skip to main content

Title: Improving ecological data science with workflow management software

Pressing environmental research questions demand the integration of increasingly diverse and large‐scale ecological datasets as well as complex analytical methods, which require specialized tools and resources.

Computational training for ecological and evolutionary sciences has become more abundant and accessible over the past decade, but tool development has outpaced the availability of specialized training. Most training for scripted analyses focuses on individual analysis steps in one script rather than creating a scripted pipeline, where modular functions comprise an ecosystem of interdependent steps. Although current computational training creates an excellent starting place, linear styles of scripting can risk becoming labor‐ and time‐intensive and less reproducible by often requiring manual execution. Pipelines, however, can be easily automated or tracked by software to increase efficiency and reduce potential errors. Ecology and evolution would benefit from techniques that reduce these risks by managing analytical pipelines in a modular, readily parallelizable format with clear documentation of dependencies.

Workflow management software (WMS) can aid in the reproducibility, intelligibility and computational efficiency of complex pipelines. To date, WMS adoption in ecology and evolutionary research has been slow. We discuss the benefits and challenges of implementing WMS and illustrate its use through a case study with thetargets rpackage to further highlight WMS benefits through workflow automation, dependency tracking and improved clarity for reviewers.

Although WMS requires familiarity with function‐oriented programming and careful planning for more advanced applications and pipeline sharing, investment in training will enable access to the benefits of WMS and impart transferable computing skills that can facilitate ecological and evolutionary data science at large scales.

more » « less
Author(s) / Creator(s):
 ;  ;  ;  ;  
Publisher / Repository:
Date Published:
Journal Name:
Methods in Ecology and Evolution
Medium: X Size: p. 1381-1388
["p. 1381-1388"]
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Biodiversity studies rely heavily on estimates of species' distributions often obtained through ecological niche modelling. Numerous software packages exist that allow users to model ecological niches using machine learning and statistical methods. However, no existing package with a graphical user interface allows users to perform model calibration and selection based on convex forms such as ellipsoids, which may match fundamental ecological niche shapes better, incorporating tools for exploring, modelling, and evaluating niches and distributions that are intuitive for both novice and proficient users.

    Here we describe anrpackage, NicheToolBox(ntbox), that allows users to conduct all processing steps involved in ecological niche modelling: downloading and curating occurrence data, obtaining and transforming environmental data layers, selecting environmental variables, exploring relationships between geographic and environmental spaces, calibrating and selecting ellipsoid models, evaluating models using binomial and partial ROC tests, assessing extrapolation risk, and performing geographic information system operations via a graphical user interface. A summary of the entire workflow is produced for use as a stand‐alone algorithm or as part of research reports.

    The method is explained in detail and tested via modelling the threatened feline speciesLeopardus wiedii. Georeferenced occurrence data for this species are queried to display both point occurrences and the IUCN extent of occurrence polygon (IUCN, 2007). This information is used to illustrate tools available for accessing, processing and exploring biodiversity data (e.g. number of occurrences and chronology of collecting) and transforming environmental data (e.g. a summary PCA for 19 bioclimatic layers). Visualizations of three‐dimensional ecological niches modelled as minimum volume ellipsoids are developed with ancillary statistics. This niche model is then projected to geographic space, to represent a corresponding potential suitability map.

    Usingntboxallows a fast and straightforward means by which to retrieve and manipulate occurrence and environmental data, which can then be implemented in model calibration, projection and evaluation for assessing distributions of species in geographic space and their corresponding environmental combinations.

    more » « less
  2. Abstract

    The field of ecology has undergone a molecular revolution, with researchers increasingly relying on DNA‐based methods for organism detection. Unfortunately, these techniques often require expensive equipment, dedicated laboratory spaces and specialized training in molecular and computational techniques; limitations that may exclude field researchers, underfunded programmes and citizen scientists from contributing to cutting‐edge science.

    It is for these reasons that we have designed a simplified, inexpensive method for field‐based molecular organism detection—FINDeM (Field‐deployableIsothermalNucleotide‐basedDetectionMethod). In this approach, DNA is extracted using chemical cell lysis and a cellulose filter disc, followed by two body‐heat inducible reactions—recombinase polymerase amplification and a CRISPR‐Cas12a fluorescent reporter assay—to amplify and detect target DNA, respectively.

    Here, we introduce and validate FINDeM in detectingBatrachochytrium dendrobatidis, the causative agent of amphibian chytridiomycosis, and show that this approach can identify single‐digit DNA copies from epidermal swabs in under 1 h using low‐cost supplies and field‐friendly equipment.

    This research signifies a breakthrough in ecology, as we demonstrate a field‐deployable platform that requires only basic supplies (i.e. micropipettes, plastic consumables and a UV flashlight), inexpensive reagents (~$1.29 USD/sample) and emanated body heat for highly sensitive, DNA‐based organism detection. By presenting FINDeM in an ecological system with pressing, global biodiversity implications, we aim to not only highlight how CRISPR‐based applications promise to revolutionize organism detection but also how the continued development of such techniques will allow for additional, more diversely trained researchers to answer the most pressing questions in ecology.

    more » « less
  3. Abstract

    Estimating phenotypic distributions of populations and communities is central to many questions in ecology and evolution. These distributions can be characterized by their moments (mean, variance, skewness and kurtosis) or diversity metrics (e.g. functional richness). Typically, such moments and metrics are calculated using community‐weighted approaches (e.g. abundance‐weighted mean). We propose an alternative bootstrapping approach that allows flexibility in trait sampling and explicit incorporation of intraspecific variation, and show that this approach significantly improves estimation while allowing us to quantify uncertainty.

    We assess the performance of different approaches for estimating the moments of trait distributions across various sampling scenarios, taxa and datasets by comparing estimates derived from simulated samples with the true values calculated from full datasets. Simulations differ in sampling intensity (individuals per species), sampling biases (abundance, size), trait data source (local vs. global) and estimation method (two types of community‐weighting, two types of bootstrapping).

    We introduce thetraitstrapR package, which contains a modular and extensible set of bootstrapping and weighted‐averaging functions that use community composition and trait data to estimate the moments of community trait distributions with their uncertainty. Importantly, the first function in the workflow,trait_fill, allows the user to specify hierarchical structures (e.g. plot within site, experiment vs. control, species within genus) to assign trait values to each taxon in each community sample.

    Across all taxa, simulations and metrics, bootstrapping approaches were more accurate and less biased than community‐weighted approaches. With bootstrapping, a sample size of 9 or more measurements per species per trait generally included the true mean within the 95% CI. It reduced average percent errors by 26%–74% relative to community‐weighting. Random sampling across all species outperformed both size‐ and abundance‐biased sampling.

    Our results suggest randomly sampling ~9 individuals per sampling unit and species, covering all species in the community and analysing the data using nonparametric bootstrapping generally enable reliable inference on trait distributions, including the central moments, of communities. By providing better estimates of community trait distributions, bootstrapping approaches can improve our ability to link traits to both the processes that generate them and their effects on ecosystems.

    more » « less
  4. Abstract

    Remote sensing of forested landscapes can transform the speed, scale and cost of forest research. The delineation of individual trees in remote sensing images is an essential task in forest analysis. Here we introduce a newPythonpackage, DeepForest that detects individual trees in high resolution RGB imagery using deep learning.

    While deep learning has proven highly effective in a range of computer vision tasks, it requires large amounts of training data that are typically difficult to obtain in ecological studies. DeepForest overcomes this limitation by including a model pretrained on over 30 million algorithmically generated crowns from 22 forests and fine‐tuned using 10,000 hand‐labelled crowns from six forests.

    The package supports the application of this general model to new data, fine tuning the model to new datasets with user labelled crowns, training new models and evaluating model predictions. This simplifies the process of using and retraining deep learning models for a range of forests, sensors and spatial resolutions.

    We illustrate the workflow of DeepForest using data from the National Ecological Observatory Network, a tropical forest in French Guiana, and street trees from Portland, Oregon.

    more » « less
  5. Abstract

    Anticipating and preparing for the effect of environmental changes on biodiversity requires to understand and predict both the ecological and evolutionary responses of populations. Tools and methods to efficiently integrate these complex processes are lacking.

    We present the genetically and spatially explicit individual‐based simulation softwareNemo‐agecombining ecological and evolutionary processes.Nemo‐agehas a strong emphasis on modelling complex life histories. We here provide a methodology to predict changes in species distribution for given climate projections usingNemo‐age.

    Modelling complex life histories, spatial distribution and evolutionary processes unravel possible eco‐evolutionary mechanisms that have been previously overlooked when populations endure rapid environmental changes.

    The interface ofNemo‐ageis designed to integrate species' data from different fields, from demography to genetic architecture and spatial distributions, thus representing a versatile tool to model a variety of applied and theoretical scenarios.

    more » « less