skip to main content


Title: A practical guide to understanding and validating complex models using data simulations
Biologists routinely fit novel and complex statistical models to push the limits of our understanding. Examples include, but are not limited to, flexible Bayesian approaches (e.g. BUGS, stan), frequentist and likelihood‐based approaches (e.g. packageslme4) and machine learning methods.

These software and programs afford the user greater control and flexibility in tailoring complex hierarchical models. However, this level of control and flexibility places a higher degree of responsibility on the user to evaluate the robustness of their statistical inference. To determine how often biologists are running model diagnostics on hierarchical models, we reviewed 50 recently published papers in 2021 in the journalNature Ecology & Evolution, and we found that the majority of published papers didnotreport any validation of their hierarchical models, making it difficult for the reader to assess the robustness of their inference. This lack of reporting likely stems from a lack of standardized guidance for best practices and standard methods.

Here, we provide a guide to understanding and validating complex models using data simulations. To determine how often biologists use data simulation techniques, we also reviewed 50 recently published papers in 2021 in the journalMethods Ecology & Evolution. We found that 78% of the papers that proposed a new estimation technique, package or model used simulations or generated data in some capacity (18 of 23 papers); but very few of those papers (5 of 23 papers) included either a demonstration that the code could recover realistic estimates for a dataset with known parameters or a demonstration of the statistical properties of the approach. To distil the variety of simulations techniques and their uses, we provide a taxonomy of simulation studies based on the intended inference. We also encourage authors to include a basic validation study whenever novel statistical models are used, which in general, is easy to implement.

Simulating data helps a researcher gain a deeper understanding of the models and their assumptions and establish the reliability of their estimation approaches. Wider adoption of data simulations by biologists can improve statistical inference, reliability and open science practices.

 
more » « less
Award ID(s):
2015273
NSF-PAR ID:
10473163
Author(s) / Creator(s):
; ;
Publisher / Repository:
BES Journals
Date Published:
Journal Name:
Methods in Ecology and Evolution
Volume:
14
Issue:
1
ISSN:
2041-210X
Page Range / eLocation ID:
203 to 217
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Estimating phenotypic distributions of populations and communities is central to many questions in ecology and evolution. These distributions can be characterized by their moments (mean, variance, skewness and kurtosis) or diversity metrics (e.g. functional richness). Typically, such moments and metrics are calculated using community‐weighted approaches (e.g. abundance‐weighted mean). We propose an alternative bootstrapping approach that allows flexibility in trait sampling and explicit incorporation of intraspecific variation, and show that this approach significantly improves estimation while allowing us to quantify uncertainty.

    We assess the performance of different approaches for estimating the moments of trait distributions across various sampling scenarios, taxa and datasets by comparing estimates derived from simulated samples with the true values calculated from full datasets. Simulations differ in sampling intensity (individuals per species), sampling biases (abundance, size), trait data source (local vs. global) and estimation method (two types of community‐weighting, two types of bootstrapping).

    We introduce thetraitstrapR package, which contains a modular and extensible set of bootstrapping and weighted‐averaging functions that use community composition and trait data to estimate the moments of community trait distributions with their uncertainty. Importantly, the first function in the workflow,trait_fill, allows the user to specify hierarchical structures (e.g. plot within site, experiment vs. control, species within genus) to assign trait values to each taxon in each community sample.

    Across all taxa, simulations and metrics, bootstrapping approaches were more accurate and less biased than community‐weighted approaches. With bootstrapping, a sample size of 9 or more measurements per species per trait generally included the true mean within the 95% CI. It reduced average percent errors by 26%–74% relative to community‐weighting. Random sampling across all species outperformed both size‐ and abundance‐biased sampling.

    Our results suggest randomly sampling ~9 individuals per sampling unit and species, covering all species in the community and analysing the data using nonparametric bootstrapping generally enable reliable inference on trait distributions, including the central moments, of communities. By providing better estimates of community trait distributions, bootstrapping approaches can improve our ability to link traits to both the processes that generate them and their effects on ecosystems.

     
    more » « less
  2. Abstract

    Despite the wide application of meta‐analysis in ecology, some of the traditional methods used for meta‐analysis may not perform well given the type of data characteristic of ecological meta‐analyses.

    We reviewed published meta‐analyses on the ecological impacts of global climate change, evaluating the number of replicates used in the primary studies (ni) and the number of studies or records (k) that were aggregated to calculate a mean effect size. We used the results of the review in a simulation experiment to assess the performance of conventional frequentist and Bayesian meta‐analysis methods for estimating a mean effect size and its uncertainty interval.

    Our literature review showed thatniandkwere highly variable, distributions were right‐skewed and were generally small (medianni = 5, mediank = 44). Our simulations show that the choice of method for calculating uncertainty intervals was critical for obtaining appropriate coverage (close to the nominal value of 0.95). Whenkwas low (<40), 95% coverage was achieved by a confidence interval (CI) based on thetdistribution that uses an adjusted standard error (the Hartung–Knapp–Sidik–Jonkman, HKSJ), or by a Bayesian credible interval, whereas bootstrap orzdistribution CIs had lower coverage. Despite the importance of the method to calculate the uncertainty interval, 39% of the meta‐analyses reviewed did not report the method used, and of the 61% that did, 94% used a potentially problematic method, which may be a consequence of software defaults.

    In general, for a simple random‐effects meta‐analysis, the performance of the best frequentist and Bayesian methods was similar for the same combinations of factors (kand mean replication), though the Bayesian approach had higher than nominal (>95%) coverage for the mean effect whenkwas very low (k < 15). Our literature review suggests that many meta‐analyses that usedzdistribution or bootstrapping CIs may have overestimated the statistical significance of their results when the number of studies was low; more appropriate methods need to be adopted in ecological meta‐analyses.

     
    more » « less
  3. Abstract

    Structured population models are among the most widely used tools in ecology and evolution. Integral projection models (IPMs) use continuous representations of how survival, reproduction and growth change as functions of state variables such as size, requiring fewer parameters to be estimated than projection matrix models (PPMs). Yet, almost all published IPMs make an important assumption that size‐dependent growth transitions are or can be transformed to be normally distributed. In fact, many organisms exhibit highly skewed size transitions. Small individuals can grow more than they can shrink, and large individuals may often shrink more dramatically than they can grow. Yet, the implications of such skew for inference from IPMs has not been explored, nor have general methods been developed to incorporate skewed size transitions into IPMs, or deal with other aspects of real growth rates, including bounds on possible growth or shrinkage.

    Here, we develop a flexible approach to modelling skewed growth data using a modified beta regression model. We propose that sizes first be converted to a (0,1) interval by estimating size‐dependent minimum and maximum sizes through quantile regression. Transformed data can then be modelled using beta regression with widely available statistical tools. We demonstrate the utility of this approach using demographic data for a long‐lived plant, gorgonians and an epiphytic lichen. Specifically, we compare inferences of population parameters from discrete PPMs to those from IPMs that either assume normality or incorporate skew using beta regression or, alternatively, a skewed normal model.

    The beta and skewed normal distributions accurately capture the mean, variance and skew of real growth distributions. Incorporating skewed growth into IPMs decreases population growth and estimated life span relative to IPMs that assume normally distributed growth, and more closely approximate the parameters of PPMs that do not assume a particular growth distribution. A bounded distribution, such as the beta, also avoids the eviction problem caused by predicting some growth outside the modelled size range.

    Incorporating biologically relevant skew in growth data has important consequences for inference from IPMs. The approaches we outline here are flexible and easy to implement with existing statistical tools.

     
    more » « less
  4. Abstract

    Resource selection functions (RSFs) are among the most commonly used statistical tools in both basic and applied animal ecology. They are typically parameterized using animal tracking data, and advances in animal tracking technology have led to increasing levels of autocorrelation between locations in such data sets. Because RSFs assume that data are independent and identically distributed, such autocorrelation can cause misleadingly narrow confidence intervals and biased parameter estimates.

    Data thinning, generalized estimating equations and step selection functions (SSFs) have been suggested as techniques for mitigating the statistical problems posed by autocorrelation, but these approaches have notable limitations that include statistical inefficiency, unclear or arbitrary targets for adequate levels of statistical independence, constraints in input data and (in the case of SSFs) scale‐dependent inference. To remedy these problems, we introduce a method for likelihood weighting of animal locations to mitigate the negative consequences of autocorrelation on RSFs.

    In this study, we demonstrate that this method weights each observed location in an animal's movement track according to its level of non‐independence, expanding confidence intervals and reducing bias that can arise when there are missing data in the movement track.

    Ecologists and conservation biologists can use this method to improve the quality of inferences derived from RSFs. We also provide a complete, annotated analytical workflow to help new users apply our method to their own animal tracking data using thectmm Rpackage.

     
    more » « less
  5. Our ability to project changes to the climate via anthropogenic forcing has steadily increased over the last five decades. Yet, biologists still lack accurate projections about climate change impacts. Despite recent advances, biologists still often rely on correlative approaches to make projections, ignore important mechanisms, develop models with limited coordination, and lack much of the data to inform projections and test them. In contrast, atmospheric scientists have incorporated mechanistic data, established a global network of weather stations, and apply multi‐model inference by comparing divergent model projections. I address the following questions: How have the two fields developed through time? To what degree does biological projection differ from climate projection? What is needed to make similar progress in biological projection? Although the challenges in biodiversity projections are great, I highlight how biology can make substantial progress in the coming years. Most obstacles are surmountable and relate to history, lag times, scientific culture, international organization, and finances. Just as climate change projections have improved, biological modeling can improve in accuracy by incorporating mechanistic understanding, employing multi‐model ensemble approaches, coordinating efforts worldwide, and validating projections against records from a well‐designed network of biotic stations. Now that climate scientists can make better projections of climate change, biologists need to project and prevent its impacts on biodiversity.

    This article is categorized under:

    Climate, Ecology, and Conservation > Modeling Species and Community Interactions

     
    more » « less