skip to main content

Title: Random errors are neither: On the interpretation of correlated data

Many statistical models currently used in ecology and evolution account for covariances among random errors. Here, I address five points: (i) correlated random errors unite many types of statistical models, including spatial, phylogenetic and time‐series models; (ii) random errors are neither unpredictable nor mistakes; (iii) diagnostics for correlated random errors are not useful, but simulations are; (iv) model predictions can be made with random errors; and (v) can random errors be causal?

These five points are illustrated by applying statistical models to analyse simulated spatial, phylogenetic and time‐series data. These three simulation studies are paired with three types of predictions that can be made using information from covariances among random errors: predictions for goodness‐of‐fit, interpolation, and forecasting.

In the simulation studies, models incorporating covariances among random errors improve inference about the relationship between dependent and independent variables. They also imply the existence of unmeasured variables that generate the covariances among random errors. Understanding the covariances among random errors gives information about possible processes underlying the data.

Random errors are caused by something. Therefore, to extract full information from data, covariances among random errors should not just be included in statistical models; they should also be studied in their own right. Data are hard won, and appropriate statistical analyses can make the most of them.

more » « less
Award ID(s):
Author(s) / Creator(s):
Publisher / Repository:
John Wiley and Sons
Date Published:
Journal Name:
Methods in Ecology and Evolution
Page Range / eLocation ID:
2092 to 2105
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Matrix population models are frequently built and used by ecologists to analyse demography and elucidate the processes driving population growth or decline. Life Table Response Experiments (LTREs) are comparative analyses that decompose the realized difference or variance in population growth rate () into contributions from the differences or variances in the vital rates (i.e. the matrix elements). Since their introduction, LTREs have been based on approximations and have not included biologically relevant interaction terms.

    We used the functional analysis of variance framework to derive an exact LTRE method, which calculates the exact response of to the difference or variance in a given vital rate, for all interactions among vital rates—including higher‐order interactions neglected by the classical methods. We used the publicly available COMADRE and COMPADRE databases to perform a meta‐analysis comparing the results of exact and classical LTRE methods. We analysed 186 and 1487 LTREs for animal and plant matrix population models, respectively.

    We found that the classical methods often had small errors, but that very high errors were possible. Overall error was related to the difference or variance in the matrices being analysed, consistent with the Taylor series basis of the classical method. Neglected interaction terms accounted for most of the errors in fixed design LTRE, highlighting the importance of two‐way interaction terms. For random design LTRE, errors in the contribution terms present in both classical and exact methods were comparable to errors due to neglected interaction terms. In most examples we analysed, evaluating exact contributions up to three‐way interaction terms was sufficient for interpreting 90% or more of the difference or variance in .

    Relative error, previously used to evaluate the accuracy of classical LTREs, is not a reliable metric of how closely the classical and exact methods agree. Error compensation between estimated contribution terms and neglected contribution terms can lead to low relative error despite faulty biological interpretation. Trade‐offs or negative covariances among matrix elements can lead to high relative error despite accurate biological interpretation. Exact LTRE provides reliable and accurate biological interpretation, and the R packageexactLTREmakes the exact method accessible to ecologists.

    more » « less
  2. Abstract

    Estimation of an unstructured covariance matrix is difficult because of the challenges posed by parameter space dimensionality and the positive‐definiteness constraint that estimates should satisfy. We consider a general nonparametric covariance estimation framework for longitudinal data using the Cholesky decomposition of a positive‐definite matrix. The covariance matrix of time‐ordered measurements is diagonalized by a lower triangular matrix with unconstrained entries that are statistically interpretable as parameters for a varying coefficient autoregressive model. Using this dual interpretation of the Cholesky decomposition and allowing for irregular sampling time points, we treat covariance estimation as bivariate smoothing and cast it in a regularization framework for desired forms of simplicity in covariance models. Viewing stationarity as a form of simplicity or parsimony in covariance, we model the varying coefficient function with components depending on time lag and its orthogonal direction separately and penalize the components that capture the nonstationarity in the fitted function. We demonstrate construction of a covariance estimator using the smoothing spline framework. Simulation studies establish the advantage of our approach over alternative estimators proposed in the longitudinal data setting. We analyze a longitudinal dataset to illustrate application of the methodology and compare our estimates to those resulting from alternative models.

    This article is categorized under:

    Data: Types and Structure > Time Series, Stochastic Processes, and Functional Data

    Statistical and Graphical Methods of Data Analysis > Nonparametric Methods

    Algorithms and Computational Methods > Maximum Likelihood Methods

    more » « less
  3. Abstract

    Spatial biases are a common feature of presence–absence data from citizen scientists. Spatial thinning can mitigate errors in species distribution models (SDMs) that use these data. When detections or non‐detections are rare, however, SDMs may suffer from class imbalance or low sample size of the minority (i.e. rarer) class. Poor predictions can result, the severity of which may vary by modelling technique.

    To explore the consequences of spatial bias and class imbalance in presence–absence data, we used eBird citizen science data for 102 bird species from the northeastern USA to compare spatial thinning, class balancing and majority‐only thinning (i.e. retaining all samples of the minority class). We created SDMs using two parametric or semi‐parametric techniques (generalized linear models and generalized additive models) and two machine learning techniques (random forest and boosted regression trees). We tested the predictive abilities of these SDMs using an independent and systematically collected reference dataset with a combination of discrimination (area under the receiver operator characteristic curve; true skill statistic; area under the precision‐recall curve) and calibration (Brier score; Cohen's kappa) metrics.

    We found large variation in SDM performance depending on thinning and balancing decisions. Across all species, there was no single best approach, with the optimal choice of thinning and/or balancing depending on modelling technique, performance metric and the baseline sample prevalence of species in the data. Spatially thinning all the data was often a poor approach, especially for species with baseline sample prevalence <0.1. For most of these rare species, balancing classes improved model discrimination between presence and absence classes using machine learning techniques, but typically hindered model calibration.

    Baseline sample prevalence, sample size, modelling approach and the intended application of SDM output—whether discrimination or calibration—should guide decisions about how to thin or balance data, given the considerable influence of these methodological choices on SDM performance. For prognostic applications requiring good model calibration (vis‐à‐vis discrimination), the match between sample prevalence and true species prevalence may be the overriding feature and warrants further investigation.

    more » « less
  4. Abstract

    Seed bank, seed dispersal and historical disturbance are critical factors affecting plant population persistence. However, because of difficulties collecting data on these factors they are often ignored.

    We evaluated the roles of seed bank, seed dispersal and historical disturbance on metapopulation persistence ofHypericum cumulicola, a Florida endemic. We took advantage of long‐term demographic data of multiple populations (22 years; ~11 K individuals; 15 populations) and a wealth of information on burn history (1962–present), and habitat attributes (patch specific location, elevation, area and aggregation) of a system of 92 patches of Florida rosemary scrub. We used previously developed integral projection models to assess the relative ability of simulations with different levels of seed dormancy for recently produced and older seeds and different dispersal kernels (including no dispersal) to predict regional observed occupancy and plant abundance in patches in 2016–2018. We compared a simulation with this model using historical burn history to 500 model simulations with the same average fire regime (using a Weibull distribution to determine the probability of ignition) but with random ignition years.

    The most likely model had limited dispersal (mean = 0.5 m) and the highest dormancy (field estimates × 1.2 %) and its predictions were associated with observed occurrences (67% correct) and densities (20% of variance explained). Historical burn synchrony among neighbouring patches (skewness in the number of patches burned by year = 1.79) probably explains the higher densities predicted by the simulation with the historical fire regime compared with predicted abundances after simulations using random ignition years (skewness = 0.20 +SE= 0.01).

    Synthesis.Our findings demonstrate the pivotal role of seed dormancy, dispersal and fire history on population dynamics, distribution and abundance. Because of the prevalence of metapopulation dynamics, we should be aware of the significance of changes in the availability and configuration of suitable habitat associated with human or non‐human landscape changes. Decisions on prescribed fires (or other disturbances) will benefit from our knowledge of consequences of fire frequency, but also of location of ignition and the probability of fire spread.

    more » « less
  5. Abstract

    Quantification of phenological patterns (e.g. migration, hibernation or reproduction) should involve statistical assessments of non‐uniform temporal patterns. Circular statistics (e.g. Rayleigh test or Hermans‐Rasson test) provide useful approaches for doing so based on the number of individuals that exhibit particular activities during a number of time intervals.

    This study used monthly reproductive activity as an example to illustrate problems in applying circular statistics to data when marginal totals characterize experimental designs (e.g. the number of reproductively active individuals per time interval depends on sampling effort or sampling success). We illustrate the nature of this problem by crafting four exemplar data sets and developing a bootstrapping simulation procedure to overcome complications that arise from the existence of marginal totals. In addition, we apply circular statistics and our bootstrapping simulation to empirical data on the reproductive phenology of six species of Neotropical bats from the Amazon.

    Because sampling effort or success can differ among time intervals, circular statistics can produce misleading results of two types: those suggesting uniform phenologies when empirical patterns are markedly modal, and those suggesting non‐uniform phenologies when empirical patterns are uniform. The bootstrapping simulation overcomes these limitations: the exemplar phenology in which the percentage of reproductively active individuals is modal is appropriately identified as non‐uniform based on the bootstrapping approach, and the exemplar phenology in which the percentage of reproductively active individuals is invariant is appropriately identified as uniform based on the bootstrapping approach. The reproductive phenology of each of the six empirical examples is non‐uniform based on the bootstrapping approach, and this is true for bats species with unimodal peaks or bimodal peaks.

    In addition to problems with marginal totals, a review of analyses of phenological patterns in ecology identified two other frequent issues in the application of circular statistics: sampling bias and pseudoreplication. Each of these issues and potential solutions are also discussed. By providing source code for the execution of the Rayleigh test and Hermans‐Rasson test, along with the code for the bootstrapping simulation, we offer a useful tool for assessing non‐random phenologies when marginal totals characterize experimental designs.

    more » « less