skip to main content

Title: Robustness of linear mixed‐effects models to violations of distributional assumptions

Linear mixed‐effects models are powerful tools for analysing complex datasets with repeated or clustered observations, a common data structure in ecology and evolution. Mixed‐effects models involve complex fitting procedures and make several assumptions, in particular about the distribution of residual and random effects. Violations of these assumptions are common in real datasets, yet it is not always clear how much these violations matter to accurate and unbiased estimation.

Here we address the consequences of violations in distributional assumptions and the impact of missing random effect components on model estimates. In particular, we evaluate the effects of skewed, bimodal and heteroscedastic random effect and residual variances, of missing random effect terms and of correlated fixed effect predictors. We focus on bias and prediction error on estimates of fixed and random effects.

Model estimates were usually robust to violations of assumptions, with the exception of slight upward biases in estimates of random effect variance if the generating distribution was bimodal but was modelled by Gaussian error distributions. Further, estimates for (random effect) components that violated distributional assumptions became less precise but remained unbiased. However, this particular problem did not affect other parameters of the model. The same pattern was found for strongly correlated fixed effects, which led to imprecise, but unbiased estimates, with uncertainty estimates reflecting imprecision.

Unmodelled sources of random effect variance had predictable effects on variance component estimates. The pattern is best viewed as a cascade of hierarchical grouping factors. Variances trickle down the hierarchy such that missing higher‐level random effect variances pool at lower levels and missing lower‐level and crossed random effect variances manifest as residual variance.

Overall, our results show remarkable robustness of mixed‐effects models that should allow researchers to use mixed‐effects models even if the distributional assumptions are objectively violated. However, this does not free researchers from careful evaluation of the model. Estimates that are based on data that show clear violations of key assumptions should be treated with caution because individual datasets might give highly imprecise estimates, even if they will be unbiased on average across datasets.

more » « less
Author(s) / Creator(s):
 ;  ;  ;  ;  ;  ;  ;  ;  ;  ;
Publisher / Repository:
Date Published:
Journal Name:
Methods in Ecology and Evolution
Page Range / eLocation ID:
p. 1141-1152
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Integrated population models (IPMs) have become increasingly popular for the modelling of populations, as investigators seek to combine survey and demographic data to understand processes governing population dynamics. These models are particularly useful for identifying and exploring knowledge gaps within life histories, because they allow investigators to estimate biologically meaningful parameters, such as immigration or reproduction, that were previously unidentifiable without additional data. AsIPMs have been developed relatively recently, there is much to learn about model behaviour. Behaviour of parameters, such as estimates near boundaries, and the consequences of varying degrees of dependency among datasets, has been explored. However, the reliability of parameter estimates remains underexamined, particularly when models include parameters that are not identifiable from one data source, but are indirectly identifiable from multiple datasets and a presumed model structure, such as the estimation of immigration using capture‐recapture, fecundity and count data, combined with a life‐history model.

    To examine the behaviour of model parameter estimates, we simulated stable populations closed to immigration and emigration. We simulated two scenarios that might induce error into survival estimates: marker induced bias in the capture–mark–recapture data and heterogeneity in the mortality process. We subsequently fit capture–mark–recapture, state‐space and fecundity models, as well asIPMs that estimated additional parameters.

    Simulation results suggested that when model assumptions are violated, estimation of additional, previously unidentifiable, parameters usingIPMs may be extremely sensitive to these violations of model assumption. For example, when annual marker loss was simulated, estimates of survival rates were low and estimates of immigration rate from anIPMwere high. When heterogeneity in the mortality process was induced, there were substantial relative differences between the medians of posterior distributions and truth for juvenile survival and fecundity.

    Our results have important implications for biological inference when usingIPMs, as well as future model development and implementation. Specifically, using multiple datasets to identify additional parameters resulted in the posterior distributions of additional parameters directly reflecting the effects of the violations of model assumptions in integrated modelling frameworks. We suggest that investigators interpret posterior distributions of these parameters as a combination of biological process and systematic error.

    more » « less
  2. Abstract

    Ecological analyses typically involve many interacting variables. Ecologists often specify lagged interactions in community dynamics (i.e. vector‐autoregressive models) or simultaneous interactions (e.g. structural equation models), but there is less familiarity with dynamic structural equation models (DSEM) that can include any simultaneous or lagged effect in multivariate time‐series analysis.

    We propose a novel approach to parameter estimation for DSEM, which involves constructing a Gaussian Markov random field (GMRF) representing simultaneous and lagged path coefficients, and then fitting this as a generalized linear mixed model to missing and/or non‐normal data. We provide a new R‐packagedsem, which extends the ‘arrow interface’ from path analysis to represent user‐specified lags when constructing the GMRF. We also outline how the resulting nonseparable precision matrix can generalize existing separable models, for example, for time‐series and species interactions in a vector‐autoregressive model.

    We first demonstratedsemby simulating a two‐species vector‐autoregressive model based on wolf–moose interactions on Isle Royale. We show that DSEM has improved precision when data are missing relative to a conventional dynamic linear model. We then demonstrate DSEM via two contrasting case studies. The first identifies a trophic cascade where decreased sunflower starfish has increased urchin and decreased kelp densities, while sea otters have a simultaneous positive effect on kelp in the California Current from 1999 to 2018. The second estimates how declining sea ice has decreased cold‐water habitats, driving a decreased density for fall copepod predation and inhibiting early‐life survival for Alaska pollock from 1963 to 2023.

    We conclude that DSEM can be fitted efficiently as a GLMM involving missing data, while allowing users to specify both simultaneous and lagged effects in a time‐series structural model. DSEM then allows conceptual models (developed with stakeholder input or from ecological expertise) to be fitted to incomplete time series and provides a simple interface for granular control over the number of estimated time‐series parameters. Finally, computational methods are sufficiently simple that DSEM can be embedded as component within larger (e.g. integrated population) models. We therefore recommend greater exploration and performance testing for DSEM relative to familiar time‐series forecasting methods.

    more » « less
  3. Abstract

    We propose a consistent and locally efficient method of estimating the model parameters of a logistic mixed effect model with random slopes. Our approach relaxes two typical assumptions: the random effects being normally distributed, and the covariates and random effects being independent of each other. Adhering to these assumptions is particularly difficult in health studies where, in many cases, we have limited resources to design experiments and gather data in long‐term studies, while new findings from other fields might emerge, suggesting the violation of such assumptions. So it is crucial to have an estimator that is robust to such violations; then we could make better use of current data harvested using various valuable resources. Our method generalizes the framework presented in Garcia & Ma (2016) which also deals with a logistic mixed effect model but only considers a random intercept. A simulation study reveals that our proposed estimator remains consistent even when the independence and normality assumptions are violated. This contrasts favourably with the traditional maximum likelihood estimator which is likely to be inconsistent when there is dependence between the covariates and random effects. Application of this work to a study of Huntington's disease reveals that disease diagnosis can be enhanced using assessments of cognitive performance.The Canadian Journal of Statistics47: 140–156; 2019 © 2019 Statistical Society of Canada

    more » « less
  4. Abstract

    Calibration weighting has been widely used to correct selection biases in nonprobability sampling, missing data and causal inference. The main idea is to calibrate the biased sample to the benchmark by adjusting the subject weights. However, hard calibration can produce enormous weights when an exact calibration is enforced on a large set of extraneous covariates. This article proposes a soft calibration scheme, where the outcome and the selection indicator follow mixed-effect models. The scheme imposes an exact calibration on the fixed effects and an approximate calibration on the random effects. On the one hand, our soft calibration has an intrinsic connection with best linear unbiased prediction, which results in a more efficient estimation compared to hard calibration. On the other hand, soft calibration weighting estimation can be envisioned as penalized propensity score weight estimation, with the penalty term motivated by the mixed-effect structure. The asymptotic distribution and a valid variance estimator are derived for soft calibration. We demonstrate the superiority of the proposed estimator over other competitors in simulation studies and using a real-world data application on the effect of BMI screening on childhood obesity.

    more » « less
  5. Summary

    We introduce a flexible marginal modelling approach for statistical inference for clustered and longitudinal data under minimal assumptions. This estimated estimating equations approach is semiparametric and the proposed models are fitted by quasi-likelihood regression, where the unknown marginal means are a function of the fixed effects linear predictor with unknown smooth link, and variance–covariance is an unknown smooth function of the marginal means. We propose to estimate the nonparametric link and variance–covariance functions via smoothing methods, whereas the regression parameters are obtained via the estimated estimating equations. These are score equations that contain nonparametric function estimates. The proposed estimated estimating equations approach is motivated by its flexibility and easy implementation. Moreover, if data follow a generalized linear mixed model, with either a specified or an unspecified distribution of random effects and link function, the model proposed emerges as the corresponding marginal (population-average) version and can be used to obtain inference for the fixed effects in the underlying generalized linear mixed model, without the need to specify any other components of this generalized linear mixed model. Among marginal models, the estimated estimating equations approach provides a flexible alternative to modelling with generalized estimating equations. Applications of estimated estimating equations include diagnostics and link selection. The asymptotic distribution of the proposed estimators for the model parameters is derived, enabling statistical inference. Practical illustrations include Poisson modelling of repeated epileptic seizure counts and simulations for clustered binomial responses.

    more » « less