skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Recommendations for improving statistical inference in population genomics
The field of population genomics has grown rapidly in response to the recent advent of affordable, large-scale sequencing technologies. As opposed to the situation during the majority of the 20th century, in which the development of theoretical and statistical population genetic insights outpaced the generation of data to which they could be applied, genomic data are now being produced at a far greater rate than they can be meaningfully analyzed and interpreted. With this wealth of data has come a tendency to focus on fitting specific (and often rather idiosyncratic) models to data, at the expense of a careful exploration of the range of possible underlying evolutionary processes. For example, the approach of directly investigating models of adaptive evolution in each newly sequenced population or species often neglects the fact that a thorough characterization of ubiquitous nonadaptive processes is a prerequisite for accurate inference. We here describe the perils of these tendencies, present our consensus views on current best practices in population genomic data analysis, and highlight areas of statistical inference and theory that are in need of further attention. Thereby, we argue for the importance of defining a biologically relevant baseline model tuned to the details of each new analysis, of skepticism and scrutiny in interpreting model fitting results, and of carefully defining addressable hypotheses and underlying uncertainties.  more » « less
Award ID(s):
1927159 2119963
PAR ID:
10351094
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
PLOS Biology
Volume:
20
Issue:
5
ISSN:
1545-7885
Page Range / eLocation ID:
e3001669
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Demographic inference methods in population genetics typically assume that the ancestry of a sample can be modeled by the Kingman coalescent. A defining feature of this stochastic process is that it generates genealogies that are binary trees: no more than 2 ancestral lineages may coalesce at the same time. However, this assumption breaks down under several scenarios. For example, pervasive natural selection and extreme variation in offspring number can both generate genealogies with “multiple-merger” events in which more than 2 lineages coalesce instantaneously. Therefore, detecting violations of the Kingman assumptions (e.g. due to multiple mergers) is important both for understanding which forces have shaped the diversity of a population and for avoiding fitting misspecified models to data. Current methods to detect deviations from Kingman coalescence in genomic data rely primarily on the site frequency spectrum (SFS). However, the signatures of some non-Kingman processes (e.g. multiple mergers) in the SFS are also consistent with a Kingman coalescent with a time-varying population size. Here, we present a new statistical test for determining whether the Kingman coalescent with any population size history is consistent with population data. Our approach is based on information contained in the 2-site joint frequency spectrum (2-SFS) for pairs of linked sites, which has a different dependence on the topologies of genealogies than the SFS. Our statistical test is global in the sense that it can detect when the genome-wide genetic diversity is inconsistent with the Kingman model, rather than detecting outlier regions, as in selection scan methods. We validate this test using simulations and then apply it to demonstrate that genomic diversity data from Drosophila melanogaster is inconsistent with the Kingman coalescent. 
    more » « less
  2. Abstract Bayesian hierarchical models allow ecologists to account for uncertainty and make inference at multiple scales. However, hierarchical models are often computationally intensive to fit, especially with large datasets, and researchers face trade‐offs between capturing ecological complexity in statistical models and implementing these models.We present a recursive Bayesian computing (RB) method that can be used to fit Bayesian models efficiently in sequential MCMC stages to ease computation and streamline hierarchical inference. We also introduce transformation‐assisted RB (TARB) to create unsupervised MCMC algorithms and improve interpretability of parameters. We demonstrate TARB by fitting a hierarchical animal movement model to obtain inference about individual‐ and population‐level migratory characteristics.Our recursive procedure reduced computation time for fitting our hierarchical movement model by half compared to fitting the model with a single MCMC algorithm. We obtained the same inference fitting our model using TARB as we obtained fitting the model with a single algorithm.For complex ecological statistical models, like those for animal movement, multi‐species systems, or large spatial and temporal scales, the computational demands of fitting models with conventional computing techniques can limit model specification, thus hindering scientific discovery. Transformation‐assisted RB is one of the most accessible methods for reducing these limitations, enabling us to implement new statistical models and advance our understanding of complex ecological phenomena. 
    more » « less
  3. Abstract Multivariate spatially oriented data sets are prevalent in the environmental and physical sciences. Scientists seek to jointly model multiple variables, each indexed by a spatial location, to capture any underlying spatial association for each variable and associations among the different dependent variables. Multivariate latent spatial process models have proved effective in driving statistical inference and rendering better predictive inference at arbitrary locations for the spatial process. High‐dimensional multivariate spatial data, which are the theme of this article, refer to data sets where the number of spatial locations and the number of spatially dependent variables is very large. The field has witnessed substantial developments in scalable models for univariate spatial processes, but such methods for multivariate spatial processes, especially when the number of outcomes are moderately large, are limited in comparison. Here, we extend scalable modeling strategies for a single process to multivariate processes. We pursue Bayesian inference, which is attractive for full uncertainty quantification of the latent spatial process. Our approach exploits distribution theory for the matrix‐normal distribution, which we use to construct scalable versions of a hierarchical linear model of coregionalization (LMC) and spatial factor models that deliver inference over a high‐dimensional parameter space including the latent spatial process. We illustrate the computational and inferential benefits of our algorithms over competing methods using simulation studies and an analysis of a massive vegetation index data set. 
    more » « less
  4. Summary A variety of demographic statistical models exist for studying population dynamics when individuals can be tracked over time. In cases where data are missing due to imperfect detection of individuals, the associated measurement error can be accommodated under certain study designs (e.g. those that involve multiple surveys or replication). However, the interaction of the measurement error and the underlying dynamic process can complicate the implementation of statistical agent‐based models (ABMs) for population demography. In a Bayesian setting, traditional computational algorithms for fitting hierarchical demographic models can be prohibitively cumbersome to construct. Thus, we discuss a variety of approaches for fitting statistical ABMs to data and demonstrate how to use multi‐stage recursive Bayesian computing and statistical emulators to fit models in such a way that alleviates the need to have analytical knowledge of the ABM likelihood. Using two examples, a demographic model for survival and a compartment model for COVID‐19, we illustrate statistical procedures for implementing ABMs. The approaches we describe are intuitive and accessible for practitioners and can be parallelised easily for additional computational efficiency. 
    more » « less
  5. Abstract This paper describes model construction practices used by scientifically trained experts. Our work on science experts has involved analyzing data from videotaped protocols of experts thinking aloud about unfamiliar explanation problems. These studies document the value of nonformal heuristic reasoning processes such as analogies, identification of new variables, Gedanken experiments, and the construction and running of visualizable explanatory models. Although theses processes are less formal than formal deduction or induction or statistical inference procedures, the case study analyzed here shows that they can lead to real insights and conceptual change. At a larger time scale, the subject went through model evolution cycles of model generation, evaluation, and modification that utilized the heuristic reasoning processes above. In addition, the prevalence of imagistic simulation as an underlying foundation in these episodes suggests that it may be important to pay greater attention to an imagistic level of processing in the analysis of expert thinking. Larger time scale modes of model evolution and model competition were also evidenced. The analysis leads to four levels of processes or practices: IV. An overarching set of Model Construction Modes, primarily alternating between Model Evolution, in which a model is improved, and Model Competition, in which two or more models compete. III. Modeling (GEM) Cycle process of Model Generation, Evaluation, and Modification at a Macro level, as shown in Figure 10. II. Nonformal Reasoning Processes at a Micro level: e.g. analogy, running a model, identifying a new variable, and conducting a Gedanken experiment. I. Underlying Imagistic process including Imagistic Simulation that may have been occurring within all of the above processes. To our knowledge these four levels of processes have not been analyzed together in the past. They complement empirical processes of discovery, experimentation, and evaluative argumentation documented by others. Diagrams of how the above processes interact may give us some new ways to picture the roles of nonformal reasoning and imagistic processes during qualitative model construction. We call the set of processes at all four levels a 'Modeling Practices Framework'. Processes at a lower level serve as subprocesses for the level above it in this framework. Each level has multiple "things to try" to achieve tasks at the level above it. Thus the framework is an organized but flexible structure of heuristic processes. This lies between and contrasts with those who would describe theory making in science as either 'anarchistic', with no method structure, or 'algorithmic', with fairly standardized procedures. 
    more » « less