skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Optimizing genomic sampling for demographic and epidemiological inference with Markov decision processes
Abstract Inferences from population genomic data provide valuable insights into the demographic history of a population. Likewise, in genomic epidemiology, pathogen genomic data provide key insights into epidemic dynamics and potential sources of transmission. Yet, predicting what information will be gained from genomic data about variables of interest and how different sampling strategies will impact the quality of downstream inferences remains challenging. As a result, population genomics and related fields such as phylodynamics and phylogeography largely lack theory to guide decisions on how best to sample individuals for genomic sequencing. By adopting a sequential decision making framework based on Markov decision processes, we model how sampling interacts with a population’s demographic history to shape the ancestral or genealogical relationships of sampled individuals. By probabilistically considering these ancestral relationships, we can use Markov decision processes to predict the expected value of sampling in terms of information gained about estimated variables. This in turn allows us to very efficiently explore and identify optimal sampling strategies even when the informational value of sampling depends on past or future sampling events. To illustrate our framework, we develop Markov decision processes for three common demographic and epidemiological inference problems: estimating population growth rates, minimizing the transmission distance between sampled individuals and estimating migration rates between subpopulations. In each case, the Markov decision process allows us to identify optimal sampling strategies that maximize the information gained from genomic data while minimizing the associated costs of sampling.  more » « less
Award ID(s):
2200047
PAR ID:
10657927
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
GENETICS
Volume:
232
Issue:
1
ISSN:
1943-2631
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Synopsis Understanding recent population trends is critical to quantifying species vulnerability and implementing effective management strategies. To evaluate the accuracy of genomic methods for quantifying recent declines (beginning <120 generations ago), we simulated genomic data using forward-time methods (SLiM) coupled with coalescent simulations (msprime) under a number of demographic scenarios. We evaluated both site frequency spectrum (SFS)-based methods (momi2, Stairway Plot) and methods that employ linkage disequilibrium information (NeEstimator, GONE) with a range of sampling schemes (contemporary-only samples, sampling two time points, and serial sampling) and data types (RAD-like data and whole-genome sequencing). GONE and momi2 performed best overall, with >80% power to detect severe declines with large sample sizes. Two-sample and serial sampling schemes could accurately reconstruct changes in population size, and serial sampling was particularly valuable for making accurate inferences when genotyping errors or minor allele frequency cutoffs distort the SFS or under model mis-specification. However, sampling only contemporary individuals provided reliable inferences about contemporary size and size change using either site frequency or linkage-based methods, especially when large sample sizes or whole genomes from contemporary populations were available. These findings provide a guide for researchers designing genomics studies to evaluate recent demographic declines. 
    more » « less
  2. Schiffels, Stephan (Ed.)
    Movement of individuals between populations or demes is often restricted, especially between geographically isolated populations. The structured coalescent provides an elegant theoretical framework for describing how movement between populations shapes the genealogical history of sampled individuals and thereby structures genetic variation within and between populations. However, in the presence of recombination an individual may inherit different regions of their genome from different parents, resulting in a mosaic of genealogical histories across the genome, which can be represented by an Ancestral Recombination Graph (ARG). In this case, different genomic regions may have different ancestral histories and so different histories of movement between populations. Recombination therefore poses an additional challenge to phylogeographic methods that aim to reconstruct the movement of individuals from genealogies, although also a potential benefit in that different loci may contain additional information about movement. Here, we introduce the Structured Coalescent with Ancestral Recombination (SCAR) model, which builds on recent approximations to the structured coalescent by incorporating recombination into the ancestry of sampled individuals. The SCAR model allows us to infer how the migration history of sampled individuals varies across the genome from ARGs, and improves estimation of key population genetic parameters such as population sizes, recombination rates and migration rates. Using the SCAR model, we explore the potential and limitations of phylogeographic inference using full ARGs. We then apply the SCAR to lineages of the recombining fungus Aspergillus flavus sampled across the United States to explore patterns of recombination and migration across the genome. 
    more » « less
  3. INTRODUCTION The Anthropocene is marked by an accelerated loss of biodiversity, widespread population declines, and a global conservation crisis. Given limited resources for conservation intervention, an approach is needed to identify threatened species from among the thousands lacking adequate information for status assessments. Such prioritization for intervention could come from genome sequence data, as genomes contain information about demography, diversity, fitness, and adaptive potential. However, the relevance of genomic data for identifying at-risk species is uncertain, in part because genetic variation may reflect past events and life histories better than contemporary conservation status. RATIONALE The Zoonomia multispecies alignment presents an opportunity to systematically compare neutral and functional genomic diversity and their relationships to contemporary extinction risk across a large sample of diverse mammalian taxa. We surveyed 240 species spanning from the “Least Concern” to “Critically Endangered” categories, as published in the International Union for Conservation of Nature’s Red List of Threatened Species. Using a single genome for each species, we estimated historical effective population sizes ( N e ) and distributions of genome-wide heterozygosity. To estimate genetic load, we identified substitutions relative to reconstructed ancestral sequences, assuming that mutations at evolutionarily conserved sites and in protein-coding sequences, especially in genes essential for viability in mice, are predominantly deleterious. We examined relationships between the conservation status of species and metrics of heterozygosity, demography, and genetic load and used these data to train and test models to distinguish threatened from nonthreatened species. RESULTS Species with smaller historical N e are more likely to be categorized as at risk of extinction, suggesting that demography, even from periods more than 10,000 years in the past, may be informative of contemporary resilience. Species with smaller historical N e also carry proportionally higher burdens of weakly and moderately deleterious alleles, consistent with theoretical expectations of the long-term accumulation and fixation of genetic load under strong genetic drift. We found weak support for a causative link between fixed drift load and extinction risk; however, other types of genetic load not captured in our data, such as rare, highly deleterious alleles, may also play a role. Although ecological (e.g., physiological, life-history, and behavioral) variables were the best predictors of extinction risk, genomic variables nonrandomly distinguished threatened from nonthreatened species in regression and machine learning models. These results suggest that information encoded within even a single genome can provide a risk assessment in the absence of adequate ecological or population census data. CONCLUSION Our analysis highlights the potential for genomic data to rapidly and inexpensively gauge extinction risk by leveraging relationships between contemporary conservation status and genetic variation shaped by the long-term demographic history of species. As more resequencing data and additional reference genomes become available, estimates of genetic load, estimates of recent demographic history, and accuracy of predictive models will improve. We therefore echo calls for including genomic information in assessments of the conservation status of species. Genomic information can help predict extinction risk in diverse mammalian species. Across 240 mammals, species with smaller historical N e had lower genetic diversity, higher genetic load, and were more likely to be threatened with extinction. Genomic data were used to train models that predict whether a species is threatened, which can be valuable for assessing extinction risk in species lacking ecological or census data. [Animal silhouettes are from PhyloPic] 
    more » « less
  4. Abstract Detecting recent demographic changes is a crucial component of species conservation and management, as many natural populations face declines due to anthropogenic habitat alteration and climate change. Genetic methods allow researchers to detect changes in effective population size (Ne) from sampling at a single timepoint. However, in species with long lifespans, there is a lag between the start of a decline in a population and the resulting decrease in genetic diversity. This lag slows the rate at which diversity is lost, and therefore makes it difficult to detect recent declines using genetic data. However, the genomes of old individuals can provide a window into the past, and can be compared to those of younger individuals, a contrast that may help reveal recent demographic declines. To test whether comparing the genomes of young and old individuals can help infer recent demographic bottlenecks, we use forward‐time, individual‐based simulations with varying mean individual lifespans and extents of generational overlap. We find that age information can be used to aid in the detection of demographic declines when the decline has been severe. When average lifespan is long, comparing young and old individuals from a single timepoint has greater power to detect a recent (within the last 50 years) bottleneck event than comparing individuals sampled at different points in time. Our results demonstrate how longevity and generational overlap can be both a hindrance and a boon to detecting recent demographic declines from population genomic data. 
    more » « less
  5. Abstract Genomic data and machine learning approaches have gained interest due to their potential to identify adaptive genetic variation across populations and to assess species vulnerability to climate change. By identifying gene–environment associations for putatively adaptive loci, these approaches project changes to adaptive genetic composition as a function of future climate change (genetic offsets), which are interpreted as measuring the future maladaptation of populations due to climate change. In principle, higher genetic offsets relate to increased population vulnerability and therefore can be used to set priorities for conservation and management. However, it is not clear how sensitive these metrics are to the intensity of population and individual sampling. Here, we use five genomic datasets with varying numbers of SNPs (NSNPs = 7006–1,398,773), sampled populations (Npop = 23–47) and individuals (Nind = 185–595) to evaluate the estimation sensitivity of genetic offsets to varying degrees of sampling intensity. We found that genetic offsets are sensitive to the number of populations being sampled, especially with less than 10 populations and when genetic structure is high. We also found that the number of individuals sampled per population had small effects on the estimation of genetic offsets, with more robust results when five or more individuals are sampled. Finally, uncertainty associated with the use of different future climate scenarios slightly increased estimation uncertainty in the genetic offsets. Our results suggest that sampling efforts should focus on increasing the number of populations, rather than the number of individuals per populations, and that multiple future climate scenarios should be evaluated to ascertain estimation sensitivity. 
    more » « less