skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Optimizing genomic sampling for demographic and epidemiological inference with Markov decision processes
Abstract Inferences from population genomic data provide valuable insights into the demographic history of a population. Likewise, in genomic epidemiology, pathogen genomic data provide key insights into epidemic dynamics and potential sources of transmission. Yet, predicting what information will be gained from genomic data about variables of interest and how different sampling strategies will impact the quality of downstream inferences remains challenging. As a result, population genomics and related fields such as phylodynamics and phylogeography largely lack theory to guide decisions on how best to sample individuals for genomic sequencing. By adopting a sequential decision making framework based on Markov decision processes, we model how sampling interacts with a population’s demographic history to shape the ancestral or genealogical relationships of sampled individuals. By probabilistically considering these ancestral relationships, we can use Markov decision processes to predict the expected value of sampling in terms of information gained about estimated variables. This in turn allows us to very efficiently explore and identify optimal sampling strategies even when the informational value of sampling depends on past or future sampling events. To illustrate our framework, we develop Markov decision processes for three common demographic and epidemiological inference problems: estimating population growth rates, minimizing the transmission distance between sampled individuals and estimating migration rates between subpopulations. In each case, the Markov decision process allows us to identify optimal sampling strategies that maximize the information gained from genomic data while minimizing the associated costs of sampling.  more » « less
Award ID(s):
2200047
PAR ID:
10651470
Author(s) / Creator(s):
; ;
Editor(s):
Brandvain, Y
Publisher / Repository:
Wiley
Date Published:
Journal Name:
GENETICS
ISSN:
1943-2631
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Understanding recent population trends is critical to quantifying species vulnerability and implementing effective management strategies. To evaluate the accuracy of genomic methods for quantifying recent declines (beginning <120 generations ago), we simulated genomic data using forward-time methods (SLiM) coupled with coalescent simulations (msprime) under a number of demographic scenarios. We evaluated both site frequency spectrum (SFS)-based methods (momi2, Stairway Plot) and methods that employ linkage disequilibrium information (NeEstimator, GONE) with a range of sampling schemes (contemporary-only samples, sampling two time points, and serial sampling) and data types (RAD-like data and whole-genome sequencing). GONE and momi2 performed best overall, with >80% power to detect severe declines with large sample sizes. Two-sample and serial sampling schemes could accurately reconstruct changes in population size, and serial sampling was particularly valuable for making accurate inference when genotyping errors or minor allele frequency cutoffs distort the SFS or under model mis-specification. However, sampling only contemporary individuals provided reliable inferences about contemporary size and size change using either site frequency or linkage-based methods, especially when large sample sizes or whole genomes from contemporary populations were available. These findings provide a guide for researchers designing genomics studies to evaluate recent demographic declines. 
    more » « less
  2. Schiffels, Stephan (Ed.)
    Movement of individuals between populations or demes is often restricted, especially between geographically isolated populations. The structured coalescent provides an elegant theoretical framework for describing how movement between populations shapes the genealogical history of sampled individuals and thereby structures genetic variation within and between populations. However, in the presence of recombination an individual may inherit different regions of their genome from different parents, resulting in a mosaic of genealogical histories across the genome, which can be represented by an Ancestral Recombination Graph (ARG). In this case, different genomic regions may have different ancestral histories and so different histories of movement between populations. Recombination therefore poses an additional challenge to phylogeographic methods that aim to reconstruct the movement of individuals from genealogies, although also a potential benefit in that different loci may contain additional information about movement. Here, we introduce the Structured Coalescent with Ancestral Recombination (SCAR) model, which builds on recent approximations to the structured coalescent by incorporating recombination into the ancestry of sampled individuals. The SCAR model allows us to infer how the migration history of sampled individuals varies across the genome from ARGs, and improves estimation of key population genetic parameters such as population sizes, recombination rates and migration rates. Using the SCAR model, we explore the potential and limitations of phylogeographic inference using full ARGs. We then apply the SCAR to lineages of the recombining fungus Aspergillus flavus sampled across the United States to explore patterns of recombination and migration across the genome. 
    more » « less
  3. Harris, Kelley (Ed.)
    Abstract As a species of considerable biomedical importance, characterizing the evolutionary genomics of the common marmoset (Callithrix jacchus) is of significance across multiple fields of research. However, at least 2 peculiarities of this species potentially preclude commonly utilized population genetic modeling and inference approaches: a high frequency of twin births and hematopoietic chimerism. We here investigate these effects within the context of demographic inference, demonstrating via simulation that neglecting these biological features results in significant mis-inference of the underlying population history. Based upon this result, we develop a novel approximate Bayesian inference approach accounting for both common twin births and chimeric sampling. In addition, we newly present population genomic data from 15 individuals sequenced to high coverage and utilize gene-level annotations to identify neutrally evolving intergenic regions appropriate for demographic inference. Applying our developed methodology, we estimate a well-fitting population history for this species, which suggests robust ancestral and current population sizes, as well as a size reduction roughly 7,000 years ago likely associated with a shift from arboreal to savanna vegetation in north-eastern Brazil during this period. 
    more » « less
  4. INTRODUCTION The Anthropocene is marked by an accelerated loss of biodiversity, widespread population declines, and a global conservation crisis. Given limited resources for conservation intervention, an approach is needed to identify threatened species from among the thousands lacking adequate information for status assessments. Such prioritization for intervention could come from genome sequence data, as genomes contain information about demography, diversity, fitness, and adaptive potential. However, the relevance of genomic data for identifying at-risk species is uncertain, in part because genetic variation may reflect past events and life histories better than contemporary conservation status. RATIONALE The Zoonomia multispecies alignment presents an opportunity to systematically compare neutral and functional genomic diversity and their relationships to contemporary extinction risk across a large sample of diverse mammalian taxa. We surveyed 240 species spanning from the “Least Concern” to “Critically Endangered” categories, as published in the International Union for Conservation of Nature’s Red List of Threatened Species. Using a single genome for each species, we estimated historical effective population sizes ( N e ) and distributions of genome-wide heterozygosity. To estimate genetic load, we identified substitutions relative to reconstructed ancestral sequences, assuming that mutations at evolutionarily conserved sites and in protein-coding sequences, especially in genes essential for viability in mice, are predominantly deleterious. We examined relationships between the conservation status of species and metrics of heterozygosity, demography, and genetic load and used these data to train and test models to distinguish threatened from nonthreatened species. RESULTS Species with smaller historical N e are more likely to be categorized as at risk of extinction, suggesting that demography, even from periods more than 10,000 years in the past, may be informative of contemporary resilience. Species with smaller historical N e also carry proportionally higher burdens of weakly and moderately deleterious alleles, consistent with theoretical expectations of the long-term accumulation and fixation of genetic load under strong genetic drift. We found weak support for a causative link between fixed drift load and extinction risk; however, other types of genetic load not captured in our data, such as rare, highly deleterious alleles, may also play a role. Although ecological (e.g., physiological, life-history, and behavioral) variables were the best predictors of extinction risk, genomic variables nonrandomly distinguished threatened from nonthreatened species in regression and machine learning models. These results suggest that information encoded within even a single genome can provide a risk assessment in the absence of adequate ecological or population census data. CONCLUSION Our analysis highlights the potential for genomic data to rapidly and inexpensively gauge extinction risk by leveraging relationships between contemporary conservation status and genetic variation shaped by the long-term demographic history of species. As more resequencing data and additional reference genomes become available, estimates of genetic load, estimates of recent demographic history, and accuracy of predictive models will improve. We therefore echo calls for including genomic information in assessments of the conservation status of species. Genomic information can help predict extinction risk in diverse mammalian species. Across 240 mammals, species with smaller historical N e had lower genetic diversity, higher genetic load, and were more likely to be threatened with extinction. Genomic data were used to train models that predict whether a species is threatened, which can be valuable for assessing extinction risk in species lacking ecological or census data. [Animal silhouettes are from PhyloPic] 
    more » « less
  5. Abstract Detecting recent demographic changes is a crucial component of species conservation and management, as many natural populations face declines due to anthropogenic habitat alteration and climate change. Genetic methods allow researchers to detect changes in effective population size (Ne) from sampling at a single timepoint. However, in species with long lifespans, there is a lag between the start of a decline in a population and the resulting decrease in genetic diversity. This lag slows the rate at which diversity is lost, and therefore makes it difficult to detect recent declines using genetic data. However, the genomes of old individuals can provide a window into the past, and can be compared to those of younger individuals, a contrast that may help reveal recent demographic declines. To test whether comparing the genomes of young and old individuals can help infer recent demographic bottlenecks, we use forward‐time, individual‐based simulations with varying mean individual lifespans and extents of generational overlap. We find that age information can be used to aid in the detection of demographic declines when the decline has been severe. When average lifespan is long, comparing young and old individuals from a single timepoint has greater power to detect a recent (within the last 50 years) bottleneck event than comparing individuals sampled at different points in time. Our results demonstrate how longevity and generational overlap can be both a hindrance and a boon to detecting recent demographic declines from population genomic data. 
    more » « less