skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 11:00 PM ET on Friday, May 16 until 2:00 AM ET on Saturday, May 17 due to maintenance. We apologize for the inconvenience.


Title: Robust detection of natural selection using a probabilistic model of tree imbalance
Abstract Neutrality tests such as Tajima’s D and Fay and Wu’s H are standard implements in the population genetics toolbox. One of their most common uses is to scan the genome for signals of natural selection. However, it is well understood that D and H are confounded by other evolutionary forces—in particular, population expansion—that may be unrelated to selection. Because they are not model-based, it is not clear how to deconfound these tests in a principled way. In this article, we derive new likelihood-based methods for detecting natural selection, which are robust to fluctuations in effective population size. At the core of our method is a novel probabilistic model of tree imbalance, which generalizes Kingman’s coalescent to allow certain aberrant tree topologies to arise more frequently than is expected under neutrality. We derive a frequency spectrum-based estimator that can be used in place of D, and also extend to the case where genealogies are first estimated. We benchmark our methods on real and simulated data, and provide an open source software implementation.  more » « less
Award ID(s):
2052653
PAR ID:
10363456
Author(s) / Creator(s):
; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Genetics
Volume:
220
Issue:
3
ISSN:
1943-2631
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract MotivationBranch lengths and topology of a species tree are essential in most downstream analyses, including estimation of diversification dates, characterization of selection, understanding adaptation, and comparative genomics. Modern phylogenomic analyses often use methods that account for the heterogeneity of evolutionary histories across the genome due to processes such as incomplete lineage sorting. However, these methods typically do not generate branch lengths in units that are usable by downstream applications, forcing phylogenomic analyses to resort to alternative shortcuts such as estimating branch lengths by concatenating gene alignments into a supermatrix. Yet, concatenation and other available approaches for estimating branch lengths fail to address heterogeneity across the genome. ResultsIn this article, we derive expected values of gene tree branch lengths in substitution units under an extension of the multispecies coalescent (MSC) model that allows substitutions with varying rates across the species tree. We present CASTLES, a new technique for estimating branch lengths on the species tree from estimated gene trees that uses these expected values, and our study shows that CASTLES improves on the most accurate prior methods with respect to both speed and accuracy. Availability and implementationCASTLES is available at https://github.com/ytabatabaee/CASTLES. 
    more » « less
  2. Abstract We derive precise asymptotic results that are directly usable for confidence intervals and Wald hypothesis tests for likelihood-based generalized linear mixed model analysis. The essence of our approach is to derive the exact leading term behaviour of the Fisher information matrix when both the number of groups and number of observations within each group diverge. This leads to asymptotic normality results with simple studentizable forms. Similar analyses result in tractable leading term forms for the determination of approximate locally D-optimal designs. 
    more » « less
  3. Abstract Properties of gene genealogies such as tree height (H), total branch length (L), total lengths of external (E) and internal (I) branches, mean length of basal branches (B), and the underlying coalescence times (T) can be used to study population-genetic processes and to develop statistical tests of population-genetic models. Uses of tree features in statistical tests often rely on predictions that depend on pairwise relationships among such features. For genealogies under the coalescent, we provide exact expressions for Taylor approximations to expected values and variances of ratios Xn/Yn, for all 15 pairs among the variables {Hn,Ln,En,In,Bn,Tk}, considering n leaves and 2≤k≤n. For expected values of the ratios, the approximations match closely with empirical simulation-based values. The approximations to the variances are not as accurate, but they generally match simulations in their trends as n increases. Although En has expectation 2 and Hn has expectation 2 in the limit as n→∞, the approximation to the limiting expectation for En/Hn is not 1, instead equaling π2/3−2≈1.28987. The new approximations augment fundamental results in coalescent theory on the shapes of genealogical trees. 
    more » « less
  4. Kim, Yuseob (Ed.)
    Abstract Natural selection leaves a spatial pattern along the genome, with a haplotype distribution distortion near the selected locus that fades with distance. Evaluating the spatial signal of a population-genetic summary statistic across the genome allows for patterns of natural selection to be distinguished from neutrality. Considering the genomic spatial distribution of multiple summary statistics is expected to aid in uncovering subtle signatures of selection. In recent years, numerous methods have been devised that consider genomic spatial distributions across summary statistics, utilizing both classical machine learning and deep learning architectures. However, better predictions may be attainable by improving the way in which features are extracted from these summary statistics. We apply wavelet transform, multitaper spectral analysis, and S-transform to summary statistic arrays to achieve this goal. Each analysis method converts one-dimensional summary statistic arrays to two-dimensional images of spectral analysis, allowing simultaneous temporal and spectral assessment. We feed these images into convolutional neural networks and consider combining models using ensemble stacking. Our modeling framework achieves high accuracy and power across a diverse set of evolutionary settings, including population size changes and test sets of varying sweep strength, softness, and timing. A scan of central European whole-genome sequences recapitulated well-established sweep candidates and predicted novel cancer-associated genes as sweeps with high support. Given that this modeling framework is also robust to missing genomic segments, we believe that it will represent a welcome addition to the population-genomic toolkit for learning about adaptive processes from genomic data. 
    more » « less
  5. Abstract AimLineage fusion (merging of two or more populations of a species resulting in a single panmictic group) is a special case of secondary contact. It has the potential to counteract diversification and speciation, or to facilitate it through creation of novel genotypes. Understanding the prevalence of lineage fusion in nature requires reliable detection of it, such that efficient summary statistics are needed. Here, we report on simulations that characterized the initial intensity and subsequent decay of signatures of past fusion for 17 summary statistics applicable to DNA sequence haplotype data. LocationGlobal. TaxaDiploid out‐crossing species. MethodsWe considered a range of scenarios that could reveal the impacts of different combinations of read length versus number of loci (arrangement of DNA sequence data), and whether or not pre‐fusion populations experienced bottlenecks coinciding with their divergence (historical context of fusion). Post‐fusion gene pools were sampled along 10 successive time points representing increasing lag times following merging of sister populations, and summary statistic values were recalculated at each. ResultsMany summary statistics were able to detect signatures of complete merging of populations after a sampling lag time of 1.5Negenerations, but the most informative ones included two neutrality tests and four diversity metrics, withZnS(a linkage disequilibrium‐based neutrality test) being particularly powerful. Correlation was relatively low among the two neutrality tests and two of the diversity metrics. There were clear benefits of many short (200‐bp × 200) loci over a handful of long (4‐kb × 10) loci. Also, only the latter genetic dataset type showed impacts of bottlenecks during divergence upon the number of informative summary statistics. Main conclusionsThis work contributes to identifying cases of lineage fusion, and advances phylogeography by enabling more nuanced reconstructions of how individual species, or multiple members of an ecological community, responded to past environmental change. 
    more » « less