Abstract Neutrality tests such as Tajima’s D and Fay and Wu’s H are standard implements in the population genetics toolbox. One of their most common uses is to scan the genome for signals of natural selection. However, it is well understood that D and H are confounded by other evolutionary forces—in particular, population expansion—that may be unrelated to selection. Because they are not model-based, it is not clear how to deconfound these tests in a principled way. In this article, we derive new likelihood-based methods for detecting natural selection, which are robust to fluctuations in effective population size. At the core of our method is a novel probabilistic model of tree imbalance, which generalizes Kingman’s coalescent to allow certain aberrant tree topologies to arise more frequently than is expected under neutrality. We derive a frequency spectrum-based estimator that can be used in place of D, and also extend to the case where genealogies are first estimated. We benchmark our methods on real and simulated data, and provide an open source software implementation.
more »
« less
Approximations to the expectations and variances of ratios of tree properties under the coalescent
Abstract Properties of gene genealogies such as tree height (H), total branch length (L), total lengths of external (E) and internal (I) branches, mean length of basal branches (B), and the underlying coalescence times (T) can be used to study population-genetic processes and to develop statistical tests of population-genetic models. Uses of tree features in statistical tests often rely on predictions that depend on pairwise relationships among such features. For genealogies under the coalescent, we provide exact expressions for Taylor approximations to expected values and variances of ratios Xn/Yn, for all 15 pairs among the variables {Hn,Ln,En,In,Bn,Tk}, considering n leaves and 2≤k≤n. For expected values of the ratios, the approximations match closely with empirical simulation-based values. The approximations to the variances are not as accurate, but they generally match simulations in their trends as n increases. Although En has expectation 2 and Hn has expectation 2 in the limit as n→∞, the approximation to the limiting expectation for En/Hn is not 1, instead equaling π2/3−2≈1.28987. The new approximations augment fundamental results in coalescent theory on the shapes of genealogical trees.
more »
« less
- Award ID(s):
- 2116322
- PAR ID:
- 10372385
- Publisher / Repository:
- Oxford University Press
- Date Published:
- Journal Name:
- G3 Genes|Genomes|Genetics
- Volume:
- 12
- Issue:
- 10
- ISSN:
- 2160-1836
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract Motivation Population admixture is an important subject in population genetics. Inferring population demographic history with admixture under the so-called admixture network model from population genetic data is an established problem in genetics. Existing admixture network inference approaches work with single genetic polymorphisms. While these methods are usually very fast, they do not fully utilize the information [e.g. linkage disequilibrium (LD)] contained in population genetic data. Results In this article, we develop a new admixture network inference method called GTmix. Different from existing methods, GTmix works with local gene genealogies that can be inferred from population haplotypes. Local gene genealogies represent the evolutionary history of sampled haplotypes and contain the LD information. GTmix performs coalescent-based maximum likelihood inference of admixture networks with inferred local genealogies based on the well-known multispecies coalescent (MSC) model. GTmix utilizes various techniques to speed up the likelihood computation on the MSC model and the optimal network search. Our simulations show that GTmix can infer more accurate admixture networks with much smaller data than existing methods, even when these existing methods are given much larger data. GTmix is reasonably efficient and can analyze population genetic datasets of current interests. Availability and implementation The program GTmix is available for download at: https://github.com/yufengwudcs/GTmix. Supplementary information Supplementary data are available at Bioinformatics online.more » « less
-
An approach to the coalescent, the fractional coalescent (f-coalescent), is introduced. The derivation is based on the discrete-time Cannings population model in which the variance of the number of offspring depends on the parameter α. This additional parameter α affects the variability of the patterns of the waiting times; values of lead to an increase of short time intervals, but occasionally allow for very long time intervals. When , the f-coalescent and the Kingman’s n-coalescent are equivalent. The distribution of the time to the most recent common ancestor and the probability that n genes descend from m ancestral genes in a time interval of length T for the f-coalescent are derived. The f-coalescent has been implemented in the population genetic model inference software Migrate. Simulation studies suggest that it is possible to accurately estimate α values from data that were generated with known α values and that the f-coalescent can detect potential environmental heterogeneity within a population. Bayes factor comparisons of simulated data with and real data (H1N1 influenza and malaria parasites) showed an improved model fit of the f-coalescent over the n-coalescent. The development of the f-coalescent and its inclusion into the inference program Migratefacilitates testing for deviations from the n-coalescent.more » « less
-
Schiffels, Stephan (Ed.)Movement of individuals between populations or demes is often restricted, especially between geographically isolated populations. The structured coalescent provides an elegant theoretical framework for describing how movement between populations shapes the genealogical history of sampled individuals and thereby structures genetic variation within and between populations. However, in the presence of recombination an individual may inherit different regions of their genome from different parents, resulting in a mosaic of genealogical histories across the genome, which can be represented by an Ancestral Recombination Graph (ARG). In this case, different genomic regions may have different ancestral histories and so different histories of movement between populations. Recombination therefore poses an additional challenge to phylogeographic methods that aim to reconstruct the movement of individuals from genealogies, although also a potential benefit in that different loci may contain additional information about movement. Here, we introduce the Structured Coalescent with Ancestral Recombination (SCAR) model, which builds on recent approximations to the structured coalescent by incorporating recombination into the ancestry of sampled individuals. The SCAR model allows us to infer how the migration history of sampled individuals varies across the genome from ARGs, and improves estimation of key population genetic parameters such as population sizes, recombination rates and migration rates. Using the SCAR model, we explore the potential and limitations of phylogeographic inference using full ARGs. We then apply the SCAR to lineages of the recombining fungus Aspergillus flavus sampled across the United States to explore patterns of recombination and migration across the genome.more » « less
-
Abstract Assessing effects of gene tree error in coalescent analyses have widely ignored coalescent branch lengths (CBLs) despite their potential utility in estimating ancestral population demographics and detecting species tree anomaly zones. However, the ability of coalescent methods to obtain accurate estimates remains largely unexplored. Errors in gene trees should lead to underestimates of the true CBL, and for a given set of comparisons, longer CBLs should be more accurate. Here, we furthered our empirical understanding of how error in gene tree quality (i.e., locus informativeness and gene tree resolution) affect CBLs using four datasets comprised of ultraconserved elements (UCE) or exons for clades that exhibit wide ranges of branch lengths. For each dataset, we compared the impact of locus informativeness (assessed using number of parsimony‐informative sites) and gene tree resolution on CBL estimates. Our results, in general, showed that CBLs were drastically shorter when estimates included low informative loci. Gene tree resolution also had an impact on UCE datasets, with polytomous gene trees producing longer branches than randomly resolved gene trees. However, resolution did not appear to affect CBL estimates from the more informative exon datasets. Thus, as expected, gene tree quality affects CBL estimates, though this can generally be minimized by using moderate filtering to select more informative loci and/or by allowing polytomies in gene trees. These approaches, as well as additional contributions to improve CBL estimation, should lead to CBLs that are useful for addressing evolutionary and biological questions.more » « less
An official website of the United States government
