skip to main content


Title: Variational inference using approximate likelihood under the coalescent with recombination
Coalescent methods are proven and powerful tools for population genetics, phylogenetics, epidemiology, and other fields. A promising avenue for the analysis of large genomic alignments, which are increasingly common, is coalescent hidden Markov model (coalHMM) methods, but these methods have lacked general usability and flexibility. We introduce a novel method for automatically learning a coalHMM and inferring the posterior distributions of evolutionary parameters using black-box variational inference, with the transition rates between local genealogies derived empirically by simulation. This derivation enables our method to work directly with three or four taxa and through a divide-and-conquer approach with more taxa. Using a simulated data set resembling a human–chimp–gorilla scenario, we show that our method has comparable or better accuracy to previous coalHMM methods. Both species divergence times and population sizes were accurately inferred. The method also infers local genealogies, and we report on their accuracy. Furthermore, we discuss a potential direction for scaling the method to larger data sets through a divide-and-conquer approach. This accuracy means our method is useful now, and by deriving transition rates by simulation, it is flexible enough to enable future implementations of various population models.  more » « less
Award ID(s):
1800723 2030604
NSF-PAR ID:
10330466
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Genome Research
Volume:
31
Issue:
11
ISSN:
1088-9051
Page Range / eLocation ID:
2107 to 2119
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Motivation Population admixture is an important subject in population genetics. Inferring population demographic history with admixture under the so-called admixture network model from population genetic data is an established problem in genetics. Existing admixture network inference approaches work with single genetic polymorphisms. While these methods are usually very fast, they do not fully utilize the information [e.g. linkage disequilibrium (LD)] contained in population genetic data. Results In this article, we develop a new admixture network inference method called GTmix. Different from existing methods, GTmix works with local gene genealogies that can be inferred from population haplotypes. Local gene genealogies represent the evolutionary history of sampled haplotypes and contain the LD information. GTmix performs coalescent-based maximum likelihood inference of admixture networks with inferred local genealogies based on the well-known multispecies coalescent (MSC) model. GTmix utilizes various techniques to speed up the likelihood computation on the MSC model and the optimal network search. Our simulations show that GTmix can infer more accurate admixture networks with much smaller data than existing methods, even when these existing methods are given much larger data. GTmix is reasonably efficient and can analyze population genetic datasets of current interests. Availability and implementation The program GTmix is available for download at: https://github.com/yufengwudcs/GTmix. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  2. Abstract Aim

    Quantifying abundance distributions is critical for understanding both how communities assemble, and how community structure varies through time and space, yet estimating abundances requires considerable investment in fieldwork. Community‐level population genetic data potentially offer a powerful way to indirectly infer richness, abundance and the history of accumulation of biodiversity within a community. Here we introduce a joint model linking neutral community assembly and comparative phylogeography to generate both community‐level richness, abundance and genetic variation under a neutral model, capturing both equilibrium and non‐equilibrium dynamics.

    Location

    Global.

    Methods

    Our model combines a forward‐time individual‐based community assembly process with a rescaled backward‐time neutral coalescent model of multi‐taxa population genetics. We explore general dynamics of genetic and abundance‐based summary statistics and use approximate Bayesian computation (ABC) to estimate parameters underlying the model of island community assembly. Finally, we demonstrate two applications of the model using community‐scale mtDNAsequence data and densely sampled abundances of an arachnid community on La Réunion. First, we use genetic data alone to estimate a summary of the abundance distribution, ground‐truthing this against the observed abundances. Then, we jointly use the observed genetic data and abundances to estimate the proximity of the community to equilibrium.

    Results

    Simulation experiments of ourABCprocedure demonstrate that coupling abundance with genetic data leads to improved accuracy and precision of model parameter estimates compared with using abundance‐only data. We further demonstrate reasonable precision and accuracy in estimating a metric underlying the shape of the abundance distribution, temporal progress towards local equilibrium and several key parameters of the community assembly process. For the insular arachnid assemblage, we find the joint distribution of genetic diversity and abundance approaches equilibrium expectations, and that the Shannon entropy of the observed abundances can be estimated using genetic data alone.

    Main conclusions

    The framework that we present unifies neutral community assembly and comparative phylogeography to characterize the community‐level distribution of both abundance and genetic variation through time, providing a resource that should greatly enhance understanding of both the processes structuring ecological communities and the associated aggregate demographic histories.

     
    more » « less
  3. Abstract

    A potential shortcoming of concatenation methods for species tree estimation is their failure to account for incomplete lineage sorting. Coalescent methods address this problem but make various assumptions that, if violated, can result in worse performance than concatenation. Given the challenges of analyzing DNA sequences with both concatenation and coalescent methods, retroelement insertions (RIs) have emerged as powerful phylogenomic markers for species tree estimation. Here, we show that two recently proposed quartet-based methods, SDPquartets and ASTRAL_BP, are statistically consistent estimators of the unrooted species tree topology under the coalescent when RIs follow a neutral infinite-sites model of mutation and the expected number of new RIs per generation is constant across the species tree. The accuracy of these (and other) methods for inferring species trees from RIs has yet to be assessed on simulated data sets, where the true species tree topology is known. Therefore, we evaluated eight methods given RIs simulated from four model species trees, all of which have short branches and at least three of which are in the anomaly zone. In our simulation study, ASTRAL_BP and SDPquartets always recovered the correct species tree topology when given a sufficiently large number of RIs, as predicted. A distance-based method (ASTRID_BP) and Dollo parsimony also performed well in recovering the species tree topology. In contrast, unordered, polymorphism, and Camin–Sokal parsimony (as well as an approach based on MDC) typically fail to recover the correct species tree topology in anomaly zone situations with more than four ingroup taxa. Of the methods studied, only ASTRAL_BP automatically estimates internal branch lengths (in coalescent units) and support values (i.e., local posterior probabilities). We examined the accuracy of branch length estimation, finding that estimated lengths were accurate for short branches but upwardly biased otherwise. This led us to derive the maximum likelihood (branch length) estimate for when RIs are given as input instead of binary gene trees; this corrected formula produced accurate estimates of branch lengths in our simulation study provided that a sufficiently large number of RIs were given as input. Lastly, we evaluated the impact of data quantity on species tree estimation by repeating the above experiments with input sizes varying from 100 to 100,000 parsimony-informative RIs. We found that, when given just 1000 parsimony-informative RIs as input, ASTRAL_BP successfully reconstructed major clades (i.e., clades separated by branches $>0.3$ coalescent units) with high support and identified rapid radiations (i.e., shorter connected branches), although not their precise branching order. The local posterior probability was effective for controlling false positive branches in these scenarios. [Coalescence; incomplete lineage sorting; Laurasiatheria; Palaeognathae; parsimony; polymorphism parsimony; retroelement insertions; species trees; transposon.]

     
    more » « less
  4. Abstract Motivation

    At RECOMB-CG 2018, we presented NJMerge and showed that it could be used within a divide-and-conquer framework to scale computationally intensive methods for species tree estimation to larger datasets. However, NJMerge has two significant limitations: it can fail to return a tree and, when used within the proposed divide-and-conquer framework, has O(n5) running time for datasets with n species.

    Results

    Here we present a new method called ‘TreeMerge’ that improves on NJMerge in two ways: it is guaranteed to return a tree and it has dramatically faster running time within the same divide-and-conquer framework—only O(n2) time. We use a simulation study to evaluate TreeMerge in the context of multi-locus species tree estimation with two leading methods, ASTRAL-III and RAxML. We find that the divide-and-conquer framework using TreeMerge has a minor impact on species tree accuracy, dramatically reduces running time, and enables both ASTRAL-III and RAxML to complete on datasets (that they would otherwise fail on), when given 64 GB of memory and 48 h maximum running time. Thus, TreeMerge is a step toward a larger vision of enabling researchers with limited computational resources to perform large-scale species tree estimation, which we call Phylogenomics for All.

    Availability and implementation

    TreeMerge is publicly available on Github (http://github.com/ekmolloy/treemerge).

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  5. An emerging discovery in phylogenomics is that interspecific gene flow has played a major role in the evolution of many different organ- isms. To what extent is the Tree of Life not truly a tree reflecting strict “vertical” divergence, but rather a more general graph structure known as a phylogenetic network which also captures “horizontal” gene flow? The answer to this fundamental question not only depends upon densely sam- pled and divergent genomic sequence data, but also computational meth- ods which are capable of accurately and efficiently inferring phylogenetic networks from large-scale genomic sequence datasets. Recent methodolog- ical advances have attempted to address this gap. However, in the 2016 performance study of Hejase and Liu, state-of-the-art methods fell well short of the scalability requirements of existing phylogenomic studies. The methodological gap remains: how can phylogenetic networks be accurately and efficiently inferred using genomic sequence data involv- ing many dozens or hundreds of taxa? In this study, we address this gap by proposing a new phylogenetic divide-and-conquer method which we call FastNet. We conduct a performance study involving a range of evolutionary scenarios, and we demonstrate that FastNet outperforms state-of-the-art methods in terms of computational efficiency and topo- logical accuracy. 
    more » « less