- NSF-PAR ID:
- Date Published:
- Journal Name:
- Genome Research
- Page Range / eLocation ID:
- 2107 to 2119
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
Abstract Motivation Population admixture is an important subject in population genetics. Inferring population demographic history with admixture under the so-called admixture network model from population genetic data is an established problem in genetics. Existing admixture network inference approaches work with single genetic polymorphisms. While these methods are usually very fast, they do not fully utilize the information [e.g. linkage disequilibrium (LD)] contained in population genetic data. Results In this article, we develop a new admixture network inference method called GTmix. Different from existing methods, GTmix works with local gene genealogies that can be inferred from population haplotypes. Local gene genealogies represent the evolutionary history of sampled haplotypes and contain the LD information. GTmix performs coalescent-based maximum likelihood inference of admixture networks with inferred local genealogies based on the well-known multispecies coalescent (MSC) model. GTmix utilizes various techniques to speed up the likelihood computation on the MSC model and the optimal network search. Our simulations show that GTmix can infer more accurate admixture networks with much smaller data than existing methods, even when these existing methods are given much larger data. GTmix is reasonably efficient and can analyze population genetic datasets of current interests. Availability and implementation The program GTmix is available for download at: https://github.com/yufengwudcs/GTmix. Supplementary information Supplementary data are available at Bioinformatics online.more » « less
Quantifying abundance distributions is critical for understanding both how communities assemble, and how community structure varies through time and space, yet estimating abundances requires considerable investment in fieldwork. Community‐level population genetic data potentially offer a powerful way to indirectly infer richness, abundance and the history of accumulation of biodiversity within a community. Here we introduce a joint model linking neutral community assembly and comparative phylogeography to generate both community‐level richness, abundance and genetic variation under a neutral model, capturing both equilibrium and non‐equilibrium dynamics.
Our model combines a forward‐time individual‐based community assembly process with a rescaled backward‐time neutral coalescent model of multi‐taxa population genetics. We explore general dynamics of genetic and abundance‐based summary statistics and use approximate Bayesian computation (
ABC) to estimate parameters underlying the model of island community assembly. Finally, we demonstrate two applications of the model using community‐scale mt DNAsequence data and densely sampled abundances of an arachnid community on La Réunion. First, we use genetic data alone to estimate a summary of the abundance distribution, ground‐truthing this against the observed abundances. Then, we jointly use the observed genetic data and abundances to estimate the proximity of the community to equilibrium. Results
Simulation experiments of our
ABCprocedure demonstrate that coupling abundance with genetic data leads to improved accuracy and precision of model parameter estimates compared with using abundance‐only data. We further demonstrate reasonable precision and accuracy in estimating a metric underlying the shape of the abundance distribution, temporal progress towards local equilibrium and several key parameters of the community assembly process. For the insular arachnid assemblage, we find the joint distribution of genetic diversity and abundance approaches equilibrium expectations, and that the Shannon entropy of the observed abundances can be estimated using genetic data alone. Main conclusions
The framework that we present unifies neutral community assembly and comparative phylogeography to characterize the community‐level distribution of both abundance and genetic variation through time, providing a resource that should greatly enhance understanding of both the processes structuring ecological communities and the associated aggregate demographic histories.
A potential shortcoming of concatenation methods for species tree estimation is their failure to account for incomplete lineage sorting. Coalescent methods address this problem but make various assumptions that, if violated, can result in worse performance than concatenation. Given the challenges of analyzing DNA sequences with both concatenation and coalescent methods, retroelement insertions (RIs) have emerged as powerful phylogenomic markers for species tree estimation. Here, we show that two recently proposed quartet-based methods, SDPquartets and ASTRAL_BP, are statistically consistent estimators of the unrooted species tree topology under the coalescent when RIs follow a neutral infinite-sites model of mutation and the expected number of new RIs per generation is constant across the species tree. The accuracy of these (and other) methods for inferring species trees from RIs has yet to be assessed on simulated data sets, where the true species tree topology is known. Therefore, we evaluated eight methods given RIs simulated from four model species trees, all of which have short branches and at least three of which are in the anomaly zone. In our simulation study, ASTRAL_BP and SDPquartets always recovered the correct species tree topology when given a sufficiently large number of RIs, as predicted. A distance-based method (ASTRID_BP) and Dollo parsimony also performed well in recovering the species tree topology. In contrast, unordered, polymorphism, and Camin–Sokal parsimony (as well as an approach based on MDC) typically fail to recover the correct species tree topology in anomaly zone situations with more than four ingroup taxa. Of the methods studied, only ASTRAL_BP automatically estimates internal branch lengths (in coalescent units) and support values (i.e., local posterior probabilities). We examined the accuracy of branch length estimation, finding that estimated lengths were accurate for short branches but upwardly biased otherwise. This led us to derive the maximum likelihood (branch length) estimate for when RIs are given as input instead of binary gene trees; this corrected formula produced accurate estimates of branch lengths in our simulation study provided that a sufficiently large number of RIs were given as input. Lastly, we evaluated the impact of data quantity on species tree estimation by repeating the above experiments with input sizes varying from 100 to 100,000 parsimony-informative RIs. We found that, when given just 1000 parsimony-informative RIs as input, ASTRAL_BP successfully reconstructed major clades (i.e., clades separated by branches $>0.3$ coalescent units) with high support and identified rapid radiations (i.e., shorter connected branches), although not their precise branching order. The local posterior probability was effective for controlling false positive branches in these scenarios. [Coalescence; incomplete lineage sorting; Laurasiatheria; Palaeognathae; parsimony; polymorphism parsimony; retroelement insertions; species trees; transposon.]
At RECOMB-CG 2018, we presented NJMerge and showed that it could be used within a divide-and-conquer framework to scale computationally intensive methods for species tree estimation to larger datasets. However, NJMerge has two significant limitations: it can fail to return a tree and, when used within the proposed divide-and-conquer framework, has O(n5) running time for datasets with n species.
Here we present a new method called ‘TreeMerge’ that improves on NJMerge in two ways: it is guaranteed to return a tree and it has dramatically faster running time within the same divide-and-conquer framework—only O(n2) time. We use a simulation study to evaluate TreeMerge in the context of multi-locus species tree estimation with two leading methods, ASTRAL-III and RAxML. We find that the divide-and-conquer framework using TreeMerge has a minor impact on species tree accuracy, dramatically reduces running time, and enables both ASTRAL-III and RAxML to complete on datasets (that they would otherwise fail on), when given 64 GB of memory and 48 h maximum running time. Thus, TreeMerge is a step toward a larger vision of enabling researchers with limited computational resources to perform large-scale species tree estimation, which we call Phylogenomics for All.
Availability and implementation
TreeMerge is publicly available on Github (http://github.com/ekmolloy/treemerge).
Supplementary data are available at Bioinformatics online.
An emerging discovery in phylogenomics is that interspecific gene flow has played a major role in the evolution of many different organ- isms. To what extent is the Tree of Life not truly a tree reflecting strict “vertical” divergence, but rather a more general graph structure known as a phylogenetic network which also captures “horizontal” gene flow? The answer to this fundamental question not only depends upon densely sam- pled and divergent genomic sequence data, but also computational meth- ods which are capable of accurately and efficiently inferring phylogenetic networks from large-scale genomic sequence datasets. Recent methodolog- ical advances have attempted to address this gap. However, in the 2016 performance study of Hejase and Liu, state-of-the-art methods fell well short of the scalability requirements of existing phylogenomic studies. The methodological gap remains: how can phylogenetic networks be accurately and efficiently inferred using genomic sequence data involv- ing many dozens or hundreds of taxa? In this study, we address this gap by proposing a new phylogenetic divide-and-conquer method which we call FastNet. We conduct a performance study involving a range of evolutionary scenarios, and we demonstrate that FastNet outperforms state-of-the-art methods in terms of computational efficiency and topo- logical accuracy.more » « less