- PAR ID:
- 10183089
- Date Published:
- Journal Name:
- Bioinformatics
- Volume:
- 36
- Issue:
- Supplement_1
- ISSN:
- 1367-4803
- Page Range / eLocation ID:
- i326 to i334
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Coalescent methods are proven and powerful tools for population genetics, phylogenetics, epidemiology, and other fields. A promising avenue for the analysis of large genomic alignments, which are increasingly common, is coalescent hidden Markov model (coalHMM) methods, but these methods have lacked general usability and flexibility. We introduce a novel method for automatically learning a coalHMM and inferring the posterior distributions of evolutionary parameters using black-box variational inference, with the transition rates between local genealogies derived empirically by simulation. This derivation enables our method to work directly with three or four taxa and through a divide-and-conquer approach with more taxa. Using a simulated data set resembling a human–chimp–gorilla scenario, we show that our method has comparable or better accuracy to previous coalHMM methods. Both species divergence times and population sizes were accurately inferred. The method also infers local genealogies, and we report on their accuracy. Furthermore, we discuss a potential direction for scaling the method to larger data sets through a divide-and-conquer approach. This accuracy means our method is useful now, and by deriving transition rates by simulation, it is flexible enough to enable future implementations of various population models.more » « less
-
Russell, Schwartz (Ed.)Abstract Motivation With growing genome-wide molecular datasets from next-generation sequencing, phylogenetic networks can be estimated using a variety of approaches. These phylogenetic networks include events like hybridization, gene flow or horizontal gene transfer explicitly. However, the most accurate network inference methods are computationally heavy. Methods that scale to larger datasets do not calculate a full likelihood, such that traditional likelihood-based tools for model selection are not applicable to decide how many past hybridization events best fit the data. We propose here a goodness-of-fit test to quantify the fit between data observed from genome-wide multi-locus data, and patterns expected under the multi-species coalescent model on a candidate phylogenetic network. Results We identified weaknesses in the previously proposed TICR test, and proposed corrections. The performance of our new test was validated by simulations on real-world phylogenetic networks. Our test provides one of the first rigorous tools for model selection, to select the adequate network complexity for the data at hand. The test can also work for identifying poorly inferred areas on a network. Availability and implementation Software for the goodness-of-fit test is available as a Julia package at https://github.com/cecileane/QuartetNetworkGoodnessFit.jl. Supplementary information Supplementary data are available at Bioinformatics online.more » « less
-
Abstract Motivation Cells in an organism share a common evolutionary history, called cell lineage tree. Cell lineage tree can be inferred from single cell genotypes at genomic variation sites. Cell lineage tree inference from noisy single cell data is a challenging computational problem. Most existing methods for cell lineage tree inference assume uniform uncertainty in genotypes. A key missing aspect is that real single cell data usually has non-uniform uncertainty in individual genotypes. Moreover, existing methods are often sampling based and can be very slow for large data. Results In this article, we propose a new method called ScisTree, which infers cell lineage tree and calls genotypes from noisy single cell genotype data. Different from most existing approaches, ScisTree works with genotype probabilities of individual genotypes (which can be computed by existing single cell genotype callers). ScisTree assumes the infinite sites model. Given uncertain genotypes with individualized probabilities, ScisTree implements a fast heuristic for inferring cell lineage tree and calling the genotypes that allow the so-called perfect phylogeny and maximize the likelihood of the genotypes. Through simulation, we show that ScisTree performs well on the accuracy of inferred trees, and is much more efficient than existing methods. The efficiency of ScisTree enables new applications including imputation of the so-called doublets. Availability and implementation The program ScisTree is available for download at: https://github.com/yufengwudcs/ScisTree. Supplementary information Supplementary data are available at Bioinformatics online.more » « less
-
Summary Motivated by the statistical inference problem in population genetics, we present a new sequential importance sampling with resampling strategy. The idea of resampling is key to the recent surge of popularity of sequential Monte Carlo methods in the statistics and engin-eering communities, but existing resampling techniques do not work well for coalescent-based inference problems in population genetics. We develop a new method called ‘stopping-time resampling’, which allows us to compare partially simulated samples at different stages to terminate unpromising partial samples and to multiply promising samples early on. To illustrate the idea, we first apply the new method to approximate the solution of a Dirichlet problem and the likelihood function of a non-Markovian process. Then we focus on its application in population genetics. All our examples show that the new resampling method can significantly improve the computational efficiency of existing sequential importance sampling methods.
-
Abstract In the past few decades, population genetics and phylogeographic studies have improved our knowledge of connectivity and population demography in marine environments. Studies of deep‐sea hydrothermal vent populations have identified barriers to gene flow, hybrid zones, and demographic events, such as historical population expansions and contractions. These deep‐sea studies, however, used few loci, which limit the amount of information they provided for coalescent analysis and thus our ability to confidently test complex population dynamics scenarios.
In this study, we investigated population structure, demographic history, and gene flow directionality among four Western Pacific hydrothermal vent populations of the vent limpet
Lepetodrilus aff.schrolli . These vent sites are located in the Manus and Lau back‐arc basins, currently of great interest for deep‐sea mineral extraction. A total of 42 loci were sequenced from each individual using high‐throughput amplicon sequencing. Amplicon sequences were analyzed using both genetic variant clustering methods and evolutionary coalescent approaches. Like most previously investigated vent species in the South Pacific,L . aff.schrolli showed no genetic structure within basins but significant differentiation between basins. We inferred significant directional gene flow from Manus Basin to Lau Basin, with low to no gene flow in the opposite direction. This study is one of the very few marine population studies using >10 loci for coalescent analysis and serves as a guide for future marine population studies.