Abstract We consider the evolution of phylogenetic gene trees along phylogenetic species networks, according to the network multispecies coalescent process, and introduce a new network coalescent model with correlated inheritance of gene flow. This model generalizes two traditional versions of the network coalescent: with independent or common inheritance. At each reticulation, multiple lineages of a given locus are inherited from parental populations chosen at random, either independently across lineages or with positive correlation according to a Dirichlet process. This process may account for locus-specific probabilities of inheritance, for example. We implemented the simulation of gene trees under these network coalescent models in the Julia package PhyloCoalSimulations, which depends on PhyloNetworks and its powerful network manipulation tools. Input species phylogenies can be read in extended Newick format, either in numbers of generations or in coalescent units. Simulated gene trees can be written in Newick format, and in a way that preserves information about their embedding within the species network. This embedding can be used for downstream purposes, such as to simulate species-specific processes like rate variation across species, or for other scenarios as illustrated in this note. This package should be useful for simulation studies and simulation-based inference methods. The software is available open source with documentation and a tutorial at https://github.com/cecileane/PhyloCoalSimulations.jl.
more »
« less
Ultrafast learning of four-node hybridization cycles in phylogenetic networks using algebraic invariants
Abstract MotivationThe abundance of gene flow in the Tree of Life challenges the notion that evolution can be represented with a fully bifurcating process which cannot capture important biological realities like hybridization, introgression, or horizontal gene transfer. Coalescent-based network methods are increasingly popular, yet not scalable for big data, because they need to perform a heuristic search in the space of networks as well as numerical optimization that can be NP-hard. Here, we introduce a novel method to reconstruct phylogenetic networks based on algebraic invariants. While there is a long tradition of using algebraic invariants in phylogenetics, our work is the first to define phylogenetic invariants on concordance factors (frequencies of four-taxon splits in the input gene trees) to identify level-1 phylogenetic networks under the multispecies coalescent model. ResultsOur novel hybrid detection methodology is optimization-free as it only requires the evaluation of polynomial equations, and as such, it bypasses the traversal of network space, yielding a computational speed at least 10 times faster than the fastest-to-date network methods. We illustrate our method’s performance on simulated and real data from the genus Canis. Availability and implementationWe present an open-source publicly available Julia package PhyloDiamond.jl available at https://github.com/solislemuslab/PhyloDiamond.jl with broad applicability within the evolutionary community.
more »
« less
- Award ID(s):
- 2144367
- PAR ID:
- 10491905
- Publisher / Repository:
- Oxford University Press
- Date Published:
- Journal Name:
- Bioinformatics Advances
- Volume:
- 4
- Issue:
- 1
- ISSN:
- 2635-0041
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Russell, Schwartz (Ed.)Abstract Motivation With growing genome-wide molecular datasets from next-generation sequencing, phylogenetic networks can be estimated using a variety of approaches. These phylogenetic networks include events like hybridization, gene flow or horizontal gene transfer explicitly. However, the most accurate network inference methods are computationally heavy. Methods that scale to larger datasets do not calculate a full likelihood, such that traditional likelihood-based tools for model selection are not applicable to decide how many past hybridization events best fit the data. We propose here a goodness-of-fit test to quantify the fit between data observed from genome-wide multi-locus data, and patterns expected under the multi-species coalescent model on a candidate phylogenetic network. Results We identified weaknesses in the previously proposed TICR test, and proposed corrections. The performance of our new test was validated by simulations on real-world phylogenetic networks. Our test provides one of the first rigorous tools for model selection, to select the adequate network complexity for the data at hand. The test can also work for identifying poorly inferred areas on a network. Availability and implementation Software for the goodness-of-fit test is available as a Julia package at https://github.com/cecileane/QuartetNetworkGoodnessFit.jl. Supplementary information Supplementary data are available at Bioinformatics online.more » « less
-
Abstract MotivationReticulate evolutionary histories, such as those arising in the presence of hybridization, are best modeled as phylogenetic networks. Recently developed methods allow for statistical inference of phylogenetic networks while also accounting for other processes, such as incomplete lineage sorting. However, these methods can only handle a small number of loci from a handful of genomes. ResultsIn this article, we introduce a novel two-step method for scalable inference of phylogenetic networks from the sequence alignments of multiple, unlinked loci. The method infers networks on subproblems and then merges them into a network on the full set of taxa. To reduce the number of trinets to infer, we formulate a Hitting Set version of the problem of finding a small number of subsets, and implement a simple heuristic to solve it. We studied their performance, in terms of both running time and accuracy, on simulated as well as on biological datasets. The two-step method accurately infers phylogenetic networks at a scale that is infeasible with existing methods. The results are a significant and promising step towards accurate, large-scale phylogenetic network inference. Availability and implementationWe implemented the algorithms in the publicly available software package PhyloNet (https://bioinfocs.rice.edu/PhyloNet). Supplementary informationSupplementary data are available at Bioinformatics online.more » « less
-
Abstract Phylogenetic networks provide a powerful framework for modeling and analyzing reticulate evolutionary histories. While polyploidy has been shown to be prevalent not only in plants but also in other groups of eukaryotic species, most work done thus far on phylogenetic network inference assumes diploid hybridization. These inference methods have been applied, with varying degrees of success, to data sets with polyploid species, even though polyploidy violates the mathematical assumptions underlying these methods. Statistical methods were developed recently for handling specific types of polyploids and so were parsimony methods that could handle polyploidy more generally yet while excluding processes such as incomplete lineage sorting. In this article, we introduce a new method for inferring most parsimonious phylogenetic networks on data that include polyploid species. Taking gene tree topologies as input, the method seeks a phylogenetic network that minimizes deep coalescences while accounting for polyploidy. We demonstrate the performance of the method on both simulated and biological data. The inference method as well as a method for evaluating evolutionary hypotheses in the form of phylogenetic networks are implemented and publicly available in the PhyloNet software package. [Incomplete lineage sorting; minimizing deep coalescences; multilabeled trees; multispecies network coalescent; phylogenetic networks; polyploidy.]more » « less
-
Abstract MotivationGene network reconstruction from gene expression profiles is a compute- and data-intensive problem. Numerous methods based on diverse approaches including mutual information, random forests, Bayesian networks, correlation measures, as well as their transforms and filters such as data processing inequality, have been proposed. However, an effective gene network reconstruction method that performs well in all three aspects of computational efficiency, data size scalability, and output quality remains elusive. Simple techniques such as Pearson correlation are fast to compute but ignore indirect interactions, while more robust methods such as Bayesian networks are prohibitively time consuming to apply to tens of thousands of genes. ResultsWe developed maximum capacity path (MCP) score, a novel maximum-capacity-path-based metric to quantify the relative strengths of direct and indirect gene–gene interactions. We further present MCPNet, an efficient, parallelized gene network reconstruction software based on MCP score, to reverse engineer networks in unsupervised and ensemble manners. Using synthetic and real Saccharomyces cervisiae datasets as well as real Arabidopsis thaliana datasets, we demonstrate that MCPNet produces better quality networks as measured by AUPRC, is significantly faster than all other gene network reconstruction software, and also scales well to tens of thousands of genes and hundreds of CPU cores. Thus, MCPNet represents a new gene network reconstruction tool that simultaneously achieves quality, performance, and scalability requirements. Availability and implementationSource code freely available for download at https://doi.org/10.5281/zenodo.6499747 and https://github.com/AluruLab/MCPNet, implemented in C++ and supported on Linux.more » « less
An official website of the United States government
