skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 11:00 PM ET on Friday, July 11 until 2:00 AM ET on Saturday, July 12 due to maintenance. We apologize for the inconvenience.


Title: Maximum Parsimony Inference of Phylogenetic Networks in the Presence of Polyploid Complexes
Abstract Phylogenetic networks provide a powerful framework for modeling and analyzing reticulate evolutionary histories. While polyploidy has been shown to be prevalent not only in plants but also in other groups of eukaryotic species, most work done thus far on phylogenetic network inference assumes diploid hybridization. These inference methods have been applied, with varying degrees of success, to data sets with polyploid species, even though polyploidy violates the mathematical assumptions underlying these methods. Statistical methods were developed recently for handling specific types of polyploids and so were parsimony methods that could handle polyploidy more generally yet while excluding processes such as incomplete lineage sorting. In this article, we introduce a new method for inferring most parsimonious phylogenetic networks on data that include polyploid species. Taking gene tree topologies as input, the method seeks a phylogenetic network that minimizes deep coalescences while accounting for polyploidy. We demonstrate the performance of the method on both simulated and biological data. The inference method as well as a method for evaluating evolutionary hypotheses in the form of phylogenetic networks are implemented and publicly available in the PhyloNet software package. [Incomplete lineage sorting; minimizing deep coalescences; multilabeled trees; multispecies network coalescent; phylogenetic networks; polyploidy.]  more » « less
Award ID(s):
1800723
PAR ID:
10305260
Author(s) / Creator(s):
 ;  ;  ;  ;  ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Systematic Biology
ISSN:
1063-5157
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Phylogenetic networks extend phylogenetic trees to allow for modeling reticulate evolutionary processes such as hybridization. They take the shape of a rooted, directed, acyclic graph, and when parameterized with evolutionary parameters, such as divergence times and population sizes, they form a generative process of molecular sequence evolution. Early work on computational methods for phylogenetic network inference focused exclusively on reticulations and sought networks with the fewest number of reticulations to fit the data. As processes such as incomplete lineage sorting (ILS) could be at play concurrently with hybridization, work in the last decade has shifted to computational approaches for phylogenetic network inference in the presence of ILS. In such a short period, significant advances have been made on developing and implementing such computational approaches. In particular, parsimony, likelihood, and Bayesian methods have been devised for estimating phylogenetic networks and associated parameters using estimated gene trees as data. Use of those inference methods has been augmented with statistical tests for specific hypotheses of hybridization, like the D-statistic. Most recently, Bayesian approaches for inferring phylogenetic networks directly from sequence data were developed and implemented. In this chapter, we survey such advances and discuss model assumptions as well as methods’ strengths and limitations. We also discuss parallel efforts in the population genetics community aimed at inferring similar structures. Finally, we highlight major directions for future research in this area. 
    more » « less
  2. The development of statistical methods to infer species phylogenies with reticulations (species networks) has led to many discoveries of gene flow between distinct species. These methods typically assume only incomplete lineage sorting and introgression. Given that phylogenetic networks can be arbitrarily complex, these methods might compensate for model misspecification by increasing the number of dimensions beyond the true value. Herein, we explore the effect of potential model misspecification, including the negligence of gene tree estimation error (GTEE) and assumption of a single substitution rate for all genomic loci, on the accuracy of phylogenetic network inference using both simulated and biological data. In particular, we assess the accuracy of estimated phylogenetic networks as well as test statistics for determining whether a network is the correct evolutionary history, as opposed to the simpler model that is a tree.We found that while GTEE negatively impacts the performance of test statistics to determine the “treeness” of the evolutionary history of a data set, running those tests on triplets of taxa and correcting for multiple-testing significantly ameliorates the problem. We also found that accounting for substitution rate heterogeneity improves the reliability of full Bayesian inference methods of phylogenetic networks, whereas summary statistic methods are robust to GTEE and rate heterogeneity, though currently require manual inspection to determine the network complexity. 
    more » « less
  3. Abstract MotivationDespite the widespread occurrence of polyploids across the Tree of Life, especially in the plant kingdom, very few computational methods have been developed to handle the specific complexities introduced by polyploids in phylogeny estimation. Furthermore, methods that are designed to account for polyploidy often disregard incomplete lineage sorting (ILS), a major source of heterogeneous gene histories, or are computationally very demanding. Therefore, there is a great need for efficient and robust methods to accurately reconstruct polyploid phylogenies. ResultsWe introduce Polyphest (POLYploid PHylogeny ESTimation), a new method for efficiently and accurately inferring species phylogenies in the presence of both polyploidy and ILS. Polyphest bypasses the need for extensive network space searches by first generating a multilabeled tree based on gene trees, which is then converted into a (uniquely labeled) species phylogeny. We compare the performance of Polyphest to that of two polyploid phylogeny estimation methods, one of which does not account for ILS, namely PADRE, and another that accounts for ILS, namely MPAllopp. Polyphest is more accurate than PADRE and achieves comparable accuracy to MPAllopp, while being significantly faster. We also demonstrate the application of Polyphest to empirical data from the hexaploid bread wheat and confirm the allopolyploid origin of bread wheat along with the closest relatives for each of its subgenomes. Availability and implementationPolyphest is available at https://github.com/NakhlehLab/Polyphest. 
    more » « less
  4. Abstract Prokaryotic genomes are often considered to be mosaics of genes that do not necessarily share the same evolutionary history due to widespread horizontal gene transfers (HGTs). Consequently, representing evolutionary relationships of prokaryotes as bifurcating trees has long been controversial. However, studies reporting conflicts among gene trees derived from phylogenomic data sets have shown that these conflicts can be the result of artifacts or evolutionary processes other than HGT, such as incomplete lineage sorting, low phylogenetic signal, and systematic errors due to substitution model misspecification. Here, we present the results of an extensive exploration of phylogenetic conflicts in the cyanobacterial order Nostocales, for which previous studies have inferred strongly supported conflicting relationships when using different concatenated phylogenomic data sets. We found that most of these conflicts are concentrated in deep clusters of short internodes of the Nostocales phylogeny, where the great majority of individual genes have low resolving power. We then inferred phylogenetic networks to detect HGT events while also accounting for incomplete lineage sorting. Our results indicate that most conflicts among gene trees are likely due to incomplete lineage sorting linked to an ancient rapid radiation, rather than to HGTs. Moreover, the short internodes of this radiation fit the expectations of the anomaly zone, i.e., a region of the tree parameter space where a species tree is discordant with its most likely gene tree. We demonstrated that concatenation of different sets of loci can recover up to 17 distinct and well-supported relationships within the putative anomaly zone of Nostocales, corresponding to the observed conflicts among well-supported trees based on concatenated data sets from previous studies. Our findings highlight the important role of rapid radiations as a potential cause of strongly conflicting phylogenetic relationships when using phylogenomic data sets of bacteria. We propose that polytomies may be the most appropriate phylogenetic representation of these rapid radiations that are part of anomaly zones, especially when all possible genomic markers have been considered to infer these phylogenies. [Anomaly zone; bacteria; horizontal gene transfer; incomplete lineage sorting; Nostocales; phylogenomic conflict; rapid radiation; Rhizonema.] 
    more » « less
  5. Abstract MotivationReticulate evolutionary histories, such as those arising in the presence of hybridization, are best modeled as phylogenetic networks. Recently developed methods allow for statistical inference of phylogenetic networks while also accounting for other processes, such as incomplete lineage sorting. However, these methods can only handle a small number of loci from a handful of genomes. ResultsIn this article, we introduce a novel two-step method for scalable inference of phylogenetic networks from the sequence alignments of multiple, unlinked loci. The method infers networks on subproblems and then merges them into a network on the full set of taxa. To reduce the number of trinets to infer, we formulate a Hitting Set version of the problem of finding a small number of subsets, and implement a simple heuristic to solve it. We studied their performance, in terms of both running time and accuracy, on simulated as well as on biological datasets. The two-step method accurately infers phylogenetic networks at a scale that is infeasible with existing methods. The results are a significant and promising step towards accurate, large-scale phylogenetic network inference. Availability and implementationWe implemented the algorithms in the publicly available software package PhyloNet (https://bioinfocs.rice.edu/PhyloNet). Supplementary informationSupplementary data are available at Bioinformatics online. 
    more » « less