skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: MAST: Phylogenetic Inference with Mixtures Across Sites and Trees
Abstract Hundreds or thousands of loci are now routinely used in modern phylogenomic studies. Concatenation approaches to tree inference assume that there is a single topology for the entire dataset, but different loci may have different evolutionary histories due to incomplete lineage sorting (ILS), introgression, and/or horizontal gene transfer; even single loci may not be treelike due to recombination. To overcome this shortcoming, we introduce an implementation of a multi-tree mixture model that we call mixtures across sites and trees (MAST). This model extends a prior implementation by Boussau et al. (2009) by allowing users to estimate the weight of each of a set of pre-specified bifurcating trees in a single alignment. The MAST model allows each tree to have its own weight, topology, branch lengths, substitution model, nucleotide or amino acid frequencies, and model of rate heterogeneity across sites. We implemented the MAST model in a maximum-likelihood framework in the popular phylogenetic software, IQ-TREE. Simulations show that we can accurately recover the true model parameters, including branch lengths and tree weights for a given set of tree topologies, under a wide range of biologically realistic scenarios. We also show that we can use standard statistical inference approaches to reject a single-tree model when data are simulated under multiple trees (and vice versa). We applied the MAST model to multiple primate datasets and found that it can recover the signal of ILS in the Great Apes, as well as the asymmetry in minor trees caused by introgression among several macaque species. When applied to a dataset of 4 Platyrrhine species for which standard concatenated maximum likelihood (ML) and gene tree approaches disagree, we observe that MAST gives the highest weight (i.e., the largest proportion of sites) to the tree also supported by gene tree approaches. These results suggest that the MAST model is able to analyze a concatenated alignment using ML while avoiding some of the biases that come with assuming there is only a single tree. We discuss how the MAST model can be extended in the future.  more » « less
Award ID(s):
1936187
PAR ID:
10527214
Author(s) / Creator(s):
; ; ; ; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Systematic Biology
Volume:
73
Issue:
2
ISSN:
1063-5157
Format(s):
Medium: X Size: p. 375-391
Size(s):
p. 375-391
Sponsoring Org:
National Science Foundation
More Like this
  1. Murphy, William (Ed.)
    Abstract DNA sequence alignments have provided the majority of data for inferring phylogenetic relationships with both concatenation and coalescent methods. However, DNA sequences are susceptible to extensive homoplasy, especially for deep divergences in the Tree of Life. Retroelement insertions have emerged as a powerful alternative to sequences for deciphering evolutionary relationships because these data are nearly homoplasy-free. In addition, retroelement insertions satisfy the “no intralocus-recombination” assumption of summary coalescent methods because they are singular events and better approximate neutrality relative to DNA loci commonly sampled in phylogenomic studies. Retroelements have traditionally been analyzed with parsimony, distance, and network methods. Here, we analyze retroelement data sets for vertebrate clades (Placentalia, Laurasiatheria, Balaenopteroidea, Palaeognathae) with 2 ILS-aware methods that operate by extracting, weighting, and then assembling unrooted quartets into a species tree. The first approach constructs a species tree from retroelement bipartitions with ASTRAL, and the second method is based on split-decomposition with parsimony. We also develop a Quartet-Asymmetry test to detect hybridization using retroelements. Both ILS-aware methods recovered the same species-tree topology for each data set. The ASTRAL species trees for Laurasiatheria have consecutive short branch lengths in the anomaly zone whereas Palaeognathae is outside of this zone. For the Balaenopteroidea data set, which includes rorquals (Balaenopteridae) and gray whale (Eschrichtiidae), both ILS-aware methods resolved balaeonopterids as paraphyletic. Application of the Quartet-Asymmetry test to this data set detected 19 different quartets of species for which historical introgression may be inferred. Evidence for introgression was not detected in the other data sets. 
    more » « less
  2. Ruane, Sara (Ed.)
    Abstract Some phylogenetic problems remain unresolved even when large amounts of sequence data are analyzed and methods that accommodate processes such as incomplete lineage sorting are employed. In addition to investigating biological sources of phylogenetic incongruence, it is also important to reduce noise in the phylogenomic dataset by using appropriate filtering approach that addresses gene tree estimation errors. We present the results of a case study in manakins, focusing on the very difficult clade comprising the genera Antilophia and Chiroxiphia. Previous studies suggest that Antilophia is nested within Chiroxiphia, though relationships among Antilophia+Chiroxiphia species have been highly unstable. We extracted more than 11,000 loci (ultra-conserved elements and introns) from whole genomes and conducted analyses using concatenation and multispecies coalescent methods. Topologies resulting from analyses using all loci differed depending on the data type and analytical method, with 2 clades (Antilophia+Chiroxiphia and Manacus+Pipra+Machaeopterus) in the manakin tree showing incongruent results. We hypothesized that gene trees that conflicted with a long coalescent branch (e.g., the branch uniting Antilophia+Chiroxiphia) might be enriched for cases of gene tree estimation error, so we conducted analyses that either constrained those gene trees to include monophyly of Antilophia+Chiroxiphia or excluded these loci. While constraining trees reduced some incongruence, excluding the trees led to completely congruent species trees, regardless of the data type or model of sequence evolution used. We found that a suite of gene metrics (most importantly the number of informative sites and likelihood of intralocus recombination) collectively explained the loci that resulted in non-monophyly of Antilophia+Chiroxiphia. We also found evidence for introgression that may have contributed to the discordant topologies we observe in Antilophia+Chiroxiphia and led to deviations from expectations given the multispecies coalescent model. Our study highlights the importance of identifying factors that can obscure phylogenetic signal when dealing with recalcitrant phylogenetic problems, such as gene tree estimation error, incomplete lineage sorting, and reticulation events. [Birds; c-gene; data type; gene estimation error; model fit; multispecies coalescent; phylogenomics; reticulation] 
    more » « less
  3. Abstract Assessing effects of gene tree error in coalescent analyses have widely ignored coalescent branch lengths (CBLs) despite their potential utility in estimating ancestral population demographics and detecting species tree anomaly zones. However, the ability of coalescent methods to obtain accurate estimates remains largely unexplored. Errors in gene trees should lead to underestimates of the true CBL, and for a given set of comparisons, longer CBLs should be more accurate. Here, we furthered our empirical understanding of how error in gene tree quality (i.e., locus informativeness and gene tree resolution) affect CBLs using four datasets comprised of ultraconserved elements (UCE) or exons for clades that exhibit wide ranges of branch lengths. For each dataset, we compared the impact of locus informativeness (assessed using number of parsimony‐informative sites) and gene tree resolution on CBL estimates. Our results, in general, showed that CBLs were drastically shorter when estimates included low informative loci. Gene tree resolution also had an impact on UCE datasets, with polytomous gene trees producing longer branches than randomly resolved gene trees. However, resolution did not appear to affect CBL estimates from the more informative exon datasets. Thus, as expected, gene tree quality affects CBL estimates, though this can generally be minimized by using moderate filtering to select more informative loci and/or by allowing polytomies in gene trees. These approaches, as well as additional contributions to improve CBL estimation, should lead to CBLs that are useful for addressing evolutionary and biological questions. 
    more » « less
  4. Introgression can produce novel genetic variation in organisms that hybridize. Sympatric species pairs in the carnivorous plant genusSarraceniaL. frequently hybridize, and all known hybrids are fertile. Despite being a desirable system for studying the evolutionary consequences of hybridization, the extent to which introgression occurs in the genus is limited to a few species in only two field sites. Previous phylogenomic analysis ofSarraceniaestimated a highly resolved species tree from 199 nuclear genes, but revealed a plastid genome that is highly discordant with the species tree. Such cytonuclear discordance could be caused by chloroplast introgression (i.e. chloroplast capture) or incomplete lineage sorting (ILS). To better understand the extent to which introgression is occurring inSarracenia, the chloroplast capture and ILS hypotheses were formally evaluated. Plastomes were assembledde-novofrom sequencing reads generated from 17 individuals in addition to reads obtained from the previous study. Assemblies of 14 whole plastomes were generated and annotated, and the remaining fragmented assemblies were scaffolded to these whole-plastome assemblies. Coding sequence from 79 homologous genes were aligned and concatenated for maximum-likelihood phylogeny estimation. The plastome tree is extremely discordant with the published species tree. Plastome trees were simulated under the coalescent and tree distance from the species tree was calculated to generate a null distribution of discordance that is expected under ILS alone. A t-test rejected the null hypothesis that ILS could cause the level of discordance seen in the plastome tree, suggesting that chloroplast capture must be invoked to explain the discordance. Due to the extreme level of discordance in the plastome tree, it is likely that chloroplast capture has been common in the evolutionary history ofSarracenia. 
    more » « less
  5. Abstract MotivationPhylogenomics faces a dilemma: on the one hand, most accurate species and gene tree estimation methods are those that co-estimate them; on the other hand, these co-estimation methods do not scale to moderately large numbers of species. The summary-based methods, which first infer gene trees independently and then combine them, are much more scalable but are prone to gene tree estimation error, which is inevitable when inferring trees from limited-length data. Gene tree estimation error is not just random noise and can create biases such as long-branch attraction. ResultsWe introduce a scalable likelihood-based approach to co-estimation under the multi-species coalescent model. The method, called quartet co-estimation (QuCo), takes as input independently inferred distributions over gene trees and computes the most likely species tree topology and internal branch length for each quartet, marginalizing over gene tree topologies and ignoring branch lengths by making several simplifying assumptions. It then updates the gene tree posterior probabilities based on the species tree. The focus on gene tree topologies and the heuristic division to quartets enables fast likelihood calculations. We benchmark our method with extensive simulations for quartet trees in zones known to produce biased species trees and further with larger trees. We also run QuCo on a biological dataset of bees. Our results show better accuracy than the summary-based approach ASTRAL run on estimated gene trees. Availability and implementationQuCo is available on https://github.com/maryamrabiee/quco. Supplementary informationSupplementary data are available at Bioinformatics online. 
    more » « less