skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Probabilistic Species Tree Distances: Implementing the Multispecies Coalescent to Compare Species Trees Within the Same Model-Based Framework Used to Estimate Them
Abstract Despite the ubiquitous use of statistical models for phylogenomic and population genomic inferences, this model-based rigor is rarely applied to post hoc comparison of trees. In a recent study, Garba et al. derived new methods for measuring the distance between two gene trees computed as the difference in their site pattern probability distributions. Unlike traditional metrics that compare trees solely in terms of geometry, these measures consider gene trees and associated parameters as probabilistic models that can be compared using standard information theoretic approaches. Consequently, probabilistic measures of phylogenetic tree distance can be far more informative than simply comparisons of topology and/or branch lengths alone. However, in their current form, these distance measures are not suitable for the comparison of species tree models in the presence of gene tree heterogeneity. Here, we demonstrate an approach for how the theory of Garba et al. (2018), which is based on gene tree distances, can be extended naturally to the comparison of species tree models. Multispecies coalescent (MSC) models parameterize the discrete probability distribution of gene trees conditioned upon a species tree with a particular topology and set of divergence times (in coalescent units), and thus provide a framework for measuring distances between species tree models in terms of their corresponding gene tree topology probabilities. We describe the computation of probabilistic species tree distances in the context of standard MSC models, which assume complete genetic isolation postspeciation, as well as recent theoretical extensions to the MSC in the form of network-based MSC models that relax this assumption and permit hybridization among taxa. We demonstrate these metrics using simulations and empirical species tree estimates and discuss both the benefits and limitations of these approaches. We make our species tree distance approach available as an R package called pSTDistanceR, for open use by the community.  more » « less
Award ID(s):
1655571
PAR ID:
10144113
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Systematic Biology
Volume:
69
Issue:
1
ISSN:
1063-5157
Page Range / eLocation ID:
194 to 207
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Abstract Stochastic models of character trait evolution have become a cornerstone of evolutionary biology in an array of contexts. While probabilistic models have been used extensively for statistical inference, they have largely been ignored for the purpose of measuring distances between phylogeny-aware models. Recent contributions to the problem of phylogenetic distance computation have highlighted the importance of explicitly considering evolutionary model parameters and their impacts on molecular sequence data when quantifying dissimilarity between trees. By comparing two phylogenies in terms of their induced probability distributions that are functions of many model parameters, these distances can be more informative than traditional approaches that rely strictly on differences in topology or branch lengths alone. Currently, however, these approaches are designed for comparing models of nucleotide substitution and gene tree distributions, and thus, are unable to address other classes of traits and associated models that may be of interest to evolutionary biologists. Here we expand the principles of probabilistic phylogenetic distances to compute tree distances under models of continuous trait evolution along a phylogeny. By explicitly considering both the degree of relatedness among species and the evolutionary processes that collectively give rise to character traits, these distances provide a foundation for comparing models and their predictions, and for quantifying the impacts of assuming one phylogenetic background over another while studying the evolution of a particular trait. We demonstrate the properties of these approaches using theory, simulations, and several empirical datasets that highlight potential uses of probabilistic distances in many scenarios. We also introduce an open-source R package named PRDATR for easy application by the scientific community for computing phylogenetic distances under models of character trait evolution. 
    more » « less
  2. Tang, H. (Ed.)
    Rooted species trees are used in several downstream applications of phylogenetics. Most species tree estimation methods produce unrooted trees and additional methods are then used to root these unrooted trees. Recently, Quintet Rooting (QR) (Tabatabaee et al., ISMB and Bioinformatics 2022), a polynomial-time method for rooting an unrooted species tree given unrooted gene trees under the multispecies coalescent, was introduced. QR, which is based on a proof of identifiability of rooted 5-taxon trees in the presence of incomplete lineage sorting, was shown to have good accuracy, improving over other methods for rooting species trees when incomplete lineage sorting was the only cause of gene tree discordance, except when gene tree estimation error was very high. However, the statistical consistency of QR was left as an open question. Here, we present QR-STAR, a polynomial-time variant of QR that has an additional step for determining the rooted shape of each quintet tree. We prove that QR-STAR is statistically consistent under the multispecies coalescent model, and our simulation study shows that QR-STAR matches or improves on the accuracy of QR. QR-STAR is available in open source form at https://github.com/ytabatabaee/Quintet-Rooting. 
    more » « less
  3. Abstract Species tree inference from multilocus data has emerged as a powerful paradigm in the postgenomic era, both in terms of the accuracy of the species tree it produces as well as in terms of elucidating the processes that shaped the evolutionary history. Bayesian methods for species tree inference are desirable in this area as they have been shown not only to yield accurate estimates, but also to naturally provide measures of confidence in those estimates. However, the heavy computational requirements of Bayesian inference have limited the applicability of such methods to very small data sets. In this article, we show that the computational efficiency of Bayesian inference under the multispecies coalescent can be improved in practice by restricting the space of the gene trees explored during the random walk, without sacrificing accuracy as measured by various metrics. The idea is to first infer constraints on the trees of the individual loci in the form of unresolved gene trees, and then to restrict the sampler to consider only resolutions of the constrained trees. We demonstrate the improvements gained by such an approach on both simulated and biological data. 
    more » « less
  4. Satta, Yoko (Ed.)
    Abstract Likelihood-based tests of phylogenetic trees are a foundation of modern systematics. Over the past decade, an enormous wealth and diversity of model-based approaches have been developed for phylogenetic inference of both gene trees and species trees. However, while many techniques exist for conducting formal likelihood-based tests of gene trees, such frameworks are comparatively underdeveloped and underutilized for testing species tree hypotheses. To date, widely used tests of tree topology are designed to assess the fit of classical models of molecular sequence data and individual gene trees and thus are not readily applicable to the problem of species tree inference. To address this issue, we derive several analogous likelihood-based approaches for testing topologies using modern species tree models and heuristic algorithms that use gene tree topologies as input for maximum likelihood estimation under the multispecies coalescent. For the purpose of comparing support for species trees, these tests leverage the statistical procedures of their original gene tree-based counterparts that have an extended history for testing phylogenetic hypotheses at a single locus. We discuss and demonstrate a number of applications, limitations, and important considerations of these tests using simulated and empirical phylogenomic data sets that include both bifurcating topologies and reticulate network models of species relationships. Finally, we introduce the open-source R package SpeciesTopoTestR (SpeciesTopology Tests in R) that includes a suite of functions for conducting formal likelihood-based tests of species topologies given a set of input gene tree topologies. 
    more » « less
  5. As genomic-scale datasets motivate research on species tree inference, simulators of the multispecies coalescent (MSC) process have become essential for the testing and evaluation of new inference methods. However, the simulators themselves must be tested to ensure that they give valid samples. This work develops methods for checking whether a collection of gene trees is in accord with the MSC model on a given species tree. When applied to well-known simulators, we find that several give flawed samples. The tests presented are capable of validating both topological and metric properties of gene tree samples, and are implemented in a freely available R package MSCsimtester so that developers and users may easily apply them. 
    more » « less