skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Maximum Likelihood Estimation for Unrooted 3-Leaf Trees: An Analytic Solution for the CFN Model
Abstract Maximum likelihood estimation is among the most widely-used methods for inferring phylogenetic trees from sequence data. This paper solves the problem of computing solutions to the maximum likelihood problem for 3-leaf trees under the 2-state symmetric mutation model (CFN model). Our main result is a closed-form solution to the maximum likelihood problem for unrooted 3-leaf trees, given generic data; this result characterizes all of the ways that a maximum likelihood estimate can fail to exist for generic data and provides theoretical validation for predictions made in Parks and Goldman (Syst Biol 63(5):798–811, 2014). Our proof makes use of both classical tools for studying group-based phylogenetic models such as Hadamard conjugation and reparameterization in terms of Fourier coordinates, as well as more recent results concerning the semi-algebraic constraints of the CFN model. To be able to put these into practice, we also give a complete characterization to test genericity.  more » « less
Award ID(s):
2308495 2023239
PAR ID:
10528041
Author(s) / Creator(s):
; ;
Publisher / Repository:
Springer
Date Published:
Journal Name:
Bulletin of Mathematical Biology
Volume:
86
Issue:
9
ISSN:
0092-8240
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Satta, Yoko (Ed.)
    Abstract Likelihood-based tests of phylogenetic trees are a foundation of modern systematics. Over the past decade, an enormous wealth and diversity of model-based approaches have been developed for phylogenetic inference of both gene trees and species trees. However, while many techniques exist for conducting formal likelihood-based tests of gene trees, such frameworks are comparatively underdeveloped and underutilized for testing species tree hypotheses. To date, widely used tests of tree topology are designed to assess the fit of classical models of molecular sequence data and individual gene trees and thus are not readily applicable to the problem of species tree inference. To address this issue, we derive several analogous likelihood-based approaches for testing topologies using modern species tree models and heuristic algorithms that use gene tree topologies as input for maximum likelihood estimation under the multispecies coalescent. For the purpose of comparing support for species trees, these tests leverage the statistical procedures of their original gene tree-based counterparts that have an extended history for testing phylogenetic hypotheses at a single locus. We discuss and demonstrate a number of applications, limitations, and important considerations of these tests using simulated and empirical phylogenomic data sets that include both bifurcating topologies and reticulate network models of species relationships. Finally, we introduce the open-source R package SpeciesTopoTestR (SpeciesTopology Tests in R) that includes a suite of functions for conducting formal likelihood-based tests of species topologies given a set of input gene tree topologies. 
    more » « less
  2. Ouangraoua, Aida (Ed.)
    Abstract Scientists world-wide are putting together massive efforts to understand how the biodiversity that we see on Earth evolved from single-cell organisms at the origin of life and this diversification process is represented through the Tree of Life. Low sampling rates and high heterogeneity in the rate of evolution across sites and lineages produce a phenomenon denoted “long branch attraction” (LBA) in which long non-sister lineages are estimated to be sisters regardless of their true evolutionary relationship. LBA has been a pervasive problem in phylogenetic inference affecting different types of methodologies from distance-based to likelihood-based. Here, we present a novel neural network model that outperforms standard phylogenetic methods and other neural network implementations under LBA settings. Furthermore, unlike existing neural network models in phylogenetics, our model naturally accounts for the tree isomorphisms via permutation invariant functions which ultimately result in lower memory and allows the seamless extension to larger trees. 
    more » « less
  3. Abstract Analysis of phylogenetic trees has become an essential tool in epidemiology. Likelihood-based methods fit models to phylogenies to draw inferences about the phylodynamics and history of viral transmission. However, these methods are often computationally expensive, which limits the complexity and realism of phylodynamic models and makes them ill-suited for informing policy decisions in real-time during rapidly developing outbreaks. Likelihood-free methods using deep learning are pushing the boundaries of inference beyond these constraints. In this paper, we extend, compare, and contrast a recently developed deep learning method for likelihood-free inference from trees. We trained multiple deep neural networks using phylogenies from simulated outbreaks that spread among 5 locations and found they achieve close to the same levels of accuracy as Bayesian inference under the true simulation model. We compared robustness to model misspecification of a trained neural network to that of a Bayesian method. We found that both models had comparable performance, converging on similar biases. We also implemented a method of uncertainty quantification called conformalized quantile regression that we demonstrate has similar patterns of sensitivity to model misspecification as Bayesian highest posterior density (HPD) and greatly overlap with HPDs, but have lower precision (more conservative). Finally, we trained and tested a neural network against phylogeographic data from a recent study of the SARS-Cov-2 pandemic in Europe and obtained similar estimates of region-specific epidemiological parameters and the location of the common ancestor in Europe. Along with being as accurate and robust as likelihood-based methods, our trained neural networks are on average over 3 orders of magnitude faster after training. Our results support the notion that neural networks can be trained with simulated data to accurately mimic the good and bad statistical properties of the likelihood functions of generative phylogenetic models. 
    more » « less
  4. Thomson, Robert (Ed.)
    Abstract Phylogenomic analyses of closely related species allow important glimpses into their evolutionary history. Although recent studies have demonstrated that inter-species hybridization has occurred in several groups, incorporating this process in phylogenetic reconstruction remains challenging. Specifically, the most predominant topology across the genome is often assumed to reflect the speciation tree, but rampant hybridization might overwhelm the genomes, causing that assumption to be violated. The notoriously challenging phylogeny of the 5 extant Panthera species (specifically jaguar [P. onca], lion [P. leo], and leopard [P. pardus]) is an interesting system to address this problem. Here we employed a Panthera-wide whole-genome-sequence data set incorporating 3 jaguar genomes and 2 representatives of lions and leopards to dissect the relationships among these 3 species. Maximum-likelihood trees reconstructed from non-overlapping genomic fragments of 4 different sizes strongly supported the monophyly of all 3 species. The most frequent topology (76–95%) united lion + leopard as a sister species (topology 1), followed by lion + jaguar (topology 2: 4–8%) and leopard + jaguar (topology 3: 0–6%). Topology 1 was dominant across the genome, especially in high-recombination regions. Topologies 2 and 3 were enriched in low-recombination segments, likely reflecting the species tree in the face of hybridization. Divergence times between sister species of each topology, corrected for local-recombination-rate effects, indicated that the lion-leopard divergence was significantly younger than the alternatives, likely driven by post-speciation admixture. Introgression analyses detected pervasive hybridization between lions and leopards, regardless of the assumed species tree. This inference was strongly supported by multispecies-coalescence-with-introgression analyses, which rejected topology 1 (lion+leopard) or any model without introgression. Interestingly, topologies 2 (lion+jaguar) and 3 (jaguar+leopard) with extensive lion-leopard introgression were unidentifiable, highlighting the complexity of this phylogenetic problem. Our results suggest that the dominant genome-wide tree topology is not the true species tree but rather a consequence of overwhelming post-speciation admixture between lion and leopard. 
    more » « less
  5. Carbone, Alessandra; El-Kebir, Mohammed (Ed.)
    The maximum parsimony phylogenetic reconciliation problem seeks to explain incongruity between a gene phylogeny and a species phylogeny with respect to a set of evolutionary events. While the reconciliation problem is well-studied for species and gene trees subject to events such as duplication, transfer, loss, and deep coalescence, recent work has examined species phylogenies that incorporate hybridization and are thus represented by networks rather than trees. In this paper, we show that the problem of computing a maximum parsimony reconciliation for a gene tree and species network is NP-hard even when only considering deep coalescence. This result suggests that future work on maximum parsimony reconciliation for species networks should explore approximation algorithms and heuristics. 
    more » « less