skip to main content


Title: Joint Maximum Likelihood of Phylogeny and Ancestral States Is Not Consistent
Abstract

Maximum likelihood estimation in phylogenetics requires a means of handling unknown ancestral states. Classical maximum likelihood averages over these unknown intermediate states, leading to provably consistent estimation of the topology and continuous model parameters. Recently, a computationally efficient approach has been proposed to jointly maximize over these unknown states and phylogenetic parameters. Although this method of joint maximum likelihood estimation can obtain estimates more quickly, its properties as an estimator are not yet clear. In this article, we show that this method of jointly estimating phylogenetic parameters along with ancestral states is not consistent in general. We find a sizeable region of parameter space that generates data on a four-taxon tree for which this joint method estimates the internal branch length to be exactly zero, even in the limit of infinite-length sequences. More generally, we show that this joint method only estimates branch lengths correctly on a set of measure zero. We show empirically that branch length estimates are systematically biased downward, even for short branches.

 
more » « less
NSF-PAR ID:
10115793
Author(s) / Creator(s):
 ;  ;  ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Molecular Biology and Evolution
ISSN:
0737-4038
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    A potential shortcoming of concatenation methods for species tree estimation is their failure to account for incomplete lineage sorting. Coalescent methods address this problem but make various assumptions that, if violated, can result in worse performance than concatenation. Given the challenges of analyzing DNA sequences with both concatenation and coalescent methods, retroelement insertions (RIs) have emerged as powerful phylogenomic markers for species tree estimation. Here, we show that two recently proposed quartet-based methods, SDPquartets and ASTRAL_BP, are statistically consistent estimators of the unrooted species tree topology under the coalescent when RIs follow a neutral infinite-sites model of mutation and the expected number of new RIs per generation is constant across the species tree. The accuracy of these (and other) methods for inferring species trees from RIs has yet to be assessed on simulated data sets, where the true species tree topology is known. Therefore, we evaluated eight methods given RIs simulated from four model species trees, all of which have short branches and at least three of which are in the anomaly zone. In our simulation study, ASTRAL_BP and SDPquartets always recovered the correct species tree topology when given a sufficiently large number of RIs, as predicted. A distance-based method (ASTRID_BP) and Dollo parsimony also performed well in recovering the species tree topology. In contrast, unordered, polymorphism, and Camin–Sokal parsimony (as well as an approach based on MDC) typically fail to recover the correct species tree topology in anomaly zone situations with more than four ingroup taxa. Of the methods studied, only ASTRAL_BP automatically estimates internal branch lengths (in coalescent units) and support values (i.e., local posterior probabilities). We examined the accuracy of branch length estimation, finding that estimated lengths were accurate for short branches but upwardly biased otherwise. This led us to derive the maximum likelihood (branch length) estimate for when RIs are given as input instead of binary gene trees; this corrected formula produced accurate estimates of branch lengths in our simulation study provided that a sufficiently large number of RIs were given as input. Lastly, we evaluated the impact of data quantity on species tree estimation by repeating the above experiments with input sizes varying from 100 to 100,000 parsimony-informative RIs. We found that, when given just 1000 parsimony-informative RIs as input, ASTRAL_BP successfully reconstructed major clades (i.e., clades separated by branches $>0.3$ coalescent units) with high support and identified rapid radiations (i.e., shorter connected branches), although not their precise branching order. The local posterior probability was effective for controlling false positive branches in these scenarios. [Coalescence; incomplete lineage sorting; Laurasiatheria; Palaeognathae; parsimony; polymorphism parsimony; retroelement insertions; species trees; transposon.]

     
    more » « less
  2. Abstract

    The standardized precipitation index (SPI) measures meteorological drought relative to historical climatology by normalizing accumulated precipitation. Longer record lengths improve parameter estimates, but these longer records may include signals of anthropogenic climate change and multidecadal natural climate fluctuations. Historically, climate nonstationarity has either been ignored or incorporated into the SPI using a quasi-stationary reference period, such as the WMO 30-yr period. This study introduces and evaluates a novel nonstationary SPI model based on Bayesian splines, designed to both improve parameter estimates for stationary climates and to explicitly incorporate nonstationarity. Using synthetically generated precipitation, this study directly compares the proposed Bayesian SPI model with existing SPI approaches based on maximum likelihood estimation for stationary and nonstationary climates. The proposed model not only reproduced the performance of existing SPI models but improved upon them in several key areas: reducing parameter uncertainty and noise, simultaneously modeling the likelihood of zero and positive precipitation, and capturing nonlinear trends and seasonal shifts across all parameters. Further, the fully Bayesian approach ensures all parameters have uncertainty estimates, including zero precipitation likelihood. The study notes that the zero precipitation parameter is too sensitive and could be improved in future iterations. The study concludes with an application of the proposed Bayesian nonstationary SPI model for nine gauges across a range of hydroclimate zones in the United States. Results of this experiment show that the model is stable and reproduces nonstationary patterns identified in prior studies, while also indicating new findings, particularly for the shape and zero precipitation parameters.

    Significance Statement

    We typically measure how bad a drought is by comparing it with the historical record. With long-term changes in climate or other factors, however, a typical drought today may not have been typical in the recent past. The purpose of this study is to build a model that measures drought relative to a changing climate. Our results confirm that the model is accurate and captures previously noted climate change patterns—a drier western United States, a wetter eastern United States, earlier summer weather, and more extreme wet seasons. This is significant because this model can improve drought measurement and identify recent changes in drought.

     
    more » « less
  3. Abstract Motivation

    The multispecies coalescent model is now widely accepted as an effective model for incorporating variation in the evolutionary histories of individual genes into methods for phylogenetic inference from genome-scale data. However, because model-based analysis under the coalescent can be computationally expensive for large datasets, a variety of inferential frameworks and corresponding algorithms have been proposed for estimation of species-level phylogenies and associated parameters, including speciation times and effective population sizes.

    Results

    We consider the problem of estimating the timing of speciation events along a phylogeny in a coalescent framework. We propose a maximum a posteriori estimator based on composite likelihood (MAPCL) for inferring these speciation times under a model of DNA sequence evolution for which exact site-pattern probabilities can be computed under the assumption of a constant θ throughout the species tree. We demonstrate that the MAPCL estimates are statistically consistent and asymptotically normally distributed, and we show how this result can be used to estimate their asymptotic variance. We also provide a more computationally efficient estimator of the asymptotic variance based on the non-parametric bootstrap. We evaluate the performance of our method using simulation and by application to an empirical dataset for gibbons.

    Availability and implementation

    The method has been implemented in the PAUP* program, freely available at https://paup.phylosolutions.com for Macintosh, Windows and Linux operating systems.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  4. Abstract

    Cosmological parameters encoding our understanding of the expansion history of the universe can be constrained by the accurate estimation of time delays arising in gravitationally lensed systems. We propose TD-CARMA, a Bayesian method to estimate cosmological time delays by modeling observed and irregularly sampled light curves as realizations of a continuous auto-regressive moving average (CARMA) process. Our model accounts for heteroskedastic measurement errors and microlensing, an additional source of independent extrinsic long-term variability in the source brightness. The semiseparable structure of the CARMA covariance matrix allows for fast and scalable likelihood computation using Gaussian process modeling. We obtain a sample from the joint posterior distribution of the model parameters using a nested sampling approach. This allows for “painless” Bayesian computation, dealing with the expected multimodality of the posterior distribution in a straightforward manner and not requiring the specification of starting values or an initial guess for the time delay, unlike existing methods. In addition, the proposed sampling procedure automatically evaluates the Bayesian evidence, allowing us to perform principled Bayesian model selection. TD-CARMA is parsimonious, and typically includes no more than a dozen unknown parameters. We apply TD-CARMA to six doubly lensed quasars HS2209+1914, SDSS J1001+5027, SDSS J1206+4332, SDSS J1515+1511, SDSS J1455+1447, and SDSS J1349+1227, estimating their time delays as −21.96 ± 1.448, 120.93 ± 1.015, 111.51 ± 1.452, 210.80 ± 2.18, 45.36 ± 1.93, and 432.05 ± 1.950, respectively. These estimates are consistent with those derived in the relevant literature, but are typically two to four times more precise.

     
    more » « less
  5. Abstract

    Contamination of a genetic sample with DNA from one or more nontarget species is a continuing concern of molecular phylogenetic studies, both Sanger sequencing studies and next-generation sequencing studies. We developed an automated pipeline for identifying and excluding likely cross-contaminated loci based on the detection of bimodal distributions of patristic distances across gene trees. When contamination occurs between samples within a data set, a comparison between a contaminated sample and its contaminant taxon will yield bimodal distributions with one peak close to zero patristic distance. This new method does not rely on a priori knowledge of taxon relatedness nor does it determine the causes(s) of the contamination. Exclusion of putatively contaminated loci from a data set generated for the insect family Cicadidae showed that these sequences were affecting some topological patterns and branch supports, although the effects were sometimes subtle, with some contamination-influenced relationships exhibiting strong bootstrap support. Long tip branches and outlier values for one anchored phylogenomic pipeline statistic (AvgNHomologs) were correlated with the presence of contamination. While the anchored hybrid enrichment markers used here, which target hemipteroid taxa, proved effective in resolving deep and shallow level Cicadidae relationships in aggregate, individual markers contained inadequate phylogenetic signal, in part probably due to short length. The cleaned data set, consisting of 429 loci, from 90 genera representing 44 of 56 current Cicadidae tribes, supported three of the four sampled Cicadidae subfamilies in concatenated-matrix maximum likelihood (ML) and multispecies coalescent-based species tree analyses, with the fourth subfamily weakly supported in the ML trees. No well-supported patterns from previous family-level Sanger sequencing studies of Cicadidae phylogeny were contradicted. One taxon (Aragualna plenalinea) did not fall with its current subfamily in the genetic tree, and this genus and its tribe Aragualnini is reclassified to Tibicininae following morphological re-examination. Only subtle differences were observed in trees after the removal of loci for which divergent base frequencies were detected. Greater success may be achieved by increased taxon sampling and developing a probe set targeting a more recent common ancestor and longer loci. Searches for contamination are an essential step in phylogenomic analyses of all kinds and our pipeline is an effective solution. [Auchenorrhyncha; base-composition bias; Cicadidae; Cicadoidea; Hemiptera; phylogenetic conflict.]

     
    more » « less