skip to main content

Title: LoRaD: Marginal likelihood estimation with haste (but no waste)

The Lowest Radial Distance (LoRaD) method is a modification of the recently introduced Partition-Weighted Kernel method for estimating the marginal likelihood of a model, a quantity important for Bayesian model selection. For analyses involving a fixed tree topology, LoRaD improves upon the Steppingstone or Thermodynamic Integration (Path Sampling) approaches now in common use in phylogenetics because it requires sampling only from the posterior distribution, avoiding the need to sample from a series of ad hoc power posterior distributions, and yet is more accurate than other fast methods such as the Generalized Harmonic Mean (GHM) method. We show that the method performs well in comparison to the Generalized Steppingstone method on an empirical fixed-topology example from molecular phylogenetics involving 180 parameters. The LoRaD method can also be used to obtain the marginal likelihood in the variable-topology case if at least one tree topology occurs with sufficient frequency in the posterior sample to allow accurate estimation of the marginal likelihood conditional on that topology. [Bayesian; marginal likelihood; phylogenetics.]

more » « less
Author(s) / Creator(s):
; ; ; ; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Systematic Biology
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Bayesian Markov chain Monte Carlo explores tree space slowly, in part because it frequently returns to the same tree topology. An alternative strategy would be to explore tree space systematically, and never return to the same topology. In this article, we present an efficient parallelized method to map out the high likelihood set of phylogenetic tree topologies via systematic search, which we show to be a good approximation of the high posterior set of tree topologies on the data sets analyzed. Here, “likelihood” of a topology refers to the tree likelihood for the corresponding tree with optimized branch lengths. We call this method “phylogenetic topographer” (PT). The PT strategy is very simple: starting in a number of local topology maxima (obtained by hill-climbing from random starting points), explore out using local topology rearrangements, only continuing through topologies that are better than some likelihood threshold below the best observed topology. We show that the normalized topology likelihoods are a useful proxy for the Bayesian posterior probability of those topologies. By using a nonblocking hash table keyed on unique representations of tree topologies, we avoid visiting topologies more than once across all concurrent threads exploring tree space. We demonstrate that PT can be used directly to approximate a Bayesian consensus tree topology. When combined with an accurate means of evaluating per-topology marginal likelihoods, PT gives an alternative procedure for obtaining Bayesian posterior distributions on phylogenetic tree topologies.

    more » « less
  2. Abstract

    The marginal likelihood of a model is a key quantity for assessing the evidence provided by the data in support of a model. The marginal likelihood is the normalizing constant for the posterior density, obtained by integrating the product of the likelihood and the prior with respect to model parameters. Thus, the computational burden of computing the marginal likelihood scales with the dimension of the parameter space. In phylogenetics, where we work with tree topologies that are high-dimensional models, standard approaches to computing marginal likelihoods are very slow. Here, we study methods to quickly compute the marginal likelihood of a single fixed tree topology. We benchmark the speed and accuracy of 19 different methods to compute the marginal likelihood of phylogenetic topologies on a suite of real data sets under the JC69 model. These methods include several new ones that we develop explicitly to solve this problem, as well as existing algorithms that we apply to phylogenetic models for the first time. Altogether, our results show that the accuracy of these methods varies widely, and that accuracy does not necessarily correlate with computational burden. Our newly developed methods are orders of magnitude faster than standard approaches, and in some cases, their accuracy rivals the best established estimators.

    more » « less
  3. Elshall, Ahmed ; Ye, Ming (Ed.)

    Bayesian model evidence (BME) is a measure of the average fit of a model to observation data given all the parameter values that the model can assume. By accounting for the trade-off between goodness-of-fit and model complexity, BME is used for model selection and model averaging purposes. For strict Bayesian computation, the theoretically unbiased Monte Carlo based numerical estimators are preferred over semi-analytical solutions. This study examines five BME numerical estimators and asks how accurate estimation of the BME is important for penalizing model complexity. The limiting cases for numerical BME estimators are the prior sampling arithmetic mean estimator (AM) and the posterior sampling harmonic mean (HM) estimator, which are straightforward to implement, yet they result in underestimation and overestimation, respectively. We also consider the path sampling methods of thermodynamic integration (TI) and steppingstone sampling (SS) that sample multiple intermediate distributions that link the prior and the posterior. Although TI and SS are theoretically unbiased estimators, they could have a bias in practice arising from numerical implementation. For example, sampling errors of some intermediate distributions can introduce bias. We propose a variant of SS, namely the multiple one-steppingstone sampling (MOSS) that is less sensitive to sampling errors. We evaluate these five estimators using a groundwater transport model selection problem. SS and MOSS give the least biased BME estimation at an efficient computational cost. If the estimated BME has a bias that covariates with the true BME, this would not be a problem because we are interested in BME ratios and not their absolute values. On the contrary, the results show that BME estimation bias can be a function of model complexity. Thus, biased BME estimation results in inaccurate penalization of more complex models, which changes the model ranking. This was less observed with SS and MOSS as with the three other methods.

    more » « less
  4. Abstract

    Cosmological parameters encoding our understanding of the expansion history of the universe can be constrained by the accurate estimation of time delays arising in gravitationally lensed systems. We propose TD-CARMA, a Bayesian method to estimate cosmological time delays by modeling observed and irregularly sampled light curves as realizations of a continuous auto-regressive moving average (CARMA) process. Our model accounts for heteroskedastic measurement errors and microlensing, an additional source of independent extrinsic long-term variability in the source brightness. The semiseparable structure of the CARMA covariance matrix allows for fast and scalable likelihood computation using Gaussian process modeling. We obtain a sample from the joint posterior distribution of the model parameters using a nested sampling approach. This allows for “painless” Bayesian computation, dealing with the expected multimodality of the posterior distribution in a straightforward manner and not requiring the specification of starting values or an initial guess for the time delay, unlike existing methods. In addition, the proposed sampling procedure automatically evaluates the Bayesian evidence, allowing us to perform principled Bayesian model selection. TD-CARMA is parsimonious, and typically includes no more than a dozen unknown parameters. We apply TD-CARMA to six doubly lensed quasars HS2209+1914, SDSS J1001+5027, SDSS J1206+4332, SDSS J1515+1511, SDSS J1455+1447, and SDSS J1349+1227, estimating their time delays as −21.96 ± 1.448, 120.93 ± 1.015, 111.51 ± 1.452, 210.80 ± 2.18, 45.36 ± 1.93, and 432.05 ± 1.950, respectively. These estimates are consistent with those derived in the relevant literature, but are typically two to four times more precise.

    more » « less
  5. Abstract

    A potential shortcoming of concatenation methods for species tree estimation is their failure to account for incomplete lineage sorting. Coalescent methods address this problem but make various assumptions that, if violated, can result in worse performance than concatenation. Given the challenges of analyzing DNA sequences with both concatenation and coalescent methods, retroelement insertions (RIs) have emerged as powerful phylogenomic markers for species tree estimation. Here, we show that two recently proposed quartet-based methods, SDPquartets and ASTRAL_BP, are statistically consistent estimators of the unrooted species tree topology under the coalescent when RIs follow a neutral infinite-sites model of mutation and the expected number of new RIs per generation is constant across the species tree. The accuracy of these (and other) methods for inferring species trees from RIs has yet to be assessed on simulated data sets, where the true species tree topology is known. Therefore, we evaluated eight methods given RIs simulated from four model species trees, all of which have short branches and at least three of which are in the anomaly zone. In our simulation study, ASTRAL_BP and SDPquartets always recovered the correct species tree topology when given a sufficiently large number of RIs, as predicted. A distance-based method (ASTRID_BP) and Dollo parsimony also performed well in recovering the species tree topology. In contrast, unordered, polymorphism, and Camin–Sokal parsimony (as well as an approach based on MDC) typically fail to recover the correct species tree topology in anomaly zone situations with more than four ingroup taxa. Of the methods studied, only ASTRAL_BP automatically estimates internal branch lengths (in coalescent units) and support values (i.e., local posterior probabilities). We examined the accuracy of branch length estimation, finding that estimated lengths were accurate for short branches but upwardly biased otherwise. This led us to derive the maximum likelihood (branch length) estimate for when RIs are given as input instead of binary gene trees; this corrected formula produced accurate estimates of branch lengths in our simulation study provided that a sufficiently large number of RIs were given as input. Lastly, we evaluated the impact of data quantity on species tree estimation by repeating the above experiments with input sizes varying from 100 to 100,000 parsimony-informative RIs. We found that, when given just 1000 parsimony-informative RIs as input, ASTRAL_BP successfully reconstructed major clades (i.e., clades separated by branches $>0.3$ coalescent units) with high support and identified rapid radiations (i.e., shorter connected branches), although not their precise branching order. The local posterior probability was effective for controlling false positive branches in these scenarios. [Coalescence; incomplete lineage sorting; Laurasiatheria; Palaeognathae; parsimony; polymorphism parsimony; retroelement insertions; species trees; transposon.]

    more » « less