Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to non-federal websites. Their policies may differ from this site.
-
Abstract Here we present the open-source and cross-platform BEAST X software that combines molecular phylogenetic reconstruction with complex trait evolution, divergence-time dating and coalescent demographics in an efficient statistical inference engine.BEAST Xsignificantly advances the flexibility and scalability of evolutionary models supported. Novel clock and substitution models leverage a large variety of evolutionary processes; discrete, continuous and mixed traits with missingness and measurement errors; and fast, gradient-informed integration techniques that rapidly traverse high-dimensional parameter spaces.more » « lessFree, publicly-accessible full text available August 1, 2026
-
The continuous-time Markov chain (CTMC) is the mathematical workhorse of evolutionary biology. Learning CTMC model parameters using modern, gradient-based methods requires the derivative of the matrix exponential evaluated at the CTMC’s infinitesimal generator (rate) matrix. Motivated by the derivative’s extreme computational complexity as a function of state space cardinality, recent work demonstrates the surprising effectiveness of a naive, first-order approximation for a host of problems in computational biology. In response to this empirical success, we obtain rigorous deterministic and probabilistic bounds for the error accrued by the naive approximation and establish a “blessing of dimensionality” result that is universal for a large class of rate matrices with random entries. Finally, we apply the first-order approximation within surrogate-trajectory Hamiltonian Monte Carlo for the analysis of the early spread of Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) across 44 geographic regions that comprise a state space of unprecedented dimensionality for unstructured (flexible) CTMC models within evolutionary biology.more » « less
-
Abstract Phylogenetic and discrete-trait evolutionary inference depend heavily on an appropriate characterization of the underlying character substitution process. In this paper, we present random-effects substitution models that extend common continuous-time Markov chain models into a richer class of processes capable of capturing a wider variety of substitution dynamics. As these random-effects substitution models often require many more parameters than their usual counterparts, inference can be both statistically and computationally challenging. Thus, we also propose an efficient approach to compute an approximation to the gradient of the data likelihood with respect to all unknown substitution model parameters. We demonstrate that this approximate gradient enables scaling of sampling-based inference, namely Bayesian inference via Hamiltonian Monte Carlo, under random-effects substitution models across large trees and state-spaces. Applied to a dataset of 583 SARS-CoV-2 sequences, an HKY model with random-effects shows strong signals of nonreversibility in the substitution process, and posterior predictive model checks clearly show that it is a more adequate model than a reversible model. When analyzing the pattern of phylogeographic spread of 1441 influenza A virus (H3N2) sequences between 14 regions, a random-effects phylogeographic substitution model infers that air travel volume adequately predicts almost all dispersal rates. A random-effects state-dependent substitution model reveals no evidence for an effect of arboreality on the swimming mode in the tree frog subfamily Hylinae. Simulations reveal that random-effects substitution models can accommodate both negligible and radical departures from the underlying base substitution model. We show that our gradient-based inference approach is over an order of magnitude more time efficient than conventional approaches.more » « less
-
Chekouo, Thierry (Ed.)Inferring dependencies between mixed-type biological traits while accounting for evolutionary relationships between specimens is of great scientific interest yet remains infeasible when trait and specimen counts grow large. The state-of-the-art approach uses a phylogenetic multivariate probit model to accommodate binary and continuous traits via a latent variable framework, and utilizes an efficient bouncy particle sampler (BPS) to tackle the computational bottleneck—integrating many latent variables from a high-dimensional truncated normal distribution. This approach breaks down as the number of specimens grows and fails to reliably characterize conditional dependencies between traits. Here, we propose an inference pipeline for phylogenetic probit models that greatly outperforms BPS. The novelty lies in 1) a combination of the recent Zigzag Hamiltonian Monte Carlo (Zigzag-HMC) with linear-time gradient evaluations and 2) a joint sampling scheme for highly correlated latent variables and correlation matrix elements. In an application exploring HIV-1 evolution from 535 viruses, the inference requires joint sampling from an 11,235-dimensional truncated normal and a 24-dimensional covariance matrix. Our method yields a 5-fold speedup compared to BPS and makes it possible to learn partial correlations between candidate viral mutations and virulence. Computational speedup now enables us to tackle even larger problems: we study the evolution of influenza H1N1 glycosylations on around 900 viruses. For broader applicability, we extend the phylogenetic probit model to incorporate categorical traits, and demonstrate its use to studyAquilegiaflower and pollinator co-evolution.more » « less
-
Yang, Ziheng (Ed.)Abstract Divergence time estimation is crucial to provide temporal signals for dating biologically important events from species divergence to viral transmissions in space and time. With the advent of high-throughput sequencing, recent Bayesian phylogenetic studies have analyzed hundreds to thousands of sequences. Such large-scale analyses challenge divergence time reconstruction by requiring inference on highly correlated internal node heights that often become computationally infeasible. To overcome this limitation, we explore a ratio transformation that maps the original $N-1$ internal node heights into a space of one height parameter and $N-2$ ratio parameters. To make the analyses scalable, we develop a collection of linear-time algorithms to compute the gradient and Jacobian-associated terms of the log-likelihood with respect to these ratios. We then apply Hamiltonian Monte Carlo sampling with the ratio transform in a Bayesian framework to learn the divergence times in 4 pathogenic viruses (West Nile virus, rabies virus, Lassa virus, and Ebola virus) and the coralline red algae. Our method both resolves a mixing issue in the West Nile virus example and improves inference efficiency by at least 5-fold for the Lassa and rabies virus examples as well as for the algae example. Our method now also makes it computationally feasible to incorporate mixed-effects molecular clock models for the Ebola virus example, confirms the findings from the original study, and reveals clearer multimodal distributions of the divergence times of some clades of interest.more » « less
-
ObjectiveTo assess the uptake of second line antihyperglycaemic drugs among patients with type 2 diabetes mellitus who are receiving metformin. DesignFederated pharmacoepidemiological evaluation in LEGEND-T2DM. Setting10 US and seven non-US electronic health record and administrative claims databases in the Observational Health Data Sciences and Informatics network in eight countries from 2011 to the end of 2021. Participants4.8 million patients (=18 years) across US and non-US based databases with type 2 diabetes mellitus who had received metformin monotherapy and had initiated second line treatments. ExposureThe exposure used to evaluate each database was calendar year trends, with the years in the study that were specific to each cohort. Main outcomes measuresThe outcome was the incidence of second line antihyperglycaemic drug use (ie, glucagon-like peptide-1 receptor agonists, sodium-glucose cotransporter-2 inhibitors, dipeptidyl peptidase-4 inhibitors, and sulfonylureas) among individuals who were already receiving treatment with metformin. The relative drug class level uptake across cardiovascular risk groups was also evaluated. Results4.6 million patients were identified in US databases, 61 382 from Spain, 32 442 from Germany, 25 173 from the UK, 13 270 from France, 5580 from Scotland, 4614 from Hong Kong, and 2322 from Australia. During 2011-21, the combined proportional initiation of the cardioprotective antihyperglycaemic drugs (glucagon-like peptide-1 receptor agonists and sodium-glucose cotransporter-2 inhibitors) increased across all data sources, with the combined initiation of these drugs as second line drugs in 2021 ranging from 35.2% to 68.2% in the US databases, 15.4% in France, 34.7% in Spain, 50.1% in Germany, and 54.8% in Scotland. From 2016 to 2021, in some US and non-US databases, uptake of glucagon-like peptide-1 receptor agonists and sodium-glucose cotransporter-2 inhibitors increased more significantly among populations with no cardiovascular disease compared with patients with established cardiovascular disease. No data source provided evidence of a greater increase in the uptake of these two drug classes in populations with cardiovascular disease compared with no cardiovascular disease. ConclusionsDespite the increase in overall uptake of cardioprotective antihyperglycaemic drugs as second line treatments for type 2 diabetes mellitus, their uptake was lower in patients with cardiovascular disease than in people with no cardiovascular disease over the past decade. A strategy is needed to ensure that medication use is concordant with guideline recommendations to improve outcomes of patients with type 2 diabetes mellitus.more » « less
An official website of the United States government
