skip to main content


This content will become publicly available on May 7, 2025

Title: Random-effects substitution models for phylogenetics via scalable gradient approximations
Abstract

Phylogenetic and discrete-trait evolutionary inference depend heavily on an appropriate characterization of the underlying character substitution process. In this paper, we present random-effects substitution models that extend common continuous-time Markov chain models into a richer class of processes capable of capturing a wider variety of substitution dynamics. As these random-effects substitution models often require many more parameters than their usual counterparts, inference can be both statistically and computationally challenging. Thus, we also propose an efficient approach to compute an approximation to the gradient of the data likelihood with respect to all unknown substitution model parameters. We demonstrate that this approximate gradient enables scaling of sampling-based inference, namely Bayesian inference via Hamiltonian Monte Carlo, under random-effects substitution models across large trees and state-spaces.

Applied to a dataset of 583 SARS-CoV-2 sequences, an HKY model with random-effects shows strong signals of nonreversibility in the substitution process, and posterior predictive model checks clearly show that it is a more adequate model than a reversible model. When analyzing the pattern of phylogeographic spread of 1441 influenza A virus (H3N2) sequences between 14 regions, a random-effects phylogeographic substitution model infers that air travel volume adequately predicts almost all dispersal rates. A random-effects state-dependent substitution model reveals no evidence for an effect of arboreality on the swimming mode in the tree frog subfamily Hylinae. Simulations reveal that random-effects substitution models can accommodate both negligible and radical departures from the underlying base substitution model. We show that our gradient-based inference approach is over an order of magnitude more time efficient than conventional approaches.

 
more » « less
Award ID(s):
2152774 2236854
PAR ID:
10506582
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ;
Publisher / Repository:
Oxford Academic
Date Published:
Journal Name:
Systematic Biology
ISSN:
1063-5157
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Fast inference of numerical model parameters from data is an important prerequisite to generate predictive models for a wide range of applications. Use of sampling-based approaches such as Markov chain Monte Carlo may become intractable when each likelihood evaluation is computationally expensive. New approaches combining variational inference with normalizing flow are characterized by a computational cost that grows only linearly with the dimensionality of the latent variable space, and rely on gradient-based optimization instead of sampling, providing a more efficient approach for Bayesian inference about the model parameters. Moreover, the cost of frequently evaluating an expensive likelihood can be mitigated by replacing the true model with an offline trained surrogate model, such as neural networks. However, this approach might generate significant bias when the surrogate is insufficiently accurate around the posterior modes. To reduce the computational cost without sacrificing inferential accuracy, we propose Normalizing Flow with Adaptive Surrogate (NoFAS), an optimization strategy that alternatively updates the normalizing flow parameters and surrogate model parameters. We also propose an efficient sample weighting scheme for surrogate model training that preserves global accuracy while effectively capturing high posterior density regions. We demonstrate the inferential and computational superiority of NoFAS against various benchmarks, including cases where the underlying model lacks identifiability. The source code and numerical experiments used for this study are available at https://github.com/cedricwangyu/NoFAS. 
    more » « less
  2. Abstract

    Claiming causal inferences in network settings necessitates careful consideration of the often complex dependency between outcomes for actors. Of particular importance are treatment spillover or outcome interference effects. We consider causal inference when the actors are connected via an underlying network structure. Our key contribution is a model for causality when the underlying network is endogenous; where the ties between actors and the actor covariates are statistically dependent. We develop a joint model for the relational and covariate generating process that avoids restrictive separability and fixed network assumptions, as these rarely hold in realistic social settings. While our framework can be used with general models, we develop the highly expressive class of Exponential-family Random Network models (ERNM) of which Markov random fields and Exponential-family Random Graph models are special cases. We present potential outcome-based inference within a Bayesian framework and propose a modification to the exchange algorithm to allow for sampling from ERNM posteriors. We present results of a simulation study demonstrating the validity of the approach. Finally, we demonstrate the value of the framework in a case study of smoking in the context of adolescent friendship networks.

     
    more » « less
  3. Summary

    We introduce a flexible marginal modelling approach for statistical inference for clustered and longitudinal data under minimal assumptions. This estimated estimating equations approach is semiparametric and the proposed models are fitted by quasi-likelihood regression, where the unknown marginal means are a function of the fixed effects linear predictor with unknown smooth link, and variance–covariance is an unknown smooth function of the marginal means. We propose to estimate the nonparametric link and variance–covariance functions via smoothing methods, whereas the regression parameters are obtained via the estimated estimating equations. These are score equations that contain nonparametric function estimates. The proposed estimated estimating equations approach is motivated by its flexibility and easy implementation. Moreover, if data follow a generalized linear mixed model, with either a specified or an unspecified distribution of random effects and link function, the model proposed emerges as the corresponding marginal (population-average) version and can be used to obtain inference for the fixed effects in the underlying generalized linear mixed model, without the need to specify any other components of this generalized linear mixed model. Among marginal models, the estimated estimating equations approach provides a flexible alternative to modelling with generalized estimating equations. Applications of estimated estimating equations include diagnostics and link selection. The asymptotic distribution of the proposed estimators for the model parameters is derived, enabling statistical inference. Practical illustrations include Poisson modelling of repeated epileptic seizure counts and simulations for clustered binomial responses.

     
    more » « less
  4. Abstract

    Improvements in the description of amino acid substitution are required to develop better pseudo‐energy‐based protein structure‐aware models for use in phylogenetic studies. These models are used to characterize the probabilities of amino acid substitution and enable better simulation of protein sequences over a phylogeny. A better characterization of amino acid substitution probabilities in turn enables numerous downstream applications, like detecting positive selection, ancestral sequence reconstruction, and evolutionarily‐motivated protein engineering. Many existing Markov models for amino acid substitution in molecular evolution disregard molecular structure and describe the amino acid substitution process over longer evolutionary periods poorly. Here, we present a new model upgraded with a site‐specific parameterization of pseudo‐energy terms in a coarse‐grained force field, which describes local heterogeneity in physical constraints on amino acid substitution better than a previous pseudo‐energy‐based model with minimum cost in runtime. The importance of each weight term parameterization in characterizing underlying features of the site, including contact number, solvent accessibility, and secondary structural elements was evaluated, returning both expected and biologically reasonable relationships between model parameters. This results in the acceptance of proposed amino acid substitutions that more closely resemble those observed site‐specific frequencies in gene family alignments. The modular site‐specific pseudo‐energy function is made available for download through the following website:https://liberles.cst.temple.edu/Software/CASS/index.html.

     
    more » « less
  5. Abstract

    Recent advances in deep learning for neural networks with large numbers of parameters have been enabled by automatic differentiation, an algorithmic technique for calculating gradients of measures of model fit with respect to model parameters. Estimation of high‐dimensional parameter sets is an important problem within the hydrological sciences. Here, we demonstrate the effectiveness of gradient‐based estimation techniques for high‐dimensional inverse estimation problems using a conceptual rainfall‐runoff model. In particular, we compare the effectiveness of Hamiltonian Monte Carlo and automatic differentiation variational inference against two nongradient‐dependent methods, random walk Metropolis and differential evolution Metropolis. We show that the former two techniques exhibit superior performance for inverse estimation of daily rainfall values and are much more computationally efficient on larger data sets in an experiment with synthetic data. We also present a case study evaluating the effectiveness of automatic differentiation variational inference for inverse estimation over 25 years of daily precipitation conditional on streamflow observations at three catchments and show that it is scalable to very high dimensional parameter spaces. The presented results highlight the power of combining hydrological process‐based models with optimization techniques from deep learning for high‐dimensional estimation problems.

     
    more » « less