skip to main content

Title: Horseshoe‐based Bayesian nonparametric estimation of effective population size trajectories

Phylodynamics is an area of population genetics that uses genetic sequence data to estimate past population dynamics. Modern state‐of‐the‐art Bayesian nonparametric methods for recovering population size trajectories of unknown form use either change‐point models or Gaussian process priors. Change‐point models suffer from computational issues when the number of change‐points is unknown and needs to be estimated. Gaussian process‐based methods lack local adaptivity and cannot accurately recover trajectories that exhibit features such as abrupt changes in trend or varying levels of smoothness. We propose a novel, locally adaptive approach to Bayesian nonparametric phylodynamic inference that has the flexibility to accommodate a large class of functional behaviors. Local adaptivity results from modeling the log‐transformed effective population size a priori as a horseshoe Markov random field, a recently proposed statistical model that blends together the best properties of the change‐point and Gaussian process modeling paradigms. We use simulated data to assess model performance, and find that our proposed method results in reduced bias and increased precision when compared to contemporary methods. We also use our models to reconstruct past changes in genetic diversity of human hepatitis C virus in Egypt and to estimate population size changes of ancient and modern steppe bison. These analyses show that our new method captures features of the population size trajectories that were missed by the state‐of‐the‐art methods.

more » « less
Award ID(s):
Author(s) / Creator(s):
 ;  ;  ;  
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Medium: X Size: p. 677-690
["p. 677-690"]
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Understanding animal movement often relies upon telemetry and biologging devices. These data are frequently used to estimate latent behavioural states to help understand why animals move across the landscape. While there are a variety of methods that make behavioural inferences from biotelemetry data, some features of these methods (e.g. analysis of a single data stream, use of parametric distributions) may limit their generality to reliably discriminate among behavioural states.

    To address some of the limitations of existing behavioural state estimation models, we introduce a nonparametric Bayesian framework called the mixed‐membership method for movement (M4), which is available within the open‐sourcebayesmoveR package. This framework can analyse multiple data streams (e.g. step length, turning angle, acceleration) without relying on parametric distributions, which may capture complex behaviours more successfully than current methods. We tested our Bayesian framework using simulated trajectories and compared model performance against two segmentation methods (behavioural change point analysis (BCPA) and segclust2d), one machine learning method [expectation‐maximization binary clustering (EMbC)] and one type of state‐space model [hidden Markov model (HMM)]. We also illustrated this Bayesian framework using movements of juvenile snail kitesRostrhamus sociabilisin Florida, USA.

    The Bayesian framework estimated breakpoints more accurately than the other segmentation methods for tracks of different lengths. Likewise, the Bayesian framework provided more accurate estimates of behaviour than the other state estimation methods when simulations were generated from less frequently considered distributions (e.g. truncated normal, beta, uniform). Three behavioural states were estimated from snail kite movements, which were labelled as ‘encamped’, ‘area‐restricted search’ and ‘transit’. Changes in these behaviours over time were associated with known dispersal events from the nest site, as well as movements to and from possible breeding locations.

    Our nonparametric Bayesian framework estimated behavioural states with comparable or superior accuracy compared to the other methods when step lengths and turning angles of simulations were generated from less frequently considered distributions. Since the most appropriate parametric distributions may not be obvious a priori, methods (such as M4) that are agnostic to the underlying distributions can provide powerful alternatives to address questions in movement ecology.

    more » « less
  2. Abstract

    Stationary points embedded in the derivatives are often critical for a model to be interpretable and may be considered as key features of interest in many applications. We propose a semiparametric Bayesian model to efficiently infer the locations of stationary points of a nonparametric function, which also produces an estimate of the function. We use Gaussian processes as a flexible prior for the underlying function and impose derivative constraints to control the function's shape via conditioning. We develop an inferential strategy that intentionally restricts estimation to the case of at least one stationary point, bypassing possible mis-specifications in the number of stationary points and avoiding the varying dimension problem that often brings in computational complexity. We illustrate the proposed methods using simulations and then apply the method to the estimation of event-related potentials derived from electroencephalography (EEG) signals. We show how the proposed method automatically identifies characteristic components and their latencies at the individual level, which avoids the excessive averaging across subjects that is routinely done in the field to obtain smooth curves. By applying this approach to EEG data collected from younger and older adults during a speech perception task, we are able to demonstrate how the time course of speech perception processes changes with age.

    more » « less
  3. Abstract Particle filters avoid parametric estimates for Bayesian posterior densities, which alleviates Gaussian assumptions in nonlinear regimes. These methods, however, are more sensitive to sampling errors than Gaussian-based techniques such as ensemble Kalman filters. A recent study by the authors introduced an iterative strategy for particle filters that match posterior moments—where iterations improve the filter’s ability to draw samples from non-Gaussian posterior densities. The iterations follow from a factorization of particle weights, providing a natural framework for combining particle filters with alternative filters to mitigate the impact of sampling errors. The current study introduces a novel approach to forming an adaptive hybrid data assimilation methodology, exploiting the theoretical strengths of nonparametric and parametric filters. At each data assimilation cycle, the iterative particle filter performs a sequence of updates while the prior sample distribution is non-Gaussian, then an ensemble Kalman filter provides the final adjustment when Gaussian distributions for marginal quantities are detected. The method employs the Shapiro–Wilk test to determine when to make the transition between filter algorithms, which has outstanding power for detecting departures from normality. Experiments using low-dimensional models demonstrate that the approach has a significant value, especially for nonhomogeneous observation networks and unknown model process errors. Moreover, hybrid factors are extended to consider marginals of more than one collocated variables using a test for multivariate normality. Findings from this study motivate the use of the proposed method for geophysical problems characterized by diverse observation networks and various dynamic instabilities, such as numerical weather prediction models. Significance Statement Data assimilation statistically processes observation errors and model forecast errors to provide optimal initial conditions for the forecast, playing a critical role in numerical weather forecasting. The ensemble Kalman filter, which has been widely adopted and developed in many operational centers, assumes Gaussianity of the prior distribution and solves a linear system of equations, leading to bias in strong nonlinear regimes. On the other hand, particle filters avoid many of those assumptions but are sensitive to sampling errors and are computationally expensive. We propose an adaptive hybrid strategy that combines their advantages and minimizes the disadvantages of the two methods. The hybrid particle filter–ensemble Kalman filter is achieved with the Shapiro–Wilk test to detect the Gaussianity of the ensemble members and determine the timing of the transition between these filter updates. Demonstrations in this study show that the proposed method is advantageous when observations are heterogeneous and when the model has an unknown bias. Furthermore, by extending the statistical hypothesis test to the test for multivariate normality, we consider marginals of more than one collocated variable. These results encourage further testing for real geophysical problems characterized by various dynamic instabilities, such as real numerical weather prediction models. 
    more » « less
  4. Abstract Aim

    Biogeographers have used three primary data types to examine shifts in tree ranges in response to past climate change: fossil pollen, genetic data and contemporary occurrences. Although recent efforts have explored formal integration of these types of data, we have limited understanding of how integration affects estimates of range shift rates and their uncertainty. We compared estimates of biotic velocity (i.e. rate of species' range shifts) using each data type independently to estimates obtained using integrated models.


    Eastern North America.


    Fraxinus pennsylvanicaMarshall (green ash).


    Using fossil pollen, genomic data and modern occurrence data, we estimated biotic velocities directly from 24 species distribution models (SDMs) and 200 pollen surfaces created with a novel Bayesian spatio‐temporal model. We compared biotic velocity from these analyses to estimates based on coupled demographic‐coalescent simulations and Approximate Bayesian Computation that combined fossil pollen and SDMs with population genomic data collected across theF. pennsylvanicarange.


    Patterns and magnitude of biotic velocity over time varied by the method used to estimate past range dynamics. Estimates based on fossil pollen yielded the highest rates of range movement. Overall, integrating genetic data with other data types in our simulation‐based framework reduced apparent uncertainty in biotic velocity estimates and resulted in greater similarity in estimates between SDM‐ and pollen‐integrated analyses.

    Main Conclusions

    By reducing uncertainty in our assessments of range shifts, integration of data types improves our understanding of the past distribution of species. Based on these results, we propose further steps to reach the integration of these three lines of biogeographical evidence into a unified analytical framework.

    more » « less
  5. Abstract. Robust, proxy-based reconstructions of relative sea-level (RSL) change are critical to distinguishing the processes that drive spatial and temporal sea-level variability. The relationships between individual proxies and RSL can be complex and are often poorly represented by traditional methods that assume Gaussian likelihood distributions. We develop a new statistical framework to estimate past RSL change based on nonparametric, empirical modern distributions of proxies in relation to RSL, applying the framework to corals and mangroves as an illustrative example. We validate our model by comparing its skill in reconstructing RSL and rates of change to two previous RSL models using synthetic time-series datasets based on Holocene sea-level data from South Florida. The new framework results in lower bias, better model fit, and greater accuracy and precision than the two previous RSL models. We also perform sensitivity tests using sea-level scenarios based on two periods of interest – meltwater pulses (MWPs) and the Holocene – to analyze the sensitivity of the statistical reconstructions to the quantity and precision of proxy data; we define high-precision indicators, such as mangroves and the reef-crest coral Acropora palmata, with 2σ vertical uncertainties within ± 3 m and lower-precision indicators, such as Orbicella spp., with 2σ vertical uncertainties within ± 10 m. For reconstructing rapid rates of change in RSL of up to ∼ 40 m kyr−1, such as those that may have characterized MWPs during deglacial periods, we find that employing the nonparametric model with 5 to 10 high-precision data points per kiloyear enables us to constrain rates to within ± 3 m kyr−1 (1σ). For reconstructing RSL with rates of up to ∼ 15 m kyr−1, as observed during the Holocene, we conclude that employing the model with 5 to 10 high-precision (or a combination of high- and low-precision) data points per kiloyear enables precise estimates of RSL within ±∼ 2 m (2σ) and accurate RSL reconstructions with errors ≲ 0.7 m. Employing the nonparametric model with only lower-precision indicators also produces fairly accurate estimates of RSL with errors ≲1.50 m, although with less precision, only constraining RSL to ±∼ 3–4 m (2σ). Although the model performs better than previous models in terms of bias, model fit, accuracy, and precision, it is computationally expensive to run because it requires inverting large matrices for every sample. The new model also provides minimal gains over similar models when a large quantity of high-precision data are available. Therefore, we recommend incorporating the nonparametric likelihood distributions when no other information (e.g., reef facies or epibionts indicative of shallow-water environments to refine coral elevational uncertainties) or no high-precision data are available at a location or during a given time period of interest. 
    more » « less