skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, May 23 until 2:00 AM ET on Friday, May 24 due to maintenance. We apologize for the inconvenience.

Title: Scalable Marginalization of Correlated Latent Variables with Applications to Learning Particle Interaction Kernels
Marginalization of latent variables or nuisance parameters is a fundamental aspect of Bayesian inference and uncertainty quantification. In this work, we focus on scalable marginalization of latent variables in modeling correlated data, such as spatio-temporal or functional observations. We first introduce Gaussian processes (GPs) for modeling correlated data and highlight the computational challenge, where the computational complexity increases cubically fast along with the number of observations. We then review the connection between the state space model and GPs with Matérn covariance for temporal inputs. The Kalman filter and Rauch-Tung-Striebel smoother were introduced as a scalable marginalization technique for computing the likelihood and making predictions of GPs without approximation. We introduce recent efforts on extending the scalable marginalization idea to the linear model of coregionalization for multivariate correlated output and spatio-temporal observations. In the final part of this work, we introduce a novel marginalization technique to estimate interaction kernels and forecast particle trajectories. The computational progress lies in the sparse representation of the inverse covariance matrix of the latent variables, then applying conjugate gradient for improving predictive accuracy with large data sets. The computational advances achieved in this work outline a wide range of applications in molecular dynamic simulation, cellular migration, and agent-based models.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
The New England Journal of Statistics in Data Science
Page Range / eLocation ID:
1 to 15
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Spatiotemporal data analysis with massive zeros is widely used in many areas such as epidemiology and public health. We use a Bayesian framework to fit zero-inflated negative binomial models and employ a set of latent variables from Pólya-Gamma distributions to derive an efficient Gibbs sampler. The proposed model accommodates varying spatial and temporal random effects through Gaussian process priors, which have both the simplicity and flexibility in modeling nonlinear relationships through a covariance function. To conquer the computation bottleneck that GPs may suffer when the sample size is large, we adopt the nearest-neighbor GP approach that approximates the covariance matrix using local experts. For the simulation study, we adopt multiple settings with varying sizes of spatial locations to evaluate the performance of the proposed model such as spatial and temporal random effects estimation and compare the result to other methods. We also apply the proposed model to the COVID-19 death counts in the state of Florida, USA from 3/25/2020 through 7/29/2020 to examine relationships between social vulnerability and COVID-19 deaths. 
    more » « less

    The spatio-temporal properties of seismicity give us incisive insight into the stress state evolution and fault structures of the crust. Empirical models based on self-exciting point processes continue to provide an important tool for analysing seismicity, given the epistemic uncertainty associated with physical models. In particular, the epidemic-type aftershock sequence (ETAS) model acts as a reference model for studying seismicity catalogues. The traditional ETAS model uses simple parametric definitions for the background rate of triggering-independent seismicity. This reduces the effectiveness of the basic ETAS model in modelling the temporally complex seismicity patterns seen in seismic swarms that are dominated by aseismic tectonic processes such as fluid injection rather than aftershock triggering. In order to robustly capture time-varying seismicity rates, we introduce a deep Gaussian process (GP) formulation for the background rate as an extension to ETAS. GPs are a robust non-parametric model for function spaces with covariance structure. By conditioning the length-scale structure of a GP with another GP, we have a deep-GP: a probabilistic, hierarchical model that automatically tunes its structure to match data constraints. We show how the deep-GP-ETAS model can be efficiently sampled by making use of a Metropolis-within-Gibbs scheme, taking advantage of the branching process formulation of ETAS and a stochastic partial differential equation (SPDE) approximation for Matérn GPs. We illustrate our method using synthetic examples, and show that the deep-GP-ETAS model successfully captures multiscale temporal behaviour in the background forcing rate of seismicity. We then apply the results to two real-data catalogues: the Ridgecrest, CA 2019 July 5 Mw 7.1 event catalogue, showing that deep-GP-ETAS can successfully characterize a classical aftershock sequence; and the 2016–2019 Cahuilla, CA earthquake swarm, which shows two distinct phases of aseismic forcing concordant with a fluid injection-driven initial sequence, arrest of the fluid along a physical barrier and release following the largest Mw 4.4 event of the sequence.

    more » « less
  3. null (Ed.)
    Hydropower is the largest renewable energy source for electricity generation in the world, with numerous benefits in terms of: environment protection (near-zero air pollution and climate impact), cost-effectiveness (long-term use, without significant impacts of market fluctuation), and reliability (quickly respond to surge in demand). However, the effectiveness of hydropower plants is affected by multiple factors such as reservoir capacity, rainfall, temperature and fluctuating electricity demand, and particularly their complicated relationships, which make the prediction/recommendation of station operational output a difficult challenge. In this paper, we present DeepHydro, a novel stochastic method for modeling multivariate time series (e.g., water inflow/outflow and temperature) and forecasting power generation of hydropower stations. DeepHydro captures temporal dependencies in co-evolving time series with a new conditioned latent recurrent neural networks, which not only considers the hidden states of observations but also preserves the uncertainty of latent variables. We introduce a generative network parameterized on a continuous normalizing flow to approximate the complex posterior distribution of multivariate time series data, and further use neural ordinary differential equations to estimate the continuous-time dynamics of the latent variables constituting the observable data. This allows our model to deal with the discrete observations in the context of continuous dynamic systems, while being robust to the noise. We conduct extensive experiments on real-world datasets from a large power generation company consisting of cascade hydropower stations. The experimental results demonstrate that the proposed method can effectively predict the power production and significantly outperform the possible candidate baseline approaches. 
    more » « less
  4. Two popular approaches for relating correlated measurements of a non‐Gaussian response variable to a set of predictors are to fit amarginal modelusing generalized estimating equations and to fit ageneralized linear mixed model(GLMM) by introducing latent random variables. The first approach is effective for parameter estimation, but leaves one without a formal model for the data with which to assess quality of fit or make individual‐level predictions for future observations. The second approach overcomes these deficiencies, but leads to parameter estimates that must be interpreted conditional on the latent variables. To obtain marginal summaries, one needs to evaluate an analytically intractable integral or use attenuation factors as an approximation. Further, we note an unpalatable implication of the standard GLMM. To resolve these issues, we turn to a class of marginally interpretable GLMMs that lead to parameter estimates with a marginal interpretation while maintaining the desirable statistical properties of a conditionally specified model and avoiding problematic implications. We establish the form of these models under the most commonly used link functions and address computational issues. For logistic mixed effects models, we introduce an accurate and efficient method for evaluating the logistic‐normal integral.

    more » « less
  5. Abstract

    Multivariate spatially oriented data sets are prevalent in the environmental and physical sciences. Scientists seek to jointly model multiple variables, each indexed by a spatial location, to capture any underlying spatial association for each variable and associations among the different dependent variables. Multivariate latent spatial process models have proved effective in driving statistical inference and rendering better predictive inference at arbitrary locations for the spatial process. High‐dimensional multivariate spatial data, which are the theme of this article, refer to data sets where the number of spatial locations and the number of spatially dependent variables is very large. The field has witnessed substantial developments in scalable models for univariate spatial processes, but such methods for multivariate spatial processes, especially when the number of outcomes are moderately large, are limited in comparison. Here, we extend scalable modeling strategies for a single process to multivariate processes. We pursue Bayesian inference, which is attractive for full uncertainty quantification of the latent spatial process. Our approach exploits distribution theory for the matrix‐normal distribution, which we use to construct scalable versions of a hierarchical linear model of coregionalization (LMC) and spatial factor models that deliver inference over a high‐dimensional parameter space including the latent spatial process. We illustrate the computational and inferential benefits of our algorithms over competing methods using simulation studies and an analysis of a massive vegetation index data set.

    more » « less