Multivariate spatially oriented data sets are prevalent in the environmental and physical sciences. Scientists seek to jointly model multiple variables, each indexed by a spatial location, to capture any underlying spatial association for each variable and associations among the different dependent variables. Multivariate latent spatial process models have proved effective in driving statistical inference and rendering better predictive inference at arbitrary locations for the spatial process. High‐dimensional multivariate spatial data, which are the theme of this article, refer to data sets where the number of spatial locations and the number of spatially dependent variables is very large. The field has witnessed substantial developments in scalable models for univariate spatial processes, but such methods for multivariate spatial processes, especially when the number of outcomes are moderately large, are limited in comparison. Here, we extend scalable modeling strategies for a single process to multivariate processes. We pursue Bayesian inference, which is attractive for full uncertainty quantification of the latent spatial process. Our approach exploits distribution theory for the matrix‐normal distribution, which we use to construct scalable versions of a hierarchical linear model of coregionalization (LMC) and spatial factor models that deliver inference over a high‐dimensional parameter space including the latent spatial process. We illustrate the computational and inferential benefits of our algorithms over competing methods using simulation studies and an analysis of a massive vegetation index data set.
This content will become publicly available on September 6, 2024
- Award ID(s):
- NSF-PAR ID:
- Publisher / Repository:
- The New England Statistical Society (NESS)
- Date Published:
- Journal Name:
- The New England Journal of Statistics in Data Science
- Page Range / eLocation ID:
- 283 to 295
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
Joint modeling of spatially oriented dependent variables is commonplace in the environmental sciences, where scientists seek to estimate the relationships among a set of environmental outcomes accounting for dependence among these outcomes and the spatial dependence for each outcome. Such modeling is now sought for massive data sets with variables measured at a very large number of locations. Bayesian inference, while attractive for accommodating uncertainties through hierarchical structures, can become computationally onerous for modeling massive spatial data sets because of its reliance on iterative estimation algorithms. This article develops a conjugate Bayesian framework for analyzing multivariate spatial data using analytically tractable posterior distributions that obviate iterative algorithms. We discuss differences between modeling the multivariate response itself as a spatial process and that of modeling a latent process in a hierarchical model. We illustrate the computational and inferential benefits of these models using simulation studies and analysis of a vegetation index data set with spatially dependent observations numbering in the millions.
Summary For multivariate spatial Gaussian process models, customary specifications of cross-covariance functions do not exploit relational inter-variable graphs to ensure process-level conditional independence between the variables. This is undesirable, especially in highly multivariate settings, where popular cross-covariance functions, such as multivariate Matérn functions, suffer from a curse of dimensionality as the numbers of parameters and floating-point operations scale up in quadratic and cubic order, respectively, with the number of variables. We propose a class of multivariate graphical Gaussian processes using a general construction called stitching that crafts cross-covariance functions from graphs and ensures process-level conditional independence between variables. For the Matérn family of functions, stitching yields a multivariate Gaussian process whose univariate components are Matérn Gaussian processes, and which conforms to process-level conditional independence as specified by the graphical model. For highly multivariate settings and decomposable graphical models, stitching offers massive computational gains and parameter dimension reduction. We demonstrate the utility of the graphical Matérn Gaussian process to jointly model highly multivariate spatial data using simulation examples and an application to air-pollution modelling.more » « less
null (Ed.)Compound flooding is a physical phenomenon that has become more destructive in recent years. Moreover, compound flooding is a broad term that envelops many different physical processes that can range from preconditioned, to multivariate, to temporally compounding, or spatially compounding. This research aims to analyze a specific case of compound flooding related to tropical cyclones where the compounding effect is on coastal flooding due to a combination of storm surge and river discharge. In recent years, such compound flood events have increased in frequency and magnitude, due to a number of factors such as sea-level rise from warming oceans. Therefore, the ability to model such events is of increasing urgency. At present, there is no holistic, integrated modeling system capable of simulating or forecasting compound flooding on a large regional or global scale, leading to the need to couple various existing models. More specifically, two more challenges in such a modeling effort are determining the primary model and accounting for the effect of adjacent watersheds that discharge to the same receiving water body in amplifying the impact of compound flooding from riverine discharge with storm surge when the scale of the model includes an entire coastal line. In this study, we investigated the possibility of using the Advanced Circulation (ADCIRC) model as the primary model to simulate the compounding effects of fluvial flooding and storm surge via loose one-way coupling with gage data through internal time-dependent flux boundary conditions. The performance of the ADCIRC model was compared with the Hydrologic Engineering Center- River Analysis System (HEC-RAS) model both at the watershed and global scales. Furthermore, the importance of including riverine discharges and the interactions among adjacent watersheds were quantified. Results showed that the ADCIRC model could reliably be used to model compound flooding on both a watershed scale and a regional scale. Moreover, accounting for the interaction of river discharge from multiple watersheds is critical in accurately predicting flood patterns when high amounts of riverine flow occur in conjunction with storm surge. Particularly, with storms such as Hurricane Harvey (2017), where river flows were near record levels, inundation patterns and water surface elevations were highly dependent on the incorporation of the discharge input from multiple watersheds. Such an effect caused extra and longer inundations in some areas during Hurricane Harvey. Comparisons with real gauge data show that adding internal flow boundary conditions into ADCIRC to account for river discharge from multiple watersheds significantly improves accuracy in predictions of water surface elevations during coastal flooding events.more » « less
Disease mapping is an important statistical tool used by epidemiologists to assess geographic variation in disease rates and identify lurking environmental risk factors from spatial patterns. Such maps rely upon spatial models for regionally aggregated data, where neighboring regions tend to exhibit similar outcomes than those farther apart. We contribute to the literature on multivariate disease mapping, which deals with measurements on multiple (two or more) diseases in each region. We aim to disentangle associations among the multiple diseases from spatial autocorrelation in each disease. We develop multivariate directed acyclic graphical autoregression models to accommodate spatial and inter‐disease dependence. The hierarchical construction imparts flexibility and richness, interpretability of spatial autocorrelation and inter‐disease relationships, and computational ease, but depends upon the order in which the cancers are modeled. To obviate this, we demonstrate how Bayesian model selection and averaging across orders are easily achieved using bridge sampling. We compare our method with a competitor using simulation studies and present an application to multiple cancer mapping using data from the Surveillance, Epidemiology, and End Results program.