skip to main content

Title: Gaussian Predictive Process Models for Large Spatial Data Sets

With scientific data available at geocoded locations, investigators are increasingly turning to spatial process models for carrying out statistical inference. Over the last decade, hierarchical models implemented through Markov chain Monte Carlo methods have become especially popular for spatial modelling, given their flexibility and power to fit models that would be infeasible with classical methods as well as their avoidance of possibly inappropriate asymptotics. However, fitting hierarchical spatial models often involves expensive matrix decompositions whose computational complexity increases in cubic order with the number of spatial locations, rendering such models infeasible for large spatial data sets. This computational burden is exacerbated in multivariate settings with several spatially dependent response variables. It is also aggravated when data are collected at frequent time points and spatiotemporal process models are used. With regard to this challenge, our contribution is to work with what we call predictive process models for spatial and spatiotemporal data. Every spatial (or spatiotemporal) process induces a predictive process model (in fact, arbitrarily many of them). The latter models project process realizations of the former to a lower dimensional subspace, thereby reducing the computational burden. Hence, we achieve the flexibility to accommodate non-stationary, non-Gaussian, possibly multivariate, possibly spatiotemporal processes in the context of large data sets. We discuss attractive theoretical properties of these predictive processes. We also provide a computational template encompassing these diverse settings. Finally, we illustrate the approach with simulated and real data sets.

more » « less
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Journal of the Royal Statistical Society Series B: Statistical Methodology
Medium: X Size: p. 825-848
["p. 825-848"]
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Multivariate spatially oriented data sets are prevalent in the environmental and physical sciences. Scientists seek to jointly model multiple variables, each indexed by a spatial location, to capture any underlying spatial association for each variable and associations among the different dependent variables. Multivariate latent spatial process models have proved effective in driving statistical inference and rendering better predictive inference at arbitrary locations for the spatial process. High‐dimensional multivariate spatial data, which are the theme of this article, refer to data sets where the number of spatial locations and the number of spatially dependent variables is very large. The field has witnessed substantial developments in scalable models for univariate spatial processes, but such methods for multivariate spatial processes, especially when the number of outcomes are moderately large, are limited in comparison. Here, we extend scalable modeling strategies for a single process to multivariate processes. We pursue Bayesian inference, which is attractive for full uncertainty quantification of the latent spatial process. Our approach exploits distribution theory for the matrix‐normal distribution, which we use to construct scalable versions of a hierarchical linear model of coregionalization (LMC) and spatial factor models that deliver inference over a high‐dimensional parameter space including the latent spatial process. We illustrate the computational and inferential benefits of our algorithms over competing methods using simulation studies and an analysis of a massive vegetation index data set.

    more » « less
  2. Abstract

    Joint modeling of spatially oriented dependent variables is commonplace in the environmental sciences, where scientists seek to estimate the relationships among a set of environmental outcomes accounting for dependence among these outcomes and the spatial dependence for each outcome. Such modeling is now sought for massive data sets with variables measured at a very large number of locations. Bayesian inference, while attractive for accommodating uncertainties through hierarchical structures, can become computationally onerous for modeling massive spatial data sets because of its reliance on iterative estimation algorithms. This article develops a conjugate Bayesian framework for analyzing multivariate spatial data using analytically tractable posterior distributions that obviate iterative algorithms. We discuss differences between modeling the multivariate response itself as a spatial process and that of modeling a latent process in a hierarchical model. We illustrate the computational and inferential benefits of these models using simulation studies and analysis of a vegetation index data set with spatially dependent observations numbering in the millions.

    more » « less
  3. Abstract

    A key challenge in spatial data science is the analysis for massive spatially‐referenced data sets. Such analyses often proceed from Gaussian process specifications that can produce rich and robust inference, but involve dense covariance matrices that lack computationally exploitable structures. Recent developments in spatial statistics offer a variety of massively scalable approaches. Bayesian inference and hierarchical models, in particular, have gained popularity due to their richness and flexibility in accommodating spatial processes. Our current contribution is to provide computationally efficient exact algorithms for spatial interpolation of massive data sets using scalable spatial processes. We combine low‐rank Gaussian processes with efficient sparse approximations. Following recent work by Zhang et al. (2019), we model the low‐rank process using a Gaussian predictive process (GPP) and the residual process as a sparsity‐inducing nearest‐neighbor Gaussian process (NNGP). A key contribution here is to implement these models using exact conjugate Bayesian modeling to avoid expensive iterative algorithms. Through the simulation studies, we evaluate performance of the proposed approach and the robustness of our models, especially for long range prediction. We implement our approaches for remotely sensed light detection and ranging (LiDAR) data collected over the US Forest Service Tanana Inventory Unit (TIU) in a remote portion of Interior Alaska.

    more » « less
  4. With continued advances in Geographic Information Systems and related computational technologies, statisticians are often required to analyze very large spatial data sets. This has generated substantial interest over the last decade, already too vast to be summarized here, in scalable methodologies for analyzing large spatial data sets. Scalable spatial process models have been found especially attractive due to their richness and flexibility and, particularly so in the Bayesian paradigm, due to their presence in hierarchical model settings. However, the vast majority of research articles present in this domain have been geared toward innovative theory or more complex model development. Very limited attention has been accorded to approaches for easily implementable scalable hierarchical models for the practicing scientist or spatial analyst. This article devises massively scalable Bayesian approaches that can rapidly deliver inference on spatial process that are practically indistinguishable from inference obtained using more expensive alternatives. A key emphasis is on implementation within very standard (modest) computing environments (eg, a standard desktop or laptop) using easily available statistical software packages. Key insights are offered regarding assumptions and approximations concerning practical efficiency.

    more » « less
  5. Abstract

    The joint analysis of spatial and temporal processes poses computational challenges due to the data's high dimensionality. Furthermore, such data are commonly non-Gaussian. In this paper, we introduce a copula-based spatiotemporal model for analyzing spatiotemporal data and propose a semiparametric estimator. The proposed algorithm is computationally simple, since it models the marginal distribution and the spatiotemporal dependence separately. Instead of assuming a parametric distribution, the proposed method models the marginal distributions nonparametrically and thus offers more flexibility. The method also provides a convenient way to construct both point and interval predictions at new times and locations, based on the estimated conditional quantiles. Through a simulation study and an analysis of wind speeds observed along the border between Oregon and Washington, we show that our method produces more accurate point and interval predictions for skewed data than those based on normality assumptions.

    more » « less