skip to main content

Title: A Full Scale Approximation of Covariance Functions for Large Spatial Data Sets

Gaussian process models have been widely used in spatial statistics but face tremendous computational challenges for very large data sets. The model fitting and spatial prediction of such models typically require O(n3) operations for a data set of size n. Various approximations of the covariance functions have been introduced to reduce the computational cost. However, most existing approximations cannot simultaneously capture both the large- and the small-scale spatial dependence. A new approximation scheme is developed to provide a high quality approximation to the covariance function at both the large and the small spatial scales. The new approximation is the summation of two parts: a reduced rank covariance and a compactly supported covariance obtained by tapering the covariance of the residual of the reduced rank approximation. Whereas the former part mainly captures the large-scale spatial variation, the latter part captures the small-scale, local variation that is unexplained by the former part. By combining the reduced rank representation and sparse matrix techniques, our approach allows for efficient computation for maximum likelihood estimation, spatial prediction and Bayesian inference. We illustrate the new approach with simulated and real data sets.

more » « less
Author(s) / Creator(s):
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Journal of the Royal Statistical Society Series B: Statistical Methodology
Page Range / eLocation ID:
p. 111-132
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    A key challenge in spatial data science is the analysis for massive spatially‐referenced data sets. Such analyses often proceed from Gaussian process specifications that can produce rich and robust inference, but involve dense covariance matrices that lack computationally exploitable structures. Recent developments in spatial statistics offer a variety of massively scalable approaches. Bayesian inference and hierarchical models, in particular, have gained popularity due to their richness and flexibility in accommodating spatial processes. Our current contribution is to provide computationally efficient exact algorithms for spatial interpolation of massive data sets using scalable spatial processes. We combine low‐rank Gaussian processes with efficient sparse approximations. Following recent work by Zhang et al. (2019), we model the low‐rank process using a Gaussian predictive process (GPP) and the residual process as a sparsity‐inducing nearest‐neighbor Gaussian process (NNGP). A key contribution here is to implement these models using exact conjugate Bayesian modeling to avoid expensive iterative algorithms. Through the simulation studies, we evaluate performance of the proposed approach and the robustness of our models, especially for long range prediction. We implement our approaches for remotely sensed light detection and ranging (LiDAR) data collected over the US Forest Service Tanana Inventory Unit (TIU) in a remote portion of Interior Alaska.

    more » « less
  2. Abstract Obtaining lightweight and accurate approximations of discretized objective functional Hessians in inverse problems governed by partial differential equations (PDEs) is essential to make both deterministic and Bayesian statistical large-scale inverse problems computationally tractable. The cubic computational complexity of dense linear algebraic tasks, such as Cholesky factorization, that provide a means to sample Gaussian distributions and determine solutions of Newton linear systems is a computational bottleneck at large-scale. These tasks can be reduced to log-linear complexity by utilizing hierarchical off-diagonal low-rank (HODLR) matrix approximations. In this work, we show that a class of Hessians that arise from inverse problems governed by PDEs are well approximated by the HODLR matrix format. In particular, we study inverse problems governed by PDEs that model the instantaneous viscous flow of ice sheets. In these problems, we seek a spatially distributed basal sliding parameter field such that the flow predicted by the ice sheet model is consistent with ice sheet surface velocity observations. We demonstrate the use of HODLR Hessian approximation to efficiently sample the Laplace approximation of the posterior distribution with covariance further approximated by HODLR matrix compression. Computational studies are performed which illustrate ice sheet problem regimes for which the Gauss–Newton data-misfit Hessian is more efficiently approximated by the HODLR matrix format than the low-rank (LR) format. We then demonstrate that HODLR approximations can be favorable, when compared to global LR approximations, for large-scale problems by studying the data-misfit Hessian associated with inverse problems governed by the first-order Stokes flow model on the Humboldt glacier and Greenland ice sheet. 
    more » « less
  3. Abstract

    Gaussian process (GP) is a staple in the toolkit of a spatial statistician. Well‐documented computing roadblocks in the analysis of large geospatial datasets using GPs have now largely been mitigated via several recent statistical innovations. Nearest neighbor Gaussian process (NNGP) has emerged as one of the leading candidates for such massive‐scale geospatial analysis owing to their empirical success. This article reviews the connection of NNGP to sparse Cholesky factors of the spatial precision (inverse‐covariance) matrix. Focus of the review is on these sparse Cholesky matrices which are versatile and have recently found many diverse applications beyond the primary usage of NNGP for fast parameter estimation and prediction in the spatial (generalized) linear models. In particular, we discuss applications of sparse NNGP Cholesky matrices to address multifaceted computational issues in spatial bootstrapping, simulation of large‐scale realizations of Gaussian random fields, and extensions to nonparametric mean function estimation of a GP using random forests. We also review a sparse‐Cholesky‐based model for areal (geographically aggregated) data that addresses long‐established interpretability issues of existing areal models. Finally, we highlight some yet‐to‐be‐addressed issues of such sparse Cholesky approximations that warrant further research.

    This article is categorized under:

    Algorithms and Computational Methods > Algorithms

    Algorithms and Computational Methods > Numerical Methods

    more » « less
  4. Summary

    Spatial statistics for very large spatial data sets is challenging. The size of the data set, n, causes problems in computing optimal spatial predictors such as kriging, since its computational cost is of order n3. In addition, a large data set is often defined on a large spatial domain, so the spatial process of interest typically exhibits non-stationary behaviour over that domain. A flexible family of non-stationary covariance functions is defined by using a set of basis functions that is fixed in number, which leads to a spatial prediction method that we call fixed rank kriging. Specifically, fixed rank kriging is kriging within this class of non-stationary covariance functions. It relies on computational simplifications when n is very large, for obtaining the spatial best linear unbiased predictor and its mean-squared prediction error for a hidden spatial process. A method based on minimizing a weighted Frobenius norm yields best estimators of the covariance function parameters, which are then substituted into the fixed rank kriging equations. The new methodology is applied to a very large data set of total column ozone data, observed over the entire globe, where n is of the order of hundreds of thousands.

    more » « less
  5. Abstract

    High‐resolution characterization of hydraulic properties and dense nonaqueous phase liquid (DNAPL) contaminant source is crucial to develop efficient remediation strategies. However, DNAPL characterization suffers from a limited number of borehole data in the field, resulting in a low‐resolution estimation. Moreover, high‐resolution DNAPL characterization requires a large number of unknowns to be estimated, presenting a computational bottleneck. In this paper, a low‐cost geophysical approach, the self‐potential method, is used as additional information for hydraulic properties characterization. Joint inversion of hydraulic head and self‐potential measurements is proposed to improve hydraulic conductivity estimation, which is then used to characterize the DNAPL saturation distribution by inverting partitioning tracer measurements. The computational barrier is overcome by (a) solving the inversion by the principal component geostatistical approach, in which the covariance matrix is replaced by a low‐rank approximation, thus reducing the number of forward model runs; (b) using temporal moments of concentrations instead of individual concentration data points for faster forward simulations. To assess the ability of the proposed approach, numerical experiments are conducted in a 3‐D aquifer with 104unknown hydraulic conductivities and DNAPL saturations. Results show that with realistic DNAPL sources and a limited number of hydraulic heads, the traditional hydraulic/partitioning tracer tomography roughly reconstructs subsurface heterogeneity but fails to resolve the DNAPL distribution. By adding self‐potential data, the error is reduced by 24% in hydraulic conductivity estimation and 68% in DNAPL saturation characterization. The proposed sequential inversion framework utilizes the complementary information from multi‐source hydrogeophysical data sets, and can provide high‐resolution characterizations for realistic DNAPL sources.

    more » « less