Abstract Gaussian process (GP) is a staple in the toolkit of a spatial statistician. Well‐documented computing roadblocks in the analysis of large geospatial datasets using GPs have now largely been mitigated via several recent statistical innovations. Nearest neighbor Gaussian process (NNGP) has emerged as one of the leading candidates for such massive‐scale geospatial analysis owing to their empirical success. This article reviews the connection of NNGP to sparse Cholesky factors of the spatial precision (inverse‐covariance) matrix. Focus of the review is on these sparse Cholesky matrices which are versatile and have recently found many diverse applications beyond the primary usage of NNGP for fast parameter estimation and prediction in the spatial (generalized) linear models. In particular, we discuss applications of sparse NNGP Cholesky matrices to address multifaceted computational issues in spatial bootstrapping, simulation of large‐scale realizations of Gaussian random fields, and extensions to nonparametric mean function estimation of a GP using random forests. We also review a sparse‐Cholesky‐based model for areal (geographically aggregated) data that addresses long‐established interpretability issues of existing areal models. Finally, we highlight some yet‐to‐be‐addressed issues of such sparse Cholesky approximations that warrant further research. This article is categorized under:Algorithms and Computational Methods > AlgorithmsAlgorithms and Computational Methods > Numerical Methods
more »
« less
Scalable Predictions for Spatial Probit Linear Mixed Models Using Nearest Neighbor Gaussian Processes
Spatial probit generalized linear mixed models (spGLMM) with a linear fixed effect and a spatial random effect, endowed with a Gaussian Process prior, are widely used for analysis of binary spatial data. However, the canonical Bayesian implementation of this hierarchical mixed model can involve protracted Markov Chain Monte Carlo sampling. Alternate approaches have been proposed that circumvent this by directly representing the marginal likelihood from spGLMM in terms of multivariate normal cummulative distribution functions (cdf). We present a direct and fast rendition of this latter approach for predictions from a spatial probit linear mixed model. We show that the covariance matrix of the cdf characterizing the marginal cdf of binary spatial data from spGLMM is amenable to approximation using Nearest Neighbor Gaussian Processes (NNGP). This facilitates a scalable prediction algorithm for spGLMM using NNGP that only involves sparse or small matrix computations and can be deployed in an embarrassingly parallel manner. We demonstrate the accuracy and scalability of the algorithm via numerous simulation experiments and an analysis of species presence-absence data.
more »
« less
- Award ID(s):
- 1915803
- PAR ID:
- 10447670
- Date Published:
- Journal Name:
- Journal of Data Science
- ISSN:
- 1680-743X
- Page Range / eLocation ID:
- 533 to 544
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Summary Canonical correlation analysis investigates linear relationships between two sets of variables, but it often works poorly on modern datasets because of high dimensionality and mixed data types such as continuous, binary and zero-inflated. To overcome these challenges, we propose a semiparametric approach to sparse canonical correlation analysis based on the Gaussian copula. The main result of this paper is a truncated latent Gaussian copula model for data with excess zeros, which allows us to derive a rank-based estimator of the latent correlation matrix for mixed variable types without estimation of marginal transformation functions. The resulting canonical correlation analysis method works well in high-dimensional settings, as demonstrated via numerical studies, and when applied to the analysis of association between gene expression and microRNA data from breast cancer patients.more » « less
-
Abstract Preharvest yield estimates can be used for harvest planning, marketing, and prescribing in‐season fertilizer and pesticide applications. One approach that is being widely tested is the use of machine learning (ML) or artificial intelligence (AI) algorithms to estimate yields. However, one barrier to the adoption of this approach is that ML/AI algorithms behave as a black block. An alternative approach is to create an algorithm using Bayesian statistics. In Bayesian statistics, prior information is used to help create the algorithm. However, algorithms based on Bayesian statistics are not often computationally efficient. The objective of the current study was to compare the accuracy and computational efficiency of four Bayesian models that used different assumptions to reduce the execution time. In this paper, the Bayesian multiple linear regression (BLR), Bayesian spatial, Bayesian skewed spatial regression, and the Bayesian nearest neighbor Gaussian process (NNGP) models were compared with ML non‐Bayesian random forest model. In this analysis, soybean (Glycine max) yields were the response variable (y), and spaced‐based blue, green, red, and near‐infrared reflectance that was measured with the PlanetScope satellite were the predictor (x). Among the models tested, the Bayesian (NNGP;R2‐testing = 0.485) model, which captures the short‐range correlation, outperformed the (BLR;R2‐testing = 0.02), Bayesian spatial regression (SRM;R2‐testing = 0.087), and Bayesian skewed spatial regression (sSRM;R2‐testing = 0.236) models. However, associated with improved accuracy was an increase in run time from 534 s for the BLR model to 2047 s for the NNGP model. These data show that relatively accurate within‐field yield estimates can be obtained without sacrificing computational efficiency and that the coefficients have biological meaning. However, all Bayesian models had lowerR2values and higher execution times than the random forest model.more » « less
-
This paper considers the latent Gaussian graphical model, which extends the Gaussian graphical model to handle discrete data as well as mixed data with both continuous and discrete variables by assuming that discrete variables are generated by discretizing latent Gaussian variables. We propose a modified expectation‐maximization (EM) algorithm to estimate parameters in the latent Gaussian model for binary data. We also extend the proposed modified EM algorithm to the latent Gaussian model for mixed data. The conditional dependence structure can be consequently constructed by exploring the sparsity pattern of the precision matrix of the latent variables. We illustrate the performance of our proposed estimator through comprehensive numerical studies and an application to voting data of the United Nations General Assembly.more » « less
-
Abstract A key challenge in spatial data science is the analysis for massive spatially‐referenced data sets. Such analyses often proceed from Gaussian process specifications that can produce rich and robust inference, but involve dense covariance matrices that lack computationally exploitable structures. Recent developments in spatial statistics offer a variety of massively scalable approaches. Bayesian inference and hierarchical models, in particular, have gained popularity due to their richness and flexibility in accommodating spatial processes. Our current contribution is to provide computationally efficient exact algorithms for spatial interpolation of massive data sets using scalable spatial processes. We combine low‐rank Gaussian processes with efficient sparse approximations. Following recent work by Zhang et al. (2019), we model the low‐rank process using a Gaussian predictive process (GPP) and the residual process as a sparsity‐inducing nearest‐neighbor Gaussian process (NNGP). A key contribution here is to implement these models using exact conjugate Bayesian modeling to avoid expensive iterative algorithms. Through the simulation studies, we evaluate performance of the proposed approach and the robustness of our models, especially for long range prediction. We implement our approaches for remotely sensed light detection and ranging (LiDAR) data collected over the US Forest Service Tanana Inventory Unit (TIU) in a remote portion of Interior Alaska.more » « less