skip to main content


Title: Nonparametric spectral methods for multivariate spatial and spatial–temporal data
We propose computationally efficient methods for estimating stationary multivariate spatial and spatial–temporal spectra from incomplete gridded data. The methods are iterative and rely on successive imputation of data and updating of model estimates. Imputations are done according to a periodic model on an expanded domain. The periodicity of the imputations is a key feature that reduces edge effects in the periodogram and is facilitated by efficient circulant embedding techniques. In addition, we describe efficient methods for decomposing the estimated cross spectral density function into a linear model of coregionalization plus a residual process. The methods are applied to two storm datasets, one of which is from Hurricane Florence, which struck the southeastern United States in September 2018. The application demonstrates how fitted models from different datasets can be compared, and how the methods are computationally feasible on datasets with more than 200,000 total observations.  more » « less
Award ID(s):
1916208
NSF-PAR ID:
10485008
Author(s) / Creator(s):
Publisher / Repository:
Elsevier
Date Published:
Journal Name:
Journal of Multivariate Analysis
Volume:
187
Issue:
C
ISSN:
0047-259X
Page Range / eLocation ID:
104823
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Occupancy modelling is a common approach to assess species distribution patterns, while explicitly accounting for false absences in detection–nondetection data. Numerous extensions of the basic single‐species occupancy model exist to model multiple species, spatial autocorrelation and to integrate multiple data types. However, development of specialized and computationally efficient software to incorporate such extensions, especially for large datasets, is scarce or absent.

    We introduce thespOccupancy Rpackage designed to fit single‐species and multi‐species spatially explicit occupancy models. We fit all models within a Bayesian framework using Pólya‐Gamma data augmentation, which results in fast and efficient inference.spOccupancyprovides functionality for data integration of multiple single‐species detection–nondetection datasets via a joint likelihood framework. The package leverages Nearest Neighbour Gaussian Processes to account for spatial autocorrelation, which enables spatially explicit occupancy modelling for potentially massive datasets (e.g. 1,000s–100,000s of sites).

    spOccupancyprovides user‐friendly functions for data simulation, model fitting, model validation (by posterior predictive checks), model comparison (using information criteria and k‐fold cross‐validation) and out‐of‐sample prediction. We illustrate the package's functionality via a vignette, simulated data analysis and two bird case studies.

    ThespOccupancypackage provides a user‐friendly platform to fit a variety of single and multi‐species occupancy models, making it straightforward to address detection biases and spatial autocorrelation in species distribution models even for large datasets.

     
    more » « less
  2. Abstract

    There has been a great deal of recent interest in the development of spatial prediction algorithms for very large datasets and/or prediction domains. These methods have primarily been developed in the spatial statistics community, but there has been growing interest in the machine learning community for such methods, primarily driven by the success of deep Gaussian process regression approaches and deep convolutional neural networks. These methods are often computationally expensive to train and implement and consequently, there has been a resurgence of interest in random projections and deep learning models based on random weights—so called reservoir computing methods. Here, we combine several of these ideas to develop the random ensemble deep spatial (REDS) approach to predict spatial data. The procedure uses random Fourier features as inputs to an extreme learning machine (a deep neural model with random weights), and with calibrated ensembles of outputs from this model based on different random weights, it provides a simple uncertainty quantification. The REDS method is demonstrated on simulated data and on a classic large satellite data set.

     
    more » « less
  3. Spatial classification with limited feature observations has been a challenging problem in machine learning. The problem exists in applications where only a subset of sensors are deployed at certain regions or partial responses are collected in field surveys. Existing research mostly focuses on addressing incomplete or missing data, e.g., data cleaning and imputation, classification models that allow for missing feature values, or modeling missing features as hidden variables and applying the EM algorithm. These methods, however, assume that incomplete feature observations only happen on a small subset of samples, and thus cannot solve problems where the vast majority of samples have missing feature observations. To address this issue, we propose a new approach that incorporates physics-aware structural constraints into the model representation. Our approach assumes that a spatial contextual feature is observed for all sample locations and establishes spatial structural constraint from the spatial contextual feature map. We design efficient algorithms for model parameter learning and class inference. Evaluations on real-world hydrological applications show that our approach significantly outperforms several baseline methods in classification accuracy, and the proposed solution is computationally efficient on a large data volume. 
    more » « less
  4. In regions of the world where topography varies significantly with distance, most global climate models (GCMs) have spatial resolutions that are too coarse to accurately simulate key meteorological variables that are influenced by topography, such as clouds, precipitation, and surface temperatures. One approach to tackle this challenge is to run climate models of sufficiently high resolution in those topographically complex regions such as the North American Regionally Refined Model (NARRM) subset of the Department of Energy’s (DOE) Energy Exascale Earth System Model version 2 (E3SM v2). Although high-resolution simulations are expected to provide unprecedented details of atmospheric processes, running models at such high resolutions remains computationally expensive compared to lower-resolution models such as the E3SM Low Resolution (LR). Moreover, because regionally refined and high-resolution GCMs are relatively new, there are a limited number of observational datasets and frameworks available for evaluating climate models with regionally varying spatial resolutions. As such, we developed a new framework to quantify the added value of high spatial resolution in simulating precipitation over the contiguous United States (CONUS). To determine its viability, we applied the framework to two model simulations and an observational dataset. We first remapped all the data into Hierarchical Equal-Area Iso-Latitude Pixelization (HEALPix) pixels. HEALPix offers several mathematical properties that enable seamless evaluation of climate models across different spatial resolutions including its equal-area and partitioning properties. The remapped HEALPix-based data are used to show how the spatial variability of both observed and simulated precipitation changes with resolution increases. This study provides valuable insights into the requirements for achieving accurate simulations of precipitation patterns over the CONUS. It highlights the importance of allocating sufficient computational resources to run climate models at higher temporal and spatial resolutions to capture spatial patterns effectively. Furthermore, the study demonstrates the effectiveness of the HEALPix framework in evaluating precipitation simulations across different spatial resolutions. This framework offers a viable approach for comparing observed and simulated data when dealing with datasets of varying spatial resolutions. By employing this framework, researchers can extend its usage to other climate variables, datasets, and disciplines that require comparing datasets with different spatial resolutions.

     
    more » « less
  5. Big spatial data has become ubiquitous, from mobile applications to satellite data. In most of these applications, data is continuously growing to huge volumes. Existing systems for big spatial data organize records at either the record-level or block-level. Systems that use record-level structures include key-value stores and LSM-Tree stores, which support insert and delete operations and they are optimized for highly-selective queries. On the other hand, systems like GeoSpark that use block-level structures (e.g. 128 MB each) are more efficient for analytical queries, but they cannot incrementally maintain the partitioned data and do not support delete operations. This paper proposes a general framework that enables block-level systems to incrementally maintain spatial partitions, in the presence of bulk insertions and deletions, in distributed file system (DFS) blocks. We first formally study the incremental spatial partitioning problem for big data and demonstrate its NP-hardness. Then, we propose a cost model to estimate the performance of queries on the partitioned data and the effect of modifying it as the data grows. After that, we provide three different implementations of the incremental partitioning framework. Comprehensive experiments on large real datasets show that our proposed partitioning algorithms outperforms state-of-the-art spatial partitioning methods. 
    more » « less