skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Cross-validation for Geospatial Data: Estimating Generalization Performance in Geostatistical Problems
Geostatistical learning problems are frequently characterized by spatial autocorrelation in the input features and/or the potential for covariate shift at test time. These realities violate the classical assumption of independent, identically distributed data, upon which most cross-validation algorithms rely in order to estimate the generalization performance of a model. In this paper, we present a theoretical criterion for unbiased cross-validation estimators in the geospatial setting. We also introduce a new cross-validation algorithm to evaluate models, inspired by the challenges of geospatial problems. We apply a framework for categorizing problems into different types of geospatial scenarios to help practitioners select an appropriate cross-validation strategy. Our empirical analyses compare cross-validation algorithms on both simulated and several real datasets to develop recommendations for a variety of geospatial settings. This paper aims to draw attention to some challenges that arise in model evaluation for geospatial problems and to provide guidance for users.  more » « less
Award ID(s):
2046678
PAR ID:
10514920
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
Transactions on Machine Learning Research
Date Published:
Journal Name:
Transactions on Machine Learning Research
ISSN:
2835-8856
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Geospatial problems often involve spatial autocorrelation and covariate shift, which violate the independent, identically distributed assumption underlying standard cross-validation. In this work, we establish a theoretical criterion for unbiased crossvalidation, introduce a preliminary categorization framework to guide practitioners in choosing suitable cross-validation strategies for geospatial problems, reconcile conflicting recommendations on best practices, and develop a novel, straightforward method with both theoretical guarantees and empirical success. 
    more » « less
  2. Despite numerous years of research into the merits and trade-offs of various model selection criteria, obtaining robust results that elucidate the behavior of cross-validation remains a challenging endeavor. In this paper, we highlight the inherent limitations of cross-validation when employed to discern the structure of a Gaussian graphical model. We provide finite-sample bounds on the probability that the Lasso estimator for the neighborhood of a node within a Gaussian graphical model, optimized using a prediction oracle, misidentifies the neighborhood. Our results pertain to both undirected and directed acyclic graphs, encompassing general, sparse covariance structures. To support our theoretical findings, we conduct an empirical investigation of this inconsistency by contrasting our outcomes with other commonly used information criteria through an extensive simulation study. Given that many algorithms designed to learn the structure of graphical models require hyperparameter selection, the precise calibration of this hyperparameter is paramount for accurately estimating the inherent structure. Consequently, our observations shed light on this widely recognized practical challenge. 
    more » « less
  3. We are currently living in the era of big data. The volume of collected or archived geospatial data for land use and land cover (LULC) mapping including remotely sensed satellite imagery and auxiliary geospatial datasets is increasing. Innovative machine learning, deep learning algorithms, and cutting-edge cloud computing have also recently been developed. While new opportunities are provided by these geospatial big data and advanced computer technologies for LULC mapping, challenges also emerge for LULC mapping from using these geospatial big data. This article summarizes the review studies and research progress in remote sensing, machine learning, deep learning, and geospatial big data for LULC mapping since 2015. We identified the opportunities, challenges, and future directions of using geospatial big data for LULC mapping. More research needs to be performed for improved LULC mapping at large scales. 
    more » « less
  4. CyberGIS—geographic information science and systems (GIS) based on advanced cyberinfrastructure—is becoming increasingly important to tackling a variety of socio-environmental problems like climate change, disaster management, and water security. While recent advances in high-performance computing (HPC) have the potential to help address these problems, the technical knowledge required to use HPC has posed challenges to many domain experts. In this paper, we present CyberGIS-Compute: a geospatial middleware tool designed to democratize HPC access for solving diverse socio-environmental problems. CyberGIS-Compute does this by providing a simple user interface in Jupyter, streamlining the process of integrating domain-specific models with HPC, and establishing a suite of APIs friendly to domain experts. 
    more » « less
  5. null (Ed.)
    Summary While many statistical models and methods are now available for network analysis, resampling of network data remains a challenging problem. Cross-validation is a useful general tool for model selection and parameter tuning, but it is not directly applicable to networks since splitting network nodes into groups requires deleting edges and destroys some of the network structure. In this paper we propose a new network resampling strategy, based on splitting node pairs rather than nodes, that is applicable to cross-validation for a wide range of network model selection tasks. We provide theoretical justification for our method in a general setting and examples of how the method can be used in specific network model selection and parameter tuning tasks. Numerical results on simulated networks and on a statisticians’ citation network show that the proposed cross-validation approach works well for model selection. 
    more » « less