NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Spatial Clustering of Citizen Science Data Improves Downstream Species Distribution Models

https://doi.org/10.1609/aaai.v39i27.34993

Ahmed, Nahian; Roth, Mark; Hallman, Tyler A; Robinson, W Douglas; Hutchinson, Rebecca A (April 2025, Proceedings of the AAAI Conference on Artificial Intelligence)

Citizen science biodiversity data present great opportunities for ecology and conservation across vast spatial and temporal scales. However, the opportunistic nature of these data lacks the sampling structure required by modeling methodologies that address a pervasive challenge in ecological data collection: imperfect detection, i.e., the likelihood of under-observing species on field surveys. Occupancy modeling is an example of an approach that accounts for imperfect detection by explicitly modeling the observation process separately from the biological process of habitat selection. This produces species distribution models that speak to the pattern of the species on a landscape after accounting for imperfect detection in the data, rather than the pattern of species observations corrupted by errors. To achieve this benefit, occupancy models require multiple surveys of a site across which the site's status (i.e., occupied or not) is assumed constant. Since citizen science data are not collected under the required repeated-visit protocol, observations may be grouped into sites post hoc. Existing approaches for constructing sites discard some observations and/or consider only geographic distance and not environmental similarity. In this study, we compare ten approaches for site construction in terms of their impact on downstream species distribution models for 31 bird species in Oregon, using observations recorded in the eBird database. We find that occupancy models built on sites constructed by spatial clustering algorithms perform better than existing alternatives.
more » « less
Free, publicly-accessible full text available April 11, 2026
Model Evaluation for Geospatial Problems

Wang, Jing; Hallman, Tyler A; Hopkins, Laurel M; Kilbride, John B; Robinson, W Douglas; Hutchinson, Rebecca A (December 2023, CompSust-2023 2023 NeurIPS Workshop on Computational Sustainability: Pitfalls and Promises from Theory to Deployment)

Geospatial problems often involve spatial autocorrelation and covariate shift, which violate the independent, identically distributed assumption underlying standard cross-validation. In this work, we establish a theoretical criterion for unbiased crossvalidation, introduce a preliminary categorization framework to guide practitioners in choosing suitable cross-validation strategies for geospatial problems, reconcile conflicting recommendations on best practices, and develop a novel, straightforward method with both theoretical guarantees and empirical success.
more » « less
Full Text Available
Cross-validation for Geospatial Data: Estimating Generalization Performance in Geostatistical Problems

Wang, Jing; Hopkins, Laurel; Hallman, Tyler A; Robinson, W Douglas; Hutchinson, Rebecca A (October 2023, Transactions on Machine Learning Research)

Geostatistical learning problems are frequently characterized by spatial autocorrelation in the input features and/or the potential for covariate shift at test time. These realities violate the classical assumption of independent, identically distributed data, upon which most cross-validation algorithms rely in order to estimate the generalization performance of a model. In this paper, we present a theoretical criterion for unbiased cross-validation estimators in the geospatial setting. We also introduce a new cross-validation algorithm to evaluate models, inspired by the challenges of geospatial problems. We apply a framework for categorizing problems into different types of geospatial scenarios to help practitioners select an appropriate cross-validation strategy. Our empirical analyses compare cross-validation algorithms on both simulated and several real datasets to develop recommendations for a variety of geospatial settings. This paper aims to draw attention to some challenges that arise in model evaluation for geospatial problems and to provide guidance for users.
more » « less
Full Text Available
On the Role of Spatial Clustering Algorithms in Building Species Distribution Models from Community Science Data

Roth, Mark; Hallman, Tyler; Robinson, W. Douglas; Hutchinson, Rebecca A. (July 2021, ICML 2021 Workshop: Tackling Climate Change with Machine Learning)

This paper discusses opportunities for developments in spatial clustering methods to help leverage broad scale community science data for building species distribution models (SDMs). SDMs are tools that inform the science and policy needed to mitigate the impacts of climate change on biodiversity. Community science data span spatial and temporal scales unachievable by expert surveys alone, but they lack the structure imposed in smaller scale studies to allow adjustments for observational biases. Spatial clustering approaches can construct the necessary structure after surveys have occurred, but more work is needed to ensure that they are effective for this purpose. In this proposal, we describe the role of spatial clustering for realizing the potential of large biodiversity datasets, how existing methods approach this problem, and ideas for future work.
more » « less
Full Text Available

Search for: All records