skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: On the Role of Spatial Clustering Algorithms in Building Species Distribution Models from Community Science Data
This paper discusses opportunities for developments in spatial clustering methods to help leverage broad scale community science data for building species distribution models (SDMs). SDMs are tools that inform the science and policy needed to mitigate the impacts of climate change on biodiversity. Community science data span spatial and temporal scales unachievable by expert surveys alone, but they lack the structure imposed in smaller scale studies to allow adjustments for observational biases. Spatial clustering approaches can construct the necessary structure after surveys have occurred, but more work is needed to ensure that they are effective for this purpose. In this proposal, we describe the role of spatial clustering for realizing the potential of large biodiversity datasets, how existing methods approach this problem, and ideas for future work.  more » « less
Award ID(s):
2046678
PAR ID:
10332683
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
ICML 2021 Workshop: Tackling Climate Change with Machine Learning
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Dainton, John (Ed.)
    Improving models of species' distributions is essential for conservation, especially in light of global change. Species distribution models (SDMs) often rely on mean environmental conditions, yet species distributions are also a function of environmental heterogeneity and filtering acting at multiple spatial scales. Geodiversity, which we define as the variation of abiotic features and processes of Earth's entire geosphere (inclusive of climate), has potential to improve SDMs and conservation assessments, as they capture multiple abiotic dimensions of species niches, however they have not been sufficiently tested in SDMs. We tested a range of geodiversity variables computed at varying scales using climate and elevation data. We compared predictive performance of MaxEnt SDMs generated using CHELSA bioclimatic variables to those also including geodiversity variables for 31 mammalian species in Colombia. Results show the spatial grain of geodiversity variables affects SDM performance. Some variables consistently exhibited an increasing or decreasing trend in variable importance with spatial grain, showing slight scale-dependence and indicating that some geodiversity variables are more relevant at particular scales for some species. Incorporating geodiversity variables into SDMs, and doing so at the appropriate spatial scales, enhances the ability to model species-environment relationships, thereby contributing to the conservation and management of biodiversity. This article is part of the Theo Murphy meeting issue ‘Geodiversity for science and society’. 
    more » « less
  2. Abstract A core goal of the National Ecological Observatory Network (NEON) is to measure changes in biodiversity across the 30‐yr horizon of the network. In contrast to NEON’s extensive use of automated instruments to collect environmental data, NEON’s biodiversity surveys are almost entirely conducted using traditional human‐centric field methods. We believe that the combination of instrumentation for remote data collection and machine learning models to process such data represents an important opportunity for NEON to expand the scope, scale, and usability of its biodiversity data collection while potentially reducing long‐term costs. In this manuscript, we first review the current status of instrument‐based biodiversity surveys within the NEON project and previous research at the intersection of biodiversity, instrumentation, and machine learning at NEON sites. We then survey methods that have been developed at other locations but could potentially be employed at NEON sites in future. Finally, we expand on these ideas in five case studies that we believe suggest particularly fruitful future paths for automated biodiversity measurement at NEON sites: acoustic recorders for sound‐producing taxa, camera traps for medium and large mammals, hydroacoustic and remote imagery for aquatic diversity, expanded remote and ground‐based measurements for plant biodiversity, and laboratory‐based imaging for physical specimens and samples in the NEON biorepository. Through its data science‐literate staff and user community, NEON has a unique role to play in supporting the growth of such automated biodiversity survey methods, as well as demonstrating their ability to help answer key ecological questions that cannot be answered at the more limited spatiotemporal scales of human‐driven surveys. 
    more » « less
  3. Abstract Understanding the ranges of rare and endangered species is central to conserving biodiversity in the Anthropocene. Species distribution models (SDMs) have become a common and powerful tool for analyzing species–environment relationships across geographic space. Although evaluating the distribution of rare species is integral to their conservation, this can be difficult when limited distribution data are available. Community science platforms, such as iNaturalist, have emerged as alternative sources for species occurrence data. Although these observations are often thought to be of lower quality than those of natural history collections, they may have potential for improving SDMs for species with few occurrence records from collections. Here, we investigate the utility of iNaturalist data for developing SDMs for a rare high‐elevation plant,Telesonix jamesii. Because methods for modeling rare species are limited in the literature, five different modeling techniques were considered, including profile methods, statistical models, and machine learning algorithms. The inclusion of iNaturalist data doubled the number of usable records forT. jamesii.We found that a random forest (RF) model using ensemble training data performed the highest of any model (area under curve = 0.98). We then compared the performance of RF models that use only natural history training data and those that use a combination of natural history (herbarium specimens) and iNaturalist training data. All models heavily relied on climate data (mean temperature of driest quarter, and precipitation of the warmest quarter), indicating that this species is under threat as climate continues to change. Validation datasets affected model fits as well. Models using only herbarium data performed slightly poorer when evaluated with cross‐validation than when validated externally with iNaturalist data. This study can serve as a model for future SDM studies of species with similar data limitations. 
    more » « less
  4. null (Ed.)
    Abstract Biodiversity is rapidly changing due to changes in the climate and human related activities; thus, the accurate predictions of species composition and diversity are critical to developing conservation actions and management strategies. In this paper, using satellite remote sensing products as covariates, we constructed stacked species distribution models (S-SDMs) under a Bayesian framework to build next-generation biodiversity models. Model performance of these models was assessed using oak assemblages distributed across the continental United States obtained from the National Ecological Observatory Network (NEON). This study represents an attempt to evaluate the integrated predictions of biodiversity models—including assemblage diversity and composition—obtained by stacking next-generation SDMs. We found that applying constraints to assemblage predictions, such as using the probability ranking rule, does not improve biodiversity prediction models. Furthermore, we found that independent of the stacking procedure (bS-SDM versus pS-SDM versus cS-SDM), these kinds of next-generation biodiversity models do not accurately recover the observed species composition at the plot level or ecological-community scales (NEON plots are 400 m 2 ). However, these models do return reasonable predictions at macroecological scales, i.e., moderately to highly correct assignments of species identities at the scale of NEON sites (mean area ~ 27 km 2 ). Our results provide insights for advancing the accuracy of prediction of assemblage diversity and composition at different spatial scales globally. An important task for future studies is to evaluate the reliability of combining S-SDMs with direct detection of species using image spectroscopy to build a new generation of biodiversity models that accurately predict and monitor ecological assemblages through time and space. 
    more » « less
  5. Citizen science biodiversity data present great opportunities for ecology and conservation across vast spatial and temporal scales. However, the opportunistic nature of these data lacks the sampling structure required by modeling methodologies that address a pervasive challenge in ecological data collection: imperfect detection, i.e., the likelihood of under-observing species on field surveys. Occupancy modeling is an example of an approach that accounts for imperfect detection by explicitly modeling the observation process separately from the biological process of habitat selection. This produces species distribution models that speak to the pattern of the species on a landscape after accounting for imperfect detection in the data, rather than the pattern of species observations corrupted by errors. To achieve this benefit, occupancy models require multiple surveys of a site across which the site's status (i.e., occupied or not) is assumed constant. Since citizen science data are not collected under the required repeated-visit protocol, observations may be grouped into sites post hoc. Existing approaches for constructing sites discard some observations and/or consider only geographic distance and not environmental similarity. In this study, we compare ten approaches for site construction in terms of their impact on downstream species distribution models for 31 bird species in Oregon, using observations recorded in the eBird database. We find that occupancy models built on sites constructed by spatial clustering algorithms perform better than existing alternatives. 
    more » « less