skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: occTest : An integrated approach for quality control of species occurrence data
Abstract AimSpecies occurrence data are valuable information that enables one to estimate geographical distributions, characterize niches and their evolution, and guide spatial conservation planning. Rapid increases in species occurrence data stem from increasing digitization and aggregation efforts, and citizen science initiatives. However, persistent quality issues in occurrence data can impact the accuracy of scientific findings, underscoring the importance of filtering erroneous occurrence records in biodiversity analyses. InnovationWe introduce an R package, occTest, that synthesizes a growing open‐source ecosystem of biodiversity cleaning workflows to prepare occurrence data for different modelling applications. It offers a structured set of algorithms to identify potential problems with species occurrence records by employing a hierarchical organization of multiple tests. The workflow has a hierarchical structure organized in testPhases(i.e. cleaning vs. testing)that encompass different testBlocksgrouping differenttestTypes(e.g.environmental outlier detection), which may use differenttestMethods(e.g.Rosner test, jacknife,etc.). Four differenttestBlockscharacterize potential problems in geographic, environmental, human influence and temporal dimensions. Filtering and plotting functions are incorporated to facilitate the interpretation of tests. We provide examples with different data sources, with default and user‐defined parameters. Compared to other available tools and workflows, occTest offers a comprehensive suite of integrated tests, and allows multiple methods associated with each test to explore consensus among data cleaning methods. It uniquely incorporates both coordinate accuracy analysis and environmental analysis of occurrence records. Furthermore, it provides a hierarchical structure to incorporate future tests yet to be developed. Main conclusionsoccTest will help users understand the quality and quantity of data available before the start of data analysis, while also enabling users to filter data using either predefined rules or custom‐built rules. As a result, occTest can better assess each record's appropriateness for its intended application.  more » « less
Award ID(s):
2225078
PAR ID:
10544284
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ;
Publisher / Repository:
wiley
Date Published:
Journal Name:
Global Ecology and Biogeography
Volume:
33
Issue:
7
ISSN:
1466-822X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract PremiseDigitized biodiversity data offer extensive information; however, obtaining and processing biodiversity data can be daunting. Complexities arise during data cleaning, such as identifying and removing problematic records. To address these issues, we created the R package Geographic And Taxonomic Occurrence R‐based Scrubbing (gatoRs). Methods and ResultsThe gatoRs workflow includes functions that streamline downloading records from the Global Biodiversity Information Facility (GBIF) and Integrated Digitized Biocollections (iDigBio). We also created functions to clean downloaded specimen records. Unlike previous R packages, gatoRs accounts for differences in download structure between GBIF and iDigBio and allows for user control via interactive cleaning steps. ConclusionsOur pipeline enables the scientific community to process biodiversity data efficiently and is accessible to the R coding novice. We anticipate that gatoRs will be useful for both established and beginning users. Furthermore, we expect our package will facilitate the introduction of biodiversity‐related concepts into the classroom via the use of herbarium specimens. 
    more » « less
  2. Abstract Species occurrence data are foundational for research, conservation, and science communication, but the limited availability and accessibility of reliable data represents a major obstacle, particularly for insects, which face mounting pressures. We presentBeeBDC, a newRpackage, and a global bee occurrence dataset to address this issue. We combined >18.3 million bee occurrence records from multiple public repositories (GBIF, SCAN, iDigBio, USGS, ALA) and smaller datasets, then standardised, flagged, deduplicated, and cleaned the data using the reproducibleBeeBDC R-workflow. Specifically, we harmonised species names (following established global taxonomy), country names, and collection dates and, we added record-level flags for a series of potential quality issues. These data are provided in two formats, “cleaned” and “flagged-but-uncleaned”. TheBeeBDCpackage with online documentation provides end users the ability to modify filtering parameters to address their research questions. By publishing reproducibleRworkflows and globally cleaned datasets, we can increase the accessibility and reliability of downstream analyses. This workflow can be implemented for other taxa to support research and conservation. 
    more » « less
  3. Though data cleaning systems have earned great success and wide spread in both academia and industry, they fall short when trying to clean spatial data. The main reason is that state-of-the-art data cleaning systems mainly rely on functional dependency rules where there is sufficient co-occurrence of value pairs to learn that a certain value of an attribute leads to a corresponding value of another attribute. However, for spatial attributes that represent locations, there is very little chance that two records would have the same exact coordinates, and hence co-occurrence is unlikely to exist. This paper presents Sparcle (SPatially-AwaRe CLEaning); a novel framework that injects spatial awareness into the core engine of rule-based data cleaning systems through two main concepts: (1)Spatial Neighborhood, where co-occurrence is relaxed to be within a certain spatial proximity rather than same exact value, and (2)Distance Weighting, where records are given different weights of whether they satisfy a dependency rule, based on their relative distance. Experimental results using a real deployment of Sparcle inside a state-of-the-art data cleaning system, and real and synthetic datasets, show that Sparcle significantly boosts the accuracy of data cleaning systems when dealing with spatial data. 
    more » « less
  4. Abstract 1. The occurrence and distributions of wildlife populations and communities are shifting as a result of global changes. To evaluate whether these shifts are negatively impacting biodiversity processes, it is critical to monitor the status, trends and effects of environmental variables on entire communities. However, modelling the dynamics of multiple species simultaneously can require large amounts of diverse data, and few modelling approaches exist to simultaneously provide species and community‐level inferences. 2. We present an ‘integrated community occupancy model’ (ICOM) that unites principles of data integration and hierarchical community modelling in a single framework to provide inferences on species‐specific and community occurrence dynamics using multiple data sources. The ICOM combines replicated and nonreplicated detection–nondetection data sources using a hierarchical framework that explicitly accounts for different detection and sampling processes across data sources. We use simulations to compare the ICOM to previously developed hierarchical community occupancy models and single species integrated distribution models. We then apply our model to assess the occurrence and biodiversity dynamics of foliage‐gleaning birds in the White Mountain National Forest in the northeastern USA from 2010 to 2018 using three independent data sources. 3. Simulations reveal that integrating multiple data sources in the ICOM increased precision and accuracy of species and community‐level inferences compared to single data source models, although benefits of integration were dependent on the information content of individual data sources (e.g. amount of replication). Compared to single species models, the ICOM yielded more precise species‐level estimates. Within our case study, the ICOM had the highest out‐of‐sample predictive performance compared to single species models and models that used only a subset of the three data sources. 4. The ICOM provides more precise estimates of occurrence dynamics compared to multi‐species models using single data sources or integrated single‐species models. We further found that the ICOM had improved predictive performance across a broad region of interest with an empirical case study of forest birds. The ICOM offers an attractive approach to estimate species and biodiversity dynamics, which is additionally valuable to inform management objectives of both individual species and their broader communities. 
    more » « less
  5. IntroductionForecasting range shifts in response to climate change requires accurate species distribution models (SDMs), particularly at the margins of species' ranges. However, most studies producing SDMs rely on sparse species occurrence datasets from herbarium records and public databases, along with random pseudoabsences. While environmental covariates used to fit SDMS are increasingly precise due to satellite data, the availability of species occurrence records is still a large source of bias in model predictions. We developed distribution models for hybridizing sister species of western and eastern Joshua trees (Yucca brevifoliaandY. jaegeriana, respectively), iconic Mojave Desert species that are threatened by climate change and habitat loss. MethodsWe conducted an intensive visual grid search of online satellite imagery for 672,043 0.25 km2grid cells to identify the two species' presences and absences on the landscape with exceptional resolution, and field validated 29,050 cells in 15,001 km of driving. We used the resulting presence/absence data to train SDMs for each Joshua tree species, revealing the contemporary environmental gradients (during the past 40 years) with greatest influence on the current distribution of adult trees. ResultsWhile the environments occupied byY. brevifoliaandY. jaegerianawere similar in total aridity, they differed with respect to seasonal precipitation and temperature ranges, suggesting the two species may have differing responses to climate change. Moreover, the species showed differing potential to occupy each other's geographic ranges: modeled potential habitat forY. jaegerianaextends throughout the range ofY. brevifolia, while potential habitat forY. brevifoliais not well represented within the range ofY. jaegeriana. DiscussionBy reproducing the current range of the Joshua trees with high fidelity, our dataset can serve as a baseline for future research, monitoring, and management of this species, including an increased understanding of dynamics at the trailing and leading margins of the species' ranges and potential for climate refugia. 
    more » « less