skip to main content

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 11:00 PM ET on Friday, December 13 until 2:00 AM ET on Saturday, December 14 due to maintenance. We apologize for the inconvenience.


This content will become publicly available on July 1, 2025

Title: occTest : An integrated approach for quality control of species occurrence data
Abstract Aim

Species occurrence data are valuable information that enables one to estimate geographical distributions, characterize niches and their evolution, and guide spatial conservation planning. Rapid increases in species occurrence data stem from increasing digitization and aggregation efforts, and citizen science initiatives. However, persistent quality issues in occurrence data can impact the accuracy of scientific findings, underscoring the importance of filtering erroneous occurrence records in biodiversity analyses.

Innovation

We introduce an R package, occTest, that synthesizes a growing open‐source ecosystem of biodiversity cleaning workflows to prepare occurrence data for different modelling applications. It offers a structured set of algorithms to identify potential problems with species occurrence records by employing a hierarchical organization of multiple tests. The workflow has a hierarchical structure organized in testPhases(i.e. cleaning vs. testing)that encompass different testBlocksgrouping differenttestTypes(e.g.environmental outlier detection), which may use differenttestMethods(e.g.Rosner test, jacknife,etc.). Four differenttestBlockscharacterize potential problems in geographic, environmental, human influence and temporal dimensions. Filtering and plotting functions are incorporated to facilitate the interpretation of tests. We provide examples with different data sources, with default and user‐defined parameters. Compared to other available tools and workflows, occTest offers a comprehensive suite of integrated tests, and allows multiple methods associated with each test to explore consensus among data cleaning methods. It uniquely incorporates both coordinate accuracy analysis and environmental analysis of occurrence records. Furthermore, it provides a hierarchical structure to incorporate future tests yet to be developed.

Main conclusions

occTest will help users understand the quality and quantity of data available before the start of data analysis, while also enabling users to filter data using either predefined rules or custom‐built rules. As a result, occTest can better assess each record's appropriateness for its intended application.

 
more » « less
Award ID(s):
2225078
PAR ID:
10544284
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ;
Publisher / Repository:
wiley
Date Published:
Journal Name:
Global Ecology and Biogeography
Volume:
33
Issue:
7
ISSN:
1466-822X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Premise

    Digitized biodiversity data offer extensive information; however, obtaining and processing biodiversity data can be daunting. Complexities arise during data cleaning, such as identifying and removing problematic records. To address these issues, we created the R package Geographic And Taxonomic Occurrence R‐based Scrubbing (gatoRs).

    Methods and Results

    The gatoRs workflow includes functions that streamline downloading records from the Global Biodiversity Information Facility (GBIF) and Integrated Digitized Biocollections (iDigBio). We also created functions to clean downloaded specimen records. Unlike previous R packages, gatoRs accounts for differences in download structure between GBIF and iDigBio and allows for user control via interactive cleaning steps.

    Conclusions

    Our pipeline enables the scientific community to process biodiversity data efficiently and is accessible to the R coding novice. We anticipate that gatoRs will be useful for both established and beginning users. Furthermore, we expect our package will facilitate the introduction of biodiversity‐related concepts into the classroom via the use of herbarium specimens.

     
    more » « less
  2. Abstract

    Conservation planning and decision‐making rely on evaluations of biodiversity status and threats that are based upon species' distribution estimates. However, gaps exist regarding automated tools to delineate species' current ranges from distribution estimates and use those estimates to calculate both species‐ and community‐level biodiversity metrics. Here, we introduce changeRangeR, an R package that facilitates workflows to reproducibly transform estimates of species' distributions into metrics relevant for conservation. For example, by combining predictions from species distribution models (SDMs) with other maps of environmental data (e.g., suitable forest cover), researchers can characterize the proportion of a species' range that is under protection, metrics used under the IUCN Criteria A and B guidelines (Area of Occupancy and Extent of Occurrence), and other more general metrics such as taxonomic and phylogenetic diversity and endemism. Further, changeRangeR facilitates temporal comparisons among biodiversity metrics to inform efforts toward complementarity and consideration of future scenarios in conservation decisions. changeRangeR also provides tools to determine the effects of modeling decisions through sensitivity tests. Transparent and repeatable workflows for calculating biodiversity change metrics from SDMs such as those provided by changeRangeR are essential to inform conservation decision‐making efforts and represent key extensions for SDM methodology and associated metadata documentation.

     
    more » « less
  3. Abstract

    Biodiversity studies rely heavily on estimates of species' distributions often obtained through ecological niche modelling. Numerous software packages exist that allow users to model ecological niches using machine learning and statistical methods. However, no existing package with a graphical user interface allows users to perform model calibration and selection based on convex forms such as ellipsoids, which may match fundamental ecological niche shapes better, incorporating tools for exploring, modelling, and evaluating niches and distributions that are intuitive for both novice and proficient users.

    Here we describe anrpackage, NicheToolBox(ntbox), that allows users to conduct all processing steps involved in ecological niche modelling: downloading and curating occurrence data, obtaining and transforming environmental data layers, selecting environmental variables, exploring relationships between geographic and environmental spaces, calibrating and selecting ellipsoid models, evaluating models using binomial and partial ROC tests, assessing extrapolation risk, and performing geographic information system operations via a graphical user interface. A summary of the entire workflow is produced for use as a stand‐alone algorithm or as part of research reports.

    The method is explained in detail and tested via modelling the threatened feline speciesLeopardus wiedii. Georeferenced occurrence data for this species are queried to display both point occurrences and the IUCN extent of occurrence polygon (IUCN, 2007). This information is used to illustrate tools available for accessing, processing and exploring biodiversity data (e.g. number of occurrences and chronology of collecting) and transforming environmental data (e.g. a summary PCA for 19 bioclimatic layers). Visualizations of three‐dimensional ecological niches modelled as minimum volume ellipsoids are developed with ancillary statistics. This niche model is then projected to geographic space, to represent a corresponding potential suitability map.

    Usingntboxallows a fast and straightforward means by which to retrieve and manipulate occurrence and environmental data, which can then be implemented in model calibration, projection and evaluation for assessing distributions of species in geographic space and their corresponding environmental combinations.

     
    more » « less
  4. Abstract Aim

    Phenology, the temporal response of a population to its climate, is a crucial behavioural trait shared across life on earth. How species adapt their phenologies to climate change is poorly understood but critical in understanding how species will respond to future change. We use a group of flies (Rhaphiomidas) endemic to the North American deserts to understand how species adapt to changing climatic conditions. Here, we explore a novel approach for taxa with constrained phenologies aimed to accurately model their environmental niche and relate this to phenological and morphological adaptations in a phylogenetic context.

    Taxon

    Insecta, Diptera, Mydidae,Rhaphiomidas.

    Location

    North America, Mojave, Sonoran and Chihuahuan Deserts.

    Methods

    We gathered geographical and phenological occurrence data for the entire genusRhaphiomidas, and, estimated a time calibrated phylogeny. We compared Daymet derived temperature values for a species adult occurrence period (phenology) with those derived from WorldClim data that is partitioned by month or quarter to examine what effect using more precise data has on capturing a species’ environmental niche. We then examined to what extent phylogenetic signal in phenological traits, climate tolerance and morphology can inform us about how species adapt to different environmental regimes.

    Results

    We found that the Bioclim temperature data, which are averages across monthly intervals, poorly represent the climate windows to which adult flies are actually adapted. Using temporally relevant climate data, we show that many species use a combination of morphological and phenological changes to adapt to different climate regimes. There are also instances where species changed only phenology to track a climate type or only morphology to adapt to different environments.

    Main Conclusions

    Without using a fine‐scale phenological data approach, identifying environmental adaptations could be misleading because the data do not represent the conditions the animals are actually experiencing. We find that fine‐scale phenological niche models are needed when assessing taxa that have a discrete phenological window that is key to their survival, accurately linking environment to morphology and phenology. Using this approach, we show thatRhaphiomidasuse a combination of niche tracking and adaptation to persist in new niches. Modelling the effect of phenology on such species’ niches will be critical for better predictions of how these species might respond to future climate change.

     
    more » « less
  5. Abstract

    Species occurrence data are foundational for research, conservation, and science communication, but the limited availability and accessibility of reliable data represents a major obstacle, particularly for insects, which face mounting pressures. We presentBeeBDC, a newRpackage, and a global bee occurrence dataset to address this issue. We combined >18.3 million bee occurrence records from multiple public repositories (GBIF, SCAN, iDigBio, USGS, ALA) and smaller datasets, then standardised, flagged, deduplicated, and cleaned the data using the reproducibleBeeBDC R-workflow. Specifically, we harmonised species names (following established global taxonomy), country names, and collection dates and, we added record-level flags for a series of potential quality issues. These data are provided in two formats, “cleaned” and “flagged-but-uncleaned”. TheBeeBDCpackage with online documentation provides end users the ability to modify filtering parameters to address their research questions. By publishing reproducibleRworkflows and globally cleaned datasets, we can increase the accessibility and reliability of downstream analyses. This workflow can be implemented for other taxa to support research and conservation.

     
    more » « less