skip to main content

Attention:

The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Friday, November 15 until 2:00 AM ET on Saturday, November 16 due to maintenance. We apologize for the inconvenience.


This content will become publicly available on March 1, 2025

Title: Geographic And Taxonomic Occurrence R‐based Scrubbing (gatoRs): An R package and workflow for processing biodiversity data
Abstract Premise

Digitized biodiversity data offer extensive information; however, obtaining and processing biodiversity data can be daunting. Complexities arise during data cleaning, such as identifying and removing problematic records. To address these issues, we created the R package Geographic And Taxonomic Occurrence R‐based Scrubbing (gatoRs).

Methods and Results

The gatoRs workflow includes functions that streamline downloading records from the Global Biodiversity Information Facility (GBIF) and Integrated Digitized Biocollections (iDigBio). We also created functions to clean downloaded specimen records. Unlike previous R packages, gatoRs accounts for differences in download structure between GBIF and iDigBio and allows for user control via interactive cleaning steps.

Conclusions

Our pipeline enables the scientific community to process biodiversity data efficiently and is accessible to the R coding novice. We anticipate that gatoRs will be useful for both established and beginning users. Furthermore, we expect our package will facilitate the introduction of biodiversity‐related concepts into the classroom via the use of herbarium specimens.

 
more » « less
Award ID(s):
2027654
NSF-PAR ID:
10518749
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
NSF Public Access Repository (NSF-PAR)
Date Published:
Journal Name:
Applications in Plant Sciences
Volume:
12
Issue:
2
ISSN:
2168-0450
Subject(s) / Keyword(s):
GBIF basis cleaning biodiversity data download herbaria iDigBio locality cleaning spatial correction taxonomic harmonization
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Aim

    Species occurrence data are valuable information that enables one to estimate geographical distributions, characterize niches and their evolution, and guide spatial conservation planning. Rapid increases in species occurrence data stem from increasing digitization and aggregation efforts, and citizen science initiatives. However, persistent quality issues in occurrence data can impact the accuracy of scientific findings, underscoring the importance of filtering erroneous occurrence records in biodiversity analyses.

    Innovation

    We introduce an R package, occTest, that synthesizes a growing open‐source ecosystem of biodiversity cleaning workflows to prepare occurrence data for different modelling applications. It offers a structured set of algorithms to identify potential problems with species occurrence records by employing a hierarchical organization of multiple tests. The workflow has a hierarchical structure organized in testPhases(i.e. cleaning vs. testing)that encompass different testBlocksgrouping differenttestTypes(e.g.environmental outlier detection), which may use differenttestMethods(e.g.Rosner test, jacknife,etc.). Four differenttestBlockscharacterize potential problems in geographic, environmental, human influence and temporal dimensions. Filtering and plotting functions are incorporated to facilitate the interpretation of tests. We provide examples with different data sources, with default and user‐defined parameters. Compared to other available tools and workflows, occTest offers a comprehensive suite of integrated tests, and allows multiple methods associated with each test to explore consensus among data cleaning methods. It uniquely incorporates both coordinate accuracy analysis and environmental analysis of occurrence records. Furthermore, it provides a hierarchical structure to incorporate future tests yet to be developed.

    Main conclusions

    occTest will help users understand the quality and quantity of data available before the start of data analysis, while also enabling users to filter data using either predefined rules or custom‐built rules. As a result, occTest can better assess each record's appropriateness for its intended application.

     
    more » « less
  2. Abstract

    Species occurrence data are foundational for research, conservation, and science communication, but the limited availability and accessibility of reliable data represents a major obstacle, particularly for insects, which face mounting pressures. We presentBeeBDC, a newRpackage, and a global bee occurrence dataset to address this issue. We combined >18.3 million bee occurrence records from multiple public repositories (GBIF, SCAN, iDigBio, USGS, ALA) and smaller datasets, then standardised, flagged, deduplicated, and cleaned the data using the reproducibleBeeBDC R-workflow. Specifically, we harmonised species names (following established global taxonomy), country names, and collection dates and, we added record-level flags for a series of potential quality issues. These data are provided in two formats, “cleaned” and “flagged-but-uncleaned”. TheBeeBDCpackage with online documentation provides end users the ability to modify filtering parameters to address their research questions. By publishing reproducibleRworkflows and globally cleaned datasets, we can increase the accessibility and reliability of downstream analyses. This workflow can be implemented for other taxa to support research and conservation.

     
    more » « less
  3. The Global Biodiversity Information Facility (GBIF 2022a) has indexed more than 2 billion occurrence records from 70,147 datasets. These datasets often include "hidden" biotic interaction data because biodiversity communities use the Darwin Core standard (DwC, Wieczorek et al. 2012) in different ways to document biotic interactions. In this study, we extracted biotic interactions from GBIF data using an approach similar to that employed in the Global Biotic Interactions (GloBI; Poelen et al. 2014) and summarized the results. Here we aim to present an estimation of the interaction data available in GBIF, showing that biotic interaction claims can be automatically found and extracted from GBIF. Our results suggest that much can be gained by an increased focus on development of tools that help to index and curate biotic interaction data in existing datasets. Combined with data standardization and best practices for sharing biotic interactions, such as the initiative on plant-pollinators interaction (Salim 2022), this approach can rapidly contribute to and meet open data principles (Wilkinson 2016). We used Preston (Elliott et al. 2020), open-source software that versions biodiversity datasets, to copy all GBIF-indexed datasets. The biodiversity data graph version (Poelen 2020) of the GBIF-indexed datasets used during this study contains 58,504 datasets in Darwin Core Archive (DwC-A) format, totaling 574,715,196 records. After retrieval and verification, the datasets were processed using Elton. Elton extracts biotic interaction data and supports 20+ existing file formats, including various types of data elements in DwC records. Elton also helps align interaction claims (e.g., host of, parasite of, associated with) to the Relations Ontology (RO, Mungall 2022), making it easier to discover datasets across a heterogeneous collection of datasets. Using specific mapping between interaction claims found in the DwC records to the terms in RO*1, Elton found 30,167,984 potential records (with non-empty values for the scanned DwC terms) and 15,248,478 records with recognized interaction types. Taxonomic name validation was performed using Nomer, which maps input names to names found in a variety of taxonomic catalogs. We only considered an interaction record valid where the interaction type could be mapped to a term in RO and where Nomer found a valid name for source and target taxa. Based on the workflow described in Fig. 1, we found 7,947,822 interaction records (52% of the potential interactions). Most of them were generic interactions ( interacts_ with , 87.5%), but the remaining 12.5% (993,477 records) included host-parasite and plant-animal interactions. The majority of the interactions records found involved plants (78%), animals (14%) and fungi (6%). In conclusion, there are many biotic interactions embedded in existing datasets registered in large biodiversity data indexers and aggregators like iDigBio, GBIF, and BioCASE. We exposed these biotic interaction claims using the combined functionality of biodiversity data tools Elton (for interaction data extraction), Preston (for reliable dataset tracking) and Nomer (for taxonomic name alignment). Nonetheless, the development of new vocabularies, standards and best practice guides would facilitate aggregation of interaction data, including the diversification of the GBIF data model (GBIF 2022b) for sharing biodiversity data beyond occurrences data. That is the aim of the TDWG Interest Group on Biological Interactions Data (TDWG 2022). 
    more » « less
  4. Biodiversity datasets, or descriptions of biodiversity datasets, are increasingly available through open digital data infrastructures such as Global Biodiversity Information Facility (GBIF, https://gbif.org), Integrated Digitized Biocollections (iDigBio, https://www.idigbio.org) and the Biological Collection Access Service (BioCASE, http://www.biocase.org). 

    However, little is known about how these networks, and the data accessed through them, change over time. This dataset provide snapshots of the biodiversity dataset graphs as tracked by Preston (https://github.com/bio-guoda/preston , https://doi.org/10.5281/zenodo.1410543 ).

    The rdf/nquad and tsv snapshots were generated using the respective commands:

    preston ls | bzip2 > preston-ls.nq.bz2

    and

    preston ls --log tsv | bzip2 > preston-ls.tsv.bz2

    For convenience, the first 100 uncompressed entries of both files are included, as well as the sha256 hashes of the content of the files.

     
    more » « less
  5. The Landsat satellites provide decades of near‐global surface reflectance measurements that are increasingly used to assess interannual changes in terrestrial ecosystem function. These assessments often rely on spectral indices related to vegetation greenness and productivity (e.g. Normalized Difference Vegetation Index, NDVI). Nevertheless, multiple factors impede multi‐decadal assessments of spectral indices using Landsat satellite data, including ease of data access and cleaning, as well as lingering issues with cross‐sensor calibration and challenges with irregular timing of cloud‐free acquisitions. To help address these problems, we developed the ‘LandsatTS' package for R. This software package facilitates sample‐based time series analysis of surface reflectance and spectral indices derived from Landsat sensors. The package includes functions that enable the extraction of the full Landsat 5, 7, and 8 records from Collection 2 for point sample locations or small study regions using Google Earth Engine accessed directly from R. Moreover, the package includes functions for 1) rigorous data cleaning, 2) cross‐sensor calibration, 3) phenological modeling, and 4) time series analysis. For an example application, we show how ‘LandsatTS' can be used to assess changes in annual maximum vegetation greenness from 2000 to 2022 across the Noatak National Preserve in northern Alaska, USA. Overall, this software provides a suite of functions to enable broader use of Landsat satellite data for assessing and monitoring terrestrial ecosystem function during recent decades across local to global geographic extents.

     
    more » « less