skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: occTest : An integrated approach for quality control of species occurrence data
Abstract AimSpecies occurrence data are valuable information that enables one to estimate geographical distributions, characterize niches and their evolution, and guide spatial conservation planning. Rapid increases in species occurrence data stem from increasing digitization and aggregation efforts, and citizen science initiatives. However, persistent quality issues in occurrence data can impact the accuracy of scientific findings, underscoring the importance of filtering erroneous occurrence records in biodiversity analyses. InnovationWe introduce an R package, occTest, that synthesizes a growing open‐source ecosystem of biodiversity cleaning workflows to prepare occurrence data for different modelling applications. It offers a structured set of algorithms to identify potential problems with species occurrence records by employing a hierarchical organization of multiple tests. The workflow has a hierarchical structure organized in testPhases(i.e. cleaning vs. testing)that encompass different testBlocksgrouping differenttestTypes(e.g.environmental outlier detection), which may use differenttestMethods(e.g.Rosner test, jacknife,etc.). Four differenttestBlockscharacterize potential problems in geographic, environmental, human influence and temporal dimensions. Filtering and plotting functions are incorporated to facilitate the interpretation of tests. We provide examples with different data sources, with default and user‐defined parameters. Compared to other available tools and workflows, occTest offers a comprehensive suite of integrated tests, and allows multiple methods associated with each test to explore consensus among data cleaning methods. It uniquely incorporates both coordinate accuracy analysis and environmental analysis of occurrence records. Furthermore, it provides a hierarchical structure to incorporate future tests yet to be developed. Main conclusionsoccTest will help users understand the quality and quantity of data available before the start of data analysis, while also enabling users to filter data using either predefined rules or custom‐built rules. As a result, occTest can better assess each record's appropriateness for its intended application.  more » « less
Award ID(s):
2225078
PAR ID:
10544284
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ;
Publisher / Repository:
wiley
Date Published:
Journal Name:
Global Ecology and Biogeography
Volume:
33
Issue:
7
ISSN:
1466-822X
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract PremiseDigitized biodiversity data offer extensive information; however, obtaining and processing biodiversity data can be daunting. Complexities arise during data cleaning, such as identifying and removing problematic records. To address these issues, we created the R package Geographic And Taxonomic Occurrence R‐based Scrubbing (gatoRs). Methods and ResultsThe gatoRs workflow includes functions that streamline downloading records from the Global Biodiversity Information Facility (GBIF) and Integrated Digitized Biocollections (iDigBio). We also created functions to clean downloaded specimen records. Unlike previous R packages, gatoRs accounts for differences in download structure between GBIF and iDigBio and allows for user control via interactive cleaning steps. ConclusionsOur pipeline enables the scientific community to process biodiversity data efficiently and is accessible to the R coding novice. We anticipate that gatoRs will be useful for both established and beginning users. Furthermore, we expect our package will facilitate the introduction of biodiversity‐related concepts into the classroom via the use of herbarium specimens. 
    more » « less
  2. Abstract Species occurrence data are foundational for research, conservation, and science communication, but the limited availability and accessibility of reliable data represents a major obstacle, particularly for insects, which face mounting pressures. We presentBeeBDC, a newRpackage, and a global bee occurrence dataset to address this issue. We combined >18.3 million bee occurrence records from multiple public repositories (GBIF, SCAN, iDigBio, USGS, ALA) and smaller datasets, then standardised, flagged, deduplicated, and cleaned the data using the reproducibleBeeBDC R-workflow. Specifically, we harmonised species names (following established global taxonomy), country names, and collection dates and, we added record-level flags for a series of potential quality issues. These data are provided in two formats, “cleaned” and “flagged-but-uncleaned”. TheBeeBDCpackage with online documentation provides end users the ability to modify filtering parameters to address their research questions. By publishing reproducibleRworkflows and globally cleaned datasets, we can increase the accessibility and reliability of downstream analyses. This workflow can be implemented for other taxa to support research and conservation. 
    more » « less
  3. Though data cleaning systems have earned great success and wide spread in both academia and industry, they fall short when trying to clean spatial data. The main reason is that state-of-the-art data cleaning systems mainly rely on functional dependency rules where there is sufficient co-occurrence of value pairs to learn that a certain value of an attribute leads to a corresponding value of another attribute. However, for spatial attributes that represent locations, there is very little chance that two records would have the same exact coordinates, and hence co-occurrence is unlikely to exist. This paper presents Sparcle (SPatially-AwaRe CLEaning); a novel framework that injects spatial awareness into the core engine of rule-based data cleaning systems through two main concepts: (1)Spatial Neighborhood, where co-occurrence is relaxed to be within a certain spatial proximity rather than same exact value, and (2)Distance Weighting, where records are given different weights of whether they satisfy a dependency rule, based on their relative distance. Experimental results using a real deployment of Sparcle inside a state-of-the-art data cleaning system, and real and synthetic datasets, show that Sparcle significantly boosts the accuracy of data cleaning systems when dealing with spatial data. 
    more » « less
  4. IntroductionForecasting range shifts in response to climate change requires accurate species distribution models (SDMs), particularly at the margins of species' ranges. However, most studies producing SDMs rely on sparse species occurrence datasets from herbarium records and public databases, along with random pseudoabsences. While environmental covariates used to fit SDMS are increasingly precise due to satellite data, the availability of species occurrence records is still a large source of bias in model predictions. We developed distribution models for hybridizing sister species of western and eastern Joshua trees (Yucca brevifoliaandY. jaegeriana, respectively), iconic Mojave Desert species that are threatened by climate change and habitat loss. MethodsWe conducted an intensive visual grid search of online satellite imagery for 672,043 0.25 km2grid cells to identify the two species' presences and absences on the landscape with exceptional resolution, and field validated 29,050 cells in 15,001 km of driving. We used the resulting presence/absence data to train SDMs for each Joshua tree species, revealing the contemporary environmental gradients (during the past 40 years) with greatest influence on the current distribution of adult trees. ResultsWhile the environments occupied byY. brevifoliaandY. jaegerianawere similar in total aridity, they differed with respect to seasonal precipitation and temperature ranges, suggesting the two species may have differing responses to climate change. Moreover, the species showed differing potential to occupy each other's geographic ranges: modeled potential habitat forY. jaegerianaextends throughout the range ofY. brevifolia, while potential habitat forY. brevifoliais not well represented within the range ofY. jaegeriana. DiscussionBy reproducing the current range of the Joshua trees with high fidelity, our dataset can serve as a baseline for future research, monitoring, and management of this species, including an increased understanding of dynamics at the trailing and leading margins of the species' ranges and potential for climate refugia. 
    more » « less
  5. Abstract BackgroundAnalysis of the relationship between chromosomal structural variation (synteny breaks) and 3D-chromatin architectural changes among closely related species has the potential to reveal causes and correlates between chromosomal change and chromatin remodeling. Of note, contrary to extensive studies in animal species, the pace and pattern of chromatin architectural changes following the speciation of plants remain unexplored; moreover, there is little exploration of the occurrence of synteny breaks in the context of multiple genome topological hierarchies within the same model species. ResultsHere we used Hi-C and epigenomic analyses to characterize and compare the profiles of hierarchical chromatin architectural features in representative species of the cotton tribe (Gossypieae), includingGossypium arboreum,Gossypium raimondii, andGossypioides kirkii, which differ with respect to chromosome rearrangements. We found that (i) overall chromatin architectural territories were preserved inGossypioidesandGossypium, which was reflected in their similar intra-chromosomal contact patterns and spatial chromosomal distributions; (ii) the non-random preferential occurrence of synteny breaks in A compartment significantly associate with the B-to-A compartment switch in syntenic blocks flanking synteny breaks; (iii) synteny changes co-localize with open-chromatin boundaries of topologically associating domains, while TAD stabilization has a greater influence on regulating orthologous expression divergence than do rearrangements; and (iv) rearranged chromosome segments largely maintain ancestralin-cisinteractions. ConclusionsOur findings provide insights into the non-random occurrence of epigenomic remodeling relative to the genomic landscape and its evolutionary and functional connections to alterations of hierarchical chromatin architecture, on a known evolutionary timescale. 
    more » « less