skip to main content


This content will become publicly available on December 7, 2024

Title: Quantifying error in occurrence data: Comparing the data quality of iNaturalist and digitized herbarium specimen data in flowering plant families of the southeastern United States

iNaturalist has the potential to be an extremely rich source of organismal occurrence data. Launched in 2008, it now contains over 150 million uploaded observations as of May 2023. Based on the findings of a limited number of past studies assessing the taxonomic accuracy of participatory science-driven sources of occurrence data such as iNaturalist, there has been concern that some portion of these records might be misidentified in certain taxonomic groups. In this case study, we compare Research Grade iNaturalist observations with digitized herbarium specimens, both of which are currently available for combined download from large data aggregators and are therefore the primary sources of occurrence data for large-scale biodiversity/biogeography studies. Our comparisons were confined regionally to the southeastern United States (Florida, Georgia, North Carolina, South Carolina, Texas, Tennessee, Kentucky, and Virginia). Occurrence records from ten plant families (Gentianaceae, Ericaceae, Melanthiaceae, Ulmaceae, Fabaceae, Asteraceae, Fagaceae, Cyperaceae, Juglandaceae, Apocynaceae) were downloaded and scored on taxonomic accuracy. We found a comparable and relatively low rate of misidentification among both digitized herbarium specimens and Research Grade iNaturalist observations within the study area. This finding illustrates the utility and high quality of iNaturalist data for future research in the region, but also points to key differences between data types, giving each a respective advantage, depending on applications of the data.

 
more » « less
Award ID(s):
2027654
NSF-PAR ID:
10518722
Author(s) / Creator(s):
; ; ;
Editor(s):
Qin, Hong
Publisher / Repository:
NSF Public Access Repository (NSF-PAR)
Date Published:
Journal Name:
PLOS ONE
Volume:
18
Issue:
12
ISSN:
1932-6203
Page Range / eLocation ID:
e0295298
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Understanding the ranges of rare and endangered species is central to conserving biodiversity in the Anthropocene. Species distribution models (SDMs) have become a common and powerful tool for analyzing species–environment relationships across geographic space. Although evaluating the distribution of rare species is integral to their conservation, this can be difficult when limited distribution data are available. Community science platforms, such as iNaturalist, have emerged as alternative sources for species occurrence data. Although these observations are often thought to be of lower quality than those of natural history collections, they may have potential for improving SDMs for species with few occurrence records from collections. Here, we investigate the utility of iNaturalist data for developing SDMs for a rare high‐elevation plant,Telesonix jamesii. Because methods for modeling rare species are limited in the literature, five different modeling techniques were considered, including profile methods, statistical models, and machine learning algorithms. The inclusion of iNaturalist data doubled the number of usable records forT. jamesii.We found that a random forest (RF) model using ensemble training data performed the highest of any model (area under curve = 0.98). We then compared the performance of RF models that use only natural history training data and those that use a combination of natural history (herbarium specimens) and iNaturalist training data. All models heavily relied on climate data (mean temperature of driest quarter, and precipitation of the warmest quarter), indicating that this species is under threat as climate continues to change. Validation datasets affected model fits as well. Models using only herbarium data performed slightly poorer when evaluated with cross‐validation than when validated externally with iNaturalist data. This study can serve as a model for future SDM studies of species with similar data limitations.

     
    more » « less
  2. Abstract Premise

    Digitized biodiversity data offer extensive information; however, obtaining and processing biodiversity data can be daunting. Complexities arise during data cleaning, such as identifying and removing problematic records. To address these issues, we created the R package Geographic And Taxonomic Occurrence R‐based Scrubbing (gatoRs).

    Methods and Results

    The gatoRs workflow includes functions that streamline downloading records from the Global Biodiversity Information Facility (GBIF) and Integrated Digitized Biocollections (iDigBio). We also created functions to clean downloaded specimen records. Unlike previous R packages, gatoRs accounts for differences in download structure between GBIF and iDigBio and allows for user control via interactive cleaning steps.

    Conclusions

    Our pipeline enables the scientific community to process biodiversity data efficiently and is accessible to the R coding novice. We anticipate that gatoRs will be useful for both established and beginning users. Furthermore, we expect our package will facilitate the introduction of biodiversity‐related concepts into the classroom via the use of herbarium specimens.

     
    more » « less
  3. Abstract

    Widespread specimen digitization has greatly enhanced the use of herbarium data in scientific research. Publications using herbarium data have increased exponentially over the last century. Here, we review changing uses of herbaria through time with a computational text analysis of 13,702 articles from 1923 to 2017 that quantitatively complements traditional review approaches. Although maintaining its core contribution to taxonomic knowledge, herbarium use has diversified from a few dominant research topics a century ago (e.g., taxonomic notes, botanical history, local observations), with many topics only recently emerging (e.g., biodiversity informatics, global change biology, DNA analyses). Specimens are now appreciated as temporally and spatially extensive sources of genotypic, phenotypic, and biogeographic data. Specimens are increasingly used in ways that influence our ability to steward future biodiversity. As we enter the Anthropocene, herbaria have likewise entered a new era with enhanced scientific, educational, and societal relevance.

     
    more » « less
  4. Native bee species in the United States provide invaluable pollination services. Concerns about native bee declines are growing, and there are calls for a national monitoring program. Documenting species ranges at ecologically meaningful scales through coverage completeness analysis is a fundamental step to track bees from species to communities. It may take decades before all existing bee specimens are digitized, so projections are needed now to focus future research and management efforts. From 1.923 million records, we created range maps for nearly 88% (3158 species) of bee species in the contiguous United States, provided the first analysis of inventory completeness for digitized specimens of a major insect clade, and perhaps most important, estimated spatial completeness accounting for all known bee specimens in USA collections, including undigitized bee specimens. Completeness analyses were very low (3–37%) across four examined spatial resolutions when using the currently available bee specimen records. Adding a subset of observations from community science data sources did not significantly increase completeness, and adding a projected 4.7 million undigitized specimens increased completeness by only an additional 12–13%. Assessments of data, including projected specimen records, indicate persistent taxonomic and geographic deficiencies. In conjunction with expedited digitization, new inventories that integrate community science data with specimen‐based documentation will be required to close these gaps. A combined effort involving both strategic inventories and accelerated digitization campaigns is needed for a more complete understanding of USA bee distributions. 
    more » « less
  5. Premise of the Study

    Phenological annotation models computed on large‐scale herbarium data sets were developed and tested in this study.

    Methods

    Herbarium specimens represent a significant resource with which to study plant phenology. Nevertheless, phenological annotation of herbarium specimens is time‐consuming, requires substantial human investment, and is difficult to mobilize at large taxonomic scales. We created and evaluated new methods based on deep learning techniques to automate annotation of phenological stages and tested these methods on four herbarium data sets representing temperate, tropical, and equatorial American floras.

    Results

    Deep learning allowed correct detection of fertile material with an accuracy of 96.3%. Accuracy was slightly decreased for finer‐scale information (84.3% for flower and 80.5% for fruit detection).

    Discussion

    The method described has the potential to allow fine‐grained phenological annotation of herbarium specimens at large ecological scales. Deeper investigation regarding the taxonomic scalability of this approach is needed.

     
    more » « less