Abstract PremiseOne of the slowest steps in digitizing natural history collections is converting labels associated with specimens into a digital data record usable for collections management and research. Here, we address how herbarium specimen labels can be converted into digital data records via extraction into standardized Darwin Core fields. MethodsWe first showcase the development of a rule‐based approach and compare outcomes with a large language model–based approach, in particular ChatGPT4. We next quantified omission and commission error rates across target fields for a set of labels transcribed using optical character recognition (OCR) for both approaches. For example, we find that ChatGPT4 often creates field names that are not Darwin Core compliant while rule‐based approaches often have high commission error rates. ResultsOur results suggest that these approaches each have different strengths and limitations. We therefore developed an ensemble approach that leverages the strengths of each individual method and documented that ensembling strongly reduced overall information extraction errors. DiscussionThis work shows that an ensemble approach has particular value for creating high‐quality digital data records, even for complicated label content. While human validation is still needed to ensure the best possible quality, automated approaches can speed digitization of herbarium specimen labels and are likely to be broadly usable for all natural history collection types.
more »
« less
Humans in the loop: Community science and machine learning synergies for overcoming herbarium digitization bottlenecks
Abstract PremiseAmong the slowest steps in the digitization of natural history collections is converting imaged labels into digital text. We present here a working solution to overcome this long‐recognized efficiency bottleneck that leverages synergies between community science efforts and machine learning approaches. MethodsWe present two new semi‐automated services. The first detects and classifies typewritten, handwritten, or mixed labels from herbarium sheets. The second uses a workflow tuned for specimen labels to label text using optical character recognition (OCR). The label finder and classifier was built via humans‐in‐the‐loop processes that utilize the community science Notes from Nature platform to develop training and validation data sets to feed into a machine learning pipeline. ResultsOur results showcase a >93% success rate for finding and classifying main labels. The OCR pipeline optimizes pre‐processing, multiple OCR engines, and post‐processing steps, including an alignment approach borrowed from molecular systematics. This pipeline yields >4‐fold reductions in errors compared to off‐the‐shelf open‐source solutions. The OCR workflow also allows human validation using a custom Notes from Nature tool. DiscussionOur work showcases a usable set of tools for herbarium digitization including a custom‐built web application that is freely accessible. Further work to better integrate these services into existing toolkits can support broad community use.
more »
« less
- PAR ID:
- 10566017
- Publisher / Repository:
- Applications in Plant Sciences
- Date Published:
- Journal Name:
- Applications in Plant Sciences
- Volume:
- 12
- Issue:
- 1
- ISSN:
- 2168-0450
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Historical data sources, like medical records or biological collections, consist of unstructured heterogeneous content: handwritten text, different sizes and types of fonts, and text overlapped with lines, images, stamps, and sketches. The information these documents can provide is important, from a historical perspective and mainly because we can learn from it. The automatic digitization of these historical documents is a complex machine learning process that usually produces poor results, requiring costly interventions by experts, who have to transcribe and interpret the content. This paper describes hybrid (Human- and Machine-Intelligent) workflows for scientific data extraction, combining machine-learning and crowdsourcing software elements. Our results demonstrate that the mix of human and machine processes has advantages in data extraction time and quality, when compared to a machine-only workflow. More specifically, we show how OCRopus and Tesseract, two widely used open source Optical Character Recognition (OCR) tools, can improve their accuracy by more than 42%, when text areas are cropped by humans prior to OCR, while the total time can increase or decrease depending on the OCR selection. The digitization of 400 images, with Entomology, Bryophyte, and Lichen specimens, is evaluated following four different approaches: processing the whole specimen image (machine-only), processing crowd cropped labels (hybrid), processing crowd cropped fields (hybrid), and cleaning the machine-only output. As a secondary result, our experiments reveal differences in speed and quality between Tesseract and OCRopus.more » « less
-
Abstract PremiseDigitized biodiversity data offer extensive information; however, obtaining and processing biodiversity data can be daunting. Complexities arise during data cleaning, such as identifying and removing problematic records. To address these issues, we created the R package Geographic And Taxonomic Occurrence R‐based Scrubbing (gatoRs). Methods and ResultsThe gatoRs workflow includes functions that streamline downloading records from the Global Biodiversity Information Facility (GBIF) and Integrated Digitized Biocollections (iDigBio). We also created functions to clean downloaded specimen records. Unlike previous R packages, gatoRs accounts for differences in download structure between GBIF and iDigBio and allows for user control via interactive cleaning steps. ConclusionsOur pipeline enables the scientific community to process biodiversity data efficiently and is accessible to the R coding novice. We anticipate that gatoRs will be useful for both established and beginning users. Furthermore, we expect our package will facilitate the introduction of biodiversity‐related concepts into the classroom via the use of herbarium specimens.more » « less
-
PremiseThe digitization of natural history collections includes transcribing specimen label data into standardized formats. Born‐digital specimen data initially gathered in digital formats do not need to be transcribed, enabling their efficient integration into digitized collections. Modernizing field collection methods for born‐digital workflows requires the development of new tools and processes. Methods and ResultscollNotes, a mobile application, was developed for Android andiOSto supplement traditional field journals. Designed for efficiency in the field, collNotes avoids redundant data entries and does not require cellular service. collBook, a companion desktop application, refines field notes into database‐ready formats and produces specimen labels. ConclusionscollNotes and collBook can be used in combination as a field‐to‐database solution for gathering born‐digital voucher specimen data for plants and fungi. Both programs are open source and use common file types simplifying either program's integration into existing workflows.more » « less
-
This paper describes on-going work in dictionary digitization, and in particular the processing of OCRed text into a structured lexicon. In processing the output of an OCR engine into structured data, we are faced with three problems: 1. Typographic errors; 2. Conversion from the dictionary’s visual layout into a lexicographically structured computer-readable format, such as XML; and 3. Converting each dictionary’s idiosyncratic structure into some standard tagging system. This paper deals mostly with the second issue, but touches on the first and third.more » « less
An official website of the United States government

