skip to main content

Title: The Herbarium 2021 Half–Earth Challenge Dataset and Machine Learning Competition
Herbarium sheets present a unique view of the world's botanical history, evolution, and biodiversity. This makes them an all–important data source for botanical research. With the increased digitization of herbaria worldwide and advances in the domain of fine–grained visual classification which can facilitate automatic identification of herbarium specimen images, there are many opportunities for supporting and expanding research in this field. However, existing datasets are either too small, or not diverse enough, in terms of represented taxa, geographic distribution, and imaging protocols. Furthermore, aggregating datasets is difficult as taxa are recognized under a multitude of names and must be aligned to a common reference. We introduce the Herbarium 2021 Half–Earth dataset: the largest and most diverse dataset of herbarium specimen images, to date, for automatic taxon recognition. We also present the results of the Herbarium 2021 Half–Earth challenge, a competition that was part of the Eighth Workshop on Fine-Grained Visual Categorization (FGVC8) and hosted by Kaggle to encourage the development of models to automatically identify taxa from herbarium sheet images.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
Frontiers in Plant Science
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Herbarium sheets present a unique view of the world's botanical history, evolution, and diversity. This makes them an all-important data source for botanical research. With the increased digitisation of herbaria worldwide and the advances in the fine-grained classification domain that can facilitate automatic identification of herbarium specimens, there are a lot of opportunities for supporting research in this field. However, existing datasets are either too small, or not diverse enough, in terms of represented taxa, geographic distribution or host institutions. Furthermore, aggregating multiple datasets is difficult as taxa exist under a multitude of different names and the taxonomy requires alignment to a common reference. We present the Herbarium Half-Earth dataset, the largest and most diverse dataset of herbarium specimens to date for automatic taxon recognition. 
    more » « less
  2. Premise

    Digitization and imaging of herbarium specimens provides essential historical phenotypic and phenological information about plants. However, the full use of these resources requires high‐quality human annotations for downstream use. Here we provide guidance on the design and implementation of image annotation projects for botanical research.

    Methods and Results

    We used a novel gold‐standard data set to test the accuracy of human phenological annotations of herbarium specimen images in two settings: structured, in‐person sessions and an online, community‐science platform. We examined how different factors influenced annotation accuracy and found that botanical expertise, academic career level, and time spent on annotations had little effect on accuracy. Rather, key factors included traits and taxa being scored, the annotation setting, and the individual scorer. In‐person annotations were significantly more accurate than online annotations, but both generated relatively high‐quality outputs. Gathering multiple, independent annotations for each image improved overall accuracy.


    Our results provide a best‐practices basis for using human effort to annotate images of plants. We show that scalable community science mechanisms can produce high‐quality data, but care must be taken to choose tractable taxa and phenophases and to provide informative training material.

    more » « less
  3. Abstract

    Vision science, particularly machine vision, has been revolutionized by introducing large-scale image datasets and statistical learning approaches. Yet, human neuroimaging studies of visual perception still rely on small numbers of images (around 100) due to time-constrained experimental procedures. To apply statistical learning approaches that include neuroscience, the number of images used in neuroimaging must be significantly increased. We present BOLD5000, a human functional MRI (fMRI) study that includes almost 5,000 distinct images depicting real-world scenes. Beyond dramatically increasing image dataset size relative to prior fMRI studies, BOLD5000 also accounts for image diversity, overlapping with standard computer vision datasets by incorporating images from the Scene UNderstanding (SUN), Common Objects in Context (COCO), and ImageNet datasets. The scale and diversity of these image datasets, combined with a slow event-related fMRI design, enables fine-grained exploration into the neural representation of a wide range of visual features, categories, and semantics. Concurrently, BOLD5000 brings us closer to realizing Marr’s dream of a singular vision science–the intertwined study of biological and computer vision.

    more » « less
  4. Premise

    Plant biodiversity is threatened, yet many species remain undescribed. It is estimated that >50% of undescribed species have already been collected and are awaiting discovery in herbaria. Robust automatic species identification algorithms using machine learning could accelerate species discovery.


    To encourage the development of an automatic species identification algorithm, we submitted our Herbarium 2019 data set to the Fine‐Grained Visual Categorization sub‐competition (FGVC6) hosted on the Kaggle platform. We chose to focus on the flowering plant family Melastomataceae because we have a large collection of imaged herbarium specimens (46,469 specimens representing 683 species) and taxonomic expertise in the family. As is common for herbarium collections, some species in this data set are represented by few specimens and others by many.


    In less than three months, the FGVC6 Herbarium 2019 Challenge drew 22 teams who entered 254 models for Melastomataceae species identification. The four best algorithms identified species with >88% accuracy.


    The FGVC competitions provide a unique opportunity for computer vision and machine learning experts to address difficult species‐recognition problems. The Herbarium 2019 Challenge brought together a novel combination of collections resources, taxonomic expertise, and collaboration between botanists and computer scientists.

    more » « less
  5. Abstract Machine learning (ML) has great potential to drive scientific discovery by harvesting data from images of herbarium specimens—preserved plant material curated in natural history collections—but ML techniques have only recently been applied to this rich resource. ML has particularly strong prospects for the study of plant phenological events such as growth and reproduction. As a major indicator of climate change, driver of ecological processes, and critical determinant of plant fitness, plant phenology is an important frontier for the application of ML techniques for science and society. In the present article, we describe a generalized, modular ML workflow for extracting phenological data from images of herbarium specimens, and we discuss the advantages, limitations, and potential future improvements of this workflow. Strategic research and investment in specimen-based ML methods, along with the aggregation of herbarium specimen data, may give rise to a better understanding of life on Earth. 
    more » « less