skip to main content

Title: The Herbarium 2021 Half–Earth Challenge Dataset and Machine Learning Competition
Herbarium sheets present a unique view of the world's botanical history, evolution, and biodiversity. This makes them an all–important data source for botanical research. With the increased digitization of herbaria worldwide and advances in the domain of fine–grained visual classification which can facilitate automatic identification of herbarium specimen images, there are many opportunities for supporting and expanding research in this field. However, existing datasets are either too small, or not diverse enough, in terms of represented taxa, geographic distribution, and imaging protocols. Furthermore, aggregating datasets is difficult as taxa are recognized under a multitude of names and must be aligned to a common reference. We introduce the Herbarium 2021 Half–Earth dataset: the largest and most diverse dataset of herbarium specimen images, to date, for automatic taxon recognition. We also present the results of the Herbarium 2021 Half–Earth challenge, a competition that was part of the Eighth Workshop on Fine-Grained Visual Categorization (FGVC8) and hosted by Kaggle to encourage the development of models to automatically identify taxa from herbarium sheet images.
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ;
Award ID(s):
Publication Date:
Journal Name:
Frontiers in Plant Science
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract 1. The widespread digitization of natural history collections, combined with novel tools and approaches is revolutionizing biodiversity science. The ‘extended specimen’ concept advocates a more holistic approach in which a specimen is framed as a diverse stream of interconnected data. Herbarium specimens that by their very nature capture multispecies relationships, such as certain parasites, fungi and lichens, hold great potential to provide a broader and more integrative view of the ecology and evolution of symbiotic interactions. This particularly ap- plies to parasite–host associations, which owing to their interconnectedness are especially vulnerable to global environmental change. 2. Here, we present an overview of how parasitic flowering plants is represented in herbarium collections. We then discuss the variety of data that can be gathered from parasitic plant specimens, and how they can be used to understand global change impacts at multiple scales. Finally, we review best practices for sampling parasitic plants in the field, and subsequently preparing and digitizing these specimens. 3. Plant parasitism has evolved 12 times within angiosperms, and similar to other plant taxa, herbarium collections represent the foundation for analysing key aspects of their ecology and evolution. Yet these collections hold far greater potential. Data and metadata obtainedmore »from parasitic plant specimens can inform analyses of co-distribution patterns, changes in eco-physiology and species plasticity spanning temporal and spatial scales, chemical ecology of tripartite interactions (e.g. host–parasite–herbivore), and molecular data critical for species conservation. Moreover, owing to the historic nature and sheer size of global herbarium collections, these data provide the spatiotemporal breadth essential for investigating organismal response to global change. 4. Parasitic plant specimens are primed to serve as ideal examples of extended specimen concept and help motivate the next generation of creative and impactful collection-based science. Continued digitization efforts and improved curatorial practices will contribute to opening these specimens to a broader audience, allowing integrative research spanning multiple domains and offering novel opportunities for education.« less
  2. Abstract Machine learning (ML) has great potential to drive scientific discovery by harvesting data from images of herbarium specimens—preserved plant material curated in natural history collections—but ML techniques have only recently been applied to this rich resource. ML has particularly strong prospects for the study of plant phenological events such as growth and reproduction. As a major indicator of climate change, driver of ecological processes, and critical determinant of plant fitness, plant phenology is an important frontier for the application of ML techniques for science and society. In the present article, we describe a generalized, modular ML workflow for extracting phenological data from images of herbarium specimens, and we discuss the advantages, limitations, and potential future improvements of this workflow. Strategic research and investment in specimen-based ML methods, along with the aggregation of herbarium specimen data, may give rise to a better understanding of life on Earth.
  3. Abstract

    Vision science, particularly machine vision, has been revolutionized by introducing large-scale image datasets and statistical learning approaches. Yet, human neuroimaging studies of visual perception still rely on small numbers of images (around 100) due to time-constrained experimental procedures. To apply statistical learning approaches that include neuroscience, the number of images used in neuroimaging must be significantly increased. We present BOLD5000, a human functional MRI (fMRI) study that includes almost 5,000 distinct images depicting real-world scenes. Beyond dramatically increasing image dataset size relative to prior fMRI studies, BOLD5000 also accounts for image diversity, overlapping with standard computer vision datasets by incorporating images from the Scene UNderstanding (SUN), Common Objects in Context (COCO), and ImageNet datasets. The scale and diversity of these image datasets, combined with a slow event-related fMRI design, enables fine-grained exploration into the neural representation of a wide range of visual features, categories, and semantics. Concurrently, BOLD5000 brings us closer to realizing Marr’s dream of a singular vision science–the intertwined study of biological and computer vision.

  4. Large systematic revisionary projects incorporating data for hundreds or thousands of taxa require an integrative approach, with a strong biodiversity-informatics core for efficient data management to facilitate research on the group. Our original biodiversity informatics platform, 3i (Internet-accessible Interactive Identification) combined a customized MS Access database backend with ASP-based web interfaces to support revisionary syntheses of several large genera of leafhopers (Hemiptera: Auchenorrhyncha: Cicadellidae). More recently, for our National Science Foundation sponsored project, “GoLife: Collaborative Research: Integrative genealogy, ecology and phenomics of deltocephaline leafhoppers (Hemiptera: Cicadellidae), and their microbial associates”, we selected the new open-source platform TaxonWorks as the cyberinfrastructure. In the scope of the project, the original “3i World Auchenorrhyncha Database” was imported into TaxonWorks. At the present time, TaxonWorks has many tools to automatically import nomenclature, citations, and specimen based collection data. At the time of the initial migration of the 3i database, many of those tools were still under development, and complexity of the data in the database required a custom migration script, which is still probably the most efficient solution for importing datasets with long development history. At the moment, the World Auchenorrhyncha Database comprehensively covers nomenclature of the group and includes data on 70 validmore »families, 6,816 valid genera, 47,064 valid species as well as synonymy and subsequent combinations (Fig. 1). In addition, many taxon records include the original citation, bibliography, type information, etymology, etc. The bibliography of the group includes 37,579 sources, about 1/3 of which are associated with PDF files. Species have distribution records, either derived from individual specimens or as country and state level asserted distribution, as well as biological associations indicating host plants, predators, and parasitoids. Observation matrices in TaxonWorks are designed to handle morphological data associated with taxa or specimens. The matrices may be used to automatically generate interactive identification keys and taxon descriptions. They can also be downloaded to be imported, for example, into Lucid builder, or to perform phylogenetic analysis using an external application. At the moment there are 36 matrices associated with the project. The observation matrix from GoLife project covers 798 taxa by 210 descriptors (most of which are qualitative multi-state morphological descriptors) (Fig. 2). Illustrations are provided for 9,886 taxa and organized in the specialized image matrix and could be used as a pictorial key for determination of species and taxa of a higher rank. For the phylogenetic analysis, a dataset was constructed for 730 terminal taxa and >160,000 nucleotide positions obtained using anchored hybrid enrichment of genomic DNA for a sample of leafhoppers from the subfamily Deltocephalinae and outgroups. The probe kit targets leafhopper genes, as well as some bacterial genes (endosymbionts and plant pathogens transmitted by leafhoppers). The maximum likelihood analyses of concatenated nucleotide and amino acid sequences as well as coalescent gene tree analysis yielded well-resolved phylogenetic trees (Cao et al. 2022). Raw sequence data have been uploaded to the Sequence Read Archive on GenBank. Occurrence and morphological data, as well as diagnostic images, for voucher specimens have been incorporated into TaxonWorks. Data in TaxonWorks could be exported in raw format, get accessed via Application Programming Interface (API), or be shared with external data aggregators like Catalogue of Life, GBIF, iDigBio.« less
  5. Herbarium collections shape our understanding of the world’s flora and are crucial for addressing global change and biodiversity conservation. The formation of such natural history collections, however, are not free from sociopolitical issues of immediate relevance. Despite increasing efforts addressing issues of representation and colonialism in natural history collections, herbaria have received comparatively less attention. While it has been noted that the majority of plant specimens are housed in the global North, the extent of this disparity has not been rigorously quantified to date. Here, by analyzing over 85 million specimen records and surveying herbaria across the globe, we assess the colonial legacy of botanical collections and how we may move towards a more inclusive future. We demonstrate that colonial exploitation has contributed to an inverse relationship between where plant biodiversity exists in nature and where it is housed in herbaria. Such disparities persist in herbaria across physical and digital realms despite overt colonialism having ended over half a century ago, suggesting ongoing digitization and decolonization efforts have yet to alleviate colonial-era discrepancies. We emphasize the need for acknowledging the inconvenient history of herbarium collections and the implementation of a more equitable, global paradigm for their collection, curation, and use.