skip to main content


Title: An algorithm competition for automatic species identification from herbarium specimens
Premise

Plant biodiversity is threatened, yet many species remain undescribed. It is estimated that >50% of undescribed species have already been collected and are awaiting discovery in herbaria. Robust automatic species identification algorithms using machine learning could accelerate species discovery.

Methods

To encourage the development of an automatic species identification algorithm, we submitted our Herbarium 2019 data set to the Fine‐Grained Visual Categorization sub‐competition (FGVC6) hosted on the Kaggle platform. We chose to focus on the flowering plant family Melastomataceae because we have a large collection of imaged herbarium specimens (46,469 specimens representing 683 species) and taxonomic expertise in the family. As is common for herbarium collections, some species in this data set are represented by few specimens and others by many.

Results

In less than three months, the FGVC6 Herbarium 2019 Challenge drew 22 teams who entered 254 models for Melastomataceae species identification. The four best algorithms identified species with >88% accuracy.

Discussion

The FGVC competitions provide a unique opportunity for computer vision and machine learning experts to address difficult species‐recognition problems. The Herbarium 2019 Challenge brought together a novel combination of collections resources, taxonomic expertise, and collaboration between botanists and computer scientists.

 
more » « less
NSF-PAR ID:
10456546
Author(s) / Creator(s):
 ;  ;  ;  ;  ;  ;  ;  ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Applications in Plant Sciences
Volume:
8
Issue:
6
ISSN:
2168-0450
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Premise

    Herbarium specimens represent an outstanding source of material with which to study plant phenological changes in response to climate change. The fine‐scale phenological annotation of such specimens is nevertheless highly time consuming and requires substantial human investment and expertise, which are difficult to rapidly mobilize.

    Methods

    We trained and evaluated new deep learning models to automate the detection, segmentation, and classification of four reproductive structures ofStreptanthus tortuosus(flower buds, flowers, immature fruits, and mature fruits). We used a training data set of 21 digitized herbarium sheets for which the position and outlines of 1036 reproductive structures were annotated manually. We adjusted the hyperparameters of amask R‐CNN(regional convolutional neural network) to this specific task and evaluated the resulting trained models for their ability to count reproductive structures and estimate their size.

    Results

    The main outcome of our study is that the performance of detection and segmentation can vary significantly with: (i) the type of annotations used for training, (ii) the type of reproductive structures, and (iii) the size of the reproductive structures. In the case ofStreptanthus tortuosus, the method can provide quite accurate estimates (77.9% of cases) of the number of reproductive structures, which is better estimated for flowers than for immature fruits and buds. The size estimation results are also encouraging, showing a difference of only a few millimeters between the predicted and actual sizes of buds and flowers.

    Discussion

    This method has great potential for automating the analysis of reproductive structures in high‐resolution images of herbarium sheets. Deeper investigations regarding the taxonomic scalability of this approach and its potential improvement will be conducted in future work.

     
    more » « less
  2. Abstract

    Understanding the ranges of rare and endangered species is central to conserving biodiversity in the Anthropocene. Species distribution models (SDMs) have become a common and powerful tool for analyzing species–environment relationships across geographic space. Although evaluating the distribution of rare species is integral to their conservation, this can be difficult when limited distribution data are available. Community science platforms, such as iNaturalist, have emerged as alternative sources for species occurrence data. Although these observations are often thought to be of lower quality than those of natural history collections, they may have potential for improving SDMs for species with few occurrence records from collections. Here, we investigate the utility of iNaturalist data for developing SDMs for a rare high‐elevation plant,Telesonix jamesii. Because methods for modeling rare species are limited in the literature, five different modeling techniques were considered, including profile methods, statistical models, and machine learning algorithms. The inclusion of iNaturalist data doubled the number of usable records forT. jamesii.We found that a random forest (RF) model using ensemble training data performed the highest of any model (area under curve = 0.98). We then compared the performance of RF models that use only natural history training data and those that use a combination of natural history (herbarium specimens) and iNaturalist training data. All models heavily relied on climate data (mean temperature of driest quarter, and precipitation of the warmest quarter), indicating that this species is under threat as climate continues to change. Validation datasets affected model fits as well. Models using only herbarium data performed slightly poorer when evaluated with cross‐validation than when validated externally with iNaturalist data. This study can serve as a model for future SDM studies of species with similar data limitations.

     
    more » « less
  3. Leaves are the most abundant and visible plant organ, both in the modern world and the fossil record. Identifying foliage to the correct plant family based on leaf architecture is a fundamental botanical skill that is also critical for isolated fossil leaves, which often, especially in the Cenozoic, represent extinct genera and species from extant families. Resources focused on leaf identification are remarkably scarce; however, the situation has improved due to the recent proliferation of digitized herbarium material, live-plant identification applications, and online collections of cleared and fossil leaf images. Nevertheless, the need remains for a specialized image dataset for comparative leaf architecture. We address this gap by assembling an open-access database of 30,252 images of vouchered leaf specimens vetted to family level, primarily of angiosperms, including 26,176 images of cleared and x-rayed leaves representing 354 families and 4,076 of fossil leaves from 48 families. The images maintain original resolution, have user-friendly filenames, and are vetted using APG and modern paleobotanical standards. The cleared and x-rayed leaves include the Jack A. Wolfe and Leo J. Hickey contributions to the National Cleared Leaf Collection and a collection of high-resolution scanned x-ray negatives, housed in the Division of Paleobotany, Department of Paleobiology, Smithsonian National Museum of Natural History, Washington D.C.; and the Daniel I. Axelrod Cleared Leaf Collection, housed at the University of California Museum of Paleontology, Berkeley. The fossil images include a sampling of Late Cretaceous to Eocene paleobotanical sites from the Western Hemisphere held at numerous institutions, especially from Florissant Fossil Beds National Monument (late Eocene, Colorado), as well as several other localities from the Late Cretaceous to Eocene of the Western USA and the early Paleogene of Colombia and southern Argentina. The dataset facilitates new research and education opportunities in paleobotany, comparative leaf architecture, systematics, and machine learning. 
    more » « less
  4. Research on plant-pollinator interactions requires a diversity of perspectives and approaches, and documenting changing pollinator-plant interactions due to declining insect diversity and climate change is especially challenging. Natural history collections are increasingly important for such research and can provide ecological information across broad spatial and temporal scales. Here, we describe novel approaches that integrate museum specimens from insect and plant collections with field observations to quantify pollen networks over large spatial and temporal gradients. We present methodological strategies for evaluating insect-pollen network parameters based on pollen collected from museum insect specimens. These methods provide insight into spatial and temporal variation in pollen-insect interactions and complement other approaches to studying pollination, such as pollinator observation networks and flower enclosure experiments. We present example data from butterfly pollen networks over the past century in the Great Basin Desert and Sierra Nevada Mountains, United States. Complementary to these approaches, we describe rapid pollen identification methods that can increase speed and accuracy of taxonomic determinations, using pollen grains collected from herbarium specimens. As an example, we describe a convolutional neural network (CNN) to automate identification of pollen. We extracted images of pollen grains from 21 common species from herbarium specimens at the University of Nevada Reno (RENO). The CNN model achieved exceptional accuracy of identification, with a correct classification rate of 98.8%. These and similar approaches can transform the way we estimate pollination network parameters and greatly change inferences from existing networks, which have exploded over the past few decades. These techniques also allow us to address critical ecological questions related to mutualistic networks, community ecology, and conservation biology. Museum collections remain a bountiful source of data for biodiversity science and understanding global change. 
    more » « less
  5. Premise

    With digitization and data sharing initiatives underway over the last 15 years, an important need has been prioritizing specimens to digitize. Because duplicate specimens are shared among herbaria in exchange and gift programs, we investigated the extent to which unique biogeographic data are held in small herbaria vs. these data being redundant with those held by larger institutions. We evaluated the unique specimen contributions that small herbaria make to biogeographic understanding at county, locality, and temporal scales.

    Methods

    We sampled herbarium specimens of 40 plant taxa from each of eight states of the United States of America in four broad status categories: extremely rare, very rare, common native, and introduced. We gathered geographic information from specimens held by large (≥100,000 specimens) and small (<100,000 specimens) herbaria. We built generalized linear mixed models to assess which features of the collections may best predict unique contributions of herbaria and used an Akaike information criterion‐based information‐theoretic approach for our model selection to choose the best model for each scale.

    Results

    Small herbaria contributed unique specimens at all scales in proportion with their contribution of specimens to our data set. The best models for all scales were the full models that included the factors of species status and herbarium size when accounting for state as a random variable.

    Conclusions

    We demonstrated that small herbaria contribute unique information for research. It is clear that unique contributions cannot be predicted based on herbarium size alone. We must prioritize digitization and data sharing from herbaria of all sizes.

     
    more » « less