skip to main content

Title: Machine Learning Using Digitized Herbarium Specimens to Advance Phenological Research
Abstract Machine learning (ML) has great potential to drive scientific discovery by harvesting data from images of herbarium specimens—preserved plant material curated in natural history collections—but ML techniques have only recently been applied to this rich resource. ML has particularly strong prospects for the study of plant phenological events such as growth and reproduction. As a major indicator of climate change, driver of ecological processes, and critical determinant of plant fitness, plant phenology is an important frontier for the application of ML techniques for science and society. In the present article, we describe a generalized, modular ML workflow for extracting phenological data from images of herbarium specimens, and we discuss the advantages, limitations, and potential future improvements of this workflow. Strategic research and investment in specimen-based ML methods, along with the aggregation of herbarium specimen data, may give rise to a better understanding of life on Earth.  more » « less
Award ID(s):
1754584 1802209 1902064 1902078
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Machine learning (ML) can accelerate the extraction of phenological data from herbarium specimens; however, no studies have assessed whether ML-derived phenological data can be used reliably to evaluate ecological patterns. In this study, 709 herbarium specimens representing a widespread annual herb, Streptanthus tortuosus, were scored both manually by human observers and by a mask R-CNN object detection model to (1) evaluate the concordance between ML and manually-derived phenological data and (2) determine whether ML-derived data can be used to reliably assess phenological patterns. The ML model generally underestimated the number of reproductive structures present on each specimen; however, when these counts were used to provide a quantitative estimate of the phenological stage of plants on a given sheet (i.e., the phenological index or PI), the ML and manually-derived PI’s were highly concordant. Moreover, herbarium specimen age had no effect on the estimated PI of a given sheet. Finally, including ML-derived PIs as predictor variables in phenological models produced estimates of the phenological sensitivity of this species to climate, temporal shifts in flowering time, and the rate of phenological progression that are indistinguishable from those produced by models based on data provided by human observers. This study demonstrates that phenological data extracted using machine learning can be used reliably to estimate the phenological stage of herbarium specimens and to detect phenological patterns. 
    more » « less
  2. null (Ed.)
    Abstract Background and Aims Fruiting remains under-represented in long-term phenology records, relative to leaf and flower phenology. Herbarium specimens and historical field notes can fill this gap, but selecting and synthesizing these records for modern-day comparison requires an understanding of whether different historical data sources contain similar information, and whether similar, but not equivalent, fruiting metrics are comparable with one another. Methods For 67 fleshy-fruited plant species, we compared observations of fruiting phenology made by Henry David Thoreau in Concord, Massachusetts (1850s), with phenology data gathered from herbarium specimens collected across New England (mid-1800s to 2000s). To identify whether fruiting times and the order of fruiting among species are similar between datasets, we compared dates of first, peak and last observed fruiting (recorded by Thoreau), and earliest, mean and latest specimen (collected from herbarium records), as well as fruiting durations. Key Results On average, earliest herbarium specimen dates were earlier than first fruiting dates observed by Thoreau; mean specimen dates were similar to Thoreau’s peak fruiting dates; latest specimen dates were later than Thoreau’s last fruiting dates; and durations of fruiting captured by herbarium specimens were longer than durations of fruiting observed by Thoreau. All metrics of fruiting phenology except duration were significantly, positively correlated within (r: 0.69–0.88) and between (r: 0.59–0.85) datasets. Conclusions Strong correlations in fruiting phenology between Thoreau’s observations and data from herbaria suggest that field and herbarium methods capture similar broad-scale phenological information, including relative fruiting times among plant species in New England. Differences in the timing of first, last and duration of fruiting suggest that historical datasets collected with different methods, scales and metrics may not be comparable when exact timing is important. Researchers should strongly consider matching methodology when selecting historical records of fruiting phenology for present-day comparisons. 
    more » « less
  3. Premise

    Herbarium specimens represent an outstanding source of material with which to study plant phenological changes in response to climate change. The fine‐scale phenological annotation of such specimens is nevertheless highly time consuming and requires substantial human investment and expertise, which are difficult to rapidly mobilize.


    We trained and evaluated new deep learning models to automate the detection, segmentation, and classification of four reproductive structures ofStreptanthus tortuosus(flower buds, flowers, immature fruits, and mature fruits). We used a training data set of 21 digitized herbarium sheets for which the position and outlines of 1036 reproductive structures were annotated manually. We adjusted the hyperparameters of amask R‐CNN(regional convolutional neural network) to this specific task and evaluated the resulting trained models for their ability to count reproductive structures and estimate their size.


    The main outcome of our study is that the performance of detection and segmentation can vary significantly with: (i) the type of annotations used for training, (ii) the type of reproductive structures, and (iii) the size of the reproductive structures. In the case ofStreptanthus tortuosus, the method can provide quite accurate estimates (77.9% of cases) of the number of reproductive structures, which is better estimated for flowers than for immature fruits and buds. The size estimation results are also encouraging, showing a difference of only a few millimeters between the predicted and actual sizes of buds and flowers.


    This method has great potential for automating the analysis of reproductive structures in high‐resolution images of herbarium sheets. Deeper investigations regarding the taxonomic scalability of this approach and its potential improvement will be conducted in future work.

    more » « less
  4. Abstract

    The widespread digitization of natural history collections, combined with novel tools and approaches is revolutionizing biodiversity science. The ‘extended specimen’ concept advocates a more holistic approach in which a specimen is framed as a diverse stream of interconnected data. Herbarium specimens that by their very nature capture multispecies relationships, such as certain parasites, fungi and lichens, hold great potential to provide a broader and more integrative view of the ecology and evolution of symbiotic interactions. This particularly applies to parasite–host associations, which owing to their interconnectedness are especially vulnerable to global environmental change.

    Here, we present an overview of how parasitic flowering plants is represented in herbarium collections. We then discuss the variety of data that can be gathered from parasitic plant specimens, and how they can be used to understand global change impacts at multiple scales. Finally, we review best practices for sampling parasitic plants in the field, and subsequently preparing and digitizing these specimens.

    Plant parasitism has evolved 12 times within angiosperms, and similar to other plant taxa, herbarium collections represent the foundation for analysing key aspects of their ecology and evolution. Yet these collections hold far greater potential. Data and metadata obtained from parasitic plant specimens can inform analyses of co‐distribution patterns, changes in eco‐physiology and species plasticity spanning temporal and spatial scales, chemical ecology of tripartite interactions (e.g. host–parasite–herbivore), and molecular data critical for species conservation. Moreover, owing to the historic nature and sheer size of global herbarium collections, these data provide the spatiotemporal breadth essential for investigating organismal response to global change.

    Parasitic plant specimens are primed to serve as ideal examples of extended specimen concept and help motivate the next generation of creative and impactful collection‐based science. Continued digitization efforts and improved curatorial practices will contribute to opening these specimens to a broader audience, allowing integrative research spanning multiple domains and offering novel opportunities for education.

    more » « less
  5. Phenology is a key biological trait of an organism’s success and is one of the best indicators of its response to recent climate change. Plants are among the most well-studied organisms in this regard, but observational data bearing on this topic are largely restricted to woody species of the northern hemisphere, mostly from ca. the last three decades. Recent research has demonstrated that mobilized online herbarium specimens provide important, albeit mostly neglected, information on plant phenology. Here, we use the web tool CrowdCurio to crowdsource phenological data from more than 10,000 herbarium specimens representing 30 flowering plant species broadly distributed across the eastern United States. Our results, spanning 120 years and generated from over 2,000 crowdsourcers, clarify numerous aspects of plant phenology. First, they reveal that plant reproductive phenology is significantly advancing in response to warming, which is consistent with previous studies. Second, among those species with broad latitudinal ranges, populations from more southern latitudes are significantly more phenologically sensitive to temperature than those from northern populations. Last, contrary to some recent findings, plants in warmer, less variable climates may be much more dynamic, on average, in their phenological sensitivity. Our results are robust to a variety of confounding factors and span large phylogenetic distances and myriad life histories. These may represent more global trends in the latitudinal gradient of phenological response with myriad potential ecological and evolutionary consequences, and leads us to hypothesize that phenological sensitivity across species' ranges is driven by adaptation to local climates. 
    more » « less