skip to main content

Title: Robust and simplified machine learning identification of pitfall trap‐collected ground beetles at the continental scale

Insect populations are changing rapidly, and monitoring these changes is essential for understanding the causes and consequences of such shifts. However, large‐scale insect identification projects are time‐consuming and expensive when done solely by human identifiers. Machine learning offers a possible solution to help collect insect data quickly and efficiently.

Here, we outline a methodology for training classification models to identify pitfall trap‐collected insects from image data and then apply the method to identify ground beetles (Carabidae). All beetles were collected by the National Ecological Observatory Network (NEON), a continental scale ecological monitoring project with sites across the United States. We describe the procedures for image collection, image data extraction, data preparation, and model training, and compare the performance of five machine learning algorithms and two classification methods (hierarchical vs. single‐level) identifying ground beetles from the species to subfamily level. All models were trained using pre‐extracted feature vectors, not raw image data. Our methodology allows for data to be extracted from multiple individuals within the same image thus enhancing time efficiency, utilizes relatively simple models that allow for direct assessment of model performance, and can be performed on relatively small datasets.

The best performing algorithm, linear discriminant analysis (LDA), reached an accuracy of 84.6% at the species level when naively identifying species, which was further increased to >95% when classifications were limited by known local species pools. Model performance was negatively correlated with taxonomic specificity, with the LDA model reaching an accuracy of ~99% at the subfamily level. When classifying carabid species not included in the training dataset at higher taxonomic levels species, the models performed significantly better than if classifications were made randomly. We also observed greater performance when classifications were made using the hierarchical classification method compared to the single‐level classification method at higher taxonomic levels.

The general methodology outlined here serves as a proof‐of‐concept for classifying pitfall trap‐collected organisms using machine learning algorithms, and the image data extraction methodology may be used for nonmachine learning uses. We propose that integration of machine learning in large‐scale identification pipelines will increase efficiency and lead to a greater flow of insect macroecological data, with the potential to be expanded for use with other noninsect taxa.

more » « less
Award ID(s):
Author(s) / Creator(s):
 ;  ;  ;  ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Ecology and Evolution
Page Range / eLocation ID:
p. 13143-13153
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Classifying insect species involves a tedious process of identifying distinctive morphological insect characters by taxonomic experts. Machine learning can harness the power of computers to potentially create an accurate and efficient method for performing this task at scale, given that its analytical processing can be more sensitive to subtle physical differences in insects, which experts may not perceive. However, existing machine learning methods are designed to only classify insect samples into described species, thus failing to identify samples from undescribed species.

    We propose a novel deep hierarchical Bayesian model for insect classification, given the taxonomic hierarchy inherent in insects. This model can classify samples of both described and undescribed species; described samples are assigned a species while undescribed samples are assigned a genus, which is a pivotal advancement over just identifying them as outliers. We demonstrated this proof of concept on a new database containing paired insect image and DNA barcode data from four insect orders, including 1040 species, which far exceeds the number of species used in existing work. A quarter of the species were excluded from the training set to simulate undescribed species.

    With the proposed classification framework using combined image and DNA data in the model, species classification accuracy for described species was 96.66% and genus classification accuracy for undescribed species was 81.39%. Including both data sources in the model resulted in significant improvement over including image data only (39.11% accuracy for described species and 35.88% genus accuracy for undescribed species), and modest improvement over including DNA data only (73.39% genus accuracy for undescribed species).

    Unlike current machine learning methods, the proposed deep hierarchical Bayesian learning approach can simultaneously classify samples of both described and undescribed species, a functionality that could become instrumental in biodiversity monitoring across the globe. This framework can be customized for any taxonomic classification problem for which image and DNA data can be obtained, thus making it relevant for use across all biological kingdoms.

    more » « less

    Camera trap studies have become a popular medium to assess many ecological phenomena including population dynamics, patterns of biodiversity, and monitoring of endangered species. In conjunction with the benefit to scientists, camera traps present an unprecedented opportunity to involve the public in scientific research via image classifications. However, this engagement strategy comes with a myriad of complications. Volunteers vary in their familiarity with wildlife, thus, the accuracy of user‐derived classifications may be biased by the commonness or popularity of species and user‐experience. From an extensive multi‐site camera trap study across Michigan, U.S.A, we compiled and classified images through a public science platform called Michigan ZoomIN. We aggregated responses from 15 independent users per image using multiple consensus methods to assess accuracy by comparing to species identification completed by wildlife experts. We also evaluated how different factors including consensus algorithms, study area, wildlife species, user support, and camera type influenced the accuracy of user‐derived classifications. Overall accuracy of user‐derived classification was 97%; although, several canid (e.g.,Canis lupus, Vulpes vulpes) and mustelid (e.g.,Neovison vison) species were repeatedly difficult to identify by users and had lower accuracy. When validating user‐derived classification, we found that study area, consensus method, and user support best explained accuracy. To overcome hesitancy associated with data collected by untrained participants, we demonstrated their value by showing that the accuracy from volunteers was comparable to experts when classifying North American mammals. Our hierarchical workflow that integrated multiple consensus methods led to more image classifications without extensive training and even when the expertise of the volunteer was unknown. Ultimately, adopting such an approach can harness broader participation, expedite future camera trap data synthesis, and improve allocation of resources by scholars to enhance performance of public participants and increase accuracy of user‐derived data. © 2021 The Wildlife Society.

    more » « less
  3. Abstract

    The Amazon River basin contains a vast diversity of lotic habitats and accompanying hydrological regimes. Further understanding the spatial distribution of flow regimes across the Amazon can be useful for recognizing riverine ecohydrological processes and informing river management and conservation, especially in areas with limited or inconsistent streamflow monitoring.

    This study compares four inductive approaches for classifying streamflow regimes across the Amazon using an unprecedented compilation of streamflow records from Bolivia, Brazil, Colombia, Ecuador, and Peru.

    Inductive classification schemes use attributes of streamflow data to categorize river reaches into similar classes, which then may be generalized to understand streamflow behaviour at the basin scale. In this study, classification was accomplished through hierarchical clustering of 67 flow metrics calculated using indicators of hydrologic alteration (IHA) and daily streamflow data from median annual hydrographs (MAHs) for 404 stations (representing >7,000 station‐years) across five Amazonian countries.

    Classification was performed using both flow magnitude‐inclusive and flow magnitude‐independent datasets. For flow magnitude‐independent methods, optimal solutions included six or seven primary hydrological classes for IHA and MAH datasets; for approaches that retained magnitude, variance was sufficiently large to prevent convergence to a specific number of classes.

    Across methods, class membership was strongly associated with the timing, frequency, and rate of change of flow, and spatially coherent clusters were associated with seasonal, elevational, and stream‐order gradients. These results highlight the diversity of flow regimes across the Amazon and provide a framework for studying relationships between hydrological regimes and ecological responses in the context of changing climate, land use, and human‐induced hydrological alteration.

    The methodology applied provides a data‐driven approach for classifying flow regimes based on observed data. When coupled with ecological knowledge and expertise, these classifications can be used to develop ecohydrologically informed and management‐relevant conservation practices.

    more » « less
  4. Image data collected after natural disasters play an important role in the forensics of structure failures. However, curating and managing large amounts of post-disaster imagery data is challenging. In most cases, data users still have to spend much effort to find and sort images from the massive amounts of images archived for past decades in order to study specific types of disasters. This paper proposes a new machine learning based approach for automating the labeling and classification of large volumes of post-natural disaster image data to address this issue. More specifically, the proposed method couples pre-trained computer vision models and a natural language processing model with an ontology tailed to natural disasters to facilitate the search and query of specific types of image data. The resulting process returns each image with five primary labels and similarity scores, representing its content based on the developed word-embedding model. Validation and accuracy assessment of the proposed methodology was conducted with ground-level residential building panoramic images from Hurricane Harvey. The computed primary labels showed a minimum average difference of 13.32% when compared to manually assigned labels. This versatile and adaptable solution offers a practical and valuable solution for automating image labeling and classification tasks, with the potential to be applied to various image classifications and used in different fields and industries. The flexibility of the method means that it can be updated and improved to meet the evolving needs of various domains, making it a valuable asset for future research and development. 
    more » « less
  5. Abstract

    The taxonomic concepts of Blapimorpha and Opatrinae (informal and traditional, morphology‐based groupings among darkling beetles) are tested using molecular phylogenetics and a reassessment of larval and adult morphology to address a major phylogeny‐classification gap in Tenebrionidae. Instead of a holistic approach (family‐level phylogeny), this study uses a bottom‐up strategy (tribal grouping) in order to define larger, monophyletic lineages within Tenebrioninae. Sampling included representatives of 27 tenebrionid tribes: Alleculini, Amarygmini, Amphidorini, Blaptini, Bolitophagini, Branchini, Cerenopini, Coniontini, Caenocrypticini, Dendarini, Eulabini, Helopini, Lagriini, Melanimini, Opatrini, Pedinini, Phaleriini, Physogasterini, Platynotini, Platyscelidini, Praociini, Scaurini, Scotobiini, Tenebrionini, Trachyscelini, Triboliini and Ulomini. Molecular analyses were based on DNA sequence data from four non‐overlapping gene regions: carbamoyl‐phosphate synthetase domain ofrudimentary(CAD) (723 bp),wingless(wg) (438 bp) and nuclear ribosomal 28S (1101 bp) and mitochondrial ribosomal 12S (363 bp). Additionally, 15 larval and imaginal characters were scored and subjected to an ancestral state reconstruction analysis. Results revealed that Amphidorini, Blaptini, Dendarini, Pedinini, Platynotini, Platyscelidini and Opatrini form a clade which can be defined by the following morphological features: adults—antennae lacking compound/stellate sensoria; procoxal cavities externally and internally closed, intersternal membrane of abdominal ventrites 3–5 visible; paired abdominal defensive glands present, elongate, not annulated; larvae—prolegs enlarged (adapted for digging); ninth tergite lacking urogomphi. To accommodate this monophyletic grouping (281 genera and ∼4000 species), the subfamily Blaptinaesens. resurrected. Prior to these results, all of the tribes within Blaptinae were classified within the polyphyletic subfamily Tenebrioninae. The non‐monophyletic nature of Terebrioninae has already been postulated by previous authors, yet no taxonomic decisions were made to fix its status. The reinstatement of Blaptinae, which groups ∼50% of the former Tenebrioninae, helps to clarify phylogenetic relations among the whole family and is the first step towards a complete higher‐level revision of Tenebrionidae. The Central Asian tribe Dissonomini (two genera, ∼30 species) was not included in Blaptinae due to a lack of representatives in the performed phylogenetic analyses; however, based on morphological features, the tribe is listed as a potential addition to the subfamily.

    more » « less