skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Robust and simplified machine learning identification of pitfall trap‐collected ground beetles at the continental scale
Abstract Insect populations are changing rapidly, and monitoring these changes is essential for understanding the causes and consequences of such shifts. However, large‐scale insect identification projects are time‐consuming and expensive when done solely by human identifiers. Machine learning offers a possible solution to help collect insect data quickly and efficiently.Here, we outline a methodology for training classification models to identify pitfall trap‐collected insects from image data and then apply the method to identify ground beetles (Carabidae). All beetles were collected by the National Ecological Observatory Network (NEON), a continental scale ecological monitoring project with sites across the United States. We describe the procedures for image collection, image data extraction, data preparation, and model training, and compare the performance of five machine learning algorithms and two classification methods (hierarchical vs. single‐level) identifying ground beetles from the species to subfamily level. All models were trained using pre‐extracted feature vectors, not raw image data. Our methodology allows for data to be extracted from multiple individuals within the same image thus enhancing time efficiency, utilizes relatively simple models that allow for direct assessment of model performance, and can be performed on relatively small datasets.The best performing algorithm, linear discriminant analysis (LDA), reached an accuracy of 84.6% at the species level when naively identifying species, which was further increased to >95% when classifications were limited by known local species pools. Model performance was negatively correlated with taxonomic specificity, with the LDA model reaching an accuracy of ~99% at the subfamily level. When classifying carabid species not included in the training dataset at higher taxonomic levels species, the models performed significantly better than if classifications were made randomly. We also observed greater performance when classifications were made using the hierarchical classification method compared to the single‐level classification method at higher taxonomic levels.The general methodology outlined here serves as a proof‐of‐concept for classifying pitfall trap‐collected organisms using machine learning algorithms, and the image data extraction methodology may be used for nonmachine learning uses. We propose that integration of machine learning in large‐scale identification pipelines will increase efficiency and lead to a greater flow of insect macroecological data, with the potential to be expanded for use with other noninsect taxa.  more » « less
Award ID(s):
1702426
PAR ID:
10455027
Author(s) / Creator(s):
 ;  ;  ;  ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Ecology and Evolution
Volume:
10
Issue:
23
ISSN:
2045-7758
Page Range / eLocation ID:
p. 13143-13153
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Kajtoch, Łukasz (Ed.)
    This study presents an initial model for bark beetle identification, serving as a foundational step toward developing a fully functional and practical identification tool. Bark beetles are known for extensive damage to forests globally, as well as for uniform and homoplastic morphology which poses identification challenges. Utilizing a MaxViT-based deep learning backbone which utilizes local and global attention to classify bark beetles down to the genus level from images containing multiple beetles. The methodology involves a process of image collection, preparation, and model training, leveraging pre-classified beetle species to ensure accuracy and reliability. The model's F1 score estimates of 0.99 and 1.0 indicates a strong ability to accurately classify genera in the collected data, including those previously unknown to the model. This makes it a valuable first step towards building a tool for applications in forest management and ecological research. While the current model distinguishes among 12 genera, further refinement and additional data will be necessary to achieve reliable species-level identification, which is particularly important for detecting new invasive species. Despite the controlled conditions of image collection and potential challenges in real-world application, this study provides the first model capable of identifying the bark beetle genera, and by far the largest training set of images for any comparable insect group. We also designed a function that reports if a species appears to be unknown. Further research is suggested to enhance the model's generalization capabilities and scalability, emphasizing the integration of advanced machine learning techniques for improved species classification and the detection of invasive or undescribed species. 
    more » « less
  2. ABSTRACT Camera trap studies have become a popular medium to assess many ecological phenomena including population dynamics, patterns of biodiversity, and monitoring of endangered species. In conjunction with the benefit to scientists, camera traps present an unprecedented opportunity to involve the public in scientific research via image classifications. However, this engagement strategy comes with a myriad of complications. Volunteers vary in their familiarity with wildlife, thus, the accuracy of user‐derived classifications may be biased by the commonness or popularity of species and user‐experience. From an extensive multi‐site camera trap study across Michigan, U.S.A, we compiled and classified images through a public science platform called Michigan ZoomIN. We aggregated responses from 15 independent users per image using multiple consensus methods to assess accuracy by comparing to species identification completed by wildlife experts. We also evaluated how different factors including consensus algorithms, study area, wildlife species, user support, and camera type influenced the accuracy of user‐derived classifications. Overall accuracy of user‐derived classification was 97%; although, several canid (e.g.,Canis lupus, Vulpes vulpes) and mustelid (e.g.,Neovison vison) species were repeatedly difficult to identify by users and had lower accuracy. When validating user‐derived classification, we found that study area, consensus method, and user support best explained accuracy. To overcome hesitancy associated with data collected by untrained participants, we demonstrated their value by showing that the accuracy from volunteers was comparable to experts when classifying North American mammals. Our hierarchical workflow that integrated multiple consensus methods led to more image classifications without extensive training and even when the expertise of the volunteer was unknown. Ultimately, adopting such an approach can harness broader participation, expedite future camera trap data synthesis, and improve allocation of resources by scholars to enhance performance of public participants and increase accuracy of user‐derived data. © 2021 The Wildlife Society. 
    more » « less
  3. Abstract The biodiversity crisis necessitates spatially extensive methods to monitor multiple taxonomic groups for evidence of change in response to evolving environmental conditions. Programs that combine passive acoustic monitoring and machine learning are increasingly used to meet this need. These methods require large, annotated datasets, which are time‐consuming and expensive to produce, creating potential barriers to adoption in data‐ and funding‐poor regions. Recently released pre‐trained avian acoustic classification models provide opportunities to reduce the need for manual labelling and accelerate the development of new acoustic classification algorithms through transfer learning. Transfer learning is a strategy for developing algorithms under data scarcity that uses pre‐trained models from related tasks to adapt to new tasks.Our primary objective was to develop a transfer learning strategy using the feature embeddings of a pre‐trained avian classification model to train custom acoustic classification models in data‐scarce contexts. We used three annotated avian acoustic datasets to test whether transfer learning and soundscape simulation‐based data augmentation could substantially reduce the annotated training data necessary to develop performant custom acoustic classifiers. We also conducted a sensitivity analysis for hyperparameter choice and model architecture. We then assessed the generalizability of our strategy to increasingly novel non‐avian classification tasks.With as few as two training examples per class, our soundscape simulation data augmentation approach consistently yielded new classifiers with improved performance relative to the pre‐trained classification model and transfer learning classifiers trained with other augmentation approaches. Performance increases were evident for three avian test datasets, including single‐class and multi‐label contexts. We observed that the relative performance among our data augmentation approaches varied for the avian datasets and nearly converged for one dataset when we included more training examples.We demonstrate an efficient approach to developing new acoustic classifiers leveraging open‐source sound repositories and pre‐trained networks to reduce manual labelling. With very few examples, our soundscape simulation approach to data augmentation yielded classifiers with performance equivalent to those trained with many more examples, showing it is possible to reduce manual labelling while still achieving high‐performance classifiers and, in turn, expanding the potential for passive acoustic monitoring to address rising biodiversity monitoring needs. 
    more » « less
  4. This data set contains the individual classifications that the Gravity Spy citizen science volunteers made for glitches through 20 July 2024. Classifications made by science team members or in testing workflows have been removed as have classifications of glitches lacking a Gravity Spy identifier. See Zevin et al. (2017) for an explanation of the citizen science task and classification interface. Data about glitches with machine-learning labels are provided in an earlier data release (Glanzer et al., 2021). Final classifications combining ML and volunteer classifications are provided in Zevin et al. (2022).  22 of the classification labels match the labels used in the earlier data release, namely 1080Lines, 1400Ripples, Air_Compressor, Blip, Chirp, Extremely_Loud, Helix, Koi_Fish, Light_Modulation, Low_Frequency_Burst, Low_Frequency_Lines, No_Glitch, None_of_the_Above, Paired_Doves, Power_Line, Repeating_Blips, Scattered_Light, Scratchy, Tomte, Violin_Mode, Wandering_Line and Whistle. One glitch class that was added to the machine-learning classification has not been added to the Zooniverse project and so does not appear in this file, namely Blip_Low_Frequency. Four classes were added to the citizen science platform but not to the machine learning model and so have only volunteer labels, namely 70HZLINE, HIGHFREQUENCYBURST, LOWFREQUENCYBLIP and PIZZICATO. The glitch class Fast_Scattering added to the machine-learning classification has an equivalent volunteer label CROWN, which is used here (Soni et al. 2021). Glitches are presented to volunteers in a succession of workflows. Workflows include glitches classified by a machine learning classifier as being likely to be in a subset of classes and offer the option to classify only those classes plus None_of_the_Above. Each level includes the classes available in lower levels. The top level does not add new classification options but includes all glitches, including those for which the machine learning model is uncertain of the class. As the classes available to the volunteers change depending on the workflow, a glitch might be classified as None_of_the_Above in a lower workflow and subsequently as a different class in a higher workflow. Workflows and available classes are shown in the table below.  Workflow ID Name Number of glitch classes Glitches added 1610  Level 1 3 Blip, Whistle, None_of_the_Above 1934 Level 2 6 Koi_Fish, Power_Line, Violin_Mode 1935 Level 3 10 Chirp, Low_Frequency_Burst, No_Glitch, Scattered_Light 2360 Original level 4 22 1080Lines, 1400Ripples, Air_Compressor, Extremely_Loud, Helix, Light_Modulation, Low_Frequency_Lines, Paired_Doves, Repeating_Blips, Scratchy, Tomte, Wandering_Line 7765 New level 4 15 1080Lines, Extremely_Loud, Low_Frequency_Lines, Repeating_Blips, Scratchy 2117 Original level 5 22 No new glitch classes 7766 New level 5 27 1400Ripples, Air_Compressor, Paired_Doves, Tomte, Wandering_Line, 70HZLINE, CROWN, HIGHFREQUENCYBURST, LOWFREQUENCYBLIP, PIZZICATO 7767 Level 6 27 No new glitch classes Description of data fields Classification_id: a unique identifier for the classification. A volunteer may choose multiple classes for a glitch when classifying, in which case there will be multiple rows with the same classification_id. Subject_id: a unique identifier for the glitch being classified. This field can be used to join the classification to data about the glitch from the prior data release.  User_hash: an anonymized identifier for the user making the classification or for anonymous users an identifier that can be used to track the user within a session but which may not persist across sessions.  Anonymous_user: True if the classification was made by a non-logged in user.  Workflow: The Gravity Spy workflow in which the classification was made.  Workflow_version: The version of the workflow. Timestamp: Timestamp for the classification.  Classification: Glitch class selected by the volunteer.  Related datasets For machine learning classifications on all glitches in O1, O2, O3a, and O3b, please see Gravity Spy Machine Learning Classifications on Zenodo For classifications of glitches combining machine learning and volunteer classifications, please see Gravity Spy Volunteer Classifications of LIGO Glitches from Observing Runs O1, O2, O3a, and O3b. For the training set used in Gravity Spy machine learning algorithms, please see Gravity Spy Training Set on Zenodo. For detailed information on the training set used for the original Gravity Spy machine learning paper, please see Machine learning for Gravity Spy: Glitch classification and dataset on Zenodo. 
    more » « less
  5. Abstract Tropical elevation gradients support highly diverse assemblages, but competing hypotheses suggest either peak species richness in lowland rainforests or at mid‐elevations. We investigated scolytine beetles—phloem, ambrosia and seed‐feeding beetles—along a tropical elevational gradient in Papua New Guinea.Highly standardised sampling from 200 to 3700 m above sea level (asl) identified areas of highest and lowest species richness, abundance and other biodiversity variables.Using passive flight intercept traps at eight elevations from 200 to 3500 m asl, we collected over 9600 specimens representing 215 species. Despite extensive sampling, species accumulation curves suggest that diversity was not fully exhausted.Scolytine species richness followed a unimodal distribution, peaking between 700 and 1200 m asl, supporting prior findings of highest diversity at low‐to‐mid elevations.Alternative models, such as a monotonous decrease from lowlands to higher elevations and a mid‐elevation maximum, showed lesser fit to our data. Abundance is greatest at the lowest sites, driven by a few extremely abundant species. The turnover rate—beta diversity between elevation steps—is greatest between the highest elevations.Among dominant tribes—Dryocoetini, Xyleborini and Cryphalini—species richness peaked between 700 and 2200 m asl. Taxon‐specific analyses revealed distinct patterns:Euwallaceaspp. abundance uniformly declined with elevation, while other genera were driven by dominant species at different elevations.Coccotrypesand phloem‐feedingCryphalushave undergone evolutionary radiations in New Guinea, with many species still undescribed. Species not yet known to science are most likely to be found at lower and middle elevations, where overall diversity is highest. 
    more » « less