skip to main content


Title: Classifying the unknown: Insect identification with deep hierarchical Bayesian learning
Abstract

Classifying insect species involves a tedious process of identifying distinctive morphological insect characters by taxonomic experts. Machine learning can harness the power of computers to potentially create an accurate and efficient method for performing this task at scale, given that its analytical processing can be more sensitive to subtle physical differences in insects, which experts may not perceive. However, existing machine learning methods are designed to only classify insect samples into described species, thus failing to identify samples from undescribed species.

We propose a novel deep hierarchical Bayesian model for insect classification, given the taxonomic hierarchy inherent in insects. This model can classify samples of both described and undescribed species; described samples are assigned a species while undescribed samples are assigned a genus, which is a pivotal advancement over just identifying them as outliers. We demonstrated this proof of concept on a new database containing paired insect image and DNA barcode data from four insect orders, including 1040 species, which far exceeds the number of species used in existing work. A quarter of the species were excluded from the training set to simulate undescribed species.

With the proposed classification framework using combined image and DNA data in the model, species classification accuracy for described species was 96.66% and genus classification accuracy for undescribed species was 81.39%. Including both data sources in the model resulted in significant improvement over including image data only (39.11% accuracy for described species and 35.88% genus accuracy for undescribed species), and modest improvement over including DNA data only (73.39% genus accuracy for undescribed species).

Unlike current machine learning methods, the proposed deep hierarchical Bayesian learning approach can simultaneously classify samples of both described and undescribed species, a functionality that could become instrumental in biodiversity monitoring across the globe. This framework can be customized for any taxonomic classification problem for which image and DNA data can be obtained, thus making it relevant for use across all biological kingdoms.

 
more » « less
NSF-PAR ID:
10408193
Author(s) / Creator(s):
 ;  ;  ;  ;  ;  
Publisher / Repository:
Wiley-Blackwell
Date Published:
Journal Name:
Methods in Ecology and Evolution
Volume:
14
Issue:
6
ISSN:
2041-210X
Format(s):
Medium: X Size: p. 1515-1530
Size(s):
["p. 1515-1530"]
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Insect populations are changing rapidly, and monitoring these changes is essential for understanding the causes and consequences of such shifts. However, large‐scale insect identification projects are time‐consuming and expensive when done solely by human identifiers. Machine learning offers a possible solution to help collect insect data quickly and efficiently.

    Here, we outline a methodology for training classification models to identify pitfall trap‐collected insects from image data and then apply the method to identify ground beetles (Carabidae). All beetles were collected by the National Ecological Observatory Network (NEON), a continental scale ecological monitoring project with sites across the United States. We describe the procedures for image collection, image data extraction, data preparation, and model training, and compare the performance of five machine learning algorithms and two classification methods (hierarchical vs. single‐level) identifying ground beetles from the species to subfamily level. All models were trained using pre‐extracted feature vectors, not raw image data. Our methodology allows for data to be extracted from multiple individuals within the same image thus enhancing time efficiency, utilizes relatively simple models that allow for direct assessment of model performance, and can be performed on relatively small datasets.

    The best performing algorithm, linear discriminant analysis (LDA), reached an accuracy of 84.6% at the species level when naively identifying species, which was further increased to >95% when classifications were limited by known local species pools. Model performance was negatively correlated with taxonomic specificity, with the LDA model reaching an accuracy of ~99% at the subfamily level. When classifying carabid species not included in the training dataset at higher taxonomic levels species, the models performed significantly better than if classifications were made randomly. We also observed greater performance when classifications were made using the hierarchical classification method compared to the single‐level classification method at higher taxonomic levels.

    The general methodology outlined here serves as a proof‐of‐concept for classifying pitfall trap‐collected organisms using machine learning algorithms, and the image data extraction methodology may be used for nonmachine learning uses. We propose that integration of machine learning in large‐scale identification pipelines will increase efficiency and lead to a greater flow of insect macroecological data, with the potential to be expanded for use with other noninsect taxa.

     
    more » « less
  2. Abstract

    Environmental DNA (eDNA) metabarcoding is a promising method to monitor species and community diversity that is rapid, affordable and non‐invasive. The longstanding needs of the eDNA community are modular informatics tools, comprehensive and customizable reference databases, flexibility across high‐throughput sequencing platforms, fast multilocus metabarcode processing and accurate taxonomic assignment. Improvements in bioinformatics tools make addressing each of these demands within a single toolkit a reality.

    The new modular metabarcode sequence toolkitAnacapa(https://github.com/limey-bean/Anacapa/) addresses the above needs, allowing users to build comprehensive reference databases and assign taxonomy to raw multilocus metabarcode sequence data. A novel aspect ofAnacapais its database building module, “Creating Reference libraries Using eXisting tools” (CRUX), which generates comprehensive reference databases for specific user‐defined metabarcoding loci. TheQuality Control and ASV Parsingmodule sorts and processes multiple metabarcoding loci and processes merged, unmerged and unpaired reads maximizing recovered diversity.DADA2then detects amplicon sequence variants (ASVs) and theAnacapa Classifiermodule aligns these ASVs toCRUX‐generated reference databases usingBowtie2. Lastly, taxonomy is assigned to ASVs with confidence scores using a Bayesian Lowest Common Ancestor (BLCA) method. TheAnacapa Toolkitalso includes anrpackage,ranacapa, for automated results exploration through standard biodiversity statistical analysis.

    Benchmarking tests verify that theAnacapa Toolkiteffectively and efficiently generates comprehensive reference databases that capture taxonomic diversity, and can assign taxonomy to both MiSeq and HiSeq‐length sequence data. We demonstrate the value of theAnacapa Toolkitin assigning taxonomy to seawater eDNA samples collected in southern California.

    TheAnacapa Toolkitimproves the functionality of eDNA and streamlines biodiversity assessment and management by generating metabarcode specific databases, processing multilocus data, retaining a larger proportion of sequencing reads and expanding non‐traditional eDNA targets. All the components of theAnacapa Toolkitare open and available in a virtual container to ease installation.

     
    more » « less
  3. Abstract

    Although they are a valuable source of specimens, insect natural history collections continue to be under‐utilized in molecular systematics, mostly due to difficulties in obtaining DNA sequences. Old specimens or specimens stored under suboptimal conditions are intractable for traditional Sanger sequencing. In this study we use an inexpensive hybrid capture with in‐house generated baits to retrieve commonly utilized ribosomal and mitochondrial loci from old museum specimens and combine them with a Sanger‐generated dataset comprising recently collected material. We focus on theCorixideagenus group (Schizopteridae), which comprises rarely collected, small (1–2 mm) and primarily tropical insects of which onlyc. 10–20% of the species have been described. A molecular phylogeny is needed to resolve relationships and revise the genus‐level classification to correctly place thec. 150 yet to be described species. Applying this approach, we constructed a dataset, containing 101 taxa, 11 of which were preserved in low‐percentage ethanol, 48 are dry and point‐mounted, and 40 are > 20 years old at DNA extraction. The obtained data proved sufficient for reconstructing a well‐supported phylogeny withc. 50% of the predicted diversity, and for the oldest successfully sequenced specimen (95 years) to be unambiguously placed in that phylogeny. We confirmed monophyly of theCorixideagenus group, showed paraphyly of the genusCorixidea, and recovered nine well‐supported clades within the group. Ancestral character states of selected morphological features were inferred and used to re‐examine primary homology hypotheses and inform an upcoming taxonomic revision.

     
    more » « less
  4. Abstract

    The diverse, largely Neotropical subtribe Euptychiina is widely regarded as one of the most taxonomically challenging groups among all butterflies. Over the last two decades, morphological and molecular studies have revealed widespread paraphyly and polyphyly among genera, and a comprehensive, robust phylogenetic hypothesis is needed to build a firm generic classification to support ongoing taxonomic revisions at the species level. Here, we generated a dataset that includes sequences for up to nine nuclear genes and the mitochondrial COI ‘barcode’ for a total of 1280 specimens representing 449 described and undescribed species of Euptychiina and 39 out‐groups, resulting in the most complete phylogeny for the subtribe to date. In combination with a recently developed genomic backbone tree, this dataset resulted in a topology with strong support for most branches. We recognize eight major clades that each contain two or more genera, together containing all but seven Euptychiina genera. We provide a summary of the taxonomy, diversity and natural history of each clade, and discuss taxonomic changes implied by the phylogenetic results. We describe nine new genera to accommodate 38 described species:LazulinaWillmott, Nakahara & Espeland,gen.n.,SauronaHuertas & Willmott,gen.n.,ArgentariaHuertas & Willmott,gen.n.,TaguaibaFreitas, Zacca & Siewert,gen.n.,XenovenaMarín & Nakahara,gen.n.,DeltayaWillmott, Nakahara & Espeland,gen.n.,ModicaZacca, Casagrande & Willmott,gen.n.,OccultaNakahara & Willmott,gen.n., andTricoNakahara & Espeland,gen.n.We also synonymizeNubilaViloria, Andrade & Henao, 2019 (syn.n.) withSplendeuptychiaForster, 1964,MacrocissiaViloria, Le Crom & Andrade, 2019 (syn.n.) withSatyrotaygetisForster, 1964, andRudyphthimoidesViloria, 2022 (syn.n.) withMalaveriaViloria & Benmesbah, 2020. Overall, we revised the generic placement of 79 species (74 new generic combinations and five revised combinations), and as a result all but six described species of Euptychiina are accommodated within 70 named, monophyletic genera. For all newly described genera, we provide illustrations of representative species, drawings of wing venation and male and (where possible) female genitalia, and distribution maps, and summarize the natural history of the genus. For three new monotypic genera,Occultagen.n.,Tricogen.n.andXenovenagen.n.we provide a taxonomic revision with a review of the taxonomy of each species and data from examined specimens. We provide a revised synonymic list for Euptychiina containing 460 valid described species, 53 subspecies and 255 synonyms, including several new synonyms and reinstated species.

     
    more » « less
  5. Metagenomics is a technique for genome-wide profiling of microbiomes; this technique generates billions of DNA sequences called reads. Given the multiplication of metagenomic projects, computational tools are necessary to enable the efficient and accurate classification of metagenomic reads without needing to construct a reference database. The program DL-TODA presented here aims to classify metagenomic reads using a deep learning model trained on over 3000 bacterial species. A convolutional neural network architecture originally designed for computer vision was applied for the modeling of species-specific features. Using synthetic testing data simulated with 2454 genomes from 639 species, DL-TODA was shown to classify nearly 75% of the reads with high confidence. The classification accuracy of DL-TODA was over 0.98 at taxonomic ranks above the genus level, making it comparable with Kraken2 and Centrifuge, two state-of-the-art taxonomic classification tools. DL-TODA also achieved an accuracy of 0.97 at the species level, which is higher than 0.93 by Kraken2 and 0.85 by Centrifuge on the same test set. Application of DL-TODA to the human oral and cropland soil metagenomes further demonstrated its use in analyzing microbiomes from diverse environments. Compared to Centrifuge and Kraken2, DL-TODA predicted distinct relative abundance rankings and is less biased toward a single taxon. 
    more » « less