skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on July 18, 2026

Title: BeetleVerse: A study on taxonomic classification of ground beetles
Ground beetles are a highly sensitive and speciose biolog- ical indicator, making them vital for monitoring biodiver- sity. However, they are currently an underutilized resource due to the manual effort required by taxonomic experts to perform challenging species differentiations based on sub- tle morphological differences, precluding widespread ap- plications. In this paper, we evaluate 12 vision models on taxonomic classification across four diverse, long-tailed datasets spanning over 230 genera and 1769 species, with images ranging from controlled laboratory settings to chal- lenging field-collected (in-situ) photographs. We further ex- plore taxonomic classification in two important real-world contexts: sample efficiency and domain adaptation. Our re- sults show that the Vision and Language Transformer com- bined with an MLP head is the best performing model, with 97% accuracy at genus and 94% at species level. Sample efficiency analysis shows that we can reduce train data re- quirements by up to 50% with minimal compromise in per- formance. The domain adaptation experiments reveal sig- nificant challenges when transferring models from lab to in-situ images, highlighting a critical domain gap. Overall, our study lays a foundation for large-scale automated tax- onomic classification of beetles, and beyond that, advances sample-efficient learning and cross-domain adaptation for diverse long-tailed ecological datasets.  more » « less
Award ID(s):
2301322
PAR ID:
10639890
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
arXiv
Date Published:
Page Range / eLocation ID:
2504.13393v2
Subject(s) / Keyword(s):
Computer vision carabids National Ecological Observatory Network
Format(s):
Medium: X
Location:
University of Maine
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Insect populations are changing rapidly, and monitoring these changes is essential for understanding the causes and consequences of such shifts. However, large‐scale insect identification projects are time‐consuming and expensive when done solely by human identifiers. Machine learning offers a possible solution to help collect insect data quickly and efficiently.Here, we outline a methodology for training classification models to identify pitfall trap‐collected insects from image data and then apply the method to identify ground beetles (Carabidae). All beetles were collected by the National Ecological Observatory Network (NEON), a continental scale ecological monitoring project with sites across the United States. We describe the procedures for image collection, image data extraction, data preparation, and model training, and compare the performance of five machine learning algorithms and two classification methods (hierarchical vs. single‐level) identifying ground beetles from the species to subfamily level. All models were trained using pre‐extracted feature vectors, not raw image data. Our methodology allows for data to be extracted from multiple individuals within the same image thus enhancing time efficiency, utilizes relatively simple models that allow for direct assessment of model performance, and can be performed on relatively small datasets.The best performing algorithm, linear discriminant analysis (LDA), reached an accuracy of 84.6% at the species level when naively identifying species, which was further increased to >95% when classifications were limited by known local species pools. Model performance was negatively correlated with taxonomic specificity, with the LDA model reaching an accuracy of ~99% at the subfamily level. When classifying carabid species not included in the training dataset at higher taxonomic levels species, the models performed significantly better than if classifications were made randomly. We also observed greater performance when classifications were made using the hierarchical classification method compared to the single‐level classification method at higher taxonomic levels.The general methodology outlined here serves as a proof‐of‐concept for classifying pitfall trap‐collected organisms using machine learning algorithms, and the image data extraction methodology may be used for nonmachine learning uses. We propose that integration of machine learning in large‐scale identification pipelines will increase efficiency and lead to a greater flow of insect macroecological data, with the potential to be expanded for use with other noninsect taxa. 
    more » « less
  2. In the context of pressing climate change challenges and the significant biodiversity loss among arthropods, automated taxonomic classification from organismal images is a subject of intense research. However, traditional AI pipelines based on deep neural visual architectures such as CNNs or ViTs face limitations such as degraded performance on the long-tail of classes and the inability to reason about their predictions. We integrate image captioning and retrieval-augmented generation (RAG) with large language models (LLMs) to enhance biodiversity monitoring, showing particular promise for characterizing rare and unknown arthropod species. While a naive Vision-Language Model (VLM) excels in classifying images of common species, the RAG model enables classification of rarer taxa by matching explicit textual descriptions of taxonomic features to contextual biodiversity text data from external sources. The RAG model shows promise in reducing overconfidence and enhancing accuracy relative to naive LLMs, suggesting its viability in capturing the nuances of taxonomic hierarchy, particularly at the challenging family and genus levels. Our findings highlight the potential for modern vision-language AI pipelines to support biodiversity conservation initiatives, emphasizing the role of comprehensive data curation and collaboration with citizen science platforms to improve species identification, unknown species characterization and ultimately inform conservation strategies. 
    more » « less
  3. Understanding genomic sequences through the lens of language modeling has the potential to revolutionize biological research, yet challenges in tokenization, model architecture, and adaptation to diverse genomic contexts remain. In this study, we investigated key innovations in DNA sequence modeling, treating DNA as a language and applying language models to genomic data. We gathered two diverse pretraining datasets: one consisting of 19,551 reference genomes, including over 18,000 prokaryotic genomes (115B nucleotides), and another more balanced dataset with 1,354 genomes, including 1,166 prokaryotic and 188 eukaryotic reference genomes (180B nucleotides). We trained five byte-pair encoding tokenizers and pretrained 52 DNA language models, systematically comparing different architectures, hyperparameters, and classification heads. We introduceseqLens, a family of models based on disentangled attention with relative positional encoding, which outperforms state-of-the-art models in 13 of 19 benchmarking phenotypic predictions. We further explore continual pretraining, domain adaptation, and parameter-efficient fine-tuning methods to assess trade-offs between computational efficiency and accuracy. Our findings demonstrate that relevant pretraining data significantly boosts performance, alternative pooling techniques enhance classification, and larger tokenizers negatively impact generalization. These insights provide a foundation for optimizing DNA language models and improving genome annotations. 
    more » « less
  4. Machine learning (ML) based skin cancer detection tools are an example of a transformative medical technology that could potentially democratize early detection for skin cancer cases for everyone. However, due to the dependency of datasets for training, ML based skin cancer detection always suffers from a systemic racial bias. Racial communities and ethnicity not well represented within the training datasets will not be able to use these tools, leading to health disparities being amplified. Based on empirical observations we posit that skin cancer training data is biased as it’s dataset represents mostly communities of lighter skin tones, despite skin cancer being far more lethal for people of color. In this paper we use domain adaptation techniques by employing CycleGANs to mitigate racial biases existing within state of the art machine learning based skin cancer detection tools by adapting minority images to appear as the majority. Using our domain adaptation techniques to augment our minority datasets, we are able to improve the accuracy, precision, recall, and F1 score of typical image classification machine learning models for skin cancer classification from the biased 50% accuracy rate to a 79% accuracy rate when testing on minority skin tone images. We evaluate and demonstrate a proof-of-concept smartphone application. 
    more » « less
  5. Vanilla models for object detection and instance segmentation suffer from the heavy bias toward detecting frequent objects in the long-tailed setting. Existing methods address this issue mostly during training, e.g., by re-sampling or re- weighting. In this paper, we investigate a largely overlooked approach — post- processing calibration of confidence scores. We propose NORCAL, Normalized Calibration for long-tailed object detection and instance segmentation, a simple and straightforward recipe that reweighs the predicted scores of each class by its training sample size. We show that separately handling the background class and normalizing the scores over classes for each proposal are keys to achieving superior performance. On the LVIS dataset, NORCAL can effectively improve nearly all the baseline models not only on rare classes but also on common and frequent classes. Finally, we conduct extensive analysis and ablation studies to offer insights into various modeling choices and mechanisms of our approach. Our code is publicly available at https://github.com/tydpan/NorCal. 
    more » « less