skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on March 1, 2026

Title: Persistent Gaps and Errors in Reference Databases Impede Ecologically Meaningful Taxonomy Assignments in 18S rRNA Studies: A Case Study of Terrestrial and Marine Nematodes
ABSTRACT In metabarcoding studies, Linnaean taxonomy assignments of Operational Taxonomic Units (OTUs) or Amplicon Sequence Variants (ASVs) underpin many downstream bioinformatics analyses and ecological interpretations of environmental DNA (eDNA) datasets. However, public molecular databases (i.e., SILVA, EUKARYOME, BOLD) for most microbial metazoan phyla (nematodes, tardigrades, kinorhynchs, etc.) are sparsely populated, negatively impacting our ability to assign ecologically meaningful taxonomy to these understudied groups. Additionally, the choice of bioinformatics parameters and computational algorithms can further affect the accuracy of eDNA taxonomy assignments. Here, we use twoin silicodatasets to show that taxonomy assignments using the 18S rRNA gene can be dramatically improved by curating Linnaean taxonomy strings associated with each reference sequence and closing phylogenetic gaps by improving taxon sampling. Using free‐living nematodes as a case study, we applied two commonly used taxonomy assignment algorithms (BLAST+ and the QIIME2 Naïve Bayes classifier) across six iterations of the SILVA 138 reference database to evaluate the precision and accuracy of taxonomy assignments. The BLAST+ top hit with a 90% sequence similarity cutoff often returned the highest percentage of correctly assigned taxonomy at the genus level, and the QIIME2 Naïve Bayes classifier performed similarly well when paired with a reference database containing corrected taxonomy strings. Our results highlight the urgent need for phylogenetically informed expansions of public reference databases (encompassing both genomes and common gene markers), focused on poorly sampled lineages that are now robustly recovered via eDNA metabarcoding approaches. Additional taxonomy curation efforts should be applied to popular reference databases such as SILVA, and taxon sampling could be rapidly improved by more frequent incorporation of newly published GenBank sequences linked to genus‐ and/or species‐level identifications.  more » « less
Award ID(s):
2144304
PAR ID:
10585249
Author(s) / Creator(s):
; ; ; ; ; ;
Publisher / Repository:
Wiley
Date Published:
Journal Name:
Environmental DNA
Volume:
7
Issue:
2
ISSN:
2637-4943
Page Range / eLocation ID:
e70080
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Fish biodiversity is an important indicator of ecosystem health and a priority for the National Park Service in Drakes Estero, a shallow estuary within Point Reyes National Seashore, Marin County, California. However, fish diversity has yet to be described following the removal of oyster aquaculture infrastructure within Drakes Estero from 2016 to 2017. We used environmental DNA (eDNA) to characterize fish biodiversity within Drakes Estero. We amplified fish eDNA with MiFish primers and classified sequences with a 12S rRNA reference database. We identified 110 unique operational taxonomic units (OTUs, at 97% similarity) within the estuary from 40 samples across 4 sites. From these 110 OTUs, we identified 9 species and 13 taxonomic groups at the genus, family, order, or class level within the estuary. Species‐level assignments are limited by a lack of representative sequences targeted by the MiFish primers for 42% of eelgrass fishes in our region that we identified from a literature review in the Northeast Pacific (NEP) from Elkhorn Slough to Humboldt Bay. Despite this limitation, we identified some common Drakes Estero fishes with our eDNA surveys, including the three‐spined stickleback (Gasterosteus aculeatus), Pacific staghorn sculpin (Leptocottus armatus), surfperches (Embiotocidae), gobies (Gobiidae), and a hound shark (Triakidae). We also compared fish biodiversity within the estuary with that from nearby Limantour Beach, a coastal site. Limantour beach differed in community composition from Drakes Estero and was characterized by high relative abundances of anchovy (Engraulissp.) and herring (Clupeasp.). Thus, we can distinguish estuarine and non‐estuarine sites (<10 km away) with eDNA surveys. Further, eDNA surveys accounted for greater fish diversity than seine surveys conducted at one site within the estuary. Environmental DNA surveys will likely be a useful tool to monitor fish biodiversity across eelgrass estuaries in the Northeast Pacific, especially as reference databases become better populated with regional species. 
    more » « less
  2. Abstract MotivationEnvironmental DNA (eDNA), as a rapidly expanding research field, stands to benefit from shared resources including sampling protocols, study designs, discovered sequences, and taxonomic assignments to sequences. High-quality community shareable eDNA resources rely heavily on comprehensive metadata documentation that captures the complex workflows covering field sampling, molecular biology lab work, and bioinformatic analyses. There are limited sources that provide documentation of database development on comprehensive metadata for eDNA and these workflows and no open-source software. ResultsWe present medna-metadata, an open-source, modular system that aligns with Findable, Accessible, Interoperable, and Reusable guiding principles that support scholarly data reuse and the database and application development of a standardized metadata collection structure that encapsulates critical aspects of field data collection, wet lab processing, and bioinformatic analysis. Medna-metadata is showcased with metabarcoding data from the Gulf of Maine (Polinski et al., 2019). Availability and implementationThe source code of the medna-metadata web application is hosted on GitHub (https://github.com/Maine-eDNA/medna-metadata). Medna-metadata is a docker-compose installable package. Documentation can be found at https://medna-metadata.readthedocs.io/en/latest/?badge=latest. The application is implemented in Python, PostgreSQL and PostGIS, RabbitMQ, and NGINX, with all major browsers supported. A demo can be found at https://demo.metadata.maine-edna.org/. Supplementary informationSupplementary data are available at Bioinformatics online. 
    more » « less
  3. Abstract Motivation As the cost of sequencing decreases, the amount of data being deposited into public repositories is increasing rapidly. Public databases rely on the user to provide metadata for each submission that is prone to user error. Unfortunately, most public databases, such as non-redundant (NR), rely on user input and do not have methods for identifying errors in the provided metadata, leading to the potential for error propagation. Previous research on a small subset of the non-redundant (NR) database analyzed misclassification based on sequence similarity. To the best of our knowledge, the amount of misclassification in the entire database has not been quantified. We propose a heuristic method to detect potentially misclassified taxonomic assignments in the NR database. We applied a curation technique and quality control to find the most probable taxonomic assignment. Our method incorporates provenance and frequency of each annotation from manually and computationally created databases and clustering information at 95% similarity. Results We found more than 2 million potentially taxonomically misclassified proteins in the NR database. Using simulated data, we show a high precision of 97% and a recall of 87% for detecting taxonomically misclassified proteins. The proposed approach and findings could also be applied to other databases. Availability Source code, dataset, documentation, Jupyter notebooks, and Docker container are available at https://github.com/boalang/nr. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  4. Marschall, Tobias (Ed.)
    Abstract Summary CONSTAX—the CONSensus TAXonomy classifier—was developed for accurate and reproducible taxonomic annotation of fungal rDNA amplicon sequences and is based upon a consensus approach of RDP, SINTAX and UTAX algorithms. CONSTAX2 extends these features to classify prokaryotes as well as eukaryotes and incorporates BLAST-based classifiers to reduce classification errors. Additionally, CONSTAX2 implements a conda-installable command-line tool with improved classification metrics, faster training, multithreading support, capacity to incorporate external taxonomic databases and new isolate matching and high-level taxonomy tools, replete with documentation and example tutorials. Availability and implementation CONSTAX2 is available at https://github.com/liberjul/CONSTAXv2, and is packaged for Linux and MacOS from Bioconda with use under the MIT License. A tutorial and documentation are available at https://constax.readthedocs.io/en/latest/. Data and scripts associated with the manuscript are available at https://github.com/liberjul/CONSTAXv2_ms_code. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  5. null (Ed.)
    Abstract Metagenomics studies leverage genomic reference databases to generate discoveries in basic science and translational research. However, current microbial studies use disparate reference databases that lack consistent standards of specimen inclusion, data preparation, taxon labelling and accessibility, hindering their quality and comprehensiveness, and calling for the establishment of recommendations for reference genome database assembly. Here, we analyze existing fungal and bacterial databases and discuss guidelines for the development of a master reference database that promises to improve the quality and quantity of omics research. 
    more » « less