Persistent Gaps and Errors in Reference Databases Impede Ecologically Meaningful Taxonomy Assignments in 18S rRNA Studies: A Case Study of Terrestrial and Marine Nematodes

De_Santiago, Alejandro; Pereira, Tiago José; Ferrero, Timothy John; Barnes, Natalie; Lallias, Delphine; Creer, Simon; Bik, Holly M

doi:10.1002/edn3.70080

ABSTRACT In metabarcoding studies, Linnaean taxonomy assignments of Operational Taxonomic Units (OTUs) or Amplicon Sequence Variants (ASVs) underpin many downstream bioinformatics analyses and ecological interpretations of environmental DNA (eDNA) datasets. However, public molecular databases (i.e., SILVA, EUKARYOME, BOLD) for most microbial metazoan phyla (nematodes, tardigrades, kinorhynchs, etc.) are sparsely populated, negatively impacting our ability to assign ecologically meaningful taxonomy to these understudied groups. Additionally, the choice of bioinformatics parameters and computational algorithms can further affect the accuracy of eDNA taxonomy assignments. Here, we use twoin silicodatasets to show that taxonomy assignments using the 18S rRNA gene can be dramatically improved by curating Linnaean taxonomy strings associated with each reference sequence and closing phylogenetic gaps by improving taxon sampling. Using free‐living nematodes as a case study, we applied two commonly used taxonomy assignment algorithms (BLAST+ and the QIIME2 Naïve Bayes classifier) across six iterations of the SILVA 138 reference database to evaluate the precision and accuracy of taxonomy assignments. The BLAST+ top hit with a 90% sequence similarity cutoff often returned the highest percentage of correctly assigned taxonomy at the genus level, and the QIIME2 Naïve Bayes classifier performed similarly well when paired with a reference database containing corrected taxonomy strings. Our results highlight the urgent need for phylogenetically informed expansions of public reference databases (encompassing both genomes and common gene markers), focused on poorly sampled lineages that are now robustly recovered via eDNA metabarcoding approaches. Additional taxonomy curation efforts should be applied to popular reference databases such as SILVA, and taxon sampling could be rapidly improved by more frequent incorporation of newly published GenBank sequences linked to genus‐ and/or species‐level identifications.

More Like this