NSF PAR Search | NSF Public Access Repository

SetBERT: the deep learning platform for contextualized embeddings and explainable predictions from high-throughput sequencing

https://doi.org/10.1093/bioinformatics/btaf370

Ludwig, II, David_W; Guptil, Christopher; Alexander, Nicholas_R; Zhalnina, Kateryna; Wipf, Edi_M_-L; Khasanova, Albina; Barber, Nicholas_A; Swingley, Wesley; Walker, Donald_M; Phillips, Joshua_L; et al (June 2025, Bioinformatics)

Abstract MotivationHigh-throughput sequencing (HTS) is a modern sequencing technology used to profile microbiomes by sequencing thousands of short genomic fragments from the microorganisms within a given sample. This technology presents a unique opportunity for artificial intelligence to comprehend the underlying functional relationships of microbial communities. However, due to the unstructured nature of HTS data, nearly all computational models are limited to processing DNA sequences individually. This limitation causes them to miss out on key interactions between microorganisms, significantly hindering our understanding of how these interactions influence the microbial communities as a whole. Furthermore, most computational methods rely on post-processing of samples which could inadvertently introduce unintentional protocol-specific bias. ResultsAddressing these concerns, we present SetBERT, a robust pre-training methodology for creating generalized deep learning models for processing HTS data to produce contextualized embeddings and be fine-tuned for downstream tasks with explainable predictions. By leveraging sequence interactions, we show that SetBERT significantly outperforms other models in taxonomic classification with genus-level classification accuracy of 95%. Furthermore, we demonstrate that SetBERT is able to accurately explain its predictions autonomously by confirming the biological-relevance of taxa identified by the model. Availability and implementationAll source code is available at https://github.com/DLii-Research/setbert. SetBERT may be used through the q2-deepdna QIIME 2 plugin whose source code is available at https://github.com/DLii-Research/q2-deepdna.

Search for: All records