SetBERT: the deep learning platform for contextualized embeddings and explainable predictions from high-throughput sequencing

Ludwig, II, David_W (ORCID:0009000158517964); Guptil, Christopher; Alexander, Nicholas_R (ORCID:0009000472715520); Zhalnina, Kateryna (ORCID:0000000218931888); Wipf, Edi_M_-L (ORCID:0000000210259310); Khasanova, Albina (ORCID:000000016652012X); Barber, Nicholas_A (ORCID:0000000316530009); Swingley, Wesley (ORCID:0000000177345179); Walker, Donald_M (ORCID:0000000331198809); Phillips, Joshua_L (ORCID:0000000246196083); Birol, ed., Inanc

doi:10.1093/bioinformatics/btaf370

Citation Details

SetBERT: the deep learning platform for contextualized embeddings and explainable predictions from high-throughput sequencing

Abstract MotivationHigh-throughput sequencing (HTS) is a modern sequencing technology used to profile microbiomes by sequencing thousands of short genomic fragments from the microorganisms within a given sample. This technology presents a unique opportunity for artificial intelligence to comprehend the underlying functional relationships of microbial communities. However, due to the unstructured nature of HTS data, nearly all computational models are limited to processing DNA sequences individually. This limitation causes them to miss out on key interactions between microorganisms, significantly hindering our understanding of how these interactions influence the microbial communities as a whole. Furthermore, most computational methods rely on post-processing of samples which could inadvertently introduce unintentional protocol-specific bias. ResultsAddressing these concerns, we present SetBERT, a robust pre-training methodology for creating generalized deep learning models for processing HTS data to produce contextualized embeddings and be fine-tuned for downstream tasks with explainable predictions. By leveraging sequence interactions, we show that SetBERT significantly outperforms other models in taxonomic classification with genus-level classification accuracy of 95%. Furthermore, we demonstrate that SetBERT is able to accurately explain its predictions autonomously by confirming the biological-relevance of taxa identified by the model. Availability and implementationAll source code is available at https://github.com/DLii-Research/setbert. SetBERT may be used through the q2-deepdna QIIME 2 plugin whose source code is available at https://github.com/DLii-Research/q2-deepdna. more »

Award ID(s):: 2236580 2125065 1933925 1742518

PAR ID:: 10614464

Author(s) / Creator(s):: Ludwig, II, David_W; Guptil, Christopher; Alexander, Nicholas_R; Zhalnina, Kateryna; Wipf, Edi_M_-L; Khasanova, Albina; Barber, Nicholas_A; Swingley, Wesley; Walker, Donald_M; Phillips, Joshua_L; Birol, ed., Inanc

Publisher / Repository:: Oxford University Press

Date Published:: 2025-06-25

Journal Name:: Bioinformatics

Volume:: 41

Issue:: 7

ISSN:: 1367-4811

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Journal Article:
https://doi.org/10.1093/bioinformatics/btaf370

More Like this