NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Efficient High-Throughput DNA Breathing Features Generation Using Jax-EPBD

https://doi.org/10.1101/2024.12.06.627191

Inan, Toki Tahmid; Kabir, Anowarul; Rasmussen, Kim; Shehu, Amarda; Usheva, Anny; Bishop, Alan; Alexandrov, Boian; Bhattarai, Manish (December 2024, bioRxiv)

Abstract DNA breathing dynamics—transient base-pair opening and closing due to thermal fluctuations—are vital for processes like transcription, replication, and repair. Traditional models, such as the Extended Peyrard-Bishop-Dauxois (EPBD), provide insights into these dynamics but are computationally limited for long sequences. We presentJAX-EPBD, a high-throughput Langevin molecular dynamics framework leveragingJAXfor GPU-accelerated simulations, achieving up to 30x speedup and superior scalability compared to the original C-based EPBD implementation.JAX-EPBDefficiently captures time-dependent behaviors, including bubble lifetimes and base flipping kinetics, enabling genome-scale analyses. Applying it to transcription factor (TF) binding affinity prediction using SELEX datasets, we observed consistent improvements inR²values when incorporating breathing features with sequence data. Validating on the 77-bp AAV P5 promoter,JAX-EPBDrevealed sequence-specific differences in bubble dynamics correlating with transcriptional activity. These findings establishJAX-EPBDas a powerful and scalable tool for understanding DNA breathing dynamics and their role in gene regulation and transcription factor binding.
more » « less
Free, publicly-accessible full text available December 12, 2025
Scalable DNA Feature Generation and Transcription Factor Binding Prediction via Deep Surrogate Models

https://doi.org/10.1101/2024.12.06.626709

Kabir, Anowarul; Inan, Toki Tahmid; Rasmussen, Kim; Shehu, Amarda; Usheva, Anny; Bishop, Alan; Alexandrov, Boian; Bhattarai, Manish (December 2024, bioRxiv)

Abstract Simulating DNA breathing dynamics, for instance Extended Peyrard-Bishop-Dauxois (EPBD) model, across the entire human genome using traditional biophysical methods like pyDNA-EPBD is computationally prohibitive due to intensive techniques such as Markov Chain Monte Carlo (MCMC) and Langevin dynamics. To overcome this limitation, we propose a deep surrogate generative model utilizing a conditional Denoising Diffusion Probabilistic Model (DDPM) trained on DNA sequence-EPBD feature pairs. This surrogate model efficiently generates high-fidelity DNA breathing features conditioned on DNA sequences, reducing computational time from months to hours–a speedup of over 1000 times. By integrating these features into the EPBDxDNABERT-2 model, we enhance the accuracy of transcription factor (TF) binding site predictions. Experiments demonstrate that the surrogate-generated features perform comparably to those obtained from the original EPBD framework, validating the model’s efficacy and fidelity. This advancement enables real-time, genome-wide analyses, significantly accelerating genomic research and offering powerful tools for disease understanding and therapeutic development.
more » « less
Free, publicly-accessible full text available December 10, 2025
Variant Effect Prediction in the Age of Machine Learning

https://doi.org/10.1101/cshperspect.a041467

Bromberg, Yana; Prabakaran, R; Kabir, Anowarul; Shehu, Amarda (July 2024, Cold Spring Harbor Perspectives in Biology)

Over the years, many computational methods have been created for the analysis of the impact of single amino acid substitutions resulting from single-nucleotide variants in genome coding regions. Historically, all methods have been supervised and thus limited by the inadequate sizes of experimentally curated data sets and by the lack of a standardized definition of variant effect. The emergence of unsupervised, deep learning (DL)-based methods raised an important question: Canmachines learn the language of life fromthe unannotated protein sequence data well enough to identify significant errors in the protein “sentences”? Our analysis suggests that some unsupervised methods perform as well or better than existing supervised methods. Unsupervised methods are also faster and can, thus, be useful in large-scale variant evaluations. For all other methods, however, their performance varies by both evaluation metrics and by the type of variant effect being predicted.We also note that the evaluation of method performance is still lacking on less-studied, nonhuman proteins where unsupervised methods hold the most promise.
more » « less
Full Text Available
In the twilight zone of protein sequence homology: do protein language models learn protein structure?

https://doi.org/10.1093/bioadv/vbae119

Kabir, Anowarul; Moldwin, Asher; Bromberg, Yana; Shehu, Amarda; Gogovi, ed., Gideon (August 2024, Bioinformatics Advances)

Abstract MotivationProtein language models based on the transformer architecture are increasingly improving performance on protein prediction tasks, including secondary structure, subcellular localization, and more. Despite being trained only on protein sequences, protein language models appear to implicitly learn protein structure. This paper investigates whether sequence representations learned by protein language models encode structural information and to what extent. ResultsWe address this by evaluating protein language models on remote homology prediction, where identifying remote homologs from sequence information alone requires structural knowledge, especially in the “twilight zone” of very low sequence identity. Through rigorous testing at progressively lower sequence identities, we profile the performance of protein language models ranging from millions to billions of parameters in a zero-shot setting. Our findings indicate that while transformer-based protein language models outperform traditional sequence alignment methods, they still struggle in the twilight zone. This suggests that current protein language models have not sufficiently learned protein structure to address remote homology prediction when sequence signals are weak. Availability and implementationWe believe this opens the way for further research both on remote homology prediction and on the broader goal of learning sequence- and structure-rich representations of protein molecules. All code, data, and models are made publicly available.
more » « less
A More Informative and Reproducible Remote Homology Evaluation for Protein Language Models

Moldwin, Asher; Kabir, Anowarul; Shehu, Amarda (February 2024, LLMs4Bio)

Recent studies exploring the abilities of transformer-based protein language models have highlighted their performance on the task of remote homology detection, but have not provided datasets or evaluation procedures geared toward properly measuring performance on this task. With the goal of obtaining more informative and reproducible results, we offer a detailed procedure for constructing datasets and evaluating remote homology detection performance in a way that allows detailed analyses to be performed that shed light on the remote homology detection performance throughout the “twilight zone” of low sequence similarity. Using the proposed procedures, we found that three stateof-the-art protein language models exhibit diminishing performance when the pairwise sequence similarity between the query sequence and other proteins is restricted to below 35% identity.
more » « less
Full Text Available
A More Informative and Reproducible Remote Homology Evaluation for Protein Language Models

Moldwin, Asher; Kabir, Anowarul; Shehu, Amarda (February 2024, AAAI 2024 LLMs4Bio Workshop)

Recent studies exploring the abilities of transformer-based protein language models have highlighted their performance on the task of remote homology detection, but have not provided datasets or evaluation procedures geared toward properly measuring performance on this task. With the goal of obtaining more informative and reproducible results, we offer a detailed procedure for constructing datasets and evaluating remote homology detection performance in a way that allows detailed analyses to be performed that shed light on the remote homology detection performance throughout the “twilight zone” of low sequence similarity. Using the proposed procedures, we found that three stateof-the-art protein language models exhibit diminishing performance when the pairwise sequence similarity between the query sequence and other proteins is restricted to below 35% identity.
more » « less
Full Text Available
A Comparative Analysis of Transformer-based Protein Language Models for Remote Homology Prediction

https://doi.org/10.1145/3584371.3612942

Kabir, Anowarul; Moldwin, Asher; Shehu, Amarda (September 2023, ACM)

Full Text Available
Sequence-Structure Embeddings via Protein Language Models Improve on Prediction Tasks

https://doi.org/10.1109/ICKG55886.2022.00021

Kabir, Anowarul; Shehu, Amarda (November 2022, IEEE Intl Conf on Knowledge Graphs (ICKG))
Analysis of AlphaFold2 for Modeling Structures of Wildtype and Variant Protein Sequences

https://doi.org/10.29007/5g4v

Kabir, Anowarul; Inan, Toki; Shehu, Amarda (March 2022, EPiC Series in Computing)

ResNet and, more recently, AlphaFold2 have demonstrated that deep neural networks can now predict a tertiary structure of a given protein amino-acid sequence with high accuracy. This seminal development will allow molecular biology researchers to advance various studies linking sequence, structure, and function. Many studies will undoubtedly focus on the impact of sequence mutations on stability, fold, and function. In this paper, we evaluate the ability of AlphaFold2 to predict accurate tertiary structures of wildtype and mutated sequences of protein molecules. We do so on a benchmark dataset in mutation modeling studies. Our empirical evaluation utilizes global and local structure analyses and yields several interesting observations. It shows, for instance, that AlphaFold2 performs similarly on wildtype and variant sequences. The placement of the main chain of a protein molecule is highly accurate. However, while AlphaFold2 reports similar confidence in its predictions over wildtype and variant sequences, its performance on placements of the side chains suffers in comparison to main-chain predictions. The analysis overall supports the premise that AlphaFold2-predicted structures can be utilized in further downstream tasks, but that further refinement of these structures may be necessary.
more » « less
Full Text Available
Graph Neural Networks in Predicting Protein Function and Interactions

https://doi.org/10.1007/978-981-16-6054-2_25

Kabir, Anowarul; Shehu, Amarda (July 2021, Graph Neural Networks: Foundations, Frontiers, and Applications)

Full Text Available

Search for: All records