NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Efficient High-Throughput DNA Breathing Features Generation Using Jax-EPBD

https://doi.org/10.1101/2024.12.06.627191

Inan, Toki Tahmid; Kabir, Anowarul; Rasmussen, Kim; Shehu, Amarda; Usheva, Anny; Bishop, Alan; Alexandrov, Boian; Bhattarai, Manish (December 2024, bioRxiv)

Abstract DNA breathing dynamics—transient base-pair opening and closing due to thermal fluctuations—are vital for processes like transcription, replication, and repair. Traditional models, such as the Extended Peyrard-Bishop-Dauxois (EPBD), provide insights into these dynamics but are computationally limited for long sequences. We presentJAX-EPBD, a high-throughput Langevin molecular dynamics framework leveragingJAXfor GPU-accelerated simulations, achieving up to 30x speedup and superior scalability compared to the original C-based EPBD implementation.JAX-EPBDefficiently captures time-dependent behaviors, including bubble lifetimes and base flipping kinetics, enabling genome-scale analyses. Applying it to transcription factor (TF) binding affinity prediction using SELEX datasets, we observed consistent improvements inR²values when incorporating breathing features with sequence data. Validating on the 77-bp AAV P5 promoter,JAX-EPBDrevealed sequence-specific differences in bubble dynamics correlating with transcriptional activity. These findings establishJAX-EPBDas a powerful and scalable tool for understanding DNA breathing dynamics and their role in gene regulation and transcription factor binding.
more » « less
Full Text Available
Scalable DNA Feature Generation and Transcription Factor Binding Prediction via Deep Surrogate Models

https://doi.org/10.1101/2024.12.06.626709

Kabir, Anowarul; Inan, Toki Tahmid; Rasmussen, Kim; Shehu, Amarda; Usheva, Anny; Bishop, Alan; Alexandrov, Boian; Bhattarai, Manish (December 2024, bioRxiv)

Abstract Simulating DNA breathing dynamics, for instance Extended Peyrard-Bishop-Dauxois (EPBD) model, across the entire human genome using traditional biophysical methods like pyDNA-EPBD is computationally prohibitive due to intensive techniques such as Markov Chain Monte Carlo (MCMC) and Langevin dynamics. To overcome this limitation, we propose a deep surrogate generative model utilizing a conditional Denoising Diffusion Probabilistic Model (DDPM) trained on DNA sequence-EPBD feature pairs. This surrogate model efficiently generates high-fidelity DNA breathing features conditioned on DNA sequences, reducing computational time from months to hours–a speedup of over 1000 times. By integrating these features into the EPBDxDNABERT-2 model, we enhance the accuracy of transcription factor (TF) binding site predictions. Experiments demonstrate that the surrogate-generated features perform comparably to those obtained from the original EPBD framework, validating the model’s efficacy and fidelity. This advancement enables real-time, genome-wide analyses, significantly accelerating genomic research and offering powerful tools for disease understanding and therapeutic development.
more » « less
Full Text Available
Domain-Specific Retrieval-Augmented Generation Using Vector Stores, Knowledge Graphs, and Tensor Factorization

Barron, Ryan C; Grantcharov, Vesselin; Wanna, Selma; Eren, Maksim E; Bhattarai, Manish; Solovyev, Nicholas; Tompkins, George; Nicholas, Charles; Rasmussen, Kim O; Matuszek, Cynthia; et al (December 2024, Proceedings of the 2024 International Conference on Machine Learning and Applications (ICMLA))

Large Language Models (LLMs) are pre-trained on large-scale corpora and excel in numerous general natural language processing (NLP) tasks, such as question answering (QA). Despite their advanced language capabilities, when it comes to domain-specific and knowledge-intensive tasks, LLMs suffer from hallucinations, knowledge cut-offs, and lack of knowledge attributions. Additionally, fine tuning LLMs' intrinsic knowledge to highly specific domains is an expensive and time consuming process. The retrieval-augmented generation (RAG) process has recently emerged as a method capable of optimization of LLM responses, by referencing them to a predetermined ontology. It was shown that using a Knowledge Graph (KG) ontology for RAG improves the QA accuracy, by taking into account relevant sub-graphs that preserve the information in a structured manner. In this paper, we introduce SMART-SLIC, a highly domain-specific LLM framework, that integrates RAG with KG and a vector store (VS) that store factual domain specific information. Importantly, to avoid hallucinations in the KG, we build these highly domain-specific KGs and VSs without the use of LLMs, but via NLP, data mining, and nonnegative tensor factorization with automatic model selection. Pairing our RAG with a domain-specific: (i) KG (containing structured information), and (ii) VS (containing unstructured information) enables the development of domain-specific chat-bots that attribute the source of information, mitigate hallucinations, lessen the need for fine-tuning, and excel in highly domain-specific question answering tasks. We pair SMART-SLIC with chain-of-thought prompting agents. The framework is designed to be generalizable to adapt to any specific or specialized domain. In this paper, we demonstrate the question answering capabilities of our framework on a corpus of scientific publications on malware analysis and anomaly detection.
more » « less
Full Text Available
Single Model Quality Estimation of Protein Structures via Non-negative Tensor Factorization

Kabir, Kazi Lutful; Bhattarai, Manish; Alexandrov, Boian S; Shehu, Amarda (January 2021, IEEE Intl Conf on Comput Adv in Bio and Medical Sciences (ICCABS) 2021)

Finding the inherent organization in the structure space of a protein molecule is central in many computational studies of proteins. Grouping or clustering tertiary structures of a protein has been leveraged to build representations of the structure-energy landscape, highlight sta- ble and semi-stable structural states, support models of structural dy- namics, and connect them to biological function. Over the years, our laboratory has introduced methods to reveal structural states and build models of state-to-state protein dynamics. These methods have also been shown competitive for an orthogonal problem known as model selection, where model refers to a computed tertiary structure. Building on this work, in this paper we present a novel, tensor factorization-based method that doubles as a non-parametric clustering method. While the method has broad applicability, here we focus and demonstrate its efficacy on the estimation of model accuracy (EMA) problem. The method outperforms state-of-the-art methods, including single-model methods that leverage deep neural networks and domain-specific insight.
more » « less

Search for: All records