skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on December 10, 2025

Title: Scalable DNA Feature Generation and Transcription Factor Binding Prediction via Deep Surrogate Models
Abstract Simulating DNA breathing dynamics, for instance Extended Peyrard-Bishop-Dauxois (EPBD) model, across the entire human genome using traditional biophysical methods like pyDNA-EPBD is computationally prohibitive due to intensive techniques such as Markov Chain Monte Carlo (MCMC) and Langevin dynamics. To overcome this limitation, we propose a deep surrogate generative model utilizing a conditional Denoising Diffusion Probabilistic Model (DDPM) trained on DNA sequence-EPBD feature pairs. This surrogate model efficiently generates high-fidelity DNA breathing features conditioned on DNA sequences, reducing computational time from months to hours–a speedup of over 1000 times. By integrating these features into the EPBDxDNABERT-2 model, we enhance the accuracy of transcription factor (TF) binding site predictions. Experiments demonstrate that the surrogate-generated features perform comparably to those obtained from the original EPBD framework, validating the model’s efficacy and fidelity. This advancement enables real-time, genome-wide analyses, significantly accelerating genomic research and offering powerful tools for disease understanding and therapeutic development.  more » « less
Award ID(s):
2310113
PAR ID:
10612832
Author(s) / Creator(s):
; ; ; ; ; ; ;
Publisher / Repository:
bioRxiv
Date Published:
Format(s):
Medium: X
Institution:
bioRxiv
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract DNA breathing dynamics—transient base-pair opening and closing due to thermal fluctuations—are vital for processes like transcription, replication, and repair. Traditional models, such as the Extended Peyrard-Bishop-Dauxois (EPBD), provide insights into these dynamics but are computationally limited for long sequences. We presentJAX-EPBD, a high-throughput Langevin molecular dynamics framework leveragingJAXfor GPU-accelerated simulations, achieving up to 30x speedup and superior scalability compared to the original C-based EPBD implementation.JAX-EPBDefficiently captures time-dependent behaviors, including bubble lifetimes and base flipping kinetics, enabling genome-scale analyses. Applying it to transcription factor (TF) binding affinity prediction using SELEX datasets, we observed consistent improvements inR2values when incorporating breathing features with sequence data. Validating on the 77-bp AAV P5 promoter,JAX-EPBDrevealed sequence-specific differences in bubble dynamics correlating with transcriptional activity. These findings establishJAX-EPBDas a powerful and scalable tool for understanding DNA breathing dynamics and their role in gene regulation and transcription factor binding. 
    more » « less
  2. Abstract Mitigating the adverse impacts caused by increasing flood risks in urban coastal communities requires effective flood prediction for prompt action. Typically, physics‐based 1‐D pipe/2‐D overland flow models are used to simulate urban pluvial flooding. Because these models require significant computational resources and have long run times, they are often unsuitable for real‐time flood prediction at a street scale. This study explores the potential of a machine learning method, Random Forest (RF), to serve as a surrogate model for urban flood predictions. The surrogate model was trained to relate topographic and environmental features to hourly water depths simulated by a high‐resolution 1‐D/2‐D physics‐based model at 16,914 road segments in the coastal city of Norfolk, Virginia, USA. Two training scenarios for the RF model were explored: (i) training on only the most flood‐prone street segments in the study area and (ii) training on all 16,914 street segments in the study area. The RF model yielded high predictive skill, especially for the scenario when the model was trained on only the most flood‐prone streets. The results also showed that the surrogate model reduced the computational run time of the physics‐based model by a factor of 3,000, making real‐time decision support more feasible compared to using the full physics‐based model. We concluded that machine learning surrogate models strategically trained on high‐resolution and high‐fidelity physics‐based models have the potential to significantly advance the ability to support decision making in real‐time flood management within urban communities. 
    more » « less
  3. Abstract Many viruses eject their DNA via a nanochannel in the viral shell, driven by internal forces arising from the high-density genome packing. The speed of DNA exit is controlled by friction forces that limit the molecular mobility, but the nature of this friction is unknown. We introduce a method to probe the mobility of the tightly confined DNA by measuring DNA exit from phage phi29 capsids with optical tweezers. We measure extremely low initial exit velocity, a regime of exponentially increasing velocity, stochastic pausing that dominates the kinetics and large dynamic heterogeneity. Measurements with variable applied force provide evidence that the initial velocity is controlled by DNA–DNA sliding friction, consistent with a Frenkel–Kontorova model for nanoscale friction. We confirm several aspects of the ejection dynamics predicted by theoretical models. Features of the pausing suggest that it is connected to the phenomenon of ‘clogging’ in soft matter systems. Our results provide evidence that DNA–DNA friction and clogging control the DNA exit dynamics, but that this friction does not significantly affect DNA packaging. 
    more » « less
  4. Abstract In the design of stellarators, energetic particle confinement is a critical point of concern which remains challenging to study from a numerical point of view. Standard Monte Carlo (MC) analyses are highly expensive because a large number of particle trajectories need to be integrated over long time scales, and small time steps must be taken to accurately capture the features of the wide variety of trajectories. Even when they are based on guiding center trajectories, as opposed to full-orbit trajectories, these standard MC studies are too expensive to be included in most stellarator optimization codes. We present the first multifidelity Monte Carlo (MFMC) scheme for accelerating the estimation of energetic particle confinement in stellarators. Our approach relies on a two-level hierarchy, in which a guiding center model serves as the high-fidelity model, and a data-driven linear interpolant is leveraged as the low-fidelity surrogate model. We apply MFMC to the study of energetic particle confinement in a four-period quasi-helically symmetric stellarator, assessing various metrics of confinement. Stemming from the very high computational efficiency of our surrogate model as well as its sufficient correlation to the high-fidelity model, we obtain speedups of up to 10 with MFMC compared to standard MC. 
    more » « less
  5. Abstract In contrast to the typified view of genome cycling only between haploidy and diploidy, there is evidence from across the tree of life of genome dynamics that alter both copy number (i.e. ploidy) and chromosome complements. Here, we highlight examples of such processes, including endoreplication, aneuploidy, inheritance of extrachromosomal DNA, and chromatin extrusion. Synthesizing data on eukaryotic genome dynamics in diverse extant lineages suggests the possibility that such processes were present before the last eukaryotic common ancestor. While present in some prokaryotes, these features appear exaggerated in eukaryotes where they are regulated by eukaryote-specific innovations including the nucleus, complex cytoskeleton, and synaptonemal complex. Based on these observations, we propose a model by which genome conflict drove the transformation of genomes during eukaryogenesis: from the origin of eukaryotes (i.e. first eukaryotic common ancestor) through the evolution of last eukaryotic common ancestor. 
    more » « less