skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Scalable DNA Feature Generation and Transcription Factor Binding Prediction via Deep Surrogate Models
Abstract Simulating DNA breathing dynamics, for instance Extended Peyrard-Bishop-Dauxois (EPBD) model, across the entire human genome using traditional biophysical methods like pyDNA-EPBD is computationally prohibitive due to intensive techniques such as Markov Chain Monte Carlo (MCMC) and Langevin dynamics. To overcome this limitation, we propose a deep surrogate generative model utilizing a conditional Denoising Diffusion Probabilistic Model (DDPM) trained on DNA sequence-EPBD feature pairs. This surrogate model efficiently generates high-fidelity DNA breathing features conditioned on DNA sequences, reducing computational time from months to hours–a speedup of over 1000 times. By integrating these features into the EPBDxDNABERT-2 model, we enhance the accuracy of transcription factor (TF) binding site predictions. Experiments demonstrate that the surrogate-generated features perform comparably to those obtained from the original EPBD framework, validating the model’s efficacy and fidelity. This advancement enables real-time, genome-wide analyses, significantly accelerating genomic research and offering powerful tools for disease understanding and therapeutic development.  more » « less
Award ID(s):
2310113
PAR ID:
10612832
Author(s) / Creator(s):
; ; ; ; ; ; ;
Publisher / Repository:
bioRxiv
Date Published:
Format(s):
Medium: X
Institution:
bioRxiv
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract DNA breathing dynamics—transient base-pair opening and closing due to thermal fluctuations—are vital for processes like transcription, replication, and repair. Traditional models, such as the Extended Peyrard-Bishop-Dauxois (EPBD), provide insights into these dynamics but are computationally limited for long sequences. We presentJAX-EPBD, a high-throughput Langevin molecular dynamics framework leveragingJAXfor GPU-accelerated simulations, achieving up to 30x speedup and superior scalability compared to the original C-based EPBD implementation.JAX-EPBDefficiently captures time-dependent behaviors, including bubble lifetimes and base flipping kinetics, enabling genome-scale analyses. Applying it to transcription factor (TF) binding affinity prediction using SELEX datasets, we observed consistent improvements inR2values when incorporating breathing features with sequence data. Validating on the 77-bp AAV P5 promoter,JAX-EPBDrevealed sequence-specific differences in bubble dynamics correlating with transcriptional activity. These findings establishJAX-EPBDas a powerful and scalable tool for understanding DNA breathing dynamics and their role in gene regulation and transcription factor binding. 
    more » « less
  2. Abstract We introduce new high-resolution galaxy simulations accelerated by a surrogate model that reduces the computation cost by approximately 75%. Massive stars with a zero-age main-sequence mass of more than about 10Mexplode as core-collapse supernovae (CCSNe), which play a critical role in galaxy formation. The energy released by CCSNe is essential for regulating star formation and driving feedback processes in the interstellar medium (ISM). However, the short integration time steps required for SN feedback have presented significant bottlenecks in astrophysical simulations across various scales. Overcoming this challenge is crucial for enabling star-by-star galaxy simulations, which aim to capture the dynamics of individual stars and the inhomogeneous shell’s expansion within the turbulent ISM. To address this, our new framework combines direct numerical simulations and surrogate modeling, including machine learning and Gibbs sampling. The star formation history and the time evolution of outflow rates in the galaxy match those obtained from resolved direct numerical simulations. Our new approach achieves high-resolution fidelity while reducing computational costs, effectively bridging the physical scale gap and enabling multiscale simulations. 
    more » « less
  3. Abstract In the design of stellarators, energetic particle confinement is a critical point of concern which remains challenging to study from a numerical point of view. Standard Monte Carlo (MC) analyses are highly expensive because a large number of particle trajectories need to be integrated over long time scales, and small time steps must be taken to accurately capture the features of the wide variety of trajectories. Even when they are based on guiding center trajectories, as opposed to full-orbit trajectories, these standard MC studies are too expensive to be included in most stellarator optimization codes. We present the first multifidelity Monte Carlo (MFMC) scheme for accelerating the estimation of energetic particle confinement in stellarators. Our approach relies on a two-level hierarchy, in which a guiding center model serves as the high-fidelity model, and a data-driven linear interpolant is leveraged as the low-fidelity surrogate model. We apply MFMC to the study of energetic particle confinement in a four-period quasi-helically symmetric stellarator, assessing various metrics of confinement. Stemming from the very high computational efficiency of our surrogate model as well as its sufficient correlation to the high-fidelity model, we obtain speedups of up to 10 with MFMC compared to standard MC. 
    more » « less
  4. Ciliates are a model lineage for studies of genome architecture given their unusual genome structures. All ciliates have both somatic macronuclei (MAC) and germline micronuclei (MIC), both of which develop from a zygotic nucleus following sex (i.e., conjugation). Nuclear developmental stages are not well documented among non-model ciliates, includingChilodonella uncinata(class Phyllopharyngea), the focus of our work. Here, we characterize nuclear architecture and genome dynamics inC. uncinataby combining 4′,6-diamidino-2-phenylindole (DAPI) staining and fluorescencein situhybridization (FISH) techniques with confocal microscopy. We developed a telomere probe for staining, which alongside DAPI allows for the identification of fragmented somatic chromosomes among the total DNA in the nuclei. We quantify both total DNA and telomere-bound signals from more than 250 nuclei sampled from 116 individual cells, and analyze changes in DNA content and nuclear architecture acrossChilodonella’s nuclear life cycle. Specifically, we find that MAC developmental stages in the ciliateC. uncinataare different from those reported from other ciliate species. These data provide insights into nuclear dynamics during development and enrich our understanding of genome evolution in non-model ciliates. IMPORTANCECiliates are a clade of diverse single-celled eukaryotic microorganisms that contain at least one somatic macronucleus (MAC) and germline micronucleus (MIC) within each cell/organism. Ciliates rely on complex genome rearrangements to generate somatic genomes from a zygotic nucleus. However, the development of somatic nuclei has only been documented for a few model ciliate genera, includingParamecium,Tetrahymena, andOxytricha. Here, we study the MAC developmental process in the non-model ciliate,C. uncinata. We analyze both total DNA and the generation of gene-sized somatic chromosomes using a laser scanning confocal microscope to describeC. uncinata’s nuclear life cycle. We show that DNA content changes dramatically during their life cycle and in a manner that differs from previous studies on model ciliates. Our study expands knowledge of genome dynamics in ciliates and among eukaryotes more broadly. 
    more » « less
  5. Abstract The vacuum-assisted resin infusion mold (VARIM) process is widely used in wind blade manufacturing for its cost-effectiveness and reliability. However, the current method faces challenges such as long curing times and defects due to nonuniform heating across the blade structure. To address this, a multi-zone heated bed setup tailored to blade thickness has been considered. However, determining an optimal temperature for each zone poses a computational challenge, which can be tackled with a novel machine-learning approach. Using a digital twin based on a high-fidelity multiphysics solver, a time-distributed LSTM model was trained to understand complex resin curing dynamics. This eliminates the need for costly lab experiments, as the model learns heating patterns and curing behavior efficiently. Once trained, the ML model acts as a digital twin by predicting the degree of cure for a given temperature setpoint with 96.73% accuracy. This model, when used as a surrogate for a Nelder-mead optimization workflow, improves the curing time by roughly 12.5% and presents a more uniform curing rate throughout the part. 
    more » « less