skip to main content


Title: Efficient Parameter Estimation for DNA Kinetics Modeled as Continuous-Time Markov Chains
Nucleic acid kinetic simulators aim to predict the kinetics of interacting nucleic acid strands. Many simulators model the kinetics of interacting nucleic acid strands as continuous-time Markov chains (CTMCs). States of the CTMCs represent a collection of secondary structures, and transitions between the states correspond to the forming or breaking of base pairs and are determined by a nucleic acid kinetic model. The number of states these CTMCs can form may be exponentially large in the length of the strands, making two important tasks challenging, namely, mean first passage time (MFPT) estimation and parameter estimation for kinetic models based on MFPTs. Gillespie’s stochastic simulation algorithm (SSA) is widely used to analyze nucleic acid folding kinetics, but could be computationally expensive for reactions whose CTMC has a large state space or for slow reactions. It could also be expensive for arbitrary parameter sets that occur in parameter estimation. Our work addresses these two challenging tasks, in the full state space of all non-pseudoknotted secondary structures of each reaction. In the first task, we show how to use a reduced variance stochastic simulation algorithm (RVSSA), which is adapted from SSA, to estimate the MFPT of a reaction’s CTMC. In the second task, we estimate model parameters based on MFPTs. To this end, first, we show how to use a generalized method of moments (GMM) approach, where we minimize a squared norm of moment functions that we formulate based on experimental and estimated MFPTs. Second, to speed up parameter estimation, we introduce a fixed path ensemble inference (FPEI) approach, that we adapt from RVSSA. We implement and evaluate RVSSA and FPEI using the Multistrand kinetic simulator. In our experiments on a dataset of DNA reactions, FPEI speeds up parameter estimation compared to inference using SSA, by more than a factor of three for slow reactions. Also, for reactions with large state spaces, it speeds up parameter estimation by more than a factor of two.  more » « less
Award ID(s):
1643606
NSF-PAR ID:
10112046
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
DNA Computing and Molecular Programming
Volume:
11648
Page Range / eLocation ID:
80-99
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Models of nucleic acid thermal stability are calibrated to a wide range of experimental observations, and typically predict equilibrium probabilities of nucleic acid secondary structures with reasonable accuracy. By comparison, a similar calibration and evaluation of nucleic acid kinetic models to a broad range of measurements has not been attempted so far. We introduce an Arrhenius model of interacting nucleic acid kinetics that relates the activation energy of a state transition with the immediate local environment of the affected base pair. Our model can be used in stochastic simulations to estimate kinetic properties and is consistent with existing thermodynamic models. We infer parameters for our model using an ensemble Markov chain Monte Carlo (MCMC) approach on a training dataset with 320 kinetic measurements of hairpin closing and opening, helix association and dissociation, bubble closing and toehold-mediated strand exchange. Our new model surpasses the performance of the previously established Metropolis model both on the training set and on a testing set of size 56 composed of toehold-mediated 3-way strand displacement with mismatches and hairpin opening and closing rates: reaction rates are predicted to within a factor of three for 93.4% and 78.5% of reactions for the training and testing sets, respectively. 
    more » « less
  2. Abstract

    Oligonucleotide hybridization is crucial in various biological, prebiotic and nanotechnological processes, including gene regulation, non-enzymatic primer extension and DNA nanodevice assembly. Although extensive research has focused on the thermodynamics and kinetics of nucleic acid hybridization, the behavior of complex mixtures and the outcome of competition for target binding remain less well understood. In this study, we investigate the impact of mismatches and bulges in a 12 bp DNA or RNA duplex on its association (kon) and dissociation (koff) kinetics. We find that such defects have relatively small effects on the association kinetics, while the dissociation kinetics vary in a position-dependent manner by up to 6 orders of magnitude. Building upon this observation, we explored a competition scenario involving multiple oligonucleotides, and observed a transient low specificity of probe hybridization to fully versus partially complementary targets in solution. We characterize these long-lived metastable states and their evolution toward equilibrium, and show that sufficiently long-lived mis-paired duplexes can serve as substrates for prebiotically relevant chemical copying reactions. Our results suggest that transient low accuracy states may spontaneously emerge within all complex nucleic acid systems comprising a large enough number of competing strands, with potential repercussions for gene regulation in the realm of modern biology and the prebiotic preservation of genetic information.

     
    more » « less
  3. Abstract Although the governing equations of many systems, when derived from first principles, may be viewed as known, it is often too expensive to numerically simulate all the interactions they describe. Therefore, researchers often seek simpler descriptions that describe complex phenomena without numerically resolving all the interacting components. Stochastic differential equations (SDEs) arise naturally as models in this context. The growth in data acquisition, both through experiment and through simulations, provides an opportunity for the systematic derivation of SDE models in many disciplines. However, inconsistencies between SDEs and real data at short time scales often cause problems, when standard statistical methodology is applied to parameter estimation. The incompatibility between SDEs and real data can be addressed by deriving sufficient statistics from the time-series data and learning parameters of SDEs based on these. Here, we study sufficient statistics computed from time averages, an approach that we demonstrate to lead to sufficient statistics on a variety of problems and that has the secondary benefit of obviating the need to match trajectories. Following this approach, we formulate the fitting of SDEs to sufficient statistics from real data as an inverse problem and demonstrate that this inverse problem can be solved by using ensemble Kalman inversion. Furthermore, we create a framework for non-parametric learning of drift and diffusion terms by introducing hierarchical, refinable parameterizations of unknown functions, using Gaussian process regression. We demonstrate the proposed methodology for the fitting of SDE models, first in a simulation study with a noisy Lorenz ’63 model, and then in other applications, including dimension reduction in deterministic chaotic systems arising in the atmospheric sciences, large-scale pattern modeling in climate dynamics and simplified models for key observables arising in molecular dynamics. The results confirm that the proposed methodology provides a robust and systematic approach to fitting SDE models to real data. 
    more » « less
  4. Chen, Ho-Lin ; Evans, Constantine G. (Ed.)
    Polynomial time dynamic programming algorithms play a crucial role in the design, analysis and engineering of nucleic acid systems including DNA computers and DNA/RNA nanostructures. However, in complex multistranded or pseudoknotted systems, computing the minimum free energy (MFE), and partition function of nucleic acid systems is NP-hard. Despite this, multistranded and/or pseudoknotted systems represent some of the most utilised and successful systems in the field. This leaves open the tempting possibility that many of the kinds of multistranded and/or pseudoknotted systems we wish to engineer actually fall into restricted classes, that do in fact have polynomial time algorithms, but we've just not found them yet. Here, we give polynomial time algorithms for MFE and partition function calculation for a restricted kind of multistranded system called the 1D scaffolded DNA computer. This model of computation thermodynamically favours correct outputs over erroneous states, simulates finite state machines in 1D and Boolean circuits in 2D, and is amenable to DNA storage applications. In an effort to begin to ask the question of whether we can naturally compare the expressivity of nucleic acid systems based on the computational complexity of prediction of their preferred energetic states, we show our MFE problem is in logspace (the complexity class L), making it perhaps one of the simplest known, natural, nucleic acid MFE problems. Finally, we provide a stochastic kinetic simulator for the 1D scaffolded DNA computer and evaluate strategies for efficiently speeding up this thermodynamically favourable system in a constant-temperature kinetic regime. 
    more » « less
  5. In many mechanistic medical, biological, physical, and engineered spatiotemporal dynamic models the numerical solution of partial differential equations (PDEs), especially for diffusion, fluid flow and mechanical relaxation, can make simulations impractically slow. Biological models of tissues and organs often require the simultaneous calculation of the spatial variation of concentration of dozens of diffusing chemical species. One clinical example where rapid calculation of a diffusing field is of use is the estimation of oxygen gradients in the retina, based on imaging of the retinal vasculature, to guide surgical interventions in diabetic retinopathy. Furthermore, the ability to predict blood perfusion and oxygenation may one day guide clinical interventions in diverse settings, i.e., from stent placement in treating heart disease to BOLD fMRI interpretation in evaluating cognitive function (Xie et al., 2019 ; Lee et al., 2020 ). Since the quasi-steady-state solutions required for fast-diffusing chemical species like oxygen are particularly computationally costly, we consider the use of a neural network to provide an approximate solution to the steady-state diffusion equation. Machine learning surrogates, neural networks trained to provide approximate solutions to such complicated numerical problems, can often provide speed-ups of several orders of magnitude compared to direct calculation. Surrogates of PDEs could enable use of larger and more detailed models than are possible with direct calculation and can make including such simulations in real-time or near-real time workflows practical. Creating a surrogate requires running the direct calculation tens of thousands of times to generate training data and then training the neural network, both of which are computationally expensive. Often the practical applications of such models require thousands to millions of replica simulations, for example for parameter identification and uncertainty quantification, each of which gains speed from surrogate use and rapidly recovers the up-front costs of surrogate generation. We use a Convolutional Neural Network to approximate the stationary solution to the diffusion equation in the case of two equal-diameter, circular, constant-value sources located at random positions in a two-dimensional square domain with absorbing boundary conditions. Such a configuration caricatures the chemical concentration field of a fast-diffusing species like oxygen in a tissue with two parallel blood vessels in a cross section perpendicular to the two blood vessels. To improve convergence during training, we apply a training approach that uses roll-back to reject stochastic changes to the network that increase the loss function. The trained neural network approximation is about 1000 times faster than the direct calculation for individual replicas. Because different applications will have different criteria for acceptable approximation accuracy, we discuss a variety of loss functions and accuracy estimators that can help select the best network for a particular application. We briefly discuss some of the issues we encountered with overfitting, mismapping of the field values and the geometrical conditions that lead to large absolute and relative errors in the approximate solution. 
    more » « less