skip to main content

Title: Efficient Parameter Estimation for DNA Kinetics Modeled as Continuous-Time Markov Chains
Nucleic acid kinetic simulators aim to predict the kinetics of interacting nucleic acid strands. Many simulators model the kinetics of interacting nucleic acid strands as continuous-time Markov chains (CTMCs). States of the CTMCs represent a collection of secondary structures, and transitions between the states correspond to the forming or breaking of base pairs and are determined by a nucleic acid kinetic model. The number of states these CTMCs can form may be exponentially large in the length of the strands, making two important tasks challenging, namely, mean first passage time (MFPT) estimation and parameter estimation for kinetic models based on MFPTs. Gillespie’s stochastic simulation algorithm (SSA) is widely used to analyze nucleic acid folding kinetics, but could be computationally expensive for reactions whose CTMC has a large state space or for slow reactions. It could also be expensive for arbitrary parameter sets that occur in parameter estimation. Our work addresses these two challenging tasks, in the full state space of all non-pseudoknotted secondary structures of each reaction. In the first task, we show how to use a reduced variance stochastic simulation algorithm (RVSSA), which is adapted from SSA, to estimate the MFPT of a reaction’s CTMC. In the second task, we estimate model parameters based on MFPTs. To this end, first, we show how to use a generalized method of moments (GMM) approach, where we minimize a squared norm of moment functions that we formulate based on experimental and estimated MFPTs. Second, to speed up parameter estimation, we introduce a fixed path ensemble inference (FPEI) approach, that we adapt from RVSSA. We implement and evaluate RVSSA and FPEI using the Multistrand kinetic simulator. In our experiments on a dataset of DNA reactions, FPEI speeds up parameter estimation compared to inference using SSA, by more than a factor of three for slow reactions. Also, for reactions with large state spaces, it speeds up parameter estimation by more than a factor of two.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
DNA Computing and Molecular Programming
Page Range / eLocation ID:
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Models of nucleic acid thermal stability are calibrated to a wide range of experimental observations, and typically predict equilibrium probabilities of nucleic acid secondary structures with reasonable accuracy. By comparison, a similar calibration and evaluation of nucleic acid kinetic models to a broad range of measurements has not been attempted so far. We introduce an Arrhenius model of interacting nucleic acid kinetics that relates the activation energy of a state transition with the immediate local environment of the affected base pair. Our model can be used in stochastic simulations to estimate kinetic properties and is consistent with existing thermodynamic models. We infer parameters for our model using an ensemble Markov chain Monte Carlo (MCMC) approach on a training dataset with 320 kinetic measurements of hairpin closing and opening, helix association and dissociation, bubble closing and toehold-mediated strand exchange. Our new model surpasses the performance of the previously established Metropolis model both on the training set and on a testing set of size 56 composed of toehold-mediated 3-way strand displacement with mismatches and hairpin opening and closing rates: reaction rates are predicted to within a factor of three for 93.4% and 78.5% of reactions for the training and testing sets, respectively. 
    more » « less
  2. Abstract Although the governing equations of many systems, when derived from first principles, may be viewed as known, it is often too expensive to numerically simulate all the interactions they describe. Therefore, researchers often seek simpler descriptions that describe complex phenomena without numerically resolving all the interacting components. Stochastic differential equations (SDEs) arise naturally as models in this context. The growth in data acquisition, both through experiment and through simulations, provides an opportunity for the systematic derivation of SDE models in many disciplines. However, inconsistencies between SDEs and real data at short time scales often cause problems, when standard statistical methodology is applied to parameter estimation. The incompatibility between SDEs and real data can be addressed by deriving sufficient statistics from the time-series data and learning parameters of SDEs based on these. Here, we study sufficient statistics computed from time averages, an approach that we demonstrate to lead to sufficient statistics on a variety of problems and that has the secondary benefit of obviating the need to match trajectories. Following this approach, we formulate the fitting of SDEs to sufficient statistics from real data as an inverse problem and demonstrate that this inverse problem can be solved by using ensemble Kalman inversion. Furthermore, we create a framework for non-parametric learning of drift and diffusion terms by introducing hierarchical, refinable parameterizations of unknown functions, using Gaussian process regression. We demonstrate the proposed methodology for the fitting of SDE models, first in a simulation study with a noisy Lorenz ’63 model, and then in other applications, including dimension reduction in deterministic chaotic systems arising in the atmospheric sciences, large-scale pattern modeling in climate dynamics and simplified models for key observables arising in molecular dynamics. The results confirm that the proposed methodology provides a robust and systematic approach to fitting SDE models to real data. 
    more » « less
  3. In many mechanistic medical, biological, physical, and engineered spatiotemporal dynamic models the numerical solution of partial differential equations (PDEs), especially for diffusion, fluid flow and mechanical relaxation, can make simulations impractically slow. Biological models of tissues and organs often require the simultaneous calculation of the spatial variation of concentration of dozens of diffusing chemical species. One clinical example where rapid calculation of a diffusing field is of use is the estimation of oxygen gradients in the retina, based on imaging of the retinal vasculature, to guide surgical interventions in diabetic retinopathy. Furthermore, the ability to predict blood perfusion and oxygenation may one day guide clinical interventions in diverse settings, i.e., from stent placement in treating heart disease to BOLD fMRI interpretation in evaluating cognitive function (Xie et al., 2019 ; Lee et al., 2020 ). Since the quasi-steady-state solutions required for fast-diffusing chemical species like oxygen are particularly computationally costly, we consider the use of a neural network to provide an approximate solution to the steady-state diffusion equation. Machine learning surrogates, neural networks trained to provide approximate solutions to such complicated numerical problems, can often provide speed-ups of several orders of magnitude compared to direct calculation. Surrogates of PDEs could enable use of larger and more detailed models than are possible with direct calculation and can make including such simulations in real-time or near-real time workflows practical. Creating a surrogate requires running the direct calculation tens of thousands of times to generate training data and then training the neural network, both of which are computationally expensive. Often the practical applications of such models require thousands to millions of replica simulations, for example for parameter identification and uncertainty quantification, each of which gains speed from surrogate use and rapidly recovers the up-front costs of surrogate generation. We use a Convolutional Neural Network to approximate the stationary solution to the diffusion equation in the case of two equal-diameter, circular, constant-value sources located at random positions in a two-dimensional square domain with absorbing boundary conditions. Such a configuration caricatures the chemical concentration field of a fast-diffusing species like oxygen in a tissue with two parallel blood vessels in a cross section perpendicular to the two blood vessels. To improve convergence during training, we apply a training approach that uses roll-back to reject stochastic changes to the network that increase the loss function. The trained neural network approximation is about 1000 times faster than the direct calculation for individual replicas. Because different applications will have different criteria for acceptable approximation accuracy, we discuss a variety of loss functions and accuracy estimators that can help select the best network for a particular application. We briefly discuss some of the issues we encountered with overfitting, mismapping of the field values and the geometrical conditions that lead to large absolute and relative errors in the approximate solution. 
    more » « less
  4. Antibodies are important biomolecules that are often designed to recognize target antigens. However, they are expensive to produce and their relatively large size prevents their transport across lipid membranes. An alternative to antibodies is aptamers, short ([Formula: see text] bp) oligonucleotides (and amino acid sequences) with specific secondary and tertiary structures that govern their affinity to specific target molecules. Aptamers are typically generated via solid phase oligonucleotide synthesis before selection and amplification through Systematic Evolution of Ligands by EXponential enrichment (SELEX), a process based on competitive binding that enriches the population of certain strands while removing unwanted sequences, yielding aptamers with high specificity and affinity to a target molecule. Mathematical analyses of SELEX have been formulated in the mass action limit, which assumes large system sizes and/or high aptamer and target molecule concentrations. In this paper, we develop a fully discrete stochastic model of SELEX. While converging to a mass-action model in the large system-size limit, our stochastic model allows us to study statistical quantities when the system size is small, such as the probability of losing the best-binding aptamer during each round of selection. Specifically, we find that optimal SELEX protocols in the stochastic model differ from those predicted by a deterministic model.

    more » « less
  5. Goal: Lifting is a common manual material handling task performed in the workplaces. It is considered as one of the main risk factors for work-related musculoskeletal disorders. An important criterion to identify the unsafe lifting task is the values of the net force and moment at L5/S1 joint. These values are mainly calculated in a laboratory environment, which utilizes marker-based sensors to collect three-dimensional (3-D) information and force plates to measure the external forces and moments. However, this method is usually expensive to set up, time-consuming in process, and sensitive to the surrounding environment. In this study, we propose a deep neural network (DNN)-based framework for 3-D pose estimation, which addresses the aforementioned limitations, and we employ the results for L5/S1 moment and force calculation. Methods: At the first step of the proposed framework, full body 3-D pose is captured using a DNN, then at the second step, estimated 3-D body pose along with the subject's anthropometric information is utilized to calculate L5/S1 join's kinetic by a top-down inverse dynamic algorithm. Results: To fully evaluate our approach, we conducted experiments using a lifting dataset consisting of 12 subjects performing various types of lifting tasks. The results are validated against a marker-based motion capture system as a reference. The grand mean ± SD of the total moment/force absolute errors across all the dataset was 9.06 ± 7.60 N·m/4.85 ± 4.85 N. Conclusion: The proposed method provides a reliable tool for assessment of the lower back kinetics during lifting and can be an alternative when the use of marker-based motion capture systems is not possible. 
    more » « less