skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 11:00 PM ET on Thursday, May 23 until 2:00 AM ET on Friday, May 24 due to maintenance. We apologize for the inconvenience.

Title: Deep learning to decompose macromolecules into independent Markovian domains

The increasing interest in modeling the dynamics of ever larger proteins has revealed a fundamental problem with models that describe the molecular system as being in a global configuration state. This notion limits our ability to gather sufficient statistics of state probabilities or state-to-state transitions because for large molecular systems the number of metastable states grows exponentially with size. In this manuscript, we approach this challenge by introducing a method that combines our recent progress on independent Markov decomposition (IMD) with VAMPnets, a deep learning approach to Markov modeling. We establish a training objective that quantifies how well a given decomposition of the molecular system into independent subdomains with Markovian dynamics approximates the overall dynamics. By constructing an end-to-end learning framework, the decomposition into such subdomains and their individual Markov state models are simultaneously learned, providing a data-efficient and easily interpretable summary of the complex system dynamics. While learning the dynamical coupling between Markovian subdomains is still an open issue, the present results are a significant step towards learning Ising models of large molecular complexes from simulation data.

more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Nature Publishing Group
Date Published:
Journal Name:
Nature Communications
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Protein structure refinement is the last step in protein structure prediction pipelines. Physics‐based refinement via molecular dynamics (MD) simulations has made significant progress during recent years. During CASP14, we tested a new refinement protocol based on an improved sampling strategy via MD simulations. MD simulations were carried out at an elevated temperature (360 K). An optimized use of biasing restraints and the use of multiple starting models led to enhanced sampling. The new protocol generally improved the model quality. In comparison with our previous protocols, the CASP14 protocol showed clear improvements. Our approach was successful with most initial models, many based on deep learning methods. However, we found that our approach was not able to refine machine‐learning models from the AlphaFold2 group, often decreasing already high initial qualities. To better understand the role of refinement given new types of models based on machine‐learning, a detailed analysis via MD simulations and Markov state modeling is presented here. We continue to find that MD‐based refinement has the potential to improve AI predictions. We also identified several practical issues that make it difficult to realize that potential. Increasingly important is the consideration of inter‐domain and oligomeric contacts in simulations; the presence of large kinetic barriers in refinement pathways also continues to present challenges. Finally, we provide a perspective on how physics‐based refinement could continue to play a role in the future for improving initial predictions based on machine learning‐based methods.

    more » « less
  2. Abstract Motivation

    An accurate characterization of transcription factor (TF)-DNA affinity landscape is crucial to a quantitative understanding of the molecular mechanisms underpinning endogenous gene regulation. While recent advances in biotechnology have brought the opportunity for building binding affinity prediction methods, the accurate characterization of TF-DNA binding affinity landscape still remains a challenging problem.


    Here we propose a novel sequence embedding approach for modeling the transcription factor binding affinity landscape. Our method represents DNA binding sequences as a hidden Markov model which captures both position specific information and long-range dependency in the sequence. A cornerstone of our method is a novel message passing-like embedding algorithm, called Sequence2Vec, which maps these hidden Markov models into a common nonlinear feature space and uses these embedded features to build a predictive model. Our method is a novel combination of the strength of probabilistic graphical models, feature space embedding and deep learning. We conducted comprehensive experiments on over 90 large-scale TF-DNA datasets which were measured by different high-throughput experimental technologies. Sequence2Vec outperforms alternative machine learning methods as well as the state-of-the-art binding affinity prediction methods.

    Availability and implementation

    Our program is freely available at

    Supplementary information

    Supplementary data are available at Bioinformatics online.

    more » « less
  3. Abstract

    Searching for patterns in data is important because it can lead to the discovery of sequence segments that play a functional role. The complexity of pattern statistics that are used in data analysis and the need of the sampling distribution of those statistics for inference renders efficient computation methods as paramount. This article gives an overview of the main methods used to compute distributions of statistics of overlapping pattern occurrences, specifically, generating functions, correlation functions, the Goulden‐Jackson cluster method, recursive equations, and Markov chain embedding. The underlying data sequence will be assumed to be higher‐order Markovian, which includes sparse Markov models and variable length Markov chains as special cases. Also considered will be recent developments for extending the computational capabilities of the Markov chain‐based method through an algorithm for minimizing the size of the chain's state space, as well as improved data modeling capabilities through sparse Markov models. An application to compute a distribution used as a test statistic in sequence alignment will serve to illustrate the usefulness of the methodology.

    This article is categorized under:

    Statistical Learning and Exploratory Methods of the Data Sciences > Pattern Recognition

    Data: Types and Structure > Categorical Data

    Statistical and Graphical Methods of Data Analysis > Modeling Methods and Algorithms

    more » « less
  4. The path-tracking control performance of an autonomous vehicle (AV) is crucially dependent upon modeling choices and subsequent system-identification updates. Traditionally, automotive engineering has built upon increasing fidelity of white- and gray-box models coupled with system identification. While these models offer explainability, they suffer from modeling inaccuracies, non-linearities, and parameter variation. On the other end, end-to-end black-box methods like behavior cloning and reinforcement learning provide increased adaptability but at the expense of explainability, generalizability, and the sim2real gap. In this regard, hybrid data-driven techniques like Koopman Extended Dynamic Mode Decomposition (KEDMD) can achieve linear embedding of non-linear dynamics through a selection of “lifting functions”. However, the success of this method is primarily predicated on the choice of lifting function(s) and optimization parameters. In this study, we present an analytical approach to construct these lifting functions using the iterative Lie bracket vector fields considering holonomic and non-holonomic constraints on the configuration manifold of our Ackermann-steered autonomous mobile robot. The prediction and control capabilities of the obtained linear KEDMD model are showcased using trajectory tracking of standard vehicle dynamics maneuvers and along a closed-loop racetrack. 
    more » « less
  5. One important problem in constructing the reduced dynamics of molecular systems is the accurate modeling of the non-Markovian behavior arising from the dynamics of unresolved variables. The main complication emerges from the lack of scale separations, where the reduced dynamics generally exhibits pronounced memory and non-white noise terms. We propose a data-driven approach to learn the reduced model of multi-dimensional resolved variables that faithfully retains the non-Markovian dynamics. Different from the common approaches based on the direct construction of the memory function, the present approach seeks a set of non-Markovian features that encode the history of the resolved variables and establishes a joint learning of the extended Markovian dynamics in terms of both the resolved variables and these features. The training is based on matching the evolution of the correlation functions of the extended variables that can be directly obtained from the ones of the resolved variables. The constructed model essentially approximates the multi-dimensional generalized Langevin equation and ensures numerical stability without empirical treatment. We demonstrate the effectiveness of the method by constructing the reduced models of molecular systems in terms of both one-dimensional and four-dimensional resolved variables. 
    more » « less