skip to main content


Title: A review of mathematical representations of biomolecular data
Recently, machine learning (ML) has established itself in various worldwide benchmarking competitions in computational biology, including Critical Assessment of Structure Prediction (CASP) and Drug Design Data Resource (D3R) Grand Challenges. However, the intricate structural complexity and high ML dimensionality of biomolecular datasets obstruct the efficient application of ML algorithms in the field. In addition to data and algorithm, an efficient ML machinery for biomolecular predictions must include structural representation as an indispensable component. Mathematical representations that simplify the biomolecular structural complexity and reduce ML dimensionality have emerged as a prime winner in D3R Grand Challenges. This review is devoted to the recent advances in developing low-dimensional and scalable mathematical representations of biomolecules in our laboratory. We discuss three classes of mathematical approaches, including algebraic topology, differential geometry, and graph theory. We elucidate how the physical and biological challenges have guided the evolution and development of these mathematical apparatuses for massive and diverse biomolecular data. We focus the performance analysis on protein–ligand binding predictions in this review although these methods have had tremendous success in many other applications, such as protein classification, virtual screening, and the predictions of solubility, solvation free energies, toxicity, partition coefficients, protein folding stability changes upon mutation, etc.  more » « less
Award ID(s):
1900473 1761320 1721024
NSF-PAR ID:
10170687
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Physical Chemistry Chemical Physics
Volume:
22
Issue:
8
ISSN:
1463-9076
Page Range / eLocation ID:
4343 to 4367
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Motivation:Despite its great success in various physical modeling, differential geometry (DG) has rarely been devised as a versatile tool for analyzing large, diverse, and complex molecular and biomolecular datasets because of the limited understanding of its potential power in dimensionality reduction and its ability to encode essential chemical and biological information in differentiable manifolds.

    Results:We put forward a differential geometry‐based geometric learning (DG‐GL) hypothesis that the intrinsic physics of three‐dimensional (3D) molecular structures lies on a family of low‐dimensional manifolds embedded in a high‐dimensional data space. We encode crucial chemical, physical, and biological information into 2D element interactive manifolds, extracted from a high‐dimensional structural data space via a multiscale discrete‐to‐continuum mapping using differentiable density estimators. Differential geometry apparatuses are utilized to construct element interactive curvatures in analytical forms for certain analytically differentiable density estimators. These low‐dimensional differential geometry representations are paired with a robust machine learning algorithm to showcase their descriptive and predictive powers for large, diverse, and complex molecular and biomolecular datasets. Extensive numerical experiments are carried out to demonstrate that the proposed DG‐GL strategy outperforms other advanced methods in the predictions of drug discovery‐related protein‐ligand binding affinity, drug toxicity, and molecular solvation free energy.

    Availability and implementation:http://weilab.math.msu.edu/DG‐GL/

    Contact:wei@math.msu.edu

     
    more » « less
  2. Abstract

    Land surface models (LSMs) are a vital tool for understanding, projecting, and predicting the dynamics of the land surface and its role within the Earth system, under global change. Driven by the need to address a set of key questions, LSMs have grown in complexity from simplified representations of land surface biophysics to encompass a broad set of interrelated processes spanning the disciplines of biophysics, biogeochemistry, hydrology, ecosystem ecology, community ecology, human management, and societal impacts. This vast scope and complexity, while warranted by the problems LSMs are designed to solve, has led to enormous challenges in understanding and attributing differences between LSM predictions. Meanwhile, the wide range of spatial scales that govern land surface heterogeneity, and the broad spectrum of timescales in land surface dynamics, create challenges in tractably representing processes in LSMs. We identify three “grand challenges” in the development and use of LSMs, based around these issues: managing process complexity, representing land surface heterogeneity, and understanding parametric dynamics across the broad set of problems asked of LSMs in a changing world. In this review, we discuss progress that has been made, as well as promising directions forward, for each of these challenges.

     
    more » « less
  3. Abstract

    Generative AI is rapidly transforming the frontier of research in computational structural biology. Indeed, recent successes have substantially advanced protein design and drug discovery. One of the key methodologies underlying these advances is diffusion models (DM). Diffusion models originated in computer vision, rapidly taking over image generation and offering superior quality and performance. These models were subsequently extended and modified for uses in other areas including computational structural biology. DMs are well equipped to model high dimensional, geometric data while exploiting key strengths of deep learning. In structural biology, for example, they have achieved state‐of‐the‐art results on protein 3D structure generation and small molecule docking. This review covers the basics of diffusion models, associated modeling choices regarding molecular representations, generation capabilities, prevailing heuristics, as well as key limitations and forthcoming refinements. We also provide best practices around evaluation procedures to help establish rigorous benchmarking and evaluation. The review is intended to provide a fresh view into the state‐of‐the‐art as well as highlight its potentials and current challenges of recent generative techniques in computational structural biology.

    This article is categorized under:

    Data Science > Artificial Intelligence/Machine Learning

    Structure and Mechanism > Molecular Structures

    Software > Molecular Modeling

     
    more » « less
  4. INTRODUCTION Solving quantum many-body problems, such as finding ground states of quantum systems, has far-reaching consequences for physics, materials science, and chemistry. Classical computers have facilitated many profound advances in science and technology, but they often struggle to solve such problems. Scalable, fault-tolerant quantum computers will be able to solve a broad array of quantum problems but are unlikely to be available for years to come. Meanwhile, how can we best exploit our powerful classical computers to advance our understanding of complex quantum systems? Recently, classical machine learning (ML) techniques have been adapted to investigate problems in quantum many-body physics. So far, these approaches are mostly heuristic, reflecting the general paucity of rigorous theory in ML. Although they have been shown to be effective in some intermediate-size experiments, these methods are generally not backed by convincing theoretical arguments to ensure good performance. RATIONALE A central question is whether classical ML algorithms can provably outperform non-ML algorithms in challenging quantum many-body problems. We provide a concrete answer by devising and analyzing classical ML algorithms for predicting the properties of ground states of quantum systems. We prove that these ML algorithms can efficiently and accurately predict ground-state properties of gapped local Hamiltonians, after learning from data obtained by measuring other ground states in the same quantum phase of matter. Furthermore, under a widely accepted complexity-theoretic conjecture, we prove that no efficient classical algorithm that does not learn from data can achieve the same prediction guarantee. By generalizing from experimental data, ML algorithms can solve quantum many-body problems that could not be solved efficiently without access to experimental data. RESULTS We consider a family of gapped local quantum Hamiltonians, where the Hamiltonian H ( x ) depends smoothly on m parameters (denoted by x ). The ML algorithm learns from a set of training data consisting of sampled values of x , each accompanied by a classical representation of the ground state of H ( x ). These training data could be obtained from either classical simulations or quantum experiments. During the prediction phase, the ML algorithm predicts a classical representation of ground states for Hamiltonians different from those in the training data; ground-state properties can then be estimated using the predicted classical representation. Specifically, our classical ML algorithm predicts expectation values of products of local observables in the ground state, with a small error when averaged over the value of x . The run time of the algorithm and the amount of training data required both scale polynomially in m and linearly in the size of the quantum system. Our proof of this result builds on recent developments in quantum information theory, computational learning theory, and condensed matter theory. Furthermore, under the widely accepted conjecture that nondeterministic polynomial-time (NP)–complete problems cannot be solved in randomized polynomial time, we prove that no polynomial-time classical algorithm that does not learn from data can match the prediction performance achieved by the ML algorithm. In a related contribution using similar proof techniques, we show that classical ML algorithms can efficiently learn how to classify quantum phases of matter. In this scenario, the training data consist of classical representations of quantum states, where each state carries a label indicating whether it belongs to phase A or phase B . The ML algorithm then predicts the phase label for quantum states that were not encountered during training. The classical ML algorithm not only classifies phases accurately, but also constructs an explicit classifying function. Numerical experiments verify that our proposed ML algorithms work well in a variety of scenarios, including Rydberg atom systems, two-dimensional random Heisenberg models, symmetry-protected topological phases, and topologically ordered phases. CONCLUSION We have rigorously established that classical ML algorithms, informed by data collected in physical experiments, can effectively address some quantum many-body problems. These rigorous results boost our hopes that classical ML trained on experimental data can solve practical problems in chemistry and materials science that would be too hard to solve using classical processing alone. Our arguments build on the concept of a succinct classical representation of quantum states derived from randomized Pauli measurements. Although some quantum devices lack the local control needed to perform such measurements, we expect that other classical representations could be exploited by classical ML with similarly powerful results. How can we make use of accessible measurement data to predict properties reliably? Answering such questions will expand the reach of near-term quantum platforms. Classical algorithms for quantum many-body problems. Classical ML algorithms learn from training data, obtained from either classical simulations or quantum experiments. Then, the ML algorithm produces a classical representation for the ground state of a physical system that was not encountered during training. Classical algorithms that do not learn from data may require substantially longer computation time to achieve the same task. 
    more » « less
  5. Cells respond to biochemical and physical internal as well as external signals. These signals can be broadly classified into two categories: (a) ``actionable'' or ``reference'' inputs that should elicit appropriate biological or physical responses such as gene expression or motility, and (b) ``disturbances'' or ``perturbations'' that should be ignored or actively filtered-out. These disturbances might be exogenous, such as binding of nonspecific ligands, or endogenous, such as variations in enzyme concentrations or gene copy numbers. In this context, the term robustness describes the capability to produce appropriate responses to reference inputs while at the same time being insensitive to disturbances. These two objectives often conflict with each other and require delicate design trade-offs. Indeed, natural biological systems use complicated and still poorly understood control strategies in order to finely balance the goals of responsiveness and robustness. A better understanding of such natural strategies remains an important scientific goal in itself and will play a role in the construction of synthetic circuits for therapeutic and biosensing applications. A prototype problem in robustly responding to inputs is that of ``robust tracking'', defined by the requirement that some designated internal quantity (for example, the level of expression of a reporter protein) should faithfully follow an input signal while being insensitive to an appropriate class of perturbations. Control theory predicts that a certain type of motif, called integral feedback, will help achieve this goal, and this motif is, in fact, a necessary feature of any system that exhibits robust tracking. Indeed, integral feedback has always been a key component of electrical and mechanical control systems, at least since the 18th century when James Watt employed the centrifugal governor to regulate steam engines. Motivated by this knowledge, biological engineers have proposed various designs for biomolecular integral feedback control mechanisms. However, practical and quantitatively predictable implementations have proved challenging, in part due to the difficulty in obtaining accurate models of transcription, translation, and resource competition in living cells, and the stochasticity inherent in cellular reactions. These challenges prevent first-principles rational design and parameter optimization. In this work, we exploit the versatility of an Escherichia coli cell-free transcription-translation (TXTL) to accurately design, model and then build, a synthetic biomolecular integral controller that precisely controls the expression of a target gene. To our knowledge, this is the first design of a functioning gene network that achieves the goal of making gene expression track an externally imposed reference level, achieves this goal even in the presence of disturbances, and whose performance quantitatively agrees with mathematical predictions. 
    more » « less