Abstract Motivation:Despite its great success in various physical modeling, differential geometry (DG) has rarely been devised as a versatile tool for analyzing large, diverse, and complex molecular and biomolecular datasets because of the limited understanding of its potential power in dimensionality reduction and its ability to encode essential chemical and biological information in differentiable manifolds. Results:We put forward a differential geometry‐based geometric learning (DG‐GL) hypothesis that the intrinsic physics of three‐dimensional (3D) molecular structures lies on a family of low‐dimensional manifolds embedded in a high‐dimensional data space. We encode crucial chemical, physical, and biological information into 2D element interactive manifolds, extracted from a high‐dimensional structural data space via a multiscale discrete‐to‐continuum mapping using differentiable density estimators. Differential geometry apparatuses are utilized to construct element interactive curvatures in analytical forms for certain analytically differentiable density estimators. These low‐dimensional differential geometry representations are paired with a robust machine learning algorithm to showcase their descriptive and predictive powers for large, diverse, and complex molecular and biomolecular datasets. Extensive numerical experiments are carried out to demonstrate that the proposed DG‐GL strategy outperforms other advanced methods in the predictions of drug discovery‐related protein‐ligand binding affinity, drug toxicity, and molecular solvation free energy. Availability and implementation:http://weilab.math.msu.edu/DG‐GL/ Contact:wei@math.msu.edu
more »
« less
This content will become publicly available on March 15, 2026
Multiscale Differential Geometry Learning for Protein Flexibility Analysis
ABSTRACT Protein structural fluctuations, measured by Debye‐Waller factors or B‐factors, are known to be closely associated with protein flexibility and function. Theoretical approaches have also been developed to predict B‐factor values, which reflect protein flexibility. Previous models have made significant strides in analyzing B‐factors by fitting experimental data. In this study, we propose a novel approach for B‐factor prediction using differential geometry theory, based on the assumption that the intrinsic properties of proteins reside on a family of low‐dimensional manifolds embedded within the high‐dimensional space of protein structures. By analyzing the mean and Gaussian curvatures of a set of low‐dimensional manifolds defined by kernel functions, we develop effective and robust multiscale differential geometry (mDG) models. Our mDG model demonstrates a 27% increase in accuracy compared to the classical Gaussian network model (GNM) in predicting B‐factors for a dataset of 364 proteins. Additionally, by incorporating both global and local protein features, we construct a highly effective machine‐learning model for the blind prediction of B‐factors. Extensive least‐squares approximations and machine learning‐based blind predictions validate the effectiveness of the mDG modeling approach for B‐factor predictions.
more »
« less
- Award ID(s):
- 2052983
- PAR ID:
- 10616125
- Publisher / Repository:
- Journal of Computational Chemistry
- Date Published:
- Journal Name:
- Journal of Computational Chemistry
- Volume:
- 46
- Issue:
- 7
- ISSN:
- 0192-8651
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract Recently, persistent homology has had tremendous success in biomolecular data analysis. It works by examining the topological relationship or connectivity of a group of atoms in a molecule at a variety of scales, then rendering a family of topological representations of the molecule. However, persistent homology is rarely employed for the analysis of atomic properties, such as biomolecular flexibility analysis or B-factor prediction. This work introduces atom-specific persistent homology to provide a local atomic level representation of a molecule via a global topological tool. This is achieved through the construction of a pair of conjugated sets of atoms and corresponding conjugated simplicial complexes, as well as conjugated topological spaces. The difference between the topological invariants of the pair of conjugated sets is measured by Bottleneck and Wasserstein metrics and leads to an atom-specific topological representation of individual atomic properties in a molecule. Atom-specific topological features are integrated with various machine learning algorithms, including gradient boosting trees and convolutional neural network for protein thermal fluctuation analysis and B-factor prediction. Extensive numerical results indicate the proposed method provides a powerful topological tool for analyzing and predicting localized information in complex macromolecules.more » « less
-
Abstract MotivationNucleic acid binding proteins (NABPs) play critical roles in various and essential biological processes. Many machine learning-based methods have been developed to predict different types of NABPs. However, most of these studies have limited applications in predicting the types of NABPs for any given protein with unknown functions, due to several factors such as dataset construction, prediction scope and features used for training and testing. In addition, single-stranded DNA binding proteins (DBP) (SSBs) have not been extensively investigated for identifying novel SSBs from proteins with unknown functions. ResultsTo improve prediction accuracy of different types of NABPs for any given protein, we developed hierarchical and multi-class models with machine learning-based methods and a feature extracted from protein language model ESM2. Our results show that by combining the feature from ESM2 and machine learning methods, we can achieve high prediction accuracy up to 95% for each stage in the hierarchical approach, and 85% for overall prediction accuracy from the multi-class approach. More importantly, besides the much improved prediction of other types of NABPs, the models can be used to accurately predict single-stranded DBPs, which is underexplored. Availability and implementationThe datasets and code can be found at https://figshare.com/projects/Prediction_of_nucleic_acid_binding_proteins_using_protein_language_model/211555.more » « less
-
Recently, machine learning (ML) has established itself in various worldwide benchmarking competitions in computational biology, including Critical Assessment of Structure Prediction (CASP) and Drug Design Data Resource (D3R) Grand Challenges. However, the intricate structural complexity and high ML dimensionality of biomolecular datasets obstruct the efficient application of ML algorithms in the field. In addition to data and algorithm, an efficient ML machinery for biomolecular predictions must include structural representation as an indispensable component. Mathematical representations that simplify the biomolecular structural complexity and reduce ML dimensionality have emerged as a prime winner in D3R Grand Challenges. This review is devoted to the recent advances in developing low-dimensional and scalable mathematical representations of biomolecules in our laboratory. We discuss three classes of mathematical approaches, including algebraic topology, differential geometry, and graph theory. We elucidate how the physical and biological challenges have guided the evolution and development of these mathematical apparatuses for massive and diverse biomolecular data. We focus the performance analysis on protein–ligand binding predictions in this review although these methods have had tremendous success in many other applications, such as protein classification, virtual screening, and the predictions of solubility, solvation free energies, toxicity, partition coefficients, protein folding stability changes upon mutation, etc.more » « less
-
ABSTRACT Predicting the structure of ligands bound to proteins is a foundational problem in modern biotechnology and drug discovery, yet little is known about how to combine the predictions of protein‐ligand structure (poses) produced by the latest deep learning methods to identify the best poses and how to accurately estimate the binding affinity between a protein target and a list of ligand candidates. Further, a blind benchmarking and assessment of protein‐ligand structure and binding affinity prediction is necessary to ensure it generalizes well to new settings. Towards this end, we introduceMULTICOM_ligand, a deep learning‐based protein‐ligand structure and binding affinity prediction ensemble featuring structural consensus ranking for unsupervised pose ranking and a new deep generative flow matching model for joint structure and binding affinity prediction. Notably,MULTICOM_ligand ranked among the top‐5 ligand prediction methods in both protein‐ligand structure prediction and binding affinity prediction in the 16th Critical Assessment of Techniques for Structure Prediction (CASP16), demonstrating its efficacy and utility for real‐world drug discovery efforts. The source code for MULTICOM_ligand is freely available on GitHub.more » « less
An official website of the United States government
