Abstract Substantial progresses in protein structure prediction have been made by utilizing deep‐learning and residue‐residue distance prediction since CASP13. Inspired by the advances, we improve our CASP14 MULTICOM protein structure prediction system by incorporating three new components: (a) a new deep learning‐based protein inter‐residue distance predictor to improve template‐free (ab initio) tertiary structure prediction, (b) an enhanced template‐based tertiary structure prediction method, and (c) distance‐based model quality assessment methods empowered by deep learning. In the 2020 CASP14 experiment, MULTICOM predictor was ranked seventh out of 146 predictors in tertiary structure prediction and ranked third out of 136 predictors in inter‐domain structure prediction. The results demonstrate that the template‐free modeling based on deep learning and residue‐residue distance prediction can predict the correct topology for almost all template‐based modeling targets and a majority of hard targets (template‐free targets or targets whose templates cannot be recognized), which is a significant improvement over the CASP13 MULTICOM predictor. Moreover, the template‐free modeling performs better than the template‐based modeling on not only hard targets but also the targets that have homologous templates. The performance of the template‐free modeling largely depends on the accuracy of distance prediction closely related to the quality of multiple sequence alignments. The structural model quality assessment works well on targets for which enough good models can be predicted, but it may perform poorly when only a few good models are predicted for a hard target and the distribution of model quality scores is highly skewed. MULTICOM is available athttps://github.com/jianlin-cheng/MULTICOM_Human_CASP14/tree/CASP14_DeepRank3andhttps://github.com/multicom-toolbox/multicom/tree/multicom_v2.0.
more »
« less
Distilling Structural Representations into Protein Sequence Models
Abstract Protein language models, like the popular ESM2, are widely used tools for extracting evolution-based protein representations and have achieved significant success on downstream biological tasks. Representations based on sequence and structure models, however, show significant performance differences depending on the downstream task. A major open problem is to obtain representations that best capture both the evolutionary and structural properties of proteins in general. Here we introduceImplicitStructureModel(ISM), a sequence-only input model with structurally-enriched representations that outperforms state-of-the-art sequence models on several well-studied benchmarks including mutation stability assessment and structure prediction. Our key innovations are a microenvironment-based autoencoder for generating structure tokens and a self-supervised training objective that distills these tokens into ESM2’s pre-trained model. We have madeISM’s structure-enriched weights easily available: integrating ISM into any application using ESM2 requires changing only a single line of code. Our code is available athttps://github.com/jozhang97/ISM.
more »
« less
- Award ID(s):
- 2505865
- PAR ID:
- 10631086
- Publisher / Repository:
- bioRxiv
- Date Published:
- Format(s):
- Medium: X
- Institution:
- bioRxiv
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract Protein language models (pLMs) have been widely adopted for various protein and peptide-related downstream tasks and demonstrated promising performance. However, short peptides are significantly underrepresented in commonly used pLM training datasets. For example, only 2.8% of sequences in the UniProt Reference Cluster (UniRef) contain fewer than 50 residues, which potentially limits the effectiveness of pLMs for peptide-specific applications. Here, we present PepBERT, a lightweight and efficient peptide language model specifically designed for encoding peptide sequences. Two versions of the model—PepBERT-large (4.9 million parameters) and PepBERT-small (1.86 million parameters)—were pretrained from scratch using four custom peptide datasets and evaluated on nine peptide-related downstream prediction tasks. Both PepBERT models achieved performance superior to or comparable to the benchmark model, ESM-2 with 7.5 million parameters, on 8 out of 9 datasets. Overall, PepBERT provides a compact yet effective solution for generating high-quality peptide representations for downstream applications. By enabling more accurate representation and prediction of bioactive peptides, PepBERT can accelerate the discovery of food-derived bioactive peptides with health-promoting properties, supporting the development of sustainable functional foods and value-added utilization of food processing by-products. The datasets, source codes, pretrained models, and tutorials for the usage of PepBERT are available athttps://github.com/dzjxzyd/PepBERT.more » « less
-
Li, Jinyan (Ed.)Artificial intelligence-powered protein structure prediction methods have led to a paradigm-shift in computational structural biology, yet contemporary approaches for predicting the interfacial residues (i.e., sites) of protein-protein interaction (PPI) still rely on experimental structures. Recent studies have demonstrated benefits of employing graph convolution for PPI site prediction, but ignore symmetries naturally occurring in 3-dimensional space and act only on experimental coordinates. Here we present EquiPPIS, an E(3) equivariant graph neural network approach for PPI site prediction. EquiPPIS employs symmetry-aware graph convolutions that transform equivariantly with translation, rotation, and reflection in 3D space, providing richer representations for molecular data compared to invariant convolutions. EquiPPIS substantially outperforms state-of-the-art approaches based on the same experimental input, and exhibits remarkable robustness by attaining better accuracy with predicted structural models from AlphaFold2 than what existing methods can achieve even with experimental structures. Freely available athttps://github.com/Bhattacharya-Lab/EquiPPIS, EquiPPIS enables accurate PPI site prediction at scale.more » « less
-
Rapid advancement in machine learning is increasing the demand for effective graph data analysis. However, real-world graph data often exhibits class imbalance, leading to poor performance of standard machine learning models on underrepresented classes. To address this,Class-ImbalancedLearning onGraphs (CILG) has emerged as a promising solution that combines graph representation learning and class-imbalanced learning. This survey provides a comprehensive understanding of CILG’s current state-of-the-art, establishing the first systematic taxonomy of existing work and its connections to traditional imbalanced learning. We critically analyze recent advances and discuss key open problems. A continuously updated reading list of relevant articles and code implementations is available athttps://github.com/yihongma/CILG-Papers.more » « less
-
Abstract Many proteins are composed of several domains that pack together into a complex tertiary structure. Multidomain proteins can be challenging for protein structure modeling, particularly those for which templates can be found for individual domains but not for the entire sequence. In such cases, homology modeling can generate high quality models of the domains but not for the orientations between domains. Small‐angle X‐ray scattering (SAXS) reports the structural properties of entire proteins and has the potential for guiding homology modeling of multidomain proteins. In this article, we describe a novel multidomain protein assembly modeling method, SAXSDom that integrates experimental knowledge from SAXS with probabilistic Input‐Output Hidden Markov model to assemble the structures of individual domains together. Four SAXS‐based scoring functions were developed and tested, and the method was evaluated on multidomain proteins from two public datasets. Incorporation of SAXS information improved the accuracy of domain assembly for 40 out of 46 critical assessment of protein structure prediction multidomain protein targets and 45 out of 73 multidomain protein targets from the ab initio domain assembly dataset. The results demonstrate that SAXS data can provide useful information to improve the accuracy of domain‐domain assembly. The source code and tool packages are available athttps://github.com/jianlin-cheng/SAXSDom.more » « less
An official website of the United States government

