skip to main content


Title: A gated graph transformer for protein complex structure quality assessment and its performance in CASP15
Abstract Motivation

Proteins interact to form complexes to carry out essential biological functions. Computational methods such as AlphaFold-multimer have been developed to predict the quaternary structures of protein complexes. An important yet largely unsolved challenge in protein complex structure prediction is to accurately estimate the quality of predicted protein complex structures without any knowledge of the corresponding native structures. Such estimations can then be used to select high-quality predicted complex structures to facilitate biomedical research such as protein function analysis and drug discovery.

Results

In this work, we introduce a new gated neighborhood-modulating graph transformer to predict the quality of 3D protein complex structures. It incorporates node and edge gates within a graph transformer framework to control information flow during graph message passing. We trained, evaluated and tested the method (called DProQA) on newly-curated protein complex datasets before the 15th Critical Assessment of Techniques for Protein Structure Prediction (CASP15) and then blindly tested it in the 2022 CASP15 experiment. The method was ranked 3rd among the single-model quality assessment methods in CASP15 in terms of the ranking loss of TM-score on 36 complex targets. The rigorous internal and external experiments demonstrate that DProQA is effective in ranking protein complex structures.

Availability and implementation

The source code, data, and pre-trained models are available at https://github.com/jianlin-cheng/DProQA.

 
more » « less
Award ID(s):
1763246
NSF-PAR ID:
10427681
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Bioinformatics
Volume:
39
Issue:
Supplement_1
ISSN:
1367-4803
Page Range / eLocation ID:
p. i308-i317
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Motivation

    The state-of-art protein structure prediction methods such as AlphaFold are being widely used to predict structures of uncharacterized proteins in biomedical research. There is a significant need to further improve the quality and nativeness of the predicted structures to enhance their usability. In this work, we develop ATOMRefine, a deep learning-based, end-to-end, all-atom protein structural model refinement method. It uses a SE(3)-equivariant graph transformer network to directly refine protein atomic coordinates in a predicted tertiary structure represented as a molecular graph.

    Results

    The method is first trained and tested on the structural models in AlphaFoldDB whose experimental structures are known, and then blindly tested on 69 CASP14 regular targets and 7 CASP14 refinement targets. ATOMRefine improves the quality of both backbone atoms and all-atom conformation of the initial structural models generated by AlphaFold. It also performs better than two state-of-the-art refinement methods in multiple evaluation metrics including an all-atom model quality score—the MolProbity score based on the analysis of all-atom contacts, bond length, atom clashes, torsion angles, and side-chain rotamers. As ATOMRefine can refine a protein structure quickly, it provides a viable, fast solution for improving protein geometry and fixing structural errors of predicted structures through direct coordinate refinement.

    Availability and implementation

    The source code of ATOMRefine is available in the GitHub repository (https://github.com/BioinfoMachineLearning/ATOMRefine). All the required data for training and testing are available at https://doi.org/10.5281/zenodo.6944368.

     
    more » « less
  2. Abstract Motivation

    Quality assessment (QA) of predicted protein tertiary structure models plays an important role in ranking and using them. With the recent development of deep learning end-to-end protein structure prediction techniques for generating highly confident tertiary structures for most proteins, it is important to explore corresponding QA strategies to evaluate and select the structural models predicted by them since these models have better quality and different properties than the models predicted by traditional tertiary structure prediction methods.

    Results

    We develop EnQA, a novel graph-based 3D-equivariant neural network method that is equivariant to rotation and translation of 3D objects to estimate the accuracy of protein structural models by leveraging the structural features acquired from the state-of-the-art tertiary structure prediction method—AlphaFold2. We train and test the method on both traditional model datasets (e.g. the datasets of the Critical Assessment of Techniques for Protein Structure Prediction) and a new dataset of high-quality structural models predicted only by AlphaFold2 for the proteins whose experimental structures were released recently. Our approach achieves state-of-the-art performance on protein structural models predicted by both traditional protein structure prediction methods and the latest end-to-end deep learning method—AlphaFold2. It performs even better than the model QA scores provided by AlphaFold2 itself. The results illustrate that the 3D-equivariant graph neural network is a promising approach to the evaluation of protein structural models. Integrating AlphaFold2 features with other complementary sequence and structural features is important for improving protein model QA.

    Availability and implementation

    The source code is available at https://github.com/BioinfoMachineLearning/EnQA.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  3. Abstract Motivation

    Millions of protein sequences have been generated by numerous genome and transcriptome sequencing projects. However, experimentally determining the function of the proteins is still a time consuming, low-throughput, and expensive process, leading to a large protein sequence-function gap. Therefore, it is important to develop computational methods to accurately predict protein function to fill the gap. Even though many methods have been developed to use protein sequences as input to predict function, much fewer methods leverage protein structures in protein function prediction because there was lack of accurate protein structures for most proteins until recently.

    Results

    We developed TransFun—a method using a transformer-based protein language model and 3D-equivariant graph neural networks to distill information from both protein sequences and structures to predict protein function. It extracts feature embeddings from protein sequences using a pre-trained protein language model (ESM) via transfer learning and combines them with 3D structures of proteins predicted by AlphaFold2 through equivariant graph neural networks. Benchmarked on the CAFA3 test dataset and a new test dataset, TransFun outperforms several state-of-the-art methods, indicating that the language model and 3D-equivariant graph neural networks are effective methods to leverage protein sequences and structures to improve protein function prediction. Combining TransFun predictions and sequence similarity-based predictions can further increase prediction accuracy.

    Availability and implementation

    The source code of TransFun is available at https://github.com/jianlin-cheng/TransFun.

     
    more » « less
  4. Abstract

    We report the results of the “UM‐TBM” and “Zheng” groups in CASP15 for protein monomer and complex structure prediction. These prediction sets were obtained using the D‐I‐TASSER and DMFold‐Multimer algorithms, respectively. For monomer structure prediction, D‐I‐TASSER introduced four new features during CASP15: (i) a multiple sequence alignment (MSA) generation protocol that combines multi‐source MSA searching and a structural modeling‐based MSA ranker; (ii) attention‐network based spatial restraints; (iii) a multi‐domain module containing domain partition and arrangement for domain‐level templates and spatial restraints; (iv) an optimized I‐TASSER‐based folding simulation system for full‐length model creation guided by a combination of deep learning restraints, threading alignments, and knowledge‐based potentials. For 47 free modeling targets in CASP15, the final models predicted by D‐I‐TASSER showed average TM‐score 19% higher than the standard AlphaFold2 program. We thus showed that traditional Monte Carlo‐based folding simulations, when appropriately coupled with deep learning algorithms, can generate models with improved accuracy over end‐to‐end deep learning methods alone. For protein complex structure prediction, DMFold‐Multimer generated models by integrating a new MSA generation algorithm (DeepMSA2) with the end‐to‐end modeling module from AlphaFold2‐Multimer. For the 38 complex targets, DMFold‐Multimer generated models with an average TM‐score of 0.83 and Interface Contact Score of 0.60, both significantly higher than those of competing complex prediction tools. Our analyses on complexes highlighted the critical role played by MSA generating, ranking, and pairing in protein complex structure prediction. We also discuss future room for improvement in the areas of viral protein modeling and complex model ranking.

     
    more » « less
  5. Abstract Motivation

    Deep learning has revolutionized protein tertiary structure prediction recently. The cutting-edge deep learning methods such as AlphaFold can predict high-accuracy tertiary structures for most individual protein chains. However, the accuracy of predicting quaternary structures of protein complexes consisting of multiple chains is still relatively low due to lack of advanced deep learning methods in the field. Because interchain residue–residue contacts can be used as distance restraints to guide quaternary structure modeling, here we develop a deep dilated convolutional residual network method (DRCon) to predict interchain residue–residue contacts in homodimers from residue–residue co-evolutionary signals derived from multiple sequence alignments of monomers, intrachain residue–residue contacts of monomers extracted from true/predicted tertiary structures or predicted by deep learning, and other sequence and structural features.

    Results

    Tested on three homodimer test datasets (Homo_std dataset, DeepHomo dataset and CASP-CAPRI dataset), the precision of DRCon for top L/5 interchain contact predictions (L: length of monomer in a homodimer) is 43.46%, 47.10% and 33.50% respectively at 6 Å contact threshold, which is substantially better than DeepHomo and DNCON2_inter and similar to Glinter. Moreover, our experiments demonstrate that using predicted tertiary structure or intrachain contacts of monomers in the unbound state as input, DRCon still performs well, even though its accuracy is lower than using true tertiary structures in the bound state are used as input. Finally, our case study shows that good interchain contact predictions can be used to build high-accuracy quaternary structure models of homodimers.

    Availability and implementation

    The source code of DRCon is available at https://github.com/jianlin-cheng/DRCon. The datasets are available at https://zenodo.org/record/5998532#.YgF70vXMKsB.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less