skip to main content

Title: Protein structure prediction using deep learning distance and hydrogen‐bonding restraints in CASP14

In this article, we report 3D structure prediction results by two of our best server groups (“Zhang‐Server” and “QUARK”) in CASP14. These two servers were built based on the D‐I‐TASSER and D‐QUARK algorithms, which integrated four newly developed components into the classical protein folding pipelines, I‐TASSER and QUARK, respectively. The new components include: (a) a new multiple sequence alignment (MSA) collection tool, DeepMSA2, which is extended from the DeepMSA program; (b) a contact‐based domain boundary prediction algorithm, FUpred, to detect protein domain boundaries; (c) a residual convolutional neural network‐based method, DeepPotential, to predict multiple spatial restraints by co‐evolutionary features derived from the MSA; and (d) optimized spatial restraint energy potentials to guide the structure assembly simulations. For 37 FM targets, the average TM‐scores of the first models produced by D‐I‐TASSER and D‐QUARK were 96% and 112% higher than those constructed by I‐TASSER and QUARK, respectively. The data analysis indicates noticeable improvements produced by each of the four new components, especially for the newly added spatial restraints from DeepPotential and the well‐tuned force field that combines spatial restraints, threading templates, and generic knowledge‐based potentials. However, challenges still exist in the current pipelines. These include difficulties in modeling multi‐domain proteins due to low accuracy in inter‐domain distance prediction and modeling protein domains from oligomer complexes, as the co‐evolutionary analysis cannot distinguish inter‐chain and intra‐chain distances. Specifically tuning the deep learning‐based predictors for multi‐domain targets and protein complexes may be helpful to address these issues.

more » « less
Author(s) / Creator(s):
 ;  ;  ;  ;  ;  ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Proteins: Structure, Function, and Bioinformatics
Page Range / eLocation ID:
p. 1734-1751
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    We report the results of the “UM‐TBM” and “Zheng” groups in CASP15 for protein monomer and complex structure prediction. These prediction sets were obtained using the D‐I‐TASSER and DMFold‐Multimer algorithms, respectively. For monomer structure prediction, D‐I‐TASSER introduced four new features during CASP15: (i) a multiple sequence alignment (MSA) generation protocol that combines multi‐source MSA searching and a structural modeling‐based MSA ranker; (ii) attention‐network based spatial restraints; (iii) a multi‐domain module containing domain partition and arrangement for domain‐level templates and spatial restraints; (iv) an optimized I‐TASSER‐based folding simulation system for full‐length model creation guided by a combination of deep learning restraints, threading alignments, and knowledge‐based potentials. For 47 free modeling targets in CASP15, the final models predicted by D‐I‐TASSER showed average TM‐score 19% higher than the standard AlphaFold2 program. We thus showed that traditional Monte Carlo‐based folding simulations, when appropriately coupled with deep learning algorithms, can generate models with improved accuracy over end‐to‐end deep learning methods alone. For protein complex structure prediction, DMFold‐Multimer generated models by integrating a new MSA generation algorithm (DeepMSA2) with the end‐to‐end modeling module from AlphaFold2‐Multimer. For the 38 complex targets, DMFold‐Multimer generated models with an average TM‐score of 0.83 and Interface Contact Score of 0.60, both significantly higher than those of competing complex prediction tools. Our analyses on complexes highlighted the critical role played by MSA generating, ranking, and pairing in protein complex structure prediction. We also discuss future room for improvement in the areas of viral protein modeling and complex model ranking.

    more » « less
  2. Abstract

    We report the results of two fully automated structure prediction pipelines, “Zhang‐Server” and “QUARK”, in CASP13. The pipelines were built upon the C‐I‐TASSER and C‐QUARK programs, which in turn are based on I‐TASSER and QUARK but with three new modules: (a) a novel multiple sequence alignment (MSA) generation protocol to construct deep sequence‐profiles for contact prediction; (b) an improved meta‐method, NeBcon, which combines multiple contact predictors, including ResPRE that predicts contact‐maps by coupling precision‐matrices with deep residual convolutional neural‐networks; and (c) an optimized contact potential to guide structure assembly simulations. For 50 CASP13 FM domains that lacked homologous templates, average TM‐scores of the first models produced by C‐I‐TASSER and C‐QUARK were 28% and 56% higher than those constructed by I‐TASSER and QUARK, respectively. For the first time, contact‐map predictions demonstrated usefulness on TBM domains with close homologous templates, where TM‐scores of C‐I‐TASSER models were significantly higher than those of I‐TASSER models with aP‐value <.05. Detailed data analyses showed that the success of C‐I‐TASSER and C‐QUARK was mainly due to the increased accuracy of deep‐learning‐based contact‐maps, as well as the careful balance between sequence‐based contact restraints, threading templates, and generic knowledge‐based potentials. Nevertheless, challenges still remain for predicting quaternary structure of multi‐domain proteins, due to the difficulties in domain partitioning and domain reassembly. In addition, contact prediction in terminal regions was often unsatisfactory due to the sparsity of MSAs. Development of new contact‐based domain partitioning and assembly methods and training contact models on sparse MSAs may help address these issues.

    more » « less
  3. Abstract

    Most proteins in nature contain multiple folding units (or domains). The revolutionary success of AlphaFold2 in single-domain structure prediction showed potential to extend deep-learning techniques for multi-domain structure modeling. This work presents a significantly improved method, DEMO2, which integrates analogous template structural alignments with deep-learning techniques for high-accuracy domain structure assembly. Starting from individual domain models, inter-domain spatial restraints are first predicted with deep residual convolutional networks, where full-length structure models are assembled using L-BFGS simulations under the guidance of a hybrid energy function combining deep-learning restraints and analogous multi-domain template alignments searched from the PDB. The output of DEMO2 contains deep-learning inter-domain restraints, top-ranked multi-domain structure templates, and up to five full-length structure models. DEMO2 was tested on a large-scale benchmark and the blind CASP14 experiment, where DEMO2 was shown to significantly outperform its predecessor and the state-of-the-art protein structure prediction methods. By integrating with new deep-learning techniques, DEMO2 should help fill the rapidly increasing gap between the improved ability of tertiary structure determination and the high demand for the high-quality multi-domain protein structures. The DEMO2 server is available at

    more » « less
  4. Abstract

    This article reports and analyzes the results of protein contact and distance prediction by our methods in the 14th Critical Assessment of techniques for protein Structure Prediction (CASP14). A new deep learning‐based contact/distance predictor was employed based on the ensemble of two complementary coevolution features coupling with deep residual networks. We also improved our multiple sequence alignment (MSA) generation protocol with wholesale meta‐genome sequence databases. On 22 CASP14 free modeling (FM) targets, the proposed model achieved a top‐L/5 long‐range precision of 63.8% and a mean distance bin error of 1.494. Based on the predicted distance potentials, 11 out of 22 FM targets and all of the 14 FM/template‐based modeling (TBM) targets have correctly predicted folds (TM‐score >0.5), suggesting that our approach can provide reliable distance potentials for ab initio protein folding.

    more » « less
  5. Abstract

    Accurate prediction of protein secondary structure (alpha‐helix, beta‐strand and coil) is a crucial step for protein inter‐residue contact prediction and ab initio tertiary structure prediction. In a previous study, we developed a deep belief network‐based protein secondary structure method (DNSS1) and successfully advanced the prediction accuracy beyond 80%. In this work, we developed multiple advanced deep learning architectures (DNSS2) to further improve secondary structure prediction. The major improvements over the DNSS1 method include (a) designing and integrating six advanced one‐dimensional deep convolutional/recurrent/residual/memory/fractal/inception networks to predict 3‐state and 8‐state secondary structure, and (b) using more sensitive profile features inferred from Hidden Markov model (HMM) and multiple sequence alignment (MSA). Most of the deep learning architectures are novel for protein secondary structure prediction. DNSS2 was systematically benchmarked on independent test data sets with eight state‐of‐art tools and consistently ranked as one of the best methods. Particularly, DNSS2 was tested on the protein targets of 2018 CASP13 experiment and achieved the Q3 score of 81.62%, SOV score of 72.19%, and Q8 score of 73.28%. DNSS2 is freely available at:

    more » « less