skip to main content


Title: Ensembling multiple raw coevolutionary features with deep residual neural networks for contact‐map prediction in CASP13
Abstract

We report the results of residue‐residue contact prediction of a new pipeline built purely on the learning of coevolutionary features in the CASP13 experiment. For a query sequence, the pipeline starts with the collection of multiple sequence alignments (MSAs) from multiple genome and metagenome sequence databases using two complementary Hidden Markov Model (HMM)‐based searching tools. Three profile matrices, built on covariance, precision, and pseudolikelihood maximization respectively, are then created from the MSAs, which are used as the input features of a deep residual convolutional neural network architecture for contact‐map training and prediction. Two ensembling strategies have been proposed to integrate the matrix features through end‐to‐end training and stacking, resulting in two complementary programs called TripletRes and ResTriplet, respectively. For the 31 free‐modeling domains that do not have homologous templates in the PDB, TripletRes and ResTriplet generated comparable results with an average accuracy of 0.640 and 0.646, respectively, for the topL/5 long‐range predictions, where 71% and 74% of the cases have an accuracy above 0.5. Detailed data analyses showed that the strength of the pipeline is due to the sensitive MSA construction and the advanced strategies for coevolutionary feature ensembling. Domain splitting was also found to help enhance the contact prediction performance. Nevertheless, contact models for tail regions, which often involve a high number of alignment gaps, and for targets with few homologous sequences are still suboptimal. Development of new approaches where the model is specifically trained on these regions and targets might help address these problems.

 
more » « less
NSF-PAR ID:
10459821
Author(s) / Creator(s):
 ;  ;  ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Proteins: Structure, Function, and Bioinformatics
Volume:
87
Issue:
12
ISSN:
0887-3585
Page Range / eLocation ID:
p. 1082-1091
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Motivation The success of genome sequencing techniques has resulted in rapid explosion of protein sequences. Collections of multiple homologous sequences can provide critical information to the modeling of structure and function of unknown proteins. There are however no standard and efficient pipeline available for sensitive multiple sequence alignment (MSA) collection. This is particularly challenging when large whole-genome and metagenome databases are involved. Results We developed DeepMSA, a new open-source method for sensitive MSA construction, which has homologous sequences and alignments created from multi-sources of whole-genome and metagenome databases through complementary hidden Markov model algorithms. The practical usefulness of the pipeline was examined in three large-scale benchmark experiments based on 614 non-redundant proteins. First, DeepMSA was utilized to generate MSAs for residue-level contact prediction by six coevolution and deep learning-based programs, which resulted in an accuracy increase in long-range contacts by up to 24.4% compared to the default programs. Next, multiple threading programs are performed for homologous structure identification, where the average TM-score of the template alignments has over 7.5% increases with the use of the new DeepMSA profiles. Finally, DeepMSA was used for secondary structure prediction and resulted in statistically significant improvements in the Q3 accuracy. It is noted that all these improvements were achieved without re-training the parameters and neural-network models, demonstrating the robustness and general usefulness of the DeepMSA in protein structural bioinformatics applications, especially for targets without homologous templates in the PDB library. Availability and implementation https://zhanglab.ccmb.med.umich.edu/DeepMSA/. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less
  2. Abstract

    We report the results of two fully automated structure prediction pipelines, “Zhang‐Server” and “QUARK”, in CASP13. The pipelines were built upon the C‐I‐TASSER and C‐QUARK programs, which in turn are based on I‐TASSER and QUARK but with three new modules: (a) a novel multiple sequence alignment (MSA) generation protocol to construct deep sequence‐profiles for contact prediction; (b) an improved meta‐method, NeBcon, which combines multiple contact predictors, including ResPRE that predicts contact‐maps by coupling precision‐matrices with deep residual convolutional neural‐networks; and (c) an optimized contact potential to guide structure assembly simulations. For 50 CASP13 FM domains that lacked homologous templates, average TM‐scores of the first models produced by C‐I‐TASSER and C‐QUARK were 28% and 56% higher than those constructed by I‐TASSER and QUARK, respectively. For the first time, contact‐map predictions demonstrated usefulness on TBM domains with close homologous templates, where TM‐scores of C‐I‐TASSER models were significantly higher than those of I‐TASSER models with aP‐value <.05. Detailed data analyses showed that the success of C‐I‐TASSER and C‐QUARK was mainly due to the increased accuracy of deep‐learning‐based contact‐maps, as well as the careful balance between sequence‐based contact restraints, threading templates, and generic knowledge‐based potentials. Nevertheless, challenges still remain for predicting quaternary structure of multi‐domain proteins, due to the difficulties in domain partitioning and domain reassembly. In addition, contact prediction in terminal regions was often unsatisfactory due to the sparsity of MSAs. Development of new contact‐based domain partitioning and assembly methods and training contact models on sparse MSAs may help address these issues.

     
    more » « less
  3. Martelli, Pier Luigi (Ed.)
    Abstract Motivation Accurate prediction of residue-residue distances is important for protein structure prediction. We developed several protein distance predictors based on a deep learning distance prediction method and blindly tested them in the 14th Critical Assessment of Protein Structure Prediction (CASP14). The prediction method uses deep residual neural networks with the channel-wise attention mechanism to classify the distance between every two residues into multiple distance intervals. The input features for the deep learning method include co-evolutionary features as well as other sequence-based features derived from multiple sequence alignments (MSAs). Three alignment methods are used with multiple protein sequence/profile databases to generate MSAs for input feature generation. Based on different configurations and training strategies of the deep learning method, five MULTICOM distance predictors were created to participate in the CASP14 experiment. Results Benchmarked on 37 hard CASP14 domains, the best performing MULTICOM predictor is ranked 5th out of 30 automated CASP14 distance prediction servers in terms of precision of top L/5 long-range contact predictions (i.e. classifying distances between two residues into two categories: in contact (< 8 Angstrom) and not in contact otherwise) and performs better than the best CASP13 distance prediction method. The best performing MULTICOM predictor is also ranked 6th among automated server predictors in classifying inter-residue distances into 10 distance intervals defined by CASP14 according to the precision of distance classification. The results show that the quality and depth of MSAs depend on alignment methods and sequence databases and have a significant impact on the accuracy of distance prediction. Using larger training datasets and multiple complementary features improves prediction accuracy. However, the number of effective sequences in MSAs is only a weak indicator of the quality of MSAs and the accuracy of predicted distance maps. In contrast, there is a strong correlation between the accuracy of contact/distance predictions and the average probability of the predicted contacts, which can therefore be more effectively used to estimate the confidence of distance predictions and select predicted distance maps. Availability The software package, source code, and data of DeepDist2 are freely available at https://github.com/multicom-toolbox/deepdist and https://zenodo.org/record/4712084#.YIIM13VKhQM. 
    more » « less
  4. Kolodny, Rachel (Ed.)
    The topology of protein folds can be specified by the inter-residue contact-maps and accurate contact-map prediction can help ab initio structure folding. We developed TripletRes to deduce protein contact-maps from discretized distance profiles by end-to-end training of deep residual neural-networks. Compared to previous approaches, the major advantage of TripletRes is in its ability to learn and directly fuse a triplet of coevolutionary matrices extracted from the whole-genome and metagenome databases and therefore minimize the information loss during the course of contact model training. TripletRes was tested on a large set of 245 non-homologous proteins from CASP 11&12 and CAMEO experiments and outperformed other top methods from CASP12 by at least 58.4% for the CASP 11&12 targets and 44.4% for the CAMEO targets in the top- L long-range contact precision. On the 31 FM targets from the latest CASP13 challenge, TripletRes achieved the highest precision (71.6%) for the top- L /5 long-range contact predictions. It was also shown that a simple re-training of the TripletRes model with more proteins can lead to further improvement with precisions comparable to state-of-the-art methods developed after CASP13. These results demonstrate a novel efficient approach to extend the power of deep convolutional networks for high-accuracy medium- and long-range protein contact-map predictions starting from primary sequences, which are critical for constructing 3D structure of proteins that lack homologous templates in the PDB library. 
    more » « less
  5. Abstract Motivation Deep learning has become the dominant technology for protein contact prediction. However, the factors that affect the performance of deep learning in contact prediction have not been systematically investigated. Results We analyzed the results of our three deep learning-based contact prediction methods (MULTICOM-CLUSTER, MULTICOM-CONSTRUCT and MULTICOM-NOVEL) in the CASP13 experiment and identified several key factors [i.e. deep learning technique, multiple sequence alignment (MSA), distance distribution prediction and domain-based contact integration] that influenced the contact prediction accuracy. We compared our convolutional neural network (CNN)-based contact prediction methods with three coevolution-based methods on 75 CASP13 targets consisting of 108 domains. We demonstrated that the CNN-based multi-distance approach was able to leverage global coevolutionary coupling patterns comprised of multiple correlated contacts for more accurate contact prediction than the local coevolution-based methods, leading to a substantial increase of precision by 19.2 percentage points. We also tested different alignment methods and domain-based contact prediction with the deep learning contact predictors. The comparison of the three methods showed deeper sequence alignments and the integration of domain-based contact prediction with the full-length contact prediction improved the performance of contact prediction. Moreover, we demonstrated that the domain-based contact prediction based on a novel ab initio approach of parsing domains from MSAs alone without using known protein structures was a simple, fast approach to improve contact prediction. Finally, we showed that predicting the distribution of inter-residue distances in multiple distance intervals could capture more structural information and improve binary contact prediction. Availability and implementation https://github.com/multicom-toolbox/DNCON2/. Supplementary information Supplementary data are available at Bioinformatics online. 
    more » « less