The information about the domain architecture of proteins is useful for studying protein structure and function. However, accurate prediction of protein domain boundaries (i.e., sequence regions separating two domains) from sequence remains a significant challenge. In this work, we develop a deep learning method based on multi-head U-Nets (called DistDom) to predict protein domain boundaries utilizing 1D sequence features and predicted 2D inter-residue distance map as input. The 1D features contain the evolutionary and physicochemical information of protein sequences, whereas the 2D distance map includes the structural information of proteins that was rarely used in domain boundary prediction before. The 1D and 2D features are processed by the 1D and 2D U-Nets respectively to generate hidden features. The hidden features are then used by the multi-head attention to predict the probability of each residue of a protein being in a domain boundary, leveraging both local and global information in the features. The residue-level domain boundary predictions can be used to classify proteins as single-domain or multi-domain proteins. It classifies the CASP14 single-domain and multi-domain targets at the accuracy of 75.9%, 13.28% more accurate than the state-of-the-art method. Tested on the CASP14 multi-domain protein targets with expert annotated domain boundaries, the average per-target F1 measure score of the domain boundary prediction by DistDom is 0.263, 29.56% higher than the state-of-the-art method.
more »
« less
Improving Inter-Helix Contact Prediction With Local 2D Topological Information
Inter-helix contact prediction is to identify residue contact across different helices in α-helical integral membrane proteins. Despite the progress made by various computational methods, contact prediction remains as a challenging task, and there is no method to our knowledge that directly tap into the contact map in an alignment free manner. We build 2D contact models from an independent dataset to capture the topological patterns in the neighborhood of a residue pair depending it is a contact or not, and apply the models to the state-of-art method's predictions to extract the features reflecting 2D inter-helix contact patterns. A secondary classifier is trained on such features. Realizing that the achievable improvement is intrinsically hinged on the quality of original predictions, we devise a mechanism to deal with the issue by introducing, 1) partial discretization of original prediction scores to more effectively leverage useful information 2) fuzzy score to assess the quality of the original prediction to help with selecting the residue pairs where improvement is more achievable. The cross-validation results show that the prediction from our method outperforms other methods including the state-of-the-art method (DeepHelicon) by a notable degree even without using the refinement selection scheme. By applying the refinement selection scheme, our method outperforms the state-of-the-art method significantly in these selected sequences.
more »
« less
- Award ID(s):
- 1820103
- PAR ID:
- 10437655
- Date Published:
- Journal Name:
- IEEE/ACM Transactions on Computational Biology and Bioinformatics
- ISSN:
- 1545-5963
- Page Range / eLocation ID:
- 1 to 12
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Valencia, Alfonso (Ed.)Abstract Motivation Protein structure prediction remains as one of the most important problems in computational biology and biophysics. In the past few years, protein residue–residue contact prediction has undergone substantial improvement, which has made it a critical driving force for successful protein structure prediction. Boosting the accuracy of contact predictions has, therefore, become the forefront of protein structure prediction. Results We show a novel contact map refinement method, ContactGAN, which uses Generative Adversarial Networks (GAN). ContactGAN was able to make a significant improvement over predictions made by recent contact prediction methods when tested on three datasets including protein structure modeling targets in CASP13 and CASP14. We show improvement of precision in contact prediction, which translated into improvement in the accuracy of protein tertiary structure models. On the other hand, observed improvement over trRosetta was relatively small, reasons for which are discussed. ContactGAN will be a valuable addition in the structure prediction pipeline to achieve an extra gain in contact prediction accuracy. Availability and implementation https://github.com/kiharalab/ContactGAN. Supplementary information Supplementary data are available at Bioinformatics online.more » « less
-
Residue-residue distance information is useful for predicting tertiary structures of protein monomers or quaternary structures of protein complexes. Many deep learning methods have been developed to predict intra-chain residue-residue distances of monomers accurately, but few methods can accurately predict inter-chain residue-residue distances of complexes. We develop a deep learning method CDPred (i.e., Complex Distance Prediction) based on the 2D attention-powered residual network to address the gap. Tested on two homodimer datasets, CDPred achieves the precision of 60.94% and 42.93% for top L/5 inter-chain contact predictions (L: length of the monomer in homodimer), respectively, substantially higher than DeepHomo’s 37.40% and 23.08% and GLINTER’s 48.09% and 36.74%. Tested on the two heterodimer datasets, the top Ls/5 inter-chain contact prediction precision (Ls: length of the shorter monomer in heterodimer) of CDPred is 47.59% and 22.87% respectively, surpassing GLINTER’s 23.24% and 13.49%. Moreover, the prediction of CDPred is complementary with that of AlphaFold2-multimer.more » « less
-
null (Ed.)Abstract Background Driven by deep learning, inter-residue contact/distance prediction has been significantly improved and substantially enhanced ab initio protein structure prediction. Currently, most of the distance prediction methods classify inter-residue distances into multiple distance intervals instead of directly predicting real-value distances. The output of the former has to be converted into real-value distances to be used in tertiary structure prediction. Results To explore the potentials of predicting real-value inter-residue distances, we develop a multi-task deep learning distance predictor (DeepDist) based on new residual convolutional network architectures to simultaneously predict real-value inter-residue distances and classify them into multiple distance intervals. Tested on 43 CASP13 hard domains, DeepDist achieves comparable performance in real-value distance prediction and multi-class distance prediction. The average mean square error (MSE) of DeepDist’s real-value distance prediction is 0.896 Å 2 when filtering out the predicted distance ≥ 16 Å, which is lower than 1.003 Å 2 of DeepDist’s multi-class distance prediction. When distance predictions are converted into contact predictions at 8 Å threshold (the standard threshold in the field), the precision of top L/5 and L/2 contact predictions of DeepDist’s multi-class distance prediction is 79.3% and 66.1%, respectively, higher than 78.6% and 64.5% of its real-value distance prediction and the best results in the CASP13 experiment. Conclusions DeepDist can predict inter-residue distances well and improve binary contact prediction over the existing state-of-the-art methods. Moreover, the predicted real-value distances can be directly used to reconstruct protein tertiary structures better than multi-class distance predictions due to the lower MSE. Finally, we demonstrate that predicting the real-value distance map and multi-class distance map at the same time performs better than predicting real-value distances alone.more » « less
-
null (Ed.)Abstract Motivation Protein model quality estimation, in many ways, informs protein structure prediction. Despite their tight coupling, existing model quality estimation methods do not leverage inter-residue distance information or the latest technological breakthrough in deep learning that has recently revolutionized protein structure prediction. Results We present a new distance-based single-model quality estimation method called QDeep by harnessing the power of stacked deep residual neural networks (ResNets). Our method first employs stacked deep ResNets to perform residue-level ensemble error classifications at multiple predefined error thresholds, and then combines the predictions from the individual error classifiers for estimating the quality of a protein structural model. Experimental results show that our method consistently outperforms existing state-of-the-art methods including ProQ2, ProQ3, ProQ3D, ProQ4, 3DCNN, MESHI, and VoroMQA in multiple independent test datasets across a wide-range of accuracy measures; and that predicted distance information significantly contributes to the improved performance of QDeep. Availability and implementation https://github.com/Bhattacharya-Lab/QDeep. Supplementary information Supplementary data are available at Bioinformatics online.more » « less
An official website of the United States government

