skip to main content


Title: DPAM : A domain parser for AlphaFold models
Abstract

The recent breakthroughs in structure prediction, where methods such as AlphaFold demonstrated near‐atomic accuracy, herald a paradigm shift in structural biology. The 200 million high‐accuracy models released in the AlphaFold Database are expected to guide protein science in the coming decades. Partitioning these AlphaFold models into domains and assigning them to an evolutionary hierarchy provide an efficient way to gain functional insights into proteins. However, classifying such a large number of predicted structures challenges the infrastructure of current structure classifications, including our Evolutionary Classification of protein Domains (ECOD). Better computational tools are urgently needed to parse and classify domains from AlphaFold models automatically. Here we present a Domain Parser for AlphaFold Models (DPAM) that can automatically recognize globular domains from these models based on inter‐residue distances in 3D structures, predicted aligned errors, and ECOD domains found by sequence (HHsuite) and structural (Dali) similarity searches. Based on a benchmark of 18,759 AlphaFold models, we demonstrate that DPAM can recognize 98.8% of domains and assign correct boundaries for 87.5%, significantly outperforming structure‐based domain parsers and homology‐based domain assignment using ECOD domains found by HHsuite or Dali. Application of DPAM to the massive AlphaFold models will enable efficient classification of domains, providing evolutionary contexts and facilitating functional studies.

 
more » « less
Award ID(s):
2224128
NSF-PAR ID:
10394150
Author(s) / Creator(s):
 ;  ;  ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Protein Science
Volume:
32
Issue:
2
ISSN:
0961-8368
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Mutations in human proteins lead to diseases. The structure of these proteins can help understand the mechanism of such diseases and develop therapeutics against them. With improved deep learning techniques, such as RoseTTAFold and AlphaFold, we can predict the structure of proteins even in the absence of structural homologs. We modeled and extracted the domains from 553 disease-associated human proteins without known protein structures or close homologs in the Protein Databank. We noticed that the model quality was higher and the Root mean square deviation (RMSD) lower between AlphaFold and RoseTTAFold models for domains that could be assigned to CATH families as compared to those which could only be assigned to Pfam families of unknown structure or could not be assigned to either. We predicted ligand-binding sites, protein–protein interfaces and conserved residues in these predicted structures. We then explored whether the disease-associated missense mutations were in the proximity of these predicted functional sites, whether they destabilized the protein structure based on ddG calculations or whether they were predicted to be pathogenic. We could explain 80% of these disease-associated mutations based on proximity to functional sites, structural destabilization or pathogenicity. When compared to polymorphisms, a larger percentage of disease-associated missense mutations were buried, closer to predicted functional sites, predicted as destabilizing and pathogenic. Usage of models from the two state-of-the-art techniques provide better confidence in our predictions, and we explain 93 additional mutations based on RoseTTAFold models which could not be explained based solely on AlphaFold models. 
    more » « less
  2. Abstract

    Identification of the molecular networks that facilitated the evolution of multicellular animals from their unicellular ancestors is a fundamental problem in evolutionary cellular biology. Choanoflagellates are recognized as the closest extant nonmetazoan ancestors to animals. These unicellular eukaryotes can adopt a multicellular‐like “rosette” state. Therefore, they are compelling models for the study of early multicellularity. Comparative studies revealed that a number of putative human orthologs are present in choanoflagellate genomes, suggesting that a subset of these genes were necessary for the emergence of multicellularity. However, previous work is largely based on sequence alignments alone, which does not confirm structural nor functional similarity. Here, we focus on the PDZ domain, a peptide‐binding domain which plays critical roles in myriad cellular signaling networks and which underwent a gene family expansion in metazoan lineages. Using a customized sequence similarity search algorithm, we identified 178 PDZ domains in theMonosiga brevicollisproteome. This includes 11 previously unidentified sequences, which we analyzed using Rosetta and homology modeling. To assess conservation of protein structure, we solved high‐resolution crystal structures of representativeM. brevicollisPDZ domains that are homologous to human Dlg1 PDZ2, Dlg1 PDZ3, GIPC, and SHANK1 PDZ domains. To assess functional conservation, we calculated binding affinities for mbGIPC, mbSHANK1, mbSNX27, and mbDLG‐3 PDZ domains fromM. brevicollis. Overall, we find that peptide selectivity is generally conserved between these two disparate organisms, with one possible exception, mbDLG‐3. Overall, our results provide novel insight into signaling pathways in a choanoflagellate model of primitive multicellularity.

     
    more » « less
  3. Abstract Motivation

    Deep learning has revolutionized protein tertiary structure prediction recently. The cutting-edge deep learning methods such as AlphaFold can predict high-accuracy tertiary structures for most individual protein chains. However, the accuracy of predicting quaternary structures of protein complexes consisting of multiple chains is still relatively low due to lack of advanced deep learning methods in the field. Because interchain residue–residue contacts can be used as distance restraints to guide quaternary structure modeling, here we develop a deep dilated convolutional residual network method (DRCon) to predict interchain residue–residue contacts in homodimers from residue–residue co-evolutionary signals derived from multiple sequence alignments of monomers, intrachain residue–residue contacts of monomers extracted from true/predicted tertiary structures or predicted by deep learning, and other sequence and structural features.

    Results

    Tested on three homodimer test datasets (Homo_std dataset, DeepHomo dataset and CASP-CAPRI dataset), the precision of DRCon for top L/5 interchain contact predictions (L: length of monomer in a homodimer) is 43.46%, 47.10% and 33.50% respectively at 6 Å contact threshold, which is substantially better than DeepHomo and DNCON2_inter and similar to Glinter. Moreover, our experiments demonstrate that using predicted tertiary structure or intrachain contacts of monomers in the unbound state as input, DRCon still performs well, even though its accuracy is lower than using true tertiary structures in the bound state are used as input. Finally, our case study shows that good interchain contact predictions can be used to build high-accuracy quaternary structure models of homodimers.

    Availability and implementation

    The source code of DRCon is available at https://github.com/jianlin-cheng/DRCon. The datasets are available at https://zenodo.org/record/5998532#.YgF70vXMKsB.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  4. Recent advances in protein structure prediction have generated accurate structures of previously uncharacterized human proteins. Identifying domains in these predicted structures and classifying them into an evolutionary hierarchy can reveal biological insights. Here, we describe the detection and classification of domains from the human proteome. Our classification indicates that only 62% of residues are located in globular domains. We further classify these globular domains and observe that the majority (65%) can be classified among known folds by sequence, with a smaller fraction (33%) requiring structural data to refine the domain boundaries and/or to support their homology. A relatively small number (966 domains) cannot be confidently assigned using our automatic pipelines, thus demanding manual inspection. We classify 47,576 domains, of which only 23% have been included in experimental structures. A portion (6.3%) of these classified globular domains lack sequence-based annotation in InterPro. A quarter (23%) have not been structurally modeled by homology, and they contain 2,540 known disease-causing single amino acid variations whose pathogenesis can now be inferred using AF models. A comparison of classified domains from a series of model organisms revealed expansions of several immune response-related domains in humans and a depletion of olfactory receptors. Finally, we use this classification to expand well-known protein families of biological significance. These classifications are presented on the ECOD website ( http://prodata.swmed.edu/ecod/index_human.php ). 
    more » « less
  5. Abstract

    Protein structure prediction is an important problem in bioinformatics and has been studied for decades. However, there are still few open-source comprehensive protein structure prediction packages publicly available in the field. In this paper, we present our latest open-source protein tertiary structure prediction system—MULTICOM2, an integration of template-based modeling (TBM) and template-free modeling (FM) methods. The template-based modeling uses sequence alignment tools with deep multiple sequence alignments to search for structural templates, which are much faster and more accurate than MULTICOM1. The template-free (ab initio or de novo) modeling uses the inter-residue distances predicted by DeepDist to reconstruct tertiary structure models without using any known structure as template. In the blind CASP14 experiment, the average TM-score of the models predicted by our server predictor based on the MULTICOM2 system is 0.720 for 58 TBM (regular) domains and 0.514 for 38 FM and FM/TBM (hard) domains, indicating that MULTICOM2 is capable of predicting good tertiary structures across the board. It can predict the correct fold for 76 CASP14 domains (95% regular domains and 55% hard domains) if only one prediction is made for a domain. The success rate is increased to 3% for both regular and hard domains if five predictions are made per domain. Moreover, the prediction accuracy of the pure template-free structure modeling method on both TBM and FM targets is very close to the combination of template-based and template-free modeling methods. This demonstrates that the distance-based template-free modeling method powered by deep learning can largely replace the traditional template-based modeling method even on TBM targets that TBM methods used to dominate and therefore provides a uniform structure modeling approach to any protein. Finally, on the 38 CASP14 FM and FM/TBM hard domains, MULTICOM2 server predictors (MULTICOM-HYBRID, MULTICOM-DEEP, MULTICOM-DIST) were ranked among the top 20 automated server predictors in the CASP14 experiment. After combining multiple predictors from the same research group as one entry, MULTICOM-HYBRID was ranked no. 5. The source code of MULTICOM2 is freely available athttps://github.com/multicom-toolbox/multicom/tree/multicom_v2.0.

     
    more » « less