Abstract Mutations in human proteins lead to diseases. The structure of these proteins can help understand the mechanism of such diseases and develop therapeutics against them. With improved deep learning techniques, such as RoseTTAFold and AlphaFold, we can predict the structure of proteins even in the absence of structural homologs. We modeled and extracted the domains from 553 disease-associated human proteins without known protein structures or close homologs in the Protein Databank. We noticed that the model quality was higher and the Root mean square deviation (RMSD) lower between AlphaFold and RoseTTAFold models for domains that could be assigned to CATH families as compared to those which could only be assigned to Pfam families of unknown structure or could not be assigned to either. We predicted ligand-binding sites, protein–protein interfaces and conserved residues in these predicted structures. We then explored whether the disease-associated missense mutations were in the proximity of these predicted functional sites, whether they destabilized the protein structure based on ddG calculations or whether they were predicted to be pathogenic. We could explain 80% of these disease-associated mutations based on proximity to functional sites, structural destabilization or pathogenicity. When compared to polymorphisms, a larger percentage of disease-associated missense mutations were buried, closer to predicted functional sites, predicted as destabilizing and pathogenic. Usage of models from the two state-of-the-art techniques provide better confidence in our predictions, and we explain 93 additional mutations based on RoseTTAFold models which could not be explained based solely on AlphaFold models.
more »
« less
DPAM : A domain parser for AlphaFold models
Abstract The recent breakthroughs in structure prediction, where methods such as AlphaFold demonstrated near‐atomic accuracy, herald a paradigm shift in structural biology. The 200 million high‐accuracy models released in the AlphaFold Database are expected to guide protein science in the coming decades. Partitioning these AlphaFold models into domains and assigning them to an evolutionary hierarchy provide an efficient way to gain functional insights into proteins. However, classifying such a large number of predicted structures challenges the infrastructure of current structure classifications, including our Evolutionary Classification of protein Domains (ECOD). Better computational tools are urgently needed to parse and classify domains from AlphaFold models automatically. Here we present a Domain Parser for AlphaFold Models (DPAM) that can automatically recognize globular domains from these models based on inter‐residue distances in 3D structures, predicted aligned errors, and ECOD domains found by sequence (HHsuite) and structural (Dali) similarity searches. Based on a benchmark of 18,759 AlphaFold models, we demonstrate that DPAM can recognize 98.8% of domains and assign correct boundaries for 87.5%, significantly outperforming structure‐based domain parsers and homology‐based domain assignment using ECOD domains found by HHsuite or Dali. Application of DPAM to the massive AlphaFold models will enable efficient classification of domains, providing evolutionary contexts and facilitating functional studies.
more »
« less
- Award ID(s):
- 2224128
- PAR ID:
- 10394150
- Publisher / Repository:
- Wiley Blackwell (John Wiley & Sons)
- Date Published:
- Journal Name:
- Protein Science
- Volume:
- 32
- Issue:
- 2
- ISSN:
- 0961-8368
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
ABSTRACT Accurate prediction of protein–peptide complex structures plays a critical role in structure‐based drug design, including antibody design. Most peptide‐docking benchmark studies were conducted using crystal structures of protein–peptide complexes; as such, the performance of the current peptide docking tools in the practical setting is unknown. Here, the practical setting implies there are no crystal or other experimental structures for the complex, nor for the receptor and peptide. In this work, we have developed a practical docking protocol that incorporated two famous machine learning models, AlphaFold 2 for structural prediction and ANI‐2x for ab initio potential prediction, to achieve a high success rate in modeling protein–peptide complex structures. The docking protocol consists of three major stages. In the first stage, the 3D structure of the receptor is predicted by AlphaFold 2 using the monomer mode, and that of the peptide is predicted by AlphaFold 2 using the multimer mode. We found that it is essential to include the receptor information to generate a high‐quality 3D structure of the peptide. In the second stage, rigid protein–peptide docking is performed using ZDOCK software. In the last stage, the top 10 docking poses are relaxed and refined by ANI‐2x in conjunction with our in‐house geometry optimization algorithm—conjugate gradient with backtracking line search (CG‐BS). CG‐BS was developed by us to more efficiently perform geometry optimization, which takes the potential and force directly from ANI‐2x machine learning models. The docking protocol achieved a very encouraging performance for a set of 62 very challenging protein–peptide systems which had an overall success rate of 34% if only the top 1 docking poses were considered. This success rate increased to 45% if the top 3 docking poses were considered. It is emphasized that this encouraging protein–peptide docking performance was achieved without using any crystal or experimental structures.more » « less
-
Dunbrack, Roland L (Ed.)Protein structure prediction has now been deployed widely across several different large protein sets. Large-scale domain annotation of these predictions can aid in the development of biological insights. Using our Evolutionary Classification of Protein Domains (ECOD) from experimental structures as a basis for classification, we describe the detection and cataloging of domains from 48 whole proteomes deposited in the AlphaFold Database. On average, we can provide positive classification (either of domains or other identifiable non-domain regions) for 90% of residues in all proteomes. We classified 746,349 domains from 536,808 proteins comprised of over 226,424,000 amino acid residues. We examine the varying populations of homologous groups in both eukaryotes and bacteria. In addition to containing a higher fraction of disordered regions and unassigned domains, eukaryotes show a higher proportion of repeated proteins, both globular and small repeats. We enumerate those highly populated domains that are shared in both eukaryotes and bacteria, such as the Rossmann domains, TIM barrels, and P-loop domains. Additionally, we compare the sampling of homologous groups from this whole proteome set against our stable ECOD reference and discuss groups that have been enriched by structure predictions. Finally, we discuss the implication of these results for protein target selection for future classification strategies for very large protein sets.more » « less
-
Recent advances in protein structure prediction have generated accurate structures of previously uncharacterized human proteins. Identifying domains in these predicted structures and classifying them into an evolutionary hierarchy can reveal biological insights. Here, we describe the detection and classification of domains from the human proteome. Our classification indicates that only 62% of residues are located in globular domains. We further classify these globular domains and observe that the majority (65%) can be classified among known folds by sequence, with a smaller fraction (33%) requiring structural data to refine the domain boundaries and/or to support their homology. A relatively small number (966 domains) cannot be confidently assigned using our automatic pipelines, thus demanding manual inspection. We classify 47,576 domains, of which only 23% have been included in experimental structures. A portion (6.3%) of these classified globular domains lack sequence-based annotation in InterPro. A quarter (23%) have not been structurally modeled by homology, and they contain 2,540 known disease-causing single amino acid variations whose pathogenesis can now be inferred using AF models. A comparison of classified domains from a series of model organisms revealed expansions of several immune response-related domains in humans and a depletion of olfactory receptors. Finally, we use this classification to expand well-known protein families of biological significance. These classifications are presented on the ECOD website ( http://prodata.swmed.edu/ecod/index_human.php ).more » « less
-
Abstract The Membranome database provides comprehensive structural information on single‐pass (i.e., bitopic) membrane proteins from six evolutionarily distant organisms, including protein–protein interactions, complexes, mutations, experimental structures, and models of transmembrane α‐helical dimers. We present a new version of this database, Membranome 3.0, which was significantly updated by revising the set of 5,758 bitopic proteins and incorporating models generated by AlphaFold 2 in the database. The AlphaFold models were parsed into structural domains located at the different membrane sides, modified to exclude low‐confidence unstructured terminal regions and signal sequences, validated through comparison with available experimental structures, and positioned with respect to membrane boundaries. Membranome 3.0 was re‐developed to facilitate visualization and comparative analysis of multiple 3D structures of proteins that belong to a specified family, complex, biological pathway, or membrane type. New tools for advanced search and analysis of proteins, their interactions, complexes, and mutations were included. The database is freely accessible athttps://membranome.org.more » « less
An official website of the United States government
