Abstract The InterPro database (https://www.ebi.ac.uk/interpro/) provides an integrative classification of protein sequences into families, and identifies functionally important domains and conserved sites. Here, we report recent developments with InterPro (version 90.0) and its associated software, including updates to data content and to the website. These developments extend and enrich the information provided by InterPro, and provide a more user friendly access to the data. Additionally, we have worked on adding Pfam website features to the InterPro website, as the Pfam website will be retired in late 2022. We also show that InterPro's sequence coverage has kept pace with the growth of UniProtKB. Moreover, we report the development of a card game as a method of engaging the non-scientific community. Finally, we discuss the benefits and challenges brought by the use of artificial intelligence for protein structure prediction.
more »
« less
The Pfam protein families database: embracing AI/ML
Abstract The Pfam protein families database is a comprehensive collection of protein domains and families used for genome annotation and protein structure and function analysis (https://www.ebi.ac.uk/interpro/). This update describes major developments in Pfam since 2020, including decommissioning the Pfam website and integration with InterPro, harmonization with the ECOD structural classification, and expanded curation of metagenomic, microprotein and repeat-containing families. We highlight how AlphaFold structure predictions are being leveraged to refine domain boundaries and identify new domains. New families discovered through large-scale sequence similarity analysis of AlphaFold models are described. We also detail the development of Pfam-N, which uses deep learning to expand family coverage, achieving an 8.8% increase in UniProtKB coverage compared to standard Pfam. We discuss plans for more frequent Pfam releases integrated with InterPro and the potential for artificial intelligence to further assist curation. Despite recent advances, many protein families remain to be classified, and Pfam continues working toward comprehensive coverage of the protein universe.
more »
« less
- Award ID(s):
- 2224128
- PAR ID:
- 10652074
- Publisher / Repository:
- Oxford University Press (OUP)
- Date Published:
- Journal Name:
- Nucleic Acids Research
- Volume:
- 53
- Issue:
- D1
- ISSN:
- 0305-1048
- Page Range / eLocation ID:
- D523 to D534
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract The evolutionary classification of protein domains (ECOD) classifies protein domains using a combination of sequence and structural data (http://prodata.swmed.edu/ecod). Here we present the culmination of our previous efforts at classifying domains from predicted structures, principally from the AlphaFold Database (AFDB), by integrating these domains with our existing classification of PDB structures. This combined classification includes both domains from our previous, purely experimental, classification of domains as well as domains from our provisional classification of 48 proteomes in AFDB predicted from model organisms and organisms of concern to global health. ECOD classifies over 1.8 M domains from over 1000 000 proteins collectively deposited in the PDB and AFDB. Additionally, we have changed the F-group classification reference used for ECOD, deprecating our original ECODf library and instead relying on direct collaboration with the Pfam sequence family database to inform our classification. Pfam provides similar coverage of ECOD with family classification while being more accurate and less redundant. By eliminating duplication of effort, we can improve both classifications. Finally, we discuss the initial deployment of DrugDomain, a database of domain-ligand interactions, on ECOD and discuss future plans.more » « less
-
Abstract Mutations in human proteins lead to diseases. The structure of these proteins can help understand the mechanism of such diseases and develop therapeutics against them. With improved deep learning techniques, such as RoseTTAFold and AlphaFold, we can predict the structure of proteins even in the absence of structural homologs. We modeled and extracted the domains from 553 disease-associated human proteins without known protein structures or close homologs in the Protein Databank. We noticed that the model quality was higher and the Root mean square deviation (RMSD) lower between AlphaFold and RoseTTAFold models for domains that could be assigned to CATH families as compared to those which could only be assigned to Pfam families of unknown structure or could not be assigned to either. We predicted ligand-binding sites, protein–protein interfaces and conserved residues in these predicted structures. We then explored whether the disease-associated missense mutations were in the proximity of these predicted functional sites, whether they destabilized the protein structure based on ddG calculations or whether they were predicted to be pathogenic. We could explain 80% of these disease-associated mutations based on proximity to functional sites, structural destabilization or pathogenicity. When compared to polymorphisms, a larger percentage of disease-associated missense mutations were buried, closer to predicted functional sites, predicted as destabilizing and pathogenic. Usage of models from the two state-of-the-art techniques provide better confidence in our predictions, and we explain 93 additional mutations based on RoseTTAFold models which could not be explained based solely on AlphaFold models.more » « less
-
ABSTRACT Homology‐based protein domain classification is a powerful tool for gaining biological insights into protein function. This classification process has been significantly enhanced by the availability of experimental structures and high‐accuracy structural models generated by advanced tools such as AlphaFold. Our Evolutionary Classification of protein Domains (ECOD) database provides a continuously updated and refined domain classification system. Isolated (“orphan”) protein domain families, which have a limited distribution in the protein universe, present a unique challenge in this classification process. These families lack clear or identifiable evolutionary relationships with other sequence families. While some isolated domain families may have emerged through de novo evolution, others potentially share common evolutionary origins with existing domain families but represent difficult cases for traditional classification methods. In this study, we conducted a manual analysis of a set of isolated families of small domains in ECOD. By exploring sequence, structural, and functional evidence, we uncovered distant members and likely homologous relationships between different isolated domain families that were previously unrecognized. Our analysis provides valuable insights into the evolution of isolated domain families and has led to improved classification within ECOD. This work enhances our understanding of protein evolution and underscores the importance of continuous refinement in domain classification systems as new data and analytical methods become available.more » « less
-
Abstract Domain classification of protein predictions released in the AlphaFold Database (AFDB) has been a recent focus of the Evolutionary Classification of protein Domains (ECOD). Although a primary focus of our recent work has been the partition and assignment of domains from these predictions, we here show how these diverse predictions can be used to examine the reference domain set more closely. Using results from DPAM, our AlphaFold‐specific domain parsing algorithm, we examine hierarchical groupings that share significant levels of homologous links, both between groups that were not previously assessed to be definitively homologous and between groups that were not previously observed to share significant homologous links. Combined with manual analysis, these large datasets of structural and sequence similarities allow us to merge homologous groups in multiple cases which we detail within. These domains tend to be families of domains from families that are either small, previously had few experimental representatives, or had unknown function. The exception to this is the chromodomains, a large homologous group which were increased from “possibly homologous” to “definitely homologous” to increase the consistency of ECOD based their strong homologous links to the SH3 domains.more » « less
An official website of the United States government

