skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: ECOD domain classification of 48 whole proteomes from AlphaFold Structure Database using DPAM2
Protein structure prediction has now been deployed widely across several different large protein sets. Large-scale domain annotation of these predictions can aid in the development of biological insights. Using our Evolutionary Classification of Protein Domains (ECOD) from experimental structures as a basis for classification, we describe the detection and cataloging of domains from 48 whole proteomes deposited in the AlphaFold Database. On average, we can provide positive classification (either of domains or other identifiable non-domain regions) for 90% of residues in all proteomes. We classified 746,349 domains from 536,808 proteins comprised of over 226,424,000 amino acid residues. We examine the varying populations of homologous groups in both eukaryotes and bacteria. In addition to containing a higher fraction of disordered regions and unassigned domains, eukaryotes show a higher proportion of repeated proteins, both globular and small repeats. We enumerate those highly populated domains that are shared in both eukaryotes and bacteria, such as the Rossmann domains, TIM barrels, and P-loop domains. Additionally, we compare the sampling of homologous groups from this whole proteome set against our stable ECOD reference and discuss groups that have been enriched by structure predictions. Finally, we discuss the implication of these results for protein target selection for future classification strategies for very large protein sets.  more » « less
Award ID(s):
2224128
PAR ID:
10539187
Author(s) / Creator(s):
; ; ; ; ;
Editor(s):
Dunbrack, Roland L
Publisher / Repository:
PLOS
Date Published:
Journal Name:
PLOS Computational Biology
Volume:
20
Issue:
2
ISSN:
1553-7358
Page Range / eLocation ID:
e1011586
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. O'Toole, George (Ed.)
    ABSTRACT Members of the widely conserved progestin and adipoQ receptor (PAQR) family function to maintain membrane homeostasis: membrane fluidity and fatty acid composition in eukaryotes and membrane energetics and fatty acid composition in bacteria. All PAQRs consist of a core seven transmembrane domain structure and five conserved amino acids (three histidines, one serine, and one aspartic acid) predicted to form a hydrolase-like catalytic site. PAQR homologs in Bacteria (called TrhA, for transmembrane homeostasis protein A) maintain homeostasis of membrane charge gradients, like the membrane potential and proton gradient that comprise the proton motive force, but their molecular mechanisms are not yet understood. Here, we show that TrhA inEscherichia colihas a periplasmic C-terminus, which places the five conserved residues shared by all PAQRs at the cytoplasmic interface of the membrane. Here, we characterize several conserved residues predicted to form an active site by site-directed mutagenesis. We also identify a specific role for TrhA in modulating unsaturated fatty acid biosynthesis with conserved residues required to either promote or reduce the abundance of unsaturated fatty acids. We also identify distinct roles for the conserved residues in supporting TrhA’s role in maintaining membrane energetics homeostasis that suggest that both functions are intertwined and probably partly dependent on one another. An analysis of domain architecture of TrhA-like domains in Bacteria further supports a function of TrhA linking membrane energetics homeostasis with biosynthesis of unsaturated fatty acid in the membrane. IMPORTANCEProgestin and adipoQ receptor (PAQR) family proteins are evolutionary conserved regulators of membrane homeostasis and have been best characterized in eukaryotes. Bacterial PAQR homologs, named TrhA (transmembrane homeostasis protein A), regulate membrane energetics homeostasis through an unknown mechanism. Here, we present evidence linking TrhA to both membrane energetics homeostasis and unsaturated fatty acid biosynthesis. Analysis of domain architecture together with experimental evidence suggests a model where TrhA activity on unsaturated fatty acid biosynthesis is regulated by changes in membrane energetics to dynamically adjust membrane homeostasis. 
    more » « less
  2. Laub, Michael T (Ed.)
    Animals use a variety of cell-autonomous innate immune proteins to detect viral infections and prevent replication. Recent studies have discovered that a subset of mammalian antiviral proteins have homology to antiphage defense proteins in bacteria, implying that there are aspects of innate immunity that are shared across the Tree of Life. While the majority of these studies have focused on characterizing the diversity and biochemical functions of the bacterial proteins, the evolutionary relationships between animal and bacterial proteins are less clear. This ambiguity is partly due to the long evolutionary distances separating animal and bacterial proteins, which obscures their relationships. Here, we tackle this problem for 3 innate immune families (CD-NTases [including cGAS], STINGs, and viperins) by deeply sampling protein diversity across eukaryotes. We find that viperins and OAS family CD-NTases are ancient immune proteins, likely inherited since the earliest eukaryotes first arose. In contrast, we find other immune proteins that were acquired via at least 4 independent events of horizontal gene transfer (HGT) from bacteria. Two of these events allowed algae to acquire new bacterial viperins, while 2 more HGT events gave rise to distinct superfamilies of eukaryotic CD-NTases: the cGLR superfamily (containing cGAS) that has since diversified via a series of animal-specific duplications and a previously undefined eSMODS superfamily, which more closely resembles bacterial CD-NTases. Finally, we found that cGAS and STING proteins have substantially different histories, with STING protein domains undergoing convergent domain shuffling in bacteria and eukaryotes. Overall, our findings paint a picture of eukaryotic innate immunity as highly dynamic, where eukaryotes build upon their ancient antiviral repertoires through the reuse of protein domains and by repeatedly sampling a rich reservoir of bacterial antiphage genes. 
    more » « less
  3. Friedberg, Iddo (Ed.)
    The Immunoglobulin fold (Ig-fold) is found in proteins from all domains of life and represents the most populous fold in the human genome, with current estimates ranging from 2 to 3% of protein coding regions. That proportion is much higher in the surfaceome where Ig and Ig-like domains orchestrate cell-cell recognition, adhesion and signaling. The ability of Ig-domains to reliably fold and self-assemble through highly specific interfaces represents a remarkable property of these domains, making them key elements of molecular interaction systems: the immune system, the nervous system, the vascular system and the muscular system. We define a universal residue numbering scheme, common to all domains sharing the Ig-fold in order to study the wide spectrum of Ig-domain variants constituting the Ig-proteome and Ig-Ig interactomes at the heart of thesesystems. The “IgStrand numbering scheme” enables the identification of Ig structural proteomes and interactomes in and between any species, and comparative structural, functional, and evolutionary analyses. We review how Ig-domains are classified today as topological and structural variants and highlight the“Ig-fold irreducible structural signature”shared by all of them. The IgStrand numbering scheme lays the foundation for the systematic annotation of structural proteomes by detecting and accurately labeling Ig-, Ig-like and Ig-extended domains in proteins, which are poorly annotated in current databases and opens the door to accurate machine learning. Importantly, it sheds light on the robustIg protein folding algorithmused by nature to form beta sandwich supersecondary structures. The numbering scheme powers an algorithm implemented in the interactive structural analysis software iCn3D to systematically recognize Ig-domains, annotate them and perform detailed analyses comparing any domain sharing the Ig-fold in sequence, topology and structure, regardless of their diverse topologies or origin. The scheme provides a robust fold detection and labeling mechanism that reveals unsuspected structural homologies among protein structures beyond currently identified Ig- and Ig-like domain variants. Indeed, multiple folds classified independently contain a common structural signature, in particular jelly-rolls. Examples of folds that harbor an “Ig-extended” architecture are given. Applications in protein engineering around the Ig-architecture are straightforward based on the universal numbering. 
    more » « less
  4. Recently developed protein language models have enabled a variety of applications with the protein contextual embeddings they produce. Per-protein representations (each protein is represented as a vector of fixed dimension) can be derived via averaging the embeddings of individual residues, or applying matrix transformation techniques such as the discrete cosine transformation (DCT) to matrices of residue embeddings. Such protein-level embeddings have been applied to enable fast searches of similar proteins; however, limitations have been found; for example, PROST is good at detecting global homologs but not local homologs, and knnProtT5 excels for proteins with single domains but not multidomain proteins. Here, we propose a novel approach that first segments proteins into domains (or subdomains) and then applies the DCT to the vectorized embeddings of residues in each domain to infer domain-level contextual vectors. Our approach, called DCTdomain, uses predicted contact maps from ESM-2 for domain segmentation, which is formulated as adomain segmentationproblem and can be solved using arecursive cutalgorithm (RecCut in short) in quadratic time to the protein length; for comparison, an existing approach for domain segmentation uses a cubic-time algorithm. We show such domain-level contextual vectors (termed asDCT fingerprints) enable fast and accurate detection of similarity between proteins that share global similarities but with undefined extended regions between shared domains, and those that only share local similarities. In addition, tests on a database search benchmark show that the DCTdomain is able to detect distant homologs by leveraging the structural information in the contextual embeddings. 
    more » « less
  5. Disordered linkers (DLs) are intrinsically disordered regions that facilitate movement between adjacent functional regions/domains, contributing to many key cellular functions. The recently completed second Critical Assessments of protein Intrinsic Disorder prediction (CAID2) experiment evaluated DL predictions by considering a rather narrow scenario when predicting 40 proteins that are already known to have DLs. We expand this evaluation by using a much larger set of nearly 350 test proteins from CAID2 and by investigating three distinct scenarios: (1) prediction residues in DLs vs. in non-DL regions (typical use of DL predictors); (2) prediction of residues in DLs vs. other disordered residues (to evaluate whether predictors can differentiate residues in DLs from other types of intrinsically disordered residues); and (3) prediction of proteins harboring DLs. We find that several methods provide relatively accurate predictions of DLs in the first scenario. However, only one method, APOD, accurately identifies DLs among other types of disordered residues (scenario 2) and predicts proteins harboring DLs (scenario 3). We also find that APOD’s predictive performance is modest, motivating further research into the development of new and more accurate DL predictors. We note that these efforts will benefit from a growing amount of training data and the availability of sophisticated deep network models and emphasize that future methods should provide accurate results across the three scenarios. 
    more » « less