skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 11:00 PM ET on Friday, May 16 until 2:00 AM ET on Saturday, May 17 due to maintenance. We apologize for the inconvenience.


This content will become publicly available on September 5, 2025

Title: Protein domain embeddings for fast and accurate similarity search
Recently developed protein language models have enabled a variety of applications with the protein contextual embeddings they produce. Per-protein representations (each protein is represented as a vector of fixed dimension) can be derived via averaging the embeddings of individual residues, or applying matrix transformation techniques such as the discrete cosine transformation (DCT) to matrices of residue embeddings. Such protein-level embeddings have been applied to enable fast searches of similar proteins; however, limitations have been found; for example, PROST is good at detecting global homologs but not local homologs, and knnProtT5 excels for proteins with single domains but not multidomain proteins. Here, we propose a novel approach that first segments proteins into domains (or subdomains) and then applies the DCT to the vectorized embeddings of residues in each domain to infer domain-level contextual vectors. Our approach, called DCTdomain, uses predicted contact maps from ESM-2 for domain segmentation, which is formulated as adomain segmentationproblem and can be solved using arecursive cutalgorithm (RecCut in short) in quadratic time to the protein length; for comparison, an existing approach for domain segmentation uses a cubic-time algorithm. We show such domain-level contextual vectors (termed asDCT fingerprints) enable fast and accurate detection of similarity between proteins that share global similarities but with undefined extended regions between shared domains, and those that only share local similarities. In addition, tests on a database search benchmark show that the DCTdomain is able to detect distant homologs by leveraging the structural information in the contextual embeddings.  more » « less
Award ID(s):
2025451
PAR ID:
10546265
Author(s) / Creator(s):
; ;
Publisher / Repository:
Genome Research
Date Published:
Journal Name:
Genome Research
ISSN:
1088-9051
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. King, Nicole (Ed.)
    The discovery that sponges (Porifera) can fully regenerate from aggregates of dissociated cells launched them as one of the earliest experimental models to study the evolution of cell adhesion and allorecognition in animals. This process depends on an extracellular glycoprotein complex called the Aggregation Factor (AF), which is composed of proteins thought to be unique to sponges. We used quantitative proteomics to identify additional AF components and interacting proteins in the classical model,Clathria prolifera, and compared them to proteins involved in cell interactions in Bilateria. Our results confirm MAFp3/p4 proteins as the primary components of the AF but implicate related proteins with calx-beta and wreath domains as additional components. Using AlphaFold, we unveiled close structural similarities of AF components to protein domains in other animals, previously masked by the mutational decay of sequence similarity. The wreath domain, believed to be unique to the AF, was predicted to contain a central beta-sandwich of the same organization as the vWFD domain (also found in extracellular, gel-forming glycoproteins in other animals). Additionally, many copurified proteins share a conserved C-terminus, containing divergent immunoglobulin (Ig) and Fn3 domains predicted to serve as an AF–interaction interface. One of these proteins, MAF-associated protein 1, resembles Ig superfamily cell adhesion molecules and we hypothesize that it may function to link the AF to the surface of cells. Our results highlight the existence of an ancient toolkit of conserved protein domains regulating cell–cell and cell–extracellular matrix protein interactions in all animals, and likely reflect a common origin of cell adhesion and allorecognition. 
    more » « less
  2. O'Toole, George (Ed.)
    ABSTRACT Members of the widely conserved progestin and adipoQ receptor (PAQR) family function to maintain membrane homeostasis: membrane fluidity and fatty acid composition in eukaryotes and membrane energetics and fatty acid composition in bacteria. All PAQRs consist of a core seven transmembrane domain structure and five conserved amino acids (three histidines, one serine, and one aspartic acid) predicted to form a hydrolase-like catalytic site. PAQR homologs in Bacteria (called TrhA, for transmembrane homeostasis protein A) maintain homeostasis of membrane charge gradients, like the membrane potential and proton gradient that comprise the proton motive force, but their molecular mechanisms are not yet understood. Here, we show that TrhA inEscherichia colihas a periplasmic C-terminus, which places the five conserved residues shared by all PAQRs at the cytoplasmic interface of the membrane. Here, we characterize several conserved residues predicted to form an active site by site-directed mutagenesis. We also identify a specific role for TrhA in modulating unsaturated fatty acid biosynthesis with conserved residues required to either promote or reduce the abundance of unsaturated fatty acids. We also identify distinct roles for the conserved residues in supporting TrhA’s role in maintaining membrane energetics homeostasis that suggest that both functions are intertwined and probably partly dependent on one another. An analysis of domain architecture of TrhA-like domains in Bacteria further supports a function of TrhA linking membrane energetics homeostasis with biosynthesis of unsaturated fatty acid in the membrane. IMPORTANCEProgestin and adipoQ receptor (PAQR) family proteins are evolutionary conserved regulators of membrane homeostasis and have been best characterized in eukaryotes. Bacterial PAQR homologs, named TrhA (transmembrane homeostasis protein A), regulate membrane energetics homeostasis through an unknown mechanism. Here, we present evidence linking TrhA to both membrane energetics homeostasis and unsaturated fatty acid biosynthesis. Analysis of domain architecture together with experimental evidence suggests a model where TrhA activity on unsaturated fatty acid biosynthesis is regulated by changes in membrane energetics to dynamically adjust membrane homeostasis. 
    more » « less
  3. Champion, Patricia A (Ed.)
    ABSTRACT Cellular life relies on enzymes that require metals, which must be acquired from extracellular sources. Bacteria utilize surface and secreted proteins to acquire such valuable nutrients from their environment. These include the cargo proteins of the type eleven secretion system (T11SS), which have been connected to host specificity, metal homeostasis, and nutritional immunity evasion. This Sec-dependent, Gram-negative secretion system is encoded by organisms throughout the phylum Proteobacteria, including human pathogensNeisseria meningitidis, Proteus mirabilis, Acinetobacter baumannii,andHaemophilus influenzae. Experimentally verified T11SS-dependent cargo includetransferrin-bindingprotein B (TbpB), the hemophilin homologshemereceptorprotein C (HrpC),hemophilinA(HphA), the immune evasion proteinfactor-H bindingprotein (fHbp), and the host symbiosis factornematodeintestinallocalization protein C (NilC). Here, we examined the specificity of T11SS systems for their cognate cargo proteins using taxonomically distributed homolog pairs of T11SS and hemophilin cargo and explored the ligand binding ability of those hemophilin cargo homologs.In vivoexpression inEscherichia coliof hemophilin homologs revealed that each is secreted in a specific manner by its cognate T11SS protein. Sequence analysis and structural modeling suggest that all hemophilin homologs share an N-terminal ligand-binding domain with the same topology as the ligand-binding domains of theHaemophilus haemolyticusheme binding protein (Hpl) and HphA. We term this signature feature of this group of proteins the hemophilin ligand-binding domain. Network analysis of hemophilin homologs revealed five subclusters and representatives from four of these showed variable heme-binding activities, which, combined with sequence-structure variation, suggests that hemophilins are diversifying in function.IMPORTANCEThe secreted protein hemophilin and its homologs contribute to the survival of several bacterial symbionts within their respective host environments. Here, we compared taxonomically diverse hemophilin homologs and their paired Type 11 secretion systems (T11SS) to determine if heme binding and T11SS secretion are conserved characteristics of this family. We establish the existence of divergent hemophilin sub-families and describe structural features that contribute to distinct ligand-binding behaviors. Furthermore, we demonstrate that T11SS are specific for their cognate hemophilin family cargo proteins. Our work establishes that hemophilin homolog-T11SS pairs are diverging from each other, potentially evolving into novel ligand acquisition systems that provide competitive benefits in host niches. 
    more » « less
  4. Abstract Domains are functional and structural units of proteins that govern various biological functions performed by the proteins. Therefore, the characterization of domains in a protein can serve as a proper functional representation of proteins. Here, we employ a self-supervised protocol to derive functionally consistent representations for domains by learning domain-Gene Ontology (GO) co-occurrences and associations. The domain embeddings we constructed turned out to be effective in performing actual function prediction tasks. Extensive evaluations showed that protein representations using the domain embeddings are superior to those of large-scale protein language models in GO prediction tasks. Moreover, the new function prediction method built on the domain embeddings, named Domain-PFP, substantially outperformed the state-of-the-art function predictors. Additionally, Domain-PFP demonstrated competitive performance in the CAFA3 evaluation, achieving overall the best performance among the top teams that participated in the assessment. 
    more » « less
  5. Dunbrack, Roland L (Ed.)
    Protein structure prediction has now been deployed widely across several different large protein sets. Large-scale domain annotation of these predictions can aid in the development of biological insights. Using our Evolutionary Classification of Protein Domains (ECOD) from experimental structures as a basis for classification, we describe the detection and cataloging of domains from 48 whole proteomes deposited in the AlphaFold Database. On average, we can provide positive classification (either of domains or other identifiable non-domain regions) for 90% of residues in all proteomes. We classified 746,349 domains from 536,808 proteins comprised of over 226,424,000 amino acid residues. We examine the varying populations of homologous groups in both eukaryotes and bacteria. In addition to containing a higher fraction of disordered regions and unassigned domains, eukaryotes show a higher proportion of repeated proteins, both globular and small repeats. We enumerate those highly populated domains that are shared in both eukaryotes and bacteria, such as the Rossmann domains, TIM barrels, and P-loop domains. Additionally, we compare the sampling of homologous groups from this whole proteome set against our stable ECOD reference and discuss groups that have been enriched by structure predictions. Finally, we discuss the implication of these results for protein target selection for future classification strategies for very large protein sets. 
    more » « less