Recent advances in protein structure prediction have generated accurate structures of previously uncharacterized human proteins. Identifying domains in these predicted structures and classifying them into an evolutionary hierarchy can reveal biological insights. Here, we describe the detection and classification of domains from the human proteome. Our classification indicates that only 62% of residues are located in globular domains. We further classify these globular domains and observe that the majority (65%) can be classified among known folds by sequence, with a smaller fraction (33%) requiring structural data to refine the domain boundaries and/or to support their homology. A relatively small number (966 domains) cannot be confidently assigned using our automatic pipelines, thus demanding manual inspection. We classify 47,576 domains, of which only 23% have been included in experimental structures. A portion (6.3%) of these classified globular domains lack sequence-based annotation in InterPro. A quarter (23%) have not been structurally modeled by homology, and they contain 2,540 known disease-causing single amino acid variations whose pathogenesis can now be inferred using AF models. A comparison of classified domains from a series of model organisms revealed expansions of several immune response-related domains in humans and a depletion of olfactory receptors. Finally, we use this classification to expand well-known protein families of biological significance. These classifications are presented on the ECOD website ( http://prodata.swmed.edu/ecod/index_human.php ).
more »
« less
This content will become publicly available on December 1, 2025
Structure classification of the proteins from Salmonella enterica pangenome revealed novel potential pathogenicity islands
Abstract Salmonella entericais a pathogenic bacterium known for causing severe typhoid fever in humans, making it important to study due to its potential health risks and significant impact on public health. This study provides evolutionary classification of proteins fromSalmonella entericapangenome. We classified 17,238 domains from 13,147 proteins from 79,758Salmonella entericastrains and studied in detail domains of 272 proteins from 14 characterizedSalmonellapathogenicity islands (SPIs). Among SPIs-related proteins, 90 proteins function in the secretion machinery. 41% domains of SPI proteins have no previous sequence annotation. By comparing clinical and environmental isolates, we identified 3682 proteins that are overrepresented in clinical group that we consider as potentially pathogenic. Among domains of potentially pathogenic proteins only 50% domains were annotated by sequence methods previously. Moreover, 36% (1330 out of 3682) of potentially pathogenic proteins cannot be classified into Evolutionary Classification of Protein Domains database (ECOD). Among classified domains of potentially pathogenic proteins the most populated homology groups include helix-turn-helix (HTH), Immunoglobulin-related, and P-loop domains-related. Functional analysis revealed overrepresentation of these protein in biological processes related to viral entry into host cell, antibiotic biosynthesis, DNA metabolism and conformation change, and underrepresentation in translational processes. Analysis of the potentially pathogenic proteins indicates that they form 119 clusters or novel potential pathogenicity islands (NPPIs) within theSalmonellagenome, suggesting their potential contribution to the bacterium’s virulence. One of the NPPIs revealed significant overrepresentation of potentially pathogenic proteins. Overall, our analysis revealed that identified potentially pathogenic proteins are poorly studied.
more »
« less
- Award ID(s):
- 2224128
- PAR ID:
- 10539189
- Publisher / Repository:
- Nature
- Date Published:
- Journal Name:
- Scientific Reports
- Volume:
- 14
- Issue:
- 1
- ISSN:
- 2045-2322
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Navarre, William (Ed.)The bacterial flagellum is a rotary motor organelle and important virulence factor that propels motile pathogenic bacteria, such asSalmonella enterica, through their surroundings. Bacteriophages, or phages, are viruses that solely infect bacteria. As such, phages have myriad applications in the healthcare field, including phage therapy against antibiotic-resistant bacterial pathogens. Bacteriophage χ (Chi) is a flagellum-dependent (flagellotropic) bacteriophage, which begins its infection cycle by attaching its long tail fiber to theS.entericaflagellar filament as its primary receptor. The interactions between phage and flagellum are poorly understood, as are the reasons that χ only kills certainSalmonellaserotypes while others entirely evade phage infection. In this study, we used molecular cloning, targeted mutagenesis, heterologous flagellin expression, and phage-host interaction assays to determine which domains within the flagellar filament protein flagellin mediate this complex interaction. We identified the antigenic N- and C-terminal D2 domains as essential for phage χ binding, with the hypervariable central D3 domain playing a less crucial role. Here, we report that the primary structure of theSalmonellaflagellin D2 domains is the major determinant of χ adhesion. The phage susceptibility of a strain is directly tied to these domains. We additionally uncovered important information about flagellar function. The central and most variable domain, D3, is not required for motility inS. Typhimurium 14028s, as it can be deleted or its sequence composition can be significantly altered with minimal impacts on motility. Further knowledge about the complex interactions between flagellotropic phage χ and its primary bacterial receptor may allow genetic engineering of its host range for use as targeted antimicrobial therapy against motile pathogens of the χ-host generaSalmonella,Escherichia, orSerratia.more » « less
-
null (Ed.)Abstract H-NS is a nucleoid structuring protein and global repressor of virulence and horizontally-acquired genes in bacteria. H-NS can interact with itself or with homologous proteins, but protein family diversity and regulatory network overlap remain poorly defined. Here, we present a comprehensive phylogenetic analysis that revealed deep-branching clades, dispelling the presumption that H-NS is the progenitor of varied molecular backups. Each clade is composed exclusively of either chromosome-encoded or plasmid-encoded proteins. On chromosomes, stpA and newly discovered hlpP are core genes in specific genera, whereas hfp and newly discovered hlpC are sporadically distributed. Six clades of H-NS plasmid proteins (Hpp) exhibit ancient and dedicated associations with plasmids, including three clades with fidelity for plasmid incompatibility groups H, F or X. A proliferation of H-NS homologs in Erwiniaceae includes the first observation of potentially co-dependent H-NS forms. Conversely, the observed diversification of oligomerization domains may facilitate stable co-existence of divergent homologs in a genome. Transcriptomic and proteomic analysis in Salmonella revealed regulatory crosstalk and hierarchical control of H-NS homologs. We also discovered that H-NS is both a repressor and activator of Salmonella Pathogenicity Island 1 gene expression, and both regulatory modes are restored by Sfh (HppH) in the absence of H-NS.more » « less
-
Burbank, Lindsey Price (Ed.)ABSTRACT Liberibacter pathogens are the causative agents of several severe crop diseases worldwide, including citrus Huanglongbing and potato zebra chip. These bacteria are endophytic and nonculturable, which makes experimental approaches challenging and highlights the need for bioinformatic analysis in advancing our understanding about Liberibacter pathogenesis. Here, we performed an in-depth comparative phylogenomic analysis of the Liberibacter pathogens and their free-living, nonpathogenic, ancestral species, aiming to identify major genomic changes and determinants associated with their evolutionary transitions in living habitats and pathogenicity. Using gene neighborhood analysis and phylogenetic classification, we systematically uncovered, annotated, and classified all prophage loci into four types, including one previously unrecognized group. We showed that these prophages originated through independent gene transfers at different evolutionary stages of Liberibacter and only the SC-type prophage was associated with the emergence of the pathogens. Using ortholog clustering, we vigorously identified two additional sets of genomic genes, which were either lost or gained in the ancestor of the pathogens. Consistent with the habitat change, the lost genes were enriched for biosynthesis of cellular building blocks. Importantly, among the gained genes, we uncovered several previously unrecognized toxins, including new toxins homologous to the EspG/VirA effectors, a YdjM phospholipase toxin, and a secreted endonuclease/exonuclease/phosphatase (EEP) protein. Our results substantially extend the knowledge of the evolutionary events and potential determinants leading to the emergence of endophytic, pathogenic Liberibacter species, which will facilitate the design of functional experiments and the development of new methods for detection and blockage of these pathogens. IMPORTANCE Liberibacter pathogens are associated with several severe crop diseases, including citrus Huanglongbing, the most destructive disease to the citrus industry. Currently, no effective cure or treatments are available, and no resistant citrus variety has been found. The fact that these obligate endophytic pathogens are not culturable has made it extremely challenging to experimentally uncover the genes/proteins important to Liberibacter pathogenesis. Further, earlier bioinformatics studies failed to identify key genomic determinants, such as toxins and effector proteins, that underlie the pathogenicity of the bacteria. In this study, an in-depth comparative genomic analysis of Liberibacter pathogens along with their ancestral nonpathogenic species identified the prophage loci and several novel toxins that are evolutionarily associated with the emergence of the pathogens. These results shed new light on the disease mechanism of Liberibacter pathogens and will facilitate the development of new detection and blockage methods targeting the toxins.more » « less
-
Friedberg, Iddo (Ed.)The Immunoglobulin fold (Ig-fold) is found in proteins from all domains of life and represents the most populous fold in the human genome, with current estimates ranging from 2 to 3% of protein coding regions. That proportion is much higher in the surfaceome where Ig and Ig-like domains orchestrate cell-cell recognition, adhesion and signaling. The ability of Ig-domains to reliably fold and self-assemble through highly specific interfaces represents a remarkable property of these domains, making them key elements of molecular interaction systems: the immune system, the nervous system, the vascular system and the muscular system. We define a universal residue numbering scheme, common to all domains sharing the Ig-fold in order to study the wide spectrum of Ig-domain variants constituting the Ig-proteome and Ig-Ig interactomes at the heart of thesesystems. The “IgStrand numbering scheme” enables the identification of Ig structural proteomes and interactomes in and between any species, and comparative structural, functional, and evolutionary analyses. We review how Ig-domains are classified today as topological and structural variants and highlight the“Ig-fold irreducible structural signature”shared by all of them. The IgStrand numbering scheme lays the foundation for the systematic annotation of structural proteomes by detecting and accurately labeling Ig-, Ig-like and Ig-extended domains in proteins, which are poorly annotated in current databases and opens the door to accurate machine learning. Importantly, it sheds light on the robustIg protein folding algorithmused by nature to form beta sandwich supersecondary structures. The numbering scheme powers an algorithm implemented in the interactive structural analysis software iCn3D to systematically recognize Ig-domains, annotate them and perform detailed analyses comparing any domain sharing the Ig-fold in sequence, topology and structure, regardless of their diverse topologies or origin. The scheme provides a robust fold detection and labeling mechanism that reveals unsuspected structural homologies among protein structures beyond currently identified Ig- and Ig-like domain variants. Indeed, multiple folds classified independently contain a common structural signature, in particular jelly-rolls. Examples of folds that harbor an “Ig-extended” architecture are given. Applications in protein engineering around the Ig-architecture are straightforward based on the universal numbering.more » « less
An official website of the United States government
