skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: A survey of experimental and computational identification of small proteins
Abstract Small proteins (SPs) are typically characterized as eukaryotic proteins shorter than 100 amino acids and prokaryotic proteins shorter than 50 amino acids. Historically, they were disregarded because of the arbitrary size thresholds to define proteins. However, recent research has revealed the existence of many SPs and their crucial roles. Despite this, the identification of SPs and the elucidation of their functions are still in their infancy. To pave the way for future SP studies, we briefly introduce the limitations and advancements in experimental techniques for SP identification. We then provide an overview of available computational tools for SP identification, their constraints, and their evaluation. Additionally, we highlight existing resources for SP research. This survey aims to initiate further exploration into SPs and encourage the development of more sophisticated computational tools for SP identification in prokaryotes and microbiomes.  more » « less
Award ID(s):
2015838
PAR ID:
10523679
Author(s) / Creator(s):
; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Briefings in Bioinformatics
Volume:
25
Issue:
4
ISSN:
1467-5463
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Small Proteins (SPs) are pivotal in various cellular functions such as immunity, defense, and communication. Despite their significance, identifying them is still in its infancy. Existing computational tools are tailored to specific eukaryotic species, leaving only a few options for SP identification in prokaryotes. In addition, these existing tools still have suboptimal performance in SP identification. To fill this gap, we introduce PSPI, a deep learning-based approach designed specifically for predicting prokaryotic SPs. We showed that PSPI had a high accuracy in predicting generalized sets of prokaryotic SPs and sets specific to the human metagenome. Compared with three existing tools, PSPI was faster and showed greater precision, sensitivity, and specificity not only for prokaryotic SPs but also for eukaryotic ones. We also observed that the incorporation of (n,k)-mers greatly enhances the performance of PSPI, suggesting that many SPs may contain short linear motifs. The PSPI tool, which is freely available athttps://www.cs.ucf.edu/∼xiaoman/tools/PSPI/, will be useful for studying SPs as a tool for identifying prokaryotic SPs and it can be trained to identify other types of SPs as well. 
    more » « less
  2. Short (15–30 residue) chains of amino acids at the amino termini of expressed proteins known as signal peptides (SPs) specify secretion in living cells. We trained an attention-based neural network, the Transformer model, on data from all available organisms in Swiss-Prot to generate SP sequences. Experimental testing demonstrates that the model-generated SPs are functional: when appended to enzymes expressed in an industrial Bacillus subtilis strain, the SPs lead to secreted activity that is competitive with industrially used SPs. Additionally, the model-generated SPs are diverse in sequence, sharing as little as 58% sequence identity to the closest known native signal peptide and 73% ± 9% on average. 
    more » « less
  3. Abstract There are continuous efforts to elucidate the structure and biological functions of short hydrogen bonds (SHBs), whose donor and acceptor heteroatoms reside more than 0.3 Å closer than the sum of their van der Waals radii. In this work, we evaluate 1070 atomic-resolution protein structures and characterize the common chemical features of SHBs formed between the side chains of amino acids and small molecule ligands. We then develop a machine learning assisted prediction of protein-ligand SHBs (MAPSHB-Ligand) model and reveal that the types of amino acids and ligand functional groups as well as the sequence of neighboring residues are essential factors that determine the class of protein-ligand hydrogen bonds. The MAPSHB-Ligand model and its implementation on our web server enable the effective identification of protein-ligand SHBs in proteins, which will facilitate the design of biomolecules and ligands that exploit these close contacts for enhanced functions. 
    more » « less
  4. Abstract We study and characterize the topology of connectivity circuits observed in natively folded protein structures whose coordinates are deposited in the Protein Data Bank (PDB). Polypeptide chains of some proteins naturally fold into unique knotted configurations. Another kind of nontrivial topology of polypeptide chains is observed when, in addition to covalent bonds connecting consecutive amino acids in polypeptide chains, one also considers disulfide and ionic bonds between non‐consecutive amino acids. Bonds between non‐consecutive amino acids introduce bifurcation points into connectivity circuits defined by bonds between consecutive and nonconsecutive amino acids in analyzed proteins. Circuits with bifurcation points can form θ‐curves with various topologies. We catalog here the observed topologies of θ‐curves passing through bridges between consecutive and non‐consecutive amino acids in studied proteins. 
    more » « less
  5. Proteins are an abundant biopolymer in organic waste feedstocks for biorefining. When degraded, amino acids are released, but their fate in non-methanogenic microbiomes is not well understood. The ability of a microbiome obtained from an anaerobic digester to produce volatile fatty acids from the twenty proteinogenic amino acids was tested using batch experiments. Batch tests were conducted using an initial concentration of each amino acid of 9000 mg COD L−1 along with 9000 mg COD L−1 acetate. Butyrate production was observed from lysine, glutamate, and serine fermentation. Lesser amounts of propionate, iso-butyrate, and iso-valerate were also observed from individual amino acids. Based on 16S rRNA gene amplicon sequencing, Anaerostignum, Intestimonas, Aminipila, and Oscillibacter all likely play a role in the conversion of amino acids to butyrate. The specific roles of other abundant taxa, including Coprothermobacter, Fervidobacterium, Desulfovibrio, and Wolinella, remain unknown, but these genera should be studied for their role in fermentation of amino acids and proteins to VFAs. 
    more » « less