O-linked β-N-acetylglucosamine (O-GlcNAc) is a distinct monosaccharide modification of serine (S) or threonine (T) residues of nucleocytoplasmic and mitochondrial proteins. O-GlcNAc modification (i.e., O-GlcNAcylation) is involved in the regulation of diverse cellular processes, including transcription, epigenetic modifications, and cell signaling. Despite the great progress in experimentally mapping O-GlcNAc sites, there is an unmet need to develop robust prediction tools that can effectively locate the presence of O-GlcNAc sites in protein sequences of interest. In this work, we performed a comprehensive evaluation of a framework for prediction of protein O-GlcNAc sites using embeddings from pre-trained protein language models. In particular, we compared the performance of three protein sequence-based large protein language models (pLMs), Ankh, ESM-2, and ProtT5, for prediction of O-GlcNAc sites and also evaluated various ensemble strategies to integrate embeddings from these protein language models. Upon investigation, the decision-level fusion approach that integrates the decisions of the three embedding models, which we call LM-OGlcNAc-Site, outperformed the models trained on these individual language models as well as other fusion approaches and other existing predictors in almost all of the parameters evaluated. The precise prediction of O-GlcNAc sites will facilitate the probing of O-GlcNAc site-specific functions of proteins in physiology and diseases. Moreover, these findings also indicate the effectiveness of combined uses of multiple protein language models in post-translational modification prediction and open exciting avenues for further research and exploration in other protein downstream tasks. LM-OGlcNAc-Site’s web server and source code are publicly available to the community.
more »
« less
MS-based proteomics for comprehensive investigation of protein O -GlcNAcylation
Protein O -GlcNAcylation refers to the covalent binding of a single N -acetylglucosamine (GlcNAc) to the serine or threonine residue. This modification primarily occurs on proteins in the nucleus and the cytosol, and plays critical roles in many cellular events, including regulation of gene expression and signal transduction. Aberrant protein O -GlcNAcylation is directly related to human diseases such as cancers, diabetes and neurodegenerative diseases. In the past decades, considerable progress has been made for global and site-specific analysis of O -GlcNAcylation in complex biological samples using mass spectrometry (MS)-based proteomics. In this review, we summarized previous efforts on comprehensive investigation of protein O -GlcNAcylation by MS. Specifically, the review is focused on methods for enriching and site-specifically mapping O -GlcNAcylated peptides, and applications for quantifying protein O -GlcNAcylation in different biological systems. As O -GlcNAcylation is an important protein modification for cell survival, effective methods are essential for advancing our understanding of glycoprotein functions and cellular events.
more »
« less
- Award ID(s):
- 2003597
- PAR ID:
- 10222517
- Date Published:
- Journal Name:
- Molecular Omics
- Volume:
- 17
- Issue:
- 2
- ISSN:
- 2515-4184
- Page Range / eLocation ID:
- 186 to 196
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Among RNAs, transfer RNAs (tRNAs) contain the widest variety of abundant post-transcriptional chemical modifications. These modifications are crucial for tRNAs to participate in protein synthesis, promoting proper tRNA structure and aminoacylation, facilitating anticodon:codon recognition, and ensuring the reading frame maintenance of the ribosome. While tRNA modifications were long thought to be stoichiometric, it is becoming increasingly apparent that these modifications can change dynamically in response to the cellular environment. The ability to broadly characterize the fluctuating tRNA modification landscape will be essential for establishing the molecular level contributions of individual sites of tRNA modification. The locations of modifications within individual tRNA sequences can be mapped using liquid chromatography coupled to tandem mass spectrometry (LC-MS/MS). In this approach, a single tRNA species is purified, treated with ribonucleases and the resulting single-stranded RNA products are subject to LC-MS/MS analysis. The application of LC-MS/MS to study tRNAs is limited by the necessity of analyzing one tRNA at a time because the digestion of total tRNA mixtures by commercially available ribonucleases produces many short digestion products unable to be uniquely mapped back to a single site within a tRNA. We overcame these limitations by taking advantage of the highly structured nature of tRNAs to prevent the full digestion by single-stranded RNA specific ribonucleases. Folding total tRNA prior to digestion allowed us to sequence S. cerevisiae tRNAs with up to 97% sequence coverage for individual tRNA species by LC-MS/MS. This method presents a robust avenue for directly detecting the distribution of modifications in total tRNAs.more » « less
-
Proteoforms, the different forms of a protein with sequence variations including post-translational modifications (PTMs), execute vital functions in biological systems, such as cell signaling and epigenetic regulation. Advances in top-down mass spectrometry (MS) technology have permitted the direct characterization of intact proteoforms and their exact number of modification sites, allowing for the relative quantification of positional isomers (PI). Protein positional isomers refer to a set of proteoforms with identical total mass and set of modifications, but varying PTM site combinations. The relative abundance of PI can be estimated by matching proteoform-specific fragment ions to top-down tandem MS (MS2) data to localize and quantify modifications. However, the current approaches heavily rely on manual annotation. Here, we present IsoForma, an open-source R package for the relative quantification of PI within a single tool. Benchmarking IsoForma's performance against two existing workflows produced comparable results and improvements in speed. Overall, IsoForma provides a streamlined process for quantifying PI, reduces the analysis time, and offers an essential framework for developing customized proteoform analysis workflows. The software is open source and available at https://github.com/EMSL-Computing/isoforma-lib.more » « less
-
Recent advances in mass spectrometry (MS)-based proteomics have enabled tremendous progress in the understanding of cellular mechanisms, disease progression, and the relationship between genotype and phenotype. Though many popular bioinformatics methods in proteomics are derived from other omics studies, novel analysis strategies are required to deal with the unique characteristics of proteomics data. In this review, we discuss the current developments in the bioinformatics methods used in proteomics and how they facilitate the mechanistic understanding of biological processes. We first introduce bioinformatics software and tools designed for mass spectrometry-based protein identification and quantification, and then we review the different statistical and machine learning methods that have been developed to perform comprehensive analysis in proteomics studies. We conclude with a discussion of how quantitative protein data can be used to reconstruct protein interactions and signaling networks.more » « less
-
Knowledge of protein structure is crucial to our understanding of biological function and is routinely used in drug discovery. High-resolution techniques to determine the three-dimensional atomic coordinates of proteins are available. However, such methods are frequently limited by experimental challenges such as sample quantity, target size, and efficiency. Structural mass spectrometry (MS) is a technique in which structural features of proteins are elucidated quickly and relatively easily. Computational techniques that convert sparse MS data into protein models that demonstrate agreement with the data are needed. This review features cutting-edge computational methods that predict protein structure from MS data such as chemical cross-linking, hydrogen–deuterium exchange, hydroxyl radical protein footprinting, limited proteolysis, ion mobility, and surface-induced dissociation. Additionally, we address future directions for protein structure prediction with sparse MS data.more » « less