Abstract De novo peptide sequencing, which does not rely on a comprehensive target sequence database, provides us with a way to identify novel peptides from tandem mass spectra. However, current de novo sequencing algorithms suffer from low accuracy and coverage, which hinders their application in proteomics. In this paper, we presentPepNet, a fully convolutional neural network for high accuracy de novo peptide sequencing. PepNet takes an MS/MS spectrum (represented as a high-dimensional vector) as input, and outputs the optimal peptide sequence along with its confidence score. The PepNet model is trained using a total of 3 million high-energy collisional dissociation MS/MS spectra from multiple human peptide spectral libraries. Evaluation results show that PepNet significantly outperforms current best-performing de novo sequencing algorithms (e.g. PointNovo and DeepNovo) in both peptide-level accuracy and positional-level accuracy. PepNet can sequence a large fraction of spectra that were not identified by database search engines, and thus could be used as a complementary tool to database search engines for peptide identification in proteomics. In addition, PepNet runs around 3x and 7x faster than PointNovo and DeepNovo on GPUs, respectively, thus being more suitable for the analysis of large-scale proteomics data.
more »
« less
Sequence-to-sequence translation from mass spectra to peptides with a transformer model
Abstract A fundamental challenge in mass spectrometry-based proteomics is the identification of the peptide that generated each acquired tandem mass spectrum. Approaches that leverage known peptide sequence databases cannot detect unexpected peptides and can be impractical or impossible to apply in some settings. Thus, the ability to assign peptide sequences to tandem mass spectra without prior information—de novo peptide sequencing—is valuable for tasks including antibody sequencing, immunopeptidomics, and metaproteomics. Although many methods have been developed to address this problem, it remains an outstanding challenge in part due to the difficulty of modeling the irregular data structure of tandem mass spectra. Here, we describe Casanovo, a machine learning model that uses a transformer neural network architecture to translate the sequence of peaks in a tandem mass spectrum into the sequence of amino acids that comprise the generating peptide. We train a Casanovo model from 30 million labeled spectra and demonstrate that the model outperforms several state-of-the-art methods on a cross-species benchmark dataset. We also develop a version of Casanovo that is fine-tuned for non-enzymatic peptides. Finally, we demonstrate that Casanovo’s superior performance improves the analysis of immunopeptidomics and metaproteomics experiments and allows us to delve deeper into the dark proteome.
more »
« less
- Award ID(s):
- 2245300
- PAR ID:
- 10528365
- Publisher / Repository:
- Nature Publishing Group
- Date Published:
- Journal Name:
- Nature Communications
- Volume:
- 15
- Issue:
- 1
- ISSN:
- 2041-1723
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract MotivationTandem mass spectrometry (MS/MS) is a crucial technology for large-scale proteomic analysis. The protein database search or the spectral library search are commonly used for peptide identification from MS/MS spectra, which, however, may face challenges due to experimental variations between replicated spectra and similar fragmentation patterns among distinct peptides. To address this challenge, we present SpecEncoder, a deep metric learning approach to address these challenges by transforming MS/MS spectra into robust and sensitive embedding vectors in a latent space. The SpecEncoder model can also embed predicted MS/MS spectra of peptides, enabling a hybrid search approach that combines spectral library and protein database searches for peptide identification. ResultsWe evaluated SpecEncoder on three large human proteomics datasets, and the results showed a consistent improvement in peptide identification. For spectral library search, SpecEncoder identifies 1%–2% more unique peptides (and PSMs) than SpectraST. For protein database search, it identifies 6%–15% more unique peptides than MSGF+ enhanced by Percolator, Furthermore, SpecEncoder identified 6%–12% additional unique peptides when utilizing a combined library of experimental and predicted spectra. SpecEncoder can also identify more peptides when compared to deep-learning enhanced methods (MSFragger boosted by MSBooster). These results demonstrate SpecEncoder’s potential to enhance peptide identification for proteomic data analyses. Availability and ImplementationThe source code and scripts for SpecEncoder and peptide identification are available on GitHub at https://github.com/lkytal/SpecEncoder. Contact: hatang@iu.edu.more » « less
-
Lisacek, Frederique (Ed.)Historically, the database search algorithms have been the de facto standard for inferring peptides from mass spectrometry (MS) data. Database search algorithms deduce peptides by transforming theoretical peptides into theoretical spectra and matching them to the experimental spectra. Heuristic similarity-scoring functions are used to match an experimental spectrum to a theoretical spectrum. However, the heuristic nature of the scoring functions and the simple transformation of the peptides into theoretical spectra, along with noisy mass spectra for the less abundant peptides, can introduce a cascade of inaccuracies. In this paper, we design and implement a Deep Cross-Modal Similarity Network called SpeCollate , which overcomes these inaccuracies by learning the similarity function between experimental spectra and peptides directly from the labeled MS data. SpeCollate transforms spectra and peptides into a shared Euclidean subspace by learning fixed size embeddings for both. Our proposed deep-learning network trains on sextuplets of positive and negative examples coupled with our custom-designed SNAP-loss function. Online hardest negative mining is used to select the appropriate negative examples for optimal training performance. We use 4.8 million sextuplets obtained from the NIST and MassIVE peptide libraries to train the network and demonstrate that for closed search, SpeCollate is able to perform better than Crux and MSFragger in terms of the number of peptide-spectrum matches (PSMs) and unique peptides identified under 1% FDR for real-world data. SpeCollate also identifies a large number of peptides not reported by either Crux or MSFragger. To the best of our knowledge, our proposed SpeCollate is the first deep-learning network that can determine the cross-modal similarity between peptides and mass-spectra for MS-based proteomics. We believe SpeCollate is significant progress towards developing machine-learning solutions for MS-based omics data analysis. SpeCollate is available at https://deepspecs.github.io/ .more » « less
-
ABSTRACT Protein tandem mass spectrometry data are most often interpreted by matching observed mass spectra to a protein database derived from the reference genome of the sample being analyzed. In many application domains, however, a relevant protein database is unavailable or incomplete, and in such settings de novo sequencing is required. Since the introduction of the DeepNovo algorithm in 2017, the field of de novo sequencing has been dominated by deep learning methods, which use large amounts of labeled mass spectrometry data to train multi‐layer neural networks to translate from observed mass spectra to corresponding peptide sequences. Here, we describe these deep learning methods, outline procedures for evaluating their performance, and discuss the challenges in the field, both in terms of methods development and evaluation protocols.more » « less
-
This research employs pepsin-containing membranes to digest proteins online after a capillary electrophoresis (CE) separation and prior to tandem mass spectrometry. Proteolysis after the separation allows the peptides from a given protein to enter the mass spectrometer in a single plug. Thus, migration time can serve as an additional criterion for confirming the identification of a peptide. The membrane resides in a sheath-flow electrospray ionization (ESI) source to enable digestion immediately before spray into the mass spectrometer, thus limiting separation of the digested peptides. Using the same membrane, digestion occurred reproducibly during 20 consecutive CE analyses performed over a 10 h period. Additionally, after separating a mixture of six unreduced proteins with CE, online digestion facilitated protein identification with at least 2 identifiable peptides for all the proteins. Sequence coverages were >75% for myoglobin and carbonic anhydrase II but much lower for proteins containing disulfide bonds. Development of methods for efficient separation of reduced proteins or identification of cross-linked peptides should enhance sequence coverages for proteins with disulfide bonds. Migration times for the peptides identified from a specific protein differed by <∼30 s, which allows for rejection of some spurious peptide identifications.more » « less
An official website of the United States government
