Journal ArticleSpecEncoder: deep metric learning for accurate peptide identification in proteomicsLiu, Kaiyuan; Tao, Chenghua; Ye, Yuzhen; Tang, Haixu<title>Abstract</title> <sec><title>Motivation</title>Tandem mass spectrometry (MS/MS) is a crucial technology for large-scale proteomic analysis. The protein database search or the spectral library search are commonly used for peptide identification from MS/MS spectra, which, however, may face challenges due to experimental variations between replicated spectra and similar fragmentation patterns among distinct peptides. To address this challenge, we present SpecEncoder, a deep metric learning approach to address these challenges by transforming MS/MS spectra into robust and sensitive embedding vectors in a latent space. The SpecEncoder model can also embed predicted MS/MS spectra of peptides, enabling a hybrid search approach that combines spectral library and protein database searches for peptide identification.</sec> <sec><title>Results</title>We evaluated SpecEncoder on three large human proteomics datasets, and the results showed a consistent improvement in peptide identification. For spectral library search, SpecEncoder identifies 1%–2% more unique peptides (and PSMs) than SpectraST. For protein database search, it identifies 6%–15% more unique peptides than MSGF+ enhanced by Percolator, Furthermore, SpecEncoder identified 6%–12% additional unique peptides when utilizing a combined library of experimental and predicted spectra. SpecEncoder can also identify more peptides when compared to deep-learning enhanced methods (MSFragger boosted by MSBooster). These results demonstrate SpecEncoder’s potential to enhance peptide identification for proteomic data analyses.</sec> <sec><title>Availability and Implementation</title>The source code and scripts for SpecEncoder and peptide identification are available on GitHub at https://github.com/lkytal/SpecEncoder. Contact: hatang@iu.edu.</sec>Oxford University Press2024-06-2810521530Bioinformatics40Supplement_1i257 to i2651367-4803https://doi.org/10.1093/bioinformatics/btae2202011271Proteomics, Deep Metric Learning, Mass Spectrometry, Peptide Identification, Spectral Library SearchingNational Science Foundation