<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Foundation model for mass spectrometry proteomics</title></titleStmt>
			<publicationStmt>
				<publisher>https://doi.org/10.48550/arXiv.2505.10848</publisher>
				<date>05/19/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10631819</idno>
					<idno type="doi"></idno>
					
					<author>Justin Sanders</author><author>Melih Yilmaz</author><author>Jacob_H Russell</author><author>Wout Bittremieux</author><author>William_E Fondrie</author><author>Nicholas_M Riley</author><author>Sewoong Oh</author><author>William_Stafford Noble</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Mass spectrometry is the dominant technology in the field of proteomics, enabling high-throughput analysis of the protein content of complex biological samples. Due to the complexity of the instrumentation and resulting data, sophisticated computational methods are required for the processing and interpretation of acquired mass spectra. Machine learning has shown great promise to improve the analysis of mass spectrometry data, with numerous purpose-built methods for improving specific steps in the data acquisition and analysis pipeline reaching widespread adoption. Here, we propose unifying various spectrum prediction tasks under a single foundation model for mass spectra. To this end, we pre-train a spectrum encoder using de novo sequencing as a pre-training task. We then show that using these pre-trained spectrum representations improves our performance on the four downstream tasks of spectrum quality prediction, chimericity prediction, phosphorylation prediction, and glycosylation status prediction. Finally, we perform multi-task fine-tuning and find that this approach improves the performance on each task individually. Overall, our work demonstrates that a foundation model for tandem mass spectrometry proteomics trained on de novo sequencing learns generalizable representations of spectra, improves performance on downstream tasks where training data is limited, and can ultimately enhance data acquisition and analysis in proteomics experiments.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>In recent years, foundation models have emerged as a powerful machine learning paradigm for various problem domains <ref type="bibr">[30,</ref><ref type="bibr">29,</ref><ref type="bibr">9]</ref>. These models are trained to learn rich latent representations of input modalities from large datasets of unlabeled or weakly labeled data (e.g., online text, protein sequences in public repositories) using pre-training tasks such as masked language modeling. The trained model can subsequently be used to perform a variety of downstream tasks, relying on the 2 Background and related work Currently, tandem mass spectrometry is the only high-throughput method for systematically analyzing the full protein content of biological samples <ref type="bibr">[24]</ref>, driving breakthroughs in disease biomarker discovery, drug development, and analysis of PTMs <ref type="bibr">[1]</ref>. In a standard tandem mass spectrometry experiment, proteins are digested into short peptides, ionized, and fragmented. The mass-to-charge ratios (m/z) of the resulting fragment ions are then measured very precisely by the instrument. This process yields a list of "peaks," each representing the m/z of a specific ion along with an intensity value corresponding to its abundance. Together, this list of peaks is called an "MS/MS spectrum," which serves as a fingerprint of the specific analyte being measured. In a typical mass spectrometry run, the instrument will collect on the order of 100,000 such spectra, each corresponding to a distinct peptide (or contaminant). Canonically, these spectra are then processed by a database search algorithm, with the goal of assigning to each spectrum its generating peptide. However, in many settings, there are a variety of other important downstream tasks involving tandem mass spectra, beyond simply assigning the peptide sequence.</p><p>As in many other fields, deep learning methods have taken mass spectrometry proteomics by storm. The tasks addressed thus far can be divided into two groups: those that take as input a peptide sequence and those that take as input a spectrum. In the first category, reviewed by Angelis et al. <ref type="bibr">[4]</ref>, models predict properties of a given peptide, such as expected fragmentation patterns and retention time, primarily with the goal of improving the sensitivity of database search.</p><p>The second category-tasks that take a spectrum as input-are more relevant to our foundation model. The most fundamental task in this category is de novo peptide sequencing, in which the input is a spectrum and the output is a peptide sequence. De novo sequencing offers an alternative method to database search for solving the spectrum annotation problem without relying on prior knowledge, making it a valuable tool for identifying peptides not present in a pre-defined protein database. Algorithms for solving this problem were introduced in the late 1990's <ref type="bibr">[37]</ref> and it was first solved using machine learning in 2015 <ref type="bibr">[22]</ref>. Subsequently, DeepNovo <ref type="bibr">[38]</ref> combined a convolutional neural network and a recurrent neural network to autoregressively predict the subsequent amino acid when provided an MS/MS spectrum and a peptide prefix. More recently, Casanovo <ref type="bibr">[45]</ref> employed a transformer architecture to frame de novo sequencing as a sequence-to-sequence translation task. Many methods have since successfully extended Casanovo's transformer architecture to include various ideas such as bidirectional decoding, a contrastive loss, improved positional embeddings, and alternative decoding strategies <ref type="bibr">[7]</ref>. Downstream tasks. We study four downstream tasks that take tandem mass spectra as input: predicting spectrum quality, chimericity, phosphorylation, and glycosylation status.</p><p>In spectrum quality prediction, the model is asked to predict whether an observed spectrum is identifiable, meaning that it shows strong and clear signal for a peptide. This problem has been addressed with a variety of classical machine learning techniques <ref type="bibr">[28,</ref><ref type="bibr">33,</ref><ref type="bibr">42,</ref><ref type="bibr">23]</ref> and more recently using a convolutional neural network <ref type="bibr">[15]</ref>.</p><p>In mass spectrometry experiments, acquired spectra often inadvertently contain signal from multiple peptides. Such spectra are called chimeras, and they can be hard to analyze due to the mixture of signals from each peptide. To our knowledge, prediction of chimeric spectra has not previously been solved using machine learning methods. However, many existing methods generalize the database search procedure to allow for chimeric matches <ref type="bibr">[46,</ref><ref type="bibr">14]</ref>.</p><p>Predicting whether a post-translational modification is present in a given spectrum is another key task of interest. PhoStar uses a random forest to predict whether a given spectrum was generated by a phosphorylated peptide based on a set of hand-designed features <ref type="bibr">[13]</ref>. AHLF improves on this using a convolutional model which takes as input the full spectrum <ref type="bibr">[3]</ref>. For predicting whether a spectrum contains a peptide which is N-or O-glycosylated, current methods rely on hand-designed rules based on specific fragment ions <ref type="bibr">[36]</ref>.</p><p>Learning representations of spectra. Prior work has investigated learning spectrum representations, but these representations have primarily been used as a dimensionality reduction technique focused specifically on clustering spectra and improving peptide identification. GLEAMS learns low-dimensional spectrum representations optimized such that spectra from the same peptide cluster together <ref type="bibr">[8]</ref>. Similarly, yHydra co-embeds peptides and spectra such that spectra are close to their generating peptides in embedding space <ref type="bibr">[2]</ref>.</p><p>Finally, prior work has investigated foundation models for tandem mass spectra in the metabolomics space. Small molecules typically result in lower complexity mass spectra than peptides, and do not exhibit the same consistent fragmentation patterns along a linear molecular backbone. The methods LSM1-MS2 <ref type="bibr">[6]</ref>, PRISM <ref type="bibr">[17]</ref>, and DreaMS <ref type="bibr">[10]</ref> use unsupervised masked-peak modeling to learn representations of metabolomics mass spectra, demonstrating that these representations improve performance on downstream chemical property prediction tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">De novo peptide sequencing as a pre-training task</head><p>To accurately perform de novo sequencing, a model needs to capture the fundamental relationships between the analyte present in the instrument (i.e., the peptide) and the observed signal measured by the mass spectrometer. This in turn requires a rich understanding of the physics and chemistry governing peptide chromatography, ionization, and fragmentation. We hypothesize that this fundamental understanding of mass spectra, which is acquired through pre-training on the de novo sequencing task, will generalize to other tasks involving mass spectra for which less training data is available.</p><p>Typically, foundation models are trained in an unsupervised manner, so as to benefit from massive datasets of unlabeled training examples. However, unlike the settings of natural language processing and computer vision, where there are orders of magnitude more unlabeled training examples than labeled samples, typically 40-60% of the acquired spectra can be annotated in a given mass spectrometry run and hence can be labeled with their generating peptide. Additionally, this labeling is fully automated and high-throughput, with no need for costly human annotations. Thus, here we consider making use of these labels to explore the supervised task of de novo peptide sequencing as pre-training for a foundation model.</p><p>In this work we perform experiments with a state-of-the art, transformer-based de novo sequencing model, Casanovo <ref type="bibr">[45,</ref><ref type="bibr">44]</ref>. Casanovo is trained on a dataset of 30 million high-quality labeled tandem mass spectra from the MassIVE-KB spectral library <ref type="bibr">[40]</ref>. We use Casanovo's pre-trained spectrum encoder off the shelf as a foundation model for mass spectrometry proteomics. Casanovo uses a standard transformer encoder architecture, where spectra are treated as a sequence of peaks, and each peak is embedded with a positional m/z embedding and a learned intensity embedding. This setup allows the model to easily attend to pairs of peaks with specified mass shifts, which is key to interpreting mass spectra. To obtain an overall spectrum representation we take the mean of the individual peak embeddings from the spectrum encoder, yielding a single embedding for the spectrum as a whole. To apply the encoder to each downstream task, we train a small task-specific dense predictor head that takes these frozen spectrum embeddings from Casanovo as input.</p><p>(A) 0.00 0.25 0.50 0.75 1.00 False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 True Positive Rate Casanovo foundation, AUC: 0.820 Casanovo foundation (multi-task training), AUC: 0.837 End-to-end transformer, AUC: 0.719 Binned embeddings, AUC: 0.723 (B) 0.00 0.25 0.50 0.75 1.00 False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 True Positive Rate Casanovo foundation, AUC: 0.780 Casanovo foundation (multi-task training), AUC: 0.821 End-to-end transformer, AUC: 0.684 Binned embeddings, AUC: 0.711 (C) 0.00 0.25 0.50 0.75 1.00 False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 True Positive Rate Casanovo Foundation, AUC: 0.948 Casanovo (multi-task training), AUC: 0.988 End-to-end transformer , AUC: 0.965 Binned embeddings, AUC: 0.861 (D) 0.00 0.25 0.50 0.75 1.00 False Positive Rate 0.0 0.2 0.4 0.6 0.8 1.0 True Positive Rate Casanovo foundation, AUC:0.976 Casanovo foundation (multi-task training), AUC:0.968 End-to-end Transformer, AUC:0.959 Binned embeddings, AUC:0.950 GlyCounter + XGBoost, AUC:0.951 138/144 Ratio baseline, AUC:0.872 </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Downstream tasks</head><p>We use our foundation model as a starting point to address three downstream tasks. In each task, we compare the frozen Casanovo encoder coupled with a small task-specific dense predictor head ("Casanovo Foundation") against at least two baselines. First, we bin spectrum peaks along the m/z axis to obtain spectrum embeddings and then train a gradient boosted decision tree classifier directly on those embeddings ("binned embedding"). Second, we train a transformer spectrum encoder, which has the same architecture as Casanovo, along with a multilayer preceptron (MLP) classifier head from scratch to learn the downstream tasks end-to-end ("end-to-end transformer"). For one of the tasks (phosphorylation detection), we also benchmark against a task-specific state-of-the-art classifier <ref type="bibr">[3]</ref>.</p><p>For the spectrum quality task, model weights for SPEQ <ref type="bibr">[15]</ref>, the current state-of-the-art deep learning method, are unfortunately not publicly available. However, our end-to-end transformer method serves as a conceptually similar baseline representing deep learning models trained directly on the task which take full spectra as input.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Spectrum quality prediction</head><p>The first downstream task we consider is spectrum quality prediction. For this task, the goal is to predict whether a given observed MS/MS spectrum will be successfully annotated by database search. The motivation for this task is three-fold. First, if we can quickly identify low-quality spectra, then we can save time and potentially boost our statistical power by eliminating these spectra prior to the database search procedure. Second, spectra that are deemed to be high-quality by the trained model but nonetheless fail to be identified during the database search procedure are good candidates for more expensive computational analyses to find identifications outside of the database. Finally, by predicting in real time which spectra can be annotated, it is possible to better allocate instrument time towards these analytes.</p><p>To create a labeled dataset for this task, we first randomly sample from the MassIVE repository 20 human mass spectrometry runs that use high-resolution instruments, and we select 10/5/5 files to create training/validation/test splits, where each split contains approximately 450k, 245k and 295k spectra, respectively. Spectra that are matched to a peptide under 1% false discovery rate (FDR) by database search are labeled as high quality, whereas spectra that failed to be matched are annotated as low quality for the binary classification task. In our dataset, we observe a 40%/60% distribution of high-and low-quality spectra.</p><p>Because spectrum quality prediction is a binary classification task and the task is roughly balanced, we use the area under the ROC curve (AUROC) as the primary performance measure. The presence of foreign spectra (i.e., spectra generated by peptides that are not in the given database, due to contamination or unexpected genetic variation) make this task particularly challenging, because these may be high-quality spectra that will never be confidently assigned a peptide by the database search procedure. Additionally, because identifications were determined at a 1% FDR threshold, a small proportion of spectra in the positive class may be incorrect identifications of low-quality spectra. As a result, we expect the training and test labels to be fairly noisy for this task and we do not expect a priori to be able to achieve AUROC values close to 1.</p><p>Applying Casanovo Foundation to this task, we achieve an AUROC of 0.820, outperforming our task-specific end-to-end transformer and the binned embedding baselines (AUROC of 0.719 and 0.723, respectively) (Figure <ref type="figure">1A</ref>). This result suggests that the pre-trained spectrum representations from Casanovo capture properties that are hard to learn from the quality prediction task alone. This is not too surprising, given that the de novo sequencing pre-training task is both an inherently richer task and took advantage of more data.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Spectrum chimericity prediction</head><p>Tandem mass spectrometry experiments are designed to attempt to isolate individual peptide species, by first separating them by hydrophobicity in the liquid chromatography step and then separating peptides by m/z in the first round of mass spectrometry analysis. Nonetheless, in many cases, two peptides with similar hydrophobicities and m/z values end up being fragmented simultaneously. The result is an MS/MS spectrum that contains peaks corresponding to both peptides. Such chimeric spectra are difficult to analyze. Most database search algorithms assign at most one peptide to each spectrum, and even assigning a single peptide to a chimeric spectrum is challenging due to the presence of unexplained peaks from the undetected peptide.</p><p>Accordingly, our second downstream task involves detecting when more than one peptide species is responsible for generating a given MS/MS spectrum, i.e., predicting whether it is chimeric or not. Many existing methods generalize the database search procedure to allow chimeric matches <ref type="bibr">[46,</ref><ref type="bibr">14]</ref>; however, prediction of chimeric spectra has not previously been solved using machine learning methods. Such a predictor would be useful, for example, in deciding which spectra to provide as input to one of the tools above or in adjusting the settings of an instrument to avoid unwanted chimeras.</p><p>To train a chimericity predictor, we use spectra from human, mouse, and yeast samples for training, validation, and test, respectively. Database search is performed using the wide-window setting in FragPipe <ref type="bibr">[26]</ref>, which allows spectra to be assigned multiple peptides. For the binary classification task, spectra assigned more than one peptide are labeled chimeric and spectra annotated with a single peptide are labeled non-chimeric. Unannotated spectra are discarded. For each of the splits, we have roughly 60 thousand spectra annotated with at least one peptide, and approximately 45% of these spectra are chimeric. Similar to the quality prediction task, we expect that there is noise in these labels due to both false positive and false negative annotations from database search, so achieving close to perfect performance is unlikely.</p><p>Like the quality prediction task, chimericity prediction is a binary classification task without a pronounced class imbalance, so we use AUROC as our primary performance measure. Comparing Casanovo Foundation to the baseline methods, we again see that it achieves improved performance (AUROC 0.780) compared to the two baselines (AUROC of 0.684 and 0.711) (Figure <ref type="figure">1B</ref>). </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Post-translational modification detection</head><p>The final type of downstream task we consider is the detection of spectra generated by peptides containing PTMs. A PTM is a molecular group that attaches to the side-chain of one of the amino acids in a peptide. The most commonly studied PTMs include phosphorylation, glycosylation, and methylation, but many more potential types of PTMs exist in nature, and some are quite rare. The peptide database used during database search of MS/MS data can be augmented to include PTMs, but because of the many potential types of PTMs and the fact that a single peptide can harbor multiple PTMs, accounting for all possible modifications is not computationally or statistically feasible. Additionally, identifying peptides containing PTMs and localizing the modification, i.e., determining which residue the modification is attached to, often requires adjusting settings in the mass spectrometer, such as the fragmentation type or collision energy, to be optimized specifically for that PTM. Thus, a model capable of identifying which PTMs are associated with a given MS/MS spectrum would be valuable in guiding how data is both collected and subsequently analyzed. In fact, simple methods for solving this task are regularly employed in practice to improve the sensitivity and quantitative accuracy of experiments targeting peptides carrying a specific PTM <ref type="bibr">[36,</ref><ref type="bibr">21]</ref>. Here, we train classifiers to recognize two common types of PTMs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.1">Phosphorylation detection</head><p>We first consider the detection of spectra from phosphorylated peptides. Protein phosphorylation is arguably one of the most important and well studied PTMs, driving key physiological activities such as energy metabolism, cell proliferation and growth, apoptosis, and signal transduction <ref type="bibr">[5]</ref>. We frame phosphorylation prediction as a binary classification task, predicting whether or not a given spectrum derives from a phosphorylated peptide.</p><p>To train a classifier, we use 19.2 million labeled spectra from the human phosphoproteome dataset <ref type="bibr">[25]</ref> which were used to train AHLF <ref type="bibr">[3]</ref>, a state-of-the-art phosphorylation predictor. The human phosphoproteome consists of 112 individual PRIDE datasets, containing 101 human cell or tissue types, where each dataset was collected with phospho-enrichment assays. To create labeled data for training AHLF, the human phosphoproteome was subjected to database search, and a binary label was assigned to spectra indicating phosphorylated or unphosphorylated peptides (see <ref type="bibr">[3]</ref> for details).</p><p>Of the resulting 19.2 million labeled spectra, 54% are phosphorylated. Following the cross-validation setup described in <ref type="bibr">[3]</ref>, we use the same train, validation, and test splits as the AHLF-&#945; model. For phosphorylation prediction, we use the F 1 score in addition to the AUROC metric to account for class imbalances at the level of individual datasets within the test set.</p><p>Comparing the ROC curves for Casanovo Foundation to our two baselines for this task, we observe that, unlike the previous two tasks, our foundation model (AUROC 0.948) performs worse than the end-to-end transformer model (AUROC 0.965) but better than the binned spectrum baseline (AUROC 0.861) (Figure <ref type="figure">1C</ref>). Breaking performance down across each of the 25 test datasets and comparing to AHLF results, we observe that Casanovo Foundation performs somewhat better than AHLF, yielding higher performance on 19/25 datasets for each metric. However, the end-to-end transformer baseline outperforms both, with a higher F1 score on 19/25 and a higher AUROC on 17/25 datasets (Table <ref type="table">1</ref>). This result is not too surprising, because foundation modeling is not expected to provide a major advantage on tasks with very large amounts of high-quality labeled data available for training. Furthermore, the dataset used for pre-training Casanovo does not contain phosphopeptides, meaning that the model may not have learned to fully recognize the importance of specific peaks and mass shifts which are indicative of phosphorylation.</p><p>For PTMs other than phosphorylation, which may be both rarer biologically and lack well-established enrichment protocols, such a large training set is unavailable. For these modifications, we reasoned that the foundation modeling approach may prove more valuable. Accordingly, to investigate the relationship between the number of available training samples and the performance of each model, we create a series of 10 nested subsets of the phosphoproteomics training data, which range in size from roughly 7,700 to 7.7 million training spectra. The relative performance of foundation modeling to our supervised baselines on each subset informs in what settings foundation modeling may provide an advantage. We find that for datasets with fewer than &#8764;1 million spectra, the performance of our Casanovo Foundation model and the end-to-end transformer baseline cross over, with the foundation model showing the best performance (Figure <ref type="figure">2A</ref>). The difference in performance grows as the size of the training set decreases. Strikingly, Casanovo Foundation achieves an AUROC of 0.881 when trained on a dataset of just 7,615 spectra, compared to 0.639 for the binned embedding baseline and 0.635 for the end-to-end transformer baseline. Given that for many rarer or less-well-studied PTMs, assembling a dataset of even &#8764;100,000 spectra may prove a significant challenge, this result suggests the potential utility of Casanovo Foundation for other PTM prediction tasks.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3.2">Glycosylation determination.</head><p>To explore the task of PTM prediction in a setting where foundation modeling may be more necessary, we turn to another important modification for which training data is less readily available. Protein glycosylation is a complex PTM where various combinations of mono-and oligosaccharides are attached to specific residues. Here, we consider the task of predicting the glycosylation class of a peptide from its spectrum. The two most common classes of glycosylation are N-glycosylation, where glycans are attached to the nitrogen side chain of asparagine residues within a specific motif, and O-glycosylation, where glycans are attached to hydroxyl groups of serine and threonine residues <ref type="bibr">[16,</ref><ref type="bibr">34,</ref><ref type="bibr">20,</ref><ref type="bibr">39]</ref>. Recognizing whether a given spectrum represents a glycosylated peptide is straightforward due to the presence of characteristic oxonium ions that are generated upon collisional activation <ref type="bibr">[43,</ref><ref type="bibr">32,</ref><ref type="bibr">35,</ref><ref type="bibr">47]</ref>. However, distinguishing N-glycosylation from O-glycosylation is more difficult (Supplementary note S2.5).</p><p>Effectively classifying glycopeptides, especially N-vs O-glycopeptides, is critical because the optimal dissociation method differs for N-and O-glycopeptides. Tryptic N-glycopeptides typically only have a single potential glycosite, meaning that higher-energy collisional dissociation (HCD) is sufficient for both identifying and localizing N-glycosylation <ref type="bibr">[31]</ref>. On the other hand, O-glycopeptide sequences can often contain multiple potential O-glycosites per peptide, and thus require the collection of alternative dissociation methods, e.g., electron-transfer dissociation (ETD), to generate peptide fragment ions that retain glycan modifications that facilitate localization. Acquiring ETD spectra incurs a significant overhead in instrument time. Thus, by predicting whether a given HCD spectrum contains an N-versus an O-glycopeptide, we can intelligently guide the data acquisition to spend instrument time acquiring ETD spectra only for the precursor ions for which it is necessary <ref type="bibr">[36]</ref>. To train a model to distinguish N-versus O-glycsolyation, we use a publicly available dataset of the mouse brain glycoproteome produced by DQGlyco. This dataset contains 252,970 total glycopeptide identifications, of which 25,757 (10.2%) are O-glycosylated <ref type="bibr">[27]</ref> In addition to our two standard baselines, for this task we also consider two domain-specific baselines. The first looks at the ratio in intensity between the oxonium ion at 138 m/z to that at 144 m/z. This ratio between the abundances of expected product ions from N-and O-glycans is currently used in practice for real-time prediction in glycoproteomics experiments <ref type="bibr">[36]</ref>. The second baseline is a slightly more sophisticated version of the prior approach, which trains an XGBoost classifier on the abundance of these two oxonium ions, along with the abundances of 52 other oxonium ions, extracted by GlyCounter <ref type="bibr">[19]</ref>, that are known to be characteristic of glycosylation.</p><p>Evaluating the performance of each method, we find that the domain-specific baselines are already reasonably good, with an AUROC of 0.872 for the 138/144 ratio and 0.951 for GlyCounter+XGBoost. The binned embedding baseline and end-to-end transformer baselines perform similarly, achieving AUROCs of 0.950 and 0.959, respectively. However, we again find that Casanovo Foundation offers the best results, achieving an AUROC of 0.976 (Figure <ref type="figure">1D</ref>). Given the significant class imbalance in the data, with only &#8764;10% of the data coming from the positive class, we also plot precision-recall curves for this task (Supplementary Figure <ref type="figure">S1</ref>). This accentuates the difference in performance between methods, with Casanovo Foundation achieving a AUPR of 0.914, compared to 0.753, 0.860, 0.811, and 0.867 for the ratio, GlyCounter, binned, and transformer baselines, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Multi-task training</head><p>Having demonstrated the utility of pre-trained spectrum representations from Casanovo Foundation on various downstream tasks, we next turn to strategies for improving the representations further. To this end, we fine-tune our pre-trained Casanovo encoder using a multi-task learning strategy. In this setup, we add three task-specific prediction heads on top of the shared pre-trained spectrum encoder. The model is then jointly optimized on the spectrum quality, chimericity, and phosphorylation prediction downstream tasks, in addition to the main de novo sequencing task used during pre-training. During each training step, the multi-task model receives one batch of spectra from each task and minimizes their summed loss. To approximately balance the amount of training data from each of the three downstream tasks, we use the 1/32 downsampled phosphorylation dataset (243,710 spectra) from the above learning curve experiment for training. We hypothesize that the diversity in training data and tasks during joint training will introduce the model to a broader distribution of spectra than was seen during pre-training, thereby helping the encoder to recognize and extract a wider range of important spectrum features. Specifically, during de novo sequencing pre-training, the model saw no low-quality spectra nor spectra with phosphorylation as a PTM.</p><p>Having trained our multi-task encoder, we evaluate it using the same procedure as described in Section 4. For each downstream task, we obtain spectral embeddings from the pre-trained encoder and train a task-specific classifier directly on these representations. We find that our multi-task training improves performance on all three tasks, although to varying degrees: spectrum quality prediction improves from an AUROC of 0.820 to 0.837, chimericity prediction improves from 0.780 to 0.821, and phosphorylation prediction improves from 0.948 to 0.988 (Figures <ref type="figure">1A-C</ref>). This performance on the phosphorylation task now surpasses the task-specific transformer model trained on the full phosphorylation dataset, achieving the best performance out of all methods for 17 of the 25 individual datasets in the test set (Table <ref type="table">1</ref>). Notably, we observe that our phosphorylation classifier trained on multi-task encoder embeddings converges very quickly, unlike the end-to-end transformer baseline, which continues to improve when trained on larger datasets all the way up to the full 7.7 million spectra. Thus, we opted to use the same downsampled dataset for training the phosphorylation predictor head that was used for pre-training the multi-task encoder. This means that not only does Casanovo Foundation with multi-task training yield state-of-the-art performance on the phosphorylation prediction task, but it does so while using, in total, only 1/32 times as much training data as competing methods.</p><p>To visualize the structure of the latent space learned by our spectrum encoders and to better understand how multi-task training improves performance, we performed principal component analysis (PCA) of the spectrum embeddings of spectra from each of the three tasks. For the pretrained encoder, we observed a heavy overlap between spectra from each class for all three tasks (Figure <ref type="figure">2A</ref>, Supplementary Figures <ref type="figure">S2A-B</ref>). This overlap indicates that the features relevant to each task are not assigned high weight by the encoder. After joint training, however, the spectrum representations from the phosphorylation dataset show clear separation in the first principal component (Figure <ref type="figure">2D</ref>), which explains why the task-specific prediction head trained on these representations requires so little training data to achieve good performance. For the other two datasets there is still considerable overlap between the embeddings of each class, as may be expected given the comparatively lower AUROCs achieved on these tasks. However, we still observe much clearer separation than is seen for the embeddings without multi-task fine-tuning (Supplementary Figures <ref type="figure">S2C-D</ref>).</p><p>Having shown that multi-task finetuning of the spectrum encoder improves spectrum representations for the tasks included in training, we next sought to evaluate whether this approach also improves performance for other tasks not included in training. Thus, we evaluated the representations from the multi-task encoder on the unseen glycosylation status prediction task above. We find that the representations learned by the multi-task model are less useful for this task, achieving an AUROC of 0.968, compared to the 0.976 achieved by the non-finetuned encoder. This suggests that multi-task training does not improve performance on other downstream tasks which are not included in the training, at least for the single downstream task which we tested here.</p><p>Overall, these results indicate that the greatest performance on a given task is obtained when the pre-trained Casanovo spectrum encoder is fine-tuned on that task in a multi-task training setup. However, the benefits of this fine-tuning are task-specific and do not necessarily generalize to new tasks. Additionally, the benefits of this approach come at the expense of introducing complexity and significantly more computational cost, which may not be worthwhile for many use cases.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion and future work</head><p>In this work, we demonstrate that the spectrum encoder learned by a model trained on the de novo sequencing task is generally applicable as a foundation model for tandem mass spectrometry data. Small models trained on frozen spectrum embeddings give good performance across a wide range of downstream tasks, and multi-task fine-tuning of the spectrum encoder improves performance further, with Casanovo Foundation ultimately achieving state-of-the-art performance on all downstream tasks it was applied to. These results demonstrate the utility of foundation models for mass spectrometry proteomics as a flexible starting point for solving novel tasks without the need for massive task-specific labeled datasets.</p><p>One promising avenue for future research is to replace or augment the de novo pre-training with an unsupervised pre-training task, as has been done in metabolomics <ref type="bibr">[6,</ref><ref type="bibr">17,</ref><ref type="bibr">10]</ref>. Although this will not dramatically increase the training dataset size, it may lead to richer and more generalizable spectrum representations. Additionally, such an approach would allow the inclusion of more diverse spectra, including those not readily annotatable by database search.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>S2.2 Chimericity task</head><p>The samples were prepared using the method described in <ref type="bibr">[41]</ref> and analyzed using an Orbitrap Fusion Lumos mass spectrometer. Raw MS/MS data were converted to mzML files using MSConvert with peak picking enabled in ProteoWizard (version 3.0.24031) <ref type="bibr">[11]</ref>. The human, mouse, and yeast MS/MS data were then searched against a human (20,597 proteins, 02/2024), mouse (21,701 proteins, 02/2024), and yeast (6060 proteins, 02/2024) proteome database, respectively, using FragPipe (version 22.0) with the default workflow and "DDA+" mode (i.e., wide window database search). Database search results were filtered at a 1% PSM-level FDR. Spectra were assigned as chimeric if involved in more than one high-confidence PSM.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>S2.3 Phosphorylation task</head><p>The raw data and database search results from the human phosphoproteome dataset were downloaded from ProteomeXchange PXD012174 <ref type="bibr">[25]</ref>, which contains data from 101 human cell and tissue types analyzed using phospho-enrichment assays. Data was prepared following the pre-processing scripts used by AHLF <ref type="bibr">[3]</ref>, which were shared by the authors. These filtered spectra at a 1% FDR at the PSM, protein, and phosphosite localization level. Additionally, PSMs were filtered based on a minimum score for modified peptides of 40, and a minimum delta score for modified peptides of 6. Spectra assigned only to phosporylated peptides were assigned a positive label and spectra assigned only to unphosphorylated peptides were assigned a negative label. Remaining spectra were discarded. Finally, the data was split into train/validation/test sets at the cell/tissue type level following the same splits used by Altenberg et al <ref type="bibr">[3]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>S2.4 Glycosylation task</head><p>The raw data from the 48 mouse brain HCD runs generated by Potel et al. was downloaded from ProteomeXchange PXD052447 <ref type="bibr">[27]</ref>, along with the FragPipe <ref type="bibr">[26]</ref> N-and O-glycopeptide search results. Raw MS/MS data were converted to mzML files using MSConvert with peak picking enabled in ProteoWizard (version 3.0.24031) <ref type="bibr">[11]</ref>. The data were randomly split at the run level into train/validation/test sets containing 36/6/6 runs each. N-glyco PSMs in the MSFragger search results were filtered for assigned modifications at asparagine residues; O-glyco PSMs were filtered for assigned modifications at either serine or threonine residues. From there, results were filtered based on a hyperscore greater than 16 and a glycan q-value less than 0.01 to obtain a 1% FDR for glycan assignment. For confident classification labels, cases of co-occupancy of O-and N-glycosites on the same PSM were filtered out. Spectra identified as containing an O-glycopeptide were labeled as positive examples, while spectra containing an N-glycopeptide were assigned negative labels. Spectra not identified with a glycopeptide were discarded. GlyCounter <ref type="bibr">[19]</ref> was run on each of these spectra, yielding a list of 54 oxonium ions, which were in turn used to calculate the m/z 138/144 ratio.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>S2.5 N-vs O-glycosylation</head><p>In some cases, distinguishing N-glycosylation from O-glycosylation is straightforward using ratios of ions that indicate the presence of N-acetylglycosamine (GlcNAc) or N-acetylgalactoseamine (GalNAc). This classification can be simplified to a comparison of m/z 138 to m/z 144, where a 1:1 ratio indicates the presence of GalNAc, but not GlcNAc, in a glycan composition. This ratio is useful for classifying N-glycopeptides, which have GlcNAc but not GalNAc residues, relative to simple core 1 O-glycopeptides, which only contain GalNAc. This task becomes more challenging when considering elongated core-1 O-glycans and core 2-8 O-glycans that contain both GalNAc and GlcNac moieties. For example, core 2 glycans are relatively common in mammalian glycoproteomic datasets, and the GlcNAc residues in these O-glycopeptides mean they produce oxonium ion patterns that look more similar to N-glycopeptides than core 1 O-glycopeptides that lack GlcNAc.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>S3 Training Settings and Hyper-parameters S3.1 Supervised pre-training</head><p>The weights for the pre-trained Casanovo model checkpoint 4.0.0 (Apache 2.0 license) were downloaded from GitHub. This model was trained on the MassIVE-KB dataset using the supervised de novo sequencing task as described in Yilmaz et al. <ref type="bibr">[44]</ref>. Only the weights for the spectrum encoder from the encoder-decoder Casanovo model were used. This gives an encoder-only model with nine transformer encoder block layers, an embedding size of 512, and eight attention heads. Overall spectrum representations were obtained from this encoder by taking the mean of the individual peak embeddings.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>S3.2 Multi-task training</head><p>For our multi-task training experiment, we fine-tune the pre-trained spectrum encoder on three downstream tasks simultaneously. One task-specific classification head, consisting of a dense network with one 512-dimensional hidden layer and ReLu non-linearity, is added to the model for each task. Each task-specific head takes as input the mean of the individual peak embeddings output by the encoder. To prevent fine-tuned spectrum representations from overfitting to the three specific tasks, we also retain the standard de novo sequencing loss during fine-tuning.</p><p>When training the multi-task model, during each training step we load one batch of spectra from each task. The task-specific loss is computed on each batch respectively, and the sum of all four losses is optimized. To roughly balance the size of the datasets for the three downstream tasks, we downsampled the phosphoproteomics dataset by a factor of 32 to obtain a dataset of roughly 240,000 spectra. We anticipate that more specific re-weighting of the task-specific losses to account for differences in difficulty of each task may improve overall results by preventing one task from overfitting before other tasks have converged. However, due to resource constraints, we were unable to extensively search the space of loss weighting terms.</p><p>We trained the multi-task model for 185,000 training steps with a batch size of 32, performing validation every 4000 steps. The model checkpoint with the lowest average validation loss across the three downstream tasks was selected (step 96,000). The peak learning rate was set to 1e-5, with a linear warmup period of 1000 steps and cosine learning rate schedule with a half-period of 120,000 steps. Binary cross entropy loss was used for all four tasks, with label smoothing of 0.001. Gradient updates were performed using the Adam optimizer <ref type="bibr">[18]</ref> with 1e-6 weight decay.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>S3.3 Task-specific training</head><p>Binned baseline. To pre-process the input for our binned baseline models, we discretize the m/z axis into equal-width bins between 150 and 2000 m/z. For the binned embeddings, we experimented with different binning resolutions to obtain the spectrum embeddings and settled on using 100-bin, i.e. 100-dimensional, embeddings (Supplementary Figure <ref type="figure">S4</ref>). Peaks outside the range 140-2000 m/z are filtered out, and the remaining peak intensities are binned at 18.6 m/z resolution. We then train a gradient-boosted decision tree classifier on these representations <ref type="bibr">[12]</ref> using the validation set for early stopping based on validation AUROC. The hyperparameter early_stopping_rounds was set to 32, and n_iters was chosen to be sufficiently large that training is always terminated by early stopping. Otherwise, default parameters were used.</p><p>Glycounter baseline. Similar to the binned baseline, the GlyCounter baseline for the glycosylation status prediction task represents each spectrum as a 54-dimensional vector of intensities for a predefined set of oxonium ions known to be produced by glycan fragmentation. An XGBoost classifier is likewise trained on these representations.</p><p>End-to-end transformer. For the end-to-end transformer pipeline, we train the transformer spectrum encoder and MLP classifier head end-to-end on each task. The transformer encoder is implemented using depthcharge components to have the same architecture as the Casanovo encoder, except for the number of transformer layers, which was optimized based on validation set performance from the interval <ref type="bibr">[1]</ref><ref type="bibr">[2]</ref><ref type="bibr">[3]</ref><ref type="bibr">[4]</ref><ref type="bibr">[5]</ref><ref type="bibr">[6]</ref><ref type="bibr">[7]</ref><ref type="bibr">[8]</ref><ref type="bibr">[9]</ref>. Gradient updates during were performed using the Adam optimizer <ref type="bibr">[18]</ref> with a learning rate of 1e-4 and a weight decay of 1e-6. Training is terminated with early stopping based on AUROC on the validation set with a patience of to 5 epochs.</p><p>Casanovo Foundation. To apply Casanovo Foundation to a downstream task, we first use the pretrained encoder from Casanovo version 4.0.0 (Apache License 2.0) to obtain 512-dimensional spectrum embeddings for each spectrum. We then train a small two layer dense network on these embeddings. The dense model has one 512-dimensional hidden layer with ReLU activation. The model is trained with a learning rate of 1e-3, and is also terminated via early stopping with a patience of 5 epochs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>S3.4 Timing and compute resources</head><p>Pre-training Casanovo on MassIVE-KB took 8 days on 4 RTX 2080 Ti GPUs. Multi-task fine-tuning took 4 days on a single A100 80G GPU. Training models for each of the downstream tasks was done using 2 L40S 48GB GPUs and took &#8764;8 days in total. The majority of this time was spent training the end-to-end transformer model on the phosphorylation task. All remaining experiments were done on a CPU workstation with 16x Intel Xeon CPU E5-2680 @ 2.70GHz and 64GB of RAM in relatively negligible time.</p><p>This dependence on large compute resources, GPUs in particular, is a notable current limitation of Casanovo Foundation. In practice, many mass spectrometry proteomics labs do not have access to or familiarity with using GPUs. However, as deep learning is becoming more widespread in the field, labs are beginning to invest more in local and cloud-based compute resources. Additionally, future engineering efforts to accelerate the inference time of Casanovo Foundation can further bridge this gap.</p></div></body>
		</text>
</TEI>
