skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on May 15, 2026

Title: Machine Learning in Small-Molecule Mass Spectrometry
Tandem mass spectrometry (MS/MS) is crucial for small-molecule analysis; however, traditional computational methods are limited by incomplete reference libraries and complex data processing. Machine learning (ML) is transforming small-molecule mass spectrometry in three key directions: (a) predicting MS/MS spectra and related physicochemical properties to expand reference libraries, (b) improving spectral matching through automated pattern extraction, and (c) predicting molecular structures of compounds directly from their MS/MS spectra. We review ML approaches for molecular representations [descriptors, simplified molecular-input line-entry (SMILE) strings, and graphs] and MS/MS spectra representations (using binned vectors and peak lists) along with recent advances in spectra prediction, retention time, collision cross sections, and spectral matching. Finally, we discuss ML-integrated workflows for chemical formula identification. By addressing the limitations of current methods for compound identification, these ML approaches can greatly enhance the understanding of biological processes and the development of diagnostic and therapeutic tools.  more » « less
Award ID(s):
2011271
PAR ID:
10636863
Author(s) / Creator(s):
; ;
Publisher / Repository:
Annual Reviews
Date Published:
Journal Name:
Annual Review of Analytical Chemistry
Volume:
18
Issue:
1
ISSN:
1936-1327
Page Range / eLocation ID:
193 to 215
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract MotivationTandem mass spectrometry is an essential technology for characterizing chemical compounds at high sensitivity and throughput, and is commonly adopted in many fields. However, computational methods for automated compound identification from their MS/MS spectra are still limited, especially for novel compounds that have not been previously characterized. In recent years, in silico methods were proposed to predict the MS/MS spectra of compounds, which can then be used to expand the reference spectral libraries for compound identification. However, these methods did not consider the compounds’ 3D conformations, and thus neglected critical structural information. ResultsWe present the 3D Molecular Network for Mass Spectra Prediction (3DMolMS), a deep neural network model to predict the MS/MS spectra of compounds from their 3D conformations. We evaluated the model on the experimental spectra collected in several spectral libraries. The results showed that 3DMolMS predicted the spectra with the average cosine similarity of 0.691 and 0.478 with the experimental MS/MS spectra acquired in positive and negative ion modes, respectively. Furthermore, 3DMolMS model can be generalized to the prediction of MS/MS spectra acquired by different labs on different instruments through minor fine-tuning on a small set of spectra. Finally, we demonstrate that the molecular representation learned by 3DMolMS from MS/MS spectra prediction can be adapted to enhance the prediction of chemical properties such as the elution time in the liquid chromatography and the collisional cross section measured by ion mobility spectrometry, both of which are often used to improve compound identification. Availability and implementationThe codes of 3DMolMS are available at https://github.com/JosieHong/3DMolMS and the web service is at https://spectrumprediction.gnps2.org. 
    more » « less
  2. The annotation of small molecules is one of the most challenging and important steps in untargeted mass spectrometry analysis, as most of our biological interpretations rely on structural annotations. Molecular networking has emerged as a structured way to organize and mine data from untargeted tandem mass spectrometry (MS/MS) experiments and has been widely applied to propagate annotations. However, propagation is done through manual inspection of MS/MS spectra connected in the spectral networks and is only possible when a reference library spectrum is available. One of the alternative approaches used to annotate an unknown fragmentation mass spectrum is through the use of in silico predictions. One of the challenges of in silico annotation is the uncertainty around the correct structure among the predicted candidate lists. Here we show how molecular networking can be used to improve the accuracy of in silico predictions through propagation of structural annotations, even when there is no match to a MS/MS spectrum in spectral libraries. This is accomplished through creating a network consensus of re-ranked structural candidates using the molecular network topology and structural similarity to improve in silico annotations. The Network Annotation Propagation (NAP) tool is accessible through the GNPS web-platform https://gnps.ucsd.edu/ProteoSAFe/static/gnps-theoretical.jsp. 
    more » « less
  3. Abstract Neuropeptides have tremendous potential for application in modern medicine, including utility as biomarkers and therapeutics. To overcome the inherent challenges associated with neuropeptide identification and characterization, data‐independent acquisition (DIA) is a fitting mass spectrometry (MS) method of choice to achieve sensitive and accurate analysis. It is advantageous for preliminary neuropeptidomic studies to occur in less complex organisms, with crustacean models serving as a popular choice due to their relatively simple nervous system. With spectral libraries serving as a means to interpret DIA‐MS output spectra, andCancer borealisas a model of choice for neuropeptide analysis, we performed the first spectral library mapping of crustacean neuropeptides. Leveraging pre‐existing data‐dependent acquisition (DDA) spectra, a spectral library was built using PEAKS Online. The library is comprised of 333 unique neuropeptides. The identification results obtained through the use of this spectral library were compared with those achieved through library‐free analysis of crustacean brain, pericardial organs (PO), and thoracic ganglia (TG) tissues. A statistically significant increase (Student'st‐test,Pvalue < 0.05) in the number of identifications achieved from the TG data was observed in the spectral library results. Furthermore, in each of the tissues, a distinctly different set of identifications was found in the library search compared to the library‐free search. This work highlights the necessity for the use of spectral libraries in neuropeptide analysis, illustrating the advantage of spectral libraries for interpreting DIA spectra in a reproducible manner with greater neuropeptidomic depth. 
    more » « less
  4. Abstract Plant metabolomes are structurally diverse. One of the most popular techniques for sampling this diversity is liquid chromatography–mass spectrometry (LC‐MS), which typically detects thousands of peaks from single organ extracts, many representing true metabolites. These peaks are usually annotated using in‐house retention time or spectral libraries, in silico fragmentation libraries, and increasingly through computational techniques such as machine learning. Despite these advances, over 85% of LC‐MS peaks remain unidentified, posing a major challenge for data analysis and biological interpretation. This bottleneck limits our ability to fully understand the diversity, functions, and evolution of plant metabolites. In this review, we first summarize current approaches for metabolite identification, highlighting their challenges and limitations. We further focus on alternative strategies that bypass the need for metabolite identification, allowing researchers to interpret global metabolic patterns and pinpoint key metabolite signals. These methods include molecular networking, distance‐based approaches, information theory–based metrics, and discriminant analysis. Additionally, we explore their practical applications in plant science and highlight a set of useful tools to support researchers in analyzing complex plant metabolomics data. By adopting these approaches, researchers can enhance their ability to uncover new insights into plant metabolism. 
    more » « less
  5. RationaleA major hurdle in identifying chemicals in mass spectrometry experiments is the availability of tandem mass spectrometry (MS/MS) reference spectra in public databases. Currently, scientists purchase databases or use public databases such as Global Natural Products Social Molecular Networking (GNPS). The MSMS‐Chooser workflow is an open‐source protocol for the creation of MS/MS reference spectra directly in the GNPS infrastructure. MethodsAn MSMS‐Chooser Sample Template is provided and completed manually. The MSMS‐Chooser Submission File and Sequence Table for data acquisition were programmatically generated. Standards from the Mass Spectrometry Metabolite Library (MSMLS) suspended in a methanol–water (1:1) solution were analyzed. Flow injection on an LC/MS/MS system was used to generate negative and positive mode data using data‐dependent acquisition. The MS/MS spectra and Submission File were uploaded to MSMS‐Chooser workflow in GNPS for automatic selection of MS/MS spectra. ResultsData acquisition and processing required ~2 h and ~2 min, respectively, per 96‐well plate using MSMS‐Chooser. Analysis of the MSMLS, over 600 small molecules, using MSMS‐Chooser added 889 spectra (including multiple adducts) to the public library in GNPS. Manual validation of one plate indicated accurate selection of MS/MS scans (true positive rate of 0.96 and a true negative rate of 0.99). The MSMS‐Chooser output includes a table formatted for inclusion in the GNPS library as well as the ability to directly launch searches via MASST. ConclusionsMSMS‐Chooser enables rapid data acquisition, data analysis (selection of MS/MS spectra), and a formatted table for inspection and upload to GNPS. Open file‐format data (.mzML or.mzXML) from most mass spectrometry platforms containing MS/MS spectra can be processed using MSMS‐Chooser. MSMS‐Chooser democratizes the creation of MS/MS reference spectra in GNPS which will improve annotation and strengthen the tools which use the annotation information. 
    more » « less