skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on December 18, 2025

Title: Oracle Embeddings for Chemical Detection
The accurate detection of chemical agents promotes many national security and public safety goals, and robust chemical detection methods can prevent disasters and support effective response to incidents. Mass spectrometry is an important tool in detecting and identifying chemical agents. However, there are high costs and logistical challenges associated with acquiring sufficient lab-generated mass spectrometry data for training machine learning algorithms, including skilled personnel, sample preparation and analysis required for data generation. These high costs of mass spectrometry data collection hinder the development of machine learning and deep learning models to detect and identify chemical agents. Accordingly, the primary objective of our research is to create a mass spectrometry data generation model whose output (synthetic mass spectrometry data) would enhance the performance of downstream machine learning chemical classification models. Such a synthetic data generation model would reduce the need to generate costly real-world data, and provide additional training data to use in combination with lab-generated mass spectrometry data when training classifiers. Our approach is a novel combination of autoencoder-based synthetic data generation combined with a fixed, apriori defined hidden layer geometry. In particular, we train pairs of encoders and decoders with an additional loss term that enforces that the hidden layer passed from the encoder to the decoder match the embedding provided by an external deep learning model designed to predict functional properties of chemicals. We have verified that incorporating our synthetic spectra into a lab-generated dataset enhances the performance of classification algorithms compared to using only the real data. Our synthetic spectra have been successfully matched to lab-generated spectra for their respective chemicals using library matching software, further demonstrating the validity of our work.  more » « less
Award ID(s):
2021871
PAR ID:
10620869
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
IEEE
Date Published:
ISBN:
979-8-3503-7488-9
Page Range / eLocation ID:
272 to 279
Format(s):
Medium: X
Location:
Miami, FL, USA
Sponsoring Org:
National Science Foundation
More Like this
  1. Mass spectrometry is the dominant technology in the field of proteomics, enabling high-throughput analysis of the protein content of complex biological samples. Due to the complexity of the instrumentation and resulting data, sophisticated computational methods are required for the processing and interpretation of acquired mass spectra. Machine learning has shown great promise to improve the analysis of mass spectrometry data, with numerous purpose-built methods for improving specific steps in the data acquisition and analysis pipeline reaching widespread adoption. Here, we propose unifying various spectrum prediction tasks under a single foundation model for mass spectra. To this end, we pre-train a spectrum encoder using de novo sequencing as a pre-training task. We then show that using these pre-trained spectrum representations improves our performance on the four downstream tasks of spectrum quality prediction, chimericity prediction, phosphorylation prediction, and glycosylation status prediction. Finally, we perform multi-task fine-tuning and find that this approach improves the performance on each task individually. Overall, our work demonstrates that a foundation model for tandem mass spectrometry proteomics trained on de novo sequencing learns generalizable representations of spectra, improves performance on downstream tasks where training data is limited, and can ultimately enhance data acquisition and analysis in proteomics experiments. 
    more » « less
  2. In this study, we consider three different machine‐learning methods—a three‐hidden‐layer neural network, support vector regression, and Gaussian process regression—and compare how well they can learn from a synthetic data set for proton acceleration in the Target Normal Sheath Acceleration regime. The synthetic data set was generated from a previously published theoretical model by Fuchs et al. 2005 that we modified. Once trained, these machine‐learning methods can assist with efforts to maximize the peak proton energy, or with the more general problem of configuring the laser system to produce a proton energy spectrum with desired characteristics. In our study, we focus on both the accuracy of the machine‐learning methods and the performance on one GPU including memory consumption. Although it is arguably the least sophisticated machine‐learning model we considered, support vector regression performed very well in our tests. 
    more » « less
  3. Abstract Thermal desorption/degradation with an atmospheric solids analysis probe (ASAP) and ion mobility (IM) separation are coupled with mass spectrometry (MS) analysis and tandem mass spectrometry (MS/MS) fragmentation to characterize thermoplastic elastomers. The compounds investigated, which are used in the manufacture of a wide variety of packaging materials, are mainly composed of thermoplastic copolymers, but also contain additional chemicals (“additives”), like antioxidants and UV stabilizers, for enhancement of their properties or protection from degradation. The traditional method for analyzing such complex mixtures is vacuum pyrolysis followed by electron or chemical ionization mass spectrometry, often after gas chromatography separation. Here, an alternative, faster approach, involving mild degradation at atmospheric pressure (ASAP) and subsequent characterization of the desorbates and pyrolyzates by IM‐MS, and if needed, MS/MS is presented. Such multidimensional dispersion considerably simplifies the resulting spectra, permitting the conclusive separation, characterization, and classification of the multicomponent materials examined. 
    more » « less
  4. Synthetic data is highly useful for training machine learning systems performing image-based 3D reconstruction, as synthetic data has applications in both extending existing generalizable datasets and being tailored to train neural networks for specific learning tasks of interest. In this paper, we introduce and utilize a synthetic data generation suite capable of generating data given existing 3D scene models as input. Specifically, we use our tool to generate image sequences for use with Multi-View Stereo (MVS), moving a camera through the virtual space according to user-chosen camera parameters. We evaluate how the given camera parameters and type of 3D environment affect how applicable the generated image sequences are to the MVS task using five pre-trained neural networks on image sequences generated from three different 3D scene datasets. We obtain generated predictions for each combination of parameter value and input image sequence, using standard error metrics to analyze the differences in depth predictions on image sequences across 3D datasets, parameters, and networks. Among other results, we find that camera height and vertical camera viewing angle are the parameters that cause the most variation in depth prediction errors on these image sequences. 
    more » « less
  5. Tandem mass spectrometry (MS/MS) is crucial for small-molecule analysis; however, traditional computational methods are limited by incomplete reference libraries and complex data processing. Machine learning (ML) is transforming small-molecule mass spectrometry in three key directions: (a) predicting MS/MS spectra and related physicochemical properties to expand reference libraries, (b) improving spectral matching through automated pattern extraction, and (c) predicting molecular structures of compounds directly from their MS/MS spectra. We review ML approaches for molecular representations [descriptors, simplified molecular-input line-entry (SMILE) strings, and graphs] and MS/MS spectra representations (using binned vectors and peak lists) along with recent advances in spectra prediction, retention time, collision cross sections, and spectral matching. Finally, we discuss ML-integrated workflows for chemical formula identification. By addressing the limitations of current methods for compound identification, these ML approaches can greatly enhance the understanding of biological processes and the development of diagnostic and therapeutic tools. 
    more » « less