skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: 3DMolMS: prediction of tandem mass spectra from 3D molecular conformations
Abstract MotivationTandem mass spectrometry is an essential technology for characterizing chemical compounds at high sensitivity and throughput, and is commonly adopted in many fields. However, computational methods for automated compound identification from their MS/MS spectra are still limited, especially for novel compounds that have not been previously characterized. In recent years, in silico methods were proposed to predict the MS/MS spectra of compounds, which can then be used to expand the reference spectral libraries for compound identification. However, these methods did not consider the compounds’ 3D conformations, and thus neglected critical structural information. ResultsWe present the 3D Molecular Network for Mass Spectra Prediction (3DMolMS), a deep neural network model to predict the MS/MS spectra of compounds from their 3D conformations. We evaluated the model on the experimental spectra collected in several spectral libraries. The results showed that 3DMolMS predicted the spectra with the average cosine similarity of 0.691 and 0.478 with the experimental MS/MS spectra acquired in positive and negative ion modes, respectively. Furthermore, 3DMolMS model can be generalized to the prediction of MS/MS spectra acquired by different labs on different instruments through minor fine-tuning on a small set of spectra. Finally, we demonstrate that the molecular representation learned by 3DMolMS from MS/MS spectra prediction can be adapted to enhance the prediction of chemical properties such as the elution time in the liquid chromatography and the collisional cross section measured by ion mobility spectrometry, both of which are often used to improve compound identification. Availability and implementationThe codes of 3DMolMS are available at https://github.com/JosieHong/3DMolMS and the web service is at https://spectrumprediction.gnps2.org.  more » « less
Award ID(s):
2011271 1916645
PAR ID:
10424582
Author(s) / Creator(s):
; ; ; ; ; ;
Publisher / Repository:
Oxford University Press
Date Published:
Journal Name:
Bioinformatics
Volume:
39
Issue:
6
ISSN:
1367-4811
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract MotivationTandem mass spectrometry (MS/MS) is a crucial technology for large-scale proteomic analysis. The protein database search or the spectral library search are commonly used for peptide identification from MS/MS spectra, which, however, may face challenges due to experimental variations between replicated spectra and similar fragmentation patterns among distinct peptides. To address this challenge, we present SpecEncoder, a deep metric learning approach to address these challenges by transforming MS/MS spectra into robust and sensitive embedding vectors in a latent space. The SpecEncoder model can also embed predicted MS/MS spectra of peptides, enabling a hybrid search approach that combines spectral library and protein database searches for peptide identification. ResultsWe evaluated SpecEncoder on three large human proteomics datasets, and the results showed a consistent improvement in peptide identification. For spectral library search, SpecEncoder identifies 1%–2% more unique peptides (and PSMs) than SpectraST. For protein database search, it identifies 6%–15% more unique peptides than MSGF+ enhanced by Percolator, Furthermore, SpecEncoder identified 6%–12% additional unique peptides when utilizing a combined library of experimental and predicted spectra. SpecEncoder can also identify more peptides when compared to deep-learning enhanced methods (MSFragger boosted by MSBooster). These results demonstrate SpecEncoder’s potential to enhance peptide identification for proteomic data analyses. Availability and ImplementationThe source code and scripts for SpecEncoder and peptide identification are available on GitHub at https://github.com/lkytal/SpecEncoder. Contact: hatang@iu.edu. 
    more » « less
  2. Background: Spectral library searching is currently the most common approach for compound annotation in untargeted metabolomics. Spectral libraries applicable to liquid chromatography mass spectrometry have grown in size over the past decade to include hundreds of thousands to millions of mass spectra and tens of thousands of compounds, forming an essential knowledge base for the interpretation of metabolomics experiments. Aim of Review: We describe existing spectral library resources, highlight different strategies for compiling spectral libraries, and discuss quality considerations that should be taken into account when interpreting spectral library searching results. Finally, we describe how spectral libraries are empowering the next generation of machine learning tools in computational metabolomics, and discuss several opportunities for using increasingly accessible large spectral libraries. Key Scientific Concepts of Review: This review focuses on the current state of spectral libraries for untargeted LC-MS/MS based metabolomics. We show how the number of entries in publicly accessible spectral libraries has increased more than 60-fold in the past eight years to aid molecular interpretation and we discuss how the role of spectral libraries in untargeted metabolomics will evolve in the near future. 
    more » « less
  3. IntroductionDissolved organic matter (DOM) composition varies over space and time, with a multitude of factors driving the presence or absence of each compound found in the complex DOM mixture. Compounds ubiquitously present across a wide range of river systems (hereafter termed core compounds) may differ in chemical composition and reactivity from compounds present in only a few settings (hereafter termed satellite compounds). Here, we investigated the spatial patterns in DOM molecular formulae presence (occupancy) in surface water and sediments across 97 river corridors at a continental scale using the “Worldwide Hydrobiogeochemical Observation Network for Dynamic River Systems—WHONDRS” research consortium. MethodsWe used a novel data-driven approach to identify core and satellite compounds and compared their molecular properties identified with Fourier-transform ion cyclotron resonance mass spectrometry (FT-ICR MS). ResultsWe found that core compounds clustered around intermediate hydrogen/carbon and oxygen/carbon ratios across both sediment and surface water samples, whereas the satellite compounds varied widely in their elemental composition. Within surface water samples, core compounds were dominated by lignin-like formulae, whereas protein-like formulae dominated the core pool in sediment samples. In contrast, satellite molecular formulae were more evenly distributed between compound classes in both sediment and water molecules. Core compounds found in both sediment and water exhibited lower molecular mass, lower oxidation state, and a higher degree of aromaticity, and were inferred to be more persistent than global satellite compounds. Higher putative biochemical transformations were found in core than satellite compounds, suggesting that the core pool was more processed. DiscussionThe observed differences in chemical properties of core and satellite compounds point to potential differences in their sources and contribution to DOM processing in river corridors. Overall, our work points to the potential of data-driven approaches separating rare and common compounds to reduce some of the complexity inherent in studying riverine DOM. 
    more » « less
  4. RationaleSilicone wristbands have emerged as valuable passive samplers for monitoring of personal exposure to environmental contaminants in the rapidly developing field ofexposomics. Once deployed, silicone wristbands collect and hold a wealth of chemical information that can be interrogated using high‐resolution mass spectrometry (HRMS) to provide a broad coverage of chemical mixtures. MethodsGas chromatography coupled to Orbitrap™ mass spectrometry (GC/Orbitrap™ MS) was used to simultaneously perform suspect screening (using in‐house database) and unknown screening (using vendor databases) of extracts from wristbands worn by volunteers. The goal of this study was to optimize a workflow that allows detection of low levels of priority pollutants, with high reliability. In this regard, a data processing workflow for GC/Orbitrap™ MS was developed using a mixture of 123 environmentally relevant standards consisting of pesticides, flame retardants, organophosphate esters, and polycyclic aromatic hydrocarbons as test compounds. ResultsThe optimized unknown screening workflow using a search index threshold of 750 resulted in positive identification of 70 analytes in validation samples, and a reduction in the number of false positives by over 50%. An average of 26 compounds with high confidence identification, 7 level 1 compounds and 19 level 2 compounds, were observed in worn wristbands. The data were further analyzed via suspect screening and retrospective suspect screening to identify an additional 36 compounds. ConclusionsThis study provides three important findings: (1) a clear evidence of the importance of sample cleanup in addressing complex sample matrices for unknown analysis, (2) a valuable workflow for the identification of unknown contaminants in silicone wristband samplers using electron ionization HRMS data, and (3) a novel application of GC/Orbitrap™ MS for the unknown analysis of organic contaminants that can be used in exposomics studies. 
    more » « less
  5. Abstract MotivationDriven by technological advances, the throughput and cost of mass spectrometry (MS) proteomics experiments have improved by orders of magnitude in recent decades. Spectral library searching is a common approach to annotating experimental mass spectra by matching them against large libraries of reference spectra corresponding to known peptides. An important disadvantage, however, is that only peptides included in the spectral library can be found, whereas novel peptides, such as those with unexpected post-translational modifications (PTMs), will remain unknown. Open modification searching (OMS) is an increasingly popular approach to annotate modified peptides based on partial matches against their unmodified counterparts. Unfortunately, this leads to very large search spaces and excessive runtimes, which is especially problematic considering the continuously increasing sizes of MS proteomics datasets. ResultsWe propose an OMS algorithm, called HOMS-TC, that fully exploits parallelism in the entire pipeline of spectral library searching. We designed a new highly parallel encoding method based on the principle of hyperdimensional computing to encode mass spectral data to hypervectors while minimizing information loss. This process can be easily parallelized since each dimension is calculated independently. HOMS-TC processes two stages of existing cascade search in parallel and selects the most similar spectra while considering PTMs. We accelerate HOMS-TC on NVIDIA’s tensor core units, which is emerging and readily available in the recent graphics processing unit (GPU). Our evaluation shows that HOMS-TC is 31× faster on average than alternative search engines and provides comparable accuracy to competing search tools. Availability and implementationHOMS-TC is freely available under the Apache 2.0 license as an open-source software project at https://github.com/tycheyoung/homs-tc. 
    more » « less