Abstract MotivationTandem mass spectrometry is an essential technology for characterizing chemical compounds at high sensitivity and throughput, and is commonly adopted in many fields. However, computational methods for automated compound identification from their MS/MS spectra are still limited, especially for novel compounds that have not been previously characterized. In recent years, in silico methods were proposed to predict the MS/MS spectra of compounds, which can then be used to expand the reference spectral libraries for compound identification. However, these methods did not consider the compounds’ 3D conformations, and thus neglected critical structural information. ResultsWe present the 3D Molecular Network for Mass Spectra Prediction (3DMolMS), a deep neural network model to predict the MS/MS spectra of compounds from their 3D conformations. We evaluated the model on the experimental spectra collected in several spectral libraries. The results showed that 3DMolMS predicted the spectra with the average cosine similarity of 0.691 and 0.478 with the experimental MS/MS spectra acquired in positive and negative ion modes, respectively. Furthermore, 3DMolMS model can be generalized to the prediction of MS/MS spectra acquired by different labs on different instruments through minor fine-tuning on a small set of spectra. Finally, we demonstrate that the molecular representation learned by 3DMolMS from MS/MS spectra prediction can be adapted to enhance the prediction of chemical properties such as the elution time in the liquid chromatography and the collisional cross section measured by ion mobility spectrometry, both of which are often used to improve compound identification. Availability and implementationThe codes of 3DMolMS are available at https://github.com/JosieHong/3DMolMS and the web service is at https://spectrumprediction.gnps2.org.
more »
« less
Chemsearch: collaborative compound libraries with structure-aware browsing
Abstract Summary Chemsearch is a cross-platform server application for developing and managing a chemical compound library and associated data files, with an interface for browsing and search that allows for easy navigation to a compound of interest, similar compounds or compounds that have desired structural properties. With provisions for access control and centralized document and data storage, Chemsearch supports collaboration by distributed teams. Availability and implementation Chemsearch is a free and open-source Flask web application that can be linked to a Google Workspace account. Source code is available at https://github.com/gem-net/chemsearch (GPLv3 license). A Docker image allowing rapid deployment is available at https://hub.docker.com/r/cgemcci/chemsearch.
more »
« less
- Award ID(s):
- 2002182
- PAR ID:
- 10311419
- Editor(s):
- Bahar, Ivet
- Date Published:
- Journal Name:
- Bioinformatics Advances
- Volume:
- 1
- Issue:
- 1
- ISSN:
- 2635-0041
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract MotivationEnvironmental DNA (eDNA), as a rapidly expanding research field, stands to benefit from shared resources including sampling protocols, study designs, discovered sequences, and taxonomic assignments to sequences. High-quality community shareable eDNA resources rely heavily on comprehensive metadata documentation that captures the complex workflows covering field sampling, molecular biology lab work, and bioinformatic analyses. There are limited sources that provide documentation of database development on comprehensive metadata for eDNA and these workflows and no open-source software. ResultsWe present medna-metadata, an open-source, modular system that aligns with Findable, Accessible, Interoperable, and Reusable guiding principles that support scholarly data reuse and the database and application development of a standardized metadata collection structure that encapsulates critical aspects of field data collection, wet lab processing, and bioinformatic analysis. Medna-metadata is showcased with metabarcoding data from the Gulf of Maine (Polinski et al., 2019). Availability and implementationThe source code of the medna-metadata web application is hosted on GitHub (https://github.com/Maine-eDNA/medna-metadata). Medna-metadata is a docker-compose installable package. Documentation can be found at https://medna-metadata.readthedocs.io/en/latest/?badge=latest. The application is implemented in Python, PostgreSQL and PostGIS, RabbitMQ, and NGINX, with all major browsers supported. A demo can be found at https://demo.metadata.maine-edna.org/. Supplementary informationSupplementary data are available at Bioinformatics online.more » « less
-
Abstract SummaryDespite the availability of existing calculators for statistical power analysis in genetic association studies, there has not been a model-invariant and test-independent tool that allows for both planning of prospective studies and systematic review of reported findings. In this work, we develop a web-based application U-PASS (Unified Power analysis of ASsociation Studies), implementing a unified framework for the analysis of common association tests for binary qualitative traits. The application quantifies the shared asymptotic power limits of the common association tests, and visualizes the fundamental statistical trade-off between risk allele frequency and odds ratio. The application also addresses the applicability of asymptotics-based power calculations in finite samples, and provides guidelines for single-SNP-based association tests. In addition to designing prospective studies, U-PASS enables researchers to retrospectively assess the statistical validity of previously reported associations. Availability and implementationU-PASS is an open-source R Shiny application. A live instance is hosted at https://power.stat.lsa.umich.edu. Source is available on https://github.com/Pill-GZ/U-PASS. Supplementary informationSupplementary data are available at Bioinformatics online.more » « less
-
Abstract MotivationComputational methods for compound–protein affinity and contact (CPAC) prediction aim at facilitating rational drug discovery by simultaneous prediction of the strength and the pattern of compound–protein interactions. Although the desired outputs are highly structure-dependent, the lack of protein structures often makes structure-free methods rely on protein sequence inputs alone. The scarcity of compound–protein pairs with affinity and contact labels further limits the accuracy and the generalizability of CPAC models. ResultsTo overcome the aforementioned challenges of structure naivety and labeled-data scarcity, we introduce cross-modality and self-supervised learning, respectively, for structure-aware and task-relevant protein embedding. Specifically, protein data are available in both modalities of 1D amino-acid sequences and predicted 2D contact maps that are separately embedded with recurrent and graph neural networks, respectively, as well as jointly embedded with two cross-modality schemes. Furthermore, both protein modalities are pre-trained under various self-supervised learning strategies, by leveraging massive amount of unlabeled protein data. Our results indicate that individual protein modalities differ in their strengths of predicting affinities or contacts. Proper cross-modality protein embedding combined with self-supervised learning improves model generalizability when predicting both affinities and contacts for unseen proteins. Availability and implementationData and source codes are available at https://github.com/Shen-Lab/CPAC. Supplementary informationSupplementary data are available at Bioinformatics online.more » « less
-
Abstract. Accurate representation of fire emissions is critical for modeling the in-plume, near-source, and remote effects of biomass burning (BB) on atmospheric composition, air quality, and climate. In recent years application of advanced instrumentation has significantly improved knowledge of the compounds emitted from fires, which, coupled with a large number of recent laboratory and field campaigns, has facilitated the emergence of new emission factor (EF) compilations. The Next-generation Emissions InVentory expansion of Akagi (NEIVA) version 1.0 is one such compilation in which the EFs for 14 globally relevant fuel and fire types have been updated to include data from recent studies, with a focus on gaseous non-methane organic compounds (NMOC_g). The data are stored in a series of connected tables that facilitate flexible querying from the individual study level to recommended averages of all laboratory and field data by fire type. The querying features are enabled by assignment of unique identifiers to all compounds and constituents, including thousands of NMOC_g. NEIVA also includes chemical and physical property data and model surrogate assignments for three widely used chemical mechanisms for each NMOC_g. NEIVA EF datasets are compared with recent publications and other EF compilations at the individual compound level and in the context of overall volatility distributions and hydroxyl (OH) reactivity (OHR) estimates. The NMOC_g in NEIVA include ∼4–8 times more compounds with improved representation of intermediate volatility organic compounds, resulting in much lower overall volatility (lowest-volatility bin shifted by as much as 3 orders of magnitude) and significantly higher OHR (up to 90 %) than other compilations. These updates can strongly impact model predictions of the effects of BB on atmospheric composition and chemistry.more » « less
An official website of the United States government

