skip to main content


Title: Chemsearch: collaborative compound libraries with structure-aware browsing
Abstract Summary Chemsearch is a cross-platform server application for developing and managing a chemical compound library and associated data files, with an interface for browsing and search that allows for easy navigation to a compound of interest, similar compounds or compounds that have desired structural properties. With provisions for access control and centralized document and data storage, Chemsearch supports collaboration by distributed teams. Availability and implementation Chemsearch is a free and open-source Flask web application that can be linked to a Google Workspace account. Source code is available at https://github.com/gem-net/chemsearch (GPLv3 license). A Docker image allowing rapid deployment is available at https://hub.docker.com/r/cgemcci/chemsearch.  more » « less
Award ID(s):
2002182
NSF-PAR ID:
10311419
Author(s) / Creator(s):
; ; ;
Editor(s):
Bahar, Ivet
Date Published:
Journal Name:
Bioinformatics Advances
Volume:
1
Issue:
1
ISSN:
2635-0041
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    This study investigates the use of airborne in situ measurements to derive transit time distributions (TTD) from the boundary layer (BL) to the upper troposphere (UT) over the highly convective tropical western Pacific (TWP). The feasibility of this method is demonstrated using 42 volatile organic compounds (VOCs) measured during the Convective Transport of Active Species in the Tropics (CONTRAST) experiment. Two important approximations necessary for the application are the constant chemical lifetimes for each compound and the representation of the BL source by the local CONTRAST data. To characterize uncertainties associated with the first approximation, we quantify the changes in derived TTDs when chemical lifetimes are estimated using conditions of the BL, UT, and tropospheric average. With the support of a trajectory model study in a companion paper, we characterize the BL source region contributing to the transport to the sampled UT. In addition to the TTDs derived using a regional average, we analyze the potential information content in locally averaged measurements to represent the dynamical variability of the region. Around 150 TTDs, derived using measurements on a ∼100 km spatial scale, show a distribution of mean and mode transit times consistent with the wide range of convective conditions encountered during the campaign. Two extreme cases, with the shortest and the near‐longest TTD, are examined using the dynamical background of the measurements and back trajectory analyses. The result provides physical consistency supporting the hypothesis that sufficient information can be obtained from measurements to resolve dynamical variability of the region.

     
    more » « less
  2. Abstract Motivation

    Accurately predicting the likelihood of interaction between two objects (compound–protein sequence, user–item, author–paper, etc.) is a fundamental problem in Computer Science. Current deep-learning models rely on learning accurate representations of the interacting objects. Importantly, relationships between the interacting objects, or features of the interaction, offer an opportunity to partition the data to create multi-views of the interacting objects. The resulting congruent and non-congruent views can then be exploited via contrastive learning techniques to learn enhanced representations of the objects.

    Results

    We present a novel method, Contrastive Stratification for Interaction Prediction (CSI), to stratify (partition) a dataset in a manner that can be exploited via Contrastive Multiview Coding to learn embeddings that maximize the mutual information across congruent data views. CSI assigns a key and multiple views to each data point, where data partitions under a particular key form congruent views of the data. We showcase the effectiveness of CSI by applying it to the compound–protein sequence interaction prediction problem, a pressing problem whose solution promises to expedite drug delivery (drug–protein interaction prediction), metabolic engineering, and synthetic biology (compound–enzyme interaction prediction) applications. Comparing CSI with a baseline model that does not utilize data stratification and contrastive learning, and show gains in average precision ranging from 13.7% to 39% using compounds and sequences as keys across multiple drug–target and enzymatic datasets, and gains ranging from 16.9% to 63% using reaction features as keys across enzymatic datasets.

    Availability and implementation

    Code and dataset available at https://github.com/HassounLab/CSI.

     
    more » « less
  3. ABSTRACT The marine unicellular cyanobacterium Prochlorococcus is an abundant primary producer and widespread inhabitant of the photic layer in tropical and subtropical marine ecosystems, where the inorganic nutrients required for growth are limiting. In this study, we demonstrate that Prochlorococcus high-light strain MIT9301, an isolate from the phosphate-depleted subtropical North Atlantic Ocean, can oxidize methylphosphonate (MPn) and hydroxymethylphosphonate (HMPn), two phosphonate compounds present in marine dissolved organic matter, to obtain phosphorus. The oxidation of these phosphonates releases the methyl group as formate, which is both excreted and assimilated into purines in RNA and DNA. Genes encoding the predicted phosphonate oxidative pathway of MIT9301 were predominantly present in Prochlorococcus genomes from parts of the North Atlantic Ocean where phosphate availability is typically low, suggesting that phosphonate oxidation is an ecosystem-specific adaptation of some Prochlorococcus populations to cope with phosphate scarcity. IMPORTANCE Until recently, MPn was only known to be degraded in the environment by the bacterial carbon-phosphorus (CP) lyase pathway, a reaction that releases the greenhouse gas methane. The identification of a formate-yielding MPn oxidative pathway in the marine planctomycete Gimesia maris (S. R. Gama, M. Vogt, T. Kalina, K. Hupp, et al., ACS Chem Biol 14:735–741, 2019, https://doi.org/10.1021/acschembio.9b00024 ) and the presence of this pathway in Prochlorococcus indicate that this compound can follow an alternative fate in the environment while providing a valuable source of P to organisms. In the ocean, where MPn is a major component of dissolved organic matter, the oxidation of MPn to formate by Prochlorococcus may direct the flow of this one-carbon compound to carbon dioxide or assimilation into biomass, thus limiting the production of methane. 
    more » « less
  4. Abstract Motivation

    Tandem mass spectrometry is an essential technology for characterizing chemical compounds at high sensitivity and throughput, and is commonly adopted in many fields. However, computational methods for automated compound identification from their MS/MS spectra are still limited, especially for novel compounds that have not been previously characterized. In recent years, in silico methods were proposed to predict the MS/MS spectra of compounds, which can then be used to expand the reference spectral libraries for compound identification. However, these methods did not consider the compounds’ 3D conformations, and thus neglected critical structural information.

    Results

    We present the 3D Molecular Network for Mass Spectra Prediction (3DMolMS), a deep neural network model to predict the MS/MS spectra of compounds from their 3D conformations. We evaluated the model on the experimental spectra collected in several spectral libraries. The results showed that 3DMolMS predicted the spectra with the average cosine similarity of 0.691 and 0.478 with the experimental MS/MS spectra acquired in positive and negative ion modes, respectively. Furthermore, 3DMolMS model can be generalized to the prediction of MS/MS spectra acquired by different labs on different instruments through minor fine-tuning on a small set of spectra. Finally, we demonstrate that the molecular representation learned by 3DMolMS from MS/MS spectra prediction can be adapted to enhance the prediction of chemical properties such as the elution time in the liquid chromatography and the collisional cross section measured by ion mobility spectrometry, both of which are often used to improve compound identification.

    Availability and implementation

    The codes of 3DMolMS are available at https://github.com/JosieHong/3DMolMS and the web service is at https://spectrumprediction.gnps2.org.

     
    more » « less
  5. Abstract Motivation

    Computational methods for compound–protein affinity and contact (CPAC) prediction aim at facilitating rational drug discovery by simultaneous prediction of the strength and the pattern of compound–protein interactions. Although the desired outputs are highly structure-dependent, the lack of protein structures often makes structure-free methods rely on protein sequence inputs alone. The scarcity of compound–protein pairs with affinity and contact labels further limits the accuracy and the generalizability of CPAC models.

    Results

    To overcome the aforementioned challenges of structure naivety and labeled-data scarcity, we introduce cross-modality and self-supervised learning, respectively, for structure-aware and task-relevant protein embedding. Specifically, protein data are available in both modalities of 1D amino-acid sequences and predicted 2D contact maps that are separately embedded with recurrent and graph neural networks, respectively, as well as jointly embedded with two cross-modality schemes. Furthermore, both protein modalities are pre-trained under various self-supervised learning strategies, by leveraging massive amount of unlabeled protein data. Our results indicate that individual protein modalities differ in their strengths of predicting affinities or contacts. Proper cross-modality protein embedding combined with self-supervised learning improves model generalizability when predicting both affinities and contacts for unseen proteins.

    Availability and implementation

    Data and source codes are available at https://github.com/Shen-Lab/CPAC.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less