skip to main content

Title: Chemsearch: collaborative compound libraries with structure-aware browsing
Abstract Summary Chemsearch is a cross-platform server application for developing and managing a chemical compound library and associated data files, with an interface for browsing and search that allows for easy navigation to a compound of interest, similar compounds or compounds that have desired structural properties. With provisions for access control and centralized document and data storage, Chemsearch supports collaboration by distributed teams. Availability and implementation Chemsearch is a free and open-source Flask web application that can be linked to a Google Workspace account. Source code is available at (GPLv3 license). A Docker image allowing rapid deployment is available at
; ; ;
Bahar, Ivet
Award ID(s):
Publication Date:
Journal Name:
Bioinformatics Advances
Sponsoring Org:
National Science Foundation
More Like this
  1. ABSTRACT The marine unicellular cyanobacterium Prochlorococcus is an abundant primary producer and widespread inhabitant of the photic layer in tropical and subtropical marine ecosystems, where the inorganic nutrients required for growth are limiting. In this study, we demonstrate that Prochlorococcus high-light strain MIT9301, an isolate from the phosphate-depleted subtropical North Atlantic Ocean, can oxidize methylphosphonate (MPn) and hydroxymethylphosphonate (HMPn), two phosphonate compounds present in marine dissolved organic matter, to obtain phosphorus. The oxidation of these phosphonates releases the methyl group as formate, which is both excreted and assimilated into purines in RNA and DNA. Genes encoding the predicted phosphonate oxidative pathway of MIT9301 were predominantly present in Prochlorococcus genomes from parts of the North Atlantic Ocean where phosphate availability is typically low, suggesting that phosphonate oxidation is an ecosystem-specific adaptation of some Prochlorococcus populations to cope with phosphate scarcity. IMPORTANCE Until recently, MPn was only known to be degraded in the environment by the bacterial carbon-phosphorus (CP) lyase pathway, a reaction that releases the greenhouse gas methane. The identification of a formate-yielding MPn oxidative pathway in the marine planctomycete Gimesia maris (S. R. Gama, M. Vogt, T. Kalina, K. Hupp, et al., ACS Chem Biol 14:735–741, 2019, ) andmore »the presence of this pathway in Prochlorococcus indicate that this compound can follow an alternative fate in the environment while providing a valuable source of P to organisms. In the ocean, where MPn is a major component of dissolved organic matter, the oxidation of MPn to formate by Prochlorococcus may direct the flow of this one-carbon compound to carbon dioxide or assimilation into biomass, thus limiting the production of methane.« less
  2. Abstract

    The budding field of materials informatics has coincided with a shift towards artificial intelligence to discover new solid-state compounds. The steady expansion of repositories for crystallographic and computational data has set the stage for developing data-driven models capable of predicting a bevy of physical properties. Machine learning methods, in particular, have already shown the ability to identify materials with near ideal properties for energy-related applications by screening crystal structure databases. However, examples of the data-guided discovery of entirely new, never-before-reported compounds remain limited. The critical step for determining if an unknown compound is synthetically accessible is obtaining the formation energy and constructing the associated convex hull. Fortunately, this information has become widely available through density functional theory (DFT) data repositories to the point that they can be used to develop machine learning models. In this Review, we discuss the specific design choices for developing a machine learning model capable of predicting formation energy, including the thermodynamic quantities governing material stability. We investigate several models presented in the literature that cover various possible architectures and feature sets and find that they have succeeded in uncovering new DFT-stable compounds and directing materials synthesis. To expand access to machine learning models formore »synthetic solid-state chemists, we additionally presentMatLearn. This web-based application is intended to guide the exploration of a composition diagram towards regions likely to contain thermodynamically accessible inorganic compounds. Finally, we discuss the future of machine-learned formation energy and highlight the opportunities for improved predictive power toward the synthetic realization of new energy-related materials.

    « less
  3. Abstract Motivation

    Computational methods for compound–protein affinity and contact (CPAC) prediction aim at facilitating rational drug discovery by simultaneous prediction of the strength and the pattern of compound–protein interactions. Although the desired outputs are highly structure-dependent, the lack of protein structures often makes structure-free methods rely on protein sequence inputs alone. The scarcity of compound–protein pairs with affinity and contact labels further limits the accuracy and the generalizability of CPAC models.


    To overcome the aforementioned challenges of structure naivety and labeled-data scarcity, we introduce cross-modality and self-supervised learning, respectively, for structure-aware and task-relevant protein embedding. Specifically, protein data are available in both modalities of 1D amino-acid sequences and predicted 2D contact maps that are separately embedded with recurrent and graph neural networks, respectively, as well as jointly embedded with two cross-modality schemes. Furthermore, both protein modalities are pre-trained under various self-supervised learning strategies, by leveraging massive amount of unlabeled protein data. Our results indicate that individual protein modalities differ in their strengths of predicting affinities or contacts. Proper cross-modality protein embedding combined with self-supervised learning improves model generalizability when predicting both affinities and contacts for unseen proteins.

    Availability and implementation

    Data and source codes are available at

    Supplementary information

    Supplementary data aremore »available at Bioinformatics online.

    « less
  4. Abstract Summary

    Although advances in untargeted metabolomics have made it possible to gather data on thousands of cellular metabolites in parallel, identification of novel metabolites from these datasets remains challenging. To address this need, Metabolic in silico Network Expansions (MINEs) were developed. A MINE is an expansion of known biochemistry which can be used as a list of potential structures for unannotated metabolomics peaks. Here, we present MINE 2.0, which utilizes a new set of biochemical transformation rules that covers 93% of MetaCyc reactions (compared to 25% in MINE 1.0). This results in a 17-fold increase in database size and a 40% increase in MINE database compounds matching unannotated peaks from an untargeted metabolomics dataset. MINE 2.0 is thus a significant improvement to this community resource.

    Availability and implementation

    The MINE 2.0 website can be accessed at The MINE 2.0 web API documentation can be accessed at The data and code underlying this article are available in the MINE-2.0-Paper repository at MINE 2.0 source code can be accessed at (MINE construction), (backend web API) and (web app).

    Supplementary information

    Supplementary data are available at Bioinformatics online.

  5. We present Descending from Stochastic Clustering Variance Regression (DiSCoVeR) (, a Python tool for identifying and assessing high-performing, chemically unique compositions relative to existing compounds using a combination of a chemical distance metric, density-aware dimensionality reduction, clustering, and a regression model. In this work, we create pairwise distance matrices between compounds via Element Mover's Distance (ElMD) and use these to create 2D density-aware embeddings for chemical compositions via Density-preserving Uniform Manifold Approximation and Projection (DensMAP). Because ElMD assigns distances between compounds that are more chemically intuitive than Euclidean-based distances, the compounds can then be clustered into chemically homogeneous clusters via Hierarchical Density-based Spatial Clustering of Applications with Noise (HDBSCAN*). In combination with performance predictions via Compositionally-Restricted Attention-Based Network (CrabNet), we introduce several new metrics for materials discovery and validate DiSCoVeR on Materials Project bulk moduli using compound-wise and cluster-wise validation methods. We visualize these via multi-objective Pareto front plots and assign a weighted score to each composition that encompasses the trade-off between performance and density-based chemical uniqueness. In addition to density-based metrics, we explore an additional uniqueness proxy related to property gradients in DensMAP space. As a validation study, we use DiSCoVeR to screen materials for both performance and uniquenessmore »to extrapolate to new chemical spaces. Top-10 rankings are provided for the compound-wise density and property gradient uniqueness proxies. Top-ranked compounds can be further curated via literature searches, physics-based simulations, and/or experimental synthesis. Finally, we compare DiSCoVeR against the naive baseline of random search for several parameter combinations in an adaptive design scheme. To our knowledge, this is the first time automated screening has been performed with explicit emphasis on discovering high-performing, novel materials.« less