skip to main content

Title: Pharmacological affinity fingerprints derived from bioactivity data for the identification of designer drugs

Facing the continuous emergence of new psychoactive substances (NPS) and their threat to public health, more effective methods for NPS prediction and identification are critical. In this study, the pharmacological affinity fingerprints (Ph-fp) of NPS compounds were predicted by Random Forest classification models using bioactivity data from the ChEMBL database. The binaryPh-fpis the vector consisting of a compound’s activity against a list of molecular targets reported to be responsible for the pharmacological effects of NPS. Their performance in similarity searching and unsupervised clustering was assessed and compared to 2D structure fingerprints Morgan and MACCS (1024-bits ECFP4 and 166-bits SMARTS-based MACCS implementation of RDKit). The performance in retrieving compounds according to their pharmacological categorizations is influenced by the predicted active assay counts inPh-fpand the choice of similarity metric. Overall, the comparative unsupervised clustering analysis suggests the use of a classification model with Morgan fingerprints as input for the construction ofPh-fp. This combination gives satisfactory clustering performance based on external and internal clustering validation indices.

Award ID(s):
Publication Date:
Journal Name:
Journal of Cheminformatics
Springer Science + Business Media
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract In this study, we developed a novel algorithm to improve the screening performance of an arbitrary docking scoring function by recalibrating the docking score of a query compound based on its structure similarity with a set of training compounds, while the extra computational cost is neglectable. Two popular docking methods, Glide and AutoDock Vina were adopted as the original scoring functions to be processed with our new algorithm and similar improvement performance was achieved. Predicted binding affinities were compared against experimental data from ChEMBL and DUD-E databases. 11 representative drug receptors from diverse drug target categories were applied to evaluate the hybrid scoring function. The effects of four different fingerprints (FP2, FP3, FP4, and MACCS) and the four different compound similarity effect (CSE) functions were explored. Encouragingly, the screening performance was significantly improved for all 11 drug targets especially when CSE = S 4 (S is the Tanimoto structural similarity) and FP2 fingerprint were applied. The average predictive index (PI) values increased from 0.34 to 0.66 and 0.39 to 0.71 for the Glide and AutoDock vina scoring functions, respectively. To evaluate the performance of the calibration algorithm in drug lead identification, we also imposed an upper limit on the structural similaritymore »to mimic the real scenario of screening diverse libraries for which query ligands are general-purpose screening compounds and they are not necessarily structurally similar to reference ligands. Encouragingly, we found our hybrid scoring function still outperformed the original docking scoring function. The hybrid scoring function was further evaluated using external datasets for two systems and we found the PI values increased from 0.24 to 0.46 and 0.14 to 0.42 for A2AR and CFX systems, respectively. In a conclusion, our calibration algorithm can significantly improve the virtual screening performance in both drug lead optimization and identification phases with neglectable computational cost.« less
  2. Abstract Background

    In Alzheimer’s Diseases (AD) research, multimodal imaging analysis can unveil complementary information from multiple imaging modalities and further our understanding of the disease. One application is to discover disease subtypes using unsupervised clustering. However, existing clustering methods are often applied to input features directly, and could suffer from the curse of dimensionality with high-dimensional multimodal data. The purpose of our study is to identify multimodal imaging-driven subtypes in Mild Cognitive Impairment (MCI) participants using a multiview learning framework based on Deep Generalized Canonical Correlation Analysis (DGCCA), to learn shared latent representation with low dimensions from 3 neuroimaging modalities.


    DGCCA applies non-linear transformation to input views using neural networks and is able to learn correlated embeddings with low dimensions that capture more variance than its linear counterpart, generalized CCA (GCCA). We designed experiments to compare DGCCA embeddings with single modality features and GCCA embeddings by generating 2 subtypes from each feature set using unsupervised clustering. In our validation studies, we found that amyloid PET imaging has the most discriminative features compared with structural MRI and FDG PET which DGCCA learns from but not GCCA. DGCCA subtypes show differential measures in 5 cognitive assessments, 6 brain volume measures, and conversion to AD patterns. Inmore »addition, DGCCA MCI subtypes confirmed AD genetic markers with strong signals that existing late MCI group did not identify.


    Overall, DGCCA is able to learn effective low dimensional embeddings from multimodal data by learning non-linear projections. MCI subtypes generated from DGCCA embeddings are different from existing early and late MCI groups and show most similarity with those identified by amyloid PET features. In our validation studies, DGCCA subtypes show distinct patterns in cognitive measures, brain volumes, and are able to identify AD genetic markers. These findings indicate the promise of the imaging-driven subtypes and their power in revealing disease structures beyond early and late stage MCI.

    « less
  3. Abstract Background

    Interpretation of high-throughput gene expression data continues to require mathematical tools in data analysis that recognizes the shape of the data in high dimensions. Topological data analysis (TDA) has recently been successful in extracting robust features in several applications dealing with high dimensional constructs. In this work, we utilize some recent developments in TDA to curate gene expression data. Our work differs from the predecessors in two aspects: (1) Traditional TDA pipelines use topological signatures called barcodes to enhance feature vectors which are used for classification. In contrast, this work involves curating relevant features to obtain somewhat better representatives with the help of TDA. This representatives of the entire data facilitates better comprehension of the phenotype labels. (2) Most of the earlier works employ barcodes obtained using topological summaries as fingerprints for the data. Even though they are stable signatures, there exists no direct mapping between the data and said barcodes.


    The topology relevant curated data that we obtain provides an improvement in shallow learning as well as deep learning based supervised classifications. We further show that the representative cycles we compute have an unsupervised inclination towards phenotype labels. This work thus shows that topological signatures are able tomore »comprehend gene expression levels and classify cohorts accordingly.


    In this work, we engender representative persistent cycles to discern the gene expression data. These cycles allow us to directly procure genes entailed in similar processes.

    « less
  4. Abstract This paper introduces the use of topological data analysis (TDA) as an unsupervised machine learning tool to uncover classification criteria in complex inorganic crystal chemistries. Using the apatite chemistry as a template, we track through the use of persistent homology the topological connectivity of input crystal chemistry descriptors on defining similarity between different stoichiometries of apatites. It is shown that TDA automatically identifies a hierarchical classification scheme within apatites based on the commonality of the number of discrete coordination polyhedra that constitute the structural building units common among the compounds. This information is presented in the form of a visualization scheme of a barcode of homology classifications, where the persistence of similarity between compounds is tracked. Unlike traditional perspectives of structure maps, this new “Materials Barcode” schema serves as an automated exploratory machine learning tool that can uncover structural associations from crystal chemistry databases, as well as to achieve a more nuanced insight into what defines similarity among homologous compounds.
  5. Abstract

    Magnetic transition metals (mTM = Cr, Mn, Fe, Co, and Ni) and their complex compounds (oxides, hydroxides, and oxyhydroxides) are highly important material platforms for diverse technologies, where electrochemical phase diagrams with respect to electrode potential and solution pH can be used to effectively understand their corrosion and oxidation behaviors in relevant aqueous environments. Many previous decades-old mTM–Pourbaix diagrams are inconsistent with various direct electrochemical observations, because experimental complexities associated with extracting reliable free energies of formation (ΔfG) lead to inaccuracies in the data used for modeling. Here, we develop a high-throughput simulation approach based on density-functional theory (DFT), which quickly screens structures and compounds using efficient DFT methods and calculates accurate ΔfGvalues, using high-level exchange-correlation functions to obtain ab initio Pourbaix diagrams in comprehensive and close agreement with various important electrochemical, geological, and biomagnetic observations reported over the last few decades. We also analyze the microscopic mechanisms governing the chemical trends among the ΔfGvalues and Pourbaix diagrams to further understand the electrochemical behaviors of mTM-based materials. Last, we provide probability profiles at variable electrode potential and solution pH to show quantitatively the likely coexistence of multiple-phase areas and diffuse phase boundaries.