skip to main content

Title: Machine learning classification can reduce false positives in structure-based virtual screening
With the recent explosion in the size of libraries available for screening, virtual screening is positioned to assume a more prominent role in early drug discovery’s search for active chemical matter. In typical virtual screens, however, only about 12% of the top-scoring compounds actually show activity when tested in biochemical assays. We argue that most scoring functions used for this task have been developed with insufficient thoughtfulness into the datasets on which they are trained and tested, leading to overly simplistic models and/or overtraining. These problems are compounded in the literature because studies reporting new scoring methods have not validated their models prospectively within the same study. Here, we report a strategy for building a training dataset (D-COID) that aims to generate highly compelling decoy complexes that are individually matched to available active complexes. Using this dataset, we train a general-purpose classifier for virtual screening (vScreenML) that is built on the XGBoost framework. In retrospective benchmarks, our classifier shows outstanding performance relative to other scoring functions. In a prospective context, nearly all candidate inhibitors from a screen against acetylcholinesterase show detectable activity; beyond this, 10 of 23 compounds have IC 50 better than 50 μM. Without any medicinal chemistry optimization, the most potent hit has IC 50 280 nM, corresponding to K i of 173 nM. These results support using the D-COID strategy for training classifiers in other computational biology tasks, and for vScreenML in virtual screening campaigns against other protein targets. Both D-COID and vScreenML are freely distributed to facilitate such efforts.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Proceedings of the National Academy of Sciences
Page Range / eLocation ID:
18477 to 18488
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Melanoma and nonmelanoma skin cancers are among the most prevalent and most lethal forms of skin cancers. To identify new lead compounds with potential anticancer properties for further optimization, in vitro assays combined with in‐silico target fishing and docking have been used to identify and further map out the antiproliferative and potential mode of action of molecules from a small library of compounds previously prepared in our laboratory. From screening these compounds in vitro against A375, SK‐MEL‐28, A431, and SCC‐12 skin cancer cell lines, 35 displayed antiproliferative activities at the micromolar level, with the majority being primarily potent against the A431 and SCC‐12 squamous carcinoma cell lines. The most active compounds11(A431: IC50 = 5.0 μM, SCC‐12: IC50 = 2.9 μM, SKMEL‐28: IC50 = 4.9 μM, A375: IC50 = 6.7 μM) and13(A431: IC50 = 5.0 μM, SCC‐12: IC50 = 3.3 μM, SKMEL‐28: IC50 = 13.8 μM, A375: IC50 = 17.1 μM), significantly and dose‐dependently induced apoptosis of SCC‐12 and SK‐MEL‐28 cells, as evidenced by the suppression of Bcl‐2 and upregulation of Bax, cleaved caspase‐3, caspase‐9, and PARP protein expression levels. Both agents significantly reduced scratch wound healing, colony formation, and expression levels of deregulated cancer molecular targets including RSK/Akt/ERK1/2 and S6K1. In silico target prediction and docking studies using the SwissTargetPrediction web‐based tool suggested that CDK8, CLK4, nuclear receptor ROR, tyrosine protein‐kinase Fyn/LCK, ROCK1/2, and PARP, all of which are dysregulated in skin cancers, might be prospective targets for the two most active compounds. Further validation of these targets by western blot analyses, revealed that ROCK/Fyn and its associated Hedgehog (Hh) pathways were downregulated or modulated by the two lead compounds. In aggregate, these results provide a strong framework for further validation of the observed activities and the development of a more comprehensive structure–activity relationship through the preparation and biological evaluation of analogs.

    more » « less
  2. Abstract

    RNA dependent RNA polymerase (RdRp), is an essential in the RNA replication within the life cycle of the severely acute respiratory coronavirus-2 (SARS-CoV-2), causing the deadly respiratory induced sickness COVID-19. Remdesivir is a prodrug that has seen some success in inhibiting this enzyme, however there is still the pressing need for effective alternatives. In this study, we present the discovery of four non-nucleoside small molecules that bind favorably to SARS-CoV-2 RdRp over the active form of the popular drug remdesivir (RTP) and adenosine triphosphate (ATP) by utilizing high-throughput virtual screening (HTVS) against the vast ZINC compound database coupled with extensive molecular dynamics (MD) simulations. After post-trajectory analysis, we found that the simulations of complexes containing both ATP and RTP remained stable for the duration of their trajectories. Additionally, it was revealed that the phosphate tail of RTP was stabilized by both the positive amino acid pocket and magnesium ions near the entry channel of RdRp which includes residues K551, R553, R555 and K621. It was also found that residues D623, D760, and N691 further stabilized the ribose portion of RTP with U10 on the template RNA strand forming hydrogen pairs with the adenosine motif. Using these models of RdRp, we employed them to screen the ZINC database of ~ 17 million molecules. Using docking and drug properties scoring, we narrowed down our selection to fourteen candidates. These were subjected to 200 ns simulations each underwent free energy calculations. We identified four hit compounds from the ZINC database that have similar binding poses to RTP while possessing lower overall binding free energies, with ZINC097971592 having a binding free energy two times lower than RTP.

    more » « less
  3. null (Ed.)
    Abstract In this study, we developed a novel algorithm to improve the screening performance of an arbitrary docking scoring function by recalibrating the docking score of a query compound based on its structure similarity with a set of training compounds, while the extra computational cost is neglectable. Two popular docking methods, Glide and AutoDock Vina were adopted as the original scoring functions to be processed with our new algorithm and similar improvement performance was achieved. Predicted binding affinities were compared against experimental data from ChEMBL and DUD-E databases. 11 representative drug receptors from diverse drug target categories were applied to evaluate the hybrid scoring function. The effects of four different fingerprints (FP2, FP3, FP4, and MACCS) and the four different compound similarity effect (CSE) functions were explored. Encouragingly, the screening performance was significantly improved for all 11 drug targets especially when CSE = S 4 (S is the Tanimoto structural similarity) and FP2 fingerprint were applied. The average predictive index (PI) values increased from 0.34 to 0.66 and 0.39 to 0.71 for the Glide and AutoDock vina scoring functions, respectively. To evaluate the performance of the calibration algorithm in drug lead identification, we also imposed an upper limit on the structural similarity to mimic the real scenario of screening diverse libraries for which query ligands are general-purpose screening compounds and they are not necessarily structurally similar to reference ligands. Encouragingly, we found our hybrid scoring function still outperformed the original docking scoring function. The hybrid scoring function was further evaluated using external datasets for two systems and we found the PI values increased from 0.24 to 0.46 and 0.14 to 0.42 for A2AR and CFX systems, respectively. In a conclusion, our calibration algorithm can significantly improve the virtual screening performance in both drug lead optimization and identification phases with neglectable computational cost. 
    more » « less
  4. Three triorganotin (IV) cyclopentane carboxylates were synthesized and structurally characterized by in solid state by Fourier‐transform infrared spectroscopy and single crystal diffraction, and in solution by NMR (1H,13C, and119Sn) spectroscopy. The complexes were tested for their anticancer activity against MCF‐7 and HeLa cells along with normal BHK‐21 cells. As revealed by MTT assay, complex2was identified as the most potent derivative with an IC50value of 2.59 and 0.051 μM against HeLa and MCF‐7 cells, respectively. The results were compared with cisplatin as reference drug. Fluorescent microscopic studies using 4′,6‐diamidino‐2‐phenylindole (DAPI) and propidium iodide (PI) staining confirmed the occurrence of apoptosis in HeLa cells treated with the most active complex2. The complex2also triggered the release of lactate dehydrogenase (LDH) in treated HeLa and MCF‐7 cells whereas a luminescence assay displayed a remarkable increase in the activity of caspase‐9 and ‐3. Moreover, flow cytometric results revealed that complex2caused G0/G1 arrest in the treated HeLa cells. The complexes were further screened for DNA binding studies through UV‐vis spectroscopy and cyclic voltammetry. The high activity of complex2was attributed to its higher Lewis acidity as indicated by natural bond orbital (NBO) analysis. Theoretical modelling and molecular docking studies were also conducted to study the reactivity of complexes againstVEGFR 2 Kinase.

    more » « less
  5. Abstract

    Reliably scoring and ranking candidate models of protein complexes and assigning their oligomeric state from the structure of the crystal lattice represent outstanding challenges. A community‐wide effort was launched to tackle these challenges. The latest resources on protein complexes and interfaces were exploited to derive a benchmark dataset consisting of 1677 homodimer protein crystal structures, including a balanced mix of physiological and non‐physiological complexes. The non‐physiological complexes in the benchmark were selected to bury a similar or larger interface area than their physiological counterparts, making it more difficult for scoring functions to differentiate between them. Next, 252 functions for scoring protein‐protein interfaces previously developed by 13 groups were collected and evaluated for their ability to discriminate between physiological and non‐physiological complexes. A simple consensus score generated using the best performing score of each of the 13 groups, and a cross‐validated Random Forest (RF) classifier were created. Both approaches showed excellent performance, with an area under the Receiver Operating Characteristic (ROC) curve of 0.93 and 0.94, respectively, outperforming individual scores developed by different groups. Additionally, AlphaFold2 engines recalled the physiological dimers with significantly higher accuracy than the non‐physiological set, lending support to the reliability of our benchmark dataset annotations. Optimizing the combined power of interface scoring functions and evaluating it on challenging benchmark datasets appears to be a promising strategy.

    more » « less