skip to main content


Title: Machine learning classification can reduce false positives in structure-based virtual screening
With the recent explosion in the size of libraries available for screening, virtual screening is positioned to assume a more prominent role in early drug discovery’s search for active chemical matter. In typical virtual screens, however, only about 12% of the top-scoring compounds actually show activity when tested in biochemical assays. We argue that most scoring functions used for this task have been developed with insufficient thoughtfulness into the datasets on which they are trained and tested, leading to overly simplistic models and/or overtraining. These problems are compounded in the literature because studies reporting new scoring methods have not validated their models prospectively within the same study. Here, we report a strategy for building a training dataset (D-COID) that aims to generate highly compelling decoy complexes that are individually matched to available active complexes. Using this dataset, we train a general-purpose classifier for virtual screening (vScreenML) that is built on the XGBoost framework. In retrospective benchmarks, our classifier shows outstanding performance relative to other scoring functions. In a prospective context, nearly all candidate inhibitors from a screen against acetylcholinesterase show detectable activity; beyond this, 10 of 23 compounds have IC 50 better than 50 μM. Without any medicinal chemistry optimization, the most potent hit has IC 50 280 nM, corresponding to K i of 173 nM. These results support using the D-COID strategy for training classifiers in other computational biology tasks, and for vScreenML in virtual screening campaigns against other protein targets. Both D-COID and vScreenML are freely distributed to facilitate such efforts.  more » « less
Award ID(s):
1836950
NSF-PAR ID:
10207648
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Proceedings of the National Academy of Sciences
Volume:
117
Issue:
31
ISSN:
0027-8424
Page Range / eLocation ID:
18477 to 18488
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    RNA dependent RNA polymerase (RdRp), is an essential in the RNA replication within the life cycle of the severely acute respiratory coronavirus-2 (SARS-CoV-2), causing the deadly respiratory induced sickness COVID-19. Remdesivir is a prodrug that has seen some success in inhibiting this enzyme, however there is still the pressing need for effective alternatives. In this study, we present the discovery of four non-nucleoside small molecules that bind favorably to SARS-CoV-2 RdRp over the active form of the popular drug remdesivir (RTP) and adenosine triphosphate (ATP) by utilizing high-throughput virtual screening (HTVS) against the vast ZINC compound database coupled with extensive molecular dynamics (MD) simulations. After post-trajectory analysis, we found that the simulations of complexes containing both ATP and RTP remained stable for the duration of their trajectories. Additionally, it was revealed that the phosphate tail of RTP was stabilized by both the positive amino acid pocket and magnesium ions near the entry channel of RdRp which includes residues K551, R553, R555 and K621. It was also found that residues D623, D760, and N691 further stabilized the ribose portion of RTP with U10 on the template RNA strand forming hydrogen pairs with the adenosine motif. Using these models of RdRp, we employed them to screen the ZINC database of ~ 17 million molecules. Using docking and drug properties scoring, we narrowed down our selection to fourteen candidates. These were subjected to 200 ns simulations each underwent free energy calculations. We identified four hit compounds from the ZINC database that have similar binding poses to RTP while possessing lower overall binding free energies, with ZINC097971592 having a binding free energy two times lower than RTP.

     
    more » « less
  2. null (Ed.)
    Abstract In this study, we developed a novel algorithm to improve the screening performance of an arbitrary docking scoring function by recalibrating the docking score of a query compound based on its structure similarity with a set of training compounds, while the extra computational cost is neglectable. Two popular docking methods, Glide and AutoDock Vina were adopted as the original scoring functions to be processed with our new algorithm and similar improvement performance was achieved. Predicted binding affinities were compared against experimental data from ChEMBL and DUD-E databases. 11 representative drug receptors from diverse drug target categories were applied to evaluate the hybrid scoring function. The effects of four different fingerprints (FP2, FP3, FP4, and MACCS) and the four different compound similarity effect (CSE) functions were explored. Encouragingly, the screening performance was significantly improved for all 11 drug targets especially when CSE = S 4 (S is the Tanimoto structural similarity) and FP2 fingerprint were applied. The average predictive index (PI) values increased from 0.34 to 0.66 and 0.39 to 0.71 for the Glide and AutoDock vina scoring functions, respectively. To evaluate the performance of the calibration algorithm in drug lead identification, we also imposed an upper limit on the structural similarity to mimic the real scenario of screening diverse libraries for which query ligands are general-purpose screening compounds and they are not necessarily structurally similar to reference ligands. Encouragingly, we found our hybrid scoring function still outperformed the original docking scoring function. The hybrid scoring function was further evaluated using external datasets for two systems and we found the PI values increased from 0.24 to 0.46 and 0.14 to 0.42 for A2AR and CFX systems, respectively. In a conclusion, our calibration algorithm can significantly improve the virtual screening performance in both drug lead optimization and identification phases with neglectable computational cost. 
    more » « less
  3. Three triorganotin (IV) cyclopentane carboxylates were synthesized and structurally characterized by in solid state by Fourier‐transform infrared spectroscopy and single crystal diffraction, and in solution by NMR (1H,13C, and119Sn) spectroscopy. The complexes were tested for their anticancer activity against MCF‐7 and HeLa cells along with normal BHK‐21 cells. As revealed by MTT assay, complex2was identified as the most potent derivative with an IC50value of 2.59 and 0.051 μM against HeLa and MCF‐7 cells, respectively. The results were compared with cisplatin as reference drug. Fluorescent microscopic studies using 4′,6‐diamidino‐2‐phenylindole (DAPI) and propidium iodide (PI) staining confirmed the occurrence of apoptosis in HeLa cells treated with the most active complex2. The complex2also triggered the release of lactate dehydrogenase (LDH) in treated HeLa and MCF‐7 cells whereas a luminescence assay displayed a remarkable increase in the activity of caspase‐9 and ‐3. Moreover, flow cytometric results revealed that complex2caused G0/G1 arrest in the treated HeLa cells. The complexes were further screened for DNA binding studies through UV‐vis spectroscopy and cyclic voltammetry. The high activity of complex2was attributed to its higher Lewis acidity as indicated by natural bond orbital (NBO) analysis. Theoretical modelling and molecular docking studies were also conducted to study the reactivity of complexes againstVEGFR 2 Kinase.

     
    more » « less
  4. Abstract

    Reliably scoring and ranking candidate models of protein complexes and assigning their oligomeric state from the structure of the crystal lattice represent outstanding challenges. A community‐wide effort was launched to tackle these challenges. The latest resources on protein complexes and interfaces were exploited to derive a benchmark dataset consisting of 1677 homodimer protein crystal structures, including a balanced mix of physiological and non‐physiological complexes. The non‐physiological complexes in the benchmark were selected to bury a similar or larger interface area than their physiological counterparts, making it more difficult for scoring functions to differentiate between them. Next, 252 functions for scoring protein‐protein interfaces previously developed by 13 groups were collected and evaluated for their ability to discriminate between physiological and non‐physiological complexes. A simple consensus score generated using the best performing score of each of the 13 groups, and a cross‐validated Random Forest (RF) classifier were created. Both approaches showed excellent performance, with an area under the Receiver Operating Characteristic (ROC) curve of 0.93 and 0.94, respectively, outperforming individual scores developed by different groups. Additionally, AlphaFold2 engines recalled the physiological dimers with significantly higher accuracy than the non‐physiological set, lending support to the reliability of our benchmark dataset annotations. Optimizing the combined power of interface scoring functions and evaluating it on challenging benchmark datasets appears to be a promising strategy.

     
    more » « less
  5. Metabotropic glutamate receptors (mGluRs) play an important role in regulating glutamate signal pathways, which are involved in neuropathy and periphery homeostasis. mGluR4, which belongs to Group III mGluRs, is most widely distributed in the periphery among all the mGluRs. It has been proved that the regulation of this receptor is involved in diabetes, colorectal carcinoma and many other diseases. However, the application of structure-based drug design to identify small molecules to regulate the mGluR4 receptor is limited due to the absence of a resolved mGluR4 protein structure. In this work, we first built a homology model of mGluR4 based on a crystal structure of mGluR8, and then conducted hierarchical virtual screening (HVS) to identify possible active ligands for mGluR4. The HVS protocol consists of three hierarchical filters including Glide docking, molecular dynamic (MD) simulation and binding free energy calculation. We successfully prioritized active ligands of mGluR4 from a set of screening compounds using HVS. The predicted active ligands based on binding affinities can almost cover all the experiment-determined active ligands, with only one ligand missed. The correlation between the measured and predicted binding affinities is significantly improved for the MM-PB/GBSA-WSAS methods compared to the Glide docking method. More importantly, we have identified hotspots for ligand binding, and we found that SER157 and GLY158 tend to contribute to the selectivity of mGluR4 ligands, while ALA154 and ALA155 could account for the ligand selectivity to mGluR8. We also recognized other 5 key residues that are critical for ligand potency. The difference of the binding profiles between mGluR4 and mGluR8 can guide us to develop more potent and selective modulators. Moreover, we evaluated the performance of IPSF, a novel type of scoring function trained by a machine learning algorithm on residue–ligand interaction profiles, in guiding drug lead optimization. The cross-validation root-mean-square errors (RMSEs) are much smaller than those by the endpoint methods, and the correlation coefficients are comparable to the best endpoint methods for both mGluRs. Thus, machine learning-based IPSF can be applied to guide lead optimization, albeit the total number of actives/inactives are not big, a typical scenario in drug discovery projects. 
    more » « less