Abstract Quantitative structure-activity relationship (QSAR) modeling is a powerful tool for drug discovery, yet the lack of interpretability of commonly used QSAR models hinders their application in molecular design. We propose a similarity-based regression framework, topological regression (TR), that offers a statistically grounded, computationally fast, and interpretable technique to predict drug responses. We compare the predictive performance of TR on 530 ChEMBL human target activity datasets against the predictive performance of deep-learning-based QSAR models. Our results suggest that our sparse TR model can achieve equal, if not better, performance than the deep learning-based QSAR models and provide better intuitive interpretation by extracting an approximate isometry between the chemical space of the drugs and their activity space.
more »
« less
Knowledge transfer between small datasets for boosting the predictive performance of machine learning assisted QSAR models on contaminant oxidative reactivity
Using machine learning (ML) to develop quantitative structure—activity relationship (QSAR) models for contaminant reactivity has emerged as a promising approach because it can effectively handle non-linear relationships. However, ML is often data-demanding, whereas data scarcity is common in QSAR model development. Here, we proposed two approaches to address this issue: combining small datasets and transferring knowledge between them. First, we compiled four individual datasets for four oxidants, i.e., SO4•-, HClO, O3 and ClO2, each dataset containing a different number of contaminants with their corresponding rate constants and reaction conditions (pH and/or temperature). We then used molecular fingerprints (MF) or molecular descriptors (MD) to represent the contaminants; combined them with ML algorithms to develop individual QSAR models for these four datasets; and interpreted the models by the Shapley Additive exPlantion (SHAP) method. The results showed that both the optimal contaminant representation and the best ML algorithm are dataset dependent. Next, we merged these four datasets and developed a unified model, which showed better predictive performance on the datasets of HClO, O3 and ClO2 because the model ‘corrected’ some wrongly learned effects of several atom groups. We further developed knowledge transfer models based on the second approach, the effectiveness of which depends on if there is consistent knowledge shared between the two datasets as well as the predictive performance of the respective single models. This study demonstrated the benefit of combining small similar datasets and transferring knowledge between them, which can be leveraged to boost the predictive performance of ML-assisted QSAR models.
more »
« less
- Award ID(s):
- 1808406
- PAR ID:
- 10288084
- Date Published:
- Journal Name:
- Environmental science technology
- Volume:
- In review
- ISSN:
- 1086-931X
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Abstract Quantitative structure–activity relationship (QSAR) modeling has become a critical tool in drug design. Recently proposed Topological Regression (TR), a computationally efficient and highly interpretable QSAR model that maps distances in the chemical domain to distances in the activity domain, has shown predictive performance comparable to state-of-the-art deep learning-based models. However, TR’s dependence on simple random sampling-based anchor selection and utilization of radial basis function for response reconstruction constrain its interpretability and predictive capacity. To address these limitations, we propose Adaptive Topological Regression (AdapToR) with adaptive anchor selection and optimization-based reconstruction. We evaluated AdapToR on the NCI60 GI50 dataset, which consists of over 50,000 drug responses across 60 human cancer cell lines, and compared its performance to Transformer CNN, Graph Transformer, TR, and other baseline models. The results demonstrate that AdapToR outperforms competing QSAR models for drug response prediction with significantly lower computational cost and greater interpretability as compared to deep learning-based models.more » « less
-
Machine learning represents a milestone in data-driven research, including material informatics, robotics, and computer-aided drug discovery. With the continuously growing virtual and synthetically available chemical space, efficient and robust quantitative structure–activity relationship (QSAR) methods are required to uncover molecules with desired properties. Herein, we propose variable-length-array SMILES-based (VLA-SMILES) structural descriptors that expand conventional SMILES descriptors widely used in machine learning. This structural representation extends the family of numerically coded SMILES, particularly binary SMILES, to expedite the discovery of new deep learning QSAR models with high predictive ability. VLA-SMILES descriptors were shown to speed up the training of QSAR models based on multilayer perceptron (MLP) with optimized backpropagation (ATransformedBP), resilient propagation (iRPROP‒), and Adam optimization learning algorithms featuring rational train–test splitting, while improving the predictive ability toward the more compute-intensive binary SMILES representation format. All the tested MLPs under the same length-array-based SMILES descriptors showed similar predictive ability and convergence rate of training in combination with the considered learning procedures. Validation with the Kennard–Stone train–test splitting based on the structural descriptor similarity metrics was found more effective than the partitioning with the ranking by activity based on biological activity values metrics for the entire set of VLA-SMILES featured QSAR. Robustness and the predictive ability of MLP models based on VLA-SMILES were assessed via the method of QSAR parametric model validation. In addition, the method of the statistical H0 hypothesis testing of the linear regression between real and observed activities based on the F2,n−2 -criteria was used for predictability estimation among VLA-SMILES featured QSAR-MLPs (with n being the volume of the testing set). Both approaches of QSAR parametric model validation and statistical hypothesis testing were found to correlate when used for the quantitative evaluation of predictabilities of the designed QSAR models with VLA-SMILES descriptors.more » « less
-
BackgroundMetastatic cancer remains one of the leading causes of cancer-related mortality worldwide. Yet, the prediction of survivability in this population remains limited by heterogeneous clinical presentations and high-dimensional molecular features. Advances in machine learning (ML) provide an opportunity to integrate diverse patient- and tumor-level factors into explainable predictive ML models. Leveraging large real-world datasets and modern ML techniques can enable improved risk stratification and precision oncology. ObjectiveThis study aimed to develop and interpret ML models for predicting overall survival in patients with metastatic cancer using the Memorial Sloan Kettering-Metastatic (MSK-MET) dataset and to identify key prognostic biomarkers through explainable artificial intelligence techniques. MethodsWe performed a retrospective analysis of the MSK-MET cohort, comprising 25,775 patients across 27 tumor types. After data cleaning and balancing, 20,338 patients were included. Overall survival was defined as deceased versus living at last follow-up. Five classifiers (extreme gradient boosting [XGBoost], logistic regression, random forest, decision tree, and naive Bayes) were trained using an 80/20 stratified split and optimized via grid search with 5-fold cross-validation. Model performance was assessed using accuracy, area under the curve (AUC), precision, recall, and F1-score. Model explainability was achieved using Shapley additive explanations (SHAP). Survival analyses included Kaplan-Meier estimates, Cox proportional hazards models, and an XGBoost-Cox model for time-to-event prediction. The positive predictive value and negative predictive value were calculated at the Youden index–optimal threshold. ResultsXGBoost achieved the highest performance (accuracy=0.74; AUC=0.82), outperforming other classifiers. In survival analyses, the XGBoost-Cox model with a concordance index (C-index) of 0.70 exceeded the traditional Cox model (C-index=0.66). SHAP analysis and Cox models consistently identified metastatic site count, tumor mutational burden, fraction of genome altered, and the presence of distant liver and bone metastases as among the strongest prognostic factors, a pattern that held at both the pan-cancer level and recurrently across cancer-specific models. At the cancer-specific level, performance varied; prostate cancer achieved the highest predictive accuracy (AUC=0.88), while pancreatic cancer was notably more challenging (AUC=0.68). Kaplan-Meier analyses demonstrated marked survival separation between patients with and without metastases (80-month survival: approximately 0.80 vs 0.30). At the Youden-optimal threshold, positive predictive value and negative predictive value were approximately 70% and 80%, respectively, supporting clinical use for risk stratification. ConclusionsExplainable ML models, particularly XGBoost combined with SHAP, can strongly predict survivability in metastatic cancers while highlighting clinically meaningful features. These findings support the use of ML-based tools for patient counseling, treatment planning, and integration into precision oncology workflows. Future work should include external validation on independent cohorts, integration with electronic health records via Fast Healthcare Interoperability Resources–based dashboards, and prospective clinician-in-the-loop evaluation to assess real-world use.more » « less
-
Per- and polyfluoroalkyl substances (PFAS) contamination has posed a significant environmental and public health challenge due to their ubiquitous nature. Adsorption has emerged as a promising remediation technique, yet optimizing adsorption efficiency remains complex due to the diverse physicochemical properties of PFAS and the wide range of adsorbent materials. Traditional modeling approaches, such as response surface methodology (RSM), struggled to capture nonlinear interactions, while standalone machine learning (ML) models required extensive datasets. This study addressed these limitations by developing hybrid RSM-ML models to improve the prediction and optimization of PFAS adsorption. A comprehensive dataset was constructed using experimental adsorption data, integrating key parameters such as pH, pHpzc, surface area, temperature, and PFAS molecular properties. RSM was employed to model adsorption behavior, while gradient boosting (GB), random forest (RF), and extreme gradient boosting (XGB) were used to enhance predictive performance. Hybrid models—linear, RMSE-based, multiplicative, and meta-learning—were developed and evaluated. The meta-learning HOP-RSM-GB model achieved near-perfect accuracy (R² = 1.00, RMSE = 10.59), outperforming all other models. Surface plots revealed that low pH and high pHpzc maximized the adsorption while increasing log Kow consistently enhanced PFAS adsorption. These findings establish hybrid RSM-ML modeling as a powerful framework for optimizing PFAS remediation strategies. The integration of statistical and machine learning approaches significantly improves predictive accuracy, reduces experimental costs, and provides deeper insights into adsorption mechanisms. This study underscores the importance of data-driven approaches in environmental engineering and highlights future opportunities for integrating ML-driven modeling with experimental adsorption research.more » « less
An official website of the United States government

