skip to main content


Title: VLA-SMILES: Variable-Length-Array SMILES Descriptors in Neural Network-Based QSAR Modeling
Machine learning represents a milestone in data-driven research, including material informatics, robotics, and computer-aided drug discovery. With the continuously growing virtual and synthetically available chemical space, efficient and robust quantitative structure–activity relationship (QSAR) methods are required to uncover molecules with desired properties. Herein, we propose variable-length-array SMILES-based (VLA-SMILES) structural descriptors that expand conventional SMILES descriptors widely used in machine learning. This structural representation extends the family of numerically coded SMILES, particularly binary SMILES, to expedite the discovery of new deep learning QSAR models with high predictive ability. VLA-SMILES descriptors were shown to speed up the training of QSAR models based on multilayer perceptron (MLP) with optimized backpropagation (ATransformedBP), resilient propagation (iRPROP‒), and Adam optimization learning algorithms featuring rational train–test splitting, while improving the predictive ability toward the more compute-intensive binary SMILES representation format. All the tested MLPs under the same length-array-based SMILES descriptors showed similar predictive ability and convergence rate of training in combination with the considered learning procedures. Validation with the Kennard–Stone train–test splitting based on the structural descriptor similarity metrics was found more effective than the partitioning with the ranking by activity based on biological activity values metrics for the entire set of VLA-SMILES featured QSAR. Robustness and the predictive ability of MLP models based on VLA-SMILES were assessed via the method of QSAR parametric model validation. In addition, the method of the statistical H0 hypothesis testing of the linear regression between real and observed activities based on the F2,n−2 -criteria was used for predictability estimation among VLA-SMILES featured QSAR-MLPs (with n being the volume of the testing set). Both approaches of QSAR parametric model validation and statistical hypothesis testing were found to correlate when used for the quantitative evaluation of predictabilities of the designed QSAR models with VLA-SMILES descriptors.  more » « less
Award ID(s):
2118061
NSF-PAR ID:
10348471
Author(s) / Creator(s):
;
Date Published:
Journal Name:
Machine Learning and Knowledge Extraction
Volume:
4
Issue:
3
ISSN:
2504-4990
Page Range / eLocation ID:
715 to 737
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In this study, photocatalytic properties and in vitro cytotoxicity of 29 TiO 2 -based multi-component nanomaterials ( i.e. , hybrids of more than two composition types of nanoparticles) were evaluated using a combination of the experimental testing and supervised machine learning modeling. TiO 2 -based multi-component nanomaterials with metal clusters of silver, and their mixtures with gold, palladium, and platinum were successfully synthesized. Two activities, photocatalytic activity and cytotoxicity, were studied. A novel cheminformatic approach was developed and applied for the computational representation of the photocatalytic activity and cytotoxicity effect. In this approach, features of investigated TiO 2 -based hybrid nanomaterials were reflected by a series of novel additive descriptors for hybrid and hybrid nanostructures (denoted as “hybrid nanosctructure descriptors”). These descriptors are based on quantum chemical calculations and the Smoluchowski equation. The obtained experimental data and calculated hybrid-nanostructure descriptors were used to develop novel predictive Quantitative Structure–Activity Relationship computational models (called “nano-QSAR mix ”). The proposed modeling approach is an initial step in the understanding of the relationships between physicochemical properties of hybrid nanoparticles, their toxicity, and photochemical activity under UV-vis irradiation. Acquired knowledge supports the safe-by-design approaches relevant to the development of efficient hybrid nanomaterials with reduced hazardous effects. 
    more » « less
  2. null (Ed.)
    Using machine learning (ML) to develop quantitative structure—activity relationship (QSAR) models for contaminant reactivity has emerged as a promising approach because it can effectively handle non-linear relationships. However, ML is often data-demanding, whereas data scarcity is common in QSAR model development. Here, we proposed two approaches to address this issue: combining small datasets and transferring knowledge between them. First, we compiled four individual datasets for four oxidants, i.e., SO4•-, HClO, O3 and ClO2, each dataset containing a different number of contaminants with their corresponding rate constants and reaction conditions (pH and/or temperature). We then used molecular fingerprints (MF) or molecular descriptors (MD) to represent the contaminants; combined them with ML algorithms to develop individual QSAR models for these four datasets; and interpreted the models by the Shapley Additive exPlantion (SHAP) method. The results showed that both the optimal contaminant representation and the best ML algorithm are dataset dependent. Next, we merged these four datasets and developed a unified model, which showed better predictive performance on the datasets of HClO, O3 and ClO2 because the model ‘corrected’ some wrongly learned effects of several atom groups. We further developed knowledge transfer models based on the second approach, the effectiveness of which depends on if there is consistent knowledge shared between the two datasets as well as the predictive performance of the respective single models. This study demonstrated the benefit of combining small similar datasets and transferring knowledge between them, which can be leveraged to boost the predictive performance of ML-assisted QSAR models. 
    more » « less
  3. The study purpose was to train and validate a deep learning approach to detect microscale streetscape features related to pedestrian physical activity. This work innovates by combining computer vision techniques with Google Street View (GSV) images to overcome impediments to conducting audits (e.g., time, safety, and expert labor cost). The EfficientNETB5 architecture was used to build deep learning models for eight microscale features guided by the Microscale Audit of Pedestrian Streetscapes Mini tool: sidewalks, sidewalk buffers, curb cuts, zebra and line crosswalks, walk signals, bike symbols, and streetlights. We used a train–correct loop, whereby images were trained on a training dataset, evaluated using a separate validation dataset, and trained further until acceptable performance metrics were achieved. Further, we used trained models to audit participant (N = 512) neighborhoods in the WalkIT Arizona trial. Correlations were explored between microscale features and GIS-measured and participant-reported neighborhood macroscale walkability. Classifier precision, recall, and overall accuracy were all over >84%. Total microscale was associated with overall macroscale walkability (r = 0.30, p < 0.001). Positive associations were found between model-detected and self-reported sidewalks (r = 0.41, p < 0.001) and sidewalk buffers (r = 0.26, p < 0.001). The computer vision model results suggest an alternative to trained human raters, allowing for audits of hundreds or thousands of neighborhoods for population surveillance or hypothesis testing. 
    more » « less
  4. Abstract

    De novo, in-silico design of molecules is a challenging problem with applications in drug discovery and material design. We introduce a masked graph model, which learns a distribution over graphs by capturing conditional distributions over unobserved nodes (atoms) and edges (bonds) given observed ones. We train and then sample from our model by iteratively masking and replacing different parts of initialized graphs. We evaluate our approach on the QM9 and ChEMBL datasets using the GuacaMol distribution-learning benchmark. We find that validity, KL-divergence and Fréchet ChemNet Distance scores are anti-correlated with novelty, and that we can trade off between these metrics more effectively than existing models. On distributional metrics, our model outperforms previously proposed graph-based approaches and is competitive with SMILES-based approaches. Finally, we show our model generates molecules with desired values of specified properties while maintaining physiochemical similarity to the training distribution.

     
    more » « less
  5. Introduction Multi-series CT (MSCT) scans, including non-contrast CT (NCCT), CT Perfusion (CTP), and CT Angiography (CTA), are widely used in acute stroke imaging. While each scan has its advantage in disease diagnosis, the varying image resolution of different series hinders the ability of the radiologist to discern subtle suspicious findings. Besides, higher image quality requires high radiation doses, leading to increases in health risks such as cataract formation and cancer induction. Thus, it is highly crucial to develop an approach to improve MSCT resolution and to lower radiation exposure. Hypothesis MSCT imaging of the same patient is highly correlated in structural features, the transferring and integration of the shared and complementary information from different series are beneficial for achieving high image quality. Methods We propose TL-GAN, a learning-based method by using Transfer Learning (TL) and Generative Adversarial Network (GAN) to reconstruct high-quality diagnostic images. Our TL-GAN method is evaluated on 4,382 images collected from nine patients’ MSCT scans, including 415 NCCT slices, 3,696 CTP slices, and 271 CTA slices. We randomly split the nine patients into a training set (4 patients), a validation set (2 patients), and a testing set (3 patients). In preprocessing, we remove the background and skull and visualize in brain window. The low-resolution images (1/4 of the original spatial size) are simulated by bicubic down-sampling. For training without TL, we train different series individually, and for with TL, we follow the scanning sequence (NCCT, CTP, and CTA) by finetuning. Results The performance of TL-GAN is evaluated by the peak-signal-to-noise ratio (PSNR) and structural similarity (SSIM) index on 184 NCCT, 882 CTP, and 107 CTA test images. Figure 1 provides both visual (a-c) and quantity (d-f) comparisons. Through TL-GAN, there is a significant improvement with TL than without TL (training from scratch) for NCCT, CTP, and CTA images, respectively. These significances of performance improvement are evaluated by one-tailed paired t-tests (p < 0.05). We enlarge the regions of interest for detail visual comparisons. Further, we evaluate the CTP performance by calculating the perfusion maps, including cerebral blood flow (CBF) and cerebral blood volume (CBV). The visual comparison of the perfusion maps in Figure 2 demonstrate that TL-GAN is beneficial for achieving high diagnostic image quality, which are comparable to the ground truth images for both CBF and CBV maps. Conclusion Utilizing TL-GAN can effectively improve the image resolution for MSCT, provides radiologists more image details for suspicious findings, which is a practical solution for MSCT image quality enhancement. 
    more » « less