skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: QSAR without borders
Prediction of chemical bioactivity and physical properties has been one of the most important applications of statistical and more recently, machine learning and artificial intelligence methods in chemical sciences. This field of research, broadly known as quantitative structure–activity relationships (QSAR) modeling, has developed many important algorithms and has found a broad range of applications in physical organic and medicinal chemistry in the past 55+ years. This Perspective summarizes recent technological advances in QSAR modeling but it also highlights the applicability of algorithms, modeling methods, and validation practices developed in QSAR to a wide range of research areas outside of traditional QSAR boundaries including synthesis planning, nanotechnology, materials science, biomaterials, and clinical informatics. As modern research methods generate rapidly increasing amounts of data, the knowledge of robust data-driven modelling methods professed within the QSAR field can become essential for scientists working both within and outside of chemical research. We hope that this contribution highlighting the generalizable components of QSAR modeling will serve to address this challenge.  more » « less
Award ID(s):
1802831
PAR ID:
10198968
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
Chemical Society Reviews
Volume:
49
Issue:
11
ISSN:
0306-0012
Page Range / eLocation ID:
3525 to 3564
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Outlier detection is a statistical procedure that aims to find suspicious events or items that are different from the normal form of a dataset. It has drawn considerable interest in the field of data mining and machine learning. Outlier detection is important in many applications, including fraud detection in credit card transactions and network intrusion detection. There are two general types of outlier detection: global and local. Global outliers fall outside the normal range for an entire dataset, whereas local outliers may fall within the normal range for the entire dataset, but outside the normal range for the surrounding data points. This paper addresses local outlier detection. The best-known technique for local outlier detection is the Local Outlier Factor (LOF), a density-based technique. There are many LOF algorithms for a static data environment; however, these algorithms cannot be applied directly to data streams, which are an important type of big data. In general, local outlier detection algorithms for data streams are still deficient and better algorithms need to be developed that can effectively analyze the high velocity of data streams to detect local outliers. This paper presents a literature review of local outlier detection algorithms in static and stream environments, with an emphasis on LOF algorithms. It collects and categorizes existing local outlier detection algorithms and analyzes their characteristics. Furthermore, the paper discusses the advantages and limitations of those algorithms and proposes several promising directions for developing improved local outlier detection methods for data streams. 
    more » « less
  2. Machine learning represents a milestone in data-driven research, including material informatics, robotics, and computer-aided drug discovery. With the continuously growing virtual and synthetically available chemical space, efficient and robust quantitative structure–activity relationship (QSAR) methods are required to uncover molecules with desired properties. Herein, we propose variable-length-array SMILES-based (VLA-SMILES) structural descriptors that expand conventional SMILES descriptors widely used in machine learning. This structural representation extends the family of numerically coded SMILES, particularly binary SMILES, to expedite the discovery of new deep learning QSAR models with high predictive ability. VLA-SMILES descriptors were shown to speed up the training of QSAR models based on multilayer perceptron (MLP) with optimized backpropagation (ATransformedBP), resilient propagation (iRPROP‒), and Adam optimization learning algorithms featuring rational train–test splitting, while improving the predictive ability toward the more compute-intensive binary SMILES representation format. All the tested MLPs under the same length-array-based SMILES descriptors showed similar predictive ability and convergence rate of training in combination with the considered learning procedures. Validation with the Kennard–Stone train–test splitting based on the structural descriptor similarity metrics was found more effective than the partitioning with the ranking by activity based on biological activity values metrics for the entire set of VLA-SMILES featured QSAR. Robustness and the predictive ability of MLP models based on VLA-SMILES were assessed via the method of QSAR parametric model validation. In addition, the method of the statistical H0 hypothesis testing of the linear regression between real and observed activities based on the F2,n−2 -criteria was used for predictability estimation among VLA-SMILES featured QSAR-MLPs (with n being the volume of the testing set). Both approaches of QSAR parametric model validation and statistical hypothesis testing were found to correlate when used for the quantitative evaluation of predictabilities of the designed QSAR models with VLA-SMILES descriptors. 
    more » « less
  3. Research interest in nanoscale biomaterials has continued to grow in the past few decades, driving the need to form families of nanomaterials grouped by similar physical or chemical properties. Nanotubes have occupied a unique space in this field, primarily due to their high versatility in a wide range of biomedical applications. Although similar in morphology, members of this nanomaterial family widely differ in synthesis methods, mechanical and physiochemical properties, and therapeutic applications. As this field continues to develop, it is important to provide insight into novel biomaterial developments and their overall impact on current technology and therapeutics. In this review, we aim to characterize and compare two members of the nanotube family: carbon nanotubes (CNTs) and janus-base nanotubes (JBNts). While CNTs have been extensively studied for decades, JBNts provide a fresh perspective on many therapeutic modalities bound by the limitations of carbon-based nanomaterials. Herein, we characterize the morphology, synthesis, and applications of CNTs and JBNts to provide a comprehensive comparison between these nanomaterial technologies. 
    more » « less
  4. null (Ed.)
    Using machine learning (ML) to develop quantitative structure—activity relationship (QSAR) models for contaminant reactivity has emerged as a promising approach because it can effectively handle non-linear relationships. However, ML is often data-demanding, whereas data scarcity is common in QSAR model development. Here, we proposed two approaches to address this issue: combining small datasets and transferring knowledge between them. First, we compiled four individual datasets for four oxidants, i.e., SO4•-, HClO, O3 and ClO2, each dataset containing a different number of contaminants with their corresponding rate constants and reaction conditions (pH and/or temperature). We then used molecular fingerprints (MF) or molecular descriptors (MD) to represent the contaminants; combined them with ML algorithms to develop individual QSAR models for these four datasets; and interpreted the models by the Shapley Additive exPlantion (SHAP) method. The results showed that both the optimal contaminant representation and the best ML algorithm are dataset dependent. Next, we merged these four datasets and developed a unified model, which showed better predictive performance on the datasets of HClO, O3 and ClO2 because the model ‘corrected’ some wrongly learned effects of several atom groups. We further developed knowledge transfer models based on the second approach, the effectiveness of which depends on if there is consistent knowledge shared between the two datasets as well as the predictive performance of the respective single models. This study demonstrated the benefit of combining small similar datasets and transferring knowledge between them, which can be leveraged to boost the predictive performance of ML-assisted QSAR models. 
    more » « less
  5. Over the past decade, the field of natural language processing has developed a wide array of computational methods for reasoning about narrative, including summarization, commonsense inference, and event detection. While this work has brought an important empirical lens for examining narrative, it is by and large divorced from the large body of theoretical work on narrative within the humanities, social and cognitive sciences. In this position paper, we introduce the dominant theoretical frameworks to the NLP community, situate current research in NLP within distinct narratological traditions, and argue that linking computational work in NLP to theory opens up a range of new empirical questions that would both help advance our understanding of narrative and open up new practical applications. 
    more » « less