skip to main content

Title: QSAR without borders
Prediction of chemical bioactivity and physical properties has been one of the most important applications of statistical and more recently, machine learning and artificial intelligence methods in chemical sciences. This field of research, broadly known as quantitative structure–activity relationships (QSAR) modeling, has developed many important algorithms and has found a broad range of applications in physical organic and medicinal chemistry in the past 55+ years. This Perspective summarizes recent technological advances in QSAR modeling but it also highlights the applicability of algorithms, modeling methods, and validation practices developed in QSAR to a wide range of research areas outside of traditional QSAR boundaries including synthesis planning, nanotechnology, materials science, biomaterials, and clinical informatics. As modern research methods generate rapidly increasing amounts of data, the knowledge of robust data-driven modelling methods professed within the QSAR field can become essential for scientists working both within and outside of chemical research. We hope that this contribution highlighting the generalizable components of QSAR modeling will serve to address this challenge.
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ;
Award ID(s):
Publication Date:
Journal Name:
Chemical Society Reviews
Page Range or eLocation-ID:
3525 to 3564
Sponsoring Org:
National Science Foundation
More Like this
  1. Outlier detection is a statistical procedure that aims to find suspicious events or items that are different from the normal form of a dataset. It has drawn considerable interest in the field of data mining and machine learning. Outlier detection is important in many applications, including fraud detection in credit card transactions and network intrusion detection. There are two general types of outlier detection: global and local. Global outliers fall outside the normal range for an entire dataset, whereas local outliers may fall within the normal range for the entire dataset, but outside the normal range for the surrounding data points. This paper addresses local outlier detection. The best-known technique for local outlier detection is the Local Outlier Factor (LOF), a density-based technique. There are many LOF algorithms for a static data environment; however, these algorithms cannot be applied directly to data streams, which are an important type of big data. In general, local outlier detection algorithms for data streams are still deficient and better algorithms need to be developed that can effectively analyze the high velocity of data streams to detect local outliers. This paper presents a literature review of local outlier detection algorithms in static and stream environments,more »with an emphasis on LOF algorithms. It collects and categorizes existing local outlier detection algorithms and analyzes their characteristics. Furthermore, the paper discusses the advantages and limitations of those algorithms and proposes several promising directions for developing improved local outlier detection methods for data streams.« less
  2. Abstract
    This dataset includes rainfall, cloud, river and stream hydro-chemistry of the Plynlimon research catchments. The data is from weekly monitoring of stream hydrochemistry of the River Hafren (Severn) at both the Lower and Upper Hafren site from 1998, stream hydrochemistry of the River Hore at the Lower Hore site from 1983 and Upper Hore site from 1984 as well as rainfall hydrochemistry near the Carreg Wen meteorological site from 1983 and cloud hydrochemistry near the Carreg Wen meteorological site from 1990. Data for over 50 chemical determinands are presented alongside data for some in-situ measurements such as water temperature. Full descriptions of the analytical methods used for each determinand is included. The Plynlimon research catchments lie within the headwaters of the River Severn and the River Wye in the uplands of mid-Wales. Intensive and long-term monitoring within the catchments underpins a wealth of hydrological and hydro-chemical research; other linked datasets include river flow, meteorology and a variety of detailed spatial datasets representing the topography, soils and rivers of the catchments. Monitoring is funded by the Centre for Ecology & Hydrology, and is ongoing since 1968.
    Originally designed to improve understanding of water use by coniferous forests, monitoring withinMore>>
  3. Machine learning represents a milestone in data-driven research, including material informatics, robotics, and computer-aided drug discovery. With the continuously growing virtual and synthetically available chemical space, efficient and robust quantitative structure–activity relationship (QSAR) methods are required to uncover molecules with desired properties. Herein, we propose variable-length-array SMILES-based (VLA-SMILES) structural descriptors that expand conventional SMILES descriptors widely used in machine learning. This structural representation extends the family of numerically coded SMILES, particularly binary SMILES, to expedite the discovery of new deep learning QSAR models with high predictive ability. VLA-SMILES descriptors were shown to speed up the training of QSAR models based on multilayer perceptron (MLP) with optimized backpropagation (ATransformedBP), resilient propagation (iRPROP‒), and Adam optimization learning algorithms featuring rational train–test splitting, while improving the predictive ability toward the more compute-intensive binary SMILES representation format. All the tested MLPs under the same length-array-based SMILES descriptors showed similar predictive ability and convergence rate of training in combination with the considered learning procedures. Validation with the Kennard–Stone train–test splitting based on the structural descriptor similarity metrics was found more effective than the partitioning with the ranking by activity based on biological activity values metrics for the entire set of VLA-SMILES featured QSAR. Robustness andmore »the predictive ability of MLP models based on VLA-SMILES were assessed via the method of QSAR parametric model validation. In addition, the method of the statistical H0 hypothesis testing of the linear regression between real and observed activities based on the F2,n−2 -criteria was used for predictability estimation among VLA-SMILES featured QSAR-MLPs (with n being the volume of the testing set). Both approaches of QSAR parametric model validation and statistical hypothesis testing were found to correlate when used for the quantitative evaluation of predictabilities of the designed QSAR models with VLA-SMILES descriptors.« less
  4. BACKGROUND Optical sensing devices measure the rich physical properties of an incident light beam, such as its power, polarization state, spectrum, and intensity distribution. Most conventional sensors, such as power meters, polarimeters, spectrometers, and cameras, are monofunctional and bulky. For example, classical Fourier-transform infrared spectrometers and polarimeters, which characterize the optical spectrum in the infrared and the polarization state of light, respectively, can occupy a considerable portion of an optical table. Over the past decade, the development of integrated sensing solutions by using miniaturized devices together with advanced machine-learning algorithms has accelerated rapidly, and optical sensing research has evolved into a highly interdisciplinary field that encompasses devices and materials engineering, condensed matter physics, and machine learning. To this end, future optical sensing technologies will benefit from innovations in device architecture, discoveries of new quantum materials, demonstrations of previously uncharacterized optical and optoelectronic phenomena, and rapid advances in the development of tailored machine-learning algorithms. ADVANCES Recently, a number of sensing and imaging demonstrations have emerged that differ substantially from conventional sensing schemes in the way that optical information is detected. A typical example is computational spectroscopy. In this new paradigm, a compact spectrometer first collectively captures the comprehensive spectral information ofmore »an incident light beam using multiple elements or a single element under different operational states and generates a high-dimensional photoresponse vector. An advanced algorithm then interprets the vector to achieve reconstruction of the spectrum. This scheme shifts the physical complexity of conventional grating- or interference-based spectrometers to computation. Moreover, many of the recent developments go well beyond optical spectroscopy, and we discuss them within a common framework, dubbed “geometric deep optical sensing.” The term “geometric” is intended to emphasize that in this sensing scheme, the physical properties of an unknown light beam and the corresponding photoresponses can be regarded as points in two respective high-dimensional vector spaces and that the sensing process can be considered to be a mapping from one vector space to the other. The mapping can be linear, nonlinear, or highly entangled; for the latter two cases, deep artificial neural networks represent a natural choice for the encoding and/or decoding processes, from which the term “deep” is derived. In addition to this classical geometric view, the quantum geometry of Bloch electrons in Hilbert space, such as Berry curvature and quantum metrics, is essential for the determination of the polarization-dependent photoresponses in some optical sensors. In this Review, we first present a general perspective of this sensing scheme from the viewpoint of information theory, in which the photoresponse measurement and the extraction of light properties are deemed as information-encoding and -decoding processes, respectively. We then discuss demonstrations in which a reconfigurable sensor (or an array thereof), enabled by device reconfigurability and the implementation of neural networks, can detect the power, polarization state, wavelength, and spatial features of an incident light beam. OUTLOOK As increasingly more computing resources become available, optical sensing is becoming more computational, with device reconfigurability playing a key role. On the one hand, advanced algorithms, including deep neural networks, will enable effective decoding of high-dimensional photoresponse vectors, which reduces the physical complexity of sensors. Therefore, it will be important to integrate memory cells near or within sensors to enable efficient processing and interpretation of a large amount of photoresponse data. On the other hand, analog computation based on neural networks can be performed with an array of reconfigurable devices, which enables direct multiplexing of sensing and computing functions. We anticipate that these two directions will become the engineering frontier of future deep sensing research. On the scientific frontier, exploring quantum geometric and topological properties of new quantum materials in both linear and nonlinear light-matter interactions will enrich the information-encoding pathways for deep optical sensing. In addition, deep sensing schemes will continue to benefit from the latest developments in machine learning. Future highly compact, multifunctional, reconfigurable, and intelligent sensors and imagers will find applications in medical imaging, environmental monitoring, infrared astronomy, and many other areas of our daily lives, especially in the mobile domain and the internet of things. Schematic of deep optical sensing. The n -dimensional unknown information ( w ) is encoded into an m -dimensional photoresponse vector ( x ) by a reconfigurable sensor (or an array thereof), from which w′ is reconstructed by a trained neural network ( n ′ = n and w′   ≈   w ). Alternatively, x may be directly deciphered to capture certain properties of w . Here, w , x , and w′ can be regarded as points in their respective high-dimensional vector spaces ℛ n , ℛ m , and ℛ n ′ .« less
  5. Research interest in nanoscale biomaterials has continued to grow in the past few decades, driving the need to form families of nanomaterials grouped by similar physical or chemical properties. Nanotubes have occupied a unique space in this field, primarily due to their high versatility in a wide range of biomedical applications. Although similar in morphology, members of this nanomaterial family widely differ in synthesis methods, mechanical and physiochemical properties, and therapeutic applications. As this field continues to develop, it is important to provide insight into novel biomaterial developments and their overall impact on current technology and therapeutics. In this review, we aim to characterize and compare two members of the nanotube family: carbon nanotubes (CNTs) and janus-base nanotubes (JBNts). While CNTs have been extensively studied for decades, JBNts provide a fresh perspective on many therapeutic modalities bound by the limitations of carbon-based nanomaterials. Herein, we characterize the morphology, synthesis, and applications of CNTs and JBNts to provide a comprehensive comparison between these nanomaterial technologies.