skip to main content


Title: QSAR without borders
Prediction of chemical bioactivity and physical properties has been one of the most important applications of statistical and more recently, machine learning and artificial intelligence methods in chemical sciences. This field of research, broadly known as quantitative structure–activity relationships (QSAR) modeling, has developed many important algorithms and has found a broad range of applications in physical organic and medicinal chemistry in the past 55+ years. This Perspective summarizes recent technological advances in QSAR modeling but it also highlights the applicability of algorithms, modeling methods, and validation practices developed in QSAR to a wide range of research areas outside of traditional QSAR boundaries including synthesis planning, nanotechnology, materials science, biomaterials, and clinical informatics. As modern research methods generate rapidly increasing amounts of data, the knowledge of robust data-driven modelling methods professed within the QSAR field can become essential for scientists working both within and outside of chemical research. We hope that this contribution highlighting the generalizable components of QSAR modeling will serve to address this challenge.  more » « less
Award ID(s):
1802831
NSF-PAR ID:
10198968
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ; ;
Date Published:
Journal Name:
Chemical Society Reviews
Volume:
49
Issue:
11
ISSN:
0306-0012
Page Range / eLocation ID:
3525 to 3564
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Outlier detection is a statistical procedure that aims to find suspicious events or items that are different from the normal form of a dataset. It has drawn considerable interest in the field of data mining and machine learning. Outlier detection is important in many applications, including fraud detection in credit card transactions and network intrusion detection. There are two general types of outlier detection: global and local. Global outliers fall outside the normal range for an entire dataset, whereas local outliers may fall within the normal range for the entire dataset, but outside the normal range for the surrounding data points. This paper addresses local outlier detection. The best-known technique for local outlier detection is the Local Outlier Factor (LOF), a density-based technique. There are many LOF algorithms for a static data environment; however, these algorithms cannot be applied directly to data streams, which are an important type of big data. In general, local outlier detection algorithms for data streams are still deficient and better algorithms need to be developed that can effectively analyze the high velocity of data streams to detect local outliers. This paper presents a literature review of local outlier detection algorithms in static and stream environments, with an emphasis on LOF algorithms. It collects and categorizes existing local outlier detection algorithms and analyzes their characteristics. Furthermore, the paper discusses the advantages and limitations of those algorithms and proposes several promising directions for developing improved local outlier detection methods for data streams. 
    more » « less
  2. Abstract

    Olfactory cues play an important role in mammalian biology, but have been challenging to assess in the field. Current methods pose problematic issues with sample storage and transportation, limiting our ability to connect chemical variation in scents with relevant ecological and behavioral contexts. Real‐time, in‐field analysisviaportable gas chromatography–mass spectrometry (GC‐MS) has the potential to overcome these issues, but with trade‐offs of reduced sensitivity and compound mass range. We field‐tested the ability of portable GC‐MS to support two representative applications of chemical ecology research with a wild arboreal primate, common marmoset monkeys (Callithrix jacchus). We developed methods to (a) evaluate the chemical composition of marmoset scent marks deposited at feeding sites and (b) characterize the scent profiles of exudates eaten by marmosets. We successfully collected marmoset scent marks across several canopy heights, with the portable GC‐MS detecting known components of marmoset glandular secretions and differentiating these from in‐field controls. Likewise, variation in the chemical profile of scent marks demonstrated a significant correlation with marmoset feeding behavior, indicating these scents’ biological relevance. The portable GC‐MS also delineated species‐specific olfactory signatures of exudates fed on by marmosets. Despite the trade‐offs, portable GC‐MS represents a viable option for characterizing olfactory compounds used by wild mammals, yielding biologically relevant data. While the decision to adopt portable GC‐MS will likely depend on site‐ and project‐specific needs, our ability to conduct two example applications under relatively challenging field conditions bodes well for the versatility of in‐field GC‐MS.

     
    more » « less
  3. Abstract

    In recent years, the availability of airborne imaging spectroscopy (hyperspectral) data has expanded dramatically. The high spatial and spectral resolution of these data uniquely enable spatially explicit ecological studies including species mapping, assessment of drought mortality and foliar trait distributions. However, we have barely begun to unlock the potential of these data to use direct mapping of vegetation characteristics to infer subsurface properties of the critical zone. To assess their utility for Earth systems research, imaging spectroscopy data acquisitions require integration with large, coincident ground‐based datasets collected by experts in ecology and environmental and Earth science. Without coordinated, well‐planned field campaigns, potential knowledge leveraged from advanced airborne data collections could be lost. Despite the growing importance of this field, documented methods to couple such a wide variety of disciplines remain sparse.

    We coordinated the first National Ecological Observatory Network Airborne Observation Platform (AOP) survey performed outside of their core sites, which took place in the Upper East River watershed, Colorado. Extensive planning for sample tracking and organization allowed field and flight teams to update the ground‐based sampling strategy daily. This enabled collection of an extensive set of physical samples to support a wide range of ecological, microbiological, biogeochemical and hydrological studies.

    We present a framework for integrating airborne and field campaigns to obtain high‐quality data for foliar trait prediction and document an archive of coincident physical samples collected to support a systems approach to ecological research in the critical zone. This detailed methodological account provides an example of how a multi‐disciplinary and multi‐institutional team can coordinate to maximize knowledge gained from an airborne survey, an approach that could be extended to other studies.

    The coordination of imaging spectroscopy surveys with appropriately timed and extensive field surveys, along with high‐quality processing of these data, presents a unique opportunity to reveal new insights into the structure and dynamics of the critical zone. To our knowledge, this level of co‐aligned sampling has never been undertaken in tandem with AOP surveys and subsequent studies utilizing this archive will shed considerable light on the breadth of applications for which imaging spectroscopy data can be leveraged.

     
    more » « less
  4. Machine learning represents a milestone in data-driven research, including material informatics, robotics, and computer-aided drug discovery. With the continuously growing virtual and synthetically available chemical space, efficient and robust quantitative structure–activity relationship (QSAR) methods are required to uncover molecules with desired properties. Herein, we propose variable-length-array SMILES-based (VLA-SMILES) structural descriptors that expand conventional SMILES descriptors widely used in machine learning. This structural representation extends the family of numerically coded SMILES, particularly binary SMILES, to expedite the discovery of new deep learning QSAR models with high predictive ability. VLA-SMILES descriptors were shown to speed up the training of QSAR models based on multilayer perceptron (MLP) with optimized backpropagation (ATransformedBP), resilient propagation (iRPROP‒), and Adam optimization learning algorithms featuring rational train–test splitting, while improving the predictive ability toward the more compute-intensive binary SMILES representation format. All the tested MLPs under the same length-array-based SMILES descriptors showed similar predictive ability and convergence rate of training in combination with the considered learning procedures. Validation with the Kennard–Stone train–test splitting based on the structural descriptor similarity metrics was found more effective than the partitioning with the ranking by activity based on biological activity values metrics for the entire set of VLA-SMILES featured QSAR. Robustness and the predictive ability of MLP models based on VLA-SMILES were assessed via the method of QSAR parametric model validation. In addition, the method of the statistical H0 hypothesis testing of the linear regression between real and observed activities based on the F2,n−2 -criteria was used for predictability estimation among VLA-SMILES featured QSAR-MLPs (with n being the volume of the testing set). Both approaches of QSAR parametric model validation and statistical hypothesis testing were found to correlate when used for the quantitative evaluation of predictabilities of the designed QSAR models with VLA-SMILES descriptors. 
    more » « less
  5. This dataset includes rainfall, cloud, river and stream hydro-chemistry of the Plynlimon research catchments. The data is from weekly monitoring of stream hydrochemistry of the River Hafren (Severn) at both the Lower and Upper Hafren site from 1998, stream hydrochemistry of the River Hore at the Lower Hore site from 1983 and Upper Hore site from 1984 as well as rainfall hydrochemistry near the Carreg Wen meteorological site from 1983 and cloud hydrochemistry near the Carreg Wen meteorological site from 1990. Data for over 50 chemical determinands are presented alongside data for some in-situ measurements such as water temperature. Full descriptions of the analytical methods used for each determinand is included. The Plynlimon research catchments lie within the headwaters of the River Severn and the River Wye in the uplands of mid-Wales. Intensive and long-term monitoring within the catchments underpins a wealth of hydrological and hydro-chemical research; other linked datasets include river flow, meteorology and a variety of detailed spatial datasets representing the topography, soils and rivers of the catchments. Monitoring is funded by the Centre for Ecology & Hydrology, and is ongoing since 1968. Originally designed to improve understanding of water use by coniferous forests, monitoring within the Plynlimon research catchments has subsequently developed into a multi-disciplinary project underpinning a wealth of hydrological research. The stream hydrochemistry dataset is a combination of chemical determinands and in-situ physical properties. Samples are taken on an approximately weekly basis and analysed in laboratories for approximately 50 chemical determinands. The laboratory analysis has been undertaken at different locations using different equipment during the years of monitoring represented within the dataset. Details of the analytical methods used for each determinand are included within the dataset. The information provided is presented as raw data which have not been modified to include "less than" values when the numbers are less than the lowest quotable value. This is done so as to allow the data to be used for research purposes including detailed statistical analysis where inclusion of less than values would be problematic. The data are thus "uncensored" for research purposes and include in places values that are negative due to analytical uncertainties and the extremely low concentrations of some of the determinands. The data are provided in good faith with the expectation that they will be used in a positive and constructive manner. It is up to the user to resolve which way to use and interpret the data and information on full methodologies and less than values is provided within the metadata files. We whole heartedly encourage the use of these data for research and educational purposes bearing in mind that it is uncensored data and that over the years research directions have evolved to reflect changing project and funding priorities while analytical methodologies have changed. The data provide a legacy from many years' research involving scientists, analysts and field workers who in many cases have now retired. While we have done our best to provide all the information possible, in many cases it will be not be possible to respond to detailed enquiries about the earlier parts of the record. 
    more » « less