skip to main content


Title: Accurate and transferable multitask prediction of chemical properties with an atoms-in-molecules neural network
Atomic and molecular properties could be evaluated from the fundamental Schrodinger’s equation and therefore represent different modalities of the same quantum phenomena. Here, we present AIMNet, a modular and chemically inspired deep neural network potential. We used AIMNet with multitarget training to learn multiple modalities of the state of the atom in a molecular system. The resulting model shows on several benchmark datasets state-of-the-art accuracy, comparable to the results of orders of magnitude more expensive DFT methods. It can simultaneously predict several atomic and molecular properties without an increase in the computational cost. With AIMNet, we show a new dimension of transferability: the ability to learn new targets using multimodal information from previous training. The model can learn implicit solvation energy (SMD method) using only a fraction of the original training data and an archive median absolute deviation error of 1.1 kcal/mol compared to experimental solvation free energies in the MNSol database.  more » « less
Award ID(s):
1802789
NSF-PAR ID:
10162649
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Science Advances
Volume:
5
Issue:
8
ISSN:
2375-2548
Page Range / eLocation ID:
eaav6490
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Gas-particle partitioning of secondary organic aerosols is impacted by particle phase state and viscosity, which can be inferred from the glass transition temperature ( T g ) of the constituting organic compounds. Several parametrizations were developed to predict T g of organic compounds based on molecular properties and elemental composition, but they are subject to relatively large uncertainties as they do not account for molecular structure and functionality. Here we develop a new T g prediction method powered by machine learning and “molecular embeddings”, which are unique numerical representations of chemical compounds that retain information on their structure, inter atomic connectivity and functionality. We have trained multiple state-of-the-art machine learning models on databases of experimental T g of organic compounds and their corresponding molecular embeddings. The best prediction model is the tgBoost model built with an Extreme Gradient Boosting (XGBoost) regressor trained via a nested cross-validation method, reproducing experimental data very well with a mean absolute error of 18.3 K. It can also quantify the influence of number and location of functional groups on the T g of organic molecules, while accounting for atom connectivity and predicting different T g for compositional isomers. The tgBoost model suggests the following trend for sensitivity of T g to functional group addition: –COOH (carboxylic acid) > –C(O)OR (ester) ≈ –OH (alcohol) > –C(O)R (ketone) ≈ –COR (ether) ≈ –C(O)H (aldehyde). We also developed a model to predict the melting point ( T m ) of organic compounds by training a deep neural network on a large dataset of experimental T m . The model performs reasonably well against the available dataset with a mean absolute error of 31.0 K. These new machine learning powered models can be applied to field and laboratory measurements as well as atmospheric aerosol models to predict the T g and T m of SOA compounds for evaluation of the phase state and viscosity of SOA. 
    more » « less
  2. Abstract Motivation

    Expanding our knowledge of small molecules beyond what is known in nature or designed in wet laboratories promises to significantly advance cheminformatics, drug discovery, biotechnology and material science. In silico molecular design remains challenging, primarily due to the complexity of the chemical space and the non-trivial relationship between chemical structures and biological properties. Deep generative models that learn directly from data are intriguing, but they have yet to demonstrate interpretability in the learned representation, so we can learn more about the relationship between the chemical and biological space. In this article, we advance research on disentangled representation learning for small molecule generation. We build on recent work by us and others on deep graph generative frameworks, which capture atomic interactions via a graph-based representation of a small molecule. The methodological novelty is how we leverage the concept of disentanglement in the graph variational autoencoder framework both to generate biologically relevant small molecules and to enhance model interpretability.

    Results

    Extensive qualitative and quantitative experimental evaluation in comparison with state-of-the-art models demonstrate the superiority of our disentanglement framework. We believe this work is an important step to address key challenges in small molecule generation with deep generative frameworks.

    Availability and implementation

    Training and generated data are made available at https://ieee-dataport.org/documents/dataset-disentangled-representation-learning-interpretable-molecule-generation. All code is made available at https://anonymous.4open.science/r/D-MolVAE-2799/.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  3. Implicit solvent models divide solvation free energies into polar and nonpolar additive contributions, whereas polar and nonpolar interactions are inseparable and nonadditive. We present a feature functional theory (FFT) framework to break thisad hocdivision. The essential ideas of FFT are as follows: (i) representability assumption: there exists a microscopic feature vector that can uniquely characterize and distinguish one molecule from another; (ii) feature‐function relationship assumption: the macroscopic features, including solvation free energy, of a molecule is a functional of microscopic feature vectors; and (iii) similarity assumption: molecules with similar microscopic features have similar macroscopic properties, such as solvation free energies. Based on these assumptions, solvation free energy prediction is carried out in the following protocol. First, we construct a molecular microscopic feature vector that is efficient in characterizing the solvation process using quantum mechanics and Poisson–Boltzmann theory. Microscopic feature vectors are combined with macroscopic features, that is, physical observable, to form extended feature vectors. Additionally, we partition a solvation dataset into queries according to molecular compositions. Moreover, for each target molecule, we adopt a machine learning algorithm for its nearest neighbor search, based on the selected microscopic feature vectors. Finally, from the extended feature vectors of obtained nearest neighbors, we construct a functional of solvation free energy, which is employed to predict the solvation free energy of the target molecule. The proposed FFT model has been extensively validated via a large dataset of 668 molecules. The leave‐one‐out test gives an optimal root‐mean‐square error (RMSE) of 1.05 kcal/mol. FFT predictions of SAMPL0, SAMPL1, SAMPL2, SAMPL3, and SAMPL4 challenge sets deliver the RMSEs of 0.61, 1.86, 1.64, 0.86, and 1.14 kcal/mol, respectively. Using a test set of 94 molecules and its associated training set, the present approach was carefully compared with a classic solvation model based on weighted solvent accessible surface area. © 2017 Wiley Periodicals, Inc.

     
    more » « less
  4. The abundance and isotopic composition of noble gases dissolved in water have many applications in the geosciences. In recent years, new analytical techniques have opened the door to the use of high-precision measurements of noble gas isotopes as tracers for groundwater hydrology, oceanography, mantle geochemistry, and paleoclimatology. These analytical advances have brought about new measurements of solubility equilibrium isotope effects (SEIEs) in water (i.e., the relative solubilities of noble gas isotopes) and their sensitivities to the temperature and salinity. Here, we carry out a suite of classical molecular dynamics (MD) simulations and employ the theoretical method of quantum correction to estimate SEIEs for comparison with experimental observations. We find that classical MD simulations can accurately predict SEIEs for the isotopes of Ar, Kr, and Xe to order 0.01‰, on the scale of analytical uncertainty. However, MD simulations consistently overpredict the SEIEs of Ne and He by up to 40% of observed values. We carry out sensitivity tests at different temperatures, salinities, and pressures and employ different sets of interatomic potential parameters and water models. For all noble gas isotopes, the TIP4P water model is found to reproduce observed SEIEs more accurately than the SPC/E and TIP4P/ice models. Classical MD simulations also accurately capture the sign and approximate magnitude of temperature and salinity sensitivities of SEIEs for heavy noble gases. We find that experimental and modeled SEIEs generally follow an inverse-square mass dependence, which implies that the mean-square force experienced by a noble gas atom within a solvation shell is similar for all noble gases. This inverse-square mass proportionality is nearly exact for Ar, Kr, and Xe isotopes, but He and Ne exhibit a slightly weaker mass dependence. We hypothesize that the apparent dichotomy between He–Ne and Ar–Kr–Xe SEIEs may result from atomic size differences, whereby the smaller noble gases are more likely to spontaneously fit within cavities of water without breaking water–water H-bonds, thereby experiencing softer collisions during translation within a solvation shell. We further speculate that the overprediction of simulated He and Ne SEIEs may result from the neglection of higher-order quantum corrections or the overly stiff representation of van der Waals repulsion by the widely used Lennard-Jones 6–12 potential model. We suggest that new measurements of SEIEs of heavy and light noble gases may represent a novel set of constraints with which to refine hydrophobic solvation theories and optimize the set of interatomic potential models used in MD simulations of water and noble gases. 
    more » « less
  5. Multimodal sentiment analysis is a core research area that studies speaker sentiment expressed from the language, visual, and acoustic modalities. The central challenge in multimodal learning involves inferring joint representations that can process and relate information from these modalities. However, existing work learns joint representations by requiring all modalities as input and as a result, the learned representations may be sensitive to noisy or missing modalities at test time. With the recent success of sequence to sequence (Seq2Seq) models in machine translation, there is an opportunity to explore new ways of learning joint representations that may not require all input modalities at test time. In this paper, we propose a method to learn robust joint representations by translating between modalities. Our method is based on the key insight that translation from a source to a target modality provides a method of learning joint representations using only the source modality as input. We augment modality translations with a cycle consistency loss to ensure that our joint representations retain maximal information from all modalities. Once our translation model is trained with paired multimodal data, we only need data from the source modality at test time for final sentiment prediction. This ensures that our model remains robust from perturbations or missing information in the other modalities. We train our model with a coupled translationprediction objective and it achieves new state-of-the-art results on multimodal sentiment analysis datasets: CMU-MOSI, ICTMMMO, and YouTube. Additional experiments show that our model learns increasingly discriminative joint representations with more input modalities while maintaining robustness to missing or perturbed modalities. 
    more » « less