skip to main content

This content will become publicly available on December 21, 2023

Title: Electronic, redox, and optical property prediction of organic π-conjugated molecules through a hierarchy of machine learning approaches
Accelerating the development of π-conjugated molecules for applications such as energy generation and storage, catalysis, sensing, pharmaceuticals, and (semi)conducting technologies requires rapid and accurate evaluation of the electronic, redox, or optical properties. While high-throughput computational screening has proven to be a tremendous aid in this regard, machine learning (ML) and other data-driven methods can further enable orders of magnitude reduction in time while at the same time providing dramatic increases in the chemical space that is explored. However, the lack of benchmark datasets containing the electronic, redox, and optical properties that characterize the diverse, known chemical space of organic π-conjugated molecules limits ML model development. Here, we present a curated dataset containing 25k molecules with density functional theory (DFT) and time-dependent DFT (TDDFT) evaluated properties that include frontier molecular orbitals, ionization energies, relaxation energies, and low-lying optical excitation energies. Using the dataset, we train a hierarchy of ML models, ranging from classical models such as ridge regression to sophisticated graph neural networks, with molecular SMILES representation as input. We observe that graph neural networks augmented with contextual information allow for significantly better predictions across a wide array of properties. Our best-performing models also provide an uncertainty quantification for the predictions. more » To democratize access to the data and trained models, an interactive web platform has been developed and deployed. « less
; ; ; ; ;
Award ID(s):
Publication Date:
Journal Name:
Chemical Science
Page Range or eLocation-ID:
203 to 213
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    The electronic, optical, and solid state properties of a series of monoradicals, anions and cations obtained from starting neutral diradicals have been studied. Diradicals based ons‐indacene and indenoacenes, with benzothiophenes fused and in different orientations, feature a varying degree of diradical character in the neutral state, which is here related with the properties of the radical redox forms. The analysis of their optical features in the polymethine monoradicals has been carried out in the framework of the molecular orbital and valence bond theories. Electronic UV‐Vis‐NIR absorption, X‐ray solid‐state diffraction and quantum chemical calculations have been carried out. Studies of the different positive‐/negative‐charged species, both residing in the same skeletalπ‐conjugated backbone, are rare for organic molecules. The key factor for the dual stabilization is the presence of the starting diradical character that enables to indistinctively accommodate a pseudo‐hole and a pseudo‐electron defect with certainly small reorganization energies for ambipolar charge transport.

  2. Machine learning (ML) offers an attractive method for making predictions about molecular systems while circumventing the need to run expensive electronic structure calculations. Once trained on ab initio data, the promise of ML is to deliver accurate predictions of molecular properties that were previously computationally infeasible. In this work, we develop and train a graph neural network model to correct the basis set incompleteness error (BSIE) between a small and large basis set at the RHF and B3LYP levels of theory. Our results show that, when compared to fitting to the total potential, an ML model fitted to correct the BSIE is better at generalizing to systems not seen during training. We test this ability by training on single molecules while evaluating on molecular complexes. We also show that ensemble models yield better behaved potentials in situations where the training data is insufficient. However, even when only fitting to the BSIE, acceptable performance is only achieved when the training data sufficiently resemble the systems one wants to make predictions on. The test error of the final model trained to predict the difference between the cc-pVDZ and cc-pV5Z potential is 0.184 kcal/mol for the B3LYP density functional, and the ensemble modelmore »accurately reproduces the large basis set interaction energy curves on the S66x8 dataset.« less
  3. Abstract Designing a new heterostructure electrode has many challenges associated with interface engineering. Demanding simulation resources and lack of heterostructure databases continue to be a barrier to understanding the chemistry and mechanics of complex interfaces using simulations. Mixed-dimensional heterostructures composed of two-dimensional (2D) and three-dimensional (3D) materials are undisputed next-generation materials for engineered devices due to their changeable properties. The present work computationally investigates the interface between 2D graphene and 3D tin (Sn) systems with density functional theory (DFT) method. This computationally demanding simulation data is further used to develop machine learning (ML)-based potential energy surfaces (PES). The approach to developing PES for complex interface systems in the light of limited data and the transferability of such models has been discussed. To develop PES for graphene-tin interface systems, high-dimensional neural networks (HDNN) are used that rely on atom-centered symmetry function to represent structural information. HDNN are modified to train on the total energies of the interface system rather than atomic energies. The performance of modified HDNN trained on 5789 interface structures of graphene|Sn is tested on new interfaces of the same material pair with varying levels of structural deviations from the training dataset. Root-mean-squared error (RMSE) for test interfaces fallmore »in the range of 0.01–0.45 eV/atom, depending on the structural deviations from the reference training dataset. By avoiding incorrect decomposition of total energy into atomic energies, modified HDNN model is shown to obtain higher accuracy and transferability despite a limited dataset. Improved accuracy in the ML-based modeling approach promises cost-effective means of designing interfaces in heterostructure energy storage systems with higher cycle life and stability.« less
  4. Nuclear magnetic resonance (NMR) is one of the primary techniques used to elucidate the chemical structure, bonding, stereochemistry, and conformation of organic compounds. The distinct chemical shifts in an NMR spectrum depend upon each atom's local chemical environment and are influenced by both through-bond and through-space interactions with other atoms and functional groups. The in silico prediction of NMR chemical shifts using quantum mechanical (QM) calculations is now commonplace in aiding organic structural assignment since spectra can be computed for several candidate structures and then compared with experimental values to find the best possible match. However, the computational demands of calculating multiple structural- and stereo-isomers, each of which may typically exist as an ensemble of rapidly-interconverting conformations, are expensive. Additionally, the QM predictions themselves may lack sufficient accuracy to identify a correct structure. In this work, we address both of these shortcomings by developing a rapid machine learning (ML) protocol to predict 1 H and 13 C chemical shifts through an efficient graph neural network (GNN) using 3D structures as input. Transfer learning with experimental data is used to improve the final prediction accuracy of a model trained using QM calculations. When tested on the CHESHIRE dataset, the proposed modelmore »predicts observed 13 C chemical shifts with comparable accuracy to the best-performing DFT functionals (1.5 ppm) in around 1/6000 of the CPU time. An automated prediction webserver and graphical interface are accessible online at We further demonstrate the model in three applications: first, we use the model to decide the correct organic structure from candidates through experimental spectra, including complex stereoisomers; second, we automatically detect and revise incorrect chemical shift assignments in a popular NMR database, the NMRShiftDB; and third, we use NMR chemical shifts as descriptors for determination of the sites of electrophilic aromatic substitution.« less
  5. Abstract

    Maximum diversification of data is a central theme in building generalized and accurate machine learning (ML) models. In chemistry, ML has been used to develop models for predicting molecular properties, for example quantum mechanics (QM) calculated potential energy surfaces and atomic charge models. The ANI-1x and ANI-1ccx ML-based general-purpose potentials for organic molecules were developed through active learning; an automated data diversification process. Here, we describe the ANI-1x and ANI-1ccx data sets. To demonstrate data diversity, we visualize it with a dimensionality reduction scheme, and contrast against existing data sets. The ANI-1x data set contains multiple QM properties from 5 M density functional theory calculations, while the ANI-1ccx data set contains 500 k data points obtained with an accurate CCSD(T)/CBS extrapolation. Approximately 14 million CPU core-hours were expended to generate this data. Multiple QM calculated properties for the chemical elements C, H, N, and O are provided: energies, atomic forces, multipole moments, atomic charges, etc. We provide this data to the community to aid research and development of ML models for chemistry.