skip to main content

Title: Electronic, redox, and optical property prediction of organic π-conjugated molecules through a hierarchy of machine learning approaches
Accelerating the development of π-conjugated molecules for applications such as energy generation and storage, catalysis, sensing, pharmaceuticals, and (semi)conducting technologies requires rapid and accurate evaluation of the electronic, redox, or optical properties. While high-throughput computational screening has proven to be a tremendous aid in this regard, machine learning (ML) and other data-driven methods can further enable orders of magnitude reduction in time while at the same time providing dramatic increases in the chemical space that is explored. However, the lack of benchmark datasets containing the electronic, redox, and optical properties that characterize the diverse, known chemical space of organic π-conjugated molecules limits ML model development. Here, we present a curated dataset containing 25k molecules with density functional theory (DFT) and time-dependent DFT (TDDFT) evaluated properties that include frontier molecular orbitals, ionization energies, relaxation energies, and low-lying optical excitation energies. Using the dataset, we train a hierarchy of ML models, ranging from classical models such as ridge regression to sophisticated graph neural networks, with molecular SMILES representation as input. We observe that graph neural networks augmented with contextual information allow for significantly better predictions across a wide array of properties. Our best-performing models also provide an uncertainty quantification for the predictions. To democratize access to the data and trained models, an interactive web platform has been developed and deployed.  more » « less
Award ID(s):
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
Chemical Science
Page Range / eLocation ID:
203 to 213
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    The electronic, optical, and solid state properties of a series of monoradicals, anions and cations obtained from starting neutral diradicals have been studied. Diradicals based ons‐indacene and indenoacenes, with benzothiophenes fused and in different orientations, feature a varying degree of diradical character in the neutral state, which is here related with the properties of the radical redox forms. The analysis of their optical features in the polymethine monoradicals has been carried out in the framework of the molecular orbital and valence bond theories. Electronic UV‐Vis‐NIR absorption, X‐ray solid‐state diffraction and quantum chemical calculations have been carried out. Studies of the different positive‐/negative‐charged species, both residing in the same skeletalπ‐conjugated backbone, are rare for organic molecules. The key factor for the dual stabilization is the presence of the starting diradical character that enables to indistinctively accommodate a pseudo‐hole and a pseudo‐electron defect with certainly small reorganization energies for ambipolar charge transport.

    more » « less
  2. Biocompatible molecules with electronic functionality provide a promising substrate for biocompatible electronic devices and electronic interfacing with biological systems. Synthetic oligopeptides composed of an aromatic π-core flanked by oligopeptide wings are a class of molecules that can self-assemble in aqueous environments into supramolecular nanoaggregates with emergent optical and electronic activity. We present an integrated computational–experimental pipeline employing all-atom molecular dynamics simulations and experimental UV-visible spectroscopy within an active learning workflow using deep representational learning and multi-objective and multi-fidelity Bayesian optimization to design π-conjugated peptides programmed to self-assemble into elongated pseudo-1D nanoaggregates with a high degree of H-type co-facial stacking of the π-cores. We consider as our design space the 694 982 unique π-conjugated peptides comprising a quaterthiophene π-core flanked by symmetric oligopeptide wings up to five amino acids in length. After sampling only 1181 molecules (∼0.17% of the design space) by computation and 28 (∼0.004%) by experiment, we identify and experimentally validate a diversity of previously unknown high-performing molecules and extract interpretable design rules linking peptide sequence to emergent supramolecular structure and properties. 
    more » « less
  3. Abstract

    It is critical that machine learning (ML) model predictions be trustworthy for high-throughput catalyst discovery approaches. Uncertainty quantification (UQ) methods allow estimation of the trustworthiness of an ML model, but these methods have not been well explored in the field of heterogeneous catalysis. Herein, we investigate different UQ methods applied to a crystal graph convolutional neural network to predict adsorption energies of molecules on alloys from the Open Catalyst 2020 dataset, the largest existing heterogeneous catalyst dataset. We apply three UQ methods to the adsorption energy predictions, namelyk-fold ensembling, Monte Carlo dropout, and evidential regression. The effectiveness of each UQ method is assessed based on accuracy, sharpness, dispersion, calibration, and tightness. Evidential regression is demonstrated to be a powerful approach for rapidly obtaining tunable, competitively trustworthy UQ estimates for heterogeneous catalysis applications when using neural networks. Recalibration of model uncertainties is shown to be essential in practical screening applications of catalysts using uncertainties.

    more » « less
  4. Machine learning (ML) offers an attractive method for making predictions about molecular systems while circumventing the need to run expensive electronic structure calculations. Once trained on ab initio data, the promise of ML is to deliver accurate predictions of molecular properties that were previously computationally infeasible. In this work, we develop and train a graph neural network model to correct the basis set incompleteness error (BSIE) between a small and large basis set at the RHF and B3LYP levels of theory. Our results show that, when compared to fitting to the total potential, an ML model fitted to correct the BSIE is better at generalizing to systems not seen during training. We test this ability by training on single molecules while evaluating on molecular complexes. We also show that ensemble models yield better behaved potentials in situations where the training data is insufficient. However, even when only fitting to the BSIE, acceptable performance is only achieved when the training data sufficiently resemble the systems one wants to make predictions on. The test error of the final model trained to predict the difference between the cc-pVDZ and cc-pV5Z potential is 0.184 kcal/mol for the B3LYP density functional, and the ensemble model accurately reproduces the large basis set interaction energy curves on the S66x8 dataset. 
    more » « less
  5. The lack of publicly available, large, and unbiased datasets is a key bottleneck for the application of machine learning (ML) methods in synthetic chemistry. Data from electronic laboratory notebooks (ELNs) could provide less biased, large datasets, but no such datasets have been made publicly available. The first real-world dataset from the ELNs of a large pharmaceutical company is disclosed and its relationship to high-throughput experimentation (HTE) datasets is described. For chemical yield predictions, a key task in chemical synthesis, an attributed graph neural network (AGNN) performs as well as or better than the best previous models on two HTE datasets for the Suzuki–Miyaura and Buchwald–Hartwig reactions. However, training the AGNN on an ELN dataset does not lead to a predictive model. The implications of using ELN data for training ML-based models are discussed in the context of yield predictions. 
    more » « less