Accelerating the development of π-conjugated molecules for applications such as energy generation and storage, catalysis, sensing, pharmaceuticals, and (semi)conducting technologies requires rapid and accurate evaluation of the electronic, redox, or optical properties. While high-throughput computational screening has proven to be a tremendous aid in this regard, machine learning (ML) and other data-driven methods can further enable orders of magnitude reduction in time while at the same time providing dramatic increases in the chemical space that is explored. However, the lack of benchmark datasets containing the electronic, redox, and optical properties that characterize the diverse, known chemical space of organic π-conjugated molecules limits ML model development. Here, we present a curated dataset containing 25k molecules with density functional theory (DFT) and time-dependent DFT (TDDFT) evaluated properties that include frontier molecular orbitals, ionization energies, relaxation energies, and low-lying optical excitation energies. Using the dataset, we train a hierarchy of ML models, ranging from classical models such as ridge regression to sophisticated graph neural networks, with molecular SMILES representation as input. We observe that graph neural networks augmented with contextual information allow for significantly better predictions across a wide array of properties. Our best-performing models also provide an uncertainty quantification for the predictions. To democratize access to the data and trained models, an interactive web platform has been developed and deployed.
more »
« less
Electronic structure prediction of multi-million atom systems through uncertainty quantification enabled transfer learning
Abstract The ground state electron density — obtainable using Kohn-Sham Density Functional Theory (KS-DFT) simulations — contains a wealth of material information, making its prediction via machine learning (ML) models attractive. However, the computational expense of KS-DFT scales cubically with system size which tends to stymie training data generation, making it difficult to develop quantifiably accurate ML models that are applicable across many scales and system configurations. Here, we address this fundamental challenge by employing transfer learning to leverage the multi-scale nature of the training data, while comprehensively sampling system configurations using thermalization. Our ML models are less reliant on heuristics, and being based on Bayesian neural networks, enable uncertainty quantification. We show that our models incur significantly lower data generation costs while allowing confident — and when verifiable, accurate — predictions for a wide variety of bulk systems well beyond training, including systems with defects, different alloy compositions, and at multi-million-atom scales. Moreover, such predictions can be carried out using only modest computational resources.
more »
« less
- Award ID(s):
- 2215734
- PAR ID:
- 10572425
- Publisher / Repository:
- Nature
- Date Published:
- Journal Name:
- npj Computational Materials
- Volume:
- 10
- Issue:
- 1
- ISSN:
- 2057-3960
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Gagliardi, Laura (Ed.)The computational cost of the Kohn–Sham density functional theory (KS-DFT), employing advanced orbital-based exchange–correlation (XC) functionals, increases quickly for large systems. To tackle this problem, we recently developed a local correlation method in the framework of KS-DFT: the embedded cluster density approximation (ECDA). The aim of ECDA is to obtain accurate electronic structures in an entire system. With ECDA, for each atom in a system, we define a cluster to enclose that atom, with the rest atoms treated as the environment. The system’s electron density is then partitioned among the cluster and the environment. The cluster’s XC energy density is then calculated based on its electron density using an advanced orbital-based XC functional. The system’s XC energy is obtained by patching all clusters’ XC energy densities in an atom-by-atom manner. In our previous formulation of ECDA, environments were treated by KS-DFT, which makes the following two tasks computationally expensive for large systems. The first task is to partition the system’s electron density among a cluster and its environment. The second task is to solve the environments’ Sternheimer equations for calculating the system’s XC potential. In this work, we remove these two computational bottlenecks by treating the environments with the orbital-free (OF) DFT. The new method is called ECDA-envOF. The performance of ECDA-envOF is examined in two systems: ester and Cl-tetracene, for which the exact exchange (EXX) is used as the advanced XC functional. We show that ECDA-envOF gives results that are very close to the previous formulation in which the environments were treated by KS-DFT. Therefore, ECDA-envOF can be used for future large-scale simulations. Another focus of this work is to examine ECDA-envOF’s performance on systems having different bond types. With ECDA-envOF, we calculate the energy curves for stretching/compressing some covalent, metallic, and ionic systems. ECDA-envOF’s predictions agree well with the benchmarks by using reasonably large clusters. These examples demonstrate that ECDA-envOF is nearly a black-box local correlation method for investigating heterogeneous materials in which different bond types exist.more » « less
-
Abstract Density functional theory (DFT) has been a critical component of computational materials research and discovery for decades. However, the computational cost of solving the central Kohn–Sham equation remains a major obstacle for dynamical studies of complex phenomena at-scale. Here, we propose an end-to-end machine learning (ML) model that emulates the essence of DFT by mapping the atomic structure of the system to its electronic charge density, followed by the prediction of other properties such as density of states, potential energy, atomic forces, and stress tensor, by using the atomic structure and charge density as input. Our deep learning model successfully bypasses the explicit solution of the Kohn-Sham equation with orders of magnitude speedup (linear scaling with system size with a small prefactor), while maintaining chemical accuracy. We demonstrate the capability of this ML-DFT concept for an extensive database of organic molecules, polymer chains, and polymer crystals.more » « less
-
null (Ed.)Abstract Accurate theoretical predictions of desired properties of materials play an important role in materials research and development. Machine learning (ML) can accelerate the materials design by building a model from input data. For complex datasets, such as those of crystalline compounds, a vital issue is how to construct low-dimensional representations for input crystal structures with chemical insights. In this work, we introduce an algebraic topology-based method, called atom-specific persistent homology (ASPH), as a unique representation of crystal structures. The ASPH can capture both pairwise and many-body interactions and reveal the topology-property relationship of a group of atoms at various scales. Combined with composition-based attributes, ASPH-based ML model provides a highly accurate prediction of the formation energy calculated by density functional theory (DFT). After training with more than 30,000 different structure types and compositions, our model achieves a mean absolute error of 61 meV/atom in cross-validation, which outperforms previous work such as Voronoi tessellations and Coulomb matrix method using the same ML algorithm and datasets. Our results indicate that the proposed topology-based method provides a powerful computational tool for predicting materials properties compared to previous works.more » « less
-
Machine learning (ML) offers an attractive method for making predictions about molecular systems while circumventing the need to run expensive electronic structure calculations. Once trained on ab initio data, the promise of ML is to deliver accurate predictions of molecular properties that were previously computationally infeasible. In this work, we develop and train a graph neural network model to correct the basis set incompleteness error (BSIE) between a small and large basis set at the RHF and B3LYP levels of theory. Our results show that, when compared to fitting to the total potential, an ML model fitted to correct the BSIE is better at generalizing to systems not seen during training. We test this ability by training on single molecules while evaluating on molecular complexes. We also show that ensemble models yield better behaved potentials in situations where the training data is insufficient. However, even when only fitting to the BSIE, acceptable performance is only achieved when the training data sufficiently resemble the systems one wants to make predictions on. The test error of the final model trained to predict the difference between the cc-pVDZ and cc-pV5Z potential is 0.184 kcal/mol for the B3LYP density functional, and the ensemble model accurately reproduces the large basis set interaction energy curves on the S66x8 dataset.more » « less
An official website of the United States government

