skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on December 1, 2025

Title: Electronic structure prediction of multi-million atom systems through uncertainty quantification enabled transfer learning
Abstract The ground state electron density — obtainable using Kohn-Sham Density Functional Theory (KS-DFT) simulations — contains a wealth of material information, making its prediction via machine learning (ML) models attractive. However, the computational expense of KS-DFT scales cubically with system size which tends to stymie training data generation, making it difficult to develop quantifiably accurate ML models that are applicable across many scales and system configurations. Here, we address this fundamental challenge by employing transfer learning to leverage the multi-scale nature of the training data, while comprehensively sampling system configurations using thermalization. Our ML models are less reliant on heuristics, and being based on Bayesian neural networks, enable uncertainty quantification. We show that our models incur significantly lower data generation costs while allowing confident — and when verifiable, accurate — predictions for a wide variety of bulk systems well beyond training, including systems with defects, different alloy compositions, and at multi-million-atom scales. Moreover, such predictions can be carried out using only modest computational resources.  more » « less
Award ID(s):
2215734
PAR ID:
10572425
Author(s) / Creator(s):
; ; ; ;
Publisher / Repository:
Nature
Date Published:
Journal Name:
npj Computational Materials
Volume:
10
Issue:
1
ISSN:
2057-3960
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Accelerating the development of π-conjugated molecules for applications such as energy generation and storage, catalysis, sensing, pharmaceuticals, and (semi)conducting technologies requires rapid and accurate evaluation of the electronic, redox, or optical properties. While high-throughput computational screening has proven to be a tremendous aid in this regard, machine learning (ML) and other data-driven methods can further enable orders of magnitude reduction in time while at the same time providing dramatic increases in the chemical space that is explored. However, the lack of benchmark datasets containing the electronic, redox, and optical properties that characterize the diverse, known chemical space of organic π-conjugated molecules limits ML model development. Here, we present a curated dataset containing 25k molecules with density functional theory (DFT) and time-dependent DFT (TDDFT) evaluated properties that include frontier molecular orbitals, ionization energies, relaxation energies, and low-lying optical excitation energies. Using the dataset, we train a hierarchy of ML models, ranging from classical models such as ridge regression to sophisticated graph neural networks, with molecular SMILES representation as input. We observe that graph neural networks augmented with contextual information allow for significantly better predictions across a wide array of properties. Our best-performing models also provide an uncertainty quantification for the predictions. To democratize access to the data and trained models, an interactive web platform has been developed and deployed. 
    more » « less
  2. null (Ed.)
    Abstract Accurate theoretical predictions of desired properties of materials play an important role in materials research and development. Machine learning (ML) can accelerate the materials design by building a model from input data. For complex datasets, such as those of crystalline compounds, a vital issue is how to construct low-dimensional representations for input crystal structures with chemical insights. In this work, we introduce an algebraic topology-based method, called atom-specific persistent homology (ASPH), as a unique representation of crystal structures. The ASPH can capture both pairwise and many-body interactions and reveal the topology-property relationship of a group of atoms at various scales. Combined with composition-based attributes, ASPH-based ML model provides a highly accurate prediction of the formation energy calculated by density functional theory (DFT). After training with more than 30,000 different structure types and compositions, our model achieves a mean absolute error of 61 meV/atom in cross-validation, which outperforms previous work such as Voronoi tessellations and Coulomb matrix method using the same ML algorithm and datasets. Our results indicate that the proposed topology-based method provides a powerful computational tool for predicting materials properties compared to previous works. 
    more » « less
  3. Machine learning (ML) offers an attractive method for making predictions about molecular systems while circumventing the need to run expensive electronic structure calculations. Once trained on ab initio data, the promise of ML is to deliver accurate predictions of molecular properties that were previously computationally infeasible. In this work, we develop and train a graph neural network model to correct the basis set incompleteness error (BSIE) between a small and large basis set at the RHF and B3LYP levels of theory. Our results show that, when compared to fitting to the total potential, an ML model fitted to correct the BSIE is better at generalizing to systems not seen during training. We test this ability by training on single molecules while evaluating on molecular complexes. We also show that ensemble models yield better behaved potentials in situations where the training data is insufficient. However, even when only fitting to the BSIE, acceptable performance is only achieved when the training data sufficiently resemble the systems one wants to make predictions on. The test error of the final model trained to predict the difference between the cc-pVDZ and cc-pV5Z potential is 0.184 kcal/mol for the B3LYP density functional, and the ensemble model accurately reproduces the large basis set interaction energy curves on the S66x8 dataset. 
    more » « less
  4. Design optimization, and particularly adjoint-based multi-physics shape and topology optimization, is time-consuming and often requires expensive iterations to converge to desired designs. In response, researchers have developed Machine Learning (ML) approaches — often referred to as Inverse Design methods — to either replace or accelerate tools like Topology optimization (TO). However, these methods have their own hidden, non-trivial costs including that of data generation, training, and refinement of ML-produced designs. This begs the question: when is it actually worth learning Inverse Design, compared to just optimizing designs without ML assistance? This paper quantitatively addresses this question by comparing the costs and benefits of three different Inverse Design ML model families on a Topology Optimization (TO) task, compared to just running the optimizer by itself. We explore the relationship between the size of training data and the predictive power of each ML model, as well as the computational and training costs of the models and the extent to which they accelerate or hinder TO convergence. The results demonstrate that simpler models, such as K-Nearest Neighbors and Random Forests, are more effective for TO warmstarting with limited training data, while more complex models, such as Deconvolutional Neural Networks, are preferable with more data. We also emphasize the need to balance the benefits of using larger training sets with the costs of data generation when selecting the appropriate ID model. Finally, the paper addresses some challenges that arise when using ML predictions to warmstart optimization, and provides some suggestions for budget and resource management. 
    more » « less
  5. Abstract Density functional theory (DFT) has been a critical component of computational materials research and discovery for decades. However, the computational cost of solving the central Kohn–Sham equation remains a major obstacle for dynamical studies of complex phenomena at-scale. Here, we propose an end-to-end machine learning (ML) model that emulates the essence of DFT by mapping the atomic structure of the system to its electronic charge density, followed by the prediction of other properties such as density of states, potential energy, atomic forces, and stress tensor, by using the atomic structure and charge density as input. Our deep learning model successfully bypasses the explicit solution of the Kohn-Sham equation with orders of magnitude speedup (linear scaling with system size with a small prefactor), while maintaining chemical accuracy. We demonstrate the capability of this ML-DFT concept for an extensive database of organic molecules, polymer chains, and polymer crystals. 
    more » « less