skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Local-environment-guided selection of atomic structures for the development of machine-learning potentials
Machine learning potentials (MLPs) have attracted significant attention in computational chemistry and materials science due to their high accuracy and computational efficiency. The proper selection of atomic structures is crucial for developing reliable MLPs. Insufficient or redundant atomic structures can impede the training process and potentially result in a poor quality MLP. Here, we propose a local-environment-guided screening algorithm for efficient dataset selection in MLP development. The algorithm utilizes a local environment bank to store unique local environments of atoms. The dissimilarity between a particular local environment and those stored in the bank is evaluated using the Euclidean distance. A new structure is selected only if its local environment is significantly different from those already present in the bank. Consequently, the bank is then updated with all the new local environments found in the selected structure. To demonstrate the effectiveness of our algorithm, we applied it to select structures for a Ge system and a Pd13H2 particle system. The algorithm reduced the training data size by around 80% for both without compromising the performance of the MLP models. We verified that the results were independent of the selection and ordering of the initial structures. We also compared the performance of our method with the farthest point sampling algorithm, and the results show that our algorithm is superior in both robustness and computational efficiency. Furthermore, the generated local environment bank can be continuously updated and can potentially serve as a growing database of feature local environments, aiding in efficient dataset maintenance for constructing accurate MLPs.  more » « less
Award ID(s):
2102317
PAR ID:
10566103
Author(s) / Creator(s):
; ; ; ; ;
Publisher / Repository:
American Physical Society
Date Published:
Journal Name:
The Journal of Chemical Physics
Volume:
160
Issue:
7
ISSN:
0021-9606
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Machine learning potentials (MLPs) are poised to combine the accuracy of ab initio predictions with the computational efficiency of classical molecular dynamics (MD) simulation. While great progress has been made over the last two decades in developing MLPs, there is still much to be done to evaluate their model transferability and facilitate their development. In this work, we construct two deep potential (DP) models for liquid water near graphene surfaces, Model S and Model F, with the latter having more training data. A concurrent learning algorithm (DP-GEN) is adopted to explore the configurational space beyond the scope of conventional ab initio MD simulation. By examining the performance of Model S, we find that an accurate prediction of atomic force does not imply an accurate prediction of system energy. The deviation from the relative atomic force alone is insufficient to assess the accuracy of the DP models. Based on the performance of Model F, we propose that the relative magnitude of the model deviation and the corresponding root-mean-square error of the original test dataset, including energy and atomic force, can serve as an indicator for evaluating the accuracy of the model prediction for a given structure, which is particularly applicable for large systems where density functional theory calculations are infeasible. In addition to the prediction accuracy of the model described above, we also briefly discuss simulation stability and its relationship to the former. Both are important aspects in assessing the transferability of the MLP model. 
    more » « less
  2. Novel machine learning algorithms that make the best use of a significantly less amount of data are of great interest. For example, active learning (AL) aims at addressing this problem by iteratively training a model using a small number of labeled data, testing the whole data on the trained model, and then querying the labels of some selected data, which then are used for training a new model. This paper presents a fast and accurate data selection method, in which the selected samples are optimized to span the subspace of all data. We propose a new selection algorithm, referred to as iterative projection and matching (IPM), with linear complexity w.r.t. the number of data, and without any parameters to be tuned. In our algorithm, at each iteration, the maximum information from the structure of the data is captured by one selected sample, and the captured information is neglected in the next iterations by projection on the null-space of previously selected samples. The computational efficiency and the selection accuracy of our proposed algorithm outperform those of the conventional methods. Furthermore, the superiority of the proposed algorithm is shown on active learning for video action recognition dataset on UCF-101; learning using representatives on ImageNet; training a generative adversarial network (GAN) to generate multi-view images from a single-view input on CMU Multi-PIE dataset; and video summarization on UTE Egocentric dataset. 
    more » « less
  3. Machine learning potentials (MLPs) for atomistic simulations have an enormous prospective impact on materials modeling, offering orders of magnitude speedup over density functional theory (DFT) calculations without appreciably sacrificing accuracy in the prediction of material properties. However, the generation of large datasets needed for training MLPs is daunting. Herein, we show that MLP-based material property predictions converge faster with respect to precision for Brillouin zone integrations than DFT-based property predictions. We demonstrate that this phenomenon is robust across material properties for different metallic systems. Further, we provide statistical error metrics to accurately determine a priori the precision level required of DFT training datasets for MLPs to ensure accelerated convergence of material property predictions, thus significantly reducing the computational expense of MLP development. 
    more » « less
  4. The rapid development and large body of literature on machine learning potentials (MLPs) can make it difficult to know how to proceed for researchers who are not experts but wish to use these tools. The spirit of this review is to help such researchers by serving as a practical, accessible guide to the state-of-the-art in MLPs. This review paper covers a broad range of topics related to MLPs, including (i) central aspects of how and why MLPs are enablers of many exciting advancements in molecular modeling, (ii) the main underpinnings of different types of MLPs, including their basic structure and formalism, (iii) the potentially transformative impact of universal MLPs for both organic and inorganic systems, including an overview of the most recent advances, capabilities, downsides, and potential applications of this nascent class of MLPs, (iv) a practical guide for estimating and understanding the execution speed of MLPs, including guidance for users based on hardware availability, type of MLP used, and prospective simulation size and time, (v) a manual for what MLP a user should choose for a given application by considering hardware resources, speed requirements, energy and force accuracy requirements, as well as guidance for choosing pre-trained potentials or fitting a new potential from scratch, (vi) discussion around MLP infrastructure, including sources of training data, pre-trained potentials, and hardware resources for training, (vii) summary of some key limitations of present MLPs and current approaches to mitigate such limitations, including methods of including long-range interactions, handling magnetic systems, and treatment of excited states, and finally (viii) we finish with some more speculative thoughts on what the future holds for the development and application of MLPs over the next 3-10+ years. 
    more » « less
  5. AutoML has demonstrated remarkable success in finding an effective neural architecture for a given machine learning task defined by a specific dataset and an evaluation metric. However, most present AutoML techniques consider each task independently from scratch, which requires exploring many architectures, leading to high computational costs. We proposed AutoTransfer, an AutoML solution that improves search efficiency by transferring the prior architectural design knowledge to the novel task of interest. Our key innovation includes a task-model bank that captures the model performance over a diverse set of GNN architectures and tasks, and a computationally efficient task embedding that can accurately measure the similarity among different tasks. Based on the task-model bank and the task embeddings, our method estimates the design priors of desirable models of the novel task, by aggregating a similarity-weighted sum of the top-K design distributions on tasks that are similar to the task of interest. The computed design priors can be used with any AutoML search algorithm. We evaluated AutoTransfer on six datasets in the graph machine learning domain. Experiments demonstrate that (i) our proposed task embedding can be computed efficiently, and that tasks with similar embeddings have similar best-performing architectures; (ii) AutoTransfer significantly improves search efficiency with the transferred design priors, reducing the number of explored architectures by an order of magnitude. Finally, we released GNN-BANK-101, a large-scale dataset of detailed GNN training information of 120,000 task-model combinations to facilitate and inspire future research. 
    more » « less