Convergence acceleration in machine learning potentials for atomistic simulations
Machine learning potentials (MLPs) for atomistic simulations have an enormous prospective impact on materials modeling, offering orders of magnitude speedup over density functional theory (DFT) calculations without appreciably sacrificing accuracy in the prediction of material properties. However, the generation of large datasets needed for training MLPs is daunting. Herein, we show that MLP-based material property predictions converge faster with respect to precision for Brillouin zone integrations than DFT-based property predictions. We demonstrate that this phenomenon is robust across material properties for different metallic systems. Further, we provide statistical error metrics to accurately determine a priori the precision level required of DFT training datasets for MLPs to ensure accelerated convergence of material property predictions, thus significantly reducing the computational expense of MLP development.
more »
« less
- Award ID(s):
- 2003808
- PAR ID:
- 10417870
- Date Published:
- Journal Name:
- Digital Discovery
- Volume:
- 1
- Issue:
- 1
- ISSN:
- 2635-098X
- Page Range / eLocation ID:
- 61 to 69
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
The rapid development and large body of literature on machine learning potentials (MLPs) can make it difficult to know how to proceed for researchers who are not experts but wish to use these tools. The spirit of this review is to help such researchers by serving as a practical, accessible guide to the state-of-the-art in MLPs. This review paper covers a broad range of topics related to MLPs, including (i) central aspects of how and why MLPs are enablers of many exciting advancements in molecular modeling, (ii) the main underpinnings of different types of MLPs, including their basic structure and formalism, (iii) the potentially transformative impact of universal MLPs for both organic and inorganic systems, including an overview of the most recent advances, capabilities, downsides, and potential applications of this nascent class of MLPs, (iv) a practical guide for estimating and understanding the execution speed of MLPs, including guidance for users based on hardware availability, type of MLP used, and prospective simulation size and time, (v) a manual for what MLP a user should choose for a given application by considering hardware resources, speed requirements, energy and force accuracy requirements, as well as guidance for choosing pre-trained potentials or fitting a new potential from scratch, (vi) discussion around MLP infrastructure, including sources of training data, pre-trained potentials, and hardware resources for training, (vii) summary of some key limitations of present MLPs and current approaches to mitigate such limitations, including methods of including long-range interactions, handling magnetic systems, and treatment of excited states, and finally (viii) we finish with some more speculative thoughts on what the future holds for the development and application of MLPs over the next 3-10+ years.more » « less
-
Machine learning potentials (MLPs) have attracted significant attention in computational chemistry and materials science due to their high accuracy and computational efficiency. The proper selection of atomic structures is crucial for developing reliable MLPs. Insufficient or redundant atomic structures can impede the training process and potentially result in a poor quality MLP. Here, we propose a local-environment-guided screening algorithm for efficient dataset selection in MLP development. The algorithm utilizes a local environment bank to store unique local environments of atoms. The dissimilarity between a particular local environment and those stored in the bank is evaluated using the Euclidean distance. A new structure is selected only if its local environment is significantly different from those already present in the bank. Consequently, the bank is then updated with all the new local environments found in the selected structure. To demonstrate the effectiveness of our algorithm, we applied it to select structures for a Ge system and a Pd13H2 particle system. The algorithm reduced the training data size by around 80% for both without compromising the performance of the MLP models. We verified that the results were independent of the selection and ordering of the initial structures. We also compared the performance of our method with the farthest point sampling algorithm, and the results show that our algorithm is superior in both robustness and computational efficiency. Furthermore, the generated local environment bank can be continuously updated and can potentially serve as a growing database of feature local environments, aiding in efficient dataset maintenance for constructing accurate MLPs.more » « less
-
Machine learning represents a milestone in data-driven research, including material informatics, robotics, and computer-aided drug discovery. With the continuously growing virtual and synthetically available chemical space, efficient and robust quantitative structure–activity relationship (QSAR) methods are required to uncover molecules with desired properties. Herein, we propose variable-length-array SMILES-based (VLA-SMILES) structural descriptors that expand conventional SMILES descriptors widely used in machine learning. This structural representation extends the family of numerically coded SMILES, particularly binary SMILES, to expedite the discovery of new deep learning QSAR models with high predictive ability. VLA-SMILES descriptors were shown to speed up the training of QSAR models based on multilayer perceptron (MLP) with optimized backpropagation (ATransformedBP), resilient propagation (iRPROP‒), and Adam optimization learning algorithms featuring rational train–test splitting, while improving the predictive ability toward the more compute-intensive binary SMILES representation format. All the tested MLPs under the same length-array-based SMILES descriptors showed similar predictive ability and convergence rate of training in combination with the considered learning procedures. Validation with the Kennard–Stone train–test splitting based on the structural descriptor similarity metrics was found more effective than the partitioning with the ranking by activity based on biological activity values metrics for the entire set of VLA-SMILES featured QSAR. Robustness and the predictive ability of MLP models based on VLA-SMILES were assessed via the method of QSAR parametric model validation. In addition, the method of the statistical H0 hypothesis testing of the linear regression between real and observed activities based on the F2,n−2 -criteria was used for predictability estimation among VLA-SMILES featured QSAR-MLPs (with n being the volume of the testing set). Both approaches of QSAR parametric model validation and statistical hypothesis testing were found to correlate when used for the quantitative evaluation of predictabilities of the designed QSAR models with VLA-SMILES descriptors.more » « less
-
Rapid and automated lipid profiling by nuclear magnetic resonance spectroscopy using neural networksAbstract Nuclear magnetic resonance (NMR) spectroscopy is a powerful tool for quantitative metabolomics; however, quantification of metabolites from NMR data is often a slow and tedious process requiring user input and expertise. In this study, we propose a neural network approach for rapid, automated lipid identification and quantification from NMR data. Multilayered perceptron (MLP) networks were developed with NMR spectra as the input and lipid concentrations as output. Three large synthetic datasets were generated, each with 55,000 spectra from an original 30 scans of reference standards, by using linear combinations of standards and simulating experimental‐like modifications (line broadening, noise, peak shifts, baseline shifts) and common interference signals (water, tetramethylsilane, extraction solvent), and were used to train MLPs for robust prediction of lipid concentrations. The performances of MLPS were first validated on various synthetic datasets to assess the effect of incorporating different modifications on their accuracy. The MLPs were then evaluated on experimentally acquired data from complex lipid mixtures. The MLP‐derived lipid concentrations showed high correlations and slopes close to unity for most of the quantified lipid metabolites in experimental mixtures compared with ground‐truth concentrations. The most accurate, robust MLP was used to profile lipids in lipophilic hepatic extracts from a rat metabolomics study. The MLP lipid results analyzed by two‐way ANOVA for dietary and sex differences were similar to those obtained with a conventional NMR quantification method. In conclusion, this study demonstrates the potential and feasibility of a neural network approach for improving speed and automation in NMR lipid profiling and this approach can be easily tailored to other quantitative, targeted spectroscopic analyses in academia or industry.more » « less
An official website of the United States government

