<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Discrepancies and error evaluation metrics for machine learning interatomic potentials</title></titleStmt>
			<publicationStmt>
				<publisher>npj Computational Materials</publisher>
				<date>12/01/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10552844</idno>
					<idno type="doi">10.1038/s41524-023-01123-3</idno>
					<title level='j'>npj Computational Materials</title>
<idno>2057-3960</idno>
<biblScope unit="volume">9</biblScope>
<biblScope unit="issue">1</biblScope>					

					<author>Yunsheng Liu</author><author>Xingfeng He</author><author>Yifei Mo</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[<title>Abstract</title> <p>Machine learning interatomic potentials (MLIPs) are a promising technique for atomic modeling. While small errors are widely reported for MLIPs, an open concern is whether MLIPs can accurately reproduce atomistic dynamics and related physical properties in molecular dynamics (MD) simulations. In this study, we examine the state-of-the-art MLIPs and uncover several discrepancies related to atom dynamics, defects, and rare events (REs), compared to ab initio methods. We find that low averaged errors by current MLIP testing are insufficient, and develop quantitative metrics that better indicate the accurate prediction of atomic dynamics by MLIPs. The MLIPs optimized by the RE-based evaluation metrics are demonstrated to have improved prediction in multiple properties. The identified errors, the evaluation metrics, and the proposed process of developing such metrics are general to MLIPs, thus providing valuable guidance for future testing and improvements of accurate and reliable MLIPs for atomistic modeling.</p>]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>INTRODUCTION</head><p>Atomistic modeling, which simulates physical phenomena based on the interactions of atoms, is a crucial research technique in a range of disciplines including physics, chemistry, biology, and materials science. Density functional theory (DFT) calculation have been the standard technique for evaluating atom interactions among a diverse range of configurations and chemistries, but their applications are limited to small system sizes of a few hundred atoms (up to a few nm) due to high computation costs <ref type="bibr">[1]</ref><ref type="bibr">[2]</ref><ref type="bibr">[3]</ref> . By contrast, classical interatomic potentials, also known as force fields, have significantly lower computation costs and thus can be employed for atomistic simulations with much larger length-scale (nm-&#956;m) and longer time-scale (ns-&#956;s), but they lack the transferability to different atomistic configurations that are not considered in the potential fitting <ref type="bibr">1,</ref><ref type="bibr">4,</ref><ref type="bibr">5</ref> .</p><p>As an emerging technique to bridge the gaps among different computational techniques <ref type="bibr">[1]</ref><ref type="bibr">[2]</ref><ref type="bibr">[3]</ref><ref type="bibr">[6]</ref><ref type="bibr">[7]</ref><ref type="bibr">[8]</ref><ref type="bibr">[9]</ref><ref type="bibr">[10]</ref> , machine learning interatomic potentials (MLIPs) utilize machine learning (ML) models to predict energies and forces of atomistic structures, which are mapped into the atomistic descriptors as input. Current state-of-the-art MLIPs include Gaussian Approximation Potential (GAP) based on Smooth Overlap of Atomic Positions (SOAP) descriptors <ref type="bibr">11,</ref><ref type="bibr">12</ref> , Neural Network Potential (NNP) <ref type="bibr">13,</ref><ref type="bibr">14</ref> , Spectral Neighbor Analysis Potential (SNAP) <ref type="bibr">15</ref> , Moment Tensor Potential (MTP) <ref type="bibr">16</ref> , and Deep Potential (DeePMD) models <ref type="bibr">5</ref> , and many MLIP variances derived from different modifications and combinations of ML models and descriptors. MLIPs are trained using DFT-calculated energies and forces from a diverse range of atomistic configurations, typically encompassing bulk and defected structures, equilibrium and nonequilibrium structures, and solid and liquid phases. Current stateof-the-art MLIPs are claimed to achieve accuracies similar to ab initio calculations <ref type="bibr">1,</ref><ref type="bibr">5,</ref><ref type="bibr">11,</ref><ref type="bibr">13,</ref><ref type="bibr">[15]</ref><ref type="bibr">[16]</ref><ref type="bibr">[17]</ref><ref type="bibr">[18]</ref> , while maintaining low computation costs and linear size scaling akin to classical interatomic potentials.</p><p>However, the MLIPs are black-box predictors not directly based on physical principles. An open question is whether MLIPs can always accurately reproduce physical phenomena in atomistic simulations. Conventional ML error testing primarily quantifies MLIP accuracies through average errors, such as root-mean-square error (RMSE) or mean-absolute error (MAE), of energies and atomic forces across a range of configurations known as testing dataset. These atomistic configurations in the testing dataset are randomly split from the entire datasets generated in the same manner as the training dataset, and thus are similar to the training dataset but may differ from the atomistic configurations that may encounter during MD simulations. Most MLIPs are reported to achieve small, average errors of energies and atomic forces as low as 1 meV atom -1 and 0.05 eV &#197; <ref type="bibr">-11,5,8,11,19,20</ref> , respectively, in conventional ML testing. These low averaged errors reported have created the impression that MLIPs are as accurate as DFT calculations. However, these MLIPs with small average errors may not always accurately reproduce the physical phenomena in atomistic simulations, as shown in the following examples <ref type="bibr">10,</ref><ref type="bibr">21,</ref><ref type="bibr">22</ref> . An MLIP of Al by Botu et al. was reported a low MAE force error of 0.03 eV &#197; -1 , but its MD simulations predicted the activation energy of Al vacancy diffusion with an error of 0.1 eV compared to the DFT value of 0.59 eV, even though vacancy structures and vacancy diffusion were included in the training dataset <ref type="bibr">23</ref> . In Vandermause et al. <ref type="bibr">20</ref> , an Al MLIP with a low RMSE force error of 0.05 eV &#197; -1 for solid phase and 0.12 eV &#197; -1 for liquid phase also exhibited discrepancies with DFT in in surface adatom migration, which were considered during the on-the-fly training. Zuo et al. <ref type="bibr">1</ref> reported a number of MLIPs (such as GAP, NNP, SNAP, MTP) with small RMSEs of atomic forces at the level of 0.15-0.4 eV &#197; -1 and 10-20% errors in the vacancy formation energy and migration barrier for a number of materials (such as Li, Mo, Si, Ge), while vacancy structures were also included in the training. Additionally, in the aforementioned studies, there were large errors of the predicted migration energy barriers of defects, even though these defects and their migrations were considered in the training and testing datasets. Atomistic diffusion is determined by the dynamics of atoms (often in the form of point defects) and the potential energy surface (PES) beyond their equilibrium sites, but the direct testing of MLIPs on these atomistic-level details in MD simulations still shows errors and failures <ref type="bibr">24</ref> . As reported by Fu et al. <ref type="bibr">24</ref> , the MD simulations based on MLIPs observe errors, such as radial density functions, and even the failure of the MD simulations after a certain duration. These results suggest there are errors in the MLIPs causing these errors and failures from actual MD simulations. It is crucial to examine the accuracy of MLIPs in simulating the atomic dynamics and reproducing physical properties, understand potential discrepancies, and develop appropriate testing metrics.</p><p>Typical approaches to improve the MLIPs include adjusting the fraction or weights of certain structures in the training dataset, modifying the cost/loss functions, and tuning hyperparameters <ref type="bibr">25</ref> . The average errors in energies and forces or a few easily computable properties, such as elastic constants, energy vs. volume curves for different crystal structures, and formation energies of point defects <ref type="bibr">1,</ref><ref type="bibr">26</ref> , are often used to optimize and select the MLIP models. However, to test and quantify those errors that can only be directly observed in actual MD simulations, such as the errors in diffusional properties, one would need to conduct numerous tests in MD simulations for an extended duration before selecting the final MLIP models <ref type="bibr">24</ref> . This approach requires a large computational cost of running MD simulations to select MLIPs, which may be impractical for optimizing MLIPs with many combinations of training variables and hyperparameters. Therefore, appropriate testing metrics should be developed to thoroughly gauge the ability of an MLIP in reproducing atomic dynamics and physical properties in a range of typically encountered physical situations, and such quantitative metrics are crucial for the further improvement of MLIPs.</p><p>In this study, by comparing atomic dynamics from MLIP-MD simulations and ab initio MD (AIMD) simulations, we reveal that state-of-the-art MLIPs, even with carefully selected training datasets and small average errors evaluated by conventional testing, may not fully reproduce atomic dynamics or related properties (Discrepancies of MLIPs on atomic dynamics between MLIP and DFT occur even when low average errors are reported). The tested MLIPs show discrepancies in diffusions or rare events (Rare events are sources of discrepancies), defect configurations (Configurations with similar energies are discrepancy sources), and atomic vibrations (Vibration near defects is a discrepancy source). We then develop the error evaluation metrics based on the atomic forces of RE atoms (Quantifying the force errors on RE migrating atoms) and demonstrate them for indicating the performance of MLIPs on atomistic dynamics in MD simulations (Force performance scores as effective metrics). The MLIPs trained with enhanced RE data and selected by newly developed metrics show improved predictions of atom dynamics and diffusional properties. In the end, we summarize our process of developing the evaluation metrics for the observed simulation-based discrepancies. The identified discrepancies, our evaluation metrics, and their development process are general to all MLIPs and can serve as guidance for future development and improvements of accurate, robust, and reliable MLIPs for atomistic modeling.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>RESULTS</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Discrepancies of MLIPs on atomic dynamics between MLIP and DFT occur even when low average errors are reported</head><p>We conduct our study on a number of MLIPs to show the observed phenomena are general among MLIPs. In order to perform a consistent comparison, we directly retrieve the MLIP (GAP, GAP PRX , NNP, SNAP, and MTP) models of Si from previous studies <ref type="bibr">1,</ref><ref type="bibr">26</ref> , besides the DeePMD model 27 trained using the same training dataset from ref. <ref type="bibr">1</ref> . This training dataset of ref. <ref type="bibr">1</ref> includes a diverse range of selected atomistic structures of solid Si, melted liquid Si, strained Si, Si surfaces, and Si-vacancy from AIMD simulations at a wide range of temperatures (Fig. <ref type="figure">1c</ref>, Methods and Supplementary Note 1). In order to quantify the errors on predicted energies and atomic forces of these MLIPs, we construct interstitial-RE and vacancy-RE testing sets, D RE-I Testing and D RE-V Testing , respectively, each consisting of 100 snapshots of atomic configurations with a single migrating vacancy or interstitial, respectively, from AIMD simulations at 1230 K (Fig. <ref type="figure">1a</ref>, b, Methods), with the true values of energies and atomic forces evaluated by DFT calculations with a k-point mesh of 4 &#215; 4 &#215; 4 (DFT K4). The MLIPs accurately predict the energies and atomic forces in the training dataset and the testing dataset in consistency with the original studies <ref type="bibr">1,</ref><ref type="bibr">26</ref> , showing low root-meansquare errors (RMSEs) below 10 meV atom -1 for energies and 0.3 eV &#197; -1 for forces (with the median force magnitude of 1.67 eV &#197; -1 ) for most MLIPs on the vacancy-RE testing set D RE-V Testing (Fig. <ref type="figure">1e</ref>, h, Supplementary Fig. <ref type="figure">1</ref>, and Supplementary Table <ref type="table">1</ref>). Given similar structures with vacancies are covered in the training dataset, the good performance of MLIPs on vacancy structures should be expected.</p><p>We also test the MLIPs on the structures not included in the training data (Fig. <ref type="figure">1c</ref>), specifically the interstitial-RE testing dataset D RE-I Testing , which comprises the snapshots from the AIMD simulations of Si supercell with an interstitial. Most MLIPs (except for GAP PRX ) do not include the configurations with an interstitial in the training dataset. While the MLIPs prediction of atomistic energies show a bias with an average offset of 10-13 meV atom -1 lower than DFT values, the overall RMSEs are below 15 meV atom -1 and 0.3 eV &#197; -1 (with the median force magnitude of 1.69 eV &#197; -1 ) for these interstitial structures for most MLIPs (Fig. <ref type="figure">1d</ref>, g, Supplementary Fig. <ref type="figure">2</ref>, and Supplementary Table <ref type="table">1</ref>). These small error values in energy and force on training and testing datasets are often interpreted as good performance of MLIPs.</p><p>In addition to the errors of energies and forces, we evaluate and compare the phonon dispersion relations predicted by MLIPs with DFT (Fig. <ref type="figure">1f</ref>, i, Supplementary Fig. <ref type="figure">3</ref>, and Supplementary Fig. <ref type="figure">4</ref>). While most MLIPs have good agreement on phonon dispersions of bulk Si (Fig. <ref type="figure">1f</ref>) <ref type="bibr">26,</ref><ref type="bibr">28</ref> , a number of MLIPs exhibit noticeable differences on phonon dispersions in the Si supercell with a vacancy (Fig. <ref type="figure">1i</ref> and Supplementary Fig. <ref type="figure">4</ref>, Methods). For example, the phonon dispersion by GAP has imaginary frequencies (Fig. <ref type="figure">1i</ref>), and the phonon dispersion by SNAP has additional bands with lower frequencies compared to DFT K4 in high frequency sections (Supplementary Fig. <ref type="figure">4</ref>). The training data includes the atom vibration from AIMD simulations with vacancies, which should help the training of phonon dispersion relations. Therefore, these discrepancies require further investigation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Rare events are sources of discrepancies</head><p>Here the MD simulations using these MLIPs are performed to simulate atom dynamics and related physical properties in a Si supercell with a single vacancy or a single interstitial, to identify potential discrepancies between MLIP-MD and AIMD simulations. The Si diffusivities are evaluated at a range of temperatures 730-1600 K from the mean-squared-displacements of Si over time (Methods) (Fig. <ref type="figure">2a</ref>, <ref type="figure">b</ref>). Given the MLIPs are trained on the DFT calculations based on a fine k-point mesh of 4 &#215; 4 &#215; 4 (K4), the MLIP results should be compared to DFT K4 benchmark as in many of our tests. For the MD simulations, the MLIP results are compared to the AIMD simulations based on coarser accuracy setting of a k-point mesh of 2 &#215; 2 &#215; 2 (K2) and a single &#915;-point (K1), because AIMD simulations based on K4 is takes prohibitively long to obtain enough number of atom hops. In addition, these lower accuracy settings are commonly utilized for AIMD simulations in previous studies <ref type="bibr">4,</ref><ref type="bibr">29,</ref><ref type="bibr">30</ref> . The errors with lower-accuracy DFT calculations also can be used as error ranges for comparison.</p><p>For vacancy diffusion, which is covered in the training data, most MLIPs predict diffusivities within a reasonable error range of the diffusivities predicted by K2 AIMD simulations. Most MLIPs perform better than DFT-K1. DeePMD gives higher diffusivities than AIMD simulations, but an agreement on the fitted activation energy (0.17 eV compared with 0.2 eV given by AIMD K2). Some MLIPs show discrepancies and deviations among each other in the diffusivities at lower temperatures, leading to discrepancies in the fitted activation energies and extrapolated room temperature diffusivities. Nonetheless, the comparison of the MLIPs and AIMD simulations should take into consideration the stochastic nature of estimating diffusivities from MD simulations and the limited number of data points, which would lead to large uncertainty of the fitted activation energy and extrapolated diffusivity <ref type="bibr">4</ref> . As shown in the previous study <ref type="bibr">4</ref> , even for a total mean square displacement of ion diffusion over a few thousand &#197; 2 , the standard deviation of obtained diffusivity can be as large as 20-30%. The error bars and range of all obtained diffusivity values are shown in Supplementary Tables 4, 5 and 6.</p><p>While the Si interstitials are not covered in the training of MLIPs (except for GAP PRX ), all MLIPs show low RMSEs in the testing dataset (Fig. <ref type="figure">1d</ref>, <ref type="figure">g</ref>) with interstitial configurations. Here the MD simulations of interstitial diffusion is performed to test whether the atomic dynamics can be correctly reproduced. Some MLIPs, such as GAP PRX , DeePMD, and SNAP, show significant deviations in the interstitial diffusional properties predicted by MLIP-MD simulations (Fig. <ref type="figure">2a</ref> and Supplementary Table <ref type="table">2</ref>). The diffusivities predicted by GAP and MTP agree reasonably with AIMD simulations over the temperature range 1000-1600 K, and the fitted activation energies of interstitial diffusion, E a I , show minor differences. GAP PRX , which is the only MLIP considered interstitial in the training, also shows discrepancies in interstitial diffusivities with a low E a I of 0.12 eV. SNAP shows a E a I of 0.74 eV, much higher than AIMD K2 (0.30 eV). For NNP, the crystalline Si structure melted during the MD simulations at the temperature of 730-1600 K (Supplementary Note 4). According to the tests, many MLIPs show some discrepancies in predicting diffusivity, activation energy, or both, and thus the ability to fully reproduce the atom dynamics of  <ref type="table">4</ref>, <ref type="table">5</ref>). The error bars of diffusivities are estimated based on the scheme in ref. <ref type="bibr">4</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>(Methods).</head><p>Si interstitial diffusion is limited. The discrepancies in diffusional properties indicate that having low average errors in energies and forces are insufficient error evaluation metrics to judge whether the atomic dynamics of diffusions in MD simulations are accurate.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Configurations with similar energies are discrepancy sources</head><p>In order to reveal the discrepancies of MLIPs in predicting interstitial diffusions, we analyze the snapshots from MLIP-MD simulations of interstitials in comparison with AIMD simulations. DFT studies <ref type="bibr">[31]</ref><ref type="bibr">[32]</ref><ref type="bibr">[33]</ref> reported three types of Si interstitials, such as the ground-state split-&lt;110&gt; (Fig. <ref type="figure">3a</ref>), tetrahedral (Fig. <ref type="figure">3b</ref>), and hexagonal (Fig. <ref type="figure">3c</ref>) interstitials, with the formation energies E f of 3.56, 3.72, and 3.59 eV, respectively, from DFT (K4) calculations (Fig. <ref type="figure">3</ref>). Consistent with the trend of the formation energies, the split-&lt;110&gt; interstitial has higher occurrence frequency in the AIMD simulations than tetrahedral or hexagonal (Methods) (Fig. <ref type="figure">3d</ref>, <ref type="figure">e</ref>) <ref type="bibr">[31]</ref><ref type="bibr">[32]</ref><ref type="bibr">[33]</ref> . By contrast, MLIP-MD simulations show higher occurrence frequencies of either tetrahedral interstitial (by GAP, SNAP, and MTP) or hexagonal interstitial (by DeePMD) (Fig. <ref type="figure">3e</ref> and Supplementary Fig. <ref type="figure">7</ref>). Consistent with the occurrence frequencies in MLIP-MD simulations, most MLIPs (except for GAP PRX ) give lower E f for the tetrahedral interstitial than the ground-state split-&lt;110&gt; interstitial (Fig. <ref type="figure">3d</ref>). The trend of increasing occurrence frequencies for different interstitials can be largely explained by the decreasing formation energies, while the entropy of these defects may also play an effect (Supplementary Note 5 and Supplementary Note 10). Therefore, the errors in interstitial diffusion during MLIP-MD simulations are caused by the discrepancies of MLIPs in the predictions of different interstitial configurations. These discrepancies should be expected given interstitial configurations are not considered in the training (except for GAP PRX ). However, it is worth noting that small errors are reported for the testing of energies and forces for interstitial configurations (in the interstitial-RE test set D RE-I Testing ). Thus, having small average errors in energies and forces in conventional error testing on the defect configurations may be insufficient in determining whether the MLIPs would correctly reproduce these defect configurations in MD simulations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Vibration near defects is a discrepancy source</head><p>The good performance of MLIPs on Si vacancies is expected given the dynamical snapshots of Si vacancies in a wide range of conditions are well covered in the training dataset. However, the phonon dispersion of the vacancy structure predicted by the MLIPs (Fig. <ref type="figure">1i</ref>) suggests potential discrepancies of PES and atomic vibrations. Additionally, for the vacancy diffusion (Fig. <ref type="figure">2a</ref>, Supplementary Table <ref type="table">3</ref>), the discrepancies of diffusional properties are observed in a few cases (Supplementary Note 4). The diffusional properties, such as activation energies and the preexponential factors of the Arrhenius relation of diffusivity, are critically dependent on the PES of a migrating atom next to a defect such as a vacancy.</p><p>We further analyze the vibrations of Si atoms neighboring the vacancy, which is directly related to vacancy diffusion, in order to gain insights into the observed discrepancies of the PES. We visualize the vibrations of Si atoms near the vacancy by plotting the distribution of the distance r s and r v of the atoms from the nearest static Si site to its nearest static vacancy site, respectively (Fig. <ref type="figure">4a</ref>) (Methods). For the atoms that are not the nearest neighbor (NN) atoms of the vacancy (Fig. <ref type="figure">4d</ref>), the atom vibrations are similar to those in crystalline bulk. The agreement in the vibration of non-NN atoms is consistent with the good agreement in the phonon dispersion relations (Fig. <ref type="figure">1f</ref>) and elastic constants of bulk crystalline Si <ref type="bibr">1,</ref><ref type="bibr">26,</ref><ref type="bibr">28</ref> .</p><p>The vibrations of the vacancy nearest neighbor (vacancy-NN) atoms in MLIP-MD simulations show major discrepancies with AIMD simulations (Fig. <ref type="figure">4e-h</ref>, and Supplementary Fig. <ref type="figure">8</ref>). In the MLIP-MD simulations, the vacancy-NN atoms vibrate much further away from their static sites, as indicated by the distributions of r s (Fig. <ref type="figure">4c</ref>, Supplementary Fig. <ref type="figure">8</ref>). The discrepancies in vacancy-NN vibrations reveal that MLIPs do not accurately reproduce atom dynamics around the defect.</p><p>The observed discrepancies in the vibrations of vacancy-NN atoms can be explained by the errors of the PES of a vacancy-NN Si atom moving along the direction toward the vacancy (Fig. <ref type="figure">4b</ref> and Supplementary Fig. <ref type="figure">10</ref>). While some pathways of the migration for a single atom in a static configuration with fully relaxed static sites (Supplementary Fig. <ref type="figure">10</ref>) may be poorly predicted by the MLIPs, some different pathways can be accurately predicted in very similar configurations with slightly adjusted atom positions (near-equilibrium snapshots from AIMD simulations) (Supplementary Note 6, Supplementary Fig. <ref type="figure">11</ref>). These observed discrepancies indicate that the PES predicted by the MLIPs may not be always accurate and reliable under similar atomic configurations with minor variances in positions.</p><p>These errors in the atom vibrations and PESs nearby a defect are surprising, given that a range of dynamical configurations of vacancies from AIMD simulations is considered in the training data and that small errors in energies and forces are reported for these configurations. These results show the challenges of MLIPs in accurately reproducing the PES of atoms related to defects, even if these defects and related dynamics are included in the training. In addition, the conventional error testing of MLIPs based on average errors of energies and forces are insufficient in indicating these errors in atom vibration and PES near defects. Our results reveal previously neglected errors on the PES of MLIPs in causing inaccurate atom dynamics.</p><p>In summary, we have identified discrepancies based on observations of atomic dynamics in MD simulations, such as diffusions, configurations of defects, and atomic vibrations. In order to train MLIPs that can more accurately reproduce these dynamical phenomena, we need to quantify these discrepancies by developing corresponding error evaluation metrics, which can be further used to train and select the MLIPs with the highest metric scores.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Quantifying the force errors on RE migrating atoms</head><p>Using the discrepancies on atom diffusions as an example, we here develop corresponding error evaluation metrics, and improve the performances of MLIPs on diffusional properties. The process is as follows. We first develop a number of metrics for quantifying the aforementioned sources of discrepancies (Quantifying the force errors on RE migrating atoms). The evaluation metrics are then statistically verified to effectively indicate diffusional properties derived from atom dynamics in MD simulations (Force performance scores as effective metrics). This process can be generalized to improve the training and testing of MLIPs, as summarized in "Process of developing error evaluation metrics".</p><p>Given the aforementioned discrepancies in diffusional properties, we identify the sources of errors on migrating atoms, which are the atoms of interstitial or vacancy in the middle of the hopping from the current equilibrium site to the neighboring site (Methods). These atom hops are known as rare events (REs) in MD simulations. To quantify the errors of MLIP predictions for RE atoms, we compare the predicted atomic forces with DFT results on these RE atoms. We evaluate the error of atomic force in the magnitude &#948; F and the direction &#948; &#952; of as</p><p>where FDFT is the benchmark true force values calculated by DFT K4, and Fpredicted is the force predicted by the MLIPs (or DFT K1, or K2). For the RE atoms in the D RE-I Testing and D RE-V Testing datasets, large errors were found in the force magnitude &#948; F (Fig. <ref type="figure">5a</ref>) and force direction &#948; &#952; (Fig. <ref type="figure">5f</ref>) of the atomic forces predicted by MLIPs. For the RE atoms and the nearby atoms within a distance r &lt; 3 &#197;, 50-80% and &gt;40% exhibit large errors of &#948; F &gt; 0.5 eV &#197; -1 or &#948; &#952; &gt; 15&#176;, respectively, whereas around 20% of the other atoms that are &gt;3 &#197; away from the migrating atoms show similar levels of errors (Supplementary Note 7, Supplementary Table <ref type="table">7</ref>). While other factors in addition to the force errors on RE atoms may contribute to the discrepancies, the analyses nonetheless confirm that there are large errors on the atoms near defects and RE atoms are a major source responsible for discrepancies in diffusions.</p><p>The cumulative distribution functions (CDFs) of &#948; F and &#948; &#952; on RE atoms show the distributions of force errors of MLIPs (Fig. <ref type="figure">5d</ref>, <ref type="figure">e</ref>, <ref type="figure">i</ref>, <ref type="figure">j</ref>, and Supplementary Fig. <ref type="figure">12</ref>). Most MLIPs (except for GAP PRX ) exhibit errors in force prediction on RE atoms (8-25% for &#948; F &gt; 0.5 eV &#197; -1 and 40-70% for &#948; &#952; &gt; 15&#176;), whereas lower accuracy DFT K2 (&lt;0.5% for &#948; F &gt; 0.5 eV &#197; -1 and 1% for &#948; &#952; &gt; 15&#176;) and DFT K1 calculations (5% for &#948; F &gt; 0.5 eV &#197; -1 and 30% for &#948; &#952; &gt; 15&#176;) have much smaller errors. These results clearly show MLIP predictions have large errors on RE atoms. The force magnitude error &#948; F of larger than 0.5 eV &#197; -1 is significant, as the median force magnitude of all atoms is around 1.7 eV &#197; -1 (Fig. <ref type="figure">5b</ref>). Among these atoms with &#948; F &gt; 0.5 eV &#197; -1 , 10-35% interstitials and 3-15% vacancies also exhibit significant force direction errors of &#948; &#952; &gt; 30&#176;( Fig. <ref type="figure">5g</ref>), which would lead to major errors in predicting atom dynamics in MD simulations. Therefore, these large errors in MLIPpredicted forces on RE atoms (Fig. <ref type="figure">5d</ref>, <ref type="figure">e</ref>, <ref type="figure">i</ref>, <ref type="figure">j</ref>) and their nearby atoms (Supplementary Table <ref type="table">7</ref> and Supplementary Fig. <ref type="figure">12</ref>) are major sources of the observed discrepancies in the atom vibration and diffusion between MLIPs and DFT.</p><p>By identifying what atoms tend to give large errors, it can be understood why conventional error testing of MLIPs shows very small errors. In conventional error testing of MLIPs, most of the atomistic configurations evaluated are for atoms near equilibrium positions, which are accurately predicted by MLIPs with very small errors. The RE atoms only consist of a very small fraction in most typical testing datasets, e.g., &lt;1% in the dataset of Zuo et al. <ref type="bibr">1</ref> and Bartok et al. <ref type="bibr">26</ref> (2-3% in our RE-testing datasets). Therefore, the large errors on RE atoms are averaged out by the majority of nearequilibrium atom configurations. The total RMSEs of the forces on all atoms in conventional error testing mostly reflect the accurate MLIP prediction of near-equilibrium atoms, which often dominate the typical testing dataset (Supplementary Note 7). However, it should be noted that a few percent of RE-related atoms are critical for the correct prediction of atom dynamics and physical phenomena in MD simulations.</p><p>In order to overcome the limitations of conventional error testing, we propose to quantify the CDFs of force errors &#948; F or &#948; &#952; on RE atoms, using the normalized area of curve (NAC) under the CDF curves (see details in Methods) as new metrics. The NAC of CDF equals to 1 for an MLIP that completely agree with the benchmark true values (DFT K4), and a lower value of NAC suggests larger errors. To combine the metrics of force magnitude NAC(&#948; F , D) and force direction NAC(&#948; &#952; , D), we define the force performance score P(D), as the product of NAC(&#948; F , D) and NAC(&#948; &#952; , D) for errors of atomic forces in a given dataset D (Methods), P&#240;D&#222; &#188; NAC&#240;&#948; F ; D&#222; NAC&#240;&#948; &#952; ; D&#222;:</p><p>These quantitative metrics can be effectively used in training, validation, and testing to improve MLIPs as demonstrated in the next section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Force performance scores as effective metrics</head><p>To evaluate the effectiveness of P(D), we here obtain MLIPs with high P(D) scores and compare their atomic dynamics in diffusions with previous MLIPs. A RE-enhanced training dataset is generated to train the MLIPs that can achieve higher P(D) scores. For the REenhanced training dataset, we replace a fraction (54%) of the structures in the original dataset in ref. <ref type="bibr">1</ref> by those containing identified migrating interstitials. In this way, the size of the training dataset is kept the same to eliminate the impact of training data size in the comparison with original MLIPs (Methods). In the REenhanced validation, the force errors on RE atoms are evaluated for trained MLIPs using the enhanced validation set (EVS) (Methods), which includes those structures with RE atoms (migrating interstitials and vacancies) identified in addition to a fraction of the original dataset in ref. <ref type="bibr">1</ref> . We use the evaluation metrics on errors of energies, overall forces, forces of RE atoms (migrating interstitials or vacancies) to fine-tune the To evaluate the effectiveness of our force performance scores, here we calculate and compare the force performance scores P(D) and the diffusional properties from MD simulations, for a total of 81 MLIPs with a range of models and different hyperparameters, such as 35 original MLIPs and 46 interstitial-enhanced MLIPs selected by the metrics of force performance scores P(D) in the validation step (Methods). Eight criteria measuring the energy errors and force errors of both force magnitudes and directions (Methods) on RE atoms are evaluated. The diffusional properties, such as activation energies, diffusivities, and diffusion pre-factors, from MD simulations of RE-enhanced MLIPs with high force performance scores, show improved agreement with AIMD simulations (Fig. <ref type="figure">6b</ref>). Among these 81 MLIPs (Fig. <ref type="figure">6a</ref>), the MLIPs with higher force accuracies P(D RE-I Testing ) on interstitial RE atoms than DFT K1 predict the activation energy E a I of interstitial diffusion in good agreement with the AIMD K2 value (Fig. <ref type="figure">6a</ref> and Supplementary Fig. <ref type="figure">14</ref>) (Supplementary Note 9). The other MLIPs with lower P(D RE-I Testing ) lower scores than DFT K1 (below the red dashed line in Fig. <ref type="figure">6a</ref>) show much large variations and errors in their predictions of diffusion. Among the RE-enhanced MLIPs with high performance scores, GAP RE-I , SNAP RE-I , NNP RE-I , and MTP RE-I , give E a I of 0.2, 0.33, 0.29, and 0.26 eV, respectively, in better agreement with 0.25 and 0.30 eV from K1 and K2 AIMD simulations, compared to 0.42, 0.74, and 0.42 eV by the original MLIPs, (Fig. <ref type="figure">6b</ref>, Supplementary Table <ref type="table">2</ref>). The predicted diffusion pre-factors D 0 I of enhanced MLIPs improve from &gt;5.7 &#215; 10 -5 cm 2 s -1 (except for GAP PRX ) in original MLIPs to 8.5 &#215; 10 -6 -4.1 &#215; 10 -5 cm 2 s -1 , in better agreement with 7.7 &#215; 10 -6 -2.3 &#215; 10 -5 cm 2 s -1 from AIMD simulations (Fig. <ref type="figure">6b</ref>). In addition, the interstitial configurations observed in MLIP-MD simulations also agree better with AIMD simulations. The occurrence frequencies of split-&lt;110&gt; interstitials increase to around 35% in RE-enhanced GAP RE-I , NNP RE-I , SNAP RE-I , and MTP RE-I from less than 10% by the original MLIPs (Fig. <ref type="figure">3e</ref>). In addition to these improvements, the occurrence frequencies of ground-state split-&lt;110&gt; interstitial are still lower than those of ground-state tetrahedral, even the configurations of all three types of interstitials are included in the RE-enhanced training data. Given the total energy difference of 0.16 eV between the two interstitial configurations is merely 3 meV atom -1 in the 65-atom supercell used, these results indicate the difficulties of MLIPs in accurately reproducing the atomic configurations with similar energies. In summary, the force performance scores on RE atoms are effective error evaluation metrics to indicate related diffusional properties predicted by the MD simulations of the MLIPs. Using improved error evaluation metrics for optimization and validation is demonstrated as an effective step in training MLIPs with improved performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Process of developing error evaluation metrics</head><p>Developing error evaluation metrics that are indicative of the predictions of atomic dynamics is essential for the development of MLIPs. In the conventional training process, optimizing MLIPs on properties other than the errors of predicted energies and forces can be either done by 1) adding additional terms and weights into the loss functions as training targets (Fig. <ref type="figure">7a-i</ref>) <ref type="bibr">1,</ref><ref type="bibr">25</ref> , or 2) adding additional metrics or material properties, such as elastic tensors in ref. <ref type="bibr">1</ref> , when deciding the optimal hyperparameters for MLIPs (Fig. <ref type="figure">7a-ii</ref>). However, this conventional approach would be computationally expensive for evaluating atomic dynamics (Fig. <ref type="figure">6a</ref>-ii), because it requires running multiple MD simulations for each trained model for the fine-tuning of hyperparameters <ref type="bibr">24</ref> . Thus, effective error evaluation metrics that do not require extensive MD simulations and are indicative of atomic dynamics are essential, so one can train MLIPs that can accurately predict physical phenomena in MD simulations with lower computational costs in the training and validation process.</p><p>Here we summarize the process of developing error evaluation metrics as follows, which can be generalized for future development to improve simulation-based performance of MLIPs (Fig. <ref type="figure">2b):</ref> (1) Identify the sources of discrepancies in the MD simulations, (2) propose related but easy-to-calculate metrics that are related to the physical properties based on the simulation (e.g., forces on RE atoms to diffusional properties in our case), (3) quantify the proposed error evaluation metrics and dynamical properties based on simulations for a range of MLIPs, and (4) verify the effectiveness of proposed metrics statistically among these MLIPs with different models, hyperparameters, or training data. If the metrics prove to have a statistically significant effect on improving the atomic dynamics for MLIPs, we then can add the proposed and verified evaluation metrics into the training process (Fig. <ref type="figure">7a-ii</ref>).</p><p>There are important considerations in generalizing this process of developing error evaluation metrics for simulation-derived properties. It's important to use a large amount of data, on metrics and dynamical properties, predicted by diverse MLIPs, and conduct statistical verification instead of using a single or a few MLIPs because the developed metrics should be general enough to be applied to most MLIPs, which may be in the future trained by other descriptors, models, or training data. In some exception cases, MLIPs (e.g., NNPs (orange crosses), SNAPs (purple crosses), and the other models (circles) in Fig. <ref type="figure">6a</ref>) may give poor scores in evaluation metrics but may have small errors on dynamical properties, thus statistically verifying the evaluation metrics is important. There are many factors affecting the outcome of the training of ML models, and many are random in nature and yet poorly understood.</p><p>Trade-offs in MLIP performances: pareto fronts of MLIPs Additionally, we compare the performance and force accuracies of MLIPs based on a variety of ML algorithms and atomic descriptors and observe the trade-off the accuracies on different properties (Fig. <ref type="figure">6c</ref>). We compare MLIPs including the 46 interstitial-enhanced These Pareto fronts indicate the difficulties of MLIPs in achieving accurate predictions for all properties and serve as a good way to compare the performance of different MLIPs. Some MLIPs have lower Pareto fronts, indicating low force accuracies in interstitials at given force accuracies in vacancies. Notably, a few MLIPs, such as GAP and MTP, are able to achieve higher force accuracies than DFT K1 in both interstitial and vacancy. These MLIP models and descriptors may be more effective in accurate prediction for different defect configurations at the given training dataset, though the trade-offs between different predicted properties generally exist for all MLIPs.</p><p>There are more examples showing the trade-offs of MLIPs on other properties. MTP gives accurate predictions for atomic forces and activation energies but shows discrepancies in the predicted atomic vibrations near vacancy (Figs. <ref type="figure">3f</ref>, <ref type="figure">6a</ref>, <ref type="figure">c</ref>). GAP has good predictions on diffusivities in MD simulations but shows discrepancies on phonon spectra (Fig. <ref type="figure">1f-h</ref>). DeePMD reproduces activation energies of diffusion but shows errors for diffusivities, atomic forces, and interstitial configurations (Supplementary Table <ref type="table">2</ref>, Supplementary Table <ref type="table">3</ref>, Figs. <ref type="figure">1</ref>, <ref type="figure">2</ref>, <ref type="figure">4</ref>, and Supplementary Fig. <ref type="figure">12</ref>). Therefore, the performance of MLIPs should not be judged by a single property or a few properties. Even if an MLIP may give good performance for a number of properties, good performance for other properties should not be assumed. Furthermore, overcoming the trade-off of MLIPs' performance on different properties is required to further improve MLIPs for a wide variety of physical simulations. While a careful selection and balance for different types of defect data are known to be essential, it is also important to have a systematic process and quantitative metrics to train MLIPs with balanced accuracies in a range of structures and properties.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>DISCUSSION</head><p>Our study presents a systematic testing on the current state-ofthe-art MLIPs, regarding the discrepancies in the prediction of atom dynamics and physical properties, in comparison with DFT calculations. Even with the carefully selected training dataset and very low average errors in energies and atomic forces from conventional ML testing, MLIPs may exhibit significant discrepancies in their calculated properties and atom dynamics in MD simulations. Examples of these discrepancies include different defect types, atom vibration near defects, phonon dispersion, and diffusional properties. These discrepancies of the MLIPs in actual MD simulations should not be neglected. While the discrepancies for cases that are not included in the training data may be expected, significant discrepancies for presumed cases that are fully covered in the training dataset (e.g., atomic vibration around vacancy) are also significant in some cases and should be carefully tested. These sources of potential discrepancies are summarized as follows and should be carefully considered for improved training and rigorous testing in the future development of MLIPs.</p><p>The properties related to rare events Atomistic diffusion is a typical example of physical property derived from MD simulations. However, the diffusion is largely determined by the REs, which are often less sampled in the training and testing data of the MLIPs <ref type="bibr">34,</ref><ref type="bibr">35</ref> . The MLIP predictions on the atomic forces (both in force magnitude and force direction) of RE atoms deviate significantly from DFT values, causing errors in the predicted atomic dynamics and diffusional properties from the MD simulations. On the other hand, these errors associated with RE atoms serve as effective metrics for quantifying errors of MLIPs. Similar discrepancies are expected for diffusions in other materials systems or other types of rare events, such as reactions and state transitions. In addition, while the RE atoms analyzed in this study are identified using a hand-designed algorithm based on local atom distances, unsupervised machine learning, e.g., the k-means clustering method, as we demonstrated in the Supporting Information, can also be employed to identify these RE atoms.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>The defects with similar energies</head><p>As shown in the examples of Si interstitials, the formation energies and the occurrence in MD simulations of different defect configurations (e.g., split-&lt;110&gt; vs tetrahedral interstitials) may be incorrectly predicted by MLIPs. Even if these various defects are considered in the training, as shown in our results generated by RE-enhanced MLIPs (Fig. <ref type="figure">6b</ref>), the MLIPs' predictions still do not completely agree with the DFT benchmark. These discrepancies are not revealed by low average errors on energies and forces in conventional ML testing for MLIPs, even if the testing dataset includes these defect configurations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Atomic vibration</head><p>The dynamics of atoms near defects are poorly reproduced by the MLIPs, even though these atom dynamics are considered in the training data. These errors are caused by the poor prediction of PES around defects. Since the PES is a high dimensional function of multi-atom configurations, the capability of MLIPs covering these configuration spaces may still have limitations, which should be further studied. It's worthwhile to note that the atomic vibration and the aforementioned formation energy of defect are persistent challenges for MLIPs in our study and are not fully resolved.</p><p>In general, these errors of MLIPs in defect energies, atom vibrations, and REs are related to defects and their nearby atoms. These aforementioned errors given by MLIPs, though yield large errors in the prediction of atom dynamics and properties in MLIP-MD simulations, are not reflected in the conventional testing of MLIPs. In conventional testing, a majority of the testing data are near-equilibrium atomic configurations, which are well described by the MLIPs, leading to low average error values (Supplementary Note 1). Those aforementioned error-prone atomic configurations, such as RE atoms and those around defects, account for a very small fraction of atomic configurations in the testing dataset and their errors are averaged out in the RMSE/MAE error values.</p><p>This understanding provides important guidance for the training, testing, and development of MLIPs. Careful considerations are required for complex defects, their surrounding atoms, and non-equilibrium structures in the training of MLIPs <ref type="bibr">1,</ref><ref type="bibr">26</ref> . In addition to the evaluation of average error values, the accuracies and robustness of MLIPs should be carefully considered and tested on a number of key scenarios with a high likelihood of errors, including defects, atom vibrations near defects, and forces (both direction and magnitude) on RE atoms. As demonstrated in our study, the error evaluation metrics quantifying these errors can effectively improve the training, validation, and testing of MLIPs. For example, force accuracies on RE atoms are demonstrated examples of such error metrics to be used in the validation (Fig. <ref type="figure">6a</ref>) and testing of MLIPs (Figs. <ref type="figure">3</ref>, <ref type="figure">6c</ref>). Furthermore, our demonstrated process of developing error evaluation metrics, which were verified over a large number of MLIPs with different models and training datasets and were applied to selecting MLIP models, can be generally applied and extended to alleviate or overcome these MLIP errors.</p><p>It is very important to develop these error evaluation metrics that are indicative of simulation-based errors. These error evaluation metrics serve as an applicable and effective method in the verification, testing, and selection of better MLIPs, because it is computationally expensive to run a large number of MD simulations for all MLIPs over an extended time duration to obtain related physical properties (such as diffusion). For these error evaluation metrics to be effective and generally applicable, it is important to statistically verify these metrics on a large number of MLIPs with different models and even different training datasets and to compare many metrics and their values with the targeted dynamical properties. Since evaluation metrics are effective in a statistical sense, applying them to different MLIP models, descriptors, and hyperparameters may show various levels of improvement. Nonetheless, developing and applying error evaluation metrics are critical to developing MLIPs with improved performance on atomic dynamics and other properties derived from MD simulations. The process we demonstrate to develop such evaluation metrics can be generalized and extended for future studies.</p><p>Applying our RE-based evaluation metrics provides significant improvements in MLIPs, and the aforementioned errors are significantly reduced but not fully eliminated, including atom vibration near defects and defect configuration occurrences since they're not directly related to RE atoms. Beyond the aforementioned issues of averaged errors and limited PES sampling, our results reveal a number of other potential challenges of MLIPs for improving performances on these defect-related errors. The defect-related errors, such as on different types of interstitials or the PES around defects, can be understood as atomistic configurations with small perturbations and small energy differences (~10 1 meV atom -1 ). Given the defect has a low concentration in the supercell model and the total energies of the entire supercell model are used for training, it can be challenging for the MLIPs to accurately reproduce the PES around all these atomistic configurations with small differences and similar energies. Small differences in the PES would lead to significant change in the probability density, atomic forces, occurrence frequency, and dynamics of atoms. While more systematic testing and more training data may be helpful, it is possible that the descriptors and ML models can be further improved to capture the PES of these varying atomistic configurations. Given the PES is a highdimension function of the atomic descriptor, it is possible the interpolation among the given configurations from the training dataset may be inadequate for accurately covering the entire relevant portion of PES encountered during MD simulations. Overall, to accurately reproduce all aforementioned error cases with similar energies, more studies are needed to further test and improve the atomistic descriptors, ML models, or the schemes of training and testing.</p><p>In conclusion, we study a number of MLIPs and identified a number of potential discrepancy sources in their applications. Leveraging these, we develop evaluation metrics into a process that identifies sources of discrepancies in atomic dynamics, quantifies the discrepancies, statistically verifies the effectiveness of the developed metrics, and optimizes MLIPs using enhanced quantitative metrics. By proposing and demonstrating improved evaluation metrics and the general process to develop such metrics, we show the improvement of the MLIPs in the prediction of physical properties. Overall, our results highlight general guidance and potential challenges in the future development of accurate, robust, and reliable MLIPs for atomistic modeling.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>METHODS Obtaining and training MLIPs</head><p>To obtain existing MLIPs in consistent with previous studies, we retrieved MLIPs (GAP, NNP, SNAP, and MTP) directly from the corresponding mlearn repository of Zuo et al. <ref type="bibr">1</ref> , or trained MLIPs (DeePMD) by using the training data the same as in ref. <ref type="bibr">1</ref> . The mlearn package and the corresponding MLIP models, including QUIP for GAP <ref type="bibr">11</ref> , N2P2 for NNP <ref type="bibr">36</ref> , MLIP for MTP <ref type="bibr">16,</ref><ref type="bibr">37</ref> , and SNAP coded in LAMMPS <ref type="bibr">38</ref> , were used for the energy and force evaluation. The GAP PRX model was retrieved from Ward et al. <ref type="bibr">39</ref> . Since different training data was used for the existing DeePMD model by Zhang et al. <ref type="bibr">5</ref> , the DeePMD model was trained here using the training data from Zuo et al. <ref type="bibr">1</ref> in order to have a fair comparison with other MLIP models. Here we refer to the early version of Deep Potential Molecular Dynamics as DeePMD <ref type="bibr">5</ref> and the later version of Deep Potential-Smooth Edition as DeepPot-SE in Supporting Information <ref type="bibr">27</ref> . The training of DeePMD models was performed using the DeePMD-kit package <ref type="bibr">27,</ref><ref type="bibr">40</ref> . The hyperparameters of the DeePMD model, including the size of the neural network, cut-off radius, and the number of iteration steps, were optimized using a grid search algorithm (two to six values were considered for each hyperparameter) by the RMSEs of energies and forces in a separate set of 100 snapshots from AIMD simulations of bulk Si with single vacancy defect (vacancy validation dataset). The length of the neighbor list in DeePMD model was estimated based on cutoff radius. The final set of hyperparameters was selected to have the lowest RMSEs of energies and forces in both training dataset and the vacancy validation dataset.</p><p>First-principles computation DFT calculations were performed to generate additional data of energies, forces, PES and atomic configurations for training and testing. All DFT calculations were performed by Vienna ab initio simulation package <ref type="bibr">41</ref> (VASP) with the projector augmented-wave approach on these snapshots. The Perdew-Burke-Ernzerhof <ref type="bibr">42</ref> (PBE) functionals by generalized-gradient approximation (GGA) were adopted to calculate the total energies of snapshots. Static relaxation of atomic configurations were performed using spinpolarized DFT calculations with an energy cutoff of 520 eV, an electronic relaxation convergence cut-off of 10 -5 eV, and other parameters set similar to those used in Materials Project <ref type="bibr">43,</ref><ref type="bibr">44</ref> . Our benchmark true values of energies and forces were calculated using 4 &#215; 4 &#215; 4 k-point mesh (K4), while &#915;-centered 1&#215;1&#215;1 (K1) and 2 &#215; 2 &#215; 2 (K2) were also evaluated for comparison.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Ab initio molecular dynamics simulation</head><p>Ab initio molecular dynamics (AIMD) simulations were performed using non-spin-polarized, an electronic energy convergence cutoff of 10 -4 eV, a &#915;-centered 1 &#215; 1 &#215; 1 (K1) or 2 &#215; 2 &#215; 2 (K2) k-point mesh, and a time step of 2 fs. The PBE functionals by GGA were adopted as in the First-Principles Computation section. The supercell model consists of Si bulk-phase with 2 &#215; 2 &#215; 2 conventional unit cells (64 atoms) with a single vacancy (63 atoms) or a single interstitial (65 atoms). The initial temperature of the simulations was set to 100 K after a static relaxation of the initial structures, and the structures were heated to the final temperatures during a period of 2 ps with a constant rate by velocity scaling, and afterwards an NVT ensemble with Nos&#233;-Hoover thermostat was adopted. To obtain diffusional properties, AIMD simulations were performed at different temperatures 730, 840, 1000, 1230, 1500, and 1600 K following the same scheme in refs. <ref type="bibr">4,</ref><ref type="bibr">45</ref> . Missing diffusivities at certain temperatures in Fig. <ref type="figure">2a</ref>, <ref type="figure">b</ref> are either due to the melting of the crystal structures or inadequate numbers of hopping events (specified in Supplementary Tables <ref type="table">4</ref>, <ref type="table">5</ref>, and 6). Given the stochastic nature of ion hopping in estimating diffusivities, the diffusivity calculations were converged using our developed scheme <ref type="bibr">4,</ref><ref type="bibr">45</ref> . and the error bars are estimated correspondingly based on the total number of ion hops. The diffusivities and their error bars of AIMD simulations at all temperatures are available in Supplementary Tables <ref type="table">4</ref>, <ref type="table">5</ref>, and 6.</p><p>The diffusivities and their error bars were evaluated according to the total mean-squared-displacement (TMSD) of Si atoms as in ref. <ref type="bibr">4</ref> . The total time durations of AIMD simulations were in the range of 100 and 1000 ps, so the values of TMSD were in the range of 1200-3000 &#197; 2 and were 600 &#197; 2 for AIMD K2 simulations at temperatures of 1000 K and 1230 K due to high computation cost.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>MLIP-MD simulations</head><p>All classical MD simulations based on MLIPs were carried out using LAMMPS. The MLIP-MD simulations for vacancy and interstitial diffusion were performed on the same supercells with single interstitial or vacancy defect as in the AIMD simulations with NVT ensemble. The time step of all MD simulations was set to 1 fs. MLIP-MD simulations of all MLIPs are performed at six different temperatures 730, 840, 1000, 1230, 1500, and 1600 K. The total time duration of MD simulations was in the range of 500 to 5000 ps, similar to those used in AIMD simulations. The TMSDs of MD simulations were between 6000 and 320,000 &#197; 2 for NNP, SNAP, MTP, DeePMD, and DeepPot-SE, and &gt;1500 &#197; 2 for GAP models. The convergence of diffusivity and the error bars are estimated correspondingly based on the total number of ion hops in the same scheme as in the AIMD simulations <ref type="bibr">4,</ref><ref type="bibr">45</ref> . The diffusivities and their error bars from MLIP-MD simulations are summarized in Supplementary Table <ref type="table">4</ref> and 5, following the same scheme described in "Ab initio molecular dynamics simulation" section.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Analyzing Si interstitial in MD simulations</head><p>The occurrence frequency of each interstitial configuration (split-&lt;110 &gt; , tetrahedral, and hexagonal) was counted among 2000 snapshots (taken every 100 fs) from the single-interstitial supercell model from MD simulations at 1230 K. To determine the interstitial type, a structural matching algorithm was performed for each snapshot following the scheme adopted in ref. <ref type="bibr">46</ref> using pymatgen <ref type="bibr">43</ref> . The interstitial configurations used as matching templates were fully relaxed using the fixed lattices of a crystalline Si bulk and a &#915;-centered 4 &#215; 4 &#215; 4 k-point mesh, with energy and the force convergence criteria at 10 -7 eV and 0.01 eV &#197; -1 , respectively. The matching algorithm used the tolerance parameters of the lattice angle of 5&#176;, the lattice length of 20%, and the site root-mean-square tolerance of 0.275(V/n) 1/3 , where V/n was the volume V normalized by the number of atoms n. Some interstitial configurations, e.g., migrating interstitials or concerted migrations of multiple interstitials, happened in MD simulations, cannot be classified as either of three interstitials, and thus the occurrence frequencies of three interstitials do not always sum up to 1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Identifying static sites, vacancies, and migrating atoms</head><p>The algorithms for identifying static sites, vacancies, and migrating atoms during MD simulations were as follows. The static sites of Si were set to the sites of the perfect crystalline Si bulk. A static site that has no Si atoms within 1.1 &#197; was identified as a vacancy. A migrating Si atom (RE atom) was a Si atom between two nearest-neighbor static sites, which was selected if the distances to their 1st nearest and 2nd nearest static sites has a difference below 0.75 &#197; (~31% of distances between two static sites). This criterion can effectively distinguish migrating atoms from atoms vibrating around static sites.</p><p>The values of the distance to the nearest static site, r s , the distances to its nearest vacancy, r v , and the distances to its nearest RE atom, r, were quantified correspondingly from these static sites. All distributions in Fig. <ref type="figure">3c</ref>-h was generated with 50&#215;50 grid size by collecting r v and r s distances of vacancy NN or non-NN atoms over 2000 snapshots taken randomly (&gt;every 100 fs) from each simulation of MLIPs or AIMD at 1230 K. The Gaussian filter with smoothing parameter &#963; set to 3 was applied in plotting the distributions using scipy library <ref type="bibr">47</ref> .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Testing dataset with REs Two testing sets, D RE-I</head><p>Testing and D RE-V Testing sets, are constructed from 100 snapshots taken from AIMD simulations at 1230 K with single-vacancy and single-interstitial, respectively, with every snapshot having a RE. The energies and atomic forces of each snapshot were further converged by single-step self-consistent DFT calculations to a higher accuracy (&#915;-centered 4 &#215; 4 &#215; 4 k-point mesh) with fixed atom positions and lattices, and these converged values (DFT K4) were used as true values for testing.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Error evaluation of atomic forces</head><p>Using the D RE-I Testing and D RE-V Testing sets, the error evaluation of atomic forces was performed for all MLIPs. The errors of atomic forces by DFT calculations with lower accuracy setting (K1 and K2) commonly used in AIMD simulations were also evaluated in comparison to the MLIPs. The magnitude error &#948; f and the directional error &#948; &#952; of forces were evaluated according to Eqs. ( <ref type="formula">1</ref>) and (2), respectively. From the error evaluation results on the RE atoms (i.e., migrating atoms as defined above) in the D RE-I Testing or D RE-V Testing dataset, the CDFs of &#948; F and &#948; &#952; were generated. The normalized area under the curves NAC(&#948;, D) of the CDFs of &#948; F and &#948; &#952; were quantified in corresponding dataset D for &#948; F over 0 -1 eV &#197; -1 (20 bins) or for &#948; &#952; over 0-90&#176;(20 bins), respectively, for each MLIP or DFT K1/K2 as shown in Fig. <ref type="figure">4</ref>, and were normalized by the total area of the evaluation range of 0-1 eV &#197; -1 for x-axis and 0-100% for y-axis, giving to a value between 0 and 1. Higher NAC values (closer to 1) correspond to smaller errors in the MLIP prediction compared to DFT K4 benchmark.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Generating RE-enhanced training dataset</head><p>To train MLIPs with low errors of forces on RE atoms, 120 snapshots with identified RE atoms were randomly selected to replace the structures in the original training dataset 1 from the following categories, such as liquid Si, AIMD simulations of Si bulk, and the strained Si bulk, from the original training dataset, in order to maintain a balance of structures for each category. The REenhanced training dataset has the same size and similar diversity of different types of atomic configurations as the original dataset in ref. <ref type="bibr">1</ref> , so we can make a fair comparison between existing and RE-enhanced MLIPs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Optimizing enhanced MLIPs</head><p>The validation dataset to optimize the MLIPs trained by REenhanced training data was constructed as follows. The enhanced validation set (EVS) contains 50 total structures, consisting of 20 structures randomly selected from the 120 replaced structures of the original training dataset, 11 snapshots with vacancy RE from AIMD simulations, and 19 snapshots with interstitial RE from AIMD simulations. The optimization of the hyperparameters of MLIPs considered 4 to 10 values for each parameter including the band limit of spherical harmonic basis functions and the number of radial basis functions for GAP, the size of neural network for NNP, the number of radial basis functions number for MTP. For each MLIP model, grid search algorithms were performed among 300-2000 sets of hyperparameters to identify the selected MLIPs as explained below.</p><p>For the selection of optimal MLIP models, which were used to study the force performances of MLIPs in Fig. <ref type="figure">6c</ref>, we evaluated the following eight metrics, such as the RMSEs of energies and forces of 31 structures (excluding the 19 interstitial structures), the RMSEs of energies and forces of RE atoms in 11 vacancy structures, the RMSEs of the energies and forces of RE atoms in 19 interstitial structures, the RMSEs of the energies and forces of RE atoms in 30 vacancy and interstitial structures in the EVS. We used the force performance score P (D for the dataset D. The evaluations were performed for different data such as all atoms in EVS, the RE atoms of the interstitial structures in EVS, the RE atoms of the vacancy structures in EVS, or all RE atoms in EVS.</p><p>Using these evaluation scores, we first selected those MLIPs with the lowest RMSE or the highest NAC in any one of the eight criteria, giving a total of 46 MLIPs as interstitial-enhanced MLIPs. The same optimization process using eight criterion was also applied on selecting additional MLIPs trained by the original dataset in ref. <ref type="bibr">1</ref> obtaining 24 additional MLIPs. All these 46 interstitial-enhanced MLIPs and 24 additional MLIPs were used in Fig. <ref type="figure">6a</ref> for testing of their force performances on interstitial RE atoms. These RE-enhanced MLIPs were further down selected to have the optimal one as the representative MLIPs, using the joint force error score of P(D RE-I EVS ) 2 + P(D RE-V EVS ) <ref type="bibr">2</ref> . All NACs of &#948; F and &#948; &#952; were computed using the RE atoms in the EVS. Six optimal MLIPs, one for each model, were selected. Both MLIP-MD simulations with single vacancy and MD simulations with single interstitial were further performed at six different temperatures to analyze their diffusional properties. </p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>npj Computational Materials (2023) 174 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences 1234567890():,;</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1"><p>Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences npj Computational Materials (2023) 174</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_2"><p>npj Computational Materials (2023) 174 Published in partnership with the Shanghai Institute of Ceramics of the Chinese Academy of Sciences</p></note>
		</body>
		</text>
</TEI>
