<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Increasing the Efficiency of Ensemble Molecular Dynamics Simulations with Termination of Unproductive Trajectories Identified at Runtime</title></titleStmt>
			<publicationStmt>
				<publisher>Journal of Physical Chemistry</publisher>
				<date>03/06/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10631991</idno>
					<idno type="doi">10.1021/acs.jpca.4c05182</idno>
					<title level='j'>The Journal of Physical Chemistry A</title>
<idno>1089-5639</idno>
<biblScope unit="volume">129</biblScope>
<biblScope unit="issue">9</biblScope>					

					<author>Jack Marquez</author><author>Michel A Cuendet</author><author>Silvina Caino-Lores</author><author>Trilce Estrada</author><author>Ewa Deelman</author><author>Harel Weinstein</author><author>Michela Taufer</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[The application of molecular dynamics (MD) simulations to study increasingly larger and more complex systems is challenged by the required amounts of trajectory data needed to sample their conformational space appropriately. The analysis and interpretation phase of such massive data sets that have to be stored and fed to the various algorithms to reveal the dynamic behaviors of the systems and the underlying energetics in structural terms related to functional mechanisms are also a significant challenge. To develop computational means that can address these challenges, we are developing a software framework that can increase the ekciency of this process. We present one component of this framework that can reduce the size of the accumulating data set while maintaining the structural attributes, distribution, and relative probability ranking of the minima in the free energy map for the system. This framework component utilizes early termination of individual trajectories identified as unproductive in the sampling of conformational space. The criteria for termination are derived quantities such as collective variables (CVs) and secondary quantities calculated from the time series of CVs. They are computed and applied during the trajectory generation. The approach is illustrated with simulations of the FS peptide and evaluated from comparisons between the free energy surfaces calculated from ensembles of complete, unabridged simulations with those obtained from ensembles in which ∼5-50% of trajectories were terminated early. Our early termination approach can optimize computational ekciency while achieving a robust representation of conformational space.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#9632; INTRODUCTION</head><p>The massive data sets accumulated from molecular dynamic (MD) simulations encode the time evolution of the system sampled to explore conformations separated by energy barriers of diWerent heights. <ref type="bibr">[1]</ref><ref type="bibr">[2]</ref><ref type="bibr">[3]</ref> The constant increase in the power of current high-performance computing (HPC) platforms has enabled the long-expected growth in the size and complexity of the systems that can be explored and understood mechanistically using MD simulations. This progress is accompanied by increasingly powerful algorithms for trajectory analysis, which, however, are hampered by the need to overcome the burden caused by working on postsimulation analysis from saved data written to a storage system. <ref type="bibr">[4]</ref><ref type="bibr">[5]</ref><ref type="bibr">[6]</ref> Therefore, alternatives that can increase ekciency have been considered based on in situ frameworks <ref type="bibr">[7]</ref><ref type="bibr">[8]</ref><ref type="bibr">[9]</ref> for processing and analyzing data during its generation at simulation execution time, with memory used for analytics and visualization tasks. <ref type="bibr">[10]</ref><ref type="bibr">[11]</ref><ref type="bibr">[12]</ref> By reducing the overhead from storage system writing and reading tasks, <ref type="bibr">[7]</ref><ref type="bibr">[8]</ref><ref type="bibr">[9]</ref><ref type="bibr">13,</ref><ref type="bibr">14</ref> in situ frameworks overcome some of the postsimulation analysis challenges and reduce the I/O cost.</p><p>In this study, we present a component of Analytics for Molecular Dynamics (A4MD), <ref type="bibr">15</ref> a specialized in situ framework designed to address these challenges. This component of A4MD focuses on reducing computational burden and cost by allowing for early termination of trajectories identified as "unproductive" at runtime. <ref type="bibr">16</ref> This identification is based on criteria predicting a trajectory's ekciency in leading to the exploration of the conformational space spanned by the simulated system. The ekciency gained by this approach depends on the ability to identify early the likely unproductive trajectories. As the conformational changes leading to transitions between long-lived microstates (or metastable states) constitute rare events in the long trajectories, <ref type="bibr">[17]</ref><ref type="bibr">[18]</ref><ref type="bibr">[19]</ref> we set out to determine whether it is possible to develop useful termination criteria based on the detection of rare events identified by monitoring derived quantities (DQs) relevant to the information sought about the simulated system such as collective variables (CVs), <ref type="bibr">20</ref> or secondary quantities calculated from time series of CVs (such as the variance, or the autocorrelation functions). Specifically, we chose the largest eigenvalue (LEV) and the eWective sample size (ESS) as DQs (see Methods). We compared their performance in steering the simulation toward the robust representation of the molecular free energy surface (FES) of the system with reduced resource utilization.</p><p>To implement this component of the A4MD framework, we compute the two types of DQ during the accumulation of each trajectory in the MD simulation. Based on the properties and mode of calculation of the DQs (detailed in Methods), we examined whether they can serve to identify, early in the trajectory evolution, that the simulation trajectory is not moving from a metastable state and is not progressing to sample ekciently the conformation space of the molecular system. Terminating such "unproductive" trajectories at runtime would result in reduced utilization of computational and storage resources and might allow the rest of the ensemble to reveal the essential features of the conformational space covered by the corresponding full, untrimmed simulation ensemble. <ref type="bibr">21,</ref><ref type="bibr">22</ref> To evaluate these possibilities, we use this component of the A4MD in situ framework, leveraging LEV and ESS DQs to eWectively terminate some trajectories early. <ref type="bibr">[23]</ref><ref type="bibr">[24]</ref><ref type="bibr">[25]</ref><ref type="bibr">[26]</ref> We assess the balance between the extent of trimming by trajectory termination and the performance relative to the full ensemble. We illustrate the method and its results for a benchmark test system for protein folding&#57557;the Fs peptide system (Ace-A 5 (AAARA) 3 A-NME), <ref type="bibr">[27]</ref><ref type="bibr">[28]</ref><ref type="bibr">[29]</ref> and show that this strategy can save computational resources in the computational eWort of conformational space sampling by eliminating up to &#8764;70% of the ensemble of trajectories with results that while identifying the positions and rank order of the minima in the free energy map of the system correctly.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#9632; METHODS</head><p>Simulations of the folding process of the Fs peptide system (Ace-A 5 (AAARA) 3 A-NME) into an &#945; helix were performed using GROMACS 2021.4 <ref type="bibr">30</ref> on the Oak Ridge National Laboratory's Summit supercomputer. <ref type="bibr">31</ref> This folding process has served as a benchmark in previous folding studies. <ref type="bibr">32</ref> The simulations utilize the Amber03 force field <ref type="bibr">33</ref> starting from initial peptide conformations generated by the Modeler software <ref type="bibr">34</ref> and solvated in a truncated octahedron box with 7775 water molecules. Each simulation trajectory was run with a time step of 2 fs on a single node with one GPU and eight CPU cores. The simulation ensembles contained 40 individual replicas, run for 200 ns each, with DQ values calculated at every step by the in situ framework that interfaces the MD code via an extension of the PLUMED plugin. <ref type="bibr">35</ref> The resulting data set consists of 40 trajectories, recorded with a stride of 1 step (5,500,040 frames), and is open-access in Dataverse (10.7910/DVN/DSTBP). <ref type="bibr">36</ref> Trajectory Termination Criteria. LEV Criterion. The LEV, the LEV of the pairwise distance matrix between all C &#945; atoms of the peptide chain of interest, has been used previously to characterize the global fold of polypeptides. <ref type="bibr">11</ref> In particular, it was shown that for a 21-residue protein (21 C &#945; atoms), the expected LEV range for the &#945;-helical folded state is [4000, 4400] according to ref 8. Since we know the folded state of the Fs peptide is a canonical &#945;-helix, we can assume that this state will be very stable. Hence, the trajectories arriving at this folded state will likely remain there for a long time and, therefore, not be very productive in terms of sampling. To define a stopping criterion based on this idea, we needed to define a time window for monitoring the LEV and decide whether it was mainly in the target range of 4000-4400 during this time window. If YES, the system is considered immobile in the folded state (or a state with a high &#945;-helical content). The criterion for stopping (terminating) the trajectory is thus composed of two elements: the time window duration and the fraction of points admitted inside the 4000-4400 range within the time window. Figure <ref type="figure">S2</ref> illustrates these considerations. We note that this LEV stropping criterion assumes that the nature of the target folded state is known a priori, allowing the definition of the appropriate target range.</p><p>ESS Criterion. The ESS captures the apparent number of independent samples in an autocorrelated time series. For a time series {x i }, i = 1, &#8226;&#8226;&#8226;, n, the ESS, noted n eff , can be approximated with the following expression. <ref type="bibr">[37]</ref><ref type="bibr">[38]</ref><ref type="bibr">[39]</ref> n n</p><p>where r k is the estimated correlation coekcient at lag time k</p><p>and k max is the last lag value before the empirical autocorrelation function changes sign for the first time. The ESS has been proposed previously to monitor trajectory convergence, <ref type="bibr">40</ref> or to identify immobile trajectories in MD simulations. <ref type="bibr">21</ref> We use the LEV as our underlying CV time series. We calculate the ESS within a sliding time window of size n ESS , which includes the most recent portion of the trajectory. Low ESS values indicate the presence of slow, large-amplitude conformational changes (rare events), while high ESS values are expected when the system oscillates rapidly around a metastable state. Therefore, if the ESS remains below a certain threshold for a period and then increases dramatically, we can conclude that the trajectory has become immobile and is not very productive for sampling. We have two parameters for an ESS-based stopping criterion: n ESS and the gradient of the first significant increase. Importantly, unlike LEV, the ESS criterion</p><p>The Journal of Physical Chemistry A is free of any prior assumption about the final folded state (e.g., &#945;-helical or &#946;-sheet) and its expected LEV range defined in ref 8. It only detects when the system becomes immobile according to the underlying CV.</p><p>Postprocessing of the Simulation Trajectories. To represent the slowest modes of the system, which are also the most challenging to sample, we utilized the standard Markov State Model (MSM) approach available in the PyEmma package. <ref type="bibr">41</ref> Initially, we applied the dimensionality reduction algorithm tICA (time-structure-based independent component analysis) <ref type="bibr">42</ref> to the trajectory data, concentrating on the slowest degrees of freedom with a lag time of 5000 steps. For clarity in representation, we focused on the first two vectors spanning the tICA space (tICs), as they represent the slowest modes and, therefore, the most challenging to sample. We reasoned that if sampling is adequate in the space of the first two tICs, it would also be sukcient for the faster dimensions. Next, we used K-means clustering on the tICA-reduced data to discretize the conformational space into clusters, resulting in 200 clusters. Subsequently, we evaluated the MSM using a time lag of 100 steps, applying the VAMP method to ensure robust state definition. <ref type="bibr">43</ref> Finally, we employed the Perron-Cluster Cluster Analysis (PCCA+) method to group the 200 clusters into a smaller number of macrostates. PCCA+ guarantees that the identified macrostates are metastable, meaning they exhibit relatively high intrastate transition probabilities and relatively low interstate transition probabilities. This protocol successfully identifies six metastable states of the simulated system (States 0-5).</p><p>A4MD Utilization. The A4MD in situ framework can function for either runtime trajectory management or postanalysis of trajectories. In the runtime mode, when the folded conformation is detected in the trajectory at runtime (i.e., the LEV is within a range corresponding to the folded state of this 21-residue molecule <ref type="bibr">11</ref> ) and the early termination criterion is fulfilled, A4MD stops the execution of the specific MD job that generated that trajectory. Alternatively, the ESS of the LEV time series can be calculated in real-time, allowing the framework to terminate a trajectory when it observes sudden uncorrelated oscillations around a stable conformation, indicated by a sharp increase in the ESS. Based on the two DQs, LEV and ESS, the trimmed trajectory results in fewer frames depending on the selected parameters, freeing up computing resources for a new MD job.</p><p>In the postanalysis mode, we used for the present study, A4MD loads the trajectory set of the full data set. It trims each trajectory using the LEV or ESS criteria. A4MD then calculates CVs and builds Markov State Models (MSMs) for both full and trimmed trajectories. For easier comparison of trimmed versus full trajectories, A4MD can build MSMs by projecting on tICA maps and microstates obtained with a reference full trajectory. Similarly, mapping the PCCA+ cluster assignments of the trimmed data to the metastable states identified in the full model allows us to validate whether the early termination does not miss critical states in the conformational space.</p><p>A4MD can compute free energy landscapes using various methods. In this study, we use the equilibrium probabilities from the MSM stationary distribution. Note that importance sampling on the CV data will not work in a workflow where immobile trajectories are actively terminated, as this intervention alters the distribution of conformations. A4MD can also estimate Hidden Markov Models (HMMs) to represent macrostate transitions and relative free energies using the PyEmma implementation. Our study assesses to what extent the early termination in the A4MD framework provides a robust and accurate representation of conformational space, comparable to full trajectory simulations. This validation is critical for optimizing computational resources while maintaining the integrity of MD simulations. The Supporting Information describes the open-access Jupyter notebooks that implement the A4MD workflow, including key input parameters and analysis steps.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#9632; RESULTS</head><p>We use the trajectory analysis approaches comprising the in situ framework described in the Methods section to compare data from the full complement of trajectories (termed FULL) with results from the truncated ensemble of MD simulation trajectories. The truncated ensemble was obtained by applying the early termination protocols using evaluation criteria based on the DQ's LEV and ESS. The results from the FULL simulations of the Fs peptide folding, shown in Figure <ref type="figure">1</ref>, were obtained using the standard MSM-based postprocessing approach as described. By reducing dimensionality and building a two-dimensional time-structure independent The Journal of Physical Chemistry A component analysis (tICA) map of the dynamics of the peptide's backbone dihedral angles we capture the slowest modes of the system, where the distance between two points in the tICA map is linked to the rate of interconversion between the corresponding conformations. The MSM, based on 200 local microstates (identified by dots in Figure <ref type="figure">1b</ref>), allows us to identify six main basins on the FES that correspond to the six metastable states returned by the PCCA+. <ref type="bibr">44</ref> This FES is shown in Figure <ref type="figure">1a</ref>, and we use the relative position and depth of its local minima as a baseline to compare the performance of simulation ensembles at diWerent levels of truncation.</p><p>To investigate the impact of the LEV stopping criterion on sampling the conformational space of the Fs peptide, we varied the window size and the stable number of frame parameters used to obtain termination at a series of percentages of the full trajectory set (see Table <ref type="table">S1</ref>). Figure <ref type="figure">2a</ref> shows the resulting FESs with an increasingly aggressive LEV-stopping criterion. We observe that until the reduction to 68% of the full number of trajectories, the FES remains qualitatively similar. At 57%, important deviations appear, but the global minimum remains in State 0, the expected folded state. This shows that even with quite drastic data reduction, some characteristic features of the FES are still captured.</p><p>We focus on two convergence measures to better quantify the overall agreement between the sampling obtained from the FES-trimmed trajectories and the full data set. First, we examine the global similarity of the FESs by calculating the mean absolute free energy diWerence at the locations of the 200 microstates. The results are shown in Figure <ref type="figure">2b</ref>. We observe that there is a low overall diWerence at 86% of the full The Journal of Physical Chemistry A ensemble, which increases rapidly when the percentage of frames retained decreases further. Second, we concentrate on the most critical regions of the FES, specifically the values at the local minima corresponding to the six metastable states detected in relation to the global minimum. Figure <ref type="figure">2c</ref> displays the results. A general trend is that with the increase in truncation (i.e., lower percentages), the values of the FE minima shift downward, indicating that the perceived depth of the global minimum is underestimated. Figure <ref type="figure">S3</ref> (Supporting Information) shows that the free energy ranking of the first four local minima is well-preserved down to 61%, and the global minimum of the FULL ensemble is no longer recognized at 46%.</p><p>Next, we conducted the same experiment using the ESS stopping criterion. Overall, we observe similar behavior illustrated in Figure <ref type="figure">3</ref> and found that the global minimum is no longer recognized correctly when we reduced the trimmed ensemble to 45% of the FULL. The mean absolute diWerence converges to zero at approximately the same rate as with LEV (cf. Figure <ref type="figure">3b</ref> to Figure <ref type="figure">2b</ref>). Interestingly, Figure <ref type="figure">3c</ref> suggests that the depth of the global minimum is aWected by more minor trajectory reductions compared to the LEV case: with 90% of the data, the relative position of the local minimum (State 1) changes by 0.3 kT, although the ranking is correct, but at 66% it deviates from the full ensemble results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#9632; DISCUSSION</head><p>We developed the approach described in this manuscript as part of a general eWort to increase the ekciency of MD simulations of very large molecular systems by implementing algorithms that can run in parallel with canonical MD and modify the workflow based on results from in situ analyses of the accumulating data. We tested its performance in reducing the resources utilized to achieve correctly the goals of the simulation. Specifically, we probed the capability of this aspect of the A4MD framework to attain this goal using a method of early termination of unproductive trajectories. To assess whether such trimming of the MD simulation ensemble can achieve satisfactory coverage of the conformational space compared to the corresponding the full simulation ensemble of The Journal of Physical Chemistry A a known test system: the FS peptide (Ace-A 5 (AAARA) 3 A-NME).</p><p>Analysis of our comparative simulation results demonstrates the feasibility and ekciency of using DQs such as LEV and ESS for early termination of molecular dynamics (MD) simulations. A key finding is that using DQs can substantially cut the required simulation frames. If the goal is to identify the global minimum in the FES, the MD ensemble can be reduced by around 55% for both the LEV and ESS criteria. On the other hand, if the goal is to preserve the stability ranking of metastable states, values of 61% for LEV and 66% for ESS are the limits to which one can reduce the MD ensemble. This reduction is still beneficial as it achieves computational savings while maintaining essential information on the stability ranking of these states. However, reducing beyond these thresholds (lower than 61% for LEV and 66% for ESS) starts to aWect the accuracy of the simulation, causing the apparent depth of the global minimum to decrease, which means key details about the system's most stable state are lost. The overall reductions represent consequential savings in computational resources that enable more ekcient utilization of supercomputing facilities and facilitate larger-scale studies.</p><p>Our performance analysis of the DQs found that the free energy values and locations of the local FES minima in tICA space were similar for the FULL, LEV-trimmed, and ESStrimmed MD ensembles. The LEV criterion preserved free energy values slightly better than the ESS for this system. Still, this potential advantage may be oWset by the fact that the LEVstopping criterion requires an assumption regarding the target conformation of the peptide, while the ESS approach does not. Instead, the application of the ESS-based criterion for trajectory termination does not expect the underlying CV to stabilize in a predefined region, only to stabilize somewhere in the configurational space, and thus can be combined with any underlying CV that can capture the global dynamics of the protein which oWers a larger scope of targeted conformational space explorations.The findings of this study have significant implications for MD and computational chemistry. Ekciently terminating simulations early, guided by time series of derived quantities calculated from trajectory data, can preserve critical biomolecular properties, accelerating the discovery process and allowing for more extensive exploration of conformational spaces. While careful adjustment and choice of parameters for the early termination criteria, such as the LEV range and ESS window size, significantly impact the results, it oWers opportunities for creative targeted utilization, and leaves room for further improvements in the early termination decisions and ekciency in utilization of computational resources. As we undertake to systematically optimize these parameters and examine their eWects on energy landscapes and state maps, we will also study their transferability across various molecular systems through the generalization of the early termination criteria to multidimensional DQs, such as developing the ESS for multiple underlying CVs measured simultaneously. In the A4MD framework for large-scale MD simulations, this component of the trajectory ensemble management system will then integrate the early termination criteria with ekcient trajectory restart strategies to maximize computational resource utilization to benefit essential largescale MD simulations such as eWorts in protein folding studies, drug discovery, and large cellular systems simulations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>&#9632; ASSOCIATED CONTENT</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>* s&#305; Supporting Information</head><p>The Supporting Information is available free of charge at <ref type="url">https://pubs.acs.org/doi/10.1021/acs.jpca.4c05182</ref>.</p><p>Open-access Jupyter notebooks that implement the A4MD workflow, including key input parameters and analysis steps (PDF)</p><p>&#9632; AUTHOR INFORMATION Corresponding Author Michela Taufer -Department of Electrical Engineering and Computer Science, University of Tennessee, Knoxville, Tennessee 37916, United States; orcid.org/0000-0002-0031-6377; Email: taufer@acm.org tion TG-CIS200053; and IBM through a Shared University Research Award.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>https://doi.org/10.1021/acs.jpca.4c05182 J. Phys. Chem. A 2025, 129, 2317-2324</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1"><p>The Journal of Physical Chemistry A</p></note>
		</body>
		</text>
</TEI>
