<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>AQME: Automated quantum mechanical environments for researchers and educators</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>02/26/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10403098</idno>
					<idno type="doi">10.1002/wcms.1663</idno>
					<title level='j'>WIREs Computational Molecular Science</title>
<idno>1759-0876</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Juan V. Alegre‐Requena</author><author>Shree Sowndarya S. V.</author><author>Raúl Pérez‐Soto</author><author>Turki M. Alturaifi</author><author>Robert S. Paton</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[AQME, automated quantum mechanical environments, is a free and open-sourcePython package for the rapid deployment of automated workflows using cheminformatics and quantum chemistry. AQME workflows integrate tasks performed across multiple computational chemistry packages and data formats, preserving all computational protocols, data, and metadata for machine and human users to access and reuse. AQME has a modular structure of independent modules that can be implemented in any sequence, allowing the users to use all or only the desired parts of the program. The code has been developed for researchers with basic familiarity with the Python programming language. The CSEARCH module interfaces to molecular mechanics and semi-empirical QM (SQM) conformer generation tools (e.g., RDKit and Conformer-Rotamer Ensemble Sampling Tool, CREST) starting from various initial structure formats. The CMIN module enables geometry refinement with SQM and neural network potentials, such as ANI. The QPREP module interfaces with multiple QM programs, such as Gaussian, ORCA, and PySCF. The QCORR module processes QM results, storing structural, energetic, and property data while also enabling automated error handling (i.e., convergence errors, wrong number of imaginary frequencies, isomerization, etc.) and job resubmission. The QDESCP module provides easy access to QM ensemble-averaged molecular descriptors and computed properties, such as NMR spectra. Overall, AQME provides automated, transparent, and reproducible workflows to produce, analyze and archive computational chemistry results. SMILES inputs can be used, and many aspects of tedious human manipulation can be avoided. Installation and execution on Windows, macOS, and Linux platforms have been tested, and the code has been developed to support access through Jupyter Notebooks, the command line, and job submission (e.g., Slurm) scripts. Examples of pre-configured workflows are available in various formats, and hands-on video tutorials illustrate their use. Juan V. Alegre-Requena and Shree Sowndarya S. V. contributed equally.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><p>The program presented is a free, open-source Python-based software installed via conda-forge (conda install -c conda-forge aqme) or Python Package Index (pip install aqme). All the dependencies required to run AQME are installed automatically, except for RDKit and Openbabel when using pip install. Along with the software, we set up tests through Pytest, Circle CI, and Codecov, which allows us to ensure a correct functioning of a significant proportion of the code. Also, multiple code analyzers (CodeFactor, Codacy, and LGTM) were employed to improve the quality standards and readability of the deployed code. A detailed documentation page is available at Read the Docs. <ref type="bibr">29</ref> </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">| MODULES OF AQME</head><p>AQME is divided into modules that can be called as part of a workflow or separately. Four main applications enclose these modules: (i) conformer generation and geometry refinement (CSEARCH and CMIN), (ii) generation of QM input files (QPREP), (iii) postprocessing of output files (QCORR), and (iv) generation of Boltzmann weighted descriptors (QDESCP). In this section, the technical details of the modules are disclosed in more detail.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">| Conformer generation and geometry refinement: The CSEARCH and CMIN modules</head><p>The availability of numerous conformer generators for small molecules (i.e., ConfGen, <ref type="bibr">38</ref> OMEGA, <ref type="bibr">39</ref> Frog2, <ref type="bibr">40</ref> etc.) highlights the central importance of this task. In the CSEARCH module, AQME gathers multiple types of conformational search tools. It can be used simply as an interface to the external conformer generation protocols available in RDKit or CREST, or to perform torsion-based sampling internally while making use of the MM or SQM potentials in those packages (Figure <ref type="figure">2a</ref>). When starting from SMILES strings, CSEARCH attempts to derive molecular charge and multiplicity, although this can be manually overwritten (charge and mult options). The number of simultaneous processes is controlled by the max_workers option; the number of processors used by each process with the nprocs option.</p><p>The Grimme group's CREST generates conformers by extensive metadynamics sampling using semi-empirical methods (GFNn-xTB) or force fields (GFN-FF), with an additional genetic Z-matrix crossing step at the end. Userdefined constraints for atom positions, bond distances, angles and dihedral angles enable approximate TS structures to be sampled. When starting from 1D and 2D structures, RDKit is used to generate the necessary 3D input for CREST. Before the sampling, two initial xTB optimizations are performed to avoid errors. During the first optimization, the F I G U R E 1 General workflow including the modules available in AQME and their connectivity with external programs. calculations are performed with all the bonds frozen in addition to any user-defined constraints. This preparatory calculation avoids problems related to the superimposition of molecules when the input molecules are generated from 1D or 2D inputs. In the second optimization, only the user-defined constraints are included. Afterward, the CREST search is carried out, including any additional keywords specified in the crest_keywords option (i.e., crest_keywords="--nci--cbonds 0.5").</p><p>MM-based searches are faster and may be necessary for large systems or large numbers of molecules. The first method (using program="rdkit") performs RDKit-based conformer optimization and filters duplicates based on energy and geometry (with root mean square distance, RMSD). It starts with an energy window to remove high energy conformers (ewin_csearch option, default 5 kcal/mol above lowest), then removes conformers with similar energies (initial_energy_threshold option, default 0.0001 kcal/mol), and finally removes conformers with similar energy and RMSD (energy_threshold, default 0.25 kcal/mol and rms_threshold, default 0.25). The default values were chosen based on a benchmark study of flexible druglike compounds and natural products to yield significant conformers while reducing duplicates (see the Benchmarking of CSEARCH-RDKit and CMIN section in the AQME_ESI.docx document available in FigShare <ref type="bibr">41</ref> ).</p><p>When the initial conformational sampling fails, the program automatically tries to address the problems through a series of changes to its initial protocol (i.e., changing MMFF for UFF, using random coordinates for molecular embedding, etc.), making the protocol more robust. CSEARCH also tries to overcome other severe limitations in the conformational sampling of molecules that contain transition metals or atoms with uncommon hybridization (i.e., pentacoordinate P atoms). Additionally, common templates for organometallic compounds, such as linear, trigonal planar, and square-planar geometries, can be used. These geometries are not usually obtained with standard RDKit protocols (i.e., square-planar metal complexes lead to tetrahedral structures), which may then neglect an essential aspect of conformational behavior (see the Highlighting the Importance of Specifying the Metal Type: ABEVUZ as an Example section in the ESI <ref type="bibr">41</ref> ).</p><p>Internal torsional sampling approaches, such as the systematic unbounded multiple minimum (SUMM) <ref type="bibr">42</ref> and Monte Carlo Multiple Minimum (MCMM) algorithms, are also implemented. <ref type="bibr">43,</ref><ref type="bibr">44</ref> The SUMM approach surveys dihedral angles that are progressively varied by a user-specified increment, while MCMM applies random values to a random subset of the rotatable torsions. These two methods require more time than the standard RDKit sampling, but they might render more accurate results in cases with complex conformational spaces. Finally, the CMIN module refines the energies and geometries of the structures obtained with RDKit or other low-level methods before optimizing with more demanding levels of theory, such as density functional theory (DFT). xTB or ANI methods typically result in a reordering of relative conformational stabilities closer to QM results and the removal of duplicate conformations. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">| QM input file preparation: The QPREP module</head><p>The QPREP module is designed to convert multiple formats (SDF, XYZ, PDB, JSON, LOG/OUT) into input files for QM programs ready to be submitted without further modification. When using SDF and XYZ files, a QM input is generated for each structure in the files. For LOG/OUT calculations, only the final geometry is employed to create inputs for postoptimization single-point calculations (i.e., energy corrections at a higher theory level, TD-DFT calculations, etc.). QPREP currently generates inputs for Gaussian, ORCA, and PySCF. One of the most convenient features of this module is that the Gen and GenECP specifications from Gaussian input files are automatically written. When the user specifies atom types to use in gen_atoms, QPREP detects the atom types of the molecule and separates them into two groups for the input GenECP section. For example, the users can set the 6-31 + G(d,p) basis set for C, H, N, and P atoms while using def2-SVP for Pd atoms (Figure <ref type="figure">2b</ref>). This automated protocol avoids the tedious manual setup of all the types of atoms in the GenECP part, which is especially helpful when working with different families of compounds or with big molecular datasets. The input keywords for the generated input files are specified through the qm_input option, and other parameters can be edited as preferred, such as charge, multiplicity, generation of CHK files, number of processors, and memory. Also, the user can include final lines after the molecular coordinates (i.e., NBO keywords).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">| Postprocessing of output files: The QCORR module</head><p>Typically, a tedious manual search and correction for error terminations, convergence issues, and extra imaginary frequencies is necessary after running QM calculations. Based on our experience with structure optimizations and frequency calculations for large databases (i.e., many thousands) of organic compounds, such occurrences are relatively common. QCORR structures output data and automatically detects issues or errors, creating new input files that try to correct those issues, a cycle that can be repeated several times (Figure <ref type="figure">2c</ref>).</p><p>We conducted a study of 2709 calculations with organic molecules to find the optimal number of QCORR cycles for ground state optimization and frequency calculations in Gaussian 16 (QCORR_benchmarking.zip file in FigShare <ref type="bibr">41</ref> ). The initial geometries were obtained with a CSEARCH-RDKit standard conformer sampling. In the first round, 24% of the RDKit-generated conformers converged to duplicated QM conformers after optimization. From the unique structures, many outputs had problems: 1.5% showed imaginary frequencies, and 35% failed to converge to a stationary point in frequency calculation (albeit they converged during optimization). QCORR then automatically generated new inputs to fix the issues and these inputs were run. After the second round of QM calculations, 98% of the outputs had no issues, indicating that two QCORR rounds might be sufficient for most organic molecules. In our experience, an additional cycle is normally needed for complex systems with metals or supramolecular aggregates. The freq_conv keyword, which checks if a stationary point is found during optimization but not during frequency calculations, may be best avoided for flat or complex energy surfaces.</p><p>Structural isomerizations can be automatically detected and filtered. Also, calculations with spin contamination will be removed if S <ref type="bibr">2</ref> differs from s(s + 1) by more than 10%, as previously suggested. <ref type="bibr">45</ref> The filter can be disabled or adjusted to other thresholds with the s2_threshold option. QCORR uses the cclib Python library and stores all parsed data as JSON files since this format allows other Python tools to retrieve the information easily. By default, QCORR uses this information to detect all calculations for consistency in terms of calculation type and software version.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">| Generation of Boltzmann weighted properties and descriptors:</head><p>The QDESCP module When comparing computed results with experimental observables, Boltzmann-weighted values are generally advisable for molecules with multiple conformers. This also applies to molecular descriptors, where the utility of approaches that derive an ensemble average for a particular descriptor, or directly use minimum/maximum values, has been demonstrated. <ref type="bibr">46,</ref><ref type="bibr">47</ref> AQME is integrated with xTB to compute and curate computed descriptors for large compound databases. Starting from 3D geometries, atomic properties such as charges, fractional occupation densities, Fukui indices, D3-dispersion coefficients, and molecular properties including dipole moment, HOMO-LUMO gap, polarizability, energy are determined for every conformation of a molecule (Figure <ref type="figure">2d</ref>). Then, the Boltzmann averaged values of each descriptor are calculated and stored separately. This protocol enables the generation of SQM molecular descriptors as starting points for ML models. Additionally, QDESCP can be used to obtain Boltzmann averaged nuclear magnetic resonance (NMR) chemical shifts from DFT calculations. The user can specify slope and intercept to scale the results to the tetramethylsilane (TMS) scale using tools such as The Tantillo group's CHESHIRE repository, <ref type="bibr">48</ref> rendering simulated spectra that can be compared directly with experimental spectra.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">| END-TO-END WORKFLOWS</head><p>In this section, we detail three illustrative workflows that have been adapted for different applications regularly carried out in our group, such as calculating energy profiles and generating molecular descriptors for ML models. These workflows are available on Figshare <ref type="bibr">41</ref> along with associated data and metadata, and three formats are available in the code and the Read the Docs webpage: Jupyter Notebook, SLURM script, and command-line script. Hands-on tutorials have been uploaded to YouTube. <ref type="bibr">29</ref> For large systems or datasets, we typically execute end-to-end AQME workflows on a cluster using SLURM commands: the overall time taken is dominated by the QM calculation steps.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">| The conformational distribution and 1 H chemical shifts of strychnine from SMILES input</head><p>Strychnine is a natural alkaloid produced by different plants of the genus Strychnos, whose complexity and pharmacological properties have attracted many organic synthetic groups over time. <ref type="bibr">49</ref> Recently, John, Reinscheid, and coworkers reported two different conformers in a 97:3 ratio observed in NMR studies. <ref type="bibr">50</ref> The following AQME workflow aims to identify these two structures and simulate an averaged NMR spectra starting from a SMILES string, using a combination of (i) RDKit conformer sampling, (ii) Gaussian geometry optimization with B3LYP/6-31 + G(d,p), (iii) fixing errors and imaginary frequencies of the output files, (iv) GoodVibes 51 calculation of Boltzmann distributions using Gibbs free energies at 298.15 K, and (v) Boltzmann averaged shielding tensor calculations (empirically-scaled to obtain chemical shifts) with B3LYP/6-311 + G(2d,p), SMD = CHCl 3 (Figure <ref type="figure">3</ref>). Using the CSEARCH module with RDkit yields two conformers which are utilized further for DFT optimization. The calculated Boltzmann distribution for the two conformers is 99:1, which correlates well with the experimental observation of 97:3. Furthermore, the predicted 1 H chemical shifts present a low mean average error (MAE, 0.14 ppm for nine known 1 H signals) compared to the experimental values. <ref type="bibr">52</ref> This workflow did not require manual intervention and suggests that further applicability of AQME to automate NMR prediction and organic structure elucidation merits investigation.</p><p>F I G U R E 3 End-to-end workflow to calculate the conformer distribution and scaled (TMS) <ref type="bibr">1</ref> H NMR chemical shifts of strychnine in CHCl 3 .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">| Comparing Diels-Alder activation barriers from multiple SMILES inputs</head><p>A common computational task involves comparing the reactivity of several different substrates or reagents, for which the same elementary steps are studied separately for each of the related systems. Where transition states are involved, a common approach to conformational analysis involves constraining forming/breaking bonds while flexible regions are explored in much the same way as for a ground state structure, followed by saddle point optimizations. AQME can be employed to generate and compare energy profiles by automating this sequence of steps that is often performed manually. Figure <ref type="figure">4</ref> summarizes a workflow where a CSV input containing a list of SMILES is used to generate the reaction energy profiles for the Diels-Alder cycloadditions of multiple cyclic dienophiles, each reacting with cyclopentadiene. <ref type="bibr">53</ref> A Jupyter notebook was used to define SMILES strings and to identify relevant atom numbers to define constraints that are used for the conformational analysis of TSs. Performed manually, these tasks would typically each structure to be built and visualized by the user. This approach is limited to reactions where the TS structures are intuitively known; for automated TS location and energy profile generation without requiring prior knowledge, and mechanistic discovery, tools such as autodE are highly recommended. <ref type="bibr">15</ref> The workflow presented illustrates a typical multistep combination of SQM conformer sampling, geometry optimizations and single point energy corrections with different levels of theory, and the generation of a potential energy surface diagram. AQME links together the following tasks: (i) CREST conformer sampling, (ii) Gaussian geometry optimization (B3LYP/def2-TZVP), (iii) fixing errors and imaginary frequencies of the output files, (iv) ORCA single point energy corrections using DLPNO-CCSD(T)/def2-TZVPP, and (v) Boltzmann weighted thermochemistry calculation and PES generation with GoodVibes at 298.15 K. There is minimal manual intervention and the use of separate spreadsheets to create the PES is avoided.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">| Generating QM or SQM molecular descriptors for a large dataset</head><p>Statistical and ML applications in chemistry are often enhanced by using feature vectors or parameters derived from QM calculations, as opposed to features obtained solely from the 2D-molecular graph. <ref type="bibr">54</ref> For example, QMderived atomic charges or populations are often used in multivariate linear regression or neural network models. <ref type="bibr">55</ref> Figure <ref type="figure">5</ref> shows a workflow performed on the SMILES-containing ESOL database, <ref type="bibr">56</ref> which contains measured aqueous solubilities, that uses a message passing graph neural network (GNN) model. There are 1126 structures with experimental values available, which are split into training (901), validation (175) and test (50) sets. The protocol includes (i) RDKit conformer sampling, (ii) xTB descriptor generation (Boltzmann weighted), and (iii) neural fingerprint (nfp) <ref type="bibr">57</ref> based GNN model creation. In the first step, the CSEARCH module creates 3D conformations using the SMILES strings of the database with RDKit. Then, QDESCP uses xTB to generate molecular and atomic properties such as dipole moments, charges, Fukui indexes, dispersion parameters, polarizability, and HOMO-LUMO gaps, among others. These properties are provided individually for each conformer and as F I G U R E 4 End-to-end workflow to generate the energy profile of multiple Diels-Alder reactions using a CSV as the input file.</p><p>Boltzmann averaged values. In addition to properties generated from xTB, features in the Lipinski/Descriptors modules of RDKit are also included in the QDESCP analysis.</p><p>Boltzmann-weighted xTB parameters are then used as descriptors in a GNN model. The GFN2-xTB atomic properties are encoded as node features, and the molecular properties are passed as global features in the input graph structure. This graphical representation of the molecule with embedded atomic and molecular properties is to build a message passing GNN. The GNN model utilized the AdamW optimizer, and model performance was assessed by measuring the mean absolute error during training for 500 epochs. The model showed an R 2 and MAE of 0.9 in the held-out test set of 50 molecules. Hyperparameter tuning can be performed along with feature selection to further improve this accuracy. Other ML models, such as random forests can also be employed in this workflow instead of the GNN shown.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">| CONCLUSION</head><p>AQME is an open-source Python package for building computational workflows to perform multistep protocols efficiently that combine different packages and model chemistries. The approach is well-suited to general tasks incorporating conformational analysis, geometry refinement, QM optimizations, and ensemble-averaged property predictions. Representative examples of chemical shift prediction, reaction energy profile calculation, and dataset featurization are shown here, in each case operating as a fully automated ("end-to-end") workflow from the supplied inputs to desired Boltzmann-averaged outputs. Inputs can be supplied in SMILES or several 3D structure formats. This approach captures and preserves protocols used at every stage of the research process: there are no extraneous Spreadsheets involved and all steps can be reproduced by other researchers.</p><p>The software consists of independent modules that can be combined in any order. The CSEARCH module performs conformational sampling or interfaces to external conformational analysis tools using MM (RDKit) or SQM (xTB, CREST) levels of theory. CMIN allows the refinement of conformer ensembles with xTB or ANI methods. The QPREP module converts files with different 3D input formats into input files for external QM programs, such as Gaussian, ORCA, and PySCF. The QCORR module analyzes QM output data by systematically processing output files. Filters for example, convergence errors, imaginary frequencies, and undesired structural isomerization, can be implemented to create new inputs automatically. The QDESCP module produces Boltzmann-weighted descriptors and properties. The modular structure means AQME can be used in Jupyter notebook environments or reused and imported by other Python projects.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>17590884, 0, Downloaded from https://wires.onlinelibrary.wiley.com/doi/10.1002/wcms.1663, Wiley Online Library on [27/02/2023]. See the Terms and Conditions (https://onlinelibrary.wiley.com/terms-and-conditions) on Wiley Online Library for rules of use; OA articles are governed by the applicable Creative Commons License</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_1"><p>of 11</p></note>
		</body>
		</text>
</TEI>
