Modern semiempirical electronic structure methods have considerable promise in drug discovery as universal “force fields” that can reliably model biological and drug-like molecules, including alternative tautomers and protonation states. Herein, we compare the performance of several neglect of diatomic differential overlap-based semiempirical (MNDO/d, AM1, PM6, PM6-D3H4X, PM7, and ODM2), density-functional tight-binding based (DFTB3, DFTB/ChIMES, GFN1-xTB, and GFN2-xTB) models with pure machine learning potentials (ANI-1x and ANI-2x) and hybrid quantum mechanical/machine learning potentials (AIQM1 and QD π) for a wide range of data computed at a consistent ωB97X/6-31G* level of theory (as in the ANI-1x database). This data includes conformational energies, intermolecular interactions, tautomers, and protonation states. Additional comparisons are made to a set of natural and synthetic nucleic acids from the artificially expanded genetic information system that has important implications for the design of new biotechnology and therapeutics. Finally, we examine the acid/base chemistry relevant for RNA cleavage reactions catalyzed by small nucleolytic ribozymes, DNAzymes, and ribonucleases. Overall, the hybrid quantum mechanical/machine learning potentials appear to be the most robust for these datasets, and the recently developed QD π model performs exceptionally well, having especially high accuracy for tautomers and protonation states relevant to drug discovery.
more »
« less
The ANI-1ccx and ANI-1x data sets, coupled-cluster and density functional theory properties for molecules
Abstract Maximum diversification of data is a central theme in building generalized and accurate machine learning (ML) models. In chemistry, ML has been used to develop models for predicting molecular properties, for example quantum mechanics (QM) calculated potential energy surfaces and atomic charge models. The ANI-1x and ANI-1ccx ML-based general-purpose potentials for organic molecules were developed through active learning; an automated data diversification process. Here, we describe the ANI-1x and ANI-1ccx data sets. To demonstrate data diversity, we visualize it with a dimensionality reduction scheme, and contrast against existing data sets. The ANI-1x data set contains multiple QM properties from 5 M density functional theory calculations, while the ANI-1ccx data set contains 500 k data points obtained with an accurate CCSD(T)/CBS extrapolation. Approximately 14 million CPU core-hours were expended to generate this data. Multiple QM calculated properties for the chemical elements C, H, N, and O are provided: energies, atomic forces, multipole moments, atomic charges, etc. We provide this data to the community to aid research and development of ML models for chemistry.
more »
« less
- PAR ID:
- 10214281
- Publisher / Repository:
- Nature Publishing Group
- Date Published:
- Journal Name:
- Scientific Data
- Volume:
- 7
- Issue:
- 1
- ISSN:
- 2052-4463
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Machine learning (ML) continues to revolutionize computational chemistry for accelerating predictions and simulations by training on experimental or accurate but expensive quantum mechanical (QM) calculations. Photodynamics simulations require hundreds of trajectories coupled with multiconfigurational QM calculations of excited-state potential energies surfaces that contribute to the prohibitive computational cost at long timescales and complex organic molecules. ML accelerates photodynamics simulations by combining nonadiabatic photodynamics simulations with an ML model trained with high-fidelity QM calculations of energies, forces, and non-adiabatic couplings. This approach has provided time-dependent molecular structural information for understanding photochemical reaction mechanisms of organic reactions in vacuum and complex environments (i.e., explicit solvation). This review focuses on the fundamentals of QM calculations and ML techniques. We, then, discuss the strategies to balance adequate training data and the computational cost of generating these training data. Finally, we demonstrate the power of applying these ML-photodynamics simulations to understand the origin of reactivities and selectivities of organic photochemical reactions, such as cis–trans isomerization, [2 + 2]-cycloaddition, 4π-electrostatic ring-closing, and hydrogen roaming mechanism.more » « less
-
Abstract Computational modeling of chemical and biological systems at atomic resolution is a crucial tool in the chemist’s toolset. The use of computer simulations requires a balance between cost and accuracy: quantum-mechanical methods provide high accuracy but are computationally expensive and scale poorly to large systems, while classical force fields are cheap and scalable, but lack transferability to new systems. Machine learning can be used to achieve the best of both approaches. Here we train a general-purpose neural network potential (ANI-1ccx) that approaches CCSD(T)/CBS accuracy on benchmarks for reaction thermochemistry, isomerization, and drug-like molecular torsions. This is achieved by training a network to DFT data then using transfer learning techniques to retrain on a dataset of gold standard QM calculations (CCSD(T)/CBS) that optimally spans chemical space. The resulting potential is broadly applicable to materials science, biology, and chemistry, and billions of times faster than CCSD(T)/CBS calculations.more » « less
-
Abstract AQME, automated quantum mechanical environments, is a free and open‐source Python package for the rapid deployment of automated workflows using cheminformatics and quantum chemistry. AQME workflows integrate tasks performed across multiple computational chemistry packages and data formats, preserving all computational protocols, data, and metadata for machine and human users to access and reuse. AQME has a modular structure of independent modules that can be implemented in any sequence, allowing the users to use all or only the desired parts of the program. The code has been developed for researchers with basic familiarity with the Python programming language. The CSEARCH module interfaces to molecular mechanics and semi‐empirical QM (SQM) conformer generation tools (e.g., RDKit and Conformer–Rotamer Ensemble Sampling Tool, CREST) starting from various initial structure formats. The CMIN module enables geometry refinement with SQM and neural network potentials, such as ANI. The QPREP module interfaces with multiple QM programs, such as Gaussian, ORCA, and PySCF. The QCORR module processes QM results, storing structural, energetic, and property data while also enabling automated error handling (i.e., convergence errors, wrong number of imaginary frequencies, isomerization, etc.) and job resubmission. The QDESCP module provides easy access to QM ensemble‐averaged molecular descriptors and computed properties, such as NMR spectra. Overall, AQME provides automated, transparent, and reproducible workflows to produce, analyze and archive computational chemistry results. SMILES inputs can be used, and many aspects of tedious human manipulation can be avoided. Installation and execution on Windows, macOS, and Linux platforms have been tested, and the code has been developed to support access through Jupyter Notebooks, the command line, and job submission (e.g., Slurm) scripts. Examples of pre‐configured workflows are available in various formats, and hands‐on video tutorials illustrate their use. This article is categorized under:Data Science > ChemoinformaticsData Science > Computer Algorithms and ProgrammingSoftware > Quantum Chemistrymore » « less
-
Abstract We have performed a large‐scale evaluation of current computational methods, including conventional small‐molecule force fields; semiempirical, density functional, ab initio electronic structure methods; and current machine learning (ML) techniques to evaluate relative single‐point energies. Using up to 10 local minima geometries across ~700 molecules, each optimized by B3LYP‐D3BJ with single‐point DLPNO‐CCSD(T) triple‐zeta energies, we consider over 6500 single points to compare the correlation between different methods for both relative energies and ordered rankings of minima. We find that the current ML methods have potential and recommend methods at each tier of the accuracy‐time tradeoff, particularly the recent GFN2 semiempirical method, the B97‐3c density functional approximation, and RI‐MP2 for accurate conformer energies. The ANI family of ML methods shows promise, particularly the ANI‐1ccx variant trained in part on coupled‐cluster energies. Multiple methods suggest continued improvements should be expected in both performance and accuracy.more » « less
An official website of the United States government
