skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Assessing conformer energies using electronic structure and machine learning methods
Abstract We have performed a large‐scale evaluation of current computational methods, including conventional small‐molecule force fields; semiempirical, density functional, ab initio electronic structure methods; and current machine learning (ML) techniques to evaluate relative single‐point energies. Using up to 10 local minima geometries across ~700 molecules, each optimized by B3LYP‐D3BJ with single‐point DLPNO‐CCSD(T) triple‐zeta energies, we consider over 6500 single points to compare the correlation between different methods for both relative energies and ordered rankings of minima. We find that the current ML methods have potential and recommend methods at each tier of the accuracy‐time tradeoff, particularly the recent GFN2 semiempirical method, the B97‐3c density functional approximation, and RI‐MP2 for accurate conformer energies. The ANI family of ML methods shows promise, particularly the ANI‐1ccx variant trained in part on coupled‐cluster energies. Multiple methods suggest continued improvements should be expected in both performance and accuracy.  more » « less
Award ID(s):
1800435
PAR ID:
10453347
Author(s) / Creator(s):
 ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
International Journal of Quantum Chemistry
Volume:
121
Issue:
1
ISSN:
0020-7608
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Modern semiempirical electronic structure methods have considerable promise in drug discovery as universal “force fields” that can reliably model biological and drug-like molecules, including alternative tautomers and protonation states. Herein, we compare the performance of several neglect of diatomic differential overlap-based semiempirical (MNDO/d, AM1, PM6, PM6-D3H4X, PM7, and ODM2), density-functional tight-binding based (DFTB3, DFTB/ChIMES, GFN1-xTB, and GFN2-xTB) models with pure machine learning potentials (ANI-1x and ANI-2x) and hybrid quantum mechanical/machine learning potentials (AIQM1 and QD π) for a wide range of data computed at a consistent ωB97X/6-31G* level of theory (as in the ANI-1x database). This data includes conformational energies, intermolecular interactions, tautomers, and protonation states. Additional comparisons are made to a set of natural and synthetic nucleic acids from the artificially expanded genetic information system that has important implications for the design of new biotechnology and therapeutics. Finally, we examine the acid/base chemistry relevant for RNA cleavage reactions catalyzed by small nucleolytic ribozymes, DNAzymes, and ribonucleases. Overall, the hybrid quantum mechanical/machine learning potentials appear to be the most robust for these datasets, and the recently developed QD π model performs exceptionally well, having especially high accuracy for tautomers and protonation states relevant to drug discovery. 
    more » « less
  2. Abstract Maximum diversification of data is a central theme in building generalized and accurate machine learning (ML) models. In chemistry, ML has been used to develop models for predicting molecular properties, for example quantum mechanics (QM) calculated potential energy surfaces and atomic charge models. The ANI-1x and ANI-1ccx ML-based general-purpose potentials for organic molecules were developed through active learning; an automated data diversification process. Here, we describe the ANI-1x and ANI-1ccx data sets. To demonstrate data diversity, we visualize it with a dimensionality reduction scheme, and contrast against existing data sets. The ANI-1x data set contains multiple QM properties from 5 M density functional theory calculations, while the ANI-1ccx data set contains 500 k data points obtained with an accurate CCSD(T)/CBS extrapolation. Approximately 14 million CPU core-hours were expended to generate this data. Multiple QM calculated properties for the chemical elements C, H, N, and O are provided: energies, atomic forces, multipole moments, atomic charges, etc. We provide this data to the community to aid research and development of ML models for chemistry. 
    more » « less
  3. Abstract This study explores open-shell biradical and polyradical molecular compounds based on extended multireference (MR) methods (MR-configuration interaction with singles and doubles (CISD) and MR-averaged quadratic coupled cluster (AQCC) approach) using the numbers of unpaired densitiesNU. These results were used to guide the analysis of the fractional occupation number weighted density (FOD) calculated within the finite temperature (FT) density functional theory (DFT) approach. As critical test examples, the dissociation of carbon–carbon (CC) single, double and triple bonds and a benchmark set of polycyclic aromatic hydrocarbons (PAHs) have been chosen. By examining single, double, and triple bond dissociations, we demonstrate the utility and accuracy but also limitations of the FOD analysis for describing these dissociation processes. In significant extension of previous work (Phys Chem Chem Phys 25: 27380–27393), the assessment of FOD applications for different classes of DFT functionals was performed examining the range-separated functionals ωB97XD, ωB97M-V, CAM-B3LYP, LC-ωPBE, and MN12-SX, the hybrid (M06-2X) functional and the double hybrid (B2P-LYP) functional. In all cases, strong correlations betweenNFODandNUvalues are found. The major task was to develop a new linear regression formula for range-separated functionals allowing a convenient determination of the optimal electronic temperatureTelfor the FT-DFT calculation. We also established an optimal temperature for the semiempirical extended tight-binding GFN2-xTB method. These findings significantly broaden the applicability of FOD analysis across various DFT functionals and semiempirical methods. 
    more » « less
  4. Abstract The global minima of urea and thiourea were characterized along with other low‐lying stationary points. Each structure was optimized with the CCSD(T) method and triple‐ζcorrelation consistent basis sets followed by harmonic vibrational frequency computations. Relative energies evaluated near the complete basis set limit with both canonical and explicitly correlated CCSD(T) techniques reveal several subtle but important details about both systems. These computations resolve a discrepancy by demonstrating that the electronic energy of the C2vsecond‐order saddle point of urea lies at least 1.5 kcal mol−1above the C2global minimum regardless of whether the structures were optimized with MP2, CCSD, or CCSD(T). Additionally, urea effectively has one minimum instead of two because the electronic barrier for inversion at one amino group in the Cslocal minimum vanishes at the CCSD(T) CBS limit. Characterization of both systems with the same ab initio methods and large basis sets conclusively establishes that the electronic barriers to inversion at one or both NH2groups in thiourea are appreciably smaller than in urea. CCSDT(Q)/cc‐pVTZ computations show higher‐order electron correlation effects have little impact on the relative energies and are consistently offset by core correlation effects of opposite sign and comparable magnitude. 
    more » « less
  5. Jouline, Igor B (Ed.)
    ABSTRACT Large-scale surveys of prokaryotic communities (metagenomes), as well as isolate genomes, have revealed that their diversity is predominantly organized in sequence-discrete units that may be equated to species. Specifically, genomes of the same species commonly show genome-aggregate average nucleotide identity (ANI) >95% among themselves and ANI <90% to members of other species, while genomes showing ANI 90%–95% are comparatively rare. However, it remains unclear if such “discontinuities” or gaps in ANI values can be observed within species and thus used to advance and standardize intra-species units. By analyzing 18,123 complete isolate genomes from 330 bacterial species with at least 10 genome representatives each and available long-read metagenomes, we show that another discontinuity exists between 99.2% and 99.8% (midpoint 99.5%) ANI in most of these species. The 99.5% ANI threshold is largely consistent with how sequence types have been defined in previous epidemiological studies but provides clusters with ~20% higher accuracy in terms of evolutionary and gene-content relatedness of the grouped genomes, while strains should be consequently defined at higher ANI values (>99.99% proposed). Collectively, our results should facilitate future micro-diversity studies across clinical or environmental settings because they provide a more natural definition of intra-species units of diversity. IMPORTANCEBacterial strains and clonal complexes are two cornerstone concepts for microbiology that remain loosely defined, which confuses communication and research. Here we identify a natural gap in genome sequence comparisons among isolate genomes of all well-sequenced species that has gone unnoticed so far and could be used to more accurately and precisely define these and related concepts compared to current methods. These findings advance the molecular toolbox for accurately delineating and following the important units of diversity within prokaryotic species and thus should greatly facilitate future epidemiological and micro-diversity studies across clinical and environmental settings. 
    more » « less