skip to main content


Title: Assessing conformer energies using electronic structure and machine learning methods
Abstract

We have performed a large‐scale evaluation of current computational methods, including conventional small‐molecule force fields; semiempirical, density functional, ab initio electronic structure methods; and current machine learning (ML) techniques to evaluate relative single‐point energies. Using up to 10 local minima geometries across ~700 molecules, each optimized by B3LYP‐D3BJ with single‐point DLPNO‐CCSD(T) triple‐zeta energies, we consider over 6500 single points to compare the correlation between different methods for both relative energies and ordered rankings of minima. We find that the current ML methods have potential and recommend methods at each tier of the accuracy‐time tradeoff, particularly the recent GFN2 semiempirical method, the B97‐3c density functional approximation, and RI‐MP2 for accurate conformer energies. The ANI family of ML methods shows promise, particularly the ANI‐1ccx variant trained in part on coupled‐cluster energies. Multiple methods suggest continued improvements should be expected in both performance and accuracy.

 
more » « less
Award ID(s):
1800435
NSF-PAR ID:
10453347
Author(s) / Creator(s):
 ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
International Journal of Quantum Chemistry
Volume:
121
Issue:
1
ISSN:
0020-7608
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Modern semiempirical electronic structure methods have considerable promise in drug discovery as universal “force fields” that can reliably model biological and drug-like molecules, including alternative tautomers and protonation states. Herein, we compare the performance of several neglect of diatomic differential overlap-based semiempirical (MNDO/d, AM1, PM6, PM6-D3H4X, PM7, and ODM2), density-functional tight-binding based (DFTB3, DFTB/ChIMES, GFN1-xTB, and GFN2-xTB) models with pure machine learning potentials (ANI-1x and ANI-2x) and hybrid quantum mechanical/machine learning potentials (AIQM1 and QD π) for a wide range of data computed at a consistent ωB97X/6-31G* level of theory (as in the ANI-1x database). This data includes conformational energies, intermolecular interactions, tautomers, and protonation states. Additional comparisons are made to a set of natural and synthetic nucleic acids from the artificially expanded genetic information system that has important implications for the design of new biotechnology and therapeutics. Finally, we examine the acid/base chemistry relevant for RNA cleavage reactions catalyzed by small nucleolytic ribozymes, DNAzymes, and ribonucleases. Overall, the hybrid quantum mechanical/machine learning potentials appear to be the most robust for these datasets, and the recently developed QD π model performs exceptionally well, having especially high accuracy for tautomers and protonation states relevant to drug discovery. 
    more » « less
  2. Abstract

    Maximum diversification of data is a central theme in building generalized and accurate machine learning (ML) models. In chemistry, ML has been used to develop models for predicting molecular properties, for example quantum mechanics (QM) calculated potential energy surfaces and atomic charge models. The ANI-1x and ANI-1ccx ML-based general-purpose potentials for organic molecules were developed through active learning; an automated data diversification process. Here, we describe the ANI-1x and ANI-1ccx data sets. To demonstrate data diversity, we visualize it with a dimensionality reduction scheme, and contrast against existing data sets. The ANI-1x data set contains multiple QM properties from 5 M density functional theory calculations, while the ANI-1ccx data set contains 500 k data points obtained with an accurate CCSD(T)/CBS extrapolation. Approximately 14 million CPU core-hours were expended to generate this data. Multiple QM calculated properties for the chemical elements C, H, N, and O are provided: energies, atomic forces, multipole moments, atomic charges, etc. We provide this data to the community to aid research and development of ML models for chemistry.

     
    more » « less
  3. Abstract

    The global minima of urea and thiourea were characterized along with other low‐lying stationary points. Each structure was optimized with the CCSD(T) method and triple‐ζcorrelation consistent basis sets followed by harmonic vibrational frequency computations. Relative energies evaluated near the complete basis set limit with both canonical and explicitly correlated CCSD(T) techniques reveal several subtle but important details about both systems. These computations resolve a discrepancy by demonstrating that the electronic energy of the C2vsecond‐order saddle point of urea lies at least 1.5 kcal mol−1above the C2global minimum regardless of whether the structures were optimized with MP2, CCSD, or CCSD(T). Additionally, urea effectively has one minimum instead of two because the electronic barrier for inversion at one amino group in the Cslocal minimum vanishes at the CCSD(T) CBS limit. Characterization of both systems with the same ab initio methods and large basis sets conclusively establishes that the electronic barriers to inversion at one or both NH2groups in thiourea are appreciably smaller than in urea. CCSDT(Q)/cc‐pVTZ computations show higher‐order electron correlation effects have little impact on the relative energies and are consistently offset by core correlation effects of opposite sign and comparable magnitude.

     
    more » « less
  4. Abstract

    We have carried out a large scale computational investigation to assess the utility of common small‐molecule force fields for computational screening of low energy conformers of typical organic molecules. Using statistical analyses on the energies and relative rankings of up to 250 diverse conformers of 700 different molecular structures, we find that energies from widely used classical force fields (MMFF94, UFF, and GAFF) show unconditionally poor energy and rank correlation with semiempirical (PM7) and Kohn–Sham density functional theory (DFT) energies calculated at PM7 and DFT optimized geometries. In contrast, semiempirical PM7 calculations show significantly better correlation with DFT calculations and generally better geometries. With these results, we make recommendations to more reliably carry out conformer screening.

     
    more » « less
  5. null (Ed.)
    Introduction: Vaso-occlusive crises (VOCs) are a leading cause of morbidity and early mortality in individuals with sickle cell disease (SCD). These crises are triggered by sickle red blood cell (sRBC) aggregation in blood vessels and are influenced by factors such as enhanced sRBC and white blood cell (WBC) adhesion to inflamed endothelium. Advances in microfluidic biomarker assays (i.e., SCD Biochip systems) have led to clinical studies of blood cell adhesion onto endothelial proteins, including, fibronectin, laminin, P-selectin, ICAM-1, functionalized in microchannels. These microfluidic assays allow mimicking the physiological aspects of human microvasculature and help characterize biomechanical properties of adhered sRBCs under flow. However, analysis of the microfluidic biomarker assay data has so far relied on manual cell counting and exhaustive visual morphological characterization of cells by trained personnel. Integrating deep learning algorithms with microscopic imaging of adhesion protein functionalized microfluidic channels can accelerate and standardize accurate classification of blood cells in microfluidic biomarker assays. Here we present a deep learning approach into a general-purpose analytical tool covering a wide range of conditions: channels functionalized with different proteins (laminin or P-selectin), with varying degrees of adhesion by both sRBCs and WBCs, and in both normoxic and hypoxic environments. Methods: Our neural networks were trained on a repository of manually labeled SCD Biochip microfluidic biomarker assay whole channel images. Each channel contained adhered cells pertaining to clinical whole blood under constant shear stress of 0.1 Pa, mimicking physiological levels in post-capillary venules. The machine learning (ML) framework consists of two phases: Phase I segments pixels belonging to blood cells adhered to the microfluidic channel surface, while Phase II associates pixel clusters with specific cell types (sRBCs or WBCs). Phase I is implemented through an ensemble of seven generative fully convolutional neural networks, and Phase II is an ensemble of five neural networks based on a Resnet50 backbone. Each pixel cluster is given a probability of belonging to one of three classes: adhered sRBC, adhered WBC, or non-adhered / other. Results and Discussion: We applied our trained ML framework to 107 novel whole channel images not used during training and compared the results against counts from human experts. As seen in Fig. 1A, there was excellent agreement in counts across all protein and cell types investigated: sRBCs adhered to laminin, sRBCs adhered to P-selectin, and WBCs adhered to P-selectin. Not only was the approach able to handle surfaces functionalized with different proteins, but it also performed well for high cell density images (up to 5000 cells per image) in both normoxic and hypoxic conditions (Fig. 1B). The average uncertainty for the ML counts, obtained from accuracy metrics on the test dataset, was 3%. This uncertainty is a significant improvement on the 20% average uncertainty of the human counts, estimated from the variance in repeated manual analyses of the images. Moreover, manual classification of each image may take up to 2 hours, versus about 6 minutes per image for the ML analysis. Thus, ML provides greater consistency in the classification at a fraction of the processing time. To assess which features the network used to distinguish adhered cells, we generated class activation maps (Fig. 1C-E). These heat maps indicate the regions of focus for the algorithm in making each classification decision. Intriguingly, the highlighted features were similar to those used by human experts: the dimple in partially sickled RBCs, the sharp endpoints for highly sickled RBCs, and the uniform curvature of the WBCs. Overall the robust performance of the ML approach in our study sets the stage for generalizing it to other endothelial proteins and experimental conditions, a first step toward a universal microfluidic ML framework targeting blood disorders. Such a framework would not only be able to integrate advanced biophysical characterization into fast, point-of-care diagnostic devices, but also provide a standardized and reliable way of monitoring patients undergoing targeted therapies and curative interventions, including, stem cell and gene-based therapies for SCD. Disclosures Gurkan: Dx Now Inc.: Patents & Royalties; Xatek Inc.: Patents & Royalties; BioChip Labs: Patents & Royalties; Hemex Health, Inc.: Consultancy, Current Employment, Patents & Royalties, Research Funding. 
    more » « less