Abstract Machine learning (ML) has become a central focus of the computational chemistry community. I will first discuss my personal history in the field. Then I will provide a broader view of how this resurgence in ML interest echoes and advances upon earlier efforts. Although numerous changes have brought about this latest wave, one of the most significant is the increased accuracy and efficiency of low‐cost methods (e. g., density functional theory or DFT) that have made it possible to generate large data sets for ML models. ML has also been used to bypass, guide, or improve DFT. The field of computational chemistry thus finds itself at a crossroads as ML both augments and supersedes traditional efforts. I will present what I believe the role of the computational chemist will be in this evolving landscape, with specific focus on my experience in the development of autonomous workflows in computational materials discovery for open‐shell transition‐metal chemistry.
more »
« less
Using experimental data in computationally guided rational design of inorganic materials with machine learning
Abstract While the impact of machine learning (ML) has been felt everywhere, its effect has been most transformative where large, high-quality datasets are available. For promising materials spaces, such as transition metal coordination complexes and metal–organic frameworks, the large chemical diversity has not yet been matched by similarly large datasets, and computational datasets (e.g., from density functional theory) may not be predictive. Extraction of experimental data from the literature represents an alternative approach to the data-driven design of materials. This perspective will describe efforts in (i) extracting experimental data; (ii) associating extracted data with known chemical structures; (iii) leveraging data in ML and screening; (iv) designing materials with enriched stability; and (v) using experimental data to improve high-throughput workflows. I will summarize some of the outstanding challenges and opportunities for data enrichment with high-throughput experimentation and large language models. Graphical abstract
more »
« less
- Award ID(s):
- 1846426
- PAR ID:
- 10588737
- Publisher / Repository:
- Cambridge University Press (CUP)
- Date Published:
- Journal Name:
- Journal of Materials Research
- Volume:
- 40
- Issue:
- 6
- ISSN:
- 0884-2914
- Format(s):
- Medium: X Size: p. 833-848
- Size(s):
- p. 833-848
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
The lack of publicly available, large, and unbiased datasets is a key bottleneck for the application of machine learning (ML) methods in synthetic chemistry. Data from electronic laboratory notebooks (ELNs) could provide less biased, large datasets, but no such datasets have been made publicly available. The first real-world dataset from the ELNs of a large pharmaceutical company is disclosed and its relationship to high-throughput experimentation (HTE) datasets is described. For chemical yield predictions, a key task in chemical synthesis, an attributed graph neural network (AGNN) performs as well as or better than the best previous models on two HTE datasets for the Suzuki–Miyaura and Buchwald–Hartwig reactions. However, training the AGNN on an ELN dataset does not lead to a predictive model. The implications of using ELN data for training ML-based models are discussed in the context of yield predictions.more » « less
-
Abstract As machine learning (ML) has matured, it has opened a new frontier in theoretical and computational chemistry by offering the promise of simultaneous paradigm shifts in accuracy and efficiency. Nowhere is this advance more needed, but also more challenging to achieve, than in the discovery of open‐shell transition metal complexes. Here, localizeddorfelectrons exhibit variable bonding that is challenging to capture even with the most computationally demanding methods. Thus, despite great promise, clear obstacles remain in constructing ML models that can supplement or even replace explicit electronic structure calculations. In this article, I outline the recent advances in building ML models in transition metal chemistry, including the ability to approach sub‐kcal/mol accuracy on a range of properties with tailored representations, to discover and enumerate complexes in large chemical spaces, and to reveal opportunities for design through analysis of feature importance. I discuss unique considerations that have been essential to enabling ML in open‐shell transition metal chemistry, including (a) the relationship of data set size/diversity, model complexity, and representation choice, (b) the importance of quantitative assessments of both theory and model domain of applicability, and (c) the need to enable autonomous generation of reliable, large data sets both for ML model training and in active learning or discovery contexts. Finally, I summarize the next steps toward making ML a mainstream tool in the accelerated discovery of transition metal complexes. This article is categorized under: Electronic Structure Theory > Density Functional Theory Software > Molecular Modeling Computer and Information Science > Chemoinformaticsmore » « less
-
Abstract Motivation:Despite its great success in various physical modeling, differential geometry (DG) has rarely been devised as a versatile tool for analyzing large, diverse, and complex molecular and biomolecular datasets because of the limited understanding of its potential power in dimensionality reduction and its ability to encode essential chemical and biological information in differentiable manifolds. Results:We put forward a differential geometry‐based geometric learning (DG‐GL) hypothesis that the intrinsic physics of three‐dimensional (3D) molecular structures lies on a family of low‐dimensional manifolds embedded in a high‐dimensional data space. We encode crucial chemical, physical, and biological information into 2D element interactive manifolds, extracted from a high‐dimensional structural data space via a multiscale discrete‐to‐continuum mapping using differentiable density estimators. Differential geometry apparatuses are utilized to construct element interactive curvatures in analytical forms for certain analytically differentiable density estimators. These low‐dimensional differential geometry representations are paired with a robust machine learning algorithm to showcase their descriptive and predictive powers for large, diverse, and complex molecular and biomolecular datasets. Extensive numerical experiments are carried out to demonstrate that the proposed DG‐GL strategy outperforms other advanced methods in the predictions of drug discovery‐related protein‐ligand binding affinity, drug toxicity, and molecular solvation free energy. Availability and implementation:http://weilab.math.msu.edu/DG‐GL/ Contact:wei@math.msu.edumore » « less
-
Abstract Efficient separation of C2H4/C2H6mixtures is of paramount importance in the petrochemical industry. Nanoporous materials, especially metal-organic frameworks (MOFs), may serve the purpose owing to their tailorable structures and pore geometries. In this work, we propose a computational framework for high-throughput screening and inverse design of high-performance MOFs for adsorption and membrane processes. High-throughput screening of the computational-ready, experimental (CoRE 2019) MOF database leads to materials with exceptionally high ethane-selective adsorption selectivity (LUDLAZ: 7.68) and ethene-selective membrane selectivity (EBINUA02: 2167.3). Moreover, the inverse design enables the exploration of broader chemical space and identification of MOF structures with even higher membrane selectivity and permeability. In addition, a relative membrane performance score (rMPS) has been formulated to evaluate the overall membrane performance relative to the Robeson boundary. The computational framework offers guidelines for the design of MOFs and is generically applicable to materials discovery for gas storage and separation.more » « less
An official website of the United States government
