The lack of publicly available, large, and unbiased datasets is a key bottleneck for the application of machine learning (ML) methods in synthetic chemistry. Data from electronic laboratory notebooks (ELNs) could provide less biased, large datasets, but no such datasets have been made publicly available. The first real-world dataset from the ELNs of a large pharmaceutical company is disclosed and its relationship to high-throughput experimentation (HTE) datasets is described. For chemical yield predictions, a key task in chemical synthesis, an attributed graph neural network (AGNN) performs as well as or better than the best previous models on two HTE datasets for the Suzuki–Miyaura and Buchwald–Hartwig reactions. However, training the AGNN on an ELN dataset does not lead to a predictive model. The implications of using ELN data for training ML-based models are discussed in the context of yield predictions.
more »
« less
Predictive chemistry: machine learning for reaction deployment, reaction development, and reaction discovery
The field of predictive chemistry relates to the development of models able to describe how molecules interact and react. It encompasses the long-standing task of computer-aided retrosynthesis, but is far more reaching and ambitious in its goals. In this review, we summarize several areas where predictive chemistry models hold the potential to accelerate the deployment, development, and discovery of organic reactions and advance synthetic chemistry.
more »
« less
- Award ID(s):
- 2144153
- PAR ID:
- 10396889
- Date Published:
- Journal Name:
- Chemical Science
- Volume:
- 14
- Issue:
- 2
- ISSN:
- 2041-6520
- Page Range / eLocation ID:
- 226 to 244
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Learning the language of organic chemistry, i.e., how to describe reaction mechanisms, is crucial to success in any postsecondary organic chemistry course. However, it is well-known that learners struggle with reasoning about and explaining reaction mechanisms beyond surface-level features. Multiple studies have sought to aid learners in developing these skills. Investigating the connections that learners make regarding reaction mechanisms through their explanations provides insight into how we can better promote the development of learners’ reasoning skills. In this study, we evaluate 20,000+ learner explanations of 90 reaction mechanisms. We use network analysis to explore patterns in keywords used by learners and visualize the word connections between them, based on their co-occurrence, within our entire data set, by reaction type, and by levels of explanation sophistication. Our results indicate that learners consistently rely on explicit surface-level features in their explanations with expected contextual variance by reaction type. This trend persists across the levels of sophistication, however, with improvements in the use of vocabulary and coherency as sophistication progresses. We hypothesize that this is evidence of learners actively working toward constructing understanding as they experiment with and refine their vocabulary until they are able to pare down their explanations in a coherent manner. This work offers insights for instructors seeking to promote the development of learners’ reasoning skills and for researchers interested in the development of machine-learning models to assist in evaluating learner explanations of reaction mechanisms.more » « less
-
The application of statistical modeling in organic chemistry is emerging as a standard practice for probing structure-activity relationships and as a predictive tool for many optimization objectives. This review is aimed as a tutorial for those entering the area of statistical modeling in chemistry. We provide case studies to highlight the considerations and approaches that can be used to successfully analyze datasets in low data regimes, a common situation encountered given the experimental demands of organic chemistry. Statistical modeling hinges on the data (what is being modeled), descriptors (how data are represented), and algorithms (how data are modeled). Herein, we focus on how various reaction outputs (e.g., yield, rate, selectivity, solubility, stability, and turnover number) and data structures (e.g., binned, heavily skewed, and distributed) influence the choice of algorithm used for constructing predictive and chemically insightful statistical models.more » « less
-
Polariton chemistry exploits the strong interaction between quantized excitations in molecules and quantized photon states in optical cavities to affect chemical reactivity. Molecular polaritons have been experimentally realized by the coupling of electronic, vibrational, and rovibrational transitions to photon modes, which has spurred a tremendous theoretical effort to model and explain how polariton formation can influence chemistry. This tutorial review focuses on computational approaches for the electronic strong coupling problem through the combination of familiar techniques from ab initio electronic structure theory and cavity quantum electrodynamics, toward the goal of supplying predictive theories for polariton chemistry. Our aim is to emphasize the relevant theoretical details with enough clarity for newcomers to the field to follow, and to present simple and practical code examples to catalyze further development work.more » « less
-
A local-sensitivity-analysis technique is employed to generate new skeletal reaction models for methane combustion from the foundational fuel chemistry model (FFCM-1). The sensitivities of the thermo-chemical variables with respect to the reaction rates are computed via the forced-optimally time dependent (f-OTD) methodology. In this methodology, the large sensitivity matrix containing all local sensitivities is modeled as a product of two low-rank time-dependent matrices. The evolution equations of these matrices are derived from the governing equations of the system. The modeled sensitivities are computed for the auto-ignition of methane at atmospheric and high pressures with different sets of initial temperatures, and equivalence ratios. These sensitivities are then analyzed to rank the most important (sensitive) species. A series of skeletal models with different number of species and levels of accuracy in reproducing the FFCM-1 results are suggested. The performances of the generated models are compared against FFCM-1 in predicting the ignition delay, the laminar flame speed, and the flame extinction. The results of this comparative assessment suggest the skeletal models with 24 and more species generate the FFCM-1 results with an excellent accuracy.more » « less
An official website of the United States government

