skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Reproducing Reaction Mechanisms with Machine‐Learning Models Trained on a Large‐Scale Mechanistic Dataset
Abstract Mechanistic understanding of organic reactions can facilitate reaction development, impurity prediction, and in principle, reaction discovery. While several machine learning models have sought to address the task of predicting reaction products, their extension to predicting reaction mechanisms has been impeded by the lack of a corresponding mechanistic dataset. In this study, we construct such a dataset by imputing intermediates between experimentally reported reactants and products using expert reaction templates and train several machine learning models on the resulting dataset of 5,184,184 elementary steps. We explore the performance and capabilities of these models, focusing on their ability to predict reaction pathways and recapitulate the roles of catalysts and reagents. Additionally, we demonstrate the potential of mechanistic models in predicting impurities, often overlooked by conventional models. We conclude by evaluating the generalizability of mechanistic models to new reaction types, revealing challenges related to dataset diversity, consecutive predictions, and violations of atom conservation.  more » « less
Award ID(s):
2144153
PAR ID:
10539524
Author(s) / Creator(s):
 ;  ;  ;  ;  ;  
Publisher / Repository:
Wiley Blackwell (John Wiley & Sons)
Date Published:
Journal Name:
Angewandte Chemie International Edition
ISSN:
1433-7851
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Sillanpää, Mikko (Ed.)
    Abstract Predicting phenotypes from a combination of genetic and environmental factors is a grand challenge of modern biology. Slight improvements in this area have the potential to save lives, improve food and fuel security, permit better care of the planet, and create other positive outcomes. In 2022 and 2023 the first open-to-the-public Genomes to Fields (G2F) initiative Genotype by Environment (GxE) prediction competition was held using a large dataset including genomic variation, phenotype and weather measurements and field management notes, gathered by the project over nine years. The competition attracted registrants from around the world with representation from academic, government, industry, and non-profit institutions as well as unaffiliated. These participants came from diverse disciplines include plant science, animal science, breeding, statistics, computational biology and others. Some participants had no formal genetics or plant-related training, and some were just beginning their graduate education. The teams applied varied methods and strategies, providing a wealth of modeling knowledge based on a common dataset. The winner’s strategy involved two models combining machine learning and traditional breeding tools: one model emphasized environment using features extracted by Random Forest, Ridge Regression and Least-squares, and one focused on genetics. Other high-performing teams’ methods included quantitative genetics, machine learning/deep learning, mechanistic models, and model ensembles. The dataset factors used, such as genetics; weather; and management data, were also diverse, demonstrating that no single model or strategy is far superior to all others within the context of this competition. 
    more » « less
  2. Waldemar Karwowski (Ed.)
    Online advertising is a billion-dollar industry, with many companies choosing online websites and various social media platforms to promote their products. The primary concerns in online marketing are to optimize the performance of a digital advert, reach the right audience, and maximize revenue, which can be achieved by predicting the accurate probability of a given ad being clicked, called the Click-Through Rate. It is assumed that a high CTR depicts the ad reaching its target customers while a low CTR shows that it is not reaching its desired audience, which may constitute a low return on investment (ROI). We propose a data-science-driven approach to help businesses improve their internet advertising campaigns which involves building various machine learning models to accurately predict the CTR and selecting the best-performing model. To build our classification models, we use the Avazu dataset, publicly available on the Kaggle website. Having insights on this metric will allow companies to compete in real-time bidding, gauge how relevant their keywords are in search engine querying, and mitigate an unexpected loss in spending budget. The authors in this paper strive to use modern machine learning tools and techniques to improve the performance of predicting Click-Through Rate (CTR) in online advertisements and bring change to the industry. 
    more » « less
  3. ABSTRACT Machine‐learning models have been surprisingly successful at predicting stream solute concentrations, even for solutes without dedicated sensors. It would be extremely valuable if these models could predict solute concentrations in streams beyond the one in which they were trained. We assessed the generalisability of random forest models by training them in one or more streams and testing them in another. Models were made using grab sample and sensor data from 10 New Hampshire streams and rivers. As observed in previous studies, models trained in one stream were capable of accurately predicting solute concentrations in that stream. However, models trained on one stream produced inaccurate predictions of solute concentrations in other streams, with the exception of solutes measured by dedicated sensors (i.e., nitrate and dissolved organic carbon). Using data from multiple watersheds improved model results, but model performance was still worse than using the mean of the training dataset (Nash–Sutcliffe Efficiency < 0). Our results demonstrate that machine‐learning models thus far reliably predict solute concentrations only where trained, as differences in solute concentration patterns and sensor‐solute relationships limit their broader applicability. 
    more » « less
  4. Abstract Sheet metal stamped and welded assemblies, such as the ones used in automotive body-in-white (BIW) structures, have various sources of manufacturing variations during stamping and assembly processes. One of the major contributors to these variations is the springback on clamping release due to elastic recovery. Mitigating these variations requires expert knowledge of mechanical behavior, tooling, and process design. No analytical models can be used for the variety of geometries. Nonlinear FEA is also being used to predict springback, but it is time-consuming and requires specialized expertise, which makes it difficult to use in design exploration. Machine learning holds the promise of democratizing such complex analyses. This paper presents several case studies for data curation/generation, ML training, and validation. The prediction and quantification of the effects of springback are done on two levels: (i) low granularity, which involves predicting variations in certain parameters that are critical to measuring and understand spring back, and (ii) high granularity, predicting the shape of the component while taking into account the effects of springback and the stresses in the components. The data required to train, test, and validate the ML models were generated previously using an automated, integrated multi-stage simulation approach that was necessary to produce large datasets. Stamping simulations were validated against NUMISHEET benchmarks and also compared to test results published by other researchers. Subsequently, machine learning models were trained on the curated dataset to predict 2D stamped component shapes after springback and stress distributions across these shapes. For the assembly dataset, parameters such as unconstrained planar minimum zone magnitudes, angles between component planes, and twist angles are predicted using machine learning models, including linear and polynomial regression, decision trees, gradient boosting regression, support vector regression, and fully connected neural networks, and compared for their performance using consistent metrics. Hyper-parameter tuning is performed to optimize model performance, with artificial neural networks demonstrating promising capabilities in understanding variations in forming and multi-stage assembly processes. 
    more » « less
  5. Abstract Physics-informed machine learning bridges the gap between the high fidelity of mechanistic models and the adaptive insights of artificial intelligence. In chemical reaction network modeling, this synergy proves valuable, addressing the high computational costs of detailed mechanistic models while leveraging the predictive power of machine learning. This study applies this fusion to the biomedical challenge of A$$\beta$$fibril aggregation, a key factor in Alzheimer’s disease. Central to the research is the introduction of an automatic reaction order model reduction framework, designed to optimize reduced-order kinetic models. This framework represents a shift in model construction, automatically determining the appropriate level of detail for reaction network modeling. The proposed approach significantly improves simulation efficiency and accuracy, particularly in systems like A$$\beta$$aggregation, where precise modeling of nucleation and growth kinetics can reveal potential therapeutic targets. Additionally, the automatic model reduction technique has the potential to generalize to other network models. The methodology offers a scalable and adaptable tool for applications beyond biomedical research. Its ability to dynamically adjust model complexity based on system-specific needs ensures that models remain both computationally feasible and scientifically relevant, accommodating new data and evolving understandings of complex phenomena. 
    more » « less