skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Data Science in Chemical Engineering: Applications to Molecular Science
Chemical engineering is being rapidly transformed by the tools of data science. On the horizon, artificial intelligence (AI) applications will impact a huge swath of our work, ranging from the discovery and design of new molecules to operations and manufacturing and many areas in between. Early adoption of data science, machine learning, and early examples of AI in chemical engineering has been rich with examples of molecular data science—the application tools for molecular discovery and property optimization at the atomic scale. We summarize key advances in this nascent subfield while introducing molecular data science for a broad chemical engineering readership. We introduce the field through the concept of a molecular data science life cycle and discuss relevant aspects of five distinct phases of this process: creation of curated data sets, molecular representations, data-driven property prediction, generation of new molecules, and feasibility and synthesizability considerations.  more » « less
Award ID(s):
1633216
PAR ID:
10279449
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Annual Review of Chemical and Biomolecular Engineering
Volume:
12
Issue:
1
ISSN:
1947-5438
Page Range / eLocation ID:
15 to 37
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Organic molecules and polymers have a broad range of applications in biomedical, chemical, and materials science fields. Traditional design approaches for organic molecules and polymers are mainly experimentally-driven, guided by experience, intuition, and conceptual insights. Though they have been successfully applied to discover many important materials, these methods are facing significant challenges due to the tremendous demand of new materials and vast design space of organic molecules and polymers. Accelerated and inverse materials design is an ideal solution to these challenges. With advancements in high-throughput computation, artificial intelligence (especially machining learning, ML), and the growth of materials databases, ML-assisted materials design is emerging as a promising tool to flourish breakthroughs in many areas of materials science and engineering. To date, using ML-assisted approaches, the quantitative structure property/activity relation for material property prediction can be established more accurately and efficiently. In addition, materials design can be revolutionized and accelerated much faster than ever, through ML-enabled molecular generation and inverse molecular design. In this perspective, we review the recent progresses in ML-guided design of organic molecules and polymers, highlight several successful examples, and examine future opportunities in biomedical, chemical, and materials science fields. We further discuss the relevant challenges to solve in order to fully realize the potential of ML-assisted materials design for organic molecules and polymers. In particular, this study summarizes publicly available materials databases, feature representations for organic molecules, open-source tools for feature generation, methods for molecular generation, and ML models for prediction of material properties, which serve as a tutorial for researchers who have little experience with ML before and want to apply ML for various applications. Last but not least, it draws insights into the current limitations of ML-guided design of organic molecules and polymers. We anticipate that ML-assisted materials design for organic molecules and polymers will be the driving force in the near future, to meet the tremendous demand of new materials with tailored properties in different fields. 
    more » « less
  2. Abstract Many of the greatest challenges facing society today likely have molecular solutions that await discovery. However, the process of identifying and manufacturing such molecules has remained slow and highly specialist dependent. Interfacing the fields of artificial intelligence (AI) and synthetic organic chemistry has the potential to powerfully address both limitations. The Molecule Maker Lab Institute (MMLI) brings together a team of chemists, engineers, and AI‐experts from the University of Illinois Urbana‐Champaign (UIUC), Pennsylvania State University, and the Rochester Institute of Technology, with the goal of accelerating the discovery, synthesis and manufacture of complex organic molecules. Advanced AI and machine learning (ML) methods are deployed in four key thrusts: (1) AI‐enabled synthesis planning, (2) AI‐enabled catalyst development, (3) AI‐enabled molecule manufacturing, and (4) AI‐enabled molecule discovery. The MMLI's new AI‐enabled synthesis platform integrates chemical and enzymatic catalysis with literature mining and ML to predict the best way to make new molecules with desirable biological and material properties. The MMLI is transforming chemical synthesis and generating use‐inspired AI advances. Simultaneously, the MMLI is also acting as a training ground for the next generation of scientists with combined expertise in chemistry and AI. Outreach efforts aimed toward high school students and the public are being used to show how AI‐enabled tools can help to make chemical synthesis accessible to nonexperts. 
    more » « less
  3. Abstract MotivationProperties of molecules are indicative of their functions and thus are useful in many applications. With the advances of deep-learning methods, computational approaches for predicting molecular properties are gaining increasing momentum. However, there lacks customized and advanced methods and comprehensive tools for this task currently. ResultsHere, we develop a suite of comprehensive machine-learning methods and tools spanning different computational models, molecular representations and loss functions for molecular property prediction and drug discovery. Specifically, we represent molecules as both graphs and sequences. Built on these representations, we develop novel deep models for learning from molecular graphs and sequences. In order to learn effectively from highly imbalanced datasets, we develop advanced loss functions that optimize areas under precision–recall curves (PRCs) and receiver operating characteristic (ROC) curves. Altogether, our work not only serves as a comprehensive tool, but also contributes toward developing novel and advanced graph and sequence-learning methodologies. Results on both online and offline antibiotics discovery and molecular property prediction tasks show that our methods achieve consistent improvements over prior methods. In particular, our methods achieve #1 ranking in terms of both ROC-AUC (area under curve) and PRC-AUC on the AI Cures open challenge for drug discovery related to COVID-19. Availability and implementationOur source code is released as part of the MoleculeX library (https://github.com/divelab/MoleculeX) under AdvProp. Supplementary informationSupplementary data are available at Bioinformatics online. 
    more » « less
  4. The discovery of advanced thermal materials with exceptional phonon properties drives technological advancements, impacting innovations from electronics to superconductors. Understanding the intricate relationship between composition, structure, and phonon thermal transport properties is crucial for speeding up such discovery. Exploring innovative materials involves navigating vast design spaces and considering chemical and structural factors on multiple scales and modalities. Artificial intelligence (AI) is transforming science and engineering and poised to transform discovery and innovation. This era offers a unique opportunity to establish a new paradigm for the discovery of advanced materials by leveraging databases, simulations, and accumulated knowledge, venturing into experimental frontiers, and incorporating cutting-edge AI technologies. In this perspective, first, the general approach of density functional theory (DFT) coupled with phonon Boltzmann transport equation (BTE) for predicting comprehensive phonon properties will be reviewed. Then, to circumvent the extremely computationally demanding DFT + BTE approach, some early studies and progress of deploying AI/machine learning (ML) models to phonon thermal transport in the context of structure–phonon property relationship prediction will be presented, and their limitations will also be discussed. Finally, a summary of current challenges and an outlook of future trends will be given. Further development of incorporating AI/ML algorithms for phonon thermal transport could range from phonon database construction to universal machine learning potential training, to inverse design of materials with target phonon properties and to extend ML models beyond traditional phonons. 
    more » « less
  5. The discovery of molecules with optimal functional properties is a central challenge across diverse fields such as energy storage, catalysis, and chemical sensing. However, molecular property optimization (MPO) remains difficult due to the combinatorial size of chemical space and the cost of acquiring property labels via simulations or wet-lab experiments. Bayesian optimization (BO) offers a principled framework for sample-efficient discovery in such settings, but its effectiveness depends critically on the quality of the molecular representation used to train the underlying probabilistic surrogate model. Existing approaches based on fingerprints, graphs, SMILES strings, or learned embeddings often struggle in low-data regimes due to high dimensionality or poorly structured latent spaces. Here, we introduce Molecular Descriptors with Actively Identified Subspaces (MolDAIS), a flexible molecular BO framework that adaptively identifies task-relevant subspaces within large descriptor libraries. Leveraging the sparse axis-aligned subspace (SAAS) prior introduced in recent BO literature, MolDAIS constructs parsimonious Gaussian process surrogate models that focus on task-relevant features as new data is acquired. In addition to validating this approach for descriptor-based MPO, we introduce two novel screening variants, which significantly reduce computational cost while preserving predictive accuracy and physical interpretability. We demonstrate that MolDAIS consistently outperforms state-of-the-art MPO methods across a suite of benchmark and real-world tasks, including single- and multi-objective optimization. Our results show that MolDAIS can identify near-optimal candidates from chemical libraries with over 100,000 molecules using fewer than 100 property evaluations, highlighting its promise as a practical tool for data-scarce molecular discovery. 
    more » « less