skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on October 8, 2026

Title: Adaptive subspace Bayesian optimization over molecular descriptor libraries for data-efficient chemical design
The discovery of molecules with optimal functional properties is a central challenge across diverse fields such as energy storage, catalysis, and chemical sensing. However, molecular property optimization (MPO) remains difficult due to the combinatorial size of chemical space and the cost of acquiring property labels via simulations or wet-lab experiments. Bayesian optimization (BO) offers a principled framework for sample-efficient discovery in such settings, but its effectiveness depends critically on the quality of the molecular representation used to train the underlying probabilistic surrogate model. Existing approaches based on fingerprints, graphs, SMILES strings, or learned embeddings often struggle in low-data regimes due to high dimensionality or poorly structured latent spaces. Here, we introduce Molecular Descriptors with Actively Identified Subspaces (MolDAIS), a flexible molecular BO framework that adaptively identifies task-relevant subspaces within large descriptor libraries. Leveraging the sparse axis-aligned subspace (SAAS) prior introduced in recent BO literature, MolDAIS constructs parsimonious Gaussian process surrogate models that focus on task-relevant features as new data is acquired. In addition to validating this approach for descriptor-based MPO, we introduce two novel screening variants, which significantly reduce computational cost while preserving predictive accuracy and physical interpretability. We demonstrate that MolDAIS consistently outperforms state-of-the-art MPO methods across a suite of benchmark and real-world tasks, including single- and multi-objective optimization. Our results show that MolDAIS can identify near-optimal candidates from chemical libraries with over 100,000 molecules using fewer than 100 property evaluations, highlighting its promise as a practical tool for data-scarce molecular discovery.  more » « less
Award ID(s):
2237616
PAR ID:
10657425
Author(s) / Creator(s):
 ;  ;  
Publisher / Repository:
Royal Society of Chemistry
Date Published:
Journal Name:
Digital Discovery
Volume:
4
Issue:
10
ISSN:
2635-098X
Page Range / eLocation ID:
2910 to 2926
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Chemical engineering is being rapidly transformed by the tools of data science. On the horizon, artificial intelligence (AI) applications will impact a huge swath of our work, ranging from the discovery and design of new molecules to operations and manufacturing and many areas in between. Early adoption of data science, machine learning, and early examples of AI in chemical engineering has been rich with examples of molecular data science—the application tools for molecular discovery and property optimization at the atomic scale. We summarize key advances in this nascent subfield while introducing molecular data science for a broad chemical engineering readership. We introduce the field through the concept of a molecular data science life cycle and discuss relevant aspects of five distinct phases of this process: creation of curated data sets, molecular representations, data-driven property prediction, generation of new molecules, and feasibility and synthesizability considerations. 
    more » « less
  2. Abstract Bayesian optimization (BO) is an indispensable tool to optimize objective functions that either do not have known functional forms or are expensive to evaluate. Currently, optimal experimental design is always conducted within the workflow of BO leading to more efficient exploration of the design space compared to traditional strategies. This can have a significant impact on modern scientific discovery, in particular autonomous materials discovery, which can be viewed as an optimization problem aimed at looking for the maximum (or minimum) point for the desired materials properties. The performance of BO-based experimental design depends not only on the adopted acquisition function but also on the surrogate models that help to approximate underlying objective functions. In this paper, we propose a fully autonomous experimental design framework that uses more adaptive and flexible Bayesian surrogate models in a BO procedure, namely Bayesian multivariate adaptive regression splines and Bayesian additive regression trees. They can overcome the weaknesses of widely used Gaussian process-based methods when faced with relatively high-dimensional design space or non-smooth patterns of objective functions. Both simulation studies and real-world materials science case studies demonstrate their enhanced search efficiency and robustness. 
    more » « less
  3. The management and analysis of large in silico molecular libraries is pivotal in many areas of modern chemistry. The adoption and success of data-oriented approaches to chemical research is dependent on the ease of handling large collections of in silico molecular structures in a programmatic way. Herein, we introduce the MOLecular LIibrary toolkit, “molli”, which is a Python 3 chemoinformatics module that provides a streamlined interface for manipulating large in silico libraries. Three-dimensional, combinatorial molecule libraries can be expanded directly from two-dimensional chemical structure fragments stored in CDXML files with high stereochemical fidelity. Geometry optimization, property calculation, and conformer generation are executed by interfacing with widely used computational chemistry programs such as OpenBabel, RDKit, ORCA, and xTB/CREST. Conformer-dependent grid-based feature calculators provide numerical representation suitable for diversity analysis, and interface to robust three-dimensional visualization tools provide comprehensive images to enhance human understanding of libraries with thousands of members. The package includes command-line interface in addition to Python classes to streamline frequently used workflows. This work describes the development and implementation of molli 1.0 and highlights the available functionality. Parallel performance is benchmarked on various hardware platforms and common workflows are demonstrated for different tasks ranging from optimized grid-based descriptor calculation on catalyst libraries to NMR prediction workflow from CDXML files. 
    more » « less
  4. Additive manufacturing (AM) enables the fabrication of complex, highly customized geometries. However, the design and fabrication of structures with advanced functionalities, such as multistability and fail-safe mechanism, remain challenging due to the significant time and costs required for high-fidelity simulations and iterative prototyping. In this study, we investigate the application of Bayesian Optimization (BO), an advanced machine learning framework, to accelerate the discovery of optimal AM compatible designs with such advanced properties. BO uses a probabilistic surrogate to strategically balances the exploration of design space with few test designs and the exploitation of design space near current best performing designs, thereby reducing the number of design simulations needed. While existing studies have demonstrated the potential of BO in AM, most have focused on static or simple designs. Here, we target multistable structures that can reconfigure among multiple stable states in response to external conditions. Since mechanical performance (e.g., strength) is configuration-dependent, our goal is to identify high performing designs while ensuring that strength in all stable configurations exceeds a prescribed threshold for structural robustness. 
    more » « less
  5. Abstract Current research practice for optimizing bioink involves exhaustive experimentation with multi-material composition for determining the printability, shape fidelity and biocompatibility. Predicting bioink properties can be beneficial to the research community but is a challenging task due to the non-Newtonian behavior in complex composition. Existing models such as Cross model become inadequate for predicting the viscosity for heterogeneous composition of bioinks. In this paper, we utilize a machine learning framework to accurately predict the viscosity of heterogeneous bioink compositions, aiming to enhance extrusion-based bioprinting techniques. Utilizing Bayesian optimization (BO), our strategy leverages a limited dataset to inform our model. This is a technique especially useful of the typically sparse data in this domain. Moreover, we have also developed a mask technique that can handle complex constraints, informed by domain expertise, to define the feasible parameter space for the components of the bioink and their interactions. Our proposed method is focused on predicting the intrinsic factor (e.g. viscosity) of the bioink precursor which is tied to the extrinsic property (e.g. cell viability) through the mask function. Through the optimization of the hyperparameter, we strike a balance between exploration of new possibilities and exploitation of known data, a balance crucial for refining our acquisition function. This function then guides the selection of subsequent sampling points within the defined viable space and the process continues until convergence is achieved, indicating that the model has sufficiently explored the parameter space and identified the optimal or near-optimal solutions. Employing this AI-guided BO framework, we have developed, tested, and validated a surrogate model for determining the viscosity of heterogeneous bioink compositions. This data-driven approach significantly reduces the experimental workload required to identify bioink compositions conducive to functional tissue growth. It not only streamlines the process of finding the optimal bioink compositions from a vast array of heterogeneous options but also offers a promising avenue for accelerating advancements in tissue engineering by minimizing the need for extensive experimental trials. 
    more » « less