skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: BOKEI: Bayesian optimization using knowledge of correlated torsions and expected improvement for conformer generation
A key challenge in conformer sampling is finding low-energy conformations with a small number of energy evaluations. We recently demonstrated the Bayesian Optimization Algorithm (BOA) is an effective method for finding the lowest energy conformation of a small molecule. Our approach balances between exploitation and exploration, and is more efficient than exhaustive or random search methods. Here, we extend strategies used on proteins and oligopeptides ( e.g. Ramachandran plots of secondary structure) and study correlated torsions in small molecules. We use bivariate von Mises distributions to capture correlations, and use them to constrain the search space. We validate the performance of our new method, Bayesian Optimization with Knowledge-based Expected Improvement (BOKEI), on a dataset consisting of 533 diverse small molecules, using (i) a force field (MMFF94); and (ii) a semi-empirical method (GFN2), as the objective function. We compare the search performance of BOKEI, BOA with Expected Improvement (BOA-EI), and a genetic algorithm (GA), using a fixed number of energy evaluations. In more than 60% of the cases examined, BOKEI finds lower energy conformations than global optimization with BOA-EI or GA. More importantly, we find correlated torsions in up to 15% of small molecules in larger data sets, up to 8 times more often than previously reported. The BOKEI patterns not only describe steric clashes, but also reflect favorable intramolecular interactions such as hydrogen bonds and π–π stacking. Increasing our understanding of the conformational preferences of molecules will help improve our ability to find low energy conformers efficiently, which will have impact in a wide range of computational modeling applications.  more » « less
Award ID(s):
1800435
PAR ID:
10177038
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Physical Chemistry Chemical Physics
Volume:
22
Issue:
9
ISSN:
1463-9076
Page Range / eLocation ID:
5211 to 5219
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The design of machine learning systems often requires trading off different objectives, for example, prediction error and energy consumption for deep neural networks (DNNs). Typically, no single design performs well in all objectives; therefore, finding Pareto-optimal designs is of interest. The search for Pareto-optimal designs involves evaluating designs in an iterative process, and the measurements are used to evaluate an acquisition function that guides the search process. However, measuring different objectives incurs different costs. For example, the cost of measuring the prediction error of DNNs is orders of magnitude higher than that of measuring the energy consumption of a pre-trained DNN as it requires re-training the DNN. Current state-of-the-art methods do not consider this difference in objective evaluation cost, potentially incurring expensive evaluations of objective functions in the optimization process. In this paper, we develop a novel decoupled and cost-aware multi-objective optimization algorithm, which we call Flexible Multi-Objective Bayesian Optimization (FlexiBO) to address this issue. For evaluating each design, FlexiBO selects the objective with higher relative gain by weighting the improvement of the hypervolume of the Pareto region with the measurement cost of each objective. This strategy, therefore, balances the expense of collecting new information with the knowledge gained through objective evaluations, preventing FlexiBO from performing expensive measurements for little to no gain. We evaluate FlexiBO on seven state-of-the-art DNNs for image recognition, natural language processing (NLP), and speech-to-text translation. Our results indicate that, given the same total experimental budget, FlexiBO discovers designs with 4.8% to 12.4% lower hypervolume error than the best method in state-of-the-art multi-objective optimization. 
    more » « less
  2. The Pareto-optimal frontier for a bi-objective search problem instance consists of all solutions that are not worse than any other solution in both objectives. The size of the Pareto-optimal frontier can be exponential in the size of the input graph, and hence finding it can be hard. Some existing works leverage a user-specified approximation factor ε to compute an approximate Pareto-optimal frontier that can be significantly smaller than the Pareto-optimal frontier. In this paper, we propose an anytime approximate bi-objective search algorithm, called Anytime Bi-Objective A*-ε (A-BOA*ε). A-BOA*ε is useful when deliberation time is limited. It first finds an approximate Pareto-optimal frontier quickly, iteratively improves it while time allows, and eventually finds the Pareto-optimal frontier. It efficiently reuses the search effort from previous iterations and makes use of a novel pruning technique. Our experimental results show that A-BOA*ε substantially outperforms baseline algorithms that do not reuse previous search effort, both in terms of runtime and number of node expansions. In fact, the most advanced variant of A-BOA*ε even slightly outperforms BOA*, a state-of-the-art bi-objective search algorithm, for finding the Pareto-optimal frontier. Moreover, given only a limited amount of deliberation time, A-BOA*ε finds solutions that collectively approximate the Pareto-optimal frontier much better than the solutions found by BOA*. 
    more » « less
  3. Optimizing a black-box function that is expensive to evaluate emerges in a gamut of machine learning and artifcial intelligence applications including drug discovery, policy optimization in robotics, and hyperparameter tuning of learning models to list a few. Bayesian optimization (BO) provides a principled framework to fnd the global optimum of such functions using a limited number of function evaluations. BO relies on a statistical surrogate model to actively select new query points, that is typically captured by a Gaussian process (GP). Unlike most existing approaches that hinge on a single GP surrogate model with a pre-selected kernel function that may confne the expressiveness of the sought function especially under the limited evaluation budget, the present work puts forth a weighted ensemble of GPs as a surrogate model. Building on the advocated Gaussian mixture (GM) posterior, the EGP framework adapts to the most ftted surrogate model as data arrive on-the-fy, offering a richer function space. For the acquisition of next evaluation points, the EGP-based posterior is coupled with an adaptive expected improvement (EI) criterion to balance exploration and exploitation of the search space. Numerical tests on a set of benchmark synthetic functions and two robotic tasks, demonstrate the impressive benefts of the proposed approach. 
    more » « less
  4. Abstract Symbolic regression (SR) can generate interpretable, concise expressions that fit a given dataset, allowing for more human understanding of the structure than black-box approaches. The addition of background knowledge (in the form of symbolic mathematical constraints) allows for the generation of expressions that are meaningful with respect to theory while also being consistent with data. We specifically examine the addition of constraints to traditional genetic algorithm (GA) based SR (PySR) as well as a Markov-chain Monte Carlo (MCMC) based Bayesian SR architecture (Bayesian Machine Scientist), and apply these to rediscovering adsorption equations from experimental, historical datasets. We find that, while hard constraints prevent GA and MCMC SR from searching, soft constraints can lead to improved performance both in terms of search effectiveness and model meaningfulness, with computational costs increasing by about an order of magnitude. If the constraints do not correlate well with the dataset or expected models, they can hinder the search of expressions. We find incorporating these constraints in Bayesian SR (as the Bayesian prior) is better than by modifying the fitness function in the GA. 
    more » « less
  5. In the wake of the terrorist attacks of 11 September 2001, extensive research efforts have been dedicated to the development of computational algorithms for identifying contamination sources in water distribution systems (WDSs). Previous studies have extensively relied on evolutionary optimization techniques, which require the simulation of numerous contamination scenarios in order to solve the inverse-modeling contamination source identification (CSI) problem. This study presents a novel framework for CSI in WDSs using Bayesian optimization (BO) techniques. By constructing an explicit acquisition function to balance exploration with exploitation, BO requires only a few evaluations of the objective function to converge to near-optimal solutions, enabling CSI in real-time. The presented framework couples BO with EPANET to reveal the most likely contaminant injection/intrusion scenarios by minimizing the error between simulated and measured concentrations at a given number of water quality monitoring locations. The framework was tested on two benchmark WDSs under different contamination injection scenarios, and the algorithm successfully revealed the characteristics of the contamination source(s), i.e., the location, pattern, and concentration, for all scenarios. A sensitivity analysis was conducted to evaluate the performance of the framework using various BO techniques, including two different surrogate models, Gaussian Processes (GPs) and Random Forest (RF), and three different acquisition functions, namely expected improvement (EI), probability of improvement (PI), and upper confident bound (UCB). The results revealed that BO with the RF surrogate model and UCB acquisition function produced the most efficient and reliable CSI performance. 
    more » « less