This content will become publicly available on February 14, 2025
- Award ID(s):
- 2154428
- PAR ID:
- 10408410
- Publisher / Repository:
- RSC
- Date Published:
- Journal Name:
- Digital Discovery
- Volume:
- 3
- Issue:
- 2
- ISSN:
- 2635-098X
- Page Range / eLocation ID:
- DOI: 10.26434/chemrxiv-2023-0zv2f-v2
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Generation of molecules with desired chemical and biological properties such as high drug-likeness, high binding affinity to target proteins, is critical for drug discovery. In this paper, we propose a probabilistic generative model to capture the joint distribution of molecules and their properties. Our model assumes an energy-based model (EBM) in the latent space. Conditional on the latent vector, the molecule and its properties are modeled by a molecule generation model and a property regression model respectively. To search for molecules with desired properties, we propose a sampling with gradual distribution shifting (SGDS) algorithm, so that after learning the model initially on the training data of existing molecules and their properties, the proposed algorithm gradually shifts the model distribution towards the region supported by molecules with desired values of properties. Our experiments show that our method achieves very strong performances on various molecule design tasks.more » « less
-
Abstract Generative models are a sub-class of machine learning models that are capable of generating new samples with a target set of properties. In chemical and materials applications, these new samples might be drug targets, novel semiconductors, or catalysts constrained to exhibit an application-specific set of properties. Given their potential to yield high-value targets from otherwise intractable design spaces, generative models are currently under intense study with respect to how predictions can be improved through changes in model architecture and data representation. Here we explore the potential of multi-task transfer learning as a complementary approach to improving the validity and property specificity of molecules generated by such models. We have compared baseline generative models trained on a single property prediction task against models trained on additional ancillary prediction tasks and observe a generic positive impact on the validity and specificity of the multi-task models. In particular, we observe that the validity of generated structures is strongly affected by whether or not the models have chemical property data, as opposed to only syntactic structural data, supplied during learning. We demonstrate this effect in both interpolative and extrapolative scenarios (i.e., where the generative targets are poorly represented in training data) for models trained to generate high energy structures and models trained to generated structures with targeted bandgaps within certain ranges. In both instances, the inclusion of additional chemical property data improves the ability of models to generate valid, unique structures with increased property specificity. This approach requires only minor alterations to existing generative models, in many cases leveraging prediction frameworks already native to these models. Additionally, the transfer learning strategy is complementary to ongoing efforts to improve model architectures and data representation and can foreseeably be stacked on top of these developments.
-
Despite its simplicity, the composition of a material can be used as input to machine learning models to predict a range of materials properties. However, many property optimization tasks require the generation of novel but realistic materials compositions. In this study, we describe a way to generate compositions of hybrid organic–inorganic crystals through adapting Augmented CycleGAN, a novel generative model that can learn many-to-many relations between two domains. Specifically, we investigate the problem of composition change upon amine swap: for a specific chemical system (set of elements) crystalized with amine A, how would the product chemical compositions change if it is crystalized with amine B? By training with limited data from Cambridge Structural Database, our model can generate realistic chemical compositions for hybrid crystalline materials. The Augmented CycleGAN model can also utilize abundant unpaired data (compositions of different chemical systems), a feature that traditional supervised methods lack. The generated compositions can be used for many tasks, for example, as input fed to a classifier that predicts structural dimensionality.more » « less
-
Inverse molecular generation is an essential task for drug discovery, and generative models offer a very promising avenue, especially when diffusion models are used. Despite their great success, existing methods are inherently limited by the lack of a semantic latent space that can not be navigated and perform targeted exploration to generate molecules with desired properties. Here, we present a property-guided diffusion model for generating desired molecules, which incorporates a sophisticated diffusion process capturing intricate interactions of nodes and edges within molecular graphs and leverages a time-dependent molecular property classifier to integrate desired properties into the diffusion sampling process. Furthermore, we extend our model to a multi-property-guided paradigm. Experimental results underscore the competitiveness of our approach in molecular generation, highlighting its superiority in generating desired molecules without the need for additional optimization steps.more » « less
-
null (Ed.)Drug discovery aims to find novel compounds with specified chemical property profiles. In terms of generative modeling, the goal is to learn to sample molecules in the intersection of multiple property constraints. This task becomes increasingly challenging when there are many property constraints. We propose to offset this complexity by composing molecules from a vocabulary of substructures that we call molecular rationales. These rationales are identified from molecules as substructures that are likely responsible for each property of interest. We then learn to expand rationales into a full molecule using graph generative models. Our final generative model composes molecules as mixtures of multiple rationale completions, and this mixture is fine-tuned to preserve the properties of interest. We evaluate our model on various drug design tasks and demonstrate significant improvements over state-of-the-art baselines in terms of accuracy, diversity, and novelty of generated compounds.more » « less