Discovering novel molecules with targeted properties remains a formidable challenge in materials science, often likened to finding a needle in a haystack. Traditional experimental approaches are slow, costly, and inefficient. In this study, we present an inverse design framework based on a molecular graph conditional variational autoencoder (CVAE) that enables the generation of new molecules with user-specified optical properties, particularly molar extinction coefficient ($$\varepsilon$$). Our model encodes molecular graphs, derived from SMILES strings, into a structured latent space, and then decodes them into valid molecular structures conditioned on a target $$\varepsilon$$ value. Trained on a curated dataset of known molecules with corresponding extinction coefficients, the CVAE learns to generate chemically valid structures, as verified by RDKit. Subsequent Density Functional Theory (DFT) simulations confirm that many of the generated molecules exhibit the electronic structures similar to those molecules with desired $$\varepsilon$$ values. We have also verified the $$\varepsilon$$ values of the generated molecules using a graph neural network (GNN) and the synthesizability of those molecules using an open-source module named ASKCOS. This approach demonstrates the potential of CVAEs to accelerate molecular discovery by enabling user-guided, property-driven molecule generation -- offering a scalable, data-driven alternative to traditional trial-and-error synthesis. 
                        more » 
                        « less   
                    
                            
                            De novo molecule design towards biased properties via a deep generative framework and iterative transfer learning
                        
                    
    
            De novo design of molecules with targeted properties represents a new frontier in molecule development. Despite enormous progress, two main challenges remain: (i) generating novel molecules conditioned on targeted, continuous property values; (ii) obtaining molecules with property values beyond the range in the training data. To tackle these challenges, we propose a reinforced regressional and conditional generative adversarial network (RRCGAN) to generate chemically valid molecules with targeted HOMO–LUMO energy gap (ΔEH–L) as a proof-of-concept study. As validated by density functional theory (DFT) calculation, 75% of the generated molecules have a relative error (RE) of <20% of the targeted ΔEH–L values. To bias the generation toward the ΔEH–L values beyond the range of the original training molecules, transfer learning was applied to iteratively retrain the RRCGAN model. After just two iterations, the mean ΔEH–L of the generated molecules increases to 8.7 eV from the mean value of 5.9 eV shown in the initial training dataset. Qualitative and quantitative analyses reveal that the model has successfully captured the underlying structure–property relationship, which agrees well with the established physical and chemical rules. These results present a trustworthy, purely data-driven methodology for the highly efficient generation of novel molecules with different targeted properties. 
        more » 
        « less   
        
    
                            - Award ID(s):
- 2154428
- PAR ID:
- 10408410
- Publisher / Repository:
- RSC
- Date Published:
- Journal Name:
- Digital Discovery
- Volume:
- 3
- Issue:
- 2
- ISSN:
- 2635-098X
- Page Range / eLocation ID:
- DOI: 10.26434/chemrxiv-2023-0zv2f-v2
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            Generation of molecules with desired chemical and biological properties such as high drug-likeness, high binding affinity to target proteins, is critical for drug discovery. In this paper, we propose a probabilistic generative model to capture the joint distribution of molecules and their properties. Our model assumes an energy-based model (EBM) in the latent space. Conditional on the latent vector, the molecule and its properties are modeled by a molecule generation model and a property regression model respectively. To search for molecules with desired properties, we propose a sampling with gradual distribution shifting (SGDS) algorithm, so that after learning the model initially on the training data of existing molecules and their properties, the proposed algorithm gradually shifts the model distribution towards the region supported by molecules with desired values of properties. Our experiments show that our method achieves very strong performances on various molecule design tasks.more » « less
- 
            Despite its simplicity, the composition of a material can be used as input to machine learning models to predict a range of materials properties. However, many property optimization tasks require the generation of novel but realistic materials compositions. In this study, we describe a way to generate compositions of hybrid organic–inorganic crystals through adapting Augmented CycleGAN, a novel generative model that can learn many-to-many relations between two domains. Specifically, we investigate the problem of composition change upon amine swap: for a specific chemical system (set of elements) crystalized with amine A, how would the product chemical compositions change if it is crystalized with amine B? By training with limited data from Cambridge Structural Database, our model can generate realistic chemical compositions for hybrid crystalline materials. The Augmented CycleGAN model can also utilize abundant unpaired data (compositions of different chemical systems), a feature that traditional supervised methods lack. The generated compositions can be used for many tasks, for example, as input fed to a classifier that predicts structural dimensionality.more » « less
- 
            Inverse molecular generation is an essential task for drug discovery, and generative models offer a very promising avenue, especially when diffusion models are used. Despite their great success, existing methods are inherently limited by the lack of a semantic latent space that can not be navigated and perform targeted exploration to generate molecules with desired properties. Here, we present a property-guided diffusion model for generating desired molecules, which incorporates a sophisticated diffusion process capturing intricate interactions of nodes and edges within molecular graphs and leverages a time-dependent molecular property classifier to integrate desired properties into the diffusion sampling process. Furthermore, we extend our model to a multi-property-guided paradigm. Experimental results underscore the competitiveness of our approach in molecular generation, highlighting its superiority in generating desired molecules without the need for additional optimization steps.more » « less
- 
            The accurate detection of chemical agents promotes many national security and public safety goals, and robust chemical detection methods can prevent disasters and support effective response to incidents. Mass spectrometry is an important tool in detecting and identifying chemical agents. However, there are high costs and logistical challenges associated with acquiring sufficient lab-generated mass spectrometry data for training machine learning algorithms, including skilled personnel, sample preparation and analysis required for data generation. These high costs of mass spectrometry data collection hinder the development of machine learning and deep learning models to detect and identify chemical agents. Accordingly, the primary objective of our research is to create a mass spectrometry data generation model whose output (synthetic mass spectrometry data) would enhance the performance of downstream machine learning chemical classification models. Such a synthetic data generation model would reduce the need to generate costly real-world data, and provide additional training data to use in combination with lab-generated mass spectrometry data when training classifiers. Our approach is a novel combination of autoencoder-based synthetic data generation combined with a fixed, apriori defined hidden layer geometry. In particular, we train pairs of encoders and decoders with an additional loss term that enforces that the hidden layer passed from the encoder to the decoder match the embedding provided by an external deep learning model designed to predict functional properties of chemicals. We have verified that incorporating our synthetic spectra into a lab-generated dataset enhances the performance of classification algorithms compared to using only the real data. Our synthetic spectra have been successfully matched to lab-generated spectra for their respective chemicals using library matching software, further demonstrating the validity of our work.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    