skip to main content


Title: Deep Latent-Variable Models for Controllable Molecule Generation
Representation learning via deep generative models is opening a new avenue for small molecule generation in silico. Linking chemical and biological space remains a key challenge. In this paper, we debut a graph-based variational autoencoder framework to address this challenge under the umbrella of disentangled representation learning. The framework permits several inductive biases that connect the learned latent factors to molecular properties. Evaluation on diverse benchmark datasets shows that the resulting models are powerful and open up an exciting line of research on controllable molecule generation in support of cheminformatics, drug discovery, and other application settings.  more » « less
Award ID(s):
1900061
NSF-PAR ID:
10343759
Author(s) / Creator(s):
; ; ; ; ; ;
Date Published:
Journal Name:
2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)
Page Range / eLocation ID:
372 to 375
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract Motivation

    Expanding our knowledge of small molecules beyond what is known in nature or designed in wet laboratories promises to significantly advance cheminformatics, drug discovery, biotechnology and material science. In silico molecular design remains challenging, primarily due to the complexity of the chemical space and the non-trivial relationship between chemical structures and biological properties. Deep generative models that learn directly from data are intriguing, but they have yet to demonstrate interpretability in the learned representation, so we can learn more about the relationship between the chemical and biological space. In this article, we advance research on disentangled representation learning for small molecule generation. We build on recent work by us and others on deep graph generative frameworks, which capture atomic interactions via a graph-based representation of a small molecule. The methodological novelty is how we leverage the concept of disentanglement in the graph variational autoencoder framework both to generate biologically relevant small molecules and to enhance model interpretability.

    Results

    Extensive qualitative and quantitative experimental evaluation in comparison with state-of-the-art models demonstrate the superiority of our disentanglement framework. We believe this work is an important step to address key challenges in small molecule generation with deep generative frameworks.

    Availability and implementation

    Training and generated data are made available at https://ieee-dataport.org/documents/dataset-disentangled-representation-learning-interpretable-molecule-generation. All code is made available at https://anonymous.4open.science/r/D-MolVAE-2799/.

    Supplementary information

    Supplementary data are available at Bioinformatics online.

     
    more » « less
  2. With the debut of AlphaFold2, we now can get a highly-accurate view of a reasonable equilibrium tertiary structure of a protein molecule. Yet, a single-structure view is insufficient and does not account for the high structural plasticity of protein molecules. Obtaining a multi-structure view of a protein molecule continues to be an outstanding challenge in computational structural biology. In tandem with methods formulated under the umbrella of stochastic optimization, we are now seeing rapid advances in the capabilities of methods based on deep learning. In recent work, we advance the capability of these models to learn from experimentally-available tertiary structures of protein molecules of varying lengths. In this work, we elucidate the important role of the composition of the training dataset on the neural network’s ability to learn key local and distal patterns in tertiary structures. To make such patterns visible to the network, we utilize a contact map-based representation of protein tertiary structure. We show interesting relationships between data size, quality, and composition on the ability of latent variable models to learn key patterns of tertiary structure. In addition, we present a disentangled latent variable model which improves upon the state-of-the-art variable autoencoder-based model in key, physically-realistic structural patterns. We believe this work opens up further avenues of research on deep learning-based models for computing multi-structure views of protein molecules. 
    more » « less
  3. Designing molecules with specific structural and functional properties (e.g., drug-likeness and water solubility) is central to advancing drug discovery and material science, but it poses outstanding challenges both in wet and dry laboratories. The search space is vast and rugged. Recent advances in deep generative models are motivating new computational approaches building over deep learning to tackle the molecular space. Despite rapid advancements, state-of-the-art deep generative models for molecule generation have many limitations, including lack of interpretability. In this paper we address this limitation by proposing a generic framework for interpretable molecule generation based on novel disentangled deep graph generative models with property control. Specifically, we propose a disentanglement enhancement strategy for graphs. We also propose new deep neural architecture to achieve the above learning objective for inference and generation for variable-size graphs efficiently. Extensive experimental evaluation demonstrates the superiority of our approach in various critical aspects, such as accuracy, novelty, and disentanglement. 
    more » « less
  4. Conformational dynamics of biomolecules are of fundamental importance for their function. Single-molecule studies of Förster Resonance Energy Transfer (smFRET) between a tethered donor and acceptor dye pair are a powerful tool to investigate the structure and dynamics of labeled molecules. However, capturing and quantifying conformational dynamics in intensity-based smFRET experiments remains challenging when the dynamics occur on the sub-millisecond timescale. The method of multiparameter fluorescence detection addresses this challenge by simultaneously registering fluorescence intensities and lifetimes of the donor and acceptor. Together, two FRET observables, the donor fluorescence lifetime τ D and the intensity-based FRET efficiency E, inform on the width of the FRET efficiency distribution as a characteristic fingerprint for conformational dynamics. We present a general framework for analyzing dynamics that relates average fluorescence lifetimes and intensities in two-dimensional burst frequency histograms. We present parametric relations of these observables for interpreting the location of FRET populations in E–τ D diagrams, called FRET-lines. To facilitate the analysis of complex exchange equilibria, FRET-lines serve as reference curves for a graphical interpretation of experimental data to (i) identify conformational states, (ii) resolve their dynamic connectivity, (iii) compare different kinetic models, and (iv) infer polymer properties of unfolded or intrinsically disordered proteins. For a simplified graphical analysis of complex kinetic networks, we derive a moment-based representation of the experimental data that decouples the motion of the fluorescence labels from the conformational dynamics of the biomolecule. Importantly, FRET-lines facilitate exploring complex dynamic models via easily computed experimental observables. We provide extensive computational tools to facilitate applying FRET-lines. 
    more » « less
  5. With the widespread adoption of the Next Generation Science Standards (NGSS), science teachers and online learning environments face the challenge of evaluating students' integration of different dimensions of science learning. Recent advances in representation learning in natural language processing have proven effective across many natural language processing tasks, but a rigorous evaluation of the relative merits of these methods for scoring complex constructed response formative assessments has not previously been carried out. We present a detailed empirical investigation of feature-based, recurrent neural network, and pre-trained transformer models on scoring content in real-world formative assessment data. We demonstrate that recent neural methods can rival or exceed the performance of feature-based methods. We also provide evidence that different classes of neural models take advantage of different learning cues, and pre-trained transformer models may be more robust to spurious, dataset-specific learning cues, better reflecting scoring rubrics. 
    more » « less