skip to main content


Title: Masked graph modeling for molecule generation
Abstract

De novo, in-silico design of molecules is a challenging problem with applications in drug discovery and material design. We introduce a masked graph model, which learns a distribution over graphs by capturing conditional distributions over unobserved nodes (atoms) and edges (bonds) given observed ones. We train and then sample from our model by iteratively masking and replacing different parts of initialized graphs. We evaluate our approach on the QM9 and ChEMBL datasets using the GuacaMol distribution-learning benchmark. We find that validity, KL-divergence and Fréchet ChemNet Distance scores are anti-correlated with novelty, and that we can trade off between these metrics more effectively than existing models. On distributional metrics, our model outperforms previously proposed graph-based approaches and is competitive with SMILES-based approaches. Finally, we show our model generates molecules with desired values of specified properties while maintaining physiochemical similarity to the training distribution.

 
more » « less
Award ID(s):
1922658
NSF-PAR ID:
10231155
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
Nature Publishing Group
Date Published:
Journal Name:
Nature Communications
Volume:
12
Issue:
1
ISSN:
2041-1723
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Statistical relational learning (SRL) and graph neural networks (GNNs) are two powerful approaches for learning and inference over graphs. Typically, they are evaluated in terms of simple metrics such as accuracy over individual node labels. Complexaggregate graph queries(AGQ) involving multiple nodes, edges, and labels are common in the graph mining community and are used to estimate important network properties such as social cohesion and influence. While graph mining algorithms support AGQs, they typically do not take into account uncertainty, or when they do, make simplifying assumptions and do not build full probabilistic models. In this paper, we examine the performance of SRL and GNNs on AGQs over graphs with partially observed node labels. We show that, not surprisingly, inferring the unobserved node labels as a first step and then evaluating the queries on the fully observed graph can lead to sub-optimal estimates, and that a better approach is to compute these queries as an expectation under the joint distribution. We propose a sampling framework to tractably compute the expected values of AGQs. Motivated by the analysis of subgroup cohesion in social networks, we propose a suite of AGQs that estimate the community structure in graphs. In our empirical evaluation, we show that by estimating these queries as an expectation, SRL-based approaches yield up to a 50-fold reduction in average error when compared to existing GNN-based approaches.

     
    more » « less
  2. Abstract Motivation

    Modeling the structural plasticity of protein molecules remains challenging. Most research has focused on obtaining one biologically active structure. This includes the recent AlphaFold2 that has been hailed as a breakthrough for protein modeling. Computing one structure does not suffice to understand how proteins modulate their interactions and even evade our immune system. Revealing the structure space available to a protein remains challenging. Data-driven approaches that learn to generate tertiary structures are increasingly garnering attention. These approaches exploit the ability to represent tertiary structures as contact or distance maps and make direct analogies with images to harness convolution-based generative adversarial frameworks from computer vision. Since such opportunistic analogies do not allow capturing highly structured data, current deep models struggle to generate physically realistic tertiary structures.

    Results

    We present novel deep generative models that build upon the graph variational autoencoder framework. In contrast to existing literature, we represent tertiary structures as ‘contact’ graphs, which allow us to leverage graph-generative deep learning. Our models are able to capture rich, local and distal constraints and additionally compute disentangled latent representations that reveal the impact of individual latent factors. This elucidates what the factors control and makes our models more interpretable. Rigorous comparative evaluation along various metrics shows that the models, we propose advance the state-of-the-art. While there is still much ground to cover, the work presented here is an important first step, and graph-generative frameworks promise to get us to our goal of unraveling the exquisite structural complexity of protein molecules.

    Availability and implementation

    Code is available at https://github.com/anonymous1025/CO-VAE.

    Supplementary information

    Supplementary data are available at Bioinformatics Advances online.

     
    more » « less
  3. Abstract

    Dynamic (temporal) graphs are a convenient mathematical abstraction for many practical complex systems including social contacts, business transactions, and computer communications. Community discovery is an extensively used graph analysis kernel with rich literature for static graphs. However, community discovery in a dynamic setting is challenging for two specific reasons. Firstly, the notion of temporal community lacks a widely accepted formalization, and only limited work exists on understanding how communities emerge over time. Secondly, the added temporal dimension along with the sheer size of modern graph data necessitates new scalable algorithms. In this paper, we investigate how communities evolve over time based on several graph metrics under a temporal formalization. We compare six different algorithmic approaches for dynamic community detection for their quality and runtime. We identify that a vertex-centric (local) optimization method works as efficiently as the classical modularity-based methods. To its advantage, such local computation allows for the efficient design of parallel algorithms without incurring a significant parallel overhead. Based on this insight, we design a shared-memory parallel algorithmDyComPar, which demonstrates between 4 and 18 fold speed-up on a multi-core machine with 20 threads, for several real-world and synthetic graphs from different domains.

     
    more » « less
  4. Abstract

    We study the performance of Markov chains for theq-state ferromagnetic Potts model on random regular graphs. While the cases of the grid and the complete graph are by now well-understood, the case of random regular graphs has resisted a detailed analysis and, in fact, even analysing the properties of the Potts distribution has remained elusive. It is conjectured that the performance of Markov chains is dictated by metastability phenomena, i.e., the presence of “phases” (clusters) in the sample space where Markov chains with local update rules, such as the Glauber dynamics, are bound to take exponential time to escape, and therefore cause slow mixing. The phases that are believed to drive these metastability phenomena in the case of the Potts model emerge as local, rather than global, maxima of the so-called Bethe functional, and previous approaches of analysing these phases based on optimisation arguments fall short of the task. Our first contribution is to detail the emergence of the two relevant phases for theq-state Potts model on thed-regular random graph for all integers$$q,d\ge 3$$q,d3, and establish that for an interval of temperatures, delineated by the uniqueness and a broadcasting threshold on thed-regular tree, the two phases coexist (as possible metastable states). The proofs are based on a conceptual connection between spatial properties and the structure of the Potts distribution on the random regular graph, rather than complicated moment calculations. This significantly refines earlier results by Helmuth, Jenssen, and Perkins who had established phase coexistence for a small interval around the so-called ordered-disordered threshold (via different arguments) that applied for largeqand$$d\ge 5$$d5. Based on our new structural understanding of the model, our second contribution is to obtain metastability results for two classical Markov chains for the Potts model. We first complement recent fast mixing results for Glauber dynamics by Blanca and Gheissari below the uniqueness threshold, by showing an exponential lower bound on the mixing time above the uniqueness threshold. Then, we obtain tight results even for the non-local and more elaborate Swendsen–Wang chain, where we establish slow mixing/metastability for the whole interval of temperatures where the chain is conjectured to mix slowly on the random regular graph. The key is to bound the conductance of the chains using a random graph “planting” argument combined with delicate bounds on random-graph percolation.

     
    more » « less
  5. null (Ed.)
    Understanding the structure of minor-free metrics, namely shortest path metrics obtained over a weighted graph excluding a fixed minor, has been an important research direction since the fundamental work of Robertson and Seymour. A fundamental idea that helps both to understand the structural properties of these metrics and lead to strong algorithmic results is to construct a “small-complexity” graph that approximately preserves distances between pairs of points of the metric. We show the two following structural results for minor-free metrics: 1) Construction of a light subset spanner. Given a subset of vertices called terminals, and ϵ, in polynomial time we construct a sub graph that preserves all pairwise distances between terminals up to a multiplicative 1+ϵ factor, of total weight at most Oϵ(1) times the weight of the minimal Steiner tree spanning the terminals. 2) Construction of a stochastic metric embedding into low treewidth graphs with expected additive distortion ϵD. Namely, given a minor-free graph G=(V,E,w) of diameter D, and parameter ϵ, we construct a distribution D over dominating metric embeddings into treewidth-Oϵ(log n) graphs such that ∀u,v∈V, Ef∼D[dH(f(u),f(v))]≤dG(u,v)+ϵD. Our results have the following algorithmic consequences: (1) the first efficient approximation scheme for subset TSP in minor-free metrics; (2) the first approximation scheme for bounded-capacity vehicle routing in minor-free metrics; (3) the first efficient approximation scheme for bounded-capacity vehicle routing on bounded genus metrics. En route to the latter result, we design the first FPT approximation scheme for bounded-capacity vehicle routing on bounded-treewidth graphs (parameterized by the treewidth). 
    more » « less