<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>D-VAE: A Variational Autoencoder for Directed Acyclic Graphs</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2019</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10140929</idno>
					<idno type="doi"></idno>
					<title level='j'>Advances in neural information processing systems</title>
<idno>1049-5258</idno>
<biblScope unit="volume">32</biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Muhan Zhang</author><author>Shali Jiang</author><author>Zhicheng Cui</author><author>Roman Garnett</author><author>Yixin Chen</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Graph structured data are abundant in the real world. Among different graph types, directed acyclic graphs (DAGs) are of particular interest to machine learning researchers, as many machine learning models are realized as computations on DAGs, including neural networks and Bayesian networks. In this paper, we study deep generative models for DAGs, and propose a novel DAG variational autoencoder (D-VAE). To encode DAGs into the latent space, we leverage graph neural networks. We propose an asynchronous message passing scheme that allows encoding the computations on DAGs, rather than using existing simultaneous message passing schemes to encode local graph structures. We demonstrate the effectiveness of our proposed DVAE through two tasks: neural architecture search and Bayesian network structure learning. Experiments show that our model not only generates novel and valid DAGs, but also produces a smooth latent space that facilitates searching for DAGs with better performance through Bayesian optimization.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Many real-world problems can be posed as optimizing of a directed acyclic graph (DAG) representing some computational task. For example, the architecture of a neural network is a DAG. The problem of searching optimal neural architectures is essentially a DAG optimization task. Similarly, one critical problem in learning graphical models -optimizing the connection structures of Bayesian networks <ref type="bibr">[1]</ref>, is also a DAG optimization task. DAG optimization is pervasive in other fields as well. In electronic circuit design, engineers need to optimize DAG circuit blocks not only to realize target functions, but also to meet specifications such as power usage and operating temperature.</p><p>DAG optimization is a hard problem. Firstly, the evaluation of a DAG's performance is often timeconsuming (e.g., training a neural network). Secondly, state-of-the-art black-box optimization techniques such as simulated annealing and Bayesian optimization primarily operate in a continuous space, thus are not directly applicable to DAG optimization due to the discrete nature of DAGs. In particular, to make Bayesian optimization work for discrete structures, we need a kernel to measure the similarity between discrete structures as well as a method to explore the design space and extrapolate to new points. Principled solutions to these problems are still lacking.</p><p>Is there a way to circumvent the trouble from discreteness? The answer is yes. If we can embed all DAGs to a continuous space and make the space relatively smooth, we might be able to directly use principled black-box optimization algorithms to optimize DAGs in this space, or even use gradient methods if gradients are available. Recently, there has been increased interest in training generative models for discrete data types such as molecules <ref type="bibr">[2,</ref><ref type="bibr">3]</ref>, arithmetic expressions <ref type="bibr">[4]</ref>, source code <ref type="bibr">[5]</ref>, undirected graphs <ref type="bibr">[6]</ref>, etc. In particular, Kusner et al. <ref type="bibr">[3]</ref> developed a grammar variational autoencoder (G-VAE) for molecules, which is able to encode and decode molecules into and from a continuous latent space, allowing one to optimize molecule properties by searching in this well-behaved space instead of a discrete space. Inspired by this work, we propose to also train a variational autoencoder for DAGs, and optimize DAG structures in the latent space via Bayesian optimization.</p><p>To encode DAGs, we leverage graph neural networks (GNNs) <ref type="bibr">[7]</ref>. Traditionally, a GNN treats all nodes symmetrically, and extracts local features around nodes by simultaneously passing all nodes' neighbors' messages to themselves. However, such a simultaneous message passing scheme is designed to learn local structure features. It might not be suitable for DAGs, since in a DAG: 1) nodes are not symmetric, but intrinsically have some ordering based on its dependency structure; and 2) we are more concerned about the computation represented by the entire graph, not the local structures.</p><p>In this paper, we propose an asynchronous message passing scheme to encode the computations on DAGs. The message passing no longer happens at all nodes simultaneously, but respects the computation dependencies (the partial order) among the nodes. For example, suppose node A has two predecessors, B and C, in a DAG. Our scheme does not perform feature learning for A until the feature learning on B and C are both finished. Then, the aggregated message from B and C is passed to A to trigger A's feature learning. This means, although the message passing is not simultaneous, it is also not completely unordered -some synchronization is still required. We incorporate this feature learning scheme in both our encoder and decoder, and propose DAG variational autoencoder (D-VAE). D-VAE has an excellent theoretical property for modeling DAGs-we prove that D-VAE can injectively encode computations on DAGs. This means, we can build a mapping from the discrete space to a continuous latent space so that every DAG computation has its unique embedding in the latent space, which justifies performing optimization in the latent space instead of the original design space.</p><p>Our contributions in this paper are: 1) We propose D-VAE, a variational autoencoder for DAGs using a novel asynchronous message passing scheme, which is able to injectively encode computations.</p><p>2) Based on D-VAE, we propose a new DAG optimization framework which performs Bayesian optimization in a continuous latent space. 3) We apply D-VAE to two problems, neural architecture search and Bayesian network structure learning. Experiments show that D-VAE not only generates novel and valid DAGs, but also learns smooth latent spaces effective for optimizing DAG structures.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Related work</head><p>Variational autoencoder (VAE) <ref type="bibr">[8,</ref><ref type="bibr">9]</ref> provides a framework to learn both a probabilistic generative model p &#952; (x|z) (the decoder) as well as an approximated posterior distribution q &#966; (z|x) (the encoder). VAE is trained through maximizing the evidence lower bound</p><p>The posterior approximation q &#966; (z|x) and the generative model p &#952; (x|z) can in principle take arbitrary parametric forms whose parameters &#966; and &#952; are output by the encoder and decoder networks. After learning p &#952; (x|z), we can generate new data by decoding latent space vectors z sampled from the prior p(z). For generating discrete data, p &#952; (x|z) is often decomposed into a series of decision steps.</p><p>Deep graph generative models use neural networks to learn distributions over graphs. There are mainly three types: token-based, adjacency-matrix-based, and graph-based. Token-based models <ref type="bibr">[2,</ref><ref type="bibr">3,</ref><ref type="bibr">10]</ref> represent a graph as a sequence of tokens (e.g., characters, grammar rules) and model these sequences using RNNs. They are less general since task-specific graph grammars such as SMILES for molecules <ref type="bibr">[11]</ref> are required. Adjacency-matrix-based models <ref type="bibr">[12,</ref><ref type="bibr">13,</ref><ref type="bibr">14,</ref><ref type="bibr">15,</ref><ref type="bibr">16]</ref> leverage the proxy adjacency matrix representation of a graph, and generate the matrix in one shot or generate the columns/entries sequentially. In contrast, graph-based models <ref type="bibr">[6,</ref><ref type="bibr">17,</ref><ref type="bibr">18,</ref><ref type="bibr">19]</ref> seem more natural, since they operate directly on graph structures (instead of proxy matrix representations) by iteratively adding new nodes/edges to a graph based on the existing graph and node states. In addition, the graph and node states are learned by graph neural networks (GNNs), which have already shown their powerful graph representation learning ability on various tasks <ref type="bibr">[20,</ref><ref type="bibr">21,</ref><ref type="bibr">22,</ref><ref type="bibr">23,</ref><ref type="bibr">24,</ref><ref type="bibr">25,</ref><ref type="bibr">26,</ref><ref type="bibr">27]</ref>.</p><p>Neural architecture search (NAS) aims at automating the design of neural network architectures. It has seen major advances in recent years <ref type="bibr">[28,</ref><ref type="bibr">29,</ref><ref type="bibr">30,</ref><ref type="bibr">31,</ref><ref type="bibr">32,</ref><ref type="bibr">33]</ref>. See Hutter et al. <ref type="bibr">[34]</ref> for an overview. NAS methods can be mainly categorized into: 1) reinforcement learning methods <ref type="bibr">[28,</ref><ref type="bibr">31,</ref><ref type="bibr">33]</ref> which train controllers to generate architectures with high rewards in terms of validation accuracy, 2) Bayesian optimization based methods <ref type="bibr">[35]</ref> which define kernels to measure architecture similarity and extrapolate the architecture space heuristically, 3) evolutionary approaches <ref type="bibr">[29,</ref><ref type="bibr">36,</ref><ref type="bibr">37]</ref> which use evolutionary algorithms to optimize neural architectures, and 4) differentiable methods <ref type="bibr">[32,</ref><ref type="bibr">38,</ref><ref type="bibr">39]</ref> which use continuous relaxation/mapping of neural architectures to enable gradient-based optimization. In Appendix A, we include more detailed discussion on several most related works.</p><p>Bayesian network structure learning (BNSL) is to learn the structure of the underlying Bayesian network from observed data <ref type="bibr">[40,</ref><ref type="bibr">41,</ref><ref type="bibr">42,</ref><ref type="bibr">43]</ref>. Bayesian network is a probabilistic graphical model encoding conditional dependencies among variables via a DAG <ref type="bibr">[1]</ref>. One main approach for BNSL is score-based search, i.e., define some "goodness-of-fit" score for network structures, and search for one with the optimal score in the discrete design space. Commonly used scores include BIC and BDeu, mostly based on marginal likelihood <ref type="bibr">[1]</ref>. Due to the NP-hardness <ref type="bibr">[44]</ref>, however, exact algorithms such as dynamic programming <ref type="bibr">[45]</ref> or shortest path approaches <ref type="bibr">[46,</ref><ref type="bibr">47]</ref> can only solve small-scale problems. Thus, people have to resort to heuristic methods such as local search and simulated annealing, etc. <ref type="bibr">[48]</ref>. BNSL is still an active research area <ref type="bibr">[41,</ref><ref type="bibr">43,</ref><ref type="bibr">49,</ref><ref type="bibr">50,</ref><ref type="bibr">51]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">DAG variational autoencoder (D-VAE)</head><p>In this section, we describe our proposed DAG variational autoencoder (D-VAE). D-VAE uses an asynchronous message passing scheme to encode and decode DAGs. In contrast to the simultaneous message passing in traditional GNNs, D-VAE allows encoding computations rather than structures. Definition 1. (Computation) Given a set of elementary operations O, a computation C is the composition of a finite number of operations o &#8712; O applied to an input signal x, with the output of each operation being the input to its succeeding operations. The set of elementary operations O depends on specific applications. For example, when we are interested in computations given by a calculator, O will be the set of all the operations defined on the functional buttons, such as +, -, &#215;, &#247;, etc. When modeling neural networks, O can be a predefined set of basic layers, such as 3&#215;3 convolution, 5&#215;5 convolution, 2&#215;2 max pooling, etc. A computation can be represented as a directed acyclic graph (DAG), with directed edges representing signal flow directions among node operations. The graph must be acyclic, since otherwise the input signal will go through an infinite number of operations so that the computation never stops. Figure <ref type="figure">1</ref> shows two examples. Note that the two different DAGs in Figure <ref type="figure">1</ref> represent the same computation, as the input signal goes through exactly the same operations. We discuss it further in Appendix B.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Encoding</head><p>We first introduce the encoder of D-VAE, which can be seen as a graph neural network (GNN) using an asynchronous message passing scheme. Given a DAG G, we assume there is a single starting node which does not have any predecessors (e.g., the input layer of a neural architecture). If there are multiple such nodes, we add a virtual starting node connecting to all of them. Similar to standard GNNs, we use an update function U to compute the hidden state of each node based on its neighbors' incoming message. The hidden state of node v is given by:</p><p>where x v is the one-hot encoding of v's type, and h in v represents the incoming message to v. h in v is given by aggregating the hidden states of v's predecessors using an aggregation function A:</p><p>where u &#8594; v denotes there is a directed edge from u to v, and h u : u &#8594; v represents a multiset of v's predecessors' hidden states. If an empty set is input to A (corresponding to the case for the starting node without any predecessors), we let A output an all-zero vector.</p><p>Compared to the traditional simultaneous message passing, in D-VAE the message passing for a node must wait until all of its predecessors' hidden states have already been computed. This simulates how a computation is really performed -to execute some operation, we also need to wait until all its input signals are ready. So how to make sure all the predecessor states are available when a new node comes? One solution is that we can sequentially perform message passing for nodes following a topological ordering of the DAG. We illustrate this encoding process in Figure <ref type="figure">2</ref>. each step according to the decoding distributions described in Section 3.2 and calculate subsequent decoding distributions based on the sampled results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Discussion and model extensions</head><p>Relation with RNNs. The D-VAE encoder and decoder can be reduced to ordinary RNNs when the input DAG is reduced to a chain of nodes. Although we propose D-VAE from a GNN's perspective, our model can also be seen as a generalization of traditional sequence modeling frameworks <ref type="bibr">[55,</ref><ref type="bibr">56]</ref> where a timestamp depends only on the timestamp immediately before it, to the DAG case where a timestamp has multiple previous dependencies. As special DAGs, similar ideas have been explored for trees <ref type="bibr">[57,</ref><ref type="bibr">17]</ref>, where a node can have multiple incoming edges yet only one outgoing edge.</p><p>Bidirectional encoding. D-VAE's encoding process can be seen as simulating how an input signal goes through a DAG, with h v simulating the output signal at each node v. This is also known as forward propagation in neural networks. Inspired by the bidirectional RNN <ref type="bibr">[58]</ref>, we can also use another GRU to reversely encode a DAG (i.e., reverse all edge directions and encode the DAG again), thus simulating the backward propagation too. After reverse encoding, we get two ending states, which are concatenated and linearly mapped to their original size as the final output state. We find this bidirectional encoding can increase the performance and convergence speed on neural architectures.</p><p>Incorporating vertex semantics. Note that D-VAE currently uses one-hot encoding of node types as x v , which does not consider the semantic meanings of different node types. For example, a 3 &#215; 3 convolution layer might be functionally very similar to a 5 &#215; 5 convolution layer, while being functionally distinct from a max pooling layer. We expect incorporating such semantic meanings of node types to be able to further improve D-VAE's performance. For example, we can use pretrained embeddings of node types to replace the one-hot encoding. We leave it for future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experiments</head><p>We validate the proposed DAG variational autoencoder (D-VAE) on two DAG optimization tasks:</p><p>&#8226; Neural architecture search. Our neural network dataset contains 19,020 neural architectures from the ENAS software <ref type="bibr">[33]</ref>. Each neural architecture has 6 layers (excluding input and output layers) sampled from: 3 &#215; 3 and 5 &#215; 5 convolutions, 3 &#215; 3 and 5 &#215; 5 depthwise-separable convolutions <ref type="bibr">[59]</ref>, 3 &#215; 3 max pooling, and 3 &#215; 3 average pooling. We evaluate each neural architecture's weight-sharing accuracy <ref type="bibr">[33]</ref> (a proxy of the true accuracy) on CIFAR-10 [60] as its performance measure. We split the dataset into 90% training and 10% held-out test sets. We use the training set for VAE training, and use the test set only for evaluation. &#8226; Bayesian network structure learning. Our Bayesian network dataset contains 200,000 random 8-node Bayesian networks from the bnlearn package <ref type="bibr">[61]</ref> in R. For each network, we compute the Bayesian Information Criterion (BIC) score to measure the performance of the network structure for fitting the Asia dataset <ref type="bibr">[62]</ref>. We split the Bayesian networks into 90% training and 10% test sets. For more details, please refer to Appendix I.</p><p>Following <ref type="bibr">[3]</ref>, we do four experiments for each task:</p><p>&#8226; Basic abilities of VAE models. In this experiment, we perform standard tests to evaluate the reconstructive and generative abilities of a VAE model for DAGs, including reconstruction accuracy, prior validity, uniqueness and novelty. &#8226; Predictive performance of latent representation. We test how well we can use the latent embeddings of neural architectures and Bayesian networks to predict their performances. &#8226; Bayesian optimization. This is the motivating application of D-VAE. We test how well the learned latent space can be used for searching for high-performance DAGs through Bayesian optimization. &#8226; Latent space visualization. We visualize the latent space to qualitatively evaluate its smoothness.</p><p>Since there is little previous work on DAG generation, we compare D-VAE with four generative baselines adapted for DAGs: S-VAE, GraphRNN, GCN and DeepGMG. Among them, S-VAE <ref type="bibr">[56]</ref> and GraphRNN <ref type="bibr">[13]</ref> are adjacency-matrix-based methods; GCN <ref type="bibr">[22]</ref> and DeepGMG <ref type="bibr">[6]</ref> are graph-based methods which use simultaneous message passing to embed DAGs. We include more details about these baselines and discuss D-VAE's advantages over them in Appendix J. The training details are in Appendix K. All the code and data are available at <ref type="url">https://github.com/muhanzhang/D-VAE</ref>. We first evaluate each model's reconstruction accuracy on the test sets. Following previous work <ref type="bibr">[3,</ref><ref type="bibr">17]</ref>, we regard the encoding as a stochastic process. That is, after getting the mean and variance parameters of the posterior approximation q &#966; (z|G), we sample a z from it as G's latent vector. To estimate the reconstruction accuracy, we sample z 10 times for each G, and decode each z 10 times too. Then we report the average proportion of the 100 decoded DAGs that are identical to the input.</p><p>To calculate prior validity, we sample 1,000 latent vectors z from the prior distribution p(z) and decode each latent vector 10 times. Then we report the proportion of valid DAGs in these 10,000 generations. A generated DAG is valid if it can be read by the original software which generated the training data. More details about the validity experiment are in Appendix M.1.</p><p>We show the results in Table <ref type="table">1</ref>. Among all the models, D-VAE and S-VAE generally perform the best. We find that D-VAE, S-VAE and GraphRNN all have near perfect reconstruction accuracy, prior validity and novelty. However, D-VAE and S-VAE show higher uniqueness, meaning that they generate more diverse examples. GCN and DeepGMG have worse reconstruction accuracies for neural architectures due to nonzero training losses. This is because the simultaneous message passing scheme in them focus more on learning local graph structures, but fail to encode the computation represented by the entire neural network. Besides, the sum pooling after the message passing might also lose some global topology information which is important for the reconstruction. The nonzero training loss of DeepGMG acts like an early stopping regularizer, making DeepGMG generate more unique graphs. Nevertheless, reconstruction accuracy is much more important than uniqueness in our tasks, since we want our embeddings to accurately remap to their original structures after latent space optimization.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Predictive performance of latent representation.</head><p>In this experiment, we evaluate how well the learned latent embeddings can predict the corresponding DAGs' performances, which tests a VAE's unsupervised representation learning ability. Being able to accurately predict a latent point's performance also makes it much easier to search for highperformance points in this latent space. Thus, the experiment is also an indirect way to evaluate a VAE latent space's amenability for DAG optimization. Following <ref type="bibr">[3]</ref>, we train a sparse Gaussian process (SGP) model <ref type="bibr">[63]</ref> with 500 inducing points on the embeddings of training data to predict the performance of unseen test data. We include the SGP training details in Appendix L. We use two metrics to evaluate the predictive performance of the latent embeddings (given by the mean of the posterior approximations q &#966; (z|G)). One is the RMSE between the SGP predictions and the true performances. The other is the Pearson correlation coefficient (or Pearson's r), measuring how well the prediction and real performance tend to go up and down together. A small RMSE and a large Pearson's r indicate a better predictive performance.  All the experiments are repeated 10 times and the means and standard deviations are reported. Table <ref type="table">2</ref> shows the results. We find that both the RMSE and Pearson's r of D-VAE are significantly better than those of the other models. A possible explanation is that D-VAE encodes the computation, while a DAG's performance is primarily determined by its computation. Therefore, D-VAE's latent embeddings are more informative about performance. In comparison, adjacency-matrix-based methods (S-VAE and GraphRNN) and graph-based methods with simultaneous message passing (GCN and DeepGMG) both only encode (local) graph structures without specifically modeling computations on DAG structures. The better predictive power of D-VAE favors using a predictive model in its latent space to guide the search for high performance graphs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Bayesian optimization</head><p>We perform Bayesian optimization (BO) using the two best models, D-VAE and S-VAE, validated by previous experiments. Based on the SGP model from the last experiment, we perform 10 iterations of batch BO, and average results across 10 trials. Following Kusner et al. <ref type="bibr">[3]</ref>, in each iteration, a batch of 50 points are proposed by sequentially maximizing the expected improvement (EI) acquisition function, using Kriging Believer <ref type="bibr">[64]</ref> to assume labels for previously chosen points in the batch. For each batch of selected points, we evaluate their decoded DAGs' real performances and add them back to the SGP to select the next batch. Finally, we check the best-performing DAGs found by each model to evaluate its DAG optimization performance.</p><p>Neural architectures. For neural architectures, we select the top 15 found architectures in terms of their weight-sharing accuracies, and fully train them on CIFAR-10's train set to evaluate their true test accuracies. More details can be found in Appendix H. We show the 5 architectures with the highest true test accuracies in Figure <ref type="figure">4</ref>. As we can see, D-VAE in general found much better neural architectures than S-VAE. Among the selected architectures, D-VAE achieved a highest accuracy of 94.80%, while S-VAE's highest accuracy was only 92.79%. In addition, all the 5 architectures of D-VAE have accuracies higher than 94%, indicating that D-VAE's latent space can stably find many high-performance architectures. More details about our NAS experiments are in Appendix H. Bayesian networks. We similarly report the top 5 Bayesian networks found by each model ranked by their BIC scores in Figure <ref type="figure">5</ref>. D-VAE generally found better Bayesian networks than S-VAE. The best Bayesian network found by D-VAE achieved a BIC of -11125.75, which is better than the best network in the training set with a BIC of -11141.89 (a higher BIC score is better). Note that BIC is in log scale, thus the probability of our found network to explain the data is actually 1E7 times larger than that of the best training network. For reference, the true Bayesian network used to generate the Asia data has a BIC of -11109.74. Although we did not exactly find the true network, our found network was close to it and outperformed all 180,000 training networks. Our experiments show that searching in an embedding space is a promising direction for Bayesian network structure learning.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Latent space visualization</head><p>In this experiment, we visualize the latent spaces of the VAE models to get a sense of their smoothness.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.</p></note>
		</body>
		</text>
</TEI>
