<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Efficient and Degree-Guided Graph Generation via Discrete Diffusion Modeling</title></titleStmt>
			<publicationStmt>
				<publisher>ML Research Press</publisher>
				<date>07/23/2023</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10496888</idno>
					<idno type="doi"></idno>
					<title level='j'>Proceedings of Machine Learning Research</title>
<idno>2640-3498</idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Xiaohui Chen</author><author>Jiaxing He</author><author>Xu Han</author><author>Li-Ping Liu</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Diffusion-based graph generative models are effective in generating high-quality small graphs. However, it is hard to scale them to large graphs that contain thousands of nodes. In this work, we propose EDGE, a new diffusion-based graph generative model that addresses generative tasks for large graphs. The model is developed by reversing a discrete diffusion process that randomly removes edges until obtaining an empty graph. It leverages graph sparsity in the diffusion process to improve computational efficiency. In particular, EDGE only focuses on a small portion of graph nodes and only adds edges between these nodes. Without compromising modeling ability, it makes much fewer edge predictions than previous diffusion-based generative models. Furthermore, EDGE can explicitly model the node degrees of training graphs and then gain performance improvement in capturing graph statistics. The empirical study shows that EDGE is much more efficient than competing methods and can generate large graphs with thousands of nodes. It also outperforms baseline models in generation quality: graphs generated by the proposed model have graph statistics more similar to those of training graphs.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1.">Introduction</head><p>There is a long history of using random graph models <ref type="bibr">(Newman et al., 2002)</ref> to model large graphs. Traditional models such as Erd&#337;s-R&#233;nyi (ER) model <ref type="bibr">(Erdos et al., 1960)</ref>, Stochastic-Block Model (SBM) <ref type="bibr">(Holland et al., 1983)</ref>, and Exponential-family Random Graph Models <ref type="bibr">(Lusher et al., 2013)</ref> are often used to model existing graph data and focus on prescribed graph structures. Besides modeling existing data, one interesting problem is to generate new graphs to simulate existing ones <ref type="bibr">(Ying &amp; Wu, 2009)</ref>, which has applications such as network data sharing. In generative tasks <ref type="bibr">(Chakrabarti &amp; Faloutsos, 2006)</ref>, traditional models often fall short in describing complex structures. A promising direction is to use deep neural models to generate large graphs.</p><p>There are only a few deep generative models designed for generating large graphs: NetGAN <ref type="bibr">(Bojchevski et al., 2018)</ref> and CELL <ref type="bibr">(Rendsburg et al., 2020)</ref> are two examples. However, recent research <ref type="bibr">(Chanpuriya et al., 2021)</ref> shows that these two models are edge-independent models and have a theoretical limitation: they cannot reproduce several important statistics (e.g. triangle counts and clustering coefficient) in their generated graphs unless they memorize the training graph. A list of other models <ref type="bibr">(Chanpuriya et al., 2021)</ref> including Variational Graph Autoencoders (VGAE) <ref type="bibr">(Kipf &amp; Welling, 2016b)</ref> and GraphVAE <ref type="bibr">(Simonovsky &amp; Komodakis, 2018)</ref> are also edge-independent models and share the same limitation.</p><p>Diffusion-based generative models <ref type="bibr">(Liu et al., 2019;</ref><ref type="bibr">Niu et al., 2020;</ref><ref type="bibr">Jo et al., 2022;</ref><ref type="bibr">Chen et al., 2022b)</ref> have gained success in modeling small graphs. These models generate a graph in multiple steps and are NOT edge-independent because edges generated in later steps depend on previously generated edges. They are more flexible than one-shot models <ref type="bibr">(Kipf &amp; Welling, 2016b;</ref><ref type="bibr">Madhawa et al., 2019;</ref><ref type="bibr">Lippe &amp; Gavves, 2020)</ref>, which directly predict an adjacency matrix in one step. They also have an advantage over autoregressive graph models <ref type="bibr">(You et al., 2018;</ref><ref type="bibr">Liao et al., 2019)</ref>, as diffusion-based models are invariant to node permutations and do not have long-term memory issues. However, diffusion-based models are only designed for tasks with small graphs (usually with less than one hundred nodes). This work aims to scale diffusion-based generative models to large graphs. The major issue of a diffusion-based model is that it must compute a latent vector or a probability for each node pair in a graph at each diffusion step <ref type="bibr">(Niu et al., 2020;</ref><ref type="bibr">Jo et al., 2022)</ref> -the computation cost is O(T N 2 ) if the model generates a graph with N nodes using T steps. The learning task becomes challenging when N is large. At the same time, large graphs increase the difficulties for a model to capture global graph statistics such as clustering coefficients. As a result, the model performance degrades when the training graphs' sizes scale up.</p><p>We propose Efficient and Degree-guided graph GEnerative model (EDGE) based on a discrete diffusion process. The development of EDGE has three innovations. First, we encourage the sparsity of graphs in the diffusion process by setting the empty graph as the convergent "distribution".</p><p>Then the diffusion process only removes edges and can be viewed as an edge-removal process. The increased sparsity in graphs in the process dramatically reduces the computation -this is because the message-passing neural network (MPNN) <ref type="bibr">(Kipf &amp; Welling, 2016a)</ref> used in the generative model needs to run on these graphs, and their runtime is linear in the number of edges. Second, the generative model, which is the reverse of the edge-removal process, only predicts edges for a small portion of "active nodes" that have edge changes in the original edge-removal process. This strategy decreases the number of predictions of MPNN and also its computation time. More importantly, this new design is naturally derived from the aforementioned edge-removal process without modifying its forward transition probabilities. Third, we model the node degrees of training graphs explicitly. By characterizing the node degrees, the statistics of the generated graphs are much closer to training graphs. While other diffusion-based graph models struggle to even train or sample on large graphs, our approach can efficiently generate large graphs with desired statistical properties. We summarize our contributions as follows:</p><p>&#8226; we use empty graphs as the convergent distribution in a discrete diffusion process to reduce computation; &#8226; we propose a new generative process that only predicts edges between a fraction of nodes in graphs; &#8226; we explicitly model node degrees in the probabilistic framework to improve graph statistics of generated graphs; and &#8226; we conduct an extensive empirical study and show that our method can efficiently generate large graphs with desired statistics.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.">Background</head><p>This work considers graph generative models that sample adjacency matrices to generate graphs. Let A N denote the space of adjacency matrices of size N . We consider simple graphs without self-loops or multi-edges, so an adjacency matrix A &#8712; A N is a binary symmetric matrix with a zero diagonal. A generative model defines a distribution over A N .</p><p>In this work, we construct a generative model based on a discrete diffusion process <ref type="bibr">(Austin et al., 2021;</ref><ref type="bibr">Hoogeboom et al., 2021;</ref><ref type="bibr">Vignac et al., 2022)</ref>. Let A 0 denote a graph from the data, then the diffusion process defined by q(A t |A t-1 ) corrupts A 0 in T steps and forms a trajectory (A 0 , A 1 , . . . , A T ). We treat (A 1 , . . . , A T ) as latent variables, then q(A 1 , . . . ,</p><p>). As T &#8594; &#8734;, q(A T ) approaches a convergent distribution, which is often a simple one with easy samples. We often choose a large enough T so that q(A T ) is a good approximation of the convergent distribution.</p><p>We model these trajectories with a denoising model p &#952; (A t-1 |A t ) parameterized by &#952;, then the model has a joint p &#952; (A 0:T ) = p(A T ) Q T t=1 p &#952; (A t-1 |A t ) and a marginal p &#952; (A 0 ) that describes the data distribution. Here p(A T ) is the convergent distribution in q.</p><p>Usually q(A t |A t-1 ) needs easy probability calculations. One choice is to treat each edge independently, and</p><p>Here B(x; &#181;) represents the Bernoulli distribution over x with probability &#181;. We also use B(A; &#181;) to represent the probability of independent Bernoulli variables arranged in a matrix. The diffusion rate &#946; t determines the probability of resampling the entry A t i,j from a Bernoulli distribution with probability p, instead of keeping the entry A t-1 i,j . This diffusion process requires two special properties for model fitting. First, we can sample A t at any time step t directly from A 0 . Let</p><p>The diffusion rates &#946; t -s are defined in a way such that &#8113;T is almost 0, then A T is almost independent from A 0 , i.e., q(A T |A 0 ) &#8776; p(A T ) &#8801; B(A T ; p). The configuration of &#946; t -s is called noise scheduling. In the context of graph generation, p(A T ) is the Erd&#337;s-R&#233;nyi graph model G(N, p) <ref type="bibr">(Erdos et al., 1960)</ref>, with p being the probability of forming an edge between two nodes.</p><p>Second, we can compute the posterior of the forward transition when conditioning on A 0 :</p><p>Since all the terms on the right-hand side are known, the posterior can be computed analytically.</p><p>The generative model p &#952; (A 0:T ) is trained by maximizing a variational lower bound of log p &#952; (A 0 ) <ref type="bibr">(Ho et al., 2020;</ref><ref type="bibr">Hoogeboom et al., 2021;</ref><ref type="bibr">Austin et al., 2021)</ref>. In an intuitive understanding, p &#952; (A t-1 |A t ) is learned to match the posterior of the forward transitionq(A t-1 |A t , A 0 ).</p><p>During generation, we sample A T &#8764; p(A T ) and then "denoise" it iteratively with p &#952; (A t-1 |A t ) to get an A 0 sample.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.">Method</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1.">Diffuse graphs to empty graphs -a motivation</head><p>With the main purpose of computation efficiency, we advocate setting p = 0 and using G(N, 0) as the convergent distribution. This configuration improves the sparsity of the adjacency matrices in diffusion trajectories, thus reducing computation. We consider the amount of computation in the denoising model p &#952; (A t-1 |A t ) from two aspects: the computation on the input A t and the number of entries to be predicted in the output A t-1 .</p><p>We first consider the computation on the input side. We assume that the denoising model p &#952; (A t-1 |A t ) is constructed with an MPNN. Suppose the input graph A t has M t edges, then a typical MPNN needs to perform O(M t ) messagepassing operations to compute node vectors -here we treat hidden sizes and the number of network layers as constants.</p><p>The total number of message-passing operations over the trajectory is O( P T t=1 M t ). After some calculations, we show that</p><p>By setting p = 0, we eliminate the second term and reduce the number of edges in graphs in the diffusion trajectory by a significant factor, then the MPNN will have much fewer message-passing operations.</p><p>We then analyze the number of entries we need to predict in the output A t-1 . When p = 0, the forward process is an edge-removal process, and the degree of a node is nonincreasing for any forward transition. A node with a degree change from t -1 to t is considered "active". When a node is inactive at t -1, all edges incident to this node is kept at t.</p><p>Figure <ref type="figure">1</ref> shows the average number of active nodes for each forward transition. We observe that active nodes only take a small fraction of the total when the convergent distribution is G(N, 0).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>While a previous diffusion-based model makes predictions</head><p>for all node pairs, the observation above indicates that we can save computation by making predictions only for pairs of active nodes. In particular, the denoising model can first infer which nodes are active in each step and then only predict edges between active nodes. Below we will develop such a model and only consider the diffusion process with p = 0.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.">A diffusion-based model that explicitly models active nodes</head><p>We treat the "active nodes" as latent variables s 1:T and incorporate them into both the forward and reverse processes. Let d t = deg(A t ) be the node degree vector of A t , then</p><p>is a binary vector indicating whether nodes are active (having degree change from t -1 to t) or not from t -1 to t. In the following, we redefine the forward and reverse processes.</p><p>Forward process. With latent variables s 1:T , we show that the forward process can be rewritten into the following decomposition:</p><p>The forward process does not change by including s 1:T since the value of s t is determined by A t-1 and A t . This allows us to use still the forward transition q(A t |A t-1 ) to draw the entire sequence.</p><p>Reverse process. We decompose the denoising model as follows:</p><p>Here both p &#952; (A t-1 |A t , s t ) and p &#952; (s t |A t ) are learnable distributions. Intuitively, the denoising model first predicts which nodes are active (s t ) and then generates edges between them to obtain A t-1 . Since we only predict edges between active nodes indicated by s t , all edges that incident inactive nodes are carried from A t to A t-1 directly.</p><p>Our EDGE model is specified by ( <ref type="formula">6</ref>). The generative framework is demonstrated in Figure <ref type="figure">2</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Remove edges by q(</head><p>Add edges between active nodes by p</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Reverse process</head><p>Forward process 1</p><p>Forward and reverse processes. For the forward process, A t is sampled from q(A t |A t-1 ), then s t is deterministically generated given A t-1 and A t . For the reverse process, s t is first sampled from a node selection distribution p &#952; (s t |A t ), then A t-1 is sampled from the parameterized distribution p &#952; (A t-1 |A t , s t ).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3.">Learning the reverse process</head><p>We optimize the model parameters &#952; by maximizing the variational lower bound (VLB) of log p(A 0 ). Following Sohl-Dickstein et al. ( <ref type="formula">2015</ref>); Ho et al. ( <ref type="formula">2020</ref>), the VLB is: (2020), the posterior q(A t-1 |A t , s t , A 0 ) is hard to compute, and there is not a closed-form for this term. Since the entropy H[q(A t-1 |A t , s t , A 0 )] is a constant, we only optimize the cross entropy term in L edge (t) via Monte Carlo estimates. We leave the work of variance reduction to the future.</p><p>For the node selection term L node (t), we show that q(s t |A t , A 0 ) has closed-form expression. In particular, we first derive the posterior of the node degree distribution q(d t |A t , A 0 ) as follows:</p><p>where q(d t-1 i</p><p>Here Bin(k; n, p) is a binomial distribution parameterized by n and p. Intuitively, a node degree d t-1 i is only relevant to the node's degrees d 0 i and d t i at steps 0 and t. The actual edges do not affect the degree probability since each edge is added or removed independently. We provide formal proof and discuss the forward node degree distribution in Appendix A.2.</p><p>Finally, we obtain the closed-form posterior:</p><p>The KL divergence L node (t) turns out to be comparisons between Bernoulli distributions.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4.">Degree-guided graph generation</head><p>A graph's node degrees are often strongly correlated to its other statistics, so it is important for a generative model to capture the node degrees of training graphs. Here we directly incorporate degree information in the proposed generative model.</p><p>We explicitly model node degrees d 0 of a graph A 0 as a random variable, then the forward process becomes</p><p>Here q(d 0 |A 0 ) = 1 because d 0 is determined by A 0 . We also include d 0 into the generative model p(A 0 , d 0 ). If the model guarantees that d 0 is the node degrees of A 0 , then p &#952; (A 0 ) = p &#952; (A 0 , d 0 ) still models graph data A 0 . Even if p &#952; (A 0 , d 0 ) allows d 0 to differ a little from the true node degrees, it is still valid to maximize the likelihood p &#952; (A 0 , d 0 = A 0 1) because model training will encourage the model to generate A 0 and d 0 to match each other. We decompose the model by:</p><p>With this decomposition, we first sample arbitrary node degrees d 0 from p &#952; (d 0 ), then generate a graph with the degree constraint (See Alg. 1). Correspondingly, the denoising model becomes</p><p>We separate the optimizations for the node degree model p &#952; (d 0 ) and the graph denoising model p &#952; (A 0:T , s 1:T |d 0 ). The entire training objective is</p><p>(See Appendix B.2 for detailed derivation.) For L(d 0 ; &#952;), we treat the learning of node degree distribution as a sequence modeling task. The decomposition of L(A 0 |d 0 ; &#952;) remains the same as Eqn. ( <ref type="formula">7</ref>), except that all terms related to the graph denoising model are now conditioning on d 0 . In particular, for the node selection distribution, we consider a special parameterization by setting p &#952; (s t |A t , d 0 ) := q(s t |d t , d 0 ) in Eqn. ( <ref type="formula">9</ref>). Note that now the node selection distribution contains no learnable parameters. Moreover, since the KL divergence L node (t) is now zero, we can further simplify the L(A 0 |d 0 ; &#952;) into</p><p>In our framework, the node degree constraint d 0 is mainly imposed on p &#952; (s t |A t , d 0 ) -only nodes with a degree below the specified degree d 0 may be selected to participate in the edge prediction. On the other hand, though the exact probability p &#952; (A t-1 |A t , s t , d 0 ) includes information about the maximum number of edges (d 0d t ) that can be added to nodes, this can be not easy to track during the edge formation. Here we consider simply augmenting the inputs to the neural network with d 0 . In practice, we found that the specified node degrees d 0 can accurately control the actual node degrees of the generated graphs.</p><p>Degree-guided generation turns out to be very useful in modeling large graphs. We reason that the d 0 significantly reduces the possible trajectories a graph can evolve along, thus reducing the modeling complexity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.5.">Implementation</head><p>We briefly describe the implementation of p &#952; (s t |A t ), p &#952; (A t-1 |A t , s t ), and p &#952; (d 0 ). Note we use the same network architecture for p &#952; (A t-1 |A t , s t ) and p &#952; (A t-1 |A t , s t , d 0 ), except the inputs to the latter includes Algorithm 1 Degree-guided graph generation Input: Empty graph A T , graph model p &#952; (A t-1 |A t , s t ), degree sequence model p &#952; (d 0 ), and diffusion steps T . Output: Generated graph A 0 Draw d 0 &#8764; p &#952; (d 0 ) for t = T, . . . , 1 do Draw s t &#8764; q(s t |deg(A t ), d 0 ). Draw A t-1 &#8764; p &#952; (A t-1 |A t , s t ). end for d 0 . We treat p &#952; (s t |A t ) as a node classification problem and p &#952; (A t-1 |A t , s t ) as an link prediction problem. Both components share the same MPNN that takes A t as the input and computes node representations Z t &#8712; R N &#215;d h for all nodes. The hidden dimension d h is a hyper-parameter here. Then a network head uses Z t to predict s t , and another one uses Z t [s t ] to predict links between active nodes indicated by s t . For the node degree model p &#952; (d 0 ), if there are multiple graphs in the dataset, we use a recurrent neural network (RNN) to fit the histogram of node degrees. If there is only one graph with node degrees d 0 , then we set p &#952; (d 0 ) = 1 directly. Implementation details are in Appendix C.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.6.">Model analysis</head><p>Complexity analysis. Let integer M represent the number of edges in a graph, and K be the maximum number of active nodes during the reverse process. In each generation step t, the MPNN needs O(M ) operations to compute node representations, O(N ) operations to predict s t , and O(K 2 ) operations to predict links between K active nodes. The factor K is relevant to noise scheduling: we find that K is smaller than N by at least one order of magnitude when the noise scheduling is linear. In a total of T generation steps, the overall running time O &#65533; T max(K 2 , M ) . As a comparison, previous diffusion-based models need running time O(T N 2 ) because they need to make O(N 2 ) link predictions at each time step.</p><p>Expressivity analysis. EDGE modifies a graph for multiple iterations to generate a sample. In each iteration, it adds new edges to the graph based on the graph structure in the prior iteration. Therefore, EDGE is NOT an edgeindependent model and does not have the limitation analyzed by <ref type="bibr">Chanpuriya et al. (2021)</ref>, thus it has a theoretical advantage over previous one-shot generative models.</p><p>The ability of EDGE might be affected by the underlying MPNN, which may not be able to distinguish different graph structures due to expressivity issues <ref type="bibr">(Xu et al., 2018)</ref>. However, this issue can be overcome by choosing more expressive GNNs <ref type="bibr">(Sato, 2020)</ref>. We defer such discussion to future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.">Related Work</head><p>Edge-independent models, which assume edges are formed independently with some probabilities, are prevalent in probabilistic models for large networks. These models include classical models such as ER graph models <ref type="bibr">(Erdos et al., 1960)</ref>, SBMs <ref type="bibr">(Holland et al., 1983)</ref>, and recent neural models such as variational graph auto-encoders <ref type="bibr">(Kipf &amp; Welling, 2016b;</ref><ref type="bibr">Mehta et al., 2019;</ref><ref type="bibr">Li et al., 2020;</ref><ref type="bibr">Chen et al., 2022a)</ref>, NetGAN and its variant <ref type="bibr">(Bojchevski et al., 2018;</ref><ref type="bibr">Rendsburg et al., 2020)</ref>. Recent works show that these models can not reproduce desiring statistics of the target network, such as triangle counts, clustering coefficient, and square counts <ref type="bibr">(Seshadhri et al., 2020;</ref><ref type="bibr">Chanpuriya et al., 2021)</ref>.</p><p>Deep auto-regressive (AR) graph models <ref type="bibr">(Li et al., 2018;</ref><ref type="bibr">You et al., 2018;</ref><ref type="bibr">Liao et al., 2019;</ref><ref type="bibr">Zang &amp; Wang, 2020;</ref><ref type="bibr">Han et al., 2023)</ref> generate graph edges by sequentially filling up an adjacency matrix to generate a graph. These algorithms are slow because they need to make N 2 predictions. <ref type="bibr">Dai et al. (2020)</ref> proposes a method to leverage graph sparsity and predict only non-zero entries in the adjacency matrix. Long-term memory is a typical issue of these sequential models, so it is hard for them to model global graph properties. Moreover, these models are not invariant with respect to node orders of training graphs, and special techniques <ref type="bibr">(Chen et al., 2021;</ref><ref type="bibr">Han et al., 2023)</ref> are often needed to train these models.</p><p>Diffusion-based generative models are shown to be powerful in generating high-quality graphs <ref type="bibr">(Niu et al., 2020;</ref><ref type="bibr">Liu et al., 2019;</ref><ref type="bibr">Jo et al., 2022;</ref><ref type="bibr">Haefeli et al., 2022;</ref><ref type="bibr">Chen et al., 2022b;</ref><ref type="bibr">Vignac et al., 2022;</ref><ref type="bibr">Kong et al.)</ref>. By "tailoring" a graph with multiple steps, these models can model edge correlations. They overcome the limitations of auto-regressive modes as well. However, all previous diffusion-based models focus on generation tasks with small graphs. This work aims to scale diffusion-based models to large graphs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.">Experiments</head><p>We empirically evaluate our proposed approach from two perspectives: whether it can capture statistics of training graphs and whether it can generate graphs efficiently.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1.">Experimental setup</head><p>Datasets. We conduct experiments on both generic graph datasets and large networks. The generic graph datasets consist of multiple graphs of varying sizes. Here we consider Community and Ego datasets <ref type="bibr">(You et al., 2018)</ref>, all of which contain graphs with hundreds of nodes. We also consider four real-world networks, Polblogs <ref type="bibr">(Adamic &amp; Glance, 2005)</ref>, Cora <ref type="bibr">(Sen et al., 2008)</ref>, Road-Minnesota <ref type="bibr">(Rossi &amp; Ahmed, 2015)</ref>, and PPI <ref type="bibr">(Stark et al., 2010)</ref>. Each of these networks contains thousands of nodes. We also use the QM9 dataset <ref type="bibr">(Ramakrishnan et al., 2014)</ref> to demonstrate that EDGE can be easily extended to generate graphs with attributes. The statistics of the datasets are shown in Table <ref type="table">1</ref>.</p><p>Baselines. For generic graphs, We compare EDGE to six recent deep generative graph models, which include two auto-regressive graph models, GraphRNN <ref type="bibr">(You et al., 2018)</ref> and GRAN <ref type="bibr">(Liao et al., 2019)</ref>, three diffusion-based models, GDSS <ref type="bibr">(Jo et al., 2022)</ref>, DiscDDPM <ref type="bibr">(Haefeli et al., 2022)</ref> and DiGress <ref type="bibr">(Vignac et al., 2022)</ref>, and one flowbased model, GraphCNF <ref type="bibr">(Lippe &amp; Gavves, 2020)</ref>. For large networks, we follow <ref type="bibr">Chanpuriya et al. (2021)</ref> and use six edge-independent models, which include VGAE <ref type="bibr">(Kipf &amp; Welling, 2016b)</ref>, CELL <ref type="bibr">(Rendsburg et al., 2020)</ref>, TSVD <ref type="bibr">(Seshadhri et al., 2020)</ref>, and three methods proposed by Chanpuriya et al. ( <ref type="formula">2021</ref>) (CCOP, HDOP, Linear). We also include GraphRNN as a baseline because it is still affordable to train it on large networks. For the QM9 dataset, We compare EDGE against GDSS <ref type="bibr">(Jo et al., 2022)</ref> and DiGress <ref type="bibr">(Vignac et al., 2022)</ref>. The implementation of our model is available at github.com/tufts-ml/graph-generation-EDGE.</p><p>Evaluation. We examine the generated generic graphs with both structure-based and neural-based metrics. For structured-based metrics, we evaluate the Maximum Mean Discrepancy (MMD) <ref type="bibr">(Gretton et al., 2012)</ref> between test graphs and generated graphs in terms of degrees, clustering coefficients, and orbit counts <ref type="bibr">(You et al., 2018)</ref>. For neural-based metrics, we evaluate the FID and the MMD RBF metrics proposed by <ref type="bibr">Thompson et al. (2022)</ref>. All implementations of the evaluation are provided by <ref type="bibr">Thompson et al. (2022)</ref>. For all these metrics, the smaller, the better.</p><p>For each large network, we follow <ref type="bibr">Chanpuriya et al. (2021)</ref> and evaluate how well the graph statistics of the generated network can match ground truths, which are statistics computed from training data. We consider the following statistics: power-law exponent of the degree sequence (PLE); normalized triangle counts (NTC); global clustering coefficient (CC) <ref type="bibr">(Chanpuriya et al., 2021)</ref>; characteristic path length (CPL); and assortativity coefficient (AC) <ref type="bibr">(Newman, 2002)</ref>. We also report the edge overlap ratio (EO) between the generated network and the original one to check to which degree a model memorizes the graph. A graph generated by a good model should have statistics similar to true values For the QM9 dataset, we evaluate the Validity, Uniqueness, Fr&#233;chet ChemNet Distance <ref type="bibr">(Preuer et al., 2018)</ref> and Scaffold similarity <ref type="bibr">(Bemis &amp; Murcko, 1996)</ref> on the samples generated from baselines and our proposed method. We use molsets library <ref type="bibr">(Polykovskiy et al., 2020)</ref> to implement the evaluation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2.">Evaluation of sample quality</head><p>Generic graph generation. Table <ref type="table">2</ref> summarizes the evaluation of generated graphs on the Community and Ego datasets. Best performances are in bold, and second-best performances are underscored. EDGE outperforms all baselines on 8 out of 10 metrics. For the other two metrics, EDGE only performs slightly worse than the best. We hypothesize that EDGE gains advantages by modeling node degrees because they are informative to the graph structure.</p><p>Large network generation. Unlike edge-independent models, the edge overlap ratios in the GraphRNN and our approach are not tunable. To make a fair comparison, we report the performance of the edge-independent models that have a similar or higher EO than GraphRNN and EDGE.</p><p>Table <ref type="table">3</ref> shows the statistics of the network itself (labeled as "True") and statistics computed from generated graphs. The statistics nearest to true values are considered as best performances, which are in bold. Second-best performances are underscored.</p><p>The proposed approach shows superior performances on all four networks. The improvements on Polblogs and PPI networks are clear. On the Road-Minnesota dataset, EDGE has a much smaller EO than edge-independent models, but its performances in terms of capturing graph statistics are similar to those models. On the Cora dataset, EDGE also has an EO much smaller than edge-independent models, but it slightly improves over these models. Road-Minnesota and Cora networks are both sparse networks -the messagepassing neural model may not work at its full strength. We notice that GraphRNN can not even compete with edgeindependent models. We also visualize the generated graphs of Polblogs in Figure <ref type="figure">4</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.3.">Efficiency</head><p>We compare the sampling efficiency of EDGE against other deep generative graph models. We record the average time for a model to sample one graph to make a consistent comparison over all datasets. The average sampling time for each dataset is averaged over 128 runs. Figure <ref type="figure">3</ref>  an MPNN as the denoising model p &#952; (A t-1 |A t ); 2) setting G(N, 0) as the convergent distribution and directly using an MPNN as the denoising model (without modeling active nodes and degree guidance); 3) the EDGE model without degree guidance, and 4) the EDGE model. Table <ref type="table">5</ref> shows the performances of the four models. If we set the convergent distribution to G(N, 0.5), we can not even train such as model since it requires an excessively large amount of GPU memory. This justifies our use of G(N, 0) as the convergent distribution. The introduction of s 1:T (Section 3.2) significantly improves the sampling speed. Finally, the EDGE approach, which explicitly models node degrees d 0 and generates graphs with degree guidance, further improves the generative performance.  noise scheduling. Specifically, we train our model on three large networks with T &#8712; {64, 128, 256, 512, 1024} and report the model performance in Table <ref type="table">6</ref>. Unlike traditional diffusion models in which more diffusion steps usually yield better perfor-mance, a large T for our model does not always improve the performance. For instance, T = 64 gives the best performance in the Cora and Road-Minnesota datasets. Our explanation for this observation is the high level of sparsity in training graphs. If we have a large T , the total number of generation steps, the model can only identify a few active nodes and predict edges between them in each time step. The model faces a highly imbalanced classification problem, which may lead to poor model convergence. Such an issue is not observed for relatively denser graphs, e.g. Polblogs and PPI datasets, which require a relatively large T to guarantee good model performances. When T is large enough (T = 128 for Polbogs and T = 256 for PPI), further increasing T does not improve the model performance.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6.">Conclusion</head><p>In this work, we propose EDGE, a generative graph model based on a discrete diffusion process. By leveraging the sparsity in the diffusion process, EDGE significantly improves the computation efficiency and scales to graphs with thousands of nodes. By explicitly modeling node degrees, EDGE improves its ability in capturing important statistics of training graphs. Our extensive empirical study shows that EDGE has superior performance in benchmark graph generation in terms of both computational efficiency and generation quality.</p></div></body>
		</text>
</TEI>
