<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Distance Encoding: Design Provably More Powerful Neural Networks for Graph Representation Learning</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2020</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10219229</idno>
					<idno type="doi"></idno>
					<title level='j'>Neural Information Processing Systems (NeurIPS)</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Pan Li</author><author>Yanbang Wang</author><author>Hongwei Wang</author><author>Jure. Leskovec</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Learning representations of sets of nodes in a graph is crucial for applications ranging from node-role discovery to link prediction and molecule classification. Graph Neural Networks (GNNs) have achieved great success in graph representation learning. However, expressive power of GNNs is limited by the 1-Weisfeiler-Lehman (WL) test and thus GNNs generate identical representations for graph substructures that may in fact be very different. More powerful GNNs, proposed recently by mimicking higher-order-WL tests, only focus on representing entire graphs and they are computationally inefficient as they cannot utilize sparsity of the underlying graph. Here we propose and mathematically analyze a general class of structure related features, termed Distance Encoding (DE). DE assists GNNs in representing any set of nodes, while providing strictly more expressive power than the 1-WL test. DE captures the distance between the node set whose representation is to be learned and each node in the graph. To capture the distance DE can apply various graph-distance measures such as shortest path distance or generalized PageRank scores. We propose two ways for GNNs to use DEs (1) as extra node features, and (2) as controllers of message aggregation in GNNs. Both approaches can utilize the sparse structure of the underlying graph, which leads to computational efficiency and scalability. We also prove that DE can distinguish node sets embedded in almost all regular graphs where traditional GNNs always fail. We evaluate DE on three tasks over six real networks: structural role prediction, link prediction, and triangle prediction. Results show that our models outperform GNNs without DE by up-to 15% in accuracy and AUROC. Furthermore, our models also significantly outperform other state-of-the-art methods especially designed for the above tasks.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Graph representation learning aims to learn representation vectors of graph-structured data <ref type="bibr">[1]</ref>. Representations of node sets in a graph can be leveraged for a wide range of applications, such as discovery of functions/roles of nodes based on individual node representations <ref type="bibr">[2]</ref><ref type="bibr">[3]</ref><ref type="bibr">[4]</ref><ref type="bibr">[5]</ref><ref type="bibr">[6]</ref>, link or link type prediction based on node-pair representations <ref type="bibr">[7]</ref><ref type="bibr">[8]</ref><ref type="bibr">[9]</ref><ref type="bibr">[10]</ref> and graph comparison or molecule classification based on entire-graph representations <ref type="bibr">[11]</ref><ref type="bibr">[12]</ref><ref type="bibr">[13]</ref><ref type="bibr">[14]</ref><ref type="bibr">[15]</ref><ref type="bibr">[16]</ref><ref type="bibr">[17]</ref>.</p><p>&#119907; ( WLGNN-p to represent T = (S, A), |S| = p Initialize: For all v &#8712; V , h</p><p>v = A vv For layers l = 0, 1, ..., L -1 and all v &#8712; V , do: h</p><p>Figure <ref type="figure">1</ref>: (a) 3-regular graph with 8 nodes. Briefly assume that all node attributes are the same and the nodes can only be distinguished based on their network structure. Then for all nodes, WLGNN will produce the same representation and thus fail to distinguish them. However, nodes with different colors should have different representations, as they are not structurally equivalent (or "isomorphic" as defined in Section 2). Furthermore, WLGNN cannot distinguish all the node-pairs (e.g., {v1, v2} vs {v4, v7}). However, if we use shortest-pathdistances (SPDs) between nodes as features we can distinguish blue nodes from green and red nodes because there is another node with SPD= 3 to a blue node of interest (e.g., SPD between v3 and v8), while all SPDs between other nodes to red/green nodes are less than 3. Note that the structural equivalence between any two nodes of the same color can be obtained from the reflexivity of the graph while the equivalence between two vertically-aligned blue nodes can be further obtained from the node permutation shown in the right. (b) WLGNN algorithm to represent a node set S of size pfi(&#8226;)'s are arbitrary neural networks; AGG(&#8226;)'s are set-pooling operators; L is the number of layers.</p><p>(2) Iteratively update the representation of each node by aggregating over the representations of its neighboring nodes; (3) Readout the final representation of a single node, a set of nodes, or the entire node set as required by the task. Under the above framework, researchers have proposed many GNN architectures <ref type="bibr">[14]</ref><ref type="bibr">[15]</ref><ref type="bibr">[16]</ref><ref type="bibr">[20]</ref><ref type="bibr">[21]</ref><ref type="bibr">[22]</ref><ref type="bibr">[23]</ref>. Interested readers may refer to tutorials on GNNs for further details <ref type="bibr">[1,</ref><ref type="bibr">24]</ref>.</p><p>Despite the success of GNNs, their representation power in representation learning is limited <ref type="bibr">[16]</ref>. Recent works proved that the representation power of GNNs that follow the above framework is bounded by the 1-WL test <ref type="bibr">[16,</ref><ref type="bibr">25,</ref><ref type="bibr">26]</ref> (We shall refer to these GNNs as WLGNNs). Concretely, WLGNNs yield identical vector representations for any subgraph structure that the 1-WL test cannot distinguish. Consider an extreme case: If node attributes are all nodes are the same, then for any node in a r-regular graph GNN will output identical representation. Such an issue becomes even worse when WLGNNs are used to extract representations of node sets, e.g., node-pairs for link prediction (Fig. <ref type="figure">1 (a)</ref>). A few works have been recently proposed to improve the power of WLGNNs <ref type="bibr">[27]</ref>. However, they either focus on building theory only for entire-graph representations <ref type="bibr">[26]</ref><ref type="bibr">[27]</ref><ref type="bibr">[28]</ref><ref type="bibr">[29]</ref><ref type="bibr">[30]</ref>, or show empirical success using heuristic methods without strong theoretical characterization <ref type="bibr">[9,</ref><ref type="bibr">10,</ref><ref type="bibr">17,</ref><ref type="bibr">[31]</ref><ref type="bibr">[32]</ref><ref type="bibr">[33]</ref>. We review these methods in detail in Section 4.</p><p>Here we address the limitations of WLGNNs and propose and mathematically analyze a new class of node features, termed Distance Encoding (DE). DE comes with both theoretical guarantees and empirical efficiency. Given a node set S whose structural representation is to be learnt, for every node u in the graph DE is defined as a mapping of a set of landing probabilities of random walks from each node of the set S to node u. DE may use measures such as shortest path distance (SPD) and generalized PageRank scores <ref type="bibr">[34]</ref>. DE can be combined with any GNN architecture in simple but effective ways: First, we propose DE-GNN that utilizes DE as an extra node feature. We further enhance DE-GNN by allowing DE to control the message aggregation procedure of WLGNNs, which yields another model DEA-GNN. Since DE purely depends on the graph structure and is independent of node identifiers, DE also provides inductive and generalization capability. We mathematically analyze the expressive power of DE-GNN and DEA-GNN for structural representation learning. We prove that the two models are able to distinguish two non-isomorphic equally-sized node sets (including nodes, node-pairs, . . . , entire-graphs) that are embedded in almost all sparse regular graphs, where WLGNN always fails to distinguish them unless discriminatory node/edge attributes are available. We also prove that the two models are not more powerful than WLGNN when applied to distance regular graphs <ref type="bibr">[35]</ref>, which implies the limitation of DEs. However, we show that DE has an extra power to learn the structural representations of node-pairs over distance regular graphs <ref type="bibr">[35]</ref>.</p><p>We experimentally evaluate DE-GNN and DEA-GNN on three levels of tasks including nodestructural-role classification (node-level), link prediction (node-pair-level), triangle prediction (nodetriad-level). Our methods outperform WLGNN on all three tasks by up-to 15% improvement in average accuracy. Our methods also outperform other baselines specifically designed for these tasks.</p><p>The invariant property is critical for the inductive and generalization capability as it frees structural representations from node identifiers and effectively reduces the problem dimension by incorporating the symmetry of the parameter space <ref type="bibr">[29]</ref> (e.g., the convolutional layers in GCN <ref type="bibr">[20]</ref>). The invariant property also implies that structural representations do not allow encoding the absolute positions of S in the graph.</p><p>The definition of structural representation is very general. Suppose we set two node sets S (1) , S (2) as two single nodes and set two graph structures A (1) and A (2) as the ego-networks around these two nodes. Then, the definition of structural representation provides a mathematical characterization the concept "structural roles" of nodes <ref type="bibr">[3,</ref><ref type="bibr">5,</ref><ref type="bibr">6]</ref>, where two far-away nodes could have the same structural roles (representations) as long as their ego-networks have the same structure.</p><p>Note that depending on the application one can vary/select the size p of the node set S. For example, when p = 1 then we are in the regime of node classification, p = 2 is link prediction, and when S = V , structural representations reduce to entire graph representations. However, in this work we will primarily focus on the case that the node set S has a fixed and small size p, where p does not depend on the graph size n. Although Corollary 3.4 later shows the potential of our techniques on learning the entire graph representations, this is not the main focus of our work here. We expect the techniques proposed here can be further used for entire-graph representations while we leave the detailed investigation for future work.</p><p>Although structural representation defines a more general concept, it shares some properties with traditional entire-graph representation. For example, the universal approximation theorem regarding entire-graph representation <ref type="bibr">[30]</ref> can be directly generalized to the case of structural representations: Theorem 2.6. If structural representations &#915; are different over any two non-isomorphic tuples T 1 and T 2 in &#8486; p , then for any invariant function f : &#8486; p &#8594; R, f can be universally approximated by feeding &#915; into a 3-layer feed-forward neural network with ReLu as the activation function, as long as (1) the feature space A is compact and (2) f (S, &#8226;) is continuous over A for any S &#8712; P p (V ).</p><p>Theorem 2.6 formally establishes the relation between learning structural representations and distinguishing non-isomorphic structures, i.e., &#915;(T 1 ) = &#915;(T 2 ) iff T 1 and T 2 are non-isomorphic. However, no polynomial algorithm has been found to distinguish even just non-isomorphic entire graphs (S = V ) without node/edge attributes (A = A), which is known as the graph isomorphism problem <ref type="bibr">[36]</ref>. In this work, we will use the range of non-isomorphic structures that GNNs can distinguish to characterize their expressive power for graph representation learning.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2.2">Weisfeiler-Lehman Tests and WLGNN for Structural Representation Learning</head><p>Weisfeiler-Lehman test (WL-test) is a family of very successful algorithmic heuristics used in graph isomorphism problems <ref type="bibr">[25]</ref>. 1-WL test, the simplest one among this family, starts with coloring nodes with their degrees, then it iteratively aggregates the colors of nodes and their neighborhoods, and hashes the aggregated colors into unique new colors. The coloring procedure finally converges to some static node-color configuration. Here a node-color configuration is a multiset that records the types of colors and their numbers. Different node-color configurations indicate two graphs are non-isomorphic while the reverse statement is not always true.</p><p>More than the graph isomorphism problem, the node colors obtained by the 1-WL test naturally provide a test of structural isomorphism. Consider two tuples T 1 = (S (1) , A (1) ) and T 2 = (S (2) , A (2) ) according to Definition 2.3. We temporarily ignore node/edge attributes for simplicity, so A (1) , A (2)  reduce to adjacent matrices. It is easy to show that different node-color configurations of nodes in S (1) and in S (2) obtained by the 1-WL test also indicate that T 1 and T 2 are not isomorphic.</p><p>WLGNNs refer to those GNNs that mimic the 1-WL test to learn structural representation, which is summarized in Fig. <ref type="figure">1 (b)</ref>. It covers many well-known GNNs of which difference may appear in the implementation of neural networks f i and set-poolings AGG(&#8226;) (Fig. <ref type="figure">1 (b)</ref>), including GCN <ref type="bibr">[20]</ref>, GraphSAGE <ref type="bibr">[21]</ref>, GAT <ref type="bibr">[22]</ref>, MPNN <ref type="bibr">[14]</ref>, GIN <ref type="bibr">[16]</ref> and many others <ref type="bibr">[29]</ref>. Note that we use WLGNN-p to denote the WLGNN that is to learn structural representations of node sets S with size |S| = p. One may directly choose S = V to obtain the entire-graph representation. Theoretically, the structural representation power of WLGNN-p is provably bounded by the 1-WL test <ref type="bibr">[16]</ref>. The result can be also generalized to the case of structural representations as follows.</p><p>Theorem 2.7. Consider two tuples T 1 = (S (1) , A (1) ) and T 2 = (S (2) , A (2) ) in &#8486; p . If T 1 , T 2 cannot be distinguished by the 1-WL test, then the corresponding outputs of WLGNN-p satisfy &#915;(T 1 ) = &#915;(T 2 ). On the other side, if they can be distinguished by the 1-WL test and we suppose aggregation operations (AGG) and neural networks f 1 , f 2 are all injective mappings, then with a large enough number of layers L, the outputs of WLGNN-p also satisfy &#915;(T 1 ) = &#915;(T 2 ).</p><p>Because of Theorem 2.7, WLGNN inherits the limitation of the 1-WL test. For example, WLGNN cannot distinguish two equal-sized node sets in all r-regular graphs (unless node/edge features are discriminatory). Here, a r-regular graph means that all its nodes have degree r. Therefore, researchers have recently focused on designing GNNs with expressive power greater than the 1-WL test. Here we will improve the power of GNNs by developing a general class of structural features. Later we use &#950;(u|S) for brevity where A could be inferred from the context. For simplicity, we choose DE as a set aggregation (e.g., the sum-pooling) of DEs between nodes u, v where v &#8712; S:</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Distance Encoding and Its Power</head><p>More complicated DE may be used while this simple design can be efficiently implemented and achieves good empirical performance. Then, the problem reduces to choosing a proper &#950;(u|v). Again for simplicity, we consider the following class of functions that is based on the mapping of a list of landing probabilities of random walks from v to u over the graph, i.e.,</p><p>where W = AD -1 is the random walk matrix, f 3 may be simply designed by some heuristics or be parameterized and learnt as a feed-forward neural network. In practice, a finite length of vu , say 3,4, is enough. Note that Eq. ( <ref type="formula">2</ref>) covers many important distance measures. First, setting f 3 ( uv ) as the first non-zero position in uv gives the shortest-path-distance (SPD) from v to u. We denote this specific choice as &#950; spd (u|v). Second, one may also use generalized PageRank scores <ref type="bibr">[34]</ref>:</p><p>Note that the permutation invariant property of DE is beneficial for inductive learning, which fundamentally differs from positional node embeddings such as node2vec <ref type="bibr">[37]</ref> or one-hot node identifiers. In the rest of this work, we will show that DE improves the expressive power of GNNs in both theory and practice. In Section 3.2, we use DE as extra node features. We term this model as DE-GNN, and theoretically demonstrate its expressive power. In the next subsection, we further use DE-1 to control the aggregation procedure of WLGNN. We term this model as DEA-GNN and extend our theory there.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">DE-GNN-Distance Encodings as Node Features</head><p>DE can be used as extra node features. Specifically, we improve WLGNNs by setting h</p><p>where &#8853; is the concatenation. We call the obtained model DE-GNN. We similarly use DE-GNN-p to specify the case when |S| = p. For simplicity, we give the following definition. Definition 3.2. DE-GNN is called proper if f 1 , f 2 , AGGs in the WLGNN (Fig. <ref type="figure">1 (b</ref>)), and AGG in Eq. ( <ref type="formula">1</ref>), f 3 in Eq. ( <ref type="formula">2</ref>) are injective mappings as long as the input features are all countable.</p><p>We know that a proper DE-GNN exists because of the universal approximation theorem of feedforward networks (to construct f i , i &#8712; {1, 2, 3}) and Deep Sets <ref type="bibr">[38]</ref> (to construct AGGs).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.1">The Expressive Power of DE-GNN</head><p>Next, we demonstrate the power of DE-GNN to distinguish structural representations. Recall that the fundamental limit of WLGNN is the 1-WL test for structural representation (Theorem 2.7). One important class of graphs that cannot be distinguished by the 1-WL test are regular graphs (although, in practice, node/edge attributes may help diminish such difficulty by breaking the symmetry). In theory, we may consider the most difficult case by assuming that no node/edge attributes are available. In the following, our main theorem shows that even in the most difficult case, DE-GNN is able to distinguish two equal-sized node sets that are embedded in almost all r-regular graphs. One example where DE-GNN using &#950; spd (&#8226;) (SPD) is shown in Fig. <ref type="figure">1</ref> (a): The blue nodes can be easily distinguished from the green or red nodes as SPD= 3 may appear between two nodes when a blue node is the node set of interest, while all SPDs from other nodes to red and green nodes are less than 3. Actually, DE-GNN-1 with SPD may also distinguish the red or green nodes by investigating its procedure in details (Fig. <ref type="figure">3</ref> in Appendix). Theorem 3.3. Given two equal-sized sets S (1) , S (2) &#8834; V , |S (1) | = |S (2) | = p. Consider two tuples T (1) = (S (1) , A (1) ) and T (2) = (S (2) , A (2) ) in the most difficult setting where features A (1)  and A (2) are only different in graph structures specified by A (1) and A (2) respectively. Suppose A (1) and A (2) are uniformly independently sampled from all r-regular graphs over V where 3 &#8804; r &lt; (2 log n) 1/2 . Then, for any small constant &gt; 0, within L &#8804; ( 1 2 + ) log n log(r-1) layers, there exists a proper DE-GNN-p using DEs &#950;(u|S (1) ), &#950;(u|S (2) ) for all u &#8712; V , such that with probability 1 -o(n -1 ), the outputs &#915;(T (1) ) = &#915;(T (2) ). Specifically, f 3 can be simply chosen as SPD, i.e., &#950;(u|v) = &#950; spd (u|v). The big-O notations here and later are w.r.t. n. Remark 3.1. In some cases, we are to learn representations of structures that lie in a single large graph, i.e., A (1) = A (2) . Actually, there is no additional difficulty to extend the proof of Theorem 3.3 to this setting as long as A (1) (= A (2) ) is uniformly sampled from all r-regular graphs and S (1) &#8745; S (2) = &#8709;. The underlying intuition is that for large n, the local subgraphs (within L-hop neighbors) around two non-overlapping fixed sets S (1) , S (2) are almost independent. Simulation results to validate the single node case (p = 1) of Theorem 3.3 and Remark 3.1 are shown in Fig. <ref type="figure">2 (a)</ref>.</p><p>Actually, the power of structural representations of small node sets can be used to further characterize the power of entire graph representations. Consider that we directly aggregate all the representations of nodes of a graph output by DE-GNN-1 via set-pooling as the graph representation, which is a common strategy adopted to learn graph representation via WLGNN-1 <ref type="bibr">[14,</ref><ref type="bibr">16,</ref><ref type="bibr">23]</ref>. So how about the power of DE-GNN-1 to represent graphs? To answer this question, suppose two n-sized r-regular graphs A (1) and A (2) satisfy the condition in Theorem 3.3. Then, by using a union bound, Theorem 3.3 indicates that for a node v &#8712; V , its representation &#915;((v, A (1) )) &#8712; {&#915;((u, A (2) ))|u &#8712; V } with probability 1 -no(n -1 ) = 1 -o(1). Therefore, these two graphs A (1) and A (2) can be distinguished via DE-GNN-1 with high probability. We formally state this result in the following corollary. Corollary 3.4. Suppose two graphs are uniformly independently sampled from all n-sized r-regular graphs over V where 3 &#8804; r &lt; (2 log n) 1/2 . Then, within L &#8804; ( 1 2 + ) log n log(r-1) layers, DE-GNN-1 can distinguish these two graphs with probability 1 -o(1) by being concatenated with injective set-pooling over all the representations of nodes.</p><p>One insightful observation is that the structural representation with a small set S may become easier to be learnt than the entire graph representation as the knowledge of the node set S can be viewed as a piece of side information. Models can be built to leverage such information, while when to compute the entire graph representation (S = V ), the model loses such side information. This could be a reason why the successful probability to learn structural representation of a small node set (Theorem 3.3) is higher than to that of an entire graph in (Corollary 3.4), though our derivation of the probabilistic bounds is not tight. Note that DE provides a convenient way to effectively capture such side information. One naive way to leverage the information of S is to simply annotate the nodes in S with binary encoding 1 and those out of S with 0. This is obviously a special case of DE but it is not as powerful as the general DE (even compared to the special case SPD). Think about setting the target node set in Eq.(1) as the entire graph (S = V ), where annotating the nodes in/out of S does not improve the representation power of WLGNN because all nodes are annotated as 1.</p><p>However, DE still holds extra representation power: For example, we want to distinguish a graph with two disconnected 3-circles and a graph with a 6-circle. These two graphs generate different SPDs between nodes.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.2">The Limitation of DE-GNN</head><p>Next, we show the limitation of DE-GNN. We prove that over a subclass of regular graphs, distance regular graphs (DRG), DE-1 is useless for structural representation. We provide the definition of DRG as follows while we refer interested readers to check more properties of DRGs in <ref type="bibr">[35]</ref>. Definition 3.5. A distance regular graph is a regular graph such that for any two nodes v, u &#8712; V , the number of vertices w s.t. SPD(w, v) = i and SPD(w, u) = j, only depends on i, j and SPD(v, u).</p><p>The Shrikhande graph and the 4 &#215; 4 Rook's graph are two non-isomorphic DRGs shown in Fig. <ref type="figure">2</ref> (b) (We temporarily ignore the nodes colors which will be discussed later). For simplicity, we only consider connected DRGs that can be characterized by arrays of integers termed intersection arrays. .., c } such that for any node pair (u, v) &#8712; V &#215;V that satisfies SPD(v, u) = j, b j is the number of nodes w that are neighbors of v and satisfy SPD(w, u) = j + 1, and c j is the number of nodes w that are neighbors of v and satisfy SPD(w, u) = j -1.</p><p>It is not hard to show that the two DRGs in Fig. <ref type="figure">2</ref> (b) share the same intersecion array {6, 3; 1, 2}.</p><p>The following theorem shows that over distance regular graphs, DE-GNN-1 requires discriminatory node/edge attributes to distinguish structures, which indicates the limitation of DE-1.</p><p>Theorem 3.7. Given any two nodes v, u &#8712; V , consider two tuples T 1 = (v, A (1) ) and T 2 = (u, A (2) ) with graph structures A (1) and A (2) that correspond to two connected DRGs with a same intersection array. Then, DE-GNN-1 must use discriminatory node/edge attributes to distinguish T 1 and T 2 .</p><p>Note Theorem 3.7 only works for node representations using DE-1. Therefore, DE-GNN-1 may not associate distinguishable node representations in the two DRGs in Fig. <ref type="figure">2 (b)</ref>.</p><p>However, if we are to learn higher-order structural representations (|S| &#8805; 2) with DE-p (p &#8805; 2), DE-GNN-p may have even stronger representation power. We illustrate this point by considering structural representations of two node-pairs that form edges of the two DRGs respectively. Consider two node-pairs that correspond to two edges of these two graphs in Fig. <ref type="figure">2</ref> (b) respectively. Then, there exists a proper DE-GNN-2 via using SPD as DE-2, associating these two node-pairs with different representations. Moreover, by simply aggregating the obtained representations of all node-pairs into graph representations via a set-pooling, we may also distinguish these two graphs. Note that distinguishing the node-pairs of the two DRGs is really hard, because even the 2-WL test<ref type="foot">foot_1</ref> will fail to distinguish any edges in the DRGs with a same intersection array and diameters exactly equal to 2 2 . This means that the recently proposed more powerful GNNs, such as RingGNN <ref type="bibr">[30]</ref> and PPGN <ref type="bibr">[27]</ref>, will also fail in this case. However, it is possible to use DE-GNN-2 to distinguish those two DRGs.</p><p>It is interesting to generalize Theorem 3.3 to DRGs to demonstrate the power of DE-GNN-p (p &#8805; 2). However, missing analytic-friendly random models for DRGs makes such generalization challenging.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">DEA-GNN-Distance Encoding-1's as Controllers of the Message Aggregation</head><p>DE-GNN only uses DEs as initial node features. In this subsection, we further consider leveraging DE-1 between any two nodes to control the aggregation procedure of DE-GNN. Specifically, we propose DE-Aggregation-GNN (DEA-GNN) to do the following change</p><p>Note that the representation power of DEA-GNN is at least no worse than DE-GNN because the later one is specialized by aggregating the nodes with &#950; spd (u|v) = 1, so Theorem 3.3, Corollary 3.4 are still true. Interestingly, its power is also limited by Theorem 3.7. We conclude as the follows. Corollary 3.8. Theorem 3.3, Corollary 3.4 and Theorem 3.7 are still true for DEA-GNN.</p><p>The general form Eq. ( <ref type="formula">4</ref>) that aggregates all nodes in each iteration holds more theoretical significance than practical usage due to scalability concern. In practice, the aggregation procedure of DEA-GNN may be trimmed by balancing the tradeoff between complexity and performance. For example, we may choose &#950;(u|v) = &#950; spd (u|v), and only aggregate the nodes u such that &#950; spd (u|v) &#8804; K, i.e., K-hop neighbors. Multi-hop aggregation allows avoiding the training issues of deep architecture, e.g., gradient degeneration. Particularly, we may prove that K-hop aggregation decreases the number of layers L requested to ( 1 2 + ) log n K log(r-1) in Theorem 3.3 and Corollary 3.4 with proof in Appendix F. We may also choose &#950;(u|v) = &#950; gpr (u|v) with non-negative &#947; k in Eq. ( <ref type="formula">3</ref>) and aggregate the nodes whose &#950;(u|v) are top-K ranked among all u &#8712; V . This manner is able to control fix-sized aggregation sets. As DEA-GNN does not show provably better representation power than DE-GNN, all the above approaches share the same theoretical power and limitations. However, in practice their specific performance may vary across datasets and applications.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Related Work</head><p>Recently, extensive effort has been taken to improve the structural representation power of WL-GNN. From the theoretical perspective, most previous works only considered representations of entire graphs <ref type="bibr">[26-30, 42, 43]</ref> while Srinivasan &amp; Ribeiro initialized the investigation of structural representations of node sets <ref type="bibr">[44]</ref> from the view of joint probabilistic distributions. Some works view GNNs as general approximators of invariant functions but the proposed models hold more theoretical implication than practical usage because of their dependence on polynomial(n)-order tensors <ref type="bibr">[29,</ref><ref type="bibr">45,</ref><ref type="bibr">46]</ref>. Ring-GNN <ref type="bibr">[30]</ref> (or equivalently PPGN <ref type="bibr">[27]</ref>), a relatively scalable model among them, was based on 3-order tensors and was proposed to achieve the expressive power of the 2-WL test (a brief introduction in Appendix H). However, Ring-GNN (PPGN) was proposed for entire-graph representations and cannot leverage the sparsity of the graph structure to be scalable enough to process large graphs <ref type="bibr">[27,</ref><ref type="bibr">30]</ref>. DE-GNN still benefits from such sparsity and are also used to represent node sets of arbitrary sizes. Moreover, our models theoretically behave orthogonal to Ring-GNN, as DE-2 can distinguish some non-isomorphic node-pairs that Ring-GNN fails to distinguish because the power of Ring-GNN is limited by the 2-WL test (Fig. <ref type="figure">2 (b)</ref>). Some works with empirical success inspire the proposal of DE, though we are the first one to derive theoretical characterization and leverage our theory to better those models as a return. SEAL <ref type="bibr">[9]</ref> predicts links by reading out the representations of ego-networks of node-pairs. Although SEAL leverages a specific DE-2, the representations of those ego-networks are extracted via complex SortPooling <ref type="bibr">[23]</ref>. However, we argue against such complex operations as DE-2 yields all the benefit of representations of node-pairs, as demonstrated by our experiments. PGNN <ref type="bibr">[10]</ref> uses SPDs between each node and some anchor nodes to encode distance between nodes. As those encodings are not permutation invariant, PGNN holds worse inductive/generalization capability than our models. Quite a few works targeted at revising neighbor aggregation procedure of WLGNN and thus are related to the DEA-GNN. However, none of them demystified their connection to DE or provided theoretical characterization. MixHop <ref type="bibr">[47]</ref>, PAGTN <ref type="bibr">[31]</ref>, MAT <ref type="bibr">[17]</ref> essentially used &#950; spd (u|v) to change the way of aggregation (Eq. ( <ref type="formula">4</ref>)) while GDC <ref type="bibr">[32]</ref> and PowerGNN <ref type="bibr">[48]</ref> used variants of &#950; gpr (u|v). MixHop, GDC and PowerGNN are evaluated for node classification while PAGTN and MAT are evaluated for graph classification. GDC claims that the aggregation based on &#950; gpr (u|v) does not help link prediction. However, we are able to show its empirical success for link prediction, as the key point missed by GDC is using DEs as extra node attributes (Appendix G.3). Note that as the above models are covered by DEA-GNN-1, their representation powers are all bounded by Theorem 3.7 according to Corollary 3.8.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Experiments</head><p>Extensive experiments<ref type="foot">foot_3</ref> are conducted to evaluate our DE-GNN and DEA-GNN over three levels of tasks involving target node sets with sizes 1, 2 and 3 respectively: roles of nodes classification (Task 1), link prediction (Task 2), and triangle prediction (Task 3). Triangle prediction is to predict for any given subset of 3 nodes, {u, v, w}, whether links uv, uw, and vw all exist. This task belongs to the more general class of higher-order network motif prediction tasks <ref type="bibr">[49,</ref><ref type="bibr">50]</ref> and has recently attracted much significance to <ref type="bibr">[51]</ref><ref type="bibr">[52]</ref><ref type="bibr">[53]</ref><ref type="bibr">[54]</ref><ref type="bibr">[55]</ref>. We briefly introduce the experimental settings and save the details of the datasets and the model parameters to Appendix G.</p><p>Dataset &amp; Training. We use the following six real graphs for the three tasks introduced above: Brazil-Airports (Task 1), Europe-Airports (1), USA-Airports (1), NS (2 &amp; 3), PB (2), C.ele <ref type="bibr">(2 &amp; 3)</ref>. For Task 1, the goal is to predict the passenger flow level of a given airport based solely on the flight traffic network. These airports datasets are chosen because the labels indicate the structural roles of nodes (4 levels in total from hubs to switches) rather than community identifiers of nodes as traditionally used <ref type="bibr">[20,</ref><ref type="bibr">21,</ref><ref type="bibr">56]</ref>. For Tasks 2 &amp; 3, the datasets were used by the strongest baseline <ref type="bibr">[9]</ref>, which consist of existing links/triangles plus the the same number of randomly sampled negative instances from those graphs. The positive test links/triangles are removed from the graphs during the training phase. For all tasks, we use 80%, 10%, 10% dataset splitting for training, validation, testing respectively. All the models are trained until loss converges and the testing performance of the best model on validation set is reported. We also report the experiments without validation sets that follow the original settings of the baselines <ref type="bibr">[5,</ref><ref type="bibr">9]</ref>  Baselines. We choose six baselines. GCN <ref type="bibr">[20]</ref>, GraphSAGE(SAGE) <ref type="bibr">[21]</ref>, GIN <ref type="bibr">[16]</ref> are representative methods of WLGNN. These models use node degrees as initial features when attributes are not available to keep inductive ability. Struc2vec <ref type="bibr">[5]</ref> is a kernel-based method, particularly designed for structural representations of single nodes. PGNN <ref type="bibr">[10]</ref> and SEAL <ref type="bibr">[9]</ref> are also GNN-based methods: PGNN learns node positional embeddings and is not inductive for node classification; SEAL is particularly designed for link prediction by using entire-graph representations of ego-networks of node-pairs. SEAL outperforms other link prediction approaches such as VGAE <ref type="bibr">[57]</ref>. The node initial features for these two models are set as the inductive setting suggested in their papers. We tune parameters of all baselines (Appendix G.5) and list their optimal performance here.</p><p>Instantiation of DE-GNN and DEA-GNN. We choose GCN as the basic WLGNN and implement three variants of DE-GNN over it. Note that GIN could be a more powerful basis while we tend to keep our models simple. The first two variants of Eq. ( <ref type="formula">2</ref> Results are shown in Table <ref type="table">1</ref>. Regarding the node-level task, GIN outperforms other WLGNNs, which matches the theory in <ref type="bibr">[16,</ref><ref type="bibr">26,</ref><ref type="bibr">29]</ref>. Struc2vec is also powerful though it is kernel-based. DE-GNN's significantly outperform the baselines (except Eur.-Airport) which imply the power of DE-1's. Among them, landing probabilities (LP) work slightly better than SPDs as DE-1's.</p><p>Regarding node-pairs-level tasks, SEAL is the strongest baseline, as it is particularly designed for link prediction by using a special DE-2 plus a graph-level readout <ref type="bibr">[23]</ref>. However, our DE-GNN-SPD performs even significantly better than SEAL: The numbers are close, but the difference is still significant; The decreases of error rates are always greater than 10% and achieve almost 30% over NS. This indicates that DE-2 is the key signal that makes SEAL work while the complex graph-level readout adopted by SEAL is not necessary. Moreover, our set-pooling form of DE-2 (Eq. ( <ref type="formula">1</ref>)) decreases the dimension of DE-2 adopted in SEAL, which also betters the generalization of our models (See the detailed discussion in Appendix G.3). Moreover, for link prediction, SPD seems to be much better to be chosen as DE-2 than LP.</p><p>Regarding node-triads-level tasks, no baselines were particularly designed for this setting. We have not expected that GIN outperforms PGNN as PGNN captures node positional information that seems useful to predict triangles. We guess that the distortion of absolute positional embeddings learnt by PGNN may be the reason that limits its ability to distinguish structures with nodes in close positions: For example, the three nodes in a path of length two are close in position and the three nodes in a triangle are also close in position. However, this is not a problem for DE-3. We also conjecture that the gain based on DEs grows w.r.t. their orders (i.e., |S| in Eq. ( <ref type="formula">1</ref>)). Again, for triangle prediction, SPD seems to be much better to be chosen as DE-3 than LP.</p><p>Note that DEA-GNN-SPD further improves DE-GNN-SPD (by almost 1% across most of the tasks). This demonstrates the power of multi-hop aggregation (Eq. ( <ref type="formula">4</ref>)). However, note that DEA-GNN-SPD needs to aggregate multi-hop neighbors simultaneously and thus pays an additional cost of scalability.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Broader Impact</head><p>This work proposes a novel angle to systematically improve the structural representation power of GNNs. We break from the convention that previous works characterize and further improve the power of GNNs by intimating different-order WL tests <ref type="bibr">[26,</ref><ref type="bibr">27,</ref><ref type="bibr">30,</ref><ref type="bibr">58]</ref>. As far as we know, we are the first one to provide non-asymptotic analysis of the power of the proposed GNN models. Therefore, the proof techniques of Theorems 3.3,3.7 may be expected to inspire new theoretical studies of GNNs and further better the practical usage of GNNs. Moreover, our models have good scalability by avoiding using the framework of WL tests, as higher-order WL tests are not able to leverage the sparsity of graphs. To be evaluated over extremely large graphs <ref type="bibr">[59]</ref>, our models can be simply trimmed and work on the ego-networks sampled with a limited size around the target node sets, just as the strategy adopted by GraphSAGE <ref type="bibr">[21]</ref> and GraphSAINT <ref type="bibr">[60]</ref>. Therefore, our work may motivate practitioners to design and deploy more powerful GNNs in industrial pipelines to benefit the society.</p><p>Distance encoding unifies the techniques of many GNN models <ref type="bibr">[9,</ref><ref type="bibr">17,</ref><ref type="bibr">31,</ref><ref type="bibr">32,</ref><ref type="bibr">47</ref>] and provides a extremely general framework with clear theoretical characterization. In this paper, we only evaluate four specific instantiations over three levels of tasks. However, there are some other interesting instantiations and applications. For example, we expect a better usage of PageRank scores as edge attributes (Eq. ( <ref type="formula">4</ref>)). Currently, our instantiation DEAGNN-PR simply uses those scores as weights in a weighted sum to aggregate node representations. We also have not considered any attention-based mechanism over DEs in aggregation while it seems to be useful <ref type="bibr">[17,</ref><ref type="bibr">22]</ref>. Researchers may try these directions in a more principled manner based on this work. Our approaches may also help other tasks based on structural representation learning, such as graph-level classification/regression <ref type="bibr">[14,</ref><ref type="bibr">16,</ref><ref type="bibr">23,</ref><ref type="bibr">26,</ref><ref type="bibr">27]</ref> and subgraph counting <ref type="bibr">[58]</ref>, which correspond to many applications with wide societal impact including drug discovery and structured data analysis.</p><p>There are also two important implications coming from the observations of this work. First, Theorem 3.7 and Corollary 3.8 show the limitation of DE-1 over distance regular graphs, including the cases when DE-1's are used as node attributes or controllers of message aggregation. As distance regular graphs with the same intersection array have the important co-spectral property <ref type="bibr">[35]</ref>, we guess that DE-1 is a bridge to connect GNN frameworks to spectral approaches, two fundamental approaches in graph-structured data processing. This point sheds some light on the question left in <ref type="bibr">[30]</ref> while more rigorous characterization is still needed. Second, as observed in the experiments, higher-order DE's induce larger gains as opposed to WLGNN, while Theorem 3.3 is not able to characterize this observation as the probability 1 -o( 1 n ) does not depend on the size p. We are sure that the probabilistic quantization in Theorem 3.3 is not tight, so it is interesting to see how such probability depends on p by deriving tighter bounds.</p><p>We are not aware of any societal disadvantages of this research. Our experiments also do not leverage biases in the data. We choose those datasets based on two rules that are irrelavant to ethics: <ref type="bibr">(1)</ref> The labels for evaluation are graph-structured related; (2) Those datasets were used in some baselines so that we provide fair comparison. Regarding the point (1), this is the reason that we do not use the datasets, such as Cora, Citeseer and Pubmed <ref type="bibr">[56]</ref>, whose node labels that indicate the community belongings of nodes instead of the structural function of nodes. However, it is interesting to evaluate distance encoding techniques over those datasets with community-related labels in the future.</p><p>readout step (AGG(&#8226;)), which works on a subset of nodes instead of the entire node set. Therefore, the same logic of proofs of Lemma 2 and Theorem 3 in <ref type="bibr">[16]</ref> can be directly applied for structural representation learning with little revision.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C Proof for The Power of DE -Theorem 3.3</head><p>We restate Theorem 3.3: Given two fixed-sized sets S (1) , S (2) &#8834; V , |S (1) | = |S (2) | = p. Consider two tuples T (1) = (S (1) , A (1) ) and T (2) = (S (2) , A (2) ) in the most difficult setting when features A (1) and A (2) are only different in graph structures specified by A (1) and A (2) respectively. Suppose A (1) and A (2) are uniformly independently sampled from all r-regular graphs over V where 3 &#8804; r &lt; (2 log n) 1/2 . Then, for any constant &gt; 0, there exist a proper DE-GNN-p with layers L &lt; ( 1 2 + ) log n log(r-1) , using DE-p &#950;(u|S (1) ), &#950;(u|S (2) ) for all u &#8712; V such that with probability 1 -o(n -1 ), its outputs &#915;(T (1) ) = &#915;(T (2) ). Specifically, f 3 can be simply chosen as SPD, i.e., &#950;(u|v) = &#950; spd (u|v). The big-O notation is with respect to n.</p><p>Proof. To prove the statement, we only need to prove the case that |S (1) </p><p>Lemma C.1. Suppose the statement is true when |S (1) </p><p>Then, the statement is also true for the case when |S (1) | = |S (2) | = p &gt; 1 for some fixed p, and &#950;(u|v) is a neural network fed with the list of landing probabilities.</p><p>Proof. We first focus on the case when DE is chosen as SPD, i.e., &#950;(u|v) = &#950; spd (u|v). We want to use the results of the case |S (1) </p><p>We choose an arbitrary node from S (1) , say w 1 . As we assume the statement is true for the single node case, for any node in S (2) , say w 2 , DE-GNN with DEs &#950; spd (u|w 1 ), &#950; spd (u|w 2 ) is able to distinguish two tuples (w 1 , A (1) ) and (w 2 , A (2) ), with probability at least 1-o(n -1 ). Given that the space of SPD is countable and AGG in Eq. ( <ref type="formula">1</ref>) is injective, &#950; spd (u|S (1) ) and &#950; spd (u|S (2) ) are different if &#950; spd (u|w 1 ) is different from any &#950; spd (u|w 2 ) (w 2 &#8712; S (2) ), which happens with probability at least 1 -|S (2) |o(n -1 ) = 1 -o(n -1 ). Therefore, DE-GNN with DEs &#950; spd (u|S (1) ), &#950; spd (u|S (2) ) is also able to distinguish two tuples (w 1 , A (1) ) and (w 2 , A (2) ), with probability at least 1 -o(n -1 ). Based on the union bound, we know that DE-GNN with DEs &#950; spd (u|S (1) ), &#950; spd (u|S (2) ) is able to distinguish two tuples (w 1 , A (1) ) and (v, A (2) ) for any v &#8712; S (2) , with probability at least 1 -|S (2) |o(n -1 ) = 1 -o(n -1 ). Therefore, we prove the capability to generalize the result from the single node case to the multiple node cases. Now, let us generalize the result from &#950; spd (u|v) to arbitrary &#950;(u|v) represented by neural networks fed with the list of landing probabilities. As &#950; spd (u|v) is indeed a function of the list of landing probabilities (Eq. ( <ref type="formula">2</ref>)), the general &#950;(u|v) should have stronger discriminatory power unless neural networks cannot provide a good mapping from the list of landing probabilities to SPD. However, we do not have to worry this because the list of landing probabilities fortunately lies a countable space for unweighted graphs: 1) The dimension of this list is countable (finite in practice); 2) Each component of this list is always a rational number if the graph is unweighted. According to our assumption, f 3 is allowed to have an injective mapping over the list of landing probabilities. Therefore, neural networks on the list of landing probabilities will not decrease the representation power that is just based on &#950; spd (u|v).</p><p>Keep in mind that Lemma C.1 could be loose. We leave the tight probability bound for the p &gt; 2 case in the future.</p><p>Outline: From now on, we focus on the single node case, i.e. |S (1) | = |S (2) | = 1, with SPD as the DE-1 (f 3 ), i.e., &#950;(u|v) = &#950; spd (u|v). Without loss of generality, we suppose S (1) = S (2) = {u}. As SPD is countable, there exists a proper DE-GNN that guarantees that all the operations are injective, which follows the basic condition used in <ref type="bibr">[16]</ref>. Because of the iterative procedure of DE-GNN and all mappings are injection, the label of node u only depends on the subtree with depth L rooted at u (See the illustration of the subtree rooted at a given node in Fig. <ref type="figure">3</ref>). Recall L is the number of layers The subtree rooted at a node: In the left two graphs, the black nodes are the target nodes who structural representations are to be learnt. Different colors of the nodes correspond to different SPDs with respect to the target nodes. The right two trees correspond to the subtrees rooted at these two target nodes respectively. DE-GNN essentially works from bottom to top along these subtrees to obtain the representations of the target nodes. Different types of subtrees yield different representations based on a proper DE-GNN. In this example, these two target nodes are all from 3-regular graphs so they cannot be distinguished via WLGNN without DE-1's or the informative node/edge attributes. However, with DE-1's, we can see the corresponding subtrees of these two nodes are different and the difference appears in the second layer. in DE-GNN. Therefore, we only need to show that A (1) and A (2) , which are uniformly sampled r-regular graphs, with probability at most o(n -1 ), have the same subtrees rooted at u given SPDs as initial node labels. To show this, our proof contains four steps. Note that all through the following proof, we assume that n is very large and is any small positive constant that is independent from n.</p><p>&#8226; We first explain that we are able to work on the configuration model of r-regular graphs proposed in <ref type="bibr">[61]</ref> which associates uniform measure over all r-regular graphs. Given the condition r &lt; (2 log n) 1/2 , there are a large portion (&#8486;(n -1/2 )) of all the graphs generated by this model are simple (without self-loops and multi-edges) r-regular graphs <ref type="bibr">[61]</ref>. Since the configuration model alleviates the difficulty to analyze the dependence between edges in r-regular graphs, we consider the graphs generated by the configuration model for the next two steps.</p><p>&#8226; Suppose the set of nodes that are associated with SPD= k from u is denoted by Q k , and the number of edges that connect the nodes in Q k and those in Q k+1 is denoted by p k . We prove that with probability 1 -o(n -3 2 ), for all k &#8712; ( 5</p><p>&#8226; Next, we define the edge configuration between Q k and Q k+1 as a list C k = (a 1,k , a 2,k , ...)</p><p>where a i,k denotes the number of nodes in Q k+1 of which each has exactly i edges from Q k . We prove that for each k &#8712; ( 1 2 log n log(r-1-) , 4   7   log n log(r-1-) ), as the edges between Q k and Q k+1 are so many, there are too much randomness that makes each type of edge configuration C k appear with only limited probability P(C k ) = O( n 1/2 p k ). Recall that p k is defined in step 2. Then, given any log n log(r-1-) many k's, the probability that A (1) and A (2) have all the same edge configurations for these k's is bounded by</p><p>2 ). Therefore, we only need to consider edge configurations for k &#8712; ( 1 2 log n log(r-1-) , ( 1 2 + ) log n log(r-1-) ) to distinguish A (1) and A (2) , as this will give us 1 -o(n -3 2) probability to succeed.</p><p>&#8226; Since there are at least &#8486;(n -1/2 ) of all the graphs generated by the configuration model that are simple r-regular graphs, and there are at most o(n -3 2 ) probability that A (1) and A (2) share the same subtrees rooted at u, there are at most o(n -3 2 /n -1/2 ) = o(n -1 ) probability that A (1) and A (2) are simple r-regular graphs and share the same subtrees rooted at u, which concludes the proof.</p><p>Step 1: We first introduce the configuration model proposed in <ref type="bibr">[61]</ref> for r-regular graphs of n nodes. Suppose we have n sets of items, W u , u &#8712; [n], where each set corresponds to one node in <ref type="bibr">[n]</ref>. Each set W u has r items. Now, we randomly partition all these nr items into nr 2 pairs. Then, each partitioning result corresponds to a r-regular graph: if a pair contains items from W u and W v , then there is an edge between nodes u and v in the graph. Note that such partitioning results may render self-loops and multi-edges. Of course, we would like to consider only simple graphs which do not have self-loops and multi-edges. For this, the theory in <ref type="bibr">[61]</ref> shows that for all these r-regular graphs, if r &lt; (2 log n) 1/2 , there are about exp(-r 2 -1 4 ) portion among them, i.e., &#8486;(n -1/2 ), which are simple graphs.</p><p>Step 2: Now, we consider a graph that is uniformly sampled from the configuration model. Recall that the set of nodes that are associated with SPD= k from u is denoted by Q k , and the number of edges that connect the nodes in Q k and those in Q k+1 is denoted by p k . Now, we prove that there exists a small constant &gt; 0, such that with probability</p><p>We prove an even stronger lemma that gives the previous argument via a union bound and doing induction over all k &#8712; ( 5 log n log(r-1) + 1, ( <ref type="formula">2</ref>3 -) log n log(r-1) ).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Lemma C.2.</head><p>There exists a small constant &gt; 0, with probability 1 -O(n -2+ ), such that: 1) For</p><p>Proof. We consider the following procedure to generate the graph based on the configuration model. Recall the node u is the target node, or the node with SP D = 0 to the target node. We start from generating the edges attached to this node. We start from the set W u and generate the r pairs with at least one item in W u . Then, we have all the nodes in Q 1 . Based on the set &#8746; v&#8712;Q1 W v , we generate all the (r -1)|Q 1 | pairs with at least one item in &#8746; v&#8712;Q1 W v , and we have all the nodes in Q 2 . The procedure goes on so on and so forth, from Q k to Q k+1 . Now, we prove 1). First, we prepare some inequalities. We have</p><p>This inequality is very crude. Now, we go back to prove the bound for p k . Recall the definition of p k that is the number of edges between Q k and Q k+1 . Then, the number of edges that are generated with both end-nodes are in</p><p>As we suppose the edges are generated sequentially, the probability to generate an edge whose two end-nodes are in Q k is upper bounded by</p><p>where c 1 , c 2 are constants, the numbers above the equality/inequality signs refer to which equations are used.</p><p>Next, we prove the bound for</p><p>. Again, if the edges are generated sequentially, p k -Q k+1 indicates the number of edges whose end-nodes in Q k+1 also belong to other edges that has been generated between Q k and Q k+1 . The probability of this edge is upper bounded by</p><p>Till now, we have proved the statement 1). Now, we prove the statement 2). Actually, at the time when Q k is generated, the number of edges having been generated is at most r(r -1) k -2. These edges cover at most r(r -1) k -1 nodes. When k = 5 log n log(r-1) + 1, we claim that with probability 1 -O(n -2+ ), at most 1 edge among these edges when generated is not connected to a new node. This is because if there are more than 1 such edges, the probability is at most (by summing over i such edges)</p><p>We use this result to give a lower bound of |Q k | for k = 5 log n log(r-1) + 1. Because there is at most 1 edge when generated do not connect to a new node. The worst case appears when two items in W u are mutually connected, which leads to |Q 1 | &#8805; r -2 &#8805; 1. All edges after Q 1 is generated are connected to new nodes and furthermore</p><p>Step 3: We start to consider the edge configuration between</p><p>. We focus our attention on the graphs that satisfy the properties developed in Step 2, which, as demonstrated in Step 2 , are with high probability 1 -o(n -3/2 ). For those graphs, we know that for k &#8712; (</p><p>and therefore at the time when Q k is generated, there are still q k = n -o(n) = &#920;(n) nodes that have not been connected.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Recall that we define the edge configuration between</head><p>where a i,k means the number of nodes in Q k+1 of which each has exactly i edges from Q k . According to the definition of C k , it satisfies</p><p>Note that if DE-GNN cannot distinguish (u, A (1) ) and (u, A (2) ), then A (1) and A (2) must share the same edge configuration between Q k and Q k+1 . Otherwise, after one iteration, the intermediate representation of nodes in Q k+1 are different between A (1) and A (2) . Such difference will be propagated to u later. To bound the probability that A (1) and A (2) must share the same edge configuration between Q k and Q k+1 , for simplicity, we consider the probability of C k given the number of edges between Q k and Q k+1 , i.e., p k and the number remaining nodes, i.e.,</p><p>We are to derive a upper bound of P(C k ) based on the configuration model in the following lemma.</p><p>Lemma C.3. Suppose p k &#8712; [n 1/2 , n 2/3-] and q k = &#920;(n). Consider the configuration model to generate edges: there are p k edges that correspond to two items that are one in &#8746; v&#8712;Q k W v and one among the rest q k r items. Then, for any possible edge configuration C k obtained based on this generating procedure, P(C k ) &#8804; c 5 q 1/2 k p k for some constant c 5 .</p><p>Proof. First, for the configuration C k = (a 1,k , a 2,k , ...), we claim that the most probable C k is achieved when a i,k = 0 for i &#8805; 3. We prove this statement via the adjustment method: We fix the value of a 1,k + ia i,k and all the other a i ,k 's. We compare the probability of (a 1,k = x, a i,k = y) and that of (a 1,k = x + i, a i,k = y -1). Because x, y &#8804; p k = O(n 2/3-), for some constant c 6 , we have</p><p>Therefore, we only need to consider the case when a 1,k , a 2,k &gt; 0 so a 1,k + 2a 2,k = p k . Define a function g(x) to denote the probability of the edge configuration (a 1,k = p k -2y, a 2,k = y). We compare g(y) and g(y + 1)</p><p>.</p><p>Consider the choice y = y * to make g(y * )/g(y * +1) &#8805; 1 while g(y * -1)/g(y * ) &#8804; 1 that corresponds to g(y * ) = max y g(y). Then, we must have y * = o(p k ) and otherwise g(y * -1)/g(y * ) &gt; 1 because q k = &#920;(n). As y * = o(p k ) and p k = o(q k ), y * according to Eq. 8 is about </p><p>Moreover, for &#948; &gt; 0 ). Based on Lemma C.3, the probability that A (1) and A (2) share the same edge configurations between Q k and Q k+1 for all these k's is bounded by</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>As</head><p>Step 4: From step 2, we know that the graphs that do not satisfy</p><p>) portion of all graphs generated from the configuration model. From step 3, we know that A (1) and A (2) that satisfy the properties of step 2, with probability at most o(n -3/2 ), their subtrees rooted at node u are the same even with DE-1's. Even all these graphs belong to simple regular graphs, as step 1 tells the portion of simple regular graphs among the graphs generated from the configuration model is &#8486;(n -1/2 ), we arrive the final conclusion that if A (1) and A (2) are sampled from all simple r-regular graphs, with probability at least 1 -o(n -3/2 )/&#8486;(n -1/2 ) = 1 -o(n -1 ), a proper DE-GNN can distinguish (u, A (1) ) and (u, A (2) ).</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D Discussion on node sets over irregular graphs</head><p>Theorem 3.3 focuses on node sets embedded in regular graphs. A natural question therefore arises: how about the power of DEs to distinguish non-isomorphic node sets embedded in irregular graphs that the 1-WL test (also WLGNN) may not distinguish. To answer this question, note that there is some important connection between irregular graphs and regular graphs under the umbrella of the 1-WL test. Actually, the partition of nodes over irregular graphs according to their representations (colors) stably associated by the 1-WL test has equitable property <ref type="bibr">[62]</ref>. Basically, suppose that the whole node set V can be partitioned into several parts based on the representations (colors) of nodes, V = &#8746; c i=1 V i , where for an arbitrary i, nodes in V i share the same representations based on the 1-WL test. Then, the induced subgraph of the nodes in V i is a regular graph for all i's. What's more, for any i, j, the number of nodes in V j that are neighbors of a certain node in V i is shared by all the nodes in V i , which again shows certain regularity. Theoretically, if we focus on the regular subgraph defined over each V i , we may further leverage DE defined over this subgraph to further distinguish the nodes in V i . In practice, we do not need to work on each subgraph individually. Mostly, the capability of DE indicated by Theorem 3.3 may still help with breaking such regularity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>E Proof for The Limitation of DE-1 -Theorem 3.7</head><p>We restate Theorem 3.7: Consider any two nodes v, u &#8712; V . Consider two tuples T 1 = (v, A (1) ) and T 2 = (u, A (2) ) with graph topologies A (1) and A (2) that correspond to two connected DRGs with a same intersection array. Then, DE-GNN-1 must depend on discriminatory node/edge attributes to distinguish T 1 and T 2 .</p><p>Proof. We are to prove that if A (1) and A (2) correspond to two DRGs with a same intersection array, then the subtrees rooted at any nodes are all same even if the nodes are labeled with any DE-1's (see illustration of a subtree rooted at a node in Fig. <ref type="figure">3</ref>). Because if the subtrees are same, the only possibility to differentiate two nodes is based on discriminatory node/edge attributes embedded in these subtrees when DE-GNN processes them from the bottom to the top.</p><p>We recall the definition of the intersection array of a connected DRG as the following.</p><p>Definition E.1. The intersection array of a connected DRG with diameter is an array of integers {b 0 , b 1 , ..., b -1 ; c 1 , c 2 , ..., c } such that for any node pair (u, v) &#8712; V &#215;V that satisfies SPD(v, u) = j, b j is the number of nodes w that are neighbors of v and satisfy SPD(w, u) = j + 1, and c j is the number of nodes w that are neighbors of v and satisfy SPD(w, u) = j -1.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>DE-1 = {0}</head><p>DE-1 = {1} DE-1 = {2} Figure <ref type="figure">5</ref>: K-hop aggregation does necessarily better the discriminatory power. Consider using DE-GNN-1 and DEA-GNN-1-2-hop to learn the structural representation of the nodes colored by black. We choose SPD as DE-1. Both models require at least two layers to distinguish two black nodes. DEA-GNN-1-2-hop cannot decrease the number of layers by a factor of 2.</p><p>Actually, Lemma E.3 is closely related the argument in <ref type="bibr">[63]</ref>, which claims the same result for walks over one DRG. However, Lemma E.3 extends the argument to any DRGs with the same intersection array. As DRGs are b 0 -regular graphs, there is a bijective mapping between the list of landing probabilities and the list of walks of different length. Therefore, this is a bijective mapping between SPD and the list of landing probabilities, which concludes the proof.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F Proof for DEA-GNN-Corollary 3.8 and Further Discussion</head><p>DEA-GNN contains a general aggregation procedure assisted by DE-1 (Eq. ( <ref type="formula">4</ref>)). As we discussed in Section 3.3, as DEA-GNN allows to use DEs as extra node features as DE-GNN in the same time, it has at least the same representation power as DE-GNN. Therefore, Theorem 3.3 and Corollary 3.4 are still true for DEA-GNN. So our first question is whether DEA-GNN shares the same limitation with DE-GNN with DE-1 for node representation learning over DRGs. Later, we will prove our confirmation to this question. An interesting case in practice is to set the DE-1 in Eq. ( <ref type="formula">4</ref>) as SPD &#950;(u|v) = &#950; spd (u|v). For some K &#8805; 1, we specify the aggregation as</p><p>which means that the aggregation happens among K-hop neighbors. If the model is to learn the representation of a node subset with size p, We term this model as DEA-GNN-p-K-hop. The second question is to investigate whether DEA-GNN-p-K-hop may decrease the number of layers of DE-GNN required in Theorem 3.3 and Corollary 3.4 by a factor of K. This result is not trivial. For example, Fig. <ref type="figure">5</ref> shows two trees whose root nodes are the target nodes to learn structural representation. Obviously, DE-GNN-1 needs two layers to distinguish these two root nodes. However, DEA-GNN-1-2-hop may not decrease the number of layers by a factor of 2 and thus still needs two layers. This is because the set aggregation in Eq. ( <ref type="formula">9</ref>) may decrease the discriminatory power of those features interacted with graph structures. However, next, we can still prove the confirmation to this second questions.</p><p>Confirmation to the first question. We formally restate the conclusion to prove: Consider any two nodes w 1 , w 2 &#8712; V . Consider two tuples T 1 = (w 1 , A (1) ) and T 2 = (w 2 , A (2) ) with graph topologies A (1) and A (2) that correspond to two connected DRGs with a same intersection array. Then, DEA-GNN-1 must depend on discriminatory node/edge attributes to distinguish T 1 and T 2 .</p><p>Proof. Because Lemma E.3 implies that in all DRGs with the same intersection array, there is a bijective mapping between SPD and the list of landing probabilities. Therefore, we only need to focus on the case that uses SPDs as DE-1, which both appears as extra node attributes and features in aggregation (see Eq. ( <ref type="formula">4</ref>)). As the statement is about the limitation, we consider the most expressive case, i.e., the aggregation appearing among every pair of nodes.</p><p>Recall the tree structure to compute DE-GNN is termed as the subtree rooted at some node. Now, we call the tree structure to compute DEA-GNN for the structural representation of a node w as the extended subtree rooted at w, which has the same utility as the tree structure for DE-GNN (Fig. <ref type="figure">3</ref>). ) such that these k's hold the same integral interval K and can be denoted as k 0 , k 0 + K, .... The probability that A (1) and A (2) have all the same edge configurations for these k's is bounded by</p><p>Therefore, we only need to consider edge configurations for k &#8712; ( 1 2 log n log(r-1-) , ( 1 2 + ) log n log(r-1-) ) to distinguish A (1) and A (2) . And within ( 1 2 + )</p><p>log n K log(r-1-) + 2, DEA-GNN-1-K-hop yields different representaions for (w, A (1) ) and (w, A (2) ). Note that the constant 2 may be merged in for simplicity, which concludes the proof.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>G Details of the Experiments G.1 Datasets</head><p>The three air traffic networks for Task 1, Brazil-Airports, Europe-Airports, and USA-Airports were collected by <ref type="bibr">[64]</ref> from the government websites throughout the year 2016 and were used to evaluate algorithms to learn structural representations of nodes <ref type="bibr">[5,</ref><ref type="bibr">6]</ref>. Networks are built such nodes represent airports and there exists an edge between two nodes if there are commercial flights between them. Brazil-Airports is a network with 131 nodes, 1,038 edges and diameter 5; Europe-Airports is a network with 399 nodes, 5,995 edges and also diameter 5; USA-Airports is a network with 1,190 nodes, 13,599 edges and diameter 8. In each dataset, the airports are divided into 4 different levels according to the annual passengers flow distribution by 3 quantiles: 25%, 50%, 75%. The goal is to infer the level of an airport using solely the connectivity pattern of them.</p><p>Tasks 2 &amp; 3 were carried out on three other datasets used by SEAL <ref type="bibr">[9]</ref> to facilitate comparison study: C.ele, NS and PB. C.ele <ref type="bibr">[65]</ref> is a neural network of C. elegans with 297 nodes, 2,148 edges and 3241 triangles (closed node triads), and diameter of 5, in which nodes are neurons and edges are neural linkage between them. NS <ref type="bibr">[66]</ref> is a network of collaboration relationship between scientists specialized in network science, comprising of 1461 nodes 2742 edges and 3764 triangles. PB <ref type="bibr">[64]</ref> is a network of reference relationships between political post web-pages, consisting of 1222 nodes, 16714 edges, and of diameter 8. Following <ref type="bibr">[9,</ref><ref type="bibr">10]</ref>, for Task 2 &amp; 3, we remove all links or triangles in testing sets from graph structure during the training phase to avoid label leakage.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>G.2 Baseline Details</head><p>We have five baselines based on GNNs and one baseline, struc2vec <ref type="bibr">[5]</ref>, based on kernels using handcrafted structural features. We first introduce the implementation of struc2vec and then discuss other baselines.</p><p>Struc2vec is implemented in a 2-phase manner. In the first phase, embeddings for all the nodes are learned by running the n-gram framework over a constructed graph based on structural similarity kernels. We directly use the code provided by the original paper <ref type="bibr">[5]</ref>  <ref type="foot">4</ref> . In the second phase, the embeddings of nodes that in the target node set are concatenated and further fed into an one-layer fully connected neural network to make further inference.</p><p>Regarding other GNN-based baselines, GCN is implemented according to Equation ( <ref type="formula">9</ref>) of <ref type="bibr">[20]</ref> with self-loops added. SAGE is implemented according to Algorithm 1 of Section 3.1 in <ref type="bibr">[21]</ref>. Mean pooling is used as the neighborhood aggregation function. GIN is implemented by adapting the code provided by the original paper <ref type="bibr">[16]</ref>  <ref type="foot">5</ref> , where we use the sum-pooling aggregation and multi-linear perception to aggregate neighbors. In all three baselines described above, ReLU nonlinearities are applied to the output of each hidden layer, followed by a Dropout layer. PGNN layer is implemented by adapting the code provided by the original paper <ref type="bibr">[10]</ref>  <ref type="foot">6</ref> . SEAL is implemented by adapting the code provided by the original paper <ref type="foot">7</ref> . As we focus on learning structural representation with inductive capability, all the five GNN-based methods use node degrees as input features if node attributes are not available.</p><p>Final readout layers are tailored to suit different tasks. For Task 1 since the task is node classification, the final layer for all baselines is a one-layer neural network followed by a cross entropy loss. Tasks 2 &amp; 3 have slightly more complex readout layers since the target entity for prediction is a node set of size 2 or 3. Note that SEAL is specifically designed for Task 2 and has its own readout that uses SortPooling over all node representations over the ego-networks of node-pairs <ref type="bibr">[23]</ref>. we refer the readers to the original paper for details <ref type="bibr">[9]</ref>. For all the other baselines, to make a fair comparison, we use the following difference-pooling: Suppose the target node set is S and the representation of node v for v &#8712; S is denoted by h v , then we readout the representation of S as</p><p>where | &#8226; | denotes component-wise absolute value. Note that Tasks 2 &amp; 3 are to predict the existence of a link / triangle. So we use the inner product w, z where w &#8712; R d is a trainable final projection vector, and feed this product into the binary cross entropy loss to train the models.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>G.3 DE-GNN Variants Details.</head><p>Minibatch training based on ego-network extraction. To understand detailed framework it is helpful to first discuss the minibatch training of GCN, although the original GCN is trained in a full-batch manner <ref type="bibr">[20]</ref>. To train GCN in minibatchs, we first extract, for each target node v in a given minibatch, an ego-network centering at v within L-hop neighbors by doing a depth-L BFS from v, denoted by G v . Here, L is the number of GCN layers intended to be used in the model. Note that the representation of node v via using L-layer GCN over G v is the same as that via using L-layer GCN over the whole graph. If node attributes are not available for GCN or other WLGNNs, we may use the degree of each node as its node attributes.</p><p>Our models are implemented by following the above mini-batch training framework. For a target node set S, we first extract the union of ego-networks centering at any nodes in S within L-hop neighbors. We call the union of ego-networks as the ego-network around S, denoted by G S = &#8746; v&#8712;S G v . Note that even if S has multiple nodes, G S can be extracted as a whole by running BFS. All the edges of G S between nodes that are both in S will be further removed, which is denoted by G S . For G S , we associate each node u in these ego-networks with the DE &#950;(u|S) as extra node attributes. Specifically, we use a simple aggregation for Eq. (1):</p><p>Next, we detail the different versions of &#950;(u|v) used by different variants of DE-GNN.</p><p>DE-GNN-SPD. This variant sets &#950;(u|v), v &#8712; S and u &#8712; G S as a one-hot vector of the truncated shortest-path-distance between u and v. That is,</p><p>where d max is the maximum distance to be encoded. As a result, the &#950; spd (u|v) is a vector of length d max + 1. The pairwise SPDs can either be pre-computed in preprocessing stage, or be computed by traversing the extracted ego-networks on the fly. The d max (&#8804; L) helps prevent overfitting the noise in an overly large neighborhood.</p><p>Compare DE-GNN-SPD with SEAL. DE-GNN-SPD, when used for link prediction, is similar to SEAL <ref type="bibr">[9]</ref> in a sense that we both encode the distance between any node and the two nodes in the target node-pairs. However, we are fundamentally different from SEAL as SEAL uses graphlevel readout of all nodes in the ego-networks. SEAL also has no discussion on the expressive power of distance encoding. Their intention of node labeling as they reported is just to let the model know which node-pair in the extracted ego-networks is the target node-pair. Moreover, the specific DE &#950;(u|S) for a node-pair S = {v 1 , v 2 } used in SEAL is a one-hot encoding of the value</p><p>The dimension of this DE is O(d 2 max ) which is higher than our DE (Eq. ( <ref type="formula">14</ref>)) used in DE-GNN-SPD, which may result in model overfitting for large d max DE-GNN-LP. This variant sets &#950;(u|v), v &#8712; S and u &#8712; G S as landing probabilities of random walks (of different lengths) from node v to node u: &#950; lp (u|v) = ((W (0) ) vu , (W (1) ) vu , (W (2) ) vu , ..., (W (drw) ) vu )</p><p>where W (k) = (AD -1 ) k is the k-step random walk matrix, d rw is the max step number of random walks. Notice that in principle, &#950; lp (u|v) encodes the distance information at a finer granularity than &#950; spd (u|S). This is because SPD(u, v) can be inferred from &#950; lp (u|v) as the index of the first non-zero random walk feature. However, in practice we observe that such encoding does not always bring a significant performance gain.</p><p>For both DE-GNN-SPD and DE-GNN-LP, we use the same 1-hop neighborhood aggregation as GCN. DE can also be used to control the message passing as shown in Eq. ( <ref type="formula">4</ref>). We further discuss two variants by incorporating DE-GNN-SPD and Eq. ( <ref type="formula">4</ref>).</p><p>DEA-GNN-SPD. DEA-GNN-SPD is built on top of DE-GNN-SPD, using the same extra node features &#950; spd (u|S), but allows for multi-hop aggregation by specifying Eq. ( <ref type="formula">4</ref>) as</p><p>In experiments, we choose K = 2, 3, which means that each node aggregates representations of other nodes that are not only its direct neighbors but also its (exclusive) 2-hop and even 3-hop neighbors. As we do not have edge attributes in our data, we omit A vu . Our implementation of the aggregation for the layer l follows</p><p>where &#920; (lk) is a trainable weight matrix and for each k, we aggregate k-hop neighbors via a GCN layer with a self-loop. Note that when implementing DEA-GNN-SPD, we need to extract the ego-network of nodes within LK-hops, if DEA-GNN-SPD has L layers.</p><p>DEA-GNN-PR. DEA-GNN-SPD is also built on top of DE-GNN-SPD, but the propagation is by specifying Eq. (4) as</p><p>As the aggregation is over the whole node set, this model does not extract the ego-networks but uses the entire graphs. For the layer l, we further specify the above aggregation by using</p><p>where &#920; (l) is a trainable weight matrix and &#950; ppr (u|v) is a specific form of &#950; gpr (u|v) based on Personalized Pagerank scores <ref type="bibr">[67]</ref>, i.e,</p><p>Note that the above 0.9 is a hyper-parameter. As we are just willing to show the use case, 0.9 is set as a heuristic and is not obtained via parameter tuning. Other values may yield better performance.</p><p>Other types of PageRank scores may be used, e.g., heat-kernel PageRank scores <ref type="bibr">[68]</ref>, time-dependent PageRank scores <ref type="bibr">[69]</ref>.</p><p>We compare DEA-GNN-PR with all other methods, which yields the following Table <ref type="table">2</ref>. DEA-GNN-PR performs worse than DEA-GNN-SPD while it still works much better than WLGNNs in link and triangle predictions. Comparing these observations with the statements on GDC <ref type="bibr">[32]</ref>, we argue that missing DEs as node attributes is the key that limits the performance of link prediction via GDC. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>G.4 Model performance without validation datasets</head><p>We have confirmed with the original authors of Struc2vec <ref type="bibr">[5]</ref> and SEAL <ref type="bibr">[9]</ref> that the performance of these two baselines reported in their papers do not use validation set. The performance therein is the best testing results ever achieved when models are being trained till convergence. We think that it is necessary to include validation datasets to achieve fair comparison and therefore reported the results with validation datasets in the main text. We put the results without validation datasets here. Under both experimental settings, we draw similar conclusions for our models in comparison with the baselines. Under the ROC Curve (AUC) (mean in percentage &#177; 95% confidence level). &#8224; highlights the best baselines. * , bold font, bold font * respectively highlights the case where our proposed model's performance: exceeds the best baseline on average, exceeds by 70% confidence, exceeds by 95% confidence.</p><p>We report additional results of our model on Task 2 and 3 versus other baselines, measured by average accuracy without validation set in Table <ref type="table">4</ref>. Similar observations as reported in the main text can be drawn from both Table <ref type="table">3</ref> and 4: the strongest baselines are given by SEAL <ref type="bibr">[9]</ref> and GIN <ref type="bibr">[16]</ref>, while our DEA-GNN variants further significantly outperform those baselines on all tasks. G.5 Hyperparameters Tuning. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>H A Brief Introduction of Higher-Order Weisfeiler-Lehman Tests</head><p>We mentioned higher-order WL tests in our context, especially on the 2-WL test. There are different definitions of the k-WL test (k &#8805; 2), while in this work we followed the definition in <ref type="bibr">[39]</ref>. Note that the k-WL test here also corresponds to the k-WL' test in <ref type="bibr">[40]</ref> and the k-FWL test in <ref type="bibr">[27]</ref>, and are equivalent to the k + 1-WL tests in <ref type="bibr">[26,</ref><ref type="bibr">27]</ref>.</p><p>The k-WL test (k &#8805; 2) follows the following coloring procedure:</p><p>1. For each k-tuple of node set V i = (v i1 , v i2 , ..., v i k ) &#8712; V k , i &#8712; [n k ], we initialize V i with a color denoted by C</p><p>i . These colors satisfies that for two k-tuples, V i and V j , C </p><p>2. For each k-tuple V i and u &#8712; V , define N (V i ; u) as a sequence of k-tuples such that N (V i ; u) ((u, i2 , ..., v i k ), (v i1 , u, ..., v i k ), (v i1 , v i2 , ..., u)). Then, the color of V i can be updated via the following mapping:</p><p>where g(&#8226;) is injective coloring.</p><p>3. For each step l, {C</p><p>i } i&#8712;[n k ] is a coloring configuration of the graph G, which is essentially a multi-set. If two graphs have different coloring configurations, these two graphs are determined to be non-isomorphic, while the inverse is not true.</p><p>Note that the step 2 essentially requires to aggregate colors of nk-tuples and thus even when k = 2, the 2-WL test may not leverage the sparsity of graph structure to keep good scalability. Ring-GNN <ref type="bibr">[30]</ref> and PPGN <ref type="bibr">[27]</ref> essentially try to achieve the expressive power of the 2-WL test. They are not scalable to process large graphs and they were only evaluated for entire-graph-level tasks such as graph classification and graph regression <ref type="bibr">[27,</ref><ref type="bibr">30]</ref>.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>(a) (b)</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_1"><p>We follow the terminology</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_2"><p>2-WL test in<ref type="bibr">[39]</ref>, which refines representations of node-pairs iteratively and is proved to be more powerful than 1-WL test. This 2-WL test is termed 2-WL' test in<ref type="bibr">[40]</ref> or 2-FWL test in<ref type="bibr">[27]</ref>. A brief introduction of higher-order WL tests can be found in Appendix H.<ref type="bibr">2</ref> Actually, the 2-WL test may not distinguish edges in a special class of DRGs, termed strongly regular graphs (SRG)<ref type="bibr">[39]</ref>. A connected SRG is a DRG with diameter constrained as 2<ref type="bibr">[41]</ref>.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_3"><p>The code to evaluate our model can be downloaded from https://github.com/snap-stanford/distance-encoding.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_4"><p>j=0 |Q k |) &lt; |Q k |n where we use</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_5"><p>https://github.com/leoribeiro/struc2vec</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_6"><p>https://github.com/weihua916/powerful-gnns</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_7"><p>https://github.com/JiaxuanYou/P-GNN</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="7" xml:id="foot_8"><p>https://github.com/muhanzhang/SEAL</p></note>
		</body>
		</text>
</TEI>
