<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Graph InfoClust: Maximizing Coarse-Grain Mutual Information in Graphs</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>2021</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10291005</idno>
					<idno type="doi">https://doi.org/10.1007/978-3-030-75762-5_43</idno>
					<title level='j'>Advances in Knowledge Discovery and Data Mining. PAKDD</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Costas Mavromatis</author><author>George Karypis</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[This work proposes a new unsupervised (or self-supervised) node representation learning method that aims to leverage the coarse-grain information that is available in most graphs. This extends previous attempts that only leverage fine-grain information (similarities within local neighborhoods) or global graph information (similarities across all nodes). Intuitively, the proposed method identifies nodes that belong to the same clusters and maximizes their mutual information. Thus, coarse-grain (cluster-level) similarities that are shared between nodes are preserved in their representations. The core components of the proposed method are (i) a jointly optimized clustering of nodes during learning and (ii) an Infomax objective term that preserves the mutual information among nodes of the same clusters. Our method is able to outperform competing state-of-art methods in various downstream tasks, such as node classification, link prediction, and node clustering. Experiments show that the average gain is between 0.2% and 6.1%, over the best competing approach, over all tasks. Our code is publicly available at: https://github.com/cmavro/Graph-InfoClust-GIC.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Graph structured data naturally emerge in various real-world applications. Such examples include social networks, citation networks, and biological networks. The challenge, from a data representation perspective, is to encode the highdimensional, non-Euclidean information about the graph structure and the attributes associated with the nodes and edges into a low dimensional embedding space. The learned embeddings (a.k.a. representations) can then be used for various tasks, e.g., node classification, link prediction, community detection, and data visualization. In this paper, we focus on unsupervised (or self-supervised) representation learning methods that estimate node embeddings without using any labeled data but instead employ various self-supervision approaches. These methods eliminate the need to develop task-specific graph representation models, eliminate the cost of acquiring labeled data, and can lead to better representations by using large unlabeled datasets.</p><p>Various approaches have been developed to self-supervise graph representation learning. Many of them optimize the embeddings based on a loss function that ensures that pairs of nearby nodes are closer in the embedding space compared to pairs of distant nodes, e.g., DeepWalk <ref type="bibr">[16]</ref> and GraphSAGE <ref type="bibr">[4]</ref>. A key difference between such methods is whether they use graph neural network (GNN) encoders to compute the embeddings. These encoders insert an additional inductive bias that nodes share similarities with their neighbors and lead to significant performance gains.</p><p>Since GNN encoders already preserve similarities between neighboring nodes, Deep Graph Infomax (DGI) <ref type="bibr">[21]</ref> uses a different loss function as self-supervision. This self-supervision encourages each node to be mindful of the global graph properties, in addition to its local neighborhood properties. Specifically, DGI maximizes the mutual information (MI) between the representation of each node (fine-grain representations) and the global graph representation, which corresponds to the summary of all node representations. DGI has shown to estimate superior node representations and is considered to be among the best unsupervised node representation learning approaches. However, in many graphs, besides their global structure, there is additional structure that can be captured. For example, nodes tend to belong to (multiple) clusters that represent topologically near-by nodes; or some nodes may share similar structural roles with topologically distant nodes. In such cases, methods that simultaneously preserve these coarse-grain interactions allow node representations to encode richer structural information.</p><p>Motivated by this observation, we developed Graph InfoClust (GIC), an unsupervised representation learning method that extracts coarse-grain information by identifying nodes that belong to the same clusters. Then, GIC learns node representations by maximizing the mutual information of nodes and their cluster-derived summaries, which preserves the coarse-grain information in the embedding space. Furthermore, since the node representations can help identify better clusters (both w.r.t. communities and also their role), we do not rely on an a priori clustering solution. Instead, the cluster-level summaries are obtained and jointly optimized by a differentiable K-means clustering <ref type="bibr">[23]</ref> in an end-to-end fashion. We evaluated GIC on seven standard datasets using node classification, link prediction, and clustering as the downstream tasks. Our experiments show that in eleven out of thirteen dataset-task combinations, GIC performs better than the best competing approach and its average improvement over DGI is 0.9, 2.6, and 15.5% points for node classification, link prediction, and clustering, respectively. These results demonstrate that by leveraging cluster summaries, GIC is able to improve the quality of the estimated representations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Notation, Definitions, and Problem Statement</head><p>Let G := {V, E} denote a graph with N nodes and |E| edges, where V := {v 1 , . . . , v N } is the set of nodes and E is the set of edges. The connectivity is represented with the adjacency matrix A &#8712; R N &#215;N , with A i,j = 1 if (v i , v j ) &#8712; E and A i,j = 0, otherwise. Let x i &#8712; R F be the feature vector associated with node v i and X &#8712; R N &#215;F be the matrix that stores these features across all nodes. Here, we denote vectors by bold lower-case letters and matrices by bold upper-case letters. We also use the terms representation and embedding interchangeably.</p><p>Let H := [h 1 , . . . , h N ] &#8712; R N &#215;D be the node embedding matrix of G, where h i &#8712; R D is the node embedding vector for v i The goal of node representation learning is to learn a function (encoder), f : R N &#215;N &#215; R N &#215;F -&#8594; R N &#215;D , such that H = f (A, X). Once learned, H can be used as an input feature matrix to downstream tasks such as node classification, link prediction, and clustering.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Graph InfoClust (GIC)</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Motivation and Overview</head><p>Preliminaries, DGI Framework. DGI employs a loss function that encourages node embeddings to contain information about the global graph properties. It does so by training the GNN-encoder to maximize the mutual information (MI) between the representation of each node h i (fine-grain representation) and a summary representation s &#8712; R D of the entire graph (global summary). Maximizing the precise value of mutual information is intractable; instead, DGI maximizes the Jensen-Shannon MI estimator that maximizes MI's lower bound <ref type="bibr">[6]</ref>. This estimator acts like a binary cross-entropy (BCE) loss, whose objective maximizes the expected log-ratio of the samples from the joint distribution (positive examples) and the product of marginal distributions (negative examples). The positive examples are pairings of s with h i of the real input graph G := (A, X), but the negatives are pairings of s with hi , which are obtained from a fake/corrupted input graph G := ( &#195;, X) (assuming the same number of nodes). The graph summary s is obtained by averaging all nodes' representations followed by the logistic sigmoid nonlinearity &#963;(&#8226;), as s = &#963; 1 N N k=1 h i . A discriminator d : R D &#215; R D &#8594; R is used to assign higher scores to the positive than the negative examples. The Jensen-Shannon-based BCE objective is expressed as</p><p>which corresponds to a noise-contrastive objective between positive and negative examples. This type of objective has been proven to effectively maximize mutual information between h i and s <ref type="bibr">[6,</ref><ref type="bibr">21]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>GIC Framework.</head><p>Optimizing the node representations based on Eq. (1) encourages the encoder to preferentially encode information that is shared across all nodes. Such approach ensures that the computed representations do not encode the noise that may exist in some neighborhoods-the noise will be very different from the global summary. However, for exactly the same reason, it will also fail to encode information that is different from the global summary but is over-represented in small parts of the graph (e.g., the neighborhoods of some nodes). Capturing such information can be important for downstream tasks like link prediction and clustering. Graph InfoClust (GIC) is specifically designed to address this problem. It postulates that the nodes belong to multiple clusters and learns node representations by simultaneously maximizing the MI between a node's (fine-grain) representation with that of the global graph summary and a coarse-grain summary derived from the clusters that it belongs to. Since this approach takes advantages of multiple entities within the graph, it leverages significantly more information that is present across the nodes and across different levels of the graph.</p><p>As Fig. <ref type="figure">1</ref> illustrates, GIC uses a GNN-based encoder to compute fine-grain node representations, which are then used to derive (i) the global summary, and (ii) the cluster-based summaries via a differentiable k-means clustering algorithm. These summaries are used in a contrastive-loss setting to define whether a node representation comes from the real or a fake graph. Specifically, GIC introduces a coarse-grain L C loss to discriminate between real and fake samples based on the cluster-based summaries and uses the global-level L 1 loss for the global summary, accordingly. GIC's overall objective is given by</p><p>where &#945; &#8712; [0, 1] controls the relative importance of each component. The optimization of this objective leads to the maximization of the mutual information that is present in both fine-grain and coarse-grain levels as well as the global level of the graph.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Coarse-Grain Loss</head><p>Suppose we have computed a coarse-grain/cluster-derived summary z i &#8712; R D that is associated with v i (described later in Sect. 3.3). Then, we can simply maximize the mutual information between z i and h i in a similar manner with DGI. The new coarse-grain-based objective term is</p><p>where positive examples are pairings of h i with z i and negatives are pairings hi with z i . A discriminator g : R D &#215; R D &#8594; R is used to facilitate the optimization, as before, by assigning higher scores to the positive examples.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">Coarse-Grain Summaries</head><p>Coarse-grain summaries summarize the information that is present in the corresponding (coarse-grain) clusters of nodes within the graph. Because nodes may belong to multiple clusters, and thus, may be present in multiple coarse-grain groups of the graph, it is advantageous to perform a soft-assignment to these clusters. Then, the coarse-grain summary z i can be computed as a weighted average of the cluster centroids that node v i belongs to. Since the representations of the nodes can help identify better clusters (both w.r.t. communities and also their role), we optimize the clusters in an end-to-end fashion along with the node representations. Specifically, the cluster centroids &#181; k &#8712; R D with k = 1, . . . , K (suppose K clusters) are obtained by a layer that implements a differentiable version of K-means clustering, as in ClusterNet <ref type="bibr">[23]</ref>, as follows. The clusters are updated by optimizing Eq. ( <ref type="formula">3</ref>) via an iterative process by alternately setting</p><p>and</p><p>where cos(&#8226;, &#8226;) denotes the cosine similarity between two instances and &#946; is an inverse-temperature hyperparameter; &#946; &#8594; &#8734; gives a binary value for each cluster assignment. The gradients propagate only to the last iteration of the forwardpass updates in order to ensure convergence <ref type="bibr">[23]</ref>. Finally, the cluster-derived summary, i.e., coarse-grain summary, associated with v i in Eq. ( <ref type="formula">3</ref>) is given by</p><p>, where r ik is the degree that v i is assigned to cluster k and &#181; k is the centroid of the kth cluster, as described before, followed by a sigmoid nonlinearity.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Fake Input and Discriminators</head><p>When the input is a single graph, we opt to corrupt the graph by row-shuffling the original features X as X := shuffle([x 1 , x 2 , . . . , x N ]) and &#195; := A (see Fig. <ref type="figure">1a</ref>).</p><p>In the case of multiple input graphs, it may be useful to randomly sample a different graph from the training set as negative examples.</p><p>As the discriminator function d in L 1 , we use a bilinear scoring function, followed by a logistic sigmoid nonlinearity, which converts scores into probabilities, as d(h i , s) = &#963;(h T i Ws), where W is a learnable scoring matrix. We use an inner product similarity, followed by &#963;(&#8226;), as the discriminator function g in</p><p>Here, we replace the bilinear scoring function used for g by an inner product, since it dramatically reduces the memory requirements and worked better in our case.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Related Work</head><p>Many unsupervised/self-supervised graph representation learning approaches follow a contrastive learning paradigm. Their objective is to give a higher score to positive examples and a lower to negative examples, which acts as a binary classification between positives and negatives. Based on the selection of positive/negative examples, methods are able to capture fine-grain similarities (DeepWalk <ref type="bibr">[16]</ref>, node2vec <ref type="bibr">[3]</ref>, GraphSAGE <ref type="bibr">[4]</ref>, GAE/VGAE <ref type="bibr">[9]</ref>, ARGVA <ref type="bibr">[13]</ref>, GMI <ref type="bibr">[15]</ref>) or global similarities (DGI <ref type="bibr">[21]</ref>, MVGRL <ref type="bibr">[5]</ref>) that are shared between nodes.</p><p>For example, in DeepWalk, node2vec and GraphSAGE, positive examples are representations of node pairs that co-occur in short random walks while negatives are representations of distant nodes. In GAE/VGAE and ARGVA, positive examples are representations of incident nodes and negatives are representations of random pairs of nodes. ARGVA uses additional positive/negative pairs as it discriminates at the same time whether a latent node representation comes from the prior (positive) or the graph encoder (negative). GMI has two types of positive/negative examples: First, by pairing a node representation with its input structure and attributes (positive) and with the input of a random node (negative), and second, by obtaining them as in GAE. MVGRL <ref type="bibr">[5]</ref> works in a same manner with DGI, but it augments the real graph to get two additional versions of the graph (real and fake), which doubles the pairs of positive/negative examples. Methods that capture additional global properties (like DGI and MVGRL) are shown to estimate superior node representations.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Experimental Methodology and Results</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.1">Methodology and Configuration</head><p>Datasets. We evaluated the performance of GIC using seven commonly used benchmarks (Table <ref type="table">1</ref>). CORA, CiteSeer, and PubMed <ref type="bibr">[25]</ref> are three citation networks, CoauthorCS and CoauthorPhysics are co-authorship graphs, and Ama-zonComputer and AmazonPhoto <ref type="bibr">[18]</ref> are segments of the Amazon co-purchase graph. Hyper-parameter Tuning and Model Selection. As the encoder function f GNN we use the graph convolution network (GCN) <ref type="bibr">[8]</ref> with the propagation rule at layer l:</p><p>, where &#194; = A + I N is the adjacency matrix with self-loops, D is the diagonal degree matrix of &#194;, &#920; &#8712; R F &#215;D is a learnable matrix, PReLU denotes the nonlinear parametric rectified linear unit, and H (0) = X.</p><p>We use an one-layer GCN-encoder (l = 1) and iterate the cluster updates in Eq. ( <ref type="formula">4</ref>) and ( <ref type="formula">5</ref>) for 10 times. Since GIC's cluster updates are performed in the unit sphere (cosine similarity), we row-normalize the embeddings before the downstream task.</p><p>GIC's learnable parameters are initialized with Glorot <ref type="bibr">[2]</ref> and the objective is optimized using the Adam <ref type="bibr">[7]</ref> with a learning rate of 0.001. We train for a maximum of 2k epochs, but the training is early terminated if the training loss does not improve in 20 or 50 consecutive epochs. The model state is reset to the one with the best (lowest) training loss. For each dataset-task pair, we perform model selection based on the validation set of the corresponding task (except for clustering, in which we use the link prediction task, instead). We set &#945; &#8712; {0.25, 0.5, 0.75} (regularization of the two objective terms), &#946; = {10, 100} (softness of the cluster assignments) and K &#8712; {32, 128} (number of clusters) to train the model, and keep the parameters' triplet that achieved the best result on the validation set. We determine these parameters once for each dataset-task pair, so that the corresponding results are reported based on the same triplet.</p><p>Competing Approaches and Implementation. We compare the performance of GIC against seventeen unsupervised and six semi-supervised methods and variants. The results for all the competing methods, except DGI and sometimes GMI and MVGRL, were obtained directly from <ref type="bibr">[5,</ref><ref type="bibr">13,</ref><ref type="bibr">15,</ref><ref type="bibr">18]</ref>. We name VGAE-best and ARGVA-best the best performing variant of VGAE and ARGVA methods, respectively. We implemented GIC using the Deep Graph Library (DGL) <ref type="bibr">[22]</ref> and PyTorch <ref type="bibr">[14]</ref>, as well by modifying DGI's original implementation (<ref type="url">https://github.com/cmavro/Graph-InfoClust-GIC</ref>). All experiments were performed on a Nvidia Geforce RTX-2070 GPU on a i5-8400 CPU and 32 GB RAM machine. Semi-supervised methods: GCN <ref type="bibr">[8]</ref>, GAT <ref type="bibr">[20]</ref>, GraphSAGE <ref type="bibr">[4]</ref>, and MoNet <ref type="bibr">[12]</ref>.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Results</head><p>Node Classification. In unsupervised methods, the learned node embeddings are passed to a downstream classifier that is based on logistic regression. Following <ref type="bibr">[18]</ref>, we set the embedding dimensions to D = 64, unless otherwise stated. Also, we sample 20&#215;#classes nodes as the train set, 30&#215;#classes nodes as the validation set, the remaining nodes are the test set for the classifier. The sets are either uniformly drawn from each class (balanced sets) or randomly sampled (imbalanced sets). In the Planetoid split <ref type="bibr">[25]</ref>, 1,000 nodes of the remaining nodes are only used for testing. We use a logistic regression classifier, which is trained with a learning rate of 0.01 for 300 or 1k epochs with Adam optimizer and Glorot initialization.</p><p>Table <ref type="table">2</ref> shows the performance of GIC compared to DGI and semisupervised methods. Leveraging cluster information benefits datasets as CORA and PubMed, where GIC achieves a mean classification accuracy gain of more than 1% over DGI. In CiteSeer, CoauthorCS and CoauthorPhysics, the gain is slightly lower, but still more than 0.4%, since the abundant attributes of each node makes the cluster extraction more challenging. In AmazonComputers and AmazonPhoto, GIC performs significantly better than DGI with a gain of more than 2%, on average. Due to the large edge density of these datasets, the GNN-encoder aggregates information from multiple other nodes leading to representations very similar to the global summary. In such cases, GIC's objective term is responsible for making the representations to contain different aspects of information. In all cases, GIC performs better than the worst performing semi-supervised method with a gain of more than 1.5% and as high as 4.3%.    that leveraging additional coarse-grain information (GIC, MVGRL) helps better than leveraging extra fine-grain information (GMI). Moreover, GIC is the only method that consistently performs better than DGI across all datasets with varying embedding size D. The performance improvement for other methods (GIC, MVGRL) is evident only for large D, e.g., D = 512 in the papers. Link Prediction. In link prediction, some edges are hidden in the input graph and the goal is to predict the existence of these edges based on the computed embeddings. The probability of an edge between nodes i and j is given by &#963;(h T i h j ), where &#963; is the logistic sigmoid function. We follow the setup described in <ref type="bibr">[9]</ref>: 5% of edges and negative edges as validation set, 10% of edges and negative edges as test set.</p><p>Table <ref type="table">4</ref> illustrates the benefits of GIC for link prediction tasks. GIC outperforms DGI in all three datasets, since GIC's clustering is able to preserve and reveal useful interactions between nodes which may hint the existence of links between them. GIC also outperforms VGAE and ARGVA, in CORA and CiteSeer by 1%-2% and 4.5%-5.5%, respectively, even though these methods are specifically designed for link prediction tasks. In PubMed, which has fewer attributes to exploit, the performance of GIC is slightly worse than that of VGAE  <ref type="bibr">[11]</ref>: mutual information, <ref type="bibr">[11]</ref>: average rand index in percents (%). and ARGVA. It is noteworthy, that GIC's proposed objective term works better than GMI which combines a mutual information objective term with a link prediction term.</p><p>Clustering. In clustering, the goal is to cluster together related nodes (e.g., nodes that belong to the same class) without any label information. The computed embeddings are clustered into K = #classes clusters with K-means. We set D = 32 and the evaluation is provided by external labels, the same used for node classification.</p><p>Table <ref type="table">5</ref> illustrates GIC's performance for clustering. GIC performs better than other unsupervised methods in two out of three datasets (CORA and Cite-Seer), and performs almost equally with ARGVA in PubMed. The gain over DGI is significantly large in all datasets, and can be as high as 15% to 18.5% for the NMI metric. Due to its interactive clustering, GIC can be considered an efficient method when the downstream task is clustering.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Ablation and Parameter Study</head><p>Effect of the Loss Function. We provide results w.r.t. silhouette score (SIL) <ref type="bibr">[17]</ref>, which is a clustering evaluation metric, of the learned representation after their t-SNE 2D projection <ref type="bibr">[10]</ref>. Table <ref type="table">6</ref> illustrates that using the novel objective term L C (&#945; = 1) outperforms the baseline DGI objective L 1 in all experiments. Primarily, using a single global vector to optimize the node representations may lead to certain pitfalls, especially when the embedding size and thus, the global vector's capacity, is low (Table <ref type="table">6a</ref>). Moreover, accounting for both cluster-level and graph-level information leads to better representations, while ignoring some of this information (&#945; = 0 or &#945; = 1) worsens the quality of the representations. This is also true when hyperparameters &#946; and K are not optimized (Avg vs. Max in Table <ref type="table">6b</ref>), since the soft assignment of the clusters and their joint optimization during learning alleviates this need.  ), however, that does not mean that all clusters will be distinct.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion</head><p>We have presented Graph InfoClust (GIC), an unsupervised graph representation learning method which relies on leveraging cluster-level content. GIC identifies nodes with similar representations, clusters them together, and maximizes their mutual information. This enables us to improve the quality of node representations with richer content and obtain better results than existing approaches for tasks like node classification, link prediction, clustering, and data visualization.</p></div></body>
		</text>
</TEI>
