<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Attribute-Enhanced Similarity Ranking for Sparse Link Prediction</title></titleStmt>
			<publicationStmt>
				<publisher>Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining</publisher>
				<date>08/03/2025</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10661377</idno>
					<idno type="doi"></idno>
					
					<author>Joao Mattos</author><author>Zexi Huang</author><author>Mert Kosan</author><author>Ambuj Singh</author><author>Arlei Silva</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Link prediction is a fundamental problem in graph data. In its most realistic setting, the problem consists of predicting missing or future links between random pairs of nodes from the set of disconnected pairs. Graph Neural Networks (GNNs) have become the predominant framework for link prediction. GNN-based methods treat link prediction as a binary classification problem and handle the extreme class imbalance---real graphs are very sparse---by sampling (uniformly at random) a balanced number of disconnected pairs not only for training but also for evaluation. However, we show that the reported performance of GNNs for link prediction in the balanced setting does not translate to the more realistic imbalanced setting and that simpler topology-based approaches are often better at handling sparsity. These findings motivate Gelato, a similarity-based link-prediction method that applies (1) graph learning based on node attributes to enhance a topological heuristic, (2) a ranking loss for addressing class imbalance, and (3) a negative sampling scheme that efficiently selects hard training pairs via graph partitioning. Experiments show that Gelato outperforms existing GNN-based alternatives.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">Introduction</head><p>Machine learning on graphs supports various structured-data applications including social network analysis <ref type="bibr">[42,</ref><ref type="bibr">64,</ref><ref type="bibr">74]</ref>, recommender systems <ref type="bibr">[33,</ref><ref type="bibr">52,</ref><ref type="bibr">77]</ref>, natural language processing <ref type="bibr">[66,</ref><ref type="bibr">72,</ref><ref type="bibr">86]</ref>, and physics modeling <ref type="bibr">[17,</ref><ref type="bibr">32,</ref><ref type="bibr">68]</ref>. Among the graph-related tasks, one could argue that link prediction, which consists of predicting missing or future links <ref type="bibr">[47,</ref><ref type="bibr">50]</ref>, is the most fundamental one. This is because link prediction not only has many concrete applications <ref type="bibr">[45,</ref><ref type="bibr">62]</ref> but can also be considered an (implicit or explicit) step of the graph-based machine learning pipeline <ref type="bibr">[2,</ref><ref type="bibr">49,</ref><ref type="bibr">80]</ref>-as the observed graph is usually noisy and/or incomplete.</p><p>Graph Neural Networks (GNNs) <ref type="bibr">[26,</ref><ref type="bibr">40,</ref><ref type="bibr">76]</ref> have emerged as the predominant paradigm for machine learning on graphs. Similar to their great success in node classification <ref type="bibr">[41,</ref><ref type="bibr">81,</ref><ref type="bibr">96]</ref> and graph classification <ref type="bibr">[28,</ref><ref type="bibr">53,</ref><ref type="bibr">87,</ref><ref type="bibr">91]</ref>, GNNs have been shown to achieve state-of-the-art link prediction performance <ref type="bibr">[10,</ref><ref type="bibr">46,</ref><ref type="bibr">59,</ref><ref type="bibr">79,</ref><ref type="bibr">89,</ref><ref type="bibr">90]</ref>. Compared to classical approaches that rely on expert-designed heuristics to extract topological information (e.g., Common Neighbors <ref type="bibr">[55]</ref>, Adamic-Adar <ref type="bibr">[1]</ref>, Preferential Attachment <ref type="bibr">[4]</ref>), GNNs can naturally incorporate attributes and are believed to be able to learn new effective heuristics directly from data via supervised learning.</p><p>However, we argue that the evaluation of GNN-based link prediction methods paints an overly optimistic view of their model performance. Most real graphs are sparse and have a modular structure <ref type="bibr">[3,</ref><ref type="bibr">54]</ref>. In Cora and Citeseer (citation networks), less than 0.2% of the node pairs are links/positive (see Table <ref type="table">1</ref>) and modules arise around research topics. Yet GNN-based link prediction methods are evaluated on an artificially balanced test set that includes every positive pair but only a small sample of the negative ones chosen uniformly at random <ref type="bibr">[27]</ref>. Due to modularity, the majority of negative pairs sampled are expected to be relatively far from each other (i.e. across different modules) compared to positive pairs. As a consequence, performance metrics reported for this balanced setting, which we call biased testing, differ widely from the ones observed for the more challenging unbiased testing, where the test set includes every disconnected pair of nodes. In particular, we have found that unsupervised topological heuristics are more competitive in the unbiased setting, often outperforming recent GNN-based link prediction methods. This finding has motivated us to rethink the design of link prediction methods for sparse graphs.</p><p>A key hypothesis of our work is that effective unbiased link prediction in sparse graphs requires a similarity metric that can distinguish positive pairs from hard negative ones. More specifically, link prediction should be seen as a "needle in the haystack" type of problem, where extreme class imbalance makes even the most similar pairs still more likely to be negative. Existing GNN-based approaches fail in this sparse regime due to (1) their use of a binary classification loss that is highly sensitive to class imbalance; <ref type="bibr">(2)</ref> their biased training that mimics biased testing; (3) their inability to learn effective topological heuristics directly from data.</p><p>The goal of this paper is to address the key limitations of GNNs for link prediction mentioned above. We present Gelato, a novel similarity-based framework for link prediction that combines a topological heuristic and graph learning to leverage both topological and attribute information. Gelato applies a ranking-based N-pair loss and partitioning-based negative sampling to select hard training node pairs. Extensive experiments demonstrate that our model significantly outperforms state-of-the-art GNNs in both accuracy and scalability. Figure <ref type="figure">1</ref> provides an overview of our approach.</p><p>To summarize, our contributions are: <ref type="bibr">(1)</ref> We scrutinize the evaluation of supervised link prediction methods and identify their limitations in handling class imbalance; <ref type="bibr">(2)</ref> we propose a simple, effective, and efficient framework to combine topological and attribute information for link prediction in an innovative fashion;</p><p>(3) we introduce an N-pair link prediction loss that we show to be more effective at addressing class imbalance; and (4) we propose an efficient partitioning-based negative sampling scheme that improves link prediction generalization in the sparse setting.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">Limitations in Supervised Link Prediction Evaluation</head><p>Supervised link prediction is often formulated as binary classification, where the positive (or negative) class includes node pairs connected (or not connected) by a link. A key difference between link prediction and other classification problems is that the two classes in link prediction are extremely imbalanced as most graphs of interest are sparse-e.g. the datasets from Table <ref type="table">1</ref> are significantly more imbalanced than those in <ref type="bibr">[75]</ref>. However, the class imbalance is not properly addressed in the evaluation of existing approaches.</p><p>Existing link prediction methods <ref type="bibr">[8,</ref><ref type="bibr">11,</ref><ref type="bibr">14,</ref><ref type="bibr">39,</ref><ref type="bibr">59,</ref><ref type="bibr">83,</ref><ref type="bibr">90,</ref><ref type="bibr">92,</ref><ref type="bibr">99</ref>] are evaluated on a test set containing all positive test pairs and only an equal number of random negative pairs. Similarly, the Open Graph Benchmark (OGB) ranks predicted links against a very small sample of random negative pairs. We term these approaches biased testing as they highly overestimate the ratio of positive pairs in the graph. This issue is exacerbated in most real graphs, where community structure <ref type="bibr">[56]</ref> causes random negative pairs to be particularly easy to identify <ref type="bibr">[43]</ref>-they likely involve members of different communities. Evaluation metrics based on biased testing provide an overly optimistic assessment of the performance in unbiased testing, where every negative pair is included in the test set. In fact, in real applications where positive test edges are not known a priori, it is impossible to construct those biased test sets to begin with.</p><p>Regarding evaluation metrics, Area Under the Receiver Operating Characteristic Curve (AUC) and Average Precision (AP) are the two most popular evaluation metrics for supervised link prediction <ref type="bibr">[8,</ref><ref type="bibr">11,</ref><ref type="bibr">14,</ref><ref type="bibr">39,</ref><ref type="bibr">59,</ref><ref type="bibr">83,</ref><ref type="bibr">90,</ref><ref type="bibr">92,</ref><ref type="bibr">99]</ref>. We first argue that, as in other imbalanced classification problems <ref type="bibr">[18,</ref><ref type="bibr">67]</ref>, AUC is not an effective evaluation metric for link prediction as it is biased towards the majority class (non-edges). On the other hand, AP and other rankbased metrics such as Hits@&#119896;-used in OGB <ref type="bibr">[27]</ref>-are effective for imbalanced classification but only if evaluated on an unbiased test.</p><p>Example: Consider an instance of Stochastic Block Model (SBM) <ref type="bibr">[35]</ref> with 10 blocks of size 1k, intra-block density 0.9, and interblock density 0.1. The number of inter-block negative pairs is 10 &#215; 1k &#215; (10 -1) &#215; 1k &#215; (1 -0.1)/2 = 40.5M, while the number of intra-block negative pairs, which have high topological similarities like the ground-truth positive pairs and are much harder to contrast against, is 10 &#215; 1k &#215; 1k &#215; (1 -0.9)/2 = 0.5M. Biased testing would select less than 0.5M/(0.5M + 40.5M) &lt; 2% of the test negative pairs among the (hard) intra-block ones. In this scenario, even a random classifier is expected to obtain 50% precision. However, the expected precision drops to less than 22% (9M positive pairs vs. 41M negative pairs) under unbiased testing.</p><p>We will formalize the argument used in the example above by performing link prediction on a generic instance of the SBM with intra-block density &#119901;, inter-block density &#119902;, where &#119901; &gt; &#119902;, and &#119896; blocks of size &#119899;. In particular, we will consider an instance of SBM corresponding to the expected node pattern given the parameters, where a node is connected to (&#119899; -1)&#119901; other nodes within its block and (&#119899;&#119896; -&#119899;)&#119902; nodes outside its block. In this setting, the optimal link prediction algorithm can only distinguish potential links within or across blocks-as pairs within each set are connected with probability &#119901; and &#119902;, respectively.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Lemma 1.</head><p>The ratio &#120572; between inter-cluster and intra-cluster negative node pairs in the SBM is such that:</p><p>The above lemma follows directly from the definition of the SBM and shows that the set of negative pairs is dominated by (easy) inter-cluster pairs as &#119901; increases compared to &#119902;. Theorem 2.1. In the unbiased setting, the optimal accuracy link prediction method based on binary classification for the SBM predicts no links if &#119901; &lt; 0.5.</p><p>The proof is given in the Appendix B. Intuitively, even if the classifier has access to the SBM block structure, most within-block pairs are disconnected and thus the accuracy is maximized if no links are predicted. On the other hand, if &#119901; &gt; &#119902;, an effective link prediction method should be able to leverage the SBM block structure to predict within block links. This motivates our formulation of link prediction as a "needle in the haystack" type of problem, where even the top candidate links (i.e., within-block pairs) are still more likely to be negative due to the sparsity of the graph. Lemma 2. In the biased setting, there exist non-trivial link prediction methods with optimal accuracy based on binary classification for the SBM with &#119901; &lt; 0.5.</p><p>The proof is given in the Appendix C. The idea is that in the biased setting, a link prediction method that predicts within-block pairs as links can outperform the trivial classifier described in Theorem 2.1. This illustrates how biased testing, which is applied by recent work on supervised link prediction, can be misleading for sparse graphs. More specifically, a model trained under the biased setting might perform poorly if evaluated in the, more realistic, unbiased setting due to possibly unforeseen distribution shifts across the settings. This is a key motivation for our work.</p><p>The above discussion motivates a more representative evaluation setting for supervised link prediction. We argue for the use of rankbased evaluation metrics-AP, Precision@&#119896; <ref type="bibr">[47]</ref>, and Hits@&#119896; <ref type="bibr">[5]</ref>with unbiased testing, where positive edges are ranked against hard negative node pairs. These metrics have been widely applied in related problems, such as unsupervised link prediction <ref type="bibr">[30,</ref><ref type="bibr">47,</ref><ref type="bibr">57,</ref><ref type="bibr">93]</ref>, knowledge graph completion <ref type="bibr">[5,</ref><ref type="bibr">73,</ref><ref type="bibr">84]</ref>, and information retrieval <ref type="bibr">[69]</ref>, where class imbalance is also significant. In our experiments, we will illustrate how these evaluation metrics combined with unbiased testing provide a drastically different and more informative performance evaluation compared to existing approaches.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">Method</head><p>The limitations of supervised link prediction methods, including GNNs, to handle unbiased testing in sparse graphs motivate the design of a novel link prediction approach. First, preliminary results (see Table <ref type="table">2</ref>) have shown that topological heuristics are not impacted by class imbalance. That is because these heuristics are sensitive to small differences in structural similarity between positive and hard negative pairs while not relying on any learning-and thus not being affected by biased training. However, local structure proximity heuristics, such as Common Neighbors, are known to be less efficient in highly sparse scenarios observed in many realworld applications <ref type="bibr">[48]</ref>-Table <ref type="table">1</ref> shows the sparsity of our datasets. Further, unlike GNNs, topological heuristics are unable to leverage attribute information. Our approach addresses these limitations by integrating supervision into a powerful topological heuristic to leverage attribute data via graph learning.</p><p>Notation and problem. Consider an attributed graph &#119866; = (&#119881; , &#119864;, &#119883; ), where &#119881; is the set of &#119899; nodes, &#119864; is the set of &#119898; edges (links), and &#119883; = (&#119909; 1 , ..., &#119909; &#119899; ) &#119879; &#8712; R &#119899;&#215;&#119903; collects &#119903; -dimensional node attributes. The topological (structural) information of the graph is represented by its adjacency matrix &#119860; &#8712; R &#119899;&#215;&#119899; , with &#119860; &#119906;&#119907; &gt; 0 if an edge of weight &#119860; &#119906;&#119907; connects nodes &#119906; and &#119907; and &#119860; &#119906;&#119907; = 0, otherwise. The (weighted) degree of node &#119906; is given as &#119889; &#119906; = &#119907; &#119860; &#119906;&#119907; and the corresponding degree vector (matrix) is denoted as &#119889; &#8712; R &#119899; (&#119863; &#8712; R &#119899;&#215;&#119899; ). The volume of the graph is vol(&#119866;) = &#119906; &#119889; &#119906; . Our goal is to infer missing links in &#119866; based on its topological and attribute information, &#119860; and &#119883; .</p><p>Model overview. Figure <ref type="figure">1</ref> provides an overview of our model. It starts by selecting training node pairs using a novel partitioningbased negative sampling scheme. Next, a topology-centric graph learning phase incorporates node attribute information directly into the graph structure via a Multi-layer Perceptron (MLP). We then apply a topological heuristic, Autocovariance (AC), to the attributeenhanced graph to obtain a pairwise score matrix. Node pairs with the highest scores are predicted as links. The scores for training pairs are collected to compute an N-pair loss. Finally, the loss is used to train the MLP parameters in an end-to-end manner. We name our model Gelato (Graph enhancement for link prediction with autocovariance). Gelato represents a different paradigm in supervised link prediction combining a graph encoding of attributes with a topological heuristic instead of relying on node embeddings. While the building blocks of Gelato have been proposed by previous work, our paper is the first to apply these building blocks to address challenges in supervised link prediction for sparse graphs.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Graph learning</head><p>The goal of graph learning is to generate an enhanced graph that incorporates node attribute information into the topology. This can be considered as the "dual" operation of message-passing in GNNs, which incorporates topological information into attributes (embeddings). We propose graph learning as a more suitable scheme to combine attributes and topology for link prediction since it does not rely on the GNN to learn a topological heuristic, which we have verified empirically to be a challenge.</p><p>Specifically, our first step of graph learning is to augment the original edges with a set of node pairs based on their (untrained) attribute similarity (i.e., adding an &#120598;-neighborhood graph):</p><p>where &#119904; (&#8226;) can be any similarity function (we use cosine in our experiments) and &#120598; &#120578; is a threshold that determines the number of added pairs as a ratio &#120578; of the original number of edges &#119898;.</p><p>A simple MLP then maps the pairwise node attributes into a trained edge weight for every edge in &#119864;:</p><p>where [&#119909; &#119906; ; &#119909; &#119907; ] denotes the concatenation of &#119909; &#119906; and &#119909; &#119907; and &#120579; contains the trainable parameters. For undirected graphs, we instead use the following permutation invariant operator <ref type="bibr">[13]</ref>:</p><p>The final weights of the enhanced graph are a combination of the topological, untrained, and trained weights:</p><p>where &#120572; and &#120573; are hyperparameters. The enhanced adjacency matrix &#119860; is then fed into a topological heuristic for link prediction introduced in the next section. The MLP is not trained directly to predict the links but instead trained end-to-end to enhance the input graph given to the topological heuristic. Further, the MLP can be easily replaced by a more powerful model such as a GNN, but the goal of this paper is to demonstrate the general effectiveness of our framework and we will show that even a simple MLP leads to significant improvement over the base heuristic.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Topological heuristic</head><p>Assuming that the learned adjacency matrix &#119860; incorporates structural and attribute information, Gelato applies a topological heuristic to &#119860;. Specifically, we generalize Autocovariance, which has been shown to be effective for non-attributed graphs <ref type="bibr">[30]</ref>, to the attributed setting. Autocovariance is a random-walk-based similarity metric. Intuitively, it measures the difference between the co-visiting probabilities for a pair of nodes in a truncated walk</p><p>Autocovariance R w uv u v MLP(&#952;) a b Graph learning Topological heuristic N-pair loss Link ranking Sec. 3.1 Sec. 3.2 Sec. 3.3 L(&#952;) Eq. 8 R(a,b) &#10007; &#10007; &#10007; &#10007; &#10003; Negative sampling Sec. 3.4</p><p>Figure <ref type="figure">1</ref>: Gelato applies graph learning to incorporate attribute information into the topology. The learned graph is given to a topological heuristic that predicts edges between node pairs with high Autocovariance similarity. The parameters of the MLP are optimized end-to-end using the N-pair loss over node pairs selected via a partitioning-based negative sampling scheme. Experiments show that Gelato outperforms state-of-the-art GNN-based link prediction methods.</p><p>and in an infinitely long walk. Given the enhanced graph &#119866;, the Autocovariance similarity matrix &#119877; &#8712; R &#119899;&#215;&#119899; is given as</p><p>where &#119905; &#8712; N 0 is the scaling parameter of the truncated walk. Each entry &#119877; &#119906;&#119907; represents a similarity score for node pair (&#119906;, &#119907;), and top similarity pairs are predicted as links. Note that &#119877; &#119906;&#119907; only depends on the &#119905;-hop enclosing subgraph of (&#119906;, &#119907;) and can be easily differentiated with respect to the edge weights in the subgraph. Gelato could be applied with any differentiable topological heuristics or even a combination of them. In our experiments (Section 4.3), we will show that Autocovariance alone enables state-of-the-art link prediction without requiring any learning.</p><p>Autocovariance versus other heuristics. Following <ref type="bibr">[48]</ref>, we show that local structural heuristics commonly employed by GNNs, such as Common Neighbors, exhibit reduced efficacy in sparse networks with less informative neighborhood structures. This observation motivates our selection of Autocovariance as our topological heuristic, given its ability to capture global structural patterns through random walks. Further, the parameter &#119905; in Autocovariance offers adaptability to varying network sparsity levels <ref type="bibr">[48]</ref>, ranging from denser (lower &#119905; values) to sparser (higher &#119905; values) networks.</p><p>Autocovariance distinguishes negative pairs. Autocovariance can be seen as a general case of the Modularity metric &#119876; <ref type="bibr">[19]</ref>:</p><p>in which &#119898; = vol(&#119866;)/2, &#119889; &#119894; and &#119889; &#119895; are the degrees of nodes &#119894; and &#119895;, and &#119904; &#119894; &#119904; &#119895; is a product that indicates whether both nodes are in the same partition. More specifically, for &#119905; = 1, Autocovariance expresses the graph partitioning resulting in the optimal Modularity value, which captures the relationship between the expected number of edges between two partitions compared to the probability of any random edge in the graph. This key property directly applies to our scenario, enabling Gelato to distinguish between hard (same partitions) and easy (different partitions) negative pairs and motivating us to adopt Autocovariance as our graph heuristic. Further, as &#119905; increases, Autocovariance expresses growing coarser partitions We represent sparse tensors (1 and 2) as matrices with blank entries and dense tensors (3 and 4) as color-filled matrices. We extract from the enhanced transition matrix (1) a slice &#119875; 0 (2) given a batch of node indices &#119881; &#119887;&#119886;&#119905;&#119888;&#8462; . Instead of a matrix exponentiation, we compute &#119875; 0 ( &#119863; -1 &#119860;) repeatedly for &#119905; times to obtain &#119875; &#119896; (3), a dense tensor. Finally, we use &#119875; &#119896; to obtain the autocovariance &#119877; (4) for nodes in the batch. This is implemented efficiently using dense-sparse tensor multiplication.</p><p>until approximating spectral clustering (for &#119905; &#8594; &#8734;), being flexible regarding partition sizes according to different domains. Scaling up Gelato with batching and sparse operations. Naively implementing Gelato using dense tensors is infeasible, due to the quadratic VRAM requirement (&#119877; &#8712; R |&#119881; | &#215; |&#119881; | ). To address this limitation, we propose storing &#119860; as a sparse matrix. Then, instead of directly computing ( &#119863; -1 &#119860;) &#119905; from Equation <ref type="formula">5</ref>(resulting on a dense |&#119881; | &#215; |&#119881; | matrix), we compute</p><p>where &#119875; 0 = ( &#119863; -1 &#119860;) &#119894; &#119895; , for all &#119894; &#8712; &#119881; &#119887;&#119886;&#119905;&#119888;&#8462; , where &#119881; &#119887;&#119886;&#119905;&#119888;&#8462; consists of the nodes in the current batch. This operation substitution allows us to compute a sequence of &#119905; multiplications between a dense</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.3">N-pair loss</head><p>Supervised link prediction methods rely on the cross entropy loss (CE) to optimize model parameters. However, CE is known to be sensitive to class imbalance <ref type="bibr">[7]</ref>. Instead, Gelato leverages the N-pair loss <ref type="bibr">[71]</ref> that is inspired by the metric learning and learning-torank literature <ref type="bibr">[9,</ref><ref type="bibr">51,</ref><ref type="bibr">65,</ref><ref type="bibr">78]</ref> to train the parameters of our graph learning model from highly imbalanced unbiased training data.</p><p>The N-pair loss (NP) contrasts each positive training edge (&#119906;, &#119907;) against a set of negative pairs &#119873; (&#119906;, &#119907;). It is computed as follows:</p><p>Intuitively, &#119871;(&#120579; ) is minimized when each positive edge (&#119906;, &#119907;) has a much higher similarity than its contrasted negative pairs: &#119877; &#119906;&#119907; &#8811; &#119877; &#119901;&#119902; , &#8704;(&#119901;, &#119902;) &#8712; &#119873; (&#119906;, &#119907;). Compared to CE, NP is more sensitive to negative pairs that have comparable similarities to those of positive pairs-they are more likely to be false positives. While NP achieves good performance in our experiments, alternative losses from the learning-to-rank literature <ref type="bibr">[6,</ref><ref type="bibr">23,</ref><ref type="bibr">82]</ref> could also be applied.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.4">Negative sampling</head><p>Supervised methods for link prediction sample a small number of negative pairs uniformly at random but most of these pairs are expected to be easy (see Section 2). To minimize distribution shifts between training and test, negative samples &#119873; (&#119906;, &#119907;) should ideally be generated using unbiased training (see additional example in Appendix A). This means that &#119873; (&#119906;, &#119907;) is a random subset of all disconnected pairs in the training graph, and |&#119873; (&#119906;, &#119907;)| is proportional to the ratio of negative pairs. In this way, we enforce &#119873; (&#119906;, &#119907;) to include hard negative pairs. However, due to graph sparsity (see Table <ref type="table">1</ref>), this approach does not scale to large graphs as the total number of negative pairs would be &#119874; (|&#119881; | 2 -|&#119864;|). Considering Lemma 3 (see proof in the Appendix D), we argue that it is unlikely for an inter-block pair to be ranked within the top Autocovariance pairs, implying that removing these pairs from training would not affect the results. To efficiently generate a small number of hard negative pairs, we propose a novel negative sampling scheme for link prediction based on graph partitioning <ref type="bibr">[16,</ref><ref type="bibr">21]</ref>. The idea is to select negative samples inside partitions (or communities) as they are expected to have similarity values comparable to positive pairs. We adopt METIS <ref type="bibr">[36]</ref> as our graph partitioning method due to its scalability and its flexibility to generate partitions of a size given as a parameter. METIS' partitions are expected to be densely connected inside and sparsely connected across (partitions). We apply METIS to obtain &#119896; partitions in which &#8704;&#119894; &#8712; {1, 2, ..., &#119896; } : &#119866; &#119894; = (&#119881; &#119894; , &#119864; &#119894; , &#119883; &#119894; ), &#119881; &#119894; &#8834; &#119881; , &#119864; &#119894; &#8834; &#119864;, &#119883; &#119894; &#8834; &#119883; , such that &#119881; = &#119896; &#119894;=1 &#119881; &#119894; and |&#119881; &#119894; | &#8776; |&#119881; |/&#119896;. Then, we apply unbiased training only within each partition, reducing the number of sampled negative pairs to</p><p>Following Lemma 4 (see proof on Appendix E), the choice of the value of &#119896; should consider a trade-off between training speed and link prediction performance (see Appendix 4.3). Further, the algorithm proposed by <ref type="bibr">[56]</ref> could be adopted to find the optimal value of &#119896; that maximizes the Modularity gain while obtaining the minimal training time. In the remainder of the paper, we refer to this approach as partitioned training. We claim that this procedure filters (easy) pairs consisting of nodes that would be too far away in the network topology from training while maintaining the more informative (hard) pairs that are closer and topologically similar, according to METIS. We include in the Appendix 4.3 (See Figure <ref type="figure">5</ref>) a performance comparison between Gelato trained using unbiased training against partitioned training.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">Experiments</head><p>In this section, we provide empirical evidence supporting our claims about supervised link prediction, demonstrate the accuracy and efficiency of Gelato, and present ablation studies. The implementation is available on Anonymous GitHub<ref type="foot">foot_0</ref> .</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Experiment settings</head><p>Datasets. Our method is evaluated on five attributed graphs commonly used for link prediction <ref type="bibr">[11,</ref><ref type="bibr">14,</ref><ref type="bibr">27,</ref><ref type="bibr">59,</ref><ref type="bibr">83,</ref><ref type="bibr">92,</ref><ref type="bibr">99]</ref>. Table <ref type="table">1</ref> shows dataset statistics. Baselines. For GNN-based link prediction, we include four stateof-the-art methods published in the past two years: Neo-GNN <ref type="bibr">[89]</ref>, BUDDY <ref type="bibr">[10]</ref>, and NCN / NCNC <ref type="bibr">[79]</ref>, as well as the pioneering work-SEAL <ref type="bibr">[90]</ref>. For topological link prediction heuristics, we consider Common Neighbors (CN) <ref type="bibr">[55]</ref>, Adamic Adar (AA) <ref type="bibr">[1]</ref>, Personalized PageRank (PPR) <ref type="bibr">[58]</ref>, and Autocovariance (AC) <ref type="bibr">[30]</ref>the base heuristic in our model. For comparison with older baselines, please refer to an earlier preprint version of our paper <ref type="bibr">[29]</ref>.</p><p>Hyperparameters. For Gelato, we tune the proportion of added edges &#120578; from {0.0, 0.25, 0.5, 0.75, 1.0}, the topological weight &#120572; from {0.0, 0.25, 0.5, 0.75}, and the trained weight &#120573; from {0.25, 0.5, 0.75, 1.0}. All other settings are fixed across datasets: MLP with one hidden layer of 128 neurons, AC scaling parameter &#119905; = 3, Adam optimizer <ref type="bibr">[38]</ref> with a learning rate of 0.001, a dropout rate of 0.5, and unbiased training without downsampling. To maintain fairness in our results, we also tuned the baselines and exposed our procedures in detail in Appendix F. For all models, including Gelato, the tuning process is done in all datasets, except for ogbl-collab. Data splits for unbiased training and unbiased testing. Following <ref type="bibr">[11,</ref><ref type="bibr">14,</ref><ref type="bibr">39,</ref><ref type="bibr">59,</ref><ref type="bibr">90,</ref><ref type="bibr">92]</ref>, we adopt 85%/5%/10% ratios for training, validation, and testing. Specifically, for unbiased training and unbiased testing, we first randomly divide the (positive) edges &#119864; of the original graph into &#119864; + &#119905;&#119903;&#119886;&#119894;&#119899; , &#119864; + &#119907;&#119886;&#119897;&#119894;&#119889; , and &#119864; + &#119905;&#119890;&#119904;&#119905; for training, validation, and testing based on the selected ratios. Then, we set the negative pairs in these three sets as (1) &#119864; - &#119905;&#119903;&#119886;&#119894;&#119899; = &#119864; -+ &#119864; + &#119907;&#119886;&#119897;&#119894;&#119889; + &#119864; + &#119905;&#119890;&#119904;&#119905; , (2) &#119864; - &#119907;&#119886;&#119897;&#119894;&#119889; = &#119864; -+ &#119864; + &#119905;&#119890;&#119904;&#119905; , and (3) &#119864; - &#119905;&#119890;&#119904;&#119905; = &#119864; -, where &#119864; -is the set of all negative pairs (excluding self-loops) in the original graph. Notice that the validation and testing positive edges are included in the negative training set, and the testing positive edges are included in the negative validation set. This setting simulates the real-world scenario where the test edges (and the validation edges) are unobserved during validation (training). For negative sampling, we repeat the dividing procedure above for each generated partition &#119866; &#119894; . The final sets are unions of individual sets for each partition: &#119864;</p><p>, and &#119864;</p><p>&#119905;&#119890;&#119904;&#119905; &#119894; . We notice that these splits do not leak training data to the test, as both positive and negative test pairs are disconnected during training. Evaluation metrics. We adopt &#8462;&#119894;&#119905;&#119904;@&#119896; -the ratio of positive edges individually ranked above &#119896;th place against all negative pairs-as our evaluation metric since it represents a good notion of class distinction under heavily imbalanced scenarios in information retrieval, compatible with the intuition of link prediction as a similarity-based ranking task.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Partitioned sampling and link prediction as a similarity task</head><p>This section provides empirical evidence for some of the claims made regarding limitations in the evaluation of supervised link prediction methods (see Section 2). It also demonstrates the effectiveness of Gelato to distinguish true links from hard negative node pairs in sparse graphs.</p><p>Negative sampling for harder pairs. Based on the hardness of negative pairs, the easiest scenario is the biased testing, followed by unbiased testing and partitioned testing-i.e. only negative pairs from inside partitions are sampled. This can be verified by Figure <ref type="figure">3</ref>, which compares the predicted scores of NCN against the similarities computed by Gelato on the test set of CiteSeer. Biased testing, the easiest and most unrealistic scenario, shows a good separation between positive and negative pairs both in NCN and Gelato. For unbiased testing, which is more realistic, Gelato is better at distinguishing positive and negative pairs. Finally, partitioned testing presents a particular challenge but Gelato still ranks most positive pairs above negative ones. Other GNN-based link prediction approaches have shown similar behaviors to NCN. Similarity-based link prediction. Figure <ref type="figure">3</ref> shows densities normalized by the size of positive and negative sets, respectively. However, in real-world sparse graphs, the number of negative pairs is much larger than that of positive ones. The results show that for unbiased and partitioned testing, ranking positive pairs over hard negative pairs is especially challenging due to their overwhelming number, i.e. positive pairs are "needles in a haystack". This provides evidence that classifiers, such as GNNs for link prediction, are not suitable for finding decision boundaries in these extremely imbalanced settings, which motivates the design of Gelato as a similarity ranking model trained using an N-pair loss.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.3">Link prediction performance</head><p>Table <ref type="table">2</ref> summarizes the link prediction performance in terms of the mean and standard deviation of &#8462;&#119894;&#119905;&#119904;@1000 for all methods. We show the same results for varying values of &#119896; in Figure <ref type="figure">4</ref>. First, we want to highlight the drastically different performance of GNN-based methods compared to those found in the original papers <ref type="bibr">[10,</ref><ref type="bibr">79,</ref><ref type="bibr">89,</ref><ref type="bibr">90]</ref>. Some of them underperform even the simplest topological heuristics such as Common Neighbors under unbiased testing. Moreover, Autocovariance, which is the base topological heuristic applied by Gelato and does not account for node attributes, outperforms all the GNN-based baselines for the majority of the datasets. These results support our arguments from Section 2 that evaluation metrics based on biased testing can produce misleading results compared to unbiased testing.</p><p>The overall best-performing GNN model is NCNC, which generalizes a pairwise topological heuristic (Common Neighbors) using message-passing. NCNC only outperforms Gelato on OGBL-ddi, which is consistent with previous results <ref type="bibr">[48]</ref> showing that local structural heuristics are effective for very dense networks (see Table <ref type="table">1</ref>). Moreover, OGBL-ddi is the only dataset considered that does not contain natural node features, which explains why our approach achieves the same performance as AC. Gelato also remains superior for different values of &#8462;&#119894;&#119905;&#119904;@&#119870;, especially for Cora, CiteSeer and OGBL-collab, and being remains competitive for OGBL-ddi being competitive as shown in Figure <ref type="figure">4</ref>. This characteristic is especially relevant in real-world scenarios where robustness is desired, mainly in more conservative regimes with lower values of &#119896;. Overall, Gelato outperforms the best GNN-based method by 138%, 125%, 156%, and 11% for Cora, Citeseer, Pubmed, and OGBL-collab, respectively. Further, Gelato outperforms its base topological heuristic (Autocovariance) by 48%, 39%, 10%, and 139% for Cora, Citeseer, Pubmed, and OGBL-collab, respectively. Unbiased Training vs. Partitioned Training. Figure <ref type="figure">5</ref> compares partitioned and unbiased training on CiteSeer using hits@K for varying &#119870;. Results show minimal performance differences, even in extreme partitioning scenarios, while partitioned sampling achieves a significant speedup-training up to 6x faster and scaling with the number of partitions. Unbiased training requires &#119874; (&#119881; 2 -&#119864;) for sparse graphs due to extensive negative sampling, whereas partitioned sampling reduces this to &#119874; (</p><p>, where (&#119881; &#119894; , &#119864; &#119894; ) are nodes and edges within partition &#119894;. Experiments confirm that performance is largely insensitive to different values of &#119901;.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.4">Ablation study</head><p>Here, we collect the results with the same hyperparameter setting as Gelato and present an ablation study in Table <ref type="table">3</ref>. Gelato-MLP (AC) represents Gelato without the MLP (Autocovariance) component, i.e., only using Autocovariance (MLP) for link prediction. Gelato-NP (UT ) replaces the proposed N-pair loss (unbiased training) with the cross entropy loss (biased training) applied by the baselines. Finally, Gelato-NP+UT replaces both the loss and the training setting.</p><p>10 -3 10 -1 10 1 10 3 Similarity 0.0 0.2 0.4 0.6 0.8 Density Gelato -Biased 10 -5 10 -3 10 -1 10 1 Similarity 0.0 0.2 0.4 0.6 0.8 Density Gelato -Unbiased 10 -5 10 -4 10 -3 10 -2 10 -1 10 0 10 1 10 2 Similarity 0.0 0.2 0.4 0.6 0.8 Density Gelato -Partitioned 10 -2 10 -1 10 0 10 1 10 2 Score 0.0 0.5 1.0 1.5 2.0 2.5 3.0 Density NCN -Biased 10 -4 10 -3 10 -2 10 -1 10 0 10 1 Score 0.0 0.5 1.0 1.5 2.0 Density NCN -Unbiased 10 -3 10 -2 10 -1 10 0 10 1 Score 0 5 10 15 20 Density NCN -Partitioned Positive Negative 0 200 400 600 800 1000 K 0 5 10 15 Hits@K Cora 0 200 400 600 800 1000 K 0 5 10 15 20 Hits@K CiteSeer 0 200 400 600 800 1000 K 0 5 10 15 20 25 30 Hits@K OGBL-Collab 0 200 400 600 800 1000 K 0.0 0.2 0.4 0.6 0.8 Hits@K OGBL-DDI Gelato NCN NCNC BUDDY SEAL NeoGNN</p><p>Figure <ref type="figure">4</ref>: Link prediction comparison in terms of &#8462;&#119894;&#119905;&#119904;@&#119896; varying &#119896; using Cora, CiteSeer, OGBL-DDI and OGBL-Collab. All datasets were split using unbiased sampling, except OGBL-Collab, which was split using partitioned sampling. Gelato outperforms the baselines across different values of &#119896; and remains competitive on OGBL-DDI, a dataset in which all methods struggle. Table <ref type="table">2</ref>: Link prediction performance comparison (mean &#177; std hits@1000) for all datasets considered. Gelato consistently outperforms GNN-based methods, topological heuristics, and two-stage approaches combining attributes/topology. For Cora, CiteSeer, ogbl-ddi and PubMed results we used unbiased training, while for ogbl-collab partitioned sampling is used, for scalability reasons. The top three models are colored by First, Second and Third.</p><p>Cora CiteSeer PubMed ogbl-ddi ogbl-collab GNN SEAL 0.0 * 7.25 * *** 0.75 * 25.9 * Neo-GNN 6.96 &#177; 4.24 5.42 &#177; 0.13 1.63 &#177; 0.32 0.76 * 0.85 * BUDDY 4.81 &#177; 0.72 5.86 &#177; 0.34 OOM 0.74 &#177; 0.01 27.66 &#177; 0.24 NCN 4.11 &#177; 1.22 7.84 &#177; 1.13 0.06 &#177; 0.1 0.82 &#177; 0.02 7.16 &#177; 1.42 NCNC 6.58 &#177; 0.58 8.72 &#177; 2.08 1.04 &#177; 0.09 0.89 &#177; 0.09 0.44 &#177; 0.37 Topological Heuristics CN 4.17 &#177; 0.00 4.4 &#177; 0.00 0.36 &#177; 0.00 0.8 &#177; 0.00 2.4 &#177; 0.00 AA 6.64 &#177; 0.00 4.4 &#177; 0.00 1.13 &#177; 0.00 0.79 &#177; 0.00 4.88 &#177; 0.00 PPR 9.30 &#177; 0.00 6.59 &#177; 0.00 0.32 &#177; 0.00 0.08 &#177; 0.00 1.24 &#177; 0.00 AC 11.20 &#177; 0.00 14.29 &#177; 0.00 3.81 &#177; 0.00 0.78 &#177; 0.00 12.89 &#177; 0.00 Gelato 16.62 &#177; 0.31 19.78 &#177; 0.23 4.18 &#177; 0.19 0.78 &#177; 0.00 30.92 * * Run only once as each run takes &gt;24 hrs; *** Each run takes &gt;1000 hrs; OOM: Out Of Memory.</p><p>0 200 400 600 800 1000 K 0 5 10 15 20 Hits@K CiteSeer -Unbiased vs. Partitioned 50 Partitions Unbiased We observe that removing either MLP or Autocovariance leads to inferior performance, as the corresponding attribute or topology information would be missing. Further, to address the class imbalance problem of link prediction, both the N-pair loss and unbiased training are crucial for the effective training of Gelato.</p><p>We also present results for Gelato using different ranking-based loss functions. We choose between Precision@k, pairwise hinge, pairwise exponential, and pairwise logistic losses as candidates for replacing the N-pair loss based on <ref type="bibr">[12]</ref>. The results are shown in Table <ref type="table">4</ref>, demonstrating that there is no clear winner considering the &#8462;&#119894;&#119905;&#119904;@1000 metric in the two datasets used (Cora and CiteSeer).</p><p>We analyze hyperparameter sensitivity and its impact on performance. Figure <ref type="figure">6</ref> shows that optimal &#120572; and &#120573; values depend on dataset characteristics, though performance varies smoothly for Cora and CiteSeer, easing tuning. Figure <ref type="figure">7</ref> highlights diminishing returns from increasing &#120578;, as performance plateaus.  0.00 0.25 0.50 0.75 1.00 &#951; 14 15 16 17 hits@1000 Cora 0.00 0.25 0.50 0.75 1.00 &#951; 12 14 16 18 hits@1000 CiteSeer 3.0 3.2 3.4 3.6 3.8 AP 2.0 2.5 3.0 3.5 AP Figure 7: Performance of Gelato with different values of &#120578;, using &#119879; = 3.</p><p>We represent hits@1000 in green and AP in blue.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">Related Work</head><p>An earlier preprint version of our paper is available at arXiv <ref type="bibr">[29]</ref>.</p><p>Topological heuristics for link prediction. The early link prediction literature focuses on topology-based heuristics. This includes approaches based on local (e.g., Common Neighbors <ref type="bibr">[55]</ref>, Adamic Adar <ref type="bibr">[1]</ref>, and Resource Allocation <ref type="bibr">[98]</ref>) and higher-order (e.g., Katz <ref type="bibr">[37]</ref>, PageRank <ref type="bibr">[58]</ref>, and SimRank <ref type="bibr">[34]</ref>) information. More recently, random-walk based graph embedding methods, which learn vector representations for nodes <ref type="bibr">[24,</ref><ref type="bibr">30,</ref><ref type="bibr">61]</ref>, have achieved promising results in graph machine learning tasks. Popular embedding approaches, such as DeepWalk <ref type="bibr">[61]</ref> and node2vec <ref type="bibr">[24]</ref>, have been shown to implicitly approximate the Pointwise Mutual Information similarity <ref type="bibr">[63]</ref>, which can also be used as a link prediction heuristic. This has motivated the investigation of other similarity metrics such as Autocovariance <ref type="bibr">[19,</ref><ref type="bibr">30,</ref><ref type="bibr">31]</ref>. However, these heuristics are unsupervised and cannot take advantage of data beyond the topology.</p><p>Graph Neural Networks for link prediction. GNN-based link prediction addresses the limitations of topological heuristics by training a neural network to combine topological and attribute information and potentially learn new heuristics. These works often assume that links are correlated with homophily in node attributes <ref type="bibr">[20,</ref><ref type="bibr">97]</ref>, as also is the case for this paper. GAE <ref type="bibr">[39]</ref> combines a graph convolution network <ref type="bibr">[40]</ref> and an inner product decoder based on node embeddings. SEAL <ref type="bibr">[90]</ref> models link prediction as a binary subgraph classification problem (edge/non-edge), and follow-up work (e.g., SHHF <ref type="bibr">[46]</ref>, WalkPool <ref type="bibr">[59]</ref>) investigates different pooling strategies. Other recent approaches for GNN-based link prediction include learning representations in hyperbolic space (e.g., HGCN <ref type="bibr">[11]</ref>,</p><p>LGCN <ref type="bibr">[92]</ref>), generalizing topological heuristics (e.g., Neo-GNN <ref type="bibr">[89]</ref>, NBFNet <ref type="bibr">[99]</ref>), and incorporating additional topological features (e.g., TLC-GNN <ref type="bibr">[83]</ref>, BScNets <ref type="bibr">[14]</ref>). ELPH and BUDDY <ref type="bibr">[10]</ref> apply hashing to efficiently approximate subgraph-based link prediction models, such as SEAL, using a message-passing neural network (MPNN) with distance-based structural features. NCNC <ref type="bibr">[79]</ref> combines the Common Neighbors heuristic with an MPNN achieving state-of-the-art results. Motivated by the growing popularity of GNNs for link prediction, this work investigates key questions regarding their training, evaluation, and ability to learn effective topological heuristics directly from data. We propose Gelato, which is simpler, more accurate, and faster than most state-of-the-art GNN-based link prediction methods. Graph learning. Gelato learns a graph that combines topological and attribute information. Our goal differs from generative models <ref type="bibr">[25,</ref><ref type="bibr">44,</ref><ref type="bibr">88]</ref>, which learn to sample from a distribution over graphs. Graph learning also enables the application of GNNs when the graph is unavailable, noisy, or incomplete <ref type="bibr">[94]</ref>. LDS <ref type="bibr">[22]</ref> and GAug <ref type="bibr">[95]</ref> jointly learn a probability distribution over edges and GNN parameters. IDGL <ref type="bibr">[15]</ref> and EGLN <ref type="bibr">[85]</ref> alternate between optimizing the graph and embeddings for node/graph classification and collaborative filtering. <ref type="bibr">[70]</ref> proposes two-stage link prediction by augmenting the graph as a preprocessing step. In comparison, Gelato effectively learns a graph in an end-to-end manner by minimizing the loss of a topological heuristic.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">Conclusion</head><p>This work exposes key limitations in evaluating supervised link prediction methods due to the widespread use of biased testing.</p><p>These limitations led to a consensus in the graph machine learning community that (1) GNNs are superior for link prediction, casting topological heuristics obsolete; and (2) link prediction is now an easy task due to deep learning advances. We challenge both assumptions, demonstrating that link prediction in sparse graphs remains a hard problem when evaluated properly. GNNs struggle with link prediction in sparse graphs due to extreme class imbalance, motivating Gelato, our novel link prediction framework. Gelato is a similarity-based method that combines graph learning and autocovariance to leverage attribute and topological information. Gelato employs an N-pair loss instead of cross-entropy to address the class imbalance and introduces a partitioning-based negative sampling scheme for efficient hard negative pair sampling. Through extensive experiments, we demonstrate superior accuracy and scalability of Gelato when compared to state-of-the-art GNN-based solutions across various datasets.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A Analysis of Link Prediction Evaluation Metrics with Different Test Settings</head><p>Example: Consider a graph with 10&#119870; nodes, 100&#119870; edges, and 99.9&#119872; disconnected (or negative) pairs. A (bad) model that ranks 1M false positives higher than the true edges achieves 0.99 AUC and 0.95 in AP under biased testing with equal negative samples. Figures <ref type="figure">8a</ref> and <ref type="figure">8b</ref> show the receiver operating characteristic (ROC) and precision-recall (PR) curves for the model under biased testing with equal number of negative samples. Due to the downsampling, only 100k (out of 99.9M) negative pairs are included in the test set, among which only 100k/99.9M &#215; 1M &#8776; 1k pairs are ranked higher than the positive edges. In the ROC curve, this means that once the false positive rate reaches 1k/100k = 0.01, the true positive rate would reach 1.0, leading to an AUC score of 0.99. Similarly, in the PR curve, when the recall reaches 1.0, the precision is 100k/(1k + 100k) &#8776; 0.99, leading to an overall AP score of &#8764;0. 95.  By comparison, as shown in Figure <ref type="figure">8c</ref>, when the recall reaches 1.0, the precision under unbiased testing is only 100k/(1M + 100k) &#8776; 0.09, leading to an AP score of &#8764;0.05. This demonstrates that evaluation metrics based on biased testing provide an overly optimistic measurement of link prediction model performance compared to the more realistic unbiased testing setting.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B Proof of Theorem 2.1</head><p>There are only three classifiers that we need to consider in this setting, assuming that the classifier can recover the block structure:</p><p>(1) It predicts every disconnected pair as a link;</p><p>(2) It predicts every disconnected pair as a non-link;</p><p>(3) It predicts within-block pairs as links and across-block pairs as non-links.</p><p>The classifier 1 cannot be optimal for sparse graphs-i.e., density lower than .5-and thus we will focus on classifiers 2 and 3. We will compute the expected number of True Positives (TP), False Positives (FP), False Negatives (FN), and True Negatives (TN) per node for each of them: </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>C Proof of Lemma 2</head><p>We will consider the same classifiers 2 and 3 from the proof of Theorem 2.1. Moreover, we will assume that the number of sampled negative pairs is the same as the number of positive pairs (i.e., balanced sampling).</p><p>By definition, the accuracy of classifier 2 is 0.5, as all predictions for negative pairs will be correct and all those for positive pairs will be incorrect. Thus, we only have to show that there exists an SBM instance for which classifier 3 achieves better accuracy than 2.</p><p>The accuracy of classifier 3 is computed as &#119886; 1 + &#119886; 2 /2, where: It follows that, as &#119902; &#8594; 0, classifier 3 can achieve an accuracy higher than 0.5.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>D Proof of Lemma 3</head><p>Let us initially consider Autocovariance with &#119905; = 1 computed in the Stochastic Block Model described in Lemma 3. We will adopt the entry-wise notation of the original Autocovariance definition presented in Section 3.2, using lower-case letters to represent individual entries in matrices and vectors, and for the sake of consistency with the Modularity definition, we adopt vol(&#119866;) = 2&#119898;. We first obtain the shortened form of Autocovariance for &#119905; = 1:</p><p>We can obtain the expected expression value for the case where (&#119894;, &#119895;) is an intra-cluster pair (E[&#119877; &#119894;&#119899;&#119905;&#119903;&#119886; ]):</p><p>Likewise, we follow the same procedure for the case where (&#119894;, &#119895;) is an inter-cluster pair (E[&#119877; &#119894;&#119899;&#119905;&#119890;&#119903; ]):</p><p>-&#119889; &#119894; &#119889; &#119895; 2&#119898; )(1 -&#119901;) + (0 -&#119889; &#119894; &#119889; &#119895; 2&#119898; )&#119901;) (13) = 1 2&#119898; (1 -&#119901; -&#119889; &#119894; &#119889; &#119895; 2&#119898; ) (14) = 1 2&#119898; (&#119902; -&#119889; &#119894; &#119889; &#119895; 2&#119898; ). (15) Due to the reversible property of Markov chains, this holds for larger values of &#119905;. Since &#119901; &gt; &#119902; =&#8658; E[&#119877; &#119894;&#119899;&#119905;&#119903;&#119886; ] &gt; E[&#119877; &#119894;&#119899;&#119905;&#119890;&#119903; ]. E Proof of Lemma 4 From Appendix D, we have E[&#119877; &#119894;&#119899;&#119905;&#119903;&#119886; ] = 1 2&#119898; (&#119901; -&#119889; &#119894; &#119889; &#119895; 2&#119898; ) is solely dependent on the value of &#119901;, since all the other terms are constants. We will denominate &#119881; &#119894;&#119896; and &#119864; + &#119894;&#119896; the number of nodes and positive pairs in the &#119894;-th partition of our graph partitioned in &#119896; partitions. Considering the estimate &#119901; = |&#119864; + &#119894;&#119896; | /|&#119881; &#119894;&#119896; | 2 , for simplicity, the number of positive pairs we can lose by increasing &#119896; to &#119896; + 1 is at most |&#119864; + &#119894;&#119896;+1 | &#8805; |&#119864; + &#119894;&#119896; | -(|&#119881; &#119894;&#119896; | 2 -|&#119881; &#119894;&#119896;+1 | 2 ), if we consider the extreme scenario in which every pair lost was positive. With this estimate, we can compare with the actual &#119901; estimate: |&#119864; + &#119894;&#119896;+1 | |&#119881; &#119894;&#119896;+1 | 2 &#8805; |&#119864; + &#119894;&#119896; | -(|&#119881; &#119894;&#119896; | 2 -|&#119881; &#119894;&#119896;+1 | 2 ) |&#119881; &#119894;&#119896;+1 | 2 (16)</p><p>It follows that, since the number of pairs drops faster than the number of positive edges for a given partition, E[&#119877; &#119894;&#119899;&#119905;&#119903;&#119886; ] increases when &#119896; increases.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>F Detailed experiment settings</head><p>Positive masking. To ensure generalizability under unbiased training, we employ a trick similar to negative injection <ref type="bibr">[90]</ref>. Training positive edges are divided into batches, and for each batch &#119864; &#119887; , only the residual edges &#119864; -&#119864; &#119887; are used as structural input to the model. This simulates testing, where edge predictions are made without leveraging their own connectivity. We term this positive masking. Implementation details. Self-loops are added to isolated nodes in the enhanced adjacency matrix to ensure valid transition probabilities for computing Autocovariance. Gelato similarity scores are standardized as in <ref type="bibr">[30]</ref> before loss computation. We train the model using gradient descent with pytorch <ref type="bibr">[60]</ref> and skip parameter updates for batches with invalid gradients (e.g., with cross-entropy loss). Models are selected based on &#119901;&#119903;&#119890;&#119888;@100% on the (unbiased) validation set. Maximum epochs are set to 100 for Cora/CiteSeer and 250 for OGBL-DDI/OGBL-Collab. For partitioned testing, we use METIS <ref type="bibr">[36]</ref> for its scalability and balanced partitions.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="1" xml:id="foot_0"><p>https://anonymous.4open.science/r/Gelato/</p></note>
		</body>
		</text>
</TEI>
