skip to main content


Title: Graph diffusion distance: Properties and efficient computation
We define a new family of similarity and distance measures on graphs, and explore their theoretical properties in comparison to conventional distance metrics. These measures are defined by the solution(s) to an optimization problem which attempts find a map minimizing the discrepancy between two graph Laplacian exponential matrices, under norm-preserving and sparsity constraints. Variants of the distance metric are introduced to consider such optimized maps under sparsity constraints as well as fixed time-scaling between the two Laplacians. The objective function of this optimization is multimodal and has discontinuous slope, and is hence difficult for univariate optimizers to solve. We demonstrate a novel procedure for efficiently calculating these optima for two of our distance measure variants. We present numerical experiments demonstrating that (a) upper bounds of our distance metrics can be used to distinguish between lineages of related graphs; (b) our procedure is faster at finding the required optima, by as much as a factor of 10 3 ; and (c) the upper bounds satisfy the triangle inequality exactly under some assumptions and approximately under others. We also derive an upper bound for the distance between two graph products, in terms of the distance between the two pairs of factors. Additionally, we present several possible applications, including the construction of infinite “graph limits” by means of Cauchy sequences of graphs related to one another by our distance measure.  more » « less
Award ID(s):
1633631
NSF-PAR ID:
10310846
Author(s) / Creator(s):
;
Editor(s):
Oliva, Gabriele
Date Published:
Journal Name:
PLOS ONE
Volume:
16
Issue:
4
ISSN:
1932-6203
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    The Sparsest Cut is a fundamental optimization problem that have been extensively studied. For planar inputs the problem is in P and can be solved in Õ(n 3 ) time if all vertex weights are 1. Despite a significant amount of effort, the best algorithms date back to the early 90’s and can only achieve O(log n)-approximation in Õ(n) time or 3.5-approximation in Õ(n 2 ) time [Rao, STOC92]. Our main result is an Ω(n 2−ε ) lower bound for Sparsest Cut even in planar graphs with unit vertex weights, under the (min, +)-Convolution conjecture, showing that approxima- tions are inevitable in the near-linear time regime. To complement the lower bound, we provide a 3.3-approximation in near-linear time, improving upon the 25-year old result of Rao in both time and accuracy. We also show that our lower bound is not far from optimal by observing an exact algorithm with running time Õ(n 5/2 ) improving upon the Õ(n 3 ) algorithm of Park and Phillips [STOC93]. Our lower bound accomplishes a repeatedly raised challenge by being the first fine-grained lower bound for a natural planar graph problem in P. Building on our construction we prove near-quadratic lower bounds under SETH for variants of the closest pair problem in planar graphs, and use them to show that the popular Average-Linkage procedure for Hierarchical Clustering cannot be simulated in truly subquadratic time. At the core of our constructions is a diamond-like gadget that also settles the complexity of Diameter in distributed planar networks. We prove an Ω(n/ log n) lower bound on the number of communication rounds required to compute the weighted diameter of a network in the CONGET model, even when the underlying graph is planar and all nodes are D = 4 hops away from each other. This is the first poly(n) lower bound in the planar-distributed setting, and it complements the recent poly(D, log n) upper bounds of Li and Parter [STOC 2019] for (exact) unweighted diameter and for (1 + ε) approximate weighted diameter. 
    more » « less
  2. The application of graph Laplacian eigenvectors has been quite popular in the graph signal processing field: one can use them as ingredients to design smooth multiscale basis. Our long-term goal is to study and understand the dual geometry of graph Laplacian eigenvectors. In order to do that, it is necessary to define a certain metric to measure the behavioral differences between each pair of the eigenvectors. Saito (2018) considered the ramified optimal transportation (ROT) cost between the square of the eigenvectors as such a metric. Clonginger and Steinerberger (2018) proposed a way to measure the affinity (or `similarity') between the eigenvectors based on their Hadamard (HAD) product. In this article, we propose a simplified ROT metric that is more computational efficient and introduce two more ways to define the distance between the eigenvectors, i.e., the time-stepping diffusion (TSD) metric and the difference of absolute gradient (DAG) pseudometric. The TSD metric measures the cost of "flattening" the initial graph signal via diffusion process up to certain time, hence it can be viewed as a time-dependent version of the ROT metric. The DAG pseudometric is the l2-distance between the feature vectors derived from the eigenvectors, in particular, the absolute gradients of the eigenvectors. We then compare the performance of ROT, HAD and the two new "metrics: on different kinds of graphs. Finally, we investigate their relationship as well as their pros and cons. Keywords: Graph Laplacian eigenvectors, metrics between orthonormal vectors, dual geometry of graph Laplacian eigenvectors, multiscale basis dictionaries on graphs, heat diffusion on graphs, Wasserstein distance, optimal transport 
    more » « less
  3. Learning representations of sets of nodes in a graph is crucial for applications ranging from node-role discovery to link prediction and molecule classification. Graph Neural Networks (GNNs) have achieved great success in graph representation learning. However, expressive power of GNNs is limited by the 1-Weisfeiler-Lehman (WL) test and thus GNNs generate identical representations for graph substructures that may in fact be very different. More powerful GNNs, proposed recently by mimicking higher-order-WL tests, only focus on representing entire graphs and they are computationally inefficient as they cannot utilize sparsity of the underlying graph. Here we propose and mathematically analyze a general class of structure related features, termed Distance Encoding (DE). DE assists GNNs in representing any set of nodes, while providing strictly more expressive power than the 1-WL test. DE captures the distance between the node set whose representation is to be learned and each node in the graph. To capture the distance DE can apply various graph-distance measures such as shortest path distance or generalized PageRank scores. We propose two ways for GNNs to use DEs (1) as extra node features, and (2) as controllers of message aggregation in GNNs. Both approaches can utilize the sparse structure of the underlying graph, which leads to computational efficiency and scalability. We also prove that DE can distinguish node sets embedded in almost all regular graphs where traditional GNNs always fail. We evaluate DE on three tasks over six real networks: structural role prediction, link prediction, and triangle prediction. Results show that our models outperform GNNs without DE by up-to 15% in accuracy and AUROC. Furthermore, our models also significantly outperform other state-of-the-art methods especially designed for the above tasks. 
    more » « less
  4. Availability of extensive genetics data across multiple individuals and populations is driving the growing importance of graph based reference representations. Aligning sequences to graphs is a fundamental operation on several types of sequence graphs (variation graphs, assembly graphs, pan-genomes, etc.) and their biological applications. Though research on sequence to graph alignments is nascent, it can draw from related work on pattern matching in hypertext. In this paper, we study sequence to graph alignment problems under Hamming and edit distance models, and linear and affine gap penalty functions, for multiple variants of the problem that allow changes in query alone, graph alone, or in both. We prove that when changes are permitted in graphs either standalone or in conjunction with changes in the query, the sequence to graph alignment problem is NP -complete under both Hamming and edit distance models for alphabets of size ≥2 . For the case where only changes to the sequence are permitted, we present an O(|V|+m|E|) time algorithm, where m denotes the query size, and V and E denote the vertex and edge sets of the graph, respectively. Our result is generalizable to both linear and affine gap penalty functions, and improves upon the run-time complexity of existing algorithms. 
    more » « less
  5. Quantifying the differences between networks is a challenging and ever-present problem in network science. In recent years, a multitude of diverse, ad hoc solutions to this problem have been introduced. Here, we propose that simple and well-understood ensembles of random networks—such as Erdős–Rényi graphs, random geometric graphs, Watts–Strogatz graphs, the configuration model and preferential attachment networks—are natural benchmarks for network comparison methods. Moreover, we show that the expected distance between two networks independently sampled from a generative model is a useful property that encapsulates many key features of that model. To illustrate our results, we calculate this within-ensemble graph distance and related quantities for classic network models (and several parameterizations thereof) using 20 distance measures commonly used to compare graphs. The within-ensemble graph distance provides a new framework for developers of graph distances to better understand their creations and for practitioners to better choose an appropriate tool for their particular task. 
    more » « less