Computing similarity between graphs is a fundamental and critical problem in graph-based applications, and one of the most commonly used graph similarity measures is graph edit distance (GED), defined as the minimum number of graph edit operations that transform one graph to another. Existing GED solutions suffer from severe performance issues due in particular to the NP-hardness of exact GED computation. Recently, deep learning has shown early promise for GED approximation with high accuracy and low computational cost. However, existing methods treat GED as a global, coarse-grained graph similarity value, while neglecting the type-specific transformative impacts incurred by different types of graph edit operations, including node insertion/deletion, node relabeling, edge insertion/deletion, and edge relabeling. In this paper, we propose a type-aware graph similarity learning and computation framework, TaGSim (T ype -a ware G raph Sim ilarity), that estimates GED in a fine-grained approach w.r.t. different graph edit types. Specifically, for each type of graph edit operations, TaGSim models its unique transformative impacts upon graphs, and encodes them into high-quality, type-aware graph embeddings, which are further fed into type-aware neural networks for accurate GED estimation. Extensive experiments on five real-world datasets demonstrate the effectiveness and efficiency of TaGSim, which significantly outperforms state-of-the-art GED solutions.
more »
« less
Similarity Search in Graph Databases: A Multi-Layered Indexing Approach
We consider in this paper the similarity search problem that retrieves relevant graphs from a graph database under the well-known graph edit distance (GED) constraint. Formally, given a graph database G = {g1, g2, . . . , gn} and a query graph q, we aim to search the graph gi ∈ g such that the graph edit distance between gi and q, GED(gi, q), is within a user-specified GED threshold, τ. In spite of its theoretical significance and wide applicability, the GED-based similarity search problem is challenging in large graph databases due in particular to a large amount of GED computation incurred, which has proven to be NP-hard. In this paper, we propose a parameterized, partition-based GED lower bound that can be instantiated into a series of tight lower bounds towards synergistically pruning false-positive graphs from before costly GED computation is performed. We design an efficient, selectivity-aware algorithm to partition graphs of into highly selective subgraphs. They are further incorporated in a cost-effective, multi-layered indexing structure, ML-Index (Multi-Layered Index), for GED lower bound cross-checking and false-positive graph filtering with theoretical performance guarantees. Experimental studies in real and synthetic graph databases validate the efficiency and effectiveness of ML-Index, which achieves up to an order of magnitude speedup over the state-of-the-art method for similarity search in graph databases.
more »
« less
- Award ID(s):
- 1743142
- PAR ID:
- 10057532
- Date Published:
- Journal Name:
- 33rd IEEE International Conference on Data Engineering, ICDE 2017
- Page Range / eLocation ID:
- 783-794
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Shortest-path computation on graphs is one of the most well-studied problems in algorithmic theory. An aspect that has only recently attracted attention is the use of databases in combination with graph algorithms, so-called distance oracles, to compute shortest-path queries on large graphs. To this purpose, we propose a novel, efficient, pure-SQL framework for answering exact distance queries on large-scale graphs, implemented entirely on an open-source database engine. Our COLD framework (COmpressed Labels on the Database) may answer multiple distance queries (vertex-to-vertex, one-to-many, k-Nearest Neighbors, Reverse k-Nearest Neighbors, Reverse k-Farthest Neighbors and Top-k Range) not handled by previous methods, rendering it a complete database solution for a variety of practical large-scale graph applications. Our experimentation shows that COLD outperforms existing approaches (including popular graph databases) in terms of query time and efficiency, while requiring significantly less storage space than these methods.more » « less
-
null (Ed.)Abstract For positive integers n and d > 0, let $G(\mathbb {Q}^n,\; d)$ denote the graph whose vertices are the set of rational points $\mathbb {Q}^n$ , with $u,v \in \mathbb {Q}^n$ being adjacent if and only if the Euclidean distance between u and v is equal to d . Such a graph is deemed “non-trivial” if d is actually realized as a distance between points of $\mathbb {Q}^n$ . In this paper, we show that a space $\mathbb {Q}^n$ has the property that all pairs of non-trivial distance graphs $G(\mathbb {Q}^n,\; d_1)$ and $G(\mathbb {Q}^n,\; d_2)$ are isomorphic if and only if n is equal to 1, 2, or a multiple of 4. Along the way, we make a number of observations concerning the clique number of $G(\mathbb {Q}^n,\; d)$ .more » « less
-
We consider the community search problem defined upon a large graph G: given a query vertex q in G, to find as output all the densely connected subgraphs of G, each of which contains the query v. As an online, query-dependent variant of the well-known community detection problem, community search enables personalized community discovery that has found widely varying applications in real-world, large-scale graphs. In this paper, we study the community search problem in the truss-based model aimed at discovering all dense and cohesive k-truss communities to which the query vertex q belongs. We introduce a novel equivalence relation, k-truss equivalence, to model the intrinsic density and cohesiveness of edges in k-truss communities. Consequently, all the edges of G can be partitioned to a series of k-truss equivalence classes that constitute a space-efficient, truss-preserving index structure, EquiTruss. Community search can be henceforth addressed directly upon EquiTruss without repeated, time-demanding accesses to the original graph, G, which proves to be theoretically optimal. In addition, EquiTruss can be efficiently updated in a dynamic fashion when G evolves with edge insertion and deletion. Experimental studies in real-world, large-scale graphs validate the efficiency and effectiveness of EquiTruss, which has achieved at least an order of magnitude speedup in community search over the state-of-the-art method, TCP-Index.more » « less
-
Motivation: Intra-sample heterogeneity describes the phenomenon where a genomic sample contains a diverse set of genomic sequences. In practice, the true string sets in a sample are often unknown due to limitations in sequencing technology. In order to compare heterogeneous samples, genome graphs can be used to represent such sets of strings. However, a genome graph is generally able to represent a string set universe that contains multiple sets of strings in addition to the true string set. This difference between genome graphs and string sets is not well characterized. As a result, a distance metric between genome graphs may not match the distance between true string sets. Results: We extend a genome graph distance metric, Graph Traversal Edit Distance (GTED) proposed by Ebrahimpour Boroojeny et al., to FGTED to model the distance between heterogeneous string sets and show that GTED and FGTED always underestimate the Earth Mover’s Edit Distance (EMED) between string sets. We introduce the notion of string set universe diameter of a genome graph. Using the diameter, we are able to upper-bound the deviation of FGTED from EMED and to improve FGTED so that it reduces the average error in empirically estimating the similarity between true string sets. On simulated T-cell receptor sequences and actual Hepatitis B virus genomes, we show that the diameter-corrected FGTED reduces the average deviation of the estimated distance from the true string set distances by more than 250%. Availability and implementation: Data and source code for reproducing the experiments are available at: https:// github.com/Kingsford-Group/gtedemedtest/.more » « less