Finding from a big graph those subgraphs that satisfy certain conditions is useful in many applications such as community detection and subgraph matching. These problems have a high time complexity, but existing systems that attempt to scale them are all IObound in execution. We propose the first truly CPUbound distributed framework called Gthinker for subgraph finding algorithms, which adopts a taskbased computation model, and which also provides a userfriendly subgraphcentric vertexpulling API for writing distributed subgraph finding algorithms that can be easily adapted from existing serial algorithms. To utilize all CPU cores of a cluster, Gthinker features (1) a highly concurrent vertex cache for parallel task access and (2) a lightweight task scheduling approach that ensures high task throughput. These designs well overlap communication with computation to minimize the idle time of CPU cores. To further improve load balancing on graphs where the workloads of individual tasks can be drastically different due to biased graph density distribution, we propose to prioritize the scheduling of those tasks that tend to be long running for processing and decomposition, plus a timeout mechanism for task decomposition to prevent longrunning straggler tasks. The idea has been integrated into a novelty algorithm for maximum cliquemore »
Gthinker: A Distributed Framework for Mining Subgraphs in a Big Graph
Mining from a big graph those subgraphs that satisfy certain conditions is useful in many applications such as community detection and subgraph matching. These problems have a high time complexity, but existing systems to scale them are all IObound in execution. We propose the first truly CPUbound distributed framework called Gthinker that adopts a userfriendly subgraphcentric vertexpulling API for writing distributed subgraph mining algorithms. To utilize all CPU cores of a cluster, Gthinker features (1) a highlyconcurrent vertex cache for parallel task access and (2) a lightweight task scheduling approach that ensures high task throughput. These designs well overlap communication with computation to minimize the CPU idle time. Extensive experiments demonstrate that Gthinker achieves orders of magnitude speedup compared even with the fastest existing subgraphcentric system, and it scales well to much larger and denser real network data. Gthinker is opensourced at http://bit.ly/gthinker with detailed documentation.
 Award ID(s):
 1755464
 Publication Date:
 NSFPAR ID:
 10140007
 Journal Name:
 Proceedings of the 36th IEEE International Conference on Data Engineering (ICDE)
 Sponsoring Org:
 National Science Foundation
More Like this


Given a userspecified minimum degree threshold γ, a γquasiclique is a subgraph g = (Vg, Eg) where each vertex ν ∈ Vg connects to at least γ fraction of the other vertices (i.e., ⌈γ · (Vg 1)⌉ vertices) in g. Quasiclique is one of the most natural definitions for dense structures useful in finding communities in social networks and discovering significant biomolecule structures and pathways. However, mining maximal quasicliques is notoriously expensive. In this paper, we design parallel algorithms for mining maximal quasicliques on Gthinker, a distributed graph mining framework that decomposes mining into computeintensive tasks to fully utilize CPU cores. We found that directly using Gthinker results in the straggler problem due to (i) the drastic load imbalance among different tasks and (ii) the difficulty of predicting the task running time. We address these challenges by redesigning Gthinker's execution engine to prioritize longrunning tasks for execution, and by utilizing a novel timeout strategy to effectively decompose longrunning tasks to improve load balancing. While this system redesign applies to many other expensive dense subgraph mining problems, this paper verifies the idea by adapting the stateoftheart quasiclique algorithm, Quick, to our redesigned Gthinker. Extensive experiments verify that our new solution scalesmore »

Given a userspecified minimum degree threshold γ, a γquasiclique is a subgraph where each vertex connects to at least γ fraction of the other vertices. Quasiclique is a natural definition for dense structures, so finding large and hence statistically significant quasicliques is useful in applications such as community detection in social networks and discovering significant biomolecule structures and pathways. However, mining maximal quasicliques is notoriously expensive, and even a recent algorithm for mining large maximal quasicliques is flawed and can lead to a lot of repeated searches. This paper proposes a parallel solution for mining maximal quasicliques that is able to fully utilize CPU cores. Our solution utilizes divide and conquer to decompose the workloads into independent tasks for parallel mining, and we addressed the problem of (i) drastic load imbalance among different tasks and (ii) difficulty in predicting the task running time and the time growth with task subgraph size, by (a) using a timeoutbased task decomposition strategy, and by (b) utilizing a priority task queue to schedule longrunning tasks earlier for mining and decomposition to avoid stragglers. Unlike our conference version in PVLDB 2020 where the solution was built on a distributed graph mining framework called Gthinker, thismore »

Many computationally expensive problems are solved by a divideandconquer algorithm: a problem over a big dataset can be recursively divided into independent tasks over smaller subsets of the dataset. We present a distributed generalpurpose framework called Tthinker which effectively utilizes the CPU cores in a cluster by properly decomposing an expensive problem into smaller independent tasks for parallel computation. Tthinker well overlaps CPU processing with network communication, and its superior performance is verified over a reengineered graph mining system Gthinker available at http://cs.uab.edu/yanda/gthinker/

We present a weighted approach to compute a maximum cardinality matching in an arbitrary bipartite graph. Our main result is a new algorithm that takes as input a weighted bipartite graph G(A cup B,E) with edge weights of 0 or 1. Let w <= n be an upper bound on the weight of any matching in G. Consider the subgraph induced by all the edges of G with a weight 0. Suppose every connected component in this subgraph has O(r) vertices and O(mr/n) edges. We present an algorithm to compute a maximum cardinality matching in G in O~(m(sqrt{w} + sqrt{r} + wr/n)) time. When all the edge weights are 1 (symmetrically when all weights are 0), our algorithm will be identical to the wellknown HopcroftKarp (HK) algorithm, which runs in O(m sqrt{n}) time. However, if we can carefully assign weights of 0 and 1 on its edges such that both w and r are sublinear in n and wr=O(n^{gamma}) for gamma < 3/2, then we can compute maximum cardinality matching in G in o(m sqrt{n}) time. Using our algorithm, we obtain a new O~(n^{4/3}/epsilon^4) time algorithm to compute an epsilonapproximate bottleneck matching of A,B subsetR^2 and an 1/(epsilon^{O(d)}}n^{1+(d1)/(2d1)}) poly logmore »