Given a userspecified minimum degree threshold γ, a γquasiclique is a subgraph where each vertex connects to at least γ fraction of the other vertices. Quasiclique is a natural definition for dense structures, so finding large and hence statistically significant quasicliques is useful in applications such as community detection in social networks and discovering significant biomolecule structures and pathways. However, mining maximal quasicliques is notoriously expensive, and even a recent algorithm for mining large maximal quasicliques is flawed and can lead to a lot of repeated searches. This paper proposes a parallel solution for mining maximal quasicliques that is able to fully utilize CPU cores. Our solution utilizes divide and conquer to decompose the workloads into independent tasks for parallel mining, and we addressed the problem of (i) drastic load imbalance among different tasks and (ii) difficulty in predicting the task running time and the time growth with task subgraph size, by (a) using a timeoutbased task decomposition strategy, and by (b) utilizing a priority task queue to schedule longrunning tasks earlier for mining and decomposition to avoid stragglers. Unlike our conference version in PVLDB 2020 where the solution was built on a distributed graph mining framework called Gthinker, this paper targets a singlemachine multicore environment which is more accessible to an average end user. A general framework called Tthinker is developed to facilitate the programming of parallel programs for algorithms that adopt divide and conquer, including but not limited to our quasiclique mining algorithm. Additionally, we consider the problem of directly mining large quasicliques from dense parts of a graph, where we identify the repeated search issue of a recent method and address it using a carefully designed concurrent trie data structure. Extensive experiments verify that our parallel solution scales well with the number of CPU cores, achieving 26.68× runtime speedup when mining a graph with 3.77M vertices and 16.5M edges with 32 mining threads. Additionally, mining large quasicliques from dense parts can provide an additional speedup of up to 89.46×.
more »
« less
Gthinker: a general distributed framework for finding qualified subgraphs in a big graph with load balancing
Finding from a big graph those subgraphs that satisfy certain conditions is useful in many applications such as community detection and subgraph matching. These problems have a high time complexity, but existing systems that attempt to scale them are all IObound in execution. We propose the first truly CPUbound distributed framework called Gthinker for subgraph finding algorithms, which adopts a taskbased computation model, and which also provides a userfriendly subgraphcentric vertexpulling API for writing distributed subgraph finding algorithms that can be easily adapted from existing serial algorithms. To utilize all CPU cores of a cluster, Gthinker features (1) a highly concurrent vertex cache for parallel task access and (2) a lightweight task scheduling approach that ensures high task throughput. These designs well overlap communication with computation to minimize the idle time of CPU cores. To further improve load balancing on graphs where the workloads of individual tasks can be drastically different due to biased graph density distribution, we propose to prioritize the scheduling of those tasks that tend to be long running for processing and decomposition, plus a timeout mechanism for task decomposition to prevent longrunning straggler tasks. The idea has been integrated into a novelty algorithm for maximum clique finding (MCF) that adopts a hybrid task decomposition strategy, which significantly improves the running time of MCF on dense and large graphs: The algorithm finds a maximum clique of size 1,109 on a large and dense WikiLinks graph dataset in 70 minutes. Extensive experiments demonstrate that Gthinker achieves orders of magnitude speedup compared even with the fastest existing subgraphcentric system, and it scales well to much larger and denser real network data. Gthinker is opensourced at http://bit.ly/gthinker with detailed documentation.
more »
« less
 Award ID(s):
 1755464
 NSFPAR ID:
 10331912
 Date Published:
 Journal Name:
 The VLDB journal
 ISSN:
 10668888
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


Mining from a big graph those subgraphs that satisfy certain conditions is useful in many applications such as community detection and subgraph matching. These problems have a high time complexity, but existing systems to scale them are all IObound in execution. We propose the first truly CPUbound distributed framework called Gthinker that adopts a userfriendly subgraphcentric vertexpulling API for writing distributed subgraph mining algorithms. To utilize all CPU cores of a cluster, Gthinker features (1) a highlyconcurrent vertex cache for parallel task access and (2) a lightweight task scheduling approach that ensures high task throughput. These designs well overlap communication with computation to minimize the CPU idle time. Extensive experiments demonstrate that Gthinker achieves orders of magnitude speedup compared even with the fastest existing subgraphcentric system, and it scales well to much larger and denser real network data. Gthinker is opensourced at http://bit.ly/gthinker with detailed documentation.more » « less

null (Ed.)Given a userspecified minimum degree threshold γ, a γquasiclique is a subgraph g = (Vg, Eg) where each vertex ν ∈ Vg connects to at least γ fraction of the other vertices (i.e., ⌈γ · (Vg 1)⌉ vertices) in g. Quasiclique is one of the most natural definitions for dense structures useful in finding communities in social networks and discovering significant biomolecule structures and pathways. However, mining maximal quasicliques is notoriously expensive. In this paper, we design parallel algorithms for mining maximal quasicliques on Gthinker, a distributed graph mining framework that decomposes mining into computeintensive tasks to fully utilize CPU cores. We found that directly using Gthinker results in the straggler problem due to (i) the drastic load imbalance among different tasks and (ii) the difficulty of predicting the task running time. We address these challenges by redesigning Gthinker's execution engine to prioritize longrunning tasks for execution, and by utilizing a novel timeout strategy to effectively decompose longrunning tasks to improve load balancing. While this system redesign applies to many other expensive dense subgraph mining problems, this paper verifies the idea by adapting the stateoftheart quasiclique algorithm, Quick, to our redesigned Gthinker. Extensive experiments verify that our new solution scales well with the number of CPU cores, achieving 201× runtime speedup when mining a graph with 3.77M vertices and 16.5M edges in a 16node cluster.more » « less

Quasicliques are a type of dense subgraphs that generalize the notion of cliques, important for applications such as community/module detection in various social and biological networks. However, the existing quasiclique definition and algorithms are only applicable to undirected graphs. In this paper, we generalize the concept of quasicliques to directed graphs by proposing $(\gamma_1, \gamma_2)$quasicliques which have density requirements in both inbound and outbound directions of each vertex in a quasiclique subgraph. An efficient recursive algorithm is proposed to find maximal $(\gamma_1, \gamma_2)$quasicliques which integrates many effective pruning rules that are validated by ablation studies. We also study the finding of top$k$ large quasicliques directly by bootstrapping the search from more compact quasicliques, to scale the mining to larger networks. The algorithms are parallelized with effective load balancing, and we demonstrate that they can scale up effectively with the number of CPU cores.more » « less

Counting and uniformly sampling motifs in a graph are fundamental algorithmic tasks with numerous applications across multiple fields. Since these problems are computationally expensive, recent efforts have focused on devising sublineartime algorithms for these problems. We consider the model where the algorithm gets a constant size motif H and query access to a graph G, where the allowed queries are degree, neighbor, and pair queries, as well as uniform edge sample queries. In the sampling task, the algorithm is required to output a uniformly distributed copy of H in G (if one exists), and in the counting task it is required to output a good estimate to the number of copies of H in G. Previous algorithms for the uniform sampling task were based on a decomposition of H into a collection of odd cycles and stars, denoted D∗(H) = {Ok1 , ...,Okq , Sp1 , ..., Spℓ19 }. These algorithms were shown to be optimal for the case where H is a clique or an oddlength cycle, but no other lower bounds were known. We present a new algorithm for sampling arbitrary motifs which, up to poly(log n) factors, for any motif H whose decomposition contains at least two components or at least one star, is always preferable. The main ingredient leading to this improvement is an improved uniform algorithm for sampling stars, which might be of independent interest, as it allows to sample vertices according to the pth moment of the degree distribution. We further show how to use our sampling algorithm to get an approximate counting algorithm, with essentially the same complexity. Finally, we prove that this algorithm is decompositionoptimal for decompositions that contain at least one odd cycle. That is, we prove that for any decomposition D that contains at least one odd cycle, there exists a motif HD 30 with decomposition D, and a family of graphs G, so that in order to output a uniform copy of H in a uniformly chosen graph in G, the number of required queries matches our upper bound. These are the first lower bounds for motifs H with a nontrivial decomposition, i.e., motifs that have more than a single component in their decomposition.more » « less