Quasicliques are dense incomplete subgraphs of a graph that generalize the notion of cliques. Enumerating quasicliques from a graph is a robust way to detect densely connected structures with applications in bioinformatics and social network analysis. However, enumerating quasicliques in a graph is a challenging problem, even harder than the problem of enumerating cliques. We consider the enumeration of top k degreebased quasicliques and make the following contributions: (1) we show that even the problem of detecting whether a given quasiclique is maximal (i.e., not contained within another quasiclique) is NPhard. (2) We present a novel heuristic algorithm K ernel QC to enumerate the k largest quasicliques in a graph. Our method is based on identifying kernels of extremely dense subgraphs within a graph, followed by growing subgraphs around these kernels, to arrive at quasicliques with the required densities. (3) Experimental results show that our algorithm accurately enumerates quasicliques from a graph, is much faster than current stateoftheart methods for quasiclique enumeration (often more than three orders of magnitude faster), and can scale to larger graphs than current methods.
more »
« less
Maximal Directed QuasiClique Mining
Quasicliques are a type of dense subgraphs that generalize the notion of cliques, important for applications such as community/module detection in various social and biological networks. However, the existing quasiclique definition and algorithms are only applicable to undirected graphs. In this paper, we generalize the concept of quasicliques to directed graphs by proposing $(\gamma_1, \gamma_2)$quasicliques which have density requirements in both inbound and outbound directions of each vertex in a quasiclique subgraph. An efficient recursive algorithm is proposed to find maximal $(\gamma_1, \gamma_2)$quasicliques which integrates many effective pruning rules that are validated by ablation studies. We also study the finding of top$k$ large quasicliques directly by bootstrapping the search from more compact quasicliques, to scale the mining to larger networks. The algorithms are parallelized with effective load balancing, and we demonstrate that they can scale up effectively with the number of CPU cores.
more »
« less
 Award ID(s):
 1755464
 NSFPAR ID:
 10331905
 Date Published:
 Journal Name:
 Proceedings of the 38th IEEE International Conference on Data Engineering (ICDE)
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


Cliquecounting is a fundamental problem that has application in many areas eg. dense subgraph discovery, community detection, spam detection, etc. The problem of kcliquecounting is difficult because as k increases, the number of kcliques goes up exponentially. Enumeration algorithms (even parallel ones) fail to count kcliques beyond a small k. Approximation algorithms, like TuránShadow have been shown to perform well upto k = 10, but are inefficient for larger cliques. The recently proposed Pivoter algorithm significantly improved the stateoftheart and was able to give exact counts of all kcliques in a large number of graphs. However, the clique counts of some graphs (for example, comlj) are still out of reach of these algorithms. We revisit the TuránShadow algorithm and propose a generalized framework called YACC that leverages several insights about realworld graphs to achieve faster cliquecounting. The bottleneck in TuránShadow is a recursive subroutine whose stopping condition is based on a classic result from extremal combinatorics called Turán's theorem. This theorem gives a lower bound for the kclique density in a subgraph in terms of its edge density. However, this stopping condition is based on a worstcase graph that does not reflect the nature of realworld graphs. Using techniques for quickly discovering dense subgraphs, we relax the stopping condition in a systematic way such that we get a smaller recursion tree while still maintaining the guarantees provided by TuránShadow. We deploy our algorithm on several realworld data sets and show that YACC reduces the size of the recursion tree and the running time by over an order of magnitude. Using YACC, we are able to obtain clique counts for several graphs for which cliquecounting was infeasible before, including comlj.more » « less

Given a userspecified minimum degree threshold γ, a γquasiclique is a subgraph where each vertex connects to at least γ fraction of the other vertices. Quasiclique is a natural definition for dense structures, so finding large and hence statistically significant quasicliques is useful in applications such as community detection in social networks and discovering significant biomolecule structures and pathways. However, mining maximal quasicliques is notoriously expensive, and even a recent algorithm for mining large maximal quasicliques is flawed and can lead to a lot of repeated searches. This paper proposes a parallel solution for mining maximal quasicliques that is able to fully utilize CPU cores. Our solution utilizes divide and conquer to decompose the workloads into independent tasks for parallel mining, and we addressed the problem of (i) drastic load imbalance among different tasks and (ii) difficulty in predicting the task running time and the time growth with task subgraph size, by (a) using a timeoutbased task decomposition strategy, and by (b) utilizing a priority task queue to schedule longrunning tasks earlier for mining and decomposition to avoid stragglers. Unlike our conference version in PVLDB 2020 where the solution was built on a distributed graph mining framework called Gthinker, this paper targets a singlemachine multicore environment which is more accessible to an average end user. A general framework called Tthinker is developed to facilitate the programming of parallel programs for algorithms that adopt divide and conquer, including but not limited to our quasiclique mining algorithm. Additionally, we consider the problem of directly mining large quasicliques from dense parts of a graph, where we identify the repeated search issue of a recent method and address it using a carefully designed concurrent trie data structure. Extensive experiments verify that our parallel solution scales well with the number of CPU cores, achieving 26.68× runtime speedup when mining a graph with 3.77M vertices and 16.5M edges with 32 mining threads. Additionally, mining large quasicliques from dense parts can provide an additional speedup of up to 89.46×.more » « less

null (Ed.)Given a userspecified minimum degree threshold γ, a γquasiclique is a subgraph g = (Vg, Eg) where each vertex ν ∈ Vg connects to at least γ fraction of the other vertices (i.e., ⌈γ · (Vg 1)⌉ vertices) in g. Quasiclique is one of the most natural definitions for dense structures useful in finding communities in social networks and discovering significant biomolecule structures and pathways. However, mining maximal quasicliques is notoriously expensive. In this paper, we design parallel algorithms for mining maximal quasicliques on Gthinker, a distributed graph mining framework that decomposes mining into computeintensive tasks to fully utilize CPU cores. We found that directly using Gthinker results in the straggler problem due to (i) the drastic load imbalance among different tasks and (ii) the difficulty of predicting the task running time. We address these challenges by redesigning Gthinker's execution engine to prioritize longrunning tasks for execution, and by utilizing a novel timeout strategy to effectively decompose longrunning tasks to improve load balancing. While this system redesign applies to many other expensive dense subgraph mining problems, this paper verifies the idea by adapting the stateoftheart quasiclique algorithm, Quick, to our redesigned Gthinker. Extensive experiments verify that our new solution scales well with the number of CPU cores, achieving 201× runtime speedup when mining a graph with 3.77M vertices and 16.5M edges in a 16node cluster.more » « less

Cliques and cliquelike subgraphs (e.g., quasicliques) are important dense structures whose counting or listing are essential in applications like complex network analysis and community detection. These problems are usually solved by divide and conquer, where a task over a big graph can be recursively divided into subtasks over smaller subgraphs whose search spaces are disjoint. This divisible algorithmic paradigm brings enormous potential for parallelism, since different subtasks can run concurrently to drastically reduce the overall running time. In this paper, we explore this potential by proposing a unified framework for counting and listing cliquelike subgraphs. We study how to divide and distribute the counting and listing tasks, and meanwhile, to balance the assigned workloads of each thread dynamically. Four applications are studied under our parallel framework, i.e., triangle counting, clique counting, maximal clique listing and quasiclique listing. Extensive experiments are conducted which demonstrate that our solution achieves an ideal speedup on various real graph datasets.more » « less