Quasicliques are dense incomplete subgraphs of a graph that generalize the notion of cliques. Enumerating quasicliques from a graph is a robust way to detect densely connected structures with applications in bioinformatics and social network analysis. However, enumerating quasicliques in a graph is a challenging problem, even harder than the problem of enumerating cliques. We consider the enumeration of top k degreebased quasicliques and make the following contributions: (1) we show that even the problem of detecting whether a given quasiclique is maximal (i.e., not contained within another quasiclique) is NPhard. (2) We present a novel heuristic algorithm K ernel QC to enumerate the k largest quasicliques in a graph. Our method is based on identifying kernels of extremely dense subgraphs within a graph, followed by growing subgraphs around these kernels, to arrive at quasicliques with the required densities. (3) Experimental results show that our algorithm accurately enumerates quasicliques from a graph, is much faster than current stateoftheart methods for quasiclique enumeration (often more than three orders of magnitude faster), and can scale to larger graphs than current methods.
This content will become publicly available on January 1, 2023
Maximal Directed QuasiClique Mining
Quasicliques are a type of dense subgraphs that generalize the notion of cliques, important for applications such as community/module detection in various social and biological networks. However, the existing quasiclique definition and algorithms are only applicable to undirected graphs. In this paper, we generalize the concept of quasicliques to directed graphs by proposing $(\gamma_1, \gamma_2)$quasicliques which have density requirements in both inbound and outbound directions of each vertex in a quasiclique subgraph. An efficient recursive algorithm is proposed to find maximal $(\gamma_1, \gamma_2)$quasicliques which integrates many effective pruning rules that are validated by ablation studies. We also study the finding of top$k$ large quasicliques directly by bootstrapping the search from more compact quasicliques, to scale the mining to larger networks. The algorithms are parallelized with effective load balancing, and we demonstrate that they can scale up effectively with the number of CPU cores.
 Award ID(s):
 1755464
 Publication Date:
 NSFPAR ID:
 10331905
 Journal Name:
 Proceedings of the 38th IEEE International Conference on Data Engineering (ICDE)
 Sponsoring Org:
 National Science Foundation
More Like this


Given a userspecified minimum degree threshold γ, a γquasiclique is a subgraph where each vertex connects to at least γ fraction of the other vertices. Quasiclique is a natural definition for dense structures, so finding large and hence statistically significant quasicliques is useful in applications such as community detection in social networks and discovering significant biomolecule structures and pathways. However, mining maximal quasicliques is notoriously expensive, and even a recent algorithm for mining large maximal quasicliques is flawed and can lead to a lot of repeated searches. This paper proposes a parallel solution for mining maximal quasicliques that is able to fully utilize CPU cores. Our solution utilizes divide and conquer to decompose the workloads into independent tasks for parallel mining, and we addressed the problem of (i) drastic load imbalance among different tasks and (ii) difficulty in predicting the task running time and the time growth with task subgraph size, by (a) using a timeoutbased task decomposition strategy, and by (b) utilizing a priority task queue to schedule longrunning tasks earlier for mining and decomposition to avoid stragglers. Unlike our conference version in PVLDB 2020 where the solution was built on a distributed graph mining framework called Gthinker, thismore »

Given a userspecified minimum degree threshold γ, a γquasiclique is a subgraph g = (Vg, Eg) where each vertex ν ∈ Vg connects to at least γ fraction of the other vertices (i.e., ⌈γ · (Vg 1)⌉ vertices) in g. Quasiclique is one of the most natural definitions for dense structures useful in finding communities in social networks and discovering significant biomolecule structures and pathways. However, mining maximal quasicliques is notoriously expensive. In this paper, we design parallel algorithms for mining maximal quasicliques on Gthinker, a distributed graph mining framework that decomposes mining into computeintensive tasks to fully utilize CPU cores. We found that directly using Gthinker results in the straggler problem due to (i) the drastic load imbalance among different tasks and (ii) the difficulty of predicting the task running time. We address these challenges by redesigning Gthinker's execution engine to prioritize longrunning tasks for execution, and by utilizing a novel timeout strategy to effectively decompose longrunning tasks to improve load balancing. While this system redesign applies to many other expensive dense subgraph mining problems, this paper verifies the idea by adapting the stateoftheart quasiclique algorithm, Quick, to our redesigned Gthinker. Extensive experiments verify that our new solution scalesmore »

Cliquecounting is a fundamental problem that has application in many areas eg. dense subgraph discovery, community detection, spam detection, etc. The problem of kcliquecounting is difficult because as k increases, the number of kcliques goes up exponentially. Enumeration algorithms (even parallel ones) fail to count kcliques beyond a small k. Approximation algorithms, like TuránShadow have been shown to perform well upto k = 10, but are inefficient for larger cliques. The recently proposed Pivoter algorithm significantly improved the stateoftheart and was able to give exact counts of all kcliques in a large number of graphs. However, the clique counts of some graphs (for example, comlj) are still out of reach of these algorithms. We revisit the TuránShadow algorithm and propose a generalized framework called YACC that leverages several insights about realworld graphs to achieve faster cliquecounting. The bottleneck in TuránShadow is a recursive subroutine whose stopping condition is based on a classic result from extremal combinatorics called Turán's theorem. This theorem gives a lower bound for the kclique density in a subgraph in terms of its edge density. However, this stopping condition is based on a worstcase graph that does not reflect the nature of realworld graphs. Using techniques formore »

Cliques and cliquelike subgraphs (e.g., quasicliques) are important dense structures whose counting or listing are essential in applications like complex network analysis and community detection. These problems are usually solved by divide and conquer, where a task over a big graph can be recursively divided into subtasks over smaller subgraphs whose search spaces are disjoint. This divisible algorithmic paradigm brings enormous potential for parallelism, since different subtasks can run concurrently to drastically reduce the overall running time. In this paper, we explore this potential by proposing a unified framework for counting and listing cliquelike subgraphs. We study how to divide and distribute the counting and listing tasks, and meanwhile, to balance the assigned workloads of each thread dynamically. Four applications are studied under our parallel framework, i.e., triangle counting, clique counting, maximal clique listing and quasiclique listing. Extensive experiments are conducted which demonstrate that our solution achieves an ideal speedup on various real graph datasets.