skip to main content


Title: On Finding and Analyzing the Backbone of the k-Core Structure of a Graph
In many network applications, dense subgraphs have proven to be extremely useful. One particular type of dense subgraph known as the k-core has received a great deal of attention. k-cores have been used in a number of important applications, including identifying important nodes, speeding up community detection, network visualization, and others. However, little work has investigated the ‘skeletal’ structure of the k-core, and the effect of such structures on the properties of the overall k-core and network itself. In this paper, we propose the Skeletal Core Subgraph, which describes the backbone of the k-core structure of a graph. We show how to categorize graphs based on their skeletal cores, and demonstrate how to efficiently decompose a given graph into its Skeletal Core Subgraph. We show both theoretically and experimentally the relationship between the Skeletal Core Subgraph and properties of the graph, including its core resilience.  more » « less
Award ID(s):
1908048
NSF-PAR ID:
10462518
Author(s) / Creator(s):
;
Date Published:
Journal Name:
2022 IEEE International Conference on Data Mining (ICDM)
Page Range / eLocation ID:
1017 to 1022
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Given a user-specified minimum degree threshold γ, a γ-quasi-clique is a subgraph where each vertex connects to at least γ fraction of the other vertices. Quasi-clique is a natural definition for dense structures, so finding large and hence statistically significant quasi-cliques is useful in applications such as community detection in social networks and discovering significant biomolecule structures and pathways. However, mining maximal quasi-cliques is notoriously expensive, and even a recent algorithm for mining large maximal quasi-cliques is flawed and can lead to a lot of repeated searches. This paper proposes a parallel solution for mining maximal quasi-cliques that is able to fully utilize CPU cores. Our solution utilizes divide and conquer to decompose the workloads into independent tasks for parallel mining, and we addressed the problem of (i) drastic load imbalance among different tasks and (ii) difficulty in predicting the task running time and the time growth with task subgraph size, by (a) using a timeout-based task decomposition strategy, and by (b) utilizing a priority task queue to schedule long-running tasks earlier for mining and decomposition to avoid stragglers. Unlike our conference version in PVLDB 2020 where the solution was built on a distributed graph mining framework called G-thinker, this paper targets a single-machine multi-core environment which is more accessible to an average end user. A general framework called T-thinker is developed to facilitate the programming of parallel programs for algorithms that adopt divide and conquer, including but not limited to our quasi-clique mining algorithm. Additionally, we consider the problem of directly mining large quasi-cliques from dense parts of a graph, where we identify the repeated search issue of a recent method and address it using a carefully designed concurrent trie data structure. Extensive experiments verify that our parallel solution scales well with the number of CPU cores, achieving 26.68× runtime speedup when mining a graph with 3.77M vertices and 16.5M edges with 32 mining threads. Additionally, mining large quasi-cliques from dense parts can provide an additional speedup of up to 89.46×. 
    more » « less
  2. Finding from a big graph those subgraphs that satisfy certain conditions is useful in many applications such as community detection and subgraph matching. These problems have a high time complexity, but existing systems that attempt to scale them are all IO-bound in execution. We propose the first truly CPU-bound distributed framework called G-thinker for subgraph finding algorithms, which adopts a task-based computation model, and which also provides a user-friendly subgraph-centric vertex-pulling API for writing distributed subgraph finding algorithms that can be easily adapted from existing serial algorithms. To utilize all CPU cores of a cluster, G-thinker features (1) a highly concurrent vertex cache for parallel task access and (2) a lightweight task scheduling approach that ensures high task throughput. These designs well overlap communication with computation to minimize the idle time of CPU cores. To further improve load balancing on graphs where the workloads of individual tasks can be drastically different due to biased graph density distribution, we propose to prioritize the scheduling of those tasks that tend to be long running for processing and decomposition, plus a timeout mechanism for task decomposition to prevent long-running straggler tasks. The idea has been integrated into a novelty algorithm for maximum clique finding (MCF) that adopts a hybrid task decomposition strategy, which significantly improves the running time of MCF on dense and large graphs: The algorithm finds a maximum clique of size 1,109 on a large and dense WikiLinks graph dataset in 70 minutes. Extensive experiments demonstrate that G-thinker achieves orders of magnitude speedup compared even with the fastest existing subgraph-centric system, and it scales well to much larger and denser real network data. G-thinker is open-sourced at http://bit.ly/gthinker with detailed documentation. 
    more » « less
  3. Clique-counting is a fundamental problem that has application in many areas eg. dense subgraph discovery, community detection, spam detection, etc. The problem of k-clique-counting is difficult because as k increases, the number of k-cliques goes up exponentially. Enumeration algorithms (even parallel ones) fail to count k-cliques beyond a small k. Approximation algorithms, like TuránShadow have been shown to perform well upto k = 10, but are inefficient for larger cliques. The recently proposed Pivoter algorithm significantly improved the state-of-the-art and was able to give exact counts of all k-cliques in a large number of graphs. However, the clique counts of some graphs (for example, com-lj) are still out of reach of these algorithms. We revisit the TuránShadow algorithm and propose a generalized framework called YACC that leverages several insights about real-world graphs to achieve faster clique-counting. The bottleneck in TuránShadow is a recursive subroutine whose stopping condition is based on a classic result from extremal combinatorics called Turán's theorem. This theorem gives a lower bound for the k-clique density in a subgraph in terms of its edge density. However, this stopping condition is based on a worst-case graph that does not reflect the nature of real-world graphs. Using techniques for quickly discovering dense subgraphs, we relax the stopping condition in a systematic way such that we get a smaller recursion tree while still maintaining the guarantees provided by TuránShadow. We deploy our algorithm on several real-world data sets and show that YACC reduces the size of the recursion tree and the running time by over an order of magnitude. Using YACC, we are able to obtain clique counts for several graphs for which clique-counting was infeasible before, including com-lj. 
    more » « less
  4. Nowadays, large-scale graph data is being generated in a variety of real-world applications, from social networks to co-authorship networks, from protein-protein interaction networks to road traffic networks. Many existing works on graph mining focus on the vertices and edges, with the first-order Markov chain as the underlying model. They fail to explore the high-order network structures, which are of key importance in many high impact domains. For example, in bank customer personally identifiable information (PII) networks, the star structures often correspond to a set of synthetic identities; in financial transaction networks, the loop structures may indicate the existence of money laundering. In this paper, we focus on mining user-specified high-order network structures and aim to find a structure-rich subgraph which does not break many such structures by separating the subgraph from the rest. A key challenge associated with finding a structure-rich subgraph is the prohibitive computational cost. To address this problem, inspired by the family of local graph clustering algorithms for efficiently identifying a low-conductance cut without exploring the entire graph, we propose to generalize the key idea to model high-order network structures. In particular, we start with a generic definition of high-order conductance, and define the high-order diffusion core, which is based on a high-order random walk induced by user-specified high-order network structure. Then we propose a novel High-Order Structure-Preserving LOcal Cut (HOSPLOC) algorithm, which runs in polylogarithmic time with respect to the number of edges in the graph. It starts with a seed vertex and iteratively explores its neighborhood until a subgraph with a small high-order conductance is found. Furthermore, we analyze its performance in terms of both effectiveness and efficiency. The experimental results on both synthetic graphs and real graphs demonstrate the effectiveness and efficiency of our proposed HOSPLOC algorithm. 
    more » « less
  5. null (Ed.)
    Context. Stars form in cold dense cores showing subsonic velocity dispersions. The parental molecular clouds display higher temperatures and supersonic velocity dispersions. The transition from core to cloud has been observed in velocity dispersion, but temperature and abundance variations are unknown. Aims. We aim to measure the temperature and velocity dispersion across cores and ambient cloud in a single tracer to study the transition between the two regions. Methods. We use NH 3 (1,1) and (2,2) maps in L1688 from the Green Bank Ammonia Survey, smoothed to 1′, and determine the physical properties by fitting the spectra. We identify the coherent cores and study the changes in temperature and velocity dispersion from the cores to the surrounding cloud. Results. We obtain a kinetic temperature map extending beyond dense cores and tracing the cloud, improving from previous maps tracing mostly the cores. The cloud is 4–6 K warmer than the cores, and shows a larger velocity dispersion (Δ σ v = 0.15–0.25 km s −1 ). Comparing to Herschel -based dust temperatures, we find that cores show kinetic temperatures that are ≈1.8 K lower than the dust temperature, while the gas temperature is higher than the dust temperature in the cloud. We find an average p-NH 3 fractional abundance (with respect to H 2 ) of (4.2 ± 0.2) × 10 −9 towards the coherent cores, and (1.4 ± 0.1) × 10 −9 outside the core boundaries. Using stacked spectra, we detect two components, one narrow and one broad, towards cores and their neighbourhoods. We find the turbulence in the narrow component to be correlated with the size of the structure (Pearson- r = 0.54). With these unresolved regional measurements, we obtain a turbulence–size relation of σ v,NT ∝ r 0.5 , which is similar to previous findings using multiple tracers. Conclusions. We discover that the subsonic component extends up to 0.15 pc beyond the typical coherent boundaries, unveiling larger extents of the coherent cores and showing gradual transition to coherence over ~0.2 pc. 
    more » « less