skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on November 1, 2025

Title: The ParClusterers Benchmark Suite (PCBS): A Fine-Grained Analysis of Scalable Graph Clustering
We introduce the ParClusterers Benchmark Suite (PCBS)---a collection of highly scalable parallel graph clustering algorithms and benchmarking tools that streamline comparing different graph clustering algorithms and implementations. The benchmark includes clustering algorithms that target a wide range of modern clustering use cases, including community detection, classification, and dense subgraph mining. The benchmark toolkit makes it easy to run and evaluate multiple instances of different clustering algorithms with respect to both the running time and quality. We evaluate the PCBS algorithms empirically and find that they deliver both the state of the art quality and the running time. In terms of the running time, they are on average over 4x faster than the fastest library we compared to. In terms of quality, the correlation clustering algorithm [Shi et al., VLDB'21] optimizing for the LambdaCC objective, which does not have a direct counterpart in other libraries, delivers the highest quality in the majority of datasets that we used.  more » « less
Award ID(s):
2403235
PAR ID:
10636087
Author(s) / Creator(s):
; ; ; ; ; ; ; ; ;
Publisher / Repository:
Proceedings of the VLDB Endowment
Date Published:
Journal Name:
Proceedings of the VLDB Endowment
Volume:
18
Issue:
3
ISSN:
2150-8097
Page Range / eLocation ID:
836 to 849
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. A fundamental building block in any graph algorithm is agraph container -- a data structure used to represent the graph. Ideally, a graph container enables efficient access to the underlying graph, has low space usage, and supports updating the graph efficiently. In this paper, we conduct an extensive empirical evaluation of graph containers designed to support running algorithms on large graphs. To our knowledge, this is the firstapples-to-applescomparison of graph containers rather than overall systems, which include confounding factors such as differences in algorithm implementations and infrastructure. We measure the running time of 10 highly-optimized algorithms across over 20 different containers and 10 graphs. Somewhat surprisingly, we find that the average algorithm running time does not differ much across containers, especially those that support dynamic updates. Specifically, a simple container based on an off-the-shelf B-tree is only 1.22× slower on average than a highly optimized static one. Moreover, we observe that simplifying a graph-container Application Programming Interface (API) to only a few simple functions incurs a mere 1.16× slowdown compared to a complete API. Finally, we also measure batch-insert throughput in dynamic-graph containers for a full picture of their performance. To perform the benchmarks, we introduce BYO, a unified framework that standardizes evaluations of graph-algorithm performance across different graph containers. BYO extends the Graph Based Benchmark Suite (Dhulipala et al. 18), a state-of-the-art graph algorithm benchmark, to easily plug into different dynamic graph containers and enable fair comparisons between them on a large suite of graph algorithms. While several graph algorithm benchmarks have been developed to date, to the best of our knowledge, BYO is the first system designed to benchmark graph containers. 
    more » « less
  2. Untrusted third parties in commercial-off-the-shelf (COTS) printed circuit board (PCB) supply chains may poison PCBs with hardware, firmware, and software implants. Hence, we focus on detection of malicious implants in PCBs. State-of-the-art hardware Trojan detection methods require a golden PCB system/model to detect malicious implants and do not scale to large-scale COTS PCB systems. We map a COTS PCB system to a graph and propose a golden-free methodology comprising a graph-based mathematical construction on node and edge equivalences, and clustering of identical nodes and paths and validation of hypothesized statistical properties on measured sidechannel data. We evaluate the methodology on a multi-PCB testbed with hierarchically networked PCB devices and several types of Trojans. 
    more » « less
  3. Graph clustering is a fundamental problem in social network analysis, the goal of which is to group vertices of a graph into a series of densely knitted clusters with each cluster well separated from all the others. Classical graph clustering methods take advantage of the graph topology to model and quantify vertex proximity. With the proliferation of rich graph contents, such as user profiles in social networks, and gene annotations in protein interaction networks, it is essential to consider both the structure and content information of graphs for high-quality graph clustering. In this paper, we propose a graph embedding approach to clustering content-enriched graphs. The key idea is to embed each vertex of a graph into a continuous vector space where the localized structural and attributive information of vertices can be encoded in a unified, latent representation. Specifically, we quantify vertex-wise attribute proximity into edge weights, and employ truncated, attribute-aware random walks to learn the latent representations for vertices. We evaluate our attribute-aware graph embedding method in real-world attributed graphs, and the results demonstrate its effectiveness in comparison with state-of-the-art algorithms. 
    more » « less
  4. Clustering plays a crucial role in computer science, facilitating data analysis and problem-solving across numerous fields. By partitioning large datasets into meaningful groups, clustering reveals hidden structures and relationships within the data, aiding tasks such as unsupervised learning, classification, anomaly detection, and recommendation systems. Particularly in relational databases, where data is distributed across multiple tables, efficient clustering is essential yet challenging due to the computational complexity of joining tables. This paper addresses this challenge by introducing efficient algorithms for k-median and k-means clustering on relational data without the need for pre-computing the join query results. For the relational k-median clustering, we propose the first efficient relative approximation algorithm. For the relational k-means clustering, our algorithm significantly improves both the approximation factor and the running time of the known relational k-means clustering algorithms, which suffer either from large constant approximation factors, or expensive running time. Given a join query q and a database instance D of O(N) tuples, for both k-median and k-means clustering on the results of q on D, we propose randomized (1+ε)γ-approximation algorithms that run in roughly O(k2Nfhw)+T_γ(k2) time, where ε ∈ (0,1) is a constant parameter decided by the user, \fhw is the fractional hyper-tree width of Q, while γ and T_γ(x) represent the approximation factor and the running time, respectively, of a traditional clustering algorithm in the standard computational setting over x points. 
    more » « less
  5. Stream clustering is an important data mining technique to capture the evolving patterns in real-time data streams. Today’s data streams, e.g., IoT events and Web clicks, are usually high-speed and contain dynamically-changing patterns. Existing stream clustering algorithms usually follow an online-offline paradigm with a one-record-at-a-time update model, which was designed for running in a single machine. These stream clustering algorithms, with this sequential update model, cannot be efficiently parallelized and fail to deliver the required high throughput for stream clustering. In this paper, we present DistStream, a distributed framework that can effectively scale out online-offline stream clustering algorithms. To parallelize these algorithms for high throughput, we develop a mini-batch update model with efficient parallelization approaches. To maintain high clustering quality, DistStream’s mini-batch update model preserves the update order in all the computation steps during parallel execution, which can reflect the recent changes for dynamically-changing streaming data. We implement DistStream atop Spark Streaming, as well as four representative stream clustering algorithms based on DistStream. Our evaluation on three real-world datasets shows that DistStream-based stream clustering algorithms can achieve sublinear throughput gain and comparable (99%) clustering quality with their single-machine counterparts. 
    more » « less