skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: It’s Hard to HAC Average Linkage!
Average linkage Hierarchical Agglomerative Clustering (HAC) is an extensively studied and applied method for hierarchical clustering. Recent applications to massive datasets have driven significant interest in near-linear-time and efficient parallel algorithms for average linkage HAC. We provide hardness results that rule out such algorithms. On the sequential side, we establish a runtime lower bound of n^{3/2-ε} on n node graphs for sequential combinatorial algorithms under standard fine-grained complexity assumptions. This essentially matches the best-known running time for average linkage HAC. On the parallel side, we prove that average linkage HAC likely cannot be parallelized even on simple graphs by showing that it is CC-hard on trees of diameter 4. On the possibility side, we demonstrate that average linkage HAC can be efficiently parallelized (i.e., it is in NC) on paths and can be solved in near-linear time when the height of the output cluster hierarchy is small.  more » « less
Award ID(s):
2317194
PAR ID:
10609048
Author(s) / Creator(s):
; ; ; ; ;
Editor(s):
Bringmann, Karl; Grohe, Martin; Puppis, Gabriele; Svensson, Ola
Publisher / Repository:
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Date Published:
Volume:
297
ISSN:
1868-8969
ISBN:
978-3-95977-322-5
Page Range / eLocation ID:
18:1-18:18
Subject(s) / Keyword(s):
Clustering Hierarchical Graph Clustering HAC Fine-Grained Complexity Parallel Algorithms CC Theory of computation → Parallel algorithms Theory of computation → Streaming, sublinear and near linear time algorithms Theory of computation → Graph algorithms analysis
Format(s):
Medium: X Size: 18 pages; 1997336 bytes Other: application/pdf
Size(s):
18 pages 1997336 bytes
Right(s):
Creative Commons Attribution 4.0 International license; info:eu-repo/semantics/openAccess
Sponsoring Org:
National Science Foundation
More Like this
  1. We give an algorithm for Centroid-Linkage Hierarchical Agglomerative Clustering (HAC), which computes a $$c$$-approximate clustering in roughly $$n^{1+O(1/c^2)}$$ time. We obtain our result by combining a new Centroid-Linkage HAC algorithm with a novel fully dynamic data structure for nearest neighbor search which works under adaptive updates. We also evaluate our algorithm empirically. By leveraging a state-of-the-art nearest-neighbor search library, we obtain a fast and accurate Centroid-Linkage HAC algorithm. Compared to an existing state-of-the-art exact baseline, our implementation maintains the clustering quality while delivering up to a $$36\times$$ speedup due to performing fewer distance comparisons. 
    more » « less
  2. Hierarchical clustering is a fundamental tool in data mining, machine learning and statistics. Popular hierarchical clustering algorithms include top-down divisive approaches such as bisecting k-means, k-median, and k-center and bottom-up agglomerative approaches such as single- linkage, average-linkage, and centroid-linkage. Unfortunately, only a few scalable hierarchical clustering algorithms are known, mostly based on the single-linkage algorithm. So, as datasets increase in size every day, there is a pressing need to scale other popular methods. We introduce efficient distributed algorithms for bisecting k-means, k- median, and k-center as well as centroid-linkage. In particular, we first formalize a notion of closeness for a hierarchical clustering algorithm, and then we use this notion to design new scalable distributed methods with strong worst case bounds on the running time and the quality of the solutions. Finally, we show experimentally that the introduced algorithms are efficient and close to their sequential variants in practice. 
    more » « less
  3. null (Ed.)
    The Sparsest Cut is a fundamental optimization problem that have been extensively studied. For planar inputs the problem is in P and can be solved in Õ(n 3 ) time if all vertex weights are 1. Despite a significant amount of effort, the best algorithms date back to the early 90’s and can only achieve O(log n)-approximation in Õ(n) time or 3.5-approximation in Õ(n 2 ) time [Rao, STOC92]. Our main result is an Ω(n 2−ε ) lower bound for Sparsest Cut even in planar graphs with unit vertex weights, under the (min, +)-Convolution conjecture, showing that approxima- tions are inevitable in the near-linear time regime. To complement the lower bound, we provide a 3.3-approximation in near-linear time, improving upon the 25-year old result of Rao in both time and accuracy. We also show that our lower bound is not far from optimal by observing an exact algorithm with running time Õ(n 5/2 ) improving upon the Õ(n 3 ) algorithm of Park and Phillips [STOC93]. Our lower bound accomplishes a repeatedly raised challenge by being the first fine-grained lower bound for a natural planar graph problem in P. Building on our construction we prove near-quadratic lower bounds under SETH for variants of the closest pair problem in planar graphs, and use them to show that the popular Average-Linkage procedure for Hierarchical Clustering cannot be simulated in truly subquadratic time. At the core of our constructions is a diamond-like gadget that also settles the complexity of Diameter in distributed planar networks. We prove an Ω(n/ log n) lower bound on the number of communication rounds required to compute the weighted diameter of a network in the CONGET model, even when the underlying graph is planar and all nodes are D = 4 hops away from each other. This is the first poly(n) lower bound in the planar-distributed setting, and it complements the recent poly(D, log n) upper bounds of Li and Parter [STOC 2019] for (exact) unweighted diameter and for (1 + ε) approximate weighted diameter. 
    more » « less
  4. Stream clustering is an important data mining technique to capture the evolving patterns in real-time data streams. Today’s data streams, e.g., IoT events and Web clicks, are usually high-speed and contain dynamically-changing patterns. Existing stream clustering algorithms usually follow an online-offline paradigm with a one-record-at-a-time update model, which was designed for running in a single machine. These stream clustering algorithms, with this sequential update model, cannot be efficiently parallelized and fail to deliver the required high throughput for stream clustering. In this paper, we present DistStream, a distributed framework that can effectively scale out online-offline stream clustering algorithms. To parallelize these algorithms for high throughput, we develop a mini-batch update model with efficient parallelization approaches. To maintain high clustering quality, DistStream’s mini-batch update model preserves the update order in all the computation steps during parallel execution, which can reflect the recent changes for dynamically-changing streaming data. We implement DistStream atop Spark Streaming, as well as four representative stream clustering algorithms based on DistStream. Our evaluation on three real-world datasets shows that DistStream-based stream clustering algorithms can achieve sublinear throughput gain and comparable (99%) clustering quality with their single-machine counterparts. 
    more » « less
  5. We show new applications of the nearest-neighbor chain algorithm, a technique that originated in agglomerative hierarchical clustering. We use it to construct the greedy multi-fragment tour for Euclidean TSP in O(n log n) time in any fixed dimension and for Steiner TSP in planar graphs in O(n sqrt(n)log n) time; we compute motorcycle graphs, a central step in straight skeleton algorithms, in O(n^(4/3+epsilon)) time for any epsilon>0. 
    more » « less