skip to main content


Title: A Framework for Parallelizing Hierarchical Clustering Methods
Hierarchical clustering is a fundamental tool in data mining, machine learning and statistics. Popular hierarchical clustering algorithms include top-down divisive approaches such as bisecting k-means, k-median, and k-center and bottom-up agglomerative approaches such as single- linkage, average-linkage, and centroid-linkage. Unfortunately, only a few scalable hierarchical clustering algorithms are known, mostly based on the single-linkage algorithm. So, as datasets increase in size every day, there is a pressing need to scale other popular methods. We introduce efficient distributed algorithms for bisecting k-means, k- median, and k-center as well as centroid-linkage. In particular, we first formalize a notion of closeness for a hierarchical clustering algorithm, and then we use this notion to design new scalable distributed methods with strong worst case bounds on the running time and the quality of the solutions. Finally, we show experimentally that the introduced algorithms are efficient and close to their sequential variants in practice.  more » « less
Award ID(s):
1824303 1845146
NSF-PAR ID:
10130909
Author(s) / Creator(s):
Date Published:
Journal Name:
European Conference on Machine Learning
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Recent years have witnessed an increasing popularity of algorithm design for distributed data, largely due to the fact that massive datasets are often collected and stored in different locations. In the distributed setting communication typically dominates the query processing time. Thus it becomes crucial to design communication efficient algorithms for queries on distributed data. Simultaneously, it has been widely recognized that partial optimizations, where we are allowed to disregard a small part of the data, provide us significantly better solutions. The motivation for disregarded points often arise from noise and other phenomena that are pervasive in large data scenarios. In this paper we focus on partial clustering problems, k-center, k-median and k-means, in the distributed model, and provide algorithms with communication sublinear of the input size. As a consequence we develop the first algorithms for the partial k-median and means objectives that run in subquadratic running time. We also initiate the study of distributed algorithms for clustering uncertain data, where each data point can possibly fall into multiple locations under certain probability distribution. 
    more » « less
  2. Bansal, Nikhil and (Ed.)
    his paper presents universal algorithms for clustering problems, including the widely studied k-median, k-means, and k-center objectives. The input is a metric space containing all potential client locations. The algorithm must select k cluster centers such that they are a good solution for any subset of clients that actually realize. Specifically, we aim for low regret, defined as the maximum over all subsets of the difference between the cost of the algorithm’s solution and that of an optimal solution. A universal algorithm’s solution sol for a clustering problem is said to be an (α, β)-approximation if for all subsets of clients C', it satisfies sol(C') ≤ α ⋅ opt(C') + β ⋅ mr, where opt(C') is the cost of the optimal solution for clients C' and mr is the minimum regret achievable by any solution. Our main results are universal algorithms for the standard clustering objectives of k-median, k-means, and k-center that achieve (O(1), O(1))-approximations. These results are obtained via a novel framework for universal algorithms using linear programming (LP) relaxations. These results generalize to other 𝓁_p-objectives and the setting where some subset of the clients are fixed. We also give hardness results showing that (α, β)-approximation is NP-hard if α or β is at most a certain constant, even for the widely studied special case of Euclidean metric spaces. This shows that in some sense, (O(1), O(1))-approximation is the strongest type of guarantee obtainable for universal clustering. 
    more » « less
  3. Abstract

    Policy interest in socio‐ecological systems has driven attempts to define and map socio‐ecological zones (SEZs), that is, spatial regions, distinguishable by their conjoined social and bio‐geo‐physical characteristics. The state of Idaho, USA, has a strong need for SEZ designations because of potential conflicts between rapidly increasing and impactful human populations, and proximal natural ecosystems. Our Idaho SEZs address analytical shortcomings in previously published SEZs by: (1) considering potential biases of clustering methods, (2) cross‐validating SEZ classifications, (3) measuring the relative importance of bio‐geo‐physical and social system predictors, and (4) considering spatial autocorrelation. We obtained authoritative bio‐geo‐physical and social system datasets for Idaho, aggregated into 5‐km grids = 25 km2, and decomposed these using principal components analyses (PCAs). PCA scores were classified using two clustering techniques commonly used in SEZ mapping: hierarchical clustering with Ward's linkage, andk‐means analysis. Classification evaluators indicated that eight‐ and five‐cluster solutions were optimal for the bio‐geo‐physical and social datasets for Ward's linkage, resulting in 31 SEZ composite types, and six‐ and five‐cluster solutions were optimal fork‐means analysis, resulting in 24 SEZ composite types. Ward's andk‐means solutions were similar for bio‐geo‐physical and social classifications with similar numbers of clusters. Further, both classifiers identified the same dominant SEZ composites. For rarer SEZs, however, classification methods strongly affected SEZ classifications, potentially altering land management perspectives. Our SEZs identify several critical regions of social–ecological overlap. These include suburban interface types and a high desert transition zone. Based on multinomial generalized linear models, bio‐geo‐physical information explained more variation in SEZs than social system data, after controlling for spatial autocorrelation, under both Ward's andk‐means approaches. Agreement (cross‐validation) levels were high for multinomial models with bio‐geo‐physical and social predictors for both Ward's andk‐means SEZs. A consideration of historical drivers, including indigenous social systems, and current trajectories of land and resource management in Idaho, indicates a strong need for rigorous SEZ designations to guide development and conservation in the region. Our analytical framework can be broadly applied in SES research and applied in other regions, when categorical responses—including cluster designations—have a spatial component.

     
    more » « less
  4. null (Ed.)
    The Sparsest Cut is a fundamental optimization problem that have been extensively studied. For planar inputs the problem is in P and can be solved in Õ(n 3 ) time if all vertex weights are 1. Despite a significant amount of effort, the best algorithms date back to the early 90’s and can only achieve O(log n)-approximation in Õ(n) time or 3.5-approximation in Õ(n 2 ) time [Rao, STOC92]. Our main result is an Ω(n 2−ε ) lower bound for Sparsest Cut even in planar graphs with unit vertex weights, under the (min, +)-Convolution conjecture, showing that approxima- tions are inevitable in the near-linear time regime. To complement the lower bound, we provide a 3.3-approximation in near-linear time, improving upon the 25-year old result of Rao in both time and accuracy. We also show that our lower bound is not far from optimal by observing an exact algorithm with running time Õ(n 5/2 ) improving upon the Õ(n 3 ) algorithm of Park and Phillips [STOC93]. Our lower bound accomplishes a repeatedly raised challenge by being the first fine-grained lower bound for a natural planar graph problem in P. Building on our construction we prove near-quadratic lower bounds under SETH for variants of the closest pair problem in planar graphs, and use them to show that the popular Average-Linkage procedure for Hierarchical Clustering cannot be simulated in truly subquadratic time. At the core of our constructions is a diamond-like gadget that also settles the complexity of Diameter in distributed planar networks. We prove an Ω(n/ log n) lower bound on the number of communication rounds required to compute the weighted diameter of a network in the CONGET model, even when the underlying graph is planar and all nodes are D = 4 hops away from each other. This is the first poly(n) lower bound in the planar-distributed setting, and it complements the recent poly(D, log n) upper bounds of Li and Parter [STOC 2019] for (exact) unweighted diameter and for (1 + ε) approximate weighted diameter. 
    more » « less
  5. Bae, Sang Won ; Park, Heejin (Ed.)
    In this paper we introduce and formally study the problem of k-clustering with faulty centers. Specifically, we study the faulty versions of k-center, k-median, and k-means clustering, where centers have some probability of not existing, as opposed to prior work where clients had some probability of not existing. For all three problems we provide fixed parameter tractable algorithms, in the parameters k, d, and ε, that (1+ε)-approximate the minimum expected cost solutions for points in d dimensional Euclidean space. For Faulty k-center we additionally provide a 5-approximation for general metrics. Significantly, all of our algorithms have a small dependence on n. Specifically, our Faulty k-center algorithms have only linear dependence on n, while for our algorithms for Faulty k-median and Faulty k-means the dependence is still only n^(1 + o(1)). 
    more » « less