skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Universal Algorithms for Clustering
his paper presents universal algorithms for clustering problems, including the widely studied k-median, k-means, and k-center objectives. The input is a metric space containing all potential client locations. The algorithm must select k cluster centers such that they are a good solution for any subset of clients that actually realize. Specifically, we aim for low regret, defined as the maximum over all subsets of the difference between the cost of the algorithm’s solution and that of an optimal solution. A universal algorithm’s solution sol for a clustering problem is said to be an (α, β)-approximation if for all subsets of clients C', it satisfies sol(C') ≤ α ⋅ opt(C') + β ⋅ mr, where opt(C') is the cost of the optimal solution for clients C' and mr is the minimum regret achievable by any solution. Our main results are universal algorithms for the standard clustering objectives of k-median, k-means, and k-center that achieve (O(1), O(1))-approximations. These results are obtained via a novel framework for universal algorithms using linear programming (LP) relaxations. These results generalize to other 𝓁_p-objectives and the setting where some subset of the clients are fixed. We also give hardness results showing that (α, β)-approximation is NP-hard if α or β is at most a certain constant, even for the widely studied special case of Euclidean metric spaces. This shows that in some sense, (O(1), O(1))-approximation is the strongest type of guarantee obtainable for universal clustering.  more » « less
Award ID(s):
1816861
PAR ID:
10379274
Author(s) / Creator(s):
Editor(s):
Bansal, Nikhil and
Date Published:
Journal Name:
Leibniz international proceedings in informatics
Volume:
198
ISSN:
1868-8969
Page Range / eLocation ID:
1-70
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. This article presentsuniversalalgorithms for clustering problems, including the widely studiedk-median,k-means, andk-center objectives. The input is a metric space containing allpotentialclient locations. The algorithm must selectkcluster centers such that they are a good solution foranysubset of clients that actually realize. Specifically, we aim for lowregret, defined as the maximum over all subsets of the difference between the cost of the algorithm’s solution and that of an optimal solution. A universal algorithm’s solutionSolfor a clustering problem is said to be an α , β-approximation if for all subsets of clientsC, it satisfiessol(C) ≤ α ċopt(C′) + β ċmr, whereopt(C′ is the cost of the optimal solution for clients (C′) andmris the minimum regret achievable by any solution. Our main results are universal algorithms for the standard clustering objectives ofk-median,k-means, andk-center that achieve (O(1),O(1))-approximations. These results are obtained via a novel framework for universal algorithms using linear programming (LP) relaxations. These results generalize to other ℓp-objectives and the setting where some subset of the clients arefixed. We also give hardness results showing that (α, β)-approximation is NP-hard if α or β is at most a certain constant, even for the widely studied special case of Euclidean metric spaces. This shows that in some sense, (O(1),O(1))-approximation is the strongest type of guarantee obtainable for universal clustering. 
    more » « less
  2. We consider the problem of clustering in the learning-augmented setting. We are given a data set in $$d$$-dimensional Euclidean space, and a label for each data point given by a predictor indicating what subsets of points should be clustered together. This setting captures situations where we have access to some auxiliary information about the data set relevant for our clustering objective, for instance the labels output by a neural network. Following prior work, we assume that there are at most an $$\alpha \in (0,c)$$ for some $c<1$ fraction of false positives and false negatives in each predicted cluster, in the absence of which the labels would attain the optimal clustering cost $$\mathrm{OPT}$$. For a dataset of size $$m$$, we propose a deterministic $$k$$-means algorithm that produces centers with an improved bound on the clustering cost compared to the previous randomized state-of-the-art algorithm while preserving the $$O( d m \log m)$$ runtime. Furthermore, our algorithm works even when the predictions are not very accurate, i.e., our cost bound holds for $$\alpha$$ up to $1/2$, an improvement from $$\alpha$$ being at most $1/7$ in previous work. For the $$k$$-medians problem we again improve upon prior work by achieving a biquadratic improvement in the dependence of the approximation factor on the accuracy parameter $$\alpha$$ to get a cost of $$(1+O(\alpha))\mathrm{OPT}$$, while requiring essentially just $$O(md \log^3 m/\alpha)$$ runtime. 
    more » « less
  3. Clustering plays a crucial role in computer science, facilitating data analysis and problem-solving across numerous fields. By partitioning large datasets into meaningful groups, clustering reveals hidden structures and relationships within the data, aiding tasks such as unsupervised learning, classification, anomaly detection, and recommendation systems. Particularly in relational databases, where data is distributed across multiple tables, efficient clustering is essential yet challenging due to the computational complexity of joining tables. This paper addresses this challenge by introducing efficient algorithms for k-median and k-means clustering on relational data without the need for pre-computing the join query results. For the relational k-median clustering, we propose the first efficient relative approximation algorithm. For the relational k-means clustering, our algorithm significantly improves both the approximation factor and the running time of the known relational k-means clustering algorithms, which suffer either from large constant approximation factors, or expensive running time. Given a join query q and a database instance D of O(N) tuples, for both k-median and k-means clustering on the results of q on D, we propose randomized (1+ε)γ-approximation algorithms that run in roughly O(k2Nfhw)+T_γ(k2) time, where ε ∈ (0,1) is a constant parameter decided by the user, \fhw is the fractional hyper-tree width of Q, while γ and T_γ(x) represent the approximation factor and the running time, respectively, of a traditional clustering algorithm in the standard computational setting over x points. 
    more » « less
  4. In this paper we introduce and formally study the problem of $$k$$-clustering with faulty centers. Specifically, we study the faulty versions of $$k$$-center, $$k$$-median, and $$k$$-means clustering, where centers have some probability of not existing, as opposed to prior work where clients had some probability of not existing. For all three problems we provide fixed parameter tractable algorithms, in the parameters $$k$$, $$d$$, and $$\eps$$, that $$(1+\eps)$$-approximate the minimum expected cost solutions for points in $$d$$ dimensional Euclidean space. For Faulty $$k$$-center we additionally provide a 5-approximation for general metrics. Significantly, all of our algorithms have only a linear dependence on $$n$$. 
    more » « less
  5. In this paper, we address clustering problems in scenarios where accurate direct access to the full dataset is impractical or impossible. Instead, we leverage oracle-based methods, which are particularly valuable in real-world applications where the data may be noisy, restricted due to privacy concerns or sheer volume. We utilize two oracles, the quadruplet and the distance oracle. The quadruplet oracle is a weaker oracle that only approximately compares the distances of two pairs of vertices. In practice, these oracles can be implemented using crowdsourcing or training classifiers or other predictive models. On the other hand, the distance oracle returns exactly the distance of two vertices, so it is a stronger and more expensive oracle to implement. We consider two noise models for the quadruplet oracle. In the adversarial noise model, if two pairs have similar distances, the response is chosen by an adversary. In the probabilistic noise model, the pair with the smaller distance is returned with a constant probability. We consider a set V of n vertices in a metric space that supports the quadruplet and the distance oracle. For each of the k-center, k-median, and k-means clustering problem on V, we design constant approximation algorithms that perform roughly O(nk) calls to the quadruplet oracle and O(k^2) calls to the distance oracle in both noise models. When the dataset has low intrinsic dimension, we significantly improve the approximation factors of our algorithms by performing a few additional calls to the distance oracle. We also show that for k-median and k-means clustering there is no hope to return any sublinear approximation using only the quadruplet oracle. Finally, we give constant approximation algorithms for estimating the clustering cost induced by any set of k vertices, performing roughly O(nk) calls to the quadruplet oracle and O(k^2) calls to the distance oracle. 
    more » « less