skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Improved Learning-augmented Algorithms for k-means and k-medians Clustering
We consider the problem of clustering in the learning-augmented setting. We are given a data set in $$d$$-dimensional Euclidean space, and a label for each data point given by a predictor indicating what subsets of points should be clustered together. This setting captures situations where we have access to some auxiliary information about the data set relevant for our clustering objective, for instance the labels output by a neural network. Following prior work, we assume that there are at most an $$\alpha \in (0,c)$$ for some $c<1$ fraction of false positives and false negatives in each predicted cluster, in the absence of which the labels would attain the optimal clustering cost $$\mathrm{OPT}$$. For a dataset of size $$m$$, we propose a deterministic $$k$$-means algorithm that produces centers with an improved bound on the clustering cost compared to the previous randomized state-of-the-art algorithm while preserving the $$O( d m \log m)$$ runtime. Furthermore, our algorithm works even when the predictions are not very accurate, i.e., our cost bound holds for $$\alpha$$ up to $1/2$, an improvement from $$\alpha$$ being at most $1/7$ in previous work. For the $$k$$-medians problem we again improve upon prior work by achieving a biquadratic improvement in the dependence of the approximation factor on the accuracy parameter $$\alpha$$ to get a cost of $$(1+O(\alpha))\mathrm{OPT}$$, while requiring essentially just $$O(md \log^3 m/\alpha)$$ runtime.  more » « less
Award ID(s):
1750716
PAR ID:
10493899
Author(s) / Creator(s):
; ;
Publisher / Repository:
International Conference on Learning Representations
Date Published:
Journal Name:
International Conference on Learning Representations
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    We consider the problem of explainable k-medians and k-means introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian (ICML 2020). In this problem, our goal is to find a threshold decision tree that partitions data into k clusters and minimizes the k-medians or k-means objective. The obtained clustering is easy to interpret because every decision node of a threshold tree splits data based on a single feature into two groups. We propose a new algorithm for this problem which is O(log k) competitive with k-medians with ℓ1 norm and O(k) competitive with k-means. This is an improvement over the previous guarantees of O(k) and O(k^2) by Dasgupta et al (2020). We also provide a new algorithm which is O(log^{3}{2}k) competitive for k-medians with ℓ2 norm. Our first algorithm is near-optimal: Dasgupta et al (2020) showed a lower bound of Ω(log k) for k-medians; in this work, we prove a lower bound of Ω(k) for k-means. We also provide a lower bound of Ω(log k) for k-medians with ℓ2 norm. 
    more » « less
  2. Given a data set of size n in d'-dimensional Euclidean space, the k-means problem asks for a set of k points (called centers) such that the sum of the l_2^2-distances between the data points and the set of centers is minimized. Previous work on this problem in the local differential privacy setting shows how to achieve multiplicative approximation factors arbitrarily close to optimal, but suffers high additive error. The additive error has also been seen to be an issue in implementations of differentially private k-means clustering algorithms in both the central and local settings. In this work, we introduce a new locally private k-means clustering algorithm that achieves near-optimal additive error whilst retaining constant multiplicative approximation factors and round complexity. Concretely, given any c>sqrt(2), our algorithm achieves O(k^(1 + O(1/(2c^2-1))) * sqrt(d' n) * log d' * poly log n) additive error with an O(c^2) multiplicative approximation factor. 
    more » « less
  3. Bojańczyk, Mikołaj; Merelli, Emanuela; Woodruff, David P (Ed.)
    Given n points in 𝓁_p^d, we consider the problem of partitioning points into k clusters with associated centers. The cost of a clustering is the sum of p-th powers of distances of points to their cluster centers. For p ∈ [1,2], we design sketches of size poly(log(nd),k,1/ε) such that the cost of the optimal clustering can be estimated to within factor 1+ε, despite the fact that the compressed representation does not contain enough information to recover the cluster centers or the partition into clusters. This leads to a streaming algorithm for estimating the clustering cost with space poly(log(nd),k,1/ε). We also obtain a distributed memory algorithm, where the n points are arbitrarily partitioned amongst m machines, each of which sends information to a central party who then computes an approximation of the clustering cost. Prior to this work, no such streaming or distributed-memory algorithm was known with sublinear dependence on d for p ∈ [1,2). 
    more » « less
  4. In this paper we provide anO(mloglogO(1)nlog (1/ϵ))-expected time algorithm for solving Laplacian systems onn-nodem-edge graphs, improving upon the previous best expected runtime of\(O(m \sqrt {\log n} \mathrm{log log}^{O(1)} n \log (1/\epsilon)) \)achieved by (Cohen, Kyng, Miller, Pachocki, Peng, Rao, Xu 2014). To obtain this result we provide efficient constructions of low spectral stretch graph approximations with improved stretch and sparsity bounds. As motivation for this work, we show that for every set of vectors in\(\mathbb {R}^d \)(not just those induced by graphs) and all integerk> 1 there exist an ultra-sparsifier withd− 1 +O(d/k) re-weighted vectors of relative condition number at mostk2. For smallk, this improves upon the previous best known multiplicative factor of\(k \cdot \tilde{O}(\log d) \), which is only known for the graph case. Additionally, in the graph case we employ our low-stretch subgraph construction to obtainn− 1 +O(n/k)-edge ultrasparsifiers of relative condition numberk1 +o(1)fork=ω(log δn) for anyδ> 0: this improves upon the previous work fork=o(exp (log 1/2 −δn)). 
    more » « less
  5. We introduce a new (ϵₚ, δₚ)-differentially private algorithm for the k-means clustering problem. Given a dataset in Euclidean space, the k-means clustering problem requires one to find k points in that space such that the sum of squares of Euclidean distances between each data point and its closest respective point among the k returned is minimised. Although there exist privacy-preserving methods with good theoretical guarantees to solve this problem, in practice it is seen that it is the additive error which dictates the practical performance of these methods. By reducing the problem to a sequence of instances of maximum coverage on a grid, we are able to derive a new method that achieves lower additive error than previous works. For input datasets with cardinality n and diameter Δ, our algorithm has an O(Δ² (k log² n log(1/δₚ)/ϵₚ + k √(d log(1/δₚ))/ϵₚ)) additive error whilst maintaining constant multiplicative error. We conclude with some experiments and find an improvement over previously implemented work for this problem. 
    more » « less