We consider the problem of clustering in the learningaugmented setting. We are given a data set in $d$dimensional Euclidean space, and a label for each data point given by a predictor indicating what subsets of points should be clustered together. This setting captures situations where we have access to some auxiliary information about the data set relevant for our clustering objective, for instance the labels output by a neural network. Following prior work, we assume that there are at most an $\alpha \in (0,c)$ for some $c<1$ fraction of false positives and false negatives in each predicted cluster, in the absence of which the labels would attain the optimal clustering cost $\mathrm{OPT}$. For a dataset of size $m$, we propose a deterministic $k$means algorithm that produces centers with an improved bound on the clustering cost compared to the previous randomized stateoftheart algorithm while preserving the $O( d m \log m)$ runtime. Furthermore, our algorithm works even when the predictions are not very accurate, i.e., our cost bound holds for $\alpha$ up to $1/2$, an improvement from $\alpha$ being at most $1/7$ in previous work. For the $k$medians problem we again improve upon prior work by achieving a biquadratic improvement in the dependence of the approximation factor on the accuracy parameter $\alpha$ to get a cost of $(1+O(\alpha))\mathrm{OPT}$, while requiring essentially just $O(md \log^3 m/\alpha)$ runtime.
more »
« less
Performance of JohnsonLindenstrauss transform for kmeans and kmedians clustering
Consider an instance of Euclidean kmeans or kmedians clustering. We show that the cost of the optimal solution is preserved up to a factor of (1+ε) under a projection onto a random O(log(k /ε) / ε2)dimensional subspace. Further, the cost of every clustering is preserved within (1+ε). More generally, our result applies to any dimension reduction map satisfying a mild subGaussiantail condition. Our bound on the dimension is nearly optimal. Additionally, our result applies to Euclidean kclustering with the distances raised to the pth power for any constant p.
For kmeans, our result resolves an open problem posed by Cohen, Elder, Musco, Musco, and Persu (STOC 2015); for kmedians, it answers a question raised by Kannan.
more »
« less
 Award ID(s):
 1718820
 NSFPAR ID:
 10107373
 Date Published:
 Journal Name:
 Performance of JohnsonLindenstrauss transform for kmeans and kmedians clustering
 Page Range / eLocation ID:
 10271038
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


null (Ed.)We consider the problem of explainable kmedians and kmeans introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian (ICML 2020). In this problem, our goal is to find a threshold decision tree that partitions data into k clusters and minimizes the kmedians or kmeans objective. The obtained clustering is easy to interpret because every decision node of a threshold tree splits data based on a single feature into two groups. We propose a new algorithm for this problem which is O(log k) competitive with kmedians with ℓ1 norm and O(k) competitive with kmeans. This is an improvement over the previous guarantees of O(k) and O(k^2) by Dasgupta et al (2020). We also provide a new algorithm which is O(log^{3}{2}k) competitive for kmedians with ℓ2 norm. Our first algorithm is nearoptimal: Dasgupta et al (2020) showed a lower bound of Ω(log k) for kmedians; in this work, we prove a lower bound of Ω(k) for kmeans. We also provide a lower bound of Ω(log k) for kmedians with ℓ2 norm.more » « less

We provide a new bicriteria O(log2k) competitive algorithm for explainable kmeans clustering. Explainable kmeans was recently introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian (ICML 2020). It is described by an easy to interpret and understand (threshold) decision tree or diagram. The cost of the explainable kmeans clustering equals to the sum of costs of its clusters; and the cost of each cluster equals the sum of squared distances from the points in the cluster to the center of that cluster. The best non bicriteria algorithm for explainable clustering O(k) competitive, and this bound is tight. Our randomized bicriteria algorithm constructs a threshold decision tree that partitions the data set into (1+δ)k clusters (where δ∈(0,1) is a parameter of the algorithm). The cost of this clustering is at most O(1/δ⋅log2k) times the cost of the optimal unconstrained kmeans clustering. We show that this bound is almost optimal.more » « less

Given a data set of size n in d'dimensional Euclidean space, the kmeans problem asks for a set of k points (called centers) such that the sum of the l_2^2distances between the data points and the set of centers is minimized. Previous work on this problem in the local differential privacy setting shows how to achieve multiplicative approximation factors arbitrarily close to optimal, but suffers high additive error. The additive error has also been seen to be an issue in implementations of differentially private kmeans clustering algorithms in both the central and local settings. In this work, we introduce a new locally private kmeans clustering algorithm that achieves nearoptimal additive error whilst retaining constant multiplicative approximation factors and round complexity. Concretely, given any c>sqrt(2), our algorithm achieves O(k^(1 + O(1/(2c^21))) * sqrt(d' n) * log d' * poly log n) additive error with an O(c^2) multiplicative approximation factor.more » « less

Bae, Sang Won ; Park, Heejin (Ed.)In this paper we introduce and formally study the problem of kclustering with faulty centers. Specifically, we study the faulty versions of kcenter, kmedian, and kmeans clustering, where centers have some probability of not existing, as opposed to prior work where clients had some probability of not existing. For all three problems we provide fixed parameter tractable algorithms, in the parameters k, d, and ε, that (1+ε)approximate the minimum expected cost solutions for points in d dimensional Euclidean space. For Faulty kcenter we additionally provide a 5approximation for general metrics. Significantly, all of our algorithms have a small dependence on n. Specifically, our Faulty kcenter algorithms have only linear dependence on n, while for our algorithms for Faulty kmedian and Faulty kmeans the dependence is still only n^(1 + o(1)).more » « less