Given a data set of size n in d'dimensional Euclidean space, the kmeans problem asks for a set of k points (called centers) such that the sum of the l_2^2distances between the data points and the set of centers is minimized. Previous work on this problem in the local differential privacy setting shows how to achieve multiplicative approximation factors arbitrarily close to optimal, but suffers high additive error. The additive error has also been seen to be an issue in implementations of differentially private kmeans clustering algorithms in both the central and local settings. In this work, we introduce a new locally private kmeans clustering algorithm that achieves nearoptimal additive error whilst retaining constant multiplicative approximation factors and round complexity. Concretely, given any c>sqrt(2), our algorithm achieves O(k^(1 + O(1/(2c^21))) * sqrt(d' n) * log d' * poly log n) additive error with an O(c^2) multiplicative approximation factor.
more »
« less
Differentially Private Clustering via Maximum Coverage
This paper studies the problem of clustering in metric spaces while preserving the privacy of individual data. Specifically, we examine differentially private variants of the kmedians and Euclidean kmeans problems. We present polynomial algorithms with constant multiplicative error and lower additive error than the previous stateoftheart for each problem. Additionally, our algorithms use a clustering algorithm without differential privacy as a blackbox. This allows practitioners to control the tradeoff between runtime and approximation factor by choosing a suitable clustering algorithm to use.
more »
« less
 Award ID(s):
 1750716
 NSFPAR ID:
 10316110
 Date Published:
 Journal Name:
 Proceedings of the AAAI Conference on Artificial Intelligence
 Volume:
 35
 Issue:
 13
 ISSN:
 21595399
 Page Range / eLocation ID:
 11555 to 11563
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


We introduce a new (ϵₚ, δₚ)differentially private algorithm for the kmeans clustering problem. Given a dataset in Euclidean space, the kmeans clustering problem requires one to find k points in that space such that the sum of squares of Euclidean distances between each data point and its closest respective point among the k returned is minimised. Although there exist privacypreserving methods with good theoretical guarantees to solve this problem, in practice it is seen that it is the additive error which dictates the practical performance of these methods. By reducing the problem to a sequence of instances of maximum coverage on a grid, we are able to derive a new method that achieves lower additive error than previous works. For input datasets with cardinality n and diameter Δ, our algorithm has an O(Δ² (k log² n log(1/δₚ)/ϵₚ + k √(d log(1/δₚ))/ϵₚ)) additive error whilst maintaining constant multiplicative error. We conclude with some experiments and find an improvement over previously implemented work for this problem.more » « less

Hierarchical clustering is a fundamental tool in data mining, machine learning and statistics. Popular hierarchical clustering algorithms include topdown divisive approaches such as bisecting kmeans, kmedian, and kcenter and bottomup agglomerative approaches such as single linkage, averagelinkage, and centroidlinkage. Unfortunately, only a few scalable hierarchical clustering algorithms are known, mostly based on the singlelinkage algorithm. So, as datasets increase in size every day, there is a pressing need to scale other popular methods. We introduce efficient distributed algorithms for bisecting kmeans, k median, and kcenter as well as centroidlinkage. In particular, we first formalize a notion of closeness for a hierarchical clustering algorithm, and then we use this notion to design new scalable distributed methods with strong worst case bounds on the running time and the quality of the solutions. Finally, we show experimentally that the introduced algorithms are efficient and close to their sequential variants in practice.more » « less

In this paper, we study kmeans++ and kmeans++ parallel, the two most popular algorithms for the classic kmeans clustering problem. We provide novel analyses and show improved approximation and bicriteria approximation guarantees for kmeans++ and kmeans++ parallel. Our results give a better theoretical justification for why these algorithms perform extremely well in practice. We also propose a new variant of kmeans++ parallel algorithm (Exponential Race kmeans++) that has the same approximation guarantees as kmeans++.more » « less

We consider the problem of clustering data sets in the presence of arbitrary outliers. Traditional clustering algorithms such as kmeans and spectral clustering are known to perform poorly for data sets contaminated with even a small number of outliers. In this paper, we develop a provably robust spectral clustering algorithm that applies a simple rounding scheme to denoise a Gaussian kernel matrix built from the data points and uses vanilla spectral clustering to recover the cluster labels of data points. We analyze the performance of our algorithm under the assumption that the “good” data points are generated from a mixture of subGaussians (we term these “inliers”), whereas the outlier points can come from any arbitrary probability distribution. For this general class of models, we show that the misclassification error decays at an exponential rate in the signaltonoise ratio, provided the number of outliers is a small fraction of the inlier points. Surprisingly, this derived error bound matches with the bestknown bound for semidefinite programs (SDPs) under the same setting without outliers. We conduct extensive experiments on a variety of simulated and realworld data sets to demonstrate that our algorithm is less sensitive to outliers compared with other stateoftheart algorithms proposed in the literature. Funding: G. A. Hanasusanto was supported by the National Science Foundation Grants NSF ECCS1752125 and NSF CCF2153606. P. Sarkar gratefully acknowledges support from the National Science Foundation Grants NSF DMS1713082, NSF HDR1934932 and NSF 2019844. Supplemental Material: The online appendix is available at https://doi.org/10.1287/opre.2022.2317 .more » « less