skip to main content


Title: Locally Private k-Means Clustering with Constant Multiplicative Approximation and Near-Optimal Additive Error
Given a data set of size n in d'-dimensional Euclidean space, the k-means problem asks for a set of k points (called centers) such that the sum of the l_2^2-distances between the data points and the set of centers is minimized. Previous work on this problem in the local differential privacy setting shows how to achieve multiplicative approximation factors arbitrarily close to optimal, but suffers high additive error. The additive error has also been seen to be an issue in implementations of differentially private k-means clustering algorithms in both the central and local settings. In this work, we introduce a new locally private k-means clustering algorithm that achieves near-optimal additive error whilst retaining constant multiplicative approximation factors and round complexity. Concretely, given any c>sqrt(2), our algorithm achieves O(k^(1 + O(1/(2c^2-1))) * sqrt(d' n) * log d' * poly log n) additive error with an O(c^2) multiplicative approximation factor.  more » « less
Award ID(s):
1909314
NSF-PAR ID:
10385072
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
Proceedings of the AAAI Conference on Artificial Intelligence
Volume:
36
Issue:
6
ISSN:
2374-3468
Page Range / eLocation ID:
6167-6174
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. We introduce a new (ϵₚ, δₚ)-differentially private algorithm for the k-means clustering problem. Given a dataset in Euclidean space, the k-means clustering problem requires one to find k points in that space such that the sum of squares of Euclidean distances between each data point and its closest respective point among the k returned is minimised. Although there exist privacy-preserving methods with good theoretical guarantees to solve this problem, in practice it is seen that it is the additive error which dictates the practical performance of these methods. By reducing the problem to a sequence of instances of maximum coverage on a grid, we are able to derive a new method that achieves lower additive error than previous works. For input datasets with cardinality n and diameter Δ, our algorithm has an O(Δ² (k log² n log(1/δₚ)/ϵₚ + k √(d log(1/δₚ))/ϵₚ)) additive error whilst maintaining constant multiplicative error. We conclude with some experiments and find an improvement over previously implemented work for this problem. 
    more » « less
  2. We consider the problem of clustering in the learning-augmented setting. We are given a data set in $d$-dimensional Euclidean space, and a label for each data point given by a predictor indicating what subsets of points should be clustered together. This setting captures situations where we have access to some auxiliary information about the data set relevant for our clustering objective, for instance the labels output by a neural network. Following prior work, we assume that there are at most an $\alpha \in (0,c)$ for some $c<1$ fraction of false positives and false negatives in each predicted cluster, in the absence of which the labels would attain the optimal clustering cost $\mathrm{OPT}$. For a dataset of size $m$, we propose a deterministic $k$-means algorithm that produces centers with an improved bound on the clustering cost compared to the previous randomized state-of-the-art algorithm while preserving the $O( d m \log m)$ runtime. Furthermore, our algorithm works even when the predictions are not very accurate, i.e., our cost bound holds for $\alpha$ up to $1/2$, an improvement from $\alpha$ being at most $1/7$ in previous work. For the $k$-medians problem we again improve upon prior work by achieving a biquadratic improvement in the dependence of the approximation factor on the accuracy parameter $\alpha$ to get a cost of $(1+O(\alpha))\mathrm{OPT}$, while requiring essentially just $O(md \log^3 m/\alpha)$ runtime. 
    more » « less
  3. Ahn, Hee-Kap ; Sadakane, Kunihiko (Ed.)
    In the standard planar k-center clustering problem, one is given a set P of n points in the plane, and the goal is to select k center points, so as to minimize the maximum distance over points in P to their nearest center. Here we initiate the systematic study of the clustering with neighborhoods problem, which generalizes the k-center problem to allow the covered objects to be a set of general disjoint convex objects C rather than just a point set P. For this problem we first show that there is a PTAS for approximating the number of centers. Specifically, if r_opt is the optimal radius for k centers, then in n^O(1/ε²) time we can produce a set of (1+ε)k centers with radius ≤ r_opt. If instead one considers the standard goal of approximating the optimal clustering radius, while keeping k as a hard constraint, we show that the radius cannot be approximated within any factor in polynomial time unless P = NP, even when C is a set of line segments. When C is a set of unit disks we show the problem is hard to approximate within a factor of (√{13}-√3)(2-√3) ≈ 6.99. This hardness result complements our main result, where we show that when the objects are disks, of possibly differing radii, there is a (5+2√3)≈ 8.46 approximation algorithm. Additionally, for unit disks we give an O(n log k)+(k/ε)^O(k) time (1+ε)-approximation to the optimal radius, that is, an FPTAS for constant k whose running time depends only linearly on n. Finally, we show that the one dimensional version of the problem, even when intersections are allowed, can be solved exactly in O(n log n) time. 
    more » « less
  4. We study the problem of fair k-median where each cluster is required to have a fair representation of individuals from different groups. In the fair representation k-median problem, we are given a set of points X in a metric space. Each point x ∈ X belongs to one of ℓ groups. Further, we are given fair representation parameters αj and β_j for each group j ∈ [ℓ]. We say that a k-clustering C_1, ⋅⋅⋅, C_k fairly represents all groups if the number of points from group j in cluster C_i is between α_j |C_i| and β_j |C_i| for every j ∈ [ℓ] and i ∈ [k]. The goal is to find a set of k centers and an assignment such that the clustering defined by fairly represents all groups and minimizes the ℓ_1-objective ∑_{x ∈ X} d(x, ϕ(x)). We present an O(log k)-approximation algorithm that runs in time n^{O(ℓ)}. Note that the known algorithms for the problem either (i) violate the fairness constraints by an additive term or (ii) run in time that is exponential in both k and ℓ. We also consider an important special case of the problem where and for all j ∈ [ℓ]. For this special case, we present an O(log k)-approximation algorithm that runs in time. 
    more » « less
  5. Estimating ranks, quantiles, and distributions over streaming data is a central task in data analysis and monitoring. Given a stream of n items from a data universe equipped with a total order, the task is to compute a sketch (data structure) of size polylogarithmic in n . Given the sketch and a query item y , one should be able to approximate its rank in the stream, i.e., the number of stream elements smaller than or equal to y . Most works to date focused on additive ε n error approximation, culminating in the KLL sketch that achieved optimal asymptotic behavior. This paper investigates multiplicative (1 ± ε)-error approximations to the rank. Practical motivation for multiplicative error stems from demands to understand the tails of distributions, and hence for sketches to be more accurate near extreme values. The most space-efficient algorithms due to prior work store either O (log (ε 2 n )/ε 2 ) or O (log  3 (ε n )/ε) universe items. We present a randomized sketch storing O (log  1.5 (ε n )/ε) items that can (1 ± ε)-approximate the rank of each universe item with high constant probability; this space bound is within an \(O(\sqrt {\log (\varepsilon n)}) \) factor of optimal. Our algorithm does not require prior knowledge of the stream length and is fully mergeable, rendering it suitable for parallel and distributed computing environments. 
    more » « less