Given a data set of size n in d'dimensional Euclidean space, the kmeans problem asks for a set of k points (called centers) such that the sum of the l_2^2distances between the data points and the set of centers is minimized. Previous work on this problem in the local differential privacy setting shows how to achieve multiplicative approximation factors arbitrarily close to optimal, but suffers high additive error. The additive error has also been seen to be an issue in implementations of differentially private kmeans clustering algorithms in both the central and local settings. In this work, we introduce a new locally private kmeans clustering algorithm that achieves nearoptimal additive error whilst retaining constant multiplicative approximation factors and round complexity. Concretely, given any c>sqrt(2), our algorithm achieves O(k^(1 + O(1/(2c^21))) * sqrt(d' n) * log d' * poly log n) additive error with an O(c^2) multiplicative approximation factor.
more »
« less
Polylogarithmic Sketches for Clustering
Given n points in 𝓁_p^d, we consider the problem of partitioning points into k clusters with associated centers. The cost of a clustering is the sum of pth powers of distances of points to their cluster centers. For p ∈ [1,2], we design sketches of size poly(log(nd),k,1/ε) such that the cost of the optimal clustering can be estimated to within factor 1+ε, despite the fact that the compressed representation does not contain enough information to recover the cluster centers or the partition into clusters. This leads to a streaming algorithm for estimating the clustering cost with space poly(log(nd),k,1/ε). We also obtain a distributed memory algorithm, where the n points are arbitrarily partitioned amongst m machines, each of which sends information to a central party who then computes an approximation of the clustering cost. Prior to this work, no such streaming or distributedmemory algorithm was known with sublinear dependence on d for p ∈ [1,2).
more »
« less
 Award ID(s):
 2002201
 NSFPAR ID:
 10529463
 Editor(s):
 Bojańczyk, Mikołaj; Merelli, Emanuela; Woodruff, David P
 Publisher / Repository:
 Schloss Dagstuhl – LeibnizZentrum für Informatik
 Date Published:
 Volume:
 229
 ISSN:
 18688969
 ISBN:
 9783959772358
 Page Range / eLocation ID:
 229229
 Subject(s) / Keyword(s):
 sketching clustering Theory of computation → Sketching and sampling
 Format(s):
 Medium: X Size: 20 pages; 847506 bytes Other: application/pdf
 Size(s):
 20 pages 847506 bytes
 Right(s):
 Creative Commons Attribution 4.0 International license; info:eurepo/semantics/openAccess
 Sponsoring Org:
 National Science Foundation
More Like this


Megow, Nicole ; Smith, Adam (Ed.)Structural balance theory studies stability in networks. Given a nvertex complete graph G = (V,E) whose edges are labeled positive or negative, the graph is considered balanced if every triangle either consists of three positive edges (three mutual "friends"), or one positive edge and two negative edges (two "friends" with a common "enemy"). From a computational perspective, structural balance turns out to be a special case of correlation clustering with the number of clusters at most two. The two main algorithmic problems of interest are: (i) detecting whether a given graph is balanced, or (ii) finding a partition that approximates the frustration index, i.e., the minimum number of edge flips that turn the graph balanced. We study these problems in the streaming model where edges are given one by one and focus on memory efficiency. We provide randomized singlepass algorithms for: (i) determining whether an input graph is balanced with O(log n) memory, and (ii) finding a partition that induces a (1 + ε)approximation to the frustration index with O(n ⋅ polylog(n)) memory. We further provide several new lower bounds, complementing different aspects of our algorithms such as the need for randomization or approximation. To obtain our main results, we develop a method using pseudorandom generators (PRGs) to sample edges between independentlychosen vertices in graph streaming. Furthermore, our algorithm that approximates the frustration index improves the running time of the stateoftheart correlation clustering with two clusters (GiotisGuruswami algorithm [SODA 2006]) from n^O(1/ε²) to O(n²log³n/ε² + n log n ⋅ (1/ε)^O(1/ε⁴)) time for (1+ε)approximation. These results may be of independent interest.more » « less

Ahn, HeeKap ; Sadakane, Kunihiko (Ed.)In the standard planar kcenter clustering problem, one is given a set P of n points in the plane, and the goal is to select k center points, so as to minimize the maximum distance over points in P to their nearest center. Here we initiate the systematic study of the clustering with neighborhoods problem, which generalizes the kcenter problem to allow the covered objects to be a set of general disjoint convex objects C rather than just a point set P. For this problem we first show that there is a PTAS for approximating the number of centers. Specifically, if r_opt is the optimal radius for k centers, then in n^O(1/ε²) time we can produce a set of (1+ε)k centers with radius ≤ r_opt. If instead one considers the standard goal of approximating the optimal clustering radius, while keeping k as a hard constraint, we show that the radius cannot be approximated within any factor in polynomial time unless P = NP, even when C is a set of line segments. When C is a set of unit disks we show the problem is hard to approximate within a factor of (√{13}√3)(2√3) ≈ 6.99. This hardness result complements our main result, where we show that when the objects are disks, of possibly differing radii, there is a (5+2√3)≈ 8.46 approximation algorithm. Additionally, for unit disks we give an O(n log k)+(k/ε)^O(k) time (1+ε)approximation to the optimal radius, that is, an FPTAS for constant k whose running time depends only linearly on n. Finally, we show that the one dimensional version of the problem, even when intersections are allowed, can be solved exactly in O(n log n) time.more » « less

Bae, Sang Won ; Park, Heejin (Ed.)In this paper we introduce and formally study the problem of kclustering with faulty centers. Specifically, we study the faulty versions of kcenter, kmedian, and kmeans clustering, where centers have some probability of not existing, as opposed to prior work where clients had some probability of not existing. For all three problems we provide fixed parameter tractable algorithms, in the parameters k, d, and ε, that (1+ε)approximate the minimum expected cost solutions for points in d dimensional Euclidean space. For Faulty kcenter we additionally provide a 5approximation for general metrics. Significantly, all of our algorithms have a small dependence on n. Specifically, our Faulty kcenter algorithms have only linear dependence on n, while for our algorithms for Faulty kmedian and Faulty kmeans the dependence is still only n^(1 + o(1)).more » « less

We study a clustering problem where the goal is to maximize the coverage of the input points by k chosen centers. Specifically, given a set of n points P ⊆ ℝ^d, the goal is to pick k centers C ⊆ ℝ^d that maximize the service ∑_{p∈P}φ(𝖽(p,C)) to the points P, where 𝖽(p,C) is the distance of p to its nearest center in C, and φ is a nonincreasing service function φ: ℝ+ → ℝ+. This includes problems of placing k base stations as to maximize the total bandwidth to the clients  indeed, the closer the client is to its nearest base station, the more data it can send/receive, and the target is to place k base stations so that the total bandwidth is maximized. We provide an n^{ε^O(d)} time algorithm for this problem that achieves a (1ε)approximation. Notably, the runtime does not depend on the parameter k and it works for an arbitrary nonincreasing service function φ: ℝ+ → ℝ+.more » « less