skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Polylogarithmic Sketches for Clustering
Given n points in 𝓁_p^d, we consider the problem of partitioning points into k clusters with associated centers. The cost of a clustering is the sum of p-th powers of distances of points to their cluster centers. For p ∈ [1,2], we design sketches of size poly(log(nd),k,1/ε) such that the cost of the optimal clustering can be estimated to within factor 1+ε, despite the fact that the compressed representation does not contain enough information to recover the cluster centers or the partition into clusters. This leads to a streaming algorithm for estimating the clustering cost with space poly(log(nd),k,1/ε). We also obtain a distributed memory algorithm, where the n points are arbitrarily partitioned amongst m machines, each of which sends information to a central party who then computes an approximation of the clustering cost. Prior to this work, no such streaming or distributed-memory algorithm was known with sublinear dependence on d for p ∈ [1,2).  more » « less
Award ID(s):
2002201
PAR ID:
10529463
Author(s) / Creator(s):
;
Editor(s):
Bojańczyk, Mikołaj; Merelli, Emanuela; Woodruff, David P
Publisher / Repository:
Schloss Dagstuhl – Leibniz-Zentrum für Informatik
Date Published:
Volume:
229
ISSN:
1868-8969
ISBN:
978-3-95977-235-8
Page Range / eLocation ID:
229-229
Subject(s) / Keyword(s):
sketching clustering Theory of computation → Sketching and sampling
Format(s):
Medium: X Size: 20 pages; 847506 bytes Other: application/pdf
Size(s):
20 pages 847506 bytes
Right(s):
Creative Commons Attribution 4.0 International license; info:eu-repo/semantics/openAccess
Sponsoring Org:
National Science Foundation
More Like this
  1. Given a data set of size n in d'-dimensional Euclidean space, the k-means problem asks for a set of k points (called centers) such that the sum of the l_2^2-distances between the data points and the set of centers is minimized. Previous work on this problem in the local differential privacy setting shows how to achieve multiplicative approximation factors arbitrarily close to optimal, but suffers high additive error. The additive error has also been seen to be an issue in implementations of differentially private k-means clustering algorithms in both the central and local settings. In this work, we introduce a new locally private k-means clustering algorithm that achieves near-optimal additive error whilst retaining constant multiplicative approximation factors and round complexity. Concretely, given any c>sqrt(2), our algorithm achieves O(k^(1 + O(1/(2c^2-1))) * sqrt(d' n) * log d' * poly log n) additive error with an O(c^2) multiplicative approximation factor. 
    more » « less
  2. In this paper, we study streaming algorithms that minimize the number of changes made to their internal state (i.e., memory contents). While the design of streaming algorithms typically focuses on minimizing space and update time, these metrics fail to capture the asymmetric costs, inherent in modern hardware and database systems, of reading versus writing to memory. In fact, most streaming algorithms write to their memory on every update, which is undesirable when writing is significantly more expensive than reading. This raises the question of whether streaming algorithms with small space and number of memory writes are possible. We first demonstrate that, for the fundamental Fpmoment estimation problem with p ≥ 1, any streaming algorithm that achieves a constant factor approximation must make Ω(n1-1/p) internal state changes, regardless of how much space it uses. Perhaps surprisingly, we show that this lower bound can be matched by an algorithm which also has near-optimal space complexity. Specifically, we give a (1+ε)-approximation algorithm for Fpmoment estimation that use a near-optimal ~Oε(n1-1/p) number of state changes, while simultaneously achieving near-optimal space, i.e., for p∈[1,2), our algorithm uses poly(log n,1/ε) bits of space for, while for p>2, the algorithm uses ~Oε(n1-1/p) space. We similarly design streaming algorithms that are simultaneously near-optimal in both space complexity and the number of state changes for the heavy-hitters problem, sparse support recovery, and entropy estimation. Our results demonstrate that an optimal number of state changes can be achieved without sacrificing space complexity. 
    more » « less
  3. Megow, Nicole; Smith, Adam (Ed.)
    Structural balance theory studies stability in networks. Given a n-vertex complete graph G = (V,E) whose edges are labeled positive or negative, the graph is considered balanced if every triangle either consists of three positive edges (three mutual "friends"), or one positive edge and two negative edges (two "friends" with a common "enemy"). From a computational perspective, structural balance turns out to be a special case of correlation clustering with the number of clusters at most two. The two main algorithmic problems of interest are: (i) detecting whether a given graph is balanced, or (ii) finding a partition that approximates the frustration index, i.e., the minimum number of edge flips that turn the graph balanced. We study these problems in the streaming model where edges are given one by one and focus on memory efficiency. We provide randomized single-pass algorithms for: (i) determining whether an input graph is balanced with O(log n) memory, and (ii) finding a partition that induces a (1 + ε)-approximation to the frustration index with O(n ⋅ polylog(n)) memory. We further provide several new lower bounds, complementing different aspects of our algorithms such as the need for randomization or approximation. To obtain our main results, we develop a method using pseudorandom generators (PRGs) to sample edges between independently-chosen vertices in graph streaming. Furthermore, our algorithm that approximates the frustration index improves the running time of the state-of-the-art correlation clustering with two clusters (Giotis-Guruswami algorithm [SODA 2006]) from n^O(1/ε²) to O(n²log³n/ε² + n log n ⋅ (1/ε)^O(1/ε⁴)) time for (1+ε)-approximation. These results may be of independent interest. 
    more » « less
  4. Ahn, Hee-Kap; Sadakane, Kunihiko (Ed.)
    In the standard planar k-center clustering problem, one is given a set P of n points in the plane, and the goal is to select k center points, so as to minimize the maximum distance over points in P to their nearest center. Here we initiate the systematic study of the clustering with neighborhoods problem, which generalizes the k-center problem to allow the covered objects to be a set of general disjoint convex objects C rather than just a point set P. For this problem we first show that there is a PTAS for approximating the number of centers. Specifically, if r_opt is the optimal radius for k centers, then in n^O(1/ε²) time we can produce a set of (1+ε)k centers with radius ≤ r_opt. If instead one considers the standard goal of approximating the optimal clustering radius, while keeping k as a hard constraint, we show that the radius cannot be approximated within any factor in polynomial time unless P = NP, even when C is a set of line segments. When C is a set of unit disks we show the problem is hard to approximate within a factor of (√{13}-√3)(2-√3) ≈ 6.99. This hardness result complements our main result, where we show that when the objects are disks, of possibly differing radii, there is a (5+2√3)≈ 8.46 approximation algorithm. Additionally, for unit disks we give an O(n log k)+(k/ε)^O(k) time (1+ε)-approximation to the optimal radius, that is, an FPTAS for constant k whose running time depends only linearly on n. Finally, we show that the one dimensional version of the problem, even when intersections are allowed, can be solved exactly in O(n log n) time. 
    more » « less
  5. Meka, Raghu (Ed.)
    We introduce Online Balanced Allocation of Dynamic Components (OBADC), a problem motivated by the practical challenge of dynamic resource allocation for large-scale distributed applications. In OBADC, we need to allocate a dynamic set of at most k𝓁 vertices (representing processes) in 𝓁 > 0 clusters. We consider an over-provisioned setup in which each cluster can hold at most k(1+ε) vertices, for an arbitrary constant ε > 0. The communication requirements among the vertices are modeled by the notion of a dynamically changing component, which is a subset of vertices that need to be co-located in the same cluster. At each time t, a request r_t of one of the following types arrives: 1) insertion of a vertex v forming a singleton component v at unit cost. 2) merge of (u,v) requiring that the components containing u and v be merged and co-located thereafter. 3) deletion of an existing vertex v at zero cost. Before serving any request, an algorithm can migrate vertices from one cluster to another, at a unit migration cost per vertex. We seek an online algorithm to minimize the total migration cost incurred for an arbitrary request sequence σ = (r_t)_{t > 0}, while simultaneously minimizing the number of clusters utilized. We analyze competitiveness with respect to an optimal clairvoyant offline algorithm with identical (over-provisioned) capacity constraints. We give an O(log k)-competitive algorithm for OBADC, and a matching lower-bound. The number of clusters utilized by our algorithm is always within a (2+ε) factor of the minimum. Furthermore, in a resource augmented setting where the optimal offline algorithm is constrained to capacity k per cluster, our algorithm obtains O(log k) competitiveness and utilizes a number of clusters within (1+ε) factor of the minimum. We also consider OBADC in the context of machine-learned predictions, where for each newly inserted vertex v at time t: i) with probability η > 0, the set of vertices (that exist at time t) in the component of v is revealed and, ii) with probability 1-η, no information is revealed. For OBADC with predictions, we give a O(1)-consistent and O(min(log 1/(η), log k))-robust algorithm. 
    more » « less