Given a data set of size n in d'-dimensional Euclidean space, the k-means problem asks for a set of k points (called centers) such that the sum of the l_2^2-distances between the data points and the set of centers is minimized. Previous work on this problem in the local differential privacy setting shows how to achieve multiplicative approximation factors arbitrarily close to optimal, but suffers high additive error. The additive error has also been seen to be an issue in implementations of differentially private k-means clustering algorithms in both the central and local settings. In this work, we introduce a new locally private k-means clustering algorithm that achieves near-optimal additive error whilst retaining constant multiplicative approximation factors and round complexity. Concretely, given any c>sqrt(2), our algorithm achieves O(k^(1 + O(1/(2c^2-1))) * sqrt(d' n) * log d' * poly log n) additive error with an O(c^2) multiplicative approximation factor.
more »
« less
Faster Algorithms for Fair Max-Min Diversification in R d
The task of extracting a diverse subset from a dataset, often referred to as maximum diversification, plays a pivotal role in various real-world applications that have far-reaching consequences. In this work, we delve into the realm of fairness-aware data subset selection, specifically focusing on the problem of selecting a diverse set of size k from a large collection of n data points (FairDiv). The FairDiv problem is well-studied in the data management and theory community. In this work, we develop the first constant approximation algorithm for FairDiv that runs in near-linear time using only linear space. In contrast, all previously known constant approximation algorithms run in super-linear time (with respect to n or k) and use super-linear space. Our approach achieves this efficiency by employing a novel combination of the Multiplicative Weight Update method and advanced geometric data structures to implicitly and approximately solve a linear program. Furthermore, we improve the efficiency of our techniques by constructing a coreset. Using our coreset, we also propose the first efficient streaming algorithm for the FairDiv problem whose efficiency does not depend on the distribution of data points. Empirical evaluation on million-sized datasets demonstrates that our algorithm achieves the best diversity within a minute. All prior techniques are either highly inefficient or do not generate a good solution.
more »
« less
- Award ID(s):
- 2348919
- PAR ID:
- 10614544
- Publisher / Repository:
- Association for Computing Machinery (ACM)
- Date Published:
- Journal Name:
- Proceedings of the ACM on Management of Data
- Volume:
- 2
- Issue:
- 3
- ISSN:
- 2836-6573
- Page Range / eLocation ID:
- 1 to 26
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)The field of algorithms has seen a push for fairness, or the removal of inherent bias, in recent history. In data summarization, where a much smaller subset of a data set is chosen to represent the whole of the data, fairness can be introduced by guaranteeing each "demographic group" a specific portion of the representative subset. Specifically, this paper examines this fair variant of the k-centers problem, where a subset of the data with cardinality k is chosen to minimize distance to the rest of the data. Previous papers working on this problem presented both a 3-approximation algorithm with a super-linear runtime and a linear-time algorithm whose approximation factor is exponential in the number of demographic groups. This paper combines the best of each algorithm by presenting a linear-time algorithm with a guaranteed 3-approximation factor and provides empirical evidence of both the algorithm’s runtime and effectiveness.more » « less
-
Given a set of n points in the plane, and a parameter k, we consider the problem of computing the minimum (perimeter or area) axis-aligned rectangle enclosing k points. We present the first near quadratic time algorithm for this problem, improving over the previous near-O(n^{5/2})-time algorithm by Kaplan et al. [Haim Kaplan et al., 2017]. We provide an almost matching conditional lower bound, under the assumption that (min,+)-convolution cannot be solved in truly subquadratic time. Furthermore, we present a new reduction (for either perimeter or area) that can make the time bound sensitive to k, giving near O(n k) time. We also present a near linear time (1+epsilon)-approximation algorithm to the minimum area of the optimal rectangle containing k points. In addition, we study related problems including the 3-sided, arbitrarily oriented, weighted, and subset sum versions of the problem.more » « less
-
Given a set of points $$P = (P^+ \sqcup P^-) \subset \mathbb{R}^d$$ for some constant $$d$$ and a supply function $$\mu:P\to \mathbb{R}$$ such that $$\mu(p) > 0~\forall p \in P^+$$, $$\mu(p) < 0~\forall p \in P^-$$, and $$\sum_{p\in P}{\mu(p)} = 0$$, the geometric transportation problem asks one to find a transportation map $$\tau: P^+\times P^-\to \mathbb{R}_{\ge 0}$$ such that $$\sum_{q\in P^-}{\tau(p, q)} = \mu(p)~\forall p \in P^+$$, $$\sum_{p\in P^+}{\tau(p, q)} = -\mu(q) \forall q \in P^-$$, and the weighted sum of Euclidean distances for the pairs $$\sum_{(p,q)\in P^+\times P^-}\tau(p, q)\cdot ||q-p||_2$$ is minimized. We present the first deterministic algorithm that computes, in near-linear time, a transportation map whose cost is within a $$(1 + \varepsilon)$$ factor of optimal. More precisely, our algorithm runs in $$O(n\varepsilon^{-(d+2)}\log^5{n}\log{\log{n}})$$ time for any constant $$\varepsilon > 0$$. While a randomized $$n\varepsilon^{-O(d)}\log^{O(d)}{n}$$ time algorithm for this problem was discovered in the last few years, all previously known deterministic $$(1 + \varepsilon)$$-approximation algorithms run in~$$\Omega(n^{3/2})$$ time. A similar situation existed for geometric bipartite matching, the special case of geometric transportation where all supplies are unit, until a deterministic $$n\varepsilon^{-O(d)}\log^{O(d)}{n}$$ time $$(1 + \varepsilon)$$-approximation algorithm was presented at STOC 2022. Surprisingly, our result is not only a generalization of the bipartite matching one to arbitrary instances of geometric transportation, but it also reduces the running time for all previously known $$(1 + \varepsilon)$$-approximation algorithms, randomized or deterministic, even for geometric bipartite matching. In particular, we give the first $$(1 + \varepsilon)$$-approximate deterministic algorithm for geometric bipartite matching and the first $$(1 + \varepsilon)$$-approximate deterministic or randomized algorithm for geometric transportation with no dependence on $$d$$ in the exponent of the running time's polylog. As an additional application of our main ideas, we also give the first randomized near-linear $$O(\varepsilon^{-2} m \log^{O(1)} n)$$ time $$(1 + \varepsilon)$$-approximation algorithm for the uncapacitated minimum cost flow (transshipment) problem in undirected graphs with arbitrary \emph{real} edge costs.more » « less
-
We present the first near-linear-time algorithm that computes a (1+ε)-approximation of the diameter of a weighted unit-disk graph of n vertices. Our algorithm requires O(n log^2 n) time for any constant ε>0, so we considerably improve upon the near-O(n^{3/2})-time algorithm of Gao and Zhang (2005). Using similar ideas we develop (1+ε)-approximate \emph{distance oracles} of O(1) query time with a likewise improvement in the preprocessing time, specifically from near O(n^{3/2}) to O(n log^3 n). We also obtain similar new results for a number of related problems in the weighted unit-disk graph metric such as the radius and the bichromatic closest pair. As a further application we employ our distance oracle, along with additional ideas, to solve the (1+ε)-approximate \emph{all-pairs bounded-leg shortest paths\/} (apBLSP) problem for a set of n planar points. Our data structure requires O(n^2 log n) space, O(loglog n) query time, and nearly O(n^{2.579}) preprocessing time for any constant ε>0, and is the first that breaks the near-cubic preprocessing time bound given by Roditty and Segal (2011).more » « less
An official website of the United States government

