Given a data set of size n in d'dimensional Euclidean space, the kmeans problem asks for a set of k points (called centers) such that the sum of the l_2^2distances between the data points and the set of centers is minimized. Previous work on this problem in the local differential privacy setting shows how to achieve multiplicative approximation factors arbitrarily close to optimal, but suffers high additive error. The additive error has also been seen to be an issue in implementations of differentially private kmeans clustering algorithms in both the central and local settings. In this work, we introduce a new locally private kmeans clustering algorithm that achieves nearoptimal additive error whilst retaining constant multiplicative approximation factors and round complexity. Concretely, given any c>sqrt(2), our algorithm achieves O(k^(1 + O(1/(2c^21))) * sqrt(d' n) * log d' * poly log n) additive error with an O(c^2) multiplicative approximation factor.
more »
« less
Improved Learningaugmented Algorithms for kmeans and kmedians Clustering
We consider the problem of clustering in the learningaugmented setting. We are given a data set in $d$dimensional Euclidean space, and a label for each data point given by a predictor indicating what subsets of points should be clustered together. This setting captures situations where we have access to some auxiliary information about the data set relevant for our clustering objective, for instance the labels output by a neural network. Following prior work, we assume that there are at most an $\alpha \in (0,c)$ for some $c<1$ fraction of false positives and false negatives in each predicted cluster, in the absence of which the labels would attain the optimal clustering cost $\mathrm{OPT}$. For a dataset of size $m$, we propose a deterministic $k$means algorithm that produces centers with an improved bound on the clustering cost compared to the previous randomized stateoftheart algorithm while preserving the $O( d m \log m)$ runtime. Furthermore, our algorithm works even when the predictions are not very accurate, i.e., our cost bound holds for $\alpha$ up to $1/2$, an improvement from $\alpha$ being at most $1/7$ in previous work. For the $k$medians problem we again improve upon prior work by achieving a biquadratic improvement in the dependence of the approximation factor on the accuracy parameter $\alpha$ to get a cost of $(1+O(\alpha))\mathrm{OPT}$, while requiring essentially just $O(md \log^3 m/\alpha)$ runtime.
more »
« less
 Award ID(s):
 1750716
 NSFPAR ID:
 10493899
 Publisher / Repository:
 International Conference on Learning Representations
 Date Published:
 Journal Name:
 International Conference on Learning Representations
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


null (Ed.)We consider the problem of explainable kmedians and kmeans introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian (ICML 2020). In this problem, our goal is to find a threshold decision tree that partitions data into k clusters and minimizes the kmedians or kmeans objective. The obtained clustering is easy to interpret because every decision node of a threshold tree splits data based on a single feature into two groups. We propose a new algorithm for this problem which is O(log k) competitive with kmedians with ℓ1 norm and O(k) competitive with kmeans. This is an improvement over the previous guarantees of O(k) and O(k^2) by Dasgupta et al (2020). We also provide a new algorithm which is O(log^{3}{2}k) competitive for kmedians with ℓ2 norm. Our first algorithm is nearoptimal: Dasgupta et al (2020) showed a lower bound of Ω(log k) for kmedians; in this work, we prove a lower bound of Ω(k) for kmeans. We also provide a lower bound of Ω(log k) for kmedians with ℓ2 norm.more » « less

Woodruff, David P. (Ed.)We give improved algorithms for maintaining edgeorientations of a fullydynamic graph, such that the maximum outdegree is bounded. On one hand, we show how to orient the edges such that maximum outdegree is proportional to the arboricity $\alpha$ of the graph, in, either, an amortised update time of $O(\log^2 n \log \alpha)$, or a worstcase update time of $O(\log^3 n \log \alpha)$. On the other hand, motivated by applications including dynamic maximal matching, we obtain a different tradeoff. Namely, the improved update time of either $O(\log n \log \alpha)$, amortised, or $O(\log ^2 n \log \alpha)$, worstcase, for the problem of maintaining an edgeorientation with at most $O(\alpha + \log n)$ outedges per vertex. Finally, all of our algorithms naturally limit the recourse to be polylogarithmic in $n$ and $\alpha$. Our algorithms adapt to the current arboricity of the graph, and yield improvements over previous work: Firstly, we obtain deterministic algorithms for maintaining a $(1+\varepsilon)$ approximation of the maximum subgraph density, $\rho$, of the dynamic graph. Our algorithms have update times of $O(\varepsilon^{6}\log^3 n \log \rho)$ worstcase, and $O(\varepsilon^{4}\log^2 n \log \rho)$ amortised, respectively. We may output a subgraph $H$ of the input graph where its density is a $(1+\varepsilon)$ approximation of the maximum subgraph density in time linear in the size of the subgraph. These algorithms have improved update time compared to the $O(\varepsilon^{6}\log ^4 n)$ algorithm by Sawlani and Wang from STOC 2020. Secondly, we obtain an $O(\varepsilon^{6}\log^3 n \log \alpha)$ worstcase update time algorithm for maintaining a $(1~+~\varepsilon)\textnormal{OPT} + 2$ approximation of the optimal outorientation of a graph with adaptive arboricity $\alpha$, improving the $O(\varepsilon^{6}\alpha^2 \log^3 n)$ algorithm by Christiansen and Rotenberg from ICALP 2022. This yields the first worstcase polylogarithmic dynamic algorithm for decomposing into $O(\alpha)$ forests. Thirdly, we obtain arboricityadaptive fullydynamic deterministic algorithms for a variety of problems including maximal matching, $\Delta+1$ colouring, and matrix vector multiplication. All update times are worstcase $O(\alpha+\log^2n \log \alpha)$, where $\alpha$ is the current arboricity of the graph. For the maximal matching problem, the stateoftheart deterministic algorithms by Kopelowitz, Krauthgamer, Porat, and Solomon from ICALP 2014 runs in time $O(\alpha^2 + \log^2 n)$, and by Neiman and Solomon from STOC 2013 runs in time $O(\sqrt{m})$. We give improved running times whenever the arboricity $\alpha \in \omega( \log n\sqrt{\log\log n})$.more » « less

We provide a new bicriteria O(log2k) competitive algorithm for explainable kmeans clustering. Explainable kmeans was recently introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian (ICML 2020). It is described by an easy to interpret and understand (threshold) decision tree or diagram. The cost of the explainable kmeans clustering equals to the sum of costs of its clusters; and the cost of each cluster equals the sum of squared distances from the points in the cluster to the center of that cluster. The best non bicriteria algorithm for explainable clustering O(k) competitive, and this bound is tight. Our randomized bicriteria algorithm constructs a threshold decision tree that partitions the data set into (1+δ)k clusters (where δ∈(0,1) is a parameter of the algorithm). The cost of this clustering is at most O(1/δ⋅log2k) times the cost of the optimal unconstrained kmeans clustering. We show that this bound is almost optimal.more » « less

We introduce a new (ϵₚ, δₚ)differentially private algorithm for the kmeans clustering problem. Given a dataset in Euclidean space, the kmeans clustering problem requires one to find k points in that space such that the sum of squares of Euclidean distances between each data point and its closest respective point among the k returned is minimised. Although there exist privacypreserving methods with good theoretical guarantees to solve this problem, in practice it is seen that it is the additive error which dictates the practical performance of these methods. By reducing the problem to a sequence of instances of maximum coverage on a grid, we are able to derive a new method that achieves lower additive error than previous works. For input datasets with cardinality n and diameter Δ, our algorithm has an O(Δ² (k log² n log(1/δₚ)/ϵₚ + k √(d log(1/δₚ))/ϵₚ)) additive error whilst maintaining constant multiplicative error. We conclude with some experiments and find an improvement over previously implemented work for this problem.more » « less