We show that the RandomCoordinateCut algorithm gives the optimal competitive ratio for explainable kmedians in l1. The problem of explainable kmedians was introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian in 2020. Several groups of authors independently proposed a simple polynomialtime randomized algorithm for the problem and showed that this algorithm is O(log k loglog k) competitive. We provide a tight analysis of the algorithm and prove that its competitive ratio is upper bounded by 2ln k +2. This bound matches the Omega(log k) lower bound by Dasgupta et al (2020).
more »
« less
Nearoptimal Algorithms for Explainable kMedians and kMeans
We consider the problem of explainable kmedians and kmeans introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian (ICML 2020). In this problem, our goal is to find a threshold decision tree that partitions data into k clusters and minimizes the kmedians or kmeans objective. The obtained clustering is easy to interpret because every decision node of a threshold tree splits data based on a single feature into two groups. We propose a new algorithm for this problem which is O(log k) competitive with kmedians with ℓ1 norm and O(k) competitive with kmeans. This is an improvement over the previous guarantees of O(k) and O(k^2) by Dasgupta et al (2020). We also provide a new algorithm which is O(log^{3}{2}k) competitive for kmedians with ℓ2 norm. Our first algorithm is nearoptimal: Dasgupta et al (2020) showed a lower bound of Ω(log k) for kmedians; in this work, we prove a lower bound of Ω(k) for kmeans. We also provide a lower bound of Ω(log k) for kmedians with ℓ2 norm.
more »
« less
 Award ID(s):
 1955351
 NSFPAR ID:
 10281998
 Date Published:
 Journal Name:
 International Conference on Machine Learning
 Volume:
 38
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


We provide a new bicriteria O(log2k) competitive algorithm for explainable kmeans clustering. Explainable kmeans was recently introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian (ICML 2020). It is described by an easy to interpret and understand (threshold) decision tree or diagram. The cost of the explainable kmeans clustering equals to the sum of costs of its clusters; and the cost of each cluster equals the sum of squared distances from the points in the cluster to the center of that cluster. The best non bicriteria algorithm for explainable clustering O(k) competitive, and this bound is tight. Our randomized bicriteria algorithm constructs a threshold decision tree that partitions the data set into (1+δ)k clusters (where δ∈(0,1) is a parameter of the algorithm). The cost of this clustering is at most O(1/δ⋅log2k) times the cost of the optimal unconstrained kmeans clustering. We show that this bound is almost optimal.more » « less

Chakrabarti, Amit ; Swamy, Chaitanya (Ed.)A Boolean maximum constraint satisfaction problem, MaxCSP(f), is specified by a predicate f:{1,1}^k → {0,1}. An nvariable instance of MaxCSP(f) consists of a list of constraints, each of which applies f to k distinct literals drawn from the n variables. For k = 2, Chou, Golovnev, and Velusamy [Chou et al., 2020] obtained explicit ratios characterizing the √ nspace streaming approximability of every predicate. For k ≥ 3, Chou, Golovnev, Sudan, and Velusamy [Chou et al., 2022] proved a general dichotomy theorem for √ nspace sketching algorithms: For every f, there exists α(f) ∈ (0,1] such that for every ε > 0, MaxCSP(f) is (α(f)ε)approximable by an O(log n)space linear sketching algorithm, but (α(f)+ε)approximation sketching algorithms require Ω(√n) space. In this work, we give closedform expressions for the sketching approximation ratios of multiple families of symmetric Boolean functions. Letting α'_k = 2^{(k1)} (1k^{2})^{(k1)/2}, we show that for odd k ≥ 3, α(kAND) = α'_k, and for even k ≥ 2, α(kAND) = 2α'_{k+1}. Thus, for every k, kAND can be (2o(1))2^{k}approximated by O(log n)space sketching algorithms; we contrast this with a lower bound of Chou, Golovnev, Sudan, Velingker, and Velusamy [Chou et al., 2022] implying that streaming (2+ε)2^{k}approximations require Ω(n) space! We also resolve the ratio for the "atleast(k1)1’s" function for all even k; the "exactly(k+1)/21’s" function for odd k ∈ {3,…,51}; and fifteen other functions. We stress here that for general f, the dichotomy theorem in [Chou et al., 2022] only implies that α(f) can be computed to arbitrary precision in PSPACE, and thus closedform expressions need not have existed a priori. Our analyses involve identifying and exploiting structural "saddlepoint" properties of this dichotomy. Separately, for all threshold functions, we give optimal "biasbased" approximation algorithms generalizing [Chou et al., 2020] while simplifying [Chou et al., 2022]. Finally, we investigate the √ nspace streaming lower bounds in [Chou et al., 2022], and show that they are incomplete for 3AND, i.e., they fail to rule out (α(3AND})ε)approximations in o(√ n) space.more » « less

Consider an instance of Euclidean kmeans or kmedians clustering. We show that the cost of the optimal solution is preserved up to a factor of (1+ε) under a projection onto a random O(log(k /ε) / ε2)dimensional subspace. Further, the cost of every clustering is preserved within (1+ε). More generally, our result applies to any dimension reduction map satisfying a mild subGaussiantail condition. Our bound on the dimension is nearly optimal. Additionally, our result applies to Euclidean kclustering with the distances raised to the pth power for any constant p. For kmeans, our result resolves an open problem posed by Cohen, Elder, Musco, Musco, and Persu (STOC 2015); for kmedians, it answers a question raised by Kannan.more » « less

null (Ed.)Directed Steiner Tree (DST) is a central problem in combinatorial optimization and theoretical computer science: Given a directed graph G = (V, E) with edge costs c ∈ ℝ_{≥ 0}^E, a root r ∈ V and k terminals K ⊆ V, we need to output a minimumcost arborescence in G that contains an rrightarrow t path for every t ∈ K. Recently, Grandoni, Laekhanukit and Li, and independently Ghuge and Nagarajan, gave quasipolynomial time O(log²k/log log k)approximation algorithms for the problem, which are tight under popular complexity assumptions. In this paper, we consider the more general DegreeBounded Directed Steiner Tree (DBDST) problem, where we are additionally given a degree bound d_v on each vertex v ∈ V, and we require that every vertex v in the output tree has at most d_v children. We give a quasipolynomial time (O(log n log k), O(log² n))bicriteria approximation: The algorithm produces a solution with cost at most O(log nlog k) times the cost of the optimum solution that violates the degree constraints by at most a factor of O(log²n). This is the first nontrivial result for the problem. While our costguarantee is nearly optimal, the degree violation factor of O(log²n) is an O(log n)factor away from the approximation lower bound of Ω(log n) from the Set Cover hardness. The hardness result holds even on the special case of the DegreeBounded Group Steiner Tree problem on trees (DBGSTT). With the hope of closing the gap, we study the question of whether the degree violation factor can be made tight for this special case. We answer the question in the affirmative by giving an (O(log nlog k), O(log n))bicriteria approximation algorithm for DBGSTT.more » « less