We consider the problem of explainable kmedians and kmeans introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian (ICML 2020). In this problem, our goal is to find a threshold decision tree that partitions data into k clusters and minimizes the kmedians or kmeans objective. The obtained clustering is easy to interpret because every decision node of a threshold tree splits data based on a single feature into two groups. We propose a new algorithm for this problem which is O(log k) competitive with kmedians with ℓ1 norm and O(k) competitive with kmeans. This is an improvement over the previous guarantees of O(k) and O(k^2) by Dasgupta et al (2020). We also provide a new algorithm which is O(log^{3}{2}k) competitive for kmedians with ℓ2 norm. Our first algorithm is nearoptimal: Dasgupta et al (2020) showed a lower bound of Ω(log k) for kmedians; in this work, we prove a lower bound of Ω(k) for kmeans. We also provide a lower bound of Ω(log k) for kmedians with ℓ2 norm.
Explainable kmeans: don’t be greedy, plant bigger trees!
We provide a new bicriteria O(log2k) competitive algorithm for explainable kmeans clustering. Explainable kmeans was recently introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian (ICML 2020). It is described by an easy to interpret and understand (threshold) decision tree or diagram. The cost of the explainable kmeans clustering equals to the sum of costs of its clusters; and the cost of each cluster equals the sum of squared distances from the points in the cluster to the center of that cluster. The best non bicriteria algorithm for explainable clustering O(k) competitive, and this bound is tight.
Our randomized bicriteria algorithm constructs a threshold decision tree that partitions the data set into (1+δ)k clusters (where δ∈(0,1) is a parameter of the algorithm). The cost of this clustering is at most O(1/δ⋅log2k) times the cost of the optimal unconstrained kmeans clustering. We show that this bound is almost optimal.
 Award ID(s):
 1955351
 Publication Date:
 NSFPAR ID:
 10336877
 Journal Name:
 STOC 2022
 Page Range or eLocationID:
 1629 to 1642
 Sponsoring Org:
 National Science Foundation
More Like this


Round genomes are found in bacteria, plant chloroplasts, and mitochondria. Genetic or epigenetic marks can present biologically interesting clusters along a circular genome. The circular data clustering problem groups N points on a circle into K clusters to minimize the withincluster sum of squared distances. Repeatedly applying the Kmeans algorithm takes quadratic time, impractical for large circular datasets. To overcome this issue, we developed a fast, reproducible, and optimal circular clustering (FOCC) algorithm of worstcase O(KN log^2 N) time. The core is a fast optimal framed clustering algorithm, which we designed by integrating two divideandconquer and one bracket dynamic programming strategies. The algorithm is optimal based on a property of monotonic increasing cluster borders over frames on linearized data. On clustering 50,000 circular data points, FOCC outruns bruteforce or heuristic circular clustering by three orders of magnitude. We produced clusters of CpG sites and genes along three round genomes, exhibiting higher quality than heuristic clustering. More broadly, the presented subquadratictime algorithms offer the fastest known solution to not only framed and circular clustering, but also angular, periodical, and looped clustering. We implemented these algorithms in the R package OptCirClust (https://CRAN.Rproject.org/package=OptCirClust)

We examine the problem of smoothed online optimization, where a decision maker must sequentially choose points in a normed vector space to minimize the sum of perround, nonconvex hitting costs and the costs of switching decisions between rounds. The decision maker has access to a blackbox oracle, such as a machine learning model, that provides untrusted and potentially inaccurate predictions of the optimal decision in each round. The goal of the decision maker is to exploit the predictions if they are accurate, while guaranteeing performance that is not much worse than the hindsight optimal sequence of decisions, even when predictions are inaccurate. We impose the standard assumption that hitting costs are globally αpolyhedral. We propose a novel algorithm, Adaptive Online Switching (AOS), and prove that, for a large set of feasible δ > 0, it is (1+δ)competitive if predictions are perfect, while also maintaining a uniformly bounded competitive ratio of 2~O (1/(α δ)) even when predictions are adversarial. Further, we prove that this tradeoff is necessary and nearly optimal in the sense that any deterministic algorithm which is (1+δ)competitive if predictions are perfect must be at least 2~Ω (1/(α δ)) competitive when predictions are inaccurate. In fact, we observe amore »

Clustering continues to be an important tool for data engineering and analysis. While advances in deep learning tend to be at the forefront of machine learning, it is only useful for the supervised classification of data sets. Clustering is an essential tool for problems where labeling data sets is either too labor intensive or where there is no agreed upon ground truth. The well studied kmeans problem partitions groups of similar vectors into k clusters by iteratively updating the cluster assignment such that it minimizes the within cluster sum of squares metric. Unfortunately kmeans can become prohibitive for very large high dimensional data sets as iterative methods often rely on random access to, or multiple passes over, the data set — a requirement that is not often possible for large and potentially unbounded data sets. In this work we explore an randomized, approximate method for clustering called TreeWalk Random Projection Clustering (TWRP) that is a fast, memory efficient method for finding cluster embedding in high dimensional spaces. TWRP combines random projection with a tree based partitioner to achieve a clustering method that forgoes storing the exhaustive representation of all vectors in the data space and instead performs a bounded searchmore »

Bansal, Nikhil and (Ed.)his paper presents universal algorithms for clustering problems, including the widely studied kmedian, kmeans, and kcenter objectives. The input is a metric space containing all potential client locations. The algorithm must select k cluster centers such that they are a good solution for any subset of clients that actually realize. Specifically, we aim for low regret, defined as the maximum over all subsets of the difference between the cost of the algorithm’s solution and that of an optimal solution. A universal algorithm’s solution sol for a clustering problem is said to be an (α, β)approximation if for all subsets of clients C', it satisfies sol(C') ≤ α ⋅ opt(C') + β ⋅ mr, where opt(C') is the cost of the optimal solution for clients C' and mr is the minimum regret achievable by any solution. Our main results are universal algorithms for the standard clustering objectives of kmedian, kmeans, and kcenter that achieve (O(1), O(1))approximations. These results are obtained via a novel framework for universal algorithms using linear programming (LP) relaxations. These results generalize to other 𝓁_pobjectives and the setting where some subset of the clients are fixed. We also give hardness results showing that (α, β)approximation is NPhard ifmore »