skip to main content

Title: Explainable k-means: don’t be greedy, plant bigger trees!
We provide a new bi-criteria O(log2k) competitive algorithm for explainable k-means clustering. Explainable k-means was recently introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian (ICML 2020). It is described by an easy to interpret and understand (threshold) decision tree or diagram. The cost of the explainable k-means clustering equals to the sum of costs of its clusters; and the cost of each cluster equals the sum of squared distances from the points in the cluster to the center of that cluster. The best non bi-criteria algorithm for explainable clustering O(k) competitive, and this bound is tight. Our randomized bi-criteria algorithm constructs a threshold decision tree that partitions the data set into (1+δ)k clusters (where δ∈(0,1) is a parameter of the algorithm). The cost of this clustering is at most O(1/δ⋅log2k) times the cost of the optimal unconstrained k-means clustering. We show that this bound is almost optimal.
Authors:
;
Award ID(s):
1955351
Publication Date:
NSF-PAR ID:
10336877
Journal Name:
STOC 2022
Page Range or eLocation-ID:
1629 to 1642
Sponsoring Org:
National Science Foundation
More Like this
  1. We consider the problem of explainable k-medians and k-means introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian (ICML 2020). In this problem, our goal is to find a threshold decision tree that partitions data into k clusters and minimizes the k-medians or k-means objective. The obtained clustering is easy to interpret because every decision node of a threshold tree splits data based on a single feature into two groups. We propose a new algorithm for this problem which is O(log k) competitive with k-medians with ℓ1 norm and O(k) competitive with k-means. This is an improvement over the previous guarantees of O(k) and O(k^2) by Dasgupta et al (2020). We also provide a new algorithm which is O(log^{3}{2}k) competitive for k-medians with ℓ2 norm. Our first algorithm is near-optimal: Dasgupta et al (2020) showed a lower bound of Ω(log k) for k-medians; in this work, we prove a lower bound of Ω(k) for k-means. We also provide a lower bound of Ω(log k) for k-medians with ℓ2 norm.
  2. Round genomes are found in bacteria, plant chloroplasts, and mitochondria. Genetic or epigenetic marks can present biologically interesting clusters along a circular genome. The circular data clustering problem groups N points on a circle into K clusters to minimize the within-cluster sum of squared distances. Repeatedly applying the K-means algorithm takes quadratic time, impractical for large circular datasets. To overcome this issue, we developed a fast, reproducible, and optimal circular clustering (FOCC) algorithm of worst-case O(KN log^2 N) time. The core is a fast optimal framed clustering algorithm, which we designed by integrating two divide-and-conquer and one bracket dynamic programming strategies. The algorithm is optimal based on a property of monotonic increasing cluster borders over frames on linearized data. On clustering 50,000 circular data points, FOCC outruns brute-force or heuristic circular clustering by three orders of magnitude. We produced clusters of CpG sites and genes along three round genomes, exhibiting higher quality than heuristic clustering. More broadly, the presented subquadratic-time algorithms offer the fastest known solution to not only framed and circular clustering, but also angular, periodical, and looped clustering. We implemented these algorithms in the R package OptCirClust (https://CRAN.R-project.org/package=OptCirClust)
  3. We examine the problem of smoothed online optimization, where a decision maker must sequentially choose points in a normed vector space to minimize the sum of per-round, non-convex hitting costs and the costs of switching decisions between rounds. The decision maker has access to a black-box oracle, such as a machine learning model, that provides untrusted and potentially inaccurate predictions of the optimal decision in each round. The goal of the decision maker is to exploit the predictions if they are accurate, while guaranteeing performance that is not much worse than the hindsight optimal sequence of decisions, even when predictions are inaccurate. We impose the standard assumption that hitting costs are globally α-polyhedral. We propose a novel algorithm, Adaptive Online Switching (AOS), and prove that, for a large set of feasible δ > 0, it is (1+δ)-competitive if predictions are perfect, while also maintaining a uniformly bounded competitive ratio of 2~O (1/(α δ)) even when predictions are adversarial. Further, we prove that this trade-off is necessary and nearly optimal in the sense that any deterministic algorithm which is (1+δ)-competitive if predictions are perfect must be at least 2~Ω (1/(α δ)) -competitive when predictions are inaccurate. In fact, we observe amore »unique threshold-type behavior in this trade-off: if δ is not in the set of feasible options, then no algorithm is simultaneously (1 + δ)-competitive if predictions are perfect and ζ-competitive when predictions are inaccurate for any ζ < ∞. Furthermore, we discuss that memory is crucial in AOS by proving that any algorithm that does not use memory cannot benefit from predictions. We complement our theoretical results by a numerical study on a microgrid application.« less
  4. Clustering continues to be an important tool for data engineering and analysis. While advances in deep learning tend to be at the forefront of machine learning, it is only useful for the supervised classification of data sets. Clustering is an essential tool for problems where labeling data sets is either too labor intensive or where there is no agreed upon ground truth. The well studied k-means problem partitions groups of similar vectors into k clusters by iteratively updating the cluster assignment such that it minimizes the within cluster sum of squares metric. Unfortunately k-means can become prohibitive for very large high dimensional data sets as iterative methods often rely on random access to, or multiple passes over, the data set — a requirement that is not often possible for large and potentially unbounded data sets. In this work we explore an randomized, approximate method for clustering called Tree-Walk Random Projection Clustering (TWRP) that is a fast, memory efficient method for finding cluster embedding in high dimensional spaces. TWRP combines random projection with a tree based partitioner to achieve a clustering method that forgoes storing the exhaustive representation of all vectors in the data space and instead performs a bounded searchmore »over the implied cluster bifurcation tree represented as approximate vector and count values. The TWRP algorithm is described and experimentally evaluated for scalability and accuracy in the presence of noise against several other well-known algorithms.« less
  5. Bansal, Nikhil and (Ed.)
    his paper presents universal algorithms for clustering problems, including the widely studied k-median, k-means, and k-center objectives. The input is a metric space containing all potential client locations. The algorithm must select k cluster centers such that they are a good solution for any subset of clients that actually realize. Specifically, we aim for low regret, defined as the maximum over all subsets of the difference between the cost of the algorithm’s solution and that of an optimal solution. A universal algorithm’s solution sol for a clustering problem is said to be an (α, β)-approximation if for all subsets of clients C', it satisfies sol(C') ≤ α ⋅ opt(C') + β ⋅ mr, where opt(C') is the cost of the optimal solution for clients C' and mr is the minimum regret achievable by any solution. Our main results are universal algorithms for the standard clustering objectives of k-median, k-means, and k-center that achieve (O(1), O(1))-approximations. These results are obtained via a novel framework for universal algorithms using linear programming (LP) relaxations. These results generalize to other 𝓁_p-objectives and the setting where some subset of the clients are fixed. We also give hardness results showing that (α, β)-approximation is NP-hard ifmore »α or β is at most a certain constant, even for the widely studied special case of Euclidean metric spaces. This shows that in some sense, (O(1), O(1))-approximation is the strongest type of guarantee obtainable for universal clustering.« less