We study a sketch-and-solve approach to speed up the Peng–Wei semidefinite relaxation of $k$-means clustering. When the data are appropriately separated we identify the $k$-means optimal clustering. Otherwise, our approach provides a high-confidence lower bound on the optimal $k$-means value. This lower bound is data-driven; it does not make any assumption on the data nor how they are generated. We provide code and an extensive set of numerical experiments where we use this approach to certify approximate optimality of clustering solutions obtained by k-means++.
Sketch-and-Lift: Scalable Subsampled Semidefinite Program for K-means Clustering
Semidefinite programming (SDP) is a powerful tool for tackling a wide range of computationally hard problems such as clustering. Despite the high accuracy, semidefinite programs are often too slow in practice with poor scalability on large (or even moderate) datasets. In this paper, we introduce a linear time complexity algorithm for approximating an SDP relaxed K-means clustering. The proposed sketch-and-lift (SL) approach solves an SDP on a subsampled dataset and then propagates the solution to all data points by a nearest-centroid rounding procedure. It is shown that the SL approach enjoys a similar exact recovery threshold as the K-means SDP on the full dataset, which is known to be information-theoretically tight under the Gaussian mixture model. The SL method can be made adaptive with enhanced theoretic properties when the cluster sizes are unbalanced. Our simulation experiments demonstrate that the statistical accuracy of the proposed method outperforms state-of-the-art fast clustering algorithms without sacrificing too much computational efficiency, and is comparable to the original K-means SDP with substantially reduced runtime.
more »
« less
- Award ID(s):
- 1752614
- PAR ID:
- 10321218
- Date Published:
- Journal Name:
- International Conference on Artificial Intelligence and Statistics
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Abstract -
Clustering continues to be an important tool for data engineering and analysis. While advances in deep learning tend to be at the forefront of machine learning, it is only useful for the supervised classification of data sets. Clustering is an essential tool for problems where labeling data sets is either too labor intensive or where there is no agreed upon ground truth. The well studied k-means problem partitions groups of similar vectors into k clusters by iteratively updating the cluster assignment such that it minimizes the within cluster sum of squares metric. Unfortunately k-means can become prohibitive for very large high dimensional data sets as iterative methods often rely on random access to, or multiple passes over, the data set — a requirement that is not often possible for large and potentially unbounded data sets. In this work we explore an randomized, approximate method for clustering called Tree-Walk Random Projection Clustering (TWRP) that is a fast, memory efficient method for finding cluster embedding in high dimensional spaces. TWRP combines random projection with a tree based partitioner to achieve a clustering method that forgoes storing the exhaustive representation of all vectors in the data space and instead performs a bounded search over the implied cluster bifurcation tree represented as approximate vector and count values. The TWRP algorithm is described and experimentally evaluated for scalability and accuracy in the presence of noise against several other well-known algorithms.more » « less
-
In this paper, we present CPL-Sync, a certifiably correct algorithm to solve planar pose graph optimization (PGO) using the complex number representation. We formulate planar PGO as the maximum likelihood estimation (MLE) on the product of unit complex numbers, and relax this nonconvex quadratic complex optimization problem to complex semidefinite programming (SDP). Furthermore, we simplify the corresponding semidefinite programming to Riemannian staircase optimization (RSO) on complex oblique manifolds that can be solved with the Riemannian trust region (RTR) method. In addition, we prove that the SDP relaxation and RSO simplification are tight as long as the noise magnitude is below a certain threshold. The efficacy of this work is validated through comparisons with existing methods as well as applications on planar PGO in simultaneous localization and mapping (SLAM), which indicates that the proposed algorithm is more efficient and capable of solving planar PGO certifiably. The C++ code for CPL-Sync is available at https://github. com/fantaosha/CPL- Sync.more » « less
-
null (Ed.)The traveling salesman problem (TSP) is a fundamental problem in combinatorial optimization. Several semidefinite programming relaxations have been proposed recently that exploit a variety of mathematical structures including, for example, algebraic connectivity, permutation matrices, and association schemes. The main results of this paper are twofold. First, de Klerk and Sotirov [de Klerk E, Sotirov R (2012) Improved semidefinite programming bounds for quadratic assignment problems with suitable symmetry. Math. Programming 133(1):75–91.] present a semidefinite program (SDP) based on permutation matrices and symmetry reduction; they show that it is incomparable to the subtour elimination linear program but generally dominates it on small instances. We provide a family of simplicial TSP instances that shows that the integrality gap of this SDP is unbounded. Second, we show that these simplicial TSP instances imply the unbounded integrality gap of every SDP relaxation of the TSP mentioned in the survey on SDP relaxations of the TSP in section 2 of Sotirov [Sotirov R (2012) SDP relaxations for some combinatorial optimization problems. Anjos MF, Lasserre JB, eds., Handbook on Semidefinite, Conic and Polynomial Optimization (Springer, New York), 795–819.]. In contrast, the subtour linear program performs perfectly on simplicial instances. The simplicial instances thus form a natural litmus test for future SDP relaxations of the TSP.more » « less
-
Semidefinite programs (SDPs) often arise in relaxations of some NP-hard problems, and if the solution of the SDP obeys certain rank constraints, the relaxation will be tight. Decomposition methods based on chordal sparsity have already been applied to speed up the solution of sparse SDPs, but methods for dealing with rank constraints are underdeveloped. This paper leverages a minimum rank completion result to decompose the rank constraint on a single large matrix into multiple rank constraints on a set of smaller matrices. The re-weighted heuristic is used as a proxy for rank, and the specific form of the heuristic preserves the sparsity pattern between iterations. Implementations of rank-minimized SDPs through interior-point and first-order algorithms are discussed. The problem of subspace clustering is used to demonstrate the computational improvement of the proposed method.more » « less