skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: Sketch-and-Lift: Scalable Subsampled Semidefinite Program for K-means Clustering
Semidefinite programming (SDP) is a powerful tool for tackling a wide range of computationally hard problems such as clustering. Despite the high accuracy, semidefinite programs are often too slow in practice with poor scalability on large (or even moderate) datasets. In this paper, we introduce a linear time complexity algorithm for approximating an SDP relaxed K-means clustering. The proposed sketch-and-lift (SL) approach solves an SDP on a subsampled dataset and then propagates the solution to all data points by a nearest-centroid rounding procedure. It is shown that the SL approach enjoys a similar exact recovery threshold as the K-means SDP on the full dataset, which is known to be information-theoretically tight under the Gaussian mixture model. The SL method can be made adaptive with enhanced theoretic properties when the cluster sizes are unbalanced. Our simulation experiments demonstrate that the statistical accuracy of the proposed method outperforms state-of-the-art fast clustering algorithms without sacrificing too much computational efficiency, and is comparable to the original K-means SDP with substantially reduced runtime.  more » « less
Award ID(s):
1752614
PAR ID:
10321218
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
International Conference on Artificial Intelligence and Statistics
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. K-means clustering is a widely used machine learning method for identifying patterns in large datasets. Recently, semidefinite programming (SDP) relaxations have been proposed for solving the K-means optimization problem, which enjoy strong statistical optimality guarantees. However, the prohibitive cost of implementing an SDP solver renders these guarantees inaccessible to practical datasets. In contrast, nonnegative matrix factorization (NMF) is a simple clustering algorithm widely used by machine learning practitioners, but it lacks a solid statistical underpinning and theoretical guarantees. In this paper, we consider an NMF-like algorithm that solves a nonnegative low-rank restriction of the SDP-relaxed K-means formulation using a nonconvex Burer--Monteiro factorization approach. The resulting algorithm is as simple and scalable as state-of-the-art NMF algorithms while also enjoying the same strong statistical optimality guarantees as the SDP. In our experiments, we observe that our algorithm achieves significantly smaller mis-clustering errors compared to the existing state-of-the-art while maintaining scalability. 
    more » « less
  2. Clustering continues to be an important tool for data engineering and analysis. While advances in deep learning tend to be at the forefront of machine learning, it is only useful for the supervised classification of data sets. Clustering is an essential tool for problems where labeling data sets is either too labor intensive or where there is no agreed upon ground truth. The well studied k-means problem partitions groups of similar vectors into k clusters by iteratively updating the cluster assignment such that it minimizes the within cluster sum of squares metric. Unfortunately k-means can become prohibitive for very large high dimensional data sets as iterative methods often rely on random access to, or multiple passes over, the data set — a requirement that is not often possible for large and potentially unbounded data sets. In this work we explore an randomized, approximate method for clustering called Tree-Walk Random Projection Clustering (TWRP) that is a fast, memory efficient method for finding cluster embedding in high dimensional spaces. TWRP combines random projection with a tree based partitioner to achieve a clustering method that forgoes storing the exhaustive representation of all vectors in the data space and instead performs a bounded search over the implied cluster bifurcation tree represented as approximate vector and count values. The TWRP algorithm is described and experimentally evaluated for scalability and accuracy in the presence of noise against several other well-known algorithms. 
    more » « less
  3. Ground motion selection has become increasingly central to the assessment of earthquake resilience. The selection of ground motion records for use in nonlinear dynamic analysis significantly affects structural response. This, in turn, will impact the outcomes of earthquake resilience analysis. This paper presents a new ground motion clustering algorithm, which can be embedded in current ground motion selection methods to properly select representative ground motion records that a structure of interest will probabilistically experience. The proposed clustering-based ground motion selection method includes four main steps: 1) leveraging domain-specific knowledge to pre-select candidate ground motions; 2) using a convolutional autoencoder to learn low-dimensional underlying characteristics of candidate ground motions’ response spectra – i.e., latent features; 3) performing k-means clustering to classify the learned latent features, equivalent to cluster the response spectra of candidate ground motions; and 4) embedding the clusters in the conditional spectra-based ground motion selection. The selected ground motions can represent a given hazard level well (by matching conditional spectra) and fully describe the complete set of candidate ground motions. Three case studies for modified, pulse-type, and non-pulse-type ground motions are designed to evaluate the performance of the proposed ground motion clustering algorithm (convolutional autoencoder + k-means). Considering the limited number of pre-selected candidate ground motions in the last two case studies, the response spectra simulation and transfer learning are used to improve the stability and reproducibility of the proposed ground motion clustering algorithm. The results of the three case studies demonstrate that the convolutional autoencoder + k-means can 1) achieve 100% accuracy in classifying ground motion response spectra, 2) correctly determine the optimal number of clusters, and 3) outperform established clustering algorithms (i.e., autoencoder + k-means, time series k-means, spectral clustering, and k-means on ground motion influence factors). Using the proposed clustering-based ground motion selection method, an application is performed to select ground motions for a structure in San Francisco, California. The developed user-friendly codes are published for practical use. 
    more » « less
  4. In this paper, we present CPL-Sync, a certifiably correct algorithm to solve planar pose graph optimization (PGO) using the complex number representation. We formulate planar PGO as the maximum likelihood estimation (MLE) on the product of unit complex numbers, and relax this nonconvex quadratic complex optimization problem to complex semidefinite programming (SDP). Furthermore, we simplify the corresponding semidefinite programming to Riemannian staircase optimization (RSO) on complex oblique manifolds that can be solved with the Riemannian trust region (RTR) method. In addition, we prove that the SDP relaxation and RSO simplification are tight as long as the noise magnitude is below a certain threshold. The efficacy of this work is validated through comparisons with existing methods as well as applications on planar PGO in simultaneous localization and mapping (SLAM), which indicates that the proposed algorithm is more efficient and capable of solving planar PGO certifiably. The C++ code for CPL-Sync is available at https://github. com/fantaosha/CPL- Sync. 
    more » « less
  5. Semidefinite programs (SDPs) often arise in relaxations of some NP-hard problems, and if the solution of the SDP obeys certain rank constraints, the relaxation will be tight. Decomposition methods based on chordal sparsity have already been applied to speed up the solution of sparse SDPs, but methods for dealing with rank constraints are underdeveloped. This paper leverages a minimum rank completion result to decompose the rank constraint on a single large matrix into multiple rank constraints on a set of smaller matrices. The re-weighted heuristic is used as a proxy for rank, and the specific form of the heuristic preserves the sparsity pattern between iterations. Implementations of rank-minimized SDPs through interior-point and first-order algorithms are discussed. The problem of subspace clustering is used to demonstrate the computational improvement of the proposed method. 
    more » « less