The size and amount of data captured from numerous sources has created a situation where the large quantity of data challenges our ability to understand the meaning within the data. This has motivated studies for mechanized data analysis and in particular for the clustering, or partitioning, of data into related groups. In fact, the size of the data has grown to the point where it is now often necessary to stream the data through the system for online and high speed analysis. This paper explores the application of approximate methods for the stream clustering of highdimensional data (feature sizes contains 100+ measures). In particular, the algorithm that has been developed, called streamingRPHash, combines Random Projection with Locality Sensitive Hashing and a countmin sketch to implement a highperformance method for the parallel and distributed clustering of streaming data in a MapReduce framework. streamingRPHash is able to perform clustering at a rate much faster than traditional clustering algorithms such as KMeans. streamingRPHash provides clustering results that are only slightly less accurate than KMeans, but in runtimes that are nearly half that required by KMeans. The performance advantage for streamingRPHash becomes even more significant as the dimensionality of the input data stream increases. Furthermore, the experimental results show that streamingRPHash has a near linear speedup relative to the number of CPU cores. This speedup efficiency is possible because the approximate methods used in streamingRPHash allow independent and largely unsynchronized analyses to be performed on each streamed data vectors.
more »
« less
Tree Based Clustering On Large, High Dimensional Datasets
Clustering continues to be an important tool for data engineering and analysis. While advances in deep learning tend to be at the forefront of machine learning, it is only useful for the supervised classification of data sets. Clustering is an essential tool for problems where labeling data sets is either too labor intensive or where there is no agreed upon ground truth. The well studied kmeans problem partitions groups of similar vectors into k clusters by iteratively updating the cluster assignment such that it minimizes the within cluster sum of squares metric. Unfortunately kmeans can become prohibitive for very large high dimensional data sets as iterative methods often rely on random access to, or multiple passes over, the data set — a requirement that is not often possible for large and potentially unbounded data sets. In this work we explore an randomized, approximate method for clustering called TreeWalk Random Projection Clustering (TWRP) that is a fast, memory efficient method for finding cluster embedding in high dimensional spaces. TWRP combines random projection with a tree based partitioner to achieve a clustering method that forgoes storing the exhaustive representation of all vectors in the data space and instead performs a bounded search over the implied cluster bifurcation tree represented as approximate vector and count values. The TWRP algorithm is described and experimentally evaluated for scalability and accuracy in the presence of noise against several other wellknown algorithms.
more »
« less
 Award ID(s):
 1440420
 NSFPAR ID:
 10350966
 Date Published:
 Journal Name:
 MLDM 2019: 15th International Conference on Machine Learning and Data Mining
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this


We provide a new bicriteria O(log2k) competitive algorithm for explainable kmeans clustering. Explainable kmeans was recently introduced by Dasgupta, Frost, Moshkovitz, and Rashtchian (ICML 2020). It is described by an easy to interpret and understand (threshold) decision tree or diagram. The cost of the explainable kmeans clustering equals to the sum of costs of its clusters; and the cost of each cluster equals the sum of squared distances from the points in the cluster to the center of that cluster. The best non bicriteria algorithm for explainable clustering O(k) competitive, and this bound is tight. Our randomized bicriteria algorithm constructs a threshold decision tree that partitions the data set into (1+δ)k clusters (where δ∈(0,1) is a parameter of the algorithm). The cost of this clustering is at most O(1/δ⋅log2k) times the cost of the optimal unconstrained kmeans clustering. We show that this bound is almost optimal.more » « less

null (Ed.)Many quantum algorithms for machine learning require access to classical data in superposition. However, for many natural data sets and algorithms, the overhead required to load the data set in superposition can erase any potential quantum speedup over classical algorithms. Recent work by Harrow introduces a new paradigm in hybrid quantumclassical computing to address this issue, relying on coresets to minimize the data loading overhead of quantum algorithms. We investigated using this paradigm to perform kmeans clustering on nearterm quantum computers, by casting it as a QAOA optimization instance over a small coreset. We used numerical simulations to compare the performance of this approach to classical kmeans clustering. We were able to find data sets with which coresets work well relative to random sampling and where QAOA could potentially outperform standard kmeans on a coreset. However, finding data sets where both coresets and QAOA work well—which is necessary for a quantum advantage over kmeans on the entire data set—appears to be challenging.more » « less

Semidefinite programming (SDP) is a powerful tool for tackling a wide range of computationally hard problems such as clustering. Despite the high accuracy, semidefinite programs are often too slow in practice with poor scalability on large (or even moderate) datasets. In this paper, we introduce a linear time complexity algorithm for approximating an SDP relaxed Kmeans clustering. The proposed sketchandlift (SL) approach solves an SDP on a subsampled dataset and then propagates the solution to all data points by a nearestcentroid rounding procedure. It is shown that the SL approach enjoys a similar exact recovery threshold as the Kmeans SDP on the full dataset, which is known to be informationtheoretically tight under the Gaussian mixture model. The SL method can be made adaptive with enhanced theoretic properties when the cluster sizes are unbalanced. Our simulation experiments demonstrate that the statistical accuracy of the proposed method outperforms stateoftheart fast clustering algorithms without sacrificing too much computational efficiency, and is comparable to the original Kmeans SDP with substantially reduced runtime.more » « less

Advances in single cell transcriptomics have allowed us to study the identity of single cells. This has led to the discovery of new cell types and high resolution tissue maps of them. Technologies that measure multiple modalities of such data add more detail, but they also complicate data integration. We offer an integrated analysis of the spatial location and gene expression profiles of cells to determine their identity. We propose scHybridNMF (singlecell Hybrid Nonnegative Matrix Factorization), which performs cell type identification by combining sparse nonnegative matrix factorization (sparse NMF) with kmeans clustering to cluster highdimensional gene expression and lowdimensional location data. We show that, under multiple scenarios, including the cases where there is a small number of genes profiled and the location data is noisy, scHybridNMF outperforms sparse NMF, kmeans, and an existing method that uses a hidden Markov random field to encode cell location and gene expression data for cell type identification.more » « less