The size and amount of data captured from numerous sources has created a situation where the large quantity of data challenges our ability to understand the meaning within the data. This has motivated studies for mechanized data analysis and in particular for the clustering, or partitioning, of data into related groups. In fact, the size of the data has grown to the point where it is now often necessary to stream the data through the system for online and high speed analysis. This paper explores the application of approximate methods for the stream clustering of highdimensional data (feature sizes contains 100+ measures). In particular, the algorithm that has been developed, called streamingRPHash, combines Random Projection with Locality Sensitive Hashing and a countmin sketch to implement a highperformance method for the parallel and distributed clustering of streaming data in a MapReduce framework. streamingRPHash is able to perform clustering at a rate much faster than traditional clustering algorithms such as KMeans. streamingRPHash provides clustering results that are only slightly less accurate than KMeans, but in runtimes that are nearly half that required by KMeans. The performance advantage for streamingRPHash becomes even more significant as the dimensionality of the input data stream increases.more »
Tree Based Clustering On Large, High Dimensional Datasets
Clustering continues to be an important tool for data engineering and analysis. While advances in deep learning tend to be at the forefront of machine learning, it is only useful for the supervised classification of data sets. Clustering is an essential tool for problems where labeling data sets is either too labor intensive or where there is no agreed upon ground truth. The well studied kmeans problem partitions groups of similar vectors into k clusters by iteratively updating the cluster assignment such that it minimizes the within cluster sum of squares metric. Unfortunately kmeans can become prohibitive for very large high dimensional data sets as iterative methods often rely on random access to, or multiple passes over, the data set — a requirement that is not often possible for large and potentially unbounded data sets. In this work we explore an randomized, approximate method for clustering called TreeWalk Random Projection Clustering (TWRP) that is a fast, memory efficient method for finding cluster embedding in high dimensional spaces. TWRP combines random projection with a tree based partitioner to achieve a clustering method that forgoes storing the exhaustive representation of all vectors in the data space and instead performs a bounded search more »
 Award ID(s):
 1440420
 Publication Date:
 NSFPAR ID:
 10350966
 Journal Name:
 MLDM 2019: 15th International Conference on Machine Learning and Data Mining
 Sponsoring Org:
 National Science Foundation
More Like this


Motivation: Software engineering for High Performace Computing (HPC) environments in general [1] and for big data in particular [5] faces a set of unique challenges including high complexity of middleware and of computing environments. Tools that make it easier for scientists to utilize HPC are, therefore, of paramount importance. We provide an experience report of using one of such highly effective middleware pbdR [9] that allow the scientist to use R programming language without, at least nominally, having to master many layers of HPC infrastructure, such as OpenMPI [4] and ScalaPACK [2]. Objective: to evaluate the extent to which middleware helps improve scientist productivity, we use pbdR to solve a real problem that we, as scientists, are investigating. Our big data comes from the commits on GitHub and other project hosting sites and we are trying to cluster developers based on the text of these commit messages. Context: We need to be able to identify developer for every commit and to identify commits for a single developer. Developer identifiers in the commits, such as login, email, and name are often spelled in multiple ways since that information may come from different version control systems (Git, Mercurial, SVN, ...) and maymore »

null (Ed.)Many quantum algorithms for machine learning require access to classical data in superposition. However, for many natural data sets and algorithms, the overhead required to load the data set in superposition can erase any potential quantum speedup over classical algorithms. Recent work by Harrow introduces a new paradigm in hybrid quantumclassical computing to address this issue, relying on coresets to minimize the data loading overhead of quantum algorithms. We investigated using this paradigm to perform kmeans clustering on nearterm quantum computers, by casting it as a QAOA optimization instance over a small coreset. We used numerical simulations to compare the performance of this approach to classical kmeans clustering. We were able to find data sets with which coresets work well relative to random sampling and where QAOA could potentially outperform standard kmeans on a coreset. However, finding data sets where both coresets and QAOA work well—which is necessary for a quantum advantage over kmeans on the entire data set—appears to be challenging.

Obeid, I. ; Selesnik, I. ; Picone, J. (Ed.)The Neuronix highperformance computing cluster allows us to conduct extensive machine learning experiments on big data [1]. This heterogeneous cluster uses innovative scheduling technology, Slurm [2], that manages a network of CPUs and graphics processing units (GPUs). The GPU farm consists of a variety of processors ranging from lowend consumer grade devices such as the Nvidia GTX 970 to higherend devices such as the GeForce RTX 2080. These GPUs are essential to our research since they allow extremely computeintensive deep learning tasks to be executed on massive data resources such as the TUH EEG Corpus [2]. We use TensorFlow [3] as the core machine learning library for our deep learning systems, and routinely employ multiple GPUs to accelerate the training process. Reproducible results are essential to machine learning research. Reproducibility in this context means the ability to replicate an existing experiment – performance metrics such as error rates should be identical and floatingpoint calculations should match closely. Three examples of ways we typically expect an experiment to be replicable are: (1) The same job run on the same processor should produce the same results each time it is run. (2) A job run on a CPU and GPU should producemore »

Advances in single cell transcriptomics have allowed us to study the identity of single cells. This has led to the discovery of new cell types and high resolution tissue maps of them. Technologies that measure multiple modalities of such data add more detail, but they also complicate data integration. We offer an integrated analysis of the spatial location and gene expression profiles of cells to determine their identity. We propose scHybridNMF (singlecell Hybrid Nonnegative Matrix Factorization), which performs cell type identification by combining sparse nonnegative matrix factorization (sparse NMF) with kmeans clustering to cluster highdimensional gene expression and lowdimensional location data. We show that, under multiple scenarios, including the cases where there is a small number of genes profiled and the location data is noisy, scHybridNMF outperforms sparse NMF, kmeans, and an existing method that uses a hidden Markov random field to encode cell location and gene expression data for cell type identification.