skip to main content


Title: Clustering Data in Secured, Distributed Datasets
The massive growth in data generation and collection has brought to the forefront the necessity to develop mechanized methods to analyze and extract information from them. Data clustering is one of the fundamental modes to discover new insights from data. However, high dimensional data has its own challenges where many conventional clustering algorithms fails either in accuracy or scalability. To further complicate the issue, distinct subsets of sensitive data may reside in geographically separated locations with the sensitive nature of the data preventing (or inhibiting) its access for mechanized analysis. Thus, methods to discover information from the collective whole of these secured, distributed data sets that also preserves the integrity of the data must be found. In this paper we develop and assess a distributed algorithm that can cluster geographically separated data while simultaneously preserving the strict privacy requirements of non sharing of protected high dimensional data. We implement our algorithm on the distributed map-reduce based platform Spark and demonstrate its performance by comparing it to the standard data clustering algorithms.  more » « less
Award ID(s):
1440420
NSF-PAR ID:
10193706
Author(s) / Creator(s):
Date Published:
Journal Name:
The Third International Workshop on Parallel and Distributed Data Mining
Page Range / eLocation ID:
557-572
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. The size and amount of data captured from numerous sources has created a situation where the large quantity of data challenges our ability to understand the meaning within the data. This has motivated studies for mechanized data analysis and in particular for the clustering, or partitioning, of data into related groups. In fact, the size of the data has grown to the point where it is now often necessary to stream the data through the system for online and high speed analysis. This paper explores the application of approximate methods for the stream clustering of high-dimensional data (feature sizes contains 100+ measures). In particular, the algorithm that has been developed, called streamingRPHash, combines Random Projection with Locality Sensitive Hashing and a count-min sketch to implement a high-performance method for the parallel and distributed clustering of streaming data in a MapReduce framework. streamingRPHash is able to perform clustering at a rate much faster than traditional clustering algorithms such as K-Means. streamingRPHash provides clustering results that are only slightly less accurate than K-Means, but in runtimes that are nearly half that required by K-Means. The performance advantage for streamingRPHash becomes even more significant as the dimensionality of the input data stream increases. Furthermore, the experimental results show that streamingRPHash has a near linear speedup relative to the number of CPU cores. This speedup efficiency is possible because the approximate methods used in streamingRPHash allow independent and largely unsynchronized analyses to be performed on each streamed data vectors. 
    more » « less
  2. Clustering streaming data has gained importance in recent years due to an expanding opportunity to discover knowledge in widely available data streams. As streams are potentially evolving and unbounded sequence of data objects, clustering algorithms capable of performing fast and incremental processing of data points are necessary. This paper presents a method of clustering high-dimensional data streams using approximate methods called streamingRPHash. streamingRPHash combines random projections with locality-sensitivity hashing to construct a high-performance clustering method. streamingRPHash is amenable to distributed processing frameworks such as Map-Reduce, and also has the benefits of constrained overall complexity growth. This paper describes streamingRPHash algorithm and its various configurations. The clustering performance of streamingRPHash is compared to several alternatives. Experimental results show that streamingRPHash has comparable clustering accuracy and substantially lower runtime and memory usage. 
    more » « less
  3. In mixed multi-view data, multiple sets of diverse features are measured on the same set of samples. By integrating all available data sources, we seek to discover common group structure among the samples that may be hidden in individualistic cluster analyses of a single data view. While several techniques for such integrative clustering have been explored, we propose and develop a convex formalization that enjoys strong empirical performance and inherits the mathematical properties of increasingly popular convex clustering methods. Specifically, our Integrative Generalized Convex Clustering Optimization (iGecco) method employs different convex distances, losses, or divergences for each of the different data views with a joint convex fusion penalty that leads to common groups. Additionally, integrating mixed multi-view data is often challenging when each data source is high-dimensional. To perform feature selection in such scenarios, we develop an adaptive shifted group-lasso penalty that selects features by shrinking them towards their loss-specific centers. Our so-called iGecco+ approach selects features from each data view that are best for determining the groups, often leading to improved integrative clustering. To solve our problem, we develop a new type of generalized multi-block ADMM algorithm using sub-problem approximations that more efficiently fits our model for big data sets. Through a series of numerical experiments and real data examples on text mining and genomics, we show that iGecco+ achieves superior empirical performance for high-dimensional mixed multi-view data. 
    more » « less
  4. Abstract Background

    In Alzheimer’s Diseases (AD) research, multimodal imaging analysis can unveil complementary information from multiple imaging modalities and further our understanding of the disease. One application is to discover disease subtypes using unsupervised clustering. However, existing clustering methods are often applied to input features directly, and could suffer from the curse of dimensionality with high-dimensional multimodal data. The purpose of our study is to identify multimodal imaging-driven subtypes in Mild Cognitive Impairment (MCI) participants using a multiview learning framework based on Deep Generalized Canonical Correlation Analysis (DGCCA), to learn shared latent representation with low dimensions from 3 neuroimaging modalities.

    Results

    DGCCA applies non-linear transformation to input views using neural networks and is able to learn correlated embeddings with low dimensions that capture more variance than its linear counterpart, generalized CCA (GCCA). We designed experiments to compare DGCCA embeddings with single modality features and GCCA embeddings by generating 2 subtypes from each feature set using unsupervised clustering. In our validation studies, we found that amyloid PET imaging has the most discriminative features compared with structural MRI and FDG PET which DGCCA learns from but not GCCA. DGCCA subtypes show differential measures in 5 cognitive assessments, 6 brain volume measures, and conversion to AD patterns. In addition, DGCCA MCI subtypes confirmed AD genetic markers with strong signals that existing late MCI group did not identify.

    Conclusion

    Overall, DGCCA is able to learn effective low dimensional embeddings from multimodal data by learning non-linear projections. MCI subtypes generated from DGCCA embeddings are different from existing early and late MCI groups and show most similarity with those identified by amyloid PET features. In our validation studies, DGCCA subtypes show distinct patterns in cognitive measures, brain volumes, and are able to identify AD genetic markers. These findings indicate the promise of the imaging-driven subtypes and their power in revealing disease structures beyond early and late stage MCI.

     
    more » « less
  5. Motivation: Software engineering for High Performace Computing (HPC) environments in general [1] and for big data in particular [5] faces a set of unique challenges including high complexity of middleware and of computing environments. Tools that make it easier for scientists to utilize HPC are, therefore, of paramount importance. We provide an experience report of using one of such highly effective middleware pbdR [9] that allow the scientist to use R programming language without, at least nominally, having to master many layers of HPC infrastructure, such as OpenMPI [4] and ScalaPACK [2]. Objective: to evaluate the extent to which middleware helps improve scientist productivity, we use pbdR to solve a real problem that we, as scientists, are investigating. Our big data comes from the commits on GitHub and other project hosting sites and we are trying to cluster developers based on the text of these commit messages. Context: We need to be able to identify developer for every commit and to identify commits for a single developer. Developer identifiers in the commits, such as login, email, and name are often spelled in multiple ways since that information may come from different version control systems (Git, Mercurial, SVN, ...) and may depend on which computer is used (what is specified in .git/config of the home folder). Method: We train Doc2Vec [7] model where existing credentials are used as a document identifier and then use the resulting 200-dimensional vectors for the 2.3M identifiers to cluster these identifiers so that each cluster represents a specific individual. The distance matrix occupies 32TB and, therefore, is a good target for HPC in general and pbdR in particular. pbdR allows data to be distributed over computing nodes and even has implemented K-means and mixture-model clustering techniques in the package pmclust. Results: We used strategic prototyping [3] to evaluate the capabilities of pbdR and discovered that a) the use of middleware required extensive understanding of its inner workings thus negating many of the expected benefits; b) the implemented algorithms were not suitable for the particular combination of n, p, and k (sample size, data dimension, and the number of clusters); c) the development environment based on batch jobs increases development time substantially. Conclusions: In addition to finding from Basili et al., we find that the quality of the implementation of HPC infrastructure and its development environment has a tremendous effect on development productivity. 
    more » « less