Massively Scalable Parallel KMeans on the HPCC Systems Platform

Xu, Lili; Apon, Amy; Villanustre, Flavio; Dev, Roger; Chala, Arjuna

doi:10.1109/CSITSS47250.2019.9031047

Citation Details

Massively Scalable Parallel KMeans on the HPCC Systems Platform

Clustering algorithms are an important part of unsupervised machine learning. With Big Data, applying clustering algorithms such as KMeans has become a challenge due to the significantly larger volume of data and the computational complexity of the standard approach, Lloyd's algorithm. This work aims to tackle this challenge by transforming the classic clustering KMeans algorithm to be highly scalable and to be able to operate on Big Data. We leverage the distributed computing environment of the HPCC Systems platform. The presented KMeans algorithm adopts a hybrid parallelism method to achieve a massively scalable parallel KMeans. Our approach can save a significant amount of time of researchers and machine learning practitioners who train hundreds of models on a daily basis. The performance is evaluated with different size datasets and clusters and the results show a significant scalabilty of the scalable parallel KMeans algorithm. more »

Award ID(s):: 1725573

PAR ID:: 10201358

Author(s) / Creator(s):: Xu, Lili; Apon, Amy; Villanustre, Flavio; Dev, Roger; Chala, Arjuna

Date Published:: 2019-12-01

Journal Name:: 2019 4th International Conference on Computational Systems and Information Technology for Sustainable Solution

Page Range / eLocation ID:: 1 to 8

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1109/CSITSS47250.2019.9031047

More Like this