NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Supercharging distributed computing environments for high-performance data engineering

https://doi.org/10.3389/fhpcp.2024.1384619

Perera, Niranda; Sarker, Arup Kumar; Shan, Kaiying; Fetea, Alex; Kamburugamuve, Supun; Kanewala, Thejaka Amila; Widanage, Chathura; Staylor, Mills; Zhong, Tianle; Abeykoon, Vibhatha; et al (July 2024, Frontiers in High Performance Computing)

The data engineering and data science community has embraced the idea of using Python and R dataframes for regular applications. Driven by the big data revolution and artificial intelligence, these frameworks are now ever more important in order to process terabytes of data. They can easily exceed the capabilities of a single machine but also demand significant developer time and effort due to their convenience and ability to manipulate data with high-level abstractions that can be optimized. Therefore it is essential to design scalable dataframe solutions. There have been multiple efforts to be integrated into the most efficient fashion to tackle this problem, the most notable being the dataframe systems developed using distributed computing environments such as Dask and Ray. Even though Dask and Ray's distributed computing features look very promising, we perceive that the Dask Dataframes and Ray Datasets still have room for optimization In this paper, we present CylonFlow, an alternative distributed dataframe execution methodology that enables state-of-the-art performance and scalability on the same Dask and Ray infrastructure (superchargingthem!). To achieve this, we integrate ahigh-performance dataframesystem Cylon, which was originally based on an entirely different execution paradigm, into Dask and Ray. Our experiments show that on a pipeline of dataframe operators, CylonFlow achieves 30 × more distributed performance than Dask Dataframes. Interestingly, it also enables superior sequential performance due to leveraging the native C++ execution of Cylon. We believe the performance of Cylon in conjunction with CylonFlow extends beyond the data engineering domain and can be used to consolidate high-performance computing and distributed computing ecosystems.
more » « less
Full Text Available
In-depth analysis on parallel processing patterns for high-performance Dataframes

https://doi.org/10.1016/j.future.2023.07.007

Perera, Niranda; Sarker, Arup Kumar; Staylor, Mills; von Laszewski, Gregor; Shan, Kaiying; Kamburugamuve, Supun; Widanage, Chathura; Abeykoon, Vibhatha; Kanewela, Thejaka Amila; Fox, Geoffrey (December 2023, Future Generation Computer Systems)

Full Text Available
Hybrid Cloud and HPC Approach to High-Performance Dataframes

https://doi.org/10.1109/BigData55660.2022.10020958

Shan, Kaiying; Perera, Niranda; Lenadora, Damitha; Zhong, Tianle; Kumar Sarker, Arup; Kamburugamuve, Supun; Amila Kanewela, Thejaka; Widanage, Chathura; Fox, Geoffrey (December 2022, 2022 IEEE International Conference on Big Data (Big Data))
HPTMT Parallel Operators for High Performance Data Science and Data Engineering

https://doi.org/10.3389/fdata.2021.756041

Abeykoon, Vibhatha; Kamburugamuve, Supun; Widanage, Chathura; Perera, Niranda; Uyar, Ahmet; Kanewala, Thejaka Amila; von Laszewski, Gregor; Fox, Geoffrey (February 2022, Frontiers in Big Data)

Data-intensive applications are becoming commonplace in all science disciplines. They are comprised of a rich set of sub-domains such as data engineering, deep learning, and machine learning. These applications are built around efficient data abstractions and operators that suit the applications of different domains. Often lack of a clear definition of data structures and operators in the field has led to other implementations that do not work well together. The HPTMT architecture that we proposed recently, identifies a set of data structures, operators, and an execution model for creating rich data applications that links all aspects of data engineering and data science together efficiently. This paper elaborates and illustrates this architecture using an end-to-end application with deep learning and data engineering parts working together. Our analysis show that the proposed system architecture is better suited for high performance computing environments compared to the current big data processing systems. Furthermore our proposed system emphasizes the importance of efficient compact data structures such as Apache Arrow tabular data representation defined for high performance. Thus the system integration we proposed scales a sequential computation to a distributed computation retaining optimum performance along with highly usable application programming interface.
more » « less
Full Text Available
HPTMT: Operator-Based Architecture for Scalable High-Performance Data-Intensive Frameworks

https://doi.org/10.1109/CLOUD53861.2021.00036

Kamburugamuve, Supun; Widanage, Chathura; Perera, Niranda; Abeykoon, Vibhatha; Uyar, Ahmet; Kanewala, Thejaka Amila; Von Laszewski, Gregor; Fox, Geoffrey (September 2021, 2021 IEEE 14th International Conference on Cloud Computing (CLOUD))

Full Text Available
A Fast, Scalable, Universal Approach For Distributed Data Aggregations

https://doi.org/10.1109/BigData50022.2020.9378124

Perera, Niranda; Abeykoon, Vibhatha; Widanage, Chathura; Kamburugamuve, Supun; Kanewala, Thejaka Amila; Wickramasinghe, Pulasthi; Uyar, Ahmet; Maithree, Hasara; Lenadora, Damitha; Fox, Geoffrey (March 2021, 2020 IEEE International Conference on Big Data (Big Data))

Full Text Available
Data Engineering for HPC with Python

https://doi.org/10.1109/PyHPC51966.2020.00007

Abeykoon, Vibhatha; Perera, Niranda; Widanage, Chathura; Kamburugamuve, Supun; Kanewala, Thejaka Amila; Maithree, Hasara; Wickramasinghe, Pulasthi; Uyar, Ahmet; Fox, Geoffrey (November 2020, 2020 IEEE/ACM 9th Workshop on Python for High-Performance and Scientific Computing (PyHPC))
null (Ed.)
Data engineering is becoming an increasingly important part of scientific discoveries with the adoption of deep learning and machine learning. Data engineering deals with a variety of data formats, storage, data extraction, transformation, and data movements. One goal of data engineering is to transform data from original data to vector/matrix/tensor formats accepted by deep learning and machine learning applications. There are many structures such as tables, graphs, and trees to represent data in these data engineering phases. Among them, tables are a versatile and commonly used format to load and process data. In this paper, we present a distributed Python API based on table abstraction for representing and processing data. Unlike existing state-of-the-art data engineering tools written purely in Python, our solution adopts high performance compute kernels in C++, with an in-memory table representation with Cython-based Python bindings. In the core system, we use MPI for distributed memory computations with a data-parallel approach for processing large datasets in HPC clusters.
more » « less
Full Text Available
High Performance Data Engineering Everywhere

https://doi.org/10.1109/SMDS49396.2020.00022

Widanage, Chathura; Perera, Niranda; Abeykoon, Vibhatha; Kamburugamuve, Supun; Kanewala, Thejaka Amila; Maithree, Hasara; Wickramasinghe, Pulasthi; Uyar, Ahmet; Gunduz, Gurhan; Fox, Geoffrey (October 2020, 2020 IEEE International Conference on Smart Data Services (SMDS))
null (Ed.)
The amazing advances being made in the fields of machine and deep learning are a highlight of the Big Data era for both enterprise and research communities. Modern applications require resources beyond a single node's ability to provide. However this is just a small part of the issues facing the overall data processing environment, which must also support a raft of data engineering for pre- and post-data processing, communication, and system integration. An important requirement of data analytics tools is to be able to easily integrate with existing frameworks in a multitude of languages, thereby increasing user productivity and efficiency. All this demands an efficient and highly distributed integrated approach for data processing, yet many of today's popular data analytics tools are unable to satisfy all these requirements at the same time. In this paper we present Cylon, an open-source high performance distributed data processing library that can be seamlessly integrated with existing Big Data and AI/ML frameworks. It is developed with a flexible C++ core on top of a compact data structure and exposes language bindings to C++, Java, and Python. We discuss Cylon's architecture in detail, and reveal how it can be imported as a library to existing applications or operate as a standalone framework. Initial experiments show that Cylon enhances popular tools such as Apache Spark and Dask with major performance improvements for key operations and better component linkages. Finally, we show how its design enables Cylon to be used cross-platform with minimum overhead, which includes popular AI tools such as PyTorch, Tensorflow, and Jupyter notebooks.
more » « less
Full Text Available
Twister2 Cross‐platform resource scheduler for big data

https://doi.org/10.1002/cpe.6502

Uyar, Ahmet; Gunduz, Gurhan; Kamburugamuve, Supun; Wickramasinghe, Pulasthi; Widanage, Chathura; Govindarajan, Kannan; Perera, Niranda; Abeykoon, Vibhatha; Akkas, Selahattin; Fox, Geoffrey (July 2021, Concurrency and Computation: Practice and Experience)

Abstract Twister2 is an open‐source big data hosting environment designed to process both batch and streaming data at scale. Twister2 runs jobs in both high‐performance computing (HPC) and big data clusters. It provides a cross‐platform resource scheduler to run jobs in diverse environments. Twister2 is designed with a layered architecture to support various clusters and big data problems. In this paper, we present the cross‐platform resource scheduler of Twister2. We identify required services and explain implementation details. We present job startup delays for single jobs and multiple concurrent jobs in Kubernetes and OpenMPI clusters. We compare job startup delays for Twister2 and Spark at a Kubernetes cluster. In addition, we compare the performance of terasort algorithm on Kubernetes and bare metal clusters at AWS cloud.
more » « less

Search for: All records