NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Deep RC: A Scalable Data Engineering and Deep Learning Pipeline

Sarker, Arup; Alsaadi, Aymen; Halpern, Alexander; Tangella1, Prabhath; Titov, Mikhail; Perera, Niranda; Staylor, Mills; von_Laszewski, Gregor; Jha, Shantenu; Fox, Geoffrey (June 2025, 28th edition of the workshop on Job Scheduling Strategies for Parallel Processing. JSSPP 2025 https://jsspp.org/)

Significant obstacles exist in scientific domains including genetics, climate modeling, and astronomy due to the management, preprocess, and training on complicated data for deep learning. Even while several large-scale solutions offer distributed execution environments, open-source alternatives that integrate scalable runtime tools, deep learning and data frameworks on high-performance computing platforms remain crucial for accessibility and flexibility. In this paper, we introduce Deep Radical-Cylon(RC), a heterogeneous runtime system that combines data engineering, deep learning frameworks, and workflow engines across several HPC environments, including cloud and supercomputing infrastructures. Deep RC supports heterogeneous systems with accelerators, allows the usage of communication libraries like MPI, GLOO and NCCL across multi-node setups, and facilitates parallel and distributed deep learning pipelines by utilizing Radical Pilot as a task execution framework. By attaining an end-to-end pipeline including preprocessing, model training, and postprocessing with 11 neural forecasting models (PyTorch) and hydrology models (TensorFlow) under identical resource conditions, the system reduces 3.28 and 75.9 seconds, respectively. The design of Deep RC guarantees the smooth integration of scalable data frameworks, such as Cylon, with deep learning processes, exhibiting strong performance on cloud platforms and scientific HPC systems. By offering a flexible, high-performance solution for resource-intensive applications, this method closes the gap between data preprocessing, model training, and postprocessing.
more » « less
Free, publicly-accessible full text available June 7, 2026
Radical-Cylon: A Heterogeneous Data Pipeline for Scientific Computing

Sarker, Arup Kumar; Alsaadi, Aymen; Perera, Niranda; Staylor, Mills; von_Laszewski, Gregor; Turilli, Matteo; Kilic, Ozgur O; Titov, Mikhail; Merzky, Andre; Jha, Shantenu; et al (December 2024, Springer Nature Switzerland)

Full Text Available
Radical-Cylon: A Heterogeneous Data Pipeline for Scientific Computing

https://doi.org/10.1007/978-3-031-74430-3_5

Sarker, Arup Kumar; Alsaadi, Aymen; Perera, Niranda; Staylor, Mills; von_Laszewski, Gregor; Turilli, Matteo; Kilic, Ozgur Ozan; Titov, Mikhail; Merzky, Andre; Jha, Shantenu; et al (December 2024, Springer Nature Switzerland)

Managing and preparing complex data for deep learning, a prevalent approach in large-scale data science can be challenging. Data transfer for model training also presents difficulties, impacting scientific fields like genomics, climate modeling, and astronomy. A large-scale solution like Google Pathways with a distributed execution environment for deep learning models exists but is proprietary. Integrating existing open-source, scalable runtime tools and data frameworks on high-performance computing (HPC) platforms is crucial to address these challenges. Our objective is to establish a smooth and unified method of combining data engineering and deep learning frameworks with diverse execution capabilities that can be deployed on various high-performance computing platforms, including cloud and supercomputers. We aim to support heterogeneous systems with accelerators, where Cylon and other data engineering and deep learning frameworks can utilize heterogeneous execution. To achieve this, we propose Radical-Cylon, a heterogeneous runtime system with a parallel and distributed data framework to execute Cylon as a task of Radical Pilot. We thoroughly explain Radical-Cylon’s design and development and the execution process of Cylon tasks using Radical Pilot. This approach enables the use of heterogeneous MPI-Communicators across multiple nodes. Radical-Cylon achieves better performance than Bare-Metal Cylon with minimal and constant overhead. Radical-Cylon achieves (4 15)% faster execution time than batch execution while performing similar join and sort operations with 35 million and 3.5 billion rows with the same resources. The approach aims to excel in both scientific and engineering research HPC systems while demonstrating robust performance on cloud infrastructures. This dual capability fosters collaboration and innovation within the open-source scientific research community.Not Available
more » « less
Full Text Available
Supercharging distributed computing environments for high-performance data engineering

https://doi.org/10.3389/fhpcp.2024.1384619

Perera, Niranda; Sarker, Arup Kumar; Shan, Kaiying; Fetea, Alex; Kamburugamuve, Supun; Kanewala, Thejaka Amila; Widanage, Chathura; Staylor, Mills; Zhong, Tianle; Abeykoon, Vibhatha; et al (July 2024, Frontiers in High Performance Computing)

The data engineering and data science community has embraced the idea of using Python and R dataframes for regular applications. Driven by the big data revolution and artificial intelligence, these frameworks are now ever more important in order to process terabytes of data. They can easily exceed the capabilities of a single machine but also demand significant developer time and effort due to their convenience and ability to manipulate data with high-level abstractions that can be optimized. Therefore it is essential to design scalable dataframe solutions. There have been multiple efforts to be integrated into the most efficient fashion to tackle this problem, the most notable being the dataframe systems developed using distributed computing environments such as Dask and Ray. Even though Dask and Ray's distributed computing features look very promising, we perceive that the Dask Dataframes and Ray Datasets still have room for optimization In this paper, we present CylonFlow, an alternative distributed dataframe execution methodology that enables state-of-the-art performance and scalability on the same Dask and Ray infrastructure (superchargingthem!). To achieve this, we integrate ahigh-performance dataframesystem Cylon, which was originally based on an entirely different execution paradigm, into Dask and Ray. Our experiments show that on a pipeline of dataframe operators, CylonFlow achieves 30 × more distributed performance than Dask Dataframes. Interestingly, it also enables superior sequential performance due to leveraging the native C++ execution of Cylon. We believe the performance of Cylon in conjunction with CylonFlow extends beyond the data engineering domain and can be used to consolidate high-performance computing and distributed computing ecosystems.
more » « less
Full Text Available

Search for: All records