NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

A Transducers-based Programming Framework for Efficient Data Transformation

https://doi.org/10.1145/3656019.3676891

Nguyen, Tri; Becchi, Michela (October 2024, ACM)

Many data analytics and scientific applications rely on data transformation tasks, such as encoding, decoding, parsing of structured and unstructured data, and conversions between data formats and layouts. Previous work has shown that data transformation can represent a performance bottleneck for data analytics workloads. The transducers computational abstraction can be used to express a wide range of data transformations, and recent efforts have proposed configurable engines implementing various transducer models (from finite state transducers, to pushdown transducers, to extended models). This line of research, however, is still at an early stage. Notably, expressing data transformation using transducers requires a paradigm shift, impacting programmability. To address this problem, we propose a programming framework to map data transformation tasks onto a variety of transducer models. Our framework includes: (1) a platform agnostic programming language (xPTLang) to code transducer programs using intuitive programming constructs, and (2) a compiler that, given an xPTLang program, generates efficient transducer processing engines for CPU and GPU. Our compiler includes a set of optimizations to improve code efficiency. We demonstrate our framework on a diverse set of data transformation tasks on an Intel CPU and an Nvidia GPU.
more » « less
Full Text Available
Significantly Improving Fixed-Ratio Compression Framework for Resource-limited Applications

https://doi.org/10.1145/3673038.3673092

Nguyen, Tri; Rahman, Md Hasanur; Di, Sheng; Becchi, Michela (August 2024, ACM)

Scientific simulations running on HPC facilities generate massive amount of data, putting significant pressure onto supercomputers’ storage capacity and network bandwidth. To alleviate this problem, there has been a rich body of work on reducing data volumes via error-controlled lossy compression. However, fixed-ratio compression is not very well-supported, not allowing users to appropriately allocate memory/storage space or know the data transfer time over the network in advance. To address this problem, recent ratio-controlled frameworks, such as FXRZ, have incorporated methods to predict required error bound settings to reach a user-specified compression ratio. However, these approaches fail to achieve fixed-ratio compression in an accurate, efficient and scalable fashion on diverse datasets and compression algorithms. This work proposes an efficient, scalable, ratio-controlled lossy compression framework (CAROL). At the core of CAROL are four optimization strategies that allow for improving the prediction accuracy and runtime efficiency over state-of-the-art solutions. First, CAROL uses surrogate-based compression ratio estimation to generate training data. Second, it includes a novel calibration method to improve prediction accuracy across a variety of compressors. Third, it leverages Bayesian optimization to allow for efficient training and incremental model refinement. Forth, it uses GPU acceleration to speed up prediction. We evaluate CAROL on four compression algorithms and six scientific datasets. On average, when compared to the state-of-the-art FXRZ framework, CAROL achieves 4 × speedup in setup time and 36 × speedup in inference time, while maintaining less than 1% difference in estimation accuracy.
more » « less
Full Text Available
FuseIM: Fusing probabilistic traversals for influence maximization on exascale systems

Neff, Reece; Zach, Mostafa; Minutoli, Marco; Halappanavar, Mahantesh; Tumeo, Antonino; Kalyanaraman, Ananth; Becchi, Michela (August 2024, ICS 2024)

Full Text Available
A Portable, Fast, DCT-based Compressor for AI Accelerators

https://doi.org/10.1145/3625549.3658662

Shah, Milan; Yu, Xiaodong; Di, Sheng; Becchi, Michela; Cappello, Franck (June 2024, ACM)

Full Text Available
A Code Transformation to Improve the Efficiency of OpenCL Code on FPGA through Pipes

https://doi.org/10.1145/3587135.3592210

Zarch, Mostafa Eghbali; Becchi, Michela (May 2023, CF '23: Proceedings of the 20th ACM International Conference on Computing Frontiers)

Over the past few years, there has been an increased interest in using FPGAs alongside CPUs and GPUs in high-performance computing systems and data centers. This trend has led to a push toward the use of high-level programming models and libraries, such as OpenCL, both to lower the barriers to the adoption of FPGAs by programmers unfamiliar with hardware description languages, and to allow to deploy a single code on different devices seamlessly. Today, both Intel and Xilinx offer toolchains to compile OpenCL code onto FPGA. However, using OpenCL on FPGAs is complicated by performance portability issues, since different devices have fundamental differences in architecture and nature of hardware parallelism they offer. Hence, platform-specific optimizations are crucial to achieving good performance across devices. In this paper, we propose a code transformation to improve the performance of OpenCL codes running on FPGA. The proposed method uses pipes to separate the memory accesses and core computation within OpenCL kernels. We analyze the benefits of the approach as well as the restrictions to its applicability. Using OpenCL applications from popular benchmark suites, we show that this code transformation can result in higher utilization of the global memory bandwidth available and increased instruction concurrency, thus improving the overall throughput of OpenCL kernels at the cost of a modest resource utilization overhead. Further concurrency can be achieved by using multiple memory and compute kernels.
more » « less
Full Text Available
Runway: In-transit Data Compression on Heterogeneous HPC Systems

https://doi.org/10.1109/CCGrid57682.2023.00030

Ravi, John; Byna, Suren; Becchi, Michela (May 2023, 2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid))

To alleviate bottlenecks in storing and accessing data on high-performance computing (HPC) systems, I/O libraries are enabling computation while data is in-transit, such as HDFS filters. For scientific applications that commonly use floating-point data, error-bounded lossy compression methods are a critical technique to significantly reduce the storage and bandwidth requirements. Thus far, deciding when and where to schedule in-transit data transformations, such as compression, has been outside the scope of I/O libraries. In this paper, we introduce Runway, a runtime framework that enables computation on in-transit data with an object storage abstraction. Runway is designed to be extensible to execute user-defined functions at runtime. In this effort, we focus on studying methods to offload data compression operations to available processing units based on latency and throughput. We compare the performance of running compression on multi-core CPUs, as well as offloading it to a GPU and a Data Processing Unit (DPU). We implement a state-of-the-art error-bounded lossy compression algorithm, SZ3, as a Runway function with a variant optimized for DPUs. We propose dynamic modeling to guide scheduling decisions for in-transit data compression. We evaluate Runway using four scientific datasets from the SDRBench benchmark suite on a the Perlmutter supercomputer at NERSC.
more » « less
Full Text Available
Lightweight Huffman Coding for Efficient GPU Compression

https://doi.org/10.1145/3577193.3593736

Shah, Milan; Yu, Xiaodong; Di, Sheng; Becchi, Michela; Cappello, Franck (June 2023, ICS '23: Proceedings of the 37th International Conference on Supercomputing)

Full Text Available
High-Level Synthesis of Irregular Applications: A Case Study on Influence Maximization

https://doi.org/10.1145/3587135.3592196

Neff, Reece; Minutoli, Marco; Tumeo, Antonino; Becchi, Michela (May 2023, CF '23: Proceedings of the 20th ACM International Conference on Computing Frontiers)

FPGAs are promising platforms for accelerating irregular applications due to their ability to implement highly specialized hardware designs for each kernel. However, the design and implementation of FPGA-accelerated kernels can take several months using hardware design languages. High Level Synthesis (HLS) tools provide fast, high quality results for regular applications, but lack the support to effectively accelerate more irregular, complex workloads. This work analyzes the challenges and benefits of using a commercial state-of-the-art HLS tool and its available optimizations to accelerate graph sampling. We evaluate the resulting designs and their effectiveness when deployed in a state-of-the-art heterogeneous framework that implements the Influence Maximization with Martingales (IMM) algorithm, a complex graph analytics algorithm. We discuss future opportunities for improvement in hardware, HLS tools, and hardware/software co-design methodology to better support complex irregular applications such as IMM.
more » « less
Full Text Available
Evaluating Asynchronous Parallel I/O on HPC Systems

https://doi.org/10.1109/IPDPS54959.2023.00030

Ravi, John; Byna, Suren; Koziol, Quincey; Tang, Houjun; Becchi, Michela (May 2023, 10.1109/IPDPS54959.2023.00030)

Parallel I/O is an effective method to optimize data movement between memory and storage for many scientific applications. Poor performance of traditional disk-based file systems has led to the design of I/O libraries which take advantage of faster memory layers, such as on-node memory, present in high-performance computing (HPC) systems. By allowing caching and prefetching of data for applications alternating computation and I/O phases, a faster memory layer also provides opportunities for hiding the latency of I/O phases by overlapping them with computation phases, a technique called asynchronous I/O. Since asynchronous parallel I/O in HPC systems is still in the initial stages of development, there hasn't been a systematic study of the factors affecting its performance.In this paper, we perform a systematic study of various factors affecting the performance and efficacy of asynchronous I/O, we develop a performance model to estimate the aggregate I/O bandwidth achievable by iterative applications using synchronous and asynchronous I/O based on past observations, and we evaluate the performance of the recently developed asynchronous I/O feature of a parallel I/O library (HDF5) using benchmarks and real-world science applications. Our study covers parallel file systems on two large-scale HPC systems: Summit and Cori, the former with a GPFS storage and the latter with a Lustre parallel file system.
more » « less
Full Text Available
A GPU-accelerated Data Transformation Framework Rooted in Pushdown Transducers

https://doi.org/10.1109/HiPC56025.2022.00038

Nguyen, Tri; Becchi, Michela (December 2022, 2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC))

With the rise of machine learning and data analytics, the ability to process large and diverse sets of data efficiently has become crucial. Research has shown that data transformation is a key performance bottleneck for applications across a variety of domains, from data analytics to scientific computing. Custom hardware accelerators and GPU implementations targeting specific data transformation tasks can alleviate the problem, but suffer from narrow applicability and lack of generality.To tackle this problem, we propose a GPU-accelerated data transformation engine grounded on pushdown transducers. We define an extended pushdown transducer abstraction (effPDT) that allows expressing a wide range of data transformations in a memory-efficient fashion, and is thus amenable for GPU deployment. The effPDT execution engine utilizes a data streaming model that reduces the application’s memory requirements significantly, facilitating deployment on high- and low-end systems. We showcase our GPU-accelerated engine on a diverse set of transformation tasks covering data encoding/decoding, parsing and querying of structured data, and matrix transformation, and we evaluate it against publicly available CPU and GPU library implementations of the considered data transformation tasks. To understand the benefits of the effPDT abstraction, we extend our data transformation engine to also support finite state transducers (FSTs), we map the considered data transformation tasks on FSTs, and we compare the performance and resource requirements of the FST-based and the effPDT-based implementations.
more » « less
Full Text Available

« Prev Next »

Search for: All records