Search for: All records

Award ID contains: 1812727

« Prev Next »

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Lightweight Huffman Coding for Efficient GPU Compression

https://doi.org/10.1145/3577193.3593736

Shah, Milan; Yu, Xiaodong; Di, Sheng; Becchi, Michela; Cappello, Franck (June 2023, ICS '23: Proceedings of the 37th International Conference on Supercomputing)

Full Text Available
High-Level Synthesis of Irregular Applications: A Case Study on Influence Maximization

https://doi.org/10.1145/3587135.3592196

Neff, Reece; Minutoli, Marco; Tumeo, Antonino; Becchi, Michela (May 2023, CF '23: Proceedings of the 20th ACM International Conference on Computing Frontiers)

FPGAs are promising platforms for accelerating irregular applications due to their ability to implement highly specialized hardware designs for each kernel. However, the design and implementation of FPGA-accelerated kernels can take several months using hardware design languages. High Level Synthesis (HLS) tools provide fast, high quality results for regular applications, but lack the support to effectively accelerate more irregular, complex workloads. This work analyzes the challenges and benefits of using a commercial state-of-the-art HLS tool and its available optimizations to accelerate graph sampling. We evaluate the resulting designs and their effectiveness when deployed in a state-of-the-art heterogeneous framework that implements the Influence Maximization with Martingales (IMM) algorithm, a complex graph analytics algorithm. We discuss future opportunities for improvement in hardware, HLS tools, and hardware/software co-design methodology to better support complex irregular applications such as IMM.
more » « less
Full Text Available
A Code Transformation to Improve the Efficiency of OpenCL Code on FPGA through Pipes

https://doi.org/10.1145/3587135.3592210

Zarch, Mostafa Eghbali; Becchi, Michela (May 2023, CF '23: Proceedings of the 20th ACM International Conference on Computing Frontiers)

Over the past few years, there has been an increased interest in using FPGAs alongside CPUs and GPUs in high-performance computing systems and data centers. This trend has led to a push toward the use of high-level programming models and libraries, such as OpenCL, both to lower the barriers to the adoption of FPGAs by programmers unfamiliar with hardware description languages, and to allow to deploy a single code on different devices seamlessly. Today, both Intel and Xilinx offer toolchains to compile OpenCL code onto FPGA. However, using OpenCL on FPGAs is complicated by performance portability issues, since different devices have fundamental differences in architecture and nature of hardware parallelism they offer. Hence, platform-specific optimizations are crucial to achieving good performance across devices. In this paper, we propose a code transformation to improve the performance of OpenCL codes running on FPGA. The proposed method uses pipes to separate the memory accesses and core computation within OpenCL kernels. We analyze the benefits of the approach as well as the restrictions to its applicability. Using OpenCL applications from popular benchmark suites, we show that this code transformation can result in higher utilization of the global memory bandwidth available and increased instruction concurrency, thus improving the overall throughput of OpenCL kernels at the cost of a modest resource utilization overhead. Further concurrency can be achieved by using multiple memory and compute kernels.
more » « less
Full Text Available
Evaluating Asynchronous Parallel I/O on HPC Systems

https://doi.org/10.1109/IPDPS54959.2023.00030

Ravi, John; Byna, Suren; Koziol, Quincey; Tang, Houjun; Becchi, Michela (May 2023, 10.1109/IPDPS54959.2023.00030)

Parallel I/O is an effective method to optimize data movement between memory and storage for many scientific applications. Poor performance of traditional disk-based file systems has led to the design of I/O libraries which take advantage of faster memory layers, such as on-node memory, present in high-performance computing (HPC) systems. By allowing caching and prefetching of data for applications alternating computation and I/O phases, a faster memory layer also provides opportunities for hiding the latency of I/O phases by overlapping them with computation phases, a technique called asynchronous I/O. Since asynchronous parallel I/O in HPC systems is still in the initial stages of development, there hasn't been a systematic study of the factors affecting its performance.In this paper, we perform a systematic study of various factors affecting the performance and efficacy of asynchronous I/O, we develop a performance model to estimate the aggregate I/O bandwidth achievable by iterative applications using synchronous and asynchronous I/O based on past observations, and we evaluate the performance of the recently developed asynchronous I/O feature of a parallel I/O library (HDF5) using benchmarks and real-world science applications. Our study covers parallel file systems on two large-scale HPC systems: Summit and Cori, the former with a GPFS storage and the latter with a Lustre parallel file system.
more » « less
Full Text Available
GPU-Accelerated Error-Bounded Compression Framework for Quantum Circuit Simulations

https://doi.org/10.1109/IPDPS54959.2023.00081

Shah, Milan; Yu, Xiaodong; Di, Sheng; Lykov, Danylo; Alexeev, Yuri; Becchi, Michela; Cappello, Franck (May 2023, 2023 IEEE International Parallel and Distributed Processing Symposium (IPDPS))

Quantum circuit simulations enable researchers to develop quantum algorithms without the need for a physical quantum computer. Quantum computing simulators, however, all suffer from significant memory footprint requirements, which prevents large circuits from being simulated on classical super-computers. In this paper, we explore different lossy compression strategies to substantially shrink quantum circuit tensors in the QTensor package (a state-of-the-art tensor network quantum circuit simulator) while ensuring the reconstructed data satisfy the user-needed fidelity.Our contribution is fourfold. (1) We propose a series of optimized pre- and post-processing steps to boost the compression ratio of tensors with a very limited performance overhead. (2) We characterize the impact of lossy decompressed data on quantum circuit simulation results, and leverage the analysis to ensure the fidelity of reconstructed data. (3) We propose a configurable compression framework for GPU based on cuSZ and cuSZx, two state-of-the-art GPU-accelerated lossy compressors, to address different use-cases: either prioritizing compression ratios or prioritizing compression speed. (4) We perform a comprehensive evaluation by running 9 state-of-the-art compressors on an NVIDIA A100 GPU based on QTensor-generated tensors of varying sizes. When prioritizing compression ratio, our results show that our strategies can increase the compression ratio nearly 10 times compared to using only cuSZ. When prioritizing throughput, we can perform compression at the comparable speed as cuSZx while achieving 3-4× higher compression ratios. Decompressed tensors can be used in QTensor circuit simulation to yield a final energy result within 1-5% of the true energy value.
more » « less
Full Text Available
Runway: In-transit Data Compression on Heterogeneous HPC Systems

https://doi.org/10.1109/CCGrid57682.2023.00030

Ravi, John; Byna, Suren; Becchi, Michela (May 2023, 2023 IEEE/ACM 23rd International Symposium on Cluster, Cloud and Internet Computing (CCGrid))

To alleviate bottlenecks in storing and accessing data on high-performance computing (HPC) systems, I/O libraries are enabling computation while data is in-transit, such as HDFS filters. For scientific applications that commonly use floating-point data, error-bounded lossy compression methods are a critical technique to significantly reduce the storage and bandwidth requirements. Thus far, deciding when and where to schedule in-transit data transformations, such as compression, has been outside the scope of I/O libraries. In this paper, we introduce Runway, a runtime framework that enables computation on in-transit data with an object storage abstraction. Runway is designed to be extensible to execute user-defined functions at runtime. In this effort, we focus on studying methods to offload data compression operations to available processing units based on latency and throughput. We compare the performance of running compression on multi-core CPUs, as well as offloading it to a GPU and a Data Processing Unit (DPU). We implement a state-of-the-art error-bounded lossy compression algorithm, SZ3, as a Runway function with a variant optimized for DPUs. We propose dynamic modeling to guide scheduling decisions for in-transit data compression. We evaluate Runway using four scientific datasets from the SDRBench benchmark suite on a the Perlmutter supercomputer at NERSC.
more » « less
Full Text Available
Data Transformation Acceleration using Deterministic Finite-State Transducers

https://doi.org/10.1109/BigData55660.2022.10020756

Nourian, Marziyeh; Nguyen, Tri; Chien, Andrew A.; Becchi, Michela (December 2022, 2022 IEEE International Conference on Big Data (Big Data))

Data transformation tasks are a critical and costly part of many data processing and analytics applications. A simple computing model that can efficiently represent data transformation and be mapped to different platforms can provide programmers with the flexibility o f u sing different data representations and allow for exploiting different platforms, including general-purpose processors and accelerators.We propose extended Deterministic Finite State Transducers (DFST+s), a computing model that enables the compact expression of data transformations (a significantly terser expression compared to the DFSTs model, a traditional computational abstraction for data transformation), aiding their correct and efficient implementation. We define the TF ORM language to facilitate expressing the DFST+, and the TFORM virtual machine to enable a further compact expression, leading to a high performance and portable implementation. We propose two TFORM VM execution models and evaluate them using a variety of data transformations (from Apache Parquet file format and sparse matrices). Our results show both effective portability across CPU and a hardware accelerator, and performance increases of 1.7× and 11.7× geometric mean, respectively, over a custom CPU implementation of the same transformations.
more » « less
Full Text Available
A GPU-accelerated Data Transformation Framework Rooted in Pushdown Transducers

https://doi.org/10.1109/HiPC56025.2022.00038

Nguyen, Tri; Becchi, Michela (December 2022, 2022 IEEE 29th International Conference on High Performance Computing, Data, and Analytics (HiPC))

With the rise of machine learning and data analytics, the ability to process large and diverse sets of data efficiently has become crucial. Research has shown that data transformation is a key performance bottleneck for applications across a variety of domains, from data analytics to scientific computing. Custom hardware accelerators and GPU implementations targeting specific data transformation tasks can alleviate the problem, but suffer from narrow applicability and lack of generality.To tackle this problem, we propose a GPU-accelerated data transformation engine grounded on pushdown transducers. We define an extended pushdown transducer abstraction (effPDT) that allows expressing a wide range of data transformations in a memory-efficient fashion, and is thus amenable for GPU deployment. The effPDT execution engine utilizes a data streaming model that reduces the application’s memory requirements significantly, facilitating deployment on high- and low-end systems. We showcase our GPU-accelerated engine on a diverse set of transformation tasks covering data encoding/decoding, parsing and querying of structured data, and matrix transformation, and we evaluate it against publicly available CPU and GPU library implementations of the considered data transformation tasks. To understand the benefits of the effPDT abstraction, we extend our data transformation engine to also support finite state transducers (FSTs), we map the considered data transformation tasks on FSTs, and we compare the performance and resource requirements of the FST-based and the effPDT-based implementations.
more » « less
Full Text Available
Accelerating Random Forest Classification on GPU and FPGA

https://doi.org/10.1145/3545008.3545067

Shah, Milan; Neff, Reece; Wu, Hancheng; Minutoli, Marco; Tumeo, Antonino; Becchi, Michela (August 2022, ICPP '22: Proceedings of the 51st International Conference on Parallel Processing)

Random Forests (RFs) are a commonly used machine learning method for classification and regression tasks spanning a variety of application domains, including bioinformatics, business analytics, and software optimization. While prior work has focused primarily on improving performance of the training of RFs, many applications, such as malware identification, cancer prediction, and banking fraud detection, require fast RF classification. In this work, we accelerate RF classification on GPU and FPGA. In order to provide efficient support for large datasets, we propose a hierarchical memory layout suitable to the GPU/FPGA memory hierarchy. We design three RF classification code variants based on that layout, and we investigate GPU- and FPGA-specific considerations for these kernels. Our experimental evaluation, performed on an Nvidia Xp GPU and on a Xilinx Alveo U250 FPGA accelerator card using publicly available datasets on the scale of millions of samples and tens of features, covers various aspects. First, we evaluate the performance benefits of our hierarchical data structure over the standard compressed sparse row (CSR) format. Second, we compare our GPU implementation with cuML, a machine learning library targeting Nvidia GPUs. Third, we explore the performance/accuracy tradeoff resulting from the use of different tree depths in the RF. Finally, we perform a comparative performance analysis of our GPU and FPGA implementations. Our evaluation shows that, while reporting the best performance on GPU, our code variants outperform the CSR baseline both on GPU and FPGA. For high accuracy targets, our GPU implementation yields a 5-9 × speedup over CSR, and up to a 2 × speedup over Nvidia’s cuML library.
more » « less
Full Text Available
PILOT: a Runtime System to Manage Multi-tenant GPU Unified Memory Footprint

https://doi.org/10.1109/HiPC53243.2021.00063

Ravi, John; Nguyen, Tri; Zhou, Huiyang; Becchi, Michela (January 2022, 2021 IEEE 28th International Conference on High Performance Computing, Data, and Analytics (HiPC))

Concurrent kernel execution on GPU has proven an effective technique to improve system throughput by maximizing the resource utilization. In order to increase programmability and meet the increasing memory requirements of data-intensive applications, current GPUs support Unified Virtual Memory (UVM), which provides a virtual memory abstraction with demand paging. By allowing applications to oversubscribe GPU memory, UVM provides increased opportunities to share GPU resources across applications. However, in the presence of applications with competing memory requirements, GPU sharing can lead to performance degradation due to thrashing. NVIDIA's Multiple Process Service (MPS) offers the capability to space share bare metal GPUs, thereby enabling cluster workload managers, such as Slurm, to share a single GPU across MPI ranks with limited control over resource partitioning. However, it is not possible to preempt, schedule, or throttle a running GPU process through MPS. These features would enable new OS-managed scheduling policies to be implemented for GPU kernels to dynamically handle resource contention and offer consistent performance. The contribution of this paper is two-fold. We first show how memory oversubscription can impact the performance of concurrent GPU applications. Then, we propose three methods to transparently mitigate memory interference through kernel preemption and scheduling policies. To implement our policies, we develop our own runtime system (PILOT) to serve as an alternative to NVIDIA's MPS. In the presence of memory over-subscription, we noticed a dramatic improvement in the overall throughput when using our scheduling policies and runtime hints.
more » « less
Full Text Available

« Prev Next »