skip to main content

Title: cuSZ: An Efficient GPU-Based Error-Bounded Lossy Compression Framework for Scientific Data
Error-bounded lossy compression is a state-of-the-art data reduction technique for HPC applications because it not only significantly reduces storage overhead but also can retain high fidelity for postanalysis. Because supercomputers and HPC applications are becoming heterogeneous using accelerator-based architectures, in particular GPUs, several development teams have recently released GPU versions of their lossy compressors. However, existing state-of-the-art GPU-based lossy compressors suffer from either low compression and decompression throughput or low compression quality. In this paper, we present an optimized GPU version, cuSZ, for one of the best error-bounded lossy compressors-SZ. To the best of our knowledge, cuSZ is the first error-bounded lossy compressor on GPUs for scientific data. Our contributions are fourfold. (1) We propose a dual-quantization scheme to entirely remove the data dependency in the prediction step of SZ such that this step can be performed very efficiently on GPUs. (2) We develop an efficient customized Huffman coding for the SZ compressor on GPUs. (3) We implement cuSZ using CUDA and optimize its performance by improving the utilization of GPU memory bandwidth. (4) We evaluate our cuSZ on five real-world HPC application datasets from the Scientific Data Reduction Benchmarks and compare it with other state-of-the-art methods on both CPUs more » and GPUs. Experiments show that our cuSZ improves SZ's compression throughput by up to 370.1x and 13.1x, respectively, over the production version running on single and multiple CPU cores, respectively, while getting the same quality of reconstructed data. It also improves the compression ratio by up to 3.48x on the tested data compared with another state-of-the-art GPU supported lossy compressor. « less
Authors:
; ; ; ; ; ; ; ; ; ;
Award ID(s):
2034169 1948447 2003624 2042084 2003709 1633608 2303820 2303064
Publication Date:
NSF-PAR ID:
10173692
Journal Name:
The 29th International Conference on Parallel Architectures and Compilation Techniques (PACT 2020)
Sponsoring Org:
National Science Foundation
More Like this
  1. Error-bounded lossy compression is a state-of-the-art data reduction technique for HPC applications because it not only significantly reduces storage overhead but also can retain high fidelity for postanalysis. Because supercomputers and HPC applications are becoming heterogeneous using accelerator-based architectures, in particular GPUs, several development teams have recently released GPU versions of their lossy compressors. However, existing state-of-the-art GPU-based lossy compressors suffer from either low compression and decompression throughput or low compression quality. In this paper, we present an optimized GPU version, cuSZ, for one of the best error-bounded lossy compressors-SZ. To the best of our knowledge, cuSZ is the first error-bounded lossy compressor on GPUs for scientific data. Our contributions are fourfold. (1) We propose a dual-quantization scheme to entirely remove the data dependency in the prediction step of SZ such that this step can be performed very efficiently on GPUs. (2) We develop an efficient customized Huffman coding for the SZ compressor on GPUs. (3) We implement cuSZ using CUDA and optimize its performance by improving the utilization of GPU memory bandwidth. (4) We evaluate our cuSZ on five real-world HPC application datasets from the Scientific Data Reduction Benchmarks and compare it with other state-of-the-art methods on both CPUsmore »and GPUs. Experiments show that our cuSZ improves SZ's compression throughput by up to 370.1x and 13.1x, respectively, over the production version running on single and multiple CPU cores, respectively, while getting the same quality of« less
  2. More and more HPC applications require fast and effective compression techniques to handle large volumes of data in storage and transmission. Not only do these applications need to compress the data effectively during simulation, but they also need to perform decompression efficiently for post hoc analysis. SZ is an error-bounded lossy compressor for scientific data, and cuSZ is a version of SZ designed to take advantage of the GPU's power. At present, cuSZ's compression performance has been optimized significantly while its decompression still suffers considerably lower performance because of its sophisticated lossless compression step---a customized Huffman decoding. In this work, we aim to significantly improve the Huffman decoding performance for cuSZ, thus improving the overall decompression performance in turn. To this end, we first investigate two state-of-the-art GPU Huffman decoders in depth. Then, we propose a deep architectural optimization for both algorithms. Specifically, we take full advantage of CUDA GPU architectures by using shared memory on decoding/writing phases, online tuning the amount of shared memory to use, improving memory access patterns, and reducing warp divergence. Finally, we evaluate our optimized decoders on an Nvidia V100 GPU using eight representative scientific datasets. Our new decoding solution obtains an average speedup ofmore »3.64X over cuSZ's Huffman decoder and improves its overall decompression performance by 2.43X on average.« less
  3. Today’s extreme-scale high-performance computing (HPC) applications are producing volumes of data too large to save or transfer because of limited storage space and I/O bandwidth. Error-bounded lossy compression has been commonly known as one of the best solutions to the big science data issue, because it can significantly reduce the data volume with strictly controlled data distortion based on user requirements. In this work, we develop an adaptive parameter optimization algorithm integrated with a series of optimization strategies for SZ, a state-of-the-art prediction-based compression model. Our contribution is threefold. (1) We exploit effective strategies by using 2nd-order regression and 2nd-order Lorenzo predictors to improve the prediction accuracy significantly for SZ, thus substantially improving the overall compression quality. (2) We design an efficient approach selecting the best-fit parameter setting, by conducting a comprehensive priori compression quality analysis and exploiting an efficient online controlling mechanism. (3) We evaluate the compression quality and performance on a supercomputer with 4,096 cores, as compared with other state-ofthe-art error-bounded lossy compressors. Experiments with multiple real world HPC simulations datasets show that our solution can improve the compression ratio up to 46% compared with the second-best compressor. Moreover, the parallel I/O performance is improved by up tomore »40% thanks to the significant reduction of data size.« less
  4. Vast volumes of data are produced by today’s scientific simulations and advanced instruments. These data cannot be stored and transferred efficiently because of limited I/O bandwidth, network speed, and storage capacity. Error-bounded lossy compression can be an effective method for addressing these issues: not only can it significantly reduce data size, but it can also control the data distortion based on user-defined error bounds. In practice, many scientific applications have specific requirements or constraints for lossy compression, in order to guarantee that the reconstructed data are valid for post hoc analysis. For example, some datasets contain irrelevant data that should be isolated in particular and users often have intuition regarding value ranges, geospatial regions, and other data subsets that are crucial for subsequent analysis. Existing state-of-the-art error-bounded lossy compressors, however, do not consider these constraints during compression, resulting in inferior compression ratios with respect to user’s post hoc analysis, due to the fact that the data itself provides little or no value for post hoc analysis. In this work we address this issue by proposing an optimized framework that can preserve diverse constraints during the error-bounded lossy compression, e.g., cleaning the irrelevant data, efficiently preserving different precision for multiple valuemore »intervals, and allowing users to set diverse precision over both regular and irregular regions. We perform our evaluation on a supercomputer with up to 2,100 cores. Experiments with six real-world applications show that our proposed diverse constraints based error-bounded lossy compressor can obtain a higher visual quality or data fidelity on reconstructed data with the same or even higher compression ratios compared with the traditional state-of-the-art compressor SZ. Our experiments also demonstrate very good scalability in compression performance compared with the I/O throughput of the parallel file system.« less
  5. As parallel computers continue to grow to exascale, the amount of data that needs to be saved or transmitted is exploding. To this end, many previous works have studied using error-bounded lossy compressors to reduce the data size and improve the I/O performance. However, little work has been done for effectively offloading lossy compression onto FPGA-based SmartNICs to reduce the compression overhead. In this paper, we propose a hardware-algorithm co-design for an efficient and adaptive lossy compressor for scientific data on FPGAs (called CEAZ), which is the first lossy compressor that can achieve high compression ratios and throughputs simultaneously. Specifically, we propose an efficient Huffman coding approach that can adaptively update Huffman codewords online based on codewords generated offline, from a variety of representative scientific datasets. Moreover, we derive a theoretical analysis to support a precise control of compression ratio under an error-bounded compression mode, enabling accurate offline Huffman codewords generation. This also helps us create a fixed-ratio compression mode for consistent throughput. In addition, we develop an efficient compression pipeline by adopting cuSZ's dual-quantization algorithm to our hardware use cases. Finally, we evaluate CEAZ on five real-world datasets with both a single FPGA board and 128 nodes (to acceleratemore »parallel I/O). Experiments show that CEAZ outperforms the second-best FPGA-based lossy compressor by 2X of throughput and 9.6X of ratio. It also improves MPI_File_write and MPI_Gather throughputs by up to 28.1X and 36.9X, respectively.« less