<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>cuSZ: An Efficient GPU-Based Error-Bounded Lossy Compression Framework for Scientific Data</title></titleStmt>
			<publicationStmt>
				<publisher></publisher>
				<date>10/03/2020</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10299839</idno>
					<idno type="doi"></idno>
					<title level='j'>Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques</title>
<idno></idno>
<biblScope unit="volume"></biblScope>
<biblScope unit="issue"></biblScope>					

					<author>Sheng Di Jiannan Tian</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[Error-bounded lossy compression is a state-of-the-art data reduction technique for HPC applications because it not only significantly reduces storage overhead but also can retain high fidelity for postanalysis. Because supercomputers and HPC applications are becoming heterogeneous using accelerator-based architectures, in particular GPUs, several development teams have recently released GPU versions of their lossy compressors. However, existing state-of-the-art GPU-based lossy compressors suffer from either low compression and decompression throughput or low compression quality. In this paper, we present an optimized GPU version, cuSZ, for one of the best error-bounded lossy compressors-SZ. To the best of our knowledge, cuSZ is the first error-bounded lossy compressor on GPUs for scientific data. Our contributions are fourfold. (1) We propose a dual-quantization scheme to entirely remove the data dependency in the prediction step of SZ such that this step can be performed very efficiently on GPUs. (2) We develop an efficient customized Huffman coding for the SZ compressor on GPUs. (3) We implement cuSZ using CUDA and optimize its performance by improving the utilization of GPU memory bandwidth. (4) We evaluate our cuSZ on five real-world HPC application datasets from the Scientific Data Reduction Benchmarks and compare it with other state-of-the-art methods on both CPUs and GPUs. Experiments show that our cuSZ improves SZ's compression throughput by up to 370.1x and 13.1x, respectively, over the production version running on single and multiple CPU cores, respectively, while getting the same quality of]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ABSTRACT</head><p>Error-bounded lossy compression is a state-of-the-art data reduction technique for HPC applications because it not only significantly reduces storage overhead but also can retain high fidelity for postanalysis. Because supercomputers and HPC applications are becoming heterogeneous using accelerator-based architectures, in particular GPUs, several development teams have recently released GPU versions of their lossy compressors. However, existing state-of-the-art GPU-based lossy compressors suffer from either low compression and decompression throughput or low compression quality. In this paper, we present an optimized GPU version, cuSZ, for one of the best error-bounded lossy compressors-SZ. To the best of our knowledge, cuSZ is the first error-bounded lossy compressor on GPUs for scientific data. Our contributions are fourfold. <ref type="bibr">(1)</ref> We propose a dual-qantization scheme to entirely remove the data dependency in the prediction step of SZ such that this step can be performed very efficiently on GPUs. <ref type="bibr">(2)</ref> We develop an efficient customized Huffman coding for the SZ compressor on GPUs. <ref type="bibr">(3)</ref> We implement cuSZ using CUDA and optimize its performance by improving the utilization of GPU memory bandwidth. <ref type="bibr">(4)</ref> We evaluate our cuSZ on five real-world HPC application datasets from the Scientific Data Reduction Benchmarks and compare it with other state-of-the-art methods on both CPUs and GPUs. Experiments show that our cuSZ improves SZ's compression throughput by up to 370.1&#215; and 13.1&#215;, respectively, over the production version running on single and multiple CPU cores, respectively, while getting the same quality of reconstructed data. It also improves the compression ratio by up to 3.48&#215; on the tested data compared with another state-of-the-art GPU supported lossy compressor.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="1">INTRODUCTION</head><p>Large-scale high-performance computing (HPC) applications can generate extremely large volumes of scientific data. For instance, the Hardware/Hybrid Accelerated Cosmology Code (HACC) <ref type="bibr">[1]</ref> can simulate 1&#8764;10 trillion particles in one simulation and produce up to 220 TB of data per snapshot, bringing up a total of 22 PB of data during the simulation <ref type="bibr">[2]</ref> with only one hundred timesteps/snapshots. Such a large volume of data is imposing an unprecedented burden on supercomputer storage and interconnects <ref type="bibr">[3]</ref> for both storing data to persistent storage and loading data for postanalysis and visualization. Therefore, data reduction has attracted great attention from HPC application users for reducing the volumes of data to be moved to/from storage systems. The common approaches are simply decimating snapshots periodically and adopting an interpolation for data reconstruction. However, such approaches result in a significant loss of valuable information for postanalysis <ref type="bibr">[4]</ref>. Traditional data deduplication and lossless compression have also been used for shrinking data size but suffer from very limited reduction ratios on HPC floating-point datasets. Specifically, deduplication generally reduces the size of scientific datasets by only 20% to 30% <ref type="bibr">[5]</ref>, and lossless compression achieves a reduction ratio of up to about 2:1 in general <ref type="bibr">[6]</ref>. This is far from scientists' desired compression ratios, which are around 10:1 or higher (such as Community Earth Simulation Model (CESM) <ref type="bibr">[7]</ref>).</p><p>Error-bounded lossy compression has been proposed to significantly reduce data size while ensuring acceptable data distortion for users <ref type="bibr">[8]</ref>. SZ <ref type="bibr">[8,</ref><ref type="bibr">9]</ref> is a state-of-the-art error-bounded lossy compression framework for scientific data (to be detailed in &#167;2), which often offers higher compression qualities (or better rate distortions) than other state-of-the-art techniques <ref type="bibr">[3]</ref>. However, as illustrated in prior work <ref type="bibr">[8,</ref><ref type="bibr">9]</ref>, SZ suffers from low compression and decompression throughput, which is only tens to hundreds of megabytes per second on a single CPU core. This throughput is far from enough for extreme-scale applications or advanced instruments with extremely high data acquisition rates, which is a major concern for corresponding users. The LCLS-II laser [10], for instance, may produce data at a rate of 250 GB/s <ref type="bibr">[11]</ref>, such that corresponding researchers require an extremely fast compression solution that can still have relatively high compression ratios-for example, 10:1-with preserved data accuracy. In order to match such a high data production rate, leveraging multiple graphics processing units (GPUs) is a fairly attractive solution because of its massive single-instruction multiple-thread (SIMT) mechanism and its high programmability as opposed to FPGAs or ASICs <ref type="bibr">[12]</ref>. Moreover, the SZ algorithm follows O(n) time complexity and employs large amounts of read and write operations in the memory, and hence its performance is eventually bounded by memory bandwidth. State-ofthe-art GPUs cannot only provide high computation capability but also provide high memory bandwidth. For example, NVIDIA V100 GPU can provide at least one higher order magnitude of memory bandwidth than state-of-the-art CPUs can <ref type="bibr">[13]</ref>.</p><p>SZ, however, cannot be run on GPUs efficiently because of the lack of parallelism in its design. The main challenges are twofold: 1 the tight dependency in the prediction-quantization step of the SZ algorithm incurs expensive synchronizations across iterations in a GPU implementation; and 2 during the customized Huffman coding step of the SZ algorithm, coding and decoding each symbol based on the constructed Huffman tree involve many different branches (see &#167;2 for more details). This process causes serious warp divergence and random memory access issues, which inevitably lead to low GPU memory bandwidth utilization and performance.</p><p>To solve these issues, this paper presents an optimized GPU version of the SZ algorithm, called cuSZ, and proposes a series of optimization techniques for cuSZ to achieve high compression and decompression throughputs on GPUs. Specifically, we focus on the main performance bottlenecks (Lorenzo prediction <ref type="bibr">[14]</ref> and customized Huffman coding <ref type="bibr">[8]</ref>) and improve their performance for GPUs. We propose a novel technique called dual-qantization that can be applied to any prediction-based compression algorithms to alleviate the tight dependency in its prediction step. Moreover, according to prior work <ref type="bibr">[11]</ref>, a strict error-controlling scheme of lossy compression is needed by many HPC applications for their scientific explorations and postanalyses. However, the state-of-the-art GPUbased lossy compressors such as cuZFP <ref type="bibr">[15]</ref> are not error-bounded. To the best of our knowledge, cuSZ 1 is the first strictly errorbounded lossy compressor on GPU for scientific data. Our contributions are summarized as follows.</p><p>&#8226; We propose a generic dual-qantization scheme to entirely remove the data dependencies in the prediction-quantization step of lossy compression and apply it to Lorenzo predictor in SZ algorithm. 1 The code is available at <ref type="url">https://github.com/hipdac-lab/cuSZ</ref>.</p><p>&#8226; We develop an efficient customized Huffman coding for SZ on GPUs with fine-and coarse-grained parallelism. &#8226; We carefully implement cuSZ and optimize its performance on CUDA architecture. In particular, we fine-tune the chunk size in Huffman coding and develop an adaptive method that selects 32-bit or 64-bit representation dynamically for Huffman code and can significantly improve GPU memory bandwidth utilization. &#8226; We evaluate our proposed cuSZ on five real-world HPC application datasets provided by a public repository, Scientific Data Reduction Benchmarks <ref type="bibr">[16]</ref>, and compare it with other state-of-the-art methods on both CPUs and GPUs. Experiments show that the cuSZ can significantly improve both compression throughput by up to 370.1&#215; and 13.1&#215; over the production version of SZ running on single CPU core and multiple CPU cores, respectively. cuSZ has up to 3.48&#215; higher compression ratio than another advanced GPU supported lossy compressor with reasonable data distortion. The rest of the paper is organized as follows. In &#167;2, we discuss the SZ lossy compression in detail. In &#167;3, we propose our novel optimizations for the GPU version of SZ and implement it using CUDA. In &#167;4, we present the evaluation results based on five real-world simulation datasets from the Scientific Data Reduction Benchmarks and compare cuSZ with other state-of-the-art compressors on both CPU and GPU. In &#167;5, we discuss related work. In &#167;6, we present our conclusions and discuss our future work.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="2">SZ BACKGROUND</head><p>Many scientific applications require a strict error-bounded control when using lossy compression to achieve accurate postanalysis and visualization for scientific discovery, as well as a high compression ratio. SZ <ref type="bibr">[8,</ref><ref type="bibr">9]</ref> is a prediction-based lossy compression framework designed for scientific data that strictly controls the global upper bound of compression error. Given a user-set error bound eb, SZ guarantees | d -d &#8226; | &lt;eb, where d and d &#8226; are the original value and the decompressed value, respectively. SZ's algorithm involves five key steps: preprocessing, data prediction, linear-scaling quantization, customized variable-length encoding, and optional lossless compression, e.g., gzip <ref type="bibr">[17]</ref> and Zstd <ref type="bibr">[18]</ref>.</p><p>1) Preprocessing SZ performs a preprocessing step, such as linearization in version 1.0 or a logarithmic transform for the pointwise relative error bound in version 2.0 <ref type="bibr">[19]</ref>. 2) Data Prediction SZ predicts the value of each data point by a data-fitting predictor, e.g., a Lorenzo predictor <ref type="bibr">[14]</ref> (abbreviated as &#8467;-predictor) based on its neighboring values. In order to guarantee that the compression error is always within the user-set error bound, the predicted values must be exactly the same in between the compression procedure and decompression procedure. To this end, the neighbor values used in the prediction have to be the decompressed values instead of the original values. 3) Linear-Scaling Quantization SZ computes the difference between the predicted value and original value for each data point and performs a linear-scaling quantization <ref type="bibr">[8]</ref> to convert the difference to an integer based on the user-set error bound.   <ref type="figure">| ----------------+</ref>  presses the encoded data by a lossless compressor such as Zstd <ref type="bibr">[18]</ref>, which may significantly reduce the size due to potential repeated patterns in the bitstream.</p><p>In this work, we focus mainly on the SZ compressor, because much prior work <ref type="bibr">[3,</ref><ref type="bibr">20,</ref><ref type="bibr">21,</ref><ref type="bibr">22,</ref><ref type="bibr">23,</ref><ref type="bibr">24,</ref><ref type="bibr">25,</ref><ref type="bibr">26]</ref> has verified that SZ yields the best compression quality among all the prediction-based compressors. However, it is nontrivial to port SZ on GPUs because of the strict constraints in its compression design. For instance, the data used in the prediction must be updated by decompressed values, such that the data prediction in the SZ compressor <ref type="bibr">[8,</ref><ref type="bibr">9]</ref> needs to be performed one by one in a sequential order. This requirement introduces a loop-carried read-after-write (RAW) dependency during the compression (will be discussed in &#167;3.1.2), making SZ hard to parallelize.</p><p>We mainly focus on SZ-1.4 instead of SZ-2.0 because the 2.0 model is particularly designed for low-precision use cases with visualization goals, in which the compression ratio can reach up to several hundred while the reconstructed data often have large data distortions. Recent studies <ref type="bibr">[11]</ref>, however, demonstrate that scientists often require a relatively high precision (or low error bound) for their sophisticated postanalysis beyond visualization purposes. In this situation (with relatively low error bounds), SZ-2.0 has very similar (or even slightly worse, if not for all the cases) compression qualities to those of SZ-1.4, as demonstrated in <ref type="bibr">[3]</ref>. Accordingly, our design for the GPU-accelerated SZ lossy compression is based on SZ-1.4 and takes advantage of both algorithmic and GPU hardware characteristics. Moreover, the current CPU version of SZ does not support SIMD vectorization and has no specific improvement on the arithmetic performance. Therefore, the CPU baseline used in our following evaluation is based on the nonvectorized single-core and multicore implementation.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3">DESIGN METHODOLOGY OF CUSZ</head><p>In this section, we propose our novel lossy compression design, cuSZ, for CUDA architectures based on the SZ model. A system overview of our proposed cuSZ is shown in Figure <ref type="figure">1</ref>. We develop different coarse-and fine-grained parallelization techniques to each subprocedure in compression and decompression. Specifically, we first employ a data-chunking technique to exploit coarse-grained data parallelism. The chunking technique is used throughout the whole cuSZ design, including lossless (step 2 and 3) and lossy (step 1, 4, and 5) procedures in both compression and decompression. We then deeply analyze the RAW data dependency in SZ and propose a novel two-phase prediction-quantization approach, namely, dualqantization, which totally eliminates the data dependency in the prediction-quantization (abbreviated as predict-quant) step. Furthermore, we provide an in-depth breakdown analysis of Huffman coding and develop an efficient Huffman coding on GPUs with multiple optimizations. A summary of our parallelization techniques is shown in Table <ref type="table">1</ref>. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.1">Parallelizing Prediction-Quantization in Compression</head><p>In this section, we discuss our proposed optimization techniques to parallelize SZ's predict-quant procedure on GPU architectures. We first chunk the original data to gain coarse-grained parallelization, and then we assign a thread to each data point for fine-grained in-chunk parallel computations.</p><p>3.1.1 Chunking and Padding. Figure <ref type="figure">2</ref> illustrates our chunking and padding technique. For each chunked data block, we assign a thread to each data point (i.e., fine-grained parallelism). To avoid complex modifications to the prediction function after chunking, we add a padding layer to each block in the predict-quant step. We set all the values in the padding layer to 0 such that they do not affect the predicted values of the points neighboring to the padding layer, as shown in Figure <ref type="figure">2</ref>. We note that in the original SZ, the uppermost points and leftmost points (denoted by "outer layer", shaded in Figure <ref type="figure">2</ref>) are saved as unpredictable data directly. In our chunking version, however, directly storing these points for each block would significantly degrade the compression ratio. Therefore, we apply &#8467;-prediction to the outer layer instead, such that every point in the block is consistently processed based on the &#8467;-predictor, avoiding thread/warp divergence. Moreover, we initialize the padding layer with 0s; the prediction for each outer-layer point falls back to 1D 1order Lorenzo, as shown in Figure <ref type="figure">2</ref>. Based on our empirical result, we adopt 32 for 1D data, 16&#215;16 for 2D data, and 8&#215;8&#215;8 for 3D data.  3.1.2 Dual-Quantization Scheme. In the following discussion, we use circle &#8226; and bullet &#8226; to denote the compression and decompression procedure, respectively. We use star &#8902; to denote all the values related to the data reconstruction in compression. The subscript (&#8226;) k represents the kth iteration.</p><p>Read-After-Write in SZ. In the original SZ algorithm, all data points need to go through predict-quant, and in situ reconstruction iteratively, which causes intrinsic read-after-write (RAW) dependencies (as illustrated in Figure <ref type="figure">3</ref>).</p><p>We describe the loop-carried RAW dependency issue in detail below. For any data point at the (k -1)th iteration in SZ compression, given a predicted value p k -1 , the prediction error e</p><p>) is converted to an integer and a corresponding quantization code q k -1 based on the user set error bound eb. Then, the reconstructed prediction error e &#8226;&#8902; k -1 and the reconstructed value</p><p>k -1 are generated by using q k -1 , eb, and</p><p>is equivalent to the reconstructed d &#8226; k -1 during decompression (as shown in Figure <ref type="figure">3</ref>); however, the kth iteration must wait until the update completes at the end of the (k -1)th iteration, which incurs loop-carried data dependency. Also note that d &#8226;&#8902; is written back in the last step of the current iteration, and its written value is used at the beginning of the next iteration, therefore, the two consecutive iterations cannot overlap. Hence, under the original design of the predict-quant in SZ, it is infeasible to effectively exploit fine-grained</p><p>CUSZ COMPRESSION DECOMPRESSION parallelism and efficiently utilize SIMT on GPUs. We present the original SZ's predict-quant step in Algorithm 1 in detail.</p><p>Algorithm 1: Original SZ of predict-quant </p><p>Proposed Dual-Quantization Approach. To eliminate the RAW dependency, we propose a dual-qantization scheme by modifying the data representation during the predict-quant procedure. Our dual-qantization (abbreviated as dual-qant) consists of two steps: preqantization and postqantization.</p><p>Given a dataset D with an arbitrary dimension, we first quantize it based on the user-set eb and convert it to a new dataset</p><p>where any d &#8226; &#8712; D &#8226; is strictly a multiple of (2 &#215;eb). We call this step preqantization (abbreviated as preqant). In order to avoid overflow, d &#8226; is stored in floating-point data type. We note that the error introduced by preqant (defined as posterror) is strictly bounded by the user-set error bound, that is, |d -2</p><p>After the preqantization, we can calculate each predicted value based on its surrounding values (denoted by d &#8226; sr ) and the &#8467;-predictor as</p><p>The second step, called postqantization (abbreviated as postqant), serves as the counterpart of the linear-scaling quantization in the original SZ. postqant computes the difference between the predicted value and the preqant-ized value. Different from the original SZ, such difference does not cause any compression error (will be discussed later), we use &#948; instead of e to denote this difference:</p><p>Then, the quantization code q &#8226; is generated based on &#948; . Note that q &#8226; is quantitatively equivalent to &#948; &#8226; , represented differently: &#948; &#8226; is a floating-point number to avoid subnormal values (i.e. under/overflow), while q &#8226; is an integer, which is used in the subsequent lossless coding (e.g., Huffman coding). It is worth noting that, during decompression, d &#8226; can be reconstructed (as d &#8226; ) based on losslessly decoded q &#8226; &#8801; q &#8226; (hence exactly &#948; &#8226; ) and predicted p &#8226; , thus this postqant step does not introduce any further error.</p><p>Eliminating RAW. In the following text, we explain in detail why the dual-qant method effectively eliminates the RAW dependency. Conceptually, similar to the original SZ, we can construct &#948; &#8226;&#8902; and d &#8226;&#8902; during the compression, as shown in Figure <ref type="figure">3</ref>. In fact, for (k -1)th iteration,</p><p>k-1 and d &#8226; k -1 are also strictly equivalent. Consequently, unlike the original SZ that must write d</p><p>always holds in our proposed dual-qant approach. As illustrated in Figure <ref type="figure">3</ref>, after preqant, all d &#8226; are dependency free for postqant. By eliminating the loop-carried RAW dependency (marked as arrows in Figure <ref type="figure">3</ref>), we can effectively parallelize the dual-qant procedure by performing fine-grained (per-point) parallel computation, which is commonly seen in image processing <ref type="bibr">[27]</ref>. We illustrate the detailed dual-qant procedure in Algorithm 2.</p><p>Lorenzo Predictor with Binomial Coefficients. According to Tao et al. <ref type="bibr">[8]</ref>, the generalized &#8467;-predictor is given by</p><p>where</p><p>, as illustrated in Figure <ref type="figure">1</ref>. We note that all the coefficients in the formula of the &#8467;-predictor are integers; thus, the prediction computation consists of mathematically integer-based operations (additions and multiplications) and results in unit weight. This ensures that no division is needed, and the data reconstruction based on dual-qant is fairly precise and robust with respect to machine &#1013;, however, the original Algorithm 2: cuSZ of dual-qant </p><p>if in-cap else verbatim x 14 end for SZ using precise floating-point operations suffers from underflow. Note that the predicted values which are integers will be completely corrected by the saved quantization codes in decompression, so the final error is still bounded by eb.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2">Efficient Customized Huffman Coding</head><p>To efficiently compress the quantization codes generated by dualqant, we develop an efficient customized Huffman coding for SZ on GPUs. Specifically, Huffman coding consists of the following subprocedures: 1 calculate the statistical frequency for each quantization bin (as a symbol); 2 build the Huffman tree based on the frequencies and generate a base codebook along with each code bitwidth; 3 transform the base codebook to the canonical Huffman codebook (called canonization); 4 encode in parallel according to the codebook, and concatenate Huffman codes into a bitstream (called deflating). And Huffman decoding is composed of 1 retrieving the reverse codebook and 2 decoding accordingly.</p><p>Note that the fourth subprocedure of encoding can be further decomposed into two steps for fine-grained optimization. Codebookbased encoding is basically memory copy and can be fully parallelized in a fine granularity, whereas deflating can be performed only sequentially (except blockwise parallelization discussed in &#167;3.1.1) because of its atomic operations. We discuss Huffman coding on GPUs step by step as follows.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>3.2.1</head><p>Histogram for Quantization Bins. The first step of Huffman coding is to build a histogram representing the frequency of each quantization bin from the data prediction step. The GPU histogramming algorithm that we use is derived from the algorithm proposed by G&#243;mez-Luna et al. <ref type="bibr">[28]</ref>. This algorithm minimizes conflicts in updating the histogram bin locations by replicating the histogram for each thread block and storing the histogram in shared memory. Where possible, conflict is further reduced by replicating the histogram such that each block has access to multiple copies. All threads inside a block read a specified partition of the quantization codes and use atomic operations to update a specific replicated histogram. As each block finishes its portion of the predicted data, the replicated histograms are combined via a parallel reduction into a single global histogram, which is used to construct the final codebook in Huffman coding.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.2">Constructing Huffman Codebook.</head><p>In order to build the optimal Huffman tree, the local symbol frequencies need to be aggregated to generate the global symbol frequencies for the whole dataset. By utilizing the aggregated frequencies, we build a codebook according to the Huffman tree for encoding. Note that the number of symbols-namely, the number of quantization bins-is a limited number (generally no greater than 65,536) that is much smaller than the data size (generally, millions of data points or more). This leads to a much lower number of nodes in the Huffman tree compared with the data size, such that the time complexity of building a Huffman tree is considered low. We note that building Huffman tree sequentially on CPU benefits from high CPUfrequency and low memory-access latency. However, it requires CPU-to-GPU/GPU-to-CPU transfer of frequencies/codebook before/after building the tree, and communicating these two small messages would incur non-negligible overheads. Therefore, we adopt one GPU thread to build the Huffman tree sequentially to avoid such overheads. We propose an adaptive codeword representation to enhance the utilization of memory bandwidth, which improves the Huffman encoding performance in turn. We illustrate the organization of the codebook in Figure <ref type="figure">4</ref>. The codebook is organized by units of unsigned integers, and each contains a variable-length Huffman codeword from LSB (the rightmost or the least significant bits) and its bitwidth from MSB (the leftmost or the most significant bits). According to the pessimistic estimation of maximum bitwidth of optimal Huffman codeword <ref type="bibr">[29]</ref>, one is supposed to use uint64_t to hold each bitwidth-codeword representation. For example, the maximum bitwidth could be 33 bits for CLDHGH field from CESM-ATM dataset in the worst case. However, we note that using uint32_t to represent a bitwidth-codeword tuple can significantly improve the Huffman coding and decoding performance compared with using 64-bit unsigned integers (i.e., uint64_t), because of higher GPU memory bandwidth utilization. Thus, we propose to dynamically select uint32_t or uint64_t representation for the Huffman bitwidth-codeword based on the practical maximum bitwidth instead of pessimistic estimation. We show the performance evaluation with different representations in &#167;4.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>MSB LSB bitwidth codeword</head><p>The theoretical time complexity is O(k log k) for building a Huffman tree and O(k) for a traversing tree, where k is the number of symbols (quantization bins). Our experiments show that the real execution time of building a Huffman tree is consistent with the theoretical time complexity analysis (see Table <ref type="table">3</ref>). On the other hand, the number of symbols is determined by the smoothness of the dataset and the user-desired error bound (1,024 by default). For example, with a relatively large error bound such as the valuerange-based relative error bound<ref type="foot">foot_0</ref> of 10 -3 , we observe that most of the symbols are concentratedly distributed near the central of codebook. As the error bound decreases, the symbols become more evenly distributed. Thus, determining a suitable number of quantization bins is important for high performance in constructing a codebook.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.3">Canonizing Codebook.</head><p>A canonical Huffman codebook <ref type="bibr">[30]</ref> holds the same bitwidth of each codeword as the original Huffman codebook (i.e., base codebook), while its bijective mapping (between quantization code and Huffman codeword) and variable codeword make the memory layout organized more efficiently. The time complexity of sequentially canonizing codebook from the base is O(k), where k is the number of symbols (i.e., the number of quantization bins) and is sufficiently small compared with the data size. By using a canonical codebook, we can (i) decode without the Huffman tree, (ii) efficiently cache the reverse codebook for high decoding throughput, and (iii) maintain exactly the same compression ratio as the base Huffman codebook.</p><p>The process of canonizing codebook can be decomposed into the following subprocedures: 1 linear scanning of the base codebook (sequentially O(k)), which is parallelized at fine granularity with atomic operations; 2 loosely radix-sorting of the codewords by bitwidth (sequentially O(k)), which cannot be parallelized because of the intrinsic RAW dependency; and 3 building the reverse codebook (sequentially O(k)), which is enabled with fine-grained parallelism.</p><p>It is intuitive to separate the functionalities of the aforementioned subprocedures and implement them into independent CUDA kernels with different configurations (i.e., blockDim and gridDim). Based on our profiling results on NVIDIA V100 GPU, however, launching a CUDA kernel usually takes about 60 microseconds (&#181;s) (about 200 &#181;s for the three kernels of canonization) measured by 11 kernels launched in total. Moreover, any two consecutive subprocedures require an additional expensive synchronization (i.e., cudaDeviceSynchronize). However, our experiment indicates that canonizing codebook is sufficiently fast; thus, we integrate all the three subprocedures in one single kernel.</p><p>We note that this single kernel must be launched with more threads than the that is limited for a single thread block (i.e. 1024) for two reasons. On the one hand, a high scalability is required for the parallel reads/writes to match the problem size in subprocedures 1 and 3 . On the other hand, unlike histogramming that saves only the &#920;(k) frequencies in shared memory, this kernel requires saving both the codebook and its footprint, which may exceed the maximum allowable capacity of shared memory in a single thread block (e.g., 96 KB for a V100). Since shared memory is only visible to its designated thread block, shuffling codewords in shared memory across different thread blocks is semantically prohibitive. In addition, there is few intermediate data reuse, thus, we use global memory instead of shared memory to save the codebook. Hence, we employ the state-of-the-art CUDA API-Cooperative Groups <ref type="bibr">[31]</ref>-to achieve in-grid operation. Specifically, we launch the same number of threads as the codeword in the base codebook. We select one thread to perform the RAW-restricted sequential subprocedure when needed (see Table <ref type="table">1</ref>). Note that it takes only 32 &#181;s on a V100 to launch this Cooperative Groups enabled kernel, which significantly reduces the overhead compared to launching multiple kernels. Moreover, compared to two inter-kernel barriers with more than 2&#215;60 &#181;s, two in-grid barriers have relatively low overheads, and eventually result in, for instance, about 200 &#181;s kernel time with k = 1024.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="3.2.4">Encoding and Deflating.</head><p>We design an efficient solution to perform the encoding by GPU threads in parallel. Encoding involves looking up a symbol in the codebook and performing a memory copy. After we adaptively select a 32-/64-bit unsigned integer to represent a Huffman code with its bitwidth, the encoding step is massively parallelized. To generate the dense bitstream of Huffman codes within each data block, we conduct deflating in order to concatenate the Huffman codes and remove the unnecessary zero bits according to the saved bitwidths.</p><p>Since the deflated code is organized sequentially, we apply the coarse-grained chunkwise parallelization technique discussed in &#167;3.1.1. In particular, a data chunk for compression and decompression is mapped to one GPU thread. Note that the chunk size for deflating is not necessarily the same as the chunk size for dualqant, and it does not rely on the dimensionality. We optimize the deflating chunk size by evaluating the performance with different sizes (will be showed in &#167;4.2.1). We also employ memory reuse technique to reduce the GPU memory footprint in deflating. Specifically, we reuse the memory space of Huffman codes for the deflated bitstream because the latter uses significantly less memory space and does not have any conflict when writing the deflated bitstream to the designated location.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>Decompression</head><p>cuSZ's decompression consists of two steps: Huffman decoding (or inflating the densely concatenated Huffman bitstream) and reversed dual-qant. In inflating, we first use the previously built reverse codebook to retrieve the quantization codes from the deflated Huffman bitstream. Then, based on the retrieved quantization codes, we reconstruct the floating-point data values. Note that only coarse-grained chunking can be applied to decompression, and its chunk size is determined in compression. The reason is that the two steps both have a RAW dependency issue. In fact, retrieving the variable-length codes has the same pattern as loop-carried RAW dependency. For the reversed dual-quantization procedure, each data point cannot be decompressed until its preceding values are fully reconstructed.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4">EXPERIMENTAL EVALUATION</head><p>In this section, we present our experimental setup (including platform, baselines, and datasets) and our evaluation results.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.1">Experimental Setup</head><p>Evaluation Platform. We conduct our experimental evaluation using PantaRhei cluster <ref type="bibr">[32]</ref>. We perform the experiments on an NVIDIA V100 GPU <ref type="bibr">[13]</ref> from the cluster and compare with lossy compressors on two 20-core Intel Xeon Gold 6148 CPUs from the cluster. The GPU is connected to the host via 16-lane PCIe 3.0 interconnect. We use NVIDIA CUDA 9.2 and its default profiler to measure the kernel time.</p><p>Comparison Baselines. We compare our cuSZ with two baselines: SZ-1.4.13.5 and cuZFP <ref type="bibr">[15]</ref>. For SZ-1.4, we adopt the default setting: 16 bits for linear-scaling quantization (i.e., 1,024 quantization bins), best_compression mode, and best_speed mode for gzip, which lead to a good tradeoff between compression ratio and performance.</p><p>Test Datasets. We conduct our evaluation and comparison based on five typical real-world HPC simulation datasets of each dimensionality from the Scientific Data Reduction Benchmarks suite <ref type="bibr">[16]</ref>: 1 1D HACC cosmology particle simulation <ref type="bibr">[1]</ref>, 2 2D CESM-ATM climate simulation <ref type="bibr">[33]</ref>, 3 3D Hurricane ISABEL simulation <ref type="bibr">[34]</ref>, 4 3D Nyx cosmology simulation <ref type="bibr">[35]</ref>, and 5 4D QMCPACK quantum Monte Carlo simulation <ref type="bibr">[36]</ref>. They have been widely used in prior works <ref type="bibr">[3,</ref><ref type="bibr">11,</ref><ref type="bibr">19,</ref><ref type="bibr">37,</ref><ref type="bibr">38]</ref> and are good representatives of productionlevel simulation datasets. Table <ref type="table">2</ref> shows all 112 fields<ref type="foot">foot_1</ref> across these datasets. The data sizes for the five datasets are 6.3 GB, 2.0 GB, 1.9 GB, 3.0 GB, and 1.2 GB, respectively. Note that our evaluated HACC dataset is consistent with real-world scenarios that generate petabytes of data. For example, according to <ref type="bibr">[1]</ref>, a typical largescale HACC simulation for cosmological surveys runs on 16,384 nodes each with 128 million particles and generates 5 PB over the whole simulation. The simulation contains 100 individual snapshots of roughly 3 GB per node. We evaluate a single snapshot for each dataset instead of all the snapshots, because the compressibility of most of the snapshots usually has strong similarity. Moreover, when the field is too large to fit in a single GPU's memory, cuSZ divides it into blocks and then compresses them block by block.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2">Evaluation Results and Analysis</head><p>In this section, we evaluate the compression performance and quality of cuSZ and compare it with CPU-SZ and cuZFP.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.1">Compression</head><p>Performance. We first evaluate the performance of dual-qant of cuSZ. The average throughput of the dualqant step on each tested dataset is shown in Table <ref type="table">7</ref>. Compared with the original serial CPU-SZ, the predict-quant throughput is improved by more than 1000&#215; via our proposed dual-qant on the GPU. This improvement is because dual-qant entirely eliminates the RAW dependency and leads to fine-grained (per-point) parallel computation, which is significantly accelerated on the GPU.</p><p>We then evaluate the performance of our implemented Huffman coding step by step. First, we conduct the experiment of Huffman histogram computation and show its throughput performance <ref type="foot">4</ref> . Efficiently computing a histogram on a GPU is an open challenging problem, because of the way that multiple threads need to write to the same memory locations simultaneously. Here, we present a method that, while a bottleneck in the Huffman process, is a 2&#215; improvement from a serial implementation.</p><p>Next, we perform the experiment of constructing codebook with different numbers of quantization bins, as shown in Table <ref type="table">3</ref>. We note that the execution times of building a Huffman tree and creating a codebook are consistent with our time complexity analyses in &#167;3.2.2. We use 1,024 quantization bins by default. Since the time overhead of constructing a codebook depends only on the number of quantization bins, it is almost fixed-for example, 4.81 ms-for the remaining experiments. We also note that a larger data size lowers the relative performance overhead of constructing a codebook, thus leading to higher overall performance.  We also evaluate the performance of encoding and decoding based on the canonical codebook. To increase the memory bandwidth utilization, we adapt online selection of Huffman codeword representation between a uint32_t and a uint64_t. Table <ref type="table">4</ref> illustrates that our encoding achieves about 250 GB/s for uint64_t and about 380 GB/s<ref type="foot">foot_3</ref> for uint32_t, based on the test with all 111 fields under the error bound of 1e-4. Hence, we conclude that using a uint32_t enables significantly higher performance than using a uint64_t. Because of the coarse-grained chunk-wise parallelization, the performance of deflating is about 60 GB/s, which is lower than the encoding throughput of 380 GB/s. Consequently, the Huffman coding performance is bounded mainly by the deflating throughput.</p><p>To improve the deflating and inflating performance, we further evaluate different chunk sizes and identify the appropriate sizes for both deflating and inflating on the tested datasets, as shown in Table <ref type="table">6</ref>. Specifically, we evaluate chunk sizes ranging from 2 6 to 2 16 , due to different field sizes. We observe that using a total of around 2e4 concurrent threads consistently achieves the optimal throughput. Note that inflating must follow exactly the same data chunking strategy as deflating; thus we need to select the same chunk size. Even under this constraint, our selected chunk sizes still achieve throughputs close to the peak ones, as illustrated in Table <ref type="table">6</ref>. Therefore, we conclude that the overall optimal performance can be achieved by setting up a total of 2e4 concurrent threads in practice.</p><p>Next, we evaluate the overall compression and decompression performance of cuSZ, as shown in Table <ref type="table">7</ref>. We compare cuSZ with cuZFP in terms of the kernel performance and the overall performance that includes the GPU-to-CPU communication cost. Note that the performance of cuZFP is highly related to its userset fixed bitrate according to the previous study <ref type="bibr">[39]</ref>, whereas the Throughput ( / ) performance of cuSZ is hardly affected by the user-set error bound. Therefore, we choose the acceptable fixed bitrate for cuZFP, which generates data distortion (i.e., PSNR of about 85 dB) similar to that of cuSZ, as shown in Table <ref type="table">5</ref>. Also, note that we exclude cuZFP for HACC in Table <ref type="table">7</ref>, because cuZFP generates fairly low compression quality on 1D HACC. In particular, even when the bitrate is as high as 16, the PSNR is only about 20 dB, which is not usable. The throughput in Table <ref type="table">7</ref> is calculated based on the original data size rather than the size of the data transferred between the GPU and CPU. Table <ref type="table">7</ref> shows that cuZFP has a higher kernel throughput but lower GPU-to-CPU throughput than does cuSZ. The reason is that cuSZ provides a much higher compression ratio than does cuZFP with the same data distortion. We note that the overall throughputs of cuSZ and cuZFP are close to each other with respect to the CPU-GPU interconnect (16lane PCIe 3.0) bandwidth in our evaluation. Generally speaking, many applications in GPU-based HPC systems generate the data on GPUs, so the compression needs to be directly performed on the data in the GPU memory, and the compressed data currently must be transferred from GPUs to disks through CPUs. Current state-ofthe-art CPU-GPU interconnect technologies such as NVLink <ref type="bibr">[40]</ref> can typically provide a theoretical transfer rate of 50 GB/s over two links, while our cuSZ's compression kernel can provide comparable throughput of about 40 GB/s. Although cuZFP's compression kernel achieves about 70 GB/s, its overall throughput is limited by the CPU-GPU bandwidth of 50 GB/s. So, the data transfer between CPU and GPU is still the bottleneck for high-throughput compression kernels (e.g., not higher than 50 GB/s). Moreover, the decompression throughput of cuSZ is lower than its compression throughput and that of cuZFP. This is because only coarse-grained chunking can be applied to decompression, as mentioned in &#167;3.3. Here we argue that the compression throughput is more important than the decompression throughput, because users use the CPU-SZ mainly to decompress the data for postanalysis and visualization instead of the GPU after the compressed data is transferred and stored to parallel file systems <ref type="bibr">[11,</ref><ref type="bibr">39]</ref>. We note that cuSZ on the CESM-ATM dataset exhibits much lower performance than on other datasets. This is due to the fact that each field of the CESM-ATM dataset is fairly small (&#8764;25 MB), such that the codebook construction cost turns out to be relatively high compared with other steps for this dataset. In fact, the codebook construction would not be a bottleneck for a relatively large dataset (such as hundreds of MBs per field), which is more common in practice (e.g., HACC, Nyx, QMCPACK).</p><p>We also compare the performance of cuSZ with that of the production version of SZ running on a single CPU core and multiple CPU cores. The parallelization of OpenMP-SZ is achieved by simply chunking the whole data without any further algorithmic optimization (such as our proposed dual-qant). In particular, each thread is assigned with a fixed-size block and runs the original sequential CPU-SZ code. The points on the border are handled similar to cuSZ (as shown in Figure <ref type="figure">2</ref>). The main differences between OpenMP-SZ and cuSZ are fourfold: 1 In the proposed dual-qant, each point in cuSZ is assigned to a GPU thread, whereas OpenMP-SZ uses a CPU thread to handle a block of data points. 2 After postqant, the data are transformed into integers (units of error bound), and all the following arithmetic operations are performed on these integers. Hence cuSZ does not need to handle the errors that are introduced by floating-point operations (e.g., underflow). 3 OpenMP-SZ does not fully parallelize Huffman coding, whereas cuSZ provides an efficient parallel implementation of Huffman coding on GPU. 4 OpenMP-SZ supports only 3D datasets, so in our comparison we use 3D Hurricane Isabel and Nyx and mark n/a for non-3D datasets in Figure <ref type="figure">5</ref>. It illustrate the compression and decompression throughput of cuSZ (considering the CPU-GPU communication overhead) and CPU-SZ. Compared with the serial SZ, the overall compression performance can improved by 242.9&#215; to 370.1&#215;. cuSZ also improves the overall performance by 11.0&#215; to 13.1&#215; over SZ running with OpenMP on 32 cores.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="4.2.2">Compression</head><p>Quality. We then present the compression quality of cuSZ compared with another advanced GPU-supported lossy compressor-cuZFP-based on the compression ratios and data distortions on the tested datasets. We use the peak signal-tonoise ratio (PSNR) <ref type="foot">6</ref> to evaluate the quality of the reconstructed data.</p><p>The larger the PSNR, the lower reconstructed distortion, hence the more accurate postanalysis.</p><p>We compare cuSZ and cuZFP only on two 3D datasets-Hurricane Isabel and Nyx-because the compression quality of cuZFP on the 1D/2D datasets is much lower than that on the 3D datasets . For a fair comparison, we plot the rate-distortion curves for both cuSZ and cuZFP on all the fields of the two datasets and compare their compression quality in PSNR at the same compression ratio.  Figure <ref type="figure">6</ref> shows the rate-distortion curves of cuSZ and cuZFP on the Nyx dataset. We observe that cuSZ generally has a higher PSNR than does cuZFP with the same compression ratio on the Nyx dataset. In other words, cuSZ provides a much higher compression ratio compared with cuZFP given the same compression quality. The main reason is twofold: 1 ZFP has better compression quality with the absolute error bound (fix-accuracy) mode than with the fixed-rate mode (as indicated by the ZFP developer <ref type="bibr">[41]</ref>); and 2 the &#8467;-predictor of cuSZ has a higher decorrelation efficiency than does the block transform of cuZFP, especially on the field with a large value range and concentrated distribution, such as baryon_density.</p><p>Similar results for cuSZ and cuZFP are observed on the Hurricane Isabel dataset, as shown in Figure <ref type="figure">7</ref>. We note that the rate-distortion curves for cuSZ-namely, QCLOUD, QICE, CLOUD-notably increase when the compression ratio decreases. This is because there are areas full of zeros, causing the compression ratio to change very slowly when the error bound is smaller than a certain value. In other words, most of the nonzeros are unpredictable, and the zeros are always predictable.</p><p>We also illustrate the overall rate-distortion curves of cuSZ and cuZFP on the Hurricane and Nyx dataset, as shown in Figure <ref type="figure">8</ref>. For example, cuSZ provides a 2.41&#215; (2.49 vs. 6) lower bitrate over cuZFP on the Nyx dataset and a 3.48&#215; (3.45 vs. 12) lower bitrate over cuZFP on the Hurricane Isabel dataset, with reasonable PSNRs, as shown in Table <ref type="table">5</ref>.</p><p>The reason is that, according to &#167;3.1.1, cuSZ sets all the values in the padding layer to 0 and uses these zeros to predict the top-left data points, resulting in better prediction on the tested datasets, squared error (RMSE) is obtained by sqrt 1  especially for the fields with large value ranges and a large majority of values close to zero (such as CLOUDf48, QSNOWf48, and baryon_density as shown in Table <ref type="table">9</ref>). However, SZ-1.4's prediction highly depends on the first data point's value, so it may cause low prediction accuracy when the first data point deviates largely from most of the other points. Therefore, cuSZ and SZ-1.4 have similar PSNRs on the datasets represented by the logarithmic scale.  . The range of eb or even 1 10 eb at 0 or min value cover a majority of data in the fields.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5">RELATED WORK 5.1 GPU-Accelerated Scientific Compression</head><p>Scientific data compression has been studied for many years for reducing storage and I/O overhead. It includes two main categories: lossless compression and lossy compression. Lossless compressors for scientific datasets such as FPC <ref type="bibr">[42]</ref> and FPZIP <ref type="bibr">[43]</ref> ensure that the decompressed data is unchanged, but they provide only a limited compression ratio because of the significant randomness of the ending mantissa bit of HPC floating-point data. According to a recent study <ref type="bibr">[6]</ref>, the compression ratio of lossless compressors for scientific datasets is generally up to 2:1, which is much lower than the user-desired ratio for HPC applications.</p><p>Error-bounded lossy compression significantly reduces the size of scientific data while maintaining desired data characteristics. Traditional lossy compressors (such as JPEG <ref type="bibr">[44]</ref>) are designed for image and visualization purposes; however, they are difficult to be applied to scientific datasets because of scientists' specific data fidelity requirement. Recently, error-bounded lossy compressors (such as SZ <ref type="bibr">[8]</ref> and ZFP <ref type="bibr">[45]</ref>) have been developed for scientific datasets. Such compressors provide strict error controls according to user requirements. Both SZ and ZFP, for example, provide an absolute error bound in their CPU version.</p><p>Different from SZ's prediction-based compression algorithm, ZFP's algorithm is based on a block transform. It first splits the whole dataset into many small blocks. It then compresses the data in each block separately in four main steps: exponent alignment, customized near-orthogonal transform, fixed-point integer conversion, and bit-plane-based embedded coding. A truncation is performed based on the user-set bitrate. Recently, the ZFP team released their CUDA version, called cuZFP <ref type="bibr">[15]</ref>. cuZFP provides higher throughputs for compression and decompression compared with the CPU version <ref type="bibr">[39]</ref>. However, the current cuZFP only supports fixed-rate mode, which significantly limit its adoption in practice.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="5.2">Huffman Coding on GPU</head><p>During the Huffman coding process, a specific method is used to determine the bit representation for each symbol, which results in variable length prefix codes. The set of these prefix codes make up the codebook, with each prefix code based on the symbols frequency in the data. This codebook is then used to replace each input symbol with its corresponding prefix code. Previous studies have shown that Huffman coding achieves better performance in parallel on a GPU than in serial on a CPU. In general, parallel Huffman coding obtains each codeword from a lookup table (generated by Huffman tree) and concatenates codewords together with other codewords. However, a severe performance issue arises when different threads write codewords with different lengths, which results in warp divergence on GPU <ref type="bibr">[46]</ref>. The most deviation between methods occurs in concatenating codewords.</p><p>Fuentes-Alventosa et al. <ref type="bibr">[47]</ref> proposed a GPU implementation of Huffman coding using CUDA with a given table of variablelength codes, which improves the performance by more than 20&#215; compared with a serial CPU implementation. Rahmani et al. <ref type="bibr">[48]</ref> proposed a CUDA implementation of Huffman coding based on serially constructing the Huffman codeword tree and parallel generating the byte stream, which can achieve up to 22&#215; speedups compared with a serial CPU implementation without any constraint on the maximum codeword length or data entropy. Lal et al. <ref type="bibr">[49]</ref> proposed a Huffman-coding-based memory compression for GPUs (called E 2 MC) based on a probability estimation of symbols. It uses an intermediate buffer to reduce the required memory bandwidth. In order to place the codeword into the correct memory location, E 2 MC extends the codeword to the size of the buffer length and uses a barrel shifter to write the codeword to the correct location. Once shifted, the codeword is bitwise ORed with the intermediate buffer, and the write location is increased by the codeword length.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head n="6">CONCLUSION AND FUTURE WORK</head><p>In this work, we propose cuSZ, a high-performance GPU-based lossy compressor for NVIDIA GPU architectures that effectively improves the compression throughput for SZ compared with the production version on CPUs. We propose a dual-quantization scheme to completely remove the strong data dependency in SZ's predictionquantization step and implement an efficient customized Huffman coding. We also propose a series of techniques to optimize the performance of cuSZ, including fine-tuning the chunk size, adaptively selecting Huffman code representation, and reusing memory. Experiments on five real-world HPC simulation datasets show that our proposed cuSZ improves the compression throughput by 242.9&#215; to 370.1&#215; over the serial CPU version and 11.0&#215; to 13.1&#215; over the parallel CPU version. Compared with another state-of-the-art GPU-supported lossy compressor, cuSZ improves the compression ratio by 2.41&#215; to 3.48&#215; with reasonable data distortion on the tested datasets. We plan to further optimize the performance of decompression, implement other data prediction methods such as linear-regression-based predictor, and evaluate the performance improvements of parallel I/O with cuSZ.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>ACKNOWLEDGMENTS</head><p>This research was supported by the Exascale Computing Project (ECP), Project Number: 17-SC-20-SC, a collaborative effort of two DOE organizations-the Office of Science and the National Nuclear Security Administration, responsible for the planning and preparation of a capable exascale ecosystem, including software, applications, hardware, advanced system engineering and early testbed platforms, to support the nation's exascale computing imperative. The material was supported by the U.S. Department of Energy, Office of Science, under contract DE-AC02-06CH11357. This work was also supported by the National Science Foundation under Grants CCF-1619253, OAC-2003709, OAC-1948447/2034169, and OAC-2003624/202042084. We would like to thank The University of Alabama for providing the startup funding for this work.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="2" xml:id="foot_0"><p>Value-range-based relative error bound (denoted by valrel) is the error bound relative to the value range of the dataset.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="3" xml:id="foot_1"><p>The QMCPACK dataset includes only one field but with two representations.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="4" xml:id="foot_2"><p>All throughputs shown are measured based on the original data size and time.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="5" xml:id="foot_3"><p>NVIDIA V100 GPU has a theoretical peak memory bandwidth of 900 GB/s.</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" n="6" xml:id="foot_4"><p>PSNR is calculated as PSNR = 20 &#8226; log 10 (d max -d min )/RMSE , where N is the number of data points and d max / d min is the maximal/minimal value. Root mean</p></note>
		</body>
		</text>
</TEI>
