skip to main content


Title: GPU-Accelerated Decoding of Integer Lists
An inverted index is the basic data structure used in most current large-scale information retrieval systems. It can be modeled as a collection of sorted sequences of integers. Many compression techniques for inverted indexes have been studied in the past, with some of them reaching tremendous decompression speeds through the use of SIMD instructions available on modern CPUs. While there has been some work on query processing algorithms for Graphics Processing Units (GPUs), little of it has focused on how to efficiently access compressed index structures, and we see some potential for significant improvements in decompression speed. In this paper, we describe and implement two encoding schemes for index decompression on GPU architectures. Their format and decoding algorithm is adapted from existing CPU-based compression methods to exploit the execution model and memory hierarchy offered by GPUs. We show that our solutions, GPU-BP and GPU-VByte, achieve significant speedups over their already carefully optimized CPU counterparts.  more » « less
Award ID(s):
1718680
NSF-PAR ID:
10171645
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Proceedings of the 28th ACM International Conference on Information and Knowledge Management
Page Range / eLocation ID:
2193 to 2196
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Error-bounded lossy compression is a state-of-the-art data reduction technique for HPC applications because it not only significantly reduces storage overhead but also can retain high fidelity for postanalysis. Because supercomputers and HPC applications are becoming heterogeneous using accelerator-based architectures, in particular GPUs, several development teams have recently released GPU versions of their lossy compressors. However, existing state-of-the-art GPU-based lossy compressors suffer from either low compression and decompression throughput or low compression quality. In this paper, we present an optimized GPU version, cuSZ, for one of the best error-bounded lossy compressors-SZ. To the best of our knowledge, cuSZ is the first error-bounded lossy compressor on GPUs for scientific data. Our contributions are fourfold. (1) We propose a dual-quantization scheme to entirely remove the data dependency in the prediction step of SZ such that this step can be performed very efficiently on GPUs. (2) We develop an efficient customized Huffman coding for the SZ compressor on GPUs. (3) We implement cuSZ using CUDA and optimize its performance by improving the utilization of GPU memory bandwidth. (4) We evaluate our cuSZ on five real-world HPC application datasets from the Scientific Data Reduction Benchmarks and compare it with other state-of-the-art methods on both CPUs and GPUs. Experiments show that our cuSZ improves SZ's compression throughput by up to 370.1x and 13.1x, respectively, over the production version running on single and multiple CPU cores, respectively, while getting the same quality of 
    more » « less
  2. Error-bounded lossy compression is a state-of-the-art data reduction technique for HPC applications because it not only significantly reduces storage overhead but also can retain high fidelity for postanalysis. Because supercomputers and HPC applications are becoming heterogeneous using accelerator-based architectures, in particular GPUs, several development teams have recently released GPU versions of their lossy compressors. However, existing state-of-the-art GPU-based lossy compressors suffer from either low compression and decompression throughput or low compression quality. In this paper, we present an optimized GPU version, cuSZ, for one of the best error-bounded lossy compressors-SZ. To the best of our knowledge, cuSZ is the first error-bounded lossy compressor on GPUs for scientific data. Our contributions are fourfold. (1) We propose a dual-quantization scheme to entirely remove the data dependency in the prediction step of SZ such that this step can be performed very efficiently on GPUs. (2) We develop an efficient customized Huffman coding for the SZ compressor on GPUs. (3) We implement cuSZ using CUDA and optimize its performance by improving the utilization of GPU memory bandwidth. (4) We evaluate our cuSZ on five real-world HPC application datasets from the Scientific Data Reduction Benchmarks and compare it with other state-of-the-art methods on both CPUs and GPUs. Experiments show that our cuSZ improves SZ's compression throughput by up to 370.1x and 13.1x, respectively, over the production version running on single and multiple CPU cores, respectively, while getting the same quality of reconstructed data. It also improves the compression ratio by up to 3.48x on the tested data compared with another state-of-the-art GPU supported lossy compressor. 
    more » « less
  3. Chi-Wang Shu (Ed.)
    GPU computing is expected to play an integral part in all modern Exascale supercomputers. It is also expected that higher order Godunov schemes will make up about a significant fraction of the application mix on such supercomputers. It is, therefore, very important to prepare the community of users of higher order schemes for hyperbolic PDEs for this emerging opportunity. Not every algorithm that is used in the space-time update of the solution of hyperbolic PDEs will take well to GPUs. However, we identify a small core of algorithms that take exceptionally well to GPU computing. Based on an analysis of available options, we have been able to identify weighted essentially non-oscillatory (WENO) algorithms for spatial reconstruction along with arbitrary derivative (ADER) algorithms for time extension followed by a corrector step as the winning three-part algorithmic combination. Even when a winning subset of algorithms has been identified, it is not clear that they will port seamlessly to GPUs. The low data throughput between CPU and GPU, as well as the very small cache sizes on modern GPUs, implies that we have to think through all aspects of the task of porting an application to GPUs. For that reason, this paper identifies the techniques and tricks needed for making a successful port of this very useful class of higher order algorithms to GPUs. Application codes face a further challenge—the GPU results need to be practically indistinguishable from the CPU results—in order for the legacy knowledge bases embedded in these applications codes to be preserved during the port of GPUs. This requirement often makes a complete code rewrite impossible. For that reason, it is safest to use an approach based on OpenACC directives, so that most of the code remains intact (as long as it was originally well-written). This paper is intended to be a one-stop shop for anyone seeking to make an OpenACC-based port of a higher order Godunov scheme to GPUs. We focus on three broad and high-impact areas where higher order Godunov schemes are used. The first area is computational fluid dynamics (CFD). The second is computational magnetohydrodynamics (MHD) which has an involution constraint that has to be mimetically preserved. The third is computational electrodynamics (CED) which has involution constraints and also extremely stiff source terms. Together, these three diverse uses of higher order Godunov methodology, cover many of the most important applications areas. In all three cases, we show that the optimal use of algorithms, techniques, and tricks, along with the use of OpenACC, yields superlative speedups on GPUs. As a bonus, we find a most remarkable and desirable result: some higher order schemes, with their larger operations count per zone, show better speedup than lower order schemes on GPUs. In other words, the GPU is an optimal stratagem for overcoming the higher computational complexities of higher order schemes. Several avenues for future improvement have also been identified. A scalability study is presented for a real-world application using GPUs and comparable numbers of high-end multicore CPUs. It is found that GPUs offer a substantial performance benefit over comparable number of CPUs, especially when all the methods designed in this paper are used. 
    more » « less
  4. Today’s large-scale scientific applications running on high-performance computing (HPC) systems generate vast data volumes. Thus, data compression is becoming a critical technique to mitigate the storage burden and data-movement cost. However, existing lossy compressors for scientific data cannot achieve a high compression ratio and throughput simultaneously, hindering their adoption in many applications requiring fast compression, such as in-memory compression. To this end, in this work, we develop a fast and high-ratio error-bounded lossy compressor on GPUs for scientific data (called FZ-GPU). Specifically, we first design a new compression pipeline that consists of fully parallelized quantization, bitshuffle, and our newly designed fast encoding. Then, we propose a series of deep architectural optimizations for each kernel in the pipeline to take full advantage of CUDA architectures. We propose a warp-level optimization to avoid data conflicts for bit-wise operations in bitshuffle, maximize shared memory utilization, and eliminate unnecessary data movements by fusing different compression kernels. Finally, we evaluate FZ-GPU on two NVIDIA GPUs (i.e., A100 and RTX A4000) using six representative scientific datasets from SDRBench. Results on the A100 GPU show that FZ-GPU achieves an average speedup of 4.2× over cuSZ and an average speedup of 37.0× over a multi-threaded CPU implementation of our algorithm under the same error bound. FZ-GPU also achieves an average speedup of 2.3× and an average compression ratio improvement of 2.0× over cuZFP under the same data distortion. 
    more » « less
  5. null (Ed.)
    Deep neural networks (DNNs) are increasingly used for real-time inference, requiring low latency, but require significant computational power as they continue to increase in complexity. Edge clouds promise to offer lower latency due to their proximity to end-users and having powerful accelerators like GPUs to provide the computation power needed for DNNs. But it is also important to ensure that the edge-cloud resources are utilized well. For this, multiplexing several DNN models through spatial sharing of the GPU can substantially improve edge-cloud resource usage. Typical GPU runtime environments have significant interactions with the CPU, to transfer data to the GPU, for CPU-GPU synchronization on inference task completions, etc. These result in overheads. We present a DNN inference framework with a set of software primitives that reduce the overhead for DNN inference, increase GPU utilization and improve performance, with lower latency and higher throughput. Our first primitive uses the GPU DMA effectively, reducing the CPU cycles spent to transfer the data to the GPU. A second primitive uses asynchronous ‘events’ for faster task completion notification. GPU runtimes typically preclude fine-grained user control on GPU resources, causing long GPU downtimes when adjusting resources. Our third primitive supports overlapping of model-loading and execution, thus allowing GPU resource re-allocation with very little GPU idle time. Our other primitives increase inference throughput by improving scheduling and processing more requests. Overall, our primitives decrease inference latency by more than 35% and increase DNN throughput by 2-3×. 
    more » « less