- Award ID(s):
- 1637458
- PAR ID:
- 10062445
- Date Published:
- Journal Name:
- IPDPS .... [proceedings]
- ISSN:
- 2332-1237
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
We develop a dynamic dictionary data structure for the GPU, supporting fast insertions and deletions, based on the Log Structured Merge tree (LSM). Our implementation on an NVIDIA K40c GPU has an average update (insertion or deletion) rate of 225 M elements/s, 13.5x faster than merging items into a sorted array. The GPU LSM supports the retrieval operations of lookup, count, and range query operations with an average rate of 75 M, 32 M and 23 M queries/s respectively. The trade-off for the dynamic updates is that the sorted array is almost twice as fast on retrievals. We believe that our GPU LSM is the first dynamic general-purpose dictionary data structure for the GPU.more » « less
-
We engineer a GPU implementation of a B-Tree that supports concurrent queries (point, range, and successor) and updates (insertions and deletions). Our B-tree outperforms the state of the art, a GPU log-structured merge tree (LSM) and a GPU sorted array. In particular, point and range queries are significantly faster than in a GPU LSM (the GPU LSM does not implement successor queries). Furthermore, B-Tree insertions are also faster than LSM and sorted array insertions unless insertions come in batches of more than roughly 100k. Because we cache the upper levels of the tree, we achieve lookup throughput that exceeds the DRAM bandwidth of the GPU. We demonstrate that the key limiter of performance on a GPU is contention and describe the design choices that allow us to achieve this high performance.more » « less
-
We present a fast dynamic graph data structure for the GPU. Our dynamic graph structure uses one hash table per vertex to store adjacency lists and achieves 3.4–14.8x faster insertion rates over the state of the art across a diverse set of large datasets, as well as deletion speedups up to 7.8x. The data structure supports queries and dynamic updates through both edge and vertex insertion and deletion. In addition, we define a comprehensive evaluation strategy based on operations, workloads, and applications that we believe better characterize and evaluate dynamic graph data structures.more » « less
-
Today’s large-scale scientific applications running on high-performance computing (HPC) systems generate vast data volumes. Thus, data compression is becoming a critical technique to mitigate the storage burden and data-movement cost. However, existing lossy compressors for scientific data cannot achieve a high compression ratio and throughput simultaneously, hindering their adoption in many applications requiring fast compression, such as in-memory compression. To this end, in this work, we develop a fast and high-ratio error-bounded lossy compressor on GPUs for scientific data (called FZ-GPU). Specifically, we first design a new compression pipeline that consists of fully parallelized quantization, bitshuffle, and our newly designed fast encoding. Then, we propose a series of deep architectural optimizations for each kernel in the pipeline to take full advantage of CUDA architectures. We propose a warp-level optimization to avoid data conflicts for bit-wise operations in bitshuffle, maximize shared memory utilization, and eliminate unnecessary data movements by fusing different compression kernels. Finally, we evaluate FZ-GPU on two NVIDIA GPUs (i.e., A100 and RTX A4000) using six representative scientific datasets from SDRBench. Results on the A100 GPU show that FZ-GPU achieves an average speedup of 4.2× over cuSZ and an average speedup of 37.0× over a multi-threaded CPU implementation of our algorithm under the same error bound. FZ-GPU also achieves an average speedup of 2.3× and an average compression ratio improvement of 2.0× over cuZFP under the same data distortion.more » « less
-
Abstract Motivation Accurate and efficient predictions of protein structures play an important role in understanding their functions. Iterative Threading Assembly Refinement (I-TASSER) is one of the most successful and widely used protein structure prediction methods in the recent community-wide CASP experiments. Yet, the computational efficiency of I-TASSER is one of the limiting factors that prevent its application for large-scale structure modeling.
Results We present I-TASSER for Graphics Processing Units (GPU-I-TASSER), a GPU accelerated I-TASSER protein structure prediction tool for fast and accurate protein structure prediction. Our implementation is based on OpenACC parallelization of the replica-exchange Monte Carlo simulations to enhance the speed of I-TASSER by extending its capabilities to the GPU architecture. On a benchmark dataset of 71 protein structures, GPU-I-TASSER achieves on average a 10× speedup with comparable structure prediction accuracy compared to the CPU version of the I-TASSER.
Availability and implementation The complete source code for GPU-I-TASSER can be downloaded and used without restriction from https://zhanggroup.org/GPU-I-TASSER/.
Supplementary information Supplementary data are available at Bioinformatics online.