NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Analyzing and Implementing GPU Hash Tables

https://doi.org/10.1137/1.9781611977578.ch3

Awad, Muhammad A.; Ashkiani, Saman; Porumbescu, Serban D.; Farach-Colton, Martín; Owens, John D. (January 2023, SIAM Symposium on Algorithmic Principles of Computer Systems)

We revisit the problem of building static hash tables on the GPU and present an efficient implementation of bucketed hash tables. By decoupling the probing scheme from the hash table in-memory representation, we offer an implementation where the number of probes and the bucket size are the only factors limiting performance. Our analysis sweeps through the hash table parameter space for two probing schemes: cuckoo and iceberg hashing. We show that a bucketed cuckoo hash table (BCHT) that uses three hash functions outperforms alternative methods that use iceberg hashing and a cuckoo hash table that uses a bucket size of one. At load factors as high as 0.99, BCHT enjoys an average probe count of 1.43 during insertion. Using three hash functions only, positive and negative queries require at most 1.39 and 2.8 average probes per key, respectively.
more » « less
Full Text Available
Dynamic Graphs on the GPU

https://doi.org/10.1109/IPDPS47924.2020.00081

Awad, Muhammad A; Ashkiani, Saman; Porumbescu, Serban D.; Owens, John D. (May 2020, Proceedings of the 34th IEEE International Parallel and Distributed Processing Symposium)

We present a fast dynamic graph data structure for the GPU. Our dynamic graph structure uses one hash table per vertex to store adjacency lists and achieves 3.4–14.8x faster insertion rates over the state of the art across a diverse set of large datasets, as well as deletion speedups up to 7.8x. The data structure supports queries and dynamic updates through both edge and vertex insertion and deletion. In addition, we define a comprehensive evaluation strategy based on operations, workloads, and applications that we believe better characterize and evaluate dynamic graph data structures.
more » « less
Full Text Available
Engineering a High-Performance GPU B-Tree

https://doi.org/10.1145/3293883.3295706

Awad, Muhammad A.; Ashkiani, Saman; Johnson, Rob; Farach-Colton, Martín; Owens, John D. (February 2019, Proceedings of the 24th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming)

We engineer a GPU implementation of a B-Tree that supports concurrent queries (point, range, and successor) and updates (insertions and deletions). Our B-tree outperforms the state of the art, a GPU log-structured merge tree (LSM) and a GPU sorted array. In particular, point and range queries are significantly faster than in a GPU LSM (the GPU LSM does not implement successor queries). Furthermore, B-Tree insertions are also faster than LSM and sorted array insertions unless insertions come in batches of more than roughly 100k. Because we cache the upper levels of the tree, we achieve lookup throughput that exceeds the DRAM bandwidth of the GPU. We demonstrate that the key limiter of performance on a GPU is contention and describe the design choices that allow us to achieve this high performance.
more » « less
Full Text Available
A Dynamic Hash Table for the GPU

https://doi.org/10.1109/IPDPS.2018.00052

Ashkiani, Saman; Farach-Colton, Martin; Owens, John D. (May 2018, Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium)

We design and implement a fully concurrent dynamic hash table for GPUs with comparable performance to the state of the art static hash tables. We propose a warp-cooperative work sharing strategy that reduces branch divergence and provides an efficient alternative to the traditional way of per-thread (or per-warp) work assignment and processing. By using this strategy, we build a dynamic non-blocking concurrent linked list, the slab list, that supports asynchronous, concurrent updates (insertions and deletions) as well as search queries. We use the slab list to implement a dynamic hash table with chaining (the slab hash). On an NVIDIA Tesla K40c GPU, the slab hash performs updates with up to 512 M updates/s and processes search queries with up to 937 M queries/s. We also design a warp-synchronous dynamic memory allocator, SlabAlloc, that suits the high performance needs of the slab hash. SlabAlloc dynamically allocates memory at a rate of 600 M allocations/s, which is up to 37x faster than alternative methods in similar scenarios.
more » « less
Full Text Available
GPU LSM: A Dynamic Dictionary Data Structure for the GPU

https://doi.org/10.1109/IPDPS.2018.00053

Ashkiani, Saman; Li, Shengren; Farach-Colton, Martin; Amenta, Nina; Owens, John D. (May 2018, Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium)

We develop a dynamic dictionary data structure for the GPU, supporting fast insertions and deletions, based on the Log Structured Merge tree (LSM). Our implementation on an NVIDIA K40c GPU has an average update (insertion or deletion) rate of 225 M elements/s, 13.5x faster than merging items into a sorted array. The GPU LSM supports the retrieval operations of lookup, count, and range query operations with an average rate of 75 M, 32 M and 23 M queries/s respectively. The trade-off for the dynamic updates is that the sorted array is almost twice as fast on retrievals. We believe that our GPU LSM is the first dynamic general-purpose dictionary data structure for the GPU.
more » « less
Full Text Available

Search for: All records