We present ColumnBurst, a memory-efficient, near-storage hardware accelerator for database join queries. While the paradigm of near-storage computation has demonstrated performance and efficiency benefits on many workloads by reducing data movement overhead, memory-bound operations such as relational joins on unsorted data have been relatively inefficient with fast modern storage devices, due to the limited capacity and performance of memory available on the near-storage processing engine. ColumnBurst delivers very high performance even on such complex queries, while staying within the memory performance and capacity budget of what is typically already available on off-the-shelf storage devices. ColumnBurst achieves this via a compact, hardware implementation of sorting-based group-by aggregation and join algorithms, instead of the conventional hash-based algorithms. We evaluate ColumnBurst using an FPGA-based prototype with 1 GB of slow on-device DDR3 DRAM, and show that on benchmarks including TPC-H queries with join queries on unsorted columns, it outperforms MonetDB on a 6-core i7 with 32 GB of DRAM by over 7x, and ColumnBurst using a near-storage hash join algorithm by 2x. 
                        more » 
                        « less   
                    
                            
                            Towards a Memory-Adaptive Hybrid Hash Join Design
                        
                    
    
            In database management systems (DBMSs) that handle multiple concurrent queries, adapting to fluctuating workloads is crucial. This flexibility allows the DBMS to revise decisions based on current workload and available resources. As memory availability changes with the arrival or completion of queries, having memory-intensive operators like the Hybrid Hash Join that dynamically adapt is vital. This paper introduces a new memory-adaptive Hash-Based join algorithm design implemented in Apache AsterixDB and evaluates its responsiveness to memory variability. 
        more » 
        « less   
        
    
    
                            - PAR ID:
- 10548704
- Publisher / Repository:
- 2023 IEEE International Conference on Big Data (BigData)
- Date Published:
- ISBN:
- 979-8-3503-2445-7
- Page Range / eLocation ID:
- 6283 to 6285
- Format(s):
- Medium: X
- Location:
- Sorrento, Italy
- Sponsoring Org:
- National Science Foundation
More Like this
- 
            
- 
            We optimize the overall energy consumption of a Narrowband Internet of Things (NB-IoT) application created using a hybrid blockchain framework. We accomplish this by engineering the underlying hash function (SHA-256) that is used in different procedures (Unique ID generation, Device Join, and Device Transaction) of the blockchain-based NB-IoT system. In order to reduce the complexity of hash verification, IoT devices in the NB-IoT application are built to save the hashes of their authorized transactions as a linear hash chain rather than the entire Merkle tree. Furthermore, base station memory is dynamically partitioned to improve memory usage efficiency and scalability. Compared with the state-of-the-art approach, our approach considerably reduces the total energy consumption of the state-of-the-art application.more » « less
- 
            We design and implement a fully concurrent dynamic hash table for GPUs with comparable performance to the state of the art static hash tables. We propose a warp-cooperative work sharing strategy that reduces branch divergence and provides an efficient alternative to the traditional way of per-thread (or per-warp) work assignment and processing. By using this strategy, we build a dynamic non-blocking concurrent linked list, the slab list, that supports asynchronous, concurrent updates (insertions and deletions) as well as search queries. We use the slab list to implement a dynamic hash table with chaining (the slab hash). On an NVIDIA Tesla K40c GPU, the slab hash performs updates with up to 512 M updates/s and processes search queries with up to 937 M queries/s. We also design a warp-synchronous dynamic memory allocator, SlabAlloc, that suits the high performance needs of the slab hash. SlabAlloc dynamically allocates memory at a rate of 600 M allocations/s, which is up to 37x faster than alternative methods in similar scenarios.more » « less
- 
            We design and implement a fully concurrent dynamic hash table for GPUs with comparable performance to the state of the art static hash tables. We propose a warp-cooperative work sharing strategy that reduces branch divergence and provides an efficient alternative to the tradi- tional way of per-thread (or per-warp) work assignment and processing. By using this strategy, we build a dynamic non- blocking concurrent linked list, the slab list, that supports asynchronous, concurrent updates (insertions and deletions) as well as search queries. We use the slab list to implement a dynamic hash table with chaining (the slab hash). On an NVIDIA Tesla K40c GPU, the slab hash performs updates with up to 512 M updates/s and processes search queries with up to 937 M queries/s. We also design a warp-synchronous dynamic memory allocator, SlabAlloc, that suits the high performance needs of the slab hash. SlabAlloc dynamically allocates memory at a rate of 600 M allocations/s, which is up to 37x faster than alternative methods in similar scenarios.more » « less
- 
            Abstract The join and group-by aggregation are two memory intensive operators that are affecting the performance of relational databases. Hashing is a common approach used to implement both operators. Recent paradigm shifts in multi-core processor architectures have reinvigorated research into how the join and group-by aggregation operators can leverage these advances. However, the poor spatial locality of the hashing approach has hindered performance on multi-core processor architectures which rely on using large cache hierarchies for latency mitigation. Multithreaded architectures can better cope with poor spatial locality by masking memory latency with many outstanding requests. Nevertheless, the number of parallel threads, even in the most advanced multithreaded processors, such as UltraSPARC, is not enough to fully cover the main memory access latency. In this paper, we explore the hardware re-configurability of FPGAs to enable deeper execution pipelines that maintain hundreds (instead of tens) of outstanding memory requests across four FPGAs-drastically increasing concurrency and throughput. We present two end-to-end in-memory accelerators for the join and group-by aggregation operators using FPGAs. Both accelerators use massive multithreading to mask long memory delays of traversing linked-list data structures, while concurrently managing hundreds of thread states across four FPGAs locally. We explore how content addressable memories can be intermixed within our multithreaded designs to act as a synchronizing cache , which enforces locks and merges jobs together before they are written to memory. Throughput results for our hash-join operator accelerator show a speedup between 2 $$\times $$ × and 3.4 $$\times $$ × over the best multi-core approaches with comparable memory bandwidths on uniform and skewed datasets. The accelerator for the hash-based group-by aggregation operator demonstrates that leveraging CAMs achieves average speedup of 3.3 $$\times $$ × with a best case of 9.4 $$\times $$ × in terms of throughput over CPU implementations across five types of data distributions.more » « less
 An official website of the United States government
An official website of the United States government 
				
			 
					 
					
 
                                    