 Award ID(s):
 1740333
 NSFPAR ID:
 10397858
 Date Published:
 Journal Name:
 SIAM Symposium on Algorithmic Principles of Computer Systems
 Page Range / eLocation ID:
 33–50
 Format(s):
 Medium: X
 Sponsoring Org:
 National Science Foundation
More Like this

Hashing is a fundamental operation in database management, playing a key role in the implementation of numerous core database data structures and algorithms. Traditional hash functions aim to mimic a function that maps a key to a random value, which can result in collisions, where multiple keys are mapped to the same value. There are many wellknown schemes like chaining, probing, and cuckoo hashing to handle collisions. In this work, we aim to study if using learned models instead of traditional hash functions can reduce collisions and whether such a reduction translates to improved performance, particularly for indexing and joins. We show that learned models reduce collisions in some cases, which depend on how the data is distributed. To evaluate the effectiveness of learned models as hash function, we test them with bucket chaining, linear probing, and cuckoo hash tables. We find that learned models can (1) yield a 1.4x lower probe latency, and (2) reduce the nonpartitioned hash join runtime with 28% over the next best baseline for certain datasets. On the other hand, if the data distribution is not suitable, we either do not see gains or see worse performance. In summary, we find that learned models can indeed outperform hash functions, but only for certain data distributions.more » « less

null (Ed.)RDMA has been an important building block for many highperformance distributed keyvalue stores in recent prior work. To sustain millions of operations per second per node, many of these works use hashing schemes, such as cuckoo hashing, that guarantee that an existing key can be found in a small, constant number of RDMA operations. In this paper, we explore whether linear probing is a compelling design alternative to cuckoobased hashing schemes. Instead of reading a fixed number of bytes per RDMA request, this paper introduces a mathematical model that picks the optimal read size by balancing the cost of performing an RDMA operation with the probability of encountering a probe sequence of a certain length. The model can further leverage optimization hints about the location of clusters of keys, which commonly occur in skewed key distributions. We extensively evaluate the performance of linear probing with a variable read size in a modern cluster to understand the tradeoffs between cuckoo hashing and linear probing. We find that cuckoo hashing outperforms linear probing only in very highly loaded hash tables (load factors at 90% or higher) that would be prohibitively expensive to maintain in practice. Overall, linear probing with variablesized reads using our model has up to 2.8× higher throughput than cuckoo hashing, with throughput as high as 50M lookup operations per second per node.more » « less

Despite being one of the oldest data structures in computer science, hash tables continue to be the focus of a great deal of both theoretical and empirical research. A central reason for this is that many of the fundamental properties that one desires from a hash table are difficult to achieve simultaneously; thus many variants offering different tradeoffs have been proposed.
This article introduces Iceberg hashing, a hash table that simultaneously offers the strongest known guarantees on a large number of core properties. Iceberg hashing supports constanttime operations while improving on the state of the art for space efficiency, cache efficiency, and low failure probability. Iceberg hashing is also the first hash table to support a load factor of up to
1  o(1) while being stable, meaning that the position where an element is stored only ever changes when resizes occur. In fact, in the setting where keys are Θ (logn ) bits, the space guarantees that Iceberg hashing offers, namely that it uses at most bits to store\(\log \binom{U}{n} + O(n \log \ \text{log} n)\) n items from a universeU , matches a lower bound by Demaine et al. that applies to any stable hash table.Iceberg hashing introduces new generalpurpose techniques for some of the most basic aspects of hashtable design. Notably, our indirectionfree technique for dynamic resizing, which we call waterfall addressing, and our techniques for achieving stability and veryhigh probability guarantees, can be applied to any hash table that makes use of the frontyard/backyard paradigm for hash table design.

First introduced in 1954, the linearprobing hash table is among the oldest data structures in computer science, and thanks to its unrivaled data locality, linear probing continues to be one of the fastest hash tables in practice. It is widely believed and taught, however, that linear probing should never be used at high load factors; this is because of an effect known as primary clustering which causes insertions at a load factor of 1  1/x to take expected time O(x2) (rather than the intuitive running time of ⇥(x)). The dangers of primary clustering, first discovered by Knuth in 1963, have now been taught to generations of computer scientists, and have influenced the design of some of the most widely used hash tables in production. We show that primary clustering is not the foregone conclusion that it is reputed to be. We demonstrate that seemingly small design decisions in how deletions are implemented have dramatic effects on the asymptotic performance of insertions: if these design decisions are made correctly, then even if a hash table operates continuously at a load factor of 1  (1/x), the expected amortized cost per insertion/deletion is O(x). This is because the tombstones left behind by deletions can actually cause an anticlustering effect that combats primary clustering. Interestingly, these design decisions, despite their remarkable effects, have historically been viewed as simply implementationlevel engineering choices. We also present a new variant of linear probing (which we call graveyard hashing) that completely eliminates primary clustering on any sequence of operations: if, when an operation is performed, the current load factor is 1 1/x for some x, then the expected cost of the operation is O(x). Thus we can achieve the data locality of traditional linear probing without any of the disadvantages of primary clustering. One corollary is that, in the externalmemory model with a data blocks of size B, graveyard hashing offers the following remarkably strong guarantee: at any load factor 1 1/x satisfying x = o(B), graveyard hashing achieves 1 + o(1) expected block transfers per operation. In contrast, past externalmemory hash tables have only been able to offer a 1 + o(1) guarantee when the block size B is at least O(x2). Our results come with actionable lessons for both theoreticians and practitioners, in particular, that well designed use of tombstones can completely change the asymptotic landscape of how the linear probing behaves (and even in workloads without deletions).more » « less

Cuckoo Hashing is a hashing scheme invented by pervious study of Pagh and Rodler. It uses
d ≥ 2 distinct hash functions to insertn items into the hash table of sizem = (1 +ε )n . In their original paper they treatedd = 2 andm = (2 +ε )n . It has been an open question for some time as to the expected time for Random Walk Insertion to add items whend > 2. We show that if the number of hash functionsd ≥d _{ε} =O (1) then the expected insertion time isO (1) per item.