skip to main content


Title: Vector Quotient Filters: Overcoming the Time/Space Trade-Off in Filter Design
Today’s filters, such as quotient, cuckoo, and Morton, have a trade-off between space and speed; even when moderately full (e.g., 50%-75% full), their performance degrades nontrivially. The result is that today’s systems designers are forced to choose between speed and space usage. In this paper, we present the vector quotient filter (VQF). Locally, the VQF is based on Robin Hood hashing, like the quotient filter, but uses power-of-two-choices hashing to reduce the variance of runs, and thus others consistent, high throughput across load factors. Power-of-two-choices hashing also makes it more amenable to concurrent updates, compared to the cuckoo filter and variants. Finally, the vector quotient filter is designed to exploit SIMD instructions so that all operations have $ (1) cost, independent of the size of the filter or its load factor. We show that the vector quotient flter is 2x faster for inserts compared to the Morton filter (a cuckoo filter variant and state-of- the-art for inserts) and has similar lookup and deletion performance as the cuckoo filter (which is fastest for queries and deletes), despite having a simpler design and implementation. The vector quotient filter has minimal performance decline at high load factors, a problem that has plagued modern filters, including quotient, cuckoo, and Morton. Furthermore, we give a thread-safe version of the vector quotient filter and show that insertion throughput scales 3x with four threads compared to a single thread.  more » « less
Award ID(s):
1938180 2106999 2118620
NSF-PAR ID:
10298533
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
SIGMOD record
ISSN:
1943-5835
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In this paper, we present our GPU implementation of the quotient filter, a compact data structure designed to implement approximate membership queries. The quotient filter is similar to the more well-known Bloom filter; however, in addition to set insertion and membership queries, the quotient filter also supports deletions and merging filters without requiring rehashing of the data set. Furthermore, the quotient filter can be extended to include counters without increasing the memory footprint. This paper describes our GPU implementation of two types of quotient filters: the standard quotient filter and the rank-and-select-based quotient filter. We describe the parallelization of all filter operations, including a comparison of the four different methods we devised for parallelizing quotient filter construction. In solving this problem, we found that we needed an operation similar to a parallel scan, but for non-associative operators. One outcome of this work is a variety of methods for computing parallel scan-type operations on a non-associative operator. For membership queries, we achieve a throughput of up to 1.13 billion items/second for the rank-and-select-based quotient filter: a speedup of 3x over the BloomGPU filter. Our fastest filter build method achieves a speedup of 2.1-3.1x over BloomGPU, with a peak throughput of 621 million items/second, and a rate of 516 million items/second for a 70% full filter. However, we find that our filters do not perform incremental updates as fast as the BloomGPU filter. For a batch of 2 million items, we perform incremental inserts at a rate of 81 million items/second - a 2.5x slowdown compared to BloomGPU's throughput of 201 million items/second. The quotient filter's memory footprint is comparable to that of a Bloom filter. 
    more » « less
  2. In this paper, we present our GPU implementation of the quotient filter, a compact data structure designed to implement approximate membership queries. The quotient filter is similar to the more well-known Bloom filter; however, in addition to set insertion and membership queries, the quotient filter also supports deletions and merging filters without requiring rehashing of the data set. Furthermore, the quotient filter can be extended to include counters without increasing the memory footprint. This paper describes our GPU implementation of two types of quotient filters: the standard quotient filter and the rank-and-select-based quotient filter. We describe the parallelization of all filter operations, including a comparison of the four different methods we devised for parallelizing quotient filter construction. In solving this problem, we found that we needed an operation similar to a parallel scan, but for non-associative operators. One outcome of this work is a variety of methods for computing parallel scan-type operations on a non-associative operator. For membership queries, we achieve a throughput of up to 1.13 billion items/second for the rank-and-select-based quotient filter: a speedup of 3x over the BloomGPU filter. Our fastest filter build method achieves a speedup of 2.1--3.1x over BloomGPU, with a peak throughput of 621 million items/second, and a rate of 516 million items/second for a 70% full filter. However, we find that our filters do not perform incremental updates as fast as the BloomGPU filter. For a batch of 2 million items, we perform incremental inserts at a rate of 81 million items/second -- a 2.5x slowdown compared to BloomGPU's throughput of 201 million items/second. The quotient filter's memory footprint is comparable to that of a Bloom filter. 
    more » « less
  3. null (Ed.)
    RDMA has been an important building block for many high-performance distributed key-value stores in recent prior work. To sustain millions of operations per second per node, many of these works use hashing schemes, such as cuckoo hashing, that guarantee that an existing key can be found in a small, constant number of RDMA operations. In this paper, we explore whether linear probing is a compelling design alternative to cuckoo-based hashing schemes. Instead of reading a fixed number of bytes per RDMA request, this paper introduces a mathematical model that picks the optimal read size by balancing the cost of performing an RDMA operation with the probability of encountering a probe sequence of a certain length. The model can further leverage optimization hints about the location of clusters of keys, which commonly occur in skewed key distributions. We extensively evaluate the performance of linear probing with a variable read size in a modern cluster to understand the trade-offs between cuckoo hashing and linear probing. We find that cuckoo hashing outperforms linear probing only in very highly loaded hash tables (load factors at 90% or higher) that would be prohibitively expensive to maintain in practice. Overall, linear probing with variable-sized reads using our model has up to 2.8× higher throughput than cuckoo hashing, with throughput as high as 50M lookup operations per second per node. 
    more » « less
  4. We revisit the problem of building static hash tables on the GPU and present an efficient implementation of bucketed hash tables. By decoupling the probing scheme from the hash table in-memory representation, we offer an implementation where the number of probes and the bucket size are the only factors limiting performance. Our analysis sweeps through the hash table parameter space for two probing schemes: cuckoo and iceberg hashing. We show that a bucketed cuckoo hash table (BCHT) that uses three hash functions outperforms alternative methods that use iceberg hashing and a cuckoo hash table that uses a bucket size of one. At load factors as high as 0.99, BCHT enjoys an average probe count of 1.43 during insertion. Using three hash functions only, positive and negative queries require at most 1.39 and 2.8 average probes per key, respectively. 
    more » « less
  5. For nearly six decades, the central open question in the study of hash tables has been to determine the optimal achievable tradeoff curve between time and space. State-of-the-art hash tables offer the following guarantee: If keys/values are Θ(logn) bits each, then it is possible to achieve constant-time insertions/deletions/queries while wasting only O(loglogn) bits of space per key when compared to the information-theoretic optimum. Even prior to this bound being achieved, the target of O(log log n) wasted bits per key was known to be a natural end goal, and was proven to be optimal for a number of closely related problems (e.g., stable hashing, dynamic retrieval, and dynamically-resized filters). This paper shows that O(log log n) wasted bits per key is not the end of the line for hashing. In fact, for any k ∈ [log∗ n], it is possible to achieve O(k)-time insertions/deletions, O(1)-time queries, and O(log(k) n) = Ologlog···logn 􏰟 􏰞􏰝 􏰠 k wasted bits per key (all with high probability in n). This means that, each time we increase inser- tion/deletion time by an additive constant, we reduce the wasted bits per key exponentially. We further show that this tradeoff curve is the best achievable by any of a large class of hash tables, including any hash table designed using the current framework for making constant-time hash tables succinct. Our results hold not just for fixed-capacity hash tables, but also for hash tables that are dynamically resized (this is a fundamental departure from what is possible for filters); and for hash tables that store very large keys/values, each of which can be up to no(1) bits (this breaks with the conventional wisdom that larger keys/values should lead to more wasted bits per key). For very small keys/values, we are able to tighten our bounds to o(1) wasted bits per key, even when k = O(1). Building on this, we obtain a constant-time dynamic filter that uses n􏰕logε−1􏰖+nloge+o(n) bits of space for a wide choice of 
    more » « less