Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher.
Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?
Some links on this page may take you to nonfederal websites. Their policies may differ from this site.

Modern NVMe solid state drives offer significantly higher bandwidth and low latency than prior storage devices. Current keyvalue stores struggle to fully utilize the bandwidth of such devices. This paper presents SplinterDB, a new keyvalue store explicitly designed for NVMe solid state drives. SplinterDB is designed around a novel data structure (the STBεtree), that exposes I/O and CPU concurrency and reduces write amplification without sacrificing query performance. STBεtree combines ideas from logstructured merge trees and Bεtrees to reduce write amplification and CPU costs of compaction. The SplinterDB memtable and cache are designed to be highly concurrent and to reduce cache misses. We evaluate SplinterDB on a number of micro and macrobenchmarks, and show that SplinterDB outperforms RocksDB, a stateoftheart keyvalue store, by a factor of 6–10x on insertions and 2–2.6x on point queries, while matching RocksDB on small range queries. Furthermore, SplinterDB reduces write amplification by 2x compared to RocksDB.more » « less

Bufferandflush is a technique for transforming standard externalmemory search trees into writeoptimized search trees. In exchange for faster amortized insertions, bufferandflush can sometimes significantly increase the latency of operations by causing cascades of flushes. In this paper, we show that flushing cascades are not a fundamental consequence of the bufferflushing technique, and can be removed entirely using randomization techniques. The underlying implementation of buffer flushing relies on a buffereviction strategy at each node in the tree. The ability for the user to select the buffer eviction strategy based on the workload has been shown to be important for performance, both in theory and in practice. In order to support arbitrary buffereviction strategies, we introduce the notion of a universal flush, which uses a universal eviction policy that can simulate any other eviction policy. This abstracts away the underlying eviction strategy, even allowing for workloadspecific strategies that change dynamically. Our deamortization preserves the amortized throughput of the underlying flushing strategy on all workloads. In particular, with our deamortization and a node cache of size polylogarithmic in the number of insertions performed on the tree, the amortized insertion cost matches the lower bound of Brodal and Fagerberg. For typical parameters, the lower bound is less than 1 I/O per insertion. For such parameters, our worstcase insertion cost is O(1) I/Os.more » « less

We design and implement a fully concurrent dynamic hash table for GPUs with comparable performance to the state of the art static hash tables. We propose a warpcooperative work sharing strategy that reduces branch divergence and provides an efficient alternative to the tradi tional way of perthread (or perwarp) work assignment and processing. By using this strategy, we build a dynamic non blocking concurrent linked list, the slab list, that supports asynchronous, concurrent updates (insertions and deletions) as well as search queries. We use the slab list to implement a dynamic hash table with chaining (the slab hash). On an NVIDIA Tesla K40c GPU, the slab hash performs updates with up to 512 M updates/s and processes search queries with up to 937 M queries/s. We also design a warpsynchronous dynamic memory allocator, SlabAlloc, that suits the high performance needs of the slab hash. SlabAlloc dynamically allocates memory at a rate of 600 M allocations/s, which is up to 37x faster than alternative methods in similar scenarios.more » « less

In this paper, we present our GPU implementation of the quotient filter, a compact data structure designed to implement approximate membership queries. The quotient filter is similar to the more wellknown Bloom filter; however, in addition to set insertion and membership queries, the quotient filter also supports deletions and merging filters without requiring rehashing of the data set. Furthermore, the quotient filter can be extended to include counters without increasing the memory footprint. This paper describes our GPU implementation of two types of quotient filters: the standard quotient filter and the rankandselectbased quotient filter. We describe the parallelization of all filter operations, including a comparison of the four different methods we devised for parallelizing quotient filter construction. In solving this problem, we found that we needed an operation similar to a parallel scan, but for nonassociative operators. One outcome of this work is a variety of methods for computing parallel scantype operations on a nonassociative operator. For membership queries, we achieve a throughput of up to 1.13 billion items/second for the rankandselectbased quotient filter: a speedup of 3x over the BloomGPU filter. Our fastest filter build method achieves a speedup of 2.13.1x over BloomGPU, with a peak throughput of 621 million items/second, and a rate of 516 million items/second for a 70% full filter. However, we find that our filters do not perform incremental updates as fast as the BloomGPU filter. For a batch of 2 million items, we perform incremental inserts at a rate of 81 million items/second  a 2.5x slowdown compared to BloomGPU's throughput of 201 million items/second. The quotient filter's memory footprint is comparable to that of a Bloom filter.more » « less

Hash tables are a ubiquitous class of dictionary data structures. However, standard hash table implementations do not translate well into the external memory model, because they do not incorporate locality for insertions. Iacono and Pătraşu established an update/query tradeoff curve for externalhash tables: a hash table that performs insertions in O(λ/B) amortized IOs requires Ω(logλ N) expected IOs for queries, where N is the number of items that can be stored in the data structure, B is the size of a memory transfer, M is the size of memory, and λ is a tuning parameter. They provide a complicated hashing data structure, which we call the IP hash table, that meets this curve for λ that is Ω(loglogM +logM N). In this paper, we present a simpler externalmemory hash table, the Bundle of Arrays Hash Table (BOA), that is optimal for a narrower range of λ. The simplicity of BOAs allows them to be readily modified to achieve the following results: A new externalmemory data structure, the Bundle of Trees Hash Table (BOT), that matches the performance of the IP hash table, while retaining some of the simplicity of the BOAs. The CacheOblivious Bundle of Trees Hash Table (COBOT), the first cacheoblivious hash table. This data structure matches the optimality of BOTs and IP hash tables over the same range of λ.more » « less

We develop a dynamic dictionary data structure for the GPU, supporting fast insertions and deletions, based on the Log Structured Merge tree (LSM). Our implementation on an NVIDIA K40c GPU has an average update (insertion or deletion) rate of 225 M elements/s, 13.5x faster than merging items into a sorted array. The GPU LSM supports the retrieval operations of lookup, count, and range query operations with an average rate of 75 M, 32 M and 23 M queries/s respectively. The tradeoff for the dynamic updates is that the sorted array is almost twice as fast on retrievals. We believe that our GPU LSM is the first dynamic generalpurpose dictionary data structure for the GPU.more » « less

Dictionaries remain the most well studied class of data structures. A dictionary supports insertions, deletions, membership queries, and usually successor, predecessor, and extractmin. In a RAM, all such operations take O(log n) time on n elements. Dictionaries are often crossreferenced as follows. Consider a set of tuples {〈ai,bi,ci…〉}. A database might include more than one dictionary on such a set, for example, one indexed on the a ‘s, another on the b‘s, and so on. Once again, in a RAM, inserting into a set of L crossreferenced dictionaries takes O(L log n) time, as does deleting. The situation is more interesting in external memory. On a Disk Access Machine (DAM), Btrees achieve O(logB N) I/Os for insertions and deletions on a single dictionary and Kelement range queries take optimal O(logB N + K/B) I/Os. These bounds are also achievable by a Btree on crossreferenced dictionaries, with a slowdown of an L factor on insertion and deletions. In recent years, both the theory and practice of external memory dictionaries has been revolutionized by write optimization techniques. A dictionary is write optimized if it is close to a Btree for query time while beating Btrees on insertions. The best (and optimal) dictionaries achieve a substantially improved insertion and deletion cost of amortized I/Os on a single dictionary while maintaining optimal O(log1+B∊ N + K/B) I/O range queries. Although write optimization still helps for insertions into crossreferenced dictionaries, its value for deletions would seem to be greatly reduced. A deletion into a cross referenced dictionary only specifies a key a. It seems to be necessary to look up the associated values b, c … in order to delete them from the other dictionaries. This takes Ω(logB N) I/Os, well above the perdictionary writeoptimization budget of So the total deletion cost is In short, for deletions, write optimization offers an advantage over Btrees in that L multiplies a lower order term, but when L = 2, write optimization seems to offer no asymptotic advantage over Btrees. That is, no known query optimal solution for pairs of crossreferenced dictionaries seem to beat Btrees for deletions. In this paper, we show a lower bound establishing that a pair of crossreferenced dictionaries that are optimal for range queries and that supports deletions cannot match the write optimization bound available to insertonly dictionaries. This result thus establishes a limit to the applicability of writeoptimization techniques on which many new databases and file systems are based. Read More: http://epubs.siam.org/doi/10.1137/1.9781611974782.99more » « less

File systems must allocate space for files without knowing what will be added or removed in the future. Over the life of a file system, this may cause subopti mal file placement decisions which eventually lead to slower performance, or aging. Traditional file systems employ heuristics, such as collocating related files and data blocks, to avoid aging, and many file system imple mentors treat aging as a solved problem. However, this paper describes realistic as well as syn thetic workloads that can cause these heuristics to fail, inducing large performance declines due to aging. For example, on ext4 and ZFS, a few hundred git pull op erations can reduce read performance by a factor of 2; performing a thousand pulls can reduce performance by up to a factor of 30. We further present microbenchmarks demonstrating that common placement strategies are ex tremely sensitive to filecreation order; varying the cre ation order of a few thousand small files in a realworld directory structure can slow down reads by 15 − 175×, depending on the file system. We argue that these slowdowns are caused by poor lay out. We demonstrate a correlation between read perfor mance of a directory scan and the locality within a file system’s access patterns, using a dynamic layout score. In short, many file systems are exquisitely prone to read aging for a variety of write workloads. We show, however, that aging is not inevitable. BetrFS, a file sys tem based on writeoptimized dictionaries, exhibits al most no aging in our experiments. BetrFS typically out performs the other file systems in our benchmarks; aged BetrFS even outperforms the unaged versions of these file systems, excepting Btrfs. We present a framework for understanding and predicting aging, and identify the key features of BetrFS that avoid aging.more » « less

The skip list is an elegant dictionary data structure that is com monly deployed in RAM. A skip list with N elements supports searches, inserts, and deletes in O(logN) operations with high probability (w.h.p.) and range queries returning K elements in O(log N + K) operations w.h.p. A seemingly natural way to generalize the skip list to external memory with block size B is to “promote” with probability 1/B, rather than 1/2. However, there are practical and theoretical obsta cles to getting the skip list to retain its efficient performance, space bounds, and highprobability guarantees. We give an externalmemory skip list that achieves write optimized bounds. That is, for 0 < ε < 1, range queries take O(logBε N + K/B) I/Os w.h.p. and insertions and deletions take O((logBε N)/B1−ε) amortized I/Os w.h.p. Our writeoptimized skip list inherits the virtue of simplicity from RAM skip lists. Moreover, it matches or beats the asymptotic bounds of prior writeoptimized data structures such as Bε trees or LSM trees. These data structures are deployed in highperformance databases and file systems.more » « less