The skip list is an elegant dictionary data structure that is com- monly deployed in RAM. A skip list with N elements supports searches, inserts, and deletes in O(logN) operations with high probability (w.h.p.) and range queries returning K elements in O(log N + K) operations w.h.p. A seemingly natural way to generalize the skip list to external memory with block size B is to “promote” with probability 1/B, rather than 1/2. However, there are practical and theoretical obsta- cles to getting the skip list to retain its efficient performance, space bounds, and high-probability guarantees. We give an external-memory skip list that achieves write- optimized bounds. That is, for 0 < ε < 1, range queries take O(logBε N + K/B) I/Os w.h.p. and insertions and deletions take O((logBε N)/B1−ε) amortized I/Os w.h.p. Our write-optimized skip list inherits the virtue of simplicity from RAM skip lists. Moreover, it matches or beats the asymptotic bounds of prior write-optimized data structures such as Bε trees or LSM trees. These data structures are deployed in high-performance databases and file systems.
more »
« less
Cross-Referenced Dictionaries and the Limits of Write Optimization
Dictionaries remain the most well studied class of data structures. A dictionary supports insertions, deletions, membership queries, and usually successor, predecessor, and extract-min. In a RAM, all such operations take O(log n) time on n elements. Dictionaries are often cross-referenced as follows. Consider a set of tuples {〈ai,bi,ci…〉}. A database might include more than one dictionary on such a set, for example, one indexed on the a ‘s, another on the b‘s, and so on. Once again, in a RAM, inserting into a set of L cross-referenced dictionaries takes O(L log n) time, as does deleting. The situation is more interesting in external memory. On a Disk Access Machine (DAM), B-trees achieve O(logB N) I/Os for insertions and deletions on a single dictionary and K-element range queries take optimal O(logB N + K/B) I/Os. These bounds are also achievable by a B-tree on cross-referenced dictionaries, with a slowdown of an L factor on insertion and deletions. In recent years, both the theory and practice of external- memory dictionaries has been revolutionized by write- optimization techniques. A dictionary is write optimized if it is close to a B-tree for query time while beating B-trees on insertions. The best (and optimal) dictionaries achieve a substantially improved insertion and deletion cost of amortized I/Os on a single dictionary while maintaining optimal O(log1+B∊ N + K/B)- I/O range queries. Although write optimization still helps for insertions into cross-referenced dictionaries, its value for deletions would seem to be greatly reduced. A deletion into a cross- referenced dictionary only specifies a key a. It seems to be necessary to look up the associated values b, c … in order to delete them from the other dictionaries. This takes Ω(logB N) I/Os, well above the per-dictionary write-optimization budget of So the total deletion cost is In short, for deletions, write optimization offers an advantage over B-trees in that L multiplies a lower order term, but when L = 2, write optimization seems to offer no asymptotic advantage over B-trees. That is, no known query- optimal solution for pairs of cross-referenced dictionaries seem to beat B-trees for deletions. In this paper, we show a lower bound establishing that a pair of cross-referenced dictionaries that are optimal for range queries and that supports deletions cannot match the write optimization bound available to insert-only dictionaries. This result thus establishes a limit to the applicability of write-optimization techniques on which many new databases and file systems are based. Read More: http://epubs.siam.org/doi/10.1137/1.9781611974782.99
more »
« less
- Award ID(s):
- 1637458
- PAR ID:
- 10027801
- Date Published:
- Journal Name:
- Proceedings of the Twenty-Eighth Annual ACM-SIAM Symposium on Discrete Algorithms
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
We develop a dynamic dictionary data structure for the GPU, supporting fast insertions and deletions, based on the Log Structured Merge tree (LSM). Our implementation on an NVIDIA K40c GPU has an average update (insertion or deletion) rate of 225 M elements/s, 13.5x faster than merging items into a sorted array. The GPU LSM supports the retrieval operations of lookup, count, and range query operations with an average rate of 75 M, 32 M and 23 M queries/s respectively. The trade-off for the dynamic updates is that the sorted array is almost twice as fast on retrievals. We believe that our GPU LSM is the first dynamic general-purpose dictionary data structure for the GPU.more » « less
-
We develop a dynamic dictionary data structure for the GPU, supporting fast insertions and deletions, based on the Log Structured Merge tree (LSM). Our implementation on an NVIDIA K40c GPU has an average update (insertion or deletion) rate of 225 M elements/s, 13.5x faster than merging items into a sorted array. The GPU LSM supports the retrieval operations of lookup, count, and range query operations with an average rate of 75 M, 32 M and 23 M queries/s respectively. The trade-off for the dynamic updates is that the sorted array is almost twice as fast on retrievals. We believe that our GPU LSM is the first dynamic general-purpose dictionary data structure for the GPU.more » « less
-
Many key-value stores and database systems use log-structured merge-trees (LSM-trees) as their storage engines because of their excellent write performance. However, the read performance of LSM-trees is suboptimal due to the overlapping sorted runs. Most existing efforts rely on filters to reduce unnecessary I/Os, but filters fundamentally do not help locate items and often become the bottleneck of the system. We identify that the lack of efficient index is the root cause of subpar read performance in LSM-trees. In this paper, we propose Disco: a compact index for LSM-trees. Disco indexes all the keys in an LSM-tree, so a query does not have to search every run of the LSM-tree. It records compact key representations to minimize the number of key comparisons so as to minimize cache misses and I/Os for both point and range queries. Disco guarantees that both point queries and seeks issue at most one I/O to the underlying runs, achieving an I/O efficiency close to a B+-tree. Disco improves upon REMIX's pioneering multi-run index design with additional compact key representations to help improve read performance. The representations are compact so the cost of persisting Disco to disk is small. Moreover, while a traditional LSM-tree has to choose a more aggressive compaction policy that slows down write performance to have better read performance, a Disco-indexed LSM-tree can employ a write-efficient policy and still have good read performance. Experimental results show that Disco can save I/Os and improve point and range query performance by up to 220% over RocksDB while maintaining efficient writes.more » « less
-
Buffer-and-flush is a technique for transforming standard external-memory search trees into write-optimized search trees. In exchange for faster amortized insertions, buffer-and-flush can sometimes significantly increase the latency of operations by causing cascades of flushes. In this paper, we show that flushing cascades are not a fundamental consequence of the buffer-flushing technique, and can be removed entirely using randomization techniques. The underlying implementation of buffer flushing relies on a buffer-eviction strategy at each node in the tree. The ability for the user to select the buffer eviction strategy based on the workload has been shown to be important for performance, both in theory and in practice. In order to support arbitrary buffer-eviction strategies, we introduce the notion of a universal flush, which uses a universal eviction policy that can simulate any other eviction policy. This abstracts away the underlying eviction strategy, even allowing for workload-specific strategies that change dynamically. Our deamortization preserves the amortized throughput of the underlying flushing strategy on all workloads. In particular, with our deamortization and a node cache of size poly-logarithmic in the number of insertions performed on the tree, the amortized insertion cost matches the lower bound of Brodal and Fagerberg. For typical parameters, the lower bound is less than 1 I/O per insertion. For such parameters, our worst-case insertion cost is O(1) I/Os.more » « less
An official website of the United States government

