skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


This content will become publicly available on February 10, 2026

Title: Disco: A Compact Index for LSM-trees
Many key-value stores and database systems use log-structured merge-trees (LSM-trees) as their storage engines because of their excellent write performance. However, the read performance of LSM-trees is suboptimal due to the overlapping sorted runs. Most existing efforts rely on filters to reduce unnecessary I/Os, but filters fundamentally do not help locate items and often become the bottleneck of the system. We identify that the lack of efficient index is the root cause of subpar read performance in LSM-trees. In this paper, we propose Disco: a compact index for LSM-trees. Disco indexes all the keys in an LSM-tree, so a query does not have to search every run of the LSM-tree. It records compact key representations to minimize the number of key comparisons so as to minimize cache misses and I/Os for both point and range queries. Disco guarantees that both point queries and seeks issue at most one I/O to the underlying runs, achieving an I/O efficiency close to a B+-tree. Disco improves upon REMIX's pioneering multi-run index design with additional compact key representations to help improve read performance. The representations are compact so the cost of persisting Disco to disk is small. Moreover, while a traditional LSM-tree has to choose a more aggressive compaction policy that slows down write performance to have better read performance, a Disco-indexed LSM-tree can employ a write-efficient policy and still have good read performance. Experimental results show that Disco can save I/Os and improve point and range query performance by up to 220% over RocksDB while maintaining efficient writes.  more » « less
Award ID(s):
2114218
PAR ID:
10573375
Author(s) / Creator(s):
; ; ;
Publisher / Repository:
ACM Digital Library
Date Published:
Journal Name:
Proceedings of the ACM on Management of Data
Volume:
3
Issue:
1
ISSN:
2836-6573
Page Range / eLocation ID:
1 to 27
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Log-structured merge (LSM) trees offer efficient ingestion by appending incoming data, and thus, are widely used as the storage layer of production NoSQL data stores. To enable competitive read performance, LSM-trees periodically re-organize data to form a tree with levels of exponentially increasing capacity, through iterative compactions. Compactions fundamentally influence the performance of an LSM-engine in terms of write amplification, write throughput, point and range lookup performance, space amplification, and delete performance. Hence, choosing the appropriate compaction strategy is crucial and, at the same time, hard as the LSM-compaction design space is vast, largely unexplored, and has not been formally defined in the literature. As a result, most LSM-based engines use a fixed compaction strategy, typically hand-picked by an engineer, which decides how and when to compact data. In this paper, we present the design space of LSM-compactions, and evaluate state-of-the-art compaction strategies with respect to key performance metrics. Toward this goal, our first contribution is to introduce a set of four design primitives that can formally define any compaction strategy: (i) the compaction trigger, (ii) the data layout, (iii) the compaction granularity, and (iv) the data movement policy. Together, these primitives can synthesize both existing and completely new compaction strategies. Our second contribution is to experimentally analyze 10 compaction strategies. We present 12 observations and 7 high-level takeaway messages, which show how LSM systems can navigate the compaction design space. 
    more » « less
  2. Dictionaries remain the most well studied class of data structures. A dictionary supports insertions, deletions, membership queries, and usually successor, predecessor, and extract-min. In a RAM, all such operations take O(log n) time on n elements. Dictionaries are often cross-referenced as follows. Consider a set of tuples {〈ai,bi,ci…〉}. A database might include more than one dictionary on such a set, for example, one indexed on the a ‘s, another on the b‘s, and so on. Once again, in a RAM, inserting into a set of L cross-referenced dictionaries takes O(L log n) time, as does deleting. The situation is more interesting in external memory. On a Disk Access Machine (DAM), B-trees achieve O(logB N) I/Os for insertions and deletions on a single dictionary and K-element range queries take optimal O(logB N + K/B) I/Os. These bounds are also achievable by a B-tree on cross-referenced dictionaries, with a slowdown of an L factor on insertion and deletions. In recent years, both the theory and practice of external- memory dictionaries has been revolutionized by write- optimization techniques. A dictionary is write optimized if it is close to a B-tree for query time while beating B-trees on insertions. The best (and optimal) dictionaries achieve a substantially improved insertion and deletion cost of amortized I/Os on a single dictionary while maintaining optimal O(log1+B∊ N + K/B)- I/O range queries. Although write optimization still helps for insertions into cross-referenced dictionaries, its value for deletions would seem to be greatly reduced. A deletion into a cross- referenced dictionary only specifies a key a. It seems to be necessary to look up the associated values b, c … in order to delete them from the other dictionaries. This takes Ω(logB N) I/Os, well above the per-dictionary write-optimization budget of So the total deletion cost is In short, for deletions, write optimization offers an advantage over B-trees in that L multiplies a lower order term, but when L = 2, write optimization seems to offer no asymptotic advantage over B-trees. That is, no known query- optimal solution for pairs of cross-referenced dictionaries seem to beat B-trees for deletions. In this paper, we show a lower bound establishing that a pair of cross-referenced dictionaries that are optimal for range queries and that supports deletions cannot match the write optimization bound available to insert-only dictionaries. This result thus establishes a limit to the applicability of write-optimization techniques on which many new databases and file systems are based. Read More: http://epubs.siam.org/doi/10.1137/1.9781611974782.99 
    more » « less
  3. Computer systems utilizing byte-addressable Non-Volatile Memory ( NVM ) as memory/storage can provide low-latency data persistence. The widely used key-value stores using Log-Structured Merge Tree ( LSM-Tree ) are still beneficial for NVM systems in aspects of the space and write efficiency. However, the significant write amplification introduced by the leveled compaction of LSM-Tree degrades the write performance of the key-value store and shortens the lifetime of the NVM devices. The existing studies propose new compaction methods to reduce write amplification. Unfortunately, they result in a relatively large read amplification. In this article, we propose NVLSM, a key-value store for NVM systems using LSM-Tree with new accumulative compaction. By fully utilizing the byte-addressability of NVM, accumulative compaction uses pointers to accumulate data into multiple floors in a logically sorted run to reduce the number of compactions required. We have also proposed a cascading searching scheme for reads among the multiple floors to reduce read amplification. Therefore, NVLSM reduces write amplification with small increases in read amplification. We compare NVLSM with key-value stores using LSM-Tree with two other compaction methods: leveled compaction and fragmented compaction. Our evaluations show that NVLSM reduces write amplification by up to 67% compared with LSM-Tree using leveled compaction without significantly increasing the read amplification. In write-intensive workloads, NVLSM reduces the average latency by 15.73%–41.2% compared to other key-value stores. 
    more » « less
  4. The skip list is an elegant dictionary data structure that is com- monly deployed in RAM. A skip list with N elements supports searches, inserts, and deletes in O(logN) operations with high probability (w.h.p.) and range queries returning K elements in O(log N + K) operations w.h.p. A seemingly natural way to generalize the skip list to external memory with block size B is to “promote” with probability 1/B, rather than 1/2. However, there are practical and theoretical obsta- cles to getting the skip list to retain its efficient performance, space bounds, and high-probability guarantees. We give an external-memory skip list that achieves write- optimized bounds. That is, for 0 < ε < 1, range queries take O(logBε N + K/B) I/Os w.h.p. and insertions and deletions take O((logBε N)/B1−ε) amortized I/Os w.h.p. Our write-optimized skip list inherits the virtue of simplicity from RAM skip lists. Moreover, it matches or beats the asymptotic bounds of prior write-optimized data structures such as Bε trees or LSM trees. These data structures are deployed in high-performance databases and file systems. 
    more » « less
  5. Modern NVMe solid state drives offer significantly higher bandwidth and low latency than prior storage devices. Current key-value stores struggle to fully utilize the bandwidth of such devices. This paper presents SplinterDB, a new key-value store explicitly designed for NVMe solid state drives. SplinterDB is designed around a novel data structure (the STBε-tree), that exposes I/O and CPU concurrency and reduces write amplification without sacrificing query performance. STBε-tree combines ideas from log-structured merge trees and Bε-trees to reduce write amplification and CPU costs of compaction. The SplinterDB memtable and cache are designed to be highly concurrent and to reduce cache misses. We evaluate SplinterDB on a number of micro- and macro-benchmarks, and show that SplinterDB outperforms RocksDB, a state-of-the-art key-value store, by a factor of 6–10x on insertions and 2–2.6x on point queries, while matching RocksDB on small range queries. Furthermore, SplinterDB reduces write amplification by 2x compared to RocksDB. 
    more » « less