skip to main content


Title: Append is Near: Log-based Data Management on ZNS SSDs
Log-based data management systems use storage as if it were an append-only medium, transforming random writes into sequential writes, which delivers significant benefits when logs are persisted on hard disks. Although solid-state drives (SSDs) offer improved random write capabilities, sequential writes continue to be advan- tageous due to locality and space efficiency. However, the inherent properties of flash-based SSDs induce major disadvantages when used with a random write block interface, causing write amplifica- tion, uneven wear, log stacking, and garbage collection overheads. To eliminate these disadvantages, Zoned Namespace (ZNS) SSDs have recently been introduced. They offer increased capacity, re- duced write amplification, and open up data placement and garbage collection to the host through zones, which have sequential-write semantics and must be explicitly reset. We explain how the new ZNS Zone Append primitive, which sup- ports pushing fine-grained data placement onto the device, along with our proposal for “Group Append”, which enables sub-block sized appends, could benefit log-structured data management sys- tems. We explore advantages of ZNS SSDs with Zone Append, Group Append, and computational storage in four log-based data management areas: (i) log-based file systems, (ii) LSM trees such as RocksDB, (iii) database systems, and (iv) event logs/shared logs. Furthermore, we propose research directions for each of these data management systems using ZNS SSDs.  more » « less
Award ID(s):
1841545 1942754
NSF-PAR ID:
10315205
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
12th Annual Conference on Innovative Data Systems Research (CIDR ’22).
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Consolidating multiple workloads on a single flash-based storage device is now a common practice. We identify a new problem related to lifetime management in such settings: how should one partition device resources among consolidated workloads such that their allowed contributions to the device's wear (resulting from their writes including hidden writes due to garbage collection) may be deemed fairly assigned? When flash is used as a cache/buffer, such fairness is important because it impacts what and how much traffic from various workloads may be serviced using flash which in turn affects their performance. We first clarify why the write attribution problem (i.e., which workload contributed how many writes) is non-trivial. We then present a technique for it inspired by the Shapley value, a classical concept from cooperative game theory, and demonstrate that it is accurate, fair, and feasible. We next consider how to treat an overall "write budget" (i.e., total allowable writes during a given time period) for the device as a first-class resource worthy of explicit management. Towards this, we propose a novel write budget allocation technique. Finally, we construct a dynamic lifetime management framework for consolidated devices by putting the above elements together. Our experiments using real-world workloads demonstrate that our write allocation and attribution techniques lead to performance fairness across consolidated workloads. 
    more » « less
  2. Many modern key-value stores, such as RocksDB, rely on log-structured merge trees (LSMs). Originally designed for spinning disks, LSMs optimize for write performance by only making sequential writes. But this optimization comes at the cost of reads: LSMs must rely on expensive compaction jobs and Bloom filters---all to maintain reasonable read performance. For NVMe SSDs, we argue that trading off read performance for write performance is no longer always needed. With enough parallelism, NVMe SSDs have comparable random and sequential access performance. This change makes update-in-place designs, which traditionally provide excellent read performance, a viable alternative to LSMs. In this paper, we close the gap between log-structured and update-in-place designs on modern SSDs with the help of new components that take advantage of data and workload patterns. Specifically, we explore three key ideas: (A) record caching for efficient point operations, (B) page grouping for high-performance range scans, and (C) insert forecasting to reduce the reorganization costs of accommodating new records. We evaluate these ideas by implementing them in a prototype update-in-place key-value store called TreeLine. On YCSB, we find that TreeLine outperforms RocksDB and LeanStore by 2.20× and 2.07× respectively on average across the point workloads, and by up to 10.95× and 7.52× overall. 
    more » « less
  3. Modern solid-state disks achieve high data transfer rates due to their massive internal parallelism. However, out-of-place updates for flash memory incur garbage collection costs when valid data needs to be copied during space reclamation. The root cause of this extra cost is that solid-state disks are not always able to accurately determine data lifetime and group together data that expires before the space needs to be reclaimed. Real-time systems found in autonomous vehicles, industrial control systems, and assembly-line robots store data from hundreds of sensors and often have predictable data lifetimes. These systems require guaranteed high storage bandwidth for read and write operations by mission-critical real-time tasks. In this article, we depart from the traditional block device interface to guarantee the high throughput needed to process large volumes of data. Using data lifetime information from the application layer, our proposed real-time design, called Telomere , is able to intelligently lay out data in NAND flash memory and eliminate valid page copies during garbage collection. Telomere’s real-time admission control is able to guarantee tasks their required read and write operations within their periods. Under randomly generated tasksets containing 500 tasks, Telomere achieves 30% higher throughput with a 5% storage cost compared to pre-existing techniques. 
    more » « less
  4. Modern SSDs achieve high throughput by utilizing multiple independent channels and chips in parallel. However, we find that excessive parallelism inadvertently amplifies the garbage collection (GC) overhead due to the larger unit of space reclamation. Based on this observation, we design PLAN, a novel SSD parallelism management and data placement scheme that allocates different levels of parallelism to different workloads with different needs to minimize the GC overhead. We demonstrate the effectiveness of PLAN by evaluating it against other state-of-the-art designs across various real-world workloads. PLAN reduces write amplification with comparable or better performance to the other designs that are always at full parallelism. 
    more » « less
  5. Shingled Magnetic Recording (SMR) may be combined with conventional (re-writable) recording on the same drive; in host-managed drives shipping today this capability is used to provide a small number of re-writable zones, typically totaling a few tens of GB. Although these re-writable zones are widely used by SMR-aware applications, the literature to date has ignored them and focused on fully-shingled devices. We describe μCache, an SMR translation layer (STL) using re-writable (mutable) zones to take advantage of both workload spatial and temporal locality to reduce the garbage collection overhead resulted from out-of-place writes. In μCache the volume LBA space is divided into fixed -sized buckets and, on write access, the corresponding bucket is copied (promoted) to the re-writable zones, allowing subsequent writes to the same bucket be served in - place resulting in fewer garbage collection cycles. We evaluate μCache in simulation against real-world traces and show that with appropriate parameters it is able to hold the entire write working set of most workloads in re-writable storage, virtually eliminating garbage collection overhead. We also emulate μCache by replaying its translated traces against actual drive and show that 1) it outperforms its examined counterpart, an E-region based translation approach on average by 2x and up to 5.1x, and 2) it incurs additional latency only for a small fraction of write operations, (up to 10%) when compared with conventional non-shingled disks. 
    more » « less