skip to main content


Title: Baoverlay: a block-accessible overlay file system for fast and efficient container storage
Container storage commonly relies on overlay file systems to interpose read-only container images upon backing file systems. While being transparent to and compatible with most existing backing file systems, the overlay file-system approach imposes nontrivial I/O overhead to containerized applications, especially for writes: To write a file originating from a read-only container image, the whole file will be copied to a separate, writable storage layer, resulting in long write latency and inefficient use of container storage. In this paper, we present BAOverlay, a lightweight, block-accessible overlay file system: Equipped with a new block-accessibility attribute, BAOverlay not only exploits the benefit of using an asynchronous copy-on-write mechanism for fast file updates but also enables a new file format for efficient use of container storage space. We have developed a prototype of BAOverlay upon Linux Ext4. Our evaluation with both micro-benchmarks and real-world applications demonstrates the effectiveness of BAOverlay with improved write performance and on-demand container storage usage.  more » « less
Award ID(s):
1909877
NSF-PAR ID:
10297237
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Proceedings of the 11th ACM Symposium on Cloud Computing (SoCC 2020)
Page Range / eLocation ID:
90 to 104
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. With the expanding data storage capacity needs, DNA as an alternative to the archival storage medium offers potential advantages, including higher density and data retention for information storage1,2. However, the majority of DNA-based memory systems are write-once and read-only, although few studies have suggested overwriting digital data on the existing DNA using chemical modifications of bases 3. Using those strategies requires constantly updating the entire data coding and iteratively synthesizing the DNA pool. Therefore, considering the complexity and cost, those methods needed some amendments to become industrially scalable. Inspired by magnetic tapes4 and multisession-CD5, in this work, we created a DNA storage system coined the Molecular File System (MolFS), to organize, store, and edit digital information in a DNA pool. MolFS uses DNA pools that consist of multiple sessions, where each session contains data block and unique index sections to store and edit the files. We used indexes to describe the file system hierarchy, locate files along with the blocks, recognize the sessions, and identify the file versions. This approach reduces the editing cost compared to the state-of-the-art methods, and editing or adding data requires only synthesizing a new DNA pool containing the DNA session of the differential file. As proof of concept, we encoded 2.3 Kbytes of graphic and text data into 2 DNA pools. To edit the existing DNA pool, we added 8 new differential data blocks to existing pools, reaching 13.8 Kbytes of data stored from sessions 1 to 5. We performed nanopore sequencing and recovered the data from the MolFS sessions accurately and precisely. 
    more » « less
  2. Fast, byte-addressable persistent memory (PM) is becoming a reality in products. However, porting legacy kernel file systems to fully support PM requires substantial effort and encounters the challenge of bridging the gap between block-based access granularity and byte-addressability. Moreover, new PM-specific file systems remain far from production-ready, preventing them from being widely used. In this paper, we propose P2CACHE, a novel in-kernel caching mechanism to explore how legacy kernel file systems can effectively evolve in the face of fast, byte-addressable PM. P2CACHE exploits a read/write-distinguishable memory hierarchy upon a tiered memory system involving both PM and DRAM. P2CACHE leverages PM to serve all write requests for instant data durability and strong crash consistency while using DRAM to serve most read I/Os for high I/O performance. Further, P2CACHE employs a simple yet effective synchronization model between PM and DRAM by leveraging device-level parallelism. Our evaluation shows that P2CACHE can significantly increase the performance of legacy kernel file systems -- e.g., by 200x for RocksDB on Ext4 -- meanwhile equipping them with instant data durability and strong crash consistency, similar to PM-specialized file systems. 
    more » « less
  3. Log-based data management systems use storage as if it were an append-only medium, transforming random writes into sequential writes, which delivers significant benefits when logs are persisted on hard disks. Although solid-state drives (SSDs) offer improved random write capabilities, sequential writes continue to be advan- tageous due to locality and space efficiency. However, the inherent properties of flash-based SSDs induce major disadvantages when used with a random write block interface, causing write amplifica- tion, uneven wear, log stacking, and garbage collection overheads. To eliminate these disadvantages, Zoned Namespace (ZNS) SSDs have recently been introduced. They offer increased capacity, re- duced write amplification, and open up data placement and garbage collection to the host through zones, which have sequential-write semantics and must be explicitly reset. We explain how the new ZNS Zone Append primitive, which sup- ports pushing fine-grained data placement onto the device, along with our proposal for “Group Append”, which enables sub-block sized appends, could benefit log-structured data management sys- tems. We explore advantages of ZNS SSDs with Zone Append, Group Append, and computational storage in four log-based data management areas: (i) log-based file systems, (ii) LSM trees such as RocksDB, (iii) database systems, and (iv) event logs/shared logs. Furthermore, we propose research directions for each of these data management systems using ZNS SSDs. 
    more » « less
  4. Server systems with large amounts of physical memory can benefit from using some of the available memory capacity for in-memory snapshots of the ongoing computations. In-memory snapshots are useful for services such as scaling of new workload instances, debugging, during scheduling, etc., which do not require snapshot persistence across node crashes/reboots. Since increasingly more frequently servers run containerized workloads, using technologies such as Docker, the snapshot, and the subsequent snapshot restore mechanisms, would be applied at granularity of containers. However, CRIU, the current approach to snapshot/restore containers, suffers from expensive filesystem write/read operations on image files containing memory pages, which dominate the runtime costs and impact the potential benefits of manipulating in-memory process state. In this paper, we demonstrate that these overheads can be eliminated by using MVAS -- kernel support for multiple independent virtual address spaces (VAS), designed specifically for machines with large memory capacities. The resulting VAS-CRIU stores application memory as a separate snapshot address space in DRAM and avoids costly file system operations. This accelerates the snapshot/restore of address spaces by two orders of magnitude, resulting in an overall reduction in snapshot time by up to 10× and restore time by up to 9×. We demonstrate the utility of VAS-CRIU for container management services such as fine-grained snapshot generation and container instance scaling. 
    more » « less
  5. null (Ed.)
    Abstract Modern NoSQL database systems use log-structured merge (LSM) storage architectures to support high write throughput. LSM architectures aggregate writes in a mutable MemTable (stored in memory), which is regularly flushed to disk, creating a new immutable file called an SSTable . Some of the SSTables are chosen to be periodically merged —replaced with a single SSTable containing their union. A merge policy (a.k.a. compaction policy) specifies when to do merges and which SSTables to combine. A bounded depth merge policy is one that guarantees that the number of SSTables never exceeds a given parameter k , typically in the range 3–10. Bounded depth policies are useful in applications where low read latency is crucial, but they and their underlying combinatorics are not yet well understood. This paper compares several bounded depth policies, including representative policies from industrial NoSQL databases and two new ones based on recent theoretical modeling, as well as the standard Tiered policy and Leveled policy. The results validate the proposed theoretical model and show that, compared to the existing policies, the newly proposed policies can have substantially lower write amplification with comparable read amplification. 
    more » « less