skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.


Title: MadFS: Per-File Virtualization for Userspace Persistent Memory Filesystems
Persistent memory (PM) can be accessed directly from userspace without kernel involvement, but most PM filesystems still perform metadata operations in the kernel for secuity and rely on the kernel for cross-process synchronization. We present per-file virtualization, where a virtualization layer implements a complete set of file functionalities, including metadata management, crash consistency, and concurrency control, in userspace. We observe that not all file metadata need to be maintained by the kernel and propose embedding insensitive metadata into the file for userspace management. For crash consistency, copy-on-write (CoW) benefits from the embedding of the block mapping since the mapping can be efficiently updated without kernel involvement. For cross-process synchronization, we introduce lockfree optimistic concurrency control (OCC) at user level, which tolerates process crashes and provides better scalability. Based on per-file virtualization, we implement MadFS, a library PM filesystem that maintains the embedded metadata as a compact log. Experimental results show that on concurrent workloads, MadFS achieves up to 3.6× the throughput of ext4-DAX. For real-world applications, MadFS provides up to 48% speedup for YCSB on LevelDB and 85% for TPC-C on SQLite compared to NOVA.  more » « less
Award ID(s):
1900758
PAR ID:
10476649
Author(s) / Creator(s):
; ; ; ; ; ;
Publisher / Repository:
USENIX Association
Date Published:
Journal Name:
Proceedings of 21st USENIX Conference on File and Storage Technologies (FAST ’23)
Subject(s) / Keyword(s):
Persistent memory file system userspace synchronization
Format(s):
Medium: X
Location:
Santa Clara, CA
Sponsoring Org:
National Science Foundation
More Like this
  1. We present SplitFS, a file system for persistent memory (PM) that reduces software overhead significantly compared to state-of-the-art PM file systems. SplitFS presents a novel split of responsibilities between a user-space library file system and an existing kernel PM file system. The user-space library file system handles data operations by intercepting POSIX calls, memory-mapping the underlying file, and serving the read and overwrites using processor loads and stores. Metadata operations are handled by the kernel PM file system (ext4 DAX). SplitFS introduces a new primitive termed relink to efficiently support file appends and atomic data operations. SplitFS provides three consistency modes, which different applications can choose from, without interfering with each other. SplitFS reduces software overhead by up-to 4× compared to the NOVA PM file system, and 17× compared to ext4 DAX. On a number of micro-benchmarks and applications such as the LevelDB key-value store running the YCSB benchmark, SplitFS increases application performance by up to 2× compared to ext4 DAX and NOVA while providing similar consistency guarantees. 
    more » « less
  2. Fast, byte-addressable persistent memory (PM) is becoming a reality in products. However, porting legacy kernel file systems to fully support PM requires substantial effort and encounters the challenge of bridging the gap between block-based access granularity and byte-addressability. Moreover, new PM-specific file systems remain far from production-ready, preventing them from being widely used. In this paper, we propose P2CACHE, a novel in-kernel caching mechanism to explore how legacy kernel file systems can effectively evolve in the face of fast, byte-addressable PM. P2CACHE exploits a read/write-distinguishable memory hierarchy upon a tiered memory system involving both PM and DRAM. P2CACHE leverages PM to serve all write requests for instant data durability and strong crash consistency while using DRAM to serve most read I/Os for high I/O performance. Further, P2CACHE employs a simple yet effective synchronization model between PM and DRAM by leveraging device-level parallelism. Our evaluation shows that P2CACHE can significantly increase the performance of legacy kernel file systems -- e.g., by 200x for RocksDB on Ext4 -- meanwhile equipping them with instant data durability and strong crash consistency, similar to PM-specialized file systems. 
    more » « less
  3. Wasm is gaining popularity outside the Web as a well-specifed low-level binary format with ISA portability, low memory footprint and polyglot targetability, enabling efficient in- process sandboxing of untrusted code. Despite these advantages, Wasm adoption for new domains is often hindered by the lack of many standard system interfaces which precludes reusability of existing software and slows ecosystem growth. This paper proposes thin kernel interfaces for Wasm, which directly expose OS userspace syscalls without breaking intra- process sandboxing, enabling a new class of virtualization with Wasm as a universal binary format. By virtualizing the bottom layer of userspace, kernel interfaces enable effortless application ISA portability, compiler backend reusability, and armor programs with Wasm’s built-in control flow integrity and arbitrary code execution protection. Furthermore, existing capability-based APIs for Wasm, such as WASI, can be implemented as a Wasm module over kernel interfaces, improving reuse, robustness, and portability through better layering. We present an implementation of this concept for two kernels – Linux and Zephyr – by extending a modern Wasm engine and evaluate our system’s performance on a number of sophisticated applications which can run for the first time on Wasm. 
    more » « less
  4. Current hardware and application storage trends put immense pressure on the operating system's storage subsystem. On the hardware side, the market for storage devices has diversified to a multi-layer storage topology spanning multiple orders of magnitude in cost and performance. Above the file system, applications increasingly need to process small, random IO on vast data sets with low latency, high throughput, and simple crash consistency. File systems designed for a single storage layer cannot support all of these demands together. We present Strata, a cross-media file system that leverages the strengths of one storage media to compensate for weaknesses of another. In doing so, Strata provides performance, capacity, and a simple, synchronous IO model all at once, while having a simpler design than that of file systems constrained by a single storage device. At its heart, Strata uses a log-structured approach with a novel split of responsibilities among user mode, kernel, and storage layers that separates the concerns of scalable, high-performance persistence from storage layer management. We quantify the performance benefits of Strata using a 3-layer storage hierarchy of emulated NVM, a flash-based SSD, and a high-density HDD. Strata has 20-30% better latency and throughput, across several unmodified applications, compared to file systems purpose-built for each layer, while providing synchronous and unified access to the entire storage hierarchy. Finally, Strata achieves up to 2.8x better throughput than a block-based 2-layer cache provided by Linux's logical volume manager. 
    more » « less
  5. We present Chipmunk, a new framework to test persistent-memory (PM) file systems for crash-consistency bugs. Using Chipmunk, we discovered 23 new bugs across five PM file systems; most bugs have been confirmed and fixed by developers. The discovered bugs have serious consequences, including making the file system un-mountable or breaking rename atomicity. We present a detailed study of the bugs found using Chipmunk and discuss important lessons learned for designing and testing PM file systems. 
    more » « less