skip to main content
US FlagAn official website of the United States government
dot gov icon
Official websites use .gov
A .gov website belongs to an official government organization in the United States.
https lock icon
Secure .gov websites use HTTPS
A lock ( lock ) or https:// means you've safely connected to the .gov website. Share sensitive information only on official, secure websites.

Attention:

The NSF Public Access Repository (PAR) system and access will be unavailable from 8:00 PM ET on Friday, March 21 until 8:00 AM ET on Saturday, March 22 due to maintenance. We apologize for the inconvenience.


Title: Twizzler: A Data-centric OS for Non-volatile Memory
Byte-addressable, non-volatile memory (NVM) presents an opportunity to rethink the entire system stack. We present Twizzler, an operating system redesign for this near-future. Twizzler removes the kernel from the I/O path, provides programs with memory-style access to persistent data using small (64 bit), object-relative cross-object pointers, and enables simple and efficient long-term sharing of data both between applications and between runs of an application. Twizzler provides a clean-slate programming model for persistent data, realizing the vision of Unix in a world of persistent RAM. We show that Twizzler is simpler, more extensible, and more secure than existing I/O models and implementations by building software for Twizzler and evaluating it on NVM DIMMs. Most persistent pointer operations in Twizzler impose less than 0.5 ns added latency. Twizzler operations are up to faster than Unix , and SQLite queries are up to faster than on PMDK. YCSB workloads ran 1.1– faster on Twizzler than on native and NVM-optimized SQLite backends.  more » « less
Award ID(s):
1841545
PAR ID:
10315206
Author(s) / Creator(s):
; ; ; ;
Date Published:
Journal Name:
ACM Transactions on Storage
Volume:
17
Issue:
2
ISSN:
1553-3077
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Persistent memory presents a great opportunity for crash-consistent computing in large-scale computing systems. The ability to recover data upon power outage or crash events can significantly improve the availability of large-scale systems, while improving the performance of persistent data applications (e.g., database applications). However, persistent memory suffers from high write latency and requires specific programming model (e.g., Intel’s PMDK) to guarantee crash consistency, which results in long latency to persist data. To mitigate these problems, recent standards advocate for sufficient back-up power that can flush the whole cache hierarchy to the persistent memory upon detection of an outage, i.e., extending the persistence domain to include the cache hierarchy. In the secure NVM with extended persistent domain(EPD), in addition to flushing the cache hierarchy, extra actions need to be taken to protect the flushed cache data. These extra actions of secure operation could cause significant burden on energy costs and battery size. We demonstrate that naive implementations could lead to significantly expanding the required power holdup budget (e.g., 10.3x more operations than EPD system without secure memory support). The significant overhead is caused by memory accesses of secure metadata. In this paper, we present Horus, a novel EPD-aware secure memory implementation. Horus reduces the overhead during draining period of EPD system by reducing memory accesses of secure metadata. Experiment result shows that Horus reduces the draining time by 5x, compared with the naive baseline design. 
    more » « less
  2. Parallel I/O is an effective method to optimize data movement between memory and storage for many scientific applications. Poor performance of traditional disk-based file systems has led to the design of I/O libraries which take advantage of faster memory layers, such as on-node memory, present in high-performance computing (HPC) systems. By allowing caching and prefetching of data for applications alternating computation and I/O phases, a faster memory layer also provides opportunities for hiding the latency of I/O phases by overlapping them with computation phases, a technique called asynchronous I/O. Since asynchronous parallel I/O in HPC systems is still in the initial stages of development, there hasn't been a systematic study of the factors affecting its performance.In this paper, we perform a systematic study of various factors affecting the performance and efficacy of asynchronous I/O, we develop a performance model to estimate the aggregate I/O bandwidth achievable by iterative applications using synchronous and asynchronous I/O based on past observations, and we evaluate the performance of the recently developed asynchronous I/O feature of a parallel I/O library (HDF5) using benchmarks and real-world science applications. Our study covers parallel file systems on two large-scale HPC systems: Summit and Cori, the former with a GPFS storage and the latter with a Lustre parallel file system. 
    more » « less
  3. Scientific data analysis pipelines face scalability bottlenecks when processing massive datasets that consist of millions of small files. Such datasets commonly arise in domains as diverse as detecting supernovae and post-processing computational fluid dynamics simulations. Furthermore, applications often use inference frameworks such as TensorFlow and PyTorch whose naive I/O methods exacerbate I/O bottlenecks. One solution is to use scientific file formats, such as HDF5 and FITS, to organize small arrays in one big file. However, storing everything in one file does not fully leverage the heterogeneous data storage capabilities of modern clusters. This paper presents Henosis, a system that intercepts data accesses inside the HDF5 library and transparently redirects I/O to the in-memory Redis object store or the disk-based TileDB array store. During this process, Henosis consolidates small arrays into bigger chunks and intelligently places them in data stores. A critical research aspect of Henosis is that it formulates object consolidation and data placement as a single optimization problem. Henosis carefully constructs a graph to capture the I/O activity of a workload and produces an initial solution to the optimization problem using graph partitioning. Henosis then refines the solution using a hill-climbing algorithm which migrates arrays between data stores to minimize I/O cost. The evaluation on two real scientific data analysis pipelines shows that consolidation with Henosis makes I/O 300× faster than directly reading small arrays from TileDB and 3.5× faster than workload-oblivious consolidation methods. Moreover, jointly optimizing consolidation and placement in Henosis makes I/O 1.7× faster than strategies that perform consolidation and placement independently. 
    more » « less
  4. Cloud-native microservice applications use different communication paradigms to network microservices, including both synchronous and asynchronous I/O for exchanging data. Existing solutions depend on kernel-based networking, incurring significant overheads. The interdependence between microservices for these applications involves considerable communication, including contention between multiple concurrent flows or user sessions. In this paper, we design X-IO, a high-performance unified I/O interface that is built on top of shared memory processing with lock-free producer/consumer rings, eliminating kernel networking overheads and contention. X-IO offers a feature-rich interface. X-IO’s zero-copy interface supports building provides truly zero-copy data transfers between microservices, achieving high performance. X-IO also provides a POSIX-like socket interface using HTTP/REST API to achieve seamless porting of microservices to X-IO, without any change to the application code. X-IO supports concurrent connections for microservices that require distinct user sessions operating in parallel. Our preliminary experimental results show that X-IO’s zero-copy interfaces achieve 2.8x-4.1x performance improvement compared to kernel-based interfaces. Its socket interfaces outperform kernel TCP sockets and achieve performance close to UNIX-domain sockets. The HTTP/REST APIs in X-IO perform 1.4 x-2.3 x better than kernel-based alternatives with concurrent connections. 
    more » « less
  5. Persistent memory (PMem) is a low-latency storage technology connected to the processor memory bus. The Direct Access (DAX) interface promises fast access to PMem, mapping it directly to processes' virtual address spaces. However, virtual memory operations (e.g., paging) limit its performance and scalability. Through an analysis of Linux/x86 memory mapping, we find that current systems fall short of what hardware can provide due to numerous software inefficiencies stemming from OS assumptions that memory mapping is for DRAM. In this paper we propose DaxVM, a design that extends the OS virtual memory and file system layers leveraging persistent memory attributes to provide a fast and scalable DAX-mmap interface. DaxVM eliminates paging costs through pre-populated file page tables, supports faster and scalable virtual address space management for ephemeral mappings, performs unmappings asynchronously, bypasses kernel-space dirty-page tracking support, and adopts asynchronous block pre-zeroing. We implement DaxVM in Linux and the ext4 file system targeting x86-64 architecture. DaxVM mmap achieves 4.9× higher throughput than default mmap for the Apache webserver and up to 1.5× better performance than read system calls. It provides similar benefits for text search. It also provides fast boot times and up to 2.95× better throughput than default mmap for PMem-optimized key-value stores running on a fragmented ext4 image. Despite designed for direct access to byte-addressable storage, various aspects of DaxVM are relevant for efficient access to other high performant storage mediums. 
    more » « less