Current hardware and application storage trends put immense pressure on the operating system's storage subsystem. On the hardware side, the market for storage devices has diversified to a multi-layer storage topology spanning multiple orders of magnitude in cost and performance. Above the file system, applications increasingly need to process small, random IO on vast data sets with low latency, high throughput, and simple crash consistency. File systems designed for a single storage layer cannot support all of these demands together.
We present Strata, a cross-media file system that leverages the strengths of one storage media to compensate for weaknesses of another. In doing so, Strata provides performance, capacity, and a simple, synchronous IO model all at once, while having a simpler design than that of file systems constrained by a single storage device. At its heart, Strata uses a log-structured approach with a novel split of responsibilities among user mode, kernel, and storage layers that separates the concerns of scalable, high-performance persistence from storage layer management. We quantify the performance benefits of Strata using a 3-layer storage hierarchy of emulated NVM, a flash-based SSD, and a high-density HDD. Strata has 20-30% better latency and throughput, across several unmodified applications, compared to file systems purpose-built for each layer, while providing synchronous and unified access to the entire storage hierarchy. Finally, Strata achieves up to 2.8x better throughput than a block-based 2-layer cache provided by Linux's logical volume manager. more »« less
Neal, Ian; Zuo, Gefei; Shiple, Eric; Khan, Tanvir; Kwon, Youngjin; Peter, Simon; Kasikci, Baris(
, 19th USENIX Conference on File and Storage Technologies (FAST 21))
null
(Ed.)
Persistent main memory (PM) dramatically improves IO performance. We find that this results in file systems on PM spending as much as 70% of the IO path performing file mapping (mapping file offsets to physical locations on storage media) on real workloads. However, even PM-optimized file systems perform file mapping based on decades-old assumptions. It is now critical to revisit file mapping for PM.
We explore the design space for PM file mapping by building and evaluating several file-mapping designs, including different data structure, caching, as well as meta-data and block allocation approaches, within the context of a PM-optimized file system. Based on our findings, we design HashFS, a hash-based file mapping approach. HashFS uses a single hash operation for all mapping and allocation operations, bypassing the file system cache, instead prefetching mappings via SIMD parallelism and caching translations explicitly. HashFS’s resulting low latency provides superior performance compared to alternatives. HashFS increases the throughput of YCSB on LevelDB by up to 45% over page-cached extent trees in the state-of-the-art Strata PM-optimized file system
Anderson, Thomas; Canini, Marco; Kim, Jongyul; Kostic, Dejan; Kwon, Youngjin; Peter, Simon; Reda, Waleed; Schuh, Henry; Witchel, Emmett(
, 14th USENIX Symposium on Operating Systems Design and Implementation (OSDI 20))
null
(Ed.)
The adoption of low latency persistent memory modules (PMMs) upends the long-established model of remote storage for distributed file systems. Instead, by colocating computation with PMM storage, we can provide applications with much higher IO performance, sub-second application failover, and strong consistency. To demonstrate this, we built the Assise distributed file system, based on a persistent, replicated coherence protocol that manages client-local PMM as a linearizable and crash-recoverable cache between applications and slower (and possibly remote) storage. Assise maximizes locality for all file IO by carrying out IO on process-local, socket-local, and client-local PMM whenever possible. Assise minimizes coherence overhead by maintaining consistency at IO operation granularity, rather than at fixed block sizes.
We compare Assise to Ceph/BlueStore, NFS, and Octopus on a cluster with Intel Optane DC PMMs and SSDs for common cloud applications and benchmarks, such as LevelDB, Postfix, and FileBench. We find that Assise improves write latency up to 22x, throughput up to 56x, fail-over time up to 103x, and scales up to 6x better than its counterparts, while providing stronger consistency semantics.
Abulila, Ahmed; Mailthody, Vikram Sharma; Qureshi, Zaid; Huang, Jian; Kim, Nam Sung; Xiong, Jinjun; Hwu, Wen-mei(
, Proceedings of the Twenty-Fourth International Conference on Architectural Support for Programming Languages and Operating Systems)
Using flash-based solid state drives (SSDs) as main memory has been proposed as a practical solution towards scaling memory capacity for data-intensive applications. However, almost all existing approaches rely on the paging mechanism to move data between SSDs and host DRAM. This inevitably incurs significant performance overhead and extra I/O traffic. Thanks to the byte-addressability supported by the PCIe interconnect and the internal memory in SSD controllers, it is feasible to access SSDs in both byte and block granularity today. Exploiting the benefits of SSD's byte-accessibility in today's memory-storage hierarchy is, however, challenging as it lacks systems support and abstractions for programs. In this paper, we present FlatFlash, an optimized unified memory-storage hierarchy, to efficiently use byte-addressable SSD as part of the main memory. We extend the virtual memory management to provide a unified memory interface so that programs can access data across SSD and DRAM in byte granularity seamlessly. We propose a lightweight, adaptive page promotion mechanism between SSD and DRAM to gain benefits from both the byte-addressable large SSD and fast DRAM concurrently and transparently, while avoiding unnecessary page movements. Furthermore, we propose an abstraction of byte-granular data persistence to exploit the persistence nature of SSDs, upon which we rethink the design primitives of crash consistency of several representative software systems that require data persistence, such as file systems and databases. Our evaluation with a variety of applications demonstrates that, compared to the current unified memory-storage systems, FlatFlash improves the performance for memory-intensive applications by up to 2.3x, reduces the tail latency for latency-critical applications by up to 2.8x, scales the throughput for transactional database by up to 3.0x, and decreases the meta-data persistence overhead for file systems by up to 18.9x. FlatFlash also improves the cost-effectiveness by up to 3.8x compared to DRAM-only systems, while enhancing the SSD lifetime significantly.
Campbell, C.; Mecca, N.; Duong, T.; Obeid, I.; Picone, J.(
, IEEE Signal Processing in Medicine and Biology Symposium (SPMB))
Obeid, Iyad; Selesnick, Ivan; Picone, Joseph
(Ed.)
The goal of this work was to design a low-cost computing facility that can support the development of an
open source digital pathology corpus containing 1M images [1]. A single image from a clinical-grade digital
pathology scanner can range in size from hundreds of megabytes to five gigabytes. A 1M image database
requires over a petabyte (PB) of disk space. To do meaningful work in this problem space requires a
significant allocation of computing resources. The improvements and expansions to our HPC (highperformance
computing) cluster, known as Neuronix [2], required to support working with digital
pathology fall into two broad categories: computation and storage. To handle the increased computational
burden and increase job throughput, we are using Slurm [3] as our scheduler and resource manager. For
storage, we have designed and implemented a multi-layer filesystem architecture to distribute a filesystem
across multiple machines. These enhancements, which are entirely based on open source software, have
extended the capabilities of our cluster and increased its cost-effectiveness.
Slurm has numerous features that allow it to generalize to a number of different scenarios. Among the most
notable is its support for GPU (graphics processing unit) scheduling. GPUs can offer a tremendous
performance increase in machine learning applications [4] and Slurm’s built-in mechanisms for handling
them was a key factor in making this choice. Slurm has a general resource (GRES) mechanism that can be
used to configure and enable support for resources beyond the ones provided by the traditional HPC
scheduler (e.g. memory, wall-clock time), and GPUs are among the GRES types that can be supported by
Slurm [5]. In addition to being able to track resources, Slurm does strict enforcement of resource allocation.
This becomes very important as the computational demands of the jobs increase, so that they have all the
resources they need, and that they don’t take resources from other jobs. It is a common practice among
GPU-enabled frameworks to query the CUDA runtime library/drivers and iterate over the list of GPUs,
attempting to establish a context on all of them. Slurm is able to affect the hardware discovery process of
these jobs, which enables a number of these jobs to run alongside each other, even if the GPUs are in
exclusive-process mode.
To store large quantities of digital pathology slides, we developed a robust, extensible distributed storage
solution. We utilized a number of open source tools to create a single filesystem, which can be mounted
by any machine on the network. At the lowest layer of abstraction are the hard drives, which were split into
4 60-disk chassis, using 8TB drives. To support these disks, we have two server units, each equipped with
Intel Xeon CPUs and 128GB of RAM. At the filesystem level, we have implemented a multi-layer solution
that: (1) connects the disks together into a single filesystem/mountpoint using the ZFS (Zettabyte File
System) [6], and (2) connects filesystems on multiple machines together to form a single mountpoint using
Gluster [7].
ZFS, initially developed by Sun Microsystems, provides disk-level awareness and a filesystem which takes
advantage of that awareness to provide fault tolerance. At the filesystem level, ZFS protects against data
corruption and the infamous RAID write-hole bug by implementing a journaling scheme (the ZFS intent
log, or ZIL) and copy-on-write functionality. Each machine (1 controller + 2 disk chassis) has its own separate ZFS filesystem. Gluster, essentially a meta-filesystem, takes each of these, and provides the means
to connect them together over the network and using distributed (similar to RAID 0 but without striping
individual files), and mirrored (similar to RAID 1) configurations [8].
By implementing these improvements, it has been possible to expand the storage and computational power
of the Neuronix cluster arbitrarily to support the most computationally-intensive endeavors by scaling
horizontally. We have greatly improved the scalability of the cluster while maintaining its excellent
price/performance ratio [1].
Parallel I/O is an effective method to optimize data movement between memory and storage for many scientific applications. Poor performance of traditional disk-based file systems has led to the design of I/O libraries which take advantage of faster memory layers, such as on-node memory, present in high-performance computing (HPC) systems. By allowing caching and prefetching of data for applications alternating computation and I/O phases, a faster memory layer also provides opportunities for hiding the latency of I/O phases by overlapping them with computation phases, a technique called asynchronous I/O. Since asynchronous parallel I/O in HPC systems is still in the initial stages of development, there hasn't been a systematic study of the factors affecting its performance.In this paper, we perform a systematic study of various factors affecting the performance and efficacy of asynchronous I/O, we develop a performance model to estimate the aggregate I/O bandwidth achievable by iterative applications using synchronous and asynchronous I/O based on past observations, and we evaluate the performance of the recently developed asynchronous I/O feature of a parallel I/O library (HDF5) using benchmarks and real-world science applications. Our study covers parallel file systems on two large-scale HPC systems: Summit and Cori, the former with a GPFS storage and the latter with a Lustre parallel file system.
Kwon, Youngjin, Fingler, Henrique, Hunt, Tyler, Peter, Simon, Witchel, Emmett, and Anderson, Thomas. Strata: A Cross Media File System. Retrieved from https://par.nsf.gov/biblio/10188582. Proceedings of the 26th Symposium on Operating Systems Principles . Web. doi:10.1145/3132747.3132770.
Kwon, Youngjin, Fingler, Henrique, Hunt, Tyler, Peter, Simon, Witchel, Emmett, & Anderson, Thomas. Strata: A Cross Media File System. Proceedings of the 26th Symposium on Operating Systems Principles, (). Retrieved from https://par.nsf.gov/biblio/10188582. https://doi.org/10.1145/3132747.3132770
Kwon, Youngjin, Fingler, Henrique, Hunt, Tyler, Peter, Simon, Witchel, Emmett, and Anderson, Thomas.
"Strata: A Cross Media File System". Proceedings of the 26th Symposium on Operating Systems Principles (). Country unknown/Code not available. https://doi.org/10.1145/3132747.3132770.https://par.nsf.gov/biblio/10188582.
@article{osti_10188582,
place = {Country unknown/Code not available},
title = {Strata: A Cross Media File System},
url = {https://par.nsf.gov/biblio/10188582},
DOI = {10.1145/3132747.3132770},
abstractNote = {Current hardware and application storage trends put immense pressure on the operating system's storage subsystem. On the hardware side, the market for storage devices has diversified to a multi-layer storage topology spanning multiple orders of magnitude in cost and performance. Above the file system, applications increasingly need to process small, random IO on vast data sets with low latency, high throughput, and simple crash consistency. File systems designed for a single storage layer cannot support all of these demands together. We present Strata, a cross-media file system that leverages the strengths of one storage media to compensate for weaknesses of another. In doing so, Strata provides performance, capacity, and a simple, synchronous IO model all at once, while having a simpler design than that of file systems constrained by a single storage device. At its heart, Strata uses a log-structured approach with a novel split of responsibilities among user mode, kernel, and storage layers that separates the concerns of scalable, high-performance persistence from storage layer management. We quantify the performance benefits of Strata using a 3-layer storage hierarchy of emulated NVM, a flash-based SSD, and a high-density HDD. Strata has 20-30% better latency and throughput, across several unmodified applications, compared to file systems purpose-built for each layer, while providing synchronous and unified access to the entire storage hierarchy. Finally, Strata achieves up to 2.8x better throughput than a block-based 2-layer cache provided by Linux's logical volume manager.},
journal = {Proceedings of the 26th Symposium on Operating Systems Principles},
author = {Kwon, Youngjin and Fingler, Henrique and Hunt, Tyler and Peter, Simon and Witchel, Emmett and Anderson, Thomas},
}
Warning: Leaving National Science Foundation Website
You are now leaving the National Science Foundation website to go to a non-government website.
Website:
NSF takes no responsibility for and exercises no control over the views expressed or the accuracy of
the information contained on this site. Also be aware that NSF's privacy policy does not apply to this site.