Making logical copies, or clones, of files and directories is critical to many real-world applications and work- flows, including backups, virtual machines, and containers. An ideal clone implementation meets the follow- ing performance goals: (1) creating the clone has low latency; (2) reads are fast in all versions (i.e., spatial locality is always maintained, even after modifications); (3) writes are fast in all versions; (4) the overall sys- tem is space efficient. Implementing a clone operation that realizes all four properties, which we call a nimble clone, is a long-standing open problem. This article describes nimble clones in B-ε-tree File System (BetrFS), an open-source, full-path-indexed, and write-optimized file system. The key observation behind our work is that standard copy-on-write heuristics can be too coarse to be space efficient, or too fine-grained to preserve locality. On the other hand, a write- optimized key-value store, such as a Bε -tree or an log-structured merge-tree (LSM)-tree, can decouple the logical application of updates from the granularity at which data is physically copied. In our write-optimized clone implementation, data sharing among clones is only broken when a clone has changed enough to warrant making a copy, a policy we call copy-on-abundant-write. We demonstrate thatmore »
Copy-on-Abundant-Write for Nimble File System Clones
Making logical copies, or clones, of files and directories is critical to many real-world applications and workflows, including backups, virtual machines, and containers. An ideal clone implementation meets the following performance goals: (1) creating the clone has low latency; (2) reads are fast in all versions (i.e., spatial locality is always maintained, even after modifications); (3) writes are fast in all versions; (4) the overall system is space efficient. Implementing a clone operation that realizes all four properties, which we call a nimble clone , is a long-standing open problem. This article describes nimble clones in B-ϵ-tree File System (BetrFS), an open-source, full-path-indexed, and write-optimized file system. The key observation behind our work is that standard copy-on-write heuristics can be too coarse to be space efficient, or too fine-grained to preserve locality. On the other hand, a write-optimized key-value store, such as a Bε-tree or an log-structured merge-tree (LSM)-tree, can decouple the logical application of updates from the granularity at which data is physically copied. In our write-optimized clone implementation, data sharing among clones is only broken when a clone has changed enough to warrant making a copy, a policy we call copy-on-abundant-write . We demonstrate that the algorithmic work more »
- Publication Date:
- NSF-PAR ID:
- 10298724
- Journal Name:
- ACM Transactions on Storage
- Volume:
- 17
- Issue:
- 1
- Page Range or eLocation-ID:
- 1 to 27
- ISSN:
- 1553-3077
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
Code optimization is an intricate task that is getting more complex as computing systems evolve. Managing the program optimization process, including the implementation and evaluation of code variants, is tedious, inefficient, and errors are likely to be introduced in the process. Moreover, because each platform typically requires a different sequence of transformations to fully harness its computing power, the optimization process complexity grows as new platforms are adopted. To address these issues, systems and frameworks have been proposed to automate the code optimization process. They, however, have not been widely adopted and are primarily used by experts with deep knowledge about underlying architecture and compiler intricacies. This article describes the requirements that we believe necessary for making automatic performance tuning more broadly used, especially in complex, long-lived high-performance computing applications. Besides discussing limitations of current systems and strategies to overcome these, we describe the design of a system that is able to semi-automatically generate efficient platform-specific code. In the proposed system, the code optimization is programmer-guided, separately from application code, on an external file in what we call optimization programming. The language to program the optimization process is able to represent complex collections of transformations and, as a result, generatemore »
-
File systems must allocate space for files without knowing what will be added or removed in the future. Over the life of a file system, this may cause subopti- mal file placement decisions which eventually lead to slower performance, or aging. Traditional file systems employ heuristics, such as collocating related files and data blocks, to avoid aging, and many file system imple- mentors treat aging as a solved problem. However, this paper describes realistic as well as syn- thetic workloads that can cause these heuristics to fail, inducing large performance declines due to aging. For example, on ext4 and ZFS, a few hundred git pull op- erations can reduce read performance by a factor of 2; performing a thousand pulls can reduce performance by up to a factor of 30. We further present microbenchmarks demonstrating that common placement strategies are ex- tremely sensitive to file-creation order; varying the cre- ation order of a few thousand small files in a real-world directory structure can slow down reads by 15 − 175×, depending on the file system. We argue that these slowdowns are caused by poor lay- out. We demonstrate a correlation between read perfor- mance of a directory scan and themore »
-
Obeid, Iyad ; Selesnick, Ivan ; Picone, Joseph (Ed.)The goal of this work was to design a low-cost computing facility that can support the development of an open source digital pathology corpus containing 1M images [1]. A single image from a clinical-grade digital pathology scanner can range in size from hundreds of megabytes to five gigabytes. A 1M image database requires over a petabyte (PB) of disk space. To do meaningful work in this problem space requires a significant allocation of computing resources. The improvements and expansions to our HPC (highperformance computing) cluster, known as Neuronix [2], required to support working with digital pathology fall into two broad categories: computation and storage. To handle the increased computational burden and increase job throughput, we are using Slurm [3] as our scheduler and resource manager. For storage, we have designed and implemented a multi-layer filesystem architecture to distribute a filesystem across multiple machines. These enhancements, which are entirely based on open source software, have extended the capabilities of our cluster and increased its cost-effectiveness. Slurm has numerous features that allow it to generalize to a number of different scenarios. Among the most notable is its support for GPU (graphics processing unit) scheduling. GPUs can offer a tremendous performance increase inmore »
-
Container storage commonly relies on overlay file systems to interpose read-only container images upon backing file systems. While being transparent to and compatible with most existing backing file systems, the overlay file-system approach imposes nontrivial I/O overhead to containerized applications, especially for writes: To write a file originating from a read-only container image, the whole file will be copied to a separate, writable storage layer, resulting in long write latency and inefficient use of container storage. In this paper, we present BAOverlay, a lightweight, block-accessible overlay file system: Equipped with a new block-accessibility attribute, BAOverlay not only exploits the benefit of using an asynchronous copy-on-write mechanism for fast file updates but also enables a new file format for efficient use of container storage space. We have developed a prototype of BAOverlay upon Linux Ext4. Our evaluation with both micro-benchmarks and real-world applications demonstrates the effectiveness of BAOverlay with improved write performance and on-demand container storage usage.