- Award ID(s):
- 1816577
- PAR ID:
- 10104370
- Date Published:
- Journal Name:
- Proceedings of the 33rd ACM International Conference on Supercomputing - ICS '19
- Page Range / eLocation ID:
- 392 to 402
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
Many applications are increasingly becoming I/O-bound. To improve scalability, analytical models of parallel I/O performance are often consulted to determine possible I/O optimizations. However, I/O performance modeling has predominantly focused on applications that directly issue I/O requests to a parallel file system or a local storage device. These I/O models are not directly usable by applications that access data through standardized I/O libraries, such as HDF5, FITS, and NetCDF, because a single I/O request to an object can trigger a cascade of I/O operations to different storage blocks. The I/O performance characteristics of applications that rely on these libraries is a complex function of the underlying data storage model, user-configurable parameters and object-level access patterns. As a consequence, I/O optimization is predominantly an ad-hoc process that is performed by application developers, who are often domain scientists with limited desire to delve into nuances of the storage hierarchy of modern computers.This paper presents an analytical cost model to predict the end-to-end execution time of applications that perform I/O through established array management libraries. The paper focuses on the HDF5 and Zarr array libraries, as examples of I/O libraries with radically different storage models: HDF5 stores every object in one file, while Zarr creates multiple files to store different objects. We find that accessing array objects via these I/O libraries introduces new overheads and optimizations. Specifically, in addition to I/O time, it is crucial to model the cost of transforming data to a particular storage layout (memory copy cost), as well as model the benefit of accessing a software cache. We evaluate the model on real applications that process observations (neuroscience) and simulation results (plasma physics). The evaluation on three HPC clusters reveals that I/O accounts for as little as 10% of the execution time in some cases, and hence models that only focus on I/O performance cannot accurately capture the performance of applications that use standard array storage libraries. In parallel experiments, our model correctly predicts the fastest storage library between HDF5 and Zarr 94% of the time, in contrast with 70% of the time for a cutting-edge I/O model.more » « less
-
Parallel I/O is an effective method to optimize data movement between memory and storage for many scientific applications. Poor performance of traditional disk-based file systems has led to the design of I/O libraries which take advantage of faster memory layers, such as on-node memory, present in high-performance computing (HPC) systems. By allowing caching and prefetching of data for applications alternating computation and I/O phases, a faster memory layer also provides opportunities for hiding the latency of I/O phases by overlapping them with computation phases, a technique called asynchronous I/O. Since asynchronous parallel I/O in HPC systems is still in the initial stages of development, there hasn't been a systematic study of the factors affecting its performance.In this paper, we perform a systematic study of various factors affecting the performance and efficacy of asynchronous I/O, we develop a performance model to estimate the aggregate I/O bandwidth achievable by iterative applications using synchronous and asynchronous I/O based on past observations, and we evaluate the performance of the recently developed asynchronous I/O feature of a parallel I/O library (HDF5) using benchmarks and real-world science applications. Our study covers parallel file systems on two large-scale HPC systems: Summit and Cori, the former with a GPFS storage and the latter with a Lustre parallel file system.more » « less
-
The imbalanced I/O load on large parallel file systems affects the parallel I/O performance of high-performance computing (HPC) applications. One of the main reasons for I/O imbalances is the lack of a global view of system-wide resource consumption. While approaches to address the problem already exist, the diversity of HPC workloads combined with different file striping patterns prevents widespread adoption of these approaches. In addition, load-balancing techniques should be transparent to client applications. To address these issues, we propose
Tarazu , an end-to-end control plane where clients transparently and adaptively write to a set of selected I/O servers to achieve balanced data placement. Our control plane leverages real-time load statistics for global data placement on distributed storage servers, while our design model employs trace-based optimization techniques to minimize latency for I/O load requests between clients and servers and to handle multiple striping patterns in files. We evaluate our proposed system on an experimental cluster for two common use cases: the synthetic I/O benchmark IOR and the scientific application I/O kernel HACC-I/O. We also use a discrete-time simulator with real HPC application traces from emerging workloads running on the Summit supercomputer to validate the effectiveness and scalability ofTarazu in large-scale storage environments. The results show improvements in load balancing and read performance of up to 33% and 43%, respectively, compared to the state-of-the-art. -
Array management libraries, such as HDF5, Zarr, etc., depend on a complex software stack that consists of parallel I/O middleware (MPI-IO), POSIX-IO, and file systems. Components in the stack are interdependent, such that effort in tuning the parameters in these software libraries for optimal performance is non-trivial. On the other hand, it is challenging to choose an array management library based on the array configuration and access patterns. In this poster, we investigate the performance aspect of two array management libraries, i.e., HDF5 and Zarr, in the context of a neuroscience use case. We highlight the performance variability of HDF5 and Zarr in our preliminary results and discuss potential optimization strategies.more » « less
-
We present FusionFS, a direct-access firmware-level in-storage filesystem that exploits the near-storage computational capability for fast I/O and data processing, consequently reducing I/O bottlenecks. In FusionFS, we introduce a new abstraction, CISCOps, that combines multiple I/O and data processing operations into one fused operation and offloaded for near-storage processing. By offloading, CISCOps significantly reduces dominant I/O overheads such as system calls, data movement, communication, and other software overheads. Further, to enhance the use of CISCOps, we introduce MicroTx for fine-grained crash consistency and fast (automatic) recovery of I/O and data processing operations. We also explore scheduling techniques to ensure fair and efficient use of in-storage compute and memory resources across tenants. Evaluation of FusionFS against the state-of-the-art user-level, kernel-level, and firmware-level file systems using microbenchmarks, macrobenchmarks, and real-world applications shows up to 6.12X, 5.09X and 2.07X performance gains, and 2.65X faster recovery for applications.more » « less