skip to main content


Title: Predicting and Comparing the Performance of Array Management Libraries
Many applications are increasingly becoming I/O-bound. To improve scalability, analytical models of parallel I/O performance are often consulted to determine possible I/O optimizations. However, I/O performance modeling has predominantly focused on applications that directly issue I/O requests to a parallel file system or a local storage device. These I/O models are not directly usable by applications that access data through standardized I/O libraries, such as HDF5, FITS, and NetCDF, because a single I/O request to an object can trigger a cascade of I/O operations to different storage blocks. The I/O performance characteristics of applications that rely on these libraries is a complex function of the underlying data storage model, user-configurable parameters and object-level access patterns. As a consequence, I/O optimization is predominantly an ad-hoc process that is performed by application developers, who are often domain scientists with limited desire to delve into nuances of the storage hierarchy of modern computers.This paper presents an analytical cost model to predict the end-to-end execution time of applications that perform I/O through established array management libraries. The paper focuses on the HDF5 and Zarr array libraries, as examples of I/O libraries with radically different storage models: HDF5 stores every object in one file, while Zarr creates multiple files to store different objects. We find that accessing array objects via these I/O libraries introduces new overheads and optimizations. Specifically, in addition to I/O time, it is crucial to model the cost of transforming data to a particular storage layout (memory copy cost), as well as model the benefit of accessing a software cache. We evaluate the model on real applications that process observations (neuroscience) and simulation results (plasma physics). The evaluation on three HPC clusters reveals that I/O accounts for as little as 10% of the execution time in some cases, and hence models that only focus on I/O performance cannot accurately capture the performance of applications that use standard array storage libraries. In parallel experiments, our model correctly predicts the fastest storage library between HDF5 and Zarr 94% of the time, in contrast with 70% of the time for a cutting-edge I/O model.  more » « less
Award ID(s):
1816577
NSF-PAR ID:
10177969
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
Page Range / eLocation ID:
906 to 915
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Access libraries such as ROOT[1] and HDF5[2] allow users to interact with datasets using high level abstractions, like coordinate systems and associated slicing operations. Unfortunately, the implementations of access libraries are based on outdated assumptions about storage systems interfaces and are generally unable to fully benefit from modern fast storage devices. For example, access libraries often implement buffering and data layout that assume that large, single-threaded sequential access patterns are causing less overall latency than small parallel random access: while this is true for spinning media, it is not true for flash media. The situation is getting worse with rapidly evolving storage devices such as non-volatile memory and ever larger datasets. This project explores distributed dataset mapping infrastructures that can integrate and scale out existing access libraries using Ceph’s extensible object model, avoiding re-implementation or even modifications of these access libraries as much as possible. These programmable storage extensions coupled with our distributed dataset mapping techniques enable: 1) access library operations to be offloaded to storage system servers, 2) the independent evolution of access libraries and storage systems and 3) fully leveraging of the existing load balancing, elasticity, and failure management of distributed storage systems like Ceph. They also create more opportunities to conduct storage server-local optimizations specific to storage servers. For example, storage servers might include local key/value stores combined with chunk stores that require different optimizations than a local file system. As storage servers evolve to support new storage devices like non-volatile memory, these server-local optimizations can be implemented while minimizing disruptions to applications. We will report progress on the means by which distributed dataset mapping can be abstracted over particular access libraries, including access libraries for ROOT data, and how we address some of the challenges revolving around data partitioning and composability of access operations. 
    more » « less
  2. Lossy compression is one of the most efficient solutions to reduce storage overhead and improve I/O performance for HPC applications. However, existing parallel I/O libraries cannot fully utilize lossy compression to accelerate parallel write due to the lack of deep understanding on compression-write performance. To this end, we propose to deeply integrate predictive lossy compression with HDF5 to significantly improve the parallel-write performance. Specifically, we propose analytical models to predict the time of compression and parallel write before the actual compression to enable compression-write overlapping. We also introduce an extra space in the process to handle possible data overflows resulting from prediction uncertainty in compression ratios. Moreover, we propose an optimization to reorder the compression tasks to increase the overlapping efficiency. Experiments with up to 4,096 cores from Summit show that our solution improves the write performance by up to 4.5× and 2.9× over the non-compression and lossy compression solutions, respectively, with only 1.5% storage overhead (compared to original data) on two real-world HPC applications. 
    more » « less
  3. Scientific data analysis pipelines face scalability bottlenecks when processing massive datasets that consist of millions of small files. Such datasets commonly arise in domains as diverse as detecting supernovae and post-processing computational fluid dynamics simulations. Furthermore, applications often use inference frameworks such as TensorFlow and PyTorch whose naive I/O methods exacerbate I/O bottlenecks. One solution is to use scientific file formats, such as HDF5 and FITS, to organize small arrays in one big file. However, storing everything in one file does not fully leverage the heterogeneous data storage capabilities of modern clusters. This paper presents Henosis, a system that intercepts data accesses inside the HDF5 library and transparently redirects I/O to the in-memory Redis object store or the disk-based TileDB array store. During this process, Henosis consolidates small arrays into bigger chunks and intelligently places them in data stores. A critical research aspect of Henosis is that it formulates object consolidation and data placement as a single optimization problem. Henosis carefully constructs a graph to capture the I/O activity of a workload and produces an initial solution to the optimization problem using graph partitioning. Henosis then refines the solution using a hill-climbing algorithm which migrates arrays between data stores to minimize I/O cost. The evaluation on two real scientific data analysis pipelines shows that consolidation with Henosis makes I/O 300× faster than directly reading small arrays from TileDB and 3.5× faster than workload-oblivious consolidation methods. Moreover, jointly optimizing consolidation and placement in Henosis makes I/O 1.7× faster than strategies that perform consolidation and placement independently. 
    more » « less
  4. Parallel I/O is an effective method to optimize data movement between memory and storage for many scientific applications. Poor performance of traditional disk-based file systems has led to the design of I/O libraries which take advantage of faster memory layers, such as on-node memory, present in high-performance computing (HPC) systems. By allowing caching and prefetching of data for applications alternating computation and I/O phases, a faster memory layer also provides opportunities for hiding the latency of I/O phases by overlapping them with computation phases, a technique called asynchronous I/O. Since asynchronous parallel I/O in HPC systems is still in the initial stages of development, there hasn't been a systematic study of the factors affecting its performance.In this paper, we perform a systematic study of various factors affecting the performance and efficacy of asynchronous I/O, we develop a performance model to estimate the aggregate I/O bandwidth achievable by iterative applications using synchronous and asynchronous I/O based on past observations, and we evaluate the performance of the recently developed asynchronous I/O feature of a parallel I/O library (HDF5) using benchmarks and real-world science applications. Our study covers parallel file systems on two large-scale HPC systems: Summit and Cori, the former with a GPFS storage and the latter with a Lustre parallel file system. 
    more » « less
  5. Array management libraries, such as HDF5, Zarr, etc., depend on a complex software stack that consists of parallel I/O middleware (MPI-IO), POSIX-IO, and file systems. Components in the stack are interdependent, such that effort in tuning the parameters in these software libraries for optimal performance is non-trivial. On the other hand, it is challenging to choose an array management library based on the array configuration and access patterns. In this poster, we investigate the performance aspect of two array management libraries, i.e., HDF5 and Zarr, in the context of a neuroscience use case. We highlight the performance variability of HDF5 and Zarr in our preliminary results and discuss potential optimization strategies. 
    more » « less