Parallel I/O on Compressed Data Files: Semantics, Algorithms, and Performance Evaluation

Singh, Siddhesh Pratap; Gabriel, Edgar

Citation Details

Many scientific applications operate on data sets that span hundreds of Gigabytes or even Terabytes in size. Large data sets often use compression to reduce the size of the files. Yet as of today, parallel I/O libraries do not support reading and writing compressed files, necessitating either expensive sequential compression/decompression operations before/after the simulation, or omitting advanced features of parallel I/O libraries, such as collective I/O operations. This paper introduces parallel I/O on compressed data files, discusses the key challenges, requirements, and solutions for supporting compressed data files in MPI I/O, as well as limitations on some MPI I/O operations when using compressed data files. The paper details handling of individual read and write operations of compressed data files, and presents an extension to the two-phase collective I/O algorithm to support data compression. The paper further presents and evaluates an implementation based on the Snappy compression library and the OMPIO parallel I/O framework. The performance evaluation using multiple data sets demonstrate significant performance benefits when using data compression on a parallel BeeGFS file system. more »

Award ID(s):: 1663887

NSF-PAR ID:: 10159546

Author(s) / Creator(s):: Singh, Siddhesh Pratap; Gabriel, Edgar

Date Published:: 2020-05-01

Journal Name:: 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID)

Page Range / eLocation ID:: 192-201

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
The DOI is not currently available.

More Like this