Understanding and Finding Crash-Consistency Bugs in Parallel File Systems

Jinghan, Sun; Huang, Jian; Snir, Marc.

Citation Details

Parallel file systems (PFSes) and parallel I/O libraries have been the backbone of high-performance computing (HPC) infrastructures for decades. However, their crash consistency bugs have not been extensively studied, and the corresponding bug-finding or testing tools are lacking. In this paper, we first conduct a thorough bug study on the popular PFSes, such as BeeGFS and OrangeFS, with a cross-stack approach that covers HPC I/O library, PFS, and interactions with local file systems. The study results drive our design of a scalable testing framework, named PFSCHECK. PFSCHECK is easy to use with low performance overhead, as it can automatically generate test cases for triggering potential crash-consistency bugs, and trace essential file operations with low overhead. PFSCHECK is scalable for supporting large-scale HPC clusters, as it can exploit the parallelism to facilitate the verification of persistent storage states. more »

Award ID(s):: 1850317

PAR ID:: 10219689

Author(s) / Creator(s):: Jinghan, Sun; Huang, Jian; Snir, Marc.

Date Published:: 2020-07-13

Journal Name:: Proceedings of the 12th USENIX Workshop on Hot Topics in Storage and File Systems (HotStorage'20)

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
The DOI is not currently available.

More Like this