skip to main content


Title: A Study of Failure Recovery and Logging of High-Performance Parallel File Systems
Large-scale parallel file systems (PFSs) play an essential role in high-performance computing (HPC). However, despite their importance, their reliability is much less studied or understood compared with that of local storage systems or cloud storage systems. Recent failure incidents at real HPC centers have exposed the latent defects in PFS clusters as well as the urgent need for a systematic analysis. To address the challenge, we perform a study of the failure recovery and logging mechanisms of PFSs in this article. First, to trigger the failure recovery and logging operations of the target PFS, we introduce a black-box fault injection tool called   PFault , which is transparent to PFSs and easy to deploy in practice.   PFault emulates the failure state of individual storage nodes in the PFS based on a set of pre-defined fault models and enables examining the PFS behavior under fault systematically. Next, we apply PFault to study two widely used PFSs: Lustre and BeeGFS. Our analysis reveals the unique failure recovery and logging patterns of the target PFSs and identifies multiple cases where the PFSs are imperfect in terms of failure handling. For example, Lustre includes a recovery component called LFSCK to detect and fix PFS-level inconsistencies, but we find that LFSCK itself may hang or trigger kernel panics when scanning a corrupted Lustre. Even after the recovery attempt of LFSCK, the subsequent workloads applied to Lustre may still behave abnormally (e.g., hang or report I/O errors). Similar issues have also been observed in BeeGFS and its recovery component BeeGFS-FSCK. We analyze the root causes of the abnormal symptoms observed in depth, which has led to a new patch set to be merged into the coming Lustre release. In addition, we characterize the extensive logs generated in the experiments in detail and identify the unique patterns and limitations of PFSs in terms of failure logging. We hope this study and the resulting tool and dataset can facilitate follow-up research in the communities and help improve PFSs for reliable high-performance computing.  more » « less
Award ID(s):
1910747 1853714 1717630 1943204 1910727 1718336
NSF-PAR ID:
10336970
Author(s) / Creator(s):
; ; ; ; ; ; ;
Date Published:
Journal Name:
ACM Transactions on Storage
Volume:
18
Issue:
2
ISSN:
1553-3077
Page Range / eLocation ID:
1 to 44
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Parallel filesystems (PFSs) are one of the most critical high-availability components of High Performance Computing (HPC) systems. Most HPC workloads are dependent on the availability of a POSIX compliant parallel filesystem that provides a globally consistent view of data to all compute nodes of a HPC system. Because of this central role, failure or performance degradation events in the PFS can impact every user of a HPC resource. There is typically insufficient information available to users and even many HPC staff to identify the causes of these PFS events, impeding the implementation of timely and targeted remedies to PFS issues. The relevant information is distributed across PFS servers; however, access to these servers is highly restricted due to the sensitive role they play in the operations of a HPC system. Additionally, the information is challenging to aggregate and interpret, relegating diagnosis and treatment of PFS issues to a select few experts with privileged system access. To democratize this information, we are developing an open-source and user-facing Parallel FileSystem TRacing and Analysis SErvice (PFSTRASE) that analyzes the requisite data to establish causal relationships between PFS activity and events detrimental to stability and performance. We are implementing the service for the open-source Lustre filesystem, which is the most commonly used PFS at large-scale HPC sites. Server loads for specific PFS I/O operations (IOPs) will be measured and aggregated by the service to automatically estimate an effective load generated by every client, job, and user. The infrastructure provides a realtime, user accessible text-based interface and a publicly accessible web interface displaying both real-time and historical data. To democratize this information, we are developing an open-source and user-facing Parallel FileSystem TRacing and Analysis SErvice (PFSTRASE) that analyzes the requisite data to establish causal relationships between PFS activity and events detrimental to stability and performance. We are implementing the service for the open-source Lustre filesystem, which is the most commonly used PFS at large-scale HPC sites. Server loads for specific PFS I/O operations (IOPs) will be measured and aggregated by the service to automatically estimate an effective load generated by every client, job, and user. The infrastructure provides a realtime, user accessible text-based interface and a publicly accessible web interface displaying both real-time and historical data. 
    more » « less
  2. Parallel File Systems (PFSs) are frequently deployed on leadership High Performance Computing (HPC) systems to ensure efficient I/O, persistent storage and scalable performance. Emerging Deep Learning (DL) applications incur new I/O and storage requirements to HPC systems with batched input of small random files. This mandates PFSs to have commensurate features that can meet the needs of DL applications. BeeGFS is a recently emerging PFS that has grabbed the attention of the research and industry world because of its performance, scalability and ease of use. While emphasizing a systematic performance analysis of BeeGFS, in this paper, we present the architectural and system features of BeeGFS, and perform an experimental evaluation using cutting-edge I/O, Metadata and DL application benchmarks. Particularly, we have utilized AlexNet and ResNet-50 models for the classification of ImageNet dataset using the Livermore Big Artificial Neural Network Toolkit (LBANN), and ImageNet data reader pipeline atop TensorFlow and Horovod. Through extensive performance characterization of BeeGFS, our study provides a useful documentation on how to leverage BeeGFS for the emerging DL applications. 
    more » « less
  3. null (Ed.)
    International Ocean Discovery Program (IODP) Expedition 372 combined two research topics, slow slip events (SSEs) on subduction faults (IODP Proposal 781A-Full) and actively deforming gas hydrate–bearing landslides (IODP Proposal 841-APL). Our study area on the Hikurangi margin, east of the coast of New Zealand, provided unique locations for addressing both research topics. SSEs at subduction zones are an enigmatic form of creeping fault behavior. They typically occur on subduction zones at depths beyond the capabilities of ocean floor drilling. However, at the northern Hikurangi subduction margin they are among the best-documented and shallowest on Earth. Here, SSEs may extend close to the trench, where clastic and pelagic sediments about 1.0–1.5 km thick overlie the subducting, seamount-studded Hikurangi Plateau. Geodetic data show that these SSEs recur about every 2 years and are associated with measurable seafloor displacement. The northern Hikurangi subduction margin thus provides an excellent setting to use IODP capabilities to discern the mechanisms behind slow slip fault behavior. Expedition 372 acquired logging-while-drilling (LWD) data at three subduction-focused sites to depths of 600, 650, and 750 meters below seafloor (mbsf), respectively. These include two sites (U1518 and U1519) above the plate interface fault that experiences SSEs and one site (U1520) in the subducting “inputs” sequence in the Hikurangi Trough, 15 km east of the plate boundary. Overall, we acquired excellent logging data and reached our target depths at two of these sites. Drilling and logging at Site U1520 did not reach the planned depth due to operational time constraints. These logging data will be augmented by coring and borehole observatories planned for IODP Expedition 375. Gas hydrates have long been suspected of being involved in seafloor failure; not much evidence, however, has been found to date for gas hydrate–related submarine landslides. Solid, ice-like gas hydrate in sediment pores is generally thought to increase seafloor strength, as confirmed by a number of laboratory measurements. Dissociation of gas hydrate to water and overpressured gas, on the other hand, may weaken and destabilize sediments, potentially causing submarine landslides. The Tuaheni Landslide Complex (TLC) on the Hikurangi margin shows evidence for active, creeping deformation. Intriguingly, the landward edge of creeping coincides with the pinch-out of the base of gas hydrate stability on the seafloor. We therefore hypothesized that gas hydrate may be linked to creep-like deformation and presented several hypotheses that may link gas hydrates to slow deformation. Alternatively, creeping may not be related to gas hydrates but instead be caused by repeated pressure pulses or linked to earthquake-related liquefaction. Expedition 372 comprised a coring and LWD program to test our landslide hypotheses. Due to weather-related downtime, the gas hydrate-related program was reduced, and we focused on a set of experiments at Site U1517 in the creeping part of the TLC. We conducted a successful LWD and coring program to 205 mbsf, the latter with almost complete recovery, through the TLC and gas hydrate stability zone, followed by temperature and pressure tool deployments. 
    more » « less
  4. Parallel I/O is an effective method to optimize data movement between memory and storage for many scientific applications. Poor performance of traditional disk-based file systems has led to the design of I/O libraries which take advantage of faster memory layers, such as on-node memory, present in high-performance computing (HPC) systems. By allowing caching and prefetching of data for applications alternating computation and I/O phases, a faster memory layer also provides opportunities for hiding the latency of I/O phases by overlapping them with computation phases, a technique called asynchronous I/O. Since asynchronous parallel I/O in HPC systems is still in the initial stages of development, there hasn't been a systematic study of the factors affecting its performance.In this paper, we perform a systematic study of various factors affecting the performance and efficacy of asynchronous I/O, we develop a performance model to estimate the aggregate I/O bandwidth achievable by iterative applications using synchronous and asynchronous I/O based on past observations, and we evaluate the performance of the recently developed asynchronous I/O feature of a parallel I/O library (HDF5) using benchmarks and real-world science applications. Our study covers parallel file systems on two large-scale HPC systems: Summit and Cori, the former with a GPFS storage and the latter with a Lustre parallel file system. 
    more » « less
  5. null (Ed.)
    Progress in high-performance computing (HPC) systems has led to complex applications that stress the I/O subsystem by creating vast amounts of data. Lossy compression reduces data size considerably, but a single error renders lossy compressed data unusable. This sensitivity stems from the high information content per bit in compressed data and is a critical issue as soft errors that cause bit-flips have become increasingly commonplace in HPC systems. While many works have improved lossy compressor performance, few have sought to address this critical weakness. This paper presents ARC: Automated Resiliency for Compression. Given user-defined constraints on storage, throughput, and resiliency, ARC automatically determines the optimal error-correcting code (ECC) configuration before encoding data. We conduct an extensive fault injection study to fully understand the effects of soft errors on lossy compressed data and how to best protect it. We evaluate ARC's scalability, performance, resiliency, and ease of use. We find on a 40 core node that encoding and decoding demonstrate throughput up to 3730 MB/s and 3602 MB/s. ARC also detects and corrects multi-bit errors with a tunable overhead in terms of storage and throughput. Finally, we display the ease of using ARC and how to consider a systems failure rate when determining the constraints. 
    more » « less