NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

PROV-IO+: A Cross-Platform Provenance Framework for Scientific Data on HPC Systems

https://doi.org/10.1109/TPDS.2024.3374555

Han, Runzhou; Zheng, Mai; Byna, Suren; Tang, Houjun; Dong, Bin; Dai, Dong; Chen, Yong; Kim, Dongkyun; Hassoun, Joseph; Thorsley, David (May 2024, IEEE Transactions on Parallel and Distributed Systems)

Full Text Available
Analyzing Configuration Dependencies of DAX File Systems

Mahmud, Tabassum; Gatla, Om R.; Zhang, Duo; Love, Carson; Bumann, Ryan; Zheng, Mai (March 2023, 14th Annual Non-Volatile Memories Workshop)

Full Text Available
ConfD: Analyzing Configuration Dependencies of File Systems for Fun and Profit

Mahmud, Tabassum; Gatla, Om R.; Zhang, Duo; Love, Carson; Bumann, Ryan; Zheng, Mai (February 2023, 21st USENIX Conference on File and Storage Technologies (FAST))

Full Text Available
On the Reproducibility of Bugs in File-System Aware Storage Applications

Zhang, Duo; Mahmud, Tabassum; Gatla, Om Rameshwar; Han, Runzhou; Chen, Yong; Zheng, Mai. (October 2022, Proceedings of the 16th IEEE International Conference on Networking, Architecture, and Storage)

Full Text Available
Understanding configuration dependencies of file systems

https://doi.org/10.1145/3538643.3539756

Mahmud, Tabassum; Zhang, Duo; Gatla, Om Rameshwar; Zheng, Mai (June 2022, Proceedings of the 14th ACM Workshop on Hot Topics in Storage and File Systems)

File systems have many configuration parameters. Such flexibility comes at the price of additional complexity which could lead to subtle configuration-related issues. To address the challenge, we study the potential configuration dependencies of a representative file system (i.e., Ext4), and identify a prevalent pattern called multi-level configuration dependencies. We build a static analyzer to extract the dependencies and leverage the information to address different configuration issues. Our preliminary prototype is able to extract 64 multi-level dependencies with a low false positive rate. Additionally, we can identify multiple configuration issues effectively.
more » « less
Full Text Available
PROV-IO: An I/O-Centric Provenance Framework for Scientific Data on HPC Systems

https://doi.org/10.1145/3502181.3531477

Han, Runzhou; Byna, Suren; Tang, Houjun; Dong, Bin; Zheng, Mai (June 2022, Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing (HPDC))

Full Text Available
A Study of Failure Recovery and Logging of High-Performance Parallel File Systems

https://doi.org/10.1145/3483447

Han, Runzhou; Gatla, Om Rameshwar; Zheng, Mai; Cao, Jinrui; Zhang, Di; Dai, Dong; Chen, Yong; Cook, Jonathan (May 2022, ACM Transactions on Storage)

Large-scale parallel file systems (PFSs) play an essential role in high-performance computing (HPC). However, despite their importance, their reliability is much less studied or understood compared with that of local storage systems or cloud storage systems. Recent failure incidents at real HPC centers have exposed the latent defects in PFS clusters as well as the urgent need for a systematic analysis. To address the challenge, we perform a study of the failure recovery and logging mechanisms of PFSs in this article. First, to trigger the failure recovery and logging operations of the target PFS, we introduce a black-box fault injection tool called PFault , which is transparent to PFSs and easy to deploy in practice. PFault emulates the failure state of individual storage nodes in the PFS based on a set of pre-defined fault models and enables examining the PFS behavior under fault systematically. Next, we apply PFault to study two widely used PFSs: Lustre and BeeGFS. Our analysis reveals the unique failure recovery and logging patterns of the target PFSs and identifies multiple cases where the PFSs are imperfect in terms of failure handling. For example, Lustre includes a recovery component called LFSCK to detect and fix PFS-level inconsistencies, but we find that LFSCK itself may hang or trigger kernel panics when scanning a corrupted Lustre. Even after the recovery attempt of LFSCK, the subsequent workloads applied to Lustre may still behave abnormally (e.g., hang or report I/O errors). Similar issues have also been observed in BeeGFS and its recovery component BeeGFS-FSCK. We analyze the root causes of the abnormal symptoms observed in depth, which has led to a new patch set to be merged into the coming Lustre release. In addition, we characterize the extensive logs generated in the experiments in detail and identify the unique patterns and limitations of PFSs in terms of failure logging. We hope this study and the resulting tool and dataset can facilitate follow-up research in the communities and help improve PFSs for reliable high-performance computing.
more » « less
Full Text Available
SentiLog: Anomaly Detecting on Parallel File Systems via Log-based Sentiment Analysis

https://doi.org/10.1145/3465332.3470873

Zhang, Di; Dai, Dong; Han, Runzhou; Zheng, Mai (July 2021, Proceedings of The 13th ACM Workshop on Hot Topics in Storage and File Systems (HotStorage '21))
null (Ed.)
Full Text Available
Fingerprinting the Checker Policies of Parallel File Systems

https://doi.org/10.1109/PDSW51947.2020.00013

Han, Runzhou; Zhang, Duo; Zheng, Mai (November 2020, 2020 IEEE/ACM Fifth International Parallel Data Systems Workshop (PDSW))
null (Ed.)
Full Text Available
A Performance Study of Lustre File System Checker: Bottlenecks and Potentials

https://doi.org/10.1109/MSST.2019.00-20

Dai, Dong; Gatla, Om Rameshwar; Zheng, Mai (May 2019, 2019 35th Symposium on Mass Storage Systems and Technologies (MSST))

Full Text Available

« Prev Next »

Search for: All records