Live Forensics for HPC Systems: A Case Study on Distributed Storage Systems

Jha, Saurabh; Cui, Shengkun; Banerjee, Subho; Xu, Tianyin; Enos, Jeremy; Showerman, Mike; Kalbarczyk, Zbigniew T.; Iyer, Ravishankar K.

doi:10.5555/3433701.3433787

Citation Details

Live Forensics for HPC Systems: A Case Study on Distributed Storage Systems

Large-scale high-performance computing systems frequently experience a wide range of failure modes, such as reliability failures (e.g., hang or crash), and resource overload-related failures (e.g., congestion collapse), impacting systems and applications. Despite the adverse effects of these failures, current systems do not provide methodologies for proactively detecting, localizing, and diagnosing failures. We present Kaleidoscope, a near real-time failure detection and diagnosis framework, consisting of of hierarchical domain-guided machine learning models that identify the failing components, the corresponding failure mode, and point to the most likely cause indicative of the failure in near real-time (within one minute of failure occurrence). Kaleidoscope has been deployed on Blue Waters supercomputer and evaluated with more than two years of production telemetry data. Our evaluation shows that Kaleidoscope successfully localized 99.3% and pinpointed the root causes of 95.8% of 843 real-world production issues, with less than 0.01% runtime overhead. more »

Award ID(s):: 2029049

PAR ID:: 10293041

Author(s) / Creator(s):: Jha, Saurabh; Cui, Shengkun; Banerjee, Subho; Xu, Tianyin; Enos, Jeremy; Showerman, Mike; Kalbarczyk, Zbigniew T.; Iyer, Ravishankar K.

Date Published:: 2020-11-11

Journal Name:: Proceedings of the International Conference for High-Performance Computing, Networking, Storage and Analysis (SC 2020)

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.5555/3433701.3433787

More Like this