Machine Learning Models for GPU Error Prediction in a Large Scale HPC System

Nie, Bin; Xue, Ji; Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian; Smirni, Evgenia; Tiwari, Devesh

doi:10.1109/DSN.2018.00022

Citation Details

Machine Learning Models for GPU Error Prediction in a Large Scale HPC System

GPUs are widely deployed on large-scale HPC systems to provide powerful computational capability for scientific applications from various domains. As those applications are normally long-running, investigating the characteristics of GPU errors becomes imperative for reliability. In this paper, we first study the system conditions that trigger GPU errors using six-month trace data collected from a large-scale, operational HPC system. Then, we use machine learning to predict the occurrence of GPU errors, by taking advantage of temporal and spatial dependencies of the trace data. The resulting machine learning prediction framework is robust and accurate under different workloads. more »

Award ID(s):: 1649087 1717532

PAR ID:: 10065578

Author(s) / Creator(s):: Nie, Bin; Xue, Ji; Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian; Smirni, Evgenia; Tiwari, Devesh

Date Published:: 2018-06-01

Journal Name:: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN)

Page Range / eLocation ID:: 95 to 106

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1109/DSN.2018.00022

More Like this