GPU Reliability Assessment: Insights Across the Abstraction Layers

Yang, Lishan; Papadimitriou, George; Sartzetakis, Dimitris; Jog, Adwait; Smirni, Evgenia; Gizopoulos, Dimitris

doi:10.1109/CLUSTER59578.2024.00008

Citation Details

GPU Reliability Assessment: Insights Across the Abstraction Layers

Graphics Processing Units (GPUs) are widely de-ployed and utilized across various computing domains including cloud and high-performance computing. Considering its extensive usage and increasing popularity, ensuring GPU reliability is cru-cial. Software-based reliability evaluation methodologies, though fast, often neglect the complex hardware details of modern GPU designs. This oversight could lead to misleading measurements and misguided decisions regarding protection strategies. This paper breaks new ground by conducting an in-depth examination of well-established vulnerability assessment methods for modern GPU architectures, from the microarchitecture all the way to the software layers. It highlights divergences between popular software-based vulnerability evaluation methods and the ground truth cross-layer evaluation, which persist even under strong protections like triple modular redundancy. Accurate evaluation requires considering fault distribution from hardware to software. Our comprehensive measurements offer valuable insights into the accurate assessment of GPU reliability. more »

Award ID(s):: 2402942 2402940 2417718

PAR ID:: 10592189

Author(s) / Creator(s):: Yang, Lishan; Papadimitriou, George; Sartzetakis, Dimitris; Jog, Adwait; Smirni, Evgenia; Gizopoulos, Dimitris

Publisher / Repository:: IEEE

Date Published:: 2024-09-24

ISBN:: 979-8-3503-5871-1

Page Range / eLocation ID:: 1 to 13

Format(s):: Medium: X

Location:: Kobe, Japan

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1109/CLUSTER59578.2024.00008

More Like this