Design and Evaluation of GPU-FPX: A Low-Overhead tool for Floating-Point Exception Detection in NVIDIA GPUs

Li, Xinyi  (ORCID:0009000572767715); Laguna, Ignacio  (ORCID:0000000293744433); Fang, Bo  (ORCID:0000000197213982); Swirydowicz, Katarzyna  (ORCID:0000000157585394); Li, Ang  (ORCID:0000000337349137); Gopalakrishnan, Ganesh  (ORCID:0000000237050031)

doi:10.1145/3588195.3592991

Citation Details

Design and Evaluation of GPU-FPX: A Low-Overhead tool for Floating-Point Exception Detection in NVIDIA GPUs

Not Floating-point exceptions occurring during numerical computations can be a serious threat to the validity of the computed results if they are not caught and diagnosed Unfortunately, on NVIDIA GPUs-today's most widely used types and which do not have hardware exception traps-this task must be carried out in software. Given the prevalence of closed-source kernels, efficient binary-level exception tracking is essential. It is also important to know how exceptions flow through the code, whether they alter the code behavior and additionally whether these exceptions can be detected at the program outputs or are killed inside program flow-paths. In this paper, we introduce GPU-FPX, a tool that has low overhead, allows for deep understanding of the origin and flow of exceptions, and also how exceptions are modified by code optimizations. We measure GPU-FPX's performance over 151 widely used GPU programs coming from HPC and ML, detecting 26 serious exceptions that were previously not reported. Our results show that GPU-FPX is 16× faster with respect to the geometric-mean runtime in relation to the only comparable prior tool, while also helping debug a larger class of codes more effectively. more »

Award ID(s):: 2124100

PAR ID:: 10639932

Author(s) / Creator(s):: Li, Xinyi ; Laguna, Ignacio ; Fang, Bo ; Swirydowicz, Katarzyna ; Li, Ang ; Gopalakrishnan, Ganesh

Publisher / Repository:: ACM

Date Published:: 2023-08-07

Page Range / eLocation ID:: 59 to 71

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1145/3588195.3592991

More Like this