NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Data-centric Reliability Management in GPUs

https://doi.org/10.1109/DSN48987.2021.00040

Kadam, Gurunath; Smirni, Evgenia; Jog, Adwait (June 2021, 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN))
null (Ed.)
Full Text Available
Enabling Software Resilience in GPGPU Applications via Partial Thread Protection

https://doi.org/10.1109/ICSE43902.2021.00114

Yang, Lishan; Nie, Bin; Jog, Adwait; Smirni, Evgenia (May 2021, 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE))
null (Ed.)
Full Text Available
SUGAR: Speeding Up GPGPU Application Resilience Estimation with Input Sizing

https://doi.org/10.1145/3447375

Yang, Lishan; Nie, Bin; Jog, Adwait; Smirni, Evgenia (February 2021, Proceedings of the ACM on Measurement and Analysis of Computing Systems)
null (Ed.)
As Graphics Processing Units (GPUs) are becoming a de facto solution for accelerating a wide range of applications, their reliable operation is becoming increasingly important. One of the major challenges in the domain of GPU reliability is to accurately measure GPGPU application error resilience. This challenge stems from the fact that a typical GPGPU application spawns a huge number of threads and then utilizes a large amount of potentially unreliable compute and memory resources available on the GPUs. As the number of possible fault locations can be in the billions, evaluating every fault and examining its effect on theapplication error resilience is impractical. Application resilience is evaluated via extensive fault injection campaigns based on sampling of an extensive fault site space. Typically, the larger the input of the GPGPU application, the longer the experimental campaign. In this work, we devise a methodology, SUGAR (Speeding Up GPGPU Application Resilience Estimation with input sizing), that dramatically speeds up the evaluation of GPGPU application error resilience by judicious input sizing. We show how analyzing a small fraction of the input is sufficient to estimate the application resilience with high accuracy and dramatically reduce the duration of experimentation. Key of our estimation methodology is the discovery of repeating patterns as a function of the input size. Using the well-established fact that error resilience in GPGPU applications is mostly determined by the dynamic instruction count at the thread level, we identify the patterns that allow us to accurately predict application error resilience for arbitrarily large inputs. For the cases that we examine in this paper, this new resilience estimation mechanism provides significant speedups (up to 1336 times) and 97.0 on the average, while keeping estimation errors to less than 1%.
more » « less
Full Text Available
Practical Resilience Analysis of GPGPU Applications in the Presence of Single- and Multi-Bit Faults

https://doi.org/10.1109/TC.2020.2980541

Yang, Lishan; Nie, Bin; Jog, Adwait; Smirni, Evgenia (January 2021, IEEE Transactions on Computers)
null (Ed.)
Full Text Available
Characterizing Accuracy-Aware Resilience of GPGPU Applications

https://doi.org/10.1109/CCGrid49817.2020.00-82

Nie, Bin; Jog, Adwait; Smirni, Evgenia (May 2020, 20th IEEE/ACM International Symposium on Cluster, Cloud and Internet Computing (CCGRID))
null (Ed.)
Full Text Available
BCoal: Bucketing-Based Memory Coalescing for Efficient and Secure GPUs

https://doi.org/10.1109/HPCA47549.2020.00053

Kadam, Gurunath; Zhang, Danfeng; Jog, Adwait (February 2020, IEEE International Symposium on High Performance Computer Architecture (HPCA))

Full Text Available
SSD failures in the field: symptoms, causes, and prediction models

https://doi.org/10.1145/3295500.3356172

Alter, Jacob; Xue, Ji; Dimnaku, Alma; Smirni, Evgenia (November 2019, Supercomputing 2019)

Full Text Available
Fault Site Pruning for Practical Reliability Analysis of GPGPU Applications

https://doi.org/10.1109/MICRO.2018.00066

Nie, Bin; Yang, Lishan; Jog, Adwait; Smirni, Evgenia (October 2018, 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO))

Graphics Processing Units (GPUs) have rapidly evolved to enable energy-efficient data-parallel computing for a broad range of scientific areas. While GPUs achieve exascale performance at a stringent power budget, they are also susceptible to soft errors, often caused by high-energy particle strikes, that can significantly affect the application output quality. Understanding the resilience of general purpose GPU applications is the purpose of this study. To this end, it is imperative to explore the range of application output by injecting faults at all the potential fault sites. This problem is especially challenging because unlike CPU applications, which are mostly single-threaded, GPGPU applications can contain hundreds to thousands of threads, resulting in a tremendously large fault site space - in the order of billions even for some simple applications. In this paper, we present a systematic way to progressively prune the fault site space aiming to dramatically reduce the number of fault injections such that assessment for GPGPU application error resilience can be practical. The key insight behind our proposed methodology stems from the fact that GPGPU applications spawn a lot of threads, however, many of them execute the same set of instructions. Therefore, several fault sites are redundant and can be pruned by a careful analysis of faults across threads and instructions. We identify important features across a set of 10 applications (16 kernels) from Rodinia and Polybench suites and conclude that threads can be first classified based on the number of the dynamic instructions they execute. We achieve significant fault site reduction by analyzing only a small subset of threads that are representative of the dynamic instruction behavior (and therefore error resilience behavior) of the GPGPU applications. Further pruning is achieved by identifying and analyzing: a) the dynamic instruction commonalities (and differences) across code blocks within this representative set of threads, b) a subset of loop iterations within the representative threads, and c) a subset of destination register bit positions. The above steps result in a tremendous reduction of fault sites by up to seven orders of magnitude. Yet, this reduced fault site space accurately captures the error resilience profile of GPGPU applications.
more » « less
Full Text Available
Machine Learning Models for GPU Error Prediction in a Large Scale HPC System

https://doi.org/10.1109/DSN.2018.00022

Nie, Bin; Xue, Ji; Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian; Smirni, Evgenia; Tiwari, Devesh (June 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN))

GPUs are widely deployed on large-scale HPC systems to provide powerful computational capability for scientific applications from various domains. As those applications are normally long-running, investigating the characteristics of GPU errors becomes imperative for reliability. In this paper, we first study the system conditions that trigger GPU errors using six-month trace data collected from a large-scale, operational HPC system. Then, we use machine learning to predict the occurrence of GPU errors, by taking advantage of temporal and spatial dependencies of the trace data. The resulting machine learning prediction framework is robust and accurate under different workloads.
more » « less
Full Text Available
RCoal: Mitigating GPU Timing Attack via Subwarp-Based Randomized Coalescing Techniques

https://doi.org/10.1109/HPCA.2018.00023

Kadam, Gurunath; Zhang, Danfeng; Jog, Adwait (February 2018, 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA))

Graphics processing units (GPUs) are becoming default accelerators in many domains such as high-performance computing (HPC), deep learning, and virtual/augmented reality. Recently, GPUs have also shown significant speedups for a variety of security-sensitive applications such as encryptions. These speedups have largely benefited from the high memory bandwidth and compute throughput of GPUs. One of the key features to optimize the memory bandwidth consumption in GPUs is intra-warp memory access coalescing, which merges memory requests originating from different threads of a single warp into as few cache lines as possible. However, this coalescing feature is also shown to make the GPUs prone to the correlation timing attacks as it exposes the relationship between the execution time and the number of coalesced accesses. Consequently, an attacker is able to correctly reveal an AES private key via repeatedly gathering encrypted data and execution time on a GPU. In this work, we propose a series of defense mechanisms to alleviate such timing attacks by carefully trading off performance for improved security. Specifically, we propose to randomize the coalescing logic such that the attacker finds it hard to guess the correct number of coalesced accesses generated. To this end, we propose to randomize: a) the granularity (called as subwarp) at which warp threads are grouped together for coalescing, and b) the threads selected by each subwarp for coalescing. Such randomization techniques result in three mechanisms: fixed-sized subwarp (FSS), random-sized subwarp (RSS), and random-threaded subwarp (RTS). We find that the combination of these security mechanisms offers 24- to 961-times improvement in the security against the correlation timing attacks with 5 to 28% performance degradation. Online copy: http://adwaitjog.github.io/docs/pdf/rcoal-hpca18.pdf
more » « less
Full Text Available

Search for: All records