skip to main content

Title: Fault Site Pruning for Practical Reliability Analysis of GPGPU Applications
Graphics Processing Units (GPUs) have rapidly evolved to enable energy-efficient data-parallel computing for a broad range of scientific areas. While GPUs achieve exascale performance at a stringent power budget, they are also susceptible to soft errors, often caused by high-energy particle strikes, that can significantly affect the application output quality. Understanding the resilience of general purpose GPU applications is the purpose of this study. To this end, it is imperative to explore the range of application output by injecting faults at all the potential fault sites. This problem is especially challenging because unlike CPU applications, which are mostly single-threaded, GPGPU applications can contain hundreds to thousands of threads, resulting in a tremendously large fault site space - in the order of billions even for some simple applications. In this paper, we present a systematic way to progressively prune the fault site space aiming to dramatically reduce the number of fault injections such that assessment for GPGPU application error resilience can be practical. The key insight behind our proposed methodology stems from the fact that GPGPU applications spawn a lot of threads, however, many of them execute the same set of instructions. Therefore, several fault sites are redundant and can be pruned by a careful analysis of faults across threads and instructions. We identify important features across a set of 10 applications (16 kernels) from Rodinia and Polybench suites and conclude that threads can be first classified based on the number of the dynamic instructions they execute. We achieve significant fault site reduction by analyzing only a small subset of threads that are representative of the dynamic instruction behavior (and therefore error resilience behavior) of the GPGPU applications. Further pruning is achieved by identifying and analyzing: a) the dynamic instruction commonalities (and differences) across code blocks within this representative set of threads, b) a subset of loop iterations within the representative threads, and c) a subset of destination register bit positions. The above steps result in a tremendous reduction of fault sites by up to seven orders of magnitude. Yet, this reduced fault site space accurately captures the error resilience profile of GPGPU applications.  more » « less
Award ID(s):
1717532 1649087
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO)
Page Range / eLocation ID:
749 to 761
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    As Graphics Processing Units (GPUs) are becoming a de facto solution for accelerating a wide range of applications, their reliable operation is becoming increasingly important. One of the major challenges in the domain of GPU reliability is to accurately measure GPGPU application error resilience. This challenge stems from the fact that a typical GPGPU application spawns a huge number of threads and then utilizes a large amount of potentially unreliable compute and memory resources available on the GPUs. As the number of possible fault locations can be in the billions, evaluating every fault and examining its effect on theapplication error resilience is impractical. Application resilience is evaluated via extensive fault injection campaigns based on sampling of an extensive fault site space. Typically, the larger the input of the GPGPU application, the longer the experimental campaign. In this work, we devise a methodology, SUGAR (Speeding Up GPGPU Application Resilience Estimation with input sizing), that dramatically speeds up the evaluation of GPGPU application error resilience by judicious input sizing. We show how analyzing a small fraction of the input is sufficient to estimate the application resilience with high accuracy and dramatically reduce the duration of experimentation. Key of our estimation methodology is the discovery of repeating patterns as a function of the input size. Using the well-established fact that error resilience in GPGPU applications is mostly determined by the dynamic instruction count at the thread level, we identify the patterns that allow us to accurately predict application error resilience for arbitrarily large inputs. For the cases that we examine in this paper, this new resilience estimation mechanism provides significant speedups (up to 1336 times) and 97.0 on the average, while keeping estimation errors to less than 1%. 
    more » « less
  2. Graphics Processing Units (GPUs) exploit large amounts of thread-level parallelism to provide high instruction throughput and to efficiently hide long-latency stalls. The resulting high throughput, along with continued programmability improvements, have made GPUs an essential computational resource in many domains. Applications from different domains can have vastly different compute and memory demands on the GPU. In a large-scale computing environment, to efficiently accommodate such wide-ranging demands without leaving GPU resources underutilized, multiple applications can share a single GPU, akin to how multiple applications execute concurrently on a CPU. Multi-application concurrency requires several support mechanisms in both hardware and software. One such key mechanism is virtual memory, which manages and protects the address space of each application. However, modern GPUs lack the extensive support for multi-application concurrency available in CPUs, and as a result suffer from high performance overheads when shared by multiple applications, as we demonstrate. We perform a detailed analysis of which multi-application concurrency support limitations hurt GPU performance the most. We find that the poor performance is largely a result of the virtual memory mechanisms employed in modern GPUs. In particular, poor address translation performance is a key obstacle to efficient GPU sharing. State-of-the-art address translation mechanisms, which were designed for single-application execution, experience significant inter-application interference when multiple applications spatially share the GPU. This contention leads to frequent misses in the shared translation lookaside buffer (TLB), where a single miss can induce long-latency stalls for hundreds of threads. As a result, the GPU often cannot schedule enough threads to successfully hide the stalls, which diminishes system throughput and becomes a first-order performance concern. Based on our analysis, we propose MASK, a new GPU framework that provides low-overhead virtual memory support for the concurrent execution of multiple applications. MASK consists of three novel address-translation-aware cache and memory management mechanisms that work together to largely reduce the overhead of address translation: (1) a token-based technique to reduce TLB contention, (2) a bypassing mechanism to improve the effectiveness of cached address translations, and (3) an application-aware memory scheduling scheme to reduce the interference between address translation and data requests. Our evaluations show that MASK restores much of the throughput lost to TLB contention. Relative to a state-of-the-art GPU TLB, MASK improves system throughput by 57.8%, improves IPC throughput by 43.4%, and reduces application-level unfairness by 22.4%. MASK's system throughput is within 23.2% of an ideal GPU system with no address translation overhead. 
    more » « less
  3. null (Ed.)
    Deterministic execution for GPUs is a desirable property as it helps with debuggability and reproducibility. It is also important for safety regulations, as safety critical workloads are starting to be deployed onto GPUs. Prior deterministic architectures, such as GPUDet, attempt to provide strong determinism for all types of workloads, incurring significant performance overheads due to the many restrictions that are required to satisfy determinism. We observe that a class of reduction workloads, such as graph applications and neural architecture search for machine learning, do not require such severe restrictions to preserve determinism. This motivates the design of our system, Deterministic Atomic Buffering (DAB), which provides deterministic execution with low area and performance overheads by focusing solely on ordering atomic instructions instead of all memory instructions. By scheduling atomic instructions deterministically with atomic buffering, the results of atomic operations are isolated initially and made visible in the future in a deterministic order. This allows the GPU to execute deterministically in parallel without having to serialize its threads for atomic operations as opposed to GPUDet. Our simulation results show that, for atomic-intensive applications, DAB performs 4× better than GPUDet and incurs only a 23% slowdown on average compared to a non-deterministic GPU architecture. We also characterize the bottlenecks and provide insights for future optimizations. 
    more » « less
  4. Arbitrary-precision integer multiplication is the core kernel of many applications including scientific computing, cryptographic algorithms, etc. Existing acceleration of arbitrary-precision integer multiplication includes CPUs, GPUs, FPGAs, and ASICs. To leverage the hardware intrinsics low-bit function units (32/64-bit), arbitrary-precision integer multiplication can be calculated using Karatsuba decomposition, and Schoolbook decomposition by decomposing the two large operands into several small operands, generating a set of low-bit multiplications that can be processed either in a spatial or sequential manner on the low-bit function units, e.g., CPU vector instructions, GPU CUDA cores, FPGA digital signal processing (DSP) blocks. Among these accelerators, reconfigurable computing, e.g., FPGA accelerators are promised to provide both good energy efficiency and flexibility. We implement the state-of-the-art (SOTA) FPGA accelerator and compare it with the SOTA libraries on CPUs and GPUs. Surprisingly, in terms of energy efficiency, we find that the FPGA has the lowest energy efficiency, i.e., 0.29x of the CPU and 0.17x of the GPU with the same generation fabrication. Therefore, key questions arise: Where do the energy efficiency gains of CPUs and GPUs come from? Can reconfigurable computing do better? If can, how to achieve that? We first identify that the biggest energy efficiency gains of the CPUs and GPUs come from the dedicated vector units, i.e., vector instruction units in CPUs and CUDA cores in GPUs. FPGA uses DSPs and lookup tables (LUTs) to compose the needed computation, which incurs overhead when compared to using vector units directly. New reconfigurable computing, e.g., “FPGA+vector units” is a novel and feasible solution to improve energy efficiency. In this paper, we propose to map arbitrary-precision integer multiplication onto such a “FPGA+vector units” platform, i.e., AMD/Xilinx Versal ACAP architecture, a heterogeneous reconfigurable computing platform that features 400 AI engine tensor cores (AIE) running at 1 GHz, FPGA programmable logic (PL), and a general-purpose CPU in the system fabricated with the TSMC 7nm technology. Designing on Versal ACAP incurs several challenges and we propose AIM: Arbitrary-precision Integer Multiplication on Versal ACAP to automate and optimize the design. AIM accelerator is composed of AIEs, PL, and CPU. AIM framework includes analytical models to guide design space exploration and AIM automatic code generation to facilitate the system design and on-board design verification. We deploy the AIM framework on three different applications, including large integer multiplication (LIM), RSA, and Mandelbrot, on the AMD/Xilinx Versal ACAP VCK190 evaluation board. Our experimental results show that compared to existing accelerators, AIM achieves up to 12.6x, and 2.1x energy efficiency gains over the Intel Xeon Ice Lake 6346 CPU, and NVidia A5000 GPU respectively, which brings reconfigurable computing the most energy-efficient platform among CPUs and GPUs. 
    more » « less
  5. With the growing effort to reduce power consumption in machines, fault tolerance becomes more of a concern. This holds particularly for large-scale computing, where execution failures due to soft faults waste excessive time and resources. These large-scale applications are normally parallel in nature and rely on control structures tailored specifically for parallel computing, such as locks and barriers. While there are many studies on resilient software, to our knowledge none of them focus on protecting these parallel control structures. In this work, we present a method of ensuring the correct operation of both locks and barriers in parallel applications. Our method tracks the memory locations used within parallel sections and detects a violation of the control structures. Upon detecting any violation, the violating thread is rolled back to the beginning of the structure and reattempts it, similar to rollback mechanisms in transactional memory systems. We test the method on representative samples of the BigDataBench kernels and find it exhibits a mean error reduction of 93.6% for basic mutex locks and barriers with a mean 6.55% execution time overhead at 64 threads. Additionally, we provide a comparison to transactional memory methods and demonstrate up to a mean 57.5% execution time overhead reduction. 
    more » « less