skip to main content


Title: SUGAR: Speeding Up GPGPU Application Resilience Estimation with Input Sizing
As Graphics Processing Units (GPUs) are becoming a de facto solution for accelerating a wide range of applications, their reliable operation is becoming increasingly important. One of the major challenges in the domain of GPU reliability is to accurately measure GPGPU application error resilience. This challenge stems from the fact that a typical GPGPU application spawns a huge number of threads and then utilizes a large amount of potentially unreliable compute and memory resources available on the GPUs. As the number of possible fault locations can be in the billions, evaluating every fault and examining its effect on theapplication error resilience is impractical. Application resilience is evaluated via extensive fault injection campaigns based on sampling of an extensive fault site space. Typically, the larger the input of the GPGPU application, the longer the experimental campaign. In this work, we devise a methodology, SUGAR (Speeding Up GPGPU Application Resilience Estimation with input sizing), that dramatically speeds up the evaluation of GPGPU application error resilience by judicious input sizing. We show how analyzing a small fraction of the input is sufficient to estimate the application resilience with high accuracy and dramatically reduce the duration of experimentation. Key of our estimation methodology is the discovery of repeating patterns as a function of the input size. Using the well-established fact that error resilience in GPGPU applications is mostly determined by the dynamic instruction count at the thread level, we identify the patterns that allow us to accurately predict application error resilience for arbitrarily large inputs. For the cases that we examine in this paper, this new resilience estimation mechanism provides significant speedups (up to 1336 times) and 97.0 on the average, while keeping estimation errors to less than 1%.  more » « less
Award ID(s):
1717532
NSF-PAR ID:
10284480
Author(s) / Creator(s):
; ; ;
Date Published:
Journal Name:
Proceedings of the ACM on Measurement and Analysis of Computing Systems
Volume:
5
Issue:
1
ISSN:
2476-1249
Page Range / eLocation ID:
1 to 29
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Graphics Processing Units (GPUs) have rapidly evolved to enable energy-efficient data-parallel computing for a broad range of scientific areas. While GPUs achieve exascale performance at a stringent power budget, they are also susceptible to soft errors, often caused by high-energy particle strikes, that can significantly affect the application output quality. Understanding the resilience of general purpose GPU applications is the purpose of this study. To this end, it is imperative to explore the range of application output by injecting faults at all the potential fault sites. This problem is especially challenging because unlike CPU applications, which are mostly single-threaded, GPGPU applications can contain hundreds to thousands of threads, resulting in a tremendously large fault site space - in the order of billions even for some simple applications. In this paper, we present a systematic way to progressively prune the fault site space aiming to dramatically reduce the number of fault injections such that assessment for GPGPU application error resilience can be practical. The key insight behind our proposed methodology stems from the fact that GPGPU applications spawn a lot of threads, however, many of them execute the same set of instructions. Therefore, several fault sites are redundant and can be pruned by a careful analysis of faults across threads and instructions. We identify important features across a set of 10 applications (16 kernels) from Rodinia and Polybench suites and conclude that threads can be first classified based on the number of the dynamic instructions they execute. We achieve significant fault site reduction by analyzing only a small subset of threads that are representative of the dynamic instruction behavior (and therefore error resilience behavior) of the GPGPU applications. Further pruning is achieved by identifying and analyzing: a) the dynamic instruction commonalities (and differences) across code blocks within this representative set of threads, b) a subset of loop iterations within the representative threads, and c) a subset of destination register bit positions. The above steps result in a tremendous reduction of fault sites by up to seven orders of magnitude. Yet, this reduced fault site space accurately captures the error resilience profile of GPGPU applications. 
    more » « less
  2. Abstract

    Streaming dataflow applications are an attractive target to parallelize on wide-SIMD processors such as GPUs. These applications can be expressed as a pipeline of compute nodes connected by edges, which feed outputs from one node to the next. Streaming applications often exhibit irregular dataflow, where the amount of output produced for one input is unknowna priori. Inserting finite queues between pipeline nodes can ameliorate the impact of irregularity and improve SIMD lane occupancy. The sizing of these queues is driven by both performance and safety considerations- relative queue sizes should be chosen to reduce runtime overhead and maximize throughput, but each node’s output queue must be large enough to accommodate the maximum number of outputs produced by one SIMD vector of inputs to the node. When safety and performance considerations conflict, the application may incur excessive memory usage and runtime overhead. In this work, we identify properties of applications that lead to such undesirable behaviors, with examples from applications implemented in our MERCATOR framework for irregular streaming on GPUs. To address these issues, we propose extensions to supportinterruptible nodesthat can be suspended mid-execution if their output queues fill. We illustrate the impacts of adding interruptible nodes to the MERCATOR framework on representative irregular streaming applications from the domains of branching search and bioinformatics.

     
    more » « less
  3. Vision Transformers (ViTs) have achieved state-of-the-art performance on various vision tasks. However, ViTs’ self-attention module is still arguably a major bottleneck, limiting their achievable hardware efficiency and more extensive applications to resource constrained platforms. Meanwhile, existing accelerators dedicated to NLP Transformers are not optimal for ViTs. This is because there is a large difference between ViTs and Transformers for natural language processing (NLP) tasks: ViTs have a relatively fixed number of input tokens, whose attention maps can be pruned by up to 90% even with fixed sparse patterns, without severely hurting the model accuracy (e.g., <=1.5% under 90% pruning ratio); while NLP Transformers need to handle input sequences of varying numbers of tokens and rely on on-the-fly predictions of dynamic sparse attention patterns for each input to achieve a decent sparsity (e.g., >=50%). To this end, we propose a dedicated algorithm and accelerator co-design framework dubbed ViTCoD for accelerating ViTs. Specifically, on the algorithm level, ViTCoD prunes and polarizes the attention maps to have either denser or sparser fixed patterns for regularizing two levels of workloads without hurting the accuracy, largely reducing the attention computations while leaving room for alleviating the remaining dominant data movements; on top of that, we further integrate a lightweight and learnable auto-encoder module to enable trading the dominant high-cost data movements for lower-cost computations. On the hardware level, we develop a dedicated accelerator to simultaneously coordinate the aforementioned enforced denser and sparser workloads for boosted hardware utilization, while integrating on-chip encoder and decoder engines to leverage ViTCoD’s algorithm pipeline for much reduced data movements. Extensive experiments and ablation studies validate that ViTCoD largely reduces the dominant data movement costs, achieving speedups of up to 235.3×, 142.9×, 86.0×, 10.1×, and 6.8× over general computing platforms CPUs, EdgeGPUs, GPUs, and prior-art Transformer accelerators SpAtten and Sanger under an attention sparsity of 90%, respectively. Our code implementation is available at https://github.com/GATECH-EIC/ViTCoD. 
    more » « less
  4. Today's high-performance computing (HPC) applications are producing vast volumes of data, which are challenging to store and transfer efficiently during the execution, such that data compression is becoming a critical technique to mitigate the storage burden and data movement cost. Huffman coding is arguably the most efficient Entropy coding algorithm in information theory, such that it could be found as a fundamental step in many modern compression algorithms such as DEFLATE. On the other hand, today's HPC applications are more and more relying on the accelerators such as GPU on supercomputers, while Huffman encoding suffers from low throughput on GPUs, resulting in a significant bottleneck in the entire data processing. In this paper, we propose and implement an efficient Huffman encoding approach based on modern GPU architectures, which addresses two key challenges: (1) how to parallelize the entire Huffman encoding algorithm, including codebook construction, and (2) how to fully utilize the high memory-bandwidth feature of modern GPU architectures. The detailed contribution is four-fold. (1) We develop an efficient parallel codebook construction on GPUs that scales effectively with the number of input symbols. (2) We propose a novel reduction based encoding scheme that can efficiently merge the codewords on GPUs. (3) We optimize the overall GPU performance by leveraging the state-of-the-art CUDA APIs such as Cooperative Groups. (4) We evaluate our Huffman encoder thoroughly using six real-world application datasets on two advanced GPUs and compare with our implemented multi-threaded Huffman encoder. Experiments show that our solution can improve the encoding throughput by up to 5.0x and 6.8x on NVIDIA RTX 5000 and V100, respectively, over the state-of-the-art GPU Huffman encoder, and by up to 3.3x over the multi-thread encoder on two 28-core Xeon Platinum 8280 CPUs. 
    more » « less
  5. Electron Backscatter Diffraction (EBSD) is a widely used approach for characterising the microstructure of various materials. However, it is difficult to accurately distinguish similar (body centred cubic and body centred tetragonal, with small tetragonality) phases in steels using standard EBSD software. One method to tackle the problem of phase distinction is to measure the tetragonality of the phases, which can be done using simulated patterns and cross‐correlation techniques to detect distortion away from a perfectly cubic crystal lattice. However, small errors in the determination of microscope geometry (the so‐called pattern or projection centre) can cause significant errors in tetragonality measurement and lead to erroneous results. This paper utilises a new approach for accurate pattern centre determination via a strain minimisation routine across a large number of grains in dual phase steels. Tetragonality maps are then produced and used to identify phase and estimate local carbon content. The technique is implemented using both kinetically simulated and dynamically simulated patterns to determine their relative accuracy. Tetragonality maps, and subsequent phase maps, based on dynamically simulated patterns in a point‐by‐point and grain average comparison are found to consistently produce more precise and accurate results, with close to 90% accuracy for grain phase identification, when compared with an image‐quality identification method. The error in tetragonality measurements appears to be of the order of 1%, thus producing a commensurate ∼0.2% error in carbon content estimation. Such an error makes the technique unsuitable for estimation of total carbon content of most commercial steels, which often have carbon levels below 0.1%. However, even in the DP steel for this study (0.1 wt.% carbon) it can be used to map carbon in regions with higher accumulation (such as in martensite with nonhomogeneous carbon content).

    Lay Description

    Electron Backscatter Diffraction (EBSD) is a widely used approach for characterising the microstructure of various materials. However, it is difficult to accurately distinguish similar (BCC and BCT) phases in steels using standard EBSD software due to the small difference in crystal structure. One method to tackle the problem of phase distinction is to measure the tetragonality, or apparent ‘strain’ in the crystal lattice, of the phases. This can be done by comparing experimental EBSD patterns with simulated patterns via cross‐correlation techniques, to detect distortion away from a perfectly cubic crystal lattice. However, small errors in the determination of microscope geometry (the so‐called pattern or projection centre) can cause significant errors in tetragonality measurement and lead to erroneous results. This paper utilises a new approach for accurate pattern centre determination via a strain minimisation routine across a large number of grains in dual phase steels. Tetragonality maps are then produced and used to identify phase and estimate local carbon content. The technique is implemented using both simple kinetically simulated and more complex dynamically simulated patterns to determine their relative accuracy. Tetragonality maps, and subsequent phase maps, based on dynamically simulated patterns in a point‐by‐point and grain average comparison are found to consistently produce more precise and accurate results, with close to 90% accuracy for grain phase identification, when compared with an image‐quality identification method. The error in tetragonality measurements appears to be of the order of 1%, thus producing a commensurate error in carbon content estimation. Such an error makes an estimate of total carbon content particularly unsuitable for low carbon steels; although maps of local carbon content may still be revealing.

    Application of the method developed in this paper will lead to better understanding of the complex microstructures of steels, and the potential to design microstructures that deliver higher strength and ductility for common applications, such as vehicle components.

     
    more » « less