NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Aspis: Lightweight Neural Network Protection Against Soft Errors

https://doi.org/10.1109/ISSRE62328.2024.00036

Schmedding, Anna; Yang, Lishan; Jog, Adwait; Smirni, Evgenia (October 2024, IEEE)

Convolutional neural networks (CNN) are incorporated into many image-based tasks across a variety of domains. Some of these are safety critical tasks such as object classification/detection and lane detection for self-driving cars. These applications have strict safety requirements and must guarantee the reliable operation of the neural networks in the presence of soft errors (i.e., transient faults) in DRAM. Standard safety mechanisms (e.g., triplication of data/computation) provide high resilience, but introduce intolerable overhead. We perform detailed characterization and propose an efficient methodology for pinpointing critical weights by using an efficient proxy, the Taylor criterion. Using this characterization, we design Aspis, an efficient software protection scheme that does selective weight hardening and offers a performance/reliability tradeoff. Aspis provides higher resilience comparing to state-of-the-art methods and is integrated into PyTorch as a fully-automated library.
more » « less
Full Text Available
GPU Reliability Assessment: Insights Across the Abstraction Layers

https://doi.org/10.1109/CLUSTER59578.2024.00008

Yang, Lishan; Papadimitriou, George; Sartzetakis, Dimitris; Jog, Adwait; Smirni, Evgenia; Gizopoulos, Dimitris (September 2024, IEEE)

Graphics Processing Units (GPUs) are widely de-ployed and utilized across various computing domains including cloud and high-performance computing. Considering its extensive usage and increasing popularity, ensuring GPU reliability is cru-cial. Software-based reliability evaluation methodologies, though fast, often neglect the complex hardware details of modern GPU designs. This oversight could lead to misleading measurements and misguided decisions regarding protection strategies. This paper breaks new ground by conducting an in-depth examination of well-established vulnerability assessment methods for modern GPU architectures, from the microarchitecture all the way to the software layers. It highlights divergences between popular software-based vulnerability evaluation methods and the ground truth cross-layer evaluation, which persist even under strong protections like triple modular redundancy. Accurate evaluation requires considering fault distribution from hardware to software. Our comprehensive measurements offer valuable insights into the accurate assessment of GPU reliability.
more » « less
Full Text Available
Pushing the Performance Envelope of DNN-based Recommendation Systems Inference on GPUs

https://doi.org/10.1109/MICRO61859.2024.00091

Jain, Rishabh; Bhasi, Vivek M; Jog, Adwait; Sivasubramaniam, Anand; Kandemir, Mahmut T; Das, Chita R (November 2024, IEEE)

Personalized recommendation is a ubiquitous application on the internet, with many industries and hyperscalers extensively leveraging Deep Learning Recommendation Models (DLRMs) for their personalization needs (like ad serving or movie suggestions). With growing model and dataset sizes pushing computation and memory requirements, GPUs are being increasingly preferred for executing DLRM inference. However, serving newer DLRMs, while meeting acceptable latencies, continues to remain challenging, making traditional deployments increasingly more GPU-hungry, resulting in higher inference serving costs. In this paper, we show that the embedding stage continues to be the primary bottleneck in the GPU inference pipeline, leading up to a 3.2× embedding-only performance slowdown. To thoroughly grasp the problem, we conduct a detailed microarchitecture characterization and highlight the presence of low occupancy in the standard embedding kernels. By leveraging direct compiler optimizations, we achieve optimal occupancy, pushing the performance by up to 53%. Yet, long memory latency stalls continue to exist. To tackle this challenge, we propose specialized plug-and-play-based software prefetching and L2 pinning techniques, which help in hiding and decreasing the latencies. Further, we propose combining them, as they complement each other. Experimental evaluations using A100 GPUs with large models and datasets show that our proposed techniques improve performance by up to 103% for the embedding stage, and up to 77% for the overall DLRM inference pipeline.
more » « less
Full Text Available
Optimizing CPU Performance for Recommendation Systems At-Scale

https://doi.org/10.1145/3579371.3589112

Jain, Rishabh; Cheng, Scott; Kalagi, Vishwas; Sanghavi, Vrushabh; Kaul, Samvit; Arunachalam, Meena; Maeng, Kiwan; Jog, Adwait; Sivasubramaniam, Anand; Kandemir, Mahmut Taylan; et al (June 2023, International Symposium on Computer Architecture 2023)

Deep Learning Recommendation Models (DLRMs) are very popular in personalized recommendation systems and are a major contributor to the data-center AI cycles. Due to the high computational and memory bandwidth needs of DLRMs, specifically the embedding stage in DLRM inferences, both CPUs and GPUs are used for hosting such workloads. This is primarily because of the heavy irregular memory accesses in the embedding stage of computation that leads to significant stalls in the CPU pipeline. As the model and parameter sizes keep increasing with newer recommendation models, the computational dominance of the embedding stage also grows, thereby, bringing into question the suitability of CPUs for inference. In this paper, we first quantify the cause of irregular accesses and their impact on caches and observe that off-chip memory access is the main contributor to high latency. Therefore, we exploit two well-known techniques: (1) Software prefetching, to hide the memory access latency suffered by the demand loads and (2) Overlapping computation and memory accesses, to reduce CPU stalls via hyperthreading to minimize the overall execution time. We evaluate our work on a single-core and 24-core configuration with the latest recommendation models and recently released production traces. Our integrated techniques speed up the inference by up to 1.59x, and on average by 1.4x.
more » « less
Improving GPU Throughput through Parallel Execution Using Tensor Cores and CUDA Cores

https://doi.org/10.1109/ISVLSI54635.2022.00051

Ho, Khoa; Zhao, Hui; Jog, Adwait; Mohanty, Saraju (July 2022, 2022 IEEE Computer Society Annual Symposium on VLSI (ISVLSI))

Full Text Available
Data-centric Reliability Management in GPUs

https://doi.org/10.1109/DSN48987.2021.00040

Kadam, Gurunath; Smirni, Evgenia; Jog, Adwait (June 2021, 51st Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN))
null (Ed.)
Full Text Available
Enabling Software Resilience in GPGPU Applications via Partial Thread Protection

https://doi.org/10.1109/ICSE43902.2021.00114

Yang, Lishan; Nie, Bin; Jog, Adwait; Smirni, Evgenia (May 2021, 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE))
null (Ed.)
Full Text Available
Asynchronous Automata Processing on GPUs

https://doi.org/10.1145/3579453

Liu, Hongyuan; Pai, Sreepathi; Jog, Adwait (March 2023, Proceedings of the ACM on Measurement and Analysis of Computing Systems)

Finite-state automata serve as compute kernels for many application domains such as pattern matching and data analytics. Existing approaches on GPUs exploit three levels of parallelism in automata processing tasks: 1)~input stream level, 2)~automaton-level and 3)~state-level. Among these, only state-level parallelism is intrinsic to automata while the other two levels of parallelism depend on the number of automata and input streams to be processed. As GPU resources increase, a parallelism-limited automata processing task can underutilize GPU compute resources. To this end, we propose AsyncAP, a low-overhead approach that optimizes for both scalability and throughput. Our insight is that most automata processing tasks have an additional source of parallelism originating from the input symbols which has not been leveraged before. Making the matching process associated with the automata tasks asynchronous, i.e., parallel GPU threads start processing an input stream from different input locations instead of processing it serially, improves throughput significantly and scales with input length. When the task does not have enough parallelism to utilize all the GPU cores, detailed evaluation across 12 evaluated applications shows that AsyncAP achieves up to 58× speedup on average over the state-of-the-art GPU automata processing engine. When the tasks have enough parallelism to utilize GPU cores, AsyncAP still achieves 2.4× speedup.
more » « less
SUGAR: Speeding Up GPGPU Application Resilience Estimation with Input Sizing

https://doi.org/10.1145/3447375

Yang, Lishan; Nie, Bin; Jog, Adwait; Smirni, Evgenia (February 2021, Proceedings of the ACM on Measurement and Analysis of Computing Systems)
null (Ed.)
As Graphics Processing Units (GPUs) are becoming a de facto solution for accelerating a wide range of applications, their reliable operation is becoming increasingly important. One of the major challenges in the domain of GPU reliability is to accurately measure GPGPU application error resilience. This challenge stems from the fact that a typical GPGPU application spawns a huge number of threads and then utilizes a large amount of potentially unreliable compute and memory resources available on the GPUs. As the number of possible fault locations can be in the billions, evaluating every fault and examining its effect on theapplication error resilience is impractical. Application resilience is evaluated via extensive fault injection campaigns based on sampling of an extensive fault site space. Typically, the larger the input of the GPGPU application, the longer the experimental campaign. In this work, we devise a methodology, SUGAR (Speeding Up GPGPU Application Resilience Estimation with input sizing), that dramatically speeds up the evaluation of GPGPU application error resilience by judicious input sizing. We show how analyzing a small fraction of the input is sufficient to estimate the application resilience with high accuracy and dramatically reduce the duration of experimentation. Key of our estimation methodology is the discovery of repeating patterns as a function of the input size. Using the well-established fact that error resilience in GPGPU applications is mostly determined by the dynamic instruction count at the thread level, we identify the patterns that allow us to accurately predict application error resilience for arbitrarily large inputs. For the cases that we examine in this paper, this new resilience estimation mechanism provides significant speedups (up to 1336 times) and 97.0 on the average, while keeping estimation errors to less than 1%.
more » « less
Full Text Available
Practical Resilience Analysis of GPGPU Applications in the Presence of Single- and Multi-Bit Faults

https://doi.org/10.1109/TC.2020.2980541

Yang, Lishan; Nie, Bin; Jog, Adwait; Smirni, Evgenia (January 2021, IEEE Transactions on Computers)
null (Ed.)
Full Text Available

« Prev Next »

Search for: All records