NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Fault Site Pruning for Practical Reliability Analysis of GPGPU Applications

https://doi.org/10.1109/MICRO.2018.00066

Nie, Bin; Yang, Lishan; Jog, Adwait; Smirni, Evgenia (October 2018, 51st Annual IEEE/ACM International Symposium on Microarchitecture (MICRO))

Graphics Processing Units (GPUs) have rapidly evolved to enable energy-efficient data-parallel computing for a broad range of scientific areas. While GPUs achieve exascale performance at a stringent power budget, they are also susceptible to soft errors, often caused by high-energy particle strikes, that can significantly affect the application output quality. Understanding the resilience of general purpose GPU applications is the purpose of this study. To this end, it is imperative to explore the range of application output by injecting faults at all the potential fault sites. This problem is especially challenging because unlike CPU applications, which are mostly single-threaded, GPGPU applications can contain hundreds to thousands of threads, resulting in a tremendously large fault site space - in the order of billions even for some simple applications. In this paper, we present a systematic way to progressively prune the fault site space aiming to dramatically reduce the number of fault injections such that assessment for GPGPU application error resilience can be practical. The key insight behind our proposed methodology stems from the fact that GPGPU applications spawn a lot of threads, however, many of them execute the same set of instructions. Therefore, several fault sites are redundant and can be pruned by a careful analysis of faults across threads and instructions. We identify important features across a set of 10 applications (16 kernels) from Rodinia and Polybench suites and conclude that threads can be first classified based on the number of the dynamic instructions they execute. We achieve significant fault site reduction by analyzing only a small subset of threads that are representative of the dynamic instruction behavior (and therefore error resilience behavior) of the GPGPU applications. Further pruning is achieved by identifying and analyzing: a) the dynamic instruction commonalities (and differences) across code blocks within this representative set of threads, b) a subset of loop iterations within the representative threads, and c) a subset of destination register bit positions. The above steps result in a tremendous reduction of fault sites by up to seven orders of magnitude. Yet, this reduced fault site space accurately captures the error resilience profile of GPGPU applications.
more » « less
Full Text Available
CEDULE: A Scheduling Framework for Burstable Performance in Cloud Computing

https://doi.org/10.1109/ICAC.2018.00024

Ali, Ahsan; Pinciroli, Riccardo; Yan, Feng; Smirni, Evgenia (September 2018, 2018 IEEE International Conference on Autonomic Computing (ICAC))

Full Text Available
Machine Learning Models for GPU Error Prediction in a Large Scale HPC System

https://doi.org/10.1109/DSN.2018.00022

Nie, Bin; Xue, Ji; Gupta, Saurabh; Patel, Tirthak; Engelmann, Christian; Smirni, Evgenia; Tiwari, Devesh (June 2018, 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN))

GPUs are widely deployed on large-scale HPC systems to provide powerful computational capability for scientific applications from various domains. As those applications are normally long-running, investigating the characteristics of GPU errors becomes imperative for reliability. In this paper, we first study the system conditions that trigger GPU errors using six-month trace data collected from a large-scale, operational HPC system. Then, we use machine learning to predict the occurrence of GPU errors, by taking advantage of temporal and spatial dependencies of the trace data. The resulting machine learning prediction framework is robust and accurate under different workloads.
more » « less
Full Text Available
Spatial–Temporal Prediction Models for Active Ticket Managing in Data Centers

https://doi.org/10.1109/TNSM.2018.2794409

Xue, Ji; Birke, Robert; Chen, Lydia Y.; Smirni, Evgenia (March 2018, IEEE Transactions on Network and Service Management)

Full Text Available
Efficient Deep Neural Network Serving: Fast and Furious

https://doi.org/10.1109/TNSM.2018.2808352

Yan, Feng; He, Yuxiong; Ruwase, Olatunji; Smirni, Evgenia (March 2018, IEEE Transactions on Network and Service Management)

Full Text Available
Fill-in the gaps: Spatial-temporal models for missing data

https://doi.org/10.23919/CNSM.2017.8255983

Xue, Ji; Nie, Bin; Smirni, Evgenia (November 2017, 13th International Conference on Network and Service Management, CNSM 2017)

Effective workload characterization and prediction are instrumental for efficiently and proactively managing large systems. System management primarily relies on the workload information provided by underlying system tracing mechanisms that record system-related events in log files. However, such tracing mechanisms may temporarily fail due to various reasons, yielding “holes” in data traces. This missing data phenomenon significantly impedes the effectiveness of data analysis. In this paper, we study real-world data traces collected from over 80K virtual machines (VMs) hosted on 6K physical boxes in the data centers of a service provider. We discover that the usage series of VMs co-located on the same physical box exhibit strong correlation with one another, and that most VM usage series show temporal patterns. By taking advantage of the observed spatial and temporal dependencies, we propose a data-filling method to predict the missing data in the VM usage series. Detailed evaluation using trace data in the wild shows that the proposed method is sufficiently accurate as it achieves an average of 20% absolute percentage errors. We also illustrate its usefulness via a use case.
more » « less
Full Text Available
Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities

https://doi.org/10.1109/MASCOTS.2017.12

Nie, Bin; Xue, Ji; Gupta, Saurabh; Engelmann, Christian; Smirni, Evgenia; Tiwari, Devesh (September 2017, 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS))

GPUs have become part of the mainstream high performance computing facilities that increasingly require more computational power to simulate physical phenomena quickly and accurately. However, GPU nodes also consume significantly more power than traditional CPU nodes, and high power consumption introduces new system operation challenges, including increased temperature, power/cooling cost, and lower system reliability. This paper explores how power consumption and temperature characteristics affect reliability, provides insights into what are the implications of such understanding, and how to exploit these insights toward predicting GPU errors using neural networks.
more » « less
Full Text Available
How to Supercharge the Amazon T2: Observations and Suggestions

https://doi.org/10.1109/CLOUD.2017.43

Yan, Feng; Ren, Lihua; Dubois, Daniel J.; Casale, Giuliano; Wen, Jiawei; Smirni, Evgenia (June 2017, 2017 IEEE 10th International Conference on Cloud Computing (CLOUD))

Cloud service providers adopt a credit system to allow users to obtain periods of performance bursts without additional cost. For example, the Amazon EC2 T2 instance offers low baseline performance and the capability to achieve short periods of high performance using CPU credits. Once a T2 instance is created and assigned some initial credits, while its CPU utilization is above the baseline threshold, there is a transient period where performance is boosted and the assigned CPU credits are used. After all credits are used, the maximum achievable performance drops to baseline. Credits accrue periodically, when the instance utilization is below the baseline threshold. This paper proposes a methodology to increase the performance benefits of T2 by seamlessly extending the duration of the transient period while maintaining high performance. This extension of the high performance transient period is combined with proactive migration to further take advantage of the initially assigned credits. We conduct experiments to demonstrate the benefits of this methodology for both single-tier and multi-tier applications.
more » « less
Full Text Available

Search for: All records