skip to main content


The NSF Public Access Repository (NSF-PAR) system and access will be unavailable from 10:00 PM ET on Friday, December 8 until 2:00 AM ET on Saturday, December 9 due to maintenance. We apologize for the inconvenience.

Search for: All records

Award ID contains: 1649087

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Graphics Processing Units (GPUs) have rapidly evolved to enable energy-efficient data-parallel computing for a broad range of scientific areas. While GPUs achieve exascale performance at a stringent power budget, they are also susceptible to soft errors, often caused by high-energy particle strikes, that can significantly affect the application output quality. Understanding the resilience of general purpose GPU applications is the purpose of this study. To this end, it is imperative to explore the range of application output by injecting faults at all the potential fault sites. This problem is especially challenging because unlike CPU applications, which are mostly single-threaded, GPGPU applications can contain hundreds to thousands of threads, resulting in a tremendously large fault site space - in the order of billions even for some simple applications. In this paper, we present a systematic way to progressively prune the fault site space aiming to dramatically reduce the number of fault injections such that assessment for GPGPU application error resilience can be practical. The key insight behind our proposed methodology stems from the fact that GPGPU applications spawn a lot of threads, however, many of them execute the same set of instructions. Therefore, several fault sites are redundant and can be pruned by a careful analysis of faults across threads and instructions. We identify important features across a set of 10 applications (16 kernels) from Rodinia and Polybench suites and conclude that threads can be first classified based on the number of the dynamic instructions they execute. We achieve significant fault site reduction by analyzing only a small subset of threads that are representative of the dynamic instruction behavior (and therefore error resilience behavior) of the GPGPU applications. Further pruning is achieved by identifying and analyzing: a) the dynamic instruction commonalities (and differences) across code blocks within this representative set of threads, b) a subset of loop iterations within the representative threads, and c) a subset of destination register bit positions. The above steps result in a tremendous reduction of fault sites by up to seven orders of magnitude. Yet, this reduced fault site space accurately captures the error resilience profile of GPGPU applications. 
    more » « less
  2. GPUs are widely deployed on large-scale HPC systems to provide powerful computational capability for scientific applications from various domains. As those applications are normally long-running, investigating the characteristics of GPU errors becomes imperative for reliability. In this paper, we first study the system conditions that trigger GPU errors using six-month trace data collected from a large-scale, operational HPC system. Then, we use machine learning to predict the occurrence of GPU errors, by taking advantage of temporal and spatial dependencies of the trace data. The resulting machine learning prediction framework is robust and accurate under different workloads. 
    more » « less
  3. Effective workload characterization and prediction are instrumental for efficiently and proactively managing large systems. System management primarily relies on the workload information provided by underlying system tracing mechanisms that record system-related events in log files. However, such tracing mechanisms may temporarily fail due to various reasons, yielding “holes” in data traces. This missing data phenomenon significantly impedes the effectiveness of data analysis. In this paper, we study real-world data traces collected from over 80K virtual machines (VMs) hosted on 6K physical boxes in the data centers of a service provider. We discover that the usage series of VMs co-located on the same physical box exhibit strong correlation with one another, and that most VM usage series show temporal patterns. By taking advantage of the observed spatial and temporal dependencies, we propose a data-filling method to predict the missing data in the VM usage series. Detailed evaluation using trace data in the wild shows that the proposed method is sufficiently accurate as it achieves an average of 20% absolute percentage errors. We also illustrate its usefulness via a use case. 
    more » « less
  4. GPUs have become part of the mainstream high performance computing facilities that increasingly require more computational power to simulate physical phenomena quickly and accurately. However, GPU nodes also consume significantly more power than traditional CPU nodes, and high power consumption introduces new system operation challenges, including increased temperature, power/cooling cost, and lower system reliability. This paper explores how power consumption and temperature characteristics affect reliability, provides insights into what are the implications of such understanding, and how to exploit these insights toward predicting GPU errors using neural networks. 
    more » « less
  5. Cloud service providers adopt a credit system to allow users to obtain periods of performance bursts without additional cost. For example, the Amazon EC2 T2 instance offers low baseline performance and the capability to achieve short periods of high performance using CPU credits. Once a T2 instance is created and assigned some initial credits, while its CPU utilization is above the baseline threshold, there is a transient period where performance is boosted and the assigned CPU credits are used. After all credits are used, the maximum achievable performance drops to baseline. Credits accrue periodically, when the instance utilization is below the baseline threshold. This paper proposes a methodology to increase the performance benefits of T2 by seamlessly extending the duration of the transient period while maintaining high performance. This extension of the high performance transient period is combined with proactive migration to further take advantage of the initially assigned credits. We conduct experiments to demonstrate the benefits of this methodology for both single-tier and multi-tier applications. 
    more » « less