GPU memory corruption and in particular double-bit errors (DBEs) remain one of the least understood aspects of HPC system reliability. Albeit rare, their occurrences always lead to job termination and can potentially cost thousands of node-hours, either from wasted com- putations or as the overhead from regular checkpointing needed to minimize the losses. As supercomputers and their components simultaneously grow in scale, density, failure rates, and environ- mental footprint, the eciency of HPC operations becomes both an imperative and a challenge. We examine DBEs using system telemetry data and logs col- lected from the Summit supercomputer, equipped with 27,648 Tesla V100 GPUs with 2nd-generation high-bandwidth memory (HBM2). Using exploratory data analysis and statistical learning, we extract several insights about memory reliability in such GPUs. We nd that GPUs with prior DBE occurrences are prone to experience them again due to otherwise harmless factors, correlate this phenomenon with GPU placement, and suggest manufacturing variability as a factor. On the general population of GPUs, we link DBEs to short- and long-term high power consumption modes while finding no signifcant correlation with higher temperatures. We also show that the workload type can be a factor in memory’s propensity to corruption.
more »
« less
Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities
GPUs have become part of the mainstream high performance computing facilities that increasingly require more computational power to simulate physical phenomena quickly and accurately. However, GPU nodes also consume significantly more power than traditional CPU nodes, and high power consumption introduces new system operation challenges, including increased temperature, power/cooling cost, and lower system reliability. This paper explores how power consumption and temperature characteristics affect reliability, provides insights into what are the implications of such understanding, and how to exploit these insights toward predicting GPU errors using neural networks.
more »
« less
- Award ID(s):
- 1649087
- PAR ID:
- 10065577
- Date Published:
- Journal Name:
- 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)
- Page Range / eLocation ID:
- 22 to 31
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)Stacking more systems into a compact area or scaling devices to increase the density of integration are two approaches to provide greater functional complexity. Excessive heat generated as a result of these technology advancements leads to an increase in leakage power and degradation in system reliability. Hence, a thermal aware system composed of hundreds of distributed thermal sensor nodes is needed. Such a system requires an efficient thermal sensor placed close to the thermal hotspots, small in size, fast response, and CMOS compatibility. In this paper, two hybrid spintronic/CMOS circuits are proposed. These circuits exhibit a low power consumption of 11.9 μW during the on-state, a linearity (R^2 ) of 0.96 over the industrial temperature range of operation (-40 to 125) o C, and a sensitivity of 3.78 mV/K.more » « less
-
This paper proposes long-term reliability management for spatial multitasking GPU architectures. Specifically, we focus on electromigration (EM)-induced long-term failure of the GPU's power delivery network. A distributed power delivery network model at functional unit granularity is developed and used for our EM analysis of GPU architectures. We use a recently proposed physics-based EM reliability model and consider the EM-induced time-to-failure at the GPU system level as a reliability resource. For GPU scheduling, we mainly focus on spatial multitasking, which allows GPU computing resources to be partitioned among multiple applications. We find that the existing reliability-agnostic thread block scheduler for spatial multitasking is effective in achieving high GPU utilization, but poor reliability. We develop and implement a long-term reliability-aware thread block scheduler in GPGPU-Sim, and compare it against existing reliability-agnostic scheduler. We evaluate several use cases of spatial multitasking and find that our proposed scheduler achieves up to 30\% improvement in long-term reliability.more » « less
-
null (Ed.)Emerging virtual and augmented reality applications are envisioned to significantly enhance user experiences. An important issue related to user experience is thermal management in smartphones widely adopted for virtual and augmented reality applications. Although smartphone overheating has been reported many times, a systematic measurement and analysis of their thermal behaviors is relatively scarce, especially for virtual and augmented reality applications. To address the issue, we build a temperature measurement and analysis framework for virtual and augmented reality applications using a robot, infrared cameras, and smartphones. Using the framework, we analyze a comprehensive set of data including the battery power consumption, smartphone surface temperature, and temperature of key hardware components, such as the battery, CPU, GPU, and WiFi module. When a 360◦ virtual reality video is streamed to a smartphone, the phone surface temperature reaches near 39◦C. Also, the temperature of the phone surface and its main hardware components generally increases till the end of our 20-minute experiments despite thermal control undertaken by smartphones, such as CPU/GPU frequency scaling. Our thermal analysis results of a popular AR game are even more serious: the battery power consumption frequently exceeds the thermal design power by 20–80%, while the peak battery, CPU, GPU, and WiFi module temperature exceeds 45, 70, 70, and 65◦C, respectivelymore » « less
-
In the US alone, data centers consumed around $20 billion (200 TWh) yearly electricity in 2016, and this amount doubles itself every five years. Data storage alone is estimated to be responsible for about 25% to 35% of data-center power consumption. Servers in data centers generally include multiple HDDs or SSDs, commonly arranged in a RAID level for better performance, reliability, and availability. In this study, we evaluate HDD and SSD based Linux (md) software RAIDs' impact on the energy consumption of popular servers. We used the Filebench workload generator to emulate three common server workloads: web, file, and mail, and measured the energy consumption of the system using the HOBO power meter. We observed some similarities and some differences in energy consumption characteristics of HDD and SSD RAIDs, and provided our insights for better energy-efficiency. We hope that our observations will shed light on new energy-efficient RAID designs tailored for HDD and SSD RAIDs' specific energy consumption characteristics.more » « less
An official website of the United States government

