skip to main content


Title: Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities
GPUs have become part of the mainstream high performance computing facilities that increasingly require more computational power to simulate physical phenomena quickly and accurately. However, GPU nodes also consume significantly more power than traditional CPU nodes, and high power consumption introduces new system operation challenges, including increased temperature, power/cooling cost, and lower system reliability. This paper explores how power consumption and temperature characteristics affect reliability, provides insights into what are the implications of such understanding, and how to exploit these insights toward predicting GPU errors using neural networks.  more » « less
Award ID(s):
1649087
NSF-PAR ID:
10065577
Author(s) / Creator(s):
; ; ; ; ;
Date Published:
Journal Name:
2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)
Page Range / eLocation ID:
22 to 31
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. null (Ed.)
    Stacking more systems into a compact area or scaling devices to increase the density of integration are two approaches to provide greater functional complexity. Excessive heat generated as a result of these technology advancements leads to an increase in leakage power and degradation in system reliability. Hence, a thermal aware system composed of hundreds of distributed thermal sensor nodes is needed. Such a system requires an efficient thermal sensor placed close to the thermal hotspots, small in size, fast response, and CMOS compatibility. In this paper, two hybrid spintronic/CMOS circuits are proposed. These circuits exhibit a low power consumption of 11.9 μW during the on-state, a linearity (R^2 ) of 0.96 over the industrial temperature range of operation (-40 to 125) o C, and a sensitivity of 3.78 mV/K. 
    more » « less
  2. Deep Neural Networks (DNN) have gained tremendous popularity over the last years for several computer vision tasks, including classification and object detection. Such techniques have been able to achieve human-level performance in many tasks and have produced results of unprecedented accuracy. As DNNs have intense computational requirements in the majority of applications, they utilize a cluster of computers or a cutting edge Graphical Processing Unit (GPU), often having excessive power consumption and generating a lot of heat. In many robotics applications the above requirements prove to be a challenge, as there is limited power on-board and heat dissipation is always a problem. In particular in underwater robotics with limited space, the above two requirements have been proven prohibitive. As first of this kind, this paper aims at analyzing and comparing the performance of several state-of-the-art DNNs on different platforms. With a focus on the underwater domain, the capabilities of the Jetson TX2 from NVIDIA and the Neural Compute Stick from Intel are of particular interest. Experiments on standard datasets show how different platforms are usable on an actual robotic system, providing insights on the current state-of-the-art embedded systems. Based on such results, we propose some guidelines in choosing the appropriate platform and network architecture for a robotic system. 
    more » « less
  3. The continuous increase in demanding for availability and ultra-reliability of low-latency and broadband wireless connections is instigating further research in the standardization of next-generation mobile systems. 6G networks, among other benefits, should offer global ubiquitous mobility thanks to the utilization of the Space segment as an intelligent yet autonomous ecosystem. In this framework, multi-layered networks will take charge of providing connectivity by implementing Cloud-Radio Access Network (C-RAN) functionalities on heterogeneous nodes distributed over aerial and orbital segments. Unmanned Aerial Vehicles (UAVs), High-Altitude Plat-forms (HAPs), and small satellites compose the Space ecosystem encompassing the 3D networks. Recently, a lot of interest has been raised about splitting operations to distribute baseband processing functionalities among such nodes to balance the computational load and reduce the power consumption. This work focuses on the hardware development of C-RAN physical (PHY-) layer operations to derive their computational and energy demand. More in detail, the 5G Downlink Shared Channel (DLSCH) and the Physical Downlink Shared Channel (PDSCH) are first simulated in MATLAB environment to evaluate the variation of computational load depending on the selected splitting options and number of antennas available at transmitter (TX) and receiver (RX) side. Then, the PHY-layer processing chain is software-implemented and the various splitting options are tested on low-cost processors, such as Raspberry Pi (RP) 3B+ and 4B. By overclocking the RPs, we compute the execution time and we derive the instruction count (IC) per program for each considered splitting option so to achieve the mega instructions per second (MIPS) for the expected processing time. Finally, by comparing the performance achieved by the employed RPs with that of Nvidia Jetson Nano (JN) processor used as benchmark, we shall discuss about size, weight, power and cost (SWaP-C)... 
    more » « less
  4. null (Ed.)
    Emerging virtual and augmented reality applications are envisioned to significantly enhance user experiences. An important issue related to user experience is thermal management in smartphones widely adopted for virtual and augmented reality applications. Although smartphone overheating has been reported many times, a systematic measurement and analysis of their thermal behaviors is relatively scarce, especially for virtual and augmented reality applications. To address the issue, we build a temperature measurement and analysis framework for virtual and augmented reality applications using a robot, infrared cameras, and smartphones. Using the framework, we analyze a comprehensive set of data including the battery power consumption, smartphone surface temperature, and temperature of key hardware components, such as the battery, CPU, GPU, and WiFi module. When a 360◦ virtual reality video is streamed to a smartphone, the phone surface temperature reaches near 39◦C. Also, the temperature of the phone surface and its main hardware components generally increases till the end of our 20-minute experiments despite thermal control undertaken by smartphones, such as CPU/GPU frequency scaling. Our thermal analysis results of a popular AR game are even more serious: the battery power consumption frequently exceeds the thermal design power by 20–80%, while the peak battery, CPU, GPU, and WiFi module temperature exceeds 45, 70, 70, and 65◦C, respectively 
    more » « less
  5. In this paper, we propose a new dynamic reliability technique using an accuracy-reconfigurable stochastic computing (ARSC) framework for deep learning computing. Unlike the conventional stochastic computing that conducts design time accuracy power/energy trade-off, the new ARSC design can adjust the bit-width of the data in run time. Hence, the ARSC can mitigate the long-term aging effects by slowing the system clock frequency, while maintaining the inference throughput by reducing the data bit-width at a small cost of accuracy. We show how to implement the recently proposed counter-based SC multiplication and bit-width reduction on a layer-wise quantization scheme for CNN networks with dynamic fixed-point data. We validate an ARSC-based five-layer convolutional neural network designs for the MNIST dataset based on Vivado HLS with constraints from Xilinx Zynq-7000 family xc7z045 platform. Experimental results show that new ARSC DNN can sufficiently compensate the NBTI induced aging effects in 10 years with marginal classification accuracy loss while maintaining or even exceeding the pre-aging computing throughput. At the same time, the proposed ARSC computing framework also reduces the active power consumption due to the frequency scaling, which can further improve system reliability due to the reduced temperature. 
    more » « less