Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities

Nie, Bin; Xue, Ji; Gupta, Saurabh; Engelmann, Christian; Smirni, Evgenia; Tiwari, Devesh

doi:10.1109/MASCOTS.2017.12

Citation Details

Characterizing Temperature, Power, and Soft-Error Behaviors in Data Center Systems: Insights, Challenges, and Opportunities

GPUs have become part of the mainstream high performance computing facilities that increasingly require more computational power to simulate physical phenomena quickly and accurately. However, GPU nodes also consume significantly more power than traditional CPU nodes, and high power consumption introduces new system operation challenges, including increased temperature, power/cooling cost, and lower system reliability. This paper explores how power consumption and temperature characteristics affect reliability, provides insights into what are the implications of such understanding, and how to exploit these insights toward predicting GPU errors using neural networks. more »

Award ID(s):: 1649087

PAR ID:: 10065577

Author(s) / Creator(s):: Nie, Bin; Xue, Ji; Gupta, Saurabh; Engelmann, Christian; Smirni, Evgenia; Tiwari, Devesh

Date Published:: 2017-09-01

Journal Name:: 2017 IEEE 25th International Symposium on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems (MASCOTS)

Page Range / eLocation ID:: 22 to 31

Format(s):: Medium: X

Sponsoring Org:: National Science Foundation

Free Publicly Accessible Full Text
Accepted Manuscript1.0
Conference Paper:
https://doi.org/10.1109/MASCOTS.2017.12

More Like this