skip to main content

Title: Real-Time Full-Chip Thermal Tracking: A Post-Silicon, Machine Learning Perspective
In this work, we present a novel approach to real-time tracking of full-chip heatmaps for commercial off-the-shelf microprocessors based on machine-learning. The proposed post-silicon approach, named RealMaps, only uses the existing embedded temperature sensors and workload-independent utilization information, which are available in real-time. Moreover, RealMaps does not require any knowledge of the proprietary design details or manufacturing process-specific information of the chip. Consequently, the methods presented in this work can be implemented by either the original chip manufacturer or a third party alike, and is aimed at supplementing, rather than substituting, the temperature data sensed from the existing embedded sensors. The new approach starts with offline acquisition of accurate spatial and temporal heatmaps using an infrared thermal imaging setup while nominal working conditions are maintained on the chip. To build the dynamic thermal model, a temporal-aware long-short-term-memory (LSTM) neutral network is trained with system-level features such as chip frequency, instruction counts, and other high-level performance metrics as inputs. Instead of a pixel-wise heatmap estimation, we perform 2D spatial discrete cosine transformation (DCT) on the heatmaps so that they can be expressed with just a few dominant DCT coefficients. This allows for the model to be built to estimate just the more » dominant spatial features of the 2D heatmaps, rather than the entire heatmap images, making it significantly more efficient. Experimental results from two commercial chips show that RealMaps can estimate the full-chip heatmaps with 0.9C and 1.2C root-mean-square-error respectively and take only 0.4ms for each inference which suits well for real-time use. Compared to the state of the art pre-silicon approach, RealMaps shows similar accuracy, but with much less computational cost. « less
; ; ;
Award ID(s):
Publication Date:
Journal Name:
IEEE Transactions on Computers
Page Range or eLocation-ID:
1 to 1
Sponsoring Org:
National Science Foundation
More Like this
  1. In tis work, we propose a novel approach to real-time estimation of full-chip transient heatmaps for commercial processors based on machine learning. The model derived in this work supplements the temperature data sensed from the existing on-chip sensors, allowing for the development of more robust runtime power and thermal control schemes that can take advantage of the additional thermal information that is otherwise not available. The new approach involves offline acquisition of accurate spatial and temporal heatmaps using an infrared thermal imaging setup while nominal working conditions are maintained on the chip. To build the dynamic thermal model, we apply Long-Short-Term-Memory (LSTM) neutral networks with system-level variables such as chip frequency, instruction counts, and other performance metrics as inputs. To reduce the dimensionality of the model, 2D spatial discrete cosine transformation (DCT) is first performed on the heatmaps so that they can be expressed with just their dominant DCT frequencies. Our study shows that only $6\times 6$ DCT coefficients are required to maintain sufficient accuracy across a variety of workloads. Experimental results show that the proposed approach can estimate the full-chip heatmaps with less than 1.4C root-mean-square-error and take only 19ms for each inference which suits well for real-time use.
  2. In this paper, we propose a novel transient full-chip thermal map estimation method for multi-core commercial CPU based on the data-driven generative adversarial learning method. We treat the thermal modeling problem as an image-generation problem using the generative neural networks. In stead of using traditional functional unit powers as input, the new models are directly based on the measurable real-time high level chip utilizations and thermal sensor information of commercial chips without any assumption of additional physical sensors requirement. The resulting thermal map estimation method, called {\it ThermGAN} can provide tool-accurate full-chip {\it transient} thermal maps from the given performance monitor traces of commercial off-the-shelf multi-core processors. In our work, both generator and discriminator are composed of simple convolutional layers with Wasserstein distance as loss function. ThermGAN can provide the transient and real-time thermal map without using any historical data for training and inferences, which is contrast with a recent RNN-based thermal map estimation method in which historical data is needed. Experimental results show the trained model is very accurate in thermal estimation with an average RMSE of 0.47C, namely, 0.63\% of the full-scale error. Our data further show that the speed of the model is faster than 7.5ms permore »inference, which is two orders of magnitude faster than the traditional finite element based thermal analysis. Furthermore, the new method is about 4x more accurate than recently proposed LSTM-based thermal map estimation method and has faster inference speed. It also achieves about 2x accuracy with much less computational cost than a state-of-the-art pre-silicon based estimation method.« less
  3. null (Ed.)
    In this article, we address the problem of accurate full-chip power and thermal map estimation for commercial off-the-shelf multi-core processors. Processors operating with heat sink cooling remains a challenging problem due to the difficulty in direct measurement. We first propose an accurate full-chip steady-state power density map estimation method for commercial multi-core microprocessors. The new method consists of a few steps. First, 2D spatial Laplace operation is performed on the measured thermal maps (images) without heat sink to obtain the so-called "raw power maps". Then, a novel scheme is developed to generate the true power density maps from the raw power density maps. The new approach is based on thermal measurements of the processor with back-side cooling using an advanced infrared (IR) thermal imaging system. FEM thermal model constructed in COMSOL Multiphysics is used to validate the estimated power density maps and thermal conductivity. Later, this work creates a high-fidelity FEM thermal model with heat sink and reconstructs the full-chip thermal maps while the heat sink is on. Ensuring that power maps are similar under back cooling and heat sink cooling settings, the reconstructed thermal maps are verified by the matching between the on-chip thermal sensor readings and the correspondingmore »elements of thermal maps. Experiments on an Intel i7-8650U 4-core processor with back cooling shows 96\% similarity (2D correlation) between the measured thermal maps and the thermal maps reconstructed from the estimated power maps, with 1.3$\rm ^\circ$C average absolute error. Under heat sink cooling, the average absolute error is 2.2$\rm ^\circ$C over a 56$\rm ^\circ$C temperature range and about 3.9\% error between the computed and the real thermal maps at the sensor locations. Furthermore, the proposed power map estimation method achieves higher resolution and at least 100$\times$ speedup than a recently proposed state-of-art Blind Power Identification method.« less
  4. Low-Power Wide-Area Networks (LP-WANs) are seeing wide-spread deployments connecting millions of sensors, each powered by a ten-year AA battery to radio infrastructure, often miles away. By design, iteratively querying all sensors in an LP-WAN may take several hours or even days, given the stringent battery limits of client radios. This precludes obtaining even an approximate real-time view of sensed information across LP-WAN devices over a large area, say in the event of a disaster, fault or simply for diagnostics.This paper presents QuAiL 1 , a system that provides a coarse aggregate view of sensed data across LP-WAN devices over a wide- area within a time span of just one LP-WAN packet. QuAiL achieves this by coordinating multiple LP-WAN radios to transmit their information synchronously in time and frequency despite their power constraints. We design each client's transmission so that the base station can retrieve an approximate heatmap of sensed data by exploiting the spatial correlation of this data across clients. We further show how our system can be optimized for statistical and machine learning queries, all while maintaining the security and privacy of sensed data from individual clients. Our deployment over a 3 sq. km. LP-WAN deployment around CMU campusmore »in Pittsburgh demonstrates a 4x faster information retrieval versus the state-of- the-art statistical methods to retrieve the spatial sensor heatmap at a desired resolution.« less
  5. Understanding occupants’ thermal sensation and comfort is essential to defining the operational settings for Heating, Ventilation and Air Conditioning (HVAC) systems in buildings. Due to the continuous impact of human and environmental factors, occupants’ thermal sensation and comfort level can change over time. Thus, to dynamically control the environment, thermal comfort should be monitored in real time. This paper presents a novel non-intrusive infrared thermography framework to estimate an occupant’s thermal comfort level by measuring skin temperature collected from different facial regions using low- cost thermal cameras. Unlike existing methods that rely on placing sensors directly on humans for skin temperature measurement, the proposed framework is able to detect the presence of occupants, extract facial regions, measure skin temperature features, and interpret thermal comfort conditions with minimal interruption of the building occupants. The method is validated by collecting thermal comfort data from a total of twelve subjects under cooling, heating and steady-state experiments. The results demonstrate that ears, nose and cheeks are most indicative of thermal comfort and the proposed framework can be used to assess occupants’ thermal comfort with an average accuracy of 85%.