skip to main content


Title: Hot-Trim: Thermal and Reliability Management for Commercial Multi-core Processors Considering Workload Dependent Hot Spots
This work proposes a new dynamic thermal and reliability management framework via task mapping and migration to improve thermal performance and reliability of commercial multi-core processors considering workload-dependent thermal hot spot stress. The new method is motivated by the observation that different workloads activate different spatial power and thermal hot spots within each core of processors. Existing run-time thermal management, which is based on on-chip location-fixed thermal sensor information, can lead to suboptimal management solutions as the temperatures provided by those sensors may not be the true hot spots. The new method, called Hot-Trim, utilizes a machine learning-based approach to characterize the power density hot spots across each core, then a new task mapping/migration scheme is developed based on the hot spot stresses. Compared to existing works, the new approach is the first to optimize VLSI reliabilities by exploring workload-dependent power hot spots. The advantages of the proposed method over the Linux baseline task mapping and the temperature-based mapping method are demonstrated and validated on real commercial chips. Experiments on a real Intel Core i7 quad-core processor executing PARSEC-3.0 and SPLASH-2 benchmarks show that, compared to the existing Linux scheduler, core and hot spot temperature can be lowered by 1.15 to 1.31C. In addition, Hot-Trim can improve the chip's EM, NBTI and HCI related reliability by 30.2%, 7.0% and 31.1% respectively compared to Linux baseline without any performance degradation. Furthermore, it improves EM and HCI related reliability by 29.6% and 19.6% respectively, and at the same time even further reduces the temperature by half a degree compared to the conventional temperature-based mapping technique.  more » « less
Award ID(s):
1854276 2113928
NSF-PAR ID:
10410016
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
ISSN:
0278-0070
Page Range / eLocation ID:
1 to 1
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. In this work, we propose a novel approach for the real-time estimation of chip-level spatial power maps for commercial Google Coral M.2 TPU chips based on a machine-learning technique for the first time. The new method can enable the development of more robust runtime power and thermal control schemes to take advantage of spatial power information such as hot spots that are otherwise not available. Different from the existing commercial multi-core processors in which real-time performance-related utilization information is available, the TPU from Google does not have such information. To mitigate this problem, we propose to use features that are related to the workloads of running different deep neural networks (DNN) such as the hyperparameters of DNN and TPU resource information generated by the TPU compiler. The new approach involves the offline acquisition of accurate spatial and temporal temperature maps captured from an external infrared thermal imaging camera under nominal working conditions of a chip. To build the dynamic power density map model, we apply generative adversarial networks (GAN) based on the workload-related features. Our study shows that the estimated total powers match the manufacturer's total power measurements extremely well. Experimental results further show that the predictions of power maps are quite accurate, with the RMSE of only 4.98\rm mW/mm^2, or 2.6\% of the full-scale error. The speed of deploying the proposed approach on an Intel Core i7-10710U is as fast as 6.9ms, which is suitable for real-time estimation. 
    more » « less
  2. null (Ed.)
    In this work, we proposed a novel learning-based task to core mapping technique to improve lifetime and reliability based on advanced deep reinforcement learning. The new method is based on the observation that on-chip temperature sensors may not capture the true hotspots of the chip, which can lead to sub-optimal control decisions. In the new method, we first perform data-driven learning to model the hotspot activation indicator with respect to the resource utilization of different workloads. On top of this, we proposed to employ a recently proposed, highly robust, sample-efficient soft-actor-critic deep reinforcement learning algorithm, which can learn optimal maximum entropy policies to improve the long-term reliability and minimize the performance degradation from NBTI/HCI effects. Lifetime and reliability improvement is achieved by assigning a reward function, which penalizes continuously stressing the same hotspots and encourages even stressing of cores. The proposed algorithm is validated with an Intel i7-8650U four-core CPU platform executing CPU benchmark workloads for various hotspot activation profiles. Our experimental results show that the proposed method balances the stress between all cores and hotspots, and achieves 50% and 160% longer lifetime compared to non-hotspot-aware and Linux default scheduling respectively. The proposed method can also reduce the average temperature by exploiting the true-hotspot information. 
    more » « less
  3. Electromigration (EM) is still the most important reliability concern for VLSI systems, especially at the nanometer regime. EM immortality check is an important step for full-chip EM signoff analysis. In this paper, we propose a new electromigration (EM) immortality check method for multi-segment interconnect considering the impacts of Joule heating induced temperature gradient. Temperature gradients from metal Joule heating, called thermal migration, can be a significant force for the metal atomic migrations, and these impacts get more significant as technology scales down. Compared to existing methods, the new method can consider the spatial temperature gradient due to Joule heating for multi-segment wires for the first time. We derive the analytic solution for the resulting steady-state EM-thermal migration stress distribution problem. Then we develop the new temperature-aware voltage-based EM immortality check method considering the multi-segment temperature migration effects, which carries all the benefits of the recently proposed voltage-based EM immortality method for multi-segment interconnects. Numerical results on an IBM power grid and self synthesized power delivery networks show that the proposed temperature-aware EM immortality check method is much more accurate than recently proposed state of the art EM immortality method. 
    more » « less
  4. Timely, flexible and accurate information dissemination can make a life-and-death difference in managing disasters. Complex command structures and information organization make such dissemination challenging. Thus, it is vital to have an architecture with appropriate naming frameworks, adaptable to the changing roles of participants, focused on content rather than network addresses. To address this, we propose POISE, a name-based and recipient-based publish/subscribe architecture for efficient content dissemination in disaster management. POISE proposes an information layer, improving on state-of-the-art Information-Centric Networking (ICN) solutions such as Named Data Networking (NDN) in two major ways: 1) support for complex graph-based namespaces, and 2) automatic name-based load-splitting. To capture the complexity and dynamicity of disaster response command chains and information flows, POISE proposes a graph-based naming framework, leveraged in a dissemination protocol which exploits information layer rendezvous points (RPs) that perform name expansions. For improved robustness and scalability, POISE allows load-sharing via multiple RPs each managing a subset of the namespace graph. However, excessive workload on one RP may turn it into a “hot spot”, thus impeding performance and reliability. To eliminate such traffic concentration, we propose an automatic load-splitting mechanism, consisting of a namespace graph partitioning complemented by a seamless, loss-less core migration procedure. Due to the nature of our graph partitioning and its complex objectives, off-the-shelf graph partitioning, e.g., METIS, is inadequate. We propose a hybrid partitioning solution, consisting of an initial and a refinement phase. Our simulation results show that POISE outperforms state-of-the-art solutions, demonstrating its effectiveness in timely delivery and load-sharing. 
    more » « less
  5. null (Ed.)
    In this article, we address the problem of accurate full-chip power and thermal map estimation for commercial off-the-shelf multi-core processors. Processors operating with heat sink cooling remains a challenging problem due to the difficulty in direct measurement. We first propose an accurate full-chip steady-state power density map estimation method for commercial multi-core microprocessors. The new method consists of a few steps. First, 2D spatial Laplace operation is performed on the measured thermal maps (images) without heat sink to obtain the so-called "raw power maps". Then, a novel scheme is developed to generate the true power density maps from the raw power density maps. The new approach is based on thermal measurements of the processor with back-side cooling using an advanced infrared (IR) thermal imaging system. FEM thermal model constructed in COMSOL Multiphysics is used to validate the estimated power density maps and thermal conductivity. Later, this work creates a high-fidelity FEM thermal model with heat sink and reconstructs the full-chip thermal maps while the heat sink is on. Ensuring that power maps are similar under back cooling and heat sink cooling settings, the reconstructed thermal maps are verified by the matching between the on-chip thermal sensor readings and the corresponding elements of thermal maps. Experiments on an Intel i7-8650U 4-core processor with back cooling shows 96\% similarity (2D correlation) between the measured thermal maps and the thermal maps reconstructed from the estimated power maps, with 1.3$\rm ^\circ$C average absolute error. Under heat sink cooling, the average absolute error is 2.2$\rm ^\circ$C over a 56$\rm ^\circ$C temperature range and about 3.9\% error between the computed and the real thermal maps at the sensor locations. Furthermore, the proposed power map estimation method achieves higher resolution and at least 100$\times$ speedup than a recently proposed state-of-art Blind Power Identification method. 
    more » « less