skip to main content


Title: Hot-Trim: Thermal and Reliability Management for Commercial Multi-core Processors Considering Workload Dependent Hot Spots
This work proposes a new dynamic thermal and reliability management framework via task mapping and migration to improve thermal performance and reliability of commercial multi-core processors considering workload-dependent thermal hot spot stress. The new method is motivated by the observation that different workloads activate different spatial power and thermal hot spots within each core of processors. Existing run-time thermal management, which is based on on-chip location-fixed thermal sensor information, can lead to suboptimal management solutions as the temperatures provided by those sensors may not be the true hot spots. The new method, called Hot-Trim, utilizes a machine learning-based approach to characterize the power density hot spots across each core, then a new task mapping/migration scheme is developed based on the hot spot stresses. Compared to existing works, the new approach is the first to optimize VLSI reliabilities by exploring workload-dependent power hot spots. The advantages of the proposed method over the Linux baseline task mapping and the temperature-based mapping method are demonstrated and validated on real commercial chips. Experiments on a real Intel Core i7 quad-core processor executing PARSEC-3.0 and SPLASH-2 benchmarks show that, compared to the existing Linux scheduler, core and hot spot temperature can be lowered by 1.15 to 1.31C. In addition, Hot-Trim can improve the chip's EM, NBTI and HCI related reliability by 30.2%, 7.0% and 31.1% respectively compared to Linux baseline without any performance degradation. Furthermore, it improves EM and HCI related reliability by 29.6% and 19.6% respectively, and at the same time even further reduces the temperature by half a degree compared to the conventional temperature-based mapping technique.  more » « less
Award ID(s):
1854276 2113928 2007135
NSF-PAR ID:
10410016
Author(s) / Creator(s):
; ;
Date Published:
Journal Name:
IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
ISSN:
0278-0070
Page Range / eLocation ID:
1 to 1
Format(s):
Medium: X
Sponsoring Org:
National Science Foundation
More Like this
  1. Abstract

    Modern microprocessor performance is limited by local hot spots induced at high frequency by busy integrated circuit elements such as the clock generator. Locally embedded thermoelectric devices (TEDs) are proposed to perform active cooling whereby thermoelectric effects enhance passive cooling by the Fourier law in removing heat from the hot spot to colder regions. To mitigate transient heating events and improve temperature stability, we propose a novel analytical solution that describes the temperature response of a periodically heated hot spot that is actively cooled by a TED driven electrically at the same frequency. The analytical solution that we present is validated by experimental data from frequency domain thermal reflectance (FDTR) measurements made directly on an actively cooled Si thermoelectric device where the pump laser replicates the transient hot spot. We herein demonstrate a practical method to actively cancel the transient temperature variations on circuit elements with TEDs. This result opens a new path to optimize the design of cooling systems for transient localized hot spots in integrated circuits.

     
    more » « less
  2. null (Ed.)
    In this work, we proposed a novel learning-based task to core mapping technique to improve lifetime and reliability based on advanced deep reinforcement learning. The new method is based on the observation that on-chip temperature sensors may not capture the true hotspots of the chip, which can lead to sub-optimal control decisions. In the new method, we first perform data-driven learning to model the hotspot activation indicator with respect to the resource utilization of different workloads. On top of this, we proposed to employ a recently proposed, highly robust, sample-efficient soft-actor-critic deep reinforcement learning algorithm, which can learn optimal maximum entropy policies to improve the long-term reliability and minimize the performance degradation from NBTI/HCI effects. Lifetime and reliability improvement is achieved by assigning a reward function, which penalizes continuously stressing the same hotspots and encourages even stressing of cores. The proposed algorithm is validated with an Intel i7-8650U four-core CPU platform executing CPU benchmark workloads for various hotspot activation profiles. Our experimental results show that the proposed method balances the stress between all cores and hotspots, and achieves 50% and 160% longer lifetime compared to non-hotspot-aware and Linux default scheduling respectively. The proposed method can also reduce the average temperature by exploiting the true-hotspot information. 
    more » « less
  3. In this work, we propose a novel approach for the real-time estimation of chip-level spatial power maps for commercial Google Coral M.2 TPU chips based on a machine-learning technique for the first time. The new method can enable the development of more robust runtime power and thermal control schemes to take advantage of spatial power information such as hot spots that are otherwise not available. Different from the existing commercial multi-core processors in which real-time performance-related utilization information is available, the TPU from Google does not have such information. To mitigate this problem, we propose to use features that are related to the workloads of running different deep neural networks (DNN) such as the hyperparameters of DNN and TPU resource information generated by the TPU compiler. The new approach involves the offline acquisition of accurate spatial and temporal temperature maps captured from an external infrared thermal imaging camera under nominal working conditions of a chip. To build the dynamic power density map model, we apply generative adversarial networks (GAN) based on the workload-related features. Our study shows that the estimated total powers match the manufacturer's total power measurements extremely well. Experimental results further show that the predictions of power maps are quite accurate, with the RMSE of only 4.98\rm mW/mm^2, or 2.6\% of the full-scale error. The speed of deploying the proposed approach on an Intel Core i7-10710U is as fast as 6.9ms, which is suitable for real-time estimation. 
    more » « less
  4. Abstract

    Initial hot spot temperatures and temperature evolutions for 4 polymer‐bound explosives under shock compression by laser‐driven flyer plates at speeds from 1.5–4.5 km s−1are presented. A new averaging routine allows for improved signal to noise in shock compressed impactor experiments and yields temperature dynamics which are more accurate than has been previously available. The PBX formulations studied here consist of either pentaerythritol tetranitrate (PETN), 1,3,5‐trinitro‐1,3,5‐triazinane (RDX), 2,4,6‐trinitrotoluene (TNT), or 1,3,5‐triamino‐2,4,6‐trinitrobenzene (TATB) in a 80/20 wt.% mixture with a silicone elastomer binder. The temperature dynamics demonstrate a unique shock strength dependence for each base explosive. The initial hot spot temperature and its evolution in time are shown to be indicative of chemistry occurring within the reaction zone of the four explosives. The number density of hot spots is qualitatively inferred from the spatially‐averaged emissivity and appears to increase exponentially with shock strength. An increased emissivity for formulations consisting of TNT and TATB is consistent with carbon‐rich explosives and in increased hot spot volume. Qualitative conclusions about sensitivity were drawn from the initial hot spot temperature and rate at which the number of hot spots appear to grow.

     
    more » « less
  5. Electromigration (EM) is still the most important reliability concern for VLSI systems, especially at the nanometer regime. EM immortality check is an important step for full-chip EM signoff analysis. In this paper, we propose a new electromigration (EM) immortality check method for multi-segment interconnect considering the impacts of Joule heating induced temperature gradient. Temperature gradients from metal Joule heating, called thermal migration, can be a significant force for the metal atomic migrations, and these impacts get more significant as technology scales down. Compared to existing methods, the new method can consider the spatial temperature gradient due to Joule heating for multi-segment wires for the first time. We derive the analytic solution for the resulting steady-state EM-thermal migration stress distribution problem. Then we develop the new temperature-aware voltage-based EM immortality check method considering the multi-segment temperature migration effects, which carries all the benefits of the recently proposed voltage-based EM immortality method for multi-segment interconnects. Numerical results on an IBM power grid and self synthesized power delivery networks show that the proposed temperature-aware EM immortality check method is much more accurate than recently proposed state of the art EM immortality method. 
    more » « less