Electromigration (EM) is still the most important reliability concern for VLSI systems, especially at the nanometer regime. EM immortality check is an important step for full-chip EM signoff analysis. In this paper, we propose a new electromigration (EM) immortality check method for multi-segment interconnect considering the impacts of Joule heating induced temperature gradient. Temperature gradients from metal Joule heating, called thermal migration, can be a significant force for the metal atomic migrations, and these impacts get more significant as technology scales down. Compared to existing methods, the new method can consider the spatial temperature gradient due to Joule heating for multi-segment wires for the first time. We derive the analytic solution for the resulting steady-state EM-thermal migration stress distribution problem. Then we develop the new temperature-aware voltage-based EM immortality check method considering the multi-segment temperature migration effects, which carries all the benefits of the recently proposed voltage-based EM immortality method for multi-segment interconnects. Numerical results on an IBM power grid and self synthesized power delivery networks show that the proposed temperature-aware EM immortality check method is much more accurate than recently proposed state of the art EM immortality method.
more »
« less
Hot-Trim: Thermal and Reliability Management for Commercial Multi-core Processors Considering Workload Dependent Hot Spots
This work proposes a new dynamic thermal and reliability management framework via task mapping and migration to improve thermal performance and reliability of commercial multi-core processors considering workload-dependent thermal hot spot stress. The new method is motivated by the observation that different workloads activate different spatial power and thermal hot spots within each core of processors. Existing run-time thermal management, which is based on on-chip location-fixed thermal sensor information, can lead to suboptimal management solutions as the temperatures provided by those sensors may not be the true hot spots. The new method, called Hot-Trim, utilizes a machine learning-based approach to characterize the power density hot spots across each core, then a new task mapping/migration scheme is developed based on the hot spot stresses. Compared to existing works, the new approach is the first to optimize VLSI reliabilities by exploring workload-dependent power hot spots. The advantages of the proposed method over the Linux baseline task mapping and the temperature-based mapping method are demonstrated and validated on real commercial chips. Experiments on a real Intel Core i7 quad-core processor executing PARSEC-3.0 and SPLASH-2 benchmarks show that, compared to the existing Linux scheduler, core and hot spot temperature can be lowered by 1.15 to 1.31C. In addition, Hot-Trim can improve the chip's EM, NBTI and HCI related reliability by 30.2%, 7.0% and 31.1% respectively compared to Linux baseline without any performance degradation. Furthermore, it improves EM and HCI related reliability by 29.6% and 19.6% respectively, and at the same time even further reduces the temperature by half a degree compared to the conventional temperature-based mapping technique.
more »
« less
- PAR ID:
- 10410016
- Date Published:
- Journal Name:
- IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
- ISSN:
- 0278-0070
- Page Range / eLocation ID:
- 1 to 1
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
-
null (Ed.)In this work, we proposed a novel learning-based task to core mapping technique to improve lifetime and reliability based on advanced deep reinforcement learning. The new method is based on the observation that on-chip temperature sensors may not capture the true hotspots of the chip, which can lead to sub-optimal control decisions. In the new method, we first perform data-driven learning to model the hotspot activation indicator with respect to the resource utilization of different workloads. On top of this, we proposed to employ a recently proposed, highly robust, sample-efficient soft-actor-critic deep reinforcement learning algorithm, which can learn optimal maximum entropy policies to improve the long-term reliability and minimize the performance degradation from NBTI/HCI effects. Lifetime and reliability improvement is achieved by assigning a reward function, which penalizes continuously stressing the same hotspots and encourages even stressing of cores. The proposed algorithm is validated with an Intel i7-8650U four-core CPU platform executing CPU benchmark workloads for various hotspot activation profiles. Our experimental results show that the proposed method balances the stress between all cores and hotspots, and achieves 50% and 160% longer lifetime compared to non-hotspot-aware and Linux default scheduling respectively. The proposed method can also reduce the average temperature by exploiting the true-hotspot information.more » « less
-
In this work, we propose a novel approach for the real-time estimation of chip-level spatial power maps for commercial Google Coral M.2 TPU chips based on a machine-learning technique for the first time. The new method can enable the development of more robust runtime power and thermal control schemes to take advantage of spatial power information such as hot spots that are otherwise not available. Different from the existing commercial multi-core processors in which real-time performance-related utilization information is available, the TPU from Google does not have such information. To mitigate this problem, we propose to use features that are related to the workloads of running different deep neural networks (DNN) such as the hyperparameters of DNN and TPU resource information generated by the TPU compiler. The new approach involves the offline acquisition of accurate spatial and temporal temperature maps captured from an external infrared thermal imaging camera under nominal working conditions of a chip. To build the dynamic power density map model, we apply generative adversarial networks (GAN) based on the workload-related features. Our study shows that the estimated total powers match the manufacturer's total power measurements extremely well. Experimental results further show that the predictions of power maps are quite accurate, with the RMSE of only 4.98\rm mW/mm^2, or 2.6\% of the full-scale error. The speed of deploying the proposed approach on an Intel Core i7-10710U is as fast as 6.9ms, which is suitable for real-time estimation.more » « less
-
In this paper, we propose a new dynamic reliability management (DRM) approach with deep reinforcement learning (DRL) for multi-core processors considering device reliability effects (hard error) and transient error of signal (soft error). The proposed method is based on a recently proposed physics-based three-phase electromigration model and an exponential soft error model that considers dynamic voltage and frequency scaling (DVFS) effects. Our work has been inspired by the recent advancements in DRL for various control and game applications. Compared with the traditional Q-learning based method, DRL has better scalability, lower memory and lower computational complexity. A large class of multi-threaded applications are used as the benchmark to validate and compare the proposed dynamic reliability management methods. Experimental results show that the proposed method can significantly reduces memory footprint and computational time compared to the traditional Q-learning based method. Furthermore, we show that the DRL-based DRM method can save 53.50% more energy than the Q-learning based method and 61.29% more than the simple DVFS based method.more » « less
-
Abstract Magnetospheric accretion models predict that matter from protoplanetary disks accretes onto stars via funnel flows, which follow stellar magnetic field lines and shock on the stellar surfaces 1–3 , leaving hot spots with density gradients 4–6 . Previous work has provided observational evidence of varying density in hot spots 7 , but these observations were not sensitive to the radial density distribution. Attempts have been made to measure this distribution using X-ray observations 8–10 ; however, X-ray emission traces only a fraction of the hot spot 11,12 and also coronal emission 13,14 . Here we report periodic ultraviolet and optical light curves of the accreting star GM Aurigae, which have a time lag of about one day between their peaks. The periodicity arises because the source of the ultraviolet and optical emission moves into and out of view as it rotates along with the star. The time lag indicates a difference in the spatial distribution of ultraviolet and optical brightness over the stellar surface. Within the framework of a magnetospheric accretion model, this finding indicates the presence of a radial density gradient in a hot spot on the stellar surface, because regions of the hot spot with different densities have different temperatures and therefore emit radiation at different wavelengths.more » « less
An official website of the United States government

