- NSF-PAR ID:
- 10275551
- Date Published:
- Journal Name:
- IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
- ISSN:
- 0278-0070
- Page Range / eLocation ID:
- 1 to 1
- Format(s):
- Medium: X
- Sponsoring Org:
- National Science Foundation
More Like this
-
In this work, we propose a novel approach for the real-time estimation of chip-level spatial power maps for commercial Google Coral M.2 TPU chips based on a machine-learning technique for the first time. The new method can enable the development of more robust runtime power and thermal control schemes to take advantage of spatial power information such as hot spots that are otherwise not available. Different from the existing commercial multi-core processors in which real-time performance-related utilization information is available, the TPU from Google does not have such information. To mitigate this problem, we propose to use features that are related to the workloads of running different deep neural networks (DNN) such as the hyperparameters of DNN and TPU resource information generated by the TPU compiler. The new approach involves the offline acquisition of accurate spatial and temporal temperature maps captured from an external infrared thermal imaging camera under nominal working conditions of a chip. To build the dynamic power density map model, we apply generative adversarial networks (GAN) based on the workload-related features. Our study shows that the estimated total powers match the manufacturer's total power measurements extremely well. Experimental results further show that the predictions of power maps are quite accurate, with the RMSE of only 4.98\rm mW/mm^2, or 2.6\% of the full-scale error. The speed of deploying the proposed approach on an Intel Core i7-10710U is as fast as 6.9ms, which is suitable for real-time estimation.more » « less
-
Full-chip thermal map estimation for multi-core commercial CPUs with generative adversarial learningIn this paper, we propose a novel transient full-chip thermal map estimation method for multi-core commercial CPU based on the data-driven generative adversarial learning method. We treat the thermal modeling problem as an image-generation problem using the generative neural networks. In stead of using traditional functional unit powers as input, the new models are directly based on the measurable real-time high level chip utilizations and thermal sensor information of commercial chips without any assumption of additional physical sensors requirement. The resulting thermal map estimation method, called {\it ThermGAN} can provide tool-accurate full-chip {\it transient} thermal maps from the given performance monitor traces of commercial off-the-shelf multi-core processors. In our work, both generator and discriminator are composed of simple convolutional layers with Wasserstein distance as loss function. ThermGAN can provide the transient and real-time thermal map without using any historical data for training and inferences, which is contrast with a recent RNN-based thermal map estimation method in which historical data is needed. Experimental results show the trained model is very accurate in thermal estimation with an average RMSE of 0.47C, namely, 0.63\% of the full-scale error. Our data further show that the speed of the model is faster than 7.5ms per inference, which is two orders of magnitude faster than the traditional finite element based thermal analysis. Furthermore, the new method is about 4x more accurate than recently proposed LSTM-based thermal map estimation method and has faster inference speed. It also achieves about 2x accuracy with much less computational cost than a state-of-the-art pre-silicon based estimation method.more » « less
-
More than ever before, data centers must deploy robust thermal solutions to adequately host the high-density and high-performance computing that is in high demand. The newer generation of central processing units (CPUs) and graphics processing units (GPUs) has substantially higher thermal power densities than previous generations. In recent years, more data centers rely on liquid cooling for the high-heat processors inside the servers and air cooling for the remaining low-heat information technology equipment. This hybrid cooling approach creates a smaller and more efficient data center. The deployment of direct-to-chip cold plate liquid cooling is one of the mainstream approaches to providing concentrated cooling to targeted processors. In this study, a processor-level experimental setup was developed to evaluate the cooling performance of a novel computer numerical control (CNC) machined nickel-plated impinging cold plate on a 1 in. 1 in. mock heater that represents a functional processing unit. The pressure drop and thermal resistance performance curves of the electroless nickel-plated cold plate are compared to those of a pure copper cold plate. A temperature uniformity analysis is done using compuational fluid dynamics and compared to the actual test data. Finally, the CNC machined pure copper one is compared to other reported cold plates to demonstrate its superiority of the design with respect to the cooling performance.more » « less
-
Abstract To fulfill the increasing demands of data storage and data processing within modern data centers, a corresponding increase in server performance is necessary. This leads to a subsequent increase in power consumption and heat generation in the servers due to high performance processing units. Currently, air cooling is the most widely used thermal management technique in data centers, but it has started to reach its limitations in cooling of high-power density packaging. Therefore, industries utilizing data centers are looking to singlephase immersion cooling using various dielectric fluids to reduce the operational and cooling costs by enhancing the thermal management of servers. In this study, heat sinks with TPMS lattice structures were designed for application in singlephase immersion cooling of data center servers. These designs are made possible by Electrochemical Additive Manufacturing (ECAM) technology due to their complex topologies. The ECAM process allows for generation of complex heat sink geometries never before possible using traditional manufacturing processes. Geometric complexities including amorphous and porous structures with high surface area to volume ratio enable ECAM heat sinks to have superior heat transfer properties. Our objective is to compare various heat sink geometries by minimizing chip junction temperature in a single-phase immersion cooling setup for natural convection flow regimes. Computational fluid dynamics in ANSYS Fluent is utilized to compare the ECAM heat sink designs. The additively manufactured heat sink designs are evaluated by comparing their thermal performance under natural convection conditions. This study presents a novel approach to heat sink design and bolsters the capability of ECAM-produced heat sinks.
-
This work proposes a new dynamic thermal and reliability management framework via task mapping and migration to improve thermal performance and reliability of commercial multi-core processors considering workload-dependent thermal hot spot stress. The new method is motivated by the observation that different workloads activate different spatial power and thermal hot spots within each core of processors. Existing run-time thermal management, which is based on on-chip location-fixed thermal sensor information, can lead to suboptimal management solutions as the temperatures provided by those sensors may not be the true hot spots. The new method, called Hot-Trim, utilizes a machine learning-based approach to characterize the power density hot spots across each core, then a new task mapping/migration scheme is developed based on the hot spot stresses. Compared to existing works, the new approach is the first to optimize VLSI reliabilities by exploring workload-dependent power hot spots. The advantages of the proposed method over the Linux baseline task mapping and the temperature-based mapping method are demonstrated and validated on real commercial chips. Experiments on a real Intel Core i7 quad-core processor executing PARSEC-3.0 and SPLASH-2 benchmarks show that, compared to the existing Linux scheduler, core and hot spot temperature can be lowered by 1.15 to 1.31C. In addition, Hot-Trim can improve the chip's EM, NBTI and HCI related reliability by 30.2%, 7.0% and 31.1% respectively compared to Linux baseline without any performance degradation. Furthermore, it improves EM and HCI related reliability by 29.6% and 19.6% respectively, and at the same time even further reduces the temperature by half a degree compared to the conventional temperature-based mapping technique.more » « less