<?xml-model href='http://www.tei-c.org/release/xml/tei/custom/schema/relaxng/tei_all.rng' schematypens='http://relaxng.org/ns/structure/1.0'?><TEI xmlns="http://www.tei-c.org/ns/1.0">
	<teiHeader>
		<fileDesc>
			<titleStmt><title level='a'>Chip-level Thermal Simulation for a Multicore Processor Using a Multi-Block Model Enabled by Proper Orthogonal Decomposition</title></titleStmt>
			<publicationStmt>
				<publisher>IEEE</publisher>
				<date>05/31/2022</date>
			</publicationStmt>
			<sourceDesc>
				<bibl> 
					<idno type="par_id">10336190</idno>
					<idno type="doi">10.1109/iTherm54085.2022.9899503</idno>
					
					<author>Lin Jiang</author><author>Anthony Dowling</author><author>Yu Liu</author><author>Ming-C Cheng</author>
				</bibl>
			</sourceDesc>
		</fileDesc>
		<profileDesc>
			<abstract><ab><![CDATA[To perform chip-level thermal simulation effectively for large-scale processors with multicores/manycores, a multiblock model enabled by proper orthogonal decomposition (POD) and domain decomposition is applied. This approach partitions a large-scale processor into smaller building blocks, such as cores, caches, I/O units, etc. For each building block, a set of temperature solution data accounting for parametric variations of interest is collected individually from FEniCS, a finite element simulation platform, to extract its basis functions (or POD modes). Using smaller building blocks, the multi-block approach significantly enhances the computational efficiency of POD mode generation to construct a POD model for the entire chip. In this work, a set of POD modes is trained by the solution data from each of two selected building blocks, a core and a level-2 cache, of AMD Athlon II X4 610e, a quad-core chip. A two-block POD thermal model is developed for Core 1 and L2 Cache by projecting these two blocks to a functional space represented by these 2 sets of POD modes. The discontinuous Galerkin method with the penalty number is applied to ensure the boundary continuity at the block interface. An optimal range of the penalty number for the two-block POD thermal model has been observed to provide an accurate prediction of the dynamic thermal distribution in Core 1 and L2 Cache. For the two-block POD model, a least square error below 3% is achieved with only 3 POD modes in each block. This results in a reduction in the numerical degrees of freedom by almost 4 orders in magnitude and thousands of times faster than FEniCS for the thermal simulation.]]></ab></abstract>
		</profileDesc>
	</teiHeader>
	<text><body xmlns="http://www.tei-c.org/ns/1.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink">
<div xmlns="http://www.tei-c.org/ns/1.0"><head>I. INTRODUCTION</head><p>Artificial intelligence (AI) and machine learning (ML) have been widely used in most of domains of technology <ref type="bibr">[1]</ref>, <ref type="bibr">[2]</ref>. The models used in AI and ML are trained by processing millions of crawled data giving rise to considerable demand for highperformance processors <ref type="bibr">[1]</ref> To satisfy the need, more cores are integrated on a semiconductor chip, and the density of transistors and power dissipation have been increasing dramatically in recent years, which has led to high temperature and hot-spot generation due to severe joule heating. High temperature and hot spots contribute to not only degradation of performance but also deterioration of reliability <ref type="bibr">[3]</ref>, <ref type="bibr">[4]</ref>. To reduce temperature and suppress hot spots in high-performance processors, the general practice is to apply effective thermalaware task scheduling and thermal management, which however requires effective and accurate chip-level thermal-simulation techniques.</p><p>Several approaches have been developed for the thermal simulation of semiconductor chips; each of them offers a different level of efficiency and accuracy. Among these approaches, direct numerical simulations (DNSs) based on either the finite element method (FEM) or finite difference method (FDM) provide accurate and detailed thermal analysis at the expense of a large number of degrees of freedom <ref type="bibr">(DoF)</ref>. Many open-source or commercial DNS tools are available for such applications, for example, FEniCS <ref type="bibr">[5]</ref>, ANSYS <ref type="bibr">[6]</ref>, COMSOL <ref type="bibr">[7]</ref>, etc. These DNSs, although offering accurate thermal solution with fine resolution, demand extensive computational resources and are impractical for chip-level thermal simulations.</p><p>To conduct the chip-level thermal simulation efficiently, the lumped RC thermal circuit model has been used to predict the thermal profile in large-scale semiconductor chips; for example, the block model of HotSpot <ref type="bibr">[8]</ref>- <ref type="bibr">[10]</ref> is one of most popular thermal simulators using the compact RC thermal model for chip-level thermal simulations. Due to the large RC lumped element, the RC thermal circuit model is not able to capture the small-size hot spots in semiconductor chips but only offers average temperatures for the large RC elements. With the approximation associated with large lumped element, heat flux at the element interfaces cannot be estimated accurately. The accuracy of the block model of HotSpot has been challenged due to the inaccurate thermal prediction for some floorplans, compared to DNS <ref type="bibr">[11]</ref>. To improve the accuracy of the block model of HotSpot, the grid model of HotSpot <ref type="bibr">[12]</ref> was developed, where smaller elements are allowed to provide a more detailed/accurate temperature prediction. However, when using very small elements for better accuracy, the grid model of HotSpot is equivalent to the FDM and becomes prohibitive for chip-level simulation.</p><p>To enhance the efficiency of chip-level thermal simulations, another strategy is to develop a spatial impulse response (or the Green's function) <ref type="bibr">[13]</ref>- <ref type="bibr">[15]</ref> of the selected chip. The Green's function is usually pre-trained by the thermal solution derived from DNS in response to a unit point heat source at the center of the chip. The spatial temperature solution is then obtained by a convolution of the pre-trained Green's function with the power profile. However, for the Green's function method, it is difficult to apply boundary conditions (BCs) <ref type="bibr">[13]</ref>, <ref type="bibr">[14]</ref> or to perform transient thermal simulation <ref type="bibr">[13]</ref>, <ref type="bibr">[15]</ref>. In addition, the training of the Green's function using DNS of the entire chip is extremely time consuming <ref type="bibr">[13]</ref>, especially if a high resolution is needed to capture the localized hot spots. As the technology node is further reduced and more cores are integrated on a chip, the computation of the Green's function is becoming more intensive and impractical for developing such a thermal model for the entire chip, especially when a high resolution is needed.</p><p>An alternative is to use a reduced-order simulation model enabled by a data-driven approach based on proper orthogonal decomposition (POD) <ref type="bibr">[16]</ref>, <ref type="bibr">[17]</ref>. This approach projects a dynamic thermal problem from the physical domain onto a functional space (POD space) described by a finite set of basis functions (also called POD modes). To derive an optimal set of modes, dynamic thermal data accounting for parametric variations of interest, such as variations of heat excitations and BCs, are obtained from DNSs to train the POD modes. The POD model constructed by these trained robust modes is therefore able to respond accurately to the parametric variations within or near the training conditions with a very small number of DoF. In addition to the high accuracy and efficiency, the POD model also offers the temperature profile as detailed as DNS.</p><p>The POD simulation approach has been shown to be effective in many areas of research <ref type="bibr">[18]</ref>- <ref type="bibr">[27]</ref>including thermal simulations of integrated circuits and CPUs <ref type="bibr">[20]</ref>- <ref type="bibr">[22]</ref>, <ref type="bibr">[27]</ref>. However, similar to the problem encountered in the pre-training of the Green's function, a long simulation time and massive thermal data needed to train the POD modes become prohibitive for larger chips with high resolutions. To overcome the difficulty, the multi-block POD methodology is proposed for large-scale chips, such as multicore/manycore processors. In the multi-block POD model, the domain decomposition technique is implemented to partition a large semiconductor chip into smaller building blocks, such as cores, caches, I/O units, etc. For each small block, a set of POD modes and the model parameters can be generated more efficiently and stored into a technology library. The POD model for the entire chip can then be constructed by gluing these POD blocks with the discontinuous Galerkin (DG) method <ref type="bibr">[28]</ref>, <ref type="bibr">[29]</ref>. This method is applied to stabilize the numerical solution at the interface by enforcing the heat flux continuity but allowing a small temperature discontinuity (i.e., the weak boundary condition) in an average sense at the interface between any 2 neighboring blocks. With the multi-block POD model for a large chip partitioned into a large number of building blocks, parallel computing can also be implemented in POD mode generation and thermal simulation to further enhance the computational efficiency.</p><p>Continuing a previous study <ref type="bibr">[27]</ref>, this work investigates a two-block POD model that projects two building blocks (Core 1 and its adjacent L2 cache) in AMD Athlon II X4 610e <ref type="bibr">[30]</ref> to a POD space described by the 2-block POD modes. For each building block, DNSs are performed in FEniCS <ref type="bibr">[5]</ref>, an opensource FEM platform, to collect temperature data for the extraction of POD modes. The two-block POD model is demonstrated and verified against the DNS, and it has shown that the POD results are in very good agreement with the DNS with almost 4 orders reduction in the DoF.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>II. THERMAL SIMULATION METHODOLOGY BASED ON POD</head></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Single-block model</head><p>Using the POD method, the physical domain is projected onto a mathematical space represented by a finite number of POD modes. Temperature in space and time &#119879;(&#119903; &#8401;, &#119905;) can then be represented by a linear combination of the selected POD modes &#120593; ! as</p><p>where &#120593; ! is the i-th POD mode, M is the number of selected POD modes which determines the accuracy and efficiency of the POD approach and &#119886; ! (&#119905;) is the time-dependent coefficient of the i-th POD mode.</p><p>To obtain an optimal set of the POD modes, each POD mode is obtained by maximizing the mean square inner product of the thermal solution with the modes via the following equation</p><p>where &#937; is the physical domain of the selected structure and the brackets &#9001; &#9002; denote the average over the collected thermal solution data. For dynamic thermal simulation, the average is computed over temporal samples (snapshots) obtained from DNSs. The maximization process in <ref type="bibr">(2)</ref> gives rise to a Fredholm equation shown below for the POD modes,</p><p>where &#119825;(&#119903; &#8401;, &#119903; &#8401; ( ) is a two-point correlation tensor expressed as</p><p>With the temperature data &#119879;(&#119903; &#8401;, &#119905;) of the simulation domain collected from DNSs, the method of snapshots <ref type="bibr">[25]</ref>, <ref type="bibr">[26]</ref> is applied to solve the eigenvalue problem in (3) for the eigenvalues &#120582; ! and POD modes &#120593; ! .</p><p>With the generated POD modes, the heat conduction equation can be projected onto a POD space represented by the POD modes using the Galerkin projection,</p><p>surface of the selected domain and &#119899; 8&#8401; is the outward normal vector of boundary surface. Substituting (1) into <ref type="bibr">(5)</ref>, it leads to an M-dimensional ordinary differential equation (ODE) for &#119886; ! (&#119905;),</p><p>where &#119875; ! representing the last 2 terms of ( <ref type="formula">5</ref>) for the i-th mode is the power density dissipated in the POD space and can be preevaluated since the shape of power density is predefined, and &#119888; !,. and &#119892; !,. are the elements of thermal capacitance and thermal conductance matrices in the POD space and defined as As presented above, the POD model development consists of thermal data collection from DNS, calculations of POD modes and eigenvalues from (3) using the snapshot method, and evaluations of model parameters in <ref type="bibr">(7)</ref>. This training process could be computationally intensive for a large simulation domain with a high resolution. To minimize the computational resources in the training, the large domain is partitioned into smaller building blocks, which is presented next.</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Multi-block model</head><p>When placing block together, the last term of (5) needs to be reformulated to account for heat flux across the interface between adjacent blocks. The DG method <ref type="bibr">[28]</ref>, <ref type="bibr">[29]</ref> is applied to properly enforce the interface thermal continuity, and (5) for the multi-block model becomes</p><p>where &#10214; * &#10215; and { * } indicate difference and average across interface, respectively, and &#120583; is the penalty constant defined as &#119873; 0 &#119889;&#119903; &#8260; (&#119889;&#119903; is the size of the local element with N&#181; as the penalty number). &#119878; is the interface surface between two adjacent blocks. N&#181; can be adjusted to balance discontinuities between temperature and heat flux at the interface to minimize the least square (LS) error and to stabilize the numerical solution.</p><p>For a two-block POD model including the heat flux exchanges via the interface, the matrix equation for both POD blocks becomes</p><p>where &#119861; 2,3 indicates the interface between &#119901;-th and &#119902;-th blocks. &#119914; 2 and &#119918; 2 are the thermal capacitance and thermal conductance matrices of &#119901;-th blocks and their elements are given by <ref type="bibr">(7)</ref>.</p><p>Compared to the thermal conductance matrices of single-block model, an extra thermal conductance matrix, &#119918; !,# !,# , is included for &#119901;-th block to consider the effect of &#119902;-th block with respect to temperature and heat flux at the interface &#119861; 2,3 and is given as</p><p>where &#119872; 2 is the number of selected POD mode of &#119901; -th block, and the element of &#119918; !,# !,# is given by</p><p>In the matrix equation ( <ref type="formula">9</ref>), the &#119901;-th block is coupled with its adjacent &#119902; -th block via &#119918; 2,3 . If &#119901; -th block is not directly adjacent with &#119902;-th block, &#119918; 2,3 = &#120782;. Otherwise, it is given as</p><p>where &#119872; 3 is the number of selected modes of &#119902;-th block and &#119892; !,$,%,&amp; is given by</p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>III. CHIP-LEVEL THERMAL SIMULATION USING MULTI-BLOCK</head><p>POD MODEL The AMD ATHLON II X4 610e processor is selected in this investigation, which consists of four cores, two L2 caches, one northbridge, three I/O units and one DDR3 module, as shown in Fig. <ref type="figure">1</ref>. The dimension of the quad-core chip is 14mm &#215; 12mm &#215; 242&#181;m (length&#215;width&#215;thickness) and the material property is listed in Table <ref type="table">I</ref>. In the single-block model, the thermal simulation is performed via DNSs over the entire chip to collect temperature data and it is, as discussed above, computationally intensive for a large chip with a high resolution. In the multi-block model, the temperature data is, however, independently collected for each building block in the entire domain. The dynamic power map is applied to the top layer (named the device layer hereafter) of the chip with the device layer thickness of 55.8 &#181;m. For data collection of each building block, the thermal simulation is performed over the simulation domain that consists of the green blocks and the building block shown in Fig. <ref type="figure">1</ref>. The simulation domain for data collection of each embedded building block is shown in Fig. <ref type="figure">2</ref>. In such a setting, solution data collected from each building block is able to account for the variation of the block BCs induced by the power excitations outside the block.</p><p>TABLE I. TEMPERATURE INDEPENDENT MATERIAL PROPERTY. Specific heat, C Density, &#120646; Thermal conductivity, k 751.1 (J/(kg&#8226;K)) 2330 (kg/m 3 ) 100 (W/(m&#8226;K))</p><p>All outer surfaces of the simulation domain for data collection are assumed adiabatic except for the bottom where the convection BC is implemented with a constant heat transfer coefficient and an ambient temperature &#119879; 456 of 45&#8451;. The outer boundary surface of the simulation domain is 3 mm from the building block. The dynamic power density in each building block is randomly generated in time; in each time step, it represents an averaged power density over 48k CPU cycles at 3.5 GHz with a total power approximately equal to 8.9 W for L2 cache and 16 W for Core 1. In the surrounding of each building block, the dynamic power density is generated with a different random sequence, which offers the variation of the interface flux on each side of the building and allows the POD modes to adapt the realistic BC variation to construct a more effective POD model.</p><p>In this work, collection of thermal data in the CPU domain for training POD modes and calculations of POD model coefficients in ( <ref type="formula">10</ref>)-( <ref type="formula">13</ref>) are carried out in FEniCS-FEM. The solution of the ODE matrix equation in ( <ref type="formula">9</ref>) and post-processing calculations for the predicted temperature in (1) are performed in C++ using solvers in the PETSc library. </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>A. Temperature data collection</head><p>In this work, L2 Cache and Core 1 of the quad-core chip are selected as Block 1 and Block 2, respectively, as shown in Fig. <ref type="figure">1</ref> and <ref type="figure">2</ref>. The dynamic thermal simulation is performed for each of Block 1 and Block 2 in FEniCS-FEM independently to collect dynamic temperature data of each building block to generate its eigenvalues and POD modes. The eigenvalue represents the mean squared temperature variation captured by the corresponding POD mode, and therefore its spectrum reveals the information on the number of POD modes needed to offer accurate temperature solution. The eigenvalue spectrums of two building blocks are shown in Fig. <ref type="figure">3</ref>. For both Block 1 and Block 2, a reduction in the eigenvalue by two orders of magnitude is observed from the first to the second mode and a decrease by four orders from the first to the third mode. Based on the rapid reduction in the eigenvalue for the first few modes, it is expected that the two-block POD model with a small number of modes is able to offer an accurate prediction of dynamic temperature solution. However, the expectation can be achieved only if the quality of the data collected from the DNSs is reasonably good.  </p></div>
<div xmlns="http://www.tei-c.org/ns/1.0"><head>B. Verification of the multi-block model</head><p>To demonstrate the validity of the two-block POD model, thermal simulation based on (9) for the domain consisting of Core 1 (Block 2) and its adjacent L2 Cache (Block 1), shown in Fig. <ref type="figure">1</ref>, is performed. The dynamic power density applied to each of the 2 blocks is generated using a random sequence different from those used in the POD mode training. The adiabatic BCs are applied to all surfaces except for the bottom of the chip where a constant heat transfer coefficient is implemented with an ambient of 45 o C. Thermal simulation is also performed via FEniCS-FEM with identical settings including heat sources and BCs to validate the accuracy of the two-block POD model. In this demonstration, a same number of POD modes is used in both blocks.</p><p>The LS error estimated from the equation below for the twoblock POD model is a function of the number of POD modes.</p><p>where the index i denotes the time step (snapshot), and &#119879; ! (&#119903; &#8401;) and &#119890; ! (&#119903; &#8401;) are the temperature solution from FEniCS-FEM and the temperature difference between FEniCS-FEM and the POD model, respectively. For the two-block POD model, the DG method with an adjustable penalty number &#119873; 0 <ref type="bibr">[28]</ref>, <ref type="bibr">[29]</ref> is used to enforce the thermal continuity across the interface between Core 1 and L2 Cache. The effect of penalty number on the LS error is shown in Fig. <ref type="figure">4</ref> for the two-block POD model. It is observed that, when using 2 or more modes in the POD model, the LS error reaches a minimum value with N&#181; near 7. The LS error vs. the number of modes with N&#181; = 7 is thus plotted in Fig. The POD simulation demonstrated above reveals a 4-order reduction in the numerical DoF, compared to FEniCS-FEM, which results in a significant saving in computing time. The POD simulation includes solving the ODE in ( <ref type="formula">6</ref>) and the post processing calculation using (1) to recover the temperature solution. The computational time of thermal simulation for the selected two blocks using FEniCS-FEM and the two-block POD model is shown in Table <ref type="table">&#8545;</ref>, where Post1 and Post2 denote the post-processing calculations of temperature in the entire domain and device layer, respectively. As shown in Table <ref type="table">&#8545;</ref>, thermal simulation based on the two-block POD model with 3 modes is 1959 times faster than FEniCS-FEM. Practically, only the temperature in the device layer is required, which would offer a speedup of 3918 times, compared to FEniCS-FEM.</p><p>Based on the results presented in Fig. <ref type="figure">4</ref>, the optimal penalty number is N&#181; = 7, and thus the detailed comparison of the dynamic thermal distributions obtained from FEniCS-FEM and the two-block POD model is given below with N&#181; = 7. As expected according to the eigenvalue spectrum shown in Fig. <ref type="figure">3</ref> and the LS error in Fig. <ref type="figure">5</ref>, the temperature evolution predicted by the two-block POD model with just 3 POD modes is in very good agreement with that obtained from FEniCS-FEM over the entire simulation time. The temperature evolution in time at the center of L2 Cache is given in Fig. <ref type="figure">6</ref>. Similarly, the dynamic temperature at the center of Core 1 is illustrated in Fig. <ref type="figure">7</ref>, where the temperature solutions obtained from the two-block POD model with 3 or 5 POD modes and FEniCS-FEM almost overlap each other.</p><p>TABLE II. CONSUMPTIONAL TIME OF THERMAL SIMULATION FOR THE MULTI-BLOCK POD AND FENICS-FEM METHODS.   The temperature distribution at t = 6.6 ms from L2 Cache to Core 1 along the centers of these 2 blocks across the interface is illustrated in Fig. <ref type="figure">8</ref>. The temperature profile provided by the two-block POD model with 3 POD modes agrees very well with the temperature profile from FEniCS-FEM, which is consistent with the information indicated by the eigenvalue spectrum in Fig. <ref type="figure">3</ref> and the LS error in Fig. <ref type="figure">5</ref>. Compared with the FEniCS-FEM results in the centers of L2 Cache and Core 1, approximately 2.6% and 1.0% (or 0.09 &#176;C and 0.14 &#176;C) differences, respectively, are achieved when using the two-block POD model with 3 POD modes.  IV. CONCLUSION A multi-block thermal simulation methodology enabled by the data-driven POD model and domain decomposition has been investigated. The approach has been applied to develop a twoblock POD model for thermal simulation of 2 selected blocks, including Core 1 and L2 Cache, from a quad-core chip, AMD ATHLON II X4 610e. This study has shown that an appropriate penalty number N&#956; is needed in the 2-block POD model to minimize the interface discontinuity for an optimal prediction of the dynamic thermal distribution in the 2-block domain. It is found that an LS error below 3% can be achieved for the 2-block POD model with 3 modes in each block if 4 &#8804; &#119873; 0 &#8804; 10. This results in a reduction in the numerical DoF by nearly 4 orders of magnitude and leads to nearly 2000 times or 4000 times of speedup for a thermal prediction of the entire 2-block domain or the device layer, respectively, compared to FEniCS-FEM. This work initiates the development of a multi-block POD thermal simulation methodology at the chip level for an entire CPU or GPU. Such a multi-block concept offers a very efficient approach to generation of the POD modes and calculations the POD model parameters to construct a POD model for very large chips like multicore CPUs or many-core GPUs, especially when a higher resolution is needed. All CPUs and GPUs are designed and constructed based on building blocks, such as cores, caches, I/O units, memory modules, etc., in the selected AMD ATHLON processor. When developing a multi-block POD model, a useful practice would be using these standard building blocks to partition the entire processor into a multi-block domain. For a semiconductor chip consisting of a large number of POD blocks, parallel computing can also be applied to further improve the POD simulation efficiency.</p></div><note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_0"><p>978-1-6654-8503-6/22/$31.00 &#169;2022 IEEE</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_1"><p>21st IEEE ITHERM Conference</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_2"><p>&amp; = 6 &#120593; ! (&#119903; &#8401;) &#8226; &#119875; + (&#119903; &#8401;, &#119905;)&#119889;&#937; &amp; -6 &#120593; ! (&#119903; &#8401;)(-&#119896;&#8711;T &#8226; &#119899; 8&#8401;)&#119889;&#119878;, ,(5)where &#119896; is thermal conductivity, &#120588; is the density, &#119862; is the specific heat, &#119875; + (&#119903; &#8401;, &#119905;) is the power density, &#119878; is the boundary</p></note>
			<note xmlns="http://www.tei-c.org/ns/1.0" place="foot" xml:id="foot_3"><p>&amp;, &#119892; !,. = 6 &#119896;&#8711;&#120593; 8&#8401; ! &#8711;&#120593; 8&#8401; . &#119889;&#937;.&amp; (7)Once &#119886; . is determined from (6), the temperature solution can be evaluated from (1).</p></note>
		</body>
		</text>
</TEI>
