Blind Identification of Thermal Models and Power Sources from Thermal Measurements

Size: px

Start display at page:

Download "Blind Identification of Thermal Models and Power Sources from Thermal Measurements"

Arleen Hancock
5 years ago
Views:

1 in IEEE Sensors Journal 1 Blind Identification of Thermal Models and Power Sources from Thermal Measurements Sherief Reda, Senior Member, IEEE, Kapil Dev Member, IEEE, and Adel Belouchrani Senior Member, IEEE Abstract The ability to sense the temperatures and power consumption of various key components of a chip is central to the operation of modern integrated circuits such as processors. While modern chips often include a number of embedded thermal sensors, they lack the ability to sense power at fine granularity. This paper proposes a new direction to simultaneously identify the thermal models and the fine-grain power consumption of a chip from just the measurements of the thermal sensors and the total power consumption. Our identification technique is blind as it does not require design knowledge of the thermal-power model to identify the power sources. We investigate the main challenges in blind identification, which are the permutation and scaling ambiguities, and propose novel techniques to resolve these ambiguities. We implement our technique and apply it in three contexts. First, we implement it within a controlled simulation environment, which enables us to verify its accuracy and analyze its sensitivity to relevant issues, such as measurement noise and number of available training samples. Second, we apply it on a real multi-core CPU+GPU processor-based system, where we show the ability to identify the runtime power consumption of the individual cores using just the total power measurement and the measurements of the embedded thermal sensors under different workloads. Third, we apply it for non-invasive power sensing of chips by inverting the temperatures measured using an external infrared imaging camera. We show that our technique consistently improves the modeling and sensing accuracy of integrated circuits. I. INTRODUCTION To enable correct thermal and power management, it is necessary to sense key physical metrics such as the power consumption and temperatures of various components that makeup a chip. For modern chips, one can obtain a few thermal measurements using embedded thermal sensors inside the chip, or get fine-grain thermal maps using external thermal imaging systems that capture infrared emissions. Direct power sensing is more restricted as its granularity is limited by the number of power domains inside the chip. For instance, modern processors use the running average power limit (RAPL) interface to enable applications to measure the power consumption [9]; however, these measurements are typically coarse-grain, S. Reda and K. Dev are with the School of Engineering, Brown University, 184 Hope St, Providence, RI 2912, USA. sherief reda@brown.edu. A. Belouchrani is with the Electrical Engineering Department/LDCCP, Ecole Nationale Polytechnique, Algiers, ALGERIA. adel.belouchrani@enp.edu.dz. An earlier version of this paper appeared at DATE 217 [16]. This submission contains numerous novel materials, including (1) an entire new application and results for the proposed method for infrared imaging; (2) improvements to the original blind identification algorithm using a new initialization method that leads to much better results; (3) characterization of noise in modern processors using infrared imaging and analysis on the impact of noise on power estimation; and (4) analysis of the impact of training data size on model accuracy. only giving the power consumption of all cores, uncore units and total package power. Further, one can invert the thermal measurements, either from internal thermal sensors or from an external infrared camera, to identify the power consumption of the various components that make-up a chip [7]. The goal of this paper is to blindly estimate both the thermal models of the chip and the power consumption of individual chip components from the total power consumption of the chip and the temperature measurements obtained through either internal thermal sensors or an external infrared camera. In contrast to previous work, our blind identification method (1) makes no assumption or need for any prior design-based models for power or temperature; (2) does not require any special conditions or calibrations during runtime; and (3) does not need additional measurements like performance counters [1], [13], [22]. Our methodology only uses thermal sensor measurements and total power consumption under regular operating conditions to simultaneously identify the thermal model and a fine-grain map of power consumption. The contributions of this paper are as follows. We formulate the blind power identification (BPI) problem to estimate the power consumption of individual units in a chip together with the chip s thermal model using only the total power measurements and the thermal measurements through either internal means (e.g., thermal sensors) or external means (e.g., infrared camera). Existing general blind identification methods suffer from their inability to determine or pin-point the exact location of the power sources, and their power estimates can be off by a constant factor [2]. To eliminate these ambiguities, a novel methodology for BPI is devised that exploits the physical characteristics of thermal transfer to provide a unique thermal model that is consistent with the measurements. Our method handles steady-state and transient operation seamlessly, enabling users to track power consumption during runtime. We verify the accuracy of our method within a controlled simulation environment, and show that the proposed BPI method is able to resolve the power consumption of various multi-core processor configurations accurately. Further, we characterize the noise in thermal sensor measurements, and use the simulation environment to analyze the impact of thermal sensor noise and number of training samples on the accuracy of our proposed method. We implement our methodology on a real quad-core CPU+GPU processor and apply it to estimate the power consumption of its cores under various standard bench-

2 2 marks from the measurements of the internal thermal sensors and the RAPL interface. We also apply our method to obtain fine-grain power maps of a test chip from detailed thermal measurements, which are obtained using an infrared imaging camera that captures the chip s thermal radiation. The organization of this paper is as follows. In Section II, we provide background on thermal-power physical model, and review the related work. The problem formulation for blind estimation of thermal models and power sources is provided in Section III. Further, the proposed methodology for blind estimation is described in Section IV. The experimental results are provided in Section V. Lastly, the main conclusions of this work are summarized in Section VI. II. BACKGROUND AND RELATED WORK A. Thermal-Power Physical Model The physical relationship between power and temperature of a chip is governed by the heat diffusion equation [21]: t(x, y, z, τ) ρ(x, y, z)c p (x, y, z) = τ.[κ(x, y, z) t(x, y, z, τ)] + p(x, y, z, τ), (1) where, (x, y, z) denotes the location (e.g., the location of a core) and τ denotes the time instance; ρ(x, y, z) and c p (x, y, z) are the density and specific heat of the material, respectively; κ(x, y, z) is the thermal conductivity of the chip at specific location; t(x, y, z, τ) and p(x, y, z, τ) are the temperature and power dissipation at specific location of the chip and time, respectively. In a practical implementation, the heat diffusion equation has to be discretized in both space and time domains. This discretization comes from the finite memory size of any computer. Moreover, the spatial resolution of the thermal imaging equipment also sets a limit on the discretization granularity in space domain. Similarly, the sampling rates of the thermal camera or internal thermal sensors decide the discretization granularity in time domain. The discretized model is commonly termed as lumped model. Further, using the duality between thermal and electric models, we can express the lumped model as a resistance-capacitance (RC) network, where thermal resistance, capacitance, temperature, and power values in a thermal model are analogous to electrical resistance, capacitance, voltage, and current, respectively. Spatial Discretization: In order to discretize the heat diffusion equation in space domain, the chip and the cooling system is assumed to be made of smaller blocks. For example, it could be assumed that the chip is made up of different building blocks, such as cores, caches, etc. Similarly, the entire geometry could be composed of a heat sink in a forced-air ambient environment, heat spreader, bulk silicon, active layer, and packaging material, or any other geometry and combination of materials. For numerical thermal analysis in spatial domain, a seven-point finite difference discretization method can be (i-1,j,l) (i,j,l+1) (i,j-1,l) (i,j,l-1) (i+1,j,l) (i,j+1,l) Fig. 1. Illustration of discretized model in the spatial domain. The center point/block in the grid is located at (i, j, l) location, where i, j, and l offsets are used to represent discrete blocks along the x, y, and z axes, respectively. applied [23]; in this regard, the seven points/blocks are shown in Figure 1. Each block is assumed to be an independent entity so that the entire block represents single power source, and it has uniform temperature over the block. The discretized heat diffusion equation at an interior point (i, j, l) of the discretized grid could be written as: ρc p V dt i,j,l(τ) = G x [t i 1,j,l (τ) 2t i,j,l (τ) + t i+1,j,l (τ)]+ dτ G y [t i,j 1,l (τ) 2t i,j,l (τ) + t i,j+1,l (τ)]+ G z [t i,j,l 1 (τ) 2t i,j,l (τ) + t i,j,l+1 (τ)] + V p i,j,l (τ), (2) where, i, j, and l are used to represent discrete blocks along the x, y, and z axes, respectively; V denotes the block volume, with V = x y z, if x, y, and z are the discretization steps along the x, y, and z axes; t i,j,l (τ) denotes the temperature at (i, j, l) location at time τ; G x, G y, and G z are the thermal conductances between adjacent blocks; they are defined as: G x = κ y z/ x, G y = κ x z/ y, and G z = κ x y/ z. If we divide the entire chip package into N discrete elements, Equation 2 could be translated in to the following thermal modeling equation: dt(τ) C c = A c t(τ) + p(τ), dτ (3) where, the thermal capacitance matrix C c R N N is a diagonal matrix; A c R N N is the thermal conductance matrix; t(τ) R N 1 and p(τ) R N 1 are thermal and power vectors, respectively. Temporal Discretization: Now, we will discretize the equation 3 in time domain using (backward) Euler s method. The following discrete-time state space could be obtained: t(k) = t(k 1) + τc 1 c [A c t(k) + p(k)], (4) where, τ is the sampling interval, k is the sampling instance (i.e., τ = k τ). Rearranging the above equation, we could obtain: t(k) = At(k 1) + Bp(k), (5) where, A = (I τc 1 c A c ) 1 and B = τac 1 c with I being the identity matrix. z y x

3 3 Typically, all practical measurements have noise in them. To account for the measurement noise, we add an additive noise term ɛ(k) to the above equation. Hence, the final thermal and power model of many-core processor chip in the discretized state-space form is given by [1], [6], [17], [18], [22]: t(k) = At(k 1) + Bp(k) + ɛ(k), (6) where, t(k) and p(k) are vectors that denote the temperature and power consumption measurements of the cores at time k, respectively; A and B are the two modeling matrices that capture the physical relationship between power and thermal; and ɛ(k) is a vector that represents the noise in the measurement process at time k. Note that if one knows the matrices A and B, then we can recover the individual powers of the cores over time (i.e., p(k)) relatively easily by just using the measurements of the thermal sensors and applying inversion techniques [7]. B. Model Identification To identify the state-space model that links temperatures and power, there are two general approaches: a design-time approach and a runtime approach. The design-time approach requires extensive information of the layout of the chip and its package characteristics [12]. For instance, the user would require knowledge of the layout of the chip, its materials, and its heat sink configuration to generate the appropriate entries in the model matrices. Thus, the design-time approach requires the transfer of proprietary information, i.e., the state-space models, which is processor specific, to the users to be deployed during runtime. This approach could be also prone to errors due to variabilities arising from manufacturing or ambient conditions. The runtime approach identifies the state-space models from physical measurements during runtime. The processor is treated as a gray or black box and machine learning or system identification techniques are used to identify the statespace models from the thermal sensor measurements [1], [8], [15], [18], [22]. A key assumption in all previous runtime modeling approaches is that there are sensors for the power sources [1], [15], [18], [22]. However, modern processors lack fine-grain power sensors. For instance, the RAPL interface provides the total power consumption for individual domains (e.g., all cores and uncore units), but it does not provide power measurements for the individual cores [9]. Beneventi et al. developed a regression-based model to estimate the power consumption of individual cores assuming that when a core is active, it is fully busy running at the maximum instructions per cycle [3]. This method does not work well in practice because (1) workloads have a large impact on power consumption, (2) modern processors automatically adjust the voltage and frequency depending on the number of active cores, and (3) per-core leakage power increases when more cores are activated because of thermal coupling. To summarize, we can say that in previous work, researchers assumed either (1) the availability of p(k) and sought to identify A and B [6], [17], [22], or (2) the availability of A and B and sought to identify p(k) [7]. In contrast to previous work, we seek blind identification of A, B and p(k), with no assumption or need for any prior design-based models for any of them. That is, our methodology only uses runtime measurements (i.e., the total power and thermal sensors measurements) to simultaneously identify A, B and p(k). We make no assumptions about the availability of power sensors or prior power models, and our method works seamlessly under various frequency and voltage settings. Our technique enables designers to simplify the number of sensors by eliminating the need for physical power sensors, and to instead use measurements of the thermal sensors and total power to derive per-core power consumption. Modern processors rely on internal micro-controllers to collect the measurements of the sensors and to orchestrate thermal and power management decisions [19], [2]. Our technique can be implemented to run on internal micro-controllers or as a software thread on the main processor. III. PROBLEM FORMULATION The blind estimation problem seeks to identify the matrices A and B, together with the power profiles p(k). It is well known that the steady-state thermal model can be derived from Equation 6. If a stable set of power sources, denoted by the vector p s, are applied, then after the transient response, the steady-state temperatures, denoted by the vector t s, will be measured; i.e., t s = t(k ). By re-arranging Equation 6, one gets t s At s + Bp s, (7) (I A)t s Bp s, t s (I A) 1 Bp s, t s Rp s, (8) where, R = (I A) 1 B is the steady state thermal transfer matrix and I the identity matrix. If one obtains m thermal steady-state measurements, [t s1 t s2... t sm ], from different experiments using m different sets of power signals, [p s1 p s2... p sm ], then we can summarize the results using [t s1 t s2... t sm ] = R[p s1 p s2... p sm ] (9) T = RP. (1) If R and A are identified, matrix B can be calculated by B = (I A)R. (11) Note that the model of Equation (8) is similar to the one commonly used in array signal processing, particularly in blind source separation [2]. The latter consists of blindly identifying the matrix R, i.e. by resorting only to the information carried by the measured signals. Before proceeding, it is important to specify the notion of blind identification. Challenges in Blind Identification. In the blind context, a full identification of the matrix R from model (8) is impossible because the exchange of a fixed scalar factor between a given source signal (a power source) and the corresponding

For example, the temperature vectors in the numerical example shown below remain identical if we divide the third column of R by.2 and multiply the third row of P by.2. 13.56 11.79.74.41.39.4 1 5 13.

4 4 column of R does not affect the observations (i.e., thermal measurements). That is, if we divide column i of R by an arbitrary factor α i and multiple the i th row of P by α i, then T does not change. For example, the temperature vectors in the numerical example shown below remain identical if we divide the third column of R by.2 and multiply the third row of P by = = (12) Note also that the labeling in Equation 9 is arbitrary. For example, the temperatures vectors in the next numerical example remain identical if we permute R by exchanging the first and third columns, and correspondingly permute P by exchanging the first and third rows = = (13) Hence the blind identification of R can be performed up to permutation and scaling factor of its columns using blind source separation algorithms, but this blind identification is not sufficient for our needs since we would like to resolve the powers of the individual units of the chip (e.g cores) and map them to the exact units. In the sequel, we propose to (1) take advantage of the particular physical characteristics of thermal transfer to solve the permutation ambiguity, and to (2) solve the scaling ambiguity by using the total power measurements. IV. PROPOSED METHODOLOGY In this section, we describe the proposed methodology to accurately estimate thermal modeling matrices and power sources of processors. A. Estimating Natural Response Matrix First, we start by estimating the natural response matrix A. By forcing p(k) = in Equation (6), we get t(k) = At(k 1) + ɛ(k). (14) Thus, an estimate of A is obtained by the least square minimization. If we collect K consecutive transient thermal traces, then we can construct two matrices T 1 = [t(1) t(k 1)] and T 2 = [t(2) t(k)], and solve the following quadratic programming formulation: min T 2 AT under the constraint A (15) where A denotes that all entries of A are non-negative. Fig. 2. Thermal map in Kelvin for a 3 3 unit chip, where the bottom left unit is activated. A 3 3 chip leads to a 9 9 R matrix. In this case, the impact of an activated 1 W unit in the lower-left corner (i.e., block 7) on the temperatures of all units is given by the 7th column in R, where the element on the diagonal has the highest value. B. Estimating the Steady-State and Forced Response Matrices Second, we describe the process of estimating the steadystate thermal transfer matrix (R) and the forced response matrix (B) given a matrix T of measured state-state temperatures. The first step is to apply NMF (Non Negative Matrix Factorization [14]) algorithm on T. The NMF algorithm is considered because it inherently copes with the positivity constraint of the Matrix R and the power profiles P. However, the solution provided by the NMF algorithm is not unique due to the aforementioned ambiguities, as highlighted in section III earlier. Let us define R and P to be the estimates up to permutation and scaling of R and P matrices, respectively. We introduce techniques to resolve these ambiguities in this section. Resolving Permutation Ambiguity: To solve this ambiguity, we resort to the physical characteristics of the thermal transfer matrix. The latter has the characteristics that the highest thermal impact of a power source is at the source location, and smaller thermal impact at the neighboring locations. This physical phenomenon is illustrated in Figure 2, which gives the thermal map of a chip with 3 3 units, such that the power source at the bottom-left unit is activated. A 3 3 chip leads to a 9 9 R matrix. If we activate 1 W of power at the unit in the lower-left corner (i.e., block 7), then the thermal impact on all units is given by the 7th column in R, where the element on the diagonal has the highest value and other elements have lower values because of heat diffusion properties. Thus, the largest values of the thermal-transfer matrix should be at the diagonal. Hence, the correct position for each column in R and row in P is recovered by identifying the position of the maximum value of each column in R and moving the column to the corresponding column position in the matrix. For example, if T = R P = (16)

5 5 then we reorganize the columns of R and the rows of P correspondingly so that maximum element of each column R line up on the diagonal. That is, we swap column 1 with 4 and column 2 with 3 in R, and correspondingly swap row 1 with 4 and row 2 with 3 in P, we get T = (17) where, permutation ambiguities in R and P are resolved. Resolving Scaling Ambiguity: Let α 1,..., α N be the correct scaling factors, then we notice that multiplying the N columns of R and dividing the N rows of P by these factors is equivalent to 1/α 1... α /α 2... T = R α P... 1/α N... α N (18) We observe that the scaling factors can be resolved by measuring the total power, which is an easy measurement. If [c 1 c 2... c m ] denote the total power measurements corresponding to the different sets of power sources P = [p s1 p s2... p sm ], then we get α 1... [ ] α P = [ ] c 1 c 2... c m,... α N (19) which can be simplified to [ α1 α 2... α N ] P = [ c1 c 2... c m ]. (2) The solution to Equation 2, and hence to the scaling ambiguity problem is then given by: [ α1 α 2... α N ] = [ c1 c 2... c m ] P, (21) where denotes the pseudo-inverse operator. The thermal transfer matrix R is finally obtained by sorting and re-scaling the columns of R using the scaling factor results from Equation 21. Getting Forced Response Matrix: Once matrix R and A, from sub-section IV-A), are identified, the forced response matrix B is estimated through equation (11): B = (I A)R. Initialization: One important consideration point during the blind identification process is the initialization of the NMF algorithm. We found that the quality of solution is particularly sensitive to the initialization of the algorithm. In our earlier work [16], we initialized the NMF algorithm by the fast ICA algorithm [11]. In this paper, we argue for a better initialization algorithm that respects the self-coupling of the R matrix. Thus, we initialize the NMF estimation of the R with the identity matrix I, which also has the property that the largest Procedure: Blind Identification of Power Profiles Input: K Transient thermal traces t(k), steady-state thermal measurements T and corresponding total power measurements [c 1... c m ] Output: Natural Response Matrix A, Thermal Transfer Matrix R, Forced Response Matrix B. 1) Let T 1 = [t(1) t(k 1)] & T 2 = [t(2) t(k)]. 2) Find the Natural Response Matrix A using least square minimization: min T 2 AT under the constraint A 3) Find R the thermal transfer matrix and P the power profiles up to permutation and scaling using the NMF algorithm [14] with the proposed initialization method. 4) Solve the Permutation Ambiguity: Identify the correct locations for each column in R by finding the positions of the maximum value of each column in R. Use the identified positions to sort the columns of R and rows of P to obtain: R and P 5) Solve the Scaling Ambiguity: Find the scaling factors: α = [α 1... α N ] = [c 1... c m ]P 6) Find the thermal transfer matrix R: R = Diag[α] 1 R 7) Find the forced response matrix B: B = (I A)R Fig. 3. Offline Power Identification Algorithm. element in every column is at the diagonal line. We initialize the NMF estimation of the P matrix, such that each block in the circuit has the same power and sum of each column adds up to the total measured power. That is, initial R = I (22) c 1 /N c 2 /N... c m /N c 1 /N c 2 /N... c m /N initial P = (23) c 1 /N c 2 /N... c m /N Thus, our initial conditions for both R and P ensure that these two matrices are physically in the right form. The offline analysis procedure is summarized in Figure 3. C. Estimating Runtime Power Consumption During online tracking, the core power profiles at every instance of time are obtained by solving the following quadratic programming periodically: such that min Bp(k) (t(k) At(k 1)) 2 2, p(k) and p(k) 1 = total measured power

6 6 Procedure: Blind Identification of Power Profiles Input: Temperatures t(k), total power c(k), matrices A and B from offline BPI analysis Output: Power Profiles p(k) 1) Solve quadratic programming: Bp(k) (t(k) At(k 1)) 2, such that p(k) and p(k) 1 = total measured power 2) Return solution of the quadratic programming p(k) Fig. 4. Online Blind Power Identification Algorithm. The online power estimation method is summarized in Figure 4. V. EXPERIMENTAL RESULTS To evaluate the proposed BPI algorithm, we consider three sets of experiments. In Subsection V-A we first verify the accuracy of the proposed BPI algorithm by using it to derive the percore estimates of a variety of multi-core processor configurations that are simulated in HotSpot, and compare the estimates against the known per-core power consumptions. We consider both steady-case and transient power estimation. The sample rate here is 1 ms. In Subsection V-B we apply the BPI algorithm to estimate the thermal models and per-core power estimates of a real quad-core CPU+GPU processor (Intel Haswell processor Core i7-479k) using the measurements of internal thermal sensors and the total power consumption as measured by the RAPL interface. We consider both steady-case and transient power estimation. The sampling rate here is 1 second. The online part of our algorithm 4 only takes 4.75 ms in runtime, which is shorter than the sampling rates, and as a result, BPI can track the power estimates at the same granularity as our thermal sensor measurements. In Subsection V-C we apply the BPI algorithm to estimate the power consumption of various blocks on a test chip, where the thermal measurements are obtained using an external SC56 FLIR infrared camera. In this application, we only consider steady-state power estimation, since transient response measurements using infrared imaging require peta-bytes of storage. The source code and sample data from our experiments are available at github.com/scale-lab/bpi. A. Verification and Analysis of BPI Using Simulators BPI Accuracy. The accuracy of the proposed BPI algorithm is analyzed by comparing our per-core power estimates with the actual per-core power consumption. Given that modern processors lack such power sensors, we resort to simulation for verification. We use HotSpot, which is a popular thermal simulator for chip designs [12]. We create multiple multi-core Temperature (C) Power(W) Power(W) Power(W) Power(W) (a) thermal simulation results from HotSpot 4 core 1 core 2 3 core 3 core (b) estimated and actual power consumption of core 1 3 estimated 2 actual (c) estimated and actual power consumption of core 2 3 estimated 2 actual (d) estimated and actual power consumption of core 3 3 estimated 2 actual (e) estimated and actual power consumption of core 4 3 estimated 2 actual Fig. 5. Verification of the proposed BPI algorithm using HotSpot simulator. Subfigure (a) gives the thermal measurements of the four cores using HotSpot, subfigures (b-e) give the per-core power estimates from BPI and the input power of each core to HotSpot. Dashed red lines give input power, while blue lines give estimated power, which tracks the input power quite accurately. layout configurations, with 4-core (2 2), 9-core (3 3) and 16- core (4 4) layouts. All chip-configurations are assumed to be 1 cm 1 cm in dimension with a maximum power budget of 6 W. We use scaled power traces from our Core i7 processor as input to HotSpot; otherwise, we use the default thermal settings in the simulator. The simulated thermal traces from HotSpot, together with the total power consumption are given as inputs to our BPI algorithm. The proposed BPI algorithm is used to identify the state-space matrix models and the percore power estimates. The power estimates produced from the BPI algorithm are then compared against the actual power consumptions of the individual cores that were used as inputs to HotSpot to verify its estimation accuracy. Figure 5 illustrates the results of our experiment for different thermal traces applied over time. In particular, Figure 5.a gives the thermal simulation output from HotSpot for the four cores, while Figures 5.b-e give the actual per-core power traces given as inputs to HotSpot (dashed red line) and the per-core power estimates computed from BPI (solid blue lines) for the four cores. The results in Figure 5 demonstrate that BPI tracks the power accurately. Further, the high accuracy of estimated power from the simulated temperature traces demonstrates that the proposed BPI algorithm is able to estimate the thermal to power model for the given chip quite accurately.

7 num of cores 4 9 16 Our earlier work [16] avg. abs. avg. abs. error (W) error (%) 1.3 6.53%.84 7.19%.81 9.22% This work avg. abs. avg. abs. error (W) error (%).9 4.5%.34 2.89%.12 2.

7 7 num of cores Our earlier work [16] avg. abs. avg. abs. error (W) error (%) % % % This work avg. abs. avg. abs. error (W) error (%).9 4.5% % % TABLE I S UMMARY OF BPI ACCURACY. T HE AVERAGE ABSOLUTE ERROR IN PER - CORE POWER ESTIMATES ARE REPORTED AS A FUNCTION OF THE NUMBER OF CORES. H OT S POT IS USED FOR VALIDATION OF PER - CORE ESTIMATES AGAINST THE ACTUAL POWER CONSUMPTION OF EACH CORE. average absolute power estimation error per core (%) 5 Fig. 7. An example IR thermal image of the Core i7 processor die. Such IR thermal images are used to obtain detailed and accurate temperature maps of the processor core for characterizing the internal thermal sensor noise of the processor cores 9 cores 16 cores number of training samples (thermal simulations) Fig. 6. Average absolute error in per-core power estimates as a function of the number of thermal traces which are used to train the BPI algorithm. HotSpot tool was used to generate the thermal traces for different power traces. About 5 thermal traces are enough to obtain reasonably good power estimates. To understand the scalability of our algorithm as a function of the number of power sources, i.e., the number of cores, we repeat our verification experiment for multi-core processors with 9-core and 16-core configurations. We define the average absolute error in Watt and percentage as av. abs. error (W) = N 1 X estimated power actual power, N n=1 (24) and N 1 X estimated power actual power N n=1 actual power (25) respectively. The average absolute errors are summarized in Table I. We also provide the results from using our earlier method [16], where we used the ICA algorithm to initialize the NMF, and contrast it to the new initialization method proposed in this paper. The results show that the proposed initialization method delivers consistently significantly better results. In particular, we observe that the new initialization method provides up to 7.2% better accuracy for the studied chip configurations. av. abs. error (%) = Impact of Number of Samples. To understand the sensitivity of BPI results as a function of the number of training samples, we re-evaluate the accuracy of BPI as a function of the number of thermal measurements. For N sources, there are N 2 unknown values in the R matrix, which is a lower bound on the number of training samples. Each steady-state thermal trace gives N thermal measurements, one per core. Figure 6 gives the average per-core absolute error in power estimation (%) as a function of the number of traces for the 4-core, 9-core and 16-core cases. The results show that increasing the number of training traces generally improves the accuracy of the algorithm; however, even with 2-5 thermal traces, one can get reasonably good accuracy. One also observes that for the same number of samples, increasing the number of cores improves the accuracy, which is an important consideration given that future many-core processors will incorporate tens of cores. Noise characterization and Impact. Another impact on performance of BPI comes from the noise in measurements, which are typical for internal thermal sensors in processors. There are mainly two types of noise: (1) noise arising from discretization, since internal measurement sensors are provided as discrete integer values, and (2) noise arising from inherent sensor noise, where the same temperature can lead to different internal measurements. To understand the magnitude of noise in sensor measurements, we characterize the temperatures of our Core i7 processor using an external infrared camera that has a noise figure of 15 mk. Figure 7 shows an example thermal map of the i7 processor. We then compare the temperatures of the core temperatures as reported from the internal thermal sensors against the temperatures measured from the far more accurate camera. The differences between the two temperatures for different cores are plotted as histograms in Figure 8. In this figure, the x-axis denotes the error in thermal sensor measurements, while y-axis denotes the probability of sensor error across different experiments. The plot shows that the internal sensor noise falls within a [ 2, 2] C window, and that is Gaussian in nature as verified by the KolmogorovSmirnov normality test. To analyze the impact of noise on the accuracy of BPI

8 8 p(error) p(error) Core Error in temperature (C) Core Error in temperature (C) p(error) p(error) Core Error in temperature (C) Core Error in temperature (C) Fig. 8. Histograms of thermal sensors measurement errors for different cores of the Core i7 processor. X-axes of histograms denote error in thermal sensor measurements, while Y-axes denote the error probabilities. The overlaid blue plots are the fitted Gaussian distributions for each core. average absolute per-core power es1ma1on error 7.% 6.% 5.% 4.% 3.% 2.% 1.%.% floa/ng point integer integer with noise sensor output format Fig. 9. Impact of internal thermal sensor measurement accuracy on estimated power accuracy from the proposed BPI method. Floating point refers to the estimation accuracy when the simulation results of HotSpot, which is given in floating point, are used. In Integer, we discretized the simulation results by rounding them to the nearest integer, which emulate the measurements from the internal thermal sensors in real processors. In Integer with noise, we round the measurements after introducing an amount of noise from a Gaussian source of standard deviation of 2/3 C. algorithm, we use HotSpot tool again since it allows us to control the per-core power consumptions easily. The thermal measurements from HotSpot are naturally given in floating point numbers. We quantify the performance of BPI when HotSpot measurements are (1) discretized, and (2) discretized with additional noise from the [ 2, 2] C window. We plot in Figure 9 the average absolute per-core power estimation (%) as a function of the sensor noise mode for case of the 9-core chip. As expected the discretization and thermal sensor noise have a small impact on the accuracy of the algorithm; however, this degradation is graceful and does not lead to large inversion errors during power estimation. B. Blind Modeling and Estimation of a Real Quad-Core Processor Using Internal Thermal Sensor Measurements In the second set of experiments, we apply the BPI algorithm to estimate the thermal models and per-core power estimates of a real quad-core CPU+GPU processor. We use a Linux-based system with an Intel Haswell processor Core i7-479k (Devil Canyon) which features four cores, an integrated GPU, and a L3 cache of 8 MB. The RAPL interface enables us to read the total power consumption of all the cores. Further, the lmsensors module v3.3.4 is used to read the thermal measurements of the four cores. The sampling rate of the RAPL power and thermal sensor measurements is 1 second. The frequencies and voltages of the cores are automatically controlled by Intel s speed driver, where the driver adjusts the frequencies of the cores automatically depending on the load and the available thermal/power envelope to a maximum of 4.2 GHz. Thus, the frequency is variable during our experiments. While we have fixed the fan speed in our experiments, a variable fan speed can be incorporated in our technique by repeating our modeling approach under various fan speed settings, and then looking up the correct model during power tracking depending on the actual fan speed. We execute a good collection of workloads to collect traces that include the internal measurements from the thermal sensors and total power from RAPL. The initial phase of our BPI algorithm is then executed on the data to blindly estimate the state-space model matrices for the processor. During runtime, our light-weight power estimation (algorithm of Figure 4) takes about 4.75 ms per-sample to compute the per-core power estimates. Hence, the proposed BPI algorithm could easily be used to make run-time decisions to control the chip temperature. Controlled Stress MicroBenchmarks. We first demonstrate that our BPI technique produces correct results on the real system. We design a multi-threaded stress generation application that enables us to control the number of threads and the exact cores that are being stressed when the application is executed. For space considerations, we demonstrate three cases out of the possible 16 cases: (a) one core is stressed (core 1), (b) two cores are stressed (cores 2 and 3), and (c) all cores are stressed. The results are given in Figure 1, where we report the total power, the measurements from the thermal sensors, and per-core power estimates using our BPI technique for the three cases. We observe from the plots that our technique is able to break-down the power consumption and map it to the four cores correctly as known from the controlled scheduling. While some of the inactive cores are correctly estimated to consume a small amount of power, this power is mainly attributed to leakage power, since our technique identifies the total power (dynamic and leakage). Furthermore, we can see that the cores do not consume the same exact power when they are all active (case c). This can be attributed to leakage power which depends on the thermal profile and process variability [5]. We also implemented the regression-based approach reported in [3], which only works in steady state and assumes that active cores are fully busy. We found that it can lead

9 9 temp. (C) 2 1 measured total Thermal sensor measurements (C) estimated core 1 Power (W) estimated core 2 Power (W) estimated core 3 Power (W) estimated core 4 Power (W) (a) core 1 stressed (b) cores 2 and 3 stressed (c) cores 1, 2, 3, and 4 stressed temp. (C) 5 measured total Thermal sensor measurements (C) estimated core 1 Power (W) estimated core 2 Power (W) estimated core 3 Power (W) estimated core 4 Power (W) temp. (C) 1 5 measured total Thermal sensor measurements (C) estimated core 1 Power (W) estimated core 2 Power (W) estimated core 3 Power (W) estimated core 4 Power (W) Fig. 1. Illustration of successful operation of BPI in which various cores are stressed and the power dissipation of the cores are blindly identified. In set of subplots in column (a), only one core is stressed (core 1); in (b) two cores are stressed (cores 2 and 3), and in (c) all cores are stressed. temp. (C) 1 5 measured total Thermal sensor measurements (C) estimated core 1 Power (W) estimated core 2 Power (W) estimated core 3 Power (W) estimated core 4 Power (W) Fig. 11. Demonstration of BPI using a mix of SPEC CPU 26 benchmarks. Four benchmarks (hmmer, mcf, milc, and povray) are launched with one benchmark per core. The wait time of about 1 seconds is used between launching two consecutive benchmarks. For example, the spike in power of core 4 corresponds to the launch time of povray benchmark on that core. to up 11.4% deviation in power estimation compared to the actual steady-state total power. This deviation results because the method does not account for the automatic changes in operational voltage-frequency and leakage power when more cores are activated. Multiple SPEC CPU6 Benchmarks. We conduct additional experiments using SPEC CPU 26 and PARSEC benchmarks to demonstrate the ability to track the power consumption of general benchmarks. In the second experiment, we launch four benchmarks of the SPEC CPU 26: hmmer, mcf, milc, and povray on cores 1, 2, 3 and 4, respectively. We wait for about 1 seconds between launching two consecutive benchmarks. Figure 11 gives the per-core estimates from our BPI algorithm. The plot correctly shows that the estimated power of core 4 spikes at about 4 seconds when povray was launched on it. In a similar trend, the power estimates of every core are very low, as expected, right after the completion of the SPEC benchmark running on the core. Furthermore, Core 1 is displaying the highest power consumption among all cores, as it executes hmmer, which is the most CPU intensive application among the selected benchmarks. Multi-threaded PARSEC Benchmarks. In the third experiment, we use bodytrack from the PARSEC multi-threaded benchmarks. We limit it to a maximum of two threads, and use our BPI algorithm to estimate the per-core power estimates. The results are given in Figure 12. Interestingly, the plot shows activation of all cores; however, not all cores appear active simultaneously. This result perfectly matches the bodytrack characteristics, which launches one thread per image to analyze 26 image frames. Since we limited bodytrack to two

estimated core 3 Power (W) 2 1 1 2 3 4 5 6 7 8 3 estimated core 4 Power (W) 2 1 1 2 3 4 5 6 7 8 Fig. 12.

10 1 temp. (C) 4 2 measured total Thermal sensor measurements (C) estimated core 1 Power (W) estimated core 2 Power (W) estimated core 3 Power (W) estimated core 4 Power (W) Fig. 12. Demonstration of BPI using bodytrack from PARSEC benchmark suite configured to run using two threads. As observed from the reconstructed power for core 1-4, at a time no more than two cores are active. However, over time, all four cores are used by the Linux scheduler for balancing load across all cores. threads, the Linux scheduler automatically seeks to balance the launched threads among the cores, and as a result all the cores are used over the course of execution but no more than two cores are active at a time. Our BPI algorithm correctly tracks this behavior during runtime, where the power spikes correspond to the launching and termination of threads on the various cores. C. Blind Modeling and Estimation Using Infrared Imaging In the third set of experiments we demonstrate the applicability of our method to the thermal model and power modeling from infrared imaging techniques. To this end, we design a test chip that is composed of 1 1=1 micro heaters embedded in a 9nm Altera Stratix II FPGA test chip of 7.2 mm 7.9 mm. Each microheater corresponds to a finegrained block on the chip, and when enabled, it consumes 2 mw power. Figure 13.a shows the silicon die area, also called the test area. To capture the thermal emissions, we use a FLIR SC56 infrared camera, and to measure total power, Fig. 13. (a) Layout of the 7.2mm 7.9mm test chip on an Altera FPGA with 1 (1 1) custom micro-heaters; (b) an example of thermal traces from an high-end infrared (IR) camera. Temperatures are in Celsius above room temperature. we intercept the power supply lines using a shunt resistor and measure the shunt s voltage using an Agilent 3441 multimeter. Figure 13.b gives an example of the captured thermal emissions in steady state. For infrared imaging (IR) based experiments, the resolution of BPI algorithm is dictated by the resolution of IR camera. In our lab, we have an IR camera from FLIR (FLIR SC56). The camera has a spatial resolution of 5µm. However, in practice, the temperature and power profiles are studied at larger granularity, dictated by the size of different functional blocks. In this experiment, we have 1 blocks in 7.2 mm 7.9 mm as illustrated in Figure 13, so our horizontal pitch is.72 mm and the vertical resolution is.79 mm. As such, the spatial resolution of estimated power maps could be the same as the temperature resolution of the IR camera. However, in this work, the die area is divided into finegrained blocks to create microheaters; hence, the resolution of temperature and power maps are decided by the number of microheaters. To record thermal emissions for the purpose of blind identification, we enable various microheaters in various pattern configurations on the chip and record the resultant infrared emissions and the total power consumption. We repeat this procedure 2 times with different power patterns to create enough number of thermal traces that can be used as inputs to our BPI method to identify the steady-state thermal model and the power consumption of the individual blocks. Figure 13.b gives an example of such thermal trace. In Figure 14, we plot the histogram of the number of incorrectly estimated microheaters out of the 1 units in our design for our 2 power patterns. Overall, the average power estimation error for the 2 power patterns is about 11.5%. In other words, on the average 88.5% of the 1 microheaters blocks have their power estimated correctly by the proposed blind identification method. Figure 15 provides two examples of such power maps. VI. CONCLUSIONS We proposed a new technique for blind identification of power consumption of individual cores in multicore processors with no need for any a priori thermal-power models. Our BPI technique simultaneously identifies the state-space thermal

11 Number of power patterns (out of 2) 45 4 35 3 25 2 15 1 5 5 1 15 2 25 Number of incorrectly estimated microheaters (out of 1) Fig. 14.

11 11 Number of power patterns (out of 2) Number of incorrectly estimated microheaters (out of 1) Fig. 14. Histogram showing the number of inaccurately estimated microheaters out of the 1 microheaters in the chip. Overall, 2 different random power patterns were generated using the microheater grid. On average BPI algorithm estimates the power dissipation of the micro-heaters with reasonably high accuracy (88.5%). Fig. 15. Comparison of two blindly estimated power maps against the reference (actual) power maps. model and the power consumption of cores from just the measurements of total power and thermal sensors during runtime. To overcome the challenges in general blind source separation techniques, we proposed methods that exploit the nature of thermal characteristics, and the total power measurements to construct the state-space model correctly with appropriate permutation and scaling factors. We verified the accuracy of our method and demonstrated its resilience to scaling and sensor noise. We have proposed and implemented two applications for our technique: (1) a real multi-core processor, where we used it to track the exact power consumption of its cores under different workloads using the internal thermal sensors, and (2) an application on a test chip, where we use the measurements from an infrared camera as input to our BPI method, which in return provided fine-grain power maps for the chip. Further, in this paper, we evaluated the proposed blind identification method on regular 2D ICs. As a future work, we are planning to extend our technique to 3D ICs with arbitrary locations for embedded thermal sensors and power sources. Acknowledgments: This work is partially supported by NSF grants # , and by an Arab-American Frontiers Fellowship of the U.S. National Academy of Sciences, Engineering and Medicine. REFERENCES [1] A. Bartolini, M. Cacciari, A. Tilli and L. Benini, A distributed and selfcalibrating model-predictive controller for energy and thermal management of high-performance multicores, IEEE DATE, pp. 1-6, 211. [2] A. Belouchrani, K. Abed-Meraim, J.F. Cardoso and E. Moulines, A blind source separation technique using second order statistics, IEEE Trans. on Signal Processing, vol. 45, no. 2, pp , February [3] F. Beneventi, A. Bartolini, and L. Benini, Static thermal model learning for high-performance multicore servers, in ICCCN, pp. 1-6, 211. [4] F. Beneventi, A. Bartolini, A. Tilli, and L. Benini, An Effective Gray- Box Identification Procedure for Multicore Thermal Modeling, in IEEE Transactions on Computers, Vol. 63(5), 214. [5] S. Borkar, T. Karik, S. Narendra, J. Tschanz, A. Keshavarzi and V. De, Parameter variations and impact on circuits and microarchitecture, in ACM/IEEE DAC, pp , 23. [6] R. Cochran and S. Reda, Consistent Runtime Thermal Prediction and Control Through Workload Phase Detection, DAC, pp , 21. [7] R. Cochran, A. N. Nowroz and S. Reda, Post-Silicon Power Characterization Using Thermal Infrared Emissions, ISLPED, pp , 21. [8] A. Coskun, T. Rosing and K. C. Gross, Utilizing Predictors for Efficient Thermal Management in Multiprocessor SoCs, IEEE Transactions on CAD of Integrated Circuits and Systems, Vol 28(1), pp , 29. [9] H. David et al., RAPL: Memory Power Estimation and Capping, in ISLPED, pp , 21. [1] K. Dev, A. N. Nowroz and S. Reda, Power Mapping and Modeling of Multi-core Processors, in IEEE International Symposium on Low-Power Electronics and Design, pp , 213. [11] A. Hyvarinen, Fast and Robust Fixed-Point Algorithms for Independent Component Analysis, IEEE Trans. on Neural Networks 1(3): , [12] W. Huang et al. HotSpot: a compact thermal modeling methodology for early-stage VLSI design, IEEE Transactions on VLSI Systems, vol 14(5), pp , 26. [13] C. Isci, G. Contreras and M. Martonosi, Live, Runtime Phase Monitoring and Prediction on Real Systems with Application to Dynamic Power Management, in ISCA, pp , 26. [14] D. D. Lee and H. S. Seung, Learning the parts of objects by nonnegative matrix factorization, Nature, 41 (1999), pp [15] D. Li, S. X.-D. Tan, E. Pacheco and M. Tirmula, Parameterized Architecture-level Dynamic Thermal Models for Multicore Microprocessors, ACM TODAES., Vol 15(2), pp. 16:1-16:22, 21. [16] S. Reda and A. Belouchrani, Blind Identification of Power Sources in Processors, IEEE/ACM Design, Automation & Test in Europe, 217. [17] S. Sharifi and C.-C. Liu and T. Rosing, Accurate Temperature Estimation for Efficient Thermal Management, ISQED, pp , 28. [18] Y. Wang, K. Ma, and X. Wang, Temperature-Constrained Power Control for Chip Multiprocessors with Online Model Estimation, ISCA, pp. pp , 29. [19] K. Dev and S. Reda, Scheduling Challenges and Opportunities in Integrated CPU+GPU Processors, in ACM/IEEE Symposium on Embedded Systems for Real-time Media, pp , 216. [2] M. Yuffe et al., A Fully Integrated Multi-CPU, Processor Graphics, and Memory Controller 32-nm Processor, IEEE JSSC, Vol 47(1), pp , 212. [21] M. N. Ozisik, Heat Conduction, Natick, MA: Wiley/IEEE, Mar [22] F. Beneventi, A. Bartolini, A. Tilli, and L. Benini, An Effective Gray- Box Identification Procedure for Multicore Thermal Modeling, IEEE Trans. Comp., Vol 63(5), pp , 214. [23] Y. Yang, Z. Gu, C. Zhu, R.P. Dick, and L. Shang, ISAC: Integrated Space-and-Time-Adaptive Chip-Package Thermal Analysis, IEEE Trans. CAD of Integ. Circ. and Sy., Vol 26(1), pp , 27.

Blind Identification of Power Sources in Processors

Blind Identification of Power Sources in Processors Sherief Reda School of Engineering Brown University, Providence, RI 2912 Email: sherief reda@brown.edu Abstract The ability to measure power consumption