Exploiting Power Budgeting in Thermal-Aware Dynamic Placement for Reconfigurable Systems

Similar documents
Online Work Maximization under a Peak Temperature Constraint

Technical Report GIT-CERCS. Thermal Field Management for Many-core Processors

Thermal Scheduling SImulator for Chip Multiprocessors

EEC 216 Lecture #3: Power Estimation, Interconnect, & Architecture. Rajeevan Amirtharajah University of California, Davis

Reducing Delay Uncertainty in Deeply Scaled Integrated Circuits Using Interdependent Timing Constraints

Temperature-aware Task Partitioning for Real-Time Scheduling in Embedded Systems

Reliability-aware Thermal Management for Hard Real-time Applications on Multi-core Processors

Test Generation for Designs with Multiple Clocks

Lecture 2: Metrics to Evaluate Systems

Design Methodology and Tools for NEC Electronics Structured ASIC ISSP

HIGH-PERFORMANCE circuits consume a considerable

System-Level Mitigation of WID Leakage Power Variability Using Body-Bias Islands

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Energy-Optimal Dynamic Thermal Management for Green Computing

Energy-Efficient Real-Time Task Scheduling in Multiprocessor DVS Systems

TempoMP: Integrated Prediction and Management of Temperature in Heterogeneous MPSoCs

Throughput of Multi-core Processors Under Thermal Constraints

Spatio-Temporal Thermal-Aware Scheduling for Homogeneous High-Performance Computing Datacenters

Architecture-level Thermal Behavioral Models For Quad-Core Microprocessors

Leakage Minimization Using Self Sensing and Thermal Management

Word-length Optimization and Error Analysis of a Multivariate Gaussian Random Number Generator

Optimal Multi-Processor SoC Thermal Simulation via Adaptive Differential Equation Solvers

Continuous heat flow analysis. Time-variant heat sources. Embedded Systems Laboratory (ESL) Institute of EE, Faculty of Engineering

System Level Leakage Reduction Considering the Interdependence of Temperature and Leakage

Optimizing Loop Operation and Dataflow in FPGA Acceleration of Deep Convolutional Neural Networks

Optimal Multi-Processor SoC Thermal Simulation via Adaptive Differential Equation Solvers

Generalized Network Flow Techniques for Dynamic Voltage Scaling in Hard Real-Time Systems

Through Silicon Via-Based Grid for Thermal Control in 3D Chips

Optimal Voltage Allocation Techniques for Dynamically Variable Voltage Processors

Power-Aware Scheduling of Conditional Task Graphs in Real-Time Multiprocessor Systems

The Fast Optimal Voltage Partitioning Algorithm For Peak Power Density Minimization

Clock Skew Scheduling in the Presence of Heavily Gated Clock Networks

On Optimal Physical Synthesis of Sleep Transistors

Intel Stratix 10 Thermal Modeling and Management

Dynamic Power Management under Uncertain Information. University of Southern California Los Angeles CA

FastSpot: Host-Compiled Thermal Estimation for Early Design Space Exploration

Stochastic Dynamic Thermal Management: A Markovian Decision-based Approach. Hwisung Jung, Massoud Pedram

Performance, Power & Energy. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

Thermal-Induced Leakage Power Optimization by Redundant Resource Allocation

Methodology to Achieve Higher Tolerance to Delay Variations in Synchronous Circuits

VLSI Design Verification and Test Simulation CMPE 646. Specification. Design(netlist) True-value Simulator

Variations-Aware Low-Power Design with Voltage Scaling

Sequential Equivalence Checking without State Space Traversal

EE241 - Spring 2001 Advanced Digital Integrated Circuits

Determining Appropriate Precisions for Signals in Fixed-Point IIR Filters

JETC: Joint Energy Thermal and Cooling Management for Memory and CPU Subsystems in Servers

Toward More Accurate Scaling Estimates of CMOS Circuits from 180 nm to 22 nm

ECE321 Electronics I

Temperature-Aware Analysis and Scheduling

L16: Power Dissipation in Digital Systems. L16: Spring 2007 Introductory Digital Systems Laboratory

Micro-architecture Pipelining Optimization with Throughput- Aware Floorplanning

VLSI Signal Processing

Temperature-Aware Floorplanning of Microarchitecture Blocks with IPC-Power Dependence Modeling and Transient Analysis

VECTOR REPETITION AND MODIFICATION FOR PEAK POWER REDUCTION IN VLSI TESTING

On Two Class-Constrained Versions of the Multiple Knapsack Problem

Multivariate Gaussian Random Number Generator Targeting Specific Resource Utilization in an FPGA

Featured Articles Advanced Research into AI Ising Computer

Temperature Aware Floorplanning

ESE 570: Digital Integrated Circuits and VLSI Fundamentals

Buffered Clock Tree Sizing for Skew Minimization under Power and Thermal Budgets

Spiral 2 7. Capacitance, Delay and Sizing. Mark Redekopp

Analog computation derives its name from the fact that many physical systems are analogous to

Reducing power in using different technologies using FSM architecture

Blind Identification of Power Sources in Processors

Early-stage Power Grid Analysis for Uncertain Working Modes

Simultaneous Shield Insertion and Net Ordering for Capacitive and Inductive Coupling Minimization

Physical Design of Digital Integrated Circuits (EN0291 S40) Sherief Reda Division of Engineering, Brown University Fall 2006

Performance Metrics & Architectural Adaptivity. ELEC8106/ELEC6102 Spring 2010 Hayden Kwok-Hay So

! Charge Leakage/Charge Sharing. " Domino Logic Design Considerations. ! Logic Comparisons. ! Memory. " Classification. " ROM Memories.

Throughput Maximization for Intel Desktop Platform under the Maximum Temperature Constraint

Performance, Power & Energy

A Novel Software Solution for Localized Thermal Problems

DC-DC Converter-Aware Power Management for Battery-Operated Embedded Systems

Lecture 15: Scaling & Economics

ECE 407 Computer Aided Design for Electronic Systems. Simulation. Instructor: Maria K. Michael. Overview

A Simulation Methodology for Reliability Analysis in Multi-Core SoCs

New Exploration Frameworks for Temperature-Aware Design of MPSoCs. Prof. David Atienza

Revenue Maximization in a Cloud Federation

2. Accelerated Computations

EE241 - Spring 2000 Advanced Digital Integrated Circuits. Announcements

Fault Collapsing in Digital Circuits Using Fast Fault Dominance and Equivalence Analysis with SSBDDs

Administrative Stuff

Mitigating Semiconductor Hotspots

CMOS Ising Computer to Help Optimize Social Infrastructure Systems

On Co-Optimization Of Constrained Satisfiability Problems For Hardware Software Applications

DKDT: A Performance Aware Dual Dielectric Assignment for Tunneling Current Reduction

Session 8C-5: Inductive Issues in Power Grids and Packages. Controlling Inductive Cross-talk and Power in Off-chip Buses using CODECs

Temperature Aware Energy-Reliability Trade-offs for Mapping of Throughput-Constrained Applications on Multimedia MPSoCs

Vectorized 128-bit Input FP16/FP32/ FP64 Floating-Point Multiplier

CSE241 VLSI Digital Circuits Winter Lecture 07: Timing II

Constrained Clock Shifting for Field Programmable Gate Arrays

Digital Integrated Circuits A Design Perspective. Semiconductor. Memories. Memories

Thermal Measurements & Characterizations of Real. Processors. Honors Thesis Submitted by. Shiqing, Poh. In partial fulfillment of the

ESE 570: Digital Integrated Circuits and VLSI Fundamentals

INF2270 Spring Philipp Häfliger. Lecture 8: Superscalar CPUs, Course Summary/Repetition (1/2)

A Novel LUT Using Quaternary Logic

Thermal System Identification (TSI): A Methodology for Post-silicon Characterization and Prediction of the Transient Thermal Field in Multicore Chips

! Memory. " RAM Memory. ! Cell size accounts for most of memory array size. ! 6T SRAM Cell. " Used in most commercial chips

SP-CNN: A Scalable and Programmable CNN-based Accelerator. Dilan Manatunga Dr. Hyesoon Kim Dr. Saibal Mukhopadhyay

Reduced-Area Constant-Coefficient and Multiple-Constant Multipliers for Xilinx FPGAs with 6-Input LUTs

Transcription:

Exploiting Power Budgeting in Thermal-Aware Dynamic Placement for Reconfigurable Systems Shahin Golshan 1, Eli Bozorgzadeh 1, Benamin C Schafer 2, Kazutoshi Wakabayashi 2, Houman Homayoun 1 and Alex Veidenbaum 1 1 Center of Embedded Computer Systems University of California, Irvine CA, 92697, USA {hhomayou, ssgolsha, eli, alexv}@icsuciedu 2 NEC Corp, Central Research Laboratories 1753 Shimonumabe, Kanagawa, 211-8666 Japan {schaderb, wakaba}@blpneccom Abstract In this paper, a novel thermal-aware dynamic placement planner for reconfigurable systems is presented, which targets transient temperature reduction Rather than solving time-consuming differential equations to obtain the hotspots, we propose a fast and accurate heuristic model based on power budgeting to plan the dynamic placements of the design statically, while considering the boundary conditions Based on our heuristic model, we have developed a fast optimization technique to plan the dynamic placements at design time Our results indicate that our technique is two orders of magnitude faster while the quality of the placements generated in terms of temperature and interconnection overhead is the same, if not better, compared to the thermal-aware placement techniques which perform thermal simulations inside the search engine Categories and Subect Descriptors: J6 [Computer Aided Design (CAD)] General Terms: Algorithms, Design, Reliability Keywords: Reconfigurable Systems, Temperature, Dynamic Reconfiguration, Placement, Computer Aided Design 1 Introduction Reconfigurable System-on-Chip (RSoC) platforms are widely used in many application domains as they offer high performance, high flexibility in design and fast reconfiguration time The everincreasing demand for more computations in shorter times as well as technology scaling in these systems will increase the power density and consequently increase the operating temperature High temperature heavily impacts the reliability of the system Increases in leakage power caused by temperature rises can lead to thermal run-away The cost of cooling solutions in chip packaging increases drastically with power density increases [1] and therefore such solutions might be prohibitively expensive in practice In order to cope with the increasing demand for higher computation capacities in reconfigurable systems, several dynamic thermal management (DTM) schemes have been proposed [2,3,4] In general, DTM techniques try to buy in more power saving by The work in this paper is partially supported by the NSF under the award CNS- CAREER #0846129 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee ISLPED 10, August 18 20, 2010, Austin, Texas, USA Copyright 2010 ACM 978-1-4503-0146-6/10/08 $1000 compromising the computation speed Dynamic voltage/frequency scaling (DVFS) is a common example of DTM approaches which exploits the timing slacks in favor of power saving Figure 1: (a) Sample RSoC (b) Single tile of the DRP Reconfiguration capabilities in RSoC provide a new and orthogonal dimension to all previous DTM approaches Rather than throttling the execution speed to lower the power dissipation in RSoC, reconfiguration allows transporting the computation from hotspots to cooler locations on the chip In this paper, we propose to statically plan and place multiple versions of the design and reconfigure each placement on the dynamic reconfigurable system An interesting feature of coarsegrained reconfigurable systems is the low-overhead of dynamic reconfiguration as opposed to fine-grained reconfigurable systems (eg FPGA) Coarse-grained reconfigurable systems can be abstracted as arrays of processing elements (PE) A prominent example of such reconfigurable systems is Dynamically Reconfigurable Processor (DRP) [5,6] The overhead of reconfiguration of the DRP is a single clock cycle We exploit this feature to reduce the peak transient temperature through reconfiguration At design time, new placements for the same design are planned at certain checkpoints to be executed sequentially on the dynamic reconfigurable system Since we are aiming at peak transient temperature reduction rather than steadystate temperature minimization, instantaneous changes in the power densities of the applications can be captured and handled through dynamic reconfiguration Given the layout of the processing elements on a reconfigurable system, temperature simulation tools can solve the differential equation of Fourier s heat flow model to obtain the temperature value based on the power consumptions over a period of time [7] However, solving the complex finite difference equations is very time-consuming Here we motivate an alternative methodology which infers the hotspots of a placement by solely taking into account the power consumptions of the PEs rather than going through time-consuming differential equation solvers As a result, the optimization tools would no longer need to perform timeconsuming temperature simulations but rather to apply an efficient and yet sufficiently accurate (and thermally safe) technique to 49

distribute the power densities to PEs In order to achieve this, we propose thermal-aware power budgeting 1 Based on the thermal model, we introduce the notion of critical power for each PE, which indicates the maximum safe power density permissible for each PE so that the temperature does not exceed a certain threshold As long as the power densities of each operation on the PEs do not exceed this budget, the temperature does not exceed the threshold In a similar fashion, we also introduce the concept of minimal safe temperature, which can be viewed as the minimal upper bound of the temperatures of all the elements in the reconfigurable system Our proposed solution not only considers the power densities of the PEs in the DRP architecture but also captures the varying power densities of surrounding hard IPs as shown in Figure 1 Hence, reconfiguration in DRP can be invoked when inflexible surrounding hard IPs are contributing to hotspots in DRP as well The main contributions of this paper are: 1 Development of a fast and accurate heuristic model based on power budgeting to plan dynamic placements of the design statically, considering the boundary conditions 2 Designing a fast placement technique based on the heuristic to plan the dynamic placements in design time Our experimental results indicate that our proposed technique outperforms other techniques in terms of the execution time and also quality of solution 2 Related Work In general, temperature-aware techniques can be categorized into design time (static) or dynamic optimizations Design time techniques are applied at high level [8,9,10] or at physical level [11,12] These works aim at improving the average power densities or leverage on lateral heat conduction to improve steady-state temperature Due to their passive nature, these techniques cannot adapt to changes in the operating environment and therefore they do not perform well for transient temperature reduction Online DTM techniques apply several mechanisms (voltage/frequency scaling [2] or task migration [3,4]) to reduce temperature Reactive systems in general require the use of thermal sensors and actuators to detect a thermal emergency and activate the appropriate thermal management scheme, which can in turn increase the complexity of the design Task migration has been widely used as an efficient DTM scheme [13,14] The work in [14] provides a simplistic thermal model in which the layout of the active cores are not taken into account In [13], an elaborate model is used to replicate the hot modules in the design and then alternate the execution of the tasks evenly between the replicas Since the power densities of the tasks might change during the execution of the tasks, evenly alternating the execution between the replicas is not always beneficial The proposed approach is only limited to relocation of hotspot modules and duplications of such modules increases the area overhead In this paper, we develop a new placement technique that utilizes all the processing elements on the reconfigurable system and re-maps the tasks to the processing elements according to their corresponding allocated power budgets 1 Throughout this paper, the terms maximum power budgeting and critical power are used interchangeably 3 RSoC Architecture An RSoC is a collection of a coarse-grained reconfigurable processers (ie DRP [6,5]), and several other IP cores (eg DSP) on a single chip (Figure 1) In our example, we assume that the cores and the DRP communicate using a shared memory through dedicated bus communications DRP, as a coarse-grained reconfigurable processor, consists of several Tiles DRP architectures are scalable in general and can have arbitrary number of tiles In DRP-4 prototype, each tile is an 8x8 processing element (PE) grid with dedicated registers and a state transition controller (STC) Each tile contains a repository of configurations or contexts that are used by the STC to reconfigure the interconnections of the processing element to other elements While the current prototype can store up to 64 contexts in each PE, additional contexts can be loaded on-demand from DRAM It is basically controlled by a uc (ARM/MIPS) on-chip The configuration memory has two ports, one for executing and the other for loading Therefore users do not need to stop the execution while loading the context with the DRP-4 4 Thermal-Aware Dynamic Placement for Reconfigurable System on Chip The flow of our thermal-aware dynamic placement based on power budgeting is depicted in Figure 2 We first divide the whole power trace into several equally-distanced check points At the beginning of every check point, we decide the optimal placement to be applied on the reconfigurable system for the time interval bounded by the next check point Note that the final temperatures calculated at one check point will be the initial temperatures of the next check point The process continues until all the check points in the power trace are covered Our flow includes three maor steps: 1 Calculation of the maximum power budget (P crit (i)) permissible for the time interval following the check point (i), given the layout of the reconfigurable system, the initial temperatures at the check point (T init (i)), and the maximum permissible temperatures (T crit (i)) for the processing elements 2 Optimization of the placement to be reconfigured for the time interval based on P crit (i) and the average powers P ave (i) of the processing elements 3 Calculation of the temperatures at the end of the interval (T f (i)) based on the optimal placement (PL(i)), the initial temperatures (T init (i)) and the average powers over the interval P ave (i) Figure 2: Flow of our thermal-aware dynamic placement based on power budgeting In this section we first elaborate on the thermal model used in our dynamic placement technique for RSoC platforms Based on the thermal model, we formally introduce and formulate the concepts 50

of critical power and minimal safe temperature for each processing element in DRP (Step 1) Then we provide a simulated annealing algorithm to solve the problem of thermal-aware dynamic placement for DRP based on power budgeting (Step 2) In order to calculate the transient temperature for the optimal placement, we adopt the well-known thermal simulator, HotSpot [7] (Step 3) 41 Critical Power and its Application on Thermal-Aware Dynamic Placement for RSoC It is well known that there is duality between electrical and thermal circuits We have adopted a compact thermal model similar to [7], which models discrete heat flow using thermal resistance-capacitance (RC) network In this model, for the die, spreader and sink layers, several nodes are assumed, where the neighboring nodes share lateral resistances to model heat diffusion within the layer The vertical thermal resistance captures the heat across the different layers Given the thermal RC model of the chip, the temperature-power relationship for a short interval can be captured as stated in Equation (1) C f C in in P T T G T (1) t t Where P, T, G, C are the power vector, the temperature vector, the thermal conductivity matrix and the thermal capacitance matrix respectively The critical power is formally defined as: The maximum average power consumption permissible for the processing elements over the interval [t in, t f ], where t in and t f are the start point and end point of the interval, provided that the initial temperatures of the processing elements are given (T in ), such that the temperatures at end of the period, T f are lesser than the critical temperature T crit Conversely, we define the minimal safe temperature (T S ) as: the minimal upper bound of the final temperatures of the processing elements reached at time t f, provided that the initial temperatures of the processing elements (T in ) and the average power consumptions P over the interval [t in, t f ] are given We can think of the minimal safe temperature as the minimum critical temperature that can be realized for given power and initial temperature vectors We can simplify Equation (1) by merging the contribution of the initial temperature in Equation (1) to the left side: T f 1 G P (2) For a small interval, the capacitance matrix can be modeled as a conductivity matrix (G =C/t) Since G represent the conductivity matrix of the thermal network, it belongs to the class of positive-definite matrices, which implies that the matrix is invertible [15] There is a challenge standing in the way of deriving the critical power for the processing elements in the RSoC The number of thermal RC-network nodes is generally more than the number of processing elements in the RSoC due to chip packaging Since we are only restricting the temperatures of the processing elements to be lesser than the critical temperature, the maximum temperature constraints for the internal nodes are unknown We have developed a smart way to derive the critical power vector P by rephrasing the linear equation stated above into two separate equations, the first of which calculates the final temperatures for all the nodes in the thermal network (T f ) and then the second calculates the critical powers based on T f In our formulation, we assume that the number of nodes representing processing elements and the total number of nodes in the RCnetwork are N and M respectively The nodes related to the boundary cores are also included in M The dimensions of the vectors and matrices in the equations are of M the information (power and temperature) pertaining to the nodes representing the processing elements is positioned in the first N elements: P0 T0 (3) AN N CN ( M N ) P N 1 TN 1 B( M N ) N D( M N ) ( M N ) P M 1 TM 1 The matrices A, B, C and D are the sub blocks of G -1 Since G -1 is positive definite, Sylvester s criterion holds [16], which states that the upper left sub blocks have positive determinants, hence it implies that the sub block A is nonsingular Since the internal nodes are passive nodes, they do not dissipate any power As the final temperatures of the processing elements should be equal to T crit, we rewrite the equation above as: P0 Tcrit PN 1 1 A A C (4) P N 1 Tcrit PM 1 The elements in the second term in Equation (4) are the average powers of the internal nodes and of the nodes pertaining to the boundary cores The second term can be fused into the left side as: crit P 0 Tcrit 1 (5) A crit P N 1 Tcrit The following Theorem expresses the significance of critical power computation for temperature optimization in RSoC placement Theorem I For short intervals (the condition under which Equation (2) holds), any average power consumption vector below the critical power vector will yield temperatures below the critical temperature Proof For the sake of simplicity, we only consider the case where there is only one layer in the chip packaging In this case, for each node in the thermal network, there is a vertical resistance connecting the node to the node with constant temperature (ambient temperature) Figure 3 depicts a sample thermal network Since the critical power of each node is calculated for the case that all the nodes reach the critical temperature, we can state the critical power in terms of the critical temperature as (T amb is the ambient temperature): Tcrit Tamb P crit i ( ) (6) Ri Now we prove that for any power vector P less than P crit, if there exists a node with temperature higher than T crit, then we reach a contradiction Without loss of generality, we assume that the temperature at node 0 has exceeded the critical temperature Applying Kirchhoff s current law we reach: amb T0 T T0 T crit Pi P (7) i R0, Ri Note that we have summed up the heat diffusions of the nodes which share a lateral resistance If T 0 becomes greater than T crit then there must be at least one other node, name it node 1 (The indices used here follow the order the high temperature nodes are visited), whose temperature T 1 is greater than T 0 or otherwise Equation (7) cannot be satisfied Now we can repeat the process for node 1 It turns out that there must be another node, like node 2 with temperature higher than node 1 as so on Eventually the last node is visited and since all the previous nodes have lower 51

temperatures than the last node (T crit < T 0 < < T N-1 ), therefore, in the thermal network, given that the power consumptions of the nodes are less than corresponding critical powers, the temperatures of all the nodes remain below the critical temperature Figure 3: Thermal network of single layer (Chip Die) When the temperature reaches the steady-state, the transient term of the right side of Equation (1) is zero (ie P = GT) Hence, the steady-state relationship between the critical power and the critical temperature is analogous to the formulation provided in Equation (2) Hence, the same steps can be taken to derive the critical power for each processing element in the steady-state The critical power for the steady-state is the maximum permissible average power of each block over the entire execution time of interest Corollary Given the average power and the initial temperature vectors, if the critical temperatures T crit is realized, then so is any critical temperature T crit > T crit The outcome of the aforementioned theorem can be viewed in the opposite direction to give a sense on how to calculate the minimal safe temperature: Theorem II Given the average power and the initial temperature vectors, the minimal safe temperature can be computed as: T max ( P / A i, ) (8) S 0 1 in i Proof Based on Theorem I, any average power below the critical power will yield temperatures below the critical temperature If we set the critical temperature to be equal to the minimal safe temperature, the corresponding critical powers can be used as constraints to figure out whether the processing elements reach the critical temperature or not In order to obtain the minimal safe temperature, we need to obtain the minimum critical temperature that can be realized by the given power vector In order to derive the minimal safe temperature, we need to find the minimal temperature that satisfies Equation (5) Rephrasing Equation (8) yields Equation (9), which implies the condition satisfying Equation (5) 0 i N 1 : P T A 1 i, (9) i S For the special cases where no lateral capacitance between neighboring blocks is assumed in the thermal model [7] (C is a diagonal matrix), it can be shown that the matrices A and A -1 are also diagonal Hence, Equation (8) can be simplified as: 1 TS max 0iN ( Pi / A i, i ) (10) The significance of the equations above relies on the fact the computation of the equations takes O(n) time, where n is the number of processing elements in DRP All the matrices can be computed ahead of time and therefore only simple operations (divisions) expressed in Equation (8) are required Such feature makes the heuristic formula desirable for optimization engines 42 Thermal-Aware Dynamic Placement Planner (TADPP) for RSoC The problem of thermal-aware dynamic placement for RSoC can be studied in two similar categories While one tries to guarantee that the final temperatures of the processing elements are below the critical temperature, the other one tries to minimize the final temperatures Since both problems are similar in nature, in this section we only state the problem of thermal-aware dynamic placement for RSoC for temperature minimization: Given the average power vector and the initial temperature vector for the system at time t 0, map the data path resources onto the processing elements of the DRP such that the final temperature vector at time t f of the system and the interconnection complexity between the processing elements is minimal We have developed a simulated annealing search engine which simultaneously optimizes both the wire length and the final temperatures of the placement for each checkpoint In general, simulated annealing based approaches tend to minimize a cost function during random moves of blocks Therefore, the run-time of the algorithm is significantly affected by the time complexity of the cost function as it needs to be executed in every move Traditionally, simulated annealing based thermal-aware placers have used temperature simulations directly to calculate the maximum temperature (eg [7]) Such techniques lead to a slow algorithm because of the need to solve finite difference equations governing the temperature-power relationship upon every move in the placement (iteration) In contrast to the traditional techniques, our power budgeting based heuristic efficiently captures temperature inside the simulated annealing engine, therefore drastically reduces the run time of the annealing engine We have adopted the adaptive annealing schedule of VPR, the state-of-art FPGA placement tool Features of VPR adaptive annealing schedule are explained in details in [17] We have used the following cost function inside our simulated annealing engine: TS WL cost ( 1 ) (11) PREV PREV T WL S Where α is the coefficient denoting the tradeoff between wire length and temperature Half perimeter metric is used to estimate the total wire length of the placement The cost function in Equation (11) uses the minimal safe temperature expressed in Equation (8) rather than the exact maximum temperature obtained by direct finite difference equation solving 5 Experimental Results In this section we first describe our experimental flow and then we will present our results 51 Experimental Setup Figure 4 depicts our experimental flow The benchmarks used in our experiments are DSP and multimedia applications, widely used in high-level synthesis community [18] The operations used in the data flow graphs are adders, multipliers and logic operations 8-bit ALUs and 8-bit multipliers are used in the data path In our experiments we have assumed that only a single tile with 64 processing elements (ALUs) and 8 multipliers is available for our thermal-aware dynamic placement planner We have performed list scheduling under resource constraints and then Left-Edge- Algorithm (LEA) to generate the RTL description of the data path for each benchmark We have applied 10000 randomly generated input vectors to each data path and recorded the activities using ModelSim We assume that the same pattern will be repeated in the power traces throughout our experiments The switching activities are then passed to PowerTheater in order to obtain the dynamic power traces of the data path components PowerTheater simulations were carried out 52

for 130nm technology and 500MHz of operation We have applied the same technique as reported in [9,19] to modify the power densities of the component as technology size is scaled down from 130 nm to 45 nm temperature reduction when minimal safe temperature heuristic is applied in dynamic placement In all our experiments, we have used power traces of ten million cycles and the initial temperatures of the processing elements are assumed to 85C For the first set of experiments we have performed dynamic placement planner with 100 random moves performed per each check point We have used 1000 checkpoints to recalculate the placement of the design The optimization obective in this experiment is to minimize the wire length of the design for all the placements while the maximum temperature of the processing elements does not exceed 100C We perform temperature simulations for the entire power trace based on the dynamic placements to get the transient temperatures over the execution of the design The results in Figure 5 are averaged over the benchmarks Figure 4: Experimental Flow In order to model the interdependence of temperature and leakage power, we first logic synthesized each data path component using Design Compiler For each standard gate we performed leakage calculations for 3 different temperatures 25 C, 85 C and 110 C using HSPICE for 45 nm The sum of the leakage powers of the individual gates is considered as the leakage of the component A second degree polynomial is used to approximate the behavior of temperature-dependant leakage power We have assumed the same chip packaging configuration as modeled by HotSpot [7] Ambient temperature is set to be 45C Different HotSpot parameters are shown in Table 1 Table 1 Thermal packaging parameters C Convection = 1404 J/K R Convection = 01 K/W Heat Sink Side = 60mm Heat Sink Thickness = 69mm Spreader Side = 30mm Spreader Thickness = 1mm Chip Thickness = 015mm Sampling Interval = 20 µs The power traces, chip packaging information, DRP layout and the number of checkpoints are fed to TADPP tool to obtain the dynamic placements for the check points Finally, HotSpot-5 simulator is used to obtain the transient temperature of the DRP over time The experiments are carried out on a 299 GHz Pentium IV machine with 1 GB of RAM running Microsoft Windows XP 52 Experimental Results In order to evaluate the significance of our developed formulation for thermal-aware dynamic placement, we have compared our heuristic formulation (Equation (4)) with two other formulations that can be used to constrain the peak temperature during dynamic placement planning The first formulation is to use the finite difference equation solving used in temperature simulators (eg HotSpot), which gives the most accurate temperature measurement The second formulation is adopted from [11], which uses temperature-weighted-distance (TWD) as a tradeoff between execution time and final temperature values: T i T TempDist (12) d i i, In this scheme the temperature update was modified to occur once out of every n annealing moves During the period between temperature updates the previously simulated temperature obtained by HotSpot is used for calculating the TWD We have performed two sets of experiments The first set of experiments examines the performance of critical power consideration in dynamic placement and the second set explores Figure 5: Comparison of placements obtained for 1000 checkpoints using different temperature metrics As a reference case, we have performed steady-state temperature plus wire length optimization to obtain the fixed placement with the minimum wire length In the other cases, we have used the metrics explained earlier to find the placement with the minimal wire length for each checkpoint subect to the critical temperature of 100C TWD metric was updated every 20 moves As shown in Figure 5, on average, the maximum temperatures acquired for TWD and fixed placement exceed 100C HotSpot and our heuristic metrics maintained the maximum temperatures below 100C While TWD reached 21% increase in the wire length on average, the overhead for our heuristic and HotSpot were 15% and 11% respectively The performance of HotSpot formulation and our heuristic are close in terms of temperature and wire length However, our technique outperforms HotSpot temperature calculation in execution time The execution times for our heuristic, TWD, HotSpot and the fixed placement are 81, 529, 8827 and 62 seconds The runtime of HotSpot is prohibitively high, which is the main reason to adopt critical power metric inside the search engine to acquire the optimal placement Figure 6: Max temperature, comparing our heuristic to TWD [11] for different no of checkpoints In the second set of experiments, we have compared the performance of our thermal-aware dynamic placement planner, 53

when the cost function in Equation (11) is used to minimize the temperature and total wire length, with TWD We have implemented the adaptive annealing schedule of VPR for both techniques We have gathered the average results for three cases: 10, 100 and 1000 checkpoints in the same execution traces The maximum temperatures in the execution traces are presented in Figure 6 In Figure 7, the wire lengths are normalized based on the case where steady-state temperature and wire length minimization are used as the cost function to optimize The notation TWD_100(1000) represents the number of moves performed before the accurate temperatures are obtained through HotSpot We have set the number of moves between updates to be 100(1000) Figure 7: Normalized wire length, comparing our heuristic to TWD [11] for different no of checkpoints As a general trend, the maximum temperature reached reduces as the number of checkpoints goes higher (Figure 6) Also, for TWD technique, as the frequency of temperature updates goes higher, the average wire length and the maximum temperature improves This is due to the fact that the accuracy of temperature calculations in TWD depends on the frequency of invocation of HotSpot Such inaccuracy impacts both wire length and temperature Our cost function used in the simulated annealing engine on the other hand combines both parameters The results suggest that our heuristic outperforms TWD technique in both temperature and wire length Figure 8: Average execution times, comparing our heuristic to TWD [11] for different number of checkpoints In Figure 8 we report the average execution times for the different techniques and different number of checkpoints As shown in the Figure, the execution time of TWD technique deteriorates as the number of check points goes higher The same trend is seen as the number of moves between temperature updates are reduced (TWD_100) Compared to both versions of TWD, our technique performs dynamic placement planning very fast The main reason is that our technique does not need to call timeconsuming differential equation solving inside the simulated annealing engine In fact, the formula provided in the previous section requires simple operations that can be performed very fast 6 Conclusion In this paper, a novel thermal-aware dynamic placement planner for reconfigurable systems for transient temperature reduction is presented We have introduced and modeled the concepts of critical power and minimal safe temperature which directly relates the power densities of the elements in the reconfigurable system to hotspots of the chip We have designed a fast placement technique to be reconfigured dynamically during design time Our results indicate that our technique performs significantly faster (two orders of magnitude) while the quality of the placements generated in terms of temperature and interconnection overhead is the same, if not better, compared to the thermal-aware placement techniques which perform thermal simulations inside the search engine References [1] S H Gunther, et al, "Managing the impact of increasing microprocessor power consumption," Intel Technology ournal, vol 5, pp 1-9, Feb 2001 [2] R Rao, et al, "Efficient online computation of core speeds to maximize the throughput of thermally constrained multi-core processors," in ICCAD, 2008 [3] F Mulas, et al, "Thermal Balancing Policy for Streaming Computing on multiprocessor architectures," in DATE, 2008 [4] A K Coskum, et al, "Temperature management in multiprocessor SoCs using online learning," in DAC, 2008 [5] NEC electronics, "Dynamically Reconfigurable Processor (DRP) What is DRP?," 2004 [6] N Suzuki, et al, "Implementing and Evaluating Stream Applications on DRP," in FCCM'04 [7] K Skadron, et al, "Temperature aware microarchitecture," in ISCA, 2003 [8] P Lim, et al, "Thermal-aware high-level synthesis based on network flow method," in CODES+ISSS, 2006 [9] T Chantem, et al, "Temperature-Aware Scheduling and Assignment for Hard Real-Time Applications on MPSoCs," in DATE, 2008 [10] Z gu, et al, "TAPHS: Thermal-Aware Unified Physical-Level and High-Level Synthesis," in ASPDAC, 2006 [11] M Healy, et al, "Thermal Optimization in Multi-Granularity Multi-Core Floorplanning," in ASPDAC, 2009 [12] J Jaffari, et al, "Thermal-Aware Placement for FPGAs using Electrostatic Charge Model," in ISQED, 2007 [13] H D Mogal, et al, "Thermal-Aware Floorplanning for Task Migration Enabled Active," in ICCAD, 2008 [14] S Heo, et al, "Reducing power density through activity migration," in ISLPED, 2003 [15] S C Chapra, Numerical Methods for Engineers, 4th Ed McGraw Hill, 2002 [16] C D Meyer, Matrix Analysis and Applied Linear Algebra Society for Industrial and Applied Mathematics, 2000 [17] Vaughn Betz, et al, Architecture and CAD for Deep- Submicron FPGAs KLUWER Academic Publishers, 1999 [18] Express group at UCSB:, "http://expresseceucsbedu/benchmark/" [19] G Link, et al, "Thermal trends in emerging technologies," in ISQED, 2006 54