System-Level Power, Thermal and Reliability Optimization

Size: px

Start display at page:

Download "System-Level Power, Thermal and Reliability Optimization"

Barrie Atkinson
5 years ago
Views:

1 System-Level Power, Thermal and Reliability Optimization by Changyun Zhu A thesis submitted to the Department of Electrical and Computer Engineering in conformity with the requirements for the degree of Doctor of Philosophy Queen s University Kingston, Ontario, Canada July 2009 Copyright Changyun Zhu, 2009

2 Abstract An integrated circuit can now contain more than one billion transistors. With increasing system integration and technology scaling, power and power-related issues have become the primary challenges of integrated circuit design. In this dissertation, techniques and algorithms, from system-level synthesis to emerging integration and device technologies, are proposed to address the power and power-induced thermal and reliability challenges of modern billion-transistor integrated circuit design. In Chapter 1, the challenges of semiconductor technology scaling are introduced. Chapter 2 reviews the related works. Chapter 3 focuses on the reliability optimization issue during system-level design. A reliable application-specific multiprocessor system-on-chip synthesis system is proposed, called TASR, which exploits redundancy and thermal-aware design planning to produce reliable and compact circuit designs. Chapter 4 introduces three-dimensional (3D) integration, a new integrated circuit fabrication and integration technology. Thermal issue is a primary concern of 3D integration. A 3D integrated circuit heat flow analytical framework is proposed in this chapter. Proactive, continuously-engaged hardware and operating system thermal management techniques are presented and evaluated which optimize system performance than state-of-the-art techniques while honoring the same temperature bound. Chapter 5 presents reconfigurable architecture design using single-electron tunneling i

3 transistor, an ultra-low-power nanometer-scale device. The proposed design has the potential to overcome the power and energy barriers for both high-performance computing and ultra-low-power embedded systems. Conclusions are drawn in Chapter 6. ii

4 Co-Authorship All work regarding Reliable MPSoC Synthesis, 3D CMP Thermal Management and Characterization of SET Transistors in this thesis (i.e., Chapter 3, Chapter 4 and Chapter 5 of the thesis) was done in collaboration with Zhenyu Gu. iii

5 Acknowledgments First, I would like to gratefully thank my supervisor, Professor Li Shang, not only for his supervision of my research work, but also for his patience and help which encouraged me to complete my studies. He has all the traits of an excellent research supervisor. I appreciate the corrections and suggestions offered by my committee members: Professor Robert Knobel, Professor Ahmad Afsahi and Professor Alireza Bakhshai for their valuable comments and feedback. I would also like to thank Professor Naraig Manjikian for his kindly help during my studies at Queen s University. Thanks are also given to Zhenyu Gu, Yonghong Yang, Kun Li, Nicholas Allec, Assem Bsoul, Zyad Mohamed, Professor Robert P. Dick and Professor Qin Lv for their invaluable discussions. Finally, I am grateful to my parents, wife and friends for their support and encouragement over these years. iv

6 Table of Contents Abstract i Co-Authorship iii Acknowledgments iv Table of Contents v List of Symbols viii List of Tables xiii List of Figures xiv Chapter 1: Introduction Technology Scaling and Design Challenges Dissertation Overview Chapter 2: Related works Reliability-aware synthesis v

7 2.2 Three-dimensional integrated circuit Single-electron tunneling transistors Chapter 3: Reliable MPSoC Synthesis Introduction TASR: Temperature-Aware Synthesis of Reliable MPSoCs Experimental Results Conclusions and Future Work Chapter 4: 3D CMP Thermal Management Introduction Contribution Heat Flow in 3D CMPs D CMP Thermal Management Experimental Setup Experimental Results Conclusions Chapter 5: Characterization of SET Transistors Introduction SET Modeling IceFlex: A Fault-Tolerant Hybrid SET/CMOS Reconfigurable Architecture vi

8 5.4 Experimental Results Conclusions Chapter 6: Conclusions and Future Work Thesis Summary Future Work Bibliography vii

9 List of Symbols A Thermal conductance matrix C Capacitance C D Drain capacitance C G Gate capacitance C S Source capacitance C P Island capacitance E aem Activation energy of electromigration E asm Activation energy of stress migration F (t) Cumulative distribution function G Gain I Current J Current density viii

10 K Diagonal matrix containing the thermal conductances of adjacent thermal elements K eff Effective vertical thermal conductivity K layer Thermal conductivity of the region without any vias K via Thermal conductivity of the via material L Laplacian matrix P d Power density R Resistance R D Drain resistance R S Source resistance T Temperature T 0 Metal deposition temperature during fabrication T ambient Ambient temperature T average Chip average temperature V T H Threshold voltage β Alpha power law parameter κ or κ B Boltzmann s constant µ Scale parameter of lognormal distribution ix

11 ρ via Via density σ Shape parameter of lognormal distribution ξ Run-time switching activity multiplied the capacitance of the switched nodes. ζ ij Thermal impact coefficient for core i due to j e Elementary charge f Frequency f(t) Probability density function g Conductance h Planck s constant 3D Three dimensional BIPS Billion instructions per second BJT Bipolar junction transistor CDF Cumulative distribution function CMOS Complementary metal-oxide-semiconductor CMP Chip-Level multiprocessor CR Component redundancy DRAM Dynamic random access memory x

12 DSP Digital signal processing DTM Dynamic thermal management DVFS Dynamic voltage and frequency scaling EEMBC The embedded microprocessor benchmark consortium FPGA Field-programmable gate array IC Integrated circuit IPC Instructions per cycle LUT Lookup table MPSoC Multiprocessor system-on-chip MTTF Mean time to failure MVL Majority voting logic NoC Network on chip OS Operating system PDF Probability density function PE Processing element PRSA Parallel recombinative simulated annealing xi

13 SET Single-electron tunneling transistor SMT Simultaneous multithreading TIP Thermal impact per performance xii

14 List of Tables 3.1 System MTTF Improvement Under Area Bound [132] ThermOS Implementation [134] DVFS and Clock Throttling Comparison [134] Design Parameters for Alpha [134] D Package Setup [134] Benchmark Characteristics [134] Benchmark Suites [134] Island Size Estimation [133] Design Space Characterization [133] Impact of Majority Vote Logic on SELB Fault Probability [133] Characterization of IceFlex Microarchitecture for C Σ = e 2 /(40k B T ) [133] Characterization of IceFlex Interconnect Fabric For C Σ = e 2 /(40k B T ) [133] Latency and Energy Improvement For Exclusive-Or Design [133] IceFlex Performance and Power Consumption at Room Temperature For C Σ = e 2 /(40k B T ) [133] xiii

15 List of Figures 1.1 Intel CPU Transistor Count [2] Microprocessor Power Consumption Temperature Profile for Active Layer and Heatsink [123] Reliable MPSoC Synthesis Example [132] TASR Flow for the Temperature-Aware Synthesis of Reliable MP- SoCs [132] Temperature Impact on MTTF [38] Comparison of MPSoC Area Reliability Tradeoffs [38] Comparison of Different Optimization Heuristics [132] (a) Comparison of Face-to-Face (Left) and Face-to-Back (Right) Configurations for Two Stacked Dies, (b) 3D Three Stacked Die Floorplan Used in This Work, and (c) 3D CMP Chip-package Thermal Modeling [134] Inter-layer and Intra-layer Thermal Heterogeneity and Dominance in 3D CMPs [134] ThermOS: 3D CMP Run-time Thermal Management [134] Comparison of ThermOS and Distributed Approach [28, 134] xiv

16 4.5 Reduction in Temperature Constraint Violations due to Local DVFS and Elimination of Temperature Constraint Violations due to Clock Throttling [134] Temporal Temperature Variation for Eight Processor Cores (P0 P7) Running lv-mipc2 Using Local DVFS w.o. (Top) and w. (Bottom) Clock Throttling [134] Negligible CMP Instruction Throughput Reduction Resulting from Local DVFS and Clock Throttling [134] Impact of Global Guidance Interval [134] Impact of Lookup Table Size [134] Impact of Floorplan Rotation [134] SET Structure and Schematic [133] SET Coulomb Oscillation (C g =3.2 af, C s = C d =1.0 af, and R s = R d =10 MΩ) [133] IceFlex Microarchitecture [133] Multi-gate SET Multiplexer Tree [133] SET Configuration Memory [135] SET Parity Circuit [133] Hybrid SET/CMOS Interface Circuitry [133] Power and Performance of the Multi-gate SET Multiplexer Tree for High Performance, C Σ = e 2 /(40k B T ) [133] Performance and Power Characterization of Exclusive-or Logic for Low Power for C Σ = e 2 /(40k B T ) [133] xv

17 Chapter 1 Introduction 1.1 Technology Scaling and Design Challenges As observed by Gordon E. Moore in 1965, the number of transistors that can be integrated on a chip doubled every 18 to 24 months [78]. During the past four decades, semiconductor technology scaling has provided consistent improvements in circuit performance and integration density. Figure 1.1 shows the technology scaling of Intel microprocessors since With increasing system integration and technology scaling, integrated circuit design becomes increasingly complex. Power and powerinduced design issues, such as chip temperature and circuit reliability, have become the primary concerns of modern integrated circuit design. Power Challenges Although scaling of technology provides higher functional integration, more computing resources, better performance and parallel operation capability, the increased 1

18 CHAPTER 1. INTRODUCTION 2 Intel CPU transistor count Transistor count 1e+10 1e+09 1e+08 1e+07 1e Quad-Core Itanium Dual-Core Itanium 2 Core 2 Quad Core i7 Itanium 2 Core 2 Duo Pentium 4 Atom Pentium III Pentium II Pentium Year Figure 1.1: Intel CPU Transistor Count [2]. operating frequency and transistor density raise the circuit dynamic power consumption. Furthermore, because the subthreshold leakage is an inverse exponential function of a transistor s threshold voltage (V T H ) and V T H is reduced with technology scaling under the constant electric field scaling scenario, the chip leakage power increases exponentially [97]. Figure 1.2 shows the power consumption of microprocessors released during the past twenty years. It indicates the exponential increase in power due to increased voltage, frequency, temperature and decreased threshold voltage. Thermal Challenges As more power is consumed by increasingly denser integrated circuits filled with transistors, more heat is generated and therefore raises chip temperatures which has

19 CHAPTER 1. INTRODUCTION 3 Power(W) Intel 386 Intel 486 Intel pentium Intel pentium2 Intel pentium3 Intel pentium4 Intel itanium Intel i7 Alpha Alpha Alpha Spar c Super Spar C Spar c64 Mips HP PA Power PC AMD K6 AMD K7 AMD x86-64 AMD Athlon64X2 AMD Barcelona Intel Clovetown Sun Niagara Sun Niagara Year Figure 1.2: Microprocessor Power Consumption. a huge impact on IC performance, cooling cost reliability, and power consumption. The latencies of transistors and metal wires increase with increasing chip temperature as do the probabilities of many lifetime reliability faults [53, 102]. For example, electromigration failure rate is an exponential function of temperature. Leakage power consumption is now responsible for a substantial proportion of overall power consumption in commercial designs and increases with temperature [67]. IC chips and packages exhibit significant spatial and temporal variations due to the heterogeneity of thermal conductivity and heat capacity in different materials, as well as the variation of power profiles. This requires accurate chip-package heat flow analysis, which is complex and computing intensive. As illustrated by the example shown in

20 CHAPTER 1. INTRODUCTION 4 Temperature ( C) IC active layer Position (mm) Heatsink/IC interface Figure 1.3: Temperature Profile for Active Layer and Heatsink [123]. Figure 1.3, the steady-state thermal profile of the active layer of the silicon die in conjunction with the top layer of the cooling package is characterized using multigrid thermal solver which has to partition the chip and the cooling package into 131,072 homogeneous thermal elements. Compared to steady-state thermal modeling, characterizing an IC dynamic thermal profile is even more time consuming. IC synthesis requires a large number of optimization steps; thermal modeling can easily become its performance bottleneck [123].

21 CHAPTER 1. INTRODUCTION 5 Reliability Challenges Moreover, aggressive scaling of CMOS process technology poses serious challenges to the lifetime reliability of ICs. Reduction of feature size and increases in power density have resulted in increasing chip temperature and failure rates. Increased system integration using these vulnerable devices and interconnects results in reduced system reliability. The severity of many reliability problems, such as time-dependent dielectric breakdown in MOS transistors and electromigration in interconnects, increases exponentially with temperature. Life time reliability is becoming an important quality metric in high-performance ICs. Optimizing lifetime reliability requires careful planning during IC design and synthesis. At the architectural level, careful assignment of tasks to processing elements (PEs) can balance the thermal profile of the chip, thereby improving system reliability. Synthesis-time architectural planning and careful use of PE-level and component-level (e.g., functional unit) redundancy will permit continued MPSoC operation after the failure of some processors or components, while limiting area overhead. At the physical level, a fast floorplanner is needed to provide physical information for generating the power profile which, in turn, is used to determine the thermal profile. The evaluation and optimization of system reliability and other design metrics, such as area and performance, require a comprehensive and efficient architectural-level and physical-level synthesis infrastructure. In summary, power, thermal and reliability issues have become dominant constraints in modern nanoscale integrated circuit design. For high-performance applications, temperature affects integration density, performance, power consumption and cost. For battery-powered embedded systems, energy consumption directly determines system life time. For any system, reliability strongly depends on the thermal

22 CHAPTER 1. INTRODUCTION 6 profile during operation. 1.2 Dissertation Overview In this dissertation, the issues of power, thermal and reliability challenges will be addressed from the following three aspects: system-level synthesis algorithms, recently proposed circuit integration technology and emerging device technology. First, reliability consideration will be integrated into the system-level synthesis algorithms of IC design flow. Then, the recently proposed integration technology, three-dimensional integrated circuit to overcome the limitations of 2D technology will be discussed. Finally, an emerging device technology, single-electron tunneling transistors, will be evaluated to overcome the coming challenges for CMOS devices. The rest of this dissertation will be organized as follows. First, technology scaling and increasing power densities are increasing the severity of IC lifetime reliability problems. The lifetime reliability problem cannot be well solved at any single level of the design process. Reliability characterization requires chip-package thermal profiles, which in turn requires physical information, including an IC floorplan, power profile, and chip package thermal model. Reliability-aware IC design requires an unified architectural-level and physical-level design flow. Therefore, a system-level synthesis flow which conducts architectural synthesis, floorplanning, on-chip network synthesis, chip-package thermal analysis, and reliability analysis is proposed in Chapter 3. Optimization algorithms within this flow exploit redundancy and temperature-aware design planning to produce reliable, compact IC designs. My major contribution to this chapter is on the MPSoC reliability modeling, temperaturedependent reliability modeling and reliability-aware optimization algorithm design.

23 CHAPTER 1. INTRODUCTION 7 My collaborator, Zhenyu Gu, contributed to the floorplanning and on-chip network synthesis. Two papers have been published on this project [132, 38]. Second, three-dimensional (3D) integration has the potential to improve the communication latency and integration density of IC designs. By stacking multiple device layers connected through inter-die vias, 3D technology significantly reduces on-chip wire length, enables efficient interconnect and logic design, and further boosts logic integration density. However, the stacked high power density layers of 3D chips increase the importance and difficulty of thermal management. Chip power density increases linearly with the number of vertically-stacked active circuit layers. In addition, the bonding layers used in 3D integration have low thermal conductivities, which further exacerbates thermal effects. Chapter 4 identifies and describes the critical concepts required for optimal thermal management and proposes proactive, continuouslyengaged hardware and operating system thermal management technique that achieves better performance than state-of-the-art techniques while honouring the same temperature bound. My major contribution to this chapter is on the characterization of heat flow in 3D CMPs, derivation of optimal workload assignment and power thermal budgeting and thermal management implementation in the Linux kernel. My collaborator, Zhenyu Gu contributed to the design of 3D CMP architecture and technology, framework buildup of the full simulation system, and benchmark suites characteristics and generation. Two papers have been published on this project [131, 134]. Third, devices researchers have seen the coming challenges for CMOS devices and evaluated alternative technologies such as single-electron tunneling transistors (SETs). The International Technology Roadmap for Semiconductors projects that SETs have the potential to achieve the lowest projected energy per switching event of

24 CHAPTER 1. INTRODUCTION 8 any known device. However their use poses unique architectural, circuit design and fabrication challenges. Chapter 5 explores the potential use of SETs in low-power embedded systems, evaluates the benefits and limitations of SETs, and characterizes the impacts of SETs on system design metrics. Based on the evaluation of the architectural and circuit-level features, a fault-tolerant, reconfigurable, hybrid SET/CMOS based architecture is proposed in this chapter. My major contribution of this chapter is on the SET modeling, SET design space characterization and characterization of IceFlex architecture. My collaborator, Zhenyu Gu, contributed to the global/local interconnect design and characterization of embedded applications. Two papers has been published on this project [135, 133]. Finally, a conclusion of this dissertation and the potential future research problems are presented in Chapter 6.

25 Chapter 2 Related works 2.1 Reliability-aware synthesis Our reliable MPSoC synthesis work draws from research in the areas of integrated circuit reliability modeling and optimization [103, 21], system synthesis [30, 42, 120, 64], physical design, and thermal analysis [99, 123]. Coskun et al. [21] and Srinivasan et al. [103] provided architectural reliability models and run-time optimization techniques for MPSoCs and microprocessors, respectively. Eles et al. contrasted optimization algorithms for use in hardware software partitioning [30]. Henkel and Ernst proposed flexible task discretization during hardware software partitioning [42]. Xie et al. proposed a technique to duplicate tasks on idle processors during embedded system synthesis to tolerate transient faults [120]. Lee and Ha proposed an allocation, assignment, and scheduling algorithm for real-time MPSoCs [64]. Ogras et al. proposed a branch-and-bound algorithm for NoC synthesis [81]. Glaß et al. proposed an evolutionary algorithm that binds tasks to resources with the goal of improving mean time to failure (MTTF) [36]. They considered fault processes with exponential 9

26 CHAPTER 2. RELATED WORKS 10 or Weibull distributions; their fault model supports permanent faults. Our system and fault model differs primarily by considering the influence of faults on subsequent fault rates due to the impact of run-time rebinding on temperature profile. 2.2 Three-dimensional integrated circuit This section summarizes the current status of 3D integration in microprocessor design, surveys related work in microprocessor thermal management, and indicates the special thermal management challenges 3D CMPs will bring. Several 3D fabrication technologies have been proposed and developed [109, 108, 95]. Topol et al. reviewed the 3D fabrication process and design techniques developed at IBM [109]. Tezzaron [108] and Samsung [95] developed 3D fabrication technologies and Intel is planning to use 3D integration in the Terascale project [115]. 3D integration increases the importance of, and complicates, thermal management. The 2D heat flux density through the heatsink increases roughly linearly with the number of stacked wafers. As a result, unless per-layer power densities are greatly reduced, 3D CMPs will often operate near their thermal limits. Today s 2D CMPs already operate at or near their thermal limits, and rely on reactive management techniques to maintain thermal safety. In addition to increasing the importance of thermal management, 3D integration complicates thermal management policy design. In contrast with 2D CMPs, the temperatures of some pairs of 3D CMP processor cores, e.g., vertically-adjacent cores, are highly correlated. Moreover, in 2D CMPs, processor cores have similar thermal resistances to the ambient, and high thermal resistances to other cores. In 3D CMPs, core resistance to ambient and thermal interaction are highly-heterogeneous. For

27 CHAPTER 2. RELATED WORKS 11 example, heat generated in cores farther from the heatsink must flow through more layers of silicon and polymide bonding before reaching the heatsink. We next survey work in microprocessor thermal management. Initially, thermal control strategies were seen as an infrequently-engaged final resorts. However, due to increasing transistor densities and limitations in cooling technology, thermal control will be constantly engaged. ThermOS was developed for this emerging thermal management paradigm. Black et al. evaluated the performance improvement yielded by stacking memory and logic layers [12]. Healy et al. proposed a microarchitecture-level floorplanning algorithm that works for both 2D and 3D ICs [39]. Kgil et al. proposed an architecture in which processing core layers are vertically integrated with main memory consisting of multiple DRAM dies, permitting performance and power consumption improvements compared to 2D designs [57]. Li et al. proposed a 3D topology that combines the benefits of network-on-chip and 3D technology to reduce L2 cache latencies [65]. Tsai et al. explored cache implementation in 3D technologies [110]. Thermal issues are critical for 3D integration. Puttaswamy and Loh evaluated the thermal impact of 3D integration on high-performance microprocessors [89]. They also proposed a family of techniques that reduce 3D power density and assign more power to the die closet to the heat sink [90]. These approaches are principally applied at design time. Skadron et al. described a compact thermal analysis technique that has been extended to support 3D integration [99]. Loi et al. studied processor and memory behavior under temperature constraints for 3D technology [72]. Link and Vijaykrishnan examined thermal effects in 3D technologies [71].

28 CHAPTER 2. RELATED WORKS 12 Brooks and Martonosi presented one of the first evaluations of dynamic thermal management (DTM) [14]. In essence, DTM allows microprocessor designers to constrain the average-case, instead of worst-case, power profile. They instead allow run-time mechanisms to detect and resolve potential thermal emergencies. This yields better overall performance than pessimistically designing systems based on the worst-case power profile. Li et al. examined the impact of several design constraints, including thermal effect, on CMP architecture design [69]. Sun et al. proposed a temperature-aware synthesis technique for 3D CMPs [104], but do not consider runtime OS management. Migration strategies can improve the use of multi-core processors by distributing heat generation more uniformly across the chip. Heo et al. proposed reducing peak power density by moving computation to another physical location [43]. Powell et al. explored the benefit of OS thermal management for SMTs and CMPs [87]. They proposed the Heat and Run strategy, in which the OS co-schedules and migrates SMT threads to maximize resource utilization before a thermal emergency arises and then migrates computation to an idle core. Kumar et al. examined hardware-software thermal management that uses hardware performance counters to characterize thermal behavior and kernel support to schedule tasks [63]. They evaluated their mechanism on a real system with SMT support and find significant benefits from considering system-level effects which cannot be accounted for with pure hardware techniques. We also take advantage of kernel scheduling and performance counters but also consider multi-core management. Recent work by Park et al. examined energy-performance tradeoffs in multi-threaded applications [83].

29 CHAPTER 2. RELATED WORKS Single-electron tunneling transistors After single-electron tunneling transistors were discovery in the 1980s [9, 33], there has been extensive research on fabrication, design, and modeling of SETs [70]. SET fabrication and use in high-sensitivity amplifiers at cryogenic temperatures has been the main research focus [25]. SETs and simple circuits with a variety of structures were proposed and fabricated using different methods and materials [80, 105, 6]. Recently, researchers have fabricated SETs that operate at room-temperature [75, 98, 84]. Various SET-based circuit applications, such as logic [111, 112, 79, 19] and memory [126, 118, 122] have been developed. These works provide the promising start for SET circuit design. However, these articles did not provide an architectural evaluation. We do not claim to have improved the performance of SET-based logic gates. Instead, we are the first to develop the modules necessary to support architectural design and synthesis and evaluate the architectural performance and power consumption implications of using SETs. They demonstrate orders of magnitude improvement in power consumption and energy efficiency compared to CMOS. Research on SET modeling and simulation has been an active area. Monte Carlo simulation has been widely used to model SETs. SIMON [117] and MOSES [17] are the two most popular SET simulators. However, they are too slow for analysis of large circuits. Uchida et al. proposed an analytical SET model and incorporated it into SPICE [113]. Recently, Inokawa et al. extended this model to a more general form to include asymmetric SETs [49]. Mahapatra et al. propose a simulation framework for hybrid SET/CMOS circuit design and analysis [73]. Their model for SET behavior is similar to that of Uchida et al. These compact modeling techniques are efficient enough for use in SET circuit design and analysis and closely match Monte Carlo

30 CHAPTER 2. RELATED WORKS 14 simulation results. Significant challenges still remain for large-scale integration of SETs and for roomtemperature operation. SETs that operate reliably at room temperature have critical dimensions of 1 10 nm. They are challenging to fabricate using current top-down lithographic techniques. However, several exciting advances make the evaluation of architectures for high-density logic based on SETs worthwhile. Scanning-probe microscopes can be used to create devices smaller than those using conventional lithography [75]. Continual progress has been made on bottom-up nano-fabrication techniques, where chemical techniques are used to make individual molecules with useful electronic properties. Molecular quantum dots [40] can display SET behavior. Larger structures, such as carbon nanotubes and nanowires, can act as SETs [6]. These bottom-up techniques can create structures supporting room-temperature SET operation. However, more research is needed in order to integrate individual devices into large-scale circuits. Very recent advances in graphene [35] devices show promise for SETs. Reliable methods for cooling to very low temperatures without supplies of liquid helium or nitrogen are also becoming more common [114]. For high-performance computing, the added complexity of operating at cryogenic temperatures may not be a limiting factor. Similarly, cryogenic temperatures are readily attained using passive methods in outer space.

31 Chapter 3 Reliable Multiprocessor System-On-Chip Synthesis This chapter presents a multiprocessor system-on-chip (MPSoC) synthesis algorithm that optimizes system mean time to failure. Given a set of directed acyclic periodic graphs in which nodes present a number of operations and edges represent the communication events, in order to minimize system failure rate and area while meeting functionality and timing constraints, the proposed algorithm determines 1) a processor core allocation,which allocate the necessary processor cores into the MPSoC system; 2) processor-level redundancy, which add identical processor cores to the MP- SoC architecture; 3) component-level structural redundancy, which add appropriate control mechanisms and redundant hardware to individual processor cores; 4) assignment of tasks to processors, which map each specific task in a processor core; 5) floorplan, which estimate the area of each processor core and arrange all these cores within an given region. and 6) scheduling, which determine when each operation is given the access to system resource. Changes to the thermal profile resulting from changes in 15

32 CHAPTER 3. RELIABLE MPSOC SYNTHESIS 16 allocation, assignment, scheduling, and floorplan are modeled and optimized during synthesis, as is the impact of thermal profile on temperature-dependent failure mechanisms. The proposed techniques have the potential to substantially increase MPSoC system mean time to failure compared to area-optimized solutions. If power densities are high and the dominant lifetime failure mechanisms are strongly dependent on temperature, our results indicate that thermal and structural redundancy optimization during synthesis have the potential to greatly increase MPSoC lifetime with low area cost. My major contribution to this chapter is on the MPSoC reliability modeling, temperature-dependent reliability modeling and reliability-aware optimization algorithm design( Section 3.2.1, 3.2.2, 3.2.3, and 3.2.5). My collaborator, Zhenyu Gu, contributed to the floorplanning and on-chip network synthesis( Section 3.2.6). 3.1 Introduction A single integrated circuit can now contain more than one billion transistors. It has been necessary to move to MPSoCs to control design complexity and power consumption. Increasing power density due to continued scaling of CMOS process technology accelerates temperature-dependent and current-dependent failure mechanisms such as electromigration. Lifetime reliability is becoming an important quality metric in highperformance MPSoCs. Optimizing lifetime reliability requires careful planning during MPSoC design and synthesis. This problem cannot be well solved at any single level of the design process. Reliability characterization requires MPSoC thermal profiles, which in turn requires physical information, including an MPSoC floorplan, power profile, and chip-package thermal model. Reliability-aware MPSoC design requires

33 CHAPTER 3. RELIABLE MPSOC SYNTHESIS 17 an unified architectural-level and physical-level design flow Contributions Our work addresses synthesis of MPSoCs capable of reliable operation in the presence of permanent faults. The proposed algorithm generates MPSoC architectures that satisfy the functionality and performance constraints of a specification while simultaneously optimizing die area and MTTF. The problem specification consists of graphs composed of data-dependent, multirate, periodic tasks as well as a database of processor cores. Each processor core executes different tasks with different execution times and power consumptions. This work makes the following main contributions. 1. We have developed and implemented an MPSoC synthesis flow that conducts architectural synthesis, floorplanning, on-chip network synthesis, chip-package thermal analysis, and reliability analysis. Optimization algorithms within this flow exploit redundancy and temperature-aware design planning to produce reliable, compact MPSoC designs. 2. We propose a two-phase reliability optimization flow that builds on a stochastic functionality, performance, and area optimization algorithm and an iterative reliability enhancement algorithm that explores the trade-off between MPSoC reliability and area. This algorithm improves MPSoC system MTTF by an average of 85% with less than 5% area cost and by an average of 436% with less than 25% area cost, compared to area-optimized solutions. To the best of our knowledge, this is the first work to propose and implement a method of predicting and optimizing the impact of design changes during synthesis

34 CHAPTER 3. RELIABLE MPSOC SYNTHESIS 18 AMD K6 2E+ Solution I Power PC Power PC PowerPC (RE) Power PC Power PC Solution II Figure 3.1: Reliable MPSoC Synthesis Example [132]. on temperature-dependent MPSoC failure processes System MTTF Definition and Example We define system MTTF to be the expected amount of time an MPSoC will operate, possibly in the presence of component faults, before its performance drops below some designer-specified constraint or it is no longer able to meet its functionality requirements. Using system MTTF to characterize reliability has the advantage of taking into account performance; this is important for consumer electronics and most other MPSoC applications. To concurrently optimize the system MTTF and area of an MPSoC, it is necessary to exploit both hardware redundancy and temperature profile. Processor-level redundancy is achieved by adding processors to the MPSoC architecture. Component-level redundancy is achieved by adding appropriate control mechanisms and redundant hardware such as additional arithmetic logic units (ALUs) or cache banks to individual processors [103]. We will illustrate each method of improving system MTTF using an example. Figure 3.1 shows two synthesized solutions for a telecommunication

35 CHAPTER 3. RELIABLE MPSOC SYNTHESIS 19 application based processor performance data from the Embedded Microprocessor Benchmark Consortium [31]. Each solution contains three embedded processors connected by an on-chip router. The temperature of each on-chip component is indicated by its brightness: brighter components are hotter. The embedded processor, an AMD K6-2E+, used in Solution I, is replaced with an IBM PowerPC 405GP-RE in Solution II. 405GP-RE is a low power, redundant version of the 405GP; the floating/fixed point units and register files are duplicated. The system MTTFs of Solution I and Solution II are 0.7 year and 1.5 years; these changes doubled MTTF. Further reliability enhancements can be used to increase MTTF to 7 years at small area cost. In this example, solutions contain processors from different companies. If necessary, the database can be limited to processors from a single company. In order to simplify the synthesis problem, we ignore the issue that there would be processors better suited to the particular task at hand than others as long as the overall performance can meet the deadline requirement. This example illustrates the potential improvement to system MTTF due to temperature reduction and resource redundancy. MPSoC reliability strongly depends on temperature. In Solution I, the K5-2E+ has a peak temperature of In Solution II, replacing the K5-2E+ with the 405GP-RE reduces the peak temperature by 5.1, thereby decreasing the run-time fault rate. Second, increasing system redundancy improves fault-tolerance. Compared to the K5-2E+, the 405GP-RE can tolerate more run-time faults. This results in an improvement to system MTTF.

36 CHAPTER 3. RELIABLE MPSOC SYNTHESIS 20 Problem instance TRAN ARCH FLT Initial construction of solutions Core allocation change Task assignment change Floorplanning Reliability analysis Functionality, performance, area, and reliability evaluation DCT ACUM N Convergence? Y N Convergence? Adaptive list scheduling Thermal analysis N Max area reached? Y Processor core and task performance, power, area, and temperaturedependent reliability models Y Area-optimized MPSoC Functionality, performance, and area evaluation Thermal analysis Core reinforcement Core addition Core swapping Reliability enhancement Area and reliability optimized MPSoC Stochastic optimization of functionality, timing, and area Reliability/area curve exploration Figure 3.2: TASR Flow for the Temperature-Aware Synthesis of Reliable MP- SoCs [132]. 3.2 TASR: Temperature-Aware Synthesis of Reliable MPSoCs In this section, we describe TASR, the proposed reliable application-specific MP- SoC synthesis infrastructure TASR Infrastructure Determining and optimizing MPSoC system MTTF requires substantial infrastructure. Figure 3.2 illustrates the main steps and components in the proposed synthesis flow. Computing system MTTF requires knowledge of component MT- TFs and run-time performance constraints. Computing component MTTFs requires knowledge of MPSoC thermal profile and architecture. Computing MPSoC thermal profile during synthesis requires a floorplan, task assignment dependent power modeling, and a thermal analysis algorithm. Finally, determining, and optimizing MPSoC architecture requires a system-level synthesis infrastructure that allocates processor

37 CHAPTER 3. RELIABLE MPSOC SYNTHESIS 21 cores, assigns tasks to processors, rapidly generates floorplans, assigns communication events to network links, and schedules operations and communication events. TASR is composed of algorithms from three domains: system-level synthesis, physical synthesis, and solution analysis. The system-level design contains a singleobjective stochastic optimization algorithm that minimizes MPSoC area subject to functionality and performance requirements, and an iterative reliability enhancement algorithm that uses knowledge of redundancy and thermal profile to improve system MTTF at a small cost in MPSoC area. Physical-level synthesis consists of a slicing floorplanning algorithm and an on-chip network synthesis algorithm. In addition, TASR contains a novel statistical lifetime reliability model, and also performance, power, and thermal models to guide MPSoC reliability optimization. Given 1. Functionality and timing requirements consisting of a directed acyclic graph of periodic graphs of communicating heterogeneous tasks, each of which may have a different deadline; 2. Databases indicating the properties of the available heterogeneous processor cores and on-chip network resources when used with the tasks in the functionality requirements specification, e.g., task execution times and power consumptions on each processor and processor areas; and 3. Temperature-dependent reliability models for the processors and functional units within them. TASR uses a two-stage optimization flow to determine

38 CHAPTER 3. RELIABLE MPSOC SYNTHESIS An allocation of processor cores that are selected based on their performance and reliability characteristics; 2. An assignment of tasks to processor cores that takes task impact on temperature and therefore reliability into account; 3. A schedule of all the tasks and communication events in the system; and 4. A floorplan for the MPSoC. The solutions are optimized for reliability (maximized MTTF) and area. Each solution is associated with numerous alternative task assignments and schedules to permit continued operation in the event of processor core failure. If a processor fails, the resulting change in task assignment and schedule required to maintain functional correctness and meet timing requirements is pre-planned Two-Phase Synthesis Flow This section explains the two-phase synthesis process used within TASR. The first phase uses a parallel recombinative simulated annealing (PRSA) algorithm, i.e., an advanced form of genetic algorithm, to search for low-area MPSoC architectures that meet functionality and timing requirements without violating area constraints. Previous studies [26] have demonstrated that the use of PRSA allocation and assignment together with adaptive list scheduling permits optimal solutions to problems for which optimal solutions are known [88]. For problem instances with previously published results, the PRSA approach rapidly produces solutions of equal or better quality [44, 127]. Adaptive list scheduling makes multiple scheduling attempts with different prioritization metrics in order to meet timing and functionality constraints.

39 CHAPTER 3. RELIABLE MPSOC SYNTHESIS 23 The MPSoC lifetime reliability optimization problem can potentially be solved using a PRSA synthesis flow by including system MTTF with the other optimization objectives. However, the addition of reliability optimization to functional, timing, and area optimization greatly increases problem complexity. Moreover, the time cost of determining the reliability impact of a design change is much higher than that of determining the area and performance impact. It becomes necessary to conduct thermal and reliability analysis and to determine multiple task assignments and schedules for each MPSoC in order to support runtime adaptation to processor core failure. Therefore, we propose starting from an area-optimized solution meeting functionality and timing constraints and using a reliability enhancement algorithm to explore the area reliability tradeoff curve. Lifetime reliability is inversely related to chip temperature. By increasing chip area, power density and chip temperature decrease, thereby increasing chip reliability. Structural redundancy, which permits continued processor or MPSoC operation after component failure and generally increases area, can also improve reliability Integrated Circuit Failure Mechanisms In this section, we characterize integrated circuit (IC) failure mechanisms. The lifetime reliability of ICs is primarily affected by the following failure mechanisms: electromigration, thermal cycling, time-dependent dielectric breakdown, and stress migration [103]. Electromigration is the gradual displacement of the atoms in metal wires caused by electrical current. It leads to voids and hillocks that cause open and short circuit

40 CHAPTER 3. RELIABLE MPSOC SYNTHESIS 24 failures. The MTTF due to electromigration is given by the following equation [55]: MTTF EM = A EM J n e EaEM κt (3.1) where A EM is a constant determined by the physical characteristics of the metal interconnect, J is the current density, E aem is the activation energy of electromigration, n is an empirically-determined constant, κ is Boltzmann s constant, and T is the temperature. Thermal cycling refers to IC fatigue failures caused by thermal mismatch deformation. In IC chip and package, adjacent material layers such as copper/low-k dielectric have different coefficients of thermal expansion. As a result, run-time thermal variation causes fatigue deformation, leading to failures. The MTTF due to thermal cycling is given by the following equation [55]: MTTF TC = A TC (T average T ambient ) q (3.2) where A T C is a constant coefficient, T average is the chip average run-time temperature, T ambient is the ambient temperature, and q is the Coffin-Manson exponent constant. Time-dependent dielectric breakdown is the deterioration of the gate dielectric layer. This effect depends strongly on temperature, and is becoming increasingly prominent with the reduction of gate-oxide dielectric thickness and non-ideal supply voltage reduction. The MTTF due to time-dependent dielectric breakdown is given by the following equation [55, 103]: MTTF TDDB = A TDDB ( 1 V ) (a bt ) A+B/T +CT e κt (3.3) where A TDDB is a constant, V is the supply voltage, and a, b, A, B, and C are fitting parameters.

41 CHAPTER 3. RELIABLE MPSOC SYNTHESIS 25 Stress migration is the mass transportation of metal atoms in metal wires due to mechanical stress caused by thermal mismatch among metal and dielectric materials. The MTTF resulting from stress migration is given by the following equation [55]: MTTF SM = A SM T 0 T n e Ea SM κt (3.4) where A SM is a constant, T 0 is the metal deposition temperature during fabrication, T is the run-time temperature of the metal layer, n is an empirically-determined constant, and E asm is the activation energy for stress migration. Equations indicate that the lifetime reliability of ICs is strongly influenced by temperature. Therefore, thermal analysis and optimization techniques play important roles in reliability optimization. Generally, MTTF values resulting from different mechanisms is from 20 to 30 years MPSoC Reliability Modeling The system MTTF of an MPSoC is a function of the lifetime reliabilities of all its PEs. In this work, we propose a system-level lifetime reliability model for MPSoCs. Our first step is to derive an efficient modeling method that can accurately predict the lifetime reliability of each MPSoC PE Reliability Modeling of On-Chip PEs The lifetime reliability of an on-chip PE is influenced by numerous design-time and run-time factors, such as architecture-level and circuit-level redundancy, accumulation of wear, and run-time temperature. Accurate lifetime characterization of each PE is challenging.

42 CHAPTER 3. RELIABLE MPSOC SYNTHESIS 26 We propose a PE reliability model that is capable of incorporating the effects of multiple fault mechanisms, component-level resource redundancy, and temperature. The dependence of lifetime failure processes on other parameters, such as current density, is not directly considered. Constant values of these parameters resulting in PE MTTFs of 30 years at 50 and 1.8 V are used [103]. For the sake of explanation, our description of PE reliability modeling starts from the simplest case, i.e., a single failure mechanism, single point of failure (no resource redundancy), and constant temperature. These assumptions are later relaxed, and the reliability model generalized Lognormal Distribution Reliability Model for Single PE, Single Point of Failure Statistical modeling is commonly used in IC reliability characterization. Researchers have proposed using various statistical models, e.g., exponential, Weibull, and lognormal, to characterize IC lifetime failures. Compared to other commonlyconsidered statistical models, the lognormal distribution more accurately models the time-dependent degradation processes of ICs, e.g., diffusion, corrosion, migration, and crack propagation [103] caused by the failure mechanisms described in Section However, using the lognormal distribution complicates the derivation of analytical solutions. Numerical methods, such as Monte-Carlo simulation or statistical fitting techniques, are required. These methods are computationally intensive. Starting from the simplest assumption, for a failure mechanism i, the run-time

43 CHAPTER 3. RELIABLE MPSOC SYNTHESIS 27 fault probability density function (PDF), f i (t), and the corresponding fault cumulative distribution function (CDF), F i (t), have two parameters: σ i P E (a shape parameter) and µ i P E (a scale parameter). The MTTF of an on-chip PE due to a particular failure mechanism i, MTTFPE i, is then estimated: MTTF i PE = t f i (t)dt = t df i (t) = e µi P E +σi P E 2 /2 (3.5) The overall lifetime reliability of each on-chip PE, MTTF PE, is modeled by a joint lognormal distribution that depends on the major failure mechanisms described in Section We assume that the relationships among different failure mechanisms are serial, i.e., each individual failure mechanism can result in the failure of a nonredundant PE. Therefore, for each non-redundant PE, the CDF of its overall lifetime failure probability follows: F PE (t) = 1 i (1 F i (t)) (3.6) where i is the index of different failure mechanisms. Researchers have often used exponential distributions for statistical modeling due to their convenience. Given F i (t) with exponential distributions, Equation 3.6 would yield an easily-computed analytical solution. However, as a consequence of using the more accurate lognormal distribution for each F i (t), Equation 3.6 does not allow straight-forward estimation of PE MTTF, MTTF PE. In this work, we use statistical fitting to approximate MTTF PE using a single lognormal distribution, governed by µ PE and σ PE. The parameters for this approximation follow: (( µ PE = 1 2 log t df 0 PE (t) ) ) 4 t 0 2 df PE (t) ( ) σ PE = t log 0 2 df PE (t) ( t df 0 PE (t) ) 2 (3.7) (3.8)

44 CHAPTER 3. RELIABLE MPSOC SYNTHESIS Reliability Models for Inactive Spare and Active Spare Redundant PEs PEs may have component redundancy to improve reliability or performance. Such PEs can be designed to continue functioning even after some of their components, e.g., an ALU or a cache bank, fail. Inactive spares are redundant resources that are not activated until a fault occurs in an active resource. The impact of faults in inactive spares upon the lifetime reliabilities of PEs can be characterized as follows. Assume a PE contains M types of resources. Each type of resource S i, i {1,, M}, is comprised of N i identical elements. Assume the cumulative failure probability of resource element E i,j, i {1,, M}, j {1,, N i } is F i,j (t). Then, the cumulative failure probability of resource S i, F Si (t) = j F i,j(t). The MIN MAX approximation [103] may be used to bound the MTTF of a PE with M types of resources as follows: ( MT T F P E = min M 1 ) t df Si (t) i=1 0 (3.9) Active spares are redundant resources that are actively used even before any faults have occurred. Faults in active spares reduce the performance of the affected PE. Determining the reliability impact of faults that result in changes to observable PE behavior involves system-level design decisions, and will be described in detail in Section Temperature-Dependent Reliability Model for Potentially Redundant PEs The lifetime reliability of a PE strongly depends on its temperature. After each MPSoC solution is derived, performance and power analysis are conducted. The

45 CHAPTER 3. RELIABLE MPSOC SYNTHESIS 29 probability density fault probability density at temperature T 1 fault probability density at temperature T t t time (years) Figure 3.3: Temperature Impact on MTTF [38]. estimated power profile, MPSoC floorplan, and cooling configuration are provided to a thermal analysis algorithm [123] to determine the thermal profile. Note that Equation 3.9 is derived under an assumption of constant PE temperature. Next, we discuss temperature-dependent PE MTTF estimation. The temperature profile of an MPSoC varies as the tasks assigned to it change. Task assignments change whenever migration is used to compensate for a partial or complete PE failure. The impact of temperature variation on MTTF calculation is illustrated in Figure 3.3. In this example, T 1 and T 2 are temperatures. The PE is initially hot (T 1 ) and, at time t 1, becomes cooler (T 2 ). Functions f 1 (t) and f 2 (t) are the fault PDFs given temperatures T 1 and T 2, respectively. The overall fault distribution of the PE should satisfy the following equation, i.e., the overall cumulative fault distribution equals one. t1 0 f 1 (t)dt + t 2 f 2 (t)dt = 1 (3.10) When we switch from the fault PDF associated with one temperature, e.g., T 1, to that associated with another temperature, e.g., T 2, it is necessary to adjust our start time

46 CHAPTER 3. RELIABLE MPSOC SYNTHESIS 30 to the value, in the new time scale, associated with the appropriate amount of wear that had been experienced in the previous time scale, i.e., we must start integrating from the effective age of the PE. For this example the concept can be summarized as follows: F 1 (t 1 ) = F 2 (t 2 ). Given that {T 0, T 1,, T N 1 } denote the PE thermal profile, the overall fault distribution should satisfy the following equation: te0 t s0 =0 f 0 (t)dt + te1 t s1 f 1 (t)dt + + t sn 1 f N 1 (t)dt = 1 (3.11) where f i (t) denotes the fault PDF of the PE at temperature T i, t ei (t) denotes the transition time at which the temperature changes from T i 1 to T i, and t si (t) denotes the equivalent age of the PE, starting from t ei 1, when the temperature switches to T i. The value of t si can be determined using Equation 3.11, allowing the MTTF of a PE to be determined using the following equation: MTTF = N 1 i=0 tei t si tf i (t)dt (3.12) This has the effect of breaking time into regions ( N 1 i=0 ) during which the temperature of the PE is uniform and, during each region, weighting each time instant by the probability of failure at that instant (t f i (t)). Values for t si and t ei are computed based on Equation Reliability analysis may be conducted numerous times during reliability optimization. Therefore, modeling efficiency is critical. An MPSoC consists of numerous PEs. If the cumulative fault probability distributions, F i (t), are lognormal, then solving Equation 3.9 requires computationally-intensive numerical analysis. To improve computational efficiency, we produce a PE reliability library before reliability optimization by pre-characterizing the reliability distributions of PEs as functions

47 CHAPTER 3. RELIABLE MPSOC SYNTHESIS 31 of temperature and supply voltage. During MPSoC reliability optimization, when solving Equation 3.12, the value of F i (t) is efficiently obtained using table look-ups Reliability Optimization of MPSoCs Figure 3.2 illustrates the proposed reliability analysis and optimization flow. In TASR, reliability optimization starts by evaluating the system MTTF of area optimized solutions (using Algorithm 1), Such solutions tend to have high power density, high temperature, low resource redundancy and, therefore, low system MTTF. An iterative reliability enhancement algorithm is invoked if these solutions do not provide the required system MTTF. During each iteration, Algorithm 2 optimizes MTTF by improving processor core and component redundancy and/or optimizing chip thermal profile by introducing new processors. System-level (task assignment and scheduling) and physical-level (floorplanning and network synthesis) algorithms are then invoked to produce valid MPSoC solutions. Through performance, power, thermal, and reliability analyses, the system MTTFs of new solutions are estimated and evaluated. The iterative optimization flow continues until the targeted system MTTF is achieved. Algorithm 1 estimates system MTTF based on statistical models of MPSoC runtime failure processes. Starting from time t = 0, it determines the minimal MTTF among all the processor cores (line 4). Each fault may result in partial or complete processor core failure. In either case, task migration is used to optimize system performance. The task migration routine moves tasks from the faulty or partiallyfaulty processor to other processors (line 6). After task migration, if the MPSoC still meets its performance requirements, the algorithm considers the next processor core with minimal MTTF. Task migration results in run-time changes in chip power

48 CHAPTER 3. RELIABLE MPSOC SYNTHESIS 32 Algorithm 1 System MTTF Analysis of an MPSoC Solution 1: Given an MPSoC solution, set MTTF MPSoC 0 2: while system schedule is valid do 3: MPSoC Func are the functioning processors in the MPSOC 4: Fault interval e i min p MPSoCFunc (MTTF p ) 5: MTTF MPSoC MTTF MPSoC + e i 6: Task migration, scheduling 7: if system scheduling is valid then 8: Power analysis, thermal analysis, compute processor temperatures 9: else 10: Return MTTF MPSoC 11: end if 12: end while consumption and temperature profiles, thereby changing the lifetime reliability of each processor core. To accurately predict subsequent processor MTTFs, power and thermal analysis are conducted (line 8). This process continues until the MPSoC fails to meet its performance or functionality requirements. The system MTTF of the MPSoC solution is then reported (line 11). At run-time, on-line fault detection algorithms should determine when an execution unit has failed. A proper treatment of on-line fault detection is beyond the scope of this dissertation but can be found in the literature [77]. Upon fault detection, the pre-planned task assignment changes associated with the particular fault are made. If it is acceptable to reboot the system in the presence of a fault (a few times in the system lifespan), no further provisions are necessary. If uninterrupted operation is necessary, distributed system checkpointing may be used. TASR is equipped with an efficient workload migration algorithm to maintain system functionality and meet performance requirements in the presence of partial and complete processor failures. When an MPSoC fails to meet its performance requirements due to run-time faults, tasks migrate to other processors using the following policy. Tasks on faulty processors are first sorted in order of increasing time slack, the

49 CHAPTER 3. RELIABLE MPSOC SYNTHESIS 33 difference between the task s latest finish time and earliest finish time. They are then migrated from the processor to other processors in this order until the system performance requirements are met and no tasks are assigned to a totally failed processor. When moving a task from one processor to another, the new processor is selected by Pareto-ranking processors in order of increasing utilization ratio (the proportion of time during which the processor is actively executing tasks) and increasing execution time for the task and processor under consideration. Depending on whether a processor is inoperational or partially-failed, all or some of the tasks assigned to it migrate to other processors. TASR optimizes the lifetime reliability of MPSoCs by focusing on architectural changes that improve redundancy and thermal profile, while maintaining low area overhead. Algorithm 2 shows the actions taken by TASR to improve the MTTF of an MPSoC architecture. First, the MTTF of each individual processor is estimated (line 2). The processor with the minimal MTTF is identified as the MPSoC s most vulnerable point, P vul (line 3). One of the proposed reliability optimization moves is then applied: processor reinforcement, processor swapping, and processor addition (line 4). Processor reinforcement introduces component redundancy (see Section 3.1.2) into the most vulnerable processor. Processor swapping replaces the most vulnerable processor with a different, more reliable, processor. Processor addition introduces a new processor into the MPSoC, enabling tasks to migrate from the vulnerable processor to other processors. These moves consider multiple candidates processors. TASR uses the relative reliability gain, defined in Equation 3.13, to select the best candidate move. This equation takes power density reduction, resource redundancy

50 CHAPTER 3. RELIABLE MPSOC SYNTHESIS 34 improvement, and area overhead associated with the move into consideration. G TASR = e P d MT T F ref /A (3.13) Note that this value is used only to guide changes. The detailed effect of each tentative change is computed using thermal profile and reliability analysis. MPSoC power profile influences MPSoC temperature profile, which strongly influences reliability. The MTTFs associated with some major fault mechanisms are exponential functions of temperature. Therefore, in Equation 3.13, TASR uses an exponential term, e P d, to characterize the impact of power density reduction on reliability improvement. P d is the power density reduction resulting from applying a candidate move. In Equation 3.13, the impact of redundancy is characterized by the second term, MTTF ref, the system MTTF improvement resulting from the candidate move. MTTF ref is calculated under the assumption that other design characteristics, e.g., temperature profile and supply voltage, remain the same. The relative reliability gain introduced by each candidate move is the product of these two terms divided by the area overhead. The move with the highest gain is applied (line 5). After each optimization move, system-level and physical-level synthesis algorithms are invoked to update the MPSoC solution. Cost analysis is then conducted to determine the improvement in system reliability, determine the impact on MPSoC area, and validate the system schedule. This optimization process continues until the target system MTTF is achieved. Two additional other optimization moves were implemented for the sake of comparison. The first considers only power density, e P d, and the second considers only resource redundancy, MTTF ref. Performance comparisons among these three heuristics are provided in Section 3.3.

51 CHAPTER 3. RELIABLE MPSOC SYNTHESIS 35 Algorithm 2 Reliability-Aware Optimization Algorithm 1: while MTTF MPSoC < MTTF target do 2: pe MPSoC compute MTTF pe 3: Find vulnerable point: P vul is the processor with minimal MTTF 4: Optimization moves (processor reinforcement, processor swapping, processor addition) 5: Apply the best move based on Equation : System-level synthesis: Task assignment and Scheduling 7: Physical-level synthesis: Floorplanning and network synthesis 8: Performance, power, thermal, reliability analysis 9: if system MTTF does not improve or system schedule invalid then 10: Revert this change 11: end if 12: end while Floorplanning, Thermal Analysis, and Network Synthesis We use a fast constructive area and communication aware floorplanning block placement algorithm based on network partitioning and optimal processor orientation and rotation selection to determine MPSoC power profile as well as communication latency and communication power consumption [26]. A fine-grained MPSoC thermal model is used within a thermal analysis algorithm designed for accuracy and high enough speed for use within the inner loop of synthesis [123]. Finally, we carry out on-chip network synthesis, using network topology to explicitly model communication contention. 3.3 Experimental Results This section describes the benchmarks used to evaluate TASR and presents the results of evaluation.

52 CHAPTER 3. RELIABLE MPSOC SYNTHESIS Benchmarks The proposed reliable MPSoC synthesis algorithm was evaluated using a number of benchmarks taken from the E3S embedded systems benchmark suite, which is based on EEMBC benchmark data [31]. This suite contains 17 PEs, e.g., the AMD ElanSC520, Analog Devices 21065L, the Motorola MPC555, and the Texas Instruments TMS320C6203. These processors are characterized based on the measured execution times of 47 tasks commonly encountered in embedded applications, power numbers derived from datasheets, and additional information, e.g., processor areas, some of which were necessarily estimated, and prices gathered by ing and calling vendors. Any processor for which the datasheet reflected results in coarser technologies were linearly scaled to a 0.18 µm technology. The task sets follow the organization of the EEMBC benchmarks. There is one task set for each of the five application suites: Automotive/Industrial, Consumer, Networking, Office Automation, and Telecommunications. The Office Automation problem contains only five tasks. Our modified version of Office Automation contains four copies of the original task set. In addition, TGFF [27] was used to generate five random benchmarks, each of which has tasks. The graphs have different structures, ranging from random connectivity to a series-parallel structure commonly encountered in DSP applications. For the random benchmarks, tasks were randomly assigned task types from the EEMBC benchmarks. The EEMBC processors do not have component redundancy, i.e., each processor will fail if any of its functional units fails. We introduce a redundant version for each processor by duplicating floating/fixed point units and floating/integer register files. We assume that instruction scheduling units and instruction decode units do

53 CHAPTER 3. RELIABLE MPSOC SYNTHESIS 37 not have redundancy [103]; a run-time fault in these units will result in processor failure. On-chip caches have redundancy; a single fault reduces performance but the processor remains operational. We relied on previous work to estimate the cost of component redundancy [103]. Processors with component redundancy suffer a 24% area penalty and, while their additional functional units are still operational, have 25% higher performance and power consumption. The embedded microprocessors in EEMBC have fairly homogeneous energy delay products. It is our goal to develop a synthesis algorithm that is effective at improving the reliability of application-specific MPSoCs, which commonly contain heterogeneous processors. Therefore, for each processor, we introduced one corresponding processor operating at a higher voltage and another operating at a lower voltage. A maximum of three voltages need to be provided by off-chip regulators. The alpha power law was used to calculate the impact of voltage scaling on performance. A 0.18 µm process, supply voltage of 1.8 V, and alpha of 1.3 were used [93]. To model high-performance processors, the supply voltage was scaled to 2.5 V, performance increased by 25%, and power consumption increased to 2.4. To model low-power processors, the supply voltage was scaled to 1.28 V, performance was decreased by 25%, and power consumption was decreased to TASR vs. Stochastic Area Optimization As described in Section 3.2.1, TASR consists of a two-stage optimization flow. It first uses a stochastic optimization algorithm to minimize MPSoC area under performance constraints. The area-optimized solution is used as a starting point for the proposed reliability enhancements. The TASR lines in Figure 3.5 illustrate the

54 CHAPTER 3. RELIABLE MPSOC SYNTHESIS 38 Area (mm 2 ) auto consumer networking office4x telecom random1 random2 random3 random4 random MTTF (years) Figure 3.4: Comparison of MPSoC Area Reliability Tradeoffs [38]. solutions produced by the MTTF optimization technique when run on all the benchmarks. The initial area-optimized solutions appear at the left-most points of the lines. TASR applied the optimization moves described in Section until several subsequent moves did not significantly improve system MTTF. Table 3.1 shows the average system MTTF improvement over initial area-optimized solutions under different area overhead constraints for all ten benchmarks. These results illustrate three key points about the reliable application-specific MPSoC synthesis problem. 1. The area cost to improve reliability is initially small. In Figure 3.4, area is shown on a logarithmic scale. As shown in Table 3.1, improving the average system MTTF over all benchmarks by 40%, 85%, and 180% results in maximum

55 CHAPTER 3. RELIABLE MPSOC SYNTHESIS 39 Area (mm 2 ) TASR CR-only PD-only 1PHASE auto Area (mm 2 ) TASR CR-only PD-only 1PHASE office4x MTTF (years) MTTF (years) Area (mm 2 ) TASR CR-only PD-only 1PHASE consumer Area (mm 2 ) TASR CR-only PD-only 1PHASE telecom MTTF (years) MTTF (years) Area (mm 2 ) TASR CR-only PD-only 1PHASE networking Area (mm 2 ) TASR CR-only PD-only 1PHASE random MTTF (years) MTTF (years) Area (mm 2 ) TASR CR-only PD-only 1PHASE random2 Area (mm 2 ) TASR CR-only PD-only 1PHASE random MTTF (years) MTTF (years) Area (mm 2 ) TASR CR-only PD-only 1PHASE random4 Area (mm 2 ) TASR CR-only PD-only 1PHASE random MTTF (years) MTTF (years) Figure 3.5: Comparison of Different Optimization Heuristics [132].

56 CHAPTER 3. RELIABLE MPSOC SYNTHESIS 40 Table 3.1: System MTTF Improvement Under Area Bound [132] Area MTTF Area MTTF Area MTTF bound improve. bound improve. bound improve. (%) (%) (%) (%) (%) (%) The MTTF improvement under each area bound is computed by selecting the highest- MTTF solution for each benchmark, that honors the area bound, and computing the average of their MTTF improvements. area overheads of 0.0%, 5.0%, and 10.0%. MTTF is not directly considered in the first optimization phase. As a result, TASR can sometimes improves MTTF without area overhead because two solutions with the same area can have different MTTFs. Initial solutions are optimized for area and tend to have high power densities, high temperatures, and low resource redundancy: the fault rates are high and single faults may cause failure. Therefore, the system reliability can be improved at low area cost. TASR introduces processor cores with lower power densities and/or replaces non-redundant cores with redundant ones, thereby optimizing thermal properties and allowing the system to continue operating despite runtime hardware faults. 2. As shown in Table 3.1, TASR automatically trades off system reliability for area, allowing system designers to choose a desirable solution based on problemspecific design constraints. 3. As system MTTF increases, the area penalty associated with further improving system reliability increases. As shown in Table 3.1, TASR achieves 436% average system MTTF improvement with a maximum area overhead of 25%. Further

57 CHAPTER 3. RELIABLE MPSOC SYNTHESIS 41 improvements to system MTTF become prohibitively expensive. Processor core failure cumulative distribution functions are non-decreasing. For a large enough duration, there is a low probability that any processor will operate without a fault. As a result, at very large MTTFs, adding processors or reinforcing a subset of existing processors with redundant components has little impact on MTTF Evaluation of Optimization Moves TASR optimizes system reliability by controlling processor temperatures and improving system redundancy. To evaluate the effectiveness of the proposed optimization moves, we compare TASR with two alternative moves described in Section 3.2.5: power density only (PD-only) and component redundancy only (CR-only) moves. PD-only minimizes power density. CR-only increases resource redundancy. Figure 3.5 shows the results produced by TASR, CR-only, and PD-only optimization moves. TASR almost always produces architectures with both superior area and system MTTF. In some cases, PD-only or CR-only also do well. PD-only does not consider component redundancy. However, introducing redundant processors in order to improve power density still improves system MTTF. CR-only does not consider processor power density. However, redundant processors tend to have lower power densities than non-redundant processors; although their instantaneous spatial power densities are similar to non-redundant processors, they have higher performance, permitting lower temporal power densities. In general, it is necessary to use both structural redundancy and power density to produce high-quality solutions.

58 CHAPTER 3. RELIABLE MPSOC SYNTHESIS Evaluation of Optimization Flow As explained in Section 3.2.2, it appears that a two-phase optimization flow in which a stochastic optimization algorithm is first used to find a promising, low-area, region of the solution space and then an iterative reliability enhancement algorithm is used to trade off area for reliability is superior to a one-phase optimization flow. To determine whether this argument has merit, we compared TASR with a onephase stochastic optimization algorithm in which functionality, timing, area, and reliability are concurrently optimized. This algorithm, which we call 1PHASE, has the ability to apply all the allocation, assignment, floorplanning, and scheduling changes available to TASR. It optimizes MTTF within its multi-objective cost function. We found that TASR can almost always produce solutions of equal or better quality than 1PHASE. In addition, TASR generally requires less CPU time (an average of s per benchmark) than 1PHASE (an average of 2,394 s per benchmark). 3.4 Conclusions and Future Work This chapter has described a synthesis algorithm for reliable application-specific MPSoCs. The dominant failure processes today, and in the near future, have rates exponentially dependent on temperature. Therefore, the impact of tentative design changes on detailed temperature profile during synthesis process should be considered. This, in turn requires power profiles, which depend on floorplanning and power models. Even the fastest detailed thermal analysis and floorplanning algorithms cannot be included within the inner loop of synthesis without greatly reducing the solution space explored in a given amount of time. Therefore, we have proposed a two-stage

59 CHAPTER 3. RELIABLE MPSOC SYNTHESIS 43 synthesis process in which a potentially-slow but high-quality stochastic optimization algorithm is first used to minimize solution area. Starting from this promising location in the solution space, a reliability enhancement heuristic explores the area MTTF tradeoff curve. Our results indicate that this synthesis approach greatly outperforms simply adding MTTF into a stochastic optimization algorithm as another objective. The proposed synthesis flow increases MPSoC system mean time to failure by an average of 85% with less than 5% area cost and by an average of 436% with less than 25% area cost, compared to area-optimized solutions. As long as power densities remain high and the dominant lifetime failure processes remain strongly dependent on temperature, our results indicate that thermal and structural redundancy optimization during synthesis have the potential to increase MPSoC lifetime with low area cost.

60 Chapter 4 Three-Dimensional Chip-Multiprocessor Run-Time Thermal Management Three-dimensional (3D) integration has the potential to improve the communication latency and integration density of chip-level multiprocessors (CMPs). However, the stacked high power density layers of 3D CMPs increase the importance and difficulty of thermal management. In this chapter, we investigate the 3D CMP run-time thermal management problem and describe efficient management techniques. This chapter makes the following main contributions: (1) it identifies and describes the critical concepts required for optimal thermal management, namely the methods by which heterogeneity in both workload power characteristics and processor core thermal characteristics should be exploited and (2) it proposes an efficient, proactive, continuously-engaged hardware and operating system thermal management technique 44

61 CHAPTER 4. 3D CMP THERMAL MANAGEMENT 45 governed by optimal thermal management polices. The proposed technique is evaluated using multiprogrammed and multithreaded benchmarks in an integrated power, performance, and temperature full-system simulation environment. We find that proactive power thermal budgeting allows a 30% improvement in instruction throughput compared to a proactive thermal management approach that bases decisions only upon local information. The software components of the proposed thermal management technique have been implemented in the Linux kernel. The analysis and technique developed in this chapter provide a general solution for future 3D and 2D CMPs. My major contribution to this chapter is on the characterization of heat flow in 3D CMPs, derivation of optimal workload assignment and power thermal budgeting and thermal management implementation in the Linux kernel (Section 4.3 and 4.4). My collaborator, Zhenyu Gu contributed to the design of 3D CMP architecture and technology, framework buildup of the full simulation system, and benchmark suites characteristics and generation (Section 4.5). 4.1 Introduction Continued increases in integration density, and achieving higher application performance without corresponding increases in processor frequency, are now primary goals for microprocessor designers. As a result, microprocessor design is rapidly moving towards highly-scalable chip-multiprocessor (CMP) architectures. Today s mainstream microprocessors are multi-core [56, 60, 7, 50, 107, 96]. The trend for future CMPs is to increase the number of on-chip cores: 80-core prototypes have recently been demonstrated by Intel [115]. Performance scalability is a major challenge in CMP design. Using the mainstream

62 CHAPTER 4. 3D CMP THERMAL MANAGEMENT 46 two-dimensional (2D) planar CMOS fabrication process, on-chip interconnect shows poor scalability in both performance and power consumption [5]. Three-dimensional (3D) integration has the potential to overcome the limitations of 2D technology [109, 12, 95, 108]. By stacking multiple device layers connected through inter-die vias, 3D integration increases logic integration density significantly and reduces on-chip wire length, especially for global and semi-global wires. This has motivated computer architects to evaluate 3D technology for CMP architecture design [12, 65, 57, 58]. However, none of this work describes a thermal management solution appropriate for 3D CMPs. Thermal issues are a large and growing concern for CMPs [68, 28, 14, 99]. Increasing chip power consumption and temperature affect circuit reliability (via negative bias temperature instability, electromigration, time-dependent dielectric breakdown, thermal cycling, etc.), power and energy consumption (via increased leakage power), and system cost (via increased cooling and packaging cost). The use of 3D integration magnifies power dissipation problems [12, 89, 90, 71]. Chip cross-sectional power density increases linearly with the number of vertically-stacked active circuit layers. In addition, the interconnect and bonding layers used in 3D integration have low thermal conductivities, which further exacerbates thermal effects. Temperature-related concerns that can sometimes be safely ignored in 2D CMPs, such as temperatureinduced performance or reliability degradation, become increasingly prominent in 3D CMPs. 3D integration holds promise but without solutions to the thermal problems it brings, 3D CMPs will be impractical. Run-time thermal management techniques, such as dynamic voltage and frequency scaling, clock throttling, execution unit toggling, and workload migration, have been

63 CHAPTER 4. 3D CMP THERMAL MANAGEMENT 47 proposed for 2D high-performance microprocessors [14, 99, 87, 54, 68, 28]. Using these techniques, cooling solutions and packages need not be designed for worst-case power consumption scenarios. Cooling cost can thereby be significantly reduced. Past work, however, cannot effectively optimize the performance temperature tradeoff in 3D CMPs for the following reasons. First, the thermal management techniques deployed in current microprocesasors and operating systems are primarily used to handle rare, worst-case processor power consumption events and eliminate thermal emergencies. Although they can potentially introduce significant performance overhead, they are rarely invoked. In contrast, the higher power densities of future 3D (and some 2D) CMPs will frequently require operation at or near thermal limits. Already, processors contain reactive techniques to permit the use of reduced-cost packaging and cooling configurations that are not capable of handling maximum power dissipation. Today s laptops frequently invoke thermal management mechanisms that drastically reduce performance, even under normal operating conditions [74]. Power should be viewed as a limited resource and processor cores should spend carefully-budgeted amounts. Thermal management should be used to proactively, continuously optimize CMP performance and temperature, instead of merely reacting to emergencies. Second, 3D CMPs have heterogeneous power and thermal characteristics. Onchip processor cores have different cooling efficiencies. For instance, cores in the layers closer to the heatsink have higher cooling efficiencies than those farther from the heatsink. Processor cores farther from the heatsink will have higher temperatures than their neighbors nearer the heatsink, even when their power consumptions are lower. Inter-core thermal correlation is heterogeneous. The thermal correlation

64 CHAPTER 4. 3D CMP THERMAL MANAGEMENT 48 I/O and Power Backside Vias Die 2 Die 1 Bulk Si Heat Sink Metal Layers Device Layer Die to Die Vias Bulk Si Heat Sink Die 2 Die 1 L2 Cache Core Core Core Core Core Core Core Core (a) (b) (c) Figure 4.1: (a) Comparison of Face-to-Face (Left) and Face-to-Back (Right) Configurations for Two Stacked Dies, (b) 3D Three Stacked Die Floorplan Used in This Work, and (c) 3D CMP Chip-package Thermal Modeling [134]. between vertically-aligned processor cores is stronger than that between processor cores within the same layer. The power and thermal heterogeneity of 3D CMP poses unique challenges for run-time thermal management. Achieving optimal 3D CMP performance under a temperature constraint requires careful system-wide control of each processor core s performance and power consumption. Local control, alone, is insufficient. In this chapter, we develop the analytical framework necessary to determine the thermal impact of every core in a 3D CMP upon every other core. This framework yields guidelines for near-optimal thermal management. The guidelines are embodied in a proactive global power thermal budgeting algorithm, performance counter-based workload monitor, and distributed thermal control techniques, which we have implemented in version of the Linux kernel. The resulting 3D CMP thermal management solution, which we call ThermOS, is evaluated using detailed full-system simulation with M5 [11]. We have integrated power modeling and thermal analysis tools within the simulator, allowing unified architectural/power/thermal simulation of arbitrary single-threaded and multi-threaded applications and the Linux operating system (OS). Our results for a wide range of multiprogrammed and multithreaded

65 CHAPTER 4. 3D CMP THERMAL MANAGEMENT 49 applications indicate that, given a peak temperature constraint, ThermOS improves CMP throughput by an average of 29.84% when compared to state-of-the-art proactive distributed thermal management. This improvement is primarily due to the power thermal budgeting guidelines used by ThermOS. 4.2 Contribution Our work is most closely related to Donald s and Martonosi s research on CMP thermal management using distributed control-theoretic core management and a global controller that guides migration [28]. Both their thermal management technique and ThermOS are continuously-engaged thermal management techniques. However, existing proactive thermal management techniques are not appropriate for CMPs with heterogeneous thermal environments, such as 3D CMPs. Global guidance and power thermal budgeting are particularly beneficial for 3D CMPs. By matching core cooling characteristics, application features and voltage levels, we can improve performance by limiting throttling and migration. We are the first to examine the impact of thermal heterogeneity on thermal management of 3D architectures. We evaluate our proposed policies in a full system simulator. This experimental setup accounts for the overhead of DTM in the OS, including migration costs and context switches. 4.3 Heat Flow in 3D CMPs This section uses examples to explain the special thermal characteristics of 3D CMPs and develop a mathematical model that will be used to derive the thermal management policies described in Section 4.4 and validated in Section 5.4.

66 CHAPTER 4. 3D CMP THERMAL MANAGEMENT 50 I C P I 1/g inter J 1/g intra K C C 1/g hs 1/ghs P J P K T amb T amb Figure 4.2: Inter-layer and Intra-layer Thermal Heterogeneity and Dominance in 3D CMPs [134] Introduction to Thermal Modeling Heat conduction within CMP chip and package can be modeled using Fourier heat flow analysis, which has been the standard method used by industry and academia for circuit-level and architecture-level IC chip package thermal analysis during the past few decades [20, 8, 99, 125]. This method is analogous to Georg Simon Ohm s method 1 of modeling electrical current. Using Fourier heat flow analysis, heat flow is analogous to electrical current and temperature is analogous to voltage. The CMP is virtually partitioned into numerous discrete blocks, as shown in Figure 4.2. The thermal conductance of each block is a linear function of the conductivity of its material and its cross-sectional area divided by length; it is analogous to electrical conductance. Blocks also have heat capacities that are analogous to electrical capacitance. 1 In fact, Ohm borrowed this model from Fourier and it was initially proposed to model heat flow.

67 CHAPTER 4. 3D CMP THERMAL MANAGEMENT 51 Therefore, an instantaneous change in heat generation results in a gradual change in temperature. As a result, the temperature profile of a CMP is essentially its power profile after applying a complicated RC filter. We will deal with this effect in detail in Section For a thermal model to be accurate, each block must be so small that the temperature within it is uniform. A fine-grained, and thus more accurate model was used to validate ThermOS. However, for the sake of explanation, this section will describe the coarse-grained model shown in Figure 4.2, in which each core is represented with a single thermal model element. In 3D CMPs fabricated from multiple stacked wafers, the thermal environment varies from layer to layer. Moreover, the intra-layer and inter-layer thermal relationships among CMP cores are heterogeneous. The rest of this section explains the impact of this heterogeneity on heat flow and builds the theoretical foundations for developing near-optimal 3D CMP thermal management policies. This understanding is essential for proper thermal management of 3D CMPs but no prior work is based on it. Homogeneous Intra-Layer Characteristics Figure 4.2 illustrates a simplified heat conduction model for a pair of adjacent CMP cores on the same layer (J and K) and a pair of adjacent CMP cores on different layers (I and K) of a 3D CMP. As shown in this figure, since the heat dissipation paths of Cores J and K are nearly identical, the thermal conductances of these two cores are nearly equal. In other words, processor cores within the same layer have similar cooling efficiencies.

68 CHAPTER 4. 3D CMP THERMAL MANAGEMENT 52 Heterogeneous Inter-Layer Characteristics In contrast to cores on the same layer, Cores I and K have different conductances to the ambient: g hs = 0.82 W/K for Core K and 1/(1/g hs + 1/g inter ) = 0.73 W/K for Core I 2. In addition, the steady-state temperature of Core I is always higher than that of Core K, even if Core I has a lower power consumption. The following equations formalize this effect, which we refer to as thermal dominance. Neglecting the limited intra-layer heat flow, T K = T amb + (P K + P I )/g hs (4.1) T I = T K + P I /g inter = T amb + (P K + P I )/g hs + P I /g inter (4.2) where T K and T I are the temperatures of Cores K and I, T amb is the ambient temperature, P K and P I are the power consumptions of Cores K and I, g hs is the thermal conductance from Core K to the ambient through the cooling solution, and g inter is the inter-layer thermal conductance between Cores I and K. In addition to Core I thermally dominating Core K, it also has a higher total resistance to the ambient, i.e., it has a lower cooling efficiency. As a result, a unit of power consumption on Core I will have at least as great an impact on temperature as a unit of power consumption on Core J or K. 2 The thermal conductance values in this section are derived using a thermal analysis package developed by Yang et al. [125], which constructs a fine-grained 3D CMP thermal model based on the material properties and physical structure of the chip package configuration described in Section , Table 4.3, and Table 4.4. For the sake of explanation, coarse-grained thermal model with compact equations are used in this section to simplify the explanation of fundamental 3D CMP thermal properties.

69 CHAPTER 4. 3D CMP THERMAL MANAGEMENT 53 Thermal Coupling The thermal conductance between J and K (g intra ) is approximately 0.41 W/K. Heat can flow between Cores J and K. As a result, the power consumption of one can influence the temperature of the other. However, this thermal coupling is relatively minor compared to that between vertically-aligned cores. The thermal conductance between Cores I and K (g inter ) is approximately 6.67 W/K, almost 16 g intra. The large interface area between Cores I and K results in a high thermal conductance, despite the interposed high thermal resistivity (but thin, and therefore low resistance) 10 µm polyimide bonding layer. Summary and Open Questions At this point, we can draw some qualitative conclusions. The temperatures of vertically-aligned cores are highly correlated, relative to the temperatures of horizontallyadjacent cores. Cores farther from the heatsink have higher temperatures than their neighbors closer to the heatsink. In addition, the temperature impact of a unit of power dissipation will be at least as high for Core I as for Cores J and K, due to their differing thermal conductances to the ambient. However, a few questions remain: 1. How can we use this knowledge of thermal environment heterogeneity to guide the development of a CMP thermal management algorithm? and 2. What is the impact of the power consumption of each core upon all other cores in the system? We will now introduce a general analytical framework that answers these questions.

70 CHAPTER 4. 3D CMP THERMAL MANAGEMENT D CMP Heat Flow Analytical Framework In this section, we formulate the problem of determining the impact of a unit change in power consumption for any given processor core upon the temperatures of all other cores. This formulation provides the theoretical foundation for determining the principals of near-optimal thermal management. We can represent the thermal characteristics of a 3D CMP using the following notation, which follows naturally from the heat conduction analysis ideas discussed in Section 4.3.1: dt (t) C + AT (t) = P u(t) (4.3) dt In this equation, given a system of N thermal elements, C is a an N N matrix with thermal element heat capacities along the diagonal and zeros elsewhere, T is a length N thermal element temperature vector, t is time, A is an N N matrix containing the thermal conductances of adjacent elements at the corresponding row column intersections and zeros elsewhere, P is a length N thermal element power vector, and u(t) is a step function that changes from 0 to 1 at time t. In addition, matrix A = L T KL, where L is a Laplacian matrix and K is a diagonal matrix containing the thermal conductances of adjacent thermal elements. Given an IC chip package partition with N connected thermal elements plus a ground element that models the ambient temperature, matrix A is full rank or nonsingular [76]. The impact of the CdT (t)/dt term will be explained in detail in Section In order to ease explanation, neglect C, then solve Equation 4.3 for T as follows: T = P A 1 (4.4)

71 CHAPTER 4. 3D CMP THERMAL MANAGEMENT 55 This leads to an interesting observation: A 1 gives the thermal impact of unit changes in power consumption. It is conventionally referred to as the thermal resistance matrix [18] but it would be better to view it as a thermal impact matrix. In order to determine the thermal impact of one core s power consumption on another core s temperature, we need only consider the value in the corresponding row column intersection in A 1. Let us assume that Core I is currently the hottest in the CMP. ζ ij is the thermal impact coefficient for core i due to j. This value indicates the change in the temperature for element i as a consequence of a unit change in power consumption for element j. To determine the impact of power consumed in Cores J and K upon Core I s temperature, we need only consider the thermal impact coefficients in row I in A 1, i.e., [ζ I,I, ζ I,J, ζ I,K ]. Thus, T I = P I ζ I,I + P J ζ I,J + P K ζ I,K (4.5) The thermal impact matrix will be used extensively in Section 4.4 to develop thermal management guidelines. It also gives us a new view of thermal heterogeneity in 3D CMPs. For a representative stacked-wafer 3D CMP design, the ζ value for vertically-adjacent cores is 1.22 K/W and the ζ value for laterally-adjacent cores is 0.39 K/W, yielding a thermal impact ratio of 3.12 for the two cases Power Model, Dynamic Thermal Analysis, and Modeling Granularity In the previous subsections, we made a number of simplifying assumptions about the thermal environment in order to ease explanation. Our actual analysis and thermal management implementation relaxes many of these assumptions for greater accuracy. We now expound on our thermal model.

72 CHAPTER 4. 3D CMP THERMAL MANAGEMENT 56 In order to determine thermal profile, the power profile must first be known. We model both dynamic power consumption and leakage power consumption [129]. Dependence on voltage, switching activity, capacitance, and temperature are considered. These equations are used together with a Wattch-based EV6 power model [15] to determine the power consumption distribution among architectural units. The power distributions of real multiprogrammed and multithreaded workloads on CMPs may be spatially and temporally heterogeneous. The proposed modeling approach allows us to capture the impact of workload heterogeneity on power and thermal profiles. As explained in Section 4.3.2, the thermal analysis of real ICs must consider heat capacity (C) as well as thermal conductance, i.e., transient analysis is necessary. The thermal analysis infrastructure we use in architectural thermal simulation captures these effects using a frequency-domain moment matching analysis technique. Our on-line thermal management technique continuously adjusts its behavior based on thermal sensor readings. Prior subsections assumed that each CMP core is represented by a single thermal element to simplify explanation. In reality, our analysis infrastructure is capable of dividing each CMP core into numerous three-dimensional thermal elements to permit accurate temperature estimation. Heat capacity plays a role in thermal modeling and management. Considering transient effects complicates the power and thermal analysis infrastructure. Fortunately, heat capacity limits the rate of temperature change, i.e., the maximum temperature change of a CMP core in a given time interval is limited by the RC thermal time constant of the core and the maximum power consumption change. Although we used a thermal analysis infrastructure that considers transient thermal effects in detail, the proposed thermal management technique is designed to react to transient

73 CHAPTER 4. 3D CMP THERMAL MANAGEMENT 57 thermal effects by periodically adapting its behavior based on temperatures measured with thermal sensors or estimated using run-time thermal models D CMP Thermal Management In this section, we investigate the 3D CMP run-time thermal management problem and propose efficient management techniques. Given a 3D CMP with N on-chip processor cores, our goal is to maximize the CMP throughput under run-time thermal constraints. CMP throughput is defined as the total number of instructions executed by the CMP per second. CMP IPS = N 1 i=0 IPC i f i (4.6) where IPC i and f i are the run-time instructions per cycle and frequency of Core i. Run-time thermal safety requires that N 1 i=0 T i T MAX (4.7) i.e., the temperature of each processor core cannot exceed the maximum safe temperature: T MAX. In the following sections, we analyze the thermal management problem for 3D CMPs and determine the policies necessary for performance optimization under temperature constraints. This study will be used to guide the development of our run-time thermal management techniques.

74 CHAPTER 4. 3D CMP THERMAL MANAGEMENT Conditions Required for Optimal 3D CMP Thermal Management and Derivations of Resulting Policy Guidelines This section derives performance optimization guidelines. The central theme is to optimize the performance of CMP cores under a constraint on peak temperature during workload assignment and power thermal budgeting. Observation: To maximize CMP throughput, processor cores should operate at different voltages and frequencies due to heterogeneous processor core thermal characteristics and heterogeneous run-time workloads. As described in Figure 4.3.1, processor cores in a 3D CMP are thermally correlated. The temperature of each Core i, is affected by the power consumptions of all cores, as follows: T i = N 1 j=0 ζ i,j p j T MAX (4.8) where T i is the temperature of processor Core i; ζ i,j, {i, j} [0, N 1] is an inter-core thermal impact coefficient, which indicates the impact of a unit power consumption of Core j on the temperature of Core i; p j is Core j s power consumption; and N is the number of processor cores of the CMP. We would like to guide migration of tasks among cores, and budget power to cores, in order to optimize CMP throughput under a temperature constraint. To facilitate developing the necessary guidelines, we introduce the concept of thermal impact per performance gain, T IP : TIP f i,j = dt i df j, TIP IPC i,j = dt i dipc j (4.9) TIP i,j indicates the thermal impact on processor Core i due to the increase in Core j s

75 CHAPTER 4. 3D CMP THERMAL MANAGEMENT 59 performance, by either increasing its frequency and voltage, and/or assign a high IPC job to this core. Intuitively, TIP is the thermal cost per unit increase in processor core performance. It can be viewed as the inverse of a core s thermal efficiency. Subject to a temperature bound, maximizing CMP performance thus requires that all the processor cores achieve the same thermal impact per performance improvement on the maximum-temperature core, i.e., TIP f,ipc i,0 TIP f,ipc i,1 TIP f,ipc i,n 1 (4.10) Note that the impact on T i due to the power consumption of core j is ζ i,j P j. Given that dynamic power consumption, P j = ξ j V 2 j f j (where V j and f j are the supply voltage and frequency of Core j), V j f β j, and β 1 [13]; ξ j is Core j s run-time switching activity multiplied the capacitance of the switched nodes (which is approximately linearly proportional to the IPC of the job running in Core j), then ζ i,0 f 2β+1 0 ζ i,1 f 2β+1 1 ζ i,n 1 f 2β+1 N 1 ζ i,0 ξ 0 f 2β 0 ζ i,1 ξ 1 f 2β 1 ζ i,n 1 ξ N 1 f 2β N 1 (4.11) This result indicates that processor cores with heterogeneous power and thermal characteristics, i.e., different power thermal impact coefficients, ζ i,j, running jobs with different IPCs should be clocked at different frequencies. A similar conclusion can be drawn when both dynamic and leakage power variants are considered. As shown in Section 4.3.1, the inter-layer and intra-layer thermal characteristics of 3D CMPs show distinct differences. This leads to different thermal management policies for inter-layer and intra-layer processor cores. In the following sections, we determine the conditions required for optimal 3D CMP thermal management and derive the resulting policy guidelines.

76 CHAPTER 4. 3D CMP THERMAL MANAGEMENT Inter-Layer Power Thermal Budgeting and Workload Assignment Inter-layer processor cores have heterogeneous thermal characteristics. In addition, vertically-aligned cores have strongly-correlated temperatures. We now derive heterogeneity-aware guidelines for power thermal budgeting and workload assignment among vertically-aligned cores. Guideline I: To maximize CMP throughput, the thermal efficiencies of verticallyaligned processor cores should be optimized under the thermal constraint, i.e., the voltage and frequency assignment among vertically-aligned processor cores should follow Equations As shown in Section 4.3.1, among each group of vertically-aligned processor cores, the Core i farthest from the heat sink is thermally dominant, i.e., it has the highest temperature and also the lowest cooling efficiency. Therefore, given the thermal constraint for processor Core i, i.e., T i T MAX, the performance-optimal voltage and frequency setup produced by Equations also guarantees the thermal safety for other vertically-aligned processor cores. In other words, Equations provide the performance-optimal power thermal budget policy for vertically-aligned processor cores. Considering Cores I and K in Figure 4.2, ζ I (= 1/g inter + 1/g hs ) > ζ K (= 1/g hs ), and T I (= ζ I P I + ζ K P K ) > T K (= ζ K P I + ζ K P K ) ( ) 1 Equations yield f I IPC f K = K ζ 2β K IPC I ζ I. Given homogeneous workload assignment, i.e., IPC K IPC K, this implies that f K > f I, i.e., to optimize CMP throughput, the processor core with higher cooling efficiency should be clocked at a higher frequency.

77 CHAPTER 4. 3D CMP THERMAL MANAGEMENT 61 Guideline II: Given jobs with different IPCs, the maximal CMP throughput can only be achieved by maximizing the IPC heterogeneity during workload distribution. To maximize throughput, jobs with higher IPCs should be assigned to cores with higher thermal efficiencies. This guideline indicates how to distribute run-time workload among verticallyaligned processor cores. We will again use Figure 4.2 to illustrate the reason for this guideline. Given a temperature constraint T MAX and an arbitrary workload assignment with Core I s IPC equal to IPC I and Core K s IPC equal to IPC K, Equations yield the following performance-optimal power and thermal budget assignment under the given workload distribution: ( IPC K ζ K f I = f K IPC I ζ I f K = ζ K IPC K (1 + T MAX ) 1 2β ( ) 1 ζ K IPC 2β K ζ I IPC I ) 1 2β+1 (4.12) (4.13) Next, we switch the workload between Core I and Core K, Equations yield the following performance-optimal power and thermal budget assignment for the new distribution: ( IPC f I = f K I ζ K IPC K ζ I f K = ζ K IPC I (1 + T MAX ) 1 2β ( ) 1 ζ K IPC 2β I ζ I IPC K ) 1 2β+1 (4.14) (4.15) Then, simple calculation can show that difference in the CMP throughput between

78 CHAPTER 4. 3D CMP THERMAL MANAGEMENT 62 these two workload distributions (IPC I f I + IPC K f K ) (IPC K f I + IPC I f K) 0 IPC I IPC K (4.16) In other words, assigning jobs with higher IPCs to cores with higher thermal efficiencies yields higher overall throughput under the same temperature constraint Intra-Layer Power Thermal Budgeting Intra-layer cores have mostly-homogeneous thermal characteristics with almost identical cooling efficiencies (see Section 4.3.1), i.e., ζ i,i ζ j,j, when Core i and Core j are in the same layer. In addition, the inter-core thermal impact is significantly lower than the self power thermal impact of each core, i.e., ζ i,i ζ i,j, when i j. We derive the following policies for intra-layer power thermal budgeting and workload assignment. Guideline III: To maximize aggregate CMP frequency or instruction throughput, power thermal budget and workload should be balanced among intra-layer processor cores. Consider two intra-layer processor cores J and K with ζ J,J ζ K,K ζ J,K ζ K,J. The temperature of each core depends mainly on its own power consumption, i.e., T J ζ J,J P J and T K ζ K,K P K (steady-state). Given thermal constraint T J, T K T MAX, performance optimization yields P J P K and TIP J TIP K, i.e., both cores should be clocked at the same frequency and execute workload with the same IPC. This guideline can also be motivated as follows. Assume both cores are assigned the same voltage V, frequency f, and workload (ξ and IPC ). Therefore, T J T K. Next, by adjusting the workload assignment, we increase the IPCs of the

79 CHAPTER 4. 3D CMP THERMAL MANAGEMENT 63 Global power-thermal budgeting Operating system Distributed thermal-aware workload migration Temperature monitoring CMP hardware Workload monitoring Distributed run-time thermal management Figure 4.3: ThermOS: 3D CMP Run-time Thermal Management [134]. jobs assigned to one core and decrease the IPCs of the jobs assigned to another core. Since ζ J,J, ζ K,K ζ J,K, ζ J,K, the temperature of one of the cores increases and the peak temperature of these two cores increases. As a result, frequency reduction and performance degradation are required to meet temperature constraints ThermOS: 3D CMP Thermal Management Based on the thermal management guidelines developed in Section 4.4.1, we have developed ThermOS, a unified hardware and OS thermal management solution for 3D CMP. As shown in Figure 4.3 and Table 4.1, ThermOS consists of hardwarebased temperature workload monitoring and distributed run-time thermal management built into a 3D CMP microarchitecture, as well as a temperature-aware Linux kernel equipped for global power thermal budgeting and distributed temperatureaware workload migration. ThermOS is a proactive, continuously-engaged solution designed to handle 3D CMP power thermal heterogeneities, distribute run-time workload, and manage the limited power thermal budget to optimize performance under

80 CHAPTER 4. 3D CMP THERMAL MANAGEMENT 64 temperature constraint. Our ThermOS is built upon the Linux kernel. It has an O (1) time complexity scheduler. Our temperature-aware scheduling algorithm maintains the same time complexity. Table 4.1 summarizes the proposed offline, run-time, and hardware management techniques Temperature Monitoring ThermOS gathers CMP temperature profiles at run-time, which are used to guide temperature-aware workload migration as well as power thermal budgeting. Either thermal sensors or online thermal analysis may be used for on-line temperature monitoring. Thermal sensors have been widely used in high-performance microprocessors [85, 56]. Efficient software-based online thermal analysis techniques have also been developed [99] Workload Monitoring In addition to CMP thermal profile, ThermOS gathers run-time performance and power characteristics to guide job migration as well as power thermal budgeting. A processor core s activity factor is a function of the capacitances of its functional units and the corresponding run-time activity factors resulting from its workload. Most modern processors provide hardware performance counters for monitoring specific events [56, 101]. These performance counters can be used to inform accurate and efficient regression-based run-time performance and power models [52, 63]. ThermOS uses this technique for linear regression estimation of run-time processor core activity factors. The model was developed offline and integrated with the OS. During execution, each processor core s hardware performance counter values are gathered

81 CHAPTER 4. 3D CMP THERMAL MANAGEMENT 65 Online Table 4.1: ThermOS Implementation [134]. Offline Given the activity factor range of on-chip processor core, derive the look-up table, which computation contains the optimal voltages and frequencies yielded by Equations rebalance tick() Invoke cluster opt() and group opt() at the beginning of each workload migration time interval (every 20 ms). cluster opt() Conduct inter-layer migration according to Guideline II. OS group opt() Conduct intra-layer migration according to Guideline III. Hardware scheduler tick() 1) Monitor the activity factors of run-time processes using hardware performance counters. 2) Determine the global power thermal budgeting using run-time table lookup. Local DVFS Proactive distributed DVFS based on global guidance and local variation. Local clock Reactive distributed clock throttling to guarantee thermal safety.

82 CHAPTER 4. 3D CMP THERMAL MANAGEMENT 66 periodically when triggered by OS timer interrupts (every 1 ms in Linux kernel). These performance counter values are used for run-time workload activity and IPC estimation Distributed Thermal-Aware Workload Migration ThermOS contains a distributed online workload migration technique to support performance optimization. The proposed technique follows the guidelines derived in Section and carefully handles 3D CMP inter-layer thermal heterogeneity and run-time workload heterogeneity. ThermOS uses a distributed approach that swaps jobs with high IPCs to processor cores with higher thermal efficiencies. Consider two vertically-adjacent processor cores: Core I and Core K. Assume Core K has higher cooling efficiency than Core I. To optimize instruction throughput, ThermOS compares the jobs stored in each processor core s job queue. It first identifies the lowest-ipc job (IPCMIN K ) on core K and the highest-ipc job (IPCMAX I ) on Core I. If IPCMIN K < IPCMAX I, ThermOS swaps the corresponding jobs. Intralayer thermal heterogeneity and thermal correlation are small. Therefore, ThermOS balances the intra-layer IPC distribution to optimize instruction throughput. Average IPCs of jobs on horizontally-adjacent cores are compared. If appropriate, they are swapped to further balance the distribution. The proposed distributed thermalaware workload migration technique has been integrated within the default Linux kernel workload balancing policy. In the current implementation, workload migration occurs every 20 ms.

83 CHAPTER 4. 3D CMP THERMAL MANAGEMENT Global Power Thermal Budgeting ThermOS dynamically adjusts the power thermal budgets of processor cores to optimize 3D CMP performance. Following the guidelines in Section 4.4.1, ThermOS balances the power thermal budget assignment among processor cores in the same layer. Equations are used to guide inter-layer power thermal budgeting. The leakage-temperature dependency introduces temperature variables on both sides of Equation Solving this equation requires numerical iteration and detailed chippackage thermal analysis, which are computationally intensive. To minimize run-time overhead, we have developed an hybrid offline/online budgeting technique. Given the switching activity (or IPC) range of the workload, the optimal voltage and frequency settings for vertically-aligned processor cores are pre-computed. The offline component of the budgeting algorithm is iterative. During each iteration, based on the IPC and the switching activity of each processor core, Equations are used to determine the optimal processor core power thermal budgets. Thermal analysis is then used to estimate the 3D CMP thermal profile and update the leakage power profile estimate. This process iterates until the chip-package thermal profile converges, subject to feedback from temperature-dependent leakage power consumption. The final voltage and frequency configurations are stored in a look-up table for efficient use during online power thermal budgeting. Given that the number of processor layers is L and the number of activity factor settings is n, the lookup table has n L entries. Increasing n, i.e., the resolution of the activity factor index, improves performance but increases storage overhead, as demonstrated in Section In ThermOS, run-time power thermal budgeting is implemented in the Linux kernel and invoked periodically. Periods ranging from 1 ms to 100 ms are currently supported.

84 CHAPTER 4. 3D CMP THERMAL MANAGEMENT Distributed Run-Time Thermal Management ThermOS uses distributed run-time thermal management to honor the power and thermal budgets described in Section and adhere to a temperature constraint. Periodically, each processor core adjusts its voltage and frequency based on its assigned power thermal budget. However, transient variations may not be immediately detected by the OS. In order to honor the temperature constraint, ThermOS uses local dynamic voltage and frequency scaling (DVFS) and clock throttling to react to transient variation with lower latency than global power thermal budgeting. Table 4.2 compares these two widely-used power management techniques. DVFS has high area overhead, mainly due to complex power supply circuitry and the need of off-chip capacitors and inductors for each independent voltage domain. It also has a higher response latency than clock throttling. For modern high-performance microprocessors equipped with DVFS, the voltage transition rate is in the range of 10 mv/µs [51]. Clock throttling, on the other hand, has low area overhead and low latency. However, DVFS has less performance impact per unit power reduction than clock throttling, thanks to the superlinear dependence of power on voltage. Note that most modern high-performance processors already support DVFS. We are proposing to use this existing DVFS hardware to the best effect. In ThermOS, local DVFS continuously tracks temperature changes and clock throttling is used as a final defense to guarantee thermal safety.

85 CHAPTER 4. 3D CMP THERMAL MANAGEMENT 69 Table 4.2: DVFS and Clock Throttling Comparison [134]. Area overhead Response Performance impact DVFS High Slow Low Clock throttling Low Fast High 4.5 Experimental Setup This section describes the experimental setup used to evaluate the proposed 3D CMP dynamic thermal management techniques. We describe our simulation and OS infrastructure, 3D chip and package models, and benchmark suites Infrastructure Performance and temperature estimation for 3D CMP architectures is challenging. Estimating spatial and temporal thermal profiles requires time-varying power profiles. This, in turn, requires timing and power analysis. To accurately estimate the run-time characteristics of 3D CMPs, we developed a full-system out-of-order multiprocessor simulation environment with integrated processor performance, power, and thermal models Full-System Simulation Setup We use the M5 Full System Simulator [11]. M5 provides a detailed, cycle-accurate, out-of-order simulation mode and a faster functional simulation mode. We use a combination of full-system checkpoints and the functional simulation mode to boot the system and fast-forward past the initialization portion of our benchmarks. We then

86 CHAPTER 4. 3D CMP THERMAL MANAGEMENT 70 Table 4.3: Design Parameters for Alpha [134]. Alpha Configuration (90 nm) Die size mm 2 Frequency and Voltage 2 GHz, 1.2 V Instruction Queue 64 entries Functional Units 4IXU, 2FPU, 1BPU Physical Registers 80 GPR, 72 FPR Branch Predictor 1 K local, 4 K global Memory Hierarchy L1 DCache/core 32 KB, 2-way, 64 B blocks, 3 cycle lat. L1 ICache/core 64 KB, 2-way, 64 B blocks, 1 cycle lat. Shared L2 Cache 16 MB, 8-way LRU, 64 B blocks, 25 cycle lat. Table 4.4: 3D Package Setup [134]. Layer Thermal Heat Depth cond. (W/mK) cap. (J/m 3 K) (µm) Eff. Active Layer (Silicon) Eff. Interface Layer (Polyimide) Heatsink (Cu) ,900 Thermal Grease [94] 3 5 (5 used) * 50 * From configuration used in HotSpot [99]. switch to detailed simulation mode to evaluate thermal and performance characteristics. We added a Wattch-based EV6 power model to M5 [15], scaled to a 90 nm process. Our cache power model is based on CACTI [106]. Static power consumption was estimated using an area-based, temperature-sensitive leakage model [103]. A 3D frequency-domain dynamic thermal analysis package was used [125]. Each active layer was modeled using numerous thermal elements.

87 CHAPTER 4. 3D CMP THERMAL MANAGEMENT Processor Architecture There are two ways to stack device layers: face-to-face and face-to-back. For designs with more than two layers, face-to-back bonding decreases worst-case interwafer via delay. We evaluate a three-layer front-to-back CMP structure. As shown in Figure 4.1, there are eight Alpha microprocessor cores in the top two layers. Each layer contains four microprocessor cores. Layers are connected with polyimide glue. There is 50 µm of thermal grease between the heatsink and die. Parameters for thermal grease and interface material follow Samson et al. [94]. Each processor core has 32 KB L1 data cache and 64 KB L1 instruction cache. There is a 16 MB shared L2 cache on Layer 2 and 1,024 MB of main memory. A 90 nm technology is modeled. Details can be found in the Table 4.3 and Table 4.4. We have accounted for inter-layer vias in the thermal model in the following way. The via density in a region follows ρ via = na via /(wh) where n is the number of vias in the region, A via is the cross section area of each via, w is the width of the region, and h is the height of the region. The relationship between via density and effective vertical thermal conductivity follows: K eff = ρ via K via + (1 ρ via )K layer (4.17) where K via is the thermal conductivity of the via material and K layer is thermal conductivity of the region without any vias. Here, the via is assumed to be copper with a thermal conductivity of 400 W/mK. A typical via size is 15 µm 15 µm. For the Alpha 21264, there are 587 package pins (389 die pins). Interconnect vias use 0.64% of the core area. This results in the effective bulk silicon layer and interface layer thermal conductivities reported in Table 4.4. There are three types of heat sinks: extruded, folded-fin, and integrated vapor-chamber. In this chapter, we

88 CHAPTER 4. 3D CMP THERMAL MANAGEMENT 72 assume an extruded copper heat sink with a thermal conductivity of 400 W/mK [116] Operating System The ThermOS run-time thermal management algorithms are implemented within the Linux kernel. We made two main changes to the kernel: Performance-counter based power modeling: We enable OS-level power estimation using performance counters. Hardware event counters of the sort typical for modern processors were added to M5. A regression-based power model was added to the OS [52]. Power thermal budgeting, task migration, and thermal management: The proposed power thermal budgeting and temperature-aware task migration techniques were implemented in the Linux kernel. We modified M5 to support kernel control of DVFS and clock throttling temperature monitoring through privileged machine registers Benchmark Suites Multithreaded and multiprogrammed benchmarks from SPEC2000, Media Bench, ALPBench [66], and SPLASH2 [100] are used. Phansalkar et al. did a detailed analysis of SPEC2000 and found that it can be divided into different groups based on several benchmark-specific metrics [86]. In order to build a complete set of test cases for our proposed techniques, we selected two benchmark-specific metrics: IPC and expected temperature variation. Although the absolute values of these metrics depend on microarchitectural characteristics, their relative differences in a set of benchmarks are mostly micro-architecture independent.

89 CHAPTER 4. 3D CMP THERMAL MANAGEMENT 73 Table 4.5: Benchmark Characteristics [134]. Group Name Avg. Avg. Max. Max. IPC Pow. (W) T δt SPEC gcc High IPC applu gzip mgrid SPEC twolf Low IPC parser vpr mcf Media gsmenc High IPC jpegdec Media Low IPC g721enc Multithreaded MPGenc (two threads) Sphinx cholesky lu radix water-nsquared water-spatial

90 CHAPTER 4. 3D CMP THERMAL MANAGEMENT 74 Table 4.6: Benchmark Suites [134]. Multiprogrammed test setups Group Filename Clusters Benchmarks SPEC hv-hipc High T var., high IPC gzip, mgrid lv-hipc Low T var., high IPC applu, gcc hv-lipc High T var., low IPC parser, vpr lv-lipc Low T var., low IPC twolf, mcf hv-mipc1 High T var., mixed IPC gzip, parser hv-mipc2 High T var., mixed IPC mgrid, vpr lv-mipc1 Low T var., mixed IPC applu, mcf lv-mipc2 Low T var., mixed IPC gcc, twolf Media media-hipc High IPC jpegdec, gsmenc media-mipc Mixed IPC gsmenc, g721enc Multithreaded test setups MPGenc, sphinx3, cholesky, lu, radix, water-nsquared, water-spatial IPC: IPC is approximately linearly-related to power consumption, which, has a strong influence on temperature. Expected temperature variation: The main goal of the proposed 3D CMP thermal management technique is to maximize performance subject to a temperature constraint. In order to evaluate it, we have selected a set of benchmarks with a wide range of spatial and temporal thermal characteristics. Based on these metrics, the benchmarks were analyzed, yielding the results in Table 4.5. Dynamic power traces were gathered during 500 ms to determine average power consumption, the temporal average of peak temperature, and the maximum peak temperature variation. We created 17 test setups (see Table 4.6). Ten of these were for multiprogrammed benchmarks. Each contains mixes of benchmarks with high and low temperature

91 CHAPTER 4. 3D CMP THERMAL MANAGEMENT 75 variation and IPC. Each test setup contains two SPEC or Media benchmarks. For multithreaded benchmarks, seven test setups are created. Each test setup contains one ALPBench or SPLASH2 benchmark with two parallel threads. During experiments, each run contains eight copies of each test setup, i.e., 16 processes/threads in total with two processes or threads per core on average. 4.6 Experimental Results This section evaluates ThermOS, the proposed run-time thermal management solution for 3D CMPs Comparison of ThermOS With Alternatives In this section, we first contrast ThermOS with solutions used in existing processors. Then we provide a detailed quantitative comparison with a state-of-the-art continuously-engaged thermal management technique. The following experiments use 85 as a predefined thermal constraint. Most thermal management techniques used in practice react to emergencies instead of being continuously engaged. They detect dangerously-high temperatures and reduce power consumption, generally via hardware clock throttling. Such solutions are adequate when temperatures approach their limits only very rarely. However, high power densities and constraints on cooling costs require proactive thermal management. Some researchers have moved in this direction. Donald and Martonosi [28] proposed a distributed continuously-engaged thermal management technique for 2D CMPs. Their approach is based on closed-loop control

92 CHAPTER 4. 3D CMP THERMAL MANAGEMENT 76 theory, and continuously adjusts the voltage and frequency of each processor core to maintain safe temperatures. Each core has its own controller and the controllers act independently, without knowledge of the conditions of other cores. This permits significantly better performance than reactive approaches because DVFS can generally reduce power consumption by the same amount as clock throttling with a smaller performance penalty. In fact, their results indicate that, compared with a stop-go based thermal control policy, distributed DVFS improves throughput by 2.5. However, independent local control has limitations. The power consumed in one processor can impact the temperatures of other processors in nonuniform ways. As a result, continuously-engaged global control can permit better performance than continuously-engaged local control. This is especially true for 3D architectures, in which the power consumption of a particular processor core has great impact on the temperature of vertically-aligned cores and relatively less impact on other cores. ThermOS uses continuously-engaged, distributed global/local control to maximize performance given a temperature bound. It supports both 3D and 2D architectures. It has two primary differences with state-of-the-art temperature control techniques. First, it uses global power budgeting that takes into account the thermal interaction between processor cores. Second, it directs temperature-aware workload migration of threads among processor cores. Figure 4.4 shows 3D CMP run-time instruction throughput (BIPS: billion instructions per second), achieved by ThermOS and Donald s and Martonosi s approach. Compared to the distributed local approach, ThermOS improves instruction throughput by 29.84% on average (ranging from 15.22% to 53.79%). This can be explained

93 CHAPTER 4. 3D CMP THERMAL MANAGEMENT 77 Throughput (BIPS) ThermOS Distributed approach water-spatial water-nsquared radix lu cholesky Sphinx3 MPGenc media-mipc media-hipc lv-mipc2 lv-mipc1 lv-lipc lv-hipc hv-mipc2 hv-mipc1 hv-lipc hv-hipc Figure 4.4: Comparison of ThermOS and Distributed Approach [28, 134]. as follows. In 3D CMPs, the strong thermal correlation among inter-layer verticallyaligned processor cores has significant impact on the temperature of the processor layer farthest from the heat sink. Using the proposed power thermal budgeting and thermal-aware workload migration techniques, ThermOS determines appropriate power budgets for each group of vertically-aligned processor cores. In addition, it uses DVFS to optimize the power thermal efficiency of each processor core. Together, these techniques maximize overall throughput. Donald s and Martonosi s work, on the other hand, is a distributed, processor-local technique. Using this technique, each processor core regulates its power and performance to ensure local thermal safety without considering the thermal impact on neighboring cores. As a result, verticallyaligned processor cores are unable to collaboratively share the power and thermal budget, which can reduce CMP performance. In other words, when a distributed,

94 CHAPTER 4. 3D CMP THERMAL MANAGEMENT 78 Thermal Violation (%) water-spatial water-nsquared radix lu cholesky Sphinx3 MPGenc media-mipc media-hipc lv-mipc2 lv-mipc1 lv-lipc lv-hipc hv-mipc2 hv-mipc1 hv-lipc hv-hipc w local DVFS, w clock throttling w local DVFS, w/o clock throttling w/o local DVFS, w/o clock throttling Figure 4.5: Reduction in Temperature Constraint Violations due to Local DVFS and Elimination of Temperature Constraint Violations due to Clock Throttling [134]. local management technique is used, power consumption on processor cores near the heatsink can push processor cores farther from the heatsink to their thermal limits Efficiency Impact of Guaranteeing Thermal Safety In this section, we establish an upper bound on performance by evaluating a thermal management technique with near-optimal performance, but vulnerability to temperature constraint violations due to transient changes in workload. We then show that there is only a small performance reduction resulting from the additional management techniques ThermOS uses to guarantee thermal safety.

95 CHAPTER 4. 3D CMP THERMAL MANAGEMENT 79 ThermOS uses the temperature-aware workload migration and global power thermal budgeting guidelines derived in Section These techniques can potentially offer near-optimal run-time performance subject to a temperature constraint. However, they do not immediately react to transient workload variation occurring in individual processor cores, which may cause run-time temperature constraint violations. ThermOS uses distributed run-time thermal management techniques to guarantee thermal safety, i.e., local DVFS and clock throttling dynamically adjust the voltage and frequency of each processor core to eliminate thermal emergencies. Compared to DVFS, clock throttling is more responsive but degrades performance more for the same thermal improvement. Therefore, in ThermOS, DVFS is continuously engaged and clock throttling is invoked only when local DVFS cannot guarantee thermal safety. These techniques, however, may cause the run-time operations of the processor cores to deviate from the guidelines derived in Section Straying from these guidelines has the potential to reduce performance. Figure 4.5 illustrates the levels of thermal safety achieved by various control techniques. As shown in this figure, when distributed control is disabled, the voltage and frequency of each processor core is solely controlled by global power thermal budgeting, which does not consider the temporal workload variation within each processor core. This local workload variation can cause significant run-time power variation, and therefore temperature constraint violations. Local DVFS can adapt to rapid workload variation occurring within each processor core and adjust voltage and frequency accordingly, thereby reducing run-time thermal emergencies. When clock throttling is also enabled, processor thermal emergencies are completely eliminated (see Figure 4.5).

96 CHAPTER 4. 3D CMP THERMAL MANAGEMENT 80 Temperature ( C) Temperature ( C) Temperature ( C) Temperature ( C) P4 temperature profile (local DVFS) P0 temperature profile (local DVFS) P4 temperature profile (local DVFS + clock throttling) P0 temperature profile (local DVFS + clock throttling) Time (ms) P6 temperature profile (local DVFS) P2 temperature profile (local DVFS) P6 temperature profile (local DVFS + clock throttling) 83.5 P2 temperature profile (local DVFS + clock throttling) Time (ms) Temperature ( C) Temperature ( C) Temperature ( C) Temperature ( C) P5 temperature profile (local DVFS) P1 temperature profile (local DVFS) P5 temperature profile (local DVFS + clock throttling) 83.5 P1 temperature profile (local DVFS + clock throttling) Time (ms) P7 temperature profile (local DVFS) P3 temperature profile (local DVFS) P7 temperature profile (local DVFS + clock throttling) 83.5 P3 temperature profile (local DVFS + clock throttling) Time (ms) Figure 4.6: Temporal Temperature Variation for Eight Processor Cores (P0 P7) Running lv-mipc2 Using Local DVFS w.o. (Top) and w. (Bottom) Clock Throttling [134].

97 CHAPTER 4. 3D CMP THERMAL MANAGEMENT 81 Normalized throughput (BIPS) water-spatial water-nsquared radix lu cholesky Sphinx3 MPGenc media-mipc media-hipc lv-mipc2 lv-mipc1 lv-lipc lv-hipc hv-mipc2 hv-mipc1 hv-lipc hv-hipc w local DVFS, w clock throttling w local DVFS, w/o clock throttling w/o local DVFS, w/o clock throttling Figure 4.7: Negligible CMP Instruction Throughput Reduction Resulting from Local DVFS and Clock Throttling [134]. To further illustrate the effectiveness of the distributed run-time control techniques, Figure 4.6 shows the run-time thermal profiles of eight processor cores when running the lv-mipc2 benchmark, with and without local clock throttling. Processors 0 3 are adjacent to the heatsink and processors 4 7 are farther from it. Local DVFS balances CMP thermal profile, and run-time temperature constraint violations (exceeding 85, a predefined thermal threshold used in this experiment) occur only rarely. When both local DVFS and clock throttling are enabled, the temperature constraint is never violated. Figure 4.7 indicates that the performance penalty introduced by the distributed control techniques required to guarantee thermal safety is low. To help quantify the performance impact, we normalize the CMP throughput to the value achieved by

98 CHAPTER 4. 3D CMP THERMAL MANAGEMENT 82 global power thermal budgeting and then evaluate the CMP throughput with local DVFS only with both local DVFS and clock throttling. These results indicate that local DVFS degrades instruction throughput by 0.55% on average. Since local DVFS is capable of eliminating most run-time thermal emergencies, clock throttling is rarely invoked. As shown in these figures, enabling both local DVFS and clock throttling results in performance penalties of only 0.60% on average for instruction throughput. In summary, the proposed distributed run-time thermal control technique achieves thermal safety with little performance impact Robustness to Changes in 3D Integration In order to show the robustness of ThermOS to variation in 3D integration style, we evaluated the performance improvement when used for CMPs using front-to-back and front-to-front wafer integration (see Section 4.5.1). We simulated the proposed technique and Donald s and Martonosi s distributed local approach [28] for both integration styles using all benchmark mixes shown in Table 4.6. The average CMP instruction throughput improvement was 29.84% for front-to-back integration and 23.77% for front-to-front integration. For all combination of benchmarks and packages, the instruction throughput improvements were greater than 7%. We can conclude that ThermOS permits substantial improvements in performance over Donald s and Martonosi s distributed local technique for different 3D integration styles Scalability Analysis of ThermOS ThermOS uses distributed temperature-aware workload migration, global power thermal budgeting, and distributed run-time thermal control techniques to optimize

99 CHAPTER 4. 3D CMP THERMAL MANAGEMENT ms 10ms 50ms 100ms Throughput (BIPS) hv-hipc hv-lipc hv-mipc1 hv-mipc2 lv-hipc cholesky radix Figure 4.8: Impact of Global Guidance Interval [134]. 3D CMP throughput and guarantee thermal safety. In contrast with purely local distributed techniques, run-time power thermal budgeting is global. This might raise concerns about the scalability of ThermOS when used on many-core 3D CMPs. In this section, we evaluate the scalability of the proposed global power thermal budgeting technique Performance Impact ThermOS periodically decides power thermal budgets for processor cores. This involves inter-layer and intra-layer assignment. Run-time inter-layer assignment uses efficient table lookup. Intra-layer assignment uses an efficient homogeneous assignment policy, i.e., processor cores within the same layer are assigned the same power thermal budgets. In the current setup, i.e., an eight-core 3D CMP with a 1 ms global

100 CHAPTER 4. 3D CMP THERMAL MANAGEMENT 84 guidance interval, detailed simulation shows that the overall run-time overhead introduced by global power thermal budgeting is only 0.22%. The run-time overhead of global power thermal budgeting is linearly proportional to the run-time global guidance/budgeting interval. In general, shorter global guidance intervals can more accurately track run-time workload variation but may introduce more run-time overhead and communication contention when aggregating data from different CMP cores. It might therefore be useful to reduce this overhead by increasing the global guidance interval. In the current setup, a 1 ms guidance interval is used. This is frequent enough to allow adjustments in global power thermal budget before temporal workload variation can produce large temperature changes, i.e., a higher frequency is unnecessary. To evaluate the impact of increasing global guidance interval on system performance, we run all six benchmarks with high workload variation from Table 4.6. One lowvariation benchmark (lv hipc) is also included for the sake of comparison. The results are shown in Figure 4.8. They indicate that, for guidance intervals up to and including 100 ms, ThermOS maintains nearly-identical performance. Only hv-hipc, cholesky, and radix experience noticeable performance degradation, due to their high temporal workload variation. However, changing the global guidance interval from 1 ms to 100 ms only reduces CMP instruction throughput by 1.81%, 1.06%, and 2.61% for hv-hipc, cholesky, and radix, respectively. We conclude that even if it were necessary to reduce global guidance interval by two orders of magnitude in order to maintain low global power thermal budgeting run-time overhead in many-core 3D CMPs, there would be little reduction in thermally-safe performance.

101 CHAPTER 4. 3D CMP THERMAL MANAGEMENT 85 Throughput (BIPS) X 6 lookup table 11 X 11 lookup table water-spatial water-nsquared radix lu cholesky Sphinx3 MPGenc media-mipc media-hipc lv-mipc2 lv-mipc1 lv-lipc lv-hipc hv-mipc2 hv-mipc1 hv-lipc hv-hipc 51 X 51 lookup table Figure 4.9: Impact of Lookup Table Size [134] Storage Impact As described in Section , ThermOS uses an offline iterative budgeting algorithm to precompute some power thermal budgeting decisions, which are stored using a lookup table in the main memory for efficient run-time usage. This lookup table has n L entries. Each entry requires 4 B storage. L is the number of processor layers. It is expected that the number of processor layers in 3D CMPs will be limited. n is the number of activity factor settings, which affects the power thermal budgeting resolution. Higher resolution improves the accuracy of the run-time power thermal budgeting decisions, but also increases the storage requirements for the table. In the current setup, we use a two-dimensional lookup table with entries (10.4 KB) which provides sufficient resolution for accurate power thermal budgeting. It might be useful to decrease lookup table resolution for many-core systems in

102 CHAPTER 4. 3D CMP THERMAL MANAGEMENT 86 Throughput (BIPS) lu cholesky Sphinx3 MPGenc media-mipc media-hipc lv-mipc2 lv-mipc1 lv-lipc lv-hipc hv-mipc2 hv-mipc1 hv-lipc hv-hipc ThermalOS w/o rotation ThermalOS w rotation Distributed approach w/o rotation Distributed approach w rotation water-spatial water-nsquared radix Figure 4.10: Impact of Floorplan Rotation [134]. order to limit storage overhead. We evaluated the impact of decreasing lookup table resolution on thermally-safe CMP performance by running all benchmark mixes using 51 51, 11 11, and 6 6 tables. As shown in Figure 4.9, compared to the lookup table, the lookup table setting reduces the memory usage from 10,404 B to 484 B, with average CMP instruction throughput reductions of 0.75%. When the table is reduced to 6 6 entries, memory usage decreases to 144 B, with average CMP instruction throughput reductions of 2.87%. We conclude that ThermOS requires little storage and that its performance degrades slowly with reduced lookup table size.

103 CHAPTER 4. 3D CMP THERMAL MANAGEMENT Interaction with 3D CMP Floorplan Optimization This experiment evaluates ThermOS for 3D CMPs with different floorplans. CMP thermal profile is strongly influenced by on-die power distribution. In 3D CMPs, interlayer vertically-aligned processor cores have strong thermal correlation. If all cores have identical floorplans, functional units with high power densities are verticallyaligned, potentially creating local thermal hotspots. Intelligent inter-layer floorplan arrangement can potentially balance inter-layer power profile and minimize chip peak temperature. Using the three-layer 3D CMP setup with processor core layers and one L2 cache layer, detailed thermal analysis shows that, by rotating the floorplan of top-layer processor cores by 180 degrees, chip power profile is more balanced, intracore local hotspots are minimized, and chip peak temperature is reduced by 1.99 on average and 4.24 maximum among the multiprogramming and multithreading benchmarks. Figure 4.10 compares ThermOS and the baseline distributed technique, with and without floorplan rotation. It shows that both run-time techniques can leverage the temperature reduction offered by floorplan rotation and achieve higher throughput under the same temperature constraint. In addition, ThermOS consistently outperforms the distributed technique by 31.45% and 29.84% on average with and without floorplan rotation, respectively. 4.7 Conclusions 3D integration has the potential to significantly improve performance and integration density. However, it will also increase power density, thereby increasing the importance of using continuously-engaged thermal management techniques. It will

104 CHAPTER 4. 3D CMP THERMAL MANAGEMENT 88 also increase the heterogeneity in thermal interaction among processor cores. This requires careful consideration during thermal management policy design. We have developed a mathematical formulation for optimizing workload assignment, power thermal budgeting, and voltage mode selection for 3D CMP thermal management. This formulation has been used to develop a continuously-engaged hardware software thermal management solution for 3D CMPs. The proposed solution has been implemented within the Linux kernel and evaluated using full-system 3D CMP and OS simulation. Our strategy outperforms a state-of-the-art proactive thermal management technique that does not make use of power thermal budgeting.

105 Chapter 5 Characterization of Single-Electron Tunneling Transistors for Designing Low-Power Embedded Systems Minimizing power consumption is vitally important in embedded system design; power consumption determines battery lifespan. Ultra-low-power designs may even permit embedded systems to operate without batteries by scavenging energy from the environment. Moreover, managing power dissipation is now a key factor in integrated circuit packaging and cooling. As a result, embedded system price, size, weight, and reliability are all strongly dependent on power dissipation. Recent developments in nanoscale devices open new alternatives for low-power embedded system design. Among these, single-electron tunneling transistors (SETs) 89

106 CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 90 hold the promise of achieving the lowest power consumption. Unfortunately, most analysis of SETs has focused on single devices instead of architectures, making it difficult to determine whether they are appropriate for low-power embedded systems. Evaluating the use of SETs in large-scale digital systems requires novel architectural and circuit design. SET-based design imposes numerous challenges resulting from low driving strength, relatively large static power consumption, and the presence of reliability problems resulting from random background charge effects. We propose a fault-tolerant, hybrid SET/CMOS, reconfigurable architecture, named Ice- Flex, that can be tailored to specific requirements and allows trade-offs among power consumption, performance requirements, operation temperature, fabrication cost, and reliability. Using IceFlex as a testbed, we characterize the benefits and limitations of SETs in embedded system designs. In particular, we focus on the use of SETs in room-temperature ultra-low-power embedded systems such as wireless sensor network nodes. We also consider higher-performance applications such as multimedia consumer electronics. We see this work as a first step in determining the potential of ultra-low-power embedded system design using SETs. My major contribution of this chapter is on the SET modeling, SET design space characterization and characterization of IceFlex architecture (Section 5.2, 5.3.1, , , , , and ) My collaborator, Zhenyu Gu, contributed to the global/local interconnect design and characterization of embedded applications (Section , and 5.4.2).

107 CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS Introduction Energy consumption and thermal issues are now central issues in electronic system design. In high-performance applications, temperature affects integration density, performance, reliability, power consumption, and cost. For battery-powered embedded systems, power consumption determines system life time. Power consumption crises were historically solved by moving to new technologies that decreased energy per operation, allowing increases in density and eventually performance. Power and thermal concerns were primary motivations for replacing vacuum tubes with semiconductor devices in the 1960s and replacing bipolar junction transistors with CMOS in the 1990s. Although CMOS is the mainstream fabrication technology used today, as IC and system integration further increase, it will reach fabrication, power consumption, and thermal limits; it may soon be time for another transition to a dramatically different technology. Device researchers have seen the coming challenges for CMOS devices and evaluated alternative technologies such as carbon nanotube transistors [29], nanowires [46], and single-electron tunneling transistors (SETs) [70]. The International Technology Roadmap for Semiconductors projects that SETs have the potential to achieve the lowest projected energy per switching event of any known device ( J) [53]. However, their use poses unique architectural, circuit design, and fabrication challenges. For example, SETs are susceptible to reliability problems caused by random background offset charges. They have cyclic I V curves (see Figure 5.2) that can complicate design but permit highly-efficient implementation of some useful logic functions that have proven inefficient using CMOS and threshold logic. Although the fabrication of SETs capable of operating at low temperatures is now common, feature

108 CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 92 sizes of only a few nanometers are required for room-temperature operation, making fabrication challenging Past Work After their discovery in the 1980s [9, 33], there has been extensive research on fabrication, design, and modeling of SETs [70]. SET fabrication and use in highsensitivity amplifiers at cryogenic temperatures has been the main research focus [25]. SETs and simple circuits with a variety of structures were proposed and fabricated using different methods and materials [80, 105, 6]. Recently, researchers have fabricated SETs that operate at room-temperature [75, 98, 84]. Various SET-based circuit applications, such as logic [111, 112, 79, 19] and memory [126, 118, 122] have been developed. These works provide the promising start for SET circuit design. However, these articles did not provide an architectural evaluation. We do not claim to have improved the performance of SET-based logic gates. Instead, we are the first to develop the modules necessary to support architectural design and synthesis and evaluate the architectural performance and power consumption implications of using SETs. They demonstrate orders of magnitude improvement in power consumption and energy efficiency compared to CMOS. Research on SET modeling and simulation has been an active area. Monte Carlo simulation has been widely used to model SETs. SIMON [117] and MOSES [17] are the two most popular SET simulators. However, they are too slow for analysis of large circuits. Uchida et al. proposed an analytical SET model and incorporated it into SPICE [113]. Recently, Inokawa et al. extended this model to a more general form to include asymmetric SETs [49]. Mahapatra et al. propose a simulation framework for

109 CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 93 hybrid SET/CMOS circuit design and analysis [73]. Their model for SET behavior is similar to that of Uchida et al. These compact modeling techniques are efficient enough for use in SET circuit design and analysis and closely match Monte Carlo simulation results. Significant challenges still remain for large-scale integration of SETs and for roomtemperature operation. SETs that operate reliably at room temperature have critical dimensions of 1 10 nm. They are challenging to fabricate using current top-down lithographic techniques. However, several exciting advances make the evaluation of architectures for high-density logic based on SETs worthwhile. Scanning-probe microscopes can be used to create devices smaller than those using conventional lithography [75]. Continual progress has been made on bottom-up nano-fabrication techniques, where chemical techniques are used to make individual molecules with useful electronic properties. Molecular quantum dots [40] can display SET behavior. Larger structures, such as carbon nanotubes and nanowires, can act as SETs [6]. These bottom-up techniques can create structures supporting room-temperature SET operation. However, more research is needed in order to integrate individual devices into large-scale circuits. Very recent advances in graphene [35] devices show promise for SETs. Reliable methods for cooling to very low temperatures without supplies of liquid helium or nitrogen are also becoming more common [114]. For high-performance computing, the added complexity of operating at cryogenic temperatures may not be a limiting factor. Similarly, cryogenic temperatures are readily attained using passive methods in outer space.

110 CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS Contributions In this chapter, we explore the potential use of SETs in low-power embedded systems. In order to take advantage of the power efficiency of SETs, it is critical to bring SET-based design to the system level, characterize the impacts of SETs on system design metrics, and evaluate the benefits and limitations of SETs. Our work starts from design space characterization of SET-based architectures. We evaluate the impacts of using SETs upon architectural, circuit-level, and device-level design, considering metrics such as energy efficiency, performance, reliability, maximum operating temperature, and ease of fabrication. Based on our evaluation of the architectural and circuit-level features that can most effectively exploit the strengths of SETs while working within their limitations, we propose a fault-tolerant, reconfigurable, hybrid SET/CMOS based architecture called IceFlex. IceFlex is regular and cell-based. It is reconfigurable, permitting compensation for fabrication defects. It incorporates flexible, modular circuits to enable tolerance of run-time faults. In addition to compensating for the weaknesses of SETs, IceFlex exploits their strengths, e.g., we develop a two-set design to implement Boolean functions that are not linearly separable. We tailor IceFlex to both high-performance and battery-powered embedded systems and characterize its energy efficiency, performance, and power consumption by using it for a number of instruction processors and application-specific cores. Compared to CMOS-based designs, IceFlex improves energy efficiency by two orders of magnitude for both battery-powered and high-performance applications, while maintaining good performance. However, our results also indicate great challenges to the use of SET-based designs in portable embedded systems. Their use will either require

111 CHAPTER 5. CHARACTERIZATION OF SET TRANSISTORS 95 island source (S) gate (G) C G tunnel junction drain (D) C S,R S C G2 optional 2 nd gate (G 2 ) C D,R D C G :gate capacitance C D :drain tunnel junction capacitance C G2 :optional 2 nd gate capacitance R S :source tunnel junction resistance C S :source tunnel junction capacitance R D :drain tunnel junction resistance Figure 5.1: SET Structure and Schematic [133]. advances in the compact cooling technologies or the fabrication of features with sizes approaching physical limits. 5.2 SET Modeling In this section, we introduce the physical properties of SETs, and discuss SET analytical device modeling SET Basics The operation of a single-electron tunneling device is governed by the Coulomb charging effect. As shown in Figure 5.1, a single-electron tunneling device consists of a nanometer-scale conductive island embedded in an insulating material. Electrons travel between the island, source (S), and drain (D) through thin insulating tunnel junctions. When an electron tunnels into the island, the overall electrostatic potential of the island increases by e 2 /C Σ, where e is the elementary charge and C Σ is the island

Enhancing Multicore Reliability Through Wear Compensation in Online Assignment and Scheduling. Tam Chantem Electrical & Computer Engineering

Enhancing Multicore Reliability Through Wear Compensation in Online Assignment and Scheduling Tam Chantem Electrical & Computer Engineering High performance Energy efficient Multicore Systems High complexity