Tradeoff between Reliability and Power Management

Size: px

Start display at page:

Download "Tradeoff between Reliability and Power Management"

Peter Logan
6 years ago
Views:

1 Tradeoff between Reliability and Power Management 9/1/2005 FORGE Lee, Kyoungwoo

2 Contents 1. Overview of relationship between reliability and power management 2. Dakai Zhu, Rami Melhem and Daniel Moss e, "The Effects of Energy Management on Reliability in Real-Time Embedded Systems" in ICCAD Kresimir Mihic, Tajana Simunic and Giovanni De Micheli, Reliability and Power Management of Integrated Systems" in DSD

3 1. Overview Power Management Techniques Power/Reliability/Time/Performance Reliability Approaches PA App Application App FT PA MW Middleware MW FT DPM DVS DFS PA OS PA HW Operating System Hardware PA NW Network OS FT NW FT HW FT Checkpointing Redundancy 3

4 1. Overview (cont ) Relationship b/w power management techniques and reliability approaches Voltage scaling decreases reliability Increase soft error rate Reduce the number of possible recoveries Checkpointing and recovery mechanism increases energy consumption Frequent fault-tolerance decreases energy saving Dynamic power management affects reliability Low power modes present lower failure rate than active mode Frequent transitions b/w active and low power states cause the larger number of failures Redundancy module increases energy usage Redundant modules w/o PM cause a significant increase in average power consumption 4

5 2. "The Effects of Energy Management on Reliability in Real-Time Embedded Systems" in ICCAD 2004 Dakai Zhu, Rami Melhem and Daniel Moss e from PARTS pjt in University of Pittsburgh

6 Outline Motivation Problem Main Idea Simulation Results Conclusion and Contribution 6

7 Motivation For autonomous critical real-time embedded applications (satellite and surveillance systems), both high reliability and low energy consumption are desired Slack time can be used for reliability and energy saving in real-time applications Temporal redundancy like recovery DVS (Dynamic Voltage Scaling) Tradeoff between power management and reliability Frequent fault-tolerance increases energy consumption More energy saving by DVS or DFS causes less reliability since less slack time for fault-tolerance The rate of soft errors depends on operating frequency and supply voltage 7

Soft Error Transient fault or Single-event upset Glitch in a semiconductor device A charged article strikes electronic circuits and changes the amount of charge stored at sensitive nodes, hence

8 Soft Error Transient fault or Single-event upset Glitch in a semiconductor device A charged article strikes electronic circuits and changes the amount of charge stored at sensitive nodes, hence affects the logic state Random, non-catastrophic, non-destructive, recoverable Caused by Radiation Neutrons Alpha particles High-energy cosmic rays Solar particles Soft Error Rate (SER) 50,000 Failures in Time (FIT*) per complex chip (once every two years) 2-4 soft errors per year at a server with 256 MB DRAM One error per week for a 1 GB memory in 0.25 µm IBM experiments suggest the even higher error rate Become worse Shrinking geometries Higher-density circuits Lowering supply voltage FIT*: one FIT corresponds to one error in a billion hours, or 114 years 8 Acknowledgement: Photo by Tom Way and Julie Lee, Courtesy of IBM Corporation Picture from De Micheli

9 Problem Explore the tradeoff b/w reliability and energy consumption in real-time embedded systems considering fault-rate changes Propose two fault rate models related to frequency and voltage scaling Analyze the effects of energy management on reliability 9

10 Main Work Propose and analyze models for application, power, & fault Application model: A frame-based real-time applications (slack time) Power model: P = P s + h*(p ind + P d ) Models for DFS and DVS by normalized f & V (f max =1,V max =1) Fault model Linear model for DFS (fixed Voltage) λ(f,v) = λ(f) = λ 0 *f b Exponential model for DVS λ(f,v) = λ(f) = λ 0 *10 d{(1-f)/(1-fmin)} Compare the effects of frequency scaling and voltage scaling on reliability with respect to the performability and energy consumption Performability: the probability of finishing the application correctly within its deadline in the presence of faults R f = 1 ρ f k f +1 ρ f k f +1 is the probability of having fault(s) during every execution Original (1) + Recovery executions (k f ) 10

11 Application Model A frame-based real-time application which is expected repeatedly within every frame L: worst case execution time at f max D: deadline (L <= D) Assumption: Execution time is linearly related to the frequency (eg) frequency is reduced by half, execution time doubles 11

12 Power Model P = P s + h*( P ind + P d ) where P d = C*V 2 *f P s : sleep power P ind : frequency-independent active power P d : frequency-dependent active power Assumption The system is always on due to the huge overhead of timing on/off system P s is zero since it ll not affect energy saving 12

13 Energy Model (Frequency Scaling) Save P d by reducing f w/o changing V E = P*T = P ind *(L/f) + C*V 2 max*l since f max = 1 P = P ind + C*V 2 max *f at frequency f T = L*(f max /f) Frequency scaling consumes more energy to execute an application at lower frequency 13

14 Energy Model (Voltage Scaling) Reduce the supply voltage for lower frequency E = P*T = P ind *(L/f) + C*f 2 *L P = P ind + C*V 2 *f for V For frequency f, the corresponding V = f*v max =f since V max = 1 for normalization T = L*(f max /f) Lower frequency lower supply voltage less frequency-dependent energy more time for execution more freq-independent energy 14

15 Fault Model The amount of slack time D L Deadline Worst case execution time Performability The probability of finishing the application correctly within its deadline in the presence of faults Average fault rate at f and V λ (f,v) = λ 0 g(f,v) λ 0 is the average fault rate corresponding to V max and f max 15

16 Linear Model for DFS The fault rate was shown to decrease linearly when frequency is reduced The softy margin in clock cycles becomes relatively larger when the frequency decreases λ(f,v) = λ(f) = λ 0 *f b b is a constant When b = 1, the fault rate is linearly increasing with the frequency 16

17 Exponential Model for DVS The fault rates in processors as well as memory increases exponentially when supply voltage decreases (why) with reduced supply voltage, the critical charge becomes smaller resulting in exponentially increased fault rate λ(f,v) = λ(f) = λ 0 *10 d{(1-f)/(1-fmin)} when f and V = f*v max = f (V max = 1) λ max = λ 0 10 d Reducing the supply voltage for lower frequency results in exponentially increased fault rates Larger d indicates that the fault-rate is more sensitive to voltage scaling 17

18 Perfomability R f = 1 ρ f k f +1 ρ f k f +1 is the probability of having fault(s) during every execution Original (1) + Recovery executions (k f ) ρ f = 1 e -λ(f,v)l/f The probability of having at least one fault during one running of the application For frequency scaling ρ f = 1 e -λ(f,v)l/f = 1 e -λ 0 Lf^(b-1) If b <= 1, ρ f increases when f decreases and lower frequencies result in fewer number of recoveries Hence, R f decreases when f decreases When b > 1, ρ f decreases when f decreases HW, lower frequency leads to fewer recoveries, which decreases R f dramatically 18

19 Perfomability (cont ) With voltage scaling, ρ f = 1 e -λ(f,v)l/f = 1 e -λ 0 10^{d(1-f)/(1-f min )}L/f ρ f increases when frequency f decreases Lower frequencies result in fewer number of recoveries Therefore, R f decreases when supply voltage is reduced for lower frequency 19

20 Energy Consumption Expected Energy Consumption EE f = E f (1 + ρ f + ρ f ρ f k f ) = E f (1- ρ f k f +1 )/(1- ρ f ) E f = (P ind + CV 2 f)l/f For frequency scaling, EE f increases when f decreases due to increased E f and/or increased ρ f Under voltage scaling, EE f is always less than that of f max But when specific cases, EE f may increase for lower frequency and voltage, when the probability of executing the recoveries is rather high due to fast increased fault rates 20

21 Simulation Setup Assumption Normalized frequency and voltage f max = 1 and V max = 1 P d max = CV 2 maxf max = 1 : max frequency-dependent power Power consumption Pentium M processor: 25 W peak power, 1 W sleep power Rambus memory: 300 mw active, 30 mw sleep power P ind = 0.0, 0.2, and 0.4 The rate of radiation induced faults: 10-6 at V max and f max b: 0.1, 1, and 10 d: 0, 2, 4, and 6 Application with D = 100 time units and L = 30 at f max 21

Simulation Results (DFS) P decreases monotonically when b<=1 P increases if lower f result in the same # of recoveries when b>1 Sharp jump reflects one fewer recovery EE increases when f

22 Simulation Results (DFS) P decreases monotonically when b<=1 P increases if lower f result in the same # of recoveries when b>1 Sharp jump reflects one fewer recovery EE increases when f decreases due to increased freq-ind energy consumption Scaling down frequencies w/ fixed supply voltage increases energy consumption as well as decreases the performability, and should not be employed 22

Simulation Results (DVS) Larger values of d lead to worse performability for lower frequency The performability for fixed fault rate (d=0) is much better than that of variable fault rate Decreasing

23 Simulation Results (DVS) Larger values of d lead to worse performability for lower frequency The performability for fixed fault rate (d=0) is much better than that of variable fault rate Decreasing freq. and vol. decreases the energy consumption For d = 6, energy consumption increases at lower freq. when 0.6<f<0.7 high probability of recovery Ignoring the effects of voltage scaling may lead to unsatisfied performability 23

24 Conclusion and Contribution Energy management through frequency and voltage scaling has significant effects on system reliability Ignoring the effects of energy management on fault rate is too optimistic and may lead to unsatisfied reliability goals The first attempt to model the relationship between reliability and power management with fault-rate changes caused by voltage scaling 24

25 3. "Reliability and Power Management of Integrated Systems" in DSD 2004 Kresimir Mihic, Tajana Simunic & Giovanni De Micheli from CSL in Stanford University

26 Outline Motivation Problem Main Idea Simulation Results Conclusion and Contribution 26

27 Motivation Advances in silicon technology for integrated systems like SoC/NoCs Higher device density (more than one billion transistors) Higher operating frequency Lower supply voltage Computation, storage, and information transition on chip will be subject to malfunctions Dynamic Power Management (DPM) Low power states present less fault rate than active state Frequent transitions between active and low power states can reduce reliability A need to evaluate reliability along with power and performance 27

28 Problem 1. To analyze system-level reliability To understand the relationship between run-time power management and reliability analysis To model system level reliability as a function of failure rates, system configuration & DPM policies To determine whether the effects of DPM are beneficial for reliability To introduce design constraints such as MTTF (Mean Time To Failure) 2. To incorporate reliability as an objective into DPM policy optimization 3. To choose components and topologies to achieve reliable low-energy design 28

29 MTTF MTTF (Mean Time To Failure) A useful parameter to specify the quality of a system The expected time that a system will operate before the first failure occurs MTTF = Σ N i=1 t i /N N identical systems t i : each system, i, operates for a time, t i, before encountering the first failure MTBF (Mean Time Between Failure) The average time between failures of a system MTBF = T/n avg n avg = Σ N i=1 n i /N Each of the N systems is operated for some time T n i is the number of failures for T n avg is the average number of failures The total operation time, T, divided by the average number of failures experienced during the time T MTTR (Mean Time To Repair) The average time to repair the system and place it back into operation MTBF = MTTF + MTTR 29

30 Main Work Overall system modeling Reliability model a reliability network A graph abstracting integrated systems Nodes: resources Edges: functional relations Function of time, failure rate, & configuration (series/parallel) DPM model PSM (Power State Machine) Function of power consumption and time usage Represent overall system as combination of components PSMs System analysis by simulation 30

31 Failure Rate The speed at which components are likely to fail Assuming a unit works correctly in [0,t], the failure rate is the conditional probability λ(t) that a unit fails in [t, t+ t] Typically the failure rate λ depends on Temperature (exponentially) Environmental exposure Soft errors Mechanical and thermal stress Time (burn-in and aging) Often the component failure rate is assumed to be constant for simplicity 31

32 Reliability of Integrated Systems A reliability network A graph abstracting integrated systems Nodes: resources (computation, storage & communication resources) Edges: functional relations Series / Parallel configurations 32

33 Example For reliability analysis, a system consists of two components Processor and Cache All components have to be up at the same time to accomplish the mission The two components form a series configuration The system reliability is the product of the component reliabilities (if the failure rates are independent) Assume failure rates constant: The system failure rate is the sum of the failure rates The MTTF is its inverse A system consists of two processors: A working processor suffices to accomplish the mission a parallel configuration The system unreliability is the product of the component unreliabilities R(t) = 1 [1-R 1 (t)] [1-R 2 (t)] Assume failure rates constant The MTTF is 1/λ 1 + 1/λ 2 +1/ (λ 1 +λ 2 ) Other relevant configurations: Standby Triple modular redundancy 33

34 SA 1100 DPM modeling: single unit PSM (Power State Machine) RUN: operational at different f and v. IDLE: a SW routine may stop the CPU when not in use, while monitoring interrupts SLEEP: Shutdown of onchip activity 34

35 Overall System Modeling Combine: Power-state machine model Reliability model Represent overall system as combination of components PSMs Failure rates depend on system state System control aims at setting system state To increase energy efficiency To enhance reliability 35

36 System Model 36

37 Simulation Processing engine with four cores Two cores are redundant (active or standby) 37

38 Simulation Results Low power state (standby) presents higher reliability PM doesn t necessary deliver better reliability In figure 5, aggressive PM is helpful for improving power consumption with little cost system reliability in larger feature sizes In figure 6, for smaller feature sizes, critical to carefully trade off b/w the design of 38PM with reliability

39 Simulation Results (cont ) The system with standby redundancy plots better reliability than the system with active redundancy Frequent switching causes a proportionally larger probability of core failure 39

40 Conclusion and Contribution The first attempt that reliability measures have been modeled jointly with DPM System level reliability is determined as a function of failure rates, configuration and PM policies A strong relationship is shown between PM policy and system reliability This methodology enables designers to quickly evaluate their system design in terms of three main objectives: Minimum power consumption Maximum system reliability And optimum performance 40

41 Discussion Relationship b/w power management techniques and reliability approaches Voltage scaling decreases reliability Increase soft error rate Reduce the number of possible recoveries Checkpointing and recovery mechanism increases energy consumption Frequent fault-tolerance decreases energy saving Dynamic power management affects reliability Low power modes present lower failure rate than active mode Frequent transitions b/w active and low power states cause the larger number of failures Redundancy module increases energy usage Redundant modules w/o PM cause a significant increase in average power consumption 41

42 Relationship b/w Power, Reliability and Time P act,max Power P act,max P act,1/2max =1/4P act,max P act,1/4max =1/16P act,max T idle T tr T act,high T tr T act,low T tr T act,max T tr T sleeptdeadline r act,low Time Fault Rate Decreasing V increases fault rate exponentially r act,high High frequent transition decreases reliability r tr r tr r tr r tr r r act,max r idle act,max Low-power modes present less fault rate than high-power mode like active mode rsleep T deadline Time 42

Reliable Computing I

Instructor: Mehdi Tahoori Reliable Computing I Lecture 5: Reliability Evaluation INSTITUTE OF COMPUTER ENGINEERING (ITEC) CHAIR FOR DEPENDABLE NANO COMPUTING (CDNC) National Research Center of the Helmholtz