Fault-Tolerant Computing Motivation, Background, and Tools Slide 1
About This Presentation This presentation has been prepared for the graduate course ECE 257A (Fault-Tolerant Computing) by Behrooz Parhami, Professor of Electrical and Computer Engineering at University of California, Santa Barbara. The material contained herein can be used freely in classroom teaching or any other educational setting. Unauthorized uses are prohibited. Behrooz Parhami Edition Released Revised Revised First Slide 2
for Dependability Slide 3
Slide 4
Impairments to Dependability Hazard Bug Fault Degradation Defect ERROR Malfunction Crash Intrusion Slide 5 Flaw
The Fault-Error- Cycle Includes both components and design Aspect Structure State Behavior Impairment Fault Error Fault 0 Correct signal 0 0 Replaced with NAND? Schematic diagram of the Newcastle hierarchical model and the impairments within one level. Slide 6
The Four-Universe Model Universe Physical Logical Informational External Impairment Fault Error Crash Cause-effect diagram for Avižienis four-universe model of impairments to dependability. Slide 7
Unrolling the Fault-Error- Cycle Abstraction Impairment Aspect Structure State Behavior Impairment Fault Error First Cycle Second Cycle Component Logic Information System Service Result Defect Fault Error Malfunction Degradation Low- Level Mid- Level High- Level Cause-effect diagram for an extended six-level view of impairments to dependability. Slide 8
Multilevel Model Ideal Component Logic Information System Legend: Legned: Initial Entry Entry Deviation Defective Faulty Erroneous Malfunctioning Low-Level Impaired Mid-Level Impaired Service Result Remedy Tolerance Degraded Failed High-Level Impaired Slide 9
Analogy for the Multilevel Model An analogy for our multi-level model of dependable computing. Defects, faults, errors, malfunctions, degradations, and failures are represented by pouring water from above. Valves represent avoidance and tolerance techniques. The goal is to avoid overflow. Wall heights represent inter-level latencies Concentric reservoirs are analogs of the six model levels, with defect being innermost Inlet valves represent avoidance techniques I I I I I I I I I I I Drain valves represent tolerance techniques I Slide 10
Why Our Concern with Dependability? Reliability of n-transistor system, each having failure rate λ R(t) = e nλt There are only 3 ways of making systems more reliable Reduce λ Reduce n Reduce t Alternative: Change the reliability formula by introducing redundancy in system e n λ t 1.0 0.8 0.6 0.4 0.2 0.0 10 4.9999.9990.9900 6 10 nt 8 10.9048.3679 10 10 Slide 11
Highly Dependable Computer Systems Long-life systems: Fail-slow, Rugged, High-reliability Spacecraft with multiyear missions, systems in inaccessible locations Methods: Replication (spares), error coding, monitoring, shielding Safety-critical systems: Fail-safe, Sound, High-integrity Flight control computers, nuclear-plant shutdown, medical monitoring Methods: Replication with voting, time redundancy, design diversity High-availability: Fail-soft, Robust, High-availability Telephone switching centers, transaction processing, e-commerce Methods: HW/info redundancy, backup schemes, hot-swap, recovery Just as performance enhancement techniques gradually migrate from supercomputers to desktops, so too dependability enhancement methods find their way from exotic systems into personal computers Slide 12
Aspects of Dependability Safety Resilience Risk, consequence Testability Security Controllability, observability Availability Pointwise av., Interval av., MTBF, MTTR RELIABILITY Reliability, MTTF = MTFF Maintainability Performability Robustness Integrity Performability, MCBF Slide 13 Serviceability
Concepts from Probability Theory Probability density function: pdf f(t) = prob[t x t + dt] / dt = df(t) / dt Cumulative distribution function: CDF t F(t) = prob[x t] = 0 f(x) dx Liftimes of 20 identical systems Expected value of x + E x = xf(x) dx = k x k f(x k ) Variance of x 2 + σ x = (x E x ) 2 f(x) dx = k (x k E x ) 2 f(x k ) Covariance of x and y ψ x,y = E [(x E x )(y E y )] = E [xy] E x E y 1.0 0.8 0.6 0.4 0.2 0 10 20 30 40 50 Time CDF 0.0 0 10 20 30 40 50 0.05 0.04 0.03 0.02 pdf f(t) Time 0.01 0.00 0 10 20 30 40 50 Time F(t) Slide 14
Some Simple Probability Distributions F(x) 1 CDF CDF CDF CDF f(x) pdf pdf pdf Uniform Exponential Normal Binomial Slide 15
Reliability and MTTF Reliability: R(t) Probability that system remains in the Good state through the interval [0, t] R(t + dt) = R(t) [1 z(t) dt] Hazard function Start State Good Two-state nonrepairable system Failed R(t) = 1 F(t) CDF of the system lifetime, or its unreliability Constant hazard function z(t) = λ R(t) = e λt (system failure rate is independent of its age) Mean time to failure: MTTF + + MTTF = 0 tf(t) dt = 0 R(t) dt Expected value of lifetime Area under the reliability curve (easily provable) Exponential reliability law Slide 16
Distributions of Interest Exponential: z(t) = λ R(t) = e λt MTTF = 1/λ Rayleigh: z(t) = 2λ(λt) R(t) = e ( λt)2 MTTF = (1/λ) π / 2 Discrete versions Geometric R(k) = q k Weibull: z(t) = αλ(λt) α 1 R(t) = e ( λt)α MTTF = (1/λ) Γ(1 + 1/α) Discrete Weibull Erlang: MTTF = k/λ Gamma: Erlang and exponential are special cases Normal: Reliability and MTTF formulas are complicated Binomial Slide 17
Reliability difference: R 2 R 1 Comparing Reliabilities Reliability gain: R 2 / R 1 Reliability improvement factor RIF 2/1 = [1 R 1 (t M )] / [1 R 2 (t M )] Example: [1 0.9] / [1 0.99] = 10 Reliability improv. index RII = log R 1 (t M ) / log R 2 (t M ) Mission time extension MTE 2/1 (r G ) = T 2 (r G ) T 1 (r G ) System Reliability (R) 1.0 R 2(t M) r R (t ) 1 G M Reliability functions for Systems 1/2 R (t) 2 R (t) 1 Mission time improv. factor: MTIF 2/1 (r G ) = T 2 (r G ) / T 1 (r G ) 0.0 T 1(r ) t T (r ) G M 2 G MTTF 2 Time (t) MTTF 1 Slide 18
Availability, MTTR, and MTBF (Interval) Availability: A(t) Fraction of time that system is in the Up state during the interval [0, t] Steady-state availability: A = lim t A(t) Pointwise availability: a(t) Probability that system available at time t t A(t) = (1/t) 0 a(x) dx Start State Two-state repairable system Up Repair Down Availability = Reliability, when there is no repair Availability is a function not only of how rarely a system fails (reliability) but also of how quickly it can be repaired (time to repair) MTTF MTTF μ A = = = MTTF + MTTR MTBF λ + μ In general, μ >> λ, leading to A 1 Repair rate 1/μ = MTTR (Will justify this equation later) Slide 19
System Up and Down Times Short repair time implies good Maintainability (serviceability) Start State Up Repair Down Time to first failure Time between failures Repair time Up Down 0 t 1 t' 1 t 2 t' 2 t Time Slide 20
Performability and MCBF Performability: P Composite measure, incorporating both performance and reliability Simple example Worth of Up2 twice that of Up1 t p Upi = probability system is in state Upi P = 2p Up2 + p Up1 Start State Repair Up 2 Up 1 Partial failure Three-state degradable system Partial repair Down Question: What is system availability here? p Up2 = 0.92, p Up1 = 0.06, p Down = 0.02, P = 1.90 (system performance equiv. To that of 1.9 processors on average) Performability improvement factor of this system (akin to RIF) relative to a fail-hard system that goes down when either processor fails: PIF = (2 2 0.92) / (2 1.90) = 1.6 Slide 21
System Up, Partially Up, and Down Times Important to prevent direct transitions to the Down state (coverage) Start State Repair Up 2 Up 1 Partial failure Partial repair Down Up Partial Partially Up Total Partial Repair Down 0 t 1 t 2 t' 2 t' 1 t 3 t' 3 t Time Slide 22
Integrity and Safety Risk: Prob. of being in Unsafe Failed state There may be multiple unsafe states, each with a different consequence (cost) Simple analysis Lump Safe Failed state with Good state; proceed as in reliability analysis More detailed analysis Even though Safe Failed state is more desirable than Unsafe Failed, it is still not as desirable as the Good state; so keeping it separate makes sense Start State Good Three-state fail-safe system Safe Failed Unsafe Failed For example, if a repair transition is introduced between Safe Failed and Good states, we can tackle questions such as the expected outage of the system in safe mode, and thus its availability Slide 23