Fault-Tolerant Computing

Similar documents
Fault-Tolerant Computing

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Fault Tolerant Computing ECE 655

Terminology and Concepts

CHAPTER 10 RELIABILITY

Dependable Systems. ! Dependability Attributes. Dr. Peter Tröger. Sources:

Reliable Computing I

CHAPTER 3 ANALYSIS OF RELIABILITY AND PROBABILITY MEASURES

EE 445 / 850: Final Examination

Dependable Computer Systems

Fault Tolerance. Dealing with Faults

Part 3: Fault-tolerance and Modeling

Modeling and Simulation of Communication Systems and Networks Chapter 1. Reliability

Signal Handling & Processing

Page 1. Outline. Modeling. Experimental Methodology. ECE 254 / CPS 225 Fault Tolerant and Testable Computing Systems. Modeling and Evaluation

9. Reliability theory

Chapter Learning Objectives. Probability Distributions and Probability Density Functions. Continuous Random Variables

Concept of Reliability

Slides 8: Statistical Models in Simulation

Page 1. Outline. Experimental Methodology. Modeling. ECE 254 / CPS 225 Fault Tolerant and Testable Computing Systems. Modeling and Evaluation

Evaluation and Validation

Engineering Risk Benefit Analysis

Reliability Engineering I

Key Words: Lifetime Data Analysis (LDA), Probability Density Function (PDF), Goodness of fit methods, Chi-square method.

Quantitative evaluation of Dependability

Evaluation and Validation

CHAPTER 3 MATHEMATICAL AND SIMULATION TOOLS FOR MANET ANALYSIS

Quantitative evaluation of Dependability

57:022 Principles of Design II Final Exam Solutions - Spring 1997

B.H. Far

These notes will supplement the textbook not replace what is there. defined for α >0

Tradeoff between Reliability and Power Management

Lecture 5 Probability

Monte Carlo Simulation for Reliability and Availability analyses

Chapter 9 Part II Maintainability

Ching-Han Hsu, BMES, National Tsing Hua University c 2015 by Ching-Han Hsu, Ph.D., BMIR Lab. = a + b 2. b a. x a b a = 12

Safety and Reliability of Embedded Systems

Stochastic Renewal Processes in Structural Reliability Analysis:

Introduction to Reliability Theory (part 2)

System Simulation Part II: Mathematical and Statistical Models Chapter 5: Statistical Models

Quiz #2 A Mighty Fine Review

Considering Security Aspects in Safety Environment. Dipl.-Ing. Evzudin Ugljesa

Reliability and Availability Simulation. Krige Visser, Professor, University of Pretoria, South Africa

I I FINAL, 01 Jun 8.4 to 31 May TITLE AND SUBTITLE 5 * _- N, '. ', -;

Closed book and notes. 60 minutes. Cover page and four pages of exam. No calculators.

Satisfiability A Lecture in CE Freshman Seminar Series: Ten Puzzling Problems in Computer Engineering. Apr Satisfiability Slide 1

Practical Applications of Reliability Theory

Chapter 15. System Reliability Concepts and Methods. William Q. Meeker and Luis A. Escobar Iowa State University and Louisiana State University

BMIR Lecture Series on Probability and Statistics Fall, 2015 Uniform Distribution

IoT Network Quality/Reliability

Chapter 5. System Reliability and Reliability Prediction.

ECE 313 Probability with Engineering Applications Fall 2000

B.H. Far

Comparative Distributions of Hazard Modeling Analysis

Stochastic Analysis of a Two-Unit Cold Standby System with Arbitrary Distributions for Life, Repair and Waiting Times

Computer Simulation of Repairable Processes

3 Continuous Random Variables

Non-observable failure progression

Evaluation criteria for reliability in computer systems

Reliability of Safety-Critical Systems 5.1 Reliability Quantification with RBDs

Quantitative Reliability Analysis

B.H. Far

Dependability Analysis

Combinational Techniques for Reliability Modeling

ELE 491 Senior Design Project Proposal

Computer Science, Informatik 4 Communication and Distributed Systems. Simulation. Discrete-Event System Simulation. Dr.

Availability. M(t) = 1 - e -mt

IEOR 6711: Stochastic Models I, Fall 2003, Professor Whitt. Solutions to Final Exam: Thursday, December 18.

Time-varying failure rate for system reliability analysis in large-scale railway risk assessment simulation

An Integral Measure of Aging/Rejuvenation for Repairable and Non-repairable Systems

JRF (Quality, Reliability and Operations Research): 2013 INDIAN STATISTICAL INSTITUTE INSTRUCTIONS

Software Reliability.... Yes sometimes the system fails...

The random counting variable. Barbara Russo

Chapter 4 Continuous Random Variables and Probability Distributions

Failure rate in the continuous sense. Figure. Exponential failure density functions [f(t)] 1

Reliability, Redundancy, and Resiliency

Cyber Physical Power Systems Power in Communications

FAULT TOLERANT DESIGN: AN INTRODUCTION

Markov Reliability and Availability Analysis. Markov Processes

10 Introduction to Reliability

Reliability of Technical Systems

Applied Statistics and Probability for Engineers. Sixth Edition. Chapter 4 Continuous Random Variables and Probability Distributions.

Chapter 4 Continuous Random Variables and Probability Distributions

Probability Midterm Exam 2:15-3:30 pm Thursday, 21 October 1999

CDA6530: Performance Models of Computers and Networks. Chapter 2: Review of Practical Random Variables

PAS04 - Important discrete and continuous distributions

STAT 509 Section 3.4: Continuous Distributions. Probability distributions are used a bit differently for continuous r.v. s than for discrete r.v. s.

Chapter 8. Calculation of PFD using Markov

Fault-Tolerant Computer System Design ECE 60872/CS 590. Topic 2: Discrete Distributions

Step-Stress Models and Associated Inference

System Simulation Part II: Mathematical and Statistical Models Chapter 5: Statistical Models

Fundamentals of Reliability Engineering and Applications

ANALYSIS FOR A PARALLEL REPAIRABLE SYSTEM WITH DIFFERENT FAILURE MODES

Risk Analysis of Highly-integrated Systems

Reliability of Safety-Critical Systems Chapter 9. Average frequency of dangerous failures

Reliability Education Opportunity: Reliability Analysis of Field Data

Aviation Infrastructure Economics

Application of the Stress-Strength Interference Model to the Design of a Thermal- Hydraulic Passive System for Advanced Reactors

1. Reliability and survival - basic concepts

Introduction to repairable systems STK4400 Spring 2011

PROBABILITY DISTRIBUTIONS

Transcription:

Fault-Tolerant Computing Motivation, Background, and Tools Slide 1

About This Presentation This presentation has been prepared for the graduate course ECE 257A (Fault-Tolerant Computing) by Behrooz Parhami, Professor of Electrical and Computer Engineering at University of California, Santa Barbara. The material contained herein can be used freely in classroom teaching or any other educational setting. Unauthorized uses are prohibited. Behrooz Parhami Edition Released Revised Revised First Slide 2

for Dependability Slide 3

Slide 4

Impairments to Dependability Hazard Bug Fault Degradation Defect ERROR Malfunction Crash Intrusion Slide 5 Flaw

The Fault-Error- Cycle Includes both components and design Aspect Structure State Behavior Impairment Fault Error Fault 0 Correct signal 0 0 Replaced with NAND? Schematic diagram of the Newcastle hierarchical model and the impairments within one level. Slide 6

The Four-Universe Model Universe Physical Logical Informational External Impairment Fault Error Crash Cause-effect diagram for Avižienis four-universe model of impairments to dependability. Slide 7

Unrolling the Fault-Error- Cycle Abstraction Impairment Aspect Structure State Behavior Impairment Fault Error First Cycle Second Cycle Component Logic Information System Service Result Defect Fault Error Malfunction Degradation Low- Level Mid- Level High- Level Cause-effect diagram for an extended six-level view of impairments to dependability. Slide 8

Multilevel Model Ideal Component Logic Information System Legend: Legned: Initial Entry Entry Deviation Defective Faulty Erroneous Malfunctioning Low-Level Impaired Mid-Level Impaired Service Result Remedy Tolerance Degraded Failed High-Level Impaired Slide 9

Analogy for the Multilevel Model An analogy for our multi-level model of dependable computing. Defects, faults, errors, malfunctions, degradations, and failures are represented by pouring water from above. Valves represent avoidance and tolerance techniques. The goal is to avoid overflow. Wall heights represent inter-level latencies Concentric reservoirs are analogs of the six model levels, with defect being innermost Inlet valves represent avoidance techniques I I I I I I I I I I I Drain valves represent tolerance techniques I Slide 10

Why Our Concern with Dependability? Reliability of n-transistor system, each having failure rate λ R(t) = e nλt There are only 3 ways of making systems more reliable Reduce λ Reduce n Reduce t Alternative: Change the reliability formula by introducing redundancy in system e n λ t 1.0 0.8 0.6 0.4 0.2 0.0 10 4.9999.9990.9900 6 10 nt 8 10.9048.3679 10 10 Slide 11

Highly Dependable Computer Systems Long-life systems: Fail-slow, Rugged, High-reliability Spacecraft with multiyear missions, systems in inaccessible locations Methods: Replication (spares), error coding, monitoring, shielding Safety-critical systems: Fail-safe, Sound, High-integrity Flight control computers, nuclear-plant shutdown, medical monitoring Methods: Replication with voting, time redundancy, design diversity High-availability: Fail-soft, Robust, High-availability Telephone switching centers, transaction processing, e-commerce Methods: HW/info redundancy, backup schemes, hot-swap, recovery Just as performance enhancement techniques gradually migrate from supercomputers to desktops, so too dependability enhancement methods find their way from exotic systems into personal computers Slide 12

Aspects of Dependability Safety Resilience Risk, consequence Testability Security Controllability, observability Availability Pointwise av., Interval av., MTBF, MTTR RELIABILITY Reliability, MTTF = MTFF Maintainability Performability Robustness Integrity Performability, MCBF Slide 13 Serviceability

Concepts from Probability Theory Probability density function: pdf f(t) = prob[t x t + dt] / dt = df(t) / dt Cumulative distribution function: CDF t F(t) = prob[x t] = 0 f(x) dx Liftimes of 20 identical systems Expected value of x + E x = xf(x) dx = k x k f(x k ) Variance of x 2 + σ x = (x E x ) 2 f(x) dx = k (x k E x ) 2 f(x k ) Covariance of x and y ψ x,y = E [(x E x )(y E y )] = E [xy] E x E y 1.0 0.8 0.6 0.4 0.2 0 10 20 30 40 50 Time CDF 0.0 0 10 20 30 40 50 0.05 0.04 0.03 0.02 pdf f(t) Time 0.01 0.00 0 10 20 30 40 50 Time F(t) Slide 14

Some Simple Probability Distributions F(x) 1 CDF CDF CDF CDF f(x) pdf pdf pdf Uniform Exponential Normal Binomial Slide 15

Reliability and MTTF Reliability: R(t) Probability that system remains in the Good state through the interval [0, t] R(t + dt) = R(t) [1 z(t) dt] Hazard function Start State Good Two-state nonrepairable system Failed R(t) = 1 F(t) CDF of the system lifetime, or its unreliability Constant hazard function z(t) = λ R(t) = e λt (system failure rate is independent of its age) Mean time to failure: MTTF + + MTTF = 0 tf(t) dt = 0 R(t) dt Expected value of lifetime Area under the reliability curve (easily provable) Exponential reliability law Slide 16

Distributions of Interest Exponential: z(t) = λ R(t) = e λt MTTF = 1/λ Rayleigh: z(t) = 2λ(λt) R(t) = e ( λt)2 MTTF = (1/λ) π / 2 Discrete versions Geometric R(k) = q k Weibull: z(t) = αλ(λt) α 1 R(t) = e ( λt)α MTTF = (1/λ) Γ(1 + 1/α) Discrete Weibull Erlang: MTTF = k/λ Gamma: Erlang and exponential are special cases Normal: Reliability and MTTF formulas are complicated Binomial Slide 17

Reliability difference: R 2 R 1 Comparing Reliabilities Reliability gain: R 2 / R 1 Reliability improvement factor RIF 2/1 = [1 R 1 (t M )] / [1 R 2 (t M )] Example: [1 0.9] / [1 0.99] = 10 Reliability improv. index RII = log R 1 (t M ) / log R 2 (t M ) Mission time extension MTE 2/1 (r G ) = T 2 (r G ) T 1 (r G ) System Reliability (R) 1.0 R 2(t M) r R (t ) 1 G M Reliability functions for Systems 1/2 R (t) 2 R (t) 1 Mission time improv. factor: MTIF 2/1 (r G ) = T 2 (r G ) / T 1 (r G ) 0.0 T 1(r ) t T (r ) G M 2 G MTTF 2 Time (t) MTTF 1 Slide 18

Availability, MTTR, and MTBF (Interval) Availability: A(t) Fraction of time that system is in the Up state during the interval [0, t] Steady-state availability: A = lim t A(t) Pointwise availability: a(t) Probability that system available at time t t A(t) = (1/t) 0 a(x) dx Start State Two-state repairable system Up Repair Down Availability = Reliability, when there is no repair Availability is a function not only of how rarely a system fails (reliability) but also of how quickly it can be repaired (time to repair) MTTF MTTF μ A = = = MTTF + MTTR MTBF λ + μ In general, μ >> λ, leading to A 1 Repair rate 1/μ = MTTR (Will justify this equation later) Slide 19

System Up and Down Times Short repair time implies good Maintainability (serviceability) Start State Up Repair Down Time to first failure Time between failures Repair time Up Down 0 t 1 t' 1 t 2 t' 2 t Time Slide 20

Performability and MCBF Performability: P Composite measure, incorporating both performance and reliability Simple example Worth of Up2 twice that of Up1 t p Upi = probability system is in state Upi P = 2p Up2 + p Up1 Start State Repair Up 2 Up 1 Partial failure Three-state degradable system Partial repair Down Question: What is system availability here? p Up2 = 0.92, p Up1 = 0.06, p Down = 0.02, P = 1.90 (system performance equiv. To that of 1.9 processors on average) Performability improvement factor of this system (akin to RIF) relative to a fail-hard system that goes down when either processor fails: PIF = (2 2 0.92) / (2 1.90) = 1.6 Slide 21

System Up, Partially Up, and Down Times Important to prevent direct transitions to the Down state (coverage) Start State Repair Up 2 Up 1 Partial failure Partial repair Down Up Partial Partially Up Total Partial Repair Down 0 t 1 t 2 t' 2 t' 1 t 3 t' 3 t Time Slide 22

Integrity and Safety Risk: Prob. of being in Unsafe Failed state There may be multiple unsafe states, each with a different consequence (cost) Simple analysis Lump Safe Failed state with Good state; proceed as in reliability analysis More detailed analysis Even though Safe Failed state is more desirable than Unsafe Failed, it is still not as desirable as the Good state; so keeping it separate makes sense Start State Good Three-state fail-safe system Safe Failed Unsafe Failed For example, if a repair transition is introduced between Safe Failed and Good states, we can tackle questions such as the expected outage of the system in safe mode, and thus its availability Slide 23