Terminology and Concepts

Similar documents
Fault Tolerance. Dealing with Faults

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Fault Tolerant Computing ECE 655

Reliable Computing I

Quantitative evaluation of Dependability

Concept of Reliability

Dependable Computer Systems

Part 3: Fault-tolerance and Modeling

Markov Models for Reliability Modeling

Combinational Techniques for Reliability Modeling

Quantitative evaluation of Dependability

Fault-Tolerant Computing

Evaluation and Validation

Dependable Systems. ! Dependability Attributes. Dr. Peter Tröger. Sources:

Chapter 5. System Reliability and Reliability Prediction.

EE 445 / 850: Final Examination

Reliability of Technical Systems

CHAPTER 10 RELIABILITY

EECS150 - Digital Design Lecture 26 Faults and Error Correction. Recap

9. Reliability theory

EECS150 - Digital Design Lecture 26 - Faults and Error Correction. Types of Faults in Digital Designs

IoT Network Quality/Reliability

ECE 3060 VLSI and Advanced Digital Design. Testing

Chapter 2 Fault Modeling

Availability. M(t) = 1 - e -mt

Tradeoff between Reliability and Power Management

Evaluation and Validation

VLSI Design I. Defect Mechanisms and Fault Models

Signal Handling & Processing

Evaluation criteria for reliability in computer systems

Software Reliability & Testing

We are IntechOpen, the first native scientific publisher of Open Access books. International authors and editors. Our authors are among the TOP 1%

Mean fault time for estimation of average probability of failure on demand.

Chapter 9 Part II Maintainability

Practical Applications of Reliability Theory

Quantitative Reliability Analysis

Safety and Reliability of Embedded Systems

We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists. International authors and editors

ELE 491 Senior Design Project Proposal

A DESIGN DIVERSITY METRIC AND RELIABILITY ANALYSIS FOR REDUNDANT SYSTEMS. Subhasish Mitra, Nirmal R. Saxena and Edward J.

Why fault tolerant system?

Reliability Engineering I

Fundamentals of Reliability Engineering and Applications

Multi-State Availability Modeling in Practice

DVClub Europe Formal fault analysis for ISO fault metrics on real world designs. Jörg Große Product Manager Functional Safety November 2016

Safety analysis and standards Analyse de sécurité et normes Sicherheitsanalyse und Normen

SUPPLEMENT TO CHAPTER

Reliability of Safety-Critical Systems Chapter 9. Average frequency of dangerous failures

Considering Security Aspects in Safety Environment. Dipl.-Ing. Evzudin Ugljesa

10 Introduction to Reliability

Introduction to Engineering Reliability

Engineering Risk Benefit Analysis

Key Words: Lifetime Data Analysis (LDA), Probability Density Function (PDF), Goodness of fit methods, Chi-square method.

Failure rate in the continuous sense. Figure. Exponential failure density functions [f(t)] 1

B.H. Far

A New Reliability Allocation Method Based on FTA and AHP for Nuclear Power Plant!

Fault Detection probability evaluation approach in combinational circuits using test set generation method

CHAPTER 3 MATHEMATICAL AND SIMULATION TOOLS FOR MANET ANALYSIS

Markov Reliability and Availability Analysis. Markov Processes

We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists. International authors and editors

Chapter 15. System Reliability Concepts and Methods. William Q. Meeker and Luis A. Escobar Iowa State University and Louisiana State University

Objective Experiments Glossary of Statistical Terms

Design of Reliable Processors Based on Unreliable Devices Séminaire COMELEC

Lecture 5 Fault Modeling

Quiz #2 A Mighty Fine Review

B.H. Far

ANALYSIS FOR A PARALLEL REPAIRABLE SYSTEM WITH DIFFERENT FAILURE MODES

Monte Carlo Simulation for Reliability and Availability analyses

Maintenance free operating period an alternative measure to MTBF and failure rate for specifying reliability?

FAULT TOLERANT DESIGN: AN INTRODUCTION

Constant speed drive time between overhaul extension: a case study from Italian Air Force Fleet.

Chapter 2. Planning Criteria. Turaj Amraee. Fall 2012 K.N.Toosi University of Technology

Chapter 6. a. Open Circuit. Only if both resistors fail open-circuit, i.e. they are in parallel.

Rel: Estimating Digital System Reliability

DIAGNOSIS OF FAULT IN TESTABLE REVERSIBLE SEQUENTIAL CIRCUITS USING MULTIPLEXER CONSERVATIVE QUANTUM DOT CELLULAR AUTOMATA

Reliability and Availability Simulation. Krige Visser, Professor, University of Pretoria, South Africa

Optimal Time and Random Inspection Policies for Computer Systems

Chapter 8. Calculation of PFD using Markov

Looking at a two binary digit sum shows what we need to extend addition to multiple binary digits.

Stochastic Renewal Processes in Structural Reliability Analysis:

Software Reliability.... Yes sometimes the system fails...

Reliability, Redundancy, and Resiliency

Comparing the Effects of Intermittent and Transient Hardware Faults on Programs

B.H. Far

Module No. # 03 Lecture No. # 11 Probabilistic risk analysis

Dictionary-Less Defect Diagnosis as Surrogate Single Stuck-At Faults

Failure detectors Introduction CHAPTER

An approach to the design of highly reliable alld fail-safe digital systems*

Page 1. Outline. Experimental Methodology. Modeling. ECE 254 / CPS 225 Fault Tolerant and Testable Computing Systems. Modeling and Evaluation

Safety Analysis Using Petri Nets

Semiconductor Reliability

Reliability, Redundancy, and Resiliency

Chapter 2. Theory of Errors and Basic Adjustment Principles

In-Flight Engine Diagnostics and Prognostics Using A Stochastic-Neuro-Fuzzy Inference System

CHAPTER 3 ANALYSIS OF RELIABILITY AND PROBABILITY MEASURES

Non-observable failure progression

6. STRUCTURAL SAFETY

R E A D : E S S E N T I A L S C R U M : A P R A C T I C A L G U I D E T O T H E M O S T P O P U L A R A G I L E P R O C E S S. C H.

Page 1. Outline. Modeling. Experimental Methodology. ECE 254 / CPS 225 Fault Tolerant and Testable Computing Systems. Modeling and Evaluation

Time-varying failure rate for system reliability analysis in large-scale railway risk assessment simulation

Advanced Testing. EE5375 ADD II Prof. MacDonald

Transcription:

Terminology and Concepts Prof. Naga Kandasamy 1 Goals of Fault Tolerance Dependability is an umbrella term encompassing the concepts of reliability, availability, performability, safety, and testability. We will now define the above terms in an intuitive fashion. 1.1 Reliability The reliability R(t) of a system is a function of time, and is defined as the conditional probability that the system will perform correctly throughout the interval [t, t]. given that the system was performing correctly at time t. So, reliability is the probability that the system will operate correctly throughout a complete interval of time. Reliability is used to characterize systems in which even momentary periods of incorrect performance are unacceptable, or in which it is impossible to repair the system (e.g., a spacecraft where the time interval of concern may be years). In some other applications such as flight control, the time interval of concern may be a few hours. Fault tolerance can improve a system s reliability by keeping the system operational when hardware and software failures occur. 1.2 Availability Availability A(t) is a function of time, and is defined as the probability that a system is operating correctly and is available to perform its functions at the instant of time t. Availability differs from reliability in that reliability depends on an interval of time, whereas availability is taken at an instant of time. So, a system can be highly available yet experience frequent periods of down time as long as the length of each down-time period is very short. The most common measure of availability is the expected fraction of time that a system is available to correctly perform its functions. 1.3 Performability In many cases, it is possible to design systems that can continue to perform correctly after the occurrence of hardware/software failures, at a diminished level of performance. So, the performability P (L, t) of a system is a function of time, and is defined as the probability that the system performance will be at, or above, some level L at the instant of time t. Performability differs from reliability in that reliability is a measure of the likelihood that all of the functions are performed correctly, whereas performability is a measure of likelihood that some subset of the functions is performed correctly. Graceful degradation is the ability of the system to automatically decrease its level of performance to compensate for hardware/software failures. Fault tolerance can provide graceful degradation and improve performability by eliminating failed hardware/software components, allowing performance at some reduced level. 1989. These notes are adapted from: B. W. Johnson, Design and Analysis of Fault Tolerant Digital Systems, Addison Wesley, 1

Fig. 1: The full adder circuit used to illustrate the distinction between faults and errors. 1.4 Safety Safety S(t) is the probability that a system will either perform its functions correctly or will discontinue its functions in a manner that does not compromise the safety of any people associated with the system (fail-safe capability). Safety and reliability differ because reliability is the probability that a system will perform its functions correctly, whereas safety is the probability that a system will either perform its functions correctly or will discontinue the functions in a fail-safe manner. 1.5 Maintainability and Testability Maintainability is a measure of the ease with which a system can be repaired, once it has failed. So, the maintainability M(t) is the probability that a failed system will be restored to an operational state within a specified period of time t. The restoration process includes locating and diagnosing the problem, repairing and reconfiguring the system, and bringing the system back to its operational state. 2 Faults, errors, and system failures We define the following basic terms: A fault is a defect within the system. An error is a deviation from the required operation of the (sub)system. A system failure occurs when the system delivers a function deviating from the one specified. There is a cause-and-effect relationship between faults, errors, and failures. Faults result in errors, and errors can lead to system failures. In other words, errors are the effect of faults, and failures are the effect of errors. The full-adder circuit shown in Fig. 1 can be used to illustrate the distinction between faults and errors. The inputs A i, B i, and C i are the two operand bits and the carry bit, respectively. The truth table showing the correct performance for this circuit is shown in Fig. 2. If a short occurs between line L and the power supply line, resulting in line L becoming permanently fixed at a logic 1 value, then a fault (or defect) has occurred in the circuit. The fault is the actual short within the circuit. Fig. 3 shows the truth table of the circuit that contains the physical fault. Comparing Figs. 2 and 3, we see that the circuit performs correctly for the input combinations 1, 11, 11, and 111, but not for, 1, 1, and 11. So, whenever an input pattern is supplied to the circuit that results in an incorrect output, 2

Fig. 2: The truth table for the fault-free full-adder circuit. an error has occurred. If the output of the circuit is used to control a relay, and the relay is opened when it should be closed, a failure has occurred. 2.1 Characteristics of Faults System faults can be classified based on their duration. Permanent failures remain in existence indefinitely if no corrective action is taken. Though many are residual design or manufacturing faults, they are also caused by catastrophic events such as an accident. Intermittent failures appear, disappear, and reappear repeatedly. They are difficult to predict, but their effects are highly correlated. Most intermittent faults are due to marginal design, testing, or manufacturing, and manifest themselves under certain environmental or system conditions. Transient failures appear and disappear quickly, and are not correlated with each other. They are most commonly induced by random environmental disturbances such as electro-magnetic interference. Faults can also be characterized based on the underlying cause: (1) Specification mistakes; (2) Implementation mistakes; (3) External disturbances; and (4) Component failures. Fig. 3: The truth table for the full-adder circuit when the line L is stuck at logic value 1. 3

System reliability Non-redundant systems Redundant systems Fault avoidance Fault detection Masking redundancy Fault-tolerant systems Dynamic redundancy On-line detection/masking Reconfiguration Retry Online repair Fig. 4: A taxonomy of possible failure-response strategies. 2.2 Fault Models To design a fault-tolerant system, it is necessary to assume that the underlying faults behave according to some fault model. Even though, in practice, faults can be transient in nature, and exhibit complex behavior, fault models are used to make the problem of designing fault-tolerant systems more manageable, and as a way to restrict our attention to a subset of all faults that can occur. A commonly used fault model to capture the behavior of fault digital circuits is the logical stuck-fault model. 2.3 Failure Response Strategies A taxonomy of the primary techniques used to design systems to operate in a fault-prone environment is shown in Fig. 4. Broadly speaking, there are three primary methods: fault avoidance (e.g., shielding from EMI), fault masking (e.g., TMR systems), and fault tolerance. Fault detection does not tolerate faults, but provides a warning that a fault has occurred. Masking redundancy (also called static redundancy) tolerates failures, but provides no warning of them. Dynamic redundancy covers those systems whose configuration can be dynamically changed in response to a fault, or in which masking redundancy is enhanced by on-line fault detection which allows on-line repair. 3 Quantitative Evaluation of System Reliability Reliability of a system R(t) is defined to be the probability of a component or system functioning correctly over a given time period [t, t] under a given set of operating conditions. Consider a set of N identical components, all of which begin operating at the same time. Then, at some time t, the number of components operating correctly is N o (t) and the number of failed components is N f (t). Then, the reliability of a 4

component at time t is given by R(t) = N o(t) N = N o (t) N o (t) + N f (t) which is simply the probability that a component has survived the interval [t, t ]. We can also define unreliability Q(t) as the probability that a system will not function correctly over a given period of time. This is also called the probability of failure. If the number of failed components during time t is given by n f (t), then Q(t) = N f (t) N = N f (t) N o (t) + N f (t) From the definitions of reliability and unreliability, we obtain Q(t) = 1 R(t) If we write the reliability function as R(t) = 1 N f (t) N and differentiate R(t) with respect to time, we obtain = 1 N dn f (t) which can be rewritten as dn f (t) = N The derivative dn f (t)/ is simply the instantaneous rate at which components are failing. At time t, there are still N o (t) components operating correctly. Dividing dn f (t)/ by N o (t), we obtain z(t) = 1 dn f (t) N o (t) where z(t) is called the hazard function, hazard rate, or failure rate. The unit for the failure-rate function is failures per unit of time. The failure rate function can also be written in terms of the reliability function R(t) as z(t) = 1 dn f (t) = 1 N o (t) N o (t) Rearranging, we obtain the following differential equation. [ N = z(t)r(t) ] = R(t) The failure-rate function z(t) of electronic components exhibits a bathtub curve, shown in Fig. 5, comprising three distinct regions burn in, useful life, and wear out. It is typically assumed that the failure rate is constant during a component s useful life and is given by z(t) = λ. So, the differential equation is = λr(t) Solving the above equation gives us R(t) = e λt (1) The exponential relationship between reliability and time is known as the the exponential failure law. Thus, the probability of a system working correctly throughout a given period of time decreases exponentially with the length of this time period. The exponential failure law is extremely valuable for the analysis of electronic components, and is by far the most commonly used relationship between reliability and time. 5

Fig. 5: The bathtub form of the failure curve Mean time to failure. Mean time to failure (MTTF) is another way to quantify system reliability. MTTF gives the expected time that a system will operate before the first failure occurs. If we have N identical components operating at time t =, and we measure the time each system operates before failing, the average time is the MTTF. If each component i operates for a time t i before encountering the first failure, the MTTF is given by N i=1 MT T F = t i N We can calculate the MTTF by finding the expected value of the time of failure. From probability theory, we know that the expected value of a random variable X is E[X] = xf(x)dx where f(x) is the probability density function. From a reliability viewpoint, we are interested in the MTTF. So, MT T F = tf(t) where f(t) is the failure density function. The failure density function is f(t) = dq(t) = So, the MTTF can be written as MT T F = t Using integration by parts, we can show that MT T F = [ tr(t) + R(t) ] = R(t) If the reliability function obeys the exponential failure law, the MTTF is given by MT T F = e λt = 1 λ The above equation leads to a very simple result that the MTTF is the inverse of the system failure rate. Thus, a system with a constant failure rate of.1 failures per hour will have a mean time to failure of 1 hours. Finally, the reliability of a system equal to the MTTF for the failure exponential law is R(t) = e λ(1/λ) = e 1 =.37 (2) In other words, the system has only a 37% chance of operating correctly for an amount of time equal to its MTTF. 6

Mean time to repair. The mean time to repair (MTTR) is the average time taken to repair a failed system. Just as we describe the reliability of a system using its failure rate, we can quantify the repairability of a system using its repair rate µ. The MTTR is 1 µ. Mean time between failures. If a failed system can be repaired and made as good as new, then the mean time between failures (MTBF) is given by MT BF = MT T F + MT T R (3) The availability of the system is the probability that the system will be functioning correctly at any given time. In other words, it is the fraction of time for which a system is operational. Availabilty = Time system is operational Total time = MT T F MT T F + MT T R = MT T F MT BF (4) 7