UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Fault Tolerant Computing ECE 655

Size: px
Start display at page:

Download "UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Fault Tolerant Computing ECE 655"

Transcription

1 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Fault Tolerant Computing ECE 655 Part 1 Introduction C. M. Krishna Fall 2006 ECE655/Krishna Part.1.1 Prerequisites Basic courses in Digital Design Hardware Organization/Computer Architecture Probability ECE655/Krishna Part.1.2 Page 1

2 Fault Tolerance - Basic definition Fault-tolerant systems - ideally systems capable of executing their tasks correctly regardless of either hardware failures or software errors In practice - we can never guarantee the flawless execution of tasks under any circumstances Limit ourselves to types of failures and errors which are more likely to occur ECE655/Krishna Part.1.3 Need For Fault Tolerance ECE655/Krishna Part.1.4 Page 2

3 Need For Fault Tolerance - Life Critical Applications Life-critical applications such as; aircrafts, nuclear reactors, chemical plants, and medical equipment A malfunction of a computer in such applications can lead to catastrophe Their probability of failure must be extremely low, possibly one in a billion per hour of operation ECE655/Krishna Part.1.5 Need for Fault Tolerance - Harsh Environment A computing system operating in a harsh environment where it is subjected to electromagnetic disturbances particle hits and alike Very large number of failures means: the system will not produce useful results unless some fault-tolerance is incorporated ECE655/Krishna Part.1.6 Page 3

4 Need For Fault Tolerance - Highly Complex Systems Complex systems consist of millions of devices Every physical device has a certain probability of failure A very large number of devices implies that the likelihood of failures is high The system will experience faults at such a frequency which renders it useless ECE655/Krishna Part.1.7 Fault Tolerance Measures It is important to have proper yardsticks - measures - by which to measure the effect of fault tolerance A measure is a mathematical abstraction, which expresses only some subset of the object's nature Measures? ECE655/Krishna Part.1.8 Page 4

5 Traditional Measures Assumption: The system can be in one of two states: up or down Examples: A lightbulb is either good or burned out; A wire is either connected or broken Two traditional measures: Reliability and Availability Reliability, R(t): probability that the system is up during the interval [0,t], given it was up at time 0 Availability, A(t), is the fraction of time that the system is up during the interval [0,t] Point Availability, Ap(t), is the probability that the system is up at time t A related measure is MTTF - Mean Time To Failure - the average time the system remains up before it goes down and has to be repaired or replaced ECE655/Krishna Part.1.9 Need For More Measures The assumption of the system being in state up or down is very limiting Example: A processor with one of its several hundreds of millions of gates stuck at logic value 0 and the rest is functional - may affect the output of the processor once in every 25,000 hours of use The processor is not fault-free, but cannot be defined as being down More general measures than the traditional reliability and availability are needed ECE655/Krishna Part.1.10 Page 5

6 More General Measures Capacity Reliability - probability that the system capacity (as measured, for example, by throughput) at time t exceeds some given threshold at that time Another extension - consider everything from the perspective of the application This approach was taken to define the measure known as Performability ECE655/Krishna Part.1.11 Computational Capacity Example: N processors in a gracefully degrading system System recovers from failures of processors and is useful as long as at least one processor remains operational Let Pi = Prob {i processors are operational} R(t) = S Pi Let c be the computational capacity of a processor (e.g., number of fixed-size tasks it can execute) Computational capacity of i processors: Ci = i c Computational capacity of the system: S Ci Pi ECE655/Krishna Part.1.12 Page 6

7 Performability Application is used to define accomplishment levels L1, L2,...,Ln Each represents a level of quality of service delivered by the application Example: Li indicates i system crashes during the mission time period T Performability is a vector (P(L1),P(L2),...,P(Ln)) where P(Li) is the probability that the computer functions well enough to permit the application to reach up to accomplishment level Li ECE655/Krishna Part.1.13 Network Connectivity Measures Focus on the network that connects the processors Classical Node and Line Connectivity - the minimum number of nodes and lines, respectively, that have to fail before the network becomes disconnected Measure indicates how vulnerable the network is to disconnection A network disconnected by the failure of just one (critically-positioned) node is potentially more vulnerable than another which requires several nodes to fail before it becomes disconnected ECE655/Krishna Part.1.14 Page 7

8 Connectivity - Examples ECE655/Krishna Part.1.15 Network Resilience Measures Classical connectivity distinguishes between only two network states: connected and disconnected It says nothing about how the network degrades as nodes fail before becoming disconnected Two possible resilience measures: Average node-pair distance Network diameter - the maximum node-pair distance Both calculated given the probability of node and/or link failure ECE655/Krishna Part.1.16 Page 8

9 More Network Measures What happens upon network disconnection A network that splits into one large connected component and several small pieces may be able to function, vs. a network that splinters into a large number of small pieces Another measure of a network's resilience to failure: probability distribution of the largest component upon disconnection ECE655/Krishna Part.1.17 Redundancy Redundancy is at the heart of fault tolerance Redundancy can be defined as the incorporation of extra parts in the design of a system in such a way that its function is not impaired in the event of a failure We will study four forms of redundancy: ECE655/Krishna Part.1.18 Page 9

10 Hardware Redundancy Extra hardware is added to override the effects of a failed component Static Hardware Redundancy - for immediate masking of a failure Example: Use three processors (instead of one), each performing the same function. The majority output of these processors can override the wrong output of a single faulty processor Dynamic Hardware Redundancy - spare components are activated upon the failure of a currently active component Hybrid Hardware Redundancy - A combination of static and dynamic redundancy techniques ECE655/Krishna Part.1.19 Software Redundancy Software redundancy is provided by having multiple teams of programmers Write different versions of software for the same function The hope is that such diversity will ensure that not all the copies will fail on the same set of input data ECE655/Krishna Part.1.20 Page 10

11 Information and Time Redundancy Information redundancy: provided by adding bits to the original data bits so that an error in the data bits can be detected and even corrected Error detecting and correcting codes have been developed and are being used Information redundancy often requires hardware redundancy to process the additional bits Time redundancy: provided by having additional time during which a failed execution can be repeated Most failures are transient - they go away after some time If enough slack time is available, the failed unit can recover and redo the affected computation ECE655/Krishna Part.1.21 Hardware Faults Classification Three types of faults: Transient Faults - disappear after a relatively short time Example - a memory cell whose contents are changed spuriously due to some electromagnetic interference Overwriting the memory cell with the right content will make the fault go away Permanent Faults - never go away, component has to be repaired or replaced Intermittent Faults - cycle between active and benign states Example - a loose connection ECE655/Krishna Part.1.22 Page 11

12 Failure Rate The rate at which a component suffers faults depends on its age, the ambient temperature, any voltage or physical shocks that it suffers, and the technology The dependence on age is usually captured by the bathtub curve: ECE655/Krishna Part.1.23 Bathtub Curve When components are very young, the failure rate is quite high: there is a good chance that some units with manufacturing defects slipped through manufacturing quality control and were released As time goes on, these units are weeded out, and the unit spends the bulk of its life showing a fairly constant failure rate As it becomes very old, aging effects start to take over, and the failure rate rises again ECE655/Krishna Part.1.24 Page 12

13 Empirical Formula for l - Failure Rate l = pl pq (C1 pt pv + C2 pe) pl: Learning factor, (how mature the technology is) pq: Manufacturing process Quality factor (0.25 to 20.00) pt: Temperature factor, (from 0.1 to 1000), proportional to exp(-ea/kt) where Ea is the activation energy in electronvolts associated with the technology, k is the Boltzmann constant and T is the temperature in Kelvin pv: Voltage stress factor for CMOS devices (from 1 to 10 depending on the supply voltage and the temperature); does not apply to other technologies (set to 1) pe: Environment shock factor: from about 0.4 (airconditioned environment), to 13.0 (harsh environment) C1, C2: Complexity factors; functions of number of gates on the chip and number of pins in the package Further details: MIL-HDBK-217E handbook ECE655/Krishna Part.1.25 Environment Impact Devices operating in space, which is replete with energy-charged particles and can subject devices to severe temperature swings, can be expected to fail more often Similarly for computers in automobiles (high temperature and vibration) and industrial applications ECE655/Krishna Part.1.26 Page 13

14 Faults Vs. Errors A fault can be either a hardware defect or a software/programming mistake By contrast, an error is a manifestation of a fault For example, consider an adder circuit one of whose output lines is stuck at 1. This is a fault, but not (yet) an error. This fault causes an error when the adder is used and the result on that line is supposed to have been a 0, rather than a 1 ECE655/Krishna Part.1.27 Propagation of Faults and Errors Both faults and errors can spread through the system If a chip shorts out power to ground, it may cause nearby chips to fail as well Errors can spread because the output of one processor is frequently used as input by other processors Adder example: the erroneous result of the faulty adder can be fed into further calculations, thus propagating the error ECE655/Krishna Part.1.28 Page 14

15 Containment Zones To limit such situations, designers incorporate containment zones into systems Barriers that reduce the chance that a fault or error in one zone will propagate to another A fault-containment zone can be created by providing an independent power supply to each zone The designer tries to electrically isolate one zone from another An error-containment zone can be created by using redundant units and voting on their output ECE655/Krishna Part.1.29 Time to Failure - Analytic Model Consider the following model: N identical components, all operational at time t=0 Each component remains operational until it is hit by a failure All failures are permanent and occur in each component independently from other components We first concentrate on one component T - the lifetime of one component - the time until it fails T is a random variable f(t) - the density function of T F(t) - the cumulative distribution function of T ECE655/Krishna Part.1.30 Page 15

16 Probabilistic Interpretation F(t) - the probability that the component will fail at or before time t F(t) = Prob (T t) f(t) - the momentary rate of failure f(t)dt = Prob (t T t+dt) Like any density function (defined for t 0) f(t) dt =1 0 f(t) 0, for all t 0 These functions are related through f(t)=df(t)/dt F(t)= f(s) ds ECE655/Krishna Part t Reliability and Failure (Hazard) Rate The reliability of a single component - R(t) R(t) = Prob (T>t) = 1- F(t) The failure probability of a component at time t, p(t) - the conditional probability that the component will fail at time t, given it has not failed before p(t) = Prob (t T t+dt T t) = Prob (t T t+dt) / Prob(T t) = f(t)dt / (1-F(t)) The failure rate (or hazard rate) of a component at time t, h(t), is defined as p(t)/dt h(t) = f(t)/(1- F(t)) Since dr(t)/dt = -f(t), we get h(t) = -1/R(t) dr(t)/dt ECE655/Krishna Part.1.32 Page 16

17 Constant Failure Rate If the failure rate is constant over time - h(t) = l - then dr(t) / dt = - l R(t) ; R(0)=1 The solution of the differential equation is - l t R(t) = e - l t f(t)= l e - l t F(t)=1- e A constant failure rate is obtained if and only if T, the lifetime of the component, has an exponential distribution ECE655/Krishna Part.1.33 Mean Time to Failure MTTF - the expected value of the lifetime T MTTF = E[T] = t f(t) dt 0 dr(t)/dt= - f(t) MTTF = - t dr(t)/dt dt = [-t R(t) ] + R(t) dt 0 -t R(t) = 0 when t=0 and when t= since R( )=0 Therefore, MTTF = R(t) dt 0 If the failure rate is a constant l - l t R(t) = e - l t MTTF = e dt = 1/ l ECE655/Krishna Part.1.34 Page 17

18 Weibull Distribution - Introduction In most calculations of reliability, a constant failure rate l is assumed, or equivalently - the Exponential distribution for the component lifetime T There are cases in which this simplifying assumption is inappropriate Example - during the infant mortality and wear-out phases of the bathtub curve In such cases, the Weibull distribution for the lifetime T is often used for reliability calculation ECE655/Krishna Part.1.35 Weibull distribution - Equation The Weibull distribution has two parameters, l and b The density function of the component lifetime T: b b-1 - l t f(t)= l b t e The failure rate for the Weibull distribution is b-1 h(t)= l b t The failure rate h(t) is decreasing with time for b<1, increasing with time for b>1, constant for b=1, appropriate for the infant mortality, wearout and middle phases, respectively ECE655/Krishna Part.1.36 Page 18

19 MTTF for the Weibull Distribution The reliability for the Weibull distribution is b - l t R(t) = e The MTTF for the Weibull distribution is 1/b MTTF = G(1/b)/(b l ) where G(x) is the Gamma function The special case b = 1 is the Exponential distribution with a constant failure rate ECE655/Krishna Part.1.37 Page 19

Terminology and Concepts

Terminology and Concepts Terminology and Concepts Prof. Naga Kandasamy 1 Goals of Fault Tolerance Dependability is an umbrella term encompassing the concepts of reliability, availability, performability, safety, and testability.

More information

Quantitative evaluation of Dependability

Quantitative evaluation of Dependability Quantitative evaluation of Dependability 1 Quantitative evaluation of Dependability Faults are the cause of errors and failures. Does the arrival time of faults fit a probability distribution? If so, what

More information

Quantitative evaluation of Dependability

Quantitative evaluation of Dependability Quantitative evaluation of Dependability 1 Quantitative evaluation of Dependability Faults are the cause of errors and failures. Does the arrival time of faults fit a probability distribution? If so, what

More information

Fault Tolerance. Dealing with Faults

Fault Tolerance. Dealing with Faults Fault Tolerance Real-time computing systems must be fault-tolerant: they must be able to continue operating despite the failure of a limited subset of their hardware or software. They must also allow graceful

More information

Reliability of Technical Systems

Reliability of Technical Systems Main Topics 1. Introduction, Key Terms, Framing the Problem 2. Reliability Parameters: Failure Rate, Failure Probability, etc. 3. Some Important Reliability Distributions 4. Component Reliability 5. Software

More information

Fault-Tolerant Computing

Fault-Tolerant Computing Fault-Tolerant Computing Motivation, Background, and Tools Slide 1 About This Presentation This presentation has been prepared for the graduate course ECE 257A (Fault-Tolerant Computing) by Behrooz Parhami,

More information

Chapter 6. a. Open Circuit. Only if both resistors fail open-circuit, i.e. they are in parallel.

Chapter 6. a. Open Circuit. Only if both resistors fail open-circuit, i.e. they are in parallel. Chapter 6 1. a. Section 6.1. b. Section 6.3, see also Section 6.2. c. Predictions based on most published sources of reliability data tend to underestimate the reliability that is achievable, given that

More information

VLSI Design I. Defect Mechanisms and Fault Models

VLSI Design I. Defect Mechanisms and Fault Models VLSI Design I Defect Mechanisms and Fault Models He s dead Jim... Overview Defects Fault models Goal: You know the difference between design and fabrication defects. You know sources of defects and you

More information

Lecture 5 Fault Modeling

Lecture 5 Fault Modeling Lecture 5 Fault Modeling Why model faults? Some real defects in VLSI and PCB Common fault models Stuck-at faults Single stuck-at faults Fault equivalence Fault dominance and checkpoint theorem Classes

More information

EECS150 - Digital Design Lecture 26 - Faults and Error Correction. Types of Faults in Digital Designs

EECS150 - Digital Design Lecture 26 - Faults and Error Correction. Types of Faults in Digital Designs EECS150 - Digital Design Lecture 26 - Faults and Error Correction April 25, 2013 John Wawrzynek 1 Types of Faults in Digital Designs Design Bugs (function, timing, power draw) detected and corrected at

More information

Signal Handling & Processing

Signal Handling & Processing Signal Handling & Processing The output signal of the primary transducer may be too small to drive indicating, recording or control elements directly. Or it may be in a form which is not convenient for

More information

EECS150 - Digital Design Lecture 26 Faults and Error Correction. Recap

EECS150 - Digital Design Lecture 26 Faults and Error Correction. Recap EECS150 - Digital Design Lecture 26 Faults and Error Correction Nov. 26, 2013 Prof. Ronald Fearing Electrical Engineering and Computer Sciences University of California, Berkeley (slides courtesy of Prof.

More information

Reliable Computing I

Reliable Computing I Instructor: Mehdi Tahoori Reliable Computing I Lecture 5: Reliability Evaluation INSTITUTE OF COMPUTER ENGINEERING (ITEC) CHAIR FOR DEPENDABLE NANO COMPUTING (CDNC) National Research Center of the Helmholtz

More information

Rel: Estimating Digital System Reliability

Rel: Estimating Digital System Reliability Rel 1 Rel: Estimating Digital System Reliability Qualitatively, the reliability of a digital system is the likelihood that it works correctly when you need it. Marketing and sa les people like to say that

More information

Chapter 2 Fault Modeling

Chapter 2 Fault Modeling Chapter 2 Fault Modeling Jin-Fu Li Advanced Reliable Systems (ARES) Lab. Department of Electrical Engineering National Central University Jungli, Taiwan Outline Why Model Faults? Fault Models (Faults)

More information

Dependable Systems. ! Dependability Attributes. Dr. Peter Tröger. Sources:

Dependable Systems. ! Dependability Attributes. Dr. Peter Tröger. Sources: Dependable Systems! Dependability Attributes Dr. Peter Tröger! Sources:! J.C. Laprie. Dependability: Basic Concepts and Terminology Eusgeld, Irene et al.: Dependability Metrics. 4909. Springer Publishing,

More information

EE 445 / 850: Final Examination

EE 445 / 850: Final Examination EE 445 / 850: Final Examination Date and Time: 3 Dec 0, PM Room: HLTH B6 Exam Duration: 3 hours One formula sheet permitted. - Covers chapters - 5 problems each carrying 0 marks - Must show all calculations

More information

Dependable Computer Systems

Dependable Computer Systems Dependable Computer Systems Part 3: Fault-Tolerance and Modelling Contents Reliability: Basic Mathematical Model Example Failure Rate Functions Probabilistic Structural-Based Modeling: Part 1 Maintenance

More information

Figure 1.1: Schematic symbols of an N-transistor and P-transistor

Figure 1.1: Schematic symbols of an N-transistor and P-transistor Chapter 1 The digital abstraction The term a digital circuit refers to a device that works in a binary world. In the binary world, the only values are zeros and ones. Hence, the inputs of a digital circuit

More information

Reliability Analysis of Moog Ultrasonic Air Bubble Detectors

Reliability Analysis of Moog Ultrasonic Air Bubble Detectors Reliability Analysis of Moog Ultrasonic Air Bubble Detectors Air-in-line sensors are vital to the performance of many of today s medical device applications. The reliability of these sensors should be

More information

Combinational Techniques for Reliability Modeling

Combinational Techniques for Reliability Modeling Combinational Techniques for Reliability Modeling Prof. Naga Kandasamy, ECE Department Drexel University, Philadelphia, PA 19104. January 24, 2009 The following material is derived from these text books.

More information

10 Introduction to Reliability

10 Introduction to Reliability 0 Introduction to Reliability 10 Introduction to Reliability The following notes are based on Volume 6: How to Analyze Reliability Data, by Wayne Nelson (1993), ASQC Press. When considering the reliability

More information

We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists. International authors and editors

We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists. International authors and editors We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists 3,900 116,000 120M Open access books available International authors and editors Downloads Our

More information

CHAPTER 10 RELIABILITY

CHAPTER 10 RELIABILITY CHAPTER 10 RELIABILITY Failure rates Reliability Constant failure rate and exponential distribution System Reliability Components in series Components in parallel Combination system 1 Failure Rate Curve

More information

Engineering Risk Benefit Analysis

Engineering Risk Benefit Analysis Engineering Risk Benefit Analysis 1.155, 2.943, 3.577, 6.938, 10.816, 13.621, 16.862, 22.82, ESD.72, ESD.721 RPRA 3. Probability Distributions in RPRA George E. Apostolakis Massachusetts Institute of Technology

More information

Advanced Testing. EE5375 ADD II Prof. MacDonald

Advanced Testing. EE5375 ADD II Prof. MacDonald Advanced Testing EE5375 ADD II Prof. MacDonald Functional Testing l Original testing method l Run chip from reset l Tester emulates the outside world l Chip runs functionally with internally generated

More information

Failure rate in the continuous sense. Figure. Exponential failure density functions [f(t)] 1

Failure rate in the continuous sense. Figure. Exponential failure density functions [f(t)] 1 Failure rate (Updated and Adapted from Notes by Dr. A.K. Nema) Part 1: Failure rate is the frequency with which an engineered system or component fails, expressed for example in failures per hour. It is

More information

Concept of Reliability

Concept of Reliability Concept of Reliability Prepared By Dr. M. S. Memon Department of Industrial Engineering and Management Mehran University of Engineering and Technology Jamshoro, Sindh, Pakistan RELIABILITY Reliability

More information

Lecture 16: Circuit Pitfalls

Lecture 16: Circuit Pitfalls Introduction to CMOS VLSI Design Lecture 16: Circuit Pitfalls David Harris Harvey Mudd College Spring 2004 Outline Pitfalls Detective puzzle Given circuit and symptom, diagnose cause and recommend solution

More information

ECE-470 Digital Design II Memory Test. Memory Cells Per Chip. Failure Mechanisms. Motivation. Test Time in Seconds (Memory Size: n Bits) Fault Types

ECE-470 Digital Design II Memory Test. Memory Cells Per Chip. Failure Mechanisms. Motivation. Test Time in Seconds (Memory Size: n Bits) Fault Types ECE-470 Digital Design II Memory Test Motivation Semiconductor memories are about 35% of the entire semiconductor market Memories are the most numerous IPs used in SOC designs Number of bits per chip continues

More information

ELE 491 Senior Design Project Proposal

ELE 491 Senior Design Project Proposal ELE 491 Senior Design Project Proposal These slides are loosely based on the book Design for Electrical and Computer Engineers by Ford and Coulston. I have used the sources referenced in the book freely

More information

Practical Applications of Reliability Theory

Practical Applications of Reliability Theory Practical Applications of Reliability Theory George Dodson Spallation Neutron Source Managed by UT-Battelle Topics Reliability Terms and Definitions Reliability Modeling as a tool for evaluating system

More information

Tradeoff between Reliability and Power Management

Tradeoff between Reliability and Power Management Tradeoff between Reliability and Power Management 9/1/2005 FORGE Lee, Kyoungwoo Contents 1. Overview of relationship between reliability and power management 2. Dakai Zhu, Rami Melhem and Daniel Moss e,

More information

Comparative Distributions of Hazard Modeling Analysis

Comparative Distributions of Hazard Modeling Analysis Comparative s of Hazard Modeling Analysis Rana Abdul Wajid Professor and Director Center for Statistics Lahore School of Economics Lahore E-mail: drrana@lse.edu.pk M. Shuaib Khan Department of Statistics

More information

Fault Modeling. 李昆忠 Kuen-Jong Lee. Dept. of Electrical Engineering National Cheng-Kung University Tainan, Taiwan. VLSI Testing Class

Fault Modeling. 李昆忠 Kuen-Jong Lee. Dept. of Electrical Engineering National Cheng-Kung University Tainan, Taiwan. VLSI Testing Class Fault Modeling 李昆忠 Kuen-Jong Lee Dept. of Electrical Engineering National Cheng-Kung University Tainan, Taiwan Class Fault Modeling Some Definitions Why Modeling Faults Various Fault Models Fault Detection

More information

Chapter 5. System Reliability and Reliability Prediction.

Chapter 5. System Reliability and Reliability Prediction. Chapter 5. System Reliability and Reliability Prediction. Problems & Solutions. Problem 1. Estimate the individual part failure rate given a base failure rate of 0.0333 failure/hour, a quality factor of

More information

Chapter 15. System Reliability Concepts and Methods. William Q. Meeker and Luis A. Escobar Iowa State University and Louisiana State University

Chapter 15. System Reliability Concepts and Methods. William Q. Meeker and Luis A. Escobar Iowa State University and Louisiana State University Chapter 15 System Reliability Concepts and Methods William Q. Meeker and Luis A. Escobar Iowa State University and Louisiana State University Copyright 1998-2008 W. Q. Meeker and L. A. Escobar. Based on

More information

ECE 3060 VLSI and Advanced Digital Design. Testing

ECE 3060 VLSI and Advanced Digital Design. Testing ECE 3060 VLSI and Advanced Digital Design Testing Outline Definitions Faults and Errors Fault models and definitions Fault Detection Undetectable Faults can be used in synthesis Fault Simulation Observability

More information

Evaluation and Validation

Evaluation and Validation Evaluation and Validation Peter Marwedel TU Dortmund, Informatik 12 Germany Graphics: Alexandra Nolte, Gesine Marwedel, 2003 2011 06 18 These slides use Microsoft clip arts. Microsoft copyright restrictions

More information

Importance of the Running-In Phase on the Life of Repairable Systems

Importance of the Running-In Phase on the Life of Repairable Systems Engineering, 214, 6, 78-83 Published Online February 214 (http://www.scirp.org/journal/eng) http://dx.doi.org/1.4236/eng.214.6211 Importance of the Running-In Phase on the Life of Repairable Systems Salima

More information

Load-strength Dynamic Interaction Principle and Failure Rate Model

Load-strength Dynamic Interaction Principle and Failure Rate Model International Journal of Performability Engineering Vol. 6, No. 3, May 21, pp. 25-214. RAMS Consultants Printed in India Load-strength Dynamic Interaction Principle and Failure Rate Model LIYANG XIE and

More information

Reliability of Bulk Power Systems (cont d)

Reliability of Bulk Power Systems (cont d) Reliability of Bulk Power Systems (cont d) Important requirements of a reliable electric power service Voltage and frequency must be held within close tolerances Synchronous generators must be kept running

More information

Key Words: Lifetime Data Analysis (LDA), Probability Density Function (PDF), Goodness of fit methods, Chi-square method.

Key Words: Lifetime Data Analysis (LDA), Probability Density Function (PDF), Goodness of fit methods, Chi-square method. Reliability prediction based on lifetime data analysis methodology: The pump case study Abstract: The business case aims to demonstrate the lifetime data analysis methodology application from the historical

More information

Understanding Integrated Circuit Package Power Capabilities

Understanding Integrated Circuit Package Power Capabilities Understanding Integrated Circuit Package Power Capabilities INTRODUCTION The short and long term reliability of National Semiconductor s interface circuits like any integrated circuit is very dependent

More information

Reliability Engineering I

Reliability Engineering I Happiness is taking the reliability final exam. Reliability Engineering I ENM/MSC 565 Review for the Final Exam Vital Statistics What R&M concepts covered in the course When Monday April 29 from 4:30 6:00

More information

Reliability and Availability Simulation. Krige Visser, Professor, University of Pretoria, South Africa

Reliability and Availability Simulation. Krige Visser, Professor, University of Pretoria, South Africa Reliability and Availability Simulation Krige Visser, Professor, University of Pretoria, South Africa Content BACKGROUND DEFINITIONS SINGLE COMPONENTS MULTI-COMPONENT SYSTEMS AVAILABILITY SIMULATION CONCLUSION

More information

Introduction to Reliability Theory (part 2)

Introduction to Reliability Theory (part 2) Introduction to Reliability Theory (part 2) Frank Coolen UTOPIAE Training School II, Durham University 3 July 2018 (UTOPIAE) Introduction to Reliability Theory 1 / 21 Outline Statistical issues Software

More information

Understanding Integrated Circuit Package Power Capabilities

Understanding Integrated Circuit Package Power Capabilities Understanding Integrated Circuit Package Power Capabilities INTRODUCTION The short and long term reliability of s interface circuits, like any integrated circuit, is very dependent on its environmental

More information

Semiconductor Reliability

Semiconductor Reliability Semiconductor Reliability. Semiconductor Device Failure Region Below figure shows the time-dependent change in the semiconductor device failure rate. Discussions on failure rate change in time often classify

More information

Cyber Physical Power Systems Power in Communications

Cyber Physical Power Systems Power in Communications 1 Cyber Physical Power Systems Power in Communications Information and Communications Tech. Power Supply 2 ICT systems represent a noticeable (about 5 % of total t demand d in U.S.) fast increasing load.

More information

IoT Network Quality/Reliability

IoT Network Quality/Reliability IoT Network Quality/Reliability IEEE PHM June 19, 2017 Byung K. Yi, Dr. of Sci. Executive V.P. & CTO, InterDigital Communications, Inc Louis Kerofsky, PhD. Director of Partner Development InterDigital

More information

Part 3: Fault-tolerance and Modeling

Part 3: Fault-tolerance and Modeling Part 3: Fault-tolerance and Modeling Course: Dependable Computer Systems 2012, Stefan Poledna, All rights reserved part 3, page 1 Goals of fault-tolerance modeling Design phase Designing and implementing

More information

Semiconductor Technologies

Semiconductor Technologies UNIVERSITI TUNKU ABDUL RAHMAN Semiconductor Technologies Quality Control Dr. Lim Soo King 1/2/212 Chapter 1 Quality Control... 1 1. Introduction... 1 1.1 In-Process Quality Control and Incoming Quality

More information

Statistics for Engineers Lecture 4 Reliability and Lifetime Distributions

Statistics for Engineers Lecture 4 Reliability and Lifetime Distributions Statistics for Engineers Lecture 4 Reliability and Lifetime Distributions Chong Ma Department of Statistics University of South Carolina chongm@email.sc.edu February 15, 2017 Chong Ma (Statistics, USC)

More information

Contents. Introduction Field Return Rate Estimation Process Used Parameters

Contents. Introduction Field Return Rate Estimation Process Used Parameters Contents Introduction Field Return Rate Estimation Process Used Parameters Parts Count Reliability Prediction Accelerated Stress Testing Maturity Level Field Return Rate Indicator Function Field Return

More information

Fault Tolerant Computing CS 530 Fault Modeling

Fault Tolerant Computing CS 530 Fault Modeling CS 53 Fault Modeling Yashwant K. Malaiya Colorado State University Fault Modeling Why fault modeling? Stuck-at / fault model The single fault assumption Bridging and delay faults MOS transistors and CMOS

More information

Confidence Intervals. Confidence interval for sample mean. Confidence interval for sample mean. Confidence interval for sample mean

Confidence Intervals. Confidence interval for sample mean. Confidence interval for sample mean. Confidence interval for sample mean Confidence Intervals Confidence interval for sample mean The CLT tells us: as the sample size n increases, the sample mean is approximately Normal with mean and standard deviation Thus, we have a standard

More information

Dictionary-Less Defect Diagnosis as Surrogate Single Stuck-At Faults

Dictionary-Less Defect Diagnosis as Surrogate Single Stuck-At Faults Dictionary-Less Defect Diagnosis as Surrogate Single Stuck-At Faults Chidambaram Alagappan and Vishwani D. Agrawal Department of Electrical and Computer Engineering Auburn University, Auburn, AL 36849,

More information

Lecture 16: Circuit Pitfalls

Lecture 16: Circuit Pitfalls Lecture 16: Circuit Pitfalls Outline Variation Noise Budgets Reliability Circuit Pitfalls 2 Variation Process Threshold Channel length Interconnect dimensions Environment Voltage Temperature Aging / Wearout

More information

Temperature and Humidity Acceleration Factors on MLV Lifetime

Temperature and Humidity Acceleration Factors on MLV Lifetime Temperature and Humidity Acceleration Factors on MLV Lifetime With and Without DC Bias Greg Caswell Introduction This white paper assesses the temperature and humidity acceleration factors both with and

More information

CRYOGENIC DRAM BASED MEMORY SYSTEM FOR SCALABLE QUANTUM COMPUTERS: A FEASIBILITY STUDY

CRYOGENIC DRAM BASED MEMORY SYSTEM FOR SCALABLE QUANTUM COMPUTERS: A FEASIBILITY STUDY CRYOGENIC DRAM BASED MEMORY SYSTEM FOR SCALABLE QUANTUM COMPUTERS: A FEASIBILITY STUDY MEMSYS-2017 SWAMIT TANNU DOUG CARMEAN MOINUDDIN QURESHI Why Quantum Computers? 2 Molecule and Material Simulations

More information

Time-varying failure rate for system reliability analysis in large-scale railway risk assessment simulation

Time-varying failure rate for system reliability analysis in large-scale railway risk assessment simulation Time-varying failure rate for system reliability analysis in large-scale railway risk assessment simulation H. Zhang, E. Cutright & T. Giras Center of Rail Safety-Critical Excellence, University of Virginia,

More information

CMP 338: Third Class

CMP 338: Third Class CMP 338: Third Class HW 2 solution Conversion between bases The TINY processor Abstraction and separation of concerns Circuit design big picture Moore s law and chip fabrication cost Performance What does

More information

Memory Thermal Management 101

Memory Thermal Management 101 Memory Thermal Management 101 Overview With the continuing industry trends towards smaller, faster, and higher power memories, thermal management is becoming increasingly important. Not only are device

More information

Lecture 5 Probability

Lecture 5 Probability Lecture 5 Probability Dr. V.G. Snell Nuclear Reactor Safety Course McMaster University vgs 1 Probability Basic Ideas P(A)/probability of event A 'lim n64 ( x n ) (1) (Axiom #1) 0 # P(A) #1 (1) (Axiom #2):

More information

Considering Security Aspects in Safety Environment. Dipl.-Ing. Evzudin Ugljesa

Considering Security Aspects in Safety Environment. Dipl.-Ing. Evzudin Ugljesa Considering Security spects in Safety Environment Dipl.-ng. Evzudin Ugljesa Overview ntroduction Definitions of safety relevant parameters Description of the oo4-architecture Calculation of the FD-Value

More information

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

PERFORMANCE METRICS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah PERFORMANCE METRICS Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Jan. 17 th : Homework 1 release (due on Jan.

More information

Why the fuss about measurements and precision?

Why the fuss about measurements and precision? Introduction In this tutorial you will learn the definitions, rules and techniques needed to record measurements in the laboratory to the proper precision (significant figures). You should also develop

More information

EE241 - Spring 2000 Advanced Digital Integrated Circuits. Announcements

EE241 - Spring 2000 Advanced Digital Integrated Circuits. Announcements EE241 - Spring 2 Advanced Digital Integrated Circuits Lecture 11 Low Power-Low Energy Circuit Design Announcements Homework #2 due Friday, 3/3 by 5pm Midterm project reports due in two weeks - 3/7 by 5pm

More information

B.H. Far

B.H. Far SENG 637 Dependability, Reliability & Testing of Software Systems Chapter 3: System Reliability Department of Electrical & Computer Engineering, University of Calgary B.H. Far (far@ucalgary.ca) http://www.enel.ucalgary.ca/people/far/lectures/seng637/

More information

System-Level Modeling and Microprocessor Reliability Analysis for Backend Wearout Mechanisms

System-Level Modeling and Microprocessor Reliability Analysis for Backend Wearout Mechanisms System-Level Modeling and Microprocessor Reliability Analysis for Backend Wearout Mechanisms Chang-Chih Chen and Linda Milor School of Electrical and Comptuer Engineering, Georgia Institute of Technology,

More information

CHAPTER 3 ANALYSIS OF RELIABILITY AND PROBABILITY MEASURES

CHAPTER 3 ANALYSIS OF RELIABILITY AND PROBABILITY MEASURES 27 CHAPTER 3 ANALYSIS OF RELIABILITY AND PROBABILITY MEASURES 3.1 INTRODUCTION The express purpose of this research is to assimilate reliability and its associated probabilistic variables into the Unit

More information

DVClub Europe Formal fault analysis for ISO fault metrics on real world designs. Jörg Große Product Manager Functional Safety November 2016

DVClub Europe Formal fault analysis for ISO fault metrics on real world designs. Jörg Große Product Manager Functional Safety November 2016 DVClub Europe Formal fault analysis for ISO 26262 fault metrics on real world designs Jörg Große Product Manager Functional Safety November 2016 Page 1 11/27/2016 Introduction Functional Safety The objective

More information

Incremental Latin Hypercube Sampling

Incremental Latin Hypercube Sampling Incremental Latin Hypercube Sampling for Lifetime Stochastic Behavioral Modeling of Analog Circuits Yen-Lung Chen +, Wei Wu *, Chien-Nan Jimmy Liu + and Lei He * EE Dept., National Central University,

More information

Fundamentals of Reliability Engineering and Applications

Fundamentals of Reliability Engineering and Applications Fundamentals of Reliability Engineering and Applications E. A. Elsayed elsayed@rci.rutgers.edu Rutgers University Quality Control & Reliability Engineering (QCRE) IIE February 21, 2012 1 Outline Part 1.

More information

Availability and Reliability Analysis for Dependent System with Load-Sharing and Degradation Facility

Availability and Reliability Analysis for Dependent System with Load-Sharing and Degradation Facility International Journal of Systems Science and Applied Mathematics 2018; 3(1): 10-15 http://www.sciencepublishinggroup.com/j/ijssam doi: 10.11648/j.ijssam.20180301.12 ISSN: 2575-5838 (Print); ISSN: 2575-5803

More information

Design for Manufacturability and Power Estimation. Physical issues verification (DSM)

Design for Manufacturability and Power Estimation. Physical issues verification (DSM) Design for Manufacturability and Power Estimation Lecture 25 Alessandra Nardi Thanks to Prof. Jan Rabaey and Prof. K. Keutzer Physical issues verification (DSM) Interconnects Signal Integrity P/G integrity

More information

Single Stuck-At Fault Model Other Fault Models Redundancy and Untestable Faults Fault Equivalence and Fault Dominance Method of Boolean Difference

Single Stuck-At Fault Model Other Fault Models Redundancy and Untestable Faults Fault Equivalence and Fault Dominance Method of Boolean Difference Single Stuck-At Fault Model Other Fault Models Redundancy and Untestable Faults Fault Equivalence and Fault Dominance Method of Boolean Difference Copyright 1998 Elizabeth M. Rudnick 1 Modeling the effects

More information

Non-observable failure progression

Non-observable failure progression Non-observable failure progression 1 Age based maintenance policies We consider a situation where we are not able to observe failure progression, or where it is impractical to observe failure progression:

More information

Introduction to Engineering Reliability

Introduction to Engineering Reliability Introduction to Engineering Reliability Robert C. Patev North Atlantic Division Regional Technical Specialist (978) 318-8394 Topics Reliability Basic Principles of Reliability Analysis Non-Probabilistic

More information

Lecture 2: Metrics to Evaluate Systems

Lecture 2: Metrics to Evaluate Systems Lecture 2: Metrics to Evaluate Systems Topics: Metrics: power, reliability, cost, benchmark suites, performance equation, summarizing performance with AM, GM, HM Sign up for the class mailing list! Video

More information

Design of Reliable Processors Based on Unreliable Devices Séminaire COMELEC

Design of Reliable Processors Based on Unreliable Devices Séminaire COMELEC Design of Reliable Processors Based on Unreliable Devices Séminaire COMELEC Lirida Alves de Barros Naviner Paris, 1 July 213 Outline Basics on reliability Technology Aspects Design for Reliability Conclusions

More information

Fault Tolerant Computing CS 530 Fault Modeling. Yashwant K. Malaiya Colorado State University

Fault Tolerant Computing CS 530 Fault Modeling. Yashwant K. Malaiya Colorado State University CS 530 Fault Modeling Yashwant K. Malaiya Colorado State University 1 Objectives The number of potential defects in a unit under test is extremely large. A fault-model presumes that most of the defects

More information

Multi-State Availability Modeling in Practice

Multi-State Availability Modeling in Practice Multi-State Availability Modeling in Practice Kishor S. Trivedi, Dong Seong Kim, Xiaoyan Yin Depart ment of Electrical and Computer Engineering, Duke University, Durham, NC 27708 USA kst@ee.duke.edu, {dk76,

More information

Quality and Reliability Report

Quality and Reliability Report Quality and Reliability Report GaAs Schottky Diode Products 051-06444 rev C Marki Microwave Inc. 215 Vineyard Court, Morgan Hill, CA 95037 Phone: (408) 778-4200 / FAX: (408) 778-4300 Email: info@markimicrowave.com

More information

Fault Modeling. Fault Modeling Outline

Fault Modeling. Fault Modeling Outline Fault Modeling Outline Single Stuck-t Fault Model Other Fault Models Redundancy and Untestable Faults Fault Equivalence and Fault Dominance Method of oolean Difference Copyright 1998 Elizabeth M. Rudnick

More information

Overview ECE 553: TESTING AND TESTABLE DESIGN OF. Memory Density. Test Time in Seconds (Memory Size n Bits) 10/28/2014

Overview ECE 553: TESTING AND TESTABLE DESIGN OF. Memory Density. Test Time in Seconds (Memory Size n Bits) 10/28/2014 ECE 553: TESTING AND TESTABLE DESIGN OF DIGITAL SYSTES Memory testing Overview Motivation and introduction Functional model of a memory A simple minded test and its limitations Fault models March tests

More information

6. STRUCTURAL SAFETY

6. STRUCTURAL SAFETY 6.1 RELIABILITY 6. STRUCTURAL SAFETY Igor Kokcharov Dependability is the ability of a structure to maintain its working parameters in given ranges for a stated period of time. Dependability is a collective

More information

Quantitative Reliability Analysis

Quantitative Reliability Analysis Quantitative Reliability Analysis Moosung Jae May 4, 2015 System Reliability Analysis System reliability analysis is conducted in terms of probabilities The probabilities of events can be modelled as logical

More information

Chapter 2 Reliability Metrology

Chapter 2 Reliability Metrology Chapter 2 Reliability Metrology Abstract In this chapter, first, reliability is defined. Then, different ways of modeling reliability are discussed. Empirical models are based on field data and are easy

More information

ECE 512 Digital System Testing and Design for Testability. Model Solutions for Assignment #3

ECE 512 Digital System Testing and Design for Testability. Model Solutions for Assignment #3 ECE 512 Digital System Testing and Design for Testability Model Solutions for Assignment #3 14.1) In a fault-free instance of the circuit in Fig. 14.15, holding the input low for two clock cycles should

More information

Maintenance free operating period an alternative measure to MTBF and failure rate for specifying reliability?

Maintenance free operating period an alternative measure to MTBF and failure rate for specifying reliability? Reliability Engineering and System Safety 64 (1999) 127 131 Technical note Maintenance free operating period an alternative measure to MTBF and failure rate for specifying reliability? U. Dinesh Kumar

More information

DC/DC converter Input 9-36 and Vdc Output up to 0.5A/3W

DC/DC converter Input 9-36 and Vdc Output up to 0.5A/3W PKV 3000 I PKV 5000 I DC/DC converter Input 9-36 and 18-72 Vdc up to 0.5A/3W Key Features Industry standard DIL24 Wide input voltage range, 9 36V, 18 72V High efficiency 74 83% typical Low idling power

More information

AN Reliability of High Power Bipolar Devices Application Note AN September 2009 LN26862 Authors: Dinesh Chamund, Colin Rout

AN Reliability of High Power Bipolar Devices Application Note AN September 2009 LN26862 Authors: Dinesh Chamund, Colin Rout Reliability of High Power Bipolar Devices Application Note AN5948-2 September 2009 LN26862 Authors: Dinesh Chamund, Colin Rout INTRODUCTION We are often asked What is the MTBF or FIT rating of this diode

More information

Statistical Analysis of BTI in the Presence of Processinduced Voltage and Temperature Variations

Statistical Analysis of BTI in the Presence of Processinduced Voltage and Temperature Variations Statistical Analysis of BTI in the Presence of Processinduced Voltage and Temperature Variations Farshad Firouzi, Saman Kiamehr, Mehdi. B. Tahoori INSTITUTE OF COMPUTER ENGINEERING (ITEC) CHAIR FOR DEPENDABLE

More information

Example: sending one bit of information across noisy channel. Effects of the noise: flip the bit with probability p.

Example: sending one bit of information across noisy channel. Effects of the noise: flip the bit with probability p. Lecture 20 Page 1 Lecture 20 Quantum error correction Classical error correction Modern computers: failure rate is below one error in 10 17 operations Data transmission and storage (file transfers, cell

More information

FAULT TOLERANT DESIGN: AN INTRODUCTION

FAULT TOLERANT DESIGN: AN INTRODUCTION FAULT TOLERANT DESIGN: AN INTRODUCTION ELENA DUBROVA Department of Microelectronics and Information Technology Royal Institute of Technology Stockholm, Sweden Kluwer Academic Publishers Boston/Dordrecht/London

More information

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51

2.6 Complexity Theory for Map-Reduce. Star Joins 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 2.6. COMPLEXITY THEORY FOR MAP-REDUCE 51 Star Joins A common structure for data mining of commercial data is the star join. For example, a chain store like Walmart keeps a fact table whose tuples each

More information

Weibull models for software reliability: case studies

Weibull models for software reliability: case studies Weibull models for software reliability: case studies Pavel Praks a Tomáš Musil b Petr Zajac c Pierre-Etienne Labeau d mailto:pavel.praks@vsb.cz December 6, 2007 avšb - Technical University of Ostrava,

More information

Radiation Effects in Nano Inverter Gate

Radiation Effects in Nano Inverter Gate Nanoscience and Nanotechnology 2012, 2(6): 159-163 DOI: 10.5923/j.nn.20120206.02 Radiation Effects in Nano Inverter Gate Nooshin Mahdavi Sama Technical and Vocational Training College, Islamic Azad University,

More information