Rel: Estimating Digital System Reliability

Similar documents
UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Fault Tolerant Computing ECE 655

Concept of Reliability

Quantitative evaluation of Dependability

Reliability Analysis of Moog Ultrasonic Air Bubble Detectors

Understanding Integrated Circuit Package Power Capabilities

Understanding Integrated Circuit Package Power Capabilities

Memory Thermal Management 101

10 Introduction to Reliability

Chapter 6. a. Open Circuit. Only if both resistors fail open-circuit, i.e. they are in parallel.

0 volts. 2 volts. 5 volts

Practical Aspects of MIL-DTL Appendix A Initiator Testing. Barry Neyer NDIA Fuze Conference May 14 th 2008

VLSI Design I. Defect Mechanisms and Fault Models

We are IntechOpen, the world s leading publisher of Open Access books Built by scientists, for scientists. International authors and editors

Signal Handling & Processing

Introduction to VLSI Testing

Section 3: Combinational Logic Design. Department of Electrical Engineering, University of Waterloo. Combinational Logic

An-Najah National University Faculty of Engineering Industrial Engineering Department. Course : Quantitative Methods (65211)

Failure rate in the continuous sense. Figure. Exponential failure density functions [f(t)] 1

2DI90 Probability & Statistics. 2DI90 Chapter 4 of MR

Accelerated Life Test Principles and Applications in Power Solutions

JEDEC STANDARD. Early Life Failure Rate Calculation Procedure for Semiconductor Components. JESD74A (Revision of JESD74, April 2000) FEBRUARY 2007

Reliability of Technical Systems

e/m APPARATUS Instruction Manual and Experiment Guide for the PASCO scientific Model SE D 5/ PASCO scientific $5.

Quantitative evaluation of Dependability

Terminology and Concepts

Chapter 5. System Reliability and Reliability Prediction.

Robust Design: An introduction to Taguchi Methods

Circuits for Analog System Design Prof. Gunashekaran M K Center for Electronics Design and Technology Indian Institute of Science, Bangalore

Fall 2014 October 1, 2014 MATH-333 (Common Exam. #1)

Statistical Reliability Modeling of Field Failures Works!

ECE-470 Digital Design II Memory Test. Memory Cells Per Chip. Failure Mechanisms. Motivation. Test Time in Seconds (Memory Size: n Bits) Fault Types

Key Words: Lifetime Data Analysis (LDA), Probability Density Function (PDF), Goodness of fit methods, Chi-square method.

Semiconductor Technologies

Lecture 6: Time-Dependent Behaviour of Digital Circuits

Practical Applications of Reliability Theory

Advanced Testing. EE5375 ADD II Prof. MacDonald

The Reliability Predictions for the Avionics Equipment

RC Circuits. 1. Designing a time delay circuit. Introduction. In this lab you will explore RC circuits. Introduction

SUPPLEMENT TO CHAPTER

TECHNICAL INFORMATION

Nonlinear opamp circuits

PHY222 - Lab 7 RC Circuits: Charge Changing in Time Observing the way capacitors in RC circuits charge and discharge.

Temperature and Humidity Acceleration Factors on MLV Lifetime

An Introduction to Insulation Resistance Testing

Math 151. Rumbos Fall Solutions to Review Problems for Final Exam

MM74C912 6-Digit BCD Display Controller/Driver

Chapter 15. System Reliability Concepts and Methods. William Q. Meeker and Luis A. Escobar Iowa State University and Louisiana State University

Dependable Computer Systems

Industrial Engineering Prof. Inderdeep Singh Department of Mechanical & Industrial Engineering Indian Institute of Technology, Roorkee

Circuit Analysis and Ohm s Law

Analog Computing: a different way to think about building a (quantum) computer

Series, Parallel, and other Resistance

Using a Microcontroller to Study the Time Distribution of Counts From a Radioactive Source

Digital electronic systems are designed to process voltage signals which change quickly between two levels. Low time.

CHAPTER 10 RELIABILITY

Alex s Guide to Word Problems and Linear Equations Following Glencoe Algebra 1

Surveying Prof. Bharat Lohani Department of Civil Engineering Indian Institute of Technology, Kanpur. Module - 3 Lecture - 4 Linear Measurements

Making Measurements. On a piece of scrap paper, write down an appropriate reading for the length of the blue rectangle shown below: (then continue )

Designing Information Devices and Systems I Fall 2015 Anant Sahai, Ali Niknejad Homework 8. This homework is due October 26, 2015, at Noon.

Conditional Probability

INTEGRATED CIRCUITS. 74LV259 8-bit addressable latch. Product specification Supersedes data of 1997 Jun 06 IC24 Data Handbook.

Lecture 16: Circuit Pitfalls

DC/DC converter Input 9-36 and Vdc Output up to 0.5A/3W

INTEGRATED CIRCUITS. 74LV273 Octal D-type flip-flop with reset; positive-edge trigger. Product specification 1997 Apr 07 IC24 Data Handbook

CH 16 LIKE TERMS AND EQUATIONS. Ch 16 Like Terms and Equations = 10 10

Quality & Reliability Data

From Physics to Logic

Day 1: Over + Over Again

SPECIFICATIONS OF CITILIGHT 1 / Scope of Application These specifications apply to chip type LED lamp, CITILIGHT, model CL-822 -U1D-T.

Phys 243 Lab 7: Radioactive Half-life

Improving the Testing Rate of Electronic Circuit Boards

Semiconductor Reliability

RS INDUSTRY LIMITED. RS Chip Array ESD Suppressor APPROVAL SHEET. Customer Information. Part No. : Model No. : COMPANY PURCHASE R&D

Algebraic substitution for electric circuits

Algebraic substitution for electric circuits

CS 453 Operating Systems. Lecture 7 : Deadlock

Reference Only Spec. No. JEFL243B 0003F-01 P 1/ 6

Lecture 16: Circuit Pitfalls

We set up the basic model of two-sided, one-to-one matching

DC/DC converter Input 9-36 and Vdc Output up to 0.5A/3W

Designing Information Devices and Systems I Summer 2017 D. Aranki, F. Maksimovic, V. Swamy Homework 5

Predicting AGI: What can we say when we know so little?

POWER DISTRIBUTION SYSTEM Calculating PDS Impedance Douglas Brooks Ultracad Design, Inc

Module No. # 03 Lecture No. # 11 Probabilistic risk analysis

ECE321 Electronics I

PROBABILITY.

Joint Probability Distributions and Random Samples (Devore Chapter Five)

ELE 491 Senior Design Project Proposal

Chapter 12- The Law of Increasing Disorder

Designing a Thermostat Worksheet

Exercise Quality Management WS 2009/2010

DIGITAL LOGIC CIRCUITS

Aluminum Capacitors Power Printed Wiring Style

HOW TO CHOOSE & PLACE DECOUPLING CAPACITORS TO REDUCE THE COST OF THE ELECTRONIC PRODUCTS

University of Technology, Building and Construction Engineering Department (Undergraduate study) PROBABILITY THEORY

Release Liners: The Most Important Trash You ll Ever Buy. by Charles Sheeran

Basis of the theory behind FIDES Methodology

TheFourierTransformAndItsApplications-Lecture28

STATIC GAS MONITOR Type SGM/DEW Revision A of 20 aprile 2016

Mathematics Level D: Lesson 2 Representations of a Line

Transcription:

Rel 1 Rel: Estimating Digital System Reliability Qualitatively, the reliability of a digital system is the likelihood that it works correctly when you need it. Marketing and sa les people like to say that the systems they sell have high reliability, meaning that the systems are pretty likely to keep working. However, savvy customers ask questions that require more concrete answers, such as, If I buy 100 of these systems, how many will fail in a year? To provide these answers, digital design engineers are often responsible for calculating the reliability of the systems they design, and in any case theyshould be aware of the factors that affect reliability. Quantitatively, reliabilityis expressed as a mathematical function of time: R ( t ) Probability that the system still works correctly at time t Reliability is a real number between 0 and 1; that is, at any time 0 R ( t ) 1. We assume that R ( t )is a monotonically decreasing function; that is, failures are permanent and we do not consider the effects of repair. Figure Rel- 1 is a typical reliability function. The foregoing definition of reliability assumes that you know the mathematical definition of probability. If you don t, reliability and the equiv alent probability are easiestto define in terms of an experiment. Suppose that we were to build and operate N identical copies of the system in question. Let W N ( t ) denote the number of them that would still be working at time t. Then, That is, if we build lots of systems, R ( t ) is the fraction of them that are still working at time t. When we talk about the reliability of a single system, we are simply using our experience with a largepopulation to estimate our chances with a single unit. It would be very expensive if the only way to to compute R ( t ) was by experiment to build and monitor N copies of the system. Worse, for any t, we wouldn t know the value of R ( t ) until time t had elapsed in real time. Thus, to answer the customer s question posed earlier, we d have to build a bunch of systems and wait ayear; by then, our potential customer would have purchased something else. R ( t ) 1.00 0.78 0.61 Rt () lim W N () t N N reliability Figure Rel-1 Typical reliability function for a system. 0.37 0.22 0.14 0.00 0 3 6 9 12 15 18 21 24 months

Rel 2 Instead, we can estimate the reliability ofa system by combining reliability information for its individual components, using a simple mathematical model. The reliability of mature components (e.g., 74FCT CMOS chips) may be known and published based on actual experimental e vidence, while the reliability of new components (e.g., a Sexium microprocessor) may be estimated or extrapolated from experience with something similar. In any case, a component s reliability is typically described by a single number, the failure rate described next. Rel.1 Failure Rates The failure rate isthe number of failuresthat occur per unit time in a component or system. In mathematical formulas, failure rate is usually denoted by the Greek letter λ. Since failures occur infrequently in electronic equipment, the failure rate is measured or estimated using many identical copies of a component or system. For example, if we operate 10,000 microprocessor chips for 1,000 hours, and eight of them fail, we would say that the failure rate is λ 8 failures -------------------------------------------------- ( 8 10 7 failures hour) chip 10 4 chips 10 3 hours That is, the failure rate of a single chip is 8 10 7 failures/hour. The actual process of estimating the reliability of a batch of chips is not nearly as simple as we ve just portrayed it; for more information, see the References. However, we can use individual component failure rates, derived by whatever means, in a straightforward mathematical model to predict overall system reliability,aswe ll show later in this section. Since the failure rates for typical electronic components are so small, there are several scaled units that are commonly used for expressing them: percent failures per 10 3 hours, failures per 10 6 hours, and failures per 10 9 hours. The last unit iscalled a FIT : 1 FIT 1 failure ( 10 9 hours) In our earlier example, we would say that λ microprocessor 800 FITs. The failure rate of a typical electronic component is a function of time. As shown in Figure Rel-2, a typical component has a high failure rate during its early life, during which most manufacturing defects make themselves visible; failures during this period are called infant mortality. Infant mortality is why manufacturers of high-quality equipment perform burn-in operating the equipment for 8 to168 hours before shipping it to customers.with burn-in, most infant mortality occurs at the factory rather than at the customer s premises. Even without a thorough burn-in period, the seemingly stingy 90-day warranty offered by most electronic equipment manufacturers does in fact cover most of the failures that occur in the firstfew years of the equipment s operation (but ifit f ailure rate λ FIT infant mortality burn-in

Rel 3 Failure rate Infant mortality Normal working life Wear-out 0 Time fails on the 91st day, that s tough!). Thisis quite differe nt from the situation with an automobile or other piece of mechanical equipment, where wear and tear increases the failure rate as a function of time. Once an electronic component has successfully passed the burn-in phase, its failure rate can be expected to be pretty much constant. Depending on the component, there may be wear-out mechanisms that occur late in the component s life, increasing its failure rate. In the old days, vacuum tubes often wore out after a few thousand hours because their filaments deterioratedfrom thermal stress. Nowadays, most electronic equipment reaches obsolescence before its solid-state components start to experience wear-out failures. For example, even though it s been over 25 years since the widespread use of EPROMs began, many of which wereguaranteed to store data for only 10 years,we haven t seen a rash of equipment failures caused by their bits leaking aw ay. (Do you know anyone with a 10-year-old PC or VCR?) Thus, in practice, the infant-mortality and wear-out phases of electroniccomponent lifetime are ignored, and reliability is calculated on the assumption that failure rate is constant during the normal working life of electronic equipment. This assumption, which says that a failure is equally likely at any time in a component s working life, allows us to use a simplified mathematical model to predict system reliability,as we ll show later in thissection. There are some other factors that can affect component failure rates, including temperature, humidity, shock, vibration, and power cycling. The most significant of these forics is temperature. Man y IC failure mechanisms involve chemical reactions between the chip and some kind of contaminant, and these are accelerated by higher temperatures. Likewise, electrically overstressing a transistor, which heats it up too much and eventually destroys it, is worse if the device temperature is high to begin with. Both theoretical and empirical evidencesupport the following widelyused rule of thumb: An IC s failure rate roughly doubles for every 10 C rise in temperature. Figure Rel-2 The bathtub curve for electroniccomponent failure rates. wear-out This rule is true to a greater or lesser degree for most other electronic parts. Note that the temperature of interest in the foregoing rule is the internal temperature of the IC, not the ambient temperature of the surrounding air. A

Rel 4 power-hogging component in a system without forced-air cooling may have an internal temperature as much as 40 50 C higher than ambient. A well-placed fan may reduce the temperature rise to 10 20 C, reducing the component s failure rateby perhaps a factor of 10. Rel.2 Reliability and MTBF For components with a constant failure rate λ, it can be shown that reliability is an exponential function of time: Rt () e λ t The reliability curve in Figure Rel-1 issuch afunction; it happens to use the value λ 1 failure/ year. Another measure of the reliability of a component or system is the mean time between failures (MTBF), the average time that it takes for a component to fail. For components with a constant failure rate λ, it can be shown that MTBF is simply the reciprocal of λ : MTBF 1 λ mean time between failures (MTBF) Rel.3 System Reliability Suppose that we build asystem with m components, each with a different failure rate, λ 1, λ 2,, λ m. Let us assume that for the system to operate properly, all of its components must operate properly. Basic probability theory says that the system reliability is then given by the formula where R sys () t R 1 () t R 2 () t R m () t e λ m t e λ 1 t e λ 21 ( ) t ( e λ 1 + λ 2 + + λ m )t e λ sys t λ sys λ 1 + λ 2 + + λ m Thus, system reliability is also an exponential function, using a composite failure rate λ sys that is the sum of the individual component failure rates. The constant-failure-rate assumption makes it very easy to determine the reliability of a system simply add the failure rates of the individual components to get the system failure rate. Individual component failure rates can be obtainedfrom manufacturers, reliability handbooks, or company standards. For example, a portion of one company s standard is listed intable Rel-1. Since failure rates are just estimates, thiscompany simplifies the designer s job by considering only broad categories of components, rather than giving exact failure rates for each component. Other companies use more detailed lists, and

Rel 5 Component Failure Rate (FITs) SSI IC 90 MSI IC 160 LSI IC 250 VLSI microprocessor 500 Resistor 10 Decoupling capacitor 15 Connector (per pin) 10 Printed-circuit board 1000 Table Rel-1 Typical component failure rates at 55 C. some CAD systems maintain a component failure-rate database, so that a circuit s composite failure rate can be calculated automatically from its parts list. Suppose that we were to build a single-board system with a VLSI microprocessor, 16 memory and other LSI ICs, 2 SSI ICs, 4 MSI ICs, 10 resistors, 24 decoupling capacitors, and a 50-pin connector. Using the numbers in Table Rel-1, we can calculate the composite failure rate, λ sys 1000 + 16 250 + 2 90 + 4 160 + 10 10 + 24 15 + 50 10 + 500 FITs 7280 failures / 10 9 hours The MTBF of our single-board system is 1/ λ, or about 15.6 years. Yes, small circuits really can be that reliable. Of course, if we include a typical power supply, with a failure rate of over 10,000 FITs, the MTBF is more than halved.