Page 1. Outline. Modeling. Experimental Methodology. ECE 254 / CPS 225 Fault Tolerant and Testable Computing Systems. Modeling and Evaluation

Similar documents
Page 1. Outline. Experimental Methodology. Modeling. ECE 254 / CPS 225 Fault Tolerant and Testable Computing Systems. Modeling and Evaluation

Dependable Computer Systems

Fault-Tolerant Computing

Reliable Computing I

CSE 4201, Ch. 6. Storage Systems. Hennessy and Patterson

Part 3: Fault-tolerance and Modeling

Basics of Stochastic Modeling: Part II

Simulation. Where real stuff starts

Chapter 4 Continuous Random Variables and Probability Distributions

Proxel-Based Simulation of Stochastic Petri Nets Containing Immediate Transitions

HW7 Solutions. f(x) = 0 otherwise. 0 otherwise. The density function looks like this: = 20 if x [10, 90) if x [90, 100]

Multi-State Availability Modeling in Practice

CHAPTER 10 RELIABILITY

Dependable Systems. ! Dependability Attributes. Dr. Peter Tröger. Sources:

Stochastic Petri Net. Ben, Yue (Cindy) 2013/05/08

Stochastic Models in Computer Science A Tutorial

Quantitative evaluation of Dependability

System Simulation Part II: Mathematical and Statistical Models Chapter 5: Statistical Models

Simulation. Where real stuff starts

STAT 509 Section 3.4: Continuous Distributions. Probability distributions are used a bit differently for continuous r.v. s than for discrete r.v. s.

Chapter 5. System Reliability and Reliability Prediction.

Chapter 3 sections. SKIP: 3.10 Markov Chains. SKIP: pages Chapter 3 - continued

Lecturer: Olga Galinina

S n = x + X 1 + X X n.

Chapter 2: Random Variables

Fault Tolerance. Dealing with Faults

MIT Manufacturing Systems Analysis Lectures 6 9: Flow Lines

Figure 10.1: Recording when the event E occurs

Software Reliability & Testing

System Simulation Part II: Mathematical and Statistical Models Chapter 5: Statistical Models

Slides 8: Statistical Models in Simulation

Queueing Theory I Summary! Little s Law! Queueing System Notation! Stationary Analysis of Elementary Queueing Systems " M/M/1 " M/M/m " M/M/1/K "

Stochastic Simulation.

Engineering Risk Benefit Analysis

Exercises Stochastic Performance Modelling. Hamilton Institute, Summer 2010

Tradeoff between Reliability and Power Management

Exercises Solutions. Automation IEA, LTH. Chapter 2 Manufacturing and process systems. Chapter 5 Discrete manufacturing problems

Quantitative evaluation of Dependability

Recap. Probability, stochastic processes, Markov chains. ELEC-C7210 Modeling and analysis of communication networks

NICTA Short Course. Network Analysis. Vijay Sivaraman. Day 1 Queueing Systems and Markov Chains. Network Analysis, 2008s2 1-1

MODELLING DYNAMIC RELIABILITY VIA FLUID PETRI NETS

Evaluation and Validation

EE 445 / 850: Final Examination

Will Landau. Feb 21, 2013

A FRAMEWORK FOR PERFORMABILITY MODELLING USING PROXELS. Sanja Lazarova-Molnar, Graham Horton

Queueing systems. Renato Lo Cigno. Simulation and Performance Evaluation Queueing systems - Renato Lo Cigno 1

Stochastic Petri Nets. Jonatan Lindén. Modelling SPN GSPN. Performance measures. Almost none of the theory. December 8, 2010

Chapter 3 sections. SKIP: 3.10 Markov Chains. SKIP: pages Chapter 3 - continued

Probability and Statistics Concepts

Introduction to Reliability Theory (part 2)

Chapter 3. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 3 <1>

Solutions For Stochastic Process Final Exam

Name of the Student:

Modeling and Simulation of Communication Systems and Networks Chapter 1. Reliability

3 Continuous Random Variables

Chapter 6. a. Open Circuit. Only if both resistors fail open-circuit, i.e. they are in parallel.

Applied Statistics and Probability for Engineers. Sixth Edition. Chapter 4 Continuous Random Variables and Probability Distributions.

Chapter 4 Continuous Random Variables and Probability Distributions

Availability. M(t) = 1 - e -mt

Terminology and Concepts

0utline. 1. Tools from Operations Research. 2. Applications

Computer Science, Informatik 4 Communication and Distributed Systems. Simulation. Discrete-Event System Simulation. Dr.

CHAPTER 3 ANALYSIS OF RELIABILITY AND PROBABILITY MEASURES

Fault-Tolerant Computing

Random Walk on a Graph

Queuing Networks. - Outline of queuing networks. - Mean Value Analisys (MVA) for open and closed queuing networks

Description Framework for Proxel-Based Simulation of a General Class of Stochastic Models

Chapter 5. Statistical Models in Simulations 5.1. Prof. Dr. Mesut Güneş Ch. 5 Statistical Models in Simulations

Ching-Han Hsu, BMES, National Tsing Hua University c 2015 by Ching-Han Hsu, Ph.D., BMIR Lab. = a + b 2. b a. x a b a = 12

2905 Queueing Theory and Simulation PART IV: SIMULATION

Reliability of Technical Systems

10 Introduction to Reliability

The exponential distribution and the Poisson process

CPSC 531: System Modeling and Simulation. Carey Williamson Department of Computer Science University of Calgary Fall 2017

Expected Values, Exponential and Gamma Distributions

STOCHASTIC MODELS FOR RELIABILITY, AVAILABILITY, AND MAINTAINABILITY

Exponential Distribution and Poisson Process

The Transition Probability Function P ij (t)

ABC methods for phase-type distributions with applications in insurance risk problems

Discrete-event simulations

EECS 126 Probability and Random Processes University of California, Berkeley: Fall 2014 Kannan Ramchandran November 13, 2014.

57:022 Principles of Design II Final Exam Solutions - Spring 1997

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Fault Tolerant Computing ECE 655

Lecture 4 The stochastic ingredient

Continuous-Valued Probability Review

Glossary availability cellular manufacturing closed queueing network coefficient of variation (CV) conditional probability CONWIP

Key Words: Lifetime Data Analysis (LDA), Probability Density Function (PDF), Goodness of fit methods, Chi-square method.

Evaluation criteria for reliability in computer systems

Failure Correlation in Software Reliability Models

Probability Midterm Exam 2:15-3:30 pm Thursday, 21 October 1999

BMIR Lecture Series on Probability and Statistics Fall, 2015 Uniform Distribution

Chapter 2 Queueing Theory and Simulation

CHAPTER 3 MATHEMATICAL AND SIMULATION TOOLS FOR MANET ANALYSIS

Q = (c) Assuming that Ricoh has been working continuously for 7 days, what is the probability that it will remain working at least 8 more days?

Discrete Event Systems Exam

IEOR 6711: Stochastic Models I, Fall 2003, Professor Whitt. Solutions to Final Exam: Thursday, December 18.

Basic concepts of probability theory

CPSC 531 Systems Modeling and Simulation FINAL EXAM

COMP9334 Capacity Planning for Computer Systems and Networks

Systems Simulation Chapter 6: Queuing Models

Transcription:

Page 1 Outline ECE 254 / CPS 225 Fault Tolerant and Testable Computing Systems Modeling and Evaluation Copyright 2004 Daniel J. Sorin Duke University Experimental Methodology and Modeling Modeling Random Variables Probabilistic Models Queuing Network Models Markov Chain and Petri Net Models Evaluation Using Simulation ECE 254 / CPS 225 2 Experimental Methodology Follow the scientific rule ( strong inference ) Devise hypothesis Conduct experiment that will prove or disprove hypothesis Keys to good evaluations Using appropriate experimental tools Using appropriate metrics Designing experiments which will produce unambiguous results NOT running experiments and THEN retro-fitting your hypothesis to match the results you obtained Goals Conclusive and persuasive results» Find situations for which your system is better/worse than existing systems» Determine which system to use for which scenario Modeling Model: a representation of a system and its behavior Simulation is a form of modeling (e.g., SimpleScalar) Why use models? Can study systems quickly Can study designs that haven t been built yet Caveat: GIGO (garbage in, garbage out) If your model isn t good, its outputs won t be useful/correct Analytical model = mathematical model Numerous types of analytical modeling techniques exist ECE 254 / CPS 225 3 ECE 254 / CPS 225 4

Page 2 Simulation Analytical Combinatorial Probabilistic Queuing network Markov chain Petri Net Etc. Modeling Techniques Each type of model has pros and cons More complexity can model more detail, but tougher to solve Inputs and Outputs Inputs System model (# of components, how they re connected, etc.) Component characteristics» Size, speed, failure rate, etc. Mean Time To Repair (MTTR) is often an input» But not if system self-repair is something we re studying Outputs Performance (latency & throughput) Reliability R(t) Availability A(t) Mean Time To Failure (MTTF) Mean Time Between Failures (MTBF) ECE 254 / CPS 225 5 ECE 254 / CPS 225 6 A Paper with a Simple Model Outline Modeling the Effect of Technology Trends on the Soft Error Rate of Combinational Logic by Shivakumar, Kistler, Keckler, Burger, and Alvisi U. of Texas and IBM/Austin Experimental Methodology and Modeling Random Variables Probabilistic Models Queuing Network Models Markov Chain and Petri Net Models Evaluation Using Simulation ECE 254 / CPS 225 7 ECE 254 / CPS 225 8

Page 3 Random Variables Informally, a random variable X is a variable that takes on a value as a result of an experiment Time between failures Time to repair a failure Time between requests made to a server Many types of random variables Deterministic Exponential (aka Markovian) Gaussian (aka Normal) Weibull Hyper-exponential Coxian Etc. Characterizing Random Variables Can characterize random variable X in several ways Mean (aka expected value ), often denoted as µ or E[X] Variance, often denoted as σ 2» σ 2 = E [(X µ) 2 ] = E[X 2 ] µ 2» Note that E[X 2 ] (E[X]) 2 Standard deviation, often denoted as σ» σ = sqrt(σ 2 ) Coefficient of variation (CV)» CV = σ/µ Probability density function (pdf) Cumulative distribution function (CDF) The pdf or CDF completely describes a RV Looking at pdf or CDF, can determine mean, variance, etc. ECE 254 / CPS 225 9 ECE 254 / CPS 225 10 CDF: Cumulative Distribution Function X is a random variable CDF: F x (y) = P[X <= y] Monotonically increasing function For an exponential RV: F x (y) = 1 e - λy F x (y) 1.0 pdf: Probability Density Function X is a random variable pdf: f x (y) corresponds to probability of X=y Rough intuitive definition, not mathematically precise P[a <= X < b] = b a f x (y)dy f x (y) >= 0 for all y lim (y infinity) f x (y) = 0 fx(y) For exponential RV: f x (y) = λe- λy y y ECE 254 / CPS 225 11 ECE 254 / CPS 225 12

Page 4 Expected Value (mean) E[X] = mean = f x (y)dy For exponential RV (do the math at home): E[X] = 1/λ Important notes In an experiment, E[X] is the expected value, but not the one that will always occur! There is a distribution of values around the expected value. Exception to above rule: deterministic distributions always have the same value for every experiment Poisson Processes Let X(t) = a random variable that is the number of events that occur between time 0 and time t X(t) is a Poisson process if events occur at a constant rate, λ. I.e., X(t) does not depend on t. Many system behaviors can be modeled as Poisson processes (even if they aren t exactly Poisson in reality) X(t) = Poisson process the time between events is an Exponential random variable ECE 254 / CPS 225 13 ECE 254 / CPS 225 14 Exponential Distributions Outline Denoted by: X(t) ~ Exponential(λ) Can show that P[X <= t+x X > t] = 1-e -λx Not dependent on t memory-less or Markovian property Markov chains (later!) depend on this property Bus stop analogy for memory-less property How long you will wait for a bus is independent of how long you ve already been waiting for a bus Experimental Methodology and Modeling Random Variables Probabilistic Models Markov Chain and Petri Net Models Evaluation Using Simulation ECE 254 / CPS 225 15 ECE 254 / CPS 225 16

Page 5 Probabilistic Models Model system behaviors and failure rates (inputs) with probabilities (probability distributions) Simple example: data C1 P=0.5 P=0.5 C2 C2 C3 Hazard Rate Hazard rate z(t) = failure rate at time t Simplest model is non-time-varying: z(t) = λ Models failures as a Poisson process Leads to reliability function that has Exponential distribution More sophisticated model: z(t) = αλ(λt) α-1 Choose α and λ to model the component s failure behavior Leads to reliability function that has Weibull distribution Denoted Weibull(α, λ) Assume C2 components are redundant paths Assume you re given failure rates for each component What is the probability of getting correct output? For z(t) = λ R(t) = Prob[0 failures from time 0 to time t ] = e - λt Intuitively, R(t) 0 as t infinity R(t) decreases at an exponential rate determined by λ ECE 254 / CPS 225 17 ECE 254 / CPS 225 18 Weibull: for z(t) = αλ(λt) α-1 Weibull Reliability R(t) = Prob[0 failures from time 0 to time t ] = e -(λt)α Intuitively, R(t) 0 as t infinity R(t) decreases at a rate determined by α and λ Other Probability Distributions We have only looked at Exponential and Weibull Other distributions exist, but we won t cover them in this course In general, many behaviors can be represented reasonably well by Exponential distributions This is convenient since they re very easy to work with We ll see more about this later with Markov chains ECE 254 / CPS 225 19 ECE 254 / CPS 225 20

Page 6 Mean Time To Failure (MTTF) MTTF = R(t)dt For exponential hazard function MTTF = e - λt dt = 1/λ Intuitively, MTTF shrinks as failure rate (λ) increases Availability Availability A = MTTF / (MTTF + MTTR) Denominator is often called MTBF (mean time between failures) Note that this is not a function of t Rewriting this equation a bit Let λ = failure rate Let µ = repair rate Then A = µ / (µ + λ) ECE 254 / CPS 225 21 ECE 254 / CPS 225 22 Example Application of Modeling In backward error recovery (BER), a system that detects an error recovers to a previous checkpoint and resumes execution from there What is the performance penalty of BER (as opposed to a FER system that doesn t recover backwards)? Assume: Error rate is λ, and recovery rate is µ Note that RecoveryTime = 1/(RecoveryRate) These are exponential distributions (and not functions of t) What s the answer? Another Example Application of Modeling In a BER system, do we need to consider the possibility of an error occurring during recovery? Assume: Error rate is λ, and recovery rate is µ What s the answer? ECE 254 / CPS 225 23 ECE 254 / CPS 225 24

Page 7 Series System Models Series connection of components C1 C2 Cn Parallel System Models C1 C2 Every component must work for system to work R(system) = π i to n R(i) = e -Σλ i t Intuitively, R increases as: Number of components decreases Failure rates of components decrease Cn Only one component must work for system to work R(system) = 1 - π ito n [1 - R(i)] Intuitively, R increases as: Number of components increases Failure rates of components decrease ECE 254 / CPS 225 25 ECE 254 / CPS 225 26 Mixed System Models It s rare that real systems are either purely serial or purely parallel We must construct models that are hybrids and analyze them accordingly A Paper with a More Exciting Model The Impact of Technology Scaling on Lifetime Reliability by Srinivasan, Adve, Bose, and Rivers U. of Illinois and IBM/Watson data C1 P=0.5 C2 C3 P=0.5 C2 ECE 254 / CPS 225 27 ECE 254 / CPS 225 28

Page 8 Outline Experimental Methodology and Modeling Random Variables Probabilistic Models Queuing Network Models Markov Chain and Petri Net Models Evaluation Using Simulation Queuing Networks Modeling technique used in many situations More often used for performance than reliability Thus we ll skim this topic in this class Good reference textbook: Quantitative System Performance: Computer System Analysis Using Queueing Network Models, by Lazowska, Zahorjan, Graham, and Sevcik (1984) ECE 254 / CPS 225 29 ECE 254 / CPS 225 30 Queuing Networks Queuing Network Issues Can model a computer system as a bunch of service centers with queues, in which customers flow from service center to service center Example: multiprocessor (simplified!) CPU CPU mem mem network Two types of queuing networks Open: customers arrive from outside world at some rate Closed: fixed number of customers in system Service times and arrival rates (for open systems) generally modeled probabilistically Common distributions used: Deterministic Exponential (Markovian) When might each of these be appropriate? When might neither be appropriate? CPU mem ECE 254 / CPS 225 31 ECE 254 / CPS 225 32

Page 9 Solving Queuing Networks Outline Can sometimes derive closed form solution, if we make certain assumptions about distributions (and other more subtle issues) Usually pretty easy to solve numerically, if model isn t too enormous and we make a few simplifying assumptions Outputs of solution: Customer throughput, response times, waiting times Service center utilization (e.g., which one is the bottleneck!) Several approaches for solving models: Mean value analysis (MVA): simple solution technique, but only reveals mean values Experimental Methodology and Modeling Random Variables Probabilistic Models Queuing Network Models Markov Chain and Petri Net Models Evaluation Using Simulation ECE 254 / CPS 225 33 ECE 254 / CPS 225 34 Markov Chains and Petri Nets Modeling techniques used in many situations Not just for modeling reliability ECE 255 & 256 cover these topics in much more detail Good reference textbook: Probability and Statistics with Reliability, Queuing, and Computer Science Applications, by Kishor Trivedi, 2002 (2 nd edition) Markov Chains Models states and transitions between them Not necessarily states in the sense of a finite state machine, but often more global states that represent features of the entire system Example: 2 states, with 2 transitions failure functional broken repair ECE 254 / CPS 225 35 ECE 254 / CPS 225 36

Page 10 Discrete Time Markov Chains Continuous Time Markov Chains Edges in graph (i.e., arcs) represent probabilities At each discrete time step, there is a certain probability of transitioning to each other connected state 1-P[failure] 1-P[repair] Edges in graph (i.e., arcs) represent rates Continuously, there is a certain rate of transitioning to each other connected state (no need for selfedges) Rates are assumed to be exponentially distributed Exponential is synonym for Markovian (hence Markov chains ) P[failure] λ functional broken OK broken P[repair] µ ECE 254 / CPS 225 37 ECE 254 / CPS 225 38 Markov Chain Example Memory scrubbing walks through memory (reading each line) to uncover latent errors Let s create Markov chain for a given line of memory Assume errors occur one at a time with rate λ per second Assume we scrub line every 1/µ seconds Assume we can repair all single-bit errors Twist on Markov Chain Example Still looking at memory scrubbing New assumptions Assume single-bit errors occur with rate λ 1 per second Assume double-bit errors occur with rate λ 2 per second Assume we scrub line every 1/µ seconds Assume we can repair all 1-bit and 2-bit errors, can detect >2-bit All bits OK λ µ 1 bit bad λ >1 bits bad All bits OK λ 1 µ 1 bit bad λ 1 µ 2 bits bad λ 1 λ 2 λ 2 λ 2 Detected, but Uncorrectable Error ECE 254 / CPS 225 39 ECE 254 / CPS 225 40

Page 11 Solving Markov Chains Petri Nets After creating a Markov chain model, we d like to be able to solve it and determine: Probability distribution of being in each state Reachability (can we actually reach all states) Except for very simple models, it is impossible to derive a closed form solution In general, Markov chains are solved with software tools that can crunch through them numerically (i.e., without finding a closed form solution) This course will not address how to solve Markov chains take ECE 255 and/or ECE 257 Markov chains are not always the most intuitive way to represent a system Petri Nets (1962) developed to provide more intuitive interface, while still mathematically equivalent to Markov chains t1 t2 Petri net components: places (states) -- large black circles transitions -- arcs multiplicities (widths of transitions) -- not shown tokens -- small blue circles A transition is enabled if each of its input places has at least as many tokens as the multiplicity of the corresponding input arc Thus, only t1 is currently enabled in example ECE 254 / CPS 225 41 ECE 254 / CPS 225 42 Petri Nets Petri Nets Markov chains are not always the most intuitive way to represent a system Petri Nets (1962) developed to provide more intuitive interface, while still mathematically equivalent to Markov chains t1 t2 Once a transition is enabled, it can fire At firing, a number of tokens equal to the multiplicity of the input arc are removed from each input place, and a number of tokens equal to the multiplicity of the output arc are placed in each output place So, where will that leave us? Markov chains are not always the most intuitive way to represent a system Petri Nets (1962) developed to provide more intuitive interface, while still mathematically equivalent to Markov chains t1 t2 Notice that we started with 2 tokens and now we have 3 no conservation law for tokens Now both t1 and t2 are enabled and can fire Petri Net Rule: either is allowed to fire first, but they are NOT allowed to fire at the same time What will happen if t2 fires first? ECE 254 / CPS 225 43 ECE 254 / CPS 225 44

Page 12 Petri Nets Stochastic Petri Nets Markov chains are not always the most intuitive way to represent a system Petri Nets (1962) developed to provide more intuitive interface, while still mathematically equivalent to Markov chains t1 t2 This is the result if t2 fired first No more transitions are possible when tokens are in this global state (called a marking in Petri Net lingo) The Petri net we just looked at had no concept of time Transitions could fire immediately (no delay) Stochastic Petri Nets (SPNs) add timing information by adding a firing time to each transition Firing times often restricted to exponential distributions A SPN is mathematically equivalent to a Continuous Time Markov Chain Can be automatically converted by software Can be solved by software ECE 254 / CPS 225 45 ECE 254 / CPS 225 46 Simple SPN Example Rectangles represent non-instantaneous transitions Labeled with mean of their exponential distribution This example is an M/M/1 queue M stands for Markov (= exponential distribution) First M means arrival process is Markovian Second M means service process is Markovian The 1 means that there is one service center Note: t arrival has no input arc always enabled λ µ SPN Example From Software Aging Models system with n nodes, initially all up. Tokens represent nodes. t common-mode-fail g2 tnode-repair t sys-repair P nodefail2 g3 n Prob[c] g1 n P up g1 t failprob P failprob g1 t nodefail P nodefail1 g1=true, if token in P sysfail g2=true, if token in P up g3=true, if a<=n nodes down t arrival P queue t service P sysfail Prob[1-c] Figure 22 from Castelli et al. ECE 254 / CPS 225 47 ECE 254 / CPS 225 48

Page 13 More About Petri Nets Outline There are many other flavors of Petri Nets and many features that you can add to them There are many software tools for solving them (by first converting them to Markov chains) But this is all beyond the scope of this course (i.e., you will not be tested on it) Experimental Methodology and Modeling Random Variables Probabilistic Models Queuing Network Models Markov Chain and Petri Net Models Evaluation Using Simulation ECE 254 / CPS 225 49 ECE 254 / CPS 225 50 Simulating Write a program that mimics system behavior Advantages + Very flexible + Relatively quick to develop Disadvantages Runs slowly (e.g., 30,000 times slower than hardware) Some examples from computer engineering: SimpleScalar: simulates microprocessors nsim: simulates networks Simics: simulates entire computer system Describing Simulated System How detailed must our simulator be? Model every transistor in the system? Would take too long Abstract away details of system? Could miss important effects Could achieve wrong conclusion Need balance Model in detail only where necessary E.g., model memory system in detail, but abstract disks ECE 254 / CPS 225 51 ECE 254 / CPS 225 52

Page 14 Simulators for Evaluating Fault Tolerance Fault/Error Injection Simulators for evaluating fault tolerance may be quite a bit different than simulators for evaluating performance (which is what most computer engineers are used to using) Issue #1: need to be able to model behavior of faults Issue #2: often need to be able to model probabilistic behaviors To simulate how a system tolerates faults/errors, we must inject them into the system E.g., periodically have the simulator flip bits in the system Issues in fault/error injection What kinds of faults/errors to inject Where to inject them When to inject them» Once» Periodically» According to probability distribution ECE 254 / CPS 225 53 ECE 254 / CPS 225 54 Monte Carlo Simulation Outline To simulate systems with probabilistic behaviors, we often use Monte Carlo Simulation Idea: each time we need the value of a probabilistic behavior (e.g., memory access time), we randomly choose it from the distribution Example Assume memory access times are either 1 cycle (L1 hit), 5 cycles (L2 hit), or 200 cycles (memory hit) Assume 90% hit rate in L1, 80% hit rate in L2 (if miss in L1), and 100% hit rate in memory (if miss in L1 and L2) For each memory access, simulator will choose among 100 options:» 90 options are 1 cycle» 8 options are 5 cycles» 2 options are 200 cycles Experimental Methodology and Modeling Random Variables Probabilistic Models Queuing Network Models Markov Chain and Petri Net Models Evaluation Using Simulation ECE 254 / CPS 225 55 ECE 254 / CPS 225 56