Analysis of Simulated Data

Similar documents
Output Data Analysis for a Single System

B. Maddah INDE 504 Discrete-Event Simulation. Output Analysis (1)

EE/PEP 345. Modeling and Simulation. Spring Class 11

Confidence interval. Prof. Giuseppe Verlato Unit of Epidemiology & Medical Statistics, Department of Diagnostics & Public Health, University of Verona

Overall Plan of Simulation and Modeling I. Chapters

Network Simulation Chapter 6: Output Data Analysis

CPSC 531: System Modeling and Simulation. Carey Williamson Department of Computer Science University of Calgary Fall 2017

Output Analysis for a Single Model

Chapter 11 Output Analysis for a Single Model. Banks, Carson, Nelson & Nicol Discrete-Event System Simulation

Chapter 11 Estimation of Absolute Performance

Simulation. Where real stuff starts

Chapter 11. Output Analysis for a Single Model Prof. Dr. Mesut Güneş Ch. 11 Output Analysis for a Single Model

2WB05 Simulation Lecture 7: Output analysis

Computer Science, Informatik 4 Communication and Distributed Systems. Simulation. Discrete-Event System Simulation. Dr.

Variance reduction techniques

Slides 12: Output Analysis for a Single Model

[Chapter 6. Functions of Random Variables]

Simulation. Where real stuff starts

Variance reduction techniques

Total Quality Management (TQM)

Output Data Analysis for a Single System

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Department of Electrical Engineering and Computer Science

ACCOUNTING FOR INPUT-MODEL AND INPUT-PARAMETER UNCERTAINTIES IN SIMULATION. < May 22, 2006

n i n T Note: You can use the fact that t(.975; 10) = 2.228, t(.95; 10) = 1.813, t(.975; 12) = 2.179, t(.95; 12) =

Lecture 2: Basic Concepts and Simple Comparative Experiments Montgomery: Chapter 2

Fixed Priority Scheduling

14 Random Variables and Simulation

Verification and Validation. CS1538: Introduction to Simulations

MFin Econometrics I Session 4: t-distribution, Simple Linear Regression, OLS assumptions and properties of OLS estimators

Chapter 2 SIMULATION BASICS. 1. Chapter Overview. 2. Introduction

Qualifying Exam CS 661: System Simulation Summer 2013 Prof. Marvin K. Nakayama

MARKOV PROCESSES. Valerio Di Valerio

Continuous random variables

(b) What is the variance of the time until the second customer arrives, starting empty, assuming that we measure time in minutes?

Analysis of Variance

ESTIMATION AND OUTPUT ANALYSIS (L&K Chapters 9, 10) Review performance measures (means, probabilities and quantiles).

Chapter 23: Inferences About Means

ON THE LAW OF THE i TH WAITING TIME INABUSYPERIODOFG/M/c QUEUES

Notes on Continuous Random Variables

Reconstruction, prediction and. of multiple monthly stream-flow series

Redacted for Privacy

Operations Research II, IEOR161 University of California, Berkeley Spring 2007 Final Exam. Name: Student ID:

1. Simple Linear Regression

STAT Section 2.1: Basic Inference. Basic Definitions

= 4. e t/a dt (2) = 4ae t/a. = 4a a = 1 4. (4) + a 2 e +j2πft 2

Probabilities & Statistics Revision

λ λ λ In-class problems

Multivariate Simulations

Warm-up Using the given data Create a scatterplot Find the regression line

EFFICIENT COMPUTATION OF PROBABILITIES OF EVENTS DESCRIBED BY ORDER STATISTICS AND APPLICATION TO A PROBLEM OFQUEUES

SEQUENTIAL ESTIMATION OF THE STEADY-STATE VARIANCE IN DISCRETE EVENT SIMULATION

The Expected Opportunity Cost and Selecting the Optimal Subset

Joint Probability Distributions and Random Samples (Devore Chapter Five)

Statistics for classification

Simple Linear Regression

Solutions to Homework Discrete Stochastic Processes MIT, Spring 2011

Since D has an exponential distribution, E[D] = 0.09 years. Since {A(t) : t 0} is a Poisson process with rate λ = 10, 000, A(0.

Measures of Dispersion

Queueing Theory I Summary! Little s Law! Queueing System Notation! Stationary Analysis of Elementary Queueing Systems " M/M/1 " M/M/m " M/M/1/K "

Markov Processes Hamid R. Rabiee

Markov Chains. X(t) is a Markov Process if, for arbitrary times t 1 < t 2 <... < t k < t k+1. If X(t) is discrete-valued. If X(t) is continuous-valued

Central Limit Theorem ( 5.3)

Figure 10.1: Recording when the event E occurs

9. Linear Regression and Correlation

Inference in Regression Analysis

Continuous distributions

SOLUTIONS IEOR 3106: Second Midterm Exam, Chapters 5-6, November 8, 2012

Homework 4 due on Thursday, December 15 at 5 PM (hard deadline).

Stat 412/512 REVIEW OF SIMPLE LINEAR REGRESSION. Jan Charlotte Wickham. stat512.cwick.co.nz

Statistics and Sampling distributions

Probability and Statistics Notes

Performance Evaluation and Comparison

Content by Week Week of October 14 27

CONTENTS. Preface List of Symbols and Notation

Single gene analysis of differential expression

Quiz 1. Name: Instructions: Closed book, notes, and no electronic devices.

ZOOLOGIA EVOLUZIONISTICA. a. a. 2016/2017 Federico Plazzi - Darwin

Computer Science, Informatik 4 Communication and Distributed Systems. Simulation. Discrete-Event System Simulation. Dr.

Sampling Distributions of Statistics Corresponds to Chapter 5 of Tamhane and Dunlop

PhysicsAndMathsTutor.com

Post-exam 2 practice questions 18.05, Spring 2014

Master s Written Examination

EEC 686/785 Modeling & Performance Evaluation of Computer Systems. Lecture 19

Institute of Actuaries of India

LIMITS FOR QUEUES AS THE WAITING ROOM GROWS. Bell Communications Research AT&T Bell Laboratories Red Bank, NJ Murray Hill, NJ 07974

ISyE 6644 Fall 2014 Test 3 Solutions

ECE 302, Final 3:20-5:20pm Mon. May 1, WTHR 160 or WTHR 172.

, (1) e i = ˆσ 1 h ii. c 2016, Jeffrey S. Simonoff 1

Basic Probability space, sample space concepts and order of a Stochastic Process

Quiz 1. Name: Instructions: Closed book, notes, and no electronic devices.

Parametric technique

appstats27.notebook April 06, 2017

Statistics for IT Managers

Statistical Methods. Missing Data snijders/sm.htm. Tom A.B. Snijders. November, University of Oxford 1 / 23

Markov Chain Model for ALOHA protocol

Lecture Notes 7 Random Processes. Markov Processes Markov Chains. Random Processes

Introduction to Statistical Inference

Imp m ortance o f Sa S mp m le S i S ze Ca C lculation Sc S ie i nti t f i ic i r easons Et E h t ic i al l r easons Ec E onomic i r easons 18-1

Section 3: Simple Linear Regression

Transcription:

Analysis of Simulated Data

Media Media di una popolazione: somma di tutti i valori delle variabili della popolazione diviso il numero di unità della popolazione (N) µ N i= = 1 N X i Dove: - N = numero elementi popolazione - Xi =i-esima osservazione della variabile Xi Media di un campione: somma di tutti i valori delle variabili di un sottoinsieme della popolazione diviso il numero di unità di tale campione (n) X n i= = 1 n X i

Varianza Varianza della popolazione: misura che caratterizza molto bene la varibilità di una popolazione σ 2 N i= = 1 ( X µ ) i N 2 Dove: - N è il numero di osservazioni dell intera popolazione - µ è la media della popolazione - x i è l i-esimo dato statistico osservato Varianza di un campione: s 2 = n i= 1 ( ) 2 X X i n 1 Dove: - n è il numero di osservazioni del campione - X è la media del campione - x i è l i-esimo dato statistico osservato Quando n è grande le differenze tra le due formule sono minime; quando n è piccolo, le differenze sono sensibili.

Teorema Centrale Limite Quando la numerosità del campione diventa abbastanza grande La distribuzione delle medie campionarie approssima una normale X

Quando la popolazione non segue una normale Central Tendency µ = µ x Variation σ = x σ n Population Distribution σ = 10 µ = 50 X Sampling Distributions µ X = 50 X

Distribuzione campionaria Random variable, X, is Age of individuals Values of X: 18, 20, 22, 24 Suppose there s a population... measured in years EVERYONE is one of these 4 ages in this population A B C D 1984-1994 T/Maker Co.

Caratteristiche della popolazione µ = σ = N i= 1 N X 18 + 20 + 22 + 24 = 4 N i= 1 i 2 ( X µ ) i N = = 21 2.236 P(X) Population Distribution.3.2.1 0 A B C D (18) (20) (22) (24) Uniform Distribution X

Possibili campioni di dim. n = 2 1 st 2 nd Observation Obs 18 20 22 24 18 18,18 18,20 18,22 18,24 20 20,18 20,20 20,22 20,24 22 22,18 22,20 22,22 22,24 24 24,18 24,20 24,22 24,24 16 Samples Samples Taken with Replacement 16 Sample Means 1st 2nd Observation Obs 18 20 22 24 18 18 19 20 21 20 19 20 21 22 22 20 21 22 23 24 21 22 23 24

Distribuzione campionaria (di tutte le medie campionarie) 16 Medie campionarie 1st 2nd Observation Obs 18 20 22 24 18 18 19 20 21.3 Distribuzione delle medie campionarie P(X) 20 19 20 21 22.2 22 20 21 22 23 24 21 22 23 24.1 0 18 19 20 21 22 23 24 _ X # nel campione = 2, # nella distrib. campionaria = 16

Media e deviazione standard della distrib. campionaria µ x = N i i= 1 + 24 = = N X 18 + 19 + 19 + 16 21 σ x = = N 2 ( X i µ x) i= 1 N 2 2 ( 18 21) + ( 19 21) + + ( 24 21) 16 2 = 1.58

Confronto popolazione/distrib. campionaria Popolazione Distr. Medie campionarie n = 2 P(X).3 µ = 21, σ = 2.236 P(X).3 µ = 21 =1. 58 x σ x.2.2.1 0 A B C D X.1 0 18 19 20 21 22 23 24 _ X (18) (20) (22) (24)

Curva Normale: proprietà Valore approssimato della percentuale dell area compresa tra valori di deviazione standard (regola empirica). 99.7% 95% 68%

Confidence Interval for a Mean when you have a small sample...

As long as you have a large sample. A confidence interval for a population mean is: x ± Z s n where the average, standard deviation, and n depend on the sample, and Z depends on the confidence level.

Example Random sample of 59 students spent an average of $273.20 on Spring 1998 textbooks. Sample standard deviation was $94.40. 94.4 273.20 ± 1.96 = 273.20 ± 59 24.09 We can be 95% confident that the average amount spent by all students was between $249.11 and $297.29.

What happens if you can only take a small sample? Random sample of 15 students slept an average of 6.4 hours last night with standard deviation of 1 hour. What is the average amount all students slept last night?

If you have a small sample... Replace the Z value with a t value to get: x ± t s n where t comes from Student s t distribution, and depends on the sample size through the degrees of freedom n-1.

Student s t distribution versus Normal Z distribution T-distribution and Standard Normal Z distribution 0.4 density 0.3 0.2 Z distribution T with 5 d.f. 0.1 0.0-5 0 5 Value

T distribution Very similar to standard normal distribution, except: t depends on the degrees of freedom n-1 more likely to get extreme t values than extreme Z values

Let s compare t and Z values Confidence t value with Z value level 5 d.f 90% 2.015 1.65 95% 2.571 1.96 99% 4.032 2.58 For small samples, T value is larger than Z value. So, T interval is made to be longer than Z interval.

OK, enough theorizing! Let s get back to our example! Sample of 15 students slept an average of 6.4 hours last night with standard deviation of 1 hour. Need t with n-1 = 15-1 = 14 d.f. For 95% confidence, t 14 = 2.145 x ± t s n = 1 6.4 ± 2.145 = 6.4 ± 15 0.55

That is... We can be 95% confident that average amount slept last night by all students is between 5.85 and 6.95 hours.

What happens as sample gets larger? T-distribution and Standard Normal Z distribution 0.4 Z distribution 0.3 density 0.2 T with 60 d.f. 0.1 0.0-5 0 5 Value

What happens to CI as sample gets larger? x x ± ± Z t s s n n For large samples: Z and t values become almost identical, so CIs are almost identical.

Example Random sample of 64 students spent an average of 3.8 hours on homework last night with a sample standard deviation of 3.1 hours. Z Confidence Intervals The assumed sigma = 3.10 Variable N Mean StDev 95.0 % CI Homework 64 3.797 3.100 (3.037, 4.556) T Confidence Intervals Variable N Mean StDev 95.0 % CI Homework 64 3.797 3.100 (3.022, 4.571)

Output analysis for single system

Why? Often most of emphasis is on simulation model development and programming. Very little resources (time and money) is budgeted for analyzing the output of the simulation experiment. In fact, it is not uncommon to see a single run of the simulation experiment being carried out and getting the results from the simulation model. The single run also is of arbitrary length and the output of this is considered true. Since simulation modeling is done using random parameters of different probability distributions, this single output is just one realization of these random variables.

Why? If the random parameters of the experiment may have a large variance, one realization of the run may differ greatly from the other. This is a real danger of making erroneous inferences about the system we are trying to simulate because we know that a single data point has practically no statistical significance!!!

Why? A simulation experiment is a computer-based statistical sampling experiment, hence if the results of the simulation are to have any significance and the inferences to have any confidence, appropriate statistical techniques must be used!! Most of the times output data of the simulation experiment is non-stationary and auto-correlated. Hence classical statistical techniques which require data to be IID can t be directly applied.

Typical output process Let Y 1, Y 2, Y m be the output stochastic process from a single simulation run. Let the realizations of these random variables over n replications be: y y y 11 21 n1 y y y 12 22 n2 y y y 1m 2m nm It is very common to observe that within the same run the output process is correlated. However, independence across the replications can be achieved. The output analyses depends on this independence.

Transient and steady-state behavior Consider the stochastic processes Y i as before. In many experiment, the distribution of the output process depends on the initial conditions to certain extent. This conditional distribution of the output stochastic process given the initial condition is called the transient distribution. If this sequence converges, as i for any initial condition, then we call the convergence distribution as steady-state distribution.

Types of simulation Terminating simulation o o o Non-terminating simulation Steady-state parameters Steady-state cycle parameters Others parameters

Terminating simulation When there is a natural event E that specifies the length of each run (replication). If we use different set of independent random variables at input, and same input conditions then the comparable output parameters are IID. Often the initial conditions of the terminating simulation affect the output parameters to a great extent. Examples of terminating simulation: 1. Banking queue example when specified that bank operates between 9 am to 5 pm. 2. Inventory planning example (calculating cost over a finite time horizon).

Non-terminating simulation There is no natural event E to specify the end of the run. Measure of performance for such simulations is said to be steady-state parameter if it is a characteristic of the steady-state distribution of some output process. Stochastic processes of most of the real systems do not have steady-state distributions, since the characteristics of the system change over time. On the other hand, a simulation model may have steady-state distribution, since often we assume that characteristics of the model don t change with time.

Non-terminating simulation Consider a stochastic process Y 1, Y 2, for a non-terminating simulation that does not have a steady-state distribution. Now lets divide the time-axis into equal-length, contiguous time intervals called cycles. Let Y i C be the random variable defined over the ith cycle. Suppose this new stochastic process has a steady-state distribution. A measure of performance is called a steady-state performance it is characteristic of Y C.

Non-terminating simulation For a non-terminating simulation, suppose that a stochastic process does not have a steady-state distribution. Also suppose that there is no appropriate cycle definition such that the corresponding process has a steady-state distribution. This can occur if the parameters for the model continue to change over time. In these cases, however, there will typically be a fixed amount of data describing how input parameters change over time. This provides, in effect, a terminating event E for the simulation, and, thus, the analysis techniques for terminating simulation are appropriate.

Statistical analysis of terminating simulation Suppose that we have n replications of terminating simulation, where each replication is terminated by the same event E and is begun by the same initial conditions. Assume that there is only one measure of performance. Let X j be the value of performance measure in jth replication j = 1, 2, n. So these are IID variables. For a bank, X j might be the average waiting time ( Wi i=1 ) over a N day from the jth replication where N is the number of customers served in a day. We can also see that N itself could be a random variable for a replication. N

Statistical analysis of terminating simulation For a simulation of war game X j might be the number of tanks destroyed on the jth replication. Finally for a inventory system X j could be the average cost from the jth replication. Suppose that we would like to obtain a point estimate and confidence interval for the mean E[X], where X is the random variable defined on a replication as described above. Then make n independent replications of simulation and let X j be the resulting IID variable in jth replication j = 1, 2, n.

Statistical analysis of terminating simulation We know that an approximate 100(1- α) confidence interval for µ = E[X] is given by: where we use a fixed sample of n replications and take the sample variance from this (S 2 (n)). X n ± t n 1,1 α / 2 ( n). n Hence this procedure is called a fixed-sample-size procedure. S 2

Statistical analysis of terminating simulation One disadvantage of fixed-sample-size procedure based on n replications is that the analyst has no control over the confidence interval half-length (the precision of ( )). X X n If the estimate n is such that then we say that n has an absolute error of β. X n µ = β X Suppose that we have constructed a confidence interval for µ based on fixed number of replications n. We assume that our estimate of S 2 (n) of the population variance will not change appreciably as the number of replications increase.

Statistical analysis of terminating simulation Then, an expression for the approximate total number of replications required to obtain an absolute error of β is given by: n * a ( i) i ( β ) min i n : t β. = i 1,1 α / 2 S 2 If this value n a* (β) > n, then we take additional replications (n a* (β) n) of the simulation, then the estimate mean E[X] based on all the replications should have an absolute error of approximately β.

Statistical analysis of terminating simulation Sequential procedure for estimating the confidence interval for. Let δ ( k, α) = tk 1,1 α ( k). k 1. Make k 0 replications of the simulation and set k = k 0. X n 2. Compute and δ(k, α) from the current sample. / 2 3. If δ(k, α) < β then use this as a point estimate of and stop. S X n 4. Otherwise replace k with k + 1, make an additional replication of the simulation and go to Step 1. 2

A method for determining when to stop Choose an acceptable value d for the standard deviation of the estimator Generate at least 100 data values Continue to generate additional data values, stopping when you have generated k values and where S is the sample standard deviation based on k values The estimate of is given by (come riportato nel libro di testo)

Example Consider a serving system in which no new customer are allowed to enter after 5 p.m. and we are interested in estimating the expected time at which the last customer departs the system. Suppose we want to be at least 95% certain that our estimated answer will not differ from the true value by more than 15 seconds

Choosing initial conditions The measures of performances for a terminating simulation depend explicitly on the state of system at time 0. Hence it is extremely important to choose initial condition with utmost care. Suppose that we want to analyze the average delay for customers who arrive and complete their delays between 12 noon and 1 pm (the busiest for any bank). Since the bank would probably be very congested by noon, starting the simulation then with no customers present (usual initial condition for any queuing problem) is not be useful. We discuss two heuristic methods for this problem.

Choosing initial conditions First approach Let us assume that the bank opens at 9 am with no customers present. Then we start the simulation at 9 am with no customers present and run it for 4 simulated hours. In estimating the desired expected average delay, we use only those customers who arrive and complete their delays between noon and 1 pm. The evolution of the simulation between 9 am to noon (the warm-up period ) determines the appropriate conditions for the simulation at noon. Disadvantage The main disadvantage with this approach is that 3 hours of simulated time are not used directly in estimation. One might propose a compromise and start the simulation at some other time, say 11 am with no customers present. However, there is no guarantee that the conditions in the simulation at noon will be representative of the actual conditions in the bank at noon.

Choosing initial conditions Second approach Collect data on the number of customers present in the bank at noon for several different days. Let p i be the proportion of these days that i customers (i = 0, 1, ) are present at noon. Then we simulate the bank from noon to 1 pm with number of customers present at noon being randomly chosen from the distribution {p i }. If more than one simulation run is required, then a different sample of {p i } is drawn for each run. So that the performance measure is IID.

Calcolo delle probabilità usando i dati #Clienti alle 12 #di giorni x f (x ) 0 80 0.40 1 50 1.25 2 40 2.20 3 10 3.05 4 20 4.10 200 1.00

Statistical analysis of steady-state parameters Let Y 1, Y 2, Y m be the output stochastic process from a single run of a non-terminating simulation. Suppose that P(Y i <= y) = F i (y) F(y) = P(Y <= y) as i goes to. Here Y is the steady state random variable of interest with distribution function F. Then φ is a steady-state parameter if it is a characteristic of Y such as E[Y], F(Y). One problem in estimating φ is that the distribution function of Y i is different from F, since it is generally not possible to choose i to be representative of the steady state behavior.

Statistical analysis of steady-state parameters This causes an estimator based on observations Y 1, Y 2, Y m not to be representative. This is called the problem of initial transient. Suppose that we want to estimate the steady-state mean E[Y], which is generally given as: Most serious problem is: υ = lim E[ Y i i ]. E[ Y m ] υ for any m.

Statistical analysis of steady-state parameters The technique that is most commonly used is the warming up of the model or initial data deletion. The idea is to delete some number of observations from the beginning of a run and to use only the remaining observations to estimate the mean. So: m Yi i= l+ 1 Y ( m, l) = m l. Question now is: How to choose the warm-up period l? We can find the point in which the transient mean curve E[Y i ] flattens out at level ν.

Statistical analysis of steady-state parameters

Bootstrapping the Mean: An example We are interested in finding the confidence interval for a mean from a sample of only 4 observations. Assume that we are interested in the difference in income between husbands and wives: we have four cases, with the following income differences (in $1000s): 6, -3, 5, 3, for a mean of 2.75, and standard deviation of 4.031 We can calculate the confidence interval: µ = X n ± t n 1,.025 S 2 ( n) n 4.031 = 2.75 ± 4.30 4 = 2.75 ± 4.30 2.015 = 2.75 ± 8.66 Now we ll compare this confidence interval to one found using bootstrapping

Defining the Random Variable The first thing that bootstrapping does is estimate the population distribution of X from the four observations in the sample In other words, the random variable X* is defined: x* p (x* ) 6.25-3.25 5.25 3.25 The mean of X* is then simply the mean of the sample X = E( X*) = 2,75

The Sample as the Population We now treat the sample as if it were the population, and resample from it In this case we take all possible samples with replacement, meaning that we take n n =4 4 =256 different samples Since we found all possible samples, the mean of these samples is simply the original mean The standard error of X from these samples is: * 2 ( X k X ) k = 1 SE *( X*) = n n = 1.745 We now make an adjustment for the sample size n n ( ) = n SE X SE *( *) = n 1 X 2.015

The Sample as the Population In this example, because we used all possible resamples of our sample, the bootstrap standard error (2.015) is exactly the same as the original standard error Still, the same approach can be used for statistics for which we do not have standard error formulas, or we have small sample sizes In summary, the following analogies can be made to sampling from the population: Bootstrap observations original observations Bootstrap mean original sample mean Original sample mean unknown population mean µ Distribution of the bootstrap means unknown sampling distribution from the original sample