Simulation. Alberto Ceselli MSc in Computer Science Univ. of Milan. Part 4 - Statistical Analysis of Simulated Data

Simulation Alberto Ceselli MSc in Computer Science Univ. of Milan Part 4 - Statistical Analysis of Simulated Data A. Ceselli Simulation P.4 Analysis of Sim. data 1 / 15

Statistical analysis of simulated data Outline Outline: estimators and interval estimates bootstrapping variance reduction by antithetic variables variance reduction by control variates variance reduction by control variates goodness of fit tests A. Ceselli Simulation P.4 Analysis of Sim. data 2 / 15

Statistical analysis of simulated data Recall Recall: sample mean X = n i=1 def. unbiased and reliable estimators (blackboard) sample variance X i n n S 2 i=1 = (X i X) 2 n 1 thm: sample variance is an unbiased estimator of the variance (E[S 2 ] = σ 2 ). (proof on the blackboard) sample standard deviation: S = S 2 A. Ceselli Simulation P.4 Analysis of Sim. data 3 / 15

Statistical analysis of simulated data A data generation stopping criterion Suppose you need to estimate some parameter θ up to an acceptable value d for the standard deviation of the estimator: Fix a confidence (e.g. 95%) and precision (e.g. 1.96 d) Generate at least 100 data values X i (empirical) Keep on generating, computing S and stopping when you have k data values and S k < d estimate θ as X = k i=1 X i k Observation (efficiency): Xj+1 = j X j + X j+1 j + 1 = X j + X j+1 X j j + 1 Observation (efficiency): S 2 j+1 = (1 1 j )S2 j + (j + 1)( X j+1 X j ) A. Ceselli Simulation P.4 Analysis of Sim. data 4 / 15

Interval estimates Statistical analysis of simulated data Idea: instead of giving a single (mean) value as estimate, provide a range in which we are confident the parameter value to be. Definition: if the observed values of the sample mean and the sample standard deviation are X = x and S = s, call the interval x ± z α/2 s/ n an (approximate) 100(1 α) percent confidence interval estimate of θ. Discussion about normal distributions and Slutsky s theorem (blackboard). A. Ceselli Simulation P.4 Analysis of Sim. data 5 / 15

Statistical analysis of simulated data Bootstrapping for MSE What if the parameter to estimate is not the mean? (e.g. median or variance). Let X 1... X n be our observations (i.i.d. random variables, having CDF F ); let θ(f ) be the parameter to estimate; let g(x 1... X n ) be a corresponding estimator. To control its quality we measure (estimate) its mean square error: MSE(F ) = E F [(g(x 1... X n ) θ(f )) 2 ] Discussion: empirical CDF and bootstrap approximation of the MSE (blackboard). A. Ceselli Simulation P.4 Analysis of Sim. data 6 / 15

Variance reduction Variance reduction Let be θ: a parameter to estimate X i : n measurements of that value by simulation runs n X = X i /n the sample mean, unbiased estimate of θ i=1 MSE = E[( X θ) 2 ] = V AR[ X] = V AR[X]/n Idea: if you obtain a different unbiased estimate, you may have a smaller variance! Unfortunately, not so simple (Quality Control example, blackboard). A. Ceselli Simulation P.4 Analysis of Sim. data 7 / 15

Antithetic variables Variance reduction Let θ = E[X] and X 1, X 2 i.i.d. R.V. with exp. value θ V AR[ X 1 + X 2 ] = 1 2 4 (V AR[X 1] + V AR[X 2 ] + 2COV [X 1, X 2 ]) i.e. if X 1 and X 2 were negatively correlated (instead of independent), the variance would be smaller. Question: how to make them negatively correlated? Idea: see X 1 as a function h() of the set U 1... U m of random numbers used in the simulation; instead of simulating X 2 with new random numbers, take (1 U 1 )... (1 U m ). Theorem: when h() is a monotone function, X 1 and X 2 are negatively correlated. Example: simulating a reliability function (blackboard) Example: computing e (blackboard) A. Ceselli Simulation P.4 Analysis of Sim. data 8 / 15

Variance reduction Control Variates Let θ = E[X], where X is a R.V. output of the simulation. Suppose you have another output R.V. Y for which you already know the expected value µ. def. The R.V. Y is a control variate for the simulation estimator X. claim: for any constant c, Z = X + c (Y µ) is also an unbiased estimator of θ (discussion on the blackboard). claim: the variance of Z is not greater than that of X (discussion on the blackboard). A. Ceselli Simulation P.4 Analysis of Sim. data 9 / 15

Variance reduction Variance Reduction by Conditioning Let θ = E[X], where X is a R.V. output of the simulation. Suppose you have another output R.V. Y, and E[X Y ] is known and takes a value that can be obtained through a simulation run. claim: E[X Y ] is also an unbiased estimator of θ (discussion on the blackboard). claim: the variance of E[X Y ] is not greater than that of X (discussion on the blackboard). Example: estimating π (discussion on the blackboard). A. Ceselli Simulation P.4 Analysis of Sim. data 10 / 15

Variance reduction Variance Reduction by Stratified Sampling Let θ = E[X], where X is a R.V. output of the simulation. Suppose you have another discrete output R.V. Y, with values y 1... y k such that: the probabilities p i = P [Y = y i ] are known for each i = 1... k, we can simulate the value of X conditional on Y = y i instead of taking X n = X i /n after n simulation runs, take i=1 ε = k j=1 X j p j with X j obtained by simulating with Y = y j. claim: the variance of ε is never higher than that of X (proof on the blackboard). A. Ceselli Simulation P.4 Analysis of Sim. data 11 / 15

Statistical Validation Techniques Statistical Validation Techniques Our simulations are often hypothesis-driven. We have a conjecture about the probability distribution of random elements of our system e.g. daily number of accidents in a road network We check by simulation if observations match the conjecture or not i.e. if simulation data is consistent with the distribution we suppose to have goodness of fit tests A. Ceselli Simulation P.4 Analysis of Sim. data 12 / 15

Statistical Validation Techniques Chi-square test for discrete data When data is discrete (e.g. categories): Chi-square test (see R code) A. Ceselli Simulation P.4 Analysis of Sim. data 13 / 15

Statistical Validation Techniques Kolmogorov-Smirnov for continuous data When data is continuous: Kolmogorov-Smirnov (see R code) A. Ceselli Simulation P.4 Analysis of Sim. data 14 / 15

Statistical Validation Techniques Missing parameters What if we have a conjecture about the distribution, but not about its parameters? E.g. I conjecture accidents to be distributed as in a Poisson Process, whose rate is unknown. (see R code) A. Ceselli Simulation P.4 Analysis of Sim. data 15 / 15