Statistical Computing Session 4: Random Simulation

Size: px

Start display at page:

Download "Statistical Computing Session 4: Random Simulation"

Tamsin White
5 years ago
Views:

1 Statistical Computing Session 4: Random Simulation Paul Eilers & Dimitris Rizopoulos Department of Biostatistics, Erasmus University Medical Center Masters Track Statistical Sciences, 2010

2 The plan for today Short introduction: who am I? Random simulation, what is it good for? The R system of random number generation Tools for presentation and analysis of simulation output Simulation techniques Getting insight in classical procedures by simulation Masters Track Statistical Sciences 2009 Statistical Computing 1

3 Why simulation? Statistics is about random numbers We assume a model for generating the data The model contains parameters We estimate them from the data, using recipes Examples: regression, ANOVA, time series analysis,... Crucial question: how good is a recipe? Classical (frequentist) statistics: confidence intervals Bayesian: combine prior and data to get posterior distribution Testing (mostly for zero values) is a special case Masters Track Statistical Sciences 2009 Statistical Computing 2

4 How to evaluate performance by analysis In some cases exact results can be obtained Especially for normally distributed data Examples: χ 2, t and F distributions (ANOVA) You will encounter them in other courses here In the old days we used tables Nowadays they are built in as functions in statistical software It is always wise to use analytical results Masters Track Statistical Sciences 2009 Statistical Computing 3

5 How to evaluate performance by simulation Simulate data according to the model Apply the recipe Repeat this many times Study the distributions of parameter estimates Confidence intervals follow from them Probabilities of outcomes of tests can be determined Disadvantage: generalization can be hard What if I have more/less observations, larger variances,...? You might have to simulate again for every case Masters Track Statistical Sciences 2009 Statistical Computing 4

6 The R system for generating random numbers For a given distribution you have four choices I use the normal distribution as an example Give me 100 drawings from standard normal: rnorm(100) The density over a vector x: dnorm(x) The cumulative distribution over a vector x: pnorm(x) The quantiles over a vector of probabilities p: qnorm(p) Let s take a better look Masters Track Statistical Sciences 2009 Statistical Computing 5

7 Example: normal density x = seq(-4, 4 by = 0.1); y = dnorm(x); plot(x, y, type = l, col = blue, lwd = 2) y x Masters Track Statistical Sciences 2009 Statistical Computing 6

8 Example: cumulative normal distribution x = seq(-4, 4 by = 0.1); y = pnorm(x); plot(x, y, type = l, col = blue, lwd = 2) y x Masters Track Statistical Sciences 2009 Statistical Computing 7

9 Example: normal quantiles p = seq(0.001, 0.999, by =0.001) y = qnorm(p) plot(p, y, type = l, lwd = 2, col = blue ) y p Masters Track Statistical Sciences 2009 Statistical Computing 8

10 How to look at random numbers The result of rnorm() is a series of numbers There is little use in plotting them individually Instead we use the histogram We divide x-axis in a number of bins In each bin we count the number of observations We plot it as a bar chart Masters Track Statistical Sciences 2009 Statistical Computing 9

11 Example of a histogram y = rnorm(200) hist(y) Histogram of y Frequency y Masters Track Statistical Sciences 2009 Statistical Computing 10

12 Playing with the histogram y = rnorm(200) b = seq(-4, 4, by = 0.2) # More narrow bins hist(y, breaks = b, col = gray, border = white ) Histogram of y Frequency y Masters Track Statistical Sciences 2009 Statistical Computing 11

13 Adding the theoretical normal density y = rnorm(200); b = seq(-4, 4, by = 0.2) hist(y, breaks = b, col = gray, border = white, freq = F) lines(b, dnorm(b), col = blue, lwd = 2) Histogram of y Density y Masters Track Statistical Sciences 2009 Statistical Computing 12

14 A different result every time? Look at the histograms carefully They seem to be based on different numbers That s indeed the case Every call to rnorm() gives a different result You might not like that (and you should not) It makes it hard to compare results Set the random seed to a starting value Like set.seed(123) or set.seed(911) Any integer number you like (see R help for details) Masters Track Statistical Sciences 2009 Statistical Computing 13

15 The empirical cumulative distribution n = 20; y = rnorm(n) p = ((1:n) - 0.5) / n # Empirical probabilities plot(sort(y), p, type = s, col = red ) p sort(y) Masters Track Statistical Sciences 2009 Statistical Computing 14

16 Adding the theoretical cumulative distribution n = 40; y = rnorm(n); p = ((1:n) - 0.5) / n plot(sort(y), p, type = s, col = red ) x = seq(-3, 3, by = 0.1); lines(x, pnorm(x), col = blue, lwd = 2) p sort(y) Masters Track Statistical Sciences 2009 Statistical Computing 15

17 Randomness at work The theoretical distribution fits not impressively This is generally the case for small samples Run these programs many times (Without set.seed(), of course) You will see randomness at work You wil appreciate variation in small samples better Masters Track Statistical Sciences 2009 Statistical Computing 16

18 The Q-Q plot We have our sorted observations And the empirical cumulative probabilities For the latter we can compute normal quantiles (with qnorm()) Plot one against the other We ll program it ourselves That has some educative value, I hope The function qqnorm() is easier to use Plot line, based on mean and variance, with qqplot() Of course, this only compares to the normal distribution Masters Track Statistical Sciences 2009 Statistical Computing 17

19 Q-Q plot from scratch n = 40; y = rnorm(n); p = ((1:n) - 0.5) / n plot(qnorm(p), sort(y), col = red ) sort(y) qnorm(p) Masters Track Statistical Sciences 2009 Statistical Computing 18

20 Q-Q plot the easy way n = 40; y = rnorm(n); qqnorm(y) qqline(y, col = blue, lwd = 2) Normal QQ Plot Sample Quantiles Theoretical Quantiles Masters Track Statistical Sciences 2009 Statistical Computing 19

21 An improvement on the histogram The histogram is a great idea But it is sensitive to the width of the bins And to their positions (midpoints) An improvement is the kernel density estimator It is explained in Chapter 10 of the book We will discuss that later And develop a further improvement Masters Track Statistical Sciences 2009 Statistical Computing 20

22 Comparing histogram and kernels y = rnorm(200); b = seq(-4, 4, by = 0.2) plot(density(y), col = red, lwd = 2) lines(b, dnorm(b), col = blue, lwd = 2) Histogram of y density.default(x = y) Density Density y N = 200 Bandwidth = Masters Track Statistical Sciences 2009 Statistical Computing 21

23 . Part II Masters Track Statistical Sciences 2009 Statistical Computing 22

24 How to generate random numbers The uniform distribution is always the starting point Sum 6 (or more) uniforms to approximate normal distribution This is of limited usefulness Apply a function: -log(runif(100)) gives exponential distribution Inverse transform: compute quantiles of desired distribution (if possible) Acceptance-rejection sampling Combinations of distributions Mixtures Masters Track Statistical Sciences 2009 Statistical Computing 23

25 Sums of uniform random variables The uniform distribution runs from 0 to 1 Mean 0.5, variance 1/12 Sum of n (independent) uniforms has mean n/2 The variance is n/12 Subtract n/2 Divide by n/12 To get standard normal distribution This is an approximation, but a good one Masters Track Statistical Sciences 2009 Statistical Computing 24

26 Illustration of sums Y = matrix(runif(10000 * 10), 10000, 10); y = apply(y, 1, sum) plot(density(y), col = red, lwd = 2) x = seq(0, 10, by = 0.1); lines(x, dnorm(x, 10 * 0.5, sqrt(10 / 12)), density.default(x = y) Density N = Bandwidth = Masters Track Statistical Sciences 2009 Statistical Computing 25

27 Simple transformations Apply a function to uniformly distributed random numbers Example: y = -log(runif(n)) gives exponential distribution Sums of these leads to the gamma distribution Many of such tricks exist See examples in the book Masters Track Statistical Sciences 2009 Statistical Computing 26

28 Inverse transform or quantiles Remember the cumulative distribution: p = F (x) Where F (x) = x f(t)dt And f(x) is the probability density We have that 0 < p < 1 Turn things around: x = F 1 (p) Generate uniformly distributed p Compute x = F 1 (p) Then x will have density f Of course, you need a procedure to invert F Example: qnorm(runif(n)) is equivalent to rnorm(n) Masters Track Statistical Sciences 2009 Statistical Computing 27

29 Illustration of the quantile approach pnorm(x) p p x x x Masters Track Statistical Sciences 2009 Statistical Computing 28

30 A special case: the exponential distribution Density: fx) = e x (positive x) Cumulative distribution: F (x) = 1 e x You can check this by integrating it yourself From p = 1 e x follows x = log(1 p) But if p from uniform distribution, so is 1 p Hence x = log(p) has exponential distribution Masters Track Statistical Sciences 2009 Statistical Computing 29

31 Inverting the cumulative distribution You need a formula for F 1 (p) In many cases an exact formula is not available But over the years very good approximations were found Try?qnorm to see pointers to literature Luckily, R has many built-in procedures Masters Track Statistical Sciences 2009 Statistical Computing 30

32 Acceptance-rejection sampling Suppose you have a formula for f(x), the density But none for F (x), the cumulative distribution Acceptance sampling is a solution Select a domain in which x will fall (almost surely) Let x 1 (x 2 ) be left (right) boundary Let u = maxf(x) (over this domain) Generate y uniformly distributed between x 1 and x 2 Generate z uniformly distributed betweeen 0 and u If z f(y), accept y If not, repeat Masters Track Statistical Sciences 2009 Statistical Computing 31

33 Illustration of acceptance-rejection sampling xlo = -3 xhi = 3 x = seq(xlo, xhi, by = 0.1) f = dnorm(x) plot(x, f, lwd =2, type = l, col = blue ) u = max(f) n = 1000 a = rep(0, n) y = runif(n) * (xhi - xlo) xlo z = runif(n) * u a = 1 * z < dnorm(y) g = c( -, ) points(y, z, pch = g[a 1]) Masters Track Statistical Sciences 2009 Statistical Computing 32

34 The result of acceptance sampling x f Masters Track Statistical Sciences 2009 Statistical Computing 33

35 Combinations of distributions Example: t distribution Let s be sum of n squared standard normal variates The s has χ 2 n distribution (n degrees of freedom) Let x be one standard normal variate Then x/ s has t distribution Ratio of gammas has beta distribution Ratio of normals has Cauchy distribution Ratio of χ 2 has F distribution Masters Track Statistical Sciences 2009 Statistical Computing 34

36 Mixtures Mixtures are a very flexible tool Basic idea: probability p 1 to draw from density f 1 Probability p 2 to draw from density f 2 And so on Typical application: outlier simulation Two normals: f 1 (x) = N(x; µ, σ), f 2 (x) = N(x; µ, 10 σ) Probabilities: p 1 = 0.98, p 2 = 0.02 Chance of 2% to get a large outlier Masters Track Statistical Sciences 2009 Statistical Computing 35

37 The use of random numbers You probably will not often aim for a specific distribution Instead you program a data generating procedure Mimicking a model you are working on You study the distribution of statistical summaries Educational example: ANOVA and the F distribution Simulate from the null hypothesis, 5 groups, 10 obs., in each Compare empirical (n = 5000) distributions (χ 2 and F ) with theory Masters Track Statistical Sciences 2009 Statistical Computing 36

38 ANOVA simulation: examples of simulated data Masters Track Statistical Sciences 2009 Statistical Computing 37

39 ANOVA: simulated sums of squares and ratio Sum of squares within (SSW) ssw Index Sum of squares between (SSB) ssb Index Ratio MSB/MSW msb/msw Index Masters Track Statistical Sciences 2009 Statistical Computing 38

40 ANOVA: histograms and expected values Sum of squares within (SSW) Frequency ssw Sum of squares between (SSB) Frequency ssb F = MSB / MSW Frequency msb/msw Masters Track Statistical Sciences 2009 Statistical Computing 39

41 ANOVA: histograms and theoretical densities Sum of squares within (SSW) Frequency ssw Sum of squares between (SSB) Frequency ssb F = MSB / MSW Frequency msb/msw Masters Track Statistical Sciences 2009 Statistical Computing 40

Reading hints Theoretical statistical background: Chapter 2 Random number simulation: Section 3.1-3.

42 Reading hints Theoretical statistical background: Chapter 2 Random number simulation: Section Kernel density estimation: Section 10.2 Histogram and qqplot: R on-line documentation Masters Track Statistical Sciences 2009 Statistical Computing 41

Robustness and Distribution Assumptions

Chapter 1 Robustness and Distribution Assumptions 1.1 Introduction In statistics, one often works with model assumptions, i.e., one assumes that data follow a certain model. Then one makes use of methodology