STAT440/840: Statistical Computing

Size: px

Start display at page:

Download "STAT440/840: Statistical Computing"

Barnard Shields
5 years ago
Views:

1 First Prev Next Last STAT440/840: Statistical Computing Paul Marriott MC 6096 February 2, 2005 Page 1 of 41

2 First Prev Next Last Page 2 of 41 Chapter 3: Data resampling: the bootstrap Suppose we are interested in a quantity θ θ(f ) that depends on the distribution function F, and that we have available an independent random sample where each For example θ might be X = (X 1,..., X n ) X i F the mean in which case θ = x df (x), the median, θ = F 1 (1/2) the variance θ = (x x df (x)) 2 df (x)

3 First Prev Next Last Estimate of θ From the observed data, we have the single value ˆθ n = ˆθ n (X) as an estimate of θ. For example ˆθ n might be Page 3 of 41 the sample mean in which case ˆθ n = x i /n, the sample median, the sample variance The question arises then, how accurate is our single value ˆθ n as an estimate of θ?

4 First Prev Next Last Idealised Monte Carlo Page 4 of Generate where Y j = (Y j1,..., Y jn ) Y ji F ( ) for i = 1,..., n and j = 1,..., N. 2. Calculate for each j = 1,..., N. ˆθ n (Y j ) 3. Estimate the variability using the N samples The thing that prevents us getting started with this scheme, of course, is that we cannot generate Y because the distribution function F is not known.

5 First Prev Next Last Idealised Monte-Carlo A way of getting round this is to approximate the unknown distribution function F by another distribution function, F where F is obtained from the observed data, We can then proceed as above but generating from F instead of F. This is the approach that underpins the bootstrap method as a computational technique. Page 5 of 41

6 First Prev Next Last Empirical distribution function Page 6 of 41 The empirical distribution function is defined by ˆF (x) = #(X 1 x,..., X n x) n n = n 1 I(X i x). i=1 This puts a mass of probability 1/n at each point in the sample Sampling from ˆF means drawing X i from the sample with probability 1/n

7 First Prev Next Last The non parametric bootstrap Page 7 of 41 We are interested in the distribution of ˆθ, but we cannot work this out because we do not know F. However ˆF (x) converges to F (x) in the limit as the sample size n. Thus provided n is large enough then ˆF ( ) may be a good approximant of F ( ). The key step here is that the unknown distribution function F ( ) is replaced by the (known) empirical distribution function ˆF ( ).

8 First Prev Next Last The approximations Page 8 of 41 The unknown distribution function F ( ) is approximated by the empirical distribution function ˆF ( ). The exact distribution of ˆθ given ˆF is called the ideal bootstrap distribution of θ n, and often is difficult to calculate. The approximate behaviour of the random variable θ( ˆF n ) may be examined through simulation. We generate N independent samples of size n from ˆF ( ), and evaluate the sample value of ˆθ for each of these N samples. Clearly, the larger the value of N then the closer this empirical distribution and the ideal bootstrap distribution of θ n will be.

9 First Prev Next Last Sampling from the empirical distribution function For the nonparametric bootstrap: 1. Generate independent integers i 1,..., i n from the discrete uniform distribution on {1, 2,..., n} Page 9 of Set X = (X i1, X i2,..., X in ) Then X is a sample of size n from ˆF ( ). This scheme is equivalent to sampling n values randomly with replacement from the observed data X 1,..., X n.

10 First Prev Next Last Page 10 of 41 The nonparametric bootstrap algorithm The nonparametric bootstrap algorithm proceeds as follows: Generate N independent bootstrap samples X 1, X 2,..., X N each consisting of n data values drawn randomly with replacement from the original data X. Evaluate the sample value of θ corresponding to each bootstrap sample: ˆθ i n = θ(x i ) for i = 1,..., N. These are called bootstrap replications of θ n.

11 First Prev Next Last Output Page 11 of 41 Bootstrap estimate of standard error of θ n. This is defined as the sample standard error of the N bootstrap replications, i.e. { where ŝe N = (N 1) 1 N i=1 ˆθ n = N 1 (ˆθ i n N i=1 ˆθ i n. } ) 1/2 2 ˆθ n The standard error is sometimes called the standard deviation. The value of ŝe N converges to the standard error of θ( ˆF n ) as N. This limiting value is (of course) the standard error of the ideal bootstrap distribution of θ n.

12 First Prev Next Last Output Page 12 of 41 Bootstrap estimate of bias of θ n. The bias of θ n as an estimator of θ is defined to be E{ˆθ} θ(f ), where θ(f ) means the theoretical value of θ under the true model F. The bootstrap estimate of bias based on N bootstrap replications is then bias N = ˆθ n ˆθ n where ˆθ n = N 1 N i=1 ˆθ i n

13 First Prev Next Last Output Page 13 of 41 Bootstrap based confidence intervals. Instead of using the bootstrap standard error as a measure of precision, we may use the bootstrap to construct approximate confidence intervals. (However, N typically needs to be much larger in this case.) Let θ (1) < < θ (N) denote the ordered bootstrap replications of θ. Then an approximate 100(1 2α)% bootstrap based confidence interval for θ is [θ (αn), θ ({1 α}n) ] This is easiest to implement when N is chosen so that αn is an integer, but this need not be the case.

14 First Prev Next Last Simple bootstrapping in R Page 14 of 41 bootstrap <-function(x,nboot,theta,...) { data <- matrix( sample(x,size = length(x) * nboot, replace = T ), nrow = nboot) answer <- apply(data, 1, theta,...) answer }

15 First Prev Next Last Example: The Old Faithful geyser Histogram of observed Page 15 of 41 Relative Frequency observed

16 First Prev Next Last Bootstrap distribution of sample mean Histogram of bootvals Page 16 of 41 Relative Frequency bootvals

17 First Prev Next Last How many bootstrap replications? Page 17 of 41 Note that the amount of computer time required increases linearly with N. It can be shown that var(ŝe N ) c 1 n 2 + c 2 nn, where c 1 and c 2 are constants that depend on the underlying population distribution F, but not on n or N. The first term represents sampling variation, and tends to zero as the sample size increases. The second term represents the resampling variation, and it approaches zero as N for fixed n.

18 First Prev Next Last How many bootstrap replications? Thus ŝe N always has a greater standard deviation than ŝe, but the practical question is how much greater? An approximate, but quite satisfactory answer can be obtained by looking at the coefficient of variation of ŝe N i.e. the ratio of the standard error of ŝe N to its expected value Page 18 of 41

19 First Prev Next Last How many replications? It can be shown that cv(ŝe N ) { cv(ŝe ) 2 + E( ˆ ) + 2 4N } 1/2. Page 19 of 41 Here, ˆ is a parameter that depends on how long the tail of the distribution of θ( ˆF n ) is, and ŝe is the ideal bootstrap estimate of standard error.

20 First Prev Next Last Page 20 of 41 How many replications? In practice, ˆ is very likely to be less than 10, and the smallest possible value of ˆ is 2. An important consequence of this is that for the values of cv(ŝe ) and ˆ that are likely to arise in practice, cv(ŝe N ) is unlikely to be much greater than for N > 200. cv(ŝe )

21 First Prev Next Last Page 21 of 41 How many replications? The following rules of thumb are given by Efron and Tibshirani: Even a small number of bootstrap samples, N = 25 say, is usually informative and often is sufficient to give a good estimate of the standard error of θ n. It is seldom that more than N = 200 bootstrap replications are needed for estimating a standard error. However, much bigger values of N are required for constructing bootstrap based confidence intervals.

22 First Prev Next Last Some worked examples: more complicated data structures American Law Schools Two measurements were made on the entering classes of each school in 1973: LSAT - the average score for the class on a national law test, and GPA - an average of the student grades for the whole class. LSAT : GPA : Page 22 of 41 LSAT : GPA : LSAT : GPA : We are interested in the standard error of the estimated correlation between these two statistics

23 First Prev Next Last Law School example law <- matrix(ncol=2,nrow=15) law[,1] <- c(576,635,558,578,...) law[,2] <-c(3.39,3.30,2.81,3.03,...) Page 23 of 41 theta.fn1 <- function(selected,xdata) { answer <- cor(xdata[selected,1],xdata[selected,2]) answer } bootvals <- bootstrap(1:15,1000, theta.fn1,xdata=law) hist(bootvals) abline(v = cor(law[,1],law[,2])) mean(bootvals) - cor(law[,1], law[,2])

24 First Prev Next Last Law school example Histogram of bootvals Page 24 of 41 Frequency bootvals

25 First Prev Next Last Comparing two samples: The mouse data In this example we are intested in testing the difference in means of two random samples The data is Treatment : No-treatment : Page 25 of 41 The mean survival for the treatment group is days, whereas the mean survival for the no-treatment group is days. We bootstrap to see if this difference is significant

26 First Prev Next Last Mouse data Page 26 of 41 Let F and G denote the population distributions of the treatment and no-treatment data respectively. The statistical question (hypothesis) of interest is whether the means of F and G are equal, i.e. whether θ(f, G) = µ(f ) µ(g) = 0 in the obvious notation. From the observed data, we have a single value of the random variable the difference between the sample means of independent θ 7,9 (F, G) = samples of sizes 7 and 9 from F and G respectively Take independent values x 1,..., x 7 from F, and independent values y 1,..., y 9 from G. Construct ˆθ 7,9 = x ȳ where x is the sample mean of the x i values, etc.

27 First Prev Next Last Mouse Data Histogram of bootvals Page 27 of 41 Frequency bootvals

28 First Prev Next Last Regression Page 28 of 41 strength curing.time Observed values of strength and curing time, with best fitting (in a least-squares sense) linear regression line.

29 First Prev Next Last Page 29 of 41 Regression There are two possible nonparametric bootstrap approaches: Bootstrapping pairs Construct a bootstrap sample of size 50 by sampling uniformly with replacement from the (y i, t i ) pairs. Estimate α and β by least-squares for each of these bootstrap samples. Repeat many times, and thus obtain the required bootstrap distributions of α and β. Bootstrapping residuals From the fitted model y = t, construct the residuals r i = y i t i. Construct a bootstrap sample of residuals r1,..., r50 by sampling uniformly with replacement from r 1,..., r 50. From r1,..., r50, construct a bootstrap replication of the y-data via yi = t i + ri. Re-fit the model to these (y i, t i ) values to obtain bootstrap estimates of α and β. Repeat, and proceed as usual to obtain the required bootstrap distributions.

30 First Prev Next Last Page 30 of 41 Code theta.lr <- function(selected,xdata, drawplot = FALSE) { dummy <- lm(xdata[selected,1] xdata[selected,2]) if (drawplot == T) abline(dummy) answer <- dummy$coef answer } bootstrap(1:50,25,theta.lr, xdata=cbind(strength,curing.time), drawplot=t)

31 First Prev Next Last Regression Page 31 of 41 strength curing.time

32 First Prev Next Last Regression Page 32 of 41 Frequency Histogram of bootvals[1, ] Frequency Histogram of bootvals[2, ] Intercept parameter (alpha) Slope parameter (beta)

33 First Prev Next Last Regression Page 33 of 41 theta.res <- function(res,xdata,alpha,beta,drawplot=false) { # xdata contains the curing times y.boot <- alpha + beta * xdata + res answer <- lm(y.boot xdata) if (drawplot == T) abline(answer) answer <- answer$coef answer } bootvals <- bootstrap(res.vals,25, theta.res,xdata=curing.time,alpha=al, beta=be,drawplot=t)

34 First Prev Next Last Regression Page 34 of 41 strength curing.time

35 First Prev Next Last Regression Histogram of bootvals[1, ] Histogram of bootvals[2, ] Page 35 of 41 Frequency Frequency Intercept parameter (alpha) Slope parameter (beta)

36 First Prev Next Last Regression Page 36 of 41 We have two approaches that give very similar answers for this set of data. Which method is best in general? The answer depends on how far we believe the assumed structure of the regression model. Under the second approach, we assume that the order of the residuals is not important, so that the residual corresponding to any t i value is equally likely to have arisen with any other t j value. This corresponds to assuming that the distribution of the error ε i does not depend on t i. Bootstrapping pairs does not make this assumption, and so is more robust than bootstrapping residuals.

37 First Prev Next Last An example of the parametric bootstrap Page 37 of 41 boot.bvn <- function(nboot, ndata, m1, m2, v1, v2, rho,...) { # nboot is the number of bootstrap repetitions # ndata is the number of data points how.many <- nboot * ndata X <- rnorm(how.many, 0, 1) X1 <- m1 + sqrt(v1) * X X.mat <- matrix(x1, nrow = nboot) Y <- rnorm(how.many, rho * X, sqrt((1 - rhoˆ(2)))) Y1 <- m2 + sqrt(v2) * Y Y.mat <- matrix(y1, nrow = nboot) data <- cbind(x.mat,y.mat) answer <- apply(data,1,theta.bvn,n=ndata) answer }

38 First Prev Next Last Parametric or nonparametric bootstrap? Page 38 of 41 If the family has distribution F α where α is the (vector of) unknown parameters, then fitting F α to the data involves choosing some particular value ˆα according to a goodness-of-fit criterion maximum likelihood, for example. Once ˆα has been found, then bootstrap samples of size n are generated from Fˆα and used exactly as before to obtain bootstrap replications of θ n. This is called the parametric bootstrap. You resample from an estimate member of the family

39 First Prev Next Last Parametric Bootstrap Page 39 of 41 To illustrate the parametric bootstrap, we will refer back to the Law School data, and assume that the observed values are from the bivariate normal distribution. Let (x i, y i ), i = 1,..., 15 denote the observed (LSAT,GPA) values. Thus we will assume that the (x i, y i ) values are an independent sample from N 2 (µ 1, µ 2, σ 2 1, σ 2 2; ρ). The first thing we need to do is pick a particular member of the bivariate normal family.

40 First Prev Next Last Parametric Bootstrap Here (µ 1, µ 2 ) can be estimated reasonably by ( x, ȳ), The covariance matrix may be estimated reasonably by [ ] [ ] var(x) cov(x, Y ) σ 2 = 1 ρσ 1 σ 2 cov(x, Y ) var(y ) ρσ 1 σ 2 σ2 2 = 1 [ ] (xi x) 2 (yi ȳ)(x i x) 14 (yi ȳ)(x i x) (yi ȳ) 2. Page 40 of 41 This yields µ 1 = 600.3, µ 2 = 3.1, σ 2 1 = , σ 2 2 = and ρ = From this estimated distribution we can resample

41 First Prev Next Last Parametric Bootstrap Histogram of bootvals Page 41 of 41 Frequency bootvals

The bootstrap. Patrick Breheny. December 6. The empirical distribution function The bootstrap

The bootstrap. Patrick Breheny. December 6. The empirical distribution function The bootstrap Patrick Breheny December 6 Patrick Breheny BST 764: Applied Statistical Modeling 1/21 The empirical distribution function Suppose X F, where F (x) = Pr(X x) is a distribution function, and we wish to estimate