This produces (edited) ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = aircondit, statistic = function(x, i) { 1/mean(x[i, ]) }, R = B)

Size: px

Start display at page:

Download "This produces (edited) ORDINARY NONPARAMETRIC BOOTSTRAP Call: boot(data = aircondit, statistic = function(x, i) { 1/mean(x[i, ]) }, R = B)"

Myron Collin Ford
5 years ago
Views:

1 STAT 675 Statistical Computing Solutions to Homework Exercises Chapter 7 Note that some outputs may differ, depending on machine settings, generating seeds, random variate generation, etc For the law-school data from Example 7.2, sample R code is require(bootstrap); attach(law) n = length( LSAT ) R.hat = cor( LSAT,GPA ) R.jack = numeric( n ) for (i in 1:n) { R.jack[i] = cor( LSAT[-i],GPA[-i] ) } bias.jack = (n-1)*( mean(r.jack) - R.hat ) R.bar = mean(r.jack) se.jack = sqrt( (n-1)*mean( (R.jack-R.bar)^2 ) ) We find [1] for the bias, and [1] for the std. error For the air-conditioning data in the boot package, assume the observations (in hours) are X i ~ i.i.d. Exp( ) with p.d.f. f(x) = e I (0, ) (x). The MLE of the hazard rate parameter is the reciprocal of the sample mean ˆ = 1/ X. Sample R code for this calculation, and for the bootstrap conf. limits is require(boot) attach(aircondit) lambda.hat = 1/mean( hours ) #MLE of rate lambda B = 5000 #no. boostrap resamples set.seed(74) boot(data=aircondit, statistic=function(x,i){1/mean(x[i,])}, R=B) This produces (edited) Call: boot(data = aircondit, statistic = function(x, i) { 1/mean(x[i, ]) }, R = B) t1* from which we see ˆ = , with ˆ bias boot(ˆ * ) = and ˆ se boot(ˆ * ) = For the air-conditioning data from Exercise 7.4, 95% bootstrap confidence intervals on 1/ can be found via the following sample R code.

2 Start with the data load and perform the bootstrapping: require(boot); attach(aircondit) x = hours gaptime.hat = mean(x) #MLE of 1/lambda B = 5000 #no. boostrap resamples STAT 675 Chapter 7 Solutions p. 2 set.seed(75) exc705.boot = boot( data=aircondit, statistic=function(x,i){mean(x[i,])}, R=B ) The bootstrap summary (edited) reports: Call: boot(data = aircondit, statistic = function(x, i) { mean(x[i, ])}, R = B) t1* We see the MLE is 1/ˆ = , with estimated std. error of ˆ se boot = For the bootstrap intervals, sample R code is einf.jack = empinf(exc705.boot, type='jack') boot.ci( exc705.boot, type=c('basic','perc','bca'), L=einf.jack ) where the call to empinf() is used to generate an empirical influence object using the usual jackknife. This produces (edited) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 5000 bootstrap replicates CALL : boot.ci(boot.out = exc705.boot, type = c("basic", "perc", "bca"), L = einf.jack) Intervals : Level Basic Percentile BCa 95% ( 25.3, ) ( 46.5, ) ( 56.3, ) Calculations and Intervals on Original Scale We see the intervals are quite disparate. A primary reason is that the bootstrap distribution is still skewed, affecting the simpler methods and their appeal to the Central Limit Theorem. For example hist(exc705.boot$t, main='', xlab=expression(1/lambda), prob=t) points(exc705.boot$t0, 0, pch = 19)

3 STAT 675 Chapter 7 Solutions p () gives the following histogram of the boostrapped 1/ values: (the solid dot,, indicates the MLE). The BCa interval incorporates an acceleration adjustment for skew, and may be preferred here The scatterplot matrix is available via require(bootstrap); attach( scor ) pairs( scor, pch=19 ) See next page for plot

4 STAT 675 Chapter 7 Solutions p () Similarly, the sample correlation matrix is found via cor( scor ) yielding mec vec alg ana sta mec vec alg ana sta The plot and correlation matrix impart similar information: potentially strong positive correlations exist between most of the variable pairs.

5 7.6. () STAT 675 Chapter 7 Solutions p. 5 Estimates of the various correlation are essentially the observed correlations fund in the correlation matrix. For example, with 12 = cor[vec,mec], the following sample R code B = 5000 #no. boostrap resamples set.seed(76) exc706.boot=boot( data=cbind(mec,vec), statistic=function(x,i){cor(x[i,1],x[i,2])}, R=B ) yields the summary output (edited) Call: boot(data = cbind(mec, vec), statistic = function(x, i) { cor(x[i, 1], x[i, 2])}, R = B) t1* where we see the estimate is simply ˆ 12 = , as in the correlation matrix. One could, however, use the information in the bootstrap distribution to estimate 12 as, say, the bootstrap mean mean( exc706.boot$t ) [1] or the bootstrap median median( exc706.boot$t ) [1] Here, these give roughly similar values (Readers do not need to know the details of Principal Component Analysis to undertake these calculations.) We continue here with the test score data from Exercise 7.6. The textbook does not specify the actual likelihood for these data, so instead of the MLE for the true covariance matrix, we find the sample covariance matrix ˆ from Exercise 7.6. To find its eigenvalues, i, start with cov( scor ) This produces mec vec alg ana sta mec vec alg ana sta

6 STAT 675 Chapter 7 Solutions p () Next, apply the eigen() function: lambda.hat = eigen(cov(scor))$values This gives the eigenvalues (ordered from largest to smallest, as is standard) [1] The estimate of (proportion of variance) = 1 / i is calculated, e.g., via theta.hat = lambda.hat[1]/sum(lambda.hat) We find ˆ = Bootstrap estimates of its bias and std. error are available via the sample R code theta.i = function(x,i){ eigen(cov(x[i,]))$values[1]/sum(eigen(cov(x[i,]))$values) } #end function B = 5000 #no. boostrap resamples require(boot) set.seed(77) exc707.boot = boot( data=scor, statistic=theta.i, R=B ) detach(package:boot) (Note that B = 5000 is a bit large and may stress some less efficient computers with this complex a calculation.) This produces the output (edited) Call: boot(data = scor, statistic = theta, R = B) t1* from which we see ˆ = (as expected), bias ˆ boot(ˆ * ) = , and se ˆ boot(ˆ * ) = Continuing with the test score data from Exercise 7.6, to find jackknife estimates of bias and std. error, sample R code to attach the data and define the core function is require(bootstrap); attach( scor ) theta = function(x){ eigen(cov(x))$values[1]/sum(eigen(cov(x))$values) } #end function code continues

7 7.8. () Sample R code to complete the jackknife calculations is n = length( scor[,1] ); x = as.matrix(scor) theta.jack = numeric( n ) for (i in 1:n) { theta.jack[i] = theta( x[-i,] ) } STAT 675 Chapter 7 Solutions p. 7 bias.jack = (n-1)*( mean(theta.jack) - theta.hat ) theta.bar = mean(theta.jack) se.jack = sqrt( (n-1)*mean( (theta.jack-theta.bar)^2 ) ) print( list(theta.hat=theta(scor), bias=bias.jack, se=se.jack) ) detach(package:bootstrap) We find $theta.hat [1] $bias [1] $se [1] so bias ˆ jack(ˆθ) = , and se ˆ jack(ˆθ) = Continuing with the test score data from Exercise 7.6, to find 95% bootstrap confidence intervals for θ, sample R code is require(bootstrap); attach( scor ) theta.i = function(x,i){ eigen(cov(x[i,]))$values[1]/sum(eigen(cov(x[i,]))$values) } #end function B = 5000 #no. boostrap resamples set.seed(79) require(boot) exc709.boot = boot( data=scor, statistic=theta.i, R=B ) (As in Exercise 7.7, B = 5000 is a bit large and may stress some less efficient computers with this complex a calculation.) This produces the following output (edited) Call: boot(data = scor, statistic = theta.i, R = B) t1* (Notice the change in bias ˆ boot(ˆθ * ), and the slight change in se ˆ boot(ˆθ * ) from those in Exercise 7.7, due to the different seed.) To find the bootstrap intervals, we can use

8 STAT 675 Chapter 7 Solutions p () einf.jack = empinf(exc709.boot, type='jack') boot.ci( exc709.boot, type=c('perc','bca'), L=einf.jack ) which produces (output edited) BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS Based on 5000 bootstrap replicates CALL : boot.ci(boot.out = exc709.boot, type = c("perc", "bca"), L = einf.jack) Intervals : Level Percentile BCa 95% ( , ) ( , ) Calculations and Intervals on Original Scale Here, the 95% intervals are fairly similar. A histogram of the 5000 bootstrapped values of ˆθ * (not shown) is unimodal and exhibits only a slight left skew. This suggests that convergence to normality under the Central Limit Theorem is active, so that the percentile method produces relatively stable confidence limits For the iron-slag data from Example 7.18, sample R code for performing crossvalidation on the new cubic model #5 Y = β 0 + β 1 X + β 2 X 2 + β 0 X 3 + ε, mimics that given in that Example: require( DAAG) attach( ironslag ) n = length( magnetic ) e5 = numeric( n ) for (k in 1:n) { y = magnetic[-k] x = chemical[-k] J5 = lm(y ~ x + I(x^2) + I(x^3)) e5[k] = magnetic[k] - predict( J5, newdata=data.frame(x=chemical[k]) ) } #end for loop mean(e5^2) This gives the MPSE for the cubic model as [1] which is larger than the MSPE for the quadratic model ( ) see in the Example. The quadratic model remains the best of the candidate group.

9 STAT 675 Chapter 7 Solutions p () To compare these results with another model selection criterion such as maximizing the adjusted coefficient of determination, R 2 adj, sample R code is y = magnetic x = chemical L1 = lm( y ~ x ) L2 = lm( y ~ x + I(x^2) ) L3 = lm( log(y) ~ x ) L5 = lm( y ~ x + I(x^2) + I(x^3 )) c(summary(l1)$adj.r.squared, summary(l2)$adj.r.squared, summary(l3)$adj.r.squared, summary(l5)$adj.r.squared) This produces (in model order: linear, quadratic, exponential, cubic) [1] from which we see the quadratic model has the highest R 2 adj and would again be preferred.

4 Resampling Methods: The Bootstrap

4 Resampling Methods: The Bootstrap Situation: Let x 1, x 2,..., x n be a SRS of size n taken from a distribution that is unknown. Let θ be a parameter of interest associated with this distribution and