Exercise 5.4 Solution Niels Richard Hansen University of Copenhagen May 7, 2010 1 5.4(a) > leukemia <- data.frame(y = c(65, 156, 100, 134, 16, 108, 121, + 4, 39, 143, 56, 26, 22, 1, 1, 5, 65), x = c(3.36, 2.88, 3.63, + 3.41, 3.78, 4.02, 4, 4.23, 3.73, 3.85, 3.97, 4.51, 4.54, + 5, 5, 4.72, 5)) We consider the exponential regression model as in Exercise 4.2. When we are asked to use the Wald statistic for the construction of confidence intervals, we understand this as the computation of standard confidence intervals based on the estimated standard error. This corresponds to using the combinant R = ( ˆβ 1 β 1 ) 2 ŝe 2 with an approximating χ 2 -distribution with one degree of freedom. In R these computations are easily done using the confint.default function. However, the dispersion parameter can not be controlled here, and since we use the general Gamma-family we don t get exactly what we want. > leukemiaglm <- glm(y ~ x, family = Gamma(link = "log"), data = leukemia) > confint.default(leukemiaglm) 2.5 % 97.5 % (Intercept) 5.334837 11.6201502 x -1.868283-0.3503104 To get exactly what we want under the exponential distributional assumption we need to do the computations by hand.
Exercise 5.4 2 > coefficients(leukemiaglm) + t(c(-1.96, 1.96) %*% t(sqrt(diag(vcov(leukemiaglm, + dispersion = 1))))) [,1] [,2] (Intercept) 5.234071 11.7209168 x -1.892620-0.3259742 We then turn to bootstrapping. We do this by hand. > bootexp <- function(theta, B = 999) { + tmp <- replicate(b, rexp(length(theta), theta)) + return(apply(tmp, 2, function(y) glm(y ~ leukemia$x, family = Gamma(link = " + } The function above implements the parametric resampling and the reestimation of the models. Then we need to decide for a combinant to use, and which kind of interval we want. First we do the resampling using the estimated model. Then we compute statistics corresponding to the combinants and ˆβ i (Y) β i ˆβ i (Y) β i ŝe i. Finally, we also compute parameter estimates used for the the percentile interval. > bootsim <- bootexp(1/fitted(leukemiaglm)) > t1 <- sapply(bootsim, function(m) coefficients(m) - coefficients(leukemiaglm)) > t2 <- sapply(bootsim, function(m) (coefficients(m) - coefficients(leukemiaglm))/ + dispersion = 1)))) > t1.5 <- sapply(bootsim, function(m) coefficients(m)) Now we compute the different confidence intervals. > coefficients(leukemiaglm) - t(apply(t1, 1, function(t) quantile(t, + c(0.975, 0.025))))
Exercise 5.4 3 97.5% 2.5% (Intercept) 5.079363 12.0461146 leukemia$x -1.942136-0.2201165 > coefficients(leukemiaglm) - sqrt(diag(vcov(leukemiaglm, dispersion = 1))) * + t(apply(t2, 1, function(t) quantile(t, c(0.975, 0.025)))) 97.5% 2.5% (Intercept) 5.079363 12.0461146 leukemia$x -1.942136-0.2201165 > t(apply(t1.5, 1, function(t) quantile(t, c(0.025, 0.975)))) 2.5% 97.5% (Intercept) 4.908873 11.8756239 leukemia$x -1.998477-0.2764581 Perhaps surprisingly the two former are identical! This is explained by the fact that the Fisher information in fact does not depend upon the estimated parameters only on the explanatory variables, and is thus constant. This is then again due to two model choices that play together. First, the fixed value of the dispersion parameter, but also the use of the log-link function. These two choices imply that the weight matrix become the identity matrix and the Fisher information is in fact equal to (X T X) 1 for the exponential distribution with a log-link function. The most interesting thing about computing the second confidence intervals is the quantiles: > t(apply(t2, 1, function(t) quantile(t, c(0.025, 0.975)))) 2.5% 97.5% (Intercept) -2.156517 2.053489 leukemia$x -2.224873 2.083897 They are generally slightly larger (depending a little on sampling errors from the resampling) than the 1.96 approximation from the normal approximation.
Exercise 5.4 4 2 5.4(b) > summary(leukemiaglm, dispersion = 1) Call: glm(formula = y ~ x, family = Gamma(link = "log"), data = leukemia) Deviance Residuals: Min 1Q Median 3Q Max -1.9922-1.2102-0.2242 0.2102 1.5646 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 8.4775 1.6548 5.123 3.01e-07 *** x -1.1093 0.3997-2.776 0.00551 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for Gamma family taken to be 1) Null deviance: 26.282 on 16 degrees of freedom Residual deviance: 19.457 on 15 degrees of freedom AIC: 173.97 Number of Fisher Scoring iterations: 8 To compute the difference in deviance from the null model we estimate the model with only an intercept and compute the deviances. > leukemiaglm0 <- glm(y ~ 1, family = Gamma(link = "log"), data = leukemia) > summary(leukemiaglm0, dispersion = 1) Call: glm(formula = y ~ 1, family = Gamma(link = "log"), data = leukemia) Deviance Residuals: Min 1Q Median 3Q Max -2.5103-1.1120-0.1074 0.6023 1.0789
Exercise 5.4 5 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 4.1347 0.2425 17.05 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for Gamma family taken to be 1) Null deviance: 26.282 on 16 degrees of freedom Residual deviance: 26.282 on 16 degrees of freedom AIC: 178.09 Number of Fisher Scoring iterations: 6 > deviance(leukemiaglm0) - deviance(leukemiaglm) [1] 6.825567 In R the deviance is by definition the unscaled deviance for the Gamma-family. This means, that it is only up to a scaling factor (the dispersion parameter) equal to minus twice the log-likelihood ratio test statistic. In other words, the deviance does not depend upon the dispersion parameter and equals the deviance as if the model was an exponential model. We can compute the p-value using the χ 2 - distribution with one degree of freedom. Another way to do this (which is more convenient for more complicated, successive tests) is to use the anova function. For generalized linear models we need to tell the function which test-statistics to use and then we need to be explicit about the dispersion parameter otherwise it is estimated and used for the χ 2 -approximation. > pchisq(deviance(leukemiaglm0) - deviance(leukemiaglm), 1, lower.tail = FALSE) [1] 0.008986204 > anova(leukemiaglm, test = "Chisq", dispersion = 1) Analysis of Deviance Table
Exercise 5.4 6 Model: Gamma, link: log Response: y Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev P(> Chi ) NULL 16 26.282 x 1 6.8256 15 19.456 0.008986 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 One should note that the p-values computed from the deviance test and the t-test in the summary table are different. This is in contrast to ordinary linear models. One can experience situations where one is significant and the other is not.