Exercise 5.4 Solution

Size: px

Start display at page:

Download "Exercise 5.4 Solution"

June Barton
5 years ago
Views:

1 Exercise 5.4 Solution Niels Richard Hansen University of Copenhagen May 7, (a) > leukemia <- data.frame(y = c(65, 156, 100, 134, 16, 108, 121, + 4, 39, 143, 56, 26, 22, 1, 1, 5, 65), x = c(3.36, 2.88, 3.63, , 3.78, 4.02, 4, 4.23, 3.73, 3.85, 3.97, 4.51, 4.54, + 5, 5, 4.72, 5)) We consider the exponential regression model as in Exercise 4.2. When we are asked to use the Wald statistic for the construction of confidence intervals, we understand this as the computation of standard confidence intervals based on the estimated standard error. This corresponds to using the combinant R = ( ˆβ 1 β 1 ) 2 ŝe 2 with an approximating χ 2 -distribution with one degree of freedom. In R these computations are easily done using the confint.default function. However, the dispersion parameter can not be controlled here, and since we use the general Gamma-family we don t get exactly what we want. > leukemiaglm <- glm(y ~ x, family = Gamma(link = "log"), data = leukemia) > confint.default(leukemiaglm) 2.5 % 97.5 % (Intercept) x To get exactly what we want under the exponential distributional assumption we need to do the computations by hand.

2 Exercise > coefficients(leukemiaglm) + t(c(-1.96, 1.96) %*% t(sqrt(diag(vcov(leukemiaglm, + dispersion = 1))))) [,1] [,2] (Intercept) x We then turn to bootstrapping. We do this by hand. > bootexp <- function(theta, B = 999) { + tmp <- replicate(b, rexp(length(theta), theta)) + return(apply(tmp, 2, function(y) glm(y ~ leukemia$x, family = Gamma(link = " + } The function above implements the parametric resampling and the reestimation of the models. Then we need to decide for a combinant to use, and which kind of interval we want. First we do the resampling using the estimated model. Then we compute statistics corresponding to the combinants and ˆβ i (Y) β i ˆβ i (Y) β i ŝe i. Finally, we also compute parameter estimates used for the the percentile interval. > bootsim <- bootexp(1/fitted(leukemiaglm)) > t1 <- sapply(bootsim, function(m) coefficients(m) - coefficients(leukemiaglm)) > t2 <- sapply(bootsim, function(m) (coefficients(m) - coefficients(leukemiaglm))/ + dispersion = 1)))) > t1.5 <- sapply(bootsim, function(m) coefficients(m)) Now we compute the different confidence intervals. > coefficients(leukemiaglm) - t(apply(t1, 1, function(t) quantile(t, + c(0.975, 0.025))))

3 Exercise % 2.5% (Intercept) leukemia$x > coefficients(leukemiaglm) - sqrt(diag(vcov(leukemiaglm, dispersion = 1))) * + t(apply(t2, 1, function(t) quantile(t, c(0.975, 0.025)))) 97.5% 2.5% (Intercept) leukemia$x > t(apply(t1.5, 1, function(t) quantile(t, c(0.025, 0.975)))) 2.5% 97.5% (Intercept) leukemia$x Perhaps surprisingly the two former are identical! This is explained by the fact that the Fisher information in fact does not depend upon the estimated parameters only on the explanatory variables, and is thus constant. This is then again due to two model choices that play together. First, the fixed value of the dispersion parameter, but also the use of the log-link function. These two choices imply that the weight matrix become the identity matrix and the Fisher information is in fact equal to (X T X) 1 for the exponential distribution with a log-link function. The most interesting thing about computing the second confidence intervals is the quantiles: > t(apply(t2, 1, function(t) quantile(t, c(0.025, 0.975)))) 2.5% 97.5% (Intercept) leukemia$x They are generally slightly larger (depending a little on sampling errors from the resampling) than the 1.96 approximation from the normal approximation.

4 Exercise (b) > summary(leukemiaglm, dispersion = 1) Call: glm(formula = y ~ x, family = Gamma(link = "log"), data = leukemia) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) e-07 *** x ** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for Gamma family taken to be 1) Null deviance: on 16 degrees of freedom Residual deviance: on 15 degrees of freedom AIC: Number of Fisher Scoring iterations: 8 To compute the difference in deviance from the null model we estimate the model with only an intercept and compute the deviances. > leukemiaglm0 <- glm(y ~ 1, family = Gamma(link = "log"), data = leukemia) > summary(leukemiaglm0, dispersion = 1) Call: glm(formula = y ~ 1, family = Gamma(link = "log"), data = leukemia) Deviance Residuals: Min 1Q Median 3Q Max

5 Exercise Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) <2e-16 *** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for Gamma family taken to be 1) Null deviance: on 16 degrees of freedom Residual deviance: on 16 degrees of freedom AIC: Number of Fisher Scoring iterations: 6 > deviance(leukemiaglm0) - deviance(leukemiaglm) [1] In R the deviance is by definition the unscaled deviance for the Gamma-family. This means, that it is only up to a scaling factor (the dispersion parameter) equal to minus twice the log-likelihood ratio test statistic. In other words, the deviance does not depend upon the dispersion parameter and equals the deviance as if the model was an exponential model. We can compute the p-value using the χ 2 - distribution with one degree of freedom. Another way to do this (which is more convenient for more complicated, successive tests) is to use the anova function. For generalized linear models we need to tell the function which test-statistics to use and then we need to be explicit about the dispersion parameter otherwise it is estimated and used for the χ 2 -approximation. > pchisq(deviance(leukemiaglm0) - deviance(leukemiaglm), 1, lower.tail = FALSE) [1] > anova(leukemiaglm, test = "Chisq", dispersion = 1) Analysis of Deviance Table

6 Exercise Model: Gamma, link: log Response: y Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev P(> Chi ) NULL x ** --- Signif. codes: 0 '***' '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 One should note that the p-values computed from the deviance test and the t-test in the summary table are different. This is in contrast to ordinary linear models. One can experience situations where one is significant and the other is not.

Logistic Regressions. Stat 430

Logistic Regressions. Stat 430 Logistic Regressions Stat 430 Final Project Final Project is, again, team based You will decide on a project - only constraint is: you are supposed to use techniques for a solution that are related to