Exercise 5.4 Solution

Similar documents
Logistic Regressions. Stat 430

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

Logistic Regression 21/05

R Hints for Chapter 10

Poisson Regression. The Training Data

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Logistic Regression - problem 6.14

Linear Regression Models P8111

UNIVERSITY OF TORONTO Faculty of Arts and Science

Exam Applied Statistical Regression. Good Luck!

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Generalized linear models

Interactions in Logistic Regression

R Output for Linear Models using functions lm(), gls() & glm()

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

ssh tap sas913, sas

Sample solutions. Stat 8051 Homework 8

9 Generalized Linear Models

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

Regression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples.

Generalized linear models

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Generalized Additive Models

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Non-Gaussian Response Variables

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches

Log-linear Models for Contingency Tables

STAC51: Categorical data Analysis

Checking the Poisson assumption in the Poisson generalized linear model

12 Modelling Binomial Response Data

Modeling Overdispersion

PAPER 206 APPLIED STATISTICS

Random Independent Variables

Introduction to General and Generalized Linear Models

ST430 Exam 1 with Answers

STA 450/4000 S: January

On the Inference of the Logistic Regression Model

Duration of Unemployment - Analysis of Deviance Table for Nested Models

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

Matched Pair Data. Stat 557 Heike Hofmann

Poisson Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

STAT 510 Final Exam Spring 2015

Linear Regression. Data Model. β, σ 2. Process Model. ,V β. ,s 2. s 1. Parameter Model

Booklet of Code and Output for STAD29/STA 1007 Midterm Exam

Stat 5303 (Oehlert): Randomized Complete Blocks 1

Generalized Linear Models. stat 557 Heike Hofmann

BMI 541/699 Lecture 22

Week 7 Multiple factors. Ch , Some miscellaneous parts

Reaction Days

Regression Methods for Survey Data

Age 55 (x = 1) Age < 55 (x = 0)

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

Simple logistic regression

Consider fitting a model using ordinary least squares (OLS) regression:

Nonlinear Models. What do you do when you don t have a line? What do you do when you don t have a line? A Quadratic Adventure

Lecture 18: Simple Linear Regression

STAT 526 Spring Midterm 1. Wednesday February 2, 2011

R code and output of examples in text. Contents. De Jong and Heller GLMs for Insurance Data R code and output. 1 Poisson regression 2

Overdispersion Workshop in generalized linear models Uppsala, June 11-12, Outline. Overdispersion

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

Statistics 203 Introduction to Regression Models and ANOVA Practice Exam

Today. HW 1: due February 4, pm. Aspects of Design CD Chapter 2. Continue with Chapter 2 of ELM. In the News:

Generalized Linear Models

Is the cholesterol concentration in blood related to the body mass index (bmi)?

Regression modeling for categorical data. Part II : Model selection and prediction

Table of Contents. Logistic Regression- Illustration Carol Bigelow March 21, 2017

Classification. Chapter Introduction. 6.2 The Bayes classifier

Business Statistics. Lecture 10: Course Review

ESP 178 Applied Research Methods. 2/23: Quantitative Analysis

Subject-specific observed profiles of log(fev1) vs age First 50 subjects in Six Cities Study

Model checking overview. Checking & Selecting GAMs. Residual checking. Distribution checking

The GLM really is different than OLS, even with a Normally distributed dependent variable, when the link function g is not the identity.

Stat 401B Final Exam Fall 2016

Binary Regression. GH Chapter 5, ISL Chapter 4. January 31, 2017

Cherry.R. > cherry d h v <portion omitted> > # Step 1.

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Homework 10 - Solution

Introduction to Statistics and R

Generalised linear models. Response variable can take a number of different formats

Leftovers. Morris. University Farm. University Farm. Morris. yield

Econometrics II. Seppo Pynnönen. Spring Department of Mathematics and Statistics, University of Vaasa, Finland

Experimental Design and Statistical Methods. Workshop LOGISTIC REGRESSION. Jesús Piedrafita Arilla.

Final Exam. Name: Solution:

Generalized Linear Models

Booklet of Code and Output for STAD29/STA 1007 Midterm Exam

1. Logistic Regression, One Predictor 2. Inference: Estimating the Parameters 3. Multiple Logistic Regression 4. AIC and BIC in Logistic Regression

Wrap-up. The General Linear Model is a special case of the Generalized Linear Model. Consequently, we can carry out any GLM as a GzLM.

Stat 411/511 ESTIMATING THE SLOPE AND INTERCEPT. Charlotte Wickham. stat511.cwick.co.nz. Nov

A brief introduction to mixed models

McGill University. Faculty of Science. Department of Mathematics and Statistics. Statistics Part A Comprehensive Exam Methodology Paper

Regression: Main Ideas Setting: Quantitative outcome with a quantitative explanatory variable. Example, cont.

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

Inference for Regression

Introduction to the Generalized Linear Model: Logistic regression and Poisson regression

Transcription:

Exercise 5.4 Solution Niels Richard Hansen University of Copenhagen May 7, 2010 1 5.4(a) > leukemia <- data.frame(y = c(65, 156, 100, 134, 16, 108, 121, + 4, 39, 143, 56, 26, 22, 1, 1, 5, 65), x = c(3.36, 2.88, 3.63, + 3.41, 3.78, 4.02, 4, 4.23, 3.73, 3.85, 3.97, 4.51, 4.54, + 5, 5, 4.72, 5)) We consider the exponential regression model as in Exercise 4.2. When we are asked to use the Wald statistic for the construction of confidence intervals, we understand this as the computation of standard confidence intervals based on the estimated standard error. This corresponds to using the combinant R = ( ˆβ 1 β 1 ) 2 ŝe 2 with an approximating χ 2 -distribution with one degree of freedom. In R these computations are easily done using the confint.default function. However, the dispersion parameter can not be controlled here, and since we use the general Gamma-family we don t get exactly what we want. > leukemiaglm <- glm(y ~ x, family = Gamma(link = "log"), data = leukemia) > confint.default(leukemiaglm) 2.5 % 97.5 % (Intercept) 5.334837 11.6201502 x -1.868283-0.3503104 To get exactly what we want under the exponential distributional assumption we need to do the computations by hand.

Exercise 5.4 2 > coefficients(leukemiaglm) + t(c(-1.96, 1.96) %*% t(sqrt(diag(vcov(leukemiaglm, + dispersion = 1))))) [,1] [,2] (Intercept) 5.234071 11.7209168 x -1.892620-0.3259742 We then turn to bootstrapping. We do this by hand. > bootexp <- function(theta, B = 999) { + tmp <- replicate(b, rexp(length(theta), theta)) + return(apply(tmp, 2, function(y) glm(y ~ leukemia$x, family = Gamma(link = " + } The function above implements the parametric resampling and the reestimation of the models. Then we need to decide for a combinant to use, and which kind of interval we want. First we do the resampling using the estimated model. Then we compute statistics corresponding to the combinants and ˆβ i (Y) β i ˆβ i (Y) β i ŝe i. Finally, we also compute parameter estimates used for the the percentile interval. > bootsim <- bootexp(1/fitted(leukemiaglm)) > t1 <- sapply(bootsim, function(m) coefficients(m) - coefficients(leukemiaglm)) > t2 <- sapply(bootsim, function(m) (coefficients(m) - coefficients(leukemiaglm))/ + dispersion = 1)))) > t1.5 <- sapply(bootsim, function(m) coefficients(m)) Now we compute the different confidence intervals. > coefficients(leukemiaglm) - t(apply(t1, 1, function(t) quantile(t, + c(0.975, 0.025))))

Exercise 5.4 3 97.5% 2.5% (Intercept) 5.079363 12.0461146 leukemia$x -1.942136-0.2201165 > coefficients(leukemiaglm) - sqrt(diag(vcov(leukemiaglm, dispersion = 1))) * + t(apply(t2, 1, function(t) quantile(t, c(0.975, 0.025)))) 97.5% 2.5% (Intercept) 5.079363 12.0461146 leukemia$x -1.942136-0.2201165 > t(apply(t1.5, 1, function(t) quantile(t, c(0.025, 0.975)))) 2.5% 97.5% (Intercept) 4.908873 11.8756239 leukemia$x -1.998477-0.2764581 Perhaps surprisingly the two former are identical! This is explained by the fact that the Fisher information in fact does not depend upon the estimated parameters only on the explanatory variables, and is thus constant. This is then again due to two model choices that play together. First, the fixed value of the dispersion parameter, but also the use of the log-link function. These two choices imply that the weight matrix become the identity matrix and the Fisher information is in fact equal to (X T X) 1 for the exponential distribution with a log-link function. The most interesting thing about computing the second confidence intervals is the quantiles: > t(apply(t2, 1, function(t) quantile(t, c(0.025, 0.975)))) 2.5% 97.5% (Intercept) -2.156517 2.053489 leukemia$x -2.224873 2.083897 They are generally slightly larger (depending a little on sampling errors from the resampling) than the 1.96 approximation from the normal approximation.

Exercise 5.4 4 2 5.4(b) > summary(leukemiaglm, dispersion = 1) Call: glm(formula = y ~ x, family = Gamma(link = "log"), data = leukemia) Deviance Residuals: Min 1Q Median 3Q Max -1.9922-1.2102-0.2242 0.2102 1.5646 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 8.4775 1.6548 5.123 3.01e-07 *** x -1.1093 0.3997-2.776 0.00551 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for Gamma family taken to be 1) Null deviance: 26.282 on 16 degrees of freedom Residual deviance: 19.457 on 15 degrees of freedom AIC: 173.97 Number of Fisher Scoring iterations: 8 To compute the difference in deviance from the null model we estimate the model with only an intercept and compute the deviances. > leukemiaglm0 <- glm(y ~ 1, family = Gamma(link = "log"), data = leukemia) > summary(leukemiaglm0, dispersion = 1) Call: glm(formula = y ~ 1, family = Gamma(link = "log"), data = leukemia) Deviance Residuals: Min 1Q Median 3Q Max -2.5103-1.1120-0.1074 0.6023 1.0789

Exercise 5.4 5 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 4.1347 0.2425 17.05 <2e-16 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 (Dispersion parameter for Gamma family taken to be 1) Null deviance: 26.282 on 16 degrees of freedom Residual deviance: 26.282 on 16 degrees of freedom AIC: 178.09 Number of Fisher Scoring iterations: 6 > deviance(leukemiaglm0) - deviance(leukemiaglm) [1] 6.825567 In R the deviance is by definition the unscaled deviance for the Gamma-family. This means, that it is only up to a scaling factor (the dispersion parameter) equal to minus twice the log-likelihood ratio test statistic. In other words, the deviance does not depend upon the dispersion parameter and equals the deviance as if the model was an exponential model. We can compute the p-value using the χ 2 - distribution with one degree of freedom. Another way to do this (which is more convenient for more complicated, successive tests) is to use the anova function. For generalized linear models we need to tell the function which test-statistics to use and then we need to be explicit about the dispersion parameter otherwise it is estimated and used for the χ 2 -approximation. > pchisq(deviance(leukemiaglm0) - deviance(leukemiaglm), 1, lower.tail = FALSE) [1] 0.008986204 > anova(leukemiaglm, test = "Chisq", dispersion = 1) Analysis of Deviance Table

Exercise 5.4 6 Model: Gamma, link: log Response: y Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev P(> Chi ) NULL 16 26.282 x 1 6.8256 15 19.456 0.008986 ** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 One should note that the p-values computed from the deviance test and the t-test in the summary table are different. This is in contrast to ordinary linear models. One can experience situations where one is significant and the other is not.