STAC51: Categorical data Analysis

STAC51: Categorical data Analysis Mahinda Samarakoon April 6, 2016 Mahinda Samarakoon STAC51: Categorical data Analysis 1 / 25

Table of contents 1 Building and applying logistic regression models (Chap 6) Mahinda Samarakoon STAC51: Categorical data Analysis 2 / 25

Model Checking for logistic Regression Let s look at the malformation data set again > #R code for Example 5.3.3 p176: Alcohol Use and Infant Malformation > alcohol<-factor(c("0","<1","1-2","3-5",">=6"), levels=c("0","<1","1-2","3-5",">=6")) > present<-c(48,38,5,1,1) > absent <-c(17066,14464,788,126,37) > n <- present+absent > #------------------------------------------------------------- > scores <-c(0,.5,1.5,4,7) > malformation <-data.frame(present, absent, n, scores) > malformation present absent n scores 1 48 17066 17114 0.0 2 38 14464 14502 0.5 3 5 788 793 1.5 4 1 126 127 4.0 5 1 37 38 7.0 Mahinda Samarakoon STAC51: Categorical data Analysis 3 / 25

Model Checking for logistic Regression > linearlogitmodel <-glm(cbind(present, absent) ~ scores,family=binomial) > summary(linearlogitmodel) Call: glm(formula = cbind(present, absent) ~ scores, family = binomial) Deviance Residuals: 1 2 3 4 5 0.5921-0.8801 0.8865-0.1449 0.1291 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -5.9605 0.1154-51.637 <2e-16 *** scores 0.3166 0.1254 2.523 0.0116 * --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 6.2020 on 4 degrees of freedom Residual deviance: 1.9487 on 3 degrees of freedom Mahinda Samarakoon STAC51: Categorical data Analysis 4 / 25

Model Checking for logistic Regression > #Another way > linearlogitmodel2 <- glm(formula = present/n ~ scores, weight = n, family = binomial) > summary(linearlogitmodel2) Call: glm(formula = present/n ~ scores, family = binomial, weights = n) Deviance Residuals: 1 2 3 4 5 0.5921-0.8801 0.8865-0.1449 0.1291 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -5.9605 0.1154-51.637 <2e-16 *** scores 0.3166 0.1254 2.523 0.0116 * --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 6.2020 on 4 degrees of freedom Residual deviance: 1.9487 on 3 degrees of freedom Mahinda Samarakoon STAC51: Categorical data Analysis 5 / 25

Model Checking for logistic Regression > #Another way > linearlogitmodel3 <- glm(formula = absent/n ~ scores, weight = n, family = binomial) > summary(linearlogitmodel3) Call: glm(formula = absent/n ~ scores, family = binomial, weights = n) Deviance Residuals: 1 2 3 4 5-0.5921 0.8801-0.8865 0.1449-0.1291 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 5.9605 0.1154 51.637 <2e-16 *** scores -0.3166 0.1254-2.523 0.0116 * --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 6.2020 on 4 degrees of freedom Residual deviance: 1.9487 on 3 degrees of freedom Mahinda Samarakoon STAC51: Categorical data Analysis 6 / 25

Model Checking For Logistic Regression This data set can be considered as a 5 2 contingency table. The residuals for each cell can be found residual = observe cell count - the estimated cell count using the fitted model. Example: Predicted probability at scores = 4.0 is exp ( 5.9605 + 0.3166 4.0) 1 + exp ( 5.9605 + 0.3166 4.0) = 0.009066. There are 127 mothers for score = 4.0 and so the predicted number of babies with malformation present = 127 0.009066 = 1.15. The estimated number of absences = 127 1.15 = 125.85 residual for the number present = 1 1.15 = 0.15 residual for the number absent = 126 125.85 = 0.15 We calculate only one of these: usually the residuals for numbers present (i.e successes) Mahinda Samarakoon STAC51: Categorical data Analysis 7 / 25

Pearson Residuals e i = Observed Predicted Var(Observed) ˆ = y i n i ˆπ i. ni ˆπ i (1 ˆπ i ) The standardized residual is: r i = e i 1 hi where h i is the ith diagonal element of the hat matrix. X is the design matrix Ŵ = diag(n i ˆπ i (1 ˆπ i )) H = Ŵ 1/2 X(X T ŴX) 1 X T Ŵ 1/2. (1) values of r i > 3 (or 2) may indicate an outlier or an influential explanatory variable pattern. Mahinda Samarakoon STAC51: Categorical data Analysis 8 / 25

Pearson Residuals: Example e i = Observed Predicted Var(Observed) ˆ Example: e 4 = y i n i ˆπ i R code = y i n i ˆπ i. ni ˆπ i (1 ˆπ i ) 1 127 0.009066 = 0.14. ni ˆπ i (1 ˆπ i ) 127 0.009066 (1 0.009066) > #Residuals > pear.res<-resid(linearlogitmodel, type="pearson") > pear.res 1 2 3 4 5 0.6008415-0.8604371 0.9557511-0.1416210 0.1319486 Mahinda Samarakoon STAC51: Categorical data Analysis 9 / 25

Pearson Residuals An overall measure of goodness of fit is the sum of squares of Pearson residuals. This is called a Pearson statistic: χ 2 = N i=1 e2 i This statistic can be approximated by a χ 2 N (k+1) distribution where k is the number of βs in the model. Pearson statistic is testing the following hypotheses H 0 : logit(π i ) = α + β 1 x 1,i + + β k x k,i, i = 1,..., N H 1 : Saturated model (N parameters) The saturated model is defined as a model where a parameter is estimated for EACH explanatory variable group (N different parameters) This means π i is estimated by the sample proportion, y i /n i. Mahinda Samarakoon STAC51: Categorical data Analysis 10 / 25

Deviance Residuals Deviance residuals are defined by di sign(y i n i π i ), where ( d i = 2 y i log( y i ) + (n i y i ) log n ) i y i n i ˆπ i n i n i ˆπ i Example : For the alcohol data ( d 4 = 2 y i 4 log( y 4 ) + (n 4 y 4 ) log n ) 4 y 4 n 4ˆπ 4 n 4 n 4ˆπ 4 ( ) 1 = 2 1 log( 127 0.009066 ) + (127 1) log 127 1 127 127 0.009066 0.0210201028 and the deviance residual is di sign(y i n i π i ) = 0.0210201028 ( 1) = 0.14498 Mahinda Samarakoon STAC51: Categorical data Analysis 11 / 25

Deviance Residuals using R R code > #Residuals > dev.res<-resid(linearlogitmodel, type="deviance") > dev.res 1 2 3 4 5 0.5921323-0.8801096 0.8864796-0.1448759 0.1291218 Mahinda Samarakoon STAC51: Categorical data Analysis 12 / 25

Likelihood Ratio Test of Goodness of fit od the Model LRT uses the test statistic G 2 = N where d i i=1 ( d i = 2 y i log( y i ) + (n i y i ) log n ) i y i n i ˆπ i n i n i ˆπ i G 2 can be approximated by χ 2 N (k+1) distribution. Mahinda Samarakoon STAC51: Categorical data Analysis 13 / 25

Logistic Regression Diagnostics:Example: Heart disease data, p217 > BP<-factor(c("<117","117-126","127-136","137-146","147-156","157-166","167-186 > #Logistic Regression Diagnostics > # Coronary Heart Disease and Blood Pressure example 216 > CHD<-c(3,17,12,16,12,8,16,8) > n<-c(156,252,284,271,139,85,99,43) > structure(cbind( n, CHD), dimnames = + list(bp, c("n", "CHD"))) n CHD <117 156 3 117-126 252 17 127-136 284 12 137-146 271 16 147-156 139 12 157-166 85 8 167-186 99 16 >186 43 8 Mahinda Samarakoon STAC51: Categorical data Analysis 14 / 25

Logistic Regression Diagnostics:Example: Heart disease data, p217 > #Independence Model > reschd<-glm(chd/n~1,family=binomial, weights=n) > summary(reschd) Call: glm(formula = CHD/n ~ 1, family = binomial, weights = n) Deviance Residuals: Min 1Q Median 3Q Max -2.8853-0.9877 0.3281 1.2792 3.1269 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -2.5987 0.1081-24.05 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 30.023 on 7 degrees of freedom Residual deviance: 30.023 on 7 degrees of freedom Mahinda Samarakoon STAC51: Categorical data Analysis 15 / 25

Logistic Regression Diagnostics:Example: Heart disease data, p217 > reschd$deviance [1] 30.02257 > pred.indep<-n*predict(reschd, type="response") > dev.indep<-resid(reschd, type="deviance") > pear.indep<-resid(reschd, type="pearson") > pear.std.indep<-resid(reschd, type="pearson")/sqrt(1-lm.influence(reschd)$hat) > structure(cbind(pred.indep, dev.indep, pear.indep, pear.std.indep), dimnames = + list(bp, c("fitted", "deviance resid", "pearson resid", "pearson std resid"))) fitted deviance resid pearson resid pearson std resid <117 10.799097-2.8852550-2.4599611-2.6184346 117-126 17.444695-0.1107980-0.1103592-0.1225923 127-136 19.659895-1.9213176-1.7906464-2.0193620 137-146 18.759970-0.6765040-0.6604895-0.7402622 147-156 9.622272 0.7670346 0.7945128 0.8396338 157-166 5.884123 0.8603984 0.9041221 0.9345002 167-186 6.853273 3.1269309 3.6215487 3.7644737 >186 2.976674 2.5357746 3.0178895 3.0679293 Mahinda Samarakoon STAC51: Categorical data Analysis 16 / 25

Logistic Regression Diagnostics:Example: Heart disease data, p217 Mahinda Samarakoon STAC51: Categorical data Analysis 17 / 25

Logistic Regression Diagnostics:Example: Heart disease data, p217 > #Linear Logit Model: > scores<-c(seq(from=111.5,to=161.5,by=10),176.5,191.5) > resll<-glm(chd/n~scores,family=binomial,weights=n) > summary(resll) Call: glm(formula = CHD/n ~ scores, family = binomial, weights = n) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -6.082033 0.724320-8.397 < 2e-16 *** scores 0.024338 0.004843 5.025 5.03e-07 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 30.0226 on 7 degrees of freedom Residual deviance: 5.9092 on 6 degrees of freedom Mahinda Samarakoon STAC51: Categorical data Analysis 18 / 25

Logistic Regression Diagnostics:Example: Heart disease data, p217 > pred.ll<-n*predict(resll, type="response") > dev.ll <- resid(resll, type = "deviance") > pear.ll <- resid(resll, type = "pearson") > pear.std.ll <- resid(resll, type = "pearson")/sqrt(1 - lm.influence(resll)$hat) > structure(cbind(pred.ll, dev.ll, pear.ll, pear.std.ll), dimnames = + list(as.character(scores), c("fitted", "deviance resid", "pearson resid", "pearson std res fitted deviance resid pearson resid pearson std resid 111.5 5.194858-1.0616803-0.9794311-1.1057850 121.5 10.606750 1.8501114 2.0057103 2.3746058 131.5 15.072724-0.8419625-0.8133348-0.9452701 141.5 18.081604-0.5162271-0.5067270-0.5727440 151.5 11.616355 0.1170033 0.1175833 0.1260886 161.5 8.856985-0.3087740-0.3042459-0.3260730 176.5 14.208764 0.5049655 0.5134721 0.6519547 191.5 8.361960-0.1402441-0.1394648-0.1773473 Mahinda Samarakoon STAC51: Categorical data Analysis 19 / 25

Logistic Regression Diagnostics:Example: Heart disease data, p217 Mahinda Samarakoon STAC51: Categorical data Analysis 20 / 25

Strategies in model selection p207 What explanatory variables should be in the model? Should interactions or quadratic terms be included? Mahinda Samarakoon STAC51: Categorical data Analysis 21 / 25

Strategies in model selection p207 Step 1: Make a list of candidate variables Fit all possible one variable logistic regression models Perform a Wald test or LRT to determine if a variable is important (H O : β = 0 vs. H a : β 0 for each variable). Use a larger than normal α level for the tests. A LRT is generally the preferred way to test model parameters in a logistic regression model. The χ 2 distribution approximation for the LRT statistic is usually better for smaller sample sizes than the standard normal approximation for a Wald statistic. Mahinda Samarakoon STAC51: Categorical data Analysis 22 / 25

Strategies in model selection p207 Step 2: Fit a logistic regression model with all the variables found in step 1 and perform backward elimination. Do the backward elimination in a similar manner as in ordinary least squares regression. Perform a Wald test or LRT to determine if the variable is important. The LRT is performed in a similar manner as discussed earlier. Continue this procedure until no more variables can be dropped. Mahinda Samarakoon STAC51: Categorical data Analysis 23 / 25

Strategies in model selection p207 Step 3: Determine if quadratic and/or interaction terms are needed in the model. This is usually done by performing a hypothesis tests for the intended quadratic or interaction terms. If an interaction or quadratic term is included the model, one should include the corresponding lower order terms, just like in regular regression. Perform a residual analysis of the selected model and make necessary improvements to the model. Once the final model which satisfies all of the model assumptions is found, interpret the model and make inferences to the population. Mahinda Samarakoon STAC51: Categorical data Analysis 24 / 25

Strategies in model selection: Example - Example On Web - Skip Chapter 7 Mahinda Samarakoon STAC51: Categorical data Analysis 25 / 25