STAC51: Categorical data Analysis

Similar documents
A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

Classification. Chapter Introduction. 6.2 The Bayes classifier

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

12 Modelling Binomial Response Data

Logistic Regressions. Stat 430

Log-linear Models for Contingency Tables

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Categorical data analysis Chapter 5

Introduction to the Generalized Linear Model: Logistic regression and Poisson regression

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Age 55 (x = 1) Age < 55 (x = 0)

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

R Hints for Chapter 10

Generalized Linear Models

Exercise 5.4 Solution

Experimental Design and Statistical Methods. Workshop LOGISTIC REGRESSION. Jesús Piedrafita Arilla.

STA 450/4000 S: January

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Linear Regression Models P8111

Chapter 22: Log-linear regression for Poisson counts

9 Generalized Linear Models

STAT 7030: Categorical Data Analysis

Applied Statistics. J. Blanchet and J. Wadsworth. Institute of Mathematics, Analysis, and Applications EPF Lausanne

ST3241 Categorical Data Analysis I Logistic Regression. An Introduction and Some Examples

Treatment Variables INTUB duration of endotracheal intubation (hrs) VENTL duration of assisted ventilation (hrs) LOWO2 hours of exposure to 22 49% lev

STAT 526 Spring Midterm 1. Wednesday February 2, 2011

Cohen s s Kappa and Log-linear Models

Introduction to General and Generalized Linear Models

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

Nonlinear Models. What do you do when you don t have a line? What do you do when you don t have a line? A Quadratic Adventure

Lecture 12: Effect modification, and confounding in logistic regression

Generalised linear models. Response variable can take a number of different formats

BMI 541/699 Lecture 22

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

Stat 579: Generalized Linear Models and Extensions

Non-Gaussian Response Variables

STA102 Class Notes Chapter Logistic Regression

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Generalized linear models

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Interactions in Logistic Regression

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102

Matched Pair Data. Stat 557 Heike Hofmann

Logistic Regression 21/05

Chapter 14 Logistic Regression, Poisson Regression, and Generalized Linear Models

Sample solutions. Stat 8051 Homework 8

On the Inference of the Logistic Regression Model

Modeling Overdispersion

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

Poisson Regression. Gelman & Hill Chapter 6. February 6, 2017

Generalized Linear Models

UNIVERSITY OF TORONTO Faculty of Arts and Science

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches

Generalized Linear Models. stat 557 Heike Hofmann

Logistic Regression - problem 6.14

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Binary Regression. GH Chapter 5, ISL Chapter 4. January 31, 2017

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

McGill University. Faculty of Science. Department of Mathematics and Statistics. Statistics Part A Comprehensive Exam Methodology Paper

Chapter 1 Statistical Inference

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

STA6938-Logistic Regression Model

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

Chapter 3: Generalized Linear Models

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013

Introduction to the Analysis of Tabular Data

Generalized Linear Models

Consider fitting a model using ordinary least squares (OLS) regression:

Analysing categorical data using logit models

Explanatory variables are: weight, width of shell, color (medium light, medium, medium dark, dark), and condition of spine.

cor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson )

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Notes for week 4 (part 2)

Chapter 5: Logistic Regression-I

Linear Regression. Data Model. β, σ 2. Process Model. ,V β. ,s 2. s 1. Parameter Model

Statistics 203 Introduction to Regression Models and ANOVA Practice Exam

ST430 Exam 1 with Answers

Logistic Regression. Fitting the Logistic Regression Model BAL040-A.A.-10-MAJ

Generalized linear models

Homework 10 - Solution

8 Nominal and Ordinal Logistic Regression

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Statistical Modelling with Stata: Binary Outcomes

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

11. Generalized Linear Models: An Introduction

Truck prices - linear model? Truck prices - log transform of the response variable. Interpreting models with log transformation

Introduction to the Logistic Regression Model

MSH3 Generalized linear model

Biostatistics for physicists fall Correlation Linear regression Analysis of variance

BIOSTATS Intermediate Biostatistics Spring 2017 Exam 2 (Units 3, 4 & 5) Practice Problems SOLUTIONS

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

Unit 5 Logistic Regression Practice Problems

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

Generalized Linear Model under the Extended Negative Multinomial Model and Cancer Incidence

Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links

Homework 1 Solutions

Generalized Linear Models 1

Transcription:

STAC51: Categorical data Analysis Mahinda Samarakoon April 6, 2016 Mahinda Samarakoon STAC51: Categorical data Analysis 1 / 25

Table of contents 1 Building and applying logistic regression models (Chap 6) Mahinda Samarakoon STAC51: Categorical data Analysis 2 / 25

Model Checking for logistic Regression Let s look at the malformation data set again > #R code for Example 5.3.3 p176: Alcohol Use and Infant Malformation > alcohol<-factor(c("0","<1","1-2","3-5",">=6"), levels=c("0","<1","1-2","3-5",">=6")) > present<-c(48,38,5,1,1) > absent <-c(17066,14464,788,126,37) > n <- present+absent > #------------------------------------------------------------- > scores <-c(0,.5,1.5,4,7) > malformation <-data.frame(present, absent, n, scores) > malformation present absent n scores 1 48 17066 17114 0.0 2 38 14464 14502 0.5 3 5 788 793 1.5 4 1 126 127 4.0 5 1 37 38 7.0 Mahinda Samarakoon STAC51: Categorical data Analysis 3 / 25

Model Checking for logistic Regression > linearlogitmodel <-glm(cbind(present, absent) ~ scores,family=binomial) > summary(linearlogitmodel) Call: glm(formula = cbind(present, absent) ~ scores, family = binomial) Deviance Residuals: 1 2 3 4 5 0.5921-0.8801 0.8865-0.1449 0.1291 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -5.9605 0.1154-51.637 <2e-16 *** scores 0.3166 0.1254 2.523 0.0116 * --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 6.2020 on 4 degrees of freedom Residual deviance: 1.9487 on 3 degrees of freedom Mahinda Samarakoon STAC51: Categorical data Analysis 4 / 25

Model Checking for logistic Regression > #Another way > linearlogitmodel2 <- glm(formula = present/n ~ scores, weight = n, family = binomial) > summary(linearlogitmodel2) Call: glm(formula = present/n ~ scores, family = binomial, weights = n) Deviance Residuals: 1 2 3 4 5 0.5921-0.8801 0.8865-0.1449 0.1291 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -5.9605 0.1154-51.637 <2e-16 *** scores 0.3166 0.1254 2.523 0.0116 * --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 6.2020 on 4 degrees of freedom Residual deviance: 1.9487 on 3 degrees of freedom Mahinda Samarakoon STAC51: Categorical data Analysis 5 / 25

Model Checking for logistic Regression > #Another way > linearlogitmodel3 <- glm(formula = absent/n ~ scores, weight = n, family = binomial) > summary(linearlogitmodel3) Call: glm(formula = absent/n ~ scores, family = binomial, weights = n) Deviance Residuals: 1 2 3 4 5-0.5921 0.8801-0.8865 0.1449-0.1291 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 5.9605 0.1154 51.637 <2e-16 *** scores -0.3166 0.1254-2.523 0.0116 * --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 6.2020 on 4 degrees of freedom Residual deviance: 1.9487 on 3 degrees of freedom Mahinda Samarakoon STAC51: Categorical data Analysis 6 / 25

Model Checking For Logistic Regression This data set can be considered as a 5 2 contingency table. The residuals for each cell can be found residual = observe cell count - the estimated cell count using the fitted model. Example: Predicted probability at scores = 4.0 is exp ( 5.9605 + 0.3166 4.0) 1 + exp ( 5.9605 + 0.3166 4.0) = 0.009066. There are 127 mothers for score = 4.0 and so the predicted number of babies with malformation present = 127 0.009066 = 1.15. The estimated number of absences = 127 1.15 = 125.85 residual for the number present = 1 1.15 = 0.15 residual for the number absent = 126 125.85 = 0.15 We calculate only one of these: usually the residuals for numbers present (i.e successes) Mahinda Samarakoon STAC51: Categorical data Analysis 7 / 25

Pearson Residuals e i = Observed Predicted Var(Observed) ˆ = y i n i ˆπ i. ni ˆπ i (1 ˆπ i ) The standardized residual is: r i = e i 1 hi where h i is the ith diagonal element of the hat matrix. X is the design matrix Ŵ = diag(n i ˆπ i (1 ˆπ i )) H = Ŵ 1/2 X(X T ŴX) 1 X T Ŵ 1/2. (1) values of r i > 3 (or 2) may indicate an outlier or an influential explanatory variable pattern. Mahinda Samarakoon STAC51: Categorical data Analysis 8 / 25

Pearson Residuals: Example e i = Observed Predicted Var(Observed) ˆ Example: e 4 = y i n i ˆπ i R code = y i n i ˆπ i. ni ˆπ i (1 ˆπ i ) 1 127 0.009066 = 0.14. ni ˆπ i (1 ˆπ i ) 127 0.009066 (1 0.009066) > #Residuals > pear.res<-resid(linearlogitmodel, type="pearson") > pear.res 1 2 3 4 5 0.6008415-0.8604371 0.9557511-0.1416210 0.1319486 Mahinda Samarakoon STAC51: Categorical data Analysis 9 / 25

Pearson Residuals An overall measure of goodness of fit is the sum of squares of Pearson residuals. This is called a Pearson statistic: χ 2 = N i=1 e2 i This statistic can be approximated by a χ 2 N (k+1) distribution where k is the number of βs in the model. Pearson statistic is testing the following hypotheses H 0 : logit(π i ) = α + β 1 x 1,i + + β k x k,i, i = 1,..., N H 1 : Saturated model (N parameters) The saturated model is defined as a model where a parameter is estimated for EACH explanatory variable group (N different parameters) This means π i is estimated by the sample proportion, y i /n i. Mahinda Samarakoon STAC51: Categorical data Analysis 10 / 25

Deviance Residuals Deviance residuals are defined by di sign(y i n i π i ), where ( d i = 2 y i log( y i ) + (n i y i ) log n ) i y i n i ˆπ i n i n i ˆπ i Example : For the alcohol data ( d 4 = 2 y i 4 log( y 4 ) + (n 4 y 4 ) log n ) 4 y 4 n 4ˆπ 4 n 4 n 4ˆπ 4 ( ) 1 = 2 1 log( 127 0.009066 ) + (127 1) log 127 1 127 127 0.009066 0.0210201028 and the deviance residual is di sign(y i n i π i ) = 0.0210201028 ( 1) = 0.14498 Mahinda Samarakoon STAC51: Categorical data Analysis 11 / 25

Deviance Residuals using R R code > #Residuals > dev.res<-resid(linearlogitmodel, type="deviance") > dev.res 1 2 3 4 5 0.5921323-0.8801096 0.8864796-0.1448759 0.1291218 Mahinda Samarakoon STAC51: Categorical data Analysis 12 / 25

Likelihood Ratio Test of Goodness of fit od the Model LRT uses the test statistic G 2 = N where d i i=1 ( d i = 2 y i log( y i ) + (n i y i ) log n ) i y i n i ˆπ i n i n i ˆπ i G 2 can be approximated by χ 2 N (k+1) distribution. Mahinda Samarakoon STAC51: Categorical data Analysis 13 / 25

Logistic Regression Diagnostics:Example: Heart disease data, p217 > BP<-factor(c("<117","117-126","127-136","137-146","147-156","157-166","167-186 > #Logistic Regression Diagnostics > # Coronary Heart Disease and Blood Pressure example 216 > CHD<-c(3,17,12,16,12,8,16,8) > n<-c(156,252,284,271,139,85,99,43) > structure(cbind( n, CHD), dimnames = + list(bp, c("n", "CHD"))) n CHD <117 156 3 117-126 252 17 127-136 284 12 137-146 271 16 147-156 139 12 157-166 85 8 167-186 99 16 >186 43 8 Mahinda Samarakoon STAC51: Categorical data Analysis 14 / 25

Logistic Regression Diagnostics:Example: Heart disease data, p217 > #Independence Model > reschd<-glm(chd/n~1,family=binomial, weights=n) > summary(reschd) Call: glm(formula = CHD/n ~ 1, family = binomial, weights = n) Deviance Residuals: Min 1Q Median 3Q Max -2.8853-0.9877 0.3281 1.2792 3.1269 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -2.5987 0.1081-24.05 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 30.023 on 7 degrees of freedom Residual deviance: 30.023 on 7 degrees of freedom Mahinda Samarakoon STAC51: Categorical data Analysis 15 / 25

Logistic Regression Diagnostics:Example: Heart disease data, p217 > reschd$deviance [1] 30.02257 > pred.indep<-n*predict(reschd, type="response") > dev.indep<-resid(reschd, type="deviance") > pear.indep<-resid(reschd, type="pearson") > pear.std.indep<-resid(reschd, type="pearson")/sqrt(1-lm.influence(reschd)$hat) > structure(cbind(pred.indep, dev.indep, pear.indep, pear.std.indep), dimnames = + list(bp, c("fitted", "deviance resid", "pearson resid", "pearson std resid"))) fitted deviance resid pearson resid pearson std resid <117 10.799097-2.8852550-2.4599611-2.6184346 117-126 17.444695-0.1107980-0.1103592-0.1225923 127-136 19.659895-1.9213176-1.7906464-2.0193620 137-146 18.759970-0.6765040-0.6604895-0.7402622 147-156 9.622272 0.7670346 0.7945128 0.8396338 157-166 5.884123 0.8603984 0.9041221 0.9345002 167-186 6.853273 3.1269309 3.6215487 3.7644737 >186 2.976674 2.5357746 3.0178895 3.0679293 Mahinda Samarakoon STAC51: Categorical data Analysis 16 / 25

Logistic Regression Diagnostics:Example: Heart disease data, p217 Mahinda Samarakoon STAC51: Categorical data Analysis 17 / 25

Logistic Regression Diagnostics:Example: Heart disease data, p217 > #Linear Logit Model: > scores<-c(seq(from=111.5,to=161.5,by=10),176.5,191.5) > resll<-glm(chd/n~scores,family=binomial,weights=n) > summary(resll) Call: glm(formula = CHD/n ~ scores, family = binomial, weights = n) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -6.082033 0.724320-8.397 < 2e-16 *** scores 0.024338 0.004843 5.025 5.03e-07 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 30.0226 on 7 degrees of freedom Residual deviance: 5.9092 on 6 degrees of freedom Mahinda Samarakoon STAC51: Categorical data Analysis 18 / 25

Logistic Regression Diagnostics:Example: Heart disease data, p217 > pred.ll<-n*predict(resll, type="response") > dev.ll <- resid(resll, type = "deviance") > pear.ll <- resid(resll, type = "pearson") > pear.std.ll <- resid(resll, type = "pearson")/sqrt(1 - lm.influence(resll)$hat) > structure(cbind(pred.ll, dev.ll, pear.ll, pear.std.ll), dimnames = + list(as.character(scores), c("fitted", "deviance resid", "pearson resid", "pearson std res fitted deviance resid pearson resid pearson std resid 111.5 5.194858-1.0616803-0.9794311-1.1057850 121.5 10.606750 1.8501114 2.0057103 2.3746058 131.5 15.072724-0.8419625-0.8133348-0.9452701 141.5 18.081604-0.5162271-0.5067270-0.5727440 151.5 11.616355 0.1170033 0.1175833 0.1260886 161.5 8.856985-0.3087740-0.3042459-0.3260730 176.5 14.208764 0.5049655 0.5134721 0.6519547 191.5 8.361960-0.1402441-0.1394648-0.1773473 Mahinda Samarakoon STAC51: Categorical data Analysis 19 / 25

Logistic Regression Diagnostics:Example: Heart disease data, p217 Mahinda Samarakoon STAC51: Categorical data Analysis 20 / 25

Strategies in model selection p207 What explanatory variables should be in the model? Should interactions or quadratic terms be included? Mahinda Samarakoon STAC51: Categorical data Analysis 21 / 25

Strategies in model selection p207 Step 1: Make a list of candidate variables Fit all possible one variable logistic regression models Perform a Wald test or LRT to determine if a variable is important (H O : β = 0 vs. H a : β 0 for each variable). Use a larger than normal α level for the tests. A LRT is generally the preferred way to test model parameters in a logistic regression model. The χ 2 distribution approximation for the LRT statistic is usually better for smaller sample sizes than the standard normal approximation for a Wald statistic. Mahinda Samarakoon STAC51: Categorical data Analysis 22 / 25

Strategies in model selection p207 Step 2: Fit a logistic regression model with all the variables found in step 1 and perform backward elimination. Do the backward elimination in a similar manner as in ordinary least squares regression. Perform a Wald test or LRT to determine if the variable is important. The LRT is performed in a similar manner as discussed earlier. Continue this procedure until no more variables can be dropped. Mahinda Samarakoon STAC51: Categorical data Analysis 23 / 25

Strategies in model selection p207 Step 3: Determine if quadratic and/or interaction terms are needed in the model. This is usually done by performing a hypothesis tests for the intended quadratic or interaction terms. If an interaction or quadratic term is included the model, one should include the corresponding lower order terms, just like in regular regression. Perform a residual analysis of the selected model and make necessary improvements to the model. Once the final model which satisfies all of the model assumptions is found, interpret the model and make inferences to the population. Mahinda Samarakoon STAC51: Categorical data Analysis 24 / 25

Strategies in model selection: Example - Example On Web - Skip Chapter 7 Mahinda Samarakoon STAC51: Categorical data Analysis 25 / 25