Analysis of Categorical Data Nick Jackson University of Southern California Department of Psychology 10/11/2013 1
Overview Data Types Contingency Tables Logit Models Binomial Ordinal Nominal 2
Things not covered (but still fit into the topic) Matched pairs/repeated measures McNemar s Chi-Square Reliability Cohen s Kappa ROC Poisson (Count) models Categorical SEM Tetrachoric Correlation Bernoulli Trials 3
Data Types (Levels of Measurement) Discrete/Categorical/ Qualitative Continuous/ Quantitative Nominal/Multinomial: Rank Order/Ordinal: Binary/Dichotomous/ Binomial: Properties: Values arbitrary (no magnitude) No direction (no ordering) Example: Race: 1=AA, 2=Ca, 3=As Measures: Mode, relative frequency Properties: Values semi-arbitrary (no magnitude?) Have direction (ordering) Example: Lickert Scales (LICK-URT): 1-5, Strongly Disagree to Strongly Agree Measures: Mode, relative frequency, median Mean? Properties: 2 Levels Special case of Ordinal or Multinomial Examples: Gender (Multinomial) Disease (Y/N) Measures: Mode, relative frequency, Mean? 4
Code 1.1 Contingency Tables Often called Two-way tables or Cross-Tab Have dimensions I x J Can be used to test hypotheses of association between categorical variables 2 X 3 Table Age Groups Gender <40 Years 40-50 Years >50 Year Female 25 68 63 Male 240 223 201 5
Contingency Tables: Test of Independence Chi-Square Test of Independence (χ 2 ) Calculate χ 2 Determine DF: (I-1) * (J-1) Compare to χ 2 critical value for given DF. 2 X 3 Table Age Groups Gender <40 Years 40-50 Years >50 Year Female 25 68 63 Male 240 223 201 C1=265 C2=331 C3=264 R1=156 R2=664 N=820 χ 2 = n i=1 O i E 2 Where: O i = Observed Freq i E E E i,j = R i C j i = Expected Freq i N n = number of cells in table 6
Code 1.2 Contingency Tables: Test of Independence Pearson Chi-Square Test of Independence (χ 2 ) H 0 : No Association H A : Association.where, how? Not appropriate when Expected (E i ) cell size freq < 5 Use Fisher s Exact Chi-Square χ 2 df 2 = 23.39, p < 0.001 2 X 3 Table Age Groups Gender <40 Years 40-50 Years >50 Year Female 25 68 63 Male 240 223 201 C1=265 C2=331 C3=264 R1=156 R2=664 N=820 7
Contingency Tables 2x2 Disorder (Outcome) Yes No Risk Factor/ Exposure Yes No a c b d a+b c+d a+c b+d a+b+c+d 8
Contingency Tables: Measures of Association a= Alcohol Use Yes No Depression Yes 25 c= 20 No b= 10 d= 45 35 65 45 55 100 Probability : Depression given Alcohol Use P D A = a a + b = 25 35 = 0.714 Depression given NO Alcohol Use P D A = c c + d = 20 65 = 0.308 Odds: Depression given Alcohol Use P D A Odds D A = 1 P D A = 0.714 1 0.714 = 2.5 Depression given NO Alcohol Use P D A Odds D A = 1 P D A = 0.308 1 0.308 = 0.44 Contrasting Probability: Relative Risk (RR) = P D A) P(D A) = 0.714 0.308 = 2.31 Individuals who used alcohol were 2.31 times more likely to have depression than those who do not use alcohol Contrasting Odds: Odds Ratio(OR) = Odds D A) Odds(D A) = 2.5 0.44 = 5.62 The odds for depression were 5.62 times greater in Alcohol users compared to nonusers. 9
Why Odds Ratios? Alcohol Use Yes No Depression Yes a= 25 c= 20 45 No b= 10*i d= 45*i 55*i i=1 to 45 (25 + 10*i) (20 + 45*i) (45 + 55*i) OR / RR 2 3 4 5 6 0.1.2.3.4.5 Overall Probability of Depression RR OR 10
The Generalized Linear Model General Linear Model (LM) Continuous Outcomes (DV) Linear Regression, t-test, Pearson correlation, ANOVA, ANCOVA Generalized Linear Model (GLM) John Nelder and Robert Wedderburn Maximum Likelihood Estimation Continuous, Categorical, and Count outcomes. Distribution Family and Link Functions Error distributions that are not normal 11
Logistic Regression This is the most important model for categorical response data Agresti (Categorical Data Analysis, 2 nd Ed.) Binary Response Predicting Probability (related to the Probit model) Assume (the usual): Independence NOT Homoscedasticity or Normal Errors Linearity (in the Log Odds) Also.adequate cell sizes. 12
Logistic Regression The Model Y = π x = e α+ β 1x1 1+e α+ β 1x1 In terms of probability of success π(x) logit π x = ln π(x) 1 π(x) In terms of Logits (Log Odds) = α + β 1 x 1 Logit transform gives us a linear equation 13
Code 2.1 Logistic Regression: Example The Output as Logits Logits: H 0 : β=0 Freq. Percent Not Depressed 672 81.95 Depressed 148 18.05 Y=Depressed Coef SE Z P CI α (_constant) -1.51 0.091-16.7 <0.001-1.69, -1.34 Conversion to Probability: e β 1 + e = 0.1805 1 + e 1.51 β = e 1.51 What does H 0 : β=0 mean? e β 1+e β = e0 1+e 0 = 0.5 Conversion to Odds e β = e 1.51 = 0.22 Also=0.1805/0.8195=0.22 14
Code 2.2 Logistic Regression: Example The Output as ORs Odds Ratios: H 0 : β=1 Y=Depressed OR SE Z P CI α (_constant) 0.220 0.020-16.7 <0.001 0.184, 0.263 Conversion to Probability: OR = 0.220 = 0.1805 1+OR 1+0.220 Conversion to Logit (log odds!) Ln(OR) = logit Ln(0.220)=-1.51 Freq. Percent Not Depressed 672 81.95 Depressed 148 18.05 15
Code 2.3 Logistic Regression: Example Logistic Regression w/ Single Continuous Predictor: log π(depressed) 1 π(depressed) = α + β(age) AS LOGITS: Y=Depressed Coef SE Z P CI α (_constant) -2.24 0.489-4.58 <0.001-3.20, -1.28 β (age) 0.013 0.009 1.52 0.127-0.004, 0.030 Interpretation: A 1 unit increase in age results in a 0.013 increase in the log-odds of depression. Hmmmm.I have no concept of what a log-odds is. Interpret as something else. Logit > 0 so as age increases the risk of depression increases. OR=e^0.013 = 1.013 For a 1 unit increase in age, there is a 1.013 increase in the odds of depression. We could also say: For a 1 unit increase in age there is 1.3% increase in the odds of depression[ (1-OR)*100 % change] 16
Logistic Regression: GOF Overall Model Likelihood-Ratio Chi-Square Omnibus test for the model Overall model fit? Relative to other models Compares specified model with Null model (no predictors) Χ 2 =-2*(LL 0 -LL 1 ), DF=K parameters estimated 17
Code 2.4 Logistic Regression: GOF (Summary Measures) Pseudo-R 2 Not the same meaning as linear regression. There are many of them (Cox and Snell/McFadden) Only comparable within nested models of the same outcome. Hosmer-Lemeshow Models with Continuous Predictors Is the model a better fit than the NULL model. X 2 H 0 : Good Fit for Data, so we want p>0.05 Order the predicted probabilities, group them (g=10) by quantiles, Chi-Square of Group * Outcome using. Df=g-2 Conservative (rarely rejects the null) Pearson Chi-Square Models with categorical predictors Similar to Hosmer-Lemeshow ROC-Area Under the Curve Predictive accuracy/classification 18
Code 2.5 Logistic Regression: GOF (Diagnostic Measures) Outliers in Y (Outcome) Pearson Residuals Square root of the contribution to the Pearson χ 2 Deviance Residuals Square root of the contribution to the likeihood-ratio test statistic of a saturated model vs fitted model. Outliers in X (Predictors) Leverage (Hat Matrix/Projection Matrix) Maps the influence of observed on fitted values Influential Observations Pregibon s Delta-Beta influence statistic Similar to Cook s-d in linear regression Detecting Problems Residuals vs Predictors Leverage Vs Residuals Boxplot of Delta-Beta 19
Logistic Regression: GOF log π(depressed) 1 π(depressed) = α + β 1 (age) L-R χ 2 (df=1): 2.47, p=0.1162 H-L GOF: Number of Groups: 10 H-L Chi 2 : 7.12 DF: 8 P: 0.5233 McFadden s R 2 : 0.0030 Y=Depressed Coef SE Z P CI α (_constant) -2.24 0.489-4.58 <0.001-3.20, -1.28 β (age) 0.013 0.009 1.52 0.127-0.004, 0.030 20
Code 2.6 Logistic Regression: Diagnostics Linearity in the Log-Odds Use a lowess (loess) plot Depressed vs Age Lowess smoother Logit transformed smooth Depressed (Logit) -3-2 -1 0 1 20 40 60 80 age bandwidth =.8 21
Code 2.7 Logistic Regression: Example Logistic Regression w/ Single Categorical Predictor: log AS OR: π(depressed) 1 π(depressed) = α + β 1 (gender) Y=Depressed OR SE Z P CI α (_constant) 0.545 0.091-3.63 <0.001 0.392, 0.756 β (male) 0.299 0.060-5.99 <0.001 0.202, 0.444 Interpretation: The odds of depression are 0.299 times lower for males compared to females. We could also say: The odds of depression are (1-0.299=.701) 70.1% less in males compared to females. Or why not just make males the reference so the OR is positive. Or we could just take the inverse and accomplish the same thing. 1/0.299 = 3.34. 22
Ordinal Logistic Regression Also called Ordered Logistic or Proportional Odds Model Extension of Binary Logistic Model >2 Ordered responses New Assumption! Proportional Odds BMI3GRP (1=Normal Weight, 2=Overweight, 3=Obese) The predictors effect on the outcome is the same across levels of the outcome. Bmi3grp (1 vs 2,3) = B(age) Bmi3grp (1,2 vs 3) = B(age) 23
Ordinal Logistic Regression The Model A latent variable model (Y*) j= number of levels-1 Y = logit(p 1 + p 2 + p j ) = ln βx p 1 +p 2 +p j 1 p 1 p 2 p j = α j + From the equation we can see that the odds ratio is assumed to be independent of the category j 24
Code 3.1 Ordinal Logistic Regression Example AS LOGITS: Y=bmi3grp Coef SE Z P CI β1 (age) -0.026 0.006-4.15 <0.001-0.381, -0.014 β2 (blood_press) 0.012 0.005 2.48 0.013 0.002, 0.021 Threshold1/cut1-0.696 0.6678-2.004, 0.613 Threshold2/cut2 0.773 0.6680-0.536, 2.082 For a 1 unit increase in Blood Pressure there is a 0.012 increase in the log-odds of being in a higher bmi category AS OR: Y=bmi3grp OR SE Z P CI β1 (age) 0.974 0.006-4.15 <0.001 0.962, 0.986 β2 (blood_press) 1.012 0.005 2.48 0.013 1.002, 1.022 Threshold1/cut1-0.696 0.6678-2.004, 0.613 Threshold2/cut2 0.773 0.6680-0.536, 2.082 For a 1 unit increase in Blood Pressure the odds of being in a higher bmi category are 1.012 times greater. 25
Code 3.2 Ordinal Logistic Regression: GOF Assessing Proportional Odds Assumptions Brant Test of Parallel Regression H 0 : Proportional Odds, thus want p >0.05 Tests each predictor separately and overall Score Test of Parallel Regression H 0 : Proportional Odds, thus want p >0.05 Approx Likelihood-ratio test H 0 : Proportional Odds, thus want p >0.05 26
Code 3.3 Ordinal Logistic Regression: GOF Pseudo R 2 Diagnostics Measures Performed on the j-1 binomial logistic regressions 27
Multinomial Logistic Regression Also called multinomial logit/polytomous logistic regression. Same assumptions as the binary logistic model >2 non-ordered responses Or You ve failed to meet the parallel odds assumption of the Ordinal Logistic model 28
Multinomial Logistic Regression The Model j= levels for the outcome J=reference level π j x = P Y = j x) where x is a fixed setting of an explanatory variable logit π j (x) = ln π j(x) π J (x) = α + β j1 x 1 + β jp x p Notice how it appears we are estimating a Relative Risk and not an Odds Ratio. It s actually an OR. Similar to conducting separate binary logistic models, but with better type 1 error control 29
Code 4.1 Multinomial Logistic Regression Example Does degree of supernatural belief indicate a religious preference? AS OR: Y=religion (ref=catholic(1)) Protestant (2) OR SE Z P CI β (supernatural) 1.126 0.090 1.47 0.141 0.961, 1.317 α (_constant) 1.219 0.097 2.49 0.013 1.043, 1.425 Evangelical (3) β (supernatural) 1.218 0.117 2.06 0.039 1.010, 1.469 α (_constant) 0.619 0.059-5.02 <0.001 0.512, 0.746 For a 1 unit increase in supernatural belief, there is a (1-OR= %change) 21.8% increase in the probability of being an Evangelical compared to Catholic. 30
Multinomial Logistic Regression GOF Limited GOF tests. Look at LR Chi-square and compare nested models. Essentially, all models are wrong, but some are useful George E.P. Box Pseudo R 2 Similar to Ordinal Perform tests on the j-1 binomial logistic regressions 31
Resources Categorical Data Analysis by Alan Agresti UCLA Stat Computing: http://www.ats.ucla.edu/stat/ 32