Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013

Similar documents
Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Statistical Modelling with Stata: Binary Outcomes

Correlation and regression

Logistic Regression. Continued Psy 524 Ainsworth

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

Logistic Regression. Building, Interpreting and Assessing the Goodness-of-fit for a logistic regression model

Generalized Linear Models

Basic Medical Statistics Course

Categorical data analysis Chapter 5

The material for categorical data follows Agresti closely.

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Statistics in medicine

Cohen s s Kappa and Log-linear Models

Chapter 5: Logistic Regression-I

Lecture 12: Effect modification, and confounding in logistic regression

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

8 Nominal and Ordinal Logistic Regression

Experimental Design and Statistical Methods. Workshop LOGISTIC REGRESSION. Jesús Piedrafita Arilla.

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

Sections 4.1, 4.2, 4.3

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

STA102 Class Notes Chapter Logistic Regression

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Turning a research question into a statistical question.

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Multinomial Logistic Regression Models

Chapter 1 Statistical Inference

Small n, σ known or unknown, underlying nongaussian

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Investigating Models with Two or Three Categories

BMI 541/699 Lecture 22

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Generalized linear models

11. Generalized Linear Models: An Introduction

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Generalized logit models for nominal multinomial responses. Local odds ratios

Generalized Linear Models: An Introduction

13.1 Categorical Data and the Multinomial Experiment

Procedia - Social and Behavioral Sciences 109 ( 2014 )

Stat 642, Lecture notes for 04/12/05 96

Modelling Binary Outcomes 21/11/2017

SOS3003 Applied data analysis for social science Lecture note Erling Berge Department of sociology and political science NTNU.

Testing Independence

Treatment Variables INTUB duration of endotracheal intubation (hrs) VENTL duration of assisted ventilation (hrs) LOWO2 hours of exposure to 22 49% lev

Chapter 1. Modeling Basics

STAT 7030: Categorical Data Analysis

Model Estimation Example

Single-level Models for Binary Responses

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Discrete Multivariate Statistics

Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Ch 6: Multicategory Logit Models

Linear Regression Models P8111

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

REVISED PAGE PROOFS. Logistic Regression. Basic Ideas. Fundamental Data Analysis. bsa350

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 18.1 Logistic Regression (Dose - Response)

More Statistics tutorial at Logistic Regression and the new:

Review of Multinomial Distribution If n trials are performed: in each trial there are J > 2 possible outcomes (categories) Multicategory Logit Models

Notes for week 4 (part 2)

Generalized Linear Models

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

7. Assumes that there is little or no multicollinearity (however, SPSS will not assess this in the [binary] Logistic Regression procedure).

Unit 5 Logistic Regression Practice Problems

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

LOGISTIC REGRESSION Joseph M. Hilbe

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

STAT Chapter 13: Categorical Data. Recall we have studied binomial data, in which each trial falls into one of 2 categories (success/failure).

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Generalized linear models

Log-linear Models for Contingency Tables

9 Generalized Linear Models

Chapter 4: Generalized Linear Models-I

Binary Dependent Variables

(c) Interpret the estimated effect of temperature on the odds of thermal distress.

Simple logistic regression

ST3241 Categorical Data Analysis I Two-way Contingency Tables. 2 2 Tables, Relative Risks and Odds Ratios

Generalized Additive Models

Unit 5 Logistic Regression

Exam details. Final Review Session. Things to Review

Unit 9: Inferences for Proportions and Count Data

Introduction to logistic regression

Introduction To Logistic Regression

Binary Logistic Regression

An introduction to biostatistics: part 1

Logistic Regression Models for Multinomial and Ordinal Outcomes

NELS 88. Latent Response Variable Formulation Versus Probability Curve Formulation

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

Good Confidence Intervals for Categorical Data Analyses. Alan Agresti

Unit 5 Logistic Regression

Logistic Regression: Regression with a Binary Dependent Variable

Unit 9: Inferences for Proportions and Count Data

Data-analysis and Retrieval Ordinal Classification

The Flight of the Space Shuttle Challenger

Poisson regression: Further topics

CHAPTER 1: BINARY LOGIT MODEL

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

Transcription:

Analysis of Categorical Data Nick Jackson University of Southern California Department of Psychology 10/11/2013 1

Overview Data Types Contingency Tables Logit Models Binomial Ordinal Nominal 2

Things not covered (but still fit into the topic) Matched pairs/repeated measures McNemar s Chi-Square Reliability Cohen s Kappa ROC Poisson (Count) models Categorical SEM Tetrachoric Correlation Bernoulli Trials 3

Data Types (Levels of Measurement) Discrete/Categorical/ Qualitative Continuous/ Quantitative Nominal/Multinomial: Rank Order/Ordinal: Binary/Dichotomous/ Binomial: Properties: Values arbitrary (no magnitude) No direction (no ordering) Example: Race: 1=AA, 2=Ca, 3=As Measures: Mode, relative frequency Properties: Values semi-arbitrary (no magnitude?) Have direction (ordering) Example: Lickert Scales (LICK-URT): 1-5, Strongly Disagree to Strongly Agree Measures: Mode, relative frequency, median Mean? Properties: 2 Levels Special case of Ordinal or Multinomial Examples: Gender (Multinomial) Disease (Y/N) Measures: Mode, relative frequency, Mean? 4

Code 1.1 Contingency Tables Often called Two-way tables or Cross-Tab Have dimensions I x J Can be used to test hypotheses of association between categorical variables 2 X 3 Table Age Groups Gender <40 Years 40-50 Years >50 Year Female 25 68 63 Male 240 223 201 5

Contingency Tables: Test of Independence Chi-Square Test of Independence (χ 2 ) Calculate χ 2 Determine DF: (I-1) * (J-1) Compare to χ 2 critical value for given DF. 2 X 3 Table Age Groups Gender <40 Years 40-50 Years >50 Year Female 25 68 63 Male 240 223 201 C1=265 C2=331 C3=264 R1=156 R2=664 N=820 χ 2 = n i=1 O i E 2 Where: O i = Observed Freq i E E E i,j = R i C j i = Expected Freq i N n = number of cells in table 6

Code 1.2 Contingency Tables: Test of Independence Pearson Chi-Square Test of Independence (χ 2 ) H 0 : No Association H A : Association.where, how? Not appropriate when Expected (E i ) cell size freq < 5 Use Fisher s Exact Chi-Square χ 2 df 2 = 23.39, p < 0.001 2 X 3 Table Age Groups Gender <40 Years 40-50 Years >50 Year Female 25 68 63 Male 240 223 201 C1=265 C2=331 C3=264 R1=156 R2=664 N=820 7

Contingency Tables 2x2 Disorder (Outcome) Yes No Risk Factor/ Exposure Yes No a c b d a+b c+d a+c b+d a+b+c+d 8

Contingency Tables: Measures of Association a= Alcohol Use Yes No Depression Yes 25 c= 20 No b= 10 d= 45 35 65 45 55 100 Probability : Depression given Alcohol Use P D A = a a + b = 25 35 = 0.714 Depression given NO Alcohol Use P D A = c c + d = 20 65 = 0.308 Odds: Depression given Alcohol Use P D A Odds D A = 1 P D A = 0.714 1 0.714 = 2.5 Depression given NO Alcohol Use P D A Odds D A = 1 P D A = 0.308 1 0.308 = 0.44 Contrasting Probability: Relative Risk (RR) = P D A) P(D A) = 0.714 0.308 = 2.31 Individuals who used alcohol were 2.31 times more likely to have depression than those who do not use alcohol Contrasting Odds: Odds Ratio(OR) = Odds D A) Odds(D A) = 2.5 0.44 = 5.62 The odds for depression were 5.62 times greater in Alcohol users compared to nonusers. 9

Why Odds Ratios? Alcohol Use Yes No Depression Yes a= 25 c= 20 45 No b= 10*i d= 45*i 55*i i=1 to 45 (25 + 10*i) (20 + 45*i) (45 + 55*i) OR / RR 2 3 4 5 6 0.1.2.3.4.5 Overall Probability of Depression RR OR 10

The Generalized Linear Model General Linear Model (LM) Continuous Outcomes (DV) Linear Regression, t-test, Pearson correlation, ANOVA, ANCOVA Generalized Linear Model (GLM) John Nelder and Robert Wedderburn Maximum Likelihood Estimation Continuous, Categorical, and Count outcomes. Distribution Family and Link Functions Error distributions that are not normal 11

Logistic Regression This is the most important model for categorical response data Agresti (Categorical Data Analysis, 2 nd Ed.) Binary Response Predicting Probability (related to the Probit model) Assume (the usual): Independence NOT Homoscedasticity or Normal Errors Linearity (in the Log Odds) Also.adequate cell sizes. 12

Logistic Regression The Model Y = π x = e α+ β 1x1 1+e α+ β 1x1 In terms of probability of success π(x) logit π x = ln π(x) 1 π(x) In terms of Logits (Log Odds) = α + β 1 x 1 Logit transform gives us a linear equation 13

Code 2.1 Logistic Regression: Example The Output as Logits Logits: H 0 : β=0 Freq. Percent Not Depressed 672 81.95 Depressed 148 18.05 Y=Depressed Coef SE Z P CI α (_constant) -1.51 0.091-16.7 <0.001-1.69, -1.34 Conversion to Probability: e β 1 + e = 0.1805 1 + e 1.51 β = e 1.51 What does H 0 : β=0 mean? e β 1+e β = e0 1+e 0 = 0.5 Conversion to Odds e β = e 1.51 = 0.22 Also=0.1805/0.8195=0.22 14

Code 2.2 Logistic Regression: Example The Output as ORs Odds Ratios: H 0 : β=1 Y=Depressed OR SE Z P CI α (_constant) 0.220 0.020-16.7 <0.001 0.184, 0.263 Conversion to Probability: OR = 0.220 = 0.1805 1+OR 1+0.220 Conversion to Logit (log odds!) Ln(OR) = logit Ln(0.220)=-1.51 Freq. Percent Not Depressed 672 81.95 Depressed 148 18.05 15

Code 2.3 Logistic Regression: Example Logistic Regression w/ Single Continuous Predictor: log π(depressed) 1 π(depressed) = α + β(age) AS LOGITS: Y=Depressed Coef SE Z P CI α (_constant) -2.24 0.489-4.58 <0.001-3.20, -1.28 β (age) 0.013 0.009 1.52 0.127-0.004, 0.030 Interpretation: A 1 unit increase in age results in a 0.013 increase in the log-odds of depression. Hmmmm.I have no concept of what a log-odds is. Interpret as something else. Logit > 0 so as age increases the risk of depression increases. OR=e^0.013 = 1.013 For a 1 unit increase in age, there is a 1.013 increase in the odds of depression. We could also say: For a 1 unit increase in age there is 1.3% increase in the odds of depression[ (1-OR)*100 % change] 16

Logistic Regression: GOF Overall Model Likelihood-Ratio Chi-Square Omnibus test for the model Overall model fit? Relative to other models Compares specified model with Null model (no predictors) Χ 2 =-2*(LL 0 -LL 1 ), DF=K parameters estimated 17

Code 2.4 Logistic Regression: GOF (Summary Measures) Pseudo-R 2 Not the same meaning as linear regression. There are many of them (Cox and Snell/McFadden) Only comparable within nested models of the same outcome. Hosmer-Lemeshow Models with Continuous Predictors Is the model a better fit than the NULL model. X 2 H 0 : Good Fit for Data, so we want p>0.05 Order the predicted probabilities, group them (g=10) by quantiles, Chi-Square of Group * Outcome using. Df=g-2 Conservative (rarely rejects the null) Pearson Chi-Square Models with categorical predictors Similar to Hosmer-Lemeshow ROC-Area Under the Curve Predictive accuracy/classification 18

Code 2.5 Logistic Regression: GOF (Diagnostic Measures) Outliers in Y (Outcome) Pearson Residuals Square root of the contribution to the Pearson χ 2 Deviance Residuals Square root of the contribution to the likeihood-ratio test statistic of a saturated model vs fitted model. Outliers in X (Predictors) Leverage (Hat Matrix/Projection Matrix) Maps the influence of observed on fitted values Influential Observations Pregibon s Delta-Beta influence statistic Similar to Cook s-d in linear regression Detecting Problems Residuals vs Predictors Leverage Vs Residuals Boxplot of Delta-Beta 19

Logistic Regression: GOF log π(depressed) 1 π(depressed) = α + β 1 (age) L-R χ 2 (df=1): 2.47, p=0.1162 H-L GOF: Number of Groups: 10 H-L Chi 2 : 7.12 DF: 8 P: 0.5233 McFadden s R 2 : 0.0030 Y=Depressed Coef SE Z P CI α (_constant) -2.24 0.489-4.58 <0.001-3.20, -1.28 β (age) 0.013 0.009 1.52 0.127-0.004, 0.030 20

Code 2.6 Logistic Regression: Diagnostics Linearity in the Log-Odds Use a lowess (loess) plot Depressed vs Age Lowess smoother Logit transformed smooth Depressed (Logit) -3-2 -1 0 1 20 40 60 80 age bandwidth =.8 21

Code 2.7 Logistic Regression: Example Logistic Regression w/ Single Categorical Predictor: log AS OR: π(depressed) 1 π(depressed) = α + β 1 (gender) Y=Depressed OR SE Z P CI α (_constant) 0.545 0.091-3.63 <0.001 0.392, 0.756 β (male) 0.299 0.060-5.99 <0.001 0.202, 0.444 Interpretation: The odds of depression are 0.299 times lower for males compared to females. We could also say: The odds of depression are (1-0.299=.701) 70.1% less in males compared to females. Or why not just make males the reference so the OR is positive. Or we could just take the inverse and accomplish the same thing. 1/0.299 = 3.34. 22

Ordinal Logistic Regression Also called Ordered Logistic or Proportional Odds Model Extension of Binary Logistic Model >2 Ordered responses New Assumption! Proportional Odds BMI3GRP (1=Normal Weight, 2=Overweight, 3=Obese) The predictors effect on the outcome is the same across levels of the outcome. Bmi3grp (1 vs 2,3) = B(age) Bmi3grp (1,2 vs 3) = B(age) 23

Ordinal Logistic Regression The Model A latent variable model (Y*) j= number of levels-1 Y = logit(p 1 + p 2 + p j ) = ln βx p 1 +p 2 +p j 1 p 1 p 2 p j = α j + From the equation we can see that the odds ratio is assumed to be independent of the category j 24

Code 3.1 Ordinal Logistic Regression Example AS LOGITS: Y=bmi3grp Coef SE Z P CI β1 (age) -0.026 0.006-4.15 <0.001-0.381, -0.014 β2 (blood_press) 0.012 0.005 2.48 0.013 0.002, 0.021 Threshold1/cut1-0.696 0.6678-2.004, 0.613 Threshold2/cut2 0.773 0.6680-0.536, 2.082 For a 1 unit increase in Blood Pressure there is a 0.012 increase in the log-odds of being in a higher bmi category AS OR: Y=bmi3grp OR SE Z P CI β1 (age) 0.974 0.006-4.15 <0.001 0.962, 0.986 β2 (blood_press) 1.012 0.005 2.48 0.013 1.002, 1.022 Threshold1/cut1-0.696 0.6678-2.004, 0.613 Threshold2/cut2 0.773 0.6680-0.536, 2.082 For a 1 unit increase in Blood Pressure the odds of being in a higher bmi category are 1.012 times greater. 25

Code 3.2 Ordinal Logistic Regression: GOF Assessing Proportional Odds Assumptions Brant Test of Parallel Regression H 0 : Proportional Odds, thus want p >0.05 Tests each predictor separately and overall Score Test of Parallel Regression H 0 : Proportional Odds, thus want p >0.05 Approx Likelihood-ratio test H 0 : Proportional Odds, thus want p >0.05 26

Code 3.3 Ordinal Logistic Regression: GOF Pseudo R 2 Diagnostics Measures Performed on the j-1 binomial logistic regressions 27

Multinomial Logistic Regression Also called multinomial logit/polytomous logistic regression. Same assumptions as the binary logistic model >2 non-ordered responses Or You ve failed to meet the parallel odds assumption of the Ordinal Logistic model 28

Multinomial Logistic Regression The Model j= levels for the outcome J=reference level π j x = P Y = j x) where x is a fixed setting of an explanatory variable logit π j (x) = ln π j(x) π J (x) = α + β j1 x 1 + β jp x p Notice how it appears we are estimating a Relative Risk and not an Odds Ratio. It s actually an OR. Similar to conducting separate binary logistic models, but with better type 1 error control 29

Code 4.1 Multinomial Logistic Regression Example Does degree of supernatural belief indicate a religious preference? AS OR: Y=religion (ref=catholic(1)) Protestant (2) OR SE Z P CI β (supernatural) 1.126 0.090 1.47 0.141 0.961, 1.317 α (_constant) 1.219 0.097 2.49 0.013 1.043, 1.425 Evangelical (3) β (supernatural) 1.218 0.117 2.06 0.039 1.010, 1.469 α (_constant) 0.619 0.059-5.02 <0.001 0.512, 0.746 For a 1 unit increase in supernatural belief, there is a (1-OR= %change) 21.8% increase in the probability of being an Evangelical compared to Catholic. 30

Multinomial Logistic Regression GOF Limited GOF tests. Look at LR Chi-square and compare nested models. Essentially, all models are wrong, but some are useful George E.P. Box Pseudo R 2 Similar to Ordinal Perform tests on the j-1 binomial logistic regressions 31

Resources Categorical Data Analysis by Alan Agresti UCLA Stat Computing: http://www.ats.ucla.edu/stat/ 32