ST3241 Categorical Data Analysis I Two-way Contingency Tables. Odds Ratio and Tests of Independence

Similar documents
ST3241 Categorical Data Analysis I Two-way Contingency Tables. 2 2 Tables, Relative Risks and Odds Ratios

Epidemiology Wonders of Biostatistics Chapter 13 - Effect Measures. John Koval

Testing Independence

STAT 705: Analysis of Contingency Tables

Sections 3.4, 3.5. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Ordinal Variables in 2 way Tables

BIOS 625 Fall 2015 Homework Set 3 Solutions

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

3 Way Tables Edpsy/Psych/Soc 589

Categorical Data Analysis Chapter 3

Epidemiology Principle of Biostatistics Chapter 14 - Dependent Samples and effect measures. John Koval

n y π y (1 π) n y +ylogπ +(n y)log(1 π).

CDA Chapter 3 part II

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

Means or "expected" counts: j = 1 j = 2 i = 1 m11 m12 i = 2 m21 m22 True proportions: The odds that a sampled unit is in category 1 for variable 1 giv

Lecture 8: Summary Measures

Three-Way Contingency Tables

13.1 Categorical Data and the Multinomial Experiment

2 Describing Contingency Tables

Categorical Variables and Contingency Tables: Description and Inference

Introduction to Statistical Data Analysis Lecture 7: The Chi-Square Distribution

Solution to Tutorial 7

Analysis of Categorical Data Three-Way Contingency Table

Chi-Square. Heibatollah Baghi, and Mastee Badii

Topic 21 Goodness of Fit

Some comments on Partitioning

ij i j m ij n ij m ij n i j Suppose we denote the row variable by X and the column variable by Y ; We can then re-write the above expression as

Discrete Multivariate Statistics

Lecture 9. Selected material from: Ch. 12 The analysis of categorical data and goodness of fit tests

STAT 526 Advanced Statistical Methodology

Multiple Sample Categorical Data

Ch 6: Multicategory Logit Models

Logistic regression: Miscellaneous topics

Statistics 3858 : Contingency Tables

Cohen s s Kappa and Log-linear Models

Multinomial Logistic Regression Models

Review of Statistics 101

Lecture 23. November 15, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

Inference for Binomial Parameters

Frequency Distribution Cross-Tabulation

Chapter 19. Agreement and the kappa statistic

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

Short Course Introduction to Categorical Data Analysis

Review of Multinomial Distribution If n trials are performed: in each trial there are J > 2 possible outcomes (categories) Multicategory Logit Models

STAT 7030: Categorical Data Analysis

Analytic Methods for Applied Epidemiology: Framework and Contingency Table Analysis

Session 3 The proportional odds model and the Mann-Whitney test

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

Measures of Association for I J tables based on Pearson's 2 Φ 2 = Note that I 2 = I where = n J i=1 j=1 J i=1 j=1 I i=1 j=1 (ß ij ß i+ ß +j ) 2 ß i+ ß

Simple logistic regression

Lecture 41 Sections Mon, Apr 7, 2008

BIOMETRICS INFORMATION

Review of One-way Tables and SAS

TA: Sheng Zhgang (Th 1:20) / 342 (W 1:20) / 343 (W 2:25) / 344 (W 12:05) Haoyang Fan (W 1:20) / 346 (Th 12:05) FINAL EXAM

Suppose that we are concerned about the effects of smoking. How could we deal with this?

Categorical data analysis Chapter 5

Sections 2.3, 2.4. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis 1 / 21

STAT Chapter 13: Categorical Data. Recall we have studied binomial data, in which each trial falls into one of 2 categories (success/failure).

Lecture 5: ANOVA and Correlation

Statistics - Lecture 04

Unit 9: Inferences for Proportions and Count Data

Chapter 2: Describing Contingency Tables - II

Psych 230. Psychological Measurement and Statistics

STA6938-Logistic Regression Model

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

10: Crosstabs & Independent Proportions

Epidemiology Wonders of Biostatistics Chapter 11 (continued) - probability in a single population. John Koval

One-Way Tables and Goodness of Fit

Lecture 01: Introduction

More Statistics tutorial at Logistic Regression and the new:

STP 226 EXAMPLE EXAM #3 INSTRUCTOR:

Data are sometimes not compatible with the assumptions of parametric statistical tests (i.e. t-test, regression, ANOVA)

6 Applying Logistic Regression Models

Topics on Statistics 3

STA102 Class Notes Chapter Logistic Regression

Elementary Statistics Lecture 3 Association: Contingency, Correlation and Regression

Unit 9: Inferences for Proportions and Count Data

Solutions for Examination Categorical Data Analysis, March 21, 2013

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

Chapter 2: Describing Contingency Tables - I

Sociology 6Z03 Review II

Chapter 11. Hypothesis Testing (II)

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

MA : Introductory Probability

Analysis of categorical data S4. Michael Hauptmann Netherlands Cancer Institute Amsterdam, The Netherlands

8 Nominal and Ordinal Logistic Regression

ssh tap sas913, sas

Good Confidence Intervals for Categorical Data Analyses. Alan Agresti

1) Answer the following questions with one or two short sentences.

Homework 1 Solutions

E509A: Principle of Biostatistics. GY Zou

The Multinomial Model

Statistics Handbook. All statistical tables were computed by the author.

Log-linear Models for Contingency Tables

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami

Lecture 25. Ingo Ruczinski. November 24, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

Goodness of Fit Tests

E509A: Principle of Biostatistics. GY Zou

Transcription:

ST3241 Categorical Data Analysis I Two-way Contingency Tables Odds Ratio and Tests of Independence 1

Inference For Odds Ratio (p. 24) For small to moderate sample size, the distribution of sample odds ratio ˆθ is highly skewed. For θ = 1, ˆθ cannot be much smaller that θ, but it can be much larger than θ with nonnegligible probability. Consider log odds ratio, log θ X and Y are independent implies log θ = 0. 2

Log Odds Ratio Log odds ratio is symmetric about zero in the sense that reversal of rows or reversal of columns changes its sign only. The sample log odds ratio, log ˆθ has a less skewed distribution and can be approximated by the normal distribution well. The asymptotic standard error of log ˆθ is given by ASE(log ˆθ) = 1 n 11 + 1 n 12 + 1 n 21 + 1 n 22 3

Confidence Intervals A large sample confidence interval for log θ is given by log(ˆθ) ± z α/2 ASE(log ˆθ) A large sample confidence interval for θ is given by exp[log(ˆθ) ± z α/2 ASE(log ˆθ)] 4

Sample Odds Ratio = 1.832 Example: Aspirin Usage Sample log odds ratio, log ˆθ = log(1.832) = 0.2629 ASE of log ˆθ 1 89 + 1 10933 + 1 10845 + 1 104 = 0.123 95% confidence interval for log θ equals 0.605 ± 1.96 0.123 The corresponding confidence interval for θ is (e 0.365, e 0.846 ) or (1.44,2.33). 5

Recall SAS Output Estimates of the Relative Risk (Row1/Row2) Type of Study Value 95% Confidence Limits Case-Control (Odds Ratio) 1.8321 1.4400 2.3308 Cohort (Col1 Risk) 1.8178 1.4330 2.3059 Cohort (Col2 Risk) 0.9922 0.9892 0.9953 Sample Size = 22071 6

A Simple R Function For Odds Ratio > odds.ratio < function(x, pad.zeros = FALSE, conf.level=0.95) { if(pad.zeros) { if(any(x==0)) x<-x+0.5 } theta<-x[1,1]*x[2,2]/(x[2,1]*x[1,2]) ASE<-sqrt(sum(1/x)) CI<-exp(log(theta) + c(-1,1)*qnorm(0.5*(1+conf.level))*ase) list(estimator=theta, ASE=ASE, conf.interval=ci, conf.level=conf.level) } 7

Notes (p. 25) Recall the formula for sample odds ratio ˆθ = n 11n 22 n 12 n 21 The sample odds ratio is 0 or if any n ij = 0 and it is undefined if both entries in a row or column are zero. Consider the slightly modified formula θ = (n 11 + 0.5)(n 22 + 0.5) (n 12 + 0.5)(n 21 + 0.5) In the ASE formula also, n ij s are replaced by n ij + 0.5. 8

Observations A sample odds ratio 1.832 does not mean that p 1 is 1.832 times p 2. A simple relation: OddsRatio = p 1/(1 p 1 ) p 2 /(1 p 2 ) = RelativeRisk 1 p 2 1 p 1 If p 1 and p 2 are close to 0, the odds ratio and relative risk take similar values. This relationship between odds ratio and relative risk is useful. 9

Example: Smoking Status and Myocardial Infarction Ever Myocardial Smoker Infarction Controls Yes 172 173 No 90 346 Odds Ratio=? (3.8222) How do we get relative risk? (2.4152) 10

Chi-Squared Tests (p.27) To test H 0 that the cell probabilities equal certain fixed values {π ij }. Let {n ij } be the cell frequencies and n be the total sample size. Then µ ij = nπ ij are the expected cell frequencies under H 0. Pearson (1900) s chi-squared test statistic χ 2 = (n ij µ ij ) 2 µ ij 11

Some Properties This statistic takes its minimum value of zero when all n ij = µ ij. For a fixed sample size, greater differences between {n ij } and {µ ij } produce larger χ 2 values and stronger evidence against H 0. The χ 2 statistic has approximately a chi-squared distribution with appropriate degrees of freedom for large sample sizes. 12

The likelihood ratio Likelihood-Ratio Test (p. 28) Λ = maximum likelihood when H 0 is true maximum likelihood when parameters are unrestricted In mathematical notation, if L(θ) denotes the likelihood function with θ as the set of parameters and the null hypothesis is H 0 : θ Θ 0 against H 1 : θ Θ 1, the likelihood ratio is given by Λ = sup θ Θ 0 L(θ) sup θ (Θ0 Θ 1 ) L(θ) 13

Properties Likelihood ratio cannot exceed 1. Small likelihood ratio implies deviation from H 0. Likelihood ratio test statistic is 2 log Λ, which has a chi-squared distribution with appropriate degrees of freedom for large samples. For a two-way contingency table, this statistic reduces to G 2 = 2 n ij log( n ij µ ij ) The test statistics χ 2 and G 2 have the same large sample distribution under null hypothesis. 14

Tests of Independence (p.30) To test: H 0 : π ij = π i+ π +j for all i and j. Equivalently, H 0 : µ ij = nπ ij = nπ i+ π +j. Usually, {π i+ } and {π +j } are unknown. We estimate them, using sample proportions µ ˆ ij = np i+ p +j = n n i+n +j n 2 = n i+n +j n These {ˆµ ij } are called estimated expected cell frequencies 15

Test Statistics Pearson s Chi-square test statistic χ 2 = I i=1 J j=1 (n ij ˆµ ij ) 2 ˆµ ij Likelihood ratio test statistic G 2 = 2 I i=1 J j=1 n ij log( n ij ˆµ ij ) Both of them have large sample chi-squared distribution with (I 1)(J 1) degrees of freedom. 16

Party Identification By Gender (p.31) Party Identification Gender Democrat Independent Republican Total Females 279 73 225 577 (261.4) (70.7) (244.9) Males 165 47 191 403 (182.6) (49.3) (171.1) Total 444 120 416 980 17

Example: Continued The test statistics are: χ 2 = 7.01 and G 2 = 7.00 Degrees of freedom = (I 1)(J 1) = (2 1)(3 1) = 2. p-value = 0.03. Thus, the above test statistics suggest that party identification and gender are associated. 18

data Survey; length Party $ 12; input Gender $ Party $ count; datalines; Female Democrat 279 Female Independent 73 Female Republican 225 Male Democrat 165 Male Independent 47 Male Republican 191 ; run; SAS Codes: Read The Data 19

SAS Codes: Use Proc Freq proc freq data=survey order=data; weight count; tables gender*party / chisq expected nopercent norow nocol; run; 20

Output The FREQ Procedure Table of Gender by Party Gender Party Frequency Expected Democrat Independent Republican Total Female 279 73 225 577 261.42 70.653 244.93 Male 165 47 191 403 182.58 49.347 171.07 Total 444 120 416 980 21

Output Statistics for Table of Gender by Party Statistic DF Value Prob ------------------------------------------------- Chi-Square 2 7.0095 0.0301 Likelihood Ratio Chi-Square 2 7.0026 0.0302 Mantel-Haenszel Chi-Square 1 6.7581 0.0093 Phi Coefficient 0.0846 Contingency Coefficient 0.0843 Cramer s V 0.0846 Sample Size = 980 22

R Codes >gendergap<-matrix(c(279,73,225,165,47,191), byrow=t,ncol=3) >dimnames(gendergap)<list(gender=c("female","male"), PartyID=c("Democrat","Independent", "Republican")) >gendergap PartyID Gender Democrat Independent Republican Female 279 73 225 Male 165 47 191 23

R Codes >chisq.test(gendergap) Pearson s Chi-squared test data: gendergap X-squared = 7.0095, df = 2,p-value = 0.03005 24

An Alternative Way >Gender<-c("Female","Female","Female","Male", "Male","Male") >Party<-c("Democrat","Independent", "Republican", "Democrat","Independent", "Republican") >count<-c(279,73,225,165,47,191) >gender1<-data.frame(gender,party,count) >gender<-xtabs(count Gender+Party, data=gender1) >gender >summary(gender) 25

Party Output Gender Democrat Independent Republican Female 279 73 225 Male 165 47 191 Call: xtabs(formula = count Gender + Party, data = gender1) Number of cases in table: 980 Number of factors: 2 Test for independence of all factors: Chisq = 7.01, df = 2, p-value = 0.03005 26

Table of Expected Cell Counts > rowsum<-apply(gender,1,sum) > colsum<-apply(gender,2,sum) > n<-sum(gender) > gd<-outer(rowsum,colsum/n, make.dimnames=t) 27

Table of Expected Cell Counts > gd Democrat Independent Republican Female 261.4163 70.65306 244.9306 Male 182.5837 49.34694 171.0694 28

Residuals (p.31) To understand better the nature of evidence against H0, a cell by cell comparison of observed and estimated frequencies is necessary. Define, adjusted residuals r ij = n ij ˆµ ij ˆµij (1 p i+ )(1 p +j ) If H 0 is true, each r ij has a large sample standard normal distribution. If r ij in a cell exceeds 2 then it indicates lack of fit of H 0 in that cell. The sign also describes the nature of association. 29

Computing Residuals in R > rowp<-rowsum/n %Row marginal prob. > colp<-colsum/n %Column marginal prob. > pd<-outer(1-rowp,1-colp, make.dimnames=t) > resid<-(gender-gd)/sqrt(gd*pd) > resid 30

Residuals Output Party Gender Democrat Independent Republican Female 2.2931603 0.4647941-2.6177798 Male -2.2931603-0.4647941 2.6177798 31

Some Comments (p.33) Pearson s χ 2 tests only indicate the degree of evidence for an association, but they cannot answer other questions like nature of association etc. These χ 2 tests are not always applicable. We need large data sets to apply them. Approximation is often poor when n/(ij) < 5. The values of χ 2 or G 2 do not depend on the ordering of the rows. Thus we ignore some information when there is ordinal data. 32

Testing Independence For Ordinal Data (p.34) For ordinal data, it is important to look for types of associations when there is dependence. It is quite common to assume that as the levels of X increases, responses on Y tend to increases or responses on Y tends to decrease to ward higher levels of X. The most simple and common analysis assigns scores to categories and measures the degree of linear trend or correlation. The method used is known as Mantel-Haenszel Chi-Square test (Mantel and Haenszel 1959). 33

Linear Trend Alternative to Independence Let u 1 u 2 u I denote scores for the rows. Let v 1 v 2 v J denote scores for the columns. The scores have the same ordering as the category levels. Define the correlation between X and Y as r = I J i=1 j=1 u i v j n ij ( I i=1 u i n i+ )( J j=1 v j n +j )/n [ I u 2 i n i+ ( I u i n i+ ) 2 /n][ J vj 2n +j ( J v j n +j ) 2 /n] i=1 i=1 j=1 j=1 34

Test For Linear Trend Alternative Independence between the variables implies that its true value equals zero. The larger the correlation is in absolute value, the farther the data fall from independence in this linear dimension. A test statistic is given by M 2 = (n 1)r 2. For large samples, it has approximately a chi-squared distribution with 1 degrees of freedom. 35

Infant Malformation and Mothers Alcohol Consumption Malformation Alcohol Consumption Absent(0) Present(1) Total 0 17,066 48 17,114 <1 14,464 38 14,502 1-2 788 5 793 3-5 126 1 127 6 37 1 38 36

Infant Malformation and Mothers Alcohol Consumption Malformation Alcohol Percent Adjusted Consumption Absent(0) Present(1) Total Present Residual 0 17,066 48 17,114 0.28-0.18 <1 14,464 38 14,502 0.26-0.71 1-2 788 5 793 0.63 1.84 3-5 126 1 127 0.79 1.06 6 37 1 38 2.63 2.71 37

Example: Tests For Independence Pearson s χ 2 = 12.1, d.f. = 4, p-value = 0.02. Likelihood Ratio Test, G 2 = 6.2, d.f. = 4, p-value =.19. The two tests give inconsistent signals. The percent present and adjusted residuals suggest that there may be a linear trend. 38

Test For Linear Trend Assign scores, v 1 = 0, v 2 = 1 and u 1 = 0, u 2 = 0.5, u 3 = 1.5, u 4 = 4.0 and u 5 = 7.0. We have, r = 0.014, n = 32, 574 and M 2 =6.6 with p-value = 0.01. It suggests strong evidence of a linear trend for infant malformation with alcohol consumption of mothers. 39

SAS Codes data infants; input malform alcohol count @@; datalines; 1 0 17066 2 0 48 1 0.5 14464 2 0.5 38 1 1.5 788 2 1.5 5 1 4.0 126 2 4.0 1 1 7.0 37 2 7.0 1 ; run; proc format; value malform 2= Present 1= Absent ; value Alcohol 0= 0 0.5= <1 1.5= 1-2 4.0= 3-5 7.0= >=6 ; run; 40

SAS Codes proc freq data = infants; format malform malform. alcohol alcohol.; weight count; tables alcohol*malform / chisq cmh1 norow nocol nopercent; run; 41

Partial Output Statistic DF Value Prob --------------------------------------------------------- Chi-Square 4 12.0821 0.0168 Likelihood Ratio Chi-Square 4 6.2020 0.1846 Mantel-Haenszel Chi-Square 1 6.5699 0.0104 Phi Coefficient 0.0193 Contingency Coefficient 0.0193 Cramer s V 0.0193 42

Partial Output Cochran-Mantel-Haenszel Statistics (Based on Table Scores) Statistic Alternative Hypothesis DF Value Prob ------------------------------------------------------- 1 Nonzero Correlation 1 6.5699 0.0104 Total Sample Size = 32574 43

Notes The correlation r has limited use as a descriptive measure of tables. Different choices of monotone scores usually give similar results. However, it may not happen when the data are very unbalanced, i.e. when some categories have many more observations than other categories. If we had taken (1, 2, 3, 4, 5) as the row scores in our example, then M 2 = 1.83 and p-value = 0.18 gives a much weaker conclusion. It is usually better to use one s own judgment by selecting scores that reflect distances between categories. 44

SAS Codes data infantsx; input malform alcoholx count @@; datalines; 1 0 17066 2 0 48 1 1 14464 2 1 38 1 2 788 2 2 5 1 3 126 2 3 1 1 4 37 2 4 1 ; run; proc freq data = infantsx; weight count; tables alcoholx*malform / cmh1 norow nocol nopercent; run; 45

Partial Output Cochran-Mantel-Haenszel Statistics (Based on Table Scores) Statistic Alternative Hypothesis DF Value Prob -------------------------------------------------------- 1 Nonzero Correlation 1 1.8278 0.1764 Total Sample Size = 32574 46

Fisher s Tea Tasting Experiment (p.39) Guess Poured First Poured First Milk Tea Total Milk 3 1 4 Tea 1 3 4 Total 4 4 8 47

Example: Tea Tasting To test whether she can tell accurately. To test H 0 : θ = 1 against H 1 : θ > 1. We cannot use previously discussed tests as we have a very small sample size. 48

Fisher s Exact Test For a 2 2 table, under the assumption of independence, i.e. θ = 1, the conditional distribution of n 11 given the row and column totals is hypergeometric. For given row and column marginal totals, the value for n 11 determines the other three cell counts. Thus, the hypergeometric formula expresses probabilities for the four cell counts in terms of n 11 alone. 49

Fisher s Exact Test When θ = 1, the probability of a particular value n 11 for that count equals p(n 11 ) = n 1+ n 11 n 2+ n +1 n 11 n n +1 To test independence, the p-value is the sum of hypergeometric probabilities for outcomes at least as favorable to the alternative hypothesis as the observed outcome. 50

Example: Tea Tasting The outcomes at least as favorable as the observed data is n 1 1 = 3 and 4 given the row and column totals. Hence, p(3) = 0 B B @ 4 3 1 C C A 0 B B @ 4 1 1 C C A 0 B B @ 8 4 1 C C A = 16 70 =.2286, p(4) = 0 B B @ 4 4 1 C C A 0 B B @ 4 0 1 C C A 0 B B @ 8 4 1 C C A = 1 70 =.0143. 51

Example: Tea Tasting Therefore, p-value = P (3) + P (4) = 0.243. There is not much evidence against the null hypothesis of independence. The experiment did not establish an association between the actual order of pouring and the guess. 52

SAS Codes: Exact Test data tea; input poured $ guess $ count @@; datalines; Milk Milk 3 Milk Tea 1 Tea Milk 1 Tea Tea 3 ; proc freq data=tea order=data; weight count; tables poured*guess / exact; run; 53

Partial Output Fisher s Exact Test ---------------------------------- Cell (1,1) Frequency (F) 3 Left-sided Pr <= F 0.9857 Right-sided Pr >= F 0.2429 Table Probability (P) 0.2286 Two-sided Pr <= P 0.4857 Sample Size = 8 54

R Codes > Poured<-c("Milk","Milk","Tea","Tea") > Guess<-c("Milk","Tea","Milk","Tea") > count<-c(3,1,1,3) > teadata<-data.frame(poured,guess,count) > tea<-xtabs(count Poured+Guess,data=teadata) > fisher.test(tea,alternative="greater") 55

Output Fisher s Exact Test for Count Data data: tea p-value = 0.2429 alternative hypothesis: true odds ratio is greater than 1 95 percent confidence interval: 0.3135693 Inf sample estimates: odds ratio 6.408309 56