ST3241 Categorical Data Analysis I Two-way Contingency Tables. Odds Ratio and Tests of Independence

Size: px

Start display at page:

Download "ST3241 Categorical Data Analysis I Two-way Contingency Tables. Odds Ratio and Tests of Independence"

Prudence Underwood
5 years ago
Views:

1 ST3241 Categorical Data Analysis I Two-way Contingency Tables Odds Ratio and Tests of Independence 1

2 Inference For Odds Ratio (p. 24) For small to moderate sample size, the distribution of sample odds ratio ˆθ is highly skewed. For θ = 1, ˆθ cannot be much smaller that θ, but it can be much larger than θ with nonnegligible probability. Consider log odds ratio, log θ X and Y are independent implies log θ = 0. 2

3 Log Odds Ratio Log odds ratio is symmetric about zero in the sense that reversal of rows or reversal of columns changes its sign only. The sample log odds ratio, log ˆθ has a less skewed distribution and can be approximated by the normal distribution well. The asymptotic standard error of log ˆθ is given by ASE(log ˆθ) = 1 n n n n 22 3

4 Confidence Intervals A large sample confidence interval for log θ is given by log(ˆθ) ± z α/2 ASE(log ˆθ) A large sample confidence interval for θ is given by exp[log(ˆθ) ± z α/2 ASE(log ˆθ)] 4

5 Sample Odds Ratio = Example: Aspirin Usage Sample log odds ratio, log ˆθ = log(1.832) = ASE of log ˆθ = % confidence interval for log θ equals ± The corresponding confidence interval for θ is (e 0.365, e ) or (1.44,2.33). 5

6 Recall SAS Output Estimates of the Relative Risk (Row1/Row2) Type of Study Value 95% Confidence Limits Case-Control (Odds Ratio) Cohort (Col1 Risk) Cohort (Col2 Risk) Sample Size =

7 A Simple R Function For Odds Ratio > odds.ratio < function(x, pad.zeros = FALSE, conf.level=0.95) { if(pad.zeros) { if(any(x==0)) x<-x+0.5 } theta<-x[1,1]*x[2,2]/(x[2,1]*x[1,2]) ASE<-sqrt(sum(1/x)) CI<-exp(log(theta) + c(-1,1)*qnorm(0.5*(1+conf.level))*ase) list(estimator=theta, ASE=ASE, conf.interval=ci, conf.level=conf.level) } 7

8 Notes (p. 25) Recall the formula for sample odds ratio ˆθ = n 11n 22 n 12 n 21 The sample odds ratio is 0 or if any n ij = 0 and it is undefined if both entries in a row or column are zero. Consider the slightly modified formula θ = (n )(n ) (n )(n ) In the ASE formula also, n ij s are replaced by n ij

9 Observations A sample odds ratio does not mean that p 1 is times p 2. A simple relation: OddsRatio = p 1/(1 p 1 ) p 2 /(1 p 2 ) = RelativeRisk 1 p 2 1 p 1 If p 1 and p 2 are close to 0, the odds ratio and relative risk take similar values. This relationship between odds ratio and relative risk is useful. 9

10 Example: Smoking Status and Myocardial Infarction Ever Myocardial Smoker Infarction Controls Yes No Odds Ratio=? (3.8222) How do we get relative risk? (2.4152) 10

11 Chi-Squared Tests (p.27) To test H 0 that the cell probabilities equal certain fixed values {π ij }. Let {n ij } be the cell frequencies and n be the total sample size. Then µ ij = nπ ij are the expected cell frequencies under H 0. Pearson (1900) s chi-squared test statistic χ 2 = (n ij µ ij ) 2 µ ij 11

12 Some Properties This statistic takes its minimum value of zero when all n ij = µ ij. For a fixed sample size, greater differences between {n ij } and {µ ij } produce larger χ 2 values and stronger evidence against H 0. The χ 2 statistic has approximately a chi-squared distribution with appropriate degrees of freedom for large sample sizes. 12

13 The likelihood ratio Likelihood-Ratio Test (p. 28) Λ = maximum likelihood when H 0 is true maximum likelihood when parameters are unrestricted In mathematical notation, if L(θ) denotes the likelihood function with θ as the set of parameters and the null hypothesis is H 0 : θ Θ 0 against H 1 : θ Θ 1, the likelihood ratio is given by Λ = sup θ Θ 0 L(θ) sup θ (Θ0 Θ 1 ) L(θ) 13

14 Properties Likelihood ratio cannot exceed 1. Small likelihood ratio implies deviation from H 0. Likelihood ratio test statistic is 2 log Λ, which has a chi-squared distribution with appropriate degrees of freedom for large samples. For a two-way contingency table, this statistic reduces to G 2 = 2 n ij log( n ij µ ij ) The test statistics χ 2 and G 2 have the same large sample distribution under null hypothesis. 14

15 Tests of Independence (p.30) To test: H 0 : π ij = π i+ π +j for all i and j. Equivalently, H 0 : µ ij = nπ ij = nπ i+ π +j. Usually, {π i+ } and {π +j } are unknown. We estimate them, using sample proportions µ ˆ ij = np i+ p +j = n n i+n +j n 2 = n i+n +j n These {ˆµ ij } are called estimated expected cell frequencies 15

16 Test Statistics Pearson s Chi-square test statistic χ 2 = I i=1 J j=1 (n ij ˆµ ij ) 2 ˆµ ij Likelihood ratio test statistic G 2 = 2 I i=1 J j=1 n ij log( n ij ˆµ ij ) Both of them have large sample chi-squared distribution with (I 1)(J 1) degrees of freedom. 16

17 Party Identification By Gender (p.31) Party Identification Gender Democrat Independent Republican Total Females (261.4) (70.7) (244.9) Males (182.6) (49.3) (171.1) Total

18 Example: Continued The test statistics are: χ 2 = 7.01 and G 2 = 7.00 Degrees of freedom = (I 1)(J 1) = (2 1)(3 1) = 2. p-value = Thus, the above test statistics suggest that party identification and gender are associated. 18

19 data Survey; length Party $ 12; input Gender $ Party $ count; datalines; Female Democrat 279 Female Independent 73 Female Republican 225 Male Democrat 165 Male Independent 47 Male Republican 191 ; run; SAS Codes: Read The Data 19

20 SAS Codes: Use Proc Freq proc freq data=survey order=data; weight count; tables gender*party / chisq expected nopercent norow nocol; run; 20

21 Output The FREQ Procedure Table of Gender by Party Gender Party Frequency Expected Democrat Independent Republican Total Female Male Total

22 Output Statistics for Table of Gender by Party Statistic DF Value Prob Chi-Square Likelihood Ratio Chi-Square Mantel-Haenszel Chi-Square Phi Coefficient Contingency Coefficient Cramer s V Sample Size =

23 R Codes >gendergap<-matrix(c(279,73,225,165,47,191), byrow=t,ncol=3) >dimnames(gendergap)<list(gender=c("female","male"), PartyID=c("Democrat","Independent", "Republican")) >gendergap PartyID Gender Democrat Independent Republican Female Male

24 R Codes >chisq.test(gendergap) Pearson s Chi-squared test data: gendergap X-squared = , df = 2,p-value =

25 An Alternative Way >Gender<-c("Female","Female","Female","Male", "Male","Male") >Party<-c("Democrat","Independent", "Republican", "Democrat","Independent", "Republican") >count<-c(279,73,225,165,47,191) >gender1<-data.frame(gender,party,count) >gender<-xtabs(count Gender+Party, data=gender1) >gender >summary(gender) 25

26 Party Output Gender Democrat Independent Republican Female Male Call: xtabs(formula = count Gender + Party, data = gender1) Number of cases in table: 980 Number of factors: 2 Test for independence of all factors: Chisq = 7.01, df = 2, p-value =

27 Table of Expected Cell Counts > rowsum<-apply(gender,1,sum) > colsum<-apply(gender,2,sum) > n<-sum(gender) > gd<-outer(rowsum,colsum/n, make.dimnames=t) 27

28 Table of Expected Cell Counts > gd Democrat Independent Republican Female Male

29 Residuals (p.31) To understand better the nature of evidence against H0, a cell by cell comparison of observed and estimated frequencies is necessary. Define, adjusted residuals r ij = n ij ˆµ ij ˆµij (1 p i+ )(1 p +j ) If H 0 is true, each r ij has a large sample standard normal distribution. If r ij in a cell exceeds 2 then it indicates lack of fit of H 0 in that cell. The sign also describes the nature of association. 29

30 Computing Residuals in R > rowp<-rowsum/n %Row marginal prob. > colp<-colsum/n %Column marginal prob. > pd<-outer(1-rowp,1-colp, make.dimnames=t) > resid<-(gender-gd)/sqrt(gd*pd) > resid 30

31 Residuals Output Party Gender Democrat Independent Republican Female Male

32 Some Comments (p.33) Pearson s χ 2 tests only indicate the degree of evidence for an association, but they cannot answer other questions like nature of association etc. These χ 2 tests are not always applicable. We need large data sets to apply them. Approximation is often poor when n/(ij) < 5. The values of χ 2 or G 2 do not depend on the ordering of the rows. Thus we ignore some information when there is ordinal data. 32

33 Testing Independence For Ordinal Data (p.34) For ordinal data, it is important to look for types of associations when there is dependence. It is quite common to assume that as the levels of X increases, responses on Y tend to increases or responses on Y tends to decrease to ward higher levels of X. The most simple and common analysis assigns scores to categories and measures the degree of linear trend or correlation. The method used is known as Mantel-Haenszel Chi-Square test (Mantel and Haenszel 1959). 33

34 Linear Trend Alternative to Independence Let u 1 u 2 u I denote scores for the rows. Let v 1 v 2 v J denote scores for the columns. The scores have the same ordering as the category levels. Define the correlation between X and Y as r = I J i=1 j=1 u i v j n ij ( I i=1 u i n i+ )( J j=1 v j n +j )/n [ I u 2 i n i+ ( I u i n i+ ) 2 /n][ J vj 2n +j ( J v j n +j ) 2 /n] i=1 i=1 j=1 j=1 34

35 Test For Linear Trend Alternative Independence between the variables implies that its true value equals zero. The larger the correlation is in absolute value, the farther the data fall from independence in this linear dimension. A test statistic is given by M 2 = (n 1)r 2. For large samples, it has approximately a chi-squared distribution with 1 degrees of freedom. 35

36 Infant Malformation and Mothers Alcohol Consumption Malformation Alcohol Consumption Absent(0) Present(1) Total 0 17, ,114 <1 14, ,

37 Infant Malformation and Mothers Alcohol Consumption Malformation Alcohol Percent Adjusted Consumption Absent(0) Present(1) Total Present Residual 0 17, , <1 14, ,

38 Example: Tests For Independence Pearson s χ 2 = 12.1, d.f. = 4, p-value = Likelihood Ratio Test, G 2 = 6.2, d.f. = 4, p-value =.19. The two tests give inconsistent signals. The percent present and adjusted residuals suggest that there may be a linear trend. 38

39 Test For Linear Trend Assign scores, v 1 = 0, v 2 = 1 and u 1 = 0, u 2 = 0.5, u 3 = 1.5, u 4 = 4.0 and u 5 = 7.0. We have, r = 0.014, n = 32, 574 and M 2 =6.6 with p-value = It suggests strong evidence of a linear trend for infant malformation with alcohol consumption of mothers. 39

40 SAS Codes data infants; input malform alcohol count datalines; ; run; proc format; value malform 2= Present 1= Absent ; value Alcohol 0= 0 0.5= <1 1.5= = = >=6 ; run; 40

41 SAS Codes proc freq data = infants; format malform malform. alcohol alcohol.; weight count; tables alcohol*malform / chisq cmh1 norow nocol nopercent; run; 41

42 Partial Output Statistic DF Value Prob Chi-Square Likelihood Ratio Chi-Square Mantel-Haenszel Chi-Square Phi Coefficient Contingency Coefficient Cramer s V

43 Partial Output Cochran-Mantel-Haenszel Statistics (Based on Table Scores) Statistic Alternative Hypothesis DF Value Prob Nonzero Correlation Total Sample Size =

44 Notes The correlation r has limited use as a descriptive measure of tables. Different choices of monotone scores usually give similar results. However, it may not happen when the data are very unbalanced, i.e. when some categories have many more observations than other categories. If we had taken (1, 2, 3, 4, 5) as the row scores in our example, then M 2 = 1.83 and p-value = 0.18 gives a much weaker conclusion. It is usually better to use one s own judgment by selecting scores that reflect distances between categories. 44

45 SAS Codes data infantsx; input malform alcoholx count datalines; ; run; proc freq data = infantsx; weight count; tables alcoholx*malform / cmh1 norow nocol nopercent; run; 45

46 Partial Output Cochran-Mantel-Haenszel Statistics (Based on Table Scores) Statistic Alternative Hypothesis DF Value Prob Nonzero Correlation Total Sample Size =

47 Fisher s Tea Tasting Experiment (p.39) Guess Poured First Poured First Milk Tea Total Milk Tea Total

48 Example: Tea Tasting To test whether she can tell accurately. To test H 0 : θ = 1 against H 1 : θ > 1. We cannot use previously discussed tests as we have a very small sample size. 48

49 Fisher s Exact Test For a 2 2 table, under the assumption of independence, i.e. θ = 1, the conditional distribution of n 11 given the row and column totals is hypergeometric. For given row and column marginal totals, the value for n 11 determines the other three cell counts. Thus, the hypergeometric formula expresses probabilities for the four cell counts in terms of n 11 alone. 49

50 Fisher s Exact Test When θ = 1, the probability of a particular value n 11 for that count equals p(n 11 ) = n 1+ n 11 n 2+ n +1 n 11 n n +1 To test independence, the p-value is the sum of hypergeometric probabilities for outcomes at least as favorable to the alternative hypothesis as the observed outcome. 50

51 Example: Tea Tasting The outcomes at least as favorable as the observed data is n 1 1 = 3 and 4 given the row and column totals. Hence, p(3) = 0 B C C A 0 B C C A 0 B C C A = =.2286, p(4) = 0 B C C A 0 B C C A 0 B C C A = 1 70 =

52 Example: Tea Tasting Therefore, p-value = P (3) + P (4) = There is not much evidence against the null hypothesis of independence. The experiment did not establish an association between the actual order of pouring and the guess. 52

53 SAS Codes: Exact Test data tea; input poured $ guess $ datalines; Milk Milk 3 Milk Tea 1 Tea Milk 1 Tea Tea 3 ; proc freq data=tea order=data; weight count; tables poured*guess / exact; run; 53

54 Partial Output Fisher s Exact Test Cell (1,1) Frequency (F) 3 Left-sided Pr <= F Right-sided Pr >= F Table Probability (P) Two-sided Pr <= P Sample Size = 8 54

55 R Codes > Poured<-c("Milk","Milk","Tea","Tea") > Guess<-c("Milk","Tea","Milk","Tea") > count<-c(3,1,1,3) > teadata<-data.frame(poured,guess,count) > tea<-xtabs(count Poured+Guess,data=teadata) > fisher.test(tea,alternative="greater") 55

56 Output Fisher s Exact Test for Count Data data: tea p-value = alternative hypothesis: true odds ratio is greater than 1 95 percent confidence interval: Inf sample estimates: odds ratio

ST3241 Categorical Data Analysis I Two-way Contingency Tables. 2 2 Tables, Relative Risks and Odds Ratios

ST3241 Categorical Data Analysis I Two-way Contingency Tables 2 2 Tables, Relative Risks and Odds Ratios 1 What Is A Contingency Table (p.16) Suppose X and Y are two categorical variables X has I categories