STAT 526 Advanced Statistical Methodology

Size: px

Start display at page:

Download "STAT 526 Advanced Statistical Methodology"

Hector Pope
6 years ago
Views:

1 STAT 526 Advanced Statistical Methodology Fall 2017 Lecture Note 7 Contingency Table 0-0

2 Outline Introduction to Contingency Tables Testing Independence in Two-Way Contingency Tables Modeling Ordinal Associations Correspondence Analysis Models for Matched Pairs Three-Way Contingency Tables Dabao Zhang Page 1

3 Introduction to Contingency Tables Contingency Table: is a table with cells containing frequency counts of outcomes which are classified according to certain variables (Karl Pearson, 1904). Contingency tables are used to display relationships between categorical variables. Two-Way Table: can be used to study the relationships between two categorical variables, e.g., X and Y. Suppose that X has I categories, and Y has J categories. Classifications of subjects on both variables have I J possible combinations, i.e., I J cells in a rectangular table having I rows for categories of X and J columns for categories of Y. A contingency table with I rows and J columns is called an I J (I-by-J) table. Example: Cross-Classification of Smoking by Lung Cancer Lung Cancer Smoking Cases Controls Total Yes No Total Dabao Zhang Page 2

4 Three-Way Table: can be used to study the relationships between three categorical variables, e.g., X, Y and Z. Suppose that X has I categories, Y has J categories, and Z has K categories. Classifications of subjects on all possible combinations present an I J K contingency table. Example: Alcohol, Cigarette, and Marijuana Use for High School Seniors Alcohol Cigarette Marijuana Use Use Use Yes No Yes Yes No No Yes 3 43 No Dabao Zhang Page 3

5 Testing Independence in Two-Way Contingency Tables Multinomial Sampling When the total sample size n is fixed but the row and column totals are not, a multinomial sampling model applies. Usually both X and Y are response variables, so the joint distribution is used to describe their association. P(X = i,y = j) = π ij, i = 1,,I; j = 1,,J Let n ij be the count in cell (i,j), then the probability mass function of the cell counts is n! n 11! n IJ! I i J j=1 π n ij ij = l = Independence of Categorical Variables I i=1 J j=1 n ij log(π ij ) + constant X and Y are independent π ij = π i+ π +j, i = 1,,I,j = 1,,J Marginal distributions: P(X = i) = π i+, P(Y = j) = π +j, where the subscript + denotes the sum over that index. Dabao Zhang Page 4

6 Hypothesis test H 0 : π ij = π i+ π +j, for all i and j H a : π ij π i+ π +j, for some i and j Under the full model, the MLE of π ij is ˆπ ij = n ij /n ++ Under the null model, the MLEs are ˆπ i+ = n i+ /n ++ and ˆπ +j = n +j /n ++. The LRT (or deviance-based test) is 2 I i=1 J j=1 n ij log n ijn ++ n i+ n +j asy. χ 2 (I 1)(J 1), under H 0 Example: Cross-Classification of Smoking by Lung Cancer (Continued) > y <- c(688,21,650,59); > smoke <- gl(2,1,4,labels=c("yes","no")); #gl: generate a factor with given levels > cancer <- gl(2,2,labels=c("cases","controls")); > lcancer <- data.frame(y,cancer,smoke); lcancer; y cancer smoke cases yes 2 21 cases no controls yes 4 59 controls no Dabao Zhang Page 5

7 >lcct <- xtabs(y~smoke+cancer); #xtabs: create a contingency table cancer smoke cases controls yes no > (fpi <- prop.table(xtabs(y~smoke+cancer))) cancer smoke cases controls yes no > spi <- prop.table(xtabs(y~smoke)); > cpi <- prop.table(xtabs(y~cancer)); > (npi <- outer(spi,cpi)) cancer smoke cases controls yes no > pchisq(2*sum(lcct*log(fpi/npi)),1,lower=f) [1] e-06 Conclusion? Dabao Zhang Page 6

8 2 2 Table Independence of X and Y can be stated in terms of the odds ratio X and Y are independent θ = π 11/π 12 π 21 /π 22 = π 11π 22 π 12 π 21 = 1 This is because (similarly for other π ij ), when θ = 1, π 12 = π +1 π 12 + π +2 π 12 = (π 11 π 12 + π 21 π 12 ) + π +2 π 12 = (π 11 π 12 + π 11 π 22 ) + π +2 π 12 = π +2 π 11 + π +2 π 12 = π 1+ π +2 MLE of the above odds ratio ˆθ = n 11/n 12 n 12 /n 22 = n 11n 22 n 12 n 21 Asymptotically, log(ˆθ) N(log(θ), ˆσ 2 ), where ˆσ 2 = 1 n n n n 22 When some n ij = 0, ˆθ is not a good estimator. It is amended by adding 0.5 to each cell count, θ = (n )(n ) (n )(n ) Dabao Zhang Page 7

9 Example: Cross-Classification of Smoking by Lung Cancer (Continued) > (etheta <- lcct[1,1]*lcct[2,2]/(lcct[1,2]*lcct[2,1])) [1] > (sele <- sqrt(sum(1/lcct))) [1] > log(etheta)+sele*c(-1.96,1.96) [1] Conclusion? Dabao Zhang Page 8

10 Independent (or Product) Multinomial Sampling When the row totals, i.e., n i+, i = 1,,I, are fixed, a independent multinomial sampling model applies. Usually X is an explanatory variable, and observations on a response Y occur separately at each setting of X. So the conditional distribution is used to describe their association P(Y = j X = i) = π j i, i = 1,,I; j = 1,,J Let n ij be the count in cell (i,j), then the counts {n ij,j = 1,,J} satisfying J j=1 n ij = n i+ follow a multinomial distribution n i+! n i1! n ij! J j=1 π n ij j i Independence of Categorical Variables X and Y are independent π j 1 = = π j I, j = 1,,J Independence is then often referred to as homogeneity of the conditional distributions. Dabao Zhang Page 9

11 π ij = π i+ π +j for all i and j π j 1 = = π j I for all j π j i = π ij /π i+ = (π i+ π +j )/π i+ = π +j Let π j i = c j, then π +j = I π ij = J π i+ c j = c j = π ij = π i+ π +j i=1 i=1 Q: How to test the homogeneity of the conditional distributions? Column Row 1 J Total 1 π 11 (π 1 1 ) π 1J (π J 1 ) π I π I1 (π 1 I ) π IJ (π J I ) π I+ Total π +1 π +J π ++ Consider the new notation: π j (x) = P(Y = j X = x) = Consider a model for multinomial responses! Dabao Zhang Page 10

12 Example: Cross-Classification of Smoking by Lung Cancer (Continued) > (mnlc <- matrix(y,nrow=2)) [,1] [,2] [1,] [2,] > mnmod <- glm(mnlc~1,family=binomial); > deviance(mnmod) [1] > 2*sum(lcct*log(fpi/npi)) # deviance in the multinomial sampling [1] Conclusion? Dabao Zhang Page 11

13 Poisson Sampling Denote the count of cell (i,j) as Y ij A Poisson sampling model assumes each Y ij follows an independent Poisson distribution with rate {µ ij } ( I J I ) J Y ij ind. Poisson(µ ij ) = Y ij Poission µ ij i=1 j=1 i=1 j=1 Denote I J i=1 J j=1 µ ij = µ ++ I Given Y ij = n ++, (Y 11,,Y ij,,y IJ ) follows a multinomial i=1 j=1 distribution with E[Y ij n ++ ] = n ++ π ij, π ij = µ ij /µ ++. Independence of X and Y has the following form log(µ ij ) = λ + α i + β j The above model is called the loglinear model of independence for two-way contingency tables, whereby the log expected frequency is an additive function of a row effect α i and a column effect β j. An independence test is also a goodness-of-fit test of the above loglinear model. Dabao Zhang Page 12

14 Example: Cross-Classification of Smoking by Lung Cancer (Continued) > pmod <- glm(y~smoke+cancer,family=poisson); > summary(pmod)... Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 6.506e e <2e-16 *** smokeno e e <2e-16 *** cancercontrols e e e Null deviance: on 3 degrees of freedom Residual deviance: on 1 degrees of freedom AIC: An alternative to the deviance-based test is the Pearson s X 2 test X 2 = I i=1 J j=1 > emu <- npi*sum(lcct); sum((lcct-emu)^2/emu) [1] (y ij ˆµ ij ) 2 Yates continuity correction: +0.5 if y ij µ ij > 0; 0.5 if y ij µ ij < 0 The deviance-based test is preferred to the Pearson s X 2. ˆµ ij Dabao Zhang Page 13

15 Hypergeometric Sampling When both row and column margins are fixed, the appropriate sampling distribution is the hypergeometric. This situation is less common in practice. When X and Y are independent, {n ij }, given the row and column margins, follows the following hypergeometric distribution ( I i=1 n i+! n ++! I )( J i=1 J j=1 n ij! j=1 n +j! An exact test of independence can be developed by defining a table order For a 2 2 table, the hypergeometric distribution is P(n 11 = k) = ( n 1+ k ( )( ) n 2+ ) n +1 k ), n ++ n +1 max(0,n 1+ + n +1 n) k min(n 1+,n +1 ) Fisher s exact test: p-value equals to the total probability of all outcomes more extreme than the one observed. Dabao Zhang Page 14

16 Example: Cross-Classification of Smoking by Lung Cancer (Continued) > fisher.test(lcct) Fisher s Exact Test for Count Data data: lcct p-value = 1.476e-05 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: sample estimates: odds ratio > etheta*exp(sele*c(-1.96,1.96)) # CI based on asymptotic approximation [1] Dabao Zhang Page 15

17 Modeling Ordinal Associations Treating ordered categories as nominal categories ignores important information. Example: US 1996 National Election Study (Continued) Here we consider the association between the party identification and level of education. > data(nes96); xtabs(~pid+educ,nes96); educ PID MS HSdrop HS Coll CCdeg BAdeg MAdeg strdem weakdem inddem indind indrep weakrep strrep > (partyed <- as.data.frame.table(xtabs(~pid+educ,nes96))) #convert to data.frame PID educ Freq 1 strdem MS strrep MAdeg 25 Dabao Zhang Page 16

18 > nomod <- glm(freq~pid+educ,family=poisson,data=partyed); > pchisq(deviance(nomod),df.residual(nomod),lower=f) [1] When treat both variables as nominal, we have no evidence against independence. > presid <- residuals(nomod,type="pearson"); > xtabs(presid~partyed$pid+partyed$educ); partyed$educ partyed$pid MS HSdrop HS Coll CCdeg BAdeg MAdeg strdem weakdem inddem indind indrep weakrep strrep Cross-classifications of ordinal variabls often exhibit their greatest deviations from independence in the corner cells Sample counts are much larger than independence predicts when both responses are at the lowest order or the highest order. The counts are much smaller than fitted values where one response is at the highest order and the other is at the lowest order. The above residuals table indicates lack of fit in the form of a positive trend. Subjects who have higher level of education also tend to be stronger republican. Dabao Zhang Page 17

19 Linear-by-Linear Association in Two-Way Tables Assigning the following row scores and column scores, respectively, u 1 u 2 u I, v 1 v 2 v J A simple model for these two ordinal variables is the linear-by-linear association model (L L) log(µ ij ) = λ + α i + β j + γu i v j, γu i v j represents the deviation of log(µ ij ) from independence The deviation is linear in the Y scores at a fixed level of X, and linear in the X scores at a fixed level of Y. So it is called the L L model. The model has its greatest departures from independence in the corners of the table. γ = 0 implies independence of X and Y When γ > 0, Y tends to increase as X increase. When γ < 0, Y tends to decrease as X increase. Dabao Zhang Page 18

20 Example: US 1996 National Election Study (Continued) We assign evenly spaced scores, i.e., one to seven (you can also try other scores), for both PID and educ, and fit the L L model > partyed$opid <- unclass(partyed$pid); partyed$oeduc <- unclass(partyed$educ); > lblmod <- glm(freq~pid+educ+i(opid*oeduc),family=poisson,data=partyed); > summary(lblmod)... Coefficients: Estimate Std. Error z value Pr(> z )... I(oPID * oeduc) **... Null deviance: on 48 degrees of freedom Residual deviance: on 35 degrees of freedom AIC: > anova(nomod,lblmod,test="chi"); Analysis of Deviance Table Model 1: Freq ~ PID + educ Model 2: Freq ~ PID + educ + I(oPID * oeduc) Resid. Df Resid. Dev Df Deviance P(> Chi ) Dabao Zhang Page 19

21 Interpretation of γ The log odd ratio for a subtable which have cells adjacent in both rows and columns, e.g., cells (i,j), (i,j + 1), (i + 1,j), and (i + 1,j + 1) log µ ijµ i+1,j+1 µ i,j+1 µ i+1,j = γ(u i+1 u i )(v j+1 v j ) This log odds ratio is stronger as γ increases and for pairs of categories that are farther apart. For evenly spaced socres, these odds ratios are all equal. For instance, when {u i = i} and {v j = j}, we have the constant local odds ratios θ ij = π ijπ i+1,j+1 π i,j+1 π i+1,j = e γ The case of having constant local odds ratios was called as uniform association by Goodman (1979). As a Baseline-Category Logit Model: log π j i π 1 i = log µ ij µ i1 = (β j β 1 ) + γ(v j v 1 )u i We may fit a baseline-category logit model and expect the coefficients of {u i } to be {γ(v i v 1 )}. Dabao Zhang Page 20

22 Example: US 1996 National Election Study (Continued) > nes96$oeduc <- unclass(nes96$educ); > nes96.mn1 <- multinom(pid~oeduc,data=nes96); summary(nes96.mn1);... Coefficients: (Intercept) oeduc weakdem inddem indind indrep weakrep strrep > nes96$opid <- unclass(nes96$pid); > nes96.mn2 <- multinom(educ~opid,data=nes96); summary(nes96.mn2);... Coefficients: (Intercept) opid HSdrop HS Coll CCdeg BAdeg MAdeg As L L models, we expect monotonically increasing (or decreasing) coefficients of oeduc and opid. While the coefficients of opid are more or less increasing, the coefficients of oeduc are apparently not. We may only treat PID as ordinal but educ as nominal. Dabao Zhang Page 21

23 Column Effects Model The columns are not assigned scores as Y is considered a nominal variable. {γ j } are called the column effects. log(µ ij ) = λ + α i + β j + γ j u i The zero-sum constraint is I i=1 α i = J j=1 β j = J j=1 γ j = 0 The baseline constraint is α 1 = β 1 = γ 1 = 0 γ 1 = γ 2 = = γ J implies independence of X and Y As a Baseline-Category Logit Model: log π j i π 1 i = log µ ij µ i1 = (β j β 1 ) + (γ j γ 1 )u i A row effects model is effectively the same model except the roles of the variables reversed. Dabao Zhang Page 22

24 Example: US 1996 National Election Study (Continued) > cmod <- glm(freq~pid+educ+educ:opid,family=poisson,data=partyed); > mcoeff <- summary(cmod)$coeff; > mcoeff[8:13,1] #beta_j-beta_1 educhsdrop educhs educcoll educccdeg educbadeg educmadeg > mcoeff[15:19,1]-mcoeff[14,1] #gamma_j-gamma_1 educhsdrop:opid educhs:opid educcoll:opid educccdeg:opid educbadeg:opid > -mcoeff[14,1] #gamma_j-gamma_1 as gamma_j=0 [1] Similar values to the coefficients in nes96.mn2 > anova(nomod,cmod,test="chi") Analysis of Deviance Table Model 1: Freq ~ PID + educ Model 2: Freq ~ PID + educ + educ:opid Resid. Df Resid. Dev Df Deviance P(> Chi ) The above comparison of cmod to the independence model nomod implies that the column effects model is preferred. What about comparing lblmod and cmod? Dabao Zhang Page 23

25 Correspondence Analysis Correspondence analysis is a graphical way to represent associations in two-way contingency tables. It is very helpful in understanding the dependence between a category of X and a category of Y. This method is based on the Pearson residuals R I J = (r ij ) I J r ij is the Pearson residual for the cell (i,j) Perform the singular value decomposition R I J = U I w D w w V T J w = r ij = w u ik d k v jk k=1 w = min(i,j) U I w = (u ij ) I w and V J w = (u ij ) J w have orthogonal column vectors and called the right and left singular vectors, respectively D = diagonal{d 1,,d w } with d 1 d 2 d w, which are called singular values. w i=1 d2 i = Pearson s X 2 is called the inertia. Dabao Zhang Page 24

26 Usually d d2 2 take account most of w i=1 d2 i = X2. Therefore, u i1 d 1 v j1 + u i2 d 2 v j2 will account for most of the Pearson residual r ij, i.e., r ij u i1 d 1 v j1 + u i2 d 2 v j2. Denote, for k = 1,2, U k = d k u 1k., V k = d k v 1k. u Ik v Jk The two-dimensional correspondence plot displays U 2 against U 1, and V 2 against V 1 on the same graph. Plotting U 2 vs. U 1 shows influence on residuals when ignoring row effect. Large U i indicates the peculiarity of the row i profile. Plotting V 2 vs. V 1 shows influence on residuals when ignoring column effect. Large V i indicates the peculiarity of the column i profile. If a row level and a column level appear close together on the plot and far from the origin, there will be a large positive residual associated with this particular combination indicating a strong positive association. If a row level and a column level are situated diametrically apart on either side of the origin, we may expect a large negative residual indicating a strong negative association. If points representing two rows or two column levels are close together, this indicates that the two levels will have a similar pattern of association. In some cases, one might consider combining the two levels. Dabao Zhang Page 25

27 Example: Hair and Eye Color Data collected from 592 students in an introductory statistics class by counting the numbers of students with given hair/eye combinations. > library(faraway); data(haireye); (ct <- xtabs(y~hair+eye,haireye)); eye hair green hazel blue brown BLACK BROWN RED BLOND > modc <- glm(y~hair+eye,family=poisson,data=haireye); > pchisq(modc$deviance,modc$df.resid,0.95,lower=f) [1] e-25 The above GOF test shows that hair and eye color are not independent. > z <- xtabs(residuals(modc,type="pearson")~hair+eye,data=haireye); > svdz <- svd(z,2,2); > leftsv <- svdz$u %*% diag(sqrt(svdz$d[1:2])); > rightsv <- svdz$v %*% diag(sqrt(svdz$d[1:2])); > bd <- 1.1*max(abs(rightsv),abs(leftsv)); Dabao Zhang Page 26

28 > plot(rbind(leftsv,rightsv),asp=1,xlim=c(-bd,bd),ylim=c(-bd,bd),xlab="sv1", ylab="sv2",type="n") > abline(h=0,v=0); > text(leftsv,dimnames(z)[[1]]); text(rightsv,dimnames(z)[[2]]); SV BLACK brown BROWN hazel RED green blue BLOND SV1 BLOND is far from the origin, indicating that the distribution of eye colors within this group of people is not typical. In contrast, BROWN is close to the origin, indicating an eye color distribution that is close to the overall average. blue and BLOND occur close together on the plot and far from the origin, indicating a strong association between blue eyes and blond hairs. On the other hand, there are relative fewer people with BLOND hairs and brown eyes than would be expected under independence. hazel and green are close together, indicating people with hazel or green eyes have similar hair color distributions and we might choose to combine these two categories. Dabao Zhang Page 27

29 Models for Matched Pairs Matched-pairs data: occur in studies to compare categorical responses for two samples when each observation in one sample pairs with an observation in the other. repeated measurement of subjects, such as longitudinal studies that observe subjects over time. a square two-way contingency table with the same row and column categories summarizes the data. Example: Rating Performance of the Prime Minister For a poll of a random sample of 1600 voting-age British citizens, 944 indicated approval of the Prime Minister s performance in office. Six months later, of these same 1600 people, 880 indicated approval. A strong association exists between opinions six months apart as the sample odds ratio being ( )/(150 86) = (Q: confidence interval?) First Second Survey Survey Approve Disapprove Total Approve Disapprove Total Dabao Zhang Page 28

30 Example: Grading of Eye Pairs for Distance Vision A sample of women are rated for the performance of distance vision in each eye. > library(faraway); data(eyegrade); > (ct<-xtabs(y~right+left,eyegrade)) left right best second third worst best second third worst > summary(ct) Call: xtabs(formula = y ~ right + left, data = eyegrade) Number of cases in table: 7477 Number of factors: 2 Test for independence of all factors: Chisq = 8097, df = 9, p-value = 0 It is not surprising to find strong evidence against independence. A more interesting hypothesis for matched pair data is whether π ij = π ji for all i and j. Dabao Zhang Page 29

31 An I I distribution {π ij } satisfies symmetry if π ij = π ji, i = 1,,I;j = 1,,J(J = I) J Under symmetry, π i+ = π J ij = π ji = π +i, which implies marginal j=1 j=1 homogeneity. For I = 2, symmetry is equivalent to marginal homogeneity For I > 2, marginal homogeneity can occur without symmetry Symmetry as Logit Models log π ij π ji = 0, for all i < j MLE of π ij = π ji is ˆπ ij = ˆπ ji = n ij +n ji 2n ++, and the LRT is 2 n ij log i j 2n ij n ij + n ji asy χ 2 I(I 1)/2, under H 0 Dabao Zhang Page 30

32 Symmetry as Loglinear Models log(µ ij ) = λ + α i + α j + γ ij Symmetry = π i+ = π +i γ ij = γ ji µ ij = µ ji MLE of µ ij = µ ji is ˆµ ij = ˆµ ji = (n ij + n ji )/2, and the LRT is 2 i j n ij log 2n ij n ij + n ji asy χ 2 I(I 1)/2, under H 0 Equivalent to the goodness-of-fit test for a loglinear model with properly defined dummy variables! Bowker s Test of symmetry (Bowker, 1948) X 2 = I 1 i=1 I j=i+1 (n ij n ji ) 2 n ij + n ji asy. χ 2 I(I 1)/2, under H 0 When I = J = 2, the above test is called McNemar s test (McNemar, 1947). Dabao Zhang Page 31

33 Example: Grading of Eye Pairs for Distance Vision (Continued) > mct <- matrix(ct,nrow=4); > 2*sum(mct*log(2*mct/(mct+t(mct)))) # LRT [1] > pchisq(2*sum(mct*log(2*mct/(mct+t(mct)))),6,lower=f) [1] > (symfac <- factor(apply(eyegrade[,2:3],1,function(x) paste(sort(x),collapse="-")))) best-best best-second best-third best-worst best-second second-second second-third second-worst best-third second-third third-third third-worst best-worst second-worst third-worst worst-worst 10 Levels: best-best best-second best-third best-worst second-second... worst-worst > mods <- glm(y ~ symfac, family=poisson, data=eyegrade); > c(deviance(mods),df.residual(mods)); [1] > pchisq(deviance(mods), df.residual(mods),lower=f); # GOF of loglinear model [1] > sum((mct-t(mct))^2/(mct+t(mct)))/2 #Bowker s Test of symmetry [1] > pchisq(sum((mct-t(mct))^2/(mct+t(mct)))/2,6,lower=f) [1] Dabao Zhang Page 32

34 Quasi-symmetry: allows the main-effect terms in the symmetry loglinear model to differ to accommodate marginal heterogeneity, log(µ ij ) = λ + α i + β j + γ ij γ ij = γ ji π ij =? (note that µ ij µ ji ) Q: What is the LRT? 2(l full l null ) asy χ 2 (I 1)(I 2)/2, under H 0 Equivalent to the goodness-of-fit test for a loglinear model with properly defined dummy variables! Example: Grading of Eye Pairs for Distance Vision (Continued) > modq <- glm(y ~ right+left+symfac, family=poisson, data=eyegrade); > c(deviance(modq),df.residual(modq)); [1] > pchisq(deviance(modq), df.residual(modq),lower=f); # GOF of loglinear model [1] > anova(mods,modq,test="chi"); Model 1: y ~ symfac Model 2: y ~ right + left + symfac Resid. Df Resid. Dev Df Deviance P(> Chi ) Dabao Zhang Page 33

35 A square contingency table satisfies quasi-independence when the variables are independent, given that the row and column outcomes differ log(µ ij ) = λ + α i + β j + δ i 1 {i=j} The first three terms specify independence, and {δ i } permit {µ ii } to depart from this pattern and have arbitrary positive values. Quasi-indpendence is the special case of quasi-summetry in which {γ ij,i j} are identical. They are equivalent when I = 3. Q: What is the LRT (I 3)? 2(l full l null ) asy χ 2 (I 1) 2 I, under H 0 Equivalent to the goodness-of-fit test for a loglinear model with properly defined dummy variables! Example: Grading of Eye Pairs for Distance Vision (Continued) > modqi <- glm(y ~ right+left, family=poisson, subset=-c(1,6,11,16), data=eyegrade); > c(deviance(modqi),df.residual(modqi)); [1] > pchisq(deviance(modqi), df.residual(modqi),lower=f); # GOF of loglinear model [1] e-41 Dabao Zhang Page 34

36 Three-Way Contingency Tables Example: Mortality Due to Smoking in Women A survey of one in six residents of Whickham, near Newcastle, England was made in Twenty years later, this data recorded in a follow-up study. Only women who are current smokers or who have never smoked are included. > library(faraway); data(femsmoke); > cbind(femsmoke[1:14,], index=15:28,femsmoke[15:28,]) y smoker dead age index y smoker dead age 1 2 yes yes yes no no yes no no yes yes yes no no yes no no yes yes yes no no yes no no yes yes yes no no yes no no yes yes yes no no yes no no yes yes yes no no yes no no yes yes yes no no yes no no 75+ Dabao Zhang Page 35

37 Simpson s Paradox > (ct <- xtabs(y~smoker+dead,femsmoke)) dead smoker yes no yes no > fisher.test(ct)... p-value = alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: sample estimates: odds ratio Can we conclude that smoking has a beneficial effect on longevity? > ct3 <- xtabs(y~smoker+dead+age,femsmoke) > apply(ct3,3,function(x){tr<-fisher.test(x); tr$estimate}) All odds ratio are greater than one with the exception of the age group. But how to test independence in 2 2 table across K strata? Dabao Zhang Page 36

38 I J K Table: The three categorical variables, e.g., X, Y and Z, have I, J and K categories, respectively. Multinomial Sampling: assumes a multinomial distribution cell probabilities {π ijk }, and π ijk = 1. i j k Poisson Sampling: assume each cell account n ijk following a Poisson distribution with rate µ ijk. So, n +++ Poisson(µ +++ ) with µ ijk = µ +++. i j k µ ijk = µ +++ π ijk Mutual Independence X, Y and Z are mutually independent when, for all i, j and k π ijk = π i++ π +j+ π ++k Mutual independence has loglinear form logµ ijk = λ + α i + β j + γ k Test of mutual independence Pearson s χ 2 test GOF test for the loglinear model Dabao Zhang Page 37

39 Example: Mortality Due to Smoking in Women (Continued) > summary(ct3) # Pearson s chi-sq test Call: xtabs(formula = y ~ smoker + dead + age, data = femsmoke) Number of cases in table: 1314 Number of factors: 3 Test for independence of all factors: Chisq = 790.6, df = 19, p-value = 2.140e-155 > modi <- glm(y~smoker+dead+age,family=poisson,data=femsmoke); > c(deviance(modi),df.residual(modi)) [1] > pchisq(deviance(modi),df.residual(modi),lower=f) [1] e-143 Conclusion? Dabao Zhang Page 38

40 Joint Independence Z is jointly independent of X and Y when, for all i, j and k Joint independence has loglinear form π ijk = π ij+ π ++k logµ ijk = λ + α ij + β k Mutual independence implies joint independence of any one variable from the others Test of joint independence Pearson s χ 2 test (after combining the levels of X and Y) GOF test for the loglinear model Dabao Zhang Page 39

41 Example: Mortality Due to Smoking in Women (Continued) We want to investigate whether age is jointly independent of smoking and life status > femsmoke$sdead <- factor(apply(femsmoke[,2:3],1, function(x) paste(x,collapse="-"))) > (ct2 <- xtabs(y~sdead+age,femsmoke)) age sdead no-no no-yes yes-no yes-yes > summary(ct2) Call: xtabs(formula = y ~ sdead + age, data = femsmoke) Number of cases in table: 1314 Number of factors: 2 Test for independence of all factors: Chisq = 734.7, df = 18, p-value = 2.455e-144 > modj <- glm(y~smoker*dead+age,family=poisson,data=femsmoke) > c(deviance(modj),df.residual(modj)) [1] > pchisq(deviance(modj),df.residual(modj),lower=f) [1] e-142 Conclusion? Dabao Zhang Page 40

42 Conditional Independence X and Y are conditionally independent of Z when, for all i, j and k π ij k = π i+ k π +j k π ijk = π i+k π +jk /π ++k π ij k = P(X = i,y = j Z = k) π i+ k = P(X = i Z = k), π +j k = P(Y = j Z = k) Conditional independence has loglinear form logµ ijk = λ + α ik + β jk It is a weaker condition than mutual or joint independence Test of conditional independence 2 2 K Tables: Cochran-Mantel-Haenszel test CMH = ( n 11k ) µ 2 / 11k 0.5 σ 2 11k asy. χ2 1, under H 0 k k k µ 11k = n 1+k n +1k /n ++k σ 2 11k = n 1+kn 2+k n +1k n +2k /[n 2 ++k (n ++k 1)] Test which of the three possible two-way interactions does not appear in the loglinear model Dabao Zhang Page 41

43 Example: Mortality Due to Smoking in Women (Continued) We want to investigate whether smoking and life status are independent given age. > mantelhaen.test(ct3,exact=true) Exact conditional test of independence in 2 x 2 x k tables data: ct3 S = 139, p-value = alternative hypothesis: true common odds ratio is not equal to 1 95 percent confidence interval: sample estimates: common odds ratio > modc <- glm(y~smoker*age+dead*age,family=poisson,data=femsmoke) > c(deviance(modc),df.residual(modc)) [1] > pchisq(deviance(modc),df.residual(modc),lower=f) [1] Conclusion? Caution: GOF test may not work well when some cell counts are small! Dabao Zhang Page 42

44 Homogeneous Association Homogenous association implies that the conditional relationship between any pair of variables given the third one is the same at each level of the third variable. It is also know as a no three-factor interactions model or no second-order interactions model. The loglinear model of homogeneous association, logµ ijk = λ + α ij + β jk + γ ik At a fixed level k of Z, consider the conditional local odds ratios, θ ij k = = π ij k π i+1,j+1 k π i,j+1 k π i+1,j k = π ijkπ i+1,j+1,k π i,j+1,k π i+1,j,k µ ijk µ i+1,j+1,k µ i,j+1,k µ i+1,j,k = exp{α ij + α i+1,j+1 } exp{α i+1,j + α i,j+1 }, Similar conclusion when fixing the level of X or Y. > modh <- glm(y~(smoker+age+dead)^2,family=poisson,data=femsmoke); > ctf <- xtabs(fitted(modh)~smoker+dead+age,femsmoke) > apply(ctf,3,function(x) (x[1,1]*x[2,2])/(x[1,2]*x[2,1]) ) > anova(modc,modh,test="chisq") % p-value = Conclusion? Dabao Zhang Page 43

Loglinear models. STAT 526 Professor Olga Vitek

Loglinear models. STAT 526 Professor Olga Vitek Loglinear models STAT 526 Professor Olga Vitek April 19, 2011 8 Can Use Poisson Likelihood To Model Both Poisson and Multinomial Counts 8-1 Recall: Poisson Distribution Probability distribution: Y - number