STAT 526 Advanced Statistical Methodology

Size: px
Start display at page:

Download "STAT 526 Advanced Statistical Methodology"

Transcription

1 STAT 526 Advanced Statistical Methodology Fall 2017 Lecture Note 7 Contingency Table 0-0

2 Outline Introduction to Contingency Tables Testing Independence in Two-Way Contingency Tables Modeling Ordinal Associations Correspondence Analysis Models for Matched Pairs Three-Way Contingency Tables Dabao Zhang Page 1

3 Introduction to Contingency Tables Contingency Table: is a table with cells containing frequency counts of outcomes which are classified according to certain variables (Karl Pearson, 1904). Contingency tables are used to display relationships between categorical variables. Two-Way Table: can be used to study the relationships between two categorical variables, e.g., X and Y. Suppose that X has I categories, and Y has J categories. Classifications of subjects on both variables have I J possible combinations, i.e., I J cells in a rectangular table having I rows for categories of X and J columns for categories of Y. A contingency table with I rows and J columns is called an I J (I-by-J) table. Example: Cross-Classification of Smoking by Lung Cancer Lung Cancer Smoking Cases Controls Total Yes No Total Dabao Zhang Page 2

4 Three-Way Table: can be used to study the relationships between three categorical variables, e.g., X, Y and Z. Suppose that X has I categories, Y has J categories, and Z has K categories. Classifications of subjects on all possible combinations present an I J K contingency table. Example: Alcohol, Cigarette, and Marijuana Use for High School Seniors Alcohol Cigarette Marijuana Use Use Use Yes No Yes Yes No No Yes 3 43 No Dabao Zhang Page 3

5 Testing Independence in Two-Way Contingency Tables Multinomial Sampling When the total sample size n is fixed but the row and column totals are not, a multinomial sampling model applies. Usually both X and Y are response variables, so the joint distribution is used to describe their association. P(X = i,y = j) = π ij, i = 1,,I; j = 1,,J Let n ij be the count in cell (i,j), then the probability mass function of the cell counts is n! n 11! n IJ! I i J j=1 π n ij ij = l = Independence of Categorical Variables I i=1 J j=1 n ij log(π ij ) + constant X and Y are independent π ij = π i+ π +j, i = 1,,I,j = 1,,J Marginal distributions: P(X = i) = π i+, P(Y = j) = π +j, where the subscript + denotes the sum over that index. Dabao Zhang Page 4

6 Hypothesis test H 0 : π ij = π i+ π +j, for all i and j H a : π ij π i+ π +j, for some i and j Under the full model, the MLE of π ij is ˆπ ij = n ij /n ++ Under the null model, the MLEs are ˆπ i+ = n i+ /n ++ and ˆπ +j = n +j /n ++. The LRT (or deviance-based test) is 2 I i=1 J j=1 n ij log n ijn ++ n i+ n +j asy. χ 2 (I 1)(J 1), under H 0 Example: Cross-Classification of Smoking by Lung Cancer (Continued) > y <- c(688,21,650,59); > smoke <- gl(2,1,4,labels=c("yes","no")); #gl: generate a factor with given levels > cancer <- gl(2,2,labels=c("cases","controls")); > lcancer <- data.frame(y,cancer,smoke); lcancer; y cancer smoke cases yes 2 21 cases no controls yes 4 59 controls no Dabao Zhang Page 5

7 >lcct <- xtabs(y~smoke+cancer); #xtabs: create a contingency table cancer smoke cases controls yes no > (fpi <- prop.table(xtabs(y~smoke+cancer))) cancer smoke cases controls yes no > spi <- prop.table(xtabs(y~smoke)); > cpi <- prop.table(xtabs(y~cancer)); > (npi <- outer(spi,cpi)) cancer smoke cases controls yes no > pchisq(2*sum(lcct*log(fpi/npi)),1,lower=f) [1] e-06 Conclusion? Dabao Zhang Page 6

8 2 2 Table Independence of X and Y can be stated in terms of the odds ratio X and Y are independent θ = π 11/π 12 π 21 /π 22 = π 11π 22 π 12 π 21 = 1 This is because (similarly for other π ij ), when θ = 1, π 12 = π +1 π 12 + π +2 π 12 = (π 11 π 12 + π 21 π 12 ) + π +2 π 12 = (π 11 π 12 + π 11 π 22 ) + π +2 π 12 = π +2 π 11 + π +2 π 12 = π 1+ π +2 MLE of the above odds ratio ˆθ = n 11/n 12 n 12 /n 22 = n 11n 22 n 12 n 21 Asymptotically, log(ˆθ) N(log(θ), ˆσ 2 ), where ˆσ 2 = 1 n n n n 22 When some n ij = 0, ˆθ is not a good estimator. It is amended by adding 0.5 to each cell count, θ = (n )(n ) (n )(n ) Dabao Zhang Page 7

9 Example: Cross-Classification of Smoking by Lung Cancer (Continued) > (etheta <- lcct[1,1]*lcct[2,2]/(lcct[1,2]*lcct[2,1])) [1] > (sele <- sqrt(sum(1/lcct))) [1] > log(etheta)+sele*c(-1.96,1.96) [1] Conclusion? Dabao Zhang Page 8

10 Independent (or Product) Multinomial Sampling When the row totals, i.e., n i+, i = 1,,I, are fixed, a independent multinomial sampling model applies. Usually X is an explanatory variable, and observations on a response Y occur separately at each setting of X. So the conditional distribution is used to describe their association P(Y = j X = i) = π j i, i = 1,,I; j = 1,,J Let n ij be the count in cell (i,j), then the counts {n ij,j = 1,,J} satisfying J j=1 n ij = n i+ follow a multinomial distribution n i+! n i1! n ij! J j=1 π n ij j i Independence of Categorical Variables X and Y are independent π j 1 = = π j I, j = 1,,J Independence is then often referred to as homogeneity of the conditional distributions. Dabao Zhang Page 9

11 π ij = π i+ π +j for all i and j π j 1 = = π j I for all j π j i = π ij /π i+ = (π i+ π +j )/π i+ = π +j Let π j i = c j, then π +j = I π ij = J π i+ c j = c j = π ij = π i+ π +j i=1 i=1 Q: How to test the homogeneity of the conditional distributions? Column Row 1 J Total 1 π 11 (π 1 1 ) π 1J (π J 1 ) π I π I1 (π 1 I ) π IJ (π J I ) π I+ Total π +1 π +J π ++ Consider the new notation: π j (x) = P(Y = j X = x) = Consider a model for multinomial responses! Dabao Zhang Page 10

12 Example: Cross-Classification of Smoking by Lung Cancer (Continued) > (mnlc <- matrix(y,nrow=2)) [,1] [,2] [1,] [2,] > mnmod <- glm(mnlc~1,family=binomial); > deviance(mnmod) [1] > 2*sum(lcct*log(fpi/npi)) # deviance in the multinomial sampling [1] Conclusion? Dabao Zhang Page 11

13 Poisson Sampling Denote the count of cell (i,j) as Y ij A Poisson sampling model assumes each Y ij follows an independent Poisson distribution with rate {µ ij } ( I J I ) J Y ij ind. Poisson(µ ij ) = Y ij Poission µ ij i=1 j=1 i=1 j=1 Denote I J i=1 J j=1 µ ij = µ ++ I Given Y ij = n ++, (Y 11,,Y ij,,y IJ ) follows a multinomial i=1 j=1 distribution with E[Y ij n ++ ] = n ++ π ij, π ij = µ ij /µ ++. Independence of X and Y has the following form log(µ ij ) = λ + α i + β j The above model is called the loglinear model of independence for two-way contingency tables, whereby the log expected frequency is an additive function of a row effect α i and a column effect β j. An independence test is also a goodness-of-fit test of the above loglinear model. Dabao Zhang Page 12

14 Example: Cross-Classification of Smoking by Lung Cancer (Continued) > pmod <- glm(y~smoke+cancer,family=poisson); > summary(pmod)... Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 6.506e e <2e-16 *** smokeno e e <2e-16 *** cancercontrols e e e Null deviance: on 3 degrees of freedom Residual deviance: on 1 degrees of freedom AIC: An alternative to the deviance-based test is the Pearson s X 2 test X 2 = I i=1 J j=1 > emu <- npi*sum(lcct); sum((lcct-emu)^2/emu) [1] (y ij ˆµ ij ) 2 Yates continuity correction: +0.5 if y ij µ ij > 0; 0.5 if y ij µ ij < 0 The deviance-based test is preferred to the Pearson s X 2. ˆµ ij Dabao Zhang Page 13

15 Hypergeometric Sampling When both row and column margins are fixed, the appropriate sampling distribution is the hypergeometric. This situation is less common in practice. When X and Y are independent, {n ij }, given the row and column margins, follows the following hypergeometric distribution ( I i=1 n i+! n ++! I )( J i=1 J j=1 n ij! j=1 n +j! An exact test of independence can be developed by defining a table order For a 2 2 table, the hypergeometric distribution is P(n 11 = k) = ( n 1+ k ( )( ) n 2+ ) n +1 k ), n ++ n +1 max(0,n 1+ + n +1 n) k min(n 1+,n +1 ) Fisher s exact test: p-value equals to the total probability of all outcomes more extreme than the one observed. Dabao Zhang Page 14

16 Example: Cross-Classification of Smoking by Lung Cancer (Continued) > fisher.test(lcct) Fisher s Exact Test for Count Data data: lcct p-value = 1.476e-05 alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: sample estimates: odds ratio > etheta*exp(sele*c(-1.96,1.96)) # CI based on asymptotic approximation [1] Dabao Zhang Page 15

17 Modeling Ordinal Associations Treating ordered categories as nominal categories ignores important information. Example: US 1996 National Election Study (Continued) Here we consider the association between the party identification and level of education. > data(nes96); xtabs(~pid+educ,nes96); educ PID MS HSdrop HS Coll CCdeg BAdeg MAdeg strdem weakdem inddem indind indrep weakrep strrep > (partyed <- as.data.frame.table(xtabs(~pid+educ,nes96))) #convert to data.frame PID educ Freq 1 strdem MS strrep MAdeg 25 Dabao Zhang Page 16

18 > nomod <- glm(freq~pid+educ,family=poisson,data=partyed); > pchisq(deviance(nomod),df.residual(nomod),lower=f) [1] When treat both variables as nominal, we have no evidence against independence. > presid <- residuals(nomod,type="pearson"); > xtabs(presid~partyed$pid+partyed$educ); partyed$educ partyed$pid MS HSdrop HS Coll CCdeg BAdeg MAdeg strdem weakdem inddem indind indrep weakrep strrep Cross-classifications of ordinal variabls often exhibit their greatest deviations from independence in the corner cells Sample counts are much larger than independence predicts when both responses are at the lowest order or the highest order. The counts are much smaller than fitted values where one response is at the highest order and the other is at the lowest order. The above residuals table indicates lack of fit in the form of a positive trend. Subjects who have higher level of education also tend to be stronger republican. Dabao Zhang Page 17

19 Linear-by-Linear Association in Two-Way Tables Assigning the following row scores and column scores, respectively, u 1 u 2 u I, v 1 v 2 v J A simple model for these two ordinal variables is the linear-by-linear association model (L L) log(µ ij ) = λ + α i + β j + γu i v j, γu i v j represents the deviation of log(µ ij ) from independence The deviation is linear in the Y scores at a fixed level of X, and linear in the X scores at a fixed level of Y. So it is called the L L model. The model has its greatest departures from independence in the corners of the table. γ = 0 implies independence of X and Y When γ > 0, Y tends to increase as X increase. When γ < 0, Y tends to decrease as X increase. Dabao Zhang Page 18

20 Example: US 1996 National Election Study (Continued) We assign evenly spaced scores, i.e., one to seven (you can also try other scores), for both PID and educ, and fit the L L model > partyed$opid <- unclass(partyed$pid); partyed$oeduc <- unclass(partyed$educ); > lblmod <- glm(freq~pid+educ+i(opid*oeduc),family=poisson,data=partyed); > summary(lblmod)... Coefficients: Estimate Std. Error z value Pr(> z )... I(oPID * oeduc) **... Null deviance: on 48 degrees of freedom Residual deviance: on 35 degrees of freedom AIC: > anova(nomod,lblmod,test="chi"); Analysis of Deviance Table Model 1: Freq ~ PID + educ Model 2: Freq ~ PID + educ + I(oPID * oeduc) Resid. Df Resid. Dev Df Deviance P(> Chi ) Dabao Zhang Page 19

21 Interpretation of γ The log odd ratio for a subtable which have cells adjacent in both rows and columns, e.g., cells (i,j), (i,j + 1), (i + 1,j), and (i + 1,j + 1) log µ ijµ i+1,j+1 µ i,j+1 µ i+1,j = γ(u i+1 u i )(v j+1 v j ) This log odds ratio is stronger as γ increases and for pairs of categories that are farther apart. For evenly spaced socres, these odds ratios are all equal. For instance, when {u i = i} and {v j = j}, we have the constant local odds ratios θ ij = π ijπ i+1,j+1 π i,j+1 π i+1,j = e γ The case of having constant local odds ratios was called as uniform association by Goodman (1979). As a Baseline-Category Logit Model: log π j i π 1 i = log µ ij µ i1 = (β j β 1 ) + γ(v j v 1 )u i We may fit a baseline-category logit model and expect the coefficients of {u i } to be {γ(v i v 1 )}. Dabao Zhang Page 20

22 Example: US 1996 National Election Study (Continued) > nes96$oeduc <- unclass(nes96$educ); > nes96.mn1 <- multinom(pid~oeduc,data=nes96); summary(nes96.mn1);... Coefficients: (Intercept) oeduc weakdem inddem indind indrep weakrep strrep > nes96$opid <- unclass(nes96$pid); > nes96.mn2 <- multinom(educ~opid,data=nes96); summary(nes96.mn2);... Coefficients: (Intercept) opid HSdrop HS Coll CCdeg BAdeg MAdeg As L L models, we expect monotonically increasing (or decreasing) coefficients of oeduc and opid. While the coefficients of opid are more or less increasing, the coefficients of oeduc are apparently not. We may only treat PID as ordinal but educ as nominal. Dabao Zhang Page 21

23 Column Effects Model The columns are not assigned scores as Y is considered a nominal variable. {γ j } are called the column effects. log(µ ij ) = λ + α i + β j + γ j u i The zero-sum constraint is I i=1 α i = J j=1 β j = J j=1 γ j = 0 The baseline constraint is α 1 = β 1 = γ 1 = 0 γ 1 = γ 2 = = γ J implies independence of X and Y As a Baseline-Category Logit Model: log π j i π 1 i = log µ ij µ i1 = (β j β 1 ) + (γ j γ 1 )u i A row effects model is effectively the same model except the roles of the variables reversed. Dabao Zhang Page 22

24 Example: US 1996 National Election Study (Continued) > cmod <- glm(freq~pid+educ+educ:opid,family=poisson,data=partyed); > mcoeff <- summary(cmod)$coeff; > mcoeff[8:13,1] #beta_j-beta_1 educhsdrop educhs educcoll educccdeg educbadeg educmadeg > mcoeff[15:19,1]-mcoeff[14,1] #gamma_j-gamma_1 educhsdrop:opid educhs:opid educcoll:opid educccdeg:opid educbadeg:opid > -mcoeff[14,1] #gamma_j-gamma_1 as gamma_j=0 [1] Similar values to the coefficients in nes96.mn2 > anova(nomod,cmod,test="chi") Analysis of Deviance Table Model 1: Freq ~ PID + educ Model 2: Freq ~ PID + educ + educ:opid Resid. Df Resid. Dev Df Deviance P(> Chi ) The above comparison of cmod to the independence model nomod implies that the column effects model is preferred. What about comparing lblmod and cmod? Dabao Zhang Page 23

25 Correspondence Analysis Correspondence analysis is a graphical way to represent associations in two-way contingency tables. It is very helpful in understanding the dependence between a category of X and a category of Y. This method is based on the Pearson residuals R I J = (r ij ) I J r ij is the Pearson residual for the cell (i,j) Perform the singular value decomposition R I J = U I w D w w V T J w = r ij = w u ik d k v jk k=1 w = min(i,j) U I w = (u ij ) I w and V J w = (u ij ) J w have orthogonal column vectors and called the right and left singular vectors, respectively D = diagonal{d 1,,d w } with d 1 d 2 d w, which are called singular values. w i=1 d2 i = Pearson s X 2 is called the inertia. Dabao Zhang Page 24

26 Usually d d2 2 take account most of w i=1 d2 i = X2. Therefore, u i1 d 1 v j1 + u i2 d 2 v j2 will account for most of the Pearson residual r ij, i.e., r ij u i1 d 1 v j1 + u i2 d 2 v j2. Denote, for k = 1,2, U k = d k u 1k., V k = d k v 1k. u Ik v Jk The two-dimensional correspondence plot displays U 2 against U 1, and V 2 against V 1 on the same graph. Plotting U 2 vs. U 1 shows influence on residuals when ignoring row effect. Large U i indicates the peculiarity of the row i profile. Plotting V 2 vs. V 1 shows influence on residuals when ignoring column effect. Large V i indicates the peculiarity of the column i profile. If a row level and a column level appear close together on the plot and far from the origin, there will be a large positive residual associated with this particular combination indicating a strong positive association. If a row level and a column level are situated diametrically apart on either side of the origin, we may expect a large negative residual indicating a strong negative association. If points representing two rows or two column levels are close together, this indicates that the two levels will have a similar pattern of association. In some cases, one might consider combining the two levels. Dabao Zhang Page 25

27 Example: Hair and Eye Color Data collected from 592 students in an introductory statistics class by counting the numbers of students with given hair/eye combinations. > library(faraway); data(haireye); (ct <- xtabs(y~hair+eye,haireye)); eye hair green hazel blue brown BLACK BROWN RED BLOND > modc <- glm(y~hair+eye,family=poisson,data=haireye); > pchisq(modc$deviance,modc$df.resid,0.95,lower=f) [1] e-25 The above GOF test shows that hair and eye color are not independent. > z <- xtabs(residuals(modc,type="pearson")~hair+eye,data=haireye); > svdz <- svd(z,2,2); > leftsv <- svdz$u %*% diag(sqrt(svdz$d[1:2])); > rightsv <- svdz$v %*% diag(sqrt(svdz$d[1:2])); > bd <- 1.1*max(abs(rightsv),abs(leftsv)); Dabao Zhang Page 26

28 > plot(rbind(leftsv,rightsv),asp=1,xlim=c(-bd,bd),ylim=c(-bd,bd),xlab="sv1", ylab="sv2",type="n") > abline(h=0,v=0); > text(leftsv,dimnames(z)[[1]]); text(rightsv,dimnames(z)[[2]]); SV BLACK brown BROWN hazel RED green blue BLOND SV1 BLOND is far from the origin, indicating that the distribution of eye colors within this group of people is not typical. In contrast, BROWN is close to the origin, indicating an eye color distribution that is close to the overall average. blue and BLOND occur close together on the plot and far from the origin, indicating a strong association between blue eyes and blond hairs. On the other hand, there are relative fewer people with BLOND hairs and brown eyes than would be expected under independence. hazel and green are close together, indicating people with hazel or green eyes have similar hair color distributions and we might choose to combine these two categories. Dabao Zhang Page 27

29 Models for Matched Pairs Matched-pairs data: occur in studies to compare categorical responses for two samples when each observation in one sample pairs with an observation in the other. repeated measurement of subjects, such as longitudinal studies that observe subjects over time. a square two-way contingency table with the same row and column categories summarizes the data. Example: Rating Performance of the Prime Minister For a poll of a random sample of 1600 voting-age British citizens, 944 indicated approval of the Prime Minister s performance in office. Six months later, of these same 1600 people, 880 indicated approval. A strong association exists between opinions six months apart as the sample odds ratio being ( )/(150 86) = (Q: confidence interval?) First Second Survey Survey Approve Disapprove Total Approve Disapprove Total Dabao Zhang Page 28

30 Example: Grading of Eye Pairs for Distance Vision A sample of women are rated for the performance of distance vision in each eye. > library(faraway); data(eyegrade); > (ct<-xtabs(y~right+left,eyegrade)) left right best second third worst best second third worst > summary(ct) Call: xtabs(formula = y ~ right + left, data = eyegrade) Number of cases in table: 7477 Number of factors: 2 Test for independence of all factors: Chisq = 8097, df = 9, p-value = 0 It is not surprising to find strong evidence against independence. A more interesting hypothesis for matched pair data is whether π ij = π ji for all i and j. Dabao Zhang Page 29

31 An I I distribution {π ij } satisfies symmetry if π ij = π ji, i = 1,,I;j = 1,,J(J = I) J Under symmetry, π i+ = π J ij = π ji = π +i, which implies marginal j=1 j=1 homogeneity. For I = 2, symmetry is equivalent to marginal homogeneity For I > 2, marginal homogeneity can occur without symmetry Symmetry as Logit Models log π ij π ji = 0, for all i < j MLE of π ij = π ji is ˆπ ij = ˆπ ji = n ij +n ji 2n ++, and the LRT is 2 n ij log i j 2n ij n ij + n ji asy χ 2 I(I 1)/2, under H 0 Dabao Zhang Page 30

32 Symmetry as Loglinear Models log(µ ij ) = λ + α i + α j + γ ij Symmetry = π i+ = π +i γ ij = γ ji µ ij = µ ji MLE of µ ij = µ ji is ˆµ ij = ˆµ ji = (n ij + n ji )/2, and the LRT is 2 i j n ij log 2n ij n ij + n ji asy χ 2 I(I 1)/2, under H 0 Equivalent to the goodness-of-fit test for a loglinear model with properly defined dummy variables! Bowker s Test of symmetry (Bowker, 1948) X 2 = I 1 i=1 I j=i+1 (n ij n ji ) 2 n ij + n ji asy. χ 2 I(I 1)/2, under H 0 When I = J = 2, the above test is called McNemar s test (McNemar, 1947). Dabao Zhang Page 31

33 Example: Grading of Eye Pairs for Distance Vision (Continued) > mct <- matrix(ct,nrow=4); > 2*sum(mct*log(2*mct/(mct+t(mct)))) # LRT [1] > pchisq(2*sum(mct*log(2*mct/(mct+t(mct)))),6,lower=f) [1] > (symfac <- factor(apply(eyegrade[,2:3],1,function(x) paste(sort(x),collapse="-")))) best-best best-second best-third best-worst best-second second-second second-third second-worst best-third second-third third-third third-worst best-worst second-worst third-worst worst-worst 10 Levels: best-best best-second best-third best-worst second-second... worst-worst > mods <- glm(y ~ symfac, family=poisson, data=eyegrade); > c(deviance(mods),df.residual(mods)); [1] > pchisq(deviance(mods), df.residual(mods),lower=f); # GOF of loglinear model [1] > sum((mct-t(mct))^2/(mct+t(mct)))/2 #Bowker s Test of symmetry [1] > pchisq(sum((mct-t(mct))^2/(mct+t(mct)))/2,6,lower=f) [1] Dabao Zhang Page 32

34 Quasi-symmetry: allows the main-effect terms in the symmetry loglinear model to differ to accommodate marginal heterogeneity, log(µ ij ) = λ + α i + β j + γ ij γ ij = γ ji π ij =? (note that µ ij µ ji ) Q: What is the LRT? 2(l full l null ) asy χ 2 (I 1)(I 2)/2, under H 0 Equivalent to the goodness-of-fit test for a loglinear model with properly defined dummy variables! Example: Grading of Eye Pairs for Distance Vision (Continued) > modq <- glm(y ~ right+left+symfac, family=poisson, data=eyegrade); > c(deviance(modq),df.residual(modq)); [1] > pchisq(deviance(modq), df.residual(modq),lower=f); # GOF of loglinear model [1] > anova(mods,modq,test="chi"); Model 1: y ~ symfac Model 2: y ~ right + left + symfac Resid. Df Resid. Dev Df Deviance P(> Chi ) Dabao Zhang Page 33

35 A square contingency table satisfies quasi-independence when the variables are independent, given that the row and column outcomes differ log(µ ij ) = λ + α i + β j + δ i 1 {i=j} The first three terms specify independence, and {δ i } permit {µ ii } to depart from this pattern and have arbitrary positive values. Quasi-indpendence is the special case of quasi-summetry in which {γ ij,i j} are identical. They are equivalent when I = 3. Q: What is the LRT (I 3)? 2(l full l null ) asy χ 2 (I 1) 2 I, under H 0 Equivalent to the goodness-of-fit test for a loglinear model with properly defined dummy variables! Example: Grading of Eye Pairs for Distance Vision (Continued) > modqi <- glm(y ~ right+left, family=poisson, subset=-c(1,6,11,16), data=eyegrade); > c(deviance(modqi),df.residual(modqi)); [1] > pchisq(deviance(modqi), df.residual(modqi),lower=f); # GOF of loglinear model [1] e-41 Dabao Zhang Page 34

36 Three-Way Contingency Tables Example: Mortality Due to Smoking in Women A survey of one in six residents of Whickham, near Newcastle, England was made in Twenty years later, this data recorded in a follow-up study. Only women who are current smokers or who have never smoked are included. > library(faraway); data(femsmoke); > cbind(femsmoke[1:14,], index=15:28,femsmoke[15:28,]) y smoker dead age index y smoker dead age 1 2 yes yes yes no no yes no no yes yes yes no no yes no no yes yes yes no no yes no no yes yes yes no no yes no no yes yes yes no no yes no no yes yes yes no no yes no no yes yes yes no no yes no no 75+ Dabao Zhang Page 35

37 Simpson s Paradox > (ct <- xtabs(y~smoker+dead,femsmoke)) dead smoker yes no yes no > fisher.test(ct)... p-value = alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: sample estimates: odds ratio Can we conclude that smoking has a beneficial effect on longevity? > ct3 <- xtabs(y~smoker+dead+age,femsmoke) > apply(ct3,3,function(x){tr<-fisher.test(x); tr$estimate}) All odds ratio are greater than one with the exception of the age group. But how to test independence in 2 2 table across K strata? Dabao Zhang Page 36

38 I J K Table: The three categorical variables, e.g., X, Y and Z, have I, J and K categories, respectively. Multinomial Sampling: assumes a multinomial distribution cell probabilities {π ijk }, and π ijk = 1. i j k Poisson Sampling: assume each cell account n ijk following a Poisson distribution with rate µ ijk. So, n +++ Poisson(µ +++ ) with µ ijk = µ +++. i j k µ ijk = µ +++ π ijk Mutual Independence X, Y and Z are mutually independent when, for all i, j and k π ijk = π i++ π +j+ π ++k Mutual independence has loglinear form logµ ijk = λ + α i + β j + γ k Test of mutual independence Pearson s χ 2 test GOF test for the loglinear model Dabao Zhang Page 37

39 Example: Mortality Due to Smoking in Women (Continued) > summary(ct3) # Pearson s chi-sq test Call: xtabs(formula = y ~ smoker + dead + age, data = femsmoke) Number of cases in table: 1314 Number of factors: 3 Test for independence of all factors: Chisq = 790.6, df = 19, p-value = 2.140e-155 > modi <- glm(y~smoker+dead+age,family=poisson,data=femsmoke); > c(deviance(modi),df.residual(modi)) [1] > pchisq(deviance(modi),df.residual(modi),lower=f) [1] e-143 Conclusion? Dabao Zhang Page 38

40 Joint Independence Z is jointly independent of X and Y when, for all i, j and k Joint independence has loglinear form π ijk = π ij+ π ++k logµ ijk = λ + α ij + β k Mutual independence implies joint independence of any one variable from the others Test of joint independence Pearson s χ 2 test (after combining the levels of X and Y) GOF test for the loglinear model Dabao Zhang Page 39

41 Example: Mortality Due to Smoking in Women (Continued) We want to investigate whether age is jointly independent of smoking and life status > femsmoke$sdead <- factor(apply(femsmoke[,2:3],1, function(x) paste(x,collapse="-"))) > (ct2 <- xtabs(y~sdead+age,femsmoke)) age sdead no-no no-yes yes-no yes-yes > summary(ct2) Call: xtabs(formula = y ~ sdead + age, data = femsmoke) Number of cases in table: 1314 Number of factors: 2 Test for independence of all factors: Chisq = 734.7, df = 18, p-value = 2.455e-144 > modj <- glm(y~smoker*dead+age,family=poisson,data=femsmoke) > c(deviance(modj),df.residual(modj)) [1] > pchisq(deviance(modj),df.residual(modj),lower=f) [1] e-142 Conclusion? Dabao Zhang Page 40

42 Conditional Independence X and Y are conditionally independent of Z when, for all i, j and k π ij k = π i+ k π +j k π ijk = π i+k π +jk /π ++k π ij k = P(X = i,y = j Z = k) π i+ k = P(X = i Z = k), π +j k = P(Y = j Z = k) Conditional independence has loglinear form logµ ijk = λ + α ik + β jk It is a weaker condition than mutual or joint independence Test of conditional independence 2 2 K Tables: Cochran-Mantel-Haenszel test CMH = ( n 11k ) µ 2 / 11k 0.5 σ 2 11k asy. χ2 1, under H 0 k k k µ 11k = n 1+k n +1k /n ++k σ 2 11k = n 1+kn 2+k n +1k n +2k /[n 2 ++k (n ++k 1)] Test which of the three possible two-way interactions does not appear in the loglinear model Dabao Zhang Page 41

43 Example: Mortality Due to Smoking in Women (Continued) We want to investigate whether smoking and life status are independent given age. > mantelhaen.test(ct3,exact=true) Exact conditional test of independence in 2 x 2 x k tables data: ct3 S = 139, p-value = alternative hypothesis: true common odds ratio is not equal to 1 95 percent confidence interval: sample estimates: common odds ratio > modc <- glm(y~smoker*age+dead*age,family=poisson,data=femsmoke) > c(deviance(modc),df.residual(modc)) [1] > pchisq(deviance(modc),df.residual(modc),lower=f) [1] Conclusion? Caution: GOF test may not work well when some cell counts are small! Dabao Zhang Page 42

44 Homogeneous Association Homogenous association implies that the conditional relationship between any pair of variables given the third one is the same at each level of the third variable. It is also know as a no three-factor interactions model or no second-order interactions model. The loglinear model of homogeneous association, logµ ijk = λ + α ij + β jk + γ ik At a fixed level k of Z, consider the conditional local odds ratios, θ ij k = = π ij k π i+1,j+1 k π i,j+1 k π i+1,j k = π ijkπ i+1,j+1,k π i,j+1,k π i+1,j,k µ ijk µ i+1,j+1,k µ i,j+1,k µ i+1,j,k = exp{α ij + α i+1,j+1 } exp{α i+1,j + α i,j+1 }, Similar conclusion when fixing the level of X or Y. > modh <- glm(y~(smoker+age+dead)^2,family=poisson,data=femsmoke); > ctf <- xtabs(fitted(modh)~smoker+dead+age,femsmoke) > apply(ctf,3,function(x) (x[1,1]*x[2,2])/(x[1,2]*x[2,1]) ) > anova(modc,modh,test="chisq") % p-value = Conclusion? Dabao Zhang Page 43

Loglinear models. STAT 526 Professor Olga Vitek

Loglinear models. STAT 526 Professor Olga Vitek Loglinear models STAT 526 Professor Olga Vitek April 19, 2011 8 Can Use Poisson Likelihood To Model Both Poisson and Multinomial Counts 8-1 Recall: Poisson Distribution Probability distribution: Y - number

More information

Categorical Variables and Contingency Tables: Description and Inference

Categorical Variables and Contingency Tables: Description and Inference Categorical Variables and Contingency Tables: Description and Inference STAT 526 Professor Olga Vitek March 3, 2011 Reading: Agresti Ch. 1, 2 and 3 Faraway Ch. 4 3 Univariate Binomial and Multinomial Measurements

More information

ST3241 Categorical Data Analysis I Two-way Contingency Tables. Odds Ratio and Tests of Independence

ST3241 Categorical Data Analysis I Two-way Contingency Tables. Odds Ratio and Tests of Independence ST3241 Categorical Data Analysis I Two-way Contingency Tables Odds Ratio and Tests of Independence 1 Inference For Odds Ratio (p. 24) For small to moderate sample size, the distribution of sample odds

More information

Correspondence Analysis

Correspondence Analysis Correspondence Analysis Q: when independence of a 2-way contingency table is rejected, how to know where the dependence is coming from? The interaction terms in a GLM contain dependence information; however,

More information

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F). STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) T In 2 2 tables, statistical independence is equivalent to a population

More information

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: ) NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) Categorical Data Analysis (Semester II: 2010 2011) April/May, 2011 Time Allowed : 2 Hours Matriculation No: Seat No: Grade Table Question 1 2 3

More information

Lecture 8: Summary Measures

Lecture 8: Summary Measures Lecture 8: Summary Measures Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University of South Carolina Lecture 8:

More information

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F). STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis 1. Indicate whether each of the following is true (T) or false (F). (a) (b) (c) (d) (e) In 2 2 tables, statistical independence is equivalent

More information

Sections 2.3, 2.4. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis 1 / 21

Sections 2.3, 2.4. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis 1 / 21 Sections 2.3, 2.4 Timothy Hanson Department of Statistics, University of South Carolina Stat 770: Categorical Data Analysis 1 / 21 2.3 Partial association in stratified 2 2 tables In describing a relationship

More information

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis Review Timothy Hanson Department of Statistics, University of South Carolina Stat 770: Categorical Data Analysis 1 / 22 Chapter 1: background Nominal, ordinal, interval data. Distributions: Poisson, binomial,

More information

Describing Contingency tables

Describing Contingency tables Today s topics: Describing Contingency tables 1. Probability structure for contingency tables (distributions, sensitivity/specificity, sampling schemes). 2. Comparing two proportions (relative risk, odds

More information

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses ST3241 Categorical Data Analysis I Multicategory Logit Models Logit Models For Nominal Responses 1 Models For Nominal Responses Y is nominal with J categories. Let {π 1,, π J } denote the response probabilities

More information

STAT 705: Analysis of Contingency Tables

STAT 705: Analysis of Contingency Tables STAT 705: Analysis of Contingency Tables Timothy Hanson Department of Statistics, University of South Carolina Stat 705: Analysis of Contingency Tables 1 / 45 Outline of Part I: models and parameters Basic

More information

Chapter 2: Describing Contingency Tables - II

Chapter 2: Describing Contingency Tables - II : Describing Contingency Tables - II Dipankar Bandyopadhyay Department of Biostatistics, Virginia Commonwealth University BIOS 625: Categorical Data & GLM [Acknowledgements to Tim Hanson and Haitao Chu]

More information

Three-Way Contingency Tables

Three-Way Contingency Tables Newsom PSY 50/60 Categorical Data Analysis, Fall 06 Three-Way Contingency Tables Three-way contingency tables involve three binary or categorical variables. I will stick mostly to the binary case to keep

More information

Homework 10 - Solution

Homework 10 - Solution STAT 526 - Spring 2011 Homework 10 - Solution Olga Vitek Each part of the problems 5 points 1. Faraway Ch. 4 problem 1 (page 93) : The dataset parstum contains cross-classified data on marijuana usage

More information

Log-linear Models for Contingency Tables

Log-linear Models for Contingency Tables Log-linear Models for Contingency Tables Statistics 149 Spring 2006 Copyright 2006 by Mark E. Irwin Log-linear Models for Two-way Contingency Tables Example: Business Administration Majors and Gender A

More information

MSH3 Generalized linear model

MSH3 Generalized linear model Contents MSH3 Generalized linear model 7 Log-Linear Model 231 7.1 Equivalence between GOF measures........... 231 7.2 Sampling distribution................... 234 7.3 Interpreting Log-Linear models..............

More information

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION Categorical Data Analysis (Semester II: 2010 2011) April/May, 2011 Time Allowed : 2 Hours Matriculation No: Seat No: Grade Table Question 1 2 3 4 5 6 Full marks

More information

Categorical Data Analysis Chapter 3

Categorical Data Analysis Chapter 3 Categorical Data Analysis Chapter 3 The actual coverage probability is usually a bit higher than the nominal level. Confidence intervals for association parameteres Consider the odds ratio in the 2x2 table,

More information

BIOS 625 Fall 2015 Homework Set 3 Solutions

BIOS 625 Fall 2015 Homework Set 3 Solutions BIOS 65 Fall 015 Homework Set 3 Solutions 1. Agresti.0 Table.1 is from an early study on the death penalty in Florida. Analyze these data and show that Simpson s Paradox occurs. Death Penalty Victim's

More information

ij i j m ij n ij m ij n i j Suppose we denote the row variable by X and the column variable by Y ; We can then re-write the above expression as

ij i j m ij n ij m ij n i j Suppose we denote the row variable by X and the column variable by Y ; We can then re-write the above expression as page1 Loglinear Models Loglinear models are a way to describe association and interaction patterns among categorical variables. They are commonly used to model cell counts in contingency tables. These

More information

3 Way Tables Edpsy/Psych/Soc 589

3 Way Tables Edpsy/Psych/Soc 589 3 Way Tables Edpsy/Psych/Soc 589 Carolyn J. Anderson Department of Educational Psychology I L L I N O I S university of illinois at urbana-champaign c Board of Trustees, University of Illinois Spring 2017

More information

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00 Two Hours MATH38052 Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER GENERALISED LINEAR MODELS 26 May 2016 14:00 16:00 Answer ALL TWO questions in Section

More information

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator UNIVERSITY OF TORONTO Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS Duration - 3 hours Aids Allowed: Calculator LAST NAME: FIRST NAME: STUDENT NUMBER: There are 27 pages

More information

Multinomial Logistic Regression Models

Multinomial Logistic Regression Models Stat 544, Lecture 19 1 Multinomial Logistic Regression Models Polytomous responses. Logistic regression can be extended to handle responses that are polytomous, i.e. taking r>2 categories. (Note: The word

More information

STAC51: Categorical data Analysis

STAC51: Categorical data Analysis STAC51: Categorical data Analysis Mahinda Samarakoon January 26, 2016 Mahinda Samarakoon STAC51: Categorical data Analysis 1 / 32 Table of contents Contingency Tables 1 Contingency Tables Mahinda Samarakoon

More information

Sections 3.4, 3.5. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Sections 3.4, 3.5. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis Sections 3.4, 3.5 Timothy Hanson Department of Statistics, University of South Carolina Stat 770: Categorical Data Analysis 1 / 22 3.4 I J tables with ordinal outcomes Tests that take advantage of ordinal

More information

Lecture 25. Ingo Ruczinski. November 24, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

Lecture 25. Ingo Ruczinski. November 24, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University Lecture 25 Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University November 24, 2015 1 2 3 4 5 6 7 8 9 10 11 1 Hypothesis s of homgeneity 2 Estimating risk

More information

Chapter 11: Analysis of matched pairs

Chapter 11: Analysis of matched pairs Chapter 11: Analysis of matched pairs Timothy Hanson Department of Statistics, University of South Carolina Stat 770: Categorical Data Analysis 1 / 42 Chapter 11: Models for Matched Pairs Example: Prime

More information

2 Describing Contingency Tables

2 Describing Contingency Tables 2 Describing Contingency Tables I. Probability structure of a 2-way contingency table I.1 Contingency Tables X, Y : cat. var. Y usually random (except in a case-control study), response; X can be random

More information

Matched Pair Data. Stat 557 Heike Hofmann

Matched Pair Data. Stat 557 Heike Hofmann Matched Pair Data Stat 557 Heike Hofmann Outline Marginal Homogeneity - review Binary Response with covariates Ordinal response Symmetric Models Subject-specific vs Marginal Model conditional logistic

More information

Simple logistic regression

Simple logistic regression Simple logistic regression Biometry 755 Spring 2009 Simple logistic regression p. 1/47 Model assumptions 1. The observed data are independent realizations of a binary response variable Y that follows a

More information

ST3241 Categorical Data Analysis I Two-way Contingency Tables. 2 2 Tables, Relative Risks and Odds Ratios

ST3241 Categorical Data Analysis I Two-way Contingency Tables. 2 2 Tables, Relative Risks and Odds Ratios ST3241 Categorical Data Analysis I Two-way Contingency Tables 2 2 Tables, Relative Risks and Odds Ratios 1 What Is A Contingency Table (p.16) Suppose X and Y are two categorical variables X has I categories

More information

Homework 1 Solutions

Homework 1 Solutions 36-720 Homework 1 Solutions Problem 3.4 (a) X 2 79.43 and G 2 90.33. We should compare each to a χ 2 distribution with (2 1)(3 1) 2 degrees of freedom. For each, the p-value is so small that S-plus reports

More information

Testing Independence

Testing Independence Testing Independence Dipankar Bandyopadhyay Department of Biostatistics, Virginia Commonwealth University BIOS 625: Categorical Data & GLM 1/50 Testing Independence Previously, we looked at RR = OR = 1

More information

CDA Chapter 3 part II

CDA Chapter 3 part II CDA Chapter 3 part II Two-way tables with ordered classfications Let u 1 u 2... u I denote scores for the row variable X, and let ν 1 ν 2... ν J denote column Y scores. Consider the hypothesis H 0 : X

More information

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models SCHOOL OF MATHEMATICS AND STATISTICS Linear and Generalised Linear Models Autumn Semester 2017 18 2 hours Attempt all the questions. The allocation of marks is shown in brackets. RESTRICTED OPEN BOOK EXAMINATION

More information

n y π y (1 π) n y +ylogπ +(n y)log(1 π).

n y π y (1 π) n y +ylogπ +(n y)log(1 π). Tests for a binomial probability π Let Y bin(n,π). The likelihood is L(π) = n y π y (1 π) n y and the log-likelihood is L(π) = log n y +ylogπ +(n y)log(1 π). So L (π) = y π n y 1 π. 1 Solving for π gives

More information

Lecture 25: Models for Matched Pairs

Lecture 25: Models for Matched Pairs Lecture 25: Models for Matched Pairs Dipankar Bandyopadhyay, Ph.D. BMTRY 711: Analysis of Categorical Data Spring 2011 Division of Biostatistics and Epidemiology Medical University of South Carolina Lecture

More information

Analysis of Categorical Data Three-Way Contingency Table

Analysis of Categorical Data Three-Way Contingency Table Yu Lecture 4 p. 1/17 Analysis of Categorical Data Three-Way Contingency Table Yu Lecture 4 p. 2/17 Outline Three way contingency tables Simpson s paradox Marginal vs. conditional independence Homogeneous

More information

13.1 Categorical Data and the Multinomial Experiment

13.1 Categorical Data and the Multinomial Experiment Chapter 13 Categorical Data Analysis 13.1 Categorical Data and the Multinomial Experiment Recall Variable: (numerical) variable (i.e. # of students, temperature, height,). (non-numerical, categorical)

More information

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form: Outline for today What is a generalized linear model Linear predictors and link functions Example: fit a constant (the proportion) Analysis of deviance table Example: fit dose-response data using logistic

More information

Chapter 2: Describing Contingency Tables - I

Chapter 2: Describing Contingency Tables - I : Describing Contingency Tables - I Dipankar Bandyopadhyay Department of Biostatistics, Virginia Commonwealth University BIOS 625: Categorical Data & GLM [Acknowledgements to Tim Hanson and Haitao Chu]

More information

8 Nominal and Ordinal Logistic Regression

8 Nominal and Ordinal Logistic Regression 8 Nominal and Ordinal Logistic Regression 8.1 Introduction If the response variable is categorical, with more then two categories, then there are two options for generalized linear models. One relies on

More information

Discrete Multivariate Statistics

Discrete Multivariate Statistics Discrete Multivariate Statistics Univariate Discrete Random variables Let X be a discrete random variable which, in this module, will be assumed to take a finite number of t different values which are

More information

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3 STA 303 H1S / 1002 HS Winter 2011 Test March 7, 2011 LAST NAME: FIRST NAME: STUDENT NUMBER: ENROLLED IN: (circle one) STA 303 STA 1002 INSTRUCTIONS: Time: 90 minutes Aids allowed: calculator. Some formulae

More information

Research Methodology: Tools

Research Methodology: Tools MSc Business Administration Research Methodology: Tools Applied Data Analysis (with SPSS) Lecture 05: Contingency Analysis March 2014 Prof. Dr. Jürg Schwarz Lic. phil. Heidi Bruderer Enzler Contents Slide

More information

Review of Statistics 101

Review of Statistics 101 Review of Statistics 101 We review some important themes from the course 1. Introduction Statistics- Set of methods for collecting/analyzing data (the art and science of learning from data). Provides methods

More information

STAT 526 Spring Midterm 1. Wednesday February 2, 2011

STAT 526 Spring Midterm 1. Wednesday February 2, 2011 STAT 526 Spring 2011 Midterm 1 Wednesday February 2, 2011 Time: 2 hours Name (please print): Show all your work and calculations. Partial credit will be given for work that is partially correct. Points

More information

Poisson Regression. Gelman & Hill Chapter 6. February 6, 2017

Poisson Regression. Gelman & Hill Chapter 6. February 6, 2017 Poisson Regression Gelman & Hill Chapter 6 February 6, 2017 Military Coups Background: Sub-Sahara Africa has experienced a high proportion of regime changes due to military takeover of governments for

More information

For more information about how to cite these materials visit

For more information about how to cite these materials visit Author(s): Kerby Shedden, Ph.D., 2010 License: Unless otherwise noted, this material is made available under the terms of the Creative Commons Attribution Share Alike 3.0 License: http://creativecommons.org/licenses/by-sa/3.0/

More information

1 Comparing two binomials

1 Comparing two binomials BST 140.652 Review notes 1 Comparing two binomials 1. Let X Binomial(n 1,p 1 ) and ˆp 1 = X/n 1 2. Let Y Binomial(n 2,p 2 ) and ˆp 2 = Y/n 2 3. We also use the following notation: n 11 = X n 12 = n 1 X

More information

9 Generalized Linear Models

9 Generalized Linear Models 9 Generalized Linear Models The Generalized Linear Model (GLM) is a model which has been built to include a wide range of different models you already know, e.g. ANOVA and multiple linear regression models

More information

Review of One-way Tables and SAS

Review of One-way Tables and SAS Stat 504, Lecture 7 1 Review of One-way Tables and SAS In-class exercises: Ex1, Ex2, and Ex3 from http://v8doc.sas.com/sashtml/proc/z0146708.htm To calculate p-value for a X 2 or G 2 in SAS: http://v8doc.sas.com/sashtml/lgref/z0245929.htmz0845409

More information

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014 LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R Liang (Sally) Shan Nov. 4, 2014 L Laboratory for Interdisciplinary Statistical Analysis LISA helps VT researchers

More information

Chapter 11: Models for Matched Pairs

Chapter 11: Models for Matched Pairs : Models for Matched Pairs Dipankar Bandyopadhyay Department of Biostatistics, Virginia Commonwealth University BIOS 625: Categorical Data & GLM [Acknowledgements to Tim Hanson and Haitao Chu] D. Bandyopadhyay

More information

Linear Regression Models P8111

Linear Regression Models P8111 Linear Regression Models P8111 Lecture 25 Jeff Goldsmith April 26, 2016 1 of 37 Today s Lecture Logistic regression / GLMs Model framework Interpretation Estimation 2 of 37 Linear regression Course started

More information

Statistics 3858 : Contingency Tables

Statistics 3858 : Contingency Tables Statistics 3858 : Contingency Tables 1 Introduction Before proceeding with this topic the student should review generalized likelihood ratios ΛX) for multinomial distributions, its relation to Pearson

More information

Sections 4.1, 4.2, 4.3

Sections 4.1, 4.2, 4.3 Sections 4.1, 4.2, 4.3 Timothy Hanson Department of Statistics, University of South Carolina Stat 770: Categorical Data Analysis 1/ 32 Chapter 4: Introduction to Generalized Linear Models Generalized linear

More information

Generalized linear models

Generalized linear models Generalized linear models Outline for today What is a generalized linear model Linear predictors and link functions Example: estimate a proportion Analysis of deviance Example: fit dose- response data

More information

Statistics of Contingency Tables - Extension to I x J. stat 557 Heike Hofmann

Statistics of Contingency Tables - Extension to I x J. stat 557 Heike Hofmann Statistics of Contingency Tables - Extension to I x J stat 557 Heike Hofmann Outline Testing Independence Local Odds Ratios Concordance & Discordance Intro to GLMs Simpson s paradox Simpson s paradox:

More information

Analysis of data in square contingency tables

Analysis of data in square contingency tables Analysis of data in square contingency tables Iva Pecáková Let s suppose two dependent samples: the response of the nth subject in the second sample relates to the response of the nth subject in the first

More information

Review of Multinomial Distribution If n trials are performed: in each trial there are J > 2 possible outcomes (categories) Multicategory Logit Models

Review of Multinomial Distribution If n trials are performed: in each trial there are J > 2 possible outcomes (categories) Multicategory Logit Models Chapter 6 Multicategory Logit Models Response Y has J > 2 categories. Extensions of logistic regression for nominal and ordinal Y assume a multinomial distribution for Y. 6.1 Logit Models for Nominal Responses

More information

STAT 135 Lab 11 Tests for Categorical Data (Fisher s Exact test, χ 2 tests for Homogeneity and Independence) and Linear Regression

STAT 135 Lab 11 Tests for Categorical Data (Fisher s Exact test, χ 2 tests for Homogeneity and Independence) and Linear Regression STAT 135 Lab 11 Tests for Categorical Data (Fisher s Exact test, χ 2 tests for Homogeneity and Independence) and Linear Regression Rebecca Barter April 20, 2015 Fisher s Exact Test Fisher s Exact Test

More information

STAT 7030: Categorical Data Analysis

STAT 7030: Categorical Data Analysis STAT 7030: Categorical Data Analysis 5. Logistic Regression Peng Zeng Department of Mathematics and Statistics Auburn University Fall 2012 Peng Zeng (Auburn University) STAT 7030 Lecture Notes Fall 2012

More information

Solution to Tutorial 7

Solution to Tutorial 7 1. (a) We first fit the independence model ST3241 Categorical Data Analysis I Semester II, 2012-2013 Solution to Tutorial 7 log µ ij = λ + λ X i + λ Y j, i = 1, 2, j = 1, 2. The parameter estimates are

More information

Poisson Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Poisson Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University Poisson Regression James H. Steiger Department of Psychology and Human Development Vanderbilt University James H. Steiger (Vanderbilt University) Poisson Regression 1 / 49 Poisson Regression 1 Introduction

More information

Cohen s s Kappa and Log-linear Models

Cohen s s Kappa and Log-linear Models Cohen s s Kappa and Log-linear Models HRP 261 03/03/03 10-11 11 am 1. Cohen s Kappa Actual agreement = sum of the proportions found on the diagonals. π ii Cohen: Compare the actual agreement with the chance

More information

Topic 21 Goodness of Fit

Topic 21 Goodness of Fit Topic 21 Goodness of Fit Contingency Tables 1 / 11 Introduction Two-way Table Smoking Habits The Hypothesis The Test Statistic Degrees of Freedom Outline 2 / 11 Introduction Contingency tables, also known

More information

One-Way Tables and Goodness of Fit

One-Way Tables and Goodness of Fit Stat 504, Lecture 5 1 One-Way Tables and Goodness of Fit Key concepts: One-way Frequency Table Pearson goodness-of-fit statistic Deviance statistic Pearson residuals Objectives: Learn how to compute the

More information

Elementary Statistics Lecture 3 Association: Contingency, Correlation and Regression

Elementary Statistics Lecture 3 Association: Contingency, Correlation and Regression Elementary Statistics Lecture 3 Association: Contingency, Correlation and Regression Chong Ma Department of Statistics University of South Carolina chongm@email.sc.edu Chong Ma (Statistics, USC) STAT 201

More information

Lecture 9. Selected material from: Ch. 12 The analysis of categorical data and goodness of fit tests

Lecture 9. Selected material from: Ch. 12 The analysis of categorical data and goodness of fit tests Lecture 9 Selected material from: Ch. 12 The analysis of categorical data and goodness of fit tests Univariate categorical data Univariate categorical data are best summarized in a one way frequency table.

More information

McGill University. Faculty of Science. Department of Mathematics and Statistics. Statistics Part A Comprehensive Exam Methodology Paper

McGill University. Faculty of Science. Department of Mathematics and Statistics. Statistics Part A Comprehensive Exam Methodology Paper Student Name: ID: McGill University Faculty of Science Department of Mathematics and Statistics Statistics Part A Comprehensive Exam Methodology Paper Date: Friday, May 13, 2016 Time: 13:00 17:00 Instructions

More information

MARGINAL HOMOGENEITY MODEL FOR ORDERED CATEGORIES WITH OPEN ENDS IN SQUARE CONTINGENCY TABLES

MARGINAL HOMOGENEITY MODEL FOR ORDERED CATEGORIES WITH OPEN ENDS IN SQUARE CONTINGENCY TABLES REVSTAT Statistical Journal Volume 13, Number 3, November 2015, 233 243 MARGINAL HOMOGENEITY MODEL FOR ORDERED CATEGORIES WITH OPEN ENDS IN SQUARE CONTINGENCY TABLES Authors: Serpil Aktas Department of

More information

Statistics in medicine

Statistics in medicine Statistics in medicine Lecture 3: Bivariate association : Categorical variables Proportion in one group One group is measured one time: z test Use the z distribution as an approximation to the binomial

More information

Multiple Sample Categorical Data

Multiple Sample Categorical Data Multiple Sample Categorical Data paired and unpaired data, goodness-of-fit testing, testing for independence University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html

More information

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages: Glossary The ISI glossary of statistical terms provides definitions in a number of different languages: http://isi.cbs.nl/glossary/index.htm Adjusted r 2 Adjusted R squared measures the proportion of the

More information

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification, Likelihood Let P (D H) be the probability an experiment produces data D, given hypothesis H. Usually H is regarded as fixed and D variable. Before the experiment, the data D are unknown, and the probability

More information

Ch 6: Multicategory Logit Models

Ch 6: Multicategory Logit Models 293 Ch 6: Multicategory Logit Models Y has J categories, J>2. Extensions of logistic regression for nominal and ordinal Y assume a multinomial distribution for Y. In R, we will fit these models using the

More information

Statistics for Managers Using Microsoft Excel

Statistics for Managers Using Microsoft Excel Statistics for Managers Using Microsoft Excel 7 th Edition Chapter 1 Chi-Square Tests and Nonparametric Tests Statistics for Managers Using Microsoft Excel 7e Copyright 014 Pearson Education, Inc. Chap

More information

Categorical data analysis Chapter 5

Categorical data analysis Chapter 5 Categorical data analysis Chapter 5 Interpreting parameters in logistic regression The sign of β determines whether π(x) is increasing or decreasing as x increases. The rate of climb or descent increases

More information

Poisson Regression. The Training Data

Poisson Regression. The Training Data The Training Data Poisson Regression Office workers at a large insurance company are randomly assigned to one of 3 computer use training programmes, and their number of calls to IT support during the following

More information

Longitudinal Modeling with Logistic Regression

Longitudinal Modeling with Logistic Regression Newsom 1 Longitudinal Modeling with Logistic Regression Longitudinal designs involve repeated measurements of the same individuals over time There are two general classes of analyses that correspond to

More information

1 Interaction models: Assignment 3

1 Interaction models: Assignment 3 1 Interaction models: Assignment 3 Please answer the following questions in print and deliver it in room 2B13 or send it by e-mail to rooijm@fsw.leidenuniv.nl, no later than Tuesday, May 29 before 14:00.

More information

STA6938-Logistic Regression Model

STA6938-Logistic Regression Model Dr. Ying Zhang STA6938-Logistic Regression Model Topic 2-Multiple Logistic Regression Model Outlines:. Model Fitting 2. Statistical Inference for Multiple Logistic Regression Model 3. Interpretation of

More information

STAT 526 Advanced Statistical Methodology

STAT 526 Advanced Statistical Methodology STAT 526 Advanced Statistical Methodology Fall 2017 Lecture Note 10 Analyzing Clustered/Repeated Categorical Data 0-0 Outline Clustered/Repeated Categorical Data Generalized Linear Mixed Models Generalized

More information

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013 Analysis of Categorical Data Nick Jackson University of Southern California Department of Psychology 10/11/2013 1 Overview Data Types Contingency Tables Logit Models Binomial Ordinal Nominal 2 Things not

More information

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches Sta 216, Lecture 4 Last Time: Logistic regression example, existence/uniqueness of MLEs Today s Class: 1. Hypothesis testing through analysis of deviance 2. Standard errors & confidence intervals 3. Model

More information

Lecture 12: Effect modification, and confounding in logistic regression

Lecture 12: Effect modification, and confounding in logistic regression Lecture 12: Effect modification, and confounding in logistic regression Ani Manichaikul amanicha@jhsph.edu 4 May 2007 Today Categorical predictor create dummy variables just like for linear regression

More information

Solutions for Examination Categorical Data Analysis, March 21, 2013

Solutions for Examination Categorical Data Analysis, March 21, 2013 STOCKHOLMS UNIVERSITET MATEMATISKA INSTITUTIONEN Avd. Matematisk statistik, Frank Miller MT 5006 LÖSNINGAR 21 mars 2013 Solutions for Examination Categorical Data Analysis, March 21, 2013 Problem 1 a.

More information

Lecture 14: Introduction to Poisson Regression

Lecture 14: Introduction to Poisson Regression Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu 8 May 2007 1 / 52 Overview Modelling counts Contingency tables Poisson regression models 2 / 52 Modelling counts I Why

More information

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview Modelling counts I Lecture 14: Introduction to Poisson Regression Ani Manichaikul amanicha@jhsph.edu Why count data? Number of traffic accidents per day Mortality counts in a given neighborhood, per week

More information

Stat 642, Lecture notes for 04/12/05 96

Stat 642, Lecture notes for 04/12/05 96 Stat 642, Lecture notes for 04/12/05 96 Hosmer-Lemeshow Statistic The Hosmer-Lemeshow Statistic is another measure of lack of fit. Hosmer and Lemeshow recommend partitioning the observations into 10 equal

More information

Hypothesis Testing hypothesis testing approach

Hypothesis Testing hypothesis testing approach Hypothesis Testing In this case, we d be trying to form an inference about that neighborhood: Do people there shop more often those people who are members of the larger population To ascertain this, we

More information

Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links

Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links Communications of the Korean Statistical Society 2009, Vol 16, No 4, 697 705 Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links Kwang Mo Jeong a, Hyun Yung Lee 1, a a Department

More information

Lecture 6 Multiple Linear Regression, cont.

Lecture 6 Multiple Linear Regression, cont. Lecture 6 Multiple Linear Regression, cont. BIOST 515 January 22, 2004 BIOST 515, Lecture 6 Testing general linear hypotheses Suppose we are interested in testing linear combinations of the regression

More information

Lecture 23. November 15, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

Lecture 23. November 15, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University. This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike License. Your use of this material constitutes acceptance of that license and the conditions of use of materials on this

More information

Lab 3: Two levels Poisson models (taken from Multilevel and Longitudinal Modeling Using Stata, p )

Lab 3: Two levels Poisson models (taken from Multilevel and Longitudinal Modeling Using Stata, p ) Lab 3: Two levels Poisson models (taken from Multilevel and Longitudinal Modeling Using Stata, p. 376-390) BIO656 2009 Goal: To see if a major health-care reform which took place in 1997 in Germany was

More information

POLI 443 Applied Political Research

POLI 443 Applied Political Research POLI 443 Applied Political Research Session 6: Tests of Hypotheses Contingency Analysis Lecturer: Prof. A. Essuman-Johnson, Dept. of Political Science Contact Information: aessuman-johnson@ug.edu.gh College

More information