Correspondence Analysis Q: when independence of a 2-way contingency table is rejected, how to know where the dependence is coming from? The interaction terms in a GLM contain dependence information; however, interpretation of interactions could be difficult Correspondence analysis: a visual residual analysis for contingency table Singular value decomposition R: an r c matrix. W.l.o.g, assume r c and rank(r) c, then R UDV T, where U: an r c column orthonormal matrix, i.e., U T U I c c ; its columns are called left singular vectors V: a c c column orthonormal matrix, i.e., V T V I c c ; its columns called right singular vectors D diag(d 1,, d c ), where d 1 d 2 d c, called singular values p. 5-11 Some properties p. 5-12 Columns of U r c are eigenvectors of (RR T ) r r Columns of V c c are eigenvectors of (R T R) c c {d 12,, d c2 } are eigenvalues of RR T and R T R Procedure of correspondence analysis on Pearson residuals a)fit a GLM corresponding to independence on the contingency table and compute its Pearson residuals, r p s (Q: what information contained in the r p s?) b)write r p s in the matrix form [R ij ] R r c as in contingency table c)perform the singular value decomposition on R: R UDV T R ij = P c k=1 U ikd k V jk d)it is not uncommon for the first few singular values of R to be much larger than the rest. Suppose that the first 2 dominate. Then, R ij U i1 d 1 V j1 + U i2 d 2 V j2 p p p p = ³U i1 d1 ³V j1 d1 + ³U i2 d2 ³V j2 d2 U i1v j1 + U i2v j2
e)the 2-dimensional correspondence plot displays Ui2 against Ui1 and Vi2 against Vi1 on the same graph (Note: because the distance between points will be of interest, it is important that the plot is scaled so that the visual distance is proportionately correct) Some notes: V11 Vj1 Vc1 V 12 Vj2 Vc2 R 1 j c U11 1 U11V 11 U11V j1 Ui1 i Ui1V U 11 i1v j1 r Ur1V U r1 j1 11 U r1 V R U (1) V T (1) + U (2) V (2) U 11V c1 U i1 V c1 U r1v c1 1 j c U12 1 U12V 12 U12V j2 Ui2 i Ui2 V U 12 i2 V j2 r Ur2V Q: what does a large positive R ij mean? a large negative R ij? k d k2 Pearson s X 2 (because ij r 2 p trace(r T R) k d k2 ) Q: what should we look for in a correspondence plot? Large values in U (k) (and V (k) ) In the contingency table, the profiles of the rows (or the columns) corresponding to the large values are different U r2 12 Ur2V j2 U 1k Uik Urk T,whereU (k) = and V (k) = p. 5-13 U 12 V c2 U i2 V c2 Ur2V c2 V1k E.g.: BLOND hair the distribution of eye colors within this group is not typical E.g.: BROWN hair the distribution of eye colors within this group close to the marginal distribution of columns Row and column levels appear close together and far from the origin A large positive R ij would be associated with the combination E.g.: BLOND hair blue eye strong association Row and column levels situate diametrically apart on either side of the origin A large negative R ij would be associated with the combination E.g.: BLOND hair brown eye relatively fewer people Points of two row (or two column) levels are close together The two rows/columns have a similar pattern of association might consider to combine the two categories E.g.: hazel eye green eye similar hair color distribution Other methods: corresp in the MASS package of R (Venables and Ripley, 22), Blasius and Greenacre (1998) Reading: Faraway, 4.2 V jk V ck p. 5-14
Matched Pairs Data: observe one categorical measure on two matched objects E.g.: left and right eye performance of a person In contrast, in the typical 2-way contingency table, observe two (different) categorical p. 5-15 1 I X 2 1 11 1I 1 I I1 II I 1 I 1 measures on one object Q: what questions we might be interested in for matched pair data? and X 2 are independent, i.e., ij i j for all i and j? [ ij ] I I is a symmetric matrix, i.e., ij ji? row and column marginals are homogeneous, i.e., i i? Symmetry implies marginal homogeneity (the reverse statement is not necessarily true) When row and column marginal totals are quite different, we might be interested in whether ij i j ij, where ij ji? The hypothesis is called quasi-symmetry Marginal homogeneity quasi-symmetric symmetry Whether ij i j for i j? it is called quasi-independent Tests for these hypotheses based on GLM, e.g., Y =(y 11,y 21,y 31,y 12,y 22,y 32,y 13,y 23,y 33 ) T Test for symmetry hypothesis: Generate a vector with I 2 components for a (I(I+1)/2)-level nomial factor with the structure: symfactor (l 1,l 2,l 3,l 2,l 4,l 5,l 3,l 5,l 6 ) T Y ~ symfactor S sym X 2 1 2 3 p. 5-16 1 y 11 y 12 y 13 y 1 2 y 21 y 22 y 23 y 2 3 y 31 y 32 y 33 y 3 y 1 y 2 y 3 y Deviance-based/Pearson X 2 goodness-of-fit test for S sym Test for quasi-symmetric hypothesis log(π ij )=log(π i+ π +j γ ij )=log(π i+ ) + log(π +j )+log(γ ij ) Y ~ + X 2 + symfactor S qsym Deviance-based/Pearson X 2 goodness-of-fit test for S qsym Test for marginal homogeneity hypothesis Deviance-based test for H : S sym v.s. H 1 :S qsym \S sym The test is only appropriate when S qsym already holds Test for quasi-independent hypothesis Omit the diagonal data, i.e., Y =(y 21,y 31,y 12,y 32,y 13,y 23 ) T ~ + X 2 S qindep Y Reading: Faraway, 4.3 Deviance-based/Pearson X 2 goodness-of-fit test for S qindep
Three-Way Contingency Table The s and y s are defined in the same manner as in the 2-way table Poisson GLM approach to investigate how, X 2, X 3 interact Mutual independence (, X 2, X 3 are independent) ijk i j k log(π ijk )=log(π i++ π +j+ π ++k ) =log(π i++ )+log(π +j+ )+log(π ++k ) Y X 2 X 3 S 1 The estimates of parameters in this model (1 i I) correspond only to the marginal totals y i, y j, and y k The coding we use will determine exactly how the parameters relate to the margin totals, e.g., let be an main effect of that codes i 1 and i 2 categories as and 1 Insignificant factor, say 1 2 I Joint independence ({, X 2 } and X 3 are independent) ijk ij k ij k ij log(π ijk )=log(π ij+ π ++k )=log(π ij+ )+log(π ++k ) Y X 2 X 2 X 3 S 2 ( S 1 ) X 3 (1 k K) X 2 (1 j J) e ˆβ/(1 + e ˆβ) =ˆπ i2 ++/(ˆπ i1 ++ +ˆπ i2 ++) =y i2 ++/(y i1 ++ + y i2 ++) p. 5-17 Conditional independence (, X 2 are independent given X 3 ) p. 5-18 ij k i k j k ijk i k jk k log(π ijk )=log(π i+k π +jk /π ++k ) =log(π i+k )+log(π +jk ) log(π ++k ) Y X 3 X 3 X 2 X 2 X 3 S 3 Note that S 3 + S 2, but the condition that {, X 3 } and X 2 are independent implies that and X 2 are independent given X 3 Q: can the conditional independence imply independence between and X 2, i.e., ij+ i++ +j+? (Hint: singular value decomposition) Uniform association Consider a model with all two-factor interactions Y X 2 X 3 X 2 X 3 X 2 X 3 S 4 ( S 3 ) S 4 is not saturated some degrees of freedoms left for goodness-of-fit test S 4 has no simple interpretation in terms of independence S 4 asserts that for every level of one variable, say X 3, we have the same association between and X 2
p. 5-19 For each levels of X3, the reduced models of S4 have different coefficients for the main effects of X1 and X2, but have the same coefficients for the interaction X1:X2 E.g., I J 2, same fitted odds-ratio between X1 and X2 for each category of X3. Note that: y y 22k π 11k π 22k β 12k = = e fitted odd-ratio = y 11k, where y π π 12k 21k 12k 21k 12k is the coefficient of the X1 X2 term (under the -1 coding) in the reduced model of X3 k Q: What does uniform association mean? How to interpret the association? How does it connect with interaction terms? p. 5-2 A saturated model corresponds to a 3-way table with different association between, say X1 and X2, across K levels of X3 whereas Y~1 corresponds to a 3-way table with constant Q: how to examine whether the X1, X2, X3 in a 3-way table are mutually independent, jointly independent, conditionally independent, or uniformly associated? 2 Ans: Perform deviance-based/pearson s X goodness-of-fit tests for S1, S2, S3, S4, respectively.
p. 5-21 However, be careful of zero or small y ijk there will be some doubt as to the accuracy of the chi-square approximation in goodness-of-fit test The chi-square approximation is better in comparing model than assessing goodness-of-fit Analysis strategy: start with complex Poisson GLM (such as saturated model) and see how far the model can be reduced (by using deviance-based test to compare models). Binomial (multinomial) GLM approach for 3-way table When y ij s are regarded as fixed, we can treat Y X3 as a response and, X 2 as covariates Q 1 : what information gone? Q 2 : what information still attainable? Ans for Q 1 : information about ij Ans for Q 2 : information about k ij Y X3 y ij1 ~ binomial(y ij+, k 1 ij ) if K=2 Y X3 (y ij1,, y ijk ) ~ multinomial(y ij+, k 1 ij,, k K ij ) if K > 2