Analysis of data in square contingency tables

Analysis of data in square contingency tables Iva Pecáková Let s suppose two dependent samples: the response of the nth subject in the second sample relates to the response of the nth subject in the first sample There are two common forms of sample dependency: (1) the same subjects are surveyed at different points in time (before-after and studies including panel studies); (2) the different subjects with a natural pairing are surveyed (a husband and his wife, a parent and his child, two people rate the same object, etc) The first form is often called repeated measures or longitudinal data, the second one matched pairs data In such a case, the responses of a categorical variable are summarized by a two-way contingency table in which row and column classifications have the same categories Thus, the table is square, r = c (r is number of rows and c is number of columns in the table) There are usually the large values on the main diagonal of such a table, cell probabilities or associations may exhibit more or less symmetric pattern about this diagonal Two marginal distributions may agree (there is marginal homogeneity) or they may differ in some systematic way If r = 2 (table 1) and the hypothesis Table 1 The binary variable (r = 2) Occasion 2 Occasion 1 X = 0 X = 1 Σ X = 0 n 11 n 12 n 1 X = 1 n 21 n 22 n 2 Σ n 1 n 2 n n ij ( i = 1, 2; j = 1, 2) dete frequencies, p ij = n i j /n dete relative frequencies π = π π = π (1) 1 1 12 21 holds (for 2 x 2 table it is marginal homogeneity and symmetry, too), the frequency n 12 has a bimial distribution with parameters n 12 n 21 and 0,5 A p-value (for two-sided test) is then double probability P[n 12 min(n 12, n 21 )], the asterisk detes here observed frequencies For large samples, as kwn, the statistic has a standard rmal distribution and n 0,5( n n ) n n U = = 0,5 n n n n 12 12 21 12 21 U 12 21 12 21 (n n ) = n n 2 12 21 12 21 has chi-square distribution with one degree of freedom (the significance test for this statistic is kwn as McNemar test) 2 (2) (3)

2 If this test is significant, we can estimate the true difference between π 1 and π 1 as where ( p p ) ± u SE ˆ( p p ) 1, (4) 1 1 1 α /2 1 [ ] SE ˆ( p p ) = p (1 p ) p (1 p ) 2( p p p p ) / n 1 1 1 1 1 1 11 22 12 21 If r > 2 (table 2), the hypothesis π π = i i Table 2 The categorical variable (r > 2) Occasion 2 Occasion 1 x 1 x r Σ x 1 n 11 n 1r n 1 x r n r1 n rr n r Σ n 1 n r n, i = 1, 2, r (5) is marginal homogeneity and the hypothesis π ij = π, i = 1, 2, r; j = 1, 2, r (6) ji is symmetry Symmetry is equivalent marginal homogeneity, but for r > 2 marginal homogeneity doesn t mean symmetry (the example in the table 3) Table 3 Marginal homogeneity, t symmetry X 1 X 2 X 3 Σ X 1 20 10 20 50 X 2 30 55 5 90 X 3 0 25 35 60 Σ 50 90 60 200 The saturated loglinear model for such square contingency table can be written as ln m ij = λ λ i λ j λ ij, (7) 1 1 where λ = ln m 2 ij, λi = ln mij λ, r r i j 1 λ j = ln mij λ, λij = ln mij λi λj λ r i j The parameters of this model are the linear combinations of expected frequencies m ij, i = 1, 2, r; j = 1, 2,, r and their number is 1 r 1 r 1 (r 1) 2 = r 2 (their identifiability requires constraints Σ λ i = 0, Σ λ ij = 0) The cell expected frequencies m ij are estimated with n ij

3 When the independence model holds, all the association parameters λ ij in (7) are zero The cell expected frequencies are estimated with np i p j To test the goodness of fit of this model, the well-kwn Pearson statistic (X 2 ) or likelihood ratio chi-squared statistics (deviance) G 2, G 2 r r nij = 2 nij ln, (8) m ˆ i j ij can be used The degrees of freedom is (r 1) 2 However, the square tables for repeated measures or matched pairs data usually have large counts on the main diagonal and this model is t useful In this case, there is important a structure of frequencies off the main diagonal When the row response differs from the column response in this table, the variables are quasi independent While the independence loglinear model can be written as ln m ij = λ λ i λ j ; (9) the quasi-independence loglinear model can be written as ln m ij λ λ i λ j δ i I ij =, (10) where I ij indicates the diagonal elements in the table (I ij = 1 for i = j and I ij = 0 for i j) In this model mˆ ii = n ii holds, but the expected frequencies haven t direct estimates To obtain the maximum likelihood estimates, the set of likelihood equations is to solve The likelihood equations do t have a direct solution and can be solved using an iterative algorithm (Newton- Raphson methods for example) The number of parameters of the quasi-independence model is 1 2(r 1) r and the residual degrees of freedom are then df = r 2 [1 2(r 1) r] = (r 1) 2 r For the symmetry model, in (7) all λ ij = λ ji The parameters λ i, i = 1, 2, r, are the same for both classifications (there is a marginal homogeneity) Expected frequencies m ij are estimated as (n ij n ji )/2 in this case It results from this, that mˆ ii = n ii The number of parameters of the model is w 1 (r 1) r(r 1)/2 and the residual degrees of freedom are df = r 2 [1 (r 1) r(r 1)/2] = r(r 1)/2 The Pearson statistic X 2 can be simplified for this model to form 2 ( n ) 2 ij nji Χ = (11) n n i< j ij ji For r = 2 this is the statistics (3) The symmetry model is often too simple to fit a table, because of the imposition of identical marginals In the quasi symmetry model, the marginal homogeneity doesn t hold more, the parameters λ i, i = 1, 2, r, aren t the same In this model mˆ ii = n ii, too, but there aren t a direct estimates for expected frequencies To obtain these estimates, Newton-Raphson methods, iterative proportional fitting or iterative methods must be used again This model has the property of symmetric association (symmetry of odds ratios), when

4 θ mm mm ij rr ji rr ij = = = θ ji for all i and j (12) mm ir rj mjrmri The number of parameters of this model is 1 (r 1) (r 1) r(r 1)/2 and the residual degrees of freedom are w df = r 2 [1 2(r 1) r(r 1)/2] = (r 1)(r 2)/2 Some loglinear models imply marginal homogeneity If a table satisfies symmetry, it also satisfies both quasi symmetry and marginal homogeneity As we can see for example in [1], the converse holds too When quasi symmetry holds, marginal homogeneity is equivalent to symmetry and we can test marginal homogeneity by comparing goodness-of-fit statistics (deviances) for the symmetry (S) and quasi-symmetry (QS) models: 2 2 2 G S QS G S G QS ( / ) = ( ) ( ) (13) This difference has chi-squared distribution with (r 1) degrees of freedom Let s remind that the well-kwn Stuart-Maxwell test can be used to test marginal homogeneity, too The Stuart-Maxwell statistic X 2 = d' S -1 d, (14) where d = [d 1, d 2, d r 1 ], d i = n i n i, i = 1, 2, r 1 and S detes the (r 1) x (r 1) covariance matrix of the elements of d, has asymptotically chi-square distribution with r 1 degrees of freedom The results of both tests are usually very similar The following data (table 4) were provided by Factum Invenio, s r o Data come from election researches realized in June 2003, in April 2004 (shortly before the end of the Špidla s cabinet), in June 2005 (after the end of the Gross cabinet) and in April 2006 (shortly before the parliamentary election) All these data files include the same questions: Which party did you in the election (the variable is ) and Which party would you at the moment (the variable is preference ) Thus, each respondent expresses whether his inclination has changed or t since the last election Table 4 Data from election researches * preference 2003 Crosstabulation US preference 2003 US 107 17 7 3 10 18 162 2 149 2 6 9 168 2 80 1 4 87 2 2 1 50 5 2 62 3 2 27 2 34 6 19 5 8 16 144 198 119 190 93 66 64 179 711

5 US * preference 2004 Crosstabulation preference 2004 US 102 25 19 6 22 17 191 9 190 2 1 6 7 215 2 82 1 5 90 2 10 1 37 5 3 58 1 2 1 27 2 33 8 27 5 3 15 166 224 124 254 109 48 76 200 811 US * preference 2005 Crosstabulation preference 2005 US 111 11 10 3 15 13 163 6 187 3 5 9 210 3 4 88 3 1 3 102 2 4 1 49 5 4 65 2 6 1 1 24 1 35 12 31 3 4 23 187 260 136 243 103 63 73 217 835 US * preference 2006 Crosstabulation preference 2006 US 157 9 6 12 7 191 8 174 1 11 4 198 3 2 87 8 100 3 2 1 51 2 1 60 7 2 2 32 4 47 14 46 5 7 32 132 236 192 235 101 59 89 156 832 As we could expect, independence is strongly rejected for all four data files (X 2 runs from 1876 by 25 degrees of freedom) The symmetry model is also unpromising For example, in 2004 only 9 people changed their inclination from to and 25 people did so in the opposite direction Only 2 people changed their inclination from to and 19 people did so in the opposite direction, and so on The results of this model are contained in table 5

6 The majority of rs did t change their preference and their frequencies are always on the main diagonal This suggests fitting a quasi independence model, omitting the diagonal The results are also contained in table 5 As we can see, for years 2003 and 2005 this model fits well It s t possible to prove differences in pattern of changed preferences of several parties in these years However, this difference is proved for years 2004 and 2006 The quasi symmetry model doesn t fit well only for year 2006 The last test confirms expected marginal heterogeneity in all files The table 6 with sign schemas (for quasi-independence in 2004 and 2006, quasi symmetry in 2006) appends the most interesting results Let s remark that true distribution of statistics used for testing fit may be far from chisqared when expected frequencies are small Our tables are sparse and fitted cell counts small, but for loglinear models the expected values refer to marginal totals and the chi-sqared approximation is likely to be adequate (In the next paper we would like to verify our results with exact tests) Table 5 Results of the analysis X 2 p-value df G 2 p-value Year Symmetry 2003 51,3 0,00 15 59,2 0,00 2004 83,4 0,00 15 95,7 0,00 2005 55,6 0,00 15 64,4 0,00 2006 83,5 0,00 15 97,8 0,00 Quasi independence 2003 21,4 0,32 19 25,9 0,13 2004 29,7 0,05 19 30,7 0,04 2005 27,9 0,09 19 28,1 0,08 2006 44,6 0,00 19 47,9 0,00 Quasi symmetry 2003 4,6 0,92 10 4,4 0,93 2004 11,8 0,30 10 12,8 0,24 2005 11,9 0,29 10 13,1 0,22 2006 29,1 0,00 10 30,6 0,00 Marginal homogeneity 2003 5 54,8 0,00 2004 5 82,9 0,00 2005 5 51,3 0,00 2006 5 67,2 0,00

7 Table 6 Sign schemas US References: US [1] Agresti, A: Categorical Data Analysis, John Wiley & Sons, 1995 [2] Anděl, J: Matematická statistika, SNTL, Praha 1978 [3] Jobson, JD: Applied Multivariate Data Analysis, Volume II: Categorical and Multivariate Methods, 1991 [4] Řeháková,B-Řehák,J: Analýza kategorizovaných dat v sociologii, Academia Praha 1986 [5] SPSS Manuals, SPSS Inc, 1994 1999 [6] Simoff, J S: Analyzing Categorical Data, Springer-Verlag Inc, New York 2003 [7] Stokes, ME- Davis, CS- Koch, GG: Categorical data Analysis Using the SAS System, SAS Institute Inc, 1995 Doc Ing Iva Pecáková, CSc The University of Ecomics Faculty of Informatics and Probability Department of Statistics and Probability Prague, Czech Republic e-mail: pecakova@vsecz