Methods of Psychological Research Online 1998, Vol.3, No.2 Internet: cæ 1998 Pabst Science Publishers Modeling Tur

Size: px

Start display at page:

Download "Methods of Psychological Research Online 1998, Vol.3, No.2 Internet: cæ 1998 Pabst Science Publishers Modeling Tur"

Lucas George
5 years ago
Views:

1 Methods of Psychological Research Online 1998, Vol.3, No.2 Internet: Modeling Turn-Over Tables Using the GSK Approach æ Christof Schuster University of Michigan Alexander von Eye Michigan State University Abstract The GSK approach is reviewed and it is shown how a large variety of models for turn-over tables can be ætted by specifying functions in the cell probabilities that are zero if the model that is analyzed æts the data. It is demonstrated that the GSK approach is even able to easily handle functions that contain sums of cell probabilities. Such functions can usually not be easily dealt with in a log-linear modeling framework. 1 Introduction When a categorical variable is repeatedly observed the data are typically arranged in what is known as a turn-over table èhagenaars, 1990ë14ëè. Consider the simple case where a categorical variable is observed twice. Arranging the categorical outcomes observed at the ærst occasion as the rows and the categorical outcomes observed at the second occasion as the columns of a square contingency table we obtain a two-dimensional turn-over table. Suppose we have observed a variable with three diæerent outcomes, the turn-over table for the data has the following form: T 2 è1è T 2 è2è T 2 è3è T 1 è1è n 11 n 12 n 13 T 1 è2è n 21 n 22 n 23 T 1 è3è n 31 n 32 n 33 T k èlè denotes the lth category observed at the kth occasion, and the n ij in the cells of the table denote the frequencies of observations that fall in category i at the ærst and in category j at the second occasion. The marginal frequencies of the table are denoted as n i+ = P j n ij and n +j = P i n ij. The notation for three- and higher-dimensional turn-over tables is a simple generalization of the notation for two-dimensional tables. Turn-over tables have received much attention in the literature on log-linear models èsee, for example, Agresti, 1990; Bishop, Fienberg, & Holland, 1975; Clogg, Eliason, & Grego, 1990; Hagenaars, 1990è and the references therein. There is a considerable number of standard models of varying complexity available that can æ Address for correspondence: Christof Schuster, Institute for Social Research, Ann Arbor, MI Alexander von Eye's work on this article was supported in part by NIAAA grant è 2R01 AA7065

2 C. Schuster and A. von Eye: GSK Approach 40 be applied to turn-over tables. These are well known under the names of quasiindependence models, quasi-symmetry models, symmetry models, marginal homogeneity models and so forth. All these models can be applied to two-dimensional turn-over tables but can also be easily extended to examine three- and higherdimensional contingency tables. Most often these models are discussed in a log-linear model framework 1 when practical aspects such as models, evaluating their goodness of æt, and so forth, become relevant. This makes good sense because numerical ætting routines for loglinear models are widely distributed through their implementation in virtually all standard statistical computer packages. Most of these routines rely on one of two computer algorithms. These are the iterative proportional ætting algorithm èipfè èdeming & Stephan, 1940aë6ë, 1940bë7ëè and the Newton-Raphson algorithm. Both yield maximum-likelihood estimators of model parameters and hence, the resulting estimators have very desirable characteristics. In addition, both algorithms are numerically inexpensive and easy to apply. However, these algorithms are most useful when analyzing data using hierarchical log-linear models. Substantive questions addressed by these models are typically whether independence, conditional independence or some kind of homogeneous association holds among several variables. Questions of these types are usually not of interest when analyzing turn-over tables. Instead, the aim is to analyze the association among the repeated observations of a variable. When models for turn-over tables are ætted, both algorithms can usually still be applied but the ease with which hierarchical log-linear models are ætted is gone. Usually large design-matrices have to be set up by hand to employ the Newton-Raphson algorithm. How this is done can be found in von Eye and Spiel è1996èë26ë. The IPF algorithm often requires that turn-over tables are amended. For instance, Agresti è1990, p. 382èë1ë explains how to æt the quasi-symmetry model using the IPF algorithm: ærst, transform the I æ I table to an I æ I æ 2 table and then draw conclusions concerning the I æ I table by ætting log-linear models to the amended I æ I æ 2 table. Yet, there is another drawback when analyzing turn-over tables using log-linear models. Testing for marginal homogeneity can only be done by comparing the æt of the symmetry and the quasi-symmetry models, èsee, for instance, Agresti, 1990, p. 359ë1ëè. However, this assumes that the quasi-symmetry model is `true' and therefore this test is only of limited use. With the GSK approach a test that is not conditional in this sense can easily be performed as will be demonstrated later in this article. It is well known that besides the log-linear model framework there is another framework for analyzing categorical data that relies on a diæerent estimation rationale. 2 This is the weighted least-squares approach èwlsè suggested by Grizzle, Starmer, and Koch è1969èë13ë. To honor the authors of this classic article the approach is generally known in the literature as the GSK approach. This approach is even more æexible to set up and to æt models for categorical data because it is not limited to model the logarithm of cell probabilities or equivalently expected cell frequencies but allows modeling of very general functions of the cell probabilities. Hence, models that can be ætted by the IPF or Newton-Raphson algorithms can also be ætted by the WLS algorithm of the GSK approach, although estimates will diæer more or less, depending mainly on sample size. However, the preferred estimation method in log-linear models has been maximumlikelihood because of è1è the generally desirable characteristics of estimators, and 1 Note that we would like to restrict the meaning of the term log-linear model for the present paper to models for manifest variables, like those models contained in standard textbooks on log-linear models. The range of models that can be considered log-linear is considerably enlarged when mixtures of latent and manifest variables are included in models. 2 For yet another very general approach to analyzing categorical data, see Rindskopf è1992èë23ë.

3 C. Schuster and A. von Eye: GSK Approach 41 è2è the large generality of the GSK approach is not needed if only hierarchical loglinear models are considered. In addition, the GSK approach is most naturally used when categorical variables can be subdivided into dependent and independent variables, that is, within a regression framework. In this paper we show how the GSK approach to categorical data can be applied in a straightforward way to analyzing data given in turn-over tables. Although, in turn-over tables variables have equal status, that is, are not divided into dependent and independent variables, the GSK approach can be applied in a natural way because it allows specifying null-hypotheses that involve several almost arbitrary functions in the cell probabilities. Hence, by showing how models for turn-over tables translate into this kind of hypotheses we can show how the test statistics for these models can be calculated and evaluated. Since the GSK approach isvery æexible in the type of functions of the cell probabilities that can be handled sums of cell probabilities pose no special problems. In Section 2 we give a brief outline of the GSK approach. In Section 3 we show how several of the models mentioned above, including the model of marginal homogeneity, can be ætted with the GSK approach giving similar results as obtained by maximum-likelihood estimation. Section 4 discusses and summarizes the results. 2 The GSK Approach In this section, we give an outline of the main ideas of the GSK approach without proofs of main results. More detailed discussion of the approach can be found in the relevant literature. Easy presentations are given in Wickens è1989èë27ë, Christensen è1997èë4ë and Agresti è1990èë1ë while more advanced presentations are contained in Grizzle et al. è1969èë13ë, Koch, Landis, Freeman, Freeman, and Lehnen è1977èë16ë, Koch, Amara, Davis, and Gillingsë15ë, Landis and Koch è1979èë17ë, Landis and Miller è1988èë18ë. The GSK approach is fully implemented in SAS PROC CAT- MOD. Stokes, Davis, and Koch è1995èë25ë give numerous applications of the GSK approach with both cross-sectional and longitudinal data. Since we do not use the GSK approach in a regression framework where covariates or, equivalently, several populations are involved we describe the approach considering only a single population. However, knowing how the approach works for a single population, one is able to extend the approach to multiple populations. Nevertheless, we do not dwell on using the GSK approach as a tool to ætting regression models using categorical data. Consider the simple case of a two-dimensional contingency table where the ærst variable has J and the second variable has K categories. Let ç 0 =èç 11 ;ç 12 ;::: ;ç JK è. Note that P èjkè ç jk = 1 so that there are only JK, 1 diæerent ç's free to vary. The large generality of the GSK approach stems from the fact that it allows to test research hypotheses that can be expressed as vector-valued functions F èçè. The test procedure based on weighted least-squares allows evaluating whether F èçè = 0 holds. Written explicitly, F is given as F èçè =ëf 1 èçè;f 2 èçè;::: ;F m èçèë 0 ; where m ç JK, 1. Before further pursuing technical details, it should be noted that hypotheses for contingency tables are often expressed as functions of ç. Consider the case of a 2æ2-table. Testing for independence of the two categorical variables is equivalent to testing whether the odds ratio ç is equal to one, that is, H 0 : ç = ç 11ç 22 ç 12 ç 21 =1:

4 C. Schuster and A. von Eye: GSK Approach 42 Of course, this is easily tested with the GSK approach since for m = 1, that is, F is real valued, we specify the null hypothesis as H 0 : F èçè =F 1 èçè = log ç 11, log ç 12, log ç 21 + log ç 22 =0: Rather than specifying a log-linear model for the cell counts assuming independence between the two variables we focus on the functions in ç that are assumed to be zero in the population if independence holds. Another simple example of recasting hypotheses in terms of F is McNemar's test for determining whether there is systematic change in a dichotomous variable observed at two diæerent occasions. The hypothesis tested by McNemar's test is H 0 : ç 12 = ç 21 : Again, it can be seen that this can be easily rearranged to take the form F èçè =0 by expressing H 0 as H 0 : F èçè =F 1 èçè =ç 12, ç 21 =0: McNemar's test can be equivalently considered a test for symmetry or marginal homogeneity ina2æ2-table. Now, we consider testing for symmetry in a 3æ3-table. Here the null hypothesis is given as H 0 : ç jk = ç kj for all jék. For a 3æ3-table this means that ç 12 = ç 21, ç 13 = ç 31, and ç 23 = ç 32 are simultaneously true if symmetry holds. Expressing this null hypothesis as F èçè = 0 yields, H 0 : F èçè = A ç 12, ç 21 ç 13, ç 31 A 0 0 A : ç 23, ç 32 F 1èçè F 2 èçè F 3 èçè Although the function F èçè can in principle be very general èsee below for more details on regularity assumptions on F è it turns out that often the substantiveinteresting F èçè can be expressed as F èçè =A 1 ç or F èçè =A 2 logèa 1 çè, where A 1 and A 2 are suitably chosen matrices. The SAS system allows almost arbitrary functions in ç that can be composed of matrix multiplications, and application of the functions exp and log. Landis and Koch è1979, p. 260èë17ë give an example of a complex function in ç that can be expressed only using matrix multiplication combined with exp and log. The authors express for suitably chosen matrices A 1 ;A 2 ;A 3 ;A 4 Yule's Q as Q = expèa 4 logèa 3 expèa 2 logèa 1 çèèèè: Having outlined the basics of the GSK approach, we now present the inferential procedure. Since the ç's are unknown we estimate them using relative frequencies, that is, p 0 =^ç 0 = 1 N èn 11;n 12 ;::: ;n JK è 0 ; where N = n ++ = P jk n jk denotes the sample size. Standard asymptotic theory shows that p Nèp, çè follows a multivariate normal distribution with zero mean and covariance matrix V èçè. 0

5 C. Schuster and A. von Eye: GSK Approach 43 Under the distributional assumptions just given a test statistic for evaluating F èpè can be constructed by using the multivariate version of the so called `deltamethod' if F has partial derivatives up to the second order with respect to the ç jk èsee, for instance, Agresti, 1990ë1ë; Bishop et. al., 1975ë3ë; Sen & Singer, 1993ë24ëè. Without further derivations we now present the test statistic for evaluating whether F èçè = 0 is likely to hold. Let Hèpè be deæned as the matrix of partial derivatives of the lth response function with respect to ç evaluated at p, that is, and V èpè begiven as Hèpè jk æ çjk =p jk V èpè = 1 N ëdiagèpè, pp0 ë; where Diagèpè denotes the diagonal matrix that contains the values of p in its diagonal cells while having a value of zero in all oæ-diagonal cells. Then we can calculate Sèpè ashèpèv èpèhèpè 0 and the test statistic X 2 for testing whether F èçè =0is given as X 2 = F èpè 0 S,1 èpèf èpè: X 2 is asymptotically distributed as ç 2 with degrees of freedom equal to the number m of real-valued functions in F èçè. The development can easily be extended to situations where there are more than two response variables. 3 Fitting Models for Turn-Over Tables using the GSK Approach 3.1 Quasi-Symmetry The case of quasi-symmetry can be handled by noting that the corresponding hypothesis can be expressed in terms of odds-ratios. We consider a square twodimensional contingency table with I rows as well as I columns. For such a table Agresti è1990, p. 355ë1ëè gives the following conditions for quasi-symmetry to hold: ç ij ç II ç ii ç ji = ç jiç II ç ji ç Ii for i; j =1;::: ;I: Note that ç II appears both in the left as in the right hand side of the equation above. Hence, we can omit this term. We obtain ç ij = ç ji : ç ii ç ji ç ji ç Ii We can use this expression to derive restrictions of the form F èçè = 0 if quasisymmetry holds. In general, we have to derive as many restrictions in the ç's as there are degrees of freedom. The quasi-symmetry model has èi, 1èèI, 2è=2 degrees of freedom. Since the above expression has to hold for all i and j it contains redundancy. As can be seen from the following arguments there are only as many independent restrictions as there are degrees of freedom for the quasi-symmetry model. Consider the case where i = j. It can easily be checked that in this case equality holds regardless of the values of ç.

6 C. Schuster and A. von Eye: GSK Approach 44 Hence, from the I 2 restrictions we already have identiæed I restrictions that are superæuous. Now assume that either i = I or j = I but i 6= j. We assume without loss of generality that j = I. Then we have ç ii = ç Ii ; ç ii ç II ç II ç Ii from which we can readily see that this expression is trivially always true. This means that an additional 2 æèi,1è of the I 2 conditions are superæuous. Finally, we note that the restrictions in terms of odds-ratios for èi; jè =èk; lè and èj; iè =èl; kè are equivalent. Hence, we have another èi, 1èèI, 2è=2 superæuous restrictions. Calculating the number of superæuous restrictions and subtracting this number from I 2 results in I 2, èi +2èI, 1è + = 1 èi, 2èèI, 1è; 2 èi, 1èèI, 2è è 2 which is exactly equal to the number of degrees of freedom of the quasi-symmetry model. Hence, we can deduce that just the restrictions where i =1;::: ;èi, 1è and iéjholds are non-redundant. For a 3æ3-table there is only one condition that needs to be tested for quasi-symmetry. Fora4æ4-table there are 1 2 è4,2èè4,1è = 3 conditions that need to be tested. We already know which these conditions are. For quasi-symmetry we have to check for a 4æ4-table whether ç 12 ç 14 ç 42 = ç 13 ç 14 ç 43 = ç 23 ç 24 ç 43 = ç 21 ç 24 ç 41 ç 31 ç 34 ç 41 ç 32 ç 34 ç 42 are simultaneously true. Equivalently, we can take the logarithm of both sides of each equation and move all terms to the left hand side. This yields and and log ç 12, log ç 14, log ç 42, log ç 21 + log ç 24 + log ç 41 = 0 and log ç 13, log ç 14, log ç 43, log ç 31 + log ç 34 + log ç 41 = 0 and log ç 23, log ç 24, log ç 43, log ç 32 + log ç 34 + log ç 42 = 0: Hence, the function F èçè = 0 has been found. We now demonstrate the ætting of the quasi-symmetry model to a 4æ4-table. The data are given in Agresti è1990, p. 357ë1ëè and were obtained from a sample of 55,981 residences sampled by the U. S. Bureau of the Census, see Table 1. The four categories are the four regions, Northeast, Midwest, South, and West of the USA and interest lies in analyzing the migration that took place between 1980, when the ærst observation was made, and 1985, the year when the sample was revisited. Fitting the model of quasi-symmetry by maximum-likelihood yields a deviance value of G 2 = 2:99 based on df = 3. Now, let us æt the model with the GSK approach. The only problem is implementing the three response functions in a SAS job æle. Here is how the model can be speciæed and ætted in SAS. proc catmod data=sasuser.agr10_2; weight n; response , ,

7 C. Schuster and A. von Eye: GSK Approach 45 Residence Residence in 1985 in 1980 Northeast Midwest South West Northeast Midwest South West Table 1: Migration of U.S. residence between 1980 and log; model res80*res85 = ènoint; run; The data are assumed to be contained in a æle named agr10_2 in the sasuser library. The data æle contains three variables: res80 and res85, the indicator variables for geographical region in 1980 and 1985 respectively where the index for res85 changes fastest, and n which contains the numbers of residences that fall in the cells of the contingency table. Most important istheresponse-statement. First, we can see that it contains the log-keyword at the end. This means that the logarithm of the vector of estimated probabilities p is used for calculations. The previous three lines deæne three linear functions in the log cell probabilities according to the three single response functions F 1 èçè, F 2 èçè, and F 3 èçè given above. This can be seen by expressing log ç as log ç = logèç 11 ;::: ;ç 14 ;ç 21 ;::: ;ç 24 ;ç 31 ;::: ;ç 34 ;ç 41 ;::: ;ç 44 è è1è and calculating A 0 log ç where A is the 16 æ 3 matrix given in the responsestatement of the SAS commands. The model-statement merely tells the SAS system to test whether all the three response functions are simultaneously equal to zero. These SAS-commands yield the following slightly abbreviated output:

8 C. Schuster and A. von Eye: GSK Approach 46 CATMOD PROCEDURE Response: RES80*RES85 Response Levels èrè= 16 Weight Variable: N Populations èsè= 1 Data Set: AGR10_2 Total Frequency ènè= Frequency Missing: 0 Observations èobsè= 16 Response Functions Sample ANALYSIS-OF-VARIANCE TABLE Source DF Chi-Square Prob RESIDUAL The test statistic yields a value of 2.98 based on 3 degrees of freedom. This is almost exactly the result obtained from the maximum-likelihood analysis given above. In both cases we come to the same substantive conclusion that the migration of US residence between 1980 and 1985 can be nicely explained by a quasi-symmetry model. For symmetry and quasi-symmetry models log-linear models and the procedure just outlined in this section are closely related to each other. This relation stems from the fact that testing for F èçè = 0 is done by calculating linear functions of the logarithm of the cell probabilities, that is, F èçè = A 0 log ç = 0. Therefore, the following paragraphs apply also to the diagonals-parameter symmetry model of Goodman discussed in a later section of this paper. The connection between ætting log-linear models and the GSK approach can be seen by using the design-matrix approach to log-linear models èevers & Namboodiri, 1979ë9ëè. Let X be the design-matrix for ætting a log-linear model. Let ç jk = log m jk where the m jk are expected frequencies in a two-dimensional contingency table. Using matrix notation the log-linear can be written as ç = Xç; è2è where X is called the design-matrix. For explanations of how design-matrices for symmetry and quasi-symmetry models are set up, see von Eye and Spiel è1996èë26ë. Finding a log-linear model that æts the data well is done by adding or deleting parameters to the model equation and adding or deleting the corresponding column-vectors to the design-matrix X or in a more abstract sense, enlarging or diminishing the column space of X. If X has q rows and p columns and q ç p model selection of log-linear models can be seen as the problem of selecting a p-dimensional subspace in q-dimensional space. Positing that CèXè yields a good model-æt is equivalent to positing that the orthogonal complement ofcèxè, that is, CèXè? adds nothing to the model. Since the GSK approach tests for F èçè = 0 this is equivalent to examining whether CèXè? can be ignored. Hence, while log-linear modeling focuses on specifying CèXè the GSK approach focuses on ænding CèXè?. More speciæcally, let X be the design-matrix for a quasi-symmetry model and let A be a matrix such that F èçè = A 0 log ç = 0 then CèAè = CèXè? = CèI, XèX 0 Xè,1 X 0 è. In the example for the quasi-symmetry model from above, A was given by the three response functions that were assumed to be simultaneously equal to zero if the quasi-symmetry model holds, see the matrix given in the responsestatement of the last SAS command æle.

9 C. Schuster and A. von Eye: GSK Approach Symmetry Having already mentioned in Section 2 how the function F èçè for a symmetry model assuming 3æ3-table is set up we do not further comment on symmetry models for higher dimensional contingency tables. The setup of F èçè for these models can easily be generalized from two-dimensional tables, and the ætting of the models in SAS is most similar to ætting quasi-symmetry models. Mainly the responsestatement has to be altered. However, from examining F èçè it should be clear how this can be done. 3.3 Marginal Homogeneity Testing marginal homogeneity in log-linear models is diæcult because marginal probabilities are given as sums of cell probabilities. As has already been mentioned in Section 2, marginal homogeneity can be tested using log-linear models if it can be assumed that the quasi-symmetry model holds in the population. Therefore, this test is often called a `conditional' test. If quasi-symmetry does not hold in the population then marginal homogeneity can be tested using the GSK approach. Other unconditional tests for marginal homogeneity are given in Agresti è1990, p. 359èë1ë. The hypothesis of marginal homogeneity can be expressed in terms of the cell probabilities. If marginal homogeneity holds the H 0 : ç i+ = ç +i for i =1;::: ;èi, 1è is fulælled. 3 Note that the model of marginal homogeneity has èi, 1è degrees of freedom. Assuming again that data are given in a 4æ4-table this model can be written as ç i1 + ç i2 + ç i3 + ç i4 = ç 1i + ç 2i + ç 3i + ç 4i for i =1;::: ;3; or, expressed in the form F i èçè = 0, we have F 1 èçè = ç i2 + ç i3 + ç i4, ç 2i, ç 3i, ç 4i =0 and F 2 èçè = ç i1 + ç i3 + ç i4, ç 1i, ç 3i, ç 4i =0 and F 3 èçè = ç i1 + ç i2 + ç i4, ç 1i, ç 2i, ç 4i =0: Using the same data example as before the SAS statements for testing marginal homogeneity can be set up as: proc catmod data=sasuser.agr10_2; weight n; response , , ; model res80*res85 = ènoint; quit; The SAS statements are almost the same as in the data example for ætting the quasi-symmetry model. Only the setup of the response function has changed. Running these SAS commands yields the following output: 3 Note that this null hypothesis implies ç I+ = ç+i. Therefore, when this hypothesis is expressed in other presentations of marginal homogeneity, i often ranges from 1 to I.

10 C. Schuster and A. von Eye: GSK Approach 48 CATMOD PROCEDURE Response: RES80*RES85 Response Levels èrè= 16 Weight Variable: N Populations èsè= 1 Data Set: AGR10_2 Total Frequency ènè= Frequency Missing: 0 Observations èobsè= 16 Response Functions Sample ANALYSIS-OF-VARIANCE TABLE Source DF Chi-Square Prob RESIDUAL Since we already know that the quasi-symmetry model æts the data reasonably well, the conditional test should yield a very similar test statistic compared to the unconditional test presented here. Agresti è1990èë1ë reports a deviance of G 2 = 240:56 based on df = 3 for the conditional test procedure based on log-linear models. The ç 2 -value calculated by the GSK approach is 236:49 based on df = 3 which corresponds quite closely to the value of the conditional approach. Both tests suggest that marginal homogeneity does not hold for the 4æ4-table. Hence, there is a time trend in the migration patterns meaning that certain regions in the US have been preferred by the people who moved between 1980 and It is an interesting result of the above analysis that the ç 2 -value resulting from the GSK approach is identical to the ç 2 -value of the unconditional test of Bhapkar è1966èë2ë as reported in Agresti è1990, p. 360èë1ë. Note that all the reported test statistics for marginal homogeneity are almost identical because quasi-symmetry holds for the contingency table under study. If quasi-symmetry does not hold results from the conditional test based on log-linear models and the unconditional test based on the GSK approach are not necessarily asymptotically equivalent. If the test-statistics diæer, the result of the GSK approach should be used to draw substantive conclusions. 3.4 Diagonal-Parameter Symmetry Model We now present how the diagonal-parameter symmetry model, which is one of the many models Goodman proposed for the analysis of turn-over tables, can be ætted using the GSK approach. For a discussion of the model see Goodman è1979ë11ë,1981ë12ëè. Using Goodman's notation in this section the symmetry model can be expressed as F ij = ç ij for i 6= j; where F ij now denote the expected cell frequencies in the turn-over table. 4 The ç ij are called symmetry parameters. The symmetry model rarely æts the data well. If the variables of the turn-over table have ordinal scale quality it might be reasonable to examine whether a model that treats the change by one unit from, say, observation Time 1 to observation Time 2 equal, regardless whether it was a change from level 1 to level 2, or from level 2 to level 3 and so forth. Similarly, changes by two scale units are treated equal, regardless of whether the change was 4 Readers should not confuse the function F that is evaluated by the GSK approach with the expected cell frequencies.

11 C. Schuster and A. von Eye: GSK Approach 49 æ 1 æ 2 æ 3 æ 1,1 æ 1 æ 2 æ,1 2 æ,1 1 æ 1 æ 3,1 æ 2,1 æ 1,1 Table 2: Assignment of æ-parameters to cells in a 4 æ 4 turn-over table. from category 1 to category 3, or from category 2 to category 4 and so forth. The model that expresses this is F ij = ç ij æ k for i 6= j; k = j, i: The diagonal elements are not part of the model, rather they are completely ignored. We see, that in addition to the symmetry parameter we have 2èI, 1è additional æ-parameters, where I denotes the number ofrowsècolumns of the table. To make them uniquely deæned we have to impose three constraints on them. We choose æ,1 = æ 1,1 ;æ,2 = æ 2,1 ;æ,3 = æ 3,1 ;:::. Table 3.4 illustrates for a 4 æ 4 table to which cells each of the three diæerent æ-parameters belong. As can be seen from this table each of the upper minor diagonal cells is assigned a separate æ-parameter and without loss of generality the parameters used for the lower minor diagonals are the inverse of the corresponding upper minor diagonal parameters. The degrees of freedom for this model are calculated as follows. Let I be the number of rowsècolumns of the table. We have I 2, I oæ-diagonal cells and there are as many symmetry parameters as there are cells below èor equivalently aboveè the main diagonal, that is, IèI, 1è=2. In addition there are I, 1 æ-parameters. Hence, the degrees of freedom are èi 2, Iè, IèI, 1è 2 = 1 èi, 2èèI, 1è: 2, èi, 1è This is exactly equal to the number of degrees of freedom for the quasi-symmetry model. Fora4æ 4 table we have 3 degrees of freedom left. Hence, to use the GSK approach to æt this model we have to ænd three functions in the cell probabilities that are zero assuming the model holds true. They following arguments for a 4 æ 4 table show which functions these are. Assuming jéi,we see that F ij F ji = ç ijæ k ç ij æ,1 k = æ 2 k : This means that the ration F ij =F ji is a constant that depends only on k = j, i. Hence, for k =1wehave while for k =2we obtain, F 12 F 21 = F 23 F 32 = F 34 F 43 ; F 13 F 31 = F 24 F 42 : The ærst of these two equations deænes two equality restrictions and the second deænes an additional restrictions on the expected cell frequencies or, equivalently, on the cell probabilities.

12 C. Schuster and A. von Eye: GSK Approach 50 right eye left eye grade grade best second third worst best second third worst Table 3: Turn-over table of women according to right and left eye grade with respect to unaided distance vision. Taking logarithms and expressing the three restrictions using cell probabilities we obtain the function F as F 1 èçè = log ç 12, log ç 21, log ç 23 + log ç 32 =0 and F 2 èçè = log ç 23, log ç 32, log ç 34 + log ç 43 =0 and F 3 èçè = log ç 13, log ç 24, log ç 31 + log ç 42 =0: We now demonstrate the calculations using a data example. The data have been repeatedly analyzed in the literature, and are given, for instance, in Grizzle et al. è1969èë13ë or Goodman è1979ë11ë, 1981ë12ëè. Table 3 contains the data on unaided vision. The vector of probabilities is set up exactly as in Equation è1è which includes the probabilities for the diagonal elements although these are not part of the model. This is done to facilitate the presentation of the response-statement in the following SAS job æle using SAS CATMOD for evaluating the functions in F simultaneously. proc catmod data=sasuser.eyes; weight n; response , , log; model A*B = è noint; run; Except for the response-statement this job is almost identical to the ones given earlier. The only diæerences are that the SAS-æle containing the data has a diæerent name and the variables are now labeled A and B. The slightly abbreviated SAS output is CATMOD PROCEDURE Response: A*B Response Levels èrè= 16 Weight Variable: N Populations èsè= 1 Data Set: EYES Total Frequency ènè= 7477 Frequency Missing: 0 Observations èobsè= 16 Response Functions Sample

13 C. Schuster and A. von Eye: GSK Approach 51 ANALYSIS-OF-VARIANCE TABLE Source DF Chi-Square Prob RESIDUAL The value of the test statistic is 0.50 based on 3 degrees of freedom. This is exactly the value reported by Goodman è1979èë11ë for the diagonal-parameter symmetry model. This is an very small value that shows that this model æts the data almost perfectly. The diæerence between the test statistics based on maximumlikelihood or weighted least-squares are within rounding errors. 4 Discussion The GSK approach requires almost no distributional assumptions of the observed random variables. Only the conditions that are needed for the multivariate centrallimit-theorem to hold must be fulælled, as for instance, ænite variance and independence of observations. Therefore, the GSK approach can almost universally be applied if the sample size is large. However, the GSK approach is not a likelihood based approach. From a theoretical point of view this can be criticized. However, from an applied point of view the GSK approach is an extremely versatile technique. Its generality makes it well suited for ætting models that have been developed in other frameworks. In applied research the latest software is often not available or other practical problems make ætting the model by the most desirable estimation procedure diæcult or impossible. Then the GSK approach might help to obtain a reasonable approximation to the more advanced methods. Although this was not demonstrated in this paper, so-called `marginal models' èsee Liang, Zeger, & Qaqish, 1992ë20ë; Diggle, Liang, & Zeger, 1994ë8ëè that are nowadays usually ætted using GEE methodology èliang & Zeger, 1986ë19ë; Zeger & Liang, 1986ë28ëè can be analyzed using the GSK approach if the covariates are categorical. Another example is McCullagh's proportional odds model èmccullagh, 1980ë21ëè. Since this model considers functions of cumulative probabilities as response that are modeled it can not easily be rewritten as a log-linear model èfienberg, 1980, p. 111ë10ëè. If a software package does not include a procedure or an option to a procedure as in `PROC LOGISTIC' of the SAS program that can handle this model, ætting the proportional odds model must be done by programming ones own ætting routine. However, this model can also be easily ætted using the GSK approach. In addition, the GSK approach can be very helpful when new models are being developed that can be formulated by specifying almost arbitrary functions in the ç's. With the GSK approach such models can be ætted to the data, and parameter estimates can be obtained to evaluate whether a new model æts the data reasonably well. The results presented in Section 3.1 suggested that migration patters exist. A look at Table 1 suggests that the South of the U.S. may be the prime target for relocations. Meiser, von Eye, and Spiel è1997èë22ë presented log-linear methods for analysis of such trends. One may wonder why, in the light of the clear advantages of the GSK approach, development of log-linear methods progresses at a faster pace than development of methods based on the GSK approach. One answer was implicitly given above. There is only one full implementation of the GSK approach in general purpose statistical software packages, that is, the SAS CATMOD module. In addition, the manual for this module is perceived by many as hard to comprehend.

14 C. Schuster and A. von Eye: GSK Approach 52 Thus, there is less stimulation than for use and development of log-linear models for which there exist many and very user-friendly programs èspss, SYSTATè. Yet, there is a second answer to the question why development and application seems to emphasize log-linear models. In many textbooks on the analysis of categorical data, the GSK approach is only presented as an aside, èfor example, Agresti, 1990ë1ëè, while the main emphasis is given to log-linear models. Other textbooks play the usefulness of the GSK approach down by emphasizing that it is justiæed based on its large sample optimality. Christensen è1997, p. 376ë4ëè concludes that ëthere is no apparent reason to use GSK for small samples." One problem with these arguments is that, while they are correct, they apply, at least in part, also to the log-linear approach. We thus conclude that methods for log-linear modeling have experienced more rapid and farther going development than methods within the GSK approach. There can be no doubt, that the GSK approach allows one to test hypotheses that the log-linear approach can test only via awkward detours or not at all. Therefore, we propose that further development of the GSK approach is well worth our while. References ë1ë Agresti, A. è1996è. An introduction to categorical data analysis. New York: John Wiley & Sons. ë2ë Bhapkar, V. P. è1966è. A note on the evidence of two test criteria for hypotheses in categorical data. Journal of the American Statistical Association, 61, ë3ë Bishop, Y. M. M., Fienberg, S. E., & Holland, P. è1975è. Discrete multivariate analysis. Theory and practice. Cambridge, MA: MIT Press. ë4ë Christensen, R. R. è1997è. Log-linear models è2nd. ed.è. New York: Springer-Verlag. ë5ë Clogg, C. C., Eliason, S. R., & Grego, J. M. è1990è. Models for the analysis of change in discrete variables. In A. von Eye èed.è, Statistical methods in longitudinal research, Vol. II èpp è. New York: Academic Press. ë6ë Deming, W. E., & Stephan, F. F. è1940aè. On a least squares adjustment of a sampled frequency table when the expected marginal totals are known. Annals of Mathematical Statistics, 11, ë7ë Deming, W. E., & Stephan, F. F. è1940bè. The sampling procedure of the 1940 population census. Journal of the American Statistical Association, 35, ë8ë Diggle, P. J., Liang, K., & Zeger, S. L. è1994è. Analysis of longitudinal data. Oxford: Clarendon Press. ë9ë Evers, M., & Namboodiri, N. K. è1979è. On the design matrix strategy in the analysis of categorical data. In K. F. Schuessler èed.è, Sociological methodology 1997 èpp è. San Francisco: Jossey-Bass. ë10ë Fienberg, S. E. è1980è. The analysis of cross-classiæed data è2nd ed.è. Cambridge: MIT Press. ë11ë Goodman, L. A. è1979è. Multiplicative models for square contingency tables with ordered categories. Biometrika, 66, ë12ë Goodman, L. A. è1981è. Three elementary views of log-linear models for the analysis of cross-classiæcations having ordered categories. Sociological Methodology, ë13ë Grizzle, J. E., Starmer, C. F., & Koch, G. G. è1969è. Analysis of categorical data by linear models. Biometrika, 25, ë14ë Hagenaars, J. A. è1990è. Categorical longitudinal data. Newbury Park, CA: Sage. ë15ë Koch, G. G., Amara, I. A., Davis, G. W., & Gillings, D. B. è1982è. A review of some statistical methods for covariance analysis of categorical data. Biometrics, 38,

15 C. Schuster and A. von Eye: GSK Approach 53 ë16ë Koch, G. G., Landis, J. R., Freeman, J. L., Freeman, D. H., & Lehnen, R. G. è1977è. A general method for the analysis of experiments with repeated measurement of categorical data. Biometrics, 33, ë17ë Landis, J. R., & Koch, G. G. è1979è. The analysis of categorical data in longitudinal studies of behavioral development. In J. R. Nesselroade & P. B. Baltes èeds.è, Longitudinal methodology in the study of behavior and development èpp è. New York: Academic Press. ë18ë Landis, J. R., & Miller, M. E. è1988è. Some general methods for the analysis of categorical data in longitudinal studies. Statistics in Medicine, 7, ë19ë Liang, K. Y., & Zeger, S. L. è1986è. Longitudinal data analysis using generalized linear models. Biometrika, 73, ë20ë Liang, K. Y., & Zeger, S. L., & Qaqish, B. è1992è. Multivariate regression analyses for categorical data. Journal of the Royal Statistical Society, B, 54, ë21ë McCullagh, P. è1980è. Regression models for ordinal data. Journal of the Royal Statistical Society, B, 42, ë22ë Meiser, T., von Eye, A., & Spiel, C. è1997è. Loglinear symmetry and quasi-symmetry models for the analysis of change. Biometrical Journal, 39 è3è, ë23ë Rindskopf, D. è1992è A general approach to categorical data analysis with missing data, using generalized linear models with composite links. Psychological Bulletin, 57, è1è, ë24ë Sen, P. K., & Singer, J. M. è1993è. Large sample methods in statistics. London: Chapman and Hall. ë25ë Stokes, M. E., Davis, C. S., & Koch, G. G. è1995è. Categorical data analysis using the SAS system. Cary, NC: SAS Institute Inc. ë26ë von Eye, A., & Spiel, C. è1996è. Standard and non-standard log-linear symmetry models for measuring change in categorical variables. The American Statistician, 50, ë27ë Wickens, T. è1989è. Multiway contigency tables for the social sciences. Hillsdale, NJ: Lawrence Erlbaum Associates. ë28ë Zeger, S. L., & Liang, K. Y. è1986è. Longitudinal data analysis for discrete and continous outcomes. Biometrics, 42,

MARGINAL HOMOGENEITY MODEL FOR ORDERED CATEGORIES WITH OPEN ENDS IN SQUARE CONTINGENCY TABLES

REVSTAT Statistical Journal Volume 13, Number 3, November 2015, 233 243 MARGINAL HOMOGENEITY MODEL FOR ORDERED CATEGORIES WITH OPEN ENDS IN SQUARE CONTINGENCY TABLES Authors: Serpil Aktas Department of