page1 Loglinear Models Loglinear models are a way to describe association and interaction patterns among categorical variables. They are commonly used to model cell counts in contingency tables. These models specify how a cell count is related to the levels of the categorical variables that define that cell. This specification relates to the association and interaction structure among the categorical variables. We will begin with considering the simplest case - that of two-way tables. Loglinear Models for Two-way Tables Consider an x table that cross-classifies n subjects based on two categorical variables. Here the cell counts n ij follow a multinomial distribution with x categories. The probabilities ij for this multinomial form the joint distribution of the two categorical variables. The variables are statistically independent when ij i j for all i and j. The expression for the expected frequencies is then simplied to for all i and j. m ij n ij m ij n i j We construct loglinear models using m ij instead of ij so they also apply for the Poisson sampling model. On a logarithmic scale, independence has an additive form logm ij logn log i log j Suppose we denote the row variable by X and the column variable by ; We can then re-write the above expression as logm ij i X j where i X log i log h 1
page2 j log j log h and log h log h logn This model is called the loglinear model of independence for two-way contingency tables. NOTE: 1. The ANOVA 2-way design is E ijk i j where i i j j 2. The parameters i X and j satisfy i X j 0 Zero-sum constraints like this are often used in the field of experimental design as a way to make model parameters identifiable. (Other parameter definitions are possible.) n the above model, the log expected frequency for cell i,j is an additive function of a row effect i X and a column effect j. The parameter i X represents the effect of classification in row i for variable X. The larger the value of i X,the larger each expected frequency is in row i of the table. When h X l X, each expected frequency in row h equals the corresponding expected frequency in row l. Similarly, the parameter j represents the effect of classification in column j for variable. The null hypothesis of that this loglinear model holds translates to the null hypothesis of independence between the two categorical variables X and. The fitted values that satisfy the model are then m ij n in j n 2
page3 which are the same as the estimated expected frequencies for the test of independence we saw earlier. Chi-square tests of independence using X 2 and G 2 are also goodness of fit tests of this loglinear model. Once we have fitted the model, we turn to interpreting it. To understand the interpretation of parameters in the model of independence, suppose we now consider a contingency table that has only 2 columns, ie. a x2 table. Using the joint probability distribution, for the i th row the log odds of being in column 1 instead of column 2 is given by log i1 i2 log m i1 m i2 logm i1 logm i2 i X 1 i X 2 1 2 2 1 using the fact that 1 2 0 For each row, we then have that the odds of response in column 1 instead of column 2 is given by e 2 1 This implies that the probability of classification in a partciular column is the same for all rows. Note: For the special case of a 2x2 table, log log m 11m 22 m 12 m 21 logm 11 logm 22 logm 12 logm 21 and using our model 1 X 1 2 X 2 1 X 2 2 X 1 0 so that 1 under the null hypothesis that our model hold. This result is also true under the model of independence. 3
page4 Example: The following 2x2 table classifies n 3566 individuals according to smoking status and sleep problems: Sleep Problems es No Total Smoking es n 11 346 n 12 1198 n 1 1544 Status No n 21 320 n 22 1702 n 2 2022 Total n 1 666 n 2 2900 n n 3566 Are the variables Smokiing Status and Sleep problems independent? Let us now approach this question by fitting the loglinear model of independence: logm ij i X j to this data to answer this question. Under the null hypothesis of independence, the M.L.Es for i and j are i n i n p i and j n j n p j Thus we determine estimates for, i X, and j using i X logpi logn j logpj which yields 4
page5 1 X 0.1349 2 X 0.1349 1 0.7356 2 0.7356 and check!!!!! 6.5346 Using these estimates, we obtain estimates for logm ij under the model of independence. e.g.check logm 11 6.5346 0.13490.7356 5.6641 logm 12 6.5346 0.13490.7356 7.1353 logm 21 6.5346 0.13490.7356 5.9339 and logm 22 6.5346 0.13490.7356 7.4051 Thus m 11 288.36 m 12 1255.64 m 21 377.64 m 22 1644.36 This results in Pearson s X 2 i1 j1 24.98 n ij m ij 2 m ij and the likelihood ratio test statistic 5
page6 G 2 2 i1 24.79 n ij log n ij m ij The degrees of freedom associated with both of these statistics in the loglinear model of independence are determined by the number of cells in the table minus the number of independent parameters in the model. Thus, for an x table, j1 df 1 1 1 1 1. Our X 2 and G 2 each have df 1 with associated pvalue of approximately 0. Thus there is strong evidence to indicate that the loglinear model of independence is not appropriate for these data. SAS code: data smoking; input smoke $ Slpprob $ count; cards; es es 346 No es 320 es No 1198 No No 1702 ; proc catmod orderdata; model smoke*slpprob_response_/ covb predfreq; loglin smoke Slpprob; weight count; run; 6