Analysis of data in square contingency tables

Similar documents
Topic 21 Goodness of Fit

Categorical Data Analysis Chapter 3

ij i j m ij n ij m ij n i j Suppose we denote the row variable by X and the column variable by Y ; We can then re-write the above expression as

Correspondence Analysis

13.1 Categorical Data and the Multinomial Experiment

Decomposition of Parsimonious Independence Model Using Pearson, Kendall and Spearman s Correlations for Two-Way Contingency Tables

ANALYSING BINARY DATA IN A REPEATED MEASUREMENTS SETTING USING SAS

The goodness-of-fit test Having discussed how to make comparisons between two proportions, we now consider comparisons of multiple proportions.

1 Interaction models: Assignment 3

Generalized Linear Models (GLZ)

MSH3 Generalized linear model

Confidence Intervals, Testing and ANOVA Summary

Three-Way Contingency Tables

MARGINAL HOMOGENEITY MODEL FOR ORDERED CATEGORIES WITH OPEN ENDS IN SQUARE CONTINGENCY TABLES

Chapter 10. Chapter 10. Multinomial Experiments and. Multinomial Experiments and Contingency Tables. Contingency Tables.

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

Chi-Square. Heibatollah Baghi, and Mastee Badii

Categorical Variables and Contingency Tables: Description and Inference

Statistics for Managers Using Microsoft Excel

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

Discrete Multivariate Statistics

Module 10: Analysis of Categorical Data Statistics (OA3102)

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Solution to Tutorial 7

Multinomial Logistic Regression Models

Chapter 10. Discrete Data Analysis

Negative Multinomial Model and Cancer. Incidence

Correspondence Analysis of Longitudinal Data

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

Unit 9: Inferences for Proportions and Count Data

NOMINAL VARIABLE CLUSTERING AND ITS EVALUATION

Institute of Actuaries of India

Principal Component Analysis for Mixed Quantitative and Qualitative Data

over Time line for the means). Specifically, & covariances) just a fixed variance instead. PROC MIXED: to 1000 is default) list models with TYPE=VC */

Longitudinal Modeling with Logistic Regression

Statistics 3858 : Contingency Tables

Model Estimation Example

An Overview of Methods in the Analysis of Dependent Ordered Categorical Data: Assumptions and Implications

11-2 Multinomial Experiment

POLI 443 Applied Political Research

Ridit Score Type Quasi-Symmetry and Decomposition of Symmetry for Square Contingency Tables with Ordered Categories

Lecture 25. Ingo Ruczinski. November 24, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

STAT 135 Lab 11 Tests for Categorical Data (Fisher s Exact test, χ 2 tests for Homogeneity and Independence) and Linear Regression

INFORMATION THEORY AND STATISTICS

ST3241 Categorical Data Analysis I Two-way Contingency Tables. Odds Ratio and Tests of Independence

Generalized Linear Model under the Extended Negative Multinomial Model and Cancer Incidence

Exercise 7.4 [16 points]

Testing Independence

Chapter 11: Models for Matched Pairs

Ling 289 Contingency Table Statistics

LOG-MULTIPLICATIVE ASSOCIATION MODELS AS LATENT VARIABLE MODELS FOR NOMINAL AND0OR ORDINAL DATA. Carolyn J. Anderson* Jeroen K.

INTRODUCTION TO LOG-LINEAR MODELING

Repeated Measures ANOVA Multivariate ANOVA and Their Relationship to Linear Mixed Models

Goodness of Fit Tests: Homogeneity

Optimal exact tests for complex alternative hypotheses on cross tabulated data

Frequency Distribution Cross-Tabulation

Maximum Likelihood Estimation; Robust Maximum Likelihood; Missing Data with Maximum Likelihood

Generalized Linear Models

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Lecture 25: Models for Matched Pairs

Unit 9: Inferences for Proportions and Count Data

Basic Business Statistics, 10/e

Lecture 22. December 19, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University.

MULTINOMIAL LOGISTIC REGRESSION

A general non-parametric approach to the analysis of ordinal categorical data Vermunt, Jeroen

A bias-correction for Cramér s V and Tschuprow s T

The material for categorical data follows Agresti closely.

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

More Accurately Analyze Complex Relationships

WORKSHOP 3 Measuring Association

Chapter 11: Analysis of matched pairs

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 16 Introduction

Computing and using the deviance with classification trees

STAT Chapter 13: Categorical Data. Recall we have studied binomial data, in which each trial falls into one of 2 categories (success/failure).

Minimum Phi-Divergence Estimators and Phi-Divergence Test Statistics in Contingency Tables with Symmetry Structure: An Overview

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Supplemental Materials. In the main text, we recommend graphing physiological values for individual dyad

A general non-parametric approach to the analysis of ordinal categorical data Vermunt, Jeroen

Generalized linear models

Cohen s s Kappa and Log-linear Models

Review of One-way Tables and SAS

Multivariate Extensions of McNemar s Test

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

Statistics of Contingency Tables - Extension to I x J. stat 557 Heike Hofmann

Chapter 19: Logistic regression

Hypothesis Testing hypothesis testing approach

Lecture 28 Chi-Square Analysis

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

10: Crosstabs & Independent Proportions

The purpose of this section is to derive the asymptotic distribution of the Pearson chi-square statistic. k (n j np j ) 2. np j.

Section 4.6 Simple Linear Regression

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

Ordinal Variables in 2 way Tables

Investigating Models with Two or Three Categories

Relate Attributes and Counts

Nominal Data. Parametric Statistics. Nonparametric Statistics. Parametric vs Nonparametric Tests. Greg C Elvers

STATISTIC OF QUASI-PERIODIC SIGNAL WITH RANDOM PERIOD - FIRST APPLICATION ON VOCAL CORDS OSCILLATION

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami

Hypothesis Testing for Var-Cov Components

Transcription:

Analysis of data in square contingency tables Iva Pecáková Let s suppose two dependent samples: the response of the nth subject in the second sample relates to the response of the nth subject in the first sample There are two common forms of sample dependency: (1) the same subjects are surveyed at different points in time (before-after and studies including panel studies); (2) the different subjects with a natural pairing are surveyed (a husband and his wife, a parent and his child, two people rate the same object, etc) The first form is often called repeated measures or longitudinal data, the second one matched pairs data In such a case, the responses of a categorical variable are summarized by a two-way contingency table in which row and column classifications have the same categories Thus, the table is square, r = c (r is number of rows and c is number of columns in the table) There are usually the large values on the main diagonal of such a table, cell probabilities or associations may exhibit more or less symmetric pattern about this diagonal Two marginal distributions may agree (there is marginal homogeneity) or they may differ in some systematic way If r = 2 (table 1) and the hypothesis Table 1 The binary variable (r = 2) Occasion 2 Occasion 1 X = 0 X = 1 Σ X = 0 n 11 n 12 n 1 X = 1 n 21 n 22 n 2 Σ n 1 n 2 n n ij ( i = 1, 2; j = 1, 2) dete frequencies, p ij = n i j /n dete relative frequencies π = π π = π (1) 1 1 12 21 holds (for 2 x 2 table it is marginal homogeneity and symmetry, too), the frequency n 12 has a bimial distribution with parameters n 12 n 21 and 0,5 A p-value (for two-sided test) is then double probability P[n 12 min(n 12, n 21 )], the asterisk detes here observed frequencies For large samples, as kwn, the statistic has a standard rmal distribution and n 0,5( n n ) n n U = = 0,5 n n n n 12 12 21 12 21 U 12 21 12 21 (n n ) = n n 2 12 21 12 21 has chi-square distribution with one degree of freedom (the significance test for this statistic is kwn as McNemar test) 2 (2) (3)

2 If this test is significant, we can estimate the true difference between π 1 and π 1 as where ( p p ) ± u SE ˆ( p p ) 1, (4) 1 1 1 α /2 1 [ ] SE ˆ( p p ) = p (1 p ) p (1 p ) 2( p p p p ) / n 1 1 1 1 1 1 11 22 12 21 If r > 2 (table 2), the hypothesis π π = i i Table 2 The categorical variable (r > 2) Occasion 2 Occasion 1 x 1 x r Σ x 1 n 11 n 1r n 1 x r n r1 n rr n r Σ n 1 n r n, i = 1, 2, r (5) is marginal homogeneity and the hypothesis π ij = π, i = 1, 2, r; j = 1, 2, r (6) ji is symmetry Symmetry is equivalent marginal homogeneity, but for r > 2 marginal homogeneity doesn t mean symmetry (the example in the table 3) Table 3 Marginal homogeneity, t symmetry X 1 X 2 X 3 Σ X 1 20 10 20 50 X 2 30 55 5 90 X 3 0 25 35 60 Σ 50 90 60 200 The saturated loglinear model for such square contingency table can be written as ln m ij = λ λ i λ j λ ij, (7) 1 1 where λ = ln m 2 ij, λi = ln mij λ, r r i j 1 λ j = ln mij λ, λij = ln mij λi λj λ r i j The parameters of this model are the linear combinations of expected frequencies m ij, i = 1, 2, r; j = 1, 2,, r and their number is 1 r 1 r 1 (r 1) 2 = r 2 (their identifiability requires constraints Σ λ i = 0, Σ λ ij = 0) The cell expected frequencies m ij are estimated with n ij

3 When the independence model holds, all the association parameters λ ij in (7) are zero The cell expected frequencies are estimated with np i p j To test the goodness of fit of this model, the well-kwn Pearson statistic (X 2 ) or likelihood ratio chi-squared statistics (deviance) G 2, G 2 r r nij = 2 nij ln, (8) m ˆ i j ij can be used The degrees of freedom is (r 1) 2 However, the square tables for repeated measures or matched pairs data usually have large counts on the main diagonal and this model is t useful In this case, there is important a structure of frequencies off the main diagonal When the row response differs from the column response in this table, the variables are quasi independent While the independence loglinear model can be written as ln m ij = λ λ i λ j ; (9) the quasi-independence loglinear model can be written as ln m ij λ λ i λ j δ i I ij =, (10) where I ij indicates the diagonal elements in the table (I ij = 1 for i = j and I ij = 0 for i j) In this model mˆ ii = n ii holds, but the expected frequencies haven t direct estimates To obtain the maximum likelihood estimates, the set of likelihood equations is to solve The likelihood equations do t have a direct solution and can be solved using an iterative algorithm (Newton- Raphson methods for example) The number of parameters of the quasi-independence model is 1 2(r 1) r and the residual degrees of freedom are then df = r 2 [1 2(r 1) r] = (r 1) 2 r For the symmetry model, in (7) all λ ij = λ ji The parameters λ i, i = 1, 2, r, are the same for both classifications (there is a marginal homogeneity) Expected frequencies m ij are estimated as (n ij n ji )/2 in this case It results from this, that mˆ ii = n ii The number of parameters of the model is w 1 (r 1) r(r 1)/2 and the residual degrees of freedom are df = r 2 [1 (r 1) r(r 1)/2] = r(r 1)/2 The Pearson statistic X 2 can be simplified for this model to form 2 ( n ) 2 ij nji Χ = (11) n n i< j ij ji For r = 2 this is the statistics (3) The symmetry model is often too simple to fit a table, because of the imposition of identical marginals In the quasi symmetry model, the marginal homogeneity doesn t hold more, the parameters λ i, i = 1, 2, r, aren t the same In this model mˆ ii = n ii, too, but there aren t a direct estimates for expected frequencies To obtain these estimates, Newton-Raphson methods, iterative proportional fitting or iterative methods must be used again This model has the property of symmetric association (symmetry of odds ratios), when

4 θ mm mm ij rr ji rr ij = = = θ ji for all i and j (12) mm ir rj mjrmri The number of parameters of this model is 1 (r 1) (r 1) r(r 1)/2 and the residual degrees of freedom are w df = r 2 [1 2(r 1) r(r 1)/2] = (r 1)(r 2)/2 Some loglinear models imply marginal homogeneity If a table satisfies symmetry, it also satisfies both quasi symmetry and marginal homogeneity As we can see for example in [1], the converse holds too When quasi symmetry holds, marginal homogeneity is equivalent to symmetry and we can test marginal homogeneity by comparing goodness-of-fit statistics (deviances) for the symmetry (S) and quasi-symmetry (QS) models: 2 2 2 G S QS G S G QS ( / ) = ( ) ( ) (13) This difference has chi-squared distribution with (r 1) degrees of freedom Let s remind that the well-kwn Stuart-Maxwell test can be used to test marginal homogeneity, too The Stuart-Maxwell statistic X 2 = d' S -1 d, (14) where d = [d 1, d 2, d r 1 ], d i = n i n i, i = 1, 2, r 1 and S detes the (r 1) x (r 1) covariance matrix of the elements of d, has asymptotically chi-square distribution with r 1 degrees of freedom The results of both tests are usually very similar The following data (table 4) were provided by Factum Invenio, s r o Data come from election researches realized in June 2003, in April 2004 (shortly before the end of the Špidla s cabinet), in June 2005 (after the end of the Gross cabinet) and in April 2006 (shortly before the parliamentary election) All these data files include the same questions: Which party did you in the election (the variable is ) and Which party would you at the moment (the variable is preference ) Thus, each respondent expresses whether his inclination has changed or t since the last election Table 4 Data from election researches * preference 2003 Crosstabulation US preference 2003 US 107 17 7 3 10 18 162 2 149 2 6 9 168 2 80 1 4 87 2 2 1 50 5 2 62 3 2 27 2 34 6 19 5 8 16 144 198 119 190 93 66 64 179 711

5 US * preference 2004 Crosstabulation preference 2004 US 102 25 19 6 22 17 191 9 190 2 1 6 7 215 2 82 1 5 90 2 10 1 37 5 3 58 1 2 1 27 2 33 8 27 5 3 15 166 224 124 254 109 48 76 200 811 US * preference 2005 Crosstabulation preference 2005 US 111 11 10 3 15 13 163 6 187 3 5 9 210 3 4 88 3 1 3 102 2 4 1 49 5 4 65 2 6 1 1 24 1 35 12 31 3 4 23 187 260 136 243 103 63 73 217 835 US * preference 2006 Crosstabulation preference 2006 US 157 9 6 12 7 191 8 174 1 11 4 198 3 2 87 8 100 3 2 1 51 2 1 60 7 2 2 32 4 47 14 46 5 7 32 132 236 192 235 101 59 89 156 832 As we could expect, independence is strongly rejected for all four data files (X 2 runs from 1876 by 25 degrees of freedom) The symmetry model is also unpromising For example, in 2004 only 9 people changed their inclination from to and 25 people did so in the opposite direction Only 2 people changed their inclination from to and 19 people did so in the opposite direction, and so on The results of this model are contained in table 5

6 The majority of rs did t change their preference and their frequencies are always on the main diagonal This suggests fitting a quasi independence model, omitting the diagonal The results are also contained in table 5 As we can see, for years 2003 and 2005 this model fits well It s t possible to prove differences in pattern of changed preferences of several parties in these years However, this difference is proved for years 2004 and 2006 The quasi symmetry model doesn t fit well only for year 2006 The last test confirms expected marginal heterogeneity in all files The table 6 with sign schemas (for quasi-independence in 2004 and 2006, quasi symmetry in 2006) appends the most interesting results Let s remark that true distribution of statistics used for testing fit may be far from chisqared when expected frequencies are small Our tables are sparse and fitted cell counts small, but for loglinear models the expected values refer to marginal totals and the chi-sqared approximation is likely to be adequate (In the next paper we would like to verify our results with exact tests) Table 5 Results of the analysis X 2 p-value df G 2 p-value Year Symmetry 2003 51,3 0,00 15 59,2 0,00 2004 83,4 0,00 15 95,7 0,00 2005 55,6 0,00 15 64,4 0,00 2006 83,5 0,00 15 97,8 0,00 Quasi independence 2003 21,4 0,32 19 25,9 0,13 2004 29,7 0,05 19 30,7 0,04 2005 27,9 0,09 19 28,1 0,08 2006 44,6 0,00 19 47,9 0,00 Quasi symmetry 2003 4,6 0,92 10 4,4 0,93 2004 11,8 0,30 10 12,8 0,24 2005 11,9 0,29 10 13,1 0,22 2006 29,1 0,00 10 30,6 0,00 Marginal homogeneity 2003 5 54,8 0,00 2004 5 82,9 0,00 2005 5 51,3 0,00 2006 5 67,2 0,00

7 Table 6 Sign schemas US References: US [1] Agresti, A: Categorical Data Analysis, John Wiley & Sons, 1995 [2] Anděl, J: Matematická statistika, SNTL, Praha 1978 [3] Jobson, JD: Applied Multivariate Data Analysis, Volume II: Categorical and Multivariate Methods, 1991 [4] Řeháková,B-Řehák,J: Analýza kategorizovaných dat v sociologii, Academia Praha 1986 [5] SPSS Manuals, SPSS Inc, 1994 1999 [6] Simoff, J S: Analyzing Categorical Data, Springer-Verlag Inc, New York 2003 [7] Stokes, ME- Davis, CS- Koch, GG: Categorical data Analysis Using the SAS System, SAS Institute Inc, 1995 Doc Ing Iva Pecáková, CSc The University of Ecomics Faculty of Informatics and Probability Department of Statistics and Probability Prague, Czech Republic e-mail: pecakova@vsecz