Statistics 3858 : Contingency Tables

Similar documents
Discrete Multivariate Statistics

Topic 21 Goodness of Fit

Testing Independence

ML Testing (Likelihood Ratio Testing) for non-gaussian models

Contingency Tables Part One 1

Chapter 10. Chapter 10. Multinomial Experiments and. Multinomial Experiments and Contingency Tables. Contingency Tables.

Contingency Tables. Safety equipment in use Fatal Non-fatal Total. None 1, , ,128 Seat belt , ,878

STAT Chapter 13: Categorical Data. Recall we have studied binomial data, in which each trial falls into one of 2 categories (success/failure).

13.1 Categorical Data and the Multinomial Experiment

Unit 9: Inferences for Proportions and Count Data

TUTORIAL 8 SOLUTIONS #

Unit 9: Inferences for Proportions and Count Data

Contingency Tables. Contingency tables are used when we want to looking at two (or more) factors. Each factor might have two more or levels.

11-2 Multinomial Experiment

The Multinomial Model

Lecture 10: Generalized likelihood ratio test

Lecture 28 Chi-Square Analysis

n y π y (1 π) n y +ylogπ +(n y)log(1 π).

ij i j m ij n ij m ij n i j Suppose we denote the row variable by X and the column variable by Y ; We can then re-write the above expression as

Statistics 3858 : Maximum Likelihood Estimators

Describing Contingency tables

2.3 Analysis of Categorical Data

Statistics 135 Fall 2008 Final Exam

STAT 135 Lab 6 Duality of Hypothesis Testing and Confidence Intervals, GLRT, Pearson χ 2 Tests and Q-Q plots. March 8, 2015

Statistics for Managers Using Microsoft Excel

STAT 705: Analysis of Contingency Tables

Goodness of Fit Tests: Homogeneity

Log-linear Models for Contingency Tables

Central Limit Theorem ( 5.3)

Example. χ 2 = Continued on the next page. All cells

4 Hypothesis testing. 4.1 Types of hypothesis and types of error 4 HYPOTHESIS TESTING 49

Ling 289 Contingency Table Statistics

Categorical Data Analysis Chapter 3

Chapter 11. Hypothesis Testing (II)

STAT 135 Lab 11 Tests for Categorical Data (Fisher s Exact test, χ 2 tests for Homogeneity and Independence) and Linear Regression

Introduction to Statistical Data Analysis Lecture 7: The Chi-Square Distribution

Lecture 01: Introduction

Part 1.) We know that the probability of any specific x only given p ij = p i p j is just multinomial(n, p) where p k1 k 2

Ron Heck, Fall Week 3: Notes Building a Two-Level Model

Chapter 2: Describing Contingency Tables - I

Lecture 8: Summary Measures

Generalized logit models for nominal multinomial responses. Local odds ratios

Model Estimation Example

STAC51: Categorical data Analysis

Chi-square (χ 2 ) Tests

Categorical Variables and Contingency Tables: Description and Inference

Inference for Categorical Data. Chi-Square Tests for Goodness of Fit and Independence

Relate Attributes and Counts

ST3241 Categorical Data Analysis I Two-way Contingency Tables. 2 2 Tables, Relative Risks and Odds Ratios

2 Describing Contingency Tables

Sociology 6Z03 Review II

Chi-square (χ 2 ) Tests

Three-Way Contingency Tables

Parametric versus Nonparametric Statistics-when to use them and which is more powerful? Dr Mahmoud Alhussami

2.6.3 Generalized likelihood ratio tests

Summary of Chapters 7-9

16.400/453J Human Factors Engineering. Design of Experiments II

Hypothesis Testing: Chi-Square Test 1

Lecture 9. Selected material from: Ch. 12 The analysis of categorical data and goodness of fit tests

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Lecture 21: October 19

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Department of Economics. Business Statistics. Chapter 12 Chi-square test of independence & Analysis of Variance ECON 509. Dr.

Statistics 3858 : Likelihood Ratio for Multinomial Models

CHAPTER 17 CHI-SQUARE AND OTHER NONPARAMETRIC TESTS FROM: PAGANO, R. R. (2007)

Chapter 6. Logistic Regression. 6.1 A linear model for the log odds

We know from STAT.1030 that the relevant test statistic for equality of proportions is:

Goodness of Fit Goodness of fit - 2 classes

Lecture 41 Sections Mon, Apr 7, 2008

Lecture 5: ANOVA and Correlation

Chi-Square. Heibatollah Baghi, and Mastee Badii

An introduction to biostatistics: part 1

The goodness-of-fit test Having discussed how to make comparisons between two proportions, we now consider comparisons of multiple proportions.

Lecture 25. Ingo Ruczinski. November 24, Department of Biostatistics Johns Hopkins Bloomberg School of Public Health Johns Hopkins University

Chapter 10. Discrete Data Analysis

Inferences for Proportions and Count Data

ST3241 Categorical Data Analysis I Two-way Contingency Tables. Odds Ratio and Tests of Independence

The material for categorical data follows Agresti closely.

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

STAT 526 Spring Final Exam. Thursday May 5, 2011

Psych Jan. 5, 2005

Multiple Sample Categorical Data

Chapter 8 Student Lecture Notes 8-1. Department of Economics. Business Statistics. Chapter 12 Chi-square test of independence & Analysis of Variance

8 Nominal and Ordinal Logistic Regression

Power and Sample Size for Log-linear Models 1

Modeling and inference for an ordinal effect size measure

Review of Statistics 101

j=1 π j = 1. Let X j be the number

The purpose of this section is to derive the asymptotic distribution of the Pearson chi-square statistic. k (n j np j ) 2. np j.

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Analysis of Categorical Data Three-Way Contingency Table

BIOS 625 Fall 2015 Homework Set 3 Solutions

Some comments on Partitioning

Summary of Chapter 7 (Sections ) and Chapter 8 (Section 8.1)

MA 575 Linear Models: Cedric E. Ginestet, Boston University Midterm Review Week 7

Correspondence Analysis

Cohen s s Kappa and Log-linear Models

Outline. 1. Define likelihood 2. Interpretations of likelihoods 3. Likelihood plots 4. Maximum likelihood 5. Likelihood ratio benchmarks

10.2 Hypothesis Testing with Two-Way Tables

Transcription:

Statistics 3858 : Contingency Tables 1 Introduction Before proceeding with this topic the student should review generalized likelihood ratios ΛX) for multinomial distributions, its relation to Pearson s chi-squared statistic, and the Theorem about the chi-square limit distribution for 2 logλx)) Contingency table data are counts for categorical outcomes and look to be of the form columns row 1 2... J 1 n 1,1 n 1,2 n 1,J n 1 2 n 2,1 n 2,2 n 2,J n 2. I n I,1 n I,2 n I,J n I n,1 n,2 n,j This table is of J columns and I rows, which we refer to as a I by J contingency table. The count for the cell i, j) is n i,j. The row totals, the last column in the table, is not part of the data, but is calculated as the row sum over the column categories 1, 2,..., J and is denoted as n i n i,1 + n i,2 +... + n i,j The notation of the subscript i, is convenient to tell us we are summing over the second index. Similarly the column totals are n j n 1,j + n 2,j +... + n I,j Finally the total number of observations or counts, n, is the sum of the row and column counts and is n n i,j. i1 j1 j1 i1 n i,j n i,j 1

Contingency Tables 2 A random contingency table will have random counts N i,j in the various categories. 2 Data Generating Mechanisms There are different sampling mechanisms that can generate random contingency tables. In our statistical inference language we have to recognize there are different statistical models, and hence different parameter spaces. When we study hypothesis testing in these models, the hypotheses depend very much on the statistical model and parameter space. There are two main mechanisms for generating random data, that is two main statistical models. Mechanism 1 Data is collected from J experiments. For each experiment there n j trials, and each trial falls into categories 1, 2,..., I. This is the type of data obtained from the Jane Austin example. For experiment j the data is N j N 1,j, N 2,j,..., N I,j ) which has a multinomial distribution with n,j trials and parameter p,j π 1,j, π 2,j,..., π I,j ) which belongs to the simplex of order I, say S I. For the experiment the parameter space is This is a parameter space of dimension JI 1). Mechanism 2 Θ S I... S I Each trial has an outcome of one of IJ categories, that is all possible i, j) categories. The experiment consists of n n, independent multinomial single trials. The random data is N N i,j ) i1,...,i,j1,...,j which has a multinomial distribution with parameter n and p p 1,1,..., p 1,J, p 2,1,..., p I,J ) which belongs to the simplex of order IJ, that is Θ S IJ. This parameter space has dimension IJ 1.

Contingency Tables 3 The notation for a vector a a i,j ) i1,...,i,j1,...,j is a convenient notation for the more cumbersome notation a a 1,1,..., a 1,J, a 2,1,..., a I,J ) and means that we are writing the components in this specific order. Actually it does not matter on the specific order as long as we are consistent in our writing, so in particular we use the ordering given here. This notation is used in the remainder of the handout where appropriate. Notice the two sampling mechanisms and their models are quite different. For mechanism 1, the numbers n,j are typically not random or are set by the mechanism generating the data - ie the experimenter, but are the number of trials for experiment j. For mechanism 1 N i,j are random, and for a given j we have I i1 N i,j N j, due to a basic property of the multinomial distribution. For mechanism 2, the numbers N,j are random, and in fact have a multinomial distribution with probability vector π Col p,1, p,2,..., p,j ) Similarly for mechanism 2 the row totals or row counts are random with probability vector π Row p 1,, p 2,,..., p I, ) For mechanism 1 it makes sense to consider a hypothesis such as : The J probability vectors p,j are all equal. This question does not make sense for mechanism 2. For mechanism 2 we could consider the two marginal distributions for Row and Column, and the two marginal random vectors, corresponding to Row and Column. Each individual trial falls into one row and column, and so may be thought of as a bivariate random variable R, C). We could then consider a hypothesis that these two random variables R and C are independent. This question is sensible for mechanism 2, but not for mechanism 1. These are not the only types of hypotheses that one might test for mechanisms 1 and 2, but are the most basic and common null hypotheses of interest. We will not study other hypothesis for contingency tables in this course. Mechanism 1 is well suited for studying categorical data from J different experiments. The column totals, n,j, is typically not random. Mechanism 2 is well suited to study an experiment is which each individual has categorical data on two variables, and each of these variables falls into I and J categories respectively. For example we may take measurements on patients with a particular type of disease and study two physiological variables,

Contingency Tables 4 such as blood type and diet type, or those given some treatment cancer treatment) and measure tumor size small, medium or large) after 1 month and gender type male or female). It is interesting to know if the pair of variables are independent. If so, then knowing gender will not help to predict if after treatment the tumor is large or small. In an insurance setting one may look at some classes of accidents and consider similar questions. Contingency tables may be of more than 2 dimensions, but we only consider them of dimension 2 or sometimes 3. The method used in the hypothesis testing is GLR, generalized likelihood ratio. In the multinomial case recall that 2 logλn)) is approximately equal to Pearson s chi-square statistic; this was obtained by a Taylor s formula approximation. Also 2 logλn)) converges in distribution as the sample size n, under the null hypothesis assumption, to a χ 2 df) distribution and the degrees of freedom is df. The student should review how this is calculated. For sampling mechanism 1 we also need to think about what does it mean for the sample size tending to infinity. Student s should also recall one of the calculations about the GLR and its relation to the Pearson s chi-square statistic. Specifically we found, with the notation changed for our cells indexed by i, j instead of j, 2 logλn)) 2 i1 j1 X 2 N i,j logˆp 0) i,j )/ˆp i,j) N i,j Ni N j N i,j Êi,j where ˆp 0) i,j is the MLE calculated under the null hypothesis assumption, which differs for mechanism 1 and 2, and Êi,j is the expected counts for cell i, j, which also differ for mechanism 1 and 2. For mechanism 1, N,j is typically not random so we should actually write n,j in place of N,j ), but it is random for mechanism 2. It is an algebraic artifact that the final formula for Pearson s chi-square X 2 happens to be the same algebraic formula in both mechanism 1 and 2. This will happen due to the different expressions for the MLE under the null hypothesis and an algebraic conincidence or cancellation.

Contingency Tables 5 3 Test of Homogeneity or Equal Population Probability Vectors In this problem the data is obtained in a form of mechanism 1. There are J populations with probability vectors p,j π 1,j, π 2,j,..., π I,j ) which belongs to the simplex of order I, say S I. For the experiment the parameter space is Θ S I... S I. The null hypothesis of interest is H 0 : p,j are all equal π), j 1,..., J The vector π is the common value of these vectors. This null hypothesis is also referred to as homogeneity of these probability vectors, or equality of these probability vectors. The notation means for the vectors, of length I, the corresponding components are equal, that is for each component index i we have π i1 π i2... π ij or equivalently π i1 π i2... π ij for all i 1, 2,..., I. The vector notation is more clear. The Rice textbook uses the component notation. We might also write p j) π 1,j, π 2,j,..., π I,j ) or π j) π 1,j, π 2,j,..., π I,j ) as another notation or way of writing p,j. The set Θ 0 is Θ 0 {p,1,..., p,j ) Θ : p,j p π 1,..., π I ) for some p S I } We have DimΘ 0 ) I 1. The alternative is the general alternative H A : p,j Θ c 0

Contingency Tables 6 Thus the limiting χ 2 distribution has degrees of freedom df JI 1) I 1) J 1)I 1). The student should review the multinomial MLE handout. Following this method the student should verify that the argmax for the denominator in the GLR is ˆπ i,j n i,j n j Next we need to calculate ˆp 0, the MLE under the null hypothesis. The MLE for this case is ˆπ n 0) n1 ˆπ 1,n,..., ˆπ 1,n ),..., n ) I For convenience we are using the notation superscript 0) to denote the MLE under the null hypothesis assumption. For the student : Write out the log likelihood, under the null hypothesis, and derive the MLE. We now calculate the Pearson s chi-square formula. Notice the estimate expected counts for cell i, j is n i n j n i n j. Recall mechanism 1 and that the random variable in this experiment is N j, whereas the trials in a given experiment is n,j. Pearson s chi-square is then O i,j X 2 Êi,j N i,j Ni n j The degrees of freedom for the limiting chi-square distribution are JI 1) I 1) J 1)I 1). Thus the Pearson s chi-square r.v. X 2 χ 2 J 1)I 1)) in distribution as n. In terms of the observed data or numbers in the contingency table n i,j ni n j X 2 which is exactly the same formula that we have below in mechanism 2 to test independence of rows and columns. n i n j See Jane Austin examples using the data from the text.

Contingency Tables 7 4 Test of Independence for Contingency Table Data For this problem the experiment has data collected by mechanism 2, that is each random variable is an observation from a multinomial trial with IJ categories or classes). Here the parameter space is Θ S IJ Since each random observation falls into a give row and column, we may also think of it as a bivariate random variable. That is an observable random variable, say Y is also Y R, C) where R and C are the random row and column for observation Y. The null hypothesis is that the random variables R and C are independent. If the marginal distributions of these random variables are then the null hypothesis is that for each i, j π Row π 1, π 2,..., π I ) π Col π 1, π 2,..., π J ) p i,j π i π j. The null hypothesis does not specify what are these marginal probability vectors, so this is a composite null hypothesis. Formally H 0 : p p i,j ) i1...i,j1...j π i π j ) i 1... I, j 1... J for some π Row S I, π Col S J or equivalently H 0 : p Θ 0 where } Θ 0 {p i,j ) i1...i,j1...j π i π j ) i 1... I, j 1... J for some π Row S I, π Col S J The dimension of Θ 0 is I 1)J 1). MLE under the model and null hypothesis. In this case p ij π i π j

Contingency Tables 8 and the log likelihood is log Lp) i1 j1 n i,j logπ i ) + logπ j )) It is easiest to maximize using the Lagrange multiplier method since there are two linear constraints. The student should find the MLE using this method and show we obtain ˆπ i n i Thus under the null hypothesis assumption we have, ˆπ j n j. Ê ij ˆπ i ˆπ j n i n j Completing the calculation we find the Pearson s chi-square formula in terms of the r.v. N) is O i,j X 2 Êi,j N i,j Ni N j The degrees of freedom are IJ 1 I 1) J 1) IJ I J + 1 I 1)J 1), and by Theorem 9.4A X 2 χ 2 I 1)J 1) in distribution as. In terms of the observed data or numbers in the contingency table n i,j ni n j X 2 which is exactly the same formula that we have below in mechanism 1 to test equality of population probability vectors for the J populations. n i n j 5 Other Hypotheses and Contingency Tables As the Rice text mentions there are other null hypotheses of interest. These are often special cases of interest in either biostatistics or sometimes social sciences. They will also apply to insurance data. In these various applications the sampling mechanism is mechanism 2. The applications are often for a 2 by 2 contingency table. Since the sampling mechanism is mechanism 2, the data is of the type R, C), that is each individual sampled corresponds to a bivariate observation, indicating the row and column random variable.

Contingency Tables 9 5.1 Odds Ratio Rice uses the notation of rows and columns being numbered 0 and 1. The null hypothesis is H 0 : OddsC 1 R 0) OddsC 1 R 0) where for an event A we define OddsA) P A) P A c ) P A) 1 P A). and Therefore Thus H 0 is written as OddsC 1 R 0) OddsC 1 R 1) P C 1 R 0) 1 P C 1 R 0) π 01 π 00 P C 1 R 1) 1 P C 1 R 1) π 11 π 10 H 0 : π 01 π 00 π 11 π 10. There are some different methods in the way data is obtained, the so called restrospective and prospective methods. We will not discuss this method further in the course. 5.2 Matched Pairs This is also used in biostatistics. It is a method specialized to this type of data problem and can be used in other areas such as social science. The distinction with the general mechanism 2 is what is the population from which the experimenter draws the sample. The data is chosen from a population of pairs of individuals, for example patients, siblings), which is then one sampling unit. A given sampling unit then has recorded random observations R, C). The null hypothesis of interest is that the marginal distribution of R is the same as the marginal distribution of C Since there are two rows and columns R and C can takes values 1 or 2. The null hypothesis is then H 0 : P R 2) P C 2). Further simplification can be made. We will not discuss this method further in the course.