Log-linear Models for Contingency Tables

Similar documents
Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Various Issues in Fitting Contingency Tables

9 Generalized Linear Models

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

Logistic Regressions. Stat 430

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

Interactions in Logistic Regression

8 Nominal and Ordinal Logistic Regression

Solution to Tutorial 7

Loglinear models. STAT 526 Professor Olga Vitek

R Hints for Chapter 10

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

12 Modelling Binomial Response Data

Multinomial Logistic Regression Models

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

STA102 Class Notes Chapter Logistic Regression

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00

MSH3 Generalized linear model

Cohen s s Kappa and Log-linear Models

Section 4.6 Simple Linear Regression

STAC51: Categorical data Analysis

Generalised linear models. Response variable can take a number of different formats

Statistics 3858 : Contingency Tables

Poisson Regression. The Training Data

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Generalized linear models

McGill University. Faculty of Science. Department of Mathematics and Statistics. Statistics Part A Comprehensive Exam Methodology Paper

STAC51: Categorical data Analysis

Statistics for Managers Using Microsoft Excel

Categorical Data Analysis Chapter 3

Linear Regression Models P8111

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Logistic Regression 21/05

Testing Independence

STA 450/4000 S: January

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Poisson Regression. Gelman & Hill Chapter 6. February 6, 2017

Logistic Regression. Some slides from Craig Burkett. STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy

Stat 8053, Fall 2013: Multinomial Logistic Models

Homework 10 - Solution

Exercise 5.4 Solution

Ch 6: Multicategory Logit Models

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

Single-level Models for Binary Responses

BMI 541/699 Lecture 22

Topic 21 Goodness of Fit

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Matched Pair Data. Stat 557 Heike Hofmann

MSH3 Generalized linear model Ch. 6 Count data models

Duration of Unemployment - Analysis of Deviance Table for Nested Models

Logistic Regression - problem 6.14

Regression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples.

Categorical Variables and Contingency Tables: Description and Inference

STAT 705: Analysis of Contingency Tables

Categorical data analysis Chapter 5

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

STAT 135 Lab 11 Tests for Categorical Data (Fisher s Exact test, χ 2 tests for Homogeneity and Independence) and Linear Regression

Classification. Chapter Introduction. 6.2 The Bayes classifier

Modeling Overdispersion

Stat 5421 Lecture Notes Simple Chi-Square Tests for Contingency Tables Charles J. Geyer March 12, 2016

Psych 230. Psychological Measurement and Statistics

Checking the Poisson assumption in the Poisson generalized linear model

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Introduction to logistic regression

Generalized linear models

BIOS 625 Fall 2015 Homework Set 3 Solutions

Review of One-way Tables and SAS

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102

Sample solutions. Stat 8051 Homework 8

UNIVERSITY OF TORONTO Faculty of Arts and Science

Linear Regression. Data Model. β, σ 2. Process Model. ,V β. ,s 2. s 1. Parameter Model

Binary Regression. GH Chapter 5, ISL Chapter 4. January 31, 2017

Model Estimation Example

ij i j m ij n ij m ij n i j Suppose we denote the row variable by X and the column variable by Y ; We can then re-write the above expression as

Sections 4.1, 4.2, 4.3

STAT 7030: Categorical Data Analysis

A course in statistical modelling. session 06b: Modelling count data

Generalized logit models for nominal multinomial responses. Local odds ratios

Section Poisson Regression

Review of Multinomial Distribution If n trials are performed: in each trial there are J > 2 possible outcomes (categories) Multicategory Logit Models

Stat 579: Generalized Linear Models and Extensions

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013

ST3241 Categorical Data Analysis I Two-way Contingency Tables. 2 2 Tables, Relative Risks and Odds Ratios

Non-Gaussian Response Variables

Statistics 191 Introduction to Regression Analysis and Applied Statistics Practice Exam

Statistics 5100 Spring 2018 Exam 1

MSH3 Generalized linear model

Regression Methods for Survey Data

STAT 526 Spring Midterm 1. Wednesday February 2, 2011

Table of z values and probabilities for the standard normal distribution. z is the first column plus the top row. Each cell shows P(X z).

Exam Applied Statistical Regression. Good Luck!

Truck prices - linear model? Truck prices - log transform of the response variable. Interpreting models with log transformation

The material for categorical data follows Agresti closely.

Transcription:

Log-linear Models for Contingency Tables Statistics 149 Spring 2006 Copyright 2006 by Mark E. Irwin

Log-linear Models for Two-way Contingency Tables Example: Business Administration Majors and Gender A study of the career plans of young men and women sent questionaires to all 722 members of the senior class in the College of Business Administration at the University of Illinois. One question asked which major within the business program the student had chosen. Major Women Men Accounting 68 56 Administration 91 40 Economics 5 6 Finance 61 59 Log-linear Models for Two-way Contingency Tables 1

Model for the data The joint PDF of (X, Y ): P [X = x i, Y = y i ] = π ij Marginal PDF of X: P [X = x i ] = π i+ Marginal PDF of Y : P [Y = Y j ] = π +j Expected cell counts: µ ij = nπ ij where n = n ++ is the total count. N = rc is the effective sample size (number of observations). Poisson rate: π ij Log-linear model on log µ ij Log-linear Models for Two-way Contingency Tables 2

Independence Model for Two-way Tables If X and Y are independent, then P [X = x i, Y = y i ] = P [X = x i ] P [Y = y i ] = π i+ π +j and the expected count is µ ij = nπ ij = nπ i+ π +j This implies that the log-linear model satisfies log µ ij = log n + log π i+ + log π +j = λ + λ X i + λ Y j Independence Model for Two-way Tables 3

The estimates for the marginal probabilities are ˆπ i+ = n i+ n ˆπ +j = n +j n The fitted values for this model are µ ij = nˆπ i+ˆπ +j = n i+n +j n This can seen by maximizing the log likelihood (under the constraint πi+ = π +j = π ij = 1), which has the form l(π) = = r i=1 r i=1 c {y ij log nπ ij nπ ij } j=1 c {y ij log π i+ + y ij log π +j + y ij log n} n j=1 Independence Model for Two-way Tables 4

If using the log µ ij = λ + λ X i + λ Y j parameterization you need deal with constraints on λ X i and λ Y j. These constraints are needed since π i+ = π +j = π ij = 1 must hold. Not a problem in R as its approach to generating contrasts when dealing with categorical predictors is a valid approach. One consequence of this that for the independence model, the number of parameters to be estimated is r +c 1. There is λ, r 1 λ X i s and c 1 λy j s. As in the ANOVA setting, there is no unique approach for dealing with constraints. Common choices in this setting are ˆλ X 1 = ˆλ Y 1 = 0 (R default) or ˆλ X r = ˆλ Y c = 0. Any valid constraint must satisfy ˆλ X r = log ( ) 1 eˆλ X 1... eˆλ X r 1 (similarly for ˆλ Y c ) Independence Model for Two-way Tables 5

What is unique are the differences ˆλ X i ˆλ X j and ˆλ Y i ˆλ Y j with contrasts). (we need to deal One reason this is important is to consider the odds π ij = P [X = x i, Y = y j ] π ik P [X = x i, Y = y k ] = µ ij µ ik Under the independence assumption, the log odds satisfies log π ij π ik = log π ij log π ik = (λ + λ X i + λ Y j ) (λ + λ X i + λ Y k ) = λ Y j λ Y k Note that this does not depend on the level of X, which should be the case when X and Y are independent. Also note that a similar relationship holds for log π ij π kj (fix Y vary X) Independence Model for Two-way Tables 6

Saturated Model for Two-way Tables The saturated model for the two-way table can be written log µ ij = λ + λ X i + λ Y j + λ XY ij The terms λ XY ij represent the interaction between X and Y. In this case, the effect of Y depends on the level of X and similarly the effect of X depends on the level of Y. One way to see this is to look at the odds ratio φ ik,jl = π ijπ kl π kj π il = µ ijµ kl µ kj µ il The odds ratios are useful measures to describe dependency between categorical variables. Saturated Model for Two-way Tables 7

To compare accounting and administration for men and women ˆφ = 68 40 56 91 = 0.53 So the odds of women to men in accounting is about half of that in administration. Major Women Men Accounting 68 56 Administration 91 40 Economics 5 6 Finance 61 59 log φ ik,jl can be shown to have the form λ XY ij λ XY kj λ XY il + λ XY kl Under independence, all λ XY ij = 0 and the log odds ratio = 0 (odds ratio = 1). Note that this goes the other way as well, if all the λ XY ij = 0, then X and Y must be independent. Saturated Model for Two-way Tables 8

So another way of thinking of the λ XY ij dependency between X and Y. is that the describe the form of For the saturated model, there are rc different parameters to be fit: λ, r 1 λ X i s, c 1 λy j s and (r 1)(c 1) λxy ij s. A consequence of this is that the fitted cell counts satisfy ˆµ ij = n ij Saturated Model for Two-way Tables 9

Deviances in Contingency Tables This leads to the deviance of a contingency model having the form X 2 = all cells 2y ij log y ij ˆµ ij i.e. 2Obs log Obs Exp Note that this has the same form as the deviance for the binomial regression models discussed earlier. This makes sense that if we condition on n, the total count, (n 11,..., n rc ) n Multinomial(n, π) As before X 2 is approximately distributed χ 2 df where df = rc # params. For the independence model df = rc (r + c 1) = (r 1)(c 1) Deviances in Contingency Tables 10

For the saturated model df = 0 As before X 2 can be used for Goodness of fit. example For the business major > summary(business.ind) Call: glm(formula = n ~ major + gender, family = poisson(), data = business) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 4.28054 0.09959 42.981 < 2e-16 *** majoradministration 0.05492 0.12529 0.438 0.66117 majoreconomics -2.42239 0.31460-7.700 1.36e-14 *** majorfinance -0.03279 0.12805-0.256 0.79790 gendermale -0.33470 0.10323-3.242 0.00119 ** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Deviances in Contingency Tables 11

(Dispersion parameter for poisson family taken to be 1) Null deviance: 168.473 on 7 degrees of freedom Residual deviance: 11.017 on 3 degrees of freedom AIC: 63.832 > pchisq(deviance(business.ind),df.residual(business.ind), lower.tail=f) [1] 0.01163662 So there is some evidence of lack of fit. This is also supported by Pearson Chi-square test Xp 2 = (n ij ˆµ ij ) 2 all cells ˆµ ij which also has approximately a χ 2 df distribution. Deviances in Contingency Tables 12

> pearson.ind <- sum(resid(business.ind,type= pearson )^2) > pearson.ind [1] 10.82673 > pchisq(pearson.ind, df.residual(business.ind), lower.tail=f) [1] 0.01270068 Note that for two-way tables, this test can be done in R by > chisq.test(business.tab) Pearson s Chi-squared test data: business.tab X-squared = 10.8267, df = 3, p-value = 0.0127 Warning message: Chi-squared approximation may be incorrect in: chisq.test(business.tab) As discussed before, the χ 2 df approximation works better in both tests when µ ij are big. In this case one of the fitted cell counts is just under 5. Deviances in Contingency Tables 13

To get a handle on where the problems are, lets look at the fits and residuals Observed Expected Major Female Male Female Male Accounting 68 56 72.28 51.72 Administration 91 40 76.36 54.64 Economics 5 6 6.41 4.59 Finance 61 59 69.95 50.05 Deviance Residuals Pearson Residuals Major Female Male Female Male Accounting -0.51 0.59-0.50 0.60 Administration 1.63-2.08 1.68-1.98 Economics -0.58 0.63-0.56 0.66 Finance -1.09 1.23-1.07 1.26 Deviances in Contingency Tables 14

It appears that the big difference occurs with administration. More women than men go into that major than would be expected under independence. The other majors tend to be closer to 50:50 women to men. This can also be seen by looking at the ˆφ ik,jl. For example, when comparing accounting and finance ˆφ = 68 59 56 61 = 1.17 Another way of looking for problems in the fit is to examine the the example > summary(business.int) Call: glm(formula = n ~ major * gender, family = poisson(), data = business) ˆλ XY ij. For Deviances in Contingency Tables 15

Deviance Residuals: [1] 0 0 0 0 0 0 0 0 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 4.2195 0.1213 34.795 < 2e-16 *** majoradministration 0.2914 0.1603 1.818 0.0691. majoreconomics -2.6101 0.4634-5.633 1.77e-08 *** majorfinance -0.1086 0.1764-0.616 0.5379 gendermale -0.1942 0.1805-1.076 0.2820 majoradministration:gendermale -0.6278 0.2618-2.398 0.0165 * majoreconomics:gendermale 0.3765 0.6318 0.596 0.5513 majorfinance:gendermale 0.1608 0.2567 0.626 0.5310 --- (Dispersion parameter for poisson family taken to be 1) Null deviance: 1.6847e+02 on 7 degrees of freedom Residual deviance: -8.8818e-16 on 0 degrees of freedom AIC: 58.815 Deviances in Contingency Tables 16

> anova(business.int, test="chisq") Analysis of Deviance Table Model: poisson, link: log Response: n Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev P(> Chi ) NULL 7 168.473 major 3 146.796 4 21.677 1.294e-31 gender 1 10.661 3 11.017 0.001 major:gender 3 11.017 0-8.882e-16 0.012 How this gets exhibited can vary on the constraints placed on the λ X i s, λy j s, and λ XY ij s. Deviances in Contingency Tables 17

For example, using the constraints λ X i = 0 i λ Y j = 0 j i j λ XY ij = 0 for each j λ XY ij = 0 for each i leads to estimated parameters > options(contrasts=c("contr.sum","contr.poly")) > business.int2 <- glm(n ~ major * gender, family=poisson(), data=business) Deviances in Contingency Tables 18

> summary(business.int2) Deviance Residuals: [1] 0 0 0 0 0 0 0 0 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 3.50428 0.08556 40.955 < 2e-16 *** major1 0.61815 0.10673 5.792 6.97e-09 *** major2 0.59559 0.10872 5.478 4.30e-08 *** major3-1.80368 0.23055-7.823 5.15e-15 *** gender1 0.10839 0.08556 1.267 0.20522 major1:gender1-0.01132 0.10673-0.106 0.91557 major2:gender1 0.30260 0.10872 2.783 0.00538 ** major3:gender1-0.19955 0.23055-0.866 0.38674 --- (Dispersion parameter for poisson family taken to be 1) Deviances in Contingency Tables 19

Null deviance: 1.6847e+02 on 7 degrees of freedom Residual deviance: -9.1038e-15 on 0 degrees of freedom AIC: 58.815 > anova(business.int2, test="chisq") Analysis of Deviance Table Model: poisson, link: log Response: n Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev P(> Chi ) NULL 7 168.473 major 3 146.796 4 21.677 1.294e-31 gender 1 10.661 3 11.017 0.001 major:gender 3 11.017 0-9.104e-15 0.012 Deviances in Contingency Tables 20

Note that we are not changing the model, just how it is described, so it is reasonable that different descriptions will lead to different descriptions of how the data differs from independence. Notice that the deviances reported under the 2 parameterizations are the same, up to rounding differences. Example: Belief in the afterlife As part of the 1991 General Social Survey, conducted by the National Opinion Research Center asked participants about whether they believed in an afterlife. The data, broken down by gender are Belief Females Males Total Yes 435 375 810 No 147 134 281 Total 582 509 1091 Lets examine whether belief and gender are associated. Deviances in Contingency Tables 21

> summary(afterlife.int) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 4.99043 0.08248 60.506 <2e-16 *** beliefyes 1.08491 0.09540 11.372 <2e-16 *** gendermale -0.09259 0.11944-0.775 0.438 beliefyes:gendermale -0.05583 0.13868-0.403 0.687 > anova(afterlife.int, test="chisq") Df Deviance Resid. Df Resid. Dev P(> Chi ) NULL 3 272.685 belief 1 267.635 2 5.050 3.718e-60 gender 1 4.888 1 0.162 0.027 belief:gender 1 0.162 0 1.197e-13 0.687 In this case, there is little evidence for including the interaction and thus belief in an afterlife and gender appear to be independent. Deviances in Contingency Tables 22

Log-linear Model versus Logistic Regression Lets suppose that X takes two levels, such as in the afterlife belief example. Then log µ 1j π j = log µ 2j 1 π j is the log odds where π j = P [X = 1 Y = Y j ]. So log φ 12,jk is the log odds ratio comparing Y at level j with Y at level k. Thus we are effectively modeling the same parameters in the saturated log-linear model as we are in logistic regression with Y as a categorical predictor. It can be shown that inference in the two approaches leads to the same answer. To exhibit this, lets fit the afterlife belief data both ways Log-linear Model versus Logistic Regression 23

> summary(afterlife.int) Call: glm(formula = n ~ belief * gender, family = poisson()) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 4.99043 0.08248 60.506 <2e-16 *** beliefyes 1.08491 0.09540 11.372 <2e-16 *** gendermale -0.09259 0.11944-0.775 0.438 beliefyes:gendermale -0.05583 0.13868-0.403 0.687 > summary(afterlife.logit) Call: glm(formula = agree ~ gen, family = binomial()) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 1.14074 0.21572 5.288 1.24e-07 *** gen -0.05583 0.13868-0.403 0.687 Log-linear Model versus Logistic Regression 24

> anova(afterlife.int, test="chisq") Analysis of Deviance Table Model: poisson, link: log Df Deviance Resid. Df Resid. Dev P(> Chi ) NULL 3 272.685 belief 1 267.635 2 5.050 3.718e-60 gender 1 4.888 1 0.162 0.027 belief:gender 1 0.162 0 1.197e-13 0.687 > anova(afterlife.logit, test="chisq") Analysis of Deviance Table Model: binomial, link: logit Df Deviance Resid. Df Resid. Dev P(> Chi ) NULL 1 0.16200 gen 1 0.16200 0 8.415e-14 0.68733 Log-linear Model versus Logistic Regression 25

Other Sampling Schemes So far only tables generated by Poisson sampling schemes have been considered. However many other schemes lead to the same analysis. These come by fixing different marginal counts. Multinomial sampling: fix n. In the case we are still modeling the joint probabilities π ij. However it is with 1 multinomial sample instead of rc Poisson samples. Product multinomial: fix row (n i+ ) or column totals (n +j ). In this case of fixing the row totals, each row is considered as a multinomial sample we are modeling P [Y = y j X = x i ]. So there are r different independent multinomial samples. Instead of looking for independence, normally we would instead look for homogeneity of probabilities (i.e. P [Y = y j X = x i ] = P [Y = y j ] for all x i ). Other Sampling Schemes 26

The justification of these statements can be made by showing that the likelihoods in these case have equivalent forms when dealing with the π ij s. This relates to the fact that if Y 1 P (µ 1 ) and Y 2 P (µ 2 ) and Y 1 and Y 2 are independent Y 1 Y 1 + Y 2 = n Bin(n, π) where π = µ 1 µ 1 +µ 2. Other Sampling Schemes 27