Various Issues in Fitting Contingency Tables

Similar documents
Log-linear Models for Contingency Tables

Exam Applied Statistical Regression. Good Luck!

Linear Regression Models P8111

Logistic Regressions. Stat 430

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

12 Modelling Binomial Response Data

UNIVERSITY OF TORONTO Faculty of Arts and Science

Generalised linear models. Response variable can take a number of different formats

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

9 Generalized Linear Models

Classification. Chapter Introduction. 6.2 The Bayes classifier

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

R Hints for Chapter 10

STAT 510 Final Exam Spring 2015

Logistic Regression - problem 6.14

Logistic Regression 21/05

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 450/4000 S: January

BMI 541/699 Lecture 22

Checking the Poisson assumption in the Poisson generalized linear model

MSH3 Generalized linear model

Modeling Overdispersion

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

ˆπ(x) = exp(ˆα + ˆβ T x) 1 + exp(ˆα + ˆβ T.

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Poisson Regression. The Training Data

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

Introduction to the Generalized Linear Model: Logistic regression and Poisson regression

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

Generalized linear models

Regression Methods for Survey Data

Survival Analysis I (CHL5209H)

Duration of Unemployment - Analysis of Deviance Table for Nested Models

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Interactions in Logistic Regression

Three-Way Tables (continued):

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

STA102 Class Notes Chapter Logistic Regression

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Sample solutions. Stat 8051 Homework 8

Loglinear models. STAT 526 Professor Olga Vitek

Matched Pair Data. Stat 557 Heike Hofmann

Truck prices - linear model? Truck prices - log transform of the response variable. Interpreting models with log transformation

Poisson Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

Multinomial Logistic Regression Models

Neural networks (not in book)

Homework 10 - Solution

Figure 36: Respiratory infection versus time for the first 49 children.

Introduction to the Analysis of Tabular Data

Recap. HW due Thursday by 5 pm Next HW coming on Thursday Logistic regression: Pr(G = k X) linear on the logit scale Linear discriminant analysis:

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

MSH3 Generalized linear model Ch. 6 Count data models

Poisson regression: Further topics

Chapter 13 Experiments with Random Factors Solutions

DISPLAYING THE POISSON REGRESSION ANALYSIS

cor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson )

Example: Poisondata. 22s:152 Applied Linear Regression. Chapter 8: ANOVA

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Generalized Linear Models 1

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Generalized Linear Models. stat 557 Heike Hofmann

Let s see if we can predict whether a student returns or does not return to St. Ambrose for their second year.

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches

Random Independent Variables

R Output for Linear Models using functions lm(), gls() & glm()

Lecture 8. Poisson models for counts

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00

STATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours

LECTURE 10: LINEAR MODEL SELECTION PT. 1. October 16, 2017 SDS 293: Machine Learning

1. Logistic Regression, One Predictor 2. Inference: Estimating the Parameters 3. Multiple Logistic Regression 4. AIC and BIC in Logistic Regression

Chapter 22: Log-linear regression for Poisson counts

Age 55 (x = 1) Age < 55 (x = 0)

Logistic Regression in R. by Kerry Machemer 12/04/2015

This manual is Copyright 1997 Gary W. Oehlert and Christopher Bingham, all rights reserved.

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

ssh tap sas913, sas

Exercise 5.4 Solution

Introduction to General and Generalized Linear Models

Categorical data analysis Chapter 5

STAT 526 Spring Midterm 1. Wednesday February 2, 2011

Regression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples.

Cherry.R. > cherry d h v <portion omitted> > # Step 1.

On the Inference of the Logistic Regression Model

STAT 526 Advanced Statistical Methodology

Contents. TAMS38 - Lecture 8 2 k p fractional factorial design. Lecturer: Zhenxia Liu. Example 0 - continued 4. Example 0 - Glazing ceramic 3

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

Solution to Tutorial 7

Describing Contingency tables

Leftovers. Morris. University Farm. University Farm. Morris. yield

MATH 644: Regression Analysis Methods

Problem #1 #2 #3 #4 #5 #6 Total Points /6 /8 /14 /10 /8 /10 /56

PAPER 218 STATISTICAL LEARNING IN PRACTICE

Using R in 200D Luke Sonnet

Consider fitting a model using ordinary least squares (OLS) regression:

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

A Re-Introduction to General Linear Models (GLM)

Transcription:

Various Issues in Fitting Contingency Tables Statistics 149 Spring 2006 Copyright 2006 by Mark E. Irwin

Complete Tables with Zero Entries In contingency tables, it is possible to have zero entries in a table. There are two ways this can occur Structural zeros: These occur when the sampling scheme forces them to be 0. For example in the Wave Damage to Cargo Ship example, they classified the number of damage incidents by the 3 factors. Ship type: A - E Year of construction: 1960-64, 1965-69, 1970-74, 1975-1979 Period of operation: 1960-74, 1975-1979 In this example, any cell involving operation period = 1960-1974 and construction year = 1975-1979 must have a count = 0. Actually there is one more structural zero. The cell for ship type = E, construction year = 1960-1964, and operation period = 1975-1979 had service time = 0, which forces the number of damage incidents to 0. Complete Tables with Zero Entries 1

These situations are easily handled by just removing these observations from the data set. Note that the total number of observations won t be IJK but IJK ν s, where ν s is the number of structural zeros in the data set. In this example ν s = 0, one for each ship type plus the one cell that had a 0 service times, even though it could have been positive. So the effective number of observations is 34 = 5 4 2 6. > summary(wave.glm) Call: glm(formula = Damage ~ Type + Construct + Operation, family = poisson(), data = wave2, offset = log(service)) Null deviance: 146.328 on 33 degrees of freedom Residual deviance: 38.695 on 25 degrees of freedom Complete Tables with Zero Entries 2

Sampling zeros: Cell counts = 0 can also occur at random. This will tend to happen when µ is small (equivalently when π is close to 0) Example: Food poisoning (Bishop, Fienberg, & Holland, pp 90-91) The following data are from and epidemiologic study following an outbreak of food poisoning at an outing held for personnel of an insurance company. Questionnaires were completed by 304 of the 320 persons attending. Of the food eaten, interest focused on potato salad and crabmeat. Crabmeat No Crabmeat Potato Salad Yes No Yes No Total Ill 120 4 22 0 146 Not Ill 80 31 24 23 158 There is no reason, apriori, to assume that people who got ill must have eaten at least one of the potato salad or crabmeat. Complete Tables with Zero Entries 3

Model df Xp 2 X 2 ˆµ 122 (IP, IC, PC) 1 1.70 2.74 1.08 (IP, IC) 2 7.21 7.64 0.60 (IP, PC) 2 5.09 6.48 1.59 (IC, PC) 2 44.35 53.66 7.33 The 0 cell for these models doesn t seem to cause a problem. example, for the model (IP, IC, PC) For > summary(poison.ic.ip.cp) Call: glm(formula = count ~.^2, family = poisson(), data = poison) Deviance Residuals: 1 2 3 4 5 6-0.09857 0.59995 0.23481-1.47176 0.12164-0.19230 7 8-0.21783 0.22947 Complete Tables with Zero Entries 4

Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 3.0873 0.2102 14.686 < 2e-16 *** illyes -3.0075 0.5676-5.299 1.17e-07 *** crabyes 0.3811 0.2697 1.413 0.1577 potatoyes 0.1349 0.2837 0.476 0.6343 illyes:crabyes 0.6097 0.3170 1.923 0.0544. illyes:potatoyes 2.8259 0.5362 5.270 1.36e-07 *** crabyes:potatoyes 0.7651 0.3432 2.229 0.0258 * --- (Dispersion parameter for poisson family taken to be 1) Null deviance: 295.2526 on 7 degrees of freedom Residual deviance: 2.7427 on 1 degrees of freedom AIC: 53.074 Number of Fisher Scoring iterations: 5 Complete Tables with Zero Entries 5

Everything looks fine here. Now lets fit the saturated model. > summary(poison.icp) Call: glm(formula = count ~.^3, family = poisson(), data = poison, epsilon = 1e-16, maxit = 50) Deviance Residuals: [1] 0 0 0 0 0 0 0 0 To be expected since ˆµ ijk = n ijk Complete Tables with Zero Entries 6

Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 3.135e+00 2.085e-01 15.037 <2e-16 ** illyes -5.544e+01 6.711e+07-8.26e-07 1.000 crabyes 2.985e-01 2.752e-01 1.085 0.278 potatoyes 4.256e-02 2.918e-01 0.146 0.884 illyes:crabyes 5.339e+01 6.711e+07 7.96e-07 1.000 illyes:potatoyes 5.535e+01 6.711e+07 8.25e-07 1.000 crabyes:potatoyes 9.055e-01 3.604e-01 2.512 0.012 * illyes:crabyes:potyes -5.290e+01 6.711e+07-7.88e-07 1.000 --- (Dispersion parameter for poisson family taken to be 1) Null deviance: 2.9525e+02 on 7 degrees of freedom Residual deviance: 1.1546e-14 on 0 degrees of freedom AIC: 52.332 Number of Fisher Scoring iterations: 50 Complete Tables with Zero Entries 7

This doesn t look so good. The glm function didn t converge here, as the number of iterations happens to be the maxit. Also notice that some of the standard errors for the λs are huge. It ends up that sampling zeros can cause problems with estimating the λs in some cases. Whether a problem occurs depends on the model to be fit and the layout of the zeros in the table. Note that even though there may be problems with estimating the λs, the model may still be well defined in terms of the πs and odds ratio type measures (maybe going to more complicated structures) Lets look at a couple of 2 2 table examples and fit the hypothetical examples and try to fit the independence model in each case Observed: [ 20 0 ] [ 20 5 ] 0 5 0 0 Complete Tables with Zero Entries 8

Fitted: [ 16 4 ] [ 20 5 ] 4 1 0 0 If we use the upper left cell as the reference cell, the MLEs for the λs satisfy ˆλ = log ˆµ 11 ˆλ X = log ˆµ 22 + log ˆµ 21 log ˆµ 12 log ˆµ 11 ˆλ Y = log ˆµ 22 log ˆµ 21 + log ˆµ 12 log ˆµ 11 where ˆµ ij = n i+n +j n Complete Tables with Zero Entries 9

For the first table we get ˆλ = log 16 = 2.7726 ˆλ X = log 1 + log 4 log 4 log 16 = 1.3863 ˆλ Y = log 1 log 4 + log 4 log 16 = 1.3863 However for the second table we get ˆλ = log 20 = 2.9957 ˆλ X = log 0 + log 0 log 5 log 20 = ˆλ Y = log 0 log 0 + log 5 log 20 = In this case the fitted cells with 0 are causing trouble. Complete Tables with Zero Entries 10

Lets consider another example Z1 Z2 Y1 Y2 Y1 Y2 X1 0 b e f X2 c d g 0 Lets consider the model (XY, XZ, YZ). Under this model ˆµ ij+ = n ij+ ˆµ i+k = n i+k ˆµ +jk = n +jk Now suppose that ˆµ 111 = > 0. Then the table of expected counts must look like Complete Tables with Zero Entries 11

Z1 Z2 Y1 Y2 Y1 Y2 X1 b e f + X2 c d + g + Which gives a negative ˆµ. Thus ˆµ 111 = ˆµ 222 = 0 which further implies ˆµ ijk = n ijk. In addition the ˆλs aren t well defined as the calculations for many of them involve log 0. If we take an example of this table we get Complete Tables with Zero Entries 12

Call: glm(formula = count ~ (X + Y + Z)^2, family = poisson(), data = zero3) Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -23.81 46570.93-0.001 1 X2 25.42 46570.93 0.001 1 Y2 26.80 46570.93 0.001 1 Z2 26.11 46570.93 0.001 1 X2:Y2-25.70 46570.93-0.001 1 X2:Z2-25.93 46570.93-0.001 1 Y2:Z2-25.70 46570.93-0.001 1 In this case the intercept is actually log 0 =. The calculated result is a result of the convergence criterion used in glm and the finite precision of calculations in R. Note that the parameterization selected can t be used to solve the problem. Different parameterizations will exhibit the problem differently, but it will always be there. Complete Tables with Zero Entries 13

Effects on the Degrees of Freedom In general, the degrees of freedom are determined by df = T e T p where T e is the number of cells in the table where µ is estimated and T p is the number of parameters to be fitted. Note this is just the same formula as before. If there are no cells with fitted zeros, this is correct formula. However this is not correct if there are fitted zeros which can occur two ways 1. λ terms in the model cannot be estimated due to the arrangements of sampling zeros, even though the specifying configurations may have all non-zero values. 2. Zero cells lead to empty cells for the specifying configuration. Complete Tables with Zero Entries 14

In the case of fitted zeros, the formula for degrees of freedom needs to be modified to df = (T e z e ) (T p z p ) where z e is the number of cells with fitted zeros and z p is the number of parameters that can t be estimated. So for the artificial 3-way table with the model (XY, XZ, YZ) giving T e = 8 T p = 7 z e = 2 z p = 1 df = (8 2) (7 1) = 0 which seems reasonable since ˆµ ijk = n ijk for this model. For the Note that z p is often difficult to determine. For this problem it happens to be 1. It ends up you can only fit 2 of the 3 two factor interactions at one time. Complete Tables with Zero Entries 15

Separating Hyperplanes in Logistic Regression There is a related problem in logistic regression where is it not possible to get parameter estimates. The plot to the right shows an example of when the problem occurs. Success 0.0 0.2 0.4 0.6 0.8 1.0 0 2 4 6 8 10 X > sp.glm <- glm(ysp ~ xsp, family=binomial()) Warning messages: 1: algorithm did not converge in: glm.fit(x = X, y = Y, weights = weights, start = start, etastart = etastart, 2: fitted probabilities numerically 0 or 1 occurred in: glm.fit(x = X, y = Y, weights = weights, start = start, etastart = etastart, Separating Hyperplanes in Logistic Regression 16

> summary(sp.glm) Call: glm(formula = ysp ~ xsp, family = binomial()) Deviance Residuals: Min 1Q Median 3Q Max -6.008e-05-2.107e-08 0.000e+00 2.107e-08 5.200e-05 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -234.32 119856.80-0.002 0.998 xsp 50.93 26078.75 0.002 0.998 (Dispersion parameter for binomial family taken to be 1) Null deviance: 6.9315e+01 on 49 degrees of freedom Residual deviance: 6.7379e-09 on 48 degrees of freedom AIC: 4 Number of Fisher Scoring iterations: 25 Separating Hyperplanes in Logistic Regression 17

In this example the data was generated by y = I(x > 5). When you try to fit this example, the iterates of ˆβ 1 in the IRSL algorithm will continue to increase without bound. The only reason that R stopped and gave an answer is that the glm function has a maxit option which forces the function to terminate. When there is a single predictor x, this problem will exhibit itself whenever you have a dataset with the property y = I(x > x c ) or y = I(x < x c ) When there is a higher dimensional predictor (more xs), this problem can X1 also occur. In two dimensions, the problem occurs if you can draw a line through the scatterplot of the predictors and have all the successes on one side and all the failures on the other. X2 0 2 4 6 8 10 0 2 4 6 8 10 Separating Hyperplanes in Logistic Regression 18

In 3 dimensions, you look for a separating plane and in 4 and higher dimensions you look for separating hyperplanes. Usually this problem occurs when you have sparce data, particularly in the range of xs where π(x) changes alot. When it comes to study design, you need to choose level of the predictors and number of observations so that the sample proportions should be bounded away from 0 and 1. Separating Hyperplanes in Logistic Regression 19

Direct Estimates While for most purposes today being able to get direct estimates of ˆµs is less important today, due to modern computing hardware and software, it does have its uses. For example, Monte Carlo techniques may require generation of these fits many times, and it may be quicker to generate them directly that to use iterative procedures such as iteratively reweighted least squares. As mentioned before, not every log linear model has direct estimates. There is a simple algorithm that will determine whether a model has direct estimates based on the sufficient configuration (the notation discussed last time for describing models). The algorithm involves the following steps: Direct Estimates 20

1. Relabel any group of variables that always appear together as a single variable. 2. Delete any variable that occurs in every configuration. 3. Delete any variable that occurs in every configuration. 4. Remove any redundant configuration. 5. Repeat steps 1-4 until (a) No more than 2 configurations remain = Direct Estimates (b) No further steps possible = No Direct Estimates Direct Estimates 21

Examples: 1. (AB, AC, BD) 3 (AB, AC, B) D occurs only once 4 (AB, AC) B redundent [in AB] 3 (AB, A) C occurs only once 4 (AB) A redundent [in AB] 2 configurations = Direct Estimates 2. (AB, AC, BC) Nothing can be relabeled by step 1 Nothing occurs in every configuration drop nothing Nothing occurs in just on configuration drop nothing No redundent configurations drop nothing No further steps possible and > 2 configurations = No Direct Estimates Direct Estimates 22

3. (ABC, BCD, AD) 1 (A[BC], [BC] D, AD) = (AE, ED, AD) relabel AB as E No further steps possible this is equivalent to example 2. > 2 configurations = No Direct Estimates 4. (BCE, ACF, EG, ABD, ABC) 3 (BCE, AC, E, ABD, ABC) F & G removed as they only occur once 4 (BCE, ABD, ABC) AC [part of ABC] & E [part of BCE] are now redundant 3 (BC, AB, ABC) D & E occur only once 4 (ABC) AB & BC both occur in ABC 2 configurations = Direct Estimates Direct Estimates 23

Now that we can figure out when direct estimates occur, what are they? There is a simple scheme to construct them 1. Numerator has entries from the sufficient configuration 2. Denominator has entries from redundant configurations due to overlapping 3. Powers of ns to ensure right order of magnitude Note: When considering the sufficient configuration, there should be no components that should be eliminated by step 4 of the algorithm. So if the model (ACD, BCD, A) is given, the sufficient configuration is (ACD, BCD). Direct Estimates 24

Examples: 1. (ACD, BCD) Conditional on C & D, A & B are independent ˆµ ijkl = n i+kln +jkl n ++kl 2. (AB, AC, BD) Many conditional independence relationships ˆµ ijkl = n ij++n i+k+ n +j+l n i+++ n +j++ Direct Estimates 25

Slight notation change: Let n X correspond to fixing the levels of variables listed in X and adding over the other variables. For example in a model with variables A, B, C, and D, n AB = n ij++ n C = n ++k+ n = n ++++ 3. (ABC, ABD, ABE) C, D, and E are mutually independent conditional on A and B. ˆµ = nabc n ABD n ABE (n AB ) 2 4. (BCE, ACF, EG, ABD, ABC) ˆµ = nabc n ABD n BCE n ACF n EG n AB n AC n BC n E Direct Estimates 26

This algorithm is related to the fact that X is an element of the sufficient configuration describing a model, then ˆµ X = n X Another way of thinking of this, the sufficient configuration describes which marginal tables are fixed when considering the corresponding estimated table for a model. Direct Estimates 27