Checking the Poisson assumption in the Poisson generalized linear model

Similar documents
Modeling Overdispersion

Linear Regression Models P8111

Generalised linear models. Response variable can take a number of different formats

Sample solutions. Stat 8051 Homework 8

Cherry.R. > cherry d h v <portion omitted> > # Step 1.

Introduction to the Generalized Linear Model: Logistic regression and Poisson regression

Lecture 18: Simple Linear Regression

Logistic Regression - problem 6.14

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

Stat 8053, Fall 2013: Poisson Regression (Faraway, 2.1 & 3)

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

Exercise 5.4 Solution

Regression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples.

Generalized Linear Models. stat 557 Heike Hofmann

Introduction to the Analysis of Tabular Data

Poisson Regression. The Training Data

Various Issues in Fitting Contingency Tables

Logistic Regression 21/05

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Log-linear Models for Contingency Tables

BMI 541/699 Lecture 22

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

Poisson Regression. Gelman & Hill Chapter 6. February 6, 2017

STA102 Class Notes Chapter Logistic Regression

Chapter 22: Log-linear regression for Poisson counts

Poisson Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

R Hints for Chapter 10

Exam Applied Statistical Regression. Good Luck!

Generalized linear models

Survival Analysis I (CHL5209H)

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

UNIVERSITY OF TORONTO Faculty of Arts and Science

Generalized Linear Models in R

Matched Pair Data. Stat 557 Heike Hofmann

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Classification. Chapter Introduction. 6.2 The Bayes classifier

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

1. Logistic Regression, One Predictor 2. Inference: Estimating the Parameters 3. Multiple Logistic Regression 4. AIC and BIC in Logistic Regression

Introduction to Statistics and R

Using R in 200D Luke Sonnet

Statistical Prediction

Nonlinear Models. What do you do when you don t have a line? What do you do when you don t have a line? A Quadratic Adventure

Booklet of Code and Output for STAD29/STA 1007 Midterm Exam

Faculty of Science FINAL EXAMINATION Mathematics MATH 523 Generalized Linear Models

Residuals from regression on original data 1

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102

9 Generalized Linear Models

Age 55 (x = 1) Age < 55 (x = 0)

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Classification: Logistic Regression and Naive Bayes Book Chapter 4. Carlos M. Carvalho The University of Texas McCombs School of Business

Hurricane Forecasting Using Regression Modeling

MODULE 6 LOGISTIC REGRESSION. Module Objectives:

Neural networks (not in book)

36-463/663: Multilevel & Hierarchical Models

ST430 Exam 1 with Answers

R Output for Linear Models using functions lm(), gls() & glm()

12 Modelling Binomial Response Data

Week 7 Multiple factors. Ch , Some miscellaneous parts

Simple Linear Regression for the Climate Data

Consider fitting a model using ordinary least squares (OLS) regression:

Poisson regression: Further topics

Truck prices - linear model? Truck prices - log transform of the response variable. Interpreting models with log transformation

On the Inference of the Logistic Regression Model

Logistic Regressions. Stat 430

Model checking overview. Checking & Selecting GAMs. Residual checking. Distribution checking

Logistic & Tobit Regression

Statistical analysis of trends in climate indicators by means of counts

Generalized Linear Models

Fitting a Straight Line to Data

Outline of GLMs. Definitions

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00

Interactions in Logistic Regression

MSH3 Generalized linear model

Nature vs. nurture? Lecture 18 - Regression: Inference, Outliers, and Intervals. Regression Output. Conditions for inference.

appstats27.notebook April 06, 2017

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Generalized Linear Models 1

STAT 3022 Spring 2007

Generalized Linear Models

Chapter 27 Summary Inferences for Regression

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Regression and Models with Multiple Factors. Ch. 17, 18

1 Least Squares Estimation - multiple regression.

The GLM really is different than OLS, even with a Normally distributed dependent variable, when the link function g is not the identity.

Random Independent Variables

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

MSH3 Generalized linear model Ch. 6 Count data models

Econometrics II. Seppo Pynnönen. Spring Department of Mathematics and Statistics, University of Vaasa, Finland

Homework 5: Answer Key. Plausible Model: E(y) = µt. The expected number of arrests arrests equals a constant times the number who attend the game.

Review of Poisson Distributions. Section 3.3 Generalized Linear Models For Count Data. Example (Fatalities From Horse Kicks)

PAPER 206 APPLIED STATISTICS

STAT Lecture 11: Bayesian Regression

Regression Methods for Survey Data

STATS216v Introduction to Statistical Learning Stanford University, Summer Midterm Exam (Solutions) Duration: 1 hours

Non-Gaussian Response Variables

Regression and the 2-Sample t

Transcription:

Checking the Poisson assumption in the Poisson generalized linear model The Poisson regression model is a generalized linear model (glm) satisfying the following assumptions: The responses y i are independent of one another, and each y i is a non-negative integer, y i {0, 1, 2,...}. Each y i follows the Poisson distribution with mean λ i, P(y i = k λ i ) = λ k i e λ i /k!. θ i = log λ i = X i β, where β = (β 0, β 1,..., β p ) T and X i = (1, X i1, X i2,..., X ip are the predictors. A reasonable question to ask is, if confronted with data, how do we know to use the Poisson regression model? There are two basic ideas that I want to discuss here. The first idea is easy to implement; the second idea is much less easy: Are the responses all non-negative integers that have no natural upper bound? Is the distribution of y consistent with the Poisson distribution? The first idea is easy: If the data are anything but non-negative integers that are (in principle, at least) unbounded, Poisson regression is the wrong model to use. The second idea sounds easy but is a little tricky. I will illustrate with an example from Gelman & Hill, which we also looked at in class. > roachdata <- read.csv ("roachdata.csv") # # this is data from an experiment on the effectiveness of # an "integrated pest management system" in apartment buildings # in a particular city. > str(roachdata) data.frame : 262 obs. of 6 variables: $ X : int 1 2 3 4 5 6 7 8 [observation number] $ y : int 153 127 7 7 0 0 [of roaches trapped after expmt] $ roach1 : num 308 331.25 1.67 [of roaches before experiment] $ treatment: int 1 1 1 1 1 1 1 1 [pest mgmt tx in this apt bldg?] $ senior : int 0 0 0 0 0 0 0 0 [apts restricted to sr citzns?] $ exposure2: num 0.8 0.6 1 1 1.14 [avg of trap-days per apt for y] 1

> attach(roachdata) > glm.1 <- glm (y roach1 + treatment + senior, family=poisson, + offset=log(exposure2)) > summary(glm.1) Call: glm(formula = y roach1 + treatment + senior, family = poisson, offset = log(exposure2)) Deviance Residuals: Min 1Q Median 3Q Max -17.9430-5.1529-3.8059 0.1452 26.7771 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 3.089e+00 2.123e-02 145.49 <2e-16 *** roach1 6.983e-03 8.874e-05 78.69 <2e-16 *** treatment -5.167e-01 2.474e-02-20.89 <2e-16 *** senior -3.799e-01 3.342e-02-11.37 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 16954 on 261 degrees of freedom Residual deviance: 11429 on 258 degrees of freedom AIC: 12192 Number of Fisher Scoring iterations: 6 The model obviously fits much better than the intercept-only model, but is the Poisson assumption really met? The first possibility that comes to mind for assessing this is to make a histogram of the y s and see if it looks like the right Poisson distribution. hist(y) 2

However, is it the right Poisson distribution? We could compare it with the Poisson distribution whose mean is estimated from the y s alone: > length(y) [1] 262 > (lambda <- mean(y)) [1] 25.65 > tbl <- NULL > for (k in 0:50) { + tbl <- rbind(tbl,c(obs=sum(y==k), + exp=262*exp(-25.65) * 25.65ˆk / factorial(k))) + } > barplot(tbl[,2]) 3

The two distributions don t look very similar, a fact confirmed when we compare the actual counts for y = 0, 1, 2, etc., with the counts that come from the Poisson(25.65) distribution, for the first few integers: > dimnames(tbl)[[1]] <- paste(0:50) > tbl[1:5,] obs exp 0 94 1.899537e-09 1 20 4.872313e-08 2 11 6.248742e-07 3 10 5.342674e-06 4 7 3.425990e-05 The problem is that, while each y i is Poisson with mean λ i = X i β, these means may be very different from one anther, and different from the overall mean 25.65, so the raw distribution of the y s may not look very much like a Poisson(25.65) distribution (and it does not, in this case!). 4

One thing we can do is to compare the values predicted from the model with the actual y s. The predicted values are easy to compute: > E.y. <- predict(glm.1,type="response") and then we have to think of a way to compare them with the actual y s. One possibility is just to plot raw residuals y i E[y i ] or standardized residuals (y i E[y i ])/E[y i ]. Here are four possible plots: > par(mfrow=c(2,2)) > plot(e.y., y - E.y.) > plot(e.y., (y - E.y.)/sqrt(E.y.)) > plot(y - E.y.) > plot((y - E.y.)/sqrt(E.y.)) The top two plots are just standard residual plots (with raw and standardized residuals) and they don t help diagnose the problem too much. The bottom two plots begin to suggest something 5

that we could use to diagnose what s wrong, but the presence of some pretty big outliers makes it hard to see what s actually happening. Let s try a scatter plot with E[y i ] on the x-axis, and y i on the y-axis. After I made the plot on the left, I tried taking the log of both E[y i ] and y i to get a better sense of what is going on (note that I added 1 before taking the log, to avoid problems with taking the log of zero. This won t change the pattern in the data). This produces the plot on the right. > plot(e.y.,y) > plot(log(e.y.+1),log(y+1),xlim=c(0,6)) There are many interesting things to see in these plots, but the most striking is that there is a long string of zeros in the observed counts y (if y is zero, then log(y + 1) will still be zero!), corresponding to values of log(e[y] + 1) that range from 1 to 4 or so. There seem to be too many zero s in the observed data, for the Poisson regression model! This leads us to consider modeling the zeros, and the other counts, separately, as suggested in the class notes. 6