Poisson Regression. The Training Data

Similar documents
Sample solutions. Stat 8051 Homework 8

Exercise 5.4 Solution

Logistic Regressions. Stat 430

Linear Regression Models P8111

Interactions in Logistic Regression

9 Generalized Linear Models

PAPER 206 APPLIED STATISTICS

The Multinomial Model

Poisson Regression. Gelman & Hill Chapter 6. February 6, 2017

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00

UNIVERSITY OF TORONTO Faculty of Arts and Science

Log-linear Models for Contingency Tables

R Hints for Chapter 10

Modeling Overdispersion

R Output for Linear Models using functions lm(), gls() & glm()

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

Regression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples.

Checking the Poisson assumption in the Poisson generalized linear model

ssh tap sas913, sas

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Logistic Regression 21/05

Generalized Linear Models in R

Exam Applied Statistical Regression. Good Luck!

Generalized linear models

Loglinear models. STAT 526 Professor Olga Vitek

Classification. Chapter Introduction. 6.2 The Bayes classifier

Let s see if we can predict whether a student returns or does not return to St. Ambrose for their second year.

Logistic Regression - problem 6.14

Simple logistic regression

Booklet of Code and Output for STAD29/STA 1007 Midterm Exam

Homework 10 - Solution

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

Poisson Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Reaction Days

R code and output of examples in text. Contents. De Jong and Heller GLMs for Insurance Data R code and output. 1 Poisson regression 2

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Introduction to General and Generalized Linear Models

Various Issues in Fitting Contingency Tables

12 Modelling Binomial Response Data

STA 450/4000 S: January

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

Stat 8053, Fall 2013: Poisson Regression (Faraway, 2.1 & 3)

MS&E 226: Small Data

Matched Pair Data. Stat 557 Heike Hofmann

1. Logistic Regression, One Predictor 2. Inference: Estimating the Parameters 3. Multiple Logistic Regression 4. AIC and BIC in Logistic Regression

MSH3 Generalized linear model Ch. 6 Count data models

Survival Analysis I (CHL5209H)

Ch 6: Multicategory Logit Models

Generalized Linear Models

Analysis of binary repeated measures data with R

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches

Non-Gaussian Response Variables

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Neural networks (not in book)

PAPER 218 STATISTICAL LEARNING IN PRACTICE

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

On the Inference of the Logistic Regression Model

8 Nominal and Ordinal Logistic Regression

Stat 8053, Fall 2013: Multinomial Logistic Models

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

ST430 Exam 1 with Answers

Chapter 3: Generalized Linear Models

Stat 5102 Final Exam May 14, 2015

Binary Regression. GH Chapter 5, ISL Chapter 4. January 31, 2017

Leftovers. Morris. University Farm. University Farm. Morris. yield

Statistics 203 Introduction to Regression Models and ANOVA Practice Exam

Generalized Linear Models 1

Duration of Unemployment - Analysis of Deviance Table for Nested Models

The GENMOD Procedure. Overview. Getting Started. Syntax. Details. Examples. References. SAS/STAT User's Guide. Book Contents Previous Next

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

Today. HW 1: due February 4, pm. Aspects of Design CD Chapter 2. Continue with Chapter 2 of ELM. In the News:

Lab 3: Two levels Poisson models (taken from Multilevel and Longitudinal Modeling Using Stata, p )

SAS Software to Fit the Generalized Linear Model

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

ESP 178 Applied Research Methods. 2/23: Quantitative Analysis

22s:152 Applied Linear Regression. RECALL: The Poisson distribution. Let Y be distributed as a Poisson random variable with the single parameter λ.

Explanatory variables are: weight, width of shell, color (medium light, medium, medium dark, dark), and condition of spine.

Consider fitting a model using ordinary least squares (OLS) regression:

Week 7 Multiple factors. Ch , Some miscellaneous parts

Joseph O. Marker Marker Actuarial a Services, LLC and University of Michigan CLRS 2010 Meeting. J. Marker, LSMWP, CLRS 1

Ordinary Least Squares Regression Explained: Vartanian

Model Estimation Example

Hurricane Forecasting Using Regression Modeling

How to work correctly statistically about sex ratio

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

Age 55 (x = 1) Age < 55 (x = 0)

Generalized Linear Models

Cherry.R. > cherry d h v <portion omitted> > # Step 1.

Generalised linear models. Response variable can take a number of different formats

Booklet of Code and Output for STAD29/STA 1007 Midterm Exam

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Generalized linear models

Transcription:

The Training Data Poisson Regression Office workers at a large insurance company are randomly assigned to one of 3 computer use training programmes, and their number of calls to IT support during the following month is recorded. Additional information on each worker includes years of experience and score on a computer literacy test (out of 100). It is reasonable to model calls to IT support as a Poisson process, and the question is whether training programme affects the rate of the process. Could test H 0 : λ 1 =λ 2 =λ 3 with a likelihood ratio test, but... > train = read.table("http://fisher.utstat.utoronto.ca/~brunner/appliedf12/data/train ing.data") > train[1:4,] Program Experience Score Support 1 A 3.92 60 6 2 A 5.83 64 3 3 A 0.92 51 8 4 A 8.50 58 2 > attach(train) > table(support) Support 0 1 2 3 4 5 6 7 8 9 10 11 12 6 27 42 61 70 39 23 17 9 2 2 1 1 > aggregate(support,by=list(program),fun=mean) Group.1 x 1 A 4.07 2 B 3.47 3 C 4.05 > aggregate(support,by=list(program),fun=length) Group.1 x 1 A 100 2 B 100 3 C 100 > Page 1 of 7

> model1 = glm(support ~ Program, family=poisson) > summary(model1) Call: glm(formula = Support ~ Program, family = poisson) Deviance Residuals: Min 1Q Median 3Q Max -2.8531-0.6319-0.0348 0.4552 3.1765 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 1.403643 0.049567 28.318 <2e-16 *** ProgramB -0.159488 0.073066-2.183 0.0291 * ProgramC -0.004926 0.070185-0.070 0.9440 (Dispersion parameter for poisson family taken to be 1) Null deviance: 330.39 on 299 degrees of freedom Residual deviance: 324.26 on 297 degrees of freedom AIC: 1250.2 Number of Fisher Scoring iterations: 4 > anova(model1,test="chisq") # Overall likelihood ratio test Analysis of Deviance Table Model: poisson, link: log Response: Support Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev Pr(>Chi) NULL 299 330.39 Program 2 6.122 297 324.26 0.04684 * Page 2 of 7

> # Include covariates > model2 = glm(support ~ Score+Experience+Program, family=poisson) > summary(model2) Call: glm(formula = Support ~ Score + Experience + Program, family = poisson) Deviance Residuals: Min 1Q Median 3Q Max -2.9625-0.6957-0.1018 0.5362 2.9386 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 1.992744 0.159223 12.515 < 2e-16 *** Score -0.009205 0.003019-3.049 0.00230 ** Experience -0.028014 0.010317-2.715 0.00662 ** ProgramB -0.170519 0.073163-2.331 0.01977 * ProgramC -0.007833 0.070218-0.112 0.91118 (Dispersion parameter for poisson family taken to be 1) Null deviance: 330.39 on 299 degrees of freedom Residual deviance: 305.90 on 295 degrees of freedom AIC: 1235.8 Number of Fisher Scoring iterations: 4 > anova(model2,test="chisq") # Sequential Analysis of Deviance Table Model: poisson, link: log Response: Support Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev Pr(>Chi) NULL 299 330.39 Score 1 9.9766 298 320.41 0.001585 ** Experience 1 7.6333 297 312.78 0.005730 ** Program 2 6.8767 295 305.90 0.032118 * Page 3 of 7

> # Wald test for program > > WaldTest = function(l,thetahat,vn,h=0) # H0: L theta = h + # Note Vn is the asymptotic covariance matrix, so it's the + # Consistent estimator divided by n. For true Wald tests + # based on numerical MLEs, just use the inverse of the Hessian. + { + WaldTest = numeric(3) + names(waldtest) = c("w","df","p-value") + r = dim(l)[1] + W = t(l%*%thetahat-h) %*% solve(l%*%vn%*%t(l)) %*% + (L%*%thetahat-h) + W = as.numeric(w) + pval = 1-pchisq(W,r) + WaldTest[1] = W; WaldTest[2] = r; WaldTest[3] = pval + WaldTest + } # End function WaldTest > > Lprog = rbind(c(0,0,0,1,0), + c(0,0,0,0,1) ) > WaldTest(L=Lprog,thetahat=model2$coefficients,Vn=vcov(model2)) W df p-value 6.73350088 2.00000000 0.03450157 > # Compare G^2 = 6.8767, df=2, p=0.032118 Page 4 of 7

Back to the Multinomial Jobs Example Students at a trade school are either employed in their field of study, employed outside their field of study, or unemployed. University administrators recognize that the percentage of students who are unemployed after graduation will vary depending upon economic conditions, but they claim that still, about twice as many students will be employed in a job related to their field of study, compared to those who get an unrelated job. To test this hypothesis, they select a random sample of 200 students from the most recent class, and observe 106 employed in a job related to their field of study, 74 employed in a job unrelated to their field of study, and 20 unemployed. Under a multinomial model, Senseless H 0 : π 1 =π 2 =π 3 yielded G 2 = 65.6, df=2 H 0 : π 1 =2π 2, G 2 = 4.739, df=1 Maybe the numbers of students in each category are independent Poisson RVs. Recall that independent Poissons conditional on their total are multinomial, with So what we did earlier could be justified under the present assumptions. > jobz = read.table(stdin()) # Read from standard input 0: Job Freq 1: 1 Related 106 2: 2 Unrelated 74 3: 3 Unemployed 20 4: > # End with Ctrl-D on Unix (Mac) or Ctrl-Z on Windows > jobz Job Freq 1 Related 106 2 Unrelated 74 3 Unemployed 20 Page 5 of 7

> freq = jobz$freq > job = factor(jobz$job) > contrasts(job) Unemployed Unrelated Related 0 0 Unemployed 1 0 Unrelated 0 1 Job Status d 1 d 2 log λ = β 0 + β 1 d 1 + β 2 d 2 λ Related 0 0 β 0 exp(β 0 ) Unemployed 1 0 β 0 + β 1 exp(β 0 ) exp(β 1 ) Unrelated 0 1 β 0 + β 2 exp(β 0 ) exp(β 2 ) On average, we expect exp(β 1 ) times as many unemployed students as students with jobs related to their fields of study. > full0 = glm(freq~job,family=poisson) > summary(full0) Call: glm(formula = freq ~ job, family = poisson) Deviance Residuals: [1] 0 0 0 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 4.66344 0.09713 48.013 < 2e-16 *** jobunemployed -1.66771 0.24379-6.841 7.88e-12 *** jobunrelated -0.35937 0.15148-2.372 0.0177 * (Dispersion parameter for poisson family taken to be 1) Null deviance: 6.5598e+01 on 2 degrees of freedom Residual deviance: -7.9936e-15 on 0 degrees of freedom AIC: 23.489 Page 6 of 7

Number of Fisher Scoring iterations: 3 > full0$null.deviance # LR test, compare G^2 = 65.6 [1] 65.59798 Better null hypothesis > # Offset "can be used to specify an a priori known component > # to be included in the linear predictor during fitting. This should > # be NULL or a numeric vector of length either one or equal to the > # number of cases." > freq [1] 106 74 20 > d1 = c(0,1,0) > d2 = c(0,0,1) > red0 = glm(freq ~ d2, offset=-log(2)*d1,family=poisson) > anova(red0,full0,test='chisq') # Compare 4.739 Analysis of Deviance Table Model 1: freq ~ d2 Model 2: freq ~ job Resid. Df Resid. Dev Df Deviance Pr(>Chi) 1 1 4.7395 2 0 0.0000 1 4.7395 0.02948 * Page 7 of 7