Model selection and comparison

Similar documents
Flexible Regression Models for Count Data Based on Renewal Processes: The Countr Package

Improving the Precision of Estimation by fitting a Generalized Linear Model, and Quasi-likelihood.

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Are Goals Poisson Distributed?

Generalized Additive Models

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Homework 5: Answer Key. Plausible Model: E(y) = µt. The expected number of arrests arrests equals a constant times the number who attend the game.

Outline. Mixed models in R using the lme4 package Part 3: Longitudinal data. Sleep deprivation data. Simple longitudinal data

Overdispersion Workshop in generalized linear models Uppsala, June 11-12, Outline. Overdispersion

You can specify the response in the form of a single variable or in the form of a ratio of two variables denoted events/trials.

Generalized linear models

Chi-Squared Tests. Semester 1. Chi-Squared Tests

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

Testing and Model Selection

Subject CS1 Actuarial Statistics 1 Core Principles

Generalized Linear Models

robmixglm: An R package for robust analysis using mixtures

Application of Poisson and Negative Binomial Regression Models in Modelling Oil Spill Data in the Niger Delta

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

MISCELLANEOUS TOPICS RELATED TO LIKELIHOOD. Copyright c 2012 (Iowa State University) Statistics / 30

9. Model Selection. statistical models. overview of model selection. information criteria. goodness-of-fit measures

Zero inflated negative binomial-generalized exponential distribution and its applications

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

Prediction of Bike Rental using Model Reuse Strategy

Institute of Actuaries of India

Generalized Linear Models (GLZ)

CHOOSING AMONG GENERALIZED LINEAR MODELS APPLIED TO MEDICAL DATA

ADVANCED STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL STUDIES. Cox s regression analysis Time dependent explanatory variables

Model Estimation Example

Joseph O. Marker Marker Actuarial a Services, LLC and University of Michigan CLRS 2010 Meeting. J. Marker, LSMWP, CLRS 1

Subject-specific observed profiles of log(fev1) vs age First 50 subjects in Six Cities Study

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00

Analysis of Count Data A Business Perspective. George J. Hurley Sr. Research Manager The Hershey Company Milwaukee June 2013

Mohammed. Research in Pharmacoepidemiology National School of Pharmacy, University of Otago

Chapter 1 Statistical Inference

The GENMOD Procedure (Book Excerpt)

Poisson Regression. Ryan Godwin. ECON University of Manitoba

A strategy for modelling count data which may have extra zeros

Bivariate Weibull-power series class of distributions

CHAPTER 1: BINARY LOGIT MODEL

Regression modeling for categorical data. Part II : Model selection and prediction

Package flexcwm. R topics documented: July 11, Type Package. Title Flexible Cluster-Weighted Modeling. Version 1.0.

Generalised linear models. Response variable can take a number of different formats

A brief introduction to mixed models

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

ME3620. Theory of Engineering Experimentation. Spring Chapter IV. Decision Making for a Single Sample. Chapter IV

Non-Gaussian Response Variables

MODELING COUNT DATA Joseph M. Hilbe

An Introduction to Path Analysis

Aedes egg laying behavior Erika Mudrak, CSCU November 7, 2018

Exercise 5.4 Solution

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

SAS/STAT 14.2 User s Guide. The GENMOD Procedure

Topic 21 Goodness of Fit

Preface Introduction to Statistics and Data Analysis Overview: Statistical Inference, Samples, Populations, and Experimental Design The Role of

Multilevel Statistical Models: 3 rd edition, 2003 Contents

Poisson Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

CHAPTER 6: SPECIFICATION VARIABLES

lme4 Luke Chang Last Revised July 16, Fitting Linear Mixed Models with a Varying Intercept

Statistical Model Of Road Traffic Crashes Data In Anambra State, Nigeria: A Poisson Regression Approach

Weighted tests of homogeneity for testing the number of components in a mixture

Stat 5101 Lecture Notes

Tables Table A Table B Table C Table D Table E 675

Group Comparisons: Differences in Composition Versus Differences in Models and Effects

Hypothesis testing:power, test statistic CMS:

Analytics Software. Beyond deterministic chain ladder reserves. Neil Covington Director of Solutions Management GI

Poisson Data. Handout #4

Modeling Overdispersion

Using R in 200D Luke Sonnet

Bootstrap Procedures for Testing Homogeneity Hypotheses

Mixtures of Negative Binomial distributions for modelling overdispersion in RNA-Seq data

Unit 5 Logistic Regression Practice Problems

Analysis of Longitudinal Data: Comparison Between PROC GLM and PROC MIXED. Maribeth Johnson Medical College of Georgia Augusta, GA

Outline. Mixed models in R using the lme4 package Part 5: Generalized linear mixed models. Parts of LMMs carried over to GLMMs

Machine Learning Linear Regression. Prof. Matteo Matteucci

Gaussian Mixtures. ## Type 'citation("mclust")' for citing this R package in publications.

UNIVERSITY OF TORONTO Faculty of Arts and Science

Testing Statistical Hypotheses

Generalized Linear Models

GLM models and OLS regression

A (Brief) Introduction to Crossed Random Effects Models for Repeated Measures Data

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

Glossary. The ISI glossary of statistical terms provides definitions in a number of different languages:

SAS Software to Fit the Generalized Linear Model

Section Poisson Regression

Phd Program in Transportation. Transport Demand Modeling. Session 8

Determining the number of components in mixture models for hierarchical data

Q30b Moyale Observed counts. The FREQ Procedure. Table 1 of type by response. Controlling for site=moyale. Improved (1+2) Same (3) Group only

Generalized linear models

How to deal with non-linear count data? Macro-invertebrates in wetlands

Lecture 8. Poisson models for counts

A course in statistical modelling. session 06b: Modelling count data

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

Introduction to the Generalized Linear Model: Logistic regression and Poisson regression

Package flexcwm. R topics documented: July 2, Type Package. Title Flexible Cluster-Weighted Modeling. Version 1.1.

Regression modeling for categorical data. Part II : Model selection and prediction

Transcription:

Model selection and comparison an example with package Countr Tarak Kharrat 1 and Georgi N. Boshnakov 2 1 Salford Business School, University of Salford, UK. 2 School of Mathematics, University of Manchester, UK. November 15, 2017 Abstract This document describes a strategy to choose between various possible count models. The computation described in this document is done in R (R Core Team, 2016 using the contributed package Countr (Kharrat and Boshnakov, 2017 and the quine data shipped with the MASS package (Venables and Ripley, 2002. The ideas used here are inspired by the demand for medical care example detailed in Cameron and Trivedi (2013, Section 6.3. 1 Prerequisites We will do the analysis of the data with package Countr, so we load it: library(countr library("mass" # for glm.nb( Packages dplyr (Wickham and Francois, 2016 and xtable (Dahl, 2016 provide usefull facilities for data manipulation and presentation: library(dplyr library(xtable 2 Data The dataset used in this example is the quine data shipped with package MASS (Venables and Ripley, 2002 and first analysed in Aitkin (1978. The data can be loaded in the usual way: data(quine, package = "MASS" The dataset gives the number of days absent from school (Days of 146 children in a particular school year. A number of explanatory variables are available describing the children s ethnic background (Eth, sex (Sex, age (Age and learner status (Lrn. The count variable Days is characterised by large overdispersion the variance is more than 16 times larger the mean, 264.2 versus 16.46. Table 1 gives an idea about its distribution. The entries in the table were calculated as follows: breaks_ <- c(0, 1, 3, 5:7, 9, 12, 15, 17, 23, 27, 32 freqtable <- count_table(count = quine$days, breaks = breaks_, formatchar = TRUE 1

0 1-2 3-4 5 6 7-8 9-11 Frequency 9 10 7 19 8 10 13 Relative frequency 0.062 0.068 0.048 0.13 0.055 0.068 0.089 12-14 15-16 17-22 23-26 27-31 >= 32 Frequency 13 6 14 6 6 25 Relative frequency 0.089 0.041 0.096 0.041 0.041 0.17 Table 1: quine data: Frequency distribution of column Days. 3 Models for quine data Given the extreme over-dispersion observed in the quine data, we do not expect the Poisson data to perform well. Nevertheless, we can still use this model as a starting point and treat it as a benchmark (any model worse than Poisson should be strongly rejected. Besides, the negative binomial (as implemented in MASS:glm.nb( and 3 renewal-count models based on weibull, gamma and generalised-gamma inter-arrival times are also considered giving in total 5 models to choose from. The code used to fit the 5 models is provided below: quine_form <- as.formula(days ~ Eth + Sex + Age + Lrn pois <- glm(quine_form, family = poisson(, data = quine nb <- glm.nb(quine_form, data = quine ## various renewal models wei <- renewalcount(formula = quine_form, data = quine, dist = "weibull", method = c("nlminb", "Nelder-Mead","BFGS", convpars = list(convmethod = "depril" gam <- renewalcount(formula = quine_form, data = quine, dist = "gamma", method = "nlminb", convpars = list(convmethod = "depril" gengam <- renewalcount(formula = quine_form, data = quine, dist = "gengamma", method = "nlminb", convpars = list(convmethod = "depril" Note that it is possible to try several optimisation algorithms in a single call to renewalcount(, as in the computation of wei above for the weibull-count model. The function will choose the best performing one in terms of value of the objective function (largest log-likelihood. It also 2

reports the computation time of each method, which may be useful for future computations (for example, bootstrap. 4 Model Selection and Comparison The different models considered here are fully parametric. Therefore, a straightforward method to use to discriminate between models is a likelihood ratio test (LR. This is possible when models are nested and in this case the LR statistic has the usual χ 2 (p distribution, where p is the difference in the number of parameters in the model. In this current study, we can compare all the renewal-count models against Poisson, negative-binomial against Poisson, weibull-count against generalised-gamma and gamma against the generalised-gamma. For non-nested models, the standard approach is to use information criteria such as the Akaike information criterion (AIC and the Bayesian information criterion (BIC. This method can be applied to discriminate between weibull and gamma renewal count models and between these two models and the negative binomial. Therefore, a possible strategy can be summarised as follows: Use the LR test to compare Poisson with negative binomial. Use the LR test to compare Poisson with weibull-count. Use the LR test to compare Poisson with gamma-count. Use the LR test to compare Poisson with generalised-gamma-count. Use the LR test to compare weibull-count with generalised-gamma-count. Use the LR test to compare gamma-count with generalised-gamma-count. Use information criteria to compare gamma-count with weibull-count. Use information criteria to compare weibull-count to negative binomial. Alternative.model Chisq Df Critical_value 1 negative-binomial 1192.03 1.00 3.84 2 weibull 1193.21 1.00 3.84 3 gamma 1193.36 1.00 3.84 4 generalised-gamma 1194.46 2.00 5.99 Table 2: LR results against Poisson model. Each row compares an alternative model vs the Poisson model. All alternatives are preferable to Poisson. As observed in Table 2, the LR test rejects the null hypothesis and all the alternative models are preferred to Poisson. This due to the large over-dispersion discussed in the previous section. Next, we use the LR test to discriminate between the renewal count models: Model Chisq Df Critical_value 1 weibull 1.25 1.00 3.84 2 gamma 1.10 1.00 3.84 Table 3: LR results against generalised-gamma model The results of Table 3 suggest that the weibull and gamma models should be preferred to the generalised gamma model. Finally, we use information criteria to choose the best model among the weibull and gamma renewal models and the negative binomial: ic <- data.frame(model = c("weibull", "gamma", "negative-binomial", AIC = c(aic(wei, AIC(gam, AIC(nb, 3

BIC = c(bic(wei, BIC(gam, BIC(nb, stringsasfactors = FALSE print(xtable(ic, caption = "Information criteria results", label = "tab:ic_models" Model AIC BIC 1 weibull 1107.98 1131.84 2 gamma 1107.83 1131.70 3 negative-binomial 1109.15 1133.02 Table 4: Information criteria results It is worth noting here that all models have the same degree of freedom and hence comparing information criteria is equivalent to comparing the likelihood value. According to Table 4, the gamma renewal model is slightly preferred to the weibull model. 5 Goodness-of-fit We conclude this analysis by running a formal chi-square goodness of fit test (Cameron and Trivedi, 2013, Section 5.3.4 to the results of the previously selected model. gof <- chisq_gof(gam, breaks = breaks_ print(gof chi-square goodness-of-fit test Cells considered 0 1-2 3-4 5 6 7-8 9-11 12-14 15-16 17-22 23-26 27-31 >= 32 DF Chisq Pr(>Chisq 1 12 17.502 0.1317 The null hypothesis cannot be rejected at standard confidence levels and conclude that the selected model presents a good fit to the data. User can check that the same test yields similar results for the weibull and negative binomial models but comfortably rejects the null hypothesis for the Poisson and generalised gamma models. These results confirm what we observed in the previous section. We conclude this document by saving the work space to avoid re-running the computation in future exportation of the document: save.image( References Aitkin, M. (1978. The analysis of unbalanced cross-classifications. Journal of the Royal Statistical Society. Series A (General, pages 195 223. Cameron, A. C. and Trivedi, P. K. (2013. Regression analysis of count data, volume 53. Cambridge university press. 4

Dahl, D. B. (2016. xtable: Export Tables to LaTeX or HTML. R package version 1.8-2. Kharrat, T. and Boshnakov, G. N. (2017. Countr: Flexible Univariate Count Models Based on Renewal Processes. R package version 3.4.0. R Core Team (2016. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. Venables, W. N. and Ripley, B. D. (2002. Modern Applied Statistics with S. Springer, New York, fourth edition. ISBN 0-387-95457-0. Wickham, H. and Francois, R. (2016. dplyr: A Grammar of Data Manipulation. R package version 0.5.0. 5