MODELING COUNT DATA Joseph M. Hilbe

Similar documents
LOGISTIC REGRESSION Joseph M. Hilbe

Poisson Regression. Ryan Godwin. ECON University of Manitoba

GENERALIZED LINEAR MODELS Joseph M. Hilbe

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Generalized Linear Models for Count, Skewed, and If and How Much Outcomes

Mohammed. Research in Pharmacoepidemiology National School of Pharmacy, University of Otago

ZERO INFLATED POISSON REGRESSION

11. Generalized Linear Models: An Introduction

Generalized Linear Models

Lab 3: Two levels Poisson models (taken from Multilevel and Longitudinal Modeling Using Stata, p )

BOOTSTRAPPING WITH MODELS FOR COUNT DATA

Varieties of Count Data

Compare Predicted Counts between Groups of Zero Truncated Poisson Regression Model based on Recycled Predictions Method

Generalized Linear Models (GLZ)

Lecture-19: Modeling Count Data II

NELS 88. Latent Response Variable Formulation Versus Probability Curve Formulation

Overdispersion Workshop in generalized linear models Uppsala, June 11-12, Outline. Overdispersion

Chapter 11 The COUNTREG Procedure (Experimental)

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Chapter 22: Log-linear regression for Poisson counts

Discrete Choice Modeling

Deal with Excess Zeros in the Discrete Dependent Variable, the Number of Homicide in Chicago Census Tract

Generalized Linear Models Introduction

DEEP, University of Lausanne Lectures on Econometric Analysis of Count Data Pravin K. Trivedi May 2005

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.

Generalized Linear Models: An Introduction

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

Analysis of Count Data A Business Perspective. George J. Hurley Sr. Research Manager The Hershey Company Milwaukee June 2013

Prediction of Bike Rental using Model Reuse Strategy

HILBE MCD E-BOOK2016 ERRATA 03Nov2016

A simple bivariate count data regression model. Abstract

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Calculating Odds Ratios from Probabillities

Generalized linear models

Modelling Rates. Mark Lunt. Arthritis Research UK Epidemiology Unit University of Manchester

Lecture 8. Poisson models for counts

Poisson regression 1/15

Modeling Overdispersion

Introduction to General and Generalized Linear Models

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Generalized linear models

Poisson Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

Generalized Quasi-likelihood (GQL) Inference* by Brajendra C. Sutradhar Memorial University address:

8 Nominal and Ordinal Logistic Regression

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

Linear Regression Models P8111

Chapter 5: Generalized Linear Models

Subject CS1 Actuarial Statistics 1 Core Principles

Exam Applied Statistical Regression. Good Luck!

Modeling Longitudinal Count Data with Excess Zeros and Time-Dependent Covariates: Application to Drug Use

Generalized linear models

Outline of GLMs. Definitions

A Practitioner s Guide to Generalized Linear Models

Generalized Multilevel Models for Non-Normal Outcomes

Econometric Analysis of Count Data

Discrete Choice Modeling

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

CARLETON ECONOMIC PAPERS

Introducing Generalized Linear Models: Logistic Regression

Poisson Inverse Gaussian (PIG) Model for Infectious Disease Count Data

Generalized Linear Models

Chapter 1 Statistical Inference

11. Generalized Linear Models: An Introduction

Lecture 3.1 Basic Logistic LDA

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

Comparison of the Performance of Count Data Models under Different Zero-Inflation. Scenarios Using Simulation Studies

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Single-level Models for Binary Responses

DISPLAYING THE POISSON REGRESSION ANALYSIS

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Introduction to Statistical Analysis

9 Generalized Linear Models

Appendix A. Numeric example of Dimick Staiger Estimator and comparison between Dimick-Staiger Estimator and Hierarchical Poisson Estimator

The 2010 Medici Summer School in Management Studies. William Greene Department of Economics Stern School of Business

Editor Executive Editor Associate Editors Copyright Statement:

Survival Analysis I (CHL5209H)

Generalized Quasi-likelihood versus Hierarchical Likelihood Inferences in Generalized Linear Mixed Models for Count Data

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random

Zero inflated negative binomial-generalized exponential distribution and its applications

STATISTICAL MODEL OF ROAD TRAFFIC CRASHES DATA IN ANAMBRA STATE, NIGERIA: A POISSON REGRESSION APPROACH

STAT5044: Regression and Anova

Linear, Generalized Linear, and Mixed-Effects Models in R. Linear and Generalized Linear Models in R Topics

Addition to PGLR Chap 6

Introduction to the Generalized Linear Model: Logistic regression and Poisson regression

Generalised linear models. Response variable can take a number of different formats

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Truck prices - linear model? Truck prices - log transform of the response variable. Interpreting models with log transformation

A CASE STUDY IN HANDLING OVER-DISPERSION IN NEMATODE COUNT DATA SCOTT EDWIN DOUGLAS KREIDER. B.S., The College of William & Mary, 2008 A THESIS

Continuing with Binary and Count Outcomes

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Analysis of Extra Zero Counts using Zero-inflated Poisson Models

CHAPTER 3 Count Regression

Economics 671: Applied Econometrics Department of Economics, Finance and Legal Studies University of Alabama

Generalized Linear Models I

Discrete Multivariate Statistics

SPH 247 Statistical Analysis of Laboratory Data. April 28, 2015 SPH 247 Statistics for Laboratory Data 1

Transcription:

MODELING COUNT DATA Joseph M. Hilbe Arizona State University Count models are a subset of discrete response regression models. Count data are distributed as non-negative integers, are intrinsically heteroskedastic, right skewed, and have a variance that increases with the mean. Example count data include such situations as length of hospital stay, the number of a certain species of fish per defined area in the ocean, the number of lights displayed by fireflies over specified time periods, or the classic case of the number of deaths among Prussian soldiers resulting from being kicked by a horse during the Crimean War. Poisson regression is the basic model from which a variety of count models are based. It is derived from the Poisson probability mass function, which can be expressed as (t f(y i ;λ i ) = e tiλi i λ i ) yi, y = 0,1,2,...; µ > 0 (1) y i! with y i as the count response, λ i as the predicted count or rate parameter, and t i the area or time in which counts enter the model. When λ i is understood as applying to individual counts without consideration of size or time, t i = 1. When t i > 1, it is commonly referred to as an exposure, and is modeled as an offset. Estimation of the Poisson model is based on the log-likelihood parameterization of the Poisson probability distribution, which is aimed at determining parameter values making the data most likely. In exponential family form it is given as: L(µ i ;y i ) = n { yi ln(µ i ) µ i ln(y i!) }, (2) where µ i is typically used to symbolize the predicted count in place of λ i. Equation 2, or the deviance function based on it, is used when the Poissonmodel is estimated as a generalized linear model (GLM) (see Generalized linear models). When estimation employs a full maximum likelihood algorithm, µ i is expressed in terms of the linear predictor, x β. As such it appears as µ i = exp(x i β). (3) In this form, the Poisson log-likelihood function is expressed as L(β;y i ) = n { yi (x i β) exp(x i β) ln(y i!) }. (4) 1

A key feature of the Poisson model is the equality of the mean and variance functions. When the variance of a Poisson model exceeds its mean, the model is termed overdispersed. Simulation studies have demonstrated that overdispersion is indicated when the Pearson χ 2 dispersion is greater than 1.0 (Hilbe, 2007). The dispersion statistic is defined as the Pearson χ 2 divided by the model residual degrees of freedom. Overdispersion, common to most Poisson models, biases the parameter estimates and fitted values. When Poisson overdispersion is real, and not merely apparent (Hilbe, 2007), a count model other than Poisson is required. Several methods have been used to accommodate Poisson overdispersion. Two common methods are quasi-poisson and negative binomial regression. Quasi- Poisson models have generally been understood in two distinct manners. The traditional manner has the Poisson variance being multiplied by a constant term. The second, employed in the glm() function that is downloaded by default when installing R software, is to multiply the standard errors by the square root of the Pearson dispersion statistic. This method of adjustment to the variance has traditionally been referred to as scaling. Using R s quasipoisson() function is the same as what is known in standard GLM terminology as the scaling of standard errors. The traditional negative binomial model is a Poisson-gamma mixture model with a second ancillary or heterogeneity parameter,. The mixture nature of the variance is reflected in its form, µ i +µ i 2, or µ i(1+µ i ). The Poisson variance is µ i, and the two parameter gamma variance is µ i 2 /ν. ν is inverted so that = 1/ν, which allows for a direct relationship between µ i, and ν. As a Poisson-gamma mixture model, counts are gamma distributed as they enter into the model. is the shape of the manner counts enter into the model as well as a measure of the amount of Poisson overdispersion in the data. The negative binomial probability mass function (see Geometric and negative binomial distributions) may be formulated as ( ) yi +1/ 1 f(y i ;µ i,) = (1/(1+µ i )) 1/ (µ i /(1+µ i )) y i 1/ 1, (5) with a log-likelihood function specified as: n { ( µi ) L(µ i ;y i,)= y i ln 1+µ i ( 1 ln(1+µ i ) ) ( +lnγ y i + 1 ) ( 1 ) } lnγ(y i +1) lnγ. (6) In terms ofµ = exp(x β), requiredfor maximum likelihood estimation, the negative binomial log-likelihood appears as n { ( exp(x L(β;y i,)= y i ln i β) ) ( 1 1+exp(x i β) ln(1+exp(x ) i β)) ( +lnγ y i + 1 ) ( 1 ) } lnγ(y i +1) lnγ. (7) 2

This form of negative binomial has been termed NB2, due to the quadratic nature of its variance function. It should be noted that the NB2 model reduces to the Poisson when = 0. When = 1, the model is geometric, taking the shape of the discrete correlate of the continuous negative exponential distribution. Several fit tests exist that evaluate whether data should be modeled as Poisson or NB2 based on the degree to which differs from 0. When exponentiated, Poisson and N B2 parameter estimates may be interpreted as incidence rate ratios. For example, given a random sample of 1000 patient observations from the German Health Survey for the year 1984, the following Poisson model output explains the years expected number of doctor visits on the basis of gender and marital status, both recorded as binary (1/0) variables, and the continuous predictor, age. docvis IRR OIM Std. Err. z P > z [95% Conf. Interval] female 1.516855.054906 11.51 0.000 1.41297 1.628378 married.8418408.0341971-4.24 0.000.7774145.9116063 age 1.018807.0016104 11.79 0.000 1.015656 1.021968 The estimates may be interpreted as: Females are expected to visit the doctor some 50% more times during the year than males, holding marital status and age constant. Married patients are expected to visit the doctor some 16% fewer times during the year than unmarried patients, holding gender and age constant. For a one year increase in age, the rate of visits to the doctor increases by some 2%, with marital status and gender held constant. It is important to understand that the canonical form of the negative binomial, when considered as a GLM, is not NB2. Nor is the canonical negative binomial model, N B-C, appropriate to evaluate the amount of Poisson overdispersion in a data situation. The N B-C parameterization of the negative binomial is directly derived from the negative binomial log-likelihood as expressed in Equation 6. As such, the link function is calculated as ln(µ/(1 + µ)). The inverse link function, or mean, expressed in terms of x β, is 1/((exp( x β) 1)). When estimated as a GLM, NB-C can be amended to NB2 form by substituting ln(µ) and exp(x β) respectively for the two above expressions. Additional amendments need to be made to have the GLM-estimated NB2 display the same parameter standard errors as are calculated using full maximum likelihood estimation. The NB-C log-likelihood, expressed in terms of µ, is identical to that of the NB2 function. However, when parameterized as x β, the two differ, with the NB-C appearing as n L(β;y i,)= {y i (x i β)+(1/)ln(1 exp(x i β)) +lnγ(y i +1/) lnγ(y i +1) lnγ(1/)} (8) 3

The NB-C model better fits certain types of count data than NB2, or any other variety of count model. However, since its fitted values are not on the log scale, comparisons cannot be made to Poisson or NB2. TheNB2model, inasimilarmannertothepoisson,canalsobeoverdispersed if the model variance exceeds its nominal variance. In such a case one must attempt to determine the source of the extra correlation and model it accordingly. The extra correlation that can exist in count data, but which cannot be accommodated by simple adjustments to the Poisson and negative binomial algorithms, has stimulated the creation of a number of enhancements to the two base count models. The differences in these enhanced models relates to the attempt of identifying the various sources of overdispersion. For instance, both the Poisson and negative binomial models assume that there exists the possibility of having zero counts. If a given set of count data excludes that possibility, the resultant Poisson or negative binomial model will likely be overdispersed. Modifying the log-likelihood function of these two models in order to adjust for the non-zero distribution of counts will eliminate the overdispersion, if there are no other sources of extra correlation. Such models are called, respectively, zero-truncated Poisson and zero-truncated negative binomial models. Likewise, if the data consists of far more zero counts that allowed by the distributional assumptions of the Poisson or negative binomial models, a zeroinflated set of models may need to be designed. Zero-inflated models are mixture models, with one part consisting of a 1/0 binary response model, usually a logistic regression, where the probability of a zero count is estimated in difference to a nonzero-count. A second component is generally comprised of a Poisson or negative binomial model that estimates the full range of count data, adjusting for the overlap in estimated zero counts. The point is to 1) determine the estimates that account for zero counts, and 2) to estimate the adjusted count model data. Hurdle models are another type mixture model designed for excessive zero counts. However, unlike the zero-inflated models, the hurdle-binary model estimates the probability of being a non-zero count in comparison to a zero count; the hurdle-count component is estimated on the basis of a zero-truncated count model. Zero-truncated, zero-inflated, and hurdle models all address abnormal zero-count situations, which violate essential Poisson and negative binomial assumptions. Some of the more recently developed count modlels include finite mixture models and exact Poisson regression. Finite mixture models allow the count response to have been created from two or more separate generating mechanisms. For example, a portion of the counts may have a Poisson distribution with a mean.5, with another portion having a Poisson distribution with a mean of 4. A response may consist of two separate underlying distributions. Such a model allows estimation of a more complex structures of counts than do standard Poisson and negative binomial models. Exact Poisson models are not based on the asymptotic methods characteristic of maximum likelihood or generalized linear models estimation; rather they are based on the construction of a statistical distribution that can be thoroughly emumerated. This highly iterative technique allows appropriate 4

estimation of parameters and confidence intervals for small and unbalanced data which would otherwise not be able to be modeled using conventional estimation methods. Other violations of the distributional assumptions of Poisson and negative binomial probability distributions exist. The table below summarizes major types of violations that have resulted in the creation of specialized count models. Table 1. Models to adjust for violations of Poisson/NB distributional assumptions Response example models 1: no zeros zero-truncated models (ZTP; ZTNB) 2: excessive zeros zero-inflated (ZIP; ZINB); hurdle models 3: truncated truncated count models 4: censored econometric and survival censored count models 5: panel GEE; fixed, random, and mixed effects count models 6: separable sample selection, finite mixture models 7: two-responses bivariate count models 8: other quantile, exact, and Bayesian count models Alternative count models have also been constructed based on an adjustment to the Poisson variance function, µ. We have previously addressed two of these. Table 2 provides a summary of major types of adjustments. Table 2. Methods to directly adjust the variance (from Hilbe, 2007) Variance function example models 0: µ Poisson 1: µ(φ) quasi-poisson; scaled SE; robust SE 2: µ(1+) linear NB (NB1) 3: µ(1 + µ) geometric 4: µ(1 + µ) standard NB (NB2); quadratic NB 5: µ(1 +(ν)µ) heterogeneous NB (NH-H) 6: µ(1+µ ρ ) generalized NB (NB-P) 7: V[R]V generalized estimating equations The four texts listed in the References below are specifically devoted to describing the theory and variety of count models, and are currently regarded as standard resources on the subject. A number of journal articles and book chapters have been written on the subject. Other texts dealing with discrete response models in general, as well as texts on generalized linear models (see Generalized linear models), also have descriptions of count models, although only a few go beyond examining basic Poisson and negative binomial regression. References 5

[1] Cameron, A. C. and P. K. Trivedi (1986). Econometric models based on count data: Comparisons and applications of some estimators, Journal of Applied Econometrics, 1: 29-53. [2] Cameron, A. C., P. K. Trivedi (1998). Regression analysis of count data. New York: Cambridge University Press. [ 3] Hilbe, J. M, (1993). Log-negative binomial regression as a generalized linear model, Technical report COS 93/94-5-26, Department of Sociology, Arizona State University. [ 4] Hilbe, J. M. (2007). Negative binomial regression. Cambridge, UK: Cambridge University Press. [ 5] Hilbe, J. M. (2011). Negative binomial regression. 2nd edition. Cambridge, UK: Cambridge University Press. In press. [6] Hilbe, J. M. and W. H. Greene (2007). Count response regression models, in (eds) C.R. Rao, J.P. Miller, and D.C. Rao, Epidemiology and Medical Statistics, Elsevier Handbook of Statistics Series, London: Elsevier. [7] Hinde, J. and C. G. B. Demetrio (1998). Overdispersion: models and estimation, Computational Statistics and Data Analysis, Vol 27, 2: 151-170. [ 8] Lawless, J. F. (1987). Negative binomial and mixed Poisson regression, Canadian Journal of Statistics, 15, 3: 209-225. [ 9] Long, J. S. (1997). Regression models for categorical and limited dependent variables. Thousand Oaks, CA: Sage. [ 10] Simon, L. J. (1960). The negative binomial and Poisson distributions compared. Proceedings of the casualty and actuarial society XLVII: 20-24. [ 11] Winkelmann, R. (2008). Econometric Analysis of Count Data. 5th edition, Heidelberg, Ger: Springer. Reprinted with permission from Lovric, Miodrag (2011), International Encyclopedia of Statistical Science. Heidelberg: Springer Science +Business Media, LLC 6