Count Data Models See Book Chapter 11 2 nd Edton (Chapter 10 1 st Edton) Count data consst of non-negatve nteger values Examples: number of drver route changes per week, the number of trp departure changes per week, drvers' frequency-of-use of ITS technologes over some tme perod, the number of accdents observed on road segments per year. Count data can be properly modeled by usng a number of methods, the most popular of whch are Posson and negatve bnomal regresson models.
Posson Regresson Model Consder the number of accdents occurrng per year at varous ntersectons n a cty. In a Posson regresson model, the probablty of ntersecton havng y accdents per year (where y s a non-negatve nteger) s gven by: ( ) P y = EXP ( ) y λ λ y! Where: P(y) s the probablty of ntersecton havng y accdents per year λ s the Posson parameter for ntersecton, whch s equal to ntersecton 's expected number of accdents per year, E[y].
Posson regresson models are estmated by specfyng the Posson parameter λ (the expected number of events per perod) as a functon of explanatory varables. The most common relatonshp between explanatory varables and the Posson parameter s the log-lnear model, λ = LN ( λ ) ( ) EXP βx = βx, or, equvalently Where: X s a vector of explanatory varables and β s a vector of estmable coeffcents.
In ths formulaton, the expected number of events per perod s gven by [ ] = λ = ( β ) E y EXP X For model estmaton, note the lkelhood functon s: So, wth the Posson equaton, ( ) P( y ) L β = ( ) L β = EXP ( ) λ λ y! y Snce λ EXP ( β X ) =, ( ) L β = ( ) EXP( βx ) EXP -EXP βx y! y
Whch gves the log-lkelhood, n. = 1 ( ) = ( ) + β (!) LL β EXP βx y X LN y Posson Regresson Model Goodness of Ft Measures The lkelhood rato test s a common test used to assess two competng models. It provdes evdence n support of one model The lkelhood rato test statstc s, -2[LL(β R ) LL (β U )] where
LL(β R ) s the log-lkelhood at convergence of the "restrcted" model (sometmes consdered to have all coeffcents n β equal to 0, or just to nclude the constant term, to test overall ft of the model) LL(β U ) s the log-lkelhood at convergence of the unrestrcted model. Ths statstc s χ 2 dstrbuted wth the degrees of freedom equal to the dfference n the numbers of coeffcents n the restrcted an unrestrcted model (the dfference n the number of coeffcents n the β R and the β U coeffcent vectors). Another measure of overall model ft s the ρ 2 statstc. The ρ 2 statstc s, 2 ρ = 1 LL LL ( β ) ( 0) Where:
LL(β) s the log-lkelhood at convergence wth coeffcent vector β and LL(0) s the ntal log-lkelhood (wth all coeffcents set to zero). The perfect model would have a lkelhood functon equal to one (all selected alternatve outcomes would be predcted by the model wth probablty one, and the product of these across the observatons would also be one) and the loglkelhood would be zero gvng a ρ 2 of one The ρ 2 statstc wll be between zero and one and the closer t s to one, the more varance the estmated model s explanng.
Truncated Posson Regresson Model Truncaton of data can occur n the routne collecton of transportaton data. Example, f the number of tmes per week an n-vehcle navgaton system s used on the mornng commute to work, durng weekdays, the data are rght truncated at 5, whch s the maxmum number of uses n any gven week. Estmatng a Posson regresson model wthout accountng for ths truncaton wll result n based estmates of the parameter vector β, and erroneous nferences wll be drawn. Fortunately, the Posson model s adapted easly to account for such truncaton. The rght-truncated Posson model s wrtten as: r y m P( y ) = λ y! ( λ m! ), m = 0
Where: P(y) s the probablty of commuter usng the system y tmes per week, λ s the Posson parameter for commuter ; m s the number of uses per week; and r s the rght truncaton (n ths case, 5 tmes per week). Negatve Bnomal Regresson Model Posson dstrbuton that restrcts the mean and varance to be equal: E[y] = VAR[y]. If ths equalty does not hold, the data are sad to be under dspersed (E[y ] > VAR[y ]) or overdspersed (E[y ] < VAR[y ]), and the coeffcent vector wll be based f correctve measures are not taken.
To account for cases when E[y ] VAR[y ], a negatve bnomal model s used. The negatve bnomal model s derved by rewrtng the λ equaton such that, λ = EXP(βX + ε ) where EXP(ε ) s a Gamma-dstrbuted error term wth mean 1 and varance α 2. The addton of ths term allows the varance to dffer from the mean as below, VAR[y ] = E[y ][1+ αe[y ]] = E[y ]+ αe[y ] 2 The Posson regresson model s regarded as a lmtng model of the negatve bnomal regresson model as α approaches zero, whch means that the selecton between these two models s dependent upon the value of α.
The parameter α s referred to as the overdsperson parameter. The negatve bnomal dstrbuton has the form, Py ( ) 1 α Γ((1 α) + y ) 1 α λ = Γ(1 α) y! (1 α) + λ (1 α) + λ y where Γ(.) s a gamma functon. Ths results n the lkelhood functon, 1α Γ((1 α) + y ) 1α λ L( λ ) = Γ(1 α) y! (1 α) + λ (1 α) + λ y
Zero-Inflated Posson and Negatve Bnomal Regresson Models Zero events can arse from two qualtatvely dfferent condtons. 1. One condton may result from smply falng to observe an event durng the observaton perod. 2. Another qualtatvely dfferent condton may result from an nablty to ever experence an event. Two states can be present, one beng a normal count-process state and the other beng a zero-count state. A zero-count state may refer to stuatons where the lkelhood of an event occurrng s extremely rare n comparson to the normal-count state where event occurrence s nevtable and follows some know count process
Two aspects of ths non qualtatve dstncton of the zero state are noteworthy: 1. There s a preponderance of zeroes n the data more than would be expected under a Posson process. 2. A samplng unt s not requred to be n the zero or near zero state nto perpetuty, and can move from the zero or near zero state to the normal count state wth postve probablty. Data obtaned from two-state regmes (normal-count and zero-count states) often suffer from overdsperson f consdered as part of a sngle, normal-count state because the number of zeroes s nflated by the zero-count state.
Zero-nflated Posson (ZIP) Assumes that the events, Y = (y 1, y 2,,y n ), are ndependent and the model s ( ) ( λ ) y = 0 wth probablty p + 1 p EXP y = y wth probablty ( 1 p ) EXP ( λ ) y y! λ. where y s the number of events per perod. Zero-nflated negatve bnomal (ZINB) regresson model follows a smlar formulaton wth events, Y = (y 1, y 2,, y n ), beng ndependent and,
1 y = 0 wth probablty p + ( 1 p) α 1 λ α + 1 1 α y Γ + y u (1 u) α y = y wth probablty ( 1 p), y=1, 2, 3... 1 Γ y! α where ( 1 ) ( 1 ) u = α α + λ. Zero-nflated models mply that the underlyng data-generatng process has a splttng regme that provdes for two types of zeros. The splttng process can be assumed to follow a logt (logstc) or probt (normal) probablty process, or other probablty processes. 1 α
A pont to remember s that there must be underlyng justfcaton to beleve the splttng process exsts (resultng n two dstnct states) pror to fttng ths type of statstcal model. There should be a bass for belevng that part of the process s n a zero-count state. To test the approprateness of usng a zero-nflated model rather than a tradtonal model, Vuong (1989) proposed a test statstc for non-nested models that s well suted for stuatons where the dstrbutons (Posson or negatve bnomal) are specfed. The statstc s calculated as (for each observaton ), ( ) ( ) f y X = 1 m LN f 2 y X where: f 1 (y X ) s the probablty densty functon of model 1, and
f 2 (y X ) s the probablty densty functon of model 2. Usng ths, Vuongs' statstc for testng the non-nested hypothess of model 1 versus model 2 s (Greene, 2000; Shankar et al., 1997), V n = 1 = = n 2 1 n = 1 1 n n m ( m m) ( ) n m S m Where: m s the mean ( ( ) = 1 1 n n m ), Sm s standard devaton, Vuongs' value s asymptotcally standard normal dstrbuted (to be compared to z-values), and f V s less than V crtcal (1.96 for a 95% confdence level), the test does not support the selecton of one model over another.
Large postve values of V greater than V crtcal favor model 1 over model 2, whereas large negatve values support model 2. Vuong statstc for ZINB(f 1 (.)) and NB(f 2 (.)) comparson t-statstc of the NB overdsperson parameter α < - 1.96 ZIP or Posson as alternatve to NB < 1.96 > 1.96 NB > 1.96 ZIP ZINB
Because overdsperson wll almost always nclude excess zeros, t s not always easy to determne whether excess zeros arse from true overdsperson or from an underlyng splttng regme. Ths could lead one to erroneously choose a negatve bnomal model when the correct model may be a zero-nflated Posson. The use of a zero-nflated model may be smply capturng model msspecfcaton that could result from factors such as unobserved effects (heterogenety) n the data.
LIMDEP ZIP/ZINB Beta Tau model: Uses same X s that predct frequency n the zero-state splttng model, but multpled by tau Separate functons model: Uses dfferent X s for splttng and frequency models Splttng functon: Logstc (default) or Normal Vuong statstc