Application of the hyper-poisson generalized linear model for analyzing motor vehicle crashes

Size: px

Start display at page:

Download "Application of the hyper-poisson generalized linear model for analyzing motor vehicle crashes"

Darleen Benson
6 years ago
Views:

1 Application of the hyper-poisson generalized linear model for analyzing motor vehicle crashes S. Hadi Khazraee 1 Graduate Research Assistant Zachry Department of Civil Engineering Texas A&M University Tel. (979) hadikhazraee@tamu.edu Antonio Jose Sáez-Castillo, Ph.D. Associate Professor Department of Statistics and Operations Research University of Jáen, Spain Tel ajsaez@ujaen.es Srinivas Reddy Geedipally, Ph.D., P.E. Assistant Research Engineer Texas A&M Transportation Institute Texas A&M University System Tel. (817) srinivas-g@ttimail.tamu.edu Dominique Lord, Ph.D., P.Eng. Associate Professor Zachry Department of Civil Engineering Texas A&M University Tel. (979) d-lord@tamu.edu 1 Corresponding author

2 ABSTRACT The hyper-poisson distribution can handle both over- and under-dispersion, and its generalized linear model formulation allows the dispersion of the distribution to be observationspecific and dependent on model covariates. This study s objective is to examine the potential applicability of a newly proposed generalized linear model framework for the hyper-poisson distribution in analyzing the motor vehicle crash count data. The hyper-poisson generalized linear model was first fitted to the intersection crash data from Toronto, characterized by overdispersion, and then to the crash data from railway-highway crossings in Korea, characterized by under-dispersion. The results of this study are promising. When fitted to the Toronto data set, the goodness-offit measures indicated that the hyper-poisson model with a variable dispersion parameter provided a statistical fit as good as the traditional negative binomial model. The hyper-poisson model was also successful in handling the under-dispersed data from Korea; the model performed as well as the gamma probability model and the Conway-Maxwell-Poisson model previously developed for the same data set. The advantages of the hyper-poisson model studied in this paper are noteworthy. Unlike the negative binomial model, which has difficulties in handling under-dispersed data, the hyper- Poisson model can handle both over- and under-dispersed crash data. Although not a major issue for the Conway-Maxwell-Poisson model, the effect of each variable on the expected mean of crashes is easily interpretable in the case of this new model. Keywords: hyper-poisson, under-dispersion, dispersion parameter 2

3 1. INTRODUCTION Motor vehicle crash count data are often characterized by over-dispersion, meaning that the variance of crash counts on a roadway entity is greater than the mean. It is however possible, although rare, to find crash datasets with under-dispersion, i.e., variance lower than the mean (1), especially in crash data with low sample means (2). The most commonly used distribution in crash count data modeling, the negative binomial (NB)/Poisson-gamma, can only accommodate overdispersion and will have convergence issues and produce incorrect parameter estimates while modeling under-dispersed data (1). Researchers in various fields have proposed numerous alternative models to handle underdispersed count data. For instance, the generalized Poisson (3), the weighted Poisson (4), and the Poisson polynomial (5) models are all extensions of the Poisson model that can handle both overand under-dispersed count data. Of all the models capable of handling both over- and under-dispersion, the Conway- Maxwell-Poisson distribution (COM-Poisson) has probably gained the most attention, especially in highway safety. The COM-Poisson distribution was first introduced by Conway and Maxwell (6) for modeling queues and service rates, and later explored by Shmueli et al. (7) for its statistical properties (1). The COM-Poisson generalized linear model (GLM) has been applied to crash data by Lord et al. (8; 9), and Geedipally and Lord (10). Several studies have found both the COM-Poisson distribution and its regression model to be very flexible in dealing with count data with a wide range of characteristics (e.g. 11; 12). Despite its flexibility for modeling count data, Francis et al. (13) have warned about the limitation of COM-Poisson GLM in dealing with overdispersed data sets with low sample mean values. 3

4 Another approach used to handle under-dispersion in crash count data modeling is the gamma probability distribution. This approach has been used with two different parameterizations. The first parameterization, proposed by Winkelmann (14) and applied to crash data first by Oh et al. (2), assumes that the time elapsed between each two successive crashes (waiting time) follows a gamma distribution. This approach implies that crash events are dependent in the sense that the occurrence of at least one event (in contrast to none) up to time t influences the probability of a further occurrence in t+ t (14). Nonetheless, while crash counts can sometimes have a temporal correlation, they are often described as independent observations (15). Recently, Daniels et al. (16; 17) used a different parameterization of the gamma model in which they assumed that the crash frequency itself follows a continuous gamma density function. Two major theoretical shortcomings exist for this assumption: it implies that crash counts of zero are not possible, and that non-integer crash counts may be observed (15). Both implications are obviously fallacious. The final model worth mentioning is the double-poisson distribution model proposed by Efron et al. (18). Although not very popular among researchers, Zou et al. (15) applied the double- Poisson model to crash count data and found the model to be flexible. Nonetheless, they noticed that the distribution does not handle under-dispersion as reliably as it does over-dispersion. Very recently, Saez-Castillo and Conde-Sanchez (19) formulated a generalized linear model (GLM) framework for a two-parameter generalization of the Poisson distribution, called the hyper-poisson distribution (20). The primary objective of this study is to examine the potential application of the hyper-poisson GLM in the field of highway safety to model crash count data. The hyper-poisson distribution can handle both under- and over-dispersion. In addition, the regression model examined in this study allows the dispersion of the distribution to vary among 4

5 observations. Such observation-specific dispersion structure for crash counts on roadway entities is consistent with the findings of recent research in highway safety. A handful of studies have addressed shortcomings in the assumption of fixed dispersion among all observations and have suggested that the model dispersion can potentially depend on the covariates (e.g. 21; 22; 23). Mitra and Washington (24) advised that the observation-specific structure can be especially important when the mean function is misspecified, such as in models where the mean only depends on the entering traffic flow. In the hyper-poisson model, the covariates enter the mean function at the same time that they influence the dispersion of the distribution. The dual link structure of the hyper-poisson GLM is similar to that suggested by Guikema and Coffalt (25) for the COM- Poisson regression model. In this research study, the hp GLM is first fitted to crash data from the signalized intersections in Toronto to examine the model performance in handling over-dispersed count data. The objective is to ensure that the model can provide an adequate fit to the majority of crash count data sets which are characterized by over-dispersion. The modeling results for Toronto data are compared to those obtained by the NB GLM. The hp model is also fitted to a data set from railway-highway crossings (RHXs) in Korea which is characterized by underdispersion. For this data set, the modeling results are compared to those for gamma probability distribution from Oh et al. (2) and COM-Poisson GLM from Lord et al. (9). 2. BACKGROUND This section describes the characteristics of the hp distribution and the corresponding generalized linear regression model. The first part discusses the hp distribution and its characteristics and the second part describes an extension of the distribution to model crash frequency data. 5

6 2.1. Hyper-Poisson Distribution Bardwell and Crow (20) derived a two-parameter generalization of the Poisson distribution. They called the proposed distribution as the hyper-poisson (hp, hereafter) family because it turned out to be a subclass of the three-parameter hypergeometric series distribution and reduced to the Poisson distribution in a special case. Using the original notations, the probability mass function (pmf) of the hp distribution with parameters θ1 and θ2 is stated as follows: 1 ( ) y f (Y y 1, 2 ) = 2 (1) F (1; ; ) ( y) (2) ( ) r F1 (1; ; 2) 2 (3) ( r) r 0 where, Y is the response variable (discrete crash count in this study), θ2 is the location parameter, λ is defined as the dispersion parameter, and F 1; ; ) is the confluent hypergeometric ( function with first argument equal to 1 (26). If λ=1, the distribution reduces to the Poisson (with variance equal to mean), λ > 1 results in an over-dispersed distribution, super-poisson, whereas λ < 1 produces an under-dispersed distribution, sub-poisson (20). It can be verified from Equation (1) that the hp distribution satisfies the following recurrence condition: (y + )f = f (4) y +1 2 y Summing Equation (4) over all y s yields the following expression for the mean (µ): 1)(1 ) (5) 2 ( f0 1F 1(1; ; 2) 1 2 ( 1) (6) F (1; ; )

7 It is clear from Equation (6) that when λ = 1, the location parameter θ2 matches the mean. In this case, Equation (2) suggests θ1 = θ2 and Equation (3) yields 1F1 (1; λ; θ2) = e θ2, so the distribution Equation (1) reduces to the Poisson with the mean θ2. However, as indicated by Equation (6), θ2 is not equal to the mean in any other case. The mean and θ2 can become significantly different as λ deviates from 1. Equation (6) provides an explicit expression of the mean in terms of θ2 and λ. Nonetheless, θ2 and λ cannot be directly expressed in terms of the mean and the other parameter because they appear as the arguments of the hypergeometric series, which does not have an explicit inverse function. This will give rise to a major computational difficulty in regression modeling as described later in this section. From Equation (4) and using the method of moments, the following relationship between the distribution variance (σ 2 ) and mean (µ) is obtained (19) : ( 2 ( 1)) (7) A comparison between the hp distribution variance, as shown above, and that from the NB distribution would be interesting. The relationship between the variance and the mean in the negative binomial distribution is stated below: 2 2 (8) where α is the over-dispersion parameter. A negative estimate of α is indicative of underdispersion (σ 2 < µ). However, the NB model is inappropriate for modeling under-dispersed data because the estimated variance will be negative for observations with α < -1/µi (27). In this paper, for the sake of convenience in comparing the NB and hp distributions, α is referred to as the dispersion parameter of the NB distribution. As Equation (8) indicates for the NB distribution, the coefficient of the second-degree term in the variance function is allowed to 7

8 vary, whereas in the hp distribution variance function it is the coefficient of the first degree term of the mean that can vary and the second-degree coefficient is constantly -1 (see Equation 7). This allows for higher flexibility of the NB distribution, compared to the hp distribution, to deal with highly over-dispersed data sets, as demonstrated later in the results section. Furthermore, Saez-Castillo and Conde-Sanchez (19) showed how the over-dispersed case of the hp distribution (i.e., when λ > 1) can be viewed upon as a Poisson compound distribution with a confluent hypergeometric distribution. An interested reader is referred to their work for the derivation. This finding provides an interpretational basis for application of the hp distribution and regression model to crash data; crash counts are Poisson distributed with a mean which itself follows a probability distribution (confluent hypergeometric in this case) to account for the heterogeneity among the individual entities (sites). Indeed, the confluent hypergeometric error term captures the variation in the mean caused by the factors not accounted for by the model. Hence, in the over-dispersion context, the hp distribution is comparable to other compound Poisson distributions, such as the negative binomial/poisson-gamma distribution Generalized Linear Model Saez-Castillo and Conde-Sanchez (19) developed an hp GLM framework to model discrete count data. In this approach, both the mean and the dispersion parameter of the hp distribution can depend on the covariates. Denoting Yi as the observed crash count at site i, the GLM assumes that Yi follows an hp distribution with the mean and dispersion parameter stated as below: p ln( ) x (9) i 0 j 1 q k 1 j ij ln( ) z (10) i 0 k ik 8

9 where, xij s and zik s are the covariates used to estimate the mean and dispersion parameter of observation i, respectively, and βj s and δk s are the regression parameters to be estimated by the model. The p covariates used to estimate the mean are not necessarily identical to the q covariates used to estimate the dispersion parameter. This study adopted the GLM as formulated above to model motor vehicle crashes. The dual link structure of the hyper-poisson GLM is similar to that suggested by Guikema and Coffelt (25) for the COM-Poisson regression model. The first link function, in Equation (9), describes the mean as a function of covariates. The covariate-dependent mean function allows for inference about the influence of the changes in the covariates on the expected number of crashes (µ). The same would not be possible had the location parameter was instead modeled as a function of the covariates. Given the estimated values of µ and λ, the location parameter (θ2) can be determined by Equation (6). The variance of each observation can then be determined by Equations (7). The second link function of the GLM, in Equation (10), is added to increase the flexibility of the distribution and enable analysis of data with potential over- or under-dispersion depending on the values of the covariates. As mentioned earlier in the introduction, there are notable advantages in allowing the dispersion characteristic of the crash count distribution to depend on the covariates. 3. METHODOLOGY This section describes the methodology used to fit the hp regression models to the crash data. The first part presents the functional form of each model, and the second part describes the procedure adopted to estimate the models Model Functional Form Toronto data 9

10 For the Toronto intersection crash data, the following common and simple functional form was adopted: i 1 2 F F (11) 0 Maj _ i Min _ i where FMaj_i and FMin_i denote the average annual daily traffic (AADT) on the major and minor approach to the intersection, respectively. Such a flow-only crash model for intersections is consistent with the base safety prediction models suggested by the Highway Safety Manual (28) and also with several other studies that have modeled the Toronto dataset in the past (e.g., 23; 8). The hp GLM was applied to the Toronto data in two steps: first, with a constant dispersion parameter (i.e., i 0 for all i), and next, with an observation-specific dispersion parameter. The observation-specific structure is especially important here because the mean function is misspecified, since the mean is allowed to depend on entering traffic flows only. The dispersion parameter has the following form: i 1 2 F F (12) 0 Maj _ i Min _ i This was done to evaluate the improvement in fit when the dispersion parameter is allowed to vary depending on the covariates. The hp model results were compared to those obtained by using the NB GLMs (and the maximum likelihood method for model estimation) with a fixed and a variable dispersion parameter. When variable, the dispersion parameter of the NB model (αi) followed a similar functional form as in Equation (12): i 1 2 F F (13) 0 Maj _ i Min _ i Korea RHX data For the Korea RHX data, the objective was to compare the hp model fit mainly with that obtained by the gamma probability model, documented by Oh et al. (2), and COM-Poisson GLM, 10

11 documented by Lord et al. (9). An interested reader may refer to their work for background information on the COM-Poisson and gamma probability models. The same functional form for the expected number of crashes was therefore used here: F exp( x ) (14) i 0 n 1 i j 2 j ij where Fi is the average daily vehicle traffic (ADT) on site i, and xij is the covariate j at site i. Various functional forms with different variables were evaluated to model the dispersion parameter but none of them were found to be significant. This supports the previous finding that since the functional form describing the mean function contains several covariates, the varying dispersion parameter is not needed (24) Model Estimation The GLMs in this study were estimated using the method of maximum likelihood. The goal was to find the set of βj and δk parameters that would maximize the joint likelihood (or loglikelihood, equivalently) of observations y1,, yn. From Equation (1), the log-likelihood function is: n logl(y,, yn) log( ( )) log( 2 ) log( ( )) log( 1F 1(1; ; 2 )) 1 i yi i i yi i i (15) i 1 n i 1 n i 1 n i 1 The optimization was carried out using an iterative procedure evaluating the log-likelihood function at different combinations of βj s and δk s until the maximum log-likelihood was reached. Nevertheless, as Equation (15) indicates, the log-likelihood function depends on θ2i and λi, while we model µi and λi as a function of covariates. θ2i in Equation (15) must therefore be replaced with its expression in terms of µi. As specified earlier, no closed form expression exists for θ2i. Consequently, evaluation of the log-likelihood function at each iteration required solving 11

12 the nonlinear Equation (6) to find the value of θ2 corresponding to the estimated µi and λi for each observation. The code developed by Saez-Castillo and Conde-Sanchez (19), in the software R (29) is used in this study. The program uses functions nlm and optim to maximize the log-likelihood, and optimize to solve Equation (6) numerically. 4.DATA DESCRIPTION This section provides an overview of the two data sets used in this research. As discussed above, the datasets come from Toronto and Korea. The Toronto data set contains crash count data collected in 1995 at 868 four-legged signalized intersections in Toronto. Several research studies (e.g., 30; 23; 31) have used this data set for the purpose of crash count modeling and have found it to be of good quality. The Toronto intersection data is characterized by over-dispersion, as commonly seen in most crash data sets. TABLES AND FIGURES Table I presents the summary statistics of the variables in this data set. The Korea data set contains crash count data collected at 162 railway-highway crossings in Korea. This data set was first used by Oh et al. (2) to fit Poisson and gamma probability models, and later by Lord et al. (9) to fit a COM-Poisson model. Although the data shows signs of slight over-dispersion (sample mean = 0.33, sample variance = 0.36), both studies observed underdispersion when crashes were modeled conditional on the mean. Out of the many explanatory variables initially considered for model estimation in these studies, only a few were found to be statistically significant at 10% level and were included in the final model. The hp model in this study was estimated using the variables (covariates) that were found to be significant in the Poisson, Gamma distribution, or COM-Poisson models. TABLES AND FIGURES 12

13 Table I presents these variables and their characteristics. 5. RESULTS This section presents the modeling results for the hp GLM. The first part of this section presents the results for the model fitted to the Toronto intersection data and the second part shows the results for the data from Korea railway-highway crossings Toronto Data Error! Reference source not found.table II summarizes the modeling results for the hp GLM with a fixed and a varying dispersion parameter and compares the results with those obtained from the NB model. The NB GLM with a fixed dispersion parameter was estimated with glm.nb in R, whereas the NB GLM with a variable dispersion parameter was estimated with PROC NLMIXED in SAS (32). All models were estimated using the maximum likelihood method. The values in parentheses indicate the standard error of the parameter estimates. As Table IIError! Reference source not found. indicates, there is no significant difference in the MPB, MAD, and MSPE of the models considered for the Toronto data. The only notable trend is the reduction in the bias (MPB) in both the hp and NB models when the dispersion parameter is allowed to vary. The MAD and MSPE measures of fit vary only slightly from one model to the other. This is due to the very similar estimates of mean function parameters (β s). Note that the MPB, MAD, and MSPE are all only dependent on the mean function and not on the dispersion parameter. Similar β parameters, therefore, have resulted in similar values for these measures of fit. On the other hand, the AIC measure depends not only on the mean function, but also on the dispersion parameter. The reason is that the AIC depends on the model likelihood function which, in both the hp and NB model cases, has the dispersion parameter as an input. Thus, 13

14 models with similar mean function parameters (β s) may have significantly different AIC s (e.g., compare hp with fixed and varying dispersion parameter in Table II). Table IIError! Reference source not found. indicates that, when dispersion parameter is constant, the AIC of the NB model (5077.3) is considerably lower than that of the hp model (5157.3). The difference in AIC is large enough to infer that the NB model with a fixed dispersion parameter outperforms the hp model with the same condition. Nonetheless, when dispersion parameter is allowed to vary depending on the covariates, the hp model s fit improves notably (AIC reduces from to ). Conversely, the NB model with a variable dispersion parameter is not a significant improvement as two of the dispersion parameter function coefficients (δ0 and δ1) are found to be statistically insignificant (at α=0.10) and the reduction in AIC is also marginal (from to ). As a rule of thumb, when the change in AIC is less than 10, the difference is usually deemed to be insignificant (9). Thus, with a variable dispersion parameter, the hp model performs almost as well as the NB model. The variance-mean relationship structure of the hp and NB distributions is the key to explaining the findings above. In the variance-mean function of the NB distribution shown by Equation (8), the over-dispersion parameter is the coefficient of the second-degree term of the mean, whereas in the hp distribution variance-mean function shown by Equation (7), the dispersion parameter can only affect the first-degree coefficient of the mean. Thus, the variance of the NB distribution is more sensitive to the changes in the dispersion parameter and can increase at a faster rate. Figure 1(a) shows the mean-variance relationship of the hp and NB models with fixed dispersion parameters for Toronto data. Clearly, the NB model variance 14

15 increases more rapidly and so the NB model better fits the over-dispersed Toronto data set than the hp model with a fixed dispersion parameter. Once the dispersion parameter of the hp distribution is allowed to vary, the variance-mean relationship becomes more flexible and the hp model becomes more capable of fitting overdispersed crash counts. As illustrated in Figure 1(b) for models with variable dispersion, the hp model mean-variance relationship becomes more similar to that of the NB model. When the mean is less than 25 crashes, the variances of the two distributions resemble closely. As the mean gets larger, however, the variance of the NB model increases at a higher rate than the hp model and the difference between the variances becomes more significant. Figure 2Error! Reference source not found.(a) illustrates the frequency distribution of the varying dispersion parameter of the hp distribution across all observations. It is important to note that even for such an over-dispersed data set, two of the observations have λ s less than 1 and are therefore under-dispersed (conditional on the mean). Despite the very small number of underdispersed observations in the Toronto data set, this finding illustrates how the hp model (with a variable dispersion parameter) can identify data points with under-dispersion, while the NB model fails to do so. Figure 2Error! Reference source not found.(b) shows the distribution of the varying dispersion parameter (α) of the NB model. The NB distribution is under-dispersed if α < 0, equi-dispersed if α = 0, and over-dispersed otherwise. As shown by Error! Reference source not found.(b), the NB model did not identify any under-dispersed observations. It is probable that the NB model would not have performed as well if a great number of observations were under-dispersed (conditional on the mean). 15

16 It is also interesting to compare the hp model performance in fitting overdispersed crash data with that obtained by using the COM-Poisson model. Geedipally and Lord (10) fitted the COM-Poisson GLM with a variable shape parameter to the Toronto data using a full Bayesian (FB) approach with non-informative (vague) prior distributions on the parameters. Figure 3 illustrates the comparison of the mean-variance relationship of the hp and COM-Poisson models. The variances from the two models resemble closely for the entire range of the mean. The hp model can thus be expected to perform as well as the COM-Poisson Korea RHX Data Both Oh et al. (2) and Lord et al. (9) examined the application of the NB model to the underdispersed (conditional on the mean) data from Korea railway-highway crossings, and deemed it to be inappropriate. These two studies also considered the Poisson model and despite the relatively good fit of the model provided, the authors mentioned that the Poisson model should not be used because the data are under-dispersed. Lord et al. (9) also noted that fitting the Poisson GLM to such under-dispersed data can have a significant effect on standard errors. Therefore, the current study compared the hp model fit to the two models found successful by the aforementioned researchers i.e., the gamma probability, and COM-Poisson. The Poisson, gamma probability, and COM-Poisson models for Korea RHX data (2,9) were originally developed using 31 candidate explanatory variables. According to Lord et al. (2), eight of these variables were found significant in at least one of the three models. These eight variables constituted the pool of candidate explanatory variables for the hp model developed in this study (see Table I). Disregarding the remaining 23 variables, it can be assumed that all final models were estimated using a common set of candidate variables. 16

17 To obtain greater accuracy and prevent inclusion of correlated variables in the model, a stepwise forward procedure with the likelihood ratio test was adopted to identify the significant variables in this study. First, the dominant traffic flow (AADT) variable was introduced into the model (mean function) and resulted in a log-likelihood value equal to Then, the other covariates entered the model in the order in which they contributed to the increase in loglikelihood/parameter. A variable was added to the model only if the increase in the loglikelihood was significant according to the likelihood ratio test (LRT). The significance level of the LRT was selected at α = 0.1 for the sake of consistency with other models developed for the Korea data with which the hp model was intended to be compared to. The final model obtained from this stepwise procedure includes the following six variables in its mean function: AADT, presence of speed hump, train detector distance, presence of commercial area, presence of track circuit controller, and presence of a guide. The log-likelihood of the final model is Error! Reference source not found.table III presents the modeling results for the hp distribution model and the comparison with the other models. All models were estimated using the maximum likelihood method. The same set of variables as those in the COM-Poisson model were found significant in the hp model. However, it is necessary to note that the coefficients estimated for the COM-Poisson model are for the centering parameter and not for the mean (E[Y]) as in the case of other distributions in Table III Error! Reference source not found. (see (9), for more details on the COM-Poisson GLM). The dispersion parameter of the hp model (0.298) confirms the finding of the previous studies that the Korea data are under-dispersed (conditional on the mean) (see also 15). Using the AIC values, the hp model provides a fit as well as the COM-Poisson and gamma models. 17

18 It is important to note that despite the similar quality of statistical fit, the three models compared in Table III each include a distinct set of variables. This comparison is still meaningful because all three models were estimated using a common pool of explanatory variables. The presence of a certain variable in one model and not in the other is attributable to the correlation among variables, meaning that the inclusion of a certain set of variables eliminates the need for one or more other variables. The considerably large difference between parameter estimates in different models is due to the distinct set of significant variables in each model. Similar to the Toronto data application, the hp Poisson model for the Korea data performs very well in terms of the bias; the MPB of the hp model is very close to zero, indicating that the model neither over-predicts nor under-predicts the crashes. The COM-Poisson model also has a relatively small bias but the value of MPB for the gamma model indicates that this model overpredicts the crashes. The MAD and MSPE of the hp distribution are almost as low as those of the COM-Poisson, but better than those of the gamma model. Overall, the hp and COM-Poisson models performed almost equally well, slightly outperforming the gamma model. 6. CONCLUSIONS The results of this study for the application of the hp GLM to crash data modeling are promising. The hp GLM with a covariate-dependent dispersion parameter could fit the overdispersed data from Toronto almost as well as the popular NB model. When applied to the under-dispersed data from Korea, the hp model had an equally good performance compared to the COM-Poisson and gamma probability models. The hp model can handle under-dispersion, while the NB model is incapable to do so properly. Lord et al. (9) showed that application of the NB model to under-dispersed data can result in unstable and unreliable parameter estimates, hence mis-specified models. In modeling 18

19 over-dispersed crash data, however, the authors admit that the NB model is usually preferable over the hp model because the variance-mean relationship structure of the NB model offers more flexibility when the variance increases very rapidly with the increase in the mean. The NB model becomes especially useful when the data are highly over-dispersed. Nonetheless, this study showed that the hp GLM with covariate-dependent dispersion can perform satisfactorily even with an over-dispersed data set. The GLM formulation of the hp model studied in this research has an advantage over the COM-Poisson GLM. In the hp model, the mean (E[Y]) is expressed in terms of the covariates, whereas in the COM-Poisson model, the centering parameter, which is approximately equal to the mode, is a function of covariates. Thus, the hp GLM permits direct interpretation of the effect of each variable on the expected mean of crashes, while the COM-Poisson GLM has on the expected mode of the crash distribution. For instance, one might look at the sign of the variable coefficients in the hp model and directly quantify the effect on the expected mean of crashes with an increase in the value of each variable. When compared to the gamma model, the hp model is preferred because it does not suffer the same theoretical issues involved with the gamma model formulation, as discussed in the first section of the paper. This paper was a report on the first steps of the ongoing research on the application of the hp GLM in crash data modeling. There are many aspects of the application that needs to be further investigated. For example, the hp model performance should be examined over a greater range of dispersion characteristics likely through simulated data. It is also recommended to examine the hp model fit to crash frequency data from the roadway segments and for identifying hazardous sites. 19

20 20

21 REFERENCES 1. Lord D, Mannering F. The Statistical Analysis of Crash-Frequency Data: a Review and Assessment of Methodological Alternatives. Transportation Research - Part A, 2010;44(5): Oh J, Washington SP, Nam D. Accident Prediction Model for Railway Highway Interfaces. Accident Analysis & Prevention, 2006;38(2): Consul P, Famoye F. Generalized Poisson Regression-Model. Communications in Statistics-Theory and Methods, 1992;21(1): Castillo J, Pérez-Casany M. Overdispersed and Underdispersed Poisson Generalizations. Journal of Statistical Planning and Inference, 2005;134: Cameron AC, Johansson P. Count Data Regression Using Series Expansions: with Applications. Journal of Applied Econometrics, 1997;12(3): Conway RW, Maxwell WL. A Queuing Model with State Dependent Service Rates. Journal of Industrial Engineering, 1962;12: Shmueli G, Minka T, Kadane JB, Borle S, Boatwright P. A Useful Distribution for Fitting Discrete Data: Revival of the Conway Maxwell Poisson Distribution. Journal of the Royal Statistical Society Series C, 2005;54(1):

22 8. Lord D, Guikema SD, Geedipally S. Application of the Conway-Maxwell-Poisson Generalized Linear Model for Analyzing Motor Vehicle Crashes. Accident Analysis & Prevention, 2008;40(3): Lord D, Geedipally SR, Guikema SD. Extension of the Application of Conway Maxwell Poisson Models: Analyzing Traffic Crash Data Exhibiting Underdispersion. Risk Analysis, 2010;30(8): Geedipally SR, Lord D. Examination of Crash Variances Estimated by Poisson-Gamma and Conway Maxwell Poisson Models. Transportation Research Record, 2011;2241: Sellers KF, Shmueli G. A Flexible Regression Model for Count Data. Annals of Applied Statistics, 2010;4(2): Sellers K, Borle S, Shmueli G. The COM Poisson Model for Count Data: A Survey of Methods and Application. Applied Stochastic Models in Business and Industry, 2012;28(2): Francis RA, Geedipally SR, Guikema SD, Dhavala SS, Lord D, LaRocca S. Characterizing the Performance of the Conway Maxwell Poisson Generalized Linear Model. Risk Analysis, 2012; 32(1):

23 14. Winkelmann R. Duration Dependence and Dispersion in Count-Data Models. Journal of Business & Economic Statistics, 1995;13(4): Zou Y, Geedipally SR, Lord D. Evaluating the Double Poisson Generalized Linear Model. Accident Analysis & Prevention, 2013; forthcoming. 16. Daniels S, Brijs T, Nuyts E, Wets G. Explaining Variation in Safety Performance of Roundabouts. Accident Analysis & Prevention, 2010;42(2): Daniels S, Brijs T, Nuyts E, Wets G. Extended Prediction Models for Crashes at Roundabouts. Safety Science, 2011;49(2): Efron B. Double Exponential-Families and their Use in Generalized Linear-Regression. Journal of the American Statistical Association, 1986;81(395): Sáez-Castillo AJ, Conde-Sánchez A. A Hyper-Poisson Regression Model for Overdispersed and Underdispersed Count Data. Computational Statistics and Data Analysis, 2012;61: Bardwell GE, Crow EL. A Two-Parameter Family of Hyper-Poisson Distributions. Journal of the American Statistical Association, Vol. 9, No. 305, 1964, pp

24 21. Hauer E. Overdispersion in Modeling Accidents on Road Sections and in Empirical Bayes Estimation. Accident Analysis and Prevention, 2001;33(6): Heydecker BG, Wu J. Identification of Sites for Road Accident Remedial Work by Bayesian Statistical Methods: An Example of Uncertain Inference. Advances in Engineering Software, 2001;32: Miaou S P, Lord D. Modeling Traffic Flow Relationships at Signalized Intersections: Dispersion Parameter, Functional Form and Bayes vs Empirical Bayes. Transportation Research Record, 2003;1840: Mitra S, Washington SP. On the Nature of Over Dispersion in Motor Vehicle Crash Prediction Models. Accident Analysis and Prevention, 2007;39(3): Guikema SD, Coffelt JP. A Flexible Count Data Regression Model for Risk Analysis. Risk Analysis, 2008;28(1): Johnson NL, Kotz S, Kemp AW. Univariate Discrete Distributions, 3rd ed. New York: Wiley; Saha K, Paul S. Bias-Corrected Maximum Likelihood Estimator of the Negative Binomial Dispersion Parameter. Biometrics, 2005; 61(1);

25 28. American Association of State Highway and Transportation Officials (AASHTO), Highway Safety Manual. 1st ed. AASHTO; R Development Core Team, R: A Language and Environment for Statistical Computing. Vienna (Austria): R Foundation for Statistical Computing; Lord, D. The Prediction of Accidents on Digital Networks: Characteristics and Issues Related to the Application of Accident Prediction Models [dissertation]. [Toronto(ON)]: University of Toronto; Miranda-Moreno LF, Fu L. Traffic Safety Study: Empirical Bayes or Full Bayes?. 84th Annual Meeting of the Transportation Research Board, Washington, DC, SAS Institute Inc. SAS System for Windows. 9th ver. Cary (NC); Burnham KP, Anderson DR. Model Selection and Multimodel Inference: A Practical Information-Theoretic Approach, 2nd ed. Springer-Verlag; Oh J, Lyon C, Washington SP, Persaud BN, Bared J. Validation of the FHWA Crash Models for Rural Intersections: Lessons Learned. Transportation Research Record, 2003;1840:

26 TABLES AND FIGURES Table I: Summary statistics of the data sets in this study Variables M in. M ax. Average (SD) Frequency Toronto Data Crashes (10.02) 868 Major approach AADT 5,469 72,178 28, (10,660.4) 868 Minor approach AADT 53 42,644 11, (8,599.40) 868 Korea Data Crashes (0.60) 162 Highway AADT 10 61, ( ) 162 Average daily highway traffic (rail.trf) (37.34) 162 Train detector distance (dist.trn.dtc) 0 1, (328.38) 162 Time duration btw activation of warning signals and gates (wrn.time) (25.71) 162 Presence of commercial area (p.comm) 1 (yes) 149 (91.98%) 0 (no) 13 (8.02%) Presence of a speed hump (p.hump) 1 (yes) 134 (82.72%) 0 (no) 28 (17.28%) Presence of a track circuit controller (p.trck.cric.cont) 1 (yes) 113 (69.75%) 0 (no) 49 (30.25%) Presence of a guide (p.guide) 1 (yes) 126 (77.78%) 0 (no) 36 (22.22%) = not applicable Table II: Modeling results for the hp and NB GLMs with the Toronto data Hyper-Poisson Negative-Binomial model Fixed dispersion Varying dispersion Fixed dispersion Varying dispersion Estimate parameter parameter parameter parameter Ln(β 0 ) (0.4464) (0.4325) (0.465) (0.4555) β (0.0462) ( ) ( ) ( ) β ( ) ( ) ( ) ( ) λ α (0.0122) Ln(δ 0 ) (2.709) (2.4381) δ (0.2677) (0.2345) δ (0.1073) (0.1002) AIC MPB MAD MSPE Akaike information criterion (33) ; 2 Mean prediction bias (34) ; 3 Mean absolute deviance (34) ; 4 Mean squared predictive error (34) ; = not applicable 26

27 Table III: Parameter Estimates and GOF Measures of Three Different Models for the Korea Data Variables COM-Poisson Gamma Hyper-Poisson Constant (1.206) a (1.008) a (0.756) Ln(ADT) 0.648(0.139) 0.230(0.076) 0.472(0.057) Average daily railway traffic (0024) - Presence of commercial area 1.474(0.513) 0.651(0.287) 0.965(0.370) Train detector distance (0.0007) 0.001(0.0004) (0.0006) Time duration between the activation of warning signals and gates (0.002) - Presence of track circuit controller (0.431) (0.303) Presence of guide -88(0.512) (0.294) Presence of speed hump (0.531) -1.58(0.859) (0.441) Shape parameter 2.349(0.634) 2.062(0.758) - Dispersion parameter (0.189) AIC MPB MAD MSPE a Standard error; - = not applicable 27

28 hp (constant dispersion) NB (constant dispersion) Variance Mean (a) hp (variable dispersion) NB (variable dispersion) Variance Mean (b) Figure 1: Crash variance vs. mean for the Toronto data obtained by the models with (a) fixed, (b) variable dispersion parameter. 28

29 Frequency (a) λ > Frequency α >0.5 (b) Figure 2: Frequency distribution of (a) the varying dispersion parameter of hp model for Toronto data (b) the varying dispersion parameter of NB model for the Toronto data. 29

30 hp (variable dispersion) 350 COM (variable shape parameter) Variance Mean Figure 3: Crash variance-mean relationship of the COM- Poisson vs. the hp model for the Toronto data. 30

The Conway Maxwell Poisson Model for Analyzing Crash Data

The Conway Maxwell Poisson Model for Analyzing Crash Data (Discussion paper associated with The COM Poisson Model for Count Data: A Survey of Methods and Applications by Sellers, K., Borle, S., and Shmueli,