Comparison of Confidence and Prediction Intervals for Different Mixed-Poisson Regression Models

Size: px

Start display at page:

Download "Comparison of Confidence and Prediction Intervals for Different Mixed-Poisson Regression Models"

Shonda Pitts
6 years ago
Views:

1 0 0 0 Comparison of Confidence and Prediction Intervals for Different Mixed-Poisson Regression Models Submitted by John E. Ash Research Assistant Department of Civil and Environmental Engineering, University of Washington Box 00, Seattle, WA -00 Tel: () - jeash@uw.edu Yajie Zou, Ph. D. Key Laboratory of Road and Traffic Engineering Ministry of Education Tongji University, Shanghai 00, China Tel: () 0 yajiezou@hotmail.com Dominique Lord, Ph. D. Professor Department of Civil Engineering Texas A&M University, TAMU College Station, TX - Tel: () -, fax: () - d-lord@tamu.edu Yinhai Wang, Ph.D. (Corresponding Author) Professor Department of Civil and Environmental Engineering, University of Washington Box 00, Seattle, WA -00 Tel: (0) -, Fax: (0) - yinhai@uw.edu Word Count:, + ( figure * 0) + ( tables * 0) =, words Submitted for Presentation at the th Annual Meeting of the Transportation Research Board Washington, D.C., January -, 0 Submitted on August, st, 0 Revised November 0

2 0 ABSTRACT A major focus for transportation safety analysts is the development of crash prediction models, a task for which an extremely wide selection of model types are available. Perhaps the most common crash prediction model is the negative binomial (NB) regression model. The NB model gained popularity due to its relative ease of implementation and its ability to handle overdispersion in crash data. Recently, many new models including the Poisson-inverse-Gaussian, Sichel, Poissonlognormal, and Poisson-Weibull models have been introduced as they can also accommodate overdispersion and could potentially replace the NB model, since many have been found to perform better. All five of the aforementioned models, including the NB model, can be classified as mixed-poisson models. A mixed-poisson model arises when an error term, following a chosen mixture distribution, enters the functional form for the Poisson parameter. For the NB model, the mixture distribution is selected as gamma, hence the alternate model name of Poisson-gamma model. In this paper, confidence intervals (CIs) for the Poisson mean (μ) and Poisson parameter (m, alternately referred to as safety), as well as prediction intervals (PIs) for the predicted number of crashes at a new site are derived for each of the aforementioned types of mixed-poisson models. After the derivations, the theory is put into practice when CIs and PIs are estimated for mixed- Poisson models developed from a Texas crash dataset. Ultimately, this study provides safety analysts with tools to express levels of uncertainty associated with estimates from safety-modeling efforts instead of simply providing point estimates.

3 Ash et al INTRODUCTION Transportation safety analysts often develop statistical models to predict crash frequencies that take into account a variety of factors including geometric characteristics of facilities and traffic volumes among many others (). With constant advances in statistical methodologies, a variety of potential model types are available to analyze crash frequency data (). Early efforts in crash frequency modeling typically focused around the use of a Poisson regression model to predict crash frequency (-). Although the Poisson model is very straightforward to use, it is unable to handle overdispersion (and underdispersion for that matter) that is commonly observed in crash data (). Overdispersion is said to occur when the variance of the crash counts is found to be greater than the mean and is quite common in crash data (). Lord et al. () noted that overdispersion is a consequence of considering crash data as resulting from Poisson trials (i.e., Bernoulli trials where the probability of a crash in each trial is not constant). Common features of overdispersed crash datasets include high frequency of zero-valued and/or large-valued crash counts that are not able to be modeled properly by a simple Poisson distribution (). In an effort to accommodate the overdispersion in crash data, a variety of models have been introduced. Perhaps the most popular is the negative binomial (NB) model which has been used by many researchers to model overdispersed crash data (; -). A key feature of the NB model is the assumption that the mean crash frequency (i.e., the Poisson parameter) for any site i, λi, follows a gamma distribution (). Thus, a formulation for the marginal mean and variance for a crash count, yi, is obtained in which the variance can exceed the mean (; ; ). The NB model is alternately referred to as the Poisson-gamma model as the crash count for site i, yi, conditioned on the Poisson parameter λi (which follows the gamma distribution), is itself Poisson distributed. That being said, there is no reason to assume that λi must be Gamma distributed (; ). In fact, researchers have investigated a variety of other distributions for the λ parameter, which result in several other types of mixed-poisson regression models (i.e., models in which the crash count conditioned on the Poisson parameter, whose distribution is known as the mixing or mixture distribution, follows a Poisson distribution) (; ). For example, one alternate choice of mixture distributions is the generalized inverse Gaussian (GIG) distribution which gives rise to the Sichel (SI) model for modeling overdispersed count data (). Zou et al. () applied a Sichel model to analyze a highly-dispersed crash dataset from Texas and compared the results to those obtained from a traditional NB model. They found the SI model yielded lower values of both the Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC) (i.e., better statistical fit) than the NB model. Yet another choice of mixture distribution is the inverse Gaussian (IG) distribution, which gives rise to the Poisson-inverse-Gaussian (PIG) model (). Zha et al. (0) analyzed the aforementioned Texas crash dataset as well as a crash dataset from Washington and found that PIG regression models provided better fit (in terms of AIC and BIC) than traditional NB models. Other possible choices for the mixture distribution include, but are not limited to, the Weibull and lognormal distributions, leading to the Poisson-Weibull (PW) and Poisson-lognormal (PLN) models, respectively. Both types of models can accommodate overdispersion; the PW model has been applied to crash data by Cheng et al. () and the PLN model has been investigated by Lord and Miranda-Moreno () and Aguero-Valverde and Jovanis (), among others. All of the aforementioned mixed-poisson regression models provide only a point estimate of the expected crash frequency at a given site. Although point estimates may provide some benefit in prediction, in many cases a confidence interval for a given estimate is preferable (). Notably, confidence intervals are important for use in safety decision-making as they express the uncertainty for a given point estimate (; ). Wood () derived formulae for the prediction intervals (PIs)

4 Ash et al for the predicted response (i.e., crash frequency at a new site, yi), and confidence intervals (CI)s for the gamma mean (mi) and the true mean crash frequency (alternately referred to as the mean response or Poisson mean, μi), from the NB (Poisson-gamma) regression model. It is important to note the distinction here between the Poisson parameter and the Poisson mean. In the case of standard Poisson regression, the two values are in fact equal. However, in a mixed-poisson model, introduction of an error term into the Poisson parameter makes it such that the two terms are no longer equal. Further detail on this issue is provided later in the paper. Ultimately, he noted that PIs and CIs may be especially useful to predict the number of crashes expected to occur at different site similar features to sites considered in the model development. Lord () developed a methodology for calculating the predicted confidence intervals for the multiplication of NB regression models with crash modification factors (CMFs). Geedipally and Lord () compared PIs for the predicted response (y) and CIs for the gamma mean (m) and mean response/poisson mean (μ) as computed from NB models with fixed and varying dispersion parameters, as well as for univariate and bivariate NB models (). Lord et al. () estimated PIs for the number of crashes (y) as obtained from an NB model with several covariates and compared them to those estimated from a baseline model (i.e., flow-only model) that was adjusted with accidentmodification factors (AMFs). Connors et al. () plotted CIs for predicted values of μ and m for variable flows and segment lengths for the NB and PLN cases; they did not, however, provide explicit formulae for the CIs and PIs associated with the PLN model. The goal of this study is to extend the work of Wood () and develop the associated CIs and PIs for a broader range of mixed-poisson regression models commonly used by safety analysts today. In order to ensure this work aligns with Wood (), the use of PI in reference to y and the use of CI in reference to m and μ will continue henceforth in this paper. Besides reviewing the derivation of the PI for y and the CIs for m and μ, respectively, for the NB model (Model ) per Wood (), derivations of CIs and PIs for the aforementioned three values (where m is now generalized to be the Poisson parameter which follows the given mixture distribution, alternately referred to as the safety) will be provided for the Poisson-inverse-Gaussian (Model ), Sichel (Model ), Poisson-Weibull (Model ), and Poisson-lognormal (Model ) regression models. Then, a case study making use of the aforementioned Texas crash dataset will be conducted, in which the five mixed-poisson models in consideration will be estimated. Once the models have been estimated, CIs will be estimated and plotted for y, m, and μ, as determined from each model. These CIs and PIs will then be compared and discussed with regards to the case study and in general terms. Ultimately, this study provides safety analysts with tools to express levels of uncertainty associated with estimates from safety-modeling efforts instead of simply providing point estimates. DERIVATION OF CONFIDENCE AND PREDICTION INTERVALS This section provides the derivations for the confidence and prediction intervals for each type of mixed-poisson model considered in this study. First, however, brief background information on mixed-poisson models is provided. In general, there are three-levels in the hierarchy of a mixed- Poisson model. At the lowest level in the hierarchy is the mean response (μi), also known as the Poisson mean, which itself follows a normal distribution, N(μ0,σ ). One level up is the Poisson parameter (mi), alternately known as the safety, which when conditioned on the follows the mixture distribution in consideration. Finally, there is the predicted response (yi), i.e., the crash frequency at site i, which when conditioned on the Poisson parameter (mi), follows a Poisson distribution.

5 Ash et al Mixed-Poisson Models and Formulation A mixed-poisson model is defined by two primary criteria. First, the count in consideration (i.e., number of crashes yi) follows a Poisson distribution conditional on the Poisson parameter λi (). Following the terminology and notation of Wood (), the Poisson parameter λi will be referred to as the safety and denoted mi: f(y i m i ) = exp( m i ) m y i i, y m i! i = 0,, () Second, the Poisson parameter, mi, has a multiplicative error term following a chosen mixture distribution (e.g., gamma, inverse Gaussian etc.) that is expressed in the conditional mean (i.e., the safety conditioned on the mean response μi) (). Note without the error term, the expression reduces to that of the mean response (μi), alternately referred to as the Poisson mean. K m i = exp (β 0 + j= x ij β j + ε i ) K = exp(β 0 + ε i ) exp( j= x ij β j ) K = exp(β 0 + j= x ij β j ) exp(ε i ) = μ i ν i () Where, i = site index; βj = j th regression coefficient; xij = j th covariate for site i; and εi = error term such that exp(εi), itself referred to as νi, follows the chosen mixture distribution. As aforementioned, Y mi~poisson(mi). The marginal distribution for Y is derived by integrating out the error term νi as follows, note g(νi) is the mixture distribution (): f(y i μ i ) = g(y i μ i, ν i ) h(ν i ) dν 0 i = E ν [g(y i μ i, ν i )] () It can then be shown, via application of the equality mi=μiνi, that the Poisson parameter mi does indeed follow the mixture distribution (just like νi) (). Parametrizations of Mixture Distributions A total of five mixed-poisson models are considered in this study. The models, corresponding mixture distributions for mi and νi, and parameterizations for the mixture distributions are presented in the following where the section header notes the model type and the distribution in parentheses is the mixture distribution. The subscript i is left out without loss of generality. Negative Binomial [NB] Model (Gamma) The NB model arises when the choice of mixture distribution for the Poisson-mixture model is chosen to the gamma distribution. Specifically, ν~gamma(δ,φ); however, in order to properly

6 Ash et al. 0 identify the intercept in the regression equation, E[ν]=. This result is obtained by setting δ=φ, and thus leading to a one-parameter gamma distribution. It then follows that Var[ν]=/φ, alternately stated Var[ν]=α, and further that m φ,μ~gamma(φ,φ/μ) (). Poisson-Inverse-Gaussian [PIG] Model (Inverse Gaussian) The PIG model arises when the inverse Gaussian (IG) distribution is selected as the mixture distribution. Specifically, ν~ig(μig,λ), where the subscript IG is used to distinguish μig (the mean of the IG distribution) from the Poisson mean (μ). As was the case with the NB model, the intercept identification condition calls for E[ν]=. If μig=, then E[ν]=, and further Var[ν]=/λ (0). Sichel [SI] (Generalized Inverse Gaussian) If the mixture distribution is selected to be the generalized inverse Gaussian (GIG) distribution, the Sichel model is obtained. Here, ν~gig(μgig,σ,νgig), where the subscript GIG is used to distinguish the mean (μgig) and shape parameter (νgig) of the GIG distribution from the Poisson mean (μ) and error term (ν), respectively. If E[ν]= (intercept identification condition), the variance of ν is expressed as follows (; 0): Where, Var[ν] = σ(ν GIG+) c + c () c = R νgig ( ); σ R λ (t) = K λ+(t)/k λ (t); and K λ (t) = xλ exp [ t(x + x )] dx (where, Kλ(t) is the modified Bessel function 0 of the third kind). Poisson-lognormal [PLN] (Lognormal) When the mixture distribution is selected as the lognormal distribution, the Poisson-Lognormal model is obtained. Here, ν~log N(d, σ LN ) and the LN subscript is used to denote σ LN is the variance 0 0 of the lognormal distribution for ν, not the variance of the normal distributed Poisson mean, σ 0. The mean of the lognormal distribution is expressed as follows (): E[ν] = exp (d + σ LN ) () The intercept identification condition E[ν]= is then obtained by requiring d=-σ LN /. The variance of ν is then obtained by substituting the aforementioned expression for d into the following equation for Var[ν]: Var[υ] = (e σ LN )e d+σ LN = e σ LN ()

7 Ash et al Poisson-Weibull [PW] (Weibull) When the error term, ν, follows the Weibull distribution, the Poisson-Weibull model is obtained. Specifically, ν~weibull(μwei,σ), where WEI is used to distinguish the mean of the Weibull distribution from the Poisson mean (μ). The mean of the Weibull distribution is expressed as follows (; 0): hold: E[ν] = /σ Γ ( + ) () μ σ WEI Thus, in order to meet the intercept identification condition of E[ν]=, the following must μ WEI = (Γ ( σ + )) σ With E[ν]=, the variance of ν can thus be obtained by substituting the aforementioned expression for μwei into the following equation for Var[ν]. Var[ν] = /σ [Γ ( ] + ) (Γ μ σ ( + )) σ WEI = Γ( σ +) (Γ( σ +)) () Derivation of Confidence Intervals for Poisson Mean (True Mean Crash Frequency) (μ) For this study, we consider a generalized linear model (GLM) for crash prediction, where each site of interest is a road segment, of form shown in Equation ( a) and ( b). η = log ( μ n L t 0) + i= x ij β j ( a) η = log ( μ n L t 0 + i= x ij β j ( b) Where, η = linear predictor; βj = j th regression coefficient; xij = j th predictor for segment (site); L = segment length (mi); and t = time period over which crash data was collected. The product of segment length and the duration over which crash data was collected for each site are considered as an offset, leading to a reformulation in Equation () of the regression as follows: n η = log(μ) = log (β 0 ) + i= x ij β j + log (L t) () ()

8 Ash et al Under Equation () if we consider X as a traffic volume (F), we have: μ = β 0 F β (L t) exp ( x ij β j ) n i= In the GLM, estimators for the regression coefficients, β j, follow a multivariate normal distribution, [β 0,, β n] ~N([β 0,, β n ], Σ) (). From, Equation (), it is clear that μ=exp(η). As done previously, the subscript i for the values of η and μ at site i are omitted without loss of generality. We can use this fact to derive an approximate (-α)% confidence interval for the Poisson mean (alternately, the true mean crash count), μ, as follows (where Z-α/ is the critical value for the -α/ quantile of the standard normal distribution) (): exp (η ± Z α/ Var(η )) = exp(η ) exp (±Z α/ Var(η )) = μ exp (±Z α/ Var(η )) μ = [, μ exp(z exp(z α/ Var(η )) α/ Var(η ))] () Thus, regardless of the choice of mixture distribution, the approximate (-α)% CI for the Poisson mean (alternately, the true mean crash count), μ, is as formulated in Equation (). The reader is referred to () for the steps to calculate the variance of the linear predictor. Derivation of Confidence Intervals for Poisson Parameter (m) Here, the derivation of an approximate (-α)% confidence interval for the Poisson parameter, alternately referred to as the safety, m is presented based on the procedure outlined in Wood (). Before beginning the derivation, it is important to note a useful result that is used in several subsequent calculations, that being that the distribution of the estimator for the Poisson mean (μ ) although technically lognormal, can be approximated as normal (). Hence, μ ~N(μ 0 = μ, σ 0 = μ Var(η )). Following Wood (), the basic formulation for an approximate (-α)% CI for the mean of, alternately the safety, m is presented in Equation (). μ ± Z α/ Var(m) () Mathematically, it is possible for the lower bound of the CI in Equation () to be negative, though physically speaking, negative values of m are not sensible. Hence, the CI for m in equation () is reformulated as: [max {0, μ Z α Var(m)}, μ + Z α Var(m)] () The variance of m is formulated as follows: Var(m) = Var(μν)

9 Ash et al. = E(μ ν ) E(μν) = E(μ )E(υ ) E(μ) E(υ) [by independence of μ and υ] = [Var(μ) + E(μ) ] [Var(υ) + E(υ) ] E(μ) E(υ) () The derived expressions for the variance of m for each of the mixture distributions considered in this study are presented in Table. TABLE Variance of m for Mixture Distributions Mixture Distribution Var(m) Gamma α (σ 0 + μ 0 ) + σ 0 [Note: α = φ ] Inverse Gaussian Generalized Inverse Gaussian Lognormal (σ λ 0 + μ 0 ) + σ 0 [σ 0 + μ 0 ] ( σ(ν GIG+) e σ LN [σ 0 + μ 0 ] μ 0 Weibull [σ 0 + μ 0 ] [ Γ( σ +) (Γ( σ +)) ] μ 0 c + c ) μ 0 With formulations for Var(M) in hand, and recalling that the distribution of μ can be approximated as normal (which leads to the key substitution σ 0 = μ Var(η )), the derived CIs for m for each type of mixed-poisson model considered in this study are presented in Table. For simplicity, % CIs are shown. Model Negative Binomial (NB) Poisson- Inverse- Gaussian (PIG) Sichel (SI) Poisson- Lognormal (PLN) Poisson- Weibull (PW) TABLE % Confidence Intervals for m % CI for m [max (0, μ. μ [α (Var(η ) + ) + Var(η )]), μ +. μ [α (Var(η ) + ) + Var(η )]] [max (0, μ. μ [ (Var(η ) λ + ) + Var(η )]), μ +. μ [ (Var(η ) + ) + Var(η )]] λ [max (0, μ. μ {[Var(η ) + ] ( σ GIG (ν +) c + c ) }), μ +. μ {[Var(η ) + ] ( [max (0, μ. μ [e σ LN (Var(η ) + ) ]), μ +. μ [e σ LN (Var(η ) + ) ] ] σ GIG (ν +) c + c) }] Γ ( max 0, μ. μ [Var(η ) + ] σ + ) Γ ( (Γ (, μ +. μ [Var(η ) + ] σ + ) [ ( ( [ σ + )) (Γ ( ] )) ( [ σ + )) ] )] Derivation of Prediction Intervals for Predicted Crash Count (Y) The final type of confidence interval of interest for a mixed-poisson model is that for the predicted crash count (y) at a new site. The formulation for the PI is developed based on Chebyshev s inequality and further assumes the following: () The lower bound for y is zero (make the PI more

10 Ash et al. 0 conservative and follows the convention of Wood ()); () y must be integer-valued (). In general, a (-α)% PI for y is shown in Equation (), where the floor of the upper bound is taken to ensure it is integer-valued. As an example, to obtain a % PI for y, the expression under the first radical would evaluate to. [0, μ + α Var(y) ] () The variance of Y is evaluated as follows: Var(Y) = E{Var(Y M)} + Var{E(Y M)} = E(M) + Var(M) = E(μν) + Var(M) = E(μ) E(υ) + Var(M) = μ 0 + Var(M) () Hence, the (-α)% PI for Y can be re-expressed as shown in Equation (). [0, μ + α μ + Var(m) ] () Thus, using the formulation for Var(M) as provided in Equation (), the % PIs for Y in the case of each of the five mixed-poisson models are developed and shown in Table. Model Negative Binomial (NB) Poisson- Inverse- Gaussian (PIG) Sichel (SI) TABLE % Prediction Intervals for y % PI for y [0, μ + μ + μ [α (Var(η ) + ) + Var(η )] ] [0, μ + μ + μ [ (Var(η ) + ) + Var(η )] ] λ [0, μ + μ + μ σ GIG (ν +) {[Var(η ) + ] ( + c c ) } ] Poisson- Lognormal (PLN) [0, μ + μ + μ [e σ LN (Var(η ) + ) ] ] Poisson- Weibull (PW) Γ( σ +) 0, μ + μ + μ ([Var(η ) + ] [ ] ) (Γ( [ [ σ +)) ]] Full derivations for any of the aforementioned expressions for Var(M), the CIs for μ and m, or the PIs for y are available from the authors upon request.

11 Ash et al CASE STUDY This section provides a case study in which mixed-poisson models are developed from a crash dataset collected in Texas. Following model development, confidence and prediction intervals for the aforementioned crash rates are estimated and displayed graphically. Although the dataset provides access to several explanatory, models considered in this study were of the flow-only variety (i.e., traffic volume was the only covariate in each model) due to constraints in terms of visualizing results discussed momentarily. Flow-only models are not without their drawbacks, perhaps most notably omitted variable bias (). That being said, these types of models permit simple graphical representations of their associated CIs and PIs. The confidence and prediction intervals developed in the previous section are general in the sense that they can be estimated for models with any number of covariates and do take into account an arbitrary number of covariates as specified in the linear predictor η; however, they cannot be plotted unless the number of independent variables is less than or equal to two or if the values of several covariates are fixed (conditioned on) such that only two or less predictors are allowed to vary. Data Description The dataset used in this study was collected as part of the efforts for the NCHRP - ( Methodology to Predict the Safety Performance of Rural Multilane Highways ) project (). Overall,, segments with an average length of 0. miles were considered. In total,, crashes were reported during the five-year study period, though these crashes only occurred on of the segments; thus, segments (%) did not experience any crashes. The mean value of crashes was. and the variance was., yielding a variance-to-mean ratio of.. Average daily traffic values over the five-year study period ranged from to,00, with a mean of,. and a standard deviation of,0.0. More detailed information and summary statistics of the dataset are available in Zou et al. (). Model Development A total of five mixed-poisson models were estimated from the Texas dataset. Each model had one covariate (ADT) and an offset term as previously described. The negative binomial, Poisson- Inverse-Gaussian, and Sichel models were estimated through a maximum likelihood (ML) approach in the GAMLSS package for the R statistical software (). The Poisson-lognormal and Poisson-Weibull likelihood functions do not have a closed form, hence these models had to be estimated through a Bayesian approach in the WinBUGS software package () (using 0,000 iterations following 000 iterations for burn in). Model parameters and goodness-of-fit statistics are presented in Table (a) and (b).

12 Ash et al. TABLE (a) Model Results Estimated with ML Approach Model NB PIG SI Value SE Value SE Value SE Intercept log(β0) log(adt) β 0.*** *** *** 0.0 Dispersion parameter α (=/φ) 0.*** Shape parameter λ - -.*** - - Scale parameter σ Scale parameter νgig *** Global Deviance... AIC.. 0. BIC Note: *** denotes variable significant at 0.00 level TABLE (b) Model Results Estimated with Bayesian Approach Poisson-Lognormal (PLN) Model Mean SD MC Error.0% Median.0% Intercept log(β0) log(adt) β E Scale Parameter σln Poisson-Weibull (PW) Model Mean SD MC Error.0% Median.0% Intercept log(β0) log(adt) β E Scale Parameter σ..e Confidence and Prediction Intervals In order to compare and contrast the prediction intervals for y and the confidence intervals for m and μ associated with each of the five types of mixed-poisson model considered, plots were made constructed to show the intervals for each model (Figure (a)-(e)). In each plot, ADT values were considered to range from 0 to,000 (as this was approximately the range in the Texas dataset) and segment length was fixed at one mile for the offset term. The notations LB and UB refer to the lower and upper bounds, respectively, for the interval of interest. Note that since the likelihood functions for Poisson-lognormal and Poisson-Weibull models do not have a closed form, the estimation method developed by Connors et al. () was adopted to calculate confidence and prediction intervals using WinBUGS. We first consider the % CI for the Poisson mean (μ). From Table (a) and (b) it can be seen that regardless of the type of model considered, the estimates for the model coefficients (log(β0) and β) are quite similar. Hence, the estimates for the Poisson mean were quite similar between models, with maximum values ranging between. for the Poisson-Weibull model to

13 Ash et al. 0 0 as high as. for the Sichel model, when ADT=,000. From Figure (a) through (e), it can be seen that both the lower and upper bounds of the % CI for the Poisson mean are nearly identical, and further nearly identical to the estimate of the mean for ADT values of approximately,000 or less. As ADT increases beyond,000, the distance between the lower and upper bounds of each CI begins to increase. Ultimately, at ADT=,000, the tightest interval around the estimate of μ resulted from the Poisson-Weibull model ([.,.], width=.) and the widest interval resulted from the Sichel model ([.,0.], width=.). As the true value of the Poisson mean is not known, conclusions on whether or not the narrowest interval is best cannot be made. When examining the % CIs for the safety, m, in Figure (a) through (e), one may first notice that for the NB, PIG, and SI models, the lower bound always has a value of zero, regardless of the ADT value. Such models produced negative values for the lower bounds when calculated, however, as noted in Equation (), negative values of the safety (m) are not sensible and hence the lowest reasonable value is zero. The lower bounds for the PLN and PW (the models estimated through a Bayesian approach) models were found to be non-negative in all cases. For the PLN model, the width of the interval at an ADT value of,000 was., and said interval ranged from. to., the maximum value of m estimated from any of the % CIs. The lowest value of an upper bound for the % CIs for m at,000 ADT, across all models, was. as obtained from the Poisson-Inverse-Gaussian model. As was the case with the Poisson mean, the true value of m is unknown, thus no comments on which interval is narrowest, while still capturing the true parameter value can be made. Regardless of model considered, the lower bound for the % prediction intervals for the predicted response at a new site (y) is always zero. Additionally, one will likely notice that the upper bounds for the PIs for y are much greater (specifically,. to. times greater at,000 ADT) than the respective upper bounds for the % CIs for m. Besides yielding the largest values, the curves for the PIs for y are notably less smooth than those representing the CIs for μ and m. This step-function appearance is a result of the use of the floor function in calculation for the upper bound of the PI as is shown in Equation (), as the number of crashes predicted to occur at a new site should be integer-valued. The upper bounds for the PIs for y ranged from as high as 0 for the Poisson-Lognormal model to as low as for the Poisson-Inverse-Gaussian model. The upper bound for PI for y as predicted from the Sichel model,, was found to be quite close to that from the PLN model.

14 Number of Crashes Number of Crashes Ash et al. 0 % CIs and PI for NB Model mu mu LB mu UB m LB m UB y LB y UB ADT FIGURE (a) % CIs and PI for Negative Binomial Model % CIs and PI for PIG Model mu mu LB mu UB m LB m UB y LB y UB ADT FIGURE (b) % CIs and PI for Poisson-Inverse-Gaussian Model

15 Number of Crashes Number of Crashes Ash et al. 0 % CIs and PI for SI Model mu mu LB mu UB m LB m UB y LB y UB ADT FIGURE (c) % CIs and PI for Sichel Model % Confidence Intervals for PLN Model mu mu LB mu UB m LB m UB y LB y UB ADT FIGURE (d) % CIs and PI for Poisson-Lognormal Model

16 Number of Crashes Ash et al. 0 % Confidence Intervals for PW Model mu mu LB mu UB m LB m UB y LB y UB ADT FIGURE (e) % CIs and PI for Poisson-Weibull Model SUMMARY AND CONCLUSIONS Based upon the initial work of Wood (), confidence intervals for the Poisson mean (μ), safety or Poisson parameter (m), and predicted response (i.e., number of crashes at a new site, y) for four types of mixed-poisson models beyond that for the negative binomial (Poisson-Gamma) model as given by Wood () were developed. Formulae for these intervals are now available for researchers and practitioners to use in order to obtain a window of uncertainty associated with predictions, as compared to a sole point estimate. Specifically, the types of mixed-poisson models considered in this study were the Poisson-Inverse Gaussian, Sichel, Poisson-Lognormal, and Poisson-Weibull models, all of which arise by allowing for a multiplicative error term following a corresponding mixture distribution to enter the functional form of the Poisson parameter m. After motivating mixed-poisson models, derivations for the aforementioned confidence and prediction intervals were provided. Once the formulae for the CIs and PIs had been established, the theory was put into practice by investigating the intervals associated with each of the five aforementioned types of mixed- Poisson models. Flow-only models were estimated for each model type in order to obtain the necessary parameters needed for calculation of the intervals. Since real-life, observed crash data was used, the true values of the Poisson mean (μ) and safety or Poisson parameter (m) are unknown, hence comments cannot be made on which intervals perform best. Nonetheless, several important conclusions can be drawn from the case study considering the Texas data: () For small ADT values, the lower and upper bounds of the % CI for the Poisson mean (μ) were quite similar in value and also close to the estimator of the Poisson mean as predicted from the models;

17 Ash et al () Of the models developed in this study, the Sichel model yielded the widest intervals for the Poisson mean (μ) and safety (m). It further yielded the second greatest upper bound when considering all % PIs for the predicted response y. That being said, there is no way to confirm narrower intervals on the μ, m, and y are necessarily better as the true values of these parameters is unknown; () For the Poisson mean (μ), the Poisson-Weibull model yielded the narrowest % CI. The Poisson-Inverse-Gaussian model yielded the narrowest % CI for m and PI for y if all models considered; () All three models estimated via a maximum likelihood approach (NB, PIG, and SI) yielded negative values for the lower bound on m (before coercing them to be zero), a behavior that was not observed for the models estimated via a Bayesian approach (PLN and PW); and () At the largest ADT values considered, the upper bounds on the PIs for y ranged from. to. times the values of the upper bounds of m at the same ADT. In terms of future work, this study introduces a few possibilities currently under investigation by the authors. First, it would be beneficial to compare the behavior of the CIs and PIs for functional forms of the mixed-poisson models that include more covariates than simply the traffic flow. This will be especially important to help overcome the likely omitted variable bias associated with the models developed in this study. Additionally, a simulation study involving simulation of values for the Poisson mean (μ), safety (m), and response (i.e., crash count, y) at a new site could help determine which CIs and PIs best represent the true intervals.

18 Ash et al REFERENCES [] Mannering, F. L., and C. R. Bhat. Analytic methods in accident research: Methodological frontier and future directions. Analytic Methods in Accident Research, Vol., 0, pp. -. [] Lord, D., and F. Mannering. The statistical analysis of crash-frequency data: A review and assessment of methodological alternatives. Transportation Research Part A: Policy and Practice, Vol., No., 0, pp. -0. [] Gustavsson, J. On the use of regression models in the study of road accidents. Accident Analysis & Prevention, Vol., No.,, pp. -. [] Gustavsson, J., and Å. Svensson. A Poisson regression model applied to classes of road accidents with small frequencies. Scandinavian Journal of Statistics,, pp. -0. [] Jovanis, P. P., and H.-L. Chang. Modeling the relationship of accidents to miles traveled. Transportation research record, Vol.,, pp. -. [] Lord, D., S. P. Washington, and J. N. Ivan. Poisson, Poisson-gamma and zero-inflated regression models of motor vehicle crashes: balancing statistical fit and theory. Accident Analysis & Prevention, Vol., No., 00, pp. -. [] Maycock, G., and R. Hall. Accidents at -arm roundabouts.. [] Hauer, E., J. C. Ng, and J. Lovell. Estimation of safety at signalized intersections (with discussion and closure).. [] Connors, R. D., M. Maher, A. Wood, L. Mountain, and K. Ropkins. Methodology for fitting and updating predictive accident models with trend. Accident Analysis & Prevention, Vol., 0, pp. -. [] Park, E. S., P. J. Carlson, R. J. Porter, and C. K. Andersen. Safety effects of wider edge lines on rural, two-lane highways. Accident Analysis & Prevention, Vol., 0, pp. -. [] El-Basyouny, K., and T. Sayed. Comparison of two negative binomial regression techniques in developing accident prediction models. Transportation Research Record: Journal of the Transportation Research Board, Vol. 0, No., 00, pp. -. [] Ye, X., R. M. Pendyala, V. Shankar, and K. C. Konduri. A simultaneous equations model of crash frequency by severity level for freeway sections. Accident Analysis & Prevention, Vol., 0, pp. 0-. [] Hauer, E. Empirical Bayes approach to the estimation of unsafety : the multivariate regression method. Accident Analysis & Prevention, Vol., No.,, pp. -. [] Lawless, J. F. Negative binomial and mixed Poisson regression. Canadian Journal of Statistics, Vol., No.,, pp. 0-. [] Hauer, E. Observational Before/After Studies in Road Safety. Estimating the Effect of Highway and Traffic Engineering Measures on Road Safety.. [] Cameron, A. C., and P. K. Trivedi. Regression analysis of count data. Cambridge university press, 0. [] Rigby, R., D. Stasinopoulos, and C. Akantziliotou. A framework for modelling overdispersed count data, including the Poisson-shifted generalized inverse Gaussian distribution. Computational Statistics & Data Analysis, Vol., No., 00, pp. -. [] Zou, Y., L. Wu, and D. Lord. Modeling over-dispersed crash data with a long tail: Examining the accuracy of the dispersion parameter in Negative Binomial models. Analytic Methods in Accident Research, Vol., No. 0, 0, pp. -. [] Dean, C., J. Lawless, and G. Willmot. A mixed Poisson-inverse-Gaussian regression model. The Canadian Journal of Statistics/La Revue Canadienne de Statistique,, pp. -.

19 Ash et al. 0 0 [0] Zha, L., D. Lord, and Y. Zou. The Poisson Inverse Gaussian (PIG) Generalized Linear Regression Model for Analyzing Motor Vehicle Crash Data. Journal of Transportation Safety & Security, No. just-accepted, 0, pp [] Cheng, L., S. R. Geedipally, and D. Lord. The Poisson Weibull generalized linear model for analyzing motor vehicle crash data. Safety Science, Vol., 0, pp. -. [] Lord, D., and L. F. Miranda-Moreno. Effects of low sample mean values and small sample size on the estimation of the fixed dispersion parameter of Poisson-gamma models for modeling motor vehicle crashes: a Bayesian perspective. Safety Science, Vol., No., 00, pp. -0. [] Aguero-Valverde, J., and P. P. Jovanis. Analysis of road crash frequency with spatial models. Transportation Research Record: Journal of the Transportation Research Board, Vol. 0, No., 00, pp. -. [] Casella, G., and R. L. Berger. Statistical inference. Duxbury Pacific Grove, CA, 00. [] Lord, D. Methodology for estimating the variance and confidence intervals for the estimate of the product of baseline models and AMFs. Accident Analysis & Prevention, Vol. 0, No., 00, pp. -. [] Lord, D., P.-F. Kuo, and S. Geedipally. Comparison of Application of Product of Baseline Models and Accident-Modification Factors and Models with Covariates: Predicted Mean Values and Variance. Transportation Research Record: Journal of the Transportation Research Board, No., 0, pp. -. [] Wood, G. Confidence and prediction intervals for generalised linear accident models. Accident Analysis & Prevention, Vol., No., 00, pp. -. [] Geedipally, S., and D. Lord. Effects of Varying Dispersion Parameter of Poisson-Gamma Models on Estimation of Confidence Intervals of Crash Prediction Models. Transportation Research Record: Journal of the Transportation Research Board, No. 0, 00, pp. -. [] Geedipally, S. R., and D. Lord. Investigating the effect of modeling single-vehicle and multi-vehicle crashes separately on confidence intervals of Poisson gamma models. Accident Analysis & Prevention, Vol., No., 0, pp. -. [0] Rigby, R., and D. Stasinopoulos. A flexible regression approach using GAMLSS in R. London Metropolitan University, London, 00. [] Rodríguez, G. Lectures notes about generalized linear models.in, 00. [] Lord, D., S. R. Geedipally, B. N. Persaud, S. P. Washington, I. van Schalkwyk, J. N. Ivan, C. Lyon, and T. Jonsson. Methodology to predict the safety performance of rural multilane highways.in, 00. [] Rigby, R. Stasinopoulos. Generalized additive models for location, scale and shape. Appl Statist, Vol., No. part, 00, pp. -. [] Lunn, D. J., A. Thomas, N. Best, and D. Spiegelhalter. WinBUGS-a Bayesian modelling framework: concepts, structure, and extensibility. Statistics and Computing, Vol., No., 000, pp. -.

Does the Dispersion Parameter of Negative Binomial Models Truly. Estimate the Level of Dispersion in Over-dispersed Crash data with a. Long Tail?

Does the Dispersion Parameter of Negative Binomial Models Truly Estimate the Level of Dispersion in Over-dispersed Crash data wh a Long Tail? Yajie Zou, Ph.D. Research associate Smart Transportation Applications