A Flexible Modeling Approach Using Dirichlet Process Mixtures: Application to Municipality-Level Railway Grade Crossing Crash Data

Size: px

Start display at page:

Download "A Flexible Modeling Approach Using Dirichlet Process Mixtures: Application to Municipality-Level Railway Grade Crossing Crash Data"

Leonard Knight
6 years ago
Views:

1 S. Heydari, L. Fu, D. Lord, And B. K. Mallick A Flexible Modeling Approach Using Dirichlet Process Mixtures: Application to Municipality-Level Railway Grade Crossing Crash Data Shahram Heydari (Corresponding Author) PhD Candidate Department of Civil and Environmental Engineering University of Waterloo 00 University Avenue W., Ontario NL G, Canada shahram.heydari@uwaterloo.ca, -- & Zachary Department of Civil Engineering Texas A&M University College Station, Texas, USA Liping Fu Professor Department of Civil and Environmental Engineering University of Waterloo 00 University Avenue W., Ontario NL G, Canada lfu@uwaterloo.ca, -- Dominique Lord Associate Professor Zachary Department of Civil Engineering Texas A&M University College Station, Texas, USA d-lord@tamu.edu, -- Bani K. Mallick Distinguished Professor Department of Statistics Texas A&M University College Station, Texas, USA bmallick@stat.tamu.edu, -- Total # of words = (Text) ( tables) = + references Submission July 0 for TRB 0

2 S. Heydari, L. Fu, D. Lord, And B. K. Mallick Abstract This paper introduces a new approach to addressing two of the most challenging issues in road safety research, namely, how to account for unobserved heterogeneity and how to identify latent subpopulations in data. Compared to the approaches of applying random effects/parameters models and finite mixtures, the proposed approach employs a Bayesian semi-parametric methodology based on Dirichlet process mixtures. Our method has four noteworthy advantages: (i) it allows examining the robustness of distributional assumptions in random effects/parameters models; (ii) it allows identifying latent clusters in data; (iii) it enables identification of outliers (extreme observations) while allowing accommodating them in analyses without compromising the quality of estimates; and (iv) it is capable of estimating the number of latent clusters in data using an elegant mathematical structure. In this paper, we evaluate the proposed method on a railway grade crossing crash dataset with hierarchical (multilevel) structure, at municipality level, from Canada for the years 00 to 0. We use cross-validation predictive densities and pseudo Bayes factor for Bayesian model selection. While confirming the need for the multilevel modeling approach, the results pointed out the inadequacy of the parametric assumption. In fact, our proposed method improved model fitting significantly for the municipality-level data. In a fully probabilistic framework, we also identified the expected number of latent clusters with similar unknown/unmeasured features among Canadian municipalities. It is possible thus to further investigate the reasons behind such similarities and dissimilarities, which could have important policy implications in terms of safety management process.

3 S. Heydari, L. Fu, D. Lord, And B. K. Mallick INTRODUCTION Addressing issues relating to unobserved heterogeneity has been receiving increasing attention in transportation safety literature recently. Employing random effects/parameters models (RPM), including random intercept and/or slope models, is a viable approach to account for such issues (-). However, parametric assumptions (e.g., normally distributed random parameters) as an inherent component of RPMs impose a strong restriction to any statistical model (, ). As a result, distributional assumptions might compromise the quality of the analysis or even produce misleading results. When using RPMs, the main goal is to allow one or more model parameters to vary across observations (in single-level model settings) (,,) or groups of observations (in multilevel model settings) (-). For example, Anastasopoulos and Mannering () used an RPM to analyze roadway segments in the state of Indiana allowing model parameters to vary across observations. The authors concluded that the employed RPM provided a superior fit; and therefore, it better represented the observed data. In another study, Yannis et al. () employed RPMs for a multilevel data in which observations were nested into various regions of Greece. The authors analyzed the data using both random intercept and random slop models. They allowed model parameters to vary between different regions (in contrast to the former study in which parameters were allowed to vary between observations). The latter paper concluded that regional variations were significant when considering the effect of enforcement on crash frequencies. Given examples above, regardless of the model setting (single-level or multilevel), distributional assumptions (such as normally distributed random parameters) might not hold or be able to properly represent an observed data or capture unobserved heterogeneity. To clarify the problem, suppose a scenario in which the modeller is only interested in potential variations in intercept (random intercept model) and the effect of other covariates are assumed to be fixed across observations. In this scenario, one basic approach is to assume that all observations have exactly the same intercept and that there is no extra variability in data. Obviously, this assumption is rather simplistic and does not take into consideration the fact that there might be some unknown and/or unmeasured attributes that change between observations. In the aforementioned scenario, for tackling unobserved heterogeneity through intercept, two major possible approaches are described as follows. The first possibility is estimating one intercept for each single observation based on the belief that each observation in data completely differs from the others; that is, the assumption of complete independence (). This assumption is not realistic since observations (e.g., intersections) are not totally dissimilar and they certainly have some similar features. The second and the most common approach is to assume that intercepts for various observations are generated from a single distribution; e.g., normal distribution. Depending on the extent to which standard distributional assumptions are capable of capturing heterogeneity in a given data, say, in the form of random parameters, the results will be biased by various degrees. It should be noted that standard assumptions usually do not accommodate skewness, kurtosis, and multimodality (0). As an example of a problem that may arise following parametric assumptions, let s consider a multilevel dataset in which observations are nested in different groups such as geographical regions (see () for a review of multilevel models in road safety literature). When there are outlier regions (extreme cases) in data, large outlier regions affect other regions excessively. Consequently, estimates relating to smaller outlier regions erroneously tend to approach the overall mean. In these circumstances, a more flexible modeling approach is necessary. The flexible model must satisfy two requirements: (i) it should be able to avoid the

4 S. Heydari, L. Fu, D. Lord, And B. K. Mallick complete independence assumption described above; and (ii) it should be able to relax the distributional assumption while adapting itself to the complexity of an observed data. It should be mentioned that while the use of RPMs has been relatively limited in singlelevel (observational level) crash data, their application in multilevel settings, which have gained popularity in recent years, has been inevitable and frequent. In fact, allowing varying parameters (e.g., varying intercepts) across higher levels or groups (e.g., counties, municipalities, regions, etc.) is a common practice in analyzing datasets characterized by hierarchical structures (,, ). This is due to the fact that observations nested in the same groups are likely to share similar unknown and/or unmeasured traits and are thus correlated (). Therefore, RPMs could account for such correlation and reduce unobserved heterogeneity in data. With this background, our paper takes advantage of an innovative class of flexible statistical models that has been developed recently in Bayesian nonparametric literature (and computer science literature) based on Dirichlet process mixtures (). The Flexible Dirichlet Process Model (FDPM) not only allows examine the robustness of parametric assumptions, but also enables identifying outliers and latent subpopulations in data. This model is highly flexible and adjusts its complexity to any observed data. That is, for example, the number of mass points or clusters (components of mixture) increases as the complexity of observed data increases. Note that when using finite mixture models (), the number of clusters in data should be anticipated (this number is usually decided without any sound justification). Then, different models should be estimated using various pre-specified numbers of clusters, and eventually the number of clusters that provides the best fit is selected as the optimal number. In contrast, FDPMs directly estimate the required number of latent clusters in a mathematically elegant framework. We evaluate our FDPM on a Canadian railway grade crossing crash dataset that contains observations nested in different geographical regions: municipalities. For a review of railway grade crossing crash analysis see (), (), and (). In this research, at the first step, the aforementioned dataset is analyzed by accounting for its hierarchical structure. To the best of our knowledge, no attempts have been made so far, especially in Canada, to accommodate the hierarchical form of grade crossing crash data. We examine the presence and the extent of dependencies between crossings nested in different regions. We explore the robustness of our standard parametric assumption in a multilevel setting using FDPMs. Lastly, we investigate the presence of outlier municipalities across Canada and identify latent clusters (subpopulations) among different Canadian regions. To our knowledge, this is the first instance of clustering groups (sites, regions, municipalities, etc.) that belong to higher levels of hierarchy in multilevel crash data literature using such flexible ad efficient statistical model. Our flexible model can be implemented in the freely available software, WinBugs (), making its use more convenient. It can also be easily adopted to cluster observations; for example, intersections or road segments in single-level model settings. This paper contributes to transportation safety literature methodologically by presenting a flexible modeling framework that has several theoretical and practical capabilities and potentials (not only in road safety analysis, but also in other areas of transportation such as travel demand research). This paper also contributes empirically to grade crossing safety literature by establishing reliable safety performance functions, providing evidence on the presence of hierarchical levels in grade crossing crash data, and identifying latent similarities and dissimilarities between different Canadian regions.

5 S. Heydari, L. Fu, D. Lord, And B. K. Mallick DATA DESCRIPTION In this research, the data used to illustrate the proposed FDPM is based on railway grade crossing crash data from Canada. This nationwide crash data set, provided by Transportation Safety Board of Canada, consists of, public crossings equipped with flashing lights and bells (FLB) for a six-year period, We obtained the crash data by combining two databases: IRIS (Integrated Railway Information System) and RODS (Railway Occurrence Database System). The IRIS database contains a set of inventory data on railway crossings across Canada. Main pieces of information that can be extracted, from IRIS, are geometric/operational characteristics, train and vehicle flows, and protection type. The RODS database, in contrast to IRIS, is mainly designed to record information related to accident occurrences such as crash occurrence date and time, number of fatalities, seriously injured, etc. Since the presence of regional dependencies in the data is suspected, we prepared the data to account for its hierarchical structure as explained bellow. Municipality-Level Data To prepare this dataset, municipalities with at least 0 FLB crossings in their boundary were considered. The final municipality-level data included, crossings sitting in municipalities, which come from major Canadian provinces: British Columbia, Alberta, Saskatchewan, Manitoba, Ontario, Quebec, New Brunswick, and Nova Scotia. A total of crashes were observed in the latter data. The municipality-level dataset includes all major Canadian cities such as Toronto, Montreal, Winnipeg, Edmonton, Vancouver, etc. Several factors (e.g., driver behavior, climate, regulations, etc.) might differ between municipalities. One scope of this research was to verify the existence of dependencies among FLB crossings nested in the same municipalities. More importantly, we aimed at examining the standard parametric assumption for the data while accounting for its multilevel form. We were also interested in identifying outlier municipalities (those that are different from the rest of data) and also municipalities that manifest similar patterns (latent subpopulations) in terms of crash frequency at FLB crossings. Among, FLB crossings in this dataset,.% were located in urban areas and whistle prohibition were applied to.% of them. A host of explanatory variables were available, but many of them were not statistically significant in describing crash frequencies. Table provides a summary statistics of the data for the most important variables. As it was discussed in the introduction, the most widespread approach assumes a common normal distribution on varying parameters between different municipalities in the multilevel modeling approach. Considering municipality-level data, for example, it may not be reasonable to assume that all municipalities are generated from the same distribution. In other words, we suspect the presence of latent subpopulations among these municipalities. The section of methodology describes our flexible modelling framework that can efficiently deal with such circumstances.

6 S. Heydari, L. Fu, D. Lord, And B. K. Mallick Table Summary Statistics of the Municipality-Level Data Variable Mean Std. Dev. Min Max Train flow (average annual daily) Vehicle flow (average annual daily) Log of exposure (product of train and vehicle flows) Number of tracks Number of lanes Track angle (deviation from 0 ) Road speed (km/h) Train speed (km/h) Whistle prohibition ( if present, 0 otherwise) Urban area ( if urban area, 0 otherwise) Crash frequency ( years) Number of FLB crossings: 0 0 METHODOLOGY This section first provides a brief methodological background on the proposed modeling approach and its origins. It then describes the main component of our approach; i.e., Dirichlet process, followed by details relating to the proposed FDPM adopted for the study in context. This section concludes by discussing model selection criteria based on cross-validation predictive densities. Methodological Background This paper illustrates a class of flexible RPMs models that are developed in Bayesian nonparametric literature (0, ) based on Dirichlet process mixtures (, ). In this regard, Escobar and West () state that Bayesian models involving Dirichlet process mixtures are at the heart of the modern nonparametric Bayesian movement. The Bayesian models used in this paper are however semi-parametric since parametric distributional assumptions are not relaxed for all model parameters in our flexible model. The original ideas of nonparametric Bayesian inference were initially developed and discussed by Ferguson () and Antoniak (); however, their application was very limited due to computational complexities. It was mainly in the 0s that Bayesian nonparametric models have attracted the attention of more researchers due to improvements in Markov Chain Monte Carlo (MCMC) schemes and also substantial computational advances during those years. At that stage, several developments have been made in various aspects of Bayesian nonparametric modeling (, ). Consequently, Bayesian nonparametric concepts have been used in different scientific articles mainly in Biostatistics and computer science research (, ), whereas their use in transportation research, especially, transportation safety has been extremely rare if not exist. One of the main motivations behind the nonparametric Bayesian inference is to remove constrains associated with specific parametric assumptions. These constraints may affect inferences made by parametric models. Therefore, employing a nonparametric Bayesian approach enables us to circumvent restrictive distributional assumptions and make parametric models more robust in terms of statistical inference. It is important to mention that the Bayesian nonparametric term does not mean that the model is parameter-free. In contrast, it may have an infinite number

7 S. Heydari, L. Fu, D. Lord, And B. K. Mallick of parameters (0). In Bayesian nonparametrics, in effect, the number of parameters increases as the complexity of data escalates. This characteristic leads to an important difference with finite mixture modeling approach () that decides the number of latent clusters in advance. In Bayesian nonparametric modeling, however, the number of latent clusters is estimated as part of the estimation algorithm and process, which is more realistic, convenient, and flexible. Dirichlet Process (DP) and Truncated Dirichlet Process In the parametric modelling approach, we assume a specific density function G(.) with a limited number of unknown parameters. In contrast, the nonparametric Bayesian approach takes G(.) as unknown with the possibility of infinite number of parameters (number of parameters depends on the complexity of observed data) and assumes a continuous baseline distribution (prior) for G(.). In other words, the unknown density G(.) is centered around the baseline distribution (G0) and its variation around G0 is determined by a real positive precision parameter, κ (, ). As κ approaches infinity, G(.) becomes more similar to G0. That is in a random intercept model, for example, larger values of κ imply that each unit effect, say ηi (i=,,,l where l denotes the number of observations), tends to be in a distinct subpopulation. This condition is similar to the standard random effects/parameters assumption that assumes a common normal distribution for all random effects (here, intercepts) (); i.e., ηi ~ Normal(m,ν) where m and ν are the mean and the variance. A κ approaching zero, however, indicates a major difference between G(.) and G0. In this condition, all unit effects (in our example, intercepts) tend to be in the same cluster. This is similar to the common intercept assumption: η = η = = ηl. The latter condition does not occur in most real applications (e.g., due to unobserved heterogeneity). Moreover, the former condition, which is more probable to occur, might be too strict since it hypothesizes that all random units are generated from a unique distribution. Dirichlet process mixing helps build a flexible model that relies on the aforementioned conditions. Given G0 and κ explained above, a generic Dirichlet process GDP can be defined as G DP ~Dirichlet(κG 0 ) () This is a random density measure on the space of all probability measures GDP (). That is for any partition P,, Pn belonging to the parameter space, the vector of random densities (G(p),, G(pn)) follows a Dirichlet distribution (0): (G(p ),, G(p n )) ~ Dirichlet(κG 0 (p ),, κg 0 (p n )) () To clarify this procedure and for the sake of simplification, let s assume that a real line represents the entire sample space of a given parameter, as in (). This line can be partitioned into several intervals: (-, p), (p, p),, (pn-, pn-), (pn-, ) where p stands for partitions or intervals. Then, the probability (PR) of falling into each interval will be as follows: PR = G(p), PR = G(p) - G(p),, PRn- = G(pn-) - G(pn-), and PRn = - G(pn-) Similarly, the probability (PR0) for the baseline distribution Go can be obtained from

8 S. Heydari, L. Fu, D. Lord, And B. K. Mallick PR0,n- = G0(pn-) - G0(pn-) Considering the partitioned line discussed above, the probabilities PR to PRn follow a Dirichlet distribution: (PR, PR,, PR n ) ~ Dirichlet(κPR 0,, κpr 0,,, κpr 0,n ) () To generate random density functions from a Dirichlet process, stick-breaking procedure () can be employed. Here, we adopt the description provided by () based on which various steps of steak-breaking are as follows: From the baseline distribution G0, generate a vector of random variables θ, θ, ; From the density function Beta(, κ), generate a vector of random variables ξ, ξ, ; thus, we can write κ PR(ξ n ) = κξ n () E(ξ n ) = ( + κ) () Assign probabilities PR, PR,..., PRn to random variables θ, θ,, θn, respectively, where PR = ξ PR = ( - ξ)ξ PR = ( - ξ)( ξ)ξ PRn = ( - ξ)( ξ) ( ξn-)( ξn-)ξ Let Iθ be an indicator function and f(.) be the probability function (infinite mixture of point masses) that corresponds to GDP, we can then write f(. ) = n= PR n I θn ; θ n ~ G 0 () It should be noted that the above density indicates that random draws from Dirichlet process are discrete probability distributions (). Nevertheless, this can be problematic when the underlying distribution is continuous. Therefore, a modification is required to substitute the indicator function Iθ with a continuous density function denoted here by γ(. θn). This results in Dirichlet process mixing. f(. ) = n= PR n γ(. θ n ); θ n ~ G 0 () Because of the complexities associated with a full Dirichlet process in terms of computation, a truncation approach can be employed to obtain an approximation (). This results in a truncated Dirichlet process in which the main idea is to limit the maximum number of possible partitions (say, on the partitioned real line discussed above). So that instead of allowing n to go to infinity, it can grow until a certain discrete value C; i.e., the maximum number of clusters. Doing so, GDP thus depends on κ, G0, and also C. One condition that should be satisfied here is related to the probability of the final partition. It is expected that the probability for the last partition be a very small value, ε, such as 0.0 (). When C is a relatively large value like 0, while κ is limited

9 S. Heydari, L. Fu, D. Lord, And B. K. Mallick to 0, the probability of the last partition approaches zero (). Under the truncated Dirichlet process the following condition should be satisfied for the last partition: PR C = C n= PR n In a truncated Dirichlet process, Eq. () can be written as f(. ) = C n= PR n γ(. θ n ); θ n ~ G 0 () It can be shown that the maximum number of clusters, C, depends on ε and κ (): C + log(ε) log [ κ + κ ] We can then approximate a full Dirichlet process by choosing C, instead of, as the maximum possible number of clusters. One should take into account that C cannot be greater than the total number of observations (or groups) in data. For instance, if the data include 00 observations, C cannot be greater than 00. Note that a prior distribution can be assumed on κ to estimate its value as part of the analysis. This prior should be in accordance with C and data in general. In this regard, further discussion is provided in the section of results. Proposed model framework applied to the study in context As a starting point, in this research, the Simple Poisson-Lognormal Model (SPLM) was used as the base model to analyze the crash frequency data: y j ~ Poisson (λ j ) () λ j = µ j e ε j () log(μ j ) = η + βx j () log(λ j ) = η + βx j + ε j () ε j ~ normal(0, ν ε ) () Where yj and λj denote observed and expected crash frequencies for site j, respectively; η is the intercept; β is the vector of coefficients; X is the vector of covariates; e ε is a lognormally distributed error term that accounts for overdispersion; and νε is the variance of the error term. In this study, we adopted a multilevel modeling approach as the hierarchical structure of the data necessitates. Such hierarchical structure, which occurs often in transportation safety studies (), requires allowing one or more model parameters to vary across groups of observations (here, regions). In many instances, a normal distribution is used for any random parameter of interest resulting in a fully parametric model. In the subsequent sections, we first describe a generic parametric random intercept multilevel model and then a semi-parametric random intercept multilevel model. Parametric Random Intercept Multilevel Model Let s consider the data in context (Section ) with grade crossings nested in different municipalities. Assume also a Random Intercept Multilevel Poisson-Lognormal Model () (0)

10 S. Heydari, L. Fu, D. Lord, And B. K. Mallick (RIMPLM) in which intercept varies between regions based on the belief that each region is likely to have its own characteristics, unobserved/unmeasured attributes. Let r denotes regions, a typical parametric multilevel model with varying intercept across regions can be obtained by extending the previously discussed SPLM as follows: y rj X rj, ε rj, η r ~ Poisson (λ rj ) () log(λ rj ) = η r + βx rj + ε rj () η r ~ normal(m η, ν η ) () ε rj ~ normal(0, ν ε ) () Where mη and νη are, respectively, the mean and the variance for the varying intercept ηr. It can be seen that the RIMPLM assumes a common normally distributed random intercept at region-level. As described earlier, for example, in case of the municipality-level dataset, the RIMPLM assumes a common normal distribution for all the municipalities. This impose a strong assumption that implies all municipalities come from the same population. The RIMPLM neglects the possibility that some municipalities might behave very different (outliers) from the rest of municipalities in the data. In the next section, we relax this assumption with a flexible model that adapts itself to the complexity of the observed data. Semi-parametric random intercept multilevel model Standard parametric assumptions on random parameters might compromise the quality of analyses. Our Flexible Dirichlet Process Multilevel Model (FDPMM) examines the quality of such parametric models while providing further insights into the data; e.g., identifying latent clusters and outliers. It is important to mention that many outlier detection methods are designed to identify outliers. Then, the modeller should exclude them from the data and conduct the analysis without them to avoid biased or less reliable estimates. Our flexible modeling approach, however, allows us to accommodate outliers in analyses without undermining the quality of analyses. Note that the flexible model presented here is used in a multilevel framework with the Poisson-lognormal model for crash frequency. However, it can be similarly adopted in single-level settings (non-multilevel analysis) and/or with different statistical models such as the Poisson, Poisson-gamma, etc. Besides its application to count models for crash frequency datasets, the proposed flexible model can also be employed in different contexts such as injury-severity analysis, travel demand research, etc. For the purpose of this specific research, the FDPMM can be defined as follows. y rj X rj, ε rj, η i ~ Poisson (λ rj ) (0) log(λ rj ) = η r + βx rj + ε rj () ε rj ~ normal(0, ν ε ) () η r = η DP ~ Dirichlet(κη 0 ); θ r ~ η 0 & r =,,, C () θ r ~ normal(m 0, ν 0 ) & κ ~ g(. ) () Where θ0 (with unknown parameters, the mean m0 and the variance ν0) is the realization of the baseline distribution η0 for ηr; and κ is the precision parameter as explained earlier in Section.. Recall that r denotes latent clusters and C stands for the maximum possible number of latent clusters (see Section.., Eq.0). In the previous model (i.e., RIMPLM), the varying intercept ηr

11 S. Heydari, L. Fu, D. Lord, And B. K. Mallick was normally distributed, whereas under the FDPMM it is defined non-parametrically using a Dirichlet process mixing. Doing so, we remove the restriction of the standard distributional assumption and allow the observed dataset to decide its proper form of the varying intercept. If the FDPMM provides a significantly better fit to the data compared to the RIMPLM, one can doubt the appropriateness of the parametric assumption. One should also take into account that the parameters of the baseline distribution, η0, are estimated here as part of the modeling process allowing us to account for uncertainties associated with the baseline distribution for the varying intercept. It is important to mention that to maintain interpretative capabilities of the model, as it can be seen in the representation of the FDPMM above, the vector of coefficients β associated with the known covariates vector X (site characteristics) does not follow a Dirichlet process and is fixed. Other extensions are obviously possible; for example, one might allow the effect of one or more covariates to vary across different regions. Note that a Dirichlet process mixing over the intercept (as in our FDPMM) allows us to deal with heterogeneity in data with respect to the mean (); that is, mean crash frequency in our paper. In the study in context, thus, such mixing enables the identification of latent clusters among different regions being various municipalities. Elicitation of Priors Bayesian analysis requires the elicitation of priors for parameters of interest. In this research, we used non-informative normal priors with mean zero for β, mη, and m0. For the inverse of variances νε, νη, and ν0, we used a diffuse gamma prior with shape and scale parameters being equal to 0.0. It is also necessary to define a prior distribution on the precision parameter κ for which different priors are possible such as gamma, exponential, and uniform. This prior should agree with the maximum number of allowed clusters C (Eq. 0). Here, we used two different uniform priors for each dataset as it is important to choose this prior based on data characteristics; for example, the number of observations over which we want to cluster the data. For the municipality-level dataset, we set the maximum number of cluster C to be 0 given the number of municipalities; i.e.,. Doing so, a better approximation of a full Dirichlet process can be obtained (). This also allows larger values of κ, which means that we do not force κ to be a small value. Therefore, we chose a uniform prior with an upper bound of 0 that corresponds to approximately 0 clusters based on Eq. 0. A lower bound of 0. was selected here to allow smaller values of κ and also to circumvent problems associated with the estimation of PRn. Therefore, we assume κ ~ uniform(0., 0). Model Selection: Conditional Predictive Ordinate and Pseudo Bayes Factor In this research, we used Conditional Predictive Ordinate (CPO) to estimate Log Pseudo Marginal Likelihood (LPML) and Pseudo Bayes Factor (PBF) (-0) to compare the three models described previously: SPLM, RIMPLM, and FDPMM. The use of CPOs for model selection in road safety literature has been extremely rare (, ). With regard to LPML and PBF, this would be probably the first instance of using such model selection criteria in transportation safety studies. One should take into account that CPO and PBF are in general more robust than the commonly used Deviance Information Criterion (DIC) (). It is important to mention that DIC is known for its problematic issues; for example, in terms of significant sensitivity to parameterization () or in situations in which the posterior density is not unimodal. In fact, WinBugs cannot estimate the DIC value when estimating the FDPMM, which involves multimodal posteriors. For a further

12 S. Heydari, L. Fu, D. Lord, And B. K. Mallick discussion on the DIC drawbacks readers are referred to (). In this section, we therefore focus on the estimation of LPML and PBF. The main idea behind cross-validation methods constitutes the base for the estimation of CPOs. In cross-validation, a given data set is divided into two groups. One is used to make the posterior inference, whereas the second group is used to validate the previously estimated model. The problem here is the sensitivity of the results to how these groups are selected. CPO circumvents this problem by leaving out only one observation each time (0). Consider a full set of observed data Y including i=,,,l observations. For a given observation Yi, the leaving-out cross-validation predictive density, as in (0), is CPO i = f(y i y i ) = f(y i ψ)f(ψ y i )dψ () Where yi consists of Y when Yi (the i th observation) is excluded from the data; and ψ denotes model parameters. Therefore, CPO of an observation in a given data is the likelihood of that observation given the rest of observations in that data (). CPO can be therefore used to identify datapoints that are in conflict with the rest of observations in a given dataset (0). The estimation of CPOs can be readily obtained from the adopted MCMC algorithm as CPO i = ( T T t= ) () f(y i ψ (t) ) Where T stands for the total number of iterations (t =,,,T) in MCMC runs. CPO is thus the mean of the probability distribution function estimated at observation Yi for each ψ (t). The product of CPOs is referred to as pseudo marginal likelihood (PML) (): l PML = i= CPO i () Similar to the log likelihood, LPML is usually computed: l l LPML = log { i= CPO i } = i= log (CPO i ) () PML or LPML can be used as a measure of Bayesian model fit and selection. The model with the largest LPML indicates the best fit to the data. As another model selection criteria, pseudo Bayes factor (PBF) can be easily estimated by dividing the PML of two models (). For example, to verify whether model fits the data better than model, the PBF is given by PBF = PML model PML model (0) Table shows how model selection can be carried out based on Bayes factor values as reported in (0). Note that the interpretation of PBF is similar to Bayes factor.

13 S. Heydari, L. Fu, D. Lord, And B. K. Mallick Table Bayesian Model Selection via Bayes Factor Bayes Factor Degree of support for the model of interest - No evidence of support -0 Support 0-0 Strong support >0 Very strong support RESULTS AND DISCUSSION The models explained above were implemented in the statistical software WinBugs. A total of 0000 MCMC iterations, in addition to 000 burn-in iterations, with chains were utilized to obtain posterior inferences. All three models ran smoothly and converged relatively quickly. For example, the FDPMM converged at around 000 iterations. This is an indication of well-defined models and priors. MCMC convergence was verified through history plots, trace plots, and Gelman-Rubin diagram, being available in WinBugs. Table presents the analyses results (at % level of confidence) related to the municipalitylevel data. The standard model (the SPLM) provided a poor fit compared to other two models that account for the hierarchy in data. The results highlighted that traffic exposure, urban area, whistle prohibition, and train speed are positively associated with crash frequencies at FLB crossings. The significant variance of the varying intercept in the multilevel framework indicates that crossings nested in the same municipalities are somehow dependent. Therefore, the SPLM is not a proper choice. Interestingly, whistle prohibition is significant at a level of confidence of 0.0 in the SPLM, but this variable is only significant at a level of confidence of 0.0 in the RIMPLM and the FDPMM. This is in accordance with previous research (, ). As discussed in (), single-level models such as the SPML employed here assume that all observations are generated from a unique homogeneous population. This in turn implies that the residuals are independent resulting in underestimated standard errors; and consequently, erroneous confidence intervals. Our flexible model provided the best fit to the municipality-level data. The log marginal likelihood of the FDPMM is the highest (see Table ). When comparing the FDPMM with the RIMPLM, a pseudo Bayes factor of. indicates a strong support (see also Table ) for the proposed flexible model. This leads to the question about the adequacy of the standard parametric assumption on varying intercept for the municipality-level data. In other words, assuming a common distribution for all municipalities is not appropriate. It can be seen in Table that the expected number of non-empty clusters is. in the FDPMM. Since the number of municipalities is large, we avoid providing all clusters here. For illustration, a small part of the clustering results are reported in Table. As an example, we found that the following municipalities share the same cluster with a probability greater than 0.0: Calgary, Edmonton, Regina, Saskatoon, Winnipeg, Grand Prairie, and Nanaimo.

14 S. Heydari, L. Fu, D. Lord, And B. K. Mallick Table Estimation Results for FLB Crossings, Municipality-Level Data Posterior Credible intervals Variable mean Std. dev..%.% Log of exposure Urban area Whistle prohibition Train speed Intercept Variance (ν ε) LPML (log pseudo marginal likelihood) Log of exposure Urban area Whistle prohibition Train speed Intercept mean Intercept variance Variance (ν ε) LPML (log pseudo marginal likelihood) Log of exposure Urban area Whistle prohibition Train speed Intercept mean Intercept variance Intercept s baseline mean (m 0) Intercept s baseline variance (ν 0) Variance (ν ε) Dirichlet precision parameter (κ) Expected number of non-empty clusters LPML (log pseudo marginal likelihood) Model Comparison based on PBF (Pseudo Bayes Factor) PBF (FDPMM vs. RIMPLM) =. Whistle prohibition is significant at a significance level of 0.0 in multilevel models. Note: SPML is the Standard Poisson-Lognormal Model; RIMPLM is the Random Intercept Multilevel Poisson-Lognormal Model; and FDPMM is the Flexible Dirichlet Process Multilevel Model. SPLM RIMPLM FDPMM Table Cluster and Outlier Identification Results Municipality Average size of cluster (% interval) Similar municipalities Probability > 0. (, ),,,,,, 0,, 0 (, ) Probability > 0.,,,,,,,,,,,, 0,,,,,, 0 Probability > 0. (, ),,,,, 0 Note: size of cluster is the median of the number of municipalities in the same cluster

15 S. Heydari, L. Fu, D. Lord, And B. K. Mallick It should be mentioned that no outlier municipality was identified, and the smallest cluster included 0 municipalities. An outlier municipality can be detected when no other municipality shares the same cluster with this outlier. That is, the size of cluster for the outlier is. Note that Table uses different threshold probabilities, for illustration, to define clusters among different municipalities. Obviously, alternative threshold values result in different members in clusters. It should be mentioned that larger probabilities will result in higher number of clusters. In other words, as the threshold probability approaches, for a given observation i, the number remaining observations that share the same cluster (with observation i) approaches SUMMARY AND CONCLUSIONS To overcome unobserved heterogeneity in data, random effects/parameters models and mixture models are often used in transportation safety literature. Standard distributional assumptions are an intrinsic part of random effects/parameters models. Because of the fact that sensitivity to such assumptions might be of a major concern in some datasets or applications, this paper propose a class of advanced flexible statistical models to investigate the adequacy of these parametric assumptions. The adopted approach has several additional advantages such as the ability to identify outliers and latent subpopulations in data. The method is also capable of accommodating outliers in analyses while preventing the latter from affecting the quality of estimates. It should be noted that the mixture modeling approach is an alternative method that can deal with some concerns associated with random effects/parameters models. In mixture models, however, the number of latent components in data should be specified in advance. In most applications, there is not any sound justification for selecting the number of components. Our proposed technique considers the number of latent components as an unknown parameter and estimates its expectation as part of its efficient mathematical algorithm. We adopted a multilevel dataset containing crash frequencies for FLB grade crossings in Canada to show the feasibility of the adopted flexible model. Log pseudo marginal likelihoods and pseudo Bayes factors computed from conditional predictive ordinates were utilized for model selection. The results confirmed the need for a multilevel modeling approach. We found that the single-level model underestimated standard errors for the coefficient associated with whistle prohibition in the municipality-level data. Traffic exposure, location of crossing (urban vs. non-urban), train speed, whistle prohibition were positively associated with crash frequencies. The results illustrated that the adequacy of the standard parametric assumption was under question for the municipality-level data. We identified latent subpopulations among Canadian municipalities. And finally, in terms of outliers, the results indicated that there is not any outlier municipality among those analyzed in this paper. It should be noted that the identification of clusters among various regions has a significant interpretative value. This is an indicator of common unmeasured/unknown covariates among those regions that are in the same subgroup. Based on the identified clusters, further investigations can be conducted to detect the presence (or extent) of such unmeasured/unknown covariates and attributes. Latent similarities and dissimilarities are expected among different regions due to variations in different regional policies, population demography, driver behaviour, climate, traffic regulations, etc.

16 S. Heydari, L. Fu, D. Lord, And B. K. Mallick Acknowledgments The authors would like to acknowledge the Natural Sciences and Engineering Research Council of Canada for their financial support. We would also like to thank Transport Canada (Rail Safety Directorate) for providing the data and financial support. References. Anastasopoulos, P., and F. Mannering. A note on modeling vehicle accident frequencies with random-parameters count models. Accident Analysis and Prevention, Vol., 00, pp... Chen, E., and A. Tarko. Modeling safety of highway work zones with random parameters and random effects models. Analytic Methods in Accident Research, Vol., 0, pp. -.. Mannering, F., and C. R. Bhat. Analytic Methods in Accident Research: Methodological Frontier and Future Directions. Analytic Methods in Accident Research, Vol., 0, pp. -.. Ohlssen, D. I., L. D. Sharples, and D. J. Spiegelhalter. Flexible Random-effects Models Using Bayesian Semi-Parametric models: Application to institutional Comparisons. Statistics in Medicine, Vol., 00, pp Wu, Z., A. Sharma, F. Mannering, and S. Wang. Safety impacts of signal-warning flashers and speed control at high-speed signalized intersections. Accident Analysis and Prevention, Vol., 0, pp. 0.. Jones, A., and S. Jørgensen. The use of multilevel models for the prediction of road accident outcomes. Accident Analysis and Prevention, Vol., 00, pp... Yannis, G., E. Papadimitriou, C. Antoniou. Multilevel modelling for the Regional Effect of Enforcement on road Accidents. Accident Analysis and Prevention, Vol., 00, pp. -.. Huang, H., H. C. Chin, M. M. Haque. Severity of driver injury and vehicle damage in traffic crashes at intersections: a Bayesian hierarchical analysis. Accident Analysis and Prevention, Vol. 0, 00, pp... Papadimitriou, E., A. Theofilatos, G. Yannis, J. Cestac, and S. Kraïem. Motorcycle Riding Under the Influence of Alcohol: Results from the SARTRE- Survey. Accident Analysis and Prevention, Vol. 0, 0, pp Xiong, Y., and F. Mannering. The Heteroscedastic Effects of Guardian Supervision on Adolescent Driver-Injury Severities: A Finite Mixture-Random Parameters Approach. Transportation Research Part B, Vol., 0, pp. -.. Dupont, E., E. Papadimitriou, H. Martensen, and G. Yannis. Multilevel Analysis in Road Safety Research. Accident Analysis and Prevention, Vol. 0, 0, pp H. Huang, M. Abdel-Aty. Multilevel data and Bayesian analysis in traffic safety. Accident Analysis and Prevention, Vol., 00, pp... Heydari, S., Miranda-Moreno, L.F., Liping, F. Speed limit reduction in urban areas: A beforeafter study using Bayesian generalized mixed linear models. Accident Analysis and Prevention, Vol., pp. -.. Escobar, M., and M. West. Bayesian Density Estimation and Inference Using Mixtures. Journal of the American Statistical Association, Vol. 0,, pp. -.. Park, B. J., and D. Lord. Application of Finite Mixture Models for Vehicle Crash Data Analysis. Accident Analysis and Prevention, Vol., 00, pp. -.

17 S. Heydari, L. Fu, D. Lord, And B. K. Mallick Saccomanno F. F., and X. Lai. A Model for Evaluating Countermeasures at Highway-Railway Grade Crossings. Transportation Research Record: Journal of the Transportation Research Board, No., 00, pp. -.. Oh, J., S. P. Washington, and N. Doohee. Accident Prediction Models for Railway-Highway interfaces. Accident Analysis and Prevention, Vol., 00, pp Yan, X., S. Richards, and X. Su. Using Hierarchical Tree-Based Regression Model to Predict Train-Vehicle Crashes at Passive Highway-Rail Grade Crossings. Accident Analysis and Prevention. Vol.. 00, pp. -.. Spiegelhalter, D. J., A. Thomas, N. G. Best. WinBUGS. User Manual. MRC Biostatistics unit and Imperial College, 00. Available from 0. Muller P., and F. A. Quintana. Nonparametric Bayesian data analysis. Statistical Science, Vol., 00; pp. 0.. Hjort, N., C. Holmes, P. Müller, and S. G. Walker. Bayesian Nonparametrics: Principles and Practice. Cambridge University Press, 00.. Ferguson, T. S. A Bayesian Analysis of Some Nonparametric Problems. The Annals of Statistics, Vol.,, pp Antoniak, C. E. Mixtures of Dirichlet Processes with Applications to nonparametric Problems. The Annals of Statistics, Vol.,, pp. -.. Bush, C. A., and S. N. MacEachern. A Semi-Parametric Bayesian Model for Randomized Block Designs. Biometrika, Vol.,, pp. -.. Mukhopadhyay, S., and A. E. Gelfand. Dirichlet Process Mixed Generalized Linear Models. Journal of the American Statistical Association, Vol.,, pp. -.. Dhavala, S. S., S. Datta, B. K. Mallick, R. J. Carroll, S. Khare, S. D. Lawhon, and L. G. Adams. Bayesian Modeling of MPSS Data: Gene Expression Analysis of Bovine Salmonella Infection. Journal of the American Statistical Association, Vol. 0, 00, pp. -.. Ishwaran, H., L. F. James. Gibbs Sampling Methods for Stick-Breaking Priors. Journal of the American Statistical Association, Vol., 00, pp... Gelfand, A. Model determination using sampling-based methods, in W. Gilks, S. Richardson, and D. Spiegelhalter, eds., Markov Chain Monte Carlo in Practice, Chapman & Hall, Suffolk,.. Carlin, B. P. and T. A. Louis. Bayesian Methods for Data Analysis, third edition. Boca Raton: Chapman & Hall/CRC, Ntzoufras, I. Bayesian Modeling using WinBugs. John Wiley & Sons, 00.. Yang, H., K. Ozbay, O. Ozturk, M. Yildirimoglu. Modeling Work Zone Crash Frequency by Quantifying Measurement Errors in Work Zone Length. Accident Analysis and Prevention, Vol., 0, pp Kun, X., X. Wang, K. Ozbay, H. Yang. Crash Frequency Modeling for Signalized Intersections in a High-Density Urban Road Network. Analytic Methods in Accident Research, Vol., 0, pp. -.. Spiegelhalter, D. J., N. G. Best, B. P. Carlin, and A. van der Linde. Bayesian Measures of Complexity and Fit (with Discussion). Journal of the Royal Statistics Society, Series B, Vol., 00,.. Geedipally, S. R., D. Lord, and S. S. Dhavala. A Caution about Using Deviance Information Criterion While Modelling Traffic Crashes. Safety Science, Vol., 0, pp. -.

18 S. Heydari, L. Fu, D. Lord, And B. K. Mallick. Kim, D. G., Y. Lee, S. Washington, K. Choi. Modeling Crash Outcome Probabilities at Rural Intersections: Application of Hierarchical Binomial Logistic Models. Accident Analysis and Prevention, Vol., 00, pp. -.

TRB Paper # Examining the Crash Variances Estimated by the Poisson-Gamma and Conway-Maxwell-Poisson Models

TRB Paper # Examining the Crash Variances Estimated by the Poisson-Gamma and Conway-Maxwell-Poisson Models TRB Paper #11-2877 Examining the Crash Variances Estimated by the Poisson-Gamma and Conway-Maxwell-Poisson Models Srinivas Reddy Geedipally 1 Engineering Research Associate Texas Transportation Instute