Confidence and prediction intervals for generalised linear accident models G.R. Wood September 8, 2004 Department of Statistics, Macquarie University, NSW 2109, Australia E-mail address: gwood@efs.mq.edu.au Fax: 61 2 9850 7669 Abstract Generalised linear models, with log link and either Poisson or negative binomial errors, are commonly used for relating accident rates to explanatory variables. This paper adds to the toolkit for such models. It describes how confidence intervals (for example, for the true accident rate at given flows) and prediction intervals (for example, for the number of accidents at a new site with given flows) can be produced using spreadsheet technology. Running head: Confidence intervals for accident models Key words and phrases: Generalised linear model; negative binomial; Poisson
1 Introduction Generalised linear models have gathered recognition in recent years (Maycock and Hall, 1984; Hauer et al., 1988; Maher and Summersgill, 1996) as useful tools for relating the number of accidents, of a specifed type, to explanatory variables such as vehicle flows. For the single flow model, the true mean number of accidents µ is modelled as β 0 x β 1, where x denotes the flow. The distribution of the observed number of accidents, for a given flow, is assumed to be either Poisson, or more generally, negative binomially distributed about this mean value. The negative binomial distribution occurs naturally when we allow for variation of safety M between sites, with a given flow, to be modelled by a gamma distribution, and then variation of the number of accidents Y within a site, with safety M, to be modelled by a Poisson distribution with mean M. A detailed description of these models has been given in the companion paper (Wood, 2002), where methods for assessing goodness of fit were described. Once goodness of fit is established for a model it is of interest to provide confidence intervals (for model parameters) and prediction intervals (for dependent variables); this is routinely carried out when working with linear models. Such intervals provide information about the extent of variation in these quantities. In this context the intervals of interest, for a given flow, are: i) A confidence interval for µ, the true accident rate ii) a) For a Poisson model, a prediction interval for y, the accident rate at a new site 1
b) For a negative binomial model, a prediction interval for m, the safety of a new site, and a prediction interval for y, the accident rate at a new site. The purpose of this paper is to provide formulae, in Section 2, which enable construction of these intervals, and to illustrate their use with real accident data in Section 3. The required calculations can be carried out on a spreadsheet. Exposition is generally in terms of models with a single flow; models with more than one explanatory variable are handled in an extended, but similar, fashion. Notation and terminology used in this paper are as in Wood (2002). Standard texts, for example McCullough and Nelder (1989), discuss confidence intervals for generalised linear model parameters; the author, however, has not found the approach discussed here in the literature, other than in Maher and Summersgill (1996). Here we clarify, amplify and extend that work. Specifically, approximate confidence and prediction intervals appropriate for a given flow are developed. We caution that the confidence level necessarily decreases if we wish to make statements about many flow values. For this, so-called simultaneous (and necessarily wider) confidence bands are needed. The development of simultaneous confidence bands is a topic of current research; the work of Sun et al. (2000) produces such confidence bands for the mean in a generalised linear model. This paper can be read in two ways. A reader interested in the practical construction of confidence and prediction intervals should skim Section 2, then work carefully through the examples of Section 3, referring to Section 2 and the appendix for formulae as needed 2
(Table 4 provides an overall summary). For the reader interested in the underlying theory, careful reading of Section 2 and the appendix is recommended. 2 Confidence and prediction intervals A confidence interval for the true mean, for both the Poisson and negative binomial models, is developed in Section 2.1. In Section 2.2 a prediction interval for a predicted number of accidents at a new site is derived for the Poisson model, while in Section 2.3 prediction intervals for safety and predicted number of accidents at a new site are produced for negative binomial models. 2.1 Confidence interval for µ The generalised linear model we have described uses a log link function; the logarithm of µ is linear in the model parameters β 0 and β 1, since η = log µ = log β 0 + β 1 log x = β 0 + β 1 log x, for the single flow model. Standard generalised linear model theory gives that asymptotically the estimates b 0 and b 1, of β 0 and β 1 respectively, have a bivariate normal distribution (Dobson, 1990), in particular b 0 N β 0, I 1, b 1 β 1 so they are unbiased, with covariance matrix the inverse of the information matrix I. It follows that ˆη = b 0 + b 1 log x has asymptotically a normal distribution and since ˆη = log ˆµ, where ˆµ = e b 0 x b 1, ˆµ has an approximately lognormal distribution. 3
This enables us to write down an approximate 95% confidence interval for η, when the flow is x, as b 0 + b 1 log x ± 1.96 Var(b 0 + b 1 log x), whence a 95% confidence interval for µ = e η is given by [ e b 0 +b 1 log x 1.96 Var(b 0 +b 1 log x), e b 0 +b 1 log x+1.96 Var(b 0 +b 1 log x)] The lower boundary is closer to the estimate ˆµ of µ than is the higher boundary, reflecting the right skewed lognormal distribution of the estimate ˆµ. Here Var(b 0 + b 1 log x) = Var(b 0) + 2 log xcov(b 0, b 1 ) + (log x) 2 Var(b 1 ) = I11 1 + 2 log xi12 1 + (log x) 2 I22 1 Illustrative real examples are given in Section 3. Note that ˆη = (1, log x)(b 0, b 1 ) T (where T denotes transpose) so in practice Var(ˆη) is most easily calculated as Var(ˆη) = (1, log x)i 1 (1, log x) T There are two ways to find the components of I 1. If the model is fitted using a statistical package then options are generally available which output the covariance matrix I 1 of the paramaters. On the other hand, if using the first principles method described in (Wood, 2002, A.3) then the required covariance matrix is (X T W X) 1, where X is the design matrix and W a diagonal matrix. A final remark in this subsection: the log-normal distribution of ˆµ discussed can be approximated by a normal distribution, or ˆµ N ( µ 0 = µ, σ0 2 = µ 2 Var(ˆη) ) 4
(as in Maher and Summersgill (1996), Equation (14)). This approximate sampling distribution for ˆµ is fundamental in the sequel. 2.2 Poisson model We consider the case of the Poisson model and an interval for a predicted number of accidents, y. Under the model, given a true mean accident rate of µ, the conditional distribution of accidents Y is Poisson with mean µ. A confidence interval for the number of accidents Y, however, must now accommodate the approximately normal variation in ˆµ, our estimator of µ, as N(µ 0, σ0). 2 Table 1 summarises the variables involved. Variable description Variable notation Distribution Accident rate, given true rate µ Y µ Poisson(µ) Estimator of true mean accident rate ˆµ N(µ 0, σ0) 2 Table 1: The two levels of variation, first in µ, then in Y given µ, to be considered when forming a prediction interval for y in the Poisson model. The marginal distribution of Y is thus a mixture of Poisson distributions, on the mean, by a normal distribution. It can be shown that the distribution of Y, supported by {0, 1, 2,...} has mean µ 0 and variance σ0 2 + µ 0. (The key to this calculation is the observation that a central moment of a mixture is the mixture of the central moments of the distributions being mixed.) Our intuition does tell us that this variance should depend on that of ˆµ, namely σ0, 2 and also should increase as µ 0 increases, since Poisson distributions with larger mean have greater variance. 5
Chebyshev s inequality (Feller, 1966), namely P ( Y µ Y tσ Y ) 1 t 2 for t > 0, is a most useful result and can be used to produce a prediction interval. When the interval is one-sided (so for low mean values) Chebyshev s one-sided inequality (Feller, 1966, Section V.7, Example (a)), namely P (Y µ Y tσ Y ) 1 1 + t 2 for t > 0, provides a tighter interval. A formula, exploiting the discreteness of the distribution of Y, however, has been developed and offers a slight strengthening of this approach. It is described in the appendix and its use illustrated in Section 3. Further tightening of this confidence interval is doubtless still possible, for example, by making use of third and higher moments. 2.3 Negative binomial model We now develop intervals for safety and predicted number of accidents, for the negative binomial model; there are three mixtures involved. We first study construction of a prediction interval for site safety M, which rests on a mixing process analogous to the Poisson case. We then see that the negative binomial conditional distribution of Y, given µ and k, is a mixture, as is the marginal distribution of Y. 6
Prediction interval for safety m Here we find a prediction interval for m, the underlying site safety described in Hauer et al. (1988), as the flow x varies. We answer the question If we selected another site with flow x, where would m lie? Table 2 summarises the variables involved. Variable description Variable notation Distribution Safety, given true rate µ M (µ, k) Gamma(k, µ/k) Estimator of true mean accident rate ˆµ N(µ 0, σ0) 2 Table 2: The two levels of variation, first in µ, then in M given µ and k, needed when producing a prediction interval for safety m, for the negative binomial model. Here we mix gamma distributions with a normal; it can be shown that the marginal distribution of M has mean µ 0 and variance σ0 2 +(σ0 2 +µ 2 0)/k. If we assume approximate normality (this will improve as µ increases) and recalling that k is estimated as ˆk during the fitting process, an approximate 95% prediction interval for site safety m is ˆµ ± 1.96 ˆµ2 Var(ˆη) + ˆµ2 Var(ˆη) + ˆµ 2 ˆk Estimates ˆµ and Var(ˆη) can be readily evaluated, as described earlier. The lower limit of this interval may be negative, due to the use of a normal approximation to the lognormally distributed ˆµ; in this case the lower limit should be set to zero. Simulation tests, using typical parameter values, have shown this to provide a satisfactory prediction interval approximation. 7
Prediction interval for number of accidents y Here we find a prediction interval for the number of accidents y at a site, randomly chosen from those with flow x. The relevant variables are summarised in Table 3. Variable description Variable notation Distribution Accident rate, given safety m Y m Poisson(m) Safety, given true rate µ M µ, k Gamma(k, k/µ) Estimator of true mean accident rate ˆµ N(µ 0, σ0) 2 Table 3: The three levels of variation involved in forming a prediction interval for y in the negative binomial model; first in µ, then in M given µ and k, and finally in Y given m. The model builds the distribution of accidents across all sites with flow x, first as a mixture of the Poisson within site variation Y m by the gamma across site variation M µ, k, well known to be negative binomial with parameters k and p = k/(µ + k), as described in (Wood, 2002, p.425). Second, we must now recognise that µ itself is unknown, so the accident rate is really a mixture of negative binomial Y k, p variables by a normal distribution on µ. The marginal distribution of Y, the number of accidents at a site with flow x, having support in {0, 1, 2,...}, can be shown to have mean µ 0 and variance σ0 2 +(σ0 2 +µ 2 0)/k+µ 0. All quantities can be estimated during model fitting, as earlier. Note that as k increases the variance shrinks to that found in the Poisson case. Chebyshev s one-sided inequality, or the slightly stronger method of the appendix, can be used to produce a prediction 8
interval for y. 3 Examples Illustrations of each of the intervals described are now given, using three New Zealand accident datasets. 3.1 Poisson model with one independent variable A Poisson model was used to relate loss-of-control accidents to incoming flow at 289 arms of both priority and uncontrolled T-intersections throughout New Zealand, yielding b 0 = 4.5260 and b 1 = 0.2883. The grouped G 2 method described in (Wood, 2002, Section 4.3) was used to test the goodness of fit of the model; there was no evidence of poor fit. In order to find a confidence interval for µ, the covariance matrix (X T W X) 1 was obtained as a by-product of the fitting process, as I 1 = 2.6724 0.5140 0.5140 0.1018 For a flow of x = 600, for example, this gives Var(b 0 + b 1 log x) = 0.2621, whence an approximate 95% confidence interval for µ is [0.0251, 0.1867]. The full confidence band, as x varies, is shown (dashed line) around the fitted curve (solid line) in Figure 1. For each flow, µ 0 = ˆµ and σ0 2 = ˆµ 2 Var(ˆη) were then calculated, as described in Section 2.2. The formula given in the appendix was then applied, yielding the 95% band for a predicted y value, shown as the stepped line in Figure 1. (The 90% band 9
was everywhere {0}, so the horizontal axis in the figure.) Figure 1 here Figure 1: A Poisson model (solid curve) relates accident rate to flow for the loss-ofcontrol data. A 95% confidence band for the true accident rate µ is shown with the dashed lines, while a 95% prediction band for the number of accidents y at at new site is shown with the stepped line. 3.2 Negative binomial model with one independent variable Rear-end accidents at 392 arms of signalised crossroads throughout New Zealand were related to flow using the negative binomial model, producing b 0 = 16.3141, b 1 = 1.6330 and ˆk = 0.60. Goodness of fit of the model was tested using the method presented in (Wood, 2002, Section 4.4), revealing no evidence of poor fit. Again, the covariance matrix for b 0 and b 1, (X T W X) 1 was obtained as I 1 = 8.4048 0.9347 0.9347 0.1042 allowing construction of a 95% confidence band for µ, as described for the Poisson model. For example, for x = 10000, we find ˆµ = 0.2798 and Var(ˆη) = Var(b 0+b 1 log x) = 0.0263, whence an approximate 95% confidence interval is [0.2036, 0.3845]. The full confidence band is shown (long dashes) around the fitted curve (solid line) in Figure 2. The variance σ0 2 +(σ0 2 +µ 2 0)/k of M, for a given flow, was then estimated as described 10
in Section 2.3. Continuing the example, for x = 10000, and recalling that ˆk = 0.60, this leads to a 95% prediction interval for m of [0, 1.003]. The full prediction band is shown (short dashes) in Figure 2. Finally, the variance of a predicted Y is calculated using the formula given in Section 2.3, for each flow level, and the formula described in the appendix used to calculate the upper limit of the prediction interval. For example, for x = 10000 and using ˆµ, Var(ˆη) and ˆk as before we find that {0, 1, 2} provides a 90% interval. The full prediction band is shown in Figure 2 (stepped horizontal lines). Figure 2 here Figure 2: A negative binomial model (solid curve) relates accident rate to flow for the rear-end data. A 95% confidence band for the true accident rate µ is shown (long-dashes) and a 95% prediction band for the safety m (short dashes), while a 90% prediction band for the number of accidents at a new site is shown (stepped horizontal lines). 3.3 Negative binomial model with two independent variables Right turn against vehicle accidents at four-arm, two-way signalised intersections were related to the two traffic flows (annual average daily through flow x 1 and annual average daily right-turning flow x 2 ) in a recent New Zealand study, using a negative binomial model with µ = β 0 x β 1 1 x β 2 2. Maximum likelihood estimation of the parameters gave b 0(= log b 0 ) = 11.2598, b 1 = 0.8175, b 2 = 0.4883, ˆk = 1.8 and a covariance matrix for 11
the three parameters of I 1 = 1.6784 0.1353 0.0688 0.1353 0.0158 0.0005 0.0688 0.0005 0.0104 Using centrally placed flows of x 1 = 10000 and x 2 = 5000, a 95% confidence interval for the associated true accident rate µ will be [ ] eˆη 1.96 Var(ˆη), eˆη+1.96 Var(ˆη) where ˆη = log ˆµ = b 0 + b 1 log x 1 + b 2 log x 2 and Var(ˆη) = Var(b 0) + (log x 1 ) 2 Var(b 1 ) + (log x 2 ) 2 Var(b 2 ) + 2 log x 1 Cov(b 0, b 1 ) + 2 log x 2 Cov(b 0, b 2 ) + 2 log x 1 log x 2 Cov(b 1, b 2 ) As remarked earlier, these calculations are most easily carried out by noting that ˆη = (1, log x 1, log x 2 )(b 0, b 1, b 2 ) T whence Var(ˆη) = (1, log x 1, log x 2 )I 1 (1, log x 1, log x 2 ) T Substituting values gives ˆη = 0.4286 and Var(ˆη) = 0.0304 whence a 95% confidence interval for µ is [1.0907, 2.1605]. A 95% prediction interval for safety m and the given flows is still given by ˆµ ± 1.96 ˆµ2 Var(ˆη) + ˆµ2 Var(ˆη) + ˆµ 2 ˆk 12
Since ˆµ = eˆη = 1.5351, substituting values and raising the lower boundary to zero produces the wider interval of [0, 3.8712]. Finally, the number of accidents Y at a site with the given flows will have mean estimated as ˆµ = 1.5351 and variance as ˆµ 2 Var(ˆη) + ˆµ2 Var(ˆη) + ˆµ 2 ˆk + ˆµ = 2.9557 Since ˆµ > 1 we must use the one-sided Chebyshev inequality to find the upper limit of a 95% prediction interval for y. This yields P (Y µ Y + 19σ Y ) 1 20 whence substitution of the estimated mean and standard deviation for Y gives an upper limit of 1.5351 + 19 2.9557 = 9.0290. Thus we can quote [0, 9] as a 95% prediction interval for the number of right turn against accidents at a signalised intersection with the given flows. 4 Discussion and summary Comment on the approximations used to produce the intervals is now presented. First, model parameter estimates (b 0, b 1, etc.) are only approximately normally distributed, the approximation improving with sample size; this will influence the accuracy of a confidence interval produced for ˆµ. Second, the lognormal distribution of ˆµ is approximated by a normal distribution, so tending to widen a prediction interval for m and y. Finally, use of the Chebyshev inequality is conservative, so again tending to widen a prediction 13
interval. The method given in the appendix was developed in an attempt to improve on the Chebyshev inequality. Once the model is fitted, for any finite number of independent variables (whether measurement or categorical), all intervals discussed can be calculated using a spreadsheet. In general we have ˆη = ab T, Var(ˆη) = ai 1 a T and ˆµ = eˆη, where a is the row vector of coefficients of the parameter estimates in row vector b. For example, for the model µ = β 0 x β 1 e zβ 2, with one measurement variable x and one categorical variable z, a = (1, log x, z) while b = (b 0, b 1, b 2 ). The form of all 95% confidence and prediction intervals discussed in this paper is summarised in Table 4, with intervals for y produced using the one-sided Chebyshev inequality. For a 90% prediction interval for y, the factor in the formulae is 3, rather than 19, being then the solution to 1/(1 + t 2 ) = 1/10; for a 99% prediction interval for y, the factor would be 99. When ˆµ 1 it may be possible to tighten the interval for y using the formulae in the appendix; for high ˆµ values a prediction interval for y excluding zero may possibly be found using the two-sided Chebyshev inequality. An assumption underlying the models used is that the explanatory variables are measured without error. As was pointed out in Section 8 of Maher and Summersgill (1996), for geometric and control variables this presents no problem. Flow variables, however, should be annual average daily traffic (AADT) over the study period. In practice, they are often scaled figures, based on a count in a single day, so estimates of AADT. Maher and Summersgill (1996) ran simulation experiments and concluded that 14
Poisson model µ y [ˆµ/e 1.96 Var(ˆη), ˆµe 1.96 Var(ˆη) ] [ 0, ˆµ + 19 ˆµ 2 Var(ˆη) + ˆµ ] Negative binomial model µ m y [ˆµ/e 1.96 Var(ˆη), ˆµe 1.96 Var(ˆη) ] max 0, ˆµ 1.96 ˆµ 2 Var(ˆη) + ˆµ2 Var(ˆη) + ˆµ 2, ˆµ + 1.96 ˆµ ˆk 2 Var(ˆη) + ˆµ2 Var(ˆη) + ˆµ 2 ˆk 0, ˆµ + 19 ˆµ 2 Var(ˆη) + ˆµ2 Var(ˆη) + ˆµ 2 + ˆµ ˆk Table 4: A summary of the 95% confidence and prediction intervals discussed in the paper ( x denotes the largest integer less than or equal to x). for their data, 12 or 16 hour single day counts provided adequate accuracy. The intervals discussed in this paper rely not only on ˆη (and hence the flows and parameter estimates) but also on Var(ˆη) and ˆk. The effect of error in the AADTs on estimation of the covariance matrix (and hence Var(ˆη)) and the parameter k remains to be investigated. As a general guide, however, counts taken over longer periods will provide for greater confidence and prediction interval accuracy. Collinearity of columns of the design matrix leads to uncertainty in parameter estimates. This uncertainty (seen as larger entries in the covariance matrix I 1 ) is incorporated into the intervals constructed. The upshot is that use of independent explanatory variables will tend to yield tighter confidence and prediction intervals. For example, 15
the through and right-turning flows used in the two variable example in Section 3.3 exhibited extremely low correlation, leading to the low second and third figures on the diagonal of the covariance matrix. To summarise, approximate confidence intervals for the true mean have been presented (Section 2.1) for Poisson and negative binomial generalised linear accident models. An approximate prediction interval for a predicted accident rate has been developed (Section 2.2) for the Poisson model. The form of this interval for the negative binomial model has also been presented (Section 2.3), together with a prediction interval for a predicted safety. Examples have been used (Section 3) to illustrate the ideas. Acknowledgements Dr. Shane Turner is thanked for alerting the author to the importance of this problem and for kindly supplying (together with Aaron Roozenburg) the three data sets. The research was partly supported by the Land Transport Safety Authority of New Zealand. Pertinent comments from a referee led to improvements in the paper. Appendix A strengthening of the one-sided Chebyshev inequality, exploiting the integer support of Y and appropriate for the typically low mean values (µ < 1) encountered in accident modelling, is presented. It provides a formula for the upper limit of a one-sided 16
prediction interval for y. In the remainder, Y is a random variable taking values in {0, 1, 2,...}, with mean µ and variance σ 2. Table 5, in summary form, shows the prediction 100(1 α)% interval as µ varies. µ interval 100(1 α)% prediction interval 0 µ α {0} α < µ 0.5 {0, 1,..., µ + 0.5 < µ < 1 {0, 1,..., µ + µ 2 (µ 2 σ 2 )/α } 1 + µ 2 + (µ 2 + σ 2 µ(1 + 2α))/α } Table 5: Upper limits for the prediction interval for y, as µ varies ( x denotes the largest integer less than or equal to x). The reasoning behind these formulae follows now. We let p i = Pr(Y = i) for i = 0, 1,.... Then µ = ip i = ip i p i = 1 p 0 i=0 i=1 i=1 So if 0 µ α, p 0 1 µ 1 α whence {0} serves as a 100(1 α)% prediction interval. Now suppose that α < µ 0.5. It is always the case that x 1 σ 2 = p 0 (0 µ) 2 + p i (i µ) 2 + p i (i µ) 2 i=1 i=x where x 1 is the largest positive integer such that Pr(Y x) α. At least probability 1 µ sits at zero and at least probability α sits to the right of and including x. By placing the maximum possible balance of the probability, µ α, at the domain point 17
closest to µ, namely zero, we can conservatively find x as the largest solution of σ 2 (1 µ)µ 2 + (µ α)µ 2 + α(x µ) 2 or, with manipulation, the largest solution of x 2 2µx + 1 α (µ2 σ 2 ) 0 The larger root of this quadratic is µ+ µ 2 (µ 2 σ 2 )/α, from which the result follows. When 0.5 < µ < 1 we follow the same path, but must place the balance of probability at the closest domain point to µ, now 1, so must find the largest integer x for which or µ + σ 2 (1 µ)µ 2 + (µ α)(1 µ) 2 + α(x µ) 2 1 + µ 2 + (µ 2 + σ 2 µ(1 + 2α))/α, so demonstrating the result. References [1] Dobson, A., 1990. An Introduction to Generalized Linear Models. Chapman and Hall, London. [2] Feller, W., 1966. An Introduction to Probability Theory and its Applications, Vol. 2. Wiley, New York. [3] Hauer, E., Ng, J.C.N., Lovell, J., 1988. Estimation of safety at signalized intersections. Transportation Research Record 1185, 48-61. [4] Maher, M.J., Summersgill, I., 1996. A comprehensive methodology for the fitting of predictive accident models. Accident Analysis and Prevention 28, 281-296. 18
[5] Maycock, G., Hall, R.D., 1984. Accidents at 4-arm roundabouts. Laboratory Report LR1120. Crowthorne, Berks, U.K. Transport Research Laboratory. [6] McCullagh, P., Nelder, J.A., 1989. Generalised Linear Models (2nd edition). Chapman and Hall, London, New York. [7] Sun, J., Loader, C., McCormick, W.P., 2000. Confidence bands in generalized linear models. The Annals of Statistics 28, 429-460. [8] Wood, G.R., 2002. Generalised linear accident models and goodness of fit testing. Accident Analysis and Prevention 34, 417-427. 19