Confidence and prediction intervals for. generalised linear accident models

Similar documents
PLANNING TRAFFIC SAFETY IN URBAN TRANSPORTATION NETWORKS: A SIMULATION-BASED EVALUATION PROCEDURE

Institute of Actuaries of India

Delta Method. Example : Method of Moments for Exponential Distribution. f(x; λ) = λe λx I(x > 0)

Accident Analysis and Prevention xxx (2006) xxx xxx. Dominique Lord

LOGISTIC REGRESSION Joseph M. Hilbe

Subject CS1 Actuarial Statistics 1 Core Principles

Statistical Methods in HYDROLOGY CHARLES T. HAAN. The Iowa State University Press / Ames

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 16 Introduction

STAT 302 Introduction to Probability Learning Outcomes. Textbook: A First Course in Probability by Sheldon Ross, 8 th ed.

COPYRIGHTED MATERIAL CONTENTS. Preface Preface to the First Edition

GLM models and OLS regression

Chapter 3: Polynomial and Rational Functions

Statistical Models for Management. Instituto Superior de Ciências do Trabalho e da Empresa (ISCTE) Lisbon. February 24 26, 2010

Statistics - Lecture One. Outline. Charlotte Wickham 1. Basic ideas about estimation

MAS223 Statistical Inference and Modelling Exercises

CHAIN LADDER FORECAST EFFICIENCY

IE 230 Probability & Statistics in Engineering I. Closed book and notes. 120 minutes.

TABLE OF CONTENTS CHAPTER 1 COMBINATORIAL PROBABILITY 1

Lecture 3: Statistical sampling uncertainty

From Practical Data Analysis with JMP, Second Edition. Full book available for purchase here. About This Book... xiii About The Author...

1 Data Arrays and Decompositions

Chapter 14. Linear least squares

Week 1 Quantitative Analysis of Financial Markets Distributions A

BIAS OF MAXIMUM-LIKELIHOOD ESTIMATES IN LOGISTIC AND COX REGRESSION MODELS: A COMPARATIVE SIMULATION STUDY

IV. The Normal Distribution

Preface Introduction to Statistics and Data Analysis Overview: Statistical Inference, Samples, Populations, and Experimental Design The Role of

X = X X n, + X 2

FINITE MIXTURES OF LOGNORMAL AND GAMMA DISTRIBUTIONS

Statistical Data Analysis

ABSTRACT KEYWORDS 1. INTRODUCTION

2008 Winton. Statistical Testing of RNGs

SYSM 6303: Quantitative Introduction to Risk and Uncertainty in Business Lecture 4: Fitting Data to Distributions

STA 2201/442 Assignment 2

Generalized Linear Models (GLZ)

Sampling: A Brief Review. Workshop on Respondent-driven Sampling Analyst Software

Freeway rear-end collision risk for Italian freeways. An extreme value theory approach

SuperMix2 features not available in HLM 7 Contents

ACTEX CAS EXAM 3 STUDY GUIDE FOR MATHEMATICAL STATISTICS

TRB Paper # Examining the Crash Variances Estimated by the Poisson-Gamma and Conway-Maxwell-Poisson Models

Fall 2017 STAT 532 Homework Peter Hoff. 1. Let P be a probability measure on a collection of sets A.

Math 180B Problem Set 3

Standard Error of Technical Cost Incorporating Parameter Uncertainty

ANALYSING BINARY DATA IN A REPEATED MEASUREMENTS SETTING USING SAS

Including Statistical Power for Determining. How Many Crashes Are Needed in Highway Safety Studies

Hypothesis Testing for Var-Cov Components

EVALUATION OF SAFETY PERFORMANCES ON FREEWAY DIVERGE AREA AND FREEWAY EXIT RAMPS. Transportation Seminar February 16 th, 2009

BOOTSTRAPPING WITH MODELS FOR COUNT DATA

Biased Urn Theory. Agner Fog October 4, 2007

D-optimal Designs for Factorial Experiments under Generalized Linear Models

Practice Problems Section Problems

Conditional distributions (discrete case)

Conjugate Predictive Distributions and Generalized Entropies

Sample size calculations for logistic and Poisson regression models

Statistical Methods for Astronomy

x. Figure 1: Examples of univariate Gaussian pdfs N (x; µ, σ 2 ).

Economics 583: Econometric Theory I A Primer on Asymptotics

The Negative Binomial Lindley Distribution as a Tool for Analyzing Crash Data Characterized by a Large Amount of Zeros

Algebra and Trigonometry 2006 (Foerster) Correlated to: Washington Mathematics Standards, Algebra 2 (2008)

MA 575 Linear Models: Cedric E. Ginestet, Boston University Mixed Effects Estimation, Residuals Diagnostics Week 11, Lecture 1

Chapter 1 Statistical Inference

Part 6: Multivariate Normal and Linear Models

The Mixture Approach for Simulating New Families of Bivariate Distributions with Specified Correlations

Practical Statistics for the Analytical Scientist Table of Contents

arxiv: v1 [stat.me] 5 Apr 2013

Discrete Mathematics and Probability Theory Fall 2015 Lecture 21

Discrete Mathematics and Probability Theory Fall 2013 Vazirani Note 16. A Brief Introduction to Continuous Probability

MA/ST 810 Mathematical-Statistical Modeling and Analysis of Complex Systems

Statistics 3858 : Maximum Likelihood Estimators

Parameter estimation! and! forecasting! Cristiano Porciani! AIfA, Uni-Bonn!

Sample size determination for logistic regression: A simulation study

APPENDICES APPENDIX A. STATISTICAL TABLES AND CHARTS 651 APPENDIX B. BIBLIOGRAPHY 677 APPENDIX C. ANSWERS TO SELECTED EXERCISES 679

2. Variance and Higher Moments

Chapter 5 continued. Chapter 5 sections

Two hours. Statistical Tables to be provided THE UNIVERSITY OF MANCHESTER. 14 January :45 11:45

Progression Factors in the HCM 2000 Queue and Delay Models for Traffic Signals

Mixtures and Random Sums

THE PRINCIPLES AND PRACTICE OF STATISTICS IN BIOLOGICAL RESEARCH. Robert R. SOKAL and F. James ROHLF. State University of New York at Stony Brook

2010 HSC NOTES FROM THE MARKING CENTRE MATHEMATICS EXTENSION 1

3.3 Estimator quality, confidence sets and bootstrapping

Introduction to General and Generalized Linear Models

I used college textbooks because they were the only resource available to evaluate measurement uncertainty calculations.

The Not-Formula Book for C2 Everything you need to know for Core 2 that won t be in the formula book Examination Board: AQA

Modeling Recurrent Events in Panel Data Using Mixed Poisson Models

CS 5014: Research Methods in Computer Science. Bernoulli Distribution. Binomial Distribution. Poisson Distribution. Clifford A. Shaffer.

Probability and Stochastic Processes

This appendix provides a very basic introduction to linear algebra concepts.

Solution: chapter 2, problem 5, part a:

Introduction to the Mathematical and Statistical Foundations of Econometrics Herman J. Bierens Pennsylvania State University

CHOOSING AMONG GENERALIZED LINEAR MODELS APPLIED TO MEDICAL DATA

Approximation of Posterior Means and Variances of the Digitised Normal Distribution using Continuous Normal Approximation

STATISTICS; An Introductory Analysis. 2nd hidition TARO YAMANE NEW YORK UNIVERSITY A HARPER INTERNATIONAL EDITION

Lawrence D. Brown* and Daniel McCarthy*

Unsupervised Learning with Permuted Data

Chapter 5. Chapter 5 sections

Probability Distributions

Algebra 2 Early 1 st Quarter

Lecture 4: Two-point Sampling, Coupon Collector s problem

TRB Paper Examining Methods for Estimating Crash Counts According to Their Collision Type

Statistical Data Analysis Stat 3: p-values, parameter estimation

Contents. Preface to Second Edition Preface to First Edition Abbreviations PART I PRINCIPLES OF STATISTICAL THINKING AND ANALYSIS 1

Transcription:

Confidence and prediction intervals for generalised linear accident models G.R. Wood September 8, 2004 Department of Statistics, Macquarie University, NSW 2109, Australia E-mail address: gwood@efs.mq.edu.au Fax: 61 2 9850 7669 Abstract Generalised linear models, with log link and either Poisson or negative binomial errors, are commonly used for relating accident rates to explanatory variables. This paper adds to the toolkit for such models. It describes how confidence intervals (for example, for the true accident rate at given flows) and prediction intervals (for example, for the number of accidents at a new site with given flows) can be produced using spreadsheet technology. Running head: Confidence intervals for accident models Key words and phrases: Generalised linear model; negative binomial; Poisson

1 Introduction Generalised linear models have gathered recognition in recent years (Maycock and Hall, 1984; Hauer et al., 1988; Maher and Summersgill, 1996) as useful tools for relating the number of accidents, of a specifed type, to explanatory variables such as vehicle flows. For the single flow model, the true mean number of accidents µ is modelled as β 0 x β 1, where x denotes the flow. The distribution of the observed number of accidents, for a given flow, is assumed to be either Poisson, or more generally, negative binomially distributed about this mean value. The negative binomial distribution occurs naturally when we allow for variation of safety M between sites, with a given flow, to be modelled by a gamma distribution, and then variation of the number of accidents Y within a site, with safety M, to be modelled by a Poisson distribution with mean M. A detailed description of these models has been given in the companion paper (Wood, 2002), where methods for assessing goodness of fit were described. Once goodness of fit is established for a model it is of interest to provide confidence intervals (for model parameters) and prediction intervals (for dependent variables); this is routinely carried out when working with linear models. Such intervals provide information about the extent of variation in these quantities. In this context the intervals of interest, for a given flow, are: i) A confidence interval for µ, the true accident rate ii) a) For a Poisson model, a prediction interval for y, the accident rate at a new site 1

b) For a negative binomial model, a prediction interval for m, the safety of a new site, and a prediction interval for y, the accident rate at a new site. The purpose of this paper is to provide formulae, in Section 2, which enable construction of these intervals, and to illustrate their use with real accident data in Section 3. The required calculations can be carried out on a spreadsheet. Exposition is generally in terms of models with a single flow; models with more than one explanatory variable are handled in an extended, but similar, fashion. Notation and terminology used in this paper are as in Wood (2002). Standard texts, for example McCullough and Nelder (1989), discuss confidence intervals for generalised linear model parameters; the author, however, has not found the approach discussed here in the literature, other than in Maher and Summersgill (1996). Here we clarify, amplify and extend that work. Specifically, approximate confidence and prediction intervals appropriate for a given flow are developed. We caution that the confidence level necessarily decreases if we wish to make statements about many flow values. For this, so-called simultaneous (and necessarily wider) confidence bands are needed. The development of simultaneous confidence bands is a topic of current research; the work of Sun et al. (2000) produces such confidence bands for the mean in a generalised linear model. This paper can be read in two ways. A reader interested in the practical construction of confidence and prediction intervals should skim Section 2, then work carefully through the examples of Section 3, referring to Section 2 and the appendix for formulae as needed 2

(Table 4 provides an overall summary). For the reader interested in the underlying theory, careful reading of Section 2 and the appendix is recommended. 2 Confidence and prediction intervals A confidence interval for the true mean, for both the Poisson and negative binomial models, is developed in Section 2.1. In Section 2.2 a prediction interval for a predicted number of accidents at a new site is derived for the Poisson model, while in Section 2.3 prediction intervals for safety and predicted number of accidents at a new site are produced for negative binomial models. 2.1 Confidence interval for µ The generalised linear model we have described uses a log link function; the logarithm of µ is linear in the model parameters β 0 and β 1, since η = log µ = log β 0 + β 1 log x = β 0 + β 1 log x, for the single flow model. Standard generalised linear model theory gives that asymptotically the estimates b 0 and b 1, of β 0 and β 1 respectively, have a bivariate normal distribution (Dobson, 1990), in particular b 0 N β 0, I 1, b 1 β 1 so they are unbiased, with covariance matrix the inverse of the information matrix I. It follows that ˆη = b 0 + b 1 log x has asymptotically a normal distribution and since ˆη = log ˆµ, where ˆµ = e b 0 x b 1, ˆµ has an approximately lognormal distribution. 3

This enables us to write down an approximate 95% confidence interval for η, when the flow is x, as b 0 + b 1 log x ± 1.96 Var(b 0 + b 1 log x), whence a 95% confidence interval for µ = e η is given by [ e b 0 +b 1 log x 1.96 Var(b 0 +b 1 log x), e b 0 +b 1 log x+1.96 Var(b 0 +b 1 log x)] The lower boundary is closer to the estimate ˆµ of µ than is the higher boundary, reflecting the right skewed lognormal distribution of the estimate ˆµ. Here Var(b 0 + b 1 log x) = Var(b 0) + 2 log xcov(b 0, b 1 ) + (log x) 2 Var(b 1 ) = I11 1 + 2 log xi12 1 + (log x) 2 I22 1 Illustrative real examples are given in Section 3. Note that ˆη = (1, log x)(b 0, b 1 ) T (where T denotes transpose) so in practice Var(ˆη) is most easily calculated as Var(ˆη) = (1, log x)i 1 (1, log x) T There are two ways to find the components of I 1. If the model is fitted using a statistical package then options are generally available which output the covariance matrix I 1 of the paramaters. On the other hand, if using the first principles method described in (Wood, 2002, A.3) then the required covariance matrix is (X T W X) 1, where X is the design matrix and W a diagonal matrix. A final remark in this subsection: the log-normal distribution of ˆµ discussed can be approximated by a normal distribution, or ˆµ N ( µ 0 = µ, σ0 2 = µ 2 Var(ˆη) ) 4

(as in Maher and Summersgill (1996), Equation (14)). This approximate sampling distribution for ˆµ is fundamental in the sequel. 2.2 Poisson model We consider the case of the Poisson model and an interval for a predicted number of accidents, y. Under the model, given a true mean accident rate of µ, the conditional distribution of accidents Y is Poisson with mean µ. A confidence interval for the number of accidents Y, however, must now accommodate the approximately normal variation in ˆµ, our estimator of µ, as N(µ 0, σ0). 2 Table 1 summarises the variables involved. Variable description Variable notation Distribution Accident rate, given true rate µ Y µ Poisson(µ) Estimator of true mean accident rate ˆµ N(µ 0, σ0) 2 Table 1: The two levels of variation, first in µ, then in Y given µ, to be considered when forming a prediction interval for y in the Poisson model. The marginal distribution of Y is thus a mixture of Poisson distributions, on the mean, by a normal distribution. It can be shown that the distribution of Y, supported by {0, 1, 2,...} has mean µ 0 and variance σ0 2 + µ 0. (The key to this calculation is the observation that a central moment of a mixture is the mixture of the central moments of the distributions being mixed.) Our intuition does tell us that this variance should depend on that of ˆµ, namely σ0, 2 and also should increase as µ 0 increases, since Poisson distributions with larger mean have greater variance. 5

Chebyshev s inequality (Feller, 1966), namely P ( Y µ Y tσ Y ) 1 t 2 for t > 0, is a most useful result and can be used to produce a prediction interval. When the interval is one-sided (so for low mean values) Chebyshev s one-sided inequality (Feller, 1966, Section V.7, Example (a)), namely P (Y µ Y tσ Y ) 1 1 + t 2 for t > 0, provides a tighter interval. A formula, exploiting the discreteness of the distribution of Y, however, has been developed and offers a slight strengthening of this approach. It is described in the appendix and its use illustrated in Section 3. Further tightening of this confidence interval is doubtless still possible, for example, by making use of third and higher moments. 2.3 Negative binomial model We now develop intervals for safety and predicted number of accidents, for the negative binomial model; there are three mixtures involved. We first study construction of a prediction interval for site safety M, which rests on a mixing process analogous to the Poisson case. We then see that the negative binomial conditional distribution of Y, given µ and k, is a mixture, as is the marginal distribution of Y. 6

Prediction interval for safety m Here we find a prediction interval for m, the underlying site safety described in Hauer et al. (1988), as the flow x varies. We answer the question If we selected another site with flow x, where would m lie? Table 2 summarises the variables involved. Variable description Variable notation Distribution Safety, given true rate µ M (µ, k) Gamma(k, µ/k) Estimator of true mean accident rate ˆµ N(µ 0, σ0) 2 Table 2: The two levels of variation, first in µ, then in M given µ and k, needed when producing a prediction interval for safety m, for the negative binomial model. Here we mix gamma distributions with a normal; it can be shown that the marginal distribution of M has mean µ 0 and variance σ0 2 +(σ0 2 +µ 2 0)/k. If we assume approximate normality (this will improve as µ increases) and recalling that k is estimated as ˆk during the fitting process, an approximate 95% prediction interval for site safety m is ˆµ ± 1.96 ˆµ2 Var(ˆη) + ˆµ2 Var(ˆη) + ˆµ 2 ˆk Estimates ˆµ and Var(ˆη) can be readily evaluated, as described earlier. The lower limit of this interval may be negative, due to the use of a normal approximation to the lognormally distributed ˆµ; in this case the lower limit should be set to zero. Simulation tests, using typical parameter values, have shown this to provide a satisfactory prediction interval approximation. 7

Prediction interval for number of accidents y Here we find a prediction interval for the number of accidents y at a site, randomly chosen from those with flow x. The relevant variables are summarised in Table 3. Variable description Variable notation Distribution Accident rate, given safety m Y m Poisson(m) Safety, given true rate µ M µ, k Gamma(k, k/µ) Estimator of true mean accident rate ˆµ N(µ 0, σ0) 2 Table 3: The three levels of variation involved in forming a prediction interval for y in the negative binomial model; first in µ, then in M given µ and k, and finally in Y given m. The model builds the distribution of accidents across all sites with flow x, first as a mixture of the Poisson within site variation Y m by the gamma across site variation M µ, k, well known to be negative binomial with parameters k and p = k/(µ + k), as described in (Wood, 2002, p.425). Second, we must now recognise that µ itself is unknown, so the accident rate is really a mixture of negative binomial Y k, p variables by a normal distribution on µ. The marginal distribution of Y, the number of accidents at a site with flow x, having support in {0, 1, 2,...}, can be shown to have mean µ 0 and variance σ0 2 +(σ0 2 +µ 2 0)/k+µ 0. All quantities can be estimated during model fitting, as earlier. Note that as k increases the variance shrinks to that found in the Poisson case. Chebyshev s one-sided inequality, or the slightly stronger method of the appendix, can be used to produce a prediction 8

interval for y. 3 Examples Illustrations of each of the intervals described are now given, using three New Zealand accident datasets. 3.1 Poisson model with one independent variable A Poisson model was used to relate loss-of-control accidents to incoming flow at 289 arms of both priority and uncontrolled T-intersections throughout New Zealand, yielding b 0 = 4.5260 and b 1 = 0.2883. The grouped G 2 method described in (Wood, 2002, Section 4.3) was used to test the goodness of fit of the model; there was no evidence of poor fit. In order to find a confidence interval for µ, the covariance matrix (X T W X) 1 was obtained as a by-product of the fitting process, as I 1 = 2.6724 0.5140 0.5140 0.1018 For a flow of x = 600, for example, this gives Var(b 0 + b 1 log x) = 0.2621, whence an approximate 95% confidence interval for µ is [0.0251, 0.1867]. The full confidence band, as x varies, is shown (dashed line) around the fitted curve (solid line) in Figure 1. For each flow, µ 0 = ˆµ and σ0 2 = ˆµ 2 Var(ˆη) were then calculated, as described in Section 2.2. The formula given in the appendix was then applied, yielding the 95% band for a predicted y value, shown as the stepped line in Figure 1. (The 90% band 9

was everywhere {0}, so the horizontal axis in the figure.) Figure 1 here Figure 1: A Poisson model (solid curve) relates accident rate to flow for the loss-ofcontrol data. A 95% confidence band for the true accident rate µ is shown with the dashed lines, while a 95% prediction band for the number of accidents y at at new site is shown with the stepped line. 3.2 Negative binomial model with one independent variable Rear-end accidents at 392 arms of signalised crossroads throughout New Zealand were related to flow using the negative binomial model, producing b 0 = 16.3141, b 1 = 1.6330 and ˆk = 0.60. Goodness of fit of the model was tested using the method presented in (Wood, 2002, Section 4.4), revealing no evidence of poor fit. Again, the covariance matrix for b 0 and b 1, (X T W X) 1 was obtained as I 1 = 8.4048 0.9347 0.9347 0.1042 allowing construction of a 95% confidence band for µ, as described for the Poisson model. For example, for x = 10000, we find ˆµ = 0.2798 and Var(ˆη) = Var(b 0+b 1 log x) = 0.0263, whence an approximate 95% confidence interval is [0.2036, 0.3845]. The full confidence band is shown (long dashes) around the fitted curve (solid line) in Figure 2. The variance σ0 2 +(σ0 2 +µ 2 0)/k of M, for a given flow, was then estimated as described 10

in Section 2.3. Continuing the example, for x = 10000, and recalling that ˆk = 0.60, this leads to a 95% prediction interval for m of [0, 1.003]. The full prediction band is shown (short dashes) in Figure 2. Finally, the variance of a predicted Y is calculated using the formula given in Section 2.3, for each flow level, and the formula described in the appendix used to calculate the upper limit of the prediction interval. For example, for x = 10000 and using ˆµ, Var(ˆη) and ˆk as before we find that {0, 1, 2} provides a 90% interval. The full prediction band is shown in Figure 2 (stepped horizontal lines). Figure 2 here Figure 2: A negative binomial model (solid curve) relates accident rate to flow for the rear-end data. A 95% confidence band for the true accident rate µ is shown (long-dashes) and a 95% prediction band for the safety m (short dashes), while a 90% prediction band for the number of accidents at a new site is shown (stepped horizontal lines). 3.3 Negative binomial model with two independent variables Right turn against vehicle accidents at four-arm, two-way signalised intersections were related to the two traffic flows (annual average daily through flow x 1 and annual average daily right-turning flow x 2 ) in a recent New Zealand study, using a negative binomial model with µ = β 0 x β 1 1 x β 2 2. Maximum likelihood estimation of the parameters gave b 0(= log b 0 ) = 11.2598, b 1 = 0.8175, b 2 = 0.4883, ˆk = 1.8 and a covariance matrix for 11

the three parameters of I 1 = 1.6784 0.1353 0.0688 0.1353 0.0158 0.0005 0.0688 0.0005 0.0104 Using centrally placed flows of x 1 = 10000 and x 2 = 5000, a 95% confidence interval for the associated true accident rate µ will be [ ] eˆη 1.96 Var(ˆη), eˆη+1.96 Var(ˆη) where ˆη = log ˆµ = b 0 + b 1 log x 1 + b 2 log x 2 and Var(ˆη) = Var(b 0) + (log x 1 ) 2 Var(b 1 ) + (log x 2 ) 2 Var(b 2 ) + 2 log x 1 Cov(b 0, b 1 ) + 2 log x 2 Cov(b 0, b 2 ) + 2 log x 1 log x 2 Cov(b 1, b 2 ) As remarked earlier, these calculations are most easily carried out by noting that ˆη = (1, log x 1, log x 2 )(b 0, b 1, b 2 ) T whence Var(ˆη) = (1, log x 1, log x 2 )I 1 (1, log x 1, log x 2 ) T Substituting values gives ˆη = 0.4286 and Var(ˆη) = 0.0304 whence a 95% confidence interval for µ is [1.0907, 2.1605]. A 95% prediction interval for safety m and the given flows is still given by ˆµ ± 1.96 ˆµ2 Var(ˆη) + ˆµ2 Var(ˆη) + ˆµ 2 ˆk 12

Since ˆµ = eˆη = 1.5351, substituting values and raising the lower boundary to zero produces the wider interval of [0, 3.8712]. Finally, the number of accidents Y at a site with the given flows will have mean estimated as ˆµ = 1.5351 and variance as ˆµ 2 Var(ˆη) + ˆµ2 Var(ˆη) + ˆµ 2 ˆk + ˆµ = 2.9557 Since ˆµ > 1 we must use the one-sided Chebyshev inequality to find the upper limit of a 95% prediction interval for y. This yields P (Y µ Y + 19σ Y ) 1 20 whence substitution of the estimated mean and standard deviation for Y gives an upper limit of 1.5351 + 19 2.9557 = 9.0290. Thus we can quote [0, 9] as a 95% prediction interval for the number of right turn against accidents at a signalised intersection with the given flows. 4 Discussion and summary Comment on the approximations used to produce the intervals is now presented. First, model parameter estimates (b 0, b 1, etc.) are only approximately normally distributed, the approximation improving with sample size; this will influence the accuracy of a confidence interval produced for ˆµ. Second, the lognormal distribution of ˆµ is approximated by a normal distribution, so tending to widen a prediction interval for m and y. Finally, use of the Chebyshev inequality is conservative, so again tending to widen a prediction 13

interval. The method given in the appendix was developed in an attempt to improve on the Chebyshev inequality. Once the model is fitted, for any finite number of independent variables (whether measurement or categorical), all intervals discussed can be calculated using a spreadsheet. In general we have ˆη = ab T, Var(ˆη) = ai 1 a T and ˆµ = eˆη, where a is the row vector of coefficients of the parameter estimates in row vector b. For example, for the model µ = β 0 x β 1 e zβ 2, with one measurement variable x and one categorical variable z, a = (1, log x, z) while b = (b 0, b 1, b 2 ). The form of all 95% confidence and prediction intervals discussed in this paper is summarised in Table 4, with intervals for y produced using the one-sided Chebyshev inequality. For a 90% prediction interval for y, the factor in the formulae is 3, rather than 19, being then the solution to 1/(1 + t 2 ) = 1/10; for a 99% prediction interval for y, the factor would be 99. When ˆµ 1 it may be possible to tighten the interval for y using the formulae in the appendix; for high ˆµ values a prediction interval for y excluding zero may possibly be found using the two-sided Chebyshev inequality. An assumption underlying the models used is that the explanatory variables are measured without error. As was pointed out in Section 8 of Maher and Summersgill (1996), for geometric and control variables this presents no problem. Flow variables, however, should be annual average daily traffic (AADT) over the study period. In practice, they are often scaled figures, based on a count in a single day, so estimates of AADT. Maher and Summersgill (1996) ran simulation experiments and concluded that 14

Poisson model µ y [ˆµ/e 1.96 Var(ˆη), ˆµe 1.96 Var(ˆη) ] [ 0, ˆµ + 19 ˆµ 2 Var(ˆη) + ˆµ ] Negative binomial model µ m y [ˆµ/e 1.96 Var(ˆη), ˆµe 1.96 Var(ˆη) ] max 0, ˆµ 1.96 ˆµ 2 Var(ˆη) + ˆµ2 Var(ˆη) + ˆµ 2, ˆµ + 1.96 ˆµ ˆk 2 Var(ˆη) + ˆµ2 Var(ˆη) + ˆµ 2 ˆk 0, ˆµ + 19 ˆµ 2 Var(ˆη) + ˆµ2 Var(ˆη) + ˆµ 2 + ˆµ ˆk Table 4: A summary of the 95% confidence and prediction intervals discussed in the paper ( x denotes the largest integer less than or equal to x). for their data, 12 or 16 hour single day counts provided adequate accuracy. The intervals discussed in this paper rely not only on ˆη (and hence the flows and parameter estimates) but also on Var(ˆη) and ˆk. The effect of error in the AADTs on estimation of the covariance matrix (and hence Var(ˆη)) and the parameter k remains to be investigated. As a general guide, however, counts taken over longer periods will provide for greater confidence and prediction interval accuracy. Collinearity of columns of the design matrix leads to uncertainty in parameter estimates. This uncertainty (seen as larger entries in the covariance matrix I 1 ) is incorporated into the intervals constructed. The upshot is that use of independent explanatory variables will tend to yield tighter confidence and prediction intervals. For example, 15

the through and right-turning flows used in the two variable example in Section 3.3 exhibited extremely low correlation, leading to the low second and third figures on the diagonal of the covariance matrix. To summarise, approximate confidence intervals for the true mean have been presented (Section 2.1) for Poisson and negative binomial generalised linear accident models. An approximate prediction interval for a predicted accident rate has been developed (Section 2.2) for the Poisson model. The form of this interval for the negative binomial model has also been presented (Section 2.3), together with a prediction interval for a predicted safety. Examples have been used (Section 3) to illustrate the ideas. Acknowledgements Dr. Shane Turner is thanked for alerting the author to the importance of this problem and for kindly supplying (together with Aaron Roozenburg) the three data sets. The research was partly supported by the Land Transport Safety Authority of New Zealand. Pertinent comments from a referee led to improvements in the paper. Appendix A strengthening of the one-sided Chebyshev inequality, exploiting the integer support of Y and appropriate for the typically low mean values (µ < 1) encountered in accident modelling, is presented. It provides a formula for the upper limit of a one-sided 16

prediction interval for y. In the remainder, Y is a random variable taking values in {0, 1, 2,...}, with mean µ and variance σ 2. Table 5, in summary form, shows the prediction 100(1 α)% interval as µ varies. µ interval 100(1 α)% prediction interval 0 µ α {0} α < µ 0.5 {0, 1,..., µ + 0.5 < µ < 1 {0, 1,..., µ + µ 2 (µ 2 σ 2 )/α } 1 + µ 2 + (µ 2 + σ 2 µ(1 + 2α))/α } Table 5: Upper limits for the prediction interval for y, as µ varies ( x denotes the largest integer less than or equal to x). The reasoning behind these formulae follows now. We let p i = Pr(Y = i) for i = 0, 1,.... Then µ = ip i = ip i p i = 1 p 0 i=0 i=1 i=1 So if 0 µ α, p 0 1 µ 1 α whence {0} serves as a 100(1 α)% prediction interval. Now suppose that α < µ 0.5. It is always the case that x 1 σ 2 = p 0 (0 µ) 2 + p i (i µ) 2 + p i (i µ) 2 i=1 i=x where x 1 is the largest positive integer such that Pr(Y x) α. At least probability 1 µ sits at zero and at least probability α sits to the right of and including x. By placing the maximum possible balance of the probability, µ α, at the domain point 17

closest to µ, namely zero, we can conservatively find x as the largest solution of σ 2 (1 µ)µ 2 + (µ α)µ 2 + α(x µ) 2 or, with manipulation, the largest solution of x 2 2µx + 1 α (µ2 σ 2 ) 0 The larger root of this quadratic is µ+ µ 2 (µ 2 σ 2 )/α, from which the result follows. When 0.5 < µ < 1 we follow the same path, but must place the balance of probability at the closest domain point to µ, now 1, so must find the largest integer x for which or µ + σ 2 (1 µ)µ 2 + (µ α)(1 µ) 2 + α(x µ) 2 1 + µ 2 + (µ 2 + σ 2 µ(1 + 2α))/α, so demonstrating the result. References [1] Dobson, A., 1990. An Introduction to Generalized Linear Models. Chapman and Hall, London. [2] Feller, W., 1966. An Introduction to Probability Theory and its Applications, Vol. 2. Wiley, New York. [3] Hauer, E., Ng, J.C.N., Lovell, J., 1988. Estimation of safety at signalized intersections. Transportation Research Record 1185, 48-61. [4] Maher, M.J., Summersgill, I., 1996. A comprehensive methodology for the fitting of predictive accident models. Accident Analysis and Prevention 28, 281-296. 18

[5] Maycock, G., Hall, R.D., 1984. Accidents at 4-arm roundabouts. Laboratory Report LR1120. Crowthorne, Berks, U.K. Transport Research Laboratory. [6] McCullagh, P., Nelder, J.A., 1989. Generalised Linear Models (2nd edition). Chapman and Hall, London, New York. [7] Sun, J., Loader, C., McCormick, W.P., 2000. Confidence bands in generalized linear models. The Annals of Statistics 28, 429-460. [8] Wood, G.R., 2002. Generalised linear accident models and goodness of fit testing. Accident Analysis and Prevention 34, 417-427. 19