Lecture 8. Poisson models for counts

Lecture 8. Poisson models for counts Jesper Rydén Department of Mathematics, Uppsala University jesper.ryden@math.uu.se Statistical Risk Analysis Spring 2014

Absolute risks The failure intensity λ(t) describes variability of life lengths in a population of components, objects or humans. Sometimes we do not know the individual life lengths, but only the total number of failures/accidents (e.g. failures during a specified period or in a certain region). By absolute risk is meant the probability for a person to be involved in a serious accident over a time-period. Often a distinction is made between voluntary risks (e.g. mountaineering) and background risks (e.g. collapse of a structure).

Tolerable risks Risk of death per person per year Characteristic response 10 3 Immediate action is taken to reduce the hazard. 10 4 People spend money, especially public money to control the hazard (e.g. traffic signs, police, laws). 10 5 Parents warn their children of the hazard (e.g. fire, drowning, fire arms, poison). 10 6 Not of great concern to average person; aware of hazard, but not of personal nature. Otway et al (1970). A risk analysis of the Omega West reactor.

Example: Number of perished in traffic Perished in traffic accidents 1998: U.S. 41 500, Sweden 500. To compare these numbers, one needs to compensate for the size of the populations, by using frequencies of deaths. U.S.: 1:6 000 (1.7 10 4 ), Sweden: 1:17 000 (0.6 10 4 ). About three times lower frequency in Sweden. Explanation? Exposure for the hazard does the average inhabitant in the U.S. spend more time in a car than a person in Sweden?

Comparative death risks Comparison activity/cause with absolute risk for death, measured per hour of exposure. Numbers from the U.K. (1970-1973). Mountaineering (international) Air travel (international) Car travel Accidents at home (all) Accidents at home (able-bodied people) Fire at home 2700 10 8 120 10 8 56 10 8 2.1 10 8 0.7 10 8 0.1 10 8 Assume the same numbers for Sweden and that an average Swede spends 15 minutes in a car per day. With 10 7 Swedes, the estimated average number of deaths in traffic is found as 0.25 365 10 7 56 10 8 = 511.

Poisson counts Denote by N i the number of accidents in year i. We assume that N i Po(µ i ), i.e. µ i = E[N i ]. If the random mechanism generating accidents can be assumed to be stationary, µ i = µ for all i. For the situation with µ i not constant, the expected value is modelled as a function of other, explanatory variables.

Example: Number of fires with perished Sweden: Number of fires with perished, and number of perished in the fires. Year Fires Perished in fires 1999 100 110 2000 101 107 2001 117 133 2002 127 138 2003 117 134 2004 62 65 2005 101 104 2006 80 83 2007 84 97 2008 108 115

Assumption of Poisson distribution If N Po(µ), then V[N] = E[N]. We have overdispersion if V[N] > E[N]. Test for Poisson distribution e.g. by χ 2 test. If overdispersion, try to fit another distribution, e.g. the negative binomial distribution.

Deviance Observations: n 1,..., n k. ML estimates: Simpler model (all µ i = µ): µ = 1 k ni. More complex model: µ i = n i. Likelihood theory. The test quantity deviance: D := 2(l(µ 1,..., µ k ) l(µ )) For large k: D χ 2 (k 1) distributed if the simpler model is true. Test: D > χ 2 α(k 1), the difference between the log-likelihoods cannot be explained by the statistical variability and hence the simpler model should be rejected.

Deviance for counts A formula for computation by hand is given as follows: k D = 2 n i (ln(n i ) ln( n)), j=1 where for n i = 0 we let n i ln(n i ) = 0. (Example: Fires, blackboard)

Example 7.13: Daily rains (Continuation of earlier example; rain in Venezuela.) Event of interest: A := Daily rain exceeds 50 mm. Monthly observed values during 39 years: J F M A M J J A S O N D 4 0 3 4 3 2 3 3 3 2 7 10 Test, by using deviance, if the means are equal, i.e. µ i = µ, i = 1,..., 12. (Blackboard)

Generalized linear model (GLM) A GLM has the basic structure g(µ i ) = X i β, where µ i = E[Y i ], g is a smooth monotonic link function, X i is the ith row of a model matrix X, and β is a vector of unknown parameters. Usually the Y i are assumed to be independent and belonging to some exponential family distribution. The exponential family of distributions includes many distributions useful for practical modelling, such as the Poisson, Binomial, Gamma and Normal distributions. Remark: GLM was introduced by Nelder and Wedderburn (1972).

Generalized linear model (GLM) The part Xβ (sometimes called linear predictor) resembles a linear-regression model. A link function and distribution must be chosen. (With the identity function as link and normal distribution, ordinary linear regression is a special case.) Generalization comes at some cost: Model fitting must be done iteratively, e.g. using IRLS (Iteratively Reweighted Least Squares). Distributional results used for inference are approximative and justified by large-sample limiting results.

Exponential family The response variable in a GLM can have any distribution from the exponential family, where by definition the pdf or pmf can be written as ( ) yθ b(θ) f θ (y) = exp + c(y, φ), a(φ) where a, b, c are arbitrary functions, φ an arbitrary scale parameter, and θ is known as the canonical parameter. (In the GLM context, this depends completely on the model parameters β.) With Y Po(µ), we have and f (y) = µy y! e µ, y = 0, 1,... θ φ a(φ) b(θ) c(y, θ) ln(µ) 1 φ (= 1) e θ (= µ) ln(y!)

Poisson regression in GLM The canonical link is g(µ) = ln(µ) and hence we have that is, µ i = g 1 (β 0 + β 1 x i1 + + β p x ip ), µ i = exp(β 0 + β 1 x i1 + + β p x ip ). In risk analysis, sometimes an extra quantity t i is introduced, measuring the exposure for the risk (e.g. t i = 1 if every observation relates to, say, one year). Then: (Example 7.15, blackboard.) µ i = t i exp(β 0 + β 1 x i1 + + β p x ip ).

More on exposure and offsets Often the expected counts will depend on an observation time or an observation area. For instance, if observing twice as long period, one expects the counts to double. The mean is then µ i = t i r i where t i is observation time and r i is the rate (expected count per observation unit). The log-linear model for this situation: ln µ i = ln(t i r i ) = ln t i + ln r i = ln t i + β 0 + β 1 x i1 + β p x ip The quantity ln t i is often referred to as the offset.

Example: Wave damage to cargo ships Data collected by Lloyd s Register of Shipping, investigating the damage caused by waves to the forward section of certain cargo-carrying vessels. Three factors are believed to affect the number of damage incidents: Ship type: A E Year of construction: 1960-64, 1965-69, 1970-1974, 1975-79. Period of operation: 1960-74, 1975-79 The observation times varied greatly (45 to 44 882 months) and thus must be taken into account in the analysis. Data in R: library(mass); data(ships)

Example: Wave damage to cargo ships Example of data: Ship Year of Period of Aggregate Incidents Damage rate type construction operation service time (per 1000 months) B 1960-64 1960-74 44882 39 0.869 B 1960-64 1975-79 17176 29 1.688 B 1965-69 1960-74 28609 58 2.027 B 1965-69 1975-79 20370 53 2.602

Example: Wave damage to cargo ships R code: library(mass); data(ships) shipsf = ships; # --- Make as factors --- shipsf$type = factor(shipsf$type) shipsf$year = factor(shipsf$year) shipsf$period = factor(shipsf$period) # --- Fit a model --- mod1 = glm(incidents type + year + period + offset(log1p(service)), family=poisson, control=glm.control(epsilon=0.0001,maxit=100), data=shipsf)

Poisson regression: rate ratio The rate ratio is of interest: RR j := exp(β j ), j = 1,..., p. The rate ratio measures the multiplicative increase of intensity of events when x ij is increased by one unit. Estimate of rate ratio: RR j = exp(β j ) where β j is the ML estimate of β j. Using asymptotic normality of ML estimators, confidence intervals for RR j can be found. (Example 7.16, blackboard.)

Poisson regression: Model selection The deviance can be used in the model selection. Consider two candidate models: a more general with p covariates, a simpler with q < p covariates. Estimated parameters: β p and β q. The test quantity DEV = 2 (l(β p ) l(β q )) is χ 2 (p q) distributed for large samples if the simpler model is true. Test: If DEV > χ 2 α(p q), the simpler model should be rejected (the difference between the log-likelihoods cannot be explained by the statistical variability). Hand calculations: DEV = 2 k n i (ln(µ ic ) ln(µ is )) i=1 (µ ic are the estimates from the more complex model, µ is from the simpler model.) (Example 7.17, blackboard.)