Lecture 8. Poisson models for counts

Similar documents
Computer exercise 4 Poisson Regression

Lecture 7. Testing for Poisson cdf - Poisson regression - Random points in space 1

Outline of GLMs. Definitions

LOGISTIC REGRESSION Joseph M. Hilbe

Generalized linear models

Introduction to General and Generalized Linear Models

Linear Regression Models P8111

Generalized Linear Models. Kurt Hornik

Generalized Linear Models

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

DISPLAYING THE POISSON REGRESSION ANALYSIS

Generalized Linear Models I

Now consider the case where E(Y) = µ = Xβ and V (Y) = σ 2 G, where G is diagonal, but unknown.

Towards a Regression using Tensors

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Generalized linear models

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52

9 Generalized Linear Models

Faculty of Health Sciences. Regression models. Counts, Poisson regression, Lene Theil Skovgaard. Dept. of Biostatistics

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Generalized linear models IV Examples

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00

Analysis of Time-to-Event Data: Chapter 4 - Parametric regression models

Analysis of Extra Zero Counts using Zero-inflated Poisson Models

Generalized Linear Models 1

Generalized linear mixed models for dependent compound risk models

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random

Lecture 7. Poisson and lifetime processes in risk analysis

Generalized Linear Models

Chapter 22: Log-linear regression for Poisson counts

Generalized Linear Models (GLZ)

Generalized linear mixed models (GLMMs) for dependent compound risk models

MODELING COUNT DATA Joseph M. Hilbe

Poisson Regression. Gelman & Hill Chapter 6. February 6, 2017

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

SB1a Applied Statistics Lectures 9-10

MSH3 Generalized linear model Ch. 6 Count data models

Lattice Data. Tonglin Zhang. Spatial Statistics for Point and Lattice Data (Part III)

Generalized Linear Models Introduction

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

Chapter 4: Generalized Linear Models-II

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

GENERALIZED LINEAR MODELS Joseph M. Hilbe

Some explanations about the IWLS algorithm to fit generalized linear models

Statistical analysis of trends in climate indicators by means of counts

Lecture 1. Introduction Statistics Statistical Methods II. Presented January 8, 2018

STAT 526 Spring Final Exam. Thursday May 5, 2011

Introduction to Generalized Linear Models

Mathematical statistics

PL-2 The Matrix Inverted: A Primer in GLM Theory

Parametric Modelling of Over-dispersed Count Data. Part III / MMath (Applied Statistics) 1

Introduction to the Generalized Linear Model: Logistic regression and Poisson regression

11. Generalized Linear Models: An Introduction

12 Modelling Binomial Response Data

Proportional hazards regression

Generalized linear models

Poisson regression: Further topics

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Various Issues in Fitting Contingency Tables

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

STAT5044: Regression and Anova

Poisson Regression. Ryan Godwin. ECON University of Manitoba

Generalized Linear Models

GLM I An Introduction to Generalized Linear Models

Lecture notes to Chapter 11, Regression with binary dependent variables - probit and logit regression

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Advanced Ratemaking. Chapter 27 GLMs

Chapter 5: Generalized Linear Models

Generalized linear mixed models (GLMMs) for dependent compound risk models

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

Gauge Plots. Gauge Plots JAPANESE BEETLE DATA MAXIMUM LIKELIHOOD FOR SPATIALLY CORRELATED DISCRETE DATA JAPANESE BEETLE DATA

Generalized linear models

Lecture 01: Introduction

where F ( ) is the gamma function, y > 0, µ > 0, σ 2 > 0. (a) show that Y has an exponential family distribution of the form

Generalized Linear Models: An Introduction

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

Statistical Models for Defective Count Data

ADVANCED STATISTICAL ANALYSIS OF EPIDEMIOLOGICAL STUDIES. Cox s regression analysis Time dependent explanatory variables

Introduction to Regression Analysis. Dr. Devlina Chatterjee 11 th August, 2017

Bayesian Inference in GLMs. Frequentists typically base inferences on MLEs, asymptotic confidence

Analysis of 2 n Factorial Experiments with Exponentially Distributed Response Variable

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches

Copula Regression RAHUL A. PARSA DRAKE UNIVERSITY & STUART A. KLUGMAN SOCIETY OF ACTUARIES CASUALTY ACTUARIAL SOCIETY MAY 18,2011

A Reliable Constrained Method for Identity Link Poisson Regression

Generalized Linear Models for a Dependent Aggregate Claims Model

Likelihoods for Generalized Linear Models

Estimation of Quantiles

EXAMINATIONS OF THE ROYAL STATISTICAL SOCIETY

STAT 6350 Analysis of Lifetime Data. Failure-time Regression Analysis

Qualifying Exam CS 661: System Simulation Summer 2013 Prof. Marvin K. Nakayama

Model Selection for Semiparametric Bayesian Models with Application to Overdispersion

6.041/6.431 Fall 2010 Final Exam Solutions Wednesday, December 15, 9:00AM - 12:00noon.

Generalized Estimating Equations

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

Modeling Overdispersion

Transcription:

Lecture 8. Poisson models for counts Jesper Rydén Department of Mathematics, Uppsala University jesper.ryden@math.uu.se Statistical Risk Analysis Spring 2014

Absolute risks The failure intensity λ(t) describes variability of life lengths in a population of components, objects or humans. Sometimes we do not know the individual life lengths, but only the total number of failures/accidents (e.g. failures during a specified period or in a certain region). By absolute risk is meant the probability for a person to be involved in a serious accident over a time-period. Often a distinction is made between voluntary risks (e.g. mountaineering) and background risks (e.g. collapse of a structure).

Tolerable risks Risk of death per person per year Characteristic response 10 3 Immediate action is taken to reduce the hazard. 10 4 People spend money, especially public money to control the hazard (e.g. traffic signs, police, laws). 10 5 Parents warn their children of the hazard (e.g. fire, drowning, fire arms, poison). 10 6 Not of great concern to average person; aware of hazard, but not of personal nature. Otway et al (1970). A risk analysis of the Omega West reactor.

Example: Number of perished in traffic Perished in traffic accidents 1998: U.S. 41 500, Sweden 500. To compare these numbers, one needs to compensate for the size of the populations, by using frequencies of deaths. U.S.: 1:6 000 (1.7 10 4 ), Sweden: 1:17 000 (0.6 10 4 ). About three times lower frequency in Sweden. Explanation? Exposure for the hazard does the average inhabitant in the U.S. spend more time in a car than a person in Sweden?

Comparative death risks Comparison activity/cause with absolute risk for death, measured per hour of exposure. Numbers from the U.K. (1970-1973). Mountaineering (international) Air travel (international) Car travel Accidents at home (all) Accidents at home (able-bodied people) Fire at home 2700 10 8 120 10 8 56 10 8 2.1 10 8 0.7 10 8 0.1 10 8 Assume the same numbers for Sweden and that an average Swede spends 15 minutes in a car per day. With 10 7 Swedes, the estimated average number of deaths in traffic is found as 0.25 365 10 7 56 10 8 = 511.

Poisson counts Denote by N i the number of accidents in year i. We assume that N i Po(µ i ), i.e. µ i = E[N i ]. If the random mechanism generating accidents can be assumed to be stationary, µ i = µ for all i. For the situation with µ i not constant, the expected value is modelled as a function of other, explanatory variables.

Example: Number of fires with perished Sweden: Number of fires with perished, and number of perished in the fires. Year Fires Perished in fires 1999 100 110 2000 101 107 2001 117 133 2002 127 138 2003 117 134 2004 62 65 2005 101 104 2006 80 83 2007 84 97 2008 108 115

Assumption of Poisson distribution If N Po(µ), then V[N] = E[N]. We have overdispersion if V[N] > E[N]. Test for Poisson distribution e.g. by χ 2 test. If overdispersion, try to fit another distribution, e.g. the negative binomial distribution.

Deviance Observations: n 1,..., n k. ML estimates: Simpler model (all µ i = µ): µ = 1 k ni. More complex model: µ i = n i. Likelihood theory. The test quantity deviance: D := 2(l(µ 1,..., µ k ) l(µ )) For large k: D χ 2 (k 1) distributed if the simpler model is true. Test: D > χ 2 α(k 1), the difference between the log-likelihoods cannot be explained by the statistical variability and hence the simpler model should be rejected.

Deviance for counts A formula for computation by hand is given as follows: k D = 2 n i (ln(n i ) ln( n)), j=1 where for n i = 0 we let n i ln(n i ) = 0. (Example: Fires, blackboard)

Example 7.13: Daily rains (Continuation of earlier example; rain in Venezuela.) Event of interest: A := Daily rain exceeds 50 mm. Monthly observed values during 39 years: J F M A M J J A S O N D 4 0 3 4 3 2 3 3 3 2 7 10 Test, by using deviance, if the means are equal, i.e. µ i = µ, i = 1,..., 12. (Blackboard)

Generalized linear model (GLM) A GLM has the basic structure g(µ i ) = X i β, where µ i = E[Y i ], g is a smooth monotonic link function, X i is the ith row of a model matrix X, and β is a vector of unknown parameters. Usually the Y i are assumed to be independent and belonging to some exponential family distribution. The exponential family of distributions includes many distributions useful for practical modelling, such as the Poisson, Binomial, Gamma and Normal distributions. Remark: GLM was introduced by Nelder and Wedderburn (1972).

Generalized linear model (GLM) The part Xβ (sometimes called linear predictor) resembles a linear-regression model. A link function and distribution must be chosen. (With the identity function as link and normal distribution, ordinary linear regression is a special case.) Generalization comes at some cost: Model fitting must be done iteratively, e.g. using IRLS (Iteratively Reweighted Least Squares). Distributional results used for inference are approximative and justified by large-sample limiting results.

Exponential family The response variable in a GLM can have any distribution from the exponential family, where by definition the pdf or pmf can be written as ( ) yθ b(θ) f θ (y) = exp + c(y, φ), a(φ) where a, b, c are arbitrary functions, φ an arbitrary scale parameter, and θ is known as the canonical parameter. (In the GLM context, this depends completely on the model parameters β.) With Y Po(µ), we have and f (y) = µy y! e µ, y = 0, 1,... θ φ a(φ) b(θ) c(y, θ) ln(µ) 1 φ (= 1) e θ (= µ) ln(y!)

Poisson regression in GLM The canonical link is g(µ) = ln(µ) and hence we have that is, µ i = g 1 (β 0 + β 1 x i1 + + β p x ip ), µ i = exp(β 0 + β 1 x i1 + + β p x ip ). In risk analysis, sometimes an extra quantity t i is introduced, measuring the exposure for the risk (e.g. t i = 1 if every observation relates to, say, one year). Then: (Example 7.15, blackboard.) µ i = t i exp(β 0 + β 1 x i1 + + β p x ip ).

More on exposure and offsets Often the expected counts will depend on an observation time or an observation area. For instance, if observing twice as long period, one expects the counts to double. The mean is then µ i = t i r i where t i is observation time and r i is the rate (expected count per observation unit). The log-linear model for this situation: ln µ i = ln(t i r i ) = ln t i + ln r i = ln t i + β 0 + β 1 x i1 + β p x ip The quantity ln t i is often referred to as the offset.

Example: Wave damage to cargo ships Data collected by Lloyd s Register of Shipping, investigating the damage caused by waves to the forward section of certain cargo-carrying vessels. Three factors are believed to affect the number of damage incidents: Ship type: A E Year of construction: 1960-64, 1965-69, 1970-1974, 1975-79. Period of operation: 1960-74, 1975-79 The observation times varied greatly (45 to 44 882 months) and thus must be taken into account in the analysis. Data in R: library(mass); data(ships)

Example: Wave damage to cargo ships Example of data: Ship Year of Period of Aggregate Incidents Damage rate type construction operation service time (per 1000 months) B 1960-64 1960-74 44882 39 0.869 B 1960-64 1975-79 17176 29 1.688 B 1965-69 1960-74 28609 58 2.027 B 1965-69 1975-79 20370 53 2.602

Example: Wave damage to cargo ships R code: library(mass); data(ships) shipsf = ships; # --- Make as factors --- shipsf$type = factor(shipsf$type) shipsf$year = factor(shipsf$year) shipsf$period = factor(shipsf$period) # --- Fit a model --- mod1 = glm(incidents type + year + period + offset(log1p(service)), family=poisson, control=glm.control(epsilon=0.0001,maxit=100), data=shipsf)

Poisson regression: rate ratio The rate ratio is of interest: RR j := exp(β j ), j = 1,..., p. The rate ratio measures the multiplicative increase of intensity of events when x ij is increased by one unit. Estimate of rate ratio: RR j = exp(β j ) where β j is the ML estimate of β j. Using asymptotic normality of ML estimators, confidence intervals for RR j can be found. (Example 7.16, blackboard.)

Poisson regression: Model selection The deviance can be used in the model selection. Consider two candidate models: a more general with p covariates, a simpler with q < p covariates. Estimated parameters: β p and β q. The test quantity DEV = 2 (l(β p ) l(β q )) is χ 2 (p q) distributed for large samples if the simpler model is true. Test: If DEV > χ 2 α(p q), the simpler model should be rejected (the difference between the log-likelihoods cannot be explained by the statistical variability). Hand calculations: DEV = 2 k n i (ln(µ ic ) ln(µ is )) i=1 (µ ic are the estimates from the more complex model, µ is from the simpler model.) (Example 7.17, blackboard.)