Regression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples.

Regression models Generalized linear models in R Dr Peter K Dunn http://www.usq.edu.au Department of Mathematics and Computing University of Southern Queensland ASC, July 00 The usual linear regression models assume data come from a Normal distribution...... with the mean related to predictors Generalized linear models (GLMs) assume data come from some distribution...... with a function of the mean related to predictors Model Randomness Structure Regression model Y N(µ, φ) µ =Xβ GLM Y P(µ, φ) g(µ) = Xβ Generalized linear models Generalized linear models have two main components 1 The model for the randomness: Y P(µ, φ) The model for the structure: g(µ) = Xβ We can choose from many distributions P We can choose from many link functions g(µ) in a separate decision (Using a transformation in regression approximately makes both decisions at once) Normal regression models are not always appropriate There are obvious occasions when a Normal distribution is inappropriate: Counts cannot have normal distributions: they are non-negative integers Proportions cannot have normal distributions: they are constrained between 0 and 1 Lots of continuous data are non-negative and have non-constant variance In all cases, the variance cannot be constant since a boundary on the responses exists s Counts may be modelled using a Poisson distribution Usually, use a log link Define µ = E[Y ] as the expected count The model is { Yi Poisson(µ i ) (random) log µ i =Xβ (systematic) The log link ensures µ = exp(x β) is always positive The log link means the effect of the covariates x j on µ is multiplicative not additive s Proportions may be modelled using a binomial distribution Often, use a logit link (to get a logistic regression model) Define µ = E[Y ] as the expected proportion The model is { Yi Binomial(µ i ) (random) logit(µ i ) = Xβ (systematic) log Y i Binomial(µ i ) ( µi 1 µ i ) =Xβ (random) (systematic) Basic fitting of glms in R Fit a regression model in R using lm( y ~ x1 + log( x ) + x3 ) To fit a glm, R must know the distribution and link function Fit a regression model in R using (for example) glm( y ~ x1 + log( x ) + x3, family=poisson( link="log" ) ) What distributions can I choose? gaussian: a Gaussian (Normal) distribution binomial: a binomial distribution for proportions poisson: a Poisson distribution for counts Gamma: a gamma distribution for positive continuous data inverse.gaussian: an inverse Gaussian distribution for positive continuous data

What link function can I choose? What link function can I choose? Link function gaussian binomial poisson indentity µ = η log log µ = η inverse 1/µ = η sqrt µ = η logit logit(µ) =η probit probit(µ) =η cauchit cauchit(µ) =η cloglog cloglog(µ) =η Link function gamma inverse.gaussian indentity µ = η log log µ = η inverse 1/µ = η 1/mu^ 1/µ = η In R... To fit a glm in R, we need to specify: The linear predictor: x1+x+log(x3) The distribution: family=poisson The link function: link="log" They work together like this: glm( y ~ x1 + x + log(x3), family=poisson(link = "log") ) Glms in R? Fitting glms is locally like fitting a standard regression model So most regression concepts have (approximate) analogies for glms For example, R allows the user to: fit glms (use glm) find important predictors (F -tests using anova; t-tests using summary) compute residuals (using resid; quantile residuals in package statmod strongly recommended: qresid) perform diagnostics (using plot, hatvalues cooks.distance, etc.) : Poisson 3 children Others < 1 (C = 1) (C = 0) SLE No SLE SLE No SLE (S = 1) (S = 0) (S = 1) (S = 0) Depres. (D = 1) 9 0 OK (D = 0) 1 0 119 31 To fit the minimal model in R: dep.glm <- glm( Counts ~ C + S + D, To fit the full model R: dep.full <- glm( Counts ~ C * S * D, We assume all qualitative variables are declared as factors The data are counts, so use a poisson family (and default log link) Initially, use the linear predictor C + S + D What predictors are significant? Sequential test: > anova(dep.full, test = "Chisq") Analysis of Deviance Table Model: poisson, link: log Response: Counts Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev P(> Chi ) NULL 7 717.3 D 1 330.3 3.9 7.005e-7 S 1 19.9 5 3.77.0e-0 C 1 31.1 5.35.505e-70 D:S 1.3 3 9.99.75e-11 D:C 1 7.5.5 0.01 S:C 1 0.5 1.00 0. D:S:C 1.00 0.1e- 0.1 What predictors are significant? Post-fit test: > summary(dep.full) Call: glm(formula = Counts ~ D * S * C, family = poisson(link = log), data = dep) Deviance Residuals: [1] 0 0 0 0 0 0 0 0 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 5. 0.05.71 < e-1 *** D1 -.051 0.503 -.03.77e-1 *** S1-0.33 0.11-5.7.15e-09 *** C1 -.7 0.331 -.97 < e-1 *** D1:S1.550 0.5517.50.0e-0 *** D1:C1-1. 7.157-0.001 1.00 S1:C1 0.155 0.3 0.399 0.9 D1:S1:C1.555 7.157 0.001 1.00 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 7.173e+0 on 7 degrees of freedom Residual deviance:.13e- on 0 degrees of freedom AIC: 51. Number of Fisher Scoring iterations: 0

To fit one suggested model in R: dep.opt <- glm( Counts ~ C + S * D, Note that S * Dmeans S + Dand the interaction S : D Plots: Hat diagonals > plot(hatvalues(dep.opt), type = "h", lwd =, hatvalues(dep.opt) 0. 0. 0. 0. 1 3 5 7 Plots: Cook s distance Plots: Q Q plots > plot(cooks.distance(dep.opt), type = "h", lwd =, > library(statmod) > qqnorm(qresid(dep.opt)) Normal QQ Plot cooks.distance(dep.opt) 0 5 15 0 5 Sample Quantiles 1 0 1 1 3 5 7 1.5 1.0 0.5 0.0 0.5 1.0 1.5 Typing plot( glm.object ) produces six plots, four by default: 1 Residuals r i vs fitted values ˆµ (default) ri vs ˆµ (default) 3 a Q Q plot (default) A plot of Cook s distance D i 5 A plot of r i vs h i with contours of equal D i (default) A plot of D i vs h i /(1 h i ), with contours of equal D i > par(mfrow = c(, )) > plot(dep.opt) > par(mfrow = c(1, 1)) Residuals 1 1 3 Residuals vs Fitted 1 3 1 1 3 5 Predicted values 3 0 Normal QQ 1 3 1.5 0.5 0.5 1.5 0.0 1.0 ScaleLocation 3 1 0 Residuals vs Leverage Cook's distance 3.5 0.5 1 1 1 3 5 0.0 0. 0. Predicted values Leverage > plot(dep.opt, which = 5) Residuals vs Leverage 0 Cook's distance 0.0 0. 0. 0. 0. Leverage glm(counts ~ D * S + C) 3 0.5 1.5 Hours No. turbines No. fissures Prop. fissures Hours No. turbines No. fissures Prop. fissures 00 39 0 0.00 3000 9 0.1 00 53 0.0 300 13 0. 100 33 0.0 300 3 0.5 100 73 7 0. 00 0 1 0.53 00 30 5 0.17 00 3 1 0.5 00 39 9 0.3 The data are proportions: use binomial family

Three ways to fit binomial glms in R; here are two: Proportion of turbines with fissures 0. 0.5 0. 0.3 0. 0.1 1 td.glm <- glm( prop ~ Hours, weights=turbines, family=binomial(link=logit) ) td.glm <- glm( cbind(fissures, Turbines) ~ Hours, family=binomial(link=logit) ) Can use alternative links: td.glm <- glm( prop ~ Hours, weights=turbines, family=binomial(link=probit) ) 0.0 00 000 3000 000 Hours of use td.glm <- glm( prop ~ Hours, weights=turbines, family=binomial(link=cloglog) ) We use the default logit link The fitted model is: > summary(td.glm) Call: glm(formula = prop ~ Hours, family = binomial(link = logit), weights = Turbines) Deviance Residuals: Min 1Q Median 3Q Max -1.5055-0.77-0.303 0.901.093 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -3.9359 0.377959 -.31 <e-1 *** Hours 0.000999 0.00011.75 <e-1 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersion parameter for binomial family taken to be 1) > td.cf <- signif(coef(td.glm), 3) > td.cf (Intercept) Hours -3.90000 0.000999 From R output, the fitted model is ( ) µi log = 3.9 + 0.000999 Hours 1 µ i where µ is the expected proportion of turbines with fissures Null deviance: 11.70 on degrees of freedom Residual deviance:.331 on 9 degrees of freedom AIC: 9.0 Number of Fisher Scoring iterations: Plots: Hat diagonals Plots: Cook s distance > plot(hatvalues(td.glm), type = "h", lwd =, col = "blue") > plot(cooks.distance(td.glm), type = "h", lwd =, hatvalues(td.glm) 0.05 0. 0.15 0.0 0.5 0.30 0.35 cooks.distance(td.glm) 0.0 0.1 0. 0.3 0. 0.5 0. Plots: Q Q plots > qqnorm(qresid(td.glm)) Sample Quantiles 1.0 0.0 0.5 1.0 1.5.0 Normal QQ Plot 1.5 1.0 0.5 0.0 0.5 1.0 1.5 F H K V Age C P C P C P C P 0 5 11 3059 13 79 31 5 50 55 59 11 00 3 50 7 7 0 11 7 15 93 7 95 39 5 9 51 3 11 70 1 31 70 7 11 509 1 3 9 535 539 7+ 05 7 1 59 7 19

Plots: Number of cancers Rates 1 1 1 1 Number of lung cancer patients is a count, so use a Poisson glm: glm( Cases ~ City + Age, But lung cancer rate probably more useful Expected cancer rate is E[Y i /T i ] = E[Y i ]/T i = µ/t i, where µ i is the expected number of cancers, Note T i is known and not random. Using a logarithmic link, model the cancer rate as log(µ i /T i ) = Xβ or 05 5559 0 59 707 Age group >7 Fredericia Horsens City Kolding Vejle log µ i = log T i +Xβ log T i is an offset: a component of the linear predictor with a known parameter value, here one. Plots: Number of cancers Plots: Rates of cancer 1 1 0.00 0.00 1 1 Lung cancer rate 0.015 0.0 Lung cancer rate 0.015 0.0 0.005 0.005 05 5559 0 59 707 Age group >7 Fredericia Horsens City Kolding Vejle 05 5559 0 59 707 Age group >7 Fredericia Horsens City Kolding Vejle Rates To model lung cancer rate, use a Poisson glm with an offset: lc.glm <- glm( Cases ~ offset( log(population)) + City + Age, Plots: Hat diagonals > plot(hatvalues(lc.glm), type = "h", lwd =, col = "blue") hatvalues(lc.glm) 0.3 0.3 0.3 0.3 0.0 0. 0. 5 15 0 Plots: Cook s distance Plots: Q Q plots > plot(cooks.distance(lc.glm), type = "h", lwd =, > library(statmod) > qqnorm(qresid(lc.glm)) Normal QQ Plot cooks.distance(lc.glm) 0.0 0.1 0. 0.3 0. 0.5 Sample Quantiles 1 0 1 5 15 0 1 0 1

Other models We haved looked at fitting glms to Proportions Counts Rates Can also fit glms to Positive continuous data (family=gamma or family=inverse.gaussian) Overdispersed counts (family=quasipoisson) Overdispersed proportions (family=quasibinomial) Positive continuous data with exact zeros (family=tweedie using package statmod)