Regression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples.

Similar documents
Generalized linear models

Generalized Linear Models

Non-Gaussian Response Variables

Nonlinear Models. What do you do when you don t have a line? What do you do when you don t have a line? A Quadratic Adventure

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

Generalized linear models

MSH3 Generalized linear model

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

9 Generalized Linear Models

Generalized Linear Models. stat 557 Heike Hofmann

Linear Regression Models P8111

R Output for Linear Models using functions lm(), gls() & glm()

Generalized Linear Models

Modeling Overdispersion

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

Exam Applied Statistical Regression. Good Luck!

Introduction to the Generalized Linear Model: Logistic regression and Poisson regression

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00

Exercise 5.4 Solution

12 Modelling Binomial Response Data

Explanatory variables are: weight, width of shell, color (medium light, medium, medium dark, dark), and condition of spine.

Poisson Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Basic Methods of Data Analysis Part 3. Sepp Hochreiter. Institute of Bioinformatics Johannes Kepler University, Linz, Austria

Week 7 Multiple factors. Ch , Some miscellaneous parts

Linear, Generalized Linear, and Mixed-Effects Models in R. Linear and Generalized Linear Models in R Topics

Generalised linear models. Response variable can take a number of different formats

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Interactions in Logistic Regression

Generalized Estimating Equations

SPH 247 Statistical Analysis of Laboratory Data. April 28, 2015 SPH 247 Statistics for Laboratory Data 1

Linear Regression. Data Model. β, σ 2. Process Model. ,V β. ,s 2. s 1. Parameter Model

Generalized Linear Models in R

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

Generalized linear models

Reaction Days

Sample solutions. Stat 8051 Homework 8

Checking the Poisson assumption in the Poisson generalized linear model

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

R Hints for Chapter 10

Log-linear Models for Contingency Tables

Poisson Regression. The Training Data

SPH 247 Statistical Analysis of Laboratory Data. May 19, 2015 SPH 247 Statistical Analysis of Laboratory Data 1

Introduction to General and Generalized Linear Models

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

Logistic Regressions. Stat 430

Chapter 22: Log-linear regression for Poisson counts

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

Mixed models in R using the lme4 package Part 7: Generalized linear mixed models

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Poisson Regression. Gelman & Hill Chapter 6. February 6, 2017

Multivariate Statistics in Ecology and Quantitative Genetics Summary

MSH3 Generalized linear model Ch. 6 Count data models

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Stat 579: Generalized Linear Models and Extensions

Classification. Chapter Introduction. 6.2 The Bayes classifier

Binary Regression. GH Chapter 5, ISL Chapter 4. January 31, 2017

R code and output of examples in text. Contents. De Jong and Heller GLMs for Insurance Data R code and output. 1 Poisson regression 2

36-463/663: Multilevel & Hierarchical Models

Chapter 3: Generalized Linear Models

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

Introduction to General and Generalized Linear Models

Notes for week 4 (part 2)

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

Outline. Mixed models in R using the lme4 package Part 5: Generalized linear mixed models. Parts of LMMs carried over to GLMMs

Package HGLMMM for Hierarchical Generalized Linear Models

Generalized Linear Models 1

Homework 5 - Solution

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 450/4000 S: January

Generalized Linear Models I

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

Age 55 (x = 1) Age < 55 (x = 0)

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102

This manual is Copyright 1997 Gary W. Oehlert and Christopher Bingham, all rights reserved.

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Today. HW 1: due February 4, pm. Aspects of Design CD Chapter 2. Continue with Chapter 2 of ELM. In the News:

Experimental Design and Statistical Methods. Workshop LOGISTIC REGRESSION. Jesús Piedrafita Arilla.

Consider fitting a model using ordinary least squares (OLS) regression:

Generalized Linear Models Introduction

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

ssh tap sas913, sas

Truck prices - linear model? Truck prices - log transform of the response variable. Interpreting models with log transformation

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Logistic Regression 21/05

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

Statistical analysis of trends in climate indicators by means of counts

Generalized Linear Models: An Introduction

The hglm Package. Xia Shen Uppsala University

Poisson regression 1/15

Logistic Regression - problem 6.14

Generalized Linear Model under the Extended Negative Multinomial Model and Cancer Incidence

Cherry.R. > cherry d h v <portion omitted> > # Step 1.

Lecture 5: LDA and Logistic Regression

Transcription:

Regression models Generalized linear models in R Dr Peter K Dunn http://www.usq.edu.au Department of Mathematics and Computing University of Southern Queensland ASC, July 00 The usual linear regression models assume data come from a Normal distribution...... with the mean related to predictors Generalized linear models (GLMs) assume data come from some distribution...... with a function of the mean related to predictors Model Randomness Structure Regression model Y N(µ, φ) µ =Xβ GLM Y P(µ, φ) g(µ) = Xβ Generalized linear models Generalized linear models have two main components 1 The model for the randomness: Y P(µ, φ) The model for the structure: g(µ) = Xβ We can choose from many distributions P We can choose from many link functions g(µ) in a separate decision (Using a transformation in regression approximately makes both decisions at once) Normal regression models are not always appropriate There are obvious occasions when a Normal distribution is inappropriate: Counts cannot have normal distributions: they are non-negative integers Proportions cannot have normal distributions: they are constrained between 0 and 1 Lots of continuous data are non-negative and have non-constant variance In all cases, the variance cannot be constant since a boundary on the responses exists s Counts may be modelled using a Poisson distribution Usually, use a log link Define µ = E[Y ] as the expected count The model is { Yi Poisson(µ i ) (random) log µ i =Xβ (systematic) The log link ensures µ = exp(x β) is always positive The log link means the effect of the covariates x j on µ is multiplicative not additive s Proportions may be modelled using a binomial distribution Often, use a logit link (to get a logistic regression model) Define µ = E[Y ] as the expected proportion The model is { Yi Binomial(µ i ) (random) logit(µ i ) = Xβ (systematic) log Y i Binomial(µ i ) ( µi 1 µ i ) =Xβ (random) (systematic) Basic fitting of glms in R Fit a regression model in R using lm( y ~ x1 + log( x ) + x3 ) To fit a glm, R must know the distribution and link function Fit a regression model in R using (for example) glm( y ~ x1 + log( x ) + x3, family=poisson( link="log" ) ) What distributions can I choose? gaussian: a Gaussian (Normal) distribution binomial: a binomial distribution for proportions poisson: a Poisson distribution for counts Gamma: a gamma distribution for positive continuous data inverse.gaussian: an inverse Gaussian distribution for positive continuous data

What link function can I choose? What link function can I choose? Link function gaussian binomial poisson indentity µ = η log log µ = η inverse 1/µ = η sqrt µ = η logit logit(µ) =η probit probit(µ) =η cauchit cauchit(µ) =η cloglog cloglog(µ) =η Link function gamma inverse.gaussian indentity µ = η log log µ = η inverse 1/µ = η 1/mu^ 1/µ = η In R... To fit a glm in R, we need to specify: The linear predictor: x1+x+log(x3) The distribution: family=poisson The link function: link="log" They work together like this: glm( y ~ x1 + x + log(x3), family=poisson(link = "log") ) Glms in R? Fitting glms is locally like fitting a standard regression model So most regression concepts have (approximate) analogies for glms For example, R allows the user to: fit glms (use glm) find important predictors (F -tests using anova; t-tests using summary) compute residuals (using resid; quantile residuals in package statmod strongly recommended: qresid) perform diagnostics (using plot, hatvalues cooks.distance, etc.) : Poisson 3 children Others < 1 (C = 1) (C = 0) SLE No SLE SLE No SLE (S = 1) (S = 0) (S = 1) (S = 0) Depres. (D = 1) 9 0 OK (D = 0) 1 0 119 31 To fit the minimal model in R: dep.glm <- glm( Counts ~ C + S + D, To fit the full model R: dep.full <- glm( Counts ~ C * S * D, We assume all qualitative variables are declared as factors The data are counts, so use a poisson family (and default log link) Initially, use the linear predictor C + S + D What predictors are significant? Sequential test: > anova(dep.full, test = "Chisq") Analysis of Deviance Table Model: poisson, link: log Response: Counts Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev P(> Chi ) NULL 7 717.3 D 1 330.3 3.9 7.005e-7 S 1 19.9 5 3.77.0e-0 C 1 31.1 5.35.505e-70 D:S 1.3 3 9.99.75e-11 D:C 1 7.5.5 0.01 S:C 1 0.5 1.00 0. D:S:C 1.00 0.1e- 0.1 What predictors are significant? Post-fit test: > summary(dep.full) Call: glm(formula = Counts ~ D * S * C, family = poisson(link = log), data = dep) Deviance Residuals: [1] 0 0 0 0 0 0 0 0 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 5. 0.05.71 < e-1 *** D1 -.051 0.503 -.03.77e-1 *** S1-0.33 0.11-5.7.15e-09 *** C1 -.7 0.331 -.97 < e-1 *** D1:S1.550 0.5517.50.0e-0 *** D1:C1-1. 7.157-0.001 1.00 S1:C1 0.155 0.3 0.399 0.9 D1:S1:C1.555 7.157 0.001 1.00 --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersion parameter for poisson family taken to be 1) Null deviance: 7.173e+0 on 7 degrees of freedom Residual deviance:.13e- on 0 degrees of freedom AIC: 51. Number of Fisher Scoring iterations: 0

To fit one suggested model in R: dep.opt <- glm( Counts ~ C + S * D, Note that S * Dmeans S + Dand the interaction S : D Plots: Hat diagonals > plot(hatvalues(dep.opt), type = "h", lwd =, hatvalues(dep.opt) 0. 0. 0. 0. 1 3 5 7 Plots: Cook s distance Plots: Q Q plots > plot(cooks.distance(dep.opt), type = "h", lwd =, > library(statmod) > qqnorm(qresid(dep.opt)) Normal QQ Plot cooks.distance(dep.opt) 0 5 15 0 5 Sample Quantiles 1 0 1 1 3 5 7 1.5 1.0 0.5 0.0 0.5 1.0 1.5 Typing plot( glm.object ) produces six plots, four by default: 1 Residuals r i vs fitted values ˆµ (default) ri vs ˆµ (default) 3 a Q Q plot (default) A plot of Cook s distance D i 5 A plot of r i vs h i with contours of equal D i (default) A plot of D i vs h i /(1 h i ), with contours of equal D i > par(mfrow = c(, )) > plot(dep.opt) > par(mfrow = c(1, 1)) Residuals 1 1 3 Residuals vs Fitted 1 3 1 1 3 5 Predicted values 3 0 Normal QQ 1 3 1.5 0.5 0.5 1.5 0.0 1.0 ScaleLocation 3 1 0 Residuals vs Leverage Cook's distance 3.5 0.5 1 1 1 3 5 0.0 0. 0. Predicted values Leverage > plot(dep.opt, which = 5) Residuals vs Leverage 0 Cook's distance 0.0 0. 0. 0. 0. Leverage glm(counts ~ D * S + C) 3 0.5 1.5 Hours No. turbines No. fissures Prop. fissures Hours No. turbines No. fissures Prop. fissures 00 39 0 0.00 3000 9 0.1 00 53 0.0 300 13 0. 100 33 0.0 300 3 0.5 100 73 7 0. 00 0 1 0.53 00 30 5 0.17 00 3 1 0.5 00 39 9 0.3 The data are proportions: use binomial family

Three ways to fit binomial glms in R; here are two: Proportion of turbines with fissures 0. 0.5 0. 0.3 0. 0.1 1 td.glm <- glm( prop ~ Hours, weights=turbines, family=binomial(link=logit) ) td.glm <- glm( cbind(fissures, Turbines) ~ Hours, family=binomial(link=logit) ) Can use alternative links: td.glm <- glm( prop ~ Hours, weights=turbines, family=binomial(link=probit) ) 0.0 00 000 3000 000 Hours of use td.glm <- glm( prop ~ Hours, weights=turbines, family=binomial(link=cloglog) ) We use the default logit link The fitted model is: > summary(td.glm) Call: glm(formula = prop ~ Hours, family = binomial(link = logit), weights = Turbines) Deviance Residuals: Min 1Q Median 3Q Max -1.5055-0.77-0.303 0.901.093 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -3.9359 0.377959 -.31 <e-1 *** Hours 0.000999 0.00011.75 <e-1 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersion parameter for binomial family taken to be 1) > td.cf <- signif(coef(td.glm), 3) > td.cf (Intercept) Hours -3.90000 0.000999 From R output, the fitted model is ( ) µi log = 3.9 + 0.000999 Hours 1 µ i where µ is the expected proportion of turbines with fissures Null deviance: 11.70 on degrees of freedom Residual deviance:.331 on 9 degrees of freedom AIC: 9.0 Number of Fisher Scoring iterations: Plots: Hat diagonals Plots: Cook s distance > plot(hatvalues(td.glm), type = "h", lwd =, col = "blue") > plot(cooks.distance(td.glm), type = "h", lwd =, hatvalues(td.glm) 0.05 0. 0.15 0.0 0.5 0.30 0.35 cooks.distance(td.glm) 0.0 0.1 0. 0.3 0. 0.5 0. Plots: Q Q plots > qqnorm(qresid(td.glm)) Sample Quantiles 1.0 0.0 0.5 1.0 1.5.0 Normal QQ Plot 1.5 1.0 0.5 0.0 0.5 1.0 1.5 F H K V Age C P C P C P C P 0 5 11 3059 13 79 31 5 50 55 59 11 00 3 50 7 7 0 11 7 15 93 7 95 39 5 9 51 3 11 70 1 31 70 7 11 509 1 3 9 535 539 7+ 05 7 1 59 7 19

Plots: Number of cancers Rates 1 1 1 1 Number of lung cancer patients is a count, so use a Poisson glm: glm( Cases ~ City + Age, But lung cancer rate probably more useful Expected cancer rate is E[Y i /T i ] = E[Y i ]/T i = µ/t i, where µ i is the expected number of cancers, Note T i is known and not random. Using a logarithmic link, model the cancer rate as log(µ i /T i ) = Xβ or 05 5559 0 59 707 Age group >7 Fredericia Horsens City Kolding Vejle log µ i = log T i +Xβ log T i is an offset: a component of the linear predictor with a known parameter value, here one. Plots: Number of cancers Plots: Rates of cancer 1 1 0.00 0.00 1 1 Lung cancer rate 0.015 0.0 Lung cancer rate 0.015 0.0 0.005 0.005 05 5559 0 59 707 Age group >7 Fredericia Horsens City Kolding Vejle 05 5559 0 59 707 Age group >7 Fredericia Horsens City Kolding Vejle Rates To model lung cancer rate, use a Poisson glm with an offset: lc.glm <- glm( Cases ~ offset( log(population)) + City + Age, Plots: Hat diagonals > plot(hatvalues(lc.glm), type = "h", lwd =, col = "blue") hatvalues(lc.glm) 0.3 0.3 0.3 0.3 0.0 0. 0. 5 15 0 Plots: Cook s distance Plots: Q Q plots > plot(cooks.distance(lc.glm), type = "h", lwd =, > library(statmod) > qqnorm(qresid(lc.glm)) Normal QQ Plot cooks.distance(lc.glm) 0.0 0.1 0. 0.3 0. 0.5 Sample Quantiles 1 0 1 5 15 0 1 0 1

Other models We haved looked at fitting glms to Proportions Counts Rates Can also fit glms to Positive continuous data (family=gamma or family=inverse.gaussian) Overdispersed counts (family=quasipoisson) Overdispersed proportions (family=quasibinomial) Positive continuous data with exact zeros (family=tweedie using package statmod)