Generalized linear models

Similar documents
Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

Mixed models in R using the lme4 package Part 5: Generalized linear mixed models

Mixed models in R using the lme4 package Part 7: Generalized linear mixed models

Outline. Mixed models in R using the lme4 package Part 5: Generalized linear mixed models. Parts of LMMs carried over to GLMMs

Generalized Linear and Nonlinear Mixed-Effects Models

Regression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples.

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

Generalized Linear Models

Generalized Linear Models

Linear Regression Models P8111

9 Generalized Linear Models

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

STA216: Generalized Linear Models. Lecture 1. Review and Introduction

Poisson Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

Generalized linear models

Linear, Generalized Linear, and Mixed-Effects Models in R. Linear and Generalized Linear Models in R Topics

Nonlinear Models. What do you do when you don t have a line? What do you do when you don t have a line? A Quadratic Adventure

Introduction to General and Generalized Linear Models

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

Linear Regression. Data Model. β, σ 2. Process Model. ,V β. ,s 2. s 1. Parameter Model

Linear model A linear model assumes Y X N(µ(X),σ 2 I), And IE(Y X) = µ(x) = X β, 2/52

Generalized Linear Models

Generalized Linear Models Introduction

Overdispersion Workshop in generalized linear models Uppsala, June 11-12, Outline. Overdispersion

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Generalized Linear Models. stat 557 Heike Hofmann

Generalized Linear Models for Non-Normal Data

12 Modelling Binomial Response Data

MSH3 Generalized linear model Ch. 6 Count data models

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

Introduction to General and Generalized Linear Models

Chapter 22: Log-linear regression for Poisson counts

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches

Generalized Linear Models (GLZ)

Non-Gaussian Response Variables

Exercise 5.4 Solution

11. Generalized Linear Models: An Introduction

COMPLEMENTARY LOG-LOG MODEL

Logistic Regression - problem 6.14

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Generalized Linear Models: An Introduction

Modeling Overdispersion

Single-level Models for Binary Responses

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

You can specify the response in the form of a single variable or in the form of a ratio of two variables denoted events/trials.

Exam Applied Statistical Regression. Good Luck!

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00

Logistic Regressions. Stat 430

Introduction to Generalized Linear Models

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

MSH3 Generalized linear model

Investigating Models with Two or Three Categories

Stat 5101 Lecture Notes

Sections 4.1, 4.2, 4.3

SPH 247 Statistical Analysis of Laboratory Data. April 28, 2015 SPH 247 Statistics for Laboratory Data 1

Outline of GLMs. Definitions

Generalized Linear. Mixed Models. Methods and Applications. Modern Concepts, Walter W. Stroup. Texts in Statistical Science.

STAT5044: Regression and Anova

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

Generalised linear models. Response variable can take a number of different formats

Generalized Linear Models. Kurt Hornik

Generalized Linear Models 1

R Hints for Chapter 10

Generalized Linear Models

Semiparametric Generalized Linear Models

STA 216: GENERALIZED LINEAR MODELS. Lecture 1. Review and Introduction. Much of statistics is based on the assumption that random

Simple logistic regression

ssh tap sas913, sas

Binary Regression. GH Chapter 5, ISL Chapter 4. January 31, 2017

8 Nominal and Ordinal Logistic Regression

LOGISTIC REGRESSION Joseph M. Hilbe

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 18.1 Logistic Regression (Dose - Response)

Introducing Generalized Linear Models: Logistic Regression

Generalized Estimating Equations

ST3241 Categorical Data Analysis I Generalized Linear Models. Introduction and Some Examples

Week 7 Multiple factors. Ch , Some miscellaneous parts

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links

Model Selection in GLMs. (should be able to implement frequentist GLM analyses!) Today: standard frequentist methods for model selection

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

R Output for Linear Models using functions lm(), gls() & glm()

Generalized Additive Models

Lecture 5: LDA and Logistic Regression

Log-linear Models for Contingency Tables

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

Matched Pair Data. Stat 557 Heike Hofmann

Age 55 (x = 1) Age < 55 (x = 0)

Package HGLMMM for Hierarchical Generalized Linear Models

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

Statistics 203: Introduction to Regression and Analysis of Variance Course review

Multivariate Statistics in Ecology and Quantitative Genetics Summary

SB1a Applied Statistics Lectures 9-10

Chapter 3: Generalized Linear Models

Transcription:

Generalized linear models Douglas Bates November 01, 2010 Contents 1 Definition 1 2 Links 2 3 Estimating parameters 5 4 Example 6 5 Model building 8 6 Conclusions 8 7 Summary 9 1 Generalized Linear Models Generalized Linear Models When using linear models (LMs) we assume that the response being modeled is on a continuous scale. Sometimes we can bend this assumption a bit if the response is an ordinal response with a moderate to large number of levels. For example, the Scottish secondary school test results in the mlmrev package are integer values on the scale of 1 to 10 but we analyze them on a continuous scale. However, an LM is not suitable for modeling a binary response, an ordinal response with few levels or a response that represents a count. For these we use generalized linear models (GLMs). To describe GLMs we return to the representation of the response as an n-dimensional, vector-valued, random variable, Y. Parts of LMs carried over to GLMs Random variables Y the response variable 1

Parameters β - fixed-effects coefficients σ - a scale parameter (not always used) The linear predictor Xβ where X is the n p model matrix for β The probability model For GLMs we retain some of the properties of the LM probability model Specifically Y N ( µ Y, σ 2 I ) where µ Y = Xβ The distribution of Y depends on β only through the mean, µ Y = Xβ. Elements of Y are independent. That is, the distribution of Y is completely specified by the univariate distributions, Y i, i = 1,..., n. These univariate, distributions all have the same form. They differ only in their means. GLMs differ from LMs in the form of the univariate distributions and in how µ Y depends on the linear predictor, Xβ. 2 Specific distributions and links Some choices of univariate distributions Typical choices of univariate distributions are: The Bernoulli distribution for binary (0/1) data, which has probability mass function p(y µ) = µ y (1 µ) 1 y, 0 < µ < 1, y = 0, 1 Several independent binary responses can be represented as a binomial response, but only if all the Bernoulli distributions have the same mean. The Poisson distribution for count (0, 1,... ) data, which has probability mass function µ µy p(y µ) = e, 0 < µ, y = 0, 1, 2,... y! All of these distributions are completely specified by the mean. This is different from the normal (or Gaussian) distribution, which also requires the scale parameter, σ. 2

The link function, g When the univariate distributions have constraints on µ, such as 0 < µ < 1 (Bernoulli) or 0 < µ (Poisson), we cannot define the mean, µ Y, to be equal to the linear predictor, Xβ, which is unbounded. We choose an invertible, univariate link function, g, such that η = g(µ) is unconstrained. The vector-valued link function, g, is defined by applying g component-wise. η = g(µ) where η i = g(µ i ), i = 1,..., n We require that g be invertible so that µ = g 1 (η) is defined for < η < and is in the appropriate range (0 < µ < 1 for the Bernoulli or 0 < µ for the Poisson). The vector-valued inverse link, g 1, is defined component-wise. Canonical link functions There are many choices of invertible scalar link functions, g, that we could use for a given set of constraints. For the Bernoulli and Poisson distributions, however, one link function arises naturally from the definition of the probability mass function. (The same is true for a few other, related but less frequently used, distributions, such as the gamma distribution.) To derive the canonical link, we consider the logarithm of the probability mass function (or, for continuous distributions, the probability density function). For distributions in this exponential family, the logarithm of the probability mass or density can be written as a sum of terms, some of which depend on the response, y, only and some of which depend on the mean, µ, only. However, only one term depends on both y and µ, and this term has the form y g(µ), where g is the canonical link. The canonical link for the Bernoulli distribution The logarithm of the probability mass function is ( ) µ log(p(y µ)) = log(1 µ) + y log, 0 < µ < 1, y = 0, 1. 1 µ Thus, the canonical link function is the logit link ( ) µ η = g(µ) = log. 1 µ Because µ = P [Y = 1], the quantity µ/(1 µ) is the odds ratio (in the range (0, )) and g is the logarithm of the odds ratio, sometimes called log odds. The inverse link is µ = g 1 (η) = eη 1 + e η = 1 1 + e η 3

Plot of canonical link for the Bernoulli distribution 5 η = log( µ ) 1 µ 0 5 0.0 0.2 0.4 0.6 0.8 1.0 µ Plot of inverse canonical link for the Bernoulli distribution 1.0 0.8 1 1 + exp( η) µ = 0.6 0.4 0.2 0.0 5 0 5 η Evaluating links for the Bernoulli As a monotone increasing function the maps (, ) to (0, 1) the inverse link, g 1, is a cumulative distribution function for a continuous random variable. The link, q, is the corresponding quantile function. The canonical link is the quantile function for the logistic distribution (R functions qlogis and plogis). 4

In the past the "probit" link was sometimes used instead of the logit link. For this the link is the standard normal quantile and the inverse link is the standard normal cumulative, sometimes written Φ(z) (R functions qnorm and pnorm). As described in R s help page for the binomial function the binomial family (allows) the links "logit", "probit", "cauchit", (corresponding to logistic, normal and Cauchy CDFs respectively) "log" and "cloglog" (complementary log-log) The canonical link for the Poisson distribution The logarithm of the probability mass is log(p(y µ)) = log(y!) µ + y log(µ) Thus, the canonical link function for the Poisson is the log link η = g(µ) = log(µ) The inverse link is µ = g 1 (η) = e η The canonical link related to the variance For the canonical link function, the derivative of its inverse is the variance of the response. For the Bernoulli, the canonical link is the logit and the inverse link is µ = g 1 (η) = 1/(1 + e η ). Then dµ dη = e η (1 + e η ) 2 = 1 e η 1 + e η = µ(1 µ) = Var(Y) 1 + e η For the Poisson, the canonical link is the log and the inverse link is µ = g 1 (η) = e η. Then dµ dη = eη = µ = Var(Y) 3 Estimating parameters Estimating parameters We determine the maximum likelihood estimates, (mle s), of the coefficients, β, in the linear predictor and, if used, the scale parameter. There is no direct algorithm for determining the mle s in a GLM and we must use iterative algorithms. Fortunately, there is a very effective iterative algorithm which is based on weighted least squares to update parameter estimates. The weights are inversely proportional to the variance of the response, but the variance depends on the mean which, in turn, depends on the parameters. 5

The IRLS (iteratively reweighted least squares) algorithm fixes the weights, determines the parameter values that minimize the weighted sum of squared residuals, then updates the weights and repeats the process until the weights stabilize. This algorithm converges very quickly. The original description of IRLS from McCullagh and Nelder s book is, I feel, overly complex. 4 Data description and initial exploration Contraception data One of the data sets in the "mlmrev" package, derived from data files available on the multilevel modelling web site, is from a fertility survey of women in Bangladesh. One of the (binary) responses recorded is whether or not the woman currently uses artificial contraception. Covariates included the woman s age (on a centered scale), the number of live children she had, whether she lived in an urban or rural setting, and the district in which she lived. Instead of plotting such data as points, we use the 0/1 response to generate scatterplot smoother curves versus age for the different groups. Contraception use versus age by urban and livch 0 1 2 3+ 10 0 10 20 N Y 1.0 0.8 Proportion 0.6 0.4 0.2 0.0 10 0 10 20 Centered age 6

Comments on the data plot These observational data are unbalanced (some districts have only 2 observations, some have nearly 120). They are not longitudinal (no time variable). Binary responses have low per-observation information content (exactly one bit per observation). Districts with few observations will not contribute strongly to estimates of random effects. Within-district plots will be too imprecise so we only examine the global effects in plots. The comparisons on the multilevel modelling site are for fits of a model that is linear in age, which is clearly inappropriate. The form of the curves suggests at least a quadratic in age. The urban versus rural differences may be additive. It appears that the livch factor could be dichotomized into 0 versus 1 or more. Preliminary model fit Call: glm(formula = use ~ age + I(age^2) + urban + livch, family = binomial, data = Contraception) Deviance Residuals: Min 1Q Median 3Q Max -1.4738-1.0369-0.6683 1.2401 1.9765 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -0.9499521 0.1560118-6.089 1.14e-09 age 0.0045837 0.0089084 0.515 0.607 I(age^2) -0.0042865 0.0007002-6.122 9.23e-10 urbany 0.7680975 0.1061916 7.233 4.72e-13 livch1 0.7831128 0.1569096 4.991 6.01e-07 livch2 0.8549040 0.1783573 4.793 1.64e-06 livch3+ 0.8060251 0.1784817 4.516 6.30e-06 (Dispersion parameter for binomial family taken to be 1) Null deviance: 2590.9 on 1933 degrees of freedom Residual deviance: 2417.7 on 1927 degrees of freedom AIC: 2431.7 Number of Fisher Scoring iterations: 4 Comments on the model fit There is a highly significant quadratic term in age. The linear term in age is not significant but we retain it because the age scale has been centered at an arbitrary value (which, unfortunately, is not provided with the data). The urban factor is highly significant (as indicated by the plot). Levels of livch greater than 0 are significantly different from 0 but may not be different from each other. 7

5 Model building Reduced model with dichotomized livch Call: glm(formula = use ~ age + I(age^2) + urban + ch, family = binomial, data = Contraception) Deviance Residuals: Min 1Q Median 3Q Max -1.4544-1.0371-0.6682 1.2378 1.9790 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -0.9429355 0.1497300-6.298 3.02e-10 age 0.0049675 0.0075889 0.655 0.513 I(age^2) -0.0043275 0.0006918-6.255 3.97e-10 urbany 0.7663616 0.1059463 7.233 4.71e-13 chy 0.8062172 0.1422619 5.667 1.45e-08 (Dispersion parameter for binomial family taken to be 1) Null deviance: 2590.9 on 1933 degrees of freedom Residual deviance: 2417.9 on 1929 degrees of freedom AIC: 2427.9 Number of Fisher Scoring iterations: 4 Comparing the model fits A likelihood ratio test can be used to compare these nested models. > anova (cm2, cm1, test =" Chisq ") Analysis of Deviance Table Model 1: use ~ age + I(age^2) + urban + ch Model 2: use ~ age + I(age^2) + urban + livch Resid. Df Resid. Dev Df Deviance P(> Chi ) 1 1929 2417.9 2 1927 2417.7 2 0.20525 0.9025 The large p-value indicates that we would not reject cm2 in favor of cm1 hence we prefer the more parsimonious cm2. The plot of the scatterplot smoothers according to live children or none indicates that there may be a difference in the age pattern between these two groups. 6 Conclusions from the example Conclusions from the example Carefully plotting the data is enormously helpful in formulating the model. 8

Observational data tend to be unbalanced and have many more covariates than data from a designed experiment. Formulating a model is typically more difficult than in a designed experiment. A generalized linear model is fit with the function glm() which requires a family argument. Typical values are binomial or poisson. By default, the family will use the canonical link. Other links can be specified as, e.g. binomial(link="cloglog"). We use likelihood-ratio tests in model building. The z-tests provided in the model summary provide a good indication of the results but should be confirmed by fitting the reduced model and performing a likelihood-ratio test. A word about overdispersion In many application areas using pseudo distribution families, such as quasibinomial and quasipoisson, is a popular and well-accepted technique for accomodating variability that is apparently larger than would be expected from a binomial or a Poisson distribution. This amounts to adding an extra parameter, like σ, the common scale parameter in a LM, to the distribution of the response. It is possible to form an estimate of such a quantity during the IRLS algorithm but it is an artificial construct. There is no probability distribution with such a parameter. 7 Summary Summary GLMs allow for the distribution of Y to be other than a Gaussian. A Bernoulli (or, more generally, a binomial) distribution is used to model binary or binomial responses. A Poisson distribution is used to model responses that are counts. The mean depends upon the linear predictor, Xβ, through the inverse link function, g 1. The mle s of the parameters are determined through iteratively reweighted least squares (IRLS). Wald-type tests on the coefficients are displayed in the summary but to be more confident of the results we should fit both models and use a comparative anova with the optional argument, test="chisq". 9