Introduction to General and Generalized Linear Models

Similar documents
Introduction to General and Generalized Linear Models

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

9 Generalized Linear Models

Linear Regression Models P8111

Generalized linear models

Introduction to General and Generalized Linear Models

12 Modelling Binomial Response Data

Logistic Regressions. Stat 430

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Exam Applied Statistical Regression. Good Luck!

Lecture 13: More on Binary Data

Logistic Regression - problem 6.14

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

Classification. Chapter Introduction. 6.2 The Bayes classifier

Generalized Linear Models. stat 557 Heike Hofmann

Modeling Overdispersion

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

COMPLEMENTARY LOG-LOG MODEL

Single-level Models for Binary Responses

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

Today. HW 1: due February 4, pm. Aspects of Design CD Chapter 2. Continue with Chapter 2 of ELM. In the News:

MODULE 6 LOGISTIC REGRESSION. Module Objectives:

Explanatory variables are: weight, width of shell, color (medium light, medium, medium dark, dark), and condition of spine.

MSH3 Generalized linear model

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 18.1 Logistic Regression (Dose - Response)

Logistic Regression 21/05

Matched Pair Data. Stat 557 Heike Hofmann

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

Age 55 (x = 1) Age < 55 (x = 0)

You can specify the response in the form of a single variable or in the form of a ratio of two variables denoted events/trials.

Introduction to General and Generalized Linear Models

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Exercise 5.4 Solution

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102

ssh tap sas913, sas

Regression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples.

Generalized linear models

Binary Regression. GH Chapter 5, ISL Chapter 4. January 31, 2017

Generalized Linear Models. Last time: Background & motivation for moving beyond linear

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

STAC51: Categorical data Analysis

R Hints for Chapter 10

1. Hypothesis testing through analysis of deviance. 3. Model & variable selection - stepwise aproaches

On the Inference of the Logistic Regression Model

The Logit Model: Estimation, Testing and Interpretation

STA 450/4000 S: January

Truck prices - linear model? Truck prices - log transform of the response variable. Interpreting models with log transformation

Introduction to the Generalized Linear Model: Logistic regression and Poisson regression

Poisson Regression. The Training Data

Generalized Linear Models (GLZ)

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

Linear Regression. Data Model. β, σ 2. Process Model. ,V β. ,s 2. s 1. Parameter Model

BMI 541/699 Lecture 22

Simple logistic regression

Goodness-of-Fit Tests for the Ordinal Response Models with Misspecified Links

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

STA102 Class Notes Chapter Logistic Regression

8 Nominal and Ordinal Logistic Regression

Generalized Linear Models for Non-Normal Data

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Categorical data analysis Chapter 5

Analysis of Time-to-Event Data: Chapter 4 - Parametric regression models

Generalized Linear Models

High-Throughput Sequencing Course

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

Sample solutions. Stat 8051 Homework 8

Neural networks (not in book)

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00

Ch 2: Simple Linear Regression

Homework 1 Solutions

Lecture 6 STK Categorical responses

36-463/663: Multilevel & Hierarchical Models

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

7.2 One-Sample Correlation ( = a) Introduction. Correlation analysis measures the strength and direction of association between

Regression Methods for Survey Data

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

Leftovers. Morris. University Farm. University Farm. Morris. yield

Econometrics II. Seppo Pynnönen. Spring Department of Mathematics and Statistics, University of Vaasa, Finland

Homework 5 - Solution

Poisson Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

Analysing categorical data using logit models

Standard Errors & Confidence Intervals. N(0, I( β) 1 ), I( β) = [ 2 l(β, φ; y) β i β β= β j

Generalised linear models. Response variable can take a number of different formats

STAT 7030: Categorical Data Analysis

Ch 3: Multiple Linear Regression

Stat 579: Generalized Linear Models and Extensions

STAT 510 Final Exam Spring 2015

Interactions in Logistic Regression

Review. Timothy Hanson. Department of Statistics, University of South Carolina. Stat 770: Categorical Data Analysis

cor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson )

Transcription:

Introduction to General and Generalized Linear Models Generalized Linear Models - part III Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby October 2010 Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 1 / 22

Today Test for model reduction Inference on individual parameters Confidence intervals Example Odds ratio Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 2 / 22

Test for model reduction Test for model reduction The principles for model reduction in generalized linear models are essentially the same as the principles for classical GLM s. In classical GLM s the deviance is calculated as a (weighted) sum of squares, and in generalized linear models the deviance is calculated using the expression for the unit deviance. Besides this, the major difference is that instead of the exact F -tests used for classical GLM s the tests in generalized linear models are only approximate tests using the χ 2 -distribution. In particular does the principles of successive testing in hypotheses chains using a type I, or type III partition of the deviance carry over to generalized linear models. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 3 / 22

Test of individual parameters β j Test of individual parameters β j Theorem (Test of individual parameters β j - Wald test) A hypothesis H : β j = βj 0 related to specific values of the parameters is tested by means of the test statistic u j = ˆβ j β 0 j σ 2 σ jj, where σ 2 indicates the estimated dispersion parameter (if relevant), and σ jj denotes the j th diagonal element in Σ. Under the hypothesis is u j approximately distributed as a standardized normal distribution. The test statistic is compared with quantiles of a standardized normal distribution (some software packages use a t(n k) distribution). The hypothesis is rejected for large values of u j. The p-value is found as p = 2 ( 1 Φ( u j ) ). Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 4 / 22

Test of individual parameters β j Test of individual parameters β j Theorem (Test of individual parameters β j - Wald test) In particular is the test statistic for the hypothesis H : β j = 0 u j = ˆβ j σ 2 σ jj. An equivalent test is obtained by considering the test statistic z j = u 2 j and reject the hypothesis for for z j > χ 2 1 α (1). Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 5 / 22

Confidence intervals Confidence intervals Wald - interval for individual parameters An approximate 100(1 α) % Wald-type confidence interval is obtained as β j ± u 1 α/2 σ 2 σ jj Confidence intervals for fitted values An approximate 100(1 α)% confidence interval for the linear prediction is obtained as η i ± u 1 α/2 σ 2 σ ii with σ ii denoting the i th diagonal element in XΣX T. The corresponding interval for the fitted value µ i is obtained by applying the inverse link transformation g 1 ( ) to the confidence limits. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 6 / 22

Example: Link functions for binary response regression An experiment testing the insulation effect of a gas (SF 6 ) was conducted. In the experiment a gaseous insulation was subjected to 100 high voltage pulses with a specified voltage, and it was recorded whether the insulation broke down (spark), or not. After each pulse the insulation was reestablished. The experiment was repeated at twelve voltage levels from 1065 kv to 1135 kv. Voltage (kv) 1065 1071 1075 1083 1089 1094 Breakdowns 2 3 5 11 10 21 Trials 100 100 100 100 100 100 Voltage (kv) 1100 1107 1111 1120 1128 1135 Breakdowns 29 48 56 88 98 99 Trials 100 100 100 100 100 100 Table: The insulation effect of a gas (SF 6 ) Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 7 / 22

Example: Link functions for binary response regression As the insulation was restored after each voltage application it seems reasonable to assume that the trials were independent. At each trial the response is binary (Breakdown/Not), and therefore it seems appropriate to use a binomial distribution model for the experiment. We shall assume that the data are stored in an R object dat with the variables Volt, Breakd, Trials Let Z i denote the number of breakdowns at the i th trial at the voltage x i. We shall then use the model Z i B(n i, p i ) with n i = 100, and p i = p(x i ), where p(x) is some suitable dose-response function. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 8 / 22

Logit transformation - logistic regression The logistic regression is of the form ( ) p g(p) = η = ln = β 1 + β 2 x 1 p p(x) = exp(η) 1 + exp(η) = exp(β 1 + β 2 x) 1 + exp(β 1 + β 2 x) We use the following commands to fit the model: dat$resp<-cbind(breakd,(trials-breakd)) fit1<-glm(resp~volt,family=binomial(link=logit),data=dat) Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 9 / 22

Logit link > summary(fit1) Deviance Residuals: Min 1Q Median 3Q Max -1.7572-0.8518 0.9359 1.2977 2.3466 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -1.277e+02 7.061e+00-18.08 <2e-16 *** Volt 1.155e-01 6.396e-03 18.05 <2e-16 *** --- (Dispersion parameter for binomial family taken to be 1) Null deviance: 783.122 on 11 degrees of freedom Residual deviance: 21.018 on 10 degrees of freedom AIC: 70.613 Number of Fisher Scoring iterations: 4 Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 10 / 22

Logit link From the output we can make a deviance table: Source f Deviance Mean deviance Model H M 1 762.10 762.10 Residual (Error) 10 21.018 2.102 Corrected total 11 783.12 71.193 The p-value corresponding to the goodness of fit statistic D(y; µ( β)) = 21.018 is assessed by calculating > pval <- 1- pchisq(21.01776,10) leading to pval = 0.02097. Thus, H logist is rejected at any significance level greater than 2 %. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 11 / 22

Logit link Also, a look at the deviance residuals: residuals(logist.glm) 1 2 3 4 5 6 1.016293 0.8554846 1.22456 1.586976-0.7922914 0.1608718 7 8 9 10 11 12-1.030256-1.083212-1.757153 1.204659 2.346577 1.517043 They indicate underestimation in the tails, and overestimation in the central part of the curve. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 12 / 22

The probit link The transformation g(p) = η = Φ 1 (p) = β 1 + β 2 x p(x) = Φ(η) = Φ(β 1 + β 2 x) with Φ( ) denoting the cumulative distribution function for the standardized normal distribution is termed the probit-transformation. The function tends towards 0 and 1 for x, respectively. The convergence is faster than for the logistic transformation. There is a long tradition in biomedical literature for using the probit transformation. We fit the model with: fit2<-glm(resp~volt,family=binomial(link=probit),data=dat) Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 13 / 22

The probit link > summary(fit2) Deviance Residuals: Min 1Q Median 3Q Max -1.653-1.252 1.250 1.421 2.395 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -71.105019 3.451897-20.60 <2e-16 *** Volt 0.064325 0.003129 20.56 <2e-16 *** --- (Dispersion parameter for binomial family taken to be 1) Null deviance: 783.122 on 11 degrees of freedom Residual deviance: 26.215 on 10 degrees of freedom AIC: 75.81 Number of Fisher Scoring iterations: 5 Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 14 / 22

The probit link From the output we can make a deviance table: Source f Deviance Mean deviance Model H M 1 756.907 756.907 Residual (Error) 10 26.215 2.622 Corrected total 11 783.122 71.193 The p-value corresponding to the goodness of fit statistic D(y; µ( β)) = 26.215 is assessed by calculating > pval <- 1- pchisq(26.215,10) leading to pval = 0.00346. Thus, H probit is rejected at any significance level greater than 0.3 %. The fit is not satisfactory. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 15 / 22

The probit link Again, a look at the deviance residuals: residuals(prob.glm) 1 2 3 4 5 6 1.666163 1.238981 1.3965 1.26034-1.357938-0.5152912 7 8 9 10 11 12-1.562847-1.216113-1.653013 1.492566 2.394602 1.280603 They indicate systematic underestimation in both tails, and overestimation in the central part. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 16 / 22

Complementary log-log link The transformation g(p) = η = ln( ln(1 p)) = β 0 + β 1 x p(x) = 1 exp[ exp(β 0 + β 1 x)] is termed the complementary log-log transformation. The response function is asymmetrical. It increases slowly away from 0, whereas it approaches 1 in a rather steep manner. We fit the model with: fit3<-glm(resp~volt,family=binomial(link=cloglog),data=dat) Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 17 / 22

Complementary log-log link > summary(fit3) Deviance Residuals: Min 1Q Median 3Q Max -1.36540-0.37472 0.04829 0.33605 1.00952 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -91.106296 4.601803-19.80 <2e-16 *** Volt 0.081900 0.004147 19.75 <2e-16 *** --- (Dispersion parameter for binomial family taken to be 1) Null deviance: 783.122 on 11 degrees of freedom Residual deviance: 5.671 on 10 degrees of freedom AIC: 55.266 Number of Fisher Scoring iterations: 4 Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 18 / 22

Complementary log-log link From the output we can make a deviance table: Source f Deviance Mean deviance Model H M 1 777.451 777.451 Residual (Error) 10 5.671 0.567 Corrected total 11 783.122 71.193 The p-value corresponding to the goodness of fit statistic D(y; µ( β)) = 5.671 is assessed by calculating pval <- 1- pchisq(5.671,10), leading to pval = 0.8421. Thus, data do not provide any evidence against the cloglog model. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 19 / 22

Complementary log-log link This is further supported by the deviance residuals: residuals(clog.glm) 1 2 3 4 5 6-0.02716415-0.1760891 0.2060522 0.8231332-1.11485 0.2832367 7 8 9 10 11 12-0.2986162 0.123737-0.6030393 1.009516 0.4944959-1.365395 There is no systematic pattern in the residuals, and all residuals are in the interval ±2. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 20 / 22

Logit/probit/cloglog p(x) 0.0 0.2 0.4 0.6 0.8 1.0 Logit Probit cloglog 1070 1080 1090 1100 1110 1120 1130 Figure: Probability of breakdown for an insulator as function of applied pulse voltage. The curves correspond to different assumptions on the functional form of the relation. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 21 / 22 Volt

Odds Ratio Odds ratio If an event occurs with probability p, then the odds in favor of the event is Odds = p 1 p A comparison between two events can be made by computing the odds ratio: OR = p 1/(1 p 1 ) p 2 /(1 p 2 ) An odds ratio larger than 1 is an indication the event is more likely in the first group than in the second group. Henrik Madsen Poul Thyregod (IMM-DTU) Chapman & Hall October 2010 22 / 22