R Hints for Chapter 10

Similar documents
Logistic Regression - problem 6.14

Logistic Regressions. Stat 430

Exercise 5.4 Solution

Linear Regression Models P8111

STA102 Class Notes Chapter Logistic Regression

9 Generalized Linear Models

Interactions in Logistic Regression

12 Modelling Binomial Response Data

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Log-linear Models for Contingency Tables

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

STA 450/4000 S: January

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

A Handbook of Statistical Analyses Using R. Brian S. Everitt and Torsten Hothorn

Week 7 Multiple factors. Ch , Some miscellaneous parts

Generalised linear models. Response variable can take a number of different formats

Matched Pair Data. Stat 557 Heike Hofmann

Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

Exam Applied Statistical Regression. Good Luck!

R Output for Linear Models using functions lm(), gls() & glm()

A Handbook of Statistical Analyses Using R 2nd Edition. Brian S. Everitt and Torsten Hothorn

MODULE 6 LOGISTIC REGRESSION. Module Objectives:

Generalized linear models

Sample solutions. Stat 8051 Homework 8

Classification. Chapter Introduction. 6.2 The Bayes classifier

BMI 541/699 Lecture 22

Review: what is a linear model. Y = β 0 + β 1 X 1 + β 2 X 2 + A model of the following form:

Introduction to the Generalized Linear Model: Logistic regression and Poisson regression

Experimental Design and Statistical Methods. Workshop LOGISTIC REGRESSION. Jesús Piedrafita Arilla.

Introduction to the Analysis of Tabular Data

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

Logistic Regression 21/05

Non-Gaussian Response Variables

UNIVERSITY OF TORONTO Faculty of Arts and Science

Poisson Regression. The Training Data

Administration. Homework 1 on web page, due Feb 11 NSERC summer undergraduate award applications due Feb 5 Some helpful books

Truck prices - linear model? Truck prices - log transform of the response variable. Interpreting models with log transformation

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

On the Inference of the Logistic Regression Model

Duration of Unemployment - Analysis of Deviance Table for Nested Models

Today. HW 1: due February 4, pm. Aspects of Design CD Chapter 2. Continue with Chapter 2 of ELM. In the News:

Lecture 12: Effect modification, and confounding in logistic regression

cor(dataset$measurement1, dataset$measurement2, method= pearson ) cor.test(datavector1, datavector2, method= pearson )

Using R in 200D Luke Sonnet

Tento projekt je spolufinancován Evropským sociálním fondem a Státním rozpočtem ČR InoBio CZ.1.07/2.2.00/

STAC51: Categorical data Analysis

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

ssh tap sas913, sas

STAT 7030: Categorical Data Analysis

Various Issues in Fitting Contingency Tables

Generalized linear models

Regression models. Generalized linear models in R. Normal regression models are not always appropriate. Generalized linear models. Examples.

Stat 579: Generalized Linear Models and Extensions

Statistics 203 Introduction to Regression Models and ANOVA Practice Exam

Age 55 (x = 1) Age < 55 (x = 0)

Let s see if we can predict whether a student returns or does not return to St. Ambrose for their second year.

A Generalized Linear Model for Binomial Response Data. Copyright c 2017 Dan Nettleton (Iowa State University) Statistics / 46

Introduction to General and Generalized Linear Models

Generalized Linear Models 1

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

Regression Methods for Survey Data

Checking the Poisson assumption in the Poisson generalized linear model

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

Generalized Linear Models. stat 557 Heike Hofmann

STAT 526 Spring Midterm 1. Wednesday February 2, 2011

Unit 5 Logistic Regression Practice Problems

8 Nominal and Ordinal Logistic Regression

Two Hours. Mathematical formula books and statistical tables are to be provided THE UNIVERSITY OF MANCHESTER. 26 May :00 16:00

LISA Short Course Series Generalized Linear Models (GLMs) & Categorical Data Analysis (CDA) in R. Liang (Sally) Shan Nov. 4, 2014

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

Linear Regression. Data Model. β, σ 2. Process Model. ,V β. ,s 2. s 1. Parameter Model

Modeling Overdispersion

Proportional Odds Logistic Regression. stat 557 Heike Hofmann

Analysing categorical data using logit models

Leftovers. Morris. University Farm. University Farm. Morris. yield

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

ECLT 5810 Linear Regression and Logistic Regression for Classification. Prof. Wai Lam

Linear Regression. In this lecture we will study a particular type of regression model: the linear regression model

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

2/26/2017. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

PAPER 206 APPLIED STATISTICS

Final Exam. Name: Solution:

Neural networks (not in book)

22s:152 Applied Linear Regression. Example: Study on lead levels in children. Ch. 14 (sec. 1) and Ch. 15 (sec. 1 & 4): Logistic Regression

Psych 230. Psychological Measurement and Statistics

Chapter 5: Logistic Regression-I

1. Logistic Regression, One Predictor 2. Inference: Estimating the Parameters 3. Multiple Logistic Regression 4. AIC and BIC in Logistic Regression

Simple logistic regression

Generalized Linear Models

Lecture 5: LDA and Logistic Regression

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

Logistic & Tobit Regression

Poisson Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Multilevel Models in Matrix Form. Lecture 7 July 27, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

22s:152 Applied Linear Regression

Transcription:

R Hints for Chapter 10 The multiple logistic regression model assumes that the success probability p for a binomial random variable depends on independent variables or design variables x 1, x 2,, x k. A factor variable with m levels is numerically coded with m-1 indicator variables that have values of either 0 or 1 in the manner described for Chapter 9. We will assume that all the factor variables have been coded this way, so the x s are all numeric. The relationship between p and the design variables is given by the logistic regression equation logit(p) = log p 1 p = β 0 + β 1 x 1 + + β k x k. So, it is the log-odds on success that is expressed as a linear function of the design variables. The data consists of N values of each of the design variables and corresponding values of the binomial random variable arising from them. where Y i ~Binom(n i, p i ), i = 1,, N, logit(p i ) = β 0 + β 1 x i1 + β 2 x i2 + + β k x ik. The logistic regression coefficients β 0,, β k are unknown and must be estimated from the data. They are not estimated by least squares, but rather by maximum likelihood estimation. Estimates β 0, β 1,, β k are chosen to maximize the log-likelihood function N (1) l = [Y i log p i + (n i Y i ) log(1 p i)]. i=1 with logit(p i) = β 0 + β 1x i1 + β kx ik. There are no explicit solutions that you can write down using elementary functions. The numerical maximization procedure is a variant of the Newton-Raphson procedure called Fisher scoring. R does all the calculations for you and reports everything you need to know about the estimates with a function glm( ), which stands for generalized linear model. Here is an example where all the n i are equal to 1 and all the Y i are Bernoulli variables. Shown below are the first 20 rows of the paindata data set. We will take trt (treatment) and age as the independent variables and painimproved as the response. trt is a factor with two levels A and B and age is a continuous numeric variable. painimproved is classified as a logical variable with values TRUE and FALSE. In R, TRUE has a numeric value of 1 and FALSE has a numeric value of 0, so we do not have to convert painimproved to a numeric vector all by ourselves.

> paindata[1:20,] trt female age injurysource pain0 pain30 painchange painimproved 1 A N 31 Y 2 1-1 TRUE 2 A Y 50 N 1 2 1 FALSE 3 A N 31 Y 2 2 0 FALSE 4 A Y 55 Y 2 4 2 FALSE 5 A Y 35 N 4 2-2 TRUE 6 A Y 46 N 4 4 0 FALSE 7 A N 51 N 1 2 1 FALSE 8 A Y 52 Y 2 2 0 FALSE 9 A N 46 Y 3 2-1 TRUE 10 A Y 48 N 3 2-1 TRUE 11 A Y 46 Y 1 4 3 FALSE 12 A Y 34 N 2 2 0 FALSE 13 A N 48 Y 2 1-1 TRUE 14 A N 41 N 2 1-1 TRUE 15 A N 40 N 3 3 0 FALSE 16 A Y 53 Y 3 2-1 TRUE 17 A N 55 Y 4 2-2 TRUE 18 A Y 40 N 3 3 0 FALSE 19 A Y 48 Y 3 2-1 TRUE 20 A N 33 Y 3 1-2 TRUE > pain.glm=glm(painimproved~trt+age,data=paindata,family=binomial) > summary(pain.glm) Call: glm(formula = painimproved ~ trt + age, family = binomial, data = paindata) Deviance Residuals: Min 1Q Median 3Q Max -1.2680-0.9211-0.8535 1.1564 1.5587 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) 0.400475 1.372952 0.292 0.771 trtb -0.875767 0.613143-1.428 0.153 age -0.007307 0.029940-0.244 0.807 (Dispersion parameter for binomial family taken to be 1) Null deviance: 68.029 on 49 degrees of freedom Residual deviance: 65.901 on 47 degrees of freedom AIC: 71.901 Number of Fisher Scoring iterations: 4 The intercept β 0 is the log-odds on pain improvement when trt has its base level A and when age =0. Its estimated value is 0.400475. The next coefficient β 1, with an estimated value of - 0.875767, is the difference in log odds on improvement between treatment B and treatment A. In other words, it is the log of the odds ratio. For any fixed age, the log odds ratio on improvement for the two treatments is estimated to be -0.875767. The age coefficient β 2, with an estimated value of -0.007307, is the increase in log odds on improvement for a unit increase in age. The negative sign means that it is actually a decrease in log odds. Here is an exercise in using this information.

Question: What is the estimated odds ratio on improvement for two patients receiving the same treatment and 10 years apart in age? Answer: The log odds ratio is the difference in log odds: 10 ( 0.007307) = 0.07307. Therefore, the odds ratio is e 0.07307 = 0.9295. Question: What is the difference in log odds between a patient receiving treatment A and another patient 10 years older receiving treatment B? Answer: -0.875767 + 10(-0.007307) = -0.948837. This is called an additive model because the effects of treatment level and age on the log odds add together in this simple fashion. There are no interactions between treatment and age. Question: What are the odds on improvement for a 60 year old patient who is receiving treatment B? Answer: The log odds are 0.400475 0.875767 + 60(- 0.007307) = - 0.913712. The odds are e 0.913712 = 0.4010. The R function predict( ) will calculate the fitted log odds for you, like this: > predict(pain.glm,newdata=data.frame(trt="b",age=60)) 1-0.9137311 Question: What is the probability that this patient improves? Answer: Pr(improvement) = odds 1+odds = 0.2862. By default, the predict( ) function returns the predicted log odds. You can get the predicted probability if you like by including the type argument. > predict(pain.glm,newdata=data.frame(trt= B,age=60),type= response ) 1 0.286237

In the preceding example, since interactions were not allowed, there are two log odds functions of age with the same slope and different intercepts, one for each level of the factor trt. They would be plotted as parallel lines. If interactions are allowed, the slopes will also be different. In other words, the treatment type alters the rate at which increasing age affects the log odds on improvement. Below is the refitted model allowing interactions. > pain.glm=update(pain.glm,.~trt*age) > summary(pain.glm) Call: glm(formula = painimproved ~ trt + age + trt:age, family = binomial, data = paindata) Deviance Residuals: Min 1Q Median 3Q Max -1.3318-1.0277-0.7884 1.1522 1.6893 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -1.00134 2.21876-0.451 0.652 trtb 1.22766 2.66647 0.460 0.645 age 0.02468 0.04980 0.496 0.620 trtb:age -0.05069 0.06272-0.808 0.419 (Dispersion parameter for binomial family taken to be 1) Null deviance: 68.029 on 49 degrees of freedom Residual deviance: 65.240 on 46 degrees of freedom AIC: 73.24 Number of Fisher Scoring iterations: 4 Question: What are the odds on improvement for a 60 year old patient who is receiving treatment B? Answer: For treatment A the intercept is -1.00134 and the age slope is 0.02468. For treatment B the intercept is -1.00134 + 1.22766 = 0.22632 and the slope is 0.02468 0.05069 = -0.02601. Therefore, the log odds are 0.22632 60(0.02601)= -1.33428, the odds are and the probability of improvement is e 1.33428 = 0.2633 0.2633 1 + 0.2633 = 0.2085. The predict( ) function will give the same answers. > predict(pain.glm,newdata=data.frame(trt= B,age=60)) 1-1.33437

> predict(pain.glm,newdata=data.frame(trt= B,age=60),type= response ) 1 0.2084374 Aggregated Data Data in raw form is like that in paindata where each observation of the response is Bernoulli, with only two possible values such as Yes/No, or Male/Female, TRUE/FALSE, or 0/1. Sometimes data is presented in aggregated form, where the numbers of successes and failures for each distinct value of (x 1, x 2,, x k ) are tabulated. Here is the part of Table E6.21 for myocardial infarction (heart attack). cases controls drink gender 1 142 197 N M 2 136 201 Y M 3 41 144 N F 4 19 122 Y F The two independent variables are binary factors drink = N or Y - was the subject a drinker? - and gender = F or M. For each combination of factor levels, cases is the number of subjects who suffered a heart attack (success) and controls is the number who didn t. For data aggregated like this, the response term in the R formula must be a two column matrix, successes in the first column and failures in the second. > prob6.21.glm=glm(cbind(cases,controls)~drink+gender,data=prob6.21,family=bi nomial) > summary(prob6.21.glm) Call: glm(formula = cbind(cases, controls) ~ drink + gender, family = binomial, data = prob6.21) Deviance Residuals: 1 2 3 4-0.5268 0.5350 0.8754-1.1110 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -1.4135 0.1535-9.209 < 2e-16 *** drinky -0.1807 0.1379-1.311 0.19 genderm 1.1441 0.1634 7.000 2.56e-12 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 58.2988 on 3 degrees of freedom Residual deviance: 2.5645 on 1 degrees of freedom AIC: 31.004 Number of Fisher Scoring iterations: 4

The estimated log of the ratio of odds on a heart attack for drinkers compared to non-drinkers is -0.1807. In other words, drinking appears to lessen the odds on a heart attack. Notice that the p-value is 19%, so we aren t justified in drawing this conclusion. The variable pain0 (baseline pain level) in the paindata data frame is numeric but has only 5 distinct values. If you want to construct a logistic regression model with pain0 and trt as independent variables, you can aggregate the data as follows to create a new data frame. > paindata2=aggregate(cbind(painimproved,1-painimproved)~trt+pain0,data=paind ata,fun=sum) > paindata2 trt pain0 painimproved V2 1 A 1 0 3 2 B 1 0 5 3 A 2 3 6 4 B 2 0 7 5 A 3 8 2 6 B 3 6 5 7 A 4 2 1 8 B 5 2 0 > names(paindata2)[4]="notimproved" > paindata2 trt pain0 painimproved notimproved 1 A 1 0 3 2 B 1 0 5 3 A 2 3 6 4 B 2 0 7 5 A 3 8 2 6 B 3 6 5 7 A 4 2 1 8 B 5 2 0 Then fit the model. > pain.glm2=glm(cbind(painimproved,notimproved)~trt+pain0,data=paindata2,fami ly=binomial) > summary(pain.glm2) Call: glm(formula = cbind(painimproved, notimproved) ~ trt + pain0, family = binomial, data = paindata2) Deviance Residuals: 1 2 3 4 5 6 7 8-0.5616-0.4069 0.3280-1.2707 0.4223 0.4850-1.6016 0.2871 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -4.9106 1.6453-2.985 0.00284 ** trtb -1.1734 0.7501-1.564 0.11777 pain0 1.9911 0.6261 3.180 0.00147 ** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1

(Dispersion parameter for binomial family taken to be 1) Null deviance: 27.5866 on 7 degrees of freedom Residual deviance: 5.2644 on 5 degrees of freedom AIC: 20.764 Number of Fisher Scoring iterations: 5 The formula cbind(painimproved,1-painimproved)~trt+pain0 in the aggregate function worked because painimproved is a logical variable that has numeric values 0 and 1. Also, the numeric variable pain0 has only a small number of values. When the response is a factor rather than a logical variable, it is better to aggregate as below. As an example, we will use the radon.leukemia data with case-control as the binary response and with independent variables DOWNS and RADON. RADON is a continuous variable with many distinct values, so we will discretize it by locating each measurement in a class interval, similar to the way it is done with the histogram function. The R function for doing this is cut( ). The intervals begin at 0 and end at 20 with widths 4. > aggregate(dis~downs+cut(radon,seq(0,20,4)),data=radon.leukemia,fun=table) DOWNS cut(radon, seq(0, 20, 4)) DIS.case DIS.control 1 1 (0,4] 1 0 2 2 (0,4] 32 49 3 1 (4,8] 5 2 4 2 (4,8] 10 14 5 1 (8,12] 16 24 6 2 (8,12] 9 12 7 1 (12,16] 12 16 8 2 (12,16] 10 12 9 1 (16,20] 4 2 10 2 (16,20] 2 1 > leukdata=.last.valu This is a data frame with cumbersome names. You can change them if you like. > names(leukdata)=c("downs","radon.grp","disease") > leukdata Downs Radon.grp Disease.case Disease.control 1 1 (0,4] 1 0 2 2 (0,4] 32 49 3 1 (4,8] 5 2 4 2 (4,8] 10 14 5 1 (8,12] 16 24 6 2 (8,12] 9 12 7 1 (12,16] 12 16 8 2 (12,16] 10 12 9 1 (16,20] 4 2 10 2 (16,20] 2 1

In this data frame, Disease is already a two-column matrix of successes and failures, so you don t have to use the cbind function in the formula. > leukdata.glm=glm(disease~downs+radon.grp,data=leukdata,family=binomial) > summary(leukdata.glm) Call: glm(formula = Disease ~ Downs + Radon.grp, family = binomial, data = leukdata) Deviance Residuals: 1 2 3 4 5 6 7 8 9 10 1.27170-0.12579 1.05679-0.56229-0.31584 0.43923-0.32715 0.3710 0-0.06982 0.09686 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -0.04040 0.72340-0.056 0.955 Downs -0.17856 0.34590-0.516 0.606 Radon.grp(4,8] 0.29262 0.43072 0.679 0.497 Radon.grp(8,12] -0.08489 0.41148-0.206 0.837 Radon.grp(12,16] 0.05588 0.41064 0.136 0.892 Radon.grp(16,20] 0.97279 0.77506 1.255 0.209 (Dispersion parameter for binomial family taken to be 1) Null deviance: 6.6580 on 9 degrees of freedom Residual deviance: 3.6176 on 4 degrees of freedom AIC: 45.131 Number of Fisher Scoring iterations: 3 Deviances and ANOVA Consider the log-likelihood function in (1) as a function of estimates p 1, p 2,, p N of the success probabilities for the N replications of the experiment. If we don t assume that they are given by the logistic regression equation and instead allow them to be completely unrestricted, then the log-likelihood function is maximized when p i = Y i ni. Its maximum value is called the saturated log-likelihood, and denoted by l sat. The model log-likelihood is the maximum value of (1) when the p i are the maximum likelihood estimators assuming the logistic regression model. It is designated by l model. The null log-likelihood is the maximum value of (1) when it is assumed that all the regression parameters β 1, β 2,, β k except the intercept β 0 are equal to zero. In other words, it is assumed that p 1, p N all have a common value p. The null log-likelihood is denoted by l null. The residual deviance is D(resid) = 2(l sat l model ).

The null deviance is D(null) = 2(l sat l null ) and the regression deviance is D(regr) = 2(l model l null ). Think of these quantities as being analogous to the residual sum of squares, the total sum of squares and the regression sum of squares in multiple linear regression problems. They satisfy a similar equation D(null) = D(regr) + D(resid). If the logistic regression model is true and N is large, D(regr) has an approximate chi-square distribution with k degrees of freedom. It can be used to test the hypothesis H 0 : β 1 = β 2 = = β k = 0. Reject H 0 if the p-value of D(regr) is too small. In the example just above, the observed value is with p-value > 1-pchisq(3.0404,df=5) [1] 0.6937572 D(regr) = 6.6580 3.6176 = 3.0404. Since the p-value is so large, we cannot conclude that any of the regression coefficients are different from 0. An anova table breaks D(regr) down into contributions from each variable in the model. It is constructed step by step, starting with the null model corresponding to H 0 above and adding one variable at a time. The increment in the regression deviance for each variable is indicated as well as the residual deviance after the variable is added. > anova(leukdata.glm,test="chisq") Analysis of Deviance Table Model: binomial, link: logit Response: Disease Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev Pr(>Chi) NULL 9 6.6580 Downs 1 0.46089 8 6.1971 0.4972 Radon.grp 4 2.57943 4 3.6176 0.6305

The p-value of 0.6305 indicates that the increment in regression deviance, which is equal to the decrement in residual deviance, 2.57943 = 6.1971 3.6176 does not significantly improve the fit of the model when the variable Radon.grp is added to the model which already contains the variable Downs.