R Hints for Chapter 10

Size: px

Start display at page:

Download "R Hints for Chapter 10"

Mervin Palmer
5 years ago
Views:

1 R Hints for Chapter 10 The multiple logistic regression model assumes that the success probability p for a binomial random variable depends on independent variables or design variables x 1, x 2,, x k. A factor variable with m levels is numerically coded with m-1 indicator variables that have values of either 0 or 1 in the manner described for Chapter 9. We will assume that all the factor variables have been coded this way, so the x s are all numeric. The relationship between p and the design variables is given by the logistic regression equation logit(p) = log p 1 p = β 0 + β 1 x β k x k. So, it is the log-odds on success that is expressed as a linear function of the design variables. The data consists of N values of each of the design variables and corresponding values of the binomial random variable arising from them. where Y i ~Binom(n i, p i ), i = 1,, N, logit(p i ) = β 0 + β 1 x i1 + β 2 x i2 + + β k x ik. The logistic regression coefficients β 0,, β k are unknown and must be estimated from the data. They are not estimated by least squares, but rather by maximum likelihood estimation. Estimates β 0, β 1,, β k are chosen to maximize the log-likelihood function N (1) l = [Y i log p i + (n i Y i ) log(1 p i)]. i=1 with logit(p i) = β 0 + β 1x i1 + β kx ik. There are no explicit solutions that you can write down using elementary functions. The numerical maximization procedure is a variant of the Newton-Raphson procedure called Fisher scoring. R does all the calculations for you and reports everything you need to know about the estimates with a function glm( ), which stands for generalized linear model. Here is an example where all the n i are equal to 1 and all the Y i are Bernoulli variables. Shown below are the first 20 rows of the paindata data set. We will take trt (treatment) and age as the independent variables and painimproved as the response. trt is a factor with two levels A and B and age is a continuous numeric variable. painimproved is classified as a logical variable with values TRUE and FALSE. In R, TRUE has a numeric value of 1 and FALSE has a numeric value of 0, so we do not have to convert painimproved to a numeric vector all by ourselves.

2 > paindata[1:20,] trt female age injurysource pain0 pain30 painchange painimproved 1 A N 31 Y TRUE 2 A Y 50 N FALSE 3 A N 31 Y FALSE 4 A Y 55 Y FALSE 5 A Y 35 N TRUE 6 A Y 46 N FALSE 7 A N 51 N FALSE 8 A Y 52 Y FALSE 9 A N 46 Y TRUE 10 A Y 48 N TRUE 11 A Y 46 Y FALSE 12 A Y 34 N FALSE 13 A N 48 Y TRUE 14 A N 41 N TRUE 15 A N 40 N FALSE 16 A Y 53 Y TRUE 17 A N 55 Y TRUE 18 A Y 40 N FALSE 19 A Y 48 Y TRUE 20 A N 33 Y TRUE > pain.glm=glm(painimproved~trt+age,data=paindata,family=binomial) > summary(pain.glm) Call: glm(formula = painimproved ~ trt + age, family = binomial, data = paindata) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) trtb age (Dispersion parameter for binomial family taken to be 1) Null deviance: on 49 degrees of freedom Residual deviance: on 47 degrees of freedom AIC: Number of Fisher Scoring iterations: 4 The intercept β 0 is the log-odds on pain improvement when trt has its base level A and when age =0. Its estimated value is The next coefficient β 1, with an estimated value of , is the difference in log odds on improvement between treatment B and treatment A. In other words, it is the log of the odds ratio. For any fixed age, the log odds ratio on improvement for the two treatments is estimated to be The age coefficient β 2, with an estimated value of , is the increase in log odds on improvement for a unit increase in age. The negative sign means that it is actually a decrease in log odds. Here is an exercise in using this information.

3 Question: What is the estimated odds ratio on improvement for two patients receiving the same treatment and 10 years apart in age? Answer: The log odds ratio is the difference in log odds: 10 ( ) = Therefore, the odds ratio is e = Question: What is the difference in log odds between a patient receiving treatment A and another patient 10 years older receiving treatment B? Answer: ( ) = This is called an additive model because the effects of treatment level and age on the log odds add together in this simple fashion. There are no interactions between treatment and age. Question: What are the odds on improvement for a 60 year old patient who is receiving treatment B? Answer: The log odds are ( ) = The odds are e = The R function predict( ) will calculate the fitted log odds for you, like this: > predict(pain.glm,newdata=data.frame(trt="b",age=60)) Question: What is the probability that this patient improves? Answer: Pr(improvement) = odds 1+odds = By default, the predict( ) function returns the predicted log odds. You can get the predicted probability if you like by including the type argument. > predict(pain.glm,newdata=data.frame(trt= B,age=60),type= response )

4 In the preceding example, since interactions were not allowed, there are two log odds functions of age with the same slope and different intercepts, one for each level of the factor trt. They would be plotted as parallel lines. If interactions are allowed, the slopes will also be different. In other words, the treatment type alters the rate at which increasing age affects the log odds on improvement. Below is the refitted model allowing interactions. > pain.glm=update(pain.glm,.~trt*age) > summary(pain.glm) Call: glm(formula = painimproved ~ trt + age + trt:age, family = binomial, data = paindata) Deviance Residuals: Min 1Q Median 3Q Max Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) trtb age trtb:age (Dispersion parameter for binomial family taken to be 1) Null deviance: on 49 degrees of freedom Residual deviance: on 46 degrees of freedom AIC: Number of Fisher Scoring iterations: 4 Question: What are the odds on improvement for a 60 year old patient who is receiving treatment B? Answer: For treatment A the intercept is and the age slope is For treatment B the intercept is = and the slope is = Therefore, the log odds are ( )= , the odds are and the probability of improvement is e = = The predict( ) function will give the same answers. > predict(pain.glm,newdata=data.frame(trt= B,age=60))

5 > predict(pain.glm,newdata=data.frame(trt= B,age=60),type= response ) Aggregated Data Data in raw form is like that in paindata where each observation of the response is Bernoulli, with only two possible values such as Yes/No, or Male/Female, TRUE/FALSE, or 0/1. Sometimes data is presented in aggregated form, where the numbers of successes and failures for each distinct value of (x 1, x 2,, x k ) are tabulated. Here is the part of Table E6.21 for myocardial infarction (heart attack). cases controls drink gender N M Y M N F Y F The two independent variables are binary factors drink = N or Y - was the subject a drinker? - and gender = F or M. For each combination of factor levels, cases is the number of subjects who suffered a heart attack (success) and controls is the number who didn t. For data aggregated like this, the response term in the R formula must be a two column matrix, successes in the first column and failures in the second. > prob6.21.glm=glm(cbind(cases,controls)~drink+gender,data=prob6.21,family=bi nomial) > summary(prob6.21.glm) Call: glm(formula = cbind(cases, controls) ~ drink + gender, family = binomial, data = prob6.21) Deviance Residuals: Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) < 2e-16 *** drinky genderm e-12 *** --- Signif. codes: 0 *** ** 0.01 * (Dispersion parameter for binomial family taken to be 1) Null deviance: on 3 degrees of freedom Residual deviance: on 1 degrees of freedom AIC: Number of Fisher Scoring iterations: 4

6 The estimated log of the ratio of odds on a heart attack for drinkers compared to non-drinkers is In other words, drinking appears to lessen the odds on a heart attack. Notice that the p-value is 19%, so we aren t justified in drawing this conclusion. The variable pain0 (baseline pain level) in the paindata data frame is numeric but has only 5 distinct values. If you want to construct a logistic regression model with pain0 and trt as independent variables, you can aggregate the data as follows to create a new data frame. > paindata2=aggregate(cbind(painimproved,1-painimproved)~trt+pain0,data=paind ata,fun=sum) > paindata2 trt pain0 painimproved V2 1 A B A B A B A B > names(paindata2)[4]="notimproved" > paindata2 trt pain0 painimproved notimproved 1 A B A B A B A B Then fit the model. > pain.glm2=glm(cbind(painimproved,notimproved)~trt+pain0,data=paindata2,fami ly=binomial) > summary(pain.glm2) Call: glm(formula = cbind(painimproved, notimproved) ~ trt + pain0, family = binomial, data = paindata2) Deviance Residuals: Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) ** trtb pain ** --- Signif. codes: 0 *** ** 0.01 *

7 (Dispersion parameter for binomial family taken to be 1) Null deviance: on 7 degrees of freedom Residual deviance: on 5 degrees of freedom AIC: Number of Fisher Scoring iterations: 5 The formula cbind(painimproved,1-painimproved)~trt+pain0 in the aggregate function worked because painimproved is a logical variable that has numeric values 0 and 1. Also, the numeric variable pain0 has only a small number of values. When the response is a factor rather than a logical variable, it is better to aggregate as below. As an example, we will use the radon.leukemia data with case-control as the binary response and with independent variables DOWNS and RADON. RADON is a continuous variable with many distinct values, so we will discretize it by locating each measurement in a class interval, similar to the way it is done with the histogram function. The R function for doing this is cut( ). The intervals begin at 0 and end at 20 with widths 4. > aggregate(dis~downs+cut(radon,seq(0,20,4)),data=radon.leukemia,fun=table) DOWNS cut(radon, seq(0, 20, 4)) DIS.case DIS.control 1 1 (0,4] (0,4] (4,8] (4,8] (8,12] (8,12] (12,16] (12,16] (16,20] (16,20] 2 1 > leukdata=.last.valu This is a data frame with cumbersome names. You can change them if you like. > names(leukdata)=c("downs","radon.grp","disease") > leukdata Downs Radon.grp Disease.case Disease.control 1 1 (0,4] (0,4] (4,8] (4,8] (8,12] (8,12] (12,16] (12,16] (16,20] (16,20] 2 1

8 In this data frame, Disease is already a two-column matrix of successes and failures, so you don t have to use the cbind function in the formula. > leukdata.glm=glm(disease~downs+radon.grp,data=leukdata,family=binomial) > summary(leukdata.glm) Call: glm(formula = Disease ~ Downs + Radon.grp, family = binomial, data = leukdata) Deviance Residuals: Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) Downs Radon.grp(4,8] Radon.grp(8,12] Radon.grp(12,16] Radon.grp(16,20] (Dispersion parameter for binomial family taken to be 1) Null deviance: on 9 degrees of freedom Residual deviance: on 4 degrees of freedom AIC: Number of Fisher Scoring iterations: 3 Deviances and ANOVA Consider the log-likelihood function in (1) as a function of estimates p 1, p 2,, p N of the success probabilities for the N replications of the experiment. If we don t assume that they are given by the logistic regression equation and instead allow them to be completely unrestricted, then the log-likelihood function is maximized when p i = Y i ni. Its maximum value is called the saturated log-likelihood, and denoted by l sat. The model log-likelihood is the maximum value of (1) when the p i are the maximum likelihood estimators assuming the logistic regression model. It is designated by l model. The null log-likelihood is the maximum value of (1) when it is assumed that all the regression parameters β 1, β 2,, β k except the intercept β 0 are equal to zero. In other words, it is assumed that p 1, p N all have a common value p. The null log-likelihood is denoted by l null. The residual deviance is D(resid) = 2(l sat l model ).

9 The null deviance is D(null) = 2(l sat l null ) and the regression deviance is D(regr) = 2(l model l null ). Think of these quantities as being analogous to the residual sum of squares, the total sum of squares and the regression sum of squares in multiple linear regression problems. They satisfy a similar equation D(null) = D(regr) + D(resid). If the logistic regression model is true and N is large, D(regr) has an approximate chi-square distribution with k degrees of freedom. It can be used to test the hypothesis H 0 : β 1 = β 2 = = β k = 0. Reject H 0 if the p-value of D(regr) is too small. In the example just above, the observed value is with p-value > 1-pchisq(3.0404,df=5) [1] D(regr) = = Since the p-value is so large, we cannot conclude that any of the regression coefficients are different from 0. An anova table breaks D(regr) down into contributions from each variable in the model. It is constructed step by step, starting with the null model corresponding to H 0 above and adding one variable at a time. The increment in the regression deviance for each variable is indicated as well as the residual deviance after the variable is added. > anova(leukdata.glm,test="chisq") Analysis of Deviance Table Model: binomial, link: logit Response: Disease Terms added sequentially (first to last) Df Deviance Resid. Df Resid. Dev Pr(>Chi) NULL Downs Radon.grp

10 The p-value of indicates that the increment in regression deviance, which is equal to the decrement in residual deviance, = does not significantly improve the fit of the model when the variable Radon.grp is added to the model which already contains the variable Downs.

Logistic Regression - problem 6.14

Logistic Regression - problem 6.14 Let x 1, x 2,, x m be given values of an input variable x and let Y 1,, Y m be independent binomial random variables whose distributions depend on the corresponding values