Logistic Regression - problem 6.14 Let x 1, x 2,, x m be given values of an input variable x and let Y 1,, Y m be independent binomial random variables whose distributions depend on the corresponding values of x. They depend on x through a function p(x) that gives the success probability in the binomial experiment for each given value of x. We assume that the numbers of trials are predetermined, so that each Y Binom(n i, p(x i )). In istic regression, we assume that the -odds on success is a linear function of the variable x: p(x) 1 p(x) = β 0 + β 1 x, where β 0 and β 1 are unknown constants. x is a numeric variable, although if it is a two-level factor variable it will be numerically coded as 0 for the first level and 1 for the second level. For example, if it is male or female, it will be coded as 0 for female and 1 for male simply because f comes before m in the alphabet. Thus, p(0) 1 p(0) = β 0, and p(1) 1 p(1) = β 0 + β 1. Therefore, β 1 is the odds ratio on success for the two populations males and females. Estimates of the parameters β 0 and β 1 are found by the method of maximum likelihood. The estimates β 0 and β 1 maximize the likelihood function l( β 0, β 1 ) = which simplifies as m Y j (p(x j )) + (n j Y j )(1 p(x j )), l( β 0, β m 1 ) = Y j ( β0 + β ) m ) 1 x j n j (1 + e β 0+ β 1x j. The maximum likelihood estimates cannot be found by elementary calculus and must be numerically approximated. Theorems in advanced mathematical 1
statistics ensure that for large sample sizes m the estimators are approximately normally distributed with means equal to the true parameters. Furthermore, their variances approach zero as m. R has a function glm, which stands for generalized linear model, that calculates the maximum likelihood estimates, their standard errors, and many other things. We will illustrate it with the data set stenosis which is in the Van Belle data folder. Part of the data set is given below. smoke disease sex 1 yes 1 m 2 yes 1 m 3 yes 1 m 4 yes 1 m 5 yes 1 m 6 yes 1 m 7 yes 1 m 8 yes 1 m 9 yes 1 m 10 yes 1 m 62 no 1 m 63 yes 0 m 64 yes 0 m 65 yes 0 m 66 yes 0 m 67 yes 0 m Disease is the response variable and has values either 0 for no disease or 1 for the presence of disease. Each n i here is equal to 1 and the response Y is Bernoulli. For the moment, we ignore the variable sex and treat the disease probability as a function of smoking status only. For Bernoullit responses, glm is simple to use. > stenosis.glm=glm(disease~smoke,data=stenosis,family=binomial) > summary(stenosis.glm) Call: glm(formula = disease ~ smoke, family = binomial, data = stenosis) Deviance Residuals: Min 1Q Median 3Q Max -1.251-1.087-1.087 1.188 1.270 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -0.2157 0.1829-1.180 0.238 smokeyes 0.3863 0.2762 1.399 0.162 2
(Dispersion parameter for binomial family taken to be 1) Null deviance: 297.94 on 214 degrees of freedom Residual deviance: 295.97 on 213 degrees of freedom AIC: 299.97 Number of Fisher Scoring iterations: 3 The variable smoke has two levels, no and yes. The intercept parameter with an estimated value of -0.2157 is the odds on disease for nonsmokers. The parameter estimate labelled smokeyes is the difference in odds for smokers and nonsmokers. In other words, it is the odds ratio on disease for smokers and nonsmokers. You can get the odds and odds ratio by exponentiating these estimates: > exp(coef(stenosis.glm)) (Intercept) smokeyes 0.8059701 1.4715762 The odds on disease for nonsmokers is estimated to be 0.80597 and the odds on disease for smokers is 1.472 times as great. However, the p-value of 0.162 is not significant. We cannot conclude that the true odds ratio is different from 1. Confidence intervals for the -odds and odds ratio are obtained as follows. > confint(stenosis.glm) Waiting for profiling to be done... 2.5 % 97.5 % (Intercept) -0.5774201 0.1413800 smokeyes -0.1536262 0.9309511 > exp(.last.value) 2.5 % 97.5 % (Intercept) 0.5613447 1.151862 smokeyes 0.8575925 2.536921 If both smoking status and sex are included in the model, > stenosis.glm2=glm(disease~smoke+sex,data=stenosis,family=binomial) > summary(stenosis.glm2) Call: glm(formula = disease ~ smoke + sex, family = binomial, data = stenosis) Deviance Residuals: Min 1Q Median 3Q Max -1.3630-1.0555-0.9783 1.0807 1.3905 3
Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -0.4882 0.2159-2.261 0.0238 * smokeyes 0.1946 0.2903 0.670 0.5026 sexm 0.7199 0.2881 2.499 0.0125 * --- NA (Dispersion parameter for binomial family taken to be 1) Null deviance: 297.94 on 214 degrees of freedom Residual deviance: 289.64 on 212 degrees of freedom AIC: 295.64 Number of Fisher Scoring iterations: 4 Now the intercept parameter is the -odds on disease for nonsmoker females. To get the -odds on disease for smokers, add 0.1946. To get the -odds on disease for males, add 0.7199. The estimated odds and odds ratios are: > exp(coef(stenosis.glm2)) (Intercept) smokeyes sexm 0.6137494 1.2148232 2.0542127 The odds ratio for smokers vs. nonsmokers is 1.215. The odds ratio for males vs. females is 2.054. It appears that being a male is riskier than being a smoker. The odds ratio for male smokers vs. female nonsmokers is 1.215 2.054 = 2.496. In problem 6.14 the data in the book is given in a format suitable for the Mantel- Haenzel test. I have arranged it in tabular form for importation as a data frame. The data file is in the Van Belle data folder. It looks like this once you have imported it. MI Control Coffee Cigs 1 7 31 >=5 never 2 55 269 <5 never 3 7 18 >=5 former 4 20 112 <5 former 5 7 24 >=5 1to14 6 33 114 <5 1to14 7 40 45 >=5 15to24 8 88 172 <5 15to24 9 34 24 >=5 25to34 10 50 55 <5 25to34 11 27 24 >=5 35to44 12 55 58 <5 35to44 13 30 17 >=5 45+ 14 34 17 <5 45+ 4
Here the response is not given as individual Bernoulli responses as in the stenosis example. Rather, the first two columns are the numbers of successes (MI) and failures (Control) corresponding to each combination of factor levels. For data in this form, the formula in glm must be the two-column matrix of successes and failures. Invoke it like this: > prob6.14.glm=glm(cbind(mi,control).,data=prob6.14,family=binomial) The cbind function puts vectors together as columns of a matrix. The dot. on the right of the model formula after is a shortcut meaning to include all the variables in the data frame that are not named on the left. You can name the object anything you like. If you don t want to do the Mantel-Haenzel calculations by hand you can put the data into the format given in the book and use the mantelhaen.test function in R. It takes some work. Here are the steps. > attach(prob6.14) > prob6.14b=array(rbind(mi,control),dim=c(2,2,7)) > for(k in 1:7) prob6.14b[,,k]=t(prob6.14b[,,k]) > dimnames(prob6.14b)=list(c("mi","control"),c(">=5","<5"),c("never", + "former","1to14","15to24","25to34","35to44","45+")) > prob6.14b The result should look like the data in the book. After this, all you have to do is call the mantelhaen.test function: > mantelhaen.test(prob6.14b) Compare the results. They should agree. 5