Generalized linear models for binary data. A better graphical exploratory data analysis. The simple linear logistic regression model

Stat 3302 (Spring 2017) Peter F. Craigmile Simple linear logistic regression (part 1) [Dobson and Barnett, 2008, Sections 7.1 7.3] Generalized linear models for binary data Beetles dose-response example Exploratory analysis A better graphical exploratory data analysis The simple linear logistic regression model Finding the maximum likelihood estimates of β 0 and β 1 Fitting the model in R Summarizing the R output Intepreting the slope parameter β 1 Where we go next 1

Generalized linear models for binary data Previously we considered statistical inference on a binomial proportion, p, for Y B(m, p). We begin by considering binary data: Each random variable (RV) Y i only has two possible values: 0 or 1. We assume that Pr(Y i = 0) = 1 p i and Pr(Y i = 1) = p i. For each observation, i, we have a vector of covariates or explanatory variables x T i = (x i,1,..., x i,p ) T. The RVs, {Y 1,..., Y n } are independent. Our aim is to model the relationship between the success probabilities p i and the explanatory variables x i. 2

Beetles dose-response example (See Table 7.2 of Dobson and Barnett [2008]; Original reference is Bliss [1935]) Historically, this type of experiment is an important step in developing statistical models for binomial outcomes. In an experiment adult flour beetles were assessed whether or not they had died after five hours of exposure to a different number of concentrations (doses) of gaseous carbon disulfide. The variables are: The log 10 dose of gaseous carbon disulfide (in units of log 10 CS 2 mg/l). Whether or not the beetle died (1) or remained alive (0). We wish to understand the relationship between (log) dose and the probability of dying. R code for this example can be downloaded from the class website. 3

Exploratory analysis A scatterplot of dead/alive versus the log10 dose is meaningless: Dead (1) or Alive(0) 0.0 0.2 0.4 0.6 0.8 1.0 1.70 1.75 1.80 1.85 log10 dose Instead we borrow strength over the doses. With repeated values of the doses, we have: log10 number number total prop. dose alive dead number killed 1.6907 53 6 59 0.102 1.7242 47 13 60 0.217 1.7552 44 18 62 0.290 1.7842 28 28 56 0.500 1.8113 11 52 63 0.825 1.8369 6 53 59 0.898 1.8610 1 61 62 0.984 1.8839 0 60 60 1.000 4

A better graphical exploratory data analysis proportion killed 0.2 0.4 0.6 0.8 1.0 logit of the proportion killed 2 1 0 1 2 3 4 1.70 1.75 1.80 1.85 log10 dose 1.70 1.75 1.80 1.85 log10 dose Summary? 5

The simple linear logistic regression model Suppose that Y i (i = 1,..., n) are n independent Bern(p i ) RVs. Let ( pi ) η i = log 1 p i = β 0 + β 1 x i, where x i (i = 1,..., n) is some explanatory variable. For our example, Y i = and x i = The above model equivalent to assuming that Y i (i = 1,..., n) are n independent Bern(p i ) RVs where p i = eβ 0+β 1 x i 1 + e β 0+β 1 x i. 6

Finding the maximum likelihood estimates of the parameters in the logistic regression model The log-likelihood function for our data y = (y 1,..., y n ) T is n [ ( ) ] 1 l n = log + y i log p i + (1 y i ) log(1 p i ) where We have and = i=1 n i=1 [ log y i ( ) 1 y i ( ) ] pi + y i log + log(1 p i ), 1 p i p i = eβ 0+β 1 x i 1 + e β 0+β 1 x i, i = 1,..., n. ( ) pi log 1 p i = 1 p i = 1 eβ 0+β 1 x i = 1 + eβ0+β1xi e β0+β1xi 1 + e β 0+β 1 x i 1 + e β 0+β 1 x i 1 =. 1 + e β 0+β 1 x i Thus the log-likelihood function simplifies to n [ ( ) ] 1 l n = log + y i (β 0 + β 1 x i ) log(1 + e β 0+β 1 x i ). i=1 y i How do we find the maximum likelihood estimates? 7

Finding the maximum likelihood estimates, cont. So what is the problem? There is no closed form solution for the maximum likelihood estimates of β 0 and β 1, β 0 and β 1. The computer (in our case R) has to solve the score equations numerically. In particular R uses a algorithm similar to the Newton-Raphson method called Fisher scoring or iteratively weighted least squares (IWLS). 8

Fitting the model in R The R code to fit the simple linear logistic model is: beetle.logit.log10dose <- glm(dead ~ log10.dose, data=beetles, family=binomial) summary(beetle.logit.log10dose) Try and understand the output from R below. Call: glm(formula = dead ~ log10.dose, family = binomial, data = beetles) Deviance Residuals: Min 1Q Median 3Q Max -2.4922-0.5986 0.2058 0.4512 2.3820 Coefficients: Estimate Std. Error z value Pr(> z ) (Intercept) -60.717 5.181-11.72 <2e-16 *** log10.dose 34.270 2.912 11.77 <2e-16 *** --- Signif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 645.44 on 480 degrees of freedom Residual deviance: 372.47 on 479 degrees of freedom AIC: 376.47 Number of Fisher Scoring iterations: 5 9

Summarizing the R output 10

Intepreting the slope parameter β 1 Let p(x i ) denote the probability of success for covariate value x i, and let η(x i ) denote the logit of this probability. The odds at value x i is odds(x i ) = p(x i ) 1 p(x i ). The odds ratio comparing the odds at values x i + 1 and x i is odds(x i + 1) odds(x i ) = exp(η(x i + 1))/ exp(η(x i )) = exp(η(x i + 1) η(x i )) = exp([β 0 + β 1 (x i + 1)] [β 0 + β 1 (x i )]) = exp(β 1 ). Thus e β 1 is 11

Intepreting the slope parameter β 1, continued In practice we estimate e β 1 by e β 1. For the beetles dataset, e β 1 = e 34.270 (a very large number!) The key problem is that for this application it makes no sense to consider an increase of one unit in the log10 dose. Ex: What is the multiplicative change in the odds for an increase in 0.01 units of the log10 dose? Produce a 95% confidence interval (CI) for this change. 12

Intepreting the slope parameter β 1, continued 13

Where we go next So far we are in a position to interpret the coefficients of the simple logistic regression model. Here is a wish list of things we want to do: 1. Produce an estimate of the probability at a given covariate value, along with a measure of uncertainty. 2. We will be able to compare different statistical models that we fit using deviances (still to be defined!) 3. We will check the fit of models using residuals appropriate for this statistical model. 4. We will extend to more complicated logistic regression models. References C. I. Bliss. The calculation of the dosage-mortality curve. Annals of Applied Biology, 22: 134 167, 1935. A. J. Dobson and A. Barnett. An Introduction to Generalized Linear Models, Third Edition. Chapman & Hall, New York, NY, 2008. 14