Lecture : Introduction to Logistic Regression Ani Manichaikul amanicha@jhsph.edu 2 May 27 Binomial Model n independent trials (e.g., coin tosses) p = probability of success on each trial (e.g., p =! = Pr of Heads) Y = number of successes out of n trials (e.g., Y= number of heads) Logistic Regression Binomial Distribution Regression for a response variable that follows a binomial distribution Recall the binomial model And the Binomial Distribution Example: P( Y ' n$ % " p & y# y = y) = n! y (! p)
Why can t we use regular regression (SLR or MLR)? Example Consider phase I clinical trial in which 35 independent patients are given a new medication for pain relief. Of the 35 patients, 22 report significant relief one hour after medication Question: How effective is the drug? Cannot use Linear Regression The response, Y, is NOT Normally Distributed The variability of Y is NOT constant since the variance, Var(Y)=pq, depends on the expected response, E(Y)=p. The predicted/fitted values must be such that the corresponding probabilities are between and. Model Y = # patients who get relief n = 35 patients (trials) p = probability of relief for any patient The truth we seek in the population How effective is the drug? What is p? Get best estimate of p given data Determine margin of error -- range of plausible values for p
Maximum Likelihood Method The method of maximum likelihood estimation chooses values for parameter estimates which make the observed data maximally likely under the specified model Maximum Likelihood So, estimate p by choosing the value for p which makes observed data maximally likely i.e., choose p that makes the value of Pr (Y=22) maximal The ML estimate is y/n = 22/35 =.63 estimated proportion of patients who will experience relief Maximum Likelihood Maximum Likelihood For the binomial model, we have observed Y=y and ' n$ y P( Y = y) = % " p! p & y# So for this example n! y ( ) ' 35$ P ( Y = y) = % " p 22! p & 22# ( ) 3 Likelihood.e- 5.e- Likelihood Function: Pr(22 of 35) Max Likelihood MLE: p=.63..2.3.4.5.6.7.8.9 p=prob(event)
Confidence Interval for p Variance of pˆ : Var( pˆ )= Standard Error of : (! p) pq n Estimate of Standard Error of : pˆ qˆ n pˆ p n pq n = pˆ Conclusion Based upon our clinical trial in which 22 of 35 patients experience relief, we estimate that 63% of persons who receive the new drug experience relief within hour (95% CI: 47% to 79%) Confidence Interval for p 95% Confidence Interval for the true proportion, p: pq ˆ ˆ pˆ ±.96 =.63±.96 n =.63-.96(.82),.63+.96(.82) =(.47,.79) (.63)(.37) 35 Conclusion Whether 63% (47% to 79%) represents an effective drug will depend many things, especially on the science of the problem. Sore throat pain? Arthritis pain? Accidentally cut your leg off pain?
Aside: Probabilities and Odds The odds of an event are defined as: P(Y = ) odds(y=) = P(Y = ) = = p -p P(Y = ) - P(Y = ) Odds Ratio We saw that an odds ratio (OR) can be helpful for comparisons. Recall the Vitamin A trial: OR = odds(death Vit. A) odds(death No Vit A.) Probabilities and Odds We can go back and forth between odds and probabilities: Odds = p -p p = odds/(odds+) Odds Ratio The OR here describes the benefits of Vitamin A therapy. We saw for this example that: OR =.59 An estimated 4% reduction in mortality OR is a building block for logistic regression
Logistic Regression Suppose we want to ask whether new drug is better than a placebo and have the following observed data: Relief? No Yes Total Drug 3 22 35 Placebo 2 5 35 Odds Ratio OR = = odds(relief Drug) odds(relief Placebo) P(Relief Drug) / [- P(Relief P(Relief Placebo) / [- P(Relief.63/(-.63).45/(-.45) = = 2.26 Drug)] Placebo)] Confidence Intervals for p Confidence Interval for OR Placebo ( ) ( ) Drug..2.3.4.5.6.7.8.9 p CI used Woolf s method for the standard error of log( Oˆ R) : se( log( Oˆ R) = find Then (e L,e U ) 22 + 3 + 5 + 2 =.489 log( O ˆR) ±.96se(log( Oˆ R))
Interpretation OR = 2.26 95% CI: (.86, 5.9) The Drug is an estimated 2 " times better than the placebo. But could the difference be due to chance alone? Model log [ odds(relief Tx) ] = log% % " " & P(no relief Tx) # ' P(relief Tx) = ( + ( Tx if Placebo where: Tx = if Drug $ Logistic Regression Can we set up a model for this similar to what we ve done in ANOVA and Regression? Idea: model the log odds of the event, (in this example, relief) as a function of predictor variables Then log( odds(relief Drug) ) = ( + ( log( odds(relief Placebo) ) = ( log( odds(r D)) log( odds(r P)) = (
And ' odds(r D) $ & odds(r P) # % " Thus: log % " = ( And: OR = exp(( ) = e (!! So: exp(( ) = odds ratio of relief for patients taking the Drug-vs-patients taking the Placebo. It s the same! So, why go to all the trouble of setting up a linear model? What if there is a biologic reason to expect that the rate of relief (and perhaps drug efficacy) is age dependent? Logistic Regression Logit estimates Number of obs = 7 LR chi2() = 2.83 Prob > chi2 =.926 Log likelihood = -46.9969 Pseudo R2 =.292 ------------------------------------------------------------------------------ y Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- drug.837752.48892.66.96 -.444926.77243 _cons -.287682.34565 -.84.4 -.957372.38773 ------------------------------------------------------------------------------ Estimates: log( odds(relief) ) = ˆ (ˆ ( + Drug = -.288 +.84(Drug) Adding other variables What if Pr(relief) = function of Drug or Placebo AND Age We could easily include age in a model such as: log( odds(relief) ) = ( + ( Drug + ( 2 Age Therefore: OR = exp(.84) = 2.26!
Logistic Regression As in MLR, we can include many additional covariates. For a Logistic Regression model with p predictors: log ( odds(y=)) = ( + ( X +... + ( p X p Pr( Y = )! Pr( Y = ) where: odds(y=) = = Pr( Y Pr( Y = ) = ) Logistic regression Linear regression might estimate anything (-!, +!), not just a proportion in the range of to. Logistic regression is a way to estimate a proportion (between and ) as well as some related items Logistic Regression log Thus: ' Pr( Y = ) $ % " = ( + ( X +... + ( p X p & Pr( Y = ) # But, why use log(odds)? Linear models for binary outcomes We would like to use something like what we know from linear regression: Continuous outcome = " + " X + " 2 X 2 + How can we turn a proportion into a continuous outcome?
Transforming a proportion The odds are always positive: ' p $ odds = % " * [, +) ) &! p # The log odds is continuous: ' p $ Log odds = ln% " * (!), +) ) &! p # Logit Function Relates log-odds (logit) to p = Pr(Y=) log-odds 5-5 - logit function.5 Probability of Success Logit transformation Key Relationships Measure Pr(Y = ) Min Max Name probability Relating log-odds, probabilities, and parameters in logistic regression: Suppose model: logit(p) = ( + ( X Pr( Y = )! Pr( Y = ) ' Pr( Y = ) $ log% " &! Pr( Y = ) # -) ) ) odds log-odds or logit ' p $ i.e. log % " = ( + ( X -p Take anti-logs & p -p ' % & $ " # # = exp(( + ( X)
Solve for p p = ( p)+exp(( + ( X) p = exp(( + ( X) p+exp(( + ( X) p + p+exp(( + ( X) = exp(( + ( X) p+{+ exp(( + ( X)} = exp(( + ( X) exp(( +( X ) p = + exp(( +( X ) Dependence of Blindness on Age The following data concern the Aegean island of Kalytos where inhabitants suffer from a congenital eye disease whose effects become more marked with age. Samples of 5 people were taken at five different ages and the numbers of blind people were counted. What s the point? Example: Data We can determine the probability of success for a specific set of covariates, X, after running a logistic regression model. Age 2 35 45 55 7 Number blind / 5 6 / 5 7 / 5 26 / 5 37 / 5 44 / 5
Question The scientific question of interest is to determine how the probability of blindness is related to age in this population. Let p i = Pr(a person in age class i is blind) Model 2 logit(p i ) = ( + ( (age i 45) ( = log-odds of blindness among 45 year olds exp(( ) = odds of blindness among 45 year olds ( = difference in log-odds of blindness comparing a group that is one year older than another exp(( ) = odds ratio of blindness comparing a group that is one year older than another Model logit(p i ) = ( * Results Model Iteration : log likelihood = -73.8674 ( *= log-odds of blindness for all ages exp(( *) = odds of blindness for all ages No age dependence in this model Logit estimates Number of obs = 25 LR chi2() =. Prob > chi2 =. Log likelihood = -73.8674 Pseudo R2 =. ------------------------------------------------------------------------------ y Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- _cons -.8427.265924 -.63.527 -.328593.68739 ------------------------------------------------------------------------------ exp(!.8) pˆ pˆ =. 48 i i = + exp(!.8) logit( ) = -.8 or
Results or Model 2 Logit estimates Number of obs = 25 LR chi2() = 99.3 Prob > chi2 =. Log likelihood = -23.43444 Pseudo R2 =.2869 ------------------------------------------------------------------------------ y Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- age.94683.9755 7.86..75967.75399 _cons -4.3568.57966-7.64. -5.473549-3.23882 ------------------------------------------------------------------------------ logit( pˆ ) = -4.4 +.94(age i - 45) i exp pˆ i = + exp {! 4.4 +.94( age! 45) } i {! 4.4 +.94( age! 45) } i What about the Odds Ratio? Maximum likelihood estimates: OR = exp( ˆ( )=. s.e.( ˆ( ) =.3 z-test: Ho: exp(( ) = z = 7.86 p-val =. 95% C.I. (.7,.3) *(done on log scale) It appears that blindness is age dependent Note: exp() =, where is this fact useful? Test of significance Is the addition of the age variable in the model important? Maximum likelihood estimates: ˆ( =.94 s.e.( ˆ( )=.2 z-test: H : ( = z=7.855; p-val=. 95% C.I. (.7,.2) Model : Plot of observed proportion -vspredicted proportions using an intercept only model Prob Blindness Observed.5 Predicted 2 4 6 8 Age
Model 2 Plot of observed proportion -vspredicted proportions with age in the model Prob Blindness.5 Observed Predicted 2 4 6 8 Age Summary Logistic regression gives us a framework in which to model binary outcomes Uses the structure of linear models, with outcomes modelled as a function of covariates Many concepts carry over from linear regression Interactions Linear splines Tests of significance for coefficients All coefficients will have different interpretations in logistic regression Conclusion Model 2 clearly fits better than Model! Including age in our model is better than intercept alone.