Lecture 10: Introduction to Logistic Regression

Similar documents
Binomial Model. Lecture 10: Introduction to Logistic Regression. Logistic Regression. Binomial Distribution. n independent trials

Lecture 12: Effect modification, and confounding in logistic regression

Homework Solutions Applied Logistic Regression

Lecture 2: Poisson and logistic regression

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Linear Regression Models P8111

Lecture 5: Poisson and logistic regression

Chapter 11. Regression with a Binary Dependent Variable

22s:152 Applied Linear Regression. Example: Study on lead levels in children. Ch. 14 (sec. 1) and Ch. 15 (sec. 1 & 4): Logistic Regression

Lecture 3.1 Basic Logistic LDA

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

ECON Introductory Econometrics. Lecture 11: Binary dependent variables

STAT 7030: Categorical Data Analysis

STA102 Class Notes Chapter Logistic Regression

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

PubH 7470: STATISTICS FOR TRANSLATIONAL & CLINICAL RESEARCH

General Linear Model (Chapter 4)

Binary Dependent Variables

9 Generalized Linear Models

Confidence Intervals for the Odds Ratio in Logistic Regression with One Binary X

Binary Logistic Regression

STAT 526 Spring Final Exam. Thursday May 5, 2011

Lecture 4: Generalized Linear Mixed Models

High-Throughput Sequencing Course

BIOSTATS Intermediate Biostatistics Spring 2017 Exam 2 (Units 3, 4 & 5) Practice Problems SOLUTIONS

Statistical Modelling with Stata: Binary Outcomes

Simple logistic regression

Generalized Linear Models for Non-Normal Data

The Logit Model: Estimation, Testing and Interpretation

Lecture 2: Probability and Distributions

multilevel modeling: concepts, applications and interpretations

2. We care about proportion for categorical variable, but average for numerical one.

ECON 594: Lecture #6

raise Coef. Std. Err. z P> z [95% Conf. Interval]

Assessing the Calibration of Dichotomous Outcome Models with the Calibration Belt

Binary Dependent Variable. Regression with a

Normal distribution We have a random sample from N(m, υ). The sample mean is Ȳ and the corrected sum of squares is S yy. After some simplification,

Biost 518 Applied Biostatistics II. Purpose of Statistics. First Stage of Scientific Investigation. Further Stages of Scientific Investigation

Appendix A. Numeric example of Dimick Staiger Estimator and comparison between Dimick-Staiger Estimator and Hierarchical Poisson Estimator

Clinical Trials. Olli Saarela. September 18, Dalla Lana School of Public Health University of Toronto.

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

CSE 103 Homework 8: Solutions November 30, var(x) = np(1 p) = P r( X ) 0.95 P r( X ) 0.

Introduction to logistic regression

Probability: Why do we care? Lecture 2: Probability and Distributions. Classical Definition. What is Probability?

Assessing Calibration of Logistic Regression Models: Beyond the Hosmer-Lemeshow Goodness-of-Fit Test

Sociology 362 Data Exercise 6 Logistic Regression 2

BMI 541/699 Lecture 22

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

2/26/2017. PSY 512: Advanced Statistics for Psychological and Behavioral Research 2

Linear Modelling in Stata Session 6: Further Topics in Linear Modelling

One-stage dose-response meta-analysis

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION (SOLUTIONS) ST3241 Categorical Data Analysis. (Semester II: )

Correlation and regression

UNIVERSITY OF TORONTO Faculty of Arts and Science

Regression techniques provide statistical analysis of relationships. Research designs may be classified as experimental or observational; regression

Logistic Regression. Some slides from Craig Burkett. STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy

Lecture 15 (Part 2): Logistic Regression & Common Odds Ratio, (With Simulations)

Lecture 12: Interactions and Splines

Beyond GLM and likelihood

Logistic regression. 11 Nov Logistic regression (EPFL) Applied Statistics 11 Nov / 20

Lecture 1 Introduction to Multi-level Models

Consider Table 1 (Note connection to start-stop process).

Compare Predicted Counts between Groups of Zero Truncated Poisson Regression Model based on Recycled Predictions Method

Sample Size Determination

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Lecture 3 Linear random intercept models

PubHlth Intermediate Biostatistics Spring 2015 Exam 2 (Units 3, 4 & 5) Study Guide

Unit 9: Inferences for Proportions and Count Data

Classification. Chapter Introduction. 6.2 The Bayes classifier

Measures of Association and Variance Estimation

Unit 9: Inferences for Proportions and Count Data

GEE for Longitudinal Data - Chapter 8

Hypothesis testing, part 2. With some material from Howard Seltman, Blase Ur, Bilge Mutlu, Vibha Sazawal

Exam ECON5106/9106 Fall 2018

From the help desk: Comparing areas under receiver operating characteristic curves from two or more probit or logit models

Basic Medical Statistics Course

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

Logistisk regression T.K.

Medical statistics part I, autumn 2010: One sample test of hypothesis

Probability and Probability Distributions. Dr. Mohammed Alahmed

Recent Developments in Multilevel Modeling

Statistical Methods III Statistics 212. Problem Set 2 - Answer Key

Unobservable Parameter. Observed Random Sample. Calculate Posterior. Choosing Prior. Conjugate prior. population proportion, p prior:

Regression so far... Lecture 21 - Logistic Regression. Odds. Recap of what you should know how to do... At this point we have covered: Sta102 / BME102

Data Mining 2018 Logistic Regression Text Classification

Mixed Models for Longitudinal Binary Outcomes. Don Hedeker Department of Public Health Sciences University of Chicago.

Personalized Treatment Selection Based on Randomized Clinical Trials. Tianxi Cai Department of Biostatistics Harvard School of Public Health

Today. HW 1: due February 4, pm. Aspects of Design CD Chapter 2. Continue with Chapter 2 of ELM. In the News:

Name: Biostatistics 1 st year Comprehensive Examination: Applied in-class exam. June 8 th, 2016: 9am to 1pm

Contrasting Marginal and Mixed Effects Models Recall: two approaches to handling dependence in Generalized Linear Models:

Lecture 13: More on Binary Data

Generalized linear models

Categorical data analysis Chapter 5

Logistic Regression. Fitting the Logistic Regression Model BAL040-A.A.-10-MAJ

Generalized Linear Modeling - Logistic Regression

Two Correlated Proportions Non- Inferiority, Superiority, and Equivalence Tests

Transcription:

Lecture 10: Introduction to Logistic Regression Ani Manichaikul amanicha@jhsph.edu 2 May 2007

Logistic Regression Regression for a response variable that follows a binomial distribution Recall the binomial model And the Binomial Distribution

Binomial Model n independent trials (e.g., coin tosses) p = probability of success on each trial (e.g., p = ½ = Pr of Heads) Y = number of successes out of n trials (e.g., Y= number of heads)

Binomial Distribution P( Y n p y y = y) = 1 ( ) n y p Example:

Why can t we use regular regression (SLR or MLR)?

Cannot use Linear Regression The response, Y, is NOT Normally Distributed The variability of Y is NOT constant since the variance, Var(Y)=pq, depends on the expected response, E(Y)=p. The predicted/fitted values must be such that the corresponding probabilities are between 0 and 1.

Example Consider phase I clinical trial in which 35 independent patients are given a new medication for pain relief. Of the 35 patients, 22 report significant relief one hour after medication Question: How effective is the drug?

Model Y = # patients who get relief n = 35 patients (trials) p = probability of relief for any patient The truth we seek in the population How effective is the drug? What is p? Get best estimate of p given data Determine margin of error -- range of plausible values for p

Maximum Likelihood Method The method of maximum likelihood estimation chooses values for parameter estimates which make the observed data maximally likely under the specified model

Maximum Likelihood For the binomial model, we have observed Y=y and P( Y n y y = y) = p 1 So for this example ( ) n y p 35 P( Y = y) = p 22 1 p 22 ( ) 13

Maximum Likelihood So, estimate p by choosing the value for p which makes observed data maximally likely i.e., choose p that makes the value of Pr (Y=22) maximal The ML estimate is y/n = 22/35 = 0.63 estimated proportion of patients who will experience relief

Maximum Likelihood 1.0e-10 Likelihood Function: Pr(22 of 35) Max Likelihood Likelihood 5.0e-11 MLE: p=0.63 0 0.1.2.3.4.5.6.7.8.9 1 p=prob(event)

Confidence Interval for p Variance of pˆ : Var( pˆ )= ( ) = n n p 1 p pq pˆ Standard Error of : pq n pˆ Estimate of Standard Error of : pˆ q n

Confidence Interval for p 95% Confidence Interval for the true proportion, p: pq ˆˆ pˆ ± 1.96 = 0.63± 1.96 n ( 0.63)( 0.37) 35 =0.63-1.96(.082),0.63+1.96(.082) =(0.47, 0.79)

Conclusion Based upon our clinical trial in which 22 of 35 patients experience relief, we estimate that 63% of persons who receive the new drug experience relief within 1 hour (95% CI: 47% to 79%)

Conclusion Whether 63% (47% to 79%) represents an effective drug will depend many things, especially on the science of the problem. Sore throat pain? Arthritis pain? Accidentally cut your leg off pain?

Aside: Probabilities and Odds The odds of an event are defined as: P(Y P(Y = 1) = 0) odds(y=1) = = P(Y = 1) 1- P(Y = 1) = p 1-p

Probabilities and Odds We can go back and forth between odds and probabilities: Odds = p 1-p p = odds/(odds+1)

Odds Ratio We saw that an odds ratio (OR) can be helpful for comparisons. Recall the Vitamin A trial: OR = odds(death Vit. A) odds(death No Vit A.)

Odds Ratio The OR here describes the benefits of Vitamin A therapy. We saw for this example that: OR = 0.59 An estimated 40% reduction in mortality OR is a building block for logistic regression

Logistic Regression Suppose we want to ask whether new drug is better than a placebo and have the following observed data: Relief? No Yes Total Drug 13 22 35 Placebo 20 15 35

Confidence Intervals for p Placebo ( ) ( ) Drug 0 0.1.2.3.4.5.6.7.8.9 1 p

Odds Ratio odds(relief Drug) OR = odds(relief Placebo) = Drug)] Placebo)] 0.45/(1-0.45) 0.63/(1-0.63) = = 2.26 P(Relief Drug) / [1- P(Relief P(Relief Placebo) / [1- P(Relief

Confidence Interval for OR CI used Woolf s method for the standard error of log( Oˆ R) : se( log( Oˆ R) = 1 22 + 1 13 + 1 15 + 1 20 = 0.489 find log( O ˆ R) ± Then (e L,e U ) 1.96se(log( Oˆ R))

Interpretation OR = 2.26 95% CI: (0.86, 5.9) The Drug is an estimated 2 ¼ times better than the placebo. But could the difference be due to chance alone?

Logistic Regression Can we set up a model for this similar to what we ve done in ANOVA and Regression? Idea: model the log odds of the event, (in this example, relief) as a function of predictor variables

Model log [ odds(relief Tx) ] = log P(no relief Tx) = β 0 + β 1 Tx P(relief Tx) where: Tx = 0 if Placebo 1 if Drug

Then log( odds(relief Drug) ) = β 0 + β 1 log( odds(relief Placebo) ) = β 0 log( odds(r D)) log( odds(r P)) = β 1

And Thus: log = β 1 odds(r D) odds(r P) And: OR = exp(β 1 ) = e β1!! So: exp(β 1 ) = odds ratio of relief for patients taking the Drug-vs-patients taking the Placebo.

Logistic Regression Logit estimates Number of obs = 70 LR chi2(1) = 2.83 Prob > chi2 = 0.0926 Log likelihood = -46.99169 Pseudo R2 = 0.0292 ------------------------------------------------------------------------------ y Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- drug.8137752.4889211 1.66 0.096 -.1444926 1.772043 _cons -.2876821.341565-0.84 0.400 -.9571372.3817731 ------------------------------------------------------------------------------ Estimates: log( odds(relief) ) = ˆ βˆ β + 0 Drug 1 = -0.288 + 0.814(Drug) Therefore: OR = exp(0.814) = 2.26!

It s the same! So, why go to all the trouble of setting up a linear model? What if there is a biologic reason to expect that the rate of relief (and perhaps drug efficacy) is age dependent?

Adding other variables What if Pr(relief) = function of Drug or Placebo AND Age We could easily include age in a model such as: log( odds(relief) ) = β 0 + β 1 Drug + β 2 Age

Logistic Regression As in MLR, we can include many additional covariates. For a Logistic Regression model with p predictors: log ( odds(y=1)) = β 0 + β 1 X 1 +... + β p X p Pr( Y = 1) 1 Pr( Y = 1) where: odds(y=1) = = Pr( Y Pr( Y = = 1) 0)

Logistic Regression Thus: log Pr( Y Pr( Y = = 1) 0) = β 0 + β 1 X 1 +... + β p X p But, why use log(odds)?

Logistic regression Linear regression might estimate anything (-, + ), not just a proportion in the range of 0 to 1. Logistic regression is a way to estimate a proportion (between 0 and 1) as well as some related items

Linear models for binary outcomes We would like to use something like what we know from linear regression: Continuous outcome = 0 + 1 X 1 + 2 X 2 + How can we turn a proportion into a continuous outcome?

Transforming a proportion The odds are always positive: p odds = [0, + ) 1 p The log odds is continuous: p Log odds = ln (, + ) 1 p

Logit transformation Measure Min Max Name Pr(Y = 1) 0 1 probability Pr( Y = 1) 1 Pr( Y = 1) 0 odds Pr( Y = 1) log 1 Pr( Y = 1) - log-odds or logit

Logit Function Relates log-odds (logit) to p = Pr(Y=1) 10 logit function log-odds 5 0-5 -10 0.5 1 Probability of Success

Key Relationships Relating log-odds, probabilities, and parameters in logistic regression: Suppose model: logit(p) = β 0 + β 1 X p i.e. log = β 0 + β 1 X 1-p Take anti-logs 1 p -p = exp(β 0 + β 1 X)

Solve for p p = (1 p) exp(β 0 + β 1 X) p = exp(β 0 + β 1 X) p exp(β 0 + β 1 X) p + p exp(β 0 + β 1 X) = exp(β 0 + β 1 X) p {1+ exp(β 0 + β 1 X)} = exp(β 0 + β 1 X) p = exp(β +β X ) 0 1 1+ exp(β +β X 0 1 )

What s the point? We can determine the probability of success for a specific set of covariates, X, after running a logistic regression model.

Dependence of Blindness on Age The following data concern the Aegean island of Kalytos where inhabitants suffer from a congenital eye disease whose effects become more marked with age. Samples of 50 people were taken at five different ages and the numbers of blind people were counted.

Example: Data Age 20 35 45 55 70 Number blind / 50 6 / 50 7 / 50 26 / 50 37 / 50 44 / 50

Question The scientific question of interest is to determine how the probability of blindness is related to age in this population. Let p i = Pr(a person in age class i is blind)

Model 1 logit(p i ) = β 0 * β 0 *= log-odds of blindness for all ages exp(β 0 *) = odds of blindness for all ages No age dependence in this model

Model 2 logit(p i ) = β 0 + β 1 (age i 45) β 0 = log-odds of blindness among 45 year olds exp(β 0 ) = odds of blindness among 45 year olds β 1 = difference in log-odds of blindness comparing a group that is one year older than another exp(β 1 ) = odds ratio of blindness comparing a group that is one year older than another

Results Model 1 Iteration 0: log likelihood = -173.08674 Logit estimates Number of obs = 250 LR chi2(0) = 0.00 Prob > chi2 =. Log likelihood = -173.08674 Pseudo R2 = 0.0000 ------------------------------------------------------------------------------ y Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- _cons -.0800427.1265924-0.63 0.527 -.3281593.1680739 ------------------------------------------------------------------------------ exp(.08) pˆ pˆ = 0. 48 i i = 1+ exp(.08) logit( ) = -0.08 or

Results Model 2 Logit estimates Number of obs = 250 LR chi2(1) = 99.30 Prob > chi2 = 0.0000 Log likelihood = -123.43444 Pseudo R2 = 0.2869 ------------------------------------------------------------------------------ y Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- age.0940683.0119755 7.86 0.000.0705967.1175399 _cons -4.356181.5700966-7.64 0.000-5.473549-3.238812 ------------------------------------------------------------------------------ logit( pˆ ) = -4.4 +.094(age i - 45) or i exp pˆ i = 1+ exp { 4.4 + 0.094( age 45) } i { 4.4 + 0.094( age 45) } i

Test of significance Is the addition of the age variable in the model important? Maximum likelihood estimates: ˆβ 1 =0.094 s.e.( ˆβ )=0.012 z-test: H 0 : β 1 = 0 z=7.855; p-val=0.000 95% C.I. (0.07, 0.12) 1

What about the Odds Ratio? Maximum likelihood estimates: OR = exp( ˆβ )= 1.10 s.e.( ) = 0.013 z-test: Ho: exp(β 1 ) = 1 1 z = 7.86 p-val = 0.000 95% C.I. (1.07, 1.13) *(done on log scale) ˆβ 1 It appears that blindness is age dependent Note: exp(0) = 1, where is this fact useful?

Model 1: Plot of observed proportion -vspredicted proportions using an intercept only model 1 Observed Prob Blindness.5 Predicted 0 20 40 60 80 Age

Model 2 Plot of observed proportion -vspredicted proportions with age in the model Prob Blindness 1.5 Observed Predicted 0 20 40 60 80 Age

Conclusion Model 2 clearly fits better than Model 1! Including age in our model is better than intercept alone.

Summary Logistic regression gives us a framework in which to model binary outcomes Uses the structure of linear models, with outcomes modelled as a function of covariates Many concepts carry over from linear regression Interactions Linear splines Tests of significance for coefficients All coefficients will have different interpretations in logistic regression