REGRESSION METHODS. Logistic regression

Size: px

Start display at page:

Download "REGRESSION METHODS. Logistic regression"

Darcy Richardson
5 years ago
Views:

1 REGRESSION METHODS Logistic regressio 233

2 RECAP: Biary Outcome? NO Cotiuous Outcome? YES Liear Regressio/ANOVA NO Other Methods YES Odds ratio as measure of associatio? Relative risk as measure of associatio? Risk differece as measure of associatio? Logistic regressio GLM w/ log lik GLM w/ idetity lik 234

3 Logistic Regressio: Motivatio May scietific questios of iterest ivolve a biary outcome (e.g. disease/o disease) Let s ivestigate if geetic factors are associated with presece/absece of coroary heart disease (CHD) 235

4 Logistic Regressio: Motivatio Scietific questios of iterest: Assess the effect of rs o CHD Assess the effect of cholesterol o CHD Assess the effect of rs o CHD after accoutig for cholesterol 236

5 Logistic Regressio: Motivatio Scietific questio: Assess the effect of rs o risk of CHD rs Coded as the umber of mior alleles 0 = C/C, 1 = C/T, 2 = T/T. 237

6 Motivatio: rs ad CHD Here is a cotigecy table for the SNP ad CHD: > table(rs ,chd) chd rs Prevalece of CHD i C/C: 48/(48+154) = Prevalece of CHD i C/T: 66/(66+104) = Prevalece of CHD i T/T: 13/(13+15) = Does the prevalece of CHD differ across the groups? Without usig regressio, what tool could we use to look for a associatio betwee rs ad CHD? 238

7 Motivatio: rs ad CHD Here is a cotigecy table for the SNP ad CHD: > table(rs ,chd) Without usig regressio, what tool could we use to look for a associatio? > chisq.test(rs ,chd) Pearso's Chi-squared test data: rs ad chd X-squared = , df = 2, p-value = I additio to hypothesis testig, we eed to summarize the stregth of associatio betwee the two variables 239

8 Measures of associatio for biary outcomes Outcome No Yes Exposure Yes a b No c d Risk differece (RD) = P(outcome exposed) - P(outcome ot exposed) = (b/(a+b)) - (d/(c+d)) > table(rs ,chd) RD(T/T vs C/C) = 13/(13+15) 48/(48+154) = =

9 Measures of associatio for biary outcomes Outcome No Yes Exposure Yes a b No c d Risk differece iterpretatio Additive differece i probability (risk) betwee exposed ad uexposed Also called excess risk -1 < RD < 1 RD = 0 o associatio; risk of outcome same for exposed ad uexposed 241

10 Measures of associatio for biary outcomes Outcome No Yes Exposure Yes a b No c d Relative risk (RR) = P(outcome exposed)/p(outcome ot exposed) = (b/(a+b))/(d/(c+d)) > table(rs ,chd) RR(T/T vs C/C) = (13/(13+15)) / (48/(48+154)) = / =

11 Measures of associatio for biary outcomes Outcome No Yes Exposure Yes a b No c d Relative risk iterpretatio Multiplicative differece i probability (risk) of outcome amog exposed compared to uexposed 0 < RR < RR = 1 o associatio; risk of outcome same for exposed ad uexposed 243

12 Measures of associatio for biary outcomes The odds is the ratio of the risk of havig a outcome to the risk of ot havig the outcome If p is the risk of a outcome, the the odds of the outcome are p/(1-p) The odds ratio (OR) is the ratio of the odds of the outcome i the exposed to the odds of the outcome i the uexposed : OR = [p 1 /(1- p 1 )]/ [p 0 /(1- p 0 )] = odds ratio where p 1 =risk i exposed ad p 0 =risk i uexposed Like the relative risk, the odds ratio provides a measure of associatio i a ratio (rather tha a differece) The odds ratio is the ratio of two ratios (i.e. the ratio of odds) The OR approximates RR for rare evets The OR is more complicated to iterpret tha the RR (except for rare evets), but there are some study desigs (amely, case-cotrol studies) where it is ot possible to directly estimate the risk ratio, but oe ca always estimate the odds ratio 244

13 Measures of associatio for biary outcomes Say the chace of disease (D) if you re exposed (E) = 0.25 The the odds of gettig D (for those who are exposed) are 0.25/0.75 = 1/3 or 1:3 Say the chace of disease if you re ot exposed =0.1 The the odds of gettig D (for those who are ot exposed) are 0.1/0.9 = 1/9 or 1:9 The the disease odds ratio (ratio of the odds of disease i the exposed to the odds of disease i the uexposed) is (1/3)/(1/9) = 3 Q: What is the risk ratio here?

14 Measures of associatio for biary outcomes Outcome No Yes Exposure Yes a b No c d Odds = P/(1-P) Odds ratio (OR) = Odds(outcome exposed)/odds(outcome ot exposed) = ((b/(a+b))/(a/(a+b)))/((d/(c+d))/(c/(c+d))) = (b/a)/(d/c) = (bc)/(ad) > table(rs ,chd) OR(T/T vs C/C) = (13/15) / (48/154) =

15 Measures of associatio for biary outcomes Outcome No Yes Exposure Yes a b No c d Odds ratio iterpretatio Multiplicative differece i odds of outcome betwee exposed ad uexposed 0 < OR < OR = 1 o associatio; odds of outcome same for exposed ad uexposed 247

16 Pros ad cos of measures of associatio RD is appealig because it directly commuicates absolute icrease i risk Ofte more policy relevat tha relative measures RR more directly iterpretable tha OR (most people do t have a ituitive uderstadig of odds) OR estimable i case-cotrol studies where RR ad RD are ot For rare outcomes, OR RR 248

17 Logistic Regressio: Motivatio The chi-squared test is adequate for ivestigatig the associatio betwee two categorical predictors But what if we wat to ivestigate the associatio betwee a cotiuous predictor like cholesterol ad a biary outcome like CHD? Or what if we wat to adjust for potetial cofouders? Logistic regressio will provide us with a tool for this 249

18 Biary outcome ad cotiuous exposure Objective: Estimate associatio betwee biary outcome ad cotiuous exposure Y = biary respose (0=o, 1=yes) X = cotiuous exposure p = E(Y X) = P(Y = 1 X ) Oe solutio fit a liear model This is just a stadard liear model except our outcome is biary Iterpretatio of b 1? Problems with this approach? 250

19 Motivatig example: CHD ad cholesterol > lm.mod1 <- lm(chd ~ chol, data = cholesterol) > summary(lm.mod1) Call: lm(formula = chd ~ chol, data = cholesterol) Residuals: Mi 1Q Media 3Q Max What is the iterpretatio of the cholesterol parameter estimate? Coefficiets: Estimate Std. Error t value Pr(> t ) (Itercept) e-15 *** chol < 2e-16 *** --- Sigif. codes: 0 *** ** 0.01 * Residual stadard error: o 398 degrees of freedom Multiple R-squared: 0.202, Adjusted R-squared: 0.2 F-statistic: o 1 ad 398 DF, p-value: < 2.2e

20 Biary outcome ad cotiuous exposure w w Alterative: use a trasformatio that maps P(Y = 1 X) to the real lie Let logit(p) = log(p / (1 - p))) w p (0, 1) w p /(1 - p) (0, ) w log(p /(1 - p)) (-, ) logit(p) p 252

21 Logistic regressio logit(p) = log(p / (1 - p))) this esures that p lies betwee 0 ad 1 Regress logit(p) o X logit[e(y X)] = log[p(y=1 X)/(1 P(Y=1 X))] = β 0 + β 1 X It turs out that the slope coefficiets i logistic regressio are readily iterpretable: they are just log odds ratios! 253

22 Iterpretatio of logistic regressio parameters O the log-odds scale log[odds(y=1 X = (c+1))] = β 0 + β 1 (c+1) log[odds(y=1 X = c)] = β 0 + β 1 c log[odds(y=1 X = (c+1))] - log[odds(y=1 X = c)] = β 1 log[odds(y=1 X = (c+1))/odds(y=1 X = c)] = β 1 log[or] = β 1 Odds Ratio (OR) That is, for two observatios that differ by oe uit i X there is a differece of β 1 i their log odds of Y = 1 Or, equivaletly, the log of the ratio of the odds of Y = 1 (i.e. the log OR) for two uits that differ i X by oe uit is β 1 254

23 Iterpretatio of logistic regressio parameters By expoetiatig we arrive at a simpler iterpretatio exp(log(or)) = exp(β 1 ) OR = exp(β 1 ) So for two observatios that differ i X by oe uit there is a multiplicative differece i their odds of Y = 1 of exp(β 1 ) Or, equivaletly, the ratio of the odds of Y = 1 (i.e., the odds ratio) for two observatios that differ i X by oe uit is exp(β 1 ) 255

24 Motivatig example: CHD ad cholesterol > glm.mod1 <- glm(chd ~ chol, family = "biomial") > summary(glm.mod1) Call: glm(formula = chd ~ chol, family = "biomial", data = cholesterol) Deviace Residuals: Mi 1Q Media 3Q Max Coefficiets: Estimate Std. Error z value Pr(> z ) (Itercept) < 2e-16 *** chol e-16 *** --- Sigif. codes: 0 *** ** 0.01 * (Dispersio parameter for biomial family take to be 1) Null deviace: o 399 degrees of freedom Residual deviace: o 398 degrees of freedom AIC: Number of Fisher Scorig iteratios: 4 w What do these results tell us about the relatioship betwee cholesterol ad CHD? 256

25 Motivatig example: CHD ad cholesterol > glm.mod1 <- glm(chd ~ chol, family = "biomial") > summary(glm.mod1) Call: glm(formula = chd ~ chol, family = "biomial", data = cholesterol) Deviace Residuals: Mi 1Q Media 3Q Max Coefficiets: Estimate Std. Error z value Pr(> z ) (Itercept) < 2e-16 *** chol e-16 *** --- Sigif. codes: 0 *** ** 0.01 * (Dispersio parameter for biomial family take to be 1) Null deviace: o 399 degrees of freedom Residual deviace: o 398 degrees of freedom AIC: Number of Fisher Scorig iteratios: 4 w Comparig two people who differ i cholesterol by 1 mg/dl, the log odds of CHD are higher by for the idividual with higher cholesterol 257

26 Motivatig example: CHD ad cholesterol w w Differeces i log odds are pretty spectacularly difficult to iterpret! It would be much better to expoetiate the coefficiets ad report odds ratios > exp(glm.mod1$coef) (Itercept) chol e e+00 > exp(cofit(glm.mod1)) Waitig for profilig to be doe % 97.5 % (Itercept) e chol e w Comparig two people who differ i cholesterol by 1 mg/dl, the odds of CHD are higher by a factor of 1.06 (95% CI: 1.04, 1.07) for the idividual with higher cholesterol 258

27 Motivatig example: CHD ad cholesterol w w A 1 mg/dl differece is very small, so we might be iterested i estimatig the OR associated with a larger differece such as 10 mg/dl I this case, just as i liear regressio we just eed to multiply our coefficiet by the appropriate factor > exp(10*glm.mod1$coef) (Itercept) chol e e+00 w Comparig two people whose cholesterol levels differ by 10 mg/dl, the perso with the higher cholesterol has 1.73 times higher odds of CHD compared to the perso with lower cholesterol. 259

28 Multivariable logistic regressio w w Ofte we are iterested i examiig associatios betwee multiple predictors simultaeously ad a biary outcome Multiple logistic regressio follows same patter as liear regressio logit[e(y X)] = β 0 + β 1 X 1 + β 2 X β p X p w exp(b j ) iterpreted as the OR associated with a oe uit chage i the j th predictor, amog idividuals with other predictors at same levels (or holdig other predictors costat/cotrollig for/adjustig for etc.) 260

29 Motivatig example > glm.mod2 <- glm(chd ~ chol+factor(rs ), family = "biomial", data = cholesterol) > summary(glm.mod2) Call: glm(formula = chd ~ chol + factor(rs ), family = "biomial", data = cholesterol) Deviace Residuals: Mi 1Q Media 3Q Max Coefficiets: Estimate Std. Error z value Pr(> z ) (Itercept) < 2e-16 *** chol e-16 *** factor(rs ) ** factor(rs ) * --- Sigif. codes: 0 *** ** 0.01 * (Dispersio parameter for biomial family take to be 1) Null deviace: o 399 degrees of freedom Residual deviace: o 396 degrees of freedom AIC: Number of Fisher Scorig iteratios: 4 261

30 Motivatig example As we have see before, expoetiatig the coefficiets gives us odds ratios > exp(glm.mod2$coef) (Itercept) chol factor(rs )1 factor(rs ) e e e e+00 A oe mg/dl icrease i cholesterol is associated with 1.06 times higher odds of CHD after adjustig for geotype We ca also obtai cofidece itervals for the odds ratios > exp(cofit(glm.mod2)) 2.5 % 97.5 % (Itercept) e chol e factor(rs ) e factor(rs ) e

31 Hypothesis testig for logistic regressio Maximum likelihood is the stadard method of estimatig parameters from logistic models ad is based o fidig the estimates which maximize the joit probability for the observed data uder the chose model. The Wald test uses maximum likelihood estimates (MLE) ad their stadard errors to coduct hypothesis tests Test: H 0 : b j = 0 (o associatio) vs. H A : b j 0 Costruct a z-score: z = ˆβ j SE( ˆβ j ) N(0, 1) Wald Test 263

32 Motivatig example > glm.mod2 <- glm(chd ~ chol+factor(rs ), family = "biomial", data = cholesterol) > summary(glm.mod2) Call: glm(formula = chd ~ chol + factor(rs ), family = "biomial", data = cholesterol) Deviace Residuals: Mi 1Q Media 3Q Max Coefficiets: Estimate Std. Error z value Pr(> z ) (Itercept) < 2e-16 *** chol e-16 *** factor(rs ) ** factor(rs ) * --- Sigif. codes: 0 *** ** 0.01 * (Dispersio parameter for biomial family take to be 1) Null deviace: o 399 degrees of freedom Residual deviace: o 396 degrees of freedom AIC: Number of Fisher Scorig iteratios: 4 Wald statistics ad p-values for each parameter 264

33 Likelihood ratio test The likelihood ratio statistic is useful i comparig ested models. (LRT = likelihood ratio test) This allows us to test hypotheses about multiple parameters simultaeously such as H 0 : b 1 = b 2 = 0 vs H A : at least oe parameter ot equal to 0 I order to use the LRT we must fit a ested hierarchy of models For example: Model 1: logit p i = b 0 + b 1 chol i Model 2: logit p i = b 0 + b 1 chol i + b 2 SNP 1i + b 3 SNP 2i 265

34 Likelihood ratio test The LRT allows us to test the sigificace of the additioal parameters i the larger model. Example: Compare model 2 to model 3 H 0 : b 2 = b 3 = 0 LRT = -2 [L 1 L 2 ] c 2 2 df = # parameters beig tested 266

35 Example: Likelihood ratio test > lrtest(glm.mod1,glm.mod2) Likelihood ratio test Model 1: chd ~ chol Model 2: chd ~ chol + factor(rs ) #Df LogLik Df Chisq Pr(>Chisq) ** --- Sigif. codes: 0 *** ** 0.01 * After accoutig for cholesterol, there is a statistically sigificat associatio betwee rs ad CHD 267

36 Logistic Regressio: Assumptios 1. Logit(E[Y x]) is related liearly to x 2. Y s are idepedet of each other 268

37 Summary We have cosidered: Measures of associatio for biary outcomes Logistic regressio Iterpretatio Estimatio Hypothesis testig 269

38 REGRESSION METHODS Geeralized liear models 270

39 Geeralized liear models So far we have cosidered : Cotiuous outcomes liear regressio/anova Biary outcomes logistic regressio Geeralized liear models (GLMs) provide a way to model Cotiuous ad biary outcomes Additioal types of outcome variables (e.g. couts) Additioal fuctioal forms for the relatioship betwee outcomes ad predictors 271

40 Geeralized Liear Models GLMs allow us to estimate regressio models for outcomes arisig from expoetial family distributios. This family icludes may familiar distributios icludig Normal, Biomial ad Poisso. A GLM is specified based o three compoets: Outcome distributio Liear predictor Lik fuctio We will see that liear ad logistic regressio are both GLMs with specific choice of outcome ad lik fuctio! 272

41 Outcome distributio The first step i fittig a GLM is to choose a appropriate distributio for your outcome Examples Cotiuous outcome Normal Biary outcome Biomial Cout outcome Poisso 273

42 Liear predictor After specifyig a distributio for the outcome, we specify the liear predictor, g[e(y)] = β 0 + β 1 x β p x p This is just the systematic piece of our regressio model As i other regressio models we have see, we eed to idetify the set of covariates to be icluded 274

43 Lik fuctio Fially, we specify a lik fuctio, g[e(y)]: g[e(y)] = β 0 + β 1 x β p x p This describes the fuctioal form of the relatioship betwee E(Y) ad the liear predictor I liear regressio, we use the idetity lik fuctio g[e(y)] = E(Y) I logistic regressio, we use the logit lik fuctio g[(e(y)] = log[e(y)/(1-e(y))] 275

44 Geeralized liear models A few example GLMS: Distributio Lik fuctio Model Normal Idetity g[e(y)]=e(y) Liear regressio Biomial Logit g[e(y)]= log[e(y)/(1-e(y))] Logistic regressio Poisso Log g[e(y)]=log[e(y)] Poisso GLM Gamma Log g[e(y)]=log[e(y)] Gamma GLM 276

45 Alteratives to logistic regressio Odds ratio is limited by difficulty of iterpretatio Relative risk is more iterpretable To estimate a relative risk usig regressio we ca use the log liear model: log[e(y x)] = β 0 + β 1 x This is sometimes referred to as relative risk regressio exp(β 1 ) is the relative risk associated with a oeuit icrease i x 277

46 Modified Poisso regressio To estimate the relative risk, we could use a biomial GLM with log lik. It turs out that estimatio for this model is very challegig ad results are sesitive to outliers i X A alterative approach that performs better i practice is modified Poisso regressio This method uses a Poisso GLM with log lik Usig a Poisso model for biary data will give icorrect stadard errors because the variace for biary outcomes differs from the variace for Poisso outcomes We ca combie the Poisso GLM with a robust variace estimator to accout for this violatio of the model s assumptios 278

47 Modified Poisso regressio > glm.rr <- gee(chd ~ chol+factor(rs ), family = "poisso", id = seq(1,row(cholesterol)), data = cholesterol) > summary(glm.rr) GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA gee S-fuctio, versio 4.13 modified 98/01/27 (1998) Model: Lik: Logarithm Variace to Mea Relatio: Poisso Correlatio Structure: Idepedet Coefficiets: Estimate Naive S.E. Naive z Robust S.E. Robust z (Itercept) chol factor(rs ) factor(rs ) Estimated Scale Parameter: Number of Iteratios: 1 279

48 Modified Poisso regressio w Relative risk of CHD associated with 1 mg/dl icrease i cholesterol is > exp(glm.rr$coef) (Itercept) chol factor(rs )1 factor(rs ) w Compare this to the odds ratio we obtaied earlier usig logistic regressio > exp(glm.mod2$coef) (Itercept) chol factor(rs )1 factor(rs ) e e e e

49 Relative risk regressio: Assumptios 1. log(e[y x]) = log(p(y=1 x) is related liearly to x Warig: this ca lead to predicted probabilities > 1 2. Y s are idepedet of each other 281

50 Risk differece regressio w w w Recall, we also cosidered fittig a liear model to biary outcome data This allows us to estimate differeces i risk associated with a 1 uit differece i the predictor By usig robust stadard errors, we ca accout for violatio of the assumptios of ormality ad equal variace > glm.rd <- gee(chd ~ chol+factor(rs ), id = seq(1,row(cholesterol)), data = cholesterol) > summary(glm.rd) Coefficiets: Estimate Naive S.E. Naive z Robust S.E. Robust z (Itercept) chol factor(rs ) factor(rs )

51 Risk differece regressio A 1 mg/dl differece is very small, so we might be iterested i estimatig the RD associated with a larger differece such as 10 mg/dl Comparig two people with the same rs geotype whose cholesterol levels differ by 10 mg/dl, the risk of CHD for the perso with the higher cholesterol is 9.4% higher (i absolute terms) compared to the perso with lower cholesterol Comparig two people with the same cholesterol level, a perso with rs C/T is estimated to have risk of CHD 14.3% higher (i absolute terms) tha a perso with rs C/C Comparig two people with the same cholesterol level, a perso with rs T/T is estimated to have risk of CHD 21.2% higher (i absolute terms) tha a perso with rs C/C 283

52 Risk differece regressio: Assumptios 1. E[Y x] = P(Y=1 x) is related liearly to x Warig: this ca lead to predicted probabilities > 1 or < 0 2. Y s are idepedet of each other 284

53 Summary We have cosidered: Logistic regressio Iterpretatio Estimatio Geeralized liear models Relative risk regressio Risk differece regressio 285

54 Module summary I this module we have covered a variety of regressio methods that ca be used to aalyze cotiuous ad biary outcomes: Cotiuous outcomes Simple liear regressio Multiple liear regressio ANOVA Biary outcomes Logistic regressio Relative risk regressio Risk differece regressio These methods are foudatioal for may statistical aalyses, ad we hope you will be able to apply them to your future research! 286

55 Everythig is regressio! (Professor Scott Emerso) 287

REGRESSION MODELS ANOVA

REGRESSION MODELS ANOVA 141 Cotiuous Outcome? NO RECAP: Logistic regressio ad other methods YES Liear Regressio Examie mai effects cosiderig predictors of iterest, ad cofouders Test effect modificatio