REGRESSION METHODS. Logistic regression

REGRESSION METHODS Logistic regressio 233

RECAP: Biary Outcome? NO Cotiuous Outcome? YES Liear Regressio/ANOVA NO Other Methods YES Odds ratio as measure of associatio? Relative risk as measure of associatio? Risk differece as measure of associatio? Logistic regressio GLM w/ log lik GLM w/ idetity lik 234

Logistic Regressio: Motivatio May scietific questios of iterest ivolve a biary outcome (e.g. disease/o disease) Let s ivestigate if geetic factors are associated with presece/absece of coroary heart disease (CHD) 235

Logistic Regressio: Motivatio Scietific questios of iterest: Assess the effect of rs4775401 o CHD Assess the effect of cholesterol o CHD Assess the effect of rs4775401 o CHD after accoutig for cholesterol 236

Logistic Regressio: Motivatio Scietific questio: Assess the effect of rs4775401 o risk of CHD rs4775401 - Coded as the umber of mior alleles 0 = C/C, 1 = C/T, 2 = T/T. 237

Motivatio: rs4755401 ad CHD Here is a cotigecy table for the SNP ad CHD: > table(rs4775401,chd) chd rs4775401 0 1 0 154 48 1 104 66 2 15 13 Prevalece of CHD i C/C: 48/(48+154) = 0.238 Prevalece of CHD i C/T: 66/(66+104) = 0.388 Prevalece of CHD i T/T: 13/(13+15) = 0.464 Does the prevalece of CHD differ across the groups? Without usig regressio, what tool could we use to look for a associatio betwee rs4755401 ad CHD? 238

Motivatio: rs4755401 ad CHD Here is a cotigecy table for the SNP ad CHD: > table(rs4775401,chd) 0 1 0 154 48 1 104 66 2 15 13 Without usig regressio, what tool could we use to look for a associatio? > chisq.test(rs4775401,chd) Pearso's Chi-squared test data: rs4775401 ad chd X-squared = 12.657, df = 2, p-value = 0.001785 I additio to hypothesis testig, we eed to summarize the stregth of associatio betwee the two variables 239

Measures of associatio for biary outcomes Outcome No Yes Exposure Yes a b No c d Risk differece (RD) = P(outcome exposed) - P(outcome ot exposed) = (b/(a+b)) - (d/(c+d)) > table(rs4775401,chd) 0 1 0 154 48 1 104 66 2 15 13 RD(T/T vs C/C) = 13/(13+15) 48/(48+154) = 0.464 0.238 =0.226 240

Measures of associatio for biary outcomes Outcome No Yes Exposure Yes a b No c d Risk differece iterpretatio Additive differece i probability (risk) betwee exposed ad uexposed Also called excess risk -1 < RD < 1 RD = 0 o associatio; risk of outcome same for exposed ad uexposed 241

Measures of associatio for biary outcomes Outcome No Yes Exposure Yes a b No c d Relative risk (RR) = P(outcome exposed)/p(outcome ot exposed) = (b/(a+b))/(d/(c+d)) > table(rs4775401,chd) 0 1 0 154 48 1 104 66 2 15 13 RR(T/T vs C/C) = (13/(13+15)) / (48/(48+154)) = 0.464 / 0.238 =1.95 242

Measures of associatio for biary outcomes Outcome No Yes Exposure Yes a b No c d Relative risk iterpretatio Multiplicative differece i probability (risk) of outcome amog exposed compared to uexposed 0 < RR < RR = 1 o associatio; risk of outcome same for exposed ad uexposed 243

Measures of associatio for biary outcomes The odds is the ratio of the risk of havig a outcome to the risk of ot havig the outcome If p is the risk of a outcome, the the odds of the outcome are p/(1-p) The odds ratio (OR) is the ratio of the odds of the outcome i the exposed to the odds of the outcome i the uexposed : OR = [p 1 /(1- p 1 )]/ [p 0 /(1- p 0 )] = odds ratio where p 1 =risk i exposed ad p 0 =risk i uexposed Like the relative risk, the odds ratio provides a measure of associatio i a ratio (rather tha a differece) The odds ratio is the ratio of two ratios (i.e. the ratio of odds) The OR approximates RR for rare evets The OR is more complicated to iterpret tha the RR (except for rare evets), but there are some study desigs (amely, case-cotrol studies) where it is ot possible to directly estimate the risk ratio, but oe ca always estimate the odds ratio 244

Measures of associatio for biary outcomes Say the chace of disease (D) if you re exposed (E) = 0.25 The the odds of gettig D (for those who are exposed) are 0.25/0.75 = 1/3 or 1:3 Say the chace of disease if you re ot exposed =0.1 The the odds of gettig D (for those who are ot exposed) are 0.1/0.9 = 1/9 or 1:9 The the disease odds ratio (ratio of the odds of disease i the exposed to the odds of disease i the uexposed) is (1/3)/(1/9) = 3 Q: What is the risk ratio here? 2.5 245

Measures of associatio for biary outcomes Outcome No Yes Exposure Yes a b No c d Odds = P/(1-P) Odds ratio (OR) = Odds(outcome exposed)/odds(outcome ot exposed) = ((b/(a+b))/(a/(a+b)))/((d/(c+d))/(c/(c+d))) = (b/a)/(d/c) = (bc)/(ad) > table(rs4775401,chd) 0 1 0 154 48 1 104 66 2 15 13 OR(T/T vs C/C) = (13/15) / (48/154) = 2.78 246

Measures of associatio for biary outcomes Outcome No Yes Exposure Yes a b No c d Odds ratio iterpretatio Multiplicative differece i odds of outcome betwee exposed ad uexposed 0 < OR < OR = 1 o associatio; odds of outcome same for exposed ad uexposed 247

Pros ad cos of measures of associatio RD is appealig because it directly commuicates absolute icrease i risk Ofte more policy relevat tha relative measures RR more directly iterpretable tha OR (most people do t have a ituitive uderstadig of odds) OR estimable i case-cotrol studies where RR ad RD are ot For rare outcomes, OR RR 248

Logistic Regressio: Motivatio The chi-squared test is adequate for ivestigatig the associatio betwee two categorical predictors But what if we wat to ivestigate the associatio betwee a cotiuous predictor like cholesterol ad a biary outcome like CHD? Or what if we wat to adjust for potetial cofouders? Logistic regressio will provide us with a tool for this 249

Biary outcome ad cotiuous exposure Objective: Estimate associatio betwee biary outcome ad cotiuous exposure Y = biary respose (0=o, 1=yes) X = cotiuous exposure p = E(Y X) = P(Y = 1 X ) Oe solutio fit a liear model This is just a stadard liear model except our outcome is biary Iterpretatio of b 1? Problems with this approach? 250

Motivatig example: CHD ad cholesterol > lm.mod1 <- lm(chd ~ chol, data = cholesterol) > summary(lm.mod1) Call: lm(formula = chd ~ chol, data = cholesterol) Residuals: Mi 1Q Media 3Q Max -0.7067-0.3301-0.1289 0.3975 1.0227 What is the iterpretatio of the cholesterol parameter estimate? Coefficiets: Estimate Std. Error t value Pr(> t ) (Itercept) -1.4245087 0.1747852-8.15 4.77e-15 *** chol 0.0094718 0.0009436 10.04 < 2e-16 *** --- Sigif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 Residual stadard error: 0.4169 o 398 degrees of freedom Multiple R-squared: 0.202, Adjusted R-squared: 0.2 F-statistic: 100.8 o 1 ad 398 DF, p-value: < 2.2e-16 251

Biary outcome ad cotiuous exposure w w Alterative: use a trasformatio that maps P(Y = 1 X) to the real lie Let logit(p) = log(p / (1 - p))) w p (0, 1) w p /(1 - p) (0, ) w log(p /(1 - p)) (-, ) logit(p) 6 4 2 0 2 4 6 0.0 0.2 0.4 0.6 0.8 1.0 p 252

Logistic regressio logit(p) = log(p / (1 - p))) this esures that p lies betwee 0 ad 1 Regress logit(p) o X logit[e(y X)] = log[p(y=1 X)/(1 P(Y=1 X))] = β 0 + β 1 X It turs out that the slope coefficiets i logistic regressio are readily iterpretable: they are just log odds ratios! 253

Iterpretatio of logistic regressio parameters O the log-odds scale log[odds(y=1 X = (c+1))] = β 0 + β 1 (c+1) log[odds(y=1 X = c)] = β 0 + β 1 c log[odds(y=1 X = (c+1))] - log[odds(y=1 X = c)] = β 1 log[odds(y=1 X = (c+1))/odds(y=1 X = c)] = β 1 log[or] = β 1 Odds Ratio (OR) That is, for two observatios that differ by oe uit i X there is a differece of β 1 i their log odds of Y = 1 Or, equivaletly, the log of the ratio of the odds of Y = 1 (i.e. the log OR) for two uits that differ i X by oe uit is β 1 254

Iterpretatio of logistic regressio parameters By expoetiatig we arrive at a simpler iterpretatio exp(log(or)) = exp(β 1 ) OR = exp(β 1 ) So for two observatios that differ i X by oe uit there is a multiplicative differece i their odds of Y = 1 of exp(β 1 ) Or, equivaletly, the ratio of the odds of Y = 1 (i.e., the odds ratio) for two observatios that differ i X by oe uit is exp(β 1 ) 255

Motivatig example: CHD ad cholesterol > glm.mod1 <- glm(chd ~ chol, family = "biomial") > summary(glm.mod1) Call: glm(formula = chd ~ chol, family = "biomial", data = cholesterol) Deviace Residuals: Mi 1Q Media 3Q Max -1.7437-0.8219-0.4852 0.9096 2.4536 Coefficiets: Estimate Std. Error z value Pr(> z ) (Itercept) -11.09600 1.29881-8.543 < 2e-16 *** chol 0.05498 0.00678 8.109 5.12e-16 *** --- Sigif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersio parameter for biomial family take to be 1) Null deviace: 499.98 o 399 degrees of freedom Residual deviace: 409.71 o 398 degrees of freedom AIC: 413.71 Number of Fisher Scorig iteratios: 4 w What do these results tell us about the relatioship betwee cholesterol ad CHD? 256

Motivatig example: CHD ad cholesterol > glm.mod1 <- glm(chd ~ chol, family = "biomial") > summary(glm.mod1) Call: glm(formula = chd ~ chol, family = "biomial", data = cholesterol) Deviace Residuals: Mi 1Q Media 3Q Max -1.7437-0.8219-0.4852 0.9096 2.4536 Coefficiets: Estimate Std. Error z value Pr(> z ) (Itercept) -11.09600 1.29881-8.543 < 2e-16 *** chol 0.05498 0.00678 8.109 5.12e-16 *** --- Sigif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersio parameter for biomial family take to be 1) Null deviace: 499.98 o 399 degrees of freedom Residual deviace: 409.71 o 398 degrees of freedom AIC: 413.71 Number of Fisher Scorig iteratios: 4 w Comparig two people who differ i cholesterol by 1 mg/dl, the log odds of CHD are higher by 0.055 for the idividual with higher cholesterol 257

Motivatig example: CHD ad cholesterol w w Differeces i log odds are pretty spectacularly difficult to iterpret! It would be much better to expoetiate the coefficiets ad report odds ratios > exp(glm.mod1$coef) (Itercept) chol 1.517293e-05 1.056515e+00 > exp(cofit(glm.mod1)) Waitig for profilig to be doe... 2.5 % 97.5 % (Itercept) 1.061838e-06 0.0001744859 chol 1.043101e+00 1.0712556915 w Comparig two people who differ i cholesterol by 1 mg/dl, the odds of CHD are higher by a factor of 1.06 (95% CI: 1.04, 1.07) for the idividual with higher cholesterol 258

Motivatig example: CHD ad cholesterol w w A 1 mg/dl differece is very small, so we might be iterested i estimatig the OR associated with a larger differece such as 10 mg/dl I this case, just as i liear regressio we just eed to multiply our coefficiet by the appropriate factor > exp(10*glm.mod1$coef) (Itercept) chol 6.466861e-49 1.732831e+00 w Comparig two people whose cholesterol levels differ by 10 mg/dl, the perso with the higher cholesterol has 1.73 times higher odds of CHD compared to the perso with lower cholesterol. 259

Multivariable logistic regressio w w Ofte we are iterested i examiig associatios betwee multiple predictors simultaeously ad a biary outcome Multiple logistic regressio follows same patter as liear regressio logit[e(y X)] = β 0 + β 1 X 1 + β 2 X 2 +. + β p X p w exp(b j ) iterpreted as the OR associated with a oe uit chage i the j th predictor, amog idividuals with other predictors at same levels (or holdig other predictors costat/cotrollig for/adjustig for etc.) 260

Motivatig example > glm.mod2 <- glm(chd ~ chol+factor(rs4775401), family = "biomial", data = cholesterol) > summary(glm.mod2) Call: glm(formula = chd ~ chol + factor(rs4775401), family = "biomial", data = cholesterol) Deviace Residuals: Mi 1Q Media 3Q Max -1.5528-0.7810-0.4585 0.8037 2.6275 Coefficiets: Estimate Std. Error z value Pr(> z ) (Itercept) -11.625209 1.335335-8.706 < 2e-16 *** chol 0.055443 0.006872 8.069 7.11e-16 *** factor(rs4775401)1 0.794212 0.259257 3.063 0.00219 ** factor(rs4775401)2 1.138308 0.464317 2.452 0.01422 * --- Sigif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 (Dispersio parameter for biomial family take to be 1) Null deviace: 499.98 o 399 degrees of freedom Residual deviace: 397.27 o 396 degrees of freedom AIC: 405.27 Number of Fisher Scorig iteratios: 4 261

Motivatig example As we have see before, expoetiatig the coefficiets gives us odds ratios > exp(glm.mod2$coef) (Itercept) chol factor(rs4775401)1 factor(rs4775401)2 8.937908e-06 1.057009e+00 2.212697e+00 3.121483e+00 A oe mg/dl icrease i cholesterol is associated with 1.06 times higher odds of CHD after adjustig for geotype We ca also obtai cofidece itervals for the odds ratios > exp(cofit(glm.mod2)) 2.5 % 97.5 % (Itercept) 5.776075e-07 0.0001096301 chol 1.043422e+00 1.0719733312 factor(rs4775401)1 1.336145e+00 3.6998174205 factor(rs4775401)2 1.250542e+00 7.8187264825 262

Hypothesis testig for logistic regressio Maximum likelihood is the stadard method of estimatig parameters from logistic models ad is based o fidig the estimates which maximize the joit probability for the observed data uder the chose model. The Wald test uses maximum likelihood estimates (MLE) ad their stadard errors to coduct hypothesis tests Test: H 0 : b j = 0 (o associatio) vs. H A : b j 0 Costruct a z-score: z = ˆβ j SE( ˆβ j ) N(0, 1) Wald Test 263

Likelihood ratio test The likelihood ratio statistic is useful i comparig ested models. (LRT = likelihood ratio test) This allows us to test hypotheses about multiple parameters simultaeously such as H 0 : b 1 = b 2 = 0 vs H A : at least oe parameter ot equal to 0 I order to use the LRT we must fit a ested hierarchy of models For example: Model 1: logit p i = b 0 + b 1 chol i Model 2: logit p i = b 0 + b 1 chol i + b 2 SNP 1i + b 3 SNP 2i 265

Likelihood ratio test The LRT allows us to test the sigificace of the additioal parameters i the larger model. Example: Compare model 2 to model 3 H 0 : b 2 = b 3 = 0 LRT = -2 [L 1 L 2 ] c 2 2 df = # parameters beig tested 266

Example: Likelihood ratio test > lrtest(glm.mod1,glm.mod2) Likelihood ratio test Model 1: chd ~ chol Model 2: chd ~ chol + factor(rs4775401) #Df LogLik Df Chisq Pr(>Chisq) 1 2-204.85 2 4-198.63 2 12.44 0.001989 ** --- Sigif. codes: 0 *** 0.001 ** 0.01 * 0.05. 0.1 1 After accoutig for cholesterol, there is a statistically sigificat associatio betwee rs4775401 ad CHD 267

Logistic Regressio: Assumptios 1. Logit(E[Y x]) is related liearly to x 2. Y s are idepedet of each other 268

Summary We have cosidered: Measures of associatio for biary outcomes Logistic regressio Iterpretatio Estimatio Hypothesis testig 269

REGRESSION METHODS Geeralized liear models 270

Geeralized liear models So far we have cosidered : Cotiuous outcomes liear regressio/anova Biary outcomes logistic regressio Geeralized liear models (GLMs) provide a way to model Cotiuous ad biary outcomes Additioal types of outcome variables (e.g. couts) Additioal fuctioal forms for the relatioship betwee outcomes ad predictors 271

Geeralized Liear Models GLMs allow us to estimate regressio models for outcomes arisig from expoetial family distributios. This family icludes may familiar distributios icludig Normal, Biomial ad Poisso. A GLM is specified based o three compoets: Outcome distributio Liear predictor Lik fuctio We will see that liear ad logistic regressio are both GLMs with specific choice of outcome ad lik fuctio! 272

Outcome distributio The first step i fittig a GLM is to choose a appropriate distributio for your outcome Examples Cotiuous outcome Normal Biary outcome Biomial Cout outcome Poisso 273

Liear predictor After specifyig a distributio for the outcome, we specify the liear predictor, g[e(y)] = β 0 + β 1 x 1 + + β p x p This is just the systematic piece of our regressio model As i other regressio models we have see, we eed to idetify the set of covariates to be icluded 274

Lik fuctio Fially, we specify a lik fuctio, g[e(y)]: g[e(y)] = β 0 + β 1 x 1 + + β p x p This describes the fuctioal form of the relatioship betwee E(Y) ad the liear predictor I liear regressio, we use the idetity lik fuctio g[e(y)] = E(Y) I logistic regressio, we use the logit lik fuctio g[(e(y)] = log[e(y)/(1-e(y))] 275

Geeralized liear models A few example GLMS: Distributio Lik fuctio Model Normal Idetity g[e(y)]=e(y) Liear regressio Biomial Logit g[e(y)]= log[e(y)/(1-e(y))] Logistic regressio Poisso Log g[e(y)]=log[e(y)] Poisso GLM Gamma Log g[e(y)]=log[e(y)] Gamma GLM 276

Alteratives to logistic regressio Odds ratio is limited by difficulty of iterpretatio Relative risk is more iterpretable To estimate a relative risk usig regressio we ca use the log liear model: log[e(y x)] = β 0 + β 1 x This is sometimes referred to as relative risk regressio exp(β 1 ) is the relative risk associated with a oeuit icrease i x 277

Modified Poisso regressio To estimate the relative risk, we could use a biomial GLM with log lik. It turs out that estimatio for this model is very challegig ad results are sesitive to outliers i X A alterative approach that performs better i practice is modified Poisso regressio This method uses a Poisso GLM with log lik Usig a Poisso model for biary data will give icorrect stadard errors because the variace for biary outcomes differs from the variace for Poisso outcomes We ca combie the Poisso GLM with a robust variace estimator to accout for this violatio of the model s assumptios 278

Modified Poisso regressio > glm.rr <- gee(chd ~ chol+factor(rs4775401), family = "poisso", id = seq(1,row(cholesterol)), data = cholesterol) > summary(glm.rr) GEE: GENERALIZED LINEAR MODELS FOR DEPENDENT DATA gee S-fuctio, versio 4.13 modified 98/01/27 (1998) Model: Lik: Logarithm Variace to Mea Relatio: Poisso Correlatio Structure: Idepedet Coefficiets: Estimate Naive S.E. Naive z Robust S.E. Robust z (Itercept) -7.06494205 0.673034850-10.497141 0.586040831-12.055375 chol 0.02963414 0.003372133 8.787949 0.002752366 10.766786 factor(rs4775401)1 0.41510943 0.155971511 2.661444 0.144444876 2.873826 factor(rs4775401)2 0.63841622 0.256170049 2.492158 0.200023351 3.191708 Estimated Scale Parameter: 0.671285 Number of Iteratios: 1 279

Modified Poisso regressio w Relative risk of CHD associated with 1 mg/dl icrease i cholesterol is 1.03. > exp(glm.rr$coef) (Itercept) chol factor(rs4775401)1 factor(rs4775401)2 0.0008545444 1.0300775972 1.5145364657 1.8934796543 w Compare this to the odds ratio we obtaied earlier usig logistic regressio > exp(glm.mod2$coef) (Itercept) chol factor(rs4775401)1 factor(rs4775401)2 8.937908e-06 1.057009e+00 2.212697e+00 3.121483e+00 280

Relative risk regressio: Assumptios 1. log(e[y x]) = log(p(y=1 x) is related liearly to x Warig: this ca lead to predicted probabilities > 1 2. Y s are idepedet of each other 281

Risk differece regressio w w w Recall, we also cosidered fittig a liear model to biary outcome data This allows us to estimate differeces i risk associated with a 1 uit differece i the predictor By usig robust stadard errors, we ca accout for violatio of the assumptios of ormality ad equal variace > glm.rd <- gee(chd ~ chol+factor(rs4775401), id = seq(1,row(cholesterol)), data = cholesterol) > summary(glm.rd) Coefficiets: Estimate Naive S.E. Naive z Robust S.E. Robust z (Itercept) -1.485416572 0.1729186937-8.590260 0.1372414095-10.823385 chol 0.009392399 0.0009293498 10.106420 0.0007615605 12.333097 factor(rs4775401)1 0.142743141 0.0427305918 3.340537 0.0423172293 3.373168 factor(rs4775401)2 0.212108382 0.0827885757 2.562049 0.0822370553 2.579231 282

Risk differece regressio A 1 mg/dl differece is very small, so we might be iterested i estimatig the RD associated with a larger differece such as 10 mg/dl Comparig two people with the same rs4775401 geotype whose cholesterol levels differ by 10 mg/dl, the risk of CHD for the perso with the higher cholesterol is 9.4% higher (i absolute terms) compared to the perso with lower cholesterol Comparig two people with the same cholesterol level, a perso with rs4775401 C/T is estimated to have risk of CHD 14.3% higher (i absolute terms) tha a perso with rs4775401 C/C Comparig two people with the same cholesterol level, a perso with rs4775401 T/T is estimated to have risk of CHD 21.2% higher (i absolute terms) tha a perso with rs4775401 C/C 283

Risk differece regressio: Assumptios 1. E[Y x] = P(Y=1 x) is related liearly to x Warig: this ca lead to predicted probabilities > 1 or < 0 2. Y s are idepedet of each other 284

Summary We have cosidered: Logistic regressio Iterpretatio Estimatio Geeralized liear models Relative risk regressio Risk differece regressio 285

Module summary I this module we have covered a variety of regressio methods that ca be used to aalyze cotiuous ad biary outcomes: Cotiuous outcomes Simple liear regressio Multiple liear regressio ANOVA Biary outcomes Logistic regressio Relative risk regressio Risk differece regressio These methods are foudatioal for may statistical aalyses, ad we hope you will be able to apply them to your future research! 286

Everythig is regressio! (Professor Scott Emerso) 287