Lecture 12: Effect modification, and confounding in logistic regression Ani Manichaikul amanicha@jhsph.edu 4 May 2007
Today Categorical predictor create dummy variables just like for linear regression Comparing nested models that differ by two or more variables for logistic regression X 2 Test of Deviance analogous to the F test in linear regression Effect Modification and Confounding
Example Mean SAT scores were compared for the 50 US states. The goal of the study was to compare overall SAT scores using state-wide predictors such as perpupil expenditures and average teachers salary. The investigators also considered the proportion of student eligible to take the SAT who actually took the examination.
Variables Outcome Total SAT score [sat_low] 1=low, 0=high Primary predictor Average expenditures per pupil [expen] in thousands Continuous, range: 3.65-9.77, mean: 5.9
Variables Secondary predictors Percent of pupils taking the SAT, in quartiles percent1 lowest quartile percent2 2 nd quartile percent3 3 rd quartile percent4 highest quartile Mean teacher salary in thousands, in quartiles salary1 lowest quartile salary2 2 nd quartile salary3 3 rd quartile salary4 highest quartile
Modifications to variables Expenditures: continuous, doesn t include 0: center at $5,000 per pupil Percent: four dummy variables for four categories; must exclude one category to create a reference group Salary: four dummy variables for four categories; must exclude one category to create a reference group
Plan Assess primary relationship Add each secondary predictor separately Determine which secondary predictor is more statistically significant Add other secondary predictor to model with better secondary predictor
The X 2 Test of Deviance We would like to consider adding salary quartiles to our model We want to compare parent model to an extended model, which differs by the three dummy variables for the four salary quartiles. The X 2 test of deviance compares nested models We use it for nested models that differ by two or more variables because the Wald test cannot be used in that situation
1. Get the Log Likelihood from both models The log likelihood is shown in the upper right corner of the logit or logistic output Null model: LL = -28.94 Extended model B: LL = -28.25
2. Find the deviance for each model Deviance = -2x(log likelihood) Deviance is analogous to residual sums of squares (RSS) in linear regression; it measures the deviation still available in the model A saturated model is one in which every Y is perfectly predicted Null model: Deviance = -2(-28.94) = 57.88 Extended model B: Deviance = -2(-28.25) = 56.50
3. Find the change in deviance between the nested models Null model: Deviance = 57.88 Extended model B: Deviance = 56.50 Change in deviance = deviance null deviance extended = 57.88-56.50 = 1.38
4. Evaluate the change in deviance The change in deviance from the parent model to the nested model is an observed Chi-square statistic df = # of variables added H 0 : all new s are 0 in the population or H 0 : the parent model is better
4. Evaluate the change in deviance H 0 : After adjusting for per-pupil expenditures, teachers salary is not an important predictor of SAT score. X 2 obs = 1.38 df = 3 with 3 df and =0.05, X 2 cr is 7.81 Fail to reject H 0
Notes about deviance test The deviance test gives us a framework in which to add several predictors to a model simultaneously Can only handle nested models Analogous to F-test for linear regression Also known as a "likelihood ratio test"
Conclusions per-pupil expenditure is associated with SAT score After adjusting for per-pupil expenditure Percent of students taking the SAT is statistically significant Teachers salary is not statistically significant Is salary significant after adjusting for both expenditure and percent?
Possible ways to improve this model: Add an interaction variable Does the effect of expenditures on odds of low mean SAT score vary between states with low and high percentages of students taking the SAT? Add a spline Does the effect of expenditures on odds of low mean SAT score vary over the level of expenditures?
Effect Modification in Logistic Regression Heart Disease Smoking and Coffee
Effect modification Just like with linear regression, we may want to allow different relationships between the primary predictor and outcome across levels of another covariate Can model such relationships by fitting interaction terms Modelling effect modification will require dealing with two or more covariates
Logistic models with two covariates logit(p) = β 0 + β 1 X 1 + β 2 X 2 Then: logit(p X 1 =X 1 +1,X 2 =X 2 ) = β 0 + β 1 (X 1 +1)+ β 2 X 2 logit(p X 1 =X 1,X 2 =X 2 ) = β 0 + β 1 (X 1 )+ β 2 X 2 in log-odds = β 1 β 1 is the change in log-odds for a 1 unit change in X 1 provided X 2 is held constant.
Interpretation in General Also: log = β 1 odds(y = 1 X + 1,X ) 1 2 odds(y = 1 X,X ) 1 2 And: OR = exp(β 1 )!! exp(β 1 ) is the Multiplicative change in odds for a 1 unit increase in X 1 provided X 2 is held constant. The result is similar for X 2
Risk of CHD from Smoking and Coffee n = 151
Study Information Study Facts: Case-Control study 40-50 year-old males previously in good health Study questions: Is smoking and/or coffee related to an increased odds of CHD? Is the association of coffee with CHD higher among smokers? That is, is smoking an effect modifier of the coffee-chd associations?
Fraction with CHD by smoking and coffee
Pooled data, ignoring smoking Odds ratio = (40 * 50) / (26 * 35) = 2.2 95% CI = (1.14, 4.24)
Among Non-Smokers Odds ratio = (15 * 42) / (15 * 21) = 2.0 95% CI = (0.82, 4.9)
Among Smokers Odds ratio = (25 * 8) / (11 * 14) = 1.3 95% CI = (.42, 4.0)
Plot Odds Ratios and 95% CIs
Define Variables Y i = 1 if CHD case, 0 if control COF i = 1 if Coffee Drinker, 0 if not SMK i = 1 if Smoker, 0 if not p i = Pr (Y i = 1) n i = Number observed at pattern i of Xs
Logistic Regression Model Y i are from a Binomial (n i, p i ) distribution Yi are independent log odds (Y i =1) (or, logit( Y i =1) ) is a function of Coffee Smoking and coffee x smoking interaction
Logistic Regression Model Which implies that Pr(Y i =1) is the logistic function 2 1 3 2 2 1 1 0 2 1 3 2 2 1 1 0 e 1 e X i X i X i X i i i i i X X X X i p β β β β + + + + = + + + i i i i i i COF SMK SMK COF p p 3 2 1 0 1 log β β β β + + + =
Probabilities of CHD as a function of coffee and smoking history Coffee No Yes No e 0 1+ e e 0 1+ e + 0 1 0+ 1 Smoke e 0 1+ e 1+ Yes + e 0 + β e 1 2 0 + β2 + β + β 2 3 0+ 1+ β2+ β3
Among Non-Smokers: ( ) ( ) 0 0 0 1 0 1 0 1 0 1 1 1 1 1 1 No Coffee Case Odds Coffee Case Odds β β β β β β β β β e e e e e e + + + + = + + + Odds Ratio 1 0 1 0 = = = + β β β β e e e
Interpretations exp{ 1 }: odds ratio of being a CHD case for coffee drinkers -vs- non-drinkers among non-smokers exp{ 1 3 }: odds ratio of being a CHD case for coffee drinkers -vs- nondrinkers among smokers
Interpretations exp{ 2 }: odds ratio of being a CHD case for smokers -vs- non-smokers among non-coffee drinkers exp{ 2 3 }: odds ratio of being case for smokers -vs- non-smokers among coffee drinkers
Interpretations e β 0 1 + e β 0 fraction of cases among nonsmoking non-coffee drinking individuals in the sample (determined by sampling plan) exp{ 3 }: ratio of odds ratios
exp{ 3 } Interpretations exp{ 3 }: factor by which odds ratio of being a CHD case for coffee drinkers -vsnondrinkers is multiplied for smokers as compared to non-smokers or exp{ 3 }: factor by which odds ratio of being a CHD case for smokers -vs- non-smokers is multiplied for coffee drinkers as compared to non-coffee drinkers
Some Special Cases Given Pr( Y = 1) log = β 0 + β1cof + β2smk + β3cof Pr( Y = 0) If 1 = 2 = 3 = 0 * SMK Neither smoking no coffee drinking is associated with increased risk of CHD
Some Special Cases Given Pr( Y = 1) log = β 0 + β1cof + β2smk + β3cof Pr( Y = 0) If 1 = 3 = 0 * SMK Smoking, but not coffee drinking, is associated with increased risk of CHD
Some Special Cases If 3 = 0 Smoking and coffee drinking are both associated with risk of CHD but the odds ratio of CHD-smoking is the same at levels of coffee Smoking and coffee drinking are both associated with risk of CHD but the odds ratio of CHD-coffee is the same at levels of smoking.
CHD ~ Coffee: Coefficients Logit estimates Number of obs = 151 LR chi2(1) = 5.65 Prob > chi2 = 0.0175 Log likelihood = -100.64332 Pseudo R2 = 0.0273 ------------------------------------------------------------------------------ chd Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- cof.7874579.3347123 2.35 0.019.1314338 1.443482 _cons -.6539265.2417869-2.70 0.007-1.12782 -.1800329 ------------------------------------------------------------------------------
Adding Smoke: Coefficients. logit chd cof smk Logit estimates Number of obs = 151 LR chi2(2) = 15.19 Prob > chi2 = 0.0005 Log likelihood = -95.869718 Pseudo R2 = 0.0734 ------------------------------------------------------------------------------ chd Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- cof.5269764.3541932 1.49 0.137 -.1672295 1.221182 smk 1.101978.3609954 3.05 0.002.3944404 1.809516 _cons -.9572328.2703086-3.54 0.000-1.487028 -.4274377 ------------------------------------------------------------------------------
Adding Interaction: Coefficient. logit chd cof smk cof_smk Logit estimates Number of obs = 151 LR chi2(3) = 15.55 Prob > chi2 = 0.0014 Log likelihood = -95.694169 Pseudo R2 = 0.0751 ------------------------------------------------------------------------------ chd Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- cof.6931472.4525062 1.53 0.126 -.1937487 1.580043 smk 1.348073.5535208 2.44 0.015.2631923 2.432954 cof_smk -.4317824.7294515-0.59 0.554-1.861481.9979163 _cons -1.029619.3007926-3.42 0.001-1.619162 -.4400768 ------------------------------------------------------------------------------
Comparing Models Variable Intercept Coffee Est se Model1 -.65.24.79.33 2.4 z -2.7 Intercept Coffee Smoking Model 2 -.96.27.53.35 1.10.36-3.5 1.5 3.1
Question: Is smoking a confounder of the coffee-chd association?
Confounding In epidemiological terms, Z is a confounder of the relationship of Y with X if Z is related to both X and Y and Z is not in the causal pathway between X and Y In statistical terms, Z is a confounder of the relationship of Y with X if the X coefficient changes when Z is added to a regression of Y on X
Confounding For example, consider the two models Y = 0 + 1 X + 1 Y = 0 + 1 X + 2 Z + 2 then Z is a confounder of the X, Y relationship if 1 1
Comparing Models Variable Intercept Coffee Est se Model1 -.65.24.79.33 z -2.7 2.4 Intercept Coffee Smoking Model 2 -.96.27.53.35 1.10.36-3.5 1.5 3.1
Look at Confidence Intervals Without Smoking OR = e 0.79 = 2.2 95% CI for log(or): 0.79 ± 1.96(0.33) = (0.13, 1.44) 95% CI for OR: (e 0.13, e 1.44 ) = (1.14, 4.24)
Look at Confidence Intervals With Smoking (adjusting for smoking) OR = e 0.53 = 1.7 95% CI for log(or): 0.53 ± 1.96(0.35) = (-0.17, 1.22) 95% CI for OR: (e -0.17, e 1.22 ) = (0.85, 3.39)
Conclusion So, ignoring smoking, the CHD and coffee OR is 2.2 (95%CI: 1.14-4.26) Adjusting for smoking, gives more modest evidence for a coffee effect In this case-control study, smoking is a weak-to-moderate confounder of the coffee-chd association
Question: Is smoking an effect modifier of CHD-coffee association?
Interaction Model Variable Intercept Coffee Smoking Coffee* Smoking Est se Model 3-1.0.30.69.45 1.3.55 -.43.73 z -3.4 1.5 2.4 -.59
Testing Interaction Term Among non-smokers: OR = e 0.69 = 1.99 95% CI for log(or): 0.69 ± 1.96(0.45) = (-0.19, 1.58) 95% CI for OR: (e -0.19, e 1.58 ) = (0.82, 4.86)
Testing Interaction Term Among smokers OR = e 0.69-0.43 = e 0.26 = 1.30 95% CI for log(or): 0.26 ± 1.96(.57) = (-0.86, 1.38) 95% CI for OR:(e -0.86, e 1.38 ) = (0.42, 3.99)
Testing Interaction Term Z= -0.59, p-value = 0.554 95% Confidence interval for 1 3 (0.42, 3.99) Both of the above suggest that there is little evidence that smoking is an effect modifier!
Note ˆ βˆ β + Calculating the SE for 1 3.57 = sqrt(.329)
Question: What model should we choose to describe the relationship of coffee and smoking with CHD?
Fitted Values We can use the logistic models to calculate fitted values for comparison with observed frequencies using each of the three models Model 1: pˆ = e 1+ -.65+.79Coffee e -.65+.79Coffee
Fitted Values Model 2: Model 3: pˆ = e -.96+.53Coffee+ 1.1Smoking 1+ e -.96+.53Coffee+ 1.1Smoking pˆ = e -.1.03+.69Coffee+ 1.3Smoking-.43(Coffee*Smoking) 1+ e -.1.03+.69Coffee+ 1.3Smoking-.43(Coffee*Smoking)
Observed vs Fitted Values
Saturated Model Note that fitted values from Model 3 exactly match the observed values indicating a saturated model that gives perfect predictions Although the saturated model will always result in a perfect fit, it is usually not the best model (e.g., when there are continuous covariates or many covariates)
Likelihood Ratio Test The Likelihood Ratio Test will help decide whether or not additional term(s) significantly improve the model fit Likelihood Ratio Test (LRT) statistic for comparing nested models is -2 times the difference between the log likelihoods (LLs) for the Null -vs- Extended models the obtained is identical to from an analysis of variance test for linear regression models
Likelihood Ratio Test Deviance is a term used for the difference in -2*log likelihood relative to the best possible value from a perfectly predicting model. Change in deviance is the same as change in -2LL.
LRT Example
Model comparisons using likelihood ratio test
Summary A case-control study was conducted with 151 subjects, 66 (44%) of whom had CHD, to assess the relative importance of smoking and coffee drinking as risk factors. The observed fractions of CHD cases by smoking, coffee strata are
Summary: Unadjusted ORs The odds of CHD was estimated to be 3.4 times higher among smokers compared to non-smokers 95% CI: (1.7, 7.9) The odds of CHD was estimated to be 2.2 times higher among coffee drinkers compared to non-coffee drinkers 95% CI: (1.1, 4.3)
Summary: Adjusted ORs Controlling for the potential confounding of smoking, the coffee odds ratio was estimated to be 1.7 with 95% CI: (.85, 3.4). Hence, the evidence in these data are insufficient to conclude coffee has an independent effect on CHD beyond that of smoking.
Summary Finally, we estimated the coffee odds ratio separately for smokers and non-smokers to assess whether smoking is an effect modifier of the coffee-chd relationship. For the smokers and non-smokers, the coffee odds ratio was estimated to be 1.3 (95% CI:.42, 4.0) and 2.0 (95% CI:.82, 4.9) respectively. There is little evidence of effect modification in these data.
Note: Retrospective Studies Ratio of odds of CHD for coffee vs. noncoffee drinkers is equivalent to ratio of coffee drinking for cases of CHD vs. controls Thus, can estimate odds ratio of CHD (prospective question) using retrospective data -- key property of odds ratios This is one reason why logistic regression is so popular with epidemiologists