Lecture 12: Effect modification, and confounding in logistic regression

Similar documents
Lecture 10: Introduction to Logistic Regression

Binomial Model. Lecture 10: Introduction to Logistic Regression. Logistic Regression. Binomial Distribution. n independent trials

Sociology 362 Data Exercise 6 Logistic Regression 2

Lecture 14: Introduction to Poisson Regression

Modelling counts. Lecture 14: Introduction to Poisson Regression. Overview

Suppose that we are concerned about the effects of smoking. How could we deal with this?

Section IX. Introduction to Logistic Regression for binary outcomes. Poisson regression

Lecture 2: Poisson and logistic regression

Lecture 5: Poisson and logistic regression

Correlation and regression

22s:152 Applied Linear Regression. Example: Study on lead levels in children. Ch. 14 (sec. 1) and Ch. 15 (sec. 1 & 4): Logistic Regression

7/28/15. Review Homework. Overview. Lecture 6: Logistic Regression Analysis

Homework Solutions Applied Logistic Regression

STA6938-Logistic Regression Model

Binary Dependent Variables

Logistic Regression. Fitting the Logistic Regression Model BAL040-A.A.-10-MAJ

Lecture 3.1 Basic Logistic LDA

Lecture 12: Interactions and Splines

Logistic Regression. Interpretation of linear regression. Other types of outcomes. 0-1 response variable: Wound infection. Usual linear regression

Linear Regression Models P8111

Chapter 11. Regression with a Binary Dependent Variable

Cohen s s Kappa and Log-linear Models

STAT 7030: Categorical Data Analysis

Ch 6: Multicategory Logit Models

Acknowledgements. Outline. Marie Diener-West. ICTR Leadership / Team INTRODUCTION TO CLINICAL RESEARCH. Introduction to Linear Regression

Problem Set #3-Key. wage Coef. Std. Err. t P> t [95% Conf. Interval]

Logistic Regression. Continued Psy 524 Ainsworth

Review of Multinomial Distribution If n trials are performed: in each trial there are J > 2 possible outcomes (categories) Multicategory Logit Models

Statistical Modelling with Stata: Binary Outcomes

Single-level Models for Binary Responses

Logistic Regression - problem 6.14

Case-control studies

Section 9c. Propensity scores. Controlling for bias & confounding in observational studies

STA102 Class Notes Chapter Logistic Regression

Unit 5 Logistic Regression

BMI 541/699 Lecture 22

Unit 5 Logistic Regression

Inference. ME104: Linear Regression Analysis Kenneth Benoit. August 15, August 15, 2012 Lecture 3 Multiple linear regression 1 1 / 58

Model Based Statistics in Biology. Part V. The Generalized Linear Model. Chapter 18.1 Logistic Regression (Dose - Response)

Unit 5 Logistic Regression

Advanced Quantitative Data Analysis

Modelling Rates. Mark Lunt. Arthritis Research UK Epidemiology Unit University of Manchester

Logistic Regression. Building, Interpreting and Assessing the Goodness-of-fit for a logistic regression model

Linear Modelling in Stata Session 6: Further Topics in Linear Modelling

Logistic regression analysis. Birthe Lykke Thomsen H. Lundbeck A/S

Introduction to Analysis of Genomic Data Using R Lecture 6: Review Statistics (Part II)

University of California at Berkeley Fall Introductory Applied Econometrics Final examination. Scores add up to 125 points

General Linear Model (Chapter 4)

STAT 526 Spring Midterm 1. Wednesday February 2, 2011

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

SCHOOL OF MATHEMATICS AND STATISTICS. Linear and Generalised Linear Models

Problem Set #5-Key Sonoma State University Dr. Cuellar Economics 317- Introduction to Econometrics

Basic Medical Statistics Course

Confidence Intervals for the Odds Ratio in Logistic Regression with One Binary X

Stat/F&W Ecol/Hort 572 Review Points Ané, Spring 2010

Exam ECON3150/4150: Introductory Econometrics. 18 May 2016; 09:00h-12.00h.

Logistic Regression. James H. Steiger. Department of Psychology and Human Development Vanderbilt University

Lecture 5: ANOVA and Correlation

Introduction to logistic regression

Class Notes: Week 8. Probit versus Logit Link Functions and Count Data

STA 303 H1S / 1002 HS Winter 2011 Test March 7, ab 1cde 2abcde 2fghij 3

More Statistics tutorial at Logistic Regression and the new:

Machine Learning Linear Classification. Prof. Matteo Matteucci

Lecture 4 Multiple linear regression

Analysis of Categorical Data. Nick Jackson University of Southern California Department of Psychology 10/11/2013

" M A #M B. Standard deviation of the population (Greek lowercase letter sigma) σ 2

EPSY 905: Fundamentals of Multivariate Modeling Online Lecture #7

Applied Statistics and Econometrics

Chapter 20: Logistic regression for binary response variables

STA 4504/5503 Sample Exam 1 Spring 2011 Categorical Data Analysis. 1. Indicate whether each of the following is true (T) or false (F).

Introducing Generalized Linear Models: Logistic Regression

UNIVERSITY OF TORONTO. Faculty of Arts and Science APRIL 2010 EXAMINATIONS STA 303 H1S / STA 1002 HS. Duration - 3 hours. Aids Allowed: Calculator

8 Nominal and Ordinal Logistic Regression

Ron Heck, Fall Week 8: Introducing Generalized Linear Models: Logistic Regression 1 (Replaces prior revision dated October 20, 2011)

Logistic Regression. Some slides from Craig Burkett. STA303/STA1002: Methods of Data Analysis II, Summer 2016 Michael Guerzhoy

ST3241 Categorical Data Analysis I Multicategory Logit Models. Logit Models For Nominal Responses

NATIONAL UNIVERSITY OF SINGAPORE EXAMINATION. ST3241 Categorical Data Analysis. (Semester II: ) April/May, 2011 Time Allowed : 2 Hours

Correlation and Simple Linear Regression

Unit 11: Multiple Linear Regression

Question 1 carries a weight of 25%; Question 2 carries 20%; Question 3 carries 20%; Question 4 carries 35%.

Binary Dependent Variable. Regression with a

Today. HW 1: due February 4, pm. Aspects of Design CD Chapter 2. Continue with Chapter 2 of ELM. In the News:

R Hints for Chapter 10

Exercise 7.4 [16 points]

Hierarchical Generalized Linear Models. ERSH 8990 REMS Seminar on HLM Last Lecture!

multilevel modeling: concepts, applications and interpretations

Multinomial Logistic Regression Models

Chapter 5: Logistic Regression-I

Consider Table 1 (Note connection to start-stop process).

Ron Heck, Fall Week 3: Notes Building a Two-Level Model

Confidence Intervals for the Odds Ratio in Logistic Regression with Two Binary X s

11 November 2011 Department of Biostatistics, University of Copengen. 9:15 10:00 Recap of case-control studies. Frequency-matched studies.

Lecture (chapter 13): Association between variables measured at the interval-ratio level

Statistics in medicine

Using the same data as before, here is part of the output we get in Stata when we do a logistic regression of Grade on Gpa, Tuce and Psi.

5. Let W follow a normal distribution with mean of μ and the variance of 1. Then, the pdf of W is

2. We care about proportion for categorical variable, but average for numerical one.

An Introduction to Mplus and Path Analysis

Final Exam. Question 1 (20 points) 2 (25 points) 3 (30 points) 4 (25 points) 5 (10 points) 6 (40 points) Total (150 points) Bonus question (10)

Assessing the Calibration of Dichotomous Outcome Models with the Calibration Belt

Transcription:

Lecture 12: Effect modification, and confounding in logistic regression Ani Manichaikul amanicha@jhsph.edu 4 May 2007

Today Categorical predictor create dummy variables just like for linear regression Comparing nested models that differ by two or more variables for logistic regression X 2 Test of Deviance analogous to the F test in linear regression Effect Modification and Confounding

Example Mean SAT scores were compared for the 50 US states. The goal of the study was to compare overall SAT scores using state-wide predictors such as perpupil expenditures and average teachers salary. The investigators also considered the proportion of student eligible to take the SAT who actually took the examination.

Variables Outcome Total SAT score [sat_low] 1=low, 0=high Primary predictor Average expenditures per pupil [expen] in thousands Continuous, range: 3.65-9.77, mean: 5.9

Variables Secondary predictors Percent of pupils taking the SAT, in quartiles percent1 lowest quartile percent2 2 nd quartile percent3 3 rd quartile percent4 highest quartile Mean teacher salary in thousands, in quartiles salary1 lowest quartile salary2 2 nd quartile salary3 3 rd quartile salary4 highest quartile

Modifications to variables Expenditures: continuous, doesn t include 0: center at $5,000 per pupil Percent: four dummy variables for four categories; must exclude one category to create a reference group Salary: four dummy variables for four categories; must exclude one category to create a reference group

Plan Assess primary relationship Add each secondary predictor separately Determine which secondary predictor is more statistically significant Add other secondary predictor to model with better secondary predictor

The X 2 Test of Deviance We would like to consider adding salary quartiles to our model We want to compare parent model to an extended model, which differs by the three dummy variables for the four salary quartiles. The X 2 test of deviance compares nested models We use it for nested models that differ by two or more variables because the Wald test cannot be used in that situation

1. Get the Log Likelihood from both models The log likelihood is shown in the upper right corner of the logit or logistic output Null model: LL = -28.94 Extended model B: LL = -28.25

2. Find the deviance for each model Deviance = -2x(log likelihood) Deviance is analogous to residual sums of squares (RSS) in linear regression; it measures the deviation still available in the model A saturated model is one in which every Y is perfectly predicted Null model: Deviance = -2(-28.94) = 57.88 Extended model B: Deviance = -2(-28.25) = 56.50

3. Find the change in deviance between the nested models Null model: Deviance = 57.88 Extended model B: Deviance = 56.50 Change in deviance = deviance null deviance extended = 57.88-56.50 = 1.38

4. Evaluate the change in deviance The change in deviance from the parent model to the nested model is an observed Chi-square statistic df = # of variables added H 0 : all new s are 0 in the population or H 0 : the parent model is better

4. Evaluate the change in deviance H 0 : After adjusting for per-pupil expenditures, teachers salary is not an important predictor of SAT score. X 2 obs = 1.38 df = 3 with 3 df and =0.05, X 2 cr is 7.81 Fail to reject H 0

Notes about deviance test The deviance test gives us a framework in which to add several predictors to a model simultaneously Can only handle nested models Analogous to F-test for linear regression Also known as a "likelihood ratio test"

Conclusions per-pupil expenditure is associated with SAT score After adjusting for per-pupil expenditure Percent of students taking the SAT is statistically significant Teachers salary is not statistically significant Is salary significant after adjusting for both expenditure and percent?

Possible ways to improve this model: Add an interaction variable Does the effect of expenditures on odds of low mean SAT score vary between states with low and high percentages of students taking the SAT? Add a spline Does the effect of expenditures on odds of low mean SAT score vary over the level of expenditures?

Effect Modification in Logistic Regression Heart Disease Smoking and Coffee

Effect modification Just like with linear regression, we may want to allow different relationships between the primary predictor and outcome across levels of another covariate Can model such relationships by fitting interaction terms Modelling effect modification will require dealing with two or more covariates

Logistic models with two covariates logit(p) = β 0 + β 1 X 1 + β 2 X 2 Then: logit(p X 1 =X 1 +1,X 2 =X 2 ) = β 0 + β 1 (X 1 +1)+ β 2 X 2 logit(p X 1 =X 1,X 2 =X 2 ) = β 0 + β 1 (X 1 )+ β 2 X 2 in log-odds = β 1 β 1 is the change in log-odds for a 1 unit change in X 1 provided X 2 is held constant.

Interpretation in General Also: log = β 1 odds(y = 1 X + 1,X ) 1 2 odds(y = 1 X,X ) 1 2 And: OR = exp(β 1 )!! exp(β 1 ) is the Multiplicative change in odds for a 1 unit increase in X 1 provided X 2 is held constant. The result is similar for X 2

Risk of CHD from Smoking and Coffee n = 151

Study Information Study Facts: Case-Control study 40-50 year-old males previously in good health Study questions: Is smoking and/or coffee related to an increased odds of CHD? Is the association of coffee with CHD higher among smokers? That is, is smoking an effect modifier of the coffee-chd associations?

Fraction with CHD by smoking and coffee

Pooled data, ignoring smoking Odds ratio = (40 * 50) / (26 * 35) = 2.2 95% CI = (1.14, 4.24)

Among Non-Smokers Odds ratio = (15 * 42) / (15 * 21) = 2.0 95% CI = (0.82, 4.9)

Among Smokers Odds ratio = (25 * 8) / (11 * 14) = 1.3 95% CI = (.42, 4.0)

Plot Odds Ratios and 95% CIs

Define Variables Y i = 1 if CHD case, 0 if control COF i = 1 if Coffee Drinker, 0 if not SMK i = 1 if Smoker, 0 if not p i = Pr (Y i = 1) n i = Number observed at pattern i of Xs

Logistic Regression Model Y i are from a Binomial (n i, p i ) distribution Yi are independent log odds (Y i =1) (or, logit( Y i =1) ) is a function of Coffee Smoking and coffee x smoking interaction

Logistic Regression Model Which implies that Pr(Y i =1) is the logistic function 2 1 3 2 2 1 1 0 2 1 3 2 2 1 1 0 e 1 e X i X i X i X i i i i i X X X X i p β β β β + + + + = + + + i i i i i i COF SMK SMK COF p p 3 2 1 0 1 log β β β β + + + =

Probabilities of CHD as a function of coffee and smoking history Coffee No Yes No e 0 1+ e e 0 1+ e + 0 1 0+ 1 Smoke e 0 1+ e 1+ Yes + e 0 + β e 1 2 0 + β2 + β + β 2 3 0+ 1+ β2+ β3

Among Non-Smokers: ( ) ( ) 0 0 0 1 0 1 0 1 0 1 1 1 1 1 1 No Coffee Case Odds Coffee Case Odds β β β β β β β β β e e e e e e + + + + = + + + Odds Ratio 1 0 1 0 = = = + β β β β e e e

Interpretations exp{ 1 }: odds ratio of being a CHD case for coffee drinkers -vs- non-drinkers among non-smokers exp{ 1 3 }: odds ratio of being a CHD case for coffee drinkers -vs- nondrinkers among smokers

Interpretations exp{ 2 }: odds ratio of being a CHD case for smokers -vs- non-smokers among non-coffee drinkers exp{ 2 3 }: odds ratio of being case for smokers -vs- non-smokers among coffee drinkers

Interpretations e β 0 1 + e β 0 fraction of cases among nonsmoking non-coffee drinking individuals in the sample (determined by sampling plan) exp{ 3 }: ratio of odds ratios

exp{ 3 } Interpretations exp{ 3 }: factor by which odds ratio of being a CHD case for coffee drinkers -vsnondrinkers is multiplied for smokers as compared to non-smokers or exp{ 3 }: factor by which odds ratio of being a CHD case for smokers -vs- non-smokers is multiplied for coffee drinkers as compared to non-coffee drinkers

Some Special Cases Given Pr( Y = 1) log = β 0 + β1cof + β2smk + β3cof Pr( Y = 0) If 1 = 2 = 3 = 0 * SMK Neither smoking no coffee drinking is associated with increased risk of CHD

Some Special Cases Given Pr( Y = 1) log = β 0 + β1cof + β2smk + β3cof Pr( Y = 0) If 1 = 3 = 0 * SMK Smoking, but not coffee drinking, is associated with increased risk of CHD

Some Special Cases If 3 = 0 Smoking and coffee drinking are both associated with risk of CHD but the odds ratio of CHD-smoking is the same at levels of coffee Smoking and coffee drinking are both associated with risk of CHD but the odds ratio of CHD-coffee is the same at levels of smoking.

CHD ~ Coffee: Coefficients Logit estimates Number of obs = 151 LR chi2(1) = 5.65 Prob > chi2 = 0.0175 Log likelihood = -100.64332 Pseudo R2 = 0.0273 ------------------------------------------------------------------------------ chd Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- cof.7874579.3347123 2.35 0.019.1314338 1.443482 _cons -.6539265.2417869-2.70 0.007-1.12782 -.1800329 ------------------------------------------------------------------------------

Adding Smoke: Coefficients. logit chd cof smk Logit estimates Number of obs = 151 LR chi2(2) = 15.19 Prob > chi2 = 0.0005 Log likelihood = -95.869718 Pseudo R2 = 0.0734 ------------------------------------------------------------------------------ chd Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- cof.5269764.3541932 1.49 0.137 -.1672295 1.221182 smk 1.101978.3609954 3.05 0.002.3944404 1.809516 _cons -.9572328.2703086-3.54 0.000-1.487028 -.4274377 ------------------------------------------------------------------------------

Adding Interaction: Coefficient. logit chd cof smk cof_smk Logit estimates Number of obs = 151 LR chi2(3) = 15.55 Prob > chi2 = 0.0014 Log likelihood = -95.694169 Pseudo R2 = 0.0751 ------------------------------------------------------------------------------ chd Coef. Std. Err. z P> z [95% Conf. Interval] -------------+---------------------------------------------------------------- cof.6931472.4525062 1.53 0.126 -.1937487 1.580043 smk 1.348073.5535208 2.44 0.015.2631923 2.432954 cof_smk -.4317824.7294515-0.59 0.554-1.861481.9979163 _cons -1.029619.3007926-3.42 0.001-1.619162 -.4400768 ------------------------------------------------------------------------------

Comparing Models Variable Intercept Coffee Est se Model1 -.65.24.79.33 2.4 z -2.7 Intercept Coffee Smoking Model 2 -.96.27.53.35 1.10.36-3.5 1.5 3.1

Question: Is smoking a confounder of the coffee-chd association?

Confounding In epidemiological terms, Z is a confounder of the relationship of Y with X if Z is related to both X and Y and Z is not in the causal pathway between X and Y In statistical terms, Z is a confounder of the relationship of Y with X if the X coefficient changes when Z is added to a regression of Y on X

Confounding For example, consider the two models Y = 0 + 1 X + 1 Y = 0 + 1 X + 2 Z + 2 then Z is a confounder of the X, Y relationship if 1 1

Comparing Models Variable Intercept Coffee Est se Model1 -.65.24.79.33 z -2.7 2.4 Intercept Coffee Smoking Model 2 -.96.27.53.35 1.10.36-3.5 1.5 3.1

Look at Confidence Intervals Without Smoking OR = e 0.79 = 2.2 95% CI for log(or): 0.79 ± 1.96(0.33) = (0.13, 1.44) 95% CI for OR: (e 0.13, e 1.44 ) = (1.14, 4.24)

Look at Confidence Intervals With Smoking (adjusting for smoking) OR = e 0.53 = 1.7 95% CI for log(or): 0.53 ± 1.96(0.35) = (-0.17, 1.22) 95% CI for OR: (e -0.17, e 1.22 ) = (0.85, 3.39)

Conclusion So, ignoring smoking, the CHD and coffee OR is 2.2 (95%CI: 1.14-4.26) Adjusting for smoking, gives more modest evidence for a coffee effect In this case-control study, smoking is a weak-to-moderate confounder of the coffee-chd association

Question: Is smoking an effect modifier of CHD-coffee association?

Interaction Model Variable Intercept Coffee Smoking Coffee* Smoking Est se Model 3-1.0.30.69.45 1.3.55 -.43.73 z -3.4 1.5 2.4 -.59

Testing Interaction Term Among non-smokers: OR = e 0.69 = 1.99 95% CI for log(or): 0.69 ± 1.96(0.45) = (-0.19, 1.58) 95% CI for OR: (e -0.19, e 1.58 ) = (0.82, 4.86)

Testing Interaction Term Among smokers OR = e 0.69-0.43 = e 0.26 = 1.30 95% CI for log(or): 0.26 ± 1.96(.57) = (-0.86, 1.38) 95% CI for OR:(e -0.86, e 1.38 ) = (0.42, 3.99)

Testing Interaction Term Z= -0.59, p-value = 0.554 95% Confidence interval for 1 3 (0.42, 3.99) Both of the above suggest that there is little evidence that smoking is an effect modifier!

Note ˆ βˆ β + Calculating the SE for 1 3.57 = sqrt(.329)

Question: What model should we choose to describe the relationship of coffee and smoking with CHD?

Fitted Values We can use the logistic models to calculate fitted values for comparison with observed frequencies using each of the three models Model 1: pˆ = e 1+ -.65+.79Coffee e -.65+.79Coffee

Fitted Values Model 2: Model 3: pˆ = e -.96+.53Coffee+ 1.1Smoking 1+ e -.96+.53Coffee+ 1.1Smoking pˆ = e -.1.03+.69Coffee+ 1.3Smoking-.43(Coffee*Smoking) 1+ e -.1.03+.69Coffee+ 1.3Smoking-.43(Coffee*Smoking)

Observed vs Fitted Values

Saturated Model Note that fitted values from Model 3 exactly match the observed values indicating a saturated model that gives perfect predictions Although the saturated model will always result in a perfect fit, it is usually not the best model (e.g., when there are continuous covariates or many covariates)

Likelihood Ratio Test The Likelihood Ratio Test will help decide whether or not additional term(s) significantly improve the model fit Likelihood Ratio Test (LRT) statistic for comparing nested models is -2 times the difference between the log likelihoods (LLs) for the Null -vs- Extended models the obtained is identical to from an analysis of variance test for linear regression models

Likelihood Ratio Test Deviance is a term used for the difference in -2*log likelihood relative to the best possible value from a perfectly predicting model. Change in deviance is the same as change in -2LL.

LRT Example

Model comparisons using likelihood ratio test

Summary A case-control study was conducted with 151 subjects, 66 (44%) of whom had CHD, to assess the relative importance of smoking and coffee drinking as risk factors. The observed fractions of CHD cases by smoking, coffee strata are

Summary: Unadjusted ORs The odds of CHD was estimated to be 3.4 times higher among smokers compared to non-smokers 95% CI: (1.7, 7.9) The odds of CHD was estimated to be 2.2 times higher among coffee drinkers compared to non-coffee drinkers 95% CI: (1.1, 4.3)

Summary: Adjusted ORs Controlling for the potential confounding of smoking, the coffee odds ratio was estimated to be 1.7 with 95% CI: (.85, 3.4). Hence, the evidence in these data are insufficient to conclude coffee has an independent effect on CHD beyond that of smoking.

Summary Finally, we estimated the coffee odds ratio separately for smokers and non-smokers to assess whether smoking is an effect modifier of the coffee-chd relationship. For the smokers and non-smokers, the coffee odds ratio was estimated to be 1.3 (95% CI:.42, 4.0) and 2.0 (95% CI:.82, 4.9) respectively. There is little evidence of effect modification in these data.

Note: Retrospective Studies Ratio of odds of CHD for coffee vs. noncoffee drinkers is equivalent to ratio of coffee drinking for cases of CHD vs. controls Thus, can estimate odds ratio of CHD (prospective question) using retrospective data -- key property of odds ratios This is one reason why logistic regression is so popular with epidemiologists