Lecture 9: Linear regression: centering, hypothesis testing, multiple covariates, and confounding

Lecture 9: Lnear regresson: centerng, hypothess testng, multple covarates, and confoundng Sandy Eckel seckel@jhsph.edu 6 May 008

Recall: man dea of lnear regresson Lnear regresson can be used to study an outcome as a lnear functon of a predctor Example: 60 ctes n the US were evaluated for numerous characterstcs, ncludng: Outcome: the percentage of the populaton that had low ncome Predctor: medan educaton level

Recall: Where s our ntercept? We wll ntroduce the dea of centerng usng the cty % low ncome example % of populaton wth ncome < $3000 0 5 0 5 30 35 40 45 50 55 60 0 4 6 8 0 4 Medan educaton 3

Need for centerng β 0 makes no sense! We don t observe any ctes wth medan educaton 0 We can change X to fx ths problem by a process called centerng. Pck a value of X (c) wthn the range of the data. For each observaton, generate X centered X -c 3. Redo the regresson wth X centered 4

Centerng We ll use c, a hgh school degree 5 % of populaton wth ncome < $3000 0 5 0 5 30 9 0 3 Medan educaton

Centerng New equaton Yˆ Yˆ β β 0 0 + + β β ( X ) Centered ( X ) Yˆ ( )..0 X β has not changed β 0 now corresponds to average of y when X centered 0 or, equvalently, X (not X 0) Note: wth X 0, we have Ŷ..0. + 4 ( 0 ) 36. 6

Centerng Interpretaton β 0 s the mean outcome for the reference group, or the group for whch X centered X -0, or when X. Here, β 0 (.) s the average percent of the populaton that s dsadvantaged for ctes wth a medan educaton level of, the equvalent of a hgh school degree. The nterpretaton of β has not changed. 7

Hypothess testng and confdence ntervals of regresson coeffcents

Drawng conclusons about populaton assocaton usng our data (a sample) β 0 : changes dependng on centerng of X, whch doesn t affect assocaton of nterest Real concern: s X assocated wth Y? Assess by testngβ : Does β 0 n the populaton from whch ths sample was drawn? Hypothess testng Confdence nterval 9

Hypothess testng Null and alternate hypothess, test statstc H 0 : β 0 H 0 : β 0 Test statstc: t obs βˆ 0 SEβˆ ( ) df n-k- n number of observatons k number of predctors (X s) 0

Hypothess testng Educaton example H 0 : β 0 Test statstc: t obs -.0 0 0.59 3.36 df n-k- 60-- 58 n number of observatons 60 k number of predctors (X s) Calculate our p-value *pt(-3.36, df58) [] 0.0038308 p-value0.00

Hypothess testng Educaton example: nterpretaton and concluson If there were no assocaton between medan educaton and percentage of dsadvantaged ctzens n the populaton, there would be about a % chance of observng data as or more extreme than ours. The null probablty s very small, so: reject the null hypothess conclude that medan educaton level and percentage of dsadvantaged ctzens are assocated n the populaton

Confdence Interval for regresson coeffcents We calculate the CI usng the usual formula: df of t CR n-k- βˆ ± t CR SEβˆ ( ) For the educaton example, the 95% CI for β s:.0 ±.0 (- 3.,-0.8) 0.59 3

Confdence nterval Educaton example nterpretaton and concluson We are 95% confdent that the true populaton decrease n percentage of low ncome ctzens per addtonal year of medan educaton s between 3. and 0.8 Snce ths nterval does not contan 0, we beleve percentage of low ncome ctzens and medan educaton are assocated among ctes n the Unted States 4

Multple covarates and confoundng

Dataset Hourly wage nformaton from 998 workers, along wth nformaton regardng age, gender, years of experence, etc. We ll focus on predctng hourly wage wth avalable nformaton. Outcome: hourly wage Multple predctors: age, gender, years experence, etc 6

Regresson: Hourly wage vs. years of experence Smple lnear regresson snce only one covarate (years of experence) Hourly Wage 0 0 0 30 40 50 0 0 40 60 Years of Experence 7

How do we estmate the coeffcents? Use least squares For each person, ther actual hourly wage (Y ) and predcted hourly wage ( Ŷ ) are known εˆ Y Yˆ ( ) ( ( )) Y β0 + βx s the resdual or error The coeffcent estmates are found by mnmzng the sum of the squared error mn n ( Y ( + )) β0 βx The coeffcents are the least squares estmates 8

Notes on regresson analyss Ŷ β + 0 β X for any known pont on the lne Y β β X 0 + s always true Recall the regresson lne equaton Y β + β X + 0 ε 9

Model : years of experence Model : Predct ncome by years of experence Ŷ βˆ + βˆ X Ŷ 8.38 + βˆ 0 8.38 0 so the average hourly wage for someone wth no experence at all s about $8.40 0.04X βˆ 0.04 so for every addtonal year of experence, the predcted hourly wage ncreases about 4 cents For 0 years of addtonal experence, the predcted hourly wage ncreases about 40 cents 0

Should we center X? 0 years of experence s wthn the range of the data The average hourly wage correspondng to 0 years of experence makes sense No need to center X

Model : years of experence, gender (0man, woman) What happens f we also consder gender? Hourly Wage 0 0 0 30 40 50 0 0 40 60 Years of Experence Men's hourly wage ft_men Women's hourly wage ft_women

Model : for no experence years of experence, gender (0man, woman) Ŷ βˆ 0 + βˆ (Experence Ŷ ) + βˆ (Gender 9.7 + 0.04(Experence ) ) -.0(Gender ) For a man wth no experence: Ŷ 9.7 + 0.04(0)-.0(0) $9.7 βˆ For a woman wth no experence: 0 Ŷ 9.7 + 0.04(0) -.0() $7.07 βˆ 0 + βˆ 3

Model : for 0 years experence years of experence, gender (0man, woman) Yˆ βˆ 0 + βˆ ( Experence Yˆ 9.7 + ) + βˆ (Gender ) 0.04( Experence ) -.0(Gender ) For a man wth 0 years of experence: ˆ Y 97. + 004(0). -0. (0) $9.67 βˆ βˆ (0) 0 + For a woman wth 0 years of experence: Ŷ 9.7 + 0.04(0) -.0() $7.47 βˆ 0 + βˆ (0) + βˆ () 4

Model : for males years of experence, gender (0man, woman) Ŷ βˆ 0 + βˆ (Experence ) + βˆ (Gender ) Ŷ 9.7 + 0.04(Experence ) -.0(Gender ) For a man wth no experence: Ŷ 9.7 + 0.04(0) -.0(0) $9.7 For a man wth 0 years of experence: 0 βˆ Ŷ 9.7 + 0.04(0) -.0(0) $9.67 βˆ 0 + βˆ (0) 5

Model : for females years of experence, gender (0man, woman) Yˆ βˆ 0 + βˆ ( Experence Yˆ 9.7 + ) + βˆ (Gender ) 0.04( Experence ) -.0(Gender ) For a woman wth no experence: ˆ 9.7 + 0.04(0) -.0() Y For a woman wth 0 years of experence: ˆ Y $7.07 βˆ 0 + βˆ 9.7 + 0.04(0) -.0() $7.47 βˆ 0 + βˆ (0) + βˆ 6

Model Interpretaton Y ˆ 9.7 + 0.04( Experence ) -.0(Gender ) : the average hourly wage for a man βˆ0 9.7 wth no experence at all s about $9.30 βˆ 0.04 : for every addtonal year of experence, the predcted hourly wage ncreases about 4 cents for both men and women. βˆ.0 : the expected hourly wage s $.0 lower for women than t s for men at any experence level. 7

Confoundng by gender? Model vs. Model Model : Yˆ 8.38 + 0.04 ( Experence ) Model : Y ˆ 9.7 + 0.04(Experence ) -.0(Gender ) 95% CI for β n Model : (0.00, 0.07) and from Model s wthn ths CI βˆ Gender s not a confounder The assocaton between salary and experence does not change when we control for gender 8

Confoundng: the epdemologc defnton C s a confounder of the relaton between X and Y f: Outcome Y Confounder C Predctor X 9

Confoundng: example Smokng s a confounder of the relaton between coffee consumpton (X) and lung cancer (Y) snce: Lung Cancer Y Smokng C Coffee Consumpton X 30

Model 3: years of experence, age Let s try a model that ncludes age nstead of gender: Y ˆ βˆ + βˆ (Experence ) + ( Age 0-40) The relatonshp s harder to graph wth two contnuous predctors, snce now the regresson s n a 3-dmensonal space We wll learn AV plots n a few days Notce that age s centered at 40 years. Age ranged between 8 and 64 n ths dataset βˆ 3

Model 3: no experence years of experence, age Yˆ βˆ 0 + βˆ (Experence 6.5 ) + βˆ ( Age 0.8(Experence - 40) ) + 0.9( Age - 40) For a 40-year-old wth no experence: ˆ Y 6.5 0.8(0) + 0.9(40 40) βˆ 0 $6.50 For a 4-year-old wth no experence: ˆ Y 6.5 0.8(0) + 0.9(4 40) $7.4 βˆ 0 + βˆ 3

Model 3: 0 years experence years of experence, age Yˆ βˆ 0 + βˆ (Experence 6.5 ) + βˆ ( Age 0.8(Experence - 40) ) + 0.9( Age - 40) For a 40-year-old wth 0 years of experence: ˆ Y 6.5 0.8(0) + 0.9(40 40) $8.30 + βˆ For a 4-year-old wth 0 years of experence: ˆ Y βˆ 0 βˆ + βˆ 0 6.5 0.8(0) + 0.9(4 40) $9. 0 + βˆ 0 33

Model 3: 40 years of age years of experence, age Y ˆ βˆ 0 + βˆ (Experence ) + βˆ ( Age 6.5 0.8(Experence - 40) ) + 0.9( Age - 40) For a 40-year-old wth no experence: ˆ Y 6.5 0.8(0) + 0.9(40 40) $6.50 βˆ 0 For a 40-year-old wth 0 years of experence: ˆ Y 6.5 0.8(0) + 0.9(40 40) $8.30 βˆ + βˆ 0 0 34

Model 3: 4 years of age years of experence, age Y ˆ βˆ 0 + βˆ (Experence ) + βˆ ( Age 6.5 0.8(Experence - 40) ) + 0.9( Age - 40) For a 4-year-old wth no experence: Ŷ 6.5 0.8(0) + 0.9(4 40) $7.4 βˆ For a 4-year-old wth 0 years of experence: 0 + βˆ Ŷ 6.5 0.8(0) + 0.9(4 40) $9. βˆ 0 + βˆ 0 + βˆ 35

Model 3: Interpretaton βˆ : the average hourly wage for a 40-0 6.5 year-old wth no experence at all s about $6.50 : for every addtonal year of βˆ 0.8 experence, the predcted hourly wage decreases about 8 cents for two people of the same age (or adjustng for age ) βˆ 0.9 : for every addtonal year of age, the expected hourly wage ncreases about 9 cents for two people wth the same amount of experence (or adjustng for experence ) 36

Model vs. Model 3 What happens when we nclude age? Model : Yˆ 8.38 + 0.04 ( Experence ) Model 3: Y ˆ 6.5 0.8(Experence ) + 0.9( Age 95% CI for β n Model : (0.00, 0.07) and from Model 3 s outsde ths CI βˆ - 40) Age s a confounder! When we adjust for age, the apparent effect of experence on wage changes 37

The Coeffcent of Determnaton, R R measures the ablty to predct Y usng X Varablty explaned by X s SSM ( ˆ y) Total varablty s SST ( y y) y R s defned as R SSM SST ( yˆ ( y Measures the proporton of total varablty explaned by the model R s the square of r, Pearson s correlaton coeffcent Recall: r s a rough way of evaluatng the assocaton between two contnuous varables y) y) 38

Usng R as a model selecton crtera The coeffcent of determnaton, R evaluates the entre model R shows the proporton of the total varaton n Y that has been predcted by ths model Model : 0.0076; 0.8% of varaton explaned Model : 0.05; 5% of varaton explaned Model 3: 0.0; 0% of varaton explaned You want a model wth large R 39

Adjusted R another model selecton crtera In both models and 3, the new predctor added a great deal to the model R ncreased a lot More mportantly, both new predctors were statstcally sgnfcant R ncreases any tme you nclude a new varable! The adjusted R s adjusted for the number of X s n the model, so t only ncreases when helpful predctors are added Balance the need for Smple model Good predctve ablty 40

Summary of Lecture 9 Centerng Hypothess testng and confdence ntervals Regresson by least squares Interpretng regresson coeffcents Addng a nd predctor to a model Bnary X added: parallel lnes Contnuous X added: 3-dmensonal graph for both, new nterpretaton reflectng new model Is the new X a confounder? Compare β across models 4