Recall: man dea of lnear regresson Lecture 9: Lnear regresson: centerng, hypothess testng, multple covarates, and confoundng Sandy Eckel seckel@jhsph.edu 6 May 8 Lnear regresson can be used to study an outcome as a lnear functon of a predctor Example: 6 ctes n the US were evaluated for numerous characterstcs, ncludng: Outcome: the percentage of the populaton that had low ncome Predctor: medan educaton level Recall: Where s our ntercept? Need for centerng We wll ntroduce the dea of centerng usng the cty % low ncome example % of populaton wth ncome < $3 5 5 3 35 4 45 5 55 6 β makes no sense! We don t observe any ctes wth medan educaton = We can change X to fx ths problem by a process called centerng. Pck a value of X (c wthn the range of the data. For each observaton, generate X centered = X -c 3. Redo the regresson wth X centered 4 6 8 4 Medan educaton 3 4
Centerng We ll use c=, a hgh school degree Centerng New equaton % of populaton wth ncome < $3 5 5 3 9 3 Medan educaton 5 = β = β + β + β ( X Centered ( X ( =.. X β has not changed β now corresponds to average of y when X centered = or, equvalently, X = (not X = Note: wth X =, we have ( Ŷ =.. =. + 4 = 36. 6 Centerng Interpretaton β s the mean outcome for the reference group, or the group for whch X centered =X -=, or when X =. Here, β (. s the average percent of the populaton that s dsadvantaged for ctes wth a medan educaton level of, the equvalent of a hgh school degree. The nterpretaton of β has not changed. Hypothess testng and confdence ntervals of regresson coeffcents 7
Drawng conclusons about populaton assocaton usng our data (a sample Hypothess testng Null and alternate hypothess, test statstc β : changes dependng on centerng of X, whch doesn t affect assocaton of nterest Real concern: s X assocated wth Y? Assess by testngβ : Does β = n the populaton from whch ths sample was drawn? Hypothess testng Confdence nterval 9 H : β = H : β Test statstc: obs βˆ = SEβˆ ( df = n-k- n = number of observatons k = number of predctors (X s t Hypothess testng Educaton example H : β = Test statstc: t obs df = n-k- = 6-- = 58 n = number of observatons = 6 k = number of predctors (X s = Calculate our p-value *pt(-3.36, df=58 [].3838 p-value=. -. = = 3.36.59 Hypothess testng Educaton example: nterpretaton and concluson If there were no assocaton between medan educaton and percentage of dsadvantaged ctzens n the populaton, there would be about a % chance of observng data as or more extreme than ours. The null probablty s very small, so: reject the null hypothess conclude that medan educaton level and percentage of dsadvantaged ctzens are assocated n the populaton
Confdence Interval for regresson coeffcents We calculate the CI usng the usual formula: df of t CR = n-k- βˆ ± t CR SEβˆ ( For the educaton example, the 95% CI for β s:. ±..59 (- 3.,-.8 Confdence nterval Educaton example nterpretaton and concluson We are 95% confdent that the true populaton decrease n percentage of low ncome ctzens per addtonal year of medan educaton s between 3. and.8 Snce ths nterval does not contan, we beleve percentage of low ncome ctzens and medan educaton are assocated among ctes n the Unted States 3 4 Dataset Multple covarates and confoundng Hourly wage nformaton from 998 workers, along wth nformaton regardng age, gender, years of experence, etc. We ll focus on predctng hourly wage wth avalable nformaton. Outcome: hourly wage Multple predctors: age, gender, years experence, etc 6
Regresson: Hourly wage vs. years of experence Smple lnear regresson snce only one covarate (years of experence Hourly Wage 3 4 5 4 6 Years of Experence 7 How do we estmate the coeffcents? Use least squares For each person, ther actual hourly wage (Y and predcted hourly wage ( Ŷ are known εˆ = Y ( ( ( = Y β + βx s the resdual or error The coeffcent estmates are found by mnmzng the sum of the squared error mn n ( Y ( β + βx = The coeffcents are the least squares estmates 8 Notes on regresson analyss Ŷ = β + β X Y = β + βx for any known pont on the lne s always true Recall the regresson lne equaton Y = β + β X + ε 9 Model : years of experence Model : Predct ncome by years of experence Ŷ X Ŷ = 8.38 +.4X βˆ = 8.38 so the average hourly wage for someone wth no experence at all s about $8.4 βˆ =.4 so for every addtonal year of experence, the predcted hourly wage ncreases about 4 cents For years of addtonal experence, the predcted hourly wage ncreases about 4 cents
Should we center X? years of experence s wthn the range of the data The average hourly wage correspondng to years of experence makes sense No need to center X Model : years of experence, gender (=man, =woman What happens f we also consder gender? Hourly Wage 3 4 5 4 6 Years of Experence Men's hourly wage ft_men Women's hourly wage ft_women Model : for no experence years of experence, gender (=man, =woman Model : for years experence years of experence, gender (=man, =woman Ŷ (Experence (Gender Ŷ = 9.7 +.4(Experence -.(Gender ˆ Y ( Experence (Gender = 9.7 +.4( Experence -.(Gender For a man wth no experence: Ŷ = 9.7 +.4(-.( = $9.7 For a woman wth no experence: For a man wth years of experence: = 97. + 4(. -. ( = $9.67 βˆ ( + For a woman wth years of experence: Ŷ = 9.7 +.4( -.( = $7.7 3 Ŷ = 9.7 +.4( -.( = $7.47 ( ( 4
Model : for males years of experence, gender (=man, =woman Model : for females years of experence, gender (=man, =woman Ŷ (Experence (Gender Ŷ = 9.7 +.4(Experence -.(Gender ˆ Y ( Experence (Gender = 9.7 +.4( Experence -.(Gender For a man wth no experence: Ŷ = 9.7 +.4( -.( = $9.7 For a man wth years of experence: For a woman wth no experence: ˆ = 9.7 +.4( -.( = $7.7 Y For a woman wth years of experence: Ŷ = 9.7 +.4( -.( = $9.67 ( 5 ˆ Y = 9.7 +.4( -.( = $7.47 ( 6 Model Interpretaton Y ˆ = 9.7 +.4( Experence -.(Gender Confoundng by gender? Model vs. Model Model : = 8.38 +.4 ( Experence : the average hourly wage for a man βˆ = 9.7 wth no experence at all s about $9.3 Model : Y ˆ = 9.7 +.4(Experence -.(Gender βˆ =.4 : for every addtonal year of experence, the predcted hourly wage ncreases about 4 cents for both men and women. βˆ =. : the expected hourly wage s $. lower for women than t s for men at any experence level. 7 95% CI for β n Model : (.,.7 and from Model s wthn ths CI βˆ Gender s not a confounder The assocaton between salary and experence does not change when we control for gender 8
Confoundng: the epdemologc defnton C s a confounder of the relaton between X and Y f: Outcome Y Confoundng: example Smokng s a confounder of the relaton between coffee consumpton (X and lung cancer (Y snce: Lung Cancer Y Confounder C Smokng C Predctor X Coffee Consumpton X 9 3 Model 3: years of experence, age Let s try a model that ncludes age nstead of gender: ˆ ( Age Y (Experence - 4 The relatonshp s harder to graph wth two contnuous predctors, snce now the regresson s n a 3-dmensonal space We wll learn AV plots n a few days Notce that age s centered at 4 years. Age ranged between 8 and 64 n ths dataset Model 3: no experence years of experence, age Y ˆ (Experence ( Age - 4 6.5.8(Experence +.9( Age For a 4-year-old wth no experence: ˆ = 6.5.8( +.9(4 4 = $6.5 Y For a 4-year-old wth no experence: - 4 ˆ = 6.5.8( +.9(4 4 = $7.4 Y 3 3
Model 3: years experence years of experence, age Model 3: 4 years of age years of experence, age Y ˆ (Experence ( Age - 4 6.5.8(Experence +.9( Age - 4 Y ˆ (Experence ( Age - 4 6.5.8(Experence +.9( Age - 4 For a 4-year-old wth years of experence: ˆ = 6.5.8( +.9(4 4 = $8.3 Y For a 4-year-old wth years of experence: ˆ = 6.5.8( +.9(4 4 = $9. Y For a 4-year-old wth no experence: ˆ = 6.5.8( +.9(4 4 = $6.5 Y For a 4-year-old wth years of experence: ˆ = 6.5.8( +.9(4 4 = $8.3 Y 33 34 Model 3: 4 years of age years of experence, age Model 3: Interpretaton Y ˆ (Experence ( Age - 4 6.5.8(Experence +.9( Age - 4 βˆ : the average hourly wage for a 4- = 6.5 year-old wth no experence at all s about $6.5 For a 4-year-old wth no experence: Ŷ = 6.5.8( +.9(4 4 = $7.4 For a 4-year-old wth years of experence: Ŷ = 6.5.8( +.9(4 4 = $9. : for every addtonal year of βˆ =.8 experence, the predcted hourly wage decreases about 8 cents for two people of the same age (or adjustng for age βˆ =.9 : for every addtonal year of age, the expected hourly wage ncreases about 9 cents for two people wth the same amount of experence (or adjustng for experence 35 36
Model vs. Model 3 What happens when we nclude age? The Coeffcent of Determnaton, R Model : = 8.38 +.4 Experence ( Model 3: Y ˆ = 6.5.8(Experence +.9( Age - 4 95% CI for β n Model : (.,.7 and from Model 3 s outsde ths CI βˆ Age s a confounder! When we adjust for age, the apparent effect of experence on wage changes 37 R measures the ablty to predct Y usng X Varablty explaned by X s SSM = ( ˆ y Total varablty s SST = R s defned as R = SSM SST ( y Measures the proporton of total varablty explaned by the model R s the square of r, Pearson s correlaton coeffcent Recall: r s a rough way of evaluatng the assocaton between two contnuous varables y ( yˆ y = ( y y y 38 Usng R as a model selecton crtera Adjusted R another model selecton crtera The coeffcent of determnaton, R evaluates the entre model R shows the proporton of the total varaton n Y that has been predcted by ths model Model :.76;.8% of varaton explaned Model :.5; 5% of varaton explaned Model 3:.; % of varaton explaned In both models and 3, the new predctor added a great deal to the model R ncreased a lot More mportantly, both new predctors were statstcally sgnfcant R ncreases any tme you nclude a new varable! The adjusted R s adjusted for the number of X s n the model, so t only ncreases when helpful predctors are added Balance the need for Smple model Good predctve ablty You want a model wth large R 39 4
Summary of Lecture 9 Centerng Hypothess testng and confdence ntervals Regresson by least squares Interpretng regresson coeffcents Addng a nd predctor to a model Bnary X added: parallel lnes Contnuous X added: 3-dmensonal graph for both, new nterpretaton reflectng new model Is the new X a confounder? Compare β across models 4