Lecture 6: Introduction to Linear Regression

Lecture 6: Introducton to Lnear Regresson An Manchakul amancha@jhsph.edu 24 Aprl 27

Lnear regresson: man dea Lnear regresson can be used to study an outcome as a lnear functon of a predctor Example: 6 ctes n the US were evaluated for numerous characterstcs, ncludng: the percentage of the populaton that was dsadvantaged medan educaton level 2

Bnary educaton varable % of populaton wth ncome < $3 5 2 25 3 Low Educaton Hgh Educaton 3

Lnear regresson vs. ANOVA These means could be compared by a t-test or ANOVA Mean n low educaton group: 5.7% Mean n hgh educaton group: 3.2% Regresson provdes a unfed equaton: X 5.72.5 X where X = for hgh educaton for low educaton (X s a dummy varable or ndcator varable that desgnates group) 4

Interpretng the model s the predcted mean of the outcome for X, that observaton s value for X. X X = (Low educaton) 5.7 2.5 X 5.7 2.5 5.7 X = (Hgh educaton) 5.7 2.5 3.2 5

Interpretaton s the mean outcome for the reference group, or the group for whch X =. Here, s the average percent of the populaton that s dsadvantaged for ctes wth low educaton. 6

Interpretaton s the dfference n the mean outcome between the two groups (when X = vs. when X =) Here, s dfference n the average percent of the populaton that s dsadvantaged for ctes wth hgh educaton compared to ctes wth low educaton. 7

Why use lnear regresson? Lnear regresson s very powerful. It can be used for many thngs: Bnary X Contnuous X Categorcal X Adjustment for confoundng Interacton Curved relatonshps between X and Y 8

Regresson Analyss A regresson s a descrpton of a response measure, Y,the dependent varable, as a functon of an explanatory varable, X, the ndependent varable. Goal: predcton or estmaton of the value of one varable, Y, based on the value of the other varable, X. 9

Regresson Analyss A smple relatonshp between the two varables s a lnear relatonshp (straght lne relatonshp) Other names: lnear, smple lnear, least squares regresson

Galton s Example records of heghts of famly groups Really tall fathers tend on average to have tall sons but not qute as tall as the really tall fathers There s a regresson of a son s heght toward the average heght for sons

Galton s Example 74 Regresson of Son's Stature on Father's E(Y) = 33.73 +.56*X 72 Son's Heght 7 68 66 64 6 62 64 66 68 7 72 74 Father's Heght (nches) 2

Regresson Analyss: Populaton Model Probablty Model: ndependent responses y, y 2,,y n are sampled from Y ~ N(, 2 ) Systematc Model: µ = E(y x ) = + x where: = ntercept = slope 3

Another way to wrte the model Systematc: y = + x + Probablty: ~ N(, 2 ) The response, Y, s a lnear functon of X plus some random, normally dstrbuted error, I Data = Sgnal + nose 4

Geometrc Interpretaton 5

Model ) Y ~ N(, 2 ) 2) µ = E(y x ) = + x OR ) y = + x + 2) ~ N(, 2 ) where: = ntercept = slope The response, Y, s a lnear functon of X plus some random, normally dstrbuted error, 6

Interpretaton of Coeffcents Mean Model: µ = E(y x) = + x = expected response when X = Snce: E(y x=) = + () = = change n expected response per unt ncrease n X Snce: E(y x+) = + (x+) And: E(y x) = + x E(y) from x to x+ = 7

From Galton s Example E(Y x) = + x E(Y x) = 33.7 +.52x where: Y = son s heght (nches) x = father s heght (nches) Expected son s heght =33.7 nches when father s heght s nches Expected dfference n heghts for sons whose fathers heghts dffer by one nch =.52 nches 8

Cty/Educaton Example % of populaton wth ncome < $3 5 2 25 3 9 2 3 Medan educaton 9

Model X 36.2 2. X where X = the medan educaton level n cty 36.2 2. when X = 36.2 36.2 2. when X = 34.2 36.2 2. 2 when X =2 32.2 2 2

Interpretaton s the mean outcome for the reference group, or the group for whch X =. Here, s the average percent of the populaton that s dsadvantaged for ctes wth medan educaton level of. 2

Interpretaton s the dfference n the mean outcome for a one unt change n X. Here, s dfference n the average percent of the populaton that s dsadvantaged between two ctes, when the frst cty has % hgher medan educaton level than the second cty. 22

Fndng s from the graph s the Y-ntercept of the lne, or the average value of Y when X=. s the slope of the lne, or the average change n Y per unt change n X. y=mx+b b=, m= ˆ y x y x 2 2 Notaton: represents the true slope (n the populaton) b and ˆ are sample estmates of the slope 23

Where s our ntercept? % of populaton wth ncome < $3 5 2 25 3 35 4 45 5 55 6 2 4 6 8 2 4 Medan educaton 24

Centerng makes no sense! We can change X to fx ths problem by a process called centerng. Pck a value of X (c) wthn the range of the data 2. For each observaton, generate X_centered = X -c 3. Redo the regresson wth X_centered 25

We ll use c=2, a hgh school degree % of populaton wth ncome < $3 5 2 25 3 9 2 3 Medan educaton 26

New equaton 2.2 has not changed now corresponds to X=2, not X= X 2. 2 X 2 Note: wth X=, we have 2.2 2. 2.2 24 36.2 2 27

Interpretaton s the mean outcome for the reference group, or the group for whch X -2=, or when X =2. Here, (2.2%) s the average percent of the populaton that s dsadvantaged for ctes wth a medan educaton level of 2, the equvalent of a hgh school degree. The nterpretaton of has not changed. 28

Centerng n Galton Example Make 6 feet (72 nch) fathers the reference group Create a new X varable, X*, by subtractng 72 from our old X varable, X* = X 72 Then: E(Y x*) = + x* = + (x 72) So, = expected response when X = 72, snce E(Y x=72) = + (72 72) = Center X s whenever nterpretatons call for t! 29

Populaton Comparsons : changes dependng on centerng of X, whch doesn t affect assocaton of nterest Real concern: s X assocated wth Y? Assess by testng : Does = n the populaton from whch ths sample was drawn? Hypothess testng Confdence nterval 3

Hypothess testng H : = Test statstc: df = n-k- obs n = number of observatons k = number of predctors (X s) t ˆ SE ˆ 3

Hypothess testng for educaton example H : = Test statstc: t obs - 2..59 3.36 df = n-k- = 6-- = 58 n = number of observatons = 6 k = number of predctors (X s) = p<2*(-.995) p<. 32

Interpretaton and concluson If there were no assocaton between medan educaton and percentage of dsadvantaged ctzens n the populaton, there would be less than a % chance of observng data as or more extreme than ours. The null probablty s very small, so: reject the null hypothess conclude that medan educaton level and percentage of dsadvantaged ctzens are assocated n the populaton 33

Confdence Interval No need to specfy a hypothess: ˆ t cr SE ˆ 2. 2.2-3.2,-.8.59 34

Interpretaton and concluson We are 95% confdent that the true populaton decrease n percentage of dsadvantaged ctzens per addtonal year of medan educaton s between 3.2 and.8. Snce ths nterval does not contan, we beleve percentage of dsadvantaged ctzens and medan educaton are assocated among ctes n the Unted States. 35

So far Lnear regresson s used for contnuous outcome varables : mean outcome when X= Bnary X = dummy varable for group : mean dfference n outcome between groups Contnuous X : mean dfference n outcome correspondng to a -unt ncrease n X Center X to gve meanng to Test = n the populaton 36

Lnear Regresson: Multple covarates and confoundng

Dataset Hourly wage nformaton from 9,98 workers, along wth nformaton regardng age, gender, years of experence, etc. We ll focus on predctng hourly wage wth avalable nformaton. 38

Regresson: Hourly wage vs. Years of experence Hourly Wage 2 3 4 5 2 4 6 Years of Experence 39

What are the parameters? For each person, ther actual hourly wage (Y ) and predcted hourly wage are known. Y Y X s the resdual or error The parameters are found by mnmzng the n sum of the squared error Y X The parameters are the least squares estmates mn 2 4

Notes X for any known pont on the lne Y X s always true The regresson lne equaton Y X 4

Model Model : Predct ncome by years of experence ˆ ˆ X 8.38.4X ˆ 8.38 so the average hourly wage for someone wth no experence at all s about $8.4. ˆ.4 so for every addtonal year of experence, the predcted hourly wage ncreases about 4 cents. For years of addtonal experence, the predcted hourly wage ncreases about 4 cents. 42

Should we center X? years of experence s wthn the range of the data The average hourly wage correspondng to years of experence makes sense No need to center X 43

What happens f we also consder gender? (Model 2) Hourly Wage 2 3 4 5 2 4 6 Years of Experence Men's hourly wage ft2_men Women's hourly wage ft2_women 44

Model 2: Gender effect, no experence ˆ ˆ (Experence ) ˆ (Gender 9.27.4(Experence 2 ) ) - 2.2(Gender ) For a man wth no experence: 9.27.4() - 2.2() ˆ For a woman wth no experence: ˆ $9.27 9.27.4() - 2.2() $7.7 ˆ 2 45

Model 2: Gender effect, years experence ˆ ˆ (Experence ) ˆ 2 (Gender ) 9.27.4(Experence ) - 2.2(Gender ) For a man wth years of experence: 9.27.4() - 2.2() $9.67 ˆ ˆ () For a woman wth years of experence: 9.27.4() - 2.2() $7.47 ˆ ˆ () ˆ 2 () 46

Model 2: Experence effect, males ˆ ˆ (Experence ) ˆ 2 (Gender ) 9.27.4(Experence ) - 2.2(Gender ) For a man wth no experence: 9.27.4() - 2.2() ˆ For a man wth years of experence: ˆ $9.27 9.27.4() - 2.2() $9.67 ˆ () 47

Model 2: Experence effect, females ˆ ˆ (Experence ) ˆ 2 (Gender ) 9.27.4(Experence ) - 2.2(Gender ) For a woman wth no experence: 9.27.4() - 2.2() $7.7 ˆ ˆ 2 For a woman wth years of experence: 9.27.4() - 2.2() $7.47 ˆ ˆ () ˆ 2 48

Interpretaton: Model 2 ˆ 9.27 : the average hourly wage for a man wth no experence at all s about $9.3. ˆ.4 : for every addtonal year of experence, the predcted hourly wage ncreases about 4 cents for both men and women. ˆ 2 2.2 : the expected hourly wage s $2.2 lower for women than t s for men at any experence level. 49

Model vs. Model 2 Model : 8.38.4 Experence Model 2: 9.27.4(Experence ) - 2.2(Gender ) 95% CI for n Model : (.,.7) and from Model 2 s wthn ths CI ˆ Gender s not a confounder 5

What happens f we consder age, nstead? (Model 3) ˆ ˆ (Experence ) ˆ 2 (Age - 4) The relatonshp s harder to graph wth two contnuous predctors, snce now the regresson s n a 3-dmensonal space. Notce that age s centered at 4 years. Age ranged between 8 and 64 n ths dataset. 5

Model 3: Age effect, no experence ˆ ˆ (Experence ) ˆ 2 (Age - 4) 26.5.82(Experence ).92(Age - 4) For a 4-year-old wth no experence: 26.5.82().92(4 4) $26.5 ˆ For a 4-year-old wth no experence: 26.5.82().92(4 4) $27.42 ˆ ˆ 2 52

Model 3: Age effect, years experence ˆ ˆ (Experence ) ˆ 2 (Age 26.5.82(Experence - 4) ).92(Age - 4) For a 4-year-old wth years of experence: 26.5.82().92(4 4) $8.3 ˆ ˆ For a 4-year-old wth years of experence: 26.5.82().92(4 4) $9.22 ˆ ˆ ˆ 2 53

Model 3: Experence effect, 4 year old ˆ ˆ (Experence ) ˆ 2 (Age - 4) 26.5.82(Experence ).92(Age - 4) For a 4-year-old wth no experence: 26.5.82().92(4 4) $26.5 ˆ For a 4-year-old wth years of experence: 26.5.82().92(4 4) $8.3 ˆ ˆ 54

Model 3: Experence effect, 4 year old ˆ ˆ (Experence ) ˆ 2 (Age - 4) 26.5.82(Experence ).92(Age - 4) For a 4-year-old wth no experence: 26.5.82().92(4 4) $27.42 ˆ For a 4-year-old wth years of experence: ˆ ˆ ˆ 26.5.82().92(4 4) $9.22 2 ˆ 2 55

Interpretaton: Model 3 ˆ 26.5 : the average hourly wage for a 4- year-old wth no experence at all s about $26.5 : for every addtonal year of ˆ.82 experence, the predcted hourly wage decreases about 82 cents for two people of the same age (or adjustng for age ) ˆ 2.92 : for every addtonal year of age, the expected hourly wage ncreases about 92 cents for two people wth the same amount of experence (or adjustng for experence ) 56

Model vs. Model 3 Model : 8.38.4Experence Model 3: 26.5.82(Experence ).92(Age - 4) 95% CI for n Model : (.,.7) and from Model 3 s outsde ths CI ˆ Age s a confounder. When we adjust for age, the apparent effect of experence on wage changes. 57

The Coeffcent of Determnaton R 2 s the coeffcent of determnaton R 2 measures the ablty to predct Y usng X Varablty explaned by X s SSM = 2 y y) 2 ( y y) Total varablty s SST = ( ˆ 58

The Coeffcent of Determnaton R 2 s defned as R 2 SSM SST ( yˆ ( y y) y) Measures the proporton of total varablty explaned by the model 2 2 59

The Coeffcent of Determnaton R 2 s the square of r, Pearson s correlaton coeffcent r s a rough way of evaluatng the assocaton between two contnuous varables. 6

So, what s R 2? The coeffcent of determnaton, R 2 evaluates the entre model. R 2 shows the proporton of the total varaton n Y that has been predcted by ths model. Model :.76;.8% of varaton explaned Model 2:.5; 5% of varaton explaned Model 3:.2; 2% of varaton explaned 6

What s the adjusted R 2? In both models 2 and 3, the new predctor added a great deal to the model R 2 ncreased a lot More mportantly, both new predctors were statstcally sgnfcant R 2 always goes up! The adjusted R 2 s adjusted for the number of X s n the model, so t only goes up when helpful predctors are added. 62

Summary Regresson by least squares Interpretng regresson coeffcents Addng a 2 nd predctor to a model Bnary X added: 2 parallel lnes Contnuous X added: 3-dmensonal graph for both, new nterpretaton reflectng new model Is the new X a confounder? Compare across models 63