Lecture 9: Interactons, Quadratc terms and Splnes An Manchakul amancha@jhsph.edu 3 Aprl 7 Remnder: Nested models Parent model contans one set of varables Extended model adds one or more new varables to the parent model one varable added: compare models wth t test two or more varables added: compare models wth F test Return to the example of wage versus experence Effect Modfcaton The phenomenon n whch the relatonshp between the prmary predctor and outcome vares across levels of another predctor We say the other predctor modfes the effect between the prmary predctor and outcome In lnear regresson, coded by ncluson of nteracton term between prmary predctor and another predctor Model 1 E[Wage ] =!ˆ +!ˆ (Experence ) 1 +!ˆ (Gender ) Ths model allows the average wage to dffer for men and women, but the dfference n average wage between men and women s always the same regardless of experence level.
Model 1 Model : Creatng the nteracton varable Source SS df MS Number of obs = 534 -------------+------------------------------ F(, 531) = 61.6 Model 651.49936 135.74968 Prob > F =. Resdual 1145.199 531 1.516387 R-squared =.1884 -------------+------------------------------ Adj R-squared =.1853 Total 1476.6985 533 6.41316 Root MSE = 4.6386 wagehr Coef. Std. Err. t P> t [95% Conf. Interval] educyrs.751834.7685 9.78..6371.91966 gender -.1457.483-5.7. -.915397-1.33716 _cons.17831 1.363.1.834-1.81796.5364 gender: for men 1 for women gender*experence = *experence = for men = 1*experence = experence for women Model Model : output E[Wage ] =!ˆ +!ˆ 1(Experence) +!ˆ (Gender ) +!ˆ 3(Gender! Experence ) What s the nteracton varable??. generate gender_educ = gender*educ. reg wagehr educyrs gender gender_educ Source SS df MS Number of obs = 534 -------------+------------------------------ F( 3, 53) = 41.5 Model 677.434 3 89.477414 Prob > F =. Resdual 11399.663 53 1.58496 R-squared =.19 -------------+------------------------------ Adj R-squared =.1856 Total 1476.6985 533 6.41316 Root MSE = 4.6377 wagehr Coef. Std. Err. t P> t [95% Conf. Interval] educyrs.6831451.98743 6.9..489178.8771194 gender -4.3745.8557 -.1.37-8.466441 -.744591 gender_educ.17533.15713 1.1.73 -.136135.481191 _cons 1.14571 1.313655.84.41-1.47638 3.685181
Model : Interpretaton Equaton for men: E[Wage ] =!ˆ +!ˆ (Experence ) E[Wage ] = 1.1 +.68(Experence ) 1 Equaton for women: E[Wage ] = (!ˆ +!ˆ ) + (!ˆ 1 +!ˆ 3 )(Experence ) E[Wage ] = ( 1.1 " 4.37) + (.68 +.17) (Experence )! : change n mean wage for women vs. men wth no experence! 3 : change n slope (of experence) for women vs. men Model : Predctons by gender, 1 year of experence Men wth 1 year of experence E[Wage ] = 1.1 +.68(1) " 4.37() +.17(! 1) = 1.1 +.68 =!ˆ Women wth 1 year of experence +!ˆ E[Wage ] = 1.1 +.68(1) " 4.37(1) +.17(1! 1) = 1.1 +.68-4.37 +.17 =!ˆ!ˆ +!ˆ 3 s the dfference n mean wage between women and men wth one year of experence 1 +!ˆ +!ˆ 1 +!ˆ 3 Model : Predctons by gender, no experence Men wth no experence E[Wage ] = 1.1 +.68() " 4.37() +.17(! ) = 1.1 =!ˆ Women wth no experence = 1.1-4.37 =!ˆ E[Wage ] = 1.1 +.68() " 4.37(1) +.17(1! )!ˆ s the dfference n mean wage between women and men of no experence +!ˆ Model : Predctons by gender, years of experence Men wth years of experence E[Wage ] = 1.1 +.68() " 4.37() +.17(! ) = 1.1 +.68() =!ˆ Women wth years of experence!ˆ +!ˆ 3 s the dfference n mean wage between women and men wth two years of experence +!ˆ E[Wage ] = 1.1 +.68() " 4.37(1) +.17(1! ) = 1.1 +.68() - 4.37 +.17() =!ˆ 1 +!ˆ +!ˆ 1 +!ˆ 3
Model : Interpretaton! : The average wage for men wth no experence! 1 : The dfference n average wage for a one year ncrease n experence among men! : The dfference n average wage between women and men wth no experence! 3 : The dfference of the dfference n average wage for a one year ncrease n experence between women and men the change n slope between women and men the slope for women s! 1 +! 3 Is the change n slope statstcally sgnfcant? Test model 1 vs. model only 1 varable added use t test for that varable to compare models H :! 3 = n the populaton From the t-statstc, p =.7 Fal to reject H Conclude that model 1 s better Compare to model 1 In the parent model! 1 was slope for both men and women! was dfference between women & men at every experence level In the extended model (wth nteracton)! 1 s slope for men! s dfference between women & men for experence=! 3 s change n slope per year of experence between men & women Model 3: Interacton of two bnary predctors Model : contnuous X, bnary X, ther nteracton slope changes by group Model 3: bnary X, bnary X, ther nteracton dfference n mean changes by group
Model 3: output Graph for Model 3! 3 = Dfference of dfferences Source SS df MS Number of obs = 534 -------------+------------------------------ F( 3, 53) = 13.94 Model 19.58518 3 343.19559 Prob > F =. Resdual 1347.1134 53 4.617195 R-squared =.731 -------------+------------------------------ Adj R-squared =.679 Total 1476.6985 533 6.41316 Root MSE = 4.9616 wagehr Coef. Std. Err. t P> t [95% Conf. Interval] gender -.951139.735696 -.13.897-1.53911 1.348894 marred.51311.61188 4.1. 1.318854 3.73768 gender_mar~d -3.97184.97319-3.41.1-4.879567-1.3148 _cons 8.35475.4936948 16.9. 7.384914 9.34591 Mean hourly wage 1 1 8 6 4! Dfference =! 1 Dfference =! Dfference =! 1 "! 3 Dfference =! "! 3 unmarred men unmarred women marred men marred women Model 3: Creatng the nteracton varable gender: for men 1 for women marred: f unmarred 1 f marred gender*marred = * = for unmarred men = 1* = for unmarred women = *1 = for marred men = 1*1 = 1 for marred women Model 3: Interpretaton! : The average wage for unmarred men! 1 : The dfference n average wage between unmarred women and unmarred men! 1 +! 3 : The dfference n average wage between marred women and marred men! 3 : The dfference of the dfference n average wage between marred women and marred men and between unmarred women and unmarred men
Model 3: Interpretaton! : The average wage for unmarred men! : The dfference n average wage between marred men and unmarred men! +! 3 : The dfference n average wage between marred women and unmarred women! 3 : The dfference of the dfference n average wage between marred women and unmarred women and between marred men and unmarred men Summary Interacton nteracton=var1*var nteracton varable changes nterpretaton of entre model wth nteracton, the effect of one varable changes accordng to the level of the second varable Test for nteracton by testng new varable f sgnfcant (p<#, not n CI), keep f not sgnfcant, go back to parent model wthout nteracton varable 3 Model 3: concluson The nteracton varable s statstcally sgnfcantly dfferent from (p=.1, CI: -4.9 to -1.3 ) The dfference n mean hourly wage between women and men s greater for marred people than for unmarred people. -or- The dfference n mean hourly wage between marred people and unmarred people s greater for men than for women. Flexblty n lnear models In lnear regresson, we assume the outcome, Y, has a lnear relatonshp wth the predctors, X However, we have flexblty n defnng the predctors transform X, such as X orx 3 use lnear splnes to ft broken arrow models
Example: Hosptal Expendtures ($$) The data are smlar to an example from the book by Pagano and Gauvreau: Prncples of Bostatstcs Data: Y - Average Hosptal expendture ($s) per admsson X 1 - Average length of stay (days) X - Average employee salary ($s) n = 51; 5 U.S. states + DC Model We mght formulate a MLR: 1) Y = # + # 1 X 1 + # X + $ ) $ ~ N(, % ) where: Y = Expendtures per admsson n $s X 1 = Length of stay (LOS) n days X = Salary n $s Scentfc Queston How s per capta expendture (Y) related to: Length of stay (X 1 ) Employee salary (X ) Model: E( Y X ) = # + # 1 X 1 + # X Parameter Interpretatons: # : expected expendture when LOS = and salary = ; (Need to center the model!) # 1 : dfference n expected expendture ($s) for two states wth same average salary but LOS that dffers by one day # : dfference n expected per capta expendture ($s) for two states wth same average LOS but salary that dffers by one dollar
Basc Model Source SS df MS Number of obs = 51 -------------+------------------------------ F(, 48) = 46.8 Model 5555145.4 177757.7 Prob > F =. Resdual 1331154.7 48 77317.87 R-squared =.6575 -------------+------------------------------ Adj R-squared =.643 Total 388664. 5 77738.3 Root MSE = 56.61 expend Coef. Std. Err. t P> t [95% Conf. Interval] los 313.597 73.44155 4.7. 165.8656 461.1938 salary.33349.37936 8.79..569844.495137 _cons -466.343 88.717-5.77. -688.346-336.339 Dagnoss The Alaskan outler appears here as well as some curvature n the salary relatonshp There appears to be a non-lnear relatonshp between expendtures (Y) and salary (X). How could we ncorporate ths n our model? Defne a new varable: salary and nclude t n the model: Check for curvature & other patterns of nterest: New Model Standardzed Resduals e(expend X) 1516.55-1131.39 -.1593.579 e( los X ) 4-4 6 8 1 length of stay (days) Standardzed Resduals e(expend X) 4815.65-989.144-968.18 8599.59 e( salary X ) 4-1 15 5 salary ($) AVPlots Resduals E( Y X ) = # + # 1 X 1 + # X + # 3 X Lnear relatonshp wth X 1 Quadratc relatonshp wth X
Quadratc Term Expendtures are lnearly related to length of stay, but have a quadratc relatonshp wth salary. Defne a new varable: salary = salary^ and nclude t n the regresson. Interpretatons # :??? # 1 : We estmate that expected expendtures per admsson wll be $44 hgher (95% CI: $37-51) n a state whose average LOS s one day longer than another state wth the same average employee salary # :??? # 3 :??? Model Output Inferences Source SS df MS Number of obs = 5 -------------+------------------------------ F( 3, 46) = 14.76 Model 175565.1 3 585755.3 Prob > F =. Resdual 188557.79 46 4983.8651 R-squared =.93 -------------+------------------------------ Adj R-squared =.8967 Total 194375.9 49 396684.14 Root MSE =.44 expend Coef. Std. Err. t P> t [95% Conf. Interval] los 441.999 9.3469 15.6. 38.9354 51.63 salary -.88387.9951-9.84. -3.47967 -.9367 salary.1 9.58e-6 1.46..89.1195 _cons 1974.65 6.543 8.94. 1583.11 4166.19 Is salary related to expendtures? Could test: H : # =? H : # 3 =? But really want H : # = # 3 = overall test for salary
Hosptal Example Recall Model: E( Y X ) = # + # 1 X 1 + # X + # 3 X Ho: # = # 3 = (Test by hand: need SSE E, SSE F ) Null Model Results Null model: E( Y X ) = # + # 1 X 1 Source SS df MS Number of obs = 5 -------------+------------------------------ F( 1, 48) = 47.4 Model 96138.76 1 96138.76 Prob > F =. Resdual 9816484.1 48 451.86 R-squared =.495 -------------+------------------------------ Adj R-squared =.4845 Total 194375.9 49 396684.14 Root MSE = 45.3 expend Coef. Std. Err. t P> t [95% Conf. Interval] los 443.3567 64.63975 6.86. 313.3898 573.336 _cons -786.691 49.483-1.6.115-177.641 199.48 SSE E = 9816484.1, s= Full Model Results F-test Results Source SS df MS Number of obs = 5 -------------+------------------------------ F( 3, 46) = 14.76 Model 175565.1 3 585755.3 Prob > F =. Resdual 188557.79 46 4983.8651 R-squared =.93 -------------+------------------------------ Adj R-squared =.8967 Total 194375.9 49 396684.14 Root MSE =.44 expend Coef. Std. Err. t P> t [95% Conf. Interval] los 441.999 9.3469 15.6. 38.9354 51.63 salary -.88387.9951-9.84. -3.47967 -.9367 salary.1 9.58e-6 1.46..89.1195 _cons 1974.65 6.543 8.94. 1583.11 4166.19 F-test: F,46 = (79316.3) / 188557.79 /(5 " 1 " " 1) $ 96.76 (p<.1; F. 5,,46 =3.) SSE F = 188557.79, n-p-s-1 = 5-1--1 = 46 Reject the null: conclude that the salary effects were statstcally sgnfcant n regresson model
Lnear Splnes: set-up The broken arrow model Example: A researcher tells you most Health Management Organzatons (HMOs) wll usually pay for the frst week of a hosptal stay only She expects expendtures to ncrease dramtcally f LOS was longer than one week How should we set up the model? Defnng a New Varable Smlar to what we dd n ANCOVA, we could just defne a new varable that checks to see f the slope s ndeed dfferent f LOS s greater than 7. Idea, nclude a term: (LOS-7) + = (LOS 7) f LOS>7 = f LOS<=7 The splne allows you to change the magntude of the slope! The researcher thought the LOS regresson lne should look lke: When to use a splne? Expendtures 35 3 5 Broken Arrow Model 3 5 7 9 length of stay (days) When a contnuous predctor s used, a typcal regresson equaton assumes there s a straght-lne relatonshp between X and Y n the populaton. If the relatonshp between X and Y s a bent lne a curve addng a splne may more accurately model the relatonshp between X and Y
Vsualzng the Model Then: Broken Arrow Model E(expendtures LOS <= 7) = # + # 1 LOS 35 Expendtures 3 5 Slope = # 1 Slope = # 1 + # E(exp LOS > 7)= # + # 1 LOS + # (LOS - 7) = (# - # &7)+ (# 1 + # )LOS 3 5 7 9 length of stay (days) = # * + # 1 *LOS The Model Model: E(expendtures) = # + # 1 LOS + # (LOS-7) + New Model E(Y X) = # + # 1 X 1 + # (X 1-7) + + # 3 X + # 4 X Where: (LOS-7)+ = (LOS 7) f LOS>7 f LOS<=7 Broken Arrow relatonshp wth X 1 Quadratc relatonshp wth X
Addng Splne to Quadratc Expendtures have a dfferent lnear relatonshp before and after a 7 day length of stay, and have a quadratc relatonshp wth salary. We ll just defne a new varable: los7 = (los-7)*(los>7) and nclude t n the regresson. Centerng LOS n the expendtures model Y: Average Hosptal expendture ($s) per admsson X1: Average length of stay (days) X: Average employee salary($1s) Centered Model: E(Y X) = # + # 1 (X 1-7) + # (X 1-7) + + # 3 (X -15) + # 4 (X -15) Results Fnal Model for Expendtures Source SS df MS Number of obs = 5 -------------+------------------------------ F( 4, 45) = 16.1 Model 17844348. 4 446187. Prob > F =. Resdual 1593174.87 45 3543.8861 R-squared =.918 -------------+------------------------------ Adj R-squared =.918 Total 194375.9 49 396684.14 Root MSE = 188.16 expend Coef. Std. Err. t P> t [95% Conf. Interval] los 1.5361 84.41545.5.15 4.51468 38.5576 los7 347.7778 11.85.87.6 13.991 591.6465 salary -3.14361.86969-1.95. -3.791 -.5651 salary.18 9.3e-6 11.6..894.169 _cons 376.97 394.89 9.7. 18453.41 81.53 Source SS df MS Number of obs = 5 -------------+------------------------------ F( 4, 45) = 16.1 Model 17844345.3 4 446186.31 Prob > F =. Resdual 1593177.63 45 3543.9473 R-squared =.918 -------------+------------------------------ Adj R-squared =.918 Total 194375.9 49 396684.14 Root MSE = 188.16 expend Coef. Std. Err. t P> t [95% Conf. Interval] losc 1.5366 84.4155.5.15 4.515 38.558 losc7 347.777 11.86.87.6 13.983 591.646 salc 11.6865 19.69614 5.16. 6.1645 141.3566 salc 18.1581 9.3474 11.6. 89.37714 16.9391 _cons 1954.413 68.69979 8.45. 1816.45 9.78 E( Y X ) = 1954 + 13(X 1-7) + 348(X 1-7) + + 1(X -15) + 18(X -15)
Back to modellng wages Ftted model wth splne at 35 Standardzed resduals - 4 Source SS df MS Number of obs = 533 -------------+------------------------------ F(, 53) = 8.18 Model 131.65577 615.87885 Prob > F =. Resdual 11584.1395 53 1.8568669 R-squared =.961 -------------+------------------------------ Adj R-squared =.97 Total 1815.795 53 4.89847 Root MSE = 4.6751 wagehr Coef. Std. Err. t P> t [95% Conf. Interval] age_cent.33899.47853 7.7..43943.453876 age_splne -.374546.6638-5.65. -.5485 -.44869 _cons 1.45389.357741 9.. 9.751156 11.1566 3 4 5 6 age We removed an outler, but do we stll need a splne? How should we add the splne? Goal: let the regresson lne bend Model: E(Wage ) =! +! 1 (age-35)+! (age-35) + What s (age-35) +? f age<35 (age-35) f age>=35 Ftted Graph (wth splne) Wage ($/hour) 1 3 4 5 3 4 5 6 age
Better Interpretaton E(Wage ) = 1.45+.33(age-35)-.37(age-35) + For a person under 35: E(Wage ) = 1.45+.33(age-35)-.37(age-35) + For a person 35 or older: E(Wage ) = 1.45+.33(age-35)-.37(age-35) + = 1.45-.4(age-35) (age-35) The average wage for people who are 35 years old s $1.45/hour (95% CI: $9.75, 11.16) For each addtonal year of age, those under age 35 earn an average of $.33 more per hour (95% CI: $.4, $.43) For each addtonal year of age, those over age 35 earn an average of $.4 less per hour (95% CI: -$.1, $.1)! 1 "! = new slope for those over 35 Interpretaton! s the average wage for people who are 35 years old! 1 s the change n average wage per addtonal year of age for those under 35! s the dfference n the change n average wage per addtonal year of age for those over age 35 as compared to those under age 35! s the change n the slope for over 35 vs. under 35 Is the change n slope statstcally sgnfcant? One varable was added to create the change n slope compare nested models wth t test. regress wagehr age_cent age_splne f sres_age<6 Source SS df MS Number of obs = 533 -------------+------------------------------ F(, 53) = 8.18 Model 131.65577 615.87885 Prob > F =. Resdual 11584.1395 53 1.8568669 R-squared =.961 -------------+------------------------------ Adj R-squared =.97 Total 1815.795 53 4.89847 Root MSE = 4.6751 wagehr Coef. Std. Err. t P> t [95% Conf. Interval] age_cent.33899.47853 7.7..43943.453876 age_splne -.374546.6638-5.65. -.5485 -.44869 _cons 1.45389.357741 9.. 9.751156 11.1566 H : splne s not needed (no change n slope n the populaton) p<.1 or CI does not nclude : reject H Conclude slope dffers for those over vs. under 35 n populaton
L Lnear relatonshp Wth the splne, there s no longer any pattern n the resduals After removng the one outler, no others appear to stand out N Normalty of the resduals The resduals are slghtly skewed to postve values the estmated regresson coeffcents are stll correct ther confdence ntervals may be msleadng I - Independence We cannot check ths by lookng at the data E Equal varance of the resduals across X The vertcal spread of the resduals may be smaller for those under 5 years of age the estmated regresson coeffcents are stll correct ther confdence ntervals may be msleadng
Concluson The ncrease n hourly wage wth ncreasng age s statstcally sgnfcant for those who recently entered the workforce (ages 18-35): for each addtonal year, these workers earn an average of 33 cents more per hour. However, ths ncrease n wage wth ncreasng age levels off for those over age 35, so that no apprecable ncrease n average wage s observed for those over age 35. One 1-year-old had much hgher earnngs ($44.5 per hour) than other young workers. Ths person s results were so unlke the rest of the sample that the observaton was dropped from the analyss. It s possble that the data was ncorrectly entered for ths person, but we are unable to assess the data entry snce the orgnal completed surveys are unavalable. Splnes Splnes are used to allow the regresson lne to bend the breakpont s arbtrary and decded graphcally the actual slope above and below the breakpont s usually of more nterest than the coeffcent for the splne (e the change n slope) 66