Example: Multiple linear regression. Least squares regression. Repetition: Simple linear regression. Tron Anders Moger

Example: Multple lear regresso 5000,00 4000,00 Tro Aders Moger 0.0.007 brthweght 3000,00 000,00 000,00 0,00 50,00 00,00 50,00 00,00 50,00 weght pouds Repetto: Smple lear regresso We defe a model Y = β0 + βx + ε where ε are depedet, ormally dstrbuted, wth equal varace σ Wsh to ft a le as close to the observed data (two ormally dstrbuted varables) as possble Example: Brth weght=β 0 +β *mother s weght Least squares regresso brthweght 5000,00 4000,00 3000,00 000,00 000,00 R Sq Lear = 0,035 0,00 50,00 00,00 50,00 00,00 50,00 weght pouds

How to compute the le ft wth the least squares method? How do you get ths aswer? Let (x, y ), (x, y ),...,(x, y ) deote the pots the plae. Fd a ad b so that y=a+bx ft the pots by mmzg Soluto: y) + ( a + bx y) + + ( a + bx y) = ( a + bx y ) S = ( a + bx L b = x y ( x )( y ) x y = ( x ) ( x ) x x xy y b x a = = y bx where x = x, y = y ad all sums are doe for,...,. Dfferetate S wth respect to a og b, ad set the result to 0 S = ( a + bx y ) = 0 a S = ( a + bx y ) x = 0 b We get: a + b( x ) y = 0 ( x ) + b( x ) x y = 0 a Ths s two equatos wth two ukows, ad the soluto of these gve the aswer. How close are the data to the ftted le? R Defe SSE: Error sum of squares ( y a+ bx) SSR: Regresso sum of squares ( a+ bx y) SST: Total sum of squares ( y y) We ca show that SST = SSR + SSE SSR SSE Defe R = corr( x, y) R SST = SST = s the coeffcet of determato What s the logc behd R? y y = a+ bx ˆ SST = y y x x ε = SSE = y yˆ SSR = yˆ y

Example: Regresso of brth weght wth mother s weght as depedet varable Summary b Adusted Std. Error of R R Square R Square the Estmate,86 a,035,09 78,470 a. Predctors: (Costat), weght pouds b. Depedet Varable: brthweght (Costat) weght pouds Regresso Resdual Total a. Depedet Varable: brthweght ANOVA b Sum of Squares df Mea Square F Sg. 344888 344888,30 6,686,00 a 964687 87 5587,574 9997053 88 a. Predctors: (Costat), weght pouds b. Depedet Varable: brthweght Coeffcets a Ustadardzed Stadardzed Coeffcets Coeffcets 95% Cofdece Iterval for B B Std. Error Beta t Sg. Lower Boud Upper Boud 369,67 8,43 0,374,000 99,040 80,304 4,49,73,86,586,00,050 7,809 Iterpretato: Have ftted the le Brth weght=369.67+4.49*mother s weght If mother s weght creases by 0 pouds, what s the predcted mpact o fat s brth weght? 4.49*0=89 grams What s the predcted brth weght of a fat wth a 50 poud mother? 369.67+4.49*50=3034 grams But how to aswer questos lke: Gve that a postve slope (b) has bee estmated: Does t gve a reproducble dcato that there s a postve tred, or s t a result of radom varato? What s a cofdece terval for the estmated slope? What s the predcto, wth ucertaty, at a ew x value? Cofdece tervals for smple regresso I a smple regresso model, β 0 a estmates b estmates β ˆ σ = SSE /( ) Also, ( b β )/ Sb ~ t ˆ σ where Sb = ( ) sx of b estmates So a cofdece terval for by b± t, α /Sb σ estmates varace β s gve 3

Hypothess testg for smple regresso Choose hypotheses: H 0 : β = 0 H: β 0 Test statstc: b/ Sb ~ t ReectH 0 f b/ Sb < t, α / or b/ Sb > t, α / For the example: Test H 0 : β mother s weght =0 o 5%-sg. level Get 4.49/.73=.586. Look up.5 ad 97.5-percetles t-dstrbuto wth 87 degrees of freedom (approx. ormal dst.) Fd p-value<0.05, reect H 0 Predcto from a smple regresso model A regresso model ca be used to predct the respose at a ew value x + The ucertaty ths predcto comes from two sources: The ucertaty the regresso le The ucertaty of ay respose, gve the regresso le A cofdece terval for the predcto: ( x x ) + ˆ +, α /σ ( x x ) a+ bx ± t + + Example: The cofdece terval of the predcted brth weght of a fat wth a 50 poud mother Foud that the predcted weght was 3034 grams The cofdece terval for the predcto s: 369.67+4.43*50±t 87,0.05 * 78.4* (+/89+(50-9.8) /(75798.5)) =.96 Not gve drectly the spss output Whch becomes (60.9, 4447.) Calculated as: MSE/S b =5587/.7 More tha oe depedet varable: Multple regresso Assume we have data of the type (x, x, x 3, y ), (x, x, x 3, y ),... We wat to expla y from the x-values by fttg the followg model: y = a + bx + cx + dx3 Just lke before, oe ca produce formulas for a,b,c,d mmzg the sum of the squares of the errors. x,x,x 3 ca be trasformatos of dfferet varables, or trasformatos of the same varable 4

0,35 0,30 0,5 0,0 0,5 0,0 0,05 69,00 69,50 70,00 70,50 7,00 7,50 0,35 0,30 0,5 0,0 0,5 0,0 0,05 0,35 0,30 0,5 0,0 0,5 0,0 0,05 3,00 3,0 3,40 3,60 3,80 0,00 5,00 0,00 5,00 0,00 5,00 30,00 Multple regresso model y x x x = β0 + β + β +... + β + ε ε The errors are depedet radom (ormal) varables wth expectato zero ad varace σ The explaatory varables x, x,, x caot be learly related New example: Traffc deaths 976 (from fle crash o textbook CD) Wat to fd f there s ay relatoshp betwee hghway death rate (deaths per 000 per state) the U.S. ad the followg varables: Average car age ( moths) Average car weght ( 000 pouds) Percetage lght trucks Percetage mported cars All data are per state Frst: Scatter plots: Uvarate effects (oe depedet varable at a tme!): Summary b Adusted Std. Error of R R Square R Square the Estmate,49 a,4,6,0506 a. Predctors: (Costat), carage Deaths per 000=a+b*car age ( moths) deaths deaths b. Depedet Varable: deaths Coeffcets a (Costat) carage Ustadardzed Coeffcets Stadardzed Coeffcets 95% Cofdece Iterval for B B Std. Error Beta t Sg. Lower Boud Upper Boud 4,56,34 3,98,000,33 6,800 -,06,06 -,49-3,834,000 -,094 -,09 a. Depedet Varable: deaths 0,35 carage vehwt Hece: If all else s equal, f average car age creases by oe moth, you get 0.06 fewer deaths per 000 habtats; crease age by moths, you get *0.06=0.74 fewer deaths per 000 habtats 0,30 0,5 Summary b Adusted Std. Error of R R Square R Square the Estmate,8 a,079,059,05740 Deaths per 000=a+b*car weght ( pouds) deaths 0,0 deaths a. Predctors: (Costat), vehwt b. Depedet Varable: deaths Coeffcets a 0,5 Ustadardzed Coeffcets Stadardzed Coeffcets 95% Cofdece Iterval for B 0,0 0,05 5,00 0,00 5,00 0,00 5,00 30,00 35,00 lghttrks mpcars (Costat) vehwt a. Depedet Varable: deaths B Std. Error Beta t Sg. Lower Boud Upper Boud -,7, -,7,6 -,76,74,4,06,8,983,053 -,00,49 5

Uvarate effects cot d (oe depedet varable at a tme!): Summary b Adusted Std. Error of R R Square R Square the Estmate,76 a,5,50,0478 a. Predctors: (Costat), lghttrks b. Depedet Varable: deaths Summary b (Costat) lghttrks a. Depedet Varable: deaths Adusted Std. Error of R R Square R Square the Estmate,308 a,095,075,05690 a. Predctors: (Costat), mpcars b. Depedet Varable: deaths (Costat) mpcars Coeffcets a Ustadardzed Stadardzed Coeffcets Coeffcets 95% Cofdece Iterval for B B Std. Error Beta t Sg. Lower Boud Upper Boud,046,08,478,07,009,083 a. Depedet Varable: deaths Hece: Icrease prop. lght trucks by 0 meas 0*0.007=0.4 more deaths per 000 habtats,007,00,76 6,947,000,005,00 Predcted umber of deaths per 000 f prop. Imported cars s 0%: 0.06-0.004*0=0.7 Coeffcets a Ustadardzed Stadardzed Coeffcets Coeffcets 95% Cofdece Iterval for B B Std. Error Beta t Sg. Lower Boud Upper Boud,06,00 0,46,000,66,46 -,004,00 -,308 -,93,033 -,007,000 Buldg a multple regresso model: Forward regresso: Try all depedet varables, oe at a tme, keep the varable wth the lowest p-value Repeat step, wth the depedet varable from the frst roud ow cluded the model Repeat utl o more varables ca be added to the model (o more sgfcat varables) Backward regresso: Iclude all depedet varables the model, remove the varable wth the hghest p- value Cotue utl oly sgfcat varables are left However: These methods are ot always correct to use practce! For the traffc deaths, ed up wth: Deaths per 000=.7-0.037*car age +0.006*perc. lght trucks Check of assumptos: Hstogram Normal P-P Plot of Regresso Stadardzed Resdual Depedet Varable: deaths Depedet Varable: deaths,0 Summary b 4 Adusted Std. Error of R R Square R Square the Estmate,768 a,590,57,0387 a. Predctors: (Costat), lghttrks, carage b. Depedet Varable: deaths (Costat) carage lghttrks a. Depedet Varable: deaths Coeffcets a Ustadardzed Stadardzed Coeffcets Coeffcets 95% Cofdece Iterval for B B Std. Error Beta t Sg. Lower Boud Upper Boud,668,895,98,005,865 4,470 -,037,03 -,95 -,930,005 -,063 -,0,006,00,6 6,8,000,004,009 Cocluso: Dd a multple lear regresso o traffc deaths, wth car age, car weght, prop. lght trucks ad prop. mported cars as depedet varables. Car age ( moths, β=-0.037, 95% CI=(-0.063, -0.0)) ad prop. lght trucks (β=0.006, 95% CI=(0.004, 0.009)) were sgfcat o 5%-level Frequecy 0 8 6 4 0-3 - - 0 3 4 Regresso Stadardzed Resdual Mea =,3E-7 Std. Dev. = 0,978 N = 48 Expected Cum Prob 0,8 0,6 0,4 0, 0,0 0,0 0, 0,4 0,6 0,8,0 Observed Cum Prob 6

Check of assumptos cot d: Regresso Stadardzed Resdual Scatterplot Depedet Varable: deaths 4 3 0 - - -3 - - 0 Regresso Stadardzed Predcted Value Least squares estmato y x x x = β0 + β + β +... + βk K + ε The least squares estmates of β0, β,..., βk are the values b, b,, b K mmzg (... ) SSE = b + b x + b x + + b x y 0 K K They ca be computed wth smlar but more complex formulas as wth smple regresso Explaatory power Defg yˆ = b0 + bx + bx +... + bkxk ( ) SST = y y ( ˆ ) SSE = y y SSR = ( yˆ y ) We get as before We defe SSR SSE R = SST = SST We also get that R = Corr( y, yˆ ) SST = SSR + SSE Coeffcet of determato Adusted coeffcet of determato Addg more depedet varables wll geerally crease SSR ad decrease SSE Thus the coeffcet of determato wll ted to dcate that models wth may varables always ft better. To avod ths effect, the adusted coeffcet of determato may be used: SSE /( K ) R = SST /( ) 7

Drawg ferece about the model parameters Smlar to smple regresso, we get that the followg statstc has a t dstrbuto wth -K- degrees of freedom: b β tb = sb where b s the least squares estmate for ad s b s ts estmated stadard devato s b s computed from SSE ad the correlato betwee depedet varables Cofdece tervals ad hypothess tests A cofdece terval for b ± t s K, α / b becomes Testg the hypothess H0 : β = 0vs H : β 0 b Reect f t K, α / or β < > t K, α / b sb s b Testg sets of parameters We ca also test the ull hypothess that a specfc set of the betas are smultaeously zero. The alteratve hypothess s that at least oe beta the set s ozero. The test statstc has a F dstrbuto, ad s computed by comparg the SSE the full model, ad the SSE whe settg the parameters the set to zero. Makg predctos from the model As smple regresso, we ca use the estmated coeffcets to make predctos As smple regresso, the ucertaty the predctos has two sources: The varace aroud the regresso estmate The varace of the estmated regresso model 8

What f the relatoshp s olear? Most commo thg to do s to categorze the depedet varable E.g. categorze age to 0-0 yrs, -40 yrs, 4-60 yrs ad so o Choose a basele category, ad estmate a slope b for each of the other categores Does ot matter what relatoshp you have betwee the outcome ad the depedet varable Wll talk more about ths ext tme Other optos f the relatoshp s o-lear: Trasformed varables The relatoshp betwee varables may ot be lear Example: The atural model may be bx y = ae We wat to fd a ad b bx so that the le y = ae approxmates the pots as well as possble 0.05 0.0 0.5 0.0 5 0 5 30 bx Whe y = ae the log( y ) = log( a) + bx Use stadard formulas o the pars (x,log(y )), (x, log(y )),..., (x, log(y )) We get estmates for log(a) ad b, ad thus a ad b Example (cot.) 0.05 0.0 0.5 0.0 Aother example of trasformed varables Aother atural model may be b y = ax We get that log( y) = log( a) + b log( x) Use stadard formulas o the pars (log(x ), log(y )), (log(x ), log(y )),...,(log(x ),log(y )) 0.008 0.00 0.0 0.04 0.06 5 0 5 30 0 4 6 8 Note: I ths model, the curve goes through (0,0) 9

Assume data (x,y ),..., (x,y ) seem to follow a thrd degree polyomal We use multvarate regresso o (x, x, x 3, y ), (x, x, x 3, y ),... We get estmated a,b,c,d, a thrd degree polyomal curve y = a + bx + cx + A thrd example: dx 3-3.0 -.5 -.0 -.5 -.0-0.5 0.0 0.0 0.5.0.5.0.5 3.0 Dog a regresso aalyss Plot the data frst, to vestgate whether there s a atural relatoshp Lear or trasformed model? Are there outlers whch wll uduly affect the result? Ft a model. Dfferet models wth same umber of parameters may be compared wth R Check the assumptos! Make tests / cofdece tervals for parameters A lot of practce s eeded! Cocluso ad further optos Regresso versus correlato: Ca clude more depedet varables regresso Gets a more detaled pcture o the effect a depedet varable has o the depedet varable What f the depedet varable oly has two possble values? Logstc regresso Smlar to lear regresso But the terpretatos of the β s are dfferet: They are terpreted as odds-ratos stead of the slope of a le 0