Topic 9. Regression and Correlation

BE54W Regresso ad Correlato Page of 43 Topc 9 Regresso ad Correlato Topc. Defto of the Lear Regresso Model... Estmato.... 3. The Aalyss of Varace Table. 4. Assumptos for the Straght Le Regresso. 5. Hypothess Testg... 6. Cofdece Iterval Estmato... 7. Itroducto to Correlato.. 8. Hypothess Test for Correlato.. 4 7 35 39 4

BE54W Regresso ad Correlato Page of 43. Defto of the Lear Regresso Model I the last ut, topc 8, the settg was that of two categorcal (dscrete) varables, such as smokg ad low brth weght, ad the use of ch-square tests of assocato ad homogeety. I ths ut, topc 9, our focus s the settg of two cotuous varables, such as age ad weght. Ths topc s a troducto to smple lear regresso ad correlato. Lear Regresso Lear regresso models the mea µ of oe radom varable as a lear fucto of oe or more other varables that are treated as fxed. The estmato ad hypothess testg volved are extesos of deas ad techques that we have already see. I lear regresso, we observe a outcome or depedet varable Y at several levels of the depedet or predctor varable X (there may be more tha oe predctor X as see later). A lear regresso model assumes that the values of the predctor X have bee fxed advace of observg Y. However, ths s ot always the realty. Ofte Y ad X are observed jotly ad are both radom varables. Correlato Correlato cosders the assocato of two radom varables. The techques of estmato ad hypothess testg are the same for lear regresso ad correlato aalyses. Explorg the relatoshp begs wth fttg a le to the pots. We develop the lear regresso model aalyss for a smple example volvg oe predctor ad oe outcome.

BE54W Regresso ad Correlato Page 3 of 43 Example. Source: Klebaum, Kupper, ad Muller 988 Avalable are pars of observatos of age ad weght for = chcke embryos. WT=Y AGE=X LOGWT=Z.9 6 -.538.5 7 -.84.79 8 -..5 9 -.93.8 -.74.6 -.583.45 -.37.738 3 -.3.3 4.53.88 5.75.8 6.449 We ll use a famlar otato The data are pars of (X, Y ) where X=AGE ad Y=WT (X, Y ) = (6,.9) (X, Y ) = (6,.8) ad equvaletly, pars of (X, Y ) where X=AGE ad Y=LOGWT (X, Y ) = (6, -.538) (X, Y ) = (6,.449) Though smple, t helps to be clear the research questo Does weght chage wth age? I the laguage of aalyss of varace we are askg the followg: Ca the varablty weght be explaed, to a sgfcat extet, by varatos age? What s a good fuctoal form that relates age to weght?

BE54W Regresso ad Correlato Page 4 of 43 We beg wth a plot of X=AGE versus Y=WT Scatter Plot of WT vs AGE 3..4.8 WT..6. 6 8 4 6 We check ad lear about the followg: The average ad meda of X The rage ad patter of varablty X The average ad meda of Y The rage ad patter of varablty Y The ature of the relatoshp betwee X ad Y The stregth of the relatoshp betwee X ad Y The detfcato of ay pots that mght be fluetal AGE For these data: The plot suggests a relatoshp betwee AGE ad WT A straght le mght ft well, but aother model mght be better We have adequate rages of values for both AGE ad WT There are o outlers

BE54W Regresso ad Correlato Page 5 of 43 We mght have gotte ay of a varety of plots. y.5 No relatoshp betwee X ad Y 4 6 8 x 8 y 6 Lear relatoshp betwee X ad Y 4 4 6 8 x 5 y3 5 No-lear relatoshp betwee X ad Y 5 4 6 8 x

BE54W Regresso ad Correlato Page 6 of 43 y 4 6 8 4 x Note the arrow potg to the outlyg pot Ft of a lear model wll yeld estmated slope that s spurously o-zero. y 8 6 4 Note the arrow potg to the outlyg pot Ft of a lear model wll yeld a estmated slope that s spurously ear zero. 4 6 8 x y 8 6 4 Note the arrow potg to the outlyg pot Ft of a lear model wll yeld a estmated slope that s spurously hgh. 4 6 8 x

BE54W Regresso ad Correlato Page 7 of 43 The bowl shape of our scatter plot suggests that perhaps a better model relates the logarthm of WT to AGE:.5 Scatter Plot of LOGWT vs AGE -. LOGWT -.9 -.6 6 8 4 6 We ll vestgate two models. AGE ) WT = β + β AGE ) LOGWT = β + β AGE

BE54W Regresso ad Correlato Page 8 of 43 Recall what you mght have leared a old math class about the equato of a le You mght recall, too, a feel for the slope Slope > Slope = Slope <

BE54W Regresso ad Correlato Page 9 of 43 Populato Y Defto of the Straght Le Model Y = β + β X Sample = β + β X + ε Y = β + β X + e Y = β + β X s the relatoshp the populato. It s measured wth error. ε = measuremet error β, β, ad e are our guesses of β, β ad ε e = resdual We do NOT kow the value of β or β or ε We do have values of β, β ad e The values of β, β ad e are obtaed by the method of least squares estmato. To see f β β ad β β we perform regresso dagostcs. Note Ths s ot dscussed ths course; see PubHlth 64, Itermedate Bostatstcs A lttle otato, sorry! Y = the outcome or depedet varable X = the predctor or depedet varable µ Y = The expected value of Y for all persos the populato µ Y X=x = The expected value of Y for the sub-populato for whom X=x σ Y σ Y X=x = Varablty of Y amog all persos the populato = Varablty of Y for the sub-populato for whom X=x

BE54W Regresso ad Correlato Page of 43. Estmato We wll use the method of least squares to obta guesses of β ad β. Goal From the may possble les through the scatter of pots, choose the oe le that s closest to the data. What do we mea by Close? We d lke the vertcal dstace betwee each observed Y ad ts correspodg ftted Y to be as small as possble. It s ot possble to choose β ad β so that t mmzes dy Y ad mmzes dvdually Y Y d ad mmzes dvdually d. Y Y So, stead, we choose β ad β that mmzes ther total c h e j Y Y = Y β + β X = =

BE54W Regresso ad Correlato Page of 43 A pcture gves a feel for the fact that may les are possble ad that we seek the closest the sese of vertcal dstaces beg as small as possble c h e j The expresso to be mmzed, Y Y = Y β + β X has a varety of ames: = = resdual sum of squares sum of squares about the regresso le sum of squares due error (SSE) σ YX

BE54W Regresso ad Correlato Page of 43 For the calculus lover, A lttle calculus yelds the soluto for the guesses β ad β c h e j Cosder SSE = Y Y = Y β + β X = = Step #: Dfferetate wth respect to β Set dervatve equal to ad solve. Step #: Dfferetate wth respect to β Set dervatve equal to, sert β ad solve. For the o-calculus lover, here are the estmates of β ad β β s the slope Estmate s deoted ˆβ or b β s the tercept Estmate s deoted ˆβ or b

BE54W Regresso ad Correlato Page 3 of 43 Some very helpful prelmary calculatos ( ) Sxx = X-X = X NX ( ) Syy = Y-Y = Y NY xy ( ) S = X-X (Y-Y) = XY NXY Note - These expressos make use of a specal otato called the summato otato. The captol S dcates summato. I S xy, the frst subscrpt x s sayg (x-x). The secod subscrpt y s sayg (y-y). S xy = ( ) X-X (Y-Y) S subscrpt x subscrpt y b gb g b g X X Y Y Slope c ov XY, = β = = X X v ar( X ) = a f Sxy ˆ β = S xx Itercept β = Y β X Predcto of Y Ŷ= ˆ β + ˆ β X =b + bx

BE54W Regresso ad Correlato Page 4 of 43 Do these estmates make sese? b gb g b g X X Y Y Slope c ov XY, = β = = X X v ar( X ) = a f The lear movemet Y wth lear movemet X s measured relatve to the varablty X. β = says: Wth a ut chage X, overall there s a 5-5 chace that Y creases versus decreases β says: Wth a ut crease X, Y creases also ( β > ) or Y decreases ( β < ). Itercept β = Y β X If the lear model s correct, or, f the true model does ot have a lear compoet, we obta β = ad β = Y as our best guess of a ukow Y

BE54W Regresso ad Correlato Page 5 of 43 Illustrato SAS Code. data temp; put wt age logwt; label wt="weght, Y" age="age, X" logwt="log(weght),y"; cards;.9 6 -.538.5 7 -.84.79 8 -..5 9 -.93.8 -.74.6 -.583.45 -.37.738 3 -.3.3 4.53.88 5.75.8 6.449 ; ru; qut; proc reg data=temp smple; /* opto smple produces smple descrptves */ ttle "Regresso of Y=Weght o X=Age"; model wt=age; ru; qut; Partal lstg of output... Parameter Estmates Parameter Stadard Varable Label DF Estmate Error t Value Pr > t Itercept Itercept -.88453.5584-3.58.59 age Age, X.357.4594 5..6 Aotated Parameter Estmates Parameter Stadard Varable Label DF Estmate Error t Value Pr > t Itercept Itercept -.88453 = tercept = β.5584-3.58.59 age Age, X.357 = slope = β.4594 5..6 The ftted le s therefore WT = 88453. +. 357 * AGE

BE54W Regresso ad Correlato Page 6 of 43 Let s overlay the ftted le o our scatterplot. 3. Scatter Plot of WT vs AGE.4.8 WT..6. 6 8 4 6 AGE As we mght have guessed, the straght le model may ot be the best choce. The bowl shape of the scatter plot does have a lear compoet, however. Wthout the plot, we mght have beleved the straght le ft s okay.

BE54W Regresso ad Correlato Page 7 of 43 Let s try a straght le model ft to Y=LOGWT versus X=AGE. Partal lstg of output... Parameter Estmates Parameter Stadard Varable Label DF Estmate Error t Value Pr > t Itercept Itercept -.6895.364-87.78 <. age Age, X.9589.68 73.8 <. Aotated Parameter Estmates Parameter Stadard Varable Label DF Estmate Error t Value Pr > t Itercept Itercept -.6895 = tercept = β.364-87.78 <. Age Age, X.9589 = slope = β.68 73.8 <. Thus, the ftted le s LOGWT = -.6895 +.9589*AGE Now the scatterplot wth the overlay of the ftted le looks much better. Further dscusso wll cosder Scatter Plot the model of LOGWT that relates vs AGEY=LOGWT to X=AGE..5 -. LOGWT -.9 -.6 6 8 4 6 AGE

BE54W Regresso ad Correlato Page 8 of 43 Predcto of Weght from Heght Source: Dxo ad Massey (969) Now You Try Idvdual Heght (X) Weght (Y) 6 6 35 3 6 4 6 5 6 4 6 6 3 7 6 35 8 64 5 9 64 45 7 7 7 85 7 6 It helps to do the prelmary calculatos X=63.833 X = 49,68 Y=4.667 Y = 46, X Y 9,38 = xx S = 7.667 Syy = 5, 66.667 Sxy = 863.333

BE54W Regresso ad Correlato Page 9 of 43 Slope ˆ S β = S xy xx ˆ 863.333 β = = 5.9 7.667 Itercept β = Y β X ˆ β 4.667 (5.9)(63.8333 = = -79.3573

BE54W Regresso ad Correlato Page of 43 3. The Aalyss of Varace Table I Topc, Summarzg Data, we leared that the umerator of the sample varace of the Y data s ( ) Y Y. I regresso settgs where Y s the outcome varable, ths = Y Y s apprecated as the total varace of the Y s. As we wll = same quatty ( ) see, other ames for ths are total sum of squares, total, corrected, ad SSY. (Note corrected has to do wth subtractg the mea before squarg.) A aalyss of varace table s all about parttog the total varace of the Y s (corrected) to two compoets:. Due resdual (the dvdual Y about the dvdual predcto Ŷ). Due regresso (the predcto Ŷ about the overall mea Y) Asde - Why are we terested such a partto? We d lke to kow f, wth the data, there exsts the suggesto of a lear relatoshp ( sgal ) that ca be dscered from chace varablty ( ose ) ) the leftover varablty of the observed Y about the predcted Ŷ ( ose ) ) the explaed varablty of the predcted Ŷ about the overall mea Y ( sgal ) Here s the partto (Note Look closely ad you ll see that both sdes are the same) ( Y ) ( ˆ ) ( ˆ Y = Y Y + Y Y) Some algebra (ot show) reveals a ce partto of the total varablty. ( Y Y) = ( Y Yˆ) + ( Yˆ Y) Total Sum of Squares = Due Error Sum of Squares + Due Model Sum of Squares

BE54W Regresso ad Correlato Page of 43 A closer look Total Sum of Squares = Due Model Sum of Squares + Due Error Sum of Squares b Y Y c Y Y Y Y c b g c h c h Y Y = Y Y + Y Y = = = g= devato of Y from Y that s to be explaed h = due model, sgal, systematc, due regresso h = due error, ose, or resdual I the world of regresso aalyses, we seek to expla the total varablty Y Y What happes whe β? What happes whe β =? b = g : A straght le relatoshp s helpful A straght le relatoshp s ot helpful Best guess s Y = β + β X Best guess s Y = β = Y Due model s LARGE because c h β β Y Y = ( + X Y) Due error s early the TOTAL because cy Y h= ey j b g β = Y Y = Y β X + β X Y = β b X X Due error has to be small g due(model) a f a f due error wll be large due model a f due error Due regresso has to be small wll be small

BE54W Regresso ad Correlato Page of 43 How to Partto the Total Varace. The total or total, corrected refers to the varablty of Y about Y b = Y Y g s called the total sum of squares Degrees of freedom = df = (-) Dvso of the total sum of squares by ts df yelds the total mea square. The resdual or due error refers to the varablty of Y about Y Y Y = c h s called the resdual sum of squares Degrees of freedom = df = (-) Dvso of the resdual sum of squares by ts df yelds the resdual mea square. 3. The regresso or due model refers to the varablty of Y about Y c h b g = = Y Y = β X X s called the regresso sum of squares Degrees of freedom = df = Dvso of the regresso sum of squares by ts df yelds the regresso mea square or model mea square. Ths s a example of a varace compoet. Source df Sum of Squares Mea Square Regresso SSR = Y d Y = SSR/ Error (-) SSE = dy Y = SSE/(-) Total, corrected (-) SST = Y Yg Ht Mea square = (Sum of squares)/(df) b =

BE54W Regresso ad Correlato Page 3 of 43 Be careful! The questo we may ask from a aalyss of varace table s a lmted oe. Does the ft of the straght le model expla a sgfcat porto of the varablty of the dvdual Y about Y? Is ths better tha usg Y aloe? We are NOT askg: Is the choce of the straght le model correct? Would aother fuctoal form be a better choce? We ll use a hypothess test approach ad the method of proof by cotradcto. We beg wth a ull hypothess that says β = ( o lear relatoshp ) Evaluato wll focus o the comparso of the due regresso mea square to the resdual mea square Recall that we reasoed the followg: If β The due(regresso)/due(resdual) wll be LARGE If β = The due(regresso)/due(resdual) wll be SMALL Our p-value calculato wll aswer the questo: If β = truly, what are the chaces of obtag a value of due(regresso)/due(resdual) as larger or larger tha that observed? To calculate chaces we eed a probablty model. So far, we have ot eeded oe.

BE54W Regresso ad Correlato Page 4 of 43 4. Assumptos for a Straght Le Regresso Aalyss I performg least squares estmato, we dd ot use a probablty model. We were dog geometry. Hypothess testg requres some assumptos ad a probablty model. Assumptos The separate observatos Y, Y,, Y are depedet. The values of the predctor varable X are fxed ad measured wthout error. For each value of the predctor varable X=x, the dstrbuto of values of Y follows a ormal dstrbuto wth mea equal to µ Y X=x ad commo varace equal to σ Y x. The separate meas µ Y X=x le o a straght le; that s µ Y X=x = β + β X Schematcally, here s what the stuato looks lke (courtesy: Sta Lemeshow)

BE54W Regresso ad Correlato Page 5 of 43 Wth these assumptos, we ca assess the sgfcace of the varace explaed by the model. F msq(model) = wth df =, (-) msq(resdual) β = β Due model MSR has expected value σ Y X Due resdual MSE has expected value σ Y X Due model MSR has expected value σ Y X + β X X b = Due resdual MSE has expected value σ Y X g F = (MSR)/MSE wll be close to F = (MSR)/MSE wll be LARGER tha We obta the aalyss of varace table for the model of Y=LOGWT to X=AGE: The followg s SAS wth aotatos red. Aalyss of Varace Sum of Mea Source DF Squares Square F Value Pr > F = MSQ(Regresso)/MSQ(Resdual) Model 4.6 4.6 5355.6 <. Error 9.79.7886 Corrected Total 4.85 Root MSE.87 R-Square.9983 = SSQ(regresso)/SSQ(total) Depedet Mea -.53445 Adj R-Sq.998 = R adjusted for ad # predctors Coeff Var -5.586

BE54W Regresso ad Correlato Page 6 of 43 Ths output correspods to the followg. Source Df Sum of Squares Mea Square Regresso SSR = Y d Y = 4.63 SSR/ = 4.63 = Error (-) = 9 SSE = dy Y =.75 SSE/(-) = 7/838E-4 = Total, corrected (-) = SST = Y Yg = 4.768 b = Other formato ths output: R-SQUARED = Sum of squares regresso/sum of squares total s the proporto of the total that we have bee able to expla wth the ft of the straght le model. - Be careful! As predctors are added to the model, R-SQUARED ca oly crease. Evetually, we eed to adjust ths measure to take ths to accout. See ADJUSTED R-SQUARED. We also get a overall F test of the ull hypothess that the smple lear model does ot expla sgfcatly more varablty LOGWT tha the average LOGWT. F = MSQ (Regresso)/MSQ (Resdual) = 4.63/.7838 = 5384.94 wth df =, 9 Acheved sgfcace <.. Reject H O. Coclude that the ftted le s a sgfcat mprovemet over the average LOGWT.

BE54W Regresso ad Correlato Page 7 of 43 5. Hypothess Testg Straght Le Model: Y = β + β X ) Overall F-Test ) Test of slope 3) Test of tercept ) Overall F-Test Research Questo: Does the ftted model, the Y expla sgfcatly more of the total varablty of the Y about Y tha does Y? Assumptos: As before. H O ad H A : H H O A : β = : β Test Statstc: F = msq( regreso) msq( resdual) df =,( ) Evaluato rule: Whe the ull hypothess s true, the value of F should be close to. Alteratvely, whe β, the value of F wll be LARGER tha. Thus, our p-value calculato aswers: What are the chaces of obtag our value of the F or oe that s larger f we beleve the ull hypothess that β =?

BE54W Regresso ad Correlato Page 8 of 43 Calculatos: For our data, we obta p-value = L NM pr F msq(mod el) pr F 5384 94 msq( resdual).. β = QP = <<, ( ),9 O Evaluate: Uder the ull hypothess that β =, the chaces of obtag a value of F that s so far away from the expected value of, wth a value of 5394.94, s less tha chace,. Ths s a very small lkelhood! Iterpret: We have leared that, at least, the ftted straght le model does a much better job of explag the varablty LOGWT tha a model that allows oly for the average LOGWT. later (BE64, Itermedate Bostatstcs), we ll see that the aalyss does ot stop here

BE54W Regresso ad Correlato Page 9 of 43 ) Test of the Slope, β Some terestg otes! - The overall F test ad the test of the slope are equvalet. - The test of the slope uses a t-score approach to hypothess testg - It ca be show that { t-score for slope } = { overall F } Research Questo: Is the slope β =? Assumptos: As before. H O ad H A : H H O A : β = : β Test Statstc: To compute the t-score, we eed a estmate of the stadard error of β L O d β b g X X = SE = msq( resdual) NM QP

BE54W Regresso ad Correlato Page 3 of 43 Our t-score s therefore: t df score= L NM = ( ) a f a f a f O L QP = dafo NM dqp observed expected β se expected se β We ca fd ths formato our output: The followg s SAS wth aotatos red. Parameter Estmates Parameter Stadard Varable Label DF Estmate Error t Value Pr > t = Estmate/Error Itercept Itercept -.6895.364-87.78 <. age Age, X.9589.68 73.8 =.9589/.68 <. Recall what we mea by a t-score: T=73.38 says the estmated slope s estmated to be 73.38 stadard error uts away from ts expected value of zero. Check that { t-score } = { Overall F }: [ 73.38 ] = 5384.6 whch s close. Evaluato rule: Whe the ull hypothess s true, the value of t should be close to zero. Alteratvely, whe β, the value of t wll be DIFFERENT from. Here, our p-value calculato aswers: What are the chaces of obtag our value of the t or oe that s more far away from f we beleve the ull hypothess that β =?

BE54W Regresso ad Correlato Page 3 of 43 Calculatos: For our data, we obta p-value = L NM O P Q pr t β pr t 73 38 ( ) 9 se P =. <<. β d Evaluate: Uder the ull hypothess that β =, the chaces of obtag a t-score value that s 73.38 or more stadard error uts away from the expected value of s less tha chace,. Iterpret: The ferece s the same as that for the overall F test. The ftted straght le model does a much better job of explag the varablty LOGWT tha the sample mea.

BE54W Regresso ad Correlato Page 3 of 43 3) Test of the Itercept, β Ths pertas to whether or ot the straght le relatoshp passes through the org. It s rarely of terest. Research Questo: Is the tercept β =? Assumptos: As before. H O ad H A : H H O A : β = : β Test Statstc: To compute the t-score for the tercept, we eed a estmate of the stadard error of β L NM SE d β = msq( resdual) + b = X X X O g QP

BE54W Regresso ad Correlato Page 33 of 43 Our t-score s therefore: t df score= L NM = ( ) a f a f a f O L QP = dafo NM dqp observed expected β se expected se β We ca fd ths formato our output: The followg s SAS wth aotatos red. Parameter Estmates Parameter Stadard Varable Label DF Estmate Error t Value Pr > t = Estmate/Error Itercept Itercept -.6895.364-87.78 = -.6895/.364 <. age Age, X.9589.68 73.8 <. Ths t=-87.78 says the estmated tercept s estmated to be 87.78 stadard error uts away from ts expected value of zero. Evaluato rule: Whe the ull hypothess s true, the value of t should be close to zero. Alteratvely, whe β, the value of t wll be DIFFERENT from. Our p-value calculato aswers: What are the chaces of obtag our value of the t or oe that s more far away from f we beleve the ull hypothess that β =?

BE54W Regresso ad Correlato Page 34 of 43 Calculatos: p-value = ˆ pr β t( ) = pr[ t9 87.78 ] <<. seˆ ( ˆ β ) Evaluate: Uder the ull hypothess that β =, the chaces of obtag a t-score value that s 87.78 or more stadard error uts away from the expected value of s less tha chace,. Iterpret: The ferece s that the straght le relatoshp betwee Y=LOGWT ad X=AGE does ot pass through the org.

BE54W Regresso ad Correlato Page 35 of 43 6. Cofdece Iterval Estmato Straght Le Model: Y = β + β X Recall (Topc 6, Estmato) that there are 3 elemets of a cofdece terval: ) Best sgle guess (estmate) ) Stadard error of the best sgle guess (SE[estmate]) 3) Cofdece coeffcet These wll be percetles from the t dstrbuto wth df=(-) For a 95% cofdece terval, ths wll be a 97.5 th percetle For a (-α)% cofdece terval, ths wll be a (-α/) th percetle. The geerc form of a cofdece terval s the Geerc Form of Cofdece Iterval Straght Le Model: Y = β + β X Lower lmt = ( Estmate ) - ( cofdece coeffcet )*SE( estmate ) Upper lmt = ( Estmate ) + ( cofdece coeffcet )*SE( estmate ) We mght wat cofdece terval estmates of the followg 4 parameters: () Slope () Itercept (3) Mea of subset of populato for whom X=x (4) Idvdual respose for perso for whom X=x

BE54W Regresso ad Correlato Page 36 of 43 ) SLOPE estmate = β d se b = msq(resdual) b = X X g ) INTERCEPT estmate = β d se b L NM = msq(resdual) + b = X X X O g QP 3) MEAN at X=x estmate = Y = β + β x X= x L NM se = msq(resdual) + 4) INDIVIDUAL wth X=x estmate = Y = β + β x X= x bx b = L NM X se = msq(resdual) + + X X b b = g O g QP x X X X g O g QP

BE54W Regresso ad Correlato Page 37 of 43 Illustrato for the model whch fts Y=LOGWT to X=AGE. Recall that we obtaed the followg ft: Parameter Estmates Parameter Stadard Varable Label DF Estmate Error t Value Pr > t Itercept Itercept -.6895.364-87.78 <. age Age, X.9589.68 73.8 <. 95% Cofdece Iterval for the Slope, β ) Best sgle guess (estmate) = ˆ β =.9589 ) Stadard error of the best sgle guess (SE[estmate]) = ( ) se ˆ β =.68 3) Cofdece coeffcet = 97.5 th percetle of Studet t = t df 95% Cofdece Iterval for Slope β = Estmate ± ( cofdece coeffcet )*SE =.9589 ± (.6)(.68) = (.898,.9). 975, = 9 = 6. 95% Cofdece Iterval for the Itercept, β ) Best sgle guess (estmate) = ˆ β =.6895 ) Stadard error of the best sgle guess (SE[estmate]) = ( ) se ˆ β =.364 3) Cofdece coeffcet = 97.5 th percetle of Studet t = t df 95% Cofdece Iterval for Slope β = Estmate ± ( cofdece coeffcet )*SE = -.6895 ± (.6)(.364) = (-.7585,-.6). 975, = 9 = 6.

BE54W Regresso ad Correlato Page 38 of 43 Code. Cofdece Itervals for Predctos proc reg data=temp alpha=.5; /* alpha=.5 s type I error */ ttle "Regresso of Y=Weght o X=Age"; model wt=age/cl clm; /*cl for dvdual, clm for mea */ ru; qut; Output. Output Statstcs Depedet Predcted Std Error Obs Varable Value Mea Predct 95% CL Mea 95% CL Predct Resdual -.538 -.539.58 -.5497 -.478 -.5868 -.44 -.4 -.84 -.38.36 -.3489 -.87 -.3886 -.474.34 3 -. -..7 -.485 -.957 -.99 -.534. 4 -.93 -.96. -.9489 -.936 -.9937 -.8588.3 5 -.74 -.733.8878 -.754 -.73 -.797 -.6637 -.7 6 -.583 -.5345.8465 -.5536 -.553 -.68 -.468 -.485 7 -.37 -.3386.8878 -.3586 -.385 -.45 -.7 -.334 8 -.3 -.47. -.653 -. -. -.75.7 9.53.53.7.68.796 -.56. -.8.75.49.36.8.8.785.397.59.449.445.58.49.488.37.579.4

BE54W Regresso ad Correlato Page 39 of 43 Defto of Correlato 7. Itroducto to Correlato A correlato coeffcet s a measure of the assocato betwee two pared radom varables (e.g. heght ad weght). The Pearso product momet correlato, partcular, s a measure of the stregth of the straght le relatoshp betwee the two radom varables. Aother correlato measure (ot dscussed here) s the Spearma correlato. It s a measure of the stregth of the mootoe creasg (or decreasg) relatoshp betwee the two radom varables. The Spearma correlato s a o-parametrc (meag model free) measure. It s troduced PubHlth 64, Itermedate Bostatstcs. Formula for the Pearso Product Momet Correlato ρ The populato parameter desgato s rho, wrtte as ρ The estmate of ρ, based o formato a sample s represeted usg r. Some prelmares: () Suppose we are terested the correlato betwee X ad Y () ˆ cov(x,y) = = (x x)(y y) (-) = S xy (-) Ths s the covarace(x,y) (3) (4) ˆ var(x) = ˆ var(y) = = = (x x) (-) (y y) (-) Sxx = (-) = S yy (-) ad smlarly

BE54W Regresso ad Correlato Page 4 of 43 Formula for Estmate of Pearso Product Momet Correlato from a Sample ˆ ρ = r = ˆ cov(x,y) var(x)var(y) ˆ ˆ = S xy S S xx yy If you absolutely have to do t by had, a equvalet (more calculator fredly formula) s ˆ ρ = r = = xy x y = = x y = = x y = = The correlato r ca take o values betwee ad oly Thus, the correlato coeffcet s sad to be dmesoless t s depedet of the uts of x or y. Sg of the correlato coeffcet (postve or egatve) = Sg of the estmated slope ˆβ.

BE54W Regresso ad Correlato Page 4 of 43 There s a relatoshp betwee the slope of the straght le, ˆβ, ad the estmated correlato r. Relatoshp betwee slope ˆβ ad the sample correlato r Because ˆ S xy β = ad Sxx r = S xy S S xx yy A lttle algebra reveals that r = S S xx yy ˆ β Thus, beware!!! It s possble to have a very large (postve or egatve) r mght accompayg a very o-zero slope, asmuch as - A very large r mght reflect a very large S xx, all other thgs equal - A very large r mght reflect a very small S yy, all other thgs equal.

BE54W Regresso ad Correlato Page 4 of 43 8. Hypothess Test of Correlato The ull hypothess of zero correlato s equvalet to the ull hypothess of zero slope. Research Questo: Is the correlato ρ =? Is the slope β =? Assumptos: As before. H O ad H A : H H O A : ρ = : ρ Test Statstc: A lttle algebra (ot show) yelds a very ce formula for the t-score that we eed. r (-) t score= r df = ( ) We ca fd ths formato our output. Recall the frst example ad the model of Y=LOGWT to X=AGE: The Pearso Correlato, r, s the R-squared the output. Root MSE.87 R-Square.9983 Depedet Mea -.53445 Adj R-Sq.998 Coeff Var -5.586 Pearso Correlato, r =.9983 =.999

BE54W Regresso ad Correlato Page 43 of 43 Substtuto to the formula for the t-score yelds r (-).999 9.9974 t score= = = = 7.69 r -.9983.4 Note: The value.999 the umerator s r= R =.9983 =.999 Ths s very close to the value of the t-score that was obtaed for testg the ull hypothess of zero slope. The dscrepacy s probably roudg error. I dd the calculatos o my calculator usg 4 sgfcat dgts. SAS probably used more sgfcat dgts - cb.