Unit 9 Regression and Correlation

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page of 44 Uit 9 Regressio ad Correlatio Assume that a statistical model such as a liear model is a good first start oly - Gerald va Belle Is higher blood pressure i the mom associated with a lower birth weight of her baby? Simple liear regressio explores the relatioship of oe cotiuous outcome (Y=birth weight) with oe cotiuous predictor (X=blood pressure). At the heart of statistics is the fittig of models to observed data followed by a examiatio of how they perform. -- somewhat useful The fitted model is a sufficietly good fit to the data if it permits exploratio of hypotheses such as higher blood pressure durig pregacy is associated with statistically sigificat lower birth weight ad it permits assessmet of cofoudig, effect modificatio, ad mediatio. These are ideas that will be developed i BIOSTATS 640 Uit, Multivariable Liear Regressio. -- more useful The fitted model ca be used to predict the outcomes of future observatios. For example, we might be iterested i predictig the birth weight of the baby bor to a mom with systolic blood pressure 45 mm Hg. -3- most useful Sometimes, but ot so much i public health, the fitted model derives from a physical-equatio. A example is Michaelis-Meto kietics. A Michaelis-Meto model is fit to the data for the purpose of estimatig the actual rate of a particular chemical reactio. Hece A liear model is a good first start oly Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page of 44 Table of Cotets Topic. Uit Roadmap. Learig Objectives. 3. Defiitio of the Liear Regressio Model.. 4. Estimatio.... 5. The Aalysis of Variace Table. 6. Assumptios for the Straight Lie Regressio. 7. Hypothesis Testig... 8. Cofidece Iterval Estimatio... 9. Itroductio to Correlatio.. 0. Hypothesis Test for Correlatio.. 3 4 5 3 6 9 35 40 43 Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 3 of 44. Uit Roadmap / Populatios Simple liear regressio is used whe there is oe respose (depedet, Y) variable ad oe explaatory (idepedet, X) variables ad both are cotiuous. Uit 9. Regressio & Correlatio Relatioships Modelig Aalysis/ Sythesis Examples of explaatory (idepedet) respose (depedet) variable pairs are height ad weight, age ad blood pressure, etc -- A simple liear regressio aalysis begis with a scatterplot of the data to see if a straight lie model is appropriate: y = β 0 + βx where Y = the respose or depedet variable X = the explaatory or idepedet variable. -- The sample data are used to estimate the parameter values ad their stadard errors. β = slope (the chage i y per uit chage i x) β 0 = itercept (the value of y whe x=0) -3- The fitted model is the compared to the simpler model y = β 0 which says that y is ot liearly related to x. Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 4 of 44. Learig Objectives Whe you have fiished this uit, you should be able to: Explai what is meat by idepedet versus depedet variable ad what is meat by a liear relatioship; Produce ad iterpret a scatterplot; Defie ad explai the itercept ad slope parameters of a liear relatioship; Explai the theory of least squares estimatio of the itercept ad slope parameters of a liear relatioship; Calculate by had least squares estimatio of the itercept ad slope parameters of a liear relatioship; Explai the theory of the aalysis of variace of simple liear regressio; Calculate by had the aalysis of variace of simple liear regressio; Explai, compute, ad iterpret R i the cotext of simple liear regressio; State ad explai the assumptios required for estimatio ad hypothesis tests i regressio; Explai, compute, ad iterpret the overall F-test i simple liear regressio; Iterpret the computer output of a simple liear regressio aalysis from a package such as R, Stata, SAS, SPSS, Miitab, etc.; Defie ad iterpret the value of a Pearso Product Momet Correlatio, r ; Explai the relatioship betwee the Pearso product momet correlatio r ad the liear regressio slope parameter; ad Calculate by had cofidece iterval estimatio ad statistical hypothesis testig of the Pearso product momet correlatio r. Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 5 of 44 3. Defiitio of the Liear Regressio Model Uit 8 cosidered two categorical (discrete) variables, such as smokig (yes/o) ad low birth weight (yes/o). It was a itroductio to chi-square tests of associatio. Uit 9 cosiders two cotiuous variables, such as age ad weight. It is a itroductio to simple liear regressio ad correlatio. A woderful itroductio to the ituitio of liear regressio ca be foud i the text by Freedma, Pisai, ad Purves (Statistics. WW Norto & Co., 978). The followig is excerpted from pp 46 ad 48 of their text: How is weight related to height? For example, there were 4 me aged 8 to 4 i Cycle I of the Health Examiatio Survey. Their average height was 5 feet 8 iches = 68 iches, with a overall average weight of 58 pouds. But those me who were oe ich above average i height had a somewhat higher average weight. Those me who were two iches above average i height had a still higher average weight. Ad so o. O the average, how much of a icrease i weight is associated with each uit icrease i height? The best way to get started is to look at the scattergram for these heights ad weights. The object is to see how weight depeds o height, so height is take as the idepedet variable ad plotted horizotally The regressio lie is to a scatter diagram as the average is to a list. The regressio lie estimates the average value for the depedet variable correspodig to each value of the idepedet variable. Liear Regressio Liear regressio models the mea µ = E [Y] of oe radom variable Y as a liear fuctio of oe or more other variables (called predictors or explaatory variables) that are treated as fixed. The estimatio ad hypothesis testig ivolved are extesios of ideas ad techiques that we have already see. I liear regressio, Y is the outcome or depedet variable that we observe. We observe its values for idividuals with various combiatios of values of a predictor or explaatory variable X. There may be more tha oe predictor X ; this will be discussed i BIOSTATS 640. I simple liear regressio the values of the predictor X are assumed to be fixed. Ofte, however, the variables Y ad X are both radom variables. Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 6 of 44 Correlatio Correlatio cosiders the associatio of two radom variables. The techiques of estimatio ad hypothesis testig are the same for liear regressio ad correlatio aalyses. Explorig the relatioship begis with fittig a lie to the poits. Developmet of a simple liear regressio model aalysis Example. Source: Kleibaum, Kupper, ad Muller 988 The followig are observatios of age (days) ad weight (kg) for = chicke embryos. Notatio WT=Y AGE=X LOGWT=Z 0.09 6 -.538 0.05 7 -.84 0.079 8 -.0 0.5 9-0.903 0.8 0-0.74 0.6-0.583 0.45-0.37 0.738 3-0.3.3 4 0.053.88 5 0.75.8 6 0.449 The data are pairs of (X i, Y i ) where X=AGE ad Y=WT (X, Y ) = (6,.09) (X, Y ) = (6,.8) ad This table also provides pairs of (X i, Z i ) where X=AGE ad Z=LOGWT (X, Z ) = (6, -.538) (X, Z ) = (6, 0.449) Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 7 of 44 Research questio There are a variety of possible research questios: () Does weight chage with age? () I the laguage of aalysis of variace we are askig the followig: Ca the variability i weight be explaied, to a sigificat extet, by variatios i age? (3) What is a good fuctioal form that relates age to weight? Tip! Begi with a Scatter plot. Here we plot X=AGE versus Y=WT We check ad lear about the followig: The average ad media of X The rage ad patter of variability i X The average ad media of Y The rage ad patter of variability i Y The ature of the relatioship betwee X ad Y The stregth of the relatioship betwee X ad Y The idetificatio of ay poits that might be ifluetial Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 8 of 44 Example, cotiued The plot suggests a relatioship betwee AGE ad WT A straight lie might fit well, but aother model might be better We have adequate rages of values for both AGE ad WT There are o outliers The bowl shape of our scatter plot suggests that perhaps a better model relates the logarithm of WT (Z=LOGWT) to AGE: Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 9 of 44 We might have gotte ay of a variety of plots. y.5 No relatioship betwee X ad Y 0 0 4 6 8 0 x 0 8 y 6 4 Liear relatioship betwee X ad Y 0 0 4 6 8 0 x 5 0 y3 5 No-liear relatioship betwee X ad Y 0 5 0 4 6 8 0 x Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 0 of 44 y Note the outlyig poit Here, a fit of a liear model will yield a estimated slope that is spuriously o-zero. 0 0 4 6 8 0 4 x y 0 8 6 4 Note the outlyig poit Here, a fit of a liear model will yield a estimated slope that is spuriously ear zero. 0 0 4 6 8 0 x y 0 8 6 4 Note the outlyig poit Here, a fit of a liear model will yield a estimated slope that is spuriously high. 0 0 4 6 8 0 x Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page of 44 Review of the Straight Lie Way back whe, i your high school days, you may have bee itroduced to the straight lie fuctio, defied as y = mx + b where m is the slope ad b is the itercept. Nothig ew here. All we re doig is chagig the otatio a bit: () Slope : m à β () Itercept: b à β 0 Slope Slope > 0 Slope = 0 Slope < 0 Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page of 44 Defiitio of the Straight Lie Model Y = β 0 + β X Populatio Y = β 0 + βx + ε Y = β 0 + βx + ε = relatioship i the populatio. Y = β 0 + βx is measured with error ε defied ε = [Y] - [β 0 + βx] β 0, β ad ε are all ukow!! Y = β ˆ + βx ˆ + e 0 β, ˆ ˆ 0 β ad e are estimates of β 0, β ad ε Note: So you kow, these may also be writte as b 0, b, ad e residual = e is ow the differece betwee the observed ad the fitted (ot the true) e = [Y] - [β ˆ + βx] ˆ 0 We obtai guesses of these ukows, called β, ˆ β ˆ ad e by the method of least squares 0 estimatio. Notatio sorry Y = the outcome or depedet variable X = the predictor or idepedet variable β, ˆ β ˆ ad e are kow 0 How close did we get? To see if ˆβ 0 β ad 0 ˆβ βwe perform regressio diagostics. Regressio diagostics are discussed i BIOSTATS 640 µ Y = The expected value of Y for all persos i the populatio µ Y X=x = The expected value of Y for the sub-populatio for whom X=x σ Y σ Y X=x = Variability of Y amog all persos i the populatio = Variability of Y for the sub-populatio for whom X=x Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 3 of 44 4. Estimatio Least squares estimatio is used to obtai guesses of β 0 ad β. Whe the outcome = Y is distributed ormal, least squares estimatio is the same as maximum likelihood estimatio. Note If you are ot familiar with maximum likelihood estimatio, do t worry. This is itroduced i BIOSTATS 640. Least Squares, Close ad Least Squares Estimatio Theoretically, it is possible to draw may lies through a X-Y scatter of poits. Which to choose? Least squares estimatio is oe approach to choosig a lie that is closest to the data. d i = [observed Y - fitted! Y ] for the i th perso Perhaps we d like d i = [observed Y - fitted! Y ] = smallest possible. Note that this is a vertical distace, sice it is a distace o the vertical axis. d i = Y i Ŷi Better yet, perhaps we d like to miimize the squared differece: d = [observed Y - fitted! Y ] = smallest possible i Glitch. We ca t miimize each commo values of 0 d = d = d i β ˆ ad β ˆ that miimizes. d = Y ˆ separately. I particular, it is ot possible to choose ( Y Y ˆ ) for subject ad miimizes ( Y Y ˆ ) for subject ad miimizes ad miimizes ( Y ) for the th subject So, istead, we choose values for 0 β ˆ ad β ˆ that, upo isertio, miimizes the total d i = ( ) [ ] ( ) Y Yˆ = Y ˆ β + ˆ β X i i i 0 i i= i= Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 4 of 44 d = i ( Y i Yˆ ) = Y i i ˆβ 0 + ˆβ X i has a variety of ames: ( ) residual sum of squares, SSE or SSQ(residual) sum of squares about the regressio lie sum of squares due error (SSE) σ Y X! Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 5 of 44 Least Squares Estimatio of the Slope ad Itercept I case you re iterested. Cosider SSE = d i = Y ˆ i Y i = Y i ˆβ 0 + ˆβ X i ( ) ( ) Step #: Differetiate with respect to ˆβ Set derivative equal to 0 ad solve for ˆβ. Step #: Differetiate with respect to ˆβ 0 Set derivative equal to 0, isert ˆβ ad solve for ˆβ 0. Least Squares Estimatio Solutios Note the estimates are deoted either usig greek letters with a caret or with roma letters Estimate of Slope ˆβ or b ˆβ i i= = ( X X )( Y Y ) ( X X ) i i= i Itercept ˆβ or 0 b 0! β = Y! β X 0 Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 6 of 44 A closer look Some very helpful prelimiary calculatios ( ) Sxx = X-X = X NX ( ) Syy = Y-Y = Y NY xy ( ) S = X-X (Y-Y) = XY NXY Note - These expressios make use of a summatio otatio, itroduced i Uit. The capitol S idicates summatio. I S xy, the first subscript x is sayig (x-x). The secod subscript y is sayig (y-y). S xy = ( ) X-X (Y-Y) S subscript x subscript y Slope ˆβ = ( X i X) Y i Y ( ) ( X i X) = côv( X,Y ) vâr(x) ˆ S β = S xy xx Itercept! β = Y! β X 0 Predictio of Y Ŷ= ˆβ 0 + ˆβ X = b 0 + b X Populatio/ Do these estimates make sese? Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 7 of 44 Slope ˆβ = ( X i X) Y i Y ( ) ( X i X) = côv( X,Y ) vâr(x) The liear movemet i Y with liear movemet i X is measured relative to the variability i X.!β = 0 says: With a uit chage i X, overall there is a 50-50 chace that Y icreases versus decreases!β 0 says: With a uit icrease i X, Y icreases also (! β > 0) or Y decreases (! β < 0). Itercept! β = Y! β X If the liear model is icorrect, 0 or, if the true model does ot have a liear compoet, we obtai!β = 0 ad! β 0 = Y as our best guess of a ukow Y Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 8 of 44 Illustratio i Stata Y=WT ad X=AGE. regress y x Partial listig of output... ------------------------------------------------------------------------------ y Coef. Std. Err. t P> t [95% Cof. Iterval] -------------+---------------------------------------------------------------- x.35077.045945 5. 0.00.3437.339008 _cos -.88457.558354-3.58 0.006-3.07405 -.695005 ------------------------------------------------------------------------------ Aotated ------------------------------------------------------------------------------ y = WEIGHT Coef. Std. Err. t P> t [95% Cof. Iterval] -------------+---------------------------------------------------------------- x = AGE.35077 = b.045945 5. 0.00.3437.339008 _cos = Itercept -.88457 = b 0.558354-3.58 0.006-3.07405 -.695005 ------------------------------------------------------------------------------ The fitted lie is therefore WT = -.88457 + 0.3507*AGE. It says that each uit icrease i AGE of day is estimated to predict a 0.3507 icrease i weight, WT. Here is a overlay of the fitted lie o our scatterplot. 3.0 Scatter Plot of WT vs AGE.4.8 WT. 0.6 0.0 6 8 0 4 6 AGE Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 9 of 44 As we might have guessed, the straight lie model may ot be the best choice. The bowl shape of the scatter plot does have a liear compoet, however. Without the plot, we might have believed the straight lie fit is okay. Illustratio i Stata- cotiued Z=LOGWT ad X=AGE. regress z x Partial listig of output... ------------------------------------------------------------------------------ z Coef. Std. Err. t P> t [95% Cof. Iterval] -------------+---------------------------------------------------------------- x.958909.006768 73.8 0.000.898356.0946 _cos -.68955.030637-87.78 0.000 -.75856 -.69949 ------------------------------------------------------------------------------ Aotated ------------------------------------------------------------------------------ Z = LOGWT Coef. Std. Err. t P> t [95% Cof. Iterval] ------------------------------------------------------------------------------ x = AGE.958909 = b.006768 73.8 0.000.898356.0946 _cos = INTERCEPT -.68955 = b 0.030637-87.78 0.000 -.75856 -.69949 ------------------------------------------------------------------------------ Thus, the fitted lie is LOGWT = -.6895 + 0.9589*AGE Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 0 of 44 Now the overlay plot looks better: Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page of 44 Predictio of Weight from Height Source: Dixo ad Massey (969) Now You Try Idividual Height (X) Weight (Y) 60 0 60 35 3 60 0 4 6 0 5 6 40 6 6 30 7 6 35 8 64 50 9 64 45 0 70 70 70 85 70 60 Prelimiary calculatios X = 63.833 Y = 4.667 X i = 49,068 Y i = 46,00 X i Y i = 09,380 S xx = 7.667 S yy = 5,66.667 S xy = 863.333 Slope ˆ S β = S xy xx ˆ 863.333 β = = 5.09 7.667 Itercept! β = Y! β X ˆ β 4.667 (5.09)(63.8333 0 = 0 = -79.3573 Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page of 44 5. The Aalysis of Variace Table Recall the sample variace itroduced i I Uit, Summarizig. Yi Y. i= The umerator of the sample variace (S ) of the Y data is ( ) This same quatity ( ) i= ( Y ) i Y Yi Y is a cetral figure i regressio. It has a ew ame, several actually. i= = total variace of the Y s. = total sum of squares, = total, corrected, ad = SSY. (Note corrected refers to subtractig the mea before squarig.) The aalysis of variace tables is all about ( ) Yi Y ad partitioig it ito two compoets i=. Due residual (the idividual Y about the idividual predictio Ŷ). Due regressio (the predictio Ŷ about the overall mea Y) Here is the partitio (Note Look closely ad you ll see that both sides are the same) ( Y i Y ) = ( Y i ) Ŷi + ( Ŷi Y ) Some algebra (ot show) reveals a ice partitio of the total variability. ( Y Y) = ( Y Yˆ) + ( Yˆ Y) i i i i Total Sum of Squares = Due Error Sum of Squares + Due Model Sum of Squares Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 3 of 44 A closer look Total Sum of Squares = Due Model Sum of Squares + Due Error Sum of Squares ( Y i Y ) ( ) + ( Y i Y ˆ i ) = Yi ˆ Y due model sum of squares due error sum of squares ( Y i Y ) = deviatio of Y i from Y that is to be explaied Yi ˆ Y ( ) = due model, sigal, systematic, due regressio ( Y i Y ˆ i ) = due error, oise, or residual We seek to explai the total variability ( Y i Y ) with a fitted model: What happes whe β 0? What happes whe β = 0? A straight lie relatioship is helpful A straight lie relatioship is ot helpful Best guess is ˆ Y = ˆβ 0 + ˆβ X Best guess is ˆ Y = ˆβ 0 = Y Due model sum of squares teds to be LARGE because ( Yˆ Y ) = ( ˆβ0 + ˆβ X Y ) Due error sum of squares teds to be early the TOTAL because ( Y Y ˆ ) = Y ˆβ 0 ( ) = Y Y ( ) = Y ˆβ X + ˆβ X Y = ˆβ ( X X) Due error sum of squares has to be small à due(model) due error ( ) will be large Due regressio sum of squares has to be small à due model ( ) due( error) will be small Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 4 of 44 How to Partitio the Total Variace Thik: carvig a pie ito wedges/pieces: (explaied) + (remaider). total pie The total or total, corrected refers to the variability of Y about Y ( Y i Y ) is called the total sum of squares Degrees of freedom = df = (-) Divisio of the total sum of squares by its df yields the total mea square. carve out the piece of the pie explaied by the model The regressio or due model refers to the variability of! Y about Y ( Yi ˆ Y ) = ˆβ ( X i X) is called the regressio sum of squares Degrees of freedom = df = Divisio of the regressio sum of squares by its df yields the regressio mea square or model mea square. It is a example of a variace compoet. 3. the remaider of the pie The residual or due error refers to the variability of Y about! Y ( Y i Y ˆ i ) is called the residual sum of squares Degrees of freedom = df = (-) Divisio of the residual sum of squares by its df yields the residual mea square. Source df Sum of Squares Mea Square Regressio due model SSR = ( Yi ˆ Y ) SSR/ Residual due error Total, corrected (-) (-) Tip! Mea square = (Sum of squares)/(degrees of freedom,df) SSE = SST = ( Y i Y ˆ i ) ( Y i Y ) SSE/(-) Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 5 of 44 Be careful! The questio we may ask from a aalysis of variace table is a limited oe. Does the fit of the straight lie model explai a sigificat portio of the variability of the idividual Y about Y? Is this fitted model better tha usig Y aloe? We are NOT askig: Is the choice of the straight lie model correct? or Would aother fuctioal form be a better choice? We ll use a hypothesis test approach (aother proof by cotradictio reasoig just like we did i Uit 7!). Assume, provisioally, the othig is goig o ull hypothesis that says β = 0 ( o liear relatioship ) Use least squares estimatio to estimate a closest lie The aalysis of variace table provides a compariso of the due regressio mea square to the residual mea square Where does least squares estimatio take us, vis a vis the slope β? If β 0 The due (regressio)/due (residual) will be LARGE If β = 0 The due (regressio)/due (residual) will be SMALL Our p-value calculatio will aswer the questio: If the ull hypothesis is true ad β = 0 truly, what were the chaces of obtaiig a value of due (regressio)/due (residual) as larger or larger tha that observed? To calculate chaces of extremeess uder some assumed ull hypothesis we eed a ull hypothesis probability model! But did you otice? So far, we have ot actually used oe! Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 6 of 44 6. Assumptios for a Straight Lie Regressio Aalysis I performig least squares estimatio, we did ot use a probability model. We were doig geometry. Cofidece iterval estimatio ad hypothesis testig require some assumptios ad a probability model. Here you go! Assumptios for Simple Liear Regressio The separate observatios Y, Y,, Y are idepedet. The values of the predictor variable X are fixed ad measured without error. For each value of the predictor variable X=x, the distributio of values of Y follows a ormal distributio with mea equal to µ Y X=x ad commo variace equal to σ Y x. The separate meas µ Y X=x lie o a straight lie; that is µ Y X=x = β 0 + β X At each value of X, there is a populatio of Y for persos with X=x Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 7 of 44 With these assumptios, we ca assess the sigificace of the variace explaied by the model. F = msq(model) msq(residual) with df =, (-) β = 0 β 0 Due model MSR has expected value σ Y X Due residual MSE has expected value σ Y X Due model MSR has expected value σ Y X + β ( X i X) Due residual MSE has expected value σ Y X F = (MSR)/MSE will be close to F = (MSR)/MSE will be LARGER tha We obtai the aalysis of variace table for the model of Z=LOGWT to X=AGE: Stata illustratio with aotatios i red. Source SS df MS Number of obs = -------------+------------------------------ F(, 9) = 5355.60 = MSQ(model)/MSQ(residual) Model 4.05734 4.05734 Prob > F = 0.0000 = p-value for Overall F Test Residual.00709346 9.00078857 R-squared = 0.9983 = SSQ(model)/SSQ(TOTAL) -------------+------------------------------ Adj R-squared = 0.998 = R ajusted for ad # of X Total 4.85076 0.485076 Root MSE =.0807 = Sqaure root of MSQ(residual) Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 8 of 44 This output correspods to the followig. Note I this example our depedet variable is actually Z, ot Y. Source Df Sum of Squares Mea Square Regressio due model Residual due error (-) = 9 Total, corrected (-) = 0 ( ) SSR = Ẑ i - Z = 4.063 SSR/ = 4.063 msr = mea square regressio ( ) SSE = Z i - Ẑi = 0.00705 SSE/(-) = 7/838E-04 mse = mea square error SST = Z i - Z = 4.768 ( ) Other iformatio i this output: R-SQUARED = [(Sum of squares regressio)/(sum of squares total)] = proportio of the total that we have bee able to explai with the fit = percet of variace explaied by the model - Be careful! As predictors are added to the model, R-SQUARED ca oly icrease. Evetually, we eed to adjust this measure to take this ito accout. See ADJUSTED R-SQUARED. We also get a overall F test of the ull hypothesis that the simple liear model does ot explai sigificatly more variability i LOGWT tha the average LOGWT. F = MSQ (Regressio)/MSQ (Residual) = 4.063/0.0007838 = 5384.94 with df =, 9 p-value = achieved sigificace < 0.000. This is a highly ulikely outcome! à Reject H O. Coclude that the fitted lie explais statistically sigificatly more of the variability i Z=LOGWT tha is explaied by the itercept-oly ull hypothesis model. Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 9 of 44 7. Hypothesis Testig Straight Lie Model: Y = β 0 + β X ) Overall F-Test Research Questio: Does the fitted model, the! Y, explai sigificatly more of the total variability of the Y about Y tha does Y? A bit of clarificatio here, i case you re woderig. Whe the ull hypothesis is true, at least two thigs happe: () β = 0 ad () the correct model (the ull oe) says Y = β 0 + error. I this situatio, the least squares estimate of β 0 turs out to be Y (that seems reasoable, right?) Assumptios: As before. H O ad H A : H: β = 0 O H : β 0 A Test Statistic: F = msq(regresio) msq(residual) df =,( ) Evaluatio rule: Whe the ull hypothesis is true, the value of F should be close to. Alteratively, whe β 0, the value of F will be LARGER tha. Thus, our p-value calculatio aswers: What are the chaces of obtaiig our value of the F or oe that is larger if we believe the ull hypothesis that β = 0? Calculatios: For our data, we obtai p-value = pr F,(-) msq(model) msq(residual) b =0 = pr F 5384.94,9 <<.000 Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 30 of 44 Evaluate: Assumptio of the ull hypothesis that β = 0 has led to a extremely ulikely outcome (F-statistic value of 5394.94), with chaces of beig observed less tha chace i 0,000. The ull hypothesis is rejected. Iterpret: We have leared that, at least, the fitted straight lie model does a much better job of explaiig the variability i Z = LOGWT tha a model that allows oly for the average LOGWT. later (BIOSTATS 640, Itermediate Biostatistics), we ll see that the aalysis does ot stop here ) Test of the Slope, β Notes - The overall F test ad the test of the slope are equivalet. The test of the slope uses a t-score approach to hypothesis testig It ca be show that { t-score for slope } = { overall F } Research Questio: Is the slope β = 0? Assumptios: As before. H O ad H A : H O :β = 0 H A :β 0 Test Statistic: To compute the t-score, we eed a estimate of the stadard error of ˆβ SÊ ˆβ ( ) = msq(residual) ( X i X) Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 3 of 44 Our t-score is therefore: ( ) ( expected ) ( ) observed t score = sê expected df = ( ) = ( ˆβ ) 0 sê ˆβ ( ) ( ) We ca fid this iformatio i our Stata output. Aotatios are i red. ------------------------------------------------------------------------------ z Coef. Std. Err. t = Coef/Std. Err. P> t [95% Cof. Iterval] -------------+---------------------------------------------------------------- x.958909.006768 73.8 = 0.9589/.00678 0.000.898356.0946 _cos -.68955.030637-87.78 0.000 -.75856 -.69949 ------------------------------------------------------------------------------ Recall what we mea by a t-score: t=73.38 says the estimated slope is estimated to be 73.38 stadard error uits away from the ull hypothesis expected value of zero. Check that { t-score } = { Overall F }: Evaluatio rule: [ 73.38 ] = 5384.6 which is close. Whe the ull hypothesis is true, the value of t should be close to zero. Alteratively, whe β 0, the value of t will be DIFFERENT from 0. Here, our p-value calculatio aswers: Uder the assumptio of the ull hypothesis that β = 0, what were our chaces of obtaiig a t-statistic value 73.38 stadard error uits away from its ull hypothesis expected value of zero? Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 3 of 44 Calculatios: For our data, we obtai p-value = pr ˆβ t ( ) 0 sê ˆβ ( ) = pr t 9 73.38 [ ] <<.000 Evaluate: Uder the ull hypothesis that β = 0, the chaces of obtaiig a t-score value that is 73.38 or more stadard error uits away from the expected value of 0 is less tha chace i 0,000. Iterpret: The iferece is the same as that for the overall F test. The fitted straight lie model does a statistically sigificatly better job of explaiig the variability i LOGWT tha the sample mea. 3) Test of the Itercept, β 0 This addresses the questio: Does the straight lie relatioship passes through the origi? It is rarely of iterest. Research Questio: Is the itercept β 0 = 0? Assumptios: As before. H O ad H A : H O :β 0 = 0 H A :β 0 0 Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 33 of 44 Test Statistic: To compute the t-score for the itercept, we eed a estimate of the stadard error of ˆβ 0 ( ) = msq(residual) + X SÊ ˆβ 0 ( X i X) Our t-score is therefore: ( ) ( expected ) ( ) observed t score = sê expected df = ( ) ( ˆβ 0 ) ( 0) = sê ( ˆβ 0 ) Agai, we ca fid this iformatio i our Stata output. Aotatios are i red. ---------------------------------------------------------------------------------------------------- z Coef. Std. Err. t = Coef/Std. Err. P> t [95% Cof. Iterval] -------------+---------------------------------------------------------------- x.958909.006768 73.8 0.000.898356.0946 _cos -.68955.030637-87.78 = -.68955/.030637 0.000 -.75856 -.69949 ------------------------------------------------------------------------------------------------------- Here, t = -87.78 says the estimated itercept is estimated to be 87.78 stadard error uits away from its ull hypothesis expected value of zero. Evaluatio rule: Whe the ull hypothesis is true, the value of t should be close to zero. Alteratively, whe β 0 0, the value of t will be DIFFERENT from 0. Our p-value calculatio aswers: Uder the assumptio of the ull hypothesis that β 0 = 0, what were our chaces of obtaiig a t-statistic value 87.78 stadard error uits away from its ull hypothesis expected value of zero? Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 34 of 44 Calculatios: p-value = pr ˆβ t ( ) 0 0 sê ˆβ 0 ( ) = pr t 87.78 9 <<.000 Evaluate: Uder the ull hypothesis that the lie passes through the origi, that β 0 = 0, the chaces of obtaiig a t-score value that is 87.78 or more stadard error uits away from the expected value of 0 is less tha chace i 0,000, agai promptig statistical rejectio of the ull hypothesis. Iterpret: The iferece is that there is statistically sigificat evidece that the straight lie relatioship betwee Z=LOGWT ad X=AGE does ot pass through the origi. Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 35 of 44 8. Cofidece Iterval Estimatio Straight Lie Model: Y = β 0 + β X The cofidece itervals here have the usual 3 elemets (for review, see agai Uit 6): ) Best sigle guess (estimate) ) Stadard error of the best sigle guess (SE[estimate]) 3) Cofidece coefficiet : This will be a percetile from the Studet t distributio with df=(-) We might wat cofidece iterval estimates of the followig 4 parameters: () Slope () Itercept (3) Mea of subset of populatio for whom X=x 0 (4) Idividual respose for perso for whom X=x 0 ) SLOPE estimate =! β ( ) = msq(residual) sê ˆb ( X i X) ) INTERCEPT estimate =! β 0 ( ) = msq(residual) + X sê ˆb 0 ( X i X) Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 36 of 44 3) MEAN at X=x 0 estimate = Y! =! β +! β x X = x 0 0 0 sê = msq(residual) ( + x 0 X) X i X ( ) 4) INDIVIDUAL with X=x 0 estimate = Y! =! β +! β x X = x 0 0 0 Example, cotiued Z=LOGWT to X=AGE. Stata yielded the followig fit: sê = msq(residual) + ( + x 0 X) X i X ( ) ------------------------------------------------------------------------------ z Coef. Std. Err. t P> t [95% Cof. Iterval] -------------+---------------------------------------------------------------- x.958909.006768 73.8 0.000.898356.0946 ß 95% CI for Slope β _cos -.68955.030637-87.78 0.000 -.75856 -.69949 ------------------------------------------------------------------------------ 95% Cofidece Iterval for the Slope, β ) Best sigle guess (estimate) = ˆ β = 0.9589 ) Stadard error of the best sigle guess (SE[estimate]) = ( ) se ˆ β = 0.0068 3) Cofidece coefficiet = 97.5 th percetile of Studet t = t df 95% Cofidece Iterval for Slope β = Estimate ± ( cofidece coefficiet )*SE =.., = = 0.9589 ± (.6)(0.0068) = (0.898, 0.09) 975 9 6 Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 37 of 44 95% Cofidece Iterval for the Itercept, β 0 ------------------------------------------------------------------------------ z Coef. Std. Err. t P> t [95% Cof. Iterval] -------------+---------------------------------------------------------------- x.958909.006768 73.8 0.000.898356.0946 _cos -.68955.030637-87.78 0.000 -.75856 -.69949 ß 95% CI for itercept β 0 ------------------------------------------------------------------------------ ) Best sigle guess (estimate) = ˆ β 0 =.6895 ) Stadard error of the best sigle guess (SE[estimate]) = ( 0 ) se ˆ β = 0.03064 3) Cofidece coefficiet = 97.5 th percetile of Studet t = t df 95% Cofidece Iterval for Slope β 0 = Estimate ± ( cofidece coefficiet )*SE = -.6895 ± (.6)(0.03064) = (-.7585,-.600) =.., = 975 9 6 Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 38 of 44 For the brave Stata Example, cotiued Cofidece Itervals for MEAN of Z at Each Value of X.. * Fit Z to x. regress z x. * save fitted values xb (this is iteral to Stata) to a ew variable called zhat. predict zhat, xb. ** Obtai SE for MEAN of Z at each X (this is iteral to Stata) to a ew variable called semeaz. predict semeaz, stdp. ** Obtai cofidece coefficiet = 97.5th percetile of T o df=9. geerate tmult=ivttail(9,.05). ** Geerate lower ad upper 95% CI limits for MEAN of Z at Each X. geerate lowmeaz=zhat -tmult*semeaz. geerate highmeaz=zhat+tmult*semeaz. ** Geerate lower ad upper 95% CI limits for INDIVIDUAL PREDICTED Z at Each X. geerate lowpredictz=zhat-tmult*sepredictz. geerate highpredictz=zhat+tmult*sepredictz. list x z zhat lowmeaz highmeaz, clea x z zhat lowmeaz highmeaz. 6 -.538 -.53909 -.549733 -.478086. 7 -.84 -.3808 -.348894 -.874 3. 8 -.0 -.7 -.485 -.095733 4. 9 -.903 -.96364 -.948893 -.9035797 5. 0 -.74 -.7303454 -.750484 -.7064 6. -.583 -.5344545 -.553609 -.55306 7. -.37 -.3385637 -.3586467 -.384806 8. 3 -.3 -.4677 -.65394 -.006 9. 4.053.0538.06839.07965 0. 5.75.4909.833.79985. 6.449.445.409766.480834 Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 39 of 44 Stata Example, cotiued Cofidece Itervals for INDIVIDUAL PREDICTED Z at Each Value of X.. * Fit Z to x. regress z x. *Save fitted values to a ew variable called zhat. predict zhat, xb. ** Obtai SE for INDIVIDUAL PREDICTION of Z at give X (iteral to Stata) to a ew variable sepredictz. predict sepredictz, stdf. ** Obtai cofidece coefficiet = 97.5th percetile of T o df=9. geerate tmult=ivttail(9,.05). ** Geerate lower ad upper 95% CI limits for INDIVIDUAL PREDICTED Z at Each X. geerate lowpredictz=zhat-tmult*sepredictz. geerate highpredictz=zhat+tmult*sepredictz. *** List Idividual Predictios with 95% CI Limits. list x z zhat lowpredictz highpredictz, clea x z zhat lowpred~z highpre~z. 6 -.538 -.53909 -.58684 -.440994. 7 -.84 -.3808 -.388634 -.4740 3. 8 -.0 -.7 -.9090 -.053353 4. 9 -.903 -.96364 -.9936649 -.8588079 5. 0 -.74 -.7303454 -.7969533 -.6637375 6. -.583 -.5344545 -.6007866 -.4685 7. -.37 -.3385637 -.40575 -.79558 8. 3 -.3 -.4677 -.003 -.07544 9. 4.053.0538 -.055564.997 0. 5.75.4909.78493.3975. 6.449.445.37085.5795 Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 40 of 44 Defiitio of Correlatio 9. Itroductio to Correlatio A correlatio coefficiet is a measure of the associatio betwee two paired radom variables (e.g. height ad weight). The Pearso product momet correlatio, i particular, is a measure of the stregth of the straight lie relatioship betwee the two radom variables. Aother correlatio measure (ot discussed here) is the Spearma correlatio. It is a measure of the stregth of the mootoe icreasig (or decreasig) relatioship betwee the two radom variables. The Spearma correlatio is a o-parametric (meaig model free) measure. It is itroduced i BIOSTATS 640, Itermediate Biostatistics. Formula for the Pearso Product Momet Correlatio ρ Populatio product momet correlatio = ρ based estimate = r. Some prelimiaries: () Suppose we are iterested i the correlatio betwee X ad Y () coˆv(x,y) = (x i x)(y i y) (-) = S xy (-) This is the covariace(x,y) (3) vaˆr(x) = (x i x) (-) = S xx (-) ad similarly (4) vaˆr(y) = (y i y) (-) = S yy (-) Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 4 of 44 Formula for Estimate of Pearso Product Momet Correlatio from a ˆ ρ = r = ˆ cov(x,y) var(x)var(y) ˆ ˆ = S xy S S xx yy If you absolutely have to do it by had, a equivalet (more calculator/excel friedly formula) is ˆ ρ = r = xy i i xi yi xi y i xi y i The correlatio r ca take o values betwee 0 ad oly Thus, the correlatio coefficiet is said to be dimesioless it is idepedet of the uits of x or y. Sig of the correlatio coefficiet (positive or egative) = Sig of the estimated slope ˆβ. Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 4 of 44 There is a relatioship betwee the slope of the straight lie, ˆβ, ad the estimated correlatio r. Relatioship betwee slope ˆβ ad the sample correlatio r Tip! This is very hady Because ˆ S xy β = ad Sxx r = S xy S S xx yy A little algebra reveals that r = S S xx yy ˆ β Thus, beware!!! It is possible to have a very large (positive or egative) r might accompayig a very o-zero slope, iasmuch as - A very large r might reflect a very large S xx, all other thigs equal - A very large r might reflect a very small S yy, all other thigs equal. Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 43 of 44 0. Hypothesis Test of Correlatio The ull hypothesis of zero correlatio is equivalet to the ull hypothesis of zero slope. Research Questio: Is the correlatio ρ = 0? Is the slope β = 0? Assumptios: As before. H O ad H A : H O : ρ = 0 H A : ρ 0 Test Statistic: A little algebra (ot show) yields a very ice formula for the t-score that we eed. t score = r (-) r df = ( ) We ca fid this iformatio i our output. Recall the first example ad the model of Z=LOGWT to X=AGE: The Pearso Correlatio, r, is the R-squared i the output. Source SS df MS Number of obs = -------------+------------------------------ F(, 9) = 5355.60 Model 4.05734 4.05734 Prob > F = 0.0000 Residual.00709346 9.00078857 R-squared = 0.9983 -------------+------------------------------ Adj R-squared = 0.998 Total 4.85076 0.485076 Root MSE =.0807 Pearso Correlatio, r = 0.9983 = 0.999 Populatio/ Relatioships/ Modelig Aalysis/ Sythesis

BIOSTATS 540 - Fall 05 Regressio ad Correlatio Page 44 of 44 Substitutio ito the formula for the t-score yields t score = r (-) =.999 9 =.9974 r -.9983.04 = 7.69 Note: The value.999 i the umerator is r= R =.9983 =.999 This is very close to the value of the t-score that was obtaied for testig the ull hypothesis of zero slope. The discrepacy is probably roudig error. I did the calculatios o my calculator usig 4 sigificat digits. Stata probably used more sigificat digits - cb. Populatio/ Relatioships/ Modelig Aalysis/ Sythesis