Topic 9. Regression and Correlation

Similar documents
Unit 9 Regression and Correlation

STA302/1001-Fall 2008 Midterm Test October 21, 2008

Simple Linear Regression

b. There appears to be a positive relationship between X and Y; that is, as X increases, so does Y.

Multiple Regression. More than 2 variables! Grade on Final. Multiple Regression 11/21/2012. Exam 2 Grades. Exam 2 Re-grades

ENGI 3423 Simple Linear Regression Page 12-01

Chapter 13 Student Lecture Notes 13-1

12.2 Estimating Model parameters Assumptions: ox and y are related according to the simple linear regression model

Statistics MINITAB - Lab 5

Lecture Notes Types of economic variables

Probability and. Lecture 13: and Correlation

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

Multiple Linear Regression Analysis

Lecture 8: Linear Regression

Chapter Business Statistics: A First Course Fifth Edition. Learning Objectives. Correlation vs. Regression. In this chapter, you learn:


Objectives of Multiple Regression

Ordinary Least Squares Regression. Simple Regression. Algebra and Assumptions.

Linear Regression with One Regressor

Example: Multiple linear regression. Least squares regression. Repetition: Simple linear regression. Tron Anders Moger

The equation is sometimes presented in form Y = a + b x. This is reasonable, but it s not the notation we use.

ESS Line Fitting

residual. (Note that usually in descriptions of regression analysis, upper-case

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Statistics: Unlocking the Power of Data Lock 5

Simple Linear Regression

Summary of the lecture in Biostatistics

Unit 2. Regression and Correlation

Mean is only appropriate for interval or ratio scales, not ordinal or nominal.

Chapter 8. Inferences about More Than Two Population Central Values

CLASS NOTES. for. PBAF 528: Quantitative Methods II SPRING Instructor: Jean Swanson. Daniel J. Evans School of Public Affairs

Econometric Methods. Review of Estimation

Midterm Exam 1, section 1 (Solution) Thursday, February hour, 15 minutes

THE ROYAL STATISTICAL SOCIETY HIGHER CERTIFICATE

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

: At least two means differ SST

Chapter 14 Logistic Regression Models

Lecture 3. Sampling, sampling distributions, and parameter estimation

Midterm Exam 1, section 2 (Solution) Thursday, February hour, 15 minutes

Simple Linear Regression - Scalar Form

ECON 482 / WH Hong The Simple Regression Model 1. Definition of the Simple Regression Model

Statistics. Correlational. Dr. Ayman Eldeib. Simple Linear Regression and Correlation. SBE 304: Linear Regression & Correlation 1/3/2018

Applied Statistics and Probability for Engineers, 5 th edition February 23, b) y ˆ = (85) =

REVIEW OF SIMPLE LINEAR REGRESSION SIMPLE LINEAR REGRESSION

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Simple Linear Regression and Correlation.

Lecture 1: Introduction to Regression

STA 105-M BASIC STATISTICS (This is a multiple choice paper.)

Chapter Two. An Introduction to Regression ( )

CHAPTER VI Statistical Analysis of Experimental Data

Multiple Choice Test. Chapter Adequacy of Models for Regression

Functions of Random Variables

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #1

Lecture 1 Review of Fundamental Statistical Concepts

4. Standard Regression Model and Spatial Dependence Tests

{ }{ ( )} (, ) = ( ) ( ) ( ) Chapter 14 Exercises in Sampling Theory. Exercise 1 (Simple random sampling): Solution:

Correlation and Simple Linear Regression

Investigation of Partially Conditional RP Model with Response Error. Ed Stanek

Chapter 2 Simple Linear Regression

1. The weight of six Golden Retrievers is 66, 61, 70, 67, 92 and 66 pounds. The weight of six Labrador Retrievers is 54, 60, 72, 78, 84 and 67.

ε. Therefore, the estimate

Lecture 2: Linear Least Squares Regression

Handout #8. X\Y f(x) 0 1/16 1/ / /16 3/ / /16 3/16 0 3/ /16 1/16 1/8 g(y) 1/16 1/4 3/8 1/4 1/16 1

Lecture 1: Introduction to Regression

Simple Linear Regression and Correlation. Applied Statistics and Probability for Engineers. Chapter 11 Simple Linear Regression and Correlation

Logistic regression (continued)

Continuous Distributions

Chapter 11 The Analysis of Variance

Multivariate Transformation of Variables and Maximum Likelihood Estimation

Chapter 3 Sampling For Proportions and Percentages

Chapter 13, Part A Analysis of Variance and Experimental Design. Introduction to Analysis of Variance. Introduction to Analysis of Variance

Module 7. Lecture 7: Statistical parameter estimation

STK4011 and STK9011 Autumn 2016

"It is the mark of a truly intelligent person to be moved by statistics." George Bernard Shaw

ENGI 4421 Propagation of Error Page 8-01

Chapter Statistics Background of Regression Analysis

Special Instructions / Useful Data

Simulation Output Analysis

Chapter 4 Multiple Random Variables

Chapter 5 Properties of a Random Sample

Homework Solution (#5)

Sum Mean n

Reaction Time VS. Drug Percentage Subject Amount of Drug Times % Reaction Time in Seconds 1 Mary John Carl Sara William 5 4

University of Belgrade. Faculty of Mathematics. Master thesis Regression and Correlation

QA 622 AUTUMN QUARTER ACADEMIC YEAR

ENGI 4421 Joint Probability Distributions Page Joint Probability Distributions [Navidi sections 2.5 and 2.6; Devore sections

( ) = ( ) ( ) Chapter 13 Asymptotic Theory and Stochastic Regressors. Stochastic regressors model

UNIVERSITY OF EAST ANGLIA. Main Series UG Examination

MEASURES OF DISPERSION

Example. Row Hydrogen Carbon

C. Statistics. X = n geometric the n th root of the product of numerical data ln X GM = or ln GM = X 2. X n X 1

ECONOMETRIC THEORY. MODULE VIII Lecture - 26 Heteroskedasticity

Chapter 3 Multiple Linear Regression Model

Point Estimation: definition of estimators

TESTS BASED ON MAXIMUM LIKELIHOOD

Module 7: Probability and Statistics

Chapter 2 Supplemental Text Material

THE ROYAL STATISTICAL SOCIETY GRADUATE DIPLOMA

r y Simple Linear Regression How To Study Relation Between Two Quantitative Variables? Scatter Plot Pearson s Sample Correlation Correlation

Lecture 3 Probability review (cont d)

Transcription:

BE54W Regresso ad Correlato Page of 43 Topc 9 Regresso ad Correlato Topc. Defto of the Lear Regresso Model... Estmato.... 3. The Aalyss of Varace Table. 4. Assumptos for the Straght Le Regresso. 5. Hypothess Testg... 6. Cofdece Iterval Estmato... 7. Itroducto to Correlato.. 8. Hypothess Test for Correlato.. 4 7 35 39 4

BE54W Regresso ad Correlato Page of 43. Defto of the Lear Regresso Model I the last ut, topc 8, the settg was that of two categorcal (dscrete) varables, such as smokg ad low brth weght, ad the use of ch-square tests of assocato ad homogeety. I ths ut, topc 9, our focus s the settg of two cotuous varables, such as age ad weght. Ths topc s a troducto to smple lear regresso ad correlato. Lear Regresso Lear regresso models the mea µ of oe radom varable as a lear fucto of oe or more other varables that are treated as fxed. The estmato ad hypothess testg volved are extesos of deas ad techques that we have already see. I lear regresso, we observe a outcome or depedet varable Y at several levels of the depedet or predctor varable X (there may be more tha oe predctor X as see later). A lear regresso model assumes that the values of the predctor X have bee fxed advace of observg Y. However, ths s ot always the realty. Ofte Y ad X are observed jotly ad are both radom varables. Correlato Correlato cosders the assocato of two radom varables. The techques of estmato ad hypothess testg are the same for lear regresso ad correlato aalyses. Explorg the relatoshp begs wth fttg a le to the pots. We develop the lear regresso model aalyss for a smple example volvg oe predctor ad oe outcome.

BE54W Regresso ad Correlato Page 3 of 43 Example. Source: Klebaum, Kupper, ad Muller 988 Avalable are pars of observatos of age ad weght for = chcke embryos. WT=Y AGE=X LOGWT=Z.9 6 -.538.5 7 -.84.79 8 -..5 9 -.93.8 -.74.6 -.583.45 -.37.738 3 -.3.3 4.53.88 5.75.8 6.449 We ll use a famlar otato The data are pars of (X, Y ) where X=AGE ad Y=WT (X, Y ) = (6,.9) (X, Y ) = (6,.8) ad equvaletly, pars of (X, Y ) where X=AGE ad Y=LOGWT (X, Y ) = (6, -.538) (X, Y ) = (6,.449) Though smple, t helps to be clear the research questo Does weght chage wth age? I the laguage of aalyss of varace we are askg the followg: Ca the varablty weght be explaed, to a sgfcat extet, by varatos age? What s a good fuctoal form that relates age to weght?

BE54W Regresso ad Correlato Page 4 of 43 We beg wth a plot of X=AGE versus Y=WT Scatter Plot of WT vs AGE 3..4.8 WT..6. 6 8 4 6 We check ad lear about the followg: The average ad meda of X The rage ad patter of varablty X The average ad meda of Y The rage ad patter of varablty Y The ature of the relatoshp betwee X ad Y The stregth of the relatoshp betwee X ad Y The detfcato of ay pots that mght be fluetal AGE For these data: The plot suggests a relatoshp betwee AGE ad WT A straght le mght ft well, but aother model mght be better We have adequate rages of values for both AGE ad WT There are o outlers

BE54W Regresso ad Correlato Page 5 of 43 We mght have gotte ay of a varety of plots. y.5 No relatoshp betwee X ad Y 4 6 8 x 8 y 6 Lear relatoshp betwee X ad Y 4 4 6 8 x 5 y3 5 No-lear relatoshp betwee X ad Y 5 4 6 8 x

BE54W Regresso ad Correlato Page 6 of 43 y 4 6 8 4 x Note the arrow potg to the outlyg pot Ft of a lear model wll yeld estmated slope that s spurously o-zero. y 8 6 4 Note the arrow potg to the outlyg pot Ft of a lear model wll yeld a estmated slope that s spurously ear zero. 4 6 8 x y 8 6 4 Note the arrow potg to the outlyg pot Ft of a lear model wll yeld a estmated slope that s spurously hgh. 4 6 8 x

BE54W Regresso ad Correlato Page 7 of 43 The bowl shape of our scatter plot suggests that perhaps a better model relates the logarthm of WT to AGE:.5 Scatter Plot of LOGWT vs AGE -. LOGWT -.9 -.6 6 8 4 6 We ll vestgate two models. AGE ) WT = β + β AGE ) LOGWT = β + β AGE

BE54W Regresso ad Correlato Page 8 of 43 Recall what you mght have leared a old math class about the equato of a le You mght recall, too, a feel for the slope Slope > Slope = Slope <

BE54W Regresso ad Correlato Page 9 of 43 Populato Y Defto of the Straght Le Model Y = β + β X Sample = β + β X + ε Y = β + β X + e Y = β + β X s the relatoshp the populato. It s measured wth error. ε = measuremet error β, β, ad e are our guesses of β, β ad ε e = resdual We do NOT kow the value of β or β or ε We do have values of β, β ad e The values of β, β ad e are obtaed by the method of least squares estmato. To see f β β ad β β we perform regresso dagostcs. Note Ths s ot dscussed ths course; see PubHlth 64, Itermedate Bostatstcs A lttle otato, sorry! Y = the outcome or depedet varable X = the predctor or depedet varable µ Y = The expected value of Y for all persos the populato µ Y X=x = The expected value of Y for the sub-populato for whom X=x σ Y σ Y X=x = Varablty of Y amog all persos the populato = Varablty of Y for the sub-populato for whom X=x

BE54W Regresso ad Correlato Page of 43. Estmato We wll use the method of least squares to obta guesses of β ad β. Goal From the may possble les through the scatter of pots, choose the oe le that s closest to the data. What do we mea by Close? We d lke the vertcal dstace betwee each observed Y ad ts correspodg ftted Y to be as small as possble. It s ot possble to choose β ad β so that t mmzes dy Y ad mmzes dvdually Y Y d ad mmzes dvdually d. Y Y So, stead, we choose β ad β that mmzes ther total c h e j Y Y = Y β + β X = =

BE54W Regresso ad Correlato Page of 43 A pcture gves a feel for the fact that may les are possble ad that we seek the closest the sese of vertcal dstaces beg as small as possble c h e j The expresso to be mmzed, Y Y = Y β + β X has a varety of ames: = = resdual sum of squares sum of squares about the regresso le sum of squares due error (SSE) σ YX

BE54W Regresso ad Correlato Page of 43 For the calculus lover, A lttle calculus yelds the soluto for the guesses β ad β c h e j Cosder SSE = Y Y = Y β + β X = = Step #: Dfferetate wth respect to β Set dervatve equal to ad solve. Step #: Dfferetate wth respect to β Set dervatve equal to, sert β ad solve. For the o-calculus lover, here are the estmates of β ad β β s the slope Estmate s deoted ˆβ or b β s the tercept Estmate s deoted ˆβ or b

BE54W Regresso ad Correlato Page 3 of 43 Some very helpful prelmary calculatos ( ) Sxx = X-X = X NX ( ) Syy = Y-Y = Y NY xy ( ) S = X-X (Y-Y) = XY NXY Note - These expressos make use of a specal otato called the summato otato. The captol S dcates summato. I S xy, the frst subscrpt x s sayg (x-x). The secod subscrpt y s sayg (y-y). S xy = ( ) X-X (Y-Y) S subscrpt x subscrpt y b gb g b g X X Y Y Slope c ov XY, = β = = X X v ar( X ) = a f Sxy ˆ β = S xx Itercept β = Y β X Predcto of Y Ŷ= ˆ β + ˆ β X =b + bx

BE54W Regresso ad Correlato Page 4 of 43 Do these estmates make sese? b gb g b g X X Y Y Slope c ov XY, = β = = X X v ar( X ) = a f The lear movemet Y wth lear movemet X s measured relatve to the varablty X. β = says: Wth a ut chage X, overall there s a 5-5 chace that Y creases versus decreases β says: Wth a ut crease X, Y creases also ( β > ) or Y decreases ( β < ). Itercept β = Y β X If the lear model s correct, or, f the true model does ot have a lear compoet, we obta β = ad β = Y as our best guess of a ukow Y

BE54W Regresso ad Correlato Page 5 of 43 Illustrato SAS Code. data temp; put wt age logwt; label wt="weght, Y" age="age, X" logwt="log(weght),y"; cards;.9 6 -.538.5 7 -.84.79 8 -..5 9 -.93.8 -.74.6 -.583.45 -.37.738 3 -.3.3 4.53.88 5.75.8 6.449 ; ru; qut; proc reg data=temp smple; /* opto smple produces smple descrptves */ ttle "Regresso of Y=Weght o X=Age"; model wt=age; ru; qut; Partal lstg of output... Parameter Estmates Parameter Stadard Varable Label DF Estmate Error t Value Pr > t Itercept Itercept -.88453.5584-3.58.59 age Age, X.357.4594 5..6 Aotated Parameter Estmates Parameter Stadard Varable Label DF Estmate Error t Value Pr > t Itercept Itercept -.88453 = tercept = β.5584-3.58.59 age Age, X.357 = slope = β.4594 5..6 The ftted le s therefore WT = 88453. +. 357 * AGE

BE54W Regresso ad Correlato Page 6 of 43 Let s overlay the ftted le o our scatterplot. 3. Scatter Plot of WT vs AGE.4.8 WT..6. 6 8 4 6 AGE As we mght have guessed, the straght le model may ot be the best choce. The bowl shape of the scatter plot does have a lear compoet, however. Wthout the plot, we mght have beleved the straght le ft s okay.

BE54W Regresso ad Correlato Page 7 of 43 Let s try a straght le model ft to Y=LOGWT versus X=AGE. Partal lstg of output... Parameter Estmates Parameter Stadard Varable Label DF Estmate Error t Value Pr > t Itercept Itercept -.6895.364-87.78 <. age Age, X.9589.68 73.8 <. Aotated Parameter Estmates Parameter Stadard Varable Label DF Estmate Error t Value Pr > t Itercept Itercept -.6895 = tercept = β.364-87.78 <. Age Age, X.9589 = slope = β.68 73.8 <. Thus, the ftted le s LOGWT = -.6895 +.9589*AGE Now the scatterplot wth the overlay of the ftted le looks much better. Further dscusso wll cosder Scatter Plot the model of LOGWT that relates vs AGEY=LOGWT to X=AGE..5 -. LOGWT -.9 -.6 6 8 4 6 AGE

BE54W Regresso ad Correlato Page 8 of 43 Predcto of Weght from Heght Source: Dxo ad Massey (969) Now You Try Idvdual Heght (X) Weght (Y) 6 6 35 3 6 4 6 5 6 4 6 6 3 7 6 35 8 64 5 9 64 45 7 7 7 85 7 6 It helps to do the prelmary calculatos X=63.833 X = 49,68 Y=4.667 Y = 46, X Y 9,38 = xx S = 7.667 Syy = 5, 66.667 Sxy = 863.333

BE54W Regresso ad Correlato Page 9 of 43 Slope ˆ S β = S xy xx ˆ 863.333 β = = 5.9 7.667 Itercept β = Y β X ˆ β 4.667 (5.9)(63.8333 = = -79.3573

BE54W Regresso ad Correlato Page of 43 3. The Aalyss of Varace Table I Topc, Summarzg Data, we leared that the umerator of the sample varace of the Y data s ( ) Y Y. I regresso settgs where Y s the outcome varable, ths = Y Y s apprecated as the total varace of the Y s. As we wll = same quatty ( ) see, other ames for ths are total sum of squares, total, corrected, ad SSY. (Note corrected has to do wth subtractg the mea before squarg.) A aalyss of varace table s all about parttog the total varace of the Y s (corrected) to two compoets:. Due resdual (the dvdual Y about the dvdual predcto Ŷ). Due regresso (the predcto Ŷ about the overall mea Y) Asde - Why are we terested such a partto? We d lke to kow f, wth the data, there exsts the suggesto of a lear relatoshp ( sgal ) that ca be dscered from chace varablty ( ose ) ) the leftover varablty of the observed Y about the predcted Ŷ ( ose ) ) the explaed varablty of the predcted Ŷ about the overall mea Y ( sgal ) Here s the partto (Note Look closely ad you ll see that both sdes are the same) ( Y ) ( ˆ ) ( ˆ Y = Y Y + Y Y) Some algebra (ot show) reveals a ce partto of the total varablty. ( Y Y) = ( Y Yˆ) + ( Yˆ Y) Total Sum of Squares = Due Error Sum of Squares + Due Model Sum of Squares

BE54W Regresso ad Correlato Page of 43 A closer look Total Sum of Squares = Due Model Sum of Squares + Due Error Sum of Squares b Y Y c Y Y Y Y c b g c h c h Y Y = Y Y + Y Y = = = g= devato of Y from Y that s to be explaed h = due model, sgal, systematc, due regresso h = due error, ose, or resdual I the world of regresso aalyses, we seek to expla the total varablty Y Y What happes whe β? What happes whe β =? b = g : A straght le relatoshp s helpful A straght le relatoshp s ot helpful Best guess s Y = β + β X Best guess s Y = β = Y Due model s LARGE because c h β β Y Y = ( + X Y) Due error s early the TOTAL because cy Y h= ey j b g β = Y Y = Y β X + β X Y = β b X X Due error has to be small g due(model) a f a f due error wll be large due model a f due error Due regresso has to be small wll be small

BE54W Regresso ad Correlato Page of 43 How to Partto the Total Varace. The total or total, corrected refers to the varablty of Y about Y b = Y Y g s called the total sum of squares Degrees of freedom = df = (-) Dvso of the total sum of squares by ts df yelds the total mea square. The resdual or due error refers to the varablty of Y about Y Y Y = c h s called the resdual sum of squares Degrees of freedom = df = (-) Dvso of the resdual sum of squares by ts df yelds the resdual mea square. 3. The regresso or due model refers to the varablty of Y about Y c h b g = = Y Y = β X X s called the regresso sum of squares Degrees of freedom = df = Dvso of the regresso sum of squares by ts df yelds the regresso mea square or model mea square. Ths s a example of a varace compoet. Source df Sum of Squares Mea Square Regresso SSR = Y d Y = SSR/ Error (-) SSE = dy Y = SSE/(-) Total, corrected (-) SST = Y Yg Ht Mea square = (Sum of squares)/(df) b =

BE54W Regresso ad Correlato Page 3 of 43 Be careful! The questo we may ask from a aalyss of varace table s a lmted oe. Does the ft of the straght le model expla a sgfcat porto of the varablty of the dvdual Y about Y? Is ths better tha usg Y aloe? We are NOT askg: Is the choce of the straght le model correct? Would aother fuctoal form be a better choce? We ll use a hypothess test approach ad the method of proof by cotradcto. We beg wth a ull hypothess that says β = ( o lear relatoshp ) Evaluato wll focus o the comparso of the due regresso mea square to the resdual mea square Recall that we reasoed the followg: If β The due(regresso)/due(resdual) wll be LARGE If β = The due(regresso)/due(resdual) wll be SMALL Our p-value calculato wll aswer the questo: If β = truly, what are the chaces of obtag a value of due(regresso)/due(resdual) as larger or larger tha that observed? To calculate chaces we eed a probablty model. So far, we have ot eeded oe.

BE54W Regresso ad Correlato Page 4 of 43 4. Assumptos for a Straght Le Regresso Aalyss I performg least squares estmato, we dd ot use a probablty model. We were dog geometry. Hypothess testg requres some assumptos ad a probablty model. Assumptos The separate observatos Y, Y,, Y are depedet. The values of the predctor varable X are fxed ad measured wthout error. For each value of the predctor varable X=x, the dstrbuto of values of Y follows a ormal dstrbuto wth mea equal to µ Y X=x ad commo varace equal to σ Y x. The separate meas µ Y X=x le o a straght le; that s µ Y X=x = β + β X Schematcally, here s what the stuato looks lke (courtesy: Sta Lemeshow)

BE54W Regresso ad Correlato Page 5 of 43 Wth these assumptos, we ca assess the sgfcace of the varace explaed by the model. F msq(model) = wth df =, (-) msq(resdual) β = β Due model MSR has expected value σ Y X Due resdual MSE has expected value σ Y X Due model MSR has expected value σ Y X + β X X b = Due resdual MSE has expected value σ Y X g F = (MSR)/MSE wll be close to F = (MSR)/MSE wll be LARGER tha We obta the aalyss of varace table for the model of Y=LOGWT to X=AGE: The followg s SAS wth aotatos red. Aalyss of Varace Sum of Mea Source DF Squares Square F Value Pr > F = MSQ(Regresso)/MSQ(Resdual) Model 4.6 4.6 5355.6 <. Error 9.79.7886 Corrected Total 4.85 Root MSE.87 R-Square.9983 = SSQ(regresso)/SSQ(total) Depedet Mea -.53445 Adj R-Sq.998 = R adjusted for ad # predctors Coeff Var -5.586

BE54W Regresso ad Correlato Page 6 of 43 Ths output correspods to the followg. Source Df Sum of Squares Mea Square Regresso SSR = Y d Y = 4.63 SSR/ = 4.63 = Error (-) = 9 SSE = dy Y =.75 SSE/(-) = 7/838E-4 = Total, corrected (-) = SST = Y Yg = 4.768 b = Other formato ths output: R-SQUARED = Sum of squares regresso/sum of squares total s the proporto of the total that we have bee able to expla wth the ft of the straght le model. - Be careful! As predctors are added to the model, R-SQUARED ca oly crease. Evetually, we eed to adjust ths measure to take ths to accout. See ADJUSTED R-SQUARED. We also get a overall F test of the ull hypothess that the smple lear model does ot expla sgfcatly more varablty LOGWT tha the average LOGWT. F = MSQ (Regresso)/MSQ (Resdual) = 4.63/.7838 = 5384.94 wth df =, 9 Acheved sgfcace <.. Reject H O. Coclude that the ftted le s a sgfcat mprovemet over the average LOGWT.

BE54W Regresso ad Correlato Page 7 of 43 5. Hypothess Testg Straght Le Model: Y = β + β X ) Overall F-Test ) Test of slope 3) Test of tercept ) Overall F-Test Research Questo: Does the ftted model, the Y expla sgfcatly more of the total varablty of the Y about Y tha does Y? Assumptos: As before. H O ad H A : H H O A : β = : β Test Statstc: F = msq( regreso) msq( resdual) df =,( ) Evaluato rule: Whe the ull hypothess s true, the value of F should be close to. Alteratvely, whe β, the value of F wll be LARGER tha. Thus, our p-value calculato aswers: What are the chaces of obtag our value of the F or oe that s larger f we beleve the ull hypothess that β =?

BE54W Regresso ad Correlato Page 8 of 43 Calculatos: For our data, we obta p-value = L NM pr F msq(mod el) pr F 5384 94 msq( resdual).. β = QP = <<, ( ),9 O Evaluate: Uder the ull hypothess that β =, the chaces of obtag a value of F that s so far away from the expected value of, wth a value of 5394.94, s less tha chace,. Ths s a very small lkelhood! Iterpret: We have leared that, at least, the ftted straght le model does a much better job of explag the varablty LOGWT tha a model that allows oly for the average LOGWT. later (BE64, Itermedate Bostatstcs), we ll see that the aalyss does ot stop here

BE54W Regresso ad Correlato Page 9 of 43 ) Test of the Slope, β Some terestg otes! - The overall F test ad the test of the slope are equvalet. - The test of the slope uses a t-score approach to hypothess testg - It ca be show that { t-score for slope } = { overall F } Research Questo: Is the slope β =? Assumptos: As before. H O ad H A : H H O A : β = : β Test Statstc: To compute the t-score, we eed a estmate of the stadard error of β L O d β b g X X = SE = msq( resdual) NM QP

BE54W Regresso ad Correlato Page 3 of 43 Our t-score s therefore: t df score= L NM = ( ) a f a f a f O L QP = dafo NM dqp observed expected β se expected se β We ca fd ths formato our output: The followg s SAS wth aotatos red. Parameter Estmates Parameter Stadard Varable Label DF Estmate Error t Value Pr > t = Estmate/Error Itercept Itercept -.6895.364-87.78 <. age Age, X.9589.68 73.8 =.9589/.68 <. Recall what we mea by a t-score: T=73.38 says the estmated slope s estmated to be 73.38 stadard error uts away from ts expected value of zero. Check that { t-score } = { Overall F }: [ 73.38 ] = 5384.6 whch s close. Evaluato rule: Whe the ull hypothess s true, the value of t should be close to zero. Alteratvely, whe β, the value of t wll be DIFFERENT from. Here, our p-value calculato aswers: What are the chaces of obtag our value of the t or oe that s more far away from f we beleve the ull hypothess that β =?

BE54W Regresso ad Correlato Page 3 of 43 Calculatos: For our data, we obta p-value = L NM O P Q pr t β pr t 73 38 ( ) 9 se P =. <<. β d Evaluate: Uder the ull hypothess that β =, the chaces of obtag a t-score value that s 73.38 or more stadard error uts away from the expected value of s less tha chace,. Iterpret: The ferece s the same as that for the overall F test. The ftted straght le model does a much better job of explag the varablty LOGWT tha the sample mea.

BE54W Regresso ad Correlato Page 3 of 43 3) Test of the Itercept, β Ths pertas to whether or ot the straght le relatoshp passes through the org. It s rarely of terest. Research Questo: Is the tercept β =? Assumptos: As before. H O ad H A : H H O A : β = : β Test Statstc: To compute the t-score for the tercept, we eed a estmate of the stadard error of β L NM SE d β = msq( resdual) + b = X X X O g QP

BE54W Regresso ad Correlato Page 33 of 43 Our t-score s therefore: t df score= L NM = ( ) a f a f a f O L QP = dafo NM dqp observed expected β se expected se β We ca fd ths formato our output: The followg s SAS wth aotatos red. Parameter Estmates Parameter Stadard Varable Label DF Estmate Error t Value Pr > t = Estmate/Error Itercept Itercept -.6895.364-87.78 = -.6895/.364 <. age Age, X.9589.68 73.8 <. Ths t=-87.78 says the estmated tercept s estmated to be 87.78 stadard error uts away from ts expected value of zero. Evaluato rule: Whe the ull hypothess s true, the value of t should be close to zero. Alteratvely, whe β, the value of t wll be DIFFERENT from. Our p-value calculato aswers: What are the chaces of obtag our value of the t or oe that s more far away from f we beleve the ull hypothess that β =?

BE54W Regresso ad Correlato Page 34 of 43 Calculatos: p-value = ˆ pr β t( ) = pr[ t9 87.78 ] <<. seˆ ( ˆ β ) Evaluate: Uder the ull hypothess that β =, the chaces of obtag a t-score value that s 87.78 or more stadard error uts away from the expected value of s less tha chace,. Iterpret: The ferece s that the straght le relatoshp betwee Y=LOGWT ad X=AGE does ot pass through the org.

BE54W Regresso ad Correlato Page 35 of 43 6. Cofdece Iterval Estmato Straght Le Model: Y = β + β X Recall (Topc 6, Estmato) that there are 3 elemets of a cofdece terval: ) Best sgle guess (estmate) ) Stadard error of the best sgle guess (SE[estmate]) 3) Cofdece coeffcet These wll be percetles from the t dstrbuto wth df=(-) For a 95% cofdece terval, ths wll be a 97.5 th percetle For a (-α)% cofdece terval, ths wll be a (-α/) th percetle. The geerc form of a cofdece terval s the Geerc Form of Cofdece Iterval Straght Le Model: Y = β + β X Lower lmt = ( Estmate ) - ( cofdece coeffcet )*SE( estmate ) Upper lmt = ( Estmate ) + ( cofdece coeffcet )*SE( estmate ) We mght wat cofdece terval estmates of the followg 4 parameters: () Slope () Itercept (3) Mea of subset of populato for whom X=x (4) Idvdual respose for perso for whom X=x

BE54W Regresso ad Correlato Page 36 of 43 ) SLOPE estmate = β d se b = msq(resdual) b = X X g ) INTERCEPT estmate = β d se b L NM = msq(resdual) + b = X X X O g QP 3) MEAN at X=x estmate = Y = β + β x X= x L NM se = msq(resdual) + 4) INDIVIDUAL wth X=x estmate = Y = β + β x X= x bx b = L NM X se = msq(resdual) + + X X b b = g O g QP x X X X g O g QP

BE54W Regresso ad Correlato Page 37 of 43 Illustrato for the model whch fts Y=LOGWT to X=AGE. Recall that we obtaed the followg ft: Parameter Estmates Parameter Stadard Varable Label DF Estmate Error t Value Pr > t Itercept Itercept -.6895.364-87.78 <. age Age, X.9589.68 73.8 <. 95% Cofdece Iterval for the Slope, β ) Best sgle guess (estmate) = ˆ β =.9589 ) Stadard error of the best sgle guess (SE[estmate]) = ( ) se ˆ β =.68 3) Cofdece coeffcet = 97.5 th percetle of Studet t = t df 95% Cofdece Iterval for Slope β = Estmate ± ( cofdece coeffcet )*SE =.9589 ± (.6)(.68) = (.898,.9). 975, = 9 = 6. 95% Cofdece Iterval for the Itercept, β ) Best sgle guess (estmate) = ˆ β =.6895 ) Stadard error of the best sgle guess (SE[estmate]) = ( ) se ˆ β =.364 3) Cofdece coeffcet = 97.5 th percetle of Studet t = t df 95% Cofdece Iterval for Slope β = Estmate ± ( cofdece coeffcet )*SE = -.6895 ± (.6)(.364) = (-.7585,-.6). 975, = 9 = 6.

BE54W Regresso ad Correlato Page 38 of 43 Code. Cofdece Itervals for Predctos proc reg data=temp alpha=.5; /* alpha=.5 s type I error */ ttle "Regresso of Y=Weght o X=Age"; model wt=age/cl clm; /*cl for dvdual, clm for mea */ ru; qut; Output. Output Statstcs Depedet Predcted Std Error Obs Varable Value Mea Predct 95% CL Mea 95% CL Predct Resdual -.538 -.539.58 -.5497 -.478 -.5868 -.44 -.4 -.84 -.38.36 -.3489 -.87 -.3886 -.474.34 3 -. -..7 -.485 -.957 -.99 -.534. 4 -.93 -.96. -.9489 -.936 -.9937 -.8588.3 5 -.74 -.733.8878 -.754 -.73 -.797 -.6637 -.7 6 -.583 -.5345.8465 -.5536 -.553 -.68 -.468 -.485 7 -.37 -.3386.8878 -.3586 -.385 -.45 -.7 -.334 8 -.3 -.47. -.653 -. -. -.75.7 9.53.53.7.68.796 -.56. -.8.75.49.36.8.8.785.397.59.449.445.58.49.488.37.579.4

BE54W Regresso ad Correlato Page 39 of 43 Defto of Correlato 7. Itroducto to Correlato A correlato coeffcet s a measure of the assocato betwee two pared radom varables (e.g. heght ad weght). The Pearso product momet correlato, partcular, s a measure of the stregth of the straght le relatoshp betwee the two radom varables. Aother correlato measure (ot dscussed here) s the Spearma correlato. It s a measure of the stregth of the mootoe creasg (or decreasg) relatoshp betwee the two radom varables. The Spearma correlato s a o-parametrc (meag model free) measure. It s troduced PubHlth 64, Itermedate Bostatstcs. Formula for the Pearso Product Momet Correlato ρ The populato parameter desgato s rho, wrtte as ρ The estmate of ρ, based o formato a sample s represeted usg r. Some prelmares: () Suppose we are terested the correlato betwee X ad Y () ˆ cov(x,y) = = (x x)(y y) (-) = S xy (-) Ths s the covarace(x,y) (3) (4) ˆ var(x) = ˆ var(y) = = = (x x) (-) (y y) (-) Sxx = (-) = S yy (-) ad smlarly

BE54W Regresso ad Correlato Page 4 of 43 Formula for Estmate of Pearso Product Momet Correlato from a Sample ˆ ρ = r = ˆ cov(x,y) var(x)var(y) ˆ ˆ = S xy S S xx yy If you absolutely have to do t by had, a equvalet (more calculator fredly formula) s ˆ ρ = r = = xy x y = = x y = = x y = = The correlato r ca take o values betwee ad oly Thus, the correlato coeffcet s sad to be dmesoless t s depedet of the uts of x or y. Sg of the correlato coeffcet (postve or egatve) = Sg of the estmated slope ˆβ.

BE54W Regresso ad Correlato Page 4 of 43 There s a relatoshp betwee the slope of the straght le, ˆβ, ad the estmated correlato r. Relatoshp betwee slope ˆβ ad the sample correlato r Because ˆ S xy β = ad Sxx r = S xy S S xx yy A lttle algebra reveals that r = S S xx yy ˆ β Thus, beware!!! It s possble to have a very large (postve or egatve) r mght accompayg a very o-zero slope, asmuch as - A very large r mght reflect a very large S xx, all other thgs equal - A very large r mght reflect a very small S yy, all other thgs equal.

BE54W Regresso ad Correlato Page 4 of 43 8. Hypothess Test of Correlato The ull hypothess of zero correlato s equvalet to the ull hypothess of zero slope. Research Questo: Is the correlato ρ =? Is the slope β =? Assumptos: As before. H O ad H A : H H O A : ρ = : ρ Test Statstc: A lttle algebra (ot show) yelds a very ce formula for the t-score that we eed. r (-) t score= r df = ( ) We ca fd ths formato our output. Recall the frst example ad the model of Y=LOGWT to X=AGE: The Pearso Correlato, r, s the R-squared the output. Root MSE.87 R-Square.9983 Depedet Mea -.53445 Adj R-Sq.998 Coeff Var -5.586 Pearso Correlato, r =.9983 =.999

BE54W Regresso ad Correlato Page 43 of 43 Substtuto to the formula for the t-score yelds r (-).999 9.9974 t score= = = = 7.69 r -.9983.4 Note: The value.999 the umerator s r= R =.9983 =.999 Ths s very close to the value of the t-score that was obtaed for testg the ull hypothess of zero slope. The dscrepacy s probably roudg error. I dd the calculatos o my calculator usg 4 sgfcat dgts. SAS probably used more sgfcat dgts - cb.