Unit 9 Regression and Correlation

Similar documents
Topic 9. Regression and Correlation

Simple Linear Regression

b. There appears to be a positive relationship between X and Y; that is, as X increases, so does Y.

STA302/1001-Fall 2008 Midterm Test October 21, 2008

Lecture Notes Types of economic variables

Chapter 13 Student Lecture Notes 13-1

Multiple Regression. More than 2 variables! Grade on Final. Multiple Regression 11/21/2012. Exam 2 Grades. Exam 2 Re-grades

ENGI 3423 Simple Linear Regression Page 12-01

12.2 Estimating Model parameters Assumptions: ox and y are related according to the simple linear regression model

Probability and. Lecture 13: and Correlation

Ordinary Least Squares Regression. Simple Regression. Algebra and Assumptions.

Objectives of Multiple Regression

Statistics MINITAB - Lab 5

Summary of the lecture in Biostatistics

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

Chapter Business Statistics: A First Course Fifth Edition. Learning Objectives. Correlation vs. Regression. In this chapter, you learn:

Multiple Linear Regression Analysis

Econometric Methods. Review of Estimation

Example: Multiple linear regression. Least squares regression. Repetition: Simple linear regression. Tron Anders Moger

Lecture 8: Linear Regression


THE ROYAL STATISTICAL SOCIETY HIGHER CERTIFICATE

Simple Linear Regression

Unit 2. Regression and Correlation

residual. (Note that usually in descriptions of regression analysis, upper-case

Linear Regression with One Regressor

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

CLASS NOTES. for. PBAF 528: Quantitative Methods II SPRING Instructor: Jean Swanson. Daniel J. Evans School of Public Affairs

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

ESS Line Fitting

Statistics: Unlocking the Power of Data Lock 5

The equation is sometimes presented in form Y = a + b x. This is reasonable, but it s not the notation we use.

Mean is only appropriate for interval or ratio scales, not ordinal or nominal.

Lecture 3. Sampling, sampling distributions, and parameter estimation

Chapter 14 Logistic Regression Models

ECON 482 / WH Hong The Simple Regression Model 1. Definition of the Simple Regression Model

Midterm Exam 1, section 2 (Solution) Thursday, February hour, 15 minutes

Chapter 8. Inferences about More Than Two Population Central Values

Midterm Exam 1, section 1 (Solution) Thursday, February hour, 15 minutes

Simple Linear Regression - Scalar Form

ε. Therefore, the estimate

Simple Linear Regression and Correlation. Applied Statistics and Probability for Engineers. Chapter 11 Simple Linear Regression and Correlation

Statistics. Correlational. Dr. Ayman Eldeib. Simple Linear Regression and Correlation. SBE 304: Linear Regression & Correlation 1/3/2018

Lecture 1: Introduction to Regression

Multiple Choice Test. Chapter Adequacy of Models for Regression

ENGI 4421 Propagation of Error Page 8-01

STA 108 Applied Linear Models: Regression Analysis Spring Solution for Homework #1

Simple Linear Regression and Correlation.

Chapter 13, Part A Analysis of Variance and Experimental Design. Introduction to Analysis of Variance. Introduction to Analysis of Variance

TESTS BASED ON MAXIMUM LIKELIHOOD

Handout #8. X\Y f(x) 0 1/16 1/ / /16 3/ / /16 3/16 0 3/ /16 1/16 1/8 g(y) 1/16 1/4 3/8 1/4 1/16 1

Functions of Random Variables

Correlation and Simple Linear Regression

Logistic regression (continued)

Continuous Distributions

Chapter Two. An Introduction to Regression ( )

Multivariate Transformation of Variables and Maximum Likelihood Estimation

Applied Statistics and Probability for Engineers, 5 th edition February 23, b) y ˆ = (85) =

Regression. Linear Regression. A Simple Data Display. A Batch of Data. The Mean is 220. A Value of 474. STAT Handout Module 15 1 st of June 2009

: At least two means differ SST

Lecture 1: Introduction to Regression

4. Standard Regression Model and Spatial Dependence Tests

ECONOMETRIC THEORY. MODULE VIII Lecture - 26 Heteroskedasticity

CHAPTER VI Statistical Analysis of Experimental Data

Lecture 1 Review of Fundamental Statistical Concepts

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Chapter 11 The Analysis of Variance

Reaction Time VS. Drug Percentage Subject Amount of Drug Times % Reaction Time in Seconds 1 Mary John Carl Sara William 5 4

Homework Solution (#5)

STA 105-M BASIC STATISTICS (This is a multiple choice paper.)

Chapter Statistics Background of Regression Analysis

{ }{ ( )} (, ) = ( ) ( ) ( ) Chapter 14 Exercises in Sampling Theory. Exercise 1 (Simple random sampling): Solution:

Example. Row Hydrogen Carbon

Lecture 2: Linear Least Squares Regression

THE ROYAL STATISTICAL SOCIETY GRADUATE DIPLOMA

Chapter 5 Properties of a Random Sample

Chapter 2 Simple Linear Regression

Chapter 2 Supplemental Text Material

MEASURES OF DISPERSION

Module 7: Probability and Statistics

2.28 The Wall Street Journal is probably referring to the average number of cubes used per glass measured for some population that they have chosen.

Maximum Likelihood Estimation

Chapter 3 Sampling For Proportions and Percentages

Investigation of Partially Conditional RP Model with Response Error. Ed Stanek

"It is the mark of a truly intelligent person to be moved by statistics." George Bernard Shaw

Introduction to F-testing in linear regression models

( ) = ( ) ( ) Chapter 13 Asymptotic Theory and Stochastic Regressors. Stochastic regressors model

Johns Hopkins University Department of Biostatistics Math Review for Introductory Courses

Special Instructions / Useful Data

1. The weight of six Golden Retrievers is 66, 61, 70, 67, 92 and 66 pounds. The weight of six Labrador Retrievers is 54, 60, 72, 78, 84 and 67.

Unit 9 Regression and Correlation

Econ 388 R. Butler 2016 rev Lecture 5 Multivariate 2 I. Partitioned Regression and Partial Regression Table 1: Projections everywhere

Point Estimation: definition of estimators

Module 7. Lecture 7: Statistical parameter estimation

ln( weekly earn) age age

University of Belgrade. Faculty of Mathematics. Master thesis Regression and Correlation

STK4011 and STK9011 Autumn 2016

Lecture 3 Probability review (cont d)

The number of observed cases The number of parameters. ith case of the dichotomous dependent variable. the ith case of the jth parameter

Chapter 4 Multiple Random Variables

Transcription:

PubHlth 54 - Fall 4 Regresso ad Correlato Page of 44 Ut 9 Regresso ad Correlato Assume that a statstcal model such as a lear model s a good frst start oly - Gerald va Belle Is hgher blood pressure the mom assocated wth a lower brth weght of her baby? Smple lear regresso explores the relatoshp of oe cotuous outcome (Y=brth weght) wth oe cotuous predctor (X=blood pressure). At the heart of statstcs s the fttg of models to observed data followed by a examato of how they perform. -- somewhat useful The ftted model s a suffcetly good ft to the data f t permts explorato of hypotheses such as hgher blood pressure durg pregacy s assocated wth statstcally sgfcat lower brth weght ad t permts assessmet of cofoudg, effect modfcato, ad medato. These are deas that wll be developed PubHtlh 64 Ut, Multvarable Lear Regresso. -- more useful The ftted model ca be used to predct the outcomes of future observatos. For example, we mght be terested predctg the brth weght of the baby bor to a mom wth systolc blood pressure 45 mm Hg. -3- most useful Sometmes, but ot so much publc health, the ftted model derves from a physcal-equato. A example s Mchaels-Meto ketcs. A mchaels-meto model s ft to the data for the purpose of estmatg the actual rate of a partcular chemcal reacto. Hece A lear model s a good frst start oly Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page of 44 Table of Cotets Topc. Ut Roadmap. Learg Objectves. 3. Defto of the Lear Regresso Model.. 4. Estmato.... 5. The Aalyss of Varace Table. 6. Assumptos for the Straght Le Regresso. 7. Hypothess Testg... 8. Cofdece Iterval Estmato... 9. Itroducto to Correlato... Hypothess Test for Correlato.. 3 4 5 3 6 9 35 4 43 Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page 3 of 44. Ut Roadmap / Populatos Smple lear regresso s used whe there s oe respose (depedet, Y) varable ad oe explaatory (depedet, X) varables ad both are cotuous. Ut 9. Regresso & Correlato Relatoshps Examples of explaatory (depedet) respose (depedet) varable pars are heght ad weght, age ad blood pressure, etc -- A smple lear regresso aalyss begs wth a scatterplot of the data to see f a straght le model s approprate: y = β + βx where Y = the respose or depedet varable X = the explaatory or depedet varable. β = slope (the chage y per ut chage x) β = tercept (the value of y whe x=) -- The sample data are used to estmate the parameter values ad ther stadard errors. -3- The ftted model s the compared to the smpler model y = β whch says that y s ot learly related to x. Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page 4 of 44. Learg Objectves Whe you have fshed ths ut, you should be able to:! Expla what s meat by depedet versus depedet varable ad what s meat by a lear relatoshp;! Produce ad terpret a scatterplot;! Defe ad expla the tercept ad slope parameters of a lear relatoshp;! Expla the theory of least squares estmato of the tercept ad slope parameters of a lear relatoshp;! Calculate by had least squares estmato of the tercept ad slope parameters of a lear relatoshp;! Expla the theory of the aalyss of varace of smple lear regresso;! Calculate by had the aalyss of varace of smple lear regresso;! Expla, compute, ad terpret R the cotext of smple lear regresso;! State ad expla the assumptos requred for estmato ad hypothess tests regresso;! Expla, compute, ad terpret the overall F-test smple lear regresso;! Iterpret the computer output of a smple lear regresso aalyss from a package such as Stata, SAS, SPSS, Mtab, etc.;! Defe ad terpret the value of a Pearso Product Momet Correlato, r ;! Expla the relatoshp betwee the Pearso product momet correlato r ad the lear regresso slope parameter; ad! Calculate by had cofdece terval estmato ad statstcal hypothess testg of the Pearso product momet correlato r. Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page 5 of 44 3. Defto of the Lear Regresso Model Ut 8 cosdered two categorcal (dscrete) varables, such as smokg (yes/o) ad low brth weght (yes/o). It was a troducto to ch-square tests of assocato. Ut 9 cosders two cotuous varables, such as age ad weght. It s a troducto to smple lear regresso ad correlato. A woderful troducto to the tuto of lear regresso ca be foud the text by Freedma, Psa, ad Purves (Statstcs. WW Norto & Co., 978). The followg s excerpted from pp 46 ad 48 of ther text: How s weght related to heght? For example, there were 4 me aged 8 to 4 Cycle I of the Health Examato Survey. Ther average heght was 5 feet 8 ches = 68 ches, wth a overall average weght of 58 pouds. But those me who were oe ch above average heght had a somewhat hgher average weght. Those me who were two ches above average heght had a stll hgher average weght. Ad so o. O the average, how much of a crease weght s assocated wth each ut crease heght? The best way to get started s to look at the scattergram for these heghts ad weghts. The object s to see how weght depeds o heght, so heght s take as the depedet varable ad plotted horzotally The regresso le s to a scatter dagram as the average s to a lst. The regresso le estmates the average value for the depedet varable correspodg to each value of the depedet varable. Lear Regresso Lear regresso models the mea µ = E [Y] of oe radom varable Y as a lear fucto of oe or more other varables (called predctors or explaatory varables) that are treated as fxed. The estmato ad hypothess testg volved are extesos of deas ad techques that we have already see. I lear regresso, Y s the outcome or depedet varable that we observe. We observe ts values for dvduals wth varous combatos of values of a predctor or explaatory varable X. There may be more tha oe predctor X ; ths wll be dscussed PubHlth 64. I smple lear regresso the values of the predctor X are assumed to be fxed. Ofte, however, the varables Y ad X are both radom varables. Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page 6 of 44 Correlato Correlato cosders the assocato of two radom varables. The techques of estmato ad hypothess testg are the same for lear regresso ad correlato aalyses. Explorg the relatoshp begs wth fttg a le to the pots. Developmet of a smple lear regresso model aalyss Example. Source: Klebaum, Kupper, ad Muller 988 The followg are observatos of age (days) ad weght (kg) for = chcke embryos. Notato WT=Y AGE=X LOGWT=Z.9 6 -.538.5 7 -.84.79 8 -..5 9 -.93.8 -.74.6 -.583.45 -.37.738 3 -.3.3 4.53.88 5.75.8 6.449 The data are pars of (X, Y ) where X=AGE ad Y=WT (X, Y ) = (6,.9) (X, Y ) = (6,.8) ad Ths table also provdes pars of (X, Z ) where X=AGE ad Z=LOGWT (X, Z ) = (6, -.538) (X, Z ) = (6,.449) Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page 7 of 44 Research questo There are a varety of possble research questos: () Does weght chage wth age? () I the laguage of aalyss of varace we are askg the followg: Ca the varablty weght be explaed, to a sgfcat extet, by varatos age? (3) What s a good fuctoal form that relates age to weght? Tp! Beg wth a Scatter plot. Here we plot X=AGE versus Y=WT We check ad lear about the followg: The average ad meda of X The rage ad patter of varablty X The average ad meda of Y The rage ad patter of varablty Y The ature of the relatoshp betwee X ad Y The stregth of the relatoshp betwee X ad Y The detfcato of ay pots that mght be fluetal Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page 8 of 44 Example, cotued The plot suggests a relatoshp betwee AGE ad WT A straght le mght ft well, but aother model mght be better We have adequate rages of values for both AGE ad WT There are o outlers The bowl shape of our scatter plot suggests that perhaps a better model relates the logarthm of WT (Z=LOGWT) to AGE: Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page 9 of 44 We mght have gotte ay of a varety of plots. y.5 No relatoshp betwee X ad Y 4 6 8 x 8 y 6 4 Lear relatoshp betwee X ad Y 4 6 8 x 5 y3 5 No-lear relatoshp betwee X ad Y 5 4 6 8 x Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page of 44 Note the outlyg pot y Here, a ft of a lear model wll yeld a estmated slope that s spurously o-zero. 4 6 8 4 x 8 Note the outlyg pot y 6 4 Here, a ft of a lear model wll yeld a estmated slope that s spurously ear zero. 4 6 8 x Note the outlyg pot y 8 6 4 Here, a ft of a lear model wll yeld a estmated slope that s spurously hgh. 4 6 8 x Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page of 44 Revew of the Straght Le Way back whe, your hgh school days, you may have bee troduced to the straght le fucto, defed as y = mx + b where m s the slope ad b s the tercept. Nothg ew here. All we re dog s chagg the otato a bt: () Slope : m " β () Itercept: b " β Slope Slope > Slope = Slope < Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page of 44 Defto of the Straght Le Model Y = β + β X Populato Y = β + βx + ε Y = β + βx + ε = relatoshp the populato. Y = β + βx s measured wth error ε defed ε = [Y] - [β + βx] β, β ad ε are ukow Y = β ˆ + βx ˆ + e β, ˆ β ˆ ad e are estmates of β, β ad ε resdual = e s ow the dfferece betwee the observed ad the ftted (ot the true) e = [Y] - [β ˆ + βx] ˆ β, ˆ β ˆ ad e are kow β, ˆ β ˆ ad e are obtaed by the The values of method of least squares estmato. How close dd we get? To see f ˆβ β ad ˆβ βwe perform regresso dagostcs. Regresso dagostcs are dscussed PubHlth 64 Notato Y = the outcome or depedet varable X = the predctor or depedet varable µ Y = The expected value of Y for all persos the populato µ Y X=x = The expected value of Y for the sub-populato for whom X=x σ Y σ Y X=x = Varablty of Y amog all persos the populato = Varablty of Y for the sub-populato for whom X=x Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page 3 of 44 4. Estmato Least squares estmato s used to obta guesses of β ad β. Whe the outcome = Y s dstrbuted ormal, least squares estmato s the same as maxmum lkelhood estmato. Note If you are ot famlar wth maxmum lkelhood estmato, do t worry. Ths s troduced PubHlth 64. Least Squares, Close ad Least Squares Estmato Theoretcally, t s possble to draw may les through a X-Y scatter of pots. Whch to choose? Least squares estmato s oe approach to choosg a le that s closest to the data. d Perhaps we d lke d = [observed Y - ftted! Y ] = smallest possble. Note that ths s a vertcal dstace, sce t s a dstace o the vertcal axs. d Better yet, perhaps we d lke to mmze the squared dfferece: d = [observed Y - ftted! Y ] = smallest possble We ca t do ths mmzato separately for each X-Y par. That s, t s ot possble to choose commo values of d = d =. d = Y ˆ β ˆ ad β ˆ that mmzes ( Y Y ˆ ) for subject ad mmzes ( Y Y ˆ ) for subject ad mmzes ad mmzes ( Y ) for the th subject So, stead, we choose values for β ˆ ad β ˆ that, upo serto, mmzes the total d = ( ) [ ] ( ) Y Yˆ = Y ˆ β + ˆ β X = = Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page 4 of 44 d = ( Y Yˆ ) = Y ˆβ + ˆβ X has a varety of ames: ( ) resdual sum of squares sum of squares about the regresso le sum of squares due error (SSE) σ Y X! Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page 5 of 44 Least Squares Estmato of the Slope ad Itercept I case you re terested. Cosder SSE = d = Y Y ˆ = Y ˆβ + ˆβ X ( ) ( ) Step #: Dfferetate wth respect to! β Set dervatve equal to ad solve for! β. Step #: Dfferetate wth respect to! β Set dervatve equal to, sert! β ad solve for! β. Least Squares Estmato Solutos Note the estmates are deoted ether usg greek letters wth a caret or wth roma letters Estmate of Slope ˆβ or b ˆβ = = ( X X )( Y Y ) ( X X ) = Itercept ˆβ or b! β = Y! β X Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page 6 of 44 A closer look Some very helpful prelmary calculatos ( ) Sxx = X-X = X NX ( ) Syy = Y-Y = Y NY xy ( ) S = X-X (Y-Y) = XY NXY Note - These expressos make use of a specal otato called the summato otato. The captol S dcates summato. I S xy, the frst subscrpt x s sayg (x-x). The secod subscrpt y s sayg (y-y). S xy = ( ) X-X (Y-Y) S subscrpt x subscrpt y Slope ˆβ = ( X X) Y Y ( ) ( X X) = côv ( X,Y ) vâr(x) ˆ S β = S xy xx Itercept! β = Y! β X Predcto of Y Ŷ= ˆ β + ˆ β X =b + bx Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page 7 of 44 Do these estmates make sese? Slope ˆβ = ( X X) Y Y ( ) ( X X) = côv ( X,Y ) vâr(x) The lear movemet Y wth lear movemet X s measured relatve to the varablty X.!β = says: Wth a ut chage X, overall there s a 5-5 chace that Y creases versus decreases!β says: Wth a ut crease X, Y creases also (! β > ) or Y decreases (! β < ). Itercept! β = Y! β X If the lear model s correct, or, f the true model does ot have a lear compoet, we obta!β = ad! β = Y as our best guess of a ukow Y Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page 8 of 44 Illustrato Stata Y=WT ad X=AGE. regress y x Partal lstg of output... ------------------------------------------------------------------------------ y Coef. Std. Err. t P> t [95% Cof. Iterval] -------------+---------------------------------------------------------------- x.3577.45945 5...3437.3398 _cos -.88457.558354-3.58.6-3.745 -.6955 ------------------------------------------------------------------------------ Aotated ------------------------------------------------------------------------------ y = WEIGHT Coef. Std. Err. t P> t [95% Cof. Iterval] -------------+---------------------------------------------------------------- x = AGE.3577 = b.45945 5...3437.3398 _cos = Itercept -.88457 = b.558354-3.58.6-3.745 -.6955 ------------------------------------------------------------------------------ The ftted le s therefore WT = -.88457 +.357*AGE. It says that each ut crease AGE of day s estmated to predct a.357 crease weght, WT. Here s a overlay of the ftted le o our scatterplot. Scatter Plot of WT vs AGE 3..4.8 WT..6. 6 8 4 6 AGE Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page 9 of 44 As we mght have guessed, the straght le model may ot be the best choce. The bowl shape of the scatter plot does have a lear compoet, however. Wthout the plot, we mght have beleved the straght le ft s okay. Illustrato Stata- cotued Z=LOGWT ad X=AGE. regress z x Partal lstg of output... ------------------------------------------------------------------------------ z Coef. Std. Err. t P> t [95% Cof. Iterval] -------------+---------------------------------------------------------------- x.95899.6768 73.8..898356.946 _cos -.68955.3637-87.78. -.75856 -.69949 ------------------------------------------------------------------------------ Aotated ------------------------------------------------------------------------------ Z = LOGWT Coef. Std. Err. t P> t [95% Cof. Iterval] ------------------------------------------------------------------------------ x = AGE.95899 = b.6768 73.8..898356.946 _cos = INTERCEPT -.68955 = b.3637-87.78. -.75856 -.69949 ------------------------------------------------------------------------------ Thus, the ftted le s LOGWT = -.6895 +.9589*AGE Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page of 44 Now the overlay plot looks better: Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page of 44 Predcto of Weght from Heght Source: Dxo ad Massey (969) Now You Try Idvdual Heght (X) Weght (Y) 6 6 35 3 6 4 6 5 6 4 6 6 3 7 6 35 8 64 5 9 64 45 7 7 7 85 7 6 Prelmary calculatos X=63.833 X = 49,68 XY 9,38 = xx Y=4.667 Y = 46, S = 7.667 Syy = 5,66.667 Sxy = 863.333 Slope ˆ S β = S xy xx ˆ 863.333 β = = 5.9 7.667 Itercept! β = Y! β X ˆ β 4.667 (5.9)(63.8333 = = -79.3573 Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page of 44 5. The Aalyss of Varace Table Recall the sample varace troduced I Ut, Summarzg. Y Y. = The umerator of the sample varace (S ) of the Y data s ( ) Ths same umerator ( ) = ( Y ) Y Y Y s a cetral fgure regresso. It has a ew ame, several actually. = = total varace of the Y s. = total sum of squares, = total, corrected, ad = SSY. (Note corrected refers to subtractg the mea before squarg.) The aalyss of varace tables s all about ( ) Y Y ad parttog t to two compoets =. Due resdual (the dvdual Y about the dvdual predcto Ŷ). Due regresso (the predcto Ŷ about the overall mea Y) Here s the partto (Note Look closely ad you ll see that both sdes are the same) ( Y ) ( ˆ ) ( ˆ Y = Y Y + Y Y) Some algebra (ot show) reveals a ce partto of the total varablty. ( Y Y) = ( Y Yˆ) + ( Yˆ Y) Total Sum of Squares = Due Error Sum of Squares + Due Model Sum of Squares Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page 3 of 44 A closer look Total Sum of Squares = Due Model Sum of Squares + Due Error Sum of Squares ( Y Y ) ( ) + ( Y Y ˆ ) = Y ˆ Y ( Y Y ) = devato of Y from Y that s to be explaed Y ˆ Y ( ) = due model, sgal, systematc, due regresso ( Y Y ˆ ) = due error, ose, or resdual We seek to expla the total varablty ( Y Y ) wth a ftted model: What happes whe β? What happes whe β =? A straght le relatoshp s helpful A straght le relatoshp s ot helpful Best guess s ˆ = ˆ β + ˆ β X Best guess s Y ˆ ˆ β = Y Y = Due model s LARGE because ( Yˆ Y ) = ([ ˆ β + ˆ β X ] ) Y Y ˆ β X + ˆ β X Y = = ˆβ ( X X ) Due error s early the TOTAL because ( Y Yˆ ) = ( Y [ ˆ β ]) = ( Y Y ) Due error has to be small Due regresso has to be small due(model) due wll be large ( model) due( error) due( error) wll be small Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page 4 of 44 How to Partto the Total Varace. The total or total, corrected refers to the varablty of Y about Y ( Y Y ) s called the total sum of squares Degrees of freedom = df = (-) Dvso of the total sum of squares by ts df yelds the total mea square. The resdual or due error refers to the varablty of Y about! Y ( Y Y ˆ ) s called the resdual sum of squares Degrees of freedom = df = (-) Dvso of the resdual sum of squares by ts df yelds the resdual mea square. 3. The regresso or due model refers to the varablty of! Y about Y ( Y ˆ Y ) = ˆβ ( X X) s called the regresso sum of squares Degrees of freedom = df = Dvso of the regresso sum of squares by ts df yelds the regresso mea square or model mea square. It s a example of a varace compoet. Source df Sum of Squares Mea Square Regresso SSR = ( Y ˆ Y ) SSR/ Error Total, corrected (-) (-) Tp! Mea square = (Sum of squares)/(degrees of freedom,df) SSE = SST = ( Y Y ˆ ) ( Y Y ) SSE/(-) Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page 5 of 44 Be careful! The questo we may ask from a aalyss of varace table s a lmted oe. Does the ft of the straght le model expla a sgfcat porto of the varablty of the dvdual Y about Y? Is ths ftted model better tha usg Y aloe? We are NOT askg: Is the choce of the straght le model correct? or Would aother fuctoal form be a better choce? We ll use a hypothess test approach (aother proof by cotradcto ). Start wth the othg s gog o ull hypothess that says β = ( o lear relatoshp ) Use least squares estmato to estmate a closest le The aalyss of varace table provdes a comparso of the due regresso mea square to the resdual mea square Recall that we reasoed the followg: If β The due (regresso)/due (resdual) wll be LARGE If β = The due (regresso)/due (resdual) wll be SMALL Our p-value calculato wll aswer the questo: If the ull hypothess s true ad β = truly, what were the chaces of obtag a value of due (regresso)/due (resdual) as larger or larger tha that observed? To calculate chaces we eed a probablty model. So far, we have ot eeded oe. Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page 6 of 44 6. Assumptos for a Straght Le Regresso Aalyss I performg least squares estmato, we dd ot use a probablty model. We were dog geometry. Hypothess testg requres some assumptos ad a probablty model. Assumptos The separate observatos Y, Y,, Y are depedet. The values of the predctor varable X are fxed ad measured wthout error. For each value of the predctor varable X=x, the dstrbuto of values of Y follows a ormal dstrbuto wth mea equal to µ Y X=x ad commo varace equal to σ Y x. The separate meas µ Y X=x le o a straght le; that s µ Y X=x = β + β X At each value of X, there s a populato of Y for persos wth X=x Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page 7 of 44 Wth these assumptos, we ca assess the sgfcace of the varace explaed by the model. F msq(model) = wth df =, (-) msq(resdual) β = β Due model MSR has expected value σ Y X Due resdual MSE has expected value σ Y X Due model MSR has expected value σ Y X + β ( X X) Due resdual MSE has expected value σ Y X F = (MSR)/MSE wll be close to F = (MSR)/MSE wll be LARGER tha We obta the aalyss of varace table for the model of Z=LOGWT to X=AGE: Stata llustrato wth aotatos red. Source SS df MS Number of obs = -------------+------------------------------ F(, 9) = 5355.6 = MSQ(model)/MSQ(resdual) Model 4.5734 4.5734 Prob > F =. = p-value for Overall F Test Resdual.79346 9.78857 R-squared =.9983 = SSQ(model)/SSQ(TOTAL) -------------+------------------------------ Adj R-squared =.998 = R ajusted for ad # of X Total 4.8576.48576 Root MSE =.87 = Sqaure root of MSQ(resdual) Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page 8 of 44 Ths output correspods to the followg. Note I ths example our depedet varable s actually Z, ot Y. Source Df Sum of Squares Mea Square Regresso SSR = ( Z ˆ - Z) = 4.63 SSR/ = 4.63 Error (-) = 9 Total, corrected (-) = SSE = SST = ( Z - Zˆ ) =.75 SSE/(-) = 7/838E-4 ( Z - Z) = 4.768 Other formato ths output: R-SQUARED = [(Sum of squares regresso)/(sum of squares total)] = proporto of the total that we have bee able to expla wth the ft - Be careful! As predctors are added to the model, R-SQUARED ca oly crease. Evetually, we eed to adjust ths measure to take ths to accout. See ADJUSTED R-SQUARED. We also get a overall F test of the ull hypothess that the smple lear model does ot expla sgfcatly more varablty LOGWT tha the average LOGWT. F = MSQ (Regresso)/MSQ (Resdual) = 4.63/.7838 = 5384.94 wth df =, 9 Acheved sgfcace <.. Reject H O. Coclude that the ftted le explas statstcally sgfcatly more of the varablty Z=LOGWT tha s explaed by the ull model that cotas the average LOGWT oly. Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page 9 of 44 7. Hypothess Testg Straght Le Model: Y = β + β X ) Overall F-Test Research Questo: Does the ftted model, the! Y, expla sgfcatly more of the total varablty of the Y about Y tha does Y? Assumptos: As before. H O ad H A : H: β = O H : β A Test Statstc: F = msq( regreso) msq( resdual) df =,( ) Evaluato rule: Whe the ull hypothess s true, the value of F should be close to. Alteratvely, whe β, the value of F wll be LARGER tha. Thus, our p-value calculato aswers: What are the chaces of obtag our value of the F or oe that s larger f we beleve the ull hypothess that β =? Calculatos: For our data, we obta p-value = msq(model) pr F β = = <<,( ),9 ( ) pr msq resdual [ F 5384.94]. Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page 3 of 44 Evaluate: Assumpto of the ull hypothess that β = has led to a extremely ulkely outcome (F-statstc value of 5394.94), wth chaces of beg observed less tha chace,. The ull hypothess s rejected. Iterpret: We have leared that, at least, the ftted straght le model does a much better job of explag the varablty Z = LOGWT tha a model that allows oly for the average LOGWT. later (PubHlth 64, Itermedate Bostatstcs), we ll see that the aalyss does ot stop here ) Test of the Slope, β Notes - The overall F test ad the test of the slope are equvalet. The test of the slope uses a t-score approach to hypothess testg It ca be show that { t-score for slope } = { overall F } Research Questo: Is the slope β =? Assumptos: As before. H O ad H A : H H O A : β = : β Test Statstc: To compute the t-score, we eed a estmate of the stadard error of! β ( ) = msq(resdual) SÊ ˆβ ( X X) Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page 3 of 44 Our t-score s therefore: ( ) ( expected ) ( ) observed t score = sê expected df = ( ) ( ˆβ ) ( ) = sê ( ˆβ ) We ca fd ths formato our Stata output. Aotatos are red. ------------------------------------------------------------------------------ z Coef. Std. Err. t = Coef/Std. Err. P> t [95% Cof. Iterval] -------------+---------------------------------------------------------------- x.95899.6768 73.8 =.9589/.678..898356.946 _cos -.68955.3637-87.78. -.75856 -.69949 ------------------------------------------------------------------------------ Recall what we mea by a t-score: t=73.38 says the estmated slope s estmated to be 73.38 stadard error uts away from the ull hypothess expected value of zero. Check that { t-score } = { Overall F }: Evaluato rule: [ 73.38 ] = 5384.6 whch s close. Whe the ull hypothess s true, the value of t should be close to zero. Alteratvely, whe β, the value of t wll be DIFFERENT from. Here, our p-value calculato aswers: What are the chaces of obtag our value of the t or oe that s more far away from f we beleve the ull hypothess that β =? Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page 3 of 44 Calculatos: For our data, we obta p-value = pr ˆβ t ( ) sê ˆβ ( ) = pr t 9 73.38 [ ] <<. Evaluate: Uder the ull hypothess that β =, the chaces of obtag a t-score value that s 73.38 or more stadard error uts away from the expected value of s less tha chace,. Iterpret: The ferece s the same as that for the overall F test. The ftted straght le model does a statstcally sgfcatly better job of explag the varablty LOGWT tha the sample mea. 3) Test of the Itercept, β Ths addresses the questo: Does the the straght le relatoshp passes through the org? It s rarely of terest. Research Questo: Is the tercept β =? Assumptos: As before. H O ad H A : H H O A : β = : β Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page 33 of 44 Test Statstc: To compute the t-score for the tercept, we eed a estmate of the stadard error of! β ( ) = msq(resdual) + X SÊ ˆβ ( X X) Our t-score s therefore: ( ) ( expected ) ( ) observed t score = sê expected df = ( ) ( ˆβ ) ( ) = sê ( ˆβ ) Aga, we ca fd ths formato our Stata output. Aotatos are red. ---------------------------------------------------------------------------------------------------- z Coef. Std. Err. t = Coef/Std. Err. P> t [95% Cof. Iterval] -------------+---------------------------------------------------------------- x.95899.6768 73.8..898356.946 _cos -.68955.3637-87.78 = -.68955/.3637. -.75856 -.69949 ------------------------------------------------------------------------------------------------------- Here, t = -87.78 says the estmated tercept s estmated to be 87.78 stadard error uts away from ts expected value of zero. Evaluato rule: Whe the ull hypothess s true, the value of t should be close to zero. Alteratvely, whe β, the value of t wll be DIFFERENT from. Our p-value calculato aswers: What are the chaces of obtag our value of the t or oe that s more far away from f we beleve the ull hypothess that β =? Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page 34 of 44 Calculatos: p-value = ˆ pr β t( ) = pr[ t9 87.78 ] <<. seˆ ( ˆ β ) Evaluate: Uder the ull hypothess that β =, the chaces of obtag a t-score value that s 87.78 or more stadard error uts away from the expected value of s less tha chace,, aga promptg statstcal rejecto of the ull hypothess. Iterpret: The ferece s that the straght le relatoshp betwee Z=LOGWT ad X=AGE does ot pass through the org. Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page 35 of 44 8. Cofdece Iterval Estmato Straght Le Model: Y = β + β X The cofdece tervals here have 3 elemets: ) Best sgle guess (estmate) ) Stadard error of the best sgle guess (SE[estmate]) 3) Cofdece coeffcet : Ths wll be a percetle from the Studet t dstrbuto wth df=(-) We mght wat cofdece terval estmates of the followg 4 parameters: () Slope () Itercept (3) Mea of subset of populato for whom X=x (4) Idvdual respose for perso for whom X=x ) SLOPE estmate =! β ( ) = msq(resdual) sê ˆb ( X X) ) INTERCEPT estmate =! β ( ) = msq(resdual) + X sê ˆb ( X X) Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page 36 of 44 3) MEAN at X=x estmate = Y! =! β +! β x X = x sê = msq(resdual) ( + x X) X X ( ) 4) INDIVIDUAL wth X=x estmate = Y! =! β +! β x X = x sê = msq(resdual) + ( + x X) X X ( ) Example, cotued Z=LOGWT to X=AGE. Stata yelded the followg ft: ------------------------------------------------------------------------------ z Coef. Std. Err. t P> t [95% Cof. Iterval] -------------+---------------------------------------------------------------- x.95899.6768 73.8..898356.946 # 95% CI for Slope β _cos -.68955.3637-87.78. -.75856 -.69949 ------------------------------------------------------------------------------ 95% Cofdece Iterval for the Slope, β ) Best sgle guess (estmate) = ˆ β =.9589 ) Stadard error of the best sgle guess (SE[estmate]) = ( ) se ˆ β =.68 3) Cofdece coeffcet = 97.5 th percetle of Studet t = t df 95% Cofdece Iterval for Slope β = Estmate ± ( cofdece coeffcet )*SE =.9589 ± (.6)(.68) = (.898,.9) =.., = 975 9 6 Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page 37 of 44 95% Cofdece Iterval for the Itercept, β ------------------------------------------------------------------------------ z Coef. Std. Err. t P> t [95% Cof. Iterval] -------------+---------------------------------------------------------------- x.95899.6768 73.8..898356.946 _cos -.68955.3637-87.78. -.75856 -.69949 # 95% CI for tercept β ------------------------------------------------------------------------------ ) Best sgle guess (estmate) = ˆ β =.6895 ) Stadard error of the best sgle guess (SE[estmate]) = ( ) se ˆ β =.364 3) Cofdece coeffcet = 97.5 th percetle of Studet t = t df 95% Cofdece Iterval for Slope β = Estmate ± ( cofdece coeffcet )*SE = -.6895 ± (.6)(.364) = (-.7585,-.6) =.., = 975 9 6 Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page 38 of 44 For the brave Stata Example, cotued Cofdece Itervals for MEAN of Z at Each Value of X.. regress z x. predct zhat, xb. ** Obta SE for MEAN of Z gve X. predct semeaz, stdp. ** Obta cofdece coeffcet = 97.5th percetle of T o df=9. geerate tmult=vttal(9,.5). ** Geerate lower ad upper 95% CI lmts for MEAN of Z at Each X. geerate lowmeaz=zhat -tmult*semeaz. geerate hghmeaz=zhat+tmult*semeaz. ** Geerate lower ad upper 95% CI lmts for INDIVIDUAL PREDICTED Z at Each X. geerate lowpredctz=zhat-tmult*sepredctz. geerate hghpredctz=zhat+tmult*sepredctz. lst x z zhat lowmeaz hghmeaz, clea x z zhat lowmeaz hghmeaz. 6 -.538 -.5399 -.549733 -.47886. 7 -.84 -.388 -.348894 -.874 3. 8 -. -.7 -.485 -.95733 4. 9 -.93 -.96364 -.948893 -.935797 5. -.74 -.733454 -.75484 -.764 6. -.583 -.5344545 -.55369 -.5536 7. -.37 -.3385637 -.3586467 -.38486 8. 3 -.3 -.4677 -.65394 -.6 9. 4.53.538.6839.7965. 5.75.499.833.79985. 6.449.445.49766.48834 Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page 39 of 44 Stata Example, cotued Cofdece Itervals for INDIVIDUAL PREDICTED Z at Each Value of X.. regress z x. predct zhat, xb. ** Obta SE for INDIVIDUAL PREDICTION of Z at gve X. predct sepredctz, stdf. ** Obta cofdece coeffcet = 97.5th percetle of T o df=9. geerate tmult=vttal(9,.5). ** Geerate lower ad upper 95% CI lmts for INDIVIDUAL PREDICTED Z at Each X. geerate lowpredctz=zhat-tmult*sepredctz. geerate hghpredctz=zhat+tmult*sepredctz. *** Lst Idvdual Predctos wth 95% CI Lmts. lst x z zhat lowpredctz hghpredctz, clea x z zhat lowpred~z hghpre~z. 6 -.538 -.5399 -.58684 -.44994. 7 -.84 -.388 -.388634 -.474 3. 8 -. -.7 -.99 -.53353 4. 9 -.93 -.96364 -.9936649 -.858879 5. -.74 -.733454 -.7969533 -.6637375 6. -.583 -.5344545 -.67866 -.4685 7. -.37 -.3385637 -.4575 -.79558 8. 3 -.3 -.4677 -.3 -.7544 9. 4.53.538 -.55564.997. 5.75.499.78493.3975. 6.449.445.3785.5795 Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page 4 of 44 Defto of Correlato 9. Itroducto to Correlato A correlato coeffcet s a measure of the assocato betwee two pared radom varables (e.g. heght ad weght). The Pearso product momet correlato, partcular, s a measure of the stregth of the straght le relatoshp betwee the two radom varables. Aother correlato measure (ot dscussed here) s the Spearma correlato. It s a measure of the stregth of the mootoe creasg (or decreasg) relatoshp betwee the two radom varables. The Spearma correlato s a o-parametrc (meag model free) measure. It s troduced PubHlth 64, Itermedate Bostatstcs. Formula for the Pearso Product Momet Correlato ρ Populato product momet correlato = ρ based estmate = r. Some prelmares: () Suppose we are terested the correlato betwee X ad Y () ˆ cov(x,y) = (x x)(y y) (-) = S xy (-) Ths s the covarace(x,y) (3) (4) ˆ var(x) = ˆ var(y) = (x x) (-) (y y) (-) Sxx = (-) = S yy (-) ad smlarly Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page 4 of 44 Formula for Estmate of Pearso Product Momet Correlato from a ˆ ρ = r = ˆ cov(x,y) var(x)var(y) ˆ ˆ = S xy S S xx yy If you absolutely have to do t by had, a equvalet (more calculator fredly formula) s ˆ ρ = r = xy x y x y x y The correlato r ca take o values betwee ad oly Thus, the correlato coeffcet s sad to be dmesoless t s depedet of the uts of x or y. Sg of the correlato coeffcet (postve or egatve) = Sg of the estmated slope ˆβ. Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page 4 of 44 There s a relatoshp betwee the slope of the straght le, ˆβ, ad the estmated correlato r. Relatoshp betwee slope ˆβ ad the sample correlato r Because ˆ S xy β = ad Sxx r = S xy S S xx yy A lttle algebra reveals that r = S S xx yy ˆ β Thus, beware!!! It s possble to have a very large (postve or egatve) r mght accompayg a very o-zero slope, asmuch as - A very large r mght reflect a very large S xx, all other thgs equal - A very large r mght reflect a very small S yy, all other thgs equal. Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page 43 of 44. Hypothess Test of Correlato The ull hypothess of zero correlato s equvalet to the ull hypothess of zero slope. Research Questo: Is the correlato ρ =? Is the slope β =? Assumptos: As before. H O ad H A : H H O A : ρ = : ρ Test Statstc: A lttle algebra (ot show) yelds a very ce formula for the t-score that we eed. r (-) t score= r df = ( ) We ca fd ths formato our output. Recall the frst example ad the model of Z=LOGWT to X=AGE: The Pearso Correlato, r, s the R-squared the output. Source SS df MS Number of obs = -------------+------------------------------ F(, 9) = 5355.6 Model 4.5734 4.5734 Prob > F =. Resdual.79346 9.78857 R-squared =.9983 -------------+------------------------------ Adj R-squared =.998 Total 4.8576.48576 Root MSE =.87 Pearso Correlato, r =.9983 =.999 Populato/ Relatoshps/

PubHlth 54 - Fall 4 Regresso ad Correlato Page 44 of 44 Substtuto to the formula for the t-score yelds r (-).999 9.9974 t score = = = = 7.69 r -.9983.4 Note: The value.999 the umerator s r= R =.9983 =.999 Ths s very close to the value of the t-score that was obtaed for testg the ull hypothess of zero slope. The dscrepacy s probably roudg error. I dd the calculatos o my calculator usg 4 sgfcat dgts. Stata probably used more sgfcat dgts - cb. Populato/ Relatoshps/