Correlation and Simple Linear Regression

Correlato ad Smple Lear Regresso Berl Che Departmet of Computer Scece & Iformato Egeerg Natoal Tawa Normal Uverst Referece:. W. Navd. Statstcs for Egeerg ad Scetsts. Chapter 7 (7.-7.3) & Teachg Materal

Itroducto (/) Ofte, scetsts ad egeers collect data order to determe the ature of the relatoshp betwee two quattes A eample s: heghts ad forearm legths of me The pots ted to slop upward ad to the rght, dcatg that taller me ted to have loger forearms A postve assocato betwee heght ad forearm legth Statstcs-Berl Che

Itroducto (/) Ma tmes, ths ordered pars of measuremets fall appromatel alog a straght le whe plotted I those stuatos, the data ca be used to compute a equato for the le that best fts the data Ths le ca be used for varous thgs, oe s predctg for future values Statstcs-Berl Che 3

Correlato Somethg we ma be terested s how closel related two phscal characterstcs are For eample, heght ad weght of a two-ear-old chld The quatt called the correlato coeffcet s a measure of ths We look at the drecto of the relatoshp, postve or egatve, stregth of relatoshp, ad the we fd a le that best fts the data I computg correlato, we ca ol use quattatve data (stead of qualtatve data) Statstcs-Berl Che 4

Eample Ths s a plot of heght vs. forearm legth for me We sa that there s a postve assocato betwee heght ad forearm legth Ths s because the plot dcates that taller me ted to have loger forearms The slope s roughl costat throughout the plot, dcatg that the pots are clustered aroud a straght le The le supermposed o the plot s a specal le kow as the leastsquares le Statstcs-Berl Che 5

Correlato Coeffcet The degree to whch the pots a scatterplot ted to cluster aroud a le reflects the stregth of the lear relatoshp betwee ad The correlato coeffcet s a umercal measure of the stregth of the lear relatoshp betwee two varables The correlato coeffcet s usuall deoted b the letter r Also called sample correlato (cf. populato cov correlato ( X, Y ) E[ ( X E[ X ])( Y E[ Y ])] ρ ) X, Y σ X σ Y E [ X ] ( [ ]) [ ] E X E Y E[ Y ] ( ) Statstcs-Berl Che 6

Computg Correlato Coeffcet r Let (, ),,(, ) represet pots o a scatterplot Compute the meas ad the stadard devatos of the s ad s The covert each ad to stadard uts. That s, compute the z-scores: ( )/ s ad ( )/ s. The correlato coeffcet s the average of the products of the z-scores, ecept that we dvde b stead of r s s Sometmes, ths computato s more useful ( )( ) r ( ) ( ) s s ( ) ( ) Statstcs-Berl Che 7

Commets o Correlato Coeffcet I prcple, the correlato coeffcet ca be calculated for a set of pots I ma cases, the pots costtute a radom sample from a populato of pots I ths case, the correlato coeffcet s called the sample correlato, ad t s a estmate of the populato correlato It s a fact that r s alwas betwee - ad Postve values of r dcate that the least-squares le has a postve slope. The greater values of oe varable are assocated wth greater values of the other Negatve values of r dcate that the least-squares le has a egatve slope. The greater values of oe varable are assocated wth lesser values of the other Statstcs-Berl Che 8

Eamples of Varous Levels of Correlato Statstcs-Berl Che 9

More Commets Values of r close to - or dcate a strog lear relatoshp Values of r close to 0 dcate a weak lear relatoshp Whe r s equal to - or, the all the pots o the scatterplot le eactl o a straght le If the pots le eactl o a horzotal or vertcal le, the r s udefed If r 0, the ad are sad to be correlated. If r 0, the ad are ucorrelated Statstcs-Berl Che 0

Propertes of Correlato Coeffcet r (/) A mportat feature of r s that t s utless. It s a pure umber that ca be compared betwee dfferet samples r remas uchaged uder each of the followg operatos: Multplg each value of a varable b a postve costat Addg a costat to each value of a varable r Iterchagg the values of ad If r 0, ths does ot mpl that there s ot a relatoshp betwee ad. It just dcates that there s o lear relatoshp 64 6 ( ) ( ) quadratc relatoshp ( )( ) Statstcs-Berl Che

Propertes of Correlato Coeffcet r (/) Outlers ca greatl dstort r, especall, small data sets, ad preset a serous problem for data aalsts correlato coeffcet r0.6 Correlato s ot causato For eample, vocabular sze s strogl correlated wth shoe sze, but ths s because both crease wth age. Learg more words does ot cause feet to grow ad vce versus. Age s cofoudg the results Statstcs-Berl Che

Iferece o the Populato Correlato If the radom varables X ad Y have a certa jot dstrbuto called a bvarate ormal dstrbuto, the the sample correlato r ca be used to costruct cofdece tervals ad perform hpothess tests o the populato correlato, ρ. The followg results make ths possble Let (, ),,(, ) be a radom sample from the jot dstrbuto of X ad Y ad r s the sample correlato of the pots. The the quatt W l + r r s appromatel ormal wth mea (a fucto of r) + ρ μ l W ρ X Z, μz Y σ X, X ΣZ σ X, Y μ μy σ σ X X, Y X, X ad varace σ W. 3 Statstcs-Berl Che 3

Eample 7.3 Questo: Fd a 95% cofdece for the correlato betwee the reacto tme of vsual stmulus ( ) ad that of audo stmulus ( ), gve the followg sample 6 59 03 06 35 4 76 63 0 97 88 93 8 09 89 9 69 78 0 The samplecorrelato betwee ad s r 0.859 W σ W + l / s gve b r r ( 0 3) A 95% (two -sded) cofdeceterval for μ.444.96 + 0.859 l.444 0.859 0.3780 ( 0.3780) μ.444 +.96( 0.3780) 0.4036 μ W W.885 W Note that the populato correlato ρ ca be epressed as e ρ e μ μ + The correspodg 95% cofdeceterval for ρ e e W W 0. 4036 0. 4036 μ e W e μ + e W + e 0.383 ρ 0.955.885.885 + Statstcs-Berl Che 4

Lear Model Whe two varables have a lear relatoshp, the scatterplot teds to be clustered aroud a le kow as the least-squares le The le that we are trg to ft s deal value l β 0 + β measured value 0 + β β + ε ε (measuremet error ) depedet varable depedet varable β 0 ad β are called the regresso coeffcets We ol kow the values of ad, we must estmate the other quattes Ths s what we call smple lear regresso We use the data to estmate these quattes Statstcs-Berl Che 5

The Least-Squares Le β 0 ad β caot be determed because of measuremet error, but the ca be estmated b calculatg the least-squares le βˆ 0 βˆ + ˆ 0 β ad βˆ are called the least-squares coeffcets The least-squares le s the le that fts the data best (?) data cotamated wth radom errors ftted value ˆ ˆ β ˆ 0 + β resdual e ˆ Statstcs-Berl Che 6

Resduals For each data pot (, ), the vertcal dstace to the pot (, ˆ ) o the least squares le s e ˆ. The quatt ŷ s called the ftted value ad the quatt e s called the resdual assocated wth the pot Pots above the least-squares le have postve resduals Pots below the le have egatve resduals The closer the resduals are to 0, the closer the ftted values are to the observatos ad the better the le fts the data The least-squares le s the oe that mmzes the sum of squared resduals S e (, ) Statstcs-Berl Che 7

Statstcs-Berl Che 8 Fdg the Equato of the Le To fd the least-squares le, we must determe estmates for the slope β 0 ad β tercept that mmze the sum of the squared resduals These quattes are ( )( ) ( ) 0 β β β ˆ ˆ ˆ ( ) 0 ˆ ˆ e E β β

Statstcs-Berl Che 9 Some Shortcut Formulas The epressos o the rght are equvalet to those o the left, ad are ofte easer to compute ( )( ) ( ) ( )

Cautos Do ot etrapolate the ftted le (such as the leastsquares le) outsde the rage of the data. The lear relatoshp ma ot hold there We leared that we should ot use the correlato coeffcet whe the relatoshp betwee ad s ot lear. The same holds for the least-squares le. Whe the scatterplot follows a curved patter, t does ot make sese to summarze t wth a straght le If the relatoshp s curved, the we would wat to ft a regresso le that cota squared terms (.e., polomal regresso) Statstcs-Berl Che 0

Measures of Goodess of Ft A goodess of ft statstc s a quatt that measures how well a model eplas a gve set of data The quatt r s the square of the correlato coeffcet ad we call t the coeffcet of determato r ( ) ( ˆ ) ( ) total sum of squares The proporto of varace eplaed b regresso s the terpretato of r ( ) ( ˆ ) error sum of squares measures the reducto of spread of the pots obtaed b usg the leas-squares le rather tha Statstcs-Berl Che

Sums of Squares ( ˆ ) s the error sum of squares (SSE) ad measures the overall spread of the pots aroud the least-squares le ( ) s the total sum of squares (SST) ad measures the overall spread of the pots aroud the le ( ) ( ) ( ) The dfferece ˆ ˆ s called the regresso sum of squares (SSR) Clearl, the followg relatoshp holds: Total sum of squares (SST) regresso sum of squares (SSR) + error sum of squares (SSE) aalss of varace dett Statstcs-Berl Che

Ucertates the Least-Squares Coeffcets Assumptos for Errors Lear Models 0 + β β + ε I the smplest stuato, the followg assumptos are satsfed:. The errors ε,,ε are radom ad depedet. I partcular, the magtude of a error ε does ot fluece the value of the et error ε +. The errors ε,,ε all have mea 0 3. The errors ε,,ε all have the same varace, whch we deote b σ (varace of the error) 4. The errors ε,,ε are ormall dstrbuted Statstcs-Berl Che 3

Dstrbuto of I the lear model β 0 +β +ε, uder assumptos through 4, the observatos,, are depedet radom varables that follow the ormal dstrbuto. The mea ad varace of are gve b μ l β 0 + β σ σ. The slope β represets the chage the mea of assocated wth a crease oe ut the value of Statstcs-Berl Che 4

Statstcs-Berl Che 5 Dstrbutos of ad (/) Uder assumptos 4: The quattes ad are ormall dstrbuted radom varables After further mapulato, we have βˆ 0 βˆ ( )( ) ( ) ( ) ( ) ( ) ( ) ( ) ( ) βˆ 0 βˆ 0 ( ) ( ) 0 βˆ βˆ ˆ 0 ˆ 0 β μ β μ β β ad are ubased estmates 0 βˆ βˆ

Dstrbutos of ad (/) The stadard devatos of βˆ ad are estmated 0 βˆ wth σ ˆ β 0 σ + ( ) βˆ 0 σ β ˆ βˆ σ ( ) σ? s ( ) ( r ) ( ) e ˆ s s a estmate of the error stadard devato σ Statstcs-Berl Che 6

Notes. Sce there s a measure of varato of the deomator both of the ucertates we just defed, the more spread out s are the smaller the ucertates βˆ ad 0 βˆ. Use cauto, f the rage of values eteds beod the rage where the lear model holds, the results wll ot be vald 3. The quattes ( ˆ β β )/ s ad ( ˆ 0 0 ˆ β β β )/ s ˆ 0 β have Studet s t dstrbuto wth degrees of freedom Statstcs-Berl Che 7

Cofdece Itervals for β 0 ad β Level 00(-α)% cofdece tervals for β 0 ad β are gve b ˆ β ± 0 t, α / s ˆ β 0 ad two-sded cofdece tervals ˆ β ± t, α / s ˆ β Statstcs-Berl Che 8

Summar We dscussed Correlato Least-squares le / regresso Ucertates the least-squares coeffcets Cofdece tervals (ad hpothess tests) for least-squares coeffcets Statstcs-Berl Che 9