Correlato ad Smple Lear Regresso Berl Che Departmet of Computer Scece & Iformato Egeerg Natoal Tawa Normal Uversty Referece:. W. Navd. Statstcs for Egeerg ad Scetsts. Chapter 7 (7.-7.3) & Teachg Materal
Itroducto (/) Ofte, scetsts ad egeers collect data order to determe the ature of the relatoshp betwee two quattes A example: heghts ad forearm legths of me The pots ted to slop upward ad to the rght, dcatg that taller me ted to have loger forearms A postve assocato betwee heght ad forearm legth Statstcs-Berl Che
Itroducto (/) May tmes, ths ordered pars of measuremets fall approxmately alog a straght le whe plotted I those stuatos, the data ca be used to compute a equato for the le that t best t fts the data Ths le ca be used for varous thgs: Summarze the data Predct for future (usee) values Statstcs-Berl Che 3
Correlato Somethg we may be terested s how closely related two physcal characterstcs are For example, heght ad weght of a two-year-old chld The quatty called the correlato coeffcet s a measure of ths We look at the drecto of the relatoshp, postve or egatve, stregth of relatoshp, ad the we fd a le that best fts the data I computg correlato, we ca oly use quattatve data (stead of qualtatve data) Statstcs-Berl Che 4
Example Ths s a plot of heght vs. forearm legth for me We say that there s a postve assocato betwee heght ad forearm legth Ths s because the plot dcates that taller me ted to have loger forearms The slope s roughly costat throughout the plot, dcatg that the pots are clustered aroud a straght le The le supermposed o the plot s a specal le kow as the leastsquares le Statstcs-Berl Che 5
Correlato Coeffcet The degree to whch the pots a scatterplot ted to cluster aroud a le reflects the stregth of the lear relatoshp betwee x ad y The correlato coeffcet s a umercal measure of the stregth of the lear relatoshp betwee two varables The correlato coeffcet s usually deoted by the letter r Also called sample correlato (cf. populato cov correlato ( X, Y ) E[ ( X E[ X ])( Y E[ Y ] ) ] ρ ) X, Y σ X σ Y E [ X ] ( [ ]) [ ] E X E Y E[ Y ] ( ) Statstcs-Berl Che 6
Computg Correlato Coeffcet r Let (x, y ),,(x, y ) represet pots o a scatterplot Compute the meas ad the stadard devatos of the x s ad y s The covert each x ad y to stadard uts. That s, compute the z-scores: ( x x)/ s ad ( y y)/ s. The correlato coeffcet s the average of the products of the z-scores,, except that we dvde by stead of x x y y r s s x y Sometmes, ths computato s more useful ( x x )( y y ) r ( x x) ( ) y y x y ( x x) sx ( y y) sy Statstcs-Berl Che 7
Commets o Correlato Coeffcet I prcple, the correlato coeffcet ca be calculated for ay set of pots I may cases, the pots costtute a radom sample from a populato of pots I ths case, the correlato coeffcet s called the sample correlato, ad t s a estmate of the populato correlato It s a fact that r s always betwee - ad Postve values of r dcate that the least-squares le has a postve slope. The greater values of oe varable are assocated wth greater values of the other Negatve values of r dcate that the least-squares le has a egatve slope. The greater values of oe varable are assocated wth lesser values of the other Statstcs-Berl Che 8
Examples of Varous Levels of Correlato Statstcs-Berl Che 9
More Commets Values of r close to - or dcate a strog lear relatoshp Values of r close to 0 dcate a weak lear relatoshp Whe r s equal to - or, the all the pots o the scatterplot le exactly o a straght le If the pots le exactly o a horzotal or vertcal le, the r s udefed If r 0, the x ad y are sad to be correlated. If r 0, the x ad y are ucorrelated Statstcs-Berl Che 0
Propertes of Correlato Coeffcet r (/) A mportat feature of r s that t s utless. It s a pure umber that ca be compared betwee dfferet samples r remas uchaged uder each of the followg operatos: Multplyg each value of a varable by a postve costat Addg a costat to each value of a varable r Iterchagg the values of x ad y ( x x)( y y) ( x x) ( ) y y If r 0, ths does ot mply that there s ot a relatoshp betwee x ad y. It just dcates that there s o lear relatoshp y 64x 6x quadratc relatoshp Statstcs-Berl Che
Propertes of Correlato Coeffcet r (/) Outlers ca greatly dstort r,, especally, small data sets, ad preset a serous problem for data aalysts correlato coeffcet r0.6 Correlato s ot causato For example, vocabulary sze s strogly correlated wth shoe sze, but ths s because both crease wth age. Learg more words does ot cause feet to grow ad vce versus. Age s cofoudg the results Statstcs-Berl Che
Iferece o the Populato Correlato If the radom varables X ad Y have a certa jot dstrbuto b t called a bvarate ormal dstrbuto, b t the the sample correlato r ca be used to costruct cofdece tervals ad perform hypothess tests o the populato correlato, ρ. The followg results make ths possble Let (x, y ),,(x, y ) be a radom sample from the jot dstrbuto of X ad Y ad r s the sample correlato of the pots. The the quatty W + r X X μx Z l, μz (a fucto of r) Y μy r s approxmately ormal wth mea ad varace σ. W 3 + ρ μ l W ρ Σ Z σ σ σ X, X X, Y X, Y σ X, X Statstcs-Berl Che 3
Example 7.3 Questo: Fd a 95% cofdece for the correlato betwee the reacto tme of vsual stmulus ( x ) ad that of audo stmulus ( y ), gve the followg sample x y 6 03 35 76 0 88 8 9 78 59 06 4 63 97 93 09 89 69 0 The samplecorrelato betwee x ad y s r 0.859 W σ W + r l r / s gve by ( 0 3) + 0.859 l.444 0.859 0.3780 A 95% (two -sded) cofdeceterval for μ.444.96 ( 0.3780) μ.444 +.96( 0.3780) 0.4036 μ W W.885 W Note that the populato correlato ρ ca be expressed as e ρ e μ μ W W + The correspodg 95% cofdeceterval for ρ e e 0. 4036 0. 4036 μ e W e μ + e W + e 0.383 ρ 0.955.885.885 + Statstcs-Berl Che 4
Lear Model Whe two varables have a lear relatoshp, the scatterplot t teds to be clustered aroud a le kow as the least-squares le The le that we are tryg to ft s deal value measured value l β 0 + βx y depedet varable 0 + βx β + ε depedet varable β 0 ad β are called the regresso coeffcets (measuremet error ε ) We oly kow the values of x ad y, we must estmate the other quattes Ths s what we call smple lear regresso Wth oly oe depedet varable We use the data to estmate these quattes Statstcs-Berl Che 5
The Least-Squares Le β 0 ad β caot be determed because of measuremet error, but they ca be estmated by calculatg the least-squares le β 0 βˆ β βˆ y βˆ + ˆ x 0 β ad are called the least-squares squares coeffcets The least-squares le s the le that fts the data best (?) data cotamated wth radom errors ftted value yˆ ˆ ˆ resdual e y ˆ β 0 + β x y Statstcs-Berl Che 6
Resduals For each data pot ( x, y ), the vertcal dstace to the pot ( x, yˆ ) o the least squares le s e y yˆ. The quatty ŷ s called the ftted value ad the quatty e s called the resdual assocated wth the pot ( x, ) Pots above the least-squares le have postve resduals Pots below the le have egatve resduals The closer the resduals are to 0, the closer the ftted values are to the observatos ad the better the le fts the data The least-squares le s the oe that mmzes the sum of squared resduals S e e y Statstcs-Berl Che 7
Fdg the Equato of the Le To fd the least-squares le, we must determe estmates for the slope β 0 ad β tercept that mmze the sum of the squared resduals E ( ) y ˆ β β ˆ e 0 x These quattes are ˆ β ˆ β ( x x )( y y ) ( x x ) y ˆ β xx 0 β Statstcs-Berl Che 8
Some Shortcut Formulas The expressos o the rght are equvalet to those o the left, ad are ofte easer to compute ( x x ) x ( y y ) y x y ( x x )( y y ) x y x y Statstcs-Berl Che 9
Cautos (/) The estmates βˆ β 0 ad βˆ β are ot the same as the true values β 0 ad β βˆ ad are radom varables 0 βˆ β 0 ad β are costats whose values are ukow e The resduals are ot the same as the errors e ε The resduals e ca be computed, whch are the dffereces betwee ad ŷ y The errors ε ca ot be computed, sce the true values β 0 ad β are ukow (or l s ukow) e ε y y yˆ l Statstcs-Berl Che 0
Cautos (/) Do ot extrapolate the ftted le (such as the least- squares le) outsde the rage of the data. The lear relatoshp may ot hold there We leared that we should ot use the correlato coeffcet whe the relatoshp betwee x ad y s ot lear. The same holds for the least-squares le. Whe the scatterplot follows a curved patter, t does ot make sese to summarze t wth a straght le If the relatoshp s curved, the we would wat to ft a regresso le that cota squared terms (.e., polyomal regresso) Statstcs-Berl Che
Aother Represetato of the Le Aother way to compute a estmate of β s ˆ sy s β r y ( ) ( ) s y y x x s x s x The uts (but ot the value) of must be same as y/x The slope s proportoal to the correlato coeffcet βˆ The least-sqaures le ca be rewrtte as s x s y yˆ y r ( x x) So the le passes through the ceter of the mass of the scatterplot wth slope r(s x /s y ) Statstcs-Berl Che
Measures of Goodess of Ft A goodess of ft statstc s a quatty that measures how well a model explas a gve set of data The quatty r s the square of the correlato coeffcet ad we call t the coeffcet of determato r ( y y) ( ˆ ) y y ( y y) total sum of squares The proporto of varace y explaed by regresso s the terpretato of r ( y) ( y ˆ ) y y error sum of squares measures the reducto of spread of the pots obtaed by usg the leas-squares le rather tha y y A goodess-of-ft statstc that has uts Statstcs-Berl Che 3
Sums of Squares (/) ( ˆ) y y s the error sum of squares (SSE) ad measures the overall spread of the pots aroud the least-squares le ( ) y y s the total sum of squares (SST) ad measures the overall spread of the pots aroud the le y y ( ) ( ˆ ) The dfferece y y y y s called the regresso sum of squares (SSR) Clearly, the followg relatoshp holds: Total sum of squares (SST) regresso sum of squares (SSR) + error sum of squares (SSE) Statstcs-Berl Che 4
Sums of Squares (/) The aalyss of varace detty has the form ( ) y ( y yˆ ) + ( yˆ ) y y r s also called the proporto of the varace y explaed by regresso ( y y) ( y yˆ ) r ( y y) Statstcs-Berl Che 5
Ucertates the Least-Squares Coeffcets Assumptos for Errors Lear Models y 0 + β x β + ε I the smplest stuato, the followg assumptos are satsfed:. The errors ε,,ε are radom ad depedet. I partcular, the magtude of ay error ε does ot fluece the value of the ext error ε +. The errors ε,,ε all have mea 0 3. The errors ε,,ε all have the same varace, whch we deote by σ (varace of the error) 4. The errors ε,,ε are ormally dstrbuted Statstcs-Berl Che 6
Dstrbuto of y I the lear model y β 0 +ββ x +ε, uder assumptos through 4, the observatos y,,y are depedet radom varables that follow the ormal dstrbuto. The mea ad varace of y are gve by μ + σ y l β x 0 β σ. y The slope β represets the chage the mea of y assocated wth a crease oe ut the value of x Statstcs-Berl Che 7
Dstrbutos of ad (/) βˆ 0 βˆ Uder assumptos 4: βˆ βˆ The quattes ad are ormally dstrbuted radom 0 varables 0 βˆ β 0 βˆ ( x x )( y y ) ( x x ) ( x x ) y ( x x ) ( x x ) ( x x ) ( x x ) ( x x ) y ( x x ) ( x x ) x y After further mapulato, we have y μ ˆ0 β μ ˆ β β β 0 βˆ 0 ad βˆ are ubased estmates Statstcs-Berl Che 8
Dstrbutos of ad (/) The stadard devatos of βˆ β ad are estmated 0 βˆ β wth σ ˆ β σ? 0 σ + x ( x x) βˆ 0 σ β ˆ β βˆ σ ( x x ) ( y y ) ( r ) ( y y ) ˆ e s The resduals teds to be a lttle smaller tha the errors s s a estmate of the error stadard devato σ Statstcs-Berl Che 9
Notes. Sce there s a measure of varato of x the deomator both of the ucertates we just defed, the more spread out x s are the smaller the ucertates βˆ ad βˆ β β 0. Use cauto, f the rage of x values exteds beyod the rage where the lear model holds, the results wll ot be vald 3. The quattes ( ˆ β β )/ s ad ( ˆ ) 0 0 ˆ β β β / s ˆ 0 β have Studet s t dstrbuto wth degrees of freedom Statstcs-Berl Che 30
Cofdece Itervals for β 0 ad β Level 00(-α)% cofdece tervals for β 0 ad β are gve by βˆ β 0 ± t, α / s ˆ β 0 ad two-sded cofdece tervals βˆ β ± t, α / s ˆ β Statstcs-Berl Che 3
Summary We dscussed Correlato Least-squares le / regresso Ucertates the least-squares coeffcets Cofdece tervals (ad hypothess tests) for least- squares coeffcets Statstcs-Berl Che 3