Correlatio ad Covariace Tom Ilveto FREC 9 What is Next? Correlatio ad Regressio Regressio We specify a depedet variable as a liear fuctio of oe or more idepedet variables, based o co-variace Regressio provides estimates of the relatioship betwee the depedet variable ad the idepedet variable(s) via a equatio of a lie The estimates, called coefficiets, ca be based o a sample ad ca be tested via a hypothesis test or cofidece iterval =.977 +.9* Correlatio ad Regressio Correlatio A measure of associatio betwee two variables Expressed as a liear relatioship Based o the co-variace - how two variables vary about their meas together Ca be show i a visual way via a scatterplot Bivariate Fit of By r =. Correlatio ad Regressio A focus o the variace!( X " X ) = TSS Total Sum of Squares Deviatios ( X " X )! = MS MeaSquared Deviatio " A focus o the co-variace #( X i " X )( Y i "Y ) i= Cov XY = A focus o the equatio of a lie Y = a + b*x where a is the itercept ad b is the slope
Let s revisit the Variace I statistics we are iterested i how a variable varies about its mea We represeted this as the Variace - the Mea Squared Deviatio!( X " X ) = TSS Total Sum of Squares Deviatios ( X " X )! = MS Mea Squared Deviatio " Basics of Co-Variace Let s start with a basic graph of a Y-variable vs a X variable. I will dissect the graph with the mea of X ad the Mea of Y Y-mea II Above Y-mea Below X-mea Below Y-mea Below X-mea III I Above Y-mea Above X-mea Below Y-mea Above X-mea IV X-mea 7 The Co-Variace Basics of Co-Variace The Covariace looks at how two variables, X ad Y, vary about their meas together We express it as a average, divided by (ot -) #( X i " X ) Y i "Y i= Cov XY = ( ) Cov XY = SS XY The covariace is a basic buildig block of correlatio, regressio, ad the Geeral Liear Model Y-mea II Values that ted to fall here ad III X-mea I here reflect egative covariace IV
Basics of Co-Variace Y-mea 9 II here reflect positive covariace III X-mea I Values that ted to fall here ad IV States with the Smartest Kids Here are distributio statistics th ad th grade math Stem ad Leaf.% maximum Mea.7 Stem Leaf Cout 99.% Std Dev. 97.%.7 Std Err Mea. 9 Upper 9% Mea. 7.% 7 Lower 9% Mea. 9.% media N..% Sum Wgt. 7.% Sum 9. 7 9.% Variace..% Skewess..% miimum Kurtosis. CV.7 N Missig. represets. Stem ad Leaf.% maximum 7 Mea 7. Stem Leaf 99.% 7 Std Dev. 7 97.%. Std Err Mea..7 Upper 9% Mea. 7.% Lower 9% Mea..% media 7 N..% Sum Wgt..%. Sum 7..% Variace 9..% Skewess.7 9.% miimum Kurtosis. CV. 7 N Missig. represets. Cout States with the Smartest Kids This is some data o states plus Washigto D.C. ad army bases overseas The key variables are the percet of studets i 9 who scored at a advaced level or higher for th ad th grade math, ad th ad th grade readig. Ay thoughts o this data? Smartest Kids Data: Covariace of th Math ad th Math Most of the data poits fall ito quadrats I ad III Positive co-variace As th grade percet icreases, so does th grade percet Bivariate Fit of By Covariace Matrix... 9.79 Fit Mea The umbers o the diagoal are the variaces - the covariace of a umber with itself is the variace
Shortcomigs of Co-Variace The covariace betwee two variables is a useful cocept it is the buildig block for regressio ad other multivariate techiques But as a measure of associatio it has limits It is symmetrical - ot a bad thig It is ubouded ukow high or low Covariace Matrix.. It is difficult to determie what the represets - a lot? a little? just how much???? Expressed i awkward cross-product uits. 9.79 Bivariate Fit of By Fit Mea Smartest Kids Data Most of the data poits fall ito quadrats I ad III Positive co-variace As th grade percet icreases, so does th grade percet Covariace Matrix... 9.79 Pearso Correlatio Coefficiet - r The correlatio coefficiet (r) is the co-variace adjusted for the stadard deviatios of both variables The adjustmet is simple, ad it makes it so much easier to iterpret r = r = Cov XY s X s Y #(X " X )(Y "Y ) # # (X " X ) (Y "Y ) r = SS XY SS X SS Y Properties of r Correlatio Coefficiet r Based o a liear measure of associatio Bouded betwee - ad Symmetrical relatioship: r XY = r YX Easier to iterpret Ivariat to liear scalig add/subtract or multiply/divide by a costat does ot chage the value of r betwee two variables Example: The correlatio betwee the respodet s educatio ad icome does ot chage if you express icome i total dollars or per $
Iterpretatio of r The closer the correlatio is to : the more perfect positive liear relatioship If r = the all values would fall o a straight lie, upward slope The closer the correlatio is to : The more perfect egative liear relatioship If r = - the all values would fall o a straight lie, dowward slope The scatterplot is a visual depictio of the correlatio coefficiet Iterpretatio of r meas o liear relatioship No-Liear Relatioship with a Near-Zero r........ 7 9 Scatter Plot Of Crab Force by Height The correlatio is., a strog positive correlatio Bivariate Fit of By Iterpretatio of r Oe other iterestig iterpretatio of r The square of r is equal to R-square, a measure of associatio i Regressio Oly i the case of a bivariate regressio - oe idepedet variable Ad it moves us toward defiig oe variable as explaiig the other This meas that r ca be iterpreted as the percet of variability i a variable that is explaied by the other variable
Iterpretig a correlatio coefficiet: Rules of Thumb for Narratives The followig is a table givig guidelie for arratives ivolvig correlatios. For simplicity sake, the table is based o the absolute value of the correlatio ( r ) Ad the exact descriptio depeds upo the subject ad disciplie Correlatio Rage Percet Variability Explaied (r ) Descriptio. to. to % Weak. to.9 % to % Moderate. to.7 % to % Moderately Strog.7 to. 7% to % Strog Some poiters i correlatio ad covariace Correlatio ad co-variace requires the umber of observatios for all variables be the same cases with missig values are excluded. With Excel, this is eve more of a problem Try to put the variable you are most iterested i first colum (i.e., the Depedet Variable). The you read dow the first colum to fid the relatioship with the depedet variable with other variables Readig the correlatio betwee other variables requires you to move across rows ad dow colums To get the covariace ad correlatio I Excel it is easy Tools!!!! Data Aalysis!!!! Covariace! or Correlatio!! Iput Rage (click to the right ad grab the data - i all four colums icludig labels)! Grouped by Colums!!!! Labels i first row (Yes) I JMP you eed to go to Multivariate Methods Multivariate List the variables Click the Hot Poit to get correlatios or covariaces Omi-Bar Study You are the marketig maager for OmiFoods ad you are plaig a atio-wide itroductio of a eergy bar, OmiPower. The bar was first marketed to high ed athletes ad moutai climbers, but ow is more popular with the geeral public. The compay wats to test market the bars ad determie the effect of price ad i-store promotios o the sales of the bars. They desig a study ad test OmiPower i a sample of stores i a supermarket chai. The depedet variable is Sales i dollars. The idepedet variables are price ad promotio. Whole values have bee carefully chose for the study. Price i three levels: $.9, $.79, ad $.99 Promotio i store i Three Levels: $, $, $
A closer look at sales Covariace ad Correlatio.% maximum 99.% 97.% 7.%.% media.%.%.%.%.% miimum.. 7. 9 7 7 7 Mea Std Dev Std Err Mea Upper 9% Mea Lower 9% Mea N Sum Wgt Sum Variace Skewess Kurtosis CV N Missig 9.7..7 7..9... 79. -.7 -.77.7. Stem ad Leaf Stem Leaf 7 79 99 7 Cout Note: Covariace difficult to iterpret Correlatios are relatively straight-forward Little correlatio betwee Price ad Promotio 7 represets 7 Mea level of sales is $,9 The media is cosiderably higher at $, A fair amout of spread i the data: CV is. Std. Dev is $, Covariace Matrix PRICE PROMOTION 79. -. 99. PRICE PROMOTION -. 99.. -.7 -.7. Correlatios PRICE PROMOTION 7. -.7. PRICE PROMOTION -.7.. -.97 -.97. The correlatios are estimated by REML method. Let s look at the relatioship of Sales with Price ad Sales with Promotio Sales by Price has a dowward slopig relatioship. As Price goes up, sales go dow - egative covariace It looks liear ad moderately strog Sales by Promotio has a upward slopig relatioship As Promotio goes up, sales go up - positive covariace It looks liear ad moderately strog Bivariate Fit of By PRICE Bivariate Fit of By PROMOTION It is a Easy step to Regressio Bivariate Fit of By PRICE...7.7..9 PRICE...7.7..9 PRICE PROMOTION
Real Life Correlatio Example Cliet:! Nicholas Hidell, Quip Laboratories The compay had two ways to measure how clea the labs were CFU ad RLU Oe was more expesive ad preferred by the compay The other was cheaper ad preferred by the cliet They wated to show the cliet that the two measures were ot the same 9 Distributios RATING 9 7.% 99.% 97.% 7.%.%.%.%.%.%.% maximum media miimum Mea Std Dev Std Err Mea upper 9% Mea lower 9% Mea N Sum Wgt Sum Variace Skewess Let s look at a example of correlatio ad covariace 9. 9. 9. 7.9 7........97.99..77.9..79.77 SALARY 9 7.% 99.% 97.% 7.%.%.%.%.%.%.% maximum media miimum Mea Std Dev Std Err Mea upper 9% Mea lower 9% Mea N Sum Wgt Sum Variace Skewess.. 9..9 79. 7.. 9.... 7..777.79 7.7 9.99 7.9.97 YEARS.% 99.% 97.% 7.%.%.%.%.%.%.% maximum media miimum Mea Std Dev Std Err Mea upper 9% Mea lower 9% Mea N Sum Wgt Sum Variace Skewess.. 7..9. 7....77....977..77 7.997.9. ORIGIN Outside Compay Iside Compay Frequecies Level Iside Compay Outside Compay Total N Missig Levels Cout Prob.7.. Let s look at a example of correlatio ad covariace The followig is some data about mid-level maagers i a compay. The variables are: RATING, a ratig scale of the maagers from to ; SALARY, the salary of the maager i $,; YEARS, years of service at the compay; ORIGIN, a dummy variable idicatig whether they were promoted iside the compay (coded as ) or were recruited from outside the compay (coded as ). At this poit we wo t worry about a depedet or idepedet variable The Covariace Matrix for Maager Ratigs Data The covariace matrix has the variaces o the diagoal (populatio variace based o ) ad the co-variaces o the off-diagoal. It is a symmetric matrix. Covariace Matrix RATING SALARY YEARS ORIGIN RATING SALARY YEARS ORIGIN.799.7.9 -.79.7.9 -.99 -.797.9 -.99.9. -.79 -.797..
The Correlatio Matrix for Maager Ratigs Data The covariaces are stadardized betwee - to The diagoal is ow - a variable is perfectly correlated with itself It is a symmetrical matrix Correlatios RATING SALARY YEARS ORIGIN RATING SALARY YEARS ORIGIN...77 -... -. -.9.77 -...7 -. -.9.7. Iterpretatio of Maager Ratigs Data There is a moderately strog positive relatioship betwee SALARY ad RATING - those that get higher salaries ted to have higher ratigs Almost o relatioship betwee YEARS i the compay ad the RATING (r =.77), but there is a egative relatioship betwee YEARS ad SALARY Bivariate Fit of SALARY By RATING 9 SALARY 7 7 9 RATING Correlatios RATING SALARY YEARS ORIGIN RATING...77 -. SALARY.. -. -.9 YEARS.77 -...7 ORIGIN -. -.9.7.