Can we take the Mysticism Out of the Pearson Coefficient of Linear Correlation?

Ca we tae the Mstcsm Out of the Pearso Coeffcet of Lear Correlato? Itroducto As the ttle of ths tutoral dcates, our purpose s to egeder a clear uderstadg of the Pearso coeffcet of lear correlato studets mds, dlute the mstcsm of the formula to a mmum, ad help studets to feel at home wth applcatos of the formula Cosder the set of data: D {( x, ),( x, ),,( x, )} 1 1 The Pearso coeffcet of lear correlato s gve b the formula: r 1 ( x x)( ) ( 1) s s, 1, (1) x where ( x x) ( ),, ad 1 1 x1 x x 1 x sx s stadard devato of x stadard devato of For the coveece of referece, ths tutoral we ame the quatt, 1 ( x x)( ), () as the total covarato, the pot ( x, ), as the average pot, the quatt, varace, ad the quatt, ( ), as the - varace ( x x), as the x- From Equato (1), we have: r ( 1) 1 ( x x)( ) ( x x) ( ) 1 1 Hece, the coeffcet r ca be wrtte the form 1

r 1 ( x x)( ) ( x x) ( ) 1 1 (3) The formula ca be gve as r 1 ( xx)( ) ( 1) average covarace of x ad, s s (stadard devato of x)(stadard devato of ) where we have termed the quatt, as average covarace x ( x x)( ) 1, ( 1) Now, we wll explore how a mathematca could have thought whe she s costructg ths formula for the frst tme the hstor A Example of a Set of Pots wth Lear Oretato Ths formula appears mstcal to ma studets The ust ow t has to wor; the have o dea wh t wors the wa t does The studets feel that the formula arrved to the textboos of mathematcs b autogeess I ths tutoral, we are trg to loo at how the dea of dervato of ths formula could have bee motvated some mathematca s md I order to do so, we wll loo at the followg set of pots whch has some lear oretato

I ths data set of {( x1, 1),( x, ),,( x, )}, x1 x x 1 x ad The average pot, ( x, ), has bee mared blue, the dagram above Also, for oe data pot, ( x, ), the covarace compoets are mared the dagram, p ad gree, respectvel For ths pot, the covarace product s ( x x) ( ) 0 0 Also, for oe data pot, ( x, ), the covarace compoets are mared the dagram For ths pot, the covarace product s 0 ( x x)( ) 0 0 0 That s, both ( x x)( ) ad ( x x )( ) are postve Ths fact s vald for all the other data pots the set Ths happes due to the oretato of the pots, as a whole More explctl, the pots are stuated ol QI ad QIII, wth respect to the axes through the pot, ( x, ) Therefore, both covarace compoets have the same sg The, the covarace product for each of the pots s postve Therefore, the total covarace, s postve Also, ote that, f we add more ad more pots to ust the Quadrats I ad IV, such a maer that stll the average pot s ( x, ), the the total covarace becomes larger ad larger, wthout a lmt Aother Set of Pots Whch Has Pots I Other Two Quadrats To further see ths, cosder the followg dagram 3

I ths dagram, we have added two ew pots to each of the quadrats, QII ad QIV, such a maer that the average pot remas the same Respectve covarace products of each of these ew pots, s egatve, sce the compoets of each of these products have dfferet sgs Cosequetl, the total sum of these covarace products, ad the average covarace, become less 1 ( x x)( ), ( x x)( ) 1, ( 1) Now mage that we eep addg pots to QII ad QIV, such a maer that the average pot, ( x, ), remas the same? Because of ths, two thgs wll happe: The total covarace, ad the average covarace, become less ad less 1 ( x x)( ), ( x x)( ) 1, ( 1) Oretato of the pots, as a whole, wll become more ad more spread out ad become less ad less lear Hece, f the magtude (ust the quatt wthout the sg) s closer to zero the pots are spread aroud the average pot, ( x, ), almost smmetrcall, ad pots wll ot le closer to a straght le, as a whole We further observe that, 1 r ( x x)( ) 0 ( x x)( ) 0 ( 1) sxs 1 1 That s, the value of r beg zero has othg to do wth the values of a of the dvsors, -1, s x or s We wll loo at the role of these dvsors, later B loog at the followg dagram, we ca clearl see how the scearo wors out whe 4

( )( ) 1 x x 1 r 0 sxs 1 Notce that the X-axs has ot bee show the dagram below Aga the pot, ( x, ), s the average pot of the data set Recall ad covce ourself that the average of the set of all data pots o a crcle s ts cetre The subset of pots, { T, A, B,, C ', B ', A'}, uder our cosderato has bee chose from the crcle such a maer that the average pot s the cetre of the crcle A Set of Pots Strategcall Stuated O A Crcle We ca qucl verf that the value of r = 0, sce the total covarace s zero For stace, each of the pots, ' T ad T have zero covarace products sce these pots are o the le x x of the covarace products for the two pots D ad the pot, ( x, ) Explctl, covarace compoets for both 5 Also, the sum ' D s zero due to the smmetr of the crcle about D ad DV D ' V ' (as ), ad x covarace compoets for both ' D are the same, sce D ad ' D have the same

magtude wth dfferet sgs, sce x x for D s UV ad x x for D ' s UV ' UV Therefore, the sum of these covarace products s D' V ' UV ' DV UV DV UV DV UV 0 All the pots ca be pared ths maer Therefore, the total covarace s 0 Cosequetl the value of r s also 0 Summar of Our Observatos So Far More olear the set of pots s smaller the magtude of (whe the sg s strpped off ) the total covarace s v More lear the set of pots, the magtude of the total covarace becomes bgger ad bgger I a lear-le set wth a egatve oretato, more ad more pots meas that we add more ad more egatve quattes tha postve quattes I a lear-le set wth a postve oretato, more ad more pots meas that we add much more postve quattes tha egatve quattes v The features () ad (v) ad our examples show that the sze of the total covarace ca be huge for a large lear data set Now, we calculate the total covarace for a set of pots lg o a straght le Ths mght help us to solve the cetral mster about the formula Equato (1) The motvato to follow ths path s ot accdetal or radom The motvato to do so, comes from the summar above about the total covarace, 1 ( x x)( );Equato() Total Covarace of A Set Of Lear Pots Cosder the lear set of data of the lear relato, mx c : The, {( x, mx c),( x, mx c),,( x, mx c)} 1 1 1 mx1 c mx c mx c mx1 mx mx c mx c Collect x s ad c's 1 1 m Tae Commo factor, m m( x x x ) c ( x x x ) c mx c Appl the Dstrbutve Law of Dvso x 6

Therefore, mx c Ths meas that the pot, ( x, ), les o the straght le, mx c Now, we ca calculate the total covarace: ( x x)( ), ( x x)( mx c mx c) m( x x) ( x x) mx c mxc 1 1 1 ( )( x x) m( x x) 1 1 m x x Ths result almost resolves the mster of the formula; f we dvde total covarato, the square-of-the- x-varace, ( ), 1 m x x b ( x x), the we ust get the gradet of the lear relato We ca mprove ths further, f we dvde the total covarace b the product of x-varace ad -varace I ths case, usg Equato (3), we obta that ( x x)( ) m( x x) m( x x) 1 1 1 ( x ) ( ) ( ) ( ) ( ) x m x x x x m x x 1 1 1 1 1 r 1, f m 0, m r udefed f m 0, m 1,f m 0 Here, we have used the fact that, for a real umber m, m m m m m or, but To see ths, ote that ad ( ) ( ot - ) That s, m m, f m 0, m m,f m 0 The wor above resolves the mster of the formula, wth the excepto of the mster the use of the dvsor -1 ad terms, ad, ad s s To resolve ths ssue, we, aga loo at Equato (3): x 7

( x x)( ) ( x )( ) 1 x 1 r The averages of the ( x x) ( ) varacesare tae over ( x ) ( ) x 1 1 1 1 1 ( x x)( ) 1 r 1 r The averages of the varacesare tae over 1 averages of the varaces are tae over or -1 1 1 Depedgo whether ( x ) ( ) x 1 1 1 ( x x)( ), ( 1) s s ( x x)( ) s s x x Whether the averages of varaces are tae over or -1 has o cosequece to the formula of r However, whe tag averages, the use of -1 s preferred due to aother statstcal cocept, whch we do ot dscuss ths tutoral The wh do we wrte the lear correlato formula the form gve I Equato (1) The reaso s that t volves the stadard devatos of the varables rather tha ust the square roots of the total of the squared varatos Summar of Features Of r So far, we have proved that: f a set of pots s lear the r 1 Also more the oleart the smaller the sze of r We have also proved that larger the sze of r, the data set has more leart ad smaller the sze of r, data teds to be more olear We have costructed (ot ust derved) ths formula so that t has the propert that 1 r 1 Perfectllear wth egatve gradet Perfectllear wth postve gradet Now, we have to prove the coverse drecto that f r 1 for a set of pots, the the data etertas a lear relato Does r = 1 mpl Leart? What we have proved up to ths pot s that f a set of pots s lear the r 1 Ths begs the questo: Is the set of pots lear f r 1? 8

If r 1, the we have 1 ( x x)( ) ( x x) ( ) 1 1 1 cross multpl ad square bothsdes ( x x)( ) ( x x) ( ) 1 1 1 saa sab a b a b ab 1 1 1 0 (4) The formula s trvall true for = 1 What happes f we have ust two terms ( = ): The, ( a a )( b b ) a b a b 0 (5) 1 1 1 1 ( a a )( b b ) 1 1 a b a b 1 1 Expad ( pq) p pqq a b a b a b a b a b a b a b a b a b a b a b a b a b a b 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Thus we have proved that 1 1 1 1 1 1 ( a a )( b b ) a b a b a b a b (6) That s, Equato (6) s true for a four real umbers, a1, a, b1, ad b ; We could have derved Equato (6), b usg dfferece of two squares o the LHS (left had sde) of the equato, ad the mapulatg the result We would le to vestgate whether we ca exted the formula Equato (6), to a umber of pots Ths suggested formula s gve Equato (*) 1 1 1, 1 a b a b a b a b ; 0,1,, (*) At ths pot, f ou prefer, ou ma sp what follows up to the ed of dervato of Equato (11) 9

Provg the Formula (*) For A Set of Pots Now, we assume that Equato (6) s true for a1, a,, a, ad b1, b,, b,, up to some postve teger We alread the equato s true for = 1 ad Hece, we have that a b a b a b a b a b a b a b a b a b a b ( a a a )( b b b ) a b a b a b a b a b a b a b 1 1 1 1 1 1 1 3 3 1 1 1 3 3 4 4 1 1 The Equato (6) ca be wrtte as: 1 1 1, 1 (7) a b a b a b a b ; 1,,, (8) Aga, I as ou to eep md that the result s trvall true for = 1, ad we have proved the case for =, Equato (6) Now, we set, M a1b 1 ab a b, termolog, we ca wrte P a a a ad 1, Q b b b Wth ths 1 ( a 1 a a )( b1 b b) a1b 1 ab ab P Q M ab a b (9), 1 Next, we cosder the statemet for = + 1 ( a a a a )( b b b b ) a b a b a b P Q M 1 1 1 1 1 1 1 1 ( P a )( Q b a 1 1 The, cosder the expresso the rght had sde of the Equato (10): b ) M a b (10) 1 1 a 1 ( Q b 1) M a 1b 1 Q b 1 a 1Q a 1b 1 M M a 1b a 1b 1 ( P ) P P PQ M P Eq( 8) Pb a 1 1 1 1 ab a b P b, 1 Q M a a Q M a 1 1 1 1 b b 10

Now, P b a Q M a b ( a a a ) b a ( b b b ) ( a b a b ab) a b 1 1 1 1 1 1 1 1 1 1 1 1 a b a b a b a b a b a b a b a 1 1 1 1 1 1 1 1 1 1 1 1 squareof a dfferece squareof a dfferece a b a b a b a b 1 1 b squareof a dfferece 1 1 ( a b a b ) ( a b a b ) ( a b a b ) 1 1 1 1 1 1 1 1 1 1 ( a b 1a 1b ) Fall, ths establshes that, f Equato (7) s true up to 1,,,, the the equato, ( a a a a )( b b b b ) a b a b a b a b 1 1 1 1 1 1 1 1 1 ab a b P b 1 a 1Q M a 1b 1 ab a b ( ab 1a 1b ), 1, 1 1 1 a b, 1 a b s also true Sce, we ow that Equato (7) s true for =1 ad, the equato s true for a subsequet teger That s, a b ab ab a b 1 1 1, 1 for a teger > 0(Equato (11) s a famous result) Proof that r 1 mples perfect Leart, (11) Suppose that r 1 for a set of data, {( x1, 1),( x, ),,( x, )}, b Equato (4), we have B Equato (11), the we have: a b ab 1 1 1 0 11

b b 0 0 ab a b ab a b m ; for all a a x x x x, 1 forsomem Each of the squares 0 Each of the squares = 0 We wrte ths more explct form: x x x x x x x x 1 3 1 3 m Ths sas that each of the followg sets of three pots {( x, ),( x, ),( x, )},{( x, ),( x, ),( x, )},,{( x, ),( x, ),(, )}, 1 1 1 1 3 3 1 1 x {( x, ),( x, ),( x, )},,{( x, ),( x, ),( x, )} 3 3 1 1 les o a straght le whch passes through the pot, ( x, ) ad has the same gradet, m Therefore, all the pots le o the same straght le Cocluso The above wor completes the proof that a set of data s lear f ad ol f r 1 Moreover, durg the dscusso, we establshed that sze of r s a measure of the leart of the set of data pots I the ext tutoral, we derve the lear least square regresso equato ad deal wth the mstc ature of the coeffcet of determato 1