Can we take the Mysticism Out of the Pearson Coefficient of Linear Correlation?

Similar documents
Multiple Choice Test. Chapter Adequacy of Models for Regression

Mean is only appropriate for interval or ratio scales, not ordinal or nominal.

Econometric Methods. Review of Estimation

Third handout: On the Gini Index

Lecture 7. Confidence Intervals and Hypothesis Tests in the Simple CLR Model

Functions of Random Variables

CHAPTER 4 RADICAL EXPRESSIONS

( ) = ( ) ( ) Chapter 13 Asymptotic Theory and Stochastic Regressors. Stochastic regressors model

{ }{ ( )} (, ) = ( ) ( ) ( ) Chapter 14 Exercises in Sampling Theory. Exercise 1 (Simple random sampling): Solution:

STATISTICAL PROPERTIES OF LEAST SQUARES ESTIMATORS. x, where. = y - ˆ " 1

2SLS Estimates ECON In this case, begin with the assumption that E[ i

Correlation and Regression Analysis

MEASURES OF DISPERSION

A Primer on Summation Notation George H Olson, Ph. D. Doctoral Program in Educational Leadership Appalachian State University Spring 2010

Lecture Notes 2. The ability to manipulate matrices is critical in economics.

IFYMB002 Mathematics Business Appendix C Formula Booklet

Summary of the lecture in Biostatistics

Part 4b Asymptotic Results for MRR2 using PRESS. Recall that the PRESS statistic is a special type of cross validation procedure (see Allen (1971))

Correlation and Simple Linear Regression

Chapter Statistics Background of Regression Analysis

Simple Linear Regression

Analyzing Two-Dimensional Data. Analyzing Two-Dimensional Data

Statistics MINITAB - Lab 5

UNIT 7 RANK CORRELATION

Chapter Two. An Introduction to Regression ( )

Lecture 3 Probability review (cont d)

Discrete Mathematics and Probability Theory Fall 2016 Seshia and Walrand DIS 10b

STA302/1001-Fall 2008 Midterm Test October 21, 2008

F. Inequalities. HKAL Pure Mathematics. 進佳數學團隊 Dr. Herbert Lam 林康榮博士. [Solution] Example Basic properties

The internal structure of natural numbers, one method for the definition of large prime numbers, and a factorization test

Evaluating Polynomials

Chapter Business Statistics: A First Course Fifth Edition. Learning Objectives. Correlation vs. Regression. In this chapter, you learn:

1 Onto functions and bijections Applications to Counting

b. There appears to be a positive relationship between X and Y; that is, as X increases, so does Y.

Simple Linear Regression

Johns Hopkins University Department of Biostatistics Math Review for Introductory Courses

Multiple Linear Regression Analysis

Johns Hopkins University Department of Biostatistics Math Review for Introductory Courses

Example. Row Hydrogen Carbon

ε. Therefore, the estimate

UNIT 2 SOLUTION OF ALGEBRAIC AND TRANSCENDENTAL EQUATIONS

12.2 Estimating Model parameters Assumptions: ox and y are related according to the simple linear regression model

Statistics. Correlational. Dr. Ayman Eldeib. Simple Linear Regression and Correlation. SBE 304: Linear Regression & Correlation 1/3/2018

hp calculators HP 30S Statistics Averages and Standard Deviations Average and Standard Deviation Practice Finding Averages and Standard Deviations

best estimate (mean) for X uncertainty or error in the measurement (systematic, random or statistical) best

Exercises for Square-Congruence Modulo n ver 11

Chapter 3 Sampling For Proportions and Percentages

Log1 Contest Round 2 Theta Complex Numbers. 4 points each. 5 points each


STA 105-M BASIC STATISTICS (This is a multiple choice paper.)

Lecture 3. Sampling, sampling distributions, and parameter estimation

Random Variables and Probability Distributions

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

Midterm Exam 1, section 2 (Solution) Thursday, February hour, 15 minutes

ESS Line Fitting

Objectives of Multiple Regression

Study of Correlation using Bayes Approach under bivariate Distributions

Arithmetic Mean and Geometric Mean

Lecture Notes Types of economic variables

f f... f 1 n n (ii) Median : It is the value of the middle-most observation(s).

THE ROYAL STATISTICAL SOCIETY 2016 EXAMINATIONS SOLUTIONS HIGHER CERTIFICATE MODULE 5

Comparison of Dual to Ratio-Cum-Product Estimators of Population Mean

Assignment 5/MATH 247/Winter Due: Friday, February 19 in class (!) (answers will be posted right after class)

The Mathematical Appendix

Chapter 11 Systematic Sampling

Lecture Notes Forecasting the process of estimating or predicting unknown situations

Class 13,14 June 17, 19, 2015

Probability and. Lecture 13: and Correlation

The equation is sometimes presented in form Y = a + b x. This is reasonable, but it s not the notation we use.

ENGI 4421 Propagation of Error Page 8-01

is the score of the 1 st student, x

LECTURE - 4 SIMPLE RANDOM SAMPLING DR. SHALABH DEPARTMENT OF MATHEMATICS AND STATISTICS INDIAN INSTITUTE OF TECHNOLOGY KANPUR

Chapter 4 Multiple Random Variables

STK4011 and STK9011 Autumn 2016

PTAS for Bin-Packing

THE ROYAL STATISTICAL SOCIETY HIGHER CERTIFICATE

X X X E[ ] E X E X. is the ()m n where the ( i,)th. j element is the mean of the ( i,)th., then

Chapter -2 Simple Random Sampling

Chapter -2 Simple Random Sampling

ρ < 1 be five real numbers. The

Chapter 2 Supplemental Text Material

i 2 σ ) i = 1,2,...,n , and = 3.01 = 4.01

Midterm Exam 1, section 1 (Solution) Thursday, February hour, 15 minutes

ENGI 3423 Simple Linear Regression Page 12-01

UNIVERSITY OF OSLO DEPARTMENT OF ECONOMICS

02/15/04 INTERESTING FINITE AND INFINITE PRODUCTS FROM SIMPLE ALGEBRAIC IDENTITIES

The Mathematics of Portfolio Theory

Module 7: Probability and Statistics

. The set of these sums. be a partition of [ ab, ]. Consider the sum f( x) f( x 1)

Point Estimation: definition of estimators

Lecture 2: The Simple Regression Model

ENGI 4421 Joint Probability Distributions Page Joint Probability Distributions [Navidi sections 2.5 and 2.6; Devore sections

Non-uniform Turán-type problems

Ordinary Least Squares Regression. Simple Regression. Algebra and Assumptions.

Lecture 1: Introduction to Regression

Chapter 13 Student Lecture Notes 13-1

Chapter 9 Jordan Block Matrices

arxiv:math/ v1 [math.gm] 8 Dec 2005

Simulation Output Analysis

Measures of Dispersion

Transcription:

Ca we tae the Mstcsm Out of the Pearso Coeffcet of Lear Correlato? Itroducto As the ttle of ths tutoral dcates, our purpose s to egeder a clear uderstadg of the Pearso coeffcet of lear correlato studets mds, dlute the mstcsm of the formula to a mmum, ad help studets to feel at home wth applcatos of the formula Cosder the set of data: D {( x, ),( x, ),,( x, )} 1 1 The Pearso coeffcet of lear correlato s gve b the formula: r 1 ( x x)( ) ( 1) s s, 1, (1) x where ( x x) ( ),, ad 1 1 x1 x x 1 x sx s stadard devato of x stadard devato of For the coveece of referece, ths tutoral we ame the quatt, 1 ( x x)( ), () as the total covarato, the pot ( x, ), as the average pot, the quatt, varace, ad the quatt, ( ), as the - varace ( x x), as the x- From Equato (1), we have: r ( 1) 1 ( x x)( ) ( x x) ( ) 1 1 Hece, the coeffcet r ca be wrtte the form 1

r 1 ( x x)( ) ( x x) ( ) 1 1 (3) The formula ca be gve as r 1 ( xx)( ) ( 1) average covarace of x ad, s s (stadard devato of x)(stadard devato of ) where we have termed the quatt, as average covarace x ( x x)( ) 1, ( 1) Now, we wll explore how a mathematca could have thought whe she s costructg ths formula for the frst tme the hstor A Example of a Set of Pots wth Lear Oretato Ths formula appears mstcal to ma studets The ust ow t has to wor; the have o dea wh t wors the wa t does The studets feel that the formula arrved to the textboos of mathematcs b autogeess I ths tutoral, we are trg to loo at how the dea of dervato of ths formula could have bee motvated some mathematca s md I order to do so, we wll loo at the followg set of pots whch has some lear oretato

I ths data set of {( x1, 1),( x, ),,( x, )}, x1 x x 1 x ad The average pot, ( x, ), has bee mared blue, the dagram above Also, for oe data pot, ( x, ), the covarace compoets are mared the dagram, p ad gree, respectvel For ths pot, the covarace product s ( x x) ( ) 0 0 Also, for oe data pot, ( x, ), the covarace compoets are mared the dagram For ths pot, the covarace product s 0 ( x x)( ) 0 0 0 That s, both ( x x)( ) ad ( x x )( ) are postve Ths fact s vald for all the other data pots the set Ths happes due to the oretato of the pots, as a whole More explctl, the pots are stuated ol QI ad QIII, wth respect to the axes through the pot, ( x, ) Therefore, both covarace compoets have the same sg The, the covarace product for each of the pots s postve Therefore, the total covarace, s postve Also, ote that, f we add more ad more pots to ust the Quadrats I ad IV, such a maer that stll the average pot s ( x, ), the the total covarace becomes larger ad larger, wthout a lmt Aother Set of Pots Whch Has Pots I Other Two Quadrats To further see ths, cosder the followg dagram 3

I ths dagram, we have added two ew pots to each of the quadrats, QII ad QIV, such a maer that the average pot remas the same Respectve covarace products of each of these ew pots, s egatve, sce the compoets of each of these products have dfferet sgs Cosequetl, the total sum of these covarace products, ad the average covarace, become less 1 ( x x)( ), ( x x)( ) 1, ( 1) Now mage that we eep addg pots to QII ad QIV, such a maer that the average pot, ( x, ), remas the same? Because of ths, two thgs wll happe: The total covarace, ad the average covarace, become less ad less 1 ( x x)( ), ( x x)( ) 1, ( 1) Oretato of the pots, as a whole, wll become more ad more spread out ad become less ad less lear Hece, f the magtude (ust the quatt wthout the sg) s closer to zero the pots are spread aroud the average pot, ( x, ), almost smmetrcall, ad pots wll ot le closer to a straght le, as a whole We further observe that, 1 r ( x x)( ) 0 ( x x)( ) 0 ( 1) sxs 1 1 That s, the value of r beg zero has othg to do wth the values of a of the dvsors, -1, s x or s We wll loo at the role of these dvsors, later B loog at the followg dagram, we ca clearl see how the scearo wors out whe 4

( )( ) 1 x x 1 r 0 sxs 1 Notce that the X-axs has ot bee show the dagram below Aga the pot, ( x, ), s the average pot of the data set Recall ad covce ourself that the average of the set of all data pots o a crcle s ts cetre The subset of pots, { T, A, B,, C ', B ', A'}, uder our cosderato has bee chose from the crcle such a maer that the average pot s the cetre of the crcle A Set of Pots Strategcall Stuated O A Crcle We ca qucl verf that the value of r = 0, sce the total covarace s zero For stace, each of the pots, ' T ad T have zero covarace products sce these pots are o the le x x of the covarace products for the two pots D ad the pot, ( x, ) Explctl, covarace compoets for both 5 Also, the sum ' D s zero due to the smmetr of the crcle about D ad DV D ' V ' (as ), ad x covarace compoets for both ' D are the same, sce D ad ' D have the same

magtude wth dfferet sgs, sce x x for D s UV ad x x for D ' s UV ' UV Therefore, the sum of these covarace products s D' V ' UV ' DV UV DV UV DV UV 0 All the pots ca be pared ths maer Therefore, the total covarace s 0 Cosequetl the value of r s also 0 Summar of Our Observatos So Far More olear the set of pots s smaller the magtude of (whe the sg s strpped off ) the total covarace s v More lear the set of pots, the magtude of the total covarace becomes bgger ad bgger I a lear-le set wth a egatve oretato, more ad more pots meas that we add more ad more egatve quattes tha postve quattes I a lear-le set wth a postve oretato, more ad more pots meas that we add much more postve quattes tha egatve quattes v The features () ad (v) ad our examples show that the sze of the total covarace ca be huge for a large lear data set Now, we calculate the total covarace for a set of pots lg o a straght le Ths mght help us to solve the cetral mster about the formula Equato (1) The motvato to follow ths path s ot accdetal or radom The motvato to do so, comes from the summar above about the total covarace, 1 ( x x)( );Equato() Total Covarace of A Set Of Lear Pots Cosder the lear set of data of the lear relato, mx c : The, {( x, mx c),( x, mx c),,( x, mx c)} 1 1 1 mx1 c mx c mx c mx1 mx mx c mx c Collect x s ad c's 1 1 m Tae Commo factor, m m( x x x ) c ( x x x ) c mx c Appl the Dstrbutve Law of Dvso x 6

Therefore, mx c Ths meas that the pot, ( x, ), les o the straght le, mx c Now, we ca calculate the total covarace: ( x x)( ), ( x x)( mx c mx c) m( x x) ( x x) mx c mxc 1 1 1 ( )( x x) m( x x) 1 1 m x x Ths result almost resolves the mster of the formula; f we dvde total covarato, the square-of-the- x-varace, ( ), 1 m x x b ( x x), the we ust get the gradet of the lear relato We ca mprove ths further, f we dvde the total covarace b the product of x-varace ad -varace I ths case, usg Equato (3), we obta that ( x x)( ) m( x x) m( x x) 1 1 1 ( x ) ( ) ( ) ( ) ( ) x m x x x x m x x 1 1 1 1 1 r 1, f m 0, m r udefed f m 0, m 1,f m 0 Here, we have used the fact that, for a real umber m, m m m m m or, but To see ths, ote that ad ( ) ( ot - ) That s, m m, f m 0, m m,f m 0 The wor above resolves the mster of the formula, wth the excepto of the mster the use of the dvsor -1 ad terms, ad, ad s s To resolve ths ssue, we, aga loo at Equato (3): x 7

( x x)( ) ( x )( ) 1 x 1 r The averages of the ( x x) ( ) varacesare tae over ( x ) ( ) x 1 1 1 1 1 ( x x)( ) 1 r 1 r The averages of the varacesare tae over 1 averages of the varaces are tae over or -1 1 1 Depedgo whether ( x ) ( ) x 1 1 1 ( x x)( ), ( 1) s s ( x x)( ) s s x x Whether the averages of varaces are tae over or -1 has o cosequece to the formula of r However, whe tag averages, the use of -1 s preferred due to aother statstcal cocept, whch we do ot dscuss ths tutoral The wh do we wrte the lear correlato formula the form gve I Equato (1) The reaso s that t volves the stadard devatos of the varables rather tha ust the square roots of the total of the squared varatos Summar of Features Of r So far, we have proved that: f a set of pots s lear the r 1 Also more the oleart the smaller the sze of r We have also proved that larger the sze of r, the data set has more leart ad smaller the sze of r, data teds to be more olear We have costructed (ot ust derved) ths formula so that t has the propert that 1 r 1 Perfectllear wth egatve gradet Perfectllear wth postve gradet Now, we have to prove the coverse drecto that f r 1 for a set of pots, the the data etertas a lear relato Does r = 1 mpl Leart? What we have proved up to ths pot s that f a set of pots s lear the r 1 Ths begs the questo: Is the set of pots lear f r 1? 8

If r 1, the we have 1 ( x x)( ) ( x x) ( ) 1 1 1 cross multpl ad square bothsdes ( x x)( ) ( x x) ( ) 1 1 1 saa sab a b a b ab 1 1 1 0 (4) The formula s trvall true for = 1 What happes f we have ust two terms ( = ): The, ( a a )( b b ) a b a b 0 (5) 1 1 1 1 ( a a )( b b ) 1 1 a b a b 1 1 Expad ( pq) p pqq a b a b a b a b a b a b a b a b a b a b a b a b a b a b 1 1 1 1 1 1 1 1 1 1 1 1 1 1 Thus we have proved that 1 1 1 1 1 1 ( a a )( b b ) a b a b a b a b (6) That s, Equato (6) s true for a four real umbers, a1, a, b1, ad b ; We could have derved Equato (6), b usg dfferece of two squares o the LHS (left had sde) of the equato, ad the mapulatg the result We would le to vestgate whether we ca exted the formula Equato (6), to a umber of pots Ths suggested formula s gve Equato (*) 1 1 1, 1 a b a b a b a b ; 0,1,, (*) At ths pot, f ou prefer, ou ma sp what follows up to the ed of dervato of Equato (11) 9

Provg the Formula (*) For A Set of Pots Now, we assume that Equato (6) s true for a1, a,, a, ad b1, b,, b,, up to some postve teger We alread the equato s true for = 1 ad Hece, we have that a b a b a b a b a b a b a b a b a b a b ( a a a )( b b b ) a b a b a b a b a b a b a b 1 1 1 1 1 1 1 3 3 1 1 1 3 3 4 4 1 1 The Equato (6) ca be wrtte as: 1 1 1, 1 (7) a b a b a b a b ; 1,,, (8) Aga, I as ou to eep md that the result s trvall true for = 1, ad we have proved the case for =, Equato (6) Now, we set, M a1b 1 ab a b, termolog, we ca wrte P a a a ad 1, Q b b b Wth ths 1 ( a 1 a a )( b1 b b) a1b 1 ab ab P Q M ab a b (9), 1 Next, we cosder the statemet for = + 1 ( a a a a )( b b b b ) a b a b a b P Q M 1 1 1 1 1 1 1 1 ( P a )( Q b a 1 1 The, cosder the expresso the rght had sde of the Equato (10): b ) M a b (10) 1 1 a 1 ( Q b 1) M a 1b 1 Q b 1 a 1Q a 1b 1 M M a 1b a 1b 1 ( P ) P P PQ M P Eq( 8) Pb a 1 1 1 1 ab a b P b, 1 Q M a a Q M a 1 1 1 1 b b 10

Now, P b a Q M a b ( a a a ) b a ( b b b ) ( a b a b ab) a b 1 1 1 1 1 1 1 1 1 1 1 1 a b a b a b a b a b a b a b a 1 1 1 1 1 1 1 1 1 1 1 1 squareof a dfferece squareof a dfferece a b a b a b a b 1 1 b squareof a dfferece 1 1 ( a b a b ) ( a b a b ) ( a b a b ) 1 1 1 1 1 1 1 1 1 1 ( a b 1a 1b ) Fall, ths establshes that, f Equato (7) s true up to 1,,,, the the equato, ( a a a a )( b b b b ) a b a b a b a b 1 1 1 1 1 1 1 1 1 ab a b P b 1 a 1Q M a 1b 1 ab a b ( ab 1a 1b ), 1, 1 1 1 a b, 1 a b s also true Sce, we ow that Equato (7) s true for =1 ad, the equato s true for a subsequet teger That s, a b ab ab a b 1 1 1, 1 for a teger > 0(Equato (11) s a famous result) Proof that r 1 mples perfect Leart, (11) Suppose that r 1 for a set of data, {( x1, 1),( x, ),,( x, )}, b Equato (4), we have B Equato (11), the we have: a b ab 1 1 1 0 11

b b 0 0 ab a b ab a b m ; for all a a x x x x, 1 forsomem Each of the squares 0 Each of the squares = 0 We wrte ths more explct form: x x x x x x x x 1 3 1 3 m Ths sas that each of the followg sets of three pots {( x, ),( x, ),( x, )},{( x, ),( x, ),( x, )},,{( x, ),( x, ),(, )}, 1 1 1 1 3 3 1 1 x {( x, ),( x, ),( x, )},,{( x, ),( x, ),( x, )} 3 3 1 1 les o a straght le whch passes through the pot, ( x, ) ad has the same gradet, m Therefore, all the pots le o the same straght le Cocluso The above wor completes the proof that a set of data s lear f ad ol f r 1 Moreover, durg the dscusso, we establshed that sze of r s a measure of the leart of the set of data pots I the ext tutoral, we derve the lear least square regresso equato ad deal wth the mstc ature of the coeffcet of determato 1