4th Indian Institute of Astrophysics - PennState Astrostatistics School July, 2013 Vainu Bappu Observatory, Kavalur. Correlation and Regression

4th Indian Institute f Astrphysics - PennState Astrstatistics Schl July, 2013 Vainu Bappu Observatry, Kavalur Crrelatin and Regressin Rahul Ry Indian Statistical Institute, Delhi.

Crrelatin Cnsider a tw dimensinal randm vectr (X, Y). As seen earlier, the randm vectr is gverned by jint prbability mass functin P(X = x i, Y = y j ) fr x = x i, y = y j p(x, y) = 0 therwise where (X, Y) takes discrete values (x i, y j ) fr i = 1,..., m, j = 1,..., n; r jint prbability density functin f(x, y) P(X a, Y b) = a b f(x, y)dxdy where (X, Y) takes cntinuus values.

We restrict urselves t the randm vectr (X, Y) taking discrete values (x i, y j ) fr i = 1,..., m, j = 1,..., n. Als, as seen earlier the marginal prbability mass functins f X and Y are respectively n j=1 p X (x) = p(x i, y j ) fr x = x i 0 therwise m i=1 p Y (y) = p(x i, y j ) fr y = y j 0 therwise

and the means and variances f the marginals X and Y are µ X = m x i p X (x i ), Var(X) = i=1 m (x i µ X ) 2 p X (x i ) i=1 µ Y = n y i p Y (y j ), Var(Y) = j=1 n (y j µ Y ) 2 p Y (y j ). j=1

Cvariance and Crrelatin Cefficient The Cvariance f X and Y is given by Cv(X, Y) = E[(X µ X )(Y µ Y )] m n = (x i µ X )(y j µ Y )p(x i, y j ) = i=1 m j=1 i=1 j=1 n x i y j p(x i, y j ) µ X µ Y Nte, in the abve the Expectatin is taken with respect t the jint distributin f (X, Y).

The crrelatin cefficient between X and Y is given by ρ XY = Cv(X, Y) VarXVar(Y) = m n i=1 j=1 x iy j p(x i, y j ) µ X µ Y ( m i=1 x2 i p X(x i ) µ 2 X )( n j=1 y2 j p Y(y j ) µ 2 Y )

Linear regressin Suppse we have 2-dimensinal data regarding the fllwing (i) age and height f children, r (ii) amunt f exercise and the rate f heart beat f peple (iii) incme and age at death f peple (iv) level f educatin and caste bias

Nte here that the data cmes in pairs nw, e.g. (height, weight) f the same persn. Thus unlike earlier we d nt have (x i, y j ) as a data pint, but just (x i, y i ). In all these examples, we expect a relatin between the tw variables. If a child is much shrter than anther child, it is mre likely that she is yunger than the ther ne, etc. Althugh in the last example, caste bias vis-a-vis level f educatin, the relatinship is nt that simple. We want t use statistical techniques t determine whether there is a mathematical equatin cnnecting the variables.

Such mathematical equatins are imprtant because they wuld allw us t predict and plan. Fr example if we knew that a grup f 9 and 10 year lds are cming fr a summer camp at this bservatry, then we culd have a fair idea f the lengths f beds and mattresses needed fr their sleep. We first cllect the data, tabulate it and express it as a scatter diagram.

25 20 15 Series1 10 5 0 0 2 4 6 8 10 12

450 400 350 300 250 200 Series1 150 100 50 0 0 2 4 6 8 10 12

Linear regressin The first chart shw suggests a linear relatinship between the X and Y variables.

Y X

Y (x i, ŷ i ) = predicted E i = errr (x i, y i ) = bservatin X

Y (x i, ŷ i ) = predicted E i = ŷ i y i (x i, y i ) = bservatin X

Nte that here we tk the errr as E i = ŷ i y i the abslute value being imprtant because if the bserved pint was abve the predicted pint, then we d nt want t have negative errr. S clearly a criterin t btain the line giving the predicted linear relatin between the x and the y variables is that the sum f the errrs must be minimized, i.e. n the line shuld minimize ŷ i y i. i=1

Unfrtunately minimizing n i=1 ŷ i y i, is difficult and can nly be dne thrugh numerical methds. A simpler methd is t use calculus. First nte that we tk the abslute errr, s as nt t have negative errrs. Anther way t avid negative errrs is t take the square, i.e. E i = (ŷ i y i ) 2. In this case, ur criterin shuld be n the line shuld minimize (ŷ i y i ) 2. i=1

Y (x i, ŷ i ) = predicted E i = (ŷ i y i ) 2 (x i, y i ) = bservatin X

T d this is easy because the equatin f a straight line is y = bx + c where b is the slpe and c the intercept. S that ŷ i = bx i + c and we have t minimize E(b, c) := n (bx i y i + c) 2 i=1

S we need t set E(b, c) b = 0 and E(b, c) c = 0 and equate fr b and c.

Fr the data {(x i, y i ) : 1 = 1,..., n}, calculus yields (Hmewrk) ˆb = = n n i=1 x iy i n i=1 x n i i=1 y i n n i=1 x2 i ( n i=1 x ) 2 i n i=1 x iy i n xȳ n i=1 x2 i n x 2 ĉ = ȳ ˆb x

x y 1.2 101 0.8 92 1.0 110 1.3 120 0.7 90 0.8 82 1.0 93 0.6 75 0.9 91 1.1 105

x y x 2 xy y 2 1.2 101 1.44 121.2 10201 0.8 92 0.64 73.6 8464 1.0 110 1.00 110.0 12100 1.3 120 1.69 156.0 14400 0.7 90 0.49 63.0 8100 0.8 82 0.64 65.6 6724 1.0 93 1.00 93 8649 0.6 75 0.36 45.0 5625 0.9 91 0.81 81.9 8281 1.1 105 1.21 115.5 11025 Sum 959 9.4 9.28 924.8 93569 ˆb = 52.568, ĉ = 46.486 s ŷ = 46.486+52.568x

Linear Crrelatin analysis Suppse the data cmes in pairs as in the case f linear regressin i.e., {(x i, y i ) : 1 = 1,..., n} and we assign equal prbability fr any data pint P(X = x i, Y = y i ) = 1/n.. Nte: this takes care f the inherent distributin f the data pints, because if the values f the randm variable X is mre cncentrated arund a given value a say, then there will be mre data pints x i arund the value a.

Fr example if we are measuring the height and weight f individuals, then we will btain mre data pints with the height value arund 165 cm rather than arund 180 cm because there are mre peple whse height are cncentrated arund 165 cm, than there are basketball players.

In this case the crrelatin cefficient becmes ρ XY = = m n i=1 j=1 x iy j p(x i, y j ) µ X µ Y ( m i=1 x2 i p X(x i ) µ 2 X )( n j=1 y2 j p Y(y j ) µ 2 Y ) n i=1 x iy i n xȳ ( n i=1 x2 i n x 2) ( n i=1 y2 i nȳ 2) T emphasize that this is a sample crrelatin cefficient we write n i=1 r X,Y = x iy i n xȳ ( n i=1 x2 i n x 2) ( n i=1 y2 i nȳ 2)

Recall frm regressin analysis ˆb = n i=1 x iy i n xȳ n i=1 x2 i n x 2, while the crrelatin cefficient is r X,Y = n i=1 x iy i n xȳ ( n i=1 x2 i n x 2) ( n i=1 y2 i nȳ 2).

The denminatr in ˆb being (1/n) variance is psitive, while we take the psitive square rt in the denminatr f r X,Y. The numeratr term is the same fr bth. Thus the sign f the slpe ˆb is the same as that f the crrelatin cefficient r X,Y. Fr the data given earlier we can nw cmpute the crrelatin cefficient (Hmewrk).

Crrelatin and regressin analysis are related, bth deal with relatinships amng variables. The crrelatin cefficient is a measure f linear assciatin between tw variables. Values f the crrelatin cefficient are always between 1 and +1.

ρ XY = 1 implies the variables are perfectly crrelated in the psitive linear sense. ρ XY = 1 implies the variables are perfectly crrelated in the negative linear sense. ρ XY = 0 implies the variables are uncrrelated. This des nt mean that the variables are independent.

Neither regressin nr crrelatin analyses establish cause-and-effect relatinships. They indicate nly hw r t what extent variables are assciated with each ther. The crrelatin cefficient measures nly the degree f linear assciatin between tw variables. A cause-and-effect relatinship must be based n the judgment f the analyst.