Correlation analysis 2: Measures of correlation

Size: px

Start display at page:

Download "Correlation analysis 2: Measures of correlation"

Estella Jacobs
5 years ago
Views:

1 Correlation analsis 2: Measures of correlation Ran Tibshirani Data Mining: / Februar

2 Review: correlation Pearson s correlation is a measure of linear association In the population: for random variables X, Y R, Cor(X, Y ) = Cov(X, Y ) Var(X) Var(Y ) In the sample: for vectors, R n, cor(, ) = If, are have been centered, then cov(, ) = ( 1)T ( ȳ1) var() var() 1 2 ȳ1 2 cor(, ) = T 2 2 2

3 Review: properties of population correlation Recall: Cor(X, Y ) = Cov(X, Y )/ ( Var(X) Var(Y ) ) Properties of Cor: 1. Cor(X, X) = 1 2. Cor(X, Y ) = Cor(Y, X) 3. Cor(aX + b, Y ) = sign(a)cor(x, Y ) for an a, b R 4. 1 Cor(X, Y ) 1, with Cor(X, Y ) > 0 indicating a positive (linear) relationship, < 0 indicating a negative one 5. Cor(X, Y ) = 1 if and onl if Y = ax + b for some a, b R, with a 0 6. If X, Y are independent then Cor(X, Y ) = 0 7. If Cor(X, Y ) = 0 then X, Y need not be independent 8. If (X, Y ) is bivariate normal and Cor(X, Y ) = 0, then X, Y are independent 3

4 Bivariate normal distribution Assume that the random vector Z = (X, Y ) R 2 has a bivariate normal distribution, Z N(µ, Σ), where µ R 2 and Σ R 2 2, ( ) ( ) µx σ 2 µ =, Σ = X ρσ X σ Y µ Y ρσ X σ Y σy 2 Note that we have E[X] = µ X, E[Y ] = µ Y, Var(X) = σ 2 X, Var(Y ) = σ 2 X, Cov(X, Y ) = ρσ Xσ Y, and Cor(X, Y ) = ρ The densit of Z = (X, Y ) is f X,Y (z) = 1 ( 2π det(σ) ep 1 ) 2 (z µ)t Σ 1 (z µ) Fact: ρ = 0 implies that X, Y are independent (Homework 3) 4

5 Review: properties of sample correlation Recall: cor(, ) = ( 1) T ( ȳ1)/ ( 1 2 ȳ1 2 ) Properties of cor: 1. cor(, ) = 1 2. cor(, ) = cor(, ) 3. cor(a + b, ) = sign(a)cor(, ) for an a, b R 4. 1 cor(, ) 1, with cor(, ) > 0 indicating a positive (linear) relationship, < 0 indicating a negative one 5. cor(, ) = 1 if and onl if = a + b for some a, b R, with a 0 6. cor(, ) = 0 if and onl if, are orthogonal 7. If, are centered then cor(, ) = cos θ, where θ is the angle between the vectors, R n cos θ = T 2 2 θ 5

6 Eamples: sample correlation Perfect linear Nois linear cor =

7 Independent Ball cor =

8 Perfect cubic Outliers cor =

9 Rank correlation Spearman s rank correlation is onl defined in the sample. It goes beond linearit, and measures a monotone association between, R n Given vectors, R n, we first define the rank vector r R n that ranks the components of, e.g., if = (0.7, 0.1, 1) then r = (2, 1, 3). We also define the ranks r based on. Rank correlation is now the usual (sample) correlation of r and r : rcor(, ) = cor(r, r ) Ke propert: rcor(, ) = 1 if and onl if there is a monotone function f : R R such that i = f( i ) for each i = 1,... n 9

10 Eamples: rank correlation Perfect linear Nois linear rcor = cor =

11 Independent Ball rcor = cor =

12 Perfect cubic Outliers rcor = cor =

13 13 Perfect quadratic Perfect circle rcor = cor =

14 Maimal correlation Maimal correlation 1 is a notion of population correlation. It has no preference for linearit or monotonicit, and it characterizes independence completel Given two random variables X, Y R, the maimal correlation between X, Y is defined as mcor(x, Y ) = ma f,g Cor(f(X), g(y )) where the maimum is taken over all functions 2 f, g : R R. Note that 0 mcor(x, Y ) 1 Ke propert: mcor(x, Y ) = 0 if and onl if X, Y are independent 1 Gebelein (1947), Das Statitistiche Problem Der Korrelation... ; Reni (1959), On Measures of Dependence 2 Actuall, f, g have to be such that Var(f(X)) > 0, Var(g(Y )) > 0 14

15 15 Review: independence Two random variables X, Y R are called independent if for an sets A, B R P(X A, Y B) = P(X A)P(Y B) If X, Y have densities f X, f Y, and (X, Y ) has a densit f X,Y, then this is equivalent to f X,Y (s, t) = f X (s)f Y (t) for an s, t R. In other words, the joint densit is the product of the marginal densities Important fact: if X, Y are independent, then for an functions f, g, we have E[f(X)g(Y )] = E[f(X)]E[g(Y )] Hence X, Y independent implies that Cor(f(X), g(y )) = 0 for an functions f, g, which means mcor(x, Y ) = 0

16 16 Review: characteristic functions The characteristic function of a random variable X R is the function h X : R C defined as h X (t) = E[ep(itX)] The characteristic function of a random vector (X, Y ) R 2 is h X,Y : R 2 C, h X,Y (s, t) = E [ ep ( i(sx + ty ) )] Characteristic functions completel characterize the distribution of a random variable (hence the name). I.e., if h X (t) = h X (t) for ever t R, then X and X must have the same distribution Important fact: X, Y are independent if and onl if h X,Y (s, t) = h X (s)h Y (t) for an s, t R. In other words, the joint characteristic function is the product of the marginal ones

17 Zero maimal correlation implies independence Suppose that mcor(x, Y ) = 0. Then taking 3 f(x) = ep(isx) and g(y ) = ep(ity ) we have This means that i.e., Cor(ep(isX), ep(ity )) = 0 Cov(ep(isX), ep(ity )) = 0 E[ep(isX) ep(ity )] = E[ep(isX)]E[ep(itY )] This holds for each s, t R, so X, Y are independent 3 Strictl speaking, f, g are supposed to take values in R (not C), but to get around this we can write ep(iθ) = cos θ + i sin θ and then break things up into real and imaginar parts 17

18 18 Maimal correlation in the sample Given two vectors, R n, and functions f, g : R R, write f() R n for the vector with ith component f( i ), i.e., f() = (f( 1 ),... f( n )), and similarl for g() We d like to define maimal correlation in the sample, analogous to its population definition (so that we can use it in practice!). Consider mcor(, ) = ma cor(f(), g()) f,g There s a big problem here as defined, mcor(, ) = 1 for an, R n. (Wh?) We ll derive an algorithm to compute mcor in the population. This inspires an algorithm in the sample, and the sample version mcor is then defined as the output of this algorithm

19 19 Fied points of maimal correlation We define a norm on random variables Z R b Z = E[Z 2 ] Given an function f, we can alwas make f(z) have the following properties: E[f(Z)] = 0, b letting f(z) f(z) E[f(Z)] (centering) f(z) = 1, b letting f(z) f(z)/ f(z) (scaling) Notice that for such an f, we have Var(f(Z)) = f(z) = 1 Therefore, when computing maimal correlation, we can restrict our attention to functions f, g such that E[f(X)] = E[g(Y )] = 0 and f(x) = g(y ) = 1. (Otherwise, we simpl center and scale as needed, and this doesn t change the correlation.) Hence mcor(x, Y ) = ma E[f(X)]=E[g(Y )]=0 f(x) = g(y ) =1 E[f(X)g(Y )]

20 Now notice that for such f, g, E[(f(X) g(y )) 2 ] = E[f 2 (X)] + E[g 2 (Y )] 2E[f(X)g(Y )] = 2 2E[f(X)g(Y )] and hence f, g are optimal functions for mcor if and onl if the are also optimal for the minimization problem min E[f(X)]=E[g(Y )]=0 f(x) = g(y ) =1 E[(f(X) g(y )) 2 ] Ke point: consider just minimizing over the function g (assuming that f is fied and satisties E[f(X)] = 0 and f(x) = 1). To do so, we can minimize the conditional epectation E[(f(X) g()) 2 Y = ] for each fied. And to minimize this over g(), we simpl take g() = E[f(X) Y = ] As a function over Y, this is written as g(y ) = E[f(X) Y ] 20

21 21 Remember though, that we are restricting g to have epectation zero and norm one. Therefore we let g(y ) = E[f(X) Y ]/ E[f(X) Y ] (Check: this satisfies both properties) The eact same arguments, but in reverse, show that minimizing over f for fied g ields f(x) = E[g(Y ) X]/ E[g(Y ) X] These are called the fied point equations of the maimal correlation problem. That is, if there eists f, g that achieved the maimum mcor(x, Y ) = E[f(X)g(Y )], then the must satisf the above equations

22 Alternating conditional epectations algorithm The alternating conditional epectations (ACE) algorithm 4 is motivated directl from these fied point equations. The idea is just to start with a guess at one of the functions, and then iterate the two equations until convergence ACE algorithm: Set f 0 (X) = (X E[X])/ X E[X] For k = 1, 2, 3, Let g k (Y ) = E[f k 1 (X) Y ]/ E[f k 1 (X) Y ] 2. Let f k (X) = E[g k (Y ) X]/ E[g k (Y ) X] 3. Stop if E[f k (X)g k (Y )] = E[f k 1 (X)g k 1 (Y )] Upon convergence, mcor(x, Y ) = E[f k (X)g k (Y )] This has been proven to converge under ver general assumptions 4 Breiman and Friedman (1985), Estimating Optimal Transformations for Multiple Regression and Correlation 22

23 Back to maimal correlation in the sample Given, R n. As in the population case, if we are considering cor(f(), g()) over all functions f, g, we can restrict out attention to functions f, g with 1 T f() = 1 T g() = 0 and f() 2 = g() 2 = 1 where now 2 is the usual Euclidean norm. For such functions f, g, we have cor(f(), g()) = f() T g(). Further, f() g() 2 2 = f() g() 2 2 2f() T g() = 2 2f() T g() for such functions f, g, so maimizing cor(f(), g()) is the same as minimizing f() g() 2 2 We re going to derive a sample version of the ACE algorithm to approimatel minimize f() g() 2 2, and then we define its output as the sample maimal correlation 23

24 24 The ACE algorithm in the sample We define the ACE algorithm analogousl to its definition in the population case. That is, we repeat iterations of the form g() = condep[f() ], center and scale g(), f() = condep[g() ], center and scale f(), where condep[ ] denotes a sample version of the conditional epectation on, and similar for condep[ ] A good question is: how do we compute these sample conditional epectations? (There s not a simple convenient sample analog like there is for unconditional epectation, variance, or covariance) Answer: use a smoother S. We use the notation S( ) to mean that the result is an estimate for = ( 1,... n ) as a function of = ( 1,... n )

25 25 A smoother S( ) could be, e.g., something as simple as a histogram or linear regression. It could also be something more fanc like kernel regression or local linear regression. (We ll see eamples of smoothers later in the course) smoother 1 smoother

26 Assume that we ve picked a smoother S to use in the ACE algorithm. (We d like a smoother that can fit ver general shapes of functions, i.e., not just a histogram or linear regression) ACE algorithm: Set f 0 () = ( 1)/ 1 2 For k = 1, 2, 3, Let G() = S(f k 1 () ), and center and scale, g k () = (G() G()1)/ G() G() Let F () = S(g k () ), and center and scale, f k () = (F () F ()1)/ F () F () Stop if f k () T g k () f k 1 () T g k 1 () 2 is small Upon convergence, define mcor(, ) = f k () T g k () Unforunatel, the ACE algorithm in the sample is onl guaranteed to converge under some restrictive conditions. But in practice, it still tends to perform quite well 26

27 Measures of correlation in R The cor function available in the base R distribution can be used to compute (Pearson s) correlation, and also (Spearman s) rank correlation. E.g., cor(, ) # default is method="pearson" cor(,, method="spearman") # rank correlation The function ace in the package acepack implements the alternating conditional epectations algorithm to compute maimum correlation. E.g., a = ace(, ) cor(a$t, a$t) # maimal correlation Note: this ace implementation doesn t scale the vectors in the wa that discussed in class, so the maimal correlation is not simpl sum(a$t * a$t) 27

28 Recap: measures of correlation In this lecture we reviewed the basic facts about correlation, both in the population (random variables) and in the sample (vectors). If random variables X, Y R are joint normal, then independence and uncorrelatedness are the same thing Rank correlation is simpl the usual sample correlation, ecept replacing vectors, R n with the ranks of their components r X, r Y R n. This aims to capture monotone associations that are not necessaril linear Maimal correlation is defined in the population as the maimum correlation over functions of our two random variables. The maimal correlation equals zero if and onl if the random variables are independent The alternating conditional epectations (ACE) algorithm is an elegant wa to compute maimal correlation in the population, and furthermore, it allows to define maimal correlation in the sample 28

29 Net time: more measures of correlation More about maimal correlation; distance correlation Data Transformed data g() f() Transformation of Transformation of f() g()

Covariance and Correlation Class 7, Jeremy Orloff and Jonathan Bloom

1 Learning Goals Covariance and Correlation Class 7, 18.05 Jerem Orloff and Jonathan Bloom 1. Understand the meaning of covariance and correlation. 2. Be able to compute the covariance and correlation