Bivariate Paired Numerical Data Pearson s correlation, Spearman s ρ and Kendall s τ, tests of independence University of California, San Diego Instructor: Ery Arias-Castro http://math.ucsd.edu/~eariasca/teaching.html 1 / 17 Bivariate Paired Numerical Data Suppose we have paired data of the form where the variables X and Y are both numerical. (X 1,Y 1 ),...,(X n,y n ) We want to know how they are related / associated / depend on each other. Note that X and Y may be measurements of different types. Summary statistics. In addition to summarizing each variable, some form of correlation between the variables is computed. Graphics. In addition to a boxplot of each variable, a scatterplot helps visualize how the variables vary together. Example: stopping distance as a function of speed Consider the cars dataset in the datasets package in R. The data give the speed of cars X (in miles per hour) and the distances taken to stop Y (in feet). Note that the data were recorded in the 1920s. 2 / 17 3 / 17
Pearson s correlation (a measure of linear association) The covariance between two random variables X and Y, with respective means µ X = E[X] and µ Y = E[Y], is given by Cov(X,Y) = E [ (X µ X )(Y µ Y ) ] = E[XY] E[X]E[Y] Their correlation is Corr(X,Y) = Cov(X,Y) Var(X)Var(Y) It is always in [ 1,1], and equal to ±1 if and only if X = ay +b for some constants a 0 and b. If this is the case, we say that X and Y are perfectly correlated. The closer the Pearson correlation is to 1 in absolute value, the stronger the linear association. 4 / 17 NOTE. Corr(X,Y) = 0 does not imply that X and Y are independent. For example, take X Unif[ 1,1] and Y = X 2. They are perfectly associated (Y is a deterministic function of X), yet Cov(X,Y) = E[XY] E[X]E[Y] = E[X 3 ] = 0 (Indeed, E(X k ) = 0 for any odd integer k 1, by symmetry.) Sample covariance and correlation The sample covariance is The sample correlation is R XY = S XY = 1 n 1 = 1 n 1 5 / 17 (X i X)(Y i Ȳ) (1) X i Y i n n 1 n (X i X)(Y i Ȳ) n (X i X) 2 n (Y = S XY i Ȳ)2 S X S Y where S X and S Y are the sample standard deviations for X and Y, respectively. XȲ (2) 6 / 17
Correlation t-test Assume we have an i.i.d. sample (X 1,Y 1 ),...,(X n,y n ), and want to test H 0 : Cor(X,Y) = 0 versus H 1 : Cor(X,Y) 0 It is natural to reject for large values of R, where R = R XY. Equivalently, we reject for large values of T, where T = R n 2 1 R 2 (Equivalently because T is a strictly increasing function of R.) Theory. Assuming that (X 1,Y 1 ),...,(X n,y n ) are i.i.d. bivariate normal, under the null hypothesis of zero correlation, T has a t-distribution with n 2 degrees of freedom. NOTE. Even when the sample is not bivariate normal, T is asymptotically standard normal as long as X and Y have finite second moments and are independent (thus under a stronger assumption that having zero correlation). The bivariate normal distribution The random vector (X, Y) is said to have a bivariate normal distribution if any (deterministic) linear combination ax + by is normally distributed. Five parameters define a bivariate normal distribution: The marginal means µ X and µ Y. The marginal variances σ 2 X and σ2 Y. The correlation ρ = Corr(X,Y). If (X,Y) is bivariate normal with these parameters, then for any a,b R, ax +by is normal with mean 7 / 17 aµ X +bµ Y and variance a 2 σx 2 +b2 σy 2 +2abρσ Xσ Y (This follows from simple moment calculations.) If (X,Y) is bivariate normal, then Cor(X,Y) = 0 implies that X and Y are independent. 8 / 17
Permutation test We can permute the data to break the pairing. What we are really testing is H 0 : X and Y are independent A permutation test based on Pearson s correlation is designed for alternatives where X and Y are linearly associated. In principle, the test works as follows: 1. For each permutation π of {1,...,n}, compute R π, the sample correlation of (X i,y π(i) ),i = 1,...,n. 2. The p-value here is the fraction of R π that are at least as large as the observed sample correlation R obs pval = #{π : R π R obs } n! 9 / 17 Permutation test Usually, the number of permutations (= n!) is too large to do that, so we estimate the p-value by Monte Carlo, which amounts to sampling B permutations (B is a large integer) π 1,...,π B, uniformly at random and computing the fraction pval = #{b : R π b R obs }+1 B +1 Spearman s rank correlation This is the same as Pearson s correlation except that the observations are replaced by their ranks. Let A i denote the rank of X i within X 1,...,X n. Let B i denote the rank of Y i within Y 1,...,Y n. Spearman s rank correlation, denoted R spear = R spear XY, is the sample Pearson correlation of (A i,b i ),i = 1,...,n. If there are no ties, R spear = 1 6 n 3 n (A i B i ) 2 Note that R spear [ 1,1] and equal to 1 (resp. 1) if and only if there is an increasing (resp. decreasing) function f such that Y i = f(x i ). 10 / 17 The closer the Spearman correlation is to 1 in absolute value, the stronger the monotonic association. 11 / 17
A test for independence vs monotonic association rejects for large values of R spear. This test is distribution-free if X and Y are continuous. Equivalently, we reject for large values of T spear, where T spear = Rspear n 2 1 (R spear ) 2 Theory. Under the null hypothesis, T spear has asymptotically the standard normal distribution. NOTE. R spear is a consistent estimator for Spearman s ρ, defined as where (X 1,Y 1 ),X 2,Y 3 are independent. ρ = 3E [ sign ( (X 1 X 2 )(Y 1 Y 3 ) )] We can have ρ = 0 even if X and Y are not independent, and indeed, the test is not universally consistent as a test for independence. Kendall s tau Kendall s tau is very similar to Spearman s rho. The sample version is defined as T kend = 2 n(n 1) 1 i<j n sign ( (X j X i )(Y j Y i ) ) Note that T kend [ 1,1] and equal to 1 (resp. 1) if and only if there is an increasing (resp. decreasing) function f such that Y i = f(x i ). The resulting test is also distribution-free if the variables are continuous. Theory. Under the null hypothesis, T kend has asymptotically the standard normal distribution. NOTE. T kend is a consistent estimator for where (X 1,Y 1 ) and (X 2,Y 2 ) are independent. τ = E [ sign ( (X 1 X 2 )(Y 1 Y 2 ) )] 12 / 17 We can have τ = 0 even if X and Y are not independent, and indeed, the test is not universally consistent as a test for independence. The joint cumulative distribution function The joint CDF of (X,Y) (a random vector with values in R 2 ) is defined as F XY (x,y) = P(X x,y y) 13 / 17 Theory. F XY (x,y) = F X (x)f Y (y) X and Y are independent
14 / 17 Tests for independence based on the empirical distributions In the spirit of the Kolmogorov-Smirnov test, Hoeffding (1948) and others later proposed a test of independence based on the empirical CDFs. An example of such a test rejects for large values of H = sup F n XY (x,y) Fn X (x)fy n (y) x,y R where F X n is the empirical CDF of X 1,...,X n F Y n is the empirical CDF of Y 1,...,Y n F XY n is the joint empirical CDF of (X 1,Y 1 ),...,(X n,y n ) F XY n (x,y) = 1 n I{X i x,y i y} It X and Y are continuous, the test is distribution-free. (The asymptotic distribution is also known in closed form.) The test is universally consistent against any alternative to independence. Energy statistics A more recent proposal of Székely, Rizzo and Bakirov (2007) is based on the following distance covariance statistic: V n (X,Y) = 1 A n 2 ij B ij where and B ij is defined similarly based on the Y s. j=1 A ij = a ij ā i. ā j. +ā.., a ij = X i X j, ā i. = 1 a ij, ā.. = 1 n n 2 Theory. If X and Y have densities f X and f Y with finite first moment, then with probability one, V n (X,Y) V(X,Y) as n, where j=1 V(X,Y) 2 (fxy (x,y) f X (x)f Y (y)) 2 := dxdy x 2 y 2 a ij 15 / 17 16 / 17
Note that V(X,Y) = 0 if and only if X and Y are independent. The authors recommend using the distance correlation, defined as V n (X,Y) Vn (X,X)V n (Y,Y) We have V n (X,Y) Vn (X,X)V n (Y,Y) V(X,Y) V(X,X)V(Y,Y) Calibration (whether using the distance covariance or the distance correlation) can be done by permutation in practice. All this remains valid if X and Y are random vectors (of possibly different dimensions). 17 / 17