A COMPARISON STUDY BETWEEN THE CORRELATION COEFFICIENTS OF PEARSON, SPEARMAN AND KENDALL WITH NUMERICAL APPLICATIONS

Size: px

Start display at page:

Download "A COMPARISON STUDY BETWEEN THE CORRELATION COEFFICIENTS OF PEARSON, SPEARMAN AND KENDALL WITH NUMERICAL APPLICATIONS"

Edward Hutchinson
5 years ago
Views:

1 A COMPARISON STUDY BETWEEN THE CORRELATION COEFFICIENTS OF PEARSON, SPEARMAN AND KENDALL WITH NUMERICAL APPLICATIONS Nicolae POPOVICIU Abstract. The work systematically presents three types of correlation coefficient: Pearson, Spearman, Kendall. For the last two types the rank correlation coefficient (RCC) is studied. The RCC use two random ordinal variables and Y and mesures the degree of similarity between them or assess the significance of the relation between them. For each RCC a computation algorithm is given. Some numerical examples iluustrate the theory. The section of conclusions shows how to decide which type of correlation coefficient have to be used in a numerical problem. Keywords: Pearson s coefficient, PCC, Spearman s coefficient, SCC, Kendall s coefficient (KCC), PSK Program Description. 1. Introduction For random variables we recall some usual notations and meanings: x1 x xn x =, n 1; =, p1 p pn f ( x) f (x) is the density probability;, Y are random variables (discrete or continuous); Mean value M ( ) = m = m; dispersion or variance D ( ) = σ = = σ = Var( ); m Reduced (normalized) random variable ' = ; M ( ') = 0; σ Covariance of and Y is a real number defined as cov(, Y ) = M[( m ) ( Y m )] (theoretical formula); cov(, Y ) = M ( Y ) M ( ) M ( Y ) (computational formula); Y Hyperion University of Bucharest, popoviciunicolae15@yahoo.ro 347

2 cov(, Y ) = cov( Y, ) (commutativity); cov(, ) = M ( ) [ M ( )] = D ( ) = ( σ cov(, ) = M ( ) M ( ) M ( ) = D ( ) = σ ; If, Y are independently, then cov(, Y ) = 0.. Correlation coefficient of Pearson If cov(, Y ) 0, then the random variables are correlated; If cov(, Y ) = 0, then and Y are not correlated, but we don t know if they were independently. That is why a new measure of correlation is necessary. cov(, Y ) The real number ρ (, Y ) defined as ρ(, Y ) = is σ σ Y called correlation coefficient of Pearson (PCC). Karl Pearson: Sometimes on denotes ρ (, Y ) = ρ P (, Y ) = co r(, Y ). Pearson s correlation coefficient properties: cov(, ) ρ (, Y ) = M ( ' Y '); ρ(, ) = = 1; σ σ cov(, ) ρ(, ) = = 1 1 ρ(, Y ) 1. σ σ cov(, Y ) Remark 1. cov(, Y ) = M[( m ) ( Y my )] and ρ(, Y ) = σ σ Y are called the theoretical formulas. There exists also the estimated formulas, marked by Estim and based on a sample of volume N 1 N = =1 x, i i M ( ) (estimation for mean value); N 1 N s = = ( x ), i 1 i s D ( ), s = s N (estimation for dispersion and standard deviation); ' Estim = ; Y Y Y ' Estim = ; s s Y ) ; 348

3 1 N cov(, Y ) Estim = ( )( ); 1 1 = x i i yi Y N cov(, Y ) Estim ρ (, Y ) Estim = s. s Y The estimation formulas don t use the probabilities p i. 3. Spearman's rank correlation coefficient Charles Edward Spearman ( ): English statistician and psychologist; founder of factorial analysis. We are interested to study a dichotomist process P (i.e. separated in two main parts). For P we have to collect only two sets of data of size n: A = ( a j ) and B = ( b j ); j = 1, n. Generally a j, b j are real numbers, but by using an appropriate transformation we obtain some natural numbers. Example: a = 4, billion becomes b = 400 million; a R, b N. The initial data generate the pairs ( a j, b j ). All numbers use the same unit of measure. Example: a j represents the student s note obtained in theoretical exam; b j represents the student s note obtained in practical exam. These numbers have an ordinal type, namely isn t important the value a j, but it is important its rank (place, position) in the string. The initial data A = ( a j ) and B = ( b j ) generate the random variavles = ( x i ) and Y = ( y i ), where the index i range in the domain i = 1, n; x i and y i are natural numbers. We look for the level of correlation between the random variables and Y. The Spearman's rank correlation coefficient (SCC) is denoted by ρ S (, Y ) and it could be calculated by two formulas. n 6 = D i 1 i Version 1. ρ S 1 = 1, where D i = xi y i. Version. Use the pairs of initial values ( x i, y i ), ( i = 1, n), where the values x i are arranged in increasing order. The x i values, in increasing order are denoted u i. This generates the ranks r = U, r Y = V, where U = ( u i ), V = ( v i ); i = 1, n; u i, v i are natural numbers. 349

4 6 D i 1 i Then ρ S = 1, where D i = u i vi (difference between ranks). 350 n = 4. Kendall's rank correlation coefficient Maurice George Kendall ( ) was an English statistician. We denote by ρ K (, Y ) Kendall s rank correlation coefficient (KCC). The notations and the main idea of section are available. The computation of ρ K (, Y ) is done in several steps. Step 1. Construct the table 1 of input data = ( x i ), Y = ( y i) and the pairs ( x i, y i ); i = 1, n, where the values x i are arranged in increasing order. Hence we use the pairs ( x i, y i ), where the redundant data (the repeating data) aren t eliminated. Step. (Optional). Find the ranks r and r Y for the random variables and Y. a) The ranks of x i are the natural numbers 1,, 3 etc. Find the rank of y i, corerewsponding to x i. So, we construct the table of ranks. The repeating data aren t eliminated. The new random Y is denoted Y '. Obtain the ranks r Y : r 1, r, r3,. b) The table 1 and table yield the table 3 of ranks r and r Y. Step 3. Because the values x i are in natural increasing order, the ranks r aren t used in the computation of Kendall s correlation coefficient. Only the ranks r Y are used. Construct the variable RSY = u ), i = 1, n which contains the superior ranks for variable Y. For this construction we take a fixed value y j from initial table 1 and count how many values y k, situated after y j have the property yk y j. The result is the table 4 of superior ranks for the variable Y. Step 4. Construct the variable RIY = v ), i = 1, n which contains the inferior ranks for variable Y. For this construction we take a fixed value ( i ( i

5 y j from initial table 1 and count how many values y k, situated after y j have the property yk y j. The result is the table 5 of inferior ranks for the variable Y. The redundant values aren t eliminated. Step 5. Compute the rank s difference d i = u i vi, D = ( d i ) and construct the table 6. Step 6. Compute the sum n S d. i = 1 i Step 7. Compute the Kendall s correlation coefficient ρ K = ρ K (, Y ) = S S =. C n Remark. For each type of correlation coefficient we have elaborated a C++ program (PSK Program) to compute the specified coefficient. // PSK Program description // code=1 for Pearson; code= for Spearman; code=3 for Kendall // Pearson correlation. // corpy=covy/d*dy; D=M-M*; DY=MY-MY*MY // cov(,y)=covy=m[(-m)(y-my)] or cov(,y)= M(Y)-M*MY // // Spearman correlation. // Version 1. We use unmodified initial data and Y and the pairs (xi,yi) // Version. We arrange the values xi in increasing order. // // Kendall correlation. // The vector has the components xi in increasing order. The redundant values aren t eliminated. The vector Y generates the pairs (xi,yi). // Observation. If in vector Y appears the redundant values yi (for example: // (redundant values are 9 and 9; 8 and 8) ) then we apply a perturbation of these values, so that the perturbation doesn t change the ranks. 351

6 // For example we put 9.01 instead of the first 9 and 8.01 instead of the first 8; the 9.01 for the second 9 and 8.0 for second 8. Hence the result is // // ; the ranks aren t changed. 5. Numerical applications Application 1 (S). For a group of 10 students on knows the notes = ( x i ) obtained at theoretical exam and the notes Y = ( y i ) for practical exam (Table 1). ρ Tablel 1 (S) [7]. Student Note Note Y cov(, Y ) a) Compute the Pearson correlation coefficient ρ (, Y ) =. σ σ Y b) Compute the Spearsman s rank correlation coefficient = (, Y ). S ρ S Solution. a) M ( ) = m = = 5.5; M ( Y ) = m Y = = 5.5; M ( ) = = 38.3; M ( Y ) = = 38.5; D ( ) = 8.05; D ( Y ) = 8.5; 10 σ =.8373; = σ cov(, Y ) = M[( m ) ( Y m Y 70.5 cov(, Y ) 7.05 cov(, Y ) = = 7.05; ρ(, Y ) = = = 0.865; 10 σ σ Y ρ P (, Y ) = The random variables and Y are very correlated. b) Compute Spearman s rank coefficient ρ = (, Y ). )] S ρ S n = S Di = xi y i. 6 D i 1 i Version 1. Use the formula ρ 1 = 1 ; 35

7 n = S i = u i v i. 6 D i 1 i Version. Use the formula ρ = 1 ; D Version 1. Use the initial data: notes x y. Table (S). (initial exam notes) i, i Student Note Note Y D i D i (the sum is 4) = D = 4; i 1 i ρ S 1 = 1 = ; ρ S = (100 1) Version. Use the pairs of initial notes ( x i, y i ), ( i = 1,10), where the notes x are arranged in increasing order. This generates the ranks i r = U, r Y = V, where U = u ), V = v ); i = 1, n. ( i ( i Table 3 (S) (of ranks) Student Note Note Y r r Y D i D i (the sum is 4) = D = 4; i 1 i ρ S = 1 = ; 10(100 1) ρ S = The boths versions givs the some result. Observation 1. We see that the coefficient values ρ(, Y ) = and ρ S = are very compatible, but the Spearman s coefficient is easier to compute. 353

8 Nevertheless, the pairs ( u i, vi ) with u i in increasing order is rather difficult to construct. For a big volume of data, a computer program is necessary (for example C++ program). Application (K). [11] For 17 economical societies we know two types of data: = ( x i ) : the sums used for publicity (in millions); Y = ( y i) : the total capitals for each society (in millions)l The data x i are arranged in increasing order. Table 1 contains all 17 pairs x, y ). 354 ( i i Table 1 (K) (sum publicity and total capital) = ( x i ) Y = ( y i ) Find the Kendall s rank correlation coefficient of random variables and Y. Solution. We use several steps. Step 0. Arrange the values x i of in increasing order (in this problem it is automatically done). The repeating (redundant) data aren t eliminated. Step 1. Construct the table 1 with the pairs ( x i, y i ); i = 1, n (see the problem formulation). Step. Construct the vector variable RSY = ( u i ), i = 1, n containing the superior ranks of variable Y. We use the table 1 and for a fixed value y j we count the values y k (placed after the value y j ) having the property yk y j. The result is table for variable Y. Tabelul (K). Superior ranks for variable Y Y = ( y i ) RSY = ( u i ) Step 3. Construct the vector variable RIY = ( u i ), i = 1, n containing the inferior ranks of variable Y. We use the table 1 and for a fixed value y j we count the values y k (placed after the value y j ) having the property yk y j. The result is table 3 for variable Y.

9 Table 3 (K). Inferior ranks for variable Y Y = ( y i ) RIY = ( v i ) Step 4. Compute the rank s differences D = ( d i ) (table 4). d i = u v and denote it i i Table 4 (K). Rank s differences RSY = ( u i ) RIY = ( v i ) D = ( d i ) Step 5. Compute the sum of d i and denote S = d. i = 1 i We obtain S = 8. Step 6. Compute the Kendall s rank coefficient ρ K = ρ K (, Y ) = S =. The result is ρ K = 0,603. The variables and Y have I good correlation. Application 3 (PSK). We repeat the application 1 with the data from table 1 PSK). Table 1 (PSK) [7]. Student Nota Nota Y Compute all the correlation coefficients ρ (, Y ) Pearson; ρ (, Y ) Spearman; ρ (, Y ) Kendall. P S Soluţie. D 1 = D = {1,,,10}; D 1 = 10 1 = 9; n = 10. Our the computer program PSK Program gives the following results ρ (, Y ) = 0,854545; ρ 1 (, Y ) = 0,854545; ρ (, Y ) = 0,854545; P ρ K S (, Y ) = 0, The results are compatible between them. K S n 355

10 6. Conclusions In order to draw several conclusins, we use some special notations. D 1 is the domain of x i; D is the domain of y i; if D 1 = D then we denote D = D 1 = D ; S1 = max{ x x D1}; s1 = min{ x x D1}; D1 = S1 s1; D = S s etc. Our problem is to determine the level of correlation between the numerical data A and B, or between the random variables and Y. The answer is obtained by several methods. Method 1. Compute the Pearson s correlation coefficient ρ = ρ P (, Y ). Method. Compute the Spearman s rank correlation coefficient (, Y ). ρ S Method 3. Compute the Kendall s rank correlation coefficient ρ (, Y ). The problem is how to choose the appropriate method? We suggest the following answer. a) The method 1 could be used for any kind of data and Y. b) If D 1 = D, D 1 = D and the norm isn t a big number, then we recommend the method. c) If D1 D we recommend the method 3. REFERENCES [1] Popoviciu N., Tutorial on Statistical Formulas. Parameters Estimation. Confidence Intervals, University Hyperion of Bucharest Annals, Exact Sciences and Engineering Series, Vol. 1, ISSN , Victor Publishing, Bucharest; 013, pp [] Purcaru I., Bâscă O., People, ideas, and Facts from the history of Mathematics, Economic Publishing, Bucharest, [3] Tomescu Rodica, IJACU Daniela, Probability and mathematical statistics, PRINTECH Publishing, Bucharest, 005. [4] Turban E., Aronson J. E., Decision Support Systems and Intelligent Systems, ed. 5th, New Jersey Prentice Hall, [5] SSP-IBM Scientific Subroutine Package, IBM Vienna, [6] Văduva I., Computer-Aided Simulation Models, Technical Publishing House, Bucharest, [7] Voineagu V., Mitruţ C-tin şi colectiv; The theoretical and statistical Macroeconomic; Tests, practical work, case studies, Economic Publishing, Bucharest, [8] Wikipedia: Multivariate Normal: [9] Wikipedia: Birth and death processes. [10] [11] K 356

Correlation and Regression

Correlation and Regression. ITRDUCTI Till now, we have been working on one set of observations or measurements e.g. heights of students in a class, marks of students in an exam, weekly wages of workers