MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 6: Bivariate Correspondence Analysis - part II

Size: px

Start display at page:

Download "MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 6: Bivariate Correspondence Analysis - part II"

Kenneth Beasley
5 years ago
Views:

1 MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 6: Bivariate Correspondence Analysis - part II the

2 Contents the the

3 the

4 Independence The independence between variables x and y can be tested using. The null hypothesis of the test is H o : p jk = p j. p.k, for all j, k the and the test statistic is given by χ 2 = J K (n jk njk )2. j=1 k=1 n jk

5 Independence Under random sampling, the n jk follow multinomial distribution with parameters n, p 11,..., p JK and E[n jk ] = np jk. In the test statistics above, the np jk, under the null, are estimated by n jk. When the sample size n is large, the test statistic has, under the null hypothesis, approximately chi-square distribution with (K 1)(J 1) degrees of freedom. Thus the null hypothesis (independence between variables x and y) is rejected at the level α if χ 2 > χ 2 (K 1)(J 1),1 α. the

6 Links Chi-square distribution the Multinomial distribution

7 the

8 Chi-square distance When the data is in the form of frequency distribution, the distance between the rows (or columns) is measured using weighted euclidian distances. The distance between two rows j 1 and j 2 is given by d 2 (j 1, j 2 ) = K k=1 1 f.k ( f j 1 k f j1. f j 2 k ) 2. f j2. the The euclidian distance gives the same weight to each column. The χ 2 distance gives the same relative importance to each column proportionally to the average frequency. The division of each squared term by the expected frequency is variance standardizing and compensates for the larger variance in high frequencies and the smaller variance in low frequencies. If no such standardization were performed, the differences between larger proportions would tend to be large and thus dominate the distance calculation, while the differences between the smaller proportions would tend to be swamped. The weighting factors are used to equalize these differences.

9 Chi-square distance The distance between two columns k 1 and k 2 is given by the d 2 (k 1, k 2 ) = J j=1 1 f j. ( f jk 1 f.k1 f jk 2 f.k2 ) 2.

10 the

11 Let Z R J K, where Z jk = f jk f j. f.k fj. f.k. Clearly J (f jk f j. f.k ) = j=1 J J f jk f j. f.k = f.k f.k j=1 j=1 J f j. = f.k f.k = 0. j=1 the Similarly, K (f jk f j. f.k ) = 0. k=1 Thus, the matrix Z gives scaled and centered relative frequencies of the variables. Moreover, the variables are fj. f.k scaled such that the elements Z jk = f jk f j. f.k = f jk f jk f jk are the terms that are squared and summed in the that is used for testing the independence of the variables.

12 A large positive value Z jk indicates a large contribution to the. This indicates a positive association between row j and column k. (More observations than expected under independence.) A large negative value Z jk also indicates a large contribution to the, but this indicates a negative association between row j and column k. (Less observations than expected under independence.) Values near zero indicate no contribution to the test statistic. (The number of observations is equal to the expected number under independence.) the Let V = Z T Z and let W = ZZ T. Now the χ 2 = n(trace(v )) = n(trace(w )).

13 the

14 Principal component analysis is based on maximizing euclidian distances. In the context of frequency distributions, the proper distance between variables is the chi-square distance. Thus, for frequency distributions, PCA has to be applied to modified data. the

15 The chi-square distances between two row can be given as K d 2 1 (j 1, j 2 ) = ( f j 1 k f j 2 k ) 2 f.k f j1. f j2. = K ( k=1 f j1. k=1 f j1 k f.k f j2. f j 2 k f.k ) 2. Thus, if the row are scaled, the usual euclidian metric can be used on the new scaled data. the

16 Let R R J K, where R jk = f jk f j. f.k f.k The matrix R contains the scaled and shifted row. The shifting is such that the weighted sum J j=1 f j. f jk f j. f.k = f.k. the Let R j denote the jth row of R. Performing equals to finding orthonormal vectors (directions) u i such that projection P i ( ) onto u i maximizes the weighted sum of the euclidian distances, J f j.d 2 (0, P i (R j )), j=1 under the constraint that u i is orthogonal to all u l, 1 l < i.

17 The problem is again a problem of maximization under constraint, and similarly as in the usual PCA, the solution is given by the eigenvalues and the eigenvectors of the matrix V = J f j.rj T R j j=1 the Some matrix algebra is needed to show that the matrix V = J f j. Rj T R j = Z T Z. j=1

18 Let λ i denote the ith largest eigenvalue of the matrix V and let u i denote the corresponding unit length eigenvector. Let u i,k denote the kth element of u i. The value (score) of the row profile j (associated with modality A j ) on the ith principal component is given by φ i,j = K u i,k R jk. k=1 the It can be proven that φ i is centered such that J f j. φ i,j = 0, j=1 and that the variance of φ i is λ i.

19 Contribution of modalities The contribution of the modality A j on construction of the axis u i is given by f j. (φ i,j ) 2 λ i. the

20 Quality of the representation The quality of the representation of the centered row profile R j by the principal axis i is measured by the squared cosine of angle between the vector OR j and u i : cos 2 (α) = ( < ORj, u i > ) 2 (φ i,j ) 2 = OR j u i OR j 2. If the value is close to 1, the quality of the representation is good. the Note that the formula above does not contain the weight f j, and thus one modality can be: Close to the axis u i and and therefore be well represented (well explained). Due to a low weight f j, it can have a low contribution to the axis.

21 the

22 Performing does not differ from performing. The solution is given by the eigenvalues and the eigenvectors of the matrix W = ZZ T. the

23 Let C R J K, where C jk = f jk f j. f.k fj. The matrix C contains scaled and shifted column. Let C k denote the kth column of C. Performing equals to finding orthonormal vectors (directions) v h such that projection P h ( ) onto v h maximizes the weighted sum of the euclidian distances, the K f.k d 2 (0, P h (C k )), k=1 under the constraint that v h is orthogonal to all v l, 1 l < h. The solution is given by the eigenvalues and the eigenvectors of the matrix W = ZZ T.

24 Let λ h denote the hth largest eigenvalue of the matrix W and let v h denote the corresponding unit length eigenvector. Let v h,k denote the kth element of v h. The value (score) of the column profile k (associated with modality B k ) on the hth principal component is given by the ψ h,k = J v h,j C jk. j=1 It can be proven that ψ h is centered such that K f.k ψ h,k = 0, k=1 and that the variance of ψ h is λ h.

25 Contribution of modalities The contribution of the modality B k on construction of the axis v h is given by f.k (ψ h,k ) 2 λ h. the

26 Quality of the representation The quality of the representation of the centered column profile C k by the principal axis h is measured by the squared cosine of angle between the vector OC k and v h. cos 2 (β) = ( < OCk, v h > ) 2 (ψ h,k ) 2 = OC k v h OC k 2. the If the value is close to 1, the quality of the representation is good.

27 the the

28 the It can be shown that the matrices V and W have the same nonzero eigenvalues. Moreover, the eigenvectors u i can be given in terms of v i and vice versa: u i = 1 λi Z T v i the and v i = 1 λi Zu i.

29 the Let H = rank(v ) = rank(w ). The coolest thing in correspondence analysis is that the attraction-repulsion indices d jk can be given in terms of φ and ψ as follows the d jk = 1 + H h=1 1 λh φ h,j ψ h,k.

30 the The components are often standardized defining ˆψ h,k = 1 λh ψ h,k and ˆφ h,j = 1 λ1 φ h,j. the Then d jk = 1 + λ 1 H h=1 ˆφ h,j ˆψh,k. The attraction-repulsion index d jk is now larger than 1 if and only if the smallest angle between ( ˆφ 1,j,..., ˆφ H,j ) and ( ˆψ 1,k,..., ˆψ H,k ) is less than 90.

31 If the row profile j and the column profile k are well represented by the first two principal components, then the attraction-repulsion index d jk 1 + λ 1 2 ˆφ h,j ˆψh,k. h=1 the We can therefore say that the modalities A j and B k are attracted to each if the angle between ( ˆφ 1,j, ˆφ 2,j ) and ( ˆψ 1,k, ˆψ 2,k ) is less than 90 and they repulse each other if the angle between ( ˆφ 1,j, ˆφ 2,j ) and ( ˆψ 1,k, ˆψ 2,k ) is larger than 90. In this case, one can simply observe the angle from the (double) biplot of the first two components of ˆφ and ˆψ.

32 Next Week Next week we will talk about multiple correspondence analysis (MCA). the

33 the

34 I K. V. Mardia, J. T. Kent, J. M. Bibby, Multivariate Analysis, Academic Press, London, 2003 (reprint of 1979). the

35 II R. V. Hogg, J. W. McKean, A. T. Craig, Introduction to Mathematical Statistics, Pearson Education, Upper Sadle River, R. A. Horn, C. R. Johnson, Matrix Analysis, Cambridge University Press, New York, R. A. Horn, C. R. Johnson, Topics in Matrix Analysis, Cambridge University Press, New York, the

36 III L. Simar, An Introduction to Multivariate Data Analysis, Université Catholique de Louvain Press, the

MS-E2112 Multivariate Statistical Analysis (5cr) Lecture 5: Bivariate Correspondence Analysis

MS-E2112 Multivariate Statistical (5cr) Lecture 5: Bivariate Contents analysis is a PCA-type method appropriate for analyzing categorical variables. The aim in bivariate correspondence analysis is to