Canonical Correlation Analysis

Size: px

Start display at page:

Download "Canonical Correlation Analysis"

Sarah Preston
5 years ago
Views:

1 Chapter 7 Canonical Correlation Analysis 7.1 Introduction Let x be the p 1 vector of predictors, and partition x = (w T, y T ) T where w is m 1 and y is q 1 where m = p q q and m, q 2. Canonical correlation analysis (CCA) seeks m pairs of linear combinations (a T 1 w, b T 1 y),..., (a T mw, b T my) such that corr(a T i w, b T i y) is large under some constraints on the a i and b i where i = 1,..., m. The first pair (a T 1 w, bt 1 y) has the largest correlation. The next pair (a T 2 w, bt 2 y) has the largest correlation among all pairs uncorrelated with the first pair and the process continues so that (a T mw, b T my) is the pair with the largest correlation that is uncorrelated with the first m 1 pairs. The correlations are called canonical correlations while the pairs of linear combinations are called canonical variables. Some notation is needed to explain CCA. Let the p p positive definite symmetric dispersion matrix ( ) Σ11 Σ Σ = 12. Σ 21 Σ 22 Let J = Σ 1/2 11 Σ 12 Σ 1/2 22. Let Σ a = Σ 1 11 Σ 12 Σ 1 Σ 1/2 11 Σ 12 Σ 1 22 Σ 21Σ 1/2 11, Σ b = Σ 1 22 Σ 21Σ 1 Σ 1/2 22 Σ 21 Σ 1 11 Σ 12Σ 1/2 22 Σ 21, Σ A = JJ T = 11 Σ 12 and Σ B = J T J = 22. Let e i and g i be sets of orthonormal eigenvectors, so e T i e i = 1, e T i e j = 0 for i j, g T i g i = 1 and g T i g j = 0 for i j. Let the e i be m 1 while the g i are q 1. Let Σ a have eigenvalue eigenvector pairs (λ 1, a 1 ),..., (λ m, a m ) where λ 1 λ 2 λ m. Let Σ A have eigenvalue eigenvector pairs (λ i, e i ) for i = 165

2 1,..., m. Let Σ b have eigenvalue eigenvector pairs (λ 1, b 1 ),..., (λ q, b q ). Let Σ B have eigenvalue eigenvector pairs (λ i, g i ) for i = 1,..., q. It can be shown that the m largest eigenvalues of the four matrices are the same. Hence λ i (Σ a ) = λ i (Σ A ) = λ i (Σ b ) = λ i (Σ B ) λ i for i = 1,..., m. It can be shown that a i = Σ 1/2 11 e i and b i = Σ 1/2 22 g i. The eigenvectors a i are not necessarily orthonormal and the eigenvectors b i are not necessarily orthonormal. Theorem 7.1. Assume the p p dispersion matrix Σ is positive definite. Assume Σ 11,Σ 22,Σ A,Σ a,σ B and Σ b are positive definite and that ˆΣ P cσ for some constant c > 0. Let d i be an eigenvector of the corresponding matrix. Hence d i = a i, b i, e i or g i. Let (ˆλ i, ˆd i ) be the ith eigenvalue eigenvector pair of ˆΣ γ. a) ˆΣ γ P Σ γ and ˆλ i (ˆΣ γ ) P λ i (Σ γ ) = λ i where γ = A, a, B or b. b) Σ γˆdi λ iˆdi P 0 and ˆΣ γ d i ˆλ i d i P 0. c) If the jth eigenvalue λ j is unique where j m, then the absolute value of the correlation of ˆd j with d j converges to 1 in probability: corr(ˆd j, d j ) P 1. Proof. a) ˆΣ P γ Σ γ since matrix multiplication is a continuous function of the relevant matrices and matrix inversion is a continuous function of a positive definite matrix. Then ˆλ i (ˆΣ γ ) P λ i since an eigenvalue is a continuous function of its associated matrix. b) Note that (Σ γ λ i I)ˆd i = [(Σ γ λ i I) (ˆΣ γ ˆλ i I)]ˆd i = o P (1)O P (1) P 0, and ˆΣ γ d i ˆλ i d i P Σ γ d i λ i d i = 0. c) If n is large, then ˆd i ˆd i,n is arbitrarily close to either d i or d i, and the result follows. Rule of thumb 7.1. To use CCA, assume the DD plot and subplots of the scatterplot matrix are linear. Want n > 10p for classical CCA and n > 20p for robust CCA that uses FCH, RFCH or RMVN. Also make the DD plot for the y variables and the DD plot for the z variables. Definition 7.1. Let the dispersion matrix be Cov(x) = Σx. Let (λ i, e i ) and (λ i, g i ) be the eigenvalue eigenvector pairs of Σ A and Σ B. The kth pair of population canonical variables is U k = a T k w = et k Σ 1/2 11 w and V k = b T k y = gt k Σ 1/2 22 y for k = 1,..., m. Then the population canonical correlations ρ k = corr(u k, V k ) 166

3 = λ k for k = 1,..., m. The vectors a k = Σ 1/2 11 e k and b k = Σ 1/2 22 g k are the kth canonical correlation coefficient vectors for w and y. Theorem 7.2. Johnson and Wichern (1988, p ): Let the dispersion matrix be Cov(x) = Σx. Then V (U k ) = V (V k ) = 1, Cov(C k, D j ) = corr(c k, D j ) = 0 for k j where C k = U k or C k = V k, and D j = U j or D j = V j and j, k = 1,..., m. That is, U k is uncorrelated with V j and U j for j k, and V k is uncorrelated with V j and U j for j k. The first pair of canonical variables is the pair of linear combinations (U, V ) having unit variances that maximizes corr(u, V ) and this maximum is corr(u 1, V 1 ) = ρ 1. The ith pair of canonical variables are the linear combinations (U, V ) with unit variances that maximize corr(u, V ) among all choices uncorrelated with the previous k 1 canonical variable pairs. Definition 7.2. Suppose standardized data z = (w T, y T ) T is used and the dispersion matrix is the correlation matrix Σ = ρ. Hence Σ ii = ρ ii for i = 1, 2. Let (λ i, e i ) and (λ i, g i ) be the eigenvalue eigenvector pairs of Σ A and Σ B. The kth pair of population canonical variables is U k = a T k w = et k Σ 1/2 11 w and V k = b T k y = gt k Σ 1/2 22 y for k = 1,..., m for k = 1,..., m. Then the population canonical correlations ρ k = corr(u k, V k ) = λ k for k = 1,..., m. Then Theorem 7.2 holds for the standardized data and the canonical correlations are unchanged by the standardization. Let ( ) ˆΣ ˆΣ = 11 ˆΣ 12. ˆΣ 21 ˆΣ22 Define estimators ˆΣ a, ˆΣ A, ˆΣ b and ˆΣ B in the same manner as their population analogs but using ˆΣ instead of Σ. For example, ˆΣ 1 a = ˆΣ ˆΣ ˆΣ 1 ˆΣ Let ˆΣ a have eigenvalue eigenvector pairs (ˆλ i, â i ), and let ˆΣ A have eigenvalue eigenvector pairs (ˆλ i, ê i ) for i = 1,..., m. Let ˆΣ b have eigenvalue eigenvector pairs (ˆλ 1, ˆb 1 ), and let ˆΣ B have eigenvalue eigenvector pairs (ˆλ i, ĝ i ) for i = 1,..., q. For these four matrices ˆλ 1 ˆλ 2 ˆλ m. Definition 7.3. Let ˆΣ = S if data x = (w T, y T ) T is used, and let ˆΣ = R if standardized data z = (w T, y T ) T is used. The kth pair of sample 167

4 canonical variables is Û k = â T k w = ê T k 1/2 ˆΣ 11 w and ˆV k = ˆb T k y = ĝ T 1/2 k ˆΣ 22 y for k = 1,..., m. Then the sample canonical correlations ˆρ k = corr(ûk, ˆV k ) 1/2 = ˆλk for k = 1,..., m. The vectors â k = ˆΣ 11 ê k and ˆb 1/2 k = ˆΣ 22 ĝ k are the kth sample canonical correlation vectors for w and y. Theorem 7.3. Under the conditions of Definition 7.3, the first pair of canonical variables (Û1, ˆV 1 ) is the pair of linear combinations (Û, ˆV ) having unit sample variances that maximizes the sample correlation corr(û, ˆV ) and this maximum is corr(û1, ˆV 1 ) = ˆρ 1. The ith pair of canonical variables are the linear combinations (Û, ˆV ) with unit sample variances that maximize the sample corr(û, ˆV ) among all choices uncorrelated with the previous k 1 canonical variable pairs. 7.2 Robust CCA The R function cancor does classical CCA and the mpack function rcancor does robust CCA by applying cancor on the RMVN set: the subset of the data used to compute RMVN. Some theory is simple: the FCH, RFCH and RMVN methods of RCCA produce consistent estimators of the kth canonical correlation ρ k on a large class of elliptically contoured distributions. P To see this, suppose Cov(x) = cxσ and C C(X) cσ where c x > 0 and c > 0 are some constants. Then C 1 XX C XY C 1 Σ 1 XX Σ XY Σ 1 Y Y Σ Y X, and C 1 Y Y C Y XC 1 XX C XY P Y Y C Y X P Σ A = Σ B = Σ 1 Y Y Σ Y XΣ 1 XX Σ XY. Note that Σ A and Σ B only depend on Σ and do not depend on the constants c or cx. (If C is also the classical covariance matrix applied to some subset of the data, then the correlation matrix G R C applied to the same subset satisfies G 1 XX G XY G 1 Y Y G P Y X R A = R 1 XX R XY R 1 Y Y R Y X, and G 1 Y Y G Y XG 1 XX G P XY R B = R 1 Y Y R Y XR 1 XX R XY.) Since eigenvalues are continuous functions of the associated matrix, and the FCH, RFCH and RMVN estimators are consistent estimators of c 1 Σ, c 2 Σ and c 3 Σ on a large class of elliptically contoured distributions, Theorem 168

5 7.1 holds, so these three RCCA methods and rcancor produce consistent estimators the kth canonical correlation ρ k on that class of distributions. Example 7.1. Example 2.2 describes the mussel data. Log transformation were taken on muscle mass M, shell width W and on the shell mass S. Then x contained the two log mass measurements while y contains L, H and log(w). The robust and classical CCAs were similar, but the canonical coefficients were difficult to interpret since log(w) has different units than L and H. Hence the log transformation were taken on all five variables and output is shown below. The data set zm contains x and y, and the DD plot showed case 48 was separated from the bulk of the data, but near the identity line. The DD plot for x showed two cases, 8 and 48, were separated from the bulk of the data. Also the plotted points did not cluster tightly about the identity line. The DD plot for y looked fine. The classical CCA produces output $cor, $xcoef and $ycoef. These are the canonical correlations, the a i and the b i. The labels for the RCCA are $out$cor, $out$xcoef and $out$ycoef. Note that the first correlation was about 0.98 while the second correlation was small. The RCCA is the CCA on the RMVN data set, which is contained in a compact ellipsoidal region. The variability of the truncated data set is less than that of the entire data set, hence expect the robust a i and b i to be larger in magnitude, ignoring sign, than that of the classical a i and b i, since the variance of each canonical variate is equal to one, and RCCA uses the truncated data. Note that a 1 was roughly proportional to log(s) while b 1 gave slightly higher weight for log(h) then log(w) and then log(l). Note that the five variables have high pairwise correlations, so log(m) was not important given that log(s) was in x. The second pair (a 2, b 2 ) might be ignored since the second canonical correlation was very low. > cancor(x,y) $cor [1] $xcoef [,1] [,2] S M

6 $ycoef [,1] [,2] [,3] L W H $xcenter S M $ycenter L W H > rcancor(x,y) $out $out$cor [1] $out$xcoef [,1] [,2] S M $out$ycoef [,1] [,2] [,3] L W H $out$xcenter S M $out$ycenter L W H

7 7.3 Summary 1) Let x be the p 1 vector of predictors, and partition x = (w T, y T ) T where w is m 1 and y is q 1 where m = p q q and m, q 2. Canonical correlation analysis (CCA) seeks m pairs of linear combinations (a T 1 w, b T 1 y),..., (a T mw, b T my) such that corr(a T i w, b T i y) is large under some constraints on the a i and b i where i = 1,..., m. The first pair (a T 1 w, bt 1 y) has the largest correlation. The next pair (a T 2 w, bt 2 y) has the largest correlation among all pairs uncorrelated with the first pair and the process continues so that (a T mw, b T my) is the pair with the largest correlation that is uncorrelated with the first m 1 pairs. The correlations are called canonical correlations while the pairs of linear combinations are called canonical variables. corr ˆρ 1 ˆρ 1 wcoef 2) R output is shown in symbols for the following table. w â 1 â m ycoef y ˆb1 ˆbm ˆbq 64) $out$cor [1] $out$ycoef $out$xcoef [,1] [,2] [,3] [,1] [,2] L S W M H ) Some notation is needed to explain CCA. Let the p p positive definite symmetric dispersion matrix ( ) Σ11 Σ Σ = 12. Σ 21 Σ 22 Let J = Σ 1/2 11 Σ 12 Σ 1/2 22. Let Σ a = Σ 1 11 Σ 12 Σ 1 Σ 1/2 11 Σ 12 Σ 1 22 Σ 21Σ 1/2 11, Σ b = Σ 1 22 Σ 21Σ 1 Σ 1/2 22 Σ 21 Σ 1 11 Σ 12Σ 1/2 22 Σ 21, Σ A = JJ T = 11 Σ 12 and Σ B = J T J = 22. Let e i and g i be sets of orthonormal eigenvectors, so e T i e i = 1, e T i e j = 0 for i j, g T i g i = 1 and g T i g j = 0 for i j. Let the e i be m 1 while the g i are q 1. Let Σ a have eigenvalue eigenvector pairs (λ 1, a 1 ),..., (λ m, a m ) where λ 1 λ 2 λ m. Let Σ A have eigenvalue eigenvector pairs (λ i, e i ) for i = 171

8 1,..., m. Let Σ b have eigenvalue eigenvector pairs (λ 1, b 1 ),..., (λ q, b q ). Let Σ B have eigenvalue eigenvector pairs (λ i, g i ) for i = 1,..., q. It can be shown that the m largest eigenvalues of the four matrices are the same. Hence λ i (Σ a ) = λ i (Σ A ) = λ i (Σ b ) = λ i (Σ B ) λ i for i = 1,..., m. It can be shown that a i = Σ 1/2 11 e i and b i = Σ 1/2 22 g i. The eigenvectors a i are not necessarily orthonormal and the eigenvectors b i are not necessarily orthonormal. Theorem 7.1. Assume the p p dispersion matrix Σ is positive definite. Assume Σ 11,Σ 22,Σ A,Σ a,σ B and Σ b are positive definite and that ˆΣ P cσ for some constant c > 0. Let d i be an eigenvector of the corresponding matrix. Hence d i = a i, b i, e i or g i. Let (ˆλ i, ˆd i ) be the ith eigenvalue eigenvector pair of ˆΣ γ. a) ˆΣ γ P Σ γ and ˆλ i (ˆΣ γ ) P λ i (Σ γ ) = λ i where γ = A, a, B or b. b) Σ γˆdi λ iˆdi P 0 and ˆΣ γ d i ˆλ i d i P 0. c) If the jth eigenvalue λ j is unique where j m, then the absolute value of the correlation of ˆd j with d j converges to 1 in probability: corr(ˆd j, d j ) P Complements Muirhead and Waternaux (1980) shows that if the population canonical correlations ρ k are distinct and if the underlying population distribution has a finite fourth moments, then the limiting joint distribution of n(ˆρ 2 k ρ2 k ) is multivariate normal where the ˆρ k are the classical sample canonical correlations and k = 1,..., p. If the data are iid from an elliptically contoured distribution with kurtosis 3κ, then the limiting joint distribution of n ˆρ 2 k ρ2 k 2ρ k (1 ρ 2 k ) for k = 1,..., p is N p (0, (κ + 1)I p ). Note that κ = 0 for multivariate normal data. Alkenani and Yu (2012), Zhang (2011) and Zhang, Olive and Ye (2012) develop robust CCA based on FCH, RFCH and RMVN. 7.5 Problems PROBLEMS WITH AN ASTERISK * ARE ESPECIALLY USE- 172

9 FUL Examine the R output in Example 7.1. a) What is the first canonical correlation ˆρ 1? b) What is â 1? c) What is ˆb 1? 7.2. The R output below is for a canonical correlation analysis on Venables and Ripley (2003) CPU data. The variables were syct = log(cycle time + 1), mmin = log(minimum main memory + 1), chmin = log(minimum number of channels + 1), chmax = log(maximum number of channels + 1), perf = log(published performance + 1) and estperf = 20/ (estimated performance+1). These six variables had a linear scatterplot matrix and DD plot and similar variances. Want to compare the two performance variables with the four remaining variables. a) What is the first canonical correlation ˆρ 1? b) What is â 1? c) What is ˆb 1? d) Interpret the second canonical variable U 2 = â T 2 w. > cancor(w,y) $cor [1] $xcoef 173

10 [,1] [,2] perf estperf $ycoef [,1] [,2] [,3] [,4] syct mmin chmin chmax Edited SAS output for SAS Institute (1985, p. 146) Fitness Club Data is given below for CCA. Three physiological and three exercise variables measured on 20 middle aged men at a fitness club. a) What is the first canonical correlation ˆρ 1? b) What is â 1? c) What is ˆb 1? Canonical Correlation

11 Raw Canonical Coefficients for the Physiological Variables PHYS1 PHYS2 PHYS3 weight waist pulse Raw Canonical Coefficients for the Exercise Variables Exer1 Exer2 Exer3 chinups situps jumps The output below is for a canonical correlations analysis on the R Seatbelts data set where y 1 = drivers = number of drivers killed or seriously injured, y 2 = front = number of front seat passengers killed or seriously injured, and y 3 = rear = number of back seat passengers killed or seriously injured, x 1 = kms = distance driven, x 2 = PetrolPrice = petrol price and x 3 = V ankilled = number of van drivers killed. The data consists of 192 monthly totals in Great Britain from January 1969 to December a) What is the first canonical correlation ˆρ 1? b) What is â 1? c) What is ˆb 1? d) Let z = (x T, y T ) T. The from the DD plot, the z i appeared to follow a multivariate normal distribution. Sketch the DD plot. > rcancor(x,y) $out $out$cor 175

12 [1] $out$xcoef [,1] [,2] [,3] x.kms e e-06 x.petrolprice e e+00 x.vankilled e e-02 $out$ycoef [,1] [,2] [,3] y.drivers e e y.front e e y.rear e e The R output below is for a canonical correlation analysis on some iris data. An iris is a flower, and there were 50 observations with 4 variables sepal length, sepal width, petal length and petal width. a) What is the first canonical correlation ˆρ 1? b) What is â 1? c) What is ˆb 1? w<-iris3[,,3] x <- w[,1:2] y <- w[,3:4] cancor(x,y) $cor [1] $xcoef [,1] [,2] Sepal L

13 Sepal W $ycoef [,1] [,2] Petal L Petal W

Chapter 9. Hotelling s T 2 Test. 9.1 One Sample. The one sample Hotelling s T 2 test is used to test H 0 : µ = µ 0 versus

Chapter 9. Hotelling s T 2 Test. 9.1 One Sample. The one sample Hotelling s T 2 test is used to test H 0 : µ = µ 0 versus Chapter 9 Hotelling s T 2 Test 9.1 One Sample The one sample Hotelling s T 2 test is used to test H 0 : µ = µ 0 versus H A : µ µ 0. The test rejects H 0 if T 2 H = n(x µ 0 ) T S 1 (x µ 0 ) > n p F p,n