Standardization and Singular Value Decomposition in Canonical Correlation Analysis

Size: px

Start display at page:

Download "Standardization and Singular Value Decomposition in Canonical Correlation Analysis"

Corey Griffin
6 years ago
Views:

Bachman, Reader Submitted to Pitzer College in Partial Fulfillment

1 Standardization and Singular Value Decomposition in Canonical Correlation Analysis Melinda Borello Johanna Hardin, Advisor David Bachman, Reader Submitted to Pitzer College in Partial Fulfillment of the Degree of Bachelor of Arts April 24, 2013 Department of Mathematics

3 Abstract Canonical correlation analysis (CCA) is a type of multivariate analysis based on the correlation between linear combinations of variables in two data sets. Biological applications of CCA include large scale genomic studies that can have multiple phenotypic or genotypic data. In these cases, CCA can lead to results that lack interpretability since CCA considers all variables. These types of analyses usually have an enormous number of variables, where the number of genes exceeds tens of thousands. Sparse canonical correlation analysis (SCCA) aims to solve the problem of interpretability by providing sparse solutions. In this paper, I examine the relationship between running CCA and SCCA with raw unstandardized data and running both methods using data that have been standardized to have a mean of zero and a standard deviation of one. I also show how both CCA and SCCA relate to singular value decomposition (SVD) by looking at an algorithm for SVD in the context of CCA and SCCA.

5 Contents Abstract Acknowledgments iii vii 1 Canonical Correlation Analysis 1 2 Sparse Canonical Correlation Analysis 9 3 Singular Value Decomposition Algorithm 15 Bibliography 21

7 Acknowledgments I d like to thank Johanna Hardin for offering to be my thesis adviser at the last minute when I thought it was too late for me to write a thesis. Her guidance throughout the research and writing process has been invaluable. I would also like to thank Associate Professor of Mathematics, Stephen Garcia, for his help with SVD and other linear algebra questions that we came upon throughout the semester. Without some of his explanations, Chapter 3 wouldn t exist. Of course, I would like to thank my mother, for all of my accomplishments are really hers. Thanks Mom, for everything.

9 Chapter 1 Canonical Correlation Analysis Canonical correlation analysis (CCA) measures the relationship between two sets of variables. CCA accomplishes this by focusing on the correlation between a linear combination of the variables in one set and a linear combination of the variables in another set. Canonical correlation can also be thought as an extension of bivariate correlation allowing more than two continuous variables in each set. CCA seeks to answer how the best linear combination of one set relates to the best linear combination of the other set of variables. The following presentation of canonical correlation analysis follows that which is offered by Johnson and Wichern (1992). Consider a group of p variables represented by the (p 1) random vector X, and a second group of q variables represented by the (q 1) random vector Y. Assume that p q. The random vectors have Cov(X) = Σ 11, Cov(Y) = Σ 22, and Cov(X, Y) = Σ 12, with E(X) = µ x and E(Y) = µ y. For coefficient vectors

10 2 Canonical Correlation Analysis a and b we can form the linear combinations, G = a X and H = b Y. Then max Corr(G, H) a,b is attained by the linear combinations (first canonical variate pair) G 1 = a 1X and H 1 = b 1Y. The kth pair of canonical variates, k = 2, 3,... p, G k = a k X and H k = b k Y where a k = Σ 1/2 11 u k and b k = Σ 1/2 22 v k for k = 1,..., p, maximizes Corr(G, H) among those linear combinations uncorrelated with the preceding 1, 2,..., k 1 canonical variables. That is, the second canonical variate pair are the linear combinations G 2 and H 2 which maximize all linear combinations which are uncorrelated with the first canonical variate pair. Then we have max a,b Corr(a X, b Y) = Corr(a k X, b k Y) = Corr(G k, H k ) = ρ k. Here the (p 1) vectors, u 1, u 2,... u p, and the (q 1) vectors, v 1, v 2,..., v p, are the left and right singular vectors, respectively, of matrix K = Σ 1/2 11 Σ 12 Σ 1/2 22.

11 3 Th singular values ρ 1 ρ 2 ρ p, of matrix K are the canonical correlations. In some instances, one may want to work with standardized variables, allowing ease of comparison of the variables to each other. There may be other motivations for standardization, such as standardization can simplify computations. Consider if we standardized the original variables as follows: Z X = V 1/2 11 (X µ x ) and Z Y = V 1/2 22 (Y µ y ) where V 1/2 11 is the (p p) diagonal matrix with one over the standard deviation on its diagonal, i.e., V 1/2 11 = 1/ σ x / σ x / σ xpp The matrix V 1/2 22 is similarly defined as a (q q) matrix with one over the standard deviation of Y, 1 σyii, on its diagonal. Here, ρ 11 = Cov(Z X ), ρ 22 = Cov(Z Y ) and ρ 12 = Cov(Z X, Z Y ), with E(Z X ) = E(Z Y ) = 0. Let the (p 1) vectors, e 1, e 2,... e p and the (q 1) vectors, f 1, f 2,..., f p be the left and right singular vectors of L, respectively, where L = ρ 1/2 11 ρ 12 ρ 1/2 22. Coefficient vectors, α k and β k, form the kth pair of canonical variates M k = α k Z X, N k = β k Z Y.

12 4 Canonical Correlation Analysis Then we have max α,β Corr(α Z X, β Z Y ) = Corr(α k Z X, β k Z Y ) = Corr(M k, N k ) = ρ k. Here ρ 1 ρ 2 ρ p are the singular values of matrix L and α k = e k ρ 1/2 11 and β k = f k ρ 1/2 22. Theorem 1.1 Given coefficient vectors α = ρ 1/2 11 e and β = ρ 1/2 22 f from scaled data, Z X and Z Y, and coefficient vectors a = Σ 1/2 11 u and b = Σ 1/2 22 v from raw data, X and Y, if V 1/2 11 and V 1/2 22 are diagonal matrices with i th diagonal element σ xii and σ yii, respectively, then α k = V 1/2 11 a k and β k = V 1/2 22 b k, i.e., coefficients from scaled data are equal to scaled coefficients from raw data in classical canonical correlation analysis. Proof: We will show that e k ρ 1/2 11 = u k Σ 1/2 11 V 1/2 11 or equivalently, that Σ 1/2 11 V 1/2 11 = ρ 1/2 11 and that e k = u k. By showing these two equalities hold, we will prove the theorem since without loss of generality, α k = e k ρ 1/2 11 = u k Σ 1/2 11 V 1/2 11 α k = a k V 1/2 11 = (V 1/2 11 a k) α k = V 1/2 11 a k.

13 5 Consider ρ 11 : ρ 11 = Cov(Z X ) = Cov(V 1/2 11 (X µ x )) = V 1/2 11 Cov(X µ x )V 1/2 11 (1.1) = V 1/2 11 Cov(X)V 1/2 11 (1.2) = V 1/2 11 Σ 11 V 1/2 11 = V 1/2 11 Σ 1/2 11 Σ1/2 11 V 1/2 11 Since ρ 11 is positive semi-definite, taking the square root of ρ 11 results in ρ 1/2 11 = V 1/2 11 Σ 1/2 11 ρ 1/2 11 = Σ 1/2 11 V 1/2 11 Line (1.2) follows from (1.1) because V 1/2 11 is diagonal and thus symmetric, so it is equal to its transpose. A similar argument for ρ 22 also shows that ρ 1/2 22 = Σ 1/2 22 V 1/2 22. Next, does u k = e k? In other words, does the left singular vector of K equal the left singular vector of L? We just proved that ρ 1/2 11 = Σ 1/2 11 V 1/2 11 and ρ 1/2 22 = Σ 1/2 22 V 1/2 22. Recall that ρ 12 = Cov(Z X, Z Y ). Thus we can

14 6 Canonical Correlation Analysis write L as: L = ρ 1/2 11 ρ 12 ρ 1/2 22 = Σ 1/2 11 V 1/2 11 Cov(Z X, Z Y )Σ 1/2 22 V 1/2 22 = Σ 1/2 11 V 1/2 11 1/2 Cov(V11 (X µ x ), V 1/2 22 (Y µ y ))Σ 1/2 22 V 1/2 22 = Σ 1/2 11 V 1/2 11 V 1/2 11 Cov(X µ x, Y µ y )V 1/2 22 Σ 1/2 22 V 1/2 22 (1.3) = Σ 1/2 11 Cov(X, Y)V 1/2 22 (Σ 1/2 22 V 1/2 22 ) (1.4) = Σ 1/2 11 Cov(X, Y)V 1/2 22 V 1/2( ) 22 Σ 1/2( ) 22 (1.5) = Σ 1/2 11 Cov(X, Y)V 1/2 22 V 1/2 22 Σ 1/2 22 = Σ 1/2 11 Cov(X, Y)Σ 1/2 22 = Σ 1/2 11 Σ 12 Σ 1/2 22 = K Moving from line (1.3) to (1.4), note that as proven above Σ 1/2 22 V 1/2 22 = ρ 1/2 22 which is positive definite and symmetric. Line (1.5) follows from (1.4) since Σ 1/2 22 and V 1/2 22 are both symmetric. Since L = K it is clear that L and K have the same singular vectors since they have the same singular value decomposition. Therefore, V 1/2 11 a k in fact is the coefficient vector α k for the k th canonical variate M k constructed from the standardized variable Z X, and V 1/2 22 b k is the coefficient vector β k for the k th canonical variate N k constructed from the standardized variable Z Y. Claim 1.2 If (i) a k and b k maximize Corr(a X, b Y) over all a and b subject to a and b being uncorrelated to a i and b i where i = 1, 2,... k 1 (ii) Corr(G k, H k ) =

15 7 ρ k, then, (i) max α,βcorr(α Z X, β Z Y ) =Corr(α k Z X, β k Z Y) subject to α and β being uncorrelated to α i and β i where i = 1, 2,..., k 1 (ii) Corr(M k, N k ) = ρ k as well, i.e., the canonical correlations are unchanged by the standardization. We will now show that raw and standardized canonical variates produce the same maximum correlation. From the raw data, G k and H k are the k th pair of canonical variates that maximize Corr(G, H). If this correlation is equal to ρ k then, ρ k = Corr(G k, H k ) = Corr(a k X, b k Y) = Corr(a k (X µ x), b k (Y µ y)) = Corr(a k V 1/2 11 V 1/2 11 (X µ x ), b k V 1/2 22 V 1/2 22 (Y µ y )) = Corr(a k V 1/2 11 Z X, b k V 1/2 22 Z Y ) = Corr(α k Z X, β k Z Y ) = Corr(M k, N k ) Thus canonical correlations are unchanged by the standardization. We know that G k and H k maximize the correlation for the raw data, but do M k and N k maximize the correlation for the standardized data? Suppose not. Suppose that there was a larger correlation, ρ k, between the two standardized canonical variates, M k and N k. Then ρ k must also be the maximum correlation for the raw canonical variates, G k and H k, since as shown above, the correlation for M k and N k is equal to the correlation for G k and H k.

16 8 Canonical Correlation Analysis Consider matrices K and L. As shown above, K = L, and thus they have the same singular vectors. It follows that they would also have the same singular values, ρ k. Furthermore, these singular values are the correlation for the kth canonical variate pair (detailed in Chapter 3). Hence, the pairs G k and H k, and M k and N k have the same (maximum) correlation value. Ultimately, CCA provides the same results whether using scaled data or using raw data and scaling coefficients by V 1/2 11 and V 1/2 22 (respectively) at the end. Both raw and standardized variables maximize the correlation of linear combinations of the variables from both data sets.

17 Chapter 2 Sparse Canonical Correlation Analysis As presented above, canonical correlation analysis uses all variables from both sets, X and Y, to create canonical vectors. Yet when trying to apply CCA to real data, CCA often fails to produce interpretable results. Data used in microarray data analysis and genome-wide linkage analysis usually have an enormous number of variables, where the number of genes exceeds tens of thousands. Thus results from CCA lack biological interpretability. One way to combat this issue is by using sparse canonical correlation analysis (SCCA). SCCA helps solve problems of interpretability by providing sparse sets of the associated variables, i.e., canonical vectors contain sparse loadings. Following a method to select appropriate sparseness parameters, the sparse solution contains variables that are deemed more im-

18 10 Sparse Canonical Correlation Analysis portant than others. Thus the solution reduces the dimensionality which improves interpretability. The iterative algorithm below is that presented by Parkhomenko et al. (2009). This algorithm uses soft-thresholding as its penalty function. Note that there are a number of different penalty functions that one can use for SCCA which have been outlined and compared by Chalise and Fridley (2012). Following classical CCA, consider two sets of variables X and Y, with p variables in X and q variables in Y. As before, let K = Σ 1/2 11 Σ 12 Σ 1/2 22, where Cov(X) = Σ 11, Cov(Y) = Σ 22, and Cov(X, Y) = Σ 12. The first sparse canonical vectors are identified using the following algorithm: 1. Select sparseness parameters, λ u and λ v 2. Select initial values u 0 and v 0 and set i = 0 3. Update u: (a) u i+1 Kv i (b) Normalize: u i+1 ui+1 u i+1 (c) Apply soft-thresholding to obtain sparse solution: u i+1 j ( u i+1 j 1 2 λ u) + Sign(u i+1 ) for j = 1,..., p (d) Normalize: u i+1 4. Update v: ui+1 u i+1 j (a) v i+1 K u i (b) Normalize: v i+1 vi+1 v i+1

19 11 (c) Apply soft-thresholding to obtain sparse solution: vj i+1 ( vj i λ v) + Sign(vj i+1 ) for j = 1,..., q (d) Normalize: v i+1 5. i i + 1 vi+1 v i+1 6. Repeat steps 3-5 until convergence where (x) + is equal to x if x 0 and 0 if x < 0, and 1 if x < 0 sign(x) = 1 if x > 1 0 if x = 0. Parkhomenko et al. (2009) replace Σ 11 and Σ 22 by diag(σ 11 ) and diag(σ 22 ) thus K becomes an approximation of the sample correlation matrix of X and Y without any of the information of how the variables in X are correlated or how the variables in Y are correlated. Using diag(σ 11 ) avoids computational problems. The computation of K requires (X X) 1 and (Y Y) 1 which may not exist in cases where the number of variables is greater than the number of observations (when p or q are larger than n). This is common in the biological applications discussed above, where one can have tens of thousands of genes, but only a few hundred observations. The authors also select initial values so that u 0 is the row means of K and v 0 is the column means of K, both standardized to have unit length. Although the authors assume that X and Y have been standardized to Z X and Z Y, SCCA applied to raw data produces the same results as ap-

20 12 Sparse Canonical Correlation Analysis plied to standardized data. As in classical CCA, using scaled and centered data or using raw data does not affect K nor does it affect the left and right singular vectors of K. Since the initial values, u 0 and v 0, come from K and recall that K = L, there is no difference in the algorithm for raw variables or standardized variables. In this context, we are interested in the effect of scaling and centering the variables on the soft-thresholding step. The sparseness parameters λ u and λ v are chosen using k-fold cross-validation as outlined below: 1. Choose λ u and λ v 2. Remove 1 k of the data (testing sample) 3. Find canonical coefficients from k 1 k of the data (training sample) 4. Find canonical correlation using testing sample and coefficients from training sample 5. Repeat steps 2-4 k times. Average across the k correlations, keeping this value 6. Repeat steps 1-5 for new λ u and λ v From this process, we find the optimal combination of λ u and λ v out of all specific pairs of sparseness parameters that correspond to the highest average test sample correlation. This process is not affected by scaling variables as it looks at the correlation between canonical vectors in testing and training samples. We have already proven that the correlation will be the same between raw canonical vectors and standardized canonical vectors.

21 13 Thus, the choices for sparseness parameters are not affected by the scaling of variables. (For more on the selection of sparseness parameters consult Parkhomenko et al. (2009)). The relationship between the coefficients from raw data and coefficients from standardized data is the same in the sparse setting as it was in CCA. The difference is that u and v will have sparse loadings, so our coefficient vectors will also be sparse. As before, if α k and β k are coefficients for standardized data Z X and Z Y, and a k and b k are coefficients for raw data, X and Y, then α k = V 1/2 11 a k and β k = V 1/2 22 b k, where Corr(α k Z X, β k Z Y) = Corr(a k X, b ky). That is, Theorem 1.1 holds for SCCA.

23 Chapter 3 Singular Value Decomposition Algorithm One may notice that the sparse algorithm is nothing more than the standard algorithm for singular value decomposition (SVD) with an added soft-thresholding step to obtain a sparse vector. If we consider the standard algorithm, we have: 1. Select initial values u 0 and v 0 and set i = 0 2. Update u: (a) u i+1 Kv i (b) Normalize: u i+1 3. Update v: ui+1 u i+1 (a) v i+1 K u i

24 16 Singular Value Decomposition Algorithm (b) Normalize: v i+1 4. i i + 1 vi+1 v i+1 5. Repeat steps 2-4 until convergence. Why does this provide the singular vectors of K? Recall that the SVD of K is K = UDV where we choose U and V to be orthogonal matrices, i.e., U = U 1 and V = V 1. Thus, the column vectors of U and V, u i and v i, are unit vectors and are the left and right singular vectors of K. Matrix D is a diagonal matrix with the singular values of K on its diagonal, i.e., the square roots of the eigenvalues of K K or KK. Alternating multiplying v by K and u by K produces the singular values of K if the following if noted: Kv i = UDV v i = d i u i. Observe that V v i is a column vector of zeros except for a 1 in the i th position, since this is equivalent to taking the inner product of v i with the columns of V. This then results in DV v i which is a diagonal matrix with all zeros expect for d i on the diagonal in the i th column. Thus UDV v i is the i th column of U multiplied by the singular value, d i. Next we normalize to get a unit vector, giving us, d i u i d i u i = d iu i d i u i = u i u i = u i

25 17 Similarly, K u i = V DU u i = d i v i which we normalize to get v i. The following process will produce the largest eigenvalue of D which is the largest singular value of K. In the context of CCA and SCCA, this is the largest correlation, ρ 1, between the canonical variates. By switching off multiplying u and v by K and K respectively, each starting vector u 0 and v 0 will be scaled towards the direction of the largest singular value of K until u and v become the singular vectors of K. Notice that if we construct a symmetric matrix A, such that A = 0 K K 0 the problem is then reduced to an eigenvalue problem. We wish to solve Ax = dx where d is the dominant singular value of K (and the square root of the dominant eigenvalue of A) and x is the vector ( v u ). Note that applying

26 18 Singular Value Decomposition Algorithm A to x leads to A v = K u u Kv = dv du if u and v are the singular vectors of K. Thus putting K and K into a block matrix results in a symmetric matrix A with dominant eigenvalue d and eigenvector ( v u ). Now instead of finding singular vectors and singular values, we just need to find the eigenvectors and eigenvalues of A. The above algorithm for SVD can be used for CCA and SCCA. It produces the components needed for the canonical variates, (singular vectors) and it produces the canonical correlations, (singular values). How does SVD produce canonical correlations? Recall that in CCA we seek coefficient vectors, a and b, such that Corr(G, H) = a Σ 12 b a Σ 11 a b Σ 22 b

27 19 is maximized. Notice that we can reduce this expression as follows: max a,b Corr(G, H) = max a,b = max u,v = max u,v = max u,v a Σ 12 b a Σ 11 a b Σ 22 b u Σ 1/2 11 Σ 12 Σ 1/2 u Σ 1/2 11 Σ 11 Σ 1/2 11 u u Kv u I p p u v I q q v 22 v v Σ 1/2 22 Σ 22 Σ 1/2 22 v (3.1) (3.2) u Kv u v. (3.3) Line 3.2 follows from 3.1 by a change of variables. Line 3.3 is equivalent to finding u and v that are unit vectors and maximizing over those vectors. Under this condition, 3.3 boils down to max u,v u Kv. Substituting K by its singular value decomposition gives max u,v u UDV v. (3.4)

28 20 Singular Value Decomposition Algorithm When u and v are the i th singular vectors of K, from above calculations, 3.4 becomes, max u,v u UDV v = u 1 UDV v 1 = u 1 d 1 u 1 = d 1 u 1 = d 1 The first singular vector of K will produce the largest singular value, which we saw in the above algorithm. The iterative process to find u 1 and v 1 will always be pulled towards the direction of the largest singular value of K. Thus, SVD produces the canonical correlations. To summarize, SVD is a helpful tool that can be used to accomplish CCA and SCCA. SVD provides an algorithm that helps finds the linear combinations of variables in each data set that produces the maximum correlation. In both SCCA and CCA, standardizing the variables so as to have mean zero and standard deviation of one does not change the canonical correlation. The coefficients from the standardized data are equal to the coefficients from the raw data scaled by their standard deviations. Therefore, it is at the discretion of the researcher whether to use raw data or to use standardized data in CCA or SCCA.

29 Bibliography Chalise, P. and Fridley, B. L. (2012). Comparison of penalty functions for sparse canonical correlation analysis. Comput. Statist. Data Anal., 56(2): Johnson, R. A. and Wichern, D. W. (1992). Applied multivariate statistical analysis. Prentice Hall Inc., Englewood Cliffs, NJ, third edition. Parkhomenko, E., Tritchler, D., and Beyene, J. (2009). Sparse canonical correlation analysis with application to genomic data integration. Stat. Appl. Genet. Mol. Biol., 8:Art. 1, 36.

Lecture 4: Principal Component Analysis and Linear Dimension Reduction

Lecture 4: Principal Component Analysis and Linear Dimension Reduction Advanced Applied Multivariate Analysis STAT 2221, Fall 2013 Sungkyu Jung Department of Statistics University of Pittsburgh E-mail: