Notes on Implementation of Component Analysis Techniques Dr. Stefanos Zafeiriou January 205 Computing Principal Component Analysis Assume that we have a matrix of centered data observations X = [x µ,..., x N µ] () where µ denotes the mean vector. X has size F N, where F is the number of dimensions and N is the number of observations. Their covariance matrix is given by S t = N N (x i µ)(x i µ) T = N XXT (2) i= In Principal Component Analysis (PCA), we aim to maximize the variance of each dimension by maximizing 0 = arg max tr( T S t ) T = I (3) The solution of Eq. 3 can be derived by solving S t = Λ (4) Thus, we need to perform eigenanalysis on S t. If we want to keep d principal components, the computational cost of the above operation is O(dF 2 ). If F is large, this computation can be quite expensive. Lemma Let us assume that B = XX T and C = X T X. It can be proven that B and C have the same positive eigenvalues Λ and, assuming that N < F, then the eigenvectors U of B and the eigenvectors V of C are related as U = XVΛ 2.
Figure : Example of data whitening using the PCA projection matrix = UΛ 2. Using Lemma we can compute the eigenvectors U of S t in O(N 3 ). The eigenanalysis of X T X is denoted by X T X = VΛV T (5) where V is a N (N ) matrix with the eigenvectors as columns and Λ is a (N ) (N ) diagonal matrix with the eigenvalues. Given that V T V = I and VV T I we have X T X = VΛV T U = XVΛ 2 U T XX T U = Λ 2 V T X T XX T XVΛ 2 = The pseudocode for computing PCA is = Λ 2 V {{ T V Λ V {{ T V Λ V {{ T V Λ 2 = I I I = Λ Algorithm Principal Component Analysis : procedure PCA 2: Compute dot product matrix: X T X = N i= (x i µ) T (x i µ) 3: Eigenanalysis: X T X = VΛV T 4: Compute eigenvectors: U = XVΛ 2 5: Keep specific number of first components: U d = [u,..., u d ] 6: Compute d features: Y = U T d X (6) Now, the covariance matrix of Y is YY T = U T XX T U = Λ The final solution of Eq. 3 is given as the projection matrix = UΛ 2 (7) which normalizes the data to have unit variance (Figure ). This procedure is called whitening (or sphereing). 2
2 Computing Linear Discriminant Analysis As explained before (Section ), PCA finds the principal components that maximize the data variance without taking into account the class labels. In contrast to this, Linear Discriminant Analysis (LDA) computes the linear directions that maximize the separation between multiple classes. This is mathematically expressed as maximizing 0 = arg max tr( T S b ) T S w = I (8) Assume that we have C number of classes, denoted by c i = [x,..., x Nci ], i =,..., C, where each x j has F dimensions and µ(c i ) is the mean vector of the class c i. Thus, the overall data matrix is X = [c,..., c C ] with size F N (N = C i= N c i ) and µ is the overall mean (mean of means). S w is the within-class scatter matrix S w = S j = j= j= x i c j (x i µ(c j ))(x i µ(c j )) T (9) that has rank (S w ) = min (F, N C). Moreover, S b is the between-class scatter matrix S b = N cj (µ(c j ) µ)(µ(c j ) µ) T (0) j= that has rank (S b ) = min (F, C ). The solution of Eq. 8 is given from the generalized eigenvalue problem S b = S w Λ () thus 0 corresponds to the eigenvectors of S w S b that have the largest eigenvalues. In order to deal with the singularity of S w, we can do the following steps:. Perform PCA on our data matrix X to reduce the dimensions to N C using the eigenvectors U 2. Solve LDA on this reduced spaceand get Q that has C columns. 3. Compute the total transform as = UQ. Unfortunately, if you follow the above procedure is possible that important information is solved. In the following, we show how the components of LDA can be computed by applying a simultaneous diagonalization procedure. Properties The scatter matrices have some interesting properties. Let us denote E 0 0 0 E 2 0 M =...... = diag {E, E 2,..., E C (2) 0 0 E C 3
where E i =..... Note that M is idempotent, thus MM = M. Given that the data covariance matrix is S t = XX T, the between-class scatter matrix can be written as and the within-class scatter matrix as (3) S b = XMMX T = XMX T (4) S w = XX {{ T XMX {{ T = X(I M)X T (5) S t S b Thus, we have that S t = S w + S b. Note that since M is idempotent, I M is also idempotent. Given the above properties, the objective function of Eq. 8 can be expressed as 0 = arg max tr( T XMMX T ) T X(I M)(I M)X T = I (6) The optimization of this problem involves a procedure called Simultaneous Diagonalization. Let s assume that the final transform matrix has the form = UQ (7) e aim to find the matrix U that diagonalizes S w = X(I M)(I M)X T. This practically means that, given the constraint of Eq. 6, we want T X(I M)(I M)X T = I Q T U T X(I M)(I M)X T U Q = I (8) {{ I Consequently, using Eqs. 7 and 8, the objective function of Eq. 6 can be further expressed as Q 0 = arg max tr(q T U T XMMX T UQ) Q (9) Q T Q = I where the constraint T X(I M)(I M)X T = I now has the form Q T Q = I. Lemma 2 Assume the matrix X(I M)(I M)X T = X w X T w, where X w is the F N matrix X w = X(I M). By performing eigenanalysis on X T w X w as X T w X w = 4
V w ΛV w T, we get N C positive eigenvalues, thus V w is a N (N C) matrix. The optimization problem of Eq. 9 can be solved in two steps. Find U such that U T X(I M)(I M)X T U = I. By applying Lemma 2, we get U = X w V w Λ w. Note that U has size F (N C). 2. Find Q 0. By denoting X b = U T XM the (N C) N matrix of projected class means, Eq. 9 becomes Q 0 = arg max Q tr(q T X b X T b Q) Q T Q = I (20) which is equivalent to applying PCA on the matrix of projected class means. The final Q 0 is a matrix with columns the d eigenvectors of X b X T b that correspond to the d largest eigenvalues (d C ). The final projection matrix is given by Based on the above, the pseudocode for computing LDA is 0 = Q 0 U (2) Algorithm 2 Linear Discriminant Analysis : procedure LDA 2: Find eigenvectors of S w that correspond to non-zero eigenvalues (usually N C), i.e. U = [u,..., u N C ] by performing eigen-analysis to (I M)X T X(I M) = V w Λ w V T w and computing U = X(I M)V w Λ w (performing whitening on S w ). 3: Project the data as X b = U T XM. 4: Perform PCA on X b to find Q (i.e., compute the eigenanalysis of X b X T b = QΛ b Q T ). 5: The total transform is = UQ Locality preserving projections can be computed in a similar fashion. The first step is to perform whitening of XDX T. e do so by applying Lemma and performing eigenanalysis of D /2 X T XD /2 = V p Λ p V p T. Then, the whitening transform is given by U T = XD /2 Λ p /2. The next step is to project the data as X p = U T X and find the eigenvectors Q of X p (D S) X T p that correspond to the lowest (but non-zero eigenvalues). Then the total transform would be = UQ. here S is the connectivity matrix defined in the slides 5