Principal Component Analysis CS5240 Theoretical Foundations in Multimedia Leow Wee Kheng Department of Computer Science School of Computing National University of Singapore Leow Wee Kheng (NUS) Principal Component Analysis 1 / 47
Motivation Motivation How wide is the widest part of NGC 1300? Leow Wee Kheng (NUS) Principal Component Analysis 2 / 47
Motivation How thick is the thickest part of NGC 4594? Use principal component analysis. Leow Wee Kheng (NUS) Principal Component Analysis 3 / 47
Maximum Variance Estimate Maximum Variance Estimate Consider a set of points x i, i = 1,...,n, in an m-dimensional space such that their mean µ = 0, i.e., centroid is at the origin. x 2 x i x 2 y i d i l q x 1 O x1 x 4 x 3 Want to find a line l through the origin that maximizes the projections y i of the points x i on l. Leow Wee Kheng (NUS) Principal Component Analysis 4 / 47
Maximum Variance Estimate Let q denote the unit vector along line l. Then, the projection y i of x i on l is y i = x i q. (1) The mean squared projection, which is the variance, V over all points is V = 1 n yi 2 = 1 n (x i q) 2 0. (2) n n i=1 i=1 Expanding Eq. 2 gives [ V = 1 n (x i q) (x i q) = q 1 n n i=1 ] n x i x i q. (3) The middle factor is the covariance matrix C of the data points C = 1 n x i x i. (4) n i=1 i=1 Leow Wee Kheng (NUS) Principal Component Analysis 5 / 47
Maximum Variance Estimate We want to find a unit vector q that maximizes the variance V, i.e., maximize V = q Cq subject to q = 1. (5) This is a constrained optimization problem. Use Lagrange multiplier method: combine V and the constraint using Lagrange multiplier λ maximize V = q Cq λ(q q 1). (6) Leow Wee Kheng (NUS) Principal Component Analysis 6 / 47
Maximum Variance Estimate Lagrange Multiplier Method Lagrange Multiplier Method Lagrange multiplier is a method for solving constrained optimization. Consider this problem: maximize f(x) subject to g(x) = c. (7) Lagrange multiplier method introduces a Lagrange multiplier λ to combine f(x) and g(x): The sign of λ can be positive or negative. L(x,λ) = f(x)+λ(g(x) c). (8) Then, solve for the stationary point of L(x,λ): L(x, λ) x = 0. (9) Leow Wee Kheng (NUS) Principal Component Analysis 7 / 47
Maximum Variance Estimate Lagrange Multiplier Method If x 0 is a solution of the original problem, then there is a λ 0 such that (x 0,λ 0 ) is a stationary point of L. Not all stationary points yield solutions of the original problem. The same method applies to minimization problem. Multiple constraints can be combined by adding multiple terms. is solved with minimize f(x) subject to g 1 (x) = c 1, g 2 (x) = c 2 (10) L(x) = f(x)+λ 1 (g 1 (x) c 1 )+λ 2 (g 2 (x) c 2 ). (11) Leow Wee Kheng (NUS) Principal Component Analysis 8 / 47
Maximum Variance Estimate Now, we differentiate V with respect to q and set to 0: Rearranging the terms gives V q = 2q C 2λq = 0. (12) q C = λq. (13) Since the covariance matrix C is symmetric (homework), C = C. So, transposing both sides of Eq. 13 gives Eq. 14 is called an eigenvector equation. q is the eigenvector. λ is the eigenvalue. Cq = λq. (14) Thus, the eigenvector q of C gives the line that maximizes variance V. Leow Wee Kheng (NUS) Principal Component Analysis 9 / 47
Maximum Variance Estimate The perpendicular distance d i of x i from the line l is d i = x i y i q = x i (x i q)q. (15) x 2 x i x 2 y i d i l q x 1 O x1 x 4 x 3 Leow Wee Kheng (NUS) Principal Component Analysis 10 / 47
Maximum Variance Estimate The squared distance is d 2 i = x i (x i q)q 2 = (x i (x i q)q) (x i (x i q)q) = x i x i x i (x i q)q (x i q)q x i +(x i q)q (x i q)q = x i x i (x i q)2. Averaging over all i gives D = 1 n n d 2 i = 1 n i=1 n x i x i V. (16) i=1 So, maximizing variance V means minimizing mean squared distance D to data points x i. The eigenvalue λ = V, the variance (homework); so λ 0. Leow Wee Kheng (NUS) Principal Component Analysis 11 / 47
Maximum Variance Estimate Name q as the first eigenvector q 1. The component of x i orthogonal to q 1 is x i = x i x i q 1q 1. Repeat previous method on x i in an (m 1)-D subspace orthogonal to q 1 to get q 2. Then, repeat to get q 3,...,q m. x 2 orthogonal subspace O q 1 x 1 x 3 Leow Wee Kheng (NUS) Principal Component Analysis 12 / 47
Eigendecomposition Eigendecomposition In general, a m m covariance matrix C has m eigenvectors: Cq j = λ j q j, j = 1,...,m. (17) Transpose both sides of the equations to get q j C = λ j q j, j = 1,...,m, (18) which are row matrices. Stack the row matrices into a column to get q λ 1 0 0 q 1. C =. 0.. 0 1. (19) 0 0 λ m q m Denote matrix Q and Λ as q m Q = [q 1 q m ], Λ = diag(λ 1,...,λ m ). (20) Leow Wee Kheng (NUS) Principal Component Analysis 13 / 47
Eigendecomposition Then, Eq. 19 becomes Q C = ΛQ (21) or C = QΛQ. (22) This is the matrix equation for eigendecomposition. The eigenvectors are orthonormal: q j q j = 1, q j q k = 0, for k j. (23) So, the eigenmatrix Q is orthogonal: Q Q = QQ = I. (24) The eigenvalues are arranged to be sorted λ j λ j+1. Leow Wee Kheng (NUS) Principal Component Analysis 14 / 47
Eigendecomposition In general, the mean of the data points x i are not at the origin. In this case, we subtract µ from each x i. Now, we can transform x i into a new vector y i : q 1 y i = Q (x i µ) m (x i µ) =. = (x i µ) q j q j, (25) q m(x i µ) j=1 where is the projection of x i µ on q j. Original x i can be recovered from y i : Caution! x i y i +µ. y ij = (x i µ) q j (26) x i = Qy i +µ. (27) x i is in the original input space but y i is in the eigenspace. Leow Wee Kheng (NUS) Principal Component Analysis 15 / 47
Eigendecomposition Leow Wee Kheng (NUS) Principal Component Analysis 16 / 47
PCA Algorithm 1 PCA Algorithm 1 Let x i denote m-dimensional vectors (data points), i = 1,...,n. 1. Compute the mean vector of the data points µ = 1 n n x i. (28) i=1 2. Compute the covariance matrix C = 1 n n (x i µ)(x i µ). (29) i=1 3. Perform eigendecomposition of C C = QΛQ. (30) Leow Wee Kheng (NUS) Principal Component Analysis 17 / 47
PCA Algorithm 1 Some books and papers use sample covariance C = 1 n 1 n (x i µ)(x i µ). (31) i=1 Eq. 31 differs from Eq. 29 by only a constant. Leow Wee Kheng (NUS) Principal Component Analysis 18 / 47
PCA Algorithm 1 Properties of PCA Properties of PCA Mean µ y over all y i is 0 (homework). Variance σ 2 j along q j is λ j (homework). Since λ 1 λ m 0, so σ 1 σ m 0. q 1 gives orientation of the largest variance. q 2 gives orientation of largest variance orthogonal to q 1 (2nd largest variance). q j gives orientation of largest variance orthogonal to q 1,...,q j 1 (j-th largest variance). q m is orthogonal to all other eigenvectors (least variance). Leow Wee Kheng (NUS) Principal Component Analysis 19 / 47
PCA Algorithm 1 Properties of PCA Data Points Leow Wee Kheng (NUS) Principal Component Analysis 20 / 47
PCA Algorithm 1 Properties of PCA Centroid at Origin Leow Wee Kheng (NUS) Principal Component Analysis 21 / 47
PCA Algorithm 1 Properties of PCA Principal Components Leow Wee Kheng (NUS) Principal Component Analysis 22 / 47
PCA Algorithm 1 Properties of PCA Another Example Leow Wee Kheng (NUS) Principal Component Analysis 23 / 47
PCA Algorithm 1 Properties of PCA Centroid at Origin Leow Wee Kheng (NUS) Principal Component Analysis 24 / 47
PCA Algorithm 1 Properties of PCA Principal Components Leow Wee Kheng (NUS) Principal Component Analysis 25 / 47
PCA Algorithm 1 Properties of PCA Notes: Covariance matrix C is a m m matrix. Eigendecomposition of very large matrix is inefficient. Example: A 256 256 colour image has 256 256 3 values. Number of dimensions m = 196680. C has a size of 196608 196608!! Number of images n is usually m, e.g., 1000. Leow Wee Kheng (NUS) Principal Component Analysis 26 / 47
PCA Algorithm 2 PCA Algorithm 2 1. Compute mean µ of data points x i. µ = 1 n 2. Form a m n matrix A, n m: n x i. i=1 A = [(x 1 µ) (x n µ)]. (32) 3. Compute A A, which is just a n n matrix. 4. Apply eigendecomposition on A A: 5. Pre-multiply A to Eq. 33 giving (A A)q j = λ j q j. (33) AA Aq j = Aλ j q j. (34) Leow Wee Kheng (NUS) Principal Component Analysis 27 / 47
PCA Algorithm 2 6. Recover eigenvectors and eigenvalues. Since AA = nc (homework), Eq. 34 is nc(aq j ) = λ j (Aq j ) (35) Therefore, eigenvectors of C are Aq j, and eigenvalues of C are λ j /n. C(Aq j ) = λ j n (Aq j). (36) Leow Wee Kheng (NUS) Principal Component Analysis 28 / 47
PCA Algorithm 3 PCA Algorithm 3 1. Compute mean µ of data points x i. µ = 1 n n x i. i=1 2. Form a m n matrix A: A = [(x 1 µ) (x n µ)]. 3. Apply singular value decomposition (SVD) on A: A = UΣV. 4. Recover eigenvectors and eigenvalues (page 31). Leow Wee Kheng (NUS) Principal Component Analysis 29 / 47
PCA Algorithm 3 Singular Value Decomposition Singular Value Decomposition Singular value decomposition (SVD) decomposes a matrix A into A = UΣV. (37) Column vectors of U are left singular vectors u j, and u j are orthonormal. Column vectors of V are right singular vectors v j, and v j are orthonormal. Σ is diagonal and contains singular values s j. Rank of A = number of non-zero singular values. Leow Wee Kheng (NUS) Principal Component Analysis 30 / 47
PCA Algorithm 3 Singular Value Decomposition Notice that AA = UΣV ( UΣV ) = UΣV VΣU = UΣ 2 U and eigendecomposition of AA is AA = QΛQ. So, eigenvector of AA = u j, eigenvalue of AA = s 2 j. On the other hand, ( A A = UΣV ) UΣV = VΣU UΣV = VΣ 2 V. Compare with the eigendecomposition of A A: A A = QΛQ. So, eigenvector of A A = v j, eigenvalue of A A = s 2 j. Leow Wee Kheng (NUS) Principal Component Analysis 31 / 47
PCA Algorithm 3 4. Recover eigenvectors and eigenvalues. Compare AA = nc with eigendecomposition of C: AA = UΣ 2 U, C = QΛQ. Therefore, eigenvectors q j of C = u j in U, and eigenvalues λ j of C = s 2 j /n, for s j in Σ. Leow Wee Kheng (NUS) Principal Component Analysis 32 / 47
Application Examples Application Examples Identify the 1st and 2nd major axes of these objects. Leow Wee Kheng (NUS) Principal Component Analysis 33 / 47
Application Examples Apply PCA for plane fitting in 3-D. Compute PCA of the points. Then, mean of points = known point on plane 3rd eigenvector = normal of plane Leow Wee Kheng (NUS) Principal Component Analysis 34 / 47
Dimensionality Reduction Dimensionality Reduction If a point represents a 256 256 colour image, then the dimensionality is 256 256 3 = 196680!! But, not all 196680 eigenvalues are non-zero. Leow Wee Kheng (NUS) Principal Component Analysis 35 / 47
Dimensionality Reduction Case 1: Number of data vectors n m number of dimensions. Data vectors are all independent. Then, number of non-zero eigenvalues (eigenvectors) = n 1, i.e., rank of covariance matrix = n 1. Why not = n? Data vectors are not independent. Then, rank of covariance matrix < n 1. Case 2: Number of data vectors n > m. m or more independent data vectors. Then, rank of covariance matrix = m. Fewer than m data vectors are independent. In this case, what is the rank of covariance matrix? In practice, it is often possible to reduce the dimensionality. Leow Wee Kheng (NUS) Principal Component Analysis 36 / 47
Dimensionality Reduction Eigenmatrix Q is Q = [q 1 q m ]. (38) PCA maps a data point x to a vector y in the eigenspace as m y = Q (x µ) = (x µ) q j q j, (39) which has m dimensions. Pick l eigenvectors with the largest eigenvalues to form truncated Q: Q = [q 1 q l ], (40) which spans a subspace of the eigenspace. Then, Q maps x to ŷ, an estimate of y: ŷ = Q l (x µ) = (x µ) q j q j, (41) which has l < m dimensions. j=1 j=1 Leow Wee Kheng (NUS) Principal Component Analysis 37 / 47
Dimensionality Reduction x y ŷ x x 1 y 1 y 1 x 1. Q.. Q.. Q y l y l Q /.. y l+1 dimensionality... reduction. x m y m x m Difference between y and ŷ is m y ŷ = (x µ) q j q j. (42) j=l+1 Leow Wee Kheng (NUS) Principal Component Analysis 38 / 47
Dimensionality Reduction Beware: and, therefore, y = Q (x µ) x = Qy+µ. But, whereas Why? ŷ = Q (x µ), x = Qŷ+µ x. (43) With n data points x i, sum-squared error E between x i and its estimate x i is n E = x i x i 2. (44) i=1 Leow Wee Kheng (NUS) Principal Component Analysis 39 / 47
Dimensionality Reduction When all m dimensions are kept, the total variance is m m σj 2 = λ j. (45) j=1 j=1 When l dimensions are used, the ratio R of unaccounted variance is: m m R (l) = j=l+1 m σ 2 j σj 2 j=1 = j=l+1 λ j. (46) m λ j j=1 Leow Wee Kheng (NUS) Principal Component Analysis 40 / 47
Dimensionality Reduction A sample plot of R vs. number of dimensions used: 1 R not enough enough 0 l m number of dimensions How to choose appropriate l? Choose l such that larger than l doesn t reduce R significantly: R (l) R (l+1) < ǫ. (47) Leow Wee Kheng (NUS) Principal Component Analysis 41 / 47
Dimensionality Reduction Alternatively, compute ratio R of accounted variance: R(l) = R 1 l j=1 m j=1 σ 2 j σ 2 j = l j=1 λ j. (48) m λ j j=1 l is large enough if R(l+1) R(l) < ǫ. not enough enough 0 l m number of dimensions Leow Wee Kheng (NUS) Principal Component Analysis 42 / 47
Summary Summary Eigendecomposition of covariance matrix = PCA. In practice, PCA is computed using SVD. PCA maximizes variances along the eigenvectors. Eigenvalues = variances along eigenvectors. Best fitting plane computed by PCA minimizes distance to points. PCA can be used for dimensionality reduction. Leow Wee Kheng (NUS) Principal Component Analysis 43 / 47
Probing Questions Probing Questions When applying PCA to plane fitting, how to check whether the points really lie close to the plane? If you apply PCA to a set of points on a curve surface, where do you expect the eigenvectors to point at? In dimensionality reduction, the m-d vector y is reduced to a l-d vector ŷ. Since l m, vector subtraction is undefined. But, subtraction of Eq. 39 and 41 gives y ŷ = m (x µ) q j q j. j=l+1 How is this possible? Is there a contradiction? Show that when n m, there are at most n 1 eigenvectors. Leow Wee Kheng (NUS) Principal Component Analysis 44 / 47
Homework Homework I 1. Describe the essence of PCA in one sentence. 2. Show that the covariance matrix C of a set of data points x i, i = 1,...,n, is symmetric about its diagonal. 3. Show that the eigenvalue λ of covariance matrix C equals the variance V as defined in Eq. 2. 4. Show that the mean µ y over all y i in the eigenspace is 0. 5. Show that the variance σ 2 j along eigenvector q j is λ j. 6. Show that AA = nc. 7. Derive the difference Qy Qŷ in dimensionality reduction. Leow Wee Kheng (NUS) Principal Component Analysis 45 / 47
Homework Homework II 8. Suppose n m, and the truncated eigenmatrix Q contains all the eigenvectors with non-zero eigenvalues, Q = [q 1 q n 1 ]. Now, map an input vector x to y and ŷ respectively by the complete Q (Eq. 39) and the truncated Q (Eq. 41). What is the error y ŷ? What is the result of mapping ŷ back to the input space by Q (Eq. 43)? 9. Q3 of AY2015/16 Final Evaluation. Leow Wee Kheng (NUS) Principal Component Analysis 46 / 47
References References 1. G. Strang, Introduction to Linear Algebra, 4th ed., Wellesley-Cambridge, 2009. www-math.mit.edu/~gs 2. C. Shalizi, Advanced Data Analysis from an Elementary Point of View, 2015. www.stat.cmu.edu/~cshalizi/adafaepov Leow Wee Kheng (NUS) Principal Component Analysis 47 / 47