Principal components analysis COMS 4771
1. Representation learning
Useful representations of data Representation learning: Given: raw feature vectors x 1, x 2,..., x n R d. Goal: learn a useful feature transformation φ: R d R k. (Often k d i.e., dimensionality reduction but not always.) 1 / 23
Useful representations of data Representation learning: Given: raw feature vectors x 1, x 2,..., x n R d. Goal: learn a useful feature transformation φ: R d R k. (Often k d i.e., dimensionality reduction but not always.) Can then use φ as a feature map for supervised learning. 1 / 23
Useful representations of data Representation learning: Given: raw feature vectors x 1, x 2,..., x n R d. Goal: learn a useful feature transformation φ: R d R k. (Often k d i.e., dimensionality reduction but not always.) Can then use φ as a feature map for supervised learning. Some previously encontered examples: Feature maps corresponding to pos. def. kernels (+approximations). (Usually data-oblivious feature map doesn t depend on the data.) Centering x x µ (Effect: resulting features have mean 0.) Standardization x diag(σ 1, σ 2,..., σ d ) 1 (x µ). (Effect: resulting features have mean 0 and unit variance.) 1 / 23
Useful representations of data Representation learning: Given: raw feature vectors x 1, x 2,..., x n R d. Goal: learn a useful feature transformation φ: R d R k. (Often k d i.e., dimensionality reduction but not always.) Can then use φ as a feature map for supervised learning. Some previously encontered examples: Feature maps corresponding to pos. def. kernels (+approximations). (Usually data-oblivious feature map doesn t depend on the data.) Centering x x µ (Effect: resulting features have mean 0.) Standardization x diag(σ 1, σ 2,..., σ d ) 1 (x µ). (Effect: resulting features have mean 0 and unit variance.) What other properties of a feature representation may be desirable? 1 / 23
2. Principal components analysis
Dimensionality reduction via projections Input: x 1, x 2,..., x n R d, target dimensionality k N. Output: a k-dimensional subspace, represented by an orthonormal basis q 1, q 2,..., q k R d. (Orthogonal) projection: projection of x R d to span(q 1, q 2,..., q k ) is k k q i q T i x = q i, x q i R d. } {{ } Π Can also represent the projection of x in terms of its coefficients w.r.t. the orthonormal basis q 1, q 2,..., q k : q 1, x q 2, x φ(x) :=. Rk. q k, x 2 / 23
Projection of minimum residual squared error Objective: find k-dimensional projector Π: R d R d such that the average residual squared error 1 n x i Πx i 2 2 n is as small as possible. 3 / 23
Projection of minimum residual squared error k = 1 case (Π = qq T ) Objective: find unit vector q R d to minimize 1 n n x i qq T 2 x i 2 4 / 23
Projection of minimum residual squared error k = 1 case (Π = qq T ) Objective: find unit vector q R d to minimize 1 n n x i qq T 2 x i = 1 n 2 n x i 2 2 1 qt n n x ix T i q 4 / 23
Projection of minimum residual squared error k = 1 case (Π = qq T ) Objective: find unit vector q R d to minimize 1 n n x i qq T 2 x i = 1 n = 1 n 2 n x i 2 2 1 qt n n x ix T i q n ( ) x i 2 1 2 qt n AT A q (where x T i is i-th row of A R n d ). 4 / 23
Projection of minimum residual squared error k = 1 case (Π = qq T ) Objective: find unit vector q R d to minimize 1 n n x i qq T 2 x i = 1 n = 1 n 2 n x i 2 2 1 qt n n x ix T i q n ( ) x i 2 1 2 qt n AT A q (where x T i is i-th row of A R n d ). 1 arg min q R d : q 2 =1 n n x i qq T 2 x i 2 ( ) 1 arg max q T q R d : q 2 =1 n AT A q. 4 / 23
Aside: Eigendecompositions Every symmetric matrix M R d d guaranteed to have eigendecomposition with real eigenvalues: = M V Λ V (d d) (d d) (d d) (d d) = d λ i v i v i real eigenvalues: λ 1 λ 2 λ d (Λ = diag(λ 1, λ 2,..., λ d )); corresponding orthonormal eigenvectors: v 1, v 2,..., v d (V = [v 1 v 2 v d ]). 5 / 23
Aside: Eigendecompositions Every symmetric matrix M R d d guaranteed to have eigendecomposition with real eigenvalues: = M V Λ V (d d) (d d) (d d) (d d) = d λ i v i v i real eigenvalues: λ 1 λ 2 λ d (Λ = diag(λ 1, λ 2,..., λ d )); corresponding orthonormal eigenvectors: v 1, v 2,..., v d (V = [v 1 v 2 v d ]). Fixed-point characterization of eigenvectors: Mv i = λ iv i. 5 / 23
Eigendecompositions Variational characterization of eigenvectors: max q R d qt Mq s.t. q 2 = 1 Maximum value: λ 1 (top eigenvalue) Maximizer: v 1 (top eigenvector) 6 / 23
Eigendecompositions Variational characterization of eigenvectors: max q R d qt Mq s.t. q 2 = 1 Maximum value: λ 1 (top eigenvalue) Maximizer: v 1 (top eigenvector) For i > 1, max q R d qt Mq s.t. q 2 = 1 q, v j = 0 j < i Maximum value: λ i (i-th largest eigenvalue) Maximizer: v i (i-th eigenvector) 6 / 23
Principal components analysis (k = 1) k = 1 case (Π = qq T ) 1 arg min q R d : q 2 =1 n n x i qq T 2 x i 2 ( ) 1 arg max q T q R d : q 2 =1 n AT A q. 7 / 23
Principal components analysis (k = 1) k = 1 case (Π = qq T ) 1 arg min q R d : q 2 =1 n n x i qq T 2 x i 2 ( ) 1 arg max q T q R d : q 2 =1 n AT A q. Solution: eigenvector of A T A corresponding to largest eigenvalue (i.e., the top eigenvector v 1). ( ) 1 q T n AT A q = 1 n q, x i 2 n (variance in direction q, assuming 1 n n xi = 0). 7 / 23
Principal components analysis (k = 1) k = 1 case (Π = qq T ) 1 arg min q R d : q 2 =1 n n x i qq T 2 x i 2 ( ) 1 arg max q T q R d : q 2 =1 n AT A q. Solution: eigenvector of A T A corresponding to largest eigenvalue (i.e., the top eigenvector v 1). ( ) 1 q T n AT A q = 1 n q, x i 2 n (variance in direction q, assuming 1 n n xi = 0). top eigenvector direction of maximum variance 7 / 23
Principal components analysis (general k) General k case (Π = QQ T ) arg min Q R d k : Q T Q=I 1 n n x i QQ T 2 x i 2 arg max Q R d k : Q T Q=I k q T i ( ) 1 n AT A q i. Solution: k eigenvectors of A T A corresponding to k largest eigenvalue 8 / 23
Principal components analysis (general k) General k case (Π = QQ T ) arg min Q R d k : Q T Q=I 1 n n x i QQ T 2 x i 2 arg max Q R d k : Q T Q=I k q T i ( ) 1 n AT A q i. Solution: k eigenvectors of A T A corresponding to k largest eigenvalue k q T i ( ) 1 n AT A q i = k 1 n n q i, x j 2 j=1 (sum of variances in q i directions, assuming 1 n n xi = 0). top k eigenvectors k-dim. subspace of maximum variance 8 / 23
Principal components analysis (PCA) Data matrix A R n d Rank k PCA (k dimensional linear subspace) Get top k eigenvectors V k := [v 1 v 2... v k ] of 1 n AT A = 1 n x ix T i. n Feature map: φ(x) := ( v 1, x, v 2, x,..., v k, x ) R k. Decorrelating property: 1 n n φ(x i)φ(x i) T = diag(λ 1, λ 2,..., λ k ). Approx. reconstruction: x V k φ(x). 9 / 23
Principal components analysis (PCA) Data matrix A R n d Rank k PCA with centering (k dimensional affine subspace) Get top k eigenvectors V k := [v 1 v 2... v k ] of where µ = 1 n n xi. 1 n n (x i µ)(x i µ) T Feature map: φ(x) := ( v 1, x µ, v 2, x µ,..., v k, x µ ) R k. Decorrelating property: 1 n 1 n n φ(x i) = 0 n φ(x i)φ(x i) T = diag(λ 1, λ 2,..., λ k ). Approx. reconstruction: x µ + V k φ(x). 10 / 23
Example: PCA on OCR digits data Data {x i} n from R 784. Fraction of residual variance left by rank-k PCA projection: k j=1 variance in direction vj 1. total variance Fraction of residual variance left by best k coordinate projections: k j=1 variance in direction ej 1. total variance fraction of residual variance 1 0.8 0.6 0.4 0.2 coordinate projections PCA projections 0 0 200 400 600 800 dimension of projections k 11 / 23
Example: compressing digits images 16 16 pixel images of handwritten 3s (as vectors in R 256 ) Mean µ and eigenvectors v 1, v 2, v 3, v 4 Mean λ 1 =3.4 10 5 λ 2 =2.8 10 5 λ 3 =2.4 10 5 λ 4 =1.6 10 5 Reconstructions: x k = 1 k = 10 k = 50 k = 200 Only have to store k numbers per image, along with the mean µ and k eigenvectors (256(k + 1) numbers). 12 / 23
Example: eigenfaces Dimensional Dataof faces (as vectors in R10304 ) 92High 112 pixel images Figure 15.5: 100 training images. Each image con 112 = 10304 greyscale pixels. The train data is sca represented as an image, the components of each ima The average value of each pixel across all images is This is a subset of the 400 images in the full Olive Face Database. 100 example images top k = 48 eigenvectors Figure 15.6: (a): SVD tion of the images 13in/ 23fig
Other examples x R d : movement of stock prices for d different stocks in one day. 14 / 23
Other examples x R d : movement of stock prices for d different stocks in one day. Principal component: combination of stocks that account for the most variation in stock price movement. 14 / 23
Other examples x R d : movement of stock prices for d different stocks in one day. Principal component: combination of stocks that account for the most variation in stock price movement. x {1, 2,..., 5} d : levels at which various terms describe an individual (e.g., jolly, impulsive, outgoing, conceited, meddlesome ) 14 / 23
Other examples x R d : movement of stock prices for d different stocks in one day. Principal component: combination of stocks that account for the most variation in stock price movement. x {1, 2,..., 5} d : levels at which various terms describe an individual (e.g., jolly, impulsive, outgoing, conceited, meddlesome ) Principal components: major personality axes in a population (e.g., extroversion, agreeableness, conscientiousness ) 14 / 23
Other examples x R d : movement of stock prices for d different stocks in one day. Principal component: combination of stocks that account for the most variation in stock price movement. x {1, 2,..., 5} d : levels at which various terms describe an individual (e.g., jolly, impulsive, outgoing, conceited, meddlesome ) Principal components: major personality axes in a population (e.g., extroversion, agreeableness, conscientiousness )... 14 / 23
3. Computation
Power method Problem: Given matrix A R n d, compute the top eigenvector of A T A. Initialize with random ˆv R d. Repeat: 1. ˆv := A T Aˆv. 2. ˆv := ˆv/ ˆv 2. 15 / 23
Power method Problem: Given matrix A R n d, compute the top eigenvector of A T A. Initialize with random ˆv R d. Repeat: 1. ˆv := A T Aˆv. 2. ˆv := ˆv/ ˆv 2. Theorem: For any ε (0, 1), with high probability (over choice of initial ˆv), ˆv T A T Aˆv (1 ε) top eigenvalue of A T A ( 1 after O ε log d ) iterations. ε 15 / 23
Power method Problem: Given matrix A R n d, compute the top eigenvector of A T A. Initialize with random ˆv R d. Repeat: 1. ˆv := A T Aˆv. 2. ˆv := ˆv/ ˆv 2. Theorem: For any ε (0, 1), with high probability (over choice of initial ˆv), ˆv T A T Aˆv (1 ε) top eigenvalue of A T A ( 1 after O ε log d ) iterations. ε Similar algorithm can be used to get top k eigenvectors. 15 / 23
4. Singular value decomposition
Singular value decomposition Every matrix A R n d has a singular value decomposition (SVD) = A U S V (n d) (n r) (r r) (r d) = r s i u i v i where r = rank(a) (r min{n, d}); U T U = I (i.e., U = [u 1 u 2 u r] has orthonormal columns) left singular vectors; S = diag(s 1, s 2,..., s r) where s 1 s 2 s r > 0 singular values; V T V = I (i.e., V = [v 1 v 2 v r] has orthonormal columns) right singular vectors. 16 / 23
SVD vs PCA If SVD of A is USV T = r siuivt i, then: non-zero eigenvalues of A T A are s 2 1, s 2 2,..., s 2 r, (squares of singular values of A); corresponding eigenvectors are v 1, v 2,..., v r R d (right singular vectors of A). 17 / 23
SVD vs PCA If SVD of A is USV T = r siuivt i, then: non-zero eigenvalues of A T A are s 2 1, s 2 2,..., s 2 r, (squares of singular values of A); corresponding eigenvectors are v 1, v 2,..., v r R d (right singular vectors of A). By symmetry, also have: non-zero eigenvalues of AA T are s 2 1, s 2 2,..., s 2 r, (squares of singular values of A); corresponding eigenvectors are u 1, u 2,..., u r R n (left singular vectors of A). 17 / 23
Low-rank SVD For any k rank(a), rank-k SVD approximation: Û k Ŝ k V k (n k) (k k) (k d) = k s i u i v i (Just retain top k left/right singular vectors and singular values from SVD.) 18 / 23
Low-rank SVD For any k rank(a), rank-k SVD approximation: Û k Ŝ k V k (n k) (k k) (k d) = k s i u i v i (Just retain top k left/right singular vectors and singular values from SVD.) Best rank-k approximation: Â := Û V T kŝk k = arg min M R n d : rank(m) k Minimum value is simply given by n d n j=1 j=1(a i,j Âi,j)2 = t>k d (A i,j M i,j) 2. s 2 t. 18 / 23
Example: latent semantic analysis Represent corpus of documents by counts of words they contain: document 1 document 2 document 3 aardvark abacus abalone 3 0 0 7 0 4 2 4 0.... One column per vocabulary word in A R n d One row per document in A R n d A i,j = numbers of times word j appears in document i. 19 / 23
Example: latent semantic analysis Statistical model for document-word count matrix. 20 / 23
Example: latent semantic analysis Statistical model for document-word count matrix. Parameters θ = (β 1, β 2,..., β k, π 1, π 2,..., π n, l 1, l 2,..., l n). k min{n, d} topics, each represented by a distributions over vocabulary words: β 1, β 2,..., β k R d +. Each β t = (β t,1, β t,2,..., β t,d ) is a probability vector, so d j=1 βt,j = 1. Each document i is associated with a probability distribution π i = (π i,1, π i,2,..., π i,k ) over topics, so k t=1 πi,t = 1. 20 / 23
Example: latent semantic analysis Statistical model for document-word count matrix. Parameters θ = (β 1, β 2,..., β k, π 1, π 2,..., π n, l 1, l 2,..., l n). k min{n, d} topics, each represented by a distributions over vocabulary words: β 1, β 2,..., β k R d +. Each β t = (β t,1, β t,2,..., β t,d ) is a probability vector, so d j=1 βt,j = 1. Each document i is associated with a probability distribution π i = (π i,1, π i,2,..., π i,k ) over topics, so k t=1 πi,t = 1. Model posits that document i s count vector (i-th row in A) follows a multinomial distribution with probabilities given by k [A i,1 A i,2... A i,d ] Expected value is l i k t=1 πi,tβt t. Multinomial l i, t=1 πi,tβ t: k t=1 π i,tβ T t. 20 / 23
Example: latent semantic analysis Suppose A P θ. 21 / 23
Example: latent semantic analysis Suppose A P θ. In expectation, A has rank k: l 1π T 1 β T 1 l 2π T 2 β T 2 E(A) =... } l nπ T n {{ }} β T k {{ } n k k d 21 / 23
Example: latent semantic analysis Suppose A P θ. In expectation, A has rank k: l 1π T 1 β T 1 l 2π T 2 β T 2 E(A) =... } l nπ T n {{ }} β T k {{ } n k k d Observed matrix A: A = E(A) + Zero mean noise so A is generally of rank min{n, d} k. 21 / 23
Example: latent semantic analysis Using SVD: rank-k SVD Û kŝk V T k of A gives approximation to LB T : Â := Û kŝk V T k E(A). (SVD helps remove some of the effect of the noise.) 22 / 23
Example: latent semantic analysis Using SVD: rank-k SVD Û kŝk V T k of A gives approximation to LB T : Â := Û kŝk V T k E(A). (SVD helps remove some of the effect of the noise.) Each of the n documents can be summarized by k numbers: Â V k = Û kŝk R n k. 22 / 23
Example: latent semantic analysis Using SVD: rank-k SVD Û kŝk V T k of A gives approximation to LB T : Â := Û kŝk V T k E(A). (SVD helps remove some of the effect of the noise.) Each of the n documents can be summarized by k numbers: Â V k = Û kŝk R n k. New document feature representation very useful for information retrieval. (Example: cosine similarities between documents become faster to compute and possibly less noisy.) 22 / 23
Example: latent semantic analysis Using SVD: rank-k SVD Û kŝk V T k of A gives approximation to LB T : Â := Û kŝk V T k E(A). (SVD helps remove some of the effect of the noise.) Each of the n documents can be summarized by k numbers: Â V k = Û kŝk R n k. New document feature representation very useful for information retrieval. (Example: cosine similarities between documents become faster to compute and possibly less noisy.) Actually estimating π i and β t takes a bit more work. 22 / 23
Recap PCA: directions of maximum variance in data subspace that minimizes residual squared error. Computation: power method SVD: general decomposition for arbitrary matrices Low-rank SVD: best low-rank approximation of a matrix in terms of average squared errors PCA/SVD: often useful when low-rank structure is expected (e.g., probabilistic modeling). 23 / 23