CS 340 Lec. 6: Linear Dimensionality Reduction

Size: px

Start display at page:

Download "CS 340 Lec. 6: Linear Dimensionality Reduction"

Myrtle Hall
5 years ago
Views:

1 CS 340 Lec. 6: Linear Dimensionality Reduction AD January 2011 AD () January / 46

2 Linear Dimensionality Reduction Introduction & Motivation Brief Review of Linear Algebra Principal Component Analysis Applications Singular Value Decomposition AD () January / 46

Lots of High Dimensional Data Lots of high-dimensional data.

browser, commonly Zambian President Levy perceived as the safer Mwanawasa has won a and more

accused him leader Internet Explorer, of rigging, official results is critically flawed.

3 Lots of High Dimensional Data Lots of high-dimensional data... face images According to media reports, a pair of hackers said on Saturday that the Firefox Web browser, commonly Zambian President Levy perceived as the safer Mwanawasa has won a and more customizable second term in office in an election his challenger alternative to market Michael Sata accused him leader Internet Explorer, of rigging, official results is critically flawed. A showed on Monday. presentation on the flaw was shown during the ToorCon hacker conference in San Diego. documents gene expression data MEG readings AD () January / 46

4 Motivation Why do dimensionality reduction? Computational: compress data time/space effi ciency Statistical: fewer dimension better generalization Visualization: understand structure of data Anomaly detection: describe normal data, detect outliers Dimensionality reduction in this course: Linear methods (this week) Clustering. Feature selection. AD () January / 46

5 Types of Problems Supervised learning (classification, regression): Applications: face recognition, gene expression prediction Techniques: knn, SVM, least squares (+ dimensionality reduction preprocessing) Structure discovery: find an alternative representation z of data x Applications: visualization Techniques: clustering, linear dimensionality reduction Density estimation p (x): model the data, Applications: anomaly detection, language modeling Techniques: clustering, linear dimensionality reduction AD () January / 46

6 What is the true dimensionality of these data? AD () January / 46

7 What is the true dimensionality of this data? Figure: Simulated data in three classes, near the surface of a half-sphere AD () January / 46

8 Linear Dimensionality Reduction Which line is best???? Which line should I pick? AD () January / 46

9 Linear sic idea Dimensionality of linear Reduction dimensionality reductio Basic idea of linear dimensionality reduction Represent each face as aas high-dimensional a high-dimensional vector x R 361 vector by a x R lower-dimensional vector say z R 10. Represent each face as a high-dimensional vector x R 3 x R 361 x R 361 z R 10 z = U x z R 10 z = U x How do we choose U? AD () January / 46

10 Review of Linear Algebra x = 3.4, x = x 1. x d b 11 b 1n, A =. a 11 a 1p. a d 1 a dp, B =. b p1 b pn Here x is scalar (1 1), x is d 1, A is d p and B is p n. Transposition: x T = x, x T = ( ) ( x 1 x d, A T ) = a i,j ji. Quantities whose inner dimensions match may be multiplied by summing over this index. The outer dimensions give the dimensions of the answer. p (Ax) i = a i,j x j, (AB) i,j = a i,k b k,j j=1 k=1 x T x scalar, xx T d d, Ax d 1, AB d n, x T Ax scalar. AD () January / 46

11 Review of Linear Algebra Simple and valid manipulations (AB) C = A (BC ), A (B + C ) = AB + AC, (A + B) T = A T + B T, (AB) T = B T A T Consider a square matrix A then u is an eigenvector of A and λ is its associated eigenvalue iff Au =λu. If the matrix is diagonalizable AU = UD A = UDU 1 Q: Prove that A k = UD k U 1. Why is this expression useful? AD () January / 46

12 Review of Linear Algebra: PD Matrices A real-valued square matrix A is called (semi-)positive definite if x T Ax 0 Q: Prove that for any matrix M, the matrix M T M is (semi-)positive definite. Q: Prove that a positive definite matrix only admits positive eigenvalues. AD () January / 46

13 Review of Linear Algebra: Inner Product Let u, v R d, then the inner product of u and v is a scalar u T v = v T u = d i=1 The (Euclidean) length/norm of a vector u is written u and is defined as the square root of the inner product of the vector with itself u = u T u = d ui 2. i=1 u i v i If the angle between vectors u and v is θ then cos (θ) = ut v u v. AD () January / 46

14 Approximating High-dimensional Vectors We are given N data {x i } N i=1 where x i R d and we want to approximate them by { x i } N i=1 using x i = k z j,i w j j=1 where z i,j R and {w i } k i=1 are Rd -valued basis vector. This can be rewritten as for x i = Wz i W = ( w 1 w k ), d k matrix z i = ( z 1,i z k,i ) T, k 1 vector AD () January / 46

15 Approximating High-dimensional Vectors An even more compact notation is X }{{} d N = }{{} W }{{} Z d k k N where X = ( x 1 x N ) and Z = ( z1 z N ). We can gain very significantly in terms of storage if k << d as we only need to store W (size d k) and Z (size k N) to compute X instead of X (size d N). Example: For d = 1000, k = 10 and N = 10 6, we have d N/ (d k + k N) 100. AD () January / 46

16 Approximating High-dimensional Vectors How should we select W and Z to ensure X X? We introduce the reconstruction error X X and propose to minimize the square of its Frobenius norm J (W, Z) = 1 X X 2 N = 1 N F N x i x i 2 i=1 = 1 N d N j=1 i=1 (x j,i x j,i ) 2 subject to W be an orthonormal matrix; i.e. w T i w j = { 1 if i = j 0 otherwise W T W = I k. Q: What is the minimum total squared reconstruction error for k = d? What about k > d? AD () January / 46

17 Preliminaries: Normalization and Centering of the Data It is standard to normalize and center the data beforehand. This ensures that PCA finds the interesting directions of variation, not the ones which just happen to be large because of the units of measurement that are used. Hence in practice if the original data were {x i }, we compute m j = 1 N N x j,i, σ 2 j = 1 N i=1 N (x j,i m j ) 2 i=1 and we set x i = ( x 1,i x d,i ) T where x j,i = x j,i m j σ j. AD () January / 46

18 Finding the first principal component Consider first the case k = 1 then we want to minimize J (W, Z) = J ( w 1, z 1) = 1 N = 1 N = 1 N N x i z 1,i w 1 2 i=1 N x T i x i 2x T i z 1,i w 1 + z 1,i w1 T w 1 z 1,i i=1 N x T i x i 2z 1,i x T i w 1 + z1,i 2 i=1 subject to w T 1 w 1 = 1 with z 1 = (z 1,1 z 1,2 z 1,N ). Taking derivative w.r.t. z 1,i and setting it equal to zero J ( w 1, z 1) z 1,i = 2x T i w 1 + 2z 1,i = 0 z 1,i = x T i w 1 Optimal reconstruction weights are obtained by orthogonally projecting the data onto the first principal direction w 1. AD () January / 46

19 Finding the first principal component Minimizing J (w 1 ) is thus equivalent to maximizing N 1 N z1,i 2 = 1 N ) 2 i=1 N (w1 T x i i=1 ( ) = w1 T N 1 N x i x T i w 1 s.t. w 1 = 1. i=1 }{{} Σ Assume the data have been centered, so that N x i = 0 i=1 then n 1 ( ) ( 2 ( ) ) 2 ( ) N w1 T x i E w1 T x Var w1 T x ; i=1 i.e. we seek w 1 maximizing the variance of the projected data. AD () January / 46

20 Note additionally that we have Σ = 1 N N x i x T i i=1 ( E xx T) = Cov (x). That is Σ is an estimate of the covariance/correlation matrix of the data. AD () January / 46

21 Finding the first principal component Minimizing J (w 1 ) is equivalent to maximizing w T 1 Σw 1 s.t. w 1 = 1. Proposition: The vector w opt 1 minimizing J (w 1 ) is the eigenvector (selected such that w 1 = 1) associated to the largest eigenvalue of Σ. Proof: Σ is a symmetric matrix so it is diagonalizable by an orthornormal matrix U; i.e. ΣU = UD Σ = UDU T with D diagonal. Without loss of generality, we pick D =diag ( σ 2 1,..., σ2 d ) where σ 2 1 σ 2 2 σ2 d. AD () January / 46

22 Finding the first principal component It follows that arg max w T Σw 1 1 = arg max w 1 : w 1 =1 w 1 : w 1 =1 = arg max y T Dy y=u T w 1 : y =1 = arg max y=u T w 1 : y =1 ( U T w 1 ) T D ( U T w 1 ) d σ 2 i yi 2 i=1 so y opt = ( ) T w1 = Uy opt = u 1. AD () January / 46

23 PCA Example Principal Components Analysis ˆx i = k j=1 z ijv j. We have d = 2 and k = 1. Circles are the original data points, crosses are the reconstructions. The red star is the data mean. Loadings (basis) 9 scores AD () January / 46

24 The second principal component We want to minimize J ( w 1, z 1, w 2, z 2) = 1 N s.t. w 1 = w 2 = 1 and w T 1 w 2 = 0. N x i z 1,i w 1 z 2,i w 2 2 i=1 Optimizing w.r.t w 1, z 1 gives the same results as before. If we optimize w.r.t z 2,i, we find J ( w opt 1, z opt,1, w 2, z 2) z 2,i = 2x T i w 2 + 2z 2,i = 0 z 2,i = x T i w 2 Similarly it can be proved that w opt 2 is the eigenvector (selected such that w 2 = 1) associated to the second largest eigenvalue of Σ. AD () January / 46

25 General Case Compute the eigendecomposition of Σ = 1 N XXT = UDU T with σ 2 1 σ2 2 σ2 d and keep only the associated k eigenvectors U k = ( ) u 1 u k The estimate is given by k = j=1 Z opt X = U k ( U T k X ) }{{} It can be additionally shown that (for k < d) 1 N X X 2 = F ) u j (u T j X }{{} loadings d σ 2 j j=k+1 AD () January / 46

26 Important Practical Remark If you have centered and normalize the data beforehand, don t forget to correct later on!! Suppose you have considered then the reconstruction will be X=Φ 1 (X µ) X = µ+φx X = µ+φ X where X is the PCA approximation of X. AD () January / 46

27 Reconstruction Error We have Hence it follows that x i x i 2 = = = = ( d j=k+1 ( d j=k+1 d j=k+1 x i x i = d ( ) u j u T j x i j=k+1 u j (u T j x i ) ) T ( d j=k+1 (u T j x i ) u T j ) ( d ( u T j x i ) 2 d u T j x i x T i u j j=k+1 j=k+1 u j ( u T j x i ) ) u j ( u T j x i ) ) as u T j u l = 1 if j = l and 0 if j = l AD () January / 46

28 Reconstruction Error Thus we have 1 N X X 2 F = 1 N = 1 N = 1 N = = N i=1 N i=1 x i x i 2 ( d d u T j j=k+1 d u T j j=k+1 d σ 2 j j=k+1 ) u T j x i x T i u j j=k+1 ) Σu j ( N x i x T i i=1 u j AD () January / 46

29 How many principal components? How Many Principal Components? Similar to question of How many clusters? Magnitude of eigenvalues indicate fraction of variance captured. Eigenvalues on a face image dataset: Magnitude of eigenvalues indicate fraction of variance captured. Typically eigenvalues drop off sharply so you don t need too many λ i i Eigenvalues typically drop off sharply, so don t need that many. AD () January / 46

30 PCA Example Example: λ 1 =0.0938, λ 2 = Example where PCA is of interest. AD () January / 46

31 PCA Example Difficult example Examples where PCA is of no interest. PCA will make no difference between these examples, because the structure on the left is not linear AD () January / 46

32 Image Compression using PCA Given one single image, how can you use the PCA to perform image compression? Many different approaches are possible Example 1: Interpret the columns of the image as different data points x i. Example 2: Interpret the rows of the image as different data points x i. Example 3: Partition the image in non-overlapping small blocks, blocks are now x i. Note: There are better ways to compress images! AD () January / 46

33 Computing the Principal Components Computing Σ takes O ( Nd 2) operations and computing the eigenvectors of the d d matrix Σ takes O ( d 3) operations. This can be very prohibitive! If d >> N, then we can compute the eigenvectors based on the eigenvectors of the so-called N N Gram matrix X T X in O ( N 3) instead. Assume v i is an eigenvector of X T X such that v i = 1 associated to the eigenvalue λ i then by definition X T Xv i = λ i v i so by multiplying both sides by X then XX }{{} T (Xv i ) = λ i (Xv i ) N Σ That is Xv i = ũ i is an eigenvector of Σ associated to the eigenvalue λ i N and we can have a unit norm eigenvector by selecting u i = λ 1/2 i ũ i. AD () January / 46

34 Singular Value Decomposition Given a d N matrix X, the SVD of X is a factorization of the form }{{} X = UDV T = d N r i=1 λ i u i v T i }{{}}{{} d 1 1 N where r = min (d, N), U are the left singular vectors with U T U = I r, V are the right singular vectors with V T V = I r. Right singular vectors are eigenvectors of X T X X T X = (UDV T) T ( UDV T) = VDU T UDV T = VD 2 V T X T XV = VD 2 Left singular vectors are eigenvectors of XX T ( XX T = UDV T) ( UDV T) T = UDV T VDU T and clearly σ 2 i = UD 2 U T X T XU = UD 2 = λ 2 i /N. AD () January / 46

35 Singular Value Decomposition and PCA In the PCA, we have X = U k ( U T k X ). If we plug the SVD decomposition X = U k ( U T k U ) DV T = k λ i u i vi T i=1 i.e. the truncated SVD yields the PCA approximation. This can be computationally beneficial. AD () January / 46

36 Visualization Visualizing data Project 256 dimensional vectors (representing 16x16 images of digits) into 2D AD () January / 46

Visualization bed vectors into their z1, z2 coord 20 40 60 80 100 120 140

37 Visualization bed vectors into their z1, z2 coord AD () January / 46

Eigen-Faces Each x i R d is a face image x ji = intensity of the

38 Eigen-Faces Each x i R d is a face image x ji = intensity of the j-th pixel in image i d is the number of pixels. Each x i R d is a face image. X d n U d k Z k n (... ) ( ) ( z 1... z n ) Idea: z i more meaningful representation of i-th face than x i. Can use z i for nearest-neighbor classication Much faster: O (dk + Nk) time instead of O (dn) when N, d >> k. AD () January / 46

39 Eigen-Faces with K-NN Eigenfaces K=4 test images closest match in training set using K=4 15 AD () January / 46

40 Eigen-Faces with K-NN Eigenfaces K=10 test images closest match in training set using K=10 17 AD () January / 46

41 Eigen-Faces with K-NN Misclassification rate vs K 40 number of misclassification errors on test set PCA dimensionality AD () January / 46

42 Latent Semantic Analysis PCA can be used to cluster documents and carry out information retrieval by using concepts as opposed to exact word-matching. This enables us to surmount the problems of synonymy (car, auto) and polysemy (money bank, river bank). The data is available in a term-frequency (TF) matrix N is the number of documents. d is the number of words in the vocabulary. Each x i R d is a vector of word counts; x j,i = numbers of occurences of word j in document i. AD () January / 46

43 Example Document 1: {I, eat, chips} Document 2: {computer,chips,chips} Document 3: {intel,computer,chips} We have X = AD () January / 46

44 PCA for Latent Semantic Analysis Using the PCA, we obtain X U k Z k That is we approximate the documents by a linear combination of k basis documents. How to measure similarity between two documents x i and x j? x T i x j is probably better than x T i x j Applications: information retrieval. Note: usually no computational savings; original x is already sparse. AD () January / 46

45 Network Anomaly Detection detection [Lakhina, 05] x ji = amount of traffi c on link j in the network during each time interval i on l i c is sum of flows along a few paths ponent intuitively represents a path Model assumption: total traffi c is sum of flows along a few paths. Apply PCA: each principal component intuitively represents a path. AD () January / 46

46 Model assumption: total traffic is sum of flows along a few paths Apply PCA: each principal component intuitively represents a path Anomaly when traffic deviates from first few principal components Network Anomaly Detection Anomaly when traffi c deviates from first few principal components. Principal component analysis (PCA) / Case studies 2 AD () January / 46

Unsupervised Machine Learning and Data Mining. DS 5230 / DS Fall Lecture 7. Jan-Willem van de Meent

Unsupervised Machine Learning and Data Mining DS 5230 / DS 4420 - Fall 2018 Lecture 7 Jan-Willem van de Meent DIMENSIONALITY REDUCTION Borrowing from: Percy Liang (Stanford) Dimensionality Reduction Goal: