Linear Algebra Methods for Data Mining

Size: px

Start display at page:

Download "Linear Algebra Methods for Data Mining"

Philomena Osborne
5 years ago
Views:

1 Linear Algebra Methods for Data Mining Saara Hyvönen, Spring 2007 The Singular Value Decomposition (SVD) continued Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki

2 The Singular Value Decomposition Any m n matrix A, with m n, can be factorized A = U ( ) Σ V T, 0 where U R m m and V R n n are orthogonal, and Σ R n n is diagonal: Σ = diag(σ 1, σ 2,..., σ n ), σ 1 σ 2... σ n 0. Skinny version : A = U 1 ΣV T, U 1 R m n. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 1

3 Matrix approximation Theorem. Let U k = (u 1 u 2... u k ), V k = (v 1 v 2... v k ) and Σ k = diag(σ 1, σ 2,..., σ k ), and define A k = U k Σ k V T k. Then min rank(b) k A B 2 = A A k 2 = σ k+1. E.g. the best approximation of rank k for the matrix A is A k = U k Σ k V T k. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 2

4 Consequences The best rank one approximation of A is A k = U 1 Σ 1 V T 1. Assume σ 1 σ 2... σ j > σ j+1 = 0 = σ j+2 =... = σ m. Then min rank(b) j A B 2 = A A j 2 = σ j+1 = 0. So the rank of A is the number of nonzero singular values of A. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 3

5 Perturbation theory Theorem. If A and A + E are in R m n with m n, then for k = 1...n σ k (A + E) σ k (A) σ 1 (E) = E 2. Proof. Omitted. Think of E as added noise. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 4

6 Example: low rank matrix plus noise 8 singular values index Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 5

7 Example: low rank matrix plus noise Assume A is a low rank matrix plus noise: N A. A = A + N, where Correct rank can be estimated by looking at singular values: when choosing a good k, look for gaps in singular values! When N is small, the number of larger singular values is often referred to as the numerical rank of A. The noise can be removed by estimating the numerical rank k from the singular values, and approximating A by the truncated SVD U k Σ k V T k. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 6

8 2 log of singular values index Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 7

9 In the figure, there is a gap between the 11th and 12th singular values. Estimate numerical rank to be 11. So to remove noise replace A by A k = U k Σ k V T k, where k = 11. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 8

10 Eigenvalue decomposition vs. SVD For symmetric A the singular value decomposition is closely related to the eigendecomposition: A = UΛU T, U and Λ eigenvectors and eigenvalues. Computation of both the eigendecomposition and the SVD follow the same pattern. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 9

11 Computation of the eigenvalue decomposition 1. Use Givens to transform A into tridiagonal form. 2. Use QR iteration with Wilkinson shift µ to transform tridiagonal form to diagonal form: Repeat until converged, (i) QR = T k µ k I (ii) T k+1 = RQ + µ k I The shift parameter µ is the eigenvalue of the 2 2 submatrix in the lower right corner that is closest to the element A n,n in the lower right corner. Once an eigenvalue is found, forget it, reduce (deflate) the problem, and go back to step 2. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 10

12 A= Example %After 1st step (with Givens to tridiagonal form:) A= Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 11

13 %intermediate results of 4 QR iteration steps: Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 12

14 eig(aorig)= Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 13

15 Eigenvalues only: about 4n 3 /3. Flop counts Accumulation of the orthogonal transformations to compute the matrix of eigenvectors: about 9n 3 more. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 14

16 How about computing the SVD? We now know how to compute the eigendecomposition. Couldn t we use this to compute the eigenvalues of A T A to get the singular values? Well, yes... and NO. This is not the way to do it. Forming A T A can lead to loss of information. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 15

17 Computing the SVD 1. Use Householder transformations to transform A into bidiagonal form B. Now B T B is tridiagonal. 2. Use QR iteration with Wilkinson shift µ to transform tridiagonal B to diagonal form (without forming B T B implicitly!) Can be computed in 6mn n 3 flops. Efficiently implemented everywhere. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 16

18 Sparse matrices In many applications only a small number of entries are nonzero (e.g. term-document matrices). Iterative methods frequently used in solving sparse problems. This is because e.g. transformation to tridiagonal form would destroy sparsity, which leads to excessive storage requirements. Also computational complexity might be prohibitively high when dimension of data matrix is very large. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 17

19 The Singular Value Decomposition Any m n matrix A, with m n, can be factorized A = U ( ) Σ V T, 0 where U R m m and V R n n are orthogonal, and Σ R n n is diagonal: Σ = diag(σ 1, σ 2,..., σ n ), σ 1 σ 2... σ n 0. Skinny version : A = U 1 ΣV T, U 1 R m n. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 18

20 Equivalent forms of SVD: Facts about SVD A T Av j = σ 2 jv j, AA T u j = σ 2 ju j, where u j and v j are the columns of U and V respectively. Let U k, V k and Σ k, be matrices with the k first singular vectors and values, and define A k = U k Σ k V T k. Then min rank(b) k A B 2 = A A k 2 = σ k+1. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 19

21 Principal components analysis Idea: look for such a direction that the data projected onto it has maximal variance. When found, continue by seeking the next direction, which is orthogonal to this (i.e. uncorrelated), and which explains as much of the remaining variance in the data as possible. Ergo: we are seeking linear combinations of the original variables. If we are lucky, we can find a few such linear combinations, or directions, or (principal) components, which describe the data fairly accurately. The aim is to capture the intrinsic variability in the data. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 20

22 1st principal component x x x x x x x x x x x x 2nd principal component x x x Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 21

23 Example: Atmospheric data Data: 1500 days, and for each day, we have the mean and the std of around 30 measured variables (temperature, wind speed and direction, rain fall, UV-A radiation, concentration of CO2 etc.) Therefore, our data matrix is Visualizing things in a 60-dimensional space is challenging! Instead, do PCA, and project days onto the plane defined by the first two principal components. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 22

24 30 Days projected in the plane defined by the 1st two principal components, colored per month nd principal component st principal component 1 Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 23

25 Example: spatial data analysis Data: 9000 dialect words, 500 counties. Word-county matrix A: A(i, j) = { 1 if word i appears in county j 0 otherwise. Apply PCA to this. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 24

26 Results obtained by PCA Data points: words; variables: counties. Each principal component tells which counties explain the most significant part of the variation left in the data. The first principal component is essentially just the number of words in each county! After this, geographical structure of principal components is apparent. Note: PCA knows nothing of the geography of the counties. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 25

27 Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 26

28 Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 27

29 Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 28

30 PCA=SVD Let A be a n m data matrix in which the rows represent the cases. Each row is a data vector, each column represents a variable. (Note: usually the roles of rows and columns are the other way around!) A is centered: the estimated mean is subtracted from each column, so each column has zero mean. Let w be the m 1 column vector of (unknown) projection weights that result in the largest variance when the data A is projected along w. Require w T w = 1. Projection of a onto w is w T a = m j=1 a jw j. Projection of data along w is Aw. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 29

31 Projection of data along w is Aw. Variance: σw 2 = (Aw) T (Aw) = w T A T Aw = w T Cw where C = A T A is the covariance matrix of the data (A is centered!) Task: maximize variance subject to constraint w T w = 1. Optimization problem: maximize f = w T Cw λ(w T w 1), λ is the Lagrange multiplier. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 30

32 Optimization problem: maximize f = w T Cw λ(w T w 1), λ is the Lagrange multiplier. Differentiating with respect to w yields f w = 2Cw 2λw = 0 Eigenvalue equation: Cw = λw, where C = A T A. Solution: singular values and singular vectors of A!!! More precisely: the first principal component of A is exactly the first right singular vector v 1 of A. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 31

33 The solution of our opimization problem is given by w = v 1, λ = σ 2 1, where σ 1 and v 1 are the first singular value and the corresponding right singular vector of A, and Cw = λw. Our interest was to maximize the variance, which is given by w T Cw = w T λw = σ 2 1w T w = σ 2 1, so the singular value tells about the variance in the direction of the principal component. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 32

34 What next? Once the first principal component is found, we continue in the same fashion to look for the next one, which is orthogonal to (all) the principal component(s) already found. The solutions are the right singular vectors v k of A, and the variance in each direction is given by the corresponding singular values σ k. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 33

35 How not to compute the PCA: In literature one frequently runs across PCA algorithms, which start by computing the covariance matrix C = A T A of the centered data matrix A, and computes the eigenvalues of this. But we already know that this is a bad idea! The condition number of A T A is much larger than that of A. Loss of information. For a sparse A, C = A T A is no longer sparse! Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 34

36 How to compute the PCA: Data matrix A, rows=data points, columns = variables (attributes, parameters). 1. Center the data by subtracting the mean of each column. 2. Compute the SVD of the centered matrix values and vectors): Â = UΣV T. Â (or the k first singular 3. The principal components are the columns of V, the coordinates of the data in the basis defined by the principal components are UΣ. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 35

37 Matlab code for PCA %Data matrix A, columns:variables, rows: data points %matlab function for computing the first k principal components of A. function [pc,score]=pca(a,k); [rows,cols]=size(a); Ameans=repmat(mean(A,1),rows,1); %matrix, rows=means of columns A=A-Ameans; %centering data [U,S,V]=svds(A,k); %k is the number of pc:s desired pc=v; score=u*s; %now A=scores*pcs +Ameans; Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 36

38 Note on Matlab PCA is coded in the statistics toolbox in matlab, BUT... DO NOT USE IT!! Why? We have so few statistics toolbox licenses, that we run out of them frequently! Better not waste scarce resources on this. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 37

39 A vs A T Let A be a centered data matrix, and A = UΣV T. The principal components of A are V, and the coordinates in the basis defined by the principal components are UΣ. If A = UΣV T, then A T = VΣ T U T. So aren t the principal components of A T given by U and the coordinates VΣ T? So the pc s of A are the new coordinates of A T and vica versa, modulo a multiplication by the diagonal matrix Σ (or its inverse)? No. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 38

40 And why not? Because of the centering of the data! In general it does not hold, that the transpose of a centered matrix is the same as the centered transpose: (A meansofcolumns(a)) T (A T meansofcolumns(a T )) In practice these are related in special cases. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 39

41 Singular values tell about variance The variance in the direction of the k th principal component is given by the corresponding singular value: σ 2 k. Singular values can be used to estimate how many principal components to keep. Rule of thumb: keep enough to explain 85% of the variation: k j=1 σ2 j n j=1 σ2 j Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 40

42 Why talk about PCA? Why not just stick to SVD? Singular vectors=principal components: Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 41

43 Centering is central SVD will give vectors that go through the origin. Centering makes sure that the origin is in the middle of the data set. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 42

44 Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 43

45 Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 44

46 Summary: PCA PCA is SVD done on centered data. PCA looks for such a direction that the data projected onto it has maximal variance. When found, PCA continues by seeking the next direction, which is orthogonal to all the previously found directions, and which explains as much of the remaining variance in the data as possible. Principal components are uncorrelated. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 45

47 PCA is useful for data exploration visualizing data compressing data outlier detection ratio rules Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 46

48 References [1] Lars Eldén: Matrix Methods in Data Mining and Pattern Recognition, SIAM [2] R. A. Horn and C. R. Johnson, Matrix Analysis, Cambridge University Press [3] D. Hand, H. Mannila, P. Smyth, Principles of Data Mining, The MIT Press, Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 47

Linear Algebra Methods for Data Mining

Linear Algebra Methods for Data Mining Saara Hyvönen, Saara.Hyvonen@cs.helsinki.fi Spring 2007 Linear Discriminant Analysis Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki Principal