Linear Algebra Methods for Data Mining

Similar documents
Linear Algebra Methods for Data Mining

Linear Algebra Methods for Data Mining

Numerical Methods I Singular Value Decomposition

Linear Algebra Methods for Data Mining

Lecture 5 Singular value decomposition

Linear Algebra Methods for Data Mining

be a Householder matrix. Then prove the followings H = I 2 uut Hu = (I 2 uu u T u )u = u 2 uut u

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

Singular Value Decomposition

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

Linear Algebra, part 3. Going back to least squares. Mathematical Models, Analysis and Simulation = 0. a T 1 e. a T n e. Anna-Karin Tornberg

Introduction to Machine Learning

IV. Matrix Approximation using Least-Squares

Linear Methods in Data Mining

Linear Algebra, part 3 QR and SVD

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Principal Component Analysis

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

Lecture: Face Recognition and Feature Reduction

Applied Mathematics 205. Unit II: Numerical Linear Algebra. Lecturer: Dr. David Knezevic

Linear Algebra Review. Fei-Fei Li

Notes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T.

18.06SC Final Exam Solutions

The Singular Value Decomposition

UNIT 6: The singular value decomposition.

Lecture: Face Recognition and Feature Reduction

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

Machine Learning (Spring 2012) Principal Component Analysis

Singular Value Decomposition

Linear Algebra Review. Fei-Fei Li

Notes on Eigenvalues, Singular Values and QR

STA141C: Big Data & High Performance Statistical Computing

Computational Methods. Eigenvalues and Singular Values

STA141C: Big Data & High Performance Statistical Computing

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

Singular Value Decomposition

Machine Learning. B. Unsupervised Learning B.2 Dimensionality Reduction. Lars Schmidt-Thieme, Nicolas Schilling

Maths for Signals and Systems Linear Algebra in Engineering

The Mathematics of Facial Recognition

SVD, PCA & Preprocessing

Principal Components Analysis (PCA)

Bindel, Fall 2009 Matrix Computations (CS 6210) Week 8: Friday, Oct 17

Mathematical foundations - linear algebra

COMP6237 Data Mining Covariance, EVD, PCA & SVD. Jonathon Hare

Computational Methods CMSC/AMSC/MAPL 460. EigenValue decomposition Singular Value Decomposition. Ramani Duraiswami, Dept. of Computer Science

15 Singular Value Decomposition

Singular Value Decomposition and Digital Image Compression

The Singular Value Decomposition

1 Linearity and Linear Systems

Homework 1. Yuan Yao. September 18, 2011

COMP 558 lecture 18 Nov. 15, 2010

(a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? Solution: dim N(A) 1, since rank(a) 3. Ax =

Probabilistic Latent Semantic Analysis

Orthogonal iteration to QR

Linear Least Squares. Using SVD Decomposition.

Linear Algebra Review. Vectors

Principal Component Analysis

14 Singular Value Decomposition

FSAN/ELEG815: Statistical Learning

Dimensionality Reduction

BlockMatrixComputations and the Singular Value Decomposition. ATaleofTwoIdeas

EIGENVALUE PROBLEMS. EIGENVALUE PROBLEMS p. 1/4

Block Bidiagonal Decomposition and Least Squares Problems

Chapter XII: Data Pre and Post Processing

PCA, Kernel PCA, ICA

EE731 Lecture Notes: Matrix Computations for Signal Processing

Singular Value Decomposition

Least Squares. Tom Lyche. October 26, Centre of Mathematics for Applications, Department of Informatics, University of Oslo

PRINCIPAL COMPONENT ANALYSIS

PRINCIPAL COMPONENTS ANALYSIS

Singular Value Decomposition and Principal Component Analysis (PCA) I

Linear Algebra. Session 12

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Principal Component Analysis

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

AM 205: lecture 8. Last time: Cholesky factorization, QR factorization Today: how to compute the QR factorization, the Singular Value Decomposition

The University of Texas at Austin Department of Electrical and Computer Engineering. EE381V: Large Scale Learning Spring 2013.

Eigenvalues and diagonalization

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Linear Algebra and Matrices

Data Preprocessing Tasks

Lecture 12 Eigenvalue Problem. Review of Eigenvalues Some properties Power method Shift method Inverse power method Deflation QR Method

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 6

Derivation of the Kalman Filter

Linear Algebra Primer

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

Linear Algebra & Geometry why is linear algebra useful in computer vision?

Main matrix factorizations

Principal Component Analysis. Applied Multivariate Statistics Spring 2012

PCA and admixture models

Introduction to Machine Learning. PCA and Spectral Clustering. Introduction to Machine Learning, Slides: Eran Halperin

Computation of eigenvalues and singular values Recall that your solutions to these questions will not be collected or evaluated.

2.3. Clustering or vector quantization 57

8. the singular value decomposition

MATH 829: Introduction to Data Mining and Analysis Principal component analysis

Singular Value Decompsition

LECTURE 16: PCA AND SVD

The Singular Value Decomposition and Least Squares Problems

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

Singular Value Decomposition

Transcription:

Linear Algebra Methods for Data Mining Saara Hyvönen, Saara.Hyvonen@cs.helsinki.fi Spring 2007 The Singular Value Decomposition (SVD) continued Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki

The Singular Value Decomposition Any m n matrix A, with m n, can be factorized A = U ( ) Σ V T, 0 where U R m m and V R n n are orthogonal, and Σ R n n is diagonal: Σ = diag(σ 1, σ 2,..., σ n ), σ 1 σ 2... σ n 0. Skinny version : A = U 1 ΣV T, U 1 R m n. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 1

Matrix approximation Theorem. Let U k = (u 1 u 2... u k ), V k = (v 1 v 2... v k ) and Σ k = diag(σ 1, σ 2,..., σ k ), and define A k = U k Σ k V T k. Then min rank(b) k A B 2 = A A k 2 = σ k+1. E.g. the best approximation of rank k for the matrix A is A k = U k Σ k V T k. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 2

Consequences The best rank one approximation of A is A k = U 1 Σ 1 V T 1. Assume σ 1 σ 2... σ j > σ j+1 = 0 = σ j+2 =... = σ m. Then min rank(b) j A B 2 = A A j 2 = σ j+1 = 0. So the rank of A is the number of nonzero singular values of A. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 3

Perturbation theory Theorem. If A and A + E are in R m n with m n, then for k = 1...n σ k (A + E) σ k (A) σ 1 (E) = E 2. Proof. Omitted. Think of E as added noise. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 4

Example: low rank matrix plus noise 8 singular values 7 6 5 4 3 2 1 0 0 2 4 6 8 10 12 14 16 18 20 index Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 5

Example: low rank matrix plus noise Assume A is a low rank matrix plus noise: N A. A = A + N, where Correct rank can be estimated by looking at singular values: when choosing a good k, look for gaps in singular values! When N is small, the number of larger singular values is often referred to as the numerical rank of A. The noise can be removed by estimating the numerical rank k from the singular values, and approximating A by the truncated SVD U k Σ k V T k. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 6

2 log of singular values 1 0 1 2 3 4 0 2 4 6 8 10 12 14 16 18 20 index Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 7

In the figure, there is a gap between the 11th and 12th singular values. Estimate numerical rank to be 11. So to remove noise replace A by A k = U k Σ k V T k, where k = 11. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 8

Eigenvalue decomposition vs. SVD For symmetric A the singular value decomposition is closely related to the eigendecomposition: A = UΛU T, U and Λ eigenvectors and eigenvalues. Computation of both the eigendecomposition and the SVD follow the same pattern. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 9

Computation of the eigenvalue decomposition 1. Use Givens to transform A into tridiagonal form. 2. Use QR iteration with Wilkinson shift µ to transform tridiagonal form to diagonal form: Repeat until converged, (i) QR = T k µ k I (ii) T k+1 = RQ + µ k I The shift parameter µ is the eigenvalue of the 2 2 submatrix in the lower right corner that is closest to the element A n,n in the lower right corner. Once an eigenvalue is found, forget it, reduce (deflate) the problem, and go back to step 2. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 10

A= 1 2 3 4 5 2 1 1 0 1 3 1 2 1 3 4 0 1 1 1 5 1 3 1 3 Example %After 1st step (with Givens to tridiagonal form:) A= 1.0000 7.3485-0.0000-0.0000 0.0000 7.3485 5.5370 2.0066 0.0000-0.0000-0.0000 2.0066 0.8979 0.6300 0.0000-0.0000 0.0000 0.6300 0.1894 0.5057 0.0000-0.0000 0.0000 0.5057 0.3757 Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 11

%intermediate results of 4 QR iteration steps: 5.9399 7.4855 0.0000-0.0000-0.0000 7.4855 0.5901 0.1715-0.0000 0.0000 0.0000 0.1715 0.4214 0.8535-0.0000 0.0000-0.0000 0.8535 0.4877-0.1693-0.0000 0.0000-0.0000-0.1693 0.5610 9.3851 5.0747-0.0000 0.0000-0.0000 5.0747-2.8580 0.0248 0.0000-0.0000-0.0000 0.0248-0.0138 0.7308 0.0000 0.0000 0.0000 0.7308 0.9303 0.0319 0.0000-0.0000-0.0000 0.0319 0.5565 Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 12

10.7304 2.7336 0.0000 0.0000-0.0000 2.7336-4.2034 0.0042-0.0000 0.0000 0.0000 0.0042-0.1329 0.6387-0.0000 0.0000-0.0000 0.6387 1.0502 0.0001-0.0000 0.0000 0.0000 0.0001 0.5558 11.0949 1.3766-0.0000 0.0000-0.0000 1.3766-4.5680 0.0007 0.0000-0.0000-0.0000 0.0007-0.2227 0.5419 0.0000 0.0000 0.0000 0.5419 1.1400 0.0000 0.0000-0.0000-0.0000 0.0000 0.5558 eig(aorig)= -4.6881-0.4119 0.5558 1.3293 11.2150 Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 13

Eigenvalues only: about 4n 3 /3. Flop counts Accumulation of the orthogonal transformations to compute the matrix of eigenvectors: about 9n 3 more. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 14

How about computing the SVD? We now know how to compute the eigendecomposition. Couldn t we use this to compute the eigenvalues of A T A to get the singular values? Well, yes... and NO. This is not the way to do it. Forming A T A can lead to loss of information. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 15

Computing the SVD 1. Use Householder transformations to transform A into bidiagonal form B. Now B T B is tridiagonal. 2. Use QR iteration with Wilkinson shift µ to transform tridiagonal B to diagonal form (without forming B T B implicitly!) Can be computed in 6mn 2 + 20n 3 flops. Efficiently implemented everywhere. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 16

Sparse matrices In many applications only a small number of entries are nonzero (e.g. term-document matrices). Iterative methods frequently used in solving sparse problems. This is because e.g. transformation to tridiagonal form would destroy sparsity, which leads to excessive storage requirements. Also computational complexity might be prohibitively high when dimension of data matrix is very large. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 17

The Singular Value Decomposition Any m n matrix A, with m n, can be factorized A = U ( ) Σ V T, 0 where U R m m and V R n n are orthogonal, and Σ R n n is diagonal: Σ = diag(σ 1, σ 2,..., σ n ), σ 1 σ 2... σ n 0. Skinny version : A = U 1 ΣV T, U 1 R m n. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 18

Equivalent forms of SVD: Facts about SVD A T Av j = σ 2 jv j, AA T u j = σ 2 ju j, where u j and v j are the columns of U and V respectively. Let U k, V k and Σ k, be matrices with the k first singular vectors and values, and define A k = U k Σ k V T k. Then min rank(b) k A B 2 = A A k 2 = σ k+1. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 19

Principal components analysis Idea: look for such a direction that the data projected onto it has maximal variance. When found, continue by seeking the next direction, which is orthogonal to this (i.e. uncorrelated), and which explains as much of the remaining variance in the data as possible. Ergo: we are seeking linear combinations of the original variables. If we are lucky, we can find a few such linear combinations, or directions, or (principal) components, which describe the data fairly accurately. The aim is to capture the intrinsic variability in the data. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 20

1st principal component x x x x x x x x x x x x 2nd principal component x x x Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 21

Example: Atmospheric data Data: 1500 days, and for each day, we have the mean and the std of around 30 measured variables (temperature, wind speed and direction, rain fall, UV-A radiation, concentration of CO2 etc.) Therefore, our data matrix is 1500 60. Visualizing things in a 60-dimensional space is challenging! Instead, do PCA, and project days onto the plane defined by the first two principal components. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 22

30 Days projected in the plane defined by the 1st two principal components, colored per month 12 25 11 10 20 9 2nd principal component 15 10 5 8 7 6 5 0 4 5 3 2 10 10 5 0 5 10 15 1st principal component 1 Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 23

Example: spatial data analysis Data: 9000 dialect words, 500 counties. Word-county matrix A: A(i, j) = { 1 if word i appears in county j 0 otherwise. Apply PCA to this. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 24

Results obtained by PCA Data points: words; variables: counties. Each principal component tells which counties explain the most significant part of the variation left in the data. The first principal component is essentially just the number of words in each county! After this, geographical structure of principal components is apparent. Note: PCA knows nothing of the geography of the counties. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 25

Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 26

Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 27

Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 28

PCA=SVD Let A be a n m data matrix in which the rows represent the cases. Each row is a data vector, each column represents a variable. (Note: usually the roles of rows and columns are the other way around!) A is centered: the estimated mean is subtracted from each column, so each column has zero mean. Let w be the m 1 column vector of (unknown) projection weights that result in the largest variance when the data A is projected along w. Require w T w = 1. Projection of a onto w is w T a = m j=1 a jw j. Projection of data along w is Aw. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 29

Projection of data along w is Aw. Variance: σw 2 = (Aw) T (Aw) = w T A T Aw = w T Cw where C = A T A is the covariance matrix of the data (A is centered!) Task: maximize variance subject to constraint w T w = 1. Optimization problem: maximize f = w T Cw λ(w T w 1), λ is the Lagrange multiplier. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 30

Optimization problem: maximize f = w T Cw λ(w T w 1), λ is the Lagrange multiplier. Differentiating with respect to w yields f w = 2Cw 2λw = 0 Eigenvalue equation: Cw = λw, where C = A T A. Solution: singular values and singular vectors of A!!! More precisely: the first principal component of A is exactly the first right singular vector v 1 of A. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 31

The solution of our opimization problem is given by w = v 1, λ = σ 2 1, where σ 1 and v 1 are the first singular value and the corresponding right singular vector of A, and Cw = λw. Our interest was to maximize the variance, which is given by w T Cw = w T λw = σ 2 1w T w = σ 2 1, so the singular value tells about the variance in the direction of the principal component. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 32

What next? Once the first principal component is found, we continue in the same fashion to look for the next one, which is orthogonal to (all) the principal component(s) already found. The solutions are the right singular vectors v k of A, and the variance in each direction is given by the corresponding singular values σ k. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 33

How not to compute the PCA: In literature one frequently runs across PCA algorithms, which start by computing the covariance matrix C = A T A of the centered data matrix A, and computes the eigenvalues of this. But we already know that this is a bad idea! The condition number of A T A is much larger than that of A. Loss of information. For a sparse A, C = A T A is no longer sparse! Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 34

How to compute the PCA: Data matrix A, rows=data points, columns = variables (attributes, parameters). 1. Center the data by subtracting the mean of each column. 2. Compute the SVD of the centered matrix values and vectors): Â = UΣV T. Â (or the k first singular 3. The principal components are the columns of V, the coordinates of the data in the basis defined by the principal components are UΣ. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 35

Matlab code for PCA %Data matrix A, columns:variables, rows: data points %matlab function for computing the first k principal components of A. function [pc,score]=pca(a,k); [rows,cols]=size(a); Ameans=repmat(mean(A,1),rows,1); %matrix, rows=means of columns A=A-Ameans; %centering data [U,S,V]=svds(A,k); %k is the number of pc:s desired pc=v; score=u*s; %now A=scores*pcs +Ameans; Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 36

Note on Matlab PCA is coded in the statistics toolbox in matlab, BUT... DO NOT USE IT!! Why? We have so few statistics toolbox licenses, that we run out of them frequently! Better not waste scarce resources on this. http://www.cis.hut.fi/projects/somtoolbox Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 37

A vs A T Let A be a centered data matrix, and A = UΣV T. The principal components of A are V, and the coordinates in the basis defined by the principal components are UΣ. If A = UΣV T, then A T = VΣ T U T. So aren t the principal components of A T given by U and the coordinates VΣ T? So the pc s of A are the new coordinates of A T and vica versa, modulo a multiplication by the diagonal matrix Σ (or its inverse)? No. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 38

And why not? Because of the centering of the data! In general it does not hold, that the transpose of a centered matrix is the same as the centered transpose: (A meansofcolumns(a)) T (A T meansofcolumns(a T )) In practice these are related in special cases. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 39

Singular values tell about variance The variance in the direction of the k th principal component is given by the corresponding singular value: σ 2 k. Singular values can be used to estimate how many principal components to keep. Rule of thumb: keep enough to explain 85% of the variation: k j=1 σ2 j n j=1 σ2 j 0.85. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 40

Why talk about PCA? Why not just stick to SVD? Singular vectors=principal components: Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 41

Centering is central SVD will give vectors that go through the origin. Centering makes sure that the origin is in the middle of the data set. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 42

4 3 2 1 0 1 2 0 1 2 3 4 5 6 Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 43

4 3 2 1 0 1 2 0 1 2 3 4 5 6 Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 44

Summary: PCA PCA is SVD done on centered data. PCA looks for such a direction that the data projected onto it has maximal variance. When found, PCA continues by seeking the next direction, which is orthogonal to all the previously found directions, and which explains as much of the remaining variance in the data as possible. Principal components are uncorrelated. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 45

PCA is useful for data exploration visualizing data compressing data outlier detection ratio rules Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 46

References [1] Lars Eldén: Matrix Methods in Data Mining and Pattern Recognition, SIAM 2007. [2] R. A. Horn and C. R. Johnson, Matrix Analysis, Cambridge University Press 1985. [3] D. Hand, H. Mannila, P. Smyth, Principles of Data Mining, The MIT Press, 2001. Linear Algebra Methods for Data Mining, Spring 2007, University of Helsinki 47