Singular Value Decomposition. 1 Singular Value Decomposition and the Four Fundamental Subspaces

Similar documents
Image Registration Lecture 2: Vectors and Matrices

Conceptual Questions for Review

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Eigenvalues, Eigenvectors, and an Intro to PCA

Chapter 3 Transformations

Eigenvalues, Eigenvectors, and an Intro to PCA

1 Linearity and Linear Systems

Linear Algebra and Eigenproblems

Eigenvalues, Eigenvectors, and an Intro to PCA

Glossary of Linear Algebra Terms. Prepared by Vince Zaccone For Campus Learning Assistance Services at UCSB

Vectors To begin, let us describe an element of the state space as a point with numerical coordinates, that is x 1. x 2. x =

(a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? Solution: dim N(A) 1, since rank(a) 3. Ax =

Large Scale Data Analysis Using Deep Learning

Designing Information Devices and Systems II

14 Singular Value Decomposition

[Disclaimer: This is not a complete list of everything you need to know, just some of the topics that gave people difficulty.]

LINEAR ALGEBRA 1, 2012-I PARTIAL EXAM 3 SOLUTIONS TO PRACTICE PROBLEMS

. = V c = V [x]v (5.1) c 1. c k

18.06SC Final Exam Solutions

Review of Linear Algebra

Lecture: Face Recognition and Feature Reduction

LINEAR ALGEBRA KNOWLEDGE SURVEY

COMP 558 lecture 18 Nov. 15, 2010

Least Squares Optimization

Eigenvalues and diagonalization

Final Review Sheet. B = (1, 1 + 3x, 1 + x 2 ) then 2 + 3x + 6x 2

Now, if you see that x and y are present in both equations, you may write:

Lecture: Face Recognition and Feature Reduction

7 Principal Component Analysis

Math 215 HW #9 Solutions

Review of similarity transformation and Singular Value Decomposition

Introduction to Machine Learning

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

15 Singular Value Decomposition

Principal Component Analysis -- PCA (also called Karhunen-Loeve transformation)

4 ORTHOGONALITY ORTHOGONALITY OF THE FOUR SUBSPACES 4.1

Least Squares Optimization

DS-GA 1002 Lecture notes 10 November 23, Linear models

Matrices and Vectors. Definition of Matrix. An MxN matrix A is a two-dimensional array of numbers A =

Linear Algebra. and

Linear Algebra Review. Vectors

CS 143 Linear Algebra Review

7. Symmetric Matrices and Quadratic Forms

MATH 1120 (LINEAR ALGEBRA 1), FINAL EXAM FALL 2011 SOLUTIONS TO PRACTICE VERSION

Dot Products, Transposes, and Orthogonal Projections

Lecture 6 Positive Definite Matrices

Machine Learning (Spring 2012) Principal Component Analysis

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

18.06 Quiz 2 April 7, 2010 Professor Strang

Getting Started with Communications Engineering

Stat 159/259: Linear Algebra Notes

Computational Methods. Eigenvalues and Singular Values

Linear Algebra- Final Exam Review

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Principal Component Analysis

Mathematical foundations - linear algebra

Econ Slides from Lecture 7

Applied Mathematics 205. Unit II: Numerical Linear Algebra. Lecturer: Dr. David Knezevic

Least Squares Optimization

Math Linear Algebra Final Exam Review Sheet

CSE 554 Lecture 7: Alignment

Math 1553, Introduction to Linear Algebra

Stat 206: Linear algebra

CS168: The Modern Algorithmic Toolbox Lecture #7: Understanding Principal Component Analysis (PCA)

Math 520 Exam 2 Topic Outline Sections 1 3 (Xiao/Dumas/Liaw) Spring 2008

Problem # Max points possible Actual score Total 120

Final Exam, Linear Algebra, Fall, 2003, W. Stephen Wilson

Linear Algebra Review. Fei-Fei Li

Dot Products. K. Behrend. April 3, Abstract A short review of some basic facts on the dot product. Projections. The spectral theorem.

Singular Value Decomposition

Focus was on solving matrix inversion problems Now we look at other properties of matrices Useful when A represents a transformations.

PCA, Kernel PCA, ICA

Lecture 6. Numerical methods. Approximation of functions

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

APPLICATIONS The eigenvalues are λ = 5, 5. An orthonormal basis of eigenvectors consists of

Background Mathematics (2/2) 1. David Barber

Linear Algebra, part 3 QR and SVD

The SVD-Fundamental Theorem of Linear Algebra

LECTURE 16: PCA AND SVD

Appendix A: Matrices

Math Bootcamp An p-dimensional vector is p numbers put together. Written as. x 1 x =. x p

Introduction to Matrix Algebra

Properties of Matrices and Operations on Matrices

B553 Lecture 5: Matrix Algebra Review

Solutions to Review Problems for Chapter 6 ( ), 7.1

Foundations of Computer Vision

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Linear Models Review

Data Preprocessing Tasks

Singular Value Decompsition

Singular Value Decomposition

MA 0540 fall 2013, Row operations on matrices

2. Every linear system with the same number of equations as unknowns has a unique solution.

MATRICES AND MATRIX OPERATIONS

MATH 20F: LINEAR ALGEBRA LECTURE B00 (T. KEMP)

First of all, the notion of linearity does not depend on which coordinates are used. Recall that a map T : R n R m is linear if

Linear Algebra in A Nutshell

18.06 Professor Johnson Quiz 1 October 3, 2007

Review problems for MA 54, Fall 2004.

Singular Value Decomposition and Digital Image Compression

Transcription:

Singular Value Decomposition This handout is a review of some basic concepts in linear algebra For a detailed introduction, consult a linear algebra text Linear lgebra and its pplications by Gilbert Strang (Harcourt, race, Jovanovich, 988) is excellent Singular Value Decomposition and the Four Fundamental Subspaces The SVD decomposes a matrix into the product of the three components: = USV t where V t means transpose Here, is the original NxM matrix, U is an NxN orthonormal matrix, V is an MxM orthonormal matrix, and S is an NxM matrix with non-zero elements only along the main diagonal The number of non-zero elements in S is (at most) the lesser of M and N The rank of a matrix is equal to the number of non-zero singular values We think of as a linear transform, x = b, that transforms x (M-dimensional vectors) into b (N-dimensional vectors) If the matrix is singular then there is some collection of x i (some subspace of M-space) that is mapped to zero, x i = This is called the row nullspace of There is also a subspace of M-space that can be reached by, ie, the set of vectors b i for which there is some x where: x = b i This is called the column space of The dimension of the column space is called the rank of The SVD explicitly constructs orthonormal bases for the row nullspace and column space of The columns of U, whose same-numbered elements in S are non-zero, are an orthonormal set of basis vectors that span the column space of The remaining columns of U span the row nullspace of t (also called the column nullspace of ) The columns of V (rows of V t ), whose same-numbered elements in S are zero, are an orthonormal set of vectors that span the row nullspace of The remaining columns of V span the column space of t (also called the row space of ) Here s a simple example: In this example: = ; ; ; : The matrix size is 3x3 (M and N are both 3)

The rank of the matrix is 2 ( ) is in the row nullspace So is any scalar multiple of this vector ( ;) is in the column nullspace So is any scalar multiple of this vector (;2 ) and ( ;) are in the row space So is any linear combination of these two vectors (; ;) and (; 2 ) are in the column space So is any linear combination of these two vectors It is easy to see that has rank 2 by noting that you can get the third column of by summing the first 2 columns and then multiplying by - The vector (,,-) is in the column nullspace It is called the column nullspace because it takes to columns to zero: ( ;) = The vector ( ) is in the row nullspace of It is called the row nullspace because it takes to rows to zero: ( ) t = How about the column space? onsider all possible combinations of the columns, x, coming from all choices of x Those products form the column space of In our example, the column space is a 2-dimensional subspace of 3-space (a tilted plane), linear combinations of: (; ;) and (; 2 ) heck that all 3 columns of can be written as linear combinations of these two vectors In general, the column space of a matrix is orthogonal to the column nullspace This is easy to see in our example: ( ;) (; ;) = and ( ;) (; 2 ) = How about the row space? onsider all possible combinations of the rows, b t, coming from all choices of b Those products form the row space of In our example, the row space is another 2-dimensional subspace of 3-space (a different tilted plane), linear combinations of: (;2 ) and ( ;) heck that all 3 rows of can be written as linear combinations of these two vectors In general, the row space of a matrix is orthogonal to the row nullspace In our example: ( ) (;2 ) = and ( ) ( ;) = 2 Linear Systems of Equations Matrices are a convenient way to solve systems of linear equations, x = b onsider the same matrix,, we used above The product, x is always a combination of the columns of : x = ; ; ; x x 2 x 3 = x + x 2 ; + x 3 To solve x = b is to find a combination of the columns that gives b We consider all possible combinations x, coming from all choices of x Those products form the 2 ; ; :

column space of In the example, the columns lie in 3-dimensional space, but their combinations fill out a plane (the matrix has rank 2) The plane goes through the origin since one of the combinations has weights x = x = x 2 = Some (in fact, most) vectors b do not lie on the plane, and for them, x = b, can not be solved exactly The system, x = b, has an exact solution only when the right side b is in the column space of The vector b = (2 3 4) t is not in the column space of so for this choice of b there is no x that satisfies x = b The vector, b = (2 3 5) t, on the other hand, does lie in the plane (spanned by the columns of ) so there is a solution In particular, x = (5 3 ) t is a solution (try it) However, there are other solutions as well; x = (6 4 ) t will work In fact, we can take x =(5 3 ) and add any scalar multiple of y =( ) since (,,) is in the row nullspace of For any scalar constant, c, we can write: (x + cy) =x + (cy) =x + cy = x If an NxN matrix has linearly independent columns then: The row nullspace contains only the point 2 The solution to x = b (if there is one) is unique 3 The rank of is N In general any two solutions to x = b differ by a vector in the row nullspace 3 Regression When b is not in the column space of, we can still find a vector x that comes the closest to solving x = b Figure 3 shows a simple example in which and b are both 3x vectors and x is a scalar In other words, b is a point in 3-space and the column space of is a line in 3- space (different values of x put you at different points along that line) Everything that we do next is true for s and b s of any size or length ut it s worthwhile keeping this simple picture in mind, because we can t draw diagrams in higher dimensions Here s the game I give you and b You pick an ^x such that ^x comes as close as possible to b That is, you pick ^x to minimize: kx ; bk 2 : The solution, of course, is the orthogonal projection of b onto x This is clear for the simple example in figure 3; the shortest distance from the point, b, to the line, x, isthe perpendicular distance 3

b e x x x e = x - b Figure : Regression: find the vector ^x such that ^x comes as close as possible to b The solution is the orthogonal projection of b onto x The error vector e is perpendicular to ^x 4

Given a value for ^x, the error vector (labeled in the figure) is given by the difference: ^x ; b: Given the right choice of ^x, this error vector is perpendicular to the column space of : (x) t (^x ; b) = or x t [ t ^x ; t b]=: This is true for every x, ie, it is true for all y = x in the column space of There is only one way in which this can happen: The vector in square brackets has to be the zero vector: [ t ^x ; t b]= ie, ^x =( t ) ; ( t b): Here the matrix ( t ) ; t is called the pseudo-inverse of We have to worry about one more thing: When is t invertible? One case is easy: If has linearly independent columns, then t is a square, symmetric, and invertible matrix What if is not full rank (ie, what if it does not have linearly independent columns)? We still might be OK Imagine that is a 3x2 matrix, but that the two columns are identical The columns are clearly not independent We can proceed by dumping one (either one) of them Make, a new 3x vector that contains the first column of Proceed as above to find the scalar ^x to minimize: k x ; bk 2 The solution to our original problem is simply: ^x =(^x ) t Generally, we use the SVD to compute the pseudo-inverse, # of a matrix, The SVD decomposes: = USV t where S is an NxM matrix with non-zero entries, s i, along the main diagonal The pseudoinverse is computed: # = VS # U where S # is an MxN matrix with non-zero entries, s # i : s # i = ( s i for s i >t otherwise Here t is a threshold that is typically chosen based on the round-off error of the computer you re using If one of the s i s is very small (essentially zero), then the matrix is not full rank and we want to ignore one of the columns Using the SVD is a very reliable method for inverting a matrix 5

d d d 2nd principal component st principal component d D = ( M- d d d ) d d d M- Figure 2: Top-left: Scatter plot of 2-dimensional data set There are M data points Each data point is represent by a vector, d = (d i di ), for i from to M Top-right: Scatter plot of data set after subtracting the mean The first principal component is a unit vector in the elongated direction of the scatter plot 4 ovariance, Eigenvectors, and Principal omponents nalysis This section of the handout deals with multi-dimensional data sets, and explains principal component analysis (also called the Karhonen-Loeve Transform) Each data point represents one test condition, eg, a data point might be the height and weight of a person represented as a vector: (height,weight) Or each data point might include the intensity values of all of the pixels in an image represented as a very long vector (p p 2 ::: p N ), where p i is the intensity at the ith pixel Let s say we have M such data points, each an N-dimensional vector The average of all the data (the average of the columns) is itself an N-dimensional vector Subtracting the mean vector from each data vector gives us a new set of N-dimensional vectors In what follows, we ll be dealing exclusively with these new vectors (after subtracting out the mean); doing it this way makes the notation much simpler We can put these vectors into a big NxM matrix, D, where each column of this matrix corresponds to a single data point Figure 4 shows an example in which N=2 The covariance of the data is: cov(d) = M DDt : This definition of the covariance is correct for data sets of arbitrary dimenion, N For our 6

simple 2-dimensional data set, we get: cov(d) = M = d d M ; d d M ; c c c c :!! d d d M ; d M ; Here c is the variance of the data along the d axis, c is the variance of the data along the d axis, and c = c is the covariance (or correlation coefficient): c = M c = M c = c = M M X; i= X M ; i= X M ; i= 2 d i 2 d i d i di : For the data set illustrated in figure 4, the diagonal (c and c ) entries in the covariance matrix are about equal, and the off-diagonal(c and c ) entries are reasonably large (about equal to 8) because the data is pretty elongated in that direction Using the SVD, D = USV t the columns of U span the column space of D It turns out that these are also the eigenvectors of the covariance matrix, DD t n eigenvector, e, of a square matrix,, satisfies: e = e for some scalar, that is called an eigenvalue In other words, the direction of e is unchanged by passing it through the matrix; only the length will change Let U i be the ith column of U For U i to be an eigenvector of DD t, we need: Using the SVD, DD t U i = U i : DD t = USV t VSU t = US 2 U t : Since V is orthonormal, V t V = I is an identity matrix Then, DD t U i = US 2 U t U i 7

= US 2 = U 2 si = s 2 i U i: The second line follows from the fact that U is orthonormal The last line says that the square of the ith element of S is the eigenvalue corresponding to the ith eigenvector of DD t, and that the eigenvector is given by the ith column of U The eigenvector corresponding to the largest eigenvalue is called the first principal component of the covariance matrix Most implementations of the SVD return the eigenvalues and eigenvectors in order so U (the first column of U) is the first principal component The second principal component is the eigenvector with the second largest eigenvalue (that is, the second column of U), and so on The key property of the principal components (that we state here without proof) is that they capture the structure of the data In particular, the first principal component is the unit vector with the largest projection onto the data set, ie, the following expression is maximized for e = U : ke t Dk 2 : s usual, it is helpful to draw the shape of the matrices on a piece of paper For the simple example illustrated in figure 4, the first principal component is a unit vector in the (,) direction, the elongated direction of the scatter plot In this case, D is a 2xM matrix, and U is 2x vector, so U t D is a xm row vector Each element of this row vector is the length of the projection of a single data point onto the unit vector U The point is that you know a lot about a data point by knowing only the projection of that data point onto U In fact, if you have to characterize each data point with only number, you should use the projection onto the first principal component That will give you the best approximation for the entire data set (given that you are only using number for each data point) What about the other principal components, the other columns of U The second principal component is the unit vector, in the subspace orthogonal to the first principal component, that has the largest projection onto the data set The third principal component is the unit vector, in the subspace orthogonal to both the first and second principal components, that has the largest projection nd so on If you have to characterize each 8

data point with 3 numbers, you should use the projections onto each of the first three principal components 9