Singular Value Decomposition (SVD)

School of Computing National University of Singapore CS CS524 Theoretical Foundations of Multimedia More Linear Algebra Singular Value Decomposition (SVD) The highpoint of linear algebra Gilbert Strang Any m n matrix A can be decomposed into: A = UΣV U : m m : columns are left singular vectors Σ : m n : diagonal : singular values V : n n : columns are right singular vectors eg for m > n σ v σ r A = u u r u r+ u m vr vr+ vn σ σ 2 σ r >, r = rank(a) Economy version A = U }{{} r Σ r V }{{} r }{{} m r r r r n U,V orthogonal : U U = I m m, V V = I n n Column Space: look at Ax Ax = UΣV x, and let y = V x = σ u σ r u r y so Col(A) = Col(U r ) In fact, u,,u r form an orthonormal basis for Col(A) Nullspace: look at Ax = U r Σ r V r x =

pre-multiply by U r : Σ r Vr x = pre-multiply by Σ r : Vr x = ie want x to be orthogonal to v,,v r That s precisely v r+,,v n, since V is orthogonal! Thus, v r+,,v n form an orthonormal basis for Null(A) Consider A A = (UΣV ) (UΣV ) = VΣ U UΣV = VΣ 2 V But this is the eigen-decomposition of A A! So V is the eigenvector matrix of A A Σ 2 is the eigenvalue matrix of A A ie singular values are positive square roots of eigenvalues Similary, consider AA = UΣV VΣ U = UΣ 2 U So U is the eigenvector matrix for AA with same eigenvalues In general, for m n A : Low-rank approximation Ax = UΣV x = (rotate in R m ) ( scale ) (rotate in R n ) x SVD provides the best lower-rank approximation to A, ie rank k approx A k = U k Σ k V k The idea is to use only the first k singular values/vectors, so that A k A Use SVD for compression: Instead of storing A : mn numbers store u,,u k : mk numbers + σ,,σ k : k numbers + v,,v k : nk numbers = (m + n + )k numbers Use SVD to filter noise Typically, small singular values are caused by noise using rank k approx (k < r), removes noise Linear Equations Revisited: Ax = b Key: solution only when b Col(A) Case A n n and invertible Then unique solution : x = A b rank(a) = n, Col(A) = R n 2

Case 2 A n n and singular rank(a) = r < n, nullity = n r Two possibilities : (a) b Col(A) : many solutions (b) b / Col(A) : no exact solution, closest solution (a) b Col(A) : SVD gives particular solution x p such that Ax p = b But we can add any vector from Nullspace, x n, since A(x p + x n ) = Ax p + Ax n = b + Infinitely many solutions! What is the SVD solution? Invert only in rank r subspace A = UΣV (all n n) σ where Σ = σ r σ Let A = VΣ U, where Σ = σ r Then x p = A b A : pseudoinverse See Figure (b) b / Col(A) : No exact solution, but can find b Col(A) closest to b Solution x = A b = VΣU b Case 3 A m n with m < n underconstrained fewer equations than unknowns r = rank(a) min(m,n), ie r < n, so Nullspace is not trivial Col(A) R m Situation similar to the previous case, either b Col(A) or b / col(a) In practice, usually r = m, so that b Col(A), ie many solutions Case 4 A m n with m > n overconstrained, more equations that unknowns rank, r, is at most, n Therefore, Col(A) R m Again, depends on whether b col(a), so we can only find closest or least squares solution x = A b Pseudoinverse A solves Ax = b in least squares sense, ie Ax b 2 is minimum 3

Figure : A singular matrix A has Col(A) R n This is represented by a plane in the diagram If b lies outside of Col(A), then the best one can do is to obtain b, which is the vector in Col(A) that is closest to b This is what the pseudoinverse computes: b = Ax, where x = A b A = VΣ U (using SVD) ( = A A) A but this requires rank(a) = n Note: A A = ( A A ) A A = I, but AA = A ( A A ) A I in general Thus, pseudoinverse is only a left inverse, not a right inverse If A invertible, then pseudoinverse = true inverse: ( ) A = A A A = A A A = A In Matlab, always use A \ b to solve Ax = b \ will compute A or A accordingly Matrix Inversion Formulas Excerpt from: Statistical Signal Processing, by Louis LScharf, Addison Wesley, 99 Lemma (Inverse of a Partitioned Matrix) Let R denote the partitioned matrix [ A B R = C D The inverse of R is [ R E FH = H G H 4

E = A BD C AF = B GA = C H = D CA B All indicated inverses are assumed to exist The matrix E is called Schur complement of A, and the matrix H is called the Schur complement of D 2 Lemma 2 (Matrix Inversion Lemma) Let E denote the Schur complement of A: E = A BD C Then the inverse of E is E = A + FH G AF = B GA = C H = D CA B Lemmas and 2 combine to form the following representation for the inverse of a partitioned matrix Theorem (Partitioned Matrix Inverse) The inverse of the partitioned matrix [ A B R = C D is the matrix R = [ A [ F + I [H [ G I AF = B GA = C H = D CA B 5

Corollary: Woodbury s Identity The inverse of the matrix is the matrix Projections R = R + γ 2 uu R = R γ 2 + γ 2 u R ur uu R Often we want to project x onto some subspace, ie find y in subspace, closest to x Geometrically, this occurs when x y is orthogonal to subspace Often the subspace of interest is Col(A) Recall that in the SVD of A, U r form an orthogonal basis for Col(A) The projection matrix P A that projects any vector onto Col(A) is : P A = U r U r ( ) = A A A A (SVD) eg To project onto a line (vector) u, P u = uu u u In general, a projection matrix P is one that satisfies: P = P symmetric 2 P 2 = P idempotent What are the eigenvalues of P? Derivatives wrt Differentiate scalar vector matrix scalar scalar vector matrix vector vector matrix matrix matrix scalar scalar: eg vector scalar: eg d x2 = 2x matrix scalar: eg y = [cos θ sin 2 θ dy = [ sin θ dθ 2 sin θ cos θ [ x 2 x A = x [ da 2x = x 2 6

scalar vector: f(x) scalar function of vector x = x x n df = x x n vector vector: y(x) Then, m vector function of vector x R n y x dy = y x n y m x y m x n df da = a y : m x : n dy : n m matrix scalar matrix: f(a) scalar function of m n Then, a 2 Commonly used derivatives 2 3 d (Ax) = A = I dy x = y = y a 2 a m A a n a 2n a mn m n matrix 4 5 d ( x Ax ) {( A + A ) x if A square = 2Ax if A symmetric d ( u (x) v(x) ) = [ du v + [ dv u product rule 6 d tr(a) da = I 7

7 d det(a) = det(a)a da Example: to find pseudoinverse Let e = Ax b We want x such that e 2 smallest, ie e 2 2 smallest Hessian: 2 nd derivative Let y = e 2 2 = e e = (Ax b) (Ax b) = x A Ax 2b Ax + b b dy = 2A Ax 2A b = A Ax = A b ( ) x = A A A b } {{ } A Let f(x) be scalar function of x R n Then Hessian: Hessian is symmetric H = d2 f 2 = x 2 x 2 x x x 2 x 2 2 x n x x x n x 2 x n x 2 n Positive semi-definite (psd) A square matrix A is positive semi-definite if x Ax for all x Positive definite x Ax > Note: A is a psd means all eigenvalues If a Hessian matrix is psd, then f has minimum point eg in the pseudoinverse calculation, dy = 2A Ax 2A b So Hessian, H = d ( ) dy = 2A A Now, for any x, x Hx = 2x A Ax = 2 Ax 2 since Ax 2 is the squared norm So H is psd y has minimum point This justifies taking derivatives to find best x 8