School of Computing National University of Singapore CS CS524 Theoretical Foundations of Multimedia More Linear Algebra Singular Value Decomposition (SVD) The highpoint of linear algebra Gilbert Strang Any m n matrix A can be decomposed into: A = UΣV U : m m : columns are left singular vectors Σ : m n : diagonal : singular values V : n n : columns are right singular vectors eg for m > n σ v σ r A = u u r u r+ u m vr vr+ vn σ σ 2 σ r >, r = rank(a) Economy version A = U }{{} r Σ r V }{{} r }{{} m r r r r n U,V orthogonal : U U = I m m, V V = I n n Column Space: look at Ax Ax = UΣV x, and let y = V x = σ u σ r u r y so Col(A) = Col(U r ) In fact, u,,u r form an orthonormal basis for Col(A) Nullspace: look at Ax = U r Σ r V r x =
pre-multiply by U r : Σ r Vr x = pre-multiply by Σ r : Vr x = ie want x to be orthogonal to v,,v r That s precisely v r+,,v n, since V is orthogonal! Thus, v r+,,v n form an orthonormal basis for Null(A) Consider A A = (UΣV ) (UΣV ) = VΣ U UΣV = VΣ 2 V But this is the eigen-decomposition of A A! So V is the eigenvector matrix of A A Σ 2 is the eigenvalue matrix of A A ie singular values are positive square roots of eigenvalues Similary, consider AA = UΣV VΣ U = UΣ 2 U So U is the eigenvector matrix for AA with same eigenvalues In general, for m n A : Low-rank approximation Ax = UΣV x = (rotate in R m ) ( scale ) (rotate in R n ) x SVD provides the best lower-rank approximation to A, ie rank k approx A k = U k Σ k V k The idea is to use only the first k singular values/vectors, so that A k A Use SVD for compression: Instead of storing A : mn numbers store u,,u k : mk numbers + σ,,σ k : k numbers + v,,v k : nk numbers = (m + n + )k numbers Use SVD to filter noise Typically, small singular values are caused by noise using rank k approx (k < r), removes noise Linear Equations Revisited: Ax = b Key: solution only when b Col(A) Case A n n and invertible Then unique solution : x = A b rank(a) = n, Col(A) = R n 2
Case 2 A n n and singular rank(a) = r < n, nullity = n r Two possibilities : (a) b Col(A) : many solutions (b) b / Col(A) : no exact solution, closest solution (a) b Col(A) : SVD gives particular solution x p such that Ax p = b But we can add any vector from Nullspace, x n, since A(x p + x n ) = Ax p + Ax n = b + Infinitely many solutions! What is the SVD solution? Invert only in rank r subspace A = UΣV (all n n) σ where Σ = σ r σ Let A = VΣ U, where Σ = σ r Then x p = A b A : pseudoinverse See Figure (b) b / Col(A) : No exact solution, but can find b Col(A) closest to b Solution x = A b = VΣU b Case 3 A m n with m < n underconstrained fewer equations than unknowns r = rank(a) min(m,n), ie r < n, so Nullspace is not trivial Col(A) R m Situation similar to the previous case, either b Col(A) or b / col(a) In practice, usually r = m, so that b Col(A), ie many solutions Case 4 A m n with m > n overconstrained, more equations that unknowns rank, r, is at most, n Therefore, Col(A) R m Again, depends on whether b col(a), so we can only find closest or least squares solution x = A b Pseudoinverse A solves Ax = b in least squares sense, ie Ax b 2 is minimum 3
Figure : A singular matrix A has Col(A) R n This is represented by a plane in the diagram If b lies outside of Col(A), then the best one can do is to obtain b, which is the vector in Col(A) that is closest to b This is what the pseudoinverse computes: b = Ax, where x = A b A = VΣ U (using SVD) ( = A A) A but this requires rank(a) = n Note: A A = ( A A ) A A = I, but AA = A ( A A ) A I in general Thus, pseudoinverse is only a left inverse, not a right inverse If A invertible, then pseudoinverse = true inverse: ( ) A = A A A = A A A = A In Matlab, always use A \ b to solve Ax = b \ will compute A or A accordingly Matrix Inversion Formulas Excerpt from: Statistical Signal Processing, by Louis LScharf, Addison Wesley, 99 Lemma (Inverse of a Partitioned Matrix) Let R denote the partitioned matrix [ A B R = C D The inverse of R is [ R E FH = H G H 4
E = A BD C AF = B GA = C H = D CA B All indicated inverses are assumed to exist The matrix E is called Schur complement of A, and the matrix H is called the Schur complement of D 2 Lemma 2 (Matrix Inversion Lemma) Let E denote the Schur complement of A: E = A BD C Then the inverse of E is E = A + FH G AF = B GA = C H = D CA B Lemmas and 2 combine to form the following representation for the inverse of a partitioned matrix Theorem (Partitioned Matrix Inverse) The inverse of the partitioned matrix [ A B R = C D is the matrix R = [ A [ F + I [H [ G I AF = B GA = C H = D CA B 5
Corollary: Woodbury s Identity The inverse of the matrix is the matrix Projections R = R + γ 2 uu R = R γ 2 + γ 2 u R ur uu R Often we want to project x onto some subspace, ie find y in subspace, closest to x Geometrically, this occurs when x y is orthogonal to subspace Often the subspace of interest is Col(A) Recall that in the SVD of A, U r form an orthogonal basis for Col(A) The projection matrix P A that projects any vector onto Col(A) is : P A = U r U r ( ) = A A A A (SVD) eg To project onto a line (vector) u, P u = uu u u In general, a projection matrix P is one that satisfies: P = P symmetric 2 P 2 = P idempotent What are the eigenvalues of P? Derivatives wrt Differentiate scalar vector matrix scalar scalar vector matrix vector vector matrix matrix matrix scalar scalar: eg vector scalar: eg d x2 = 2x matrix scalar: eg y = [cos θ sin 2 θ dy = [ sin θ dθ 2 sin θ cos θ [ x 2 x A = x [ da 2x = x 2 6
scalar vector: f(x) scalar function of vector x = x x n df = x x n vector vector: y(x) Then, m vector function of vector x R n y x dy = y x n y m x y m x n df da = a y : m x : n dy : n m matrix scalar matrix: f(a) scalar function of m n Then, a 2 Commonly used derivatives 2 3 d (Ax) = A = I dy x = y = y a 2 a m A a n a 2n a mn m n matrix 4 5 d ( x Ax ) {( A + A ) x if A square = 2Ax if A symmetric d ( u (x) v(x) ) = [ du v + [ dv u product rule 6 d tr(a) da = I 7
7 d det(a) = det(a)a da Example: to find pseudoinverse Let e = Ax b We want x such that e 2 smallest, ie e 2 2 smallest Hessian: 2 nd derivative Let y = e 2 2 = e e = (Ax b) (Ax b) = x A Ax 2b Ax + b b dy = 2A Ax 2A b = A Ax = A b ( ) x = A A A b } {{ } A Let f(x) be scalar function of x R n Then Hessian: Hessian is symmetric H = d2 f 2 = x 2 x 2 x x x 2 x 2 2 x n x x x n x 2 x n x 2 n Positive semi-definite (psd) A square matrix A is positive semi-definite if x Ax for all x Positive definite x Ax > Note: A is a psd means all eigenvalues If a Hessian matrix is psd, then f has minimum point eg in the pseudoinverse calculation, dy = 2A Ax 2A b So Hessian, H = d ( ) dy = 2A A Now, for any x, x Hx = 2x A Ax = 2 Ax 2 since Ax 2 is the squared norm So H is psd y has minimum point This justifies taking derivatives to find best x 8