STAT 39: MATHEMATICAL COMPUTATIONS I FALL 17 LECTURE 5 1 existence of svd Theorem 1 (Existence of SVD) Every matrix has a singular value decomposition (condensed version) Proof Let A C m n and for simplicity we assume that all its nonzero singular values are distinct We define the matrix [ ] A W = A C (m+n) (m+n) It is easy to verify that W = W (after Wielandt, who s the first to consider this matrix) and by the spectral theorem for Hermitian matrices, W has an evd, W = ZΛZ where Z C (m+n) (m+n) is a unitary matrix and Λ R (m+n) (m+n) is a diagonal matrix with real diagonal elements If z is an eigenvector of W, then we can write [ x W z = σz, z = y] and therefore [ [ ] [ A x x A ] = σ y y] or, equivalently, Ay = σx, A x = σy Now, suppose that we apply W to the vector obtained from z by negating y Then we have [ ] [ ] [ ] [ ] [ ] A x Ay σx x A = y A = = σ x σy y In other words, if σ is an eigenvalue, then σ is also an eigenvalue So we may assume without loss of generality that σ 1 σ σ r > = = where r = rank(a) So the diagonal matrix Λ of eigenvalues of W may be written as Λ = diag(σ 1, σ,, σ r, σ 1, σ,, σ r,,, ) C (m+n) (m+n) Observe that there is a zero block of size (m + n r) (m + n r) in the bottom right corner of Λ We scale the eigenvector z of W so that z z = Since W is symmetric, eigenvectors corresponding to the distinct eigenvalues σ and σ are orthogonal, so it follows that [ x y ] [ x y ] = Date: October 1, 18, version 1 Comments, bug reports: lekheng@galtonuchicagoedu 1
These yield the system of equations which has the unique solution x x + y y =, x x y y =, x x = 1, y y = 1 Now note that we can represent the matrix of normalized eigenvectors of W corresponding to nonzero eigenvalues (note that there are exactly r of these) as Z = 1 [ ] X X C Y Y (m+n) r Note that the factor 1/ appears because of the way we have chosen the norm of z We also let It is easy to see that Λ = diag(σ 1, σ,, σ r, σ 1, σ,, σ r ) C r r ZΛZ = Z Λ Z just by multiplying out the zero block in Λ So we have [ ] A A = W = ZΛZ = ZΛ Z = 1 [ X X Y Y = 1 [ XΣr XΣ r = 1 [ XΣ r Y ] Y Σ r X [ XΣ = r Y ] Y Σ r X and therefore ] [ ] [ Σr X Y ] Σ r X Y Y Σ r Y Σ r ] [ X Y X Y ] A = XΣ r Y, A = Y Σ r X where X is an m r matrix, Σ is r r, and Y is n r, and r is the rank of A We have obtained the condensed svd of A The last missing bit is the orthonormality of the columns of X and Y This follows from the fact that distinct columns of [ X ] X Y Y are mutually orthogonal since they correspond to distinct eigenvalues and so if we pick [ x i y i ], [ x j ] y j, [ xj y j ] for i j, and take their inner products, we get x i x j + y i y j =, x i x j y i y j = Adding them gives x i x j = and substracting them gives y i y j = for all i j, as required see Theorem 41 in Trefethen and Bau for an alternative non-constructive proof that does not require the use of spectral theorem
other characterizations of svd the proof of the above theorem gives us two more characterizations of singular values and singular vectors: (i) in terms of eigenvalues and eigenvectors of an (m + n) (m + n) Hermitian matrix: [ A A ] = 1 [ U U V V ] [ Σ Σ (ii) in terms of a coupled system of equations { Av = σu, A u = σv ] [ U V ] U V the following is yet another way to characterize them in terms of eigenvalues/eigenvectors of an m m Hermitian matrix and an n n Hermitian matrix Lemma 1 The suqare of the singular values of a matrix A are eigenvalues of AA and A A The left singular vectors of A are the eigenvectors of AA and the right singular vectors of A are the eigenvectors of A A Proof From the relationships Ay = σx, A x = σy, we obtain A Ay = σ y, AA x = σ x Therefore, if ±σ are eigenvalues of W, then σ is an eigenvalue of both AA and A A Also AA = (UΣV )(V Σ T U ) = UΣΣ T U, A A = (V Σ T U )(UΣV ) = V Σ T ΣV Note that Σ = Σ T since singular values are real The matrices Σ T Σ and ΣΣ T are respectively n n and m m diagonal matrices with diagonal elements σi and the svd is something like a swiss army knife of linear algebra, matrix theory, and numerical linear algebra, you can do a lot with it over the next few sections we will see that the singular value decomposition is a singularly powerful tool once we have it, we could solve just about any problem involving matrices given A C m n and b im(a) C m, find all solutions of Ax = b given A GL(n), find A 1 given A C m n, find A and A F given A C m n, find A σ,p,k for p [, ] and k N given A C n n, find det(a) given A C m n, find A given A C m n and b C m, find all solutions of the min x C n Ax b the good news is that unlike the Jordan canonical form, the svd is actually computable there are two main methods to compute it: Golub-Reinsch and Golub-Kahan, we will look at these briefly later, right now all you need to know is that you can call Matlab to give you the svd, both the full and compact versions in all of the following we shall assume that we have the full svd of A = UΣV furthermore σ 1 σ σ r > are the singular values of A and r = rank(a) 3 solving linear systems assuming that we want to solve a linear system Ax = b that is known to be consistent, ie b im(a) first we form the matrix-vector product c = U b let y = V x 3
then Ax = b becomes Σy = c since Ax = b is consistent, so σ 1 σ r y 1 y r y r+1 y n c 1 = c n c r c r+1 is consistent, which means that we must have c r+1 = = c m =, ie c 1 c c = r now let where y r+1,, y n are free parameters all solutions of Ax = b are given by c 1 /σ 1 c y = r /σ r y r+1 y n x = V y for some choices of y r+1,, y n to see this, observe that σ 1 Ax = UΣV V y = UΣy = U σ r c 1 /σ 1 c 1 c r /σ r c = U r = UU y r+1 b = b y n 4 inverting nonsingular matrices note that only square matrices could have inverses, when we say A is nonsinular or invertible, the fact that A is square is implied by the way, the set of all nonsingular matrices in C n n forms group under matrix multiplication it is called the general linear group and denoted GL(n) := {A C n n : det(a) } if A C n n is nonsingular, then rank(a) = n and so σ 1,, σ n are all non-zero 4
so where A 1 = (UΣV ) 1 = (V ) 1 Σ 1 U 1 = V Σ 1 U Σ 1 = diag(σ 1 1,, σ 1 n ) 5 computing matrix -norm and F -norm recall the definition of the matrix -norm, A = max x Ax x which is also called the spectral norm we examine the expression Ax = x A Ax the matrix A A is Hermitian and positive semidefinite, ie x (A A)x for all nonzero x C n exercise: show that if a matrix M is Hermitian positive semidefinite, then its evd and svd coincide as such, A A has svd given by A A = V ΣV where V is a unitary matrix whose columns are the eigenvectors of A A, and Σ is a diagonal matrix of the form σ 1 σn where each σ i is nonnegative and an eigenvalue of A A these eigenvalues can be ordered such that σ 1 σ σ r >, σ r+1 = = σ n =, where r = rank(a) let x C n and let w = V x, then we obtain Ax x = x A Ax x x = x V ΣV x x V V x = w Σw w w n = σ i w i n w i σ1 exercise: show that if a 1,, a n are nonnegative numbers, then a 1 x 1 + + a n x n max = max(a 1,, a n ) x i x 1 + + x n where the maximum is taken over x 1,, x n [, ) not all it follows that Ax x σ 1 for all nonzero x 5
since V is a unitary matrix, it follows that there exists an x such that 1 w = V x = = e 1 in which case x A Ax = e 1Σe 1 = σ 1 in fact, this vector x is the eigenvector of A A corresponding to the eigenvalue σ 1 we conclude that A = σ 1 note that we have also shown that A = ρ(a A) since the eigenvalues of A A are simply the squares of the singular values of A by Lemma 1 another way to arrive at this same conclusion is to use the fact in Homework 1 that the -norm of a vector is invariant under multiplication by a unitary matrix, ie if Q Q = I, then x = Qx, from which it follows that A = UΣV = Σ = σ 1 by the same problem in Homework 1, the Frobenius norm is also unitarily invarint this yields an expression in terms of singular values A F = UΣV F = Σ F = σ1 + + σ r where r = rank(a) 6 computing other matrix norms the Schatten and Ky Fan norms can be expressed in terms of singular values this is cheating a bit, because we will in fact define them in terms of singular values for any p [1, ], the Schatten p-norm of A C m n is 1/p min(m,n) A σ,p := σ i (A) p and for p =, we have A σ, = max{σ 1 (A),, σ min(m,n) (A)} of course, the sum will stop at r = rank(a) since σ r+1 (A) = = σ min(m,n) (A) = as usual, we also have lim p A σ,p = A σ, (61) for all A C m n for the special values of p = 1,,, we have p = 1: nuclear norm A σ,1 = σ 1 (A) + + σ min(m,n) (A) = A p = : Frobenius norm A σ, = σ 1 (A) + + σ min(m,n) (A) = A F 6
p = : spectral norm A σ, = max{σ 1 (A),, σ min(m,n) (A)} = σ 1 (A) = A in fact one may also define Schatten p-norm for values of p [, 1) by dropping the root 1/p outside the sum A σ,p := min(m,n) σ i (A) p these will not be norms in the usual sense of the word because they do not satisfy the triangle inequality but they satisfy an analogue of (61) lim A σ,p = rank(a) p we may regard matrix rank as the Schatten -norm since A σ, = σ 1 (A) + + σ min(m,n) (A) = rank(a) provided we define := Schatten p-norm for p < 1 are useful in the same way l p -norms are useful for p < 1, namely, as continuous surrogates of matrix rank (vector l p -norms for p < 1 are often used as continuous surrogates as the l -norm, ie a count of the number of nonzero entries), which is a discontinuous function on C m n the nuclear norm, ie Schatten 1-norm, being convex in addition to continuous, is the most popular surrogate for matrix rank for any p [1, ) and k N the Ky Fan (p, k)-norm of A C m n is [ k clearly for any A C m n, A σ,p,k := σ i (A) p ] 1/p A σ,p, = A σ,p,min(m,n) = A σ,p and so Ky Fan norms generalize Schatten norms 7 computing magnitude of determinant note that determinants are only defined for square matrices A C n n recall the formula n det(a) = λ i (A) (71) where λ i (A) C is the ith eigenvalue of A recall that every n n matrix has exactly n eigenvalues counted with multiplicity by convention we usually order eigenvalues in decreasing order of magnitudes, ie λ 1 (A) λ (A) λ n (A) now if we have only the singular values of A we cannot get det(a) like we did with its eigenvalues in (71) however we can get the magnitude of the determinant 1 det(a) = det(uσv ) = det(u) det(σ) det(v ) = det(σ) = det σ since det(u) = det(v ) = 1 7 σ n = n σ i (A)
exercise: show that all eigenvalues of a unitary matrix U must have absolute value 1 and so det(u) = 1 by (71) 8 computing pseudoinverse as we mentioned above, only square matrices A C n n may have an inverse, ie X C n n such that AX = XA = I if such an X exists, we denote it by A 1 exercise: show that A 1, if exist, must be unique can we define something that behaves like an inverse (we will say exactly what this means in the next lecture when we discuss minimum length least squares problem) for square matrices that are not invertible in the traditional sense? more generally, can we define some kind of inverse for rectangular matrices? such considerations lead us to the notion of pseudoinverse the most famous one is the Moore Penrose pseudoinverse Theorem (Moore Penrose) For any A C m n, there exists a X C n m satisfying (i) (AX) = AX, (ii) (XA) = XA, (iii) XAX = X, (iv) AXA = A Furthermore the X satisfying these four conditions must be unique and is denoted by X = A other types of pseudoinverse, sometimes also called generalized inverse, may be defined by choosing a subset of these four properties the Moore Penrose theorem is actually true over any field but for C (and also R) we can say a lot more in fact we can compute A explicitly using the svd easy fact: if A C n n is invertible, then A 1 = A (81) to show this, just see that all four properties in the theorem hold true if we plug in X = A 1 another easy fact: if D = diag(d 1,, d min(m,n) ) C m n is diagonal in the sense that d ij = for all i j, then where D = diag(δ 1,, δ min(m,n) ) C n m (8) δ i = { 1/d i if d i if d i = to show this, just see that all four properties in the theorem hold true if we plug in X = diag(δ 1,, δ min(m,n) ) in general (AB) B A, AA I, A A I we can get A in terms of the svd of A C m n : if A = UΣV, then A = V Σ U (83) 8
where Σ = σ1 1 σ 1 r C n m to see this, just verify that (83) satisfies the four properties in Theorem we may also deduce from (83) that A = (A A) 1 A if A has full column rank (r = n) and that if A has full row rank (r = m) A = A (AA ) 1 9 solving least squares problems given A C m n and b C m, the least squares problem ask to find x C n so that b Ax is minimized note that x minimizes b Ax iff it minimizes b Ax, so whether we write the squared or not doesn t really matter using svd of A and the unitary invariance of the vector -norm in Homework, we can simplify this minimization problem as follows b Ax = b UΣV x = U b ΣV x = c Σy = (c 1 σ 1 y 1 ) + + (c r σ r y r ) + c r+1 + + c m where c = U b and y = V x so we see that in order to minimize Ax b, we must set y i = c i /σ i for i = 1,, r the unknowns y i, for i = r + 1,, m, can have any value, since they do not affect the value of c Σy so all the least squares solution are of the form c 1 /σ 1 c x = V y = V r /σ r y r+1 where y r+1,, y n C are arbitrary our analysis above also shows that the minimum value of b Ax is given by y n min x C n b Ax = c r+1 + + c m 9
1 solving minimum length least squares problems one of the best-known applications of the svd is that it can be used to obtain the solution to the problem b Ax = min, x = min alternatively, we can write { } min x : x argmin b Ax x C n or using the normal equations notation: f : X R, for S X, write for the set of minmizers, ie min { x : A Ax = A b} argmin f(x) or argmin{f(x) : x S} x S argmin f(x) = x S { } x S : f(x ) = min f(x) x S we often write (sloppily) x = argmin f(x) or argmin f(x) = x to mean that x is a minimizer of f over S even though the proper notation ought to have been x argmin f(x) note that from the last part of the last lecture that if A does not have full rank, then there are infinitely many solutions to the least squares problem because the residual in does not depend on y r+1,, y m and these are thus free parameters that we may freely choose: c 1 /σ 1 c x = V y = V r /σ r (11) y r+1 y n where c = U b we can claim that the unique solution with minimum -norm is given by setting y r+1 = = y m =, ie c 1 /σ 1 c x + = V r /σ r (1) where c = U b this follows because x + = c 1 + + σ 1 c r σ r c 1 σ 1 + + whatever our choice of y r+1,, y n in (11) 1 c r σ r + y r+1 + + y n = x
we may also write (1) in terms of the Moore Penrose pseudo inverse since Σ has precisely the right form x + = V Σ c = V Σ U b = A b Σ = σ1 1 σ 1 r 11