(VII.E) The Singular Value Decomposition (SVD) In this section we describe a generalization of the Spectral Theorem to non-normal operators, and even to transformations between different vector spaces. This is a computationally useful generalization, with applications to data fitting and data compression in particular. It is used in many branches of science, in statistics, and even machine learning. As we develop the underlying mathematics, we will encounter two closely associated constructions: the polar decomposition and pseudoinverse, in each case presenting the version for operators followed by that for matrices. To begin, let (V,, ) be an n-dimensional inner-product space, T : V V an operator. By Spectral Theorem III, if T is normal then it has a unitary eigenbasis B, with complex eigenvalues σ,..., σ s. In particular, T is unitary iff these σ i = (since preserving lengths of a unitary eigenbasis is equivalent to preserving lengths of all vectors), and self-adjoint iff the σ i are real. So T is both unitary and self-adjoint iff T has only + and as eigenvalues, which is to say T is an orthogonal reflection, acting on V = W W by Id W on W and Id W on W. But the broader point here is that we should have in mind the following table of analogies : operators numbers normal C unitary S self-adjoint R?? R 0 where S C is the unit circle. So, how do we complete it?
(VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) DEFINITION. T is positive (resp. strictly positive) if it is normal, with all eigenvalues in R 0 (resp. R >0 ). Since they are unitarily diagonalizable with real eigenvalues, positive operators are self-adjoint. Given a self-adjoint (or normal) operator T, any v V is a sum of orthogonal eigenvectors w j with eigenvalues σ j ; so v, T v = i,j w i, T w j = i.j σ j w i, w j = σ j w j j is in R 0 for all v if and only if T is positive. If V is complex, the condition that v, T v 0 ( v) on its own gives self-adjointness (Exercise VII.D.7), hence positivity, of T. (If V is real, this condition alone isn t enough.) EXAMPLE. Orthogonal projections are positive, since they are self-adjoint with eigenvalues 0 and. Polar decomposition. A nonzero complex number can be written (in polar form ) as a product re iθ, for unique r R >0 and e iθ S. If T is a normal operator, then it has a unitary eigenbasis B = { ˆv,..., ˆv n }, with [T] B = diag{r e iθ,..., r n e iθ n} (r i R 0 ), and defining T, U by [ T ] B = diag{r,..., r n } and [U] B = diag{e iθ,..., e iθ n}, we have T = U T with U unitary and T positive, and U T = T U. One might expect such polar decompositions of operators to stop there. But there exists an analogue for arbitrary transformations, one that will lead on to the SVD as well! To formulate this general polar decomposition, let T : V V be an arbitrary linear transformation. Notice that T T is self-adjoint and, in fact, positive: v, T T v = T v, T v 0, since, is an inner product. So we have a unitary B with [T T] B = diag{λ,..., λ n }, and all λ i R 0. (Henceforth we impose the convention that λ λ... λ n.) Taking nonnegative square trivially so: (T T) = T T = T T.
(VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) 3 roots µ i = λ i R 0, we define a new operator T by [ T ] B = diag{µ,..., µ n }; clearly T = T T, and T is positive. THEOREM 3. There exists a unitary operator U such that T = U T. This polar decomposition is unique exactly when T is invertible. Moreover, U and T commute exactly when T is normal. PROOF. We have T v = T v, T v = v, T T v = v, T v = T v, T v = T v since T is self-adjoint. Consequently T and T have the same kernel, hence (by Rank + Nullity) the same rank; so while their images (as subspaces of V) may differ, they have the same dimension. By Spectral Theorem I, V = im T ker T. So dim(ker T ) = dim((imt) ), and we fix an invertible transformation U : ker T (imt) sending some choice of unitary basis to another unitary basis. Writing v = T w + v (with T v = 0), we now define U : im T ker T imt (imt) }{{}}{{} V V by U v := T w + U v. Since T w + U v = T w + U v = T w + v = v, U is unitary; and taking v = T w gives U T w = T w ( w V). Now if T is invertible, then so is T (all µ i > 0) = U = T T. If it isn t, then ker T = {0} and non-uniqueness enters in the choice of U. Finally, if U and T commute, then they simultaneously unitarily diagonalize, and therefore so does T, making T normal. If T is normal, we have constructed U and T above, and they commute.
4 (VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) EXAMPLE 4. Consider V = R 4 with the dot product, and 0 0 0 0 0 0 A = [T]ê = 0 0. 0 0 0 0 Since A has a Jordan block, we know it is not diagonalizable; and so T is not normal. However, 4 0 0 0 [T T]ê = t 0 0 AA = 0 0 = S BDSB 0 0 0 0 with B := {ê, ê + ϕ ê 3, ê ϕ + ê 3, ê 4 } and D := diag{4, η +, η, 0} (where ϕ ± := ±+ 5 and η ± := 3± 5 = ϕ ± ). Taking the positive square root D := diag{, ϕ +, ϕ, 0}, we get 0 0 0 3 0 5 5 0 [ T ]ê = A := S B D S B = 0 5 5 0. 0 0 0 0 On W := span{ê, ê, ê 3 } = imt = im T, we must have U W = T W ( T W ) ; while on span{ê 4 } = W, U can be taken to act by any scalar of norm. Therefore 0 0 0 0 5 0 [U]ê = Q := 5 0 5 5 0, 0 0 0 e iθ and A = Q A. Geometrically, A dilates the B-coordinates of a vector, then Q performs a rotation.
(VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) 5 SVD for endomorphisms. Now let T be the (unique) positive square root of T T as above, and B = { ˆv,..., ˆv n } the unitary basis of V under which [ T ] B = diag{µ,..., µ n }. DEFINITION 5. These {µ i } are called the singular values of T. Writing T = U T and ŵ i := U ˆv i, U U = Id V = ŵ i, ŵ j = U U ˆv i, ˆv j = ˆv i, ˆv j = δ ij = C := {ŵ,..., ŵ n } = U(B) is unitary. For v V, we have () = T v = U T n i= ˆv i, v ˆv i = n i= ˆv i, v U T ˆv i n n µ i ˆv i, v U ˆv i = µ i ˆv i, v ŵ i. i= i= Viewing ˆv i, =: l ˆvi : V C as a linear functional, and ŵ i : C V as a map (sending α C to αŵ i ), () becomes () T = n µ i ŵ i l ˆvi. i= How does this look in matrix terms? Let A be an arbitrary 3 unitary basis of V, and set A := [T] A. Using {} as a basis of C, () yields (3) A = n n µ i A [ŵ i ] {} {} [l ˆvi ] A = µ i [ŵ i ] A [ ˆv i ] A. i= i= Here [ŵ i ] A [ ˆv i ] A is a (n ) ( n) matrix product, yielding an n n matrix of rank one. So () (resp. (3)) decompose T (resp. A) into a sum of rank one endomorphisms (resp. matrices) weighted by the singular values. This is a first version of the Singular Value Decomposition. The reader may of course replace unitary and C by orthonormal / orthogonal and R in what follows. 3 What you should actually have in mind here is, in the case where V = C n with dot product, the standard basis ê.
6 (VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) REMARK 6. If T is positive, we can take U = Id V, so that ŵ i l ˆvi = ˆv i l ˆvi =: Pr ˆvi is the orthogonal projection onto span{ ˆv i }. Thus () merely restates Spectral Theorem I in this case. In fact, if T is normal, then T, T T, and T simultaneously diagonalize. If µ,..., µ n are the eigenvalues of T, then we evidently have µ i = µ i ; U may then be defined by [U] B = diag{ µ i /µ i } (with 0/0 interpreted as ). Thus () reads T = n i= µ ipr ˆvi, which restates Spectral Theorem III. So in this sense, the SVD generalizes the results in VII.D. Now since U sends B to C, we have and thus C[T] B = C [U] B }{{} [ T ] B = diag{µ,..., µ n } =: D I n (4) A = [T] A = A [I] C C [T] B B [I] A =: S DS, with S and S unitary matrices (as they change between unitary bases). Roughly speaking, (4) presents A as a composition of the form rotation, 4 dilation, rotation. THEOREM 7. Any A M n (C) can be presented as a product S DS, where D is a diagonal matrix with entries in R 0, and S i Si = I n. The diagonal entries of D, called the singular values of A, are unique; and if these are all distinct, nonzero, and in decreasing order, then (S, S ) are unique modulo rescaling (S, S ) by a unitary diagonal matrix. 5 PROOF. Existence was deduced above. We have D = D hence A A = S D S S DS = S D S, 4 More precisely, rotation here can mean a complicated sequence of rotations and reflections. 5 If A Mn (R), then the above construction shows that S and S can be taken to be real (orthogonal); and such S i are unique modulo rescaling by real diagonal unitary, i.e. with diagonal entries ±.
(VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) 7 which is a unitary 6 diagonalization of the Hermitian matrix A A. So S, hence D, is unique; and if the diagonal entries are distinct, then the unitary eigenbasis ( = columns of S ) is determined up to e iθ scaling factors. If no singular values are 0, D is invertible and S = AS D. EXAMPLE 8. Let s look at the operator T = d dt : P (R) P (R) with f, g := f (t)g(t)dt. This is a non-normal, nilpotent transformation. We compute its singular values which are not all zero, even though 7 0 is the only eigenvalue of T and the bases B and C. To compute T we will need an o.n. basis A: Gram-Schmidt on {, t, t } yields A := { 3, t, 3 5 (t 3 )} and [T] A = 0 3 0 0 0 5 0 0 0 0 0 0 = [T T] A = 0 3 0 0 0 5 = [T ] A = [T] A = = [ T ] A = 0 0 0 3 0 0 0 5 0 0 0 0 0 3 0 0 0 = singular values are 5, 3, 0, and { } 3 5 B = (t 3 ), 3 t, =: { ˆv, ˆv, ˆv 3 }. 5 Now im( T ) = span{ ˆv, ˆv }, ker( T ) = span{ ˆv 3 } = ker(t), im(t) = span{ ˆv, ˆv 3 }, and im(t) = span{ ˆv }. So we define U : span{ ˆv, ˆv } span{ ˆv 3 } span{ ˆv, ˆv 3 } span{ ˆv } 6 in the dot product 7 Caution: if T is diagonalizable, it may not be unitarily diagonalizable, and it s only in the latter case that T s singular values are the absolute values of T s eigenvalues.
8 (VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) by sending ˆv = T ( 5 ˆv ) T( 5 ˆv ) = ˆv, ˆv = T ( 3 ˆv ) T( 3 ˆv ) = ˆv 3, and ˆv 3 ˆv. We then have [U] A = 0 0 0 0 0 0, and C = U(B) = { ˆv, ˆv 3, ˆv }. The 8 SVD of A = [T] A now reads (noting that A = { ˆv 3, ˆv, ˆv }) 0 0 5 0 0 0 0 A = S DS = 0 0 0 3 0 0 0. 0 0 0 0 0 0 0 We now turn to the more general form of the Singular Value Decomposition. SVD for transformations. Let T : V W be an arbitrary linear transformation between finite-dimensional inner-product spaces, and fix unitary bases A resp. A of V resp. W. Write n = dim V, m = dim W. LEMMA 9. The transformation T : W V defined by A [T ] A = ( A [T] A ) is the unique operator satisfying (5) T w, v = w, T v for all v V, w W. We call it the adjoint of T. PROOF. Since A, A are unitary, (5) is equivalent to and thus to [T w] A [ v] A = [ w] A [T v] A [ w] A (A[T ] A ) [ v]a = [ w] A A [T] A[ v] A. 8 Of course, even over R it isn t quite unique: one can play with the signs in S and S.
(VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) 9 LEMMA 0. We have V = ker(t) im(t ) and W = im(t) ker(t ), with T (resp. T ) restricting to an isomorphism from im(t ) im(t) (resp. im(t) im(t )). PROOF. By Lemma 9, T and T have the same rank r, so Rank + Nullity = { dim(kert) + dim(imt ) = (n r) + r = n = dim V dim(imt) + dim(kert ) = r + (m r) = m = dim W. For v ker(t), v, T w = T v, w = 0, w = 0 = ker(t) im(t ) = ker(t) im(t ) = {0} = T im(t ) is injective. This proves the assertions about V and T; those for W and T follow by symmetry. The compositions T T : V V and TT : W W are clearly positive (argue as above), and preserve the s in Lemma 0. So T T (resp. TT ) is an automorphism of im(t ) (resp. im(t)) the zero map on ker(t) (resp. ker(t )). The same remarks apply to the positive operator T T (resp. TT ), and there exists a unitary basis B = { ˆv,..., ˆv n } of V with T T] = diag{µ,..., µ r, 0,..., 0} [ B and µ µ... µ r > 0. DEFINITION. These {µ i } are called the (nonzero) singular values of T. In particular, { ˆv,..., ˆv r } im(t ) and { ˆv r+,..., ˆv n } ker(t); and we define {ŵ,..., ŵ r } im(t) by ŵ i := µ i T ˆv i. Since ŵ i, ŵ j = µ i µ j T ˆv i, T ˆv j = µ i µ i µ j ˆv i, ˆv j = µ i µ j δ ij = δ ij, this gives a unitary basis of im(t), which we complete to a unitary basis C W by choosing {ŵ r+,..., ŵ m } ker(t ). This proves
0 (VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) THEOREM (Abstract SVD). There exist unitary bases B V, C W such that 9 (6) C[T] B = diag m n {µ,..., µ r, 0,..., 0} with µ... µ r > 0, and r = rank(t). REMARK 3. (i) C is an eigenbasis for TT, and its nonzero eigenvalues are also the {µ i } r i= : TT ŵ i = µ i TT T ˆv i = µ i T(µ i ˆv i) = µ i ŵ i. (ii) If B V, C W are unitary bases such that C [T] B = diag m n {µ,..., µ s, 0..., 0} in decreasing order, then s = r and µ i = µ i ( i). This is because B [T] B = B [T ] C C [T] B = ( C [T] B ) C [T] B = diag m n {(µ ),..., (µ r), 0..., 0} so that {µ i } are the eigenvalues of T T. So the (nonzero) singular values are the unique positive numbers that can appear in (6). (iii) Moreover, if the µ i are distinct and T is injective (resp. surjective), the elements of B (resp. C ) are e iθ -multiples of the elements of B (resp. C). DEFINITION 4. The pseudoinverse of T is the transformation T : W V given by 0 on ker(t ) and inverting T on im(t). COROLLARY 5. B [T ] C = diag n m {µ,..., µ r, 0,..., 0}. Note that if T is invertible, then T = T. COROLLARY 6. T T : V V is the orthogonal projection Pr im(t ) and TT : W W is Pr im(t). SVD for m n matrices. The matrix version of Theorem is one of the most important results in linear algebra. Its efficient computer implementation for large matrices has been the subject of countless articles in numerical analysis. While we won t say anything about these algorithms, they are more efficient (and accurate) than the orthogonal diagonalization of A A, which remains our algorithm of choice here. 9 The notation means an m n matrix, whose entries are zero except for (j, j) th entries for j r( m, n).
(VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) THEOREM 7 (Matrix SVD). Let A be an arbitrary m n matrix of rank r, with complex entries. Then there exist unitary matrices P M m (C), Q M n (C) and real numbers µ... µ r > 0, such that (7) A = P Q = ( ˆp ˆp m ) diag m n {µ,..., µ r, 0,..., 0} ( ˆq ˆq n ) (and A = Q t P ). The map from C n C m induced by A decomposes as an orthogonal direct sum under the dot product V col (A ) { v A v = 0} V col (A) { w A w = 0} span{ ˆq,..., ˆq r } span{ ˆp,..., ˆp r } A given by ˆq j µ j ˆp j (j =,..., r) on the first factor and zero on the second factor. (The reverse map A sends ˆp j µ j ˆq j.) The nonzero singular values µ j are the positive square roots of the nonzero eigenvalues of A A (resp. AA ), for which the columns of Q (resp. P) give unitary eigenbases. PROOF. Apply Theorem to the setting: V = C n with, = dot product and A = ê; W = C m with, = dot product and A = ê; T : V W such that A = A [T] A. Then A [T] A = A [Id W ] C C [T] B B [Id V ] A is A = S C SB. The remaining details are immediate from the abstract analysis. DEFINITION 8. The pseudoinverse of A is the n m matrix A := A [T ] A = Q P, where = diag n m {µ,..., µ r, 0,..., 0}. COROLLARY 9. A A = [Pr Vcol (A )]ê and AA = [Pr Vcol (A)]ê are matrices of orthogonal projections (under the dot product). REMARK 0. In all of this, if A is real then A = t A = V col (A ) = V row (A). In this case, we can take P, Q orthogonal, and A = P t Q with the { ˆq i } appearing as the rows of t Q.
(VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) EXAMPLE. Consider the rank matrix A =. We have ( ) t 6 6 AA = = Qdiag{µ 6 6, 0}t Q with µ = ( ) and Q =. Now ˆp = µ A ˆq = 6 ( ), while ˆp, ˆp 3 have to be an o.n. basis of ker( t A). So 3 6 0 A = 3 6 0 0 0 0 0 6 3 ( ) is a SVD, and A = ( = ) ( 0 0 ( 0 0 0 ) the pseudoinverse. So for instance A A = {( orthogonally project onto V row (A) = span ) ( 6 6 6 3 3 0 3 ) does indeed )}. Applications to least squares and linear regression. Let T : V W be an arbitrary linear transformation, with V = C n and W = C n equipped with the dot product, and A := ê[t]ê. Write v V, w W and [ v]ê = x, [ w]ê = y.
(VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) 3 DEFINITION. For any fixed w W, a least-squares solution (LSS) to T v = w (or equivalently A x = y) is any ṽ V minimizing Tṽ w. The minimal LSS ṽ min is the (unique) LSS minimizing ṽ. The minimum distance from w to im(t) is, of course, the distance from w to the orthogonal projection w := Pr im(t) w, im(t) w Tv~ and so ṽ is just any solution to (8) Tṽ = w ( A x = ỹ). THEOREM 3. (i) The least-squares solutions of T v = w are the solutions of the normal equations (9) T Tṽ = T w ( A A x = A y). (ii) The minimal LSS of T v = w is (0) ṽ min = T w. PROOF. (i) Since im(t) = ker(t ), ṽ satisfies (9) w Tṽ im(t) ṽ satisfies (8). (ii) By Corollary 6, w = TT w. So ṽ satisfies (8) κ := ṽ T w ker(t). But ker(t) im(t ) = ṽ = T w + κ is minimized by κ = 0. Now T T is invertible iff T is -to- (cf. Exercise ()), in which case ṽ (= ṽ min ) is unique and the normal equations become () ṽ = (T T) T w ( x = (A A) A y).
4 (VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) If ker(t) = {0}, then (9) is less convenient and (0) seems the better result. But it turns out that if A is large, computationally (0) is more useful than the normal equations regardless of invertibility of A A. This is because the SVD gives () x min = A y = Q P y, and involves inverting the {µ i } whereas (A A) involves inverting the {µ i }. If some nonzero µ i s differ by orders of magnitude, the computation of (A A) will therefore introduce significantly more error than (). That said, the normal equations are typically simpler for small examples, as we ll now see. where EXAMPLE 4. We find the LSS for the linear system A x = y, 0 A =, y = 0. Compute ( ) ( t 4 AA =, ( t AA) = 3 0 6 ( ) = x = ( t AA) t A y =. Writing µ ± = 5 ± 5, the SVD is A = 3 5 0 5 0 5 0 3 5 0 3+ 5 0 + 5 0 + 5 0 3+ 5 0 5 3 5 3 5 5 µ + 0 0 µ 0 0 0 0 ) ( 5 µ µ and replacing the middle factor by = diag 4 {µ ( ) A = 0 4 3. 3 3 + 5 µ+ µ + +, µ ) } gives But using A = ( t AA) t A (see Exercise () below) is much easier!
(VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) 5 Note that this example can be seen as a data fitting (linear regression) problem: the line minimizing the sum of squares of vertical errors in Y X is Y = X. Here is a slightly more interesting example. EXAMPLE 5. Having entered the sports business, Monsanto is engaged in trying to grow a better baseball player. They ve tracked a few little-leaguers and got the data Y= # of home runs / season (X,Y )=(,3) (X 3,Y 3)=(,3) ε ε3 ε 4 (X 4,Y 4)=(3,) ε (X,Y )=(0,) X = tens of miles from Arch which suggests a parabola 0 Y = f (X) = b 0 + b X + b X. 0 More seriously, rather than looking at the shape of the data set, one should try to base the form of f on physical principle (see for instance Exercise () below).
6 (VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) So write Y = b 0 + b X + b X + ε Y = b 0 + b X + b X + ε Y 3 = b 0 + b X 3 + b X3 + ε 3 Y 4 = b 0 + b X 4 + b X4 + ε 4 which translates to Y = A β + ε, where X X X A = X X 3 X3 = X 4 X4 0 0 4 3 9. Minimizing ε = A β Y just means solving A β = Ỹ, or equivalently t AA β = t A Y which is 4 6 4 6 4 36 4 36 98 b 0 b b = 9 5 33 = β = RREF.05.55 0.75. So using f (X) =.05 +.55X 0.75X = 0 = f (X) =.55.5X is maximized at X =.7. Therefore the ideal distance from the Arch for raising baseball stars must be 7 miles. Because our data wasn t questionable at all and our sample size was YUGE. Applications to data compression. One of the many other computational applications of the SVD is called Principal Component Analysis. Suppose in a country of N = million internet users, you want to target ads (or fake news) based on online behavior, where the number M of possible clicks or purchases is also very large. Assume however that only k (cultural, demographic, etc.) attributes of a person generally determine this behavior: that is, there exists an N k attribute matrix and a k M behavior matrix whose product is roughly A, the N M matrix of raw user data. In other words, we
EXERCISES 7 expect A to be well-approximated by a rank k matrix, with k much smaller than M and N. The SVD in the first form we encountered (cf. (3)) extends to nonsquare matrices: in view of (7), (3) A = where r = rank(a). It turns out that (4) A k := r µ l ˆp l ˆq l l= N M k µ l ˆp l ˆq l l= is the best rank k approximation to A (in the sense of minimizing the operator norm A A k among all rank k M N matrices). In the situation just described, we d expect A k to be very close to A, while requiring us to record only k( + M + N) numbers (the µ l and entries of ˆp l, ˆq l for l =,..., k), a significant data compression vs. the MN entries of A. The formulas (3)-(4) are close in spirit to (discrete) Fourier analysis, which breaks a function down into its constituent frequency components viz., A ij = l,l α l,l e π il M e π jl N. But the SVD allows for many fewer terms in (4), since the ˆp l, ˆq l are not fixed trigonometric functions, but rather adapted to the matrix A. On the other hand, the discrete Fourier transform doesn t require recording ˆp l and ˆq l, so has that advantage. Fortunately for computational applications, MATLAB has efficient algorithms for both! Exercises () (a) Show that T is -to- iff T T is invertible. (b) Show that if T is -to-, then T = (T T) T. (So in this case, (0) and () are the same. However, the two formulas (A A) A and Q P for A really do have different strengths. M := max M v ; in fact, A A k = µ k+. In addition, A k minimizes the v = stupid matrix norm M := ( i,j M ij ) for A A K.
8 (VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) () As luck would have it, at time t = 0 bullies stuffed your backpack full of radioactive material from the Bridgeton landfill. Your legal team thinks there are two substances in there, with half-lives of one and five days respectively, but all they can do is measure the overall masses M, M,..., M n of the backpack at the end of days,..., n. (The bullies also superglued it shut, you see.) Measurements have error, so you need to formulate a linear regression model in order to determine the original masses A, B of the two substances at time t = 0. This should involve an n matrix. (3) Find a polar decomposition for 0 4 0 A = 0 0. 4 0 0 (4) Calculate the matrix SVD for A = 0. (5) Calculate the minimal least-squares solution of the system x + x = x + x = x x = 0.