(VII.E) The Singular Value Decomposition (SVD)

Similar documents
Math 113 Final Exam: Solutions

Final Review Sheet. B = (1, 1 + 3x, 1 + x 2 ) then 2 + 3x + 6x 2

ALGEBRA QUALIFYING EXAM PROBLEMS LINEAR ALGEBRA

Elementary linear algebra

Definitions for Quizzes

Linear Algebra Review. Vectors

Properties of Matrices and Operations on Matrices

Summary of Week 9 B = then A A =

Vector Spaces and Linear Transformations

Chapter 7: Symmetric Matrices and Quadratic Forms

1. The Polar Decomposition

Review problems for MA 54, Fall 2004.

The Singular Value Decomposition (SVD) and Principal Component Analysis (PCA)

A PRIMER ON SESQUILINEAR FORMS

Math Linear Algebra II. 1. Inner Products and Norms

MATH 20F: LINEAR ALGEBRA LECTURE B00 (T. KEMP)

MATH 583A REVIEW SESSION #1

The Singular Value Decomposition

LINEAR ALGEBRA BOOT CAMP WEEK 1: THE BASICS

Glossary of Linear Algebra Terms. Prepared by Vince Zaccone For Campus Learning Assistance Services at UCSB

Singular Value Decomposition (SVD) and Polar Form

IMPORTANT DEFINITIONS AND THEOREMS REFERENCE SHEET

Linear Algebra Fundamentals

7. Symmetric Matrices and Quadratic Forms

LINEAR ALGEBRA MICHAEL PENKAVA

Chapter 2 Linear Transformations

IMPORTANT DEFINITIONS AND THEOREMS REFERENCE SHEET

Math 115A: Homework 5

Topic 2 Quiz 2. choice C implies B and B implies C. correct-choice C implies B, but B does not imply C

NONCOMMUTATIVE POLYNOMIAL EQUATIONS. Edward S. Letzter. Introduction

TBP MATH33A Review Sheet. November 24, 2018

October 25, 2013 INNER PRODUCT SPACES

Math 113 Midterm Exam Solutions

Solutions to Final Exam

Linear Algebra Highlights

Chapter 6: Orthogonality

Designing Information Devices and Systems II

Math 102, Winter Final Exam Review. Chapter 1. Matrices and Gaussian Elimination

Math 291-2: Lecture Notes Northwestern University, Winter 2016

1. General Vector Spaces

. = V c = V [x]v (5.1) c 1. c k

THE MINIMAL POLYNOMIAL AND SOME APPLICATIONS

8 General Linear Transformations

Ir O D = D = ( ) Section 2.6 Example 1. (Bottom of page 119) dim(v ) = dim(l(v, W )) = dim(v ) dim(f ) = dim(v )

Lecture Summaries for Linear Algebra M51A

REPRESENTATION THEORY WEEK 7

Honors Algebra II MATH251 Course Notes by Dr. Eyal Goren McGill University Winter 2007

Linear Algebra. Min Yan

(VI.D) Generalized Eigenspaces

Linear Algebra Massoud Malek

The Singular Value Decomposition and Least Squares Problems

(v, w) = arccos( < v, w >

Pseudoinverse & Orthogonal Projection Operators

SUMMARY OF MATH 1600

Numerical Linear Algebra

Lecture notes: Applied linear algebra Part 1. Version 2

Equality: Two matrices A and B are equal, i.e., A = B if A and B have the same order and the entries of A and B are the same.

Contents. Preface for the Instructor. Preface for the Student. xvii. Acknowledgments. 1 Vector Spaces 1 1.A R n and C n 2

Math 350 Fall 2011 Notes about inner product spaces. In this notes we state and prove some important properties of inner product spaces.

Pseudoinverse & Moore-Penrose Conditions

YORK UNIVERSITY. Faculty of Science Department of Mathematics and Statistics MATH M Test #2 Solutions

MATH 240 Spring, Chapter 1: Linear Equations and Matrices

The following definition is fundamental.

2.2. Show that U 0 is a vector space. For each α 0 in F, show by example that U α does not satisfy closure.

LINEAR ALGEBRA BOOT CAMP WEEK 4: THE SPECTRAL THEOREM

(VII.B) Bilinear Forms

March 27 Math 3260 sec. 56 Spring 2018

Linear Algebra. Session 12

University of Colorado Denver Department of Mathematical and Statistical Sciences Applied Linear Algebra Ph.D. Preliminary Exam June 8, 2012

Spring 2014 Math 272 Final Exam Review Sheet

Linear Algebra: Matrix Eigenvalue Problems

PRACTICE FINAL EXAM. why. If they are dependent, exhibit a linear dependence relation among them.

Lecture 7: Positive Semidefinite Matrices

Notes on Eigenvalues, Singular Values and QR

Lecture 19: Polar and singular value decompositions; generalized eigenspaces; the decomposition theorem (1)

BASIC ALGORITHMS IN LINEAR ALGEBRA. Matrices and Applications of Gaussian Elimination. A 2 x. A T m x. A 1 x A T 1. A m x

UNIT 6: The singular value decomposition.

Notes on singular value decomposition for Math 54. Recall that if A is a symmetric n n matrix, then A has real eigenvalues A = P DP 1 A = P DP T.

1 Singular Value Decomposition and Principal Component

Chapter 8 Integral Operators

Linear Algebra Practice Problems

The Spectral Theorem for normal linear maps

MAT2342 : Introduction to Applied Linear Algebra Mike Newman, fall Projections. introduction

ECE 275A Homework # 3 Due Thursday 10/27/2016

Linear Maps, Matrices and Linear Sytems

Supplementary Notes on Linear Algebra

Lecture 2: Linear Algebra Review

14 Singular Value Decomposition

EE/ACM Applications of Convex Optimization in Signal Processing and Communications Lecture 2

Math 396. An application of Gram-Schmidt to prove connectedness

Dot Products. K. Behrend. April 3, Abstract A short review of some basic facts on the dot product. Projections. The spectral theorem.

The value of a problem is not so much coming up with the answer as in the ideas and attempted ideas it forces on the would be solver I.N.

Throughout these notes we assume V, W are finite dimensional inner product spaces over C.

Eigenvectors and Hermitian Operators

MATH 423 Linear Algebra II Lecture 33: Diagonalization of normal operators.

( 9x + 3y. y 3y = (λ 9)x 3x + y = λy 9x + 3y = 3λy 9x + (λ 9)x = λ(λ 9)x. (λ 2 10λ)x = 0

1. Foundations of Numerics from Advanced Mathematics. Linear Algebra

Review of linear algebra

Math 312 Final Exam Jerry L. Kazdan May 5, :00 2:00

Fall TMA4145 Linear Methods. Exercise set Given the matrix 1 2

Transcription:

(VII.E) The Singular Value Decomposition (SVD) In this section we describe a generalization of the Spectral Theorem to non-normal operators, and even to transformations between different vector spaces. This is a computationally useful generalization, with applications to data fitting and data compression in particular. It is used in many branches of science, in statistics, and even machine learning. As we develop the underlying mathematics, we will encounter two closely associated constructions: the polar decomposition and pseudoinverse, in each case presenting the version for operators followed by that for matrices. To begin, let (V,, ) be an n-dimensional inner-product space, T : V V an operator. By Spectral Theorem III, if T is normal then it has a unitary eigenbasis B, with complex eigenvalues σ,..., σ s. In particular, T is unitary iff these σ i = (since preserving lengths of a unitary eigenbasis is equivalent to preserving lengths of all vectors), and self-adjoint iff the σ i are real. So T is both unitary and self-adjoint iff T has only + and as eigenvalues, which is to say T is an orthogonal reflection, acting on V = W W by Id W on W and Id W on W. But the broader point here is that we should have in mind the following table of analogies : operators numbers normal C unitary S self-adjoint R?? R 0 where S C is the unit circle. So, how do we complete it?

(VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) DEFINITION. T is positive (resp. strictly positive) if it is normal, with all eigenvalues in R 0 (resp. R >0 ). Since they are unitarily diagonalizable with real eigenvalues, positive operators are self-adjoint. Given a self-adjoint (or normal) operator T, any v V is a sum of orthogonal eigenvectors w j with eigenvalues σ j ; so v, T v = i,j w i, T w j = i.j σ j w i, w j = σ j w j j is in R 0 for all v if and only if T is positive. If V is complex, the condition that v, T v 0 ( v) on its own gives self-adjointness (Exercise VII.D.7), hence positivity, of T. (If V is real, this condition alone isn t enough.) EXAMPLE. Orthogonal projections are positive, since they are self-adjoint with eigenvalues 0 and. Polar decomposition. A nonzero complex number can be written (in polar form ) as a product re iθ, for unique r R >0 and e iθ S. If T is a normal operator, then it has a unitary eigenbasis B = { ˆv,..., ˆv n }, with [T] B = diag{r e iθ,..., r n e iθ n} (r i R 0 ), and defining T, U by [ T ] B = diag{r,..., r n } and [U] B = diag{e iθ,..., e iθ n}, we have T = U T with U unitary and T positive, and U T = T U. One might expect such polar decompositions of operators to stop there. But there exists an analogue for arbitrary transformations, one that will lead on to the SVD as well! To formulate this general polar decomposition, let T : V V be an arbitrary linear transformation. Notice that T T is self-adjoint and, in fact, positive: v, T T v = T v, T v 0, since, is an inner product. So we have a unitary B with [T T] B = diag{λ,..., λ n }, and all λ i R 0. (Henceforth we impose the convention that λ λ... λ n.) Taking nonnegative square trivially so: (T T) = T T = T T.

(VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) 3 roots µ i = λ i R 0, we define a new operator T by [ T ] B = diag{µ,..., µ n }; clearly T = T T, and T is positive. THEOREM 3. There exists a unitary operator U such that T = U T. This polar decomposition is unique exactly when T is invertible. Moreover, U and T commute exactly when T is normal. PROOF. We have T v = T v, T v = v, T T v = v, T v = T v, T v = T v since T is self-adjoint. Consequently T and T have the same kernel, hence (by Rank + Nullity) the same rank; so while their images (as subspaces of V) may differ, they have the same dimension. By Spectral Theorem I, V = im T ker T. So dim(ker T ) = dim((imt) ), and we fix an invertible transformation U : ker T (imt) sending some choice of unitary basis to another unitary basis. Writing v = T w + v (with T v = 0), we now define U : im T ker T imt (imt) }{{}}{{} V V by U v := T w + U v. Since T w + U v = T w + U v = T w + v = v, U is unitary; and taking v = T w gives U T w = T w ( w V). Now if T is invertible, then so is T (all µ i > 0) = U = T T. If it isn t, then ker T = {0} and non-uniqueness enters in the choice of U. Finally, if U and T commute, then they simultaneously unitarily diagonalize, and therefore so does T, making T normal. If T is normal, we have constructed U and T above, and they commute.

4 (VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) EXAMPLE 4. Consider V = R 4 with the dot product, and 0 0 0 0 0 0 A = [T]ê = 0 0. 0 0 0 0 Since A has a Jordan block, we know it is not diagonalizable; and so T is not normal. However, 4 0 0 0 [T T]ê = t 0 0 AA = 0 0 = S BDSB 0 0 0 0 with B := {ê, ê + ϕ ê 3, ê ϕ + ê 3, ê 4 } and D := diag{4, η +, η, 0} (where ϕ ± := ±+ 5 and η ± := 3± 5 = ϕ ± ). Taking the positive square root D := diag{, ϕ +, ϕ, 0}, we get 0 0 0 3 0 5 5 0 [ T ]ê = A := S B D S B = 0 5 5 0. 0 0 0 0 On W := span{ê, ê, ê 3 } = imt = im T, we must have U W = T W ( T W ) ; while on span{ê 4 } = W, U can be taken to act by any scalar of norm. Therefore 0 0 0 0 5 0 [U]ê = Q := 5 0 5 5 0, 0 0 0 e iθ and A = Q A. Geometrically, A dilates the B-coordinates of a vector, then Q performs a rotation.

(VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) 5 SVD for endomorphisms. Now let T be the (unique) positive square root of T T as above, and B = { ˆv,..., ˆv n } the unitary basis of V under which [ T ] B = diag{µ,..., µ n }. DEFINITION 5. These {µ i } are called the singular values of T. Writing T = U T and ŵ i := U ˆv i, U U = Id V = ŵ i, ŵ j = U U ˆv i, ˆv j = ˆv i, ˆv j = δ ij = C := {ŵ,..., ŵ n } = U(B) is unitary. For v V, we have () = T v = U T n i= ˆv i, v ˆv i = n i= ˆv i, v U T ˆv i n n µ i ˆv i, v U ˆv i = µ i ˆv i, v ŵ i. i= i= Viewing ˆv i, =: l ˆvi : V C as a linear functional, and ŵ i : C V as a map (sending α C to αŵ i ), () becomes () T = n µ i ŵ i l ˆvi. i= How does this look in matrix terms? Let A be an arbitrary 3 unitary basis of V, and set A := [T] A. Using {} as a basis of C, () yields (3) A = n n µ i A [ŵ i ] {} {} [l ˆvi ] A = µ i [ŵ i ] A [ ˆv i ] A. i= i= Here [ŵ i ] A [ ˆv i ] A is a (n ) ( n) matrix product, yielding an n n matrix of rank one. So () (resp. (3)) decompose T (resp. A) into a sum of rank one endomorphisms (resp. matrices) weighted by the singular values. This is a first version of the Singular Value Decomposition. The reader may of course replace unitary and C by orthonormal / orthogonal and R in what follows. 3 What you should actually have in mind here is, in the case where V = C n with dot product, the standard basis ê.

6 (VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) REMARK 6. If T is positive, we can take U = Id V, so that ŵ i l ˆvi = ˆv i l ˆvi =: Pr ˆvi is the orthogonal projection onto span{ ˆv i }. Thus () merely restates Spectral Theorem I in this case. In fact, if T is normal, then T, T T, and T simultaneously diagonalize. If µ,..., µ n are the eigenvalues of T, then we evidently have µ i = µ i ; U may then be defined by [U] B = diag{ µ i /µ i } (with 0/0 interpreted as ). Thus () reads T = n i= µ ipr ˆvi, which restates Spectral Theorem III. So in this sense, the SVD generalizes the results in VII.D. Now since U sends B to C, we have and thus C[T] B = C [U] B }{{} [ T ] B = diag{µ,..., µ n } =: D I n (4) A = [T] A = A [I] C C [T] B B [I] A =: S DS, with S and S unitary matrices (as they change between unitary bases). Roughly speaking, (4) presents A as a composition of the form rotation, 4 dilation, rotation. THEOREM 7. Any A M n (C) can be presented as a product S DS, where D is a diagonal matrix with entries in R 0, and S i Si = I n. The diagonal entries of D, called the singular values of A, are unique; and if these are all distinct, nonzero, and in decreasing order, then (S, S ) are unique modulo rescaling (S, S ) by a unitary diagonal matrix. 5 PROOF. Existence was deduced above. We have D = D hence A A = S D S S DS = S D S, 4 More precisely, rotation here can mean a complicated sequence of rotations and reflections. 5 If A Mn (R), then the above construction shows that S and S can be taken to be real (orthogonal); and such S i are unique modulo rescaling by real diagonal unitary, i.e. with diagonal entries ±.

(VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) 7 which is a unitary 6 diagonalization of the Hermitian matrix A A. So S, hence D, is unique; and if the diagonal entries are distinct, then the unitary eigenbasis ( = columns of S ) is determined up to e iθ scaling factors. If no singular values are 0, D is invertible and S = AS D. EXAMPLE 8. Let s look at the operator T = d dt : P (R) P (R) with f, g := f (t)g(t)dt. This is a non-normal, nilpotent transformation. We compute its singular values which are not all zero, even though 7 0 is the only eigenvalue of T and the bases B and C. To compute T we will need an o.n. basis A: Gram-Schmidt on {, t, t } yields A := { 3, t, 3 5 (t 3 )} and [T] A = 0 3 0 0 0 5 0 0 0 0 0 0 = [T T] A = 0 3 0 0 0 5 = [T ] A = [T] A = = [ T ] A = 0 0 0 3 0 0 0 5 0 0 0 0 0 3 0 0 0 = singular values are 5, 3, 0, and { } 3 5 B = (t 3 ), 3 t, =: { ˆv, ˆv, ˆv 3 }. 5 Now im( T ) = span{ ˆv, ˆv }, ker( T ) = span{ ˆv 3 } = ker(t), im(t) = span{ ˆv, ˆv 3 }, and im(t) = span{ ˆv }. So we define U : span{ ˆv, ˆv } span{ ˆv 3 } span{ ˆv, ˆv 3 } span{ ˆv } 6 in the dot product 7 Caution: if T is diagonalizable, it may not be unitarily diagonalizable, and it s only in the latter case that T s singular values are the absolute values of T s eigenvalues.

8 (VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) by sending ˆv = T ( 5 ˆv ) T( 5 ˆv ) = ˆv, ˆv = T ( 3 ˆv ) T( 3 ˆv ) = ˆv 3, and ˆv 3 ˆv. We then have [U] A = 0 0 0 0 0 0, and C = U(B) = { ˆv, ˆv 3, ˆv }. The 8 SVD of A = [T] A now reads (noting that A = { ˆv 3, ˆv, ˆv }) 0 0 5 0 0 0 0 A = S DS = 0 0 0 3 0 0 0. 0 0 0 0 0 0 0 We now turn to the more general form of the Singular Value Decomposition. SVD for transformations. Let T : V W be an arbitrary linear transformation between finite-dimensional inner-product spaces, and fix unitary bases A resp. A of V resp. W. Write n = dim V, m = dim W. LEMMA 9. The transformation T : W V defined by A [T ] A = ( A [T] A ) is the unique operator satisfying (5) T w, v = w, T v for all v V, w W. We call it the adjoint of T. PROOF. Since A, A are unitary, (5) is equivalent to and thus to [T w] A [ v] A = [ w] A [T v] A [ w] A (A[T ] A ) [ v]a = [ w] A A [T] A[ v] A. 8 Of course, even over R it isn t quite unique: one can play with the signs in S and S.

(VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) 9 LEMMA 0. We have V = ker(t) im(t ) and W = im(t) ker(t ), with T (resp. T ) restricting to an isomorphism from im(t ) im(t) (resp. im(t) im(t )). PROOF. By Lemma 9, T and T have the same rank r, so Rank + Nullity = { dim(kert) + dim(imt ) = (n r) + r = n = dim V dim(imt) + dim(kert ) = r + (m r) = m = dim W. For v ker(t), v, T w = T v, w = 0, w = 0 = ker(t) im(t ) = ker(t) im(t ) = {0} = T im(t ) is injective. This proves the assertions about V and T; those for W and T follow by symmetry. The compositions T T : V V and TT : W W are clearly positive (argue as above), and preserve the s in Lemma 0. So T T (resp. TT ) is an automorphism of im(t ) (resp. im(t)) the zero map on ker(t) (resp. ker(t )). The same remarks apply to the positive operator T T (resp. TT ), and there exists a unitary basis B = { ˆv,..., ˆv n } of V with T T] = diag{µ,..., µ r, 0,..., 0} [ B and µ µ... µ r > 0. DEFINITION. These {µ i } are called the (nonzero) singular values of T. In particular, { ˆv,..., ˆv r } im(t ) and { ˆv r+,..., ˆv n } ker(t); and we define {ŵ,..., ŵ r } im(t) by ŵ i := µ i T ˆv i. Since ŵ i, ŵ j = µ i µ j T ˆv i, T ˆv j = µ i µ i µ j ˆv i, ˆv j = µ i µ j δ ij = δ ij, this gives a unitary basis of im(t), which we complete to a unitary basis C W by choosing {ŵ r+,..., ŵ m } ker(t ). This proves

0 (VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) THEOREM (Abstract SVD). There exist unitary bases B V, C W such that 9 (6) C[T] B = diag m n {µ,..., µ r, 0,..., 0} with µ... µ r > 0, and r = rank(t). REMARK 3. (i) C is an eigenbasis for TT, and its nonzero eigenvalues are also the {µ i } r i= : TT ŵ i = µ i TT T ˆv i = µ i T(µ i ˆv i) = µ i ŵ i. (ii) If B V, C W are unitary bases such that C [T] B = diag m n {µ,..., µ s, 0..., 0} in decreasing order, then s = r and µ i = µ i ( i). This is because B [T] B = B [T ] C C [T] B = ( C [T] B ) C [T] B = diag m n {(µ ),..., (µ r), 0..., 0} so that {µ i } are the eigenvalues of T T. So the (nonzero) singular values are the unique positive numbers that can appear in (6). (iii) Moreover, if the µ i are distinct and T is injective (resp. surjective), the elements of B (resp. C ) are e iθ -multiples of the elements of B (resp. C). DEFINITION 4. The pseudoinverse of T is the transformation T : W V given by 0 on ker(t ) and inverting T on im(t). COROLLARY 5. B [T ] C = diag n m {µ,..., µ r, 0,..., 0}. Note that if T is invertible, then T = T. COROLLARY 6. T T : V V is the orthogonal projection Pr im(t ) and TT : W W is Pr im(t). SVD for m n matrices. The matrix version of Theorem is one of the most important results in linear algebra. Its efficient computer implementation for large matrices has been the subject of countless articles in numerical analysis. While we won t say anything about these algorithms, they are more efficient (and accurate) than the orthogonal diagonalization of A A, which remains our algorithm of choice here. 9 The notation means an m n matrix, whose entries are zero except for (j, j) th entries for j r( m, n).

(VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) THEOREM 7 (Matrix SVD). Let A be an arbitrary m n matrix of rank r, with complex entries. Then there exist unitary matrices P M m (C), Q M n (C) and real numbers µ... µ r > 0, such that (7) A = P Q = ( ˆp ˆp m ) diag m n {µ,..., µ r, 0,..., 0} ( ˆq ˆq n ) (and A = Q t P ). The map from C n C m induced by A decomposes as an orthogonal direct sum under the dot product V col (A ) { v A v = 0} V col (A) { w A w = 0} span{ ˆq,..., ˆq r } span{ ˆp,..., ˆp r } A given by ˆq j µ j ˆp j (j =,..., r) on the first factor and zero on the second factor. (The reverse map A sends ˆp j µ j ˆq j.) The nonzero singular values µ j are the positive square roots of the nonzero eigenvalues of A A (resp. AA ), for which the columns of Q (resp. P) give unitary eigenbases. PROOF. Apply Theorem to the setting: V = C n with, = dot product and A = ê; W = C m with, = dot product and A = ê; T : V W such that A = A [T] A. Then A [T] A = A [Id W ] C C [T] B B [Id V ] A is A = S C SB. The remaining details are immediate from the abstract analysis. DEFINITION 8. The pseudoinverse of A is the n m matrix A := A [T ] A = Q P, where = diag n m {µ,..., µ r, 0,..., 0}. COROLLARY 9. A A = [Pr Vcol (A )]ê and AA = [Pr Vcol (A)]ê are matrices of orthogonal projections (under the dot product). REMARK 0. In all of this, if A is real then A = t A = V col (A ) = V row (A). In this case, we can take P, Q orthogonal, and A = P t Q with the { ˆq i } appearing as the rows of t Q.

(VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) EXAMPLE. Consider the rank matrix A =. We have ( ) t 6 6 AA = = Qdiag{µ 6 6, 0}t Q with µ = ( ) and Q =. Now ˆp = µ A ˆq = 6 ( ), while ˆp, ˆp 3 have to be an o.n. basis of ker( t A). So 3 6 0 A = 3 6 0 0 0 0 0 6 3 ( ) is a SVD, and A = ( = ) ( 0 0 ( 0 0 0 ) the pseudoinverse. So for instance A A = {( orthogonally project onto V row (A) = span ) ( 6 6 6 3 3 0 3 ) does indeed )}. Applications to least squares and linear regression. Let T : V W be an arbitrary linear transformation, with V = C n and W = C n equipped with the dot product, and A := ê[t]ê. Write v V, w W and [ v]ê = x, [ w]ê = y.

(VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) 3 DEFINITION. For any fixed w W, a least-squares solution (LSS) to T v = w (or equivalently A x = y) is any ṽ V minimizing Tṽ w. The minimal LSS ṽ min is the (unique) LSS minimizing ṽ. The minimum distance from w to im(t) is, of course, the distance from w to the orthogonal projection w := Pr im(t) w, im(t) w Tv~ and so ṽ is just any solution to (8) Tṽ = w ( A x = ỹ). THEOREM 3. (i) The least-squares solutions of T v = w are the solutions of the normal equations (9) T Tṽ = T w ( A A x = A y). (ii) The minimal LSS of T v = w is (0) ṽ min = T w. PROOF. (i) Since im(t) = ker(t ), ṽ satisfies (9) w Tṽ im(t) ṽ satisfies (8). (ii) By Corollary 6, w = TT w. So ṽ satisfies (8) κ := ṽ T w ker(t). But ker(t) im(t ) = ṽ = T w + κ is minimized by κ = 0. Now T T is invertible iff T is -to- (cf. Exercise ()), in which case ṽ (= ṽ min ) is unique and the normal equations become () ṽ = (T T) T w ( x = (A A) A y).

4 (VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) If ker(t) = {0}, then (9) is less convenient and (0) seems the better result. But it turns out that if A is large, computationally (0) is more useful than the normal equations regardless of invertibility of A A. This is because the SVD gives () x min = A y = Q P y, and involves inverting the {µ i } whereas (A A) involves inverting the {µ i }. If some nonzero µ i s differ by orders of magnitude, the computation of (A A) will therefore introduce significantly more error than (). That said, the normal equations are typically simpler for small examples, as we ll now see. where EXAMPLE 4. We find the LSS for the linear system A x = y, 0 A =, y = 0. Compute ( ) ( t 4 AA =, ( t AA) = 3 0 6 ( ) = x = ( t AA) t A y =. Writing µ ± = 5 ± 5, the SVD is A = 3 5 0 5 0 5 0 3 5 0 3+ 5 0 + 5 0 + 5 0 3+ 5 0 5 3 5 3 5 5 µ + 0 0 µ 0 0 0 0 ) ( 5 µ µ and replacing the middle factor by = diag 4 {µ ( ) A = 0 4 3. 3 3 + 5 µ+ µ + +, µ ) } gives But using A = ( t AA) t A (see Exercise () below) is much easier!

(VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) 5 Note that this example can be seen as a data fitting (linear regression) problem: the line minimizing the sum of squares of vertical errors in Y X is Y = X. Here is a slightly more interesting example. EXAMPLE 5. Having entered the sports business, Monsanto is engaged in trying to grow a better baseball player. They ve tracked a few little-leaguers and got the data Y= # of home runs / season (X,Y )=(,3) (X 3,Y 3)=(,3) ε ε3 ε 4 (X 4,Y 4)=(3,) ε (X,Y )=(0,) X = tens of miles from Arch which suggests a parabola 0 Y = f (X) = b 0 + b X + b X. 0 More seriously, rather than looking at the shape of the data set, one should try to base the form of f on physical principle (see for instance Exercise () below).

6 (VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) So write Y = b 0 + b X + b X + ε Y = b 0 + b X + b X + ε Y 3 = b 0 + b X 3 + b X3 + ε 3 Y 4 = b 0 + b X 4 + b X4 + ε 4 which translates to Y = A β + ε, where X X X A = X X 3 X3 = X 4 X4 0 0 4 3 9. Minimizing ε = A β Y just means solving A β = Ỹ, or equivalently t AA β = t A Y which is 4 6 4 6 4 36 4 36 98 b 0 b b = 9 5 33 = β = RREF.05.55 0.75. So using f (X) =.05 +.55X 0.75X = 0 = f (X) =.55.5X is maximized at X =.7. Therefore the ideal distance from the Arch for raising baseball stars must be 7 miles. Because our data wasn t questionable at all and our sample size was YUGE. Applications to data compression. One of the many other computational applications of the SVD is called Principal Component Analysis. Suppose in a country of N = million internet users, you want to target ads (or fake news) based on online behavior, where the number M of possible clicks or purchases is also very large. Assume however that only k (cultural, demographic, etc.) attributes of a person generally determine this behavior: that is, there exists an N k attribute matrix and a k M behavior matrix whose product is roughly A, the N M matrix of raw user data. In other words, we

EXERCISES 7 expect A to be well-approximated by a rank k matrix, with k much smaller than M and N. The SVD in the first form we encountered (cf. (3)) extends to nonsquare matrices: in view of (7), (3) A = where r = rank(a). It turns out that (4) A k := r µ l ˆp l ˆq l l= N M k µ l ˆp l ˆq l l= is the best rank k approximation to A (in the sense of minimizing the operator norm A A k among all rank k M N matrices). In the situation just described, we d expect A k to be very close to A, while requiring us to record only k( + M + N) numbers (the µ l and entries of ˆp l, ˆq l for l =,..., k), a significant data compression vs. the MN entries of A. The formulas (3)-(4) are close in spirit to (discrete) Fourier analysis, which breaks a function down into its constituent frequency components viz., A ij = l,l α l,l e π il M e π jl N. But the SVD allows for many fewer terms in (4), since the ˆp l, ˆq l are not fixed trigonometric functions, but rather adapted to the matrix A. On the other hand, the discrete Fourier transform doesn t require recording ˆp l and ˆq l, so has that advantage. Fortunately for computational applications, MATLAB has efficient algorithms for both! Exercises () (a) Show that T is -to- iff T T is invertible. (b) Show that if T is -to-, then T = (T T) T. (So in this case, (0) and () are the same. However, the two formulas (A A) A and Q P for A really do have different strengths. M := max M v ; in fact, A A k = µ k+. In addition, A k minimizes the v = stupid matrix norm M := ( i,j M ij ) for A A K.

8 (VII.E) THE SINGULAR VALUE DECOMPOSITION (SVD) () As luck would have it, at time t = 0 bullies stuffed your backpack full of radioactive material from the Bridgeton landfill. Your legal team thinks there are two substances in there, with half-lives of one and five days respectively, but all they can do is measure the overall masses M, M,..., M n of the backpack at the end of days,..., n. (The bullies also superglued it shut, you see.) Measurements have error, so you need to formulate a linear regression model in order to determine the original masses A, B of the two substances at time t = 0. This should involve an n matrix. (3) Find a polar decomposition for 0 4 0 A = 0 0. 4 0 0 (4) Calculate the matrix SVD for A = 0. (5) Calculate the minimal least-squares solution of the system x + x = x + x = x x = 0.