The Singular Value Decomposition

Size: px

Start display at page:

Download "The Singular Value Decomposition"

Caitlin Blair
6 years ago
Views:

1 The Singular Value Decomposition We are interested in more than just sym+def matrices. But the eigenvalue decompositions discussed in the last section of notes will play a major role in solving general systems of equations y = Ax, y R M, A is M N, x R N. We have seen that a symmetric positive definite matrix can be decomposed as A = V ΛV T, where V is an orthogonal matrix (V T V = V V T = I) whose columns are the eigenvectors of A, and Λ is a diagonal matrix containing the eigenvalues of A. Because both orthogonal and diagonal matrices are trivial to invert, this eigenvalue decomposition makes it very easy to solve systems of equations y = Ax and analyze the stability of these solutions. The singular value decomposition (SVD) takes apart an arbitrary M N matrix A in a similar manner. The SVD of a real-valued M N matrix A with rank 1 R is where 1. U is an M R matrix A = UΣV T U = [ u 1 u 2 u R ], whose columns u m R M are orthonormal. Note that while U T U = I, in general UU T I when R < M. The columns of U are an orthobasis for the range space of A. 1 Recall that the rank of a matrix is the number of linearly independent columns of a matrix (which is always equal to the number of linearly independent rows). 36

2 2. V is an N R matrix V = [ v 1 v 2 v R ], whose columns v n R N are orthonormal. Again, while V T V = I, in general V V T I when R < N. The columns of V are an orthobasis for the range space of A T (recall that Range(A T ) consists of everything which is orthogonal to the nullspace of A). 3. Σ is an R R diagonal matrix with positive entries: σ Σ = 0 σ σ R We call the σ r the singular values of A. By convention, we will order them such that σ 1 σ 2 σ R. 4. The v 1,..., v R are eigenvectors of the positive semi-definite matrix A T A. Note that A T A = V ΣU T UΣV T = V Σ 2 V T, and so the singular values σ 1,..., σ R are the square roots of the non-zero eigenvalues of A T A. 5. Similarly, AA T = UΣ 2 U T, and so the u 1,..., u R are eigenvectors of the positive semidefinite matrix AA T. Since the non-zero eigenvalues of A T A 37

3 and AA T are the same, the σ r are also square roots of the eigenvalues of AA T. The rank R is the dimension of the space spanned by the columns of A, this is the same as the dimension of the space spanned by the rows. Thus R min(m, N). We say A is full rank if R = min(m, N). As before, we will often times find it useful to write the SVD as the sum of R rank-1 matrices: A = UΣV T = σ r u r v T r. When A is overdetermined (M > N), the decomposition looks like this 1 σ A = U... V T σ R When A is underdetermined (M < N), the SVD looks like this 1 A = U σ... V T When A is square and full rank (M = N = R), the SVD looks like 1 A = U σ... V T σ N σ R 38

4 Technical Details: Existence of the SVD In this section we will prove that any M N matrix A with rank(a) = R can be written as A = UΣV T where U, Σ, V have the five properties listed at the beginning of the last section. Since A T A is symmetric positive semi-definite, we can write: A T A = N λ n v n v T n, n=1 where the v n are orthonormal and the λ n are real and non-negative. Since rank(a) = R, we also have rank(a T A) = R, and so λ 1,..., λ R are all strictly positive above, and λ R+1 = = λ N = 0. Set u m = 1 λm Av m, for m = 1,..., R, U = [ u 1 u R ]. Notice that these u m are orthonormal, as u m, u l = 1 λm λ l v T l A T Av m = λ m λ l v T l v m = These u m also happen to be eigenvectors of AA T, as { 1, m = l, 0, m l. AA T u m = 1 λm AA T Av m = λ m Av m = λ m u m. Now let u R+1,..., u M be an orthobasis for the null space of U T concatenating these two sets into u 1,..., u M forms an orthobasis for all of R M. 39

5 Let V = [ v 1 v 2 v R ]. In addition, let V 0 = [ v R+1 v R+2 v N ], V full = [ V V 0 ] and U 0 = [ u R+1 u R+2 u M ], U full = [ U U 0 ]. It should be clear that V full is an N N orthonormal matrix and U full is a M M orthonormal matrix. Consider the M N matrix U T fullav full the entry in the m th rows and n th column of this matrix is { (U T fullav full )[m, n] = u T λn u T mav n = mu n n = 1,..., R 0, n = R + 1,..., N. { λn, m = n = 1,..., R = 0, otherwise. Thus where Σ full [m, n] = U T fullav full = Σ full { λn, m = n = 1,..., R 0, otherwise. Since U full U T full = I and V full V T full = I, we have A = U full Σ full V T full. Since Σ full is non-zero only in the first R locations along its main diagonal, the above reduces to A = UΣV T, Σ = λ1 λ2... λr. 40

6 The Least-Squares Problem We can use the SVD to solve the general system of linear equations y = Ax where y R M, x R N, and A is an M N matrix. Given y, we want to find x in such a way that 1. when there is a unique solution, we return it; 2. when there is no solution, we return something reasonable; 3. when there are an infinite number of solutions, we choose one to return in a smart way. The least-squares framework revolves around finding an x that minimizes the length of the residual r = y Ax. That is, we want to solve the optimization problem minimize x R N y Ax 2 2, (1) where 2 is the standard Euclidean norm. We will see that the SVD of A: A = UΣV T, (2) plays a pivotal role in solving this problem. To start, note that we can write any x R N as x = V α + V 0 α 0. (3) 41

7 Here, V is the N R matrix appearing in the SVD decomposition (2), and V 0 is a N (N R) matrix whose columns are orthogonal to one another and to the columns in V. We have the relations V T V = I, V T 0 V 0 = I, V T V 0 = 0. You can think of V 0 as an orthobasis for the null space of A. Of course, V 0 is not unique, as there are many orthobases for Null(A), but any such set of vectors will serve our purposes here. The decomposition (3) is possible since Range(A T ) and Null(A) partition R N for any M N matrix A. Taking we see that (3) holds since α = V T x, α 0 = V T 0 x, x = V V T x + V 0 V T 0 x = (V V T + V 0 V T 0 )x = x, where we have made use of the fact that V V T + V 0 V T 0 = I, because V V T and V 0 V T 0 are ortho-projectors onto complementary subspaces 2 of R N. So we can solve for x R N by solving for the pair α R R, α 0 R N R. Similarly, we can decompose y as y = Uβ + U 0 β 0, (4) where U is the M R matrix from the SVD decomposition, and U 0 is a M (M R) complementary orthogonal basis. Again, U T U = I, U T 0 U 0 = I, U T U 0 = 0, 2 Subspaces S 1 and S 2 are complementary in R N if S 1 S 2 (everything in S 1 is orthogonal to everything in S 2 ) and S 1 S 2 = R N. You can think of S 1, S 2 as a partition of R N into two orthogonal subspaces. 42

8 and we can think of U 0 as an orthogonal basis for everything in R M that is not in the range of A. As before, we can calculate the decomposition above using β = U T y, β 0 = U T 0 y. Using the decompositions (2), (3), and (4) for A, x, and y, we can write the residual r = y Ax as r = Uβ + U 0 β 0 UΣV T (V α + V 0 α 0 ) = Uβ + U 0 β 0 UΣα (since V T V = I and V T V 0 = 0) = U 0 β 0 + U(β Σα). We want to choose α that minimizes the energy of r: r 2 2 = U 0 β 0 + U(β Σα), U 0 β 0 + U(β Σα) = U 0 β 0, U 0 β U 0 β 0, U(β Σα) + U(β Σα), U(β Σα) = β β Σα 2 2 where the last equality comes from the facts that U T 0 U 0 = I, U T U = I, and U T U 0 = 0. We have no control over β 0 2 2, since it is determined entirely by our observations y. Therefore, our problem has been reduced to finding α that minimizes the second term β Σα 2 2 above, which is non-negative. We can make it zero (i.e. as small as possible) by taking ˆα = Σ 1 β. Finally, the x which minimizes the residual (solves (1)) is ˆx = V ˆα = V Σ 1 β = V Σ 1 U T y. (5) 43

9 Thus we can calculate the solution to (1) simply by applying the linear operator V Σ 1 U T to the input data y. There are two interesting facts about the solution ˆx in (5): 1. When y span({u 1,..., u M }), we have β 0 = U T 0 y = 0, and so the residual r = 0. In this case, there is at least one exact solution, and the one we choose satisfies Aˆx = y. 2. Note that if R < N, then the solution is not unique. In this case, V 0 has at least one column, and any part of a vector x in the range of V 0 is not seen by A, since As such, AV 0 α 0 = UΣV T V 0 α 0 = 0 (since V T V 0 = 0). x = ˆx + V 0 α 0 for any α 0 R N R will have exactly the same residual, since Ax = Aˆx. In this case, our solution ˆx is the solution with smallest norm, since x 2 2 = ˆx + V 0 α 0, ˆx + V 0 α 0 = ˆx, ˆx + 2 ˆx, V 0 α 0 + V 0 α, V 0 α = ˆx V Σ 1 U T y, V 0 α 0 + α (since V T 0 V 0 = I) = ˆx α (since V T V 0 = 0) which is minimized by taking α 0 = 0. To summarize, ˆx = V Σ 1 U T y has the desired properties stated at the beginning of this module, since 1. when y = Ax has a unique exact solution, it must be ˆx, 2. when an exact solution is not available, ˆx is the solution to (1), 44

10 3. when there are an infinite number of minimizers to (1), ˆx is the one with smallest norm. Because the matrix V Σ 1 U T gives us such an elegant solution to this problem, we give it a special name: the pseudo-inverse. The Pseudo-Inverse The pseudo-inverse of a matrix A with singular value decomposition (SVD) A = UΣV T is A = V Σ 1 U T. (6) Other names for A include natural inverse, Lanczos inverse, and Moore-Penrose inverse. Given an observation y, taking ˆx = A y gives us the least squares solution to y = Ax. The pseudo-inverse A always exists, since every matrix (with rank R) has an SVD decomposition A = UΣV T with Σ as an R R diagonal matrix with Σ[r, r] > 0. When A is full rank (R = min(m, N)), then we can calculate the pseudo-inverse without using the SVD. There are three cases: When A is square and invertible (R = M = N), then This is easy to check, as here A = A 1. A = UΣV T where both U, V are N N, 45

11 and since in this case V V T = V T V = I and UU T = U T U = I, A A = V Σ 1 U T UΣV T = V Σ 1 ΣV T = V V T = I. Similarly, AA = I, and so A is both a left and right inverse of A, and thus A = A 1. When A more rows than columns and has full column rank (R = N M), then A T A is invertible, and This type of A is tall and skinny A = (A T A) 1 A T. (7) A, and its columns are linearly independent. To verify equation (7), recall that and so A T A = V ΣU T UΣV T = V Σ 2 V T, (A T A) 1 A T = V Σ 2 V T V ΣU T = V Σ 1 U T, which is exactly the content of (6). 46

12 When A has more columns than rows and has full row rank (R = M N), then AA T is invertible, and This occurs when A is short and fat A = A T (AA T ) 1. (8) A, and its rows are linearly independent. To verify equation (8), recall that and so AA T = UΣV T V ΣU T = UΣ 2 U T, A T (AA T ) 1 = V ΣU T UΣ 2 U T = V Σ 1 U T, which again is exactly (6). A is as close to an inverse of A as possible As discussed in above, when A is square and invertible, A is exactly the inverse of A. When A is not square, we can ask if there is a better right or left inverse. We will argue that there is not. Left inverse Given y = Ax, we would like A y = A Ax = x for any x. That is, we would like A to be a left inverse of A: A A = I. Of course, this is not always possible, especially when A has more columns than rows, M < N. But we can ask if any other matrix H comes closer to being a left inverse 47

13 than A. To find the best left-inverse, we look for the matrix which minimizes min H R N M HA I 2 F. (9) Here, F is the Frobenius norm, defined for an N M matrix Q as the sum of the squares of the entries: 3 Q 2 F = M n=1 N Q[m, n] 2 n=1 With (9), we are finding H such that HA is as close to the identity as possible in the least-squares sense. The pseudo-inverse A minimizes (9). To see this, recognize (see the exercise below) that the solution Ĥ to (9) must obey AA T Ĥ T = A. (10) We can see that this is indeed true for Ĥ = A : AA T A T = UΣV T V ΣU T UΣ 1 V T = UΣV T = A. So there is no N M matrix that is closer to being a left inverse than A. 3 It is also true that Q 2 F is the sum of the squares of the singular values of Q: Q 2 F = λ λ 2 p. This is something that you will prove on the next homework. 48

14 Right inverse If we re-apply A to our solution ˆx = A y, we would like it to be as close as possible to our observations y. That is, we would like AA to be as close to the identity as possible. Again, achieving this goal exactly is not always possible, especially if A has more rows that columns. But we can attempt to find the best right inverse, in the least-squares sense, by solving minimize AH I 2 F. (11) H R N M The solution Ĥ to (11) (see the exercise below) must obey A T AĤ = AT. (12) Again, we show that A satisfies (12), and hence is a minimizer to (11): A T AA = V Σ 2 V T V Σ 1 U T = V ΣU T = A T. Moral: A = V Σ 1 U T is as close (in the least-squares sense) to an inverse of A as you could possibly have. Exercise: 1. Show that the minimizer Ĥ to (9) must obey (10). Do this by using the fact that the derivative of the functional HA I 2 F with respect to an entry H[k, l] in H must obey HA I 2 F H[k, l] = 0, for all 1 k N, 1 l M, to be a solution to (9). Do the same for (11) and (12). 49

15 Stability Analysis of the Pseudo-Inverse We have seen that if we make indirect observations y R M of an unknown vector x 0 R N through a M N matrix A, y = Ax 0, then applying the pseudo-inverse of A gives us the least squares estimate of x 0 : ˆx ls = A y = V Σ 1 U T y, where A = UΣV T is the singular value decomposition (SVD) of A. We will now discuss what happens if our measurements contain noise the analysis here will be very similar to when we looked at the stability of solving square sym+def systems, and in fact this is one of the main reasons we introduced the SVD. Suppose we observe y = Ax 0 + e, where e R M is an unknown perturbation. Say that we again apply the pseudo-inverse to y in an attempt to recover x: ˆx ls = A y = A Ax 0 + A e What effect does the presence of the noise vector e had on our estimate of x 0? We answer this question by comparing ˆx ls to the reconstruction we would obtain if we used standard least-squares on perfectly noise-free observations y clean = Ax 0. This noise-free recon- 50

16 struction can be written as x pinv = A y clean = A Ax 0 = V Σ 1 U T UΣV T x 0 = V V T x 0 = x 0, v r v r. The vector x pinv is the orthogonal projection of x 0 onto the row space (everything orthogonal to the null space) of A. If A has full column rank (R = N), then x pinv = x 0. If not, then the application of A destroys the part of x 0 that is not in x pinv, and so we only attempt to recover the visible components. In some sense, x pinv contains all of the components of x 0 that A does not completely remove, and has them preserved perfectly. The reconstruction error (relative to x pinv is) ˆx ls x pinv 2 2 = A e 2 2 = V Σ 1 U T e 2 2. (13) Now suppose for a moment that the error has unit norm, e 2 2 = 1. Then the worst case for (13) is given by maximize e R M V Σ 1 U T e 2 2 subject to e 2 = 1. Since the columns of U are orthonormal, U T e 2 2 e 2 2, and the above is equivalent to max β R R : β 2 =1 V Σ 1 β 2 2. (14) 51

17 Also, for any vector z R R, we have V z 2 2 = V z, V z = z, V T V z = z, z = z 2 2, since the columns of V are orthonormal. So we can simplify (14) to maximize β R R Σ 1 β 2 2 subject to β 2 = 1. The worst case β (you should verify this at home) will have a 1 in the entry corresponding to the largest entry in Σ 1, and will be zero everywhere else. Thus max β R R : β 2 =1 Σ 1 β 2 2 = max,...,r σ 2 r = 1. σr 2 (Recall that by convention, we order the singular values so that σ 1 σ 2 σ R.) Returning to the reconstruction error (13), we now see that ˆx ls x pinv 2 2 = V Σ 1 U T e e 2 σ 2. R 2 Since U is an M R matrix, it is possible when R < M that the reconstruction error is zero. This happens when e is orthogonal to every column of U, i.e. U T e = 0. Putting this together with the work above means 0 1 U T e 2 σ1 2 2 ˆx ls x pinv U T e 2 σr e 2 σ 2. R 2 Notice that if σ R is small, the worst case reconstruction error can be very bad. 52

18 We can also relate the average case error to the singular values. Say that e is additive Gaussian white noise, that is each entry e[m] is a random variable independent of all the other entries, and distributed e[m] Normal(0, ν 2 ). Then, as we have argued before, the average measurement error is E[ e 2 2] = Mν 2, and the average reconstruction error 4 is [ ] ( E A e 2 2 = ν 2 trace(a T A ) = ν 2 ( = 1 M 1 σ σ σ σ 2 2 ) σr ) E[ e 2 σ 2]. R 2 Again, if σ R is tiny, 1/σ 2 R will dominate the sum above, and the average reconstruction error will be quite large. Exercise: Let D be a diagonal R R matrix whose diagonal elements are positive. Show that the maximizer ˆβ to maximize β R R Dβ 2 2 subject to β 2 = 1 has a 1 in the entry corresponding to the largest diagonal element of D, and is 0 elsewhere. 4 We are using the fact that if e is vector of iid Gaussian random variables, e Normal(0, ν 2 I), then for any matrix M, E[ Me 2 2] = ν 2 trace(m T M). We will argue this carefully as part of the next homework. 53

19 Stable Reconstruction with the Truncated SVD We have seen that if A has very small singular values and we apply the pseudo-inverse in the presence of noise, the results can be disastrous. But it doesn t have to be this way. There are several ways to stabilize the pseudo-inverse. We start be discussing the simplest one, where we simply cut out the part of the reconstruction which is causing the problems. As before, we are given noisy indirect observations of a vector x through a M N matrix A: y = Ax + e. (15) The matrix A has SVD A = UΣV T, and pseudo-inverse A = V Σ 1 U T. We can rewrite A as a sum of rank-1 matrices: A = σ r u r v T r, where R is the rank of A, the σ r are the singular values, and u r R M and v r R N are columns of U and V, respectively. Similarly, we can write the pseudo-inverse as A = 1 σ r v r u T r. Given y as above, we can write the least-squares estimate of x from the noisy measurements as ˆx ls = A y = 1 σ r y, u r v r. (16) 54

20 As we can see (and have seen before) if any one of the σ r are very small, the least-squares reconstruction can be a disaster. A simple way to avoid this is to simply truncate the sum (16), leaving out the terms where σ r is too small (1/σ r is too big). Exactly how many terms to keep depends a great deal on the application, as there are competing interests. On the one hand, we want to ensure that each of the σ r we include has an inverse of reasonable size, on the other, we want the reconstruction to be accurate (i.e. not to deviate from the noiseless least-squares solution by too much). We form an approximation A to A by taking A = R σ r u r v T r, for some R < R. Again, our final answer will depend on which R we use, and choosing R is often times something of an art. It is clear that the approximation A has rank R. Note that the pseudo-inverse of A is also a truncated sum A = R 1 σ r v r u T r. Given noisy data y as in (15), we reconstruct x by applying the truncated pseudo-inverse to y: ˆx trunc = A y = R 1 σ r y, u r v r. 55

21 How good is this reconstruction? To answer this question, we will compare it to the noiseless least-squares reconstruction x pinv = A y clean, where y clean = Ax are noiseless measurements of x. The difference between these two is the reconstruction error (relative to x pinv ) as ˆx trunc x pinv = A y A Ax = A Ax + A e A Ax = (A A )Ax + A e. Proceeding further, we can write the matrix A A as A A = r=r +1 1 σ r v r u T r, and so the first term in the reconstruction error can be written as (A A )Ax = 1 Ax, u r v r σ r = = = r=r +1 r=r +1 r=r +1 r=r +1 1 R σ j x, v j u j, u r v k σ r 1 σ r j=1 R j=1 x, v r v r σ j x, v j u j, u r v r (since u r, u j = 0 unless j = r). The second term in the reconstruction error can also be expanded against the v r : R A 1 e = e, u r v r. σ r 56

22 Combining these expressions, the reconstruction error can be written R 1 ˆx trunc x pinv = e, u r v r + x, v k v k σ r k=r }{{} +1 }{{} = Noise error + Approximation error. Since the v r are mutually orthogonal, and the two sums run over disjoint index sets, the noise error and the approximation error will be orthogonal. Also ˆx trunc x pinv 2 2 = Noise error Approximation error 2 2 R 1 = e, u r 2 + x, v r 2. σ 2 r r=r +1 The reconstruction error, then, is signal dependent and will depend on how much of the vector x is concentrated in the subspace spanned by v R +1,..., v R. We will lose everything in this subspace; if it contains a significant part of x, then there is not much least-squares can do for you. The worst-case noise error occurs when e is aligned with u R : Noise error 2 2 = R 1 σ 2 r e, u r 2 1 σ 2 R e 2 2. As seen before, if the error e is random, this bound is a bit pessimistic. Specifically, if each entry of e is an independent identically distributed Normal random variable with mean zero and variance ν 2, then the expected noise error in the reconstruction will be E[ Noise error 2 2] = 1 ( ) E[ e 2 M σ1 2 σ2 2 σ 2]. R 2 57

23 Stable Reconstruction using Tikhonov Regularization Tikhonov 5 regularization is another way to stabilize the least-squares recovery. It has the nice features that: 1) it can be interpreted using optimization, and 2) it can be computed without direct knowledge of the SVD of A. Recall that we motivated the pseudo-inverse by showing that ˆx LS = A y is a solution to minimize x R N y Ax 2 2. (17) When A has full column rank, ˆx LS is the unique solution, otherwise it is the solution with smallest energy. When A has full column rank but has singular values which are very small, huge variations in x (in directions of the singular vectors v k corresponding to the tiny σ k ) can have very little effect on the residual y Ax 2 2. As such, the solution to (17) can have wildly inaccurate components in the presence of even mild noise. One way to counteract this problem is to modify (17) with a regularization term that penalizes the size of the solution x 2 2 as well as the residual error y Ax 2 2: minimize x R N y Ax δ x 2 2. (18) The parameter δ > 0 gives us a trade-off between accuracy and regularization; we want to choose δ small enough so that the residual 5 Andrey Tikhonov ( ) was a 20 th century Russian mathematician. 58

24 for the solution of (18) is close to that of (17), and large enough so that the problem is well-conditioned. Just as with (17), which is solved by applying the pseudo-inverse to y, we can write the solution to (18) in closed form. To see this, recall that we can decompose any x R N as x = V α + V 0 α 0, where V is the N R matrix (with orthonormal columns) used in the SVD of A, and V 0 is a N N R matrix whose columns are an orthogonal basis for the null space of A. This means that the columns of V 0 are orthogonal to each other and all of the columns of V. Similarly, we can decompose y as y = Uβ + U 0 β 0, where U is the M R matrix used in the SVD of A, and the columns of U 0 are an orthogonal basis for the left null space of A (everything in R M that is not in the range of A). For any x, we can write y Ax = Uβ + U 0 β UΣV T (V α + V 0 α 0 ) = U(β Σα) + U 0 β. Since the columns of U are orthonormal, U T U = I, and also U T 0 U 0 = I, and U T U 0 = 0, we have and y Ax 2 2 = U(β Σα) + U 0 β 0, U(β Σα) + U 0 β 0 = β Σα β 0 2 2, x 2 2 = α α

25 Using these facts, we can write the functional in (18) as y Ax δ x 2 = β Σα δ α δ α (19) We want to choose α and α 0 that minimize (19). It is clear that, just as in the standard least-squares problem, we need α 0 = 0. The part of the functional that depends on α can be rewritten as β Σα δ α 2 2 = (β[k] σ k α[k]) 2 + δα[k] 2. (20) k=1 We can minimize this sum simply by minimizing each term independently. Since d [ (β[k] σk α[k]) 2 + δα[k] 2] = 2β[k]σ k + 2σ 2 dα[k] kα[k] + 2δα[k], we need α[k] = σ k σ 2 k + δ β[k]. Putting this back in vector form, (20) is minimized by and so the minimizer to (18) is ˆα tik = (Σ 2 + δi) 1 Σβ, ˆx tik = V ˆα tik = V (Σ 2 + δi) 1 ΣU T y. (21) We can get a better feel for what Tikhonov regularization is doing by comparing it directly to the pseudo-inverse. The least-squares 60

26 reconstruction ˆx ls can be written as ˆx ls = V Σ 1 U T y 1 = y, u r v r, σ r while the Tikhonov reconstruction ˆx tik derived above is σ r ˆx tik = σr 2 + δ y, u r v r. (22) Notice that when σ r is much larger than δ, σ r σ 2 r + δ 1 σ r, σ r δ, but when σ r is small σ r σ 2 r + δ 0, σ r δ. Thus the Tikhonov reconstruction modifies the important parts (components where the σ r are large) of the pseudo-inverse very little, while ensuring that the unimportant parts (components where the σ r are small) affect the solution only by a very small amount. This damping of the singular values, is illustrated below. 6 5 σ r σ 2 r+δ σ r 61

27 Above, we see the damped multipliers σ r /(σr 2 + δ) versus σ r for δ = 0.1 (blue), δ = 0.05 (red), and δ = 0.01 (green). The black dotted line is 1/σ r, the least-squares multiplier. Notice that for large σ r (σ r > 2 δ, say), the damping has almost no effect. This damping makes the Tikhonov reconstruction exceptionally stable; large multipliers never appear in the reconstruction (22). In fact it is easy to check that no matter the value of σ r. σ r σr 2 + δ 1 2 δ Tikhonov Error Analysis Given noisy observations y = Ax 0 + e, how well will Tikhonov regularization work? The answer to this questions depends on multiple factors including the choice of δ, the nature of the perturbation e, and how well x 0 can be approximated using a linear combination of the singular vectors v r corresponding to the large (relative to δ) singular values. Since a closed-form expression for the solution to (18) exists, we can quantify these trade-offs precisely. We compare the Tikhonov reconstruction to the reconstruction we would obtain if we used standard least-squares on perfectly noisefree observations y clean = Ax 0. This noise-free reconstruction can 62

28 be written as x pinv = A y clean = A Ax 0 = V Σ 1 U T UΣV T x 0 = V V T x 0 = x 0, v r v r. The vector x pinv is the orthogonal projection of x 0 onto the row space (everything orthogonal to the null space) of A. If A has full column rank, then x pinv = x 0. If not, then the application of A destroys the part of x 0 that is not in x pinv, and so we only attempt to recover the visible components. In some sense, x pinv contains all of the components of x 0 that we could ever hope to recover, and has them preserved perfectly. The Tikhonov regularized solution is given by ˆx tik = = = σ r σ 2 r + δ y, u r v r σ r σ 2 r + δ e, u r v r + σ r σ 2 r + δ e, u r v r + σ r σ 2 r + δ Ax 0, u r v r σ 2 r σ 2 r + δ x 0, v r v r, and so the reconstruction error, relative to the best possible recon- 63

29 struction x pinv, is ˆx tik x pinv = = σ r σ 2 r + δ e, u r v r + σ r σ 2 r + δ e, u r v r }{{} + ( σ 2 ) r σ 2 r + δ 1 x 0, v r v r δ σ 2 r + δ x 0, v r v r }{{} = Noise error + Approximation error. The approximation error is signal dependent, and depends on δ. Since the v r are orthonormal, Approximation error 2 2 = δ 2 (σ 2 r + δ) 2 x 0, v r 2. Note that for the components much smaller than δ, σ 2 r δ δ 2 (σ 2 r + δ) 2 1, so this portion of the approximation error will be about the same as if we had simply truncated these components. For large components, σ 2 r δ δ 2 (σ 2 r + δ) 2 δ2 σ 2 r and so this portion of the approximation error will be very small. 64

30 For the noise error energy, we have ( ) 2 Noise error 2 σr 2 = e, u σ 2 r 2 r + δ 1 e, u r 2 4δ 1 4δ e 2 2. The worst-case error is more or less determined by the choice of δ. The regularization makes the effective condition number of A about 1/(2 δ); no matter how small the smallest singular value is, the noise energy will not increase by more than a factor of 1/(4δ) during the reconstruction process. As usual, the average case error is less pessimistic. If each entry of e is an independent identically distributed Normal random variable with mean zero and variance ν 2, then the expected noise error in the reconstruction will be E [ ] R ( ) 2 Noise error 2 σr 2 = E [ e, u σ 2 r 2] r + δ ( ) 2 = ν 2 σr σr 2 + δ Note that = 1 M ( R ( σr 2 1 min, 1 (σr+δ) 2 2 σr 2 4δ σ 2 r (σ 2 r + δ) 2 ) E [ e 2 2]. (23) ), so we can think of the error in (23) as an average of the 1, with the large values simply replaced by σr 2 1/(4δ). 65

31 A Closed Form Expression Tikhonov regularization is in some sense very similar to the truncated SVD, but with one significant advantage: we do not need to explicitly calculate the SVD to solve (18). Indeed, the solution to (18) can be written as ˆx tik = (A T A + δi) 1 A T y. (24) To see that the expression above is equivalent to (21), note that we can write A T A as where V is N N, A T A = V Σ 2 V T = V Σ 2 V T, V = [ V V 0 ], and the N N diagonal matrix Σ is simply Σ padded with zeros: [ ] Σ 0 Σ =. 0 0 The verification of (24) is now straightforward: (A T A + δi) 1 A T y = (V Σ 2 V T + δi) 1 V ΣU T y = V (Σ 2 + δi) 1 V T V ΣU T y [ ] = V (Σ 2 + δi) 1 ΣU T y 0 = V (Σ 2 + δi) 1 ΣU T y = ˆx tik. The expression (24) holds for all M, N, and R. We will leave it as an exercise to show that (A T A + δi) 1 A T y = A T (AA T + δi) 1 y. 66

32 The importance of not needing to explicitly compute the SVD is significant when we are solving large problems. When A is large (M, N > 10 5, say) it may be expensive or even impossible to construct the SVD and compute with it explicitly. However, if it has special structure (if it is sparse, for example), then it may take many fewer than MN operations to compute a matrix vector product Ax. In these situations, a matrix free iterative algorithm can be used to perform the inverse required in (24). A prominent example of such an algorithm is conjugate gradients, which we will see later in this course. Weighted Least Squares Standard least-squares tries to fit a vector x to a set of measurements y by solving minimize x R N y Ax 2 2. Now, what if some of the measurements ore more reliable than others? Or, what if the errors are closely correlated between measurements? There is a systematic way to treat both of these cases using weighted least-squares. Instead of minimizing the energy in the residual we will minimize r 2 2 = y Ax 2 2, W r 2 2 = W y W Ax 2 2, 67

33 for some M M weighting matrix W. When W is a diagonal matrix, W = w 11 w22..., then the error we are minimizing looks like w MM W r 2 2 = w 2 11r[1] 2 + w 2 22r[2] w 2 MMr[M] 2. By adjusting the w mm, we can penalize some of the components of the error more than others. By adding off-diagonal terms, we can account for correlations in the error (we will explore this further later in these notes). Solving minimize W r 2 2 = minimize x R N W y W Ax 2 2, is simple. We simply use least-squares with W A as the matrix, and W y as the observations: ˆx wls = (W A) W y, where (W A) is the pseudo-inverse of W A. For the rest of this section, we will assume that M N (meaning that there are at least as many observations as unknowns) and that A has full column rank. This allows us to write ˆx wls = (A T W T W A) 1 A T W T W y. 68

34 Example: We measure a patient s pulse 3 times, and record y[1] = 70, y[2] = 80, y[3] = 120. In this case, we can take A = What is the least-square estimate for the pulse rate x 0? Now say that we were in a hurry when the third measurement was made, so we would like to weigh less than the others. What is the weighted least-squares estimate when W = 0 1 0? 0 0 w 33 What about the particular case when w 33 = 1/2? 69

IV. Matrix Approximation using Least-Squares

IV. Matrix Approximation using Least-Squares The SVD and Matrix Approximation We begin with the following fundamental question. Let A be an M N matrix with rank R. What is the closest matrix to A that