Notes on Polar Decomposition and SVD

Size: px

Start display at page:

Download "Notes on Polar Decomposition and SVD"

Sandra Newton
5 years ago
Views:

1 Notes on Polar Decomposition and SVD by Tamara Fusselman 20 Nov 2012 Assume for these notes that V, W are finite-dim vector spaces over C with inner products. Observation: Given any T L(V ), (T T ) = T T, so T T is self-adjoint, T T v, v = T v, T v 0 v V, and so T T satisfies the definition of a positive operator. Because T T is a positive operator, by prop 7.28 (p. 146)! positive operator R L(V ), written R = T T, such that R 2 = T T. (Notice R = R because positive operators are also self-adjoint.) Claim: v V, T v = Rv. Proof: v V, T v 2 = T v, T v = T T v, v = R 2 v, v = Rv, R v = Rv, Rv (because R = R) = Rv 2. Now take the square root of both sides. Since T v = Rv v V, an explicit isometry between range(t ) and range(r) would be nice. Define S 1 : range(r) range(t ) by S 1 (Rv) = T v v V. As S 1 takes each Rv range(r) to T v, and T v = Rv (by ), S 1 satisfies the norm-preserving property 1 of an isometry. But is S 1 even a map? (What if v 1, v 2 V such that Rv 1 = Rv 2 but T v 1 T v 2?) Suppose that v 1, v 2 V such that Rv 1 = Rv 2. We have Rv 1 Rv 2 = 0 R(v 1 v 2 ) = 0 R(v 1 v 2 ) = 0 (since R is lin) T (v 1 v 2 ) = 0 (by claim above, taking v = v 1 v 2 ) T (v 1 v 2 ) = 0 T v 1 T v 2 = 0 T v 1 = T v 2. (because T is linear) 1 If you are not yet convinced that S 1 satisfies the norm-preserving propery, check it: Given any w range(r), v V such that Rv = w. S 1 w = S 1 (Rv) = T v (by the definition of S 1 ) = Rv (by claim, above) = w. Tamara Fusselman, 20 Nov :52 1 Notes on Polar Decomposition and SVD

2 Thus, S 1 is a map. Note that S 1 is bijective: Supposing v 1, v 2 V such that T v 1 = T v 2 and running the above calculation backwards shows that Rv 1 = Rv 2, and so S 1 is injective. T v range(t ), Rv range(r) with S 1 (Rv) = T v, so S 1 is surjective. Is S 1 linear? v 1, v 2 V, a C. S 1 (arv 1 + Rv 2 ) = S 1 ( R(av1 + v 2 ) ) Thus, S 1 is linear. (since R is linear) = T (av 1 + v 2 ) (by the defnition of S 1 ) = at v 1 + T v 2 (because T is linear) = as 1 (Rv 1 ) + S 1 (Rv 2 ) (by the definition of S 1.) As S 1 is a linear map constructed to have the isometry propery, S 1 is an isometry between range(r) and range(t ). Can S 1 be extended to an isometry on all of V? Because S 1 is bijective (or, as Axler puts it, is an injective linear map), dim ( range(r) ) = dim ( range(t ) ) dim ( range(r) ) = dim ( range(t ) ) Letting m = dim ( range(r) ), pick orthonormal basis B = (u1,..., u m ) for ( range(r) ) and orthonormal basis B = (u 1,..., u m) for ( range(t ) ). Writing each v ( range(r) ) as v = m i=1 a iu i, a i C, define S 2 : ( range(r) ) ( ) range(t ) by ( m ) m S 2 a i u i = a i u i i=1 i=1 It is easy to check that S 2 is linear. Now m S 2 v = a i u i = m a i (since B is an orthonormal basis.) i=1 i=1 m = a i u i (since B is an orthonormal basis.) so S 2 is an isometry. i=1 = v, V = range(r) ( range(r) ) ( ), so v V,! u range(r) and w range(r) such that v = u + w. Now define S L(V ) by Sv = S 1 u + S 2 w. S is linear since it s the sum of linear maps. Tamara Fusselman, 20 Nov :52 2 Notes on Polar Decomposition and SVD

3 Sv 2 = S 1 u + S 2 w 2 by the definition of S. Now, since S 1 u range(t ) and S 2 w ( range(t ) ), = S 1 u 2 + S 2 w 2 by the Pythagorean theorem (p. 102) = u 2 + w 2 since S 1 and S 2 are isometries = u + w 2 since u range(r) and S 2 w ( range(r) ) = v 2. Taking the square root of both sides shows that S is an isometry. Now, v V, Rv range(r), and so S(Rv) = S 1 (Rv) = T v. Recalling R = T T, v V, T v = S T T v and so T = S T T. This construction of S (which works T L(V )) gives the Polar Decomposition Theorem (Theorem 7.41, p. 153): T L(V ), isometry S L(V ) such that T = S T T. Singular Value Decomposition Theorem (Theorem 7.46, p. 156): T L(V ), scalars λ 1,..., λ k and orthonormal bases B = (u 1,..., u n ), B = (u 1,..., u n) such that T v = λ 1 v, u 1 u λ n v, u n u n, v V. Proof: We already know that T has polar decomposition T = SR, where R = T T and S L(V ) is an isometry. Since R is self-adjoint, by the Spectral Theorem (theorem 7.9, p. 133), R has an orthonormal eigenbasis B 1 = (u 1,..., u n ). 2 For each k, let λ k be the eigenvector associated with u k (so that (λ k, u k ) is an eigenpair). Since B 1 is an orthonormal basis for V, v V, v = v, u 1 u 1 + v, u n u n Rv = λ 1 v, u 1 u 1 + λ n v, u n u n T v = λ 1 v, u 1 Su 1 + λ n v, u n Su n (applying R to both sides) (applying S to both sides) because S is an isometry. B 2 = (u 1 = Su 1,..., u n = Su n ) is another orthonormal basis for V (See 7.36, p. 148). Hence T v = λ 1 v, u 1 u λ n v, u n u n. Definition: Singular Values of T : The singular values of T are the scalars λ 1,..., λ n found above. They are the eigenvalues of T T, counted with repetition. Matrix of T Recall some notation: If B = (v 1,..., v n ) and B = (v 1,..., v n) are bases for V and L L(V ), then B [L] B = M(L, B, B ) is the matrix whose k th column is [a 1k,..., a nk ], where Lv j = a 1k v a nk v n. Now, T u k = λ 1 u k, u 1 u λ n u k, u n u n = λ k u k k, hence the kth column of B2 [L] B1 nonzero in the k th place {}}{ [0,..., λ k,..., 0], is 2 An eigenbasis for a linear operator on V is a basis for V whose elements are all eigenvectors of the linear operator. Tamara Fusselman, 20 Nov :52 3 Notes on Polar Decomposition and SVD

4 making λ 1 B 2 [ T ] B1 =.... λ n The convention is to call this matrix Σ (for singular values ) or D (for diagonal ). We will let λ 1... = D. λ n Assuming the standard basis E = (e 1,..., e n ) is orthonormal with respect to our given inner product for V, we write E [ T ] E as follows: E[ T ] E = E [ I ] B2 B 2 [ T ] B1 B 1 [ I ] E, where B 1 [ I ] E = ( E[ I ] B1 ) 1 and both E [ I ] B1 and E [ I ] B2 are change-of-basis matrices from B 1 and B 2, respectively, to the standard basis E. Let U 1 = E [ I ] B1 and U 2 = E [ I ] B2. As both U 1 and U 2 are maps taking orthonormal bases to orthonormal bases, both are isometries, hence unitary operators (See theorem 7.36, p. 148). Recalling that B2 [ T ] B1 = D, E [ T ] E = U 2 DU1 1 = U 2 DU1. Summarizing the SVD geometrically Isometries: You can picture an isometry as rotating and reflecting (reversing the sign of) various basis elements in an orthonormal basis of a space. An isometry might change a shape to its mirror image, but aside from that, it doesn t deform the shape in any way; it just changes its orientation. Usually, when we see something upside-down, or reflected, or tilted, we immediately recognize what it is. In that sense, an isometry doesn t change what we re looking at. Positive Operators: Since positive operators are self-adjoint, they are diagonal with respect to some orthonormal basis Additionally, their eigenvalues (diagonal values) are all nonnegative. So we can think of a positive operator as rescaling the elements of an orthonormal basis without reversing any of their directions: Polar Decomposition: The Polar Decomposition Theorem shows us that every T L(V ) is isometric to a positive operator R = T T : S T R. Spectral Theorem: The spectral theorm tells us that R has an orthonormal eigenbasis, that is, an orthonormal basis B = (u 1,..., u n ) such that B1 [ I ] E is diagonal: S T D = B [ R ] B. Moreover, the nonzero entries of D are positive, since R is a positive operator. Altogether: Any linear operator T on a vector space C is isometric to an operator that simply rescales some orthonormal basis of the space. I find that thinking of the SVD geometrically makes remembering how to construct the matrix form of the SVD easier: Tamara Fusselman, 20 Nov :52 4 Notes on Polar Decomposition and SVD

5 Matrix Form of the SVD, Revisited: (Assuming that our standard basis E is orthonormal with respect to the inner product on V ) The polar decomposition gives us T = SR, where R = T T. The Spectral Theorem tells us that B[ R ] B = D, a diagonal matrix, where B = (u 1,..., u n ) is an orthonormal basis of V of eigenvectors for R. The matrix U 1 = [u 1,..., u n ] is E [ I ] B, the change-of-basis marix from B to E. As B and E are orthonormal bases, U 1 is an isometry and so U 1 1 = U 1. E[ R ] E = E [ I ] B B [ R ] B B [ I ] E = U 1 DU 1. By the polar decomposition, E[ T ] E = E [ S ] E E [ R ] E = E [ S ] E U 1 DU 1. Let U 2 = E [ S ] E U 1. Since the composition of isometries is another isometry, U 2 is an isometry. So E[ T ] E = U 2 DU 1. Generalizing Axler s proof of the SVD to all linear maps Let T L(V, W ), with dim(v ) = n and dim(w ) = m. T T : V W V is a positive operator on V, so R L(V ) such that R = T T. As in the case where T is a linear operator, it s possible to show that T v = Rv v V (even thought T v W and Rv V ). Likewise, it s possible to show that S 1 : range(r) range(t ) given by S 1 (Rv) = T v is an isometry between range(r) and range(t ). Again, the Spectral Theorem shows us that there exists an orthonormal eigenbasis B of V such that Σ = B [ R ] B is diagonal. Let B = (u 1,..., u r, u r+1,..., u n ), where (λ 1, u 1 ),..., (λ r, u r ) are the eigenpairs for the nonzero diagonal entries of Σ, arranged so that λ 1 λ r. Then u r+1,..., u n are all eigenvectors of R with eigenvalue 0. Notice that range(r) = span(u 1,..., u r ) and null(r) = span(u r+1,..., u n ). Since S 1 is an isometry from range(r) to range(t ), (u 1,..., u r) = (S 1 u 1,... S 1 u r ) is an orthonormal basis for range(t ), which can be extended to an orthonormal basis B = (u 1,..., u r, u r+1,..., u m) for all of W. Now express T with respect to B and B : T u k = S 1 (Ru k ) = S 1 (λ k u k ) = λ k S 1 u k = λ k u k 1 k r. Because S 1 is an isometry between range(r) and range(t ), dim ( range(t ) ) = dim ( range(r) ) = r. If T u k 0 for some r < k n, then dim ( range(t ) ) > dim ( range(r) ), contradicting the fact that S 1 is an isometry. So T u k = 0 r < k n, making B[ T ] B = u 1 u r u r+1 u n u 1 λ u r λ r u r u m 0 Tamara Fusselman, 20 Nov :52 5 Notes on Polar Decomposition and SVD

6 As long as the standard bases E v and E w are orthonormal with respect to the inner products given for V and W, the change-of-basis matrices U = Ev [ I ] B and U = Ev [ I ] B are unitary, and in particular, U 1 = B [ I ] Ev = U. Letting D = B [ I ] B, E w [ T ] Ev = Ew [ I ] B B [ T ] B B[ I ] Ev = U DU. Statement of the SVD Theorem for T L(V, W ), Axler style: Let T L(V, W ), where dim(v ) = n and dim(w ) = m. Then r = dim ( range(t ) ) min(n, m) and orthonormal bases B 1 = (u 1,..., u n ) for V and B 2 = (u 1,..., u m) for W such that v V, T v = λ 1 v, u 1 u λ r v, u r u r, where λ 1 λ 2 λ r are nonzero singular values of T. Is the SVD useful? Yes, very and not only to mathematicians. As you might imagine, the SVD has some important geometric applications. Chemists use it to compare molecules by rotating them around until as many points match up as possible. It s also used to reconstruct 3-D distances between objects in a photograph. The SVD is used to build seperable models in biology and to model entanglement in quantum mechanics. Perhaps the most widespread use of the SVD, though, is in principle component approximation, especially principle component analysis. Principal Component Approximation Consider the components λ i v, u i u i in the expression T v = λ 1 v, u 1 u λ r v, u r u r. Because λ 1 λ 2 λ r, the earlier components of T v contribute more towards T v than the later components. It seems we could get a pretty good approximation of T by throwing away the components whose singular values (the λ i ) we decide are too small. We can call approximating T by the first k components in the SVD we consider significant Principal Component Approximation. More formally, define T k by T k v = λ 1 v, u 1 u λ k v, u k u k v V, where k < r (You could choose k r, but there s not much point you d just get T back.) Notice that T k v is the projection of T v onto W k = span(u 1,..., u k ), so T kv is the closest approximation to T v in W k. (It can be shown that T k is the closest rank-k approximation to T according to the Frobenius norm, T = trace(t T), but I won t do that there.) u 1,..., u r are called the normalized principal component vectors of T. When T is represented by the m n matrix M with SVD M = U 2 DU1, then M k = U 2 D k U1, where λ 1... D k = λ k 0... is the matrix form of T k. Uses of Principal Component Approximation 0 Principal component approximation can be used for image compression, to build pattern-recognizing software (look up eigenfaces), or to improve matrix computations in the face of a computer s limited precision (It is often better to approximate a matrix using only the singular values that are well away from a machine s precision than to use the original matrix). Tamara Fusselman, 20 Nov :52 6 Notes on Polar Decomposition and SVD

7 Principal component approximation is most famously used in principal component analysis (PCA for short), a widely-used statistical tool of interest to many people besides statisticians (Some of the examples above use PCA). Principal Component Analysis Suppose an experiment measuring n random variables is repeated m times and the results are recorded in a table with n columns, each column containing the m data points gathered for a given random variable. The data is mean-centered by subtracting the column average of a given column from every entry in that column, so that what s left is how much each measurement of a random variable deviates from the random variable s sample mean. Let M be a mean-centered m n matrix. The sample correlation between any two random variables (column vectors) recorded in M is measured by the cosine of the angle between them, according to the usual dot-product definition of the cosine between two vectors. Random variables are considered uncorrelated if they (or rather, the column vectors in M representing them) are mutually orthogonal. The variance among all random variables recored in M is given by the covariance matrix 3 C = 1 n 1 M M. The goal of principal component analysis is to find uncorrelated factors (mutually orthogonal random variables composed of linear combinations of the random variables recorded in the experiment) which explain most of the variance in the experiment. If the SVD of the mean-centered data matrix M is M = U 2 DU 1, then C = 1 ( U2 DU ) ( 1 U2 DU ) 1 n 1 = 1 n 1 U 1D U2 U 2 DU1 = 1 n 1 U 1D DU 1 = 1 n 1 U 1D 2 U 1. It seems reasonable to approximate C by C k = 1 n 1 U 1DkU 2 1, where λ 1... D k = λ k 0..., k < rank(m), 0 and this is indeed what is done. C k ought to look familiar, for C k = 1 n 1 M k M k, where M k is the principal component approximation to M of rank k, defined earlier. PCA is often used with k = 2 or 3 in order to graph statistical data in the most revealing way possible. There s even a song about this on YouTube (called It had to be U ). 3 Depending on statistical considerations, the covariance matrix may take slightly different forms. Tamara Fusselman, 20 Nov :52 7 Notes on Polar Decomposition and SVD

Principal Component Analysis and Thorny Devils Several years ago, two biologists decided to study what makes the thorny devil, a slow-moving Australian lizard rejoicing in the Latin name Moloch

They compared the thorny devil to three other lizards: one related to the thorny devil but who didn t specialize in eating ants (the bearded dragon, Pogona vitticeps); another who specialized in

8 Principal Component Analysis and Thorny Devils Several years ago, two biologists decided to study what makes the thorny devil, a slow-moving Australian lizard rejoicing in the Latin name Moloch horridus, so good at eating ants. They compared the thorny devil to three other lizards: one related to the thorny devil but who didn t specialize in eating ants (the bearded dragon, Pogona vitticeps); another who specialized in eating ants, but who wasn t closely related to the thorny devil (the horned lizard or horny toad, Phrynosoma platyrhinos); and a third who was neither closely related to the thorny devil, nor an ant-eating specialist (the fringetoed lizard, Uma notata). The biologists used high-speed cameras to record profile views of all four species eating ants, then computed 27 kinematic variables from the footage by measuring the movement of seven anatomical landmarks in the x-y plane. By performing Principal Component Analysis on the variables, they were able to find three principal components that explained 71% of the variation in the data. The first component, which accounted for 45% of the variation, corresponded with the habits that make the ant-eating specialists different from the generalists. The second two components seemed to correspond to which lizard was more closely related to which other lizard. What s interesting about this experiment is that it offered at least two opportunities to use the SVD. Obviously, the SVD could have been used to do the Principal Component Analysis. But the SVD could also have been used to retrieve more and more accurate initial data: It s hard to get a profile view of animals eating, so most of the footage of the lizards had to be thrown out. The biologists then took their measurements from footage where the lizards were approximately in profile view, apparently without correcting for the fact that the views weren t perfect profiles. They could have used the SVD to un-project foreshortened footage of the lizards so that more of the footage would have been usable, and the measurements from the footage they did use would be more accurate.

9 Practice Problems for SVD and Polar Decomposition 1) Find the SVD for the following real 2x2 matrix: -2-1/2-2 1/2 The matrix cos(x) sin(x) -sin(x) cos(x) and the following diagram may help: 2) Professor Eigenwichser attempts to demonstrate an SVD for a linear operator T on C 4 to his class. Unfortunately he has found bases B = (u 1, u 2, u 3, u 4 ) and B = (u 1, u 2, u 3, u 4 ) so that his matrix of singular values B [T] B has diagonal entries -1, 3, -5, and 0. But singular values are supposed to be nonnegative (since they are the eigenvalues of the square root of T*T, a positive operator). What could he do to his bases to fix this? 3) Suppose T: V W is a linear map but not an operator. Letting R be the square root of T*T, show that Tv = Rv for all v in V, even though Tv is in W and Rv is in V. 4) Again, suppose T is a linear map but not an operator. Show that the map S 1 : range(r) range(t) defined by S 1 (Rv) = Tv is an isometry, even though range(r) is in V and range(t) is in W.

10 References Axler, Linear Algebra Done Right Steven J Leon, Linear Algebra with Applications (pp , pp This is the only source I found that made statistics notation and PCA easy to follow for someone with only a linear algebra background.) Image Filtering via SVD Wikipedia s articles on SVD and PCA (though the math is presented differently from how we re doing it) It had to be U the SVD song Scholarpedia s article on Eigenfaces Prey capture kinematics of ant-eating lizards by Meyers and Herrel Eric Pianka s Thorny Devil page Special thanks to my husband, Jerry Fusselman for being my TEX hero until TEX conked out on us.

Summary of Week 9 B = then A A =

Summary of Week 9 B = then A A = Summary of Week 9 Finding the square root of a positive operator Last time we saw that positive operators have a unique positive square root We now briefly look at how one would go about calculating the