Lecture 9: SVD, Low Rank Approximation

CSE 521: Design and Analysis of Algorithms I Spring 2016 Lecture 9: SVD, Low Rank Approimation Lecturer: Shayan Oveis Gharan April 25th Scribe: Koosha Khalvati Disclaimer: hese notes have not been subjected to the usual scrutiny reserved for formal publications. 9.1 Singular Value Decomposition (SVD) Claim 9.1. For matri M R m n (m n), there eist two sets of k orthonormal columns v 1,..., v k and u 1,..., u k such that: k M = σ i u i vi where for each i, σ > 0 and k min{m, n}. In the above representation, σ i is called a singular value, u i is a left singular vector, and v i is a right singular vector. Proof. o prove the eistence of these two sets of columns, we construct them: First, note that M M R n n is a symmetric and positive semidefinite matri. o see the latter observe that for any vector R n, (M M) =, M 2 0. Secondly, by the spectral theorem, there eists eigenvalues λ 1,..., λ n with corresponding orthonormal eigenvectors v 1,..., v n such that: M M = λ i v i vi Use these v i s (eigenvectors of M M), the right singular vectors in SVD of M. Note that in the above representation for all i, λ i 0. We also need to throw away all v i, λ i where λ i = 0. It remains to construct u i. We let each u i be proportional to Mv i. Firstly, observe that for any i, j, Mv i, Mv j are orthogonal, { Mv i, Mv j = M λ i if i = j Mv i, v j = λ i v i, v j = (9.1) 0 otherwise. So, we just need to normalize u i. Define, u i = Mv i Mv i = Mv i λi = Mv i σ i Note that singular values σ i are equal to λ i ; since M M is PSD, λ i 0 and σ i is well defined. In particular, observe that if M is a symmetric matri, σ i is the absolute value of the i-th eigenvalue of M. Now, we want to show that these v i s and u i s meet SVD conditions. Recall that v i s are orthonormal because they are eigenvectors of M M, and u i s are orthonormal by (9.1). We claim that M = σ i u i v i 9-1

9-2 Lecture 9: SVD, Low Rank Approimation. One way to prove that two operators are the same is to show that they act identically on a set of basis vectors w 1,..., w n. In particular, we show that for any j, Mv j = ( σi u i v i ) v j. (9.2) If the above holds for all i, since v 1,..., v n form a basis, for any vector v one can write Mv = M v j v, v j = j=1 So, the two matri are indeed equal. It remains to prove (9.2) for all j. ( σi u i v i Mv j v, v j = j=1 ) v j = k ( σi u i vi j=1 σ i u i v i, v j = σ j u j = σ j Mv j σ j = Mv j. ) ( ) v j v, v i = σi u i vi v. his is because v i, v j is equal to 0 when i and j are different and it is equal to 1 when i = j. In the last equality we used the definition of u i. his proves (9.2), as desired. 9.2 Low Rank Approimation In the rest of this lecture and part of the net one we study low rank approimation of matrices. First, let s define the rank of the matri: here are many ways one can define the rank of a matri. Rank of Matri M, rank(m), is the number of linearly independent columns in M. It is also equal to the number of linearly independent rows in M. In addition, it is equal to the number of none-zero eigenvalues (or singular values) of M. It is possible to prove that all of the definitions above are one concept. 9.2.1 Motivation In the low rank approimation, the goal is to approimate a given matri M, with a low rank Matri, i.e. M k such that M M k is approimately zero. We have not specified the norm; in general one should choose the norm based on the specific application. his area is also known as principal component analysis. It has a similar flavor as the dimension reduction technique that we studied a few lectures ago. We have a high dimensional data represented as a matri and we want to approimate it with a much lower dimensional object. Note that a rank k approimation of M can be epressed using only k singular values and the corresponding k singular vectors. So, in principal, a rank k approimation only needs O(kn) bits of memory. So, at a very high level one can use low rank approimation to approimately store an m n matri with only O(kn) bits. In the net lecture we will see some of its applications in optimization. Besides these applications, one can use low rank approimation to reveal hidden structures in a given data set. Let us give several eamples of this phenomenon.

Lecture 9: SVD, Low Rank Approimation 9-3 9.2.2 Application1: Consumer-Product Matri A Consumer-product matri is a matri (M R n d ) where each row corresponds to a consumer and each column corresponds to a product. Entry i, j of the matri represents the probability that consumer i purchases product j. We can hypothesize that there are of k hidden features in consumers such as age, gender, and annual income, etc and the decision of each consumer is only a function of these hidden features. With this hypothesis we can rewrite M = AB as a product of two matrices: a factor weigh matri, A R n k and a purchase probability matri B R k d. In particular, each row of A represents a consumer as a weighted sum of the k underlying features and each column of B represents the purchase probability for a consumer with only one feature. Ideally, all elements of consumer-product matri are available and we can use a low rank approimation of M to find these k hidden features for a small value of k. In the real world however, we usually have only some of the elements of the matri. So, our goal is to predict the unknown elements. here are several eamples of such a question. In the Netfli challenge, we were given a partial ratings of the Netfli users and our goal was to predict the rating of each user for the rest of the movies. In online advertisement, the goal is to find the best ad to show to a given user, given that he has purchased one or two items in the past. 9.2.3 Application2: Hidden Partitioning Given graph, assume that there are two hidden communities such that the probability p of eistence of an edge between the nodes of one community is much more than the probability q of an edge connecting the nodes in different communities. Given a sample of such a graph our goal is to recover the hidden communities with high probability. If we reorder the vertices such that the first community consists of nodes 1,..., n/2 and the second one consists of n/2 + 1,..., n, the epected matri look like the following: [ ] p q M 2 = q p Observe that the above matri is a rank 2 matri. So, one may epect that by using a rank 2 approimation of the adjacency matri of the given graph he can recover the hidden partition. his is indeed possible assuming p is sufficiently larger than q [McS01]. 9.2.4 Application3: Image Compression As each image is represented by a matri, it is possible to compress an image by approimating it by a lowerrank matri. o see an eample of image compression by lower-rank matri approimation in MALAB, please check the course homepage. 9.2.5 Principal Component Analysis In this section we study low rank approimation with respect to the operator norm. Definition 9.2 (Norm of the Matri). he operator norm of a matri M is defined as follows: M 2 = ma M 2 2.

9-4 Lecture 9: SVD, Low Rank Approimation In words, the operator norm shows how large a vector can get once it is multiplied with M. It follows by the Rayleigh quotient that the operator norm of M is equal to its largest singular value. Claim 9.3. M 2 = σ ma (M) Proof. We can write, ma M 2 2 = ma M 2 2 M 2 = M 2 = λ ma (M M) = σ ma (M), where the second to last equality follows by the Rayleigh quotient. In the following theorem we show that the best rank k approimation with respect to operator norm satisfies M M k σ k+1. Note that this approimation does not directly imply that the entries of M M k are closed to zero; it only upper bounds the action of M M k on vectors. heorem 9.4. Let M R m n with singular values σ 1 σ 2 σ n. For any integer k 1, min M k M M k 2 = σ k+1 (9.3) Proof. We need to prove two statements: i) here eist a rank-k matri M k such that M M k 2 = σ k+1. (2) For any rank k matri M k, M M k 2 σ k+1. We start with part (i). We let By definition of M, M M k = where the last equality uses Claim 9.3. M k = k σ i u i vi σ i u i v i M M k 2 = σ ma ( m σ i u i v i ) = σ k+1 Now, we prove part (ii). So, assume M k is an arbitrary rank k matri. For a matri M, the null space of M NULL(M) = { : M = 0} is the set of vector that M map to zero. Let null(m) be the dimension of the null space of M, NULL(M). It is a well known fact that for any M R m n, rank(m) + null(m) = n. his is because any vector which is orthogonal to the right singular vectors of M is in NULL(M). So, the above equality follows from the fact that M has rank(m) right singular vectors (with positive singular values). As an application, since M k has rank k, we must have null( M k ) = n rank( M k ) = n k. (9.4)

Lecture 9: SVD, Low Rank Approimation 9-5 By the Rayleigh quotient, we can write, σ k+1 (M) 2 = λ k+1 (M M) = min ma M M S:(n k)dims S M M ma NULL( M k ) (M = ma M k ) (M M k ) NULL( M k ) ma (M M k ) (M M k ). he first inequality uses the fact that NULL(M) is a n k dimensional linear space; so a special case of S being a n k dimensional linear space is S = NULL(M). he second equality uses that M k = 0 for any NULL( M k ). Now, we are done using another application of the Rayleigh quotient. ma (M M k ) (M M k ) his completes the proof of heorem 9.4. = λ ma ((M M k ) (M M k )) = σ ma (M M k ) 2. 9.2.6 Approimation of Frobenius Norm In this section we study low rank approimation with respect to different norm known as the Frobenius norm. Definition 9.5 (Frobenius Norm). For any matri M, the Frobenius norm of M is defined as follows: M F = i, jm 2 i,j Again, we are looking for the best rank-k approimation with respect to the Frobenius Norm. heorem 9.6. Given a matri M Rm n with singular values σ 1 σ 2 σ n, we have min M k M M k 2 F = σ 2 i (9.5) Before, proving the theorem, let us see how we can relate the Frobenius norm to the singular values of a matri. Claim 9.7. For any matri M R m n, M 2 F = σ 2 i Proof. First, observe that i, i-th entry of the matri M M, is the inner product of the i-th column of M with itself. (M M) i,i = Mj,i. 2 j Summing up over all i we have (M M) i,i = i,j M 2 i,j = M 2 F.

9-6 REFERENCES Now, observe the the LHS is indeed r(m M). In the last lecture we proved that the trace of a symmetric matri is equal to sum of its eigenvalues. So, the LHS is equal to σ 2 i. o prove heorem 9.6 we need to prove two statements: i) here is a rank-k matri M k such that M M k 2 F = m σ2 i ii) For any rank k matri M k, M M k 2 F m σ2 i. Here, we only prove (i). Similar to heorem 9.4 let M k = σ i u i vi, hen, as desired. M M k 2 F = σ i u i vi = m σ 2 i. References [McS01] F. McSherry. Spectral Partitioning of Random Graphs. In: FOCS. 2001, pp. 529 537 (cit. on p. 9-3).