Lecture 18 Nov 3rd, 2015

Size: px

Start display at page:

Download "Lecture 18 Nov 3rd, 2015"

Berenice Hunter
6 years ago
Views:

1 CS 229r: Algorithms for Big Data Fall 2015 Prof. Jelani Nelson Lecture 18 Nov 3rd, 2015 Scribe: Jefferson Lee 1 Overview Low-rank approximation, Compression Sensing 2 Last Time We looked at three different regression methods. The first was based on ε-subspace embeddings. The second was an iterative approach, building a well-conditioned matrix good for stochastic gradient descent. The third was formulated as follows: For the least square problem min x Sx b 2, which has optimal solution x = S + b, and approximate solution x = argmin x ΠSx Πx 2, we let x = Uα, w = Sx b, Uβ = S x Sx where S = UΣV T. We proved that (ΠU) T (ΠU)β = (ΠU) T Πw last time. These results from regression will appear in our work for low-rank approximation. 3 Low-rank approximation The basic idea is a huge matrix A R n d with n, d both very large - say, n users rating d movies. We might believe that the users are linear combinations of a few (k) basic types. We want to discover this low-rank structure. More formally: Given a matrix A R n d, we want to compute A k := argmin rank(b) k A B X. Some now argue that we should look for a non-negative matrix factorization; nevertheless, this version is still used. Theorem 1 (Eckart-Young). Let A = UΣV T be a singular-value decomposition of A where rank(a) = r and Σ is diagonal with entries σ 1 σ 2... σ r > 0, then under X = F, A k = U k Σ k Vk T is the minimizer where U k and V k are the first k columns of U and V and Σ k = diag(σ 1,..., σ k ). Our output is then U k, Σ k, V k. We can calculate A k in O(nd 2 ) time, by calculating the SVD of A. We would like to do better. First, a few definitions: Definition 2. Proj A B is the projection of the columns of B onto the colspace(a). Definition 3. Let A = UΣV T be a singular decomposition. A + = V Σ 1 U T is called Moore- Penrose pseudoinverse of A. 1

2 3.1 Algorithm Today we are going to use a sketch which is used both in subspace embedding and approximate matrix multiplication to compute Ãk with rank at most k such that A Ãk F (1+ɛ) A A k F, following Sarlós approach [8]. The first works which got some decent error (like ε A F ) was due to Papadimitriou [7] and Frieze, Kanna and Vempala [5]. Theorem 4. Define Ãk = Proj AΠ T,k(A). As long as Π R m n is an 1/2 subspace embedding for a certain k-dimensional subspace V k and satisfies approximate matrix multiplication with error ε/k, then A Ãk F (1 + O(ε)) A A k F, where Proj V,k (A) is the best rank k approximation to Proj V (A), i.e., projecting the columns of A to V. Before we prove this theorem, let us first convince ourselves that this algorithm is fast, and that we can compute Proj AΠ T,k(A) quickly. To satisfy the conditions in the above theorem, we know that Π R m d can be chosen with m = O(k/ε) e.g. using a random sign matrix (or slightly larger m using a faster subspace embedding). We need to multiply AΠ T. We can use a fast subspace embedding to compute AΠ T quickly, then we can compute the SVD of AΠ T = U Σ V T in O(nm 2 ) time. Let [ ] k denote the best rank-k approximation under Frobenius norm. We then want to compute [U U T A] k = U [U T A] k. Computing U T A takes O(mnd) time, then computing the SVD of U T A takes O(dm 2 ) time. Note that this is already better than the O(nd 2 ) time to compute the SVD of A, but we can do better if we approximate. In particular, by using the right combination of subspace embeddings, for constant ε the scheme described here can be made to take O(nnz(A)) + Õ(ndk) time (where Õ hides log n factors). We will shoot instead for O(nnz(A)) + Õ(nk2 ). Consider that: We want to compute Ãk = argmin X:rank(X) k U X A 2 F. If X+ is the argmin without the rank constraint, then the argmin with the rank constraint is [U X + ] k = U [X + ] k, where [ ] k denotes the best rank-k approximation under Frobenius error. Rather than find X +, we use approximate regression to find an approximately optimal X. That is, we compute X = argmin X Π U X Π A 2 F where Π is an α-subspace embedding for the column space of U (note U has rank m). Then our final output is U [ X] k. Why does the above work? (Thanks to Michael Cohen for describing the following simple argument.) First note ( ) 1 + α U X + A 2 F U X A 2 1 α F = (U X + A) + U ( X X + ) 2 F = U X + A 2 F + U ( X X + ) 2 F = U X + A 2 F + X X + 2 F 2

3 and thus X X + 2 F O(α) U X + A 2 F. The second equality above holds since the matrix U preserves Frobenius norms, and the first equality since U X + A has a column space orthogonal to the column space of U. Next, suppose f, f are two functions mapping the same domain to R such that f(x) f(x) η for all x in the domain. Then clearly f(argmin x f(x)) minx f(x) + 2η. Now, let the domain be the set of all rank-k matrices, and let f(z) = U X + Z F and f(z) = U X Z F. Then η = U X + U X F = X + X F. Thus U [ X] k A 2 F = U [ X] k U X + F + (I U U T )A 2 F ( U [X + ] k U X + F + 2 X + X F ) 2 + (I U U T )A 2 F ( U [X + ] k U X + F + O( α) U X + A F ) 2 + (I U U T )A 2 F = ( U [X + ] k U X + F + O( α) U X + A F ) 2 + U X + A 2 F = U [X + ] k U X + 2 F + O( α) U [X + ] k U X + F U X + A F + O(α) U X + A 2 F + U X + A 2 F = U [X + ] k A 2 F + O( α) U [X + ] k U X + F U X + A F + O(α) U X + A 2 F (1) (1 + O(α)) U [X + ] k A 2 F + O( α) U [X + ] k U X + F U X + A F (2) (1 + O(α)) U [X + ] k A 2 F + O( α) U [X + ] k A 2 F (3) = (1 + O( α)) U [X + ] k A 2 F where (1) used that U [X + ] k U X + + U X + A 2 F = U [X + ] k A 2 F + U [X + ] k U X + 2 F since U X + A has columns orthogonal to the column space of U. Also, (2) used that U X + A F U [X + ] k A F, since U X + is the best Frobenius approximation to A in the column space of U. Finally, (3) again used U X + A F U [X + ] k A F, and also used the triangle inequality U [X + ] k U X + F U [X + ] k A F + U X + A F 2 U [X + ] k A F. Thus we have established the following theorem, which follows from the above calculations and Theorem 4. Theorem 5. Let Π 1 R m 1 n be a 1/2 subspace embedding for a certain k-dimensional subspace V k, and suppose Π 1 also satisfies approximate matrix multiplication with error ε/k. Let Π 2 R m 2 n be an α-subspace embedding for the column space of U, where AΠ T 1 = U Σ V T is the SVD (and hence U has rank at most m 1 ). Let Ã k = U [ X] k where X = argmin Π 2 U X Π 2 A 2 F. X Then Ã k has rank k and A Ã k F (1 + O(ε) + O( α)) A A k F. In particular, the error is (1 + O(ε)) A A k F for α = ε. 3

4 In the remaining part of these lecture notes, we show that Proj AΠ T,k(A) actually is a good rank-k approximation to A (i.e. we prove Theorem 4). In the following proof, we will denote the first k columns of U and V as U k and V k and the remaining columns by U k and V k. Proof. Let Y be the column span of Proj AΠ T (A k ) and the orthogonal projection operator onto Y as P. Then, A Proj AΠ T,k(A) 2 F A P A 2 F = A k P A k 2 F + A k P A k 2 F Then we can bound the second term in that sum: A k = (I P )A k 2 F A k 1 F Now we just need to show that A k P A k 2 F ε A k 2 F : A P A 2 F = A k (AΠ T )(AΠ T ) + A k ) 2 F A k (AΠ T )(AΠ T ) + A k 2 F = = A T k AT k (ΠAT ) + (ΠA T ) 2 F n A T (i) k A T k (ΠA T ) + (ΠA T ) (i) 2 2 i=1 Here superscript (i) means the ith column. Now we have a bunch of different approximate regression problems which have the following form: min x ΠA T k x Π(AT ) (i) 2, which has optimal value x = (ΠA T k )+ (ΠA T ) (i). Consider the problem min x ΠA T k x (AT ) (i) 2 as original regression problem. In this case optimal x gives A T k x = Proj A T k ((A T ) (i) ) = (A T k )(i). Now we can use the analysis on the approximate least square from last week. In our problem, we have a bunch of w i, β i, α i with S = A T k = V kσ k Uk T and b i = (A T ) (i). Here, w i 2 = Sx b 2 = (A T k )(i) (A T ) (i) 2. Hence i w i 2 = A A k 2 F. On the other hand, i β i 2 = A T k AT k (ΠAT k )+ (ΠA T ) 2 F. Since (ΠV k) T (ΠV k )β i = (ΠV k ) T Πw i, if all singlar values of ΠV k are at least 1/2 1/4, we have i β i 2 2 i (ΠV k ) T (ΠV k )β i 2 = i (ΠV k ) T Πw i 2 = (ΠV k ) T ΠW T F where W has w i as ith column. What does it look like? (ΠV k ) T ΠW exactly look like approximate matrix multiplication of V k and W. Since columns of W and V k are orthogonal, we have Vk T W = 0, hence if Π is a sketch for approximate matrix multiplication of error ε = ε/k, then P Π ( (ΠV k ) T (ΠW ) 2 F > ε W 2 F ) < δ since V k 2 F = k. Clearly W 2 F = i w i 2 = A A k 2 F, we get the desired result. 4

5 3.2 Further results What we just talked about gives a good low-rank approximation but every column of Ã k is a linear combination of potentially all columns of A. In applications (e.g. information retrieval), we want a few number of columns be spanning our low dimensional subspace. There has been work on finding fewer columns of A (call them C) such that A (CC + A) k 2 F is small, but we will not talk about it deeply. Boutsidis et al. [1] showed that we can take C with 2k/ε columns and error ɛ A A k F. Guruswami and Sinop got C with k ε + k 1 columns such that A CC+ A F (1 + ɛ) A A k F. 3.3 K-Means as a Low-Rank Approximation Problem The k-means problem, which was stated on the problem set, involved a set of points (x 1,..., x n ) R d. Let A be the matrix with the ith row equal to x T i. Given a partition P(P 1,..., P k ) of points into k clusters, then the best centroids are averages of the clusters. Define the matrix X p R n k such that: 1 ifi P j (X p ) i,j = Pj 0 otherwise Note that X T p X p = I. It can e shown that the ith row of X p X T p A is the centroid of the cluster that X i belongs to. Thus, solving k-means is equivalent to finding some P = argmin A X p X T p A - this is a constrained rank-k approximation problem. Cohen et. al[3] show that Π can have m = O(k/ε 2 ) for a (1 + ε) approximation, or a m = O(lg k/ε 2 ) for a (9 + ε) approximation (the second bound is specifically for the k-means problem). It is an open problem whether this second bon can get a better approximation. 4 Compressed Sensing 4.1 Basic Idea Consider x R n. If x is a k sparse vector, we could represent it in a far more compressed manner. Thus, we define a measure of how compressible a vector is as a measure of how close it is to being k sparse. Definition 6. Let x head(k) be the k elements of largest magnitude in x. Let x tail(k) be the rest of x. Therefore, we call x compressible if x tail(k) is small. The goal here is to approximately recover x from few linear measurements. Consider we have a matrix Πx such that each the ith row is equal to α i, x for some α 1,..., α m R n. We want to recover a x from ΠX such that x x p C ε,p,q x tail(k) q, where C ε,p,q is some constant dependent on ε, p and q. Depending on the problem formulation, I may or may not get to choose this matrix Π. 5

6 4.2 Approximate Sparsity There are many practical applications in which approximately sparse vectors appear. Pixelated images, for example, are usually approximately sparse in some basis U. For example, consider an n by n image x R n2. then x = Uy for some basis U, and y is approximately sparse. Thus we can get measurements from ΠU y. Images are typically sparse in the wavelet basis. We will describe how to transform to the Haar wavelet basis here. Assume that n is a power of two. Then: 1. Break the image x into squares of size four pixels. 2. Initialize a new image, with four regions R 1, R 2, R 3, R Each block of four pixels, b, in x has a corresponding single pixel in each of R 1b, R 2b, R 3b, and R 4b based on its location. For each block of four b: Let the b have pixel values p 1, p 2, p 3, and p 4. R 1b 1 4 (p 1 + p 2 + p 3 + p 4 ) R 2b 1 4 (p 1 p 2 + p 3 p 4 ) R 3b 1 4 (p 1 p 2 p 3 + p 4 ) R 4b 1 4 (p 1 p 2 + p 3 p 4 ) 4. Recurse on R 1, R 2, R 3, and R 4. The general idea is this: usually, pixels are relatively constant in certain regions. Thus, the values in all regions except for the first are usually relatively small. If you view images after this transform, the upper left hand regions will often be closer to white, while the rest will be relatively sparse. Theorem 7 (Candès, Romberg, Tao [2], Donoho [4]). There exists a Π R m n with m = O(klg(n/k)) and a poly-time algorithm Alg s.t. if x = Alg(Πx) then x x 2 O(k 1/2 ) x tail(k) 1 If x is actually k-spares, 2k measurements are necessary and sufficient. We will see this by examining Prony s method in one of our problem sets, and investigate compressed sensing further next class. References [1] Christos Boutsidis, Petros Drineas, Malik Magdon-Ismail. Near Optimal Column-based Matrix Reconstruction. FOCS, , [2] Emmanuel J. Candès, Justin K. Romberg, Terence Tao. Robust uncertainty principles: exact signal reconstruction from highly incomplete frequency information. IEEE Transactions on Information Theory, 52(2): , [3] Michael B. Cohen, Sam Elder, Cameron Musco, Christopher Musco, Madalina Persu. Dimensionality Reduction for k-means Clustering and Low Rank Approximation. STOC, ,

7 [4] David L. Donoho. Compressed sensing, IEEE Transactions on Information Theory, 52(4): , [5] Alan M. Frieze, Ravi Kannan, Santosh Vempala. Fast Monte-carlo Algorithms for Finding Low-rank Approximations. J. ACM, 51(6): , [6] Venkatesan Guruswami, Ali Kemal Sinop. Optimal Column-based Low-rank Matrix Reconstruction. SODA, , [7] Christos H. Papadimitriou, Prabhakar Raghavan, Hisao Tamaki, Santosh Vempala. Latent Semantic Indexing: A Probabilistic Analysis. J. Comput. Syst. Sci., 61(2): , [8] Tamás Sarlós. Improved Approximation Algorithms for Large Matrices via Random Projections. FOCS, ,

CS 229r: Algorithms for Big Data Fall Lecture 17 10/28

CS 229r: Algorithms for Big Data Fall 2015 Prof. Jelani Nelson Lecture 17 10/28 Scribe: Morris Yau 1 Overview In the last lecture we defined subspace embeddings a subspace embedding is a linear transformation