Randomized Numerical Linear Algebra: Review and Progresses

Size: px

Start display at page:

Download "Randomized Numerical Linear Algebra: Review and Progresses"

Frederica Russell
5 years ago
Views:

1 ized ized SVD ized : Review and Progresses Zhihua Department of Computer Science and Engineering Shanghai Jiao Tong University The 12th China Workshop on Machine Learning and Applications Xi an, November 2014

2 ized ized SVD An interdisciplinary among Theoretical Computer Science (TCS), (NLA), and Modern Data Analysis Many data mining and machine learning algorithms involve matrix decomposition, matrix inverse and matrix determinant; and some methods are based on low-rank matrix approximation. The Big Data phenomenon brings new challenges and opportunities to machine learning and data mining.

3 Singular Value Decomposition (SVD) ized ized SVD Input: an m n data matrix A of rank r and an integer k less than r. The (condensed) SVD: A = UΣV T where U T U = I r, V T V = I r, and Σ = diag(σ 1,..., σ r ) with σ 1 σ 2 σ r > 0. time complexity: O(mn min(m, n)) The truncated SVD: A k = U k Σ k V T k where U k and V k are the first k columns of U and V, and Σ k is the k k top sub-block of Σ. A k is the closest rank-k approximation to A. That is, A k = argmin A X ξ. rank(x) k where ξ = 2" is the matrix spectral norm and ξ = F " is the matrix Frobenius norm. time complexity: O(mnk)

4 Singular Value Decomposition (SVD) ized ized SVD Input: an m n data matrix A of rank r and an integer k less than r. The (condensed) SVD: A = UΣV T where U T U = I r, V T V = I r, and Σ = diag(σ 1,..., σ r ) with σ 1 σ 2 σ r > 0. time complexity: O(mn min(m, n)) The truncated SVD: A k = U k Σ k V T k where U k and V k are the first k columns of U and V, and Σ k is the k k top sub-block of Σ. A k is the closest rank-k approximation to A. That is, A k = argmin A X ξ. rank(x) k where ξ = 2" is the matrix spectral norm and ξ = F " is the matrix Frobenius norm. time complexity: O(mnk)

5 Singular Value Decomposition (SVD) ized ized SVD Input: an m n data matrix A of rank r and an integer k less than r. The (condensed) SVD: A = UΣV T where U T U = I r, V T V = I r, and Σ = diag(σ 1,..., σ r ) with σ 1 σ 2 σ r > 0. time complexity: O(mn min(m, n)) The truncated SVD: A k = U k Σ k V T k where U k and V k are the first k columns of U and V, and Σ k is the k k top sub-block of Σ. A k is the closest rank-k approximation to A. That is, A k = argmin A X ξ. rank(x) k where ξ = 2" is the matrix spectral norm and ξ = F " is the matrix Frobenius norm. time complexity: O(mnk)

6 The ized ized SVD A CUR decomposition algorithm seeks to find a subset of c columns of A to form a matrix C R m c, a subset of r rows to form a matrix R R r n, and an intersection matrix U R c r such that A CUR ξ is minimized. The CUR decomposition results in an interpretable matrix approximation to A. There are ( n c) possible choices of constructing C and ( m r ) possible choices of constructing R, so selecting the best subsets is a hard problem.

7 Kernel Methods ized ized SVD K: n n kernel matrix. Matrix inverse b = (K + αi n ) 1 y time complexity: O(n 3 ) performed by Gaussian process regression, least square SVM, kernel ridge regression Partial eigenvalue decomposition of K time complexity: O(n 2 k) performed by kernel PCA and some manifold learning methods Space complexity: O(n 2 ) the iterative algorithms go many passes through the data you had better put the entire kernel matrix in RAM if the data does not fit in the RAM, one swap between RAM and disk in each pass.

8 Approaches for Large Scale Matrix Computations ized ized SVD Two typical approaches: incremental and distributed ized algorithms have been also used.

9 Outline ized ized SVD 1 ized SVD 2 3 4

10 Outline ized ized SVD 1 ized SVD 2 3 4

11 ized ized SVD This lemma has been given by Johnson and (1984), but the proof was not constructive. Indyk and Motwani (1998) and Dasgupta and Gupta (2003) constructed a result based on Gaussian random projection matrix R = [r ij ] where r ij iid N(0, 1). Matoušek (2008) generalized the result to the case that r ij s are any subgaussian random variables; that is, r ij iid G(ν 2 ) for ν 1.

12 ized ized SVD Definition (ɛ-isometry) Given ɛ (0, 1), a map f : R p R q where p > q is called an ɛ-isometry on set X R p if for every pair x, y X, we have (1 ɛ) x y 2 2 f (x) f (y) 2 2 (1 + ɛ) x y 2 2. We consider the case that f is defined as a linear map R R q p. The Basic idea is to construct a random projection R R q p that is an exact isometry in expectation;" that is, for every x R p, E [ Rx 2 2] = x 2 2.

13 ized ized SVD Theorem ( ) Let X = {x 1,..., x n } R p, and let ɛ, δ (0, 1). Assume that R R q p ( p > q) where r ij G(ν 2 ) for some ν 1. If q 100ν 2 ɛ 2 log(n/ δ), then with probability at least 1 δ, R is an ɛ-isometry on X where Y = { Pr sup y Y } Ry ɛ δ. { xi x j x i x j 2 : x i, x j X, x i x j }.

14 Prototype for ized SVD ized ized SVD Given an m n matrix A, a target number k of singular vectors, and an integer c such that k < c < min(m, n), a proto-algorithm based on random projection for Singular Value Decomposition (SVD) of A is as follows. 1 Construct an m c column-orthonormal matrix Q and form B = Q T A; 2 Compute SVD of the small matrix: B = U B Σ B V T B ; 3 Set Ũ = QU B; 4 Return ŨΣ BV T B as an approximate SVD of A, and U B,k Σ B,k V T B,k as a truncated SVD of A.

15 A Proto-Algorithm for Construction of Matrix Q ized ized SVD Let A be an m n matrix, and k be a target number of singular vectors. 1 Generate an m 2k Gaussian test matrix Ω. 2 Form Y = (AA T ) γ AΩ where γ = 1 or γ = 2. 3 Construct a matrix Q whose columns form an orthonormal basis for the range of Y.

16 Computational Complexity for the ized SVD ized ized SVD The randomized SVD procedure requires only 2(γ + 1) passes over the matrix. The flop count is (2γ + 2)kT mult + O(k 2 (m + n)), where T mult is the flop count of a matrix-vector multiply with A or A T.

17 Theoretical Analysis for the ized SVD ized ized SVD Theorem (Halko et al., 2011) Let A R m n. Give an exponent γ and a target number k of singular vectors, where 2 k 1 2 min(m, n), running the ized SVD algorithm obtains a rank-2k factorization Ũ 2k Σ2k Ṽ T 2k. Then E A Ũ2k Σ [ 2 min(m, n) ] 1/(2γ+1)σk+1 2k Ṽ T 2k k 1 where E is taken w.r.t. the random test matrix and σ k+1 is the top (k + 1)th singular value of A.

18 Outline ized ized SVD 1 ized SVD 2 3 4

19 The Problem ized ized SVD For a fixed m n matrix A of rank r and an error parameter ɛ (0, 1), we call S : R m R k a subspace embedding matrix for A if for all x R n. (1 ɛ) Ax 2 SAx 2 (1 + ɛ) Ax 2 The Problem is to find such an embedding matrix obliviously. More specifically, one designs a distribution π over linear maps from R m to R k such that for any fixed m n matrix A, if we choose S π, then with high probability S is an embedding matrix for A.

20 The Problem ized ized SVD For a fixed m n matrix A of rank r and an error parameter ɛ (0, 1), we call S : R m R k a subspace embedding matrix for A if for all x R n. (1 ɛ) Ax 2 SAx 2 (1 + ɛ) Ax 2 The Problem is to find such an embedding matrix obliviously. More specifically, one designs a distribution π over linear maps from R m to R k such that for any fixed m n matrix A, if we choose S π, then with high probability S is an embedding matrix for A.

21 Sparse Matrices ized ized SVD For a fixed m n matrix A with m > n, let nnz(a) denote the number of non-zero entries of A. Assume that nnz(a) m and that there are no all-zero rows or columns in A. Let [m] = {1, 2,..., m}. For a parameter k, define a random linear map ΦD : R m R k as follows h : [m] [k] is a random map so that for each i [m], h(i) = t where t [k] with probability 1/k. Φ {0, 1} k m is a k m binary matrix, with φ h(i),i = 1 and all remaining entries 0. D is an m m random diagonal matrix, with each diagonal entry independently chosen to be +1 or 1 with equal probability. A matrix of the form S = ΦD is referred to as a sparse embedding matrix (Dasgupta et al., 2010; Clarkson and Woodruff, 2013).

22 Sparse Matrices ized ized SVD For a fixed m n matrix A with m > n, let nnz(a) denote the number of non-zero entries of A. Assume that nnz(a) m and that there are no all-zero rows or columns in A. Let [m] = {1, 2,..., m}. For a parameter k, define a random linear map ΦD : R m R k as follows h : [m] [k] is a random map so that for each i [m], h(i) = t where t [k] with probability 1/k. Φ {0, 1} k m is a k m binary matrix, with φ h(i),i = 1 and all remaining entries 0. D is an m m random diagonal matrix, with each diagonal entry independently chosen to be +1 or 1 with equal probability. A matrix of the form S = ΦD is referred to as a sparse embedding matrix (Dasgupta et al., 2010; Clarkson and Woodruff, 2013).

23 Sparse Matrices ized ized SVD For a fixed m n matrix A with m > n, let nnz(a) denote the number of non-zero entries of A. Assume that nnz(a) m and that there are no all-zero rows or columns in A. Let [m] = {1, 2,..., m}. For a parameter k, define a random linear map ΦD : R m R k as follows h : [m] [k] is a random map so that for each i [m], h(i) = t where t [k] with probability 1/k. Φ {0, 1} k m is a k m binary matrix, with φ h(i),i = 1 and all remaining entries 0. D is an m m random diagonal matrix, with each diagonal entry independently chosen to be +1 or 1 with equal probability. A matrix of the form S = ΦD is referred to as a sparse embedding matrix (Dasgupta et al., 2010; Clarkson and Woodruff, 2013).

24 Sparse Matrices ized ized SVD For a fixed m n matrix A with m > n, let nnz(a) denote the number of non-zero entries of A. Assume that nnz(a) m and that there are no all-zero rows or columns in A. Let [m] = {1, 2,..., m}. For a parameter k, define a random linear map ΦD : R m R k as follows h : [m] [k] is a random map so that for each i [m], h(i) = t where t [k] with probability 1/k. Φ {0, 1} k m is a k m binary matrix, with φ h(i),i = 1 and all remaining entries 0. D is an m m random diagonal matrix, with each diagonal entry independently chosen to be +1 or 1 with equal probability. A matrix of the form S = ΦD is referred to as a sparse embedding matrix (Dasgupta et al., 2010; Clarkson and Woodruff, 2013).

25 Sparse Matrices ized ized SVD For a fixed m n matrix A with m > n, let nnz(a) denote the number of non-zero entries of A. Assume that nnz(a) m and that there are no all-zero rows or columns in A. Let [m] = {1, 2,..., m}. For a parameter k, define a random linear map ΦD : R m R k as follows h : [m] [k] is a random map so that for each i [m], h(i) = t where t [k] with probability 1/k. Φ {0, 1} k m is a k m binary matrix, with φ h(i),i = 1 and all remaining entries 0. D is an m m random diagonal matrix, with each diagonal entry independently chosen to be +1 or 1 with equal probability. A matrix of the form S = ΦD is referred to as a sparse embedding matrix (Dasgupta et al., 2010; Clarkson and Woodruff, 2013).

26 in Input-Sparsity Time ized ized SVD Theorem (Meng and Mahoney, 2013) Let S = ΦD R k m with k = n2 +n. Then with probability at ɛ 2 δ least 1 δ, (1 ɛ) Ax 2 SAx 2 (1 + ɛ) Ax 2 for all x R n. In addition, SA can be computed in O(nnz(A)).

27 Spectral Sparsifiers ized ized SVD Theorem (Batson, Spielman and Srivastava, 2014) Suppose ρ > 1 and {v 1, v 2,..., v m } R n with v i v T i = I n. i m Then there exist scalars d i 0 with {i : d i 0} ρn such that ( 1 1 ) 2In ( d i v i v T ρ i ) 2In. ρ i m This theorem shows that λ 1 ( i m d iv i v T i ) λ n ( i m d iv i v T i ) ρ ρ ρ ρ.

28 Outline ized ized SVD 1 ized SVD 2 3 4

29 and The CX Decomposition ized ized SVD Given an m n matrix A, column selection algorithms aim to find a matrix with c columns of A such that A CC + A ξ = (I m CC + )A ξ achieves the minimum. Here ξ = 2," ξ = F," and ξ = " respectively represent the matrix spectral norm, the matrix Frobenius norm, and the matrix nuclear norm, and C + is the Moore-Penrose inverse of C. Let X be the best rank k approximation to A in the column span of C. Then CX is called the CX Decomposition of A. Since there are ( n c) possible choices of constructing C, selecting the best subset is a hard problem.

30 A ized Algorithm for ized ized SVD Given an m n matrix A and a rank parameter k, a random sampling based on the statistical leverage score is: Compute the importance sampling probabilities {π i } n i=1. Here π i = 1 k V k (i), where V k is an n k orthonormal matrix spanning the top-k right singular subspace of A. ly select c = O(k log(k/ɛ 2 )) columns of A according to these probabilities to form the matrix C.

31 Theoretical Result for the Column (Drineas et al., 2008) ized ized SVD Let C k be the best rank-k approximation to the matrix C, and define the projection matrix P Ck = C k C + k. Then A P Ck A F (1 + ɛ) A A k F, where A k = U k Σ k V T k is the best rank k approximation to A.

32 The Adaptive Sampling Algorithm ized ized SVD (Deshpande et al., 2006) Given a matrix A R m n, let C 1 R m c 1 consist of c 1 columns of A, and define the residual B = A C 1 C + 1 A. Additionally, for i = 1,, n, define π i = b i 2 2 / B 2 F. We further sample c 2 columns i.i.d. from A, in each trial of which the i-th column is chosen with probability π i. Let C 2 R m c 2 contain the c 2 sampled columns and let C = [C 1, C 2 ] R m (c 1+c 2 ). Then, for any integer k > 0, the following inequality holds: E A CC + A 2 F A A k 2 F + k c 2 A C 1 C + 1 A 2 F,

33 The Near-Optimal Algorithm ized ized SVD Boutsidis et al. (2013) derived a near-optimal algorithm, which consists of three steps: the approximate SVD via random projection (Halko et al. 2011) a dual set sparsification algorithm an extension of spectral sparsifier (BSS) the adaptive sampling algorithm (Deshpande et al., 2006)

34 The Near-Optimal Algorithm ized ized SVD Theorem (Boutsidis et al., 2013) Given a matrix A R m n of rank ρ, a target rank k (2 k < ρ), and 0 < ɛ < 1, the algorithm selects c = 2k ɛ ( ) 1 + o(1) columns of A to form a matrix C R m c. Then the following inequality holds: E A CC + A 2 F (1 + ɛ) A A k 2 F, where the expectation is taken w.r.t. C. Furthermore, the matrix C can be obtained in time: O ( mk 2 ɛ 4/3 + nk 3 ɛ 2/3) + T Multiply ( mnkɛ 2/3 ).

35 The (Drineas et al., 2008; Mahoney and Drineas, 2009) ized ized SVD Given an m n matrix A, and integers c < n and r < m, the CUR decomposition of A finds C R m c with c columns from A, R R r n with r rows from A, and U R c r such that A = CUR + E. Here E = A CUR is the residual error matrix.

36 The CUR Problem ized ized SVD Definition (The ) Given an m n matrix A of rank ρ, a rank parameter k, and accuracy parameter ɛ (0, 1), construct a matrix C R m c with c columns from A, R R r n with rows from A, and U R c r, with c, r, and rank(u) being as small as possible, such that A CUR 2 F (1 + ɛ) A A k 2 F. Here A k = U k Σ k V T k Rm n is the best rank k matrix obtained via the SVD of A: A = UΣV T.

37 The Sampling CUR Algorithm ized ized SVD Drineas et al., (2008) proposed a two-stage randomized CUR algorithm that called Sampling. The first stage samples c columns of A to construct C according to the sampling probabilities proportional to the squared l 2 -norm of the rows of V k ; The second stage samples r rows from A and C simultaneously to construct R and W and let U = W. The sampling probabilities in this stages are proportional to the leverage scores of A and C, respectively.

38 The Sampling CUR Algorithm ized ized SVD (Drineas et al., 2008) Given an m n matrix A and a target rank k min{m, n}, the subspace sampling algorithm selects c = O(kɛ 2 log k log(1/δ)) columns and r = O ( cɛ 2 log c log(1/δ) ) rows without replacement. Then A CUR F = A CW + R F (1 + ɛ) A A k F, holds with probability at least 1 δ, where W contains the rows of C with scaling. The running time is dominated by the truncated SVD of A, that is, O(mnk).

39 The Adaptive Sampling CUR Algorithm ized ized SVD Wang and (2013) proposed an Adaptive Sampling CUR Algorithm. Select c = 2k ( ) ɛ 1 + o(1) columns of A to construct C R m c using Algorithm of Boutsidis et al. (2013); Select r 1 = c rows of A to construct R 1 R r 1 n using Algorithm of Boutsidis et al. (2013); Adaptively sample r 2 = c/ɛ rows from A according to the residual A AR 1 R 1; Return C, R = [R T 1, RT 2 ]T, and U = C AR.

40 The Adaptive Sampling CUR Algorithm ized (Wang and, 2013) Given a matrix A R m n and a matrix C R m c such that rank(c) = rank(cc A) = ρ (ρ c n), let R 1 R r 1 n consist of r 1 rows of A and define the residual B = A AR 1 R 1. Additionally, for i = 1,, m, we define ized SVD π i = b (i) 2 2 / B 2 F. We further sample r 2 rows i.i.d. from A, in each trial of which the i-th row is chosen with probability p i. Let R 2 R r 2 n contain the r 2 sampled rows and let R = [R T 1, RT 2 ]T R (r 1+r 2 ) n. Then we have E A CC AR R 2 F A CC A 2 F + ρ r 2 A AR 1 R 1 2 F,

41 The Adaptive Sampling CUR Algorithm ized ized SVD Theorem (Wang and, 2013) Given a matrix A R m n and a positive integer k min{m, n}, the Adaptive Sampling CUR algorithm randomly selects c = 2k ɛ (1+o(1)) columns of A to construct C R m c, and then selects r = c ɛ (1+ɛ) rows of A to construct R R r n. Then we have E A CUR F = E A C(C AR )R F (1+ɛ) A A k F. The algorithm costs time O ( (m + n)k 3 ɛ 2/3 + mk 2 ɛ 2 + nk 2 ɛ 4) + T Multiply ( mnkɛ 1 ) to compute matrices C, U and R.

42 Optimal CUR Algorithm ized ized SVD Boutsidis and Woodruff (2014) proposed Optimal CUR Algorithm. Construction C with O(k + k ɛ ) columns: Compute the top k singular vectors of A: Z 1 Sample O(k log k) columns from Z T 1 with the leverage scores Down-sample columns to c 1 = O(k) columns with the sampling algorithm of Boutsidis et al. (2013) Adaptively sample c 2 = O( k ɛ ) columns of A Construction R with O(k + k ɛ ) rows: Find Z 2 in the span of C such that: A Z 2 Z T 2 A 2 F (1 + ɛ) A A k 2 F Sample O(k log k) rows from Z 2 with the leverage scores Down-sample rows to r 1 = O(k) rows with the sampling algorithm of Boutsidis et al. (2013) Sample r 2 = O( k ɛ ) rows with adaptive sampling

43 Optimal CUR Algorithm ized ized SVD (Boutsidis and Woodruff, 2014) Given a matrix A R m n, V R m c and an integer k, let V = YΨ be a QR decomposition of V, Γ = Y T A, Γ k = Σ k V T k be a rank k SVD of Γ, Rc k. Then Y T Y T satisfies: A Y T Y T A 2 F A Y Σ kv T k 2 F = A ΠF V,k (A) 2 F.

44 Optimal CUR Algorithm ized ized SVD Theorem (Boutsidis and Woodruff, 2014) Given a matrix A R m n of rank ρ, a target rank 1 k ρ, and 0 < ɛ < 1, the optimal CUR algorithm selects at most c = O(k/ɛ) columns and at most r = O(k/ɛ) rows from A form matrices C R m c, R R r n, and U R c r with rank(u) = k such that, with some probability, A CUR 2 F (1 + O(ɛ)) A A k 2 F. The matrices C, U, and R can be computed in time O [ nnz(a) log n + (m + n) poly(log n, k, 1/ɛ) ].

45 ized ized SVD : selects c ( n) columns of K to construct C using some randomized algorithms. After permutation we have K = [ W K T 21 K 21 K 22 The Nyström Approximation: K nys c }{{} n n = C ], C = }{{} n c W }{{} c c K nys c C T }{{} c n K. [ W K 21 ].

46 ized ized SVD : selects c ( n) columns of K to construct C using some randomized algorithms. After permutation we have K = [ W K T 21 K 21 K 22 The Nyström Approximation: K nys c }{{} n n = C ], C = }{{} n c W }{{} c c K nys c C T }{{} c n K. [ W K 21 ].

47 The Nyström Approximation ized The Nyström Approximation: K (A low-rank factorization). K nys c = CW C T ized SVD Nyström Approximation c c c n n n n c

48 Problem Formulation ized ized SVD Problem: How to select informative columns of K R n n to construct C R n c? The approximation error K CUC T F or K CUC T 2 should be as small as possible.

49 Criterion: Upper Error Bounds ized ized SVD Using approximation algorithms to find c good columns (not necessarily the best) Hope that K CUCT F K K k F smaller the better. has upper bound, which is the

50 Uniform Sampling: The Simplest Algorithm ized ized SVD Sample c columns of K uniformly at random to construct C. The simplest, but the most widely used.

51 Adaptive Sampling ized ized SVD The adaptive sampling algorithm [Deshpande et al., 2006]: 1 Sample c 1 columns of K to construct C 1 using some algorithm; 2 Compute the residual B = K C 1 C 1 K; 3 Compute sampling probabilities p i = b i 2 2, for i = 1 to B 2 F n; 4 Sample further c 2 columns of K in c 2 i.i.d. trials, in each trial the i-th column is chosen with probability p i ; Denote the selected columns by C 2 ; 5 Return C = [C 1, C 2 ].

52 Adaptive Sampling ized ized SVD The error term K CC K F is bounded theoretically, but K CW C T F is not. Empirically, the adaptive sampling algorithm works very well.

53 Better Sampling Algorithms? ized ized SVD We hope K CW C T F K K k F will be very small if the column sampling algorithm is good enough. But it cannot be arbitrarily small. Lower Error Bound Theorem (Wang &, JMLR 2013) Whatever column sampling is used to select c columns, there exists a bad case K such that K CW C T 2 ( F K K k 2 Ω 1 + nk ) c F 2.

54 Different Types of Low-Rank Approximation? ized ized SVD The Ensemble Nyström Method [Kumar et al., JMLR 2012]: t 1 K t C(i) W (i) C (i)t i=1 It does not improve the lower error bound. Lower Error Bound Theorem (Wang &, JMLR 2013) Whatever column sampling is used to select c columns, there exists a bad case K such that K t i=1 1 t C(i) W (i) C (i)t 2 F K K k 2 F ( Ω 1 + nk ) c 2.

55 The Modified Nyström Method ized ized SVD The Modified Nyström Method [Wang &, JMLR 2013]: ( ) K C C K(C ) T }{{} c c C T. Theorem (Wang &, JMLR 2013) Using a column sampling algorithm, the error incurred by the modified Nyström method satisfies K C ( C K(C ) T ) C T 2 F E K K k 2 F 1 + k c.

56 Comparisons between the Two Methods ized ized SVD The Standard Nyström Method: fast. It costs only T SVD (c 3 ) time to compute the intersection matrix U nys = W. The Modified Nyström Method: slow. It costs T SVD (nc 2 ) + T Multiply (n 2 c) time to compute the intersection matrix U mod = C K(C ) T naively.

57 Comparisons between the Two Methods ized ized SVD The Standard Nyström Method: inaccurate. It cannot attain 1 + ɛ Frobenius relative-error bound unless c nk/ɛ columns are selected, whatever column selection algorithm is used. (Due to its lower error bound.) The Modified Nyström Method: accurate. Some adaptive sampling based algorithms attain 1 + ɛ Frobenius relative-error bound when c = O(k/ɛ 2 ). (c is the smaller the better.)

58 Comparisons between the Two Methods ized ized SVD Theorem (Exact Recovery) For the symmetric matrix K defined previously, the following three statements are equivalent: 1 rank(w) = rank(k), 2 K = CW C T, (i.e., the standard Nyström method is exact) 3 K = C ( C K(C ) T ) C T, (i.e., the modified Nyström method is exact)

59 Outline ized ized SVD 1 ized SVD 2 3 4

60 ized ized SVD Santosh S. Vempala. The Method. American Mathematical Society, Michael W. Mahoney. ized Algorithms for Matrices and Data. Foundations and Trends in Machine Learning, 3(2): , N. Halko, P. G. Martinsson, and J. A. Tropp. Finding Structure with ness: Probabilistic Algorithms for Constructing Approximate Matrix Decompositions. SIAM Review, 53(2): , 2011 W. B. Johnson and J.. Extensions of Lipschitz mapping into a Hilbert space. Contemporary Mathematics, S. Dasgupta and A. Gupta. An elementary proof of a theorem of Johnson and. Structure & Algorithms, J. Matoušek. On variants of the Johnson and Leamma. Structure & Algorithms, 2008.

61 ized ized SVD A. Dasgupta, R. Kumar, and T. Sarlós: A sparse Johnson- Transform. In STOC, K. L. Clarkson and D. P. Woodruff: Low Rank Approximation and Regression in Sparsity Time. In STOC, X. Meng and M. W. Mahoney. Low-distortion subspace embeddings in input-sparsity time and applications to robust linear regression. STOC, J. Nelson and H. L. Nguyên. OSNAP: Faster numerical linear algebra algorithms via sparser subspace embeddings In FOCS, J. Batson, D. Spielman, and N. Srivastave: Twice-Ramanujan Sparsifiers. SIAM Review, 2014.

62 ized ized SVD A. Frieze, K. Kannan, and Rademacher, S. Vempala: Fast Monte-Carlo algorithms for finding low-rank approximation. In FOCS, Journal of the ACM, A. Deshpande, L. Rademacher, S. Vempala, and G.Wang: Matrix approximation and projective clustering via volume sampling. Theory of Computing, C. Boutsidis, P. Drineas, and M. Magdon-Ismail: Near optimal column-based matrix reconstruction. SIAM Journal on Computing, V. Guruswami and A. K. Sinop: Optimal column based low-rank matrix reconstruction. In SODA, 2012.

63 ized ized SVD P. Drineas, M. W. Mahoney, and S. Muthukrishnan. Relative-error CUR matrix decompositions. SIAM Journal on Matrix Analysis and Applications, M. W. Mahoney and P. Drineas. CUR matrix decompositions for improved data analysis. Proceedings of the National Academy of Sciences, Sshuse Wang and Zhihua : Improving CUR matrix decomposition and the Nyström approximation via adaptive sampling. JMLR, C. Boutsidis and D. P. Woodruff: Optimal CUR matrix decompositions. In STOC, S. Kumar, M. Mohri, and A. Talwalkar: Sampling methods for the Nyström method. JMLR, K. L. Clarkson and D. P. Woodruff: Low Rank Approximation and Regression in Sparsity Time. In STOC, 2013.

Relative-Error CUR Matrix Decompositions

Relative-Error CUR Matrix Decompositions RandNLA Reading Group University of California, Berkeley Tuesday, April 7, 2015. Motivation study [low-rank] matrix approximations that are explicitly expressed in terms of a small numbers of columns and/or