sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU)

Size: px

Start display at page:

Download "sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU)"

Christine Jones
5 years ago
Views:

1 sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU) 0

2 overview Our Contributions: 1

3 overview Our Contributions: A near optimal low-rank approximation for any positive semidefinite (PSD) matrix can be computed in sublinear time (i.e. without reading the full matrix). 1

4 overview Our Contributions: A near optimal low-rank approximation for any positive semidefinite (PSD) matrix can be computed in sublinear time (i.e. without reading the full matrix). Concrete: Significantly improves on previous, roughly linear time approaches for general matrices, and bypasses a trivial linear time lower bound for general matrices. 1

5 overview Our Contributions: A near optimal low-rank approximation for any positive semidefinite (PSD) matrix can be computed in sublinear time (i.e. without reading the full matrix). Concrete: Significantly improves on previous, roughly linear time approaches for general matrices, and bypasses a trivial linear time lower bound for general matrices. High Level: Demonstrates that PSD structure can be exploited in a much stronger way than previously known for low-rank approximation. Opens the possibility of further advances in algorithms for PSD matrices. 1

6 low-rank matrix approximation Low-rank approximation is one of the most widely used methods for general matrix and data compression. 2

7 low-rank matrix approximation Low-rank approximation is one of the most widely used methods for general matrix and data compression. 2

8 low-rank matrix approximation Low-rank approximation is one of the most widely used methods for general matrix and data compression. Closely related to principal component analysis, spectral embedding/clustering, and low-rank matrix completion. 2

9 low-rank matrix approximation Low-rank approximation is one of the most widely used methods for general matrix and data compression. Closely related to principal component analysis, spectral embedding/clustering, and low-rank matrix completion. Important Special Case: A is positive semidefinite (PSD). I.e.. x T Ax 0, 8x 2 R n 2

10 low-rank matrix approximation Low-rank approximation is one of the most widely used methods for general matrix and data compression. Closely related to principal component analysis, spectral embedding/clustering, and low-rank matrix completion. Important Special Case: A is positive semidefinite (PSD). I.e. x T Ax 0, 8x 2 R n. Includes graph Laplacians, Gram matrices and kernel matrices, covariance matrices, Hessians for convex functions. 2

11 optimal low-rank approximation An optimal low-rank approximation can be computed via the singular value decomposition (SVD). 3

12 optimal low-rank approximation An optimal low-rank approximation can be computed via the singular value decomposition (SVD). 3

13 optimal low-rank approximation An optimal low-rank approximation can be computed via the singular value decomposition (SVD). 3

14 optimal low-rank approximation An optimal low-rank approximation can be computed via the singular value decomposition (SVD). A k = argmin ka B:rank(B)=k Bk F 3

15 optimal low-rank approximation An optimal low-rank approximation can be computed via the singular value decomposition (SVD). A k = argmin ka Bk F = B:rank(B)=k s X (A ij B ij ) 2 i,j 3

16 optimal low-rank approximation An optimal low-rank approximation can be computed via the singular value decomposition (SVD). A k = argmin ka Bk F = B:rank(B)=k s X (A ij B ij ) 2 i,j Unfortunately, computing the SVD takes O(nd 2 )time. 3

17 input sparsity time low-rank approximation Recent work on matrix sketching gives state-of-the-art runtimes 4

18 input sparsity time low-rank approximation Recent work on matrix sketching gives state-of-the-art runtimes Theorem (Clarkson, Woodru 13) There is an algorithm which in O(nnz(A)+n poly(k, 1/ )) time outputs N 2 R n k, M 2 R d k satisfying with prob. 99/100: ka NM T k F apple (1 + )ka A k k F. 4

19 input sparsity time low-rank approximation Recent work on matrix sketching gives state-of-the-art runtimes Theorem (Clarkson, Woodru 13) There is an algorithm which in O(nnz(A)+n poly(k, 1/ )) time outputs N 2 R n k, M 2 R d k satisfying with prob. 99/100: ka NM T k F apple (1 + )ka A k k F. When k, 1/ are not too large, runtime is linear in input size. 4

20 input sparsity time low-rank approximation Recent work on matrix sketching gives state-of-the-art runtimes Theorem (Clarkson, Woodru 13) There is an algorithm which in O(nnz(A)+n poly(k, 1/ )) time outputs N 2 R n k, M 2 R d k satisfying with prob. 99/100: ka NM T k F apple (1 + )ka A k k F. When k, 1/ are not too large, runtime is linear in input size. Best known runtime for both general and PSD matrices. 4

21 sublinear time low-rank approximation Theorem (Main Result Musco, Woodru 17) There is an algorithm running in Õ nk 2 time which, given PSD A, outputs N, M 2 R n k satisfying with probability 99/100: ka NM T k F apple (1 + )ka A k k F. 4 5

22 sublinear time low-rank approximation Theorem (Main Result Musco, Woodru 17) There is an algorithm running in Õ nk 2 time which, given PSD A, outputs N, M 2 R n k satisfying with probability 99/100: ka NM T k F apple (1 + )ka A k k F. 4 Compare to CW 13 which takes O(nnz(A)) + n poly(k, 1/ ). 5

23 sublinear time low-rank approximation Theorem (Main Result Musco, Woodru 17) There is an algorithm running in Õ nk 2 time which, given PSD A, outputs N, M 2 R n k satisfying with probability 99/100: ka NM T k F apple (1 + )ka A k k F. 4 Compare to CW 13 which takes O(nnz(A)) + n poly(k, 1/ ). 5

24 sublinear time low-rank approximation Theorem (Main Result Musco, Woodru 17) There is an algorithm running in Õ nk 2 time which, given PSD A, outputs N, M 2 R n k satisfying with probability 99/100: ka NM T k F apple (1 + )ka A k k F. 4 Compare to CW 13 which takes O(nnz(A)) + n poly(k, 1/ ). If k, 1/ are not too large compared to nnz(a), our runtime is significantly sublinear in the size of A. 5

25 lower bound for general matrices For general matrices, (nnz(a)) time is required. 6

26 lower bound for general matrices For general matrices, (nnz(a)) time is required. Randomly place a single entry which dominates A s Frobenius norm. 6

27 lower bound for general matrices For general matrices, (nnz(a)) time is required. Randomly place a single entry which dominates A s Frobenius norm. 6

28 lower bound for general matrices For general matrices, (nnz(a)) time is required. Randomly place a single entry which dominates A s Frobenius norm. Finding it with constant probability requires reading at least a constant fraction of the non-zero entries in A. 6

29 lower bound for general matrices For general matrices, (nnz(a)) time is required. Randomly place a single entry which dominates A s Frobenius norm. Finding it with constant probability requires reading at least a constant fraction of the non-zero entries in A. Lower bound holds for any approximation factor and even rules out o(nnz(a)) time for weaker guarantees. 6

30 lower bound for general matrices For general matrices, (nnz(a)) time is required. Randomly place a single entry which dominates A s Frobenius norm. Finding it with constant probability requires reading at least a constant fraction of the non-zero entries in A. Lower bound holds for any approximation factor and even rules out o(nnz(a)) time for weaker guarantees. ka NM T k F apple (1 + )ka A k k F 6

$Finding it with constant probability requires reading at least a constant fraction of the non-zero entries in A.$

31 lower bound for general matrices For general matrices, (nnz(a)) time is required. Randomly place a single entry which dominates A s Frobenius norm. Finding it with constant probability requires reading at least a constant fraction of the non-zero entries in A. Lower bound holds for any approximation factor and even rules out o(nnz(a)) time for weaker guarantees. ka NM T k F appleka A k k F + kak F. 6

32 what about for psd matrices? Observation: For PSD A, wehaveforanyentrya ij : a ij apple max(a ii, a jj ) since otherwise (e i e j ) T A(e i e j ) < 0. 7

33 what about for psd matrices? Observation: For PSD A, wehaveforanyentrya ij : a ij apple max(a ii, a jj ) since otherwise (e i e j ) T A(e i e j ) < 0. So we can find any hidden heavy entry by looking at its corresponding diagonal entries. 7

34 what about for psd matrices? Observation: For PSD A, wehaveforanyentrya ij : a ij apple max(a ii, a jj ) since otherwise (e i e j ) T A(e i e j ) < 0. So we can find any hidden heavy entry by looking at its corresponding diagonal entries. Question: How can we exploit additional structure arising from positive semidefiniteness to achieve sublinear runtime? 7

35 every psd matrix is a gram matrix Very Simple Fact: Every PSD matrix A 2 R n n can be written as B T B for some B 2 R n n. 8

36 every psd matrix is a gram matrix Very Simple Fact: Every PSD matrix A 2 R n n can be written as B T B for some B 2 R n n. B can be any matrix square root of A, e.g. if we let V V T be the eigendecomposition of A, wecansetb = 1/2 V T. 8

if we let V V T be the eigendecomposition of A, wecansetb = 1/2 V T. Letting b 1,.

37 every psd matrix is a gram matrix Very Simple Fact: Every PSD matrix A 2 R n n can be written as B T B for some B 2 R n n. B can be any matrix square root of A, e.g. if we let V V T be the eigendecomposition of A, wecansetb = 1/2 V T. Letting b 1,...,b n be the columns of B, the entries of A contain every pairwise dot product a ij = b T i b j. 8

38 every psd matrix is a gram matrix The fact that A is a Gram matrix places a variety of geometric constraints on its entries. 9

39 every psd matrix is a gram matrix The fact that A is a Gram matrix places a variety of geometric constraints on its entries. The heavy diagonal observation is just one example. By Cauchy-Schwarz: q a ij = b T i b j apple (b T i b i ) (b T j b j )= p a ii a jj apple max(a ii, a jj ). 9

40 every psd matrix is a gram matrix The fact that A is a Gram matrix places a variety of geometric constraints on its entries. The heavy diagonal observation is just one example. By Cauchy-Schwarz: q a ij = b T i b j apple (b T i b i ) (b T j b j )= p a ii a jj apple max(a ii, a jj ). Another View: A contains a lot of information about the column span of B in a very compressed form with every pairwise dot product stored as a ij. 9

41 factor matrix low-rank approximation Question: Can we compute a low-rank approximation of B using o(n 2 ) column dot products? I.e. o(n 2 )accessestoa? 10

42 factor matrix low-rank approximation Question: Can we compute a low-rank approximation of B using o(n 2 ) column dot products? I.e. o(n 2 )accessestoa? Why? B has the same (right) singular vectors as A, andits singular values are closely related: i(b) = p i(a). 10

43 factor matrix low-rank approximation Question: Can we compute a low-rank approximation of B using o(n 2 ) column dot products? I.e. o(n 2 )accessestoa? Why? B has the same (right) singular vectors as A, andits singular values are closely related: i(b) = p i(a). So the top k singular vectors are the same for the two matrices. An optimal low-rank approximation for B thus gives an optimal low-rank approximation for A. 10

44 factor matrix low-rank approximation Question: Can we compute a low-rank approximation of B using o(n 2 ) column dot products? I.e. o(n 2 )accessestoa? Why? B has the same (right) singular vectors as A, andits singular values are closely related: i(b) = p i(a). So the top k singular vectors are the same for the two matrices. An optimal low-rank approximation for B thus gives an optimal low-rank approximation for A. Things will be messier once we introduce approximation, but this simple idea will lead to a sublinear time algorithm for A. 10

45 low-rank approximation via adaptive sampling 11

46 low-rank approximation via adaptive sampling Theorem (Deshpande, Vempala 06) For any B 2 R n n, there exists a subset of Õ(k 2 / ) columns whose span contains Z 2 R n k satisfying: kb ZZ T Bk F apple (1 + )kb B k k F 11

47 low-rank approximation via adaptive sampling Theorem (Deshpande, Vempala 06) For any B 2 R n n, there exists a subset of Õ(k 2 / ) columns whose span contains Z 2 R n k satisfying: kb ZZ T Bk F apple (1 + )kb B k k F Adaptive Sampling Initially, start with an empty column subset S := {}. For t =1,...,Õ(k 2 / ) Let P S be the projection onto the columns in S. kb Add b i to S with probability P i P S b i k 2 n i=1 kb. i P S b i k 2 11

48 low-rank approximation via adaptive sampling Theorem (Deshpande, Vempala 06) For any B 2 R n n, there exists a subset of Õ(k 2 / ) columns whose span contains Z 2 R n k satisfying: kb ZZ T Bk F apple (1 + )kb B k k F Adaptive Sampling Initially, start with an empty column subset S := {}. For t =1,...,Õ(k 2 / ) Let P S be the projection onto the columns in S. kb Add b i to S with probability P i P S b i k 2 n i=1 kb. i P S b i k 2 11

49 adaptive sampling Adaptive Sampling Initially, start with an empty column subset S := {}. For t =1,...,Õ(k 2 / ) Let P S be the projection onto the columns in S. kb Add b i to S with probability P i P S b i k 2 n i=1 kb. i P S b i k 2 12

50 adaptive sampling Adaptive Sampling Initially, start with an empty column subset S := {}. For t =1,...,Õ(k 2 / ) Let P S be the projection onto the columns in S. kb Add b i to S with probability P i P S b i k 2 n i=1 kb = kb i P S b i k 2 P i k 2 n i=1 kb = a i k 2 ii tr(a). 12

51 adaptive sampling Adaptive Sampling Initially, start with an empty column subset S := {}. For t =1,...,Õ(k 2 / ) Let P S be the projection onto the columns in S. kb Add b i to S with probability P i P S b i k 2 n i=1 kb = kb i P S b i k 2 P i k 2 n i=1 kb = a i k 2 ii tr(a). 12

52 adaptive sampling Adaptive Sampling Initially, start with an empty column subset S := {}. For t =1,...,Õ(k 2 / ) Let P S be the projection onto the columns in S. kb Add b i to S with probability P i P S b i k 2 n i=1 kb = kb i P S b i k 2 P i k 2 n i=1 kb = a i k 2 ii tr(a). 12

53 adaptive sampling Adaptive Sampling Initially, start with an empty column subset S := {}. For t =1,...,Õ(k 2 / ) Let P S be the projection onto the columns in S. kb Add b i to S with probability P i P S b i k 2 n i=1 kb. i P S b i k 2 12

54 adaptive sampling Adaptive Sampling Initially, start with an empty column subset S := {}. For t =1,...,Õ(k 2 / ) Let P S be the projection onto the columns in S. kb Add b i to S with probability P i P S b i k 2 n i=1 kb. i P S b i k 2 12

55 adaptive sampling Adaptive Sampling Initially, start with an empty column subset S := {}. For t =1,...,Õ(k 2 / ) Let P S be the projection onto the columns in S. kb Add b i to S with probability P i P S b i k 2 n i=1 kb. i P S b i k 2 12

56 adaptive sampling Adaptive Sampling Initially, start with an empty column subset S := {}. For t =1,...,Õ(k 2 / ) Let P S be the projection onto the columns in S. kb Add b i to S with probability P i P S b i k 2 n i=1 kb. i P S b i k 2 12

57 adaptive sampling Adaptive Sampling Initially, start with an empty column subset S := {}. For t =1,...,Õ(k 2 / ) Let P S be the projection onto the columns in S. kb Add b i to S with probability P i P S b i k 2 n i=1 kb. i P S b i k 2 12

58 adaptive sampling Adaptive Sampling Initially, start with an empty column subset S := {}. For t =1,...,Õ(k 2 / ) Let P S be the projection onto the columns in S. kb Add b i to S with probability P i P S b i k 2 n i=1 kb. i P S b i k 2 12

59 adaptive sampling Adaptive Sampling Initially, start with an empty column subset S := {}. For t =1,...,Õ(k 2 / ) Let P S be the projection onto the columns in S. kb Add b i to S with probability P i P S b i k 2 n i=1 kb. i P S b i k 2 12

60 adaptive sampling Adaptive Sampling Initially, start with an empty column subset S := {}. For t =1,...,Õ(k 2 / ) Let P S be the projection onto the columns in S. kb Add b i to S with probability P i P S b i k 2 n i=1 kb. i P S b i k 2 12

61 sublinear dot product algorithm Theorem (Factor Matrix Low-Rank Approximation) There is an algorithm using Õ(nk 2 / ) accesses to A = B T B which computes Z 2 R n k satisfying with probability 99/100: kb ZZ T Bk F apple (1 + )kb B k k F. 13

62 sublinear dot product algorithm Theorem (Factor Matrix Low-Rank Approximation) There is an algorithm using Õ(nk 2 / ) accesses to A = B T B which computes Z 2 R n k satisfying with probability 99/100: kb ZZ T Bk F apple (1 + )kb B k k F. How does this translate to low-rank approximation of A itself? 13

63 boosting to a psd matrix approximation Lemma If kb ZZ T Bk 2 F apple 1+ 3/2 p n kb B k k 2 F,thenforA = BT B: ka B T ZZ T Bk 2 F apple (1 + )ka A kk 2 F. 14

64 boosting to a psd matrix approximation Lemma If kb ZZ T Bk 2 F apple 1+ 3/2 p n kb B k k 2 F,thenforA = BT B: ka B T ZZ T Bk 2 F apple (1 + )ka A kk 2 F. 14

65 boosting to a psd matrix approximation Lemma If kb ZZ T Bk 2 F apple 1+ 3/2 p n kb B k k 2 F,thenforA = BT B: ka ASCS T A T k 2 F apple (1 + )ka A kk 2 F. 14

66 boosting to a psd matrix approximation Lemma If kb ZZ T Bk 2 F apple 1+ 3/2 p n kb B k k 2 F,thenforA = BT B: ka ASCS T A T k 2 F apple (1 + )ka A kk 2 F. This gives a low-rank approximation algorithm which accesses just Õ nk 2 = n 3/2 poly(k, 1/ ) entries of A. 3/2 / p n 14

67 boosting to a psd matrix approximation Lemma If kb ZZ T Bk 2 F apple 1+ 3/2 p n kb B k k 2 F,thenforA = BT B: ka ASCS T A T k 2 F apple (1 + )ka A kk 2 F. This gives a low-rank approximation algorithm which accesses just Õ nk 2 = n 3/2 poly(k, 1/ ) entries of A. 3/2 / p n Our best algorithm accesses just Õ in Õ nk 2 time. 4 nk 2.5 entries of A and runs 14

68 limitations of column sampling Recall that our algorithm accesses the diagonal of A along with Õ(k 2 / ) columns. 15

69 limitations of column sampling Recall that our algorithm accesses the diagonal of A along with Õ(k 2p n) columns. 15

70 limitations of column sampling Recall that our algorithm accesses the diagonal of A along with Õ(k 2p n) columns. 15

$which contains a constant fraction of A s Frobenius norm.$

71 limitations of column sampling Recall that our algorithm accesses the diagonal of A along with Õ(k 2p n) columns. If we take fewer columns, we can miss a p n p n block which contains a constant fraction of A s Frobenius norm. 15

72 column and row sampling Solution: Sample both rows and columns of A. 16

73 column and row sampling Solution: Sample both rows and columns of A. Instead of adaptive sampling we use ridge leverage scores, which can also be computed using an iterative sampling scheme making Õ(nk) accesses to A (Musco, Musco 17). 16

74 column and row sampling Solution: Sample both rows and columns of A. Instead of adaptive sampling we use ridge leverage scores, which can also be computed using an iterative sampling scheme making Õ(nk) accesses to A (Musco, Musco 17). Same intuition select a diverse set of columns which span a near-optimal low-rank approximation of the matrix. 16

75 column and row sampling Solution: Sample both rows and columns of A. Instead of adaptive sampling we use ridge leverage scores, which can also be computed using an iterative sampling scheme making Õ(nk) accesses to A (Musco, Musco 17). Same intuition select a diverse set of columns which span a near-optimal low-rank approximation of the matrix. Sample AS is a projection-cost-preserving sketch for A [Cohen et al 15, 17]. For any rank-k projection P, kas PASk 2 F =(1± )ka PAk2 F. 16

76 final algorithm Recover low-rank approximation using two-sided sampling and projection-cost-preserving sketch property. 17

77 final algorithm Recover low-rank approximation using two-sided sampling and projection-cost-preserving sketch property. 17

78 final algorithm Recover low-rank approximation using two-sided sampling and projection-cost-preserving sketch property. 17

79 summary of main ideas View each entry of A as encoding a large amount of information about its square root B. Inparticulara ij = b T i b j. 18

80 summary of main ideas View each entry of A as encoding a large amount of information about its square root B. Inparticulara ij = b T i b j. Use this view to find a low-rank approximation to B using sublinear accesses to A. 18

81 summary of main ideas View each entry of A as encoding a large amount of information about its square root B. Inparticulara ij = b T i b j. Use this view to find a low-rank approximation to B using sublinear accesses to A. Since B has the same singular vectors as A and i(b) = p i(a), a low-rank approximation of B can used to find one for A, albietwitha p n factor loss in quality. 18

82 summary of main ideas View each entry of A as encoding a large amount of information about its square root B. Inparticulara ij = b T i b j. Use this view to find a low-rank approximation to B using sublinear accesses to A. Since B has the same singular vectors as A and i(b) = p i(a), a low-rank approximation of B can used to find one for A, albietwitha p n factor loss in quality. Obtain near-optimal complexity using ridge leverage scores to sample both rows and columns of A. 18

83 open questions 19

84 open questions What else can be done for PSD matrices? We give applications to ridge regression, but what other linear algebraic problems require a second look? 19

85 open questions What else can be done for PSD matrices? We give applications to ridge regression, but what other linear algebraic problems require a second look? Are there other natural classes of matrices that admit sublinear time low-rank approximation? 19

86 open questions What else can be done for PSD matrices? We give applications to ridge regression, but what other linear algebraic problems require a second look? Are there other natural classes of matrices that admit sublinear time low-rank approximation? Starting points are matrices that break the (nnz(a)) time lower bound: e.g. binary matrices, diagonally dominant matrices. 19

open questions What else can be done for PSD matrices? We give applications to ridge regression, but what other linear algebraic problems require a second look?

87 open questions What else can be done for PSD matrices? We give applications to ridge regression, but what other linear algebraic problems require a second look? Are there other natural classes of matrices that admit sublinear time low-rank approximation? Starting points are matrices that break the (nnz(a)) time lower bound: e.g. binary matrices, diagonally dominant matrices. What can we do when we have PSD matrices with additional structure? E.g. kernel matrices. 19

88 Thanks! Questions? 20

to be more efficient on enormous scale, in a stream, or in distributed settings.

to be more efficient on enormous scale, in a stream, or in distributed settings. 16 Matrix Sketching The singular value decomposition (SVD) can be interpreted as finding the most dominant directions in an (n d) matrix A (or n points in R d ). Typically n > d. It is typically easy to