dimensionality reduction for k-means and low rank approximation

Size: px

Start display at page:

Download "dimensionality reduction for k-means and low rank approximation"

Baldwin Edwards
5 years ago
Views:

1 dimensionality reduction for k-means and low rank approximation Michael Cohen, Sam Elder, Cameron Musco, Christopher Musco, Mădălina Persu Massachusetts Institute of Technology 0

2 overview Simple techniques to accelerate algorithms for: k-means clustering principal component analysis (PCA) constrained low rank approximation 1

3 dimensionality reduction Replace large, high dimensional dataset with low dimensional sketch. d features d << d features n data points A Ã 2

4 dimensionality reduction Solution on sketch Ã should approximate original solution

5 dimensionality reduction Solution on sketch Ã should approximate original solution Simultaneously improves runtime, memory requirements, communication cost, etc. 3

6 k-means clustering Extremely common objective function for clustering 4

7 k-means clustering Extremely common objective function for clustering Choose k clusters to minimize total intra-cluster variance 4

8 k-means clustering Extremely common objective function for clustering Choose k clusters to minimize total intra-cluster variance 4

9 k-means clustering Extremely common objective function for clustering Choose k clusters to minimize total intra-cluster variance 4

10 k-means clustering Extremely common objective function for clustering Choose k clusters to minimize total intra-cluster variance 4

11 k-means clustering a i µ min C Cost(A, C) = n a i µ (C[a i ]) 2 2 i=1 4

12 k-means clustering min C Cost(A, C) = n a i µ (C[a i ]) 2 2 i=1 5

13 k-means clustering min C Cost(A, C) = n a i µ (C[a i ]) 2 2 i=1 NP-hard, even for fixed dimension d or fixed k. 5

14 k-means clustering min C Cost(A, C) = n a i µ (C[a i ]) 2 2 i=1 NP-hard, even for fixed dimension d or fixed k. Several (1 + ϵ) and constant factor approximation algorithms. 5

15 k-means clustering min C Cost(A, C) = n a i µ (C[a i ]) 2 2 i=1 NP-hard, even for fixed dimension d or fixed k. Several (1 + ϵ) and constant factor approximation algorithms. In practice: Lloyd s heuristic (i.e. the k-means algorithm ) with k-means++ initialization is used. O(log k) approximation guaranteed, typically performs much better. 5

16 k-means clustering min C Cost(A, C) = n a i µ (C[a i ]) 2 2 i=1 NP-hard, even for fixed dimension d or fixed k. Several (1 + ϵ) and constant factor approximation algorithms. In practice: Lloyd s heuristic (i.e. the k-means algorithm ) with k-means++ initialization is used. O(log k) approximation guaranteed, typically performs much better. Dimensionality reduction can speed up any of these algorithms. 5

17 key observation k-means clustering == low rank approximation 6

18 key observation k-means clustering == low rank approximation min n i=1 a i µ(a i ) 2 2 µ 1 a 1 a 2 a 3 µ 2 µ k a 1 a 2 a 3 A C(A) a n-1 a n a n-1 a n 6

19 key observation k-means clustering == low rank approximation min n i=1 a i µ(a i ) 2 2 µ 1 a 1 a 2 a 3 µ 2 µ k µ 1 a 2 a 3 A C(A) a n-1 a n µ 1 a n 6

20 key observation k-means clustering == low rank approximation min n i=1 a i µ(a i ) 2 2 µ 1 a 1 a 2 a 3 µ 2 µ k µ 1 a 2 µ 2 A C(A) a n-1 a n µ 1 a n 6

21 key observation k-means clustering == low rank approximation min n i=1 a i µ(a i ) 2 2 µ 1 a 1 a 2 a 3 µ 2 µ k µ 1 µ k µ 2 A C(A) a n-1 a n µ 1 µ k 6

22 key observation k-means clustering == low rank approximation min n i=1 a i µ(a i ) 2 2 = A C(A) 2 F µ 1 a 1 a 2 a 3 µ 2 µ k µ 1 µ k µ 2 A C(A) a n-1 a n µ 1 µ k 6

23 key observation C(A) is actually a projection of A s columns onto a rank k subspace [Boutsidis, Drineas, Mahoney, Zouzias 11] 7

24 key observation C(A) is actually a projection of A s columns onto a rank k subspace [Boutsidis, Drineas, Mahoney, Zouzias 11] 1 C 1 1 C k 1 C 2 1/ C 1 1/ C 2 1/ C k 1/ C 1 1/ C k a 1 a 2 a 3 A = µ 1 µ k µ 2 C(A) 1 C 1 1 C k a n-1 a n µ 1 µ k 7

25 key observation C(A) is actually a projection of A s columns onto a rank k subspace [Boutsidis, Drineas, Mahoney, Zouzias 11] / C 1 1/ C 2 1/ C k 1/ C 1 1/ C k a 1 a 2 a 3 A = µ 1 µ k µ 2 C(A) 1 1 a n-1 a n µ 1 µ k 7

26 key observation C(A) is actually a projection of A s columns onto a rank k subspace [Boutsidis, Drineas, Mahoney, Zouzias 11] µ 1 µ 2 µ k = µ 1 µ k µ 2 C(A) 1 1 µ 1 µ k 7

27 key observation C(A) is actually a projection of A s columns onto a rank k subspace [Boutsidis, Drineas, Mahoney, Zouzias 11] 1 C C k C 2 X C 1/ C 1 1/ C 2 1/ C k X C T 1/ C 1 1/ C k a 1 a 2 a 3 A = µ 1 µ k µ 2 C(A) 1 C 1 1 C k a n-1 a n µ 1 µ k 7

28 key observation min C n a i µ (C[a i ]) 2 2 = min A rank(x)=k,x S XX A 2 F i=1 Where X is a rank k orthonormal matrix and for k-means S is the set of all clustering indicator matrices. 8

29 key observation min C n a i µ (C[a i ]) 2 2 = min A rank(x)=k,x S XX A 2 F i=1 Where X is a rank k orthonormal matrix and for k-means S is the set of all clustering indicator matrices. Set S = {All rank k orthonormal matrices} for principal component analysis/singular value decomposition. 8

30 key observation min C n a i µ (C[a i ]) 2 2 = min A rank(x)=k,x S XX A 2 F i=1 Where X is a rank k orthonormal matrix and for k-means S is the set of all clustering indicator matrices. Set S = {All rank k orthonormal matrices} for principal component analysis/singular value decomposition. General form for constrained low rank approximation. 8

31 so what? Can solve any problem like min rank(x)=k,x S A XX A 2 F if: 9

32 so what? Can solve any problem like min rank(x)=k,x S A XX A 2 F if: for all rank k X, Ã XX Ã 2 F A XX A 2 F 9

33 so what? Can solve any problem like min rank(x)=k,x S A XX A 2 F if: for all rank k X, Ã XX Ã 2 F A XX A 2 F 2 2 A XX T A Ã XX T Ã d F O(k) F 9

34 so what? Can solve any problem like min rank(x)=k,x S A XX A 2 F if: for all rank k X, Ã XX Ã 2 F A XX A 2 F 2 2 A XX T A Ã XX T Ã d F O(k) F Projection-Cost Preserving Sketch 9

35 projection-cost preservation Specifically, we want: for all X, Ã XX Ã 2 F A XX A 2 F 10

36 projection-cost preservation Specifically, we want: for all X, Ã XX Ã 2 F = (1 ± ϵ) A XX A 2 F 10

37 projection-cost preservation Specifically, we want: for all X, Ã XX Ã 2 F + c = (1 ± ϵ) A XX A 2 F 10

38 projection-cost preservation Specifically, we want: for all X, Ã XX Ã 2 F + c = (1 ± ϵ) A XX A 2 F If we find an X that gives: Ã XX Ã 2 F γ Ã XX optã 2 F 10

39 projection-cost preservation Specifically, we want: for all X, Ã XX Ã 2 F + c = (1 ± ϵ) A XX A 2 F If we find an X that gives: then: Ã XX Ã 2 F γ Ã XX optã 2 F A XX A 2 F γ (1 + ϵ) A (XX) opta 2 F 10

40 projection-cost preservation Specifically, we want: for all X, Ã XX Ã 2 F + c = (1 ± ϵ) A XX A 2 F If we find an X that gives: then: Ã XX Ã 2 F γ Ã XX optã 2 F A XX A 2 F γ (1 + ϵ) A (XX) opta 2 F See coresets of Feldman, Schmidt, Sohler

41 recap What we have seen so far: 11

42 recap What we have seen so far: k-means clustering is just constrained k rank approximation. 11

43 recap What we have seen so far: k-means clustering is just constrained k rank approximation. We can construct a projection-cost preserving sketch Ã that approximates the distance from A to any rank k subspace. 11

44 recap What we have seen so far: k-means clustering is just constrained k rank approximation. We can construct a projection-cost preserving sketch Ã that approximates the distance from A to any rank k subspace. Stronger guarantee than has been sought in prior work on approximate PCA via sketching 11

45 our results What techniques give (1 + ϵ) projection-cost preservation? Previous Work Our Results Technique Reference Dimensions Error Dimensions Error SVD Feldman, Schmidt, O(k/ϵ 2 ) 1 + ϵ k/ϵ 1 + ϵ Sohler 13 Approximate SVD Boutsidis, Drineas, Mahoney, Zouzias 11 k 2 + ϵ k/ϵ 1 + ϵ Random O(k/ϵ 2 O(k/ϵ ) 2 + ϵ ) 1 + ϵ Projection O(log k/ϵ 2 ) 9 + ϵ Column Sampling O(k log k/ϵ 2 ) 3 + ϵ O(k log k/ϵ 2 ) 1 + ϵ Boutsidis, Deterministic Magdon- Column Selection Ismail 13 r > k O(n/r) O(k/ϵ 2 ) 1 + ϵ Non-oblivious Projection NA NA NA O(k/ϵ) 1 + ϵ 12

46 our results Previous Work Our Results Technique Reference Dimensions Error Dimensions Error SVD Feldman, Schmidt, Sohler 13 O(k/ϵ 2 ) 1 + ϵ k/ϵ 1 + ϵ A V k/ε Ã k/ε 13

47 our results Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Zouzias 11 Approximate Drineas, SVD Mahoney, k 2 + ϵ k/ϵ 1 + ϵ A ~ V k/ε Ã k/ε 13

48 our results Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Zouzias 11 Approximate Drineas, SVD Mahoney, k 2 + ϵ k/ϵ 1 + ϵ Practically very useful for k-means No constant factors on k/ϵ, and typically many fewer dimensions are required (Kappmeier, Schmidt, Schmidt 15) 13

49 our results Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Random Drineas, O(k/ϵ 2 O(k/ϵ ) 2 + ϵ ) 1 + ϵ Projection Mahoney, O(log k/ϵ 2 ) 9 + ϵ Zouzias 11 A ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 Π ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 Ã O(k/ε 2 ) 14

50 our results Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Random Drineas, O(k/ϵ 2 O(k/ϵ ) 2 + ϵ ) 1 + ϵ Projection Mahoney, O(log k/ϵ 2 ) 9 + ϵ Zouzias 11 A ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 Π ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 Ã O(k/ε 2 ) 14

51 our results Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Random Drineas, O(k/ϵ 2 O(k/ϵ ) 2 + ϵ ) 1 + ϵ Projection Mahoney, O(log k/ϵ 2 ) 9 + ϵ Zouzias 11 A ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 Π ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 Ã O(k/ε 2 ) 14

52 our results Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Random Drineas, O(k/ϵ 2 O(k/ϵ ) 2 + ϵ ) 1 + ϵ Projection Mahoney, O(log k/ϵ 2 ) 9 + ϵ Zouzias 11 First sketch with dimension sublinear in k. Can (9 + ϵ) be improved? 14

53 our results Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Random Drineas, O(k/ϵ 2 O(k/ϵ ) 2 + ϵ ) 1 + ϵ Projection Mahoney, O(log k/ϵ 2 ) 9 + ϵ Zouzias 11 First sketch with dimension sublinear in k. Can (9 + ϵ) be improved? Sketch is data oblivious Lowest communication distributed k-means (improves on Balcan, Kanchanapally, Liang, Woodruff 14) Streaming principal component analysis in a single pass 14

54 oblivious project-cost preserving sketches Standard sketches for low rank approximation (Sarlós 06, Clarkson, Woodruff 13, etc): 15

55 oblivious project-cost preserving sketches Standard sketches for low rank approximation (Sarlós 06, Clarkson, Woodruff 13, etc): O(k/ε) A ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 Π ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 Ã A (P Ã A) k 2 F (1 + ϵ) A A k 2 F. span(ã) contains a good low rank approximation for A, but we must return to A to find it. 15

56 oblivious project-cost preserving sketches Projection-cost preserving sketch for low rank approximation: 15

57 oblivious project-cost preserving sketches Projection-cost preserving sketch for low rank approximation: A ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 Π ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 O(k/ε 2 ) Ã A V k V k A 2 F (1 + ϵ) A A k 2 F. Additional ϵ dependence for stronger sketch. 15

58 oblivious project-cost preserving sketches Projection-cost preserving sketch for low rank approximation: A ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 Π ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 O(k/ε 2 ) Ã A V k V k A 2 F (1 + ϵ) A A k 2 F. Additional ϵ dependence for stronger sketch. Is it required for oblivious approximate PCA? (See Ghashami, Liberty, Phillips, Woodruff 14/15) 15

59 our results Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Non-oblivious Randomized Projection Sarlós 06 NA NA O(k/ϵ) 1 + ϵ A span(aπ) Ã O(k/ε) 16

60 our results Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Non-oblivious Randomized Projection Sarlós 06 NA NA O(k/ϵ) 1 + ϵ Very good practical performance: can get away with narrower Π than approximate SVD, although final point dimension is a constant factor higher. 16

61 our results Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Zouzias 11 Column Drineas, Sampling Mahoney, O(k log k/ϵ 2 ) 3 + ϵ O(k log k/ϵ 2 ) 1 + ϵ Deterministic Column Selection Boutsidis, Magdon- Ismail 13 r > k O(n/r) O(k/ϵ 2 ) 1 + ϵ A Ã Õ(k/ε 2 ) 17

62 our results Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Zouzias 11 Column Drineas, Sampling Mahoney, O(k log k/ϵ 2 ) 3 + ϵ O(k log k/ϵ 2 ) 1 + ϵ A not only contains a small set of columns that span a good low rank approximation, but a small (reweighted) set whose top principal components approximate those of A. 17

63 our results Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Zouzias 11 Column Drineas, Sampling Mahoney, O(k log k/ϵ 2 ) 3 + ϵ O(k log k/ϵ 2 ) 1 + ϵ A not only contains a small set of columns that span a good low rank approximation, but a small (reweighted) set whose top principal components approximate those of A. First single shot sampling based dimensionality reduction for (1 + ϵ) error low rank approximation. 17

64 our results Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Zouzias 11 Column Drineas, Sampling Mahoney, O(k log k/ϵ 2 ) 3 + ϵ O(k log k/ϵ 2 ) 1 + ϵ A not only contains a small set of columns that span a good low rank approximation, but a small (reweighted) set whose top principal components approximate those of A. First single shot sampling based dimensionality reduction for (1 + ϵ) error low rank approximation. Applications to streaming column subset selection? Iterative sampling algorithms for the SVD? 17

65 analysis for svd Singular Value Decomposition (SVD): σ 1 σ 2 A = U Σ V T σ d-1 σ d 18

66 analysis for svd Singular Value Decomposition (SVD): σ 1 σ 2 A = U Σ V T σ d-1 σ d U, V have orthonormal columns. σ 1 σ 2... σ d. 18

67 analysis for svd Singular Value Decomposition (SVD): σ 1 σ 2 A = U Σ V T σ d-1 σ d U, V have orthonormal columns. σ 1 σ 2... σ d. Same as Principal Component Analysis (PCA) if we mean center A s columns/rows. 18

68 analysis for svd Partial Singular Value Decomposition (SVD): σ 1 σ 2 A m = U m Σ km σ d-1 σ d V mt A A m 2 F = min rank(x)=m A XX A 2 F 19

69 analysis for svd Partial Singular Value Decomposition (SVD): σ 1 σ 2 A m = U m Σ km σ d-1 σ d V mt A A m 2 F = min rank(x)=m A XX A 2 F Claim: A k/ϵ XX A k/ϵ 2 F + c = (1 ± ϵ) A XX A 2 F 19

70 analysis for svd Partial Singular Value Decomposition (SVD): σ 1 σ 2 A m = U m Σ km σ d-1 σ d V mt A A m 2 F = min rank(x)=m A XX A 2 F Claim: U k/ϵ Σ k/ϵ XX U k/ϵ Σ k/ϵ 2 F + c = (1 ± ϵ) A XX A 2 F 19

71 analysis for svd Partial Singular Value Decomposition (SVD): σ 1 σ 2 A m = U m Σ km σ d-1 σ d V mt A A m 2 F = min rank(x)=m A XX A 2 F Claim: A k/ϵ XX A k/ϵ 2 F + c = (1 ± ϵ) A XX A 2 F We work with A m for simplicity. Denote (A A m ) as A \m 19

72 analysis for svd Claim: A k/ϵ XX A k/ϵ 2 F + c = (1 ± ϵ) A XX A 2 F 20

73 analysis for svd Claim: A k/ϵ XX A k/ϵ 2 F + c = (1 ± ϵ) A XX A 2 F Split into (row) orthogonal pairs: A = A k/ε + A \k/ε 20

74 analysis for svd Claim: A k/ϵ XX A k/ϵ 2 F + c = (1 ± ϵ) A XX A 2 F Split into (row) orthogonal pairs: A = A k/ε + A \k/ε A XX A 2 F 20

75 analysis for svd Claim: (I XX )A k/ϵ 2 F + c = (1 ± ϵ) (I XX )A 2 F Split into (row) orthogonal pairs: A = A k/ε + A \k/ε (I XX )A 2 F 20

76 analysis for svd Claim: (I XX )A k/ϵ 2 F + c = (1 ± ϵ) (I XX )A 2 F Split into (row) orthogonal pairs: A = A k/ε + A \k/ε = (I XX )A k/ϵ 2 F + (I XX )A \k/ϵ 2 F = (I XX )A 2 F 20

77 analysis for svd Claim: (I XX )A k/ϵ 2 F + c = (1 ± ϵ) (I XX )A 2 F Split into (row) orthogonal pairs: A = A k/ε + A \k/ε = (I XX )A k/ϵ 2 F + (I XX )A \k/ϵ 2 F = (I XX )A 2 F = (I XX )A k/ϵ 2 F + A \k/ϵ 2 F XX A \k/ϵ 2 F = (I XX )A 2 F 20

78 analysis for svd Claim: (I XX )A k/ϵ 2 F + c = (1 ± ϵ) (I XX )A 2 F Split into (row) orthogonal pairs: A = A k/ε + A \k/ε = (I XX )A k/ϵ 2 F + (I XX )A \k/ϵ 2 F = (I XX )A 2 F = (I XX )A k/ϵ 2 F + c XX A \k/ϵ 2 F = (I XX )A 2 F 20

79 analysis for svd In words, the projection cost for A is explained by the cost over A k/ϵ plus the cost over A \k/ϵ. We want to argue that ignoring the tail term is fine. 21

80 analysis for svd In words, the projection cost for A is explained by the cost over A k/ϵ plus the cost over A \k/ϵ. We want to argue that ignoring the tail term is fine. Need to show that: XX A \k/ϵ 2 F ϵ (I XX )A 2 F 21

81 analysis for svd In words, the projection cost for A is explained by the cost over A k/ϵ plus the cost over A \k/ϵ. We want to argue that ignoring the tail term is fine. Need to show that: XX A \k/ϵ 2 F ϵ A A k 2 F 21

82 analysis for svd XX A \k/ϵ 2 F ϵ A A k 2 F σ 12 σ 22 σ k2 σ k/ε+12 Σ A σ k/ε+k2 σ d2 Recall that B 2 F = i σ2 i (B) 21

83 analysis for svd XX A \k/ϵ 2 F ϵ A A k 2 F σ 12 σ 22 σ k2 A \k/ε 2 σ k/ε+12 Σ A σ k/ε+k2 σ d2 Recall that B 2 F = i σ2 i (B) 21

84 analysis for svd XX A \k/ϵ 2 F ϵ A A k 2 F σ 12 σ 22 σ k2 XX T A \k/ε 2 σ k/ε+12 Σ A σ k/ε+k2 σ d2 Recall that B 2 F = i σ2 i (B) 21

85 analysis for svd XX A \k/ϵ 2 F ϵ A A k 2 F σ 12 σ 22 A \k 2 σ k2 σ k/ε+12 Σ A σ k/ε+k2 σ d2 Recall that B 2 F = i σ2 i (B) 21

86 analysis for svd XX A \k/ϵ 2 F ϵ A A k 2 F σ 12 σ 22 σ k2 A \k 2 σ k/ε+12 XX T A \k/ε 2 Σ A σ k/ε+k2 σ d2 Recall that B 2 F = i σ2 i (B) 21

87 analysis for svd XX A \k/ϵ 2 F ϵ A A k 2 F σ 12 σ 22 σ k2 A \k 2 σ k/ε+12 XX T A \k/ε 2 Σ A k σ k/ε+k2 σ d2 Recall that B 2 F = i σ2 i (B) 21

$analysis for svd XX A \k/ϵ 2 F ϵ A A k 2 F σ 12 σ 22 σ k2 A \k 2 k/ε σ$

88 analysis for svd XX A \k/ϵ 2 F ϵ A A k 2 F σ 12 σ 22 σ k2 A \k 2 k/ε σ k/ε+12 XX T A \k/ε 2 Σ A k σ k/ε+k2 σ d2 Recall that B 2 F = i σ2 i (B) 21

89 analysis for svd Analysis is very worst case since we assume σ k/ϵ+k is just as big as σ k+1. 22

90 analysis for svd Analysis is very worst case since we assume σ k/ϵ+k is just as big as σ k+1. In practice, spectrum decay allows for a much smaller sketching dimension. 22

91 analysis for svd Analysis is very worst case since we assume σ k/ϵ+k is just as big as σ k+1. In practice, spectrum decay allows for a much smaller sketching dimension. x Squared Singular Value Index 22

92 proof sketch for random projection All proofs have similar flavor: A k XX A k 2 F + A \k XX A \k 2 F vs. A k XX A k Π 2 F + A \k XX A \k Π 2 F 23

93 proof sketch for random projection All proofs have similar flavor: Except that: A k XX A k 2 F + A \k XX A \k 2 F vs. A k XX A k Π 2 F + A \k XX A \k Π 2 F We have to worry about cross terms AΠ XX AΠ 2 F A kπ XX A k Π 2 F + A \kπ XX A \k Π 2 F 23

94 proof sketch for random projection All proofs have similar flavor: A k XX A k 2 F + A \k XX A \k 2 F vs. A k XX A k Π 2 F + A \k XX A \k Π 2 F Except that: We have to worry about cross terms AΠ XX AΠ 2 F A kπ XX A k Π 2 F + A \kπ XX A \k Π 2 F We have some hope of approximating A \k XX A \k 2 F 23

95 proof sketch for random projection Rely on standard sketching tools: Approximate Matrix Multiplication: AΠΠ B 2 F ϵ A F B F Subspace Embedding: YAΠ 2 F = (1 ± ϵ) YA 2 F for all Y if A has rank k Frobenius Norm Preservation: AΠ 2 F = (1 ± ϵ) A 2 F 24

96 proof sketch for random projection See paper for the details! 25

97 theoretical takeaway Dimensionality reduction for any constrained low rank approximation can be unified. 26

98 theoretical takeaway Dimensionality reduction for any constrained low rank approximation can be unified. Projection-cost preserving sketches guarantee Ã XX Ã 2 F A XX A 2 F for all rank k X. 26

99 theoretical takeaway Dimensionality reduction for any constrained low rank approximation can be unified. Projection-cost preserving sketches guarantee Ã XX Ã 2 F A XX A 2 F for all rank k X. Standard proof approach: Split A into orthogonal pairs. Top of spectrum can be preserved multiplicatively, only top singular values of bottom of spectrum matter. 26

100 proof sketch for sublinear result Split using optimal clustering: a 1 a 2 a 3 µ 1 µ k µ 2 e 1 e k e 2 A = C*(A) + A-C*(A) a n-1 a n µ 1 µ k e 1 e k 27

101 proof sketch for sublinear result Split using optimal clustering: a 1 a 2 a 3 µ 1 µ k µ 2 e 1 e k e 2 A = C*(A) + A-C*(A) a n-1 a n µ 1 µ k e 1 e k C (A) can be approximated with O(log k/ϵ 2 ) dimensions. Not row orthogonal have to use triangle inquality which leads to the (9 + ϵ) factor. 27

102 practical considerations Experiments, implements, etc. in Cameron Musco Dimensionality Reduction Time Clustering Time 2500 Time (Seconds) Full Dataset SVD Approx. SVD Sampling Approx. Sampling Rand. Proj. NORP Time for ϵ =.01 compared to baseline Lloyd s w/ k-means++. 28

103 practical considerations Experiments, implements, etc. in Cameron Musco Dimensionality Reduction Time Clustering Time 60 Time (Seconds) Full Dataset SVD Approx. SVD Sampling Approx. Sampling Rand. Proj. NORP Time for ϵ =.01 compared to baseline Lloyd s w/ k-means++. 28

104 Thank you! 29

Lecture 18 Nov 3rd, 2015

CS 229r: Algorithms for Big Data Fall 2015 Prof. Jelani Nelson Lecture 18 Nov 3rd, 2015 Scribe: Jefferson Lee 1 Overview Low-rank approximation, Compression Sensing 2 Last Time We looked at three different