dimensionality reduction for k-means and low rank approximation

Size: px
Start display at page:

Download "dimensionality reduction for k-means and low rank approximation"

Transcription

1 dimensionality reduction for k-means and low rank approximation Michael Cohen, Sam Elder, Cameron Musco, Christopher Musco, Mădălina Persu Massachusetts Institute of Technology 0

2 overview Simple techniques to accelerate algorithms for: k-means clustering principal component analysis (PCA) constrained low rank approximation 1

3 dimensionality reduction Replace large, high dimensional dataset with low dimensional sketch. d features d << d features n data points A Ã 2

4 dimensionality reduction Solution on sketch à should approximate original solution

5 dimensionality reduction Solution on sketch à should approximate original solution Simultaneously improves runtime, memory requirements, communication cost, etc. 3

6 k-means clustering Extremely common objective function for clustering 4

7 k-means clustering Extremely common objective function for clustering Choose k clusters to minimize total intra-cluster variance 4

8 k-means clustering Extremely common objective function for clustering Choose k clusters to minimize total intra-cluster variance 4

9 k-means clustering Extremely common objective function for clustering Choose k clusters to minimize total intra-cluster variance 4

10 k-means clustering Extremely common objective function for clustering Choose k clusters to minimize total intra-cluster variance 4

11 k-means clustering a i µ min C Cost(A, C) = n a i µ (C[a i ]) 2 2 i=1 4

12 k-means clustering min C Cost(A, C) = n a i µ (C[a i ]) 2 2 i=1 5

13 k-means clustering min C Cost(A, C) = n a i µ (C[a i ]) 2 2 i=1 NP-hard, even for fixed dimension d or fixed k. 5

14 k-means clustering min C Cost(A, C) = n a i µ (C[a i ]) 2 2 i=1 NP-hard, even for fixed dimension d or fixed k. Several (1 + ϵ) and constant factor approximation algorithms. 5

15 k-means clustering min C Cost(A, C) = n a i µ (C[a i ]) 2 2 i=1 NP-hard, even for fixed dimension d or fixed k. Several (1 + ϵ) and constant factor approximation algorithms. In practice: Lloyd s heuristic (i.e. the k-means algorithm ) with k-means++ initialization is used. O(log k) approximation guaranteed, typically performs much better. 5

16 k-means clustering min C Cost(A, C) = n a i µ (C[a i ]) 2 2 i=1 NP-hard, even for fixed dimension d or fixed k. Several (1 + ϵ) and constant factor approximation algorithms. In practice: Lloyd s heuristic (i.e. the k-means algorithm ) with k-means++ initialization is used. O(log k) approximation guaranteed, typically performs much better. Dimensionality reduction can speed up any of these algorithms. 5

17 key observation k-means clustering == low rank approximation 6

18 key observation k-means clustering == low rank approximation min n i=1 a i µ(a i ) 2 2 µ 1 a 1 a 2 a 3 µ 2 µ k a 1 a 2 a 3 A C(A) a n-1 a n a n-1 a n 6

19 key observation k-means clustering == low rank approximation min n i=1 a i µ(a i ) 2 2 µ 1 a 1 a 2 a 3 µ 2 µ k µ 1 a 2 a 3 A C(A) a n-1 a n µ 1 a n 6

20 key observation k-means clustering == low rank approximation min n i=1 a i µ(a i ) 2 2 µ 1 a 1 a 2 a 3 µ 2 µ k µ 1 a 2 µ 2 A C(A) a n-1 a n µ 1 a n 6

21 key observation k-means clustering == low rank approximation min n i=1 a i µ(a i ) 2 2 µ 1 a 1 a 2 a 3 µ 2 µ k µ 1 µ k µ 2 A C(A) a n-1 a n µ 1 µ k 6

22 key observation k-means clustering == low rank approximation min n i=1 a i µ(a i ) 2 2 = A C(A) 2 F µ 1 a 1 a 2 a 3 µ 2 µ k µ 1 µ k µ 2 A C(A) a n-1 a n µ 1 µ k 6

23 key observation C(A) is actually a projection of A s columns onto a rank k subspace [Boutsidis, Drineas, Mahoney, Zouzias 11] 7

24 key observation C(A) is actually a projection of A s columns onto a rank k subspace [Boutsidis, Drineas, Mahoney, Zouzias 11] 1 C 1 1 C k 1 C 2 1/ C 1 1/ C 2 1/ C k 1/ C 1 1/ C k a 1 a 2 a 3 A = µ 1 µ k µ 2 C(A) 1 C 1 1 C k a n-1 a n µ 1 µ k 7

25 key observation C(A) is actually a projection of A s columns onto a rank k subspace [Boutsidis, Drineas, Mahoney, Zouzias 11] / C 1 1/ C 2 1/ C k 1/ C 1 1/ C k a 1 a 2 a 3 A = µ 1 µ k µ 2 C(A) 1 1 a n-1 a n µ 1 µ k 7

26 key observation C(A) is actually a projection of A s columns onto a rank k subspace [Boutsidis, Drineas, Mahoney, Zouzias 11] µ 1 µ 2 µ k = µ 1 µ k µ 2 C(A) 1 1 µ 1 µ k 7

27 key observation C(A) is actually a projection of A s columns onto a rank k subspace [Boutsidis, Drineas, Mahoney, Zouzias 11] 1 C C k C 2 X C 1/ C 1 1/ C 2 1/ C k X C T 1/ C 1 1/ C k a 1 a 2 a 3 A = µ 1 µ k µ 2 C(A) 1 C 1 1 C k a n-1 a n µ 1 µ k 7

28 key observation min C n a i µ (C[a i ]) 2 2 = min A rank(x)=k,x S XX A 2 F i=1 Where X is a rank k orthonormal matrix and for k-means S is the set of all clustering indicator matrices. 8

29 key observation min C n a i µ (C[a i ]) 2 2 = min A rank(x)=k,x S XX A 2 F i=1 Where X is a rank k orthonormal matrix and for k-means S is the set of all clustering indicator matrices. Set S = {All rank k orthonormal matrices} for principal component analysis/singular value decomposition. 8

30 key observation min C n a i µ (C[a i ]) 2 2 = min A rank(x)=k,x S XX A 2 F i=1 Where X is a rank k orthonormal matrix and for k-means S is the set of all clustering indicator matrices. Set S = {All rank k orthonormal matrices} for principal component analysis/singular value decomposition. General form for constrained low rank approximation. 8

31 so what? Can solve any problem like min rank(x)=k,x S A XX A 2 F if: 9

32 so what? Can solve any problem like min rank(x)=k,x S A XX A 2 F if: for all rank k X, Ã XX Ã 2 F A XX A 2 F 9

33 so what? Can solve any problem like min rank(x)=k,x S A XX A 2 F if: for all rank k X, Ã XX Ã 2 F A XX A 2 F 2 2 A XX T A Ã XX T Ã d F O(k) F 9

34 so what? Can solve any problem like min rank(x)=k,x S A XX A 2 F if: for all rank k X, Ã XX Ã 2 F A XX A 2 F 2 2 A XX T A Ã XX T Ã d F O(k) F Projection-Cost Preserving Sketch 9

35 projection-cost preservation Specifically, we want: for all X, Ã XX Ã 2 F A XX A 2 F 10

36 projection-cost preservation Specifically, we want: for all X, Ã XX Ã 2 F = (1 ± ϵ) A XX A 2 F 10

37 projection-cost preservation Specifically, we want: for all X, Ã XX Ã 2 F + c = (1 ± ϵ) A XX A 2 F 10

38 projection-cost preservation Specifically, we want: for all X, Ã XX Ã 2 F + c = (1 ± ϵ) A XX A 2 F If we find an X that gives: Ã XX Ã 2 F γ Ã XX optã 2 F 10

39 projection-cost preservation Specifically, we want: for all X, Ã XX Ã 2 F + c = (1 ± ϵ) A XX A 2 F If we find an X that gives: then: Ã XX Ã 2 F γ Ã XX optã 2 F A XX A 2 F γ (1 + ϵ) A (XX) opta 2 F 10

40 projection-cost preservation Specifically, we want: for all X, Ã XX Ã 2 F + c = (1 ± ϵ) A XX A 2 F If we find an X that gives: then: Ã XX Ã 2 F γ Ã XX optã 2 F A XX A 2 F γ (1 + ϵ) A (XX) opta 2 F See coresets of Feldman, Schmidt, Sohler

41 recap What we have seen so far: 11

42 recap What we have seen so far: k-means clustering is just constrained k rank approximation. 11

43 recap What we have seen so far: k-means clustering is just constrained k rank approximation. We can construct a projection-cost preserving sketch à that approximates the distance from A to any rank k subspace. 11

44 recap What we have seen so far: k-means clustering is just constrained k rank approximation. We can construct a projection-cost preserving sketch à that approximates the distance from A to any rank k subspace. Stronger guarantee than has been sought in prior work on approximate PCA via sketching 11

45 our results What techniques give (1 + ϵ) projection-cost preservation? Previous Work Our Results Technique Reference Dimensions Error Dimensions Error SVD Feldman, Schmidt, O(k/ϵ 2 ) 1 + ϵ k/ϵ 1 + ϵ Sohler 13 Approximate SVD Boutsidis, Drineas, Mahoney, Zouzias 11 k 2 + ϵ k/ϵ 1 + ϵ Random O(k/ϵ 2 O(k/ϵ ) 2 + ϵ ) 1 + ϵ Projection O(log k/ϵ 2 ) 9 + ϵ Column Sampling O(k log k/ϵ 2 ) 3 + ϵ O(k log k/ϵ 2 ) 1 + ϵ Boutsidis, Deterministic Magdon- Column Selection Ismail 13 r > k O(n/r) O(k/ϵ 2 ) 1 + ϵ Non-oblivious Projection NA NA NA O(k/ϵ) 1 + ϵ 12

46 our results Previous Work Our Results Technique Reference Dimensions Error Dimensions Error SVD Feldman, Schmidt, Sohler 13 O(k/ϵ 2 ) 1 + ϵ k/ϵ 1 + ϵ A V k/ε Ã k/ε 13

47 our results Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Zouzias 11 Approximate Drineas, SVD Mahoney, k 2 + ϵ k/ϵ 1 + ϵ A ~ V k/ε Ã k/ε 13

48 our results Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Zouzias 11 Approximate Drineas, SVD Mahoney, k 2 + ϵ k/ϵ 1 + ϵ Practically very useful for k-means No constant factors on k/ϵ, and typically many fewer dimensions are required (Kappmeier, Schmidt, Schmidt 15) 13

49 our results Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Random Drineas, O(k/ϵ 2 O(k/ϵ ) 2 + ϵ ) 1 + ϵ Projection Mahoney, O(log k/ϵ 2 ) 9 + ϵ Zouzias 11 A ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 Π ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 Ã O(k/ε 2 ) 14

50 our results Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Random Drineas, O(k/ϵ 2 O(k/ϵ ) 2 + ϵ ) 1 + ϵ Projection Mahoney, O(log k/ϵ 2 ) 9 + ϵ Zouzias 11 A ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 Π ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 Ã O(k/ε 2 ) 14

51 our results Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Random Drineas, O(k/ϵ 2 O(k/ϵ ) 2 + ϵ ) 1 + ϵ Projection Mahoney, O(log k/ϵ 2 ) 9 + ϵ Zouzias 11 A ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 Π ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 Ã O(k/ε 2 ) 14

52 our results Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Random Drineas, O(k/ϵ 2 O(k/ϵ ) 2 + ϵ ) 1 + ϵ Projection Mahoney, O(log k/ϵ 2 ) 9 + ϵ Zouzias 11 First sketch with dimension sublinear in k. Can (9 + ϵ) be improved? 14

53 our results Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Random Drineas, O(k/ϵ 2 O(k/ϵ ) 2 + ϵ ) 1 + ϵ Projection Mahoney, O(log k/ϵ 2 ) 9 + ϵ Zouzias 11 First sketch with dimension sublinear in k. Can (9 + ϵ) be improved? Sketch is data oblivious Lowest communication distributed k-means (improves on Balcan, Kanchanapally, Liang, Woodruff 14) Streaming principal component analysis in a single pass 14

54 oblivious project-cost preserving sketches Standard sketches for low rank approximation (Sarlós 06, Clarkson, Woodruff 13, etc): 15

55 oblivious project-cost preserving sketches Standard sketches for low rank approximation (Sarlós 06, Clarkson, Woodruff 13, etc): O(k/ε) A ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 Π ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 Ã A (P Ã A) k 2 F (1 + ϵ) A A k 2 F. span(ã) contains a good low rank approximation for A, but we must return to A to find it. 15

56 oblivious project-cost preserving sketches Projection-cost preserving sketch for low rank approximation: 15

57 oblivious project-cost preserving sketches Projection-cost preserving sketch for low rank approximation: A ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 Π ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 O(k/ε 2 ) Ã A V k V k A 2 F (1 + ϵ) A A k 2 F. Additional ϵ dependence for stronger sketch. 15

58 oblivious project-cost preserving sketches Projection-cost preserving sketch for low rank approximation: A ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 Π ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 ±1 O(k/ε 2 ) Ã A V k V k A 2 F (1 + ϵ) A A k 2 F. Additional ϵ dependence for stronger sketch. Is it required for oblivious approximate PCA? (See Ghashami, Liberty, Phillips, Woodruff 14/15) 15

59 our results Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Non-oblivious Randomized Projection Sarlós 06 NA NA O(k/ϵ) 1 + ϵ A span(aπ) Ã O(k/ε) 16

60 our results Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Non-oblivious Randomized Projection Sarlós 06 NA NA O(k/ϵ) 1 + ϵ Very good practical performance: can get away with narrower Π than approximate SVD, although final point dimension is a constant factor higher. 16

61 our results Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Zouzias 11 Column Drineas, Sampling Mahoney, O(k log k/ϵ 2 ) 3 + ϵ O(k log k/ϵ 2 ) 1 + ϵ Deterministic Column Selection Boutsidis, Magdon- Ismail 13 r > k O(n/r) O(k/ϵ 2 ) 1 + ϵ A Ã Õ(k/ε 2 ) 17

62 our results Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Zouzias 11 Column Drineas, Sampling Mahoney, O(k log k/ϵ 2 ) 3 + ϵ O(k log k/ϵ 2 ) 1 + ϵ A not only contains a small set of columns that span a good low rank approximation, but a small (reweighted) set whose top principal components approximate those of A. 17

63 our results Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Zouzias 11 Column Drineas, Sampling Mahoney, O(k log k/ϵ 2 ) 3 + ϵ O(k log k/ϵ 2 ) 1 + ϵ A not only contains a small set of columns that span a good low rank approximation, but a small (reweighted) set whose top principal components approximate those of A. First single shot sampling based dimensionality reduction for (1 + ϵ) error low rank approximation. 17

64 our results Previous Work Our Results Technique Reference Dimensions Error Dimensions Error Boutsidis, Zouzias 11 Column Drineas, Sampling Mahoney, O(k log k/ϵ 2 ) 3 + ϵ O(k log k/ϵ 2 ) 1 + ϵ A not only contains a small set of columns that span a good low rank approximation, but a small (reweighted) set whose top principal components approximate those of A. First single shot sampling based dimensionality reduction for (1 + ϵ) error low rank approximation. Applications to streaming column subset selection? Iterative sampling algorithms for the SVD? 17

65 analysis for svd Singular Value Decomposition (SVD): σ 1 σ 2 A = U Σ V T σ d-1 σ d 18

66 analysis for svd Singular Value Decomposition (SVD): σ 1 σ 2 A = U Σ V T σ d-1 σ d U, V have orthonormal columns. σ 1 σ 2... σ d. 18

67 analysis for svd Singular Value Decomposition (SVD): σ 1 σ 2 A = U Σ V T σ d-1 σ d U, V have orthonormal columns. σ 1 σ 2... σ d. Same as Principal Component Analysis (PCA) if we mean center A s columns/rows. 18

68 analysis for svd Partial Singular Value Decomposition (SVD): σ 1 σ 2 A m = U m Σ km σ d-1 σ d V mt A A m 2 F = min rank(x)=m A XX A 2 F 19

69 analysis for svd Partial Singular Value Decomposition (SVD): σ 1 σ 2 A m = U m Σ km σ d-1 σ d V mt A A m 2 F = min rank(x)=m A XX A 2 F Claim: A k/ϵ XX A k/ϵ 2 F + c = (1 ± ϵ) A XX A 2 F 19

70 analysis for svd Partial Singular Value Decomposition (SVD): σ 1 σ 2 A m = U m Σ km σ d-1 σ d V mt A A m 2 F = min rank(x)=m A XX A 2 F Claim: U k/ϵ Σ k/ϵ XX U k/ϵ Σ k/ϵ 2 F + c = (1 ± ϵ) A XX A 2 F 19

71 analysis for svd Partial Singular Value Decomposition (SVD): σ 1 σ 2 A m = U m Σ km σ d-1 σ d V mt A A m 2 F = min rank(x)=m A XX A 2 F Claim: A k/ϵ XX A k/ϵ 2 F + c = (1 ± ϵ) A XX A 2 F We work with A m for simplicity. Denote (A A m ) as A \m 19

72 analysis for svd Claim: A k/ϵ XX A k/ϵ 2 F + c = (1 ± ϵ) A XX A 2 F 20

73 analysis for svd Claim: A k/ϵ XX A k/ϵ 2 F + c = (1 ± ϵ) A XX A 2 F Split into (row) orthogonal pairs: A = A k/ε + A \k/ε 20

74 analysis for svd Claim: A k/ϵ XX A k/ϵ 2 F + c = (1 ± ϵ) A XX A 2 F Split into (row) orthogonal pairs: A = A k/ε + A \k/ε A XX A 2 F 20

75 analysis for svd Claim: (I XX )A k/ϵ 2 F + c = (1 ± ϵ) (I XX )A 2 F Split into (row) orthogonal pairs: A = A k/ε + A \k/ε (I XX )A 2 F 20

76 analysis for svd Claim: (I XX )A k/ϵ 2 F + c = (1 ± ϵ) (I XX )A 2 F Split into (row) orthogonal pairs: A = A k/ε + A \k/ε = (I XX )A k/ϵ 2 F + (I XX )A \k/ϵ 2 F = (I XX )A 2 F 20

77 analysis for svd Claim: (I XX )A k/ϵ 2 F + c = (1 ± ϵ) (I XX )A 2 F Split into (row) orthogonal pairs: A = A k/ε + A \k/ε = (I XX )A k/ϵ 2 F + (I XX )A \k/ϵ 2 F = (I XX )A 2 F = (I XX )A k/ϵ 2 F + A \k/ϵ 2 F XX A \k/ϵ 2 F = (I XX )A 2 F 20

78 analysis for svd Claim: (I XX )A k/ϵ 2 F + c = (1 ± ϵ) (I XX )A 2 F Split into (row) orthogonal pairs: A = A k/ε + A \k/ε = (I XX )A k/ϵ 2 F + (I XX )A \k/ϵ 2 F = (I XX )A 2 F = (I XX )A k/ϵ 2 F + c XX A \k/ϵ 2 F = (I XX )A 2 F 20

79 analysis for svd In words, the projection cost for A is explained by the cost over A k/ϵ plus the cost over A \k/ϵ. We want to argue that ignoring the tail term is fine. 21

80 analysis for svd In words, the projection cost for A is explained by the cost over A k/ϵ plus the cost over A \k/ϵ. We want to argue that ignoring the tail term is fine. Need to show that: XX A \k/ϵ 2 F ϵ (I XX )A 2 F 21

81 analysis for svd In words, the projection cost for A is explained by the cost over A k/ϵ plus the cost over A \k/ϵ. We want to argue that ignoring the tail term is fine. Need to show that: XX A \k/ϵ 2 F ϵ A A k 2 F 21

82 analysis for svd XX A \k/ϵ 2 F ϵ A A k 2 F σ 12 σ 22 σ k2 σ k/ε+12 Σ A σ k/ε+k2 σ d2 Recall that B 2 F = i σ2 i (B) 21

83 analysis for svd XX A \k/ϵ 2 F ϵ A A k 2 F σ 12 σ 22 σ k2 A \k/ε 2 σ k/ε+12 Σ A σ k/ε+k2 σ d2 Recall that B 2 F = i σ2 i (B) 21

84 analysis for svd XX A \k/ϵ 2 F ϵ A A k 2 F σ 12 σ 22 σ k2 XX T A \k/ε 2 σ k/ε+12 Σ A σ k/ε+k2 σ d2 Recall that B 2 F = i σ2 i (B) 21

85 analysis for svd XX A \k/ϵ 2 F ϵ A A k 2 F σ 12 σ 22 A \k 2 σ k2 σ k/ε+12 Σ A σ k/ε+k2 σ d2 Recall that B 2 F = i σ2 i (B) 21

86 analysis for svd XX A \k/ϵ 2 F ϵ A A k 2 F σ 12 σ 22 σ k2 A \k 2 σ k/ε+12 XX T A \k/ε 2 Σ A σ k/ε+k2 σ d2 Recall that B 2 F = i σ2 i (B) 21

87 analysis for svd XX A \k/ϵ 2 F ϵ A A k 2 F σ 12 σ 22 σ k2 A \k 2 σ k/ε+12 XX T A \k/ε 2 Σ A k σ k/ε+k2 σ d2 Recall that B 2 F = i σ2 i (B) 21

88 analysis for svd XX A \k/ϵ 2 F ϵ A A k 2 F σ 12 σ 22 σ k2 A \k 2 k/ε σ k/ε+12 XX T A \k/ε 2 Σ A k σ k/ε+k2 σ d2 Recall that B 2 F = i σ2 i (B) 21

89 analysis for svd Analysis is very worst case since we assume σ k/ϵ+k is just as big as σ k+1. 22

90 analysis for svd Analysis is very worst case since we assume σ k/ϵ+k is just as big as σ k+1. In practice, spectrum decay allows for a much smaller sketching dimension. 22

91 analysis for svd Analysis is very worst case since we assume σ k/ϵ+k is just as big as σ k+1. In practice, spectrum decay allows for a much smaller sketching dimension. x Squared Singular Value Index 22

92 proof sketch for random projection All proofs have similar flavor: A k XX A k 2 F + A \k XX A \k 2 F vs. A k XX A k Π 2 F + A \k XX A \k Π 2 F 23

93 proof sketch for random projection All proofs have similar flavor: Except that: A k XX A k 2 F + A \k XX A \k 2 F vs. A k XX A k Π 2 F + A \k XX A \k Π 2 F We have to worry about cross terms AΠ XX AΠ 2 F A kπ XX A k Π 2 F + A \kπ XX A \k Π 2 F 23

94 proof sketch for random projection All proofs have similar flavor: A k XX A k 2 F + A \k XX A \k 2 F vs. A k XX A k Π 2 F + A \k XX A \k Π 2 F Except that: We have to worry about cross terms AΠ XX AΠ 2 F A kπ XX A k Π 2 F + A \kπ XX A \k Π 2 F We have some hope of approximating A \k XX A \k 2 F 23

95 proof sketch for random projection Rely on standard sketching tools: Approximate Matrix Multiplication: AΠΠ B 2 F ϵ A F B F Subspace Embedding: YAΠ 2 F = (1 ± ϵ) YA 2 F for all Y if A has rank k Frobenius Norm Preservation: AΠ 2 F = (1 ± ϵ) A 2 F 24

96 proof sketch for random projection See paper for the details! 25

97 theoretical takeaway Dimensionality reduction for any constrained low rank approximation can be unified. 26

98 theoretical takeaway Dimensionality reduction for any constrained low rank approximation can be unified. Projection-cost preserving sketches guarantee à XX à 2 F A XX A 2 F for all rank k X. 26

99 theoretical takeaway Dimensionality reduction for any constrained low rank approximation can be unified. Projection-cost preserving sketches guarantee à XX à 2 F A XX A 2 F for all rank k X. Standard proof approach: Split A into orthogonal pairs. Top of spectrum can be preserved multiplicatively, only top singular values of bottom of spectrum matter. 26

100 proof sketch for sublinear result Split using optimal clustering: a 1 a 2 a 3 µ 1 µ k µ 2 e 1 e k e 2 A = C*(A) + A-C*(A) a n-1 a n µ 1 µ k e 1 e k 27

101 proof sketch for sublinear result Split using optimal clustering: a 1 a 2 a 3 µ 1 µ k µ 2 e 1 e k e 2 A = C*(A) + A-C*(A) a n-1 a n µ 1 µ k e 1 e k C (A) can be approximated with O(log k/ϵ 2 ) dimensions. Not row orthogonal have to use triangle inquality which leads to the (9 + ϵ) factor. 27

102 practical considerations Experiments, implements, etc. in Cameron Musco Dimensionality Reduction Time Clustering Time 2500 Time (Seconds) Full Dataset SVD Approx. SVD Sampling Approx. Sampling Rand. Proj. NORP Time for ϵ =.01 compared to baseline Lloyd s w/ k-means++. 28

103 practical considerations Experiments, implements, etc. in Cameron Musco Dimensionality Reduction Time Clustering Time 60 Time (Seconds) Full Dataset SVD Approx. SVD Sampling Approx. Sampling Rand. Proj. NORP Time for ϵ =.01 compared to baseline Lloyd s w/ k-means++. 28

104 Thank you! 29

Lecture 18 Nov 3rd, 2015

Lecture 18 Nov 3rd, 2015 CS 229r: Algorithms for Big Data Fall 2015 Prof. Jelani Nelson Lecture 18 Nov 3rd, 2015 Scribe: Jefferson Lee 1 Overview Low-rank approximation, Compression Sensing 2 Last Time We looked at three different

More information

to be more efficient on enormous scale, in a stream, or in distributed settings.

to be more efficient on enormous scale, in a stream, or in distributed settings. 16 Matrix Sketching The singular value decomposition (SVD) can be interpreted as finding the most dominant directions in an (n d) matrix A (or n points in R d ). Typically n > d. It is typically easy to

More information

randomized block krylov methods for stronger and faster approximate svd

randomized block krylov methods for stronger and faster approximate svd randomized block krylov methods for stronger and faster approximate svd Cameron Musco and Christopher Musco December 2, 25 Massachusetts Institute of Technology, EECS singular value decomposition n d left

More information

Improved Distributed Principal Component Analysis

Improved Distributed Principal Component Analysis Improved Distributed Principal Component Analysis Maria-Florina Balcan School of Computer Science Carnegie Mellon University ninamf@cscmuedu Yingyu Liang Department of Computer Science Princeton University

More information

Randomized algorithms for the approximation of matrices

Randomized algorithms for the approximation of matrices Randomized algorithms for the approximation of matrices Luis Rademacher The Ohio State University Computer Science and Engineering (joint work with Amit Deshpande, Santosh Vempala, Grant Wang) Two topics

More information

sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU)

sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU) sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU) 0 overview Our Contributions: 1 overview Our Contributions: A near optimal low-rank

More information

Approximate Spectral Clustering via Randomized Sketching

Approximate Spectral Clustering via Randomized Sketching Approximate Spectral Clustering via Randomized Sketching Christos Boutsidis Yahoo! Labs, New York Joint work with Alex Gittens (Ebay), Anju Kambadur (IBM) The big picture: sketch and solve Tradeoff: Speed

More information

Improved Distributed Principal Component Analysis

Improved Distributed Principal Component Analysis Improved Distributed Principal Component Analysis Maria-Florina Balcan School of Computer Science Carnegie Mellon University ninamf@cs.cmu.edu Yingyu Liang Department of Computer Science Princeton University

More information

Sketching as a Tool for Numerical Linear Algebra

Sketching as a Tool for Numerical Linear Algebra Sketching as a Tool for Numerical Linear Algebra (Part 2) David P. Woodruff presented by Sepehr Assadi o(n) Big Data Reading Group University of Pennsylvania February, 2015 Sepehr Assadi (Penn) Sketching

More information

Lecture 9: Matrix approximation continued

Lecture 9: Matrix approximation continued 0368-348-01-Algorithms in Data Mining Fall 013 Lecturer: Edo Liberty Lecture 9: Matrix approximation continued Warning: This note may contain typos and other inaccuracies which are usually discussed during

More information

CS 229r: Algorithms for Big Data Fall Lecture 17 10/28

CS 229r: Algorithms for Big Data Fall Lecture 17 10/28 CS 229r: Algorithms for Big Data Fall 2015 Prof. Jelani Nelson Lecture 17 10/28 Scribe: Morris Yau 1 Overview In the last lecture we defined subspace embeddings a subspace embedding is a linear transformation

More information

RandNLA: Randomization in Numerical Linear Algebra

RandNLA: Randomization in Numerical Linear Algebra RandNLA: Randomization in Numerical Linear Algebra Petros Drineas Department of Computer Science Rensselaer Polytechnic Institute To access my web page: drineas Why RandNLA? Randomization and sampling

More information

Processing Big Data Matrix Sketching

Processing Big Data Matrix Sketching Processing Big Data Matrix Sketching Dimensionality reduction Linear Principal Component Analysis: SVD-based Compressed sensing Matrix sketching Non-linear Kernel PCA Isometric mapping Matrix sketching

More information

Relative-Error CUR Matrix Decompositions

Relative-Error CUR Matrix Decompositions RandNLA Reading Group University of California, Berkeley Tuesday, April 7, 2015. Motivation study [low-rank] matrix approximations that are explicitly expressed in terms of a small numbers of columns and/or

More information

PCA, Kernel PCA, ICA

PCA, Kernel PCA, ICA PCA, Kernel PCA, ICA Learning Representations. Dimensionality Reduction. Maria-Florina Balcan 04/08/2015 Big & High-Dimensional Data High-Dimensions = Lot of Features Document classification Features per

More information

A fast randomized algorithm for approximating an SVD of a matrix

A fast randomized algorithm for approximating an SVD of a matrix A fast randomized algorithm for approximating an SVD of a matrix Joint work with Franco Woolfe, Edo Liberty, and Vladimir Rokhlin Mark Tygert Program in Applied Mathematics Yale University Place July 17,

More information

Sketched Ridge Regression:

Sketched Ridge Regression: Sketched Ridge Regression: Optimization and Statistical Perspectives Shusen Wang UC Berkeley Alex Gittens RPI Michael Mahoney UC Berkeley Overview Ridge Regression min w f w = 1 n Xw y + γ w Over-determined:

More information

Singular Value Decomposition

Singular Value Decomposition Chapter 6 Singular Value Decomposition In Chapter 5, we derived a number of algorithms for computing the eigenvalues and eigenvectors of matrices A R n n. Having developed this machinery, we complete our

More information

Fast Dimension Reduction

Fast Dimension Reduction Fast Dimension Reduction MMDS 2008 Nir Ailon Google Research NY Fast Dimension Reduction Using Rademacher Series on Dual BCH Codes (with Edo Liberty) The Fast Johnson Lindenstrauss Transform (with Bernard

More information

Efficiently Implementing Sparsity in Learning

Efficiently Implementing Sparsity in Learning Efficiently Implementing Sparsity in Learning M. Magdon-Ismail Rensselaer Polytechnic Institute (Joint Work) December 9, 2013. Out-of-Sample is What Counts NO YES A pattern exists We don t know it We have

More information

15 Singular Value Decomposition

15 Singular Value Decomposition 15 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis Yingyu Liang yliang@cs.wisc.edu Computer Sciences Department University of Wisconsin, Madison [based on slides from Nina Balcan] slide 1 Goals for the lecture you should understand

More information

14 Singular Value Decomposition

14 Singular Value Decomposition 14 Singular Value Decomposition For any high-dimensional data analysis, one s first thought should often be: can I use an SVD? The singular value decomposition is an invaluable analysis tool for dealing

More information

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas Dimensionality Reduction: PCA Nicholas Ruozzi University of Texas at Dallas Eigenvalues λ is an eigenvalue of a matrix A R n n if the linear system Ax = λx has at least one non-zero solution If Ax = λx

More information

Subset Selection. Ilse Ipsen. North Carolina State University, USA

Subset Selection. Ilse Ipsen. North Carolina State University, USA Subset Selection Ilse Ipsen North Carolina State University, USA Subset Selection Given: real or complex matrix A integer k Determine permutation matrix P so that AP = ( A 1 }{{} k A 2 ) Important columns

More information

Simple and Deterministic Matrix Sketches

Simple and Deterministic Matrix Sketches Simple and Deterministic Matrix Sketches Edo Liberty + ongoing work with: Mina Ghashami, Jeff Philips and David Woodruff. Edo Liberty: Simple and Deterministic Matrix Sketches 1 / 41 Data Matrices Often

More information

Sketching as a Tool for Numerical Linear Algebra

Sketching as a Tool for Numerical Linear Algebra Sketching as a Tool for Numerical Linear Algebra David P. Woodruff presented by Sepehr Assadi o(n) Big Data Reading Group University of Pennsylvania February, 2015 Sepehr Assadi (Penn) Sketching for Numerical

More information

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices)

SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Chapter 14 SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Today we continue the topic of low-dimensional approximation to datasets and matrices. Last time we saw the singular

More information

Randomized Algorithms for Matrix Computations

Randomized Algorithms for Matrix Computations Randomized Algorithms for Matrix Computations Ilse Ipsen Students: John Holodnak, Kevin Penner, Thomas Wentworth Research supported in part by NSF CISE CCF, DARPA XData Randomized Algorithms Solve a deterministic

More information

Principal Component Analysis

Principal Component Analysis Machine Learning Michaelmas 2017 James Worrell Principal Component Analysis 1 Introduction 1.1 Goals of PCA Principal components analysis (PCA) is a dimensionality reduction technique that can be used

More information

Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora

Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 14: SVD, Power method, and Planted Graph problems (+ eigenvalues of random matrices) Lecturer: Sanjeev Arora Scribe: Today we continue the

More information

Randomized Numerical Linear Algebra: Review and Progresses

Randomized Numerical Linear Algebra: Review and Progresses ized ized SVD ized : Review and Progresses Zhihua Department of Computer Science and Engineering Shanghai Jiao Tong University The 12th China Workshop on Machine Learning and Applications Xi an, November

More information

Miderm II Solutions To find the inverse we row-reduce the augumented matrix [I A]. In our case, we row reduce

Miderm II Solutions To find the inverse we row-reduce the augumented matrix [I A]. In our case, we row reduce Miderm II Solutions Problem. [8 points] (i) [4] Find the inverse of the matrix A = To find the inverse we row-reduce the augumented matrix [I A]. In our case, we row reduce We have A = 2 2 (ii) [2] Possibly

More information

arxiv: v2 [cs.ds] 17 Feb 2016

arxiv: v2 [cs.ds] 17 Feb 2016 Efficient Algorithm for Sparse Matrices Mina Ghashami University of Utah ghashami@cs.utah.edu Edo Liberty Yahoo Labs edo.liberty@yahoo.com Jeff M. Phillips University of Utah jeffp@cs.utah.edu arxiv:1602.00412v2

More information

EECS 275 Matrix Computation

EECS 275 Matrix Computation EECS 275 Matrix Computation Ming-Hsuan Yang Electrical Engineering and Computer Science University of California at Merced Merced, CA 95344 http://faculty.ucmerced.edu/mhyang Lecture 22 1 / 21 Overview

More information

Subspace sampling and relative-error matrix approximation

Subspace sampling and relative-error matrix approximation Subspace sampling and relative-error matrix approximation Petros Drineas Rensselaer Polytechnic Institute Computer Science Department (joint work with M. W. Mahoney) For papers, etc. drineas The CUR decomposition

More information

Lecture 12: Randomized Least-squares Approximation in Practice, Cont. 12 Randomized Least-squares Approximation in Practice, Cont.

Lecture 12: Randomized Least-squares Approximation in Practice, Cont. 12 Randomized Least-squares Approximation in Practice, Cont. Stat60/CS94: Randomized Algorithms for Matrices and Data Lecture 1-10/14/013 Lecture 1: Randomized Least-squares Approximation in Practice, Cont. Lecturer: Michael Mahoney Scribe: Michael Mahoney Warning:

More information

Assignment 1 Math 5341 Linear Algebra Review. Give complete answers to each of the following questions. Show all of your work.

Assignment 1 Math 5341 Linear Algebra Review. Give complete answers to each of the following questions. Show all of your work. Assignment 1 Math 5341 Linear Algebra Review Give complete answers to each of the following questions Show all of your work Note: You might struggle with some of these questions, either because it has

More information

Low-Rank PSD Approximation in Input-Sparsity Time

Low-Rank PSD Approximation in Input-Sparsity Time Low-Rank PSD Approximation in Input-Sparsity Time Kenneth L. Clarkson IBM Research Almaden klclarks@us.ibm.com David P. Woodruff IBM Research Almaden dpwoodru@us.ibm.com Abstract We give algorithms for

More information

Math 3191 Applied Linear Algebra

Math 3191 Applied Linear Algebra Math 9 Applied Linear Algebra Lecture : Orthogonal Projections, Gram-Schmidt Stephen Billups University of Colorado at Denver Math 9Applied Linear Algebra p./ Orthonormal Sets A set of vectors {u, u,...,

More information

Column Selection via Adaptive Sampling

Column Selection via Adaptive Sampling Column Selection via Adaptive Sampling Saurabh Paul Global Risk Sciences, Paypal Inc. saupaul@paypal.com Malik Magdon-Ismail CS Dept., Rensselaer Polytechnic Institute magdon@cs.rpi.edu Petros Drineas

More information

AM205: Assignment 2. i=1

AM205: Assignment 2. i=1 AM05: Assignment Question 1 [10 points] (a) [4 points] For p 1, the p-norm for a vector x R n is defined as: ( n ) 1/p x p x i p ( ) i=1 This definition is in fact meaningful for p < 1 as well, although

More information

Computation. For QDA we need to calculate: Lets first consider the case that

Computation. For QDA we need to calculate: Lets first consider the case that Computation For QDA we need to calculate: δ (x) = 1 2 log( Σ ) 1 2 (x µ ) Σ 1 (x µ ) + log(π ) Lets first consider the case that Σ = I,. This is the case where each distribution is spherical, around the

More information

Subset Selection. Deterministic vs. Randomized. Ilse Ipsen. North Carolina State University. Joint work with: Stan Eisenstat, Yale

Subset Selection. Deterministic vs. Randomized. Ilse Ipsen. North Carolina State University. Joint work with: Stan Eisenstat, Yale Subset Selection Deterministic vs. Randomized Ilse Ipsen North Carolina State University Joint work with: Stan Eisenstat, Yale Mary Beth Broadbent, Martin Brown, Kevin Penner Subset Selection Given: real

More information

Randomized Algorithms in Linear Algebra and Applications in Data Analysis

Randomized Algorithms in Linear Algebra and Applications in Data Analysis Randomized Algorithms in Linear Algebra and Applications in Data Analysis Petros Drineas Rensselaer Polytechnic Institute Computer Science Department To access my web page: drineas Why linear algebra?

More information

A randomized algorithm for approximating the SVD of a matrix

A randomized algorithm for approximating the SVD of a matrix A randomized algorithm for approximating the SVD of a matrix Joint work with Per-Gunnar Martinsson (U. of Colorado) and Vladimir Rokhlin (Yale) Mark Tygert Program in Applied Mathematics Yale University

More information

Mathematical foundations - linear algebra

Mathematical foundations - linear algebra Mathematical foundations - linear algebra Andrea Passerini passerini@disi.unitn.it Machine Learning Vector space Definition (over reals) A set X is called a vector space over IR if addition and scalar

More information

MATH 22A: LINEAR ALGEBRA Chapter 4

MATH 22A: LINEAR ALGEBRA Chapter 4 MATH 22A: LINEAR ALGEBRA Chapter 4 Jesús De Loera, UC Davis November 30, 2012 Orthogonality and Least Squares Approximation QUESTION: Suppose Ax = b has no solution!! Then what to do? Can we find an Approximate

More information

Optimality of the Johnson-Lindenstrauss Lemma

Optimality of the Johnson-Lindenstrauss Lemma Optimality of the Johnson-Lindenstrauss Lemma Kasper Green Larsen Jelani Nelson September 7, 2016 Abstract For any integers d, n 2 and 1/(min{n, d}) 0.4999 < ε < 1, we show the existence of a set of n

More information

A fast randomized algorithm for overdetermined linear least-squares regression

A fast randomized algorithm for overdetermined linear least-squares regression A fast randomized algorithm for overdetermined linear least-squares regression Vladimir Rokhlin and Mark Tygert Technical Report YALEU/DCS/TR-1403 April 28, 2008 Abstract We introduce a randomized algorithm

More information

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works

CS168: The Modern Algorithmic Toolbox Lecture #8: How PCA Works CS68: The Modern Algorithmic Toolbox Lecture #8: How PCA Works Tim Roughgarden & Gregory Valiant April 20, 206 Introduction Last lecture introduced the idea of principal components analysis (PCA). The

More information

Distributed Machine Learning. Maria-Florina Balcan Carnegie Mellon University

Distributed Machine Learning. Maria-Florina Balcan Carnegie Mellon University Distributed Machine Learning Maria-Florina Balcan Carnegie Mellon University Distributed Machine Learning Modern applications: massive amounts of data distributed across multiple locations. Distributed

More information

CS60021: Scalable Data Mining. Dimensionality Reduction

CS60021: Scalable Data Mining. Dimensionality Reduction J. Leskovec, A. Rajaraman, J. Ullman: Mining of Massive Datasets, http://www.mmds.org 1 CS60021: Scalable Data Mining Dimensionality Reduction Sourangshu Bhattacharya Assumption: Data lies on or near a

More information

7 Principal Component Analysis

7 Principal Component Analysis 7 Principal Component Analysis This topic will build a series of techniques to deal with high-dimensional data. Unlike regression problems, our goal is not to predict a value (the y-coordinate), it is

More information

7. Dimension and Structure.

7. Dimension and Structure. 7. Dimension and Structure 7.1. Basis and Dimension Bases for Subspaces Example 2 The standard unit vectors e 1, e 2,, e n are linearly independent, for if we write (2) in component form, then we obtain

More information

Lecture 5: Randomized methods for low-rank approximation

Lecture 5: Randomized methods for low-rank approximation CBMS Conference on Fast Direct Solvers Dartmouth College June 23 June 27, 2014 Lecture 5: Randomized methods for low-rank approximation Gunnar Martinsson The University of Colorado at Boulder Research

More information

IV. Matrix Approximation using Least-Squares

IV. Matrix Approximation using Least-Squares IV. Matrix Approximation using Least-Squares The SVD and Matrix Approximation We begin with the following fundamental question. Let A be an M N matrix with rank R. What is the closest matrix to A that

More information

Review problems for MA 54, Fall 2004.

Review problems for MA 54, Fall 2004. Review problems for MA 54, Fall 2004. Below are the review problems for the final. They are mostly homework problems, or very similar. If you are comfortable doing these problems, you should be fine on

More information

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil

GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis. Massimiliano Pontil GI07/COMPM012: Mathematical Programming and Research Methods (Part 2) 2. Least Squares and Principal Components Analysis Massimiliano Pontil 1 Today s plan SVD and principal component analysis (PCA) Connection

More information

Conditions for Robust Principal Component Analysis

Conditions for Robust Principal Component Analysis Rose-Hulman Undergraduate Mathematics Journal Volume 12 Issue 2 Article 9 Conditions for Robust Principal Component Analysis Michael Hornstein Stanford University, mdhornstein@gmail.com Follow this and

More information

Optimal Principal Component Analysis in Distributed and Streaming Models

Optimal Principal Component Analysis in Distributed and Streaming Models Optimal Principal Component Analysis in Distributed and Streaming Models Christos Boutsidis Goldman Sachs New York, USA christos.boutsidis@gmail.com David P. Woodruff IBM Research California, USA dpwoodru@us.ibm.com

More information

Chapter XII: Data Pre and Post Processing

Chapter XII: Data Pre and Post Processing Chapter XII: Data Pre and Post Processing Information Retrieval & Data Mining Universität des Saarlandes, Saarbrücken Winter Semester 2013/14 XII.1 4-1 Chapter XII: Data Pre and Post Processing 1. Data

More information

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining

Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes. October 3, Statistics 202: Data Mining Dimension reduction, PCA & eigenanalysis Based in part on slides from textbook, slides of Susan Holmes October 3, 2012 1 / 1 Combinations of features Given a data matrix X n p with p fairly large, it can

More information

Single Pass PCA of Matrix Products

Single Pass PCA of Matrix Products Single Pass PCA of Matrix Products Shanshan Wu The University of Texas at Austin shanshan@utexas.edu Sujay Sanghavi The University of Texas at Austin sanghavi@mail.utexas.edu Srinadh Bhojanapalli Toyota

More information

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 61, NO. 2, FEBRUARY

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 61, NO. 2, FEBRUARY IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 61, NO. 2, FEBRUARY 2015 1045 Randomized Dimensionality Reduction for k-means Clustering Christos Boutsidis, Anastasios Zouzias, Michael W. Mahoney, and Petros

More information

Math 261 Lecture Notes: Sections 6.1, 6.2, 6.3 and 6.4 Orthogonal Sets and Projections

Math 261 Lecture Notes: Sections 6.1, 6.2, 6.3 and 6.4 Orthogonal Sets and Projections Math 6 Lecture Notes: Sections 6., 6., 6. and 6. Orthogonal Sets and Projections We will not cover general inner product spaces. We will, however, focus on a particular inner product space the inner product

More information

Parallel Singular Value Decomposition. Jiaxing Tan

Parallel Singular Value Decomposition. Jiaxing Tan Parallel Singular Value Decomposition Jiaxing Tan Outline What is SVD? How to calculate SVD? How to parallelize SVD? Future Work What is SVD? Matrix Decomposition Eigen Decomposition A (non-zero) vector

More information

Section 6.2, 6.3 Orthogonal Sets, Orthogonal Projections

Section 6.2, 6.3 Orthogonal Sets, Orthogonal Projections Section 6. 6. Orthogonal Sets Orthogonal Projections Main Ideas in these sections: Orthogonal set = A set of mutually orthogonal vectors. OG LI. Orthogonal Projection of y onto u or onto an OG set {u u

More information

Example Linear Algebra Competency Test

Example Linear Algebra Competency Test Example Linear Algebra Competency Test The 4 questions below are a combination of True or False, multiple choice, fill in the blank, and computations involving matrices and vectors. In the latter case,

More information

CS168: The Modern Algorithmic Toolbox Lecture #8: PCA and the Power Iteration Method

CS168: The Modern Algorithmic Toolbox Lecture #8: PCA and the Power Iteration Method CS168: The Modern Algorithmic Toolbox Lecture #8: PCA and the Power Iteration Method Tim Roughgarden & Gregory Valiant April 15, 015 This lecture began with an extended recap of Lecture 7. Recall that

More information

A Tour of the Lanczos Algorithm and its Convergence Guarantees through the Decades

A Tour of the Lanczos Algorithm and its Convergence Guarantees through the Decades A Tour of the Lanczos Algorithm and its Convergence Guarantees through the Decades Qiaochu Yuan Department of Mathematics UC Berkeley Joint work with Prof. Ming Gu, Bo Li April 17, 2018 Qiaochu Yuan Tour

More information

Improved Practical Matrix Sketching with Guarantees

Improved Practical Matrix Sketching with Guarantees Improved Practical Matrix Sketching with Guarantees Amey Desai Mina Ghashami Jeff M. Phillips January 26, 215 Abstract Matrices have become essential data representations for many large-scale problems

More information

MATH 20F: LINEAR ALGEBRA LECTURE B00 (T. KEMP)

MATH 20F: LINEAR ALGEBRA LECTURE B00 (T. KEMP) MATH 20F: LINEAR ALGEBRA LECTURE B00 (T KEMP) Definition 01 If T (x) = Ax is a linear transformation from R n to R m then Nul (T ) = {x R n : T (x) = 0} = Nul (A) Ran (T ) = {Ax R m : x R n } = {b R m

More information

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate 58th Annual IEEE Symposium on Foundations of Computer Science First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate Zeyuan Allen-Zhu Microsoft Research zeyuan@csail.mit.edu

More information

1 Principal Components Analysis

1 Principal Components Analysis Lecture 3 and 4 Sept. 18 and Sept.20-2006 Data Visualization STAT 442 / 890, CM 462 Lecture: Ali Ghodsi 1 Principal Components Analysis Principal components analysis (PCA) is a very popular technique for

More information

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis

December 20, MAA704, Multivariate analysis. Christopher Engström. Multivariate. analysis. Principal component analysis .. December 20, 2013 Todays lecture. (PCA) (PLS-R) (LDA) . (PCA) is a method often used to reduce the dimension of a large dataset to one of a more manageble size. The new dataset can then be used to make

More information

Ridge Regression and Provable Deterministic Ridge Leverage Score Sampling

Ridge Regression and Provable Deterministic Ridge Leverage Score Sampling Ridge Regression and Provable Deterministic Ridge Leverage Score Sampling Shannon R. McCurdy California Institute for Quantitative Biosciences UC Berkeley Berkeley, CA 94702 smccurdy@berkeley.edu Abstract

More information

(a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? Solution: dim N(A) 1, since rank(a) 3. Ax =

(a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? Solution: dim N(A) 1, since rank(a) 3. Ax = . (5 points) (a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? dim N(A), since rank(a) 3. (b) If we also know that Ax = has no solution, what do we know about the rank of A? C(A)

More information

Dimensionality Reduction for k-means Clustering. Cameron N. Musco

Dimensionality Reduction for k-means Clustering. Cameron N. Musco Dimensionality Reduction for k-means Clustering by Cameron N. Musco B.S., Yale University (2012) Submitted to the Department of Electrical Engineering and Computer Science in partial fulfillment of the

More information

Linear Algebra Massoud Malek

Linear Algebra Massoud Malek CSUEB Linear Algebra Massoud Malek Inner Product and Normed Space In all that follows, the n n identity matrix is denoted by I n, the n n zero matrix by Z n, and the zero vector by θ n An inner product

More information

Linear Algebra and Eigenproblems

Linear Algebra and Eigenproblems Appendix A A Linear Algebra and Eigenproblems A working knowledge of linear algebra is key to understanding many of the issues raised in this work. In particular, many of the discussions of the details

More information

6.1. Inner Product, Length and Orthogonality

6.1. Inner Product, Length and Orthogonality These are brief notes for the lecture on Friday November 13, and Monday November 1, 2009: they are not complete, but they are a guide to what I want to say on those days. They are guaranteed to be incorrect..1.

More information

Recovering any low-rank matrix, provably

Recovering any low-rank matrix, provably Recovering any low-rank matrix, provably Rachel Ward University of Texas at Austin October, 2014 Joint work with Yudong Chen (U.C. Berkeley), Srinadh Bhojanapalli and Sujay Sanghavi (U.T. Austin) Matrix

More information

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26

Lecture 13. Principal Component Analysis. Brett Bernstein. April 25, CDS at NYU. Brett Bernstein (CDS at NYU) Lecture 13 April 25, / 26 Principal Component Analysis Brett Bernstein CDS at NYU April 25, 2017 Brett Bernstein (CDS at NYU) Lecture 13 April 25, 2017 1 / 26 Initial Question Intro Question Question Let S R n n be symmetric. 1

More information

10 Distributed Matrix Sketching

10 Distributed Matrix Sketching 10 Distributed Matrix Sketching In order to define a distributed matrix sketching problem thoroughly, one has to specify the distributed model, data model and the partition model of data. The distributed

More information

Sparse Features for PCA-Like Linear Regression

Sparse Features for PCA-Like Linear Regression Sparse Features for PCA-Like Linear Regression Christos Boutsidis Mathematical Sciences Department IBM T J Watson Research Center Yorktown Heights, New York cboutsi@usibmcom Petros Drineas Computer Science

More information

Section 6.4. The Gram Schmidt Process

Section 6.4. The Gram Schmidt Process Section 6.4 The Gram Schmidt Process Motivation The procedures in 6 start with an orthogonal basis {u, u,..., u m}. Find the B-coordinates of a vector x using dot products: x = m i= x u i u i u i u i Find

More information

EE 381V: Large Scale Learning Spring Lecture 16 March 7

EE 381V: Large Scale Learning Spring Lecture 16 March 7 EE 381V: Large Scale Learning Spring 2013 Lecture 16 March 7 Lecturer: Caramanis & Sanghavi Scribe: Tianyang Bai 16.1 Topics Covered In this lecture, we introduced one method of matrix completion via SVD-based

More information

Principal components analysis COMS 4771

Principal components analysis COMS 4771 Principal components analysis COMS 4771 1. Representation learning Useful representations of data Representation learning: Given: raw feature vectors x 1, x 2,..., x n R d. Goal: learn a useful feature

More information

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD DATA MINING LECTURE 8 Dimensionality Reduction PCA -- SVD The curse of dimensionality Real data usually have thousands, or millions of dimensions E.g., web documents, where the dimensionality is the vocabulary

More information

Tighter Low-rank Approximation via Sampling the Leveraged Element

Tighter Low-rank Approximation via Sampling the Leveraged Element Tighter Low-rank Approximation via Sampling the Leveraged Element Srinadh Bhojanapalli The University of Texas at Austin bsrinadh@utexas.edu Prateek Jain Microsoft Research, India prajain@microsoft.com

More information

18.06 Professor Johnson Quiz 1 October 3, 2007

18.06 Professor Johnson Quiz 1 October 3, 2007 18.6 Professor Johnson Quiz 1 October 3, 7 SOLUTIONS 1 3 pts.) A given circuit network directed graph) which has an m n incidence matrix A rows = edges, columns = nodes) and a conductance matrix C [diagonal

More information

UNIT 6: The singular value decomposition.

UNIT 6: The singular value decomposition. UNIT 6: The singular value decomposition. María Barbero Liñán Universidad Carlos III de Madrid Bachelor in Statistics and Business Mathematical methods II 2011-2012 A square matrix is symmetric if A T

More information

Gradient-based Sampling: An Adaptive Importance Sampling for Least-squares

Gradient-based Sampling: An Adaptive Importance Sampling for Least-squares Gradient-based Sampling: An Adaptive Importance Sampling for Least-squares Rong Zhu Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China. rongzhu@amss.ac.cn Abstract

More information

Chapter 3 Transformations

Chapter 3 Transformations Chapter 3 Transformations An Introduction to Optimization Spring, 2014 Wei-Ta Chu 1 Linear Transformations A function is called a linear transformation if 1. for every and 2. for every If we fix the bases

More information

THE SINGULAR VALUE DECOMPOSITION MARKUS GRASMAIR

THE SINGULAR VALUE DECOMPOSITION MARKUS GRASMAIR THE SINGULAR VALUE DECOMPOSITION MARKUS GRASMAIR 1. Definition Existence Theorem 1. Assume that A R m n. Then there exist orthogonal matrices U R m m V R n n, values σ 1 σ 2... σ p 0 with p = min{m, n},

More information

Math 416, Spring 2010 Gram-Schmidt, the QR-factorization, Orthogonal Matrices March 4, 2010 GRAM-SCHMIDT, THE QR-FACTORIZATION, ORTHOGONAL MATRICES

Math 416, Spring 2010 Gram-Schmidt, the QR-factorization, Orthogonal Matrices March 4, 2010 GRAM-SCHMIDT, THE QR-FACTORIZATION, ORTHOGONAL MATRICES Math 46, Spring 00 Gram-Schmidt, the QR-factorization, Orthogonal Matrices March 4, 00 GRAM-SCHMIDT, THE QR-FACTORIZATION, ORTHOGONAL MATRICES Recap Yesterday we talked about several new, important concepts

More information

Sketching as a Tool for Numerical Linear Algebra All Lectures. David Woodruff IBM Almaden

Sketching as a Tool for Numerical Linear Algebra All Lectures. David Woodruff IBM Almaden Sketching as a Tool for Numerical Linear Algebra All Lectures David Woodruff IBM Almaden Massive data sets Examples Internet traffic logs Financial data etc. Algorithms Want nearly linear time or less

More information

Machine Learning - MT & 14. PCA and MDS

Machine Learning - MT & 14. PCA and MDS Machine Learning - MT 2016 13 & 14. PCA and MDS Varun Kanade University of Oxford November 21 & 23, 2016 Announcements Sheet 4 due this Friday by noon Practical 3 this week (continue next week if necessary)

More information

No books, no notes, no calculators. You must show work, unless the question is a true/false, yes/no, or fill-in-the-blank question.

No books, no notes, no calculators. You must show work, unless the question is a true/false, yes/no, or fill-in-the-blank question. Math 304 Final Exam (May 8) Spring 206 No books, no notes, no calculators. You must show work, unless the question is a true/false, yes/no, or fill-in-the-blank question. Name: Section: Question Points

More information