randomized block krylov methods for stronger and faster approximate svd

randomized block krylov methods for stronger and faster approximate svd Cameron Musco and Christopher Musco December 2, 25 Massachusetts Institute of Technology, EECS

singular value decomposition n d left singular vectors singular values right singular vectors σ σ 2 A = U Σ V T σ d- σ d

singular value decomposition n d left singular vectors singular values right singular vectors σ σ 2 A = U Σ V T σ d- σ d Extremely important primitive for dimensionality reduction, low-rank approximation, PCA, etc.

singular value decomposition n d A k = left singular vectors singular values U k σ σ k Σ k σ d- σ d right singular vectors V kt Extremely important primitive for dimensionality reduction, low-rank approximation, PCA, etc. u i = arg max x T AA T x x: x =,x u,...,u i A k = arg min A B F B:rank(B)=k

standard svd approximation metrics Frobenius Norm Low-Rank Approximation: A Ũ k Ũ T k A F ( + ϵ) A U k U T k A F 2

standard svd approximation metrics Frobenius Norm Low-Rank Approximation (weak): A Ũ k Ũ T k A F ( + ϵ) A U k U T k A F A 2 F = d i= σ2 i and A U k U T k A 2 F = A A k 2 F = d i=k+ σ2 i. 2 SNAP/email Enron singular value σ i 9 8 7 6 5 4 3 2 2 3 4 5 6 7 index i 2

standard svd approximation metrics Frobenius Norm Low-Rank Approximation (weak): A Ũ k Ũ T k A F ( + ϵ) A U k U T k A F Spectral Norm Low-Rank Approximation (stronger): A Ũ k Ũ T k A 2 ( + ϵ) A U k U T k A 2 2

standard svd approximation metrics Frobenius Norm Low-Rank Approximation (weak): A Ũ k Ũ T k A F ( + ϵ) A U k U T k A F Spectral Norm Low-Rank Approximation (stronger): A Ũ k Ũ T k A 2 ( + ϵ)σ k+ Per Vector Principal Component Error (strongest): ũ T i AAT ũ i ( ϵ)u T i AAT u i for all i k. 2

main research question Classic Full SVD Algorithms (e.g. QR Algorithm): All of these goals in roughly O(nd 2 ) time (error dependence is log log /ϵ on lower order terms). Unfortunately, this is much too slow for many data sets. 3

frobenius norm error Weak Approximation Algorithms: Strong Rank Revealing QR (Gu, Eisenstat 996): A Ũ k Ũ T k A F poly(n, k) A A k F in time O(ndk) 4

frobenius norm error Weak Approximation Algorithms: Strong Rank Revealing QR (Gu, Eisenstat 996): A Ũ k Ũ T k A F poly(n, k) A A k F in time O(ndk) Sparse Subspace Embeddings (Clarkson, Woodruff 23): ( ) nk A Ũ k Ũ T 2 k A F ( + ϵ) A A k F in time O(nnz(A)) + Õ ϵ 4 4

iterative svd algorithms Iterative methods are the only game in town for stronger guarantees. Runtime is approximately: O(nnz(A)k #iterations) Power method (Müntz 93, von Mises 929) Krylov/Lanczos methods (Lanczos 95) 5

iterative svd algorithms Iterative methods are the only game in town for stronger guarantees. Runtime is approximately: O(nnz(A)k #iterations) O(ndk #iterations) << O(nd 2 ) Power method (Müntz 93, von Mises 929) Krylov/Lanczos methods (Lanczos 95) 5

power method review Traditional Power Method: x R d, x i+ Ax i Ax i 6

power method review Traditional Power Method: x R d, x i+ Ax i Ax i Block Power Method (Simultaneous/Subspace Iteration): X R d k, X i+ orthonormalize(ax i ) 3.6.6.92 -.8 2 A.59 -..37.54.54.78 -.4.82 X X U 2 6

power method review Traditional Power Method: x R d, x i+ Ax i Ax i Block Power Method (Simultaneous/Subspace Iteration): X R d k, X i+ orthonormalize(ax i ) 3.92 -.8.98 -.5 2.37.54.2.8 -.4.82 -.5.56 A X X 2 U 2 6

power method review Traditional Power Method: x R d, x i+ Ax i Ax i Block Power Method (Simultaneous/Subspace Iteration): X R d k, X i+ orthonormalize(ax i ) 3.98 -.5 -.6 2.2.8.8.95 -.5.56 -.4.3 A X 2 X 3 U 2 6

power method review Traditional Power Method: x R d, x i+ Ax i Ax i Block Power Method (Simultaneous/Subspace Iteration): X R d k, X i+ orthonormalize(ax i ) 3 -.6 -.2 2.8.95.2.99 -.4.3 -.2.6 A X 3 X 4 U 2 6

power method review Traditional Power Method: x R d, x i+ Ax i Ax i Block Power Method (Simultaneous/Subspace Iteration): X R d k, X i+ orthonormalize(ax i ) 3 -.2 2.2.99 -.2.6 -..8 A X 4 X 5 U 2 6

power method review Traditional Power Method: x R d, x i+ Ax i Ax i Block Power Method (Simultaneous/Subspace Iteration): X R d k, X i+ orthonormalize(ax i ) 3 2 -..8.4 A X 5 X 6 U 2 6

power method review Traditional Power Method: x R d, x i+ Ax i Ax i Block Power Method (Simultaneous/Subspace Iteration): X R d k, X i+ orthonormalize(ax i ) 3 2.4.2 A X 6 X 7 U 2 6

power method review Traditional Power Method: x R d, x i+ Ax i Ax i Block Power Method (Simultaneous/Subspace Iteration): X R d k, X i+ orthonormalize(ax i ) 3 2.2. A X 7 X 8 U 2 6

power method review Traditional Power Method: x R d, x i+ Ax i Ax i Block Power Method (Simultaneous/Subspace Iteration): X R d k, X i+ orthonormalize(ax i ) 3 2. A X 8 X 9 U 2 6

power method runtime Runtime for Block Power method is roughly: ( ) log(d/ϵ) O nnz(a)k (σ k σ k+ )/σ k 7

power method runtime Runtime for Block Power method is roughly: ( ) log(d/ϵ) O nnz(a)k (σ k σ k+ )/σ k Linear dependence on the singular value gap: gap = σ k σ k+ σ k 7

power method runtime Runtime for Block Power method is roughly: ( ) log(d/ϵ) O nnz(a)k (σ k σ k+ )/σ k Linear dependence on the singular value gap: gap = σ k σ k+ σ k While this gap is traditionally assumed to be constant, it is the dominant factor in the iteration count for many datasets. 7

typical gap values Stanford Network Analysis Project Slashdot Social Network singular value, σ i 4 2 8 6 4 2 2 4 6 8 2 4 6 8 2 index, i 8

typical gap values Stanford Network Analysis Project Slashdot Social Network 4 singular value, σ i 35 3 25 2 5 5 55 6 65 7 75 8 85 9 95 index, i 8

typical gap values Stanford Network Analysis Project Slashdot Social Network 4 singular value, σ i 35 3 25 2 5 5 55 6 65 7 75 8 85 9 95 index, i Median value of gap k = σ k σ k+ σ k for k 2:.9 8

typical gap values Stanford Network Analysis Project Slashdot Social Network 4 singular value, σ i 35 3 25 2 5 5 55 6 65 7 75 8 85 9 95 index, i Minimum value of gap k = σ k σ k+ σ k for k 2:.4 8

typical gap values Stanford Network Analysis Project Slashdot Social Network 4 singular value, σ i 35 3 25 2 5 5 55 6 65 7 75 8 85 9 95 index, i Minimum value of gap k = σ k σ k+ σ k for k 2: Runtime = O(25, nnz(a)k log(d/ϵ)) 8

gap independent bounds Recent work shows Block Power method (with randomized start vectors) gives: 9

gap independent bounds Recent work shows Block Power method (with randomized start vectors) gives: ( A Ũ k Ũ T k A 2 ( + ϵ) A A k 2 in time O nnz(a)k log d ) ϵ 9

gap independent bounds Recent work shows Block Power method (with randomized start vectors) gives: ( A Ũ k Ũ T k A 2 ( + ϵ) A A k 2 in time O nnz(a)k log d ) ϵ Improves on classical bounds when ϵ > (σ k σ k+ )/σ k. Long series of refinements and improvements: Rokhlin, Szlam, Tygert 29 Halko, Martinsson, Tropp 2 Boutsidis, Drineas, Magdon-Ismail 2 Witten, Candès 24 Woodruff 24 9

randomized block power method Randomized Block Power Method is widely cited and implemented simple algorithm with simple bounds.

randomized block power method Randomized Block Power Method is widely cited and implemented simple algorithm with simple bounds. redsvd ScaleNLP (Breeze) libskylark

randomized block power method Randomized Block Power Method is widely cited and implemented simple algorithm with simple bounds. redsvd ScaleNLP (Breeze) libskylark But in the numerical linear algebra community, Krylov/Lanczos methods have long been prefered over power iteration.

randomized block power method Randomized Block Power Method is widely cited and implemented simple algorithm with simple bounds. GNU Octave MLlib But in the numerical linear algebra community, Krylov/Lanczos methods have long been prefered over power iteration.

lanczos/krylov acceleration Power Method Krylov Methods ( O nnz(a)k ) ( log(d/ϵ) O nnz(a)k (σ k σ k+ )/σ k ) log(d/ϵ) (σk σ k+ )/σ k

lanczos/krylov acceleration Power Method Krylov Methods ( O nnz(a)k ) ( log(d/ϵ) O nnz(a)k (σ k σ k+ )/σ k ) log(d/ϵ) (σk σ k+ )/σ k ( O nnz(a)k log d )? ϵ No gap independent analysis of Krylov methods!

lanczos/krylov acceleration Power Method Krylov Methods ( O nnz(a)k ) ( log(d/ϵ) O nnz(a)k (σ k σ k+ )/σ k ) log(d/ϵ) (σk σ k+ )/σ k ( O nnz(a)k log d ) ϵ ( O nnz(a)k log d ) ϵ }{{} Our Contribution

our main result A simple randomized Block Krylov Iteration gives all three of our target error bounds in time: ( O nnz(a)k log d ) ϵ A U k U T k A 2 ( + ϵ)σ k+ and ũ T i AAT ũ i σ 2 i ϵσ 2 k+ 2

our main result A simple randomized Block Krylov Iteration gives all three of our target error bounds in time: ( O nnz(a)k log d ) ϵ A U k U T k A 2 ( + ϵ)σ k+ and ũ T i AAT ũ i σ 2 i ϵσ 2 k+ Gives a runtime bound that is independent of A. 2

understanding gap dependence First Step: Where does gap dependence actually comes from? 3

understanding gap dependence To prove guarantees like: ũ T i AAT ũ i ( ϵ)σi 2, classical analysis argues about convergence to A s true singular vectors. u i ũ i error Traditional objective function: u i ũ i 2. 4

understanding gap dependence Traditional objective function: u i ũ i 2 Simple potential function, easy to work with. 5

understanding gap dependence Traditional objective function: u i ũ i 2 Simple potential function, easy to work with. Can be used to prove strong per-vector error or spectral norm guarantees for A Ũ k Ũ T k A 2. 5

understanding gap dependence Traditional objective function: u i ũ i 2 3.93.26.. A -.35.5 -.8.8.4.78 X -.7.62 X U 2 6

understanding gap dependence Traditional objective function: u i ũ i 2 3.. A -.7.62.4.78.2.75 X -.2.65 X 2 U 2 6

understanding gap dependence Traditional objective function: u i ũ i 2 3. A -.2.65.2.75..72 X 2 -..69 X 3 U 2 6

understanding gap dependence Traditional objective function: u i ũ i 2 3. A -..69..72.68 X 3 X 4.72 U 2 6

understanding gap dependence Traditional objective function: u i ũ i 2 3. A.68.65 X 4.72 X 5.76 U 2 6

understanding gap dependence Traditional objective function: u i ũ i 2 3. A.65.62 X 5.76 X 6.78 U 2 6

understanding gap dependence Traditional objective function: u i ũ i 2 3..78.82 A.62.58 X 6 X 7 U 2 6

understanding gap dependence Traditional objective function: u i ũ i 2 3..82.84 A.58.54 X 7 X 8 U 2 6

understanding gap dependence Traditional objective function: u i ũ i 2 3. A.54.5 X 8.84 X 9.86 U 2 6

key insight Convergence becomes less necessary precisely when it is difficult to achieve! 7

key insight Convergence becomes less necessary precisely when it is difficult to achieve! 3. A U 2 U 2 Minimizing u i ũ i 2 is sufficient, but far from necessary. 7

modern approach to analysis Iterative methods viewed as denoising procedures for Random Sketching methods. 8

modern approach to analysis Iterative methods viewed as denoising procedures for Random Sketching methods. Choose G N (, ) d k. If A is rank k then: span (AG) = span(a) 8

modern approach to analysis Iterative methods viewed as denoising procedures for Random Sketching methods. Choose G N (, ) d k. If A is rank k then: span (AG) = span(a) Ũ k = span (AG) = A Ũ k Ũ T k A F = A A F = 8

modern approach to analysis Iterative methods viewed as denoising procedures for Random Sketching methods. Choose G N (, ) d k. If A is rank k then: span (AG) = span(a) Ũ k = span (A) = A Ũ k Ũ T k A F = A A F = If A is not rank k then we have error due to A A k : Ũ k = span(ag) = A Ũ k Ũ T k A F poly(d) A A k F 8

modern approach to analysis How to avoid tail noise? Apply sketching method to A q instead. 9

modern approach to analysis How to avoid tail noise? Apply sketching method to A q instead. This is exactly what Block Power Method does: G AG A 2 G... A q G, Ũ k = span(a q G) 9

modern approach to analysis How to avoid tail noise? Apply sketching method to A q instead. Assuming A is symmetric, if A = UΣU T then A q = UΣ q U T. d d σ σ 2 A = U Σ U T σ d- σ d 9

modern approach to analysis How to avoid tail noise? Apply sketching method to A q instead. Assuming A is symmetric, if A = UΣU T then A q = UΣ q U T. d d σ q σ 2q A q = U Σ q U T σ d-q σ dq 9

modern approach to analysis How to avoid tail noise? Apply sketching method to A q instead. Assuming A is symmetric, if A = UΣU T then A q = UΣ q U T. 5 Spectrum of A Spectrum of A q Singular Value σ i 5 5 5 2 Index i 9

modern approach to analysis q = Õ(/ϵ) ensures that any singular value below σ k+ becomes extremely small in comparison to any singular value above ( + ϵ)σ k+. 2

modern approach to analysis q = Õ(/ϵ) ensures that any singular value below σ k+ becomes extremely small in comparison to any singular value above ( + ϵ)σ k+. ( ϵ) O(/ϵ) << 2

modern approach to analysis q = Õ(/ϵ) ensures that any singular value below σ k+ becomes extremely small in comparison to any singular value above ( + ϵ)σ k+. ( ϵ) O(/ϵ) << Ũ k = span(a q G) must align well with large (but not the largest!) singular vectors of A q to achieve even coarse Frobenius norm error: A q Ũ k Ũ T k Aq F poly(d) A q A q k F 2

modern approach to analysis We use new tools for converting very small Frobenius norm low-rank approximation error to spectral norm and per vector error, without arguing about convergence of ũ i and u i. 2

krylov acceleration There are better polynomials than A q for denoising A. 22

krylov acceleration There are better polynomials than A q for denoising A. 3 x O(/ε) T O(/ ε) (x) 2..2.3.4.5.6.7.8.9. x 22

krylov acceleration There are better polynomials than A q for denoising A. 3 x O(/ε) T O(/ ε) (x) 2..2.3.4.5.6.7.8.9. x With Chebyshev polynomials only need degree q = Õ(/ ϵ). 22

krylov acceleration Chebyshev polynomial T q (A) has a very small tail. So returning Ũ k = span(t q (A)G) would suffice. 23

krylov acceleration Chebyshev polynomial T q (A) has a very small tail. So returning Ũ k = span(t q (A)G) would suffice. Furthermore, block power iteration computes (at intermediate steps) all of the components needed for: T q (A)G = c G + c AG +... + c q A q G 23

implicit use of acceleration polynomial But... 24

implicit use of acceleration polynomial But... we can t explicitly compute T q (A), since its parameters depend on A s (unknown) singular values. 24

implicit use of acceleration polynomial But... we can t explicitly compute T q (A), since its parameters depend on A s (unknown) singular values. Solution: Returning the best Ũ k in the span of K is only better then returning span(t q (A)G). 24

implicit use of acceleration polynomial What is the best Ũ k? 25

implicit use of acceleration polynomial What is the best Ũ k? Surprisingly difficult question. 25

implicit use of acceleration polynomial What is the best Ũ k? Surprisingly difficult question. For Block Power Method, did not need to consider this Ũ k = span(a q G) was the only option. 25

implicit use of acceleration polynomial What is the best Ũ k? Surprisingly difficult question. For Block Power Method, did not need to consider this Ũ k = span(a q G) was the only option. In classical Lanczos/Krylov analysis, convergence to the true singular vectors also lets us avoid this issue. Use Rayleigh Ritz procedure. 25

rayleigh-ritz post-processing Project A to K and take the top k singular vectors (using an accurate classical method): Ũ k = span ((P K A) k ) 26

rayleigh-ritz post-processing Project A to K and take the top k singular vectors (using an accurate classical method): Block Krylov Iteration: K = Ũ k = span ((P K A) k ) [ G, AG, A 2 G,..., A q G ] } {{ } Krylov subspace, Ũ k = span ((P K A) k ) }{{} best solution in Krylov subspace 26

rayleigh-ritz post-processing This post-processing step provably gives an optimal Ũ k for Frobenius norm low-rank approximation error. 27

rayleigh-ritz post-processing This post-processing step provably gives an optimal Ũ k for Frobenius norm low-rank approximation error. Our entire analysis relied on converting very small Frobenius norm error to strong spectral norm and per vector error! Take away: Modern denoising analysis gives new insight into the practical effectiveness of Rayleigh-Ritz projection. 27

final implementation Similar to randomized Block Power Method extremely simple (pseudocode in paper). Block Power Method Block Krylov Iteration 28

performance Block Krylov beats Block Power Method definitively for small ϵ. Error ε.35.3.25.2.5 Block Krylov Frobenius Error Block Krylov Spectral Error Block Krylov Per Vector Error Power Method Frobenius Error Power Method Spectral Error Power Method Per Vector Error..5 5 5 2 25 Iterations q 2 Newsgroups, k = 2 29

final comments Main Takeaway: First gap independent bound for Krylov methods. ( ) ( log d O nnz(a)k O nnz(a)k log d ) (σk σ k+ /σ k ϵ Open Questions Full stability analysis. Master error metric for gap independent results. Gap independent bounds for other methods (e.g. online and stochastic PCA). Analysis for small space/restarted block Krylov methods? 3

Thank you! 3

stability Stability Lanczos algorithms are often considered to be unstable. Largely due to the fact that a recurrence is used to efficiently compute a basis for the Krylov subspace on the fly. Since our subspace is small, we do not use the recurrence. Computing the basis explicitly avoids serious stability issues. There is some loss of orthogonality between blocks. However it only occurs once the algorithm has converged and we can show that it is not an issue in practice. 32

stability On poorly conditioned matrices Randomized Block Krylov Iteration still significantly outperforms Block Power Method. Block Krylov Block Power Per vector error ε 5 5 2 4 6 8 2 iterations q Per Vector Error for k =, κ =, 33