randomized block krylov methods for stronger and faster approximate svd

Similar documents
dimensionality reduction for k-means and low rank approximation

A Tour of the Lanczos Algorithm and its Convergence Guarantees through the Decades

Randomized algorithms for the approximation of matrices

sublinear time low-rank approximation of positive semidefinite matrices Cameron Musco (MIT) and David P. Woodru (CMU)

APPLIED NUMERICAL LINEAR ALGEBRA

Normalized power iterations for the computation of SVD

Subset Selection. Deterministic vs. Randomized. Ilse Ipsen. North Carolina State University. Joint work with: Stan Eisenstat, Yale

Principal Component Analysis

AMS526: Numerical Analysis I (Numerical Linear Algebra)

A fast randomized algorithm for approximating an SVD of a matrix

Lecture 9: Matrix approximation continued

Lecture 18 Nov 3rd, 2015

STA141C: Big Data & High Performance Statistical Computing

Computational Methods. Eigenvalues and Singular Values

STA141C: Big Data & High Performance Statistical Computing

Applications of Randomized Methods for Decomposing and Simulating from Large Covariance Matrices

Randomized Algorithms for Matrix Computations

Sketching as a Tool for Numerical Linear Algebra

to be more efficient on enormous scale, in a stream, or in distributed settings.

A randomized algorithm for approximating the SVD of a matrix

Parallel Numerical Algorithms

Numerical Methods in Matrix Computations

Computation of eigenvalues and singular values Recall that your solutions to these questions will not be collected or evaluated.

SUBSPACE ITERATION RANDOMIZATION AND SINGULAR VALUE PROBLEMS

Randomized Numerical Linear Algebra: Review and Progresses

Efficiently Implementing Sparsity in Learning

Subset Selection. Ilse Ipsen. North Carolina State University, USA

Approximate Spectral Clustering via Randomized Sketching

Lecture 5: Randomized methods for low-rank approximation

Data Mining Lecture 4: Covariance, EVD, PCA & SVD

4.8 Arnoldi Iteration, Krylov Subspaces and GMRES

1 Singular Value Decomposition and Principal Component

Last Time. Social Network Graphs Betweenness. Graph Laplacian. Girvan-Newman Algorithm. Spectral Bisection

Sketching Structured Matrices for Faster Nonlinear Regression

Least Squares. Tom Lyche. October 26, Centre of Mathematics for Applications, Department of Informatics, University of Oslo

Matrix Algorithms. Volume II: Eigensystems. G. W. Stewart H1HJ1L. University of Maryland College Park, Maryland

Large-scale eigenvalue problems

EECS 275 Matrix Computation

Randomization: Making Very Large-Scale Linear Algebraic Computations Possible

Recovering any low-rank matrix, provably

Column Selection via Adaptive Sampling

arxiv: v2 [cs.ds] 17 Feb 2016

Applied Mathematics 205. Unit II: Numerical Linear Algebra. Lecturer: Dr. David Knezevic

Sketching as a Tool for Numerical Linear Algebra

Sketched Ridge Regression:

Low-Rank Approximations, Random Sampling and Subspace Iteration

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate

14 Singular Value Decomposition

A fast randomized algorithm for overdetermined linear least-squares regression

Dimensionality Reduction: PCA. Nicholas Ruozzi University of Texas at Dallas

Parameter Selection Techniques and Surrogate Models

Iterative methods for symmetric eigenvalue problems

Simple and Deterministic Matrix Sketches

Matrix Approximations

Index. for generalized eigenvalue problem, butterfly form, 211

LARGE SPARSE EIGENVALUE PROBLEMS. General Tools for Solving Large Eigen-Problems

Approximate Principal Components Analysis of Large Data Sets

QALGO workshop, Riga. 1 / 26. Quantum algorithms for linear algebra.

Krylov Subspace Methods to Calculate PageRank

Tighter Low-rank Approximation via Sampling the Leveraged Element

arxiv: v1 [math.na] 29 Dec 2014

LARGE SPARSE EIGENVALUE PROBLEMS

Fast algorithms for dimensionality reduction and data visualization

Singular Value Decomposition

15 Singular Value Decomposition

1 Conjugate gradients

Solution of eigenvalue problems. Subspace iteration, The symmetric Lanczos algorithm. Harmonic Ritz values, Jacobi-Davidson s method

1 An implementation of a randomized algorithm for principal component analysis

Homework 1. Yuan Yao. September 18, 2011

Randomized algorithms for matrix computations and analysis of high dimensional data

Lecture 9: Krylov Subspace Methods. 2 Derivation of the Conjugate Gradient Algorithm

RandNLA: Randomization in Numerical Linear Algebra

Numerical Methods for Solving Large Scale Eigenvalue Problems

Parallel Singular Value Decomposition. Jiaxing Tan

The Lanczos and conjugate gradient algorithms

Review of Some Concepts from Linear Algebra: Part 2

3 Best-Fit Subspaces and Singular Value Decomposition

Randomized Algorithms

Krylov subspace projection methods

Singular Value Decomposition

Math 504 (Fall 2011) 1. (*) Consider the matrices

Sketching as a Tool for Numerical Linear Algebra All Lectures. David Woodruff IBM Almaden

Linear Algebra. Session 12

Rank revealing factorizations, and low rank approximations

Rank revealing factorizations, and low rank approximations

Beyond Scalar Affinities for Network Analysis or Vector Diffusion Maps and the Connection Laplacian

arxiv: v2 [stat.ml] 24 Apr 2017

DATA MINING LECTURE 8. Dimensionality Reduction PCA -- SVD

arxiv: v2 [stat.ml] 29 Nov 2018

2. LINEAR ALGEBRA. 1. Definitions. 2. Linear least squares problem. 3. QR factorization. 4. Singular value decomposition (SVD) 5.

Notes on Eigenvalues, Singular Values and QR

Improved Distributed Principal Component Analysis

CS 229r: Algorithms for Big Data Fall Lecture 17 10/28

Arnoldi Methods in SLEPc

7 Principal Component Analysis

AM 205: lecture 8. Last time: Cholesky factorization, QR factorization Today: how to compute the QR factorization, the Singular Value Decomposition

Structure in Data. A major objective in data analysis is to identify interesting features or structure in the data.

Randomized methods for computing the Singular Value Decomposition (SVD) of very large matrices

A MODIFIED TSVD METHOD FOR DISCRETE ILL-POSED PROBLEMS

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Transcription:

randomized block krylov methods for stronger and faster approximate svd Cameron Musco and Christopher Musco December 2, 25 Massachusetts Institute of Technology, EECS

singular value decomposition n d left singular vectors singular values right singular vectors σ σ 2 A = U Σ V T σ d- σ d

singular value decomposition n d left singular vectors singular values right singular vectors σ σ 2 A = U Σ V T σ d- σ d Extremely important primitive for dimensionality reduction, low-rank approximation, PCA, etc.

singular value decomposition n d left singular vectors singular values right singular vectors σ σ 2 A = U Σ V T σ d- σ d Extremely important primitive for dimensionality reduction, low-rank approximation, PCA, etc. u i = arg max x T AA T x x: x =,x u,...,u i

singular value decomposition n d A k = left singular vectors singular values U k σ σ k Σ k σ d- σ d right singular vectors V kt Extremely important primitive for dimensionality reduction, low-rank approximation, PCA, etc. u i = arg max x T AA T x x: x =,x u,...,u i A k = arg min A B F B:rank(B)=k

singular value decomposition n d A k = left singular vectors singular values U k σ σ k Σ k σ d- σ d right singular vectors V kt Extremely important primitive for dimensionality reduction, low-rank approximation, PCA, etc. u i = arg max x T AA T x x: x =,x u,...,u i A k = arg min A B 2 B:rank(B)=k

singular value decomposition n d A k = left singular vectors singular values U k σ σ k Σ k σ d- σ d right singular vectors V kt Extremely important primitive for dimensionality reduction, low-rank approximation, PCA, etc. u i = arg max x T AA T x x: x =,x u,...,u i U k U T ka = arg min A B 2 B:rank(B)=k

standard svd approximation metrics Frobenius Norm Low-Rank Approximation: A Ũ k Ũ T k A F ( + ϵ) A U k U T k A F 2

standard svd approximation metrics Frobenius Norm Low-Rank Approximation (weak): A Ũ k Ũ T k A F ( + ϵ) A U k U T k A F A 2 F = d i= σ2 i and A U k U T k A 2 F = A A k 2 F = d i=k+ σ2 i. 2 SNAP/email Enron singular value σ i 9 8 7 6 5 4 3 2 2 3 4 5 6 7 index i 2

standard svd approximation metrics Frobenius Norm Low-Rank Approximation (weak): A Ũ k Ũ T k A F ( + ϵ) A U k U T k A F A 2 F = d i= σ2 i and A U k U T k A 2 F = A A k 2 F = d i=k+ σ2 i. 2 SNAP/email Enron singular value σ i 9 8 7 6 5 4 3 2 2 3 4 5 6 7 index i For many datasets literally any Ũ k would work! 2

standard svd approximation metrics Frobenius Norm Low-Rank Approximation (weak): A Ũ k Ũ T k A F ( + ϵ) A U k U T k A F Spectral Norm Low-Rank Approximation (stronger): A Ũ k Ũ T k A 2 ( + ϵ) A U k U T k A 2 2

standard svd approximation metrics Frobenius Norm Low-Rank Approximation (weak): A Ũ k Ũ T k A F ( + ϵ) A U k U T k A F Spectral Norm Low-Rank Approximation (stronger): A Ũ k Ũ T k A 2 ( + ϵ)σ k+ 2

standard svd approximation metrics Frobenius Norm Low-Rank Approximation (weak): A Ũ k Ũ T k A F ( + ϵ) A U k U T k A F Spectral Norm Low-Rank Approximation (stronger): A Ũ k Ũ T k A 2 ( + ϵ)σ k+ Per Vector Principal Component Error (strongest): ũ T i AAT ũ i ( ϵ)u T i AAT u i for all i k. 2

main research question Classic Full SVD Algorithms (e.g. QR Algorithm): All of these goals in roughly O(nd 2 ) time (error dependence is log log /ϵ on lower order terms). Unfortunately, this is much too slow for many data sets. 3

main research question Classic Full SVD Algorithms (e.g. QR Algorithm): All of these goals in roughly O(nd 2 ) time (error dependence is log log /ϵ on lower order terms). Unfortunately, this is much too slow for many data sets. How fast can we approximately compute just u,..., u k? 3

frobenius norm error Weak Approximation Algorithms: Strong Rank Revealing QR (Gu, Eisenstat 996): A Ũ k Ũ T k A F poly(n, k) A A k F in time O(ndk) 4

frobenius norm error Weak Approximation Algorithms: Strong Rank Revealing QR (Gu, Eisenstat 996): A Ũ k Ũ T k A F poly(n, k) A A k F in time O(ndk) Sparse Subspace Embeddings (Clarkson, Woodruff 23): ( ) nk A Ũ k Ũ T 2 k A F ( + ϵ) A A k F in time O(nnz(A)) + Õ ϵ 4 4

iterative svd algorithms Iterative methods are the only game in town for stronger guarantees. Runtime is approximately: O(nnz(A)k #iterations) Power method (Müntz 93, von Mises 929) Krylov/Lanczos methods (Lanczos 95) 5

iterative svd algorithms Iterative methods are the only game in town for stronger guarantees. Runtime is approximately: O(nnz(A)k #iterations) O(ndk #iterations) << O(nd 2 ) Power method (Müntz 93, von Mises 929) Krylov/Lanczos methods (Lanczos 95) 5

iterative svd algorithms Iterative methods are the only game in town for stronger guarantees. Runtime is approximately: O(nnz(A)k #iterations) O(ndk #iterations) << O(nd 2 ) Power method (Müntz 93, von Mises 929) Krylov/Lanczos methods (Lanczos 95) Stochastic Methods? 5

power method review Traditional Power Method: x R d, x i+ Ax i Ax i 6

power method review Traditional Power Method: x R d, x i+ Ax i Ax i Block Power Method (Simultaneous/Subspace Iteration): X R d k, X i+ orthonormalize(ax i ) 3.6.6.92 -.8 2 A.59 -..37.54.54.78 -.4.82 X X U 2 6

power method review Traditional Power Method: x R d, x i+ Ax i Ax i Block Power Method (Simultaneous/Subspace Iteration): X R d k, X i+ orthonormalize(ax i ) 3.92 -.8.98 -.5 2.37.54.2.8 -.4.82 -.5.56 A X X 2 U 2 6

power method review Traditional Power Method: x R d, x i+ Ax i Ax i Block Power Method (Simultaneous/Subspace Iteration): X R d k, X i+ orthonormalize(ax i ) 3.98 -.5 -.6 2.2.8.8.95 -.5.56 -.4.3 A X 2 X 3 U 2 6

power method review Traditional Power Method: x R d, x i+ Ax i Ax i Block Power Method (Simultaneous/Subspace Iteration): X R d k, X i+ orthonormalize(ax i ) 3 -.6 -.2 2.8.95.2.99 -.4.3 -.2.6 A X 3 X 4 U 2 6

power method review Traditional Power Method: x R d, x i+ Ax i Ax i Block Power Method (Simultaneous/Subspace Iteration): X R d k, X i+ orthonormalize(ax i ) 3 -.2 2.2.99 -.2.6 -..8 A X 4 X 5 U 2 6

power method review Traditional Power Method: x R d, x i+ Ax i Ax i Block Power Method (Simultaneous/Subspace Iteration): X R d k, X i+ orthonormalize(ax i ) 3 2 -..8.4 A X 5 X 6 U 2 6

power method review Traditional Power Method: x R d, x i+ Ax i Ax i Block Power Method (Simultaneous/Subspace Iteration): X R d k, X i+ orthonormalize(ax i ) 3 2.4.2 A X 6 X 7 U 2 6

power method review Traditional Power Method: x R d, x i+ Ax i Ax i Block Power Method (Simultaneous/Subspace Iteration): X R d k, X i+ orthonormalize(ax i ) 3 2.2. A X 7 X 8 U 2 6

power method review Traditional Power Method: x R d, x i+ Ax i Ax i Block Power Method (Simultaneous/Subspace Iteration): X R d k, X i+ orthonormalize(ax i ) 3 2. A X 8 X 9 U 2 6

power method runtime Runtime for Block Power method is roughly: ( ) log(d/ϵ) O nnz(a)k (σ k σ k+ )/σ k 7

power method runtime Runtime for Block Power method is roughly: ( ) log(d/ϵ) O nnz(a)k (σ k σ k+ )/σ k 7

power method runtime Runtime for Block Power method is roughly: ( ) log(d/ϵ) O nnz(a)k (σ k σ k+ )/σ k 7

power method runtime Runtime for Block Power method is roughly: ( ) log(d/ϵ) O nnz(a)k (σ k σ k+ )/σ k Linear dependence on the singular value gap: gap = σ k σ k+ σ k 7

power method runtime Runtime for Block Power method is roughly: ( ) log(d/ϵ) O nnz(a)k (σ k σ k+ )/σ k Linear dependence on the singular value gap: gap = σ k σ k+ σ k While this gap is traditionally assumed to be constant, it is the dominant factor in the iteration count for many datasets. 7

typical gap values Stanford Network Analysis Project Slashdot Social Network singular value, σ i 4 2 8 6 4 2 2 4 6 8 2 4 6 8 2 index, i 8

typical gap values Stanford Network Analysis Project Slashdot Social Network 4 singular value, σ i 35 3 25 2 5 5 55 6 65 7 75 8 85 9 95 index, i 8

typical gap values Stanford Network Analysis Project Slashdot Social Network 4 singular value, σ i 35 3 25 2 5 5 55 6 65 7 75 8 85 9 95 index, i Median value of gap k = σ k σ k+ σ k for k 2:.9 8

typical gap values Stanford Network Analysis Project Slashdot Social Network 4 singular value, σ i 35 3 25 2 5 5 55 6 65 7 75 8 85 9 95 index, i Minimum value of gap k = σ k σ k+ σ k for k 2:.4 8

typical gap values Stanford Network Analysis Project Slashdot Social Network 4 singular value, σ i 35 3 25 2 5 5 55 6 65 7 75 8 85 9 95 index, i Minimum value of gap k = σ k σ k+ σ k for k 2: Runtime = O(25, nnz(a)k log(d/ϵ)) 8

gap independent bounds Recent work shows Block Power method (with randomized start vectors) gives: 9

gap independent bounds Recent work shows Block Power method (with randomized start vectors) gives: ( A Ũ k Ũ T k A 2 ( + ϵ) A A k 2 in time O nnz(a)k log d ) ϵ 9

gap independent bounds Recent work shows Block Power method (with randomized start vectors) gives: ( A Ũ k Ũ T k A 2 ( + ϵ) A A k 2 in time O nnz(a)k log d ) ϵ Improves on classical bounds when ϵ > (σ k σ k+ )/σ k. 9

gap independent bounds Recent work shows Block Power method (with randomized start vectors) gives: ( A Ũ k Ũ T k A 2 ( + ϵ) A A k 2 in time O nnz(a)k log d ) ϵ Improves on classical bounds when ϵ > (σ k σ k+ )/σ k. Long series of refinements and improvements: Rokhlin, Szlam, Tygert 29 Halko, Martinsson, Tropp 2 Boutsidis, Drineas, Magdon-Ismail 2 Witten, Candès 24 Woodruff 24 9

randomized block power method Randomized Block Power Method is widely cited and implemented simple algorithm with simple bounds.

randomized block power method Randomized Block Power Method is widely cited and implemented simple algorithm with simple bounds. redsvd ScaleNLP (Breeze) libskylark

randomized block power method Randomized Block Power Method is widely cited and implemented simple algorithm with simple bounds. redsvd ScaleNLP (Breeze) libskylark But in the numerical linear algebra community, Krylov/Lanczos methods have long been prefered over power iteration.

randomized block power method Randomized Block Power Method is widely cited and implemented simple algorithm with simple bounds. GNU Octave MLlib But in the numerical linear algebra community, Krylov/Lanczos methods have long been prefered over power iteration.

lanczos/krylov acceleration Power Method Krylov Methods ( O nnz(a)k ) ( log(d/ϵ) O nnz(a)k (σ k σ k+ )/σ k ) log(d/ϵ) (σk σ k+ )/σ k

lanczos/krylov acceleration Power Method Krylov Methods ( O nnz(a)k ) ( log(d/ϵ) O nnz(a)k (σ k σ k+ )/σ k ) log(d/ϵ) (σk σ k+ )/σ k ( O nnz(a)k log d )? ϵ No gap independent analysis of Krylov methods!

lanczos/krylov acceleration Power Method Krylov Methods ( O nnz(a)k ) ( log(d/ϵ) O nnz(a)k (σ k σ k+ )/σ k ) log(d/ϵ) (σk σ k+ )/σ k ( O nnz(a)k log d ) ϵ ( O nnz(a)k log d ) ϵ }{{} Our Contribution

our main result A simple randomized Block Krylov Iteration gives all three of our target error bounds in time: ( O nnz(a)k log d ) ϵ A U k U T k A 2 ( + ϵ)σ k+ and ũ T i AAT ũ i σ 2 i ϵσ 2 k+ 2

our main result A simple randomized Block Krylov Iteration gives all three of our target error bounds in time: ( O nnz(a)k log d ) ϵ A U k U T k A 2 ( + ϵ)σ k+ and ũ T i AAT ũ i σ 2 i ϵσ 2 k+ Gives a runtime bound that is independent of A. 2

our main result A simple randomized Block Krylov Iteration gives all three of our target error bounds in time: ( O nnz(a)k log d ) ϵ A U k U T k A 2 ( + ϵ)σ k+ and ũ T i AAT ũ i σ 2 i ϵσ 2 k+ Gives a runtime bound that is independent of A. Beats runtime of Block Power Method:... 2

our main result A simple randomized Block Krylov Iteration gives all three of our target error bounds in time: ( O nnz(a)k log d ) ϵ A U k U T k A 2 ( + ϵ)σ k+ and ũ T i AAT ũ i σ 2 i ϵσ 2 k+ Gives a runtime bound that is independent of A. Beats runtime of Block Power Method:,. 2

our main result A simple randomized Block Krylov Iteration gives all three of our target error bounds in time: ( O nnz(a)k log d ) ϵ A U k U T k A 2 ( + ϵ)σ k+ and ũ T i AAT ũ i σ 2 i ϵσ 2 k+ Gives a runtime bound that is independent of A. Beats runtime of Block Power Method:,. Improves classic Lanczos bounds when (σ k σ k+ )/σ k < ϵ. 2

understanding gap dependence First Step: Where does gap dependence actually comes from? 3

understanding gap dependence To prove guarantees like: ũ T i AAT ũ i ( ϵ)σi 2, classical analysis argues about convergence to A s true singular vectors. u i ũ i error Traditional objective function: u i ũ i 2. 4

understanding gap dependence Traditional objective function: u i ũ i 2 Simple potential function, easy to work with. 5

understanding gap dependence Traditional objective function: u i ũ i 2 Simple potential function, easy to work with. Can be used to prove strong per-vector error or spectral norm guarantees for A Ũ k Ũ T k A 2. 5

understanding gap dependence Traditional objective function: u i ũ i 2 Simple potential function, easy to work with. Can be used to prove strong per-vector error or spectral norm guarantees for A Ũ k Ũ T k A 2. Inherently requires an iteration count that depends on singular value gaps. 5

understanding gap dependence Traditional objective function: u i ũ i 2 3.93.26.. A -.35.5 -.8.8.4.78 X -.7.62 X U 2 6

understanding gap dependence Traditional objective function: u i ũ i 2 3.. A -.7.62.4.78.2.75 X -.2.65 X 2 U 2 6

understanding gap dependence Traditional objective function: u i ũ i 2 3. A -.2.65.2.75..72 X 2 -..69 X 3 U 2 6

understanding gap dependence Traditional objective function: u i ũ i 2 3. A -..69..72.68 X 3 X 4.72 U 2 6

understanding gap dependence Traditional objective function: u i ũ i 2 3. A.68.65 X 4.72 X 5.76 U 2 6

understanding gap dependence Traditional objective function: u i ũ i 2 3. A.65.62 X 5.76 X 6.78 U 2 6

understanding gap dependence Traditional objective function: u i ũ i 2 3..78.82 A.62.58 X 6 X 7 U 2 6

understanding gap dependence Traditional objective function: u i ũ i 2 3..82.84 A.58.54 X 7 X 8 U 2 6

understanding gap dependence Traditional objective function: u i ũ i 2 3. A.54.5 X 8.84 X 9.86 U 2 6

key insight Convergence becomes less necessary precisely when it is difficult to achieve! 7

key insight Convergence becomes less necessary precisely when it is difficult to achieve! 3. A U 2 U 2 Minimizing u i ũ i 2 is sufficient, but far from necessary. 7

modern approach to analysis Iterative methods viewed as denoising procedures for Random Sketching methods. 8

modern approach to analysis Iterative methods viewed as denoising procedures for Random Sketching methods. Choose G N (, ) d k. If A is rank k then: span (AG) = span(a) 8

modern approach to analysis Iterative methods viewed as denoising procedures for Random Sketching methods. Choose G N (, ) d k. If A is rank k then: span (AG) = span(a) Ũ k = span (AG) = A Ũ k Ũ T k A F = A A F = 8

modern approach to analysis Iterative methods viewed as denoising procedures for Random Sketching methods. Choose G N (, ) d k. If A is rank k then: span (AG) = span(a) Ũ k = span (A) = A Ũ k Ũ T k A F = A A F = 8

modern approach to analysis Iterative methods viewed as denoising procedures for Random Sketching methods. Choose G N (, ) d k. If A is rank k then: span (AG) = span(a) Ũ k = span (A) = A Ũ k Ũ T k A F = A A F = If A is not rank k then we have error due to A A k : Ũ k = span(ag) = A Ũ k Ũ T k A F poly(d) A A k F 8

modern approach to analysis Iterative methods viewed as denoising procedures for Random Sketching methods. Choose G N (, ) d k. If A is rank k then: span (AG) = span(a) Ũ k = span (A) = A Ũ k Ũ T k A F = A A F = If A is not rank k then we have error due to A A k : Ũ k = span(ag) = A Ũ k Ũ T k A F poly(d) A A k F Gives an error bound for a single power method iteration. 8

modern approach to analysis Iterative methods viewed as denoising procedures for Random Sketching methods. Choose G N (, ) d k. If A is rank k then: span (AG) = span(a) Ũ k = span (A) = A Ũ k Ũ T k A F = A A F = If A is not rank k then we have error due to A A k : Ũ k = span(ag) = A Ũ k Ũ T k A F poly(d) A A k F Gives an error bound for a single power method iteration. Meaningless unless A A k F (the tail noise ) is very small. 8

modern approach to analysis How to avoid tail noise? Apply sketching method to A q instead. 9

modern approach to analysis How to avoid tail noise? Apply sketching method to A q instead. This is exactly what Block Power Method does: G AG A 2 G... A q G, Ũ k = span(a q G) 9

modern approach to analysis How to avoid tail noise? Apply sketching method to A q instead. Assuming A is symmetric, if A = UΣU T then A q = UΣ q U T. d d σ σ 2 A = U Σ U T σ d- σ d 9

modern approach to analysis How to avoid tail noise? Apply sketching method to A q instead. Assuming A is symmetric, if A = UΣU T then A q = UΣ q U T. d d σ q σ 2q A q = U Σ q U T σ d-q σ dq 9

modern approach to analysis How to avoid tail noise? Apply sketching method to A q instead. Assuming A is symmetric, if A = UΣU T then A q = UΣ q U T. 5 Spectrum of A Spectrum of A q Singular Value σ i 5 5 5 2 Index i 9

modern approach to analysis How to avoid tail noise? Apply sketching method to A q instead. Assuming A is symmetric, if A = UΣU T then A q = UΣ q U T. 5 Spectrum of A Spectrum of A q Singular Value σ i 5 5 5 2 Index i A q A q k 2 F = d i=k+ σ2q is extremely small. i 9

modern approach to analysis q = Õ(/ϵ) ensures that any singular value below σ k+ becomes extremely small in comparison to any singular value above ( + ϵ)σ k+. 2

modern approach to analysis q = Õ(/ϵ) ensures that any singular value below σ k+ becomes extremely small in comparison to any singular value above ( + ϵ)σ k+. ( ϵ) O(/ϵ) << 2

modern approach to analysis q = Õ(/ϵ) ensures that any singular value below σ k+ becomes extremely small in comparison to any singular value above ( + ϵ)σ k+. ( ϵ) O(/ϵ) << Ũ k = span(a q G) must align well with large (but not the largest!) singular vectors of A q to achieve even coarse Frobenius norm error: A q Ũ k Ũ T k Aq F poly(d) A q A q k F 2

modern approach to analysis q = Õ(/ϵ) ensures that any singular value below σ k+ becomes extremely small in comparison to any singular value above ( + ϵ)σ k+. ( ϵ) O(/ϵ) << Ũ k = span(a q G) must align well with large (but not the largest!) singular vectors of A q to achieve even coarse Frobenius norm error: A q Ũ k Ũ T k Aq F poly(d) A q A q k F A and A q have the same singular vectors so Ũ k is good for A. 2

modern approach to analysis We use new tools for converting very small Frobenius norm low-rank approximation error to spectral norm and per vector error, without arguing about convergence of ũ i and u i. 2

krylov acceleration There are better polynomials than A q for denoising A. 22

krylov acceleration There are better polynomials than A q for denoising A. 3 x O(/ε) T O(/ ε) (x) 2..2.3.4.5.6.7.8.9. x 22

krylov acceleration There are better polynomials than A q for denoising A. 3 x O(/ε) T O(/ ε) (x) 2..2.3.4.5.6.7.8.9. x With Chebyshev polynomials only need degree q = Õ(/ ϵ). 22

krylov acceleration There are better polynomials than A q for denoising A. 3 x O(/ε) T O(/ ε) (x) 2..2.3.4.5.6.7.8.9. x With Chebyshev polynomials only need degree q = Õ(/ ϵ). 22

krylov acceleration Chebyshev polynomial T q (A) has a very small tail. So returning Ũ k = span(t q (A)G) would suffice. 23

krylov acceleration Chebyshev polynomial T q (A) has a very small tail. So returning Ũ k = span(t q (A)G) would suffice. Furthermore, block power iteration computes (at intermediate steps) all of the components needed for: T q (A)G = c G + c AG +... + c q A q G 23

krylov acceleration Chebyshev polynomial T q (A) has a very small tail. So returning Ũ k = span(t q (A)G) would suffice. Furthermore, block power iteration computes (at intermediate steps) all of the components needed for: T q (A)G = c G + c AG +... + c q A q G G AG A 2 G... A q G 23

krylov acceleration Chebyshev polynomial T q (A) has a very small tail. So returning Ũ k = span(t q (A)G) would suffice. Furthermore, block power iteration computes (at intermediate steps) all of the components needed for: Block Krylov Iteration: T q (A)G = c G + c AG +... + c q A q G G AG A 2 G... A q G K = [ G, AG, A 2 G,..., A q G ] } {{ } Krylov subspace 23

implicit use of acceleration polynomial But... 24

implicit use of acceleration polynomial But... we can t explicitly compute T q (A), since its parameters depend on A s (unknown) singular values. 24

implicit use of acceleration polynomial But... we can t explicitly compute T q (A), since its parameters depend on A s (unknown) singular values. Solution: Returning the best Ũ k in the span of K is only better then returning span(t q (A)G). 24

implicit use of acceleration polynomial What is the best Ũ k? 25

implicit use of acceleration polynomial What is the best Ũ k? Surprisingly difficult question. 25

implicit use of acceleration polynomial What is the best Ũ k? Surprisingly difficult question. For Block Power Method, did not need to consider this Ũ k = span(a q G) was the only option. 25

implicit use of acceleration polynomial What is the best Ũ k? Surprisingly difficult question. For Block Power Method, did not need to consider this Ũ k = span(a q G) was the only option. In classical Lanczos/Krylov analysis, convergence to the true singular vectors also lets us avoid this issue. Use Rayleigh Ritz procedure. 25

rayleigh-ritz post-processing Project A to K and take the top k singular vectors (using an accurate classical method): Ũ k = span ((P K A) k ) 26

rayleigh-ritz post-processing Project A to K and take the top k singular vectors (using an accurate classical method): Block Krylov Iteration: K = Ũ k = span ((P K A) k ) [ G, AG, A 2 G,..., A q G ] } {{ } Krylov subspace, Ũ k = span ((P K A) k ) }{{} best solution in Krylov subspace 26

rayleigh-ritz post-processing Project A to K and take the top k singular vectors (using an accurate classical method): Block Krylov Iteration: K = Ũ k = span ((P K A) k ) [ G, AG, A 2 G,..., A q G ] } {{ } Krylov subspace, Ũ k = span ((P K A) k ) }{{} best solution in Krylov subspace Equivalent to the classic Block Lanczos algorithm in exact arithmetic. 26

rayleigh-ritz post-processing This post-processing step provably gives an optimal Ũ k for Frobenius norm low-rank approximation error. 27

rayleigh-ritz post-processing This post-processing step provably gives an optimal Ũ k for Frobenius norm low-rank approximation error. Our entire analysis relied on converting very small Frobenius norm error to strong spectral norm and per vector error! 27

rayleigh-ritz post-processing This post-processing step provably gives an optimal Ũ k for Frobenius norm low-rank approximation error. Our entire analysis relied on converting very small Frobenius norm error to strong spectral norm and per vector error! Take away: Modern denoising analysis gives new insight into the practical effectiveness of Rayleigh-Ritz projection. 27

final implementation Similar to randomized Block Power Method extremely simple (pseudocode in paper). Block Power Method Block Krylov Iteration 28

performance Block Krylov beats Block Power Method definitively for small ϵ. Error ε.35.3.25.2.5 Block Krylov Frobenius Error Block Krylov Spectral Error Block Krylov Per Vector Error Power Method Frobenius Error Power Method Spectral Error Power Method Per Vector Error..5 5 5 2 25 Iterations q 2 Newsgroups, k = 2 29

performance Block Krylov beats Block Power Method definitively for small ϵ. Error ε.35.3.25.2.5 Block Krylov Frobenius Error Block Krylov Spectral Error Block Krylov Per Vector Error Power Method Frobenius Error Power Method Spectral Error Power Method Per Vector Error..5 2 3 4 5 6 7 Runtime (seconds) 2 Newsgroups, k = 2 29

final comments Main Takeaway: First gap independent bound for Krylov methods. ( ) ( log d O nnz(a)k O nnz(a)k log d ) (σk σ k+ /σ k ϵ Open Questions Full stability analysis. Master error metric for gap independent results. Gap independent bounds for other methods (e.g. online and stochastic PCA). Analysis for small space/restarted block Krylov methods? 3

Thank you! 3

stability Stability Lanczos algorithms are often considered to be unstable. Largely due to the fact that a recurrence is used to efficiently compute a basis for the Krylov subspace on the fly. Since our subspace is small, we do not use the recurrence. Computing the basis explicitly avoids serious stability issues. There is some loss of orthogonality between blocks. However it only occurs once the algorithm has converged and we can show that it is not an issue in practice. 32

stability On poorly conditioned matrices Randomized Block Krylov Iteration still significantly outperforms Block Power Method. Block Krylov Block Power Per vector error ε 5 5 2 4 6 8 2 iterations q Per Vector Error for k =, κ =, 33