A Tour of the Lanczos Algorithm and its Convergence Guarantees through the Decades

A Tour of the Lanczos Algorithm and its Convergence Guarantees through the Decades Qiaochu Yuan Department of Mathematics UC Berkeley Joint work with Prof. Ming Gu, Bo Li April 17, 2018 Qiaochu Yuan Tour of Lanczos April 17, 2018 1 / 43

Timeline 1950) Lanczos iteration algorithm 1965) Lanczos algorithm for SVD 1974, 1977) algorithm 2009, 2010, 2015) Randomized 1950 1960 1970 1980 1990 2000 2010 1966, 1971, 1980) Convergence 1975, 1980) bounds for Convergence Lanczos bounds for 2015) Convergence bounds for Randomized Qiaochu Yuan Tour of Lanczos April 17, 2018 2 / 43

Timeline 1950) Lanczos iteration algorithm 1965) Lanczos algorithm for SVD 1974, 1977) algorithm 2009, 2010, 2015) Randomized 1950 1960 1970 1980 1990 2000 2010 Our work: 2015) 1966, 1971, 1980) Convergence 1975, 1980) 1) Convergence bounds Convergence for bounds for bounds for Convergence RBL, arbitrary b Randomized Lanczos bounds 2) for Superlinear convergence for typical data matrices Qiaochu Yuan Tour of Lanczos April 17, 2018 3 / 43

Lanczos Iteration Algorithm Developed by Lanczos in 1950 [Lan50]. Widely used iterative algorithm for computing the extremal eigenvalues and corresponding eigenvectors of a large, sparse, symmetric matrix A. Goal Given a symmetric matrix A R n n, with eigenvalues λ 1 > λ 2 > > λ n and associated eigenvectors u 1,, u n, want to find approximations for λ i, i = 1,, k, the k largest eigenvalues of A u i, i = 1,, k, the associated eigenvectors where k n. Qiaochu Yuan Tour of Lanczos April 17, 2018 5 / 43

Lanczos - Details General idea: 1 Select an initial vector v. 2 Construct Krylov subspace K A, v, k) = span{v, Av, A 2 v,, A k 1 v}. 3 Restrict and project A to the Krylov subspace, T = proj K A K 4 Use eigen values and vectors of T as approximations to those of A. In matrices: K k = [ v Av A k 1 v ] R n k Q k = [ ] q 1 q 2 q k qr Kk ) T k = Q T k AQ k R k k Qiaochu Yuan Tour of Lanczos April 17, 2018 6 / 43

Lanczos - Details A [ ] [ ] q 1 q j = q1 q j q j+1 At each step j = 1,, k of Lanczos iteration: Use the three-term recurrence: Calculate the αs, βs as: AQ j = Q j T j + β j q j+1 e T j+1 Aq j = β j 1 q j 1 + α j q j + β j q j+1 α 1 β 1 β 1............ βj 1 β j 1 α j β j α j = q T j Aq j 1) r j = A α j I) q j β j 1 q j 2) β j = r j 2, q j+1 = r j /β j 3) Qiaochu Yuan Tour of Lanczos April 17, 2018 7 / 43

Convergence of Lanczos How well does λ k) i, the eigenvalues of T k, approximate λ i, the eigenvalues of A, for i = 1,, k? First answered by Kaniel in 1966 [Kan66] and Paige in 1971 [Pai71]. Theorem Kaniel-Paige Inequality) If v is chosen to be not orthogonal to the eigenspace associated with λ 1, then 0 λ 1 λ k) 1 λ 1 λ n ) tan2 θ u 1, v) Tk 1 2 γ 4) 1) where T i x) is the Chebyshev polynomial of degree i, θ, ) is the angle between two vectors, and γ 1 = 1 + 2 λ 1 λ 2 λ 2 λ n Qiaochu Yuan Tour of Lanczos April 17, 2018 9 / 43

Convergence of Lanczos Later generalized by Saad in 1980 [Saa80]. Theorem Saad Inequality) For i = 1,, k, if v is chosen such that u T i v 0, then ) 0 λ i λ k) L k) 2 i tan θ u i, v) i λ i λ n ) 5) T k i γ i ) where T i x) is the Chebyshev polynomial of degree i, θ, ) is the angle between two vectors, and γ i = 1 + 2 λ i λ i+1 λ i+1 λ n L k) i = i 1 j=1 λ k) j λ n if i 1 λ k) j λ i 1 if i = 1 Qiaochu Yuan Tour of Lanczos April 17, 2018 10 / 43

Aside - Chebyshev Polynomials Recall T j x) = 1 x + ) j x 2 2 1 + x ) ) j x 2 1 6) When j is large, T j x) 1 2 and when g is small, T j 1 + g) 1 2 x + ) j x 2 1 1 + g + ) j 2g determines the convergence of Lanczos with ) λi λ i+1 g = Θ λ i+1 λ n 140 120 Chebyshev, Lanczos 100 80 60 40 Monomial, power method 20 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Qiaochu Yuan Tour of Lanczos April 17, 2018 11 / 43

Algorithm Introduced by Golub and Underwood in 1977 [GU77] and Cullum and Donath in 1974 [CD74]. The block generalization of the Lanczos method uses, instead of a single initial vector v, a block of b vectors V = [ v 1 v b ], and builds the Krylov subspace in q iterations as K q A, V, q) = span { V, AV,, A q 1 V }. k b, bq n. Compared to classical Lanczos, block Lanczos is more memory and cache efficient, using BLAS3 operations. has the ability to converge to eigenvalues with cluster size > 1. has faster convergence with respect to number of iterations. Qiaochu Yuan Tour of Lanczos April 17, 2018 13 / 43

Algorithm - Details A [ ] [ ] Q 1 Q j = Q1 Q j Q j+1 At each step j = 1,, k of Lanczos iteration: AQ = QT j + Q j+1 [ 0 0 Bj ] Use the three-term recurrence: A 1 B1 T. B.. 1......... B T j 1 B j 1 A j B j Calculate the As, Bs as: AQ j = Q j 1 B T j 1 + Q j A j + Q j+1 B j A j = Q T j AQ j Q j+1 B j qr ) AQ j Q j A j Q j Bj 1 T 7) 8) Qiaochu Yuan Tour of Lanczos April 17, 2018 14 / 43

Convergence of First analyzed by Underwood in 1975 [Und75] with the introduction of the algorithm. Theorem Underwood Inequality) Let λ i, u i, i = 1,, n be the eigenvalues and eigenvectors of A respectively. Let U = [ ] u 1 u b. If U T V is of full rank b, then for i = 1,, b 0 λ i λ q) i λ 1 λ n ) tan2 Θ U, V) Tq 1 2 ρ 9) i) where T i x) is the Chebyshev polynomial of degree i, cos Θ U, V) = σ min U T V ), and ρ i = 1 + 2 λ i λ b+1 λ b+1 λ n Qiaochu Yuan Tour of Lanczos April 17, 2018 16 / 43

Convergence of Theorem Saad Inequality [Saa80]) Let λ j, u j, j = 1,, n be the eigenvalues and eigenvectors of A respectively. Let U i = [ u i u i+b 1 ]. For i = 1,, b, if U T i V is of full rank b, then ) 0 λ i λ q) L q) 2 i tan Θ U i, V) i λ i λ n ) 10) T q i ˆγ i ) where T i x) is the Chebyshev polynomial of degree i, cos Θ U, V) = σ min U T V ), and ˆγ i = 1 + 2 λ i λ i+b λ i+b λ n L q) i = i 1 j=1 λ q) j λ n if i 1 λ q) j λ i 1 if i = 1 Qiaochu Yuan Tour of Lanczos April 17, 2018 17 / 43

Aside - Block Size Recall, when j is large and g is small T j 1 + g) 1 1 + g + j 2g) 11) 2 ) λi λ i+1 classical: g = Θ λ i+1 λ n ) λi λ i+b block: g b = Θ λ i+b λ n 140 120 g g b Suppose eigenvalue distributed as λ j > 1 + ɛ)λ j+1 for all j: 100 80 60 Chebyshev, Lanczos g b 1 1 + ɛ) b 1 + ɛ) b λ i b λ i ɛ b g 40 20 0 Monomial, power method 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 Qiaochu Yuan Tour of Lanczos April 17, 2018 18 / 43

Randomized In recent years, there has been increased interest in algorithms to compute low-rank approximations of matrices, with high computational efficiency requirement, running on large matrices low approximation accuracy requirement, 2-3 digits of accuracy Applications mostly in big-data computations compression of data matrices matrix processing techniques, e.g. PCA optimization of the nuclear norm objective function Qiaochu Yuan Tour of Lanczos April 17, 2018 20 / 43

Randomized Grew out of work done in randomized algorithms, in particular Randomized Subspace Iteration, by Rokhlin, Szlam, and Tygert in 2009 [RST09] Halko, Martinsson, and Tropp in 2011 [HMST11] Gu [Gu15], and Musco [MM15] in 2015 Idea: Instead of taking any initial set of vectors V, an unfortunate choice of which could result in poor convergence, choose V = AΩ, a random projection of the columns of A, to better capture the range space. Qiaochu Yuan Tour of Lanczos April 17, 2018 21 / 43

RSI & RBL Pseudocode Algorithm 1 Randomized Subspace Iteration RSI) pseudocode Input: A R m n, target rank k, block size b, number of iter. q Output: B k R m n, a rank-k approximation 1: Draw Ω R n b. ω ij N 0, 1). 2: Form K = AA T ) q 1 AΩ. 3: Orthogonalize qrk) Q. 4: B k = QQ T A) k = QQ T A) k. M) k indicates the k-truncated SVD of matrix M. Algorithm 2 Randomized Block Lanczos RBL) pseudocode Input: A R m n, target rank k, block size b, number of iter. q Output: B k R m n, a rank-k approximation 1: Draw Ω R n b. ω ij N 0, 1). 2: Form K =[AΩ, AA T )AΩ,, AA T ) q 1 AΩ]. 3: Orthogonalize qrk) Q. 4: B k = QQ T A) k = QQ T A) k. requires bq k. numerically unstable. Qiaochu Yuan Tour of Lanczos April 17, 2018 22 / 43

Lanczos for SVD A modification of the Lanczos algorithm can be used to find the extremal singular value pairs of a non-symmetric matrix A R m n. Utilizing the connections between the singular value decomposition of A and the eigen decompositions of AA T and A T A, 1 Select a block of initial vectors V = [ v 1 v b ]. 2 Construct Krylov subspaces K V A T A, V, q ) and K U AA T, AV, q ). 3 Restrict and project A to form B = proj KU A KV 4 Use singular values and vectors of B as approximations to those of A. Qiaochu Yuan Tour of Lanczos April 17, 2018 24 / 43

Golub-Kahan Bidiagonalization [GK65] A [ ] [ ] V 1 V j = U1 U j A 1 B T 1...... A T [ ] [ ] U 1 U j = V1 V j V j+1 A 1... B T j 1 A j B 1......... B j 1 At each step j = 1,, k of the bidiagonalization, calculate As and Bs as: ) U j A j qr AV j U j 1 Bj 1 T 12) ) V j+1 B j qr A T U j V j A j 13) A j B j Qiaochu Yuan Tour of Lanczos April 17, 2018 25 / 43

Convergence of RBL Analyzed by Musco and Musco in 2015 [MM15]. Theorem ) ) For block size b k, in q = Θ log n ɛ iterations of RSI and q = Θ log n ɛ iterations of RBL, with constant probability 99/100, the following inequalities bounds are satisfied: σi 2 σi 2 B k ) ɛσk+1 2,, for i = 1,, k 14) ) In the event that σ b+1 cσ k with c < 1, taking q = Θ ) and q = Θ logn/ɛ) min1,σk /σ b+1 1) logn/ɛ) min1,σ k /σ b+1 1) suffices for RSI and RBL respectively. Qiaochu Yuan Tour of Lanczos April 17, 2018 27 / 43

Timeline 1950) Lanczos iteration algorithm 1965) Lanczos algorithm for SVD 1974, 1977) algorithm 2009, 2010, 2015) Randomized 1950 1960 1970 1980 1990 2000 2010 2015) 1966, 1971, 1980) Our work: Convergence Convergence 1975, 1980) 1) Convergence bounds bounds for for bounds for Convergence RBL, arbitrary brandomized Lanczos bounds 2) for Superlinear convergence for typical data matrices Qiaochu Yuan Tour of Lanczos April 17, 2018 28 / 43

Aside: Convergence of RSI Many of our analysis techniques are similar to those used by Gu to analyze RSI [Gu15]. The bounds for RBL resulting from the current work will be similar in form to the bounds for RSI. Theorem RSI convergence) Let B k be the approximation returned by the RSI algorithm on A = UΣV T, with target rank k, block size b k, and q iterations. If ˆΩ 1 has full row rank in V T Ω = [ ] ˆΩ T 1 ˆΩ T T 2, then σ j σ j B k ) σ j 1 + ˆΩ 2 2 2 ˆΩ 1 2 2 σb+1 σ j ) 4q+2 15) Qiaochu Yuan Tour of Lanczos April 17, 2018 29 / 43

Current Work - Core Ideas The core pieces of our analysis are: 1 the growth behavior of Chebyshev polynomials, 2 the choice of a clever orthonormal basis for Krylov subspace, 3 the creation of a spectrum gap, by separating the spectrum of A into those singular values that are close to σ k, and those that are sufficiently smaller in magnitude. Qiaochu Yuan Tour of Lanczos April 17, 2018 30 / 43

Setup & Notation For block size b, random Gaussian matrix Ω R n b, we are interested in the column span of [ K q = AΩ AA T ) AΩ AA T ) ] q 1 AΩ Let the SVD of A R m n be UΣV T. For any 0 p q, define ˆK p : = UT 2p+1 Σ) [ ˆΩ Σ 2 ˆΩ Σ 2q p 1) ˆΩ ] = UT 2p+1 Σ)V q p Note V q p R n bq p). We require bq p) k, the target rank. Lemma With ˆK p and K q as previously defined, { } span {K q } span ˆK p 16) Qiaochu Yuan Tour of Lanczos April 17, 2018 31 / 43

Convergence of RBL Theorem Main Result) Let B k be the approximation returned by the RBL algorithm. With notation as previously defined, if the random starting Ω is initialized such that the block Vandermonde formed by the sub-blocks of V q p is invertible, then for all 1 j k, and all choices a of s, r, σ j σ j B k ) 1 + C 2 T 2 2p+1 σ j+s 1 + 2 σj σ j+s+r+1 σ j+s+r+1 ) 17) where p = q k+r b, and C is a constant independent of q. a s is chosen to be non-zero to handle multiple singular values, and can be set to zero otherwise. Qiaochu Yuan Tour of Lanczos April 17, 2018 32 / 43

Summary of Convergence Bounds Rewriting bounds in comparable forms: citation bound req. on b Saad [Saa80] Musco [MM15] spec. indep. Musco [MM15] spec. dep. Current Work λ q) j σ q) j σ q) j 1+L q) j σ q) j λ j 2 tan 2 ΘU,V)T 2 q j σ j 1+C 2 1 log2 n) q 2 σ2 k+1 σ 2 j σ j 1+C 2 ne q min1,σ k /σ b+1 1) σ2 k+1 σ 2 j 1+2 λ j λ j+b λ j+b ) b k σ j ) 1+C3 2T 2 1+2 σ j σ j+r+1 2q+1 2k+r)/b σ j+r+1 b k b k b 1 bq k + r Qiaochu Yuan Tour of Lanczos April 17, 2018 33 / 43

Typical Case - Superlinear Convergence Recall: {a q } convergence superlinearly to a if a q+1 a lim = 0 18) q a q a In practice, Lanczos algorithms classical, block, randomized) often exhibit superlinear convergence behavior. It has been shown that classical Lanczos iteration is theoretically superlinearly convergent under certain assumptions about the singular spectrum [saa94, Li10]. We show this for block Lanczos algorithms, i.e., that under certain assumptions about the singular spectrum, block Lanczos produces rank k approximations B k such that σ j B k ) σ j superlinearly. Qiaochu Yuan Tour of Lanczos April 17, 2018 34 / 43

Typical Case - Superlinear Convergence A typical data matrix might have singular value spectrum decaying to 0, i.e., σ j 0. In this case our bound suggests that convergence is governed by a q := Cr)T 1 p 1 + g) ) 2 Cr) 1 2 with σj = 2 σ j+r+1 g = 2 σ j σ j+r+1 p = 2 q k + r b ) + 1 = 2q + 1 + g + 2g) p ) 2 0 ) 1 σ j+r+1 1 2 k + r b We argue that a q+1 /a q 0 as follows: for all ɛ > 0, choose 1 r so that 1 + g ɛ 1 2. Then, a q+1 ɛ a q 1 Recall 1) our main result holds for all r; 2) k + r = q p)b, and so choosing r amounts to choosing q. Qiaochu Yuan Tour of Lanczos April 17, 2018 35 / 43 )

Effect of r There is some optimal value of r, typically non-zero, which achieves the best convergence factor. The balance is between larger smaller) values of r, which implies lower higher) Chebyshev degree but bigger smaller) gap. ) Figure: Value of reciprocal convergence factor T 1 2q+1 2k+r)/b) 1 + 2 σ j σ j+r+1 σ j+r+1 as r varies, for Daily Activities and Sports Dataset, k = j = 100, b = 10, q = 20. 10-1 10-2 10-3 10-4 10-5 0 10 20 30 40 50 60 r Qiaochu Yuan Tour of Lanczos April 17, 2018 36 / 43

Numerical Example Experimentally, choices of smaller block sizes 1 b < k appear favorable with superlinear convergence for all block sizes. Figure: Daily Activities and Sports Dataset - A R 9120 5625. 10 0 Daily Activities Dataset, target k = 200, j = 200 j - j B k ))/ j rel. err. = 10-5 10-10 10-15 RSI b=200+2 RSI b=200+5 RSI b=200+10 RBL b=200 RBL b=100 RBL b=50 RBL b=20 RBL b=10 RBL b=2 10-20 0 1000 2000 3000 4000 5000 Number of MATVECs = b*1+2*q) Qiaochu Yuan Tour of Lanczos April 17, 2018 37 / 43

Numerical Example Figure: Eigenfaces Dataset - A R 10304 400. 10 0 Eigenfaces Dataset, target k = 100, j = 100 j - j B k ))/ j rel. err. = 10-5 10-10 10-15 RSI b=100 RSI b=100+2 RSI b=100+5 RBL b=100 RBL b=50 RBL b=20 RBL b=10 RBL b=2 RBL b=1 10-20 0 500 1000 1500 2000 2500 Number of MATVECs = b*1+2*q) Qiaochu Yuan Tour of Lanczos April 17, 2018 38 / 43

Conclusions Both the theoretical analysis and numerical evidence suggest that, holding the number of matrix vector operations constant, RBL with smaller block size b is better. For matrices with decaying spectrum, RBL achieves superlinear convergence. However the preference for smaller b must be balanced with the advantages of a larger b for computational efficiency and numerical stability reasons in a practical implementation, and should be further investigated. Qiaochu Yuan Tour of Lanczos April 17, 2018 39 / 43

References I Jane Cullum and William E Donath, A block lanczos algorithm for computing the q algebraically largest eigenvalues and a corresponding eigenspace of large, sparse, real symmetric matrices, Decision and Control including the 13th Symposium on Adaptive Processes, 1974 IEEE Conference on, vol. 13, IEEE, 1974, pp. 505 509. Gene Golub and William Kahan, Calculating the singular values and pseudo-inverse of a matrix, Journal of the Society for Industrial and Applied Mathematics, Series B: Numerical Analysis 2 1965), no. 2, 205 224. Gene Howard Golub and Richard Underwood, The block lanczos method for computing eigenvalues, Mathematical software, Elsevier, 1977, pp. 361 377. Qiaochu Yuan Tour of Lanczos April 17, 2018 40 / 43

References II Ming Gu, Subspace iteration randomization and singular value problems, SIAM Journal on Scientific Computing 37 2015), no. 3, A1139 A1173. Nathan Halko, Per-Gunnar Martinsson, Yoel Shkolnisky, and Mark Tygert, An algorithm for the principal component analysis of large data sets, SIAM Journal on Scientific computing 33 2011), no. 5, 2580 2594. Shmuel Kaniel, Estimates for some computational techniques in linear algebra, Mathematics of Computation 20 1966), no. 95, 369 378. Cornelius Lanczos, An iteration method for the solution of the eigenvalue problem of linear differential and integral operators, United States Governm. Press Office Los Angeles, CA, 1950. Qiaochu Yuan Tour of Lanczos April 17, 2018 41 / 43

References III Ren-Cang Li, Sharpness in rates of convergence for the symmetric lanczos method, Mathematics of Computation 79 2010), no. 269, 419 435. Cameron Musco and Christopher Musco, Randomized block krylov methods for stronger and faster approximate singular value decomposition, Advances in Neural Information Processing Systems, 2015, pp. 1396 1404. Christopher Conway Paige, The computation of eigenvalues and eigenvectors of very large sparse matrices., Ph.D. thesis, University of London, 1971. Vladimir Rokhlin, Arthur Szlam, and Mark Tygert, A randomized algorithm for principal component analysis, SIAM Journal on Matrix Analysis and Applications 31 2009), no. 3, 1100 1124. Qiaochu Yuan Tour of Lanczos April 17, 2018 42 / 43

References IV Yousef Saad, On the rates of convergence of the lanczos and the block-lanczos methods, SIAM Journal on Numerical Analysis 17 1980), no. 5, 687 706. Theoretical error bounds and general analysis of a few lanczos-type algorithms, 1994. Richard Underwood, An iterative block lanczos method for the solution of large sparse symmetric eigenproblems, Tech. report, 1975. Qiaochu Yuan Tour of Lanczos April 17, 2018 43 / 43