Communication-avoiding Krylov subspace methods

Motivation Communication-avoiding Krylov subspace methods Mark mhoemmen@cs.berkeley.edu University of California Berkeley EECS MS Numerical Libraries Group visit: 28 April 2008

Overview Motivation Current Krylov methods: communication-limited Can rearrange them to avoid communication Can do this in a numerically stable way Requires stepping outside the black box

Motivation Review Encapsulation Problem Amortize the communication Data dependencies limit reuse Review: Krylov subspace methods Given these kernels: Sparse matrix-vector product (SpMV) operator A (Possibly) a preconditioner operator M 1 Dot product and vector operations Solve Ax = b or Ax = λx iteratively By constructing a basis of span{r, Ar, A 2 r,... } And projecting onto it

Motivation Review Encapsulation Problem Amortize the communication Data dependencies limit reuse Encapsulation in standard Krylov methods Krylov methods great for software engineers! All kernels called as black boxes Good challenging to optimize! SpMV needs tuning & domain-specific hints Preconditioners a domain-specific black art Efficient reductions nontrivial on complex architectures

Motivation Review Encapsulation Problem Amortize the communication Data dependencies limit reuse Problem: Kernels are communication bound Sparse matrix-vector multiplication Parallel: Latency (communicate with neighbors) Sequential: Bandwidth (read matrix) Much like SpMV Orthogonalization Dot products and norms Θ(1) reductions per vector

Motivation Amortize the commmunication Review Encapsulation Problem Amortize the communication Data dependencies limit reuse Our goal: Amortize cost of each kernel over s steps Compute many SpMVs for cost of 1 SpMV Orthogonalize many vectors for cost of 1 reduction which we accomplish by breaking encapsulation, oops!

Matrix powers kernel Motivation Review Encapsulation Problem Amortize the communication Data dependencies limit reuse Replace s SpMVs with matrix powers kernel Marghoob s talk (details in Demmel 2007) Compute basis of span{v, Av, A 2 v,..., A s v}... With same communication cost as one SpMV Via overlapping ghost zones + redundant computation Local Dependencies for k=8 Type (1) Remote Dependencies for k=8 Type (2) Remote Dependencies for k=8 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 5 10 15 20 25 30 5 10 15 20 25 30 5 10 15 20 25 30

Tall Skinny QR Motivation Review Encapsulation Problem Amortize the communication Data dependencies limit reuse Replace orthogonalization with Tall Skinny QR (TSQR) Demmel et al. 2008 (coming soon) Block row reduction in one communication step QR factorization is the reduction operator Unconditionally stable Figure: TSQR on a binary tree of 4 processors.

What stops us? Motivation Review Encapsulation Problem Amortize the communication Data dependencies limit reuse What stops us from using both kernels in a Krylov method?

Data dependencies Motivation Review Encapsulation Problem Amortize the communication Data dependencies limit reuse Krylov methods advance one vector at a time SpMV, then orthogonalize, then SpMV,... Figure: Data dependencies in Krylov subspace methods.

Motivation Solution: Solution Example: GMRES Basis condition number Numerical experiments Can break dependency with s-step Krylov methods Mathematically equivalent to original algorithms Compute basis of span{v, Av, A 2 v,..., A s v} Orthogonalize s + 1 vectors Reconstruct upper Hessenberg H (Arnoldi) resp. tridiagonal T (Lanczos) using R factor and basis properties Solve for solution update and continue

Motivation Solution Example: GMRES Basis condition number Numerical experiments Example: GMRES

Original GMRES Motivation Solution Example: GMRES Basis condition number Numerical experiments 1: for k = 1 to s do 2: w = Av k 1 3: Orthogonalize w against v 0,..., v k 1 (Modified Gram-Schmidt) 4: end for 5: Compute solution using H

Motivation Solution Example: GMRES Basis condition number Numerical experiments Version 2: Matrix powers kernel & TSQR 1: W = [v 0, Av 0, A 2 v 0,..., A s v 0 ] 2: [Q, R] = TSQR(W ) 3: Compute H using R 4: Compute solution using H s powers of A for no extra latency cost s steps of QR for one step of latency But...

Motivation Basis computation not stable Solution Example: GMRES Basis condition number Numerical experiments v, Av, A 2 v,... looks familiar... It s the power method! Converges to principal eigenvector of A Basis condition number exponential in s

Motivation Version 3: Different basis Solution Example: GMRES Basis condition number Numerical experiments Just like polynomial interpolation Use a different basis, e.g.: Newton basis W = [v, (A θ 1 I)v, (A θ 2 I)(A θ 1 I)v,... ] Chebyshev basis W = [v, T 1 (v), T 2 (v),... ] Parameters via estimated spectral info... Which comes free with Krylov method!

Motivation Basis condition number Solution Example: GMRES Basis condition number Numerical experiments Figure: Condition number of various bases vs. basis length s. Matrix A is 10 6 10 6 2-D Poisson with Dirac delta right-hand side.

Motivation Numerical experiments Solution Example: GMRES Basis condition number Numerical experiments Diagonal 10 4 10 4 matrix, κ 2 (A) = 10 8 s = 24 Newton: basis condition # about 10 14 Monomial: basis condition # about 10 16

Motivation Better basis pays off: restarting Solution Example: GMRES Basis condition number Numerical experiments 0 0.5 GMRES(24,1) residuals: cond(a) = 1e8, n=1e4 Standard(24,1) Monomial(24,1) Newton(24,1) Log base 10 of 2 norm relative residual error 1 1.5 2 2.5 3 3.5 4 4.5 5 100 200 300 400 500 600 700 800 900 1000 Iteration count Figure: Restart after every group of s steps

Motivation Solution Example: GMRES Basis condition number Numerical experiments Better basis pays off: less restarting 1 0 GMRES(24,8) residuals: cond(a) = 1e8, n=1e4 Standard(24,8) Monomial(24,8) Newton(24,8) Log base 10 of 2 norm relative residual error 1 2 3 4 5 6 100 200 300 400 500 600 700 800 900 1000 Iteration count Figure: Restart after 8 groups of s = 24 steps.

Motivation Lots of previous work s-step CG Van Rosendale 1983, Chronopoulos et al. 1989, Toledo 1995,... s-step GMRES Walker 1988, De Sturler 1991, Bai et al. 1991, Joubert et al. 1992, Erhel 1995,...

Motivation Our advances over previous work Performance: Matrix powers kernel and TSQR Fixed numerical stability problems Need not restart after each group of s Can do preconditioning...

Motivation Matrix powers kernel changes Matrix powers kernel Computing this is hard What preconditioners might work? Dense but low rank GMRES with left preconditioning v, M 1 Av, (M 1 A) 2 v,..., (M 1 A) s v CG with split preconditioning v, L 1 AL T v,..., (L 1 AL T ) s v CG with left preconditioning M 1 A not necessarily symmetric! V = [v, M 1 Av,..., (M 1 A) s v], and W = [Av, AM 1 Av,..., (AM 1 ) s Av] We know the necessary adjustments for any basis

Motivation Computing this is hard Matrix powers kernel Computing this is hard What preconditioners might work? Dense but low rank We couldn t find an author that tried Preconditioner must be part of matrix powers kernel Otherwise, Θ(s) communication steps vs. Θ(1) This changes the abstraction Matrix and preconditioner not separate black boxes anymore! But encapsulation preserved at a different level

Motivation What preconditioners might work? Matrix powers kernel Computing this is hard What preconditioners might work? Dense but low rank Polynomial preconditioning? Natural fit with matrix powers kernel See e.g., Saad 1985 Overlapping Schwarz domain decomposition? Use same overlap regions as matrix powers kernel Sparse approximate inverse? Place nonzeros to avoid communication

Beyond sparsity? Motivation Matrix powers kernel Computing this is hard What preconditioners might work? Dense but low rank Sparsity is the wrong abstraction Quickly disappears with matrix powers Doesn t limit communication to nearest neighbors: e.g., x x x x x x A = x x x is O(n) sparse but requires O(n) messages per SpMV

Dense but low rank Motivation Matrix powers kernel Computing this is hard What preconditioners might work? Dense but low rank Matrix powers tend to destroy sparsity Tridiagonal + block Jacobi dense after four iterations But ranks of off-diagonal block rows & columns only increase linearly True for general matrices

Motivation Rank = communication Matrix powers kernel Computing this is hard What preconditioners might work? Dense but low rank Let r ij = rank of block i, j r ij : # of source vector words proc j sends to proc i in SpMV In matrix powers kernel: proc i redundantly computes s r ij ghost zone values from proc j Bound rank to limit communication

Motivation Matrix powers kernel Computing this is hard What preconditioners might work? Dense but low rank Caveat: Communication vs. computation Low off-diagonal block rank only saves communication Need sparsity, else higher computational complexity O( Ω ) term per node of interior domain Ω Asymptotically more complex for 2-, 3-D problems We re still working on this... Figure: Left: low-rank dense off-diagonal blocks. Right: low-rank sparse off-diagonal blocks.

Motivation Compatible preconditioners Matrix powers kernel Computing this is hard What preconditioners might work? Dense but low rank Black box in interior, low-rank blocks outside Hierarchical matrices (Hackbusch et al.) Semiseparable matrices Fast multipole method (a) log x y on a straight line Figure: Discretization of log ( x y ) on interval.

Motivation Performance tuning (choosing s) Extension to eigensolvers Lanczos biorthogonalization (e.g., Bi-CG) Preconditioner implementations Combine with block Krylov methods Block methods can already use TSQR Does combining block and s-step pay?

Motivation Can amortize communication in Krylov methods Break open the global SpMV black box SpMV on interior of local domain can remain closed Rewrite orthogonalization as one reduction Can preserve numerical stability in theory

Motivation Mentors and colleagues Advisor: James Demmel (UC Berkeley) Sparse matrix operations: Kathy Yelick (U. Calif. Berkeley and LBNL) Marghoob Mohiyuddin (U. Calif. Berkeley) QR factorization: Julien Langou (U. Colorado Denver) Laura Grigori (INRIA, France)

Motivation Acknowledgments Research supported by: Microsoft and Intel (ParLab) ACM/IEEE NSF US Department of Energy

Copyright Motivation Copyright 2008 Mark mhoemmen@cs.berkeley.edu. Licensed under Creative Commons Attribution-Share Alike 3.0 United States. See http://creativecommons.org/licenses/ by-sa/3.0/us/ for more information.

Appendix Extra slides Block Krylov methods More slides Bibliography Bibliography I Z. BAI, D. HU, AND L. REICHTEL, A Newton basis GMRES implementation, IMA Journal of Numerical Analysis, 14 (1994), pp. 563 581. A. H. BAKER, J. M. DENNIS, AND E. R. JESSUP, On improving linear solver performance: A block variant of GMRES, SIAM J. Sci. Comp., 27 (2006), pp. 1608 1626. S. BÖRM, L. GRASEDYCK, AND W. HACKBUSCH, Hierarchical matrices. http://www.mis.mpg.de/scicomp/fulltext/ws_ HMatrices.pdf, 2004.

Appendix Extra slides Block Krylov methods More slides Bibliography Bibliography II S. CHANDRASEKARAN, M. GU, AND W. LYONS, A fast and stable adaptive solver for hierarchically semi-separable representations, May 2004. A. T. CHRONOPOULOS AND C. W. GEAR, s-step iterative methods for symmetric linear systems, J. Comput. Appl. Math., 25 (1989), pp. 153 168. A. T. CHRONOPOULOS AND A. B. KUCHEROV, A parallel Krylov-type method for nonsymmetric linear systems, in High Performance Computing - HiPC 2001: Eighth International Conference, Hyderabad, India, December 17-20, 2001. Proceedings, Springer, 2001, pp. 104 114.

Appendix Extra slides Block Krylov methods More slides Bibliography Bibliography III E. DE STURLER, A parallel variant of GMRES(m), in Proceedings of the 13th IMACS World Congress on Computation and Applied Mathematics, J. J. H. Miller and R. Vichnevetsky, eds., Dublin, Ireland, 1991, Criterion Press. J. DEMMEL, M. F. HOEMMEN, M. MOHIYUDDIN, AND K. A. YELICK, Avoiding communication in computing Krylov subspaces, Tech. Rep. UCB/EECS-2007-123, EECS Department, University of California, Berkeley, Oct 2007.

Appendix Extra slides Block Krylov methods More slides Bibliography Bibliography IV J. ERHEL, A parallel GMRES version for general sparse matrices, Electronic Transactions on Numerical Analysis, 3 (1995), pp. 160 176. W. GAUTSCHI AND G. INGLESE, Lower bounds for the condition number of Vandermonde matrices, Numer. Math., 52 (1988), pp. 241 250. W. HACKBUSCH, Hierarchische Matrizen Algorithmen und Analysis. http://www.mis.mpg.de/scicomp/fulltext/ hmvorlesung.ps, last accessed 22 May 2006, Jan. 2006.

Appendix Extra slides Block Krylov methods More slides Bibliography Bibliography V W. D. JOUBERT AND G. F. CAREY, Parallelizable restarted iterative methods for nonsymmetric linear systems, Part I: Theory, International Journal of Computer Mathematics, 44 (1992), pp. 243 267., Parallelizable restarted iterative methods for nonsymmetric linear systems, Part II: Parallel implementation, International Journal of Computer Mathematics, 44 (1992), pp. 269 290.

Appendix Extra slides Block Krylov methods More slides Bibliography Bibliography VI C. E. LEISERSON, S. RAO, AND S. TOLEDO, Efficient out-of-core algorithms for linear relaxation using blocking covers, Journal of Computer and System Sciences, 54 (1997), pp. 332 344. G. MEURANT, The block preconditioned conjugate gradient method on vector computers, BIT, 24 (1984), pp. 623 633. D. P. O LEARY, The block conjugate gradient algorithm and related methods, Linear Algebra Appl., 29 (1980), pp. 293 322.

Appendix Extra slides Block Krylov methods More slides Bibliography Bibliography VII Y. SAAD, Practical use of polynomial preconditionings for the conjugate gradient method, SIAM J. Sci. Stat. Comput., 6 (1985), pp. 865 881. S. A. TOLEDO, Quantitative performance modeling of scientific computations and creating locality in numerical algorithms, PhD thesis, Massachusetts Institute of Technology, 1995. J. VAN ROSENDALE, Minimizing inner product data dependence in conjugate gradient iteration, in Proc. IEEE Internat. Confer. Parallel Processing, 1983.

Appendix Extra slides Block Krylov methods More slides Review Problems with block methods Review: Block Krylov methods Build up basis of span{b, A B, A 2 B,... } for block of vectors B Original application: hard eigenproblems Accelerates convergence for multiple / clustered eigenvalues Can also solve Ax = B Best for multiple right-hand sides Can use if only one right-hand side SpMV for multiple vectors No extra latency cost Bandwidth cost scales linearly w/ # vectors

Appendix Extra slides Block Krylov methods More slides Review Problems with block methods Problems with block methods for Ax = b (1 of 2) If only one right-hand side: Do one restart cycle with one RHS After each restart cycle, add one error vector to RHS block Higher startup cost than Need s cycles of s until at full block size Whereas, s-step always at full optimization

Appendix Extra slides Block Krylov methods More slides Review Problems with block methods Problems with block methods for Ax = b (2 of 2) More complicated convergence & breakdown conditions Convergence benefit isn t as clear as one-vector case Must do deflation Remove linearly dependent vectors from block Need rank-revealing factorization

Appendix Extra slides Block Krylov methods More slides Restarting for stability Extra precision for stability Lanczos reorthogonalization What preconditioner structure limits communication? Components Lanczos(s,t) w/ reorthogonalization Get orthogonality estimates from Lanczos recurrence (Paige) Each group of s basis vectors is a TSQR Q factor Best reorthogonalization: Do TSQR of last group to compute Lanczos coefficients Use Lanczos coeffs in Paige s recurrence If last group not orthogonal w.r.t. previous groups Compute it explicitly Orthogonalize against previous t 1 groups Finally take TSQR again of last group Converting all groups of s to explicit storage and redoing TSQR on them all is too expensive & unnecessary

Appendix Extra slides Block Krylov methods More slides Restarting for stability Extra precision for stability Lanczos reorthogonalization What preconditioner structure limits communication? Components What preconditioner structure limits communication? Experiment: Partition matrix A & preconditioner M into block rows Compute MA, A(MA), (MA) 2, A(MA) 2,... Look at Fill-in (sparsity) Rank of off-diagonal block rows & columns Matrices Matrix A is tridiagonal Preconditioner M is block diagonal

Appendix Extra slides Block Krylov methods More slides Restarting for stability Extra precision for stability Lanczos reorthogonalization What preconditioner structure limits communication? Components Figure: Left is spy plot of A(M A); block row & column ranks are 2, 4, 4, 2. Right is spy plot of (M A) 2 ; block row & column ranks are 2, 4, 4, 2.

Appendix Extra slides Block Krylov methods More slides Restarting for stability Extra precision for stability Lanczos reorthogonalization What preconditioner structure limits communication? Components Figure: Left is spy plot of A(M A) 2 ; block row & column ranks are 3, 6, 6, 3. Right is spy plot of (M A) 3 ; block row & column ranks are 3, 6, 6, 3.

Appendix Extra slides Block Krylov methods More slides Restarting for stability Extra precision for stability Lanczos reorthogonalization What preconditioner structure limits communication? Components Figure: Left is spy plot of A(M A) 3 ; block row & column ranks are 4, 8, 8, 4. Right is spy plot of (M A) 4 ; block row & column ranks are 4, 8, 8, 4.

Dense but low rank Appendix Extra slides Block Krylov methods More slides Restarting for stability Extra precision for stability Lanczos reorthogonalization What preconditioner structure limits communication? Components In matrix powers, sparsity quickly lost But ranks of off-diagonal block rows & columns only increase linearly True for general matrices

Appendix Extra slides Block Krylov methods More slides Rank = communication Restarting for stability Extra precision for stability Lanczos reorthogonalization What preconditioner structure limits communication? Components Let r ij = rank of block i, j r ij : # of source vector words proc j sends to proc i in SpMV In matrix powers kernel: proc i redundantly computes s r ij ghost zone values from proc j Bound rank to limit communication in matrix powers kernel