Communication-avoiding Krylov subspace methods

Similar documents
Minisymposia 9 and 34: Avoiding Communication in Linear Algebra. Jim Demmel UC Berkeley bebop.cs.berkeley.edu

Exploiting Low-Rank Structure in Computing Matrix Powers with Applications to Preconditioning

Efficient Deflation for Communication-Avoiding Krylov Subspace Methods

The Lanczos and conjugate gradient algorithms

4.8 Arnoldi Iteration, Krylov Subspaces and GMRES

AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning

Arnoldi Methods in SLEPc

APPLIED NUMERICAL LINEAR ALGEBRA

A communication-avoiding thick-restart Lanczos method on a distributed-memory system

M.A. Botchev. September 5, 2014

Summary of Iterative Methods for Non-symmetric Linear Equations That Are Related to the Conjugate Gradient (CG) Method

Finite-choice algorithm optimization in Conjugate Gradients

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

Numerical Methods in Matrix Computations

Preface to the Second Edition. Preface to the First Edition

Algorithms that use the Arnoldi Basis

Sparse linear solvers: iterative methods and preconditioning

Course Notes: Week 1

Linear Solvers. Andrew Hazel

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix

The Conjugate Gradient Method

WHEN studying distributed simulations of power systems,

Contents. Preface... xi. Introduction...

Conjugate gradient method. Descent method. Conjugate search direction. Conjugate Gradient Algorithm (294)

Topics. The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems

Minimizing Communication in Linear Algebra. James Demmel 15 June

ITERATIVE METHODS FOR SPARSE LINEAR SYSTEMS

LARGE SPARSE EIGENVALUE PROBLEMS. General Tools for Solving Large Eigen-Problems

LARGE SPARSE EIGENVALUE PROBLEMS

6.4 Krylov Subspaces and Conjugate Gradients

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Last Time. Social Network Graphs Betweenness. Graph Laplacian. Girvan-Newman Algorithm. Spectral Bisection

On the influence of eigenvalues on Bi-CG residual norms

Iterative methods for Linear System

IDR(s) as a projection method

1 Extrapolation: A Hint of Things to Come

FEM and sparse linear system solving

Applied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic

arxiv: v1 [hep-lat] 2 May 2012

Parallel sparse linear solvers and applications in CFD

PROJECTED GMRES AND ITS VARIANTS

The amount of work to construct each new guess from the previous one should be a small multiple of the number of nonzeros in A.

MS 28: Scalable Communication-Avoiding and -Hiding Krylov Subspace Methods I

The quadratic eigenvalue problem (QEP) is to find scalars λ and nonzero vectors u satisfying

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.

Parallel Numerics, WT 2016/ Iterative Methods for Sparse Linear Systems of Equations. page 1 of 1

Iterative methods, preconditioning, and their application to CMB data analysis. Laura Grigori INRIA Saclay

MS4: Minimizing Communication in Numerical Algorithms Part I of II

Avoiding Communication in Distributed-Memory Tridiagonalization

Iterative Methods for Sparse Linear Systems

Chapter 7 Iterative Techniques in Matrix Algebra

Contribution of Wo¹niakowski, Strako²,... The conjugate gradient method in nite precision computa

Key words. linear equations, polynomial preconditioning, nonsymmetric Lanczos, BiCGStab, IDR

ON ORTHOGONAL REDUCTION TO HESSENBERG FORM WITH SMALL BANDWIDTH

S-Step and Communication-Avoiding Iterative Methods

Simple iteration procedure

Communication Avoiding Strategies for the Numerical Kernels in Coupled Physics Simulations

Deflation Strategies to Improve the Convergence of Communication-Avoiding GMRES

Communication avoiding parallel algorithms for dense matrix factorizations

Parallel Iterative Methods for Sparse Linear Systems. H. Martin Bücker Lehrstuhl für Hochleistungsrechnen

Communication-avoiding parallel and sequential QR factorizations

Solution of eigenvalue problems. Subspace iteration, The symmetric Lanczos algorithm. Harmonic Ritz values, Jacobi-Davidson s method

9.1 Preconditioned Krylov Subspace Methods

Preconditioned inverse iteration and shift-invert Arnoldi method

Lecture 8: Fast Linear Solvers (Part 7)

A Residual Replacement Strategy for Improving the Maximum Attainable Accuracy of Communication- Avoiding Krylov Subspace Methods

Key words. conjugate gradients, normwise backward error, incremental norm estimation.

Communication-avoiding parallel and sequential QR factorizations

Charles University Faculty of Mathematics and Physics DOCTORAL THESIS. Krylov subspace approximations in linear algebraic problems

Solution of eigenvalue problems. Subspace iteration, The symmetric Lanczos algorithm. Harmonic Ritz values, Jacobi-Davidson s method

Iterative Methods for Linear Systems of Equations

Algebraic Multigrid as Solvers and as Preconditioner

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

A DISSERTATION. Extensions of the Conjugate Residual Method. by Tomohiro Sogabe. Presented to

On the loss of orthogonality in the Gram-Schmidt orthogonalization process

Communication-Avoiding Krylov Subspace Methods in Theory and Practice. Erin Claire Carson. A dissertation submitted in partial satisfaction of the

Matrix Algorithms. Volume II: Eigensystems. G. W. Stewart H1HJ1L. University of Maryland College Park, Maryland

Solving Sparse Linear Systems: Iterative methods

Solving Sparse Linear Systems: Iterative methods

Incomplete Cholesky preconditioners that exploit the low-rank property

Multigrid absolute value preconditioning

A Domain Decomposition Based Jacobi-Davidson Algorithm for Quantum Dot Simulation

Performance Evaluation of Some Inverse Iteration Algorithms on PowerXCell T M 8i Processor

Lab 1: Iterative Methods for Solving Linear Systems

ITERATIVE METHODS BASED ON KRYLOV SUBSPACES

Deflation for inversion with multiple right-hand sides in QCD

Key words. linear equations, eigenvalues, polynomial preconditioning, GMRES, GMRES-DR, deflation, QCD

From Stationary Methods to Krylov Subspaces

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Krylov Subspace Methods that Are Based on the Minimization of the Residual

Introduction. Chapter One

Iterative Methods for Solving A x = b

Reduced Synchronization Overhead on. December 3, Abstract. The standard formulation of the conjugate gradient algorithm involves

Iterative methods for Linear System of Equations. Joint Advanced Student School (JASS-2009)

Inexactness and flexibility in linear Krylov solvers

Alternative correction equations in the Jacobi-Davidson method

Review: From problem to parallel algorithm

Numerical Methods I Non-Square and Sparse Linear Systems

A short course on: Preconditioned Krylov subspace methods. Yousef Saad University of Minnesota Dept. of Computer Science and Engineering

Krylov Space Methods. Nonstationary sounds good. Radu Trîmbiţaş ( Babeş-Bolyai University) Krylov Space Methods 1 / 17

Transcription:

Motivation Communication-avoiding Krylov subspace methods Mark mhoemmen@cs.berkeley.edu University of California Berkeley EECS MS Numerical Libraries Group visit: 28 April 2008

Overview Motivation Current Krylov methods: communication-limited Can rearrange them to avoid communication Can do this in a numerically stable way Requires stepping outside the black box

Motivation Review Encapsulation Problem Amortize the communication Data dependencies limit reuse Review: Krylov subspace methods Given these kernels: Sparse matrix-vector product (SpMV) operator A (Possibly) a preconditioner operator M 1 Dot product and vector operations Solve Ax = b or Ax = λx iteratively By constructing a basis of span{r, Ar, A 2 r,... } And projecting onto it

Motivation Review Encapsulation Problem Amortize the communication Data dependencies limit reuse Encapsulation in standard Krylov methods Krylov methods great for software engineers! All kernels called as black boxes Good challenging to optimize! SpMV needs tuning & domain-specific hints Preconditioners a domain-specific black art Efficient reductions nontrivial on complex architectures

Motivation Review Encapsulation Problem Amortize the communication Data dependencies limit reuse Problem: Kernels are communication bound Sparse matrix-vector multiplication Parallel: Latency (communicate with neighbors) Sequential: Bandwidth (read matrix) Much like SpMV Orthogonalization Dot products and norms Θ(1) reductions per vector

Motivation Amortize the commmunication Review Encapsulation Problem Amortize the communication Data dependencies limit reuse Our goal: Amortize cost of each kernel over s steps Compute many SpMVs for cost of 1 SpMV Orthogonalize many vectors for cost of 1 reduction which we accomplish by breaking encapsulation, oops!

Matrix powers kernel Motivation Review Encapsulation Problem Amortize the communication Data dependencies limit reuse Replace s SpMVs with matrix powers kernel Marghoob s talk (details in Demmel 2007) Compute basis of span{v, Av, A 2 v,..., A s v}... With same communication cost as one SpMV Via overlapping ghost zones + redundant computation Local Dependencies for k=8 Type (1) Remote Dependencies for k=8 Type (2) Remote Dependencies for k=8 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 5 10 15 20 25 30 5 10 15 20 25 30 5 10 15 20 25 30

Tall Skinny QR Motivation Review Encapsulation Problem Amortize the communication Data dependencies limit reuse Replace orthogonalization with Tall Skinny QR (TSQR) Demmel et al. 2008 (coming soon) Block row reduction in one communication step QR factorization is the reduction operator Unconditionally stable Figure: TSQR on a binary tree of 4 processors.

What stops us? Motivation Review Encapsulation Problem Amortize the communication Data dependencies limit reuse What stops us from using both kernels in a Krylov method?

Data dependencies Motivation Review Encapsulation Problem Amortize the communication Data dependencies limit reuse Krylov methods advance one vector at a time SpMV, then orthogonalize, then SpMV,... Figure: Data dependencies in Krylov subspace methods.

Motivation Solution: Solution Example: GMRES Basis condition number Numerical experiments Can break dependency with s-step Krylov methods Mathematically equivalent to original algorithms Compute basis of span{v, Av, A 2 v,..., A s v} Orthogonalize s + 1 vectors Reconstruct upper Hessenberg H (Arnoldi) resp. tridiagonal T (Lanczos) using R factor and basis properties Solve for solution update and continue

Motivation Solution Example: GMRES Basis condition number Numerical experiments Example: GMRES

Original GMRES Motivation Solution Example: GMRES Basis condition number Numerical experiments 1: for k = 1 to s do 2: w = Av k 1 3: Orthogonalize w against v 0,..., v k 1 (Modified Gram-Schmidt) 4: end for 5: Compute solution using H

Motivation Solution Example: GMRES Basis condition number Numerical experiments Version 2: Matrix powers kernel & TSQR 1: W = [v 0, Av 0, A 2 v 0,..., A s v 0 ] 2: [Q, R] = TSQR(W ) 3: Compute H using R 4: Compute solution using H s powers of A for no extra latency cost s steps of QR for one step of latency But...

Motivation Basis computation not stable Solution Example: GMRES Basis condition number Numerical experiments v, Av, A 2 v,... looks familiar... It s the power method! Converges to principal eigenvector of A Basis condition number exponential in s

Motivation Basis computation not stable Solution Example: GMRES Basis condition number Numerical experiments v, Av, A 2 v,... looks familiar... It s the power method! Converges to principal eigenvector of A Basis condition number exponential in s

Motivation Version 3: Different basis Solution Example: GMRES Basis condition number Numerical experiments Just like polynomial interpolation Use a different basis, e.g.: Newton basis W = [v, (A θ 1 I)v, (A θ 2 I)(A θ 1 I)v,... ] Chebyshev basis W = [v, T 1 (v), T 2 (v),... ] Parameters via estimated spectral info... Which comes free with Krylov method!

Motivation Basis condition number Solution Example: GMRES Basis condition number Numerical experiments Figure: Condition number of various bases vs. basis length s. Matrix A is 10 6 10 6 2-D Poisson with Dirac delta right-hand side.

Motivation Numerical experiments Solution Example: GMRES Basis condition number Numerical experiments Diagonal 10 4 10 4 matrix, κ 2 (A) = 10 8 s = 24 Newton: basis condition # about 10 14 Monomial: basis condition # about 10 16

Motivation Better basis pays off: restarting Solution Example: GMRES Basis condition number Numerical experiments 0 0.5 GMRES(24,1) residuals: cond(a) = 1e8, n=1e4 Standard(24,1) Monomial(24,1) Newton(24,1) Log base 10 of 2 norm relative residual error 1 1.5 2 2.5 3 3.5 4 4.5 5 100 200 300 400 500 600 700 800 900 1000 Iteration count Figure: Restart after every group of s steps

Motivation Solution Example: GMRES Basis condition number Numerical experiments Better basis pays off: less restarting 1 0 GMRES(24,8) residuals: cond(a) = 1e8, n=1e4 Standard(24,8) Monomial(24,8) Newton(24,8) Log base 10 of 2 norm relative residual error 1 2 3 4 5 6 100 200 300 400 500 600 700 800 900 1000 Iteration count Figure: Restart after 8 groups of s = 24 steps.

Motivation Lots of previous work s-step CG Van Rosendale 1983, Chronopoulos et al. 1989, Toledo 1995,... s-step GMRES Walker 1988, De Sturler 1991, Bai et al. 1991, Joubert et al. 1992, Erhel 1995,...

Motivation Our advances over previous work Performance: Matrix powers kernel and TSQR Fixed numerical stability problems Need not restart after each group of s Can do preconditioning...

Motivation Matrix powers kernel changes Matrix powers kernel Computing this is hard What preconditioners might work? Dense but low rank GMRES with left preconditioning v, M 1 Av, (M 1 A) 2 v,..., (M 1 A) s v CG with split preconditioning v, L 1 AL T v,..., (L 1 AL T ) s v CG with left preconditioning M 1 A not necessarily symmetric! V = [v, M 1 Av,..., (M 1 A) s v], and W = [Av, AM 1 Av,..., (AM 1 ) s Av] We know the necessary adjustments for any basis

Motivation Matrix powers kernel changes Matrix powers kernel Computing this is hard What preconditioners might work? Dense but low rank GMRES with left preconditioning v, M 1 Av, (M 1 A) 2 v,..., (M 1 A) s v CG with split preconditioning v, L 1 AL T v,..., (L 1 AL T ) s v CG with left preconditioning M 1 A not necessarily symmetric! V = [v, M 1 Av,..., (M 1 A) s v], and W = [Av, AM 1 Av,..., (AM 1 ) s Av] We know the necessary adjustments for any basis

Motivation Computing this is hard Matrix powers kernel Computing this is hard What preconditioners might work? Dense but low rank We couldn t find an author that tried Preconditioner must be part of matrix powers kernel Otherwise, Θ(s) communication steps vs. Θ(1) This changes the abstraction Matrix and preconditioner not separate black boxes anymore! But encapsulation preserved at a different level

Motivation Computing this is hard Matrix powers kernel Computing this is hard What preconditioners might work? Dense but low rank We couldn t find an author that tried Preconditioner must be part of matrix powers kernel Otherwise, Θ(s) communication steps vs. Θ(1) This changes the abstraction Matrix and preconditioner not separate black boxes anymore! But encapsulation preserved at a different level

Motivation What preconditioners might work? Matrix powers kernel Computing this is hard What preconditioners might work? Dense but low rank Polynomial preconditioning? Natural fit with matrix powers kernel See e.g., Saad 1985 Overlapping Schwarz domain decomposition? Use same overlap regions as matrix powers kernel Sparse approximate inverse? Place nonzeros to avoid communication

Beyond sparsity? Motivation Matrix powers kernel Computing this is hard What preconditioners might work? Dense but low rank Sparsity is the wrong abstraction Quickly disappears with matrix powers Doesn t limit communication to nearest neighbors: e.g., x x x x x x A = x x x is O(n) sparse but requires O(n) messages per SpMV

Dense but low rank Motivation Matrix powers kernel Computing this is hard What preconditioners might work? Dense but low rank Matrix powers tend to destroy sparsity Tridiagonal + block Jacobi dense after four iterations But ranks of off-diagonal block rows & columns only increase linearly True for general matrices

Motivation Rank = communication Matrix powers kernel Computing this is hard What preconditioners might work? Dense but low rank Let r ij = rank of block i, j r ij : # of source vector words proc j sends to proc i in SpMV In matrix powers kernel: proc i redundantly computes s r ij ghost zone values from proc j Bound rank to limit communication

Motivation Matrix powers kernel Computing this is hard What preconditioners might work? Dense but low rank Caveat: Communication vs. computation Low off-diagonal block rank only saves communication Need sparsity, else higher computational complexity O( Ω ) term per node of interior domain Ω Asymptotically more complex for 2-, 3-D problems We re still working on this... Figure: Left: low-rank dense off-diagonal blocks. Right: low-rank sparse off-diagonal blocks.

Motivation Compatible preconditioners Matrix powers kernel Computing this is hard What preconditioners might work? Dense but low rank Black box in interior, low-rank blocks outside Hierarchical matrices (Hackbusch et al.) Semiseparable matrices Fast multipole method (a) log x y on a straight line Figure: Discretization of log ( x y ) on interval.

Motivation Performance tuning (choosing s) Extension to eigensolvers Lanczos biorthogonalization (e.g., Bi-CG) Preconditioner implementations Combine with block Krylov methods Block methods can already use TSQR Does combining block and s-step pay?

Motivation Can amortize communication in Krylov methods Break open the global SpMV black box SpMV on interior of local domain can remain closed Rewrite orthogonalization as one reduction Can preserve numerical stability in theory

Motivation Mentors and colleagues Advisor: James Demmel (UC Berkeley) Sparse matrix operations: Kathy Yelick (U. Calif. Berkeley and LBNL) Marghoob Mohiyuddin (U. Calif. Berkeley) QR factorization: Julien Langou (U. Colorado Denver) Laura Grigori (INRIA, France)

Motivation Acknowledgments Research supported by: Microsoft and Intel (ParLab) ACM/IEEE NSF US Department of Energy

Copyright Motivation Copyright 2008 Mark mhoemmen@cs.berkeley.edu. Licensed under Creative Commons Attribution-Share Alike 3.0 United States. See http://creativecommons.org/licenses/ by-sa/3.0/us/ for more information.

Appendix Extra slides Block Krylov methods More slides Bibliography Bibliography I Z. BAI, D. HU, AND L. REICHTEL, A Newton basis GMRES implementation, IMA Journal of Numerical Analysis, 14 (1994), pp. 563 581. A. H. BAKER, J. M. DENNIS, AND E. R. JESSUP, On improving linear solver performance: A block variant of GMRES, SIAM J. Sci. Comp., 27 (2006), pp. 1608 1626. S. BÖRM, L. GRASEDYCK, AND W. HACKBUSCH, Hierarchical matrices. http://www.mis.mpg.de/scicomp/fulltext/ws_ HMatrices.pdf, 2004.

Appendix Extra slides Block Krylov methods More slides Bibliography Bibliography II S. CHANDRASEKARAN, M. GU, AND W. LYONS, A fast and stable adaptive solver for hierarchically semi-separable representations, May 2004. A. T. CHRONOPOULOS AND C. W. GEAR, s-step iterative methods for symmetric linear systems, J. Comput. Appl. Math., 25 (1989), pp. 153 168. A. T. CHRONOPOULOS AND A. B. KUCHEROV, A parallel Krylov-type method for nonsymmetric linear systems, in High Performance Computing - HiPC 2001: Eighth International Conference, Hyderabad, India, December 17-20, 2001. Proceedings, Springer, 2001, pp. 104 114.

Appendix Extra slides Block Krylov methods More slides Bibliography Bibliography III E. DE STURLER, A parallel variant of GMRES(m), in Proceedings of the 13th IMACS World Congress on Computation and Applied Mathematics, J. J. H. Miller and R. Vichnevetsky, eds., Dublin, Ireland, 1991, Criterion Press. J. DEMMEL, M. F. HOEMMEN, M. MOHIYUDDIN, AND K. A. YELICK, Avoiding communication in computing Krylov subspaces, Tech. Rep. UCB/EECS-2007-123, EECS Department, University of California, Berkeley, Oct 2007.

Appendix Extra slides Block Krylov methods More slides Bibliography Bibliography IV J. ERHEL, A parallel GMRES version for general sparse matrices, Electronic Transactions on Numerical Analysis, 3 (1995), pp. 160 176. W. GAUTSCHI AND G. INGLESE, Lower bounds for the condition number of Vandermonde matrices, Numer. Math., 52 (1988), pp. 241 250. W. HACKBUSCH, Hierarchische Matrizen Algorithmen und Analysis. http://www.mis.mpg.de/scicomp/fulltext/ hmvorlesung.ps, last accessed 22 May 2006, Jan. 2006.

Appendix Extra slides Block Krylov methods More slides Bibliography Bibliography V W. D. JOUBERT AND G. F. CAREY, Parallelizable restarted iterative methods for nonsymmetric linear systems, Part I: Theory, International Journal of Computer Mathematics, 44 (1992), pp. 243 267., Parallelizable restarted iterative methods for nonsymmetric linear systems, Part II: Parallel implementation, International Journal of Computer Mathematics, 44 (1992), pp. 269 290.

Appendix Extra slides Block Krylov methods More slides Bibliography Bibliography VI C. E. LEISERSON, S. RAO, AND S. TOLEDO, Efficient out-of-core algorithms for linear relaxation using blocking covers, Journal of Computer and System Sciences, 54 (1997), pp. 332 344. G. MEURANT, The block preconditioned conjugate gradient method on vector computers, BIT, 24 (1984), pp. 623 633. D. P. O LEARY, The block conjugate gradient algorithm and related methods, Linear Algebra Appl., 29 (1980), pp. 293 322.

Appendix Extra slides Block Krylov methods More slides Bibliography Bibliography VII Y. SAAD, Practical use of polynomial preconditionings for the conjugate gradient method, SIAM J. Sci. Stat. Comput., 6 (1985), pp. 865 881. S. A. TOLEDO, Quantitative performance modeling of scientific computations and creating locality in numerical algorithms, PhD thesis, Massachusetts Institute of Technology, 1995. J. VAN ROSENDALE, Minimizing inner product data dependence in conjugate gradient iteration, in Proc. IEEE Internat. Confer. Parallel Processing, 1983.

Appendix Extra slides Block Krylov methods More slides Review Problems with block methods Review: Block Krylov methods Build up basis of span{b, A B, A 2 B,... } for block of vectors B Original application: hard eigenproblems Accelerates convergence for multiple / clustered eigenvalues Can also solve Ax = B Best for multiple right-hand sides Can use if only one right-hand side SpMV for multiple vectors No extra latency cost Bandwidth cost scales linearly w/ # vectors

Appendix Extra slides Block Krylov methods More slides Review Problems with block methods Problems with block methods for Ax = b (1 of 2) If only one right-hand side: Do one restart cycle with one RHS After each restart cycle, add one error vector to RHS block Higher startup cost than Need s cycles of s until at full block size Whereas, s-step always at full optimization

Appendix Extra slides Block Krylov methods More slides Review Problems with block methods Problems with block methods for Ax = b (2 of 2) More complicated convergence & breakdown conditions Convergence benefit isn t as clear as one-vector case Must do deflation Remove linearly dependent vectors from block Need rank-revealing factorization

Appendix Extra slides Block Krylov methods More slides Restarting for stability Extra precision for stability Lanczos reorthogonalization What preconditioner structure limits communication? Components Restarting for stability

Appendix Extra slides Block Krylov methods More slides Restarting for stability Extra precision for stability Lanczos reorthogonalization What preconditioner structure limits communication? Components Extra precision for stability (1 of 3)

Appendix Extra slides Block Krylov methods More slides Restarting for stability Extra precision for stability Lanczos reorthogonalization What preconditioner structure limits communication? Components Extra precision for stability (2 of 3)

Appendix Extra slides Block Krylov methods More slides Restarting for stability Extra precision for stability Lanczos reorthogonalization What preconditioner structure limits communication? Components Extra precision for stability (3 of 3)

Appendix Extra slides Block Krylov methods More slides Restarting for stability Extra precision for stability Lanczos reorthogonalization What preconditioner structure limits communication? Components Lanczos(s,t) w/ reorthogonalization Get orthogonality estimates from Lanczos recurrence (Paige) Each group of s basis vectors is a TSQR Q factor Best reorthogonalization: Do TSQR of last group to compute Lanczos coefficients Use Lanczos coeffs in Paige s recurrence If last group not orthogonal w.r.t. previous groups Compute it explicitly Orthogonalize against previous t 1 groups Finally take TSQR again of last group Converting all groups of s to explicit storage and redoing TSQR on them all is too expensive & unnecessary

Appendix Extra slides Block Krylov methods More slides Restarting for stability Extra precision for stability Lanczos reorthogonalization What preconditioner structure limits communication? Components What preconditioner structure limits communication? Experiment: Partition matrix A & preconditioner M into block rows Compute MA, A(MA), (MA) 2, A(MA) 2,... Look at Fill-in (sparsity) Rank of off-diagonal block rows & columns Matrices Matrix A is tridiagonal Preconditioner M is block diagonal

Appendix Extra slides Block Krylov methods More slides Restarting for stability Extra precision for stability Lanczos reorthogonalization What preconditioner structure limits communication? Components Figure: Left is spy plot of A; right is spy plot of M.

Appendix Extra slides Block Krylov methods More slides Restarting for stability Extra precision for stability Lanczos reorthogonalization What preconditioner structure limits communication? Components Figure: Left is spy plot of A; right is spy plot of M A. Block row & column ranks of M A are 1, 2, 2, 1.

Appendix Extra slides Block Krylov methods More slides Restarting for stability Extra precision for stability Lanczos reorthogonalization What preconditioner structure limits communication? Components Figure: Left is spy plot of A(M A); block row & column ranks are 2, 4, 4, 2. Right is spy plot of (M A) 2 ; block row & column ranks are 2, 4, 4, 2.

Appendix Extra slides Block Krylov methods More slides Restarting for stability Extra precision for stability Lanczos reorthogonalization What preconditioner structure limits communication? Components Figure: Left is spy plot of A(M A) 2 ; block row & column ranks are 3, 6, 6, 3. Right is spy plot of (M A) 3 ; block row & column ranks are 3, 6, 6, 3.

Appendix Extra slides Block Krylov methods More slides Restarting for stability Extra precision for stability Lanczos reorthogonalization What preconditioner structure limits communication? Components Figure: Left is spy plot of A(M A) 3 ; block row & column ranks are 4, 8, 8, 4. Right is spy plot of (M A) 4 ; block row & column ranks are 4, 8, 8, 4.

Dense but low rank Appendix Extra slides Block Krylov methods More slides Restarting for stability Extra precision for stability Lanczos reorthogonalization What preconditioner structure limits communication? Components In matrix powers, sparsity quickly lost But ranks of off-diagonal block rows & columns only increase linearly True for general matrices

Appendix Extra slides Block Krylov methods More slides Rank = communication Restarting for stability Extra precision for stability Lanczos reorthogonalization What preconditioner structure limits communication? Components Let r ij = rank of block i, j r ij : # of source vector words proc j sends to proc i in SpMV In matrix powers kernel: proc i redundantly computes s r ij ghost zone values from proc j Bound rank to limit communication in matrix powers kernel

Appendix Extra slides Block Krylov methods More slides Restarting for stability Extra precision for stability Lanczos reorthogonalization What preconditioner structure limits communication? Components Components Figure: Components of communication-avoiding Krylov methods.