State-of-the-art numerical solution of large Hermitian eigenvalue problems. Andreas Stathopoulos

State-of-the-art numerical solution of large Hermitian eigenvalue problems Andreas Stathopoulos Computer Science Department and Computational Sciences Cluster College of William and Mary Acknowledgment: National Science Foundation

The problem FindnumEvals eigenvalues λ i and corresponding eigenvectors x i A x i = λ i x i, i = 1 :numevals A is large, sparse, symmetric N = O(10 6 10 8 ) Applications: materials, structural, data mining, SVD, QCD,... QCD Accelerate linear systems with multiple right hand sides Low rank approximation of matrices Only possible through iterative methods [ 2 ]

Power method: the fundamental iterative method Given initial guess v 0, the iteration for i = 1,2,... t = Av i 1 v i = t/ t converges to the largest modulus eigenpair ( λ N, x N ), i.e., A i v 0 A i v 0 x N, with rate λ N 1 λ N + Requires only matrix-vector multiplications Only for largest eigenpair Slow! [ 3 ]

Krylov methods: the prevailing technique Krylov space consists of the span of all power iterates: K m,v = span { v,av,a 2 v,...,a m 1 v } = {p(a)v : p polynomial of degree < m} Compute V an orthonormal basis for K m,v (for numerical stability) Compute approximations through Rayleigh-Ritz: x i = V y i, where V T AV y i = λ i y i Arnoldi: the above process for non-symmetric matrices Lanczos: a special case of Arnoldi for symmetric matrices [ 4 ]

Krylov methods: the prevailing technique K m,v = span { v,av,a 2 v,...,a m 1 v } = {p(a)v : p polynomial of degree < m} + Approximating extreme eigenpairs + Converges trivially in N steps + Optimal approximations over all polynomials Convergence rate depends on relative separation of eigenvalues Slow for clustered eigenvalues and large sizes O(Nm 2 ) orthogonalization cost, O(mN) storage [ 5 ]

Krylov ideal for linear systems Ax = b Conjugate Gradient (CG) uses a 3-term recurrence to build K m,v and update the approximate solution. O(Nm) cost and O(3N) storage minimizes error A at every step Preconditioning with M 1 A 1 also easy: M 1 Ax = M 1 b (PCG) [ 6 ]

Lanczos problems Lanczos 3-term recurrence still requires O(Nm) storage Unlike CG, orthogonality is important in Lanczos Restarting to limit the basis size destroys optimality Preconditioning is not obvious (M 1 Ax = λm 1 x not an eigenproblem) Goal: Use PCG to derive nearly optimal eigenmethods with smaller bases [ 8 ]

Generalized Davidson: Eigenvalue Preconditioning Let r = Ax λx the residual of an approximate eigenpair (λ,x) Arnoldi/Lanczos: expand basis V by r. Generalized Davidson: expands by the preconditioned r: Generalized Davidson repeat V = [V, M 1 r] V AV y = λy, x = V y x = x/ x r = Ax λx until r < ε append preconditioned residual Rayleigh Ritz normalize new residual No 3-term recurrence, more expensive step, but much faster convergence [ 9 ]

Inverse iteration (inverse power method) Given initial guess v 0, the iteration for i = 1,2,... t = (A σi) 1 v i 1 v i = t/ t converges to the eigenpair closest to σ + The closer σ is to λ k the faster the outer convergence rate λ k σ λ k 1 σ A direct factorization of A may be prohibitive An iterative method for the linear system may take long to converge [ 10 ]

Rayleigh Quotient Iteration Given initial guess v 0 : for i = 1,2,... t = (A σi) 1 v i 1 v i = t/ t σ = v T i 1 Av i 1 All the characteristics of Inverse Iteration but also: + converges to the eigenpair cubically!! if v 0 not close to the required eigenvector it may misconverge Eigenproblem: constrained minimization of Rayleigh quotient λ = xt Ax x T x RQI equivalent to Newton on the unit-sphere manifold [ 12 ]

Inexact RQI, Newton, and Jacobi-Davidson (A σi)t = v i 1 must be solved accurately enough for RQI to converge However, inexact (truncated) Newton does not require high accuracy Newton: x i+1 = x i Hess(x i ) 1 (x i ) computes correction RQI: x i+1 = (A σi) 1 x i updates approximation inexact RQI not exactly inexact Newton! [ 13 ]

Inexact RQI, Newton, and Jacobi-Davidson (A σi)t = v i 1 must be solved quite accurately for RQI to converge However, inexact (truncated) Newton does not require high accuracy Newton: x i+1 = x i Hess(x i ) 1 (x i ) computes correction RQI: x i+1 = (A σi) 1 x i updates approximation inexact RQI not exactly inexact Newton! Note (x) = r = (Ax λx) the residual of (λ,x). Thus the correction δ to x: Jacobi-Davidson: (I xx T )(A ηi)(i xx T )δ = r computes correction JD inexact (truncated) Newton [ 14 ]

Equivalence in the general case Let M A σi a preconditioner Both GD/JD solve approximately the correction equation: Generalized Davidson as δ = M 1 r Jacobi Davidson as δ = M 1 x r Both GD/JD not single vector iterations, they build a space! GD, JD subspace-accelerated inexact Newton [ 15 ]

A different view: Quasi-Newton approaches Mild non-linearity: Nonlinear CG is competitive [Bradbury & Fletcher, 66, Others] Better: locally optimal LOBPCG [D yakonov 83, Knyazev, 91, 01] x i+1 = Rayleigh Ritz ( ) x i 1, x i, M 1 r i Subspace acceleration and recurrence restarting in GD [Murray et al., 92] GD(k,m)+1: Restart with [x i 1, xi 1,...,xi k ] [AS 98, 99] Direct analogy to limited memory quasi Newton methods: GD+1 accelerates LOBCPG Broyden accelerates Nonlinear CG [ 17 ]

So what is optimal? Work unit: Matrix-Vector RQI Newton inner iterations Total MVs * Lanczos QMRopt Optimal: Unrestarted Lanczos or QMRopt, QMR solving (A λi)x = 0 [ 18 ]

So what is optimal? Work unit: Matrix-Vector Inexact RQI Jacobi Davidson Truncated Newton RQI Newton inner iterations Total MVs * Lanczos QMRopt [ 19 ]

So what is optimal? Work unit: Matrix-Vector IRA (m=2) Inexact RQI Jacobi Davidson RQI Steepest Descent Truncated Newton Newton inner iterations Total MVs * Lanczos QMRopt [ 20 ]

So what is optimal? Work unit: Matrix-Vector IRA (m=2) Inexact RQI Jacobi Davidson RQI Steepest Descent Truncated Newton Newton inner iterations Total MVs * * Lanczos QMRopt Nonlinear CG DACG LOBPCG memory [ 21 ]

So what is optimal? Work unit: Matrix-Vector IRA (m=2) Inexact RQI Jacobi Davidson RQI Steepest Descent Truncated Newton Newton inner iterations Total MVs * All these methods can be implemented as JD with different parameters * Lanczos QMRopt Nonlinear CG DACG LOBPCG Limited memory quasi Newton Generalized Davidson+1(m) memory [ 23 ]

Stopping the inner JD iteration (Notay 02) (I xx T )(A ηi)(i xx T )δ = r r eigen 2 = r Linear 2 / f + g k 2 0 Inverse iteration residual Eigenvalue residual vs linear system residual 10 2 Inner iteration run to convergence 10 4 eigenvalue residual linear system residual 10 6 10 8 10 10 0 50 100 150 200 Matrix vector products [ 24 ]

Our JDQMR extension to JDCG Based on symmetric QMR [Freund & Nachtigal 94] with right preconditioning JDQMR new features 1. Can use indefinite preconditioners 2. Works for interior eigenpairs 3. Residual convergence smooth 4. Better stopping criteria JDQMR improves robustness and efficiency [ 25 ]

JDQMR reduces wasted iterations 0 2 4 6 BCSSTK09 Jacobi Davidson with jmin=8, jmax=15, residual tolerance 1e 10. sqmr on (A λ 1 I)x = 0 JDqmr with dynamic stopping criterion 10 2 BCSSTK09 JDqmr with dynamic stopping criterion 10 4 10 6 Eigenvalue and linear system residuals very close 78 80 8 10 8 10 10 10 12 0 20 40 60 80 100 0 50 100 150 200 250 Matrix vector products Matrix vector products [ 26 ]

One eigenvalue with preconditioning NASASRB: Note the plateaus NASASRB: Matvecs of preconditioned methods 10 4 10 2 Residual norm 10 0 10 2 QMRopt GDfull Lobpcg 10 4 GD+1 JDQMR JDCG 10 6 0 50 100 150 200 250 300 350 400 Matvecs [ 27 ]

Sample runs GD+k: minmatvecs JDQMR: mintime 1 smallest eigenvalue of 2 for large matrices 4500 4000 3500 GD+1 JDqmr BLOPEX ARPACK Matvecs for 1 smallest eval 250 200 Time in seconds for 1 smallest eval GD+1 JDqmr BLOPEX ARPACK 16000 14000 12000 3000 2500 2000 1500 150 100 10000 8000 6000 1000 50 4000 500 2000 0 1M ILUT(4,0) 1M no prec 10M no prec Dimension of Laplacian 0 1M ILUT(4,0) 1M no prec 0 10M no prec ARPACK for 1M: 525 Matvecs, 220 seconds [ 28 ]

What is optimal for many eigenvalues? Red: QMRopt (exact eigenvalues as shifts) Black: JDQMR nearly optimal Blue: subspace accelerated GD+1 better than optimal? 4 494BUS Residual convergence versus matvecs: Black: JDqmr, Red: JDqmr with exact eigenvalue shifts, Blue: JD+1 2 0 2 4 6 8 10 12 0 50 100 150 200 250 Matrix vector multiplications [ 29 ]

JD: projecting the locked vectors X projecting the Ritz vector x (I XX T ) (A ηi)k 1 ( I X(X T K 1 X) 1 X T K 1) t = r Left projection needed for definiteness Skew Right projection is it needed? Orthogonalization requirements dominate for nev, so Solve: (A ηi) t = r, w/o preconditioning (I XX T )(A ηi)k 1 t = r, with preconditioning Other choices JDQMR-(Left,Skew,Right) (111), (000), (101), (011), (100) [ 30 ]

Scaling numevals no preconditioner Laplace 7point, 125K, Tol = 1e-15 Matvecs x 10 4 4.5 4 3.5 3 2.5 2 1.5 arpack GD+2 jdqmr000 jdqmr100 jdqmr011 jdbsym Matrix: Lap7pt125K tol = 1E 15 Time in seconds 4500 4000 3500 3000 2500 2000 1500 arpack GD+2 jdqmr000 jdqmr100 jdqmr011 jdbsym Matrix: Lap7pt125K tol = 1E 15 1 1000 0.5 500 0 0 10 20 30 40 50 60 70 80 90 100 Number of smallest eigenvalues found 0 0 10 20 30 40 50 60 70 80 90 100 Number of smallest eigenvalues found JDQMR-000 fastest among all PRIMME variants and ARPACK [ 31 ]

Scaling numevals no preconditioner cfd1, 70K, 26 nonzeros/row Tol = 1e-15 Matvecs x 10 5 2.5 2 1.5 1 arpack GD+2 jdqmr000 jdqmr100 jdqmr011 jdbsym Matrix: cfd1 tol = 1E 15 Time in seconds 18000 16000 14000 12000 10000 8000 6000 arpack GD+2 jdqmr000 jdqmr100 jdqmr011 jdbsym Matrix: cfd1 tol = 1E 15 0.5 4000 2000 0 0 10 20 30 40 50 60 70 80 90 100 Number of smallest eigenvalues found 0 0 10 20 30 40 50 60 70 80 90 100 Number of smallest eigenvalues found ARPACK eventually better for large numevals and denser matrices [ 32 ]

Scaling numevals no preconditioner Ratio: ARPACK / JDQMR-000 for 8 matrices Matvec ratio ARPACK/JDQMR 000, tol = 1E 15 Time ratio ARPACK/JDQMR 000, tol = 1E 15 Matvec ratio 2 1 0.5 Cone_A Andrews cfd1 Plate33K_A0 finan512 Lap7pt15K Lap7pt125K torsion1 Time ratio 4 3.5 3 2.5 2 1.5 Cone_A Andrews cfd1 Plate33K_A0 finan512 Lap7pt15K Lap7pt125K torsion1 0.1 1 0.5 0.05 5 10 50 100 Number of smallest eigenvalues found 0 5 10 50 100 Number of smallest eigenvalues found JDQMR-000 faster fornumevals < 10. Asymptotically depends on sparsity [ 33 ]

Scaling numevals ILUT(80,1e-4) cfd1 70K, Tol = 1e-15 Matrix: cfd1 tol = 1E 15 Matrix: cfd1 tol = 1E 15 Precond/Matvecs 12000 10000 8000 6000 4000 arpack GD+1 jdqmr100 jdqmr011 jdqmr111 jdbsym Time in seconds 3500 3000 2500 2000 1500 1000 arpack GD+1 jdqmr100 jdqmr011 jdqmr111 jdbsym 2000 500 0 0 10 20 30 40 50 60 70 80 90 100 Number of smallest eigenvalues found 0 0 10 20 30 40 50 60 70 80 90 100 Number of smallest eigenvalues found Q-projectors have no effect convergence of JDQMR [ 34 ]

Scaling numevals ILUT(20,1e-6) Laplace 7point, 125K, Tol = 1e-15 Matrix: Lap7pt125K tol = 1E 15 Matrix: Lap7pt125K tol = 1E 15 Precond/Matvecs 7000 6000 5000 4000 3000 arpack GD+1 jdqmr100 jdqmr011 jdqmr111 jdbsym Time in seconds 1600 1400 1200 1000 800 600 arpack GD+1 jdqmr100 jdqmr011 jdqmr111 jdbsym 2000 400 1000 200 0 0 10 20 30 40 50 60 70 80 90 100 Number of smallest eigenvalues found 0 0 10 20 30 40 50 60 70 80 90 100 Number of smallest eigenvalues found Expensive preconditioner fewer MVs means faster (GD+1) [ 35 ]

Software availability PRIMME: PReconditioned Iterative MultiMethod Eigensolver with my Ph.D. student J.R. McCombs Full set of defaults for non expert users Full customizability for expert users Near optimality through GD+k and JDQMR Over 12 methods accessible through PRIMME. Parallel, high performance implementation C and Fortran interfaces, Matlab interface soon. Download: www.cs.wm.edu/ andreas [ 36 ]

Minimal interface End user #include "primme.h" primme_params primme; primme_initialize(&primme); primme.n = n; primme.numevals = 20; The matrix and its size have been read. Number of needed eigenvalues, smallest by default primme.matrixmatvec = MV(x,y,k) primme.applypreconditioner = PR(x,y,k) primme_set_method(method, &primme); ierr = dprimme(evals, evecs, rnorms, &primme); [ 38 ]

Minimal interface End user #include "primme.h" primme_params primme; primme_initialize(&primme); primme.n = n; primme.numevals = 20; primme.matrixmatvec = MV(x,y,k) primme.applypreconditioner = PR(x,y,k) Pointers to functions for block matrix-vectors, and block precondition-vectors primme_set_method(method, &primme); ierr = dprimme(evals, evecs, rnorms, &primme); [ 39 ]

Minimal interface End user #include "primme.h" primme_params primme; primme_initialize(&primme); primme.n = n; primme.numevals = 20; primme.matrixmatvec = MV(x,y,k) primme.applypreconditioner = PR(x,y,k) primme set method(method, &primme); CHOICES: DYNAMIC DEFAULT MIN TIME DEFAULT MIN MATVECS Arnoldi GD GD plusk GD Olsen plusk JD Olsen plusk RQI JDQR JDQMR JDQMR ETol SUBSPACE ITERATION LOBPCG OrthoBasis LOBPCG OrthoBasis Window ierr = dprimme(evals, evecs, rnorms, &primme); [ 40 ]

The full interface Advanced user #include "primme.h" primme_params primme; primme. outputfile = stdout printlevel = 5 numevals = 10 anorm = 1.0 eps = 1.0e-12 maxbasissize = 15 minrestartsize = 7 maxblocksize = 1 maxouteriterations = 10000 maxmatvecs = 300000 target = primme_smallest numtargetshifts = 0 targetshifts = 1.0 2.0 locking = 1 initsize = 0 numorthoconst = 0; iseed = -1 restarting.scheme = primme_thick restarting.maxprevretain = 1 correction.precondition = 1 correction.robustshifts = 1 correction.maxinneriterations = -1 correction.reltolbase = 1.5 correction.convtest = adaptive_etolerance correction.projectors.leftq = 1 correction.projectors.leftx = 1 correction.projectors.rightq = 0 correction.projectors.skewq = 0 correction.projectors.rightx = 1 correction.projectors.skewx = 1 matrixmatvec = MV(x,y,k) applypreconditioner = PR(x,y,k) ierr = dprimme(evals, evecs, rnorms, &primme); [ 41 ]

Minimal Fortran interface include primme_f77.h integer primme call primme_initialize_f77(primme) call primme_set_member_f77(primme, PRIMMEF77_n, n) call primme_set_member_f77(primme, PRIMMEF77_numEvals, 20) call primme_set_member_f77(primme, PRIMMEF77_matrixMatvec, MV) call primme_set_member_f77(primme, PRIMMEF77_applyPreconditioner,PR) call primme_set_method_f77(primme, method, bytesneeded) call dprimme_f77(evals, evecs, rnorms, primme, ierr) Similar to C #include "primme.h" primme_params primme; primme_initialize(&primme); primme.n = n; primme.numevals = 20; primme.matrixmatvec = MV; primme.applypreconditioner = PR; primme_set_method(method, &primme); ierr = dprimme(evals, evecs, rnorms, &primme); [ 42 ]

In QCD we solve a sequence of linear systems Can we use these Krylov spaces to 1. obtain eigenpairs? 2. use these eigenpairs to deflate and thus accelerate subsequent systems? For restarted GMRES(m), the variant GMRESDR(m) IRA(m) GMRESDR computes eigenvalues while solving the system GMRES expensive per iteration Restarting slows convergence for linear system AND eigenvectors Can we be more effective on CG/Lanczos? [ 43 ]

Our recent work (with K. Orginos) Small window of m vectors, V, keeps track of the smallest nev < m eigenvectors V is expanded by the CG residuals When m vectors in V, restart it as in GD(nev,m)+nev CG iterates unaffected Records the Lanczos vector contributions to eigenvectors [ 44 ]

Does it find eigenvalues accurately? 10 0 Convergence of lowest eigenresidual with different (nev,m) Norm of lowest eigenvalue residual 10 1 10 2 10 3 10 4 10 5 10 6 eigcg(1,3) eigcg(3,9) eigcg(3,40) eigcg(8,17) eigcg(8,20) eigcg(8,40) unrestarted Lanczos 0 200 400 600 800 1000 1200 1400 1600 CG iterations [ 45 ]

Does this find eigenvalues accurately? Identical to Lanczos Residual convergence of 8 smallest eigenpairs with eigcg(8,20) 1 10 10 0 Eigenresidual Norm 10 1 10 2 10 3 10 4 10 5 10 6 0 200 400 600 800 1000 1200 1400 1600 CG iterations [ 46 ]

Incrementally improving accuracy and number of eigenvalues Use the CG iterations for k subsequent RHS to improve U: Incremental eigcg U = [ ],Λ = [ ] // accumulated eigenpairs for i = 1 : k x 0 = UΛ 1 U H b i // the init-cg part [x i,v,m ] = eigcg(nev,m,a,x 0,b i ) // eigcg with initial guess x 0 [U, Λ] = RayleighRitz([U,V]); end Typical values: k = 100, k = 12 24, nev = 10, m = 40 [ 47 ]

Convergence improves after every new CG 10 3 10 0 10-3 r 2 10-6 10-9 10-12 0 500 1000 1500 CG iteration [ 48 ]

A realistic experimental case Lattice parameters: 2 flavor Wilson fermions Lattice spacing a s = 0.1 f m (spatial) anisotropic: a t = a s /3 pion mass ( 350-400 MeV ) Two lattice sizes: 16 3 64 for a matrix dimension of 3.1 million 24 3 64 for a matrix dimension of 10.6 million Currently running using Chroma at Jefferson Lab [ 49 ]

Case: 3.1 million Case: 16 16 16 64 12 3000 m π L < 3 Original CG init CG with 240 evecs 2500 sea quark mass Matvecs 2000 1500 1000 500 0 0.425 0.42 0.415 0.41 0.405 0.4 Quark mass [ 50 ]

Case: 10.6 million 5000 m π L < 3 Case: 24 24 24 64 12 Original CG init CG with 240 evecs 4000 sea quark mass Matvecs 3000 2000 1000 0 0.42 0.415 0.41 0.405 0.4 Quark mass [ 51 ]

Conclusions JDQMR subspace accelerated inexact (truncated) Newton GD+1 subspace accelerated quasi Newton Near optimal for just a few eigenpairs Cheaper projectors possible with JDQMR for many eigenvalues PRIMME a state of the art eigensolver Our recent research promising for optimal, limited memory eigensolver [ 52 ]