Iterative Methods for Linear Systems of Equations

Iterative Methods for Linear Systems of Equations Projection methods (3) ITMAN PhD-course DTU 20-10-08 till 24-10-08 Martin van Gijzen 1 Delft University of Technology

Overview day 4 Bi-Lanczos method Computation of an approximate solution: Bi-CG QMR CGS and Bi-CGSTAB Convergence The induced dimension theorem and IDR(s) 2

Introduction Yesterday, we discussed Arnoldi-based methods. A method like GMRES minimises the residual norm, but has to compute and store a new orthogonal basis vector in each iteration. The cost of doing this becomes prohibitive if many iterations have to be performed. Today, we will discuss another class of iterative methods. They are based on the bi-lanczos method, and hence closely related to CG. They use short recurrence but the iterates have no optimality property. We will also discuss a new method: IDR(s). This method is not based on the bi-lanczos method, but is mathematically related. 3

The Bi-Lanczos method The symmetric Lanczos method constructs a matrix Q k such that Q T k Q k = I and Q T k AQ k = T k tridiagonal. The Bi-Lanczos algorithm constructs matrices W k and V k such that W T k AV k = T k tridiagonal and such that W k T V k = I Hence W k and V k are not orthogonal themselves, but they form a bi-orthogonal pair. 4

The Bi-Lanczos method (2) Choose two vectors v 1 and w 1 such that v1 Tw 1 = 1 β 1 = δ 1 = 0 w 0 = v 0 = 0 initialization FOR k = 1, DO iteration α k = wk TAv k ˆv k+1 = Av k α k v k β k v k 1 new ŵ k+1 = A T w k α k w k δ k w k 1 direction v δ k+1 = ˆv k+1ŵk+1 T orthogonal to β k+1 = ˆv k+1ŵk+1/δ T k+1 previous w. w k+1 = ŵ k+1 /β k+1 END FOR v k+1 = ˆv k+1 /δ k+1 5

The Bi-Lanczos method (3) Let T k = α 1 β 2 0 δ 2 α 2............ 0...... β k δ k α k and V k = [v 1 v 2... v k ] W k = [w 1 w 2... w k ]. Then AV k = V k T k + δ k+1 v k+1 e T k and A T W k = W k Tk T + β k+1w k+1 e T k. 6

The Bi-Lanczos method (3) Note that span{v 1 v k } is a basis for K k (A;v 1 ) and span{w 1 w k } is a basis for K k (A T ;w 1 ) 7

Breakdown of Bi-Lanczos Bi-Lanczos breaks down if wk+1 T v k+1 = 0. This can happen in the following cases: v k+1 = 0 : {v 1 v k } span an invariant subspace (wrt. A) w k+1 = 0 : {w 1 w k } span an invariant subspace (wrt. A T ) w k+1 0 and v k+1 0 : serious breakdown 8

Bi-CG A CG-type method can now be defined as: v 1 = r 0 / r 0 and w 1 = r 0 / r 0 (with e.g. r 0 = r 0 ). - Compute W k and V k using Bi-Lanczos - Compute x k = x 0 + V k y k with y k such that W T k AV ky k = W T k r 0 T k y k = r 0 e 1 This algorithm can be cast in a more practical form in the same way as CG. The resulting algorithm is called Bi-CG. 9

The Bi-CG algorithm r 0 = b Ax 0 Choose r 0, p 0 = r 0, p 0 = r 0 initialization FOR k = 0,1,, DO α k = rt k r k p T k Ap k x k+1 = x k + α k p k r k+1 = r k α k Ap k r k+1 = r k α k A T p k β k = rt k+1 r k+1 r T k r k p k+1 = r k+1 + β k p k p k+1 = r k+1 + β k p k update iterate update residual shadow residual update direction vector shadow direction vector END FOR 10

Properties of Bi-CG The method uses limited memory: only five vectors need to be stored; The method is not optimal. However, if A is SPD and with r 0 = r 0 we get CG back (but with two times the number of operations per iteration!) The method is finite: all the residuals are orthogonal to the shadow residuals r i Tr j = 0 if i j. The method is not robust: r k Tr k may be zero and r k 0. 11

Quasi minimising the residuals In analogy to CG and MINRES (and to FOM and GMRES) we can define a quasi -minimal residual algorithm. To this end, we first rewrite the Bi-Lanczos relations. 12

The Bi-Lanczos relations Define T k as T k = α 1 β 2 δ 2 α 2.................. β k δ k α k. δ k+1 We can now write AV k = V k+1 T k 13

Quasi minimal residuals (1) We would like to solve the problem: find x k = x 0 + V k y k such that r k is minimal. r k = b Ax k = r 0 AV k y k = r 0 v 1 AV k y k Hence the problem is to minimise r k = r 0 v 1 AV k y k (1) = r 0 V k+1 e 1 V k+1 T k y k. (2) If V k+1 had orthonormal columns this would be equivalent to: Minimise r k = r 0 e 1 T k y k 14

Quasi minimal residuals (2) However, in general V k does not have orthogonal columns. We could ignore this fact and still compute y k by minimising r 0 e 1 T k y k The resulting algorithm is called QMR. For many problems, the convergence of QMR is smoother than of Bi-CG, but not essentially faster. 15

Bi-CG wastes time Bi-CG constructs its approximations as a linear combination of the columns of V k. The only reason why we construct shadow vectors is to calculate the parameters α k and β k. Another disadvantage of Bi-CG and QMR is that operations with A T have to be performed, and this may be difficult in matrix-free computations. Next we look at the following questions: Can the computational work of Bi-CG be better exploited? Can operations with A T be avoided? 16

Bi-CG: a closer look In the Bi-CG method, the residual vector can be expressed as and the direction vector as r k = φ k (A)r 0 p k = π k (A)r 0 with φ k and π k polynomials of degree k. The shadow vectors r k and p k are defined through the same recurrences as r k and p k. Hence r k = φ k (A T ) r 0, p k = π k (A T ) r 0 17

Towards a faster algorithm The scalar α k in Bi-CG is given by α k = (φ k(a T ) r 0 ) T (φ k (A)r 0 ) (π k (A T ) r 0 ) T (Aπ k (A)r 0 ) = rt 0 φ2 k (A)r 0 r T 0 Aπ2 k (A)r 0 The idea is now to find an algorithm which gives residuals ˆr k that satisfy ˆr k = φ k 2 (A)r 0. The algorithm that is based on this idea is called CGS (by Sonneveld in 1989). 18

Conjugate Gradients Squared x 0 is an initial guess; r 0 = b Ax 0 ; r 0 arbitrary, such that r T 0 r 0 0, ρ 0 = r T 0 r 0) ; β 1 = ρ 0 ; p 1 = q 0 = 0 ; FOR i = 0,1,2,... DO u i = r i + β i 1 q i ; p i = u i + β i 1 (q i + β i 1 p i 1 ) ; ˆv = Ap i ; α i = ρ i r T 0 ˆv ; q i+1 = u i α iˆv ; END FOR s i = u i + q i+1 x i+1 = x i + α i (u i + q i+1 ) ; r i+1 = r i α i A(u i + q i+1 ) ; ρ i+1 = r T 0 r i+1 ; β i = ρ i+1 ρ i ; 19

CGS (2) For many problems CGS converges considerably faster than Bi-CG, sometimes twice as fast. However, convergence is more erratic. Bi-CG shows peaks in the convergence curves. These peaks are squared by CGS. The peaks can destroy the accuracy of the solution and also convergence of the method. The search for a more smoothly converging method has led to the development of Bi-CGSTAB by Van der Vorst. 20

Bi-CGSTAB Bi-CGSTAB is based on the following observation. Instead of squaring the Bi-CG polynomial, we can construct other iteration methods by which x k are generated so that r k = ψ k (A)φ k (A)r 0 Bi-CGSTAB takes a polynomial of the form ψ k (x) = (1 ω 1 x)(1 ω 2 x)...(1 ω k x) The ω k are chosen to minimise the current residual in the same way as in the one-step minimal residual method. 21

Bi-CGSTAB: the algorithm x 0 is an initial guess; r 0 = b Ax 0 ; r 0 arbitrary, such that r0 T r 0 0, ρ 1 = α 1 = ω 1 = 1 ; v 1 = p 1 = 0 ; FOR k = 0,1,2,... DO ρ k = r 0 Tr k ; β k 1 = (ρ k /ρ k 1 )(α k 1 /ω k 1 ) ; p k = r k + β k 1 (p k 1 ω k 1 v k 1 ) ; v k = Ap k ; α k = ρ k /( r 0 Tv k) ; s = r k α k v k ; t = As ; ω k = (t T s)/(t T t) ; x k+1 = x k + α k p k + ω k s ; r k+1 = s ω k t ; END FOR 22

Convergence of Bi-CG methods No bounds on the residual norm can be given for the four methods of today: they have no optimality property. Some general observations can be made: All four methods that we discussed today are finite. They all show superlinear convergence. The rate of convergence of Bi-CG and QMR is basically the same, but convergence of QMR is smoother. Convergence of Bi-CGSTAB and CGS is most of the time faster than of the other two methods. But convergence of CGS can be erratic which has a negative effect on the accuracy. 23

Convergence: first example The figure below shows typical convergence curves for Bi-CGSTAB, CGS, QMR and Bi-CG. 10 4 10 2 BICGSTAB CGS BI CG QMR 10 0 residual norm 10 2 10 4 10 6 10 8 0 100 200 300 400 500 600 matrix vector multiplications 24

Convergence: second example The figure below shows another example of the convergence of Bi-CGSTAB, CGS, QMR and Bi-CG. 10 10 BICGSTAB CGS BI CG QMR 10 5 residual norm 10 0 10 5 0 1000 2000 3000 4000 5000 6000 7000 matrix vector multiplications 25

BiCGSTAB2 and BiCGstab(l) For some problems, in particular for real problems with large complex eigenvalues, Bi-CGSTAB does not work well. The reason is that for such problems it is not possible to construct a residual minimising polynomial of the form ψ k (x) = (1 ω 1 x)(1 ω 2 x)...(1 ω k x). Such a polynomial has real roots, and can therefore not approximate large eigenvalues. To overcome this problem Gutknecht proposed BiCGSTAB2, which uses a polynomial with quadratic factors. Sleijpen and Fokkema proposed BiCGstab(l), which uses polynomial factors of degree l. 26

Intermezzo Of the Bi-CG methods Bi-CGSTAB is by far the most widely used. Recently, the new method IDR(s) has been proposed by Sonneveld and van Gijzen. IDR(1) and Bi-CGSTAB are mathematically equivalent. But convergence of IDR(s) is often much faster than of Bi-CGSTAB for s > 1. Although Bi-CGSTAB and IDR(1) are mathematically equivalent, IDR(s) is based on a completely different approach. 27

The IDR approach for solving Ax = b Generate residuals r n = b Ax n that are in subspaces G j of decreasing dimension. These nested subspaces are related by G j = (I ω j A)(G j 1 S) where S is a fixed proper subspace of C N, and the ω j C s are non-zero scalars. Ultimately r n {0} (IDR theorem). 28

The IDR theorem Theorem 1 (IDR) Let A be any matrix in C N N, let v 0 be any nonzero vector in C N, and let G 0 be the complete Krylov space K N (A;v 0 ). Let S denote any (proper) subspace of C N such that S and G 0 do not share a nontrivial invariant subspace of A, and define the sequence G j, j = 1,2,... as G j = (I ω j A)(G j 1 S) where the ω j s are nonzero scalars. Then (i) G j G j 1 for all j > 0. (ii) G j = {0} for some j N. 29

Making an IDR algorithm 30

Making an IDR algorithm (2) Definition of S: S can be defined as span(p 1... p s ). Let P be the matrix with p 1... p s as its columns. Then v S P H v = 0. Residual difference vectors: We compute residual difference vectors r n = r n+1 r n = A x n to update the solution vector x n+1 with the residual r n+1. 31

Making an IDR algorithm (3) Intermediate residuals Intermediate residuals r n can be generated by repeating the algorithm. Once s + 1 residuals in G j have been computed, the next residual will be in G j+1. Choice of ω Every s + 1st step a new ω may be chosen. We choose it such that the next residual is minimized in norm. 32

Prototype IDR(s) algorithm. while r n > TOL or n < MAXIT do for k = 0 to s do Solve c from P H dr n c = P H r n v = r n dr n c; t = Av; if k = 0 then end if end for end while ω = (t H v)/(t H t); dr n = dr n c ωt; dx n = dx n c + ωv; r n+1 = r n + dr n ; x n+1 = x n + dx n ; n = n + 1; dr n = (dr n 1 dr n s ); dx n = (dx n 1 dx n s ); 33

Convergence of IDR(s) The figure below shows typical convergence of Bi-CGSTAB, GMRES (optimal) and IDR(s). Convection diffusion problem 10 2 10 0 GMRES Bi CGSTAB BiCGstab(2) IDR(1) IDR(2) IDR(4) IDR(8) log 10 r 2 10 2 10 4 10 6 10 8 0 200 400 600 800 1000 1200 Number of MATVECS 34

Concluding remarks (1) Today we discussed short recurrence methods for solving nonsymmetric problems. Of these methods Bi-CGSTAB is by far the most popular. A logical question to ask is: are there better methods than Bi-CGSTAB possible that use short recurrences? The answer is: sure! For example IDR(s). However, neither Bi-CGSTAB nor IDR(s) reduce an error in an optimal way. Another question is: Is it possible to derive an optimal method for general matrices that uses short recurrences? The answer is: unfortunately not. This has been proven by Faber and Manteufel in the 90 s. 35

Concluding remarks When should we use IDR(s) and when GMRES? It is difficult to give a general answer. Rules of thumb are: Are matrix-vector multiplications (and/or preconditioning operations) expensive? GMRES is your best bet, at least if the number of iterations is limited. Are matrix-vector multiplications inexpensive and you expect that many iterations are needed (for example because you use a simple preconditioner)? Use IDR(s) (with s = 4 as a good default). 36