Chapter 7 Iterative Techniques in Matrix Algebra

Chapter 7 Iterative Techniques in Matrix Algebra Per-Olof Persson persson@berkeley.edu Department of Mathematics University of California, Berkeley Math 128B Numerical Analysis

Vector Norms Definition A vector norm on R n is a function,, from R n into R with the properties: (i) x 0 for all x R n (ii) x = 0 if and only if x = 0 (iii) αx = α x for all α R and x R n (iv) x + y x + y for all x, y R n Definition The Euclidean norm l 2 and the infinity norm l for the vector x = (x 1, x 2,..., x n ) t are defined by { n x 2 = i=1 x 2 i } 1/2 and x = max 1 i n x i

Cauchy-Bunyakovsky-Schwarz Inequality for Sums Theorem For each x = (x 1, x 2,..., x n ) t and y = (y 1, y 2,..., y n ) t in R n, x t y = { n n x i y i x 2 i i=1 i=1 i=1 } 1/2 { n } 1/2 yi 2 = x 2 y 2

Distances Definition The distance between two vectors x = (x 1,..., x n ) t and y = (y 1,..., y n ) t is the norm of the difference of the vectors. The l 2 and l distances are { n } 1/2 x y 2 = (x i y i ) 2 i=1 x y = max 1 i n x i y i

Convergence Definition A sequence {x (k) k=1 of vectors in Rn is said to converge to x with respect to the norm if, given any ε > 0, there exists an integer N(ε) such that x (k) x < ε, for all k N(ε) Theorem The sequence of vectors {x (k) } converges to x in R n with respect to if and only if lim k x (k) i = x i. Theorem For each x R n, x x 2 n x

Matrix Norms Definition A matrix norm on n n matrices is a real-valued function satisfying (i) A 0 (ii) A = 0, if and only if A = 0 (iii) αa = α A (iv) A + B A + B (v) AB A B

Natural Matrix Norms Theorem If is a vector norm, the natural (or induced) matrix norm is given by Corollary A = max x =1 Ax For any vector z 0, matrix A, and natural norm, Theorem Az A z If A = (a ij ) is an n n matrix, then A = max 1 i n j=1 n a ij

Eigenvalues and Eigenvectors Definition The characteristic polynomial of a square matrix A is p(λ) = det(a λi) Definition The zeros λ of the characteristic polynomial are eigenvalues of A, x 0 satisfying (A λi)x = 0 is a corresponding eigenvector. Definition The spectral radius ρ(a) of a matrix A is ρ(a) = max λ, Theorem If A is an n n matrix, then (i) A 2 = [ρ(a t A)] 1/2 for eigenvalues λ of A (ii) ρ(a) A, for any natural norm

Convergent Matrices Definition An n n matrix A is convergent if lim k (Ak ) ij = 0, for each i = 1, 2,..., n and j = 1, 2,..., n Theorem The following statements are equivalent. (i) A is a convergent matrix (ii) lim n A n = 0, for some natural norm (iii) lim n A n = 0, for all natural norms (iv) ρ(a) < 1 (v) lim n A n x = 0, for every x

Iterative Methods for Linear Systems Direct methods for solving Ax = b, e.g. Gaussian elimination, compute an exact solution after a finite number of steps (in exact arithmetic) Iterative algorithms produce a sequence of approximations x (1), x (2),... which hopefully converges to the solution, and may require less memory than direct methods may be faster than direct methods may handle special structures (such as sparsity) in a simpler way Residual r = b Ax 10 0 10 5 10 10 10 15 Direct Iterative 0 5 10 15 20 25 30 Iteration

Two Classes of Iterative Methods Stationary methods (or classical iterative methods) finds a splitting A = M K and iterates x (k) = M 1 (Kx (k 1) + b) = T x (k 1) + c Jacobi, Gauss-Seidel, Successive Overrelaxation (SOR) Krylov subspace methods use only multiplication by A (and possibly by A T ) and find solutions in the Krylov subspace {b, Ab, A 2 b,..., A k 1 b} Conjugate Gradient (CG), Generalized Minimal Residual (GMRES), BiConjugate Gradient (BiCG), etc

Jacobi s Method An iterative technique to solve Ax = b starts with an initial approximation x (0) and generates a sequence of vectors {x (k) } k=0 that converges to x. Jacobi s Method Solve for x i in the the ith equation of Ax = b: x i = n j=1 j i ( a ) ijx j + b i, a ii a ii for i = 1, 2,..., n This leads to the iteration x (k) i = 1 n ( a ii j=1 j i a ij x (k 1) j ) + b i, for i = 1, 2,..., n

Matrix form of Jacobi s Method Convert Ax = b into an equivalent system x = T x + c, select initial vector x (0) and iterate x (k) = T x (k 1) + c For Jacobi s method, split A into diagonal and off-diagonal parts: a 11 a 12 a 1n a 11 0 0 0 0 0 a 12 a 1n a 21 a 22 a 2n..... = 0 a 22...... a 21......................... 0.......... an 1,n a n1 a n2 a nn 0 0 a nn a n1 a n,n 1 0 0 0 }{{}}{{}}{{}}{{} A D L U This transforms Ax = (D L U)x = b into Dx = (L + U)x + b, and if D 1 exists, this leads to the Jacobi iteration: x (k) = D 1 (L + U)x (k 1) + D 1 b = T j x (k 1) + c j where T j = D 1 (L + U) and c j = D 1 b

The Gauss-Seidel Method The Gauss-Seidel Method Improve Jacobi s method by, for i > 1, using the already updated components x (k) 1,..., x(k) i 1 when computing x(k) i : x (k) i = 1 a ii i 1 j=1 (a ij x (k) j ) n (a ij x (k 1) ) + b i j=i+1 In matrix form, the method can be written (D L)x (k) = Ux (k 1) + b and if (D L) 1 exists, this leads to the Gauss-Seidel iteration x (k) = (D L) 1 Ux (k 1) + (D L) 1 b = T g x (k 1) + c g where T g = (D L) 1 U and c g = (D L) 1 b j

General Iteration Methods Lemma If the spectral radius satisfies ρ(t ) < 1, then (I T ) 1 exists, and Theorem (I T ) 1 = I + T + T 2 + = For any x (0) R n, the sequence x (k) = T x (k 1) + c j=0 converges to the unique solution of x = T x + c if and only if ρ(t ) < 1. T j

General Iteration Methods Corollary If T < 1 for any natural matrix norm, then x (k) = T x (k 1) + c converges for any x (0) R n to a vector x R n s.t. x = T x + c. The following error estimates hold: 1 x x (k) T k x (0) x 2 x x (k) T k 1 T x(1) x (0) Theorem A strictly diagonally dominant = Jacobi and Gauss-Seidel converges for any x (0). Theorem (Stein-Rosenberg) If a ii > 0 for all i and a ij < 0 for i j, then one and only one of the following holds: (i) 0 ρ(t g ) < ρ(t j ) < 1 (ii) 1 < ρ(t j ) < ρ(t g ) (iii) ρ(t j ) = ρ(t g ) = 0 (iv) ρ(t j ) = ρ(t g ) = 1

The Residual Vector Definition The residual vector for x R n with respect to the linear system Ax = b is r = b A x. Consider the approximate solution vector in Gauss-Seidel: x (k) i with residual vector = (x (k) 1, x(k) 2,..., x(k) r (k) i The Gauss-Seidel method: x (k) i = 1 i 1 b i a ii can then be written as i 1, x(k 1) i = (r (k) 1i, r(k) 2i,..., r(k) ni )t x (k) i j=1 a ij x (k) j = x (k 1) i n j=i+1 + r(k) ii a ii,..., x (k 1) n ) t a ij x (k 1) j

Successive Over-Relaxation The relaxation methods uses an iteration of the form x (k) i = x (k 1) i + ω r(k) ii a ii for some positive ω. With ω > 1, they can accelerate the convergence of the Gauss-Seidel method, and are called successive over-relaxation (SOR) methods. Write the SOR method as x (k) i = (1 ω)x (k 1) i + ω i 1 b i a ij x (k) j a ii j=1 which can be written in the matrix form x (k) = T ω x (k 1) + c ω where T ω = (D ωl) 1 [(1 ω)d + ωu] and c ω = ω(d ωl) 1 b. n j=i+1 a ij x (k 1) j

Convergence of the SOR Method Theorem (Kahan) If a ii 0 for all i, then ρ(t ω ) ω 1 and the SOR method can converge only if 0 < ω < 2. Theorem (Ostrowski-Reich) If A is PD and 0 < ω < 2, then SOR converges for any x (0). Theorem If A is PD and tridiagonal, then ρ(t g ) = [ρ(t j )] 2 < 1, and the optimal ω for SOR is ω = 2 1 + 1 [ρ(t j )] 2 which gives ρ(t ω ) = ω 1.

Error Bounds Theorem Suppose Ax = b, A is nonsingular, x x, and r = b A x. Then for any natural norm, and if x, b 0, Definition x x r A 1 x x x A A 1 r b The condition number of nonsingular matrix A in the norm is K(A) = A A 1 In terms of K(A), the error bounds can be written: x x K(A) r A, x x K(A) r x b

Iterative Refinement Algorithm: Iterative Refinement Solve Ax (1) = b for k = 1, 2, 3,... r (k) = b Ax (k) Solve Ay (k) = r (k) x (k+1) = x (k) + y (k) residual compute accurately! solve for correction improve solution Allows for errors in the solution of the linear systems, provided the residual r is computed accurately

Errors in both matrix and right-hand side Theorem Suppose A is nonsingular and The solution x to δa < 1 A 1 (A + δa) x = b + δb approximates the solution x of Ax = b with the error estimate ( x x K(A) A δb x A K(A) δa b + δa ) A

Inner products Definition The inner product for n-dimensional vectors x, y is x, y = x t y Theorem For any vectors x, y, z and real number α: (a) x, y = y, x (b) αx, y = x, αy = α x, y (c) x + z, y = x, y + z, y (d) x, x 0 (e) x, x = 0 x = 0

Krylov Subspace Algorithms Create a sequence of Krylov subspaces for Ax = b: K k = {b, Ab,..., A k 1 b} and find approximate solutions x k in K k Only matrix-vector products involved For SPD matrices, the most popular algorithm is the Conjugate Gradients method [Hestenes/Stiefel, 1952] Finds the best solution x k K k in the norm x A = x t Ax Only requires storage of 4 vectors (not all the k vectors in K k ) Remarkably simple and excellent convergence properties Originally invented as a direct algorithm! (converges after n steps in exact arithmetic)

The Conjugate Gradients Method Algorithm: Conjugate Gradients Method x 0 = 0, r 0 = b, p 0 = r 0 for k = 1, 2, 3,... α k = (rk 1 t r k 1)/(p t k 1 Ap k 1) x k = x k 1 + α k p k 1 r k = r k 1 α k Ap k 1 β k = (rk t r k)/(rk 1 t r k 1) p k = r k + β k p k 1 step length approximate solution residual improvement this step search direction Only one matrix-vector product Ap k 1 per iteration Operation count O(n) (excluding the matrix-vector product)

Properties of Conjugate Gradients Vectors Theorem The spaces spanned by the solutions, the search directions, and the residuals are all equal to the Krylov subspaces: K k = span ({x 1, x 2,..., x k }) = span ({p 0, p 1,..., p k 1 }) ({ }) = span ({r 0, r 1,..., r k 1 }) = span b, Ab,..., A k 1 b The residuals are orthogonal: r t k r j = 0 (j < k) The search directions are A-conjugate: p t k Ap j = 0 (j < k)

Optimality of Conjugate Gradients Theorem The errors e k = x x k are minimized in the A-norm Proof. For any other point x = x k x K k the error is e 2 A = (e k + x) t A(e k + x) = e t k Ae k + ( x) t A( x) + 2e t k A( x) But e t k A( x) = rt k ( x) = 0, since r k is orthogonal to K k, so x = 0 minimizes e A Theorem Monotonic: e k A e k 1 A, and e k = 0 in k m steps Proof. Follows from K k K k+1, and that K k R m unless converged

Optimization in CG CG can be interpreted as a minimization algorithm We know it minimizes e A, but this cannot be evaluated CG also minimizes the quadratic function ϕ(x) = 1 2 xt Ax x t b: e k 2 A = e t k Ae k = (x x k ) t A(x x k ) = x t k Ax k 2x t k Ax + x t Ax = x t k Ax k 2x t k + xt b = 2ϕ(x k ) + constant At each step α k is chosen to minimize x k = x k 1 + α k p k 1 The conjugated search directions p k give minimization over all of K k

Optimization by Conjugate Gradients We know that solving Ax = b is equivalent to minimizing the quadratic function ϕ(x) = 1 2 xt Ax x t b The minimization can be done by line searches, where ϕ(x k ) is minimized along a search direction p k Theorem The α k+1 that minimizes ϕ(x k + α k+1 p k ) is α k+1 = pt k r k p t k Ap k with the residual r k = b Ax k The residual is also minus the gradient of ϕ(x k ): ϕ(x k ) = Ax k b = r k

The Method of Steepest Descent Very simple approach: Set search direction p k to the negative gradient r k Corresponds to moving in the direction ϕ(x) changes the most Algorithm: Steepest Descent x 0 = 0, r 0 = b for k = 1, 2, 3,... α k = (rk 1 t r k 1)/(rk 1 t Ar k 1) x k = x k 1 α k r k 1 r k = r k 1 + α k Ar k 1 step length approximate solution residual Poor convergence, tends to move along previous search directions

The Method of Conjugate Directions The optimization can be improved by better search directions Let the search direction be A-conjugate, or p t i Ap k = 0 Then the algorithm will converge in at most n steps, since the initial error can be decomposed along the p s: n 1 e 0 = δ k p k, with δ k = pt k Ae 0 p t k Ap k k=0 But this is exactly the α we choose at step k: α k+1 = pt k r k p t k Ap k = pt k Ae k p t k Ap k = pt k Ae 0 p t k Ap k since the error e k is the initial e 0 plus a combination of p 0,..., p k 1, which are all A-conjugate to p k. Each component δ k is then subtracted out at step k, and the method converges after n steps.

Choosing A-conjugate Search Directions One method to choose p k which is A-conjugate to previous search vectors is by Gram-Schmidt: k 1 p k = p 0 k β kj p j, with β kj = p0 t k Apj p t j Ap j j=0 The initial p 0 k vectors should be linearly independent, for example column k + 1 of identity matrix Drawback: Must store all previous search vectors p k Conjugate Gradients is simply Conjugate Directions with a particular initial vector in Gram-Schmidt: p 0 k = r k This gives orthogonal residuals r t k r j = 0 for j k, and β kj = 0 for k > j + 1

Preconditioners for Linear Systems Main idea: Instead of solving Ax = b solve, using a nonsingular n n preconditioner M, which has the same solution x M 1 Ax = M 1 b Convergence properties based on M 1 A instead of A Trade-off between the cost of applying M 1 and the improvement of the convergence properties. Extreme cases: M = A, perfect conditioning of M 1 A = I, but expensive M 1 M = I, do nothing M 1 = I, but no improvement of M 1 A = A

Preconditioned Conjugate Gradients To keep symmetry, solve (C 1 AC )C x = C 1 b with CC = M Can be written in terms of M 1 only, without reference to C: Algorithm: Preconditioned Conjugate Gradients Method x 0 = 0, r 0 = b, p 0 = M 1 r 0, z 0 = p 0 for k = 1, 2, 3,... α k = (rk 1 T z k 1)/(p T k 1 Ap k 1) x k = x k 1 + α k p k 1 r k = r k 1 α k Ap k 1 z k = M 1 r k β k = (rk T z k)/(rk 1 T z k 1) p k = z k + β k p k 1 step length approximate solution residual preconditioning improvement this step search direction

Commonly Used Preconditioners A preconditioner should approximately solve the problem Ax = b Jacobi preconditioning - M = diag(a), very simple and cheap, might improve certain problems but usually insufficient Block-Jacobi preconditioning - Use block-diagonal instead of diagonal. Another variant is using several diagonals (e.g. tridiagonal) Classical iterative methods - Precondition by applying one step of Jacobi, Gauss-Seidel, SOR, or SSOR Incomplete factorizations - Perform Gaussian elimination but ignore fill, results in approximate factors A LU or A R T R (more later) Coarse-grid approximations - For a PDE discretized on a grid, a preconditioner can be formed by transferring the solution to a coarser grid, solving a smaller problem, then transferring back (multigrid)