Lecture 11. Fast Linear Solvers: Iterative Methods. J. Chaudhry. Department of Mathematics and Statistics University of New Mexico

Lecture 11 Fast Linear Solvers: Iterative Methods J. Chaudhry Department of Mathematics and Statistics University of New Mexico J. Chaudhry (UNM) Math/CS 375 1 / 23

Summary: Complexity of Linear Solves Ax = b diagonal system: O(n) upper or lower triangular system: O(n 2 ) full system with GE or Cholesky: O(n 3 ) partial pivoting adds O(n 2 ) full system with LU: O(n 3 ) LU back solve: O(n 2 ) m different right-hand sides: O(mn 3 ) or O(n 3 + mn 2 ) tridiagonal system: O(n) m-band system: O(m 2 n) J. Chaudhry (UNM) Math/CS 375 2 / 23

Approximate solutions So far, we are seeking exact solutions to Ax = b What if we only need an approximations ˆx to x? We would like some ˆx so that ˆx x ɛ, where ɛ is some tolerance. J. Chaudhry (UNM) Math/CS 375 3 / 23

The Residual We can t actually evaluate But... e = x ˆx For true solution b Ax 0 For approximate solution ˆx b Aˆx 0 We recall that ˆr = b Aˆx is the residual. It is way to measure the error. In fact ˆr = b Aˆx = Ax Aˆx = Aê J. Chaudhry (UNM) Math/CS 375 4 / 23

How big is the residual? For a given approximation, ˆx to x, how big is the residual ˆr = b Aˆx? r gives a magnitude r 1 = n j=1 r i r 2 = ( n j=1 r2 i ) 1/2 r = max 1 j n r i J. Chaudhry (UNM) Math/CS 375 5 / 23

Approximating x... Suppose we made a wild guess to the solution x of Ax = b: x (0) x How do I improve x (0)? Ideally: x (1) = x (0) + e (0) but to obtain e (0), we must know x. Not a viable method. Ideally (another way): x (1) = x (0) + e (0) = x (0) + (x x (0) ) = x (0) + (A 1 b x (0) ) = x (0) + A 1 (b Ax (0) ) = x (0) + A 1 r (0) J. Chaudhry (UNM) Math/CS 375 6 / 23

An iteration Again, the method is nonsense since A 1 is needed. x (1) = x (0) + A 1 r (0) What if we approximate A 1? Suppose Q 1 A 1 and is cheap to construct, then x (1) = x (0) + Q 1 r (0) is a good step. continuing... x (k) = x (k 1) + Q 1 r (k 1) J. Chaudhry (UNM) Math/CS 375 7 / 23

What kind of Q 1 do I need? Rewrite: x (k) = x (k 1) + Q 1 (b Ax (k 1) ) This becomes Qx (k) = Qx (k 1) + (b Ax (k 1) ) = (Q A)x (k 1) + b This is the form in the text (page 408 in NMC7 or 322 in NMC6). Or x (k) = Q 1 (Q A)x (k 1) + Q 1 b J. Chaudhry (UNM) Math/CS 375 8 / 23

Two Popular Choices Example Jacobi iteration approximates A with Q = diag(a). 1 x = x (0) 2 3 Q = D 4 5 for k = 1 to k max 6 r = b Ax 7 if r = b Ax tol, stop 8 9 x = x + Q 1 r 0 end (Do not implement the algorithm this way! More on implementation later.) J. Chaudhry (UNM) Math/CS 375 9 / 23

Two Popular Choices Let A = D C U C L, where C U and C L are strictly upper and lower triangular respectively. Example Gauss-Seidel iteration approximates A with Q = lowertri(a). 1 x = x (0) 2 3 Q = D C L 4 5 for k = 1 to k max 6 r = b Ax 7 if r = b Ax tol, stop 8 9 x = x + Q 1 r 0 end (Do not implement the algorithm this way! More on implementation later.) J. Chaudhry (UNM) Math/CS 375 10 / 23

Example: Lets do Jacobi and Gauss-Seidel: Example Consider 2 1 0 1 0 A = 1 3 1, b = 8, x (0) = 0 0 1 2 5 0 J. Chaudhry (UNM) Math/CS 375 11 / 23

Why D and D C L? Look again at the iteration x (k) = x (k 1) + Q 1 r (k 1) Looking at the error: x x (k) = x x (k 1) Q 1 r (k 1) Gives or or e (k) = e (k 1) Q 1 Ae (k 1) e (k) = (I Q 1 A)e (k 1) e (k) = (I Q 1 A) k e (0) J. Chaudhry (UNM) Math/CS 375 12 / 23

Why D and D C L? We want e (k) = (I Q 1 A) k e (0) to converge. When does a k = c k converge?...when c < 1 Likewise, our iteration converges e (k) = (I Q 1 A) k e (0) I Q 1 A k e (0) when I Q 1 A < 1. J. Chaudhry (UNM) Math/CS 375 13 / 23

Matrix Norms (Recall from earlier lecture) What is I Q 1 A? A 1 = max n 1 j n i=1 a ij A 2 = ρ(a T A)??? ρ(a) = max 1 j n λ i??? A 2 = ρ(a) for symmetric A??? A = max n 1 i n j=1 a ij J. Chaudhry (UNM) Math/CS 375 14 / 23

Again, why do Jacobi and Gauss-Seidel work? Jacobi, Gauss-Seidel (sufficient) Convergence Theorem If A is diagonally dominant, then the Jacobi and Gauss-Seidel methods converge for any initial guess x (0). Definition: Diagonal Dominance A matrix is diagonally dominant if n a ii > a ij for all i. j=1,j i J. Chaudhry (UNM) Math/CS 375 15 / 23

Implementation of Jacobi Algorithm (or the Smart Jacobi Algorithm) The algorithm above uses the matrix representation: x (k) = D 1 (C L + C U )x (k 1) + D 1 b Or componentwise, x (k) i = n j=1,j i ( aij a ii ) x (k 1) j + b i a ii So each sweep (from k 1 to k) uses O(n) operations per vector element, or a total of O(n 2 ) operations for n vector elements. If, for each row i, a ij = 0 for all but m values of j, each sweep uses O(mn) operations. J. Chaudhry (UNM) Math/CS 375 16 / 23

Implementation of Gauss-Seidel Algorithm (or the Smart Gauss-Seidel Algorithm) The algorithm above uses the matrix representation: Component-wise: x (k) i x (k) = (D C L ) 1 C U x (k 1) + (D C L ) 1 b = n j=1,j<i ( aij a ii ) x (k) j n j=1,j>i ( aij a ii ) x (k 1) j + b i a ii So again each sweep (from k 1 to k) uses O(n) operations per vector element, or a total of O(n 2 ) operations for n vector elements. If, for each row i, a ij = 0 for all but m values of j, each sweep uses O(mn) operations. The difference is that in the Jacobi method, updates are saved (and not used) in a new vector. With Gauss-Seidel, an update to an element x (k) i is used immediately. J. Chaudhry (UNM) Math/CS 375 17 / 23

Good Exercise: Redo the earlier example with Smart Jacobi and Smart Gauss-Seidel: Example Consider 2 1 0 1 0 A = 1 3 1, b = 8, x (0) = 0 0 1 2 5 0 J. Chaudhry (UNM) Math/CS 375 18 / 23

Conjugate Gradients Suppose that A is n n symmetric and positive definite. Since A is positive definite, x T Ax > 0 for all x R n. Define a quadratic function φ(x) = 1 2 xt Ax x T b It turns out that φ = b Ax = r, or φ(x) has a minimum for x such that Ax = b. Optimization methods look in a search direction and pick the best step: x (k+1) = x (k) + αs (k) Choose α so that φ(x (k) + αs (k) ) is minimized in the direction of s (k). 0 = d dα φ(x(k+1) ) = φ(x (k+1) ) T d dα x(k+1) = ( r (k+1) ) T d dα (x(k) + αs (k) ) = ( r (k+1) ) T s (k). J. Chaudhry (UNM) Math/CS 375 19 / 23

Conjugate Gradients Find α so that φ is minimized: We also know 0 = (r (k+1) ) T s (k). r (k+1) = b Ax (k+1) = b A(x (k) + αs (k) ) = r (k) αas (k) So, the optimal search parameter is α = (r(k) ) T s (k) (s (k) ) T As (k) This is CG: take a step in a search direction J. Chaudhry (UNM) Math/CS 375 20 / 23

Congugate Gradients Neat trick: We can compute the r without explicitly forming b Ax: r (k+1) = b Ax (k+1) = b A(x (k) +αs (k) ) = b Ax (k) αas (k) = r (k) αas (k) J. Chaudhry (UNM) Math/CS 375 21 / 23

Congugate Gradients How should we pick s (k)? Note that φ = b Ax = r, so r is the gradient of φ (for any x), and this is a good direction. Thus, pick s (0) = r = b Ax (0). What is s (1)? This should be in the direction of r (1), but conjugate to s (0) : (s (1) ) T As (0) = 0. (Two vectors u and v are A-conjugate if u T Av = 0) So, if we let s (1) = r (1) + βs (0), we can require or 0 = (s (1) ) T As (0) = ((r (1) ) T + β(s (0) ) T )As (0) = (r (1) ) T As (0) + β(s (0) ) T As (0) β = (r (1) ) T As (0) /(s (0) ) T As (0). Holds for s (k+1) in terms of r (k+1) + β (k+1) s (k) Further similification (which is not simple to carry out) yields a simple method that requires only one matrix-vector product per step: J. Chaudhry (UNM) Math/CS 375 22 / 23

Conjugate Gradients 1 x (0) = initial guess 2 r (0) = b Ax (0) 3 s (0) = r (0) 4 for k = 0, 1, 2,... 5 α (k) = (r(k) ) T r (k) (s (k) ) T As (k) 6 x (k+1) = x (k) + α (k) s (k) 7 r (k+1) = r (k) α (k) As (k) 8 β (k+1) = (r (k+1) ) T r (k+1) /(r (k) ) T r (k) 9 s (k+1) = r (k+1) + β (k+1) s (k) J. Chaudhry (UNM) Math/CS 375 23 / 23