Vector and Matrix Norms I - PDF Free Download

Vector and Matrix Norms I Scalar, vector, matrix How to calculate errors? Scalar: absolute error: ˆα α relative error: Vectors: vector norm Norm is the distance!! ˆα α / α Chih-Jen Lin (National Taiwan Univ.) 1 / 93

Vector and Matrix Norms II l-norm: x l = l x 1 l + + x n l 1-norm: -norm: x 1 = n x i i=1 l l x = lim x l = lim l = max x i x 1 l + + x n l l (max x i ) l x l 1 l + + x n l l n(max x i ) l Chih-Jen Lin (National Taiwan Univ.) 2 / 93

Vector and Matrix Norms III Example: x = ˆx x = 2, ˆx x x 1 100 and ˆx = 9 1.1 99 11 = 0.02, ˆx x ˆx = 0.0202 ˆx x 2 = 2.238, ˆx x 2 x 2 = 0.0223, ˆx x 2 ˆx 2 = 0.0225 All norms are equivalent Chih-Jen Lin (National Taiwan Univ.) 3 / 93

Vector and Matrix Norms IV For l 1 and l 2 norms, there exist c 1 and c 2 such that for all x R n Example: c 1 x l1 x l2 c 2 x l1 x 2 x 1 n x 2 (1) x x 2 n x (2) x x 1 n x (3) Chih-Jen Lin (National Taiwan Univ.) 4 / 93

Vector and Matrix Norms V Therefore, you can just choose a norm for your convenience HW5-1: prove (1)-(3) Matrix norm: How to define the distance between two matrices? Usually a norm should satisfy where α is a scalar A 0 A + B A + B (4) αa = α A, Chih-Jen Lin (National Taiwan Univ.) 5 / 93

Vector and Matrix Norms VI Definition: A l max x 0 Ax l x l = max x =1 Ax l Proof of (4) A + B = max (A + B)x x =1 max ( Ax + Bx ) max Ax + max Bx x =1 x =1 x =1 Chih-Jen Lin (National Taiwan Univ.) 6 / 93

Relative error I Usually calculating ˆα α α is not practical The reason is that α is unknown ˆα α ˆα is a more reasonable estimate Chih-Jen Lin (National Taiwan Univ.) 7 / 93

Relative error II If then 1 ˆα α 1.1 ˆα ˆα α ˆα ˆα α α 0.1, 1 ˆα α 0.9 ˆα Proof: α ˆα ˆα α 0.1 ˆα α 1.1 ˆα Chih-Jen Lin (National Taiwan Univ.) 8 / 93

Relative error III Similarly, ˆα α 0.1 ˆα α 0.9 ˆα Chih-Jen Lin (National Taiwan Univ.) 9 / 93

Condition of a Linear System I Solving a linear system 10 7 8 7 x 1 32 1 7 5 6 5 x 2 8 6 10 9 x 3 = 23 33, solution = 1 1 7 5 9 10 31 1 x 4 Right-hand side slightly modified 10 7 8 7 x 1 32.1 9.2 7 5 6 5 x 2 8 6 10 9 x 3 = 22.9 33.1, solution = 12.6 4.5 7 5 9 10 30.9 1.1 x 4 Chih-Jen Lin (National Taiwan Univ.) 10 / 93

Condition of a Linear System II A small modification causes a huge error Matrix slightly modified 10 7 8.1 7.2 x 1 32 7.08 5.04 6 5 x 2 8 5.98 9.89 9 x 3 = 23 33 6.99 4.99 9 9.98 31 x 4 81 solution = 137 34 22 Chih-Jen Lin (National Taiwan Univ.) 11 / 93

Condition of a Linear System III Right-hand side slightly modified Ax = b A(x + δx) = b + δb δx = A 1 δb δx A 1 δb b A x δx x A A 1 δb b Chih-Jen Lin (National Taiwan Univ.) 12 / 93

Condition of a Linear System IV Matrix modified Ax = b (A + δa)(x + δx) = b Ax + Aδx + δa(x + δx) = b δx = A 1 δa(x + δx) δx A 1 δa x + δx δx x + δx A A 1 δa A A A 1 is defined as the condition of A A smaller condition number is better Chih-Jen Lin (National Taiwan Univ.) 13 / 93

Sparse matrices: Storage Schemes I Most elements are zero Very common in engineering applications Without saving zeros, can handle very large matrices An example 1 0 0 2 A = 3 4 0 5 6 0 7 8 0 0 10 11 Storage schemes: There are different ways to store sparse matrices Chih-Jen Lin (National Taiwan Univ.) 14 / 93

Sparse matrices: Storage Schemes II Coordinate format a 1 3 6 4 7 10 2 5 8 11 arow_ind 1 2 3 2 3 4 1 2 3 4 acol_ind 1 1 1 2 3 3 4 4 4 4 Indices may not be well ordered Is it easy to do operations? A + B, Ax A + B: if (i, j) are not ordered, difficult y = Ax: for l = 1:nnz i = arow_ind(l) j = acol_ind(l) y(i) = y(i) + a(l)*x(j) end Chih-Jen Lin (National Taiwan Univ.) 15 / 93

Sparse matrices: Storage Schemes III nnz: usually used to represent the number of nonzeros x: vector in dense format In general we directly store a vector without using sparse format Access one column for l = 1:nnz if acol_ind(l) == i x(arow_ind(l)) = a(l) end end Cost: O(nnz) Chih-Jen Lin (National Taiwan Univ.) 16 / 93

Sparse matrices: Storage Schemes IV When is this used: Solving Lx = b l 11 x 1 l 21 l 22 x 2..... = b 2. l n1 l n2 l nn x n b n b 2. b n x 1 Compressed column format l 21. l n1 b 1 Chih-Jen Lin (National Taiwan Univ.) 17 / 93

Sparse matrices: Storage Schemes V a 1 3 6 4 7 10 2 5 8 11 arow_ind 1 2 3 2 3 4 1 2 3 4 acol_ptr 1 4 5 7 11 jth column: from a(acol ptr(j)) to a(acol ptr(j+1)-1) Example: 3rd column acol_ptr(3) = 5 acol_prr(4) = 7 a(5) = 7 a(6) = 10 nnz = acol ptr(n+1) - 1 acol ptr contains n + 1 elements Chih-Jen Lin (National Taiwan Univ.) 18 / 93

Sparse matrices: Storage Schemes VI C = A + B for j = 1:n get A s jth column get B s jth column do a vector addition end C is still with column format y = Ax = A :,1 x 1 + + A :,n x n for j = 1:n for l = acol_ptr(j):acol_ptr(j+1)-1 y(arow_ind(l)) = y(arow_ind(l)) + a(l)*x(j) end end Chih-Jen Lin (National Taiwan Univ.) 19 / 93

Sparse matrices: Storage Schemes VII Row indices of the same column may not be sorted a 6 3 1 4 7 10 2 5 8 11 arow_ind 3 2 1 2 3 4 1 2 3 4 acol_ptr 1 4 5 7 11 C = AB is similar Access one column is easy Access one row is very difficult Compressed row format 1 0 0 2 A = 3 4 0 5 6 0 7 8 0 0 10 11 Chih-Jen Lin (National Taiwan Univ.) 20 / 93

Sparse matrices: Storage Schemes VIII a 1 2 3 4 5 6 7 8 10 11 acol_ind 1 4 1 2 4 1 3 4 3 4 arow_ptr 1 3 6 9 11 C a 1 3 6 4 7 10 2 5 8 11 arow_ind 0 1 2 1 2 3 0 1 2 3 acol_ptr 0 3 4 6 10 There are so many variations of sparse structures. Very difficult to have standard sparse libraries Different formats are suitable for different matrices A C implementation: row format Chih-Jen Lin (National Taiwan Univ.) 21 / 93

Sparse matrices: Storage Schemes IX typedef struct row_elt { int col, nxt_row, nxt_idx; Real val; } row_elt; typedef struct SPROW { int len, maxlen, diag; row_elt *elt; } SPROW; typedef struct SPMAT { int m, n, max_m, max_n; char flag_col, flag_diag; SPROW *row; int *start_row; int *start_idx; Chih-Jen Lin (National Taiwan Univ.) 22 / 93

Sparse matrices: Storage Schemes X } SPMAT; To scan a row len = A->row[i].len ; for (j_idx = 0; j_idx < len; j_idx++) printf( %d %d %g, i, A->row[i].elt[j_idx].col, A->row[i].elt[j_idx].val) ; Objected oriented design is a big challenge for sparse matrices Homework 5-2: Chih-Jen Lin (National Taiwan Univ.) 23 / 93

Sparse matrices: Storage Schemes XI Write your own sparse matrix-matrix code and compare with Intel MKL (which has now supported sparse opertions) You generate random matrices by yourself. Any reasonable size is ok. Chih-Jen Lin (National Taiwan Univ.) 24 / 93

Sparse Matrix and Factorization I A more advanced topic Factorization generates fill-ins fill-ins: new nonzero positions A matlab program A = sprandsym(200, 0.05, 0.01, 1) ; L = chol(a) ; spy(a) ; print -deps A spy(l) ; print -deps L 0.05: density 0.01: 1/(condition number) Chih-Jen Lin (National Taiwan Univ.) 25 / 93

Sparse Matrix and Factorization II 1: type of matrix, 1 gives a matrix with 1/(condition number) exactly 0.01 spy: draw the sparsity pattern 0 0 20 20 40 40 60 60 80 80 100 100 120 120 140 140 160 160 180 180 200 0 20 40 60 80 100 120 140 160 180 200 nz = 1984 (a) A 200 0 20 40 60 80 100 120 140 160 180 200 nz = 2919 (b) L Chih-Jen Lin (National Taiwan Univ.) 26 / 93

Sparse Matrix and Factorization III Clearly L is denser Chih-Jen Lin (National Taiwan Univ.) 27 / 93

Permutation and Reordering I 3 2 1 2 A = 2 4 0 0 1 0 5 0, 2 0 0 6 1.7321 0 0 0 chol(a) = 1.1547 1.6330 0 0 0.5774 0.4082 2.1213 0 1.1547 0.8165 0.4714 1.9437 Chih-Jen Lin (National Taiwan Univ.) 28 / 93

Permutation and Reordering II 0 0 0 1 2 1 2 3 P = 0 0 1 0 0 1 0 0, AP = 0 0 4 2 0 5 0 1 1 0 0 0 6 0 0 2 Chih-Jen Lin (National Taiwan Univ.) 29 / 93

Permutation and Reordering III = = PAP 0 0 0 1 2 1 2 3 0 0 1 0 0 0 4 2 0 1 0 0 0 5 0 1 1 0 0 0 6 0 0 2 6 0 0 2 0 5 0 1 0 0 4 2 2 1 2 3 Chih-Jen Lin (National Taiwan Univ.) 30 / 93

Permutation and Reordering IV 2.4495 0 0 0 chol(pap T ) = 0 2.2361 0 0 0 0 2.0000 0 0.8165 0.4472 1.0000 1.0646 chol(pap T ) is sparser Ax = b (PAP T )Px = Pb, solve Px and get x There are different ways of permutations Chih-Jen Lin (National Taiwan Univ.) 31 / 93

Permutation and Reordering V Reordering algorithms. colmmd - Column minimum degree permutat symmmd - Symmetric minimum degree permu symrcm - Symmetric reverse Cuthill-McKe permutation. colperm - Column permutation. randperm - Random permutation. dmperm - Dulmage-Mendelsohn permutation Finding the ordering with the least entries in the factorization minimum fill-in problem Minimum fill-in may not be the best: need to consider the numerical stability, implementation efforts, etc Chih-Jen Lin (National Taiwan Univ.) 32 / 93

Iterative Methods I Chapter 10 of the book matrix computation An iterative process: x 1, x 2,..., x We hope Ax = b Gaussian elimination is O(n 3 ) If x k x k+1 takes O(n r ), l iterations, and n r l < n 3, iterative methods can be faster Accuracy and sparsity are other considerations Chih-Jen Lin (National Taiwan Univ.) 33 / 93

Jacobi and Gauss-Seidel Method I A three by three system Ax = b x 1 = (b 1 a 12 x 2 a 13 x 3 )/a 11 x 2 = (b 2 a 21 x 1 a 23 x 3 )/a 22 x 3 = (b 3 a 31 x 1 a 32 x 2 )/a 33 x k : an approximation to x = A 1 b (x k+1 ) 1 = (b 1 a 12 (x k ) 2 a 13 (x k ) 3 )/a 11 (x k+1 ) 2 = (b 2 a 21 (x k ) 1 a 23 (x k ) 3 )/a 22 (x k+1 ) 3 = (b 3 a 31 (x k ) 1 a 32 (x k ) 2 )/a 33 Chih-Jen Lin (National Taiwan Univ.) 34 / 93

Jacobi and Gauss-Seidel Method II The general case for i = 1:n (x k+1 ) i = (b i i 1 j=1 a ij(x k ) j n j=i+1 a ij(x k ) j )/a ii end Gauss-Seidel iteration (x k+1 ) 1 = (b 1 a 12 (x k ) 2 a 13 (x k ) 3 )/a 11 (x k+1 ) 2 = (b 2 a 21 (x k+1 ) 1 a 23 (x k ) 3 )/a 22 (x k+1 ) 3 = (b 3 a 31 (x k+1 ) 1 a 32 (x k+1 ) 2 )/a 33 The general case Chih-Jen Lin (National Taiwan Univ.) 35 / 93

Jacobi and Gauss-Seidel Method III for i = 1:n (x k+1 ) i = (b i i 1 j=1 a ij(x k+1 ) j n j=i+1 a ij(x k ) j )/a ii end The iterates may diverge A = 1 2 2 2 1 2, b = 5 5, sol = 1 1 2 2 1 5 1 Chih-Jen Lin (National Taiwan Univ.) 36 / 93

Jacobi and Gauss-Seidel Method IV det = 1 + 8 + 8 4 4 4 0 x 0 = 0 0, x 1 = 5 5 10 10 5, x 2 = 5 10 10 = 15 15 0 5 5 10 10 15 5 + 30 + 30 x 3 = 5 + 30 + 30 = 65 65 5 + 30 + 30 65 Chih-Jen Lin (National Taiwan Univ.) 37 / 93

Jacobi and Gauss-Seidel Method V Convergence Does the method eventually go to a solution? This is like a 11 a 22 a 11 x 1 = b 1 a 12 x 2 a 13 x 3 a 22 x 2 = b 2 a 21 x 1 a 23 x 3 a 33 x 3 = b 3 a 31 x 1 a 32 x 2 x 1 x 2 = 0 a 12 a 13 a 21 0 a 23 x 1 x 2 +b a 33 a 31 a 32 0 x 3 Chih-Jen Lin (National Taiwan Univ.) 38 / 93 x 3

Jacobi and Gauss-Seidel Method VI If M = a 11 then a 22 and N = 0 a 12 a 13 a 21 0 a 23 a 33 a 31 a 32 0 A = M N and Mx k+1 = Nx k + b Chih-Jen Lin (National Taiwan Univ.) 39 / 93

Jacobi and Gauss-Seidel Method VII Spectral radius: ρ(a) = max{ λ λ λ(a)} λ(a) contains all eigenvalues of A Theorem 1 A = M N, A, M non-singular, ρ(m 1 N) < 1 Mx k+1 = Nx k + b leads to the convergence of {x k } to A 1 b for any starting vector x 0 Chih-Jen Lin (National Taiwan Univ.) 40 / 93

Jacobi and Gauss-Seidel Method VIII Proof: Ax = b Mx = Nx + b M(x k+1 x) = N(x k x) x k+1 x = M 1 N(x k x) x k+1 x = (M 1 N) k (x 1 x) ρ(m 1 N) < 1 (M 1 N) k 0 x k+1 x 0 Chih-Jen Lin (National Taiwan Univ.) 41 / 93

Jacobi and Gauss-Seidel Method IX Reasons why ρ(m 1 N) < 1 (M 1 N) k 0? This is quite complicated, so we omit the derivation here. Chih-Jen Lin (National Taiwan Univ.) 42 / 93

Reasons of Gauss-Seidel Methods I An optimization problem is the same as solving 1 min x 2 x T Ax b T x Ax b = 0 if A is symmetric positive definite Chih-Jen Lin (National Taiwan Univ.) 43 / 93

Reasons of Gauss-Seidel Methods II If then f (x) = 1 2 x T Ax b T x f (x) = Ax b f (x) f (x) x 1. f (x) x n Chih-Jen Lin (National Taiwan Univ.) 44 / 93

Reasons of Gauss-Seidel Methods III Remember x T Ax = = n x i (Ax) i i=1 n i=1 n x i A ij x j j=1 = x 1 A 11 x 1 + + x 1 A 1n x n +x 2 A 21 x 1 + + x n A n1 x 1 + Chih-Jen Lin (National Taiwan Univ.) 45 / 93

Reasons of Gauss-Seidel Methods IV Therefore x T Ax x 1 = 2A 11 x 1 + + A 1n x n +x 2 A 21 + + x n A n1 = 2(A 11 x 1 + + A 1n x n ) Chih-Jen Lin (National Taiwan Univ.) 46 / 93

Reasons of Gauss-Seidel Methods V Sequentially update one variable min f (x 1, x2 k,..., xn k ) x 1 min f (x x 1 k+1, x 2,..., xn k ) 2 min f (x x 1 k+1, x2 k+1, x 3,..., xn k ) 3. Chih-Jen Lin (National Taiwan Univ.) 47 / 93

Reasons of Gauss-Seidel Methods VI 1 min d 2 (x + de i) T A(x + de i ) b T (x + de i ) 1 min d 2 d 2 A ii + d(ax) i b i d A ii d + (Ax) i b i = 0 d = b i (Ax) i A ii x i + d b i j:j i A ijx j A ii Chih-Jen Lin (National Taiwan Univ.) 48 / 93

Reasons of Gauss-Seidel Methods VII Note that e i = [0,..., 0, 1, 0,..., 0] }{{} T i 1 Chih-Jen Lin (National Taiwan Univ.) 49 / 93

Conjugate Gradient Method I For symmetric positive definite matrices only One of the most frequently used iterative methods Before introducing CG, we discuss a related method called steepest descent method We still consider solving 1 min x 2 x T Ax b T x Chih-Jen Lin (National Taiwan Univ.) 50 / 93

Steepest Descent Method I Gradient direction f (x) =f (x k ) + f (x k )(x x k ) + 1 2 f (x k )(x x k ) 2 + f (x) =f (x k ) + f (x k ) T (x x k )+ 1 2 (x x k) T 2 f (x k )(x x k ) + Omit 1 2 (x x k) T 2 f (x k )(x x k ) + f (x) f (x k ) + f (x k ) T (x x k ) Chih-Jen Lin (National Taiwan Univ.) 51 / 93

Steepest Descent Method II minimize If then is the direction. f (x k ) T (x x k ) min f (x k) T (x x k ), x x k =1 x x k = f (x k) f (x k ) Chih-Jen Lin (National Taiwan Univ.) 52 / 93

Steepest Descent Method III Note a b = a b cos θ cos π = 1, minimum OK for 2D. But how do you prove that for general cases? Now f (x k ) = Ax k b. x x k α f (x k ) Chih-Jen Lin (National Taiwan Univ.) 53 / 93

Steepest Descent Method IV Let r = f (x k ) = b Ax k, x = x k Minimize along the direction 1 min α 2 (x + αr)t A(x + αr) (x + αr) T b 1 min α 2 α2 r T Ar + αr T Ax αr T b 1 min α 2 α2 r T Ar + αr T (Ax b) Chih-Jen Lin (National Taiwan Univ.) 54 / 93

Steepest Descent Method V A problem of one variable: Note αr T Ar + r T (Ax b) = 0 α = r T (b Ax) r T Ar r T Ar 0 if A is positive definite Now r = b Ax k α = (b Ax k) T (b Ax k ) (b Ax k ) T A(b Ax k ) Chih-Jen Lin (National Taiwan Univ.) 55 / 93

Steepest Descent Method VI The algorithm: k = 0; x 0 = 0; r 0 = b while r k 0 k = k + 1 α k = r T k 1 r k 1/r T k 1 Ar k 1 x k = x k 1 + α k r k 1 r k = b Ax k end It converges but may be very slow Chih-Jen Lin (National Taiwan Univ.) 56 / 93

General Search Directions I Suppose we have x k obtained by f (x k 1 + αp k ) min α f (x k 1 + αp k ) = 1 2 (x k 1 + αp k ) T A(x k 1 + αp k ) b T (x k 1 + αp k ) = + αp T k (Ax k 1 b) + 1 2 α2 p T k Ap k α = pt k (r k 1) p T k Ap k Chih-Jen Lin (National Taiwan Univ.) 57 / 93

General Search Directions II A more general algorithm: k = 0; x 0 = 0; r 0 = b while r k 0 k = k + 1 Choose a direction p k such that p T k r k 1 0 α k = p T k r k 1/p T k Ap k x k = x k 1 + α k p k r k = b Ax k end By this setting x k x 0 + span{p 1,..., p k } Chih-Jen Lin (National Taiwan Univ.) 58 / 93

General Search Directions III The question is then how to choose suitable directions? Chih-Jen Lin (National Taiwan Univ.) 59 / 93

Conjugate Gradient Method I We hope that p 1,..., p n are linearly independent and x k = arg min f (x) (5) x x 0 +span{p 1,...,p k } With (5), so Then span{p 1,..., p n } = R n, x n = arg min x R n f (x) Ax n = b Chih-Jen Lin (National Taiwan Univ.) 60 / 93

Conjugate Gradient Method II and the procedure stops at most n iterations But how to maintain (5)? Let x k = x 0 + P k 1 y + αp k, where P k 1 = [ p 1,..., p k 1 ], y R k 1, α R Chih-Jen Lin (National Taiwan Univ.) 61 / 93

Conjugate Gradient Method III f (x k ) = 1 2 (x 0 + P k 1 y + αp k ) T A(x 0 + P k 1 y + αp k ) b T (x 0 + P k 1 y + αp k ) = 1 2 (x 0 + P k 1 y) T A(x 0 + P k 1 y) b T (x 0 + P k 1 y) + αp T k A(x 0 + P k 1 y) b T (αp k ) + α2 2 pt k Ap k =f (x 0 + P k 1 y) + αp T k AP k 1 y αp T k r 0 + α2 2 pt k Ap k Chih-Jen Lin (National Taiwan Univ.) 62 / 93

Conjugate Gradient Method IV min f (x) x x 0 +span{p 1,...,p k } = min y,α f (x 0 + P k 1 y + αp k ) is difficult because the term αp T k AP k 1 y involves both α and y Chih-Jen Lin (National Taiwan Univ.) 63 / 93

Conjugate Gradient Method V If then and p k span{ap 1,..., Ap k 1 }, min f (x) x x 0 +span{p 1,...,p k } = min y αp T k AP k 1 y = 0 f (x 0 + P k 1 y) + min( αpk T r 0 + α2 α 2 pt k Ap k ), two independent optimization prolems Chih-Jen Lin (National Taiwan Univ.) 64 / 93

Conjugate Gradient Method VI Therefore, we require p k is A-conjugate to p 1,..., p k 1. That is By induction, p T i Ap k = 0, i = 1,..., k 1 x k 1 = arg min y f (x 0 + P k 1 y) The solution of the second problem is α k = pt k r 0 p T k Ap k Chih-Jen Lin (National Taiwan Univ.) 65 / 93

Conjugate Gradient Method VII Because of A-conjugacy, pk T r k 1 = pk T (b Ax k 1 ) = pk T (b A(x 0 + P k 1 y k 1 )) = pk T r 0 We have x k = x k 1 + α k p k New algorithm Chih-Jen Lin (National Taiwan Univ.) 66 / 93

Conjugate Gradient Method VIII k = 0; x 0 = 0; r 0 = b while r k 0 k = k + 1 Choose any p k span{ap 1,..., Ap k 1 } such that p T k r k 1 0 α k = p T k r k 1/p T k Ap k x k = x k 1 + α k p k r k = b Ax k end Next how to choose p k? One way is to minimize the distance to r k 1 : Chih-Jen Lin (National Taiwan Univ.) 67 / 93

Conjugate Gradient Method IX Reason: r k 1 now negative gradient direction The algorithm becomes k = 0; x 0 = 0; r 0 = b while r k 0 k = k + 1 if k = 1 p 1 = r 0 else Let p k minimize p r k 1 2 over all vectors p span{ap 1,..., Ap k 1 } end Chih-Jen Lin (National Taiwan Univ.) 68 / 93

Conjugate Gradient Method X end α k = p T k r k 1/p T k Ap k x k = x k 1 + α k p k r k = b Ax k Chih-Jen Lin (National Taiwan Univ.) 69 / 93

Conjugate Gradient Method XI Lemma 2 If p k minimizes p r k 1 2 over all vectors p span{ap 1,..., Ap k 1 }, then where z k 1 solves p k = r k 1 AP k 1 z k 1 min z r k 1 AP k 1 z 2, P k = [p 1,..., p k ]: an n k matrix Chih-Jen Lin (National Taiwan Univ.) 70 / 93

Conjugate Gradient Method XII Proof: AP k 1 z: space spanned by Ap 1,..., Ap k 1 r k 1 is not in the above space z: Coefficients of the linear combination of Ap 1,..., Ap k 1 p k span{ap 1,..., Ap k 1 } p k = r k 1 AP k 1 z k 1 Chih-Jen Lin (National Taiwan Univ.) 71 / 93

Conjugate Gradient Method XIII Theorem 3 After j iterations, we have Proof: r j = r j 1 α j Ap j Pj T r j = 0 (6) span{p 1,..., p j } = span{r 0,..., r j 1 } = span{b, Ab,..., A j 1 b} ri T r j = 0 for all i j Chih-Jen Lin (National Taiwan Univ.) 72 / 93

Conjugate Gradient Method XIV r j = b Ax j = b Ax j 1 + A(x j 1 x j ) = r j 1 α j Ap j r i, r j mutually orthogonal Proofs of others omitted Now we want to find z k 1 Chih-Jen Lin (National Taiwan Univ.) 73 / 93

Conjugate Gradient Method XV z k 1 a vector with length k 1 [ w z k 1 =, w : (k 2) 1, µ : 1 1 µ] p k = r k 1 AP k 1 z k 1 (7) = r k 1 AP k 2 w µap k 1 = (1 + µ )r k 1 + s k 1 α k 1 (8) and r k 1 = r k 2 α k 1 Ap k 1 Chih-Jen Lin (National Taiwan Univ.) 74 / 93

Conjugate Gradient Method XVI where We have and s k 1 µ α k 1 r k 2 AP k 2 w (9) ri T r j = 0, i, j AP k 2 w span{ap 1,..., Ap k 2 } = span{ab,..., A k 2 b} span{r 0,..., r k 2 } Chih-Jen Lin (National Taiwan Univ.) 75 / 93

Conjugate Gradient Method XVII Hence r T k 1 (AP k 2w) = 0 s T k 1r k 1 = 0 (10) Recall from Lemma 2, our job now is to find z k 1 such that r k 1 AP k 1 z is minimized Chih-Jen Lin (National Taiwan Univ.) 76 / 93

Conjugate Gradient Method XVIII The reason of minimizing r k 1 AP k 1 z instead of p r k 1 2, p span{ap 1,..., Ap k 1 } (11) is that (11) is constrained. Chih-Jen Lin (National Taiwan Univ.) 77 / 93

Conjugate Gradient Method XIX From (8) and (10), select µ and w such that (1 + µ α k 1 )r k 1 + s k 1 2 = (1 + µ α k 1 )r k 1 2 + µ α k 1 r k 2 AP k 2 w 2 is minimized Chih-Jen Lin (National Taiwan Univ.) 78 / 93

Conjugate Gradient Method XX If an optimal solution is (µ, w ), then µ α k 1 r k 2 AP k 2 w w = µ r k 2 AP k 2 α k 1 µ /α k 1 and w µ /α k 1 must be the solution of min r k 2 AP k 2 z z Chih-Jen Lin (National Taiwan Univ.) 79 / 93

Conjugate Gradient Method XXI From Lemma 2, the solution of min r k 2 AP k 2 z z is p k 1 = r k 2 AP k 2 z k 2 Therefore, s k 1 is a multiple of p k 1 From (8), p k span{r k 1, p k 1 } Chih-Jen Lin (National Taiwan Univ.) 80 / 93

Conjugate Gradient Method XXII Assume p k = r k 1 + β k p k 1 (12) This assumption is fine as we will adjust α later. That is, a direction parallel to the real solution of min p r k 1 is enough From (6), p T k 1Ap k = 0, p T k 1r k 1 = 0 Chih-Jen Lin (National Taiwan Univ.) 81 / 93

Conjugate Gradient Method XXIII With (12), Ap k = Ar k 1 + β k Ap k 1 p T k 1Ap k = p T k 1Ar k 1 + β k p T k 1Ap k 1 = 0 β k = pt k 1 Ar k 1 p T k 1 Ap k 1 α k = (r k 1 + β k p k 1 ) T r k 1 p T k Ap k = r T k 1 r k 1 p T k Ap k The conjugate gradient method Chih-Jen Lin (National Taiwan Univ.) 82 / 93

Conjugate Gradient Method XXIV k = 0; x 0 = 0; r 0 = b while r k 0 k = k + 1 if k = 1 p 1 = r 0 else β k = p T k 1 Ar k 1/p T k 1 Ap k 1 p k = r k 1 + β k p k 1 end α k = r T k 1 r k 1/p T k Ap k x k = x k 1 + α k p k r k = b Ax k Chih-Jen Lin (National Taiwan Univ.) 83 / 93

Conjugate Gradient Method XXV end The computational efforts 3 matrix-vector products each iteration: Ar k 1, Ap k 1, Ax k Chih-Jen Lin (National Taiwan Univ.) 84 / 93

Conjugate Gradient Method XXVI Further simplification r k r k 1 r T k 1r k 1 = r k 1 α k Ap k = r k 2 α k 1 Ap k 1 = rk 1r T k 2 α k 1 rk 1Ap T k 1 = 0 α k 1 r T k 1Ap k 1 r T k 2r k 1 = r T k 2r k 2 α k 1 r T k 2Ap k 1 = r T k 2r k 2 α k 1 (p k 1 β k 1 p k 2 ) T Ap k 1 = r T k 2r k 2 α k 1 p T k 1Ap k 1 = 0 r T k 2r k 2 = α k 1 p T k 1Ap k 1 Chih-Jen Lin (National Taiwan Univ.) 85 / 93

Conjugate Gradient Method XXVII β k = pt k 1 Ar k 1 p T k 1 Ap k 1 A simplified version: k = 0; x 0 = 0; r 0 = b while r k 0 k = k + 1 if k = 1 p 1 = r 0 else β k = r T k 1 r k 1/r T k 2 r k 2 p k = r k 1 + β k p k 1 = r T k 1 r k 1/α k 1 r T k 2 r k 2/α k 1 = r T k 1 r k 1 r T k 2 r k 2 Chih-Jen Lin (National Taiwan Univ.) 86 / 93

Conjugate Gradient Method XXVIII end α k = rk 1 T r k 1/pk T Ap k x k = x k 1 + α k p k r k = r k 1 α k Ap k end One matrix vector product r k 0 is not a practical termination criterion Too many inner products The final version Chih-Jen Lin (National Taiwan Univ.) 87 / 93

Conjugate Gradient Method XXIX k = 0; x = 0; r = b; ρ 0 = r 2 2 while ρ k > ɛ b 2 and k < k max k = k + 1 if k = 1 p = r else β = ρ k 1 /ρ k 2 p = r + βp end w = Ap α = ρ k 1 /p T w x = x + αp Chih-Jen Lin (National Taiwan Univ.) 88 / 93

Conjugate Gradient Method XXX r = r αw ρ k = r 2 2 end Numerical error will cause the number of iterations > n Slow convergence Convergence properties Theorem 4 If A = I + B is an n n symmetric positive definite matrix and rank(b) = r, then the conjugate gradient method converges in at most r + 1 steps Chih-Jen Lin (National Taiwan Univ.) 89 / 93

Conjugate Gradient Method XXXI The case of A = I : x 0 = 0, r 0 = b p 1 = r 0 = b α = r T 0 r 0 /p T 1 Ap 1 = b T b/b T Ab = 1 x 1 = p 1 = b r 1 = r 0 α 1 Ap 1 = b b = 0 The conjugate gradient method stops in one iteration Chih-Jen Lin (National Taiwan Univ.) 90 / 93

Conjugate Gradient Method XXXII An error bound in terms of the norm x T Ax: Theorem 5 If Ax = b, κ 1 x x k A 2 x x 0 A ( ) k, κ + 1 where x A = x T Ax κ: the condition number of A κ 1, ( κ 1 κ+1 ) k smaller Chih-Jen Lin (National Taiwan Univ.) 91 / 93

Conjugate Gradient Method XXXIII In general, if the condition of A is better fewer CG iterations Where is PD used? For α, 1 2 α2 p T Ap + We need p T Ap > 0 for the minimization. That is, we need PD to ensur that CG solves 1 min x 2 x T Ax b T x Chih-Jen Lin (National Taiwan Univ.) 92 / 93

Homework 6 I Solving a linear system with the largest symmetric positive definite matrix in Matrix Market http://math.nist.gov/matrixmarket Implement three methods on C or C++: Jacobi, Gauss-Seidel, and CG Solve Ax = e, e: vector of all ones You may need to set the maximal number iterations if a method takes too many iterations If none converges, try to do diagonal scaling first: Let C = diag(a) 1/2 Chih-Jen Lin (National Taiwan Univ.) 93 / 93

Homework 6 II Solve CACy = Ce where x = Cy. After finding solutions, analyze the error by Ax b Chih-Jen Lin (National Taiwan Univ.) 94 / 93