AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 21: Sensitivity of Eigenvalues and Eigenvectors; Conjugate Gradient Method Xiangmin Jiao Stony Brook University Xiangmin Jiao Numerical Analysis I 1 / 17
Outline 1 Sensitivity of Eigenvalues 2 Conjugate Gradient Method Xiangmin Jiao Numerical Analysis I 2 / 17
Residual of Eigenvalue Problems We consider how eigenvalues and eigenvectors of A + δa differ from of A A small residual r = Ax λx guarantees small perturbation to A in backward analysis Theorem Let A C n n, let x be an approximate eigenvector of A with x 2 = 1, λ an associated approximate eigenvalue, and r = Ax λx the residual. Then λ and x are an exact eigenpair of some perturbed matrix A + δa, where δa 2 = r 2. Proof. Let δa = rx, which satisfies δa 2 = r 2 x 2 = r 2. Then (A + δa)x = Ax rx x = Ax r = λx. Xiangmin Jiao Numerical Analysis I 3 / 17
Sensitivity of Eigenvalues Theorem Condition number of matrix X determine sensitivity of eigenvalues Let A C n n be a nondefective (semisimple) matrix, and suppose A = X ΛX 1, where X is nonsingular and Λ is diagonal. Let δa C n n be some perturbation of A,and let µ be an eigenvalue of A + δa. Then A has an eigenvalue λ such that for 1 p. µ λ κ p (X ) δa p κ p (X ) measures how far eigenvectors are from being linearly dependent For normal matrices, condition number κ p (X ) = 1 For defective matrices, consider condition number κ p (X ) as infinity Xiangmin Jiao Numerical Analysis I 4 / 17
Sensitivity of Eigenvalues Proof. Let δλ = X 1 (δa) X. Then δλ p X 1 p δa p X p = κ p (X ) δa p. Let y be an engenvector of A + δa associated with µ. Suppose µ is not an eigenvalue of A, so µi Λ is nonsingular, and (Λ + δλ)y = µy (µi Λ)y = (δλ) y y = (µi Λ) 1 (δλ) y, thus (µi Λ) 1 1 p δλ p. (µi Λ) 1 p = µ λ 1, where λ is the eigenvalue of A closest to µ. Thus, µ λ δλ p, Xiangmin Jiao Numerical Analysis I 5 / 17
Left and Right Eigenvectors We would like to analyze sensitivity of individual eigenvalues First, define left eigenvectors y associated with eigenvalue λ of matrix A to be nonzero vector y C n such that y A = λy Theorem Left eigenvectors of A are right eigenvectors of A Let A C n n have distinct eigenvalues λ 1, λ 2,..., λ n with associated linearly independent right eigenvectors x 1,..., x n and left eigenvectors y 1,..., y n. Then yj x i 0 if i = j and yj x i = 0 if i j. Proof. If i j, yj Ax i = λ i yj x i and yj Ax i = λ j yj x i. Since λ i λ j, yj x i = 0. If i = j, since {x i } form a basis for C n, yi x i = 0 together with yi x j = 0 would imply that y i = 0. This leads to a contradiction. Xiangmin Jiao Numerical Analysis I 6 / 17
Sensitivity of Individual Eigenvalues Theorem We analyze sensitivity of individual eigenvalues that are distinct Let A C n n have n distinct eigenvalues. Let λ be an eigenvalue with associated right and left eigenvectors x and y, respectively, normalized so that x 2 = y 2 = 1. Let δa be a small perturbation satisfying δa 2 = ɛ, and let λ + δλ be the eigenvalue of A + δa that approximates λ. Then δλ 1 y x ɛ + O(ɛ2 ). Let s = y x. κ = 1 s = 1 y x is condition number for eigenvalue λ A simple eigenvalue is sensitive if its associated right and left eigenvectors are nearly orthogonal Xiangmin Jiao Numerical Analysis I 7 / 17
Sensitivity of Individual Eigenvalues Proof. We know that δλ κ p (X )ɛ = O(ɛ). In addition, δx = O(ɛ) when λ is a simple eigenvalue. Because (A + δa) (x + δx) = (λ + δλ) (x + δx), thus (δa) x + A (δx) + O(ɛ 2 ) = (δλ)x + λ(δx) + O(ɛ 2 ). Left multiplying by y and using equation y A = λy, we obtain and hence y (δa) x + O(ɛ 2 ) = (δλ) y x + O(ɛ 2 ) δλ = y (δa) x y x + O(ɛ 2 ). Note that y (δa) x y 2 (δa) 2 x 2 = ɛ, so δλ 1 y x ɛ + O(ɛ2 ). Xiangmin Jiao Numerical Analysis I 8 / 17
Sensitivity of Multiple Eigenvalues and Eigenvectors Sensitivity of multiple eigenvalues is very complex For multiple eigenvalues, left and right eigenvectors can be orthogonal, hence very ill-conditioned In general, multiple or close eigenvalues can be poorly conditioned, especially if matrix is defective Condition numbers of eigenvectors are also difficult to analyze If matrix has well-conditioned and well-separated eigenvalues, then eigenvectors are well-conditioned If eigenvalues are ill-conditioned or closely clustered, then eigenvectors may be poorly conditioned Xiangmin Jiao Numerical Analysis I 9 / 17
Outline 1 Sensitivity of Eigenvalues 2 Conjugate Gradient Method Xiangmin Jiao Numerical Analysis I 10 / 17
Direct vs. Iterative Methods Direct methods, or noniterative methods, compute the exact solution after a finite number of steps (in exact arithmetic) Example: Gaussian elimination, QR factorization Iterative methods produce a sequence of approximations x (1), x (2),... that hopefully converge to the true solution Example: Jacobi, Conjugate Gradient (CG), GMRES, BiCG, etc. Caution: The boundary between direct and iterative methods is vague sometimes Why use iterative methods (instead of direct methods)? may be faster than direct methods produce useful intermediate results handle sparse matrices more easily (needs only matrix-vector product) often are easier to implement on parallel computers Question: When not to use iterative methods? Xiangmin Jiao Numerical Analysis I 11 / 17
Two Classes of Iterative Methods Stationary iterative methods find a splitting A = M K and iterates x (k+1) = M 1 (Kx (k) + b) Examples: Jacobi (for linear systems, not the Jacobi iterations for eigenvalues), Gauss-Seidel, Successive Over-Relaxation (SOR) etc. Krylov subspace methods find optimal solution in Krylov subspace {b, Ab, A 2 b, A k b} Build subspace successively Example: Conjugate Gradient (CG), Generalized Minimum Residual (GMRES), BiCG, etc. We will focus on Krylov subspace methods Xiangmin Jiao Numerical Analysis I 12 / 17
Krylov Subspace Algorithms Create a sequence of Krylov subspaces for Ax = b K n = {b, Ab,..., A n 1 b} and find an approximate (hopefully optimal) solutions x n in K n Only matrix-vector products involved For SPD matrices, most famous algorithm is Conjugate Gradient (CG) method discovered by Hestenes/Stiefel in 1952 Finds best solution xn K n in norm x A = x T Ax Only requires storing 4 vectors (instead of n vectors) due to three-term recurrence Xiangmin Jiao Numerical Analysis I 13 / 17
Motivation of Conjugate Gradients If A is m m SPD, then quadratic function ϕ(x) = 1 2 x T Ax x T b has unique minimum Negative gradient of this function is residual vector ϕ(x) = b Ax = r so minimum is obtained precisely when Ax = b Optimization methods have form x n+1 = x n + αp n where p n is search direction and α is step length chosen to minimize ϕ(x n + αp n ) Line search parameter can be determined analytically as α = r T n p n /p T n Ap n In CG, p n is chosen to be A-conjugate (or A-orthogonal) to previous search directions, i.e., p T n Ap j = 0 for j < n Xiangmin Jiao Numerical Analysis I 14 / 17
Conjugate Gradient Method Algorithm: Conjugate Gradient Method x 0 = 0, r 0 = b, p 0 = r 0 for n = 1 to 1, 2, 3,... α n = (rn 1 T r n 1)/(pn 1 T Ap n 1) x n = x n 1 + α n p n 1 r n = r n 1 α n Ap n 1 β n = (rn T r n )/(rn 1 T r n 1) p n = r n + β n p n 1 step length approximate solution residual improvement this step search direction Only one matrix-vector product Ap n 1 per iteration Apart from matrix-vector product, #operations per iteration is O(m) If A is sparse with constant number of nonzeros per row, O(m) operations per iteration CG can be viewed as minimization of quadratic function ϕ(x) = 1 2 x T Ax x T b by modifying steepest descent Xiangmin Jiao Numerical Analysis I 15 / 17
An Alternative Interpretation of CG Algorithm: CG x 0 = 0, r 0 = b, p 0 = r 0 for n = 1 to 1, 2, 3,... α n = r T n 1 r n 1/(p T n 1 Ap n 1) x n = x n 1 + α n p n 1 r n = r n 1 α n Ap n 1 β n = r T n r n /(r T n 1 r n 1) p n = r n + β n p n 1 Algorithm: A non-standard CG x 0 = 0, r 0 = b, p 0 = r 0 for n = 1 to 1, 2, 3,... α n = r T n 1 p n 1/(p T n 1 Ap n 1) x n = x n 1 + α n p n 1 r n = b Ax n β n = r T n Ap n 1 /(p T n 1 Ap n 1) p n = r n + β n p n 1 The non-standard one is less efficient but easier to understand It is easy to see r n = r n 1 α n Ap n 1 = b Ax n We need to show: α n minimizes ϕ along search direction p n αn and β n are equivalent to those in standard CG Minimizing ϕ along pn also minimizes ϕ within Krylov subspace Xiangmin Jiao Numerical Analysis I 16 / 17
Optimality of Step Length Select step length α n over vector p n 1 to minimize ϕ(x) = 1 2 x T Ax x T b Let x n = x n 1 + α n p n 1, ϕ(x n ) = 1 2 (x n 1 + α n p n 1 ) T A(x n 1 + α n p n 1 ) (x n 1 + α n p n 1 ) T b Therefore, = 1 2 α2 np T n 1Ap n 1 + α n p T n 1Ax n 1 α n p T n 1b + constant = 1 2 α2 np T n 1Ap n 1 α n p T n 1r n 1 + constant dϕ dα n = 0 α n p T n 1Ap n 1 p T n 1r n 1 = 0 α n = pt n 1 r n 1 pn 1 T Ap. n 1 In addition, p T n 1 r n 1 = r T n 1 r n 1 because p n 1 = r n 1 + β n p n 2 and r T n 1 p n 2 = 0 due to the following theorem. Xiangmin Jiao Numerical Analysis I 17 / 17