Lecture 9: Krylov Subspace Methods. 2 Derivation of the Conjugate Gradient Algorithm

Similar documents
1 Conjugate gradients

Some minimization problems

Applied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic

4.8 Arnoldi Iteration, Krylov Subspaces and GMRES

EECS 275 Matrix Computation

M.A. Botchev. September 5, 2014

The Lanczos and conjugate gradient algorithms

Conjugate Gradient algorithm. Storage: fixed, independent of number of steps.

Lecture 10: October 27, 2016

1 Extrapolation: A Hint of Things to Come

Iterative methods for Linear System of Equations. Joint Advanced Student School (JASS-2009)

Large-scale eigenvalue problems

Key words. conjugate gradients, normwise backward error, incremental norm estimation.

Course Notes: Week 1

Linear Algebra Massoud Malek

AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning

From Stationary Methods to Krylov Subspaces

6.4 Krylov Subspaces and Conjugate Gradients

Introduction. Chapter One

Summary of Iterative Methods for Non-symmetric Linear Equations That Are Related to the Conjugate Gradient (CG) Method

Iterative methods for Linear System

Last Time. Social Network Graphs Betweenness. Graph Laplacian. Girvan-Newman Algorithm. Spectral Bisection

FEM and sparse linear system solving

Conjugate gradient method. Descent method. Conjugate search direction. Conjugate Gradient Algorithm (294)

The Conjugate Gradient Method

Iterative Methods for Linear Systems of Equations

Linear Solvers. Andrew Hazel

Numerical Methods for Solving Large Scale Eigenvalue Problems

Krylov Space Methods. Nonstationary sounds good. Radu Trîmbiţaş ( Babeş-Bolyai University) Krylov Space Methods 1 / 17

Topics. The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems

The Conjugate Gradient Method

Iterative solvers for linear equations

Conjugate Gradient Method

Conjugate Gradient (CG) Method

Contribution of Wo¹niakowski, Strako²,... The conjugate gradient method in nite precision computa

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Parallel Numerics, WT 2016/ Iterative Methods for Sparse Linear Systems of Equations. page 1 of 1

Mathematical Optimisation, Chpt 2: Linear Equations and inequalities

Krylov Subspace Methods that Are Based on the Minimization of the Residual

ON ORTHOGONAL REDUCTION TO HESSENBERG FORM WITH SMALL BANDWIDTH

Iterative Methods for Solving A x = b

Introduction to Iterative Solvers of Linear Systems

APPLIED NUMERICAL LINEAR ALGEBRA

DS-GA 1002 Lecture notes 0 Fall Linear Algebra. These notes provide a review of basic concepts in linear algebra.

Numerical Methods in Matrix Computations

Iterative solvers for linear equations

Conjugate Gradient Method

Iterative solvers for linear equations

Krylov subspace projection methods

AMSC 600 /CMSC 760 Advanced Linear Numerical Analysis Fall 2007 Krylov Minimization and Projection (KMP) Dianne P. O Leary c 2006, 2007.

ITERATIVE METHODS BASED ON KRYLOV SUBSPACES

Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012

Computation of eigenvalues and singular values Recall that your solutions to these questions will not be collected or evaluated.

Lecture 22. r i+1 = b Ax i+1 = b A(x i + α i r i ) =(b Ax i ) α i Ar i = r i α i Ar i

Notes on PCG for Sparse Linear Systems

Algebra C Numerical Linear Algebra Sample Exam Problems

Numerical Optimization

Principal Component Analysis

Spectral Theorem for Self-adjoint Linear Operators

Solving Constrained Rayleigh Quotient Optimization Problem by Projected QEP Method

Chapter 3 Transformations

Gradient Method Based on Roots of A

KRYLOV SUBSPACE ITERATION

IDR(s) Master s thesis Goushani Kisoensingh. Supervisor: Gerard L.G. Sleijpen Department of Mathematics Universiteit Utrecht

AMS526: Numerical Analysis I (Numerical Linear Algebra)

The Conjugate Gradient Method

Reduced Synchronization Overhead on. December 3, Abstract. The standard formulation of the conjugate gradient algorithm involves

Lecture 11: CMSC 878R/AMSC698R. Iterative Methods An introduction. Outline. Inverse, LU decomposition, Cholesky, SVD, etc.

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Review problems for MA 54, Fall 2004.

Chapter 7. Iterative methods for large sparse linear systems. 7.1 Sparse matrix algebra. Large sparse matrices

Numerical Methods I Non-Square and Sparse Linear Systems

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Math 411 Preliminaries

Sensitivity of Gauss-Christoffel quadrature and sensitivity of Jacobi matrices to small changes of spectral data

MATH 20F: LINEAR ALGEBRA LECTURE B00 (T. KEMP)

Solution of eigenvalue problems. Subspace iteration, The symmetric Lanczos algorithm. Harmonic Ritz values, Jacobi-Davidson s method

An Iterative Descent Method

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Scientific Computing: An Introductory Survey

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

Lecture 10b: iterative and direct linear solvers

The German word eigen is cognate with the Old English word āgen, which became owen in Middle English and own in modern English.

1 Singular Value Decomposition and Principal Component

TMA 4180 Optimeringsteori THE CONJUGATE GRADIENT METHOD

Fast Krylov Methods for N-Body Learning

ECS231 Handout Subspace projection methods for Solving Large-Scale Eigenvalue Problems. Part I: Review of basic theory of eigenvalue problems

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Lecture 10 - Eigenvalues problem

IDR(s) as a projection method

4.6 Iterative Solvers for Linear Systems

Lab 1: Iterative Methods for Solving Linear Systems

Some definitions. Math 1080: Numerical Linear Algebra Chapter 5, Solving Ax = b by Optimization. A-inner product. Important facts

Math 504 (Fall 2011) 1. (*) Consider the matrices

Block Bidiagonal Decomposition and Least Squares Problems

Course Notes: Week 4

Lecture 17 Methods for System of Linear Equations: Part 2. Songting Luo. Department of Mathematics Iowa State University

The Conjugate Gradient Method for Solving Linear Systems of Equations

Least Squares Optimization

Research Article Some Generalizations and Modifications of Iterative Methods for Solving Large Sparse Symmetric Indefinite Linear Systems

Transcription:

CS 622 Data-Sparse Matrix Computations September 19, 217 Lecture 9: Krylov Subspace Methods Lecturer: Anil Damle Scribes: David Eriksson, Marc Aurele Gilles, Ariah Klages-Mundt, Sophia Novitzky 1 Introduction In the last lecture, we discussed two methods for producing an orthogonal basis for the Krylov subspaces K k (A, b) K k (A, b) = span{b, Ab, A 2 b,..., A k 1 b} of a matrix A and a vector b: the Lanczos and Arnoldi methods. In this lecture, we will use the Lanczos method in an iterative algorithm to solve linear systems Ax = b, when A is positive definite. This algorithm is called the Conjugate Gradient (CG) algorithm. We will show that its time complexity in each step k does not depend on the number of steps preceding it, and discuss convergence of the algorithm. 2 Derivation of the Conjugate Gradient Algorithm Consider the following optimization problem over K k (A, b): x k = arg x K k (A,b) 1 2 x A 1 b 2 A. Assug A is symmetric positive definite, recall that after k steps, the Lanczos method gives us a basis for the kth Krylov K k (A, b) space via AV k = V k+1 Tk. Here T k is a square matrix given by T k = T k.... β k e T k. T k is a tridiagonal matrix, and V k has orthogonal columns. We will consider Algorithm 1 for solving Ax = b. Naively, it looks like this algorithm would cost one matrix-vector multiplication Aq (to perform an iteration of Lanczos) and a trilinear solve with a matrix of size k k thus it looks like the cost of each iteration increases with the number of steps taken k. However, one can show that there is no k dependence because of the 3 term recursion. We will now 9: Krylov Subspace Methods -1

Algorithm 1: Conjugate gradient algorithm (Explicit form) for step k do Run step k of the Lanczos method Solve T k y (k) = b e 1 Let x (k) = V k y (k) end show that the added calculations at each step can also be performed in time that does not depend on k. Observe that T k+1 = T k. β k... β k α k+1 Take T k = L k D k L T k, an LDLT factorization. Then we can get T k+1 = L k+1 D k+1 L T k+1 in O(1) work. Furthermore, we can go from z (k) = L T k y(k) to z (k+1) = L T k+1 y(k+1) in O(1) work since T k y (k) = L k D k L T k y(k) = b e 1. We have for some ζ k+1, so we can write [ z (k+1) z (k) = ζ k+1 x (k+1) = V k+1 y (k+1) = V k+1 Iy (k+1) ],. = V k+1 L T k+1 LT k+1 y(k+1) = V k+1 L T k+1 z(k+1). Now let C k = V k L T k = [ c 1 c k ], where c1,..., c k are the columns of C k. Then x (k+1) = C k+1 z (k+1) = C k z (k) + c k+1 ζ k+1 = x (k) + c k+1 ζ k+1. We can compute c k+1 and ζ k+1 in time that does not depend on k (in O(n) in general as it involves vector multiplication). And so we can update x (k) to x (k+1) in time independent of k. I.e., the steps do not become more expensive the further into the algorithm that we go. 9: Krylov Subspace Methods -2

After some algebra, the CG can be rewritten as Algorithm 2. Algorithm 2: Conjugate gradient algorithm (Explicit form) x = r = b p = b for k = 1 to K do α k = r k,r k p k,ap k x k+1 = x k + α k p k r k+1 = r k α k Ap k β k = r k+1,r k+1 r k,r k p k+1 = r k+1 + β k p k end This form makes it clear that the computational cost of one CG iterate is one matrix multiply, 3 vector inner products and 3 scalar vector product and 3 vector addition. It also only requires storage for 3 vectors at any iteration. A completely equivalent way to derive CG is through the imization of the following quadratic: x k = arg x K k (A,b) 1 2 x A 1 b 2 A. (1) The derivation from this point of view is included in most textbooks, e.g., see [1]. 3 Convergence and stopping criteria 3.1 Monitoring residual We now consider when to stop the conjugate gradient method. One stopping criterion is to monitor the residual r k := b Ax (k) and stop when r k 2 is smaller than some threshold. Notice we can t use the error metric defined in 1 as it requires to know the true solution x. Observe that via Lanczos r k 2 = b AV k y (k) 2 = V k+1 ( b e 1 ) V k+1 Tk y (k) 2 b [ ] = V k+1. Tk β k e T y (k) k = V k+1 (β k (e T k y(k) )e k+1 ) 2 V k+1 2, where the simplification of the vector from the second-to-last line to the last line comes from the fact that we ve constructed y (k) in the Lanczos algorithm so that it satisfies this 2 9: Krylov Subspace Methods -3

(i.e., T k y (k) = b e 1 ). This shows that the kth residual is proportional to the (k + 1)st Krylov vector. If V k+1 is orthonormal, then it drops out of the 2 norm and we just get the norm of the vector. However, in floating point arithmetic, V k may not have orthogonal columns. Even in this case we have r k 2 V k+1 2 β k (e T k y(k) ). and since the columns of V k have norm one, V k+1 2 is typically close to one even in floating point arithmetic. 3.2 Convergence Recall that the k-th Krylov subspace of the matrix A and vector b is: K k (A, b) = span{b, Ab, A 2 b,..., A k 1 b} Therefore, k 1 k 1 x K k (A, b) x = α j A j b = α j A j b = p(a)b j= j= where p(x) is a 1-dimensional polynomial of degree k 1 with coefficients {α j } k 1 j=. Let x = A 1 b, recall that the x k, the k-th iterate of CG is chosen such that it imizes the error in the A-norm over the k-th Krylov subspace, i.e.: x k = arg x x K k (A,b) x A. Using the observation above, we can rewrite this error metric as a polynomial approximation problem: x x A = p(a)b A 1 b A = p(a)aa 1 b A 1 b A = (p(a)a I)A 1 b A (I p(a)a) 2 A 1 b A. Note that if p(x) is a k 1 degree polynomial, then q(x) = 1 xp(x) is a k degree polynomial with q() = 1. Let P k, be the subspace of such polynomial q, that is P k, = {q(x) q(x) P k,, q() = 1} Furthermore, by the argument above, for all x p k, there exist a polynomial q such that x x A q(a) 2 A 1 b A. To summarize, we have shown that: x k x A x A q P k, (q(a)) 2. (2) 9: Krylov Subspace Methods -4

Consider the eigenvalue decomposition of A, A = V ΣV 1. eigenvalues are real and positive, and V 1 = V T. Note that A is SPD, therefore its Therefore for any polynomial A A i = AA A = V ΣV 1 V ΣV 1 V ΣV 1 = V Σ i V T. p(a) = = = V k α i A i i= k α i V Σ i V T i= k α i Σ i V T i= = V p(σ) i V T (This is a special case of the spectral theorem). Let Λ(A) be the set of eigenvalues of A, then using this observation we can rewrite 2 as: x k x A x A q P k, q(a) 2 = q P k, V q(σ)v T 2 q P k, V V T q(σ) 2 = q P k, q(σ) 2 where we have used that V is orthogonal, thus V 2 = V T 2 = 1. Recall the definition of the spectral norm: A 2 = i σ i (A), where σ i are the singular values of A which coincide with the eigenvalues as A is SPD. This allows us to rewrite the bound one more time as: x k x A x A q P k, λ Λ(A) q(λ) (3) This is a remarkable result, which lets us reason about the convergence of CG in terms of the imization of a polynomial over a set of discrete points. Much of the theory and intuition we have about the convergence of CG comes from this inequality. This bound implies that the convergence of CG on a given problem depends strongly on the eigenvalues of A. When the eigenvalues of A are spread out it is hard to find such p, and the degree of the polynomial (k N) has to be greater, which means that the algorithm requires more iterations. The figure 1 illustrates the challenge. Neither the first nor the second degree polynomial give desired results. When the eigenvalues don t form any groups, the polynomial degree may have to be as large as the number of eigenvalues of A. 9: Krylov Subspace Methods -5

Figure 1: Polynomial fitting for A with eigenvalues spread out. Figure 2: Polynomial fitting for clustered eigenvalues. The less clusters are present, the smaller polynomial degree k is required. The task is easier when the eigenvalues are clustered. The fewer and tighter the clusters are, the smaller k (degree of the polynomial) is required to attain a good upper bound on algorithm error. Figure 2 shows two different cases of clustered eigenvalues. If there are three clusters as on the left plot, a third degree polynomial performs well. When all eigenvalues are in one cluster like in the plot on the right, a first degree polynomial is sufficient. This dependence on the eigenvalues of A is the true reason why the algorithm works so well in practice. Equation (3) is a strong statement, but one that is difficult to apply. Indeed, we usually don t know the distribution of eigenvalues of our matrix A, and even if we did, the polynomial imization problem on a discrete set is very difficult to solve. We now give another bound which is must weaker but is something we can usually estimate. Let λ 1, λ n be the smallest and largest eigenvalue of A respectively. Clearly, Λ(A) 9: Krylov Subspace Methods -6

[λ 1, λ n ]. This implies that q P k, λ Λ(A) q(λ) q P k, λ [λ 1,λ n] q(λ). (4) The right hand side of the equation is much easier to solve than the left hand side because it involves a continuous interval. It is related to the Chebyshev polynomials T n (x) characterized by: Lemma 1 The i problem: T n (x) = cos (n arccos (x)) for x 1. q P k, λ [λ 1,λ n] q(λ) (5) is solved by the polynomial Furthermore, P i (x) = T i( λn+λ 1 2x λ n λ 1 ) T i ( λn+λ 1 λ n λ 1 ) ( ) i κ 1 P i (x) 2 κ + 1 where κ = λn λ 1 (called the condition number of A). The proof of this lemma can be found in [2]. This implies directly the following theorem: Theorem 2 Let x k be the kth iterate of CG, and κ be the condition number of the matrix A, then: References x k x ( ) k A κ 1 x 2 (6) A κ + 1 [1] Gene H. Golub and Van Loan Charles F. Matrix computation. Johns Hopkins University Press, 199. [2] Jonathan Richard Shewchuk et al. An introduction to the conjugate gradient method without the agonizing pain. 1994. 9: Krylov Subspace Methods -7