Conjugate Gradient Method

Similar documents
The Conjugate Gradient Method

Chapter 7 Iterative Techniques in Matrix Algebra

Applied Mathematics 205. Unit V: Eigenvalue Problems. Lecturer: Dr. David Knezevic

Course Notes: Week 1

Numerical Methods I Non-Square and Sparse Linear Systems

9.1 Preconditioned Krylov Subspace Methods

The Lanczos and conjugate gradient algorithms

Linear Solvers. Andrew Hazel

AMS526: Numerical Analysis I (Numerical Linear Algebra)

Conjugate gradient method. Descent method. Conjugate search direction. Conjugate Gradient Algorithm (294)

Topics. The CG Algorithm Algorithmic Options CG s Two Main Convergence Theorems

Notes on PCG for Sparse Linear Systems

AMS526: Numerical Analysis I (Numerical Linear Algebra)

AMS526: Numerical Analysis I (Numerical Linear Algebra) Lecture 23: GMRES and Other Krylov Subspace Methods; Preconditioning

Truncated Newton Method

4.6 Iterative Solvers for Linear Systems

1 Conjugate gradients

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization

Last Time. Social Network Graphs Betweenness. Graph Laplacian. Girvan-Newman Algorithm. Spectral Bisection

Lecture # 20 The Preconditioned Conjugate Gradient Method

The amount of work to construct each new guess from the previous one should be a small multiple of the number of nonzeros in A.

Lab 1: Iterative Methods for Solving Linear Systems

Solutions and Notes to Selected Problems In: Numerical Optimzation by Jorge Nocedal and Stephen J. Wright.

ITERATIVE METHODS BASED ON KRYLOV SUBSPACES

Lecture 17: Iterative Methods and Sparse Linear Algebra

9. Numerical linear algebra background

Efficient Estimation of the A-norm of the Error in the Preconditioned Conjugate Gradient Method

Lecture 9: Krylov Subspace Methods. 2 Derivation of the Conjugate Gradient Algorithm

Preliminary/Qualifying Exam in Numerical Analysis (Math 502a) Spring 2012

SOLVING SPARSE LINEAR SYSTEMS OF EQUATIONS. Chao Yang Computational Research Division Lawrence Berkeley National Laboratory Berkeley, CA, USA

Iterative techniques in matrix algebra

9. Iterative Methods for Large Linear Systems

Iterative Methods for Solving A x = b

CME 302: NUMERICAL LINEAR ALGEBRA FALL 2005/06 LECTURE 0

Parallel Numerics, WT 2016/ Iterative Methods for Sparse Linear Systems of Equations. page 1 of 1

On the choice of abstract projection vectors for second level preconditioners

Notes on Some Methods for Solving Linear Systems

EECS 275 Matrix Computation

Lecture 11: CMSC 878R/AMSC698R. Iterative Methods An introduction. Outline. Inverse, LU decomposition, Cholesky, SVD, etc.

Numerical Methods - Numerical Linear Algebra

PETROV-GALERKIN METHODS

6.4 Krylov Subspaces and Conjugate Gradients

Efficient Estimation of the A-norm of the Error in the Preconditioned Conjugate Gradient Method

Conjugate Gradient algorithm. Storage: fixed, independent of number of steps.

FEM and sparse linear system solving

Scientific Computing with Case Studies SIAM Press, Lecture Notes for Unit VII Sparse Matrix

Conjugate Gradient (CG) Method

9. Numerical linear algebra background

The Conjugate Gradient Method for Solving Linear Systems of Equations

CS137 Introduction to Scientific Computing Winter Quarter 2004 Solutions to Homework #3

Chapter 7. Iterative methods for large sparse linear systems. 7.1 Sparse matrix algebra. Large sparse matrices

Introduction. Chapter One

Preconditioning Techniques Analysis for CG Method

Numerical Linear Algebra Primer. Ryan Tibshirani Convex Optimization /36-725

A Chebyshev-based two-stage iterative method as an alternative to the direct solution of linear systems

The Conjugate Gradient Method

CME342 Parallel Methods in Numerical Analysis. Matrix Computation: Iterative Methods II. Sparse Matrix-vector Multiplication.

AMS526: Numerical Analysis I (Numerical Linear Algebra for Computational and Data Sciences)

Lecture Note 7: Iterative methods for solving linear systems. Xiaoqun Zhang Shanghai Jiao Tong University

Iterative solvers for linear equations

Contribution of Wo¹niakowski, Strako²,... The conjugate gradient method in nite precision computa

Jae Heon Yun and Yu Du Han

Iterative Methods for Sparse Linear Systems

Sparsity-Preserving Difference of Positive Semidefinite Matrix Representation of Indefinite Matrices

Iterative Methods. Splitting Methods

Sparse matrix methods in quantum chemistry Post-doctorale cursus quantumchemie Han-sur-Lesse, Belgium

4.8 Arnoldi Iteration, Krylov Subspaces and GMRES

The Conjugate Gradient Method

Iterative Methods for Linear Systems of Equations

Scientific Computing: Solving Linear Systems

ITERATIVE PROJECTION METHODS FOR SPARSE LINEAR SYSTEMS AND EIGENPROBLEMS CHAPTER 4 : CONJUGATE GRADIENT METHOD

Iterative methods for Linear System of Equations. Joint Advanced Student School (JASS-2009)

Preconditioning for Nonsymmetry and Time-dependence

Krylov Space Solvers

Computational Linear Algebra

7.2 Steepest Descent and Preconditioning

Boundary Value Problems - Solving 3-D Finite-Difference problems Jacob White

Lecture 9: Numerical Linear Algebra Primer (February 11st)

Contents. Preface... xi. Introduction...

APPLIED NUMERICAL LINEAR ALGEBRA

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Suppose that the approximate solutions of Eq. (1) satisfy the condition (3). Then (1) if η = 0 in the algorithm Trust Region, then lim inf.

From Stationary Methods to Krylov Subspaces

Master Thesis Literature Study Presentation

Iterative solvers for linear equations

Scientific Computing

Linear Algebra Massoud Malek

Lecture 17 Methods for System of Linear Equations: Part 2. Songting Luo. Department of Mathematics Iowa State University

Iterative methods for Linear System

Iterative Methods and Multigrid

Scalable Non-blocking Preconditioned Conjugate Gradient Methods

PDE Solvers for Fluid Flow

Algorithms that use the Arnoldi Basis

Lecture 10: October 27, 2016

Motivation: Sparse matrices and numerical PDE's

Introduction to Iterative Solvers of Linear Systems

A Tuned Preconditioner for Inexact Inverse Iteration Applied to Hermitian Eigenvalue Problems

Computational Science and Engineering (Int. Master s Program) Preconditioning for Hessian-Free Optimization

Linear System Theory

Reduced Synchronization Overhead on. December 3, Abstract. The standard formulation of the conjugate gradient algorithm involves

Transcription:

Conjugate Gradient Method direct and indirect methods positive definite linear systems Krylov sequence spectral analysis of Krylov sequence preconditioning Prof. S. Boyd, EE364b, Stanford University

Three classes of methods for linear equations methods to solve linear system Ax = b, A R n n dense direct (factor-solve methods) runtime depends only on size; independent of data, structure, or sparsity work well for n up to a few thousand sparse direct (factor-solve methods) runtime depends on size, sparsity pattern; (almost) independent of data can work well for n up to 10 4 or 10 5 (or more) requires good heuristic for ordering Prof. S. Boyd, EE364b, Stanford University 1

indirect (iterative methods) runtime depends on data, size, sparsity, required accuracy requires tuning, preconditioning,... good choice in many cases; only choice for n = 10 6 or larger Prof. S. Boyd, EE364b, Stanford University 2

Symmetric positive definite linear systems SPD system of equations Ax = b, A R n n, A = A T 0 examples Newton/interior-point search direction: 2 φ(x) x = φ(x) least-squares normal equations: (A T A)x = A T b regularized least-squares: (A T A+µI)x = A T b minimization of convex quadratic function (1/2)x T Ax b T x solving (discretized) elliptic PDE (e.g., Poisson equation) Prof. S. Boyd, EE364b, Stanford University 3

analysis of resistor circuit: Gv = i v is node voltage (vector), i is (given) source current G is circuit conductance matrix G ij = { total conductance incident on node i i = j (conductance between nodes i and j) i j Prof. S. Boyd, EE364b, Stanford University 4

CG overview proposed by Hestenes and Stiefel in 1952 (as direct method) solves SPD system Ax = b in theory (i.e., exact arithmetic) in n iterations each iteration requires a few inner products in R n, and one matrix-vector multiply z Az for A dense, matrix-vector multiply z Az costs n 2, so total cost is n 3, same as direct methods get advantage over dense if matrix-vector multiply is cheaper than n 2 with roundoff error, CG can work poorly (or not at all) but for some A (and b), can get good approximate solution in n iterations Prof. S. Boyd, EE364b, Stanford University 5

Solution and error x = A 1 b is solution x minimizes (convex function) f(x) = (1/2)x T Ax b T x f(x) = Ax b is gradient of f with f = f(x ), we have f(x) f = (1/2)x T Ax b T x (1/2)x T Ax +b T x = (1/2)(x x ) T A(x x ) = (1/2) x x 2 A i.e., f(x) f is half of squared A-norm of error x x Prof. S. Boyd, EE364b, Stanford University 6

a relative measure (comparing x to 0): τ = f(x) f f(0) f = x x 2 A x 2 A (fraction of maximum possible reduction in f, compared to x = 0) Prof. S. Boyd, EE364b, Stanford University 7

Residual r = b Ax is called the residual at x r = f(x) = A(x x) in terms of r, we have f(x) f = (1/2)(x x ) T A(x x ) = (1/2)r T A 1 r = (1/2) r 2 A 1 a commonly used measure of relative accuracy: η = r / b τ κ(a)η 2 (η is easily computable from x; τ is not) Prof. S. Boyd, EE364b, Stanford University 8

Krylov subspace (a.k.a. controllability subspace) K k = span{b,ab,...,a k 1 b} = {p(a)b p polynomial, degp < k} we define the Krylov sequence x (1),x (2),... as x (k) = argmin x K k f(x) = argmin x K k x x 2 A the CG algorithm (among others) generates the Krylov sequence Prof. S. Boyd, EE364b, Stanford University 9

Properties of Krylov sequence f(x (k+1) ) f(x (k) ) (but r can increase) x (n) = x (i.e., x K n even when K n R n ) x (k) = p k (A)b, where p k is a polynomial with degp k < k less obvious: there is a two-term recurrence x (k+1) = x (k) +α k r (k) +β k (x (k) x (k 1) ) for some α k, β k (basis of CG algorithm) Prof. S. Boyd, EE364b, Stanford University 10

characteristic polynomial of A: Cayley-Hamilton theorem χ(s) = det(si A) = s n +α 1 s n 1 + +α n by Caley-Hamilton theorem χ(a) = A n +α 1 A n 1 + +α n I = 0 and so A 1 = (1/α n )A n 1 (α 1 /α n )A n 2 (α n 1 /α n )I in particular, we see that x = A 1 b K n Prof. S. Boyd, EE364b, Stanford University 11

Spectral analysis of Krylov sequence A = QΛQ T, Q orthogonal, Λ = diag(λ 1,...,λ n ) define y = Q T x, b = Q T b, y = Q T x in terms of y, we have f(x) = f(y) = (1/2)x T QΛQ T x b T QQ T x = (1/2)y T Λy b T y n ( = (1/2)λi yi 2 ) b i y i i=1 so y i = b i /λ i, f = (1/2) n i=1 b 2 i /λ i Prof. S. Boyd, EE364b, Stanford University 12

Krylov sequence in terms of y y (k) = argmin y K k f(y), Kk = span{ b,λ b,...,λ k 1 b} y (k) i = p k (λ i ) b i, degp k < k p k = argmin deg p<k n ( b2 i (1/2)λi p(λ i ) 2 p(λ i ) ) i=1 Prof. S. Boyd, EE364b, Stanford University 13

f(x (k) ) f = f(y (k) ) f = min degp<k (1/2) n i=1 = min degp<k (1/2) n i=1 = min (1/2) n degq k, q(0)=1 b2 i (λ i p(λ i ) 1) 2 λ i ȳ 2 i λ i (λ i p(λ i ) 1) 2 i=1 = min (1/2) n degq k, q(0)=1 i=1 ȳ 2 i λ i q(λ i ) 2 b2 i q(λ i ) 2 λ i Prof. S. Boyd, EE364b, Stanford University 14

τ k = min n degq k, q(0)=1 i=1ȳ 2 i λ i q(λ i ) 2 n i=1ȳ 2 i λ i ) ( min deg q k, q(0)=1 max q(λ i) 2 i=1,...,n if there is a polynomial q of degree k, with q(0) = 1, that is small on the spectrum of A, then f(x (k) ) f is small if eigenvalues are clustered in k groups, then y (k) is a good approximate solution if solution x is approximately a linear combination of k eigenvectors of A, then y (k) is a good approximate solution Prof. S. Boyd, EE364b, Stanford University 15

A bound on convergence rate taking q as Chebyshev polynomial of degree k, that is small on interval [λ min,λ max ], we get τ k ( κ 1 κ+1 ) k, κ = λ max /λ min convergence can be much faster than this, if spectrum of A is spread but clustered Prof. S. Boyd, EE364b, Stanford University 16

Small example A R 7 7, spectrum shown as filled circles; p 1,p 2,p 3,p 4, and p 7 shown 2 1.5 1 1 xp(x) 0.5 0 0.5 1 1.5 2 0 2 4 6 8 10 x Prof. S. Boyd, EE364b, Stanford University 17

Convergence 1 0.9 0.8 0.7 0.6 τk 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 6 7 k Prof. S. Boyd, EE364b, Stanford University 18

Residual convergence 1 0.9 0.8 0.7 0.6 ηk 0.5 0.4 0.3 0.2 0.1 0 0 1 2 3 4 5 6 7 k Prof. S. Boyd, EE364b, Stanford University 19

Larger example solve Gv = i, resistor network with 10 5 nodes average node degree 10; around 10 6 nonzeros in G random topology with one grounded node nonzero branch conductances uniform on [0, 1] external current i uniform on [0, 1] sparse Cholesky factorization of G requires too much memory Prof. S. Boyd, EE364b, Stanford University 20

Residual convergence 10 4 10 2 10 0 ηk 10 2 10 4 10 6 10 8 0 10 20 30 40 50 60 k Prof. S. Boyd, EE364b, Stanford University 21

CG algorithm (follows C. T. Kelley) x := 0, r := b, ρ 0 := r 2 for k = 1,...,N max quit if ρ k 1 ǫ b if k = 1 then p := r; else p := r +(ρ k 1 /ρ k 2 )p w := Ap α := ρ k 1 /p T w x := x+αp r := r αw ρ k := r 2 Prof. S. Boyd, EE364b, Stanford University 22

Efficient matrix-vector multiply sparse A structured (e.g., sparse) plus low rank products of easy-to-multiply matrices fast transforms (FFT, wavelet,... ) inverses of lower/upper triangular (by forward/backward substitution) fast Gauss transform, for A ij = exp( v i v j 2 /σ 2 ) (via multipole) Prof. S. Boyd, EE364b, Stanford University 23

Shifting suppose we have guess ˆx of solution x we can solve Az = b Aˆx using CG, then get x = ˆx+z in this case x (k) = ˆx+z (k) = argmin x ˆx+K k f(x) (ˆx+K k is called shifted Krylov subspace) same as initializing CG alg with x := ˆx, r := b Ax good for warm start, i.e., solving Ax = b starting from a good initial guess (e.g., the solution of another system Ãx = b, with A Ã, b b) Prof. S. Boyd, EE364b, Stanford University 24

Preconditioned conjugate gradient algorithm idea: apply CG after linear change of coordinates x = Ty, dett 0 use CG to solve T T ATy = T T b; then set x = T 1 y T or M = TT T is called preconditioner in naive implementation, each iteration requires multiplies by T and T T (and A); also need to compute x = T 1 y at end can re-arrange computation so each iteration requires one multiply by M (and A), and no final solve x = T 1 y called preconditioned conjugate gradient (PCG) algorithm Prof. S. Boyd, EE364b, Stanford University 25

Choice of preconditioner if spectrum of T T AT (which is the same as the spectrum of MA) is clustered, PCG converges fast extreme case: M = A 1 trade-off between enhanced convergence, and extra cost of multiplication by M at each step goal is to find M that is cheap to multiply, and approximate inverse of A (or at least has a more clustered spectrum than A) Prof. S. Boyd, EE364b, Stanford University 26

Some generic preconditioners diagonal: M = diag(1/a 11,...,1/A nn ) incomplete/approximate Cholesky factorization: use M =  1, where  = ˆLˆL T is an approximation of A with cheap Cholesky factorization T compute Cholesky factorization of Â,  = ˆLˆL at each iteration, compute Mz = ˆL T ˆL 1 z via forward/backward substitution examples  is central k-wide band of A ˆL obtained by sparse Cholesky factorization of A, ignoring small elements in A, or refusing to create excessive fill-in Prof. S. Boyd, EE364b, Stanford University 27

Larger example residual convergence with and without diagonal preconditioning 10 4 10 2 10 0 ηk 10 2 10 4 10 6 10 8 0 10 20 30 40 50 60 k Prof. S. Boyd, EE364b, Stanford University 28

CG summary in theory (with exact arithmetic) converges to solution in n steps the bad news: due to numerical round-off errors, can take more than n steps (or fail to converge) the good news: with luck (i.e., good spectrum of A), can get good approximate solution in n steps each step requires z Az multiplication can exploit a variety of structure in A in many cases, never form or store the matrix A compared to direct (factor-solve) methods, CG is less reliable, data dependent; often requires good (problem-dependent) preconditioner but, when it works, can solve extremely large systems Prof. S. Boyd, EE364b, Stanford University 29