Mathematical Toolkit Autumn 206 Lecturer: Madhur Tulsiani Lecture 0: October 27, 206 The conjugate gradient method In the last lecture we saw the steepest descent or gradient descent method for finding a solution to the linear system Ax = b for A 0 The method guarantees x t x ε x 0 x after t = Oκ log/ε iterations, where κ is the condition number of the matrix A We will see that the conjugate gradient can obtain a similar guarantee in O κ log/ε iterations For the steepest descent method, if we start from x 0 = 0, we get x t x = I ηa x, which gives x t = pa b for some polynomial p of degree at most t The conjugate gradient method just takes this idea of finding an x of the form pa b and runs with it The method finds an x t = p t A b where p t is the best polynomial of degree at most t ie, the polynomial which minimizes the function 2 Ax, x b, x However, the method does not explicitly work with polynomials Instead we use the simple observation that any vector of the form p t A b lies in the subspace Span {b, Ab,, A t b} and the method finds the best vector in the subspace at every time t Definition Let ϕ : V V be a linear operator on a vector space V and let v V be a vector The Krylov subspace of order t defined by ϕ and v is defined as { } K t ϕ, v := Span v, ϕv,, ϕ v Thus, at step t of the conjugate gradient method, we find the best vector in the space K t A, b we will just write the subspace as K t since A and b are fixed for the entire argument The trick of course is to be able to do this in an iterative fashion so that we can quickly update the minimizer in the space K to the minimizer in the space K t This can be done by expressing the minimizer in K in terms of a convenient orthonormal basis {u 0,, u } for K It turns out that if we work with a basis which is orthonormal with respect to the inner product, A, at step t we only need to update the component of the minimizer along the new vector u t we get to obtain a basis for K t
The algorithm Recall that we defined the inner product x, y A := Ax, y where, denotes the standard inner product on R n As before we consider the function 2 Ax, x b, x and pick x t := arg min f x This can also be thought of as finding the closest point to x in the space K t under the distance A since f x = 2 Ax, x b, x = 2 Ax, x Ax, x = 2 x, x A x, x A = 2 x x 2 A x 2 A, which gives x t = arg min f x = arg min x x A We have already seen how to compute find the characterize the closest point in a subspace, to a given point Let {u 0,, u } be an orthonormal basis for K t under the inner product, A Completing this to an orthonormal basis {u 0,, u n } for R n, let x be expressible as x = n c i u i = n x, u i A u i Then we know that the closest point x t in K t, under the distance A is given by x t = x, u i A u i = Ax, u i u i = b, u i u i Note that even thhough we do not know x, we can find x t given an orthonormal basis {u 0,, u }, since we can compute b, u i for all u i This gives the following algorithm: - Start with u 0 = b/ b A as an orthonormal basis for K - Let x t = b, u i u i for a basis {u 0,, u } orthonormal under the inner product, A - Extend {u 0,, u } to a basis of K t+ by defining v t = A t b A t b, u i A u i and u t = v t vt, v t A 2
- Update x t+ = x t + b, u t u t Notice that the basis extension step here seems to require Ot matrix-vector multiplications in the t th iteration and thus we will need Ot 2 matrix-vector multiplications in total for t iterations This would negate the quadratic advantage we are trying to gain over steepest descent However, in the homework you will see a way of extending the basis using only O matrix-vector multiplications in each step 2 Bounding the number of iterations Since x t lies in the subspace K t, we have x t = pa b for some polynomial p of degree at most t Thus, x t x = pa b x = pa A x x = I pa A x 0 x, since x 0 = 0 We can think of I paa as a polynomial qa, where degq t and q0 = Recall from last lecture that the minimizer of f x is the same as the minimizer of x x, x x A = x x 2 A Since pab is the minimizer of f x in K t, we have x t x 2 A = min q Q t qax o x 2 A, where Q t is the set of polynomials defined as Q t := {q R[z] degq t, q0 = } Use the fact that if λ is an eigenvalue of a matrix M, then λ t is an eigenvalue of M t with the same eigenvector to prove that the following Exercise 2 Let λ,, λ n be the eigenvalues of A Then for any polynomial q and any v R n, qav A max qλ i v A i Using the above, we get that x t x A min max q Q t i qλ i x 0 x A Thus, the problem of bounding the norm of x t x is reduced to finding a polynomial q of degree at most t such that q0 = and qλ i is small for all i Exercise 3 Verify that using qz = method 2z λ +λ n t recovers the guarantee of the steepest descent 3
Note that the conjugate gradient method itself does not need to know anything about the optimal polynomials in the above bound The polynomials are only used in the analysis of the bound The following claim, which can be proved by using slightly modified Chebyshev polynomials, suffices to obtain the desired bound on the number of iterations Claim 4 For each t N, there exists a polynomial q t Q t such that 2 t q t z 2 z [λ, λ n ] κ + We will prove the claim later using Chebyshev polynomials However, using the claim we have that x t x A min qλ i x 0 x 2 t A 2 x 0 x κ + A max q Q t i Thus, O κ log/ε iterations suffice to ensure that x t x A ε x 0 x A 3 Chebyshev polynomials The Chebyshev polynomial of degree t is given by the expression P t z = [ 2 z + t z 2 + z ] t z 2 Note that this is a polynomial since the odd powers of z 2 will cancel from the two expansions For z [, ] this can also be written as P t z = cos t cos z, which shows that P t z [, ] for all z [, ] Using these polynomials, we can define the required polynomials q t as q t z = P λ +λ n 2z t λ n λ P λ +λ n t λ n λ The denominator is a constant which does not depend on z and the numerator is a polynomial of degree t in z Hence degq t = t Also, the denominator ensures that q t 0 = Finally, for z [λ, λ n ], we have Hence, the numerator is in the range [, ] for all z [λ, λ n ] This gives q t z λ +λ n 2z λ n λ 2 P λ +λ n t λ n λ t κ = 2 κ + 4 2 κ + t
The last bound above can be computed directly from the first definition of the Chebyshev polynomials 2 Matrices associated with graphs Let G = V, E be an undirected graph with V = [n] We consider the matrices A, D and L associated with the graph G, as defined below A ij = { if {i, j} E 0 otherwise D ij = { degi if i = j 0 otherwise L = D A The matrix A is called the adjacency matrix of the graph, while L is called the Laplacian matrix of the graph We will later use these matrices to analyze properties of the underlying graph Note that all the above matrices are symmetric since the graph G is undirected Thus, the eigenvectors of each of these matrices form an orthonormal basis, and the corresponding eigenvalues are real Conventionally, the eigenvalues of the Laplacian matrix are written in ascending order Let λ λ n be the eigenvalues of L The following claim shows that λ 0 ie, the Laplacian matrix is positive semidefinite Proposition 2 For all x R n, x, Lx = {i,j} E xi x j 2 Note that from the above, we also get that kerl, where denotes the vector,, In fact, the kernel of the Laplacian matrix corresponds directly to the connected components of G Proposition 22 Let k be the number of connected components of G Then dimkerl = k In particular, the above says that if λ 2 = 0, then G can be divided in two parts with no edges crossing between them While this seems an uneccessarily complicated way of talking about connected components of a graph, stating this is in terms of eigenvalues has the advantage that one can talk about robust versions of this statement One can then make statements of the form: if λ 2 is close to zero, then G can be divided into two pieces with only a small number of edges crossing between them One such statement is known as Cheeger s inequality and we will see this later in the course The proof of Cheeger s inequality in fact comes with an algorithm for finding a way of dividing G in two pieces with small number of crossing edges This procedure is often known as spectral partioning 5