7.2 Steepest Descent and Preconditioning

Size: px

Start display at page:

Download "7.2 Steepest Descent and Preconditioning"

Lillian Theresa Mason
6 years ago
Views:

1 7.2 Steepest Descent and Preconditioning Descent methods are a broad class of iterative methods for finding solutions of the linear system Ax = b for symmetric positive definite matrix A R n n. Consider the functional J : R n R given by J(y) = 1 2 yt Ay y T b Theorem: Let A R n n be symmetric positive definite and J be defined as above. Then there is exactly one x R n for which J(x) = min J(y) y and that x is the solution to the linear system Ax = b. Proof I: Since x is the solution to Ax = b we can write the functional as J(y) = 1 2 yt Ay y T Ax Then, by completing the square we have J(y) = 1 2 yt Ay y T Ax = 1 2 yt Ay y T Ax xt Ax 1 2 xt Ax = 1 2 yt Ay 1 2 xt Ay 1 2 yt Ax xt Ax 1 2 xt Ax = 1 2 (y x)t Ay 1 2 (y x)t Ax 1 2 xt Ax = 1 2 (y x)t A (y x) 1 2 xt Ax We re minimizing over y so x is fixed. Since A is positive definite we know that the first term in the rewritten functional will always be positive. This means that the best we can do to minimize J is to choose y so that the first term is 0. This is accomplished precisely when we choose y = x. Thus, the solution to Ax = b is the minimizer of functional J. APPM 4650 Chris Ketelsen December 16, 2013 Section 7.2

2 How does this help us come up with a new solver for Ax = b? It implies that instead of solving the system Ax = b directly, we can solve the minimization problem that minimizes J. Proof II: Let y = [y 1, y 2,..., y n ] T and recall that J = routine calculation shows that [ J, J,..., J ] T. Then, a y 1 y 2 y n J = Ay b (The computation is left as an exercise.) Notice that J evaluated at any vector y is just the negative of the residual vector r = b Ay that arises from using y as an approximate solution to Ax = b. We know from calculus that the extreme values of scalar functions like J occur when J = 0. Since A is s.p.d. it is also nonsingular, which means the sole critical value of J occurs when Ay b = 0 Ax = b. Since functional J is quadratic and the matrix involved in the leading term is s.p.d. we can conclude that the only extreme value of J is a minimum. Thus we have again shown that the unique minimum of functional J is the solution to the linear system Ax = b. Descent Methods: Descent methods are iterative methods that attempt to desc to the minimum of functional J (i.e. the solution of Ax = b). Descent methods start with initial guess x (0) and generate a sequence of iterates x (1), x (2), x (3),... such that each set of consecutive iterates satisfy J ( x (k+1)) J ( x (k)) but hopefully J ( x (k+1)) < J ( x (k)) That is to say, each new iterate should lower the value of the functional and thus get closer to the solution of Ax = b. If at some point we have Ax (k) = b, or nearly so, we stop and consider x (k) our approximate solution. If x (k) is still not a good enough solution to Ax = b we take another step in the iteration which decreases the functional. The question is: How do we go from x (k) to x (k+1)? The process has two components: 1. Choose a direction to move in. (We ll call this the search direction.) 2. Choose how far to move in the decided upon direction.

3 This can be stated more mathematically: 1. Choose search direction vector p (k) to move in on the k th iteration 2. Choose step parameter α k R determining how far in the direction of p (k) to move This produces the update step x (k+1) = x (k) + α k p (k) Notice that this indicates that the new iterate x (k+1) lies along a line starting from x (k) and heading in the search direction p (k). Those choice of α k determines how far along this line we move. Intuitively, since we re solving a minimization problem, we d like to choose α k so that the move reduces J as much as possible. When we move in such a way, to reduce the functional, the movement is called a line search. Given a direction p (k) for a line search there are a lot of choices for how far we actually move. If we choose α k exactly such that the move minimizes the functional J along that line, the process is called an exact line search. If, for whatever reason, we do not pick α k to explicitly minimize J, the process is called an inexact line search. Suppose that given x (k) we have (somehow) decided on a direction p (k) to search in. For an arbitrary functional, it is usually very difficult to find an α k so that the movement to x (k+1) exactly minimizes the functional, but since in our case J is a quadratic it s actually pretty easy. Exact Line Search: For a given iterate x (k) and search direction p (k), define g(α) = J ( x (k) + αp (k)) Then the minimizer of g(α) is α k, the correct value for our exact line search. To find the minimizer we set g (α) = 0 and solve for α. Notice that g(α) = J ( x (k) + αp (k)) = 1 ( x (k) + αp (k)) T ( A x (k) + αp (k)) ( x (k) + αp (k)) T b ( 2 ) 1 ( ) = 2 x(k)t Ax (k) x (k)t b + αp (k)t Ax (k) αp (k)t b α2 p (k)t Ap (k) = J ( x (k)) αp (k)t Ar (k) α2 p (k)t Ap (k) Then, taking the derivative of g w.r.t. α, setting it equal 0, and solving for α we have r(k)t p (k) α = p (k)t Ap =: α (k) k

4 So, given a search direction p (k), the choice of the α k above results in a step that minimizes J along p (k). Now we just need to choose the search direction. Notice that the only way to get α k = 0 is to choose a search direction p (k) that is orthogonal to r (k). Since this would result in no movement at all we would like to avoid this. Choice of Search Direction: There are many ways to choose the search direction. One instructive (but naive) method is to choose the search direction that gives the maximum decrease in the functional J. Recall from vector calculus that the maximum decrease of a scalar functional is in the direction of the negative gradient. Therefore we propose choosing p (k) = J ( x (k)) = b Ax (k) = r (k) This indicates that the choice of the residual as the search direction will result in the greatest possible decrease of the functional. This leads to the so-called Method of Steepest Descent Function SteepestDescent () Input: Matrix A Righthand Side Vector b Initial guess x (0) r b Ax p r while not yet converged do α p T r/p T Ap x x + αp r b Ax p r Notice that the most expensive operation in the algorithm are the two matrix-vector multiplies. We can reduce their cost as follows. Notice that r (k+1) = b Ax (k+1) = b A ( x (k) + αp (k)) = r (k) αap (k) Now notice that the only two mat-vecs in the algorithm are both of the form Ap. Since mat-vecs are (relatively) expensive, we might as well only do this operation once and store it for later use. The modified algorithm then becomes

5 Function SteepestDescent () Input: Matrix A Righthand Side Vector b Initial guess x (0) r b Ax p r while not yet converged do q Ap α p T r/p T q x x + αp r r αq p r Geometric Interpretation of Steepest Descent Recall that in the Steepest Descent Method, we re attempting to minimize the functional J(y) = 1 2 (y x)t A (y x) 1 2 xt Ax by performing, at each iteration, an exact line search in the direction of the negative residual vector. Geometrically you can think of this as moving from your current iterate in a direction orthogonal to the contour, and continuing until you reach a point parallel to another contour. The process continuous until you are reasonably close to the minimum of the functional. Notice that if the shape of the contours is close to circular then the Steepest Descent Method reaches a minimum fairly quickly. If, on the other hand, the contours are fairly elongated, it can take a large number of iterations to reach the minimum, and Steepest Descent will be very inefficient.

6 Let s analyze the case when convergence of Steepest Descent is very slow. We know that the functional J that we re trying to minimize is a quadratic. To get a better idea of the contours of the functional, we transform J into a new coordinate system. Since A is symmetric it has the following eigen-decomposition: A = UΛU T Where Λ is a diagonal matrix made up of the eigenvalues of A and U is an orthogonal matrix whose columns are the eigenvectors of A. Substituting this into the functional (and dropping the constant term and scaling factor since they don t affect the minimization) we have J(y) = (y x) T UΛU T (y x) = [ U T (y x) ] T Λ [ U T (y x) ] We then define a new coordinate system by z = U T (y x). Since U is an orthogonal transformation the contours of the functional are simply rotated. We now have our simplified functional J(z) = z T Λz = n λ i zi 2 i=1 Let s think about this in two-dimensions because we can actually visualize it. For n = 2 we have J(z) = z T Λz = λ 1 z λ 2 z 2 2 The level curves of the functional look like λ 1 z λ 2 z 2 2 = c which form a family of concentric ellipses, with the solution of Ax = b located at the center. We can rewrite the equation of the elliptical contours as λ 1 c z2 1 + λ 2 c z2 2 = 1 which, if we assume that λ 2 > λ 1, indicates that the minor and major axes of the ellipse have lengths λ 1 /c and λ 2 /c, respectively. We can get an idea of how elongated the ellipses are by taking the ratio of the lengths of the two axes. This ratio is given by λ2 /c λ1 /c = λ 2 /λ 1

7 Notice that since A is symmetric we have A 2 = ρ(a) = max λ λ and A 1 2 = ρ ( A 1) = 1/ min λ λ which for this 2 2 case gives λ2 /λ 1 = λ 2 1 λ 1 = A 2 A 1 2 = κ 2 (A) This indicates that the contours of the functional J will be very elongated if A is illconditioned (i.e. has a large condition number) and thus Steepest Descent will take a long time to converge. Example: Consider solving the linear system Ax = b where A = [ ] b = [ 30 6 ] x = [ 1 5 ] The eigenvalues of A are approximately λ 1 = 0.9 and λ 2 = 25.1 which gives a condition number of approximately κ 2 (A) = This may not seem too bad, but it means that the ratio of the axes of the each contour ellipse is about 26.1 = 5.1. The matrix does not seem particularly ill-conditioned, but it results in fairly elongated contour ellipses. So, we might expect Steepest Descent to converge slowly here. In fact, if we iterate until the relative residual r / b is less than then it takes 43 iterations of Steepest Descent. 43 iterations for a 2 2 system! Can we do better than this? Preconditioning Strategy: Transform the linear system into an equivalent problem that is better conditioned, and then apply Steepest Descent to the new problem. We transform the linear system by applying a symmetric approximate inverse of A to the linear system. There are many choices for this approximate inverse, but one common choice is to use the M from our matrix splitting methods (e.g. Jacobi, Symmetric Gauss-Seidel, etc). Suppose we choose symmetric matrix M that is a good approximation to A. We d then have Ax = b M 1 Ax = M 1 b Âx = ˆb The general idea is that we d now apply our Steepest Descent method to the modified linear system Âx = ˆb. Unfortunately, the way we ve done it, Â is no longer symmetric, even if M

8 is, so while we could probably apply it as a preconditioner for Steepest Descent, we wouldn t be able to use it with CG. Instead we consider the following: Choose a symmetric positive definite M and compute it s Cholesky Decomposition M = R T R. Note that we have M 1 = R 1 R T. Multiplying through by R T we have Ax = b R T Ax = R T b R T AR 1 Rx Âˆx = ˆb where Â = R T AR 1 ˆx = Rx ˆb = R T b ˆr = ˆb Âˆx = R T r Notice that Â is symmetric (and positive definite) so we can apply Steepest Descent (or CG) to the system Âˆx = ˆb. Then, once we ve reached a good enough solution for ˆx we get the solution to our original problem by x = R 1ˆx. Example: Consider the same example problem as before. One very popular (but not always that effective) choice of preconditioner is the matrix with just the main diagonal of A. This is called the Diagonal Preconditioner in most literature. (We will show later that this is equivalent to using one step of the Jacobi iteration with a zero initial guess). So for this problem we d take M = [ ] = R T R = [ ] [ ] Which gives Â = [ 1/ ] [ ] [ 1/ ] [ = 1 1/5 1/5 1 ] The condition number of Â is given by κ2(â) = 1.5 which means that the ratio of the major and minor axes of the contour ellipses is , which is must better than 5. In fact, if we apply Steepest Descent to the preconditioned problem we converge to a relative residual under in just 15 iterations.

9 Implementation So far the preconditioning methodology that we ve come up with requires us to do the Cholesky Decomposition of M, transform the linear system, and then apply the Steepest Descent Method. This makes for a pretty large overhead. It would be nice if we could skip a lot of this stuff, and it turns out we can. Consider applying Steepest Descent to the modified linear system algorithm looks like Âˆx = ˆb. The general ˆr ˆb Âˆx ˆp ˆr while not yet converged do ˆq Âˆp α ˆp Tˆr/ˆp T ˆq ˆx ˆx + αˆp ˆr ˆr αˆq ˆp ˆr We know the explicit transformation for x, b, and A. We need to define them for vectors p and q as well. Since ˆp gets added to ˆx it makes sense to define the transformation from p to ˆp in an analagous way with x. Then ˆx = Rx we define ˆp = Rp Similarly, since ˆq appears with ˆr: ˆr = R T r we define ˆq = R T q Now, we can write the steps in the Steepest Descent algorithms involving the hat vectors in terms of their transformations of the original vectors and we see that a lot of the tranformations cancel out. For instance ˆx ˆx + αˆp Rx Rx + αrp x x + αp Similarly for the residual update we have ˆr ˆr αˆq R T r R T r αr T q r r αq

10 In fact, the transformation drops out for almost every step in the Steepest Descent Algorithm. We have r b Ax p? while not yet converged q Ap α p T r/p T q x x + αp r r αq p? do Notice, the only step in the algorithm that needs to be changed is the updating of the search direction. This happens because the transformations of p and r are not the same (i.e. one involves R and the other R T ). But, it s still pretty easy to figure out ˆp ˆr Rp R T r p R 1 R T r = M 1 r So the final algorithm becomes Function PreconditionedSteepestDescent () Input: Matrix A Preconditioner M Righthand Side Vector b Initial guess x (0) r b Ax p M 1 r while not yet converged do q Ap α p T r/p T q x x + αp r r αq p M 1 r So the only thing that changes in the preconditioned descent algorithm is that instead of choosing the residual as the search direction, we apply M 1 to the residual and use that instead. Notice that this means that we don t actually have to compute the factorization of M, we can just apply its inverse directly.

11 Examples of Preconditioners Jacobi and the Diagonal Preconditioner: The preconditioning step in the algorithm occurs when we set the new search direction equal to M 1 times the residual. In other words, p M 1 r If we re using the Diagonal Preconditioner this just means we use p D 1 r. We can also show that this is equivalent to applying one iteration of Jacobi Iteration to the system Ap = r using a zero initial guess. Recall, the general form of an iteration of Jacobi applied to the linear system Ap = r is given by p (1) D 1 (L + U) p (0) + D 1 r If we choose the zero initial guess p (0) = 0 then this reduces to p D 1 r which is exactly the same as the Diagonal Preconditioner. Symmetric Gauss-Seidel: In order to use Gauss-Seidel and keep the preconditioned system symmetric, we have to use a modified form of Gauss-Seidel known as Symmetric Gauss-Seidel. In order to derive the method we have to think about the efficient implementation of the standard Gauss-Seidel iteration. Recall that in terms of a matrix splitting, one Gauss-Seidel iteration applied to the linear system Ap = r is given by p (1) (D L) 1 Up (0) + (D L) 1 r Those of you who chose to implement Gauss-Seidel for the homework using loops instead of matrices will recognize one iteration of Gauss-Seidel as for i = 1 to n do ( q (k+1) i 1 i 1 r i a ij p (k+1) j a ii j=1 n j=i+1 a ij p (k) j ) Now, the trick to symmetrizing Gauss-Seidel is to do two iterations of Gauss-Seidel, but loop over the variables in different orders. You do the first iteration as written above, and then for the second iteration you start at the bottom of the vector and loop backwards up to the top. In other words you do

12 for i = n to 1 do ( q (k+1) i 1 i 1 r i a ij p (k+1) j a ii j=1 n j=i+1 a ij p (k) j ) It is easy to check that Gauss-Seidel with a backwards loop can be written as a matrix splitting iteration, with the form p (1) (D U) 1 Lp (0) + (D U) 1 r Because of the ordering of the loops, the traditional Gauss-Seidel method we discussed in class is sometimes called Forward Gauss-Seidel (FGS) and the iteration with the reverse loop is called Backward Gauss-Seidel (BGS). One iteration of Forward Gauss-Seidel, followed by an iteration of Backward Gauss-Seidel, is considered one iteration of Symmetric Gauss- Seidel (SGS). Note: If you re purely interested in being able to implement a Symmeteric Gauss-Seidel preconditioner, then you can stop at this point. The jist of the method is that you apply the preconditioner by running one iteration of SGS on Ap = r with a zero initial guess. If, however, you want to see how this thing could possibly be symmetric, then continue reading. Be warned that it gets a little algebraee. Now, let s think about applying one iteration of Symmetric Gauss-Seidel with a zero initial guess for doing the preconditioning in steepest descent. In other words we will do one iteration of SGS applied to the linear system Ap = r with p (0) = 0. p (1) (D L) 1 Up (0) + (D L) 1 r p (1) (D L) 1 r p (2) (D U) 1 Lp (1) + (D U) 1 r p (2) (D U) 1 L(D L) 1 r + (D U) 1 r p (2) [ (D U) 1 L(D L) 1 + (D U) 1] r One iteration of FGS One iteration of BGS So, we can think of doing one iteration of SGS on Ap = r as p [ (D U) 1 L(D L) 1 + (D U) 1] r = M 1 r We can of course simplify this expression to make it a bit more manageable. Let s concentrate on just the matrix

13 M 1 = [ (D U) 1 L(D L) 1 + (D U) 1] = (D U) 1 [ L(D L) 1 + I ] = (D U) 1 [ L(D L) 1 + (D L) (D L) 1] = (D U) 1 [L + (D L)] (D L) 1 = (D U) 1 D(D L) 1 So, one preconditioning step of SGS involves hitting the residual by M 1 = (D U) 1 D(D L) 1. We now want to convince ourselves that the matrix M is symmetric. To do this we note that M = (D L) D 1 (D U) Now, note that by construction, if the matrix A is symmetric, then U and L are transposes of each other. In other words U T = L and L T = U. Notice also that since D is diagonal we have D T = D and D T = D 1. We then have so, M is symmetric. M T = [ (D L) D 1 (D U) ] T = (D U) T D T (D L) T = ( D T U T ) D T ( D T L T ) = (D L) D 1 (D U) = M

Iterative Methods for Solving A x = b

Iterative Methods for Solving A x = b A good (free) online source for iterative methods for solving A x = b is given in the description of a set of iterative solvers called templates found at netlib: http