Solutions and Notes to Selected Problems In: Numerical Optimzation by Jorge Nocedal and Stephen J. Wright.

Size: px

Start display at page:

Download "Solutions and Notes to Selected Problems In: Numerical Optimzation by Jorge Nocedal and Stephen J. Wright."

Eunice Lucinda Harper
5 years ago
Views:

1 Solutions and Notes to Selected Problems In: Numerical Optimzation by Jorge Nocedal and Stephen J. Wright. John L. Weatherwax July 7,

2 Chapter 5 (Conjugate Gradient Methods) Notes on the Text Notes on the linear conjugate gradient method The objective function we seek to minimize is given by φ(x) = 1 2 xt Ax b T x. (1) This will be minimized by taking a step from the current best guess at the minimum, x k, in the conjugate directions, p k, to get a new best guess as x k+1 = x k + αp k. (2) Note that to find the value of α to use in the step given in Equation 2, we can select the value of α that minimizes φ(α) φ(x k + αp k ). To find this value we take the derivative of this expression with respect to α, set the resulting expression equal to zero, and solve for α. Since φ(x k + αp k ) can be written as φ(α) 1 2 (x k + αp k ) T A(x k + αp k ) b T (x k + αp k ) = 1 2 xt k Ax k + αx T k Ap k α2 p T k Ap k b T x k αb T p k = φ(x k ) α2 p T k Ap k + α(p T k Ax k p T k b) T = φ(x k ) α2 p T k Ap k + αr T k p k. Setting the derivative of this expression equal to zero gives p T k Ap kα + r T k p k = 0. From which when we solve for α we get the following for the conjugate direction stepsize: α = rt k p k p T k Ap, (3) k this is equation 5.6 in the book. Notationally we can add a subscript k to the variable α as in α k to denote that this is the step size taking in moving from x k to x k+1. Since the residual r is defined as r = Ax b and to get the the k +1st minimization estimate x k+1 from x k we use Equation 2. If we premultiply this expression by A and subtract b from both sides we get Ax k+1 b = Ax k b + α k Ap k, or in terms of the residuals we get the residual update equation: which is the books equation 5.9. r k+1 = r k + α k Ap k, (4)

3 The initial induction step in the expanding subspace minimization theorem I found it a bit hard to verify the truth of the initial inductive statement used in the proof of the expanding subspace minimization theorem presented in the book. In that proof the initial step requires that one verify r T 1 p 0 = 0. This can be shown as follows. The first residual r 1 is given by r 1 = φ(x 0 + α 0 p 0 ) = A(x 0 + α 0 p 0 ) b. Thus the inner product of r 1 with p 0 gives r T 1 p 0 = (Ax 0 + α 0 Ap 0 b) T p 0 = x T 0 Ap 0 + α 0 p T 0 Ap 0 b T p 0. If we put in the value of α 0 = rt 0 p 0 p T 0 Ap 0 the above expression becomes as we were to show. = x T 0 Ap 0 rt 0 p 0 p T 0 Ap 0 p T 0 Ap 0 b T p 0 = (x T 0 A bt )p 0 r T 0 p 0 = r T 0 p r T 0 p 0 = 0, Notes on the basic properties of the conjugate gradient method We want to pick a value for β k such that the new expression for p k given by p k = r k + β k p k 1, to be A conjugate to the old p k 1. To enforce this conjugacy, multiply this expression by p T k 1A on the left of the expression above where we get p T k 1Ap k = p T k 1Ar k + β k p T k 1Ap k 1. If we take the left-hand-side of this expression equal to zero and solve for β k we get that β k = pt k 1Ar k p T k 1 Ap. (5) k 1 In this case the new value of p k will be A conjugate to the old value p k 1. Notes on the preliminary version of the conjugate gradient method The stepsize in the conjugate direction is given by Equation 3. If we use the conjugate update equation p k = r k + β k 1 p k 1, (6)

4 and the residual prior-conjugate orthogonality condition given by r T k p j = 0 for 0 j < k, (7) in Equation 6 we get when we multiply by r T k on the left we have r T k p k = r T k r k + β k 1 r T k p k 1 = r T k r k, Thus using this fact in Equation 3 we have an alternative expression for α k given by α k = rt k r k p T k Ap. (8) k In the preliminary conjugate gradient algorithm the stepsize, β k+1, in the conjugate direction is given by Equation 5. Using the residual update equation r k+1 = r k + α k Ap k, to replace Ap k in the expression for β k+1 to get β k+1 = 1 α k (rk+1 T (r k+1 r k )) 1 α k p T k (r k+1 r k ) since r T k+1 r k = 0, by residual-residual orthogonality = rt k+1 r k+1 p T k (r k+1 r k ), r T k r i = 0 for i = 0, 1,..., k 1. (9) Using the conjugate update equation p k = r k +β k+1 p k 1 in the denominator above, we get a new denominator given by ( r k + β k p k 1 ) T (r k+1 r k ). Next using residual prior-conjugate orthogonality Equation 7 or r T k p i = 0 for i = 0, 1,..., k 1, we have as we wanted to show. β k+1 = rt k+1 r k+1 r T k r k, (10) Notes on the rate of convergence of the conjugate gradient method To study the convergence of the conjugate gradient method we first argue that the minimization problem we originally posed: that of minimizing φ(x) = 1 2 xt Ax b T x over x is equivalent to the problem of minimizing a norm squared expression, namely x x 2 A. If we take the minimum of φ(x) to be denoted as x such that x solves Ax = b we then have the minimum of φ(x) at this point is given by φ(x ) = 1 2 x T Ax b T x = 1 2 x T b b T x = 1 2 bt x.

5 To show that these two minimization problems are equivalent consider the expression 1 2 x x 2 A. We have 1 2 x x 2 A = 1 2 (x x ) T A(x x ) = 1 2 xt Ax x T Ax x T Ax. Since Ax = b we have that the above becomes 1 2 x x 2 A = 1 2 xt Ax x T b x T Ax = φ(x) x T Ax = φ(x) 1 2 x T Ax + x T Ax verifying the books equation = φ(x) 1 2 x T Ax + x T b = φ(x) φ(x ), To study convergence of the conjugate gradient method we will decompose the difference between our initial guess at the solution denoted as x 0 and the true solution denoted by x or x 0 x in terms of the eigenvectors v i of A as x 0 x = n i=1 ξ i v i. When we do this we have that the difference between the k + 1th iteration and x is given by n x k+1 x = (1 + λ i Pk (λ i))ξ i v i. i=1 Then in terms of the A norm this distance is given by ( n ) x k+1 x T ( A = (1 + λ i Pk (λ n i)ξ i v i A i=1 ( n ) ( = (1 + λ i Pk (λ i)ξ i vi T n A i=1 n n = i=1 j=1 ) (1 + λ i Pk (λ i)ξ i v i i=1 ) (1 + λ i Pk (λ i)ξ i v i i=1 ξ i (1 + λ i P k (λ i))ξ j (1 + λ j P k (λ j))v T i Av j. As v j are orthonormal eigenvectors of A we have Av j = λ j v j and vi Tv j = 0 so that the above becomes n ξi 2 (1 + λ i Pk (λ i )) 2 λ i, i=1 as claimed by the book s equation Notes on the Polak-Ribiere Method In this subsection of these notes we derive an alternative expression for β k+1 that is used in the conjugate-direction update Equation 6 and that is valid for nonlinear optimization

6 problems. Since our state update equation is given by x k+1 = x k + α k p k, the gradient of f(x) at the new point x k+1 can be computed to second order using Taylor s theorem as f k+1 = f k + α k Ḡ k p k, where Ḡk is the average Hessian over the line segment [x k, x k+1 ]. To be sure that the new conjugate search direction p k+1 derived from the standard conjugate direction update equation: p k+1 = f k+1 + β k+1 p k, is conjugate with respect to the average Hessian Ḡk means that or using the expression for p k+1 this becomes So this later expression requires that p T k+1ḡkp k = 0, f T k+1ḡkp k + β k+1 p T k Ḡkp k = 0. β k+1 = ft k+1ḡkp k p T k Ḡkp k. Recognizing that Ḡkp k = f k+1 f k α k so β k+1 above becomes or the Hestenes-Stiefel formula and the books equation β k+1 = ft k+1 ( f k+1 f k ) ( f k+1 f k ) T p k, (11) Problem Solutions Problem 5.1 (the conjugate gradient algorithm on Hilbert matrices) See the MATLAB code prob 1.m where we call the MATLAB routine cgsolve.m to solve for the solution to Ax = b when A is the n n Hilbert matrix (generated in MATLAB using the built-in function hilb.m) and b is a vector of all ones and starting with an initial guess at the minimum of x 0 = 0. We do this for the values of n suggested in the text and then plot the number of CG iterations needed to drive the residual to The result of this calculation is presented in Figure 1. The definition of convergence here is taken to be when Ax b b < We see that the number of iterations grows relatively quickly with the dimension of the matrix A.

7 70 number of iterations required to reduced the error to 1.e size of Hilbert matrix (n) Figure 1: The number of iterations needed for convergence of the CG algorithm when applied to the (classically ill-conditioned) Hilbert matrix. Problem 5.2 (if p i are conjugate w.r.t. A then p i are linearly independent) The books equation 5.4 is the statement that p T i Ap j = 0 for all i j. To show that the vectors p i are linearly independent we begin by assuming that they are not and show that this leads to a contradiction. That p i are not linearly independent means that we can find constants α i (not all zero) such that αi p i = 0. If we premultiply the above summation by the matrix A we get αi Ap i = 0. Next premultiply the above by p T j to get α j p T j Ap j = 0, since p T j Ap i = 0 for all i j. Since A is positive definite the term p T j Ap j > 0 we can divide by it and conclude that α j = 0. Since the above is true for each value of j we have that each value of α j is zero. This is a contradiction to the assumption that p i are not linearly independent thus they must be linearly independent. Problem 5.3 (verification of the conjugate direction stepsize α k ) See the discussion around Equation 3 in these notes where the requested expression is derived.

8 Problem 5.4 (strongly convex when we step along the conjugate directions p k ) I think this problem is supposed to state that f(x) is strongly convex (as which means that f(x) has a functional form given by f(x) = 1 2 xt Ax b T x. In such a case when we take x of the form x = x 0 + σ 0 p 0 + σ 1 p σ k 1 p k 1, we can write f as a strongly convex function in the variables σ = (σ 1, σ 2,...,σ k 1 ) T. Problem 5.5 (the conjugate directions p i span the Krylov subspace) We want to show that 5.16 and 5.17 hold for k = 1. The books equation 5.16 is span{r 0, r 1 } = span{r 0, Ar 0 }. To show this equivalence we need to show that r 1 span{r 0, Ar 0 }. Now r 1 can be written as r 1 = Ax 1 b with x 1 given by = A(x 0 + α 0 p 0 ) b or = Ax 0 b + α 0 Ap 0 or since r 0 = Ax 0 b = r 0 + α 0 Ap 0 or since p 0 = r 0 = r 0 α 0 Ar 0, showing that r 1 span{r 0, Ar 0 }, and thus span{r 0, r 1 } span{r 0, Ar 0 }. Showing the other direction, that is span{r 0, Ar 0 } span{r 0, r 1 } is the same as noting that we can perform the manipulations above in the other direction. The books equation 5.17 when k = 1 is span{p 0, p 1 } = span{r 0, Ar 0 }, since p 0 = r 0 to show span{p 0, p 1 } span{r 0, Ar 0 } we need to show that p 1 span{r 0, Ar 0 }. Since p 1 = r 1 + β 1 p 0 = (Ax 1 b) β 1 r 0 = (A(x 0 + α 0 p 0 ) b) β 1 r 0 ) = r 0 + α 0 Ar 0 β 1 r 0. Again showing the other direction is the same as noting that we can perform the manipulations above in the other direction. Problem 5.6 (an alternative form for the conjugate direction stepsize β k+1 ) See the discussion around Equation 10 of these notes where the requested expression is derived.

9 Problem 5.7 (the eigensystem of a polynomial expression of a matrix) Given the eigenvalues λ i and eigenvectors v i of a matrix A then any polynomial expression of A say P(A) has the same eigenvectors with corresponding eigenvalues P(λ i ) as can be seen by simply evaluating P(A)v i. This problem is a special case of that result. Problem 5.9 (deriving the preconditioned CG algorithm) For this problem we are to derive the preconditioned CG algorithm from the normal CG algorithm. This means that we transform the original problem, that of seeing a solution for the minimum φ(x) of φ(x) = 1 2 xt Ax b T x, by transforming the original x variable into a hat variable ˆx = Cx. In this new space the x minimization problem above is equivalent to seeking the minimum solution to the following ˆφ(ˆx) = 1 2 (C 1ˆx) T A(C 1ˆx) b T (C 1ˆx) = 1 2ˆxT (C T AC 1 )ˆx (C T b) T. This later problem we will solve with the CG method where in the standard CG algorithm we take a matrix A and the vector b given by Â = C T AC 1 ˆb = C T b. Now given ˆx 0 as an initial guess at the minimum of the hat problem (note that specifying this is equivalent to specifying a initial guess x 0 for the minimum of φ(x)) and following the standard CG algorithm but using the hated variables, we start to derive the preconditioned CG by setting ˆr 0 = C T AC 1ˆx 0 C T b ˆp 0 = ˆr 0 k = 0. With these initial variables set, we then loop while ˆr k 0 and perform the following steps (following algorithm 5.2) ˆα k = ˆx k+1 ˆr k+1 ˆr T k ˆr k ˆp T k C T AC 1ˆp k = ˆx k + ˆα kˆp k = ˆr k + ˆα k C T AC 1ˆp k ˆβ k+1 = ˆr k+1ˆr k+1 ˆr T k ˆr k ˆp k+1 = ˆr k+1 + ˆβ k+1ˆp k k = k + 1.

10 After performing these iterations our output will be ˆx or the minimum of the quadratic ˆφ(ˆx) = 1 2ˆxT (C T AC 1 )ˆx (C T b)ˆx, but we really want to output x = C 1ˆx. To derive an expression that works on x lets write the above algorithm in terms of the unhatted variables x and r. Given an initial guess x 0 at the minimum of φ(x) then ˆx 0 = Cx 0 is the initial guess at the minimum of ˆφ(ˆx). Note that ˆr 0 = C T AC 1ˆx 0 C T b, or C T ˆr 0 = Ax 0 b = r 0, is the residual of the original problem. Thus it looks like the residuals transform between hatted an unhatted variables as ˆr k = C T r k. (12) Next note that ˆp 0 = ˆr 0 = C T r 0 = C T p 0, it looks like the conjugate directions transform between hatted an unhatted variables in the same way, namely ˆp k = C T p k. (13) Using these two simplifications our preconditioned conjugate gradient algorithm becomes in terms of the unhatted variables (recall the unknown variable transforms as ˆx k = Cx k ) ˆα k = = rk TC 1 C T r k p T k C 1 C T AC 1 C T p k rk T (C T C) 1 r k (14) p T k (CT C) 1 A(C T C) 1 p k x k+1 r k+1 = x k + ˆα k C 1ˆp k = x k + ˆα k C 1 C T p k = x k + ˆα k (C T C) 1 p k (15) = r k + ˆα k AC 1 C T p k = r k + ˆα k A(C T C) 1 p k (16) ˆβ k+1 = rt k+1 C 1 C T r k+1 r T k C 1 C T r k = rt k+1 (CT C) 1 r k+1 r T k (CT C) 1 r k (17) p k+1 = r k+1 + ˆβ k+1 p k (18) k = k + 1. In the above expressions on each line, we first made the hat to unhat substitution and then on the subsequent line simplified the resulting expression. Next to simplify these expressions further we introduce two new variables. The first variable, y k, is defined by y k = (C T C) 1 r k,

11 or the solution y k to the linear system My k = r k, where M = C T C. The second variable z k is defined similarly as z k = (C T C) 1 p k, or the solution to the linear system Mz k = p k. To use these variables, as a first step, in Equation 18 above we multiply by (C T C) 1 on both sides and use the definitions of z k and y k to get z k+1 = y k+1 + ˆβ k+1 z k. our algorithm then becomes Given x 0, our initial guess at the minimum of φ(x) form r 0 = Ax 0 b and solve My 0 = r 0 for y 0. Next compute z 0 given by z 0 = M 1 p 0 = M 1 ( r 0 ) = M 1 (My 0 ) = y 0. (19) Next we set k = 0 and iterate the following while r k 0 ˆα k = rt k y k zk TAz k = x k + ˆα k z k x k+1 r k+1 = r k + ˆα k Az k solve My k+1 = r k+1 for y k+1 ˆβ k+1 = rt k+1 y k+1 r T k y k z k+1 = y k+1 + ˆβ k+1 z k k = k + 1. Note that this is the same algorithm presented in the book but the book denotes the variable z k by the notation p k and I think that there is an error in the book s initialization of this routine in that the book states p 0 = r 0 while I think this expression should be p 0 = y 0 (in their notation) see Equation 19. Problem 5.10 (deriving the modified residual conjugacy condition) In the transformed hat problem to minimize the quadratic ˆφ(ˆx) given by see Problem 5.9 on page 9 above, if we define ˆφ(ˆx) = 1 2ˆxT (C T AC 1 )ˆx (C Tˆb)T ˆx, Â C T AC 1 ˆb C T b. So that the transformed hat problem has a residual ˆr given by ˆr = Âˆx ˆb = C T AC 1 Cx C T b = C T (Ax b) = C T r.

12 Thus the orthogonality property of successive residuals i.e. the books equation 5.15 for the hat problem which is given by ˆr T k ˆr i = 0 for i = 0, 1, 2,, k 1. (20) becomes in terms of the original variables of the problem r T k C 1 C T r j = r T k M 1 r j = 0, since M = C T C or the modified residual conjugacy condition and is what we were to show. Problem 5.11 (the expressions for β PR and β HS reduce to β FR ) Recall that three expressions suggested for β (the conjugate direction stepsize) are for the Polak-Ribiere formula, for the Hestenes-Stiefel and for the Fletcher-Reeves expression. βk+1 PR = ft k+1 ( f k+1 f k ) fk T f, (21) k β HS k+1 = ft k+1 ( f k+1 f k ) ( f k+1 f k ) T p k, (22) βk+1 FR = ft k+1 f k+1 fk T f, (23) k To show that the Polak-Ribere CG stepsize β PR reduces to the Fletcher Reeves CG stepsize β FR, under the conditions given in this problem, from the above formulas it is sufficient to show that f T k+1 f k = 0, since they agree on the other terms. Now when f(x) is a quadratic function f(x) = 1 2 xt Ax b T x + c for some matrix A, vector b, and scalar c and x k+1 is chosen to be the exact line search minimum then f k+1 = r k+1. Thus f T k+1 f k = r T k+1r k = 0, by residual-residual orthogonality Equation 9 (the books equation 5.15). Now to show that Hestenes-Stiefel CG stepsize β HS reduces to the Fletcher Reeves CG stepsize β FR, under the conditions given in this problem, from the above formulas it is sufficient to show that ( f k+1 f k ) T p k = f T k f k. Using the results above we have ( f k+1 f k ) T p k = (r k+1 r k ) T ( r k + β k p k 1 ) = r T k+1 r k + β k r T k+1 p k 1 + r T k r k β k r T k p k 1 = r T k r k = f T k f k. Where in the above we have used residual prior-conjugate orthogonality given by Equation 7 to show r T k p k 1 = 0.

13 References

1 Conjugate gradients

Notes for 2016-11-18 1 Conjugate gradients We now turn to the method of conjugate gradients (CG), perhaps the best known of the Krylov subspace solvers. The CG iteration can be characterized as the iteration