TMA480 Solutions to recommended exercises in Chapter 3 of N&W Exercise 3. The steepest descent and Newtons method with the bactracing algorithm is implemented in rosenbroc_newton.m. With initial point x 0 = (.,.) T, Newtons method converged in 8 iterations, while steepest descent needs 3997 (!) iterations. For x 0 = (., ) T, Newtons method converged in iterations, while steepest descent needs 4076 (!) iterations. Clearly, the steepest descent method has very slow convergence rate for this problem. Also notice that the convergence rate of Newtons method is highly dependent on the initial point. Convergence of steepest descent To show that the steepest descent method applied to the Rosenbroc function converges for all x 0, we use Theorem from the note on "Convergence of descent methods with bactracing (Armijo) linesearch". Thus, we need to show that the three assumptions of that theorem holds, i.e., that. f(x) is continuously differentiable;. the set S := {x R f(x) f(x 0 )} is bounded; 3. the matrices B are uniformly positive definite and bounded. Condition were shown in Exercise., while condition 3 is trivial since B = I (the identity) for the steepest descent. To show boundedness of S, we observe that f(x) is the sum of two non-negative terms, so that f(x) = 00(x x ) + ( x ) f(x 0 ) = C { ( x ) C, 00(x x ) C. The first condition is equivalent to x C, thus x is bounded. condition is equivalent to x x 0.C, thus x is also bounded. The second Exercise 3. We show this exercise by an counter-example. Pic objective function f(x) = x x with f (x) = x and minimum x = 0.5. The Wolfe conditions for this dimensional function is f(x + α) f(x) + cαf (x), f (x + α ) c f (x ). In one dimension we let p = and allow α to be negative.
If we choose x = 0, the first condition reads while the other condition reads α α c α, α c, α c, α c. If we now pic c = 3 4 and c = 4, we see that the two conditions reduces to α 4 and α 3 8, which is a contradiction. Hence, we need that 0 < c < c < to be sure that there exists an α satisfying the Wolfe conditions. Exercise 3.3 Consider the strongly convex quadratic function f(x) = xt Qx b T x. We search for a minimizer along the ray x + αp, that is, an α such that We can write d dα (f(x + αp )) = 0. Differentiation gives f(x + αp ) = (x + αp ) T Q(x + αp ) b T (x + αp ) = xt Qx + αx T Qp + α p T Qp b T x αb T p. d dα (f(x + αp )) = x T Qp + αp T Qp b T p = αp T Qp + f T p, where we have used that f = f(x) = Qx b. Thus d dα (f(x + αp )) = 0 only if which is what we wanted to show. Recall this from Exercise.3. α = f T p p T Qp,
Exercise 3.4 We consider the strongly convex quadratic function f(x) = xt Qx b T, whose gradient is given as f(x) = Qx b. The one-dimensional minimizer is given as α = f T p p t Qp The Goldstein conditions are given as = (xt Q bt )p p T Qp. () f(x ) + ( c)α f T p f(x + α p ) f(x ) + cα f T p, c (0, ). () We start by looing at f(x + α p ), f(x + α p ) = xt Qx + α x T Qp + α p T Qp b T x α b T p = f(x ) + α x T Qp + α p T Qp α b T p. Further, we see that ( c)α f T p = ( c)α (x T Q b T )p = α (x T Qp cx t Qp b T p + cb T p ), and that cα f T p = α c(x T Q b T )p. Hence, the Goldstein conditions () can be written as α (x T Qp cx t Qp b T p + cb T p ) α x T Qp + α p T Qp α b T p We start by looing at the first condition, α c(x T Q b T )p. cx T Qp + cb T p α p T Qp c(b T x T Q)p (b T x T Q)p, where we have used () on the right hand side. We see that this inequality is satisfied for all c (0, ) since (bt x T Q)p is non-negative. 3
Similarly, for the second condition we get that x T Qp + α p T Qp b T p c(x T Q b T )p α p T Qp (c )(x T Q b T )p (b T x T Q)p ( c)(b T x T Q)p. This holds true if ( c) since (b T x T Q)p is non-negative, or equivalently that c. Exercise 3.5 For a matrix norm induced from a vector norm, it is always true that Ax A x. Hence, x = B Bx B Bx Bx x B. A property of symmetric positive definite matrices, B, is that there exists matrices B and B such that B = B and B = B. Thus, we have Exercise 3.6 cosθ = f T p f p = p T B p B p p pt B p B p = pt B B p B p = B p B p p = B B M. B B p From Equation (3.8) in N&W we have that { x x ( f0 T Q = f 0) } ( f0 T Q f 0)( f0 T x 0 x Q f 0 ) Q. (3) We now that x 0 x is parallel to an eigenvector of G. Let e be this (normalized) eigenvector with corresponding eigenvalue λ > 0 such that Qe = λe and such that x 0 x = βe for some constant β. Further, recall that s an eigenvalue of Q with corresponding eigenvector e. Now, f o = Q(x 0 x ) = Qβe = βλe, 4
and we can deduce that ( f0 T f 0) ( f0 T Q f 0)( f0 T Q f 0) = (β λ e T e) (β λ e T Qe)(β λ e T Q e) = (e T λe)(e T =. λe) Insertion into (3) gives x x Q = 0. Hence we have convergence in one step. Exercise 3.7 First we use the definition of Q to see that x x Q = (x x ) T Q(x x ) = x T Qx x T Qx + (x ) T Qx. By further using that x + = x α f, we see that x x Q x + x Q = x T Qx x T Qx x T +Qx + + x T +Qx = x T Qx x T Qx Now, if we insert the one-dimensional minimizer, and f = Q(x x ), we get that (x T Qx α f T Qx + α f T Q f ) + (x T Qx α f T Qx ) = α f T Q(x x ) α f T Q f. α = f T f f T Q f x x Q x + x Q = ( f T f ) Further, we see that f T Q f ( f T f ) f T Q f x + x Q = Q f Q = f T Q f. = ( f T f ) f T Q f. (4) Inserting this into (4) and reorganizing gives the desired result (3.8) in N&W. Exercise 3.8 Since Q R n n is SPD, we can diagonalize it, i.e., Q = RDR T, Q = RD R T, where R is an orthonormal matrix and D = diag{λ, λ,..., λ n }. Each column of R is an eigenvector of G and > 0 are the corresponding eigenvalues ordered such that 5
λ λ... λ n. Since RR T = I, we can write β = (x T x) (x T Qx)(x T Q x) = (x T RR T x) (x T RDRx)(x T RD R T x) = (d T d) (d T Dd)(d T D d), where d = R T x. Let ξ i = d i d T d. Then ξ i 0 and i ξ i = and similarly that Hence, d T d d T Dd = i d i i d i = dt d i ξ i d T d i ξ i = d T d d T D d = i ξ. i β = ( i ξ i )( i ξ i ). d T d i d i i ξ i, =. We now see that Further, let λ = i ξ i and λ = i ξ i. Observe that λ λ λ n. By the convexity of the function φ(λ) = λ, we now that Hence, Finally, we deduce that β = λ λ λ n λ n λ λ + λ λ n λ λ n = λ n + λ λ λ n. λ i λ λ n λ(λ n + λ λ) ( λn + λ λ λ n ) ξ i = λ n + λ λ λ λ n. λ λ n max λ [λ,λ n]{λ(λ n + λ λ)} = 4λ λ n λ + λ n ), which is what we wanted to show. We have used that λ(λ n +λ λ) attains its maximum at λ = λ +λ n (verify this). 6