5 Overview of algorithms for unconstrained optimization

Size: px

Start display at page:

Download "5 Overview of algorithms for unconstrained optimization"

Felicia Ball
6 years ago
Views:

1 IOE 59: NLP, Winter 22 c Marina A. Epelman 9 5 Overview of algorithms for unconstrained optimization 5. General optimization algorithm Recall: we are attempting to solve the problem (P) min f(x) s.t. x 2 X where f(x) is di erentiable and X R n is an open set. Solutions to optimization problems are almost always impossible to obtain directly (or in closed form ) with a few exceptions. Hence, for the most part, we will solve these problems with iterative algorithms. These algorithms typically require the user to supply a starting point x 2 X. Beginning at x, an iterative algorithm will generate a sequence of points {x k } k= called iterates. In deciding how to generate the next iterate, x k+, the algorithms use information about the function f at the current iterate, x k, and sometimes past iterates x,...,x k. In practice, rather than constructing an infinite sequence of iterates, algorithms stop when an appropriate termination criterion is satisfied, indicating either that the problem has been solved within a desired accuracy, or that no further progress can be made. Most algorithms for unconstrained optimization we will discuss fall into the category of directional search algorithms: General directional search optimization algorithm Initialization Specify an initial guess of the solution x Iteration For k =,,..., If x k is optimal, stop Otherwise, Determine d k asearch directions Determine k > astep size Determine x k+ = x k + k d k a new estimate of the solution. 5.. Choosing the direction Typically, we require that d k is a descent direction of f at x k, that is, f(x k + d k ) <f(x k ) 8 2 (, ] for some >. For the case when f is di erentiable, we have shown in Theorem 4. that any d k such that rf(x k ) T d k < is a descent direction whenever rf(x k ) 6=. Often, direction is chosen to be of the form d k = D k rf(x k ), where D k is a positive definite symmetric matrix. (Why is it important that D k is positive definite?)

2 IOE 59: NLP, Winter 22 c Marina A. Epelman 2 The following are the two basic methods for choosing the matrix D k at each iteration; they give rise to two classic algorithms for unconstrained optimization we are going to discuss in class: Steepest descent D k = I, k =,, 2,... Newton s method D k = H(x k ) (provided H(x k ) is positive definite.) 5..2 Choosing the stepsize After d k is fixed, k ideally would solve the one-dimensional optimization problem min f(xk + d k ). This optimization problem is usually also impossible to solve exactly. Instead, k is computed (via an iterative procedure referred to as line search) either to approximately solve the above optimization problem, or to ensure a su cient decrease in the value of f Testing for optimality Based on the optimality conditions, x k is a locally optimal if rf(x k ) = and H(x k ) is positive definite. However, such a point is unlikely to be found. In fact, the most of the analysis of the algorithms in the above form deals with their limiting behavior, i.e., analyzes the limit points of the infinite sequence of iterates generated by the algorithm. Thus, to implement the algorithm in practice, more realistic termination criteria need to be implemented. They often hinge, at least in part, on approximately satisfying, to a certain tolerance, the first order necessary condition for optimality discussed in the previous section. 5.2 Steepest descent algorithm for minimization The steepest descent algorithm is a version of the general optimization algorithm that chooses d k = rf(x k ) at the kth iteration. As a source of motivation, note that f(x) can be approximated by its linear expansion f( x + d) f( x)+rf( x) T d. It is not hard to see that so long as rf( x) 6=, the direction d = rf( x) krf( x)k = rf( x) p rf( x) T rf( x) minimizes the above approximation over all direction of unit length. Indeed, for any direction d with kdk =, the Schwartz inequality yields rf( x) T d krf( x)k kdk = krf( x)k = rf( x) T d. Of course, if rf( x) =, then x is a candidate for local minimizer, i.e., x satisfies the first order necessary optimality condition. The direction d = rf( x) is called the direction of steepest descent at the point x. Note that d = rf( x) is a descent direction as long as rf( x) 6=. To see this, simply observe that d T rf( x) = rf( x) T rf( x) < so long as rf( x) 6=. A natural consequence of this is the following algorithm, called the steepest descent algorithm.

3 IOE 59: NLP, Winter 22 c Marina A. Epelman 2 Step Given x,setk Step d k = rf(x k ). If d k =, then stop. Step 2 Steepest Descent Algorithm: Choose stepsize k by performing an exact (or inexact) line search. Step 3 Set x k+ x k + k d k, k k +. Go to Step. Note from Step 2 and the fact that d k = rf(x k ) is a descent direction, it follows that f(x k+ ) < f(x k ). The following theorem establishes that under certain assumptions on f, the steepest descent algorithm converges regardless of the initial starting point x (i.e., it exhibits global convergence). Theorem 5. (Convergence Theorem; Steepest Descent with exact line search) Suppose that f : R n! R is continuously di erentiable on the set S = {x 2 R n : f(x) apple f(x )}, andthat S is a closed and bounded set. Suppose further that the sequence {x k } is generated by the steepest descent algorithm with stepsizes k chosen by an exact line search. Then every point x that is a limit point of the sequence {x k } satisfies rf( x) =. Proof: The proof of this theorem is by contradiction. By the Weierstrass Theorem, at least one limit point of the sequence {x k } must exist. Let x be any such limit point. Without loss of generality, assume that lim k! x k = x, but that rf( x) 6=. This being the case, there is a value 4 of > such that = f( x) f( x + d) >, where d = rf( x). Then also ( x + d) 2 ints, because f( x + d) <f( x) apple f(x ). Let {d k } be the sequence of directions generated by the algorithm, i.e., d k = rf(x k ). Since f is continuously di erentiable, lim k! d k = d. Thensince( x+ d) 2 ints, and (x k + d k )! ( x+ d), for k su ciently large we have x k + d k 2 S and f(x k + d k ) apple f( x + d)+ = f( x) + = f( x) 2 2 However, f( x) apple f(x k + k d k ) apple f(x k + d k ) apple f( x) 2, which is, of course, a contradiction. Thus d = rf( x) =. An example Suppose, f(x) is a simple quadratic function of the form: 2. f(x) = 2 xt Qx + q T x, where Q is a positive definite symmetric matrix. The optimal solution of (P) is easily computed as: x? = Q q (since Q is positive definite, it is non-singular) and direct substitution shows that the optimal objective function value is: f(x? )= 2 qt Q q.

4 IOE 59: NLP, Winter 22 c Marina A. Epelman 22 For convenience, let x denote the current point in the steepest descent algorithm. We have: f(x) = 2 xt Qx + q T x and let d denote the current direction, which is the negative of the gradient, i.e., d = rf(x) = Qx q. Now let us compute the next iterate of the steepest descent algorithm. If is the generic stepsize, then f(x + d) = 2 (x + d)t Q(x + d)+q T (x + d) = 2 xt Qx + d T Qx d T Qd + q T x + q T d = f(x) d T d d T Qd. Optimizing the value of in this last expression yields and the next iterate of the algorithm then is = dt d d T Qd, x = x + d = x + dt d d T d, where d = Qx q. Qd and f(x )=f(x + d) =f(x) d T d d T Qd = f(x) (d T d) 2 2 d T Qd. Suppose that Then and so and Q = rf(x) = x? = and q = x x 2 f(x? )= Suppose that x =(, ). Then we have: x =(.4,.4), x 2 =(,.8), etc., and the even numbered iterates satisfy x 2n =(,.2 n ) and f(x 2n )=(.2 n ) (.2) n

5 IOE 59: NLP, Winter 22 c Marina A. Epelman 23 and so kx 2n x? k =.2 n, f(x 2n ) f(x? )=(.2) 2n. Therefore, starting from the point x =(, ), distance from the current iterate to the optimal solution goes down by a factor of.2 after every two iterations of the algorithm (a similar observation can be made about the progress of the objective function values). The graph below plots the progress of the sequence kx k x? k as a function of iteration number; notice that the y-axis is drawn on a logarithmic scale this allows us to visualize the progress of the algorithm better as values of kx k x? k approach zero. Although it is easy to find the optimal solution of the quadratic optimization problem in closed form, the above example is relevant in that it demonstrates a typical performance of the steepest descent algorithm. Additionally, most functions behave as near-quadratic functions in a neighborhood of the optimal solution, making the example even more relevant. Termination criteria Ideally, the algorithm will terminate at a point x k such that rf(x k ) =. However, the algorithm is not guaranteed to be able to find such point in finite amount of time. Moreover, due to rounding errors in computer calculations, the calculated value of the gradient will have some imprecision in it. Therefore, in practical algorithms the termination criterion is designed to test if the above condition is satisfied approximately, so that the resulting output of the algorithm is an approximately optimal solution. A natural termination criterion for the steepest descent could be krf(x k )kapple, where > is a pre-specified tolerance. However, depending on the scaling of the function, this requirement can be either unnecessarily stringent, or too loose to ensure near-optimality (consider a problem concerned with minimizing distance, where the objective function can be expressed in inches, feet, or miles). Another alternative, that might alleviate the above consideration, is to terminate when krf(x k )kapple f(x k ) this, however, may lead to problems when the objective function at the optimum is zero. A combined approach is then to terminate when krf(x k )kapple ( + f(x k ) ). The value of is typically taken to be at most the square root of the machine tolerance (e.g., = 8 if 6-digit computing is used), due to the error incurred in estimating derivatives.

6 IOE 59: NLP, Winter 22 c Marina A. Epelman Stepsize selection In the analysis in the above subsection we assumed that one-dimensional optimization problem invoked in the line search in each iteration of the Steepest Descent algorithm was performed exactly and with perfect precision, which is usually not possible. In this subsection we discuss one of the many practical ways of solving this problem approximately, to determine the stepsize at each iteration of the general directional search optimization algorithm (including steepest descent) Stepsize selection basics Suppose that f(x) is a continuously di erentiable function, and that we seek to (approximately) solve: = arg min f( x + d), > where x is our current iterate, and d is the current direction generated by an algorithm that seeks to minimize f(x). We assume that d is a descent direction, i.e., rf( x) T d <. Let F ( ) =f( x + d), whereby F ( ) is a function in the scalar variable, and our problem is to solve for = arg min > F ( ). Using the chain rule for di erentiation, we can show that F ( ) =rf( x + d) T d. Therefore, applying the necessary optimality conditions to the one-dimensional optimization problem above, we want to find a value for which F ( ) =. Furthermore, since d is a descent direction, F () < Armijo rule, or backtracking Although there are iterative algorithms developed to solve the problem min F ( ) (or F ( ) = ) exactly, i.e., with a high degree of precision (such as, for instance, bisection search algorithm), they are typically too expensive computationally. (Recall that we need to perform a line search at every iteration of our steepest optimization algorithm!) On the other hand, if we sacrifice accuracy of the line search, this can cause inferior performance of the overall algorithm. The Armijo rule, or the backtracking method, is one of several inexact line search methods which guarantees a su cient degree of improvement in the objective function to ensure the algorithm s convergence. Armijo rule requires two parameters: <µ<.5 and < <. Suppose we are minimizing a function F ( ) such that F () < (which is indeed the case for the line search problems arising in descent algorithms). Then the first order approximation of F ( ) at = is given by F ()+ F (). Define ˆF ( ) =F ()+µ F () (see figure). A stepsize is considered acceptable by Armijo rule only if F ( ) apple ˆF ( ), that is, if taking a step of size guarantees su cient decrease of the function: f( x + d) f( x) apple µ rf( x) T d.

7 IOE 59: NLP, Winter 22 c Marina A. Epelman F()+µ F().2 F( )...2 F()+ F() Note that the su cient decrease condition will hold for any small value of. On the other hand, we would like to prevent the step size from being too small, for otherwise our overall optimization algorithm would not be making much progress. To combine these two considerations, will implement the following iterative backtracking procedure (here we use = 2 ): Step Set k=. =. Backtracking line search Step k If F ( k ) apple ˆF ( k ), choose k as the step size; stop. If F ( k ) > ˆF ( k ), let k+ 2 k, k k +. Note that as a result of the above iterative scheme, the chosen stepsize is = 2,wheret t the smallest integer such that F (/2 t ) apple ˆF (/2 t ) (or, for general, F ( t ) apple ˆF ( t )). Typically, µ is chosen in the range between. and.3, and between. to.8. Note that if x k and x k+ are the consecutive iterates of the general optimization algorithm with d k a descent direction, and the stepsizes chosen by backtracking, then f(x k+ ) apple f(x k ) that is, the algorithm is guaranteed to produce an improvement in the function value at every iteration. Under additional assumptions on f, it can be also shown that the steepest descent algorithm will demonstrate global convergence properties under the Armijo line search rule, as stated in the following theorem. Theorem 5.2 (Convergence Theorem; Steepest Descent with backtracking line search) Suppose that the set S = {x 2 R n : f(x) apple f(x )} is closed and bounded, and suppose that the gradient of f is Lipschitz continuous on the set S, i.e., there exist a constant G> such that krf(x) rf(y)k applegkx yk 8x, y 2 S. Suppose further that the sequence {x k } is generated by the steepest descent algorithm with stepsizes k chosen by a backtracking line search. Then every point x that is a limit point of the sequence {x k } satisfies rf( x) =. The additional assumption, basically, ensures that the gradient of f does not change too rapidly. In the proof of the theorem, this allows to provide a lower bound on the stepsize in each iteration. (See any of the reference textbooks for details.) Remark: Our discussion so far implicitly assumed that the domain of the optimization problem was the entire R n. If our optimization problem is (P) min f(x) s.t.x 2 X, is

8 IOE 59: NLP, Winter 22 c Marina A. Epelman 26 where X is an open set, then the line-search problem is min f( x + d) s.t. x + d 2 X. In this case, we must ensure that all iterate values of in the backtracking algorithm satisfy x + d 2 X. As an example, consider the following problem: P (P) min f(x) := m ln(b i a T i x) s.t. b Ax >. Here the domain of f(x) isx = {x 2 R n : b the line-search problem is: i= (LS) min h( ) :=f( x + d) P = m ln(b i s.t. b A( x + d) >. Standard arithmetic manipulation can be used to establish that Ax > }. Given a point x 2 X and a direction d, i= a T i ( x + d)) b A( x + d) > if and only if ˇ < <ˆ, where ˇ := bi min a T d< i and the line-search problem then is: a T i x a T i d P LS : minimize h( ) := m ln(b i s.t. < <ˆ. bi a T i and ˆ := min x a T d> a T d i i i= a T i ( x + d)), The implementation of the backtracking rule for this problem would have to be modified starting with =, we will backtrack, if necessary, until alpha < ˆ, and only then start checking the su cient decrease conditions. 5.4 Newton s method for minimization Again, we want to solve (P) min f(x) x 2 R n. The Newton s method can also be interpreted in the framework of the general optimization algorithm, but it truly stems from the Newton s method for solving systems of nonlinear equations. Recall that if : R n! R n, to solve the system of equations (x) =, one can apply an iterative method. Starting at a point x, approximate the function by ( x+d) ( x)+r ( x) T d,wherer ( x) T 2 R n n is the Jacobian of at x, and provided that r ( x) is nonsingular, solve the system of linear equations r ( x) T d = ( x)

9 IOE 59: NLP, Winter 22 c Marina A. Epelman 27 to obtain d. Set the next iterate x = x + d, and continue. This method is well-studied, and is wellknown for its good performance when the starting point x is chosen appropriately. The Newton s method for minimization is precisely an application of this equation-solving method to the (system of) first-order optimality conditions rf(x) =. Here is another view of the motivation behind the Newton s method for optimization. At x = x, f(x) can be approximated by f(x) q(x) 4 = f( x)+rf( x) T (x x)+ 2 (x x)t H( x)(x x), which is the quadratic Taylor expansion of f(x) at x = x. q(x) is a quadratic function which is minimized by solving rq(x) =, i.e., rf( x)+h( x)(x x) =, which yields x x = H( x) rf( x). The direction H( x) rf( x) is called the Newton direction, or the Newton step. This leads to the following algorithm for solving (P): Step Given x,setk Newton s Method: Step d k = H(x k ) rf(x k ). If d k =, then stop. Step 2 Choose stepsize k =. Step 3 Set x k+ x k + k d k, k k +. Go to Step. Proposition 5.3 If H(x) is p.d., then d = H(x) rf(x) is a descent direction. Proof: It is su cient to show that rf(x) T d = rf(x) T H(x) rf(x) <. Since H(x) is positive definite, if v 6=, < (H(x) v) T H(x)(H(x) v)=v T H(x) v, completing the proof. Note that: Work per iteration: O(n 3 ) The iterates of Newton s method are, in general, equally attracted to local minima and local maxima. Indeed, the method is just trying to solve the system of equations rf(x) =. The method assumes H(x k ) is nonsingular at each iteration. positive definite, d k is not guaranteed to be a descent direction. There is no guarantee that f(x k+ ) apple f(x k ). Moreover, unless H(x k )is Step 2 could be augmented by a linesearch of f(x k + d k ) over the value of ; then previous consideration would not be an issue. What if H(x k ) becomes increasingly singular (or not positive definite)? Use H(x k )+ I. In general, points generated by the Newton s method as it is described above, may not converge. For example, H(x k ) may not exist. Even if H(x) is always non-singular, the method may not converge, unless started close enough to the right point.

IOE 59: NLP, Winter 22 c Marina A. Epelman 28 Example : Let f(x) =7x ln(x). Then rf(x) =f (x) =7 x and H(x) =f (x) =. It is not x 2 hard to check that x? = 7 =.4285743 is the unique global minimizer.

10 IOE 59: NLP, Winter 22 c Marina A. Epelman 28 Example : Let f(x) =7x ln(x). Then rf(x) =f (x) =7 x and H(x) =f (x) =. It is not x 2 hard to check that x? = 7 = is the unique global minimizer. The Newton direction at x is d = H(x) rf(x) = f (x) f (x) = x2 7 = x 7x 2, x and is defined so long as x>. So, Newton s method will generate the sequence of iterates {x k } with x k+ = x k +(x k 7(x k ) 2 )=2x k 7(x k ) 2. Below are some examples of the sequences generated by this method for di erent starting points: k x k x k x k (note that the iterate in the first column is not in the domain of the objective function, so the algorithm has to terminate with an error). Below is a plot of the progress of the algorithm as a function of iteration number (for the two sequences that did converge): Example 2: f(x) = ln( x x 2 ) ln x ln x 2. rf(x) = apple x x 2 x, x x 2 x 2

IOE 59: NLP, Winter 22 c Marina A. Epelman 29 2 2 2 2 H(x) = 4 x x 2 + x x x 2 2 2 2 x x 2 x x 2 + x 2 x? = 3, 3, f(x? )=3.295836866. 3 5. k x k x k 2 kx k xk.85.5.5892556598879.776827288.

11 IOE 59: NLP, Winter 22 c Marina A. Epelman H(x) = 4 x x 2 + x x x x x 2 x x 2 + x 2 x? = 3, 3, f(x? )= k x k x k 2 kx k xk e e e 6 Termination criteria Since Newton s method is working with the Hessian as well as the gradient, it would be natural to augment the termination criterion we used in the Steepest Descent algorithm with the requirement that H(x k ) is positive semi-definite, or, taking into account the potential for the computational errors, that H(x k )+ I is positive semi-definite for some > (this parameter may be di erent than the one used in the condition on the gradient).

12 IOE 59: NLP, Winter 22 c Marina A. Epelman Comparing performance of the steepest descent and Newton algorithms 5.5. Rate of convergence Suppose we have a converging sequence lim k! s k = s, and we would like to characterize the speed, or rate, at which the iterates s k approach the limit s. A converging sequence of numbers {s k } exhibits linear convergence if for some apple C<, s k+ s lim = C. k! s k s C in the above expression is referred to as the rate constant; if C =, the sequence exhibits superlinear convergence. A converging sequence of numbers {s k } exhibits quadratic convergence if s k+ s lim k! s k s 2 = <. Examples: Linear convergence s k = k :.,.,., etc. s =. s k+ s s k s =.. Superlinear convergence s k =. k! :, 2, 6, 24, 25,etc. s =. s k+ s s k s = k! (k + )! =! as k!. k + Quadratic convergence s k = (2 k ) :.,.,.,., etc. s =. s k+ s s k s 2 = (2k ) 2 =. 2k This illustration compares the rates of convergence of the above sequences (note that the y-axis is displayed on the logarithmic scale):

IOE 59: NLP, Winter 22 c Marina A. Epelman 3 We will use the notion of rate of convergence to analyze one aspect of performance of optimization algorithms.

13 IOE 59: NLP, Winter 22 c Marina A. Epelman 3 We will use the notion of rate of convergence to analyze one aspect of performance of optimization algorithms. Indeed, since an algorithm for nonlinear optimization problems, in its abstract form, generates an infinite sequence of points {x k } converging to a solution x only in the limit, it makes sense to discuss the rate of convergence of the sequence ke k k = kx k xk, or E k = f(x k ) f( x), which both have limit Rate of convergence of the steepest descent algorithm for the case of a quadratic function In this section we explore answers to the question of how fast the steepest descent algorithm converges. Recall that in the earlier example we observed linear convergence of both the sequence {E k } and {e k }. We will show now that the steepest descent algorithm with stepsizes selected by exact line search in general exhibits linear convergence, but that the rate constant depends very much on the ratio of the largest to the smallest eigenvalue of the Hessian matrix H(x) at the optimal solution x = x?. In order to see how this dependence arises, we will examine the case where the objective function f(x) is itself a simple quadratic function of the form: f(x) = 2 xt Qx + q T x, where Q is a positive definite symmetric matrix. We will suppose that the eigenvalues of Q are A = a a 2... a n = a>, i.e, A and a are the largest and smallest eigenvalues of Q. We already derived that the optimal solution of (P) is x? = Q q with the optimal objective function value is: f(x? )= 2 qt Q q.

14 IOE 59: NLP, Winter 22 c Marina A. Epelman 32 Moreover, if x is the current point in the steepest descent algorithm, then f(x) = 2 xt Qx + q T x, and the next iterate of the steepest descent algorithm with exact line search is x = x + d = x + dt d d T Qd d, where d = rf(x) and f(x )=f(x) d T d d T Qd = f(x) (d T d) 2 2 d T Qd. Therefore, (d T d) 2 2 f(x ) f(x? ) f(x) f(x? ) = f(x) f(x? ) d T Qd f(x) f(x? ) = = = (d T d) 2 2 d T Qd 2 xt Qx + q T x + 2 qt Q q (d T d) 2 2 d T Qd 2 (Qx + q)t Q (Qx + q) (d T d) 2 (d T Qd)(d T Q d) = where = (dt Qd)(d T Q d) (d T d) 2. In order for the convergence constant to be good, which will translate to fast linear convergence, we would like the quantity to be small. The following result provides an upper bound on the value of. Kantorovich Inequality: Let A and a be the largest and the smallest eigenvalues of Q, respectively. Then (A + a)2 apple. 4Aa We will skip the proof of this inequality. Continuing, we have Let us apply this inequality to the above analysis. f(x ) f(x? ) f(x) f(x? ) = 4Aa (A a)2 apple = (A + a) 2 (A + a) 2 = A/a 2. A/a + Note by definition that A/a is always at least. If A/a is small (not much bigger than ), then the convergence constant will be much smaller than. However, if A/a is large, then the convergence constant will be only slightly smaller than. The following table shows some sample values:

15 IOE 59: NLP, Winter 22 c Marina A. Epelman 33 Upper Bound on Number of Iterations to Reduce A a the Optimality Gap by Note that the number of iterations needed to reduce the optimality gap by. grows linearly in the ratio A/a. Two pictures of possible iterations of the steepest descent algorithm are as follows:

16 IOE 59: NLP, Winter 22 c Marina A. Epelman 34 Some remarks: We analyzed the convergence of the function values; the convergence of the algorithm iterates can be easily shown to be linear with the same rate constant. The bound on the rate of convergence is attained in practice quite often, which is unfortunate. The ratio of the largest to the smallest eigenvalue of a matrix is called the condition number of the matrix. What about non-quadratic functions? If the Hessian at the locally optimal solution is positive definite, the function behaves as near-quadratic function in a neighborhood of that solution. The convergence exhibited by the iterates of the steepest descent algorithm will also be linear. The analysis of the non-quadratic case gets very involved; fortunately, the key intuition is obtained by analyzing the quadratic case. What about backtracking line search? Also linear convergence! (The rate constant depends in part on the backtracking parameters.) Rate of convergence of the pure Newton s method We have seen from our examples that, even for convex functions, the Newton s method in its pure form (i.e., with stepsize of at every iteration) does not guarantee descent at each iteration, and may produce a diverging sequence of iterates. Moreover, each iteration of the Newton s method is much more computationally intensive then that of the steepest descent. However, under certain conditions, the method exhibits quadratic rate of convergence, making it the ideal method for solving convex optimization problems. Recall that a method exhibits quadratic convergence when ke k k = kx k xk! and ke k+ k lim k! ke k k 2 = C. Roughly speaking, if the iterates converge quadratically, the accuracy (i.e., the number of correct digits) of the solution doubles in a fixed number of iterations. There are many ways to state and prove results regarding the convergence on the Newton s method. We provide one that provides a particular insight into the circumstances under which pure Newton s method demonstrates quadratic convergence. Let kvk denote the usual Euclidian norm of a vector, namely kvk := p v T v. Recall that the operator norm of a matrix M is defined as follows: kmk := max{kmxk : kxk =}. x As a consequence of this definition, for any x, kmxkapplekmk kxk. Theorem 5.4 (Quadratic convergence) Suppose f(x) is twice continuously di erentiable and x? is a point for which rf(x? )=.SupposeH(x) satisfies the following conditions: there exists a scalar h> for which k[h(x? )] kapple h there exists scalars > and L> for which kh(x) H(y)k applelkx yk for all x and y satisfying kx x? kapple and ky x? kapple.

17 IOE 59: NLP, Winter 22 c Marina A. Epelman 35 Let x satisfy kx x? kapple, where < < and := min, 2h 3L, and let x N := x H(x) rf(x). Then: (i) kx N x? kapplekx x? k 2 L 2(h Lkx x? k) (ii) kx N x? k < kx x? k, and hence the iterates converge to x? (iii) kx N x? kapplekx x? k 2 3L 2h. The proof relies on the following two elementary facts. Proposition 5.5 Suppose that M is a symmetric matrix. Then the following are equivalent:. h> satisfies km kapple h 2. h> satisfies kmvk h kvk for any vector v Proposition 5.6 Suppose that f(x) is twice di erentiable. Then rf(z) rf(x) = Z [H(x + t(z x))] (z x)dt. Proof: Let (t) := rf(x + t(z x)). Then () = rf(x) and () = rf(z), and (t) = [H(x + t(z x))] (z x). From the fundamental theorem of calculus, we have: Proof of Theorem 5.4 We have: Therefore rf(z) rf(x) = () () x N x? = x H(x) rf(x) x? = = Z Z (t)dt = x x? + H(x) (rf(x? ) rf(x)) [H(x + t(z x))] (z x)dt. Z = x x? + H(x) [H(x + t(x? x))] (x? x)dt (from Proposition 5.6) Z = H(x) [H(x + t(x? x)) H(x)] (x? x)dt. kx N x? kapplekh(x) k applekx? Z xk kh(x) k = kx? xk 2 kh(x) kl k [H(x + t(x? x)) H(x)] k k(x? x)kdt Z = kx? xk 2 kh(x) kl. 2 Z L t k(x? tdt x)kdt

18 IOE 59: NLP, Winter 22 c Marina A. Epelman 36 We now bound kh(x) k. Let v be any vector. Then kh(x)vk = kh(x? )v +(H(x) H(x? ))vk kh(x? )vk k(h(x) H(x? ))vk h kvk kh(x) H(x? )kkvk (from Proposition 5.5) h kvk Lkx? xk kvk =(h Lkx? xk) kvk. Invoking Proposition 5.5 again, we see that this implies that Combining this with the above yields kh(x) kapple h Lkx? xk. kx N x? kapplekx? xk 2 L 2(h Lkx? xk), which is (i) of the theorem. Because Lkx? xkapple 2h 3 < 2h 3 we have: kx N x? kapplekx? Lkx? xk xk 2(h Lkx? xk) apple 2h 3 2h 2 h 3 which establishes (ii) of the theorem. Finally, we have kx? xk = kx? xk, kx N x? kapplekx? xk 2 L 2(h Lkx? xk) applekx? xk 2 L 2 h 2h 3 = kx? xk 2 3L 2h, which establishes (iii) of the theorem. Notice that the results regarding the convergence and rate of convergence in the above theorem are local, i.e., they apply only if the algorithm is initialized at certain starting points (the ones su ciently close to the desired limit). In practice, it is not known how to pick such starting points, or to check if the proposed starting point is adequate. (With the very important exception of self-concordant functions.) 5.6 Further discussion and modifications of the Newton s method 5.6. Global convergence for strongly convex functions with a two-phase Newton s method We have noted that, to ensure descent at each iteration, the Newton s method can be augmented by a line search. This idea can be formalized, and the e ciency of the resulting algorithm can be analyzed (see, for example, Convex Optimization by Stephen Boyd and Lieven Vandenberghe, available at for a fairly simple presentation of the analysis). Suppose that f(x) isstrongly convex on it domain, i.e., assume there exists µ > such that the smallest eigenvalue of H(x) is greater than equal to µ for all x and that the Hessian is Lipschitz continuous everywhere on the domain of f. Suppose we apply the Newton s method with the

19 IOE 59: NLP, Winter 22 c Marina A. Epelman 37 stepsize at each iteration determined by the backtracking procedure of section That is, at each iteration of the algorithm we first attempt to take a full Newton step, but reduce the stepsize if the decrease in the function value is not su cient. Then there exist positive numbers and such that if krf(x k )k, thenf(x k+ ) f(x k ) apple, and if krf(x k )k <,thenstepsize k = will be selected, and the next iterate will satisfy krf(x k+ )k <, and so will all the further iterates. Moreover, quadratic convergence will be observed in this phase. As hinted above, the algorithm will proceed in two phases: while the iterates are far from the minimizer, a dampening of the Newton step will be required, but there will be a guaranteed decrease in the objective function values. This phase (referred to as dampened Newton phase ) cannot take more than f(x ) f(x?) iterations. Once the norm of the gradient becomes su ciently small, no dampening of the Newton step will required in the rest of the algorithm, and quadratic convergence will be observed, thus making it the quadratically convergence phase. Note that it is not necessary to know the values of and to apply this version of the algorithm! The two-phase Newton s method is globally convergent; however, to ensure global convergence, the function being minimized needs to posses particularly nice global properties Other modifications of the Newton s method We have seen that if Newton s method is initialized su ciently close to the point x such that rf( x) = and H( x) is positive definite (i.e., x is a local minimizer), then it will converge quadratically, using stepsizes of =. There are three issues in the above statement that we should be concerned with: What if H( x) is singular, or nearly-singular? How do we know if we are close enough, and what to do if we are not? Can we modify Newton s method to guarantee global convergence? In the previous subsection we assumed away the first issue, and, under an additional assumption, showed how to address the other two. What if the function f is not strongly convex, and H(x) may approach singularity? There are two popular approaches (which are actually closely related) to address these issues. The first approach ensures that the method always uses a descent direction. For example, instead of the direction H(x k ) rf(x k ), use the direction (H(x k )+ k I) rf(x k ), where k is chosen so that the smallest eigenvalue of H(x k )+ k I is bounded below by a fixed number >. It is important to choose the value of appropriately if it is chosen to be too small, the matrix employed in computing the direction can become ill-conditioned if H( x) is nearly singular; if it is chosen to be too large, the direction becomes nearly that of the steepest descent algorithm, and hence only linear convergence can be guaranteed. Hence, the value of k is often chosen dynamically. The second approach is the so-called trust region method. Note that the main idea behind the Newton s method is to represent the function f(x) by its quadratic approximation q k (x) =f(x k )+

20 IOE 59: NLP, Winter 22 c Marina A. Epelman 38 rf(x k ) T (x x k )+ 2 (x xk ) T H(x k )(x x k ) around the current iterate, and then minimize that approximation. While locally the approximation works quite well, this may no longer be the case when a large step is taken. The trust region methods hence find the next iterate by solving the following constrained optimization problem: min q k (x) s.t.kx x k kapple k, i.e., not allowing the next iterate to be outside the neighborhood of x k where the quadratic approximation is close to the original function f(x) (as it turns out, this problem is not much harder to solve than the unconstrained minimization of q k (s)). The value of k is set to represent the size of the region in which we can trust q k (x) toprovide a good approximation of f(x). Smaller values of k ensure that we are working with an accurate representation of f(x), but result in conservative steps. Larger values of k allow for larger steps, but may lead to inaccurate estimation of the objective function. To account for this, the value if k is updated dynamically throughout the algorithm, namely, it is increased if it is observed that q k (x) provided an exceptionally good approximation of f(x) at the previous iteration, and decreased is the approximation was exceptionally bad.

The Steepest Descent Algorithm for Unconstrained Optimization

The Steepest Descent Algorithm for Unconstrained Optimization Robert M. Freund February, 2014 c 2014 Massachusetts Institute of Technology. All rights reserved. 1 1 Steepest Descent Algorithm The problem