Introduction to Nonlinear Optimization Paul J. Atzberger Comments should be sent to: atzberg@math.ucsb.edu
Introduction We shall discuss in these notes a brief introduction to nonlinear optimization concepts, with wide applicability in finance and other applied fields. The basic problem is to solve min x f(x) (1) where typically x R n, but this can also be subject to constraints. To numerically approximate the solution (if it exists), we shall construct a sequence {x n } n=1 such that x n x, where f(x ) = min x f(x). There are many algorithms which attempt to construct such a sequence, with different approaches appropriate depending on the problem. A good reference on nonlinear optimization methods is Numerical Optimization by Nocedal and Wright. In these notes, we shall discuss primarily line search methods and handle any constraints that arise using penalty terms. In line search algorithms the sequence {x n } n=1 is constructed iteratively at each step choosing a search direction p and attempting to minimize the objective function along the line (ray) in this direction. This reduces the problem essentially to a sequence of one dimensional problems, where x is given by the basic recurrence: x +1 = x + α p (2) where the step-length α > 0 must be estimated. The ey to having a sequence x which converges rapidly is to construct an algorithm which maes effective use of the information about f(x) from the previous iterates to choose a good p and step-length α. To ensure that progress can be made each iteration, a natural condition to impose on p is that the function decrease, at least locally, in this direction. We call p a descent direction if f p < 0. Typically, the descent direction has the form p = B 1 f. In the case that B is position definite this ensures that p is a descent direction For example: f p = f T B 1 f < 0. Steepest descent has B = I set to the identity matrix so that p = f. Newton s method has B = 2 f set to the Hessian so that p = ( 2 f ) 1 f. It is important to note that p is only ensured to be descent direction if the Hessian is position definite. The Hessian however is often expensive to compute numerically. Quasi-Newton methods construct a B which approximates the Hessian by maing use of the previous function evaluations of f and f, see Nocedal and Wright for a further discussion. In designing a good numerical optimization method some care must be taen in choosing the not only the direction p but the step-length α, and this will usually depend on the function f(x) being optimized. We shall assume that p is some chosen descent direction and now discuss how to choose the step-length α. Wolfe Conditions: Choosing a Step-Length α Let φ(α) = f(x + αp ) then one choice of α which might seem ideal would be to minimize φ(α) globally for all α. However, it is typically computationally expensive to find the global optimum even to the one dimensional problem. Since the ray p will not typically contain the minimizer to the full problem, multiple 2
iterations will almost always be required. This suggests that in designing an algorithm there should be a balance between maing progress in minimizing the objective function f in each search direction p while iterating over a number of different search directions. To ensure our algorithm converges we must be more mathematically precise about what we mean by maing progress over each search direction. The simplest criteria we might consider to determine if progress has been made over a particular search direction is that the function be reduced f(x + α p ) < f(x ), however, this is not sufficient to ensure convergence to the minimizer of x. For example, consider f(x,y) = x 2 + y 2, and suppose the algorithm has the output x +1 = x + α p where α = 1. Then if p 2 = [ 1,0] T we have f(x +1 ) < f(x ), but x +1 = x 0 =0 1 p 2. If the first component is [x 0 ] 1 > 2 then x 0. In other words, to summarize, if the progress made each iteration is not sufficiently large, the sequence may converge prematurely to a non-optimal value. Sufficient Decrease New Point Must Lie Below This Line y x Figure 1: Sufficient Decrease Condition Curvature Condition New Slope Must Be Greater Than This Line y x Figure 2: Curvature Condition To ensure that sufficient progress is made each iteration in the line search we shall require the following two conditions to hold: 3
Wolfe Conditions: (i) (sufficient decrease) f(x + αp ) < f(x ) + c 1 α f Tp, where c 1 (0,1). (ii) (curvature condition) f(x + αp ) T p c 2 f Tp, where c 2 (c 1,1). (See the figures 1 2 for a geometric interpretation). We now prove that such step lengths exist satisfying the Wolfe Conditions. Convergence of Line Search Optimization Methods Lemma: Let f : R n R be continuously differentiable and let p be a descent direction for each x. If we assume that along each ray {x + αp α > 0} f is bounded below then there exist intervals of step lengths [α (1),α (2) ] such that the Wolfe Conditions are satisfied. Proof: Since φ(α) = f(x + αp ) is bounded below for all α > 0 and 0 < c 1 < 1, the line l(α) = f(x ) + αc 1 f T p must intersect the graph of φ(α) at least once. Let α > 0 be the smallest such α, that is f(x + α p ) = f(x ) + α c 1 f T p. The sufficient decrease condition (i) then clearly holds for α < α. By the differentiability of the function we have from the mean-value theorem that there exists α (0,α ) such that f(x +α p ) f(x ) = α f(x +α p ) T p. Now by substituting for f(x +α p ) the expression above (sufficient-decrease), we obtain f(x + α p ) T p = c 1 f T p > c 2 f T p, since c 1 < c 2 and f p < 0. Thus α satisfies the curvature condition (ii). Therefore, since α < α, we have that in a neighborhood of α both Wolfe Conditions (i) and (ii) hold simultaneously. We now discuss the convergence of line search algorithms when the Wolfe Conditions are satisfied. Theorem: (Zoutendij s Condition) Assume {x } is generated by a line search algorithm satisfying the Wolfe Conditions. Suppose that f(x) : R n R is bounded below with Lipschitz continuous f, f(x) f(y) L x y then cos 2 (θ ) f 2 < 0 where cos(θ ) = ft p f p. We remar that θ is the angle between the f T steepest descent direction and the search direction p. Proof: From the Wolfe Conditions (i) and (ii) we have From Lipschitz continuity we have ( f +1 f ) T p (c 2 1) f T p. ( f +1 f ) T p α L p 2. 4
By combining both relations we have α c 2 1 f Tp L p 2. By substituting this in (i) we have From the definition of cos(θ ) we have f +1 f c 1 1 c 2 L ( f T p ) 2 p 2. f +1 f ccos(θ ) f 2 where c = c1(1 c2) L. By summing all indices less than we have f +1 f 0 c j=0 cos2 (θ j ) f j 2. Since f(x) is bounded below we must have cos 2 (θ j ) f j 2 < j=0. Using the standard facts about convergence of series, we have cos 2 (θ j ) f j 2 0, as. This result while appearing somewhat technical, is actually quite useful. For if we construct a line search method such that cos(θ ) > δ > 0, then the theorem above implies that lim f = 0. Thus if the line search sequence {x } remains bounded there will be a limit point x which is a critical point such that f(x ) = 0. Given the sufficient decrease condition this point will liely be a minimizer, although one must be careful to chec if no special structure is assumed about f. To further describe the convergence, we shall say that a method converges in f with rate p if f(x +1 ) f(x ) c f(x ) f(x ) p. We say that a method converges in x with rate p if x +1 x c x x p. We shall now discuss the convergence of two common optimization methods, namely, steepest descent and newton s method. Convergence of Steepest Descent Steepest Descent: In this case p = f for each step. Let us consider a problem which we can readily analyze with an objective function of the form: f(x) = 1 2 xt Qx b T x where Q = Q T is positive definite. The minimizer to this problem can be readily found as solves, Qx = b. In this example, we can explicitly compute the step-length α which minimizes the function ρ(α) each interaction. This is given by f(x αg ) = 1 2 (x αg ) T Q(x αg ) b(x αg ) 5
so that f(x) = Qx b gives the minimizer in α > 0. That is α should satisfy This gives when g = f we have f(x αg ) = Q(x αg ) b = 0. α = gt g g T Qg α = ft f f T Q f thus the next iterate is given by (when minimizing in the line search), ( ) f T x +1 = x f f TQ f f. From the expression above it can be shown that [ ( ) f f(x +1 ) f(x T 2 ] ) = 1 f ( ) ( ) (f(x f T Q f f T Q 1 ) f(x )). f This can be expressed in terms of the eigenvalues of Q as ( ) 2 (f(x +1 ) f(x λn λ 1 )) (f(x ) f(x )) λ n + λ 1 where 0 λ 1 λ 2 λ n are eigenvalues of Q. (Proof is given elsewhere, see Luenberger 1984). We remar that for nonlinear objective functions we can approximate the function near a minimizer using a truncation of the Taylor expansion at the quadratic term. In this case, the analysis above will be applicable up to this truncation error where we set Q to be the Hessian, Q = 2 f. From the analysis we see find that steepest descent has a rate of convergence in f of first order, in the ideal case. Convergence of Newton s Method Newton s Method: Let us now consider the the case when the search direction is p (N) = 2 f 1 f. As we mentioned above, we must be a little careful since in this case the Hessian 2 f may not be positive definite and the search direction may not be a descent direction. For the example above Q positive definite the search direction is p (N) = Q 1 f. Analysis similar to that done in the case of Steepest Descent can be carried out to show that the rate of convergence in f is quadratic. In particular, in the case of a quadratic objective function it can be shown that for each iteration the optimal α = 1. This gives x + p x = x x 2 f 1 f = 2 f 1 [ 2 f (x x ) ( f f ) ] 6
where f = f(x ) = 0. Now using that we have f f = 1 0 2 f(x + t(x x ))(x x )dt, 2 f(x )(x x ) ( f f(x )) = 1 0 1 where L is the Lipschitz constant for 2 f(x) for x near x. 0 [ 2 f(x ) 2 f(x + t(x x )) ] (x x )dt 2 f(x ) 2 f(x + t(x x )) x x dt 1 x x 2 Ltdt 0 x + p (N) x L 2 f(x ) 1 x x 2 Therefore, the method converges in x quadratically, p = 2. A Line Search Algorithm We now discuss an algorithm to find a step-length α which satisfies the Wolfe conditions. The basic strategy is to use interpolation and a bisection search by determining which half of an interval contains points satisfying the Wolfe conditions. We state a general algorithm here in pseudo code. Algorithm: LineSearch (Input: x, p, α max. Output α.) α 0 0, and specify α 1 > 0 and α max ; i 1 repeat: Evaluate φ(α i ); if φ(α i ) > φ(0) + c 1 α i φ (0) or φ(α i ) φ(α i 1 ) and i > 1. α zoom (α i 1,α i ) and stop. Evaluate φ (α i ); if φ (α i ) c 2 φ (0) set α α i and stop; if φ (α i ) 0 set α zoom(α i,α i 1 ) and stop; Choose α i+1 (α i,α max ) i i + 1; end (repeat) Algorithm: Zoom (Input: α lo, α hi. Output: α.) repeat Interpolate (using quadratic, cubic, or bisection) to find a trial step length α j between α lo and α hi Evaluate φ(α j ); if φ(α j ) > φ(0) + c 1 α j φ (0) or φ(α j ) φ(α lo ) α hi α j ; 7
else Evaluate φ (α j ); if φ(α j ) c 2 φ (0) set α α j and stop; if φ (α j )(α hi α lo ) 0 α hi α lo ; α lo α j ; end (repeat) For a more in depth discussion and motivation for these algorithms see (Nocedal and Wright, Numerical Optimization, pages 58-61). Solving Nonlinear Unconstrained Optimization Problems For unconstrained optimization problems of the form the basic line search algorithm is as follows: min x f(x) Unconstrained Line Search Optimization Algorithm: (Input: x 0, α max,ǫ. Output: x.) 0 x x 0 Evaluate f f(x ); repeat p = B 1 f ; α LineSearch(x,p,α max ); x +1 x + α p ; Evaluate f f(x ); If f < ǫ then x x and stop; end (repeat) Solving Nonlinear Constrained Optimization Problems For constrained optimization problems of the form min x subject f(x) c i (x) = 0, i E c i (x) 0, i I a line search algorithm can be constructed. To handle the constraints we shall formulate an unconstrained optimization problem which incorporates the constraints through penalty terms. The strategy is to then solve each of the unconstrained problems with a penalty parameter µ j up to a tolerance of ǫ j. By repeatedly solving this problem using as the starting point the solution x j of the previous iteration, and by successively reducing the values of µ j and ǫ j, a sequence of solutions x j x can be generated. One must tae some care on how µ j and ǫ j are reduced each iteration to ensure convergence and efficient use of computational resources. These issues are further discussed in 8
Nocedal and Wright. The basic algorithm is: A Constrained Line Search Optimization Algorithm: (Input: x 0, α max,ǫ,δ. Output: x.) j 0 x j x 0 µ j µ 0 repeat 1 Let F(x,µ j ) = f(x) + 1 2µ j i E c2 i (x) µ j i I log (c i(x)) x j UnconstrainedOptimization(x j,α max,ǫ j ) Evaluate F j x F(x j,µ j); if F j < ǫ and µ j < δ then x x j and stop; j j + 1 reduce the value of µ j reduce the value of ǫ j end (repeat 1) 9