IE 5531: Engineering Optimization I Lecture 15: Nonlinear optimization Prof. John Gunnar Carlsson November 1, 2010 Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 1 / 24
Administrivia Midterms returned 11/01 11/01 oce hours moved No class next week (INFORMS) Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 2 / 24
Recap Algorithms for unconstrained minimization: Introduction Bisection search (root-nding) Golden section search (unimodal minimization) Line search (minimization) Wolfe, Goldstein conditions Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 3 / 24
Today Gradient method (steepest descent) example Newton's method Constrained problems and the ellipsoid method Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 4 / 24
Steepest (gradient) descent example Recall that in the method of steepest descent, we set d k = f (x k ) Consider the case where we want to minimize f (x) = c T x + 1 2 xt Q x where Q is a symmetric positive denite matrix Clearly, the unique minimizer lies where f (x ) = 0, which occurs precisely when Q x = c The descent direction will be d = f (x) = (c + Q x) Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 5 / 24
Steepest descent example The iteration scheme x k+1 = x k + α k d k is given by x k+1 = x k α k (c + Q x k ) We need to choose a step size α k, so we consider φ (α) = f (x k α (c + Q x k )) Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 6 / 24
Steepest descent example Note that we can nd the optimal α analytically, which automatically satises the Wolfe conditions φ (α) = f (x k α (c + Q x k )) = c T (x k α (c + Q x k )) + 1 2 (x k α (c + Q x k )) T Q (x k α (c + Q x k )) Since φ (α) is a strictly convex quadratic function in α it is not hard to see that its minimizer occurs where c T d k + x T k Q d k + αd T k Q d k = 0 and thus we set with d k = (c + Q x k ) α k = dt k d k d T k Q d k Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 7 / 24
Steepest descent example The recursion for the steepest descent method is therefore ( x k+1 = x k d T k d k d T k Q d k ) d k Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 8 / 24
The above theorem gives what is called the global convergence property of the steepest-descent method No matter how far away x0 is, the steepest descent method must converge to a stationary point The steepest descent method may, however, be very slow to reach that point Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 9 / 24 Convergence of steepest descent Theorem Let f (x) be a given continuously dierentiable function. Let x0 R n be a point for which the sub-level set X 0 = {x R n : f (x) f (x0)} is bounded. Let {x k } be a sequence of points generated by the steepest descent method initiated at x0, using either the Wolfe or Goldstein line search conditions. Then {x k } converges to a stationary point of f (x).
Newton's method Minimizing a function f (x) can be thought of as nding a solution to the nonlinear system of equations f (x) = 0 Suppose we begin at a point x0 that is thought to be close to a minimizer x We may consider the problem of nding a solution to f (x) = 0 that is close to x0 (we're assuming that there aren't any maximizers that are closer to x0) Newton's method is a general method for solving a system of equations g (x) = 0 (to minimize/maximize, set g (x) := f (x) Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 10 / 24
Univariate Newton's method Newton's method is an iterative method that follows the following scheme: 1 At a given iterate x k, make a linear approximation L (x) to g (x) at x k by dierentiating g (x) 2 Set x k+1 to be the solution to the linear system of equations L (x) = 0 It is not hard to show that, in the univariate case, the iteration is x k+1 = x k g (x k) g (x k ) which is well-dened provided g (x k ) exists and is nonzero at each step Note that the iteration terminates if g (x k ) = 0 Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 11 / 24
Graphical interpretation Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 12 / 24
Graphical interpretation Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 12 / 24
Graphical interpretation Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 12 / 24
Conditions for convergence Without further conditions imposed, Newton's method is not globally convergent: The function g (x) = x 1/3 has a root at x = 0, but any non-zero starting point will diverge: Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 13 / 24
Conditions for convergence Without further conditions imposed, Newton's method is not globally convergent: The function g (x) = x 1/3 has a root at x = 0, but any non-zero starting point will diverge: Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 13 / 24
Conditions for convergence Without further conditions imposed, Newton's method is not globally convergent: The function g (x) = x 1/3 has a root at x = 0, but any non-zero starting point will diverge: Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 13 / 24
Conditions for convergence Without further conditions imposed, Newton's method is not globally convergent: The function g (x) = x 1/3 has a root at x = 0, but any non-zero starting point will diverge: Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 13 / 24
Conditions for convergence Without further conditions imposed, Newton's method is not globally convergent: The function g (x) = x 1/3 has a root at x = 0, but any non-zero starting point will diverge: Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 13 / 24
Convergence conditions Theorem If g (x) is twice continuously dierentiable and x is a root of g (x) at which g (x ) 0, then provided that x 0 x is suciently small, the sequence generated by the Newton iterations x k+1 = x k g (x k) g (x k ) converges quadratically to x at a rate C = g (x ) 2g (x ) Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 14 / 24
Multiple dimensions Consider the problem of solving g (x) = 0 Dene the Jacobian matrix J = g by [J] ij = g i (x) (x j ) (the rows of J are just the gradient vectors g i (x)) The necessary conditions for convergence are somewhat more detailed and involved and we will not go into them Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 15 / 24
Computational issues As an optimization procedure the Newton method requires all rst and second derivatives of the objective function This can be very time-consuming, if the objective function is expensive to compute For this reason, Quasi-Newton approaches are often used in which the second derivative is approximated Trade-o: steepest descent requires more iterations, but each iteration is fast Newton's method requires fewer iterations, but they take longer Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 16 / 24
Constrained optimization In a constrained optimization problem minimize f (x) x F s.t. we have to worry about feasibility as well as optimality A descent direction must also be feasible, for example Gradient projection: project the gradient vector onto F and move in that direction (more on this later) The ellipsoid method: a completely dierent approach developed in the 1960's and 1970's in the Soviet Union The idea is to enclose the region of interest in a sequence of ellipsoids whose size is decreasing Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 17 / 24
Constrained optimization In a constrained optimization problem minimize f (x) x F s.t. we have to worry about feasibility as well as optimality A descent direction must also be feasible, for example Gradient projection: project the gradient vector onto F and move in that direction (more on this later) The ellipsoid method: a completely dierent approach developed in the 1960's and 1970's in the Soviet Union The idea is to enclose the region of interest in a sequence of ellipsoids whose size is decreasing This is similar to the bisection method in 2 dimensions Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 17 / 24
Ellipsoid method The ellipsoid method is best introduced by considering the problem of nding an element of of a solution set X given by a system of linear inequalities: X = {x R n : Ax b, i = 1,..., m} Instead of restricting ourselves to linear inequalities, we can have convex inequalities instead, i.e. g i (x) 0 with g (x) convex for all i, although this makes the exposition more challenging As we saw in an earlier problem set, solving a linear program is equivalent to nding a feasible solution to a set of linear inequalities We make two technical assumptions to start: 1 X is contained in a ball centered at the origin with radius R > 0 2 The volume of X is at least ɛ n vol (B 0 ), where B 0 is the volume of the unit ball in R n Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 18 / 24
Ellipsoid representation An ellipsoid is just a set of the form { } E k = x R n : (x x k ) T B 1 (x x k k ) 1 where x k is the center of the ellipsoid B k is a symmetric positive denite matrix of dimension n In two dimensions this is just { ( x E k = (x, y) R 2 x0 : y y 0 ) T ( a b/2 b/2 c ) ( x x0 y y 0 ) } 1 i.e. a (x x 0 ) 2 + b (x x 0 ) (y y 0 ) + c (y y 0 ) 2 1 Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 19 / 24
Volume of ellipsoid and cutting plane It is not hard to show that vol (E k ) = det B k volb 0 At the kth iteration, we know that X E k ; we're going to shrink these ellipsoids at each iteration by a constant factor until a feasible point is found We check whether the center point x k is in X : If x k X, then we're done If not, then at least one constraint is violated, say a T j E half k := { } x E k : a T j x a T j x k In that case, X lies in the half-ellipsoid x k < b j Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 20 / 24
Illustration Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 21 / 24
Illustration Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 21 / 24
Illustration Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 21 / 24
Illustration Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 21 / 24
Constructing a new ellipsoid At a given iteration k with x k and E k, we construct E k+1 as follows: dene τ = 1 n + 1 ; δ = n2 n 2 1 ; σ = 2τ We set x k+1 = x k + Tj ( a τ B k a j ; B k aj B k+1 = δ B k σ B k a j a a T j T B j k B k a j ) It is rather cumbersome, but it turns out { that B k+1 is the minimum } volume ellipsoid that contains E half := x E k k : a T x a T x j j k Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 22 / 24
Convergence Theorem The ellipsoid E k+1 dened in the { preceding slide is the minimum volume ellipsoid that contains E half := x E k k : a T x a T x j j k }. Moreover, vol (E k+1) vol (E k ) ( n 2 = n 2 1 ) (n 1)/2 ( ) n 1 n + 1 < exp < 1 2 (n + 1) This establishes that the volume of the ellipsoid decreases by a constant amount at each iteration. It can be shown that the ellipsoid solves linear programs in O ( n 2 log (R/ɛ) ) iterations. Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 23 / 24
Comments The ellipsoid can solve any convex problem as well, as long as we can generate a hyperplane through the center of E k that must contain X In practice, the ellipsoid method is usually slower than the simplex method; it exists primarily as a pedagogical tool to prove the complexity of solving linear (or convex) programs Prof. John Gunnar Carlsson IE 5531: Engineering Optimization I November 1, 2010 24 / 24