A Primer on Multidimensional Optimization

A Primer on Multidimensional Optimization Prof. Dr. Florian Rupp German University of Technology in Oman (GUtech) Introduction to Numerical Methods for ENG & CS (Mathematics IV) Spring Term 2016

Eercise Session

Reviewing the highlights from last time (1/ 2) Reviewing the highlights from last time Page 123, eercise 1 (reformulated) Find where the graphs of y = 3 and y = ep() intersect by finding solutions of ep() 3 = 0 correct to four decimal digits with the secant method. Page 149, eercise 4 Application of the secant method for f() = 2 e with 0 = 0 and 1 = 1 leads to the following sequence of iterates n+1 = n +(2 e n )( n n 1 )(e n e n 1 ) 1. What is lim n n? Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 3 / 83

Reviewing the highlights from last time (2/ 2) Reviewing the highlights from last time Page 151, computer eercise 12 Test numerically whether Olver s method, given by the update formula n+1 = n f( n) f ( n ) 1 2 f ( n ) f ( n ) ( ) f(n ) 2 f ( n ) is cubically convergent to a root of f. Try to establish that it is. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 4 / 83

Introduction & Todays Scope

Disclaimer This will be a rather theoretical and not very interactive lecture. It discusses root finding of the gradient J of functions J : R n R, n 1. The purpose of this lecture is to give you a high-level outlook on optimization methods that are based on linear and quadratic approimations. Thus, it is a review of what we discussed for root finding in a slightly different contet. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 6 / 83

About optimization and calculus An important application of calculus is the problem of finding the local minima and maima of a function. Problems of maimization are covered by the theory of minimization because the maima of F are the minima of F. In calculus, the principal technique for minimization is to differentiate the function whose minimum is sought, set the derivative equal to zero, and locate the points that simultaneously satisfy the resulting equation, like F( 1, 2, 3 ) 1 = F( 1, 2, 3 ) 2 = F( 1, 2, 3 ) 3 = 0. This procedure cannot be readily accepted as a general-purpose numerical method as it requires differentiation followed by the solution of one/ many equations in one/ many variables using the methods from the last time. This task may be as difficult to carry out as a direct frontal attack on the original problem. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 7 / 83

Unconstrained & constrained minimization problems (1/ 2) The minimization problem has two forms: the unconstrained and the constrained. In an unconstrained minimization problem, a function F : R n R is given and a point z R n is sought with the property F(z) F() for all R n. In a constrained minimization problem, a subset K R n is prescribed, and a point z K R n is sought with the property F(z) F() for all K R n. Such problems are more difficult because of the need to keep the points within the set K which can be defined in a complicated way (you may have seen such problems already, think of the Lagrange method!). Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 8 / 83

Unconstrained & constrained minimization problems (2/ 2) Eample 20 Consider the elliptic paraboloid 20 18 16 F( 1, 2 ) = 2 1 +2 2 2 1 2 2 +4 = ( 1 1) 2 +( 2 1) 2 +2. 18 16 14 12 10 8 6 14 12 10 8 The unconstrained minimum occurs at F(1,1) = 2, whereas the if K = {( 1, 2 ) : 1, 2 0}, the constrained minimum is at F(0,0) = 4. 4 2 0 2 1 0 y ais 1 2 2 1 0 1 ais 2 6 4 2 Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 9 / 83

Today, we will focus on multidimensional unconstrained optimization Today s topics: The gradient/ steepest descent method and its step-size conditions Newton s method revisited and the Quasi-Newton method The Trust-Region method Penalty & barrier functions (in 1D) Simulated Annealing Corresponding tetbook chapters: 13.1 and 13.2 Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 10 / 83

The Gradient/ Steepest Descent Method

The negative gradient vector points in the direction of the steepest descent From calculus we know that the negative gradient vector J of a function J points in the direction of the steepest descent: Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 12 / 83

A first gedanken eperiment... Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 13 / 83 Let us try to find the minimum of the peaks -function J : R 2 R and assume therefore a small ball at the position (,J()) on the graph of J that moves with speed J() 2 in direction J: k ' k+1 '' k+1 Abszisse J() = 2 J() = 2

... immediately leads to four core problems in optimization We see that we found a minimum as soon as our ball is in a kind of a basin and does not move anymore ( J = 0). Though, this intuition is already not so easy to transform into mathematics: 1. What does it mean to be in a basin and is J = 0 the right mathematical property (after all saddle points have this property, too)? 2. Can it be possible, that the ball has so much energy that it simply runs through the basin without stopping or that it has so few energy that it approaches the basin too slowly? 3. Which direction has the ball to take to eventually come to a basin? 4. How long does it take the ball to reach a minimum? (Speed of convergence) Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 14 / 83

Solving problem 1: strictly conve functions (1/ 2) Problem 1 1. What does it mean to be in a basin and is J = 0 the right property? can be solved in two ways: First, by introducing the notion of curvature and establishing that the curvature is positive at a minimum of J : R n R. This is the usual way in calculus, where you go for a second order Taylor epansion around J( +h) = f( )+h T J( )h+ 1 2 ht H J ( )h+o( h 3 ) and study if the Hessian matri H J () = ( i j J( 1,..., n )) i,j=1,...,n is positive definite at =. I.e., if all eigenvalues of H J ( ) are strictly positive. Second, by encoding such an information into the function you want to minimize, like only allowing (strictly) conve functions as minimizers. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 15 / 83

Solving problem 1: strictly conve functions (2/ 2) Definition (Strictly Conve Function) Let F R n be a conve set. A function J : F R is called conve, if for all,y F with y and λ (0,1) it holds that J(λ+(1 λ)y) λj()+(1 λ)j(y). The function J : F R is called strictly conve if strict inequality holds, i.e., J(λ+(1 λ)y) < λj()+(1 λ)j(y). Strictly conve functions have at most one minimum such that they are perfect candidates for optimization purposes. Remark: Flipping the inequality sign gives the definition of a (strictly) concave function. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 16 / 83

Eamples of conve and concave functions Some illustrations of conve and concave functions: J(y) concave linear conve conve concave J(c) J() c y conve non conve For a conve function J over an interval [,y] we have that any point c [a,b] takes a value J(c) in the interval [J(a),J(b)] or [J(b),J(a)]. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 17 / 83

The key idea of the gradient method (1/ 2) Net to problem 3 3. Which direction has the ball to take to eventually come to a basin? If the ball continues to roll always in that direction s R n, s 2 = 1, for which J(+s) is less that J() at the current point, i.e., at best J() > J(+s) and ma s R n J(+s) J() 2 then it will inevitably reach a minimum (if it eists). Applying a first order Taylor approimation J(+s) J()+s T J() leads, provided it eists, to a direction s R n of the steepest descent (of norm one) via the relation s = ma s R n : s 2 =1 st J() 2 = ma J(+s) J() 2. s R n : s 2 =1 Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 18 / 83

The key idea of the gradient method (2/ 2) The scalar product s T J() takes its maimum if s and J() are collinear, giving us already s := J() J() 2, because we require J(+s) J() = s T J() < 0, and, as we have already seen in the discussion of the gradient, a direction of the kind + J() would lead us to an ascent. This motivates the following algorithm (attributed to Cauchy, 1847). Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 19 / 83

The algorithm of the gradient method Algorithm (Gradient Method) Starting point: Choose 0 R n and compute s 0 := J( 0 ). Set k = 0. Iteration: If s k = 0: STOP, the optimal solution is k. Else, choose α k (0, ) such that J( k ) > J( k +α k s k ). (The parameter α k determines how long we want to descent in the direction s k.) Update Data: Set k+1 := k +α k s k and compute s k+1 := J( k+1 ). Set k := k +1 and continue with the iteration step. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 20 / 83

A geometric view on the gradient... The gradient at k is orthogonal to the tangential plane of the graph at k, and the projection of the gradient at k onto the argument space is orthogonal to the level set of the function at k. This leads to the gradient method s typical zig-zag pattern of the iterates in the argument space. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 21 / 83

... eplains the typical zig-zag pattern of the gradient method Application of the gradient method on the function ( ) T 21 7 7 280 with starting point 0 = ( 1.9, 0.5) T : 0.5 0 0.5 1.5 1 0.5 0 0.5 1 1.5 Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 22 / 83

The gradient method can be generalized to a descent method In general a sufficiently good descent direction would be enough to eventually reach a minimum, e.g. a direction s k R n \{0} such that J( k ) > J( k +α k +s k ), k N 0. This generalizes the gradient method to a descent or gradient-like method. For the direction of descent in such methods the following uniform angle condition is demanded in order to ensure a sufficiently large descent: νsuch that0 < ν ν k := cos( J( k ),s k ) = st k J( k) s k 2 J( k ) 2. The global postulate ν ν k for all k ensures during the descent that the angle between s k and J( k ) stays uniformly less than 90 degrees and thus that at each step s k is a sufficiently good direction of descent. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 23 / 83

Visualization of the uniform angle condition k ν J( k ) k ν k s k ν J( k ) J() Visualization of the uniform angle condition: the green direction of descent s k must lie in the red cone. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 24 / 83

Step-Size Conditions for the Gradient/ Steepest Descent Method

How do we control the (step-)size of the direction of descent? Problem 2 2. Can it be possible, that the ball has so much energy that it simply runs through the basin without stopping or that it has so few energy that it approaches the basin too slowly? leads to the issue of controlling the step-size and thus the parameter α in the gradient or gradient-like methods. One way to obtain an optimal α := α opt as the solution of the minimization problem J(+α opt s) = min α>0 J(+αs). This is an effective way, but may not be efficient as the minimization may not terminate in finitely many steps, ecept we are dealing with quadratic functions. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 26 / 83

The Armijo-Goldstein step-size condition The Armijo-Goldstein Condition For the whole gradient or gradient-like method, let the number σ (0,1) and a strictly decreasing sequence {β l } l N0 be given such that β l (0,1), l = 0,1,2,... (β l > β l+1 ). For R n and s R n \{0}, s 2 = 1, the number α := ma{β l : l N 0 } has to be determined such that ϕ(α) := J(+αs) J() + σ α J() T s = ϕ(0) + σ α ϕ (0). Thus for determining the Armijo-Goldstein step-size α we have to test these inequalities subsequently for all β l, l = 0,1,2,..., until we reach an inde l where they hold for the first time. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 27 / 83

Illustration of the Armijo-Goldstein step-size condition ϕ(α) ϕ(0) + σαϕ'(0) ϕ 0 AG α = β l β l-1 We consider the argument set of ϕ and there only those values (AG) for which the graph Γ ϕ lies under the Armijo-Goldstein line through (0,ϕ(0)) with slope σϕ (0). The Armijo-Goldstein sep-size α AG is the largest of our predefined numbers β l, l = 0,1,2,..., therein. Thus, we are not descending too far. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 28 / 83

The Wolfe-Powell step-size condition... Wolfe-Powell Condition For the whole gradient or gradient-like method, let the numbers σ (0, 1 2 ) and ρ (σ,1) be given. For R n and s R n \ {0}, s 2 = 1, such that s T J() < 0 determine the number α > 0 such that the Armijo-Goldstein inequality ϕ(α) := J(+αs) J() + σ α J() T s = ϕ(0) + σ α ϕ (0). and both hold. ϕ (α) = J(+αs) T s ρ J() T s = ρ ϕ (0) The restriction σ < 1 2 is motivated by considerations to accept the eact minimum of a quadratic function ϕ as Wolfe-Powell setp-size. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 29 / 83

.. and its interpretation We choose the Wolfe-Powell step-size from the domain WP R for which on the one hand the graph of ϕ lies under the Armijo-Goldstein line (this prohibits going to far in direction s), and on the other hand the graph of ϕ on WP is not that steeply increasing or decreasing as it is in a neighborhood of α = 0 due to the damping factor ρ in the second Wolfe-Powell inequality (this ensures a sufficiently large progress). ϕ(α) ϕ'(α) ϕ(0) + σαϕ'(0) 0 WP α Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 30 / 83

A word of caution The gradient and gradient-like methods can take a long time to actually reach the minimum if the function is rather flat, like in the non-conve Rosenbrock function with its banana valley: The global minimum of the Rosenbrock function f(,y) = (1 ) 2 +100(y 2 ) 2 is at the point (1,1). Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 31 / 83

Newton s Method Revisited and the Quasi-Newton Method

Newton s method revisited As we have seen, the gradient method uses a linear approimation of the function J and reaches the vicinity of a stationary point with J( ) = 0. This method can be improved by considering the quadratic approimation of J at a point 0 q() := J( 0 )+( 0 ) T J( 0 )+ 1 2 ( 0) T H J ( 0 )( 0 ) If and only if a point leads to a vanishing gradient and a positive definite Hessian-matri of the discussed function this point is a minimum, we get 0 = q() := J( 0 )+H J ( 0 )( 0 ) and thus the Newton identity = 0 (H J ( 0 )) 1 J( 0 ) = 0 +s follows, where s is the solution of the linear system H J ( 0 )s = J( 0 ). Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 33 / 83

The local Newton s method for unconstrained minimization Algorithm (Local Newton s Method) Starting point: Choose 0 R n and compute s 0 := J( 0 ). Set k = 0. Iteration: If s k = 0: STOP, the optimal solution is k. Else, determine s k+1 R n as the solution of the linear Newton system Update Data: H J ( 0 )s k+1 = s k Set k+1 := k +s k+1 and compute s k+1 := J( k+1 ). Set k := k +1 and continue with the iteration step. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 34 / 83

Application of the local Newton s method to the Rosenbrock function Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 35 / 83

Some remarks on the local Newton s method For a quadratic function the local Newton s method terminates in one step. The local Newton s method requires a positive definite Hessian matri (which can only be ensured close to the actual minimum ) and, as we already know from last time, starting points very close to the actual minimum. For instance, combining the local Newton s method with the gradient method leads to a globalization of the local Newton s method. 8 7 6 5 4 3 2 1 0 3 2 1 0 1 2 3 Quadratic best approimation on the eponential function at the origin. The local character of this approimation is clearly visible. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 36 / 83

The key ideas of the Newton-Gradient method We combine the gradient method and the local Newton s method such that in the case that the Newton iteration can not be performed (i.e., if H J () is not positive definite) a sequence of descending points is generated via the gradient method until Newton iteration performs. Of course, the linear Newton system has a solution for all regular Hessian matrices. Thus it has to be checked if we really obtain a direction of descent. In particular, the Newton method s direction (as well as the gradient method s direction) is taken as a new direction of descent the step-size of which is determined by the Armijo-Goldstein condition. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 37 / 83

The Newton-Gradient method (1/ 2) Algorithm (Newton-Gradient Method) Starting point: Choose 0 R n, ρ > 0, p > 2 and compute s 0 := J( 0 ). Set k = 0. Iteration: If s k = 0: STOP, the optimal solution is k. Else, determine s k+1 R n as the solution of the linear Newton system... H J ( 0 )s k+1 = s k If this system has no solution or if the condition for a suitable good direction of descent s T k+1 J( k+1) ρ s k+1 p is violated, then set s k+1 := J( k ). Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 38 / 83

The Newton-Gradient method (2/ 2) Algorithm (Newton-Gradient Method) [cont.]... New Step-Size: Determine the new step-size α k+1 with the Armijo-Goldstein method. Update Data: Set k+1 := k +α k+1 s k+1 and compute s k+1 := J( k+1 ). Set k := k +1 and continue with the iteration step. Remark: It would be sufficient to use difference-quotients of the gradients to approimate the Hessian matri. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 39 / 83

The key idea of Quasi-Newton methods (1/ 3) Instead of going for the eact inverse Hessian matri, the so called Quasi-Newton methods use an approimation of it and thus avoid in each step the computationally epensive eact set-up of the Hessian matri, and the evaluation of a linear system to get the direction of descent. Due to Taylor s theorem we have H J ( k )( k+1 k ) J( k+1 ) J( k ). Thus, any matri A k+1 that satisfies the Quasi-Newton condition A k+1 ( k+1 k ) = J( k+1 ) J( k ) can be considered as an approimation of the Hessian matri H J ( k ). Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 40 / 83

The key idea of Quasi-Newton methods (2/ 3) With the Quasi-Newton condition A k+1 ( k+1 k ) = J( k+1 ) J( k ) we have that any matri B k+1 that satisfies the inverse Quasi-Newton condition B k+1 ( J( k+1 ) J( k )) }{{} =:g k = k+1 k }{{} =:y k can be considered as an approimation of the inverse of the Hessian matri. In the Quasi-Newton algorithm this approimation B k+1 is updated in each step with the help of specific update formulas (no eact computation of the Hessian or the inverse Hessian is performed!) such that points of descent are gained just by updating. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 41 / 83

The key idea of Quasi-Newton methods (3/ 3) Note, the direction of the steepest descent depends on the norm. Just consider a conve function with an unique minimum and then overlay it with the unit cells of different norms. In each norm another steepest descent path will guide you to the minimum. Using Newton methods it suggests itself to choose a norm that is related to the Hessian H or its inverse, for instance H := H 1/2 2 = H 1/2,H 1/2, where 2 denotes the Euclidean norm in R n, and R n. This allows us to view the Newton direction as direction of steepest descent with respect to the norm H. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 42 / 83

The BFGS-method One of the most used update schemes was found 1970 nearly to the same time by Broyden, Fletcher, Goldfrab and Shanno. Their BFGS-update formula reads as B k+1 = B k + (y k B k g k )y T k +y k(y k B k g k ) T g T k y k (y k B k g k )g k y k y T k (g T k y k) 2. In particular, let y k,g k R n such that y T k g k > 0 and if B k R n n is symmetric and positive definite. Then one can show that the BFGS-update matri B k+1 R n n is again symmetric and positive definite. The same holds for the following globalized BFGS-minimization algorithm. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 43 / 83

The globalized BFGS method (1/ 2) Algorithm (globalized BFGS Method) Starting point: Choose 0 R n, and B 0 R n n symmetric and positive definite. Set k = 0. Iteration (new Quasi-Newton direction): If J( k ) = 0: STOP, the optimal solution is k. Else, compute s k = B k J( k ). New Step-Size: Determine the new step-size α k with the Wolfe-Powell method. Update Data: Set k+1 := k + α k s k, and compute y k := k+1 k and g k := J( k+1 ) J( k ). Compute B k+1 via the BFGS-update formula. Set k := k +1 and continue with the iteration step. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 44 / 83

Some remarks on Quasi-Newton methods and the BFGS-method In each step of a Quasi-Newton method the approimation of the Hessian and its inverse changes. Thus, the Quasi-Newton direction is in each step considered the steepest descent with respect to a changed norm. This leads to the notation of Quasi-Newton methods as variable metric methods. As an important convergence result, let us finally note that the BFGS-method converges towards a minimum and that its speed of convergence in a suitable environment of is even super-linear. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 45 / 83

The Trust-Region Method

The key idea of the trust-region method (1/ 2) The idea of all optimization algorithms discussed so far was to determine the direction of descent s via the unconstrained optimization problem where min q k (s), q k (s) := J( k )+ J( k ) T s+ 1 2 st H k s is the quadratic model for the function J : R n R at the point k for the actual minimum of J, and H k is a suitably good approimation of the Hessian matri at k. As discussed several times, the quadratic approimation is not that well suited for global optimization. Although, we consider the quadratic approimation q k as a good model only locally around k, we still take its global minimum as the new direction of descent for our original non-linear problem. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 47 / 83

The key idea of the trust-region method (2/ 2) This is where the trust-region method provides a better ansatz: instead of finding the global minimum of q k we determine a minimum in a certain region of trustworthiness (trust-region) which allows for the local character of the quadratic model. This leads to the constrained sub-problem min q k (s) with respect to s 2 k, where k denotes the radius of the trust-region. This sub-problem can be solved, for instance, with Newton s method. In particular, due to its local character the step-size determination is omitted in the trust-region method. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 48 / 83

Illustration of the trust-region sub-problem y level-sets of the quadratic model Newton step trust region radius J(,y) k+1 k direction of the negative gradient Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 49 / 83

Discussion of the trust-region radius It is obvious that the choice of the trust-region radius k is the essential part of this method. k is predicted by comparing the decrease of the actual objective function J and its quadratic approimation q k. This prediction is carried out with the help of the following quotient r k := J( k) J( k +s). J( k ) q k (s) Depending on r k the trust-region radius is enlarged or reduced (see the algorithm for details). Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 50 / 83

A trust-region Newton algorithm (1/ 2) Algorithm (Trust-Region Newton Method) Starting point: Choose 0 R n, 0 > 0, min > 0, 0 < ρ 1 < ρ 2 < 1 and 0 < σ 1 < 1 < σ 2. Set k = 0. Iteration: Trust region sub-problem... If J( k ) = 0: STOP, the optimal solution is k. Else, determines k R n asthesolution ofthetrust-region sub-problem (e.g. as penalized unconstrained optimization problem using Newton s method), and compute the trust-region quotient r k. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 51 / 83

A trust-region Newton algorithm (2/ 2) Algorithm (Newton-Gradient Method) [cont.]... New point of descent: If r k ρ 1 (the kth iteration was successful), set k+1 := k +s k Else, set k+1 := k (do nothing and adjust the trust-region radius) Update Data: new trust-region radius If r k < ρ 1, set k+1 := σ 1 k (reduce the trust-region radius) If r k [ρ 1,ρ 2 ), set k+1 := ma{ min, k } If r k ρ 2, set k+1 := ma{ min,σ 2 k } (enlarge the trust-region radius) Set k := k +1 and continue with the iteration step. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 52 / 83

Solving constrained minimization problems The trust-region method requires us to solve a constrained minimization problem subject to the trust-region. There are two rather easy ways to transform a constrained minimization problem into an unconstrained one and thus be able to apply the discussed methods: 1. Barrier functions prohibit leaving the admissible region (e.g. by introducing some kind of a singularity at the boundaries), and 2. Penalty functions simply penalize leaving the admissible region. (There are more efficient (but more elaborate and difficult) methods for approaching constrained optimization problems, but they are beyond what we ll cover in this course.) Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 53 / 83

Eamples of Penalty & Barrier Functions in 1D

Constrained 1D minimization with penalty functions (1/ 4) Eample Given the constrained 1D minimization problem 800 700 600 500 min J() subject to 1 for J() = 4. 400 300 200 100 First, we define a twice-continuously differentiable penalty function { 0 for < 0 φ k () := k 3 for 0 0 2 1.5 1 0.5 0 0.5 1 1.5 2 Penalty function φ k () for k = 100, penalizing all values > 0. for some k 1, e.g., k = 100. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 55 / 83

Constrained 1D minimization with penalty functions (2/ 4) Eample [cont.] 20 18 As the constraint 1 is equivalent to 1 0, we define the modified penalized objective function J k () := J()+φ k (1 ) = 4 +φ k (1 ) which is identical to J for 1 but rises sharply for < 1. The additional term φ k (1 ) penalizes an optimization algorithm for choosing < 1. 16 14 12 10 8 6 4 2 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 Graph of the original objective function (solid blue) and of the modified penalized objective function (dotted red). Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 56 / 83

Constrained 1D minimization with penalty functions (3/ 4) Eample [cont.] 2.4 2.2 We can approimately minimize J() subject to 1 by running an unconstrained optimization algorithm on J k (); the penalty term will strongly encourage the unconstrained algorithm to choose the best 1. The penalty function φ k is C 2, so it does not cause any trouble in an optimization algorithm which relies on first or second derivatives. The computed minimum of J 100 () turns out as = 0.9012. 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 1.2 Zoom into the graph of the original objective function (solid blue) and of the modified penalized objective function (dotted red). Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 57 / 83

Constrained 1D minimization with penalty functions (4/ 4) Eample [cont.] 3 Increasing the penalty parameter k enforces the constrains more rigorously, while using the previous final iterate as an initial guess speeds up convergence of the unconstrained optimization algorithm (as we epect theminimumforalargervalueofk to be near the minimum of the previous value of k). 2.5 2 1.5 1 0.5 0 0.8 0.85 0.9 0.95 1 1.05 1.1 1.15 1.2 Zoom into the graph of the original objective function J() (solid blue), of the modified penalized objective function J 100 () (dotted red) and of J 10000 () (dotted black). Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 58 / 83

Pros & Cons of penalty functions The penalty function approach is easily generalized to higher dimensions. It is a hands-off method for converting constrained problems of any type into unconstrained problems. We don t have to worry about finding an initial feasible point (sometimes a problem). Many constraints in the real world are soft, in the sense that they need not be satisfied precisely. The penalty function approach is well-suited to this type of problem. The drawback to penalty function methods is that the solution to the unconstrained penalized problem will not be an eact solution to the original problem (ecept in the limit as described above). In some cases penalty methods can t be applied because the objective function is actually undefined outside the feasible set. Also, as we increase the penalty parameters to more strictly enforce the constraints, the unconstrained formulation becomes very ill-conditioned, with large gradients and abrupt function changes. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 59 / 83

The key idea of the barrier function method (1/ 2) Barrier function methods are closely related to penalty function methods, and in fact might as well be considered a type of penalty function method. These methods are generally applicable only to inequality constrained optimization problems. Barrier methods have the advantage that they always maintain feasible iterates, unlike the penalty methods above. The most common is the log barrier method. Suppose we have an objective function J() on R n with inequality constraints g i () 0 for i = 1,2,...,m. We transform this into a modified or penalized objective function J b () = J() m r i ln( g i ()), with all r i > 0. i=1 J b () is undefined if any g i () 0, so we can only evaluate J b () in the interior of the feasible region. However, even inside the feasible region the penalty term is non-zero (but it becomes an anti-penalty if g i () 1). Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 60 / 83

The key idea of the barrier function method (2/ 2) In general a barrier method works in a similar way to the penalty methods. We start with some choice for the r i and with an initial feasible point 0 (which may be hard to find), and minimize J b () = J() m r i ln( g i ()), with all r i > 0 i=1 by applying an unconstrained optimization algorithm. The terminal point k, must be a feasible point, because the log terms in the definition of J b () form a barrier of infinite height which prevents the optimization routine from leaving the interior of the feasible region. Net, we decrease the value of the r i and re-optimize, using the final iterate k as an initial guess for the newly. We continue this until an acceptable minimum is found. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 61 / 83

An eample for the application of barrier functions Eample Given the constrained 1D minimization problem min J() subject to 1 for J() = 4. 16 14 12 10 8 6 4 2 A modified objective function may look like for r 1 = 2. J b () = 4 2ln( 1) 0 0.5 1 1.5 2 Graph of the original objective function J() (solid blue) and of the modified objective function J b () (dotted red). Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 62 / 83

Simulated Annealing

Finding the global minimum in the presence of many local ones Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 64 / 83

The method of simulated annealing (1/ 2) Simulated annealing eploits an analogy between the way in which a metal cools and freezes into a minimum energy crystalline structure (the annealing process) and the search for a minimum in a more general system. It has been proposed and found effective for the minimization of difficult functions, especially if they may have many purely local minimum points. It involves no derivatives or line searches; indeed it has found great success in minimizing discrete functions, such as arise in the traveling salesman problem. Suppose, we are given a real-valued function of n variables J : R n R, such that we are able to compute the values J() for any R n. It is desired to locate a global minimum point of J, which is a point such that J( ) J() for all R n. In other words, J( ) is equal to inf R n J(). Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 65 / 83

The method of simulated annealing (1/ 2) The simulated annealing algorithm is based upon that of Metropolis et al., which was originally proposed as a means of finding the equilibrium configuration of a collection of atoms at a given temperature. Simulated annealing s major advantage over other methods is an ability to avoid becoming trapped at local minima. The algorithm employs a random search which not only accepts changes that decrease the objective function, but also some changes that increase it. The latter are accepted with a certain probability depending on a control parameter, which by analogy with the original application is known as the system temperature irrespective of the objective function involved. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 66 / 83

Illustration of the method of simulated annealing (1/ 9) J() perturb perturb high "energy" barrier perturb local minimum local minimum local minimum global minimum Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 67 / 83

Illustration of the method of simulated annealing (2/ 9) J() Situation: We seem to be trapped in a local minimum, how can we escape? Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 68 / 83

Illustration of the method of simulated annealing (3/ 9) J() Situation: We seem to be trapped in a local minimum, how can we escape? Solution: Increase the energy or temerature, so that the "particle" can deviate from the local minimum. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 69 / 83

Illustration of the method of simulated annealing (4/ 9) J() Situation: We seem to be trapped in a local minimum, how can we escape? Solution: Increase the energy or temerature, so that the "particle" can deviate from the local minimum. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 70 / 83

Illustration of the method of simulated annealing (5/ 9) J() Situation: We seem to be trapped in a local minimum, how can we escape? Solution: Increase the energy or temerature, so that the "particle" can deviate from the local minimum. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 71 / 83

Illustration of the method of simulated annealing (6/ 9) J() Situation: We seem to be trapped in a local minimum, how can we escape? Solution: Increase the energy or temerature, so that the "particle" can deviate from the local minimum. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 72 / 83

Illustration of the method of simulated annealing (7/ 9) J() Situation: We seem to be trapped in a local minimum, how can we escape? Solution: Increase the energy or temerature, so that the "particle" can deviate from the local minimum, and reduce it again... Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 73 / 83

Illustration of the method of simulated annealing (8/ 9) J() Situation: We seem to be trapped in a local minimum, how can we escape? Solution: Increase the energy or temerature, so that the "particle" can deviate from the local minimum, and reduce it again... Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 74 / 83

Illustration of the method of simulated annealing (9/ 9) J() Situation: We seem to be trapped in a local minimum, how can we escape? Solution: Increase the energy or temerature, so that the "particle" can deviate from the local minimum, and reduce it again... Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 75 / 83

Mathematical formulation of the simulated annealing algorithm (1/ 2) The simulated annealing algorithm generates a sequence of points 1, 2,... and one hopes that the following convergence can be established min j k J( j) J( ) for k, where J( ) is equal to inf R n J(). In describing the computation that leads to k+1, assuming that k has been computed, we begin by generating a modest number of random points u 1,u 2,...,u m in a large neighborhood of k. For each of these points, its value J(u i ) (i = 1,2,...,m) must be computed and the net point k+1 in our sequence is actually one of the points u 1,u 2,...,u m. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 76 / 83

Mathematical formulation of the simulated annealing algorithm (2/ 2) The choice of the net point k+1 in our sequence is made as follows: Select an inde j such that J(u j ) = min{j(u 1 ),J(u 2 ),...,J(u m )}. If J(u j ) < J( k ), set k+1 := u j. Else, for each i = 1,2,...,m assign a probability p i := ep(α(j( k ) J(u i ))) m i=1 ep(α(j( k) J(u i ))) [0,1] to each u i, where α > 0 is a parameter chosen upfront. Finally, a random choice is made among the points u 1,u 2,...,u m based on the probabilities p 1,p 2,...,p m. The thus randomly chosen u i becomes k+1. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 77 / 83

Another illustration of simulated annealing (1/ 4) J() Start with a given point or distribution of points. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 78 / 83

Another illustration of simulated annealing (2/ 4) J() range where the test points are randomly set range where the test points are randomly set Generate a modest number of random test points in an interval around the starting point(s). The length of this interval can be interpreted as the energy or temperature we used in our first analogy. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 79 / 83

Another illustration of simulated annealing (3/ 4) J() Evaluate the function at these test points. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 80 / 83

Another illustration of simulated annealing (4/ 4) J() random choice unique choice Finally, decide upon the new starting point(s). To avoid too much diffusion, one adjusts the energy/ temp. in each step and thus controls how far away form the starting point(s) the net test points lie. Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 81 / 83

Summary & Outlook

Major concepts covered: unconditioned optimization techniques Although this may seem a parado, all eact science is dominated by the idea of approimation. Bertrand Russell The gradient/ steepest descent method and its step-size conditions (Armijo-Goldstein & Wolfe-Powell) Newton s method revisited, the Gradient-Newton method and the Quasi-Newton method incl. BFGS-method The Trust-Region method Penalty & barrier functions (in 1D) Simulated Annealing Prof. Dr. Florian Rupp GUtech 2016: Numerical Methods 83 / 83