Lecture notes. Mathematical Methods for Applications in Science and Engineering part II: Numerical Optimization

Lecture notes Mathematical Methods for Applications in Science and Engineering part II: Numerical Optimization Eran Treister Computer Science Department, Ben-Gurion University of the Negev.

Contents 1 Preliminary background unconstrained optimization 1 1.1 Optimality conditions for unconstrained optimization............. 4 1.2 Convexity..................................... 6 2 Iterative methods for unconstrained optimization 12 2.1 Steepest descent.................................. 12 2.2 Newton, quasi-newton.............................. 14 2.3 Line-search methods............................... 16 2.4 Coordinate descent methods........................... 22 2.5 Non-linear Conjugate Gradient methods.................... 24 3 Constrained Optimization 25 3.1 Equality-constrained optimization Lagrange multipliers........... 26 3.2 The KKT optimality conditions for general constrained optimization..... 31 3.3 Penalty and Barrier methods........................... 34 3.4 The projected Steepest Descent method.................... 41 4 Robust statistics in least squares problems 45 5 Stochastic optimization 48 5.1 Stochastic Gradient descent........................... 48 5.2 Example: SGD for a minimizing a convex function............... 51 5.3 Upgrades to SGD................................. 53 6 Minimization of Neural Networks for Classification 54 6.1 Linear classifiers: Logistic regression...................... 54 6.1.1 Computing the gradient of logistic regression.............. 56 6.1.2 Adding the bias.............................. 57 6.2 Linear classifiers: Multinomial logistic (softmax) regression.......... 57 6.2.1 Computing the gradient of softmax regression............. 58

2 CONTENTS 6.3 The general structure of a neural network................... 60 6.4 Computing the gradient of NN s: back-propagation.............. 61 6.5 Gradient and Jacobian verification....................... 66

1 1. Preliminary background unconstrained optimization 1 Preliminary background unconstrained optimization An optimization problem is a problem where a cost function f(x) : R n R needs to be minimized or maximized. We will mostly refer to minimization, as maximization can always be achieved by minimizing f. The function f() is called the objective. In most cases, we wish to find a point x that minimizes f. We will denote this by x = arg min f(x). (1) x R n This optimization problem is not the most general problem we will deal with. This problem is called an unconstrained optimization problem, and we will deal with this problem first. We will deal with constrained optimization in the next section. Optimization problems arise in huge variety of applications in fields like natural sciences, engineering, economics, statistics and more. Optimization arises from our nature - we usually wish to make things better in some sense. In recent years, large amounts of data have been collected, and consequently optimization methods arise from the need to characterize this data by computational models. In this course we will deal with continuous optimization problems, as opposed to discrete optimization which is a completely different field with different challenges. Until now we met some optimization problems, such as least squares and A norm minimization problems. Those problems had quadratic objectives, and setting their gradient to zero resulted in a linear system. Understanding quadratic optimization problems are the key for understanding general optimization problems, and often, the iterative methods used for general optimization problems are just a generalization of methods for solving linear systems. We ve seen a bit of this concept in Steepest Descent and Gauss-Seidel. Another strong connection between quadratic and general optimization problems is that every objective function f is locally quadratic. We will now see that via the Taylor expansion.

2 1. Preliminary background unconstrained optimization Multivariate Taylor expansion one-dimensional Taylor expansion is given by Assume a continuous one-variable function f(x). The f(x + ε) = f(x) + f (x)ε + 1 2 f (x)ε 2 + 1 3! f (c)ε 3 ; c [x, x + ε]. Assuming that f is continuous, we will usually refer to the last term just as O(ε 3 ), and all of our derivations will focus on the first two or three terms. If f has two variables, f = f(x 1, x 2 ), then the two-dimensional Taylor expansion is given by f(x 1 + ε 1, x 2 + ε 1 ) = f(x 1, x 2 ) + f x 1 ε 1 + f x 2 ε 2 + 1 2 2 f ε 2 x 2 1 + 1 2 f x 1 x 2 ε 1 ε 2 + 1 2 Let us recall the definition of the gradient, which in two-variables is given by f = [ f x 1 f x 2 This is a 2 1 vector, and if we denote by ε = [ε 1, ε 2 ], then ] f, ε = f x 1 ε 1 + f x 2 ε 2.. 2 f ε 2 x 2. 2 2 Now we will define the two-dimensional Hessian matrix a two-dimensional second derivative: 2 f = H = [ 2 f x 2 1 2 f x 1 x 2 2 f x 1 x 2 2 f x 2 2 ]. It can be seen that 1 ε, Hε = 1 2 2 expansion can be written as 2 f x 2 1 ε 2 1 + 2 f 2 f x 1 x 2 ε 1 ε 2 + 1 ε 2 2 x 2, and therefore the Taylor 2 2 f(x + ε) = f(x) + f, ε + 1 2 ε, Hε + O( ε 3 ), This expansion is also suitable for n dimensional functions, where the Hessian matrix H R n n is defined by H ij = 2 f x i x j.

3 1. Preliminary background unconstrained optimization Note that if we assume that f is smooth and twice differentiable, then H is a symmetric matrix (but not necessarily positive definite). A Taylor expansion of a function vector In some cases, derivatives may be complicated and we need to know to calculate a derivative of a vector of functions, as opposed to scalar functions that we dealt with so far. Consider f(x) : R n R m. That is x R n, and f(x) R m. The first order approximation of each of the functions f i (x) is given by f i (x + ε) f i (x) + f i (x), ε. For the whole function vector f(x), we can write δf = f(x + ε) f(x) Jε,

4 1. Preliminary background unconstrained optimization where J R m n is the Jacobian matrix, comprised of the gradients f i (x) as rows: f 1 (x) f J = 2 (x). J i,j = f i. x j f m (x) For the sake of gradient calculation we will focus on the case where ε 0 and assume equality in the definition of δf. Example 1. (The Jacobian of Ax) Suppose that f = Ax, where A R m n. It is clear that δf = f(x + ε) f(x) = A(x + ε) Ax = Aε. It is clear that J = A. Example 2. (The Jacobian of φ(x), where φ is a scalar function) Suppose that f = φ(x), where φ : R R is a scalar function, i.e., f i (x) = φ(x i ). In this case the vector Taylor expansion is just a vector of one dimensional expansions. δf = φ(x + ε) φ i (x) diag(φ (x))ε = diag(φ (x))δx. This means that J = diag(φ (x)), which is a diagonal matrix such that J ii = φ (x i ). 1.1 Optimality conditions for unconstrained optimization We we ask ourselves: what is a minimum point of f? We will consider two kinds of minimum points. Definition 1 (Local minimum). A point x will be called a local minimum of a function f() if there exists r > 0 s.t f(x) f(x ) for all x s.t x x < r. If we use a strict inequality, then we have a strict local minimum.

5 1. Preliminary background unconstrained optimization Figure 1: Global and local minimum Definition 2 (Global minimum). A point x will be called a global minimum of a function f() if f(x) f(x ) for all x R n. Again, if we have a strict inequality, we will also have a strict global minimum. We will ask ourselves now, how can we identify or compute a local and global minima of a function? It turns out that a local minima is possible to define, and a global one is harder. In this course will focus on iterative methods to find a local minimum, hoping that we are not so far from the global minimum. There are many ways to handle issue of global vs. local minimum in the case of general functions e.g., adding regularization, and investigating a good starting point but we will not deal with that in this course.

6 1. Preliminary background unconstrained optimization It turns out that there is a quite large family of functions, where any local minimum is also a global minimum. These functions are called convex functions. 1.2 Convexity The convex set definition is illustrated in Fig. in The convex function definition can be illustrated in 1D see Figure 3. A 2D example of convex and non convex function is given in Fig. 4.

7 1. Preliminary background unconstrained optimization Figure 2: Convexity of sets Figure 3: Convexity: the image uses t instead of α

8 1. Preliminary background unconstrained optimization Figure 4: Convexity in 2D: the left image shows a non-convex function, while the right one shows a convex (and quadratic) function. Alternative definitions for convexity 1. Assume that f(x) is differentiable. f(x) is a convex function over a convex region Ω if and only if for all x 1, x 2 Ω f(x 1 ) f(x 2 ) + f(x 2 ), x 1 x 2. This definition means that any tangent plane of the function has to be under the function. This definition is equivalent to the previous traditional definition. 2. Assume that f(x) is twice differentiable. f(x) is a convex function over a convex region Ω if and only if the Hessian 2 f(x) 0 is positive semi-definite for all x. It is easy to show that the two definitions are equivalent through a Taylor series: f(x 1 ) = f(x 2 ) + f(x 2 ), x 1 x 2 + 1 2 x 1 x 2, 2 f(c)(x 1 x 2 ), c [x 1, x 2 ].

9 1. Preliminary background unconstrained optimization Example 3 (The definitions of convexity for a quadratic function). A

10 1. Preliminary background unconstrained optimization

11 1. Preliminary background unconstrained optimization When the objective function is convex, local and global minimizers are simple to characterize. Theorem 1. When f is convex, any local minimizer x is a global minimizer of f. If in addition f is differentiable, then any stationary point x such that f(x ) = 0 is a global minimizer of f. We will only consider the proof of the first part of the Theorem here: The proof of the second part of the Theorem is obtained by contradiction. These results provide the foundation of unconstrained optimization algorithms. If we have a convex function, then every local minimizer is also a global minimizer, and hence if we have methods that reach local minimizers, we can reach the global minimizer. This is why convex problems are very popular in science, although in some cases there is no choice but to solve non-convex problems. In all algorithms we seek a point x where f(x ) = 0.

12 2. Iterative methods for unconstrained optimization 2 Iterative methods for unconstrained optimization Just like linear systems, optimization can be carried out by iterative methods. Actually, in optimization, it is usually not possible to directly find a minimum of a function, and so iterative methods are much more essential in optimization than in numerical linear algebra. On the other hand, we have learned about minimizing quadratic functions, which is equivalent to solving linear systems. Since every function is locally approximated by a quadratic function (and specifically near its minimum) everything we learned about iterative methods for linear systems can be useful in the context of optimization. 2.1 Steepest descent Earlier, we met the steepest descent method x (k+1) = x (k) α (k) f(x (k) ), which we analysed for positive definite linear systems Ax = b. In fact, we saw that if we look at the equivalent quadratic minimization, then the matrix A has to be positive definite, which means that the quadratic function f has to be convex (this does not mean that a general f has to be convex for Steepest Descent to work). Just like in the numerical linear algebra case, α (k) can be chosen sufficiently small (whether a constant α (k) = α or not) so that the method converges to a local minimum. However, a more common way is to look at α (k) as a step size, and choose it in some wise way.

13 2. Iterative methods for unconstrained optimization

14 2. Iterative methods for unconstrained optimization Corollary 1. Any direction d = M f such that M 0 is a descent direction. We will look at more general methods next, but first observe that steepest descent method (with a pre-defined step size α) can be viewed as a step that minimizes the quadratic function that approximates f(x) around x (k) : { d (k) SD = arg min f(x) + f(x (k) ), d + 1 } d, d = α f(x (k) ). d 2α Then: x k+1 = x (k) + d (k) SD = x(k) α f(x (k) ). This derivation assumes that α is either constant or known a-priori. We will now see some other methods that follow the same pattern, and later see a method for determining the step length α through linesearch. 2.2 Newton, quasi-newton One of the most important methods outside of steepest descent is Newton s method, which is much more powerful than SD, but also may be more expensive. Recall again the Taylor expansion f(x (k) + ε) = f(x (k) ) + f(x (k) ), ε + 1 2 ε, 2 f(x (k) )ε + O( ε 3 ). (2) Now we choose the next step x (k+1) = x (k) + ε by a minimizing Eq. (2) without the O( ε 3 ) term with respect to ε. That is, the Newton direction d N is chosen by the following quadratic minimization: d (k) N = arg min d { f(x) + f(x (k) ), d + 1 } 2 d, 2 f(x (k) )d = ( f (x (k) ) 1 ) f(x (k) ). This minimization is similar to the one mentioned in SD, but now it has a 2 f(x (k) ) instead of a matrix 1 α I. This shows us that in SD we essentially approximate the real Hessian with a scaled identity matrix.

15 2. Iterative methods for unconstrained optimization The Newton procedure leads to a search direction d N = ( 2 f(x (k) )) 1 f(x (k) ), which is the Newton s direction. This is the best search direction that is practically used, and it is relatively hard to obtain: first, the function f should be twice differentiable. Second, one has to compute the Hessian matrix or at least know how to solve a non-trivial linear system involving the Hessian matrix. This has to be computed at each iteration, which may be costly. Solving the linear system can be done directly through an LU decomposition, or iteratively using an appropriate method. If we choose a constant steplength, then one can show that the best step-length for Newton s method is asymptotically 1.0. However, throughout the iterations it is better to perform a lineaserach over the direction d N to guarantee the decreasing values of f(x (k) ) and convergence. It is important to note that the Newton s method should be used with care. There is no guarantee that d is a descent direction, or that the matrix ( 2 f(x (k) )) is invertible. If the problem is convex and the Hessian is positive definite - we have nothing to worry about. Quasi Newton methods Applying the Newton s method is expensive. In Quasi-Newton we approximate the Hessian by some matrix M(x (k) ), and at each step minimize d (k) QN { = arg min f(x) + f(x (k) ), d + 1 } d 2 d, M(x(k) )d = (M(x (k) ) 1 ) f(x (k) ). (3) M can be chosen as diagonal, M = ci or M = diag( 2 f), which is easily invertible, or by other ways. We will not go into this further in this course, but the limited memory Broyden- Fletcher-Goldfarb-Shanno (LBFGS) method is one of the most popular quasi-newton methods in the literature. It iteratively builds a low rank approximation of the Hessian from previous search directions, which is somewhat similar to some Krylov methods that we ve seen earlier in this course (e.g. GMRES). All the methods above are one-point iterations, and are summarized in Algorithm 1.

16 2. Iterative methods for unconstrained optimization Algorithm: General one-point iterative method # Input: Objective: f(x) to minimize. for k = 1,..., maxiter do Compute the gradient: f(x k ). Define the search direction d (k), by essentially solving a quadratic minimization. Choose a step-length α (k) (possibly by linesearch) Apply a step: x (k+1) = x (k+1) + α (k) d (k). < ɛ or alternatively < ɛ. then f(x (1) ) x (k) Convergence is reached, stop the iterations. end end Return x (k+1) as the solution. Algorithm 1: General one-point iterative method for unconstrained optimization if f(x(k+1) ) x (k+1) x (k) 2.3 Line-search methods Line-search methods are among the most common methods in optimization. Assume we are at iteration k, at point x (k). Assume that we have a descent direction d (k) in which the function decreases. Line-search methods performs x (k+1) = x (k) + α (k) d (k) (4) and choose α (k) such that f(x (k) + αd (k) ) is exactly or approximately minimized for α. In the quadratic SD case, we had a closed term that minimizes this linesearch. When computing the step length α (k), we are essentially minimizing a one-dimensional function φ(α) = f(x (k) + αd (k) ), α (k) = arg min φ(α) α with respect to α. All 1-D minimization methods can be used here, but we face a tradeoff. We would like to choose α (k) to substantially reduce f, but at the same time we do not want to spend too much time making the choice. In general, it is too expensive to identify a minimizer, as it may require too many evaluations of the objective f. We assume here that the computation of f(x (k) + αd (k) ) is just as expensive as computing f(w) for some arbitrary vector w. If this is not the case, we may be able to allow ourselves to invest more

17 2. Iterative methods for unconstrained optimization iterations in minimizing φ(α). However, note that minimizing φ is only locally optimal (a greedy choice), and there is no guarantee that an exact linesearch minimization is indeed the right choice for the fastest convergence, not to mention the cost of doing that. More practical strategies perform an inexact line search to identify a step length that achieves adequate reductions in f at minimal cost. Typical line search algorithms try out a sequence of candidate values for α, stopping to accept one of these values when certain conditions are satisfied. Sophisticated line search algorithms can be quite complicated. We now show a practical termination condition for the line search algorithm. We will later show, using a simple quadratic example, that effective step lengths need not lie near the minimizers of φ(α). Backtracking line search using the Armijo condition We will now show one of the most common line search algorithms, called backtracking line search, also known as the Armijo rule. It involves starting with a relatively large estimate of the step size α, and iteratively shrinking the step size (i.e., backtracking ) until a sufficient decrease of the objective function is observed. The motivation is to choose α (j) to be as large as possible while having a sufficient decrease of the objective at the same time. We are not minimizing φ. This is done by choosing α 0 rather large, and choosing a decrease factor 0 < β < 1. We will examine the values of φ(α j ) for decreasing values α j = β j α 0 (using j = 0, 1, 2,...) until the following condition is satisfied: f(x (k) + α j d (k) ) f(x (k) ) + cα j f, d (k). (5) 0 < c < 1 is usually chosen small (about 10 4 ), and β is usually chosen to be about 0.5. We are guaranteed that such a point exists, because we assume that d (k) is a descent direction. That is, we know that f, d (k) < 0, and using the Taylor theorem we have f(x (k) + αd (k) ) = f(x (k) ) + α f, d (k) + O(α 2 d (k) 2 ). It is clear that for some α small enough we will have f(x (k) ) f(x (k) + αd (k) ) = α f, d (k) + O(α 2 d (k) 2 ) > 0.

18 2. Iterative methods for unconstrained optimization Figure 5: Armijo rule

19 2. Iterative methods for unconstrained optimization In our stopping rule we usually set c to be small, so we choose α to be relatively large. Algorithm 2 summarizes the backtracking linesearch procedure. Common upgrade for this procedure are Choose α 0 for iteration k as 1 β α(k 1) or 1 β 2 α (k 1). Instead of only reducing the α j s, also try to enlarge them by α j+1 = 1 β α j as long as the Armijo condition is satisfied. Algorithm: Armijo Linesearch # Input: Iterate x, objective f(x), gradient f(x), descent direction d. # Constants: α 0 > 0, 0 < β < 1, 0 < c < 1. for j = 0,..., maxiter do Compute the objective φ(α j ) = f(x + α j d). if f(x + α j d) f(x (k) ) + cα j f, d (k). then Return α = α j as the chosen step-length. else α j+1 = βα j end end # If maxiter iterations were reached, α is too small and we are probably stuck. # In this case terminate the minimization. Algorithm 2: Armijo backtracking linesearch procedure. Example 4 (Newton vs. Steepest descent). We will now consider the minimization of the Rosenbrock banana function f(x) = (a x 1 ) 2 + b(x 2 x 2 1) 2, where a, b are parameters that determine the difficulty of the problem. Here we choose b = 5, and a = 1. The minimum of this function is at [a, a 2 ], where f = 0 (in our case it s the point [1,1]). The gradient and Hessian of this function are [ ] [ ] 4bx 1 (x 2 x 2 f(x) = 1) 2(a x 1 ) 2 4b(x 2 x 2 f(x) = 1) + 8bx 2 1 + 2 4bx 1 2b(x 2 x 2 1) 4bx 2 2b In the code below we apply the SD and Newton s algorithms starting from the point [ 1.4, 2]. After 100 iterations, SD reached [0.957531, 0.915136] on his way to [1, 1]. The Newton method solved the problem in only 9 iterations.

20 2. Iterative methods for unconstrained optimization Figure 6: The convergence path of steepest descent and Newton s method for minimizing the banana function. using PyPlot; close("all"); xx = -1.5:0.01:1.6 yy= -0.5:0.01:2.5; X = repmat(xx,1,length(yy)) ; Y = repmat(yy,1,length(xx)); a = 1; b = 5; F = (a-x).^2 + b*(y-x.^2).^2 figure(); contour(x,y,f,500); #hold on; axis image; xlabel("x"); ylabel("y"); title("sd vs Newton for minimizing the Banana function.") f = (x)->((a-x[1]).^2 + b*(x[2]-x[1].^2).^2) g = (x)->[-4*b*x[1]*(x[2]-x[1]^2)-2*(a-x[1]); 2*b*(x[2]-x[1]^2)]; H = (x)->[-4b*(x[2]-x[1]^2)+8*b*x[1]^2+2-4*b*x[1] ; -4*b*x[1] 2*b];

21 2. Iterative methods for unconstrained optimization ## Armijo parameters: alpha0 = 1.0; beta = 0.5; c = 1e-4; function linesearch(f::function,x,d,gk,alpha0,beta,c) alphaj = alpha0; for jj = 1:10 x_temp = x + alphaj*d; if f(x_temp) <= f(x) + alphaj*c*dot(d,gk) break; else alphaj = alphaj*beta; end end return alphaj; end ## Starting point x_sd = [-1.4;2.0]; println("*********** SD *****************") ## SD Iterations for k=1:100 gk = g(x_sd); d_sd = -gk; x_prev = copy(x_sd); alpha_sd = linesearch(f,x_sd,d_sd,gk,0.25,beta,c); x_sd=x_sd+alpha_sd*d_sd; plot([x_sd[1];x_prev[1]],[x_sd[2];x_prev[2]],"g",linewidth=2.0); println(x_sd); end; ## Newton Iterations x_n = [-1.4;2.0]; println("*********** Newton *****************") for k=1:10 gk = g(x_n); Hk = H(x_N); d_n = -(Hk)\gk; x_prev = copy(x_n); alpha_n = linesearch(f,x_n,d_n,gk,alpha0,beta,c); x_n=x_n+alpha_n*d_n; plot([x_n[1];x_prev[1]],[x_n[2];x_prev[2]],"k",linewidth=2.0); println(x_n); end; legend(("sd","newton"))

22 2. Iterative methods for unconstrained optimization 2.4 Coordinate descent methods Coordinate descent methods are similar in principle to the Gauss-Seidel method for linear systems. When we minimize f(x), we iterate over all entries i of x, and change each scalar entry x i so that the i-th f(x) is minimized with respect to x i given that the rest of the variables are fixed. Algorithm: Coordinate Descent # Input: Objective: f(x) to minimize. for k = 1,..., maxiter do #sweep through all scalar variables x i and (approximately) minimize f(x), # according x i in turn, assuming the rest of the variables are fixed. for i = 1,..., n do x (k+1) i arg min xi f(x (k+1) 1, x (k+1) 2,..., x (k+1) i 1 end # Check if convergence is reached. end Return x (k+1) as the solution. Algorithm 3: Coordinate descent, x i, x (k) i+1,..., x(k) n ) By doing this, we essentially zero the i-th entry of the gradient, i.e., update x i such that f x i = 0. Similarly to the variational property of Gauss-Seidel, the value of f is monotonically nonincreasing with each update. If f is bounded from below, the series {f(x (k) )} converges. This does not guarantee convergence to a minimizer, but it is a good property to have. There are many rules in which order to choose the variables x i. For example, lexicographic or random orders are common. One of the requirements to guarantee convergence is that all the variables will be visited periodically. That is, we cannot guarantee convergence if a few of the variables are not visited. Coordinate descent methods are most common in non-smooth optimization, and especially effective when one-dimensional minimization is easy to apply. Example 5 (Coordinate descent for the banana function). Assume again that we are minimizing f(x) = (a x 1 ) 2 + b(x 2 x 2 1) 2, where a, b are parameters b = 10, and a = 1.

23 2. Iterative methods for unconstrained optimization We ve seen that f(x) = [ 4bx 1 (x 2 x 2 1) 2(a x 1 ) 2b(x 2 x 2 1) and in coordinate descent we re solving each of these equations in turn. We start with the second variable, because the updates regarding the second variable are easy to define (assuming b 0) For the first variable, we have f x 2 = 2b(x 2 x 2 1) = 0 x 2 = x 2 1. f x 1 = 4bx 1 (x 2 x 2 1) 2(a x 1 ) = 0 x 1 =?[No closed form solution] This time there is no closed form solution to the problem, and in fact there may be more than one minimum point here. A common option in such cases is to apply a one-dimensional Newton step for x 1 : x new 1 = x 1 α f x 1 = x f x 1 α 4bx 1(x 2 x 2 1) 2(a x 1 ) 1 4b(x 2 x 2 1) + 8bx 2 1 + 2 where α is obtained by linesearch with respect to minimizing f(x 1 + αd 1, x 2 ) and d 1 is the Newton search direction for x 1. The code for the CD updates appears next. It works in addition to the code in the previous examples. After 100 iterations, CD iterates reached [0.997156, 0.99432] on their way to [1,1], which is significantly better than SD but worse than Newton s method. ],

24 2. Iterative methods for unconstrained optimization Figure 7: The convergence path of the coordinate descent method for minimizing the banana function. ## CD Iterations (this code comes in addition to the first section of code in the previous example) x_cd = [-1.4;2.0]; println("*********** Newton *****************") for k=1:100 # first variable Newton update: gk = -4*b*x_CD[1]*(x_CD[2]-x_CD[1]^2)-2*(a-x_CD[1]); # This is g[1] Hk = -4*b*(x_CD[2]-x_CD[1]^2)+8*b*x_CD[1]^2+2; # This is H[1,1] d_1 = -gk/hk; ## Hk is a scalar f1 = (x)->((a-x).^2 + b*(x_cd[2]-x.^2).^2) # f1 accepts a scalar. x_prev = copy(x_cd); alpha_1 = linesearch(f1,x_cd[1],d_1,gk,alpha0,beta,c); x_cd[1] = x_cd[1]+alpha_1*d_1; # second variable update x_cd[2] = x_cd[1]^2; plot([x_cd[1];x_prev[1]],[x_cd[2];x_prev[2]],"k",linewidth=2.0); println(x_cd); end; 2.5 Non-linear Conjugate Gradient methods Earlier we ve seen the Conjugate Gradient methods for linear systems. This method has a lot of nice properties and advantages. It turns out that the linear CG method can be extended to non-linear unconstrained optimization. This extension has many variants, and all of them reduce to the linear CG method when the function is quadratic. We will not study these methods in this course, but they are quite efficient and worthy of consideration.

25 3. Constrained Optimization 3 Constrained Optimization In this section we will see the constrained optimization theory. The theory is a bit deep and we will see only a few of the main results in this course. A general formulation of constrained optimization problems has the form min f(x) x R n subject to { c eq j (x) = 0 j = 1,..., meq, (6) c ieq l (x) 0 l = 1,..., m ieq where f, c eq j, and cieq j are all smooth functions. f(x), as before, is the objective that we wish to minimize. c eq j (x) are equality constraints, and cieq j (x) are inequality constraints. We define a feasible set to be the set of all points that satisfy the constraints: Ω = { x c eq j (x) = 0, j = 1,..., meq ; c ieq l (x) 0, l = 1,..., m ieq }. We can write (6) compactly as min x Ω f(x). The focus in this section is to characterize the solutions of (6). Recall that for unconstrained minimization we had the following sufficient optimality condition: Any point x at which f(x ) = 0 and 2 f(x ) is positive definite is a strong local minimizer of f. Our goal now is to define similar optimality conditions to constrained optimization problems. We have seen already that global solutions are difficult to find even when there are no constraints. The situation may be improved when we add constraints, since the feasible set might exclude many of the local minima, and it may be easy to pick the global minimum from those that remain. However, constraints can also make things much more difficult. As an example, consider the problem min x x R n 2 s.t. x 2 2 1. Without the constraint, this is a convex quadratic problem with unique minimizer x = 0. When the constraint is added, any vector x with x 2 = 1 solves the problem. There are infinitely many such vectors (hence, infinitely many local minima) whenever n 2. Definitions of the different types of local solutions are simple extensions of the corresponding definitions for the unconstrained case, except that now we restrict consideration to the feasible points in the neighborhood of x. We have the following definition.

26 3. Constrained Optimization Figure 8: The single equality constraint example. Definition 3 (Local solution of the constrained problem). A vector x is a local solution of the problem (6) if x Ω (x satisfies the constraints) and there is a neighborhood N of x such that f(x) f(x ) for x N Ω. 3.1 Equality-constrained optimization Lagrange multipliers To introduce the basic principles behind the characterization of solutions of general constrained optimization problems, we will first consider only equality constraints. We start with a relatively simple case of constrained optimization with a single equality constraint. Our first example is a two-variable problem with a single equality constraint. Example 6 (A single equality constraint). Consider the following problem: min x x R n 1 + x 2 subject to x 2 1 + x 2 2 2 = 0, (7) In the language of Equation (6) we have m eq = 1 and m ieq = 0, with c eq 1 (x) = x 2 1 + x 2 2 2. We can see by inspecting Fig 8 that the feasible set for this problem is the circle of radius 2 centered at the origin just the boundary of this circle, not its interior. The solution x is obviously [ 1, 1]. From any other point on the circle, it is easy to find a way to move that stays feasible (that is, remains on the circle) while decreasing f.

27 3. Constrained Optimization We also see that at the optimal point x, the objective gradient f is parallel to the constraint gradient c eq 1 (x ). In other words, there exists a scalar λ 1 such that f + λ 1 c eq 1 (x ) = 0, and in this particular case where f = [1, 1], and c eq 1 = [2x 1, 2x 2 ], we have λ 1 = 1 2, for x = [ 1, 1]. We can reach the same conclusion using the Taylor expansion. Assume a feasible point x. To retain feasibility, any small movement in the direction d from x has to satisfy the constraint 0 = c eq 1 (x + d) c eq 1 (x) + c eq 1 (x), d = c eq 1 (x), d. So any movement at the direction d from the point x has to satisfy c eq 1 (x), d = 0. (8) On the other hand, we have seen earlier than any descent direction has to satisfy f(x), d < 0 (9) because f(x + d) f(x) f(x), d. If there exists a direction d, satisfying (8)-(9), then we can say that improvement in f is possible under the constraints. It follows that a necessary condition for optimality for our problem is that there is not a direction d satisfying both (8)-(9). The only way that such a direction cannot exist is if f(x) and c eq 1 (x) are parallel, otherwise d = ( I 1 c eq 1 (x) c eq 1 (x) ceq ) 1 (x) c eq 1 (x) f(x), is a descent direction, which satisfies the constraints at least at first order. It turns out that we can pack the objective and its equality constraints into one function called the Lagrangian function L(x, λ 1 ) = f(x) + λ 1 c eq 1 (x),

28 3. Constrained Optimization for which x L = x f(x) + λ 1 x c eq 1 (x). Hence, at the solution x there exists a scalar λ 1 such that x L(x, λ 1) = 0. Note that the condition λ1 L = 0 provides the constraint c eq 1 (x) = 0. This observation suggests that we can search for solutions of the equality-constrained problem by searching for stationary points of the Lagrangian function. The scalar quantity λ 1 is called a Lagrange multiplier for the constraint c eq 1 (x) = 0. The condition above appears to be necessary for an optimal solution of the problem, but it is clearly not sufficient. For example, the point [1, 1] also satisfies it in our example, but it is the maximum of the function under the constraint. Back to our general equality constrained problem, which we write as min f(x) subject to x R n ceq (x) = 0, (10) where c eq (x) : R n R meq is the constraints vector function, that is c eq 1 (x) c c eq (x) = eq 2 (x). = 0. c eq m eq(x) To obtain the optimality conditions we will first make an assumption on a point x regarding the constraints. Definition 4 (LICQ for equality constrained optimization). Given the point x, we say that the linear independence constraint qualification (LICQ) holds if the set of constraint gradients c eq i is linearly independent, or equivalently, if the Jacobian of the constraints vector J eq (x ) is a full-rank matrix. Note that if this condition holds, none of the constraint gradients can be zero. assumption comes to simplify our derivations and is not really necessary from a practical point of view. For example, it breaks if we write one of the constraints twice. This clearly should have any practical influence on our ability to solve problems. This A generalization of our conclusion from the previous example is as follows. In order for x to be a stationary point, there shouldn t be a direction d such that f(x), d < 0 and

29 3. Constrained Optimization c eq i (x), d = 0 for i = 1,..., meq. In other words, the gradient cannot have a component that is orthogonal to all the constraints gradients. This will happen only if f(x) is a linear combination of the constraints gradients m eq f(x) + λ i c eq i (x) = f(x) + (Jeq ) λ = 0, i=1 where J eq = c eq 1 (x) c eq 2 (x). c eq m eq(x). Consequently, given an equality constrained problem (10), we write its Lagrangian function as L(x, λ) = f(x) + λ c eq (x), where the vector λ R meq is the Lagrange multipliers vector. Theorem 2 (Lagrange multipliers for equality constrained minimization). Let x be a local solution of (10), in particular satisfying the equality constraints c eq (x ) = 0, and the LICQ condition. Then there exists a unique vector λ R meq which satisfies x L(x, λ ) = f(x ) + (J eq ) λ = 0. called the Lagrange multipliers vector We will not prove this theorem, but note that λ L(x, λ ) = 0 c eq (x ) = 0, which allows us to formulate a method for solving equality constrained optimization. Back to our example of a single equality constraint min x x R n 1 + x 2 subject to x 2 1 + x 2 2 2 = 0. (11) The Lagrangian of this method is L(x, λ 1 ) = x 1 + x 2 + λ 1 (x 2 1 + x 2 2 2) and the solution of

30 3. Constrained Optimization this problem is given by 1 + 2x 1 λ 1 = 0 L = 0 1 + 2x 2 λ 1 = 0 x 2 1 + x 2 2 2 = 0 x 1 = 1 2λ 1 x 2 = 1 2λ 1 1 + 1 4λ 2 1 4λ 2 1 = 2 λ 1 = ± 1 2 x 1 = 1 x 2 = 1, As discussed before, we got two stationary points for the Lagrangian. How do we decide which one of them is a local minimum? The following theorem says that the Hessian of L with respect to x has to be positive definite with respect to the directions that satisfy the constraints (or, the directions which are orthogonal to the constraints gradients.). Theorem 3 (2 nd order necessary conditions for equality constrained minimization). Let x be a minimum solution and let λ be a corresponding Lagrange multiplier vector satisfying L(x, λ ) = 0. Also assume that the LICQ condition is satisfied. Then the following must hold: y 2 xl(x, λ )y 0 y R n s.t. J eq y = 0. In our example, L(x, λ 1 ) = x 1 + x 2 + λ 1 (x 2 1 + x 2 2 2), and [ ] [ 2 2λ 1 0 xl(x, λ) = 2 xl(x, λ = ±1/2) = 0 2λ 1 ±1 0 0 ±1 ] It follows that the minimum is obtained at λ 1 = 1/2 where 2 xl(x, λ ) = I 0. This matrix is positive definite with respect to any vector, and in particular to those vectors that are orthogonal to the constraint gradient. In the other point (λ 1 = 1/2), we encounter the case of the negative definite Hessian, and hence we have a local maximum.

31 3. Constrained Optimization 3.2 The KKT optimality conditions for general constrained optimization Suppose now that we are considering the general case of constrained optimization, allowing inequality constraints as well. Recall the following problem definition: min f(x) x R n subject to { c eq j (x) = 0 j = 1,..., meq. (12) c ieq l (x) 0 l = 1,..., m ieq It turns out that for this problem we should make a distinction between active inequality constraints and inactive inequality constraints. Assume a local solution x. The active constraints are those constraints for which c ieq l (x ) = 0 and the inactive constraints are those constraints for which c ieq l (x ) < 0. Example 7. Consider the following problem which is similar to the previous one, only now we have an inequality constraint: min x x R n 1 + x 2 subject to x 2 1 + x 2 2 2 0. (13) Similarly to the previous problem, the solution of this problem lies at the point [ 1, 1], where the constraint is active. The Lagrangian is again L(x, λ 1 ) = x 1 + x 2 + λ 1 (x 2 1 + x 2 2 2), and the minimum is obtained at the point satisfying x L(x, λ 1) = 0. But now, it turns out that the sign of the Lagrange multiplier λ 1 = 1 has a significant role. 2 Similarly to before, we need that any search direction d will satisfy c ieq 1 (x + d) c ieq 1 (x) + c ieq 1 (x), d 0. If the constraint c ieq 1 is inactive at the point x we can move at any direction, and in particular we can move at the descent direction f(x) and decrease the objective. However, if the constraint is active, then c ieq 1 (x) = 0, and to satisfy the constraint we have the condition c ieq 1 (x), d 0. (14) In order for d to be a legal descent direction it needs to satisfy both f(x), d < 0 and (14). If we wish that d is not a descent direction, we need the inner products not to be both

32 3. Constrained Optimization negative. For this we must require with λ 1 0. f + λ 1 c ieq 1 (x ) = 0, From the example above we learn that active inequality constraints act as equality constraints but require the lagrange multiplier to be positive at stationary points. If we have multiple inequality constraints, {c ieq l direction d, each of them which is active will satisfy c ieq l with λ l 0, then f + f(x), d + l active (x) 0} meq l=1, then we require that for a descent (x), d 0, but if l active λ l c ieq l (x ) = 0, λ l c ieq l (x ), d = 0, and f(x), d 0 must hold. This means that d cannot be a descent direction under these conditions. We will now state the necessary conditions for a solution of a general constrained minimization problem. Definition 5 (Active set). The active set A(x) at any feasible x is the union of inequality constrains which are satisfied with exact equality {l : c ieq l (x) = 0}. We define the Lagrangian of the problem (12) by L(x, λ eq, λ ieq ) = f(x) + (λ eq ) c eq (x) + (λ ieq ) c ieq (x), where λ eq R meq and λ ieq R mieq are the Lagrange multiplier vectors for the equality and inequality constraints respectively. The next theorem states the first order necessary conditions, known as the Karush-Kuhn-Tucker (KKT) conditions. Theorem 4 (First-Order Necessary Conditions (KKT)). Suppose that x is a local solution of (12), and that the LICQ holds at x. Then there are Lagrange multiplier vectors λ eq R meq

33 3. Constrained Optimization and λ ieq R mieq, such that the following hold: x L(x, λ eq, λ ieq ) = f(x ) + (J eq ) λ eq + (J ieq ) λ ieq = 0 (15) c eq (x ) = 0 (16) c ieq (x ) 0 (17) λ ieq 0 (18) (Complementary slackness) for l = 1,..., m ieq λ ieq l c ieq l (x) = 0 (19) The conditions (17)-(19) may be replaced with λ ieq l > 0 for the active constraints l A(x ) and λ ieq l = 0 for the inactive ones. To know whether a stationary point is a minimum or a maximum we have the Theorem 5 (2 nd order necessary conditions for general constrained minimization). Let x be a minimum solution satisfying the KKT conditions and let λ eq, λ ieq be the corresponding Lagrange multiplier vectors. Also assume that the LICQ condition is satisfied. Then the following must hold: y 2 xl(x, λ eq, λ ieq )y 0 y R n s.t. { J eq y = 0 c ieq l (x )) y = 0 l A(x) Example 8. Consider the following problem ( min x 1 3 2 ( + x 2 x R 2) 1 4 s.t. 8) n x 1 + x 2 1 0 x 1 x 2 1 0 x 1 + x 2 1 0 x 1 x 2 1 0. (20) This time, we have four inequality constraints. In principle, we should try every combination of active and inactive constraints and see if the resulting x and λ that are achieved by x L = 0 satisfy the KKT conditions. It turns out that the solution here is x = [1, 0], and that the first and second constraints in are active at this point. Denoting them by c 1 and

34 3. Constrained Optimization c 2 we have f(x ) = [ 1 1 128 ] c 1 (x ) = [ 1 1 ] c 2 (x ) = [ 1 1 ], and the vector λ = [ 129 256, 127 256, 0, 0]. 3.3 Penalty and Barrier methods In the previous section we saw that in order to solve a constrained optimization using the KKT conditions, we may need to solve several large linear systems, each time checking a different combination of active and inactive inequality constraints. This is a option for solving the problem but it does not align with the optimization methods we ve seen so far. In this section, we will see two closely related approaches for solving constrained optimization the penalty and barrier approaches. Both of these approach convert the constrained problem into a somewhat equivalent unconstrained problem which can be solved by SD, Newton, or any other method for unconstrained optimization. Penalty methods. problem We again assume that we have the general constrained optimization min f(x) x R n subject to { c eq j (x) = 0 j = 1,..., meq, (21) c ieq l (x) 0 l = 1,..., m ieq and now we wish to transform it to an equivalent unconstrained problem. In the penalty approach, we rewrite the problem as min f(x) + µ x R n ( m eq j=1 ρ j (c eq j mieq (x)) + l=1 ρ l (max{0, c ieq l (x)}) where µ > 0 is a balancing penalty parameter, and {ρ j (x)} and {ρ l (x)} are scalar penalty functions which are bounded from below and get their minimum at 0 (it is preferable that these are non-negative but this is not mandatory). ) (22) In addition, these functions should monotonically increase as we move away from zero. The common choice for a penalty is the quadratic function ρ(x) = x 2, which has a minimum at 0. Using this choice, if we set µ then the solution of (22) will be equal to that of (21). We will focus on this choice in this

35 3. Constrained Optimization course. The problem (22) is an unconstrained optimization problem which can be solved using the methods we ve learned to far. For ρ(x) = x 2, however, it turns out that the problem becomes more and more ill-conditioned as µ, which makes standard first-order approaches like SD and CD slow. Therefore, in practice we will have a continuation strategy where we solve the problem for iteratively increasing values of the penalty µ. That is we will define µ 0 < µ 1 <... < and iteratively solve the problem for each of those µ s, each time using the previous solution as an initial guess for the next problem. We will stop when µ is large enough such that the constraints are reasonably fulfilled. That is, for our largest value of µ, it may be that for example one of the equality constraints satisfies c(x ) = ε. Whether ε is small enough or not will depend on the application that yielded the optimization problem. Remark 1. Another popular choice for a penalty is the exact penalty function ρ(x) = x (absolute value). It is popular because unlike the quadratic function, the constraints will be exactly fulfilled for some moderate µ and not only for µ. The downside here is that this function is non-smooth and significantly complicates the optimization process. We will not consider this choice further in this chapter. Example 9. Consider again the following problem ( min x 1 3 2 ( + x 2 x R 2) 1 4 s.t. 8) n x 1 + x 2 1 0 x 1 x 2 1 0 x 1 + x 2 1 0 x 1 x 2 1 0. (23) Previously, we had four inequality constraints, and only the first two where active at the solution. Using the Lagrange multipliers approach we had to check several combinations of active/inactive constraints. We will now use the penalty version of this approach using ρ = x 2, which is vector form becomes ( min f µ(x) = x 1 3 2 ( + x 2 x R 2) 1 ) 4 + µ max{ax 1, 0} 2 n 2 8, (24)

36 3. Constrained Optimization where the matrix A is A = 1 1 1 1 1 1 1 1. We will solve this problem using Steepest Descent (SD), and the gradient for SD is given by [ f µ = ] 2(x 1 3) 2 + µa (max{ax 1, 0} (Ax 1)), 4(x 2 1 8 )3 where is the Hadamard point-wise vector product (same as the operator.* is Julia/Matlab).

37 3. Constrained Optimization using PyPlot; close("all"); xx = -0.5:0.01:1.5 yy= -1.0:0.01:1.0; m = length(xx); X = repmat(xx,1,length(yy)) ; Y = repmat(yy,1,length(xx)); ovec = ones(4); A = [1.0 1 ; 1-1 ; -1 1 ; -1-1]; f = (x,mu)->(x[1]-3/2)^2 + (x[2]-1/8)^4 + mu*norm(max(a*x - ovec,0.0)).^2; g = (x,mu)->[2*(x[1]-3/2) ; 4*(x[2]-1/8).^3] + mu*2*a *(((A*x- ovec).>0.0).*(a*x - ovec)); F = (X-3/2).^2 + (Y-1/8).^4 C = (max(x + Y - 1,0).^2 + max(-x + Y - 1,0).^2 + max(-x - Y - 1,0).^2 + max(x - Y - 1,0).^2); figure(); mu = 0.0; Fc = F + mu*c; subplot(3,2,1); contour(x,y,fc,200); #hold on; axis image; xlabel("x"); ylabel("y"); title(string("penalty for mu = ",mu)) mu = 0.01; Fc = F + mu*c; subplot(3,2,2); contour(x,y,fc,200); #hold on; axis image; xlabel("x"); ylabel("y"); title(string("penalty for mu = ",mu)) mu = 0.1; Fc = F + mu*c; subplot(3,2,3); contour(x,y,fc,200); #hold on; axis image; xlabel("x"); ylabel("y"); title(string("penalty for mu = ",mu)) mu = 1; Fc = F + mu*c; subplot(3,2,4); contour(x,y,fc,200); #hold on; axis image; xlabel("x"); ylabel("y"); title(string("penalty for mu = ",mu)) mu = 10; Fc = F + mu*c; subplot(3,2,5); contour(x,y,fc,500); #hold on; axis image; xlabel("x"); ylabel("y"); title(string("penalty for mu = ",mu)) mu = 100; Fc = F + mu*c; subplot(3,2,6); contour(x,y,fc,1000); #hold on; axis image; xlabel("x"); ylabel("y"); title(string("penalty for mu = ",mu))

38 3. Constrained Optimization Figure 9: The influence of the penalty parameter on the objective. As µ grows, the feasible set is more and more obvious in the function values.

39 3. Constrained Optimization ## Armijo parameters: alpha0 = 1.0; beta = 0.5; c = 1e-1; ## See the Armijo linesearch function in previous examples. ## Starting point x_sd = [-0.5;0.8]; muvec = [0.1; 1.0 ; 10.0 ; 100.0] fvals = []; cvals = [];figure(); ## SD Iterations for jmu = 1:length(muVec) mu = muvec[jmu] fmu = (x)->f(x,mu); println("mu = ",mu) Fc = F + mu*c; subplot(2,2,jmu); contour(x,y,fc,1000); #hold on; axis image; xlabel("x"); ylabel("y"); title(string("sd for mu = ",mu)) for k=1:50 gk = g(x_sd,mu); if norm(gk) < 1e-3 break; end d_sd = -gk; x_prev = copy(x_sd); alpha_sd = linesearch(fmu,x_sd,d_sd,gk,0.5,beta,c); x_sd=x_sd+alpha_sd*d_sd; plot([x_sd[1];x_prev[1]],[x_sd[2];x_prev[2]],"g",linewidth=2.0); println(x_sd,",",norm(gk),",",alpha_sd,",",f(x_sd,mu)); fvals = [fvals;f(x_sd,mu)]; cvals = [cvals;norm(max(a*x_sd - ovec,0.0))] end; end figure();subplot(1,2,1); plot(fvals);title("function values") subplot(1,2,2); semilogy(cvals,"*");title("constraints violation values max(ax-1,0) ") Barrier methods We will now see the sister of the penalty method which offers a different penalty approach. This approach is relevant only for inequality constraints, and is used when we wish to require that the iterates will absolutely be inside the feasible domain. The idea is to choose a barrier function that goes to as we move towards the boundaries of the feasible domain. Unlike the penalty approach we do not even define the objective outside the feasible domain, and do not allow our iterates to go there. There are two main barrier functions : the log-barrier function and the inverse-barrier

40 3. Constrained Optimization Figure 10: The convergence paths of the SD method for minimizing the objective for various penalty parameters. Each solution is initialized with the solution for the previous larger penalty parameter µ.

41 3. Constrained Optimization Figure 11: The function value and constraints violation for the SD iterations. At µ = 100 we get about three digits of accuracy for the constraints. function. Suppose we wish to have a constraint φ(x) 0, then the two following functions can be used: b log (x) = log(φ(x)), b inv (x) = 1 φ(x) Figure 12 shows an example of the two penalty functions. Using these functions, the minimizer will only be inside the domain and as we apply the iterates, we have to make sure that our next step is always inside the domain. If not, we reduce the steplength parameter until the next iterate is inside the domain (say, in an iterative matter similar to the Armijo process). Since these functions go to, the barrier method is not too sensitive to the choice of penalty parameter, but note that 1) these barrier functions are not 0 inside the domain, and hence influence the solution even if the constraint is not active. 2) These are highly non-quadratic functions with quite large high order derivatives (third derivative for example). The optimization process which works best on quadratic penalties usually struggles with these functions. 3.4 The projected Steepest Descent method The projected Steepest Descent method is an attempt to avoid all the difficulties involved with the penalty and barrier methods. In this method we assume that the iterates x (k) are inside or on the boundaries of the feasible domain. In this method we find a Steepest Descent

42 3. Constrained Optimization Figure 12: Two barrier functions for the constraint x 2 0. Both of the functions go to as x 0. The log function is bounded from below only in a closed region, so it is not suitable for cases where only one sided barrier is used. step y, followed by a projection of y to be inside the feasible domain. That is, we replace the y with the closest point that fulfils the constraints. This is not always an easy task, therefore this method is most useful when the constraints are simple (linear, for example). More explicitly, assume the problem min f(x) x Ω where Ω is a sub-set of R n. Given a point y, we will define the projection operator Π Ω to be the solution of the following problem: Π Ω (y) = arg min x y, x Ω for some vector norm (in most cases the squared l 2 norm is chosen). The projected SD method is defined by x (k+1) = Π Ω ( x (k) α f(x (k) ) ), where α is obtained by linesearch.