AM 205: lecture 19. Last time: Conditions for optimality, Newton s method for optimization Today: survey of optimization methods

Similar documents
AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

5 Handling Constraints

Constrained Optimization

Nonlinear Optimization: What s important?

Scientific Computing: An Introductory Survey

Scientific Computing: An Introductory Survey

Algorithms for Constrained Optimization

Optimization Problems with Constraints - introduction to theory, numerical Methods and applications

Outline. Scientific Computing: An Introductory Survey. Optimization. Optimization Problems. Examples: Optimization Problems

Quasi-Newton Methods

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

Optimization Methods

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

5 Quasi-Newton Methods

Scientific Computing: Optimization

Optimization and Root Finding. Kurt Hornik

Algorithms for constrained local optimization

Lecture V. Numerical Optimization

Constrained Optimization and Lagrangian Duality

Lecture 18: Optimization Programming

Scientific Computing: An Introductory Survey

1 Computing with constraints

Nonlinear Programming

Multidisciplinary System Design Optimization (MSDO)

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

Chapter 3 Numerical Methods

Lecture 3. Optimization Problems and Iterative Algorithms

ISM206 Lecture Optimization of Nonlinear Objective with Linear Constraints

CONSTRAINED NONLINEAR PROGRAMMING

Newton s Method. Javier Peña Convex Optimization /36-725

2.098/6.255/ Optimization Methods Practice True/False Questions

MS&E 318 (CME 338) Large-Scale Numerical Optimization

Lecture 14: October 17

What s New in Active-Set Methods for Nonlinear Optimization?

Optimisation in Higher Dimensions

Linear Programming: Simplex

4TE3/6TE3. Algorithms for. Continuous Optimization

Unconstrained optimization

Review for Exam 2 Ben Wang and Mark Styczynski

Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS)

Primal-Dual Interior-Point Methods for Linear Programming based on Newton s Method

2.3 Linear Programming

Programming, numerics and optimization

Determination of Feasible Directions by Successive Quadratic Programming and Zoutendijk Algorithms: A Comparative Study

Interior-Point Methods for Linear Optimization

Lecture 15: SQP methods for equality constrained optimization

ICS-E4030 Kernel Methods in Machine Learning

Convex Optimization. Problem set 2. Due Monday April 26th

Chapter 4. Unconstrained optimization

Lecture 13: Constrained optimization

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

Lecture 11 and 12: Penalty methods and augmented Lagrangian methods for nonlinear programming

Generalization to inequality constrained problem. Maximize

Computational Finance

1.2 Derivation. d p f = d p f(x(p)) = x fd p x (= f x x p ). (1) Second, g x x p + g p = 0. d p f = f x g 1. The expression f x gx

Chapter 2. Optimization. Gradients, convexity, and ALS

Lecture 7 Unconstrained nonlinear programming

Algorithms for nonlinear programming problems II

Constrained Nonlinear Optimization Algorithms

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

Numerical Optimization

Constrained optimization: direct methods (cont.)

Support Vector Machines: Maximum Margin Classifiers


Lectures 9 and 10: Constrained optimization problems and their optimality conditions

SF2822 Applied Nonlinear Optimization. Preparatory question. Lecture 9: Sequential quadratic programming. Anders Forsgren

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

Nonlinear optimization

Numerical optimization

10 Numerical methods for constrained problems

CS-E4830 Kernel Methods in Machine Learning

5. Duality. Lagrangian

AM 205: lecture 18. Last time: optimization methods Today: conditions for optimality

Improving the Convergence of Back-Propogation Learning with Second Order Methods

LINEAR AND NONLINEAR PROGRAMMING

Part 4: Active-set methods for linearly constrained optimization. Nick Gould (RAL)

A Trust-region-based Sequential Quadratic Programming Algorithm

arxiv: v1 [math.oc] 10 Apr 2017

Solution Methods. Richard Lusby. Department of Management Engineering Technical University of Denmark

Numerical Optimization

Infeasibility Detection and an Inexact Active-Set Method for Large-Scale Nonlinear Optimization

On nonlinear optimization since M.J.D. Powell

Interior Point Methods. We ll discuss linear programming first, followed by three nonlinear problems. Algorithms for Linear Programming Problems

Constrained optimization. Unconstrained optimization. One-dimensional. Multi-dimensional. Newton with equality constraints. Active-set method.

Constrained optimization

6.252 NONLINEAR PROGRAMMING LECTURE 10 ALTERNATIVES TO GRADIENT PROJECTION LECTURE OUTLINE. Three Alternatives/Remedies for Gradient Projection

Applications of Linear Programming

Linear programming II

Numerical Optimization of Partial Differential Equations

2. Quasi-Newton methods

MATH 3795 Lecture 13. Numerical Solution of Nonlinear Equations in R N.

CS711008Z Algorithm Design and Analysis

An Introduction to Algebraic Multigrid (AMG) Algorithms Derrick Cerwinsky and Craig C. Douglas 1/84

OPER 627: Nonlinear Optimization Lecture 14: Mid-term Review

Some new facts about sequential quadratic programming methods employing second derivatives

Numerical optimization. Numerical optimization. Longest Shortest where Maximal Minimal. Fastest. Largest. Optimization problems

Algorithms for nonlinear programming problems II

Computational Optimization. Augmented Lagrangian NW 17.3

Convex Optimization CMU-10725

Lecture 6: Conic Optimization September 8

Transcription:

AM 205: lecture 19 Last time: Conditions for optimality, Newton s method for optimization Today: survey of optimization methods

Quasi-Newton Methods General form of quasi-newton methods: x k+1 = x k α k B 1 k f (x k) where α k is a line search parameter and B k is some approximation to the Hessian Quasi-Newton methods generally lose quadratic convergence of Newton s method, but often superlinear convergence is achieved We now consider some specific quasi-newton methods

BFGS The Broyden Fletcher Goldfarb Shanno (BFGS) method is one of the most popular quasi-newton methods: 1: choose initial guess x 0 2: choose B 0, initial Hessian guess, e.g. B 0 = I 3: for k = 0, 1, 2,... do 4: solve B k s k = f (x k ) 5: x k+1 = x k + s k 6: y k = f (x k+1 ) f (x k ) 7: B k+1 = B k + B k 8: end for where B k y ky T k y T k s k B ks k s T k B k s T k B ks k

BFGS See lecture: derivation of the Broyden root-finding algorithm See lecture: derivation of the BFGS algorithm Basic idea is that B k accumulates second derivative information on successive iterations, eventually approximates H f well

BFGS Actual implementation of BFGS: store and update inverse Hessian to avoid solving linear system: 1: choose initial guess x 0 2: choose H 0, initial inverse Hessian guess, e.g. H 0 = I 3: for k = 0, 1, 2,... do 4: calculate s k = H k f (x k ) 5: x k+1 = x k + s k 6: y k = f (x k+1 ) f (x k ) 7: H k+1 = H k 8: end for where H k (I s k ρ k y t k )H k(i ρ k y k s T k ) + ρ ks k s T k, ρ k = 1 y t k s k

BFGS BFGS is implemented as the fmin bfgs function in scipy.optimize Also, BFGS (+ trust region) is implemented in Matlab s fminunc function, e.g. x0 = [5;5]; options = optimset( GradObj, on ); [x,fval,exitflag,output] =... fminunc(@himmelblau_function,x0,options);

Conjugate Gradient Method The conjugate gradient (CG) method is another alternative to Newton s method that does not require the Hessian: 1: choose initial guess x 0 2: g 0 = f (x 0 ) 3: x 0 = g 0 4: for k = 0, 1, 2,... do 5: choose η k to minimize f (x k + η k s k ) 6: x k+1 = x k + η k s k 7: g k+1 = f (x k+1 ) 8: β k+1 = (gk+1 T g k+1)/(gk T g k) 9: s k+1 = g k+1 + β k+1 s k 10: end for

Constrained Optimization

Equality Constrained Optimization We now consider equality constrained minimization: min f (x) subject to g(x) = 0, x Rn where f : R n R and g : R n R m With the Lagrangian L(x, λ) = f (x) + λ T g(x), we recall from that necessary condition for optimality is [ ] f (x) + J T L(x, λ) = g (x)λ = 0 g(x) Once again, this is a nonlinear system of equations that can be solved via Newton s method

Sequential Quadratic Programming To derive the Jacobian of this system, we write [ f (x) + m L(x, λ) = k=1 λ k g k (x) g(x) ] R n+m Then we need to differentiate wrt to x R n and λ R m For i = 1,..., n, we have ( L(x, λ)) i = f (x) x i + m k=1 λ k g k (x) x i Differentiating wrt x j, for i, j = 1,..., n, gives x j ( L(x, λ)) i = 2 f (x) x i x j + m k=1 λ k 2 g k (x) x i x j

Sequential Quadratic Programming Hence the top-left n n block of the Jacobian of L(x, λ) is B(x, λ) H f (x) + m λ k H gk (x) R n n k=1 Differentiating ( L(x, λ)) i wrt λ j, for i = 1,..., n, j = 1,..., m, gives ( L(x, λ)) i = g j(x) λ j x i Hence the top-right n m block of the Jacobian of L(x, λ) is J g (x) T R n m

Sequential Quadratic Programming For i = n + 1,..., n + m, we have ( L(x, λ)) i = g i (x) Differentiating ( L(x, λ)) i wrt x j, for i = n + 1,..., n + m, j = 1,..., n, gives x j ( L(x, λ)) i = g i(x) x j Hence the bottom-left m n block of the Jacobian of L(x, λ) is J g (x) R m n... and the final m m bottom right block is just zero (differentiation of g i (x) w.r.t. λ j )

Sequential Quadratic Programming Hence, we have derived the following Jacobian matrix for L(x, λ): [ ] B(x, λ) J T g (x) R (m+n) (m+n) J g (x) 0 Note the 2 2 block structure of this matrix (matrices with this structure are often called KKT matrices 1 ) 1 Karush, Kuhn, Tucker: did seminal work on nonlinear optimization

Sequential Quadratic Programming Therefore, Newton s method for L(x, λ) = 0 is: [ B(xk, λ k ) Jg T ] [ ] [ (x k ) sk f (xk ) + Jg = T (x k )λ k J g (x k ) 0 g(x k ) δ k ] for k = 0, 1, 2,... Here (s k, δ k ) R n+m is the k th Newton step

Sequential Quadratic Programming Now, consider the constrained minimization problem, where (x k, λ k ) is our Newton iterate at step k: { } 1 min s 2 st B(x k, λ k )s + s T ( f (x k ) + Jg T (x k )λ k ) subject to J g (x k )s + g(x k ) = 0 The objective function is quadratic in s (here x k, λ k are constants) This minimization problem has Lagrangian L k (s, δ) 1 2 st B(x k, λ k )s + s T ( f (x k ) + J T g (x k )λ k ) + δ T (J g (x k )s + g(x k ))

Sequential Quadratic Programming Then solving L k (s, δ) = 0 (i.e. first-order necessary conditions) gives a linear system, which is the same as the kth Newton step Hence at each step of Newton s method, we exactly solve a minimization problem (quadratic objective fn., linear constraints) An optimization problem of this type is called a quadratic program This motivates the name for applying Newton s method to L(x, λ) = 0: Sequential Quadratic Programming (SQP)

Sequential Quadratic Programming SQP is an important method, and there are many issues to be considered to obtain an efficient and reliable implementation: Efficient solution of the linear systems at each Newton iteration matrix block structure can be exploited Quasi-Newton approximations to the Hessian (as in the unconstrained case) Trust region, line search etc to improve robustness Treatment of constraints (equality and inequality) during the iterative process Selection of good starting guess for λ

Penalty Methods Another computational strategy for constrained optimization is to employ penalty methods This converts a constrained problem into an unconstrained problem Key idea: Introduce a new objective function which is a weighted sum of objective function and constraint

Penalty Methods Given the minimization problem min x f (x) subject to g(x) = 0 we can consider the related unconstrained problem min x φ ρ (x) = f (x) + 1 2 ρg(x)t g(x) ( ) Let x and x ρ denote the solution of ( ) and ( ), respectively Under appropriate conditions, it can be shown that lim x ρ ρ = x

Penalty Methods In practice, we can solve the unconstrained problem for a large value of ρ to get a good approximation of x Another strategy is to solve for a sequence of penalty parameters, ρ k, where x ρ k serves as a starting guess for x ρ k+1 Note that the major drawback of penalty methods is that a large factor ρ will increase the condition number of the Hessian H φρ On the other hand, penalty methods can be convenient, primarily due to their simplicity

Linear Programming

Linear Programming As we mentioned earlier, the optimization problem min f (x) subject to g(x) = 0 and h(x) 0, ( ) x Rn with f, g, h affine, is called a linear program The feasible region is a convex polyhedron 2 Since the objective function maps out a hyperplane, its global minimum must occur at a vertex of the feasible region 2 Polyhedron: a solid with flat sides, straight edges

Linear Programming This can be seen most easily with a picture (in R 2 )

Linear Programming The standard approach for solving linear programs is conceptually simple: examine a sequence of the vertices to find the minimum This is called the simplex method Despite its conceptual simplicity, it is non-trivial to develop an efficient implementation of this algorithm We will not discuss the implementation details of the simplex method...

Linear Programming In the worst case, the computational work required for the simplex method grows exponentially with the size of the problem But this worst-case behavior is extremely rare; in practice simplex is very efficient (computational work typically grows linearly) Newer methods, called interior point methods, have been developed that are polynomial in the worst case Nevertheless, simplex is still the standard approach since it is more efficient than interior point for most problems

Linear Programming Python example: Using cvxopt, solve the linear program subject to and 0 x 1, 0 x 2, 0 x 3 min x f (x) = 5x 1 4x 2 6x 3 x 1 x 2 + x 3 20 3x 1 + 2x 2 + 4x 3 42 3x 1 + 2x 2 30 (LP solvers are efficient, main challenge is to formulate an optimization problem as a linear program in the first place!)

PDE Constrained Optimization

PDE Constrained Optimization We will now consider optimization based on a function that depends on the solution of a PDE Let us denote a parameter dependent PDE as PDE(u(p); p) = 0 p R n is a parameter vector; could encode, for example, the flow speed and direction in a convection diffusion problem u(p) is the PDE solution for a given p

PDE Constrained Optimization We then consider an output functional g, 3 which maps an arbitrary function v to R And we introduce a parameter dependent output, G(p) R, where G(p) g(u(p)) R, which we seek to minimize At the end of the day, this gives a standard optimization problem: min G(p) p R n 3 A functional is just a map from a vector space to R

PDE Constrained Optimization One could equivalently write this PDE-based optimization problem as min g(u) subject to PDE(u; p) = 0 p,u For this reason, this type of optimization problem is typically referred to as PDE constrained optimization objective function g depends on u u and p are related by the PDE constraint Based on this formulation, we could introduce Lagrange multipliers and proceed in the usual way for constrained optimization...

PDE Constrained Optimization Here we will focus on the form we introduced first: min G(p) p R n Optimization methods usually need some derivative information, such as using finite differences to approximate G(p)

PDE Constrained Optimization But using finite differences can be expensive, especially if we have many parameters: G(p) p i G(p + he i) G(p), h hence we need n + 1 evaluations of G to approximate G(p)! We saw from the Himmelblau example that supplying the gradient G(p) cuts down on the number of function evaluations required The extra function calls due to F.D. isn t a big deal for Himmelblau s function, each evaluation is very cheap But in PDE constrained optimization, each p G(p) requires a full PDE solve!

PDE Constrained Optimization Hence for PDE constrained optimization with many parameters, it is important to be able to compute the gradient more efficiently There are two main approaches: the direct method the adjoint method The direct method is simpler, but the adjoint method is much more efficient if we have many parameters

PDE Output Derivatives Consider the ODE BVP u (x; p) + r(p)u(x; p) = f (x), u(a) = u(b) = 0 which we will refer to as the primal equation Here p R n is the parameter vector, and r : R n R We define an output functional based on an integral g(v) b a σ(x)u(x)dx, for some function σ; then G(p) g(u(p)) R

The Direct Method We observe that G(p) b = σ(x) u dx p i a p i hence if we can compute u p i, i = 1, 2,..., n, then we can obtain the gradient Assuming sufficient smoothness, we can differentiate the ODE BVP wrt p i to obtain, for i = 1, 2,..., n u (x; p) + r(p) u (x; p) = r u(x; p) p i p i p i

The Direct Method Once we compute each u p i we can then evaluate G(p) by evaluating a sequence of n integrals However, this is not much better than using finite differences: We still need to solve n separate ODE BVPs (Though only the right-hand side changes, so could LU factorize the system matrix once and back/forward sub. for each i)

Adjoint-Based Method However, a more efficient approach when n is large is the adjoint method We introduce the adjoint equation: z (x; p) + r(p)z(x; p) = σ(x), z(a) = z(b) = 0

Adjoint-Based Method Now, G(p) p i = = = b a b a b a σ(x) u dx p i [ z (x; p) + r(p)z(x; p) ] u dx p i [ z(x; p) u (x; p) + r(p) u ] (x; p) dx, p i p i where the last line follows by integrating by parts twice (boundary terms vanish because u p i and z are zero at a and b) (The adjoint equation is defined based on this integration by parts relationship to the primal equation)

Adjoint-Based Method Also, recalling the derivative of the primal problem with respect to p i : u (x; p) + r(p) u (x; p) = r u(x; p), p i p i p i we get G(p) = r b z(x; p)u(x; p)dx p i p i a Therefore, we only need to solve two differential equations (primal and adjoint) to obtain G(p)! For more complicated PDEs the adjoint formulation is more complicated but the basic ideas stay the same