Chapter 4. Unconstrained optimization

Similar documents
Nonlinear Programming

Programming, numerics and optimization

Optimization: Nonlinear Optimization without Constraints. Nonlinear Optimization without Constraints 1 / 23

5 Quasi-Newton Methods

Optimization II: Unconstrained Multivariable

Quasi-Newton methods: Symmetric rank 1 (SR1) Broyden Fletcher Goldfarb Shanno February 6, / 25 (BFG. Limited memory BFGS (L-BFGS)

FALL 2018 MATH 4211/6211 Optimization Homework 4

Unconstrained optimization

Lecture 7 Unconstrained nonlinear programming

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

Convex Optimization CMU-10725

Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent

January 29, Non-linear conjugate gradient method(s): Fletcher Reeves Polak Ribière January 29, 2014 Hestenes Stiefel 1 / 13

Shiqian Ma, MAT-258A: Numerical Optimization 1. Chapter 3. Gradient Method

Convex Optimization. Problem set 2. Due Monday April 26th

1 Numerical optimization

Chapter 10 Conjugate Direction Methods

Quasi-Newton methods for minimization

Statistics 580 Optimization Methods

Optimization II: Unconstrained Multivariable

2. Quasi-Newton methods

1 Numerical optimization

Numerical solutions of nonlinear systems of equations

8 Numerical methods for unconstrained problems

MATH 4211/6211 Optimization Quasi-Newton Method

Quasi-Newton Methods

Nonlinear Optimization: What s important?

Newton s Method. Ryan Tibshirani Convex Optimization /36-725

The Conjugate Gradient Algorithm

Gradient-Based Optimization

EECS260 Optimization Lecture notes

Optimization and Root Finding. Kurt Hornik

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

Lecture 14: October 17

ORIE 6326: Convex Optimization. Quasi-Newton Methods

Homework 3 Conjugate Gradient Descent, Accelerated Gradient Descent Newton, Quasi Newton and Projected Gradient Descent

Global Convergence of Perry-Shanno Memoryless Quasi-Newton-type Method. 1 Introduction

Static unconstrained optimization

Comparative study of Optimization methods for Unconstrained Multivariable Nonlinear Programming Problems

(One Dimension) Problem: for a function f(x), find x 0 such that f(x 0 ) = 0. f(x)

OPER 627: Nonlinear Optimization Lecture 14: Mid-term Review

Scientific Computing: An Introductory Survey

Numerical Optimization of Partial Differential Equations

Part 4: IIR Filters Optimization Approach. Tutorial ISCAS 2007

Higher-Order Methods

Notes on Numerical Optimization

A globally and R-linearly convergent hybrid HS and PRP method and its inexact version with applications

SECTION: CONTINUOUS OPTIMISATION LECTURE 4: QUASI-NEWTON METHODS

Review of Classical Optimization

New hybrid conjugate gradient methods with the generalized Wolfe line search

Lecture 7: Optimization methods for non linear estimation or function estimation

Optimization 2. CS5240 Theoretical Foundations in Multimedia. Leow Wee Kheng

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

Scientific Computing: An Introductory Survey

Scientific Computing: An Introductory Survey

Lecture 18: November Review on Primal-dual interior-poit methods

ISyE 6663 (Winter 2017) Nonlinear Optimization

AM 205: lecture 19. Last time: Conditions for optimality, Newton s method for optimization Today: survey of optimization methods

Search Directions for Unconstrained Optimization

Multivariate Newton Minimanization

Quasi-Newton Methods

Open Problems in Nonlinear Conjugate Gradient Algorithms for Unconstrained Optimization

MA/OR/ST 706: Nonlinear Programming Midterm Exam Instructor: Dr. Kartik Sivaramakrishnan INSTRUCTIONS

University of Maryland at College Park. limited amount of computer memory, thereby allowing problems with a very large number

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

Optimization Methods

Lecture V. Numerical Optimization

Conditional Gradient (Frank-Wolfe) Method

ON THE CONNECTION BETWEEN THE CONJUGATE GRADIENT METHOD AND QUASI-NEWTON METHODS ON QUADRATIC PROBLEMS

NonlinearOptimization

On nonlinear optimization since M.J.D. Powell

Part 2: Linesearch methods for unconstrained optimization. Nick Gould (RAL)

Unconstrained Multivariate Optimization

Algorithms for Constrained Optimization

Steepest descent algorithm. Conjugate gradient training algorithm. Steepest descent algorithm. Remember previous examples

Conjugate Directions for Stochastic Gradient Descent

Optimization Methods for Machine Learning

ECS550NFB Introduction to Numerical Methods using Matlab Day 2

Math 408A: Non-Linear Optimization

Quasi-Newton Methods. Zico Kolter (notes by Ryan Tibshirani, Javier Peña, Zico Kolter) Convex Optimization

Modification of the Armijo line search to satisfy the convergence properties of HS method

Step-size Estimation for Unconstrained Optimization Methods

Numerical Optimization

Optimization Methods for Circuit Design

Geometry optimization

Outline. Scientific Computing: An Introductory Survey. Optimization. Optimization Problems. Examples: Optimization Problems

Global convergence of a regularized factorized quasi-newton method for nonlinear least squares problems

Conjugate gradient algorithm for training neural networks

Global Convergence Properties of the HS Conjugate Gradient Method

Quasi-Newton Methods. Javier Peña Convex Optimization /36-725

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

The Conjugate Gradient Method

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

14. Nonlinear equations

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Numerical Optimization

Seminal papers in nonlinear optimization

Contents. Preface. 1 Introduction Optimization view on mathematical models NLP models, black-box versus explicit expression 3

Line Search Methods for Unconstrained Optimisation

Solutions and Notes to Selected Problems In: Numerical Optimzation by Jorge Nocedal and Stephen J. Wright.

Transcription:

Chapter 4. Unconstrained optimization Version: 28-10-2012 Material: (for details see) Chapter 11 in [FKS] (pp.251-276) A reference e.g. L.11.2 refers to the corresponding Lemma in the book [FKS] PDF-file of the Book [FKS]: Faigle/Kern/Still, Algorithmic principles of Mathematical Programming. on: http://wwwhome.math.utwente.nl/ stillgj/priv/ CO, Chapter 4 p 1/23

4.1 Introduction We consider the nonlinear minimization problem (with f C 1 or f C 2 ): (P) min f (x), x R n Recall: Usually in (nonconvex) unconstrained optimization we try to find a local minimizer. Global minimization is much more difficult. Theoretical method: (based on optimality conditions) Find a point x satisfying f (x) = 0 (critical point) Check whether 2 f (x) 0. CO, Chapter 4 p 2/23

CONCEPTUAL ALGORITHM: Choose x 0 R n. Iterate step k: Given x k R n, find a new point x k+1 with f (x k+1 ) < f (x k ). We hope that: x k x with x a local mininimizer. Def. Let x k x for k. The sequence (x k ) has a: linear convergence if with a constant 0 C < 1 and some K N: x k+1 x C x k x, k K. C is the convergence factor. quadratic convergence if with a constant c 0, x k+1 x c x k x 2, k N. superlinear convergence if lim k x k+1 x x k x = 0. CO, Chapter 4 p 3/23

4.2 General descent method (f C 1 ) Def. A vector d k R n is called a descent direction for f in x k if f (x k ) T d k < 0 ( ) Rem. If ( ) holds then: f (x k + td k ) < f (x k ) for t > 0 small. Abbreviation: g(x) := f (x), g k := g(x k ) Conceptual DESCENT METHOD: Choose a starting point x 0 R n and ɛ > 0. Iterate step k: Given x k R n, proceed as follows: if g(x k ) < ɛ, stop with x x k. Choose a descent direction d k in x k : g T k d k < 0 Find a solution t k of the (one-dimensional) minimization problem min t 0 f (x k + td k ) and put x k+1 = x k + t k d k. ( ) CO, Chapter 4 p 4/23

Remark By this descent method, minimization in R n is reduced to (line) minimization in R (in each step k). Steepest descent method: Use in the descent method as descent direction (see Ex.11.7): d k = f (x k ) Ex.11.7 Assuming f (x k ) 0, show that d k = [ f (x k )]/ f (x k ) solves the problem: min f (x d R n k ) T d s.t. d = 1 Convergence behavior: L.11.3 In the line-minimization step (**) we have f (x k+1 ) T d k = 0 For the steepest descent method this means: d T k+1 d k = 0 (ziggzagging) CO, Chapter 4 p 5/23

Th.11.1 Let f C 1. Apply the steepest descent method. If the iterates x k converge, i.e., x k x, then f (x) = 0 Ex.11.8 Given the quadratic function on R n, q(x) = x T Ax + b T x, A 0. Show that the minimizer t k of min q(x k + td k ) is given by t k = gt k d k. t 0 2dk T Ad k CO, Chapter 4 p 6/23

Speed of concergence: The next example shows that in general (even for min of quadratic functions), the steepest descent method cannot be expected to converge better than linearly. Ex.11.9. Apply the steepest descent method to q(x) = x T ( 1 0 0 r ) x, r 1 Then with x 0 = (r, 1) it follows ( ) r 1 k x k = (r, ( 1) k ). r + 1 (Linear convergence to x = 0 with factor C = (r 1)/(r + 1).) HINT: Make use of [FKS,L.11.8] and apply induction wrt. k. CO, Chapter 4 p 7/23

4.3 Method of conjugate directions Aim: Find an algorithm which (at least for quadratic functions) has better convergence than steepest descent. 4.3.1 Case: f (x) = q(x) := 1 2 x T Ax + b T x, A 0 (pd.) Idea. Try to generate d k s such that (not only q(x k+1 ) T d k = 0 but) q(x k+1 ) T d j = 0 0 j k Then, after n steps we have q(x n ) T d j = 0 0 j n 1 and (if the d j s are lin. indep.) q(x n ) = 0. So x n = A 1 b is the minimizer of q. CO, Chapter 4 p 8/23

L. 11.4 Apply the descent method to q(x). The following are equivalent: (i) g T j+1 d i = 0 for all 0 i j k; (ii) d T j Ad i = 0 for all 0 i < j k. Definition. Vectors d 0,..., d n 1 0 are called A-conjugate (or A-orthogonal) if: dj T Ad i = 0 i j. Ex. A collection of A-conjugate vectors d 0,..., d n 1 0 in R n are linearly independent. CO, Chapter 4 p 9/23

Construction of A-conjugate d k s. To construct vectors satisfying the conditions in L.11.4, simply try: d k = g k + α k d k 1 Then d T k Ad k 1 = 0 implies α k = gt k Ad k 1 d T k 1 Ad k 1. Th.11.3 Apply the descent method to q(x) with d k = g k + α k d k 1, α k = gt k Ad k 1 d T k 1 Ad k 1 Then the d k s are A-conjugate. In particular, the algorithm stops after (at most) n steps with the unique minimizer x = A 1 b of q. CO, Chapter 4 p 10/23

Conjugate Gradient Method (CG) INIT: Choose x 0 R n, ε > 0, d 0 := g 0 ; ITER: WHILE g k ε DO BEGIN Determine a solution t k for the problem END ( ) min f (x k + td k ) t 0 Set x k+1 = x k + t k d k. Set d k+1 = g k+1 + α k+1 d k. Ex.11.10 Under the assumptions of Th.11.3, show that the iteration point x k+1 is the (global) minimizer of the quadratic function q on the affine subspace S k = {x 0 + γ 0 d 0 +.. + γ k d k γ 0,.., γ k R} CO, Chapter 4 p 11/23

4.3.2 Case: non-quadratic functions f (x) Note that for quadratic function f = q we have: α k+1 = gt k+1 Ad k d T k Ad k = gt k+1 (g k+1 g k ) d T k (g k+1 g k ) = gt k+1 (g k+1 g k ) = g k+1 2 g k 2 g k 2 So, for non-quadratic f (x) in the CG-algorithm we can use d k+1 = g k+1 + α k+1 d k with: Hestenes-Stiefel (1952): α k+1 = gt k+1 (g k+1 g k ) d T k (g k+1 g k ) Fletcher-Reeves (1964): α k+1 = g k+1 2 g k 2 Polak-Ribiere (1969): α k+1 = gt k+1 (g k+1 g k ) g k 2 CO, Chapter 4 p 12/23

Application to sparse systems Ax = b, A 0 Def. A = (a ij ) is sparse if less than α% of the a ij -s are 0 with (say) α 5 CG-method: apply the CG-method to min 1 2 x T Ax b T x with solution x = A 1 b CO, Chapter 4 p 13/23

CG Method for sparse linear systems Ax = b, A 0 INIT: Choose x 0 R n and ε > 0 and set d 0 = g 0 ; ITER: WHILE g k ε DO BEGIN Set x k+1 = x k + t k d k Set g k+1 = g k + t k Ad k with t k = gt k d k d T k Ad k Set d k+1 = g k+1 + α k+1 d k with α k+1 = gt k+1 g k+1 gk T g k END Rem. Complexity: α 100 n2 flop s (floating point operations) per ITER. CO, Chapter 4 p 14/23

4.4 Line minimization In the general descent method (see Ch.4.2) we have to repeatedly solve: where h (0) < 0. min t 0 h(t) with h(t) = f (x k + td k ) This can be done by: exact line minimization of numerical analysis e.g., bisection, golden section, Newton-, secant method (see Ch.4.3, Ch.11.4.1) or more efficiently by inexact line search Goldstein-, Goldstein-Wolfe test (see Ch.11.4.2) CO, Chapter 4 p 15/23

4.5 Newton s method: General remark: Newton s method for solving systems of nonlinear equations is one of the most important tools of applied mathematics. Newtons s Iteration: for solving F (x) = 0, F : R n R n, F C 1, a system of n equations in n unknowns x = (x 1,..., x n ): start with some x 0 and iterate Th.11.4 x k+1 = x k [ F (x k )] 1 F (x k ) k = 0, 1,... (local convergence of Newton s method) Given F : R n R n, F C 2 such that F (x) = 0 and F (x) is non-singular. Then the Newton iteration x k converges quadratically to x for any x 0 sufficiently close to x. CO, Chapter 4 p 16/23

Newton for solving : min f (x) or F (x) := f (x) = 0, x k+1 = x k [ 2 f (x k )] 1 f (x k ) (local) quadratic convergence to x if: f C 3, f (x) = 0 with 2 f (x) non-singular. Problems with this Newton method for: x k x possibly a local maximizer. x k x k+1 with increasing f Newton descent method: min f (x) The Newton direction d k = [ 2 f (x k )] 1 f (x k ) is a descent direction at x k (g T k d k < 0) if (assume f (x k ) 0): [ 2 f (x k )] 1 or equivalently 2 f (x k ) is positive definite. CO, Chapter 4 p 17/23

Algorithm: (Levenberg-Marquardt variant) step k: Given x k R n with g k 0. 1. determine σ k > 0 such that ( 2 f (x k ) + σ k I) 0, compute d k = ( 2 f (x k ) + σ k I ) 1 gk ( ) 2. Find a minimizer t k of min t 0 f (x k + td k ) and put x k+1 = x k + t k d k. Ex.11.n1 [connection with the trust region method ] Consider the quadratic Taylor approximation of f near x k : q(x) := f (x k ) + f (x k ) T (x x k ) +1/2(x x k ) T 2 f (x k )(x x k ) Compute the descent step d k according to ( ) (Levenberg-Marquardt) and put x k+1 = x k + d k, τ := d k. Show that x k+1 is a local minimizer of the trust region problem: min q(x) s.t. x x k τ CO, Chapter 4 p 18/23

Disadvantage of the Newton methods: 2 f (x k ) needed work per step: linear system F k x = b k n 3 flop s CO, Chapter 4 p 19/23

4.6 Quasi-Newton method. Find a method which only makes use of first derivatives and only needs O(n 2 ) flop s per iter. Consider the descent method with: desired properties for H k : i H k 0 d k = H k g k ii H k+1 = H k + E k simple update rule iii for quadratic f conjugate directions d j iv the Quasi-Newton condition: (x k+1 x k ) = H k+1 (g k+1 g k ) Notation: δ k := (x k+1 x k ), γ k := (g k+1 g k ) CO, Chapter 4 p 20/23

Quasi-Newton Method INIT: Choose some x 0 R n, H 0 0, ε > 0 ITER: WHILE g k ε DO BEGIN Set d k = H k g k, Determine a solution t k for the problem END min f (x k + td k ) t 0 Set x k+1 = x k + t k d k H k+1 = H k + E k. and update For the update H k + E k we try: with α, β, µ R E k = αuu T + βvv T + µ(uv T + vu T ) ( ) where u := δ k, v := H k γ k Note that E k is symmetric with rank 2. CO, Chapter 4 p 21/23

L.11.5 Apply the Quasi-Newton method to q(x) = 1 2 x T Ax + b T x, A 0 with E k of the form ( ) and H k+1 satisfying iv: δ k = H k+1 γ k Then the directions d j are A-conjugate : d T j Ad i = 0 0 i < j k Last step in the construction of E k : Find α, β, µ in ( ) such that (iv) holds. This leads to the following update formula CO, Chapter 4 p 22/23

Broyden family: with Φ R H k+1 = H k + δ k δ T k δ T k γ k where w := ( δk δ T k γ k As special cases we obtain: H k γ k γ T k H k γ T k H k γ k + Φ ww T ( ) ) H k γ k (γ T γk T H k γ k k H k γ k ) 1 2. Φ = 0, the DFP-method (1963) (Davidon, Fletcher, Powell) Φ = 1, the BFGS-method (1970) (Broyden, Fletcher, Goldfarb, Shanno) Finally we show that property i), H k 0, is preserved. L.11.6 In the Quasi-Newton method, if we use ( ) with Φ 0, then: H k 0 H k+1 0 CO, Chapter 4 p 23/23