Mathematical optimization

Similar documents
Constrained optimization. Unconstrained optimization. One-dimensional. Multi-dimensional. Newton with equality constraints. Active-set method.

the method of steepest descent

EECS 275 Matrix Computation

The Conjugate Gradient Method

Quasi-Newton Methods

Nonlinear Optimization: What s important?

Lecture Notes: Geometric Considerations in Unconstrained Optimization

17 Solution of Nonlinear Systems

Chapter 3 Numerical Methods

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

Programming, numerics and optimization

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

Constrained Optimization

Optimization: Nonlinear Optimization without Constraints. Nonlinear Optimization without Constraints 1 / 23

Neural Network Training

1 Computing with constraints

Conjugate Gradient Method

MATH 4211/6211 Optimization Basics of Optimization Problems

Optimization Methods

Introduction to gradient descent

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

Unconstrained optimization

Chapter 7 Iterative Techniques in Matrix Algebra

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

CE 191: Civil and Environmental Engineering Systems Analysis. LEC 05 : Optimality Conditions

, b = 0. (2) 1 2 The eigenvectors of A corresponding to the eigenvalues λ 1 = 1, λ 2 = 3 are

Introduction to Optimization

Penalty and Barrier Methods. So we again build on our unconstrained algorithms, but in a different way.

MATHEMATICS FOR COMPUTER VISION WEEK 8 OPTIMISATION PART 2. Dr Fabio Cuzzolin MSc in Computer Vision Oxford Brookes University Year

Conjugate Gradient (CG) Method

Math 273a: Optimization Netwon s methods

MIT Manufacturing Systems Analysis Lecture 14-16

1 Introduction

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

An Iterative Descent Method

Math 5630: Conjugate Gradient Method Hung M. Phan, UMass Lowell March 29, 2019

Tangent spaces, normals and extrema

Some definitions. Math 1080: Numerical Linear Algebra Chapter 5, Solving Ax = b by Optimization. A-inner product. Important facts

5 Handling Constraints

OPER 627: Nonlinear Optimization Lecture 14: Mid-term Review

Review of Classical Optimization

Numerical Optimization

Lecture 22. r i+1 = b Ax i+1 = b A(x i + α i r i ) =(b Ax i ) α i Ar i = r i α i Ar i

Notes on Some Methods for Solving Linear Systems

Scientific Computing: Optimization

Nonlinear Optimization for Optimal Control

Adaptive Beamforming Algorithms

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

Introduction to unconstrained optimization - direct search methods

Nonlinear Optimization

Numerical Optimization Techniques

Exploring the energy landscape

Solutions and Notes to Selected Problems In: Numerical Optimzation by Jorge Nocedal and Stephen J. Wright.

Numerical Optimization. Review: Unconstrained Optimization

CHAPTER 2: QUADRATIC PROGRAMMING

14. Nonlinear equations

Unconstrained Optimization

3E4: Modelling Choice. Introduction to nonlinear programming. Announcements

Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent

An Introduction to the Conjugate Gradient Method Without the Agonizing Pain

AM 205: lecture 18. Last time: optimization methods Today: conditions for optimality

Numerical Optimization of Partial Differential Equations

Numerical solutions of nonlinear systems of equations

Line Search Methods for Unconstrained Optimisation

Quadratic Programming

Multivariate Newton Minimanization

Reading Group on Deep Learning Session 1

Optimization Tutorial 1. Basic Gradient Descent

Iterative methods for Linear System

Convex Optimization. Problem set 2. Due Monday April 26th

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

Functions of Several Variables

Suppose that the approximate solutions of Eq. (1) satisfy the condition (3). Then (1) if η = 0 in the algorithm Trust Region, then lim inf.

Part 5: Penalty and augmented Lagrangian methods for equality constrained optimization. Nick Gould (RAL)

CS137 Introduction to Scientific Computing Winter Quarter 2004 Solutions to Homework #3

6.252 NONLINEAR PROGRAMMING LECTURE 10 ALTERNATIVES TO GRADIENT PROJECTION LECTURE OUTLINE. Three Alternatives/Remedies for Gradient Projection

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

8 Numerical methods for unconstrained problems

The conjugate gradient method

Practical Optimization: Basic Multidimensional Gradient Methods

Iterative Methods for Solving A x = b

Nonlinear equations and optimization

CPSC 540: Machine Learning

Descent methods. min x. f(x)

Written Examination

Nonlinear Programming

The Conjugate Gradient Method

Multidisciplinary System Design Optimization (MSDO)

Algorithms for Constrained Optimization

Math 273a: Optimization Basic concepts

Selected Topics in Optimization. Some slides borrowed from

Chapter 10 Conjugate Direction Methods

Gradient Descent. Dr. Xiaowei Huang

Zangwill s Global Convergence Theorem

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

PETROV-GALERKIN METHODS

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

Course Notes: Week 4

GENG2140, S2, 2012 Week 7: Curve fitting

Lecture 10. Neural networks and optimization. Machine Learning and Data Mining November Nando de Freitas UBC. Nonlinear Supervised Learning

Transcription:

Optimization Mathematical optimization Determine the best solutions to certain mathematically defined problems that are under constrained determine optimality criteria determine the convergence of the solution The advent of computer had great impact on the development of optimization methods When do we need optimizations? Optimal robotic control Optimal robotic control Arrive at d with vel= 0 with tradeoff between time and energy Inverse kinematics Optimal motion trajectories minimize T = time + energy 0 d

Inverse kinematics Optimal motion trajectories a set of 3D markers a pose Optimization taxonomy Optimization taxonomy Unconstrained Constrained Discontinuous Unconstrained Constrained Discontinuity Newton-like methods Linear Integer Newton-like methods Linear Integer Descent methods Quadratic Stochastic Descent methods Quadratic Stochastic Nonlinear equations Nonlinear Network Nonlinear equations Nonlinear Network

Newton s methods Root estimation Find the roots of of a nonlinear function Root estimation Minimization One variable Multi variables Quasi-Newton method C(x) = 0 We can linearize the function as C( x) = C(x) + C (x)( x x) = 0, where C (x) = C x Then we can estimate the roots as x = x C(x) C (x) Root estimation Newton s convergence theorem C(x) C(x () ) = C(x (0) ) + C (x (0) )(x () x (0) ) Consider C(x) = 0 and assume x* is such a root. If C (x*) is not zero and C (x) is continuous on an interval containing x* Local convergence: for x 0 is suitably close to x* Newton s method converges to x* x () x () x (0) x Quadratic convergence: the algorithm converges quadratically, that is, x (k+) x c x (k) x

Root estimation Minimization Pros: Quadratic convergence Cons: Sensitive to initial guess Find x such that the nonlinear function F (x ) is a minimum What is the simplest model that has minima? F (x (k) + δ) = F (x (k) ) + F (x (k) )δ + F (x (k) )δ Example? Slope can t be zero at solution Why? Find the minima of F (x) F (x (k) + δ) δ = 0 Find the roots of F (x) δ = F (x) F (x) Conditions Minimization What are the conditions for minima to exist? Necessary conditions: a local minima exists at x* F (x ) > 0 F (x) F (x ) = 0 F (x ) 0 Sufficient conditions: an isolated minima exists at x* F (x ) = 0 F (x ) > 0 F (x) x x

Example Stationary points F(x) = x F(x) = x 4 Which function has a strict isolated minimum at x = 0? Many methods only locate a point x* such that F (x*) = 0 x* refers to a stationary point that has the following three types minimum maximum saddle Multiple variables Multiple variables F (x (k) + p) = F (x (k) ) + g T (x (k) )p + pt H(x (k) )p g(x) = x F = H(x) = xxf = F x. F x n F x gradient vector F x x n F x x F x x n.. F x n x F x n Hessian matrix 0 = g(x (k) ) + H(x (k) )p p = H(x (k) ) g(x (k) ) x (k+) = x (k) + p

Multiple variables Positive definite matrix Necessary conditions: g(x ) = 0 p T H p 0 Sufficient conditions: g(x ) = 0 p T H p > 0 H is positive semi-definite H is positive definite Function F at some arbitrary point can be approximated by F (x (k+) ) = F (x ) + pt H p If x is the minimizer of F, p T H p > 0 x (k+) F (x (k+) ) = F (x ) + g T (x )p + pt H p (By g(x ) = 0) Finite difference Newton method Finite difference Newton method The main drawback of Newton s method is that the user must supply the formulation to compute Hessian matrix Finite difference methods estimate H (k) by computing differences in gradient vectors Evaluate the vector with increment h i in each coordinate direction e i How many gradient evaluations are required to update Hessian? The estimated Hessian matrix might no longer be positive definite Need to solve for linear system to compute the inverse of Hessian Each column of H (k) is g(x (k) + h i e i ) g(x (k) ) h i All these problems can be solved by Quasi-Newton methods Rectify the symmetry of H (k) by H (k) = H (k) + H T (k)

Quasi-Newton method Quasi-Newton method Quasi-Newton methods construct a new estimate of the Hessian matrix using information from previous iterates Approximate H (k) In each iteration: using a symmetric positive definite matrix Ĥ(k). p = Ĥ(k)g (k). x (k+) = x (k) + p Ĥ () The initial matrix can be any symmetric positive definite matrix, for example, Ĥ () = I By repeated updates of, Quasi-Newton method turns an arbitrary matrix in to a close approximation of Ĥ (k+) Ĥ () H (k) In each iteration, is computed by augmenting with second derivative information gained on the k-th iteration The Quasi-Newton condition: Ĥ (k+) γ (k) = p (k) Ĥ (k) 3. update Ĥ (k) giving Ĥ (k+),where γ (k) = g (k+) g (k) Quasi-Newton method Optimization taxonomy Ĥ (k+) γ (k) = p (k) Unconstrained Constrained Discontinuity Ĥ (k+) = Ĥ(k) + E (k) = Ĥ(k) + auu T Ĥ (k) γ (k) + auu T γ (k) = p (k) Newton-like methods Linear Integer u = p (k) Ĥ(k)γ (k) au T γ (k) = Descent methods Quadratic Stochastic Ĥ (k+) = Ĥ + (p Ĥγ)(p Ĥγ)T (p Ĥγ)T γ Nonlinear equations Nonlinear Network

Descent methods Solving large linear system Ax = b Greatest gradient descent Conjugate direction Conjugate gradient A b x a known, square, symmetric, and positive semi-definite matrix a known vector an unknown vector If A is dense, solve with factorization and backsubstitution If A is sparse, solve with iterative methods (Conjugate Gradient) the quadratic form Greatest gradient descent F (x) = xt Ax b T x + c The minimizer of F is also the solution to Ax = b F (x) = 0 = Ax b Start at an arbitrary point x (0) and slide down to the bottom of the paraboloid Take a series of steps x (), x (),... until we are satisfied that we are close enough to the solution x* Take a step along the direction in which F descents most quickly F (x (k) ) = b Ax (k)

Greatest gradient descent line search Important definitions: error: e (k) = x (k) x residual: r (k) = b Ax (k) = F (x (k) ) r (0) x () = x (0) + αr (0) But how big a step should we take? = Ae (k) Think residual as the direction of the greatest descent x (0) A line search is a procedure that chooses α to minimize F along a line The Method of Steepest Descent 7 4 (a) (b) 0 0 (a) (c).5 0 -.5-5 -.5 0.5 5 (b) Line search (d) 50 0 50 00-4 - 4 6 0 40 0 00 80 60 40 0 - -4-6 0 (c) 0. 0.4 0.6.5 0 -.5-5 -4-4 6 - Figure 6: The method of Steepest Descent. (a) Starting at, take a step in the direction of steepest descent of. (b) Find the point on the intersection of these two surfaces that minimizes. (c) This parabola is the intersection of surfaces. The bottommost point is our target. (d) The gradient at the bottommost point is orthogonal to the gradient of the previous step. 4-4 -6 -.5 0.5 5 (d) 50 0 50 00 Optimal step size d dα F (x ()) = F (x () ) T d dα x () = F (x () ) T r (0) = 0 F (x () ) r (0) r T (0) r () = 0

Optimal step size Recurrence of residual x (k+) = x (k) + αr (k) r T (k) r (k+) = 0 Exercise: derive alpha Ans: rt k r k α = r T k Ar k.. 3. r (k) = b Ax (k) rt k r k α = r T k Ar k x (k+) = x (k) + αr (k) The algorithm requires two matrix-vector multiplications per iteration One multiplication can be eliminated by replacing step 3 with r (k+) = r (k) αar (k) Poor convergence Conjugate direction Pick a set of orthogonal directions: What is the problem with greatest descent? Wouldn t it be nice if we can avoid to traverse the same direction? d (0), d (),, d (n ) Take exactly one step along each direction Solution is found within n steps Two problems:. How do we determine these directions?. How do we determine the step size along each direction?

Conjugate direction Conjugate directions Let s deal with the second problem first x (k+) = x (k) + α (k) d (k) To compute α (k), we need to know e (k). If we knew e (k), the problem would already be solved! Use the fact that e (k+) should be orthogonal to d (k) so that we need never step in the direction of d (k) again d (k) e (k+) = 0 d (k) (e (k) + α (k) d (k) ) = 0 α (k) = dt (k) e (k) d T (k) d (k) Instead making search directions orthogonal, we find a set of directions that are A-orthogonal to each other Two vectors d (i) and d (j) are A-orthogonal or conjugate, if d T (i) Ad (j) = 0 What seems to be the problem? A-orthogonality A-orthogonality If we take the optimal step size along each direction F (x (k+) ) T d dα F (x (k+)) = 0 d dα x (k+) = 0 r T (k+) d (k) = 0 d T (k) Ae (k+) = 0 e (k+) must be A-orthogonal to d (k) vectors are A-orthogonal vectors are orthogonal

Optimal size step Algorithm e (k+) must be A-orthogonal to d (k) Suppose we can come up with a set of A-orthogonal directions {d (k) } Using this condition, can you derive α (k)?. Compute d (k). α (k) = dt (k) r (k) d T (k) Ad (k) 3. x (k+) = x (k) + α (k) d (k) Why does it work? Search directions We need to prove that x can be found in n steps if we take α (k) step size along at each step n e (0) = δ i d (i) i=0 d (k) n d T (j) Ae (0) = δ i d T (j) Ad (i) i=0 d T (j) Ae (0) = δ j d T (j) Ad (j) δ j = dt (j) Ae (0) d T (j) Ad (j) = dt (j) Ae (j) d T (j) Ad = α (j) (j) = dt (j) A(e (0) + j k=0 α kd (k) ) d T (j) Ad (j) We know how to determine the optimal step size along each direction (second problem solved) We still need to figure out what search directions are What do we know about d (0), d (),..., d (n-)? They are A-orthogonal to each other: d (i)ad (j) = 0 d (i) is A-orthogonal to e (i+)

Gram-Schmidt Conjugation Gram-Schmidt Conjugation Suppose we have a set of linearly independent vectors u 0, u,..., u n- To construct d (i), take u i and subtract out any components that are not A-orthogonal to the previous d vectors The search directions can be represented as k d (k) = u k + β ki d (i) i=0 and Use the same trick to get rid of the summation d T (k) Ad (j) = u T k Ad (j) + β ki d T (j) Ad (j) d (0) = u 0 k > j u 0 d (0) u + d (0) β kj = ut k Ad (j) d T j Ad (j) u u * d () What are the drawbacks of Gram-Schmidt conjugation? Conjugate gradients Conjugate gradient If we pick up a set of u s intelligently, we might be able to save both time and space It turns out that residuals (r s) is an excellent choice for u s Take r (k) and subtract out any components that are not A-orthogonal to the previous d vectors k d (k) = r (k) + β ki d (i) i=0 k d T (k) Ad (j) = r T (k) Ad (j) + β ki d T (i) Ad (j) i=0 j < k residual is orthogonal to the previous search directions residuals work for Greatest Descent 0 = r T (k) Ad (j) + β kj d T (j) Ad (j) β kj = rt (k) Ad (j) d T (j) Ad (j) (by A-orthogonality of d vectors) Each d (k) requires O(n ) operations! However...

Conjugate gradient Conjugate gradient r (k) is A-orthogonal to all the previous search directions except for d (k ) β kj = rt (k) Ad (j) d T (j) Ad = 0 if j < k (j) r (k+) = Ae (k+) = A(e (k) + α (k) d (k) ) = r (k) α (k) Ad (k) r T (j) r (k+) = r T (j) r (k) α (k) r T (j) Ad (k) rt (k) r (k) β kj = r T (k ) r (k ) if j = k proof: r T (k) Ad (j) = 0 when j < k r T (j) Ad (k) = { r T (j) r (j) α (j) rt (j) r (j) α (j ) 0 j = k j = k + otherwise Conjugate gradient Put it all together d (0) = r (0) = b Ax (0) rt (k) r (k) α (k) = d T (k) Ad (k) x (k+) = x (k) + α (k) d (k) r (k+) = r (k) α (k) Ad (k) β (k+) = rt (k+) r (k+) r T (k) r (k) d (k+) = r (k+) + β (k+) d (k)