Constrained optimization. Unconstrained optimization. One-dimensional. Multi-dimensional. Newton with equality constraints. Active-set method.

Similar documents
Mathematical optimization

the method of steepest descent

EECS 275 Matrix Computation

The Conjugate Gradient Method

An Iterative Descent Method

Quasi-Newton Methods

Nonlinear Optimization: What s important?

Optimization: Nonlinear Optimization without Constraints. Nonlinear Optimization without Constraints 1 / 23

Optimization Methods

MATH 4211/6211 Optimization Basics of Optimization Problems

Lecture Notes: Geometric Considerations in Unconstrained Optimization

Chapter 3 Numerical Methods

Programming, numerics and optimization

Optimization. Escuela de Ingeniería Informática de Oviedo. (Dpto. de Matemáticas-UniOvi) Numerical Computation Optimization 1 / 30

Constrained Optimization

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

Numerical Optimization. Review: Unconstrained Optimization

Scientific Computing: Optimization

Written Examination

Optimization Tutorial 1. Basic Gradient Descent

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

Numerical Optimization

17 Solution of Nonlinear Systems

Nonlinear Optimization for Optimal Control

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

Some definitions. Math 1080: Numerical Linear Algebra Chapter 5, Solving Ax = b by Optimization. A-inner product. Important facts

8 Numerical methods for unconstrained problems

NONLINEAR. (Hillier & Lieberman Introduction to Operations Research, 8 th edition)

Math 273a: Optimization Basic concepts

CS137 Introduction to Scientific Computing Winter Quarter 2004 Solutions to Homework #3

Gradient Descent. Dr. Xiaowei Huang

Convex Optimization. Problem set 2. Due Monday April 26th

14. Nonlinear equations

Conjugate Gradient (CG) Method

Nonlinear Optimization

Algorithms for Constrained Optimization

Conjugate Gradient Method

4TE3/6TE3. Algorithms for. Continuous Optimization

CHAPTER 2: QUADRATIC PROGRAMMING

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

MIT Manufacturing Systems Analysis Lecture 14-16

FALL 2018 MATH 4211/6211 Optimization Homework 4

6.252 NONLINEAR PROGRAMMING LECTURE 10 ALTERNATIVES TO GRADIENT PROJECTION LECTURE OUTLINE. Three Alternatives/Remedies for Gradient Projection

AM 205: lecture 19. Last time: Conditions for optimality, Newton s method for optimization Today: survey of optimization methods

Iterative methods for Linear System

AM 205: lecture 18. Last time: optimization methods Today: conditions for optimality

Unconstrained optimization

2.098/6.255/ Optimization Methods Practice True/False Questions

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

Math 273a: Optimization Netwon s methods

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

, b = 0. (2) 1 2 The eigenvectors of A corresponding to the eigenvalues λ 1 = 1, λ 2 = 3 are

Conjugate-Gradient. Learn about the Conjugate-Gradient Algorithm and its Uses. Descent Algorithms and the Conjugate-Gradient Method. Qx = b.

10.34 Numerical Methods Applied to Chemical Engineering Fall Quiz #1 Review

An Introduction to the Conjugate Gradient Method Without the Agonizing Pain

Conjugate Gradients: Idea

Numerical solutions of nonlinear systems of equations

Unconstrained Optimization

1 Computing with constraints

LINEAR AND NONLINEAR PROGRAMMING

NonlinearOptimization

Lecture 7 Unconstrained nonlinear programming

Math 5630: Conjugate Gradient Method Hung M. Phan, UMass Lowell March 29, 2019

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Math 411 Preliminaries

Gradient Descent Methods

Nonlinear Programming

PETROV-GALERKIN METHODS

Functions of Several Variables

Vasil Khalidov & Miles Hansard. C.M. Bishop s PRML: Chapter 5; Neural Networks

1 Introduction

Some minimization problems

Review of Classical Optimization

1 Conjugate gradients

ISM206 Lecture Optimization of Nonlinear Objective with Linear Constraints

Scientific Computing: An Introductory Survey

Scientific Computing: An Introductory Survey

Solutions and Notes to Selected Problems In: Numerical Optimzation by Jorge Nocedal and Stephen J. Wright.

Chapter 7 Iterative Techniques in Matrix Algebra

The Conjugate Gradient Method

Iterative methods for Linear System of Equations. Joint Advanced Student School (JASS-2009)

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

Selected Topics in Optimization. Some slides borrowed from

Methods that avoid calculating the Hessian. Nonlinear Optimization; Steepest Descent, Quasi-Newton. Steepest Descent

minimize x subject to (x 2)(x 4) u,

Penalty and Barrier Methods. So we again build on our unconstrained algorithms, but in a different way.

Introduction to Optimization

Neural Network Training

Chapter 4. Unconstrained optimization

Newton s Method. Javier Peña Convex Optimization /36-725

5 Handling Constraints

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem

IE 5531: Engineering Optimization I

Chapter 2. Optimization. Gradients, convexity, and ALS

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Lecture 4 - The Gradient Method Objective: find an optimal solution of the problem

j=1 r 1 x 1 x n. r m r j (x) r j r j (x) r j (x). r j x k

Two hours. To be provided by Examinations Office: Mathematical Formula Tables. THE UNIVERSITY OF MANCHESTER. xx xxxx 2017 xx:xx xx.

Part 4: Active-set methods for linearly constrained optimization. Nick Gould (RAL)

Lecture V. Numerical Optimization

Transcription:

Optimization

Unconstrained optimization One-dimensional Multi-dimensional Newton s method Basic Newton Gauss- Newton Quasi- Newton Descent methods Gradient descent Conjugate gradient Constrained optimization Newton with equality constraints Active-set method Simplex method Interior-point method

Unconstrained optimization Define an objective function over a domain: f: R n R Optimization variables: x T = {x 1,x 2,,x n } minimize f(x 1,x 2,,x n ) minimize f(x), for x R n

Constraints Equality constraints! a i (x) = 0 for x R n, where i =1,,p! Inequality constraints c j (x) 0 for x R n, where j =1,,q

Constrained optimization minimize f(x), for x R n subjec to a i (x) = 0, where i =1,,p c j (x) 0, where j =1,,q Solution: x* satisfies constraints ai and cj, while minimizing the objective function f(x)

Formulate an optimization General optimization problem is very difficult to solve Certain problem classes can be solve efficiently and reliably Convex problems can be solved with global solutions efficiently and reliably Nonconvex problems do not guarantee global solutions

Example: pattern matching A pattern can be described by a set of points, P = {p1, p2,..., pn} The same object viewed from a different distance or a different angle corresponds to a different P Two patterns P and P are similar if p i = cos sin sin cos p i + r 1 r 2

Example: pattern matching Let Q = {q1, q2,..., qn} be the target pattern, find the most similar pattern among P1, P2,..., Pn

Inverse kinematics a set of 3D marker positions a pose described by joint angles

Optimal motion trajectories

Quiz Arrive at d with velocity = 0! Maximal force allowed: F Minimize time? Minimize energy? 0 d

Unconstrained optimization Newton method Gauss-Newton method Gradient descent method Conjugate gradient method

Newton method Find the roots of of a nonlinear function C(x) =0 We can linearize the function as C( x) =C(x)+C (x)( x x) =0, where C (x) = C x Then we can estimate the roots as x = x C(x) C (x)

Root estimation C(x) C(x (1) )=C(x (0) )+C (x (0) )(x (1) x (0) ) x (2) x (1) x (0) x

Root estimation Pros: Quadratic convergence! Cons:! Sensitive to initial guess! Example?! Slope can t be zero at solution! Why?

Minimization Find x such that the nonlinear function F (x ) is a minimum What is the simplest function that has minima? F (x (k) + δ) =F (x (k) )+F (x (k) )δ + 1 2 F (x (k) )δ 2 Find the minima of F (x) Find the roots of F (x) F(x (k) + δ) δ =0 δ = F (x) F (x)

Conditions What are the conditions for minima to exist?! Necessary conditions: a local minimum exists at x*!! F (x )=0 F (x ) 0 Sufficient conditions: an isolated minimum exists x* F (x )=0 F (x ) > 0

Minimization F (x ) > 0 F (x) x x F (x)

Multidimensional optimization Search methods only need function evaluations! First-order gradient-based methods depend on the information of gradient g! Second-order gradient-based methods depend on both gradient and Hessian H

Multiple variables F (x (k) + p) =F (x (k) )+g T (x (k) )p + 1 2 pt H(x (k) )p g(x) = x F = F x 1. F x n gradient vector H(x) = 2 xxf = 2 F 2 F x 2 x 1 1 x n 2 F x 2 x 1 2 F x 2 x n.. 2 F x n x 1 2 F x 2 n Hessian matrix

Multiple variables 0=g(x (k) )+H(x (k) )p p = H(x (k) ) 1 g(x (k) ) x (k+1) = x (k) + p

Multiple variables Necessary conditions: g(x )=0 p T H p 0 H is positive semi-definite Sufficient conditions: g(x )=0 p T H p > 0 H is positive definite

Gauss-Newton method What if the objective function is in the form of a vector of functions?!! f =[f 1 (x) f 2 (x) f m (x)] T! The real-valued function can be formed as F = m p=1 f p (x) 2 = f T f

Jacobian Each f p (x) depends on x i for i = 1,2,...,m, a gradient matrix can be formed!!!! The Jacobian need not to be a square matrix

Gradient and Hessian Gradient of objective function! m F! = 2f p (x) f p x i x i! p=1 g F =2J T f Hessian of objective function 2 F x i x j =2 m p=1 f p x i f p x j +2 m p=1 f p (x) x i 2 f p x j H F 2J T J

Gauss-Newton algorithm In k th iteration, compute f p (x k ) and J k to obtain new g k and H k! Compute p k = -(2J T J) -1 (2J T f) = -(J T J) -1 (J T f)! Find α k that minimizes F(x k + α k p k )! Set x k+1 = x k + α k p k

First-order gradient methods Greatest gradient descent Conjugate gradient

Solving large linear system Ax = b A a known, square, symmetric, and positive semi-definite matrix b a known vector x an unknown vector If A is dense, solve with factorization and back substitution If A is sparse, solve with iterative methods (descent methods)

Quadratic form F (x) = 1 2 xt Ax b T x + c The gradient of F(x) is F (x) = 1 2 AT x + 1 2 Ax b If A is symmetric, F (x) =Ax b F (x) =0=Ax b The critical point of F is also the solution to Ax = b If A is not symmetric, what is the linear system solved by finding the critical points of F?

Greatest gradient descent Start at an arbitrary point x (0) and slide down to the bottom of the paraboloid Take a series of steps x (1), x (2),... until we are satisfied that we are close enough to the solution x* Take a step along the direction in which F descents most quickly F (x (k) )=b Ax (k)

Greatest gradient descent Important definitions: error: e (k) = x (k) x residual: r (k) = b Ax (k) = F (x (k) ) = Ae (k) Think residual as the direction of the greatest descent

Line search x (1) = x (0) + αr (0) x (0) r (0) But how big of a step should we take? A line search is a procedure that chooses α to minimize F along a line

2 (a) (c) 1 2.5 0 2-2.5-5 2 Line search 0-2 (b) -2.5 0 2.5 5 1 150 0 50 100 (d) Figure 6: The method of Steepest Descent. (a) Starting at 2 2 1-4 -2 2 4 6 140 120 100 80 60 40 20-4 -6 (c) 0.2 0.4 0.6 2.5 0 2-2.5-5 -4-2 descent of. (b) Find the point on the intersection of these two surf is the intersection of surfaces. The bottommost point is our target. ( is orthogonal to the gradient of the previous step. 0 1 2 4 2-2 -4-6 -

Optimal step size d dα F (x (1)) =F (x (1) ) T d dα x (1) = F (x (1) ) T r (0) =0 F (x (1) ) r (0) r T (0) r (1) =0

Optimal step size Exercise: derive alpha from r T (k) r (k+1) =0 Hint: replace the terms involving (k+1) with those involving (k) by x (k+1) = x (k) + αr (k) Ans: α = rt k r k r T k Ar k

Recurrence of residual 1. 2. 3. r (k) = b Ax (k) α = rt k r k r T k Ar k x (k+1) = x (k) + αr (k) The algorithm requires two matrix-vector multiplications per iteration One multiplication can be eliminated by replacing step 1 with r (k+1) = r (k) αar (k)

Quiz In our IK problem, we use greatest gradient descent method to find an optimal pose, but we can t compute alpha using the formula described in the previous slides, why?

Poor convergence What is the problem with greatest descent? Wouldn t it be nice if we can avoid to traverse the same direction?

Conjugate directions Pick a set of directions: d (0), d (1),, d (n 1) Take exactly one step along each direction Solution is found within n steps Two problems: 1. How do we determine these directions? 2. How do we determine the step size along each direction?

A-orthogonality If we take the optimal step size along each direction F (x (k+1) ) T d dα F (x (k+1)) = 0 d dα x (k+1) = 0 r T (k+1) d (k) = 0 d T (k) Ae (k+1) = 0 Two different vectors v and u are A-orthogonal or conjugate, if v T Au = 0

A-orthogonality vectors are A-orthogonal vectors are orthogonal

Optimal size step e (k+1) must be A-orthogonal to d (k) Using this condition, can you derive α (k)?

Algorithm Suppose we can come up with a set of A-orthogonal directions {d (k) }, this algorithm will converge in n steps 1. Take d (k) 2. 3. α (k) = dt (k) r (k) d T (k) Ad (k) x (k+1) = x (k) + α (k) d (k)

Why does it work? We need to prove that x can be found in n steps if we take step size along at each step α (k) e (0) = n 1 i=0 d T (j) Ae (0) = d (k) δ i d (i) n 1 i=0 δ i d T (j) Ad (i) d T (j) Ae (0) = δ j d T (j) Ad (j) d s are linearly independent if d s are A-orthogonal δ j = dt (j) Ae (0) d T (j) Ad (j) = dt (j) A(e (0) j 1 k=0 δ kd (k) ) d T (j) Ad (j) = dt (j) Ae (j) d T (j) Ad (j) = α (j)

Quiz Given that d s are A-orthogonal, prove that d s are linearly independent.

Search directions We know how to determine the optimal step size along each direction (second problem solved)! We still need to figure out what search directions are! What do we know about d (0), d (1),..., d (n-1)?! They are A-orthogonal to each other: d (i)t Ad (j) = 0! d (i) is A-orthogonal to e (i+1)

Gram-Schmidt Conjugation Suppose we have a set of linearly independent vectors u s, the search directions can be represented as k 1 d (k) = u k + β ki d (i) i=0 and d (0) = u 0 Use the same trick to get rid of the summation d T (k) Ad (j) = u T (k) Ad (j) + β kj d T (j) Ad (j) k>j β kj = ut k Ad (j) d T j Ad (j) What are the drawbacks of Gram-Schmidt conjugation?

Conjugate gradients If we pick a set of u s intelligently, we might be able to save both time and space! It turns out that residuals (r s) is an excellent choice for u s! residuals are orthogonal to each other! residual is orthogonal to the previous search directions

Proof: Orthogonality Proof r (k) is orthogonal to all the previous search directions d (0), d (1),, d (k 1) e (k) = n 1 j=k δ j d (j) d T (i) Ae (k) = d T (i) r (k) =0 n 1 j=k δ j d T (i) Ad (j) =0 if i < k if i < k From here, we can proof r T (i) r (j) =0,i j identity 1 d T (k) r (k) = r T (k) r (k) identity 2

Conjugate gradients d (k) = r (k) + k 1 i=0 β ki d (i) d T (k) Ad (j) = r T (k) Ad (j) + k 1 i=0 β ki d T (i) Ad (j) j<k 0=r T (k) Ad (j) + β kj d T (j) Ad (j) (by A-orthogonality of d vectors) β kj = rt (k) Ad (j) d T (j) Ad (j) Each d (k) requires O(n 3 ) operations! However...

Conjugate gradients r (k) is A-orthogonal to all the previous search directions except for d (k 1) β kj = rt (k) Ad (j) d T (j) Ad (j) =0 if j<k 1 rt (k) r (k) β kj = r T (k 1) r (k 1) if j = k 1 proof: r T (k) Ad (j) =0when j<k 1

Proof: A-orthogonality Proof r (k) is A-orthogonal to all the previous search directions except for d (k 1) r (j+1) = Ae (j+1) = A(e (j) + α (j) d (j) ) = r (j) α (j) Ad (j) r T (k) r (j+1) = r T (k) r (j) α (j) r T (k) Ad (j) use identity 1 r T (k) r (k) α (k) j = k r T (k) Ad (j) = { rt (k) r (k) j = k 1 α (k 1) 0 otherwise

Conjugate gradients Simplify β k β k = rt (k) Ad (k 1) d T (k 1) Ad (k 1) = r T (k) r (k) α (k 1) d T (k 1) Ad (k 1) = rt (k) r (k) d T (k 1) r (k 1) = rt (k) r (k) r T (k 1) r (k 1) use identity 2

Conjugate gradients Put it all together d (0) = r (0) = b Ax (0) α (k) = rt (k) r (k) d T (k) Ad (k) x (k+1) = x (k) + α (k) d (k) r (k+1) = r (k) α (k) Ad (k) β (k+1) = rt (k+1) r (k+1) r T (k) r (k) d (k+1) = r (k+1) + β (k+1) d (k)

References J. Shewchuk, An introduction to conjugate gradient method without agonizing pain! A. Antoniou and W.S. Lu, Practical optimization! R. Fletcher, Practical methods of optimization! J. Betts, Practical methods for optimal control using nonlinear programming