Trajectory-based optimization

Similar documents
Numerical Optimization

Pontryagin s maximum principle

4 Newton Method. Unconstrained Convex Optimization 21. H(x)p = f(x). Newton direction. Why? Recall second-order staylor series expansion:

Nonlinear Optimization: What s important?

Controlled Diffusions and Hamilton-Jacobi Bellman Equations

Optimization: Nonlinear Optimization without Constraints. Nonlinear Optimization without Constraints 1 / 23

Chapter 3 Numerical Methods

Line Search Methods for Unconstrained Optimisation

Linear-Quadratic-Gaussian (LQG) Controllers and Kalman Filters

Programming, numerics and optimization

Optimization 2. CS5240 Theoretical Foundations in Multimedia. Leow Wee Kheng

Motivation: We have already seen an example of a system of nonlinear equations when we studied Gaussian integration (p.8 of integration notes)

5 Quasi-Newton Methods

8 Numerical methods for unconstrained problems

Numerical optimization

Higher-Order Methods

CS 542G: Robustifying Newton, Constraints, Nonlinear Least Squares

Unconstrained minimization of smooth functions

Gradient Descent. Dr. Xiaowei Huang

Part 3: Trust-region methods for unconstrained optimization. Nick Gould (RAL)

Unconstrained Optimization

Optimization Methods

Chapter 4. Unconstrained optimization

NonlinearOptimization

Contents. Preface. 1 Introduction Optimization view on mathematical models NLP models, black-box versus explicit expression 3

Scientific Computing: An Introductory Survey

Scientific Computing: An Introductory Survey

Math 273a: Optimization Netwon s methods

CPSC 540: Machine Learning

Applied Mathematics 205. Unit I: Data Fitting. Lecturer: Dr. David Knezevic

Nonlinear Optimization for Optimal Control

1 Newton s Method. Suppose we want to solve: x R. At x = x, f (x) can be approximated by:

Numerisches Rechnen. (für Informatiker) M. Grepl P. Esser & G. Welper & L. Zhang. Institut für Geometrie und Praktische Mathematik RWTH Aachen

13. Nonlinear least squares

E5295/5B5749 Convex optimization with engineering applications. Lecture 8. Smooth convex unconstrained and equality-constrained minimization

Matrix Derivatives and Descent Optimization Methods

Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization

Numerical Optimization: Basic Concepts and Algorithms

An Iterative Descent Method

Numerical optimization. Numerical optimization. Longest Shortest where Maximal Minimal. Fastest. Largest. Optimization problems

Algorithms for constrained local optimization

ECE 680 Modern Automatic Control. Gradient and Newton s Methods A Review

5 Overview of algorithms for unconstrained optimization

Optimal Control Theory

Optimization. Next: Curve Fitting Up: Numerical Analysis for Chemical Previous: Linear Algebraic and Equations. Subsections

4 damped (modified) Newton methods

The Steepest Descent Algorithm for Unconstrained Optimization

Conjugate Directions for Stochastic Gradient Descent

Image restoration: numerical optimisation

Trust Region Methods. Lecturer: Pradeep Ravikumar Co-instructor: Aarti Singh. Convex Optimization /36-725

Introduction to Unconstrained Optimization: Part 2

TMA4180 Solutions to recommended exercises in Chapter 3 of N&W

Outline. Scientific Computing: An Introductory Survey. Optimization. Optimization Problems. Examples: Optimization Problems

Gradient Methods Using Momentum and Memory

IE 5531: Engineering Optimization I

Line Search Algorithms

Part 2: Linesearch methods for unconstrained optimization. Nick Gould (RAL)

Arc Search Algorithms

COMP 558 lecture 18 Nov. 15, 2010

Numerical Optimization

5 Handling Constraints

Chapter 3 Numerical Methods

Applied Numerical Analysis

Lecture 7 Unconstrained nonlinear programming

Optimization Methods. Lecture 19: Line Searches and Newton s Method

Convex Optimization. Prof. Nati Srebro. Lecture 12: Infeasible-Start Newton s Method Interior Point Methods

Numerical Optimization Professor Horst Cerjak, Horst Bischof, Thomas Pock Mat Vis-Gra SS09

Minimization of Static! Cost Functions!

Numerical solutions of nonlinear systems of equations

Neural Network Training

Conjugate Gradient (CG) Method

Methods for Unconstrained Optimization Numerical Optimization Lectures 1-2

NUMERICAL MATHEMATICS AND COMPUTING

2 Nonlinear least squares algorithms

EAD 115. Numerical Solution of Engineering and Scientific Problems. David M. Rocke Department of Applied Science

A generalized iterative LQG method for locally-optimal feedback control of constrained nonlinear stochastic systems

Optimization and Root Finding. Kurt Hornik

HYBRID RUNGE-KUTTA AND QUASI-NEWTON METHODS FOR UNCONSTRAINED NONLINEAR OPTIMIZATION. Darin Griffin Mohr. An Abstract

AM 205: lecture 19. Last time: Conditions for optimality Today: Newton s method for optimization, survey of optimization methods

Optimization Methods

AM 205: lecture 18. Last time: optimization methods Today: conditions for optimality

1 Numerical optimization

1 Conjugate gradients

ECE580 Exam 1 October 4, Please do not write on the back of the exam pages. Extra paper is available from the instructor.

University of Houston, Department of Mathematics Numerical Analysis, Fall 2005

Review. Numerical Methods Lecture 22. Prof. Jinbo Bi CSE, UConn

Second Order Optimization Algorithms I

Numerical Methods. V. Leclère May 15, x R n

Iterative Methods for Smooth Objective Functions

LECTURE 22: SWARM INTELLIGENCE 3 / CLASSICAL OPTIMIZATION

How do we recognize a solution?

Convex Optimization. Problem set 2. Due Monday April 26th

Numerical Methods I Solving Nonlinear Equations

Constrained Optimization

The Conjugate Gradient Method

Optimization Methods for Circuit Design

Nonlinear Programming

Unconstrained optimization

1 Computing with constraints

Matrix stabilization using differential equations.

Transcription:

Trajectory-based optimization Emo Todorov Applied Mathematics and Computer Science & Engineering University of Washington Winter 2012 Emo Todorov (UW) AMATH/CSE 579, Winter 2012 Lecture 6 1 / 13

Using the maximum principle Recall that for deterministic dynamics ẋ = f (x, u) and cost rate ` (x, u) the optimal state-control-costate trajectory (x (), u (), λ ()) satisfies ẋ = f (x, u) λ = `x (x, u) + f x (x, u) T λ u = arg min eu n ` (x, eu) + f (x, eu) T λ with x (0) given and λ (T) = x q T (x (T)). Solving this boundary-value ODE problem numerically is a trajectory-based method. o Emo Todorov (UW) AMATH/CSE 579, Winter 2012 Lecture 6 2 / 13

Using the maximum principle Recall that for deterministic dynamics ẋ = f (x, u) and cost rate ` (x, u) the optimal state-control-costate trajectory (x (), u (), λ ()) satisfies ẋ = f (x, u) λ = `x (x, u) + f x (x, u) T λ u = arg min eu n ` (x, eu) + f (x, eu) T λ with x (0) given and λ (T) = x q T (x (T)). Solving this boundary-value ODE problem numerically is a trajectory-based method. We can also use the fact that, if (x (), λ ()) satisfies the ODE for some u () which is not a minimizer of the Hamiltonian H (x, u, λ) = ` (x, u) + f (x, u) T λ, then the gradient of the total cost J is given by J (x (), u ()) = q T (x (T)) + J u (t) Z T 0 o ` (x (t), u (t)) dt = H u (x, u, λ) = `u (x, u) + f u (x, u) T λ Thus we can perform gradient descent on J with respect to u () Emo Todorov (UW) AMATH/CSE 579, Winter 2012 Lecture 6 2 / 13

Compact representations Given the current u (), each step of the algorithm involves computing x () by integrating forward in time starting with the given x (0), then computing λ () by integrating backward in time starting with λ (T) = x q T (x (T)). Emo Todorov (UW) AMATH/CSE 579, Winter 2012 Lecture 6 3 / 13

Compact representations Given the current u (), each step of the algorithm involves computing x () by integrating forward in time starting with the given x (0), then computing λ () by integrating backward in time starting with λ (T) = x q T (x (T)). One way to implement the above methods is to discretize the time axis and represent (x, u, λ) independently at each time step. This may be inefficient because the values at nearby time steps are usually very similar, thus it is a waste to represent/optimize them independently. Emo Todorov (UW) AMATH/CSE 579, Winter 2012 Lecture 6 3 / 13

Compact representations Given the current u (), each step of the algorithm involves computing x () by integrating forward in time starting with the given x (0), then computing λ () by integrating backward in time starting with λ (T) = x q T (x (T)). One way to implement the above methods is to discretize the time axis and represent (x, u, λ) independently at each time step. This may be inefficient because the values at nearby time steps are usually very similar, thus it is a waste to represent/optimize them independently. Instead we can splines, Legendre or Chebyshev polynomials, etc. u (t) = g (t, w) Gradient: Z J T w = g w (t, w) T J u (t) dt 0 u t w Emo Todorov (UW) AMATH/CSE 579, Winter 2012 Lecture 6 3 / 13

Space-time constraints We can also minimize the total cost J as an explicit function of the (parameterized) state-control trajectory: x (t) = h (t, v) u (t) = g (t, w) We have to make sure that the state-control trajectory is consistent with the dynamics ẋ = f (x, u). This yields a constrained optimization problem: Z T min q T (h (T, v)) + ` (h (t, v), g (t, w)) dt v,w 0 s.t. h (t, v) t = f (h (t, v), g (t, v)), 8t 2 [0, T] Emo Todorov (UW) AMATH/CSE 579, Winter 2012 Lecture 6 4 / 13

Space-time constraints We can also minimize the total cost J as an explicit function of the (parameterized) state-control trajectory: x (t) = h (t, v) u (t) = g (t, w) We have to make sure that the state-control trajectory is consistent with the dynamics ẋ = f (x, u). This yields a constrained optimization problem: Z T min q T (h (T, v)) + ` (h (t, v), g (t, w)) dt v,w 0 s.t. h (t, v) t = f (h (t, v), g (t, v)), 8t 2 [0, T] In practice we cannot impose the contraint for all t, so instead we choose a finite set of points ft k g where the constraint is enforced. The same points can also be used to approximate R `. There may be no feasible solution (depending on h, g) in which case we have to live with constraint violations. This requires no knowledge of optimal control (which may be why it is popular:) Emo Todorov (UW) AMATH/CSE 579, Winter 2012 Lecture 6 4 / 13

Second-order methods More efficient methods (DDP, ilqg) can be constructed by using the Bellman equations locally. Initialize with some open-loop control u (0) (), and repeat: 1 Compute the state trajectory x (n) () corresponding to u (n) (). 2 Construct a time-varying linear (ilqg) or quadratic (DDP) approximation to the function f around x (n) (), u (n) (), which gives the local dynamics in terms of the state and control deviations δx (), δu (). Also construct quadratic approximations to the costs ` and q T. 3 Compute the locally-optimal cost-to-go v (n) (δx, t) as a quadratic in δx. In ilqg this is exact (because the local dynamics are linear and the cost is quadratic) while in DDP this is approximate. 4 Compute the locally-optimal linear feedback control law in the form π (n) (δx, t) = c (t) L (t) δx. 5 Apply π (n) to the local dynamics (i.e. integrate forward in time) to compute the state-control modification δx (n) (), δu (n) (), and set u (n+1) () = u (n) () + δu (n) (). This requires linesearch to avoid jumping outside the region where the local approximation is valid. Emo Todorov (UW) AMATH/CSE 579, Winter 2012 Lecture 6 5 / 13

Numerical comparison Conj. ODE DDP time(s) 5.9 1.25 0.61 1 Log 10 ( Cost ) 0 1 2 Steepest Conjugate 1 100 200 Iteration iter 1 iter 10 iter 100 iter 200 Emo Todorov (UW) AMATH/CSE 579, Winter 2012 Lecture 6 6 / 13

Gradient descent The directional derivative of f (x) at x 0 in direction v is D v [f ] (x 0 ) = df (x 0 + εv) dε ε=0 Let x (ε) = x 0 + εv. Then f (x 0 + εv) = f (x (ε)) and the chain rule yields x (ε)t f (x) D v [f ] (x 0 ) = T f (x) ε x = v ε=0 x=x0 x = v T g (x 0 ) x=x0 where g denotes the gradient of f. Emo Todorov (UW) AMATH/CSE 579, Winter 2012 Lecture 6 7 / 13

Gradient descent The directional derivative of f (x) at x 0 in direction v is D v [f ] (x 0 ) = df (x 0 + εv) dε ε=0 Let x (ε) = x 0 + εv. Then f (x 0 + εv) = f (x (ε)) and the chain rule yields x (ε)t f (x) D v [f ] (x 0 ) = T f (x) ε x = v ε=0 x=x0 x = v T g (x 0 ) x=x0 where g denotes the gradient of f. Theorem (steepest ascent direction) The maximum of D v [f ] (x 0 ) s.t. kvk = 1 is achieved when v is parallel to g (x 0 ). Emo Todorov (UW) AMATH/CSE 579, Winter 2012 Lecture 6 7 / 13

Gradient descent The directional derivative of f (x) at x 0 in direction v is D v [f ] (x 0 ) = df (x 0 + εv) dε ε=0 Let x (ε) = x 0 + εv. Then f (x 0 + εv) = f (x (ε)) and the chain rule yields x (ε)t f (x) D v [f ] (x 0 ) = T f (x) ε x = v ε=0 x=x0 x = v T g (x 0 ) x=x0 where g denotes the gradient of f. Theorem (steepest ascent direction) The maximum of D v [f ] (x 0 ) s.t. kvk = 1 is achieved when v is parallel to g (x 0 ). Algorithm (gradient descent) Set x k+1 = x k β k g (x k ) where β k is the step size. The optimal step size is β k = arg min β k f (x k β k g (x k )) Emo Todorov (UW) AMATH/CSE 579, Winter 2012 Lecture 6 7 / 13

Line search Most optimization methods involve an inner loop which seeks to minimize (or sufficiently reduce) the objective function constrained to a line: f (x + εv), where v is such that a reduction in f is always possible for sufficiently small ε, unless f is already at a local minimum. In gradient descent v = g (x); other choices are possible (see below) as long as v T g (x) 0. Emo Todorov (UW) AMATH/CSE 579, Winter 2012 Lecture 6 8 / 13

Line search Most optimization methods involve an inner loop which seeks to minimize (or sufficiently reduce) the objective function constrained to a line: f (x + εv), where v is such that a reduction in f is always possible for sufficiently small ε, unless f is already at a local minimum. In gradient descent v = g (x); other choices are possible (see below) as long as v T g (x) 0. This is called linesearch, and can be done in different ways: 1 Backtracking: try some ε, if f (x + εv) > f (x) reduce ε and try again. 2 Bisection: attempt to minimize f (x + εv) w.r.t. ε using a bisection method. 3 Polysearch: attempt to minimize f (x + εv) by fitting quadratic or cubic polynomials in ε, finding the minimum analytically, and iterating. Exact minimization w.r.t. ε is often a waste of time because for ε 6= 0 the current search direction may no longer be a descent direction. Sufficient reduction in f is defined relative to the local model (linear or quadratic). This is known as the Armijo-Goldstein condition; the Wolfe condition (which also involves the gradient) is more complicated. Emo Todorov (UW) AMATH/CSE 579, Winter 2012 Lecture 6 8 / 13

Chattering If x k+1 is a (local) minimum of f in the search direction v k = g (x k ), then D vk [f ] (x k+1 ) = 0 = v T k g (x k+1), and so if we use v k+1 = g (x k+1 ) as the next search direciton, we have v k+1 orthogonal to v k. Thus gradient descent with exact line search (i.e. steepest descent) makes a 90 deg turn at each iteration, which causes chattering when the function has a long oblique valey. Emo Todorov (UW) AMATH/CSE 579, Winter 2012 Lecture 6 9 / 13

Chattering If x k+1 is a (local) minimum of f in the search direction v k = g (x k ), then D vk [f ] (x k+1 ) = 0 = v T k g (x k+1), and so if we use v k+1 = g (x k+1 ) as the next search direciton, we have v k+1 orthogonal to v k. Thus gradient descent with exact line search (i.e. steepest descent) makes a 90 deg turn at each iteration, which causes chattering when the function has a long oblique valey. x k Emo Todorov (UW) AMATH/CSE 579, Winter 2012 Lecture 6 9 / 13

Chattering If x k+1 is a (local) minimum of f in the search direction v k = g (x k ), then D vk [f ] (x k+1 ) = 0 = v T k g (x k+1), and so if we use v k+1 = g (x k+1 ) as the next search direciton, we have v k+1 orthogonal to v k. Thus gradient descent with exact line search (i.e. steepest descent) makes a 90 deg turn at each iteration, which causes chattering when the function has a long oblique valey. x k Key to developing more efficient methods is to anticipate how the gradient will rotate as we move along the current search direction. Emo Todorov (UW) AMATH/CSE 579, Winter 2012 Lecture 6 9 / 13

Newton s method Theorem If all you have is a hammer, then everything looks like a nail. Corollary If all you can optimize is a quadratic, then every function looks like a quadratic. Emo Todorov (UW) AMATH/CSE 579, Winter 2012 Lecture 6 10 / 13

Newton s method Theorem If all you have is a hammer, then everything looks like a nail. Corollary If all you can optimize is a quadratic, then every function looks like a quadratic. Taylor-expand f (x) around the current solution x k up to 2nd order: f (x k + ε) = f (x k ) + ε T g (x k ) + 1 2 εt H (x k ) ε + o ε 3 where g (x k ) and H (x k ) are the gradient and Hessian of f at x k : g (x k ), f x H (x k ), 2 f x=xk x x T x=xk Assuming H is (symmetric) positive definite, the next solution is x k+1 = x k + arg min ε T g + 1 ε 2 εt Hε = x k H 1 g Emo Todorov (UW) AMATH/CSE 579, Winter 2012 Lecture 6 10 / 13

Stabilizing Newton s method For convex functions the Hessian H is always s.p.d, so the above method converges (usually quickly) to the global minimum. In reality however most functions we want to optimize are non-convex, which causes two problems: 1 H may be singular, which means that x k+1 = x k H 1 g will take us all the way to infinity. 2 H may have negative eigenvalues, which measn that (even if x k+1 is finite) we end up finding saddle points minimum in some directions, maximum in other directions. Emo Todorov (UW) AMATH/CSE 579, Winter 2012 Lecture 6 11 / 13

Stabilizing Newton s method For convex functions the Hessian H is always s.p.d, so the above method converges (usually quickly) to the global minimum. In reality however most functions we want to optimize are non-convex, which causes two problems: 1 H may be singular, which means that x k+1 = x k H 1 g will take us all the way to infinity. 2 H may have negative eigenvalues, which measn that (even if x k+1 is finite) we end up finding saddle points minimum in some directions, maximum in other directions. These problems can be avoided in two general ways: 1 Trust region: minimize ε T g + 1 2 εt Hε s.t. kεk r, where r is adapted over iterations. The minimization is usually done approximately. 2 Convexification/linearsearch: replace H with H + λi, and/or use backtracking linesearch starting at the Newton point. When λ is large, x k (H + λi) 1 g x k λ 1 g, which is gradient descent with step λ 1. The Levenberg-Marquardt method adapts λ over iterations. Emo Todorov (UW) AMATH/CSE 579, Winter 2012 Lecture 6 11 / 13

Relation to linear solvers The quadratic function f (x k + ε) = f (x k ) + ε T g (x k ) + 1 2 εt H (x k ) ε is minimized when the gradient w.r.t ε vanishes, i.e. when Hε = g When H is s.p.d, one can use the conjugate-gradient method for solving linear equations to do numerical optimization. The set of vectors fv k g k=1n are conjugate if they satisfy v T i Hv j = 0 for i 6= j. These are good search directions because they yield exact minimization of an n-dimensional quadratic in n iterations (using exact linesearch). Such a set can be constructed using Lanczos iteration: s k+1 v k+1 = (H α k I) v k s k v k 1 where s k+1 is such that kv k+1 k = 1, and α k = v T k Hv k. Note that access to H is not required; all we need to be able to compute is Hv. Emo Todorov (UW) AMATH/CSE 579, Winter 2012 Lecture 6 12 / 13

Non-linear least squares Many optimization problems are in the form f (x) = 1 kr (x)k2 2 where r (x) is a vector of "residuals". Define the Jacobian of the residuals: J (x) = Then the gradient and Hessian of f are r (x) x g (x) = J (x) T r (x) H (x) = J (x) T J (x) J (x) + r (x) x We can omit the last term and obtain the Gauss-Netwon approximation: H (x) J (x) T J (x) Then Newton s method (with stabilization) becomes x k+1 = x k J T k J k + λ k I 1 J T k r k Emo Todorov (UW) AMATH/CSE 579, Winter 2012 Lecture 6 13 / 13