EN Applied Optimal Control Lecture 8: Dynamic Programming October 10, 2018

Similar documents
OPTIMAL CONTROL. Sadegh Bolouki. Lecture slides for ECE 515. University of Illinois, Urbana-Champaign. Fall S. Bolouki (UIUC) 1 / 28

Optimal Control. Lecture 18. Hamilton-Jacobi-Bellman Equation, Cont. John T. Wen. March 29, Ref: Bryson & Ho Chapter 4.

Deterministic Dynamic Programming

Pontryagin s maximum principle

Optimal Control Theory

Hamilton-Jacobi-Bellman Equation Feb 25, 2008

Optimal Control. Quadratic Functions. Single variable quadratic function: Multi-variable quadratic function:

Quadratic Stability of Dynamical Systems. Raktim Bhattacharya Aerospace Engineering, Texas A&M University

Stochastic and Adaptive Optimal Control

Numerical Optimal Control Overview. Moritz Diehl

Lecture 5 Linear Quadratic Stochastic Control

Problem 1 Cost of an Infinite Horizon LQR

Lecture 4 Continuous time linear quadratic regulator

Advanced Mechatronics Engineering

Module 05 Introduction to Optimal Control

UCLA Chemical Engineering. Process & Control Systems Engineering Laboratory

ECE7850 Lecture 7. Discrete Time Optimal Control and Dynamic Programming

Controlled Diffusions and Hamilton-Jacobi Bellman Equations

Static and Dynamic Optimization (42111)

Formula Sheet for Optimal Control

EE221A Linear System Theory Final Exam

Homework Solution # 3

Chapter 5. Pontryagin s Minimum Principle (Constrained OCP)

Robotics. Control Theory. Marc Toussaint U Stuttgart

Solution of Stochastic Optimal Control Problems and Financial Applications

Optimal Control. McGill COMP 765 Oct 3 rd, 2017

EE C128 / ME C134 Final Exam Fall 2014

Computational Issues in Nonlinear Dynamics and Control

ESC794: Special Topics: Model Predictive Control

A Tour of Reinforcement Learning The View from Continuous Control. Benjamin Recht University of California, Berkeley

Lecture 10 Linear Quadratic Stochastic Control with Partial State Observation

Chapter 2 Optimal Control Problem

Linear Quadratic Optimal Control Topics

Static and Dynamic Optimization (42111)

Numerical approximation for optimal control problems via MPC and HJB. Giulia Fabrini

MATH4406 (Control Theory) Unit 6: The Linear Quadratic Regulator (LQR) and Model Predictive Control (MPC) Prepared by Yoni Nazarathy, Artem

Lecture 9: Discrete-Time Linear Quadratic Regulator Finite-Horizon Case

EE363 homework 2 solutions

Suppose that we have a specific single stage dynamic system governed by the following equation:

EE C128 / ME C134 Feedback Control Systems

CHAPTER 2 THE MAXIMUM PRINCIPLE: CONTINUOUS TIME. Chapter2 p. 1/67

Principles of Optimal Control Spring 2008

Game Theory Extra Lecture 1 (BoB)

Direct Methods. Moritz Diehl. Optimization in Engineering Center (OPTEC) and Electrical Engineering Department (ESAT) K.U.

EE291E/ME 290Q Lecture Notes 8. Optimal Control and Dynamic Games

Duality and dynamics in Hamilton-Jacobi theory for fully convex problems of control

EN Nonlinear Control and Planning in Robotics Lecture 3: Stability February 4, 2015

The Linear Quadratic Regulator

Control of Dynamical System

The Inverted Pendulum

LINEAR-CONVEX CONTROL AND DUALITY

Optimization via the Hamilton-Jacobi-Bellman Method: Theory and Applications

The HJB-POD approach for infinite dimensional control problems

Linear-Quadratic-Gaussian (LQG) Controllers and Kalman Filters

A Globally Stabilizing Receding Horizon Controller for Neutrally Stable Linear Systems with Input Constraints 1

A Remark on IVP and TVP Non-Smooth Viscosity Solutions to Hamilton-Jacobi Equations

Feedback Optimal Control of Low-thrust Orbit Transfer in Central Gravity Field

Dynamic Programming with Applications. Class Notes. René Caldentey Stern School of Business, New York University

Alberto Bressan. Department of Mathematics, Penn State University

minimize x subject to (x 2)(x 4) u,

Lecture 9 Approximations of Laplace s Equation, Finite Element Method. Mathématiques appliquées (MATH0504-1) B. Dewals, C.

OPTIMAL SPACECRAF1 ROTATIONAL MANEUVERS

Stochastic optimal control theory

ECSE.6440 MIDTERM EXAM Solution Optimal Control. Assigned: February 26, 2004 Due: 12:00 pm, March 4, 2004

Numerical Methods for Constrained Optimal Control Problems

Course Summary Math 211

The Derivative. Appendix B. B.1 The Derivative of f. Mappings from IR to IR

LECTURE NOTES IN CALCULUS OF VARIATIONS AND OPTIMAL CONTROL MSc in Systems and Control. Dr George Halikias

Passive control. Carles Batlle. II EURON/GEOPLEX Summer School on Modeling and Control of Complex Dynamical Systems Bertinoro, Italy, July

Mathematical Economics. Lecture Notes (in extracts)

4F3 - Predictive Control

Optimal Control, TSRT08: Exercises. This version: October 2015

Numerical Methods for Optimal Control Problems. Part I: Hamilton-Jacobi-Bellman Equations and Pontryagin Minimum Principle

Subject: Optimal Control Assignment-1 (Related to Lecture notes 1-10)

Computational Issues in Nonlinear Control and Estimation. Arthur J Krener Naval Postgraduate School Monterey, CA 93943

Path Integral Stochastic Optimal Control for Reinforcement Learning

LMI Methods in Optimal and Robust Control

Lecture 2 and 3: Controllability of DT-LTI systems

Ordinary differential equations. Phys 420/580 Lecture 8

Bayesian Decision Theory in Sensorimotor Control

Inverse Optimality Design for Biological Movement Systems

Theoretical Justification for LQ problems: Sufficiency condition: LQ problem is the second order expansion of nonlinear optimal control problems.

UCLA Chemical Engineering. Process & Control Systems Engineering Laboratory

Suboptimality of minmax MPC. Seungho Lee. ẋ(t) = f(x(t), u(t)), x(0) = x 0, t 0 (1)

Non-Commutative Worlds

A Very General Hamilton Jacobi Theorem

Sliding Mode Regulator as Solution to Optimal Control Problem for Nonlinear Polynomial Systems

Stability of Feedback Solutions for Infinite Horizon Noncooperative Differential Games

Robotics: Science & Systems [Topic 6: Control] Prof. Sethu Vijayakumar Course webpage:

ECON 582: An Introduction to the Theory of Optimal Control (Chapter 7, Acemoglu) Instructor: Dmytro Hryshko

Steady State Kalman Filter

CDS 110b: Lecture 2-1 Linear Quadratic Regulators

Linear Quadratic Optimal Control

Exercises for Multivariable Differential Calculus XM521

Math 124A October 11, 2011

9 Controller Discretization

A Relationship Between Minimum Bending Energy and Degree Elevation for Bézier Curves

DYNAMIC LECTURE 5: DISCRETE TIME INTERTEMPORAL OPTIMIZATION

Linearly-Solvable Stochastic Optimal Control Problems

An Introduction to Noncooperative Games

Transcription:

EN530.603 Applied Optimal Control Lecture 8: Dynamic Programming October 0, 08 Lecturer: Marin Kobilarov Dynamic Programming (DP) is conerned with the computation of an optimal policy, i.e. an optimal control signal expressed as a function of the state and time u = u(x, t). This means that we seek solutions from any state x(t) to a desired goal defined either as reaching a single state state x f at time t f, or reaching a surface defined by ψ(x(t f ), t f ) = 0. This global viewpoint is in contrast to the calculus of variations approach where we typically look at local variations of trajectories in the vicinity of a given initial path x( ) often starting from a given state x(0) = x 0. The DP principle is based on the idea that once an optimal path x (t) for t [t 0, t f ] is computed then any segment of this path is optimal, i.e. x (t) for t [t, t f ] is the optimal path from x (t ) to x (t f ). x x x f x 0 For instance, consider an optimal path x (t) between x 0 and x f, i.e. a path with cost J(x 0,x f ) that is less than the cost of any other trajectory between these two points. According to the DP principle, for any given time t (t 0, t f ) the trajectory x (t) from x = x (t ) to x(t f ) is also otpimal. The principle can be easily proved by contradition. It it were not optimal then the optimal path must pass through a state x which does not lie on x (t), i.e. But then we have J (x,x,x f ) < J (x,x f ). J (x0,x ) + J (x,x,x f ) < J (x0,x ) + J (x,x f ) = J (x 0,x f ), which is a contradition. This property is key in dynamic programming, since it allows the computation of optimal trajectory segments which can then be pieced together in one globally optimal tarjectory, rather than say enumerating and comparing all possible trajectories ending at the goal.

DP applies to both continuous and discrete decision domains. The fully discrete setting occurs when the state x takes values from a finite set e.g. x {a, b, c,... } and when the time variable t can only take a finite number of values, e.g. t {t 0, t, t,... }. As we will see there are efficient algorithms which can solve discrete problems and compute optimal policy from any state, as long as the size of these discrete sets is not too large. For instance, consider the problem (Kirk 3.4) of a motorist navigating through a maze-like environment consisting of one-way roads and intersections marked by letters a 8 d 3 e 8 h 5 5 b 9 c 3 f 3 g The goal is to travel between a and h in an optimal way where the total cost is the sum of costs along each road given by the numbers above. One way to solve the problem is to enumerate all paths and pick the least-cost one. This is doable in this problem but does not scale if the there were much more roads and intersections. Another way is to solve the problem recursively by computing optimal subpaths starting backwards from the goal h. Denote the cost along an edge ab by J ab and the optimal cost of going between two nodes a and b by J ab. Then we have J eh = 8, J gh =, J fg = 3 Clearly, we also have but J eh = 8, J gh =, J fh = 5 J eh = min{j eh, J ef + J fh } = 7 In other words the optimal cost at e can be computed by looking at the already compuetd optimal cost at f and the increment J ef to get from e to f. Once we have Jeh the procedure is repeated recursively from all nodes connected to e and f, emanating backwards until the start a is reached and the cost Jah is computed. The costs of the form J eh in this case are called optimal cost-to-go since they denote the optimal cost to reach the goal. In the optimal control setting, we are tpically dealing with continuous spaces, e.g. x R n and t [t 0, t f ]. One way to apply the discrete DP approach is to discretize space and time using a a grid of N x discrete values along each state dimension and N t discete values along time. The possible number of discrete states (or nodes) will then be #of nodes = N t N n x, or in other words their number is exponential in the state dimension. The case n = is illustrated below where each cell could be regarded as a node in a graph similar to the maze navigation.

N x x space N x t 0 t t N t segments time This illustrates one of the fundamental problems in control and DP known as the curve of dimensionlity. In practice,, this means that applying the discrete DP procedure is only feasible for a few dimensions, e.g. n 5 and coarse discretizations, e.g. N x, N t < 00. But there are exceptions. An additional complication when transforming a continuous problem into a discrete one is that when the system can only take on a finite number of states its continuous dynamics can be grossly violated. This motivates the study of DP theory not only in discrete domains but also in continuous state space X, and discrete time t {t 0, t, t,...} continuous state space X, and continuous time t [t 0, t f ] (our standard case). To proceed we will consider the restricted optimal control problems tf J = φ(x(t f ), t f ) + L (x(t), u(t), t) dt, t 0 () subject to ẋ(t) = f(x(t), u(t), t), with given x(t 0 ), t 0, t f Discrete-time DP. Bellman equations Assume that the time line is discretized into the discrete set {t 0, t,..., t N }. When time is discrete, we work with discrete trajectories, i.e. sequences of states at these discrete times. A continuous trajectory x(t) is approximated by a discrete trajectory x 0:N given by x 0:N = {x 0, x,..., x N }, 3

where x i x(t i ), i = 0,..., N. The dynamics then must be expressed in a discrete form according to x i+ = f i (x i, u i ), where f i is a numerical approximation of the dynamics. For instance, using the simplest Euler rule resulting from finite differences ẋ(t i ) x(t i + t) x(t i ) t we have f i (x i, u i ) = x i + t i f(t i, x i, u i ), where t i is the time step defined by t i = t i+ t i. The cost function along each segment [t i, t i+ ] is approximated by ti+ t i L(x, u, t)dt L i (x i, u i ), where L i is the discrete cost. For instance, in the simplest first order Euler approximation we have L i (x, u) = t i L(x(t i ), u(t i ), t i ) To summarize we have approximated our continuous problem () by the dsicrete problem N J J 0 = φ(x N, t N ) + L i (x i, u i ), () i=0 subject to x i+ = f i (x i, u i ), with given x(t 0 ), t 0, t N In order to establish a link to DP define the cost from discrete stage i to N by N J i = φ(x N, t N ) + L k (x k, u k ). Similarly to the maze navigation problem, let us define the optimal cost-to-go function (or optimal value function) at stage i as the optimal cost from state x at stage i to the end stage N: k=i V i (x) J i (x) = min u i:n J i (x, u i:n ) The optimal cost-to-go at x is computed by looking at all states x = f i (x, u) that can be reached using control inputs u and selecting u which results in minimum cost-to-go from x in addition to the cost to get to x. More formally, V i (x) = min u [ Li (x, u) + V i+ ( x )], where x = f i (x, u). This can be equivalently written as the Bellman equation with V N (x) = φ(x, t N ). V i (x) = min u [L i (x, u) + V i+ (f i (x, u))], 4

. Discrete-time linear-quadratic case Bellman equation has closed form solution when f i is linear, i.e. when and when the cost is quadratic, i.e. x i+ = A i x i + B i u i φ(x) = xt P f x, L i (x, u) = x T Q i x + u T R i u. To see that assume that the value function is of the form V i (x) = xt P i x, for P i > 0 with boundary condition P N = P f. Bellman s principle requires that { xt P i x = min u ut R i u + xt Q i x + } (A ix + B i u) T P i+ (A i x + B i u) The minimization results in where u = (R i + B T i P i+ B i ) B T i P i+ A i x K i x, K i = (R i + B T i P i+ B i ) B T i P i+ A i Substituting u back into the Bellman equation we obtain x T P i x = x T [ K T i R i K i + Q i + (A i + B i K i ) T P i+ (A i + B i K i ) ] x. Note, setting P i = K T i R i K i + Q i + (A i + B i K i ) T P i+ (A i + B i K i ) with P N = P f will satisfy the Bellman equation. This relationship can be cycled backwards starting from i = N to i = 0 to obtain P N, P N,..., P 0. The recurence can also be expressed without gains K i according to P i = Q i + A T i [P i+ P i+ B i (R i + B T i P i+ B i ) B T i P i+ ]A i. Note that we have expressed the optimal control at stage i according to u i = K i x i, or in a linear feedback form, similarly to the continuous case. This means that the gains K i can be computed only once in the beginning and then used from any state x i. This solution is equivalent to the discrete Linear-Quadratic-Regulator. 5

Continuous Dynamic Programming We next consider the fully continuous case and will derive an analog to Bellman s equation. The continuous analog is called the Hamilton-Jacobi-Bellman equation and is a key result in optimal control. Consider the continuous value function V (x, t) computed over the time interval [t, t f ] defined by [ tf ] V (x(t), t) = min φ(x(t f ), t f ) + L(x(τ), u(τ), τ)dτ u(t),t [t,t f ] t As noted earlier, Bellman s principle of optimality states that if a trajectory over [t 0, t f ] is optimal then it is also optimal on any subinterval [t, t + ] [t 0, t f ]. This can be expressed more formally through the recursive relationship ] V (x(t), t) = min u(t),t [t,t+ ] [ t+ t L (x(τ), u(τ), τ) dτ + V (x(t + ), t + ), (3) where the optimization is over the continuous control signal u(t) over the interval [t, t + ]. We can now state the key principle of dynamic programming in continuous space: optimal trajectories must satisfy the Hamilton-Jacobi-Bellman equations (HJB) given by: t V (x, t) = min u(t) { L(x, u, t) + x V (x, t) T f(x, u, t) } (4) To prove HJB, assume that is small expand V (x(t+ ), t+ ) according to V (x(t + ), t + ) = V (x(t), t) + [ t V (x(t), t) + x V (x(t), t) T ẋ ] + o( ) Substituting the above in (3) and taking 0 we have { [ V (x(t), t) = min L (x(t), u(t), t) + V (x(t), t) + t V (x(t), t) + x V (x(t), t) T ẋ ] }, u(t) which is equivalent to { 0 = min L (x(t), u(t), t) + t V (x(t), t) + x V (x(t), t) T f(x(t), u(t), t) }. u(t) Since this must hold for any then we obtain the HJB equation. Example. A simple nonlinear system. Consider the scalar system ẋ = x 3 + u with cost First we find The solution is J = t [q(x) + u ]dt { } u = arg min u [q(x) + u ] + x V (x, t)( x 3 + u) u = x V (x, t) 6

The HJB equation becomes t V (t, x) = [q(x) xv (t, x) ] x V (t, x)x 3 If we can solve this PDE globally then we re done. But this might be difficult Example. Linear-quadratic case. Consider the scalar system ẋ = x + u with cost We have J = tf x(t f ) + 0 u(t) dt H(x, u, x V, t) = u + x V [x + u] The optimal control is computed by satisfying u H = 0, i.e. u (t) = x V (x, t) Furthermore, we have uh = > 0 so u is globally optimal! Substituting the control, the HJB equation becomes Consider the possible value function t V (t, x) = xv (t, x) + x V (t, x)x V (t, x) = P (t)x The HJB equation becomes P x = P x + P x which is equivalent to P + P P = 0 Using separation of variables and the final condition P (t f ) =, the solution is P (t) = The optimal (feedback) control law becomes e (t tf) + u = P (t)x. 7

. General Linear-Quadratic Regulator (LQR) Consider the linear system ẋ = Ax + Bu with cost function J = tf xt (t f )P f x(t f ) + 0 ( x T Qx + u T Ru ) dt The Hamiltonian is H = ( x T Qx + u T Ru ) + x V T (Ax + Bu) It is positive definite since uh = R. The optimal control is u = R B T x V which results in the HJB t V = xt Qx + xv T BR B T x V + x V T (Ax BR B T x V ). Consider the value function V (x(t), t) = xt (t)p (t)x(t), for a positive definite symmetric P (t). The optimal control law becomes We obtain which will always hold true if u(t) = R B T P (t)x(t). x P x = xt (Q P BR B T P + P A)x = xt (Q P BR B T P + P A + A T P )x P (t) = A T P (t) + P (t)a P (t)br B T P (t) + Q. which is the Ricatti ODE for P (t). Note that we replaced P A with P A + A T P above in order to ensure that P (t) remains symmetric for all t. The matrix P is known at the terminal point, i.e. P (t f ) = P f Therefore, P (t) is computed by backward integration of the Ricatti ODE. Note that P is symmetric, so only n(n + )/ equations are integrated. 8

. Link b/n MP and DP We have the relationship L(x, u, t) + x V (x, t) T f(x, u, t) = H(x, u, x V (x, t), t), i.e. x V plays the role of the multiplier λ HJB equation can be written as t V (x, t) = max u H(x, u, xv (x, t), t) The optimal control becomes u = max u H(x, u, xv, t) If the corresponding optimal trajectory is x then t V (x, t) = H(x, u, x V (x, t), t) Think of λ as guiding the evolution along the gradient of V. Visually, if V can be thought of an expanding front emanating from the goal set, then the vector λ is orthogonal to that front and pointing outward, i.e. it gives the direction of fastest increase of V. DP in practice. If we knew V (x, t) then then DP gives a closed-loop control law u = max u H(x, u, xv, t) But computing V (x, t) globally is extremely challenging Various approximate solutions are possible Simplifications (but not drastic) obtained in the time-independent case with infinite time horizon 9