Linearly-Solvable Stochastic Optimal Control Problems

Similar documents
Controlled Diffusions and Hamilton-Jacobi Bellman Equations

Linear-Quadratic-Gaussian (LQG) Controllers and Kalman Filters

Pontryagin s maximum principle

Linearly-solvable Markov decision problems

Optimal Control Theory

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Lecture 5 Linear Quadratic Stochastic Control

Aggregation Methods for Lineary-solvable Markov Decision Process

CS181 Midterm 2 Practice Solutions

Probabilistic Graphical Models

Eigenfunction approximation methods for linearly-solvable optimal control problems

Proper Orthogonal Decomposition in PDE-Constrained Optimization

Optimal Control. Quadratic Functions. Single variable quadratic function: Multi-variable quadratic function:

Stochastic Shortest Path Problems

EN Applied Optimal Control Lecture 8: Dynamic Programming October 10, 2018

Approximate Dynamic Programming

Organization. I MCMC discussion. I project talks. I Lecture.

Robotics. Control Theory. Marc Toussaint U Stuttgart

Markov Chain BSDEs and risk averse networks

Path Integral Stochastic Optimal Control for Reinforcement Learning

Probabilistic Graphical Models

ECE7850 Lecture 7. Discrete Time Optimal Control and Dynamic Programming

Dependence. MFM Practitioner Module: Risk & Asset Allocation. John Dodson. September 11, Dependence. John Dodson. Outline.

Hamilton-Jacobi-Bellman Equation Feb 25, 2008

Distributed Optimization. Song Chong EE, KAIST

Kernel methods for comparing distributions, measuring dependence

Bayesian Decision Theory in Sensorimotor Control

Approximate Dynamic Programming

Markov Decision Processes and Dynamic Programming

Sequential Monte Carlo Methods for Bayesian Computation

Deterministic Dynamic Programming

Factor Analysis and Kalman Filtering (11/2/04)

Efficient Learning in Linearly Solvable MDP Models

Uniformly Uniformly-ergodic Markov chains and BSDEs

Estimating Passive Dynamics Distributions and State Costs in Linearly Solvable Markov Decision Processes during Z Learning Execution

Laplacian Agent Learning: Representation Policy Iteration

Real Time Stochastic Control and Decision Making: From theory to algorithms and applications

STA 4273H: Statistical Machine Learning

AIMS Exercise Set # 1

ACM/CMS 107 Linear Analysis & Applications Fall 2017 Assignment 2: PDEs and Finite Element Methods Due: 7th November 2017

Hidden Markov Models (HMM) and Support Vector Machine (SVM)

Internet Monetization

6.241 Dynamic Systems and Control

minimize x subject to (x 2)(x 4) u,

Basics of reinforcement learning

(a) If A is a 3 by 4 matrix, what does this tell us about its nullspace? Solution: dim N(A) 1, since rank(a) 3. Ax =

Reinforcement Learning and Optimal Control. ASU, CSE 691, Winter 2019

If we want to analyze experimental or simulated data we might encounter the following tasks:

Recall that in finite dynamic optimization we are trying to optimize problems of the form. T β t u(c t ) max. c t. t=0 T. c t = W max. subject to.

OPTIMAL CONTROL. Sadegh Bolouki. Lecture slides for ECE 515. University of Illinois, Urbana-Champaign. Fall S. Bolouki (UIUC) 1 / 28

Optimal Control. Lecture 18. Hamilton-Jacobi-Bellman Equation, Cont. John T. Wen. March 29, Ref: Bryson & Ho Chapter 4.

Lectures 9-10: Polynomial and piecewise polynomial interpolation

Gaussian processes for inference in stochastic differential equations

Dantzig s pivoting rule for shortest paths, deterministic MDPs, and minimum cost to time ratio cycles

Abstract Dynamic Programming

Discretization of SDEs: Euler Methods and Beyond

Lecture 2: From Linear Regression to Kalman Filter and Beyond

Notes: Outline. Diffusive flux. Notes: Notes: Advection-diffusion

Inverse Optimality Design for Biological Movement Systems

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

A Tour of Reinforcement Learning The View from Continuous Control. Benjamin Recht University of California, Berkeley

Variational sampling approaches to word confusability

A Concise Course on Stochastic Partial Differential Equations

19 : Slice Sampling and HMC

1 Stat 605. Homework I. Due Feb. 1, 2011

Online solution of the average cost Kullback-Leibler optimization problem

Variational Autoencoders

University of Warwick, EC9A0 Maths for Economists Lecture Notes 10: Dynamic Programming

Monte Carlo Methods. Leon Gu CSD, CMU

Chapter 6 - Random Processes

Trajectory-based optimization

1 Markov decision processes

Chapter 7 Network Flow Problems, I

CS599 Lecture 1 Introduction To RL

CS242: Probabilistic Graphical Models Lecture 7B: Markov Chain Monte Carlo & Gibbs Sampling

Final Exam, Machine Learning, Spring 2009

LINEARLY SOLVABLE OPTIMAL CONTROL

Scalable robust hypothesis tests using graphical models

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

6.231 DYNAMIC PROGRAMMING LECTURE 9 LECTURE OUTLINE

Probabilistic Graphical Models. Theory of Variational Inference: Inner and Outer Approximation. Lecture 15, March 4, 2013

Latent state estimation using control theory

, and rewards and transition matrices as shown below:

PROBABILITY: LIMIT THEOREMS II, SPRING HOMEWORK PROBLEMS

A.Piunovskiy. University of Liverpool Fluid Approximation to Controlled Markov. Chains with Local Transitions. A.Piunovskiy.

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

MATH 215/255 Solutions to Additional Practice Problems April dy dt

Robust control and applications in economic theory

ME 234, Lyapunov and Riccati Problems. 1. This problem is to recall some facts and formulae you already know. e Aτ BB e A τ dτ

Deep learning-based numerical methods for high-dimensional parabolic partial differential equations and backward stochastic differential equations

Applied Math Qualifying Exam 11 October Instructions: Work 2 out of 3 problems in each of the 3 parts for a total of 6 problems.

Temporal-Difference Q-learning in Active Fault Diagnosis

Artificial Intelligence

PDEs in Image Processing, Tutorials

Dynamic Discrete Choice Structural Models in Empirical IO

Data assimilation in high dimensions

Markov Chains and MCMC

Kalman Filter. Predict: Update: x k k 1 = F k x k 1 k 1 + B k u k P k k 1 = F k P k 1 k 1 F T k + Q

Reinforcement Learning. Introduction

Lecture No 1 Introduction to Diffusion equations The heat equat

Transcription:

Linearly-Solvable Stochastic Optimal Control Problems Emo Todorov Applied Mathematics and Computer Science & Engineering University of Washington Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26

Problem formulation In traditional MDPs the controller chooses actions u which in turn specify the transition probabilities p (x 0 jx, u). We can obtain a linearly-solvable MDP (LMDP) by allowing the controller to specify these probabilities directly: x 0 u (jx) x 0 p (jx) p (x 0 jx) = 0 ) u (x 0 jx) = 0 controlled dynamics passive dynamics feasible control set U (x) Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 2 / 26

Problem formulation In traditional MDPs the controller chooses actions u which in turn specify the transition probabilities p (x 0 jx, u). We can obtain a linearly-solvable MDP (LMDP) by allowing the controller to specify these probabilities directly: x 0 u (jx) x 0 p (jx) p (x 0 jx) = 0 ) u (x 0 jx) = 0 controlled dynamics passive dynamics feasible control set U (x) The immediate cost is in the form ` (x, u (jx)) = q (x) + KL (u (jx) jjp (jx)) KL (u (jx) jjp (jx)) = x 0 u (x 0 jx) log u(x0 jx) p(x 0 jx) = E x 0 u(jx) h log u(x0 jx) p(x 0 jx) Thus the controller can impose any dynamics it wishes, however it pays a price (KL divergence control cost) for pushing the system away from its passive dynamics. i Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 2 / 26

Understanding the KL divergence cost KL cost over the probability simplex how to bias a coin benefits of error tolerance one two E(q) + KL(u p) 2 1 1 1 u(1 x) u(2 x) p three u* E(q) p KL(u p) 0 0 0.27 0.5 1 u = probability of Heads 2.6 1 3.6 0 u(3 x) Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 3 / 26

Simplifying the Bellman equation (first exit) v (x) = min n` (x, u) + E u x 0 p(jx,u) v x 0 o = min q (x) + E x 0 u(jx) log u (x0 jx) u(jx) = min u(jx) q (x) + E x 0 u(jx) log The last term is an unnormalized KL divergence... p (x 0 jx) + log 1 exp ( v (x 0 )) u (x 0 ) p (x 0 jx) exp ( v (x 0 )) Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 4 / 26

Simplifying the Bellman equation (first exit) v (x) = min n` (x, u) + E u x 0 p(jx,u) v x 0 o = min q (x) + E x 0 u(jx) log u (x0 jx) u(jx) = min u(jx) q (x) + E x 0 u(jx) log The last term is an unnormalized KL divergence... Definitions p (x 0 jx) + log 1 exp ( v (x 0 )) u (x 0 ) p (x 0 jx) exp ( v (x 0 )) desirability function z (x), exp ( v (x)) next-state expectation P [z] (x), x 0 p (x 0 jx) z (x 0 ) v (x) = min q (x) u(jx) log P [z] (x) + KL u (jx) p (jx) z () P [z] (x) Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 4 / 26

Linear Bellman equation and optimal control law KL (p 1 () jjp 2 ()) achieves its global minimum of 0 iff p 1 = p 2, thus Theorem (optimal control law) The Bellman equation becomes u x 0 jx = p (x0 jx) z (x 0 ) P [z] (x) v (x) = q (x) log P [z] (x) z (x) = exp ( q (x)) P [z] (x) which can be written more explicitly as Theorem (linear Bellman equation) ( exp ( q (x)) x 0 p (x 0 jx) z (x 0 ) : x non-terminal z (x) = exp ( q T (x)) : x terminal Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 5 / 26

Illustration z(x ) p(x x) u*(x x) ~ p(x x) z(x ) x x : sampled from u * (x x) Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 6 / 26

Summary of results Let Q = diag (exp ( q)) and P xy = p (yjx). Then we have first exit z = exp ( q) P [z] z = QPz finite horizon z k = exp ( q k ) P k [z k+1 ] z k = Q k P k z k+1 average cost z = exp (c q) P [z] λz = QPz discounted cost z = exp ( q) P [z α ] z = QPz α Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 7 / 26

Summary of results Let Q = diag (exp ( q)) and P xy = p (yjx). Then we have first exit z = exp ( q) P [z] z = QPz finite horizon z k = exp ( q k ) P k [z k+1 ] z k = Q k P k z k+1 average cost z = exp (c q) P [z] λz = QPz discounted cost z = exp ( q) P [z α ] z = QPz α In the first exit problem we can also write z N = Q N N P N N z N + b = (I Q N N P N N ) 1 b b, Q N N P N T exp ( q T ) where N, T are the sets of non-terminal and terminal states respectively. In the average cost problem λ = log (c) is the principal eigenvalue. Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 7 / 26

Stationary distribution under the optimal control law Let µ (x) denote the stationary distribution under the optimal control law u (jx) in the average cost problem. Then µ x 0 = x u x 0 jx µ (x) Recall that u x 0 jx = p (x0 jx) z (x 0 ) P [z] (x) = p (x0 jx) z (x 0 ) λ exp (q (x)) z (x) Defining r (x), µ (x) /z (x), we have In vector notation this becomes µ x 0 = x p (x 0 jx) z (x 0 ) λ exp (q (x)) z (x) µ (x) λr x 0 = x exp ( q (x)) p x 0 jx r (x) λr = (QP) T r Thus z and r are the right and left principal eigenvectors of QP, and µ = z. r Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 8 / 26

Comparison to policy and value iteration P average cost z iter policy iter 10 25 50 100 0 40 80 120 CPU time (sec) average cost z pol. val. 0 10 10 2 10 3 10 4 number of updates +9 optimal control (z iter) optimal cost to go (z iter) optimal cost to go (policy iter) tangential velocity 9 3 horizontal position +3 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 9 / 26

Application to deterministic shortest paths Given a graph and a set T of goal states, define the first-exit LMDP p (x 0 jx) q (x) = ρ > 0 q T (x) = 0 random walk on the graph constant cost at non-terminal states zero cost at terminal states For large ρ the optimal cost-to-go v (ρ) is dominated by the state costs, since the KL divergence control costs are bounded. Thus we have Theorem The length of the shortest path from state x to a goal state is v (ρ) (x) lim ρ! ρ Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 10 / 26

Internet example Performance on the graph of Internet routers as of 2003 (data from caida.org) There are 190914 nodes and 609066 undirected edges in the graph. 15 0.6 2% approximation 10 5 error 0.2 number of errors (%) 1% shortest path lengths over estimated by 1 values of z(x) numerically indistinguishable from 0 1 1 5 10 15 shortest path length 0% 25 40 55 70 parameter \rho Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 11 / 26

Embedding of traditional MDPs Given a traditional MDP with controls eu 2 eu (x), transition probabilities ep (x 0 jx, eu) and costs è (x, eu), we can construct and LMDP such that the controls corresponding to the MDPs transition probabilities have the same costs as in the MDP. This is done by constructing p and q such that for 8x, eu 2 eu (x) q (x) + KL (ep (jx, eu) jjp (jx)) = è (x, eu) q (x) x 0 ep x 0 jx, eu log p x 0 jx = è (x, eu) + e h (x, eu) where e h is the entropy of ep (jx, eu). Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 12 / 26

Embedding of traditional MDPs Given a traditional MDP with controls eu 2 eu (x), transition probabilities ep (x 0 jx, eu) and costs è (x, eu), we can construct and LMDP such that the controls corresponding to the MDPs transition probabilities have the same costs as in the MDP. This is done by constructing p and q such that for 8x, eu 2 eu (x) q (x) + KL (ep (jx, eu) jjp (jx)) = è (x, eu) q (x) x 0 ep x 0 jx, eu log p x 0 jx = è (x, eu) + e h (x, eu) where e h is the entropy of ep (jx, eu). The construction is done separately for every x. Suppressing x, vectorizing over eu and defining s = log p, q1 + eps = e b exp ( s) T 1 = 1 ep and b e = è + h e are known, q and s are unknown. Assuming ep is full rank, y = ep 1 b, e s = y q1, q = log exp ( y) T 1 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 12 / 26

Grid world example X MDP cost to go 40 X X LMDP cost to go 30 20 10 R 2 = 0.986 LMDP cost to go 0 0 20 40 60 MDP cost to go X Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 13 / 26

Machine repair example 100 60 MDP cost to go state LMDP cost to go 40 20 R 2 = 0.993 LMDP cost to go 1 1 time step 50 0 0 20 40 60 MDP cost to go performance on MDP 1.00 1.01 1.64 MDP LMDP random control policy Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 14 / 26

Continuous-time limit Consider a continuous-state discrete-time LMDP where p (h) (x 0 jx) is the h-step transition probability of some continuous-time stochastic process, and z (h) (x) is the LMDP solution. The linear Bellman equation (first exit) is z (h) (x) = exp ( hq (x)) E x 0 p (h) (jx) hz (h) x 0i Let z = lim h#0 z (h). The limit yields z (x) = z (x), Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 15 / 26

Continuous-time limit Consider a continuous-state discrete-time LMDP where p (h) (x 0 jx) is the h-step transition probability of some continuous-time stochastic process, and z (h) (x) is the LMDP solution. The linear Bellman equation (first exit) is z (h) (x) = exp ( hq (x)) E x 0 p (h) (jx) hz (h) x 0i Let z = lim h#0 z (h). The limit yields z (x) = z (x), but we can rearrange as exp (hq (x)) 1 E lim h#0 z (h) x (x) = lim 0 p (h) (jx) h h#0 Recalling the definition of the generator L, we now have q (x) z (x) = L [z] (x) h i z (h) (x 0 ) If the underlying process is an Ito diffusion, the generator is L [z] (x) = a (x) T z x (x) + 1 2 trace (Σ (x) z xx (x)) h z (h) (x) Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 15 / 26

Linearly-solvable controlled diffusions Above z was defined as the continuous-time limit to LMDP solutions z (h). But is z the solution to a continuous-time problem, and if so, what problem? Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 16 / 26

Linearly-solvable controlled diffusions Above z was defined as the continuous-time limit to LMDP solutions z (h). But is z the solution to a continuous-time problem, and if so, what problem? dx = (a (x) + B (x) u) dt + C (x) dω ` (x, u) = q (x) + 1 2 ut R (x) u Recall that for such problems we have u = 0 = q + a T v x + 1 2 tr CC T v xx R 1 B T v x and 1 2 vt x BR 1 B T v x Define z (x) = exp ( v (x)) and write the PDE in terms of z: 0 = q 1 z z x v x = z, v xx = z xx z + z xz T x z 2 a T z x + 12 CC tr T z xx + 1 2z zt x BR 1 B T 1 z x 2z zt x CC T z x Now if CC T = BR 1 B T, we obtain the linear HJB equation qz = L [z]. Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 16 / 26

Quadratic control cost and KL divergence The KL divergence between two Gaussians with means µ 1, µ 2 and common full-rank covariance Σ is 1 2 (µ 1 µ 2 ) T Σ 1 (µ 1 µ 2 ). Using Euler discretization of the controlled diffusion, the passive and controlled dynamics have means x + ha, x + ha + hbu and covariance hcc T. Thus the KL divergence control cost is 1 2 hut B T hcc T 1 h hbu = 2 ut B T BR 1 B T 1 h Bu = 2 ut Ru This is the quadratic control cost accumulated over time h. x controlled hbu passive Here we used CC T = BR 1 B T and assumed that B is full rank. If B is rank-defficient, the same result holds but the Gaussians are defined over the subspace spanned by the columns of B. Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 17 / 26

Summary of results discrete time : continuous time : first exit exp (q) z = P [z] qz = L [z] finite horizon exp (q k ) z k = P k [z k+1 ] qz z t = L [z] average cost exp (q c) z = P [z] (q c) z = L [z] discounted cost exp (q) z = P [z α ] z log (z α ) = L [z] Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 18 / 26

Summary of results discrete time : continuous time : first exit exp (q) z = P [z] qz = L [z] finite horizon exp (q k ) z k = P k [z k+1 ] qz z t = L [z] average cost exp (q c) z = P [z] (q c) z = L [z] discounted cost exp (q) z = P [z α ] z log (z α ) = L [z] The relation between P [z] and L [z] is P [z] (x) = E x 0 p(jx) z x 0 E x L [z] (x) = lim 0 p (h) (jx) [z (x0 )] z (x) h#0 h P (h) [z] (x) = z (x) + hl [z] (x) + o h 2 = lim h#0 P (h) [z] (x) z (x) h Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 18 / 26

Path-integral representation We can unfold the linear Bellman equation (first exit) as z (x) = exp ( q (x)) E x 0 p(jx) z x 0 h = exp ( q (x)) E x 0 p(jx) exp q x 0 E x 00 p(jx 0 ) z x 00 i = h = E x i 0=x exp q x k+1 p(jx k ) T x tfirst t first 1 k=0 q (x k ) This is a path-integral representation of z. Since KL (pjjp) = 0, we have exp E optimal [ total cost] = z (x) = E passive [exp ( total cost)] Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 19 / 26

Path-integral representation We can unfold the linear Bellman equation (first exit) as z (x) = exp ( q (x)) E x 0 p(jx) z x 0 h = exp ( q (x)) E x 0 p(jx) exp q x 0 E x 00 p(jx 0 ) z x 00 i = h = E x i 0=x exp q x k+1 p(jx k ) T x tfirst t first 1 k=0 q (x k ) This is a path-integral representation of z. Since KL (pjjp) = 0, we have exp E optimal [ total cost] = z (x) = E passive [exp ( total cost)] In continuous problems, the Feynman-Kac theorem states that the unique positive solution z to the parabolic PDE qz = a T z x + 1 2 tr CC T z xx has the same path-integral representation: z (x) = E x(0)=x dx=a(x)dt+c(x)dω h exp q T (x (t first )) R i tfirst 0 q (x (t)) dt Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 19 / 26

Model-free learning The solution to the linear Bellman equation z (x) = exp ( q (x)) E x 0 p(jx) z x 0 can be approximated in a model-free way given samples (x n, x 0 n, q n = q (x n )) obtained from the passive dynamics x 0 n p (jx n ). Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 20 / 26

Model-free learning The solution to the linear Bellman equation z (x) = exp ( q (x)) E x 0 p(jx) z x 0 can be approximated in a model-free way given samples (x n, x 0 n, q n = q (x n )) obtained from the passive dynamics x 0 n p (jx n ). One possibility is a Monte Carlo method based on the path integral representation, although covergence can be slow: bz (x) = 1 # trajectories starting at x exp ( trajectory cost) Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 20 / 26

Model-free learning The solution to the linear Bellman equation z (x) = exp ( q (x)) E x 0 p(jx) z x 0 can be approximated in a model-free way given samples (x n, x 0 n, q n = q (x n )) obtained from the passive dynamics x 0 n p (jx n ). One possibility is a Monte Carlo method based on the path integral representation, although covergence can be slow: bz (x) = 1 # trajectories starting at x exp ( trajectory cost) Faster convergence is obtained using temporal difference learning: bz (x n ) (1 β)bz (x n ) + β exp ( q n )bz xn 0 The learning rate β should decrease over time. Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 20 / 26

Importance sampling The expectation of a function f (x)under a distribution p (x)can be approximated as E xp() [f (x)] 1 N n f (x n) where fx n g n=1n are i.i.d. samples from p (). However, if f (x) has interesting behavior in regions where p (x) is small, convergence can be slow, i.e. we may need a very large N to obtain an accurate approximation. In the case of Z learning, the passive dynamics may rarely take the state to regions with low cost. Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 21 / 26

Importance sampling The expectation of a function f (x)under a distribution p (x)can be approximated as E xp() [f (x)] 1 N n f (x n) where fx n g n=1n are i.i.d. samples from p (). However, if f (x) has interesting behavior in regions where p (x) is small, convergence can be slow, i.e. we may need a very large N to obtain an accurate approximation. In the case of Z learning, the passive dynamics may rarely take the state to regions with low cost. Importance sampling is a general (unbiased) method for speeding up convergence. Let q (x) be some other distribution which is better "adapted" to f (x), and let fx n g now be samples from q (). Then This is essential for particle filters. E xp() [f (x)] 1 N n p (x n ) q (x n ) f (x n) Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 21 / 26

Greedy Z learning Let bu (x 0 jx) denote the greedy control law, i.e. the control law which would be optimal if the current approximation bz (x) were the exact desirability function. Then we can sample from bu rather than p and use importance sampling: bz (x n ) (1 β)bz (x n ) + β p (x0 njx n ) bu (x 0 njx n ) exp ( q n)bz xn 0 We now need access to the model p (x 0 jx)of the passive dynamics. Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 22 / 26

Greedy Z learning Let bu (x 0 jx) denote the greedy control law, i.e. the control law which would be optimal if the current approximation bz (x) were the exact desirability function. Then we can sample from bu rather than p and use importance sampling: bz (x n ) (1 β)bz (x n ) + β p (x0 njx n ) bu (x 0 njx n ) exp ( q n)bz xn 0 We now need access to the model p (x 0 jx)of the passive dynamics. X cost to go approximation error log total cost Q learning, passive Q learning, e greedy Z learning, passive Z learning, greedy 0 50,000 state transitions 0 50,000 state transitions Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 22 / 26

Maximum principle for the most likely trajectory Recall that for finite-horizon LMDPs we have u k x 0 jx = exp ( q (x)) p x 0 jx z k+1 (x 0 ) z k (x) The probability that the optimally-controlled stochastic system initialized at state x 0 generates a given trajectory x 1, x 2, x T is p (x 1, x 2, x T jx 0 ) = T 1 k=0 u k (x k+1jx k ) = T 1 k=0 exp ( q (x k)) p (x k+1 jx k ) z k+1 (x k+1 ) z k (x k ) = exp ( q T (x T )) z 0 (x 0 ) T 1 k=0 exp ( q (x k)) p (x k+1 jx k ) Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 23 / 26

Maximum principle for the most likely trajectory Recall that for finite-horizon LMDPs we have u k x 0 jx = exp ( q (x)) p x 0 jx z k+1 (x 0 ) z k (x) The probability that the optimally-controlled stochastic system initialized at state x 0 generates a given trajectory x 1, x 2, x T is p (x 1, x 2, x T jx 0 ) = T 1 k=0 u k (x k+1jx k ) Theorem (LMDP maximum principle) = T 1 k=0 exp ( q (x k)) p (x k+1 jx k ) z k+1 (x k+1 ) z k (x k ) = exp ( q T (x T )) z 0 (x 0 ) T 1 k=0 exp ( q (x k)) p (x k+1 jx k ) The most likely trajectory under p coincides with the optimal trajectory for a deterministic finite-horizon problem with final cost q T (x), dynamics x 0 = f (x, u) where f can be arbitrary, and immediate cost ` (x, u) = q (x) log p (f (x, u), x). Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 23 / 26

Trajectory probabilities in continuous time There is no formula for the probability of a trajectory under the Ito diffusion dx = a (x) + Cdω. However the relative probabilities of two trajectories ϕ(t) and ψ (t) can be defined using the Onsager-Machlup functional: OM [ϕ(), ψ ()], lim ε!0 p (sup t jx (t) ϕ(t)j < ε) p (sup t jx (t) ψ (t)j < ε) Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 24 / 26

Trajectory probabilities in continuous time There is no formula for the probability of a trajectory under the Ito diffusion dx = a (x) + Cdω. However the relative probabilities of two trajectories ϕ(t) and ψ (t) can be defined using the Onsager-Machlup functional: OM [ϕ(), ψ ()], lim ε!0 p (sup t jx (t) ϕ(t)j < ε) p (sup t jx (t) ψ (t)j < ε) It can be shown that Z T OM [ϕ(), ψ ()] = exp L (ψ (t), ψ (t)) where 0 L (ϕ(t), ϕ(t)) dt L [x, v], 1 2 (a (x) v)t CC T 1 1 (a (x) v) + div (a (x)) 2 We can then fix ψ (t) and define the relative probability of a trajectory as Z T p OM (ϕ()) = exp L (ϕ(t), ϕ(t)) dt 0 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 24 / 26

Continuous-time maximum principle It can be shown that the trajectory maximizing p OM () under the optimally-controlled stochastic dynamics for the problem dx = a (x) + B (udt + σdω) ` (x, u) = q (x) + 1 2σ 2 kuk2 coincides with the optimal trajectory for the deterministic problem ẋ = a (x) + Bu ` (x, u) = q (x) + 1 2σ 2 kuk2 + 1 div (a (x)) 2 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 25 / 26

Continuous-time maximum principle It can be shown that the trajectory maximizing p OM () under the optimally-controlled stochastic dynamics for the problem dx = a (x) + B (udt + σdω) ` (x, u) = q (x) + 1 2σ 2 kuk2 coincides with the optimal trajectory for the deterministic problem ẋ = a (x) + Bu ` (x, u) = q (x) + 1 2σ 2 kuk2 + 1 div (a (x)) 2 Example: dx = (a (x) + u) dt + σdω ` (x, u) = 1 +2 div(a(x)) a(x) 0 2σ 2 u2 2 5 0 +5 position (x) Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 25 / 26

Example +5 mu(x,t) sigma = 0.6 r(x,t) z(x,t) position (x) 0 5 0 time (t) 5 sigma = 1.2 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 26 / 26