Linearly-Solvable Stochastic Optimal Control Problems

Linearly-Solvable Stochastic Optimal Control Problems Emo Todorov Applied Mathematics and Computer Science & Engineering University of Washington Winter 2014 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 1 / 26

Problem formulation In traditional MDPs the controller chooses actions u which in turn specify the transition probabilities p (x 0 jx, u). We can obtain a linearly-solvable MDP (LMDP) by allowing the controller to specify these probabilities directly: x 0 u (jx) x 0 p (jx) p (x 0 jx) = 0 ) u (x 0 jx) = 0 controlled dynamics passive dynamics feasible control set U (x) The immediate cost is in the form ` (x, u (jx)) = q (x) + KL (u (jx) jjp (jx)) KL (u (jx) jjp (jx)) = x 0 u (x 0 jx) log u(x0 jx) p(x 0 jx) = E x 0 u(jx) h log u(x0 jx) p(x 0 jx) Thus the controller can impose any dynamics it wishes, however it pays a price (KL divergence control cost) for pushing the system away from its passive dynamics. i Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 2 / 26

Understanding the KL divergence cost KL cost over the probability simplex how to bias a coin benefits of error tolerance one two E(q) + KL(u p) 2 1 1 1 u(1 x) u(2 x) p three u* E(q) p KL(u p) 0 0 0.27 0.5 1 u = probability of Heads 2.6 1 3.6 0 u(3 x) Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 3 / 26

Simplifying the Bellman equation (first exit) v (x) = min n` (x, u) + E u x 0 p(jx,u) v x 0 o = min q (x) + E x 0 u(jx) log u (x0 jx) u(jx) = min u(jx) q (x) + E x 0 u(jx) log The last term is an unnormalized KL divergence... Definitions p (x 0 jx) + log 1 exp ( v (x 0 )) u (x 0 ) p (x 0 jx) exp ( v (x 0 )) desirability function z (x), exp ( v (x)) next-state expectation P [z] (x), x 0 p (x 0 jx) z (x 0 ) v (x) = min q (x) u(jx) log P [z] (x) + KL u (jx) p (jx) z () P [z] (x) Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 4 / 26

Linear Bellman equation and optimal control law KL (p 1 () jjp 2 ()) achieves its global minimum of 0 iff p 1 = p 2, thus Theorem (optimal control law) The Bellman equation becomes u x 0 jx = p (x0 jx) z (x 0 ) P [z] (x) v (x) = q (x) log P [z] (x) z (x) = exp ( q (x)) P [z] (x) which can be written more explicitly as Theorem (linear Bellman equation) ( exp ( q (x)) x 0 p (x 0 jx) z (x 0 ) : x non-terminal z (x) = exp ( q T (x)) : x terminal Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 5 / 26

Illustration z(x ) p(x x) u*(x x) ~ p(x x) z(x ) x x : sampled from u * (x x) Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 6 / 26

Summary of results Let Q = diag (exp ( q)) and P xy = p (yjx). Then we have first exit z = exp ( q) P [z] z = QPz finite horizon z k = exp ( q k ) P k [z k+1 ] z k = Q k P k z k+1 average cost z = exp (c q) P [z] λz = QPz discounted cost z = exp ( q) P [z α ] z = QPz α In the first exit problem we can also write z N = Q N N P N N z N + b = (I Q N N P N N ) 1 b b, Q N N P N T exp ( q T ) where N, T are the sets of non-terminal and terminal states respectively. In the average cost problem λ = log (c) is the principal eigenvalue. Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 7 / 26

Stationary distribution under the optimal control law Let µ (x) denote the stationary distribution under the optimal control law u (jx) in the average cost problem. Then µ x 0 = x u x 0 jx µ (x) Recall that u x 0 jx = p (x0 jx) z (x 0 ) P [z] (x) = p (x0 jx) z (x 0 ) λ exp (q (x)) z (x) Defining r (x), µ (x) /z (x), we have In vector notation this becomes µ x 0 = x p (x 0 jx) z (x 0 ) λ exp (q (x)) z (x) µ (x) λr x 0 = x exp ( q (x)) p x 0 jx r (x) λr = (QP) T r Thus z and r are the right and left principal eigenvectors of QP, and µ = z. r Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 8 / 26

Comparison to policy and value iteration P average cost z iter policy iter 10 25 50 100 0 40 80 120 CPU time (sec) average cost z pol. val. 0 10 10 2 10 3 10 4 number of updates +9 optimal control (z iter) optimal cost to go (z iter) optimal cost to go (policy iter) tangential velocity 9 3 horizontal position +3 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 9 / 26

Application to deterministic shortest paths Given a graph and a set T of goal states, define the first-exit LMDP p (x 0 jx) q (x) = ρ > 0 q T (x) = 0 random walk on the graph constant cost at non-terminal states zero cost at terminal states For large ρ the optimal cost-to-go v (ρ) is dominated by the state costs, since the KL divergence control costs are bounded. Thus we have Theorem The length of the shortest path from state x to a goal state is v (ρ) (x) lim ρ! ρ Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 10 / 26

Internet example Performance on the graph of Internet routers as of 2003 (data from caida.org) There are 190914 nodes and 609066 undirected edges in the graph. 15 0.6 2% approximation 10 5 error 0.2 number of errors (%) 1% shortest path lengths over estimated by 1 values of z(x) numerically indistinguishable from 0 1 1 5 10 15 shortest path length 0% 25 40 55 70 parameter \rho Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 11 / 26

Embedding of traditional MDPs Given a traditional MDP with controls eu 2 eu (x), transition probabilities ep (x 0 jx, eu) and costs è (x, eu), we can construct and LMDP such that the controls corresponding to the MDPs transition probabilities have the same costs as in the MDP. This is done by constructing p and q such that for 8x, eu 2 eu (x) q (x) + KL (ep (jx, eu) jjp (jx)) = è (x, eu) q (x) x 0 ep x 0 jx, eu log p x 0 jx = è (x, eu) + e h (x, eu) where e h is the entropy of ep (jx, eu). The construction is done separately for every x. Suppressing x, vectorizing over eu and defining s = log p, q1 + eps = e b exp ( s) T 1 = 1 ep and b e = è + h e are known, q and s are unknown. Assuming ep is full rank, y = ep 1 b, e s = y q1, q = log exp ( y) T 1 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 12 / 26

Grid world example X MDP cost to go 40 X X LMDP cost to go 30 20 10 R 2 = 0.986 LMDP cost to go 0 0 20 40 60 MDP cost to go X Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 13 / 26

Machine repair example 100 60 MDP cost to go state LMDP cost to go 40 20 R 2 = 0.993 LMDP cost to go 1 1 time step 50 0 0 20 40 60 MDP cost to go performance on MDP 1.00 1.01 1.64 MDP LMDP random control policy Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 14 / 26

Continuous-time limit Consider a continuous-state discrete-time LMDP where p (h) (x 0 jx) is the h-step transition probability of some continuous-time stochastic process, and z (h) (x) is the LMDP solution. The linear Bellman equation (first exit) is z (h) (x) = exp ( hq (x)) E x 0 p (h) (jx) hz (h) x 0i Let z = lim h#0 z (h). The limit yields z (x) = z (x), but we can rearrange as exp (hq (x)) 1 E lim h#0 z (h) x (x) = lim 0 p (h) (jx) h h#0 Recalling the definition of the generator L, we now have q (x) z (x) = L [z] (x) h i z (h) (x 0 ) If the underlying process is an Ito diffusion, the generator is L [z] (x) = a (x) T z x (x) + 1 2 trace (Σ (x) z xx (x)) h z (h) (x) Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 15 / 26

Linearly-solvable controlled diffusions Above z was defined as the continuous-time limit to LMDP solutions z (h). But is z the solution to a continuous-time problem, and if so, what problem? dx = (a (x) + B (x) u) dt + C (x) dω ` (x, u) = q (x) + 1 2 ut R (x) u Recall that for such problems we have u = 0 = q + a T v x + 1 2 tr CC T v xx R 1 B T v x and 1 2 vt x BR 1 B T v x Define z (x) = exp ( v (x)) and write the PDE in terms of z: 0 = q 1 z z x v x = z, v xx = z xx z + z xz T x z 2 a T z x + 12 CC tr T z xx + 1 2z zt x BR 1 B T 1 z x 2z zt x CC T z x Now if CC T = BR 1 B T, we obtain the linear HJB equation qz = L [z]. Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 16 / 26

Quadratic control cost and KL divergence The KL divergence between two Gaussians with means µ 1, µ 2 and common full-rank covariance Σ is 1 2 (µ 1 µ 2 ) T Σ 1 (µ 1 µ 2 ). Using Euler discretization of the controlled diffusion, the passive and controlled dynamics have means x + ha, x + ha + hbu and covariance hcc T. Thus the KL divergence control cost is 1 2 hut B T hcc T 1 h hbu = 2 ut B T BR 1 B T 1 h Bu = 2 ut Ru This is the quadratic control cost accumulated over time h. x controlled hbu passive Here we used CC T = BR 1 B T and assumed that B is full rank. If B is rank-defficient, the same result holds but the Gaussians are defined over the subspace spanned by the columns of B. Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 17 / 26

Summary of results discrete time : continuous time : first exit exp (q) z = P [z] qz = L [z] finite horizon exp (q k ) z k = P k [z k+1 ] qz z t = L [z] average cost exp (q c) z = P [z] (q c) z = L [z] discounted cost exp (q) z = P [z α ] z log (z α ) = L [z] The relation between P [z] and L [z] is P [z] (x) = E x 0 p(jx) z x 0 E x L [z] (x) = lim 0 p (h) (jx) [z (x0 )] z (x) h#0 h P (h) [z] (x) = z (x) + hl [z] (x) + o h 2 = lim h#0 P (h) [z] (x) z (x) h Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 18 / 26

Path-integral representation We can unfold the linear Bellman equation (first exit) as z (x) = exp ( q (x)) E x 0 p(jx) z x 0 h = exp ( q (x)) E x 0 p(jx) exp q x 0 E x 00 p(jx 0 ) z x 00 i = h = E x i 0=x exp q x k+1 p(jx k ) T x tfirst t first 1 k=0 q (x k ) This is a path-integral representation of z. Since KL (pjjp) = 0, we have exp E optimal [ total cost] = z (x) = E passive [exp ( total cost)] In continuous problems, the Feynman-Kac theorem states that the unique positive solution z to the parabolic PDE qz = a T z x + 1 2 tr CC T z xx has the same path-integral representation: z (x) = E x(0)=x dx=a(x)dt+c(x)dω h exp q T (x (t first )) R i tfirst 0 q (x (t)) dt Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 19 / 26

Model-free learning The solution to the linear Bellman equation z (x) = exp ( q (x)) E x 0 p(jx) z x 0 can be approximated in a model-free way given samples (x n, x 0 n, q n = q (x n )) obtained from the passive dynamics x 0 n p (jx n ). One possibility is a Monte Carlo method based on the path integral representation, although covergence can be slow: bz (x) = 1 # trajectories starting at x exp ( trajectory cost) Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 20 / 26

Model-free learning The solution to the linear Bellman equation z (x) = exp ( q (x)) E x 0 p(jx) z x 0 can be approximated in a model-free way given samples (x n, x 0 n, q n = q (x n )) obtained from the passive dynamics x 0 n p (jx n ). One possibility is a Monte Carlo method based on the path integral representation, although covergence can be slow: bz (x) = 1 # trajectories starting at x exp ( trajectory cost) Faster convergence is obtained using temporal difference learning: bz (x n ) (1 β)bz (x n ) + β exp ( q n )bz xn 0 The learning rate β should decrease over time. Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 20 / 26

Importance sampling The expectation of a function f (x)under a distribution p (x)can be approximated as E xp() [f (x)] 1 N n f (x n) where fx n g n=1n are i.i.d. samples from p (). However, if f (x) has interesting behavior in regions where p (x) is small, convergence can be slow, i.e. we may need a very large N to obtain an accurate approximation. In the case of Z learning, the passive dynamics may rarely take the state to regions with low cost. Importance sampling is a general (unbiased) method for speeding up convergence. Let q (x) be some other distribution which is better "adapted" to f (x), and let fx n g now be samples from q (). Then This is essential for particle filters. E xp() [f (x)] 1 N n p (x n ) q (x n ) f (x n) Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 21 / 26

Greedy Z learning Let bu (x 0 jx) denote the greedy control law, i.e. the control law which would be optimal if the current approximation bz (x) were the exact desirability function. Then we can sample from bu rather than p and use importance sampling: bz (x n ) (1 β)bz (x n ) + β p (x0 njx n ) bu (x 0 njx n ) exp ( q n)bz xn 0 We now need access to the model p (x 0 jx)of the passive dynamics. Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 22 / 26

Greedy Z learning Let bu (x 0 jx) denote the greedy control law, i.e. the control law which would be optimal if the current approximation bz (x) were the exact desirability function. Then we can sample from bu rather than p and use importance sampling: bz (x n ) (1 β)bz (x n ) + β p (x0 njx n ) bu (x 0 njx n ) exp ( q n)bz xn 0 We now need access to the model p (x 0 jx)of the passive dynamics. X cost to go approximation error log total cost Q learning, passive Q learning, e greedy Z learning, passive Z learning, greedy 0 50,000 state transitions 0 50,000 state transitions Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 22 / 26

Maximum principle for the most likely trajectory Recall that for finite-horizon LMDPs we have u k x 0 jx = exp ( q (x)) p x 0 jx z k+1 (x 0 ) z k (x) The probability that the optimally-controlled stochastic system initialized at state x 0 generates a given trajectory x 1, x 2, x T is p (x 1, x 2, x T jx 0 ) = T 1 k=0 u k (x k+1jx k ) Theorem (LMDP maximum principle) = T 1 k=0 exp ( q (x k)) p (x k+1 jx k ) z k+1 (x k+1 ) z k (x k ) = exp ( q T (x T )) z 0 (x 0 ) T 1 k=0 exp ( q (x k)) p (x k+1 jx k ) The most likely trajectory under p coincides with the optimal trajectory for a deterministic finite-horizon problem with final cost q T (x), dynamics x 0 = f (x, u) where f can be arbitrary, and immediate cost ` (x, u) = q (x) log p (f (x, u), x). Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 23 / 26

Trajectory probabilities in continuous time There is no formula for the probability of a trajectory under the Ito diffusion dx = a (x) + Cdω. However the relative probabilities of two trajectories ϕ(t) and ψ (t) can be defined using the Onsager-Machlup functional: OM [ϕ(), ψ ()], lim ε!0 p (sup t jx (t) ϕ(t)j < ε) p (sup t jx (t) ψ (t)j < ε) It can be shown that Z T OM [ϕ(), ψ ()] = exp L (ψ (t), ψ (t)) where 0 L (ϕ(t), ϕ(t)) dt L [x, v], 1 2 (a (x) v)t CC T 1 1 (a (x) v) + div (a (x)) 2 We can then fix ψ (t) and define the relative probability of a trajectory as Z T p OM (ϕ()) = exp L (ϕ(t), ϕ(t)) dt 0 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 24 / 26

Continuous-time maximum principle It can be shown that the trajectory maximizing p OM () under the optimally-controlled stochastic dynamics for the problem dx = a (x) + B (udt + σdω) ` (x, u) = q (x) + 1 2σ 2 kuk2 coincides with the optimal trajectory for the deterministic problem ẋ = a (x) + Bu ` (x, u) = q (x) + 1 2σ 2 kuk2 + 1 div (a (x)) 2 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 25 / 26

Continuous-time maximum principle It can be shown that the trajectory maximizing p OM () under the optimally-controlled stochastic dynamics for the problem dx = a (x) + B (udt + σdω) ` (x, u) = q (x) + 1 2σ 2 kuk2 coincides with the optimal trajectory for the deterministic problem ẋ = a (x) + Bu ` (x, u) = q (x) + 1 2σ 2 kuk2 + 1 div (a (x)) 2 Example: dx = (a (x) + u) dt + σdω ` (x, u) = 1 +2 div(a(x)) a(x) 0 2σ 2 u2 2 5 0 +5 position (x) Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 25 / 26

Example +5 mu(x,t) sigma = 0.6 r(x,t) z(x,t) position (x) 0 5 0 time (t) 5 sigma = 1.2 Emo Todorov (UW) AMATH/CSE 579, Winter 2014 Winter 2014 26 / 26