Stochastic optimal control theory

Size: px

Start display at page:

Download "Stochastic optimal control theory"

Maryann Ellis
6 years ago
Views:

1 Stochastic optimal control theory Bert Kappen SNN Radboud University Nijmegen the Netherlands July 5, 2008 Bert Kappen

Introduction Optimal control theory: Optimize sum of a path cost and end cost. Result is optimal control sequence and optimal trajectory. Input: Cost function.

2 Introduction Optimal control theory: Optimize sum of a path cost and end cost. Result is optimal control sequence and optimal trajectory. Input: Cost function. Output: Optimal trajectory and controls. Classical control theory: what control signal should I give to move a plant to a desired state? Input: Desired state trajectory. Output: Optimal control trajectory. Bert Kappen ICML, July

3 Finite horizon (fixed horizon time) Types of optimal control problems x environment t controlled trajectory t f Dynamics and environment may depend explicitly on time. Optimal control depends explicitly on time. Bert Kappen ICML, July

4 Finite horizon (moving horizon) Types of optimal control problems t t f Dynamics and environment are static. Optimal control is time independent. Bert Kappen ICML, July

5 Finite horizon (moving horizon) Types of optimal control problems t t f t t f Dynamics and environment are static. Optimal control is time independent. Similar to RL. Bert Kappen ICML, July

6 Other types of control problems: - minimum time - infinite horizon, average reward - infinite horizon, absorbing states In addition one should distinguish: - discrete vs. continuous state - discrete vs. continuous time - observable vs. partial observable Types of optimal control problems Bert Kappen ICML, July

7 Overview Deterministic optimal control (Kappen, 30 min.) - Introduction of delayed reward problem in discrete time; - Dynamic programming solution and deterministic Bellman equations; - Solution in continuous time and states; - Example: Mass on a spring - Pontryagin maximum principle; Notion of an optimal (particle) trajectory - Again Mass on a spring Bert Kappen ICML, July

8 Overview Stochastic optimal control, discrete case (Toussaint, 40 min.) - Stochastic Bellman equation (discrete state and time) and Dynamic Programming - Reinforcement learning (exact solution, value iteration, policy improvement); Actor critic networks; - Markov decision problems and probabilistic inference; - Example: robotic motion control and planning Bert Kappen ICML, July

9 Overview Stochastic optimal control, continuous case (Kappen, 40 min.) - Stochastic differential equations - Hamilton-Jacobi-Bellman equation (continuous state and time) - LQ control, Ricatti equation; - Example of LQ control - Learning; Partial observability: Inference and control; - Certainty equivalence - Path integral control; the role of noise and symmetry breaking; efficient approximate computation (MC, MF, BP,...) - Examples: Double slit, delayed choice, n joint arm Bert Kappen ICML, July

10 Overview Research issues (Toussaint, 30 min.) - Learning; - efficient methods to compute value functions/cost-to-go - control under partial observability (POMDPs) Bert Kappen ICML, July

11 Discrete time control Consider the control of a discrete time deterministic dynamical system: x t+1 = x t + f(t, x t, u t ), t = 0, 1,..., T 1 x t describes the state and u t specifies the control or action at time t. Given x t=0 = x 0 and u 0:T 1 = u 0, u 1,..., u T 1, we can compute x 1:T. Define a cost for each sequence of controls: C(x 0, u 0:T 1 ) = φ(x T ) + T 1 t=0 R(t, x t, u t ) The problem of optimal control is to find the sequence u 0:T 1 that minimizes C(x 0, u 0:T 1 ). Bert Kappen ICML, July

12 Dynamic programming Find the minimal cost path from A to J. C(J) = 0, C(H) = 3, C(I) = 4 C(F ) = min(6 + C(H), 3 + C(I)) Bert Kappen ICML, July

13 Discrete time control The optimal control problem can be solved by dynamic programming. Introduce the optimal cost-to-go: J(t, x t ) = min u t:t 1 ( φ(x T ) + T 1 s=t R(s, x s, u s ) ) which solves the optimal control problem from an intermediate time t until the fixed end time T, for all intermediate states x t. Then, J(T, x) = φ(x) J(0, x) = min u 0:T 1 C(x, u 0:T 1 ) Bert Kappen ICML, July

14 Discrete time control One can recursively compute J(t, x) from J(t + 1, x) for all x in the following way: J(t, x t ) = min u t:t 1 ( = min u t ( φ(x T ) + T 1 s=t R(t, x t, u t ) + R(s, x s, u s ) min u t+1:t 1 [ ) φ(x T ) + = min u t (R(t, x t, u t ) + J(t + 1, x t+1 )) T 1 s=t+1 = min u t (R(t, x t, u t ) + J(t + 1, x t + f(t, x t, u t ))) This is called the Bellman Equation. Computes u as a function of x, t for all intermediate t and all x. R(s, x s, u s ) ]) Bert Kappen ICML, July

15 Discrete time control The algorithm to compute the optimal control u 0:T 1, the optimal trajectory x 1:T and the optimal cost is given by 1. Initialization: J(T, x) = φ(x) 2. Backwards: For t = T 1,..., 0 and for all x compute u t (x) = arg min{r(t, x, u) + J(t + 1, x + f(t, x, u))} u J(t, x) = R(t, x, u t ) + J(t + 1, x + f(t, x, u t )) 3. Forwards: For t = 0,..., T 1 compute x t+1 = x t + f(t, x t, u t (x t )) NB: the backward computation requires u t (x) for all x. Bert Kappen ICML, July

16 Replace t + 1 by t + dt with dt 0. Assume J(x, t) is smooth. J(t, x) = min u Continuous limit x t+dt = x t + f(x t, u t, t)dt C(x 0, u 0 T ) = φ(x T ) + min u t J(t, x) = min u T 0 dτr(τ, x(τ), u(τ)) (R(t, x, u)dt + J(t + dt, x + f(x, u, t)dt)) (R(t, x, u)dt + J(t, x) + t J(t, x)dt + x J(t, x)f(x, u, t)dt) (R(t, x, u) + f(x, u, t) x J(x, t)) with boundary condition J(x, T ) = φ(x). Bert Kappen ICML, July

17 Continuous limit t J(t, x) = min u with boundary condition J(x, T ) = φ(x). (R(t, x, u) + f(x, u, t) x J(x, t)) This is called the Hamilton-Jacobi-Bellman Equation. Computes the anticipated potential J(t, x) from the future potential φ(x). Bert Kappen ICML, July

18 Example: Mass on a spring The spring force F z = z towards the rest position and control force F u = u. Newton s Law with m = 1. F = z + u = m z Control problem: Given initial position and velocity z(0) = ż(0) = 0 at time t = 0, find the control path 1 < u(0 T ) < 1 such that z(t ) is maximal. Bert Kappen ICML, July

19 Introduce x 1 = z, x 2 = ż, then Example: Mass on a spring x 1 = x 2 x 2 = x 1 + u The end cost is φ(x) = x 1 ; path cost R(x, u, t) = 0. The HJB takes the form: t J = min u ( J J x 2 + x 1 + J ) u x 1 x 2 x 2 J = x 2 J x 1 + x 1 J x 2 x 2, u = sign ( J x 2 ) Bert Kappen ICML, July

20 Example: Mass on a spring The solution is J(t, x 1, x 2 ) = cos(t T )x 1 + sin(t T )x 2 + α(t) u(t, x 1, x 2 ) = sign(sin(t T )) As an example consider T = 2π. Then, the optimal control is u = 1, u = 1, 0 < t < π π < t < 2π x 1 x t Bert Kappen ICML, July

21 Pontryagin minimum principle The HJB equation is a PDE with boundary condition at future time. The PDE is solved using discretization of space and time. The solution is an optimal cost-to-go for all x and t. From this we compute the optimal trajectory and optimal control. An alternative approach is a variational approach that directly finds the optimal trajectory and optimal control. Bert Kappen ICML, July

22 Pontryagin minimum principle We can write the optimal control problem as a constrained optimization problem with independent variables u(0 T ) and x(0 T ) min φ(x(t )) + u(0 T ),x(0 T ) T 0 dtr(x(t), u(t), t) subject to the constraint ẋ = f(x, u, t) and boundary condition x(0) = x 0. Introduce the Lagrange multiplier function λ(t): C = φ(x(t )) + = φ(x(t )) + T 0 T 0 H(t, x, u, λ) = R(t, x, u) λf(t, x, u) dt [R(t, x(t), u(t)) λ(t)(f(t, x(t), u(t)) ẋ(t))] dt[ H(t, x(t), u(t), λ(t)) + λ(t)ẋ(t))] Bert Kappen ICML, July

23 Derivation PMP The solution is found by extremizing C. This gives a necessary but not sufficient condition for a solution. If we vary the action wrt to the trajectory x, the control u and the Lagrange multiplier λ, we get: δc = φ x (x(t ))δx(t ) + T 0 dt[ H x δx(t) H u δu(t) + ( H λ + ẋ(t))δλ(t) + λ(t)δẋ(t)] = (φ x (x(t )) + λ(t )) δx(t ) T [ + dt ( H x λ(t))δx(t) ] H u δu(t) + ( H λ + ẋ(t))δλ(t) 0 For instance, H x = H(t,x(t),u(t),λ(t)) x(t). Bert Kappen ICML, July

24 We can solve H u (t, x, u, λ) = 0 for u and denote the solution as Assumes H convex in u. The remaining equations are u (t, x, λ) ẋ = H λ (t, x, u (t, x, λ), λ) λ = H x (t, x, u (t, x, λ), λ) with boundary conditions Mixed boundary value problem. x(0) = x 0 λ(t ) = φ x (x(t )) Bert Kappen ICML, July

25 Again mass on a spring Problem x 1 = x 2, x 2 = x 1 + u R(x, u, t) = 0 φ(x) = x 1 Hamiltonian H(t, x, u, λ) = R(t, x, u) + λ T f(t, x, u) = λ 1 x 2 + λ 2 ( x 1 + u) H (t, x, λ) = λ 1 x 2 λ 2 x 1 λ 2 u = sign(λ 2 ) The Hamilton equations ẋ = H λ x 1 = x 2, x 2 = x 1 sign(λ 2 ) λ = H x λ 1 = λ 2, λ 2 = λ 1 with x(t = 0) = x 0 and λ(t = T ) = 1. Bert Kappen ICML, July

26 Comments The HJB method gives a sufficient (and often necessary) condition for optimality. The solution of the PDE is expensive. The PMP method provides a necessary condition for optimal control. This means that it provides candidate solutions for optimality. The PMP method is computationally less complicated than the HJB method because it does not require discretization of the state space. Optimal control in continuous space and time contains many complications related to the existence, uniqueness and smoothness of the solution, particular in the absence of noise. In the presence of noise many of these intricacies disappear. HJB generalizes to the stochastic case, PMP does not (at least not easy). Bert Kappen ICML, July

27 Stochastic differential equations Consider the random walk on the line: x t+1 = x t + ξ t ξ t = ±1 with x 0 = 0. We can compute t x t = Since x t is a sum of random variables, x t becomes Gaussian distributed with i=1 ξ i x t = x 2 t = t ξ i = 0 i=1 t i,j=1 ξ i ξ j = t t ξ 2 i + ξ i ξ j = t i=1 i,j=1,j i Note, that the fluctuations t. Bert Kappen ICML, July

28 In the continuous time limit we define Stochastic differential equations dx t = x t+dt x t = dξ with dξ an infinitesimal mean zero Gaussian variable with dξ 2 = νdt. Then d x = 0, x (t) = 0 dt d x 2 = ν, x 2 (t) = νt dt 1 ρ(x, t x 0, 0) = exp ( (x x 0) 2 ) 2πνt 2νt Bert Kappen ICML, July

29 Stochastic optimal control Consider a stochastic dynamical system dx = f(t, x, u)dt + dξ dξ Gaussian noise dξ i dξ j = ν ij (t, x, u)dt. The cost becomes an expectation: C(t, x, u(t T )) = φ(x(t )) + T t dτ R(t, x(t), u(t)) over all stochastic trajectories starting at x with control path u(t T ). Note, that u(t) as part of u(t T ) is used at time t. Next move to x + dx and repeat the optimization. Bert Kappen ICML, July

30 We obtain the Bellman recursion Stochastic optimal control J(t, x t ) = min R(t, x t, u t ) + J(t + dt, x t+dt ) u t J(t + dt, x t+dt ) = dx t+dt N (x t+dt x t, νdt)j(t + dt, x t+dt ) = J(t, x t ) + dt t J(t, x t ) + dx x J(t, x t ) dx 2 2 xj(t, x t ) dx = f(x, u, t)dt dx 2 = ν(t, x, u)dt Thus, t J(t, x) = min u with boundary condition J(x, T ) = φ(x). ( R(t, x, u) + f(x, u, t) x J(x, t) + 1 ) 2 ν(t, x, u) 2 xj(x, t) Bert Kappen ICML, July

31 Linear Quadratic control The dynamics is linear dx = [A(t)x + B(t)u + b(t)]dt + m (C j (t)x + D j (t)u + σ j (t))dξ j, j=1 dξj dξ j = δjj dt The cost function is quadratic φ(x) = 1 2 xt Gx R(x, u, t) = 1 2 xt Q(t)x + u T S(t)x ut R(t)u In this case the optimal cost-to-go is quadratic in x: J(t, x) = 1 2 xt P (t)x + α T (t)x + β(t) u(t) = Ψ(t)x(t) ψ(t) Bert Kappen ICML, July

32 Substitution in the HJB equation yields ODEs for P, α, β: P = P A + A T P + mx C T j P C j + Q Ŝ T ˆR 1 Ŝ j=1 α = [A B ˆR 1 Ŝ] T α + β = 1 2 ˆR = R + mx [C j D j ˆR 1 Ŝ] T P σ j + P b j=1 2 p ˆRψ α T b 1 2 mx D T j P D j j=1 Ŝ = B T P + S + mx σ T j P σ j j=1 mx D T j P C j j=1 Ψ = ˆR 1 Ŝ ψ = ˆR 1 (B T α + mx D T j P σ j) j=1 with P (t f ) = G and α(t f ) = β(t f ) = 0. Bert Kappen ICML, July

33 Example Find the optimal control for the dynamics dx = (x + u)dt + dξ, dξ 2 = νdt with end cost φ(x) = 0 and path cost R(x, u) = 1 2 (x2 + u 2 ). The Ricatti equations reduce to P = 2P + 1 P 2 α = 0 β = 1 2 νp P β with P (T ) = α(t ) = β(t ) = 0 and u(x, t) = P (t)x t Bert Kappen ICML, July

34 Comments Note, that in the last example the optimal control is independent of ν, i.e. optimal stochastic control equals optimal deterministic control. This is true in general for non-multiplicative noise (C j = D j = 0). Bert Kappen ICML, July

35 Learning What happens if (part of) the state is not observed? For instance, - As a result of measurement error we do not know x t but p(x t y 0:t ) - We do not know the parameters of the dynamics - We do not know the cost/rewards (RL case) Bert Kappen ICML, July

36 Learning in RL or receding horizon Imagine eternity, and you are ordered to cook a very good meal for tomorrow. You decide to spend all of today to learn the recipes. Bert Kappen ICML, July

37 Learning in RL or receding horizon Imagine eternity, and you are ordered to cook a very good meal for tomorrow. You decide to spend all of today to learn the recipes. When the next day arrives, you are ordered to cook a very good meal for tomorrow. You decide to spend all of today to learn the recipes... Bert Kappen ICML, July

38 Learning in RL or receding horizon Imagine eternity, and you are ordered to cook a very good meal for tomorrow. You decide to spend all of today to learn the recipes. When the next day arrives, you are ordered to cook a very good meal for tomorrow. You decide to spend all of today to learn the recipes.... V t - The learning phase takes forever. - Mix of exploration and optimizing (RL: actor-critic, E 3,...) - Learning is not part of the control problem t Bert Kappen ICML, July

Learning in RL or receding horizon Imagine eternity, and you are ordered to cook a very good meal for tomorrow. You decide to spend all of today to learn the recipes.

39 Learning in RL or receding horizon Imagine eternity, and you are ordered to cook a very good meal for tomorrow. You decide to spend all of today to learn the recipes. When the next day arrives, you are ordered to cook a very good meal for tomorrow. You decide to spend all of today to learn the recipes.... V t - The learning phase takes forever. - Mix of exploration and optimizing (RL: actor-critic, E 3,...) - Learning is not part of the control problem t Bert Kappen ICML, July

40 Finite horizon learning Imagine instead life as we know it. It is finite and we have only one life. Aim is to maximize accumulated reward. This requires to plan your learning! - At t = 0, action is useless. - At t = T, learning is useless. learning action t Problem of inference and control. Bert Kappen ICML, July

41 As an example, consider the problem Inference and control dx = αudt + dξ with α unobserved and x observed. with α unobserved and x observed. Path cost R(x, u, t), end cost φ(x) and noise variance dξ 2 = νdt are given. The problem is that the future information that we receive about α depends on u. Each time step we observe dx and u and thus learn about α. p t+dt (α dx, u) p(dx α, u)p t (α) The solution is to augment the state space with parameters θ t (sufficient statistics) that describe p t (α) = p(α θ t ) and θ 0 known. Then with α = ±1, p t (α = 1) = σ(θ t ): dθ = u ν dx = u (αudt + dξ) ν NB: In forward pass dθ = F (dx), thus θ also observed. Bert Kappen ICML, July

42 With z t = (x t, θ t ) we obtain a standard HJB: t J(t, z)dt = min u with boundary condition J(z, T ) = φ(x). ( R(t, z, u)dt + dz z z J(z, t) + 1 dz 2 ) 2 z 2 zj(z, t) The expectation values are conditioned on (x t, θ t ) and are averages over p(α θ t ) and the Gaussian noise, cf. dx x,θ = αudt + dξ x,θ = ᾱudt ᾱ = dαp(α θ)α <dz> z <dz 2> z J(z,t+dt) t t+dt t f Bert Kappen ICML, July

43 Certainty equivalence An important special case of a partial observable control problem is the Kalman filter (y observed, x not observed). dx = (x + u)dt + dξ y = x + η The cost is quadratic in x and u, for instance The optimal control is u(x, t). C(x t, t, u t:t ) = T τ=t 1 2 (x2 τ + u 2 τ) When x t is not observed, we can compute p(x t y 0:t ) using Kalman filtering and the optimal control minimizes C KF (y 0:t, t, u t:t ) = dx t p(x t y 0:t )C(x t, t, u t:t ) Bert Kappen ICML, July

44 Since p(x t y 0:t ) = N (x t µ t, σ 2 t ) is Gaussian and C KF (y 0:t, t, u t:t ) = dx t C(x t, t, u t:t )N (x t µ t, σt 2 ) = = T τ=t 1 2 u2 τ + T x 2 τ τ=t µ t,σ t = C(µ t, t, u t:t ) (T t)σ2 t The first term is identical to the observed case with x t µ t. The second term does not depend on u and thus does not affect the optimal control. The optimal control for the Kalman filter is identical to the observed case with x t replaced by µ t : u KF (y 0:t, t) = u(µ t, t) Bert Kappen ICML, July

45 Summary For infinite time problems, learning is a meta problem with a time scale unrelated to the horizon time. A finite time partial observable (or adaptive) control problem is in general equivalent to an observable non-adaptive control problem in an extended state space. The partial observable case is generally more complex than the observable case. For instance a LQ problem with unknown parameters is not LQ. For a Kalman filter with unobserved states and known parameters, the partial observability does not affect the optimal control law (Certainty equivalence). Learning can be done either in the maximum likelihood sense or in a full Bayesian way. Bert Kappen ICML, July

46 Path integral control Solving the PDE is hard. We consider a special case that can be solved. dx = (f(x, t) + u)dt + dξ R(x, u, t) = V (x, t) ut Ru and φ(x) arbitrary. The stochastic HJB equation becomes: t J = min u ( 1 2 ut Ru + V + (f + u) T x J + 1 ) 2 Tr(ν 2 xj) = 1 2 ( xj) T R 1 ( x J) + V + f T x J Tr(ν 2 xj) u = R 1 x J(x, t) must be solved backward in time with boundary condition J(T, x) = φ(x). Bert Kappen ICML, July

47 Closed form solution If we further assume that R 1 = λν for some scalar λ then we can solve J in closed form. The solution contains the following steps: Substitute J = λ log Ψ in the HJB equations. Because of the relation R 1 = λν the terms quadratic in Ψ cancel and only linear terms remain. t Ψ = HΨ, H = V λ + f x ν 2 x This equation must be solved backward in time with boundary condition Ψ(x, T ) = exp( φ(x)/λ). The linearity allows us to reverse the direction of computation, replacing it by a diffusion process, in the following way. Bert Kappen ICML, July

48 Closed form solution Let ρ(y, τ x, t) describe a diffusion process defined by the Fokker-Planck equation τ ρ = V ν ρ y(fρ) ν 2 yρ = H ρ (1) with ρ(y, t x, t) = δ(y x). Define A(x, t) = dyρ(y, τ x, t)ψ(y, τ). It is easy to see by using the equations of motions for Ψ and ρ that A(x, t) is independent of τ. Evaluating A(x, t) for τ = t yields A(x, t) = Ψ(x, t). Evaluating A(x, t) for τ = T yields A(x, t) = dyρ(y, T x, t)ψ(y, T ). Thus, A(x, t) = A(x, T ) Ψ(x, t) = dyρ(y, T x, t) exp( φ(y)/ν) Bert Kappen ICML, July

49 The diffusion equation Forward sampling of the diffusion process τ ρ = V λ ρ y(fρ) ν 2 yρ (2) can be sampled as dx = f(x, t)dt + dξ x = x + dx, with probability 1 V (x, t)dt/λ x i =, with probability V (x, t)dt/λ Bert Kappen ICML, July

50 Forward sampling of the diffusion process We can estimate Ψ(x, t) = dyρ(y, T x, t) exp( φ(y)/λ) 1 N exp( φ(x i (T ))/λ) i alive by computing N trajectories x i (t T ), i = 1,..., N. Alive denotes the subset of trajectories that do not get killed along the way by the operation. Bert Kappen ICML, July

51 The diffusion process The diffusion process can be written as a path integral: ρ(y, T x, t) = [dx] y x exp ( 1 ) ν S path(x(t T )) S path (x(t T )) = T t dτ 1 2 (ẋ(τ) f(x(τ), τ))2 + V (x(τ), τ) x y t t f Bert Kappen ICML, July

52 The path integral formulation Ψ(x, t) = = ( dyρ(y, T x, t) exp φ(x) ) ν [dx] x exp ( 1ν ) S(x(t T )) S(x(t T )) = S path (x(t T ) + φ(x(t )) Ψ is a partition sum and J = ν log Ψ therefore can be interpreted as a free energy. S is the energy of a path and ν the temperature. The corresponding probability distribution is p(x(t T ) x, t) = 1 ( Ψ(x, t) exp 1ν ) S(x(t T )) Bert Kappen ICML, July

53 An example: double slit 8 6 dx = udt + dξ C = 1 2 x(t )2 + T 0 dτ 1 2 u(τ)2 + V (x, t) V (x, t = 1) implements a slit at an intermediate time t = 1. Ψ(x, t) = can be solved in closed form. dyρ(y, T x, t)ψ(y, T ) J t=0 t=0.99 t=1.01 t= x Bert Kappen ICML, July

54 MC sampling on double slit ( 1 J(x, t) ν log N ) N exp( φ(x i T )/ν) i= MC Exact J 4 J x x (a) Naive sample paths (b) J(x, t = 0) by naive sampling (n = ). (c) J(x, t = 0) by Laplace approximation and importance sampling (n = 100). Bert Kappen ICML, July

55 The delayed choice If we go further back in time we encounter a new phenomenon: a delayed choice when to decide. When slit size goes to zero, J is given by J(x, T ) = ν log = 1 T dyρ(y x)e φ(y)/ν ( 1 2 x2 νt log 2 cosh x ) νt J(x,t) T=2 T=1 T=0.5 where T the time to reach the slits. The expression between brackets is a typical free energy with temperature νt x Symmetry breaking at νt = 1 separates two qualitatively different behaviors. Bert Kappen ICML, July

56 The delayed choice 2 stochastic 2 deterministic The timing of the decision, that is when the automaton decides to go left or right, is the consequence of spontaneous symmetry breaking. Bert Kappen ICML, July

57 N joint arm in a plane dθ i = u i dt + dξ i, i = 1,..., n C = φ(x n ( θ)) + dτ 1 2 u(τ)2 u i = θ i θ i T t, i = 1,..., n ρ(θ, T θ 0, t) exp( φ(x n ( θ))) θ from uncontrolled penalized diffusion. Variational approximation. Bert Kappen ICML, July

58 Other path integral control issues Multiple agents coordination Relation Reinforcement learning and learning. Realistic robotics applications; Non-differentiable cost functions (obstacles). Bert Kappen ICML, July

59 Summary Deterministic control can be solved by - HJB, a PDE - PMP, two coupled ODEs with mixed boundary conditions Stochastic control can be solved by - HJB in general - Ricatti equation for sufficient statistics in LQ case - Path integral is LQ control with arbitrary cost and dynamics - RL is special case Learning, (PO states or parameter values) - Decoupled from control in the RL case - Joint inference and control typically harder than control only - For Kalman filters the PO is irrelevant (certainty equivalence) Bert Kappen ICML, July

60 Summary The PI control problems provides novel link to machine learning: - statistical physics - symmetry breaking, phase transitions - efficient computation (MCMC, BP, MF, EP) Bert Kappen ICML, July

61 Further reading Check out the ICML tutorial paper and references there and other general references at the tutorial web page Bert Kappen ICML, July

An efficient approach to stochastic optimal control. Bert Kappen SNN Radboud University Nijmegen the Netherlands

An efficient approach to stochastic optimal control Bert Kappen SNN Radboud University Nijmegen the Netherlands Bert Kappen Examples of control tasks Motor control Bert Kappen Pascal workshop, 27-29 May