An efficient approach to stochastic optimal control. Bert Kappen SNN Radboud University Nijmegen the Netherlands

Size: px

Start display at page:

Download "An efficient approach to stochastic optimal control. Bert Kappen SNN Radboud University Nijmegen the Netherlands"

Calvin Chase
5 years ago
Views:

1 An efficient approach to stochastic optimal control Bert Kappen SNN Radboud University Nijmegen the Netherlands Bert Kappen

2 Examples of control tasks Motor control Bert Kappen Pascal workshop, May

3 Examples of control tasks Foraging Bert Kappen Pascal workshop, May

4 Examples of control tasks Collaborating agents Bert Kappen Pascal workshop, May

5 Stochastic optimal control theory Control: how to act (now) to optimize future rewards - optimal solution is noise dependent - computation is intractable - tractable approaches are unimodal (LQ, deterministic) Bert Kappen Pascal workshop, May

6 Outline Control theory Path integral control theory Spontaneous symmetry breaking, timing of decisions Agents Summary If time permits: Learning and neural implementation Bert Kappen Pascal workshop, May

7 Discrete time control Consider the control of a discrete time dynamical system: x t+1 = f(t, x t, u t ), t = 0, 1,..., T (1) x t is an n-dimensional vector describing the state of the system and u t is an m-dimensional vector that specifies the control or action at time t. Note, that Eq. 1 describes a noiseless dynamics. If we specify x at t = 0 as x 0 and we specify a sequence of controls u 0:T = u 0, u 1,..., u T, we can compute future states of the system x 1,..., x T +1 recursively from Eq.1. Define a cost function that assigns a cost to each sequence of controls: C(x 0, u 0:T ) = T R(t, x t, u t ) (2) t=0 R(t, x, u) is the cost that is associated with taking action u at time t in state x. Bert Kappen Pascal workshop, May

8 Discrete time control The problem of optimal control is to find the sequence u 0:T C(x 0, u 0:T ). that minimizes The problem has a standard solution, which is known as dynamic programming. Introduce the optimal cost-to-go: J(t, x t ) = min u t:t T R(s, x s, u s ) (3) s=t which solves the optimal control problem from an intermediate time t until the fixed end time T, starting at an arbitrary location x t. The minimum of Eq. 2 is given by J(0, x 0 ). Bert Kappen Pascal workshop, May

9 Discrete time control One can recursively compute J(t, x) from J(t + 1, x) for all x in the following way: J(T + 1, x) = 0 J(t, x t ) = min u t:t = min u t T R(s, x s, u s ) s=t ( R(t, x t, u t ) + min u t+1:t T s=t+1 = min u t (R(t, x t, u t ) + J(t + 1, x t+1 )) R(s, x s, u s ) ) The minimizers u 0:T give the optimal control path. Bert Kappen Pascal workshop, May

10 The discrete time recursion is: Continuous limit J(t, x t ) = min u t (R(t, x t, u t ) + J(t + dt, x t+dt )) In the limit of continuous time we get J(t + dt, x t+dt ) = J(t, x t ) + dt t J(t, x t ) + dx x J(t, x t ) dx = f(x, u, t)dt Thus, t J(t, x) = min u (R(t, x, u) + f(x, u, t) x J(x, t)) with boundary condition J(x, T ) = R(T, x) = φ(x). Bert Kappen Pascal workshop, May

11 Example: Bang-bang control The spring force F = z towards the rest position. Control force u. Newton s Law F = m z with m = 1: z = z + u Control problem: Given initial position and velocity z i = ż i = 0 at time t = 0, find the control path 1 < u(0 T ) < 1 such that z(t ) is maximal. Bert Kappen Pascal workshop, May

12 Introduce x 1 = z, x 2 = ż, then Example: Bang-bang control x 1 = x 2 x 2 = x 1 + u The end cost is φ(x) = x 1 and R(x, u, t) = 0. The HJB takes the form: t J = min u ( J J x 2 + x 1 + J ) u x 1 x 2 x 2 J = x 2 J x 1 + x 1 J x 2 x 2, u = sign ( J x 2 ) Bert Kappen Pascal workshop, May

13 Example: Bang-bang control The solution is J(t, x 1, x 2 ) = cos(t T )x 1 + sin(t T )x 2 + α(t) u(t, x 1, x 2 ) = sign(sin(t T )) As an example consider T = 2π. Then, the optimal control is u = 1, u = 1, 0 < t < π π < t < 2π x 1 x t Bert Kappen Pascal workshop, May

14 Stochastic optimal control Consider a stochastic dynamical system dx = f(t, x, u)dt + dξ dξ Gaussian noise dξ 2 = νdt. The cost becomes an expectation: C(t, x, u(t T )) = φ(x(t )) + T t dτ R(t, x(t), u(t)) over all stochastic trajectories starting at x with control path u(t T ). Bert Kappen Pascal workshop, May

15 Stochastic optimal control We obtain a similar discrete time recursion: J(t, x t ) = min u t R(t, x t, u t ) + J(t + dt, x t+dt ) In the limit of continuous time we get J(t + dt, x t+dt ) = J(t, x t ) + dt t J(t, x t ) + dx x J(t, x t ) dx 2 2 xj(t, x t ) dx = f(x, u, t)dt dx 2 = νdt Thus, t J(t, x) = min u ( R(t, x, u) + f(x, u, t) x J(x, t) + 1 ) 2 ν 2 xj(x, t) with boundary condition J(x, T ) = φ(x). Bert Kappen Pascal workshop, May

16 Path integral control Consider the special case: f(t, x, u) = f(t, x) + u R(t, x, u) = V (t, x) u2 then t J = min u ( 1 2 u2 + V + f x J + 1 ) 2 ν 2 xj = 1 2 ( xj) 2 + V + f x J ν 2 xj u = x J(x, t) Bert Kappen Pascal workshop, May

17 Solution 1. define Ψ(x, t) = exp( J(x, t)/ν), then t Ψ = V ν Ψ + f xψ ν 2 xψ, Ψ(x, T ) = exp( φ(x)/ν) = HΨ 2. define the conditional probability ρ(y, τ x, t), τ t through a diffusion equation: τ ρ = V ν ρ y(fρ) ν 2 yρ, ρ(y, t x, t) = δ(y x) = H Ψ 3. By construction, dyρ(y, τ x, t)ψ(y, τ) independent of τ. Bert Kappen Pascal workshop, May

18 4. Evaluate at t and T : Ψ(x, t) = dyρ(y, T x, t) exp ( φ(y)/ν) Ψ gives J gives u. Bert Kappen Pascal workshop, May

19 An example: double slit 8 6 dx = udt + dξ C = 1 2 x(t )2 + T 0 dτ 1 2 u(τ)2 + V (x, t) V (x, t = 1) implements a slit at an intermediate time t = 1. Ψ(x, t) = can be solved in closed form. dyρ(y, T x, t)ψ(y, T ) J t=0 t=0.99 t=1.01 t= x Bert Kappen Pascal workshop, May

20 The delayed choice x t Obstacle avoidance requires mechanism when to decide. We take V = 0 and f = 0 and φ(x) = for all x, except for two narrow slits of infinitesimal size ɛ at x = ±1. Bert Kappen Pascal workshop, May

21 We can compute J exactly and is given by J(x, T ) = ν log = 1 T dyρ(y x)e φ(y)/ν ( 1 2 x2 νt log 2 cosh x ) νt J(x,t) T=2 T=1 T=0.5 where T the time to reach the slits. The expression between brackets is a typical free energy with temperature νt x Symmetry breaking at νt = 1 separates two qualitatively different behaviours. Bert Kappen Pascal workshop, May

22 The delayed choice 2 stochastic 2 deterministic The timing of the decision, that is when the automaton decides to go left or right, is the consequence of spontaneous symmetry breaking. Bert Kappen Pascal workshop, May

23 The diffusion process ρ(y, τ x, t) satisfies the diffusion equation: τ ρ = V ν ρ y(fρ) ν 2 yρ, τ = t T ρ(y, t x, t) = δ(y x) and can be sampled as dy = f(y, t)dt + dξ y = y + dy, with probability 1 V (y, t)dt/ν y =, with probability V (y, t)dt/ν Bert Kappen Pascal workshop, May

24 The diffusion process The diffusion process can be written as a path integral: ρ(y, T x, t) = [dx] y x exp ( 1 ) ν S path(x(t T )) S path (x(t T )) = T t dτ 1 2 (ẋ(τ) f(x(τ), τ))2 + V (x(τ), τ) x y t t f Bert Kappen Pascal workshop, May

25 The path integral formulation Ψ(x, t) = = ( dyρ(y, T x, t) exp φ(x) ) ν [dx] x exp ( 1ν ) S(x(t T )) S(x(t T )) = S path (x(t T ) + φ(x(t )) Ψ is a partition sum and J = ν log Ψ therefore can be interpreted as a free energy. S is the energy of a path and ν the temperature. The corresponding probability distribution is p(x(t T ) x, t) = 1 ( Ψ(x, t) exp 1ν ) S(x(t T )) Bert Kappen Pascal workshop, May

26 Gibbs sampling Sample paths x 0:n from p(x 0:n ) exp( S(x 0:n )/ν) End cost φ(x n ) centered on target. Path cost V (x) for obstacles ( 1 J(x, t) = ν log N udt = exp(j/ν) N ) N exp( S(x i 0:n)/ν) i N exp( S(x i 0:n)/ν)dξ i i Bert Kappen Pascal workshop, May

27 n agents with independent dynamics Coordination of agents dx α = (f α (x α, t) + u α ) + dξ α, α = 1,..., n should coordinate their actions to minimize a cost at a future time t = T : φ(y 1,..., y n ) y α {z 1,... z k } and φ = elsewhere. Bert Kappen Pascal workshop, May

28 Coordination of agents Then, Ψ(x 1,..., x n, t) = = y dy 1... dy n ρ(y α, T x α, t) exp( φ(y 1,..., y n )/ν) α exp( E( y x, t)/ν) p( y) = 1 exp( E( y x, t)/ν) Z log ρ(yα, T x α, t) u α ( x, t) = xα J = x α with x = (x 1,..., x n ), y = (y 1,..., y n ). E has a graphical model structure if φ has. Bert Kappen Pascal workshop, May

29 Pseudo code Loop: 1. Compute the cost and its log derivative for each agent to move to each target: ρ(z i, T x α, t), i = 1,..., k, α = 1,..., n This path integral can be estimated using MC sampling or variational approximation. 2. Compute u α using graphical model inference in p( y) (exact, BP, MF). Bert Kappen Pascal workshop, May

30 A simple 1d example Intrinsic dynamics f α = 0, V (x 1,..., x n ) = 0: p(y α, T x α, t) exp( (y α x α ) 2 /2ν(T t)) End cost φ(y 1,..., y n ) = k j=1 (n j( y) n j ) 2, with n j ( y) the # of agents that go to target j. Optimal control is for agent α is u α = 1 T t ( y α x α ) Bert Kappen Pascal workshop, May

31 A simple 1d example <y> 0 x t t (a) Agent predicted target y α (b) Agent position x Bert Kappen Pascal workshop, May

32 A simple 1d example Cost Difference Noise CPU Time Agents Control cost greedy control (red) MF control (blue) BP control (green) CPU time exact control (black) MF control (blue) BP control (green) greedy control (red) Bert Kappen Pascal workshop, May

33 Nonlinear Coordination Agents a = 1,..., n in 2D: dx a (t) = v a (t) cos ϕ a (t) dt dy a (t) = v a (t) sin ϕ a (t) dt dv a (t) = u a (t)dt + dξ a (t) dϕ a (t) = ω a (t)dt + dζ a (t) Initial states O, v a (0) = 0, ϕ a (0) = 0 Targets X, v a (T ) = 0, ϕ a (T ) = 0 Sample paths specified at t i = t + i dt, i = 0,..., 6, dt = (T t)/6 Example of 10 agents & 10 targets: Sample paths: Bert Kappen Pascal workshop, May

34 Computation Time Inference methods: Junction Tree ( ) MF ( ) (100 sample paths per agent-target) CPU time (s) vs. number of agents: CPU time Number of Agents (# agents = # targets) JT MF : exponential in number of agents (intractable for # agents > 10) : polynomial in number of agents Bert Kappen Pascal workshop, May

35 Summary A restricted class of control problems can be reformulated in statistical physics language. - path integrals - symmetry breaking - efficient computation (MCMC, BP, MF, EP) - coordination of agents Future: - Robotics in dynamical environment - Learning/exploration Bert Kappen Pascal workshop, May

36 Further reading H.J. Kappen, Physical Review Letters (2005) H.J. Kappen, Journal of statistical mechanics: theory and experiment, November 2005 P11011 W. Wiegerinck, B. van den Broek, H.J. Kappen, Proceedings UAI (2006) H.J. Kappen, 9th Granada seminar on Computational Physics: Computational and Mathematical Modeling of Cooperative Behavior in Neural Systems, Americal Institute of Physics (2007) Bert Kappen Pascal workshop, May

37 Learning Model-based: first learn a model and then do optimal control. Model-free: interleave learning and optimal control - more natural biologically and for AI - problem of exploration-exploitation:. Intermediate control is suboptimal. Control theory does not address exploration RL/actor critic approach: exploration = exploitation + noise PI control: exploration is forward diffusion. Ψ(x i, 0) = exp dt λ j=i+n j=i V (x j ) with T = ndt and x j, j = i, i + 1,..., i + n the states visited after state x i. Exploration can be optimized as in important sampling. Bert Kappen Pascal workshop, May

38 Learning x x J T=3 J T=10 V T*V J mc J lp x T*V 11 J mc J lp x Sampling of J(x) with one trajectory of N = 8000 iterations starting at x = 0. Left: The diffusion process dx = dξ explores the area between x = 7.5 and x = 6. Shown is a histogram of the points visited (300 bins). In each bin x, an estimate of ψ(x) is made by averaging all ψ(x i ) with x i from bin x. Right: V (x) and J T (x)/t versus x for T = 3 and T = 10. Bert Kappen Pascal workshop, May

39 A neural implementation/thinking ahead Topological map represents space x. Neuron i is active when animal at x. dρ i dt = V i λ ρ i(t) + ν 2 D ij ρ j (t) with D the diffusion matrix D ii = 2, D ii+1 = D ii 1 = 1 and all other entries of D are zero. V i is the immediate reward at location i. Some mechanism ensures i ρ i(t) = 1. j T=0.1 T=5 T= Thinking ahead. When the animal is at x 1 it can start the diffusion dynamics to anticipate what will happen in the future. Bert Kappen Pascal workshop, May

Stochastic optimal control theory

Stochastic optimal control theory Bert Kappen SNN Radboud University Nijmegen the Netherlands July 5, 2008 Bert Kappen Introduction Optimal control theory: Optimize sum of a path cost and end cost. Result