Gradient Methods for Markov Decision Processes

Size: px

Start display at page:

Download "Gradient Methods for Markov Decision Processes"

Beverley Quinn
5 years ago
Views:

1 Gradient Methods for Markov Decision Processes Department of Computer Science University College London May 11, 212

2 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods Notation Steepest Gradient Ascent Expectation Maximisation Natural Gradient Ascent Summary Approximate Newton Method Experiments 3 Model-Based Inference Forward-Backward Inference Rauch Tung Striebel Inference

3 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods Notation Steepest Gradient Ascent Expectation Maximisation Natural Gradient Ascent Summary Approximate Newton Method Experiments 3 Model-Based Inference Forward-Backward Inference Rauch Tung Striebel Inference

4 Markov Decision Processes Markov Decision Processes consider the problem of optimal decision making in a dynamic environment.

5 Markov Decision Processes Examples include Robotics.

6 Markov Decision Processes Examples include Robotics. Optimal Game Play.

7 Markov Decision Processes Examples include Robotics. Start Finish Optimal Game Play. Navigation.

8 Markov Decision Processes More formally Markov Decision Processes (MDPs) are given by the tuple (A, S, H, p 1, R, p), where A - action space, either discrete or continuous. S - state space, either discrete or continuous. Z = S A - state-action space. H - planning horizon, either finite or infinite.

9 Markov Decision Processes More formally Markov Decision Processes (MDPs) are given by the tuple (A, S, H, p 1, R, p), where p 1 (s) : S [, 1] π(a s) : A S [, 1] R(a, s) : A S R + p(s s, a) =: S 2 A [, 1]

10 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )

11 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )

12 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )

13 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )

14 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )

15 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )

16 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )

17 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )

18 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )

19 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )

20 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )

21 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )

22 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )

23 Markov Decision Processes Objective - Optimise π to maximise the total expected reward U(π) = where H t=1 E pt (a,s;π) [ ] R(a, s), p t (a, s; π) state-action marginal of the t th time-point.

24 Markov Decision Processes Objective unbounded in infinite horizons. Discounted rewards U(π) = t=1 E pt (a,s;π) [ ] γ t 1 R(a, s), Average rewards 1 U(π) = lim H H H t=1 E pt (a,s;π) [ ] R(a, s),

25 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods Notation Steepest Gradient Ascent Expectation Maximisation Natural Gradient Ascent Summary Approximate Newton Method Experiments 3 Model-Based Inference Forward-Backward Inference Rauch Tung Striebel Inference

26 Dynamic programming Theoretically possible to solve MDP through dynamic programming Finite horizon Bellman equation V t (s) = max a A { R(s, a) + E p(s s,a) [ ]} Vt+1 (s ). Discounted infinite horizon Bellman equation { R(s, a) + γe p(s s,a) V (s) = max a A [ V (s ) ]},

27 Dynamic Programming - An Example A Graphical Example of Dynamic Programming. 1 Initial Value Function.

28 Dynamic Programming - An Example A Graphical Example of Dynamic Programming. 2 1 Value Function After 1 Iteration.

29 Dynamic Programming - An Example A Graphical Example of Dynamic Programming Value Function After 2 Iterations.

30 Dynamic Programming - An Example A Graphical Example of Dynamic Programming Value Function After 12 Iteration.

31 Dynamic programming Dynamic programming has numerous issues, including Curse of Dimensionality - complexity scales exponentially in dimension of state-action space. Representation issues in non-linear continuous systems. Global maximisation over action space can be problematic.

32 Beyond Dynamic programming Various solutions proposed, including Approximate dynamic programming Work in space of value functions. Often good initial performance. Convergence issues, Policy oscillation Policy search methods, which include gradient methods, Work in policy space. Very general convergence guarantees.

33 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods Notation Steepest Gradient Ascent Expectation Maximisation Natural Gradient Ascent Summary Approximate Newton Method Experiments 3 Model-Based Inference Forward-Backward Inference Rauch Tung Striebel Inference

34 Notation Considering gradient-based methods, so consider parametric policy π(a s; w), w W. Write objective in terms of w, i.e. U(w). Similarly for trajectory distribution, p(z 1:H ; w). Also introduce state-action value function Q τ (z; w) = H t=τ E pt (a,s;w) [ ] R(a, s).

35 Reward Weighted Trajectory Distribution Unnormalised reward weighted trajectory distribution p(z 1:t, t w) = R(z t )p(z 1:t w). t = 1 t = 2 Denote normalised version by ˆp(z 1:t, t w). t = H Note - normalisation constant equals U(w), i.e. ˆp(z 1:t, t w) = p(z 1:t, t w). U(w)

36 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods Notation Steepest Gradient Ascent Expectation Maximisation Natural Gradient Ascent Summary Approximate Newton Method Experiments 3 Model-Based Inference Forward-Backward Inference Rauch Tung Striebel Inference

37 Steepest Gradient Ascent Gradient can be calculated through likelihood-ratios. In terms of p(z, τ, t; w) gradient takes form w U(w) = H t t=1 τ=1 ] E p(z,τ,t;w) [ w log p(a s; w). In terms of state-action value function w U(w) = H τ=1 E pτ (z;w)q τ (z;w) [ ] w log p(a s; w).

38 Steepest Gradient Ascent - Derivation Gradient can be calculated through likelihood-ratios. Likelihood ratio, or log-trick, gives the gradient w U(w) = H t=1 E p(z1:t ;w) [ ] R(z t ) w log p(z 1:t ; w).

39 Steepest Gradient Ascent - Derivation Gradient can be calculated through likelihood-ratios. Equivalently w U(w) = H t=1 E p(z1:t,t;w) [ ] w log p(z 1:t ; w).

40 Steepest Gradient Ascent - Derivation Gradient can be calculated through likelihood-ratios. Markovian dynamics gives w U(w) = H t t=1 τ=1 ] E p(z,τ,t;w) [ w log p(a s; w).

41 Steepest Gradient Ascent - Summary Summary Possible to calculate gradient through likelihood ratios. Often poorly conditioned = difficult to select step-size. Linear rate of convergence.

42 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods Notation Steepest Gradient Ascent Expectation Maximisation Natural Gradient Ascent Summary Approximate Newton Method Experiments 3 Model-Based Inference Forward-Backward Inference Rauch Tung Striebel Inference

43 Expectation Maximisation An alternative is Expectation Maximisation (EM). Introduce variational distribution q(z 1:t, t). Kullback-Leibler divergence, KL(q ˆp), gives the bound [ ] log U(w) H entropy (q(z 1:t, t)) + E q(z1:t,t) log p(z 1:t, t; w).

44 Expectation Maximisation Iteratively maximise bound w.r.t. q and w E-step - optimise bound w.r.t. q(z 1:t, t), q(z 1:t, t) = ˆp(z 1:t, t; w k ). M-step - optimise bound w.r.t. w, where Q(w, w k ) = w k+1 = argmaxq(w, w k ), w H t t=1 τ=1 E p(z,τ,t;w k ) [ ] log p(a s; w).

45 Relation to Steepest Gradient Ascent What is the relation between steepest gradient ascent and EM? Steepest Gradient Ascent w U(w) w=w k = w Q(w, w k ) w=w k. Expectation Maximisation w k+1 = argmaxq(w, w k ). w

46 Expectation Maximisation Summary EM is a two-stage iterative process. There is no need to select step-sizes. Rate of convergence: anywhere between sub-linear and quadratic.

47 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods Notation Steepest Gradient Ascent Expectation Maximisation Natural Gradient Ascent Summary Approximate Newton Method Experiments 3 Model-Based Inference Forward-Backward Inference Rauch Tung Striebel Inference

48 Natural Gradient Ascent Steepest gradient ascent assumes a Euclidean metric, w U(w) w=w k = argmaxu(w k + p). p p T p=ɛ In many cases this is not true of parameter space, which instead has manifold structure.

49 Natural Gradient Ascent This is the idea behind natural gradient ascent [1], G 1 (w) w U(w) w=w k = argmax U(w k + p), p p T G(w)p=ɛ where G(w) a local metric on parameter manifold. Fisher information matrix used, where G(w) = H t=1 ] E p(z,t;w) [ w T w log p(a s; w).

50 Natural Gradient Ascent Fisher information is easy to calculate/estimate. Policy is covariant. Rate of convergence still linear, but typically faster than steepest gradient ascent in practice. Very popular method in the MDP literature since introduction [2].

51 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods Notation Steepest Gradient Ascent Expectation Maximisation Natural Gradient Ascent Summary Approximate Newton Method Experiments 3 Model-Based Inference Forward-Backward Inference Rauch Tung Striebel Inference

52 Summary So we have three methods Steepest gradient ascent. Expectation Maximisation. Natural gradient ascent. Which is best? Depends on which paper you read.

53 Summary So we have three methods Steepest gradient ascent. Expectation Maximisation. Natural gradient ascent. Which is best? Depends on which paper you read.

54 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods Notation Steepest Gradient Ascent Expectation Maximisation Natural Gradient Ascent Summary Approximate Newton Method Experiments 3 Model-Based Inference Forward-Backward Inference Rauch Tung Striebel Inference

55 Approximate Newton Method In Newton s method There is no guarantee of valid ascent direction in non-concave problems. Inference is more expensive. Inversion can be expensive. Is there an approximation to the Hessian that doesn t suffer from these problems?

56 Approximate Newton Method Through second application of log-trick Hessian takes the form H(w) = where H t t=1 τ,τ =1 E p(z,z,τ,τ,t;w) + H t t=1 τ=1 [ ] w log p(a s; w) T w log p(a s ; w) ] E p(z,τ,t;w) [ w T w log p(a s; w), p(z, z, τ, τ, t; w) p(z τ =z, z τ =z, t; w).

57 Approximate Newton Method Consider H 1 (w) = [ ] H t t=1 τ,τ =1 E p(z,z,τ,τ,t;w) w log p(a s; w) T w log p(a s ; w). Positive mixture of outer product matrices = positive semidefinite. Matrix requires additional inference. Matrix is generally dense. We disregard this part of the Hessian.

58 Approximate Newton Method Consider H 2 (w) = [ ] H t t=1 τ=1 E p(z,τ,t;w) w T w log p(a s; w). Policy log-concave in w = negative semidefinite. Little or no additional inference required. Matrix has sparsity properties not present in Hessian. We use this as our approximate Hessian.

59 Approximate Newton Method - Two Examples Two prominent examples of policy that are log-concave are The Gibbs policy in discrete systems π(a s; w) = e w T φ(a,s) a A ew T φ(a,s). The linear Gaussian policy in continuous systems a = K φ(s) + m + η.

60 Relation to Expectation Maximisation It is possible to show, under suitable conditions, that w w = H 1 2 (w) wu(w) + O(( w w) 2 ), where w is the EM-update given parameters w. In other words EM moves, up to first order, in direction of the approximate Newton method with fixed step-size of unity.

61 Relation to Natural Gradient Ascent What is the relation between natural gradient ascent and approximate Newton Method? Natural gradient ascent preconditions with G(w) = H t=1 ] E p(z,t;w) [ w T w log p(a s; w). Approximate Newton preconditions with H 2 (w) = H t=1 E p(z,t;w)qt (z;w) [ ] w T w log p(a s; w).

62 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods Notation Steepest Gradient Ascent Expectation Maximisation Natural Gradient Ascent Summary Approximate Newton Method Experiments 3 Model-Based Inference Forward-Backward Inference Rauch Tung Striebel Inference

63 Model-Based Experiments Model-Based Experiments in Linear-Gaussian Systems. Inference Exact, No Issues of Approximate Inference. Feedback-Linearisation Linearise Non-Linear Systems.

64 Lotka-Volterra System Lotka-Volterra equations model population dynamics of a group interacting species of animal. ṡ = D(s) ( As + c + f (a) ) + η, Task - equilibrate populations of species.

65 N-Link Rigid Manipulator - Search Direction 6 N = 6. Normalised Total Expected Reward Steepest Gradient Ascent Expectation Maximisation Approximate Newton Method Natural Gradient Ascent Training Time S = R 6. A = R 6. w R 13. H = 1. 3s Training.

66 N-Link Rigid Manipulator Simple Model of Robotic Joint. M(q) q + C( q, q) q + g(q) = τ Task - Position End Effector.

67 N-Link Rigid Manipulator - Search Direction Normalised Total Expected Reward Steepest Gradient Ascent Expectation Maximisation Approximate Newton Method Natural Gradient Ascent Training Time N = 3. S = R 6. A = R 3. w R 22. H = 1. 3s Training.

68 Model-Free Experiments Model-free experiments in non-linear systems. Forward sampling used in inference. Linear controller with non-linear features, a = K φ(s) + m + η.

69 Pendulum Simple Pendulum Model. l ml θ = mg sin θ kl θ + τ. θ mg Task - Balance Pendulum in Upright Position.

70 Pendulum - Search Direction Normalised Total Expected Reward Expectation Maximisation.1 Approximate Newton Method Natural Gradients Training Iterations S = R 2. A = R. w R 2. H = 1. 5s Training Iterations.

71 Cart-Pole Cart-Pole Problem I θ = mgl sin θ ml 2 θ mlÿ cos θ,, θ mg ( ) Mÿ = u m ÿ + L θ cos θ L θ 2 sin θ kẏ. u y Task - Balance Pole in Upright Position.

72 Cart-Pole - Search Direction Normalised Total Expected Reward Expectation Maximisation Approximate Newton Method Natural Gradients Training Iterations S = R 4. A = R. w R 2. H = 1. 5s Training Iterations.

73 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods Notation Steepest Gradient Ascent Expectation Maximisation Natural Gradient Ascent Summary Approximate Newton Method Experiments 3 Model-Based Inference Forward-Backward Inference Rauch Tung Striebel Inference

74 Forward-Backward Inference Model-based inference similar to time-series inference. Model-based time-series inference splits into forward-backward inference. Rauch Tung Striebel inference. Yet model-based inference in gradient-based methods for MDPs is exclusively forward-backward.

75 Forward-Backward Inference Observe standard form of gradient w U(w) = H τ=1 E pτ (z;w)q τ (z;w) [ ] w log p(a s; w). { p τ (z; w) } H - forward messages. τ=1 { Q τ (z; w) } H - backward messages. τ=1 We use new notation Q fb τ (z; w) for state-action value function.

76 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods Notation Steepest Gradient Ascent Expectation Maximisation Natural Gradient Ascent Summary Approximate Newton Method Experiments 3 Model-Based Inference Forward-Backward Inference Rauch Tung Striebel Inference

77 RTS Inference - Finite Planning Horizon We redefine the state-action value function as follows Q rts τ (z; w) = H p(z, τ, t; w), t=τ = p τ (z; w)q fb τ (z; w). Terms necessary for policy update can be written in form H t=1 τ=1 t p(z, τ, t; w) = H t=1 Qτ rts (z; w).

78 RTS Inference - Finite Planning Horizon Obtain the recursive equation of these new Q-functions, Q rts τ (z; w) = p τ (z; w)r(z) + z p τ (z z ; w)q rts τ+1 (z ; w). Note alternate direction of transition dynamics compared to standard recursion Q fb τ (z; w) = R(z) + z p(z z; w)q fb τ+1 (z ; w).

79 RTS Inference - Recursion Derivation H [ ] Q rts τ (z; w) = p τ (z; w) E pt(z ;w) R(z ) zτ = z t=τ H = p τ (z; w)r(z) + dz p(z τ = z, z t = z ; w)r(z ) t=τ+1 H = p τ (z; w)r(z) + dz dz p(z τ = z, z τ+1 = z, z t = z ; w)r(z ) t=τ+1 H = p τ (z; w)r(z) + dz dz p(z t = z z τ+1 = z ; w)p(z τ = z, z τ+1 = z ; w)r(z ) t=τ+1 H = p τ (z; w)r(z) + dz dz p(z t = z z τ+1 = z ; w)p(z τ = z z τ+1 = z ; w)p(z τ+1 = z ; w)r(z ) t=τ+1 H = p τ (z; w)r(z) + dz p(z τ = z z τ+1 = z ; w)p(z τ+1 = z ; w) dz p(z t = z z τ+1 = z ; w)r(z ) = p τ (z; w)r(z) + dz p(z τ = z z τ+1 = z ; w)q rts τ+1(z ; w) t=τ+1

80 RTS Inference - Infinite Planning Horizon In case of infinite planning horizon, H =, need to calculate the infinite summation t=1 Qt rts (z; w). We use convergence of trajectory distribution to stationary state-action distribution.

81 RTS Inference - Infinite Planning Horizon Suppose convergence is reached by ˆτ, then t=1 ˆτ 1 Qt rts (z; w) = Qt rts (z; w) + t=1 t=ˆτ First term is easy provided we know Q rts (z; w). ˆτ Qt rts (z; w). Use stationarity of state-action distribution to calculate the second term.

82 RTS Inference - Infinite Planning Horizon For any t ˆτ we have Easy to show Q rts τ+1 Q rts τ+1 (z; w) = p τ+1(z; w) = γp τ (z; w) (z; w) = γqrts(z; w). t=τ+1 t=τ = γqτ rts (z; w). τ E pt (z;w) E pt (z;w) [ ] γ t 1 R(z ) z τ+1 = z [ ] γ t 1 R(z ) z τ = z

83 RTS Inference - Infinite Planning Horizon Can now simplify the second term t=ˆτ Q rts t (z; w) = γ t 1 Q rts (z; w) = γ ˆτ 1 1 γ Qrts (z; w). t=ˆτ It remains to find Q rts (z; w). Extend finite horizon derivation to obtain fixed-point equation Q rts (z; w) = p(z; w)r(z) + γ z p (z z ; w)q rts (z ; w).

84 Examples Now consider some examples where the RTS approach is beneficial. In particular Linear-Gaussian Systems, High-Dimensional Discrete Systems.

85 Continuous Systems In continuous problems the Q-recursion becomes Qτ rts (z; w) = p τ (z; w)r(z) + dz p τ (z z ; w)q rts τ+1 (z ; w). No longer possible to maintain closed form of Q-functions.

86 Continuous Systems However, only require moments to perform policy update. Example, with a linear controller on require the moments H [ ] H z, E Qτ (z;w)[ zz T ], E Qτ (z;w) τ=1 τ=1 These moments can be iterated exactly in linear time.

87 Linear Systems We consider example of linear dynamical system with a linear controller. All functions have linear-gaussian form p(s 1 ) = N (s 1 µ, Σ ), p(s t+1 s t, a t ) = N (s t+1 As t + Ba t, Σ), p(a t s t ; K, m, π σ ) = N (a t K s t + m; π σ ), R(z) = N (y j Mz, L j ). Policy parameters - w = (K, m, π σ ).

88 Linear Systems - Reward Weighted Trajectory Distribution t = 1 p(z 1:t, t; w) an unnormalised mixture of Gaussians. t = 2 Each marginal unnormalised Gaussian. t = H Don t need each marginal, but summation of marginals.

89 Linear Systems - Forward-Backward Inference Linear Dynamical System with a Linear Controller. Forward-backward inference in this model was considered in [3] Standard forward-backward equation has form E pτ (z;w) [ ] [ ] [ zqτ fb (z; w) = E pτ (z;w) zr(z) + E pτ (z;w) ze p(z z;w) [ ]] Qτ+1 fb (z ; w).

90 Linear Systems - Forward-Backward Inference Linear Dynamical System with a Linear Controller. Equivalent form of forward-backward equation [ ] [ ] [ ] E pτ (z;w) zqτ fb (z; w) = E pτ (z;w) zr(z) + E pqτ (z;w) z, where [ ] p Qτ (z; w) = p τ (z; w)e p(z z;w) Qτ+1 fb (z ; w), = H t=τ+1 dz p(z τ = z, z t = z ; w)r(z ),

91 Linear Systems - Forward-Backward Inference Linear Dynamical System with a Linear Controller. Linear-Gaussian system = p Qτ (z; w) unnormalised mixture of Gaussians. # components equals (H t). Calculation E pqτ (z;w)[ z ] has cost of O(H t). Overall cost of forward-backward inference is O(H 2 ). No clear extension to infinite horizon.

92 Linear Systems - RTS Inference - Finite Horizon Linear Dynamical System with a Linear Controller. Need to Calculate H τ=1 E Q rts τ (z;w) [ z ], H τ=1 E Q rts τ (z;w) [ zz T ], Q-recursion has the form [ ] [ [ ] E Q rts τ z = (z;w) Epτ (z;w) zr(z) + E Q rts τ+1 (z ;w) E p τ (z z ;w) [ z ] ] E Q rts τ (z;w)[ zz T ] [ ] [ = E pτ (z;w) zz T R(z) + E Q rts τ+1 (z ;w) E p τ (z z ;w)[ zz T ]].

93 Linear Systems - RTS Inference - Finite Horizon Linear Dynamical System with a Linear Controller. Denote Moments of Reward Function µ R τ = E pτ (z;w)[ R(z)z ], Σ R τ = E pτ (z;w)[ R(z)zz T ]. Linear System = Linear System Reversal Dynamics z τ = G τ z τ+1 + m τ + η τ.

94 Linear Systems - RTS Inference - Finite Horizon Linear Dynamical System with a Linear Controller. RTS recursion has the form E Q rts τ (z;w) [ z ] = µ R τ + E Q rts τ+1 (z ;w) [ ( G τ z + m τ ) ], [ E Q rts τ zz (z;w) T ] [ ( G = Σ R τ + E Q rts τ+1 (z ;w) τ z + )( m τ G τ z + ] ) T m τ + Σ τ.

95 Linear Systems - RTS Inference - Finite Horizon Linear Dynamical System with a Linear Controller. Denote the first two moments of Q rts τ (z; w) by µ Q τ and Σ Q τ. Recursion for µ Q t and Σ Q t is immediate. µ Q τ = µ R τ + Z τ+1 mτ + G τ µ Q τ+1, Σ Q τ = Σ R τ + Z τ+1 ( Σ τ + m τ m T τ ) + G τ (Σ Q τ+1 + µq τ+1 m T τ + m τ (µ Q τ+1 )T )G T τ.

96 Linear Systems - RTS Inference - Infinite Horizon Linear Dynamical System with a Linear Controller. In case of infinite horizon with discounted rewards have fixed-point equation µ Q = µ R + γ (Z ) m + Gµ Q, ( Σ Q = Σ R + γ Z ( Σ + m m T ) + G(Σ Q + µ Q m T + m(µ Q ) T )G T ).

97 Lotka-Volterra System Lotka-Volterra equations model population dynamics of a group interacting species of animal. ṡ = D(s) ( As + c + f (a) ) + η, Task - equilibrate populations of species.

98 Lotka-Volterra System - Finite Horizon Normalised Total Expected Reward Q Inference Forward Backward Inference Training Time N = 6. S = R 6. A = R 6. w R 13. H = 1. 3s Training.

99 Lotka-Volterra System - Finite Horizon Expectation Maximisation used in training. To obtain similar level of performance: RTS inference - 35 s (training time), Forward-Backward inference - 3 s (training time). Forward-Backward inference obtains 5% performance of RTS inference.

100 Lotka-Volterra System - Infinite Horizon Normalised Total Expected Reward Infinite Horizon Finite Horizon Heuristic Training Time N = 6. S = R 6. A = R 6. w R 13. H =. 6s Training.

101 Lotka-Volterra System - Infinite Horizon Expectation Maximisation used in training. Heuristic horizon used in forward-backward inference, H = 1. Training iterations performed: RTS inference ± 221.8, Forward-Backward inference ±.7.

102 N-Link Rigid Manipulator Simple Model of Robotic Joint. M(q) q + C( q, q) q + g(q) = τ Task - Position End Effector.

103 N-Link Rigid Manipulator - Finite Horizon Normalised Total Expected Reward Q Inference Forward Backward Inference Training Time N = 3. S = R 6. A = R 3. w R 22. H = 1. 3s Training.

104 N-Link Rigid Manipulator - Finite Horizon Expectation Maximisation used in training. To obtain similar level of performance: RTS inference - 35 s (training time), Forward-Backward inference - 3 s (training time). Forward-Backward inference obtains 5% performance of RTS inference.

105 N-Link Rigid Manipulator - Infinite Horizon Considered infinite horizon problem. Policy parameters often tended to boundary of unit circle. Should properly handle problem as a constrained optimisation problem: Constraint to unit circle difficult. Point of future research.

106 Continuous Systems - Summary More efficient than forward-backward inference, with runtime O(H) instead of O(H 2 ). Extends to infinite horizon problems. Higher order algorithms, e.g. the Newton method, with runtime O(H) instead of O(H 3 ).

107 Continuous Systems - Extensions It is possible to consider more general systems, e.g. Model non-gaussian rewards through mixture of Gaussians. Model certain non-linear systems through feedback-linearisation. Approximate trajectory distribution of non-linear system with Gaussian, through e.g. EP, and use RTS recursion. Possible extensions to (controlled) switching linear dynamical systems.

108 Bibliography I Amari, S. Natural Gradient Works Efficiently in Learning Neural Computation, 1: , Kakade, S. A Natural Policy Gradient NIPS, 14, 22. Hoffman, M. and de Freitas, N. and Doucet, A. and Peters, J. An Expectation Maximization Algorithm for Continuous Markov Decision Processes with Arbitrary Rewards AISTATS, 12(5): , 29.

Efficient Inference in Markov Control Problems

Efficient Inference in Markov Control Problems Thomas Furmston Computer Science Department University College London London, WC1E 6BT David Barber Computer Science Department University College London