Gradient Methods for Markov Decision Processes

Size: px
Start display at page:

Download "Gradient Methods for Markov Decision Processes"

Transcription

1 Gradient Methods for Markov Decision Processes Department of Computer Science University College London May 11, 212

2 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods Notation Steepest Gradient Ascent Expectation Maximisation Natural Gradient Ascent Summary Approximate Newton Method Experiments 3 Model-Based Inference Forward-Backward Inference Rauch Tung Striebel Inference

3 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods Notation Steepest Gradient Ascent Expectation Maximisation Natural Gradient Ascent Summary Approximate Newton Method Experiments 3 Model-Based Inference Forward-Backward Inference Rauch Tung Striebel Inference

4 Markov Decision Processes Markov Decision Processes consider the problem of optimal decision making in a dynamic environment.

5 Markov Decision Processes Examples include Robotics.

6 Markov Decision Processes Examples include Robotics. Optimal Game Play.

7 Markov Decision Processes Examples include Robotics. Start Finish Optimal Game Play. Navigation.

8 Markov Decision Processes More formally Markov Decision Processes (MDPs) are given by the tuple (A, S, H, p 1, R, p), where A - action space, either discrete or continuous. S - state space, either discrete or continuous. Z = S A - state-action space. H - planning horizon, either finite or infinite.

9 Markov Decision Processes More formally Markov Decision Processes (MDPs) are given by the tuple (A, S, H, p 1, R, p), where p 1 (s) : S [, 1] π(a s) : A S [, 1] R(a, s) : A S R + p(s s, a) =: S 2 A [, 1]

10 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )

11 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )

12 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )

13 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )

14 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )

15 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )

16 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )

17 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )

18 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )

19 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )

20 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )

21 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )

22 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )

23 Markov Decision Processes Objective - Optimise π to maximise the total expected reward U(π) = where H t=1 E pt (a,s;π) [ ] R(a, s), p t (a, s; π) state-action marginal of the t th time-point.

24 Markov Decision Processes Objective unbounded in infinite horizons. Discounted rewards U(π) = t=1 E pt (a,s;π) [ ] γ t 1 R(a, s), Average rewards 1 U(π) = lim H H H t=1 E pt (a,s;π) [ ] R(a, s),

25 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods Notation Steepest Gradient Ascent Expectation Maximisation Natural Gradient Ascent Summary Approximate Newton Method Experiments 3 Model-Based Inference Forward-Backward Inference Rauch Tung Striebel Inference

26 Dynamic programming Theoretically possible to solve MDP through dynamic programming Finite horizon Bellman equation V t (s) = max a A { R(s, a) + E p(s s,a) [ ]} Vt+1 (s ). Discounted infinite horizon Bellman equation { R(s, a) + γe p(s s,a) V (s) = max a A [ V (s ) ]},

27 Dynamic Programming - An Example A Graphical Example of Dynamic Programming. 1 Initial Value Function.

28 Dynamic Programming - An Example A Graphical Example of Dynamic Programming. 2 1 Value Function After 1 Iteration.

29 Dynamic Programming - An Example A Graphical Example of Dynamic Programming Value Function After 2 Iterations.

30 Dynamic Programming - An Example A Graphical Example of Dynamic Programming Value Function After 12 Iteration.

31 Dynamic programming Dynamic programming has numerous issues, including Curse of Dimensionality - complexity scales exponentially in dimension of state-action space. Representation issues in non-linear continuous systems. Global maximisation over action space can be problematic.

32 Beyond Dynamic programming Various solutions proposed, including Approximate dynamic programming Work in space of value functions. Often good initial performance. Convergence issues, Policy oscillation Policy search methods, which include gradient methods, Work in policy space. Very general convergence guarantees.

33 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods Notation Steepest Gradient Ascent Expectation Maximisation Natural Gradient Ascent Summary Approximate Newton Method Experiments 3 Model-Based Inference Forward-Backward Inference Rauch Tung Striebel Inference

34 Notation Considering gradient-based methods, so consider parametric policy π(a s; w), w W. Write objective in terms of w, i.e. U(w). Similarly for trajectory distribution, p(z 1:H ; w). Also introduce state-action value function Q τ (z; w) = H t=τ E pt (a,s;w) [ ] R(a, s).

35 Reward Weighted Trajectory Distribution Unnormalised reward weighted trajectory distribution p(z 1:t, t w) = R(z t )p(z 1:t w). t = 1 t = 2 Denote normalised version by ˆp(z 1:t, t w). t = H Note - normalisation constant equals U(w), i.e. ˆp(z 1:t, t w) = p(z 1:t, t w). U(w)

36 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods Notation Steepest Gradient Ascent Expectation Maximisation Natural Gradient Ascent Summary Approximate Newton Method Experiments 3 Model-Based Inference Forward-Backward Inference Rauch Tung Striebel Inference

37 Steepest Gradient Ascent Gradient can be calculated through likelihood-ratios. In terms of p(z, τ, t; w) gradient takes form w U(w) = H t t=1 τ=1 ] E p(z,τ,t;w) [ w log p(a s; w). In terms of state-action value function w U(w) = H τ=1 E pτ (z;w)q τ (z;w) [ ] w log p(a s; w).

38 Steepest Gradient Ascent - Derivation Gradient can be calculated through likelihood-ratios. Likelihood ratio, or log-trick, gives the gradient w U(w) = H t=1 E p(z1:t ;w) [ ] R(z t ) w log p(z 1:t ; w).

39 Steepest Gradient Ascent - Derivation Gradient can be calculated through likelihood-ratios. Equivalently w U(w) = H t=1 E p(z1:t,t;w) [ ] w log p(z 1:t ; w).

40 Steepest Gradient Ascent - Derivation Gradient can be calculated through likelihood-ratios. Markovian dynamics gives w U(w) = H t t=1 τ=1 ] E p(z,τ,t;w) [ w log p(a s; w).

41 Steepest Gradient Ascent - Summary Summary Possible to calculate gradient through likelihood ratios. Often poorly conditioned = difficult to select step-size. Linear rate of convergence.

42 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods Notation Steepest Gradient Ascent Expectation Maximisation Natural Gradient Ascent Summary Approximate Newton Method Experiments 3 Model-Based Inference Forward-Backward Inference Rauch Tung Striebel Inference

43 Expectation Maximisation An alternative is Expectation Maximisation (EM). Introduce variational distribution q(z 1:t, t). Kullback-Leibler divergence, KL(q ˆp), gives the bound [ ] log U(w) H entropy (q(z 1:t, t)) + E q(z1:t,t) log p(z 1:t, t; w).

44 Expectation Maximisation Iteratively maximise bound w.r.t. q and w E-step - optimise bound w.r.t. q(z 1:t, t), q(z 1:t, t) = ˆp(z 1:t, t; w k ). M-step - optimise bound w.r.t. w, where Q(w, w k ) = w k+1 = argmaxq(w, w k ), w H t t=1 τ=1 E p(z,τ,t;w k ) [ ] log p(a s; w).

45 Relation to Steepest Gradient Ascent What is the relation between steepest gradient ascent and EM? Steepest Gradient Ascent w U(w) w=w k = w Q(w, w k ) w=w k. Expectation Maximisation w k+1 = argmaxq(w, w k ). w

46 Expectation Maximisation Summary EM is a two-stage iterative process. There is no need to select step-sizes. Rate of convergence: anywhere between sub-linear and quadratic.

47 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods Notation Steepest Gradient Ascent Expectation Maximisation Natural Gradient Ascent Summary Approximate Newton Method Experiments 3 Model-Based Inference Forward-Backward Inference Rauch Tung Striebel Inference

48 Natural Gradient Ascent Steepest gradient ascent assumes a Euclidean metric, w U(w) w=w k = argmaxu(w k + p). p p T p=ɛ In many cases this is not true of parameter space, which instead has manifold structure.

49 Natural Gradient Ascent This is the idea behind natural gradient ascent [1], G 1 (w) w U(w) w=w k = argmax U(w k + p), p p T G(w)p=ɛ where G(w) a local metric on parameter manifold. Fisher information matrix used, where G(w) = H t=1 ] E p(z,t;w) [ w T w log p(a s; w).

50 Natural Gradient Ascent Fisher information is easy to calculate/estimate. Policy is covariant. Rate of convergence still linear, but typically faster than steepest gradient ascent in practice. Very popular method in the MDP literature since introduction [2].

51 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods Notation Steepest Gradient Ascent Expectation Maximisation Natural Gradient Ascent Summary Approximate Newton Method Experiments 3 Model-Based Inference Forward-Backward Inference Rauch Tung Striebel Inference

52 Summary So we have three methods Steepest gradient ascent. Expectation Maximisation. Natural gradient ascent. Which is best? Depends on which paper you read.

53 Summary So we have three methods Steepest gradient ascent. Expectation Maximisation. Natural gradient ascent. Which is best? Depends on which paper you read.

54 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods Notation Steepest Gradient Ascent Expectation Maximisation Natural Gradient Ascent Summary Approximate Newton Method Experiments 3 Model-Based Inference Forward-Backward Inference Rauch Tung Striebel Inference

55 Approximate Newton Method In Newton s method There is no guarantee of valid ascent direction in non-concave problems. Inference is more expensive. Inversion can be expensive. Is there an approximation to the Hessian that doesn t suffer from these problems?

56 Approximate Newton Method Through second application of log-trick Hessian takes the form H(w) = where H t t=1 τ,τ =1 E p(z,z,τ,τ,t;w) + H t t=1 τ=1 [ ] w log p(a s; w) T w log p(a s ; w) ] E p(z,τ,t;w) [ w T w log p(a s; w), p(z, z, τ, τ, t; w) p(z τ =z, z τ =z, t; w).

57 Approximate Newton Method Consider H 1 (w) = [ ] H t t=1 τ,τ =1 E p(z,z,τ,τ,t;w) w log p(a s; w) T w log p(a s ; w). Positive mixture of outer product matrices = positive semidefinite. Matrix requires additional inference. Matrix is generally dense. We disregard this part of the Hessian.

58 Approximate Newton Method Consider H 2 (w) = [ ] H t t=1 τ=1 E p(z,τ,t;w) w T w log p(a s; w). Policy log-concave in w = negative semidefinite. Little or no additional inference required. Matrix has sparsity properties not present in Hessian. We use this as our approximate Hessian.

59 Approximate Newton Method - Two Examples Two prominent examples of policy that are log-concave are The Gibbs policy in discrete systems π(a s; w) = e w T φ(a,s) a A ew T φ(a,s). The linear Gaussian policy in continuous systems a = K φ(s) + m + η.

60 Relation to Expectation Maximisation It is possible to show, under suitable conditions, that w w = H 1 2 (w) wu(w) + O(( w w) 2 ), where w is the EM-update given parameters w. In other words EM moves, up to first order, in direction of the approximate Newton method with fixed step-size of unity.

61 Relation to Natural Gradient Ascent What is the relation between natural gradient ascent and approximate Newton Method? Natural gradient ascent preconditions with G(w) = H t=1 ] E p(z,t;w) [ w T w log p(a s; w). Approximate Newton preconditions with H 2 (w) = H t=1 E p(z,t;w)qt (z;w) [ ] w T w log p(a s; w).

62 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods Notation Steepest Gradient Ascent Expectation Maximisation Natural Gradient Ascent Summary Approximate Newton Method Experiments 3 Model-Based Inference Forward-Backward Inference Rauch Tung Striebel Inference

63 Model-Based Experiments Model-Based Experiments in Linear-Gaussian Systems. Inference Exact, No Issues of Approximate Inference. Feedback-Linearisation Linearise Non-Linear Systems.

64 Lotka-Volterra System Lotka-Volterra equations model population dynamics of a group interacting species of animal. ṡ = D(s) ( As + c + f (a) ) + η, Task - equilibrate populations of species.

65 N-Link Rigid Manipulator - Search Direction 6 N = 6. Normalised Total Expected Reward Steepest Gradient Ascent Expectation Maximisation Approximate Newton Method Natural Gradient Ascent Training Time S = R 6. A = R 6. w R 13. H = 1. 3s Training.

66 N-Link Rigid Manipulator Simple Model of Robotic Joint. M(q) q + C( q, q) q + g(q) = τ Task - Position End Effector.

67 N-Link Rigid Manipulator - Search Direction Normalised Total Expected Reward Steepest Gradient Ascent Expectation Maximisation Approximate Newton Method Natural Gradient Ascent Training Time N = 3. S = R 6. A = R 3. w R 22. H = 1. 3s Training.

68 Model-Free Experiments Model-free experiments in non-linear systems. Forward sampling used in inference. Linear controller with non-linear features, a = K φ(s) + m + η.

69 Pendulum Simple Pendulum Model. l ml θ = mg sin θ kl θ + τ. θ mg Task - Balance Pendulum in Upright Position.

70 Pendulum - Search Direction Normalised Total Expected Reward Expectation Maximisation.1 Approximate Newton Method Natural Gradients Training Iterations S = R 2. A = R. w R 2. H = 1. 5s Training Iterations.

71 Cart-Pole Cart-Pole Problem I θ = mgl sin θ ml 2 θ mlÿ cos θ,, θ mg ( ) Mÿ = u m ÿ + L θ cos θ L θ 2 sin θ kẏ. u y Task - Balance Pole in Upright Position.

72 Cart-Pole - Search Direction Normalised Total Expected Reward Expectation Maximisation Approximate Newton Method Natural Gradients Training Iterations S = R 4. A = R. w R 2. H = 1. 5s Training Iterations.

73 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods Notation Steepest Gradient Ascent Expectation Maximisation Natural Gradient Ascent Summary Approximate Newton Method Experiments 3 Model-Based Inference Forward-Backward Inference Rauch Tung Striebel Inference

74 Forward-Backward Inference Model-based inference similar to time-series inference. Model-based time-series inference splits into forward-backward inference. Rauch Tung Striebel inference. Yet model-based inference in gradient-based methods for MDPs is exclusively forward-backward.

75 Forward-Backward Inference Observe standard form of gradient w U(w) = H τ=1 E pτ (z;w)q τ (z;w) [ ] w log p(a s; w). { p τ (z; w) } H - forward messages. τ=1 { Q τ (z; w) } H - backward messages. τ=1 We use new notation Q fb τ (z; w) for state-action value function.

76 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods Notation Steepest Gradient Ascent Expectation Maximisation Natural Gradient Ascent Summary Approximate Newton Method Experiments 3 Model-Based Inference Forward-Backward Inference Rauch Tung Striebel Inference

77 RTS Inference - Finite Planning Horizon We redefine the state-action value function as follows Q rts τ (z; w) = H p(z, τ, t; w), t=τ = p τ (z; w)q fb τ (z; w). Terms necessary for policy update can be written in form H t=1 τ=1 t p(z, τ, t; w) = H t=1 Qτ rts (z; w).

78 RTS Inference - Finite Planning Horizon Obtain the recursive equation of these new Q-functions, Q rts τ (z; w) = p τ (z; w)r(z) + z p τ (z z ; w)q rts τ+1 (z ; w). Note alternate direction of transition dynamics compared to standard recursion Q fb τ (z; w) = R(z) + z p(z z; w)q fb τ+1 (z ; w).

79 RTS Inference - Recursion Derivation H [ ] Q rts τ (z; w) = p τ (z; w) E pt(z ;w) R(z ) zτ = z t=τ H = p τ (z; w)r(z) + dz p(z τ = z, z t = z ; w)r(z ) t=τ+1 H = p τ (z; w)r(z) + dz dz p(z τ = z, z τ+1 = z, z t = z ; w)r(z ) t=τ+1 H = p τ (z; w)r(z) + dz dz p(z t = z z τ+1 = z ; w)p(z τ = z, z τ+1 = z ; w)r(z ) t=τ+1 H = p τ (z; w)r(z) + dz dz p(z t = z z τ+1 = z ; w)p(z τ = z z τ+1 = z ; w)p(z τ+1 = z ; w)r(z ) t=τ+1 H = p τ (z; w)r(z) + dz p(z τ = z z τ+1 = z ; w)p(z τ+1 = z ; w) dz p(z t = z z τ+1 = z ; w)r(z ) = p τ (z; w)r(z) + dz p(z τ = z z τ+1 = z ; w)q rts τ+1(z ; w) t=τ+1

80 RTS Inference - Infinite Planning Horizon In case of infinite planning horizon, H =, need to calculate the infinite summation t=1 Qt rts (z; w). We use convergence of trajectory distribution to stationary state-action distribution.

81 RTS Inference - Infinite Planning Horizon Suppose convergence is reached by ˆτ, then t=1 ˆτ 1 Qt rts (z; w) = Qt rts (z; w) + t=1 t=ˆτ First term is easy provided we know Q rts (z; w). ˆτ Qt rts (z; w). Use stationarity of state-action distribution to calculate the second term.

82 RTS Inference - Infinite Planning Horizon For any t ˆτ we have Easy to show Q rts τ+1 Q rts τ+1 (z; w) = p τ+1(z; w) = γp τ (z; w) (z; w) = γqrts(z; w). t=τ+1 t=τ = γqτ rts (z; w). τ E pt (z;w) E pt (z;w) [ ] γ t 1 R(z ) z τ+1 = z [ ] γ t 1 R(z ) z τ = z

83 RTS Inference - Infinite Planning Horizon Can now simplify the second term t=ˆτ Q rts t (z; w) = γ t 1 Q rts (z; w) = γ ˆτ 1 1 γ Qrts (z; w). t=ˆτ It remains to find Q rts (z; w). Extend finite horizon derivation to obtain fixed-point equation Q rts (z; w) = p(z; w)r(z) + γ z p (z z ; w)q rts (z ; w).

84 Examples Now consider some examples where the RTS approach is beneficial. In particular Linear-Gaussian Systems, High-Dimensional Discrete Systems.

85 Continuous Systems In continuous problems the Q-recursion becomes Qτ rts (z; w) = p τ (z; w)r(z) + dz p τ (z z ; w)q rts τ+1 (z ; w). No longer possible to maintain closed form of Q-functions.

86 Continuous Systems However, only require moments to perform policy update. Example, with a linear controller on require the moments H [ ] H z, E Qτ (z;w)[ zz T ], E Qτ (z;w) τ=1 τ=1 These moments can be iterated exactly in linear time.

87 Linear Systems We consider example of linear dynamical system with a linear controller. All functions have linear-gaussian form p(s 1 ) = N (s 1 µ, Σ ), p(s t+1 s t, a t ) = N (s t+1 As t + Ba t, Σ), p(a t s t ; K, m, π σ ) = N (a t K s t + m; π σ ), R(z) = N (y j Mz, L j ). Policy parameters - w = (K, m, π σ ).

88 Linear Systems - Reward Weighted Trajectory Distribution t = 1 p(z 1:t, t; w) an unnormalised mixture of Gaussians. t = 2 Each marginal unnormalised Gaussian. t = H Don t need each marginal, but summation of marginals.

89 Linear Systems - Forward-Backward Inference Linear Dynamical System with a Linear Controller. Forward-backward inference in this model was considered in [3] Standard forward-backward equation has form E pτ (z;w) [ ] [ ] [ zqτ fb (z; w) = E pτ (z;w) zr(z) + E pτ (z;w) ze p(z z;w) [ ]] Qτ+1 fb (z ; w).

90 Linear Systems - Forward-Backward Inference Linear Dynamical System with a Linear Controller. Equivalent form of forward-backward equation [ ] [ ] [ ] E pτ (z;w) zqτ fb (z; w) = E pτ (z;w) zr(z) + E pqτ (z;w) z, where [ ] p Qτ (z; w) = p τ (z; w)e p(z z;w) Qτ+1 fb (z ; w), = H t=τ+1 dz p(z τ = z, z t = z ; w)r(z ),

91 Linear Systems - Forward-Backward Inference Linear Dynamical System with a Linear Controller. Linear-Gaussian system = p Qτ (z; w) unnormalised mixture of Gaussians. # components equals (H t). Calculation E pqτ (z;w)[ z ] has cost of O(H t). Overall cost of forward-backward inference is O(H 2 ). No clear extension to infinite horizon.

92 Linear Systems - RTS Inference - Finite Horizon Linear Dynamical System with a Linear Controller. Need to Calculate H τ=1 E Q rts τ (z;w) [ z ], H τ=1 E Q rts τ (z;w) [ zz T ], Q-recursion has the form [ ] [ [ ] E Q rts τ z = (z;w) Epτ (z;w) zr(z) + E Q rts τ+1 (z ;w) E p τ (z z ;w) [ z ] ] E Q rts τ (z;w)[ zz T ] [ ] [ = E pτ (z;w) zz T R(z) + E Q rts τ+1 (z ;w) E p τ (z z ;w)[ zz T ]].

93 Linear Systems - RTS Inference - Finite Horizon Linear Dynamical System with a Linear Controller. Denote Moments of Reward Function µ R τ = E pτ (z;w)[ R(z)z ], Σ R τ = E pτ (z;w)[ R(z)zz T ]. Linear System = Linear System Reversal Dynamics z τ = G τ z τ+1 + m τ + η τ.

94 Linear Systems - RTS Inference - Finite Horizon Linear Dynamical System with a Linear Controller. RTS recursion has the form E Q rts τ (z;w) [ z ] = µ R τ + E Q rts τ+1 (z ;w) [ ( G τ z + m τ ) ], [ E Q rts τ zz (z;w) T ] [ ( G = Σ R τ + E Q rts τ+1 (z ;w) τ z + )( m τ G τ z + ] ) T m τ + Σ τ.

95 Linear Systems - RTS Inference - Finite Horizon Linear Dynamical System with a Linear Controller. Denote the first two moments of Q rts τ (z; w) by µ Q τ and Σ Q τ. Recursion for µ Q t and Σ Q t is immediate. µ Q τ = µ R τ + Z τ+1 mτ + G τ µ Q τ+1, Σ Q τ = Σ R τ + Z τ+1 ( Σ τ + m τ m T τ ) + G τ (Σ Q τ+1 + µq τ+1 m T τ + m τ (µ Q τ+1 )T )G T τ.

96 Linear Systems - RTS Inference - Infinite Horizon Linear Dynamical System with a Linear Controller. In case of infinite horizon with discounted rewards have fixed-point equation µ Q = µ R + γ (Z ) m + Gµ Q, ( Σ Q = Σ R + γ Z ( Σ + m m T ) + G(Σ Q + µ Q m T + m(µ Q ) T )G T ).

97 Lotka-Volterra System Lotka-Volterra equations model population dynamics of a group interacting species of animal. ṡ = D(s) ( As + c + f (a) ) + η, Task - equilibrate populations of species.

98 Lotka-Volterra System - Finite Horizon Normalised Total Expected Reward Q Inference Forward Backward Inference Training Time N = 6. S = R 6. A = R 6. w R 13. H = 1. 3s Training.

99 Lotka-Volterra System - Finite Horizon Expectation Maximisation used in training. To obtain similar level of performance: RTS inference - 35 s (training time), Forward-Backward inference - 3 s (training time). Forward-Backward inference obtains 5% performance of RTS inference.

100 Lotka-Volterra System - Infinite Horizon Normalised Total Expected Reward Infinite Horizon Finite Horizon Heuristic Training Time N = 6. S = R 6. A = R 6. w R 13. H =. 6s Training.

101 Lotka-Volterra System - Infinite Horizon Expectation Maximisation used in training. Heuristic horizon used in forward-backward inference, H = 1. Training iterations performed: RTS inference ± 221.8, Forward-Backward inference ±.7.

102 N-Link Rigid Manipulator Simple Model of Robotic Joint. M(q) q + C( q, q) q + g(q) = τ Task - Position End Effector.

103 N-Link Rigid Manipulator - Finite Horizon Normalised Total Expected Reward Q Inference Forward Backward Inference Training Time N = 3. S = R 6. A = R 3. w R 22. H = 1. 3s Training.

104 N-Link Rigid Manipulator - Finite Horizon Expectation Maximisation used in training. To obtain similar level of performance: RTS inference - 35 s (training time), Forward-Backward inference - 3 s (training time). Forward-Backward inference obtains 5% performance of RTS inference.

105 N-Link Rigid Manipulator - Infinite Horizon Considered infinite horizon problem. Policy parameters often tended to boundary of unit circle. Should properly handle problem as a constrained optimisation problem: Constraint to unit circle difficult. Point of future research.

106 Continuous Systems - Summary More efficient than forward-backward inference, with runtime O(H) instead of O(H 2 ). Extends to infinite horizon problems. Higher order algorithms, e.g. the Newton method, with runtime O(H) instead of O(H 3 ).

107 Continuous Systems - Extensions It is possible to consider more general systems, e.g. Model non-gaussian rewards through mixture of Gaussians. Model certain non-linear systems through feedback-linearisation. Approximate trajectory distribution of non-linear system with Gaussian, through e.g. EP, and use RTS recursion. Possible extensions to (controlled) switching linear dynamical systems.

108 Bibliography I Amari, S. Natural Gradient Works Efficiently in Learning Neural Computation, 1: , Kakade, S. A Natural Policy Gradient NIPS, 14, 22. Hoffman, M. and de Freitas, N. and Doucet, A. and Peters, J. An Expectation Maximization Algorithm for Continuous Markov Decision Processes with Arbitrary Rewards AISTATS, 12(5): , 29.

Efficient Inference in Markov Control Problems

Efficient Inference in Markov Control Problems Efficient Inference in Markov Control Problems Thomas Furmston Computer Science Department University College London London, WC1E 6BT David Barber Computer Science Department University College London

More information

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels? Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity

More information

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it

More information

A Gauss-Newton Method for Markov Decision Processes

A Gauss-Newton Method for Markov Decision Processes A Gauss-Newton Method for Markov Decision Processes arxiv:1507.08271v4 [cs.ai] 6 Aug 2015 Thomas Furmston Department of Computer Science University College London London, WC1E 6BT Guy Lever Department

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization

Computer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions

More information

Trust Region Policy Optimization

Trust Region Policy Optimization Trust Region Policy Optimization Yixin Lin Duke University yixin.lin@duke.edu March 28, 2017 Yixin Lin (Duke) TRPO March 28, 2017 1 / 21 Overview 1 Preliminaries Markov Decision Processes Policy iteration

More information

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More. March 8, 2017

Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More. March 8, 2017 Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More March 8, 2017 Defining a Loss Function for RL Let η(π) denote the expected return of π [ ] η(π) = E s0 ρ 0,a t π( s t) γ t r t We collect

More information

EM-based Reinforcement Learning

EM-based Reinforcement Learning EM-based Reinforcement Learning Gerhard Neumann 1 1 TU Darmstadt, Intelligent Autonomous Systems December 21, 2011 Outline Expectation Maximization (EM)-based Reinforcement Learning Recap : Modelling data

More information

Markov Decision Processes

Markov Decision Processes Markov Decision Processes Noel Welsh 11 November 2010 Noel Welsh () Markov Decision Processes 11 November 2010 1 / 30 Annoucements Applicant visitor day seeks robot demonstrators for exciting half hour

More information

Reinforcement Learning and Deep Reinforcement Learning

Reinforcement Learning and Deep Reinforcement Learning Reinforcement Learning and Deep Reinforcement Learning Ashis Kumer Biswas, Ph.D. ashis.biswas@ucdenver.edu Deep Learning November 5, 2018 1 / 64 Outlines 1 Principles of Reinforcement Learning 2 The Q

More information

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms * Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms

More information

Deep Reinforcement Learning: Policy Gradients and Q-Learning

Deep Reinforcement Learning: Policy Gradients and Q-Learning Deep Reinforcement Learning: Policy Gradients and Q-Learning John Schulman Bay Area Deep Learning School September 24, 2016 Introduction and Overview Aim of This Talk What is deep RL, and should I use

More information

13 : Variational Inference: Loopy Belief Propagation and Mean Field

13 : Variational Inference: Loopy Belief Propagation and Mean Field 10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction

More information

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan Some slides borrowed from Peter Bodik and David Silver Course progress Learning

More information

Variational inference

Variational inference Simon Leglaive Télécom ParisTech, CNRS LTCI, Université Paris Saclay November 18, 2016, Télécom ParisTech, Paris, France. Outline Introduction Probabilistic model Problem Log-likelihood decomposition EM

More information

Reinforcement Learning. Introduction

Reinforcement Learning. Introduction Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control

More information

Variational Inference (11/04/13)

Variational Inference (11/04/13) STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further

More information

Policy Gradient Reinforcement Learning for Robotics

Policy Gradient Reinforcement Learning for Robotics Policy Gradient Reinforcement Learning for Robotics Michael C. Koval mkoval@cs.rutgers.edu Michael L. Littman mlittman@cs.rutgers.edu May 9, 211 1 Introduction Learning in an environment with a continuous

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Reinforcement Learning

Reinforcement Learning CS7/CS7 Fall 005 Supervised Learning: Training examples: (x,y) Direct feedback y for each input x Sequence of decisions with eventual feedback No teacher that critiques individual actions Learn to act

More information

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers

Engineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:

More information

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Markov Decision Processes and Solving Finite Problems. February 8, 2017 Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:

More information

Linear Dynamical Systems

Linear Dynamical Systems Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations

More information

13: Variational inference II

13: Variational inference II 10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational

More information

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent

More information

Information geometry for bivariate distribution control

Information geometry for bivariate distribution control Information geometry for bivariate distribution control C.T.J.Dodson + Hong Wang Mathematics + Control Systems Centre, University of Manchester Institute of Science and Technology Optimal control of stochastic

More information

Lecture 6: CS395T Numerical Optimization for Graphics and AI Line Search Applications

Lecture 6: CS395T Numerical Optimization for Graphics and AI Line Search Applications Lecture 6: CS395T Numerical Optimization for Graphics and AI Line Search Applications Qixing Huang The University of Texas at Austin huangqx@cs.utexas.edu 1 Disclaimer This note is adapted from Section

More information

CSC321 Lecture 22: Q-Learning

CSC321 Lecture 22: Q-Learning CSC321 Lecture 22: Q-Learning Roger Grosse Roger Grosse CSC321 Lecture 22: Q-Learning 1 / 21 Overview Second of 3 lectures on reinforcement learning Last time: policy gradient (e.g. REINFORCE) Optimize

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Markov decision processes

Markov decision processes CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only

More information

Decision Theory: Markov Decision Processes

Decision Theory: Markov Decision Processes Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies

More information

Reinforcement Learning In Continuous Time and Space

Reinforcement Learning In Continuous Time and Space Reinforcement Learning In Continuous Time and Space presentation of paper by Kenji Doya Leszek Rybicki lrybicki@mat.umk.pl 18.07.2008 Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous

More information

Week 3: The EM algorithm

Week 3: The EM algorithm Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.

More information

Variational Principal Components

Variational Principal Components Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger

More information

ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki

ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches Ville Kyrki 9.10.2017 Today Direct policy learning via policy gradient. Learning goals Understand basis and limitations

More information

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free

More information

Lecture 8: Policy Gradient

Lecture 8: Policy Gradient Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve

More information

Covariant Policy Search

Covariant Policy Search Covariant Policy Search J. Andrew Bagnell and Jeff Schneider Robotics Institute Carnegie-Mellon University Pittsburgh, PA 15213 { dbagnell, schneide } @ ri. emu. edu Abstract We investigate the problem

More information

Final Exam December 12, 2017

Final Exam December 12, 2017 Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes

More information

Lecture 3: Markov Decision Processes

Lecture 3: Markov Decision Processes Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov

More information

Probabilistic inference for computing optimal policies in MDPs

Probabilistic inference for computing optimal policies in MDPs Probabilistic inference for computing optimal policies in MDPs Marc Toussaint Amos Storkey School of Informatics, University of Edinburgh Edinburgh EH1 2QL, Scotland, UK mtoussai@inf.ed.ac.uk, amos@storkey.org

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018 Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)

More information

STA 414/2104: Machine Learning

STA 414/2104: Machine Learning STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

Optimal Control. McGill COMP 765 Oct 3 rd, 2017

Optimal Control. McGill COMP 765 Oct 3 rd, 2017 Optimal Control McGill COMP 765 Oct 3 rd, 2017 Classical Control Quiz Question 1: Can a PID controller be used to balance an inverted pendulum: A) That starts upright? B) That must be swung-up (perhaps

More information

Pattern Recognition and Machine Learning

Pattern Recognition and Machine Learning Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability

More information

Cheng Soon Ong & Christian Walder. Canberra February June 2018

Cheng Soon Ong & Christian Walder. Canberra February June 2018 Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression

More information

Reinforcement Learning and Control

Reinforcement Learning and Control CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make

More information

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning. Machine Learning, Fall 2010 Reinforcement Learning Machine Learning, Fall 2010 1 Administrativia This week: finish RL, most likely start graphical models LA2: due on Thursday LA3: comes out on Thursday TA Office hours: Today 1:30-2:30

More information

Covariant Policy Search

Covariant Policy Search Carnegie Mellon University Research Showcase @ CMU Robotics Institute School of Computer Science 003 Covariant Policy Search J. Andrew Bagnell Carnegie Mellon University Jeff Schneider Carnegie Mellon

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project

More information

Basic Sampling Methods

Basic Sampling Methods Basic Sampling Methods Sargur Srihari srihari@cedar.buffalo.edu 1 1. Motivation Topics Intractability in ML How sampling can help 2. Ancestral Sampling Using BNs 3. Transforming a Uniform Distribution

More information

CSC2541 Lecture 5 Natural Gradient

CSC2541 Lecture 5 Natural Gradient CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient 1 / 12 Motivation Two classes of optimization procedures used throughout ML (stochastic) gradient descent,

More information

Lecture 9: Policy Gradient II 1

Lecture 9: Policy Gradient II 1 Lecture 9: Policy Gradient II 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John

More information

STA 4273H: Statistical Machine Learning

STA 4273H: Statistical Machine Learning STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear

More information

A graph contains a set of nodes (vertices) connected by links (edges or arcs)

A graph contains a set of nodes (vertices) connected by links (edges or arcs) BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,

More information

Reinforcement Learning as Variational Inference: Two Recent Approaches

Reinforcement Learning as Variational Inference: Two Recent Approaches Reinforcement Learning as Variational Inference: Two Recent Approaches Rohith Kuditipudi Duke University 11 August 2017 Outline 1 Background 2 Stein Variational Policy Gradient 3 Soft Q-Learning 4 Closing

More information

Reinforcement Learning via Policy Optimization

Reinforcement Learning via Policy Optimization Reinforcement Learning via Policy Optimization Hanxiao Liu November 22, 2017 1 / 27 Reinforcement Learning Policy a π(s) 2 / 27 Example - Mario 3 / 27 Example - ChatBot 4 / 27 Applications - Video Games

More information

Notes on Reinforcement Learning

Notes on Reinforcement Learning 1 Introduction Notes on Reinforcement Learning Paulo Eduardo Rauber 2014 Reinforcement learning is the study of agents that act in an environment with the goal of maximizing cumulative reward signals.

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.

More information

Modelling and Control of Nonlinear Systems using Gaussian Processes with Partial Model Information

Modelling and Control of Nonlinear Systems using Gaussian Processes with Partial Model Information 5st IEEE Conference on Decision and Control December 0-3, 202 Maui, Hawaii, USA Modelling and Control of Nonlinear Systems using Gaussian Processes with Partial Model Information Joseph Hall, Carl Rasmussen

More information

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision

The Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that

More information

CS 7180: Behavioral Modeling and Decisionmaking

CS 7180: Behavioral Modeling and Decisionmaking CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and

More information

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted 15-889e Policy Search: Gradient Methods Emma Brunskill All slides from David Silver (with EB adding minor modificafons), unless otherwise noted Outline 1 Introduction 2 Finite Difference Policy Gradient

More information

Variational Inference. Sargur Srihari

Variational Inference. Sargur Srihari Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of discussion We first describe inference with PGMs and the intractability of exact inference Then give a taxonomy of inference algorithms

More information

Final Exam December 12, 2017

Final Exam December 12, 2017 Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes

More information

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu

Natural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu Natural Language Processing Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu Projects Project descriptions due today! Last class Sequence to sequence models Attention Pointer networks Today Weak

More information

Lecture 13 : Variational Inference: Mean Field Approximation

Lecture 13 : Variational Inference: Mean Field Approximation 10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun

MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun Y. LeCun: Machine Learning and Pattern Recognition p. 1/? MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun The Courant Institute, New York University http://yann.lecun.com

More information

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.

Optimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X. Optimization Background: Problem: given a function f(x) defined on X, find x such that f(x ) f(x) for all x X. The value x is called a maximizer of f and is written argmax X f. In general, argmax X f may

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Present and analyse two online algorithms

More information

Decision Theory: Q-Learning

Decision Theory: Q-Learning Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning

More information

Probabilistic Graphical Models

Probabilistic Graphical Models Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft

More information

Clustering with k-means and Gaussian mixture distributions

Clustering with k-means and Gaussian mixture distributions Clustering with k-means and Gaussian mixture distributions Machine Learning and Object Recognition 2017-2018 Jakob Verbeek Clustering Finding a group structure in the data Data in one cluster similar to

More information

State Space Abstractions for Reinforcement Learning

State Space Abstractions for Reinforcement Learning State Space Abstractions for Reinforcement Learning Rowan McAllister and Thang Bui MLG RCC 6 November 24 / 24 Outline Introduction Markov Decision Process Reinforcement Learning State Abstraction 2 Abstraction

More information

Mathematical Formulation of Our Example

Mathematical Formulation of Our Example Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

Clustering, K-Means, EM Tutorial

Clustering, K-Means, EM Tutorial Clustering, K-Means, EM Tutorial Kamyar Ghasemipour Parts taken from Shikhar Sharma, Wenjie Luo, and Boris Ivanovic s tutorial slides, as well as lecture notes Organization: Clustering Motivation K-Means

More information

Expectation Maximization

Expectation Maximization Expectation Maximization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 /

More information

Planning in Markov Decision Processes

Planning in Markov Decision Processes Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha

More information

Reinforcement Learning with Reference Tracking Control in Continuous State Spaces

Reinforcement Learning with Reference Tracking Control in Continuous State Spaces Reinforcement Learning with Reference Tracking Control in Continuous State Spaces Joseph Hall, Carl Edward Rasmussen and Jan Maciejowski Abstract The contribution described in this paper is an algorithm

More information

Artificial Intelligence & Sequential Decision Problems

Artificial Intelligence & Sequential Decision Problems Artificial Intelligence & Sequential Decision Problems (CIV6540 - Machine Learning for Civil Engineers) Professor: James-A. Goulet Département des génies civil, géologique et des mines Chapter 15 Goulet

More information

Development of a Deep Recurrent Neural Network Controller for Flight Applications

Development of a Deep Recurrent Neural Network Controller for Flight Applications Development of a Deep Recurrent Neural Network Controller for Flight Applications American Control Conference (ACC) May 26, 2017 Scott A. Nivison Pramod P. Khargonekar Department of Electrical and Computer

More information

Learning MN Parameters with Approximation. Sargur Srihari

Learning MN Parameters with Approximation. Sargur Srihari Learning MN Parameters with Approximation Sargur srihari@cedar.buffalo.edu 1 Topics Iterative exact learning of MN parameters Difficulty with exact methods Approximate methods Approximate Inference Belief

More information

Mixtures of Gaussians. Sargur Srihari

Mixtures of Gaussians. Sargur Srihari Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm

More information

Lecture 6: Bayesian Inference in SDE Models

Lecture 6: Bayesian Inference in SDE Models Lecture 6: Bayesian Inference in SDE Models Bayesian Filtering and Smoothing Point of View Simo Särkkä Aalto University Simo Särkkä (Aalto) Lecture 6: Bayesian Inference in SDEs 1 / 45 Contents 1 SDEs

More information

Learning Control Under Uncertainty: A Probabilistic Value-Iteration Approach

Learning Control Under Uncertainty: A Probabilistic Value-Iteration Approach Learning Control Under Uncertainty: A Probabilistic Value-Iteration Approach B. Bischoff 1, D. Nguyen-Tuong 1,H.Markert 1 anda.knoll 2 1- Robert Bosch GmbH - Corporate Research Robert-Bosch-Str. 2, 71701

More information