Gradient Methods for Markov Decision Processes
|
|
- Beverley Quinn
- 5 years ago
- Views:
Transcription
1 Gradient Methods for Markov Decision Processes Department of Computer Science University College London May 11, 212
2 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods Notation Steepest Gradient Ascent Expectation Maximisation Natural Gradient Ascent Summary Approximate Newton Method Experiments 3 Model-Based Inference Forward-Backward Inference Rauch Tung Striebel Inference
3 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods Notation Steepest Gradient Ascent Expectation Maximisation Natural Gradient Ascent Summary Approximate Newton Method Experiments 3 Model-Based Inference Forward-Backward Inference Rauch Tung Striebel Inference
4 Markov Decision Processes Markov Decision Processes consider the problem of optimal decision making in a dynamic environment.
5 Markov Decision Processes Examples include Robotics.
6 Markov Decision Processes Examples include Robotics. Optimal Game Play.
7 Markov Decision Processes Examples include Robotics. Start Finish Optimal Game Play. Navigation.
8 Markov Decision Processes More formally Markov Decision Processes (MDPs) are given by the tuple (A, S, H, p 1, R, p), where A - action space, either discrete or continuous. S - state space, either discrete or continuous. Z = S A - state-action space. H - planning horizon, either finite or infinite.
9 Markov Decision Processes More formally Markov Decision Processes (MDPs) are given by the tuple (A, S, H, p 1, R, p), where p 1 (s) : S [, 1] π(a s) : A S [, 1] R(a, s) : A S R + p(s s, a) =: S 2 A [, 1]
10 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )
11 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )
12 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )
13 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )
14 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )
15 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )
16 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )
17 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )
18 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )
19 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )
20 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )
21 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )
22 Markov Decision Processes One of the main assumptions of the MDP model is Markovian dynamics. Start Finish p(a 1:H, s 1:H ; π) = p(a H s H ; π) { } H 1 t=1 p(s t+1 s t, a t)p(a t s t; π) p 1 (s 1 )
23 Markov Decision Processes Objective - Optimise π to maximise the total expected reward U(π) = where H t=1 E pt (a,s;π) [ ] R(a, s), p t (a, s; π) state-action marginal of the t th time-point.
24 Markov Decision Processes Objective unbounded in infinite horizons. Discounted rewards U(π) = t=1 E pt (a,s;π) [ ] γ t 1 R(a, s), Average rewards 1 U(π) = lim H H H t=1 E pt (a,s;π) [ ] R(a, s),
25 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods Notation Steepest Gradient Ascent Expectation Maximisation Natural Gradient Ascent Summary Approximate Newton Method Experiments 3 Model-Based Inference Forward-Backward Inference Rauch Tung Striebel Inference
26 Dynamic programming Theoretically possible to solve MDP through dynamic programming Finite horizon Bellman equation V t (s) = max a A { R(s, a) + E p(s s,a) [ ]} Vt+1 (s ). Discounted infinite horizon Bellman equation { R(s, a) + γe p(s s,a) V (s) = max a A [ V (s ) ]},
27 Dynamic Programming - An Example A Graphical Example of Dynamic Programming. 1 Initial Value Function.
28 Dynamic Programming - An Example A Graphical Example of Dynamic Programming. 2 1 Value Function After 1 Iteration.
29 Dynamic Programming - An Example A Graphical Example of Dynamic Programming Value Function After 2 Iterations.
30 Dynamic Programming - An Example A Graphical Example of Dynamic Programming Value Function After 12 Iteration.
31 Dynamic programming Dynamic programming has numerous issues, including Curse of Dimensionality - complexity scales exponentially in dimension of state-action space. Representation issues in non-linear continuous systems. Global maximisation over action space can be problematic.
32 Beyond Dynamic programming Various solutions proposed, including Approximate dynamic programming Work in space of value functions. Often good initial performance. Convergence issues, Policy oscillation Policy search methods, which include gradient methods, Work in policy space. Very general convergence guarantees.
33 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods Notation Steepest Gradient Ascent Expectation Maximisation Natural Gradient Ascent Summary Approximate Newton Method Experiments 3 Model-Based Inference Forward-Backward Inference Rauch Tung Striebel Inference
34 Notation Considering gradient-based methods, so consider parametric policy π(a s; w), w W. Write objective in terms of w, i.e. U(w). Similarly for trajectory distribution, p(z 1:H ; w). Also introduce state-action value function Q τ (z; w) = H t=τ E pt (a,s;w) [ ] R(a, s).
35 Reward Weighted Trajectory Distribution Unnormalised reward weighted trajectory distribution p(z 1:t, t w) = R(z t )p(z 1:t w). t = 1 t = 2 Denote normalised version by ˆp(z 1:t, t w). t = H Note - normalisation constant equals U(w), i.e. ˆp(z 1:t, t w) = p(z 1:t, t w). U(w)
36 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods Notation Steepest Gradient Ascent Expectation Maximisation Natural Gradient Ascent Summary Approximate Newton Method Experiments 3 Model-Based Inference Forward-Backward Inference Rauch Tung Striebel Inference
37 Steepest Gradient Ascent Gradient can be calculated through likelihood-ratios. In terms of p(z, τ, t; w) gradient takes form w U(w) = H t t=1 τ=1 ] E p(z,τ,t;w) [ w log p(a s; w). In terms of state-action value function w U(w) = H τ=1 E pτ (z;w)q τ (z;w) [ ] w log p(a s; w).
38 Steepest Gradient Ascent - Derivation Gradient can be calculated through likelihood-ratios. Likelihood ratio, or log-trick, gives the gradient w U(w) = H t=1 E p(z1:t ;w) [ ] R(z t ) w log p(z 1:t ; w).
39 Steepest Gradient Ascent - Derivation Gradient can be calculated through likelihood-ratios. Equivalently w U(w) = H t=1 E p(z1:t,t;w) [ ] w log p(z 1:t ; w).
40 Steepest Gradient Ascent - Derivation Gradient can be calculated through likelihood-ratios. Markovian dynamics gives w U(w) = H t t=1 τ=1 ] E p(z,τ,t;w) [ w log p(a s; w).
41 Steepest Gradient Ascent - Summary Summary Possible to calculate gradient through likelihood ratios. Often poorly conditioned = difficult to select step-size. Linear rate of convergence.
42 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods Notation Steepest Gradient Ascent Expectation Maximisation Natural Gradient Ascent Summary Approximate Newton Method Experiments 3 Model-Based Inference Forward-Backward Inference Rauch Tung Striebel Inference
43 Expectation Maximisation An alternative is Expectation Maximisation (EM). Introduce variational distribution q(z 1:t, t). Kullback-Leibler divergence, KL(q ˆp), gives the bound [ ] log U(w) H entropy (q(z 1:t, t)) + E q(z1:t,t) log p(z 1:t, t; w).
44 Expectation Maximisation Iteratively maximise bound w.r.t. q and w E-step - optimise bound w.r.t. q(z 1:t, t), q(z 1:t, t) = ˆp(z 1:t, t; w k ). M-step - optimise bound w.r.t. w, where Q(w, w k ) = w k+1 = argmaxq(w, w k ), w H t t=1 τ=1 E p(z,τ,t;w k ) [ ] log p(a s; w).
45 Relation to Steepest Gradient Ascent What is the relation between steepest gradient ascent and EM? Steepest Gradient Ascent w U(w) w=w k = w Q(w, w k ) w=w k. Expectation Maximisation w k+1 = argmaxq(w, w k ). w
46 Expectation Maximisation Summary EM is a two-stage iterative process. There is no need to select step-sizes. Rate of convergence: anywhere between sub-linear and quadratic.
47 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods Notation Steepest Gradient Ascent Expectation Maximisation Natural Gradient Ascent Summary Approximate Newton Method Experiments 3 Model-Based Inference Forward-Backward Inference Rauch Tung Striebel Inference
48 Natural Gradient Ascent Steepest gradient ascent assumes a Euclidean metric, w U(w) w=w k = argmaxu(w k + p). p p T p=ɛ In many cases this is not true of parameter space, which instead has manifold structure.
49 Natural Gradient Ascent This is the idea behind natural gradient ascent [1], G 1 (w) w U(w) w=w k = argmax U(w k + p), p p T G(w)p=ɛ where G(w) a local metric on parameter manifold. Fisher information matrix used, where G(w) = H t=1 ] E p(z,t;w) [ w T w log p(a s; w).
50 Natural Gradient Ascent Fisher information is easy to calculate/estimate. Policy is covariant. Rate of convergence still linear, but typically faster than steepest gradient ascent in practice. Very popular method in the MDP literature since introduction [2].
51 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods Notation Steepest Gradient Ascent Expectation Maximisation Natural Gradient Ascent Summary Approximate Newton Method Experiments 3 Model-Based Inference Forward-Backward Inference Rauch Tung Striebel Inference
52 Summary So we have three methods Steepest gradient ascent. Expectation Maximisation. Natural gradient ascent. Which is best? Depends on which paper you read.
53 Summary So we have three methods Steepest gradient ascent. Expectation Maximisation. Natural gradient ascent. Which is best? Depends on which paper you read.
54 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods Notation Steepest Gradient Ascent Expectation Maximisation Natural Gradient Ascent Summary Approximate Newton Method Experiments 3 Model-Based Inference Forward-Backward Inference Rauch Tung Striebel Inference
55 Approximate Newton Method In Newton s method There is no guarantee of valid ascent direction in non-concave problems. Inference is more expensive. Inversion can be expensive. Is there an approximation to the Hessian that doesn t suffer from these problems?
56 Approximate Newton Method Through second application of log-trick Hessian takes the form H(w) = where H t t=1 τ,τ =1 E p(z,z,τ,τ,t;w) + H t t=1 τ=1 [ ] w log p(a s; w) T w log p(a s ; w) ] E p(z,τ,t;w) [ w T w log p(a s; w), p(z, z, τ, τ, t; w) p(z τ =z, z τ =z, t; w).
57 Approximate Newton Method Consider H 1 (w) = [ ] H t t=1 τ,τ =1 E p(z,z,τ,τ,t;w) w log p(a s; w) T w log p(a s ; w). Positive mixture of outer product matrices = positive semidefinite. Matrix requires additional inference. Matrix is generally dense. We disregard this part of the Hessian.
58 Approximate Newton Method Consider H 2 (w) = [ ] H t t=1 τ=1 E p(z,τ,t;w) w T w log p(a s; w). Policy log-concave in w = negative semidefinite. Little or no additional inference required. Matrix has sparsity properties not present in Hessian. We use this as our approximate Hessian.
59 Approximate Newton Method - Two Examples Two prominent examples of policy that are log-concave are The Gibbs policy in discrete systems π(a s; w) = e w T φ(a,s) a A ew T φ(a,s). The linear Gaussian policy in continuous systems a = K φ(s) + m + η.
60 Relation to Expectation Maximisation It is possible to show, under suitable conditions, that w w = H 1 2 (w) wu(w) + O(( w w) 2 ), where w is the EM-update given parameters w. In other words EM moves, up to first order, in direction of the approximate Newton method with fixed step-size of unity.
61 Relation to Natural Gradient Ascent What is the relation between natural gradient ascent and approximate Newton Method? Natural gradient ascent preconditions with G(w) = H t=1 ] E p(z,t;w) [ w T w log p(a s; w). Approximate Newton preconditions with H 2 (w) = H t=1 E p(z,t;w)qt (z;w) [ ] w T w log p(a s; w).
62 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods Notation Steepest Gradient Ascent Expectation Maximisation Natural Gradient Ascent Summary Approximate Newton Method Experiments 3 Model-Based Inference Forward-Backward Inference Rauch Tung Striebel Inference
63 Model-Based Experiments Model-Based Experiments in Linear-Gaussian Systems. Inference Exact, No Issues of Approximate Inference. Feedback-Linearisation Linearise Non-Linear Systems.
64 Lotka-Volterra System Lotka-Volterra equations model population dynamics of a group interacting species of animal. ṡ = D(s) ( As + c + f (a) ) + η, Task - equilibrate populations of species.
65 N-Link Rigid Manipulator - Search Direction 6 N = 6. Normalised Total Expected Reward Steepest Gradient Ascent Expectation Maximisation Approximate Newton Method Natural Gradient Ascent Training Time S = R 6. A = R 6. w R 13. H = 1. 3s Training.
66 N-Link Rigid Manipulator Simple Model of Robotic Joint. M(q) q + C( q, q) q + g(q) = τ Task - Position End Effector.
67 N-Link Rigid Manipulator - Search Direction Normalised Total Expected Reward Steepest Gradient Ascent Expectation Maximisation Approximate Newton Method Natural Gradient Ascent Training Time N = 3. S = R 6. A = R 3. w R 22. H = 1. 3s Training.
68 Model-Free Experiments Model-free experiments in non-linear systems. Forward sampling used in inference. Linear controller with non-linear features, a = K φ(s) + m + η.
69 Pendulum Simple Pendulum Model. l ml θ = mg sin θ kl θ + τ. θ mg Task - Balance Pendulum in Upright Position.
70 Pendulum - Search Direction Normalised Total Expected Reward Expectation Maximisation.1 Approximate Newton Method Natural Gradients Training Iterations S = R 2. A = R. w R 2. H = 1. 5s Training Iterations.
71 Cart-Pole Cart-Pole Problem I θ = mgl sin θ ml 2 θ mlÿ cos θ,, θ mg ( ) Mÿ = u m ÿ + L θ cos θ L θ 2 sin θ kẏ. u y Task - Balance Pole in Upright Position.
72 Cart-Pole - Search Direction Normalised Total Expected Reward Expectation Maximisation Approximate Newton Method Natural Gradients Training Iterations S = R 4. A = R. w R 2. H = 1. 5s Training Iterations.
73 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods Notation Steepest Gradient Ascent Expectation Maximisation Natural Gradient Ascent Summary Approximate Newton Method Experiments 3 Model-Based Inference Forward-Backward Inference Rauch Tung Striebel Inference
74 Forward-Backward Inference Model-based inference similar to time-series inference. Model-based time-series inference splits into forward-backward inference. Rauch Tung Striebel inference. Yet model-based inference in gradient-based methods for MDPs is exclusively forward-backward.
75 Forward-Backward Inference Observe standard form of gradient w U(w) = H τ=1 E pτ (z;w)q τ (z;w) [ ] w log p(a s; w). { p τ (z; w) } H - forward messages. τ=1 { Q τ (z; w) } H - backward messages. τ=1 We use new notation Q fb τ (z; w) for state-action value function.
76 Outline 1 Introduction Markov Decision Processes Dynamic Programming 2 Gradient Methods Notation Steepest Gradient Ascent Expectation Maximisation Natural Gradient Ascent Summary Approximate Newton Method Experiments 3 Model-Based Inference Forward-Backward Inference Rauch Tung Striebel Inference
77 RTS Inference - Finite Planning Horizon We redefine the state-action value function as follows Q rts τ (z; w) = H p(z, τ, t; w), t=τ = p τ (z; w)q fb τ (z; w). Terms necessary for policy update can be written in form H t=1 τ=1 t p(z, τ, t; w) = H t=1 Qτ rts (z; w).
78 RTS Inference - Finite Planning Horizon Obtain the recursive equation of these new Q-functions, Q rts τ (z; w) = p τ (z; w)r(z) + z p τ (z z ; w)q rts τ+1 (z ; w). Note alternate direction of transition dynamics compared to standard recursion Q fb τ (z; w) = R(z) + z p(z z; w)q fb τ+1 (z ; w).
79 RTS Inference - Recursion Derivation H [ ] Q rts τ (z; w) = p τ (z; w) E pt(z ;w) R(z ) zτ = z t=τ H = p τ (z; w)r(z) + dz p(z τ = z, z t = z ; w)r(z ) t=τ+1 H = p τ (z; w)r(z) + dz dz p(z τ = z, z τ+1 = z, z t = z ; w)r(z ) t=τ+1 H = p τ (z; w)r(z) + dz dz p(z t = z z τ+1 = z ; w)p(z τ = z, z τ+1 = z ; w)r(z ) t=τ+1 H = p τ (z; w)r(z) + dz dz p(z t = z z τ+1 = z ; w)p(z τ = z z τ+1 = z ; w)p(z τ+1 = z ; w)r(z ) t=τ+1 H = p τ (z; w)r(z) + dz p(z τ = z z τ+1 = z ; w)p(z τ+1 = z ; w) dz p(z t = z z τ+1 = z ; w)r(z ) = p τ (z; w)r(z) + dz p(z τ = z z τ+1 = z ; w)q rts τ+1(z ; w) t=τ+1
80 RTS Inference - Infinite Planning Horizon In case of infinite planning horizon, H =, need to calculate the infinite summation t=1 Qt rts (z; w). We use convergence of trajectory distribution to stationary state-action distribution.
81 RTS Inference - Infinite Planning Horizon Suppose convergence is reached by ˆτ, then t=1 ˆτ 1 Qt rts (z; w) = Qt rts (z; w) + t=1 t=ˆτ First term is easy provided we know Q rts (z; w). ˆτ Qt rts (z; w). Use stationarity of state-action distribution to calculate the second term.
82 RTS Inference - Infinite Planning Horizon For any t ˆτ we have Easy to show Q rts τ+1 Q rts τ+1 (z; w) = p τ+1(z; w) = γp τ (z; w) (z; w) = γqrts(z; w). t=τ+1 t=τ = γqτ rts (z; w). τ E pt (z;w) E pt (z;w) [ ] γ t 1 R(z ) z τ+1 = z [ ] γ t 1 R(z ) z τ = z
83 RTS Inference - Infinite Planning Horizon Can now simplify the second term t=ˆτ Q rts t (z; w) = γ t 1 Q rts (z; w) = γ ˆτ 1 1 γ Qrts (z; w). t=ˆτ It remains to find Q rts (z; w). Extend finite horizon derivation to obtain fixed-point equation Q rts (z; w) = p(z; w)r(z) + γ z p (z z ; w)q rts (z ; w).
84 Examples Now consider some examples where the RTS approach is beneficial. In particular Linear-Gaussian Systems, High-Dimensional Discrete Systems.
85 Continuous Systems In continuous problems the Q-recursion becomes Qτ rts (z; w) = p τ (z; w)r(z) + dz p τ (z z ; w)q rts τ+1 (z ; w). No longer possible to maintain closed form of Q-functions.
86 Continuous Systems However, only require moments to perform policy update. Example, with a linear controller on require the moments H [ ] H z, E Qτ (z;w)[ zz T ], E Qτ (z;w) τ=1 τ=1 These moments can be iterated exactly in linear time.
87 Linear Systems We consider example of linear dynamical system with a linear controller. All functions have linear-gaussian form p(s 1 ) = N (s 1 µ, Σ ), p(s t+1 s t, a t ) = N (s t+1 As t + Ba t, Σ), p(a t s t ; K, m, π σ ) = N (a t K s t + m; π σ ), R(z) = N (y j Mz, L j ). Policy parameters - w = (K, m, π σ ).
88 Linear Systems - Reward Weighted Trajectory Distribution t = 1 p(z 1:t, t; w) an unnormalised mixture of Gaussians. t = 2 Each marginal unnormalised Gaussian. t = H Don t need each marginal, but summation of marginals.
89 Linear Systems - Forward-Backward Inference Linear Dynamical System with a Linear Controller. Forward-backward inference in this model was considered in [3] Standard forward-backward equation has form E pτ (z;w) [ ] [ ] [ zqτ fb (z; w) = E pτ (z;w) zr(z) + E pτ (z;w) ze p(z z;w) [ ]] Qτ+1 fb (z ; w).
90 Linear Systems - Forward-Backward Inference Linear Dynamical System with a Linear Controller. Equivalent form of forward-backward equation [ ] [ ] [ ] E pτ (z;w) zqτ fb (z; w) = E pτ (z;w) zr(z) + E pqτ (z;w) z, where [ ] p Qτ (z; w) = p τ (z; w)e p(z z;w) Qτ+1 fb (z ; w), = H t=τ+1 dz p(z τ = z, z t = z ; w)r(z ),
91 Linear Systems - Forward-Backward Inference Linear Dynamical System with a Linear Controller. Linear-Gaussian system = p Qτ (z; w) unnormalised mixture of Gaussians. # components equals (H t). Calculation E pqτ (z;w)[ z ] has cost of O(H t). Overall cost of forward-backward inference is O(H 2 ). No clear extension to infinite horizon.
92 Linear Systems - RTS Inference - Finite Horizon Linear Dynamical System with a Linear Controller. Need to Calculate H τ=1 E Q rts τ (z;w) [ z ], H τ=1 E Q rts τ (z;w) [ zz T ], Q-recursion has the form [ ] [ [ ] E Q rts τ z = (z;w) Epτ (z;w) zr(z) + E Q rts τ+1 (z ;w) E p τ (z z ;w) [ z ] ] E Q rts τ (z;w)[ zz T ] [ ] [ = E pτ (z;w) zz T R(z) + E Q rts τ+1 (z ;w) E p τ (z z ;w)[ zz T ]].
93 Linear Systems - RTS Inference - Finite Horizon Linear Dynamical System with a Linear Controller. Denote Moments of Reward Function µ R τ = E pτ (z;w)[ R(z)z ], Σ R τ = E pτ (z;w)[ R(z)zz T ]. Linear System = Linear System Reversal Dynamics z τ = G τ z τ+1 + m τ + η τ.
94 Linear Systems - RTS Inference - Finite Horizon Linear Dynamical System with a Linear Controller. RTS recursion has the form E Q rts τ (z;w) [ z ] = µ R τ + E Q rts τ+1 (z ;w) [ ( G τ z + m τ ) ], [ E Q rts τ zz (z;w) T ] [ ( G = Σ R τ + E Q rts τ+1 (z ;w) τ z + )( m τ G τ z + ] ) T m τ + Σ τ.
95 Linear Systems - RTS Inference - Finite Horizon Linear Dynamical System with a Linear Controller. Denote the first two moments of Q rts τ (z; w) by µ Q τ and Σ Q τ. Recursion for µ Q t and Σ Q t is immediate. µ Q τ = µ R τ + Z τ+1 mτ + G τ µ Q τ+1, Σ Q τ = Σ R τ + Z τ+1 ( Σ τ + m τ m T τ ) + G τ (Σ Q τ+1 + µq τ+1 m T τ + m τ (µ Q τ+1 )T )G T τ.
96 Linear Systems - RTS Inference - Infinite Horizon Linear Dynamical System with a Linear Controller. In case of infinite horizon with discounted rewards have fixed-point equation µ Q = µ R + γ (Z ) m + Gµ Q, ( Σ Q = Σ R + γ Z ( Σ + m m T ) + G(Σ Q + µ Q m T + m(µ Q ) T )G T ).
97 Lotka-Volterra System Lotka-Volterra equations model population dynamics of a group interacting species of animal. ṡ = D(s) ( As + c + f (a) ) + η, Task - equilibrate populations of species.
98 Lotka-Volterra System - Finite Horizon Normalised Total Expected Reward Q Inference Forward Backward Inference Training Time N = 6. S = R 6. A = R 6. w R 13. H = 1. 3s Training.
99 Lotka-Volterra System - Finite Horizon Expectation Maximisation used in training. To obtain similar level of performance: RTS inference - 35 s (training time), Forward-Backward inference - 3 s (training time). Forward-Backward inference obtains 5% performance of RTS inference.
100 Lotka-Volterra System - Infinite Horizon Normalised Total Expected Reward Infinite Horizon Finite Horizon Heuristic Training Time N = 6. S = R 6. A = R 6. w R 13. H =. 6s Training.
101 Lotka-Volterra System - Infinite Horizon Expectation Maximisation used in training. Heuristic horizon used in forward-backward inference, H = 1. Training iterations performed: RTS inference ± 221.8, Forward-Backward inference ±.7.
102 N-Link Rigid Manipulator Simple Model of Robotic Joint. M(q) q + C( q, q) q + g(q) = τ Task - Position End Effector.
103 N-Link Rigid Manipulator - Finite Horizon Normalised Total Expected Reward Q Inference Forward Backward Inference Training Time N = 3. S = R 6. A = R 3. w R 22. H = 1. 3s Training.
104 N-Link Rigid Manipulator - Finite Horizon Expectation Maximisation used in training. To obtain similar level of performance: RTS inference - 35 s (training time), Forward-Backward inference - 3 s (training time). Forward-Backward inference obtains 5% performance of RTS inference.
105 N-Link Rigid Manipulator - Infinite Horizon Considered infinite horizon problem. Policy parameters often tended to boundary of unit circle. Should properly handle problem as a constrained optimisation problem: Constraint to unit circle difficult. Point of future research.
106 Continuous Systems - Summary More efficient than forward-backward inference, with runtime O(H) instead of O(H 2 ). Extends to infinite horizon problems. Higher order algorithms, e.g. the Newton method, with runtime O(H) instead of O(H 3 ).
107 Continuous Systems - Extensions It is possible to consider more general systems, e.g. Model non-gaussian rewards through mixture of Gaussians. Model certain non-linear systems through feedback-linearisation. Approximate trajectory distribution of non-linear system with Gaussian, through e.g. EP, and use RTS recursion. Possible extensions to (controlled) switching linear dynamical systems.
108 Bibliography I Amari, S. Natural Gradient Works Efficiently in Learning Neural Computation, 1: , Kakade, S. A Natural Policy Gradient NIPS, 14, 22. Hoffman, M. and de Freitas, N. and Doucet, A. and Peters, J. An Expectation Maximization Algorithm for Continuous Markov Decision Processes with Arbitrary Rewards AISTATS, 12(5): , 29.
Efficient Inference in Markov Control Problems
Efficient Inference in Markov Control Problems Thomas Furmston Computer Science Department University College London London, WC1E 6BT David Barber Computer Science Department University College London
More informationMachine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?
Machine Learning and Bayesian Inference Dr Sean Holden Computer Laboratory, Room FC6 Telephone extension 6372 Email: sbh11@cl.cam.ac.uk www.cl.cam.ac.uk/ sbh11/ Unsupervised learning Can we find regularity
More informationReinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina
Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it
More informationA Gauss-Newton Method for Markov Decision Processes
A Gauss-Newton Method for Markov Decision Processes arxiv:1507.08271v4 [cs.ai] 6 Aug 2015 Thomas Furmston Department of Computer Science University College London London, WC1E 6BT Guy Lever Department
More informationReinforcement Learning and NLP
1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value
More informationComputer Vision Group Prof. Daniel Cremers. 6. Mixture Models and Expectation-Maximization
Prof. Daniel Cremers 6. Mixture Models and Expectation-Maximization Motivation Often the introduction of latent (unobserved) random variables into a model can help to express complex (marginal) distributions
More informationTrust Region Policy Optimization
Trust Region Policy Optimization Yixin Lin Duke University yixin.lin@duke.edu March 28, 2017 Yixin Lin (Duke) TRPO March 28, 2017 1 / 21 Overview 1 Preliminaries Markov Decision Processes Policy iteration
More informationREINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning
REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari
More informationArtificial Intelligence
Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important
More informationReinforcement Learning
Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value
More informationQ-Learning in Continuous State Action Spaces
Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental
More informationAdvanced Policy Gradient Methods: Natural Gradient, TRPO, and More. March 8, 2017
Advanced Policy Gradient Methods: Natural Gradient, TRPO, and More March 8, 2017 Defining a Loss Function for RL Let η(π) denote the expected return of π [ ] η(π) = E s0 ρ 0,a t π( s t) γ t r t We collect
More informationEM-based Reinforcement Learning
EM-based Reinforcement Learning Gerhard Neumann 1 1 TU Darmstadt, Intelligent Autonomous Systems December 21, 2011 Outline Expectation Maximization (EM)-based Reinforcement Learning Recap : Modelling data
More informationMarkov Decision Processes
Markov Decision Processes Noel Welsh 11 November 2010 Noel Welsh () Markov Decision Processes 11 November 2010 1 / 30 Annoucements Applicant visitor day seeks robot demonstrators for exciting half hour
More informationReinforcement Learning and Deep Reinforcement Learning
Reinforcement Learning and Deep Reinforcement Learning Ashis Kumer Biswas, Ph.D. ashis.biswas@ucdenver.edu Deep Learning November 5, 2018 1 / 64 Outlines 1 Principles of Reinforcement Learning 2 The Q
More informationUsing Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *
Proceedings of the 8 th International Conference on Applied Informatics Eger, Hungary, January 27 30, 2010. Vol. 1. pp. 87 94. Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms
More informationDeep Reinforcement Learning: Policy Gradients and Q-Learning
Deep Reinforcement Learning: Policy Gradients and Q-Learning John Schulman Bay Area Deep Learning School September 24, 2016 Introduction and Overview Aim of This Talk What is deep RL, and should I use
More information13 : Variational Inference: Loopy Belief Propagation and Mean Field
10-708: Probabilistic Graphical Models 10-708, Spring 2012 13 : Variational Inference: Loopy Belief Propagation and Mean Field Lecturer: Eric P. Xing Scribes: Peter Schulam and William Wang 1 Introduction
More informationLecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan
COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan Some slides borrowed from Peter Bodik and David Silver Course progress Learning
More informationVariational inference
Simon Leglaive Télécom ParisTech, CNRS LTCI, Université Paris Saclay November 18, 2016, Télécom ParisTech, Paris, France. Outline Introduction Probabilistic model Problem Log-likelihood decomposition EM
More informationReinforcement Learning. Introduction
Reinforcement Learning Introduction Reinforcement Learning Agent interacts and learns from a stochastic environment Science of sequential decision making Many faces of reinforcement learning Optimal control
More informationVariational Inference (11/04/13)
STA561: Probabilistic machine learning Variational Inference (11/04/13) Lecturer: Barbara Engelhardt Scribes: Matt Dickenson, Alireza Samany, Tracy Schifeling 1 Introduction In this lecture we will further
More informationPolicy Gradient Reinforcement Learning for Robotics
Policy Gradient Reinforcement Learning for Robotics Michael C. Koval mkoval@cs.rutgers.edu Michael L. Littman mlittman@cs.rutgers.edu May 9, 211 1 Introduction Learning in an environment with a continuous
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationReinforcement Learning
CS7/CS7 Fall 005 Supervised Learning: Training examples: (x,y) Direct feedback y for each input x Sequence of decisions with eventual feedback No teacher that critiques individual actions Learn to act
More informationEngineering Part IIB: Module 4F10 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers
Engineering Part IIB: Module 4F0 Statistical Pattern Processing Lecture 5: Single Layer Perceptrons & Estimating Linear Classifiers Phil Woodland: pcw@eng.cam.ac.uk Michaelmas 202 Engineering Part IIB:
More informationMarkov Decision Processes and Solving Finite Problems. February 8, 2017
Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:
More informationLinear Dynamical Systems
Linear Dynamical Systems Sargur N. srihari@cedar.buffalo.edu Machine Learning Course: http://www.cedar.buffalo.edu/~srihari/cse574/index.html Two Models Described by Same Graph Latent variables Observations
More information13: Variational inference II
10-708: Probabilistic Graphical Models, Spring 2015 13: Variational inference II Lecturer: Eric P. Xing Scribes: Ronghuo Zheng, Zhiting Hu, Yuntian Deng 1 Introduction We started to talk about variational
More informationMS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction
MS&E338 Reinforcement Learning Lecture 1 - April 2, 2018 Introduction Lecturer: Ben Van Roy Scribe: Gabriel Maher 1 Reinforcement Learning Introduction In reinforcement learning (RL) we consider an agent
More informationInformation geometry for bivariate distribution control
Information geometry for bivariate distribution control C.T.J.Dodson + Hong Wang Mathematics + Control Systems Centre, University of Manchester Institute of Science and Technology Optimal control of stochastic
More informationLecture 6: CS395T Numerical Optimization for Graphics and AI Line Search Applications
Lecture 6: CS395T Numerical Optimization for Graphics and AI Line Search Applications Qixing Huang The University of Texas at Austin huangqx@cs.utexas.edu 1 Disclaimer This note is adapted from Section
More informationCSC321 Lecture 22: Q-Learning
CSC321 Lecture 22: Q-Learning Roger Grosse Roger Grosse CSC321 Lecture 22: Q-Learning 1 / 21 Overview Second of 3 lectures on reinforcement learning Last time: policy gradient (e.g. REINFORCE) Optimize
More informationIntroduction to Reinforcement Learning. CMPT 882 Mar. 18
Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and
More informationMarkov decision processes
CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only
More informationDecision Theory: Markov Decision Processes
Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies
More informationReinforcement Learning In Continuous Time and Space
Reinforcement Learning In Continuous Time and Space presentation of paper by Kenji Doya Leszek Rybicki lrybicki@mat.umk.pl 18.07.2008 Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous
More informationWeek 3: The EM algorithm
Week 3: The EM algorithm Maneesh Sahani maneesh@gatsby.ucl.ac.uk Gatsby Computational Neuroscience Unit University College London Term 1, Autumn 2005 Mixtures of Gaussians Data: Y = {y 1... y N } Latent
More informationIntroduction to Reinforcement Learning
CSCI-699: Advanced Topics in Deep Learning 01/16/2019 Nitin Kamra Spring 2019 Introduction to Reinforcement Learning 1 What is Reinforcement Learning? So far we have seen unsupervised and supervised learning.
More informationVariational Principal Components
Variational Principal Components Christopher M. Bishop Microsoft Research 7 J. J. Thomson Avenue, Cambridge, CB3 0FB, U.K. cmbishop@microsoft.com http://research.microsoft.com/ cmbishop In Proceedings
More informationBalancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm
Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu
More informationExpectation Maximization
Expectation Maximization Bishop PRML Ch. 9 Alireza Ghane c Ghane/Mori 4 6 8 4 6 8 4 6 8 4 6 8 5 5 5 5 5 5 4 6 8 4 4 6 8 4 5 5 5 5 5 5 µ, Σ) α f Learningscale is slightly Parameters is slightly larger larger
More informationELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches. Ville Kyrki
ELEC-E8119 Robotics: Manipulation, Decision Making and Learning Policy gradient approaches Ville Kyrki 9.10.2017 Today Direct policy learning via policy gradient. Learning goals Understand basis and limitations
More informationLecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation
Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation CS234: RL Emma Brunskill Winter 2018 Material builds on structure from David SIlver s Lecture 4: Model-Free
More informationLecture 8: Policy Gradient
Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve
More informationCovariant Policy Search
Covariant Policy Search J. Andrew Bagnell and Jeff Schneider Robotics Institute Carnegie-Mellon University Pittsburgh, PA 15213 { dbagnell, schneide } @ ri. emu. edu Abstract We investigate the problem
More informationFinal Exam December 12, 2017
Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes
More informationLecture 3: Markov Decision Processes
Lecture 3: Markov Decision Processes Joseph Modayil 1 Markov Processes 2 Markov Reward Processes 3 Markov Decision Processes 4 Extensions to MDPs Markov Processes Introduction Introduction to MDPs Markov
More informationProbabilistic inference for computing optimal policies in MDPs
Probabilistic inference for computing optimal policies in MDPs Marc Toussaint Amos Storkey School of Informatics, University of Edinburgh Edinburgh EH1 2QL, Scotland, UK mtoussai@inf.ed.ac.uk, amos@storkey.org
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More informationINF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018
Machine learning for image classification Lecture 14: Reinforcement learning May 9, 2018 Page 3 Outline Motivation Introduction to reinforcement learning (RL) Value function based methods (Q-learning)
More informationSTA 414/2104: Machine Learning
STA 414/2104: Machine Learning Russ Salakhutdinov Department of Computer Science! Department of Statistics! rsalakhu@cs.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 9 Sequential Data So far
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More informationOptimal Control. McGill COMP 765 Oct 3 rd, 2017
Optimal Control McGill COMP 765 Oct 3 rd, 2017 Classical Control Quiz Question 1: Can a PID controller be used to balance an inverted pendulum: A) That starts upright? B) That must be swung-up (perhaps
More informationPattern Recognition and Machine Learning
Christopher M. Bishop Pattern Recognition and Machine Learning ÖSpri inger Contents Preface Mathematical notation Contents vii xi xiii 1 Introduction 1 1.1 Example: Polynomial Curve Fitting 4 1.2 Probability
More informationCheng Soon Ong & Christian Walder. Canberra February June 2018
Cheng Soon Ong & Christian Walder Research Group and College of Engineering and Computer Science Canberra February June 2018 Outlines Overview Introduction Linear Algebra Probability Linear Regression
More informationReinforcement Learning and Control
CS9 Lecture notes Andrew Ng Part XIII Reinforcement Learning and Control We now begin our study of reinforcement learning and adaptive control. In supervised learning, we saw algorithms that tried to make
More informationReinforcement Learning. Machine Learning, Fall 2010
Reinforcement Learning Machine Learning, Fall 2010 1 Administrativia This week: finish RL, most likely start graphical models LA2: due on Thursday LA3: comes out on Thursday TA Office hours: Today 1:30-2:30
More informationCovariant Policy Search
Carnegie Mellon University Research Showcase @ CMU Robotics Institute School of Computer Science 003 Covariant Policy Search J. Andrew Bagnell Carnegie Mellon University Jeff Schneider Carnegie Mellon
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 7 Approximate
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 11 Project
More informationBasic Sampling Methods
Basic Sampling Methods Sargur Srihari srihari@cedar.buffalo.edu 1 1. Motivation Topics Intractability in ML How sampling can help 2. Ancestral Sampling Using BNs 3. Transforming a Uniform Distribution
More informationCSC2541 Lecture 5 Natural Gradient
CSC2541 Lecture 5 Natural Gradient Roger Grosse Roger Grosse CSC2541 Lecture 5 Natural Gradient 1 / 12 Motivation Two classes of optimization procedures used throughout ML (stochastic) gradient descent,
More informationLecture 9: Policy Gradient II 1
Lecture 9: Policy Gradient II 1 Emma Brunskill CS234 Reinforcement Learning. Winter 2019 Additional reading: Sutton and Barto 2018 Chp. 13 1 With many slides from or derived from David Silver and John
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.utstat.utoronto.ca/~rsalakhu/ Sidney Smith Hall, Room 6002 Lecture 3 Linear
More informationA graph contains a set of nodes (vertices) connected by links (edges or arcs)
BOLTZMANN MACHINES Generative Models Graphical Models A graph contains a set of nodes (vertices) connected by links (edges or arcs) In a probabilistic graphical model, each node represents a random variable,
More informationReinforcement Learning as Variational Inference: Two Recent Approaches
Reinforcement Learning as Variational Inference: Two Recent Approaches Rohith Kuditipudi Duke University 11 August 2017 Outline 1 Background 2 Stein Variational Policy Gradient 3 Soft Q-Learning 4 Closing
More informationReinforcement Learning via Policy Optimization
Reinforcement Learning via Policy Optimization Hanxiao Liu November 22, 2017 1 / 27 Reinforcement Learning Policy a π(s) 2 / 27 Example - Mario 3 / 27 Example - ChatBot 4 / 27 Applications - Video Games
More informationNotes on Reinforcement Learning
1 Introduction Notes on Reinforcement Learning Paulo Eduardo Rauber 2014 Reinforcement learning is the study of agents that act in an environment with the goal of maximizing cumulative reward signals.
More informationReinforcement Learning
Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/ Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al.
More informationModelling and Control of Nonlinear Systems using Gaussian Processes with Partial Model Information
5st IEEE Conference on Decision and Control December 0-3, 202 Maui, Hawaii, USA Modelling and Control of Nonlinear Systems using Gaussian Processes with Partial Model Information Joseph Hall, Carl Rasmussen
More informationThe Particle Filter. PD Dr. Rudolph Triebel Computer Vision Group. Machine Learning for Computer Vision
The Particle Filter Non-parametric implementation of Bayes filter Represents the belief (posterior) random state samples. by a set of This representation is approximate. Can represent distributions that
More informationCS 7180: Behavioral Modeling and Decisionmaking
CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and
More information15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted
15-889e Policy Search: Gradient Methods Emma Brunskill All slides from David Silver (with EB adding minor modificafons), unless otherwise noted Outline 1 Introduction 2 Finite Difference Policy Gradient
More informationVariational Inference. Sargur Srihari
Variational Inference Sargur srihari@cedar.buffalo.edu 1 Plan of discussion We first describe inference with PGMs and the intractability of exact inference Then give a taxonomy of inference algorithms
More informationFinal Exam December 12, 2017
Introduction to Artificial Intelligence CSE 473, Autumn 2017 Dieter Fox Final Exam December 12, 2017 Directions This exam has 7 problems with 111 points shown in the table below, and you have 110 minutes
More informationNatural Language Processing. Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu
Natural Language Processing Slides from Andreas Vlachos, Chris Manning, Mihai Surdeanu Projects Project descriptions due today! Last class Sequence to sequence models Attention Pointer networks Today Weak
More informationLecture 13 : Variational Inference: Mean Field Approximation
10-708: Probabilistic Graphical Models 10-708, Spring 2017 Lecture 13 : Variational Inference: Mean Field Approximation Lecturer: Willie Neiswanger Scribes: Xupeng Tong, Minxing Liu 1 Problem Setup 1.1
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and
More informationMACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun
Y. LeCun: Machine Learning and Pattern Recognition p. 1/? MACHINE LEARNING AND PATTERN RECOGNITION Fall 2006, Lecture 8: Latent Variables, EM Yann LeCun The Courant Institute, New York University http://yann.lecun.com
More informationOptimization. The value x is called a maximizer of f and is written argmax X f. g(λx + (1 λ)y) < λg(x) + (1 λ)g(y) 0 < λ < 1; x, y X.
Optimization Background: Problem: given a function f(x) defined on X, find x such that f(x ) f(x) for all x X. The value x is called a maximizer of f and is written argmax X f. In general, argmax X f may
More informationReinforcement Learning
Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Present and analyse two online algorithms
More informationDecision Theory: Q-Learning
Decision Theory: Q-Learning CPSC 322 Decision Theory 5 Textbook 12.5 Decision Theory: Q-Learning CPSC 322 Decision Theory 5, Slide 1 Lecture Overview 1 Recap 2 Asynchronous Value Iteration 3 Q-Learning
More informationProbabilistic Graphical Models
Probabilistic Graphical Models Brown University CSCI 295-P, Spring 213 Prof. Erik Sudderth Lecture 11: Inference & Learning Overview, Gaussian Graphical Models Some figures courtesy Michael Jordan s draft
More informationClustering with k-means and Gaussian mixture distributions
Clustering with k-means and Gaussian mixture distributions Machine Learning and Object Recognition 2017-2018 Jakob Verbeek Clustering Finding a group structure in the data Data in one cluster similar to
More informationState Space Abstractions for Reinforcement Learning
State Space Abstractions for Reinforcement Learning Rowan McAllister and Thang Bui MLG RCC 6 November 24 / 24 Outline Introduction Markov Decision Process Reinforcement Learning State Abstraction 2 Abstraction
More informationMathematical Formulation of Our Example
Mathematical Formulation of Our Example We define two binary random variables: open and, where is light on or light off. Our question is: What is? Computer Vision 1 Combining Evidence Suppose our robot
More informationMachine Learning I Reinforcement Learning
Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:
More informationClustering, K-Means, EM Tutorial
Clustering, K-Means, EM Tutorial Kamyar Ghasemipour Parts taken from Shikhar Sharma, Wenjie Luo, and Boris Ivanovic s tutorial slides, as well as lecture notes Organization: Clustering Motivation K-Means
More informationExpectation Maximization
Expectation Maximization Seungjin Choi Department of Computer Science and Engineering Pohang University of Science and Technology 77 Cheongam-ro, Nam-gu, Pohang 37673, Korea seungjin@postech.ac.kr 1 /
More informationPlanning in Markov Decision Processes
Carnegie Mellon School of Computer Science Deep Reinforcement Learning and Control Planning in Markov Decision Processes Lecture 3, CMU 10703 Katerina Fragkiadaki Markov Decision Process (MDP) A Markov
More informationCS599 Lecture 1 Introduction To RL
CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming
More informationReinforcement Learning
Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha
More informationReinforcement Learning with Reference Tracking Control in Continuous State Spaces
Reinforcement Learning with Reference Tracking Control in Continuous State Spaces Joseph Hall, Carl Edward Rasmussen and Jan Maciejowski Abstract The contribution described in this paper is an algorithm
More informationArtificial Intelligence & Sequential Decision Problems
Artificial Intelligence & Sequential Decision Problems (CIV6540 - Machine Learning for Civil Engineers) Professor: James-A. Goulet Département des génies civil, géologique et des mines Chapter 15 Goulet
More informationDevelopment of a Deep Recurrent Neural Network Controller for Flight Applications
Development of a Deep Recurrent Neural Network Controller for Flight Applications American Control Conference (ACC) May 26, 2017 Scott A. Nivison Pramod P. Khargonekar Department of Electrical and Computer
More informationLearning MN Parameters with Approximation. Sargur Srihari
Learning MN Parameters with Approximation Sargur srihari@cedar.buffalo.edu 1 Topics Iterative exact learning of MN parameters Difficulty with exact methods Approximate methods Approximate Inference Belief
More informationMixtures of Gaussians. Sargur Srihari
Mixtures of Gaussians Sargur srihari@cedar.buffalo.edu 1 9. Mixture Models and EM 0. Mixture Models Overview 1. K-Means Clustering 2. Mixtures of Gaussians 3. An Alternative View of EM 4. The EM Algorithm
More informationLecture 6: Bayesian Inference in SDE Models
Lecture 6: Bayesian Inference in SDE Models Bayesian Filtering and Smoothing Point of View Simo Särkkä Aalto University Simo Särkkä (Aalto) Lecture 6: Bayesian Inference in SDEs 1 / 45 Contents 1 SDEs
More informationLearning Control Under Uncertainty: A Probabilistic Value-Iteration Approach
Learning Control Under Uncertainty: A Probabilistic Value-Iteration Approach B. Bischoff 1, D. Nguyen-Tuong 1,H.Markert 1 anda.knoll 2 1- Robert Bosch GmbH - Corporate Research Robert-Bosch-Str. 2, 71701
More information