Reinforcement Learning In Continuous Time and Space

Size: px
Start display at page:

Download "Reinforcement Learning In Continuous Time and Space"

Transcription

1 Reinforcement Learning In Continuous Time and Space presentation of paper by Kenji Doya Leszek Rybicki Leszek Rybicki Reinforcement Learning In Continuous Time and Space 1/31

2 Reinforcement Learning A Machine Learning Paradigm Idea Reinforcement Learning is a machine learning paradigm in which the agent learns by trial and error. The agent takes actions within an environment and receives a numerical reinforcement. The purpose is to maximize the total reinforcement in a given task. The paper in question considers fully-observable, deterministic decision processes over continuous and factored state, action spaces and continuous time. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 2/31

3 Reinforcement Learning A Machine Learning Paradigm Idea Reinforcement Learning is a machine learning paradigm in which the agent learns by trial and error. The agent takes actions within an environment and receives a numerical reinforcement. The purpose is to maximize the total reinforcement in a given task. The paper in question considers fully-observable, deterministic decision processes over continuous and factored state, action spaces and continuous time. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 2/31

4 Reinforcement Learning A Machine Learning Paradigm Idea Reinforcement Learning is a machine learning paradigm in which the agent learns by trial and error. The agent takes actions within an environment and receives a numerical reinforcement. The purpose is to maximize the total reinforcement in a given task. The paper in question considers fully-observable, deterministic decision processes over continuous and factored state, action spaces and continuous time. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 2/31

5 Example Pendulum swing-up with limited torque maximal output torque is smaller than the maximal load torque goal: swing the load up and keep it up for some time! l T mg Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 3/31

6 Example Cart-pole swing-up cart can move both ways, limited speed and distance goal: swing the load up and keep it up for some time Leszek Rybicki Reinforcement Learning In Continuous Time and Space 4/31

7 Basic terms x(t) X - state at time t u(t) U - action at time t x(t + 1) = f (x(t), u(t)) - state change function r(t) = r(x(t), u(t)) - reinforcement at time t µ : X U - agent s policy V µ : X R - state value function goal find policy µ : S A with an optimal value function V Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 5/31

8 Basic terms x(t) X - state at time t u(t) U - action at time t x(t + 1) = f (x(t), u(t)) - state change function r(t) = r(x(t), u(t)) - reinforcement at time t µ : X U - agent s policy V µ : X R - state value function goal find policy µ : S A with an optimal value function V Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 5/31

9 Basic terms x(t) X - state at time t u(t) U - action at time t x(t + 1) = f (x(t), u(t)) - state change function r(t) = r(x(t), u(t)) - reinforcement at time t µ : X U - agent s policy V µ : X R - state value function goal find policy µ : S A with an optimal value function V Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 5/31

10 Basic terms x(t) X - state at time t u(t) U - action at time t x(t + 1) = f (x(t), u(t)) - state change function r(t) = r(x(t), u(t)) - reinforcement at time t µ : X U - agent s policy V µ : X R - state value function goal find policy µ : S A with an optimal value function V Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 5/31

11 Basic terms x(t) X - state at time t u(t) U - action at time t x(t + 1) = f (x(t), u(t)) - state change function r(t) = r(x(t), u(t)) - reinforcement at time t µ : X U - agent s policy V µ : X R - state value function goal find policy µ : S A with an optimal value function V Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 5/31

12 Basic terms x(t) X - state at time t u(t) U - action at time t x(t + 1) = f (x(t), u(t)) - state change function r(t) = r(x(t), u(t)) - reinforcement at time t µ : X U - agent s policy V µ : X R - state value function goal find policy µ : S A with an optimal value function V Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 5/31

13 Basic terms x(t) X - state at time t u(t) U - action at time t x(t + 1) = f (x(t), u(t)) - state change function r(t) = r(x(t), u(t)) - reinforcement at time t µ : X U - agent s policy V µ : X R - state value function goal find policy µ : S A with an optimal value function V Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 5/31

14 Time Difference Learning I TD(0) The idea Given an estimate (prediction) of the state change function: f (x(t), u(t)) and the optimal state value function: V (x(t)) V (x(t)) employ a greedy policy of selecting the next action. Prediction Lookup tables and function approximators can be used for f and V. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 6/31

15 Time Difference Learning II TD(0) TD-error Given: V (x(t)) = γ k t r(t) k=t the perfect prediction of V should satisfy: So the prediction error is: V (x(t)) = r(t + 1) + γv (x(t + 1)) δ(t + 1) = r(t + 1) γv (t + 1) + V (t) Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 7/31

16 Time Difference Learning III TD(0) Learning At each step: observe state change and reinforcement calculate TD-error update prediction of V : { V (x) ηδ(t + 1) V (x) V (x) x = x(t) otherwise This is known as TD(0). Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 8/31

17 Eligibility traces TD(λ) The value of state x(t 1) depends on the value of state x(t) and all future states, so any update of the value function estimate of V (x(t)) should also be reflected in the estimate of V (x(t 1)) and all past states, in a discounted manner. { V (x(t)) ηδ(t o )λ t 0 t t < t 0 V (x(t)) V (x(t)) otherwise where λ is a discounting parameter. This is know as TD(λ). Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 9/31

18 Actor-Critic method Exploring the space Sometimes the learning system is split into two parts: the critic, which maintains the state value estimate V and the actor, which is responsible for choosing the appropriate actions at each state. The actor s policy is modified using a signal from the critic. This is known as the Actor-Critic method. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 10/31

19 Problems with discrete approach 1 Many real-world problems are continuous 2 Coarse discretization leads to jumpy actions 3 Can make the algorithm non-convergent 4 Fine discretization leads to high dimensionality 5 Delays reward in steps 6 Needs more trials Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 11/31

20 Problems with discrete approach 1 Many real-world problems are continuous 2 Coarse discretization leads to jumpy actions 3 Can make the algorithm non-convergent 4 Fine discretization leads to high dimensionality 5 Delays reward in steps 6 Needs more trials Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 11/31

21 Problems with discrete approach 1 Many real-world problems are continuous 2 Coarse discretization leads to jumpy actions 3 Can make the algorithm non-convergent 4 Fine discretization leads to high dimensionality 5 Delays reward in steps 6 Needs more trials Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 11/31

22 Problems with discrete approach 1 Many real-world problems are continuous 2 Coarse discretization leads to jumpy actions 3 Can make the algorithm non-convergent 4 Fine discretization leads to high dimensionality 5 Delays reward in steps 6 Needs more trials Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 11/31

23 Problems with discrete approach 1 Many real-world problems are continuous 2 Coarse discretization leads to jumpy actions 3 Can make the algorithm non-convergent 4 Fine discretization leads to high dimensionality 5 Delays reward in steps 6 Needs more trials Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 11/31

24 Problems with discrete approach 1 Many real-world problems are continuous 2 Coarse discretization leads to jumpy actions 3 Can make the algorithm non-convergent 4 Fine discretization leads to high dimensionality 5 Delays reward in steps 6 Needs more trials Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 11/31

25 Discrete and continuous TD Comparison x(t) X u(t) U V (x(t)) = k=t γt k r(k) δ(t) = r(t + 1) γv (t + 1) + V (t) x(t) R n u(t) R m V (x(t)) = t e s t τ r(x(s), u(s)) δ(t) = r(t) 1 τ V (t) + V (t) τ = 1 γ - constant for discounting future rewards Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 12/31

26 Discrete and continuous TD Comparison x(t) X u(t) U V (x(t)) = k=t γt k r(k) δ(t) = r(t + 1) γv (t + 1) + V (t) x(t) R n u(t) R m V (x(t)) = t e s t τ r(x(s), u(s)) δ(t) = r(t) 1 τ V (t) + V (t) τ = 1 γ - constant for discounting future rewards Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 12/31

27 Discrete and continuous TD Comparison x(t) X u(t) U V (x(t)) = k=t γt k r(k) δ(t) = r(t + 1) γv (t + 1) + V (t) x(t) R n u(t) R m V (x(t)) = t e s t τ r(x(s), u(s)) δ(t) = r(t) 1 τ V (t) + V (t) τ = 1 γ - constant for discounting future rewards Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 12/31

28 Discrete and continuous TD Comparison x(t) X u(t) U V (x(t)) = k=t γt k r(k) δ(t) = r(t + 1) γv (t + 1) + V (t) x(t) R n u(t) R m V (x(t)) = t e s t τ r(x(s), u(s)) δ(t) = r(t) 1 τ V (t) + V (t) τ = 1 γ - constant for discounting future rewards Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 12/31

29 Discrete and continuous TD Comparison x(t) X u(t) U V (x(t)) = k=t γt k r(k) δ(t) = r(t + 1) γv (t + 1) + V (t) x(t) R n u(t) R m V (x(t)) = t e s t τ r(x(s), u(s)) δ(t) = r(t) 1 τ V (t) + V (t) τ = 1 γ - constant for discounting future rewards Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 12/31

30 Discrete and continuous TD Comparison x(t) X u(t) U V (x(t)) = k=t γt k r(k) δ(t) = r(t + 1) γv (t + 1) + V (t) x(t) R n u(t) R m V (x(t)) = t e s t τ r(x(s), u(s)) δ(t) = r(t) 1 τ V (t) + V (t) τ = 1 γ - constant for discounting future rewards Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 12/31

31 Discrete and continuous TD Comparison x(t) X u(t) U V (x(t)) = k=t γt k r(k) δ(t) = r(t + 1) γv (t + 1) + V (t) x(t) R n u(t) R m V (x(t)) = t e s t τ r(x(s), u(s)) δ(t) = r(t) 1 τ V (t) + V (t) τ = 1 γ - constant for discounting future rewards Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 12/31

32 Discrete and continuous TD Comparison x(t) X u(t) U V (x(t)) = k=t γt k r(k) δ(t) = r(t + 1) γv (t + 1) + V (t) x(t) R n u(t) R m V (x(t)) = t e s t τ r(x(s), u(s)) δ(t) = r(t) 1 τ V (t) + V (t) τ = 1 γ - constant for discounting future rewards Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 12/31

33 Discrete and continuous TD Comparison x(t) X u(t) U V (x(t)) = k=t γt k r(k) δ(t) = r(t + 1) γv (t + 1) + V (t) x(t) R n u(t) R m V (x(t)) = t e s t τ r(x(s), u(s)) δ(t) = r(t) 1 τ V (t) + V (t) τ = 1 γ - constant for discounting future rewards Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 12/31

34 Continuous Value Function Time-symmetric method Let V (x; w) be an approximator of the Value function with parameters w and let E(t) = 1 2 (δ(t))2 be the objective function to minimize. Using the chain rule V (t) = V ẋ(t) x we have: E(t) = δ(t) w i w i = δ(t) [ 1 τ [ r(t) 1 ] τ V (t) + V (t) V (x; w) + ( V (x; w) w i w i x ) ẋ(t) (1) ] (2) giving us the gradient descent algorithm for parameters w i. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 13/31

35 Backward Euler Differentiation Residual Gradient and TD(0) By substituing V (t) = (V (t) V (t t))/ t we have: δ(t) = r(t) + 1 [ (1 t ] )V (t) V (t t) t τ giving us the gradient with respect to parameter w i : E(t) = δ(t) 1 [ (1 t ] (x(t); w) V (x(t t); w) ) V w i t τ w i w i We can use only the V (t t) for parameter update. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 14/31

36 Backward Euler Differentiation Residual Gradient and TD(0) By substituing V (t) = (V (t) V (t t))/ t we have: δ(t) = r(t) + 1 [ (1 t ] )V (t) V (t t) t τ giving us the gradient with respect to parameter w i : E(t) = δ(t) 1 [ (1 t ] (x(t); w) V (x(t t); w) ) V w i t τ w i w i We can use only the V (t t) for parameter update. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 14/31

37 Exponential Eligibility Trace I TD(λ) A Impulse reward!(t) Assume that at time t = t 0 an impulse reward was given, t t making the temporal profile of the value B function: 0 V(t) ^ V µ (t) = thus: ẇ i = ηδ(t 0 ) {e t 0 t τ t < t 0 0 t > t 0 t 0 e t 0 t τ V (x(t); w) w i CA V(t)! ^ B V(t) D V(t) ^ C dt D V(t) ^ V(t) ^ t 0 t 0 t t t 0 t Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 15/31

38 Exponential Eligibility Trace II TD(λ) Eligibility traces Let: e i = t 0 e t 0 t τ V (x(t); w) dt w i be the eligibility trace for parameter w i. We have: ẇ i = ηδ(t)e i (t), κė i (t) = e i (t) + V (x(t); w) w i, where 0 < κ τ is the time constant of the eligibility trace. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 16/31

39 Improving the policy Exploring the possibilities One way to improve the policy is using the Actor-Critic method, another is to take a greedy value-gradient based policy with respect to the current value function: u(t) = µ(x(t)) = argmax u U [ r(x(t), u) + V (x) f (x(t), u) x and the knowledge about the reward (r) and system dynamics (f ). ] Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 17/31

40 Continuous Actor-Critic Stochastic exploration Let A(x(t); w A ) R m be a function approximator with parameters w A. We can thus have a stochastic actor that employs the policy: u(t) = s(a(x(t); w A ) + σn(t)) where s is a monotonically increasing function and n(t) is noise. The actor s parameters are updated by: ẇ A i = η A δ(t)n(t) A(x(t); w A ) w A i Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 18/31

41 Value-Gradient Based Policy Heuristic approach In discrete problems, we can simply search the finite set of actions for the optimal one. Here, the set of actions is infinite. Let s assume: r(x, u) = R(x) S(u) S(u) = m j=1 S j(u j ) - cost of action R(u) - unknown part of the reinforcement Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 19/31

42 Cost of greedy policy Given a world model For the greedy policy it holds that: S j (u j ) = V (x) f (x, u), x u j f (x,u) where is the input gain matrix of the system u dynamics. Assuming that the matrix is not dependent on u and that S j is convex, the equation has a solution: u j = S j 1 V (x) f (x, u) ( ) x u j Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 20/31

43 Feedback control With a sigmoid output function A common constraint is that the amplitude of the action u is bounded by u max. We incorporate the constraint into the action cost: uj ( ) u S j (u j ) = c j s 1 du 0 u max j where s is a sigmoid function. This results in a following policy: ( ) u j = uj max 1 f (x, u) T T V (x) s. c j u j x Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 21/31

44 Advantage updating When the model is not available When the world model f is not available, we can simultaneously update the value function and an advantage function A(x, u): A (x, u) = r(x, u) 1 τ V (x) + V (x) f (x, u), x where max u [A(x, u)] = 0 for the optimal action u. The update rule is: A(x, u) max u [A(x, u)] + r(x, u) δ(t) The policy then is to select the action u for which the advantage function is maximal. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 22/31

45 Simulations Compared methods Four algorithms were compared: discrete actor-critic continuous actor-critic value-gradient based policy with a world model value-gradient based policy with learning of the gain matrix Value and policy functions were approximated by normalized Gaussian networks, the sigmoid function used was s(x) = 2 π arctan( π 2 x). Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 23/31

46 Pendulum swing-up I Trial setup! l T mg reward was given by height of the mass: R(x) = cos(θ) state space x = (θ, ω) trial lasted 20 seconds, unless the pendulum was over-rotated, resulting in r (t) = 1 for 1 second a trial was successful when when the pendulum was up for more than 10 seconds Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 24/31

47 Pendulum swing-up II Results Trials DiscAC ActCrit ValGrad PhysModel In the discrete actor-critic, the state space was evenly discretized into 30x30 boxes and the action was binary. Note different scales. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 25/31

48 Methods of Value Function Update Trials Trials Trials !t !t Various V -update methods were tested. resudual gradient single step eligibility trace exponential eligibility symmetric update method The time-symmetric update method was very unstable " Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 26/31

49 Action cost Various cost coefficients were tested, as well as various reward functions and exploration noise. The bang-bang control (c = 0) tended to be less consistent than small costs with sigmoid control. The binary reward function made the task difficult to learn, however a negative reward function yielded better results. Trials Trials control cost coef. cos! {0,1} {-1,0} cos! {0,1} {-1,0} " 0 = 0.5 " 0 = 0.0 Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 27/31

50 Cart-Pole Swing-Up I A harder test Experiment setup Physical parameters the same as Barto s cart-pole balancing task, but pole had to be swung up and balanced. Higher dimensionality than previous task, state dependent input gain. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 28/31

51 Cart-Pole Swing-Up II A harder test Results The performances with the exact and learned input gains were comparable, since the learning of the physical model was relatively easy compared to the learning of the value function. v $# # 2.4 4# 0 " $4# 0.0 V x v ! 0 $#! 0 # 4# " 0 Trials 2500 $4# df/du ActorCritic ValueGrad PhysModel -2.4 Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 0.0 x /31

52 Discussion 1 The swing-up task was accomplished by the continuous Actor-Critic 50 times faster than by the discrete Actor-Critic, despite finer discretization in the action-state space of the latter. 2 Value-gradient based methods performed significanly better than the actor-critic 3 Exponential eligibility was more efficient and stable than Euler approximation 4 Reward related parameters greatly affect the speed of learning 5 The value-gradient method worked well even when the input gain was state-dependent. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 30/31

53 Discussion 1 The swing-up task was accomplished by the continuous Actor-Critic 50 times faster than by the discrete Actor-Critic, despite finer discretization in the action-state space of the latter. 2 Value-gradient based methods performed significanly better than the actor-critic 3 Exponential eligibility was more efficient and stable than Euler approximation 4 Reward related parameters greatly affect the speed of learning 5 The value-gradient method worked well even when the input gain was state-dependent. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 30/31

54 Discussion 1 The swing-up task was accomplished by the continuous Actor-Critic 50 times faster than by the discrete Actor-Critic, despite finer discretization in the action-state space of the latter. 2 Value-gradient based methods performed significanly better than the actor-critic 3 Exponential eligibility was more efficient and stable than Euler approximation 4 Reward related parameters greatly affect the speed of learning 5 The value-gradient method worked well even when the input gain was state-dependent. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 30/31

55 Discussion 1 The swing-up task was accomplished by the continuous Actor-Critic 50 times faster than by the discrete Actor-Critic, despite finer discretization in the action-state space of the latter. 2 Value-gradient based methods performed significanly better than the actor-critic 3 Exponential eligibility was more efficient and stable than Euler approximation 4 Reward related parameters greatly affect the speed of learning 5 The value-gradient method worked well even when the input gain was state-dependent. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 30/31

56 Discussion 1 The swing-up task was accomplished by the continuous Actor-Critic 50 times faster than by the discrete Actor-Critic, despite finer discretization in the action-state space of the latter. 2 Value-gradient based methods performed significanly better than the actor-critic 3 Exponential eligibility was more efficient and stable than Euler approximation 4 Reward related parameters greatly affect the speed of learning 5 The value-gradient method worked well even when the input gain was state-dependent. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 30/31

57 Bibliography Kenji Doya, Reinforcement Learning in Continuous Time and Space, Neural Computation, vol. 12, no. 1/2000, p Sutton, R.S. and Barto, A.Gg, Reinforcement Learning: An Introduction, MIT Press, 1998, Cambridge, MA Florentin Woergoetter and Bernd Porr (2008) Reinforcement learning. Scholarpedia, 3(3):1448 Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 31/31

Reinforcement Learning in Continuous Time and Space

Reinforcement Learning in Continuous Time and Space LETTER Communicated by Peter Dayan Reinforcement Learning in Continuous Time and Space Kenji Doya ATR Human Information Processing Research Laboratories, Soraku, Kyoto 619-288, Japan This article presents

More information

and 3) a value function update using exponential eligibility traces is more efficient and stable than that based on Euler approximation. The algorithm

and 3) a value function update using exponential eligibility traces is more efficient and stable than that based on Euler approximation. The algorithm Reinforcement Learning In Continuous Time and Space Kenji Doya Λ ATR Human Information Processing Research Laboratories 2-2 Hikaridai, Seika, Soraku, Kyoto 619-288, Japan Neural Computation, 12(1), 219-245

More information

Reinforcement Learning in Non-Stationary Continuous Time and Space Scenarios

Reinforcement Learning in Non-Stationary Continuous Time and Space Scenarios Reinforcement Learning in Non-Stationary Continuous Time and Space Scenarios Eduardo W. Basso 1, Paulo M. Engel 1 1 Instituto de Informática Universidade Federal do Rio Grande do Sul (UFRGS) Caixa Postal

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo Marc Toussaint University of

More information

CS599 Lecture 1 Introduction To RL

CS599 Lecture 1 Introduction To RL CS599 Lecture 1 Introduction To RL Reinforcement Learning Introduction Learning from rewards Policies Value Functions Rewards Models of the Environment Exploitation vs. Exploration Dynamic Programming

More information

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN Reinforcement Learning for Continuous Action using Stochastic Gradient Ascent Hajime KIMURA, Shigenobu KOBAYASHI Tokyo Institute of Technology, 4259 Nagatsuda, Midori-ku Yokohama 226-852 JAPAN Abstract:

More information

The convergence limit of the temporal difference learning

The convergence limit of the temporal difference learning The convergence limit of the temporal difference learning Ryosuke Nomura the University of Tokyo September 3, 2013 1 Outline Reinforcement Learning Convergence limit Construction of the feature vector

More information

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm Michail G. Lagoudakis Department of Computer Science Duke University Durham, NC 2778 mgl@cs.duke.edu

More information

Temporal difference learning

Temporal difference learning Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite).

More information

Animal learning theory

Animal learning theory Animal learning theory Based on [Sutton and Barto, 1990, Dayan and Abbott, 2001] Bert Kappen [Sutton and Barto, 1990] Classical conditioning: - A conditioned stimulus (CS) and unconditioned stimulus (US)

More information

Off-Policy Actor-Critic

Off-Policy Actor-Critic Off-Policy Actor-Critic Ludovic Trottier Laval University July 25 2012 Ludovic Trottier (DAMAS Laboratory) Off-Policy Actor-Critic July 25 2012 1 / 34 Table of Contents 1 Reinforcement Learning Theory

More information

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel Reinforcement Learning with Function Approximation Joseph Christian G. Noel November 2011 Abstract Reinforcement learning (RL) is a key problem in the field of Artificial Intelligence. The main goal is

More information

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it

More information

Lecture 23: Reinforcement Learning

Lecture 23: Reinforcement Learning Lecture 23: Reinforcement Learning MDPs revisited Model-based learning Monte Carlo value function estimation Temporal-difference (TD) learning Exploration November 23, 2006 1 COMP-424 Lecture 23 Recall:

More information

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Chapter 7: Eligibility Traces R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1 Midterm Mean = 77.33 Median = 82 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction

More information

Open Theoretical Questions in Reinforcement Learning

Open Theoretical Questions in Reinforcement Learning Open Theoretical Questions in Reinforcement Learning Richard S. Sutton AT&T Labs, Florham Park, NJ 07932, USA, sutton@research.att.com, www.cs.umass.edu/~rich Reinforcement learning (RL) concerns the problem

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

A Residual Gradient Fuzzy Reinforcement Learning Algorithm for Differential Games

A Residual Gradient Fuzzy Reinforcement Learning Algorithm for Differential Games International Journal of Fuzzy Systems manuscript (will be inserted by the editor) A Residual Gradient Fuzzy Reinforcement Learning Algorithm for Differential Games Mostafa D Awheda Howard M Schwartz Received:

More information

Lecture 8: Policy Gradient

Lecture 8: Policy Gradient Lecture 8: Policy Gradient Hado van Hasselt Outline 1 Introduction 2 Finite Difference Policy Gradient 3 Monte-Carlo Policy Gradient 4 Actor-Critic Policy Gradient Introduction Vapnik s rule Never solve

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Q-Learning in Continuous State Action Spaces

Q-Learning in Continuous State Action Spaces Q-Learning in Continuous State Action Spaces Alex Irpan alexirpan@berkeley.edu December 5, 2015 Contents 1 Introduction 1 2 Background 1 3 Q-Learning 2 4 Q-Learning In Continuous Spaces 4 5 Experimental

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

ilstd: Eligibility Traces and Convergence Analysis

ilstd: Eligibility Traces and Convergence Analysis ilstd: Eligibility Traces and Convergence Analysis Alborz Geramifard Michael Bowling Martin Zinkevich Richard S. Sutton Department of Computing Science University of Alberta Edmonton, Alberta {alborz,bowling,maz,sutton}@cs.ualberta.ca

More information

(Deep) Reinforcement Learning

(Deep) Reinforcement Learning Martin Matyášek Artificial Intelligence Center Czech Technical University in Prague October 27, 2016 Martin Matyášek VPD, 2016 1 / 17 Reinforcement Learning in a picture R. S. Sutton and A. G. Barto 2015

More information

Reinforcement Learning. George Konidaris

Reinforcement Learning. George Konidaris Reinforcement Learning George Konidaris gdk@cs.brown.edu Fall 2017 Machine Learning Subfield of AI concerned with learning from data. Broadly, using: Experience To Improve Performance On Some Task (Tom

More information

REINFORCEMENT LEARNING

REINFORCEMENT LEARNING REINFORCEMENT LEARNING Larry Page: Where s Google going next? DeepMind's DQN playing Breakout Contents Introduction to Reinforcement Learning Deep Q-Learning INTRODUCTION TO REINFORCEMENT LEARNING Contents

More information

Replacing eligibility trace for action-value learning with function approximation

Replacing eligibility trace for action-value learning with function approximation Replacing eligibility trace for action-value learning with function approximation Kary FRÄMLING Helsinki University of Technology PL 5500, FI-02015 TKK - Finland Abstract. The eligibility trace is one

More information

CS599 Lecture 2 Function Approximation in RL

CS599 Lecture 2 Function Approximation in RL CS599 Lecture 2 Function Approximation in RL Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. Overview of function approximation (FA)

More information

Lecture 1: March 7, 2018

Lecture 1: March 7, 2018 Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights

More information

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning Ronen Tamari The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (#67679) February 28, 2016 Ronen Tamari

More information

Lecture 7: Value Function Approximation

Lecture 7: Value Function Approximation Lecture 7: Value Function Approximation Joseph Modayil Outline 1 Introduction 2 3 Batch Methods Introduction Large-Scale Reinforcement Learning Reinforcement learning can be used to solve large problems,

More information

Reinforcement Learning in Continuous Action Spaces

Reinforcement Learning in Continuous Action Spaces Reinforcement Learning in Continuous Action Spaces Hado van Hasselt and Marco A. Wiering Intelligent Systems Group, Department of Information and Computing Sciences, Utrecht University Padualaan 14, 3508

More information

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel Machine Learning Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING Slides adapted from Tom Mitchell and Peter Abeel Machine Learning: Jordan Boyd-Graber UMD Machine Learning

More information

6 Reinforcement Learning

6 Reinforcement Learning 6 Reinforcement Learning As discussed above, a basic form of supervised learning is function approximation, relating input vectors to output vectors, or, more generally, finding density functions p(y,

More information

Reinforcement Learning II

Reinforcement Learning II Reinforcement Learning II Andrea Bonarini Artificial Intelligence and Robotics Lab Department of Electronics and Information Politecnico di Milano E-mail: bonarini@elet.polimi.it URL:http://www.dei.polimi.it/people/bonarini

More information

Reinforcement Learning and NLP

Reinforcement Learning and NLP 1 Reinforcement Learning and NLP Kapil Thadani kapil@cs.columbia.edu RESEARCH Outline 2 Model-free RL Markov decision processes (MDPs) Derivative-free optimization Policy gradients Variance reduction Value

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Temporal Difference Learning Temporal difference learning, TD prediction, Q-learning, elibigility traces. (many slides from Marc Toussaint) Vien Ngo MLR, University of Stuttgart

More information

, and rewards and transition matrices as shown below:

, and rewards and transition matrices as shown below: CSE 50a. Assignment 7 Out: Tue Nov Due: Thu Dec Reading: Sutton & Barto, Chapters -. 7. Policy improvement Consider the Markov decision process (MDP) with two states s {0, }, two actions a {0, }, discount

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning RL in continuous MDPs March April, 2015 Large/Continuous MDPs Large/Continuous state space Tabular representation cannot be used Large/Continuous action space Maximization over action

More information

Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department

Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department Dialogue management: Parametric approaches to policy optimisation Milica Gašić Dialogue Systems Group, Cambridge University Engineering Department 1 / 30 Dialogue optimisation as a reinforcement learning

More information

Implicit Incremental Natural Actor Critic

Implicit Incremental Natural Actor Critic Implicit Incremental Natural Actor Critic Ryo Iwaki and Minoru Asada Osaka University, -1, Yamadaoka, Suita city, Osaka, Japan {ryo.iwaki,asada}@ams.eng.osaka-u.ac.jp Abstract. The natural policy gradient

More information

Linear Least-squares Dyna-style Planning

Linear Least-squares Dyna-style Planning Linear Least-squares Dyna-style Planning Hengshuai Yao Department of Computing Science University of Alberta Edmonton, AB, Canada T6G2E8 hengshua@cs.ualberta.ca Abstract World model is very important for

More information

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI Temporal Difference Learning Policy Evaluation Intro to model-free learning Monte Carlo Learning Temporal Difference Learning

More information

arxiv: v1 [cs.ai] 5 Nov 2017

arxiv: v1 [cs.ai] 5 Nov 2017 arxiv:1711.01569v1 [cs.ai] 5 Nov 2017 Markus Dumke Department of Statistics Ludwig-Maximilians-Universität München markus.dumke@campus.lmu.de Abstract Temporal-difference (TD) learning is an important

More information

Reinforcement learning

Reinforcement learning Reinforcement learning Based on [Kaelbling et al., 1996, Bertsekas, 2000] Bert Kappen Reinforcement learning Reinforcement learning is the problem faced by an agent that must learn behavior through trial-and-error

More information

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes Today s s Lecture Lecture 20: Learning -4 Review of Neural Networks Markov-Decision Processes Victor Lesser CMPSCI 683 Fall 2004 Reinforcement learning 2 Back-propagation Applicability of Neural Networks

More information

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B. Advanced Topics in Machine Learning, GI13, 2010/11 Advanced Topics in Machine Learning, GI13, 2010/11 Answer any THREE questions. Each question is worth 20 marks. Use separate answer books Answer any THREE

More information

An online kernel-based clustering approach for value function approximation

An online kernel-based clustering approach for value function approximation An online kernel-based clustering approach for value function approximation N. Tziortziotis and K. Blekas Department of Computer Science, University of Ioannina P.O.Box 1186, Ioannina 45110 - Greece {ntziorzi,kblekas}@cs.uoi.gr

More information

Reinforcement Learning: the basics

Reinforcement Learning: the basics Reinforcement Learning: the basics Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 http://people.isir.upmc.fr/sigaud August 6, 2012 1 / 46 Introduction Action selection/planning Learning by trial-and-error

More information

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning. Machine Learning, Fall 2010 Reinforcement Learning Machine Learning, Fall 2010 1 Administrativia This week: finish RL, most likely start graphical models LA2: due on Thursday LA3: comes out on Thursday TA Office hours: Today 1:30-2:30

More information

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B. Advanced Topics in Machine Learning, GI13, 2010/11 Advanced Topics in Machine Learning, GI13, 2010/11 Answer any THREE questions. Each question is worth 20 marks. Use separate answer books Answer any THREE

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Markov decision process & Dynamic programming Evaluative feedback, value function, Bellman equation, optimality, Markov property, Markov decision process, dynamic programming, value

More information

Variance Reduction for Policy Gradient Methods. March 13, 2017

Variance Reduction for Policy Gradient Methods. March 13, 2017 Variance Reduction for Policy Gradient Methods March 13, 2017 Reward Shaping Reward Shaping Reward Shaping Reward shaping: r(s, a, s ) = r(s, a, s ) + γφ(s ) Φ(s) for arbitrary potential Φ Theorem: r admits

More information

Reinforcement Learning II. George Konidaris

Reinforcement Learning II. George Konidaris Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2017 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes

More information

Reinforcement Learning based on On-line EM Algorithm

Reinforcement Learning based on On-line EM Algorithm Reinforcement Learning based on On-line EM Algorithm Masa-aki Sato t t ATR Human Information Processing Research Laboratories Seika, Kyoto 619-0288, Japan masaaki@hip.atr.co.jp Shin Ishii +t tnara Institute

More information

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Markov Decision Processes and Solving Finite Problems. February 8, 2017 Markov Decision Processes and Solving Finite Problems February 8, 2017 Overview of Upcoming Lectures Feb 8: Markov decision processes, value iteration, policy iteration Feb 13: Policy gradients Feb 15:

More information

Reinforcement Learning II. George Konidaris

Reinforcement Learning II. George Konidaris Reinforcement Learning II George Konidaris gdk@cs.brown.edu Fall 2018 Reinforcement Learning π : S A max R = t=0 t r t MDPs Agent interacts with an environment At each time t: Receives sensor signal Executes

More information

Generalization and Function Approximation

Generalization and Function Approximation Generalization and Function Approximation 0 Generalization and Function Approximation Suggested reading: Chapter 8 in R. S. Sutton, A. G. Barto: Reinforcement Learning: An Introduction MIT Press, 1998.

More information

Temporal Difference Learning & Policy Iteration

Temporal Difference Learning & Policy Iteration Temporal Difference Learning & Policy Iteration Advanced Topics in Reinforcement Learning Seminar WS 15/16 ±0 ±0 +1 by Tobias Joppen 03.11.2015 Fachbereich Informatik Knowledge Engineering Group Prof.

More information

On and Off-Policy Relational Reinforcement Learning

On and Off-Policy Relational Reinforcement Learning On and Off-Policy Relational Reinforcement Learning Christophe Rodrigues, Pierre Gérard, and Céline Rouveirol LIPN, UMR CNRS 73, Institut Galilée - Université Paris-Nord first.last@lipn.univ-paris13.fr

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Policy gradients Daniel Hennes 26.06.2017 University Stuttgart - IPVS - Machine Learning & Robotics 1 Policy based reinforcement learning So far we approximated the action value

More information

RL 3: Reinforcement Learning

RL 3: Reinforcement Learning RL 3: Reinforcement Learning Q-Learning Michael Herrmann University of Edinburgh, School of Informatics 20/01/2015 Last time: Multi-Armed Bandits (10 Points to remember) MAB applications do exist (e.g.

More information

An application of the temporal difference algorithm to the truck backer-upper problem

An application of the temporal difference algorithm to the truck backer-upper problem ESANN 214 proceedings, European Symposium on Artificial Neural Networks, Computational Intelligence and Machine Learning. Bruges (Belgium), 23-25 April 214, i6doc.com publ., ISBN 978-28741995-7. Available

More information

Machine Learning I Reinforcement Learning

Machine Learning I Reinforcement Learning Machine Learning I Reinforcement Learning Thomas Rückstieß Technische Universität München December 17/18, 2009 Literature Book: Reinforcement Learning: An Introduction Sutton & Barto (free online version:

More information

Proceedings of the International Conference on Neural Networks, Orlando Florida, June Leemon C. Baird III

Proceedings of the International Conference on Neural Networks, Orlando Florida, June Leemon C. Baird III Proceedings of the International Conference on Neural Networks, Orlando Florida, June 1994. REINFORCEMENT LEARNING IN CONTINUOUS TIME: ADVANTAGE UPDATING Leemon C. Baird III bairdlc@wl.wpafb.af.mil Wright

More information

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control Lecturer: Nikolay Atanasov: natanasov@ucsd.edu Teaching Assistants: Tianyu Wang: tiw161@eng.ucsd.edu Yongxi Lu: yol070@eng.ucsd.edu

More information

Local Linear Controllers. DP-based algorithms will find optimal policies. optimality and computational expense, and the gradient

Local Linear Controllers. DP-based algorithms will find optimal policies. optimality and computational expense, and the gradient Efficient Non-Linear Control by Combining Q-learning with Local Linear Controllers Hajime Kimura Λ Tokyo Institute of Technology gen@fe.dis.titech.ac.jp Shigenobu Kobayashi Tokyo Institute of Technology

More information

Policy Gradient Methods. February 13, 2017

Policy Gradient Methods. February 13, 2017 Policy Gradient Methods February 13, 2017 Policy Optimization Problems maximize E π [expression] π Fixed-horizon episodic: T 1 Average-cost: lim T 1 T r t T 1 r t Infinite-horizon discounted: γt r t Variable-length

More information

On the Convergence of Optimistic Policy Iteration

On the Convergence of Optimistic Policy Iteration Journal of Machine Learning Research 3 (2002) 59 72 Submitted 10/01; Published 7/02 On the Convergence of Optimistic Policy Iteration John N. Tsitsiklis LIDS, Room 35-209 Massachusetts Institute of Technology

More information

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti 1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early

More information

Notes on Reinforcement Learning

Notes on Reinforcement Learning 1 Introduction Notes on Reinforcement Learning Paulo Eduardo Rauber 2014 Reinforcement learning is the study of agents that act in an environment with the goal of maximizing cumulative reward signals.

More information

Reward function and initial values : Better choices for accelerated Goal-directed reinforcement learning.

Reward function and initial values : Better choices for accelerated Goal-directed reinforcement learning. Reward function and initial values : Better choices for accelerated Goal-directed reinforcement learning. Laëtitia Matignon, Guillaume Laurent, Nadine Le Fort - Piat To cite this version: Laëtitia Matignon,

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

Chapter 8: Generalization and Function Approximation

Chapter 8: Generalization and Function Approximation Chapter 8: Generalization and Function Approximation Objectives of this chapter: Look at how experience with a limited part of the state set be used to produce good behavior over a much larger part. Overview

More information

Introduction to Reinforcement Learning

Introduction to Reinforcement Learning Introduction to Reinforcement Learning Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/ munos/ INRIA Lille - Nord Europe Machine Learning Summer School, September 2011,

More information

Reinforcement Learning using Continuous Actions. Hado van Hasselt

Reinforcement Learning using Continuous Actions. Hado van Hasselt Reinforcement Learning using Continuous Actions Hado van Hasselt 2005 Concluding thesis for Cognitive Artificial Intelligence University of Utrecht First supervisor: Dr. Marco A. Wiering, University of

More information

Reinforcement Learning via Policy Optimization

Reinforcement Learning via Policy Optimization Reinforcement Learning via Policy Optimization Hanxiao Liu November 22, 2017 1 / 27 Reinforcement Learning Policy a π(s) 2 / 27 Example - Mario 3 / 27 Example - ChatBot 4 / 27 Applications - Video Games

More information

Q-Learning for Markov Decision Processes*

Q-Learning for Markov Decision Processes* McGill University ECSE 506: Term Project Q-Learning for Markov Decision Processes* Authors: Khoa Phan khoa.phan@mail.mcgill.ca Sandeep Manjanna sandeep.manjanna@mail.mcgill.ca (*Based on: Convergence of

More information

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study CS 287: Advanced Robotics Fall 2009 Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study Pieter Abbeel UC Berkeley EECS Assignment #1 Roll-out: nice example paper: X.

More information

Tutorial on Policy Gradient Methods. Jan Peters

Tutorial on Policy Gradient Methods. Jan Peters Tutorial on Policy Gradient Methods Jan Peters Outline 1. Reinforcement Learning 2. Finite Difference vs Likelihood-Ratio Policy Gradients 3. Likelihood-Ratio Policy Gradients 4. Conclusion General Setup

More information

ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6.

ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6. ROB 537: Learning-Based Control Week 5, Lecture 1 Policy Gradient, Eligibility Traces, Transfer Learning (MaC Taylor Announcements: Project background due Today HW 3 Due on 10/30 Midterm Exam on 11/6 Reading:

More information

Learning Control for Air Hockey Striking using Deep Reinforcement Learning

Learning Control for Air Hockey Striking using Deep Reinforcement Learning Learning Control for Air Hockey Striking using Deep Reinforcement Learning Ayal Taitler, Nahum Shimkin Faculty of Electrical Engineering Technion - Israel Institute of Technology May 8, 2017 A. Taitler,

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Approximate Dynamic Programming

Approximate Dynamic Programming Approximate Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course Value Iteration: the Idea 1. Let V 0 be any vector in R N A. LAZARIC Reinforcement

More information

Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods

Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods Yaakov Engel Joint work with Peter Szabo and Dmitry Volkinshtein (ex. Technion) Why use GPs in RL? A Bayesian approach

More information

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted 15-889e Policy Search: Gradient Methods Emma Brunskill All slides from David Silver (with EB adding minor modificafons), unless otherwise noted Outline 1 Introduction 2 Finite Difference Policy Gradient

More information

arxiv: v1 [cs.lg] 23 Oct 2017

arxiv: v1 [cs.lg] 23 Oct 2017 Accelerated Reinforcement Learning K. Lakshmanan Department of Computer Science and Engineering Indian Institute of Technology (BHU), Varanasi, India Email: lakshmanank.cse@itbhu.ac.in arxiv:1710.08070v1

More information

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning Reinforcement Learning Spring 2018 Defining MDPs, Planning understandability 0 Slide 10 time You are here Markov Process Where you will go depends only on where you are Markov Process: Information state

More information

Optimal Control. McGill COMP 765 Oct 3 rd, 2017

Optimal Control. McGill COMP 765 Oct 3 rd, 2017 Optimal Control McGill COMP 765 Oct 3 rd, 2017 Classical Control Quiz Question 1: Can a PID controller be used to balance an inverted pendulum: A) That starts upright? B) That must be swung-up (perhaps

More information

CS 570: Machine Learning Seminar. Fall 2016

CS 570: Machine Learning Seminar. Fall 2016 CS 570: Machine Learning Seminar Fall 2016 Class Information Class web page: http://web.cecs.pdx.edu/~mm/mlseminar2016-2017/fall2016/ Class mailing list: cs570@cs.pdx.edu My office hours: T,Th, 2-3pm or

More information

Forward Actor-Critic for Nonlinear Function Approximation in Reinforcement Learning

Forward Actor-Critic for Nonlinear Function Approximation in Reinforcement Learning Forward Actor-Critic for Nonlinear Function Approximation in Reinforcement Learning Vivek Veeriah Dept. of Computing Science University of Alberta Edmonton, Canada vivekveeriah@ualberta.ca Harm van Seijen

More information

Adaptive Importance Sampling for Value Function Approximation in Off-policy Reinforcement Learning

Adaptive Importance Sampling for Value Function Approximation in Off-policy Reinforcement Learning Adaptive Importance Sampling for Value Function Approximation in Off-policy Reinforcement Learning Abstract Off-policy reinforcement learning is aimed at efficiently using data samples gathered from a

More information

REINFORCEMENT learning (RL) is a machine learning

REINFORCEMENT learning (RL) is a machine learning 762 IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, VOL. 24, NO. 5, MAY 213 Online Learning Control Using Adaptive Critic Designs With Sparse Kernel Machines Xin Xu, Senior Member, IEEE, Zhongsheng

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Function approximation Mario Martin CS-UPC May 18, 2018 Mario Martin (CS-UPC) Reinforcement Learning May 18, 2018 / 65 Recap Algorithms: MonteCarlo methods for Policy Evaluation

More information

Counterfactual Multi-Agent Policy Gradients

Counterfactual Multi-Agent Policy Gradients Counterfactual Multi-Agent Policy Gradients Shimon Whiteson Dept. of Computer Science University of Oxford joint work with Jakob Foerster, Gregory Farquhar, Triantafyllos Afouras, and Nantas Nardelli July

More information

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING AN INTRODUCTION Prof. Dr. Ann Nowé Artificial Intelligence Lab ai.vub.ac.be REINFORCEMENT LEARNING WHAT IS IT? What is it? Learning from interaction Learning about, from, and while

More information

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan Some slides borrowed from Peter Bodik and David Silver Course progress Learning

More information

Reinforcement Learning as Classification Leveraging Modern Classifiers

Reinforcement Learning as Classification Leveraging Modern Classifiers Reinforcement Learning as Classification Leveraging Modern Classifiers Michail G. Lagoudakis and Ronald Parr Department of Computer Science Duke University Durham, NC 27708 Machine Learning Reductions

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning II Daniel Hennes 11.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Eligibility traces n-step TD returns

More information