Reinforcement Learning In Continuous Time and Space presentation of paper by Kenji Doya Leszek Rybicki lrybicki@mat.umk.pl 18.07.2008 Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 1/31
Reinforcement Learning A Machine Learning Paradigm Idea Reinforcement Learning is a machine learning paradigm in which the agent learns by trial and error. The agent takes actions within an environment and receives a numerical reinforcement. The purpose is to maximize the total reinforcement in a given task. The paper in question considers fully-observable, deterministic decision processes over continuous and factored state, action spaces and continuous time. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 2/31
Reinforcement Learning A Machine Learning Paradigm Idea Reinforcement Learning is a machine learning paradigm in which the agent learns by trial and error. The agent takes actions within an environment and receives a numerical reinforcement. The purpose is to maximize the total reinforcement in a given task. The paper in question considers fully-observable, deterministic decision processes over continuous and factored state, action spaces and continuous time. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 2/31
Reinforcement Learning A Machine Learning Paradigm Idea Reinforcement Learning is a machine learning paradigm in which the agent learns by trial and error. The agent takes actions within an environment and receives a numerical reinforcement. The purpose is to maximize the total reinforcement in a given task. The paper in question considers fully-observable, deterministic decision processes over continuous and factored state, action spaces and continuous time. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 2/31
Example Pendulum swing-up with limited torque maximal output torque is smaller than the maximal load torque goal: swing the load up and keep it up for some time! l T mg Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 3/31
Example Cart-pole swing-up cart can move both ways, limited speed and distance goal: swing the load up and keep it up for some time Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 4/31
Basic terms x(t) X - state at time t u(t) U - action at time t x(t + 1) = f (x(t), u(t)) - state change function r(t) = r(x(t), u(t)) - reinforcement at time t µ : X U - agent s policy V µ : X R - state value function goal find policy µ : S A with an optimal value function V Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 5/31
Basic terms x(t) X - state at time t u(t) U - action at time t x(t + 1) = f (x(t), u(t)) - state change function r(t) = r(x(t), u(t)) - reinforcement at time t µ : X U - agent s policy V µ : X R - state value function goal find policy µ : S A with an optimal value function V Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 5/31
Basic terms x(t) X - state at time t u(t) U - action at time t x(t + 1) = f (x(t), u(t)) - state change function r(t) = r(x(t), u(t)) - reinforcement at time t µ : X U - agent s policy V µ : X R - state value function goal find policy µ : S A with an optimal value function V Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 5/31
Basic terms x(t) X - state at time t u(t) U - action at time t x(t + 1) = f (x(t), u(t)) - state change function r(t) = r(x(t), u(t)) - reinforcement at time t µ : X U - agent s policy V µ : X R - state value function goal find policy µ : S A with an optimal value function V Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 5/31
Basic terms x(t) X - state at time t u(t) U - action at time t x(t + 1) = f (x(t), u(t)) - state change function r(t) = r(x(t), u(t)) - reinforcement at time t µ : X U - agent s policy V µ : X R - state value function goal find policy µ : S A with an optimal value function V Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 5/31
Basic terms x(t) X - state at time t u(t) U - action at time t x(t + 1) = f (x(t), u(t)) - state change function r(t) = r(x(t), u(t)) - reinforcement at time t µ : X U - agent s policy V µ : X R - state value function goal find policy µ : S A with an optimal value function V Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 5/31
Basic terms x(t) X - state at time t u(t) U - action at time t x(t + 1) = f (x(t), u(t)) - state change function r(t) = r(x(t), u(t)) - reinforcement at time t µ : X U - agent s policy V µ : X R - state value function goal find policy µ : S A with an optimal value function V Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 5/31
Time Difference Learning I TD(0) The idea Given an estimate (prediction) of the state change function: f (x(t), u(t)) and the optimal state value function: V (x(t)) V (x(t)) employ a greedy policy of selecting the next action. Prediction Lookup tables and function approximators can be used for f and V. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 6/31
Time Difference Learning II TD(0) TD-error Given: V (x(t)) = γ k t r(t) k=t the perfect prediction of V should satisfy: So the prediction error is: V (x(t)) = r(t + 1) + γv (x(t + 1)) δ(t + 1) = r(t + 1) γv (t + 1) + V (t) Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 7/31
Time Difference Learning III TD(0) Learning At each step: observe state change and reinforcement calculate TD-error update prediction of V : { V (x) ηδ(t + 1) V (x) V (x) x = x(t) otherwise This is known as TD(0). Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 8/31
Eligibility traces TD(λ) The value of state x(t 1) depends on the value of state x(t) and all future states, so any update of the value function estimate of V (x(t)) should also be reflected in the estimate of V (x(t 1)) and all past states, in a discounted manner. { V (x(t)) ηδ(t o )λ t 0 t t < t 0 V (x(t)) V (x(t)) otherwise where λ is a discounting parameter. This is know as TD(λ). Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 9/31
Actor-Critic method Exploring the space Sometimes the learning system is split into two parts: the critic, which maintains the state value estimate V and the actor, which is responsible for choosing the appropriate actions at each state. The actor s policy is modified using a signal from the critic. This is known as the Actor-Critic method. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 10/31
Problems with discrete approach 1 Many real-world problems are continuous 2 Coarse discretization leads to jumpy actions 3 Can make the algorithm non-convergent 4 Fine discretization leads to high dimensionality 5 Delays reward in steps 6 Needs more trials Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 11/31
Problems with discrete approach 1 Many real-world problems are continuous 2 Coarse discretization leads to jumpy actions 3 Can make the algorithm non-convergent 4 Fine discretization leads to high dimensionality 5 Delays reward in steps 6 Needs more trials Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 11/31
Problems with discrete approach 1 Many real-world problems are continuous 2 Coarse discretization leads to jumpy actions 3 Can make the algorithm non-convergent 4 Fine discretization leads to high dimensionality 5 Delays reward in steps 6 Needs more trials Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 11/31
Problems with discrete approach 1 Many real-world problems are continuous 2 Coarse discretization leads to jumpy actions 3 Can make the algorithm non-convergent 4 Fine discretization leads to high dimensionality 5 Delays reward in steps 6 Needs more trials Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 11/31
Problems with discrete approach 1 Many real-world problems are continuous 2 Coarse discretization leads to jumpy actions 3 Can make the algorithm non-convergent 4 Fine discretization leads to high dimensionality 5 Delays reward in steps 6 Needs more trials Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 11/31
Problems with discrete approach 1 Many real-world problems are continuous 2 Coarse discretization leads to jumpy actions 3 Can make the algorithm non-convergent 4 Fine discretization leads to high dimensionality 5 Delays reward in steps 6 Needs more trials Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 11/31
Discrete and continuous TD Comparison x(t) X u(t) U V (x(t)) = k=t γt k r(k) δ(t) = r(t + 1) γv (t + 1) + V (t) x(t) R n u(t) R m V (x(t)) = t e s t τ r(x(s), u(s)) δ(t) = r(t) 1 τ V (t) + V (t) τ = 1 γ - constant for discounting future rewards Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 12/31
Discrete and continuous TD Comparison x(t) X u(t) U V (x(t)) = k=t γt k r(k) δ(t) = r(t + 1) γv (t + 1) + V (t) x(t) R n u(t) R m V (x(t)) = t e s t τ r(x(s), u(s)) δ(t) = r(t) 1 τ V (t) + V (t) τ = 1 γ - constant for discounting future rewards Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 12/31
Discrete and continuous TD Comparison x(t) X u(t) U V (x(t)) = k=t γt k r(k) δ(t) = r(t + 1) γv (t + 1) + V (t) x(t) R n u(t) R m V (x(t)) = t e s t τ r(x(s), u(s)) δ(t) = r(t) 1 τ V (t) + V (t) τ = 1 γ - constant for discounting future rewards Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 12/31
Discrete and continuous TD Comparison x(t) X u(t) U V (x(t)) = k=t γt k r(k) δ(t) = r(t + 1) γv (t + 1) + V (t) x(t) R n u(t) R m V (x(t)) = t e s t τ r(x(s), u(s)) δ(t) = r(t) 1 τ V (t) + V (t) τ = 1 γ - constant for discounting future rewards Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 12/31
Discrete and continuous TD Comparison x(t) X u(t) U V (x(t)) = k=t γt k r(k) δ(t) = r(t + 1) γv (t + 1) + V (t) x(t) R n u(t) R m V (x(t)) = t e s t τ r(x(s), u(s)) δ(t) = r(t) 1 τ V (t) + V (t) τ = 1 γ - constant for discounting future rewards Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 12/31
Discrete and continuous TD Comparison x(t) X u(t) U V (x(t)) = k=t γt k r(k) δ(t) = r(t + 1) γv (t + 1) + V (t) x(t) R n u(t) R m V (x(t)) = t e s t τ r(x(s), u(s)) δ(t) = r(t) 1 τ V (t) + V (t) τ = 1 γ - constant for discounting future rewards Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 12/31
Discrete and continuous TD Comparison x(t) X u(t) U V (x(t)) = k=t γt k r(k) δ(t) = r(t + 1) γv (t + 1) + V (t) x(t) R n u(t) R m V (x(t)) = t e s t τ r(x(s), u(s)) δ(t) = r(t) 1 τ V (t) + V (t) τ = 1 γ - constant for discounting future rewards Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 12/31
Discrete and continuous TD Comparison x(t) X u(t) U V (x(t)) = k=t γt k r(k) δ(t) = r(t + 1) γv (t + 1) + V (t) x(t) R n u(t) R m V (x(t)) = t e s t τ r(x(s), u(s)) δ(t) = r(t) 1 τ V (t) + V (t) τ = 1 γ - constant for discounting future rewards Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 12/31
Discrete and continuous TD Comparison x(t) X u(t) U V (x(t)) = k=t γt k r(k) δ(t) = r(t + 1) γv (t + 1) + V (t) x(t) R n u(t) R m V (x(t)) = t e s t τ r(x(s), u(s)) δ(t) = r(t) 1 τ V (t) + V (t) τ = 1 γ - constant for discounting future rewards Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 12/31
Continuous Value Function Time-symmetric method Let V (x; w) be an approximator of the Value function with parameters w and let E(t) = 1 2 (δ(t))2 be the objective function to minimize. Using the chain rule V (t) = V ẋ(t) x we have: E(t) = δ(t) w i w i = δ(t) [ 1 τ [ r(t) 1 ] τ V (t) + V (t) V (x; w) + ( V (x; w) w i w i x ) ẋ(t) (1) ] (2) giving us the gradient descent algorithm for parameters w i. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 13/31
Backward Euler Differentiation Residual Gradient and TD(0) By substituing V (t) = (V (t) V (t t))/ t we have: δ(t) = r(t) + 1 [ (1 t ] )V (t) V (t t) t τ giving us the gradient with respect to parameter w i : E(t) = δ(t) 1 [ (1 t ] (x(t); w) V (x(t t); w) ) V w i t τ w i w i We can use only the V (t t) for parameter update. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 14/31
Backward Euler Differentiation Residual Gradient and TD(0) By substituing V (t) = (V (t) V (t t))/ t we have: δ(t) = r(t) + 1 [ (1 t ] )V (t) V (t t) t τ giving us the gradient with respect to parameter w i : E(t) = δ(t) 1 [ (1 t ] (x(t); w) V (x(t t); w) ) V w i t τ w i w i We can use only the V (t t) for parameter update. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 14/31
Exponential Eligibility Trace I TD(λ) A Impulse reward!(t) Assume that at time t = t 0 an impulse reward was given, t t making the temporal profile of the value B function: 0 V(t) ^ V µ (t) = thus: ẇ i = ηδ(t 0 ) {e t 0 t τ t < t 0 0 t > t 0 t 0 e t 0 t τ V (x(t); w) w i CA V(t)! ^ B V(t) D V(t) ^ C dt D V(t) ^ V(t) ^ t 0 t 0 t t t 0 t Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 15/31
Exponential Eligibility Trace II TD(λ) Eligibility traces Let: e i = t 0 e t 0 t τ V (x(t); w) dt w i be the eligibility trace for parameter w i. We have: ẇ i = ηδ(t)e i (t), κė i (t) = e i (t) + V (x(t); w) w i, where 0 < κ τ is the time constant of the eligibility trace. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 16/31
Improving the policy Exploring the possibilities One way to improve the policy is using the Actor-Critic method, another is to take a greedy value-gradient based policy with respect to the current value function: u(t) = µ(x(t)) = argmax u U [ r(x(t), u) + V (x) f (x(t), u) x and the knowledge about the reward (r) and system dynamics (f ). ] Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 17/31
Continuous Actor-Critic Stochastic exploration Let A(x(t); w A ) R m be a function approximator with parameters w A. We can thus have a stochastic actor that employs the policy: u(t) = s(a(x(t); w A ) + σn(t)) where s is a monotonically increasing function and n(t) is noise. The actor s parameters are updated by: ẇ A i = η A δ(t)n(t) A(x(t); w A ) w A i Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 18/31
Value-Gradient Based Policy Heuristic approach In discrete problems, we can simply search the finite set of actions for the optimal one. Here, the set of actions is infinite. Let s assume: r(x, u) = R(x) S(u) S(u) = m j=1 S j(u j ) - cost of action R(u) - unknown part of the reinforcement Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 19/31
Cost of greedy policy Given a world model For the greedy policy it holds that: S j (u j ) = V (x) f (x, u), x u j f (x,u) where is the input gain matrix of the system u dynamics. Assuming that the matrix is not dependent on u and that S j is convex, the equation has a solution: u j = S j 1 V (x) f (x, u) ( ) x u j Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 20/31
Feedback control With a sigmoid output function A common constraint is that the amplitude of the action u is bounded by u max. We incorporate the constraint into the action cost: uj ( ) u S j (u j ) = c j s 1 du 0 u max j where s is a sigmoid function. This results in a following policy: ( ) u j = uj max 1 f (x, u) T T V (x) s. c j u j x Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 21/31
Advantage updating When the model is not available When the world model f is not available, we can simultaneously update the value function and an advantage function A(x, u): A (x, u) = r(x, u) 1 τ V (x) + V (x) f (x, u), x where max u [A(x, u)] = 0 for the optimal action u. The update rule is: A(x, u) max u [A(x, u)] + r(x, u) δ(t) The policy then is to select the action u for which the advantage function is maximal. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 22/31
Simulations Compared methods Four algorithms were compared: discrete actor-critic continuous actor-critic value-gradient based policy with a world model value-gradient based policy with learning of the gain matrix Value and policy functions were approximated by normalized Gaussian networks, the sigmoid function used was s(x) = 2 π arctan( π 2 x). Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 23/31
Pendulum swing-up I Trial setup! l T mg reward was given by height of the mass: R(x) = cos(θ) state space x = (θ, ω) trial lasted 20 seconds, unless the pendulum was over-rotated, resulting in r (t) = 1 for 1 second a trial was successful when when the pendulum was up for more than 10 seconds Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 24/31
Pendulum swing-up II Results 400 350 300 Trials 250 200 150 100 50 0 DiscAC ActCrit ValGrad PhysModel In the discrete actor-critic, the state space was evenly discretized into 30x30 boxes and the action was binary. Note different scales. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 25/31
Methods of Value Function Update 100 80 Trials Trials Trials 60 40 20 0 100 80 60 40 20 0 100 80 60 40 0.02 0.1 0.5 1.0!t 0.02 0.1 0.5 1.0!t Various V -update methods were tested. resudual gradient single step eligibility trace exponential eligibility symmetric update method The time-symmetric update method was very unstable. 20 0 0.02 0.1 0.5 1.0 " Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 26/31
Action cost Various cost coefficients were tested, as well as various reward functions and exploration noise. The bang-bang control (c = 0) tended to be less consistent than small costs with sigmoid control. The binary reward function made the task difficult to learn, however a negative reward function yielded better results. Trials Trials 100 80 60 40 20 0 100 80 60 40 20 0 0. 0.01 0.1 1.0 control cost coef. cos! {0,1} {-1,0} cos! {0,1} {-1,0} " 0 = 0.5 " 0 = 0.0 Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 27/31
Cart-Pole Swing-Up I A harder test Experiment setup Physical parameters the same as Barto s cart-pole balancing task, but pole had to be swung up and balanced. Higher dimensionality than previous task, state dependent input gain. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 28/31
Cart-Pole Swing-Up II A harder test Results The performances with the exact and learned input gains were comparable, since the learning of the physical model was relatively easy compared to the learning of the value function. v $# # 2.4 4# 0 " $4# 0.0 V +0.021-2.4-0.347-2.4 0.0 x v -0.715 2.4 4000 3500! 0 $#! 0 # 4# 3000 2.4 " 0 Trials 2500 $4# 2000 1500 0.0 df/du +1.423 1000 500 0-2.4-0.064 ActorCritic ValueGrad PhysModel -2.4 Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 0.0 x 2.4-1.551 29/31
Discussion 1 The swing-up task was accomplished by the continuous Actor-Critic 50 times faster than by the discrete Actor-Critic, despite finer discretization in the action-state space of the latter. 2 Value-gradient based methods performed significanly better than the actor-critic 3 Exponential eligibility was more efficient and stable than Euler approximation 4 Reward related parameters greatly affect the speed of learning 5 The value-gradient method worked well even when the input gain was state-dependent. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 30/31
Discussion 1 The swing-up task was accomplished by the continuous Actor-Critic 50 times faster than by the discrete Actor-Critic, despite finer discretization in the action-state space of the latter. 2 Value-gradient based methods performed significanly better than the actor-critic 3 Exponential eligibility was more efficient and stable than Euler approximation 4 Reward related parameters greatly affect the speed of learning 5 The value-gradient method worked well even when the input gain was state-dependent. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 30/31
Discussion 1 The swing-up task was accomplished by the continuous Actor-Critic 50 times faster than by the discrete Actor-Critic, despite finer discretization in the action-state space of the latter. 2 Value-gradient based methods performed significanly better than the actor-critic 3 Exponential eligibility was more efficient and stable than Euler approximation 4 Reward related parameters greatly affect the speed of learning 5 The value-gradient method worked well even when the input gain was state-dependent. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 30/31
Discussion 1 The swing-up task was accomplished by the continuous Actor-Critic 50 times faster than by the discrete Actor-Critic, despite finer discretization in the action-state space of the latter. 2 Value-gradient based methods performed significanly better than the actor-critic 3 Exponential eligibility was more efficient and stable than Euler approximation 4 Reward related parameters greatly affect the speed of learning 5 The value-gradient method worked well even when the input gain was state-dependent. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 30/31
Discussion 1 The swing-up task was accomplished by the continuous Actor-Critic 50 times faster than by the discrete Actor-Critic, despite finer discretization in the action-state space of the latter. 2 Value-gradient based methods performed significanly better than the actor-critic 3 Exponential eligibility was more efficient and stable than Euler approximation 4 Reward related parameters greatly affect the speed of learning 5 The value-gradient method worked well even when the input gain was state-dependent. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 30/31
Bibliography Kenji Doya, Reinforcement Learning in Continuous Time and Space, Neural Computation, vol. 12, no. 1/2000, p 219-245 Sutton, R.S. and Barto, A.Gg, Reinforcement Learning: An Introduction, MIT Press, 1998, Cambridge, MA Florentin Woergoetter and Bernd Porr (2008) Reinforcement learning. Scholarpedia, 3(3):1448 Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 31/31