Reinforcement Learning In Continuous Time and Space

Similar documents
Reinforcement Learning in Continuous Time and Space

and 3) a value function update using exponential eligibility traces is more efficient and stable than that based on Euler approximation. The algorithm

Reinforcement Learning in Non-Stationary Continuous Time and Space Scenarios

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Reinforcement Learning

CS599 Lecture 1 Introduction To RL

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

The convergence limit of the temporal difference learning

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Temporal difference learning

Animal learning theory

Off-Policy Actor-Critic

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Lecture 23: Reinforcement Learning

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Open Theoretical Questions in Reinforcement Learning

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

A Residual Gradient Fuzzy Reinforcement Learning Algorithm for Differential Games

Lecture 8: Policy Gradient

Grundlagen der Künstlichen Intelligenz

Q-Learning in Continuous State Action Spaces

MDP Preliminaries. Nan Jiang. February 10, 2019

ilstd: Eligibility Traces and Convergence Analysis

(Deep) Reinforcement Learning

Reinforcement Learning. George Konidaris

REINFORCEMENT LEARNING

Replacing eligibility trace for action-value learning with function approximation

CS599 Lecture 2 Function Approximation in RL

Lecture 1: March 7, 2018

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Lecture 7: Value Function Approximation

Reinforcement Learning in Continuous Action Spaces

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

6 Reinforcement Learning

Reinforcement Learning II

Reinforcement Learning and NLP

Reinforcement Learning

, and rewards and transition matrices as shown below:

Reinforcement Learning

Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department

Implicit Incremental Natural Actor Critic

Linear Least-squares Dyna-style Planning

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

arxiv: v1 [cs.ai] 5 Nov 2017

Reinforcement learning

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

An online kernel-based clustering approach for value function approximation

Reinforcement Learning: the basics

Reinforcement Learning. Machine Learning, Fall 2010

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Reinforcement Learning

Variance Reduction for Policy Gradient Methods. March 13, 2017

Reinforcement Learning II. George Konidaris

Reinforcement Learning based on On-line EM Algorithm

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Reinforcement Learning II. George Konidaris

Generalization and Function Approximation

Temporal Difference Learning & Policy Iteration

On and Off-Policy Relational Reinforcement Learning

Reinforcement Learning

RL 3: Reinforcement Learning

An application of the temporal difference algorithm to the truck backer-upper problem

Machine Learning I Reinforcement Learning

Proceedings of the International Conference on Neural Networks, Orlando Florida, June Leemon C. Baird III

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

Local Linear Controllers. DP-based algorithms will find optimal policies. optimality and computational expense, and the gradient

Policy Gradient Methods. February 13, 2017

On the Convergence of Optimistic Policy Iteration

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Notes on Reinforcement Learning

Reward function and initial values : Better choices for accelerated Goal-directed reinforcement learning.

Elements of Reinforcement Learning

Chapter 8: Generalization and Function Approximation

Introduction to Reinforcement Learning

Reinforcement Learning using Continuous Actions. Hado van Hasselt

Reinforcement Learning via Policy Optimization

Q-Learning for Markov Decision Processes*

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

Tutorial on Policy Gradient Methods. Jan Peters

ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6.

Learning Control for Air Hockey Striking using Deep Reinforcement Learning

Basics of reinforcement learning

Approximate Dynamic Programming

Learning to Control an Octopus Arm with Gaussian Process Temporal Difference Methods

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

arxiv: v1 [cs.lg] 23 Oct 2017

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Optimal Control. McGill COMP 765 Oct 3 rd, 2017

CS 570: Machine Learning Seminar. Fall 2016

Forward Actor-Critic for Nonlinear Function Approximation in Reinforcement Learning

Adaptive Importance Sampling for Value Function Approximation in Off-policy Reinforcement Learning

REINFORCEMENT learning (RL) is a machine learning

Reinforcement Learning

Counterfactual Multi-Agent Policy Gradients

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Reinforcement Learning as Classification Leveraging Modern Classifiers

Grundlagen der Künstlichen Intelligenz

Transcription:

Reinforcement Learning In Continuous Time and Space presentation of paper by Kenji Doya Leszek Rybicki lrybicki@mat.umk.pl 18.07.2008 Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 1/31

Reinforcement Learning A Machine Learning Paradigm Idea Reinforcement Learning is a machine learning paradigm in which the agent learns by trial and error. The agent takes actions within an environment and receives a numerical reinforcement. The purpose is to maximize the total reinforcement in a given task. The paper in question considers fully-observable, deterministic decision processes over continuous and factored state, action spaces and continuous time. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 2/31

Reinforcement Learning A Machine Learning Paradigm Idea Reinforcement Learning is a machine learning paradigm in which the agent learns by trial and error. The agent takes actions within an environment and receives a numerical reinforcement. The purpose is to maximize the total reinforcement in a given task. The paper in question considers fully-observable, deterministic decision processes over continuous and factored state, action spaces and continuous time. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 2/31

Reinforcement Learning A Machine Learning Paradigm Idea Reinforcement Learning is a machine learning paradigm in which the agent learns by trial and error. The agent takes actions within an environment and receives a numerical reinforcement. The purpose is to maximize the total reinforcement in a given task. The paper in question considers fully-observable, deterministic decision processes over continuous and factored state, action spaces and continuous time. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 2/31

Example Pendulum swing-up with limited torque maximal output torque is smaller than the maximal load torque goal: swing the load up and keep it up for some time! l T mg Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 3/31

Example Cart-pole swing-up cart can move both ways, limited speed and distance goal: swing the load up and keep it up for some time Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 4/31

Basic terms x(t) X - state at time t u(t) U - action at time t x(t + 1) = f (x(t), u(t)) - state change function r(t) = r(x(t), u(t)) - reinforcement at time t µ : X U - agent s policy V µ : X R - state value function goal find policy µ : S A with an optimal value function V Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 5/31

Basic terms x(t) X - state at time t u(t) U - action at time t x(t + 1) = f (x(t), u(t)) - state change function r(t) = r(x(t), u(t)) - reinforcement at time t µ : X U - agent s policy V µ : X R - state value function goal find policy µ : S A with an optimal value function V Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 5/31

Basic terms x(t) X - state at time t u(t) U - action at time t x(t + 1) = f (x(t), u(t)) - state change function r(t) = r(x(t), u(t)) - reinforcement at time t µ : X U - agent s policy V µ : X R - state value function goal find policy µ : S A with an optimal value function V Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 5/31

Basic terms x(t) X - state at time t u(t) U - action at time t x(t + 1) = f (x(t), u(t)) - state change function r(t) = r(x(t), u(t)) - reinforcement at time t µ : X U - agent s policy V µ : X R - state value function goal find policy µ : S A with an optimal value function V Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 5/31

Basic terms x(t) X - state at time t u(t) U - action at time t x(t + 1) = f (x(t), u(t)) - state change function r(t) = r(x(t), u(t)) - reinforcement at time t µ : X U - agent s policy V µ : X R - state value function goal find policy µ : S A with an optimal value function V Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 5/31

Basic terms x(t) X - state at time t u(t) U - action at time t x(t + 1) = f (x(t), u(t)) - state change function r(t) = r(x(t), u(t)) - reinforcement at time t µ : X U - agent s policy V µ : X R - state value function goal find policy µ : S A with an optimal value function V Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 5/31

Basic terms x(t) X - state at time t u(t) U - action at time t x(t + 1) = f (x(t), u(t)) - state change function r(t) = r(x(t), u(t)) - reinforcement at time t µ : X U - agent s policy V µ : X R - state value function goal find policy µ : S A with an optimal value function V Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 5/31

Time Difference Learning I TD(0) The idea Given an estimate (prediction) of the state change function: f (x(t), u(t)) and the optimal state value function: V (x(t)) V (x(t)) employ a greedy policy of selecting the next action. Prediction Lookup tables and function approximators can be used for f and V. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 6/31

Time Difference Learning II TD(0) TD-error Given: V (x(t)) = γ k t r(t) k=t the perfect prediction of V should satisfy: So the prediction error is: V (x(t)) = r(t + 1) + γv (x(t + 1)) δ(t + 1) = r(t + 1) γv (t + 1) + V (t) Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 7/31

Time Difference Learning III TD(0) Learning At each step: observe state change and reinforcement calculate TD-error update prediction of V : { V (x) ηδ(t + 1) V (x) V (x) x = x(t) otherwise This is known as TD(0). Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 8/31

Eligibility traces TD(λ) The value of state x(t 1) depends on the value of state x(t) and all future states, so any update of the value function estimate of V (x(t)) should also be reflected in the estimate of V (x(t 1)) and all past states, in a discounted manner. { V (x(t)) ηδ(t o )λ t 0 t t < t 0 V (x(t)) V (x(t)) otherwise where λ is a discounting parameter. This is know as TD(λ). Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 9/31

Actor-Critic method Exploring the space Sometimes the learning system is split into two parts: the critic, which maintains the state value estimate V and the actor, which is responsible for choosing the appropriate actions at each state. The actor s policy is modified using a signal from the critic. This is known as the Actor-Critic method. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 10/31

Problems with discrete approach 1 Many real-world problems are continuous 2 Coarse discretization leads to jumpy actions 3 Can make the algorithm non-convergent 4 Fine discretization leads to high dimensionality 5 Delays reward in steps 6 Needs more trials Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 11/31

Problems with discrete approach 1 Many real-world problems are continuous 2 Coarse discretization leads to jumpy actions 3 Can make the algorithm non-convergent 4 Fine discretization leads to high dimensionality 5 Delays reward in steps 6 Needs more trials Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 11/31

Problems with discrete approach 1 Many real-world problems are continuous 2 Coarse discretization leads to jumpy actions 3 Can make the algorithm non-convergent 4 Fine discretization leads to high dimensionality 5 Delays reward in steps 6 Needs more trials Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 11/31

Problems with discrete approach 1 Many real-world problems are continuous 2 Coarse discretization leads to jumpy actions 3 Can make the algorithm non-convergent 4 Fine discretization leads to high dimensionality 5 Delays reward in steps 6 Needs more trials Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 11/31

Problems with discrete approach 1 Many real-world problems are continuous 2 Coarse discretization leads to jumpy actions 3 Can make the algorithm non-convergent 4 Fine discretization leads to high dimensionality 5 Delays reward in steps 6 Needs more trials Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 11/31

Problems with discrete approach 1 Many real-world problems are continuous 2 Coarse discretization leads to jumpy actions 3 Can make the algorithm non-convergent 4 Fine discretization leads to high dimensionality 5 Delays reward in steps 6 Needs more trials Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 11/31

Discrete and continuous TD Comparison x(t) X u(t) U V (x(t)) = k=t γt k r(k) δ(t) = r(t + 1) γv (t + 1) + V (t) x(t) R n u(t) R m V (x(t)) = t e s t τ r(x(s), u(s)) δ(t) = r(t) 1 τ V (t) + V (t) τ = 1 γ - constant for discounting future rewards Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 12/31

Discrete and continuous TD Comparison x(t) X u(t) U V (x(t)) = k=t γt k r(k) δ(t) = r(t + 1) γv (t + 1) + V (t) x(t) R n u(t) R m V (x(t)) = t e s t τ r(x(s), u(s)) δ(t) = r(t) 1 τ V (t) + V (t) τ = 1 γ - constant for discounting future rewards Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 12/31

Discrete and continuous TD Comparison x(t) X u(t) U V (x(t)) = k=t γt k r(k) δ(t) = r(t + 1) γv (t + 1) + V (t) x(t) R n u(t) R m V (x(t)) = t e s t τ r(x(s), u(s)) δ(t) = r(t) 1 τ V (t) + V (t) τ = 1 γ - constant for discounting future rewards Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 12/31

Discrete and continuous TD Comparison x(t) X u(t) U V (x(t)) = k=t γt k r(k) δ(t) = r(t + 1) γv (t + 1) + V (t) x(t) R n u(t) R m V (x(t)) = t e s t τ r(x(s), u(s)) δ(t) = r(t) 1 τ V (t) + V (t) τ = 1 γ - constant for discounting future rewards Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 12/31

Discrete and continuous TD Comparison x(t) X u(t) U V (x(t)) = k=t γt k r(k) δ(t) = r(t + 1) γv (t + 1) + V (t) x(t) R n u(t) R m V (x(t)) = t e s t τ r(x(s), u(s)) δ(t) = r(t) 1 τ V (t) + V (t) τ = 1 γ - constant for discounting future rewards Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 12/31

Discrete and continuous TD Comparison x(t) X u(t) U V (x(t)) = k=t γt k r(k) δ(t) = r(t + 1) γv (t + 1) + V (t) x(t) R n u(t) R m V (x(t)) = t e s t τ r(x(s), u(s)) δ(t) = r(t) 1 τ V (t) + V (t) τ = 1 γ - constant for discounting future rewards Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 12/31

Discrete and continuous TD Comparison x(t) X u(t) U V (x(t)) = k=t γt k r(k) δ(t) = r(t + 1) γv (t + 1) + V (t) x(t) R n u(t) R m V (x(t)) = t e s t τ r(x(s), u(s)) δ(t) = r(t) 1 τ V (t) + V (t) τ = 1 γ - constant for discounting future rewards Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 12/31

Discrete and continuous TD Comparison x(t) X u(t) U V (x(t)) = k=t γt k r(k) δ(t) = r(t + 1) γv (t + 1) + V (t) x(t) R n u(t) R m V (x(t)) = t e s t τ r(x(s), u(s)) δ(t) = r(t) 1 τ V (t) + V (t) τ = 1 γ - constant for discounting future rewards Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 12/31

Discrete and continuous TD Comparison x(t) X u(t) U V (x(t)) = k=t γt k r(k) δ(t) = r(t + 1) γv (t + 1) + V (t) x(t) R n u(t) R m V (x(t)) = t e s t τ r(x(s), u(s)) δ(t) = r(t) 1 τ V (t) + V (t) τ = 1 γ - constant for discounting future rewards Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 12/31

Continuous Value Function Time-symmetric method Let V (x; w) be an approximator of the Value function with parameters w and let E(t) = 1 2 (δ(t))2 be the objective function to minimize. Using the chain rule V (t) = V ẋ(t) x we have: E(t) = δ(t) w i w i = δ(t) [ 1 τ [ r(t) 1 ] τ V (t) + V (t) V (x; w) + ( V (x; w) w i w i x ) ẋ(t) (1) ] (2) giving us the gradient descent algorithm for parameters w i. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 13/31

Backward Euler Differentiation Residual Gradient and TD(0) By substituing V (t) = (V (t) V (t t))/ t we have: δ(t) = r(t) + 1 [ (1 t ] )V (t) V (t t) t τ giving us the gradient with respect to parameter w i : E(t) = δ(t) 1 [ (1 t ] (x(t); w) V (x(t t); w) ) V w i t τ w i w i We can use only the V (t t) for parameter update. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 14/31

Backward Euler Differentiation Residual Gradient and TD(0) By substituing V (t) = (V (t) V (t t))/ t we have: δ(t) = r(t) + 1 [ (1 t ] )V (t) V (t t) t τ giving us the gradient with respect to parameter w i : E(t) = δ(t) 1 [ (1 t ] (x(t); w) V (x(t t); w) ) V w i t τ w i w i We can use only the V (t t) for parameter update. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 14/31

Exponential Eligibility Trace I TD(λ) A Impulse reward!(t) Assume that at time t = t 0 an impulse reward was given, t t making the temporal profile of the value B function: 0 V(t) ^ V µ (t) = thus: ẇ i = ηδ(t 0 ) {e t 0 t τ t < t 0 0 t > t 0 t 0 e t 0 t τ V (x(t); w) w i CA V(t)! ^ B V(t) D V(t) ^ C dt D V(t) ^ V(t) ^ t 0 t 0 t t t 0 t Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 15/31

Exponential Eligibility Trace II TD(λ) Eligibility traces Let: e i = t 0 e t 0 t τ V (x(t); w) dt w i be the eligibility trace for parameter w i. We have: ẇ i = ηδ(t)e i (t), κė i (t) = e i (t) + V (x(t); w) w i, where 0 < κ τ is the time constant of the eligibility trace. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 16/31

Improving the policy Exploring the possibilities One way to improve the policy is using the Actor-Critic method, another is to take a greedy value-gradient based policy with respect to the current value function: u(t) = µ(x(t)) = argmax u U [ r(x(t), u) + V (x) f (x(t), u) x and the knowledge about the reward (r) and system dynamics (f ). ] Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 17/31

Continuous Actor-Critic Stochastic exploration Let A(x(t); w A ) R m be a function approximator with parameters w A. We can thus have a stochastic actor that employs the policy: u(t) = s(a(x(t); w A ) + σn(t)) where s is a monotonically increasing function and n(t) is noise. The actor s parameters are updated by: ẇ A i = η A δ(t)n(t) A(x(t); w A ) w A i Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 18/31

Value-Gradient Based Policy Heuristic approach In discrete problems, we can simply search the finite set of actions for the optimal one. Here, the set of actions is infinite. Let s assume: r(x, u) = R(x) S(u) S(u) = m j=1 S j(u j ) - cost of action R(u) - unknown part of the reinforcement Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 19/31

Cost of greedy policy Given a world model For the greedy policy it holds that: S j (u j ) = V (x) f (x, u), x u j f (x,u) where is the input gain matrix of the system u dynamics. Assuming that the matrix is not dependent on u and that S j is convex, the equation has a solution: u j = S j 1 V (x) f (x, u) ( ) x u j Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 20/31

Feedback control With a sigmoid output function A common constraint is that the amplitude of the action u is bounded by u max. We incorporate the constraint into the action cost: uj ( ) u S j (u j ) = c j s 1 du 0 u max j where s is a sigmoid function. This results in a following policy: ( ) u j = uj max 1 f (x, u) T T V (x) s. c j u j x Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 21/31

Advantage updating When the model is not available When the world model f is not available, we can simultaneously update the value function and an advantage function A(x, u): A (x, u) = r(x, u) 1 τ V (x) + V (x) f (x, u), x where max u [A(x, u)] = 0 for the optimal action u. The update rule is: A(x, u) max u [A(x, u)] + r(x, u) δ(t) The policy then is to select the action u for which the advantage function is maximal. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 22/31

Simulations Compared methods Four algorithms were compared: discrete actor-critic continuous actor-critic value-gradient based policy with a world model value-gradient based policy with learning of the gain matrix Value and policy functions were approximated by normalized Gaussian networks, the sigmoid function used was s(x) = 2 π arctan( π 2 x). Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 23/31

Pendulum swing-up I Trial setup! l T mg reward was given by height of the mass: R(x) = cos(θ) state space x = (θ, ω) trial lasted 20 seconds, unless the pendulum was over-rotated, resulting in r (t) = 1 for 1 second a trial was successful when when the pendulum was up for more than 10 seconds Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 24/31

Pendulum swing-up II Results 400 350 300 Trials 250 200 150 100 50 0 DiscAC ActCrit ValGrad PhysModel In the discrete actor-critic, the state space was evenly discretized into 30x30 boxes and the action was binary. Note different scales. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 25/31

Methods of Value Function Update 100 80 Trials Trials Trials 60 40 20 0 100 80 60 40 20 0 100 80 60 40 0.02 0.1 0.5 1.0!t 0.02 0.1 0.5 1.0!t Various V -update methods were tested. resudual gradient single step eligibility trace exponential eligibility symmetric update method The time-symmetric update method was very unstable. 20 0 0.02 0.1 0.5 1.0 " Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 26/31

Action cost Various cost coefficients were tested, as well as various reward functions and exploration noise. The bang-bang control (c = 0) tended to be less consistent than small costs with sigmoid control. The binary reward function made the task difficult to learn, however a negative reward function yielded better results. Trials Trials 100 80 60 40 20 0 100 80 60 40 20 0 0. 0.01 0.1 1.0 control cost coef. cos! {0,1} {-1,0} cos! {0,1} {-1,0} " 0 = 0.5 " 0 = 0.0 Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 27/31

Cart-Pole Swing-Up I A harder test Experiment setup Physical parameters the same as Barto s cart-pole balancing task, but pole had to be swung up and balanced. Higher dimensionality than previous task, state dependent input gain. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 28/31

Cart-Pole Swing-Up II A harder test Results The performances with the exact and learned input gains were comparable, since the learning of the physical model was relatively easy compared to the learning of the value function. v $# # 2.4 4# 0 " $4# 0.0 V +0.021-2.4-0.347-2.4 0.0 x v -0.715 2.4 4000 3500! 0 $#! 0 # 4# 3000 2.4 " 0 Trials 2500 $4# 2000 1500 0.0 df/du +1.423 1000 500 0-2.4-0.064 ActorCritic ValueGrad PhysModel -2.4 Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 0.0 x 2.4-1.551 29/31

Discussion 1 The swing-up task was accomplished by the continuous Actor-Critic 50 times faster than by the discrete Actor-Critic, despite finer discretization in the action-state space of the latter. 2 Value-gradient based methods performed significanly better than the actor-critic 3 Exponential eligibility was more efficient and stable than Euler approximation 4 Reward related parameters greatly affect the speed of learning 5 The value-gradient method worked well even when the input gain was state-dependent. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 30/31

Discussion 1 The swing-up task was accomplished by the continuous Actor-Critic 50 times faster than by the discrete Actor-Critic, despite finer discretization in the action-state space of the latter. 2 Value-gradient based methods performed significanly better than the actor-critic 3 Exponential eligibility was more efficient and stable than Euler approximation 4 Reward related parameters greatly affect the speed of learning 5 The value-gradient method worked well even when the input gain was state-dependent. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 30/31

Discussion 1 The swing-up task was accomplished by the continuous Actor-Critic 50 times faster than by the discrete Actor-Critic, despite finer discretization in the action-state space of the latter. 2 Value-gradient based methods performed significanly better than the actor-critic 3 Exponential eligibility was more efficient and stable than Euler approximation 4 Reward related parameters greatly affect the speed of learning 5 The value-gradient method worked well even when the input gain was state-dependent. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 30/31

Discussion 1 The swing-up task was accomplished by the continuous Actor-Critic 50 times faster than by the discrete Actor-Critic, despite finer discretization in the action-state space of the latter. 2 Value-gradient based methods performed significanly better than the actor-critic 3 Exponential eligibility was more efficient and stable than Euler approximation 4 Reward related parameters greatly affect the speed of learning 5 The value-gradient method worked well even when the input gain was state-dependent. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 30/31

Discussion 1 The swing-up task was accomplished by the continuous Actor-Critic 50 times faster than by the discrete Actor-Critic, despite finer discretization in the action-state space of the latter. 2 Value-gradient based methods performed significanly better than the actor-critic 3 Exponential eligibility was more efficient and stable than Euler approximation 4 Reward related parameters greatly affect the speed of learning 5 The value-gradient method worked well even when the input gain was state-dependent. Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 30/31

Bibliography Kenji Doya, Reinforcement Learning in Continuous Time and Space, Neural Computation, vol. 12, no. 1/2000, p 219-245 Sutton, R.S. and Barto, A.Gg, Reinforcement Learning: An Introduction, MIT Press, 1998, Cambridge, MA Florentin Woergoetter and Bernd Porr (2008) Reinforcement learning. Scholarpedia, 3(3):1448 Leszek Rybicki lrybicki@mat.umk.pl Reinforcement Learning In Continuous Time and Space 31/31