Replacing eligibility trace for action-value learning with function approximation

Similar documents
arxiv: v1 [cs.ai] 5 Nov 2017

Generalization and Function Approximation

Temporal difference learning

Reinforcement Learning: An Introduction. ****Draft****

CS599 Lecture 2 Function Approximation in RL

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

arxiv: v1 [cs.ai] 1 Jul 2015

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Chapter 8: Generalization and Function Approximation

Reinforcement Learning

An online kernel-based clustering approach for value function approximation

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

Reinforcement Learning. George Konidaris

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

Lecture 7: Value Function Approximation

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

Reinforcement Learning II

Linear Least-squares Dyna-style Planning

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Reinforcement Learning

Dual Memory Model for Using Pre-Existing Knowledge in Reinforcement Learning Tasks

ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6.

Procedia Computer Science 00 (2011) 000 6

Open Theoretical Questions in Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Machine Learning I Reinforcement Learning

CS599 Lecture 1 Introduction To RL

Lecture 8: Policy Gradient

Basics of reinforcement learning

Grundlagen der Künstlichen Intelligenz

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Reinforcement Learning

A Comparison of Action Selection Methods for Implicit Policy Method Reinforcement Learning in Continuous Action-Space

Forward Actor-Critic for Nonlinear Function Approximation in Reinforcement Learning

Value Function Approximation in Reinforcement Learning using the Fourier Basis


An application of the temporal difference algorithm to the truck backer-upper problem

RL 3: Reinforcement Learning

Policy Gradient Reinforcement Learning for Robotics

Reinforcement learning

ilstd: Eligibility Traces and Convergence Analysis

Reinforcement Learning

Lecture 23: Reinforcement Learning

Off-Policy Actor-Critic

Notes on Reinforcement Learning

Week 6, Lecture 1. Reinforcement Learning (part 3) Announcements: HW 3 due on 11/5 at NOON Midterm Exam 11/8 Project draft due 11/15

Reinforcement Learning In Continuous Time and Space

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

State Space Abstractions for Reinforcement Learning

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Sparse Kernel-SARSA(λ) with an Eligibility Trace

Value Function Approximation through Sparse Bayesian Modeling

Learning Control Under Uncertainty: A Probabilistic Value-Iteration Approach

Reinforcement Learning, Neural Networks and PI Control Applied to a Heating Coil

Approximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts.

Reinforcement Learning. Value Function Updates

The Book: Where we are and where we re going. CSE 190: Reinforcement Learning: An Introduction. Chapter 7: Eligibility Traces. Simple Monte Carlo

CS Machine Learning Qualifying Exam

1 Introduction 2. 4 Q-Learning The Q-value The Temporal Difference The whole Q-Learning process... 5

Reinforcement learning an introduction

On and Off-Policy Relational Reinforcement Learning

Reinforcement Learning. Machine Learning, Fall 2010

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning

Reinforcement Learning in Continuous Action Spaces

Dialogue management: Parametric approaches to policy optimisation. Dialogue Systems Group, Cambridge University Engineering Department

The convergence limit of the temporal difference learning

arxiv: v1 [cs.ai] 21 May 2014

Local Linear Controllers. DP-based algorithms will find optimal policies. optimality and computational expense, and the gradient

Ensembles of Extreme Learning Machine Networks for Value Prediction

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

An Adaptive Clustering Method for Model-free Reinforcement Learning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

by the size of state/action space (and by the trial length which maybeproportional to this). Lin's oine Q() (993) creates an action-replay set of expe

Dual Temporal Difference Learning

QUICR-learning for Multi-Agent Coordination

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

EM-algorithm for Training of State-space Models with Application to Time Series Prediction

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Progress in Learning 3 vs. 2 Keepaway

Reinforcement Learning: An Introduction

Data Mining Part 5. Prediction

arxiv: v1 [cs.lg] 20 Sep 2018

Human-level control through deep reinforcement. Liia Butler

Off-Policy Learning Combined with Automatic Feature Expansion for Solving Large MDPs

Least squares temporal difference learning

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Reinforcement. Function Approximation. Learning with KATJA HOFMANN. Researcher, MSR Cambridge

Reinforcement Learning in a Nutshell

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Reinforcement Learning using Continuous Actions. Hado van Hasselt

Reinforcement Learning

Immediate Reward Reinforcement Learning for Projective Kernel Methods

Lecture 1: March 7, 2018

Grundlagen der Künstlichen Intelligenz

Transcription:

Replacing eligibility trace for action-value learning with function approximation Kary FRÄMLING Helsinki University of Technology PL 5500, FI-02015 TKK - Finland Abstract. The eligibility trace is one of the most used mechanisms to speed up reinforcement learning. Earlier reported experiments seem to indicate that replacing eligibility traces would perform better than accumulating eligibility traces. However, replacing traces are currently not applicable when using function approximation methods where states are not represented uniquely by binary values. This paper proposes two modifications to replacing traces that overcome this limitation. Experimental results from the Mountain-Car task indicate that the new replacing traces outperform both the accumulating and the ordinary replacing traces. 1 Introduction Eligibility traces are an essential element of reinforcement learning because they can speed up learning significantly, especially in tasks with delayed reward. Two alternative implementations of eligibility traces exist, i.e. the accumulating eligibility trace and the replacing eligibility trace. As reported in [4], the replacing trace would seem to perform better than the accumulating trace. Unfortunately, current replacing traces can not be used with continuous-valued function approximation methods for action-value learning because traces of other states and actions than the intended ones may be affected [5, p. 212]. In this paper, we propose two modifications to replacing eligibility traces that make it possible to use them with continuous-valued function approximation. The modified replacing eligibility trace is compared with accumulating and ordinary replacing traces. The experimental task used is the same as in [4], i.e. the well-known Mountain-Car task. Experiments are performed using a grid discretisation (i.e. lookup-table), a CMAC discretisation and a Radial Basis Function (RBF) function approximator. After this introduction, section 2 describes the background and theory of eligibility traces both for the case of discrete state spaces and for continuous state spaces, including the use of function approximators. Section 2 also explains the modifications to replacing eligibility traces proposed in this paper, followed by experimental results in section 3 and conclusions. 2 Eligibility traces for action-value learning Most current RL methods are based on the notion of value functions. Value functions either represent state-values (i.e. value of a state) or action-values (i.e.

the value of taking an action in a given state, usually called Q-value). Actionvalue learning is typically needed in control tasks. The SARSA method [3] seems to be the most popular method for learning Q-values for the moment. SARSA is a so-called on-policy method, which makes it easier to handle eligibility traces than for off-policy methods such as Q-learning. In SARSA, Q-values are updated according to Q t+1 (s, a) = Q t (s, a) + αδ t e t (s, a), for all s, a, (1) where α is a learning rate, e t (s, a) is the eligibility trace of state s and action a and δ t = r t+1 + γq t (s t+1, a t+1 ) Q t (s t, a t ), (2) where r t+1 is the reward received upon entering a new state, γ is the discount rate and Q t (s t, a t ) and Q t (s t+1, a t+1 ) are the action-value estimates for the current and next state, respectively. Eligibility traces are inspired from the behaviour of biological neurons that reach maximum eligibility for learning a short time after their activation. Eligibility traces were mentioned in the context of Machine Learning at least as early as 1972 and used for action-value learning at least as early as 1983 [1]. The oldest version of the eligibility trace for action-value learning seems to be the accumulating eligibility trace: { γλet 1 (s, a) + 1 if s = s e t (s, a) = t and a = a t ; γλe t 1 (s, a) otherwise. for all s, a, (3) where λ is a trace decay parameter. In [4] it was proposed to use a replacing eligibility trace instead of the accumulating eligibility trace: 1 if s = s t and a = a t ; e t (s, a) = 0 if s = s t and a a t ; γλe t 1 (s, a) if s s t. for all s, a. (4) The replacing eligibility trace outperformed the accumulating eligibility trace at least in the well-known Mountain-Car task [4], which has also been selected for the experiments reported in section 3. In the Mountain-Car task the state space is continuous and two-dimensional. In order to use equations 1 to 4 with a continuous-valued state space, it becomes necessary to encode the real state vector s into a corresponding binary feature vector, φ. A great amount of work on RL still uses binary features, as obtained by grid discretisation or the CMAC tile-coding [5]. However, it is also possible to perform action-value learning with continuous-valued features or other function approximation methods. In that case, equation 1 becomes [5, p. 211]: θ t+1 = θ t + α [v t Q t (s t, a t )] θt Q t (s t, a t ), (5)

where θ t is the parameter vector of the function approximator. For SARSA, the target output v t = r t+1 + γq t (s t+1, a t+1 ), which gives the update rule: θ t+1 = θ t + αδ t e t, where (6) δ t = r t+1 + γq t (s t+1, a t+1 ) Q t (s t, a t ), and (7) e t = γλ e t 1 + θt Q t (s t, a t ) (8) with e 0 = 0, which represents an accumulating eligibility trace. This method is called gradient-descent Sarsa(λ) [5, p. 211]. Equation 3 is a special case of equation 8 when using binary state representations because then θt Q t (s t, a t ) = 1 when s = s t and a = a t and zero otherwise. When artificial neural networks (ANNs) are used for learning an action-value function, an eligibility trace value is usually associated with every weight of the ANN [1]. Then the eligibility trace value reflects to what extent and how long ago the neurons connected by the weight have been activated. It seems like no similar generalisation exists for the replacing trace in equation 4. In this paper, we propose the following update rule for replacing traces: e t = max(γλ e t 1, θt Q t (s t, a t )). (9) Even though no theoretical or mathematical proof is proposed here for this update rule, it makes it possible to compare the performance of accumulating and replacing eligibility traces also for continuous function approximation. Except for the resetting of unused actions, Equation 4 is a special case of equation 9 when using binary state representations. In addition to the resetting of unused actions, setting an action s trace to one for the used action in Equation 4 is a major obstacle for using replacing traces with continuous-valued function approximators. In the case of binary features, this is feasible because the traces to set or reset can be uniquely identified. For continuous-valued features, it would be necessary to define thresholds for deciding if a feature is sufficiently present for representing the current state or not in order to set the corresponding trace values to one or to zero. Furthermore, there seems to be no strong theoretical foundation for resetting the traces of unused actions to zero. In [4] it is indicated that resetting the trace was considered preferable due to its similarity with firstvisit Monte-Carlo methods, while it is indicated as optional in [5]. Both [4] and [5] also call for empirical evidence on the effects of resetting the traces of unused actions but it seems like no such experimental results have been reported yet. The experimental results in the next section will attempt to give more insight into this question. 3 Mountain-Car experiments In order to simplify comparison with earlier work on eligibility traces in control tasks, we will here use the same Mountain-Car task as in [4] where the

dynamics of the task are described. The task has a continuous-valued state vector s t = (position, velocity) of the car. The task consists in accelerating an under-powered car up a hill, which is only possible if the car first gains enough inertia by backing away from the hill. The three possible actions are full throttle forward (+1), full throttle backward (-1) and no throttle (0). The reward function used is the cost-to-go function as in [4], i.e. giving -1 reward for every step except when reaching the goal, where zero reward is given. The parameters θ were set to zero before every new run so the initial action-value estimates are zero. Therefore the initial action-value estimates are optimistic compared to the actual action-value function, which encourages exploration of the state space. The discount rate γ was set to one, i.e. no discounting. In addition to the CMAC experiments used in [4], experiments were performed using a grid discretisation ( lookup-table ) and a continuous-valued Radial Basis Function (RBF) function approximator. The grid discretisation used 8 8 tiles while CMAC used five 9 9 tilings as in [4]. CMAC tilings usually have a random offset in the range of one tile but for reasons of repeatability a regular tile range offset of number of tilings was used here. With the RBF network, feature values are calculated as: φ i ( s t ) = exp ( ( s t c) 2 ), (10) where c is the centroid vector of the RBF neuron and r is the spread parameter. 8 8 RBF neurons were used with regularly spaced centroids. In the experiments, an affine transformation mapped the actual state values from the intervals [-1.2,0.5] and [-0.07,0.07] into the interval [0,1] that was used as s. This transformation makes it easier to adjust the spread parameter (r 2 = 0.01 used in all experiments). The action-value estimate of action a is then: ) N N Q t ( s t, a) = w ia (φ i / φ k, (11) i=1 where w ia is the weight between the action neuron a and the RBF neuron i. The division by N k=1 φ k, where N is the number of RBF neurons, implements a normalized RBF network. Equation 11 without normalisation was also used for the binary features produced by grid discretisation and CMAC. Every parameter w ia, has its corresponding eligibility trace value. For the accumulating trace, equation 8 was used in all experiments, while equation 9 was used for the replacing trace. Results for the three methods accumulating trace, replacing trace with reset for unused actions and replacing trace without reset for unused actions are compared in Figure 1. The results are indicated as the average number of steps per episode for 50 agents that each performed 200 episodes for grid discretisation and 100 episodes for CMAC and RBF. Figure 1 shows the results obtained with the best learning rate values found, which are indicated in Table 1. Every episode started with a random position and velocity. r 2 k=1

Fig. 1: Results with best α value of different methods as a function of λ. Error bars indicate one standard error. There is no significant difference between the three methods when using grid discretisation. With CMAC, the replacing trace without reset is the best while the accumulating trace performs the worst. With RBF, the replacing trace with reset is not applicable due to the reasons explained in section 2. The replacing trace without reset outperforms the accumulating trace. As indicated by Table 1, the accumulating trace is more sensitive to the value of the learning rate than the replacing traces. This is because the e t value in Equation 6 becomes bigger for the accumulating trace than for the replacing traces so the learning rate needs to be decreased correspondingly. Better results might be obtained by fine-tuning of the learning rate or by using a normalized least mean square (NLMS) version of Equation 6 (see e.g. [2] on the use of NLMS in a reinforcement learning task).

Method λ=0 λ=0.4 λ=0.7 λ=0.8 λ=0.9 λ=0.95 λ=0.99 Accumulating Grid 0.3 0.3 0.1 0.1 0.1 0.05 0.05 CMAC 0.26 0.2 0.14 0.1 0.08 0.04 NA RBF 4.7 3.3 2.3 2.1 0.9 0.5 0.1 With reset Grid 0.3 0.3 0.3 0.3 0.3 0.3 0.3 CMAC 0.28 0.24 0.28 0.3 0.24 0.22 0.12 RBF NA NA NA NA NA NA NA Without reset Grid 0.3 0.3 0.3 0.3 0.3 0.3 0.3 CMAC 0.26 0.26 0.28 0.26 0.2 0.22 0.14 RBF 4.9 4.3 4.3 4.7 3.9 3.3 2.5 Table 1: Best α values for different values of λ. NA: Not Applicable. 4 Conclusions The experimental results confirm that replacing traces give better results than accumulating traces with CMAC for the Mountain-Car task. Replacing traces are also more stable against changes in learning parameters than accumulating traces. The replacing eligibility trace without reset of unused actions gives the best results with both CMAC and RBF. Therefore, the replacing eligibility trace proposed in this paper (Equation 9) is the best choice at least for this task. Experiments with other tasks should still be performed to see if the same applies to them. It also seems that structural credit assignment (i.e. weights of function approximator in this case) is more powerful than temporal credit assignment (i.e. eligibility traces in this case). The improvement is greater when passing from grid discretisation to CMAC to RBF than what can be obtained by modifying the eligibility trace or its parameters. For instance, the best result with RBF and no eligibility trace is 80 steps/episode, which is better than the best CMAC with λ = 0.95 (86 steps/episode). Therefore the utility of eligibility traces seems to decrease if appropriate structural credit assignment can be provided instead. References [1] A.G. Barto, R.S. Sutton and C.W. Anderson, Neuronlike Adaptive Elements That Can Solve Difficult Learning Control Problems, IEEE Transactions on Systems, Man, and Cybernetics, 13:835-846, 1983. [2] K. Främling, Adaptive robot learning in a non-stationary environment. In M. Verleysen, editor, proceedings of the 13 th European Symposium on Artificial Neural Networks (ESANN 2005), pages 381-386, April 27-29, Bruges (Belgium), 2005. [3] G.A. Rummery and M. Niranjan, On-Line Q-Learning Using Connectionist Systems. Technical Report CUED/F-INFENG/TR 166, Cambridge University Engineering Department, 1994. [4] S.P. Singh and R.S. Sutton, Reinforcement Learning with Replacing Eligibility Traces, Machine Learning, 22:123-158, 1996. [5] R.S Sutton and A.G. Barto. Reinforcement Learning, MIT Press, Cambridge, Massachusetts, 1998.