On and Off-Policy Relational Reinforcement Learning

Similar documents
Reinforcement Learning: the basics

CS599 Lecture 1 Introduction To RL

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Grundlagen der Künstlichen Intelligenz

arxiv: v1 [cs.ai] 5 Nov 2017

Basics of reinforcement learning

Reinforcement Learning. George Konidaris

Reinforcement Learning II

6 Reinforcement Learning

Reinforcement Learning

ilstd: Eligibility Traces and Convergence Analysis

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Open Theoretical Questions in Reinforcement Learning

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Temporal Difference Learning & Policy Iteration

Lecture 23: Reinforcement Learning

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Policies Generalization in Reinforcement Learning using Galois Partitions Lattices

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Machine Learning I Reinforcement Learning

An online kernel-based clustering approach for value function approximation

Temporal difference learning

Reinforcement Learning

CS599 Lecture 2 Function Approximation in RL

An application of the temporal difference algorithm to the truck backer-upper problem

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Generalization and Function Approximation

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Linear Least-squares Dyna-style Planning

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

Reinforcement Learning. Machine Learning, Fall 2010

Chapter 8: Generalization and Function Approximation

Q-Learning in Continuous State Action Spaces

Reinforcement Learning

arxiv: v1 [cs.ai] 1 Jul 2015

Reinforcement Learning and NLP

Reinforcement Learning

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Replacing eligibility trace for action-value learning with function approximation

Off-Policy Actor-Critic

Reinforcement Learning In Continuous Time and Space

Evaluation of multi armed bandit algorithms and empirical algorithm

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Reinforcement Learning: An Introduction

Q-learning. Tambet Matiisen

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Reinforcement Learning

The convergence limit of the temporal difference learning

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Incremental Construction of Complex Aggregates: Counting over a Secondary Table

Reinforcement Learning Active Learning

Lower PAC bound on Upper Confidence Bound-based Q-learning with examples

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Internet Monetization

QUICR-learning for Multi-Agent Coordination

Notes on Reinforcement Learning

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Notes on Tabular Methods

15-780: ReinforcementLearning

Reinforcement Learning

Dual Memory Model for Using Pre-Existing Knowledge in Reinforcement Learning Tasks

CS 570: Machine Learning Seminar. Fall 2016

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

Lecture 8: Policy Gradient

Reinforcement learning an introduction

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Lecture 10 - Planning under Uncertainty (III)

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Olivier Sigaud. September 21, 2012

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Reinforcement learning

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Key concepts: Reward: external «cons umable» s5mulus Value: internal representa5on of u5lity of a state, ac5on, s5mulus W the learned weight associate

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Food delivered. Food obtained S 3

Logical Markov Decision Programs

Approximation Methods in Reinforcement Learning

Reinforcement Learning

Logical Markov Decision Programs

Reinforcement Learning II. George Konidaris

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems

Sequential Decision Problems

Reinforcement Learning

Elements of Reinforcement Learning

Sparse Kernel-SARSA(λ) with an Eligibility Trace

Active Policy Iteration: Efficient Exploration through Active Learning for Value Function Approximation in Reinforcement Learning

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Reinforcement Learning

Reinforcement Learning II. George Konidaris

Least squares temporal difference learning

RL 3: Reinforcement Learning

(Deep) Reinforcement Learning

MDP Preliminaries. Nan Jiang. February 10, 2019

An Introduction to Reinforcement Learning

Animal learning theory

Using Gaussian Processes for Variance Reduction in Policy Gradient Algorithms *

Transcription:

On and Off-Policy Relational Reinforcement Learning Christophe Rodrigues, Pierre Gérard, and Céline Rouveirol LIPN, UMR CNRS 73, Institut Galilée - Université Paris-Nord first.last@lipn.univ-paris13.fr Abstract. In this paper, we propose adaptations of Sarsa and regular Q- learning to the relational case, by using an incremental relational function approximator RIB. In the experimental study, we highlight how changing the RL algorithms impacts generalization in relational regression. 1 Introduction Most works on Reinforcement Learning (RL, [1]) use propositional feature based representations to produce approximations of value-functions. If states and actions are represented by scalar vectors, the classical numerical approach to learn value-functions is to use regression algorithms. Recently, the field of Relational Reinforcement Learning (RRL) has emerged [2] aiming at extending Reinforcement learning to handle more complex first order logic-based representations for states and actions. Moving to a more complex language opens up possibilities beyond the reach of attribute-value learning systems, mainly thanks to the detection and exploitation of structural regularities in (state, action) pairs. In this paper, we study how even slight modifications in the RL algorithm employed may impact significantly on the performance of the relational regression system. In section 2, we briefly present the RL problem in the relational framework and we present three very similar RRL algorithms: the former Q-RRL [2], and two regular algorithms upgraded to relational representations: Q-learning (off-policy) and Sarsa (on-policy). We combine all these RL techniques with the same relational function approximator: RIB [3]. In section 3, we compare those algorithms experimentally and show a significant impact on RIB performance. Indeed, the size of the models learned by RIB decrease, resulting in an overall reduction of computation time. 2 Relational Temporal Difference Relational Reinforcement Learning (RRL) addresses the development of RL algorithms operating on relational representations of states and actions. The relational Reinforcement Learning task can be defined as follows. Given: a set of possible states S, represented in a relational format, a set of possible actions A, also represented in a relational format,

an unknown transition function T : S A S, (this function can be nondeterministic) an unknown real-valued reward function R : S A R, the goal is to learn a policy for selecting actions π : S A that maximizes the discounted return R t = k= γk r t+k+1 from any time step t. This return is the cumulative reward obtained in the future, starting in state s t. Future rewards are weakened by using a discount factor γ [, 1]. In value-based RL methods, the return is usually approximated thanks to a value function V : S R or a Q-value function Q : S A R such that Q(s, a) E {R t s t = s, a t = a}. Algorithm 1 Off-policy TD RRL algorithm: Qlearning-RIB () Require: state and action spaces S, A, RIB regression system for Q RIB Ensure: approximation of the action-value function Q RIB initialize Q loop choose randomly start state s for episode repeat a πq τ RIB (s) (Boltzmann softmax) perform a; get r and s in return if s is NOT terminal then r + γ max a A(s ) Q RIB(s, a ) else r end if s s until s terminal end loop States are relational interpretations, as used in the learning from interpretations setting [4]. In this notation, each (state, action) pair is represented by a relational interpretation, ie a set of relational facts. The action is represented by an additional ground fact. Among other incremental relational function approximators used in RRL [2, 7], the RIB system [3] is a quite good performance/efficiency compromise. It adopts an instance based learning paradigm to approximate the Q-value function. RIB stores a number of prototypes each associated with a Q-value. These prototypes are employed to predict the Q-value of unseen examples, using a k- nearest-neighbor algorithm. It takes advantage of a relational distance adapted to the problem to solve (see [8] for a distance for the blocks world problem). RIB handles incrementality since it forgets prototypes that are not necessary to reach a good prediction performance or have a bad prediction performance. As opposed to regular Q-learning, in the RRL algorithm introduced in [2], learning occurs only at the end of episodes, and not at each time step. It stores full trajectories s, a, r 1, s 1,, s T 1, a T 1, r T, s T. Then, back-propagation

of all the time-steps occurs at once only when reaching a terminal state, using the usual update rule: Q(s t, a t ) learn r t+1 + γ max Q (s t+1, a) a A(s t+1) In order to learn at each time step, we propose (algorithm 1) a regular adaptation of Q-learning to a relational framework and use RIB for the relational regression part. Algorithm 2 On-policy TD RRL algorithm: Sarsa-RIB () Require: state and action spaces S, A, RIB regression system for Q RIB Ensure: approximation of the action-value function Q RIB initialize Q loop choose randomly start state s for episode a πq τ RIB (s) (Boltzmann softmax) repeat perform a; get r and s in return a πq τ RIB (s ) (Boltzmann softmax) if s is NOT terminal then r + γq RIB(s, a ) else r end if s s ; a a until s terminal end loop In this algorithm, Q RIB (s, a) stands for the RIB prediction for the (s, a) pair. πq τ RIB means that the action is chosen according to a policy π derived from the action values Q RIB. The action is selected according to a Boltzmann distribution with a temperature τ 1. Q-learning is said off-policy because it learns an optimal Q-value function, even if it does not always choose optimal actions. With minor modifications, we propose (algorithm 2) an upgraded version of Sarsa [1], an on-policy algorithm which learns the Q-value function corresponding to the policy it actually follows. With these new algorithms, there is no need anymore to keep complete trajectories in memory. In addition, the value function is modified at each time step. As a consequence, action selection improves along an episode. Although we expect little performance gain from a strict RL perspective, from an ILP point of view, these algorithms take full advantage of the incrementality of RIB. This method changes both the presented samples and their order of presentation to 1 The probability of choosing action a in state s is Q RIB (a) e τ Q RIB (b). Pb A(s) e τ

the regression algorithm, resulting in a different generalization of the Q-value function. 3 Experimental study The experiments are performed on the blocks world problem as described in [9]. Each algorithm (, and ) is tested for 2 trials, and results are averaged. The trials are divided into episodes, each starting in a random state and ending depending on the task to solve (stacking or on(a,b)). The Q-value function is periodically evaluated during a trial. For each evaluation, 1 episodes of greedy exploitation without learning are performed, each starting randomly. In every experiment, the discount factor γ is set to.9. Each episode is interrupted after N time steps, depending of the number of blocks in the environment: N = (n blocks 1) 3. All other parameters are set according to [9]. Figure 1 compares the different algorithms facing the on(a,b) problem with and 7 blocks. Table 1 shows the average number of instances used by RIB after 1 episodes, indicating the complexity of the model obtained by each algorithm by each algorithm on the different problems. Time steps 2 1 1 Task : stack with blocks Time steps 2 1 1 Task : stack with 7 blocks 1 2 3 4 6 7 1 2 3 4 6 7 Fig. 1. Evolution of the number of time steps to complete an episode stacking blocks 6 blocks 7 blocks goal Average σ Average σ Average σ 21 38 63 1. 19.2 32.6 1 1.9 19 32. 1.3 Table 1. Number of prototypes used by RIB after 1 episodes Figure 2 compares the different algorithms facing the stacking problem with and 7 blocks. Table 2 shows the average number of instances used by RIB after 1 episodes (with standard deviation), indicating the complexity of the model obtained by each algorithm on the different problems.

Time steps 2 1 1 Task : on(a,b) with blocks Time steps 2 1 1 Task : on(a,b) with 7 blocks 2 4 6 8 1 2 4 6 8 1 Fig. 2. Evolution of the number of time steps to complete an episode on(a, b) blocks 6 blocks 7 blocks goal Average σ Average σ Average σ 32 3. 77 18.4 1339 47.7 24 2.9 66 7.8 163 18.2 24 4.4 46 17.3 68 17.1 Table 2. Number of prototypes used by RIB after 1 episodes The experimental results show that, as one might expect on such a task where exploration actions do not lead to catastrophic actions, the off-policy TD algorithms ( and ) outperform slightly the on-policy one (RIB- S). Moreover, does not differ that much from the original Q-RRL (here, ), considering the convergence speed wrt the number of episodes. Most important is the level of performance (computation time) reached by all presented RRL algorithms. Our adaptations of Q-learning and Sarsa, namely RIB- Q and, don t provide more examples to the regression system than the former Q-RRL. Thus, since RIB s learning is linear in the number of prototypes, and having less prototypes saves computation time. (on-policy) learns less prototypes for these relatively simple tasks, it explores a smaller portion of the state space than off-policy algorithms due to its policy, and therefore needs less prototypes to reach a good predictive accuracy on those states. As a consequence, outperforms and as far as computation time is concerned, demonstrating than despite its slower convergence speed wrt the number of episodes, Sarsa remains a good candidate for scaling-up. RIB is instance-based and thus strongly relies on its distance: generalization only takes place through the distance computation during the k-nearest-neighbor prediction. The distance used in RIB [9] is well suited for blocks world problems: it relies on a distance similar to the Levenshtein edit distance between sets of block stacks, seen as strings. The distance between (state, action) pairs is equal to for pairs differing only by a permutation of constants that do not occur in the action literal. This distance also takes into account bindings of variables occurring in the goal. Without this prior knowledge, the system cannot solve

problems like on(a, b), where two specific blocks have to be stacked on each other. 4 Conclusion We have observed that even small differences in the RL techniques significantly influence the behavior of a fixed relational regression algorithm, namely RIB. We have proposed two RRL algorithms, and, and have tested them on usual RRL benchmarks, showing performance improvements. This work opens up several new research directions. We plan to adapt more sophisticated RL algorithms that will provide more useful information to the relational regression algorithms. We have already made experiments with relational TD(λ) with eligility traces, without noticing a significant improvement, neither on the number of prototypes nor on the computation time. A possible explanation is that the distance is to well fitted to the problem that eligibility traces are useless in that case. It might be interesting to study how RL may balance the effects of a misleading distance or even further, how RL may help in adapting the distance to the problem at hand. Acknowledgements The authors would like to thank Kurt Driessens and Jan Ramon for very nicely and helpfully providing the authors with. References 1. Sutton, R.S., Barto, A.: Reinforcement Learning: An Introduction. MIT Press (1998) 2. Dzeroski, S., De Raedt, L., Driessens, K.: Relational reinforcement learning. Machine Learning 43 (21) 7 2 3. Driessens, K., Ramon, J.: Relational instance based regression for relational reinforcement learning. In: Proceedings of the Twentieth International Conference on Machine Learning (ICML 23). (23) 123 13 4. De Raedt, L., Džeroski, S.: First-order jk-clausal theories are PAC-learnable. Artificial Intelligence 7(1 2) (1994) 37 392. Driessens, K., Ramon, J.,, Blockeel, H.: Speeding up relational reinforcement learning through the use of an incremental first order decision tree algorithm. In: Proceedings of the European Conference on Machine Learning (ECML 21), LNAI vol 2167. (21) 97 18 6. Driessens, K., Dzeroski, S.: Combining model-based and instance-based learning for first order regression. In: Proceedings of the 22nd International Conference on Machine Learning (ICML 2). (2) 193 2 7. Gartner, T., Driessens, K.,, Ramon, J.: Graph kernels and gaussian processes for relational reinforcement learning. In: Proceedings of the 13th Inductive Logic Programming International Conference (ILP 23), LNCS vol 283. (23) 146 163 8. Ramon, J., Bruynooghe, M.: A polynomial time computable metric between point sets. Acta Informatica 37(1) (21) 76 78 9. Driessens, K.: Relational reinforcement learning. PhD thesis, K. U. Leuven (24)