Reinforcement Learning: the basics

Similar documents
Olivier Sigaud. September 21, 2012

Temporal difference learning

Machine Learning I Reinforcement Learning

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

CS599 Lecture 1 Introduction To RL

In Advances in Neural Information Processing Systems 6. J. D. Cowan, G. Tesauro and. Convergence of Indirect Adaptive. Andrew G.

Reinforcement learning

Reinforcement Learning

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Reinforcement Learning

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

6 Reinforcement Learning

Lecture 23: Reinforcement Learning

Reinforcement Learning

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Temporal Difference Learning & Policy Iteration

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Grundlagen der Künstlichen Intelligenz

Open Theoretical Questions in Reinforcement Learning

Reinforcement Learning

Off-Policy Actor-Critic

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Reinforcement Learning. George Konidaris

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Decision Theory: Q-Learning

Approximate Optimal-Value Functions. Satinder P. Singh Richard C. Yee. University of Massachusetts.

Reinforcement learning an introduction

Introduction to Reinforcement Learning

, and rewards and transition matrices as shown below:

REINFORCEMENT LEARNING

Reinforcement Learning II

Lecture 8: Policy Gradient

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

On and Off-Policy Relational Reinforcement Learning

Basics of reinforcement learning

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Reinforcement Learning. Introduction

16.410/413 Principles of Autonomy and Decision Making

Reinforcement Learning (1)

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

An Introduction to Reinforcement Learning

Notes on Reinforcement Learning

Reinforcement Learning Part 2

Lecture 1: March 7, 2018

Internet Monetization

Reinforcement Learning and NLP

Decision Theory: Markov Decision Processes

Reinforcement Learning

Prioritized Sweeping Converges to the Optimal Value Function

arxiv: v1 [cs.ai] 5 Nov 2017

(Deep) Reinforcement Learning

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

An Introduction to Reinforcement Learning

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Lecture 3: Markov Decision Processes

Artificial Intelligence

Reinforcement Learning

Reinforcement Learning

Lecture 7: Value Function Approximation

Sequential Decision Problems

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Linear Least-squares Dyna-style Planning

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning and Deep Reinforcement Learning

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

An Introduction to Reinforcement Learning

arxiv: v1 [cs.ai] 1 Jul 2015

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Elements of Reinforcement Learning

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Q-Learning in Continuous State Action Spaces

Reinforcement Learning

Reinforcement Learning using Continuous Actions. Hado van Hasselt

MDP Preliminaries. Nan Jiang. February 10, 2019

Reinforcement Learning. Yishay Mansour Tel-Aviv University

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Reinforcement Learning

Proceedings of the International Conference on Neural Networks, Orlando Florida, June Leemon C. Baird III

Reinforcement learning

Reinforcement Learning

1 Problem Formulation

Reinforcement Learning

Reinforcement Learning

RL 3: Reinforcement Learning

On the Convergence of Optimistic Policy Iteration

Transcription:

Reinforcement Learning: the basics Olivier Sigaud Université Pierre et Marie Curie, PARIS 6 http://people.isir.upmc.fr/sigaud August 6, 2012 1 / 46

Introduction Action selection/planning Learning by trial-and-error (main model : Reinforcement Learning) 2 / 46

Reinforcement Learning: the basics Introduction Introductory books 1. [Sutton & Barto, 1998] : the ultimate introduction to the eld, in the discrete case 2. [Bu et & Sigaud, 2008] : in french 3. [Sigaud & Bu et, 2010] : (improved) translation of 2 3 / 46

Introduction Dierent learning mechanisms Supervised learning The supervisor indicates to the agent the expected answer The agent corrects a model based on the answer Typical mechanism : gradient backpropagation, RLS Applications : classication, regression, function approximation 4 / 46

Introduction Dierent learning mechanisms Self-supervised learning When an agent learns to predict, it proposes its prediction The environment provides the correct answer : next state Supervised learning without a supervisor Dicult to distinguish from associative learning 5 / 46

Introduction Dierent learning mechanisms Cost-Sensitive Learning The environment provides the value of action (reward, penalty) Application : behaviour optimization 6 / 46

Introduction Dierent learning mechanisms Reinforcement learning In RL, the value signal is given as a scalar How good is -10.45? Necessity of exploration 7 / 46

Introduction Dierent learning mechanisms The exploration/exploitation trade-o Exploring can be (very) harmful Shall I exploit what I know or look for a better policy? Am I optimal? Shall I keep exploring or stop? Decrease the rate of exploration along time ɛ-greedy : take the best action most of the time, and a random action from time to time 8 / 46

Introduction Dierent learning mechanisms Dierent mechanisms : reminder Supervised learning : for a given input, the learner gets as feedback the output it should have given Reinforcement learning : for a given input, the learner gets as feedback a scalar representing the immediate value of its output Unsupervised learning : for a given input, the learner gets no feedback : it just extracts correlations Note : the self-supervised learning case is hard to distinguish from the unsupervised learning case 9 / 46

Introduction Dierent learning mechanisms Outline Goals of this class : Present the basics of discrete RL and dynamic programming Content : Dynamic programming Model-free Reinforcement Learning Actor-critic approach Model-based Reinforcement Learning 10 / 46

Dynamic programming Markov Decision Processes S : states space A : action space T : S A Π(S) : transition function r : S A IR : reward function An MDP denes s t+1 and r t+1 as f (s t, a t) It describes a problem, not a solution Markov property : p(s t+1 s t, a t ) = p(s t+1 s t, a t, s t 1, a t 1,...s 0, a 0 ) Reactive agents a t+1 = f (s t), without internal states nor memory In an MDP, a memory of the past does not provide any useful advantage 11 / 46

Dynamic programming Markov property : Limitations Markov property is not veried if : the state does not contain all useful information to take decisions or if the next depends on decisions of several agents ou if transitions depend on time 12 / 46

Dynamic programming Example : tic-tac-toe The state is not always a location The opponents is seen as part of the environment (might be stochastic) 13 / 46

Dynamic programming A stochastic problem Deterministic problem = special case of stochastic T (s t, a t, s t+1 ) = p(s s, a) 14 / 46

Dynamic programming A stochastic policy For any MDP, there exists a deterministic policy that is optimal 15 / 46

Dynamic programming Rewards over a Markov chain : on states or action? Reward over states Reward over actions in states Below, we assume the latter (we note r(s, a)) 16 / 46

Dynamic programming Policy and value functions Goal : nd a policy π : S A maximising the agregation of reward on the long run The value function V π : S IR records the agregation of reward on the long run for each state (following policy π). It is a vector with one entry per state The action value function Q π : S A IR records the agregation of reward on the long run for doing each action in each state (and then following policy π). It is a matrix with one entry per state and per action In the remainder, we focus on V, trivial to transpose to Q 17 / 46

Dynamic programming Agregation criteria The computation of value functions assumes the choice of an agregation criterion (discounted, average, etc.) Mere sum (nite horizon) : V π (S 0) = r 0 + r 1 + r 2 +... + r N Equivalent : average over horizon 18 / 46

Dynamic programming Agregation criteria The computation of value functions assumes the choice of an agregation criterion (discounted, average, etc.) Average criterion on a window : V π (S 0) = r 0+r1+r2 3... 18 / 46

Dynamic programming Agregation criteria The computation of value functions assumes the choice of an agregation criterion (discounted, average, etc.) Discounted criterion : V π (s t0 ) = t=t0 γt r(s t, π(s t)) γ [0, 1] : discount factor if γ = 0, sensitive only to immediate reward if γ = 1, future rewards are as important as immediate rewards The discounted case is the most used 18 / 46

Dynamic programming Bellman equation over a Markov chain : recursion Given the discounted reward agregation criterion : V (s 0) = r 0 + γv (s 1) 19 / 46

Dynamic programming Bellman equation : general case Generalisation of the recusion V (s 0) = r 0 + γv (s 1) over all possible trajectories Deterministic π : V π (s) = r(s, π(s)) + γ s p(s s, π(s))v π (s ) 20 / 46

Dynamic programming Bellman equation : general case Generalisation of the recusion V (s 0) = r 0 + γv (s 1) over all possible trajectories Stochastic π : V π (s) = a π(s, a)[r(s, a) + γ s p(s s, a)v π (s )] 20 / 46

Dynamic programming Bellman operator and dynamic programming We get V π (s) = r(s, π(s)) + γ s p(s s, π(s))v π (s ) We call Bellman operator (noted T π ) the application V π (s) r(s, π(s)) + γ s p(s s, π(s)) We call Bellman optimality operator (noted T ) the application [ V π (s) max r(s, a) + γ p(s s, a)v (s )] a A s The optimal value function is a xed-point of the Bellman optimality operator T : V = T V Value iteration : V i+1 T V i Policy Iteration : policy evaluation (with Vi+1 π T π V π i ) + policy improvement with s S, π (s) arg max a A s p(s s, a)[r(s, a) + γv π (s )] 21 / 46

Dynamic programming Value Iteration in practice 0.9 0.9 R [ s S, V i+1 (s) max r(s, a) + γ p(s s, a)v i (s )] a A s 22 / 46

Dynamic programming Value Iteration in practice 0.81 0.81 0.9 0.81 0.9 R [ s S, V i+1 (s) max r(s, a) + γ p(s s, a)v i (s )] a A s 22 / 46

Dynamic programming Value Iteration in practice 0.73 0.73 0.81 0.73 0.81 0.9 0.73 0.81 0.9 R [ s S, V i+1 (s) max r(s, a) + γ p(s s, a)v i (s )] a A s 22 / 46

Dynamic programming Value Iteration in practice 0.43 0.53 0.66 0.48 0.53 0.59 0.66 0.73 0.53 0.73 0.81 0.59 0.73 0.81 0.9 0.66 0.73 0.81 0.9 R [ s S, V i+1 (s) max r(s, a) + γ p(s s, a)v i (s )] a A s 22 / 46

Dynamic programming Value Iteration in practice [ π (s) = arg max r(s, a) + γ p(s s, a)v (s )] a A s 22 / 46

Dynamic programming Policy Iteration in practice s S, V i (s) evaluate(π i (s)) 23 / 46

Dynamic programming Policy Iteration in practice s S, π i+1 (s) improve(π i (s), V i (s)) 23 / 46

Dynamic programming Policy Iteration in practice s S, V i (s) evaluate(π i (s)) 23 / 46

Dynamic programming Policy Iteration in practice s S, π i+1 (s) improve(π i (s), V i (s)) 23 / 46

Dynamic programming Policy Iteration in practice s S, V i (s) evaluate(π i (s)) 23 / 46

Dynamic programming Policy Iteration in practice s S, π i+1 (s) improve(π i (s), V i (s)) 23 / 46

Dynamic programming Policy Iteration in practice s S, V i (s) evaluate(π i (s)) 23 / 46

Dynamic programming Policy Iteration in practice s S, π i+1 (s) improve(π i (s), V i (s)) 23 / 46

Dynamic programming Families of methods Critic : (action) value function evaluation of the policy Actor : the policy itself Value iteration is a pure critic method : it iterates on the value function up to convergence without storing policy, then computes optimal policy Policy iteration is implemented as an actor-critic method, updating in parallel one structure for the actor and one for the critic In the continuous case, there are pure actor methods 24 / 46

Model-free Reinforcement learning Reinforcement learning In DP (planning), T and r are given Reinforcement learning goal : build π without knowing T and r Model-free approach : build π without estimating T nor r Actor-critic approach : special case of model-free Model-based approach : build a model of T and r and use it to improve the policy 25 / 46

Model-free Reinforcement learning Temporal dierence methods Incremental estimation Estimating the average immediate (stochastic) reward in a state s E k (s) = (r 1 + r 2 +... + r k )/k E k+1 (s) = (r 1 + r 2 +... + r k + r k+1 )/(k + 1) Thus E k+1 (s) = k/(k + 1)E k (s) + r k+1 /(k + 1) Or E k+1 (s) = (k + 1)/(k + 1)E k (s) E k (s)/(k + 1) + r k+1 /(k + 1) Or E k+1 (s) = E k (s) + 1/(k + 1)[r k+1 E k (s)] Still needs to store k Can be approximated as E k+1 (s) = E k (s) + α[r k+1 E k (s)] (1) Converges to the true average (slower or faster depending on α) without storing anything Equation (1) is everywhere in reinforcement learning 26 / 46

Model-free Reinforcement learning Temporal dierence methods Temporal Dierence error The goal of TD methods is to estimate the value function V (s) If estimations V (s t) and V (s t+1) were exact, we would get : V (s t) = r t+1 + γr t+2 + γ 2 r t+3 + γ 3 r t+4 +... V (s t+1) = r t+2 + γ(r t+3 + γ 2 r t+4 +... Thus V (s t) = r t+1 + γv (s t+1) δ k = r k+1 + γv (s k+1 ) V (s k ) : measures the error between current values of V and the values they should have 27 / 46

Model-free Reinforcement learning Temporal dierence methods Monte Carlo methods Much used in games (Go...) to evaluate a state Generate a lot of trajectories : s 0, s 1,..., s N with observed rewards r 0, r 1,..., r N Update state values V (s k ), k = 0,..., N 1 with : V (s k ) V (s k ) + α(s k )(r k + r k+1 + + r N V (s k )) It uses the average estimation method (1) 28 / 46

Model-free Reinforcement learning Temporal dierence methods Temporal Dierence (TD) Methods Temporal Dierence (TD) methods combine the properties of DP methods and Monte Carlo methods : in Monte Carlo, T and r are unknown, but the value update is global, trajectories are needed in DP, T and r are known, but the value update is local TD : as in DP, V (s t) is updated locally given an estimate of V (s t+1) and T and r are unknown Note : Monte Carlo can be reformulated incrementally using the temporal dierence δ k update 29 / 46

Model-free Reinforcement learning Temporal dierence methods Policy evaluation : TD(0) Given a policy π, the agent performs a sequence s 0, a 0, r 1,, s t, a t, r t+1, s t+1, a t+1, V (s t) V (s t) + α[r t+1 + γv (s t+1) V (s t)] Combines the TD update (propagation from V (s t+1) to V (s t)) from DP and the incremental estimation method from Monte Carlo Updates are local from s t, s t+1 and r t+1 Proof of convergence : [Dayan & Sejnowski, 1994] 30 / 46

Model-free Reinforcement learning Temporal dierence methods TD(0) : limitation TD(0) evaluates V (s) One cannot infer π(s) from V (s) without knowing T : one must know which a leads to the best V (s ) Three solutions : Work with Q(s, a) rather than V (s). Learn a model of T : model-based (or indirect) reinforcement learning Actor-critic methods (simultaneously learn V and update π) 31 / 46

Model-free Reinforcement learning Action Value Function Approaches Value function and Action Value function The value function V π : S IR records the agregation of reward on the long run for each state (following policy π). It is a vector with one entry per state The action value function Q π : S A IR records the agregation of reward on the long run for doing each action in each state (and then following policy π). It is a matrix with one entry per state and per action 32 / 46

Model-free Reinforcement learning Action Value Function Approaches Sarsa Reminder (TD) :V (s t) V (s t) + α[r t+1 + γv (s t+1) V (s t)] Sarsa : For each observed (s t, a t, r t+1, s t+1, a t+1) : Q(s t, a t) Q(s t, a t) + α[r t+1 + γq(s t+1, a t+1) Q(s t, a t)] Policy : perform exploration (e.g. ɛ-greedy) One must know the action a t+1, thus constrains exploration On-policy method : more complex convergence proof [Singh et al., 2000] 33 / 46

Model-free Reinforcement learning Action Value Function Approaches Q-Learning For each observed (s t, a t, r t+1, s t+1) : Q(s t, a t) Q(s t, a t) + α[r t+1 + γ max Q(s t+1, a) Q(s t, a t)] a A max a A Q(s t+1, a) instead of Q(s t+1, a t+1) O-policy method : no more need to know a t+1 [Watkins, 1989] Policy : perform exploration (e.g. ɛ-greedy) Convergence proved provided innite exploration [Dayan & Sejnowski, 1994] 34 / 46

Model-free Reinforcement learning Action Value Function Approaches Q-Learning in practice (Q-learning : the movie) Build a states actions table (Q-Table, eventually incremental) Initialise it (randomly or with 0 is not a good choice) Apply update equation after each action Problem : it is (very) slow 35 / 46

Model-free Reinforcement learning Actor-Critic approaches From Q(s, a) to Actor-Critic (1) state / action a 0 a 1 a 2 a 3 e 0 0.66 0.88 0.81 0.73 e 1 0.73 0.63 0.9 0.43 e 2 0.73 0.9 0.95 0.73 e 3 0.81 0.9 1.0 0.81 e 4 0.81 1.0 0.81 0.9 e 5 0.9 1.0 0.9 In Q learning, given a Q Table, one must determine the max at each step This becomes expensive if there are numerous actions 36 / 46

Model-free Reinforcement learning Actor-Critic approaches From Q(s, a) to Actor-Critic (2) state / action a 0 a 1 a 2 a 3 e 0 0.66 0.88* 0.81 0.73 e 1 0.73 0.63 0.9* 0.43 e 2 0.73 0.9 0.95* 0.73 e 3 0.81 0.9 1.0* 0.81 e 4 0.81 1.0* 0.81 0.9 e 5 0.9 1.0* 0.9 One can store the best value for each state Then one can update the max by just comparing the changed value and the max No more maximum over actions (only in one case) 37 / 46

Model-free Reinforcement learning Actor-Critic approaches From Q(s, a) to Actor-Critic (3) state / action a 0 a 1 a 2 a 3 e 0 0.66 0.88* 0.81 0.73 e 1 0.73 0.63 0.9* 0.43 e 2 0.73 0.9 0.95* 0.73 e 3 0.81 0.9 1.0* 0.81 e 4 0.81 1.0* 0.81 0.9 e 5 0.9 1.0* 0.9 state chosen action e 0 a 1 e 1 a 2 e 2 a 2 e 3 a 2 e 4 a 1 e 5 a 1 Storing the max is equivalent to storing the policy Update the policy as a function of value updates Basic actor-critic scheme 38 / 46

Model-free Reinforcement learning Actor-Critic approaches Dynamic Programming and Actor-Critic (1) In both PI and AC, the architecture contains a representation of the value function (the critic) and the policy (the actor) In PI, the MDP (T and r) is known PI alternates two stages : 1. Policy evaluation : update (V (s)) or (Q(s, a)) given the current policy 2. Policy improvement : follow the value gradient 39 / 46

Model-free Reinforcement learning Actor-Critic approaches Dynamic Programming and Actor-Critic (2) In AC, T and r are unknown and not represented (model-free) Information from the environment generates updates in the critic, then in the actor 40 / 46

Model-free Reinforcement learning Actor-Critic approaches Naive design Discrete states and actions, stochastic policy An update in the critic generates a local update in the actor Critic : compute δ and update V (s) with V k (s) V k (s) + α k δ k Actor : P π (a s) = P π (a s) + α k δ k NB : no need for a max over actions NB2 : one must then know how to draw an action from a probabilistic policy (not obvious for continuous actions) 41 / 46

Model-based reinforcement learning Eligibility traces To improve over Q-learning Naive approach : store all (s, a) pair and back-propagate values Limited to nite horizon trajectories Speed/memory trade-o TD(λ), sarsa (λ) and Q(λ) : more sophisticated approach to deal with innite horizon trajectories A variable e(s) is decayed with a factor λ after s was visited and reinitialized each time s is visited again TD(λ) : V (s) V (s) + αδe(s), (similar for sarsa (λ) and Q(λ)), If λ = 0, e(s) goes to 0 immediately, thus we get TD(0), sarsa or Q-learning TD(1) = Monte-Carlo... 42 / 46

Model-based reinforcement learning Model-based Reinforcement Learning General idea : planning with a learnt model of T and r is performing back-ups in the agent's head ([Sutton, 1990a, Sutton, 1990b]) Learning T and r is an incremental self-supervised learning problem Several approaches : Draw random transition in the model and apply TD back-up Use Policy Iteration (Dyna-PI) or Q-learning (Dyna-Q) to get V or Q Dyna-AC also exists Better propagation : Prioritized Sweeping [Moore & Atkeson, 1993, Peng & Williams, 1992] 43 / 46

Model-based reinforcement learning Dyna architecture and generalization (Dyna-like video (good model)) (Dyna-like video (bad model)) Thanks to the model of transitions, Dyna can propagate values more often Problem : in the stochastic case, the model of transitions is in card(s) card(s) card(a) Usefulness of compact models MACS [Gérard et al., 2005] : Dyna with generalisation (Learning Classier Systems) SPITI [Degris et al., 2006] : Dyna with generalisation (Factored MDPs) 44 / 46

Model-based reinforcement learning Messages Dynamic programming and reinforcement learning methods can be split into pure actor, pure critic and actor-critic methods Dynamic programming, value iteration, policy iteration are when you know the transition and reward functions Model-free RL is based on TD-error Actor critic RL is a model-free, PI-like algorithm Model-based RL combines dynamic programming and model learning The continuous case is more complicated 45 / 46

Model-based reinforcement learning Any question? 46 / 46

Model-based reinforcement learning Buet, O. & Sigaud, O. (2008). Processus décisionnels de Markov en intelligence articielle. Lavoisier. Dayan, P. & Sejnowski, T. (1994). Td(lambda) converges with probability 1. Machine Learning, 14(3). Degris, T., Sigaud, O., & Wuillemin, P.-H. (2006). Learning the Structure of Factored Markov Decision Processes in Reinforcement Learning Problems. Edité dans Proceedings of the 23rd International Conference on Machine Learning (ICML'2006), pages 257264, CMU, Pennsylvania. Gérard, P., Meyer, J.-A., & Sigaud, O. (2005). Combining latent learning with dynamic programming in MACS. European Journal of Operational Research, 160 :614637. Moore, A. W. & Atkeson, C. (1993). Prioritized sweeping : Reinforcement learning with less data and less real time. Machine Learning, 13 :103130. Peng, J. & Williams, R. (1992). Ecient learning and planning within the DYNA framework. Edité dans Meyer, J.-A., Roitblat, H. L., & Wilson, S. W., editeurs, Proceedings of the Second International Conference on Simulation of Adaptive Behavior, pages 281290, Cambridge, MA. MIT Press. Sigaud, O. & Buet, O. (2010). Markov Decision Processes in Articial Intelligence. iste - Wiley. Singh, S. P., Jaakkola, T., Littman, M. L., & Szepesvari, C. (2000). Convergence Results for Single-Step On-Policy Reinforcement Learning Algorithms. Machine Learning, 38(3) :287308. 46 / 46

Model-based reinforcement learning Sutton, R. S. (1990a). Integrating architectures for learning, planning, and reacting based on approximating dynamic programming. Edité dans Proceedings of the Seventh International Conference on Machine Learning ICML'90, pages 216224, San Mateo, CA. Morgan Kaufmann. Sutton, R. S. (1990b). Planning by incremental dynamic programming. Edité dans Proceedings of the Eighth International Conference on Machine Learning, pages 353357, San Mateo, CA. Morgan Kaufmann. Sutton, R. S. & Barto, A. G. (1998). Reinforcement Learning : An Introduction. MIT Press. Watkins, C. J. C. H. (1989). Learning with Delayed Rewards. Thèse de doctorat, Psychology Department, University of Cambridge, England. 46 / 46