Temporal difference learning

Similar documents
Reinforcement Learning

Reinforcement Learning

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Reinforcement Learning. George Konidaris

Open Theoretical Questions in Reinforcement Learning

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

arxiv: v1 [cs.ai] 5 Nov 2017

Machine Learning I Reinforcement Learning

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.


Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Reinforcement Learning II

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Grundlagen der Künstlichen Intelligenz

Notes on Reinforcement Learning

Reinforcement learning

Lecture 23: Reinforcement Learning

Reinforcement Learning: the basics

Reinforcement Learning. Value Function Updates

Lecture 7: Value Function Approximation

Reinforcement Learning

The Book: Where we are and where we re going. CSE 190: Reinforcement Learning: An Introduction. Chapter 7: Eligibility Traces. Simple Monte Carlo

ilstd: Eligibility Traces and Convergence Analysis

Chapter 6: Temporal Difference Learning

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Reinforcement Learning Part 2

Basics of reinforcement learning

Chapter 6: Temporal Difference Learning

CS599 Lecture 1 Introduction To RL

On the Convergence of Optimistic Policy Iteration

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Week 6, Lecture 1. Reinforcement Learning (part 3) Announcements: HW 3 due on 11/5 at NOON Midterm Exam 11/8 Project draft due 11/15

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Grundlagen der Künstlichen Intelligenz

An Adaptive Clustering Method for Model-free Reinforcement Learning

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6.

Introduction to Reinforcement Learning. Part 5: Temporal-Difference Learning

Bias-Variance Error Bounds for Temporal Difference Updates

Reinforcement Learning

A Unified Approach for Multi-step Temporal-Difference Learning with Eligibility Traces in Reinforcement Learning

Replacing eligibility trace for action-value learning with function approximation

Lecture 8: Policy Gradient

Reinforcement Learning

Elements of Reinforcement Learning

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Lecture 3: Markov Decision Processes

The Nature of Learning - A study of Reinforcement Learning Methodology

Reinforcement Learning. Yishay Mansour Tel-Aviv University

arxiv: v1 [cs.ai] 1 Jul 2015

Planning in Markov Decision Processes

Reinforcement Learning: An Introduction

Reinforcement learning an introduction

Decision Theory: Q-Learning

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Q-Learning in Continuous State Action Spaces

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Internet Monetization

Introduction. Reinforcement Learning

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Off-Policy Actor-Critic

Reinforcement Learning: An Introduction. ****Draft****

Reinforcement Learning

Reinforcement Learning

Off-Policy Temporal-Difference Learning with Function Approximation

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

, and rewards and transition matrices as shown below:

Reinforcement Learning in Continuous Action Spaces

Generalization and Function Approximation

Olivier Sigaud. September 21, 2012

Reinforcement Learning and Deep Reinforcement Learning

Lecture 4: Approximate dynamic programming

An Introduction to Reinforcement Learning

Reinforcement Learning

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

Lecture 1: March 7, 2018

Real Time Value Iteration and the State-Action Value Function

The convergence limit of the temporal difference learning

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

Forward Actor-Critic for Nonlinear Function Approximation in Reinforcement Learning

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning.

16.410/413 Principles of Autonomy and Decision Making

Temporal Difference Learning & Policy Iteration

Decision Theory: Markov Decision Processes

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

ARTIFICIAL INTELLIGENCE. Reinforcement learning

CS 234 Midterm - Winter

Animal learning theory

Reinforcement Learning (1)

Reinforcement Learning. Machine Learning, Fall 2010

Dual Temporal Difference Learning

Transcription:

Temporal difference learning AI & Agents for IET Lecturer: S Luz http://www.scss.tcd.ie/~luzs/t/cs7032/ February 4, 2014 Recall background & assumptions Environment is a finite MDP (i.e. A and S are finite). MDP s dynamics defined by transition probabilities: Pss a = P (s t+1 = s s t = s, a t = a) and expected immediate rewards, R a ss = E{r t+1 s t = s, a t = a, s t+1 = s } Goal: to search for good policies π DP Strategy: use value functions to structure search: V (s) = max Pss a a [Ra ss + γv (s )] or s Q (s, a) = s P a ss [Ra ss + γ max a Q (s, a )] MC strategy: expand each episode and keep averages of returns per state. Temporal Difference (TD) Learning TD is a combination of ideas from Monte Carlo methods and DP methods: TD can learn directly from raw experience without a model of the environment s dynamics, like MC. TD methods update estimates based in part on other learned estimates, without waiting for a final outcome (i.e. they bootstrap ), like DP Some widely used TD methods: TD(0), TD(λ), Sarsa, Q-learning, Actorcritic, R-Learning 1

Prediction & Control Recall value function estimation (prediction) for DP and MC: Updates for DP: Update rule for MC: V π (s) = E π {R t s t = s} = E π {r t+1 + γv π (s t+1 ) s t = s} V (s t ) V (s t ) + α[r t γv (s t )] (Q: why are the value functions above merely estimates?) Prediction in TD(0) Prediction (update of estimates of V ) for the TD(0) method is done as follows: V (s t ) V (s t ) + α[ r t+1 + γv (s t+1 ) V (s t )] (1) }{{} 1-step estimate of R t No need to wait until the end of the episode (as in MC) to update V No need to sweep across the possible successor states (as in DP). Tabular TD(0) value function estimation Use sample backup (as in MC) rather than full backup (as in DP): 1 TabularTD0 (π ) 2 I n i t i a l i s a t i o n, s S : 3 V (s) a r b i t r a r y ; 4 α l e a r n i n g c o n s t a n t 5 γ d i s c o u n t f a c t o r 6 Algorithm 1: TD(0) 7 For each e p i s o d e E u n t i l stop c r i t e r i o n met 8 I n i t i a l i s e s 9 For each step o f E 10 a π(s) 11 Take a c t i o n a ; 12 Observe reward r and next s t a t e s 13 V (s) V (s) + α[r + γv (s ) V (s)] 14 s s 15 u n t i l s i s a t e r m i n a l s t a t e 2

Example: random walk estimate P a ss =.5 for all non-terminal states R a ss = 0 except for terminal state on the right which is 1. Comparison between MC and TD(0) (over 100 episodes) Error (RMS) Episodes Focusing on action-value function estimates The control (algorithm) part. Exploration Vs. exploitation trade-off On-policy and off-policy methods Several variants, e.g.: Q-learning: off-policy method (Watkins and Dayan, 1992) Modified Q-learning (aka Sarsa): on-policy method Sarsa Transitions from non-terminal states update Q as follows: Q(s t, a t) Q(s t, a t) + α[r t+1 + γq(s t+1, a t+1 ) Q(s t, a t)] (2) r r st t+1 s t+2 t+1 s st,a t s t+1,a t+2 t+1 s t+2,a t+2 (for terminal states s t+1, assign Q(s t+1, a t+1 ) 0) 1 I n i t i a l i s e Q(s, a) a r b i t r a r i l y 2 f o r each e p i s o d e 3 i n i t i a l i s e s Algorithm 2: Sarsa 4 choose a from s u s i n g π d e r i v e d from Q / e. g. ɛ g r e e d y / 5 r e p r e a t ( f o r each s t e p o f e p i s o d e ) 6 perform a, o b s e r v e r, s 7 choose a from s u s i n g π d e r i v e d from Q / e. g. ɛ g r e e d y / 8 Q(s, a) Q(s, a) + α[r + γq(s, a ) Q(s, a)] 9 s s 10 a a 11 u n t i l s i s t e r m i n a l s t a t e 3

Q-learning An off-policy method: approximate Q independently of the policy being followed: Q(s t, a t) Q(s t, a t) + α[r t+1 + γ max Q(s t+1, a) Q(s a t, a t)] (3) Algorithm 3: Q-Learning 1 I n i t i a l i s e Q(s, a) a r b i t r a r i l y 2 f o r each e p i s o d e 3 i n i t i a l i s e s 4 r e p r e a t ( f o r each s t e p o f e p i s o d e ) 5 choose a from s u s i n g π d e r i v e d from Q / e. g. ɛ g r e e d y / 6 perform a, o b s e r v e r, s 7 Q(s, a) Q(s, a) + α[r + γ max a Q(s, a ) Q(s, a)] 8 s s 9 u n t i l s i s t e r m i n a l s t a t e A simple comparison A comparison between Q-Learning and SARSA on the cliff walking problem (Sutton and Barto, 1998) Safe Path Optimal Path Average reward per episode Why does Q-learning find the optimal path? Why is its average reward per episode worse than SARSA s? Convergence theorems Convergence, in the following sense: Q converges in the limit to the optimal Q function, provided the system can be modelled as a deterministc Markov process, r is bounded and π visit every action-state pair infinitely often. Has been proved for the above described TD methods: Q-Learning (Watkins and Dayan, 1992), Sarsa and TD(0) (Sutton and Barto, 1998). General proofs based on stochastic approximation theory can be found in (Bertsekas and Tsitsiklis, 1996). 4

Summary of methods Dynamic programming Exhaustive search full backups Monte Carlo sample backups Temporaldifference shallow backups bootstrapping, λ deep backups TD(λ) Basic Idea: if with TD(0) we backed up our estimates based on one step ahead, why not generalise this to include n-steps ahead? 1-step: TD(0) 2-step n-step Monte Carlo............ Averaging backups 5

Consider, for instance, averaging 2- and 4-step backups: R µ t = 0.5Rt 2 + 0.5Rt 4 TD(λ)is a method for averaging all n-step backups. Weight by λ n 1 (0 λ 1): Rt λ = (1 λ) λ n 1 Rt n Backup using λ-return: V t (s t ) = α[rt λ V t (s t )] (Sutton and Barto, 1998) call this the Forward View of TD(λ) TD(λ) backup structure n=1............ lim n n i=1 (1 λ)λn 1 = 1 Backup value (λ-return): R λ t = (1 λ) n=1 λn 1 R n t or R λ t = (1 λ) T t 1 n=1 λ n 1 R n t + λt t 1 R t Implementing TD(λ) We need a way to accummulate the effect of the trace-decay parameter λ Eligibility traces: { γλet 1 (s) + 1 ifs = s e t (s) = t γλe t 1 (s) otherwise. (4) TD error for state-value prediction is: δ t = r t+1 + γv t (s t+1 ) V t (s t ) (5) Sutton and Barto call this the Backward View of TD(λ) 6

Equivalence It can be shown (Sutton and Barto, 1998) that: The Forward View of TD(λ) and the Backward View of TD(λ) are equivalent. Tabular TD(λ) Algorithm 4: On-line TD(λ) 1 I n i t i a l i s e V (s) a r b i t r a r i l y 2 e(s) 0 f o r a l l s S 3 f o r each e p i s o d e 4 i n i t i a l i s e s 5 r e p e a t ( f o r each s t e p o f e p i s o d e ) 6 choose a a c c o r d i n g to π 7 perform a, o b s e r v e r, s 8 δ r + γv (s ) V (s) 9 e(s) e(s) + 1 10 f o r a l l s 11 V (s) V (s) + αδe(s) 12 e(s) γλe(s) 13 s s 14 u n t i l s i s t e r m i n a l s t a t e Similar algorithms can be implemented for control (Sarsa, Q-learning), using eligibility traces. Control algorithms: SARSA(λ) Algorithm 5: SARSA(λ) 1 I n i t i a l i s e Q(s, a) a r b i t r a r i l y 2 e(s, a) 0 f o r a l l s S and a A 3 f o r each e p i s o d e 4 i n i t i a l i s e s, a 5 r e p e a t ( f o r each s t e p o f e p i s o d e ) 6 perform a, o b s e r v e r, s 7 choose a, s a c c o r d i n g to Q ( e. g. ɛ greedy 8 δ r + γq(s, a ) Q(s, a) 9 e(s, a) e(s, a) + 1 10 f o r a l l s 11 Q(s, a) Q(s, a) + αδe(s, a) 12 e(s) γλe(s) 13 s s a a 14 u n t i l s i s t e r m i n a l s t a t e Notes based on (Sutton and Barto, 1998, ch 6). References Bertsekas, D. P. and Tsitsiklis, J. N. (1996). Athena Scientific, Belmont. Neuro-Dynamic Programming. 7

Sutton, R. S. and Barto, A. G. (1998). Reinforcement Learning: An Introduction. MIT Press, Cambridge, MA. Watkins, C. J. C. H. and Dayan, P. (1992). Technical note Q-learning. Machine Learning, 8:279 292. 8