Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Similar documents
The Book: Where we are and where we re going. CSE 190: Reinforcement Learning: An Introduction. Chapter 7: Eligibility Traces. Simple Monte Carlo

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Temporal difference learning

Reinforcement Learning

Reinforcement Learning II

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Reinforcement Learning

arxiv: v1 [cs.ai] 5 Nov 2017

Machine Learning I Reinforcement Learning


Reinforcement Learning. George Konidaris

Week 6, Lecture 1. Reinforcement Learning (part 3) Announcements: HW 3 due on 11/5 at NOON Midterm Exam 11/8 Project draft due 11/15

ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6.

Chapter 6: Temporal Difference Learning

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Open Theoretical Questions in Reinforcement Learning

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

CS599 Lecture 1 Introduction To RL

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Lecture 23: Reinforcement Learning

Reinforcement Learning. Value Function Updates

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Chapter 6: Temporal Difference Learning

Chapter 3: The Reinforcement Learning Problem

Basics of reinforcement learning

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Introduction to Reinforcement Learning. Part 5: Temporal-Difference Learning

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Reinforcement learning an introduction

Reinforcement Learning

Lecture 3: The Reinforcement Learning Problem

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Grundlagen der Künstlichen Intelligenz

Reinforcement Learning: An Introduction

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Reinforcement Learning

Off-Policy Actor-Critic

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning.

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Lecture 8: Policy Gradient

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning Part 2

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Introduction. Reinforcement Learning

Notes on Reinforcement Learning

Reinforcement learning

Reinforcement Learning: the basics

ilstd: Eligibility Traces and Convergence Analysis

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Replacing eligibility trace for action-value learning with function approximation

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

CS599 Lecture 2 Function Approximation in RL

Chapter 3: The Reinforcement Learning Problem

Reading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

6 Reinforcement Learning

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

Reinforcement Learning and Control

The Reinforcement Learning Problem

Reinforcement Learning: An Introduction. ****Draft****

Least squares temporal difference learning

arxiv: v1 [cs.ai] 1 Jul 2015

Reinforcement Learning

On the Convergence of Optimistic Policy Iteration

, and rewards and transition matrices as shown below:

Off-Policy Temporal-Difference Learning with Function Approximation

Reinforcement Learning

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Introduction to Reinforcement Learning

Reinforcement Learning Active Learning

RL 3: Reinforcement Learning

Reinforcement Learning

Lecture 7: Value Function Approximation

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

Reinforcement Learning In Continuous Time and Space

Grundlagen der Künstlichen Intelligenz

Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning

Forward Actor-Critic for Nonlinear Function Approximation in Reinforcement Learning

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Internet Monetization

The Nature of Learning - A study of Reinforcement Learning Methodology

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

15-780: ReinforcementLearning

CS 7180: Behavioral Modeling and Decisionmaking

Computational Reinforcement Learning: An Introduction

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Q-learning. Tambet Matiisen

Lecture 1: March 7, 2018

Generalization and Function Approximation

Planning in Markov Decision Processes

Transcription:

Chapter 7: Eligibility Traces R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Midterm Mean = 77.33 Median = 82 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 2

N-step TD Prediction Idea: Look farther into the future when you do TD backup (1, 2, 3,, n steps) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 3

Mathematics of N-step TD Prediction Monte Carlo: R t = r " 2 t+ 1 + " rt + 2 + " rt + 3 + L+ T! t! 1 r T TD: R! (1) t = rt + 1 + Vt t+ ( s 1) Use V to estimate remaining return n-step TD: 2 step return: R! (2) t 2 = rt + 1 +! rt + 2 + Vt t+ ( s 2) n-step return: ( n) 2 n! 1 n R t = rt + 1 + " rt + 2 + " rt + 3 + L+ " rt + n + " Vt ( st+ n ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 4

Learning with N-step Backups Backup (on-line or off-line):!v t (s t ) = "[ R (n) t # V t (s t )] Error reduction property of n-step returns max s n! n E! { Rt st = s} # V ( s) $ " maxv ( s) # V s! ( s) n step return Maximum error using n-step return Maximum error using V Using this, you can show that n-step methods converge R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 5

Random Walk Examples How does 2-step TD work here? How about 3-step TD? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 6

A Larger Example Task: 19 state random walk Do you think there is an optimal n (for everything)? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 7

Averaging N-step Returns n-step methods were introduced to help with TD(λ) understanding Idea: backup an average of several returns e.g. backup half of 2-step and half of 4- step avg 1 (2) R t = Rt + 2 1 2 R (4) t One backup Called a complex backup Draw each component Label with the weights for that component R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 8

Forward View of TD(λ) TD(λ) is a method for averaging all n-step backups weight by λ n-1 (time since visitation) λ-return: R! t = (1"!) $! n "1 (n) R t n=1 Backup using λ-return: #!V t (s t ) = "[ R # t $ V t (s t )] R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 9

λ-return Weighting Function R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 10

Relation to TD(0) and MC λ-return can be rewritten as: T " t "1 R! t = (1"!) #! n"1 R (n) t +! T "t "1 R t n=1 Until termination If λ = 1, you get MC: After termination T "t "1 R! t = (1"1) # 1 n"1 (n R ) t + 1 T " t "1 R t = R t n=1 If λ = 0, you get TD(0) T "t "1 R! t = (1" 0) # 0 n"1 (n R ) t + 0 T " t "1 (1) R t = R t n=1 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 11

Forward View of TD(λ) II Look forward from each state to determine update from future states and rewards: R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 12

λ-return on the Random Walk Same 19 state random walk as before Why do you think intermediate values of λ are best? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 13

Backward View of TD(λ) The forward view was for theory The backward view is for mechanism + New variable called eligibility trace e t (s)"! On each step, decay all traces by γλ and increment the trace for the current state by 1 Accumulating trace %!"e t #1 (s) e t (s) = & '!"e t #1 (s) +1 if s $ s t if s = s t R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 14

On-line Tabular TD(λ) Initialize V(s) arbitrarily and e(s) = 0, for all s!s Repeat (for each episode) : Initialize s Repeat (for each step of episode) : a " action given by # for s Take action a, observe reward, r, and next state s $ % " r +&V( s $ ) ' V (s) e(s) " e(s) +1 For all s : s " s $ V(s) " V(s) +(%e(s) e(s) " &)e(s) Until s is terminal R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 15

Backward View # t = r t+ 1 + " Vt ( st+ 1)! Vt ( st ) Shout δ t backwards over time The strength of your voice decreases with temporal distance by γλ R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 16

Relation of Backwards View to MC & TD(0) Using update rule: As before, if you set λ to 0, you get to TD(0) If you set λ to 1, you get MC but in a better way # V ( s) = e ( s) t!" t Can apply TD(1) to continuing tasks t Works incrementally and on-line (instead of waiting to the end of the episode) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 17

Forward View = Backward View The forward (theoretical) view of TD(λ) is equivalent to the backward (mechanistic) view for off-line updating The book shows: T "1 #!V TD t (s) = #!V $ t (s t ) I sst t = 0 T "1 t = 0 T "1 #!V TD t (s) = # $ I sst #(%&) k " t ' k $!V " t (s t )I sst = $ % I sst $ (&") k # t ' k t = 0 T "1 t = 0 Backward updates Forward updates algebra shown in book T "1 k =t T #1 On-line updating with small α is similar t = 0 T #1 t = 0 T #1 k =t R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 18

On-line versus Off-line on Random Walk Same 19 state random walk On-line performs better over a broader range of parameters R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 19

Control: Sarsa(λ) Save eligibility for state-action pairs instead of just states $ e t (s, a) =!"e t #1(s, a) +1 % &!"e t #1 (s,a) if s = s t and a = a t otherwise Q t +1 (s, a) = Q t (s, a) +'( t e t (s, a) ( t = r t +1 +!Q t (s t +1,a t +1 ) # Q t (s t, a t ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 20

Sarsa(λ) Algorithm Initialize Q(s,a) arbitrarily and e(s, a) = 0, for all s, a Repeat (for each episode) : Initialize s, a Repeat (for each step of episode) : Take action a, observe r, s! Choose a! from s! using policy derived from Q (e.g.? - greedy) " # r +$Q( s!, a!) % Q(s, a) e(s,a) # e(s,a) +1 For all s,a : Q(s, a) # Q(s, a) +&"e(s, a) e(s, a) # $'e(s, a) s # s!;a # a! Until s is terminal R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 21

Sarsa(λ) Gridworld Example With one trial, the agent has much more information about how to get to the goal not necessarily the best way Can considerably accelerate learning R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 22

Three Approaches to Q(λ) How can we extend this to Q- learning? If you mark every state action pair as eligible, you backup over non-greedy policy Watkins: Zero out eligibility trace after a nongreedy action. Do max when backing up at first non-greedy choice. e t (s, a) = % 1 +!"e t #1 (s, a) ' & 0 ( '!"e t #1 (s,a) if s = s t, a = a t,q t #1 (s t,a t ) = max a Q t #1 (s t, a) if Q t #1 (s t,a t ) $ max a Q t #1 (s t,a) otherwise Q t +1 (s, a) = Q t (s, a) +)* t e t (s, a) * t = r t +1 +! max a + Q t (s t +1, a + ) # Q t (s t,a t ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 23

Watkins s Q(λ) Initialize Q(s,a) arbitrarily and e(s, a) = 0, for all s, a Repeat (for each episode) : Initialize s, a Repeat (for each step of episode) : Take action a, observe r, s! Choose a! from s! using policy derived from Q (e.g.? - greedy) a * " arg max b Q( s!, b) (if a ties for the max, then a * " a!) # " r +$Q( s!, a!) % Q(s, a * ) e(s,a) " e(s,a) +1 For all s,a : Q(s, a) " Q(s, a) +&#e(s, a) If a! = a *, then e(s, a) "$'e(s,a) s " s!;a " a! Until s is terminal else e(s, a) " 0 R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 24

Peng s Q(λ) Disadvantage to Watkins s method: Peng: Early in learning, the eligibility trace will be cut (zeroed out) frequently resulting in little advantage to traces Backup max action except at end Never cut traces Disadvantage: Complicated to implement R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 25

Naïve Q(λ) Idea: is it really a problem to backup exploratory actions? Never zero traces Always backup max at current action (unlike Peng or Watkins s) Is this truly naïve? Works well is preliminary empirical studies What is the backup diagram? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 26

Comparison Task Compared Watkins s, Peng s, and Naïve (called McGovern s here) Q(λ) on several tasks. See McGovern and Sutton (1997). Towards a Better Q( λ) for other tasks and results (stochastic tasks, continuing tasks, etc) Deterministic gridworld with obstacles 10x10 gridworld 25 randomly generated obstacles 30 runs α = 0.05, γ = 0.9, λ = 0.9, ε = 0.05, accumulating traces From McGovern and Sutton (1997). Towards a better Q(λ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 27

Comparison Results From McGovern and Sutton (1997). Towards a better Q(λ) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 28

Convergence of the Q(λ) s None of the methods are proven to converge. Much extra credit if you can prove any of them. Watkins s is thought to converge to Q * Peng s is thought to converge to a mixture of Q π and Q * Naïve - Q *? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 29

Eligibility Traces for Actor-Critic Methods Critic: On-policy learning of V π. Use TD(λ) as described before. Actor: Needs eligibility traces for each state-action pair. We change the update equation: # p t +1 (s, a) = p t(s,a) +!" t if a = a t and s = s t $ % p t (s, a) otherwise Can change the other actor-critic update: to pt + 1( s, a) = pt ( s, a) +!" tet ( s, a) [ ] if a = a t and s = s t % p t +1 (s, a) = p t(s,a) +!" t 1# $(s, a) & ' p t (s,a) otherwise to pt + 1( s, a) = pt ( s, a) +!" tet ( s, a) where % e t (s, a) =!"e (s, a) +1 # $ (s,a ) t #1 t t t & '!"e t #1 (s, a) if s = s and a = a t t otherwise R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 30

Replacing Traces Using accumulating traces, frequently visited states can have eligibilities greater than 1 This can be a problem for convergence Replacing traces: Instead of adding 1 when you visit a state, set that trace to 1 % e t (s) =!"e t #1(s) if s $ s t & ' 1 if s = s t R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 31

Replacing Traces Example Same 19 state random walk task as before Replacing traces perform better than accumulating traces over more values of λ R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 32

Why Replacing Traces? Replacing traces can significantly speed learning They can make the system perform well for a broader set of parameters Accumulating traces can do poorly on certain types of tasks Why is this task particularly onerous for accumulating traces? R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 33

More Replacing Traces Off-line replacing trace TD(1) is identical to first-visit MC Extension to action-values: When you revisit a state, what should you do with the traces for the other actions? Singh and Sutton say to set them to zero: e t (s, a) = $ & 1 if s = s t and a = a t % 0 if s = s t and a ( a t & '!"e t #1 (s, a) if s ( s t R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 34

Implementation Issues Could require much more computation But most eligibility traces are VERY close to zero If you implement it in Matlab, backup is only one line of code and is very fast (Matlab is optimized for matrices) R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 35

Variable λ Can generalize to variable λ e t (s) = %!" t e t #1 (s) if s $ s t & '!" t e t #1 (s) +1 if s = s t Here λ is a function of time Could define " t = "( s t ) or " t = " t! R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 36

Conclusions Provides efficient, incremental way to combine MC and TD Includes advantages of MC (can deal with lack of Markov property) Includes advantages of TD (using TD error, bootstrapping) Can significantly speed learning Does have a cost in computation R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 37

Something Here is Not Like the Other R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 38