The Book: Where we are and where we re going. CSE 190: Reinforcement Learning: An Introduction. Chapter 7: Eligibility Traces. Simple Monte Carlo

Similar documents
Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning.

RL Lecture 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Reinforcement Learning

Temporal difference learning

Chapter 6: Temporal Difference Learning

Reinforcement Learning II

Reinforcement Learning. Value Function Updates

Reinforcement Learning. George Konidaris

arxiv: v1 [cs.ai] 5 Nov 2017

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Chapter 6: Temporal Difference Learning

Week 6, Lecture 1. Reinforcement Learning (part 3) Announcements: HW 3 due on 11/5 at NOON Midterm Exam 11/8 Project draft due 11/15

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Introduction to Reinforcement Learning. Part 5: Temporal-Difference Learning

Machine Learning I Reinforcement Learning

Reinforcement Learning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

CS599 Lecture 1 Introduction To RL


Lecture 23: Reinforcement Learning

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Lecture 3: The Reinforcement Learning Problem

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6.

Chapter 3: The Reinforcement Learning Problem

Reinforcement learning an introduction

Reinforcement Learning

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Basics of reinforcement learning

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Lecture 8: Policy Gradient

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

The Reinforcement Learning Problem

Grundlagen der Künstlichen Intelligenz

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Lecture 3: Markov Decision Processes

Chapter 3: The Reinforcement Learning Problem

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Reinforcement Learning Part 2

, and rewards and transition matrices as shown below:

Open Theoretical Questions in Reinforcement Learning

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

6 Reinforcement Learning

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems

Reinforcement Learning and Control

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Reinforcement Learning Active Learning

15-780: ReinforcementLearning

Reinforcement Learning: An Introduction

Off-Policy Actor-Critic

Replacing eligibility trace for action-value learning with function approximation

Approximation Methods in Reinforcement Learning

Reinforcement Learning

Reading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Reinforcement Learning

Internet Monetization

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

On the Convergence of Optimistic Policy Iteration

Planning in Markov Decision Processes

Introduction. Reinforcement Learning

Reinforcement learning

Introduction to Reinforcement Learning

Sequential Decision Problems

Q-learning. Tambet Matiisen

Generalization and Function Approximation

Reinforcement Learning

Grundlagen der Künstlichen Intelligenz

Lecture 1: March 7, 2018

Reinforcement Learning

RL 3: Reinforcement Learning

Reinforcement Learning: the basics

Notes on Reinforcement Learning

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

CS Machine Learning Qualifying Exam

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

An Adaptive Clustering Method for Model-free Reinforcement Learning

ilstd: Eligibility Traces and Convergence Analysis

Real Time Value Iteration and the State-Action Value Function

arxiv: v1 [cs.ai] 1 Jul 2015

Reinforcement Learning

Edward L. Thorndike #1874$1949% Lecture 2: Learning from Evaluative Feedback. or!bandit Problems" Learning by Trial$and$Error.

Transcription:

CSE 190: Reinforcement Learning: An Introduction Chapter 7: Eligibility races Acknowledgment: A good number of these slides are cribbed from Rich Sutton he Book: Where we are and where we re going Part I: he Problem Introduction Evaluative Feedback he Reinforcement Learning Problem Part II: Elementary Solution Methods Dynamic Programming Monte Carlo Methods emporal Difference Learning Part III: A Unified View Eligibility races Generalization and Function Approximation Planning and Learning Dimensions of Reinforcement Learning Case Studies 2 Chapter 7: Eligibility races Simple Monte Carlo V! V + " [ R t # V ] where R t is the actual return following state s t. s t R t 3 4

Simplest D Method Is there something in between? [ ] V! V + " r t +1 + # V +1 $ V s t s t s t +1 r t +1 s t +1 r t +1 5 6 N-step D Prediction Idea: Look farther into the future when you do D backup (1, 2, 3,, n steps Mathematics of N-step D Prediction Monte Carlo: R t = r t +1 +! 2 r t + 3 +!+! "t "1 r R t (1 = r t +1 +! V t +1 D: I.e., use V to estimate remaining return Nothe n-step D: 2 step return: R t (2 = r t +1 +! 2 V t +2 n-step return: R t (n = r t +1 +! 2 r t + 3 +!+! n"1 r t +n +! n V t +n 7 8

Mathematics of N-step D Prediction Nothe Notation Monte Carlo: R t = r t +1 +! 2 r t + 3 +!+! "t "1 r Monte Carlo: R t = r t +1 +! 2 r t + 3 +!+! "t "1 r R t (1 = r t +1 +! V t +1 D: I.e., use V to estimate remaining return Nothe R t (1 = r t +1 +! V t +1 D: I.e., use V to estimate remaining return Nothe n-step D: 2 step return: n-step return: R t (2 = r t +1 +! 2 V t +2 R t (n = r t +1 +! 2 r t + 3 +!+! n"1 r t +n +! n V t +n Data estimate 9 n-step D: n-step return: R t (n = r t +1 +! 2 r t + 3 +!+! n"1 r t +n +! n V t +n An (n at thop of the R means n-step return (easy to miss! - R t is basically R! t 10 Learning with N-step Backups Backup (on-line or off-line:!v t = " $% R (n t # V t & ' Error reduction property of n-step returns max s E! {R t (n s t = s} " V! (s # $ n max s V(s " V! (s Learning with N-step Backups Backup (on-line or off-line:!v t = " $% R (n t # V t & ' Error reduction property of n-step returns max s E! {R t (n s t = s} " V! (s # $ n max s V(s " V! (s Maximum error using n-step return Maximum error using V Maximum error using n-step return Maximum error using V Using this, you can show that n-step methods converge Using this, you can show that n-step methods converge 11 12

Random Walk Examples A Larger Example How does 2-step D work here? How about 3-step D? ask: 19 state random walk Each curve corresponds to the error after 10 episodes. Each curve uses a different n. he x-axis is the learning rate. Do you think there is an optimal n? for everything? 13 14 Averaging N-step Returns Forward View of D(! n-step methods were introduced to help with D(! understanding (next! Idea: backup an average of several returns e.g. backup half of 2-step and half of 4-step R avg t = 1 2 R (2 t + 1 2 R t(4 Called a complex backup o draw the backup diagram: Draw each component Label with the weights for that component One backup D(! is a method for averaging all n-step backups weighted by! n-1 (time since visitation!-return: R! t = (1"! $! n"1 (n R t n=1 Backup using!-return:!v t = " %& R # t $ V t ' ( # 15 16

Forward View of D(! D(! is a method for averaging all n-step backups weighted by! n-1 (time since visitation Note: Sinche weights add up to 1, r t+1 is not overemphasized.!-return Weighting Function "t "1 R! t = (1 "! #! n"1 R (n t +! "t "1 R t n=1 17 Until termination After termination 18 Relation to D(0 and MC Forward View of D(! II!-return can be rewritten as: "t "1 R! t = (1"! #! n"1 R (n t +! "t "1 R t n=1 Look forward from each stato determine update from future states and rewards: If! = 1, you get MC: Until termination After termination "t "1 R! t = (1 " 1 # 1 n"1 R (n t + 1 "t "1 R t = R t n=1 If! = 0, you get D(0 "t "1 R! t = (1 " 0 # 0 n"1 R (n t + 0 "t "1 (1 R t = R t n=1 19 20

!-return on the Random Walk Backward View Same 19 state random walk as before Why do you think intermediate values of! are best? 21! t = r t +1 + " V t +1 # V t Shout! t backwards over time he strength of your voice decreases with temporal distance by "#$ 22 Backward View of D(! he forward view was for theory he backward view is for mechanism New variable called eligibility trace (s!" + On each step, decay all traces by "#$ and increment thrace for the current state by 1 his gives the equivalent weighting as the forward algorithm when used incrementally % '!" #1 (s (s = &!" #1 (s + 1 (' if s $ s t if s = s t 23 Policy Evaluation Using D(! Initialize V(s arbitrarily Repeat (for each episode: e(s = 0, for all s!s Initialize s Repeat (for each step of episode: a " action given by # for s ake action a, observe reward, r, and next state s$ % " r + & V( s $ ' V(s /* Compute error */ e(s " e(s + 1 /* Increment trace for current state */ For all s: /* Update ALL s's - really only ones we have visited */ V(s " V(s + (%e(s /* incrementally update V(s according to D( e(s " &e(s /* decay trace */ s " s $ /* Movo next state */ Until s is terminal 24

Relation of Backwards View to MC & D(0 Using update rule:!v t (s = "# t (s As before, if you set $ to 0, you get to D(0 If you set $ to 1, you get MC but in a better way Can apply D(1 to continuing tasks Works incrementally and on-line (instead of waiting to the end of the episode Again, by thime you get to therminal state, these increments add up to the one you would have made via the forward (theoretical view. "1 Forward View = Backward View he forward (theoretical view of D(! is equivalent to the backward (mechanistic view for off-line updating "1 "1!V D # he book shows: t (s = #!V $ t I sst #!V D t (s = # $ I sst #(%& k "t ' k $!V " t I sst = $ % I sst $ (&" k #t ' k t =0 "1 t =0 "1 k =t On-line updating with small " is similar t =0 t =0 Backward updates Forward updates algebra shown in book #1 t =0 #1 t =0 #1 k =t 25 26 On-line versus Off-line on Random Walk Control: Sarsa(! Standard control idea: Learn Q values - so, save eligibility for state-action pairs instead of just states $ & (s,a = % '&!" #1 (s,a + 1!" #1 (s,a if s = s t and a = a t otherwise Q t +1 (s,a = Q t (s,a + ( t (s,a t = r t +1 +! Q t +1,a t +1 # Q t,a t Same 19 state random walk On-line performs better over a broader range of parameters 27 28

Sarsa(! Algorithm Initialize Q(s,a arbitrarily Repeat (for each episode: e(s,a = 0, for all s,a Initialize s,a Repeat (for each step of episode: ake action a, observe r, s! Choose a! from s! using policy derived from Q (e.g. "-greedy # $ r + % Q( s!, a! & Q(s,a e(s,a $ e(s,a + 1 For all s,a: Q(s,a $ Q(s,a + '#e(s,a e(s,a $ %(e(s,a s $ s!;a $ a! Until s is terminal Sarsa(! Gridworld Example With onrial, the agent has much more information about how to get to the goal not necessarily the best way Can considerably accelerate learning 29 30 hree Approaches to Q(! hree Approaches to Q(! How can we extend this to Q-learning? Recall Q-learning is an off-policy method to learn Q* - and it uses the max of the Q values for a state in its backup What happens if we make an exploratory move? his is NO the right thing to back up over then What to do? If you mark every state action pair as eligible, you backup over non-greedy policy Watkins: Zero out eligibility trace after a non-greedy action. Do max when backing up at first non-greedy choice. hree answers: Watkins, Peng, and the naïve answer. % 1 +!" #1 (s,a ' (s,a = & 0 '!" #1 (s,a ( if s = s t,a = a t,q t #1,a t = max a Q t #1,a if Q t #1,a t $ max a Q t #1,a otherwise 31 Q t +1 (s,a = Q t (s,a + * t (s,a * t = r t +1 +! max a+ Q t +1, a + # Q t,a t 32

Watkins s Q(! Initialize Q(s,a arbitrarily Peng s Q(! Repeat (for each episode: e(s,a = 0, for all s,a Initialize s,a Repeat (for each step of episode: ake action a, observe r, s! Choose a! from s! using policy derived from Q (e.g. "-greedy a * # argmax b Q( s!,b (if a ties for the max, then a * # a! $ # r + % Q( s!, a! & Q(s,a * e(s,a # e(s,a + 1 For all s,a: Q(s,a # Q(s,a + '$e(s,a If a! = a *, then e(s,a # %(e(s,a else e(s,a # 0 Disadvantago Watkins s method: Early in learning, the eligibility trace will be cut (zeroed out frequently resulting in little advantago traces Peng: Backup max action except at end Never cut traces Disadvantage: Complicated to implement s # s!;a # a! Until s is terminal 33 34 Naïve Q(! Comparison ask Idea: is it really a problem to backup exploratory actions? Never zero traces Always backup max at current action (unlike Peng or Watkins s Is this truly naïve? Well, it works well in some empirical studies Compared Watkins s, Peng s, and Naïve (called McGovern s here Q(! on several tasks. See McGovern and Sutton (1997. owards a Better Q($ for other tasks and results (stochastic tasks, continuing tasks, etc Deterministic gridworld with obstacles 10x10 gridworld 25 randomly generated obstacles 30 runs " = 0.05, # = 0.9,! = 0.9, $ = 0.05, accumulating traces What is the backup diagram? 35 From McGovern CSE 190: Reinforcement and Sutton (1997. Learning, owards Lecture a better on Chapter Q(! 7 36

Comparison Results Convergence of the Q(! s None of the methods are proven to converge. Much extra credit if you can prove any of them. Watkins s is thought to convergo Q * Peng s is thought to convergo a mixture of Q % and Q * Naïve - Q *? From CSE McGovern 190: Reinforcement and Sutton Learning, (1997. owards Lecture on a better Chapter Q(! 7 37 38 Eligibility races for Actor-Critic Methods Eligibility races for Actor-Critic Methods Critic: On-policy learning of V %. Use D(! as described before. Actor: Needs eligibility traces for each state-action pair. We changhe update equation: Critic: On-policy learning of V %. Use D(! as described before. Actor: Needs eligibility traces for each state-action pair. Can changhe other actor-critic update: # % p t +1 (s,a = $ &% to p t (s,a +!" t p t (s,a if a = a t and s = s t otherwise p t +1 (s,a = p t (s,a +!" t (s,a 39 % ' p t (s,a +!" t [ 1# $(s,a ] if a = a t and s = s t p t +1 (s,a = & p t (s,a otherwise (' to p t +1 (s,a = p t (s,a +!" t (s,a where % '!" #1 (s,a + 1# $ t,a t (s,a = &!" #1 (s,a (' if s = s t and a = a t otherwise 40

Replacing races Using accumulating traces, frequently visited states can have eligibilities greater than 1 his can be a problem for convergence Replacing races Example Same 19 state random walk task as before Replacing traces perform better than accumulating traces over more values of! Replacing traces: Instead of adding 1 when you visit a state, set that traco 1 % ' (s = & ('!" #1 (s if s $ s t 1 if s = s t 41 42 Why Replacing races? More Replacing races Replacing traces can significantly speed learning hey can makhe system perform well for a broader set of parameters Accumulating traces can do poorly on certain types of tasks Off-line replacing trace D(1 is identical to first-visit MC Extension to action-values (Q values: When you revisit a state, what should you do with thraces for the actions you didn t take? Singh and Sutton say to set them to zero: Why is this task particularly onerous for accumulating traces? $ & (s,a = % & ' 1 0!" #1 (s,a if s = s t and a = a t if s = s t and a ( a t if s ( s t 43 44

Implementation Issues with races Could require much more computation But most eligibility traces are VERY closo zero If you implement it in Matlab, backup is only one line of code and is very fast (Matlab is optimized for matrices Can generalizo variable! Here! is a function of time Could define Variable! % '!" t #1 (s if s $ s t (s = &!" t #1 (s + 1 if s = s (' t! t =! or! t =! t " 45 46 Conclusions hwo views Provides efficient, incremental way to combine MC and D Includes advantages of MC (can deal with lack of Markov property Includes advantages of D (using D error, bootstrapping Can significantly speed learning Does have a cost in computation 47 48

Unified View END 50 50