Week 6, Lecture 1. Reinforcement Learning (part 3) Announcements: HW 3 due on 11/5 at NOON Midterm Exam 11/8 Project draft due 11/15

ME 537: Learning Based Control Week 6, Lecture 1 Reinforcement Learning (part 3 Announcements: HW 3 due on 11/5 at NOON Midterm Exam 11/8 Project draft due 11/15 Suggested reading : Chapters 7-8 in Reinforcement Learning (Sutton and Barto (Lecture notes based on Reinforcement Learning, Sutton and Barto, 1998 Eligibility Traces Eligibility Traces are the basic mechanism of assigning credit in Reinforcement Learning!! In TD(!,! refers to an eligibility trace Basic Idea:!! Determine how many rewards will be used in evaluating a value function!! Determine which states are eligible for an update!! Connect TD learning and Monte Carlo Learning

Eligibility Traces Forward view:!! Theoretical!! Connect TD learning and Monte Carlo learning o! In TD learning only previous state is eligible o! In Monte Carlo, all visited states are eligible!! Intuition: How far ahead do you need to look to figure out value of a state Eligibility Traces Forward view:!! Theoretical!! Connect TD learning and Monte Carlo learning o! In TD learning only previous state is eligible o! In Monte Carlo, all visited states are eligible!! Intuition: How far ahead do you need to look to figure out value of a state Backward view:!! Mechanistic!! Temporary record of events o! Visiting states, taking actions, Receiving Rewards!! Intuition: How many state value pairs do I need to store to figure out value of a state

Forward View Recall Monte Carlo methods for a T step episode: R t = r t +1 + "r t +2 + " 2 r t +3 +!+ " T #t#1 r T!! The update is based on all the returns Forward View Recall Monte Carlo methods for a T step episode: R t = r t +1 + "r t +2 + " 2 r t +3 +!+ " T #t#1 r T!! The update is based on all the returns In TD backup: R t (1 = r t +1 + "V t +1!! The update is based on 1-step backup

Forward View Recall Monte Carlo methods for a T step episode: R t = r t +1 + "r t +2 + " 2 r t +3 +!+ " T #t#1 r T!! The update is based on all the returns In TD backup: R t (1 = r t +1 + "V t +1!! The update is based on 1-step backup Why restrict ourselves to these two extremes? Forward View: n-step TD Why not look at 2-step backup: R t (2 = r t +1 + "r t +2 + " 2 V t +2

Forward View: n-step TD Why not look at 2-step backup: R t (2 = r t +1 + "r t +2 + " 2 V t +2 How about n-step backup? R t (n = r t +1 + "r t +2 + " 2 r t +3 +!+ " n#1 r t +n + " n V t +n Called corrected n-step truncated return or n-step return Forward View (Sutton & Barto

Backward View Causal, incremental view Implementable Introduce a new memory variable: eligibility trace e t (s!! On each step eligibility trace for each state decays by!"#!! Eligibility trace of one state increased by 1 & e t (s = " # e (s if s % s t$1 t ' (" # e t$1 (s +1 if s = s t!!! is trace decay parameter!! " is the discount rate used previously Intuition At any time eligibility traces keep track of which states have been recently visited!! Recently is quantified by "!# The traces capture how eligible a state is for an update Recall basic reinforcement learning: NewEstimate OldEstimate + Stepsize (Target OldEstimate! Now, update all values based on eligibility trace

Implementation NewEstimate OldEstimate + Stepsize (Target OldEstimate! For TD, the error was: " t = r + #V +1 $V Implementation NewEstimate OldEstimate + Stepsize (Target OldEstimate! For TD, the error was: " t = r + #V +1 $V Then update becomes: V t +1 (s = V t (s + " e for all s $ S

Special Cases Let s explore two special cases based on this update rule V t +1 (s = V t (s + " e for all s $ S Special Cases Let s explore two special cases based on this update rule V t +1 (s = V t (s + " e for all s $ S For!=0:!! All traces are 0 at t except state visited at t: s t!! Reduces to TD update rule

Special Cases Let s explore two special cases based on this update rule V t +1 (s = V t (s + " e for all s $ S For!=0:!! All traces are 0 at t except state visited at t: s t!! Reduces to TD update rule For!=1:!! All states receive credit decaying by "#!! This is the Monte Carlo updates o! If "=1 as well, there is no decay in time and we have undiscounted Monte Carlo updates Forward and Backward Views Forward view

Forward and Backward Views Forward view Backward view (Sutton & Barto Sarsa (! Recall Sarsa learning: Q " Q + # r + $ Q +1 +1 % Q Now, we have: ( Q " Q + # $ t e for all s,a

Sarsa (! Recall Sarsa learning: Q " Q + # r + $ Q +1 +1 % Q Now, we have: ( Q " Q + # $ t e for all s,a where " t = r + # Q +1 +1 $ Q % e t (s,a = " # e t$1(s,a +1 if s = s t and a = a t & '" # e t$1 (s,a otherwise for all s,a Example: Sarsa (! Recall Sarsa updates: only last state receives update. All other state action pairs will update when visited/taken next Path taken Action values increased by one-step Sarsa Action values increased by Sarsa(! with!=0.9

Q(! Recall Q-learning Q " Q + # r + $ maxq +1,a % Q Now, we have: ( a Q " Q + # $ t e for all s,a Q(! Now, we have: Q " Q + # $ t e for all s,a where " t = r + # maxq +1,a $ Q a &# $ e t%1 (s,a +1 if Q t -1 = max a Q t -1,a e t (s,a = I sst " I aat ' ( 0 otherwise

Q(! In Q learning with eligibility traces:!!like in Sarsa learning, except eligibility trace set to zero after an exploratory move (not greedy!!intuition: two step process!!step 1: o! Either each state-action pair decayed by "!# o! Or all set to 0 after an exploratory step!!step 2: o! Trace for current state action pair incremented by 1 (this is Watkins Q(!; T\there are other versions as well.

Function Approximation So far Value functions were tables!!tables get large!!tables are slow to be accessed!!tables are hard to fill Need to use a subset of state-action pairs to learn value function Function Approximation So far Value functions were tables!!tables get large!!tables are slow to be accessed!!tables are hard to fill Need to use a subset of state-action pairs to learn value function Function approximation!!neural network!!linear

Value Prediction Use mean square error for evaluating performance & MSE(! " t = p(s V # (s $V t (s s%s ( 2 V t is the value function at time step t V $ is the optimal value function p(s is a probability distribution of states!!makes sure MSE focuses on the states that matter Gradient Descent Gradient Decent & ( 2 MSE(! " t = p(s V # (s $V t (s s%s Basic parameter update (based on uniform p(s : = $ 1 2 % &! ( V ' $V t 2 #!!Update parameters based on the negative of the gradient

Gradient Descent Gradient Decent = $ 1 2 % &! # ( V ' $V t 2 Becomes: ( '! = $ V % &V t # V t Vector of partial derivatives Gradient Descent We have one problem with the parameter update: = $ V % &V t ( '! # V t

Gradient Descent We have one problem with the parameter update: = $ ( V % &V t '! V (s t We don t know the optimal value function! Gradient Descent without Value Function Solution: Use estimate of Optimal Value!!Use Monte Carlo estimate R t = $ ( R t %V t &! V!!Use TD estimate R t! : ( '! = $ R t % &V t # V t

Gradient Descent without Value Function Solution: Use estimate of Optimal Value!!Use Monte Carlo estimate R t = $ ( R t %V t &! V (s t!!use TD estimate R t! : ( '! = $ R t % &V t Unbiased Estimate # V t Biased Estimate Control with Function Approximation Gradient descent for state-action values: ( '! = $ Q % & Q t # Q t

Gradient Descent Sarsa Gradient descent for state-action values: ( '! = $ Q % & Q t # Q t = $ ( r t +1 + % Q t +1 +1 & Q t '! Q (s,a t t Gradient descent Sarsa:!!1-step backup!!extend to eligibility traces Gradient Descent Sarsa(! Gradient descent Sarsa(!: = $ % t! e t!!where " t = r t +1 + # Q t +1 +1 $ Q t! e t = " #! e t$1 + %! & Q t