Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Similar documents
Reinforcement Learning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Temporal difference learning

Introduction to Reinforcement Learning. Part 5: Temporal-Difference Learning

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Chapter 6: Temporal Difference Learning

Grundlagen der Künstlichen Intelligenz

Reinforcement Learning. George Konidaris

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Chapter 6: Temporal Difference Learning

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning.

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Reinforcement Learning

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Lecture 7: Value Function Approximation

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

arxiv: v1 [cs.ai] 5 Nov 2017

Lecture 23: Reinforcement Learning

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Basics of reinforcement learning

Machine Learning I Reinforcement Learning

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Reinforcement Learning II

CS599 Lecture 1 Introduction To RL

Lecture 8: Policy Gradient

Notes on Reinforcement Learning

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Lecture 3: Markov Decision Processes

Grundlagen der Künstlichen Intelligenz

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Reinforcement Learning Part 2


Reinforcement learning

Reinforcement Learning. Value Function Updates

Week 6, Lecture 1. Reinforcement Learning (part 3) Announcements: HW 3 due on 11/5 at NOON Midterm Exam 11/8 Project draft due 11/15

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

The Book: Where we are and where we re going. CSE 190: Reinforcement Learning: An Introduction. Chapter 7: Eligibility Traces. Simple Monte Carlo

Reinforcement Learning

CSC321 Lecture 22: Q-Learning

Lecture 9: Policy Gradient II 1

Introduction. Reinforcement Learning

Reinforcement Learning and Deep Reinforcement Learning

Replacing eligibility trace for action-value learning with function approximation

Open Theoretical Questions in Reinforcement Learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning

, and rewards and transition matrices as shown below:

Reinforcement Learning

CS 234 Midterm - Winter

6 Reinforcement Learning

Off-Policy Actor-Critic

Lecture 9: Policy Gradient II (Post lecture) 2

Reinforcement learning an introduction

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Internet Monetization

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6.

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

arxiv: v1 [cs.ai] 1 Jul 2015

Reinforcement Learning and Control

Reinforcement Learning

Q-Learning in Continuous State Action Spaces

Lecture 1: March 7, 2018

Reinforcement Learning

Olivier Sigaud. September 21, 2012

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

Chapter 3: The Reinforcement Learning Problem

Reinforcement Learning

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Reinforcement Learning

Policy Gradient Methods. February 13, 2017

Q-learning. Tambet Matiisen

An online kernel-based clustering approach for value function approximation

Reinforcement Learning

Lecture 3: The Reinforcement Learning Problem

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Introduction to Reinforcement Learning

Reinforcement Learning. Machine Learning, Fall 2010

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Forward Actor-Critic for Nonlinear Function Approximation in Reinforcement Learning

Least squares temporal difference learning

Artificial Intelligence

Reinforcement Learning: the basics

Probabilistic Planning. George Konidaris

Reinforcement Learning

Real Time Value Iteration and the State-Action Value Function

(Deep) Reinforcement Learning

Reinforcement Learning and NLP

Reinforcement Learning: An Introduction

Reinforcement Learning

Approximation Methods in Reinforcement Learning

Elements of Reinforcement Learning

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Reinforcement Learning

Variance Reduction for Policy Gradient Methods. March 13, 2017

Trust Region Policy Optimization

Transcription:

Temporal Difference Learning KENNETH TRAN Principal Research Engineer, MSR AI

Temporal Difference Learning Policy Evaluation

Intro to model-free learning Monte Carlo Learning Temporal Difference Learning TD(λ)

Episode S 0, A 0, R 1, S 1, A 1, R 2, state action reward Return state-value function True value of state s under policy π Discount rate, e.g. 0.9 G t R t+1 + γr t+2 + γ 2 R t+3 + = R t+1 + γ R t+2 + γr t+3 + γ 2 R t+4 + = R t+1 + γg t+1 v π s E π G t S t = s = E π R t+1 + γg t+1 S t = s = E π R t+1 + γv π (S t+1 ) S t = s

Bellman equation Q s, a = P s s, a R s, a, s + γ max Q (s, a ) s a DP: solve a known MDP

MC methods learn from episodes of experience MC is model-free: no knowledge of MDP equations MC learns from complete episodes MC uses the simplest possible idea: use empirical mean to approximate the expected value Caveat: can only apply MC to episodic MDPs (i.e. all episodes must terminate)

Goal: learn v π from episodes of experience under π S 1, A 1, R 2,, S k Recall: return is the total discounted reward G t = R t+1 + γr t+2 + + γ T 1 R T Recall: value function is the expected return v π s = E[G t S t = s] MC policy evaluation uses empirical mean return instead of expected return

To evaluate v π (s) Sample episodes of experience under π Every time t that state s is visited in an episode Increment counter N s N s + 1 Increment total return S s S s + G t Value is estimated by mean return V s = S(s)/N(s) By law of large numbers: V s v π (s) as N s 0

Update V s incrementally after episode S 1, A 1, R 2,, S T For each state s, with return G t N S t N S t + 1 V S t V S t + 1 N S t G t V(S t )

Like MC, TD is model-free: no knowledge of MDP transition/rewards Unlike MC, TD learns from incomplete episodes, by bootstrapping TD updates a guess towards a guess

Goal: policy evaluation For a given policy v π, compute the state-value function v π Recall: simple every-visit Monte Carlo method V S t V S t + α G t V(S t ) step-size parameter target: the actual return after time t Simplest temporal-difference learning method: TD(0) V S t V S t + α R t+1 + γv S t+1 V(S t ) target: an estimate of the return δ t R t+1 + γv S t+1 V(S t ) is called the TD error

State Elapsed Time (minutes) Predicted Time to Go Predicted Total Time leaving office 0 30 30 reach car, raining 5 35 40 exit highway 20 15 35 behind truck 30 10 40 home street 40 3 43 arrive home 43 0 43

Image Credit: Sutton and Barto, Reinforcement Learning, An Introduction 2017

TD can learn before knowing the final outcome TD can learn online after every step MC must wait until end of episode before return is known TD can learn without the final outcome TD can learn from incomplete sequences MC can only learn from complete sequences TD works in continuing (non-terminating) environments MC only works for episodic (terminating) environments

Return G t = R t+1 + γr t+2 + + γ T 1 R T is unbiased estimate of v π (S t ) TD target R t+1 + γv(s t+1 ) is biased estimate of v π (S t )

MC has high variance, zero bias Not very sensitive to initial value Very simple to understand and use TD has low variance, some bias Usually more efficient than MC More sensitive to initial value

Image Credit: Sutton and Barto, Reinforcement Learning, An Introduction 2017

Image Credit: Sutton and Barto, Reinforcement Learning, An Introduction 2017

MC and TD converge: V (s) v π (s) as experience But what about batch solution for finite experience? s 1 1, a 1 1, r 1 1 2,, s T1 s K 1, a K 1, r K K 2,, s T1 e.g. Repeatedly sample episode k [1, K] Apply MC or TD(0) to episode k Will they converge to the same value function?

Two states A, B; no discounting; 8 episodes of experience A, 0, B, 0 B, 1 B, 1 B, 1 B, 1 B, 1 B, 1 B, 0 What is V(A), V(B)? Image Credit: Sutton and Barto, Reinforcement Learning, An Introduction 2017

MC converges to solution with minimum mean-squared error Best fit to the observed returns K T k G t k V s t k 2 k=1 t=1 In the AB example, V(A) = 0 TD(0) converges to solution of max likelihood Markov model Solution to the MDP S, A, P, R, γ that best fits the data In the AB example, V(A) = 0.75

TD exploits Markov property more efficient MC does not exploit Markov property

Image Credit: David Silver, Model-Free Prediction, UCL Course on RL

Image Credit: David Silver, Model-Free Prediction, UCL Course on RL

Image Credit: David Silver, Model-Free Prediction, UCL Course on RL

Bootstrapping: update involves an estimate MC does not bootstrap: it learns from complete episodes MC must wait until end of episode before return is known DP bootstrap TD bootstrap Sampling: update does not require computing exact expectation MC samples DP does not sample TD samples

Image Credit: David Silver, Model-Free Prediction, UCL Course on RL

Let TD target look n steps into the future Image Credit: Sutton and Barto, Reinforcement Learning, An Introduction 2017

Consider the following n-step returns for n = 1, 2, : n = 1 (TD) G 1 t = R t+1 + γv(s t+1 ) n = 2 (2) G t = Rt+1 + γr t+2 + γv(s t+2 ) n = MC G t = R t+1 + γr t+2 + + γ T 1 R T Define the n-step return (n) G t = Rt+1 + γr t+2 + + γ n 1 R t+n + γ n V(S t+n ) n-step temporal-difference learning (n) V(S t ) V(S t ) + α G t V(St )

Image Credit: Sutton and Barto, Reinforcement Learning, An Introduction 2017

We can average n-step returns over different n e.g. average the 2-step and 4-step returns 1 2 G(2) + 1 2 G(4) Combines information from two different time-steps Can we efficiently combine information from all time-steps? Image Credit: Sutton and Barto, Reinforcement Learning, An Introduction 2017

The λ-return G λ n t combines all n-step returns G t Using weight 1 λ λ n 1 G λ t = 1 λ Forward-view TD(λ) V(S t ) V(S t ) + α G λ t V(S t ) Image Credit: Sutton and Barto, Reinforcement Learning, An Introduction 2017

G λ t = 1 λ λ n 1 n G t n=1 Image Credit: Sutton and Barto, Reinforcement Learning, An Introduction 2017

Forward-view looks into the future to compute G t λ Like MC, can only be computed from complete episodes Image Credit: Sutton and Barto, Reinforcement Learning, An Introduction 2017

Forward view provides theory Backward view provides mechanism Update online, every step, from incomplete sequences

Image Credit: David Silver, Model-Free Prediction, UCL Course on RL Credit assignment problem: did bell or light cause shock? Frequency heuristic: assign credit to most frequent states Recency heuristic: assign credit to most recent states Eligibility traces combine both heuristics E 0 (s) = 0 E t (s) = γλe t 1 (s) + 1(S t = s)

Image Credit: Sutton and Barto, Reinforcement Learning, An Introduction 2017 At each time step t in the rollout Update eligibility trace for every state s E t (s) = γλe t 1 (s) + 1(S t = s) Compute TD-error δ t = R t+1 + γv(s t+1 ) V(S t ) Update value function V(s) for every state s V s V(s) + αδ t E t (s) current state

When λ = 0, only current state is updated E t (s) = 1(S t = s) V s V(s) + αδ t E t (s) This is exactly equivalent to TD(0) update V S t V (S t ) + αδ t

TD(1) is roughly equivalent to MC Error is accumulated online, step-by-step If value function is only updated offline at end of episode Then total update is exactly the same as MC

Batch Update Updates are accumulated within episode, but applied in batch at the end of episode Backward TD(λ) is equivalent to Forward TD(λ) Online Update TD(λ) updates are applied at each step within episode Forward and backward-view TD(λ) are slightly different Backward TD(λ) is typically more efficient Analogy: SGD vs. batch GD

Temporal Difference Learning Policy Optimization

Introduction On-policy Montel Carlo control On-policy Temporal Difference control Off-policy Learning Summary

Last lesson: model-free policy evaluation Estimate the value function of an unknown MDP This lesson: model-free policy optimization Optimize the value function of an unknown MDP

On-policy learning Learn on the job Learn about policy π from experience sampled from π Off-policy learning Look over someone s shoulder Learn about policy π from experience sampled from µ

Image Credit: Sutton and Barto, Reinforcement Learning, An Introduction 2017

Image Credit: Sutton and Barto, Reinforcement Learning, An Introduction 2017

Greedy policy improvement over V(s) requires model of MDP Greedy policy improvement over Q(s, a) is model-free

Image Credit: Sutton and Barto, Reinforcement Learning, An Introduction 2017

There are two doors in front of you. You open the left door and get reward 0 V left = 0 You open the right door and get reward +1: V right = +1 You open the right door and get reward +3: V right = +3 You open the right door and get reward +2: V right = +2... Are you sure you ve chosen the best door?

Simplest idea for ensuring continual exploration All m actions are tried with non-zero probability With probability 1 ε choose the greedy action With probability ε choose an action at random Theorem: For any ε-greedy policy π, the ε-greedy policy π with respect to Q π is an improvement, i.e. v π (s) v π (s)

Image Credit: Sutton and Barto, Reinforcement Learning, An Introduction 2017

Definition: Greedy in the Limit with Infinite Exploration (GLIE) All state-action pairs are explored infinitely many times The policy converges on a greedy policy lim N k s, a = k lim π k a s = 1(a = argmax Q k (s, a )) k a A Example: ε-greedy is GLIE if ε reduces to zero at ε k = 1 k

Sample k th episode using π: S 1, A 1, R 2,, S T ~π For each state S t and action A t in the episode, N S t, A t N S t, A t + 1 Q S t, A t Q S t, A t + Improve policy based on new action-value function ε 1/k 1 N(S t, A t ) G t Q(S t, A t ) π ε-greedy(q) Theorem: GLIE Monte-Carlo control converges to the optimal action-value function Q s, a q s, a

Temporal-difference (TD) learning has several advantages over Monte-Carlo (MC) Lower variance Online Incomplete sequences Natural idea: use TD instead of MC in our control loop Apply TD to Q(S, A) Use ε-greedy policy improvement Update every time-step

Q(S, A) Q(S, A) + α R + γq(s, A ) Q(S, A) Image Credit: Sutton and Barto, Reinforcement Learning, An Introduction 2017

Image Credit: Sutton and Barto, Reinforcement Learning, An Introduction 2017

Theorem: Sarsa converges to the optimal action-value function Q s, a q s, a under the following conditions: GLIE sequence of policies π t (a s) Robbins-Monro sequence of step sizes α t α t = t=1 α 2 t < t=1

Reward = -1 per time step until reaching goal Undiscounted Image Credit: Sutton and Barto, Reinforcement Learning, An Introduction 2017

Image Credit: Sutton and Barto, Reinforcement Learning, An Introduction 2017

Consider the following n-step returns for n = 1,2,, : n = 1 Sarsa (1) q t = Rt+1 + γq S t+1, A t+1 n = 2 (2) q t = Rt+1 + γr t+2 + γ 2 Q S t+2, A t+2 n = MC ( ) q t = Rt+1 + γr t+2 + + γ T 1 R T Define the n-step Q-return (n) q t = Rt+1 + γr t+2 + + γ n 1 R t+n + γ n Q(S t+n, A t+n ) n-step Sarsa updates Q(s, a) towards the n-step Q-return Q S t, A t Q(S t, A t ) + α (n) q t Q(St, A t )

The q λ (n) return combines all n-step Q-returns q t Using weight 1 λ λ n 1 q λ t = 1 λ λ n 1 (n) q t n=1 Forward-view Sarsa(λ) Q S t, A t Q(S t, A t ) + α q λ t Q(S t, A t ) Image Credit: Sutton and Barto, Reinforcement Learning, An Introduction 2017

Just like TD(λ), we use eligibility traces in an online algorithm But Sarsa(λ) has one eligibility trace for each state-action pair E 0 s, a = 0 E t (s, a) = γλe t 1 (s, a) + 1(S t = s, A_t = a) Q(s, a) is updated for every state s and action a In proportion to TD-error δ t and eligibility trace E t s, a δ t = R t+1 + γq(s t+1, A t+1 ) Q(S t, A t ) Q s, a Q(s, a) + αδ t E t (s, a)

Image Credit: Sutton and Barto, Reinforcement Learning, An Introduction 2017

Evaluate target policy π(a s) to compute v π (s) or q π (s, a) While following behaviour policy µ a s {S 1, A 1, R 2,, S T } µ Why is this important? Learn from observing humans or other agents Re-use experience generated from old policies π 1, π 2,, π t 1 Learn about optimal policy while following exploratory policy Learn about multiple policies while following one policy

The target policy π is greedy w.r.t. Q(s, a) π(s t+1 ) = argmax a Q(S t+1, a ) The behaviour policy µ is e.g. ε-greedy w.r.t. Q s, a The Q-learning target becomes: R t+1 + γ max a Q(S t+1, a )

Q S, A Q(S, A) + α R + γmax a Q(S, a ) Q(S, A) Theorem: Q-learning control converges to the optimal action-value function Q s, a q (s, a) Image Credit: Sutton and Barto, Reinforcement Learning, An Introduction 2017