Introduction to Reinforcement Learning. Part 5: Temporal-Difference Learning

Similar documents
Chapter 6: Temporal Difference Learning

Chapter 6: Temporal Difference Learning

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning.

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Reinforcement Learning

Temporal-Difference Learning

Lecture 23: Reinforcement Learning

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Temporal difference learning

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

The Book: Where we are and where we re going. CSE 190: Reinforcement Learning: An Introduction. Chapter 7: Eligibility Traces. Simple Monte Carlo

Reinforcement Learning. Value Function Updates

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Reinforcement Learning. George Konidaris

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Reinforcement Learning

Reinforcement Learning

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Reinforcement learning an introduction

Chapter 4: Dynamic Programming

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Reinforcement Learning

Lecture 3: The Reinforcement Learning Problem

Introduction. Reinforcement Learning

Lecture 7: Value Function Approximation

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

arxiv: v1 [cs.ai] 5 Nov 2017

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Planning in Markov Decision Processes

6 Reinforcement Learning

An Introduction to Reinforcement Learning

Reinforcement Learning Part 2

CS599 Lecture 1 Introduction To RL

Grundlagen der Künstlichen Intelligenz

Chapter 3: The Reinforcement Learning Problem

Notes on Reinforcement Learning

Key concepts: Reward: external «cons umable» s5mulus Value: internal representa5on of u5lity of a state, ac5on, s5mulus W the learned weight associate

Chapter 3: The Reinforcement Learning Problem

CSC321 Lecture 22: Q-Learning

Open Theoretical Questions in Reinforcement Learning

Basics of reinforcement learning

Elements of Reinforcement Learning

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

Lecture 1: March 7, 2018

Reinforcement Learning II

Reinforcement Learning. Machine Learning, Fall 2010

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Computational Reinforcement Learning: An Introduction

Reading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

CS 234 Midterm - Winter

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

Reinforcement Learning

What we learned last time

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Reinforcement Learning

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:


CS Machine Learning Qualifying Exam

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

Introduction to Reinforcement Learning

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.

Week 6, Lecture 1. Reinforcement Learning (part 3) Announcements: HW 3 due on 11/5 at NOON Midterm Exam 11/8 Project draft due 11/15

Machine Learning I Reinforcement Learning

Least squares temporal difference learning

Reinforcement learning

Generalization and Function Approximation

University of Alberta

Reinforcement Learning: An Introduction

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Reinforcement learning crash course

Chapter 8: Generalization and Function Approximation

Real Time Value Iteration and the State-Action Value Function

Lecture 8: Policy Gradient

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz

CS599 Lecture 2 Function Approximation in RL

Rewards and Value Systems: Conditioning and Reinforcement Learning

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Short Course: Multiagent Systems. Multiagent Systems. Lecture 1: Basics Agents Environments. Reinforcement Learning. This course is about:

Introduction to Reinforcement Learning

Reinforcement Learning. Introduction

An Introduction to Reinforcement Learning

Decision Theory: Q-Learning

Dual Memory Model for Using Pre-Existing Knowledge in Reinforcement Learning Tasks

An Introduction to Reinforcement Learning

ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6.

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Reinforcement Learning

Transcription:

Introduction to Reinforcement Learning Part 5: emporal-difference Learning

What everybody should know about emporal-difference (D) learning Used to learn value functions without human input Learns a guess from a guess Applied by Samuel to play Checkers (1959) and by esauro to beat humans at Backgammon (1992-5) and Jeopardy! (2011) Explains (accurately models) the brain reward systems of primates, rats, bees, and many other animals (Schultz, Dayan & Montague 1997) Arguably solves Bellman s curse of dimensionality

Simple Monte Carlo V(S t ) V(S t ) + α [ G t V(S t )] St

Simplest D Method V(S t ) V(S t ) + α [ R t+1 + γ V(S t+1 ) V(S t )] S t S t+1 R t+1

D Prediction Policy Evaluation (the prediction problem): for a given policy π, compute the state-value function vπ Recall: Simple every-visit Monte Carlo method: V(S t ) V(S t ) + α [ G t V(S t )] target: the actual return after time t he simplest temporal-difference method, D(0): V(S t ) V(S t ) + α [ R t+1 + γ V(S t+1 ) V(S t )] target: an estimate of the return

Example: Driving Home State Elapsed ime Predicted Predicted (minutes) ime to Go otal ime leaving office 0 30 30 reach car, raining 5 35 40 exit highway 20 15 35 behind truck 30 10 40 home street 40 3 43 arrive home 43 0 43 (5) (15) (10) (10) (3) bad good bad bad ok

Driving Home Changes recommended by Monte Carlo methods (α=1) Changes recommended by D methods (α=1)

Advantages of D Learning D methods do not require a model of the environment, only experience D, but not MC, methods can be fully incremental You can learn before knowing the final outcome Less memory Less peak computation You can learn without the final outcome From incomplete sequences Both MC and D converge (under certain assumptions to be detailed later), but which is faster?

Random Walk Example Values learned by D(0) after various numbers of episodes

D and MC on the Random Walk Data averaged over 100 sequences of episodes

Batch Updating in D and MC methods Batch Updating: train completely on a finite amount of data, e.g., train repeatedly on 10 episodes until convergence. Compute updates according to D(0), but only update estimates after each complete pass through the data. For any finite Markov prediction task, under batch updating, D(0) converges for sufficiently small α. Constant-α MC also converges under these conditions, but to a difference answer!

Random Walk under Batch Updating After each new episode, all previous episodes were treated as a batch, and algorithm was trained until convergence. All repeated 100 times.

You are the Predictor Suppose you observe the following 8 episodes: A, 0, B, 0 B, 1 B, 1 B, 1 B, 1 B, 1 B, 1 B, 0 V(B)? 0.75 V(A)? 0?

You are the Predictor he prediction that best matches the training data is V(A)=0 his minimizes the mean-square-error on the training set his is what a batch Monte Carlo method gets If we consider the sequentiality of the problem, then we would set V(A)=.75 his is correct for the maximum likelihood estimate of a Markov model generating the data i.e, if we do a best fit Markov model, and assume it is exactly correct, and then compute what it predicts (how?) his is called the certainty-equivalence estimate his is what D(0) gets

Summary (so far) Introduced one-step tabular model-free D methods hese methods bootstrap and sample, combining aspects of DP and MC methods D prediction If the world is truly Markov, then D methods will learn faster than MC methods MC methods have lower error on past data, but higher error on future data Extend prediction to control by employing some form of GPI On-policy control: Sarsa Off-policy control: Q-learning

Learning An Action-Value Function Estimate qπ for the current policy π R t +2 R t +3...... R t +1 S t S t +1 S t, A t S, t+1 A t+1 S t +2 S t +3 S, t+2 A t+2 S, t+3 A t+3 After every transition from a nonterminal state, S t, do this: Q(S t, A t ) Q(S t, A t ) + α [ R t+1 + γ Q(S t+1, A t+1 ) Q(S t, A t )] If S t+1 is terminal, then define Q(S t+1, A t+1 ) = 0

Sarsa: On-Policy D Control urn this into a control method by always updating the policy to be greedy with respect to the current estimate: Initialize Q(s, a), 8s 2 S, a 2 A(s), arbitrarily, and Q(terminal-state, ) = 0 Repeat (for each episode): Initialize S Choose A from S using policy derived from Q (e.g., "-greedy) Repeat (for each step of episode): ake action A, observe R, S 0 Choose A 0 from S 0 using policy derived from Q (e.g., "-greedy) Q(S, A) Q(S, A)+ [R + Q(S 0,A 0 ) Q(S, A)] S S 0 ; A A 0 ; until S is terminal

Windy Gridworld Wind: undiscounted, episodic, reward = 1 until goal

Results of Sarsa on the Windy Gridworld

Q-Learning: Off-Policy D Control One-step Q-learning: Q(S t, A t ) Q(S t, A t ) + α % R t+1 + γ maxq(s t+1,a) Q(S t, A t )' & a ( Initialize Q(s, a), 8s 2 S, a 2 A(s), arbitrarily, and Q(terminal-state, ) = 0 Repeat (for each episode): Initialize S Repeat (for each step of episode): Choose A from S using policy derived from Q (e.g., "-greedy) ake action A, observe R, S 0 Q(S, A) Q(S, A)+ [R + max a Q(S 0,a) Q(S, A)] S S 0 ; until S is terminal

Cliffwalking R R ε greedy, ε = 0.1

Summary Introduced one-step tabular model-free D methods hese methods bootstrap and sample, combining aspects of DP and MC methods D prediction If the world is truly Markov, then D methods will learn faster than MC methods MC methods have lower error on past data, but higher error on future data Extend prediction to control by employing some form of GPI On-policy control: Sarsa Off-policy control: Q-learning