Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning.

Similar documents
Chapter 6: Temporal Difference Learning

Introduction to Reinforcement Learning. Part 5: Temporal-Difference Learning

Chapter 6: Temporal Difference Learning

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

The Book: Where we are and where we re going. CSE 190: Reinforcement Learning: An Introduction. Chapter 7: Eligibility Traces. Simple Monte Carlo

Reinforcement Learning

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

CS599 Lecture 1 Introduction To RL

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Lecture 23: Reinforcement Learning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Reinforcement Learning. George Konidaris

Reinforcement Learning. Introduction

Introduction. Reinforcement Learning

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Reinforcement Learning

What we learned last time

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Reinforcement Learning

Temporal difference learning

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Temporal-Difference Learning

Reinforcement learning an introduction

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Notes on Reinforcement Learning

arxiv: v1 [cs.ai] 5 Nov 2017

Planning in Markov Decision Processes

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Chapter 3: The Reinforcement Learning Problem

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Basics of reinforcement learning

Reinforcement Learning. Value Function Updates

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Machine Learning I Reinforcement Learning

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

Reinforcement Learning II

6 Reinforcement Learning

Chapter 4: Dynamic Programming

Elements of Reinforcement Learning

Lecture 3: The Reinforcement Learning Problem

Reinforcement Learning Part 2

Reinforcement Learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Lecture 3: Markov Decision Processes

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Chapter 3: The Reinforcement Learning Problem

Lecture 7: Value Function Approximation

Open Theoretical Questions in Reinforcement Learning

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

Reinforcement Learning

Computational Reinforcement Learning: An Introduction

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Reinforcement Learning

Lecture 1: March 7, 2018

Introduction to Reinforcement Learning

Machine Learning I Continuous Reinforcement Learning

Grundlagen der Künstlichen Intelligenz

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Reinforcement Learning

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning

Key concepts: Reward: external «cons umable» s5mulus Value: internal representa5on of u5lity of a state, ac5on, s5mulus W the learned weight associate

Reinforcement Learning

CS Machine Learning Qualifying Exam

The Reinforcement Learning Problem

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Approximation Methods in Reinforcement Learning

Grundlagen der Künstlichen Intelligenz

Reinforcement learning

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Lecture 8: Policy Gradient

Reinforcement Learning. Up until now we have been

Reinforcement Learning. Machine Learning, Fall 2010

Reinforcement Learning and Control

CS885 Reinforcement Learning Lecture 7a: May 23, 2018

Short Course: Multiagent Systems. Multiagent Systems. Lecture 1: Basics Agents Environments. Reinforcement Learning. This course is about:

, and rewards and transition matrices as shown below:

Reinforcement Learning

Reinforcement Learning


Internet Monetization

Reinforcement Learning Active Learning

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Reinforcement Learning (1)

Reinforcement Learning: the basics

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

An Adaptive Clustering Method for Model-free Reinforcement Learning

CS 234 Midterm - Winter

Introduction to Reinforcement Learning

Reinforcement Learning and NLP

Reinforcement Learning

Markov Decision Processes Chapter 17. Mausam

Transcription:

Monte Carlo is important in practice CSE 190: Reinforcement Learning: An Introduction Chapter 6: emporal Difference Learning When there are just a few possibilitieo value, out of a large state space, Monte Carlo is a big win Backgammon, Go, But it is still somewhat inefficient, because it figures out state values in isolation from other states (no bootstrapping Acknowledgment: A good number of these slides are cribbed from Rich Sutton 2 Chapter 6: emporal Difference Learning Objectives of this chapter: Introduce emporal Difference (D learning As usual, we focus first on policy evaluation, or prediction, methods hen extend to control methods D Prediction Policy Evaluation (the prediction problem: for a given policy!, compute the state-value function V! Simple every-visit Monte Carlo method: [ ]! + " R t # target: the actual return after time t he simplest D method, D(0: [ ]! + " r t +1 + # +1 $ target: an estimate of the return 3 4

Note that: D Prediction { } V! (s = E! R t = s # % ( = E! & $ " k r t + k +1 = s ' k =0 * # % ( = E! & r t +1 + " $ " k r t + k +2 = s ' k =0 * { } = E! r t +1 + " V! ( +1 = s Hence the two targets are the same - in expectation. In practice, the results may differ due to finite training sets D Prediction Policy Evaluation (the prediction problem: for a given policy!, compute the state-value function V! Simple every-visit Monte Carlo method: [ ]! + " R t # target: the actual return after time t he simplest D method, D(0: [ ]! + " r t +1 + # +1 $ target: an estimate of the return 5 6 Simple Monte Carlo Simplest D Method! + " [ R t # ] where R t ihe actual return following state. [ ]! + " r t +1 + # +1 $ +1 r t +1 R t 7 8

cf. Dynamic Programming! E " { r t +1 + # } rt +1 +1 Comparing MC and Dynamic Programming Bootstrapping: update involves an estimate from a nearby state MC does not bootstrap: Each state is independently estimated DP bootstraps Estimates for each state depend on nearby states Sampling: update involves an actual return MC samples Learns from experience in the environment, gets an actual return R DP does not sample Learns by spreading constraints until convergence through iterating the Bellman equations - without interaction with the environment! 9 10 D methods bootstrap and sample Policy Evaluation with D(0 Bootstrapping: update involves an estimate MC does not bootstrap DP bootstraps D bootstraps Sampling: update involves an actual return MC samples DP does not sample D samples Note the on-line nature: updates as you go. 11 12

Example: Driving Home Value of a state: he remaining time to get home, or time to go State Elapsed ime Predicted Predicted (minutes ime to Go otal ime leaving office 0 30 30 (! reach car, raining 5 (5 35 40 Driving Home Changes recommended by Changes recommended Monte Carlo methods ("=1 by D methods ("=1( exit highway 20 (15 15 35 behind truck 30 (10 10 40 home street 40 (10 3 43 arrive home 43 (3 0 43 13 14 Advantages of D Learning Random Walk Example D methods do not require a model of the environment, only experience D, but not MC, methods can be fully incremental You can learn before knowing the final outcome Less memory Less peak computation You can learn without the final outcome From incomplete sequences Both MC and D converge (under certain assumptiono be detailed later, but which is faster? rewards Values learned by D(0 after various numbers of episodes What happened on the first episode? 15 16

D and MC on the Random Walk Optimality of D(0 Batch Updating: train completely on a finite amount of data, e.g., train repeatedly on 10 episodes until convergence. Compute updates according to D(0, but only update estimates after each complete pashrough the data. For any finite Markov prediction task, under batch updating, D(0 converges for sufficiently small ". Data averaged over 100 sequences of episodes Constant-" MC also converges under these conditions, but to a different answer! 17 18 Random Walk under Batch Updating You are the Predictor Suppose you observe the following 8 episodes: (NOE: these are not from the previous example! 1. A, 0, B, 0, terminate 2. B, 1, terminate 3. B, 1, terminate 4. B, 1, terminate 5. B, 1, terminate 6. B, 1, terminate 7. B, 1, terminate 8. B, 0, terminate V(A? V(B? After each new episode, all previous episodes were treated as a batch, and algorithm warained until convergence. All repeated 100 times. 19 20

You are the Predictor V(A? You are the Predictor he prediction that best matchehe training data is V(A=0 his minimizehe mean-square-error on the training set his is what a batch Monte Carlo method gets If we consider the sequentiality of the problem, then we would set V(A=.75 his is correct for the maximum likelihood estimate of a Markov model generating the data i.e, if we do a best fit Markov model, and assume it is exactly correct, and then compute what it predicts (how? his is called the certainty-equivalence estimate his is what D(0 gets 21 22 Learning An Action-Value Function Estimate Q! for the current behavior policy!. Sarsa: On-Policy D Control urn this into a control method by always updating the policy to be greedy with respect to the current estimate: After every transition from a nonterminal state, do this: (! Q(,a t + " r t +1 + # Q( +1,a t +1 $ Q(,a t Q,a t If +1 ierminal, then Q( +1,a t +1 = 0. (no action out of a terminal state! %& ' ( 23 24

Windy Gridworld Results of Sarsa on the Windy Gridworld undiscounted, episodic, reward = 1 until goal 25 26 Q-Learning: Off-Policy D Control One-step Q-learning: Q(,a t! Q,a t Compare to SARSA: Q,a t ( + " % r t +1 + # max & a Q( +1,a $ Q(,a t (! Q(,a t + " r t +1 + # Q( +1,a t +1 $ Q(,a t %& ' ( ' ( Q-Learning: Off-Policy D Control One-step Q-learning: Q(,a t! Q(,a t + " % r t +1 + # max & ( $ Q(,a t a Q +1,a ' ( hiiny difference in the update makes a BIG difference! Q-learning learns Q*, the optimal Q-values,with probability 1, no matter what policy is being followed (as long as it is exploratory and visits each state-action pair infinitely often - and the learning rates get smaller at the appropriate rate. Hence this is off-policy because it finds #* while following another policy 27 28

Cliffwalking Actor-Critic Methods #$greedy greedy,,# = 0.1 Explicit representation of policy as well as value function Minimal computation to select actions Can learn an explicit stochastic policy Can put constraints on policies Appealing as psychological and neural models 29 30 Actor-Critic Details Dopamine Neurons and D Error Suppose we take action a from state, reaching state +1 he critic evaluatehe action using the D-error:! t = r t +1 + " +1 # W. Schultz et al. Universite de Fribourg If actions are determined by preferences, p(s,a, as follows: { } = e p(s,a "! t (s,a = Pr a t = a = s b e p(s,b (a softmax policy then you can update the actor's preferences like this: p(,a t # p(,a t + $% t, 31 32

Can we apply these methodo continuous tasks without Discounting? Yes: Average Reward Per ime Step Average expected reward per time step under policy!: "! = lim % E! r t the same for each state if ergodic n t =1 Value of a state relative to! " :!V " s n#$ 1 $ n { } { } ( = % E " r t + k #! " = s k =1 Value of a state-action pair relative to! " :!Q " s,a $ { } ( = % E " r t + k #! " = s,a t = a k =1 hese code values ahe transients away from the average... 33 R-Learning 34 Access-Control Queuing ask n servers Customers have four different priorities, which pay reward of 1, 2, 4, or 8, if served At each time step, customer at head of queue is accepted (assigned to a server or removed from the queue - :-( Proportion of randomly distributed high priority customers in queue is h Busy server becomes free with probability p on each time step Statistics of arrivals and departures are unknown Apply R-learning n=10, h=.5, p=.06 Afterstates Usually, a state-value function evaluates states in which the agent can take an action. But sometimes it is useful to evaluate states after agent has acted, as in tic-tac-toe. Why ihis useful? What ihis in general? 35 36

Summary D prediction Introduced one-step tabular model-free D methods Extend prediction to control by employing some form of GPI On-policy control: Sarsa Off-policy control: Q-learning and R-learning hese methods bootstrap and sample, combining aspects of DP and MC methods Questions What can I tell you about RL? What is common to all three classes of methods? DP, MC, D What are the principal strengths and weaknesses of each? In what sense is our RL view complete? In what senses is it incomplete? What are the principal things missing? he broad applicability of these ideas What doehe term bootstrapping refer to? What ihe relationship between DP and learning? 37 38 he Book END 40 Part I: he Problem Introduction Evaluative Feedback he Reinforcement Learning Problem Part II: Elementary Solution Methods Dynamic Programming Monte Carlo Methods emporal Difference Learning Part III: A Unified View Eligibility races Generalization and Function Approximation Planning and Learning Dimensions of Reinforcement Learning Case Studies 40

Unified View 41