Week 6, Lecture 1. Reinforcement Learning (part 3) Announcements: HW 3 due on 11/5 at NOON Midterm Exam 11/8 Project draft due 11/15

Similar documents
ROB 537: Learning-Based Control. Announcements: Project background due Today. HW 3 Due on 10/30 Midterm Exam on 11/6.

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Temporal difference learning

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

The Book: Where we are and where we re going. CSE 190: Reinforcement Learning: An Introduction. Chapter 7: Eligibility Traces. Simple Monte Carlo

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

CS599 Lecture 1 Introduction To RL

Generalization and Function Approximation

Reinforcement Learning II

Reinforcement Learning

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?


PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Chapter 8: Generalization and Function Approximation

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Reinforcement Learning

Grundlagen der Künstlichen Intelligenz

Machine Learning I Reinforcement Learning

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Reinforcement Learning Part 2

Reinforcement Learning. Value Function Updates

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

CSE 190: Reinforcement Learning: An Introduction. Chapter 8: Generalization and Function Approximation. Pop Quiz: What Function Are We Approximating?

Reinforcement Learning

Chapter 6: Temporal Difference Learning

Replacing eligibility trace for action-value learning with function approximation

arxiv: v1 [cs.ai] 5 Nov 2017

Reinforcement Learning

Reinforcement Learning. George Konidaris

Reinforcement Learning

6 Reinforcement Learning

Notes on Reinforcement Learning

Reinforcement Learning. Yishay Mansour Tel-Aviv University

, and rewards and transition matrices as shown below:

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

Lecture 23: Reinforcement Learning

Lecture 3: Markov Decision Processes

CS599 Lecture 2 Function Approximation in RL

Chapter 6: Temporal Difference Learning

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

Reinforcement Learning

Lecture 9: Policy Gradient II 1

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

Reinforcement Learning

Reinforcement Learning

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Reinforcement learning an introduction

Introduction to Reinforcement Learning. Part 5: Temporal-Difference Learning

Lecture 7: Value Function Approximation

Off-Policy Actor-Critic

CS 287: Advanced Robotics Fall Lecture 14: Reinforcement Learning with Function Approximation and TD Gammon case study

Reinforcement Learning. Machine Learning, Fall 2010

Lecture 9: Policy Gradient II (Post lecture) 2

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning.

ARTIFICIAL INTELLIGENCE. Reinforcement learning

CS 188: Artificial Intelligence Spring Announcements

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

Reinforcement Learning

Reinforcement Learning: An Introduction. ****Draft****

Lecture 8: Policy Gradient

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

CS230: Lecture 9 Deep Reinforcement Learning

CSC321 Lecture 22: Q-Learning

Reinforcement Learning for Continuous. Action using Stochastic Gradient Ascent. Hajime KIMURA, Shigenobu KOBAYASHI JAPAN

Bias-Variance Error Bounds for Temporal Difference Updates

Grundlagen der Künstlichen Intelligenz

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.

Introduction. Reinforcement Learning

Off-policy learning based on weighted importance sampling with linear computational complexity

On the Convergence of Optimistic Policy Iteration

Reinforcement Learning: the basics

Basics of reinforcement learning

On and Off-Policy Relational Reinforcement Learning

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

Planning in Markov Decision Processes

Approximation Methods in Reinforcement Learning

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Trust Region Policy Optimization

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Chapter 3: The Reinforcement Learning Problem

Policy Gradient Methods. February 13, 2017

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Real Time Value Iteration and the State-Action Value Function

A Unified Approach for Multi-step Temporal-Difference Learning with Eligibility Traces in Reinforcement Learning

Accelerated Gradient Temporal Difference Learning

Dual Temporal Difference Learning

The optimal unbiased value estimator and its relation to LSTD, TD and MC

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

Forward Actor-Critic for Nonlinear Function Approximation in Reinforcement Learning

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Reinforcement Learning and NLP

QUICR-learning for Multi-Agent Coordination

Lecture 1: March 7, 2018

Transcription:

ME 537: Learning Based Control Week 6, Lecture 1 Reinforcement Learning (part 3 Announcements: HW 3 due on 11/5 at NOON Midterm Exam 11/8 Project draft due 11/15 Suggested reading : Chapters 7-8 in Reinforcement Learning (Sutton and Barto (Lecture notes based on Reinforcement Learning, Sutton and Barto, 1998 Eligibility Traces Eligibility Traces are the basic mechanism of assigning credit in Reinforcement Learning!! In TD(!,! refers to an eligibility trace Basic Idea:!! Determine how many rewards will be used in evaluating a value function!! Determine which states are eligible for an update!! Connect TD learning and Monte Carlo Learning

Eligibility Traces Forward view:!! Theoretical!! Connect TD learning and Monte Carlo learning o! In TD learning only previous state is eligible o! In Monte Carlo, all visited states are eligible!! Intuition: How far ahead do you need to look to figure out value of a state Eligibility Traces Forward view:!! Theoretical!! Connect TD learning and Monte Carlo learning o! In TD learning only previous state is eligible o! In Monte Carlo, all visited states are eligible!! Intuition: How far ahead do you need to look to figure out value of a state Backward view:!! Mechanistic!! Temporary record of events o! Visiting states, taking actions, Receiving Rewards!! Intuition: How many state value pairs do I need to store to figure out value of a state

Forward View Recall Monte Carlo methods for a T step episode: R t = r t +1 + "r t +2 + " 2 r t +3 +!+ " T #t#1 r T!! The update is based on all the returns Forward View Recall Monte Carlo methods for a T step episode: R t = r t +1 + "r t +2 + " 2 r t +3 +!+ " T #t#1 r T!! The update is based on all the returns In TD backup: R t (1 = r t +1 + "V t +1!! The update is based on 1-step backup

Forward View Recall Monte Carlo methods for a T step episode: R t = r t +1 + "r t +2 + " 2 r t +3 +!+ " T #t#1 r T!! The update is based on all the returns In TD backup: R t (1 = r t +1 + "V t +1!! The update is based on 1-step backup Why restrict ourselves to these two extremes? Forward View: n-step TD Why not look at 2-step backup: R t (2 = r t +1 + "r t +2 + " 2 V t +2

Forward View: n-step TD Why not look at 2-step backup: R t (2 = r t +1 + "r t +2 + " 2 V t +2 How about n-step backup? R t (n = r t +1 + "r t +2 + " 2 r t +3 +!+ " n#1 r t +n + " n V t +n Called corrected n-step truncated return or n-step return Forward View (Sutton & Barto

Backward View Causal, incremental view Implementable Introduce a new memory variable: eligibility trace e t (s!! On each step eligibility trace for each state decays by!"#!! Eligibility trace of one state increased by 1 & e t (s = " # e (s if s % s t$1 t ' (" # e t$1 (s +1 if s = s t!!! is trace decay parameter!! " is the discount rate used previously Intuition At any time eligibility traces keep track of which states have been recently visited!! Recently is quantified by "!# The traces capture how eligible a state is for an update Recall basic reinforcement learning: NewEstimate OldEstimate + Stepsize (Target OldEstimate! Now, update all values based on eligibility trace

Implementation NewEstimate OldEstimate + Stepsize (Target OldEstimate! For TD, the error was: " t = r + #V +1 $V Implementation NewEstimate OldEstimate + Stepsize (Target OldEstimate! For TD, the error was: " t = r + #V +1 $V Then update becomes: V t +1 (s = V t (s + " e for all s $ S

Special Cases Let s explore two special cases based on this update rule V t +1 (s = V t (s + " e for all s $ S Special Cases Let s explore two special cases based on this update rule V t +1 (s = V t (s + " e for all s $ S For!=0:!! All traces are 0 at t except state visited at t: s t!! Reduces to TD update rule

Special Cases Let s explore two special cases based on this update rule V t +1 (s = V t (s + " e for all s $ S For!=0:!! All traces are 0 at t except state visited at t: s t!! Reduces to TD update rule For!=1:!! All states receive credit decaying by "#!! This is the Monte Carlo updates o! If "=1 as well, there is no decay in time and we have undiscounted Monte Carlo updates Forward and Backward Views Forward view

Forward and Backward Views Forward view Backward view (Sutton & Barto Sarsa (! Recall Sarsa learning: Q " Q + # r + $ Q +1 +1 % Q Now, we have: ( Q " Q + # $ t e for all s,a

Sarsa (! Recall Sarsa learning: Q " Q + # r + $ Q +1 +1 % Q Now, we have: ( Q " Q + # $ t e for all s,a where " t = r + # Q +1 +1 $ Q % e t (s,a = " # e t$1(s,a +1 if s = s t and a = a t & '" # e t$1 (s,a otherwise for all s,a Example: Sarsa (! Recall Sarsa updates: only last state receives update. All other state action pairs will update when visited/taken next Path taken Action values increased by one-step Sarsa Action values increased by Sarsa(! with!=0.9

Q(! Recall Q-learning Q " Q + # r + $ maxq +1,a % Q Now, we have: ( a Q " Q + # $ t e for all s,a Q(! Now, we have: Q " Q + # $ t e for all s,a where " t = r + # maxq +1,a $ Q a &# $ e t%1 (s,a +1 if Q t -1 = max a Q t -1,a e t (s,a = I sst " I aat ' ( 0 otherwise

Q(! In Q learning with eligibility traces:!!like in Sarsa learning, except eligibility trace set to zero after an exploratory move (not greedy!!intuition: two step process!!step 1: o! Either each state-action pair decayed by "!# o! Or all set to 0 after an exploratory step!!step 2: o! Trace for current state action pair incremented by 1 (this is Watkins Q(!; T\there are other versions as well.

Function Approximation So far Value functions were tables!!tables get large!!tables are slow to be accessed!!tables are hard to fill Need to use a subset of state-action pairs to learn value function Function Approximation So far Value functions were tables!!tables get large!!tables are slow to be accessed!!tables are hard to fill Need to use a subset of state-action pairs to learn value function Function approximation!!neural network!!linear

Value Prediction Use mean square error for evaluating performance & MSE(! " t = p(s V # (s $V t (s s%s ( 2 V t is the value function at time step t V $ is the optimal value function p(s is a probability distribution of states!!makes sure MSE focuses on the states that matter Gradient Descent Gradient Decent & ( 2 MSE(! " t = p(s V # (s $V t (s s%s Basic parameter update (based on uniform p(s : = $ 1 2 % &! ( V ' $V t 2 #!!Update parameters based on the negative of the gradient

Gradient Descent Gradient Decent = $ 1 2 % &! # ( V ' $V t 2 Becomes: ( '! = $ V % &V t # V t Vector of partial derivatives Gradient Descent We have one problem with the parameter update: = $ V % &V t ( '! # V t

Gradient Descent We have one problem with the parameter update: = $ ( V % &V t '! V (s t We don t know the optimal value function! Gradient Descent without Value Function Solution: Use estimate of Optimal Value!!Use Monte Carlo estimate R t = $ ( R t %V t &! V!!Use TD estimate R t! : ( '! = $ R t % &V t # V t

Gradient Descent without Value Function Solution: Use estimate of Optimal Value!!Use Monte Carlo estimate R t = $ ( R t %V t &! V (s t!!use TD estimate R t! : ( '! = $ R t % &V t Unbiased Estimate # V t Biased Estimate Control with Function Approximation Gradient descent for state-action values: ( '! = $ Q % & Q t # Q t

Gradient Descent Sarsa Gradient descent for state-action values: ( '! = $ Q % & Q t # Q t = $ ( r t +1 + % Q t +1 +1 & Q t '! Q (s,a t t Gradient descent Sarsa:!!1-step backup!!extend to eligibility traces Gradient Descent Sarsa(! Gradient descent Sarsa(!: = $ % t! e t!!where " t = r t +1 + # Q t +1 +1 $ Q t! e t = " #! e t$1 + %! & Q t