Reinforcement Learning Part 2

Similar documents
Reinforcement Learning

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

CSC321 Lecture 22: Q-Learning

Reinforcement Learning. George Konidaris

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

CS230: Lecture 9 Deep Reinforcement Learning

Temporal difference learning

6 Reinforcement Learning

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

(Deep) Reinforcement Learning

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Reinforcement Learning

Q-learning. Tambet Matiisen

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Introduction to Reinforcement Learning

Approximation Methods in Reinforcement Learning

Lecture 10 - Planning under Uncertainty (III)

Deep Reinforcement Learning

Machine Learning I Reinforcement Learning

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Deep Reinforcement Learning SISL. Jeremy Morton (jmorton2) November 7, Stanford Intelligent Systems Laboratory

15-780: Graduate Artificial Intelligence. Reinforcement learning (RL)

Reinforcement learning an introduction

Grundlagen der Künstlichen Intelligenz

Notes on Reinforcement Learning

Reinforcement Learning

Reinforcement Learning

Reinforcement Learning. Introduction

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

Reinforcement Learning. Summer 2017 Defining MDPs, Planning

15-780: ReinforcementLearning

Q-Learning in Continuous State Action Spaces

Lecture 23: Reinforcement Learning

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Dueling Network Architectures for Deep Reinforcement Learning (ICML 2016)

CS599 Lecture 1 Introduction To RL

Trust Region Policy Optimization

Reinforcement Learning and Deep Reinforcement Learning

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Reinforcement Learning. Machine Learning, Fall 2010

Real Time Value Iteration and the State-Action Value Function

REINFORCEMENT LEARNING

Lecture 3: Markov Decision Processes

Lecture 8: Policy Gradient

Lecture 1: March 7, 2018

Reinforcement Learning

Reinforcement Learning

Basics of reinforcement learning

Reinforcement Learning: the basics

Reinforcement learning

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Deep Reinforcement Learning: Policy Gradients and Q-Learning

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

David Silver, Google DeepMind

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Actor-critic methods. Dialogue Systems Group, Cambridge University Engineering Department. February 21, 2017

Learning Control for Air Hockey Striking using Deep Reinforcement Learning

Reinforcement Learning for NLP

Planning in Markov Decision Processes

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Approximate Q-Learning. Dan Weld / University of Washington

Lecture 9: Policy Gradient II 1

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

Machine Learning I Continuous Reinforcement Learning

Reinforcement Learning Active Learning

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Fast Gradient-Descent Methods for Temporal-Difference Learning with Linear Function Approximation

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Lecture 9: Policy Gradient II (Post lecture) 2

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Chapter 3: The Reinforcement Learning Problem

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Introduction to Markov Decision Processes

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

Value Function Methods. CS : Deep Reinforcement Learning Sergey Levine

Reinforcement Learning

Week 6, Lecture 1. Reinforcement Learning (part 3) Announcements: HW 3 due on 11/5 at NOON Midterm Exam 11/8 Project draft due 11/15

Grundlagen der Künstlichen Intelligenz

Reinforcement Learning and NLP

Off-Policy Actor-Critic

Deep reinforcement learning. Dialogue Systems Group, Cambridge University Engineering Department

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning.

Lecture 7: Value Function Approximation

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Introduction to Reinforcement Learning. Part 5: Temporal-Difference Learning

Reinforcement Learning

The Book: Where we are and where we re going. CSE 190: Reinforcement Learning: An Introduction. Chapter 7: Eligibility Traces. Simple Monte Carlo

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Chapter 6: Temporal Difference Learning

Artificial Intelligence

Reinforcement Learning and Control

Deep Reinforcement Learning. STAT946 Deep Learning Guest Lecture by Pascal Poupart University of Waterloo October 19, 2017

Transcription:

Reinforcement Learning Part 2 Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/

From previous tutorial Reinforcement Learning Exploration No supervision Agent-Reward-Environment Policy MDP Consistency Equation Optimal Policy Optimality Condition Bellman Backup Operator Iterative Solution

Interaction with the environment action reward + new environment Scalar reward Setup from Lenz et. al. 2014

Rollout hs1, a1, r1, s2, a2, r2, s3, an, rn, sn i a1 r1 a2 r2... an rn Setup from Lenz et. al. 2014

Setup at st! st+1 rt e.g.,1$

Policy (s, a) = 0.9

From previous tutorial An optimal policy exists such that: V (s) V (s) 8s 2 S, Bellman s self-consistency equation V (s) = X a (s, a) X s 0 P a s,s 0 R a s,s 0 + V (s 0 ) Bellman s optimality condition V X (s) = max Ps,s a a 0 {Rs,s a 0 + V (s 0 )} s 0

Solving MDP To solve an MDP (or RL problem) is to find an optimal policy

Dynamic Programming Solution Initialize V 0 randomly do V t+1 = TV t until kv t+1 V t k 1 > return V t+1 (TV)(s) = max a T : V! V X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )}

From previous tutorial Reinforcement Learning Exploration No supervision Agent-Reward-Environment Policy MDP Consistency Equation Optimal Policy Optimality Condition Bellman Backup Operator Iterative Solution

Dynamic Programming Solution Initialize V 0 randomly do V t+1 = TV t return V t+1 until kv t+1 V t k 1 > Problem? (TV)(s) = max a T : V! V X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )}

Learning from rollouts Step 1: gather experience using a behaviour policy Step2: update value functions of an estimation policy

On-Policy and Off-Policy On policy methods behaviour and estimation policy are same Off policy methods behaviour and estimation policy can be different Advantage?

Behaviour Policy Encourage exploration of search space Epsilon-greedy policy (s, a) = { 1 + A(s) A(s) a = arg max a 0 Q(s, a 0 ) otherwise

Temporal Difference Method Q (s, a) =E 2 4 X t 0 3 t r t+1 s 1 = s, a 1 = a5 = E 2 4r 1 + 0 @ X t 0 1 3 t r t+2 A s 1 = s, a 1 = a5 = E [r 1 + Q (s 2,a 2 ) s 1 = s, a 1 = a] Q (s, a) =(1 )Q (s, a)+ (r 1 + Q (s 2,a 2 )) combination of monte carlo and dynamic programming

SARSA Converges w.p.1 to an optimal policy as long as all state-action pairs are visited infinitely many times and epsilon eventually decays to 0 i.e. policy becomes greedy. On or off?

Q-Learning Resemblance to Bellman optimality condition Q (s, a) = X s 0 P a s,s 0 {R a s,s 0 + max a 0 Q (s 0,a 0 )} For proof of convergence see: http://users.isr.ist.utl.pt/~mtjspaan/readinggroup/proofqlearning.pdf On or off?

Summary SARSA and Q-Learning On vs Off policy. Epsilon greedy policy.

What we learned Solving Reinforcement Learning Dynamic Programming Soln. Temporal Difference Learning Bellman Backup Operator SARSA Q-Learning Iterative Solution

Another Approach So far policy is implicitly defined using value functions Can t we directly work with policies

Policy Gradient Methods Parameterized policy (s, a) Optimization max J( ) where J( ) =E (s,a) " X t t r t+1 # Gradient descent. Smoothly evolving policy. Obtaining gradient estimator? On or off?

Finite Difference Method @J( ) J( + e i) J( e i ) @ i 2 t+1 i t i + @J( t ) @ t i Easy to implement and works for all policies. Problem?

Likelihood Ratio Trick J( ) =E t p (t 0 )[R(t)] = X t R(t)p (t) max J( ) r J( ) = X t R(t)r p (t) = X t R(t)p (t) r log p (t) = E t p (t 0 )[R(t)r log p (t)] = E t p (t 0 )[(R(t) b)r log p (t)] 8 b

Reinforce (Multi Step) Policy gradient theorem: r J( ) =E (s,a)[r log (s, a)q (s,a) (s, a)] initialize for each episode hs 1,a 1,r 1,s 2,a 2,r 2,s 3, a n,r n,s n i (s 1,a 1 ) for t 2 {1,n} v t Q (s t,a t ) return + r log (s t,a t )v t content from David Silver

Summary SARSA and Q-Learning On vs Off policy. Epsilon greedy policy. Policy Gradient Methods

What we learned Solving Reinforcement Learning Dynamic Programming Soln. Temporal Difference Learning Bellman Backup Operator SARSA Q-Learning Iterative Solution Policy Gradient Methods Finite difference method Reinforce

What we did not cover Generalized policy iteration Simple monte carlo solution TD( ) algorithm Convergence of Q-learning, SARSA Actor-critic method

Application

Playing Atari game with Deep RL State is given by raw images. Learn a good policy for a given game.

Playing Atari game with Deep RL Q(s, a, ) Q (s, a) Q (s, a) = X s 0 P a s,s 0 {R a s,s 0 + max a 0 Q (s 0,a 0 )} = R a s,s 0 + max a 0 Q (s 0,a 0 ) FC Q(s, a, ) FC + relu conv + relu conv + relu

Playing Atari game with Deep RL Q (s, a) =R a s,s 0 + max a 0 Q (s 0,a 0 ) Q(s, a, )! R a s,s 0 + max a 0 Q(s 0,a 0, ) min(q(s, a, t ) R a s,s 0 max a 0 Q(s 0,a 0, t 1 )) 2 FC Q(s, a, ) FC + relu conv + relu conv + relu nothing deep about their RL

Playing Atari game with Deep RL

Playing Atari game with Deep RL why replay memory? break correlation between consecutive datapoints

Playing Atari game with Deep RL

Why Deep RL is hard Q (s, a) = X s 0 P a s,s 0 {R a s,s 0 + max a 0 Q (s 0,a 0 )} Recursive equation blows as difference between s, s 0 is small Too many iterations required for convergence. 10 million frames for Atari game. It may take too long to see a high reward action.

Learning to Search It may take too long to see a high reward. Ease the learning using a reference policy Exploiting a reference policy to search space better (s, a) ref (s, a) s1 si sn

Summary SARSA and Q-Learning On vs Off policy. Epsilon greedy policy. Policy Gradient Methods Playing Atari game using deep reinforcement learning Why deep RL is hard. Learning to search.