Reinforcement Learning

Similar documents
Reinforcement Learning Part 2

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

REINFORCEMENT LEARNING

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Reinforcement Learning and Deep Reinforcement Learning

Lecture 3: Markov Decision Processes

Machine Learning I Reinforcement Learning

Reinforcement Learning. George Konidaris

6 Reinforcement Learning

Chapter 3: The Reinforcement Learning Problem

Reinforcement Learning. Introduction

Grundlagen der Künstlichen Intelligenz

, and rewards and transition matrices as shown below:

The Reinforcement Learning Problem

CMU Lecture 12: Reinforcement Learning. Teacher: Gianni A. Di Caro

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Planning in Markov Decision Processes

Reinforcement Learning. Machine Learning, Fall 2010

Lecture 3: The Reinforcement Learning Problem

CS599 Lecture 1 Introduction To RL

Reinforcement Learning and NLP

Reading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Reinforcement Learning as Classification Leveraging Modern Classifiers

Q-Learning in Continuous State Action Spaces

Introduction to Reinforcement Learning. Part 6: Core Theory II: Bellman Equations and Dynamic Programming

Reinforcement Learning

Internet Monetization

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Lecture 1: March 7, 2018

CSC321 Lecture 22: Q-Learning

Reinforcement Learning

Lecture 23: Reinforcement Learning

Temporal difference learning

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Introduction to Reinforcement Learning

Reinforcement Learning and Control

Reinforcement Learning

Markov decision processes

Notes on Reinforcement Learning

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Understanding (Exact) Dynamic Programming through Bellman Operators

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Decision Theory: Q-Learning

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Probabilistic Planning. George Konidaris

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Temporal Difference. Learning KENNETH TRAN. Principal Research Engineer, MSR AI

Prof. Dr. Ann Nowé. Artificial Intelligence Lab ai.vub.ac.be

MDP Preliminaries. Nan Jiang. February 10, 2019

Reinforcement learning an introduction

Approximation Methods in Reinforcement Learning

Chapter 3: The Reinforcement Learning Problem

Artificial Intelligence

Reinforcement Learning (1)

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

INTRODUCTION TO MARKOV DECISION PROCESSES

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Real Time Value Iteration and the State-Action Value Function

Reinforcement Learning

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

(Deep) Reinforcement Learning

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

Trust Region Policy Optimization

Reinforcement Learning

Reinforcement learning

CS 234 Midterm - Winter

1. (3 pts) In MDPs, the values of states are related by the Bellman equation: U(s) = R(s) + γ max a

Temporal Difference Learning & Policy Iteration

Prioritized Sweeping Converges to the Optimal Value Function

Chapter 7: Eligibility Traces. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

An Adaptive Clustering Method for Model-free Reinforcement Learning

Lecture 25: Learning 4. Victor R. Lesser. CMPSCI 683 Fall 2010

Basics of reinforcement learning

Multiagent Value Iteration in Markov Games

Reinforcement Learning: the basics

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

Reinforcement Learning

Decision Theory: Markov Decision Processes

Reinforcement Learning II. George Konidaris

Q-Learning for Markov Decision Processes*

Reinforcement Learning II. George Konidaris

Reinforcement Learning with Function Approximation. Joseph Christian G. Noel

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Deep Reinforcement Learning

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Q-learning. Tambet Matiisen

Reinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Open Theoretical Questions in Reinforcement Learning

CS Machine Learning Qualifying Exam

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

Monte Carlo is important in practice. CSE 190: Reinforcement Learning: An Introduction. Chapter 6: Temporal Difference Learning.

ECE276B: Planning & Learning in Robotics Lecture 16: Model-free Control

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Transcription:

Reinforcement Learning Dipendra Misra Cornell University dkm@cs.cornell.edu https://dipendramisra.wordpress.com/

Task Grasp the green cup. Output: Sequence of controller actions Setup from Lenz et. al. 2014

Supervised Learning Grasp the green cup. Expert Demonstrations Setup from Lenz et. al. 2014

Supervised Learning Grasp the green cup. Problem? Expert Demonstrations Setup from Lenz et. al. 2014

Supervised Learning Grasp the cup. Problem? Training data Test data No exploration

Exploring the environment

What is reinforcement learning? Reinforcement learning is a computation approach that emphasizes on learning by the individual from direct interaction with its environment, without relying on exemplary supervision or complete models of the environment - R. Sutton and A. Barto

Interaction with the environment action reward + new environment Scalar reward Setup from Lenz et. al. 2014

Interaction with the environment a 1 r 1 a 2 r 2... a n Episodic vs Non-Episodic r n Setup from Lenz et. al. 2014

Rollout hs 1,a 1,r 1,s 2,a 2,r 2,s 3, a n,r n,s n i a 1 r 1 a 2 r 2... a n r n Setup from Lenz et. al. 2014

Setup a t s t! s t+1 r t e.g.,1$

Policy (s, a) = 0.9

Interaction with the environment action reward + new environment Objective? Setup from Lenz et. al. 2014

Objective hs 1,a 1,r 1,s 2,a 2,r 2,s 3, a n,r n,s n i a 1 r 1 maximize expected reward " X n E t=1 r t # a 2 r 2... a n Problem? r n Setup from Lenz et. al. 2014

Discounted Reward maximize expected reward " X 1 E unbounded discount future reward " X 1 E t=0 t=0 r t+1 # t r t+1 # Problem? 2 [0, 1) a 1 r 1 a 2 r 2... a n r n

Discounted Reward maximize discounted E expected reward " # nx 1 t r t+1 t=0 if r apple M and 2 [0, 1) " # X 1 1X E t r t+1 apple t M = t=0 t=0 1 M a 1 r 1 a 2 r 2... a n r n

Need for discounting To keep the problem well formed Evidence that humans discount future reward

Markov Decision Process MDP is a tuple (S, A,P,R, ) where S is a set of finite or infinite states A is a set of finite or infinite actions For the transition s a! s 0 P a is the transition probability s,s 0 2 P Rs,s a 0 2 R is the reward for the transition 2 [0, 1] is the discounted factor } Markov Asmp.

MDP Example Example from Sutton and Barto 1998

Summary MDP is a tuple (S, A,P,R, ) a 1 Maximize discounted expected reward E " n 1 X t=0 t r t+1 # Agent controls the policy (s, a) r 1 a 2 r 2... a n r n

What we learned Reinforcement Learning Exploration No supervision Agent-Reward-Environment Policy MDP

Value functions Expected reward from following a policy State value function " 1 X # V (s) =E t=0 t r t+1 s 1 = s, State action value function " X 1 # Q (s, a) =E t=0 t r t+1 s 1 = s, a 1 = a,

State Value function " 1 X # V (s) =E t r t+1 s 1 = s, t=0 s1 a 1 (s 1,a 1 )P a 1 s 1,s 2 R a 1 s 1,s 2... s2 a 2 (s 2,a 2 )P a 2 s 2,s 3 R a 2 s 2,s 3... s3

State Value function V (s 1 )=E " 1 X t=0 t r t+1 # = X t (r 1 + r 2 )p(t) where t = hs 1,a 1,s 2,a 2 i = hs 1,a 1,s 2 i : t 0 s1 a 1 (s 1,a 1 )P a 1 s 1,s 2 R a 1 s 1,s 2... s2 a 2 (s 2,a 2 )P a 2 s 2,s 3 R a 2 s 2,s 3... s3

State Value function V (s 1 )=E " 1 X t=0 t r t+1 # = X t (r 1 + r 2 )p(t) where t = hs 1,a 1,s 2,a 2 i = hs 1,a 1,s 2 i : t 0 = X X P (s 1,a 1,s 2 )P (t 0 s 1,a 1,s 2 ) Rs a1 1,s 2 + (r 2 ) a 1,s 2 t 0 = X a 1,s 2 P (s 1,a 1,s 2 ){R a 1 s 1,s 2 + X t 0 P (t 0 s 1,a 1,s 2 )(r 2 )} V (s 2 ) = X a 1 (s 1,a 2 ) X s 2 P a 1 s 1,s 2 R a 1 s 1,s 2 + V (s 2 )}

Bellman Self-Consistency Eqn V (s) = X a (s, a) X s 0 P a s,s 0 R a s,s 0 + V (s 0 ) similarly Q X (s, a) = Ps,s a 0 s 0 Rs,s a 0 + V (s 0 ) Q (s, a) = X Ps,s (R a 0 s,s a + X ) (s 0,a 0 )Q (s 0,a 0 ) 0 s 0 a 0

Bellman Self-Consistency Eqn V (s) = X a (s, a) X s 0 P a s,s 0 R a s,s 0 + V (s 0 ) Given N states, we have N equations in N variables Solve the above equation Does it have a unique solution? Yes, it does. Exercise: Prove it.

Optimal Policy V (s) = X a (s, a) X s 0 P a s,s 0 R a s,s 0 + V (s 0 ) Given a state s policy 1 is as good as 2 (den. 1 2 ) if: V 1 (s) V 2 (s) How to define a globally optimal policy?

Optimal Policy policy 1 is as good as 2 (den. 1 2 ) if: V 1 (s) V 2 (s) How to define a globally optimal policy? is an optimal policy if: Does it always exists? Yes it always does. V (s) V (s) 8s 2 S,

Existence of Optimal Policy Leader policy for every state s is: s = arg max V (s) Define: (s, a) = s (s, a) 8s, a To show is optimal or equivalently: (s) =V (s) V s (s) 0 V (s) = X a s (s, a) X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )} V s (s) = X a s (s, a) X s 0 P a s,s 0 {R a s,s 0 + V s (s 0 )}

Existence of Optimal Policy Leader policy for every state s is: s = arg max V (s) Define: (s, a) = s (s, a) 8s, a V (s) = X a s (s, a) X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )} V s (s) = X a s (s, a) X s 0 P a s,s 0 {R a s,s 0 + V s (s 0 )} (s) =V (s) V s (s) = X a X a s (s, a) X s 0 P a s,s 0 {V (s 0 ) V s (s 0 )} s (s, a) X s 0 P a s,s 0 { (s 0 )} = conv( (s 0 )) (s 0 )

Existence of Optimal Policy Leader policy for every state s is: s = arg max V (s) Define: (s, a) = s (s, a) 8s, a (s) =V (s) V s (s) = X a s (s, a) X s 0 P a s,s 0 {V (s 0 ) V s (s 0 )} X s (s, a) X Ps,s a { (s 0 )} = conv( (s 0 )) 0 s 0 a (s) min (s 0 ) min (s) min (s 0 ) 2 [0, 1) ) min (s) 0 Hence proved

Bellman s Optimality Condition Define V (s) =V (s) and Q (s, a) =Q (s, a) V (s) = X a (s, a)q (s, a) apple max a Q (s, a) Claim: V (s) = max a Q (s, a) Let V (s) < max a Q (s, a) Define 0 (s) = arg max a Q (s, a)

Bellman s Optimality Condition 0 (s) = arg max a Q (s, a) V (s) = X a (s, a)q (s, a) V 0 (s) =Q 0 (s, 0 (s)) (s) =V 0 (s) V (s) =Q 0 (s, 0 (s)) X a (s, a)q (s, a) Q 0 (s, 0 (s)) Q (s, 0 (s)) = X s 0 P 0 (s) s,s 0 (s 0 ) (s) 0 9s 0 such that (s 0 ) > 0 is not optimal

Bellman s Optimality Condition V (s) = max a V (s) = max a Q (s, a) X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )} similarly Q (s, a) = X s 0 P a s,s 0 {R a s,s 0 + max a 0 Q (s 0,a 0 )}

Optimal policy from Q value Given Q (s, a) an optimal policy is given by: (s) = arg max a Q (s, a) Corollary: Every MDP has a deterministic optimal policy

Summary An optimal policy exists such that: V (s) V (s) 8s 2 S, Bellman s self-consistency equation V (s) = X a (s, a) X s 0 P a s,s 0 R a s,s 0 + V (s 0 ) Bellman s optimality condition V X (s) = max Ps,s a a 0 {Rs,s a 0 + V (s 0 )} s 0

What we learned Reinforcement Learning Exploration No supervision Agent-Reward-Environment Policy MDP Consistency Equation Optimal Policy Optimality Condition

Solving MDP To solve an MDP is to find an optimal policy

Bellman s Optimality Condition V (s) = max a X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )} Iteratively solve the above equation

Bellman Backup Operator V (s) = max a X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )} T : V! V (TV)(s) = max a X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )}

Dynamic Programming Solution Initialize V 0 randomly do V t+1 = TV t until kv t+1 V t k 1 > return V t+1 V t+1 (s) = max a X s 0 P a s,s 0 {R a s,s 0 + V t (s 0 )}

Convergence (TV)(s) = max a X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )} Theorem: ktv 1 TV 2 k 1 apple kv 1 V 2 k 1 Proof: (TV 1 )(s) where (TV 2 )(s) = max a max a kxk 1 = max{ x 1, x 2 x k }; x 2 R k X s X 0 Ps,s a {R 0 s,s a + 0 V 2 (s 0 )} Ps,s a {R a 0 s,s + 0 V 1 (s 0 )} s 0 using max x f(x) max x g(x) apple max f(x) g(x) x

Convergence Theorem: ktv 1 TV 2 k 1 = kv 1 V 2 k 1 where kxk 1 = max{ x 1, x 2 x k }; x 2 R k Proof: (TV 1 )(s) (TV 2 )(s) apple max a apple max a apple max a X s 0 P a s,s 0 (V 1 (s 0 ) V 2 (s 0 )) X s 0 P a s,s 0 (V 1 (s 0 ) V 2 (s 0 )) max s 0 V 1 (s 0 ) V 2 (s 0 ) apple kv 1 V 2 k 1 )ktv 1 TV 2 k 1 apple kv 1 V 2 k 1

Optimal is a fixed point V = max a X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )} = TV V is a fixed point of T

Optimal is the fixed point V = max a X s 0 P a s,s 0 {R a s,s 0 + V (s 0 )} = TV V is a fixed point of T Theorem: V is the only fixed point of T Proof: TV 1 = V 1 TV 2 = V 2 kv 1 V 2 k 1 = ktv 1 TV 2 k 1 apple kv 1 V 2 k 1 As 2 [0, 1) therefore kv 1 V 2 k 1 =0) V 1 = V 2

Dynamic Programming Solution Initialize V 0 randomly do V t+1 = TV t return V t+1 until kv t+1 V t k 1 > Problem? Theorem: algorithm converges for all V 0 Proof: kv t+1 V k 1 = ktv t TV k 1 apple kv t V k 1 kv t V k 1 apple t kv 0 V k 1 lim kv t V k 1 apple lim t kv 0 V k 1 =0 t!1 t!1

Summary Iteratively solving optimality condition V t+1 X (s) = max Ps,s a a 0 {Rs,s a 0 + V t (s 0 )} s 0 Bellman Backup Operator X (TV)(s) = max Ps,s a a 0 {Rs,s a 0 + V (s 0 )} s 0 Convergence of the iterative solution

What we learned Reinforcement Learning Exploration No supervision Agent-Reward-Environment Policy MDP Consistency Equation Optimal Policy Optimality Condition Bellman Backup Operator Iterative Solution

In next tutorial Value and Policy Iteration Monte Carlo Solution SARSA and Q-Learning Policy Gradient Methods Learning to search OR Atari game paper? https://dipendramisra.wordpress.com/