Introduction to Reinforcement Learning Part 1: Markov Decision Processes

Similar documents
Reinforcement Learning. Introduction

Decision Theory: Q-Learning

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

, and rewards and transition matrices as shown below:

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

PART A and ONE question from PART B; or ONE question from PART A and TWO questions from PART B.

Markov Decision Processes and Dynamic Programming

Decision Theory: Markov Decision Processes

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Internet Monetization

Lecture 3: Markov Decision Processes

November 28 th, Carlos Guestrin 1. Lower dimensional projections

Reinforcement Learning

Introduction to Reinforcement Learning

Lecture 3: The Reinforcement Learning Problem

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Markov Decision Processes and Dynamic Programming

1 [15 points] Search Strategies

Reinforcement Learning and Control

16.410/413 Principles of Autonomy and Decision Making

Markov Decision Processes

Reading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

1 MDP Value Iteration Algorithm

Chapter 3: The Reinforcement Learning Problem

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

The convergence limit of the temporal difference learning

Real Time Value Iteration and the State-Action Value Function

Stochastic Primal-Dual Methods for Reinforcement Learning

1. (3 pts) In MDPs, the values of states are related by the Bellman equation: U(s) = R(s) + γ max a

Preference Elicitation for Sequential Decision Problems

CSE250A Fall 12: Discussion Week 9

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Markov decision processes

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Artificial Intelligence & Sequential Decision Problems

Chapter 3: The Reinforcement Learning Problem

Lecture 4: Approximate dynamic programming

Reinforcement Learning

Approximate Dynamic Programming

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Final Exam December 12, 2017

Maximum Margin Planning

Optimal Stopping Problems

The Reinforcement Learning Problem

Reinforcement Learning

State Space Abstractions for Reinforcement Learning

Reinforcement Learning and Deep Reinforcement Learning

Final Exam December 12, 2017

Understanding (Exact) Dynamic Programming through Bellman Operators

The Art of Sequential Optimization via Simulations

AM 121: Intro to Optimization Models and Methods: Fall 2018

Planning in Markov Decision Processes

Algorithms for MDPs and Their Convergence

Open Problem: Approximate Planning of POMDPs in the class of Memoryless Policies

Using first-order logic, formalize the following knowledge:

Control Theory : Course Summary

Reinforcement Learning

Reinforcement Learning as Variational Inference: Two Recent Approaches

Markov Decision Processes and Dynamic Programming

Finite-Sample Analysis in Reinforcement Learning

MDP Preliminaries. Nan Jiang. February 10, 2019

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Artificial Intelligence

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount

INF 5860 Machine learning for image classification. Lecture 14: Reinforcement learning May 9, 2018

An Introduction to Reinforcement Learning

Markov decision processes and interval Markov chains: exploiting the connection

Mathematical Optimization Models and Applications

Markov Decision Processes Chapter 17. Mausam

An Introduction to Markov Decision Processes. MDP Tutorial - 1

Introduction to Reinforcement Learning

An Adaptive Clustering Method for Model-free Reinforcement Learning

15-889e Policy Search: Gradient Methods Emma Brunskill. All slides from David Silver (with EB adding minor modificafons), unless otherwise noted

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Hidden Markov Models (HMM) and Support Vector Machine (SVM)

CSL302/612 Artificial Intelligence End-Semester Exam 120 Minutes

Approximate Dynamic Programming

Markov Decision Processes Chapter 17. Mausam

Lecture 1: March 7, 2018

CSE 546 Final Exam, Autumn 2013

Bayesian reinforcement learning

Today s Outline. Recap: MDPs. Bellman Equations. Q-Value Iteration. Bellman Backup 5/7/2012. CSE 473: Artificial Intelligence Reinforcement Learning

Bits of Machine Learning Part 2: Unsupervised Learning

Reinforcement Learning. Yishay Mansour Tel-Aviv University

Analysis of Inverse Reinforcement Learning with Perturbed Demonstrations

Be able to define the following terms and answer basic questions about them:

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Markov Decision Processes Infinite Horizon Problems

CAP Plan, Activity, and Intent Recognition

RL 14: POMDPs continued

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.

An Analysis of Linear Models, Linear Value-Function Approximation, and Feature Selection For Reinforcement Learning

CS 4100 // artificial intelligence. Recap/midterm review!

Lecture 3: Policy Evaluation Without Knowing How the World Works / Model Free Policy Evaluation

Lecture notes for Analysis of Algorithms : Markov decision processes

Basics of reinforcement learning

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

CS 598 Statistical Reinforcement Learning. Nan Jiang

Transcription:

Introduction to Reinforcement Learning Part 1: Markov Decision Processes Rowan McAllister Reinforcement Learning Reading Group 8 April 2015

Note I ve created these slides whilst following Algorithms for Reinforcement Learning lectures by Csaba Szepesvári, specifically sections 2.2-2.4. The lectures themselves are available on Professor Szepesvári s homepage: http://www.ualberta.ca/~szepesva/papers/rlalgsinmdps.pdf Any errors please email me: rtm26 at cam dot ac dot uk

MDP Framework M = { X }{{} states, A }{{} actions, P }{{} transition probability, }{{} R, γ } }{{} reward function discount factor

MDP Framework M = { X }{{} states, A }{{} actions, P }{{} transition probability X e.g. Euclidean plane X = R 2, }{{} R, γ } }{{} reward function discount factor

MDP Framework M = { X }{{} states, A }{{} actions, P }{{} transition probability X e.g. Euclidean plane X = R 2 A, }{{} R, γ } }{{} reward function discount factor e.g. movement A = {north, south, east, west} 1 meter

MDP Framework M = { X }{{} states, A }{{} actions, P }{{} transition probability X e.g. Euclidean plane X = R 2 A, }{{} R, γ } }{{} reward function discount factor e.g. movement A = {north, south, east, west} 1 meter P Pr(X t+1, X t, A t ) e.g. Pr({0, 1}, {0, 0}, north) = 0.99

MDP Framework M = { X }{{} states, A }{{} actions, P }{{} transition probability X e.g. Euclidean plane X = R 2 A, }{{} R, γ } }{{} reward function discount factor e.g. movement A = {north, south, east, west} 1 meter P Pr(X t+1, X t, A t ) e.g. Pr({0, 1}, {0, 0}, north) = 0.99 R R : X A R

MDP Framework M = { X }{{} states, A }{{} actions, P }{{} transition probability X e.g. Euclidean plane X = R 2 A, }{{} R, γ } }{{} reward function discount factor e.g. movement A = {north, south, east, west} 1 meter P Pr(X t+1, X t, A t ) e.g. Pr({0, 1}, {0, 0}, north) = 0.99 R R : X A R γ prefer rewards now vs later, γ [0, 1]

MDP Framework Markov: Cond. indep.: P(X t+1 X 1:t, A 1:t ) = P(X t+1 X t, A t ) Decision: Decide action to optimise objective Process: Sequential movements and decisions, time t = 0,1,2,...

MDP Framework Markov: Cond. indep.: P(X t+1 X 1:t, A 1:t ) = P(X t+1 X t, A t ) Decision: Decide action to optimise objective Process: Sequential movements and decisions, time t = 0,1,2,... R t 1 R t R t+1 X t 1 X t X t+1 P P A t 1 A t A t+1

Goal Maximise return, where: return = γ t R t t=0

Goal Maximise return, where: return = γ t R t t=0 How? can influence the return by deciding actions A t each time step. Define policy function: π : X A

Goal Maximise return, where: return = γ t R t t=0 How? can influence the return by deciding actions A t each time step. Define policy function: Define value of policy: V π : X R π : X A V π (x) = E P [ return X0 = x; π ], x X

Goal Maximise return, where: return = γ t R t t=0 How? can influence the return by deciding actions A t each time step. Define policy function: Define value of policy: Optimise policy: V π : X R π : X A V π (x) = E P [ return X0 = x; π ], x X π arg max π [ V π (x) ], x X

Also useful: Define action-value of policy: Q π : X A R Q π (x, a) = E P [ return X0 = x, A 0 = a; π ], x X, a A

Bellman Equations (evaluating a fixed policy) Q π (x, a) = R(x, a) + γ P(x x, a)v π (x ), x X, a A x X V π (x) = Q π (x, π(x)) = R(x, π(x)) + γ x X P(x x, π(x))v π (x ), x X } {{ } T π operator on V π

Bellman Equations (evaluating a fixed policy) i.e.: Q π (x, a) = R(x, a) + γ P(x x, a)v π (x ), x X, a A x X V π (x) = Q π (x, π(x)) = R(x, π(x)) + γ x X P(x x, π(x))v π (x ), x X } {{ } T π operator on V π V π = T π V π

Bellman Equations (evaluating a fixed policy) i.e.: Q π (x, a) = R(x, a) + γ P(x x, a)v π (x ), x X, a A x X V π (x) = Q π (x, π(x)) = R(x, π(x)) + γ x X P(x x, π(x))v π (x ), x X } {{ } T π operator on V π V π = T π V π i.e. a linear system of equations: v π = r + γp π v π v π = (I γp π ) 1 r

Bellman Optimality Equations (evaluating the optimal policy) Q (x, a) =. Q π (x, a) = R(x, a) + γ P(x x, a)v (x ), x X, a A x X V (x). = V π (x) = max a A Q (x, a), x X π (x) = arg max a A Q (x, a), x X

Bellman Optimality Equations (evaluating the optimal policy) Q (x, a) =. Q π (x, a) = R(x, a) + γ P(x x, a)v (x ), x X, a A x X V (x). = V π (x) = max a A Q (x, a), x X π (x) = arg max a A Q (x, a), x X i.e.: V (x) = [ max R(x, a) + γ x a A X P(x x, a)v (x ) ], x X }{{} T operator on V V = T V

Value Iteration V k+1 = T V k

Value Iteration 16 0 0?? 0 0 4?? 0 0 0 Value k=0??? Policy k=0 γ = 0.5

Value Iteration 16 8 2 8 2 4 0 0 2 Value k=1?? Policy k=1 γ = 0.5

Value Iteration 16 8 4 8 4 4 4 1 2 Value k=2 Policy k=2 γ = 0.5

Value Iteration 16 8 4 8 4 4 4 2 2 Value k=3 Policy k=3 γ = 0.5

Policy Iteration Initialise random policy π 0 k 0 WHILE π k not converged 1. Compute associated action values Q π k (policy evaluation) 2. Update policy greedily w.r.t. Q π k : (policy improvement) π k+1 (x) arg max a A Q π k (x, a), x X 3. k k + 1

Policy Iteration 16 1 2 0 0 4 Policy k=0 0 0 0 Value k=0 γ = 0.5

Policy Iteration 16 1 2 0 0 4 Policy k=1 0 0 0 Value k=0 γ = 0.5

Policy Iteration 16 8 2 8 2 4 Policy k=1 0.5 1 2 Value k=1 γ = 0.5

Policy Iteration 16 8 2 8 2 4 Policy k=2 0.5 1 2 Value k=1 γ = 0.5

Policy Iteration 16 8 4 8 4 4 Policy k=2 4 1 2 Value k=2 γ = 0.5

Policy Iteration 16 8 4 8 4 4 Policy k=3 4 1 2 Value k=2 γ = 0.5

Policy Iteration 16 8 4 8 4 4 Policy k=3 4 2 2 Value k=3 γ = 0.5

Policy Iteration (Policy evaluation) v π = (I γp π ) 1 r 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 P π = 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 Policy k=3 (Rows FROM state, columns TO state. Grey indicates a terminal state)

Policy Iteration (Policy evaluation) v π = (I γp π ) 1 r 16 4 Rewards r = 16 0 0 0 0 0 0 4 0 0

Policy Iteration (Policy evaluation) v π = (I γp π ) 1 r 16 8 4 8 4 4 Value 4 2 2