The Markov Decision Process (MDP) model

Similar documents
Reading Response: Due Wednesday. R. S. Sutton and A. G. Barto: Reinforcement Learning: An Introduction 1

Chapter 3: The Reinforcement Learning Problem

Lecture 3: The Reinforcement Learning Problem

Chapter 3: The Reinforcement Learning Problem

CS788 Dialogue Management Systems Lecture #2: Markov Decision Processes

Reinforcement Learning

21 Markov Decision Processes

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Chapter 16 focused on decision making in the face of uncertainty about one future

Markov Chains. Chapter 16. Markov Chains - 1

Markov Chains (Part 4)

Reinforcement Learning. Machine Learning, Fall 2010

ISM206 Lecture, May 12, 2005 Markov Chain

CSE250A Fall 12: Discussion Week 9

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

Section Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018

AM 121: Intro to Optimization Models and Methods: Fall 2018

Decision Theory: Markov Decision Processes

Discrete planning (an introduction)

Course basics. CSE 190: Reinforcement Learning: An Introduction. Last Time. Course goals. The website for the class is linked off my homepage.

CMU Lecture 11: Markov Decision Processes II. Teacher: Gianni A. Di Caro

Some AI Planning Problems

Review: TD-Learning. TD (SARSA) Learning for Q-values. Bellman Equations for Q-values. P (s, a, s )[R(s, a, s )+ Q (s, (s ))]

Reinforcement Learning II

= P{X 0. = i} (1) If the MC has stationary transition probabilities then, = i} = P{X n+1

The Transition Probability Function P ij (t)

Balancing and Control of a Freely-Swinging Pendulum Using a Model-Free Reinforcement Learning Algorithm

Reinforcement Learning

Probabilistic Planning. George Konidaris

MATH 56A: STOCHASTIC PROCESSES CHAPTER 1

Lecture 18: Reinforcement Learning Sanjeev Arora Elad Hazan

Markov Decision Processes Chapter 17. Mausam

STOCHASTIC PROCESSES Basic notions

Lecture 11: Introduction to Markov Chains. Copyright G. Caire (Sample Lectures) 321

Probability, Random Processes and Inference

MDP Preliminaries. Nan Jiang. February 10, 2019

An Introduction to Markov Decision Processes. MDP Tutorial - 1

Lecture 20 : Markov Chains

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Markov Decision Processes

CS 7180: Behavioral Modeling and Decisionmaking

Reinforcement Learning

1 [15 points] Search Strategies

Hidden Markov Models (HMM) and Support Vector Machine (SVM)

Reinforcement Learning

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Artificial Intelligence

Some notes on Markov Decision Theory

Markov Chains. X(t) is a Markov Process if, for arbitrary times t 1 < t 2 <... < t k < t k+1. If X(t) is discrete-valued. If X(t) is continuous-valued

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Reinforcement Learning. George Konidaris

1 Stochastic Dynamic Programming

Discrete time Markov chains. Discrete Time Markov Chains, Limiting. Limiting Distribution and Classification. Regular Transition Probability Matrices

STA 624 Practice Exam 2 Applied Stochastic Processes Spring, 2008

Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS

Markov chains. Randomness and Computation. Markov chains. Markov processes

Lecture 3: Markov Decision Processes

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Introduction to Reinforcement Learning

The Reinforcement Learning Problem

Chapter 5. Continuous-Time Markov Chains. Prof. Shun-Ren Yang Department of Computer Science, National Tsing Hua University, Taiwan

Lecture notes for Analysis of Algorithms : Markov decision processes

Machine Learning and Bayesian Inference. Unsupervised learning. Can we find regularity in data without the aid of labels?

Stochastic Models: Markov Chains and their Generalizations

RL 3: Reinforcement Learning

Homework 3 posted, due Tuesday, November 29.

CS 188: Artificial Intelligence Spring 2009

Markov Chains (Part 3)

Lecture Notes 8

2001, Dennis Bricker Dept of Industrial Engineering The University of Iowa. DP: Producing 2 items page 1

Reinforcement Learning (1)

Reinforcement Learning. Introduction

A Gentle Introduction to Reinforcement Learning

Markov Decision Processes Chapter 17. Mausam

CS 188: Artificial Intelligence Fall Recap: Inference Example

Announcements. CS 188: Artificial Intelligence Fall VPI Example. VPI Properties. Reasoning over Time. Markov Models. Lecture 19: HMMs 11/4/2008

16.4 Multiattribute Utility Functions

Stochastic process. X, a series of random variables indexed by t

Machine Learning. Machine Learning: Jordan Boyd-Graber University of Maryland REINFORCEMENT LEARNING. Slides adapted from Tom Mitchell and Peter Abeel

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

REINFORCE Framework for Stochastic Policy Optimization and its use in Deep Learning

CS 188: Artificial Intelligence Spring Announcements

Machine Learning I Reinforcement Learning

Markov Chain Model for ALOHA protocol

CS599 Lecture 1 Introduction To RL

Sequential decision making under uncertainty. Department of Computer Science, Czech Technical University in Prague

Markov Decision Processes and Dynamic Programming

Uncertainty Runs Rampant in the Universe C. Ebeling circa Markov Chains. A Stochastic Process. Into each life a little uncertainty must fall.

Decision Theory: Q-Learning

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Homework set 3 - Solutions

Human-Oriented Robotics. Temporal Reasoning. Kai Arras Social Robotics Lab, University of Freiburg

Markov Processes Hamid R. Rabiee

Markov decision processes and interval Markov chains: exploiting the connection

Outline. CSE 573: Artificial Intelligence Autumn Agent. Partial Observability. Markov Decision Process (MDP) 10/31/2012

Announcements. CS 188: Artificial Intelligence Fall Markov Models. Example: Markov Chain. Mini-Forward Algorithm. Example

Lecture 1: March 7, 2018

This question has three parts, each of which can be answered concisely, but be prepared to explain and justify your concise answer.

Transcription:

Decision Making in Robots and Autonomous Agents The Markov Decision Process (MDP) model Subramanian Ramamoorthy School of Informatics 25 January, 2013

In the MAB Model We were in a single casino and the only decision is to pull from a set of n arms except perhaps in the very last slides, exactly one state! We asked the following, What if there is more than one state? So, in this state space, what is the effect of the distribution of payout changing based on how you pull arms? What happens if you only obtain a net reward corresponding to a long sequence of arm pulls (at the end)? 25/01/2013 2

Decision Making Agent-Environment Interface... Agent and environment interact at discrete time steps: t 0,1, 2, Agent observes state at step t : s t S produces action at step t : a t A(s t ) gets resulting reward : r t 1 and resulting next state: s t 1 s t a t r t +1 s a t +1 t +1 r t +2 s a t +2 t +2 r t +3 s... t +3 a t +3 25/01/2013 3

Markov Decision Processes A model of the agent-environment system Markov property = history doesn t matter, only current state If state and action sets are finite, it is a finite MDP. To define a finite MDP, you need to give: state and action sets one-step dynamics defined by transition probabilities: a P ss Pr s t 1 s s t s,a t a for all s,s S, a A(s). reward probabilities: a R ss E r t 1 s t s,a t a,s t 1 s for all s,s S, a A(s). 25/01/2013 4

Recycling Robot An Example Finite MDP At each step, robot has to decide whether it should (1) actively search for a can, (2) wait for someone to bring it a can, or (3) go to home base and recharge. Searching is better but runs down the battery; if runs out of power while searching, has to be rescued (which is bad). Decisions made on basis of current energy level: high, low. Reward = number of cans collected 25/01/2013 5

Recycling Robot MDP S high, low A(high ) search, wait A(low ) search, wait, recharge R search R wait expected no. of cans while searching expected no. of cans while waiting R search R wait 25/01/2013 6

Enumerated In Tabular Form If you were given this much, what can you say about the behaviour (over time) of the system? 25/01/2013 7

A Very Brief Primer on Markov Chains and Decisions A model, as originally developed in Operations Research/Stochastic Control theory 25/01/2013 8

Stochastic Processes A stochastic process is an indexed collection of random variables. e.g., collection of weekly demands for a product One type: At a particular time t, labelled by integers, system is found in exactly one of a finite number of mutually exclusive and exhaustive categories or states, labelled by integers too Process could be imbedded in that time points correspond to occurrence of specific events (or time may be equi-spaced) Random variables may depend on others, e.g., 25/01/2013 9

Markov Chains The stochastic process is said to have a Markovian property if Markovian probability means that the conditional probability of a future event given any past events and current state, is independent of past states and depends only on present The conditional probabilities are transition probabilities, These are stationary if time invariant, called p ij, 25/01/2013 10

Markov Chains Looking forward in time, n-step transition probabilities, p ij (n) One can write a transition matrix, A stochastic process is a finite-state Markov chain if it has, Finite number of states Markovian property Stationary transition probabilities A set of initial probabilities P{X 0 = i} for all i 25/01/2013 11

Markov Chains n-step transition probabilities can be obtained from 1-step transition probabilities recursively (Chapman-Kolmogorov) We can get this via the matrix too First Passage Time: number of transitions to go from i to j for the first time If i = j, this is the recurrence time In general, this itself is a random variable 25/01/2013 12

Markov Chains n-step recursive relationship for first passage time For fixed i and j, these f ij (n) are nonnegative numbers so that If,, that state is a recurrent state, absorbing if n=1 25/01/2013 13

Markov Chains: Long-Run Properties Consider the 8-step transition matrix of the inventory example: Interesting property: probability of being in state j after 8 weeks appears independent of initial level of inventory. For an irreducible ergodic Markov chain, one has limiting probability Reciprocal gives you recurrence time jj 25/01/2013 14

Markov Decision Model Consider the following application: machine maintenance A factory has a machine that deteriorates rapidly in quality and output and is inspected periodically, e.g., daily Inspection declares the machine to be in four possible states: 0: Good as new 1: Operable, minor deterioration 2: Operable, major deterioration 3: Inoperable Let X t denote this observed state evolves according to some law of motion, so it is a stochastic process Furthermore, assume it is a finite state Markov chain 25/01/2013 15

Markov Decision Model Transition matrix is based on the following: Once the machine goes inoperable, it stays there until repairs If no repairs, eventually, it reaches this state which is absorbing! Repair is an action a very simple maintenance policy. e.g., machine from from state 3 to state 0 25/01/2013 16

Markov Decision Model There are costs as system evolves: State 0: cost 0 State 1: cost 1000 State 2: cost 3000 Replacement cost, taking state 3 to 0, is 4000 (and lost production of 2000), so cost = 6000 The modified transition probabilities are: 25/01/2013 17

Markov Decision Model Simple question: What is the average cost of this maintenance policy? Compute the steady state probabilities: How? (Long run) expected average cost per day, 25/01/2013 18

Markov Decision Model Consider a slightly more elaborate policy: Repair when inoperable or needing major repairs, replace Transition matrix now changes a little bit Permit one more thing: overhaul Go back to minor repairs state (1) for the next time step Not possible if truly inoperable, but can go from major to minor Key point about the system behaviour. It evolves according to Laws of motion Sequence of decisions made (actions from {1: none,2:overhaul,3: replace}) Stochastic process is now defined in terms of {X t } and { t } Policy, R, is a rule for making decisions Could use all history, although popular choice is (current) state-based 25/01/2013 19

Markov Decision Model There is a space of potential policies, e.g., Each policy defines a transition matrix, e.g., for R b 0 0 Which policy is best? Need costs. 25/01/2013 20

Markov Decision Model C ik = expected cost incurred during next transition if system is in state i and decision k is made State Dec. 1 2 3 0 0 4 6 1 1 4 6 2 3 4 6 3 6 The long run average expected cost for each policy may be computed using R b is best 25/01/2013 21

Markov Decision Processes Solution using Dynamic Programming (*some notation changes upcoming) 25/01/2013 22

The RL Problem Main Elements: States, s Actions, a State transition dynamics - often, stochastic & unknown Reward (r) process - possibly stochastic Objective: Policy t(s,a) probability distribution over actions given current state Assumption: Environment defines a finite-state MDP 25/01/2013 23

Back to Our Recycling Robot MDP S high, low A(high ) search, wait A(low ) search, wait, recharge R search R wait expected no. of cans while searching expected no. of cans while waiting R search R wait 25/01/2013 24

Given an enumeration of transitions and corresponding costs/rewards, what is the best sequence of actions? We want to maximize the criterion: k R t rt k 1 k 0 So, what must one do? 25/01/2013 25

The Shortest Path Problem 25/01/2013 26

Finite-State Systems and Shortest Paths state space s k is a finite set for each k a k can get you from s k to f k (s k, a k ) at a cost g k (x k, u k ) Length Cost Sum of length of arcs Solve this first V k (i) = min j [a k ij + V k+1 (j)] 25/01/2013 27

Value Functions The value of a state is the expected return starting from that state; depends on the agent s policy: State - value function for policy : V (s) E R t s t s E k 0 k r t k 1 s t s The value of taking an action in a state under policy is the expected return starting from that state, taking that action, and thereafter following : Action - value function for policy : Q (s, a) E R t s t s, a t a E k r t k 1 s t s,a t a k 0 25/01/2013 28

Recursive Equation for Value The basic idea: R t r t 1 r t 2 2 r t 3 3 r t 4 r t 1 r t 2 r t 3 2 r t 4 r t 1 R t 1 So: V ( s) E R t s t s E r t 1 V s t 1 s t s 25/01/2013 29

Optimality in MDPs Bellman Equation 25/01/2013 30

Policy Evaluation How to compute V(s) for an arbitrary policy? (Prediction problem) For a given MDP, this yields a system of simultaneous equations as many unknowns as states (BIG, S linear system!) Solve iteratively, with a sequence of value functions, 3/02/2012 31

Policy Improvement Does it make sense to deviate from (s) at any state (following the policy everywhere else)? Let us for now assume deterministic (s) - Policy Improvement Theorem [Howard/Blackwell] 3/02/2012 32

Computing Better Policies Starting with an arbitrary policy, we d like to approach truly optimal policies. So, we compute new policies using the following, Are we restricted to deterministic policies? No. With stochastic policies, 3/02/2012 33

Grid-World Example 25/01/2013 34

Iterative Policy Evaluation in Grid World Note: The value function can be searched greedily to find long-term optimal actions 25/01/2013 35