Some notes on Markov Decision Theory

Similar documents
1 Markov decision processes

21 Markov Decision Processes

Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS

The Markov Decision Process (MDP) model

Lecture notes for Analysis of Algorithms : Markov decision processes

AM 121: Intro to Optimization Models and Methods: Fall 2018

Reinforcement Learning

Decision Theory: Markov Decision Processes

Finding the Value of Information About a State Variable in a Markov Decision Process 1

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Section Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018

MDP Preliminaries. Nan Jiang. February 10, 2019

Markov Decision Processes Chapter 17. Mausam

Reinforcement Learning. Introduction

Markov decision processes and interval Markov chains: exploiting the connection

Course 16:198:520: Introduction To Artificial Intelligence Lecture 13. Decision Making. Abdeslam Boularias. Wednesday, December 7, 2016

Markov Decision Processes

Markov decision processes

RECURSION EQUATION FOR

Computational complexity estimates for value and policy iteration algorithms for total-cost and average-cost Markov decision processes

The Simplex and Policy Iteration Methods are Strongly Polynomial for the Markov Decision Problem with Fixed Discount

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

, and rewards and transition matrices as shown below:

DISCRETE STOCHASTIC PROCESSES Draft of 2nd Edition

Reinforcement Learning

Learning to Coordinate Efficiently: A Model-based Approach

Infinite-Horizon Discounted Markov Decision Processes

CSE250A Fall 12: Discussion Week 9

Total Expected Discounted Reward MDPs: Existence of Optimal Policies

UNCORRECTED PROOFS. P{X(t + s) = j X(t) = i, X(u) = x(u), 0 u < t} = P{X(t + s) = j X(t) = i}.

Internet Monetization

Some AI Planning Problems

Motivation for introducing probabilities

Stochastic Primal-Dual Methods for Reinforcement Learning

Cover Page. The handle holds various files of this Leiden University dissertation

Discrete planning (an introduction)

Exponential Moving Average Based Multiagent Reinforcement Learning Algorithms

MATH 56A: STOCHASTIC PROCESSES CHAPTER 1

CS 7180: Behavioral Modeling and Decisionmaking

Probabilistic Model Checking and Strategy Synthesis for Robot Navigation

Reductions Of Undiscounted Markov Decision Processes and Stochastic Games To Discounted Ones. Jefferson Huang

Dynamic Control of a Tandem Queueing System with Abandonments

Practicable Robust Markov Decision Processes

Decision Theory: Q-Learning

Reinforcement Learning

Distributed Optimization. Song Chong EE, KAIST

Markov decision processes in minimization of expected costs

Markov Processes Hamid R. Rabiee

Markov Decision Processes and Dynamic Programming

Control Theory : Course Summary

Today s s Lecture. Applicability of Neural Networks. Back-propagation. Review of Neural Networks. Lecture 20: Learning -4. Markov-Decision Processes

Inventory Ordering Control for a Retrial Service Facility System Semi- MDP

Infinite-Horizon Average Reward Markov Decision Processes

MARKOV DECISION PROCESSES LODEWIJK KALLENBERG UNIVERSITY OF LEIDEN

Chapter 16 focused on decision making in the face of uncertainty about one future

ISM206 Lecture, May 12, 2005 Markov Chain

16.4 Multiattribute Utility Functions

A Polynomial-time Nash Equilibrium Algorithm for Repeated Games

MULTIPLE CHOICE QUESTIONS DECISION SCIENCE

MS&E338 Reinforcement Learning Lecture 1 - April 2, Introduction

A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time

Q-Learning for Markov Decision Processes*

Lecture 3: Markov Decision Processes

OPTIMALITY OF RANDOMIZED TRUNK RESERVATION FOR A PROBLEM WITH MULTIPLE CONSTRAINTS

A POMDP Framework for Cognitive MAC Based on Primary Feedback Exploitation

MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti

INTRODUCTION TO MARKOV DECISION PROCESSES

Lecture 1. Evolution of Market Concentration

Markov Decision Processes and Solving Finite Problems. February 8, 2017

Stochastic Processes. Theory for Applications. Robert G. Gallager CAMBRIDGE UNIVERSITY PRESS

Procedia Computer Science 00 (2011) 000 6

Notes on Coursera s Game Theory

The Complexity of Ergodic Mean-payoff Games,

CS188: Artificial Intelligence, Fall 2009 Written 2: MDPs, RL, and Probability

Experts in a Markov Decision Process

Strategic Dynamic Jockeying Between Two Parallel Queues

Learning Equilibrium as a Generalization of Learning to Optimize

Zero-Sum Stochastic Games An algorithmic review

Risk-Sensitive and Average Optimality in Markov Decision Processes

Planning in Markov Decision Processes

Near-Optimal Control of Queueing Systems via Approximate One-Step Policy Improvement

Suggested solutions for the exam in SF2863 Systems Engineering. December 19,

Reinforcement Learning and Deep Reinforcement Learning

Lecture December 2009 Fall 2009 Scribe: R. Ring In this lecture we will talk about

Grundlagen der Künstlichen Intelligenz

Markov Decision Processes Infinite Horizon Problems

Stochastic Optimization

A monotonic property of the optimal admission control to an M/M/1 queue under periodic observations with average cost criterion

Answers to selected exercises

Discrete-Time Markov Decision Processes

Computation and Dynamic Programming

Notation. state space

Lecture Notes 7 Random Processes. Markov Processes Markov Chains. Random Processes

Restless Bandit Index Policies for Solving Constrained Sequential Estimation Problems

Reinforcement Learning. Yishay Mansour Tel-Aviv University

A Simple Solution for the M/D/c Waiting Time Distribution

Time Reversibility and Burke s Theorem

Linear and Integer Programming - ideas

A general algorithm to compute the steady-state solution of product-form cooperating Markov chains

1 Random Walks and Electrical Networks

Transcription:

Some notes on Markov Decision Theory Nikolaos Laoutaris laoutaris@di.uoa.gr January, 2004 1

Markov Decision Theory[1, 2, 3, 4] provides a methodology for the analysis of probabilistic sequential decision processes at an infinite/finite planning horizon Queueing Theory + Markov Processes: model a probabilistic system in order to evaluate its performance Markov Decision Theory: goes one step ahead; design the operation of a probabilistic system so as to optimize its performance 2

Probabilistic : meaning that there exists an environment, which cannot be described to its full detail, thus it is taken to be random (stochastic), following some probability law that best describes its nature. It is in this environment that our agent is to operate. Sequential : The theory aims at providing the tools for optimizing behaviors, i.e., sequences of decisions, not single decisions. Planning horizon : When deciding our behavior we must take into consideration the length of our intended activity. A poker player has a different behavior when opening a new game, as compared to the case that he is finishing one (and has secured a profit or suffered a loss). Behaviors depend on whether we will be participate in an activity for a finite 2-1

amount of time and then abandon it, or have decided to be involved permanently in it. Queuing theory vs decision theory : In a queuing model (say an M/M/1 queue) everything is fixed. The (stochastic) arrival process and the (stochastic) service process are parts of the environment in which we study some performance metric of interest (e.g., the expected queuing delay). An application of decision theory on queues (an M/X/1 queue, where X is an unknown service policy that we want to design and optimize) would consider the arrival process as the only element of the environment. Decision theory would provide the tools to design and optimize the service process so as to achieve a desired goal (e.g., avoid underflows or overflows). 2-2

A Markov Decision Process (MDP) is a discrete Markov process {In}n>0 characterized by a tuple S, A, P, C : S is the set of possible states. {In} is in state i at time n iff In = i, 0 i M A is the set of possible actions (or decisions). Following an observation, an action k is taken from the finite action space A, k = 0, 1,..., K P : S A S [0, 1] is the state transition function specifying the probability P {j i, k} pij(k) of observing a transition to state j S after taking action k A in state i S 3

Observation instances The process can be either continuous or discrete time. We focus on discrete time MDPs. Discrete time MDPs come in two flavors: discrete MDPs that are time-less (we do not model the time between observation instances, e.g., n > n + 1, or take it to be constant). These are in a way generalizations of Markov chains (some times disc. MDPs of that kind are called controllable Markov chains discrete MDPs that allow for time to pass between observation instances. These are much like generalized versions of semi-markov processes. In such cases the Markov property holds only upon observation instances (and not at arbitrary instances) and we may exploit them to optimize a decision system that acts 3-1

upon observation instances To have a continuous time MDP would require the Markov property to be in place at all time instances. This is rather restricting as most processes of interest posses the Markov property in selected times of interest and not generally. The action state can be either homogeneous or non-homogeneous. Homogeneous there is a common action state, A, among which we choose actions according to the current state. Non-homogeneous each state i is associated to a potentially different set of actions, Ai, among which a decision must be made. 3-2

C : S A R is a function specifying the cost ci(k) of taking action k A at state i S; ci(k) must depend only on the current state-action pair A policy R = (d0, d1,..., dm ) prescribes an action for each possible state. di(r) = k means that under policy R, action k is taken when the process is in i π(r) = (π0, π1,..., πm ) is the limiting distribution of {In} under policy R π(r) = π(r) P (R) the objective is to find the optimal policy Ropt that minimizes some cost criterion which considers both immediate and subsequent costs from the future evolution of {In} 4

Costs are used to drive the agent towards the desired behavior (defined by an objective function). Costs are used with minimization objectives. Alternatively, we may use rewards in conjunction with maximization objectives. Be careful when defining costs : Costs must depend only on the current state. This is a source of errors. Be particularly careful because in many cases the cost may depend on the next state also. In such cases, average over all possible transitions to have a legitimate MDP cost (that depends only on the current state). 4-1

Types of policies: Stationary policies always follow the same rule for choosing an action for each possible state, independently of the current time n Non-stationary polices behave differently as time evolves Deterministic policies always take the same action when at the same state: di(r) = k with probability 1 Randomized policies map a probability distribution over the possible actions to each state: di(r) = k with probability ρi(k) : k A ρ i(k) = 1 5

Policy space Stationary Non-Stationary Randomized Deterministic Deterministic policy: 1 state 1 action Randomized policy: 1 state 1 probability distribution over the action space 6

Stationarity and randomization are concepts that belong to different levels of characterizing a policy. They do not compare directly (e.g., stationary is NOT the opposite of randomized)! Stationarity is all about whether the rule for choosing decisions is affected by time Randomization refers to the way decisions are made for particular states (under a stationary or non-stationary policy) Stationary policies arise when optimizing over an infinite horizon. This makes sense intuitively. It is boundary (time) conditions that might prompt for change of behavior (given a fixed environment). A poker player that already looses money might bet more aggressively towards the end of the game in a final attempt to recover. Similarly, a winning 6-1

(rational!) player might avoid excessive risks towards the end of the game (to protect his winnings). Non-stationary policies arise from finite planning horizons. Behavioral changes due to approaching time boundaries appear also in the domain of Game Theory (a game involves at least two interacting agents (players), whereas the discussed decision theory involves only one agent, whose aim is to adapt to his environment, rather than compete with another rational entity). An interesting discussion of such behavioral issues appears in the context of the Iterated Prisoner s Dilemma, and other games 1. 1 William Poundstone, Prisoner s Dilemma: John Von Neumann, Game Theory and the Puzzle of the Bomb, Anchor Books, 1993. (highly recommended!) 6-2

Cost criteria and planning horizon Finite horizon undiscounted-cost problems require the minimization of the total expected accumulated cost over a finite planning horizon of W observations (transitions): { W } Ei{c}(R) = E (d (R)) I cin In 0 = i n=1 Infinite horizon problems consider an infinite planning horizon. Appropriate for systems that are expected to operate continuously or for systems that have an unknown stopping time 7

To understand Ei{c}(R) the expected cost when starting from state i and operating for W time units under policy R remember the following definitions: I0 is the initial state of the process (at n = 0) In is the state of the process at time n din (R) is the decision taken in state I n under policy R cin (d In (R)) is the cost incurred by taken decision d In (R) in state In 7-1

Discounted-cost problems (finite/infinite horizon) Attach a discount factor α, 0 < α < 1, to each immediate cost ci(k) thus affecting the relative importance of immediate costs over future costs: Ei{c}(R) = E { n=1 α n cin (d In (R)) I 0 = i } When α 1 future costs tend to count as much as immediate costs. Otherwise, future costs tend to be heavily discounted so, for an optimal performance, more attention must be given to the minimization of immediate costs 8

Average cost optimality (goes with infinite horizon) Requires the minimization of the expected average cost per unit of time: Ei{c}(R) = E { lim n 1 n n h=1 cih (d Ih (R)) I 0 = i } (1) As n, P {In = j I0 = i}(r) πj(r), independently of the initial state I0 = i, thus: E{c}(R) = j S πj(r) cj(dj(r)) (2) 9

Derivation (1) (2): The limiting probability πj(r) can be written as follows: 1 πj(r) = lim n n n h=1 P {Ih = j I0 = i}(r) (3) 9-1

Thus, starting from (1): Ei{c}(R) = E { lim n 1 = lim n n 1 = lim n n = j S = j S = (2) 1 n } n (d (R)) I cih Ih 0 = i h=1 n { } E (d (R)) I cih Ih 0 = i h=1 n h=1 j S cj(dj(r))p {Ih = j I0 = i}(r) 1 cj(dj(r)) lim n n n h=1 P {Ih = j I0 = i}(r) cj(dj(r)) πj(r) (substituting from (3)) 9-2

The optimal policy Ropt is the one that incurs the smallest cost: E{c}(Ropt) E{c}(R) over all R S A 1. An optimal police does not always exist under the average cost criterion 2. If an optimal police does exist, it is not guaranteed to be stationary 3. If S is finite and every stationary policy gives rise to an irreducible Markov chain then the stationarity of the optimal policy is guaranteed (+it is non-randomized) (see S.M. Ross [2] for more details) 10

Finding the optimal policy 1. Exhaustive enumeration. Suitable only for tiny problems due to O(size(A) size(s) ) complexity 2. Linear Programming (LP) 3. Policy improvement algorithm 4. Value iteration algorithm 11

Solving the MDP via Linear Programming The optimal policy can be identified efficiently by transforming the MDP formulation into a linear program Denote: Dik = P {action = k state = i} Dik s completely define a policy Also denote: yik = P {action = k and state = i} 12

Clearly the two are related via: yik = πidik (4) Also K πi = yik (5) k=0 From (4), (5): Dik = y ik πi = yik K k=0 y ik (6) 13

The are several constraints on yik s: 1. M πi = 1 i=0 M 2. πj = M πipij K i=0 k=0 yik = 1 K yik = M K i=0 k=0 i=0 k=0 3. yik 0 i, k yikpij(k) j The steady-state average cost per unit time is: M E(C) = K πicikdik = M i=0 k=0 i=0 k=0 K cikyik 14

yik s are obtained from the following LP: Minimize M z = E{c} = Subject to j : K yjk = k=0 M K i=0 k=0 i=0 k=0 M i=0 k=0 K cik yik (7) K yikpij(k) (8) yik = 1 (9) i, k : yik 0 (10) 15

D ik s are then readily available by using equation (6) The D ik s of the optimal solution are either 0 or 1, i.e., the optimal policy is non-randomized This is because the aforementioned LP has a totally unimodular constraint matrix and integer constants These two properties guarantee that an optimal solution to the LP through the Simplex method will return an integral solution (in the current case having 0 s and 1 s). See [5] for more on unimodularity 16

A policy improvement algorithm Very efficient for large problems Starts with an arbitrary policy which is progressively improved at each iteration of the algorithm, until the optimal policy is reached Convergence after a finite number of iterations is guaranteed for problems with finite state and action sets S, A 17

The theory of policy improvement v n i (R): the total expected cost of starting from state i and operating for n periods under policy R: v n i (R) = c i(k) + M pij(k) v n 1 j (R) (11) j=0 The expected average cost is independent of the initial state i: E{c}(R) = M πj cj(k) (12) j=0 For large n we have: v n i (R) n E{c}(R) + v i(r) (13) vi(r) captures the effect of starting from state i, on the total expected cost v i n (R), thus: v n i (R) vn j (R) v i(r) vj(r) (14) 18

substituting v n i (R) n E{c}(R) + v i(r) and v n 1 j (R) (n 1) E{c}(R) + vj(r) into equation (11) we take: E{c}(R) + vi(r) = ci(k) + M pij(k) vj(r) for i = 0, 1,..., M j=0 (15) The system of equation (15) has M + 2 unknowns (E{c}(R), vi(r)) and M + 1 equations. By setting vm (R) = 0 we can find the vi(r)s and the cost associated with a particular policy Theoretically, the recursive equation could be used for an exhaustive search for the optimal policy but this is not computationally efficient 19

The policy improvement algorithm Initialization Select an arbitrary initial policy R0. Iteration n Perform the following steps: Value determination For policy Rn solve the system of M + 1 equations E{c}(Rn) = ci(k)+ M pij(k)vj(rn) vi(rn), for 0 i M j=0 for the M + 1 unknown values E{c}(Rn), v0(rn), v1(rn),..., vm 1(Rn). 20

Policy improvement Using the values of vi(rn) computed for policy Rn, find an improved policy Rn+1 such that for each state i, Di(Rn+1) = k is the decision that minimizes: ci(k) + M pij(k)vj(rn) vi(rn), for 0 i M (16) j=0 i.e, for each state i minimize (16) and set di(rn+1) equal to the minimizing value of k. This procedure defines a new policy Rn+1 with E{c}(Rn+1) E{c}(Rn) (see Theorem 3.2 in [3]). Optimality test If the current policy Rn+1 is identical to the previous Rn, then it is the optimal policy. Otherwise set n = n + 1 and perform another iteration. 21

References [1] H. Mine and S. Osaki, Markovian Decision Processes, Elsevier, Amsterdam, 1970. [2] Sheldon M. Ross, Applied Probability Models with Optimization Applications, Dover Publications, New York, 1992. [3] Henk C. Tijms, Stochastic Modelling and Analysis: A Computational Approach, John Wiley & Sons, 1986. [4] Frederick S. Hillier and Gerald J. Lieberman, Introduction to Operations Research, McGraw-Hill, 2000. [5] Christos H. Papadimitriou and Kenneth Steiglitz, Combinatorial Optimization: Algorithms and Complexity, Dover Publications, New York, 1998. 22