MARKOV DECISION PROCESSES LODEWIJK KALLENBERG UNIVERSITY OF LEIDEN

Size: px

Start display at page:

Download "MARKOV DECISION PROCESSES LODEWIJK KALLENBERG UNIVERSITY OF LEIDEN"

Anissa Price
5 years ago
Views:

1 MARKOV DECISION PROCESSES LODEWIJK KALLENBERG UNIVERSITY OF LEIDEN

3 Preface Branching out from operations research roots of the 950 s, Markov decision processes (MDPs) have gained recognition in such diverse fields as ecology, economics, and communication engineering. These applications have been accompanied by many theoretical advances. Markov decision processes, also referred to as stochastic dynamic programming or stochastic control problems, are models for sequential decision making when outcomes are uncertain. The Markov decision process model consists of decision epochs, states, actions, rewards, and transition probabilities. Choosing an action in a state generates a reward and determines the state at the next decision epoch through a transition probability function. Policies or strategies are prescriptions of which action to choose under any eventuality at every future decision epoch. Decision makers seek policies which are optimal in some sense. These lecture notes aim to present a unified treatment of the theoretical and algorithmic aspects of Markov decision process models. It can serve as a text for an advanced undergraduate or graduate level course in operations research, econometrics or control engineering. As a prerequisite, the reader should have some backround in linear algebra, real analysis, probability, and linear programming. Throughout the text there are a lot of examples; at the end of each chapter there is a section with bibliographic notes and a section with exercises (the notes include 8 exercises). A solution manuel is available on request ( to kallenberg@math.leidenuniv.nl). Chapter introduces the Markov decision process model as a sequential decision model with actions, rewards, transitions and policies. We illustrate these concepts with nine different applications: red-black gambling, how-to-serve in tennis, optimal stopping, replacement problems, maintenance and repair, production control, optimal control of queues, stochastic scheduling, and the multi-armed bandit problem. Chapter 2 deals with the finite horizon model and the principle of dynamic programming, backward induction. We also study under which conditions optimal policies are monotone, i.e. nondecreasing or nonincreasing in the ordering of the state space. In chapter 3 the discounted rewards over an infinite horizion are studied. This results in the optimality equation and solution methods to solve this equation: policy iteration, linear programming, value iteration and modified value iteration.

4 ii Chapter 4 discusses the total rewards over an infinite horizion under the assumption that the transition matrices are substochastic. We derive equivalent statements for the property that the model is a contracting dynamic programming model. For contracting dynamic programming the similar results and algorithms can be formulated as for the discounted reward model. Special sections are devoted to positive, negative and convergent models, and to optimal stopping problems. Chapter 5 discusses the criterion of average rewards over an infinite horizion, in the most general case. Firstly, polynomial algorithms are developed to classify MDPs as irreducible or communicating. The distinction between unichain and multichain turns out to be N P-complete, so there is no hope of a polynomial algorithm. Then, the stationary, the fundamental and the deviation matrices are introduced, and the relations between them and their properties are derived. Next, an extension of a theorem by Blackwell and the Laurent series expansion are presented. These results are fundamental to analyse the relation between discounted, average and more sentitive optimality criteria. With these results, as in the discounted case but via a more complicated analysis, the optimality equation is derived, and solution methods to solve this equation are presented (policy iteration, linear programming and value iteration). In chapter 6 special cases of the average reward criterion (irreducible, unichain and communicating) are considered. In all these cases the optimality equation and the methods of policy iteration, linear programming, value iteration and modified value iteration can be simplified. Chapter 7 introduces more sensitive optimality criteria: bias optimality, n-discount optimality and Blackwell optimality. We present a unifying framework, based on the Laurent series expansion, to derive sensitive discount optimality equations. Using a lexicographic ordering of the Laurent series, we derive the policy iteration method for n-discount optimality. For bias optimality, an optimal policy can be found with a three-step linear programming approach. When in addition the model is a unichain MDP, the linear programs for bias optimality can be simplified. In the irreducible case, one can derive a sequence of nested linear programs to compute n-discount optimal policies for any n. Also for Blackwell optimality, even in the most general case, linear programming can be applied. However, then the elements are not real numbers, but lie in a much general ordered field, namely in an ordered field of rational functions. In chapter 8 the applications introduced in chapter are analysed in much more detail. In most cases theoretical and computational (algorithms) results are presented. It turns out that in many cases polynomial algorithms exist, e.g. of order O(N 3 ), where N is the number of states. Chapter 9 deals with some other topics: additional constraints and multiple objectives (both for discounted MDPs as well as for average MDPs), and mean-variance tradeoffs.

5 iii In the last chapter (chapter 0) stochastic games are treated, particularly the two-person zero-sum stochastic game. Then, both players may choose actions from their own action sets, resulting in transitions and rewards determined by both players. Zero-sum means that the reward for player has to be payed by player 2. Hence, there is a conflicting situation: player wants to maximize the rewards, while player 2 tries to mimimize the rewards. We discuss the value of the game and the concept of optimal policies for discounted as well as for average rewards. We also derive mathematical programming formulations and iterative methods. In some special cases we can present finite methods. For these notes I have used a lot of material, collected over the years, from various sourses. It is impossible for me to mention them here. However, there is one exception. I wish to express my gratitude to Arie Hordijk, who introduced me to the topic of MDP, as my teacher and former colleague. Already 20 years ago we started with a first draft of some chapters for a book on MDP. Our work was interrupted and, unfortunately, never continued. Lodewijk Kallenberg Leiden, April, 2009.

6 iv

7 Contents Introduction. The MDP model Policies and optimality criteria Policies Optimality criteria Examples Red-black gambling Gaming: How to serve in tennis Optimal stopping Replacement problems Maintenance and repair Production control Optimal control of queues Stochastic scheduling Multi-armed bandit problem Bibliographic notes Exercises Finite Horizon Introduction Backward induction An equivalent stationary infinite horizon model Monotone optimal policies Bibliographic notes Exercises Discounted rewards Introduction Monotone contraction mappings The optimality equation Policy iteration Linear programming

8 vi CONTENTS 3.6 Value iteration Modified Policy Iteration Bibliographic notes Exercises Total reward Introduction Equivalent statements for contracting The contracting model Positive MDPs Negative MDPs Convergent MDPs Special models Red-black gambling Optimal stopping Bibliographic notes Exercises Average reward - general case Introduction Classification of MDPs Definitions Classification of Markov chains Classification of Markov decision chains Stationary, fundamental and deviation matrix The stationary matrix The fundamental matrix and the deviation matrix Extension of Blackwell s theorem The Laurent series expansion The optimality equation Policy iteration Linear programming Value iteration Bibliographic notes Exercises Average reward - special cases The irreducible case Optimality equation Policy iteration Linear programming

9 CONTENTS vii 6..4 Value iteration Modified policy iteration Unichain case Optimality equation Policy iteration Linear programming Value iteration Modified policy iteration Communicating case Optimality equation Policy iteration Linear programming Value iteration Modified value iteration Bibliographic notes Exercises More sensitive optimality criteria Introduction Equivalence between n-discount and n-average optimality Stationary optimal policies and optimality equations Lexicographic ordering of Laurent series Policy iteration for n-discount optimality Linear programming and n-discount optimality (irreducible case) Average optimality Bias optimality n-discount optimality Blackwell optimality and linear programming Bias optimality and linear programming The general case The unichain case Overtaking and average overtaking optimality Bibliographic notes Exercises Special models Replacement problems A general replacement model A replacement model with increasing deterioration Skip to the right model with failure A separable replacement problem

10 viii CONTENTS 8.2 Maintenance and repair problems A surveillance-maintenance-replacement model Optimal repair allocation in a series system Production and inventory control No backlogging Backlogging Inventory control and single-critical-number policies Inventory control and (s, S)-policies Optimal control of queues The single-server queue Parallel queues Stochastic scheduling Maximizing finite-time returns on a single processor Optimality of the µc-rule Optimality of threshold policies Optimality of join-the-shortest-queue policies Optimality of LEPT and SEPT policies Maximizing finite-time returns on two processors Tandem queues Multi-armed bandit problems Introduction A single project with a terminal reward Multi-armed bandits Methods for the computation of the Gittins indices Separable problems Introduction Examples (part ) Discounted rewards Average rewards - unichain case Average rewards - general case Examples (part 2) Exercises Bibliographic notes Other topics Additional constraints Introduction Infinite horizon and discounted rewards Infinite horizon and average rewards Multiple objectives

11 CONTENTS ix 9.2. Discounted rewards Average rewards Mean-variance tradeoffs Formulations of the problem A unifying framework Determination of an optimal solution Determination of an optimal policy Bibliographic notes Exercises Stochastic Games Introduction The model Optimality criteria Matrix games Discounted rewards Value and optimal policies Mathematical programming Iterative methods Finite methods Average rewards Value and optimal policies The Big Match Mathematical programming Perfect information and irreducible games Finite methods Bibliographic notes Exercises

12 x CONTENTS

13 Chapter Introduction In this first chapter we introduce the model of a Markov decision process (for short MDP). We present several optimality criteria and give some examples of problems that can be modelled as an MDP.. The MDP model An MDP is a model for sequential decision making under uncertainty, taking into account both the short-term outcomes of current decisions and opportunities for making decisions in the future. While the notion of an MDP may appear quite simple, it encompasses a wide range of applications and has generated a rich mathematical theory. In an MDP model one can distinguish the following seven characteristics.. The state space At any time point at which a decision has to be made, the state of the system is observed by the decision maker. The set of possible states is called the state space and will be denoted by S. The state space may be finite, denumerable, compact or even more general. In a finite state space, the number of states, i.e. S, will be denoted by N. 2. The action sets When the decision maker observes that the system is in state i, he (we will refer to the decision maker as he ) chooses an action from a certain action set that may depend on the observed state: the action set in state i is denoted by A(i). Similarly to the state space the action sets may be finite, denumerable, compact or more general. 3. The decision time points The time intervals between the decision points may be constant or random. In the first case the model is said to be a Markov decision process; when the times between consecutive decision points are random the model is called a semi-markov decision process.

14 2 CHAPTER. INTRODUCTION 4. The immediate rewards (or costs) Given the state of the system and the chosen action, an immediate reward (or cost) is earned (there is no essential difference between rewards and costs, because maximizing rewards is equivalent to minimizing costs). These rewards depend on the decision time point, the observed state and the chosen action and not on the history of the process. The immediate reward at decision time point t for an action a in state i will be denoted by ri t (a); if the reward is independent of the time t, we will write r i (a) instead of r t i (a). 5. The transition probabilities Given the state of the system and the chosen action, the state at the next decision time point is determined by a transition law. These transitions only depend on the decision time point t, the observed state i and the chosen action a and not on the history of the process. This property is called the Markov property. If the transitions really depend on the decision time point, the problem is said to be nonstationary. If the state at time t is i and action a is chosen, we denote the probability that at the next time point the system is in state j by p t ij (a). If the transitions are independent of the time points, the problem is called stationary, and the transition probabilities are denoted by p ij (a). 6. The planning horizon The process has a planning horizon, i.e. the time points at which the system has to be controlled. This horizon may be finite, infinite or of random length. 7. The optimality criterion The objective is to determine a policy, i.e. a decision rule for each decision time point and each history (including the present state) of the process, that optimizes the performance of the system. The performance is measured by a utility function. This function assigns to each policy, given the starting state of the process, a value. In the next section we will present several optimality criteria. Example. Inventory model with backlogging An inventory has to be managed over a planning horizon of T weeks. The optimization problem is: which inventory strategy minimizes the total expected costs? At the beginning of each week the manager observes the inventory on hand and decides how many units to order. We assume that orders can be delivered instantaneously, and that there is a finite inventory capacity of B units. We also assume that the demands D t in week t, t T, are independent random variables that have nonnegative integer values and that the numbers p j (t) := P{D t = j} are known for all j N 0 and t =, 2,..., T. If the demand during a period exceeds the inventory on hand, the shortage is backlogged in the next period. Let the inventory at the start of week t be i (shortages are modelled as negative inventory), let the number of ordered units be a and let j be the inventory at the end of week t.

15 .2. POLICIES AND OPTIMALITY CRITERIA 3 Then, the following costs are involved, where we use the notation δ(x) = ordering costs: K t δ(a) + k t a; inventory costs: h t δ(j) j; backlogging costs: q t δ( j) ( j). The data K t, k t, h t, q t and p j (t), j N, are known for all t {, 2,..., T }. { if x ; 0 if x 0. If an order is made in week t, there is a fixed cost K t and a cost k t for each ordered unit. If at the end of the week there is a positive inventory, then there are inventory costs of h t per unit; when there is shortage, there are backlogging costs of q t per unit. This inventory problem can be modelled as a nonstationary MDP over a finite planning horizon, with a denumerable state space and finite action sets: S = {...,, 0,,..., B}; A(i) = {a 0 0 i + a B}; p t ij (a) = { pi+a j (t) j i + a; 0 B j > i + a; r t i (a) = {K t δ(a) + k t a + i+a j=0 p j(t) h t (i + a j) + j=i+a+ p j(t) q t (j i a)}..2 Policies and optimality criteria.2. Policies A policy R is a sequence of decision rules: R = (π, π 2,..., π t,... ), where π t is the decision rule at time point t, t =, 2,.... The decision rule π t at time point t may depend on all availlable information on the system until time t, i.e. on the states at the time points, 2,..., t and the actions at the time points, 2,..., t. The formal definition of a policy is as follows. Consider the Cartesian product S A = {(i, a) i S, a A(i)} and let H t denote the set of the possible histories of the system up to time point t, i.e. H t := {h t = (i, a,..., i t, a t, i t ) (i k, a k ) S A, k t ; i t S}. (.) A decision rule π t at time point t is function on H t which prescribes the action to be taken at time t as a transition probability from H t into A, i.e. πh t ta t 0 for every a t A(i t ) and πh t ta t = for every h t H t. (.2) a t Let C denote the set of all policies. A policy is said to be memoryless if the decision rule π t is independent of (i, a,..., i t, a t ) for every t N. For a memoryless policy the decision rule

16 4 CHAPTER. INTRODUCTION at time t only depends on the state i t ; therefore the notation πi t ta t is used. We call C(M) the set of the memoryless policies. Memoryless policies are also called Markov policies. If a policy is memoryless and the decision rules are independent of the time point t, i.e. π = π 2 =, then the policy is called stationary. Hence, a stationary policy is determined by a nonnegative function π on S A such that a π ia = for every i S. The stationary policy R = (π, π,... ) is denoted by π. The set of stationary policies is notated by C(S). If the decision rule π of a stationary policy is nonrandomized, i.e. for every i S, we have π ia = for (exactly) one action a i (consequently π ia = 0 for every a a i ), then the policy is called deterministic. A deterministic policy can be described by a function f on S, where f(i) is the chosen action a i, i S. A deterministic policy is denoted by f (and sometimes by f) and the set of deterministic policies by C(D). A matrix P = (p ij ) is a transition matrix if p ij 0 for all (i, j) and j p ij = for all i. For a Markov policy R = (π, π 2,... ) the transition matrix P (π t ) and the reward vector r(π t ) are defined by { P (π t ) } = p t ij ij(a)πia t for every (i, j) S S and t N; (.3) a { r(π t ) } i = a r t i(a)π t ia for every i S and t N. (.4) Consider an initial distribution β, i.e. β i is the probability that the system starts in state i, and an policy R. Then, by a theorem of Ionescu Tulcea (cf. Bertsekas and Shreve [4] p.40), there exists a unique probability measure P β,r on H, where H = {h = (i, a, i 2, a 2,... ) (i k, a k ) S A, k =, 2,... }. (.5) Let the random variables X t and Y t denote the state and action at time t, t =, 2,..., and let P β,r {X t = j, Y t = a} be the probability that at time t the state is j and the action is a, given that policy R is used and the initial distribution is β. If β i = for some i S, then we write P i,r instead of P β,r. The expectation operator with respect to the probability measure P β,r or P i,r is denoted by E β,r or E i,r, repectively. Lemma. For any Markov policy R = (π, π 2,... ), any initial distribution β and t N, we have () P β,r {X t = j, Y t = a} = i β i {P (π )P (π 2 ) P (π t )} ij πja t, (j, a) S A. (2) E β,r {r t X t (Y t )} = i β i {P (π )P (π 2 ) P (π t )r(π t )} i, where P (π )P (π 2 ) P (π t ) = I (the identity matrix) for t =. By induction on t. For t =, P β,r {X t = j, Y t = a} = β j π ja = i β i {P (π )P (π 2 ) P (π t )} ij π t ja

17 .2. POLICIES AND OPTIMALITY CRITERIA 5 and E β,r {r t X t (Y t )} = i,a β i π iar i (a) = i β i {P (π )P (π 2 ) P (π t )r(π t )} i. Assume that the results are shown for t; we show that the results also hold for t + : P β,r {X t+ = j, Y t+ = a} = k,b P β,r{x t = k, Y t = b} p t kj (b) πt+ ja Furthermore, one has = k,b,i β i {P (π )P (π 2 ) P (π t )} ik πkb t pt kj (b) πt+ ja = i β i k {P (π )P (π 2 ) P (π t )} ik b πt kb pt kj (b) πt+ ja = i β i k {P (π )P (π 2 ) P (π t )} ik {P (π t )} kj π t+ ja = i β i {P (π )P (π 2 ) P (π t )} ij π t+ ja. E β,r {r t+ X t+ (Y t+ )} = j,a P β,r{x t+ = j, Y t+ = a} rj t+ (a) = j,a,i β i {P (π )P (π 2 ) P (π t )} ij π t+ ja = i β i {P (π )P (π 2 ) P (π t )} ij a πt+ ja rt+ j (a) = i β i j {P (π )P (π 2 ) P (π t )} ij {r(π t+ )} j = i β i {P (π )P (π 2 ) P (π t )r(π t+ )} i. rt+ j (a) The next theorem shows that for any initial distribution β, any sequence of policies R, R 2,... and any convex combination of the marginal distributions of P β,rk, k N, there exists a Markov policy R with the same marginal distribution. Theorem. For any initial distribution β, any sequence of policies R, R 2,... and any sequence of nonnegative real numbers p, p 2,... satisfying k p k =, there exists a Markov policy R such that P β,r {X t = j, Y t = a} = k p k P β,rk {X t = j, Y t = a}, (j, a) S A, t N. (.6) Define the Markov policy R = (π, π 2,... ) by πja t = k p k P β,rk {X t = j, Y t = a} k p, t N, (j, a) S A. k P β,rk {X t = j} (.7) In case the denominator is zero, take for πja t, a A(j) arbitrary nonnegative numbers such that a πt ja =, j S. Take any (j, a) S A. We show the theorem by induction on t. For t =, we obtain P β,r {X = j} = β j and k p k P β,rk {X = j} = β j. If β j = 0, then P β,r {X = j, Y = a} = k p k P β,rk {X = j, Y = a} = 0.

18 6 CHAPTER. INTRODUCTION If β j 0, then from (.7) it follows that k p k P β,rk {X = j, Y = a} = k p k P β,rk {X = j} π ja = β j π ja = P β,r {X = j, Y = a}. Assume that (.6) holds t. We have to prove that (.6) also holds for t +. P β,r {X t+ = j} = l,b P β,r {X t = l, Y b = b} p t lj (b) = l,b,k p k P β,rk {X t = l, Y b = b} p t lj (b) = k p k l,b P β,r k {X t = l, Y b = b} p t lj (b) = k p k P β,rk {X t+ = j}. If P β,r {X t+ = j} = 0, then k p k P β,rk {X t+ = j} = 0, and consequently, P β,r {X t+ = j, Y t+ = a} = k p k P β,rk {X t+ = j, Y t+ = a} = 0. If P β,r {X t+ = j} 0, then P β,r {X t+ = j, Y t+ = a} = P β,r {X t+ = j} πja t+ = k p k P β,rk {X t+ = j} πja t+ = k p k P β,rk {X t+ = j} P k = k p k P β,rk {X t+ = j, Y t+ = a}. p k P β,rk {X t+ =j,y t+ =a} P k p k P β,rk {X t+ =j} Corollary. For any starting state i and any policy R, there exists a Markov policy R such that P i,r {X t = j, Y t = a} = P i,r {X t = j, Y t = a}, t N, (j, a) S A, and E i,r {rx t t (Y t )} = E i,r {rx t t (Y t )}, t N..2.2 Optimality criteria In this course we consider the following optimality criteria:. Total expected reward over a finite horizon. 2. Total expected discounted reward over an infinite horizon. 3. Total expected reward over an infinite horizon. 4. Average expected reward over an infinite horizon. 5. More sensitive optimality criteria over an infinite horizon. Assumption. In infinite horizon models we assume that the immediate rewards and the transition probabilities are stationary, and we denote them by r i (a) and p ij (a), respectively, for all i, j and a.

19 .2. POLICIES AND OPTIMALITY CRITERIA 7 Total expected reward over a finite horizon Consider an MDP with a finite planning horizon of T periods. For any policy R and initial state i S, the total expected reward over the planning horizon is defined by: T T vi T (R) = E i,r {rx t t (Y t )} = P i,r {X t = j, Y t = a} rj(a), t i S. (.8) t= t= j,a Interchanging the summation and the expectation in (.8) is allowed, so vi T defined as the expected total reward, i.e. (R) may also be v T i (R) = E i,r { T t= } rx t t (Y t ), i S. Let v T i = sup R C v T i (R), i S, or in vector notation, v T = sup R C v T (R). The vector v T is called the value vector. From Corollary. and Lemma., it follows that v T = sup R C(M) v T (R), (.9) and v T (R) = T t= P (π )P (π 2 ) P (π t )r(π t ), for R = (π, π 2, ) C(M). (.0) A policy R is called an optimal policy if v T (R ) = sup R C v T (R). It is nontrivial that there exists an optimal policy: the supremum has to be attained and it has to be attained simultaneously for all starting states. It can be shown (see the next chapter) that an optimal Markov policy R = (f, f 2,, f T ) exists, where f t is a deterministic decision rule t T. Total expected discounted reward over an infinite horizon Assume that an amount r earned at time point is deposited in a bank with interest rate ρ. This amount becomes ( + ρ) r at time point 2, ( + ρ) 2 r at time point 3, etc. Hence, an amount r at time point is comparable with ( + ρ) t r at time point t, t =, 2,.... Let α = ( + ρ), called the discount factor. Note that α (0, ). Then, conversely, an amount r received at time point t can be considered as equivalent to an amount α t r at time point, the so-called discounted value.

20 8 CHAPTER. INTRODUCTION The reward r Xt (Y t ) at time point t has at time point the discounted value α t r Xt (Y t ). The total expected α-discounted reward, given initial state i and policy R, is denoted by vi α (R) and defined by vi α (R) = E i,r {α t r Xt (Y t )} = t= t= α t j,a P i,r {X t = j, Y t = a} r j (a). (.) Another way to consider the discounted reward is by the expected total α-discounted reward, i.e. { } E i,r α t r Xt (Y t ). Since t= α t r Xt (Y t ) α t M = ( α) M, t= t= where M = max i,a r i (a), the theorem of dominated convergence (e.g. Bauer [8] p. 7) implies { } E i,r α t r Xt (Y t ) = t= E i,r {α t r Xt (Y t )} = vi α (R), (.2) t= i.e. the expected total discounted reward and the total expected discounted reward criteria are equivalent. Let R = (π, π 2,... ) C(M), then v α (R) = α t P (π )P (π 2 ) P (π t )r(π t ). (.3) t= Hence, a stationary policy π satisfies v α (π ) = α t P (π) t r(π). (.4) t= Like before, the value vector v α and the concept of optimality for a policy R are defined by v α = sup R v α (R) and v α (R ) = v α. (.5) In Chapter 3 we show the existence of an optimal deterministic policy f for this criterion and we show that the value vector v α is the unique solution of the so-called optimality equation x i = max a A(i) { r i (a) + α j p ij (a)x j }, i S. (.6) Furthermore, we will see that f is an optimal policy if r i (f ) + α j p ij (f )v α j r i (a) + α j p ij (a)v α j, a A(i), i S. (.7)

21 .2. POLICIES AND OPTIMALITY CRITERIA 9 Total expected reward over an infinite horizon The total expected reward, given initial state i and policy R, is denoted by v i (R) and defined by v i (R) = E i,r {r Xt (Y t )} = P i,r {X t = j, Y t = a} r j (a). (.8) t= We consider this criterion under the following assumptions. Assumption.2 () The model is substochastic, i.e. j p ij(a) for all (i, a) S A. t= (2) For any initial state i and any policy R the expected total reward v i (R) is well-defined (possibly ± ). Under the above assumption v i (R) is well defined for all policies R and intial states i. The value vector and the concept of an optimal policy are defined in the usual way. Under the additional assumption that every policy R is transient, i.e. t= P i,r{x t = j, Y t = a} < for all i, j and a, it can be shown (cf. Kallenberg [08], chapter 3) that, with α =, most properties of the discounted MDP model are valid for total reward. j,a Average expected reward over an infinite horizon In the criterion of average reward the limiting behaviour of T T t= r X t (Y t ) is considered for T. Since lim T T T t= r X t (Y t ) may not exist and interchanging limit and expectation is not allowed, in general, there are four different evaluation measures which can be considered:. Lower limit of the average expected reward: φ i (R) = liminf T T T t= E i,r{r Xt (Y t )}, i S, with value vector φ = sup R φ(r). 2. Upper limit of the average expected reward: φ i (R) = limsup T T T t= E i,r{r Xt (Y t )}, i S, with value vector φ = sup R φ(r). 3. Expectation of the lower limit of the average reward: ψ i (R) = E i,r {liminf T T T t= r X t (Y t )}, i S, with value vector ψ = sup R ψ(r). 4. Expectation of the upper limit of the average reward: ψ i (R) = E i,r {limsup T T T t= r X t (Y t )}, i S, with value vector ψ = sup R ψ(r). Lemma.2 ψ(r) φ(r) φ(r) ψ(r) for every policy R. The second inequality is obvious. The first and the last inequality follow from Fatou s lemma (e.g. Bauer [8], p.26): and ψ i (R) = E i,r {liminf T T φ i (R) = limsup T T T t= r X t (Y t )} liminf T T T t= E i,r{r Xt (Y t )} = φ i (R) T t= E i,r{r Xt (Y t )} E i,r {limsup T T T t= r X t (Y t )} = ψ i (R).

22 0 CHAPTER. INTRODUCTION Example.2 Consider the following MDP: S = {, 2, 3}; A() = {}, A(2) = A(3) = {, 2}. p () = 0, p 2 () = 0.5; p 3 () = 0.5; r () = 0. p 2 () = 0, p 22 () = ; p 23 () = 0; r 2 () =. p 2 (2) = 0, p 22 (2) = 0; p 23 (2) = ; r 2 (2) =. p 3 () = 0, p 32 () = 0; p 33 () = ; r 3 () = 0. p 3 (2) = 0, p 32 (2) = ; p 33 (2) = 0; r 3 (2) = 0. We use directed graphs to illustrate examples The nodes of the graph represent the states. If the transition probability p ij (a) is positive there is an arc (i, j) from node i to node j;for a = we use a simple arc, for a = 2 a double arc, etc.; next to the arc the probability p ij (a) is given. The graph of this MDP model is pictured. If we start in state, we never return to that state, but we will remain in state 2 or state 3 for ever. Because of the reward structure, T T t= r X t (Y t ) is the average number of visits to state 2 in the periods, 2,..., T. Let the policy R = (π, π 2,..., π t,... ) be defined by π i = for i =, 2, 3 and for t 2 : πh t ta t = k t k t+ if a t = k t+ if a t = 2, where for h t = (i, a, i 2, a 2,..., i t, a t, i t ), k t = max{k i t = i t 2 = i t k+ = i t, i t k i t }, so k t is the maximum number of periods we consecutively are in state i t at the time points t, t,.... This implies that each time we stay in the same state (2 or 3) we have a higher probability to stay there for one more period. The probability to stay in state 2 (or 3) for ever, given that we enter state 2 (or 3), is P R {X t = 2 for t = t 0 +, t 0 + 2,... X t0 = 2 and X t0 2} = { lim t t } t = limt t = 0. Hence, with probability, a switch from state 2 to state 3 will occur at some time point; similarly, with probability, there is a switch from state 3 to state 2 at some time point. We even can compute the expected number of periods before such a switch occurs: E R {number of consecutive stays in state 2} = k= k P R{X j = 2 for t = t 0 +, t 0 + 2,..., t 0 + k ; X t0 +k 2 X t0 = 2 and X t0 2} = { } k= k k k k+ = k= k+ =. So, as long as we stay in state 2, we obtain a reward of in each period. The expected numer of stays in state 2 is infinite. Therefore, with probability, there is an infinite number of time

23 .2. POLICIES AND OPTIMALITY CRITERIA points at which the average reward are arbitrary close to. Similarly for state 3, with probability, there is an infinite number of time points at which the average reward are arbitrary close to 0. This implies for policy R that and From this we obtain ψ (R)=E,R {liminf T T limsup T T liminf T T T r Xt (Y t ) = with probability t= T r Xt (Y t ) = 0 with probability. t= T t= r X t (Y t )}=0 and ψ (R)=E,R {limsup T T T t= r X t (Y t )}=. If the process starts in state, then at any time point t - by symmetry - the probability to be in state 2 and earn will be equal to the probability to be in state 3 and earn 0. E,R {r Xt (Y t )} = 2 for all t 2. Hence, φ (R) = liminf T T T t= E i,r{r Xt (Y t )} = φ (R) = limsup T T T t= E i,r{r Xt (Y t )} = 2. So, this example shows that ψ (R) = 0 < φ (R) = 2 = φ (R) < ψ (R) =. So, We also give an example in which φ i (R) < φ i (R) for some policy R and some state i. Consider the following very simple MDP: S = {}; A() = {, 2}; p () = p (2) = ; r () =, r (2) =. Take the policy R that chooses action at t = ; action 2 at t = 2, 3; action at t = 4, 5, 6, 7; action 2 at t = 8, 9,..., 5. In general: action at t = 2 2k, 2 2k +,..., 2 2k +2 2k for k = 0,,... and action 2 at t = 2 2k+, 2 2k+ +,..., 2 2k k+ for k = 0,,.... This gives a deterministic stream of rewards: +;, ; +, +, +, +;,,,,,,, ;... with total rewards +; 0, ; 0, +, +2, +3, +2, +, 0,, 2, 3, 4, 5, 4, 3, 2,, 0, +, +2, +3, +4, +5, +6, +7, +8, +9, +0, +. To compute the limsup we take the time points 2 2k voor k =, 2,..., i.e. the time points, 7, 3, 27,... ; for the liminf we consider the time points 2 2k voor k =, 2,..., i.e. the time points 3, 5, 63, 255,.... Let T k be the time points when we change the chosen action, i.e. T k = 2 k for k =, 2,..., and let A k denote the total reward at the time points T k, i.e. A = +, A 2 =, A 3 = +3, A 4 = 5, A 5 = +. It can easily be shown (this is left to the reader) that A k + A k+ = 2 k and A k+ = 2 A k + ( ) k. This implies that A k = 3 {2k ( ) k }. Since A k is positive iff k is odd, we obtain A k = 3 {( )k+ 2 k + }, k =, 2,.... Hence,

24 2 CHAPTER. INTRODUCTION and φ (R) = liminf T T φ (R) = limsup T T T t= E,R{r Xt (Y t )} = lim k T t= E,R{r Xt (Y t )} = lim k A 2k 2 2k = lim 3 { 22k +} k = 2 2k 3 A 2k 2 2k = lim 3 {22k +} k = + 2 2k 3.

25 .2. POLICIES AND OPTIMALITY CRITERIA 3 For these four criteria the value vector and the concept of an optimal policy can be defined in the usual way. Bierth [2] has shown that ψ(π ) = φ(π ) = φ(π ) = ψ(π ) for every stationary policy π, and that there exists a deterministic optimal policy which is optimal for all these four criteria. Hence, the four criteria are equivalent in the sense that an optimal deterministic policy for one criterion is also optimal for the other criteria. More sensitive optimality criteria over an infinite horizon The average reward criterion has the disadvantage that it does not consider rewards earned in a finite number of periods. For example, the streams of rewards 0, 0, 0, 0, 0,... and 00, 00, 0, 0, 0,... have the same average value although usually the second stream will be preferred. Hence, there is a need for criteria that select policies which are average optimal but also make the right early decisions as well. There are several ways to create more sensitive criteria. One way is to consider discounting for discount factors that tend to. Another way is to use more subtle kinds of averaging. We present some of these criteria.. Bias optimality A policy R is called bias optimal if lim α {v α (R ) v α } = Blackwell optimality A policy R is Blackwell optimal if v α (R ) = v α for all α {α 0, ) and some 0 < α 0 <. From this definition it is clear that Blackwell optimality implies bias optimality. The next example shows policies f, f 2 and f3 such that f is average optimal but not bias-optimal, f2 is biasoptimal but not Blackwell optimal, and f3 is Blackwell optimal. Therefore, Blackwell optimality is more selective than bias-optimality which in his turn is more selective than average optimality. Example.3 Consider the following MDP: S = {, 2}; A() = {, 2, 3}, A(2) = {}. p () =, p 2 () = 0; p (2) = p 2 (2) = 0.5; 0.5 p (3) = 0, p 2 (3) = ; p 2 () = 0, p 22 () = ; r () = 0, r (2) =, r (3) = 2, r 2 () = If the system is in state 2, the system stays in state 2 forever and no rewards are earned. In state we have to choose between the actions, 2 and 3, which is denoted by the policies f, f 2 and f3, respectively. All policies have the same average reward (0 for both starting states) and the discounted reward for these policies only differ in state (in state 2 the discounted reward are 0 for every discount factor α).

26 4 CHAPTER. INTRODUCTION It is easy to see that for all α, we have v α(f ) = 0 and vα (f 3 ) = 2. For the second policy, we have an immediate reward and in state one we stay with probability 0.5 and we go to state 2 with probability 0.5. Hence, we obtain v α(f 2 ) = + 0.5α vα (f 2 ) + 0.5α vα 2 (f 2 ), so v α(f 2 ) = 2 2 α. On the right we present a picture of these policies as function of the discount factor α. From this picture it is obvious that v α = 2 (vα = sup f C(D) v α (f), this is shown in the next chapter) and that the policy f3 is the only Blackwell optimal policy. Furthermore, we have lim α {v α(f ) vα } = 2, lim α {v α(f 2 ) vα } = lim α { 2 2 α 2} = 0, and lim α {v α (f 3 ) vα } = lim α {2 2} = 0. Hence, both the policies f 2 and f 3 are bias-optimal. v α(f) v α(f 3 2 ) v α (f 2 ) v α(f 0 ) α=0 α= 3. n-discount optimality For n =, 0,,... the policy R is called n-discount optimal if lim α ( α) n {v α (R ) v α } = 0. Obviously, 0-discount optimality is the same as bias-optimality. It can be shown that that ( )- discount optimality is equivalent to average optimality, and that Blackwell optimality is equivalent to n-discount optimality for all n S = N. 4. n-average optimality Let R be any policy. For t N and n =, 0,,..., we define the vector v n,t (R) by { v v n,t t (R) for n = (R) = t s= vn,s (R) for n = 0,,... For n =, 0,,... a policy R is said to be n-average optimal if for all policies R: liminf T T {vn,t (R ) v n,t (R)} 0. It can be shown that n-average optimality is equivalent to n-discount optimality. 5. Overtaking optimality A policy R is overtaking optimal if liminf T {v T (R ) v T (R)} 0 for all policies R. In contrast with other criteria, an overtaking optimal policy doesn t exist in general. 6. Average overtaking optimality A policy R is average overtaking optimal if liminf T T T t= {vt (R ) v t (R)} 0 for all policies R. It can be shown that average overtaking optimality is equivalent to bias optimality and therefore also equivalent to 0-discount optimality.

27 .3. EXAMPLES 5.3 Examples In Example. we saw an inventory model with backlogging as an example of an MDP. In this section we introduce other examples of MDPs: gambling, gaming, optimal stopping, replacement, maintenance and repair, production, optimal control of queues, stochastic scheduling and the socalled multi-armed bandit problem. For some models the optimal policy has a special structure. In those cases we will indicate the optimal structure..3. Red-black gambling In red-black gambling a gambler with a fortune of i euro may bet any amount a {, 2,..., i}. He wins his amount with probability p and he looses with probability p. The gambler s goal is to reach a certain fortune N. The gambler continues until either he has reached his goal or he has lost all his money. The problem is to determine a policy that maximizes the probability to reach this goal. This problem can be modelled as an MDP with the total reward criterion. The fortune of the gambler is the state. Since the gambling problem is over when the gambler has reached his goal or has lost all his money, there are no transitions when the game is in either state N or state 0. Maximizing the probability to reach the amount N is equivalent to assigning a reward to state N and rewards 0 to the other states. The MDP model for the gambling problem is as follows. S = {0,,..., N}; A(0) = A(N) = {0}, A(i) = {, 2,..., min(i, N i)}, i N. p, j = i + a For i N, a A(i) : p ij (a) = p, j = i a and r i (a) = 0. 0, j i + a, i a p 0j (0) = p Nj (0) = 0, j S; r 0 (0) = 0, r N (0) =. Since under any policy state N or state 0 is reached with probability, it is easy to verify that Assumption.2 is satisfied. Notice also that v i (R) = P i,r {X t = j, Y t = a} r j (a) = t= j,a i.e. the reward is equal to the probability to reach state N. It can be shown that an optimal policy has the following structure: if p > 2, then timid play, i.e. always bet the amount, is optimal. if p = 2, then any policy is optimal. P i,r {X t = N}, if p < 2, then bold play, i.e. betting min(i, N i) in state i, is optimal. t=

28 6 CHAPTER. INTRODUCTION.3.2 Gaming: How to serve in tennis The scoring in tennis is conventionally in steps from 0 to 5 to 30 to 40 to game. We simply use the numbers 0 through 4 for these scores. If the score reaches deuce, i.e , the game is won by the player who has as first two points more than his opponent. Therefore, deuce is equivalent to 30-30, and similarly advantage server (receiver) is equivalent to (30-40). Hence, the scores can be represented by (i, j), 0 i, j 3, excluding the pair (3,3) which is equivalent to (2,2), where i denotes the score of the server and j of the receiver. Furthermore, we use the states (4) and (5) for the case that the server or the receiver, respectively, wins the game. When the score is (i, j), the server may serve a first service (s = ), or a second service (s = 2) in case the first serve is fault. This leads to the following 32 states: (i, j, s) 0 i, j 3, (i, j) (3, 3), s =, 2 : the states in which the game is going on (4) : the target state for the server (5) : the target state for the reciever For the sake of simplicity, suppose that the players can choose between two types of services: a fast service (a = ) and a slow service (a = 2). The fast service is more likely to be fault, but also more difficult to return; the slow service is more accurate and easier to return correctly. Let p (p 2 ) be the probability that the fast (slow) service is good (lands in the given bounds of the court) and let q (q 2 ) be the probability of winning the point by the server when the fast (slow) service is good. We make the following obvious assumptions: p p 2 and q q 2. Suppose the server chooses for his first service action a. Then, the event that the server serves right and wins the point has probability p a q a ; the event that the server serves right and loses the point has probability p a ( q a ); the event that the server serves a fault and continues with his second service has probability p a. For the second service, the server either wins or loses the point with probabilities p a q a and p a q a, respectively. In the states where there is no game point, i.e. i 3 or j 3, we have the following transition probabilities: p (i,j,)(i+,j,) (a) = p a q a ; p (i,j,2)(i+,j,) (a) = p a q a ; p (i,j,)(i,j+,) (a) = p a ( q a ); p (i,j,2)(i,j+,) (a) = p a q a ; p (i,j,)(i,j,2) (a) = p a. If i = 3, we obtain (where (3, 3, ) has to be replaced by (2, 2, )) p (3,j,)(4) (a) = p a q a ; p (3,j,2)(4) (a) = p a q a ; p (3,j,)(3,j+,) (a) = p a ( q a ); p (3,j,2)(3,j+,) (a) = p a q a ; p (3,j,)(3,j,2) (a) = p a. Similarly, we obtain for j = 3 (also in this case replace (3, 3, ) by (2, 2, )) p (i,3,)(i+,3,) (a) = p a q a ; p (i,3,2)(i+,3,) (a) = p a q a ; p (i,3,)(5) (a) = p a ( q a ); p (i,3,2)(5) (a) = p a q a ; p (i,3,)(i,3,2) (a) = p a.

29 .3. EXAMPLES 7 When the game is over, i.e. in states (4) and (5), there are no transitions. The question is: what kind of service to choose, given the score, in order to win the game? To maximize for the server the probability of winning the game, the following reward structure is suitable: all rewards are 0, except in the target state (4) of the server, where the reward in the target state is. As utility criterion the total expected reward will be used. Let x = p q p 2 q 2. Then, x is the ratio of serving right and winning the point for the two possible actions a = and a = 2. It can be shown that the optimal policy has the following structure in each state: If x If (p 2 p ) x < If x < (p 2 p ) : always use the fast service : use the fast service as first service and the slow service as second : use always the slow service A similar problem is: is it better to use a fast than a slow service to maximize the probability of winning the next point. It turns out that this problem has the same optimal policy. Hence, the optimal policy for winning the game is a myopic policy..3.3 Optimal stopping In an optimal stopping problem there are two actions for every state. The first action is stopping and the second corresponds with continueing. If the stopping action is chosen in state i, then a terminal reward r i is earned and the process terminates. This termination is modelled by taking all transition probabilities equal to zero. If action 2 is chosen in state i, then a cost c i is incurred and the probability of being in state j at the next time point is p ij. Hence, the characteristics of the MDP model are: S = {, 2,..., N}; A(i) = {, 2}, i S; r i () = r i, i S; r i (2) = c i, i S; p ij () = 0, i, j S; p ij (2) = p ij, i, j S. We are interested in finding an optimal stopping policy. A stopping policy R is a policy such that for any starting state i the process terminates in finite time with probability. As optimality criterion the total expected reward is considered. Notice that for a stopping policy the total expected reward v(r) is well-defined. Let v be the value vector of this model, i.e. v i = sup{v i (R) R is a stopping policy}, i S. A stopping policy R is an optimal policy if v(r ) = v. Let S 0 = i S r i c i + j p ij r j, i.e. S 0 is the set of states in which immediate stopping is as least as good as continuing for one period and then choosing the stopping action. A one-step look ahead policy is a policy which

Discrete-Time Markov Decision Processes

CHAPTER 6 Discrete-Time Markov Decision Processes 6.0 INTRODUCTION In the previous chapters we saw that in the analysis of many operational systems the concepts of a state of a system and a state transition