MARKOV DECISION PROCESSES LODEWIJK KALLENBERG UNIVERSITY OF LEIDEN
|
|
- Anissa Price
- 5 years ago
- Views:
Transcription
1 MARKOV DECISION PROCESSES LODEWIJK KALLENBERG UNIVERSITY OF LEIDEN
2
3 Preface Branching out from operations research roots of the 950 s, Markov decision processes (MDPs) have gained recognition in such diverse fields as ecology, economics, and communication engineering. These applications have been accompanied by many theoretical advances. Markov decision processes, also referred to as stochastic dynamic programming or stochastic control problems, are models for sequential decision making when outcomes are uncertain. The Markov decision process model consists of decision epochs, states, actions, rewards, and transition probabilities. Choosing an action in a state generates a reward and determines the state at the next decision epoch through a transition probability function. Policies or strategies are prescriptions of which action to choose under any eventuality at every future decision epoch. Decision makers seek policies which are optimal in some sense. These lecture notes aim to present a unified treatment of the theoretical and algorithmic aspects of Markov decision process models. It can serve as a text for an advanced undergraduate or graduate level course in operations research, econometrics or control engineering. As a prerequisite, the reader should have some backround in linear algebra, real analysis, probability, and linear programming. Throughout the text there are a lot of examples; at the end of each chapter there is a section with bibliographic notes and a section with exercises (the notes include 8 exercises). A solution manuel is available on request ( to kallenberg@math.leidenuniv.nl). Chapter introduces the Markov decision process model as a sequential decision model with actions, rewards, transitions and policies. We illustrate these concepts with nine different applications: red-black gambling, how-to-serve in tennis, optimal stopping, replacement problems, maintenance and repair, production control, optimal control of queues, stochastic scheduling, and the multi-armed bandit problem. Chapter 2 deals with the finite horizon model and the principle of dynamic programming, backward induction. We also study under which conditions optimal policies are monotone, i.e. nondecreasing or nonincreasing in the ordering of the state space. In chapter 3 the discounted rewards over an infinite horizion are studied. This results in the optimality equation and solution methods to solve this equation: policy iteration, linear programming, value iteration and modified value iteration.
4 ii Chapter 4 discusses the total rewards over an infinite horizion under the assumption that the transition matrices are substochastic. We derive equivalent statements for the property that the model is a contracting dynamic programming model. For contracting dynamic programming the similar results and algorithms can be formulated as for the discounted reward model. Special sections are devoted to positive, negative and convergent models, and to optimal stopping problems. Chapter 5 discusses the criterion of average rewards over an infinite horizion, in the most general case. Firstly, polynomial algorithms are developed to classify MDPs as irreducible or communicating. The distinction between unichain and multichain turns out to be N P-complete, so there is no hope of a polynomial algorithm. Then, the stationary, the fundamental and the deviation matrices are introduced, and the relations between them and their properties are derived. Next, an extension of a theorem by Blackwell and the Laurent series expansion are presented. These results are fundamental to analyse the relation between discounted, average and more sentitive optimality criteria. With these results, as in the discounted case but via a more complicated analysis, the optimality equation is derived, and solution methods to solve this equation are presented (policy iteration, linear programming and value iteration). In chapter 6 special cases of the average reward criterion (irreducible, unichain and communicating) are considered. In all these cases the optimality equation and the methods of policy iteration, linear programming, value iteration and modified value iteration can be simplified. Chapter 7 introduces more sensitive optimality criteria: bias optimality, n-discount optimality and Blackwell optimality. We present a unifying framework, based on the Laurent series expansion, to derive sensitive discount optimality equations. Using a lexicographic ordering of the Laurent series, we derive the policy iteration method for n-discount optimality. For bias optimality, an optimal policy can be found with a three-step linear programming approach. When in addition the model is a unichain MDP, the linear programs for bias optimality can be simplified. In the irreducible case, one can derive a sequence of nested linear programs to compute n-discount optimal policies for any n. Also for Blackwell optimality, even in the most general case, linear programming can be applied. However, then the elements are not real numbers, but lie in a much general ordered field, namely in an ordered field of rational functions. In chapter 8 the applications introduced in chapter are analysed in much more detail. In most cases theoretical and computational (algorithms) results are presented. It turns out that in many cases polynomial algorithms exist, e.g. of order O(N 3 ), where N is the number of states. Chapter 9 deals with some other topics: additional constraints and multiple objectives (both for discounted MDPs as well as for average MDPs), and mean-variance tradeoffs.
5 iii In the last chapter (chapter 0) stochastic games are treated, particularly the two-person zero-sum stochastic game. Then, both players may choose actions from their own action sets, resulting in transitions and rewards determined by both players. Zero-sum means that the reward for player has to be payed by player 2. Hence, there is a conflicting situation: player wants to maximize the rewards, while player 2 tries to mimimize the rewards. We discuss the value of the game and the concept of optimal policies for discounted as well as for average rewards. We also derive mathematical programming formulations and iterative methods. In some special cases we can present finite methods. For these notes I have used a lot of material, collected over the years, from various sourses. It is impossible for me to mention them here. However, there is one exception. I wish to express my gratitude to Arie Hordijk, who introduced me to the topic of MDP, as my teacher and former colleague. Already 20 years ago we started with a first draft of some chapters for a book on MDP. Our work was interrupted and, unfortunately, never continued. Lodewijk Kallenberg Leiden, April, 2009.
6 iv
7 Contents Introduction. The MDP model Policies and optimality criteria Policies Optimality criteria Examples Red-black gambling Gaming: How to serve in tennis Optimal stopping Replacement problems Maintenance and repair Production control Optimal control of queues Stochastic scheduling Multi-armed bandit problem Bibliographic notes Exercises Finite Horizon Introduction Backward induction An equivalent stationary infinite horizon model Monotone optimal policies Bibliographic notes Exercises Discounted rewards Introduction Monotone contraction mappings The optimality equation Policy iteration Linear programming
8 vi CONTENTS 3.6 Value iteration Modified Policy Iteration Bibliographic notes Exercises Total reward Introduction Equivalent statements for contracting The contracting model Positive MDPs Negative MDPs Convergent MDPs Special models Red-black gambling Optimal stopping Bibliographic notes Exercises Average reward - general case Introduction Classification of MDPs Definitions Classification of Markov chains Classification of Markov decision chains Stationary, fundamental and deviation matrix The stationary matrix The fundamental matrix and the deviation matrix Extension of Blackwell s theorem The Laurent series expansion The optimality equation Policy iteration Linear programming Value iteration Bibliographic notes Exercises Average reward - special cases The irreducible case Optimality equation Policy iteration Linear programming
9 CONTENTS vii 6..4 Value iteration Modified policy iteration Unichain case Optimality equation Policy iteration Linear programming Value iteration Modified policy iteration Communicating case Optimality equation Policy iteration Linear programming Value iteration Modified value iteration Bibliographic notes Exercises More sensitive optimality criteria Introduction Equivalence between n-discount and n-average optimality Stationary optimal policies and optimality equations Lexicographic ordering of Laurent series Policy iteration for n-discount optimality Linear programming and n-discount optimality (irreducible case) Average optimality Bias optimality n-discount optimality Blackwell optimality and linear programming Bias optimality and linear programming The general case The unichain case Overtaking and average overtaking optimality Bibliographic notes Exercises Special models Replacement problems A general replacement model A replacement model with increasing deterioration Skip to the right model with failure A separable replacement problem
10 viii CONTENTS 8.2 Maintenance and repair problems A surveillance-maintenance-replacement model Optimal repair allocation in a series system Production and inventory control No backlogging Backlogging Inventory control and single-critical-number policies Inventory control and (s, S)-policies Optimal control of queues The single-server queue Parallel queues Stochastic scheduling Maximizing finite-time returns on a single processor Optimality of the µc-rule Optimality of threshold policies Optimality of join-the-shortest-queue policies Optimality of LEPT and SEPT policies Maximizing finite-time returns on two processors Tandem queues Multi-armed bandit problems Introduction A single project with a terminal reward Multi-armed bandits Methods for the computation of the Gittins indices Separable problems Introduction Examples (part ) Discounted rewards Average rewards - unichain case Average rewards - general case Examples (part 2) Exercises Bibliographic notes Other topics Additional constraints Introduction Infinite horizon and discounted rewards Infinite horizon and average rewards Multiple objectives
11 CONTENTS ix 9.2. Discounted rewards Average rewards Mean-variance tradeoffs Formulations of the problem A unifying framework Determination of an optimal solution Determination of an optimal policy Bibliographic notes Exercises Stochastic Games Introduction The model Optimality criteria Matrix games Discounted rewards Value and optimal policies Mathematical programming Iterative methods Finite methods Average rewards Value and optimal policies The Big Match Mathematical programming Perfect information and irreducible games Finite methods Bibliographic notes Exercises
12 x CONTENTS
13 Chapter Introduction In this first chapter we introduce the model of a Markov decision process (for short MDP). We present several optimality criteria and give some examples of problems that can be modelled as an MDP.. The MDP model An MDP is a model for sequential decision making under uncertainty, taking into account both the short-term outcomes of current decisions and opportunities for making decisions in the future. While the notion of an MDP may appear quite simple, it encompasses a wide range of applications and has generated a rich mathematical theory. In an MDP model one can distinguish the following seven characteristics.. The state space At any time point at which a decision has to be made, the state of the system is observed by the decision maker. The set of possible states is called the state space and will be denoted by S. The state space may be finite, denumerable, compact or even more general. In a finite state space, the number of states, i.e. S, will be denoted by N. 2. The action sets When the decision maker observes that the system is in state i, he (we will refer to the decision maker as he ) chooses an action from a certain action set that may depend on the observed state: the action set in state i is denoted by A(i). Similarly to the state space the action sets may be finite, denumerable, compact or more general. 3. The decision time points The time intervals between the decision points may be constant or random. In the first case the model is said to be a Markov decision process; when the times between consecutive decision points are random the model is called a semi-markov decision process.
14 2 CHAPTER. INTRODUCTION 4. The immediate rewards (or costs) Given the state of the system and the chosen action, an immediate reward (or cost) is earned (there is no essential difference between rewards and costs, because maximizing rewards is equivalent to minimizing costs). These rewards depend on the decision time point, the observed state and the chosen action and not on the history of the process. The immediate reward at decision time point t for an action a in state i will be denoted by ri t (a); if the reward is independent of the time t, we will write r i (a) instead of r t i (a). 5. The transition probabilities Given the state of the system and the chosen action, the state at the next decision time point is determined by a transition law. These transitions only depend on the decision time point t, the observed state i and the chosen action a and not on the history of the process. This property is called the Markov property. If the transitions really depend on the decision time point, the problem is said to be nonstationary. If the state at time t is i and action a is chosen, we denote the probability that at the next time point the system is in state j by p t ij (a). If the transitions are independent of the time points, the problem is called stationary, and the transition probabilities are denoted by p ij (a). 6. The planning horizon The process has a planning horizon, i.e. the time points at which the system has to be controlled. This horizon may be finite, infinite or of random length. 7. The optimality criterion The objective is to determine a policy, i.e. a decision rule for each decision time point and each history (including the present state) of the process, that optimizes the performance of the system. The performance is measured by a utility function. This function assigns to each policy, given the starting state of the process, a value. In the next section we will present several optimality criteria. Example. Inventory model with backlogging An inventory has to be managed over a planning horizon of T weeks. The optimization problem is: which inventory strategy minimizes the total expected costs? At the beginning of each week the manager observes the inventory on hand and decides how many units to order. We assume that orders can be delivered instantaneously, and that there is a finite inventory capacity of B units. We also assume that the demands D t in week t, t T, are independent random variables that have nonnegative integer values and that the numbers p j (t) := P{D t = j} are known for all j N 0 and t =, 2,..., T. If the demand during a period exceeds the inventory on hand, the shortage is backlogged in the next period. Let the inventory at the start of week t be i (shortages are modelled as negative inventory), let the number of ordered units be a and let j be the inventory at the end of week t.
15 .2. POLICIES AND OPTIMALITY CRITERIA 3 Then, the following costs are involved, where we use the notation δ(x) = ordering costs: K t δ(a) + k t a; inventory costs: h t δ(j) j; backlogging costs: q t δ( j) ( j). The data K t, k t, h t, q t and p j (t), j N, are known for all t {, 2,..., T }. { if x ; 0 if x 0. If an order is made in week t, there is a fixed cost K t and a cost k t for each ordered unit. If at the end of the week there is a positive inventory, then there are inventory costs of h t per unit; when there is shortage, there are backlogging costs of q t per unit. This inventory problem can be modelled as a nonstationary MDP over a finite planning horizon, with a denumerable state space and finite action sets: S = {...,, 0,,..., B}; A(i) = {a 0 0 i + a B}; p t ij (a) = { pi+a j (t) j i + a; 0 B j > i + a; r t i (a) = {K t δ(a) + k t a + i+a j=0 p j(t) h t (i + a j) + j=i+a+ p j(t) q t (j i a)}..2 Policies and optimality criteria.2. Policies A policy R is a sequence of decision rules: R = (π, π 2,..., π t,... ), where π t is the decision rule at time point t, t =, 2,.... The decision rule π t at time point t may depend on all availlable information on the system until time t, i.e. on the states at the time points, 2,..., t and the actions at the time points, 2,..., t. The formal definition of a policy is as follows. Consider the Cartesian product S A = {(i, a) i S, a A(i)} and let H t denote the set of the possible histories of the system up to time point t, i.e. H t := {h t = (i, a,..., i t, a t, i t ) (i k, a k ) S A, k t ; i t S}. (.) A decision rule π t at time point t is function on H t which prescribes the action to be taken at time t as a transition probability from H t into A, i.e. πh t ta t 0 for every a t A(i t ) and πh t ta t = for every h t H t. (.2) a t Let C denote the set of all policies. A policy is said to be memoryless if the decision rule π t is independent of (i, a,..., i t, a t ) for every t N. For a memoryless policy the decision rule
16 4 CHAPTER. INTRODUCTION at time t only depends on the state i t ; therefore the notation πi t ta t is used. We call C(M) the set of the memoryless policies. Memoryless policies are also called Markov policies. If a policy is memoryless and the decision rules are independent of the time point t, i.e. π = π 2 =, then the policy is called stationary. Hence, a stationary policy is determined by a nonnegative function π on S A such that a π ia = for every i S. The stationary policy R = (π, π,... ) is denoted by π. The set of stationary policies is notated by C(S). If the decision rule π of a stationary policy is nonrandomized, i.e. for every i S, we have π ia = for (exactly) one action a i (consequently π ia = 0 for every a a i ), then the policy is called deterministic. A deterministic policy can be described by a function f on S, where f(i) is the chosen action a i, i S. A deterministic policy is denoted by f (and sometimes by f) and the set of deterministic policies by C(D). A matrix P = (p ij ) is a transition matrix if p ij 0 for all (i, j) and j p ij = for all i. For a Markov policy R = (π, π 2,... ) the transition matrix P (π t ) and the reward vector r(π t ) are defined by { P (π t ) } = p t ij ij(a)πia t for every (i, j) S S and t N; (.3) a { r(π t ) } i = a r t i(a)π t ia for every i S and t N. (.4) Consider an initial distribution β, i.e. β i is the probability that the system starts in state i, and an policy R. Then, by a theorem of Ionescu Tulcea (cf. Bertsekas and Shreve [4] p.40), there exists a unique probability measure P β,r on H, where H = {h = (i, a, i 2, a 2,... ) (i k, a k ) S A, k =, 2,... }. (.5) Let the random variables X t and Y t denote the state and action at time t, t =, 2,..., and let P β,r {X t = j, Y t = a} be the probability that at time t the state is j and the action is a, given that policy R is used and the initial distribution is β. If β i = for some i S, then we write P i,r instead of P β,r. The expectation operator with respect to the probability measure P β,r or P i,r is denoted by E β,r or E i,r, repectively. Lemma. For any Markov policy R = (π, π 2,... ), any initial distribution β and t N, we have () P β,r {X t = j, Y t = a} = i β i {P (π )P (π 2 ) P (π t )} ij πja t, (j, a) S A. (2) E β,r {r t X t (Y t )} = i β i {P (π )P (π 2 ) P (π t )r(π t )} i, where P (π )P (π 2 ) P (π t ) = I (the identity matrix) for t =. By induction on t. For t =, P β,r {X t = j, Y t = a} = β j π ja = i β i {P (π )P (π 2 ) P (π t )} ij π t ja
17 .2. POLICIES AND OPTIMALITY CRITERIA 5 and E β,r {r t X t (Y t )} = i,a β i π iar i (a) = i β i {P (π )P (π 2 ) P (π t )r(π t )} i. Assume that the results are shown for t; we show that the results also hold for t + : P β,r {X t+ = j, Y t+ = a} = k,b P β,r{x t = k, Y t = b} p t kj (b) πt+ ja Furthermore, one has = k,b,i β i {P (π )P (π 2 ) P (π t )} ik πkb t pt kj (b) πt+ ja = i β i k {P (π )P (π 2 ) P (π t )} ik b πt kb pt kj (b) πt+ ja = i β i k {P (π )P (π 2 ) P (π t )} ik {P (π t )} kj π t+ ja = i β i {P (π )P (π 2 ) P (π t )} ij π t+ ja. E β,r {r t+ X t+ (Y t+ )} = j,a P β,r{x t+ = j, Y t+ = a} rj t+ (a) = j,a,i β i {P (π )P (π 2 ) P (π t )} ij π t+ ja = i β i {P (π )P (π 2 ) P (π t )} ij a πt+ ja rt+ j (a) = i β i j {P (π )P (π 2 ) P (π t )} ij {r(π t+ )} j = i β i {P (π )P (π 2 ) P (π t )r(π t+ )} i. rt+ j (a) The next theorem shows that for any initial distribution β, any sequence of policies R, R 2,... and any convex combination of the marginal distributions of P β,rk, k N, there exists a Markov policy R with the same marginal distribution. Theorem. For any initial distribution β, any sequence of policies R, R 2,... and any sequence of nonnegative real numbers p, p 2,... satisfying k p k =, there exists a Markov policy R such that P β,r {X t = j, Y t = a} = k p k P β,rk {X t = j, Y t = a}, (j, a) S A, t N. (.6) Define the Markov policy R = (π, π 2,... ) by πja t = k p k P β,rk {X t = j, Y t = a} k p, t N, (j, a) S A. k P β,rk {X t = j} (.7) In case the denominator is zero, take for πja t, a A(j) arbitrary nonnegative numbers such that a πt ja =, j S. Take any (j, a) S A. We show the theorem by induction on t. For t =, we obtain P β,r {X = j} = β j and k p k P β,rk {X = j} = β j. If β j = 0, then P β,r {X = j, Y = a} = k p k P β,rk {X = j, Y = a} = 0.
18 6 CHAPTER. INTRODUCTION If β j 0, then from (.7) it follows that k p k P β,rk {X = j, Y = a} = k p k P β,rk {X = j} π ja = β j π ja = P β,r {X = j, Y = a}. Assume that (.6) holds t. We have to prove that (.6) also holds for t +. P β,r {X t+ = j} = l,b P β,r {X t = l, Y b = b} p t lj (b) = l,b,k p k P β,rk {X t = l, Y b = b} p t lj (b) = k p k l,b P β,r k {X t = l, Y b = b} p t lj (b) = k p k P β,rk {X t+ = j}. If P β,r {X t+ = j} = 0, then k p k P β,rk {X t+ = j} = 0, and consequently, P β,r {X t+ = j, Y t+ = a} = k p k P β,rk {X t+ = j, Y t+ = a} = 0. If P β,r {X t+ = j} 0, then P β,r {X t+ = j, Y t+ = a} = P β,r {X t+ = j} πja t+ = k p k P β,rk {X t+ = j} πja t+ = k p k P β,rk {X t+ = j} P k = k p k P β,rk {X t+ = j, Y t+ = a}. p k P β,rk {X t+ =j,y t+ =a} P k p k P β,rk {X t+ =j} Corollary. For any starting state i and any policy R, there exists a Markov policy R such that P i,r {X t = j, Y t = a} = P i,r {X t = j, Y t = a}, t N, (j, a) S A, and E i,r {rx t t (Y t )} = E i,r {rx t t (Y t )}, t N..2.2 Optimality criteria In this course we consider the following optimality criteria:. Total expected reward over a finite horizon. 2. Total expected discounted reward over an infinite horizon. 3. Total expected reward over an infinite horizon. 4. Average expected reward over an infinite horizon. 5. More sensitive optimality criteria over an infinite horizon. Assumption. In infinite horizon models we assume that the immediate rewards and the transition probabilities are stationary, and we denote them by r i (a) and p ij (a), respectively, for all i, j and a.
19 .2. POLICIES AND OPTIMALITY CRITERIA 7 Total expected reward over a finite horizon Consider an MDP with a finite planning horizon of T periods. For any policy R and initial state i S, the total expected reward over the planning horizon is defined by: T T vi T (R) = E i,r {rx t t (Y t )} = P i,r {X t = j, Y t = a} rj(a), t i S. (.8) t= t= j,a Interchanging the summation and the expectation in (.8) is allowed, so vi T defined as the expected total reward, i.e. (R) may also be v T i (R) = E i,r { T t= } rx t t (Y t ), i S. Let v T i = sup R C v T i (R), i S, or in vector notation, v T = sup R C v T (R). The vector v T is called the value vector. From Corollary. and Lemma., it follows that v T = sup R C(M) v T (R), (.9) and v T (R) = T t= P (π )P (π 2 ) P (π t )r(π t ), for R = (π, π 2, ) C(M). (.0) A policy R is called an optimal policy if v T (R ) = sup R C v T (R). It is nontrivial that there exists an optimal policy: the supremum has to be attained and it has to be attained simultaneously for all starting states. It can be shown (see the next chapter) that an optimal Markov policy R = (f, f 2,, f T ) exists, where f t is a deterministic decision rule t T. Total expected discounted reward over an infinite horizon Assume that an amount r earned at time point is deposited in a bank with interest rate ρ. This amount becomes ( + ρ) r at time point 2, ( + ρ) 2 r at time point 3, etc. Hence, an amount r at time point is comparable with ( + ρ) t r at time point t, t =, 2,.... Let α = ( + ρ), called the discount factor. Note that α (0, ). Then, conversely, an amount r received at time point t can be considered as equivalent to an amount α t r at time point, the so-called discounted value.
20 8 CHAPTER. INTRODUCTION The reward r Xt (Y t ) at time point t has at time point the discounted value α t r Xt (Y t ). The total expected α-discounted reward, given initial state i and policy R, is denoted by vi α (R) and defined by vi α (R) = E i,r {α t r Xt (Y t )} = t= t= α t j,a P i,r {X t = j, Y t = a} r j (a). (.) Another way to consider the discounted reward is by the expected total α-discounted reward, i.e. { } E i,r α t r Xt (Y t ). Since t= α t r Xt (Y t ) α t M = ( α) M, t= t= where M = max i,a r i (a), the theorem of dominated convergence (e.g. Bauer [8] p. 7) implies { } E i,r α t r Xt (Y t ) = t= E i,r {α t r Xt (Y t )} = vi α (R), (.2) t= i.e. the expected total discounted reward and the total expected discounted reward criteria are equivalent. Let R = (π, π 2,... ) C(M), then v α (R) = α t P (π )P (π 2 ) P (π t )r(π t ). (.3) t= Hence, a stationary policy π satisfies v α (π ) = α t P (π) t r(π). (.4) t= Like before, the value vector v α and the concept of optimality for a policy R are defined by v α = sup R v α (R) and v α (R ) = v α. (.5) In Chapter 3 we show the existence of an optimal deterministic policy f for this criterion and we show that the value vector v α is the unique solution of the so-called optimality equation x i = max a A(i) { r i (a) + α j p ij (a)x j }, i S. (.6) Furthermore, we will see that f is an optimal policy if r i (f ) + α j p ij (f )v α j r i (a) + α j p ij (a)v α j, a A(i), i S. (.7)
21 .2. POLICIES AND OPTIMALITY CRITERIA 9 Total expected reward over an infinite horizon The total expected reward, given initial state i and policy R, is denoted by v i (R) and defined by v i (R) = E i,r {r Xt (Y t )} = P i,r {X t = j, Y t = a} r j (a). (.8) t= We consider this criterion under the following assumptions. Assumption.2 () The model is substochastic, i.e. j p ij(a) for all (i, a) S A. t= (2) For any initial state i and any policy R the expected total reward v i (R) is well-defined (possibly ± ). Under the above assumption v i (R) is well defined for all policies R and intial states i. The value vector and the concept of an optimal policy are defined in the usual way. Under the additional assumption that every policy R is transient, i.e. t= P i,r{x t = j, Y t = a} < for all i, j and a, it can be shown (cf. Kallenberg [08], chapter 3) that, with α =, most properties of the discounted MDP model are valid for total reward. j,a Average expected reward over an infinite horizon In the criterion of average reward the limiting behaviour of T T t= r X t (Y t ) is considered for T. Since lim T T T t= r X t (Y t ) may not exist and interchanging limit and expectation is not allowed, in general, there are four different evaluation measures which can be considered:. Lower limit of the average expected reward: φ i (R) = liminf T T T t= E i,r{r Xt (Y t )}, i S, with value vector φ = sup R φ(r). 2. Upper limit of the average expected reward: φ i (R) = limsup T T T t= E i,r{r Xt (Y t )}, i S, with value vector φ = sup R φ(r). 3. Expectation of the lower limit of the average reward: ψ i (R) = E i,r {liminf T T T t= r X t (Y t )}, i S, with value vector ψ = sup R ψ(r). 4. Expectation of the upper limit of the average reward: ψ i (R) = E i,r {limsup T T T t= r X t (Y t )}, i S, with value vector ψ = sup R ψ(r). Lemma.2 ψ(r) φ(r) φ(r) ψ(r) for every policy R. The second inequality is obvious. The first and the last inequality follow from Fatou s lemma (e.g. Bauer [8], p.26): and ψ i (R) = E i,r {liminf T T φ i (R) = limsup T T T t= r X t (Y t )} liminf T T T t= E i,r{r Xt (Y t )} = φ i (R) T t= E i,r{r Xt (Y t )} E i,r {limsup T T T t= r X t (Y t )} = ψ i (R).
22 0 CHAPTER. INTRODUCTION Example.2 Consider the following MDP: S = {, 2, 3}; A() = {}, A(2) = A(3) = {, 2}. p () = 0, p 2 () = 0.5; p 3 () = 0.5; r () = 0. p 2 () = 0, p 22 () = ; p 23 () = 0; r 2 () =. p 2 (2) = 0, p 22 (2) = 0; p 23 (2) = ; r 2 (2) =. p 3 () = 0, p 32 () = 0; p 33 () = ; r 3 () = 0. p 3 (2) = 0, p 32 (2) = ; p 33 (2) = 0; r 3 (2) = 0. We use directed graphs to illustrate examples The nodes of the graph represent the states. If the transition probability p ij (a) is positive there is an arc (i, j) from node i to node j;for a = we use a simple arc, for a = 2 a double arc, etc.; next to the arc the probability p ij (a) is given. The graph of this MDP model is pictured. If we start in state, we never return to that state, but we will remain in state 2 or state 3 for ever. Because of the reward structure, T T t= r X t (Y t ) is the average number of visits to state 2 in the periods, 2,..., T. Let the policy R = (π, π 2,..., π t,... ) be defined by π i = for i =, 2, 3 and for t 2 : πh t ta t = k t k t+ if a t = k t+ if a t = 2, where for h t = (i, a, i 2, a 2,..., i t, a t, i t ), k t = max{k i t = i t 2 = i t k+ = i t, i t k i t }, so k t is the maximum number of periods we consecutively are in state i t at the time points t, t,.... This implies that each time we stay in the same state (2 or 3) we have a higher probability to stay there for one more period. The probability to stay in state 2 (or 3) for ever, given that we enter state 2 (or 3), is P R {X t = 2 for t = t 0 +, t 0 + 2,... X t0 = 2 and X t0 2} = { lim t t } t = limt t = 0. Hence, with probability, a switch from state 2 to state 3 will occur at some time point; similarly, with probability, there is a switch from state 3 to state 2 at some time point. We even can compute the expected number of periods before such a switch occurs: E R {number of consecutive stays in state 2} = k= k P R{X j = 2 for t = t 0 +, t 0 + 2,..., t 0 + k ; X t0 +k 2 X t0 = 2 and X t0 2} = { } k= k k k k+ = k= k+ =. So, as long as we stay in state 2, we obtain a reward of in each period. The expected numer of stays in state 2 is infinite. Therefore, with probability, there is an infinite number of time
23 .2. POLICIES AND OPTIMALITY CRITERIA points at which the average reward are arbitrary close to. Similarly for state 3, with probability, there is an infinite number of time points at which the average reward are arbitrary close to 0. This implies for policy R that and From this we obtain ψ (R)=E,R {liminf T T limsup T T liminf T T T r Xt (Y t ) = with probability t= T r Xt (Y t ) = 0 with probability. t= T t= r X t (Y t )}=0 and ψ (R)=E,R {limsup T T T t= r X t (Y t )}=. If the process starts in state, then at any time point t - by symmetry - the probability to be in state 2 and earn will be equal to the probability to be in state 3 and earn 0. E,R {r Xt (Y t )} = 2 for all t 2. Hence, φ (R) = liminf T T T t= E i,r{r Xt (Y t )} = φ (R) = limsup T T T t= E i,r{r Xt (Y t )} = 2. So, this example shows that ψ (R) = 0 < φ (R) = 2 = φ (R) < ψ (R) =. So, We also give an example in which φ i (R) < φ i (R) for some policy R and some state i. Consider the following very simple MDP: S = {}; A() = {, 2}; p () = p (2) = ; r () =, r (2) =. Take the policy R that chooses action at t = ; action 2 at t = 2, 3; action at t = 4, 5, 6, 7; action 2 at t = 8, 9,..., 5. In general: action at t = 2 2k, 2 2k +,..., 2 2k +2 2k for k = 0,,... and action 2 at t = 2 2k+, 2 2k+ +,..., 2 2k k+ for k = 0,,.... This gives a deterministic stream of rewards: +;, ; +, +, +, +;,,,,,,, ;... with total rewards +; 0, ; 0, +, +2, +3, +2, +, 0,, 2, 3, 4, 5, 4, 3, 2,, 0, +, +2, +3, +4, +5, +6, +7, +8, +9, +0, +. To compute the limsup we take the time points 2 2k voor k =, 2,..., i.e. the time points, 7, 3, 27,... ; for the liminf we consider the time points 2 2k voor k =, 2,..., i.e. the time points 3, 5, 63, 255,.... Let T k be the time points when we change the chosen action, i.e. T k = 2 k for k =, 2,..., and let A k denote the total reward at the time points T k, i.e. A = +, A 2 =, A 3 = +3, A 4 = 5, A 5 = +. It can easily be shown (this is left to the reader) that A k + A k+ = 2 k and A k+ = 2 A k + ( ) k. This implies that A k = 3 {2k ( ) k }. Since A k is positive iff k is odd, we obtain A k = 3 {( )k+ 2 k + }, k =, 2,.... Hence,
24 2 CHAPTER. INTRODUCTION and φ (R) = liminf T T φ (R) = limsup T T T t= E,R{r Xt (Y t )} = lim k T t= E,R{r Xt (Y t )} = lim k A 2k 2 2k = lim 3 { 22k +} k = 2 2k 3 A 2k 2 2k = lim 3 {22k +} k = + 2 2k 3.
25 .2. POLICIES AND OPTIMALITY CRITERIA 3 For these four criteria the value vector and the concept of an optimal policy can be defined in the usual way. Bierth [2] has shown that ψ(π ) = φ(π ) = φ(π ) = ψ(π ) for every stationary policy π, and that there exists a deterministic optimal policy which is optimal for all these four criteria. Hence, the four criteria are equivalent in the sense that an optimal deterministic policy for one criterion is also optimal for the other criteria. More sensitive optimality criteria over an infinite horizon The average reward criterion has the disadvantage that it does not consider rewards earned in a finite number of periods. For example, the streams of rewards 0, 0, 0, 0, 0,... and 00, 00, 0, 0, 0,... have the same average value although usually the second stream will be preferred. Hence, there is a need for criteria that select policies which are average optimal but also make the right early decisions as well. There are several ways to create more sensitive criteria. One way is to consider discounting for discount factors that tend to. Another way is to use more subtle kinds of averaging. We present some of these criteria.. Bias optimality A policy R is called bias optimal if lim α {v α (R ) v α } = Blackwell optimality A policy R is Blackwell optimal if v α (R ) = v α for all α {α 0, ) and some 0 < α 0 <. From this definition it is clear that Blackwell optimality implies bias optimality. The next example shows policies f, f 2 and f3 such that f is average optimal but not bias-optimal, f2 is biasoptimal but not Blackwell optimal, and f3 is Blackwell optimal. Therefore, Blackwell optimality is more selective than bias-optimality which in his turn is more selective than average optimality. Example.3 Consider the following MDP: S = {, 2}; A() = {, 2, 3}, A(2) = {}. p () =, p 2 () = 0; p (2) = p 2 (2) = 0.5; 0.5 p (3) = 0, p 2 (3) = ; p 2 () = 0, p 22 () = ; r () = 0, r (2) =, r (3) = 2, r 2 () = If the system is in state 2, the system stays in state 2 forever and no rewards are earned. In state we have to choose between the actions, 2 and 3, which is denoted by the policies f, f 2 and f3, respectively. All policies have the same average reward (0 for both starting states) and the discounted reward for these policies only differ in state (in state 2 the discounted reward are 0 for every discount factor α).
26 4 CHAPTER. INTRODUCTION It is easy to see that for all α, we have v α(f ) = 0 and vα (f 3 ) = 2. For the second policy, we have an immediate reward and in state one we stay with probability 0.5 and we go to state 2 with probability 0.5. Hence, we obtain v α(f 2 ) = + 0.5α vα (f 2 ) + 0.5α vα 2 (f 2 ), so v α(f 2 ) = 2 2 α. On the right we present a picture of these policies as function of the discount factor α. From this picture it is obvious that v α = 2 (vα = sup f C(D) v α (f), this is shown in the next chapter) and that the policy f3 is the only Blackwell optimal policy. Furthermore, we have lim α {v α(f ) vα } = 2, lim α {v α(f 2 ) vα } = lim α { 2 2 α 2} = 0, and lim α {v α (f 3 ) vα } = lim α {2 2} = 0. Hence, both the policies f 2 and f 3 are bias-optimal. v α(f) v α(f 3 2 ) v α (f 2 ) v α(f 0 ) α=0 α= 3. n-discount optimality For n =, 0,,... the policy R is called n-discount optimal if lim α ( α) n {v α (R ) v α } = 0. Obviously, 0-discount optimality is the same as bias-optimality. It can be shown that that ( )- discount optimality is equivalent to average optimality, and that Blackwell optimality is equivalent to n-discount optimality for all n S = N. 4. n-average optimality Let R be any policy. For t N and n =, 0,,..., we define the vector v n,t (R) by { v v n,t t (R) for n = (R) = t s= vn,s (R) for n = 0,,... For n =, 0,,... a policy R is said to be n-average optimal if for all policies R: liminf T T {vn,t (R ) v n,t (R)} 0. It can be shown that n-average optimality is equivalent to n-discount optimality. 5. Overtaking optimality A policy R is overtaking optimal if liminf T {v T (R ) v T (R)} 0 for all policies R. In contrast with other criteria, an overtaking optimal policy doesn t exist in general. 6. Average overtaking optimality A policy R is average overtaking optimal if liminf T T T t= {vt (R ) v t (R)} 0 for all policies R. It can be shown that average overtaking optimality is equivalent to bias optimality and therefore also equivalent to 0-discount optimality.
27 .3. EXAMPLES 5.3 Examples In Example. we saw an inventory model with backlogging as an example of an MDP. In this section we introduce other examples of MDPs: gambling, gaming, optimal stopping, replacement, maintenance and repair, production, optimal control of queues, stochastic scheduling and the socalled multi-armed bandit problem. For some models the optimal policy has a special structure. In those cases we will indicate the optimal structure..3. Red-black gambling In red-black gambling a gambler with a fortune of i euro may bet any amount a {, 2,..., i}. He wins his amount with probability p and he looses with probability p. The gambler s goal is to reach a certain fortune N. The gambler continues until either he has reached his goal or he has lost all his money. The problem is to determine a policy that maximizes the probability to reach this goal. This problem can be modelled as an MDP with the total reward criterion. The fortune of the gambler is the state. Since the gambling problem is over when the gambler has reached his goal or has lost all his money, there are no transitions when the game is in either state N or state 0. Maximizing the probability to reach the amount N is equivalent to assigning a reward to state N and rewards 0 to the other states. The MDP model for the gambling problem is as follows. S = {0,,..., N}; A(0) = A(N) = {0}, A(i) = {, 2,..., min(i, N i)}, i N. p, j = i + a For i N, a A(i) : p ij (a) = p, j = i a and r i (a) = 0. 0, j i + a, i a p 0j (0) = p Nj (0) = 0, j S; r 0 (0) = 0, r N (0) =. Since under any policy state N or state 0 is reached with probability, it is easy to verify that Assumption.2 is satisfied. Notice also that v i (R) = P i,r {X t = j, Y t = a} r j (a) = t= j,a i.e. the reward is equal to the probability to reach state N. It can be shown that an optimal policy has the following structure: if p > 2, then timid play, i.e. always bet the amount, is optimal. if p = 2, then any policy is optimal. P i,r {X t = N}, if p < 2, then bold play, i.e. betting min(i, N i) in state i, is optimal. t=
28 6 CHAPTER. INTRODUCTION.3.2 Gaming: How to serve in tennis The scoring in tennis is conventionally in steps from 0 to 5 to 30 to 40 to game. We simply use the numbers 0 through 4 for these scores. If the score reaches deuce, i.e , the game is won by the player who has as first two points more than his opponent. Therefore, deuce is equivalent to 30-30, and similarly advantage server (receiver) is equivalent to (30-40). Hence, the scores can be represented by (i, j), 0 i, j 3, excluding the pair (3,3) which is equivalent to (2,2), where i denotes the score of the server and j of the receiver. Furthermore, we use the states (4) and (5) for the case that the server or the receiver, respectively, wins the game. When the score is (i, j), the server may serve a first service (s = ), or a second service (s = 2) in case the first serve is fault. This leads to the following 32 states: (i, j, s) 0 i, j 3, (i, j) (3, 3), s =, 2 : the states in which the game is going on (4) : the target state for the server (5) : the target state for the reciever For the sake of simplicity, suppose that the players can choose between two types of services: a fast service (a = ) and a slow service (a = 2). The fast service is more likely to be fault, but also more difficult to return; the slow service is more accurate and easier to return correctly. Let p (p 2 ) be the probability that the fast (slow) service is good (lands in the given bounds of the court) and let q (q 2 ) be the probability of winning the point by the server when the fast (slow) service is good. We make the following obvious assumptions: p p 2 and q q 2. Suppose the server chooses for his first service action a. Then, the event that the server serves right and wins the point has probability p a q a ; the event that the server serves right and loses the point has probability p a ( q a ); the event that the server serves a fault and continues with his second service has probability p a. For the second service, the server either wins or loses the point with probabilities p a q a and p a q a, respectively. In the states where there is no game point, i.e. i 3 or j 3, we have the following transition probabilities: p (i,j,)(i+,j,) (a) = p a q a ; p (i,j,2)(i+,j,) (a) = p a q a ; p (i,j,)(i,j+,) (a) = p a ( q a ); p (i,j,2)(i,j+,) (a) = p a q a ; p (i,j,)(i,j,2) (a) = p a. If i = 3, we obtain (where (3, 3, ) has to be replaced by (2, 2, )) p (3,j,)(4) (a) = p a q a ; p (3,j,2)(4) (a) = p a q a ; p (3,j,)(3,j+,) (a) = p a ( q a ); p (3,j,2)(3,j+,) (a) = p a q a ; p (3,j,)(3,j,2) (a) = p a. Similarly, we obtain for j = 3 (also in this case replace (3, 3, ) by (2, 2, )) p (i,3,)(i+,3,) (a) = p a q a ; p (i,3,2)(i+,3,) (a) = p a q a ; p (i,3,)(5) (a) = p a ( q a ); p (i,3,2)(5) (a) = p a q a ; p (i,3,)(i,3,2) (a) = p a.
29 .3. EXAMPLES 7 When the game is over, i.e. in states (4) and (5), there are no transitions. The question is: what kind of service to choose, given the score, in order to win the game? To maximize for the server the probability of winning the game, the following reward structure is suitable: all rewards are 0, except in the target state (4) of the server, where the reward in the target state is. As utility criterion the total expected reward will be used. Let x = p q p 2 q 2. Then, x is the ratio of serving right and winning the point for the two possible actions a = and a = 2. It can be shown that the optimal policy has the following structure in each state: If x If (p 2 p ) x < If x < (p 2 p ) : always use the fast service : use the fast service as first service and the slow service as second : use always the slow service A similar problem is: is it better to use a fast than a slow service to maximize the probability of winning the next point. It turns out that this problem has the same optimal policy. Hence, the optimal policy for winning the game is a myopic policy..3.3 Optimal stopping In an optimal stopping problem there are two actions for every state. The first action is stopping and the second corresponds with continueing. If the stopping action is chosen in state i, then a terminal reward r i is earned and the process terminates. This termination is modelled by taking all transition probabilities equal to zero. If action 2 is chosen in state i, then a cost c i is incurred and the probability of being in state j at the next time point is p ij. Hence, the characteristics of the MDP model are: S = {, 2,..., N}; A(i) = {, 2}, i S; r i () = r i, i S; r i (2) = c i, i S; p ij () = 0, i, j S; p ij (2) = p ij, i, j S. We are interested in finding an optimal stopping policy. A stopping policy R is a policy such that for any starting state i the process terminates in finite time with probability. As optimality criterion the total expected reward is considered. Notice that for a stopping policy the total expected reward v(r) is well-defined. Let v be the value vector of this model, i.e. v i = sup{v i (R) R is a stopping policy}, i S. A stopping policy R is an optimal policy if v(r ) = v. Let S 0 = i S r i c i + j p ij r j, i.e. S 0 is the set of states in which immediate stopping is as least as good as continuing for one period and then choosing the stopping action. A one-step look ahead policy is a policy which
Discrete-Time Markov Decision Processes
CHAPTER 6 Discrete-Time Markov Decision Processes 6.0 INTRODUCTION In the previous chapters we saw that in the analysis of many operational systems the concepts of a state of a system and a state transition
More informationLecture notes for Analysis of Algorithms : Markov decision processes
Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with
More information1 Markov decision processes
2.997 Decision-Making in Large-Scale Systems February 4 MI, Spring 2004 Handout #1 Lecture Note 1 1 Markov decision processes In this class we will study discrete-time stochastic systems. We can describe
More informationReinforcement Learning
Reinforcement Learning March May, 2013 Schedule Update Introduction 03/13/2015 (10:15-12:15) Sala conferenze MDPs 03/18/2015 (10:15-12:15) Sala conferenze Solving MDPs 03/20/2015 (10:15-12:15) Aula Alpha
More informationRECURSION EQUATION FOR
Math 46 Lecture 8 Infinite Horizon discounted reward problem From the last lecture: The value function of policy u for the infinite horizon problem with discount factor a and initial state i is W i, u
More informationDerman s book as inspiration: some results on LP for MDPs
Ann Oper Res (2013) 208:63 94 DOI 10.1007/s10479-011-1047-4 Derman s book as inspiration: some results on LP for MDPs Lodewijk Kallenberg Published online: 4 January 2012 The Author(s) 2012. This article
More informationChapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS
Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS 63 2.1 Introduction In this chapter we describe the analytical tools used in this thesis. They are Markov Decision Processes(MDP), Markov Renewal process
More informationThe Markov Decision Process (MDP) model
Decision Making in Robots and Autonomous Agents The Markov Decision Process (MDP) model Subramanian Ramamoorthy School of Informatics 25 January, 2013 In the MAB Model We were in a single casino and the
More informationTotal Expected Discounted Reward MDPs: Existence of Optimal Policies
Total Expected Discounted Reward MDPs: Existence of Optimal Policies Eugene A. Feinberg Department of Applied Mathematics and Statistics State University of New York at Stony Brook Stony Brook, NY 11794-3600
More information1 Stochastic Dynamic Programming
1 Stochastic Dynamic Programming Formally, a stochastic dynamic program has the same components as a deterministic one; the only modification is to the state transition equation. When events in the future
More informationMarkov Processes Hamid R. Rabiee
Markov Processes Hamid R. Rabiee Overview Markov Property Markov Chains Definition Stationary Property Paths in Markov Chains Classification of States Steady States in MCs. 2 Markov Property A discrete
More informationChapter 16 focused on decision making in the face of uncertainty about one future
9 C H A P T E R Markov Chains Chapter 6 focused on decision making in the face of uncertainty about one future event (learning the true state of nature). However, some decisions need to take into account
More informationLecture Notes 7 Random Processes. Markov Processes Markov Chains. Random Processes
Lecture Notes 7 Random Processes Definition IID Processes Bernoulli Process Binomial Counting Process Interarrival Time Process Markov Processes Markov Chains Classification of States Steady State Probabilities
More informationISM206 Lecture, May 12, 2005 Markov Chain
ISM206 Lecture, May 12, 2005 Markov Chain Instructor: Kevin Ross Scribe: Pritam Roy May 26, 2005 1 Outline of topics for the 10 AM lecture The topics are: Discrete Time Markov Chain Examples Chapman-Kolmogorov
More informationDISCRETE STOCHASTIC PROCESSES Draft of 2nd Edition
DISCRETE STOCHASTIC PROCESSES Draft of 2nd Edition R. G. Gallager January 31, 2011 i ii Preface These notes are a draft of a major rewrite of a text [9] of the same name. The notes and the text are outgrowths
More informationSome notes on Markov Decision Theory
Some notes on Markov Decision Theory Nikolaos Laoutaris laoutaris@di.uoa.gr January, 2004 1 Markov Decision Theory[1, 2, 3, 4] provides a methodology for the analysis of probabilistic sequential decision
More informationMAS275 Probability Modelling Exercises
MAS75 Probability Modelling Exercises Note: these questions are intended to be of variable difficulty. In particular: Questions or part questions labelled (*) are intended to be a bit more challenging.
More informationStochastic Processes. Theory for Applications. Robert G. Gallager CAMBRIDGE UNIVERSITY PRESS
Stochastic Processes Theory for Applications Robert G. Gallager CAMBRIDGE UNIVERSITY PRESS Contents Preface page xv Swgg&sfzoMj ybr zmjfr%cforj owf fmdy xix Acknowledgements xxi 1 Introduction and review
More informationLecture 9 Classification of States
Lecture 9: Classification of States of 27 Course: M32K Intro to Stochastic Processes Term: Fall 204 Instructor: Gordan Zitkovic Lecture 9 Classification of States There will be a lot of definitions and
More information1 Gambler s Ruin Problem
1 Gambler s Ruin Problem Consider a gambler who starts with an initial fortune of $1 and then on each successive gamble either wins $1 or loses $1 independent of the past with probabilities p and q = 1
More informationOn Polynomial Cases of the Unichain Classification Problem for Markov Decision Processes
On Polynomial Cases of the Unichain Classification Problem for Markov Decision Processes Eugene A. Feinberg Department of Applied Mathematics and Statistics State University of New York at Stony Brook
More informationMathematical Games and Random Walks
Mathematical Games and Random Walks Alexander Engau, Ph.D. Department of Mathematical and Statistical Sciences College of Liberal Arts and Sciences / Intl. College at Beijing University of Colorado Denver
More informationMarkov Chains (Part 3)
Markov Chains (Part 3) State Classification Markov Chains - State Classification Accessibility State j is accessible from state i if p ij (n) > for some n>=, meaning that starting at state i, there is
More informationMarkov Chains CK eqns Classes Hitting times Rec./trans. Strong Markov Stat. distr. Reversibility * Markov Chains
Markov Chains A random process X is a family {X t : t T } of random variables indexed by some set T. When T = {0, 1, 2,... } one speaks about a discrete-time process, for T = R or T = [0, ) one has a continuous-time
More informationInternet Monetization
Internet Monetization March May, 2013 Discrete time Finite A decision process (MDP) is reward process with decisions. It models an environment in which all states are and time is divided into stages. Definition
More informationMATH 56A: STOCHASTIC PROCESSES CHAPTER 1
MATH 56A: STOCHASTIC PROCESSES CHAPTER. Finite Markov chains For the sake of completeness of these notes I decided to write a summary of the basic concepts of finite Markov chains. The topics in this chapter
More informationWorst case analysis for a general class of on-line lot-sizing heuristics
Worst case analysis for a general class of on-line lot-sizing heuristics Wilco van den Heuvel a, Albert P.M. Wagelmans a a Econometric Institute and Erasmus Research Institute of Management, Erasmus University
More informationLecture 4 - Random walk, ruin problems and random processes
Lecture 4 - Random walk, ruin problems and random processes Jan Bouda FI MU April 19, 2009 Jan Bouda (FI MU) Lecture 4 - Random walk, ruin problems and random processesapril 19, 2009 1 / 30 Part I Random
More informationMARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti
1 MARKOV DECISION PROCESSES (MDP) AND REINFORCEMENT LEARNING (RL) Versione originale delle slide fornita dal Prof. Francesco Lo Presti Historical background 2 Original motivation: animal learning Early
More informationGame Theory and its Applications to Networks - Part I: Strict Competition
Game Theory and its Applications to Networks - Part I: Strict Competition Corinne Touati Master ENS Lyon, Fall 200 What is Game Theory and what is it for? Definition (Roger Myerson, Game Theory, Analysis
More informationDynamic control of a tandem system with abandonments
Dynamic control of a tandem system with abandonments Gabriel Zayas-Cabán 1, Jingui Xie 2, Linda V. Green 3, and Mark E. Lewis 4 1 Center for Healthcare Engineering and Patient Safety University of Michigan
More informationThe Transition Probability Function P ij (t)
The Transition Probability Function P ij (t) Consider a continuous time Markov chain {X(t), t 0}. We are interested in the probability that in t time units the process will be in state j, given that it
More informationStochastic Shortest Path Problems
Chapter 8 Stochastic Shortest Path Problems 1 In this chapter, we study a stochastic version of the shortest path problem of chapter 2, where only probabilities of transitions along different arcs can
More informationMarkov Decision Processes and Dynamic Programming
Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) Ecole Centrale - Option DAD SequeL INRIA Lille EC-RL Course In This Lecture A. LAZARIC Markov Decision Processes
More informationOn Finding Optimal Policies for Markovian Decision Processes Using Simulation
On Finding Optimal Policies for Markovian Decision Processes Using Simulation Apostolos N. Burnetas Case Western Reserve University Michael N. Katehakis Rutgers University February 1995 Abstract A simulation
More informationChapter 5. Continuous-Time Markov Chains. Prof. Shun-Ren Yang Department of Computer Science, National Tsing Hua University, Taiwan
Chapter 5. Continuous-Time Markov Chains Prof. Shun-Ren Yang Department of Computer Science, National Tsing Hua University, Taiwan Continuous-Time Markov Chains Consider a continuous-time stochastic process
More informationIEOR 6711, HMWK 5, Professor Sigman
IEOR 6711, HMWK 5, Professor Sigman 1. Semi-Markov processes: Consider an irreducible positive recurrent discrete-time Markov chain {X n } with transition matrix P (P i,j ), i, j S, and finite state space.
More informationInfinite-Horizon Average Reward Markov Decision Processes
Infinite-Horizon Average Reward Markov Decision Processes Dan Zhang Leeds School of Business University of Colorado at Boulder Dan Zhang, Spring 2012 Infinite Horizon Average Reward MDP 1 Outline The average
More informationMarkov Decision Processes with Multiple Long-run Average Objectives
Markov Decision Processes with Multiple Long-run Average Objectives Krishnendu Chatterjee UC Berkeley c krish@eecs.berkeley.edu Abstract. We consider Markov decision processes (MDPs) with multiple long-run
More informationMarkov Decision Processes and Dynamic Programming
Markov Decision Processes and Dynamic Programming A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course How to model an RL problem The Markov Decision Process
More informationReductions Of Undiscounted Markov Decision Processes and Stochastic Games To Discounted Ones. Jefferson Huang
Reductions Of Undiscounted Markov Decision Processes and Stochastic Games To Discounted Ones Jefferson Huang School of Operations Research and Information Engineering Cornell University November 16, 2016
More informationComputational complexity estimates for value and policy iteration algorithms for total-cost and average-cost Markov decision processes
Computational complexity estimates for value and policy iteration algorithms for total-cost and average-cost Markov decision processes Jefferson Huang Dept. Applied Mathematics and Statistics Stony Brook
More informationDecision Theory: Markov Decision Processes
Decision Theory: Markov Decision Processes CPSC 322 Lecture 33 March 31, 2006 Textbook 12.5 Decision Theory: Markov Decision Processes CPSC 322 Lecture 33, Slide 1 Lecture Overview Recap Rewards and Policies
More informationLectures on Markov Chains
Lectures on Markov Chains David M. McClendon Department of Mathematics Ferris State University 2016 edition 1 Contents Contents 2 1 Markov chains 4 1.1 The definition of a Markov chain.....................
More informationOPTIMALITY OF RANDOMIZED TRUNK RESERVATION FOR A PROBLEM WITH MULTIPLE CONSTRAINTS
OPTIMALITY OF RANDOMIZED TRUNK RESERVATION FOR A PROBLEM WITH MULTIPLE CONSTRAINTS Xiaofei Fan-Orzechowski Department of Applied Mathematics and Statistics State University of New York at Stony Brook Stony
More informationCS 4100 // artificial intelligence. Recap/midterm review!
CS 4100 // artificial intelligence instructor: byron wallace Recap/midterm review! Attribution: many of these slides are modified versions of those distributed with the UC Berkeley CS188 materials Thanks
More informationProbability, Random Processes and Inference
INSTITUTO POLITÉCNICO NACIONAL CENTRO DE INVESTIGACION EN COMPUTACION Laboratorio de Ciberseguridad Probability, Random Processes and Inference Dr. Ponciano Jorge Escamilla Ambrosio pescamilla@cic.ipn.mx
More information1 Gambler s Ruin Problem
Coyright c 2017 by Karl Sigman 1 Gambler s Ruin Problem Let N 2 be an integer and let 1 i N 1. Consider a gambler who starts with an initial fortune of $i and then on each successive gamble either wins
More informationStatistics 150: Spring 2007
Statistics 150: Spring 2007 April 23, 2008 0-1 1 Limiting Probabilities If the discrete-time Markov chain with transition probabilities p ij is irreducible and positive recurrent; then the limiting probabilities
More informationMATH4406 Assignment 5
MATH4406 Assignment 5 Patrick Laub (ID: 42051392) October 7, 2014 1 The machine replacement model 1.1 Real-world motivation Consider the machine to be the entire world. Over time the creator has running
More informationStochastic Analysis of Bidding in Sequential Auctions and Related Problems.
s Case Stochastic Analysis of Bidding in Sequential Auctions and Related Problems S keya Rutgers Business School s Case 1 New auction models demand model Integrated auction- inventory model 2 Optimizing
More informationA note on two-person zero-sum communicating stochastic games
Operations Research Letters 34 26) 412 42 Operations Research Letters www.elsevier.com/locate/orl A note on two-person zero-sum communicating stochastic games Zeynep Müge Avşar a, Melike Baykal-Gürsoy
More informationCSE250A Fall 12: Discussion Week 9
CSE250A Fall 12: Discussion Week 9 Aditya Menon (akmenon@ucsd.edu) December 4, 2012 1 Schedule for today Recap of Markov Decision Processes. Examples: slot machines and maze traversal. Planning and learning.
More informationMDP Preliminaries. Nan Jiang. February 10, 2019
MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process
More informationLecture 11: Introduction to Markov Chains. Copyright G. Caire (Sample Lectures) 321
Lecture 11: Introduction to Markov Chains Copyright G. Caire (Sample Lectures) 321 Discrete-time random processes A sequence of RVs indexed by a variable n 2 {0, 1, 2,...} forms a discretetime random process
More informationMarkov Chains (Part 4)
Markov Chains (Part 4) Steady State Probabilities and First Passage Times Markov Chains - 1 Steady-State Probabilities Remember, for the inventory example we had (8) P &.286 =.286.286 %.286 For an irreducible
More informationLecture 1: March 7, 2018
Reinforcement Learning Spring Semester, 2017/8 Lecture 1: March 7, 2018 Lecturer: Yishay Mansour Scribe: ym DISCLAIMER: Based on Learning and Planning in Dynamical Systems by Shie Mannor c, all rights
More informationA Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time
A Review of the E 3 Algorithm: Near-Optimal Reinforcement Learning in Polynomial Time April 16, 2016 Abstract In this exposition we study the E 3 algorithm proposed by Kearns and Singh for reinforcement
More informationOn the static assignment to parallel servers
On the static assignment to parallel servers Ger Koole Vrije Universiteit Faculty of Mathematics and Computer Science De Boelelaan 1081a, 1081 HV Amsterdam The Netherlands Email: koole@cs.vu.nl, Url: www.cs.vu.nl/
More informationStrategic Dynamic Jockeying Between Two Parallel Queues
Strategic Dynamic Jockeying Between Two Parallel Queues Amin Dehghanian 1 and Jeffrey P. Kharoufeh 2 Department of Industrial Engineering University of Pittsburgh 1048 Benedum Hall 3700 O Hara Street Pittsburgh,
More informationSection Notes 9. Midterm 2 Review. Applied Math / Engineering Sciences 121. Week of December 3, 2018
Section Notes 9 Midterm 2 Review Applied Math / Engineering Sciences 121 Week of December 3, 2018 The following list of topics is an overview of the material that was covered in the lectures and sections
More informationDynamic Control of a Tandem Queueing System with Abandonments
Dynamic Control of a Tandem Queueing System with Abandonments Gabriel Zayas-Cabán 1 Jungui Xie 2 Linda V. Green 3 Mark E. Lewis 1 1 Cornell University Ithaca, NY 2 University of Science and Technology
More informationCover Page. The handle holds various files of this Leiden University dissertation
Cover Page The handle http://hdl.handle.net/1887/39637 holds various files of this Leiden University dissertation Author: Smit, Laurens Title: Steady-state analysis of large scale systems : the successive
More informationMarkov Decision Processes and their Applications to Supply Chain Management
Markov Decision Processes and their Applications to Supply Chain Management Jefferson Huang School of Operations Research & Information Engineering Cornell University June 24 & 25, 2018 10 th Operations
More informationCO759: Algorithmic Game Theory Spring 2015
CO759: Algorithmic Game Theory Spring 2015 Instructor: Chaitanya Swamy Assignment 1 Due: By Jun 25, 2015 You may use anything proved in class directly. I will maintain a FAQ about the assignment on the
More informationStochastic Optimization
Chapter 27 Page 1 Stochastic Optimization Operations research has been particularly successful in two areas of decision analysis: (i) optimization of problems involving many variables when the outcome
More informationMathematical Optimization Models and Applications
Mathematical Optimization Models and Applications Yinyu Ye Department of Management Science and Engineering Stanford University Stanford, CA 94305, U.S.A. http://www.stanford.edu/ yyye Chapters 1, 2.1-2,
More informationNoncooperative continuous-time Markov games
Morfismos, Vol. 9, No. 1, 2005, pp. 39 54 Noncooperative continuous-time Markov games Héctor Jasso-Fuentes Abstract This work concerns noncooperative continuous-time Markov games with Polish state and
More informationMarkov decision processes
CS 2740 Knowledge representation Lecture 24 Markov decision processes Milos Hauskrecht milos@cs.pitt.edu 5329 Sennott Square Administrative announcements Final exam: Monday, December 8, 2008 In-class Only
More informationSTOCHASTIC MODELS FOR RELIABILITY, AVAILABILITY, AND MAINTAINABILITY
STOCHASTIC MODELS FOR RELIABILITY, AVAILABILITY, AND MAINTAINABILITY Ph.D. Assistant Professor Industrial and Systems Engineering Auburn University RAM IX Summit November 2 nd 2016 Outline Introduction
More informationOptimal Rejuvenation for. Tolerating Soft Failures. Andras Pfening, Sachin Garg, Antonio Puliato, Miklos Telek, Kishor S. Trivedi.
Optimal Rejuvenation for Tolerating Soft Failures Andras Pfening, Sachin Garg, Antonio Puliato, Miklos Telek, Kishor S. Trivedi Abstract In the paper we address the problem of determining the optimal time
More informationMarkov Decision Processes Chapter 17. Mausam
Markov Decision Processes Chapter 17 Mausam Planning Agent Static vs. Dynamic Fully vs. Partially Observable Environment What action next? Deterministic vs. Stochastic Perfect vs. Noisy Instantaneous vs.
More informationOn the Convergence of Optimal Actions for Markov Decision Processes and the Optimality of (s, S) Inventory Policies
On the Convergence of Optimal Actions for Markov Decision Processes and the Optimality of (s, S) Inventory Policies Eugene A. Feinberg Department of Applied Mathematics and Statistics Stony Brook University
More informationReinforcement Learning
1 Reinforcement Learning Chris Watkins Department of Computer Science Royal Holloway, University of London July 27, 2015 2 Plan 1 Why reinforcement learning? Where does this theory come from? Markov decision
More informationComputational Game Theory Spring Semester, 2005/6. Lecturer: Yishay Mansour Scribe: Ilan Cohen, Natan Rubin, Ophir Bleiberg*
Computational Game Theory Spring Semester, 2005/6 Lecture 5: 2-Player Zero Sum Games Lecturer: Yishay Mansour Scribe: Ilan Cohen, Natan Rubin, Ophir Bleiberg* 1 5.1 2-Player Zero Sum Games In this lecture
More informationINTRODUCTION TO MARKOV DECISION PROCESSES
INTRODUCTION TO MARKOV DECISION PROCESSES Balázs Csanád Csáji Research Fellow, The University of Melbourne Signals & Systems Colloquium, 29 April 2010 Department of Electrical and Electronic Engineering,
More informationA linear programming approach to constrained nonstationary infinite-horizon Markov decision processes
A linear programming approach to constrained nonstationary infinite-horizon Markov decision processes Ilbin Lee Marina A. Epelman H. Edwin Romeijn Robert L. Smith Technical Report 13-01 March 6, 2013 University
More informationCS 7180: Behavioral Modeling and Decisionmaking
CS 7180: Behavioral Modeling and Decisionmaking in AI Markov Decision Processes for Complex Decisionmaking Prof. Amy Sliva October 17, 2012 Decisions are nondeterministic In many situations, behavior and
More informationMarkov Chains. Chapter 16. Markov Chains - 1
Markov Chains Chapter 16 Markov Chains - 1 Why Study Markov Chains? Decision Analysis focuses on decision making in the face of uncertainty about one future event. However, many decisions need to consider
More informationCS 598 Statistical Reinforcement Learning. Nan Jiang
CS 598 Statistical Reinforcement Learning Nan Jiang Overview What s this course about? A grad-level seminar course on theory of RL 3 What s this course about? A grad-level seminar course on theory of RL
More informationAdvanced Linear Programming: The Exercises
Advanced Linear Programming: The Exercises The answers are sometimes not written out completely. 1.5 a) min c T x + d T y Ax + By b y = x (1) First reformulation, using z smallest number satisfying x z
More informationNote that in the example in Lecture 1, the state Home is recurrent (and even absorbing), but all other states are transient. f ii (n) f ii = n=1 < +
Random Walks: WEEK 2 Recurrence and transience Consider the event {X n = i for some n > 0} by which we mean {X = i}or{x 2 = i,x i}or{x 3 = i,x 2 i,x i},. Definition.. A state i S is recurrent if P(X n
More informationSimplex Algorithm for Countable-state Discounted Markov Decision Processes
Simplex Algorithm for Countable-state Discounted Markov Decision Processes Ilbin Lee Marina A. Epelman H. Edwin Romeijn Robert L. Smith November 16, 2014 Abstract We consider discounted Markov Decision
More informationThe Complexity of Ergodic Mean-payoff Games,
The Complexity of Ergodic Mean-payoff Games, Krishnendu Chatterjee Rasmus Ibsen-Jensen Abstract We study two-player (zero-sum) concurrent mean-payoff games played on a finite-state graph. We focus on the
More information1.3 Convergence of Regular Markov Chains
Markov Chains and Random Walks on Graphs 3 Applying the same argument to A T, which has the same λ 0 as A, yields the row sum bounds Corollary 0 Let P 0 be the transition matrix of a regular Markov chain
More information6.254 : Game Theory with Engineering Applications Lecture 13: Extensive Form Games
6.254 : Game Theory with Engineering Lecture 13: Extensive Form Games Asu Ozdaglar MIT March 18, 2010 1 Introduction Outline Extensive Form Games with Perfect Information One-stage Deviation Principle
More informationCOMPUTING OPTIMAL SEQUENTIAL ALLOCATION RULES IN CLINICAL TRIALS* Michael N. Katehakis. State University of New York at Stony Brook. and.
COMPUTING OPTIMAL SEQUENTIAL ALLOCATION RULES IN CLINICAL TRIALS* Michael N. Katehakis State University of New York at Stony Brook and Cyrus Derman Columbia University The problem of assigning one of several
More informationTHE BIG MATCH WITH A CLOCK AND A BIT OF MEMORY
THE BIG MATCH WITH A CLOCK AND A BIT OF MEMORY KRISTOFFER ARNSFELT HANSEN, RASMUS IBSEN-JENSEN, AND ABRAHAM NEYMAN Abstract. The Big Match is a multi-stage two-player game. In each stage Player hides one
More informationDistributed Optimization. Song Chong EE, KAIST
Distributed Optimization Song Chong EE, KAIST songchong@kaist.edu Dynamic Programming for Path Planning A path-planning problem consists of a weighted directed graph with a set of n nodes N, directed links
More informationScheduling Markovian PERT networks to maximize the net present value: new results
Scheduling Markovian PERT networks to maximize the net present value: new results Hermans B, Leus R. KBI_1709 Scheduling Markovian PERT networks to maximize the net present value: New results Ben Hermans,a
More informationStability of the two queue system
Stability of the two queue system Iain M. MacPhee and Lisa J. Müller University of Durham Department of Mathematical Science Durham, DH1 3LE, UK (e-mail: i.m.macphee@durham.ac.uk, l.j.muller@durham.ac.uk)
More informationINTRODUCTION TO MARKOV CHAINS AND MARKOV CHAIN MIXING
INTRODUCTION TO MARKOV CHAINS AND MARKOV CHAIN MIXING ERIC SHANG Abstract. This paper provides an introduction to Markov chains and their basic classifications and interesting properties. After establishing
More informationP i [B k ] = lim. n=1 p(n) ii <. n=1. V i :=
2.7. Recurrence and transience Consider a Markov chain {X n : n N 0 } on state space E with transition matrix P. Definition 2.7.1. A state i E is called recurrent if P i [X n = i for infinitely many n]
More informationLecture 20 : Markov Chains
CSCI 3560 Probability and Computing Instructor: Bogdan Chlebus Lecture 0 : Markov Chains We consider stochastic processes. A process represents a system that evolves through incremental changes called
More informationA Reinforcement Learning (Nash-R) Algorithm for Average Reward Irreducible Stochastic Games
Learning in Average Reward Stochastic Games A Reinforcement Learning (Nash-R) Algorithm for Average Reward Irreducible Stochastic Games Jun Li Jun.Li@warnerbros.com Kandethody Ramachandran ram@cas.usf.edu
More informationReinforcement Learning. Donglin Zeng, Department of Biostatistics, University of North Carolina
Reinforcement Learning Introduction Introduction Unsupervised learning has no outcome (no feedback). Supervised learning has outcome so we know what to predict. Reinforcement learning is in between it
More information1 Random Walks and Electrical Networks
CME 305: Discrete Mathematics and Algorithms Random Walks and Electrical Networks Random walks are widely used tools in algorithm design and probabilistic analysis and they have numerous applications.
More informationSome AI Planning Problems
Course Logistics CS533: Intelligent Agents and Decision Making M, W, F: 1:00 1:50 Instructor: Alan Fern (KEC2071) Office hours: by appointment (see me after class or send email) Emailing me: include CS533
More informationInventory Control with Convex Costs
Inventory Control with Convex Costs Jian Yang and Gang Yu Department of Industrial and Manufacturing Engineering New Jersey Institute of Technology Newark, NJ 07102 yang@adm.njit.edu Department of Management
More informationSelecting Efficient Correlated Equilibria Through Distributed Learning. Jason R. Marden
1 Selecting Efficient Correlated Equilibria Through Distributed Learning Jason R. Marden Abstract A learning rule is completely uncoupled if each player s behavior is conditioned only on his own realized
More information