UNCORRECTED PROOFS. P{X(t + s) = j X(t) = i, X(u) = x(u), 0 u < t} = P{X(t + s) = j X(t) = i}.

Size: px

Start display at page:

Download "UNCORRECTED PROOFS. P{X(t + s) = j X(t) = i, X(u) = x(u), 0 u < t} = P{X(t + s) = j X(t) = i}."

Hilary Jones
6 years ago
Views:

1 Cochran eorms934.tex V1 - May 25, 21 2:25 P.M. P. 1 UNIFORMIZATION IN MARKOV DECISION PROCESSES OGUZHAN ALAGOZ MEHMET U.S. AYVACI Department of Industrial and Systems Engineering, University of Wisconsin-Madison, Madison, Wisconsin Most Markov decision process (MDP) models consider problems with decisions occurring at discrete time points. On the other hand, there are several real-life applications, particularly in queuing systems, in which the decision maker chooses actions at random times over a continuous-time interval. Such problems can be modeled using continuous-time models. Semi-Markov decision processes (SMDPs) (see Semi- Markov Decision Processes), a class of continuous-time models, generalize discretetime Markov decision processes (DTMDPs) by allowing state changes to occur randomly over continuous time and letting or requiring decisions to be taken whenever the system state changes 1,2. In SMDPs, the stochastic process defined by the state transitions follows a discrete-time Markov chain while the time between each transition is drawn from a general distribution, independent of transitions 1,3. Continuous-time Markov decision processes (CTMDPs) constitute a special type of SMDPs in which the transition times between decisions are exponentially distributed and actions are taken at every transition 2. Uniformization, as a tool, is used to convert a CTMDP into an equivalent DTMDP. Although uniformization has been used to analyze continuous-time Markov processes for a long time 4 7, Serfozo 8 formalized the use of uniformization in the context of countable-state CTMDPs. In this article, we will describe uniformization in CTMDPs. Although we consider CTMDPs with stationary transition probabilities and reward functions, bounded reward functions, and finite state and action spaces, the results can be easily extended to CTMDPs with countable state and action spaces as well as to more general spaces with appropriate measurability conditions 8. While we focus on uniformization only for infinite-horizon CTMDPs with total expected discounted reward criterion, uniformization can also be utilized in CTMDPs with average reward criterion 2. Assuming a unichain transition probability matrix for every stationary policy, the transformation and modeling scheme for solving CTMDPs with average reward criterion (see Average Reward MDPs: Solution Techniques) is identical to those described in this article except for the computation of the reward function. The results apply to multichain cases as well, with slight modifications. More information on uniformization in CTMDPs with average reward criterion is available elsewhere 2. The remainder of this article is organized as follows: we next summarize how uniformization is used to convert a continuous-time Markov chain (CTMC) into an equivalent discrete-time Markov chain. Then, we describe the use of uniformization in CTMDPs. Finally, we present two examples for uniformization. UNIFORMIZATION IN CONTINUOUS-TIME MARKOV CHAINS CTMCs are formally defined as follows 9 (see the section titled Continuous-Time Markov Chains (CTMCs) in this encyclopedia): a continuous-time stochastic process {X(t), t } is a CTMC if for all s, t and nonnegative integers i, j, x(u) where u < t, P{X(t + s) = j X(t) = i, X(u) = x(u), u < t} = P{X(t + s) = j X(t) = i}. A CTMC is a stochastic process with the Markovian property; that is, the conditional distribution P(X(s + t) X(u)) of the future Wiley Encyclopedia of Operations Research and Management Science, edited by James J. Cochran Copyright 21 John Wiley & Sons, Inc. 1

2 Cochran eorms934.tex V1 - May 25, 21 2:25 P.M. P. 2 2 UNIFORMIZATION IN MARKOV DECISION PROCESSES state X(t + s) is independent of the past states X(u), u < t and only depends on the current state X(t) (see Definition and Examples of CTMCs). Consider a CTMC in which the time to make a transition from its current state to a different state is exponentially distributed with β for all states. Let P ij (t) denote the probability of being in state j at time t starting from state i at time. Note that the number of transitions by time t, {N(t),t } is a Poisson process with rate β 9. Therefore, P ij (t) can be recasted by conditioning on the number of transitions by time t as follows: P ij (t) = P{X(t) = j X() = i} = P { X(t) = j X() = i, N(t) = n } P(N(t) = n X() = i) = P { X(t) = j X() = i, N(t) = n } βt (βt)n e = P n (βt)n ije βt, (1) where P n ij represents the n-step stationary transition probability of an equivalent discrete-time Markov chain with transition probabilities P ij.thatis, P{X(t) = j X() = i, N(t) = n} =P n ij. (2) Equation (1) follows from the assumption that the time spent in every state is exponentially distributed with rate (β). More specifically, the probability of moving from i to j in n transitions is equal to the probability of moving from i to j by time t since moving in n steps does not give any information on which states were visited due to identical sojourn times. Therefore, Equation (2) can be applied only if all states have identical sojourn time distributions. In order to convert a CTMC with different transition rates into a discrete-time Markov chain we use uniformization. Suppose the mean sojourn time at each state is 1/β i and there exists a finite constant β such that β i β, foralli. According to the new scheme, we assign the same transition rate β to all states i, where the transition process is divided into two: the fictitious transitions to the state itself and the transitions to the other states. To match the actual process, we will have the process remain in each state for an exponential amount of time with rate β and the new transition probabilities will be defined as { β P 1 i β ij =, j = i, β i β P ij, j i. Applying the new transition probabilities to Equation (1), we obtain P ij (t) = ( P ij ) n βt (βt)n e. Figure 1 shows the schematic of a simple uniformization example. In summary, uniformization enabled us to convert a CTMC with state-dependent out-of-state transition rates into an analytically equivalent CTMC with uniform transition rates. This new system could be treated as a discrete-time Markov chain for the purposes of analysis 9 UNIFORMIZATION IN CONTINUOUS-TIME MARKOV DECISION PROCESSES In this section, we describe the uniformization process in CTMDPs. We start with the simpler case, where the transition rates are uniform and then extend this to the more general form, where the transition rates are state-and-action-dependent. Uniform Transition Rates Consider an infinite-horizon discounted CTMDP with the following reward (cost) function: t n lim e αt g(s(t), a(t)) dt, n E s where t n represents the time of nth transition, n ={1,..., }; α is the continuous-time discounting factor, α > ; g(s(t), a(t)) is the reward obtained when action a(t) is selected at state s(t). If we let s n and a n denote

3 Cochran eorms934.tex V1 - May 25, 21 2:25 P.M. P. 3 UNIFORMIZATION IN MARKOV DECISION PROCESSES 3 Original scheme After uniformization β β 1 β β CTMC ν (i ): Transition rate out of state i 1 β 1 1 β 1 Equivalent DTMC (Probabilities) ν() = β, ν(1) = β 1 P 1 = 1 1 P 1 = 1 P = β 1 / β ν() = ν(1) = β = β + β 1 P 1 = β / β 1 P 1 = β 1 / β P 11 = β / β Figure 1. An illustrative example for the uniformization of a CTMC through the use of fictitious self-transitions. the state and the action selected at time t n, respectively, then s(t) = s n and a(t) = a n hold where t n t < t n+1. Suppose g(s(t),a(t)) consists of two parts: K(s(t),a(t)), the lump reward obtained when a new state and action pair is observed and C(s(t),a(t)), the continuous reward accumulated if the state was s(t) andactiona(t) was taken in the last decision epoch. The state of a CTMDP does not change between decision epochs, therefore, the value of a given policy π for a CTMDP, v π α, that is, the total expected discounted reward over the infinite horizon for π is calculated as follows: v π α = Eπ s e (K(s αtn n, a n ) + tn+1 tn e α(t tn) C(s n, a n )dt). (3) Let τ n+1 = t n+1 t n (τ = t = ) denote the time the process remains in s n,whichisdistributed exponentially with parameter β for all states. Then, we can rewrite Equation (3) as follows: v π α = Eπ s e αtn K(s n, a n ) + E π s e αtn C(s n, a n ) τn+1 = E π s e α(τ 1 + +τn) K(s n, a n ) = + E π s e α(τ 1 + +τn) 1 α (1 e ατ n+1)c(s n, a n ) + E π s K(sn, a n ) ( E π ) s e ατ 1 n E π s C(sn, a n ) 1 α ( E π s e ατ 1 ) n e αt dt ( 1 E π ) s e ατ 1, (4) where Equation (4) follows from the assumption that {τ 1, τ 2,..., τ n+1 } are independent and identically distributed with exponential distribution and s n is independent of τ 1, τ 2,..., τ n+1. Evaluating the expectation of the exponential, E π s e ατ 1 = e αt βe βt dt = β := λ,

4 Cochran eorms934.tex V1 - May 25, 21 2:25 P.M. P. 4 4 UNIFORMIZATION IN MARKOV DECISION PROCESSES and rewriting this equation with the λ substituted in Equation (4), we obtain v π α = ( Eπ s λ n K(s n, a n ) + C(s ) n, a n ) = E π s λ n r(s n, a n ), which has the same form as an equivalent DTMDP if we redefine K(s n, a n ) + C(s n, a n )/() = r(s n, a n ) as the total expected discounted reward between two decision epochs for the (s n, a n ) pair. Note that this was achieved since the sojourn times at each state are assumed to be independent and identically distributed with exponential distribution. To summarize, a CTMDP with the reward function t n lim e αt g ( s(t), a(t) ) dt, n E s and a transition rate β for all states and actions is equivalent to a DTMDP with the discount factor λ = β, and the total expected discounted reward between two decision epochs is given by C(s, a) r(s, a) = K(s, a) +, (5) where the reward functions K and C are defined as above. Let P(j s, a) representthe probability that the state at the next decision epoch will be j, given that the state is s and action a is taken at the last decision epoch. Then, the Bellman equations can be rewritten as v(s) = max a As r(s, a) + λ P(j s, a)v(j), and be solved as a DTMDP, where A s and v(j) represent the set of available actions at state s and the optimal total expected discounted reward that can be obtained when the process starts in state j, respectively. Nonuniform Transition Rates A major limiting assumption for the above result is the assumption of identical transition rates across all states and actions. In this section, we show that by allowing fictitious transitions from a state to itself as in the previous section, we can extend the results for CTMDPs with uniform transition rates to those with nonuniform transition rates. Letβ(s, a) denote the transition rate out of state s when action a is taken and β represent a uniform transition rate satisfying β(s, a) β, s S and a A s,wherea s represents the action space for state s. Wecanthenmodify the transition probabilities as { 1 β(s,a) β, j = s; P(j s, a) = β(s,a) β P(j s, a), j s. (6) By creating fictitious transitions, we are creating a stochastically equivalent process in which the transitions occur more often. As an example, when the process is in state s, it will leave s at a faster rate β but will return to the same state with 1 β(s, a)/β probability. Probabilistically, the new process will move to another state at the same rate as in the original one. As a result of the uniformization of the nonidentical transition rates, we can use the results for CTMDPs with uniform transition rates. To summarize, we can analyze a CTMDP with exponential transition rates β(s, a), transition probabilities P(j s, a), and a reward function of lim N E s tn e αt g(s(t), a(t)) dt, by converting it into an equivalent CTMDP with the discount factor λ = β α+β, where β(s, a) β, s S, a A s and the transition probabilities are given by Equation (6). The total expected discounted reward between two decision epochs is given by Equation (5).

5 Cochran eorms934.tex V1 - May 25, 21 2:25 P.M. P. 5 UNIFORMIZATION IN MARKOV DECISION PROCESSES 5 The optimality equation is then written as C(s, a) v(s) = max K(s, a) + a As + β P(j s, a)v(j) (7) = max a As r(s, a) + λ P(j s, a)v(j) (8) and can be analyzed as a DTMDP. It can easily be shown 1 that after several simple algebraic manipulations, Equations (7) and (8) can also be presented as follows: v(s) = 1 max ()K(s, a) a As + C(s, a) + (β β(s, a))v(s) + β(s, a) P(j s, a)v(j). The optimality equations given by Equation (8) provide a compact form that is very similar to the conventional optimality equations for DTMDPs and therefore, are easier to comprehend. EXAMPLES In this section, we present two simple examples from queuing systems to illustrate the use of uniformization in continuous-time Markov models. More examples for the application of uniformization to CTMDPs are available elsewhere 2,1,11. Meeting the Professor. Students come in randomly during Professor Smith s office hours and on some occasions, they find the professor busy with other students, in which case they leave and return later. The interarrival times of the students are independent and identically distributed exponentially with rate ω, and it takes exponential amount of time with rate μ for Professor Smith to finish with a student. A student arrives at the office and finds the professor busy with another student. We compute the probability that the professor will be available if the student comes back at time t. We model the process as a birth and death process, where states and 1 represent when the professor is available and busy with another student, respectively. We can solve a set of differential equations to calculate the probability in the question. However, we will solve this problem using uniformization. Note that this problem is essentially an M/M/1/1 queue. The reader can refer to Ref. 9 for an analysis of the model to derive the probability in question. The process has the following parameters: β = ω, β 1 = μ, andp 1 = P 1 = 1. By defining β = ω + μ, we can uniformize the CTMC to obtain P = μ ω + μ = 1 P 1, P 1 = μ ω + μ = 1 P 11. This creates a new transition matrix with identical entries in a column: ( ) μ ω P ij = μ ω = P n ij i, j ={, 1} n = 1, 2,... Hence, using the uniformization for CTMCs, P 11 (t) = P n 11 = e ()t + (ω + μ)tn e ()t ( ω ) ω + μ n=1 ()t (ω + μ)tn e = e ()t + 1 e ()t ω ω + μ = ω ω + μ + μ ω + μ e ()t. The required probability is then P 1 (t)=1 P 11 (t)= μ 1 e ()t. ω + μ

6 Cochran eorms934.tex V1 - May 25, 21 2:25 P.M. P. 6 6 UNIFORMIZATION IN MARKOV DECISION PROCESSES Professor s Dilemma. Consider a slightly modified version of the above example. Namely, we will now model the professor s decision on how fast he should answer a student s questions. Suppose, the professor has only three chairs in the office and students coming to the office hours get in only if there is a vacant seat in the office. Every time a student comes in, the professor sets his pace in answering questions so that he can be fair to all that are waiting. His pace in terms of time he expects to spend with a student, is distributed exponentially with ameanthatisin1/μ,1/μ interval.each time a student comes in, the professor will accrue a reward of U(s, μ), the immediate utility he gets when the total number of students in the office is s and he chooses his pace as μ. The utility depends on the number of people waiting, as well as the pace reflecting the quality of time he expects to spend with the students. On the other hand, he accrues a utility of u(s, μ) while he is answering the student s questions and this is continuously discounted at rate α reflecting the fact that the more time he spends with a student, the less he can spend with others. We will write the optimality equations for Professor s pace decision problem. We define the state space S ={, 1, 2, 3}, representing the number of students that are in the office. The transition rates are ω, s = ; β(s, μ) = ω + μ, s = 1, 2; μ, s = 3. The maximum possible transition rate is β = ω + μ. The new transition probability matrix for a given μ is as follows: μ ω μ μ μ P = μ ω μ μ ω μ ω, where (s, j)entryof P represents P(j s, a = μ). Uniformizing the decision process via using β and the above transition matrix followed by application of Equation (7) leads us to the optimality equations given by u(s, μ) v(s) = max U(s, μ) + μ (μ,μ) + β P(j s, μ)v(j). As discussed in the previous section, the above equations could be converted to the form of Equation (8), and therefore can be solved using conventional DTMDP solution techniques such as value iteration, policy iteration, or linear programming (see Total Expected Discounted Reward MDPs: Value Iteration Algorithm, Total Expected Discounted Reward MDPs: Policy Iteration Algorithm and Linear Programming Formulations of MDPs). Acknowledgments This article was supported in part by National Science Foundation grant CMMI The authors thank Jeffrey Kharoufeh and two anonymous referees for their suggestions and insights, which improved this manuscript. REFERENCES 1. Heyman DP, Sobel MJ. Stochastic models. New York: Elsevier Science Publications; Puterman ML. Markov decision processes: discrete stochastic dynamic programming. New York: John Wiley & Sons, Inc.; Cinlar E. Introduction to stochastic processes. Englewood Cliffs (NJ): Prentice Hall; Howard R. Dynamic programming and Markov processes. Cambridge (MA): MIT Press; Jensen A. Markov chains as an aid in the study of Markov processes. Skand Aktuarietidskr 1953;34(3): Lippman SA. Applying a new device in the optimization of exponential queuing systems. Oper Res 1975;23(4): Veinott AF. Discrete dynamic programming with sensitive discount optimality. Ann Math Stat 1969;4:

7 Cochran eorms934.tex V1 - May 25, 21 2:25 P.M. P. 7 UNIFORMIZATION IN MARKOV DECISION PROCESSES 7 8. Serfozo R. An equivalence between discrete and continuous time Markov decision processes. Oper Res 1979;27: Ross SM. Introduction to probability models. New York: Academic Press; Bertsekas DP. Volumes 1 and 2, Dynamic programming and stochastic control. Belmont (MA): Athena Scientific; Walrand J. An introduction to queueing networks. Englewood Cliffs (NJ): Prentice Hall; 1988.

8 Cochran eorms934.tex V1 - May 25, 21 2:25 P.M. Queries in Article eorms934 Q1. Please confirm if the suggested keywords are fine.

9 Cochran eorms934.tex V1 - May 25, 21 2:25 P.M. Please note that the abstract and keywords will not be included in the printed book, but are required for the online presentation of this book which will be published on Wiley s own online publishing platform. Q1 If the abstract and keywords are not present below, please take this opportunity to add them now. The abstract should be a short paragraph upto 2 words in length and keywords between 5 to 1 words. Abstract: Continuous-time Markov decision processes (CTMDP) may be viewed as a special case of semi-markov decision processes (SMDP) where the intertransition times are exponentially distributed and the decision maker is allowed to choose actions whenever the system state changes. When the transition rates are identical for each state and action pair, one can convert a CTMDP into an equivalent discrete-time Markov decision process (DTMDP), which is easier to analyze and solve. In this article, we describe uniformization that uses fictitious transitions from a state to itself and hence enables the conversion of a CTMDP with nonidentical transition rates into an equivalent DTMDP. We first demonstrate the use of uniformization in converting a continuous-time Markov chain into an equivalent discrete-time Markov chain, and then describe how it is used in the context of CTMDPs with discounted reward criterion. We also present examples for the use of uniformization in continuous-time Markov models. Keywords: MDP; DTMDP; CTMDP; discounted reward; uniformization

Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS

Chapter 2 SOME ANALYTICAL TOOLS USED IN THE THESIS 63 2.1 Introduction In this chapter we describe the analytical tools used in this thesis. They are Markov Decision Processes(MDP), Markov Renewal process