MARKOV DECISION PROCESSES - PDF Free Download

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 1 MARKOV DECISION PROCESSES In studying Markov processes we have up till now assumed that the system, its states and transition probabilities are given in advance. The problem has been to find the stationary probabilities of the system, or possibly the transient evolution of the probabilities starting from a given initial state distribution. From these one can deduce interesting quantities such as blocking or overflow probabilities. Often, however, the situation is such that in the operation of the system one can make some choices. The operation is not completely fixed in advance but the behaviour of the system depends on the chosen operation policy. Then the task is to find an optimal policy which maximizes a given objective function. For instance, routing problems lead to this kind of setting. When the state of the network (calls in progress) is known, the task is to decide upon the arrival of a call in a given class (defined by the origin and destination points, and maybe other attributes) whether the call is admitted, and if so, along which route it shall be carried. The objective may be to maximize (in the long run) e.g. the number of carried calls or the volume of the carried traffic (call minutes). Similar problems emerge in the context of e.g. buffer management: one has to decide, in which order the packets shall be sent, which packets are discarded when the buffer is full etc.

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 2 Markov decision processes, MDPs The theory of Markov decision processes studies decision problems of the described type when the stochastic behaviour of the system can be described as a Markov process. It combines Dynamic programming (Bellman, 1957) Theory of Markov processes (Howard, 1960) In a Markov process the state of the system X S may jump from state i to state j with a given probability p i,j. How state i has been reached does not influence the next and later transitions. In Markov decision processes after each transition, when the system is in a new state, one can make a decision or choose an action, which may incur some immediate revenue or costs and which, in addition, affects the next transition probability. For instance, when a call is terminated or a new call has just been admitted into the network, one can decide, as long as one stays in this state, what one will do with a new call which may arrive (reject/accept/which route is chosen). This decision clearly affects which transitions are possible, or more generally, what are the probabilities of different transitions; the next transition, however, happens stochastically as the arrival and departure events occur stochastically.

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 3 Markov decision processes (continued) The problem is to find an optimal policy such that the expected revenues will be maximized (or the expected costs will be minimized; it does not matter which way we formulate the problem). Under the Markovian assumptions it is clear, that the action to be chosen in each state depends only on the state itself. Generally, a policy, optimal or not, defines for each state the action to be chosen. When with each state we associate an action, which in turn determines the transition probabilities of the next transition, these probabilities depend solely on the state, and the state of the system constitutes a Markov process. Each policy defines a different Markov process. Special attention will be given to finding a policy such that the associated Markov process has maximal average revenue. In the same way as Markov processes in general, also the Markov decision processes are divided into discrete time and continuous time decision processes.

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 4 Discrete time MDPs The state of the system chances only at discrete points indexed by t = 1, 2,.... When the system has arrived at state i, one has to decide on action a, which belongs to the set A i of possible actions in state i, a A i. Action a incurs an immediate revenue r i (a). The revenue may also be stochastic; then r i (a) denotes its expectation. At the next instant the system moves into a new state j with the transition probability p i,j (a), which depends on the action chosen in state i. The transition probabilities, however, do not depend on how the state i has been reached (Markovian). Moreover, we restrict ourselves to consider time homogeneous systems, where r i (a) and p i,j (a) do not depend on the time (index) t. Policy α defines, which action a = a i (α) in each state i is chosen among the set of possible actions. Then the revenue r i (a i (α)) accrued by the visit of state i, as well as the transition probabilities p i,j (a i (α)) are functions of the policy α and of the state i. For brevity, we will denote them by r i (α) and p i,j (α).

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 5 The equilibrium distribution of a discrete time MDP Given the policy α, the transition probabilities p i,j (α) are fixed. Under general assumptions, the Markov chain defined by these transition probabilities has a stationary (equilibrium) distribution, π i (α). The equilibrium distribution can be solved as for any Markov chain from the balance equations complemented by the normalization condition: i π i (α) = j π i (α) = 1 or, in vector form, π j (α)p j,i (α) π(α) = π(α)p(α) π(α) e T = 1 where π(α) = (π 1 (α), π 2 (α),...) P(α) = p 1,1 (α) p 1,2 (α)... p 2,1 (α) p 2,2 (α)........ e = (1, 1,...) From these equations one can solve π(α). The solution can be written in the form π(α) = e(p(α) I + E) 1, where I is the identity matrix and E is a matrix with all elements equal to 1.

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 6 Average revenue of a discrete time MDP When the equilibrium distribution π(α) has been solved, one can immediately write the average revenue r(α), i.e. the expected revenue per one step, r(α) = i π i (α) r i (α) = π(α) r T (α), where r(α) = (r 1 (α), r 2 (α),...). Now the task is to find the optimal policy α, which maximizes the average revenue α = argmax r(α) or r(α ) r(α), α α Since the definition of a policy is discrete, we are led to a discrete optimization problem. The solution of such of a problem is not quite straight forward, even though, in principle, r(α) can be calculated for each possible policy. To find the optimum, one needs some systematic approach. The following approaches have been introduced in the literature 1. Policy iteration 2. Value iteration 3. Linear programming In the sequel, we will mainly focus on the policy iteration. To this end we are led to study a quantity which is called the relative value of state i.

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 7 Relative values of states The quantity r(α) tells the average revenue per step under the policy α. Now we will examine, what can be said about the cumulative revenues, if we have the additional information that initially the system is in state i. Denote V n (i, α) = the expected cumulative revenue over n steps, when the system starts from state i In the first step (initial state) the expected revenue is r i (α) = e i r T (α), where e i = (0,..., 0, }{{ 1 }, 0,..., 0) component i After the first step the state probability vector is e i P(α). Similarly, the expected revenue at the second step is e i P(α) r T (α). In general, we have V n (i, α) = e i ( I + P(α) + P 2 (α) +... + P n 1 (α) ) r T (α) We know that, irrespective of the initial state, the state probabilities approach an equilibrium distribution, e i P n (α) π(α), when n

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 8 Relative values of states (continued) As n increases, the additional terms in V n (i, α) tend to π(α) r T (α) = r(α). After a large number of steps the information about the initial state is washed out and the expected revenue per each additional step equals the overall average revenue per step. V n (i, α) is the sum depicted in the figure. Only the initial part of the cumulative revenue depends on the initial state. The total effect of the initial transient can be defined for each state. Define the relative value v i (α) of state i v i (α) = lim n (V n (i, α) n r(α)) 0 1 2 3 4 5 6 7 r The relative value of state i tells how much greater the expected cumulative revenue over an infinite time horizon is when the system starts from the initial state i (rather than from equilibrium), in comparison with the average cumulative revenue.

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 9 Howard s equation The relative values of states satisfy the Howard equations v i (α) = r i (α) r(α) + j p i,j (α) v j (α) i By defining v(α) = (v 1 (α), v 2 (α),...) the equation can be written in vector form v(α) = r(α) r(α)e + v(α)p T (α) The Howard equation (the component form) can be interpreted as follows. Starting from state i: The deviation of the accrued revenue at the first step from the average is r i (α) r(α); this is explicitly accounted for. From that step on, one uses the Markovian property; conditioned that the system moves to state j, the deviation of the expected cumulative revenue from the average, from step 2 onwards is v j (α). Since p i,j (α) is the probability that the system makes a transition to state j, the sum gives the unconditional deviation of the expected cumulative revenue from step 2 onwards.

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 10 Remarks on the Howard equation Comparison of the balance equation and the Howard equation In the balance equation π j = π i p i,j i the probability of state i is split and pushed forward. In the last term of Howard s equation p i,j v j the cumulative revenues of the j future paths are collected backwards. p i, j j p i, j j, v j π i i i Notice the difference: πp vs. vp T.

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 11 Remarks on the Howard equation (continued) Since j p i,j = 1, the Howard equations can be written in the form r i (α) r(α) + j p i,j (α)(v j (α) v i (α)) = 0, i Only the differences v j (α) v i (α) appear in the equation. The relative values will be determined v i (α) up to an undetermined additive constant. In the sequel, the additive constant is unimportant, only the differences matter; We can arbitrarily set e.g. v 1 (α) = 0. Then the number of unknown v i (α) is one less than the number of equations. But also r(α) is unknown; all in all, there are as many unknowns as there are equations; r(α) is solved among the others.

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 12 Remarks on the Howard equation (continued) The value r(α), which is obtained as a solution of the Howard equations (along with the values v 2 (α), v 3 (α),...), is automatically equal to rπ T (α). This can be seen by multiplying (dot product) the vector form Howard equation from right by π T (α) (for brevity, we suppress the dependence on the policy α): v = r re + vp T π T v π T = r π T r e π T } {{ } 1 +v P T π T }{{} (π P) T =π T = r π T r + v π T r = r π T

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 13 Policy iteration Howard s equation determines the relative values of states v i (α) when the policy α is given. The policy can be improved by choosing the action a in each state i as follows a i = argmax a {r i (a) r(α) + j p i,j (a) v j (α)} The idea here is that a single decision is made by maximizing the expected revenue, by taking into account the immediate revenue of the action and its influence on the next transition, but assuming that, from that point on, all decisions are made using the old policy α. By choosing the action a i in each state as defined above, we arrive at a new policy α. With the new policy α one can (at least, in principle) solve the average revenue r(α ) and the relative values of states v i (α ). One can show that the resulting new policy α never is worse than the policy α one started with, i.e. r(α ) r(α). In the policy iteration, the iteration is continued until nothing changes. In general, the policy iteration converges quickly.

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 14 Value iteration In the value iteration, we consider the cumulative revenue starting from initial state i. Now, it is convenient to index time such that the index of the initial time is n. The last time is denoted by index 0, and we set V 0 (i) = 0, i. Determining the optimal policy (with terminal revenues V 0 (i) = 0) is done as in the dynamic programming by proceeding from the last instant of time 0 backwards to the initial time n. Which action should be selected at time n, when the system is in state i, and what is the corresponding expected cumulative revenue V n (i) with the optimal policy? This is solved recursively assuming that the problem has already been solved for time n 1, and that the expected cumulative revenue from that point on up till end, V n 1 (i), is known for all states i. A recursion step is defined by the equation (start of the recursion: V 0 (i) = 0, i) V n (i) = max a {r i (a) + j p i,j (a)v n 1 (j)} The expression in brackets represents the expected revenue, given that at time n the system is in state i and that action a is chosen, and from that point on the policy is optimal. At time n, in state i, the optimal action a is the one which maximizes the expression in the brackets. The value of the maximum is the expected revenue when at each step (n, n 1, n 2,..., 1) an optimal action is taken.

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 15 Value iteration (continued) In the value iteration, the optimum policy and the relative values of states are determined in parallel. As n grows, and the time is farther and farther from the terminal point: The action, defined by the value iteration becomes independent from time n and depends solely on the state i; the selection of that action corresponds to the optimum policy α in stationarity. The expected revenues tend to: V n (i) v i (α ) + n r(α ) + c, where c is some constant. When this form is inserted back into the value iteration equation, one obtains v i (α ) = max a {r i (a) r(α ) + j p i,j (a)v j (α )} which is the optimality condition for both the policy α and relative values v i (α ) and the average revenue r(α ). Optimum policy α is the policy, which in each state i selects the action a which realizes the maximum. The maximizing action depends on the relative values v i, which in turn, as defined by the equation, depend on which the maximizing action is.

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 16 Value iteration (continued) The complex dependence of the optimal policy and the related relative values in the value iteration is no problem. The recursive calculation of the cumulative expected revenues V n (i) using the value iteration equation is very easy. In the policy iteration, determining the policy and determining the relative values have been separated: The action is selected as defined by a given policy α (no maximization), whence we are left with the Howard equation to determine the relative values. With these relative values, a new policy is determined by maximizing the equation. Comparison of policy and value iterations Though the policy iteration may look more complicated, it is more efficient: the relative values associated wit a given policy are computed once and for all by solving the linear set of Howard equations. In the value iteration, even the solution of this linear equation, is effectively done iteratively, which is slow (although the solution is interleaved with policy optimization). In the value iteration, one needs many more iterations.

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 17 Continuous time MDPs The foregoing considerations can straightforwardly be adapted to the setting of continuous time Markov Decision Processes. In each state i a certain action a is chosen depending solely on the current state. State i together with the chosen action a determine the revenue rate r i (a) as well as the transition rates q i,j (a) to other states j. A policy α defines the choice of the action a for each state i, a = a i (α), whence the revenue rate and the transition rates are functions of the state and the policy. For brevity, we again denote these by r i (α) and q i,j (α). The relative values of the states, v i (α), for a given policy α are again determined by the Howard equations, which analogously with the earlier form read r i (α) r(α) + j =i q i,j (α)(v j (α) v i (α)) = 0, i v i (α) = the relative value of state i r i (α) = revenue rate in state i r(α) = average revenue rate of policy α

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 18 Howard equation for a continuous time MDP Also the continuous time Howard equation can be written in vector form. To this end, we first modify the previous equation q i,j (α)(v j (α) v i (α)) = q i,j (α)v j (α) q i,j v i (α) = j =i j =i j =i }{{} q i,i j q i,j (α)v j (α) We obtain r(α) r(α)e + v(α)q T (α) = 0 Q is the transition rate matrix, Q = (q i,j ), formed by the transition rates q i,j The policy α explicitly determines the transition rate matrix Q(α). v(α) and r(α) are then determined by the Howard equation. For comparison, recall that the equilibrium probabilities, π(α), of policy α are determined by the balance condition π(α)q(α) = 0 (note the difference Q vs. Q T )

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 19 Remarks on the Howard equation The v i (α) are determined up to additive constant. If c e is added to v(α), where c is a constant, the equation remains satisfied, since eq T = (Q e T ) T = 0 (the row sums of Q equal zero). One can set e.g. v 1 (α) = 0, and then v 1 (α), v 2 (α),... and r(α) can be solved from the equation. The solution r(α) thus obtained is automatically the same as the average revenue rate r(α)π T (α) This can be seen by multiplying the Howard equation from the right by π T : r π T r e π T } {{ } 1 +v Q T π T }{{} (πq) T =0 = 0 r = r π T

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 20 The relative values of states The relative value v i of state i again represents the difference of the expected cumulative revenue (over infinite time horizon) accrued starting from the initial state i and the average cumulative revenue. Starting from state i the state probability vector is initially π(0) = e i (the i th component is one, other are zero). From that state on, the time dependent state probability vector evolves according to, d dt π(t) = π(t) Q, i.e. π(t) = π(0)e Qt and the revenue rate at time t is then rπ T (t) = re QTt e T i. Let V i (t) be the cumulative revenue (integral of the revenue rate over time) in the interval (0, t) starting from state i and let V(t) be the vector formed by these values. It is easy to see that V(t) = r t 0 eqtu du and, therefore, v = lim t (V(t) r t e) The latter constant term does not change the Howard equation. Now we show that in the limit t the first term V(t) does indeed satisfy the Howard equation: V(t)Q T = r t 0 eqtu du Q T = r/ t 0e QTu = r( e QT t }{{} I) (r π T ) e r = r e r π T e r r e + V(t)Q T 0 when t v satisfies the Howard equation

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 21 Policy iteration The policy iteration starts with some policy α and solves the related relative values of states v i (α) and the average revenue rate r(α) from the Howard equation. The a new policy is determined by choosing in each state i the action a which realizes the maximum max {r a i(a) r(α) + q i,j (a)(v j (α) v i (α))} j =i These choices define a new policy α. Using the transition rate matrix related to this new policy, Q(α ), one solves from the Howard equation new relative values v i (α ) and r(α ) and determines a new policy again by the above maximization. This iteration is continued until nothing changes.

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 22 Value iteration In the value iteration one considers the cumulative revenue V i (t) in a finite interval ( t, 0), when initially at time t the system is in the state i. The index t measures the time from the terminal time 0 backwards. At the terminal time t = 0 the cumulative revenues have the terminal values V i (0) = 0, i. The recursion progressing backwards in time takes in the continuous time the form of a differential equation d dt V i(t) = max {r a i (a) + q i,j (a)(v j (t) V i (t))} j =i This defines both the optimal choice of an action in each state i (at time ( t)) as well as the expected cumulative revenues V i (t) associated with the optimal policy.

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 23 Value iteration (continued) When the value iteration equation is integrated far enough in backward time (up to a large t), one reaches a stationary state, where the chosen action depends solely on state i (not on time t), and this choice corresponds to the optimal policy α. Correspondingly, the cumulative revenue grows at a constant rate V i (t) = v i (α ) + r(α ) t + c, where c is some constant. When this is inserted to the value iteration equation, this takes the form max a {r i (a) r(α ) + j =i q i,j (a)(v j (α ) v i (α ))} = 0 which is an optimality condition for both the policy α and the relative values v i (α ) as well as for the average revenue rate r(α ). In practice, it is easiest to consider the cumulative revenues and to solve the differentia equation far enough, or to separate determining the values and the policy as is done in the policy iteration.

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 24 Example 1. Relative values in the M/M/ system As the policy we take a free access: each arriving call is admitted (gets a server/trunk of its own). Suppose that revenue is accrued from each ongoing call at a constant rate 1 (charging per minute, e.g. cent/min). When the system is in the state n (n calls in progress), the revenue rate is n. The continuous time Howard equation for state n can be written directly n a + λ(v n+1 v n ) + µn(v n 1 v n ) = 0 Here we have exploited the knowledge that the average revenue rate is equal to the average number of calls in progress, which is a = λ/µ. Denote u n = µv n. Then the equation becomes n a + a(u n+1 u n ) + n(u n 1 u n ) = 0 i.e. a(u n+1 u n 1) = n(u n u n 1 1) The solution to this equation is u n+1 u n = 1 i.e. u n = n + c, n = 0, 1,...

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 25 Example 1. (continued) The value of the constant c (unimportant) is fixed by the requirement n π nu n = 0 n π nn + c n π n = a + c = 0 c = a The solution is u n = n a 5 The result is easy to understand physically : The expected occupancy m(t) = E[N(t)] in the M/M/ 2 system obeys (irrespective of the initial distribution) the 1 equation d dt m(t) = λ µm(t) m(t) = (m(0) 0 a)e µt 0.5 1 1.5 2 2.5 + a t Then, starting from any state n (implying m(0) = n) the cumulative revenue, in comparison with the average revenue, is (m(t) a)dt = (m(0) a) e µt dt = 1 (n a) 0 0 µ m(t) 4 3 1/µ

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 26 Example 2. The relative values of states in an M/M/1 queue In developing methods for minimizing delays in systems involving M/M/1 queues, it is convenient to consider the costs (as opposed to revenues). As the total cost incurred per customer we take the total time the customer spends in the queue. Then, in state n (customers in the system) the cost rate is n. In a queueing system one wishes to minimize m(t)dt, where m(t) = E[N(t)]. In a loss system the integral represents the expected amount of carried traffic and one wishes to maximize its value. Minimizing the delay and blocking are opposite objectives. It is easy to write the Howard equation (with free access policy): n ρ 1 ρ + λ(v n+1 v n ) + µ1 n>0 (v n 1 v n ) = 0 where we have exploited the fact that the average cost rate equals the average number in system, which we know is ρ/(1 ρ), ρ = λ/µ. (It is also possible to solve the equation even without this knowledge).

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 27 Example 2. (continued) Denote u n = µv n. Then the equation reads n ρ 1 ρ + ρ(u n+1 u n ) + 1 n>0 (u n 1 u n ) = 0 ρ (u n+1 u n n + 1 1 ρ ) = 1 n>0(u n u n 1 ) n 1 ρ u n+1 u n = n + 1 1 ρ v n+1 v n = n + 1 µ λ v n = 1 2 n(n + 1) µ λ when one sets v 0 = 0 Physical interpretation: The expected occupancy m(t) = E[N(t)] in an M/M/1 system behaves approximately as depicted in the figure. m(t) ρ /(1- ρ) m(0)-( µ - λ)t ½ m(0) m(0)/( µ - λ) If the initial occupancy n is large, then m(t) first decreases linearly. The area of the triangle is a quadratic function of the initial occupancy. t

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 28 Example 3. Optimal routing in the case of two parallel M/M/1 queues Packets arrive according to a Poisson process at rate λ. An arriving packet can be routed to either of the queues. The occupancies n 1 and n 2 are assumed to be known. The task is to find a routing policy such that in the long run the delay of the packets in the system is minimized. Let us start with the following basic policy (policy 0): An arriving packet is directed to queue 1 with the probability p and to queue 2 with the probability 1 p. The queues thus receive Poisson streams with intensities λ 1 = pλ and λ 2 = (1 p)λ. The queues are independent M/M/1 queues and the average dealy in the system is p µ 1 pλ + 1 p µ 2 (1 p)λ One can make static optimization with respect to the parameter p. Denote x = µ 2 /µ 1. The expression is minimized by p = p, λ µ 1 µ 2 p = 1, λ (1 x)µ 1 1 1 + x + x 1 + x (1 x) µ 1 λ, λ > (1 x)µ 1

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 29 Example 3. (continued) Let us improve the basic policy by making one policy iteration. The implications of each routing decision are estimated by using the relative values of states calculated with the policy 0. Since with policy 0 the queues are independent M/M/1 queues, the relative values are as derived in example 2. The cost difference of routing alternatives 1 and 2 is (the first term is the increase in the cumulative costs, if the packet is directed to queue 2, i.e. v n2 +1 v n2 calculated for queue 2, and the latter term is the corresponding quantity for queue 1) n 2 + 1 n 1 + 1 If 0, then it is advantageous to put the packet in queue 1, and if < 0, then it is advantageous to put the packet in queue µ 2 λ 2 µ 1 λ 1 2. The decision line, which separates the occupancy areas corresponding to the different optimal routing choices, is n 2 "1" n 2 = µ 2 λ 2 µ 1 λ 1 (n 1 + 1) 1 In the JSQ policy (join the shortest queue) the decision line is along the diagonal. The JSQ policy is not optimal. Neither does the decision line now obtained, define the true optimal policy, but is just the result of the first policy iteration. n 1 "2"

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 30 Example 3. (continued) One can continue the iteration only by solving the Howard equations numerically. The figure on the right was obtained by truncating the state space in the region n 1 < 20 and n 2 < 20 (only part of the region shown) and by solving the Howard equations numerically for the parameters λ = 1, µ 1 = 2, µ 2 = 1. The iteration converged to a fixed point in four rounds. The final optimal policy is indicated by the colouring of the points: green (light) point: direct the packet to queue 2 red (dark) point: direct the packet to queue 1. 10 8 6 4 2 JSQ 1st policy iteration 2 4 6 8 10 According to the optimal static policy (p = 0.828) the mean delay of a packet in the system is 0.914. With the JSQ policy, the value is 0.853. The policy obtained by the first policy iteration (starting from the optimized static policy) brings the value down to 0.730. This is quite close to the value of 0.724 of the optimal policy.

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 31 Example 4. Relative costs of states in an M/M/m/m system In example 1, we considered an infinity server system, M/M/, and calculated the relative values of the states, using the amount of carried traffic as the measure for the revenues. Now we will consider a finite capacity system M/M/m/m, i.e. the Erlang loss system, where there are m servers (trunks). All calls are accepted as long as there is free capacity available. We could again calculate the relative values of the states measured by the amount of carried traffic. However, here it is more convenient to consider the relative costs of the states measured by the amount of blocked traffic. Technically, there is a slight difference in the calculations in calculating the revenues, the revunue rate in state n is n, n in calculating the costs of loss, the cost rate in the state n = m is λ, and in other states, n < m, the cost rate is 0; when the system is in the blocking state n = m, the expected rate of blockings is λ (in fact, to make this comparable with the revenue consideration, λ should be multiplied by the expected revenue per call, which with per minute charging equals the mean holding time multiplied by the charge per minute; the constant factor, however, is inconsequential, and will be omitted in the sequel). In practice, it does not matter which consideration (revenues / costs) is used; maximizing the carried traffic is equivalent to minimizing the lost traffic.

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 32 Example 4. (continued) On the basis what was just said, one can write the Howard equations: λ r + µm(v m 1 v m ) = 0 (last state, no upward transitions) r + λ(v n+1 v n ) + µn(v n 1 v n ) = 0, n = 0,..., m 1 Here r is the average cost rate. In the equations one can set v 0 = 0, and then solve them for r and v 1,..., v m. In advance, though, we already know that r = λe(m, a). The solution of the equations is left as an exercise. It is, however, instructive to note that the solution of this problem can be derived by a direct deduction. In particular, we will deduce the difference n = v n+1 v n. According to the definition given before, we can write n = lim (V n+1 (t) V n (t)) t where V n (t) = E[number of blockings in (0, t), when at time 0 the system is in the state n]

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 33 Example 4. (continued) Consider the quantities V n (t) and V n+1 (t) with reference to the figures. When the system starts from the state n, it takes some time before the system first moves to the state n + 1. Let us denote the time of this first passage by t n. In the interval (0, t n ) no blockings can occur. When the system has moved to the state n + 1, it is precisely in the same situation as the system which started from the state n + 1 (Markovian property!). The darkened areas in the figures, with equal durations, are statistically indistinguishable, and the expected number of blockings in them are equal. Thus we deduce, that V n+1 V n is the same as the number of blockings in the interval (t t n, t) in the system which started from the state n + 1. 14 12 10 n8 6 4 2 0 14 12 10 n+1 8 6 4 2 0 t n * 2 4 6 8 10 12 14 2 4 6 8 10 12 14 When t, the initial state no longer affects the behaviour of the system in the interval (t t n, t), but the system is in the equilibrium, whence the blocking probability is E(m, a). Since the expected number of arrivals in this interval is λe[t n ], then n = λe[t n ]E(m, a) t t-t n * t

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 34 Example 4. (continued) We still deduce the value of E[t n]. To this end, consider a system with the capacity of n. Immediately after a blocking event the system is in the state n. The next blocking occurs at the instant when the system would have moved to the state n+1. The interblocking time in this limited capacity system is thus distributed as t n in the larger system (first passage time from state n to state n + 1). E[t n ] = E[interblocking time] = 1 λe(n, a), n x estyneen kutsun saapumishetki x x x (λe(n, a) is the blocking frequency) Inserting this into the previous equation, we finally end up with the simple and beautiful result n = v n+1 v n = E(m, a) E(n, a) Since n m, it follows that n 1. Recall that this quantity tells, the expected increase in the number of blockings in an M/M/m/m system which starts from the initial state n + 1, in comparison with a system, which starts from the state n. If the system is in state n and an offered call arrives, then n the expected future cost of accepting the call. This result is used in the following to find an optimal routing policy.

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 35 Dynamic state dependent routing in a circuit switched network Consider the basic routing problem in the triangular network shown in the figure. The capacities of the links C j, the offered traffic intensities a j as well as the instantaneous occupancies n j are supposed to be known. C j = capacity of link j n j = occupancy of link j = offered traffic intensity offered to link j a j A C a 2 a 3 n 2, C 2 n 3, C 3 The basic policy is that calls are taken only to the direct links, and accepted always as far as the capacity allows. Then we have three separate loss systems, and for each of them we can calculate the relative costs of states as was done in example 4. Now we wish to find, using first policy iteration, what would be a good policy for using alternative routes. Consider the problem from the point of view of calls offered to link 1. Thus, the problem is the following: An call is offered to link 1, when the link occupancies are n 1, n 2 and n 3. The question is, should the call be admitted or rejected. If admitted, then the question is on which route the call should be carried direct route AB (link 1) alternative route ACB (links 2 + 3) n 1, C 1 a 1 B

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 36 Dynamic state dependent routing (continued) The objective of the optimization is to maximize the number of carried calls, that is to minimize the number of blocked calls (within the infinite time horizon). Using an alternative route consumes more of the resources of the network (occupies a trunk on two links) and can potentially increase the blocking of future arriving calls. In advance, it is not at all clear, under what conditions it is advisable to use the alternative route. By the first policy iteration, each individual decision is made by minimizing the total costs as they appear, if after the decision all the future decisions are made according to the basic policy. If a new call is admitted to link j, this implies an increase in the number of future blockings with the expectation E(n j, a j )/E(m j, a j ). Policy As the cost of accepting the call on the direct link is always < 1, when there is capacity to admit the call, i.e. when m 1 < n 1, whereas the revenue from carrying the call has the expectation 1, it is always beneficial to admit the call to the direct link whenever possible. If the direct link is occupied, we have to consider the sum of the costs incurred on different links. The use of the alternative route is beneficial, if the following condition is satisfied: E(m 2, a 2 ) E(n 2, a 2 ) + E(m 3, a 3 ) E(n 3, a 3 ) < 1

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 37 Dynamic state dependent routing (continued) If the links on the alternative route are highly occupied the terms on the left side of the condition E(m 2, a 2 ) E(n 2, a 2 ) + E(m 3, a 3 ) E(n 3, a 3 ) < 1 are close to 1 and the sum can exceed 1. The condition defines a kind of dynamic trunk reservation principle. One has to leave enough free capacity for fresh one link traffic on the links of the alternative route. The figure shows the behaviour of the function n for a system with the capacity m = 30. Assuming that both links of the alternative route 0.6 are identical (m = 30) one sees that typically on the links of the alternative route one has to have 0.4 a=30 free capacity ( n 0.5) of a few trunks, in order 0.2 a=20 that the use of the route is beneficial: a=10 0 2 trunks, if a = 20, 6 trunks, if a = 30. 5 10 15 20 25 30 1 0.8 C=30 n

J. Virtamo 38.3141 Teletraffic Theory / Markov decision processes 38 Remarks on dynamic state dependent routing The above policy is not the ultimate optimal policy, but the result of the first policy iteration. The true optimal policy is obtained when in making the decisions, the future impacts of the decision are assessed using the optimal policy (which is not yet known). Implementation of dynamic state dependent routing is technically demanding since it requires full knowledge of the state of the system, and also of the arrival intensities. In practice one can apply the more robust trunk reservation, where a fixed capacity is reserved for the fresh direct traffic.