Some notes on Markov Decision Theory

Some notes on Markov Decision Theory Nikolaos Laoutaris laoutaris@di.uoa.gr January, 2004 1

Markov Decision Theory[1, 2, 3, 4] provides a methodology for the analysis of probabilistic sequential decision processes at an infinite/finite planning horizon Queueing Theory + Markov Processes: model a probabilistic system in order to evaluate its performance Markov Decision Theory: goes one step ahead; design the operation of a probabilistic system so as to optimize its performance 2

Probabilistic : meaning that there exists an environment, which cannot be described to its full detail, thus it is taken to be random (stochastic), following some probability law that best describes its nature. It is in this environment that our agent is to operate. Sequential : The theory aims at providing the tools for optimizing behaviors, i.e., sequences of decisions, not single decisions. Planning horizon : When deciding our behavior we must take into consideration the length of our intended activity. A poker player has a different behavior when opening a new game, as compared to the case that he is finishing one (and has secured a profit or suffered a loss). Behaviors depend on whether we will be participate in an activity for a finite 2-1

amount of time and then abandon it, or have decided to be involved permanently in it. Queuing theory vs decision theory : In a queuing model (say an M/M/1 queue) everything is fixed. The (stochastic) arrival process and the (stochastic) service process are parts of the environment in which we study some performance metric of interest (e.g., the expected queuing delay). An application of decision theory on queues (an M/X/1 queue, where X is an unknown service policy that we want to design and optimize) would consider the arrival process as the only element of the environment. Decision theory would provide the tools to design and optimize the service process so as to achieve a desired goal (e.g., avoid underflows or overflows). 2-2

A Markov Decision Process (MDP) is a discrete Markov process {In}n>0 characterized by a tuple S, A, P, C : S is the set of possible states. {In} is in state i at time n iff In = i, 0 i M A is the set of possible actions (or decisions). Following an observation, an action k is taken from the finite action space A, k = 0, 1,..., K P : S A S [0, 1] is the state transition function specifying the probability P {j i, k} pij(k) of observing a transition to state j S after taking action k A in state i S 3

Observation instances The process can be either continuous or discrete time. We focus on discrete time MDPs. Discrete time MDPs come in two flavors: discrete MDPs that are time-less (we do not model the time between observation instances, e.g., n > n + 1, or take it to be constant). These are in a way generalizations of Markov chains (some times disc. MDPs of that kind are called controllable Markov chains discrete MDPs that allow for time to pass between observation instances. These are much like generalized versions of semi-markov processes. In such cases the Markov property holds only upon observation instances (and not at arbitrary instances) and we may exploit them to optimize a decision system that acts 3-1

upon observation instances To have a continuous time MDP would require the Markov property to be in place at all time instances. This is rather restricting as most processes of interest posses the Markov property in selected times of interest and not generally. The action state can be either homogeneous or non-homogeneous. Homogeneous there is a common action state, A, among which we choose actions according to the current state. Non-homogeneous each state i is associated to a potentially different set of actions, Ai, among which a decision must be made. 3-2

C : S A R is a function specifying the cost ci(k) of taking action k A at state i S; ci(k) must depend only on the current state-action pair A policy R = (d0, d1,..., dm ) prescribes an action for each possible state. di(r) = k means that under policy R, action k is taken when the process is in i π(r) = (π0, π1,..., πm ) is the limiting distribution of {In} under policy R π(r) = π(r) P (R) the objective is to find the optimal policy Ropt that minimizes some cost criterion which considers both immediate and subsequent costs from the future evolution of {In} 4

Costs are used to drive the agent towards the desired behavior (defined by an objective function). Costs are used with minimization objectives. Alternatively, we may use rewards in conjunction with maximization objectives. Be careful when defining costs : Costs must depend only on the current state. This is a source of errors. Be particularly careful because in many cases the cost may depend on the next state also. In such cases, average over all possible transitions to have a legitimate MDP cost (that depends only on the current state). 4-1

Types of policies: Stationary policies always follow the same rule for choosing an action for each possible state, independently of the current time n Non-stationary polices behave differently as time evolves Deterministic policies always take the same action when at the same state: di(r) = k with probability 1 Randomized policies map a probability distribution over the possible actions to each state: di(r) = k with probability ρi(k) : k A ρ i(k) = 1 5

Policy space Stationary Non-Stationary Randomized Deterministic Deterministic policy: 1 state 1 action Randomized policy: 1 state 1 probability distribution over the action space 6

Stationarity and randomization are concepts that belong to different levels of characterizing a policy. They do not compare directly (e.g., stationary is NOT the opposite of randomized)! Stationarity is all about whether the rule for choosing decisions is affected by time Randomization refers to the way decisions are made for particular states (under a stationary or non-stationary policy) Stationary policies arise when optimizing over an infinite horizon. This makes sense intuitively. It is boundary (time) conditions that might prompt for change of behavior (given a fixed environment). A poker player that already looses money might bet more aggressively towards the end of the game in a final attempt to recover. Similarly, a winning 6-1

(rational!) player might avoid excessive risks towards the end of the game (to protect his winnings). Non-stationary policies arise from finite planning horizons. Behavioral changes due to approaching time boundaries appear also in the domain of Game Theory (a game involves at least two interacting agents (players), whereas the discussed decision theory involves only one agent, whose aim is to adapt to his environment, rather than compete with another rational entity). An interesting discussion of such behavioral issues appears in the context of the Iterated Prisoner s Dilemma, and other games 1. 1 William Poundstone, Prisoner s Dilemma: John Von Neumann, Game Theory and the Puzzle of the Bomb, Anchor Books, 1993. (highly recommended!) 6-2

Cost criteria and planning horizon Finite horizon undiscounted-cost problems require the minimization of the total expected accumulated cost over a finite planning horizon of W observations (transitions): { W } Ei{c}(R) = E (d (R)) I cin In 0 = i n=1 Infinite horizon problems consider an infinite planning horizon. Appropriate for systems that are expected to operate continuously or for systems that have an unknown stopping time 7

To understand Ei{c}(R) the expected cost when starting from state i and operating for W time units under policy R remember the following definitions: I0 is the initial state of the process (at n = 0) In is the state of the process at time n din (R) is the decision taken in state I n under policy R cin (d In (R)) is the cost incurred by taken decision d In (R) in state In 7-1

Discounted-cost problems (finite/infinite horizon) Attach a discount factor α, 0 < α < 1, to each immediate cost ci(k) thus affecting the relative importance of immediate costs over future costs: Ei{c}(R) = E { n=1 α n cin (d In (R)) I 0 = i } When α 1 future costs tend to count as much as immediate costs. Otherwise, future costs tend to be heavily discounted so, for an optimal performance, more attention must be given to the minimization of immediate costs 8

Average cost optimality (goes with infinite horizon) Requires the minimization of the expected average cost per unit of time: Ei{c}(R) = E { lim n 1 n n h=1 cih (d Ih (R)) I 0 = i } (1) As n, P {In = j I0 = i}(r) πj(r), independently of the initial state I0 = i, thus: E{c}(R) = j S πj(r) cj(dj(r)) (2) 9

Derivation (1) (2): The limiting probability πj(r) can be written as follows: 1 πj(r) = lim n n n h=1 P {Ih = j I0 = i}(r) (3) 9-1

Thus, starting from (1): Ei{c}(R) = E { lim n 1 = lim n n 1 = lim n n = j S = j S = (2) 1 n } n (d (R)) I cih Ih 0 = i h=1 n { } E (d (R)) I cih Ih 0 = i h=1 n h=1 j S cj(dj(r))p {Ih = j I0 = i}(r) 1 cj(dj(r)) lim n n n h=1 P {Ih = j I0 = i}(r) cj(dj(r)) πj(r) (substituting from (3)) 9-2

The optimal policy Ropt is the one that incurs the smallest cost: E{c}(Ropt) E{c}(R) over all R S A 1. An optimal police does not always exist under the average cost criterion 2. If an optimal police does exist, it is not guaranteed to be stationary 3. If S is finite and every stationary policy gives rise to an irreducible Markov chain then the stationarity of the optimal policy is guaranteed (+it is non-randomized) (see S.M. Ross [2] for more details) 10

Finding the optimal policy 1. Exhaustive enumeration. Suitable only for tiny problems due to O(size(A) size(s) ) complexity 2. Linear Programming (LP) 3. Policy improvement algorithm 4. Value iteration algorithm 11

Solving the MDP via Linear Programming The optimal policy can be identified efficiently by transforming the MDP formulation into a linear program Denote: Dik = P {action = k state = i} Dik s completely define a policy Also denote: yik = P {action = k and state = i} 12

Clearly the two are related via: yik = πidik (4) Also K πi = yik (5) k=0 From (4), (5): Dik = y ik πi = yik K k=0 y ik (6) 13

The are several constraints on yik s: 1. M πi = 1 i=0 M 2. πj = M πipij K i=0 k=0 yik = 1 K yik = M K i=0 k=0 i=0 k=0 3. yik 0 i, k yikpij(k) j The steady-state average cost per unit time is: M E(C) = K πicikdik = M i=0 k=0 i=0 k=0 K cikyik 14

yik s are obtained from the following LP: Minimize M z = E{c} = Subject to j : K yjk = k=0 M K i=0 k=0 i=0 k=0 M i=0 k=0 K cik yik (7) K yikpij(k) (8) yik = 1 (9) i, k : yik 0 (10) 15

D ik s are then readily available by using equation (6) The D ik s of the optimal solution are either 0 or 1, i.e., the optimal policy is non-randomized This is because the aforementioned LP has a totally unimodular constraint matrix and integer constants These two properties guarantee that an optimal solution to the LP through the Simplex method will return an integral solution (in the current case having 0 s and 1 s). See [5] for more on unimodularity 16

A policy improvement algorithm Very efficient for large problems Starts with an arbitrary policy which is progressively improved at each iteration of the algorithm, until the optimal policy is reached Convergence after a finite number of iterations is guaranteed for problems with finite state and action sets S, A 17

The theory of policy improvement v n i (R): the total expected cost of starting from state i and operating for n periods under policy R: v n i (R) = c i(k) + M pij(k) v n 1 j (R) (11) j=0 The expected average cost is independent of the initial state i: E{c}(R) = M πj cj(k) (12) j=0 For large n we have: v n i (R) n E{c}(R) + v i(r) (13) vi(r) captures the effect of starting from state i, on the total expected cost v i n (R), thus: v n i (R) vn j (R) v i(r) vj(r) (14) 18

substituting v n i (R) n E{c}(R) + v i(r) and v n 1 j (R) (n 1) E{c}(R) + vj(r) into equation (11) we take: E{c}(R) + vi(r) = ci(k) + M pij(k) vj(r) for i = 0, 1,..., M j=0 (15) The system of equation (15) has M + 2 unknowns (E{c}(R), vi(r)) and M + 1 equations. By setting vm (R) = 0 we can find the vi(r)s and the cost associated with a particular policy Theoretically, the recursive equation could be used for an exhaustive search for the optimal policy but this is not computationally efficient 19

The policy improvement algorithm Initialization Select an arbitrary initial policy R0. Iteration n Perform the following steps: Value determination For policy Rn solve the system of M + 1 equations E{c}(Rn) = ci(k)+ M pij(k)vj(rn) vi(rn), for 0 i M j=0 for the M + 1 unknown values E{c}(Rn), v0(rn), v1(rn),..., vm 1(Rn). 20

Policy improvement Using the values of vi(rn) computed for policy Rn, find an improved policy Rn+1 such that for each state i, Di(Rn+1) = k is the decision that minimizes: ci(k) + M pij(k)vj(rn) vi(rn), for 0 i M (16) j=0 i.e, for each state i minimize (16) and set di(rn+1) equal to the minimizing value of k. This procedure defines a new policy Rn+1 with E{c}(Rn+1) E{c}(Rn) (see Theorem 3.2 in [3]). Optimality test If the current policy Rn+1 is identical to the previous Rn, then it is the optimal policy. Otherwise set n = n + 1 and perform another iteration. 21

References [1] H. Mine and S. Osaki, Markovian Decision Processes, Elsevier, Amsterdam, 1970. [2] Sheldon M. Ross, Applied Probability Models with Optimization Applications, Dover Publications, New York, 1992. [3] Henk C. Tijms, Stochastic Modelling and Analysis: A Computational Approach, John Wiley & Sons, 1986. [4] Frederick S. Hillier and Gerald J. Lieberman, Introduction to Operations Research, McGraw-Hill, 2000. [5] Christos H. Papadimitriou and Kenneth Steiglitz, Combinatorial Optimization: Algorithms and Complexity, Dover Publications, New York, 1998. 22