The Simplex and Policy-Iteration Methods are Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate

Size: px

Start display at page:

Download "The Simplex and Policy-Iteration Methods are Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate"

Roberta Davis
6 years ago
Views:

1 The Siplex and Policy-Iteration Methods are Strongly Polynoial for the Markov Decision Proble with a Fixed Discount Rate Yinyu Ye Departent of Manageent Science and Engineering, Stanford University, Stanford, CA eail: yinyu-ye@tanford.edu Initial subission date: May 15, 2010; first revision Nov. 30, 2010; second revision August 16, We prove that the classic policy-iteration ethod (Howard 1960) and the original siplex ethod with the ost-negative-reduced-cost pivoting rule (Dantzig 1947) are strongly polynoial-tie algoriths for solving the Markov decision proble (MDP) with a fixed discount rate. Furtherore, the coputational coplexity of the policy-iteration and siplex ethods is superior to that of the only known strongly polynoial-tie interiorpoint algorith (Ye 2005) for solving this proble. The result is surprising since the siplex ethod with the sae pivoting rule was shown to be exponential for solving a general linear prograing (LP) proble (Klee and Minty 1972), the siplex ethod with the sallest-index pivoting rule was shown to be exponential for solving an MDP regardless of discount rates (Melekopoglou and Condon 1994), and the policy-iteration ethod was recently shown to be exponential for solving undiscounted MDPs under the average cost criterion. We also extend the result to solving MDPs with transient substochastic transition atrices whose spectral radii uniforly inorize one. Key words: Siplex Method, Policy-Iteration Method, Markov Decision Proble; Linear Prograing; Dynaic Prograing, Strongly Polynoial Tie MSC2000 Subject Classification: Priary: 90C40, 90C05; Secondary: 68Q25, 90C39 OR/MS subject classification: Priary: Linear Prograing, Algorith; Secondary: Dynaic Prograing, Markov 1. Introduction Markov decision probles (MDPs) provide a atheatical fraework for odeling decision-aking in situations where outcoes are partly rando and partly under the control of a decision aker. The MDP is one of the ost fundaental odels for studying a wide range of optiization probles solved via dynaic prograing and reinforceent learning. Today, it is being used in a variety of areas, including anageent, econoics, bioinforatics, electronic coerce, social networking, and supply chains. More precisely, an MDP is a continuous or discrete tie stochastic control process. At each tie step, the process is in soe state i, and the decision aker ay choose an action, say j, which is available in state i. The process responds at the next tie step by randoly oving into a state i and incurring an iediate cost c j (i, i ). Let denote the total nuber of states. The probability that the process enters state i after a transition is deterined by the current state i and the chosen action j, but is conditionally independent of all previous states and actions; in other words, the state transition of an MDP possesses the Markov property. Specifically, it is given by a transition probability distribution p j (i, i ) 0, i = 1,,, and i =1 pj (i, i ) = 1, i = 1,, (For transient substochastic systes, the su is allowed to be less than or equal to one). The key goal of MDPs is to identify an optial (stationary) policy for the decision aker, which can be represented as a set function π = {π 1, π 2,, π } that specifies the action π i that the decision aker will choose when in state i, for i = 1,,. A (stationary) policy π is optial if it iniizes soe cuulative function of the rando costs, typically the (expected) discounted su over the infinite horizon γ t C(π, t), t=0 where C(π, t) is the (expected) cost at tie t when the decision aker follows policy π, and γ is the discount rate assued to be strictly less than one in this paper. This MDP proble is called the infinite-horizon discounted Markov decision proble (DMDP), which serves as a core odel for MDPs. Presuably, the decision aker is allowed to take different policies at different tie steps to iniize the above cost. However, it is known that there exists an optial stationary policy (or policy for short in the rest of this paper) for the DMDP that it is independent of tie t so that it can be written as a set function of state i = 1,, only as described above. 1

2 2 Yinyu Ye: Policy-Iteration Methods are Strongly Polynoial for the MDP with a Fixed Discount Rate Matheatics of Operations Research xx(x), pp. xxx xxx, c 200x INFORMS Let k i be the nuber of actions available in state i {1,, } and let or, for i = 1, 2,, in general, A 1 = {1, 2,, k 1 }, A 2 = {k 1 + 1, k 1 + 2,, k 1 + k 2 },... i 1 A i := k s + {1, 2,, k i }. s=1 Clearly, A i = k i. The total nuber of actions is n = i=1 k i and these, without loss of generality, can be ordered such that if j A i, then action j is available in state i. Suppose we know the state transition probabilities p j (i, i ) and the iediate cost coefficients c j (i, i ) (j = 1,, n and i, i = 1,, ) and we wish to find a policy π that iniizes the expected discounted cost. Then the policy π would be associated with another array indexed by state, value vector v IR, which contains cost-to-go values for all states. Furtherore, an optial pair (π, v ) is then a fixed point of the following operator, π i := arg in j A i { i pj (i, i ) ( c j (i, i ) + γv i )} ; v i := i pπ i (i, i ) ( c π i (i, i ) + γv i ), i = 1,,. (1) where again c j (i, i ) represents the iediate cost incurred by an individual in state i who takes action j and enters state i. Let P π IR be the colun stochastic atrix corresponding to a policy π, that is, the ith colun of P π be the probability distribution p πi (i, i ), i = 1,,. Then (1) can be represented in atrix for as follows (I γpπ T )v = c π, (I γpπ T )v (2) c π, π, where the ith entry of vector c π IR equals the expected iediate cost i pπi (i, i )c πi (i, i ) for state i. D Epenoux [8] and de Ghellinck [7] (also see Manne [20], Kallenberg [14] and Altan [1]) showed that the infinite-horizon discounted MDP can be forulated as a prial linear prograing (LP) proble in the standard for iniize c T x subject to Ax = b, x 0, (3) with the dual axiize b T y subject to s = c A T y 0, where A IR n is a given real atrix with rank, c IR n and b IR are given real vectors, 0 denotes the vector of all 0 s, and x IR n and (y IR, s IR n ) are unknown prial and dual variables, respectively. Vector s is often called the dual slack vector. In what follows, LP stands for any of the following: linear progra, linear progras, or linear prograing, depending on the context. The DMDP can be forulated as LP probles (3) and (4) with the following assignents of (A, b, c): First, every entry of the colun vector b IR is one. Secondly, the jth entry of vector c IR n is the (expected) iediate cost of action j. In particular, if j A i, then action j is available in state i and Thirdly, the LP constraint atrix has the for c j = i p j (i, i )c j (i, i ). (5) A = E γp IR n. (6) Here, the jth colun of P is the transition probability distribution when the jth action is chosen. Specifically, for each action j A i P i j = p j (i, i ), i = 1,,, (7) (4) and E ij = { 1, if j Ai, 0, otherwise, i = 1,,, j = 1,, n. (8) Let e be the vector of all ones, where its diension depends on the context. Then we have b = e, e T P = e (that is, P is a colun stochastic atrix), e T E = e, and e T A = (1 γ)e.

3 Yinyu Ye: Policy-Iteration Methods are Strongly Polynoial for the MDP with a Fixed Discount Rate Matheatics of Operations Research xx(x), pp. xxx xxx, c 200x INFORMS 3 The interpretations of the quantities defining the DMDP prial (3) and the DMDP dual (4) are as follows: b = e eans that initially, there is one individual in each state. For j A i, the prial variable x j is the action frequency of action j, or the expected present value of the nuber of ties in which an individual is in state i and takes action j. Thus, solving the DMDP prial entails choosing action frequencies that iniize the expected present cost, c T x, subject to the conservation law Ax = e. The conservation law ensures that for each state i, the expected present value of the nuber of individuals entering state i equals the expected present value of the nuber of individuals leaving i. The DMDP dual variables y IR exactly represent the expected present cost-to-go values of the states. Solving the dual entails choosing dual variables y, one for each state i, together with s IR n of slack variables, one for each action j, that axiizes e T y subject to A T y + s = c, s 0 or siply A T y c. It is well known that there exist unique optial solution pair (y, s ) where, for each state i, is the iniu expected present cost that an individual in state i and its progeny can incur. y i A policy π of the original DMDP, containing exactly one action in A i for each state i, actually corresponds to basic variable indices of a basic feasible solution (BFS) of the DMDP prial LP forulation. Obviously, we have a total of i=1 k i different policies. Let atrix A π IR (resp., P π, E π ) be the coluns of A (resp., P, E) with indices in π. Then for a policy π, E π = I (where I is the identity atrix), so that A π has the Leontief substitution for A π = I γp π. It is also well known that A π is nonsingular, has a nonnegative inverse and is a feasible basis for the DMDP prial. Let x π be the BFS for a policy π in the DMDP prial for and let ν contain the reaining indices not in π. Let x π and x ν be the sub-vectors of x whose indices are respectively in policy π and ν. Then the nonbasic variables x π ν = 0 and the basic variables x π π for a unique solution to A π x π = e. The corresponding dual basic solution y π is the unique solution to the equation of A T π y = c π. The dual basic solution y π is feasible if A T ν y π c ν or equivalently if s ν 0. The basic solution pair (x π, y π ) of the DMDP prial and dual are optial if and only if both are feasible. In that case, π is an optial policy π and y π = v of operator (2). Note that the constraints A T π = c yπ π and A T y π c describe the sae condition for v in (2) for each policy π or for each action j. 2. Literature and Coplexity There are several ajor events in developing ethods for solving DMDPs. Bellan (1957) [2] developed a successive approxiations ethod, called value-iteration, which coputes the optial total cost function assuing first a one stage finite horizon, then a two-stage finite horizon, and so on. The total cost functions so coputed are guaranteed to converge in the liit to the optial total cost function. It should be noted that, even prior to Bellan, Shapley (1953) [26] used value-iteration to solve DMDPs in the context of zero-su two-person stochastic gaes. The other best known ethod to solve DMDPs is due to Howard (1960) [13] and is known as policyiteration, which generates an optial policy in a finite nuber of iterations. Policy-iteration alternates between a value deterination phase, in which the current policy is evaluated, and a policy iproveent phase, in which an attept is ade to iprove the current policy. In the policy iproveent phase, the original and classic policy-iteration ethod updates the actions possibly in every state in one iteration. If the the current policy is iproved for at ost one state in one iteration, then it is called siple policyiteration. We will coe back to the policy-iteration and siple policy-iteration ethods later in ters of the LP forulation. Since it was discovered by D Epenoux [8] and de Ghellinck [7] that the DMDP has an LP forulation, the siplex ethod of Dantzig (1947) [6] can be used to solving DMDPs. It turns out that the siplex ethod, when applied to solving DMDPs, is the siple policy-iteration ethod. Other general LP ethods, such as the Ellipsoid ethod and interior-point algoriths are also capable of solving DMDPs. As the notion of coputational coplexity eerged, there were treendous efforts in analyzing the coplexity of solving MDPs and its solution ethods. On the positive side, since it (with or without discounting 1 ) can be forulated as an linear progra, the MDP can be solved in polynoial tie by either the Ellipsoid ethod (e.g., Khachiyan (1979) [16]) or the interior-point algorith (e.g., Kararkar (1984) [15]). Here, polynoial tie eans that the nuber of arithetic operations needed to copute an optial policy is bounded by a polynoial in the nubers of states, actions, and the bit-size of the input data, which are assued to be rational nubers. Papadiitriou and Tsitsiklis [23] then showed in 1987 that an MDP with deterinistic transitions (i.e., each entry of state transition probability atrices 1 When there is no discounting, the optiality criterion or objective function needs to be changed where a popular one 1 is to iniize the average cost li N 1 N N t=0 C(π, t).

4 4 Yinyu Ye: Policy-Iteration Methods are Strongly Polynoial for the MDP with a Fixed Discount Rate Matheatics of Operations Research xx(x), pp. xxx xxx, c 200x INFORMS is either 0 s or 1 s) and the average cost optiality criterion can be solved in strongly polynoial-tie (i.e., the nuber of arithetic operations is bounded by a polynoial in the nubers of states and actions only) as the well-known Miniu-Mean-Cost-Cycle proble. Erickson [11] in 1988 showed that finitely any iterations of successive approxiations suffice to produce: (1) an optial stationary halting policy, or (2) show that no such policy exists in strongly polynoial tie algorith, based on the work of Eaves and Veinott [10] and Rothblu [25]. Bertsekas [3] in 1987 also showed that the value-iteration ethod converges to the optial policy in a finite nuber of iterations. Tseng [27] in 1990 showed that the value-iteration ethod generates an optial policy in polynoial-tie for the DMDP when the discount rate is fixed. Puteran [24] in 1994 showed that the policy-iteration ethod converges no ore slowly than the value iteration ethod, so that it is also a polynoial-tie algorith for the DMDP with a fixed discount rate. 2 Mansour and Singh [21] in 1994 also gave an upper bound on the nuber of iterations, k, for the policy-iteration ethod in solving the DMDP when each state has k actions. (Note that the total nuber of possible policies is k, so that the result is not uch better than that of coplete enueration.) In 2005, Ye [32] developed a strongly polynoial-tie cobinatorial interior-point algorith (CIPA) for solving DMDPs with a fixed discount rate, that is, the nuber of arithetic operations is bounded by a polynoial in only the nubers of states and actions. The worst-case running-ties (within a constant factor), in ters of the nuber of arithetic operations, of these ethods are suarized in the following table, when there are exact k actions in each of the states; see Littan et al. [18], Mansour and Singh [21], Ye [32], and references therein. Value-Iteration Policy-Iteration LP-Algoriths CIPA 2 kl(p,c,γ) log(1/()) { in 3 k k }, 3 kl(p,c,γ) log(1/()) 3 k 2 L(P, c, γ) 4 k 4 log Here, L(P, c, γ) is the total bit-size of the DMDP input data in the linear prograing for, given that (P, c, γ) have only rational entries. As one can see fro the table, both the value-iteration and policy-iteration ethods are polynoial-tie algoriths if the discount rate 0 γ < 1 is fixed. But they are not strongly polynoial, where the running tie needs to be a polynoial only in and k. The proof of polynoial-tie coplexity for the value and policy-iteration ethods is essentially due to the arguent that, when the gap between the objective value of the current policy (or BFS) and the optial one is sall than 2 L(P,c,γ), the current policy ust be optial; e.g., see [19]. However, the proof of a strongly polynoial-tie algorith cannot rely on this arguent, since (P, c, γ) ay have irrational entries so that the bit-size of the data can be. In practice, the policy-iteration ethod has been rearkably successful and shown to be ost effective and widely used; where the nuber of iterations is typically bounded by O(k) [18]. It turns out that the policy-iteration ethod is actually the siplex ethod with block pivots in each iteration; and the siplex ethod also reains one of the very few extreely effective ethods for solving general LPs; see Bixby [4]. In the past 50 years, any efforts have been ade to resolve the worst-case coplexity issues of the policy-iteration and siplex ethods for solving MDPs and general LPs, and to answer the question: are the policy-iteration and the siplex ethods strongly polynoial-tie algoriths? Unfortunately, so far ost results have been negative. Klee and Minty [17] showed in 1972 that the original siplex ethod with Dantzig s ost-negative-reduced-cost pivoting rule necessarily takes an exponential nuber of iterations to solve a carefully designed LP proble. Later, a siilar negative result of Melekopoglou and Condon [22] showed that the siplex or siple policy-iteration ethod with the sallest-index pivoting rule (that is, only the action with the sallest index is pivoted or updated aong the actions with negative reduced costs) needs an exponential nuber of iterations to copute an optial policy for a specific DMDP proble regardless of discount rate (i.e., even when γ < 1 is fixed). 3 Is there a pivoting rule to ake the siplex and policy-iteration ethods strongly polynoial for solving DMDPs? In this paper, we prove that the original, and now classic, siplex ethod of Dantzig (1947), or the 2 This fact was actually known to Veinott (and perhaps others) three decades earlier and used in dynaic prograing courses he taught for a nuber of years before 1994 [31]. 3 Most recently, Fearnley (2010) [12] showed that the policy-iteration ethod also needs an exponential nuber of iterations for an undiscounted but finite-horizon MDP under the average cost optiality criterion.

5 Yinyu Ye: Policy-Iteration Methods are Strongly Polynoial for the MDP with a Fixed Discount Rate Matheatics of Operations Research xx(x), pp. xxx xxx, c 200x INFORMS 5 siple policy-iteration ethod with the ost-negative-reduced-cost pivoting rule, is indeed a strongly polynoial-tie algorith for solving DMDPs with a fixed discount rate 0 γ < 1. The nuber of its iterations is actually bounded by 2 (k 1) 1 γ ( ) 2 log, 1 γ and each iteration uses at ost O( 2 k) arithetic operations. The result sees surprising, given the earlier negative results entioned above. Since the original policy-iteration ethod of Howard (1960) updates the actions possibly in every state in one iteration, the action updated in one iteration of the siplex ethod will be included in the actions updated in one iteration of the policy-iteration ethod. Consequently, we prove that it is also a strongly polynoial-tie algorith with the sae iteration coplexity bound. Note that the worstcase arithetic operation coplexity, O( 4 k 2 log ), of the siplex ethod is actually superior to that, O( 4 k 4 log ), of the cobinatorial interior-point algorith [32] for solving DMDPs when the discount rate is a fixed constant. If the nuber of actions ( varies ) aong the states, our iteration coplexity bound of the siplex ethod would be (n ) log 2, and each iteration uses at ost O(n) arithetic operations, where n is again the total nuber of actions. One can see that the worst-case iteration coplexity bound is linear in the total nuber of actions, as observed in practice [18], when the nuber of states is fixed. We reark that, if the discount rate is an input, it reains open whether or not the policy-iteration or siplex ethod is polynoial for the MDP, or whether or not there exists a strongly polynoial-tie algorith for MDP or LP in general. 3. Preliinaries and Forulations We first describe a few general LP and DMDP theores and the siplex and policy-iteration ethods. We will use the LP forulation (3) and (4) for DMDP and the terinology presented in the Introduction section. Recall that, for DMDP, b = e IR, A = E γp IR n, and c, P and E are defined in (5), (7) and (8), respectively. 3.1 DMDP Properties The optiality conditions for all optial solutions of a general LP ay be written as follows: Ax = b, A T y + s = c, s j x j = 0, j = 1,, n, x 0, s 0 where the third condition is often referred as the copleentarity condition. Let π be the index set of actions corresponding to a policy. Then, as we briefly entioned earlier, x π is a BFS of the DMDP prial and basis A π has the for A π = (I γp π ), and P π is a colun stochastic atrix, that is, P π 0 and e T P π = e. In fact, the converse is also true, that is, the index set π of basic variables of every BSF of the DMDP prial is a policy for the original DMDP. In other words, π ust have exactly one variable or action index in A i, for each state i. Thus, we have the following lea. Lea 3.1 The DMDP prial linear prograing forulation has the following properties: (i) There is a one-to-one correspondence between a (stationary) policy of the original DMDP and a basic feasible solution of the DMDP prial. (ii) Let x π be a basic feasible solution of the DMDP prial. Then any basic variable, say x π i, i = 1,,, has its value 1 x π i 1 γ. (iii) The feasible set of the DMDP prial is bounded. More precisely, e T x = 1 γ, for every feasible x 0.

6 6 Yinyu Ye: Policy-Iteration Methods are Strongly Polynoial for the MDP with a Fixed Discount Rate Matheatics of Operations Research xx(x), pp. xxx xxx, c 200x INFORMS Proof. Let π be the basis set of any basic feasible solution for the DMDP prial. Then, the first stateent can be seen as follows. Consider the coefficients of the ith row of A. Fro the structure of (6), (7) and (8), we ust have a ij 0 for all j A i. Thus, if no basic variable is chosen fro A i or π A i =, then 1 = a ij x π j = a ij x π j 0, j π j π,j A i which is a contradiction. Thus, each state ust have an action in π. On the other hand, since x π is a BFS, π, so, π ust contain exactly one action index in A i for each state i = 1,,, that is, π is a policy. The last two stateents of the lea were given in [32] whose proofs were based on Dantzig [5, 6] and Veinott [29]. Fro the first stateent of Lea 3.1, in what follows we siply call the basis index set π of any BFS of the DMDP prial a policy. For the basis A π = (I γp π ) of any policy π, the BFS x π and the dual can be coputed as x π π = (A π ) 1 e e, x π ν = 0, y π = (A T π ) 1 c π, s π π = 0, s π ν = c ν A T ν (A T π ) 1 c π, where ν contains the reaining action indices not in π. Since x π and s π are already copleentary, if s π ν 0, then π would be an optial policy. We now present the following strict copleentarity result for the DMDP. Lea 3.2 If both linear progras (3) and (4) are feasible; then there is a unique partition O {1, 2,, n} and N {1, 2,, n}, O N = and O N = {1, 2,, n}, such that for each optial solution pair (x, s ), x j = 0, j N, and s j = 0, j O, and there is at least one optial solution pair (x, s ) that is strictly copleentary, x j > 0, j O, and s j > 0, j N, for the DMDP linear progra. In particular, every optial policy π O so that O and N n. Proof. The strict copleentarity result for general LP is well known, where we call O the optial (super) basic variable set and N the optial non-basic variable set. The cardinality result is fro the fact that there is always an optial basic feasible solution or optial policy where the basic variables (optial action frequencies) are all strictly positive fro Lea 3.1, so that their indices ust all belong to O. The interpretation of Lea 3.2 is as follows: since there ay exist ultiple optial policies π for a DMDP, O contains those actions each of which appears in at least one optial policy, and N contains the reaining actions none of which appears in any optial policy. Let s call each action in N a non-optial action. Then, any DMDP should have no ore than n non-optial actions. Note that, although there ay be ultiple optial policies for a DMDP, the optial dual basic feasible solution pair (y, s ) is unique and invariant aong the ultiple optial policies. Thus, if j is a nonoptial action, then its optial dual slack value, s j, ust be positive, and the converse is true by the lea. 3.2 The Siplex and Policy-Iteration Methods Let π be a policy and ν contain the reaining indices of the non-basic variables. Then we can rewrite (3) as with its dual iniize c T π x π +c T ν x ν subject to A π x π +A ν x ν = e, x = (x π ; x ν ) 0, axiize e T y subject to A T π y + s π = c π, A T ν y + s ν = c ν, s = (s π ; s ν ) 0. (9) (10)

7 Yinyu Ye: Policy-Iteration Methods are Strongly Polynoial for the MDP with a Fixed Discount Rate Matheatics of Operations Research xx(x), pp. xxx xxx, c 200x INFORMS 7 The (prial) siplex ethod rewrites (9) into an equivalent proble where c is called the reduced cost vector: and iniize ( c ν ) T x ν + c T π (A π ) 1 e subject to A π x π +A ν x ν = e, x = (x π ; x ν ) 0; c π = 0 and c ν = c ν A T ν y π, y π = (A T π ) 1 c π. Note that the fixed quantity c T π (A π ) 1 e = c T x π in the objective function of (11) is the objective value of the current policy π for (9). In fact, proble (9) and its equivalent for (11) share exactly the sae objective value for every feasible solution x. We now describe the siplex (or siple policy-iteration) ethod of Dantzig and the policy-iteration ethod of Howard in ters of the LP forulation. The siplex ethod If c 0, the current policy is optial. Otherwise, let 0 < = in{ c j : j = 1, 2,, n} with j + = arg in{ c j : j = 1, 2,, n}, that is, c j + = < 0. Then we ust have j + π, since c j = 0 for all j π. Let j + A i, that is, let j + be an action available in state i. Then, the original siplex ethod (Dantzig 1947) takes x j + as the incoing basic variable to replace the old one x πi, and the ethod repeats with the new policy denoted by π + where π i A i is replaced by j + A i. The ethod will break a tie arbitrarily, and it updates exactly one action in one iteration, that is, it only updates the state with the ost negative reduced cost. The policy-iteration ethod The original policy-iteration ethod (Howard 1960) updates the actions in every state that has a negative reduced cost. Specifically, for each state i, let i = in{ c j : j A i } with j + i = arg in{ c j : j A i }. Then for every state i such that i > 0, let j + i A i replace π i A i already in the current policy π. The ethod repeats with the new policy denoted by π +, where possibly ultiple π i A i are replaced by j + i A i. The ethod will also break a tie in each state arbitrarily. Thus, both ethods would generate a sequence of polices denoted by π 0, π 1,..., π t,..., starting fro any initial policy π 0. Clearly, one can see that the policy-iteration ethod is actually the siplex ethod with block pivots or updates in each iteration; and the action, j +, updated by the siplex ethod will be always included in the actions updated by the policy-iteration ethod in every iteration. In what follows, we analyze the coplexity bounds of the siplex and policy-iteration ethods with their action updating or pivoting rules as exactly described above. 4. Strong Polynoiality We first prove the strongly polynoial-tie result for the siplex ethod. Lea 4.1 Suppose z is the optial objective value of (9), π is the current policy, π + is the iproveent of π, and is the quantity defined in an iteration of the siplex ethod. Then, in any iteration of the siplex ethod fro a current policy π to a new policy π + Moreover, c T x π+ z z c T x π 1 γ. ( 1 1 γ ) (c T x π z ). Therefore, the siplex ethod generates a sequence of polices π 0, π 1,..., π t,... such that ( c T x πt z 1 1 γ ) t ( c T x π0 z ). (11) Proof. Fro proble (11), we see that the objective function value for any feasible x is c T x = c T x π + c T x c T x π e T x = c T x π 1 γ,

8 8 Yinyu Ye: Policy-Iteration Methods are Strongly Polynoial for the MDP with a Fixed Discount Rate Matheatics of Operations Research xx(x), pp. xxx xxx, c 200x INFORMS where the first inequality follows fro c e, by the ost-negative-reduced-cost pivoting rule adapted in the ethod, and the last equality is based on the third stateent of Lea 3.1. In particular, the optial objective value is z = c T x c T x π 1 γ, which proves the first inequality of the lea. Since at the new policy π + the value of the new basic variable x π+ j is greater than or equal to 1 fro + the second stateent of Lea 3.1, the objective value of the new policy for proble (11) is decreased by at least. Thus, for proble (9), or which proves the second inequality. c T x π c T x π+ = x π+ j + c T x π+ z 1 γ ( c T x π z ), (12) ) (c T x π z ), ( 1 1 γ Replacing π by π t and using the above inequality, for all t = 0, 1,..., we have ( c T x πt+1 z 1 1 γ ) ( c T x πt z ), which leads to the third desired inequality by induction. We now present the following key technical lea. Lea 4.2 (i) If a policy π is not optial, then there is an action j π N (i.e., a non-optial action j in π) such that s j 1 γ ( c T 2 x π z ), where N, together with O, is the strict copleentarity partition stated in Lea 3.2, and s is the optial dual slack vector of (10). (ii) For any sequence of polices π 0, π 1,..., π t,... generated by the siplex ethod starting with a non-optial policy π 0, let j 0 π 0 N be the action index identified above in the initial policy π 0. Then, if j 0 π t, we ust have x πt j γ ct x π t z, t 1. c T x π0 z Proof. Since all non-basic variables of x π are zero, c T x π z = c T x π e T y = (s ) T x π = j π s j x π j. Since the nuber of non-negative ters in the su is, there ust be an action j π such that Then, fro Lea 3.1, x π j, so that which also iplies j N fro Lea 3.2. s j x π j 1 ( c T x π z ). s j 1 γ 2 ( c T x π z ) > 0, Now, suppose the initial policy π 0 is not optial and let j 0 π 0 N be the index identified in policy π 0 such that the above inequality holds, that is, s j 1 γ ( 0 2 c T x π0 z ). Then, for any policy π t generated by the siplex ethod, if j 0 π t, we ust have c T x πt z = (s ) T x πt s j 0xπt j 0,

9 Yinyu Ye: Policy-Iteration Methods are Strongly Polynoial for the MDP with a Fixed Discount Rate Matheatics of Operations Research xx(x), pp. xxx xxx, c 200x INFORMS 9 so that x πt j 0 These leas lead to our key result: ct x π t z s 2 j 0 1 γ ct x π t z c T x π0 z. Theore 4.1 Let π 0 be any given non-optial policy. Then there is an action j 0 π 0 N, i.e., a nonoptial action j 0 in policy π( 0, that ) would never appear in any of the policies generated by the siplex ethod after T := log 2 iterations starting fro π 0. Proof. Let j 0 π 0 be a non-optial action identified in Lea 4.2. Fro Lea 4.1, after t iterations of the siplex ethod, we have c T x πt z ( c T x π0 z 1 1 γ ) t. Therefore, after t T + 1 iterations fro the initial policy π 0, j 0 π t iplies, by Lea 4.2, x πt j γ ct x π t z ( c T x π0 z 2 1 γ 1 1 γ ) t < 1. The last inequality above coes fro the fact log(1 x) x for all x < 1 so that ( log 2 1 γ + t log 1 1 γ ) ( log 2 1 γ + t 1 γ ) < 0 ( ) if t 1 + T 1 + log 2. But x πt j < 1 contradicts the second stateent of Lea 3.1. Thus, 0 j 0 π t for all t T + 1. The event described in Theore 4.1 can be viewed as a crossover event of Vavasis and Ye [28, 32]: an action, although we don t know which one it is, was in the initial policy but it will never stay in or return to the policies after T iterations of the iterative process of the siplex ethod. We now repeat the sae proof for policy π T +1, if it is not optial yet, in the policy sequence generated by the siplex ethod. Since policy π T +1 is not optial, there ust be a non-optial action, j 1 π T +1 N and j 1 j 0 (because of Theore 4.1), that would never stay in or return to the policies generated by the siplex ethod after 2T iterations starting fro π 0. Again, we can repeat this process for policy π 2T +1 if it is not optial yet, and so on. In each of these loops (each loop consists of T siplex ethod iterations), at least one new non-optial action is eliinated fro consideration in any of the future loops of the siplex ethod. However, we have at ost N any such non-optial actions to eliinate, where N n fro Lea 3.2. Hence, the siplex ethod loops at ost n ties, which gives our ain result: Theore 4.2 The siplex, or siple policy-iteration, ethod of Dantzig for solving the discounted Markov decision proble with a fixed discount rate γ is a strongly ( polynoial-tie ) algorith. Starting fro any policy, the ethod terinates in at ost (n ) log 2 iterations, where each iteration uses O(n) arithetic operations. The arithetic operations count is well known for the siplex ethod: it uses O( 2 ) arithetic operations to update the inverse of the basis (A π t) 1 of the current policy π t and the dual basic solution y πt, as well as O(n) arithetic operations to calculate the reduced cost, and then chooses the incoing basic variable. We now turn our attention to the policy-iteration ethod, and we have the following corollary: Corollary 4.1 The policy-iteration ethod of Howard for solving the discounted Markov decision proble with a fixed discount rate γ is( a strongly ) polynoial-tie algorith. Starting fro any policy, it terinates in at ost (n ) log 2 iterations.

10 10 Yinyu Ye: Policy-Iteration Methods are Strongly Polynoial for the MDP with a Fixed Discount Rate Matheatics of Operations Research xx(x), pp. xxx xxx, c 200x INFORMS Proof. First, Leas 3.1 and 3.2 hold since they are independent of which ethod is being used. Secondly, Lea 4.1 still holds for the policy-iteration ethod, since at any policy π the incoing basic variable j + = arg in{ c j : j = 1, 2,, n} (that is, c j + = = in{ c j : j = 1, 2,, n}) for the siplex ethod is always one of the incoing basic variables for the policy-iteration ethod. Thus, inequality (12) in the proof of Lea 4.1 can actually be strengthened to c T x π c T x π+ = i x π+ x j + π+ j 1 γ ( c T x π z ), + i i: i>0 where i = in{ c j : j A i } and j + i = arg in{ c j : j A i }, which iplies the final results of the lea. Thirdly, the facts established by Lea 4.2 are also independent of how the policy sequence is generated as long as the action with the ost-negative-reduced-cost is included in the next policy, so that they hold for the policy-iteration ethod as well. Thus, we can conclude that there is an action j 0 π 0 N, i.e., a non-optial action j 0 in the initial non-optial policy π 0, that would never stay in or return to the policies generated by the policy-iteration ethod after T iterations. Thus, Theore 4.1 also holds for the policy-iteration ethod, which proves the corollary. Note that, for the policy-iteration ethod, each iteration could use up to O( 2 n) arithetic operations, depending on how any actions would be updated in one iteration. 5. Extensions and Rearks Our result can be extended to other undiscounted MDPs where every basic feasible atrix of (9) exhibits the Leontief substitution for: A π = I P, for soe nonnegative square atrix P with P 0 and its spectral radius ρ(p ) γ for a fixed γ < 1. This includes MDPs with transient substochastic atrices; see Veinott [30] and Eaves and Rothblu [9]. Note that the inverse of (I P ) has the expansion for (I P ) 1 = I + P + P and (I P ) 1 e 2 e 2 (1 + γ + γ ) = 1 γ, so that (I P ) 1 e 1 1 γ. Thus, each basic variable value is still between 1 and, so that Lea 3.1 is true with an inequality (actually stronger for our proof): e T x 1 γ, for every feasible solution x. Consequently, Leas 3.2, 4.1, and 4.2 all hold, which leads to the following corollary. Corollary 5.1 Let every feasible basis of an MDP have the for I P where P 0, with a spectral radius less than or equal to a fixed γ < 1. Then, the siplex and policy-iteration ethods are strongly polynoial-tie algoriths. Starting fro any policy, each of the terinates in at ost (n ) ( ) log iterations. 2 One observation fro our worst-case analyses is that there is no iteration coplexity difference between the siplex ethod and the policy-iteration ethod. However, each iteration of the siplex ethod is ore efficient than that the policy-iteration ethod. Thus, an open question would be whether or not the iteration coplexity bound of the policy-iteration ethod can be reduced. Finally, we reark that the pivoting rule sees to ake the difference. As we entioned earlier, for solving DMDPs with a fixed discount rate, the siplex ethod with the sallest-index pivoting rule (a rule popularly used against cycling in the presence of degeneracy) was shown to be exponential. This is in contrast to the ethod that uses the original ost-negative-reduced-cost pivoting rule of Dantzig, which is proven to be strongly polynoial in the current paper. On the other hand, the ost-negative-reducedcost pivoting rule is exponential for solving soe other general LP probles. Thus, searching for suitable

11 Yinyu Ye: Policy-Iteration Methods are Strongly Polynoial for the MDP with a Fixed Discount Rate Matheatics of Operations Research xx(x), pp. xxx xxx, c 200x INFORMS 11 pivoting rules for solving different LP probles sees essential, and one cannot rule out the siplex ethod siply because the behavior of one pivoting rule on one proble is shown to be exponential. Further possible research directions ay answer the questions: Are the classic siplex ethod and policy-iteration ethods strongly polynoial-tie algoriths for solving MDPs under other optiality criteria? Can the siplex ethod or the policy-iteration ethod be strongly polynoial for solving the MDP regardless of discount rates? Or, is there any strongly polynoial-tie algorith for solving the MDP regardless of discount rates? Acknowledgents. The author would like to thank Pete Veinott and five anonyous referees for any insightful discussions and suggestions on this subject, which have greatly iproved the presentation of the paper. This research is supported in part by NSF Grant GOALI and AFOSR Grant FA References [1] E. Altan. Constrained Markov Decision Processes. Chapan and Hall/CRC, [2] R. Bellan. Dynaic Prograing. Princeton University Press, Princeton, New Jersey, [3] D. P. Bertsekas. Dynaic Prograing, Deterinistic and Stochastic Models. Prentice-Hall, Englewood Cliffs, New Jersey, [4] R. E. Bixby. Progress in linear prograing. ORSA J. on Coput., 6(1):15 22, [5] G. B. Dantzig. Optial solutions of a dynaic leontief odel with substitution. Econoetrica, 23: , [6] G. B. Dantzig. Linear Prograing and Extensions. Princeton University Press, Princeton, New Jersey, [7] G. de Ghellinck. Les problés de décisions séquentielles. Cahiers du Centre d Etudes de Recherche Opérationnelle, 2: , [8] F. D Epenoux. A probabilistic production and inventory proble. Manageent Science, 10:98 108, [9] B. E. Eaves and U. G. Rothblu. A theory for extending algoriths for paraetric probles. Matheatics of Operations Research, 14(3): , [10] B. E. Eaves and A. F. Veinott. Maxiu-stopping-value policies in finite arkov population decision chains. Technical report, anuscript, Stanford University. [11] R. E. Erickson. Optiality of stationary halting policies and finite terination of successive approxiations. Matheatics of Operations Research, 13(1):90 98, [12] J. Fearnley. Exponential lower bounds for policy iteration. Technical report, arxiv: v1. [13] R. A. Howard. Dynaic Prograing and Markov Processes. MIT, Cabridge, [14] L. C. M. Kallenberg. Linear Prograing and Finite Markovian Control Probles. Matheatical Centre Tracts, [15] N. Kararkar. A new polynoial-tie algorith for linear prograing. Cobinatorica, 4: , [16] L. G. Khachiyan. A polynoial algorith in linear prograing. Dokl. Akad. Nauk SSSR, 244: , [17] V. Klee and G. J. Minty. How good is the siplex ethod. Technical report, in O. Shisha, editor, Inequalities III, Acadeic Press, New York. [18] M. L. Littan, T. L. Dean, and L. P. Kaelbling. On the coplexity of solving arkov decision probles. In Proceedings of the Eleventh Annual Conference on Uncertainty in Artificial Intelligence (UAI 95), pages , [19] L. Lovász M. Grötschel and A. Schrijver. Geoetric Algoriths and Cobinatorial Optiization. Springer, Berlin, [20] A. S. Manne. Linear prograing and sequential decisions. Manageent Science, 6: , [21] Y. Mansour and S. Singh. On the coplexity of policy iteration. In Proceedings of the 15th International Conference on Uncertainty in AI, pages , 1999.

12 12 Yinyu Ye: Policy-Iteration Methods are Strongly Polynoial for the MDP with a Fixed Discount Rate Matheatics of Operations Research xx(x), pp. xxx xxx, c 200x INFORMS [22] M. Melekopoglou and A. Condon. On the coplexity of the policy iproveent algorith for arkov decision processes. INFORMS Journal on Coputing, 6(2): , [23] C. H. Papadiitriou and J. N. Tsitsiklis. The coplexity of arkov decision processes. Matheatics of Operations Research, 12(3): , [24] M. L. Puteran. Markov Decision Processes. John & Wiley and Sons, New York, [25] U. G. Rothblu. Multiplicative arkov decision chains. Technical report, Doctoral Dissertation, Departent of Operations Research, Stanford University, Stanford. [26] L. S. Shapley. Stochastic gaes. Proceedings of the National Acaday of Science, 39(10): , [27] P. Tseng. Solving h-horizon, stationary arkov decision probles in tie proportional to log(h). Operations Research Letters, 9(5): , [28] S. Vavasis and Y. Ye. A prial-dual interior-point ethod whose running tie depends only on the constraint atrix. Matheatical Prograing, 74:79 120, [29] A. F. Veinott. Extree points of leontief substitution systes. Linear Algebra and its Applications, 1: , [30] A. F. Veinott. Discrete dynaic prograing with sensitive discount optiality criteria. The Annals of Matheatical Statistics, 40(5): , [31] A. F. Veinott. Lectures in dynaic prograing and stochastic control. Technical report, Lecture Notes of MS&E351, Stanford University, Stanford. [32] Y. Ye. A new coplexity result on solving the arkov decision proble. Matheatics of Operations Research, 30(3): , 2005.

The Simplex and Policy-Iteration Methods are Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate

The Simplex and Policy-Iteration Methods are Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate The Siplex and Policy-Iteration Methods are Strongly Polynoial for the Markov Decision Proble with a Fixed Discount Rate Yinyu Ye April 20, 2010; revised Noveber 30, 2010 Abstract We prove that the classic