The Simplex and Policy-Iteration Methods are Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate

Size: px

Start display at page:

Download "The Simplex and Policy-Iteration Methods are Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate"

Emery Wilkerson
6 years ago
Views:

1 The Siplex and Policy-Iteration Methods are Strongly Polynoial for the Markov Decision Proble with a Fixed Discount Rate Yinyu Ye April 20, 2010; revised Noveber 30, 2010 Abstract We prove that the classic policy-iteration ethod (Howard 1960), including the Siplex ethod (Dantzig 1947) with the ost-negative-reduced-cost pivoting rule, is a strongly polynoial-tie algorith for solving the Markov decision proble (MDP) with a fixed discount rate. Furtherore, the coputational coplexity of the policyiteration ethod (including the Siplex ethod) is superior to that of the only known strongly polynoial-tie interior-point algorith ([28] 2005) for solving this proble. The result is surprising since the Siplex ethod with the sae pivoting rule was shown to be exponential for solving a general linear prograing (LP) proble, the Siplex (or siple policy-iteration) ethod with the sallest-index pivoting rule was shown to be exponential for solving an MDP regardless of discount rates, and the policy-iteration ethod was recently shown to be exponential for solving a undiscounted MDP. We also extend the result to solving MDPs with sub-stochastic and transient state transition probability atrices. 1 Introduction of the Markov decision proble and its linear prograing forulation Markov decision probles (MDPs), naed after Andrey Markov, provide a atheatical fraework for odeling decision-aking in situations where outcoes are partly rando Departent of Manageent Science and Engineering, Stanford University, Stanford, CA 94305; E- ail: yinyu-ye@stanford.edu. This researcher is supported in part by NSF Grant GOALI and AFOSR Grant FA

2 and partly under the control of a decision aker. The MDP is one of the ost fundaental odels for studying a wide range of optiization probles solved via dynaic prograing and reinforceent learning. Today, it has been used in a variety of areas, including anageent, econoics, bioinforatics, electronic coerce, social networking, and supply chains. More precisely, an MDP is a discrete-tie stochastic control process. At each tie step, the process is in soe state i, and the decision aker ay choose any action, say action j, that is available in state i. The process responds at the next tie step by randoly oving into a new state i, and giving the decision aker a corresponding iediate cost c j (i, i ). Let denote the total nuber of states. The probability that the process enters i as its new state is influenced by the chosen state-action j. Specifically, it is given by a state transition probability distribution p j (i, i ) 0, i = 1,,, and p j (i, i ) = 1, i = 1,,. i =1 Thus, the next state i depends on the current state i and the decision aker s chosen state-action j, but is conditionally independent of all previous states and actions; in other words, the state transitions of an MDP possess the Markov property. The key decision of MDPs is to find a (stationary) policy for the decision aker: a set function π = {π 1, π 2,, π } that specifies the action π i that the decision aker will choose when in state i, for i = 1,,. The goal of the proble is to find a (stationary) policy π that will iniize soe cuulative function of the rando costs, typically the expected discounted su over an infinite horizon: γ t c π i t (i t, i t+1 ), t=0 where c π i t (i t, i t+1 ) represents the cost, at tie t, incurred to an individual who is in state i t and takes action π i t. Here γ is the discount rate, where γ 0 and is assued to be strictly less than 1 in this paper. This MDP proble is called the infinite-horizon discounted Markov decision proble (DMDP), which serves as the core odel for MDPs. Because of the Markov property, there is an optial stationary policy, or policy for short, for the DMDP so that it can indeed be written as a function of i only; that is, π is independent of tie t as described above. Let k i be the nuber of state-actions available in state i, i = 1,,, and let A 1 = {1, 2,, k 1 }, A 2 = {k 1 + 1, k 1 + 2,, k 1 + k 2 },... 2

3 or, for i = 1, 2,, in general, {( i 1 ) ( i 1 ) } i A i := k s + 1, k s + 2,, k s. s=1 s=1 s=1 Moreover, let n = i=1 k i, and the total n state-actions be ordered such that if j A i, then state-action j is controlled by state i. Note that the cardinality A i = k i. Suppose we know the state transition probability P and the cost function c, and we wish to calculate the policy that iniizes the expected discounted cost. Then a policy π would be associated with another array indexed by state, value vector v IR, which contains cost-to-go values for all states. Furtherore, an optial policy, (v, π ), is then a fixed point of the following iniu cost operator, π i := arg in j Ai { i pj (i, i ) (c j (i, i ) + γv i )} ; v i := i pπ i (i, i ) ( c π i (i, i ) + γv i ), i = 1,,. (1) Let P π IR be the colun stochastic atrix corresponding to a policy π, that is, the ith colun of P π be the probability distribution p π i (i, i ), i = 1,,. Then the equilibriu condition of (1) can be represented by a atrix for (I γpπ T )v = c π, (I γpπ T )v c π, π, (2) where the ith entry of colun vector c π IR equals i pπ i (i, i )c π i (i, i ). Due to D Epenoux [9] (also see Manne [17] and de Ghellinck [8]), the infinite-horizon discounted MDP can be forulated as a prial linear prograing (LP) proble in the standard for iniize c T x subject to Ax = b, x 0, (3) with the dual axiize b T y subject to s = c A T y 0, where A IR n is a given real atrix with rank, c IR n and b IR are given real vectors, 0 denotes the vector of all 0 s, and x IR n and (y IR, s IR n ) are unknown prial and dual variables, respectively. Vector s is often called the dual slack vector. In what follows, LP stands for any of the following: linear progra, linear progras, or linear prograing, depending on the context. More precisely, the DMDP can be represented by the LP probles (3) and (4) with the following assignents of (A, b, c). First, the ith entry of the colun vector b IR n is 1 for all i, representing an initial population of individuals in state i. Secondly, the jth entry of 3 (4)

4 the colun vector of c IR n is the (expected) one-tie unit cost of action j taken by a state. In particular, if j A i, then action j is controlled by state i and c j = i p j (i, i )c j (i, i ). (5) The LP constraint atrix has the for A = E γp IR n. (6) The jth colun of P is the state transition probability distribution when the jth action is taken by a state. More precisely, for each state-action j A i, that is, each action j controlled by state i, P i j = p j (i, i ), i = 1,,. (7) Finally, the ith eleent of the jth colun of E is 1 if action j is controlled by state i and zero everywhere else: E ij = { 1, if j Ai, 0, otherwise, i = 1,,, j = 1,, n. (8) Let e be the vector of all ones, where its diension depends on the context. Then we have b = e, e T P = e (that is, P is a colun stochastic atrix), e T E = e, and e T A = (1 γ)e. The interpretations of the quantities defining the DMDP prial (3) and the DMDP dual (4) are as follows: b = e eans that there is one unit of the initial nuber of individuals in each state i. The jth entry, if j A i, of prial variables x IR n is the state-action frequency for action j, or the expected present value of the nuber of ties in which an individual is in state i and takes state-action j when j A i. Thus, solving the DMDP prial entails choosing state-action frequencies that iniize the expected present value su, c T x, of total costs subject to the conservation law Ax = e. The conservation law ensures that for each state i, the expected present value of the nuber of individuals entering state i equals the expected present value of the nuber of individuals leaving i. The DMDP dual variables y IR exactly represent the expected present cost-to-go values of the states. Solving the dual entails choosing dual variables y, one for each state i, together with s IR n of slack variables, one for each state-action j, that axiizes e T y subject to A T y + s = c, s 0 or siply A T y c. It is well known that there exist unique optial y and s where, for each state i, yi is the iniu expected present cost that an individual in state i and its progeny can incur. A policy π of the original DMDP, containing exactly one action in A i for each state i, actually corresponds to basic variable indexes of a basic feasible solution (BFS) of the DMDP prial LP forulation. Obviously, we have a total of i=1 k i different policies. Let atrix A π IR (resp., P π, E π ) be the coluns of A (resp., P, E) with indexes in π. Then for a policy π, E π = I (where I is the identity atrix), so that A π has the 4

5 Leontief substitution for A π = I γp π. It is also well known that A π is nonsingular, has a nonnegative inverse and is a feasible basis for the DMDP prial. Let x π be the BFS for a policy π in the DMDP prial for and let ν contain the rest indexes not in π. Let x π and x ν be the sub-vectors of x whose indexes are respectively in policy π and ν. Then the nonbasic variables x π ν = 0 and the basic variables x π π are the unique solution to A π x π = e. The corresponding basic solution of the dual is the vector y π that is the unique solution to A T π y = c T π. The basic solution y π of the dual is feasible if also A T ν y π c T ν or s ν 0. The basic solution pair x π and y π of the DMDP prial and dual are optial if and only if both are feasible. If policy π produces optial x π and y π, then π is an optial policy π and y π is exactly v. Note that the constraints A T π = c T yπ π and AT y π c T describe the sae condition for v in (2) for each policy π or for each state-action j. 2 The Markov decision proble ethods and their coplexities There are several ajor events in developing ethods for solving DMDPs. Bellan (1957) [1] developed a successive approxiate ethod, called value-iteration, which coputes the optial total cost function assuing first a one stage finite horizon, then a two-stage finite horizon, and so on. The total cost functions so coputed are guaranteed to converge in the liit to the optial total cost function. It should be noted that, even prior to Bellan, Shapley (1953) [23] used value-iteration to solve DMDPs in the context of zero-su twoperson stochastic gaes. The other best known ethod is due to Howard (1960) [12] and is known as policyiteration, which generates an optial policy in a finite nuber of iterations. Policy-iteration alternates between a value deterination phase, in which the current policy is evaluated, and a policy iproveent phase, in which an attept is ade to iprove the current policy. In the policy iproveent phase, the policy-iteration ethod updates possibly iproved actions for every state in one iteration. If the the current policy is iproved for at ost one state in one iteration, then it is called siple policy-iteration. We will coe back to the policy-iteration and siple policy-iteration ethods later in ters of the LP forulation. Since it was discovered in 1960 that the DMDP has an LP forulation, the Siplex ethod of Dantzig (1947) [5] can be used to solving DMDPs. It turns out that the Siplex ethod, when applied to solving DMDPs, is the siple policy-iteration ethod. Other general LP ethods, such as the Ellipsoid ethod and interior-point algoriths are also capable to solve DMDPs. As the notion of coputational coplexity eerged, there were treendous efforts in analyzing the coplexity of the MDP and its solution ethods. On the positive side, since 5

6 it (with or without discount) can be forulated as an linear progra, the MDP can be solved in polynoial tie by either the Ellipsoid ethod (e.g., Khachiyan (1979) [14]) or the interior-point algorith (e.g., Kararkar (1984) [13]). Here, polynoial tie eans that the nuber of arithetic operations needed to copute an optial policy is bounded by a polynoial in the nubers of states, actions, and the bit-size of the input data, which are assued to be rational nubers. Papadiitriou and Tsitsiklis [20] then showed in 1987 that an MDP with deterinistic transitions (i.e., each entry of state transition probability atrices is either 0 s or 1 s) can be solved in strongly polynoial-tie (i.e., the nuber of arithetic operations is bounded by a polynoial in the nubers of states and actions only) as a Miniu-Mean-Cost-Cycle proble. Erickson [7] in 1988 showed that successive approxiations suffice to produce: (1) an optial stationary halting policy, or (2) show that no such policy exists in strongly polynoial tie algorith, based on the work of Eaves and Veinott [6] and Rothblu [22]. There were also great research interests and progresses in the value-iteration and policyiteration ethods for solving the DMDP. Bertsekas [2] in 1987 showed that the valueiteration ethod converges to the optial policy in a finite nuber of iterations. Tseng [24] in 1990 showed that the value-iteration ethod generates an optial policy in polynoialtie for the DMDP when the discount rate is fixed. Puteran [21] in 1994 showed that the policy-iteration ethod converges no ore slowly than the value iteration ethod, so that it is also a polynoial-tie algorith for the DMDP with a fixed discount rate. This fact was actually known to Veinott (and perhaps others) three decades earlier and used in dynaic prograing courses he taught for a nuber of years well before Mansour and Singh [18] in 1994 also gave an upper bound on the nuber of iterations, k, for the policy-iteration ethod in solving the DMDP when each state has k actions. (Note that the total nuber of possible policies is k, so that the result is not uch better than that of coplete enueration.) In 2005, Ye [28] developed a strongly polynoialtie cobinatorial interior-point algorith (CIPA) for the DMDP with a fixed discount rate, that is, the nuber of arithetic operations is bounded by a polynoial in only the nubers of states and actions. In ters of the worst-case coplexity bound on the nuber of arithetic operations, the current best results (within a constant factor) are suarized in the following table, when there are exact k actions in each of the states; see Littan et al. [16], Mansour and Singh [18], Ye [28], and references therein. Value-Iteration Policy-Iteration LP-Algoriths CIPA 2 kl(p,c,γ) log(1/(1 γ)) 1 γ in { 3 k k, } 3 kl(p,c,γ) log(1/(1 γ)) 1 γ 3 k 2 L(P, c, γ) 4 k 4 log 1 γ Here, L(P, c, γ) is the total bit-size of the DMDP input data in the linear prograing for, given that (P, c, γ) have only rational entries. As one can see fro the table, both the 6

7 value-iteration and policy-iteration ethods are polynoial-tie algoriths if the discount rate 0 γ < 1 is fixed. But they are not strongly polynoial, where the running tie needs to be a polynoial only in and k. The proof of polynoial-tie for the value and policy-iteration ethods is essentially due to the arguent that, when the gap between the objective value of the current policy (or BFS) and the optial one is sall than 2 L(P,c,γ), the current policy ust be optial; e.g., see [11]. However, the proof of a strongly polynoial-tie algorith cannot rely on this arguent, since (P, c, γ) ay have irrational entries so that the bit-size of the data can be. In practice, the policy-iteration ethod, including the siple policy-iteration or Siplex ethod, has been rearkably successful and shown to be a ost effective and widely used. The nuber of iterations is typically bounded by O(k). It turns out that the policyiteration ethod is actually the Siplex ethod with block pivots at each iteration; and the Siplex ethod also reains one of the very few extreely effective ethods for solving general LPs; see Bixby [3]. In the past 50 years, any efforts have been ade to resolve the worst-case coplexity issue of the policy-iteration ethod or the Siplex ethod, and to answer the question: are the policy-iteration and the Siplex ethods strongly polynoial-tie algoriths? Unfortunately, so far ost results have been negative. Klee and Minty [15] showed in 1972 that the classic Siplex ethod, with Dantzig s original ost-negative-reduced-cost pivoting rule, necessarily takes an exponential nuber of iterations to solve a carefully designed LP proble. Later, a siilar negative result of Melekopoglou and Condon [19] showed that one siple policy-iteration ethod, where only the action for the state with the sallest index is updated, needs an exponential nuber of iterations to copute an optial policy for a specific DMDP proble regardless of discount rates (i.e., even when γ < 1 is fixed). Most recently, Fearnley (2010) [10] showed that the policy-iteration ethod needs an exponential nuber of iterations for an undiscounted but finite-horizon MDP. Thus, it sees ipossible for the policy-iteration ethod to be a strongly polynoial-tie algorith for solving general MDPs. What about DMDP with a fixed discount rate? Is there a pivoting rule to ake the siplex and policy-iteration ethods strongly polynoial for the DMDP? In this paper, we prove that the classic Siplex ethod, or the siple policy-iteration ethod, with the ost-negative-reduced-cost pivoting rule, is indeed a strongly polynoialtie algorith for the DMDP with a fixed discount rate 0 γ < 1. The nuber of its iterations is bounded by 2 ( ) (k 1) 2 log, 1 γ 1 γ and each iteration uses at ost O( 2 k) arithetic operations. The result sees surprising, given the earlier negative results entioned above. Since the policy-iteration ethod with the all-negative-reduced-cost pivoting rule is at 7

8 least as good as the the siple policy-iteration ethod, we prove that it is also a strongly polynoial-tie algorith with the sae iteration coplexity bound. Therefore, the worstcase operation coplexity, O( 4 k 2 log ), of the Siplex ethod is actually superior to that, O( 4 k 4 log ), of the cobinatorial interior-point algorith [28] for solving DMDPs when the discount rate is a fixed constant. If the nuber of actions varies aong the states, our worst-case iteration bound would be (n ) log ( ) 2 1 γ 1 γ, and each iteration uses at ost O(n) arithetic operations, where n is again the total nuber of actions. One can see that the worst-case iteration coplexity bound is linear in the total nuber of actions, as it is observed in practice. We reark that, if the discount rate is an input, it reains open whether or not the policy-iteration ethod is polynoial for the MDP, or whether or not there exists a strongly polynoial-tie algorith for MDP or LP in general. 3 DMDP Properties and the Siplex and policy-iteration ethods We first describe a few general LP and DMDP theores and the classic Siplex and policy-iteration ethods. We will use the LP forulation (3) and (4) for DMDP and the terinology presented in the Introduction section. Recall that, for DMDP, b = e IR, A = E γp IR n, and c, P and E are defined in (5), (7) and (8), respectively. 3.1 DMDP Properties The optiality conditions for all optial solutions of a general LP ay be written as follows: Ax = b, A T y + s = c, s j x j = 0, j = 1,, n, x 0, s 0 where the third condition is often referred as the copleentarity condition. Let π be the index set of state-actions corresponding to a policy. Then, as we briefly entioned earlier, x π is a BFS of the DMDP prial and basis A π has the for A π = (I γp π ), and P π is a colun stochastic atrix, that is, P π 0 and e T P π = e. In fact, the converse is also true, that is, the index set π of basic variables of every BSF of the DMDP prial is a policy for the original DMDP. In other words, π ust have exactly one variable or action index in A i, for each state i. Thus, we have the following lea. 8

9 Lea 1 The DMDP prial linear prograing forulation has the following properties: 1. There is a one-to-one correspondence between a (stationary) policy of the original DMDP and a basic feasible solution of the DMDP prial. 2. Let x π be a basic feasible solution of the DMDP prial. Then any basic variable, say x π i, has its value 1 x π i 1 γ. 3. The feasible set of the DMDP prial is bounded. More precisely, for every feasible x 0. e T x = 1 γ, Proof. Let π be the basis set of any basic feasible solution for the DMDP prial. Then, the first stateent can be seen as follows. Consider the coefficients of the ith row of A. Fro the structure of (6), (7) and (8), we ust have a ij 0 for all j A i. Thus, if no basic variable is chosen fro A i or π A i =, then 1 = a ij x π j = j π j π,j A i a ij x π j 0, which is a contradiction. Thus, each state ust have a state-action in π. On the other hand, π =. Therefore, π ust contain exactly one action index in A i fro each state i = 1,,, that is, π is a policy. The last two stateents of the lea were given in [28] whose proofs were based on Dantzig [4, 5] and Veinott [26]. Fro the first stateent of Lea 1, in what follows we siply call the basis index set π of any BFS of the DMDP prial a policy. For the basis A π = (I γp π ) of any policy π, the BFS x π and the dual can be coputed as x π π = (A π ) 1 e e, x π ν = 0, y π = (A T π ) 1 c π, s π π = 0, s π ν = c ν A T ν (A T π ) 1 c π, where again ν contains the rest action indexes not in π. Since x π and s π are already copleentary, if s π ν 0, then π would be an optial policy. We now present the following strict copleentarity result for the DMDP. Lea 2 Let both linear progras (3) and (4) be feasible. Then there is a unique partition P {1, 2,, n} and O {1, 2,, n}, P O = and P Q = {1, 2,, n}, such that for all optial solution pair (x, s ), x j = 0, j O, and s j = 0, j P, 9

10 and there is at least one optial solution pair (x, s ) that is strictly copleentary, x j > 0, j P, and s j > 0, j O, for the DMDP linear progra. In particular, every optial policy π P so that P and O n. Proof. The strict copleentarity result for general LP is well known, where we call P the optial (super) basic variable set and O the optial non-basic variable set. The cardinality result is fro the fact that there is always an optial basic feasible solution or optial policy where the basic variables (optial state-action frequencies) are all strictly positive fro Lea 1, so that their indexes ust all belong to P. The interpretation of Lea 2 is as folows: since there ay exist ultiple optial policies π for a DMDP, P contains those state-actions each of which appears in at least one optial policy, and O contains the rest state-actions neither of which appears in any optial policy. Let s call each state-action in O a non-optial state-action or siply nonoptial action. Then, any DMDP should have no ore than n non-optial actions. Note that, although there ay be ultiple optial policies for a DMDP, the optial dual basic feasible solution (y, s ) is unique and invariant aong the ultiple optial policies. Thus, if j is a non-optial action, its optial dual slack value, s j, ust be strictly greater than 0, and the converse is also true by the lea. 3.2 The Siplex and policy-iteration Methods Let π be a policy and ν contain the reaining indexes of the non-basic variables. Then we can rewrite (3) as iniize c T π x π +c T ν x ν subject to A π x π +A ν x ν = e, (9) x = (x π ; x ν ) 0, with its dual axiize e T y subject to A T π y + s π = c π, A T ν y + s ν = c ν, s = (s π ; s ν ) 0. The (prial) Siplex ethod rewrites (9) into an equivalent proble (10) iniize ( c ν ) T x ν +c T π (A π ) 1 e subject to A π x π +A ν x ν = e, x = (x π ; x ν ) 0; (11) 10

11 where c is called the reduced cost vector: and c π = 0 and c ν = c ν A T ν y π, y π = (A T π ) 1 c π. Note that the fixed quantity c T π (A π ) 1 e = c T x π in the objective function of (11) is the objective value of the current policy π for (9). In fact, proble (9) and its equivalent for (11) share exactly the sae objective value for every feasible solution x. The Siplex ethod If c 0, the current policy is optial. Otherwise, let 0 < = in( c) with j + = arg in(c), that is, c j + = > 0. Then we ust have j + π, since c j = 0 for all j π. Let j + A i, that is, let j + be a state-action controlled by state i. Then, the classic Siplex ethod (Dantzig 1947) takes x j + as the incoing basic variable to replace the old one x πi, and the ethod repeats with the new policy denoted by π + where π i A i is replaced by j + A i. The ethod will break a tie arbitrarily, and it updates exactly one state-action in one iteration, that is, it only updates the state with the ost negative reduced cost. This is the classic Siplex, or the siple policy-iteration, ethod that uses the ost-negative-reduced-cost updating or pivoting rule. The policy-iteration ethod The original policy-iteration ethod (Howard 1960 [12]) is to update every state that has a negative reduced cost. For each state i, let i = in j Ai ( c) with j i + = arg in j Ai ( c). Then for every state i such that i > 0, let j i + A i replace π i A i already in the current policy π. The ethod repeats with the new policy denoted by π +, where possibly ultiple π i A i are replaced by j i + A i. The ethod will also break a tie in each state arbitrarily. Therefore, both ethods would generate a sequence of polices denoted by π 0, π 1,..., π t,..., starting fro any initial policy π 0. We coent that the Siplex and policy-iteration ethods with the greedy or the ost-negative-reduced-cost updating rule are special versions of generic policy iproveent. In what follows, the ost-negative-reduced-cost pivoting rule is used as a default for the Siplex and policy-iteration ethods, unless otherwise stated. 4 Proof of strong polynoiality We first prove our strongly polynoial-tie result for the Siplex ethod. For the iproveent of new policy π + over any policy π of the Siplex ethod, we have 11

12 Lea 3 Let z be the optial objective value of (9). Siplex ethod fro current policy π to new policy π + Moreover, z c T x π 1 γ. ( c T x π+ z 1 1 γ ) (c T x π z ). Then, in any iteration of the Therefore, the Siplex ethod generates a sequence of polices π 0, π 1,..., π t,... such that ( c T x πt z 1 1 γ ) t ( c T x π0 z ). Proof. Fro proble (11), we see that the objective function value for any feasible x is c T x = c T x π + c T x c T x π e T x = c T x π 1 γ, where the first inequality follows fro c e, by the ost-negative-reduced-cost pivoting rule adapted in the ethod, and the last equality is based on the third stateent of Lea 1. In particular, the optial objective value is z = c T x c T x π which proves the first inequality of the lea. 1 γ, Since at the new policy π +, the value of new basic variable x π+ j is greater than or equal + to 1, fro the second stateent of Lea 1, the objective value of the new policy for proble (11) is decreased by at least. Thus, for proble (9), c T x π c T x π+ = x π+ j + 1 γ ( c T x π z ), or ( c T x π+ z 1 1 γ ) (c T x π z ), which proves the second inequality. Replacing π by π t and using the above inequality, for all t = 0, 1,..., we have ( c T x πt+1 z 1 1 γ ) (c T x πt z ), which leads to the third desired inequality by induction. We now present the following key technical lea. 12

13 Lea 4 1. If a policy π is not optial, then there is a state-action j π O (i.e., a non-optial state-action j in the current policy) such that s j 1 γ 2 ( c T x π z ), where O, together with P, is the strict copleentarity partition stated in Lea 2, and s is the optial dual slack vector of (10). 2. For any sequence of polices π 0, π 1,..., π t,... generated by the Siplex ethod where π 0 is not optial, let j 0 π 0 O be the state-action index identified above in the initial policy π 0. Then, if j 0 π t, we ust have x πt j γ ct x πt z, t 1. c T x π0 z Proof. Since all non-basic variable of x π have zero values, c T x π z = c T x π e T y = (s ) T x π = j π s jx π j. Since the nuber of non-negative ters in the su is, there ust be a state-action j π such that s jx π j 1 ( c T x π z ). Then, fro Lea 1, x π j, so that 1 γ which also iplies j O fro Lea 2. s j 1 γ 2 ( c T x π z ) > 0, Now, suppose the initial policy π 0 is not optial and let j 0 π 0 O be the index identified at policy π 0 such that the above inequality holds, that is, s j 0 1 γ 2 ( c T x π0 z ). Then, for any policy π t generated by the Siplex ethod, if j 0 π t, we ust have c T x πt z = (s ) T x πt s j 0xπt j 0, so that x πt j 0 ct x πt z 2 s j 1 γ ct x πt z c T x 0 π0 z. 13

14 These leas lead to our key result: Theore 1 Let π 0 be any given non-optial policy. Then there is a state-action j 0 π 0 O, i.e., a non-optial action j 0 in policy π 0, that would never appear in any of the policies generated by the Siplex ethod after T := log ( ) 2 1 γ 1 γ iterations starting fro π 0. Proof. Fro Lea 3, after t iterations of the Siplex ethod, we have c T x πt z ( c T x π0 z 1 1 γ ) t. Therefore, after t T + 1 iterations fro the initial policy π 0, j 0 π t iplies, by Lea 4, x πt j γ ct x πt z ( c T x π0 z 2 1 γ 1 1 γ ) t < 1. The last inequality above coes fro the fact log(1 x) x for all x < 1 so that log ( 2 1 γ + t log 1 1 γ ) log 2 1 γ + t ( 1 γ ) < 0 if t 1 + T 1 + log ( ) 2 1 γ 1 γ. But x π t j < 1 is a contradiction to Lea 1, which 0 states that every basic variable value ust be greater or equal to 1. Thus, j 0 π t for all t T + 1. The event described in Theore 1 can be viewed as a crossover event of Vavasis and Ye [25, 28]: a state-action, although we don t know which one it is, was in the initial policy but it will never stay in or return to the policies after a certain nuber of iterations, during the iterative process of the Siplex or siple policy-iteration ethod. We now repeat the sae proof for policy π T +1, if it is not optial yet, in the policy sequence generated by the Siplex ethod. Since policy π T +1 is not optial, there ust be a non-optial state-action, j 1 π T +1 O and j 1 j 0 (because of Theore 1), that would never stay in or return to the policies generated by the Siplex ethod after 2T iterations starting fro π 0. Again, we can repeat this process for policy π 2T +1 if it is not optial yet, and so on. In each of these cycles of T Siplex iterations, at least one new non-optial state-action is eliinated fro appearance in any of the future policy cycles generated by the Siplex ethod. However, we have at ost O any such non-optial state-actions to eliinate, where O n fro Lea 2. Hence, the Siplex ethod can cycle at ost n ties, and we reach our ain conclusion: 14

15 Theore 2 The siplex, or siple policy-iteration, ethod with the ost-negative-reducedcost pivoting rule of Dantzig for solving the discounted Markov decision proble with a fixed discount rate is a strongly polynoial-tie algorith. Starting fro any policy, the ethod terinates in at ost (n ) log ( ) 2 1 γ 1 γ iterations, where each iteration uses O(n) arithetic operations. The arithetic operations count is well known for the Siplex ethod: it uses O( 2 ) arithetic operations to update the inverse of the basis (A π t) 1 of the current policy π t and the dual basic solution y πt, as well as O(n) arithetic operations to calculate the reduced cost, and then chooses the incoing basic variable. We now turn our attention to the policy-iteration ethod, and we have the following corollary: Corollary 1 The original policy-iteration ethod of Howard for solving the discounted Markov decision proble with a fixed discount rate is a strongly polynoial-tie algorith. Starting fro any policy, it terinates in at ost (n ) log ( ) 2 1 γ 1 γ iterations. Proof. First, Leas 1 and 2 hold since they are independent of which ethod is being used. Secondly, Lea 3 still holds for the policy-iteration ethod, since at any policy π the incoing basic variable j + = arg in(c) (that is, c j + = = in( c)) for the Siplex ethod is always one of the incoing basic variables for the policy-iteration ethod. Thirdly, the facts established by Lea 4 are also independent of how the policy sequence is generated as long as the state-action with the ost-negative-reduced-cost is included in the next policy, so that they hold for the policy-iteration ethod as well. Thus, we can conclude that there is a state-action j 0 π 0 O, i.e., a non-optial stateaction j 0 in the initial non-optial policy π 0, that would never stay in or return to the policies generated by the policy-iteration ethod after T iterations. Thus, Theore 1 also holds for the policy-iteration ethod, which proves the corollary. Note that, for the policy-iteration ethod, each iteration could use up to O( 2 n) arithetic operations. 5 Extensions and Rearks Our result can be extended to other undiscounted MDPs where every basic feasible atrix of (9) exhibits the Leontief substitution for: A π = I P, for soe nonnegative square atrix P with P 0 and its spectral radius ρ(p ) γ for a fixed γ < 1. This includes MDPs with sub-stochastic atrices and transient cases; see 15

16 Veinott [27]. Note that the inverse of (I P ) has the expansion for and so that (I P ) 1 = I + P + P (I P ) 1 e 2 e 2 (1 + γ + γ ) = (I P ) 1 e 1 1 γ. 1 γ, Thus, each basic variable value is still between 1 and, so that Lea 1 is true with an 1 γ inequality (actually stronger for our proof): e T x 1 γ, for every feasible solution x. Consequently, Leas 2, 3, and 4 all hold, which leads to the following corollary. Corollary 2 Let every feasible basis of an MDP have the for I P where P 0, with a spectral radius less than or equal to a fixed γ < 1. Then, the Siplex and policy-iteration ethods are strongly polynoial-tie algoriths. Starting fro any policy, each of the terinates in at ost (n ) log ( ) 2 1 γ 1 γ iterations. One observation fro our worst-case analyses is that there is no iteration-count difference between the Siplex ethod and the policy-iteration ethod that akes block pivots in each iteration, as long as the ost-negative-reduced-cost pivoting rule is adapted. However, each iteration of the Siplex ethod is ore efficient than that the policy-iteration ethod. Finally, we reark that the pivoting rule sees to ake the difference. As we entioned earlier, for the DMDP with a fixed discount rate, the siplex or siple policy-iteration ethod with the sallest-index pivoting rule (a rule popularly used against cycling in the presence of degeneracy) was shown to be exponential. This is in contrast to the ethod that uses the ost-negative-reduced-cost pivoting rule, which is proven to be strongly polynoial in this paper. On the other hand, the ost-negative-reduced-cost pivoting rule is exponential for solving soe other LP probles. Thus, searching for suitable pivoting rules for solving different LP probles is essential, and one cannot rule out the Siplex ethod siply because the behavior of one pivoting rule on one proble is shown to be exponential. Further possible research directions ay answer the questions: can the Siplex ethod or the policy-iteration ethod be strongly polynoial for solving the MDP regardless of discount rates? Or, is there any strongly polynoial-tie algorith for solving the MDP regardless of discount rates? 16

17 Acknowledgents. I thank Pete Veinott and four anonyous Referees for any insightful discussions and suggestions on this subject, which have greatly iproved the presentation of the paper. References [1] R. Bellan. Dynaic Prograing. Princeton University Press, Princeton, New Jersey, [2] D. P. Bertsekas. Dynaic Prograing, Deterinistic and Stochastic Models. Prentice-Hall, Englewood Cliffs, New Jersey, [3] R. E. Bixby, Progress in linear prograing, ORSA J. on Coput. 6:1 (1994) [4] G. B. Dantzig, Optial solutions of a dynaic Leontief odel with substitution, Econoetrica 23 (1955), [5] G. B. Dantzig. Linear Prograing and Extensions. Princeton University Press, Princeton, New Jersey, [6] B. E. Eaves and A. F. Veinott, Maxiu-Stopping-Value Policies in Finite Markov Population Decision Chains, anuscript, Stanford University, [7] R. E. Erickson, Optiality of Stationary Halting Policies and Finite Terination of Successive Approxiations, Matheatics of Operations Research 13:1 (1988), [8] G. de Ghellinck, Les Problés de Décisions Séquentielles, Cahiers du Centre d Etudes de Recherche Opérationnelle 2 (1960), [9] F. D Epenoux, A Probabilistic Production and Inventory Proble, Manageent Science 10 (1963), ; Translation of an article published in Revue Franccaise de Recherche Opérationnelle 14 (1960). [10] J. Fearnley, Exponential lower bounds for policy iteration, arxiv: v1, (March 2010). [11] M. Grötschel, L. Lovász and A. Schrijver, Geoetric Algoriths and Cobinatorial Optiization, Springer, Berlin, [12] R. A. Howard, Dynaic Prograing and Markov Processes. MIT, Cabridge, Massachusetts, [13] N. Kararkar, A new polynoial-tie algorith for linear prograing, Cobinatorica 4 (1984), [14] L. G. Khachiyan, A polynoial algorith in linear prograing, Dokl. Akad. Nauk SSSR 244 (1979), ; Translated in Soviet Math. Dokl [15] V. Klee and G. J. Minty, How good is the Siplex ethod, In O. Shisha, editor, Inequalities III, Acadeic Press, New York, NY,

18 [16] M. L. Littan, T. L. Dean and L. P. Kaelbling, On the coplexity of solving Markov decision probles, Proceedings of the Eleventh Annual Conference on Uncertainty in Artificial Intelligence (UAI 95), 1995, pp [17] A. S. Manne, Linear prograing and sequential decisions, Manageent Science 6 (1960), [18] Y. Mansour and S. Singh, On the coplexity of policy iteration, Proceedings of the 15th International Conference on Uncertainty in AI, 1999, pp [19] M. Melekopoglou and A. Condon, On the coplexity of the policy iproveent algorith for Markov Decision Processes, INFORMS Journal on Coputing 6:2 (1994), [20] C. H. Papadiitriou and J. N. Tsitsiklis, The coplexity of Markov decision processes, Matheatics of Operations Research 12:3 (1987), [21] M. L. Puteran, Markov Decision Processes, John & Wiley and Sons, New York, [22] U. Rothblu, Multiplicative Markov Decision Chains, Doctoral Dissertation, Departent of Operations Research, Stanford University, Stanford, [23] L. S. Shapley, Stochastic Gaes, Proc Natl Acad Sci U S A 39:10 (1953), [24] P. Tseng, Solving H-horizon, stationary Markov decision probles in tie proportional to log(h), Operations Research Letters 9:5 (1990), [25] S. Vavasis and Y. Ye, A prial-dual interior-point ethod whose running tie depends only on the constraint atrix, Matheatical Prograing 74 (1996) [26] A. Veinott, Extree points of Leontief substitution systes, Linear Algebra and its Applications 1 (1968) [27] A. Veinott, Discrete dynaic prograing with sensitive discount optiality criteria, The Annals of Matheatical Statistics 40:5 (1969) [28] Y. Ye, A new coplexity result on solving the Markov decision proble, Matheatics of Operations Research 30:3 (2005),

The Simplex Method is Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate

The Simplex Method is Strongly Polynomial for the Markov Decision Problem with a Fixed Discount Rate The Siplex Method is Strongly Polynoial for the Markov Decision Proble with a Fixed Discount Rate Yinyu Ye April 20, 2010 Abstract In this note we prove that the classic siplex ethod with the ost-negativereduced-cost