The simplex method is strongly polynomial for deterministic Markov decision processes

Size: px

Start display at page:

Download "The simplex method is strongly polynomial for deterministic Markov decision processes"

Meredith Lucas
5 years ago
Views:

1 The implex method i trongly polynomial for determinitic Markov deciion procee Ian Pot Yinyu Ye May 31, 2013 Abtract We prove that the implex method with the highet gain/mot-negative-reduced cot pivoting rule converge in trongly polynomial time for determinitic Markov deciion procee (MDP) regardle of the dicount factor. For a determinitic MDP with n tate and m action, we prove the implex method run in O(n 3 m 2 log 2 n) iteration if the dicount factor i uniform and O(n 5 m 3 log 2 n) iteration if each action ha a ditinct dicount factor. Previouly the implex method wa known to run in polynomial time only for dicounted MDP where the dicount wa bounded away from 1 [Ye11]. Unlike in the dicounted cae, the algorithm doe not greedily converge to the optimum, and we require a more complex meaure of progre. We identify a et of layer in which the value of primal variable mut lie and how that the implex method alway make progre optimizing one layer, and when the upper layer i updated the algorithm make a ubtantial amount of progre. In the cae of nonuniform dicount, we define a polynomial number of miletone policie and we prove that, while the objective function may not improve ubtantially overall, the value of at leat one dual variable i alway making progre toward ome miletone, and the algorithm will reach the next miletone in a polynomial number of tep. 1 Introduction Markov deciion procee (MDP) are a powerful tool for modeling repeated deciion making in tochatic, dynamic environment. An MDP conit of a et of tate and a et of action that one may perform in each tate. Baed on an agent action it receive reward and affect the future evolution of the proce, and the agent attempt to maximize it reward over time (ee Section 2 for a formal definition). MDP are widely ued in machine learning, robotic and control, operation reearch, economic, and related field. See the book [Put94] and [Ber96] for a thorough overview. Solving MDP i alo an important problem theoretically. Optimizing an MDP can be formulated a a linear program (LP), and although thee LP poe extra tructure that can be exploited by algorithm like Howard policy iteration method [How60], they lie jut beyond the point at which Department of Combinatoric and Optimization, Univerity of Waterloo. Reearch done while at Stanford Univerity. ian@ianpot.org. Reearch upported by NSF grant We alo acknowledge financial upport from grant #FA from the U.S. Air Force Office of Scientific Reearch (AFOSR) and the Defene Advanced Reearch Project Agency (DARPA). Department of Management Science and Engineering, Stanford Univerity. yinyu-ye@tanford.edu. Reearch upported in part from grant #FA from the U.S. Air Force Office of Scientific Reearch (AFOSR). 1

2 our ability to olve LP in trongly-polynomial time end (and are a natural target for extending thi ability), and they have proven to be hard in general for algorithm previouly thought to be quite powerful, uch a randomized implex pivoting rule [FHZ11]. In practice [LDK95] MDP are olved uing policy iteration, which may be viewed a a parallel verion of the implex method with multiple imultaneou pivot, or value iteration [Bel57], an inexact approximation to policy iteration that i fater per iteration. If the dicount factor γ, which determine the effective time horizon (ee Section 2), i mall it ha long been known that policy and value iteration will find an ɛ-approximation to the optimum [Bel57]. It i alo well-known that value iteration may be exponential, but policy iteration reited wort-cae analyi for many year. It wa conjectured to be trongly polynomial but except for highly-retricted example [Mad02] only exponential time bound were known [MS99]. Building on reult for parity game [Fri09], Fearnley recently gave an exponential lower bound [Fea10]. Friedmann, Hanen, and Zwick extended Fearnley technique to achieve ub-exponential lower bound for randomized implex pivoting rule [FHZ11] uing MDP, and Friedmann gave an exponential lower bound for MDP uing the leat-entered pivoting rule [Fri11]. Melekopoglou and Condon proved everal other implex pivoting rule are exponential [MC94]. On the poitive ide, Ye deigned a pecialized interior-point method that i trongly polynomial in everything except the dicount factor [Ye05]. Ye later proved that for dicounted MDP with n tate and m action, the implex method with the mot-negative-reduced-cot pivoting rule and, by extenion, policy iteration, run in time O(nm/(1 γ) log(n/(1 γ))) on dicounted MDP, which i polynomial for fixed γ [Ye11]. Hanen, Milteren, and Zwick improved the policy iteration bound to O(m/(1 γ) log(n/(1 γ))) and extended it to both value iteration a well a the trategy iteration algorithm for two player turn-baed tochatic game [HMZ11]. But the performance of policy iteration and implex-tyle bai-exchange algorithm on MDP remain poorly undertood. Policy iteration, for intance, i conjectured to run in O(m) iteration on determinitic MDP, but the bet upper bound are exponential, although a lower bound of O(m) i known [HZ10]. Improving our undertanding of thee algorithm i an important tep in deigning better one with polynomial or even trongly-polynomial guarantee. Motivated by thee quetion, we analyze the implex method with the mot-negative-reducedcot pivoting rule on determinitic MDP. For a determinitic MDP with n tate and m action, we prove that the implex method terminate in O(n 3 m 2 log 2 n) iteration regardle of the dicount factor, and if each action ha a ditinct dicount factor, then the algorithm run in O(n 5 m 3 log 2 n) iteration. Our reult do not extend to policy iteration, and we leave thi a a challenging open quetion. Determinitic MDP were previouly known to be olvable in trongly polynomial time uing pecialized method not applicable to general MDP minimum mean cycle algorithm [PT87] or, in the cae of nonuniform dicount, by exploiting the property that the dual LP ha only two variable per inequality [HN94]. The fatet known algorithm for uniformly dicounted determinitic MDP run in time O(mn) [MTZ10]. However, thee problem were not known to be olvable in polynomial time with the more-generic implex method. More generally, we believe that our reult help hed ome light on how algorithm like implex and policy iteration function on MDP. Our proof technique, particularly in the cae of nonuniform dicount, may be of independent interet. For uniformly dicounted MDP, we how that the value of the primal flux variable mut lie within one of two interval or layer of polynomial ize depending on whether an action i on a path or a cycle. Mot iteration update variable in the maller path layer, and we how thee 2

3 converge rapidly to a locally optimal policy for the path, at which point the algorithm mut update the larger cycle layer and make a large amount of progre toward the optimum. Progre take the form of many mall improvement interpered with a few much larger one rather than uniform convergence. The nonuniform cae i harder, and our meaure of progre i unuual and, to the bet of our knowledge, novel. We again define a et of interval in which the value of variable on cycle mut fall, and thee define a collection of intermediate miletone or checkpoint value for each dual variable (the value of a tate in the MDP). Whenever a variable enter a cycle layer, we argue that a correponding dual variable i making progre toward the layer miletone and will pa thi value after enough update. When each of thee checkpoint have been paed, the algorithm mut have reached the optimum. We believe ome of thee idea may prove ueful in other problem a well. In Section 2 we formally define MDP and decribe a number of well-known propertie that we require. In Section 3 we analyze the cae of a uniform dicount factor, and in Section 4 we extend thee reult to the nonuniform cae. 2 Preliminarie Many variation and extenion of MDP have been defined, but we will tudy the following problem. A Markov deciion proce conit of a et of n tate S and m action A. Each action a i aociated with a ingle tate in which it can be performed, a reward r a R for performing the action, and a probability ditribution P a over tate to which the proce will tranition when uing action a. We denote by P a, the probability of tranitioning to tate when taking action a. There i at leat one action uable in each tate. Let r be the vector of reward indexed by a with entrie r a, A A be the et of action performable in tate, and P be the n by m matrix with column P a and entrie P a,. We will retrict the ditribution P a to be determinitic for all action, in which cae tate may be thought of a node in a graph and action a directed edge. However, the reult in thi ection apply to MDP with tochatic tranition a well. At each time tep, the MDP tart in ome tate and perform an action a admiible in tate, at which point it receive the reward r a and tranition to a new tate according to the probability ditribution P a. We are given a dicount factor γ < 1 a part of the input, and our goal i to chooe action to perform o a to maximize the expected dicounted reward we accumulate over an infinite time horizon. The dicount can be thought of a a topping probability at each time tep the proce end with probability 1 γ. Normally, the dicount γ i uniform for the entire MDP, but in Section 4 we will allow each action to have a ditinct dicount γ a. Due to the Markov property tranition depend only the current tate and action there i an optimal trategy that i memoryle and depend only on the current tate. Let π be uch a policy, a ditribution of action to perform for each tate. Thi define a Markov chain and a value for each tate: Definition 2.1. Let π be a policy, P π be the n by n matrix where P, π i the probability of tranitioning from to uing π, and r π the vector of expected reward for each tate according to the ditribution of action in π. The value vector v π i indexed by tate, and v π i equal to the expected total dicounted reward of tarting in tate and following policy π. It i defined a v π = i 0 (γ(p π ) T ) i r π = (I γp π ) T r π or equivalently by v π = r π + γ(p π ) T v π. (1) 3

4 If policy π i randomized and ue two or more action in ome tate, then the value of v π i an average of the value of performing each of the pure action in, and one of thee i the larget. Therefore we can replace the ditribution by a ingle action and only increae the value of the tate. In the remainder of the paper we will retrict ourelve to pure policie in which a ingle action i taken in each tate. In addition to the value vector, a policy π alo ha an aociated flux vector x π that will play a critical role in our analyi. It act a a kind of dicounted flow. Suppoe we tart with a ingle unit of ma on every tate and then run the Markov chain. At each time tep we remove 1 γ fraction of the ma on each tate and reditribute the remaining ma according to the policy π. Summing over all time tep, the total amount of ma that pae through each action i it flux. More formally, Definition 2.2. Let π be a policy and P π the n by n tranition matrix for π formed by the column P a for action in π. The flux vector x π i indexed by action. If action a i not in π then x π a = 0, and if π ue a in tate, then x π a = z, where z = i 0(γP π ) i 1 = (I γp π ) 1 1, (2) and 1 i the all one vector of dimenion n. The flux i the total dicounted number of time we ue each action if we tart the MDP in all tate and run the Markov chain P π dicounting by γ each iteration. Note that if a π then x π a 1, ince the initial flux placed on a tate alway pae through a. Further note that each bit of flux can be traced back to one of the initial unit of ma placed on each tate, although the vector x π um flux from all tate. Thi will be important in Section 4. Solving the MDP can be formulated a the following primal/dual pair of LP, in which the flux and value vector correpond to primal and (poibly infeaible) dual olution: Primal: maximize a r ax a ubject to S, a A x a = 1 + γ a P a,x a x 0 (3) Dual: minimize v ubject to S, a A, v r a + γ (4) P a, v The contraint matrix of (3) i equal to M γp, where M,a = 1 if action a can be ued in tate and 0 otherwie. The dual value LP (4) i often defined a the primal, a it i perhap more intuitive, and (3) i rarely conidered. However, our analyi center on the flux variable, and algorithm that manipulate policie can more naturally be een a moving through the polytope (3), ince vertice of the polytope repreent policie: Lemma 2.3. The LP (3) i non-degenerate, and there i a bijection between vertice of the polytope and policie of the MDP. Proof. Policie have exactly n nonzero variable, and olving for the flux vector in (2) i identical to olving for a bai in the polytope, o policie map to bae. Write the contraint in (3) in the 4

5 tandard matrix form Ax = b. The vector b i 1, and A = M γp. In a row of A the only poitive entrie are on action uable in tate, o if Ax = b, then x mut have a nonzero entry for every tate, i.e., a choice of action for every tate. Bae of the LP have n variable, o they mut include only one action per tate. Finally, a hown above x π a 1 for all a in a policy/bai, o the LP i not degenerate, and bae correpond to vertice. By Lemma 2.3, the implex method applied to (3) correpond to a imple, ingle-witch verion of policy iteration: we tart with an arbitrary policy, and in each iteration we change a ingle action that improve the value of ome tate. Since the LP i not degenerate, the implex method will find the optimal policy with no cycling. We will ue Dantzig mot-negative-reduced-cot pivoting rule to chooe the action witched. Since (3) i written a a maximization problem, we will refer to reduced cot a gain and alway chooe the highet gain action to witch/pivot. For MDP, the gain have a imple interpretation: Definition 2.4. The gain (or reduced cot) of an action a for tate with repect to a policy π i denoted r π a and i the improvement in the value of if ue action a once and then follow π for all time. Formally, r π a = (r a + γp T a v π ) v π, or, in vector form r π = r (M γp ) T v π. (5) We denote the optimal policy by π, and the optimal flux, value, and gain by x, v, and r. The following are baic propertie of the implex method, and we prove them for completene. Lemma 2.5. Let π and π be any policie. The gain atify the following propertie (r π ) T x π = r T x π r T x π = 1 T v π 1 T v π, r π a = 0 for all a π, and r a 0 for all a. Proof. From the definition of gain (r π ) T x π = (r (M γp ) T v π ) T x π = r T x π (v π ) T (M γp )x π = r T x π (v π ) T 1, uing that (M γp ) i the contraint matrix of (3). From the definition of value and flux vector r T x π = r T π (I γp π ) 1 1 = (v π ) T 1, where r π i the reward vector retricted to indice π. Combining thee two give the firt reult. For the econd reult, if a i in π, then v π = r a + γp T a v π, o r π a = 0. Finally, if r a > 0 for ome a, then conider the policy π that i identical to π but ue a. Then (r ) T x π > 0, and the firt identity prove that π i not optimal. A key property of the implex method on MDP that we will employ repeatedly i that not only i the overall objective improving, but alo the value of all tate are monotone non-decreaing, and there exit a ingle policy we denote by π that maximize the value of all tate: Lemma 2.6. Let π and π be policie appearing in an execution of the implex method with π being ued after π. Then v π v π. Further, let π be the policy when implex terminate, and π be any other policy. Then v v π. 5

6 Proof. Suppoe π and π are ubequent policie. The gain of all action in π with repect to π are equal to r π (I γp π ) T v π, all of which are nonnegative. Therefore 0 (I γp π ) T (r π (I γp π ) T )v π = v π v π, uing that (I γp π ) T = i 0 (γ(p π ) T ) i 0. By induction, thi hold if π and π occur further apart. Performing a imilar calculation uing the gain r, which are nonpoitive, how that v v π 0 for any policy π. 3 Uniform dicount A a warmup before delving into our analyi of determinitic MDP, we briefly review the analyi of [Ye11] for tochatic MDP with a fixed dicount. Conider the flux vector in Definition 2.2. One unit of flux i added to each tate, and every tep it i dicounted by a factor of γ, for a total of n(1 + γ + γ 2 + ) = n/(1 γ) flux overall. If π i the current policy and i the highet gain, then, by Lemma 2.5 the farthet π can be from π i if all n/(1 γ) unit of flux in π are on the action with gain, o r T x r T x π n /(1 γ). If we pivot on thi action, at leat 1 unit of flux i placed on the new action, increaing the objective by at leat. Thu we have reduced the gap to π by a 1 (1 γ)/n fraction, which i ubtantial if 1/(1 γ) i polynomial. Now conider r T x r T x π = (r ) T x π. All the term r ax π a are nonnegative, and for ome action a in π we have r ax π a (r ) T x π /n. The term r ax π a i at mot r an/(1 γ), o r a (r ) T x π /(n 2 /(1 γ)). But for any policy π that include a, (r ) T x π r ax π a r a, o after r T x r T x π ha hrunk by a factor of n 2 /(1 γ), action a cannot appear in any future policy, and thi occur after log 1 (1 γ)/n 1 γ n 2 ( ) n = O 1 γ log n 1 γ tep. See [Ye11] for the detail. The above reult hinged on the fact that the ize of all nonzero flux lay within the interval [1, n/(1 γ)], which wa aumed to be polynomial but give a weak bound if γ i very cloe to 1. However, conider a policy for a determinitic MDP. It can be een a a graph with a node for each tate with a ingle directed edge leaving each tate repreenting the action, o the graph conit of one or more directed cycle and directed path leading to thee cycle. Starting on a path, the MDP ue each path action once before reaching a cycle, o the flux on path mut be mall. Flux on the cycle may be ubtantially larger, but ince the MDP reviit each action after at mot n tep, the flux on cycle action varie by at mot a factor of n. Lemma 3.1. Let π be a policy with flux vector x π and a an action in π. If a i on a path in π then 1 x π a n, and if a i on a cycle then 1/(1 γ) x π a n/(1 γ). The total flux on path i at mot n 2, and the total flux on cycle i at mot n/(1 γ). Proof. All action have at leat 1 flux. If a i on a path, then tarting from any tate we can only ue a once and never return, contributing flux at mot 1 per tate, o x π a n. Summing over all path action, the total flux i at mot n 2. If a i on a cycle, each tate on the cycle contribute a total of 1/(1 γ) flux to the cycle. By ymmetry thi flux i ditributed evenly among action on the cycle, o x π a 1/(1 γ). The total flux in the MDP i n/(1 γ), o x π a n/(1 γ). The overall range of flux i large, but all value mut lie within one of two polynomial layer. We will prove that implex can eentially optimize each layer eparately. If a cycle i not updated, then 6

7 not much progre i made toward the optimum, but we make a ubtantial amount of progre in optimizing the path for the current cycle. When the path are optimal the algorithm i forced to update a cycle, at which point we make a ubtantial amount of progre toward the optimum but reet all progre on the path. Firt we analyze progre on the path: Lemma 3.2. Suppoe the implex method pivot from π to π, which doe not create a new cycle. Let π be the final policy uch that cycle in π are a ubet of thoe in π (i.e., the final policy before a new cycle i created). Then r T (x π x π ) (1 1/n 2 )r T (x π x π ). Proof. Let = max a r π a be the highet gain. Conider (r π ) T x π. Since cycle in π are contained in π, r π a = 0 for any action a on a cycle in π, and by Lemma 3.1, π ha at mot n 2 unit of flux on path, o (r π ) T x π = r T (x π x π ) n 2. Policy π ha at leat 1 unit of flux on the action with gain, o r T (x π x π ) r T (x π x π ) (1 1n ) 2 r T (x π x π ). Due to the polynomial contraction in the lemma above, not too many iteration can pa before a new cycle i formed. Lemma 3.3. Let π be a policy. After O(n 2 log n) iteration tarting from π, either the algorithm finihe, a new cycle i created, a cycle i broken, or ome action in π never appear in a policy again until a new cycle i created. Proof. Let π be the policy in ome iteration, π the lat policy before a new cycle i created, and π an arbitrary policy occurring between π and π in the algorithm. Policy π differ from π in action on path and poibly in cycle that exit in π but have been broken in π. By Lemma 2.5 (r π ) T x π = r T (x π x π ) = 1 T (v π v π ). We divide the analyi into two cae. Firt uppoe that there exit an action a ued in tate on a path uch that r π a x π a (r π ) T x π /n (note (r π ) T x π 0). Since a i on a path x π a n, which implie r π a n 2 (r π ) T x π. Now if policy π ue action a, then (r π ) T x π = 1 T (v π v π ) v π v π =v π (r a + γp a v π ) v π (r a + γp a v π ) = r π a (rπ ) π x π n 2, uing that the value of all tate are monotone increaing. In the econd cae there i no action a on a path in π atifying r π a x π a (r π ) T x π /n. The remaining portion of (r π ) T x π i due to cycle, o there mut be ome cycle C coniting of action {a 1,..., a k } ued in tate { 1,..., k } uch that a C rπ a x π a (r π ) T x π /n. All flux in C firt enter C either from a path ending at C or from the initial unit of flux placed on ome tate in C. If y 1 unit of flux firt enter C at tate in policy π, then that flux earn y (v π v π ) reward with repect to the reward r π, o a C rπ a x π a = C y (v π v π ). Moreover, each term v π v π i nonnegative, ince the value of all tate are nondecreaing. Now note that C (vπ v π ) = a C rπ a /(1 γ), and at mot n unit of flux enter each tate from outide. Therefore n a C rπ a /(1 γ) a C rπ a x π a, implying n 2 a C rπ a /(1 γ) (r π ) T x π. 7

8 A long a cycle C i intact, each a C ha 1/(1 γ) flux from tate in C (Lemma 3.1), o if C i in policy π then (r π ) T x π = 1 T (v π v π ) C v π v π = a C rπ a 1 γ (rπ ) T x π n 2. (6) Now if log n 2 /(n 2 1) n 2 iteration occur between π and π, Lemma 3.2 implie (r π ) T x π < (1 1n ) logn 2 /(n 2 1) n 2 2 (r π ) T x π (rπ ) T x π n 2. In the firt cae action a cannot appear in π, and in the econd cae cycle C mut be broken in π. Thi take log n 2 /(n 2 1) n 2 = O(n 2 log n) iteration if no new cycle interrupt the proce. Lemma 3.4. Either the algorithm finihe or a new cycle i created after O(n 2 m log n) iteration. Proof. Let π 0 be a policy after a new cycle i created, and conider the policie π 1, π 2,... each eparated by O(n 2 log n) iteration. If no new cycle i created, then by Lemma 3.3 each of thee policie π i ha either broken another cycle in π 0 or contain an action that cannot appear in π j for all j > i. There are at mot n cycle in π 0 and at mot m action that can be eliminated, o after (m + n)o(n 2 log n) = O(n 2 m log n) iteration, the algorithm mut terminate or create a new cycle. When a new cycle i formed, the algorithm make a ubtantial amount of progre toward the optimum but alo reet the path optimality above. Lemma 3.5. Let π and π be ubequent policie uch that π create a new cycle. Then r T (x x π ) (1 1/n)r T (x x π ). Proof. Let = max a r π a and a = argmax a r π a. There i a total of n/(1 γ) flux in the MDP, o r T x r T x π = (r π ) T x n/(1 γ). By Lemma 3.1, pivoting on a and creating a cycle will reult in at leat 1/(1 γ) flux through a. Therefore r T x π r T x π + /(1 γ), o r T (x x π ) r T (x x π ) ( 1 γ 1 1 ) r T (x x π ). n Lemma 3.6. Let π be a policy. Starting from π, after O(n log n) iteration in which a new cycle i created, ome action in π i either eliminated from cycle for the remainder of the algorithm or entirely eliminated from policie for the remainder of the algorithm. Proof. Conider a policy π with repect to the optimal gain r. There i an action a uch that r ax π a (r ) T x π /n. If a i on a path in π, then 1 x π a n, o r a (r ) T x π /n 2, and if a i on a cycle, then 1/(1 γ) x π a n/(1 γ), o r a/(1 γ) (r ) T x π /n 2. Since r are the gain for the optimal policy, r a 0 for all a. Therefore if π i any policy containing a, then r a r ax π a (r ) T x π, and if π i any policy containing a on a cycle, then r a/(1 γ) r ax π a (r ) T x π. Now by Lemma 3.5, if there are more than log n/(n 1) n 2 = O(n log n) new cycle created between policie π and π then ( (r ) T x π < 1 1 ) logn/(n 1) n 2 (r ) T x π = (r ) T x π n n 2. 8

9 Therefore if π contained a on a path, then a cannot appear in any policy after π for the remainder of the algorithm, and if π contained a on a cycle, then a cannot appear in a cycle after π (but may appear in a path) for the remainder of the algorithm. Theorem 3.7. The implex method converge in at mot O(n 3 m 2 log 2 n) iteration on determinitic MDP with uniform dicount uing the highet gain pivoting rule. Proof. Conider the policie π 0, π 1, π 2,... where O(n log n) new cycle have been created between π i and π i+1. By Lemma 3.6, each π i contain an action that i either eliminated entirely in π j for j > i or eliminated from cycle. Each action can be eliminated from cycle and path, o after 2m uch round of O(n log n) new cycle the algorithm ha converged. By Lemma 3.4 cycle are created every O(n 2 m log n) iteration, for a total of O(n 3 m 2 log 2 n) iteration. 4 Varying Dicount In thi ection we allow each action a to have a ditinct dicount γ a. Thi ignificantly complicate the proof of convergence ince the total flux i no longer fixed. When updating a cycle we can no longer bound the ditance to the optimum baed olely on the maximum gain, ince the optimal policy may employ action with maller gain to the current policy but ubtantially more flux. We are able to exhibit a et of layer in which the flux on cycle mut lie baed on the dicount of the action, and we will how that when a cycle i created in a particular layer we make progre toward the optimum value for the updated tate auming that it lie within that layer. Thee layer will define a et of bound whoe value we mut urpa, which erve a miletone or checkpoint to the optimum. When we update a cycle we cannot claim that the overall objective increae ubtantially but only that the value of individual tate make progre toward one of thee miletone value. When the value of all tate have urpaed each of thee intermediate miletone the algorithm will terminate. We firt define ome notation. Recall that to calculate flux we place one unit of ma in each tate and then run the Markov chain, o all flux trace back to ome tate, but x π aggregate all of it together. Becaue we will be concerned with analyzing the value of individual tate in thi ection, it will be ueful to eparate out the flux originating in a particular tate. Conider the following alternate LP: maximize ubject to r T x a A x a = 1 + a γ ap a, x a a A x a = a γ ap a, x a x 0 The LP (7) i identical to (3), except that initial flux i only added to tate rather than all tate, and the dual of (7) matche (4) if the objective in (4) i changed to minimize only v. Feaible olution in (7) meaure only flux originating in and contributing to v. For a tate and policy π we ue the notation x π, to denote the correponding vertex in (7). Note that x π = xπ,. The following lemma i analogou to Lemma 2.5 and ha an identical proof: Lemma 4.1. For a tate and for policie π and π, (r π ) T x π, = r T x π, r T x π, = v π v π. (7) 9

10 We now define the interval in which the flux mut lie. A in Section 3 flux on path i in [1, n]. Let C be a cycle in ome policy, and γ C = a C γ a be total dicount of C. We will prove that the mallet dicount in C determine the rough order of magnitude of the flux through C. Definition 4.2. Let C be a cycle and a an action in C, then the dicount of a dominate the dicount of C if γ a γ a for all a C. Lemma 4.3. Let π be a policy containing the cycle C with dicount dominated by γ a and total dicount γ C. Let be a tate on C, a the action ued in and a an arbitrary action in C, then x π, a = 1/(1 γ C ), γ C /(1 γ C ) x π, a 1/(1 γ C ), and 1/(n(1 γ a )) 1/(1 γ C ) 1/(1 γ a ). Proof. For the firt equality, all flux originate at, o the flux through a (ued in tate ) either jut originated in or came around the cycle from, implying x π, a = 1 + γ C x π, a. An analogou equation hold for all other action a on C, but now the initial flow from may have been dicounted by at mot γ C before reaching a, giving γ C /(1 γ C ) x π, a 1/(1 γ C ). The upper bound in the final inequality, 1/(1 γ C ) 1/(1 γ a ) hold ince a C (γ a dominate the dicount of C). For the lower bound, let l = 1 γ a. Then γ C γa n = (1 l) n 1 nl = 1 n(1 γ a ), implying 1/(1 γ C ) 1/(n(1 γ a )). Flux on path till fall in [1, n], o the algorithm behave the ame on path a it did in the uniform cae: Lemma 4.4. Either the algorithm finihe or a new cycle i created after O(n 2 m log n) iteration. Proof. Thi i identical to the proof of Lemma 3.4, which depend on Lemma 3.2 and 3.3. Lemma 3.2 hold for nonuniform dicount, and Lemma 3.3 hold after adjuting Equation (6) a follow (r π ) T x π C v π v π a C rπ a (rπ ) T x π 1 γ C n 2, uing that a C rπ a n/(1 γ C ) (r π ) T x π /n and Lemma 4.3. Now uppoe the implex method update the action for tate in policy π and create a cycle dominated by γ a. Again, v may not improve much, ince there may be a cycle with dicount much larger than γ a. However, in any policy π where i on a cycle dominated by γ a and ue ome action a, 1/(n(1 γ a )) x π, a 1/(1 γ a ), which allow u to argue v ha made progre toward the highet value achievable when it i on a cycle dominated by γ a, and after enough uch progre ha made, v will beat thi value and never again appear on any cycle dominated by γ a. The optimal value achievable for each tate on a cycle dominated by each γ a erve a the above-mentioned miletone. Since all cycle are dominated by ome γ a, there are m miletone per tate. Lemma 4.5. Suppoe the implex method move from π to π by updating the action for tate, creating a new cycle C with dicount dominated by γ a for ome a in π. Let π be the final policy ued by the implex method in which i in a cycle dominated by γ a. Then v π v π (1 1/n 2 )(v π v π ). 10

11 Proof. Let = max a r π a be the value of the highet gain with repect to π. Any cycle contain at mot n action, each of which ha gain at mot in r π, o if i on a cycle dominated by γ a in π then by Lemma 4.3 and Lemma 4.1, v π v π n /(1 γ a ), and ince π create a cycle dominated by γ a, by the ame lemma v π v π + /(n(1 γ a )). Combining the two, v π v π = (v π v π ) (v π v π ) (v π v π ) n(1 γ a ) (1 1n 2 ) (v π v π ). The following lemma i the crux of our analyi and allow u to eliminate action when we get cloe to a miletone value. Thi occur becaue the poitive gain mut hrink or ele the algorithm would urpa the miletone, and a the poitive gain hrink they can no longer balance larger negative gain, forcing uch action out of the cycle. Lemma 4.6. Suppoe policy π contain a cycle C with dicount dominated by γ a and i a tate in C. There i ome action a in C (depending on ) uch that after O(n 2 log n) iteration that change the action for and create a cycle with dicount dominated by γ a, action a will never again appear in a cycle dominated by γ a. Proof. Let π be a policy containing a cycle C with dicount dominated by γ a and a tate in C. Let π be another policy where i on a cycle dominated by γ a after at leat 1+log n 2 /(n 2 1) n 5 = O(n 2 log n) iteration that create uch a cycle by changing the action for and π the final policy ued by the algorithm in which i on a cycle dominated by γ a. Conider the policy ˆπ in the iteration immediately preceding π. By Lemma 4.5, and the choice of π, v π vˆπ ( 1 1 ) logn 2 /(n 2 1) n 5 n 2 (v π v π ) = 1 n 5 (vπ v π ), or equivalently v π v π n 5 (v π vˆπ ), implying v π vˆπ = (v π v π ) + (v π vˆπ ) ( n 5 + 1)(v π vˆπ ). (8) Since the gap v π vˆπ i large and negative, there mut be highly negative gain in rˆπ. By Lemma 4.1 v π vˆπ = (rˆπ ) T x π,. Let rˆπ a = min rˆπ a C a and be the tate uing a. By Lemma 4.3, x π, 1/(1 γ a ), and C ha at mot n tate, o applying Equation (8) rˆπ a 1 ( 1 γ a n (vπ vˆπ ) n ) (v π vˆπ ). (9) n The poitive entrie in rˆπ mut all be mall, ince there i only a mall increae in the value of. Let = max rˆπ. The algorithm pivot on the highet gain, and by aumption it update the action for and create a cycle dominated by γ a. By Lemma 4.3, the new action i ued at leat 1/(n(1 γ a )) time by flux from, ince it i the firt action in the cycle, o n(1 γ a ) vπ vˆπ v π vˆπ. (10) We prove that the highly negative rˆπ a cannot coexit with only mall poitive gain bounded by. Conider any policy in which i on a cycle C containing a (but not necearily containing ) with total gain γ C dominated by γ a. By Lemma 4.3, there i at leat 1/(1 γ C ) 1/(n(1 γ a )) 11

12 flux from going through a, and in the ret of the cycle there are at mot n 1 other action with at mot 1/(1 γ C ) 1/(1 γ a ) flux. The highet gain with repect to ˆπ i, o the value of v relative to rˆπ i at mot rˆπ a n(1 γ a ) + n 1 γ a = ( n 3 + 1n 2 ) (v π vˆπ ) + n 2 (v π vˆπ ) ( n 3 + 1n 2 + n2 ) (v π vˆπ ) < 0 uing Equation (9) and (10). But vˆπ = 0 relative to rˆπ, and it only increae in future iteration, o a cannot appear again in a cycle dominated by γ a. Lemma 4.7. For any action a, there are at mot O(n 3 m log n) iteration that create a cycle with dicount dominated by γ a. Proof. After O(n 3 log n) iteration that create a cycle dominated by γ a, ome tate mut have been updated in O(n 2 log n) of thoe iteration, o by Lemma 4.6 ome action will never appear again in a cycle dominated by γ a. After m repetition of thi proce all action have been eliminated. Theorem 4.8. Simplex terminate in at mot O(n 5 m 3 log 2 n) iteration on determinitic MDP with nonuniform dicount uing the highet gain pivoting rule. Proof. There are O(m) poible dicount γ a that can dominate a cycle, and by Lemma 4.7 there are at mot O(n 3 m log n) iteration creating a cycle dominated by any particular γ a, for a total of O(n 3 m 2 log n) iteration that create a cycle. By Lemma 4.4 a new cycle i created every O(n 2 m log n) iteration, for a total of O(n 5 m 3 log 2 n) iteration overall. 5 Open problem A difficult but natural next tep would be to try to extend thee technique to handle policy iteration on determinitic MDP. The main problem encountered i that the multiple imultaneou pivot ued in policy iteration can interfere with each other in uch a way that the algorithm effectively pivot on the mallet improving witch rather than the larget. See [HZ10] for uch an example. Another challenging open quetion i to deign a trongly polynomial algorithm for general MDP. Finally, we believe the technique of dividing variable value into polynomial ized layer may be helpful for entirely different problem. Acknowledgment. The author would like to thank Kazuhia Makino for pointing out an error in Lemma 3.3. Reference [Bel57] Richard E. Bellman. Dynamic Programming. Princeton Univerity Pre, [Ber96] Dimitri P. Berteka. Dynamic programming and optimal control. Athena Scientific,

13 [Fea10] [FHZ11] [Fri09] [Fri11] John Fearnley. Exponential lower bound for policy iteration. In Automata, Language and Programming, volume 6199 of Lecture Note in Computer Science, page Springer Berlin / Heidelberg, arxiv: v1, doi: / _46. 2 Oliver Friedmann, Thoma Dueholm Hanen, and Uri Zwick. Subexponential lower bound for randomized pivoting rule for the implex algorithm. In Proc. 43rd Sympoium on Theory of Computing, STOC 11, page ACM, doi: / Oliver Friedmann. An exponential lower bound for the parity game trategy improvement algorithm a we know it. In Proc. 24th Logic In Computer Science, LICS 09, page , arxiv: v1, doi: /lics Oliver Friedmann. A ubexponential lower bound for zadeh pivoting rule for olving linear program and game. In Integer Programming and Combinatoral Optimization, volume 6655 of Lecture Note in Computer Science, page Springer Berlin / Heidelberg, doi: / _16. 2 [HMZ11] Thoma Dueholm Hanen, Peter Bro Milteren, and Uri Zwick. Strategy iteration i trongly polynomial for 2-player turn-baed tochatic game with a contant dicount factor. In ICS, page , arxiv: v1. 2 [HN94] [How60] [HZ10] Dorit S. Hochbaum and Joeph (Seffi) Naor. Simple and fat algorithm for linear and integer program with two variable per inequality. SIAM Journal on Computing, 23:1179, doi: /s Ronald Howard. Dynamic programming and markov deciion procee. MIT, Cambridge, Thoma Hanen and Uri Zwick. Lower bound for howard algorithm for finding minimum mean-cot cycle. In Otfried Cheong, Kyung-Yong Chwa, and Kunoo Park, editor, Algorithm and Computation, volume 6506 of Lecture Note in Computer Science, page Springer Berlin / Heidelberg, doi: / _37. 2, 12 [LDK95] Michael L. Littman, Thoma L. Dean, and Lelie Pack Kaelbling. On the complexity of olving markov deciion problem. In Proc. 11th Uncertainty in Artificial Intelligence, UAI 95, page , Available from: [Mad02] [MC94] Omid Madani. On policy iteration a a newton method and polynomial policy iteration algorithm. In Proc. 18th National Conference on Artificial intelligence, page , Available from: 2 Mary Melekopoglou and Anne Condon. On the complexity of the policy improvement algorithm for markov deciion procee. ORSA Journal on Computing, 6(2): , doi: /ijoc

14 [MS99] Yihay Manour and Satinder Singh. On the complexity of policy iteration. In Proc. 15th Uncertainty in Artificial Intelligence, UAI 99, page , Available from: 2 [MTZ10] Omid Madani, Mikkel Thorup, and Uri Zwick. Dicounted determinitic markov deciion procee and dicounted all-pair hortet path. ACM Tranaction on Algorithm (TALG), 6(2):33:1 33:25, doi: / [PT87] Chrito Papadimitriou and John N. Titikli. The complexity of markov deciion procee. Mathematic of Operation Reearch, 12(3): , Augut doi: /moor [Put94] [Ye05] [Ye11] Martin L. Puterman. Markov Deciion Procee: Dicrete Stochatic Dynamic Programming. Wiley, New York, NY, USA, Yinyu Ye. A new complexity reult on olving the markov deciion problem. Mathematic of Operation Reearch, 30(3): , Augut doi: /moor Yinyu Ye. The implex and policy-iteration method are trongly polynomial for the markov deciion problem with a fixed dicount rate. Mathematic of Operation Reearch, 36(4): , November doi: /moor , 2, 6 14

Problem Set 8 Solutions

Problem Set 8 Solutions Deign and Analyi of Algorithm April 29, 2015 Maachuett Intitute of Technology 6.046J/18.410J Prof. Erik Demaine, Srini Devada, and Nancy Lynch Problem Set 8 Solution Problem Set 8 Solution Thi problem