Robust Shortest Path Planning and Semicontractive Dynamic Programming

Size: px

Start display at page:

Download "Robust Shortest Path Planning and Semicontractive Dynamic Programming"

Octavia Wright
6 years ago
Views:

1 ebrary 4 (Revised Agst 4, ne 6 Report LIDS - 95 Robst Shortest Path Planning Semicontractive Dynamic Programg Dimitri P. Bertseas Abstract In this paper we consider shortest path in a directed graph where transitions between nodes sbject to ncertainty. We se a ima ormlation, where objective is to garantee that a destination state is reached with a imm cost path nder worst possible instance o ncertainty. s o this type arise, among ors, in planning prsit-evasion contets, in model predictive control. Or analysis maes se o recently developed ory o abstract semicontractive dynamic programg models. We investigate qestions o eistence niqeness o soltion o optimality eqation, eistence o optimal paths, validity o varios algorithms patterned ater classical s o vale policy iteration, as well as a Dijstra-lie algorithm or with nonnegative arc lengths. Dimitri Bertseas is with Dept. o Electr. Engineering Comp. Science, Laboratory or Inormation Decision Systems, M.I.T., Cambridge, Mass., 9.

2 . INTRODUCTION In this paper, we discss shortest path that embody a worst-case view o ncertainty. These relate to several or types o arising in stochastic ima control, model predictive control, Marovian decision processes, planning, seqential games, robst combinatorial optimization, soltion o discretized large-scale dierential eqations. Conseqently, or analysis algorithms relate to a large body o eisting ory. However, in this paper we rely on a recently developed abstract dynamic programg ory o semicontractive, capitalize on general reslts developed in contet o this ory [Ber]. We irst discss inormally se connections we srvey related literatre. Relations with Or s Literatre Review The closest connection to or wor is classical shortest path problem where objective is to reach a destination node with a imm length path rom every or node in a directed graph. This is a ndamental problem that has an enormos range o applications has been stdied etensively (see e.g., srveys [Dre69], [GaP88], many tetboos, inclding [Roc84], [AMO89], [Ber98], [Ber5]. The assmption is that at any node, we may detere a sccessor node y rom a given set o possible sccessors, deined arcs, y o graph that otgoing rom. In some, however, ollowing decision at a given node, re is inherent ncertainty abot sccessor node. In a stochastic ormlation, ncertainty is modeled a probability distribtion over set o sccessors, or decision is n to choose at each node one distribtion ot o a given set o distribtions. The reslting problem, nown as stochastic shortest path problem (also nown as transient programg problem, is a total cost ininite horizon Marovian decision problem, with a sbstantial analytical algorithmic ology, which inds etensive applications in o motion planning, robotics, or where aim is to reach a goal state with probability nder stochastic ncertainty (see [Pal67], [Der7], [Pli78], [Whi8], [BeT89], [BeT9], [Pt94], [HCP99], [HiW5], [ac6], [LaV6], [Bon7], [Ber], [BeY6], [CCV4], [YBa]. Anor important a o application is largescale comptation or discretized versions o dierential eqations (sch as Hamilton-acobi-Bellman eqation, eional eqation; see [GoR85], [al87], [KD9], [BGM95], [Tsi95], [PBT98], [Set99a], [Set99b], [Vla8], [AlM], [ChV], [AnV4], [Mir4]. In this paper, we introdce a seqential ima ormlation o shortest path problem, where ncertainty is modeled set membership: at a given node, we may choose one sbset ot o a given collection o sbsets o nodes, sccessor node on path is chosen rom this sbset an antagonistic opponent. Or principal o analysis is dynamic programg (DP or short. Related have been stdied or a long time, in contet o control o ncertain discrete-time dynamic systems with a set membership description o ncertainty (starting with ses [Wit66] [Ber7], ollowed p many or wors; see e.g., monographs [BaB9], [KV97], [BlM8], srvey [Bla99], reerences given re. These relevant or eample in contet o model predictive control nder ncertainty, a sbject o great importance in crrent practice o control ory (see e.g., srveys [MoL99], [MRR], boos [Mac], [CaB4], [RaM9]; model predictive control with set membership distrbances is discssed in sis [Ker] tet [Ber5], Section Seqential ima have also been stdied in contet o seqential games (starting with paper [Sha5], ollowed p many or wors, inclding boos [BaB9], [iv96], reerences given re. Seqential games that involve shortest paths particlarly relevant; see wors [PaB99], [Gr8], [Y], [BaL5]. An important dierence with some o wors on seqential

3 games is that in or ima ormlation, we assme that antagonistic opponent nows decision corresponding sbset o sccessor nodes chosen at each node. Ths in or problem, it wold mae a dierence i decisions at each node were made with advance nowledge o opponent s choice ( -ma is typically not eqal to ma- in or contet. Generally shortest path games admit a simpler analysis when arc lengths assmed nonnegative (as is done or eample in recent wors [Gr8], [BaL5], when problem inherits strctre o negative DP (see [Str66], or tets [Pt94], [Ber] or abstract monotone increasing abstract DP models (see [Ber77], [BeS78], [Ber]. However, or ormlation line o analysis is based on recently introdced abstract semicontractive DP model o [Ber], allows negative as well as nonnegative arc lengths. s with negative arc lengths arise in applications when we want to ind longest path in a networ with nonnegative arc lengths, sch as critical path analysis. s with both positive negative arc lengths inclde searching a networ or objects o vale with positive search costs (c. Eample 4., inancial o maimization o total reward when re transaction or costs. An important application o or shortest path is in prsit-evasion (or search resce contets, where a team o prsers aig to reach one or more evaders that move npredictably. s o this ind have been stdied etensively rom dierent s o view (see e.g., [Par76], [MHG88], [BaS9], [HKR9], [BS94], [BB95], [GLL99], [VKS], [LaV6], [AHK7], [BBH8], [BaL5]. or or shortest path ormlation to be applicable to sch a problem, prsers evaders mst have perect inormation abot each ors positions, Cartesian prodct o ir positions ( state o system mst be restricted to inite set o nodes o a given graph, with nown transition costs (i.e., a terrain map that is nown a priori. We may deal with prsit-evasion with imperect state inormation set-membership ncertainty means o a redction to perect state inormation, which is based on set membership estimators notion o a siciently inormative nction, introdced in sis [Ber7] in sbseqent paper [BeR7]. In this contet, original imperect state inormation problem is reormlated as a problem o perect state inormation, where states correspond to sbsets o nodes o original graph ( set o states that consistent with observation history o system, in terology o set membership estimation [BeR7], [Ber7], [KV97]. Ths, since X has a inite nmber o nodes, reormlated problem still involves a inite (bt mch larger nmber o states, may be dealt with sing ology o this paper. Note that problem reormlation jst described is also applicable to general ima control with imperect state inormation, not jst to prsit-evasion. Or wor is also related to sbject o robst optimization (see e.g., boo [BGN9] recent srvey [BBC], which incldes ima ormlations o general optimization with set membership ncertainty. However, or emphasis here is placed on presence o destination node reqirement or teration, which is salient eatre essential strctre o shortest path. Moreover, a dierence with or wors on robst shortest path selection (see e.g., [YY98], [BeS], [MoG4] is that in or wor ncertainty abot transitions or arc cost data at a given node is decopled rom corresponding ncertainty at or nodes. This allows a DP ormlation o or problem. Becase or contet diers in essential respects rom preceding wors, reslts o present paper new to a great etent. The line o analysis is also new, is based on connection with ory o abstract semicontractive DP mentioned earlier. In addition to simpler proos, a major beneit o this abstract line o treatment is deeper insight into strctre o or problem, natre o or analytical comptational reslts. Several related, involving or eample an additional stochastic type o ncertainty, admit a similar treatment. Some o se described in last

4 section, ir analysis associated algorithms sbjects or rr research. Robst Shortest Path ormlation To ormally describe or problem, we consider a graph with a inite set o nodes X {t} a inite set o directed arcs A {, y, y X {t} }, where t is a node called destination. At each node X we may choose a control or action rom a nonempty set U, which is a sbset o a inite set U. Then a sccessor node y is selected an antagonistic opponent rom a nonempty set Y, X {t}, sch that, y A or all y Y,, a cost g,, y is incrred. The destination node t is absorbing cost-ree, in sense that only otgoing arc rom t is (t, t we have g(t,, t or all U(t. A policy is deined to be a nction that assigns to each node X a control U. We denote inite set o all policies M. A possible path nder a policy starting at node X is an arc seqence o orm p {,,,,... }, sch that Y (, or all. The set o all possible paths nder starting at is denoted P, ; it is set o paths that antagonistic opponent may generate starting rom, once policy has been chosen. The length o a path p P, is deined L (p g (,,, i series above is convergent, more generally L (p lim sp m m g (,,, i it is not. or completeness, we also deine length o a portion { i, i, i, i,..., m, m } o a path p P,, consisting o a inite nmber o consective arcs, m g (,,. i When consion cannot arise we will also reer to sch a inite-arc portion as a path. O interest cycles, that is, paths o orm { i, i, i, i,..., im, i }. Paths that do not contain any cycle or than sel-cycle (t, t called simple. or a given policy t, a path p P, is said to be terating i it has orm p {,,,,..., m, t, (t, t,... }, (. where m is a positive integer,,..., m distinct nondestination nodes. Since g(t,, t or all U(t, length o a terating path p o orm (., corresponding to, is given L (p g ( m, m, t m g (,,, 4

5 t t Improper policy Proper policy igre.. A robst shortest path problem with X {, }, two controls at node, one control at node. There two policies,, corresponding to two controls at node. The igre shows sbgraphs o arcs A A. The policy is improper becase A contains cycle (,, (sel-cycle (,. is eqal to inite length o its initial portion that consists o irst m arcs. An important characterization o a policy is provided sbset o arcs A X {, y y Y (, }. Ths A, toger with sel-arc (t, t, consists o set o paths X P,, in sense that it contains this set o paths no or paths. We say that A is destination-connected i or each X re eists a terating path in P,. We say that is proper i sbgraph o arcs A is acyclic (i.e., contains no cycles. Ths is proper i only i all paths in X P, simple hence terating (eqivalently is proper i only i A is destination-connected has no cycles. The term proper is consistent with a similar term in stochastic shortest path, where it indicates a policy nder which destination is reached with probability, see e.g., [Pal67], [BeT89], [BeT9]. I is not proper, it is called improper, in which case sbgraph o arcs A mst contain a cycle; see eamples o ig... or a proper, we associate with every X worst-case path length over inite set o possible paths starting rom, which is denoted ma L (p, X. (. p P, Ths is length o longest path rom to t in acyclic sbgraph o arcs A. Since re initely many paths in this acyclic graph, may be ond eir enmeration comparison o se paths (in simple, or solving shortest path problem obtained when signs o arc lengths g (,, y,, y A, reversed. Or problem is to ind an optimal proper policy, i.e., one that imizes over all proper, simltaneosly or all X, nder assmptions that parallel those or classical shortest path problem. We reer to this as problem o robst shortest path selection (RSP or short. Note that in or problem, reaching destination starting rom every node is a reqirement, regardless o choices o hypotical antagonistic opponent. In or words imization in RSP is over proper policies only. 5

6 O corse or problem to have a easible soltion ths be meaningl, re mst eist at least one proper policy, this may be restrictive or a given problem. One may deal with where easibility is not nown to hold introdcing or every an artiicial teration action into U [i.e., a with Y, {t}], associated with very large length [i.e., g,, t g >> ]. Then policy that selects teration action at each is proper has cost nction g. In problem ths reormlated optimal cost over proper policies will be naected or all nodes or which re eists a proper policy with < g. Since or a proper, cost is bonded above nmber o nodes in X times largest arc length, a sitable vale o g is readily available. In Section we will ormlate RSP in a way that semicontractive DP ramewor can be applied. In Section, we will describe briely this ramewor we will qote reslts that will be sel to s. In Section 4, we will develop or main analytical reslts or RSP. In Section 5, we will discss algorithms o vale policy iteration type, izing corresponding algorithms o semicontractive DP, adapting available algorithms or stochastic shortest path. Among ors, we will give a Dijstralie algorithm or with nonnegative arc lengths, which terates in a nmber o iterations eqal to nmber o nodes in graph, has low order polynomial compleity. Related Dijstra-lie algorithms were proposed recently, in contet o dynamic games with an abbreviated convergence analysis, [Gr8] [BaL5].. MINIMAX ORMULATION In this section we will reormlate RSP into a ima problem, where given a policy, an antagonistic opponent selects a sccessor node y Y (, or each X, with aim o maimizing lengths o reslting paths. The essential dierence between RSP associated ima problem is that in RSP only proper policies admissible, while in ima problem all policies will be admissible. Or analysis will be based in part on assmptions nder which improper policies cannot be optimal or ima problem, implying that optimal policies or ima problem will be optimal or original RSP problem. One sch assmption is ollowing. Assmption.: (a There eists at least one proper policy. (b or every improper policy, all cycles in sbgraph o arcs A have positive length. The preceding assmption parallels generalizes typical assmptions in classical deteristic shortest path problem, i.e., case where Y, consists o a single node. Then condition (a is eqivalent to assg that each node is connected to destination with a path, while condition (b is eqivalent to assg that all directed cycles in graph have positive length. Later in Section 4, in addition to To veriy eistence o a proper policy [condition (a] one may apply a reachability algorithm, which constrcts seqence {N } o sets N N { X {t} re eists U with Y, N }, 6

7 Assmption., we will consider anor weaer assmption, where positive length is replaced with nonnegative length in condition (b above. This assmption will hold in case where all arc lengths g,, y nonnegative, bt re may eist a zero length cycle. As a irst step, we etend deinition o nction to case o an improper policy. Recall that or a proper policy, has been deined Eq. (., as length o longest path p P,, ma L (p, X. (. p P, We etend this deinition to any policy, proper or improper, deining as lim sp sp p P, L p(, (. where L p( is sm o lengths o irst arcs in path p. When is proper, this deinition coincides with one given earlier [c. Eq. (.]. Ths or a proper, is real-valed, it is niqe soltion o optimality eqation (or Bellman eqation or longest path problem associated with proper policy acyclic sbgraph o arcs A : where we denote nction given ma [ ( g,, y (y ] X, (. y Y, (y { (y i y X, i y t. (.4 Any shortest path algorithm may be sed to solve this longest path problem or a proper. However, when is improper, we may have, soltion o corresponding longest path problem may be problematic. We will consider problem o inding * M, X, (.5 a policy attaining imm above, simltaneosly or all X. Note that imization is over all policies, in contrast with RSP problem, where imization is over jst proper policies. Embedding Within an Abstract DP Model We will now reormlate ima problem o Eq. (.5 more abstractly, epressing it in terms o mapping that appears in Bellman s eqation (.-(.4, re bringing to bear ory o abstract DP. We denote E(X set o nctions : X [, ], R(X set o nctions starting with N {t} (see [Ber7], [BeR7]. A proper policy eists i only i this algorithm stops with a inal set N eqal to X {t}. I re is no proper policy, this algorithm will stop with N eqal to a strict sbset o X {t} o nodes starting rom which re eists a terating path nder some policy. The problem may n be reormlated over redced graph consisting o node set N, so re will eist a proper policy in this redced problem. 7

8 : X (,. Note that since X is inite, R(X can be viewed as a inite-dimensional Eclidean space. We introdce mapping H : X U E(X [, ] given H,, where or any E(X we denote nction given ma [ ] g,, y (y, (.6 y Y, (y { (y i y X, i y t. (.7 We consider or each policy, mapping T : E(X E(X, deined (T H (,,, X, (.8 we note that ied eqation T is identical to Bellman eqation (.. We also consider mapping T : E(X E(X deined (T H,,, X, (.9 U also eqivalently written as (T M (T, X. (. We denote T T -old compositions o mappings T T with mselves, respectively. Let s consider zero nction, which we denote :, X. Using Eqs. (.6-(.8, we see that or any M X, (T is reslt o -stage DP algorithm that comptes sp p P, L p(, length o longest path nder that starts at consists o arcs, so that (T sp p P, L p(, X. Ths deinition (. o can be written in alternative eqivalent orm lim sp (T, X. (. We ocsing on optimization over stationary policies becase nder assmptions o this paper (both Assmption. alternative assmptions o Section 4 optimal cost nction wold not be improved allowing nonstationary policies, as shown in [Ber], Chapter. In more general ramewor o [Ber], nonstationary Marov policies o orm π {,,...}, with M,,,..., allowed, ir cost nction is deined π lim sp (T T, X, where T T is composition o mappings T,..., T. Moreover, is deined as inimm o π over all sch π. However, nder assmptions o present paper, this inimm is attained a stationary policy (in act one that is proper. Hence, attention may be restricted to stationary policies withot loss o optimality withot aecting reslts rom [Ber] that will be sed. 8

9 The reslts that we will show nder Assmption. generalize main analytical reslts or classical deteristic shortest path problem, stated in abstract orm, ollowing: (a * is niqe ied o T within R(X, we have T * or all R(X. (b Only proper policies can be optimal, re eists an optimal proper policy. (c A policy is optimal i only i it attains imm or all X in Eq. (. when *. Proos o se reslts rom irst principles qite comple. However, airly easy proos can be obtained embedding problem o imizing nction o Eq. (. over M, within abstract semicontractive DP ramewor introdced in [Ber]. In particlar, we will se general reslts or this ramewor, which we will smmarize in net section.. SEMICONTRACTIVE DP ANALYSIS We will now view problem o imizing over M cost nction, given in abstract orm (., as a case o a semicontractive DP model. We irst provide a brie review o this model, with a notation that corresponds to one sed in preceding section. The starting is a set o states X, a set o controls U, a control constraint set U U or each X. or general ramewor o this section, X U arbitrary sets; we contine to se some o notation o preceding section in order to indicate relevant associations. A policy is a mapping : X U with U or all X, set o all policies is denoted M. or each policy, we given a mapping T : E(X E(X that is monotone in sense that or any two, E(X, T T. We deine mapping T : E(X E(X The cost nction o is deined as (T in M (T, X. lim sp (T, X, m where is some given nction in E(X. The objective is to ind * in M or each X, a policy sch that *, i one eists. Based on correspondences with Eqs. (.6-(., it can be seen that ima problem o preceding section is case o problem o this section, where X U inite sets, T is deined Eq. (.8, is zero nction. Since set o policies is inite, re eists a policy imizing over set o proper policies, or each X. However, assertion here is stronger, namely that re eists a proper imizing over all M simltaneosly or all X, i.e., a proper with. 9

10 In contractive models, mappings T assmed to be contractions, with respect to a weighted sp-norm with a contraction modls, in sbspace o nctions in E(X that bonded with respect to weighted sp-norm. These models have a strong analytical algorithmic ory, which dates to [Den67]; see also [BeS78], Ch., recent etensive treatments given in Chapters - o [Ber], Ch. o [Ber]. In semicontractive models, only some policies have a contraction-lie property. This property is captred notion o S-reglarity o a policy introdced in [Ber] deined as ollows. Deinition.: Given a set o nctions S E(X, we say that a policy is S-reglar i: (a S T. (b lim T or all S. A policy that is not S-reglar is called S-irreglar. Roghly, is S-reglar i is an asymptotically stable eqilibrim o T within S. An important case o an S-reglar is when S is a complete sbset o a metric space T maps S to S, when restricted to S, is a contraction with respect to metric o that space. There several dierent choices o S, which may be sel depending on contet, sch as or eample R(X, E(X, { R(X }, { E(X }, ors. There also several sets o assmptions corresponding reslts, which given in [Ber] will be sed to prove or analytical reslts or RSP problem. In this paper, we will se S R(X, bt or ease o reerence, we will qote reslts rom [Ber] with S being an arbitrary sbset o R(X. We give below an assmption relating to semicontractive models, which is Assmption.. o [Ber]. A ey part o this assmption is part (c, which implies that S-irreglar policies have ininite cost or at least one state, so y cannot be optimal. This part will provide a connection to Assmption.(b. Assmption.: hold: In semicontractive model o this section with a set S R(X ollowing (a S contains, has property that i, two nctions in S, n S contains all nctions with. (b The nction Ĵ given Ĵ in : S-reglar, X, belongs to S. (c or each S-irreglar policy each S, re is at least one state X sch that lim sp (T.

11 (d The control set U is a metric space, set { (T λ } is compact or every S, X, λ R. (e or each seqence { m } S with m or some S we have lim (T m (T, X, M. m ( or each nction S, re eists a nction S sch that T. The ollowing two propositions given in [Ber] as Prop... Lemma..4, respectively. Or analysis will be based on se two propositions. Proposition.: Let Assmption. hold. Then: (a The optimal cost nction * is niqe ied o T within set S. (b A policy is optimal i only i T * T *. Moreover, re eists an optimal S-reglar policy. (c We have T * or all S. (d or any S, i T we have *, i T we have *. Proposition.: Let Assmption.(b,(c,(d hold. Then: (a The nction Ĵ o Assmption.(b is niqe ied o T within S. (b Every policy satisying T Ĵ T Ĵ is optimal within set o S-reglar policies, i.e., is S-reglar Ĵ. Moreover, re eists at least one sch policy. The second proposition is sel or sitations where only some o conditions o Assmption. satisied, will be sel in proo o an important part o Prop. 4. in net section. As noted in preceding section, a more general problem is deined in [Ber], where nonstationary Marov policies allowed, is deined as inimm over se policies. However, nder or assmptions, attention may be restricted to stationary policies withot loss o optimality withot aecting validity o two propositions.

12 4. SEMICONTRACTIVE MODELS AND SHORTEST PATH PROBLEMS We will now apply preceding two propositions to ima ormlation o RSP problem: imizing over all M shortest path cost as given Eq. (. or both proper improper policies. We will irst derive some preliary reslts. The ollowing proposition clariies properties o when is improper. Proposition 4.: Let be an improper policy let be its cost nction as given Eq. (.. (a I all cycles in sbgraph o arcs A have nonpositive length, < or all X. (b I all cycles in sbgraph o arcs A have nonnegative length, > or all X. (c I all cycles in sbgraph o arcs A have zero length, is real-valed. (d I re is a positive length cycle in sbgraph o arcs A, we have or at least one node X. More generally, or each R(X, we have lim sp (T or at least one X. Proo: Any path with a inite nmber o arcs, can be decomposed into a simple path, a inite nmber o cycles (see e.g., path decomposition orem o [Ber98], Prop.., Eercise.4. Since re is only a inite nmber o simple paths nder, ir length is bonded above below. Ths in part (a length o all paths with a inite nmber o arcs is bonded above, in part (b it is bonded below, implying that < or all X or > or all X, respectively. Part (c ollows combining parts (a (b. To show part (d, consider a path p, which consists o an ininite repetition o positive length cycle that is assmed to eist. Let C(p be length o path that consists o irst cycles in p. Then C(p C(p or all [c. Eq. (.], where is irst node in cycle, ths implying that. Moreover or every R(X all, (T is maimm over lengths o -arc paths that start at, pls a teral cost that is eqal to eir (y (i teral node o -arc path is y X, or (i teral node o -arc path is destination. Ths we have, { } (T, (T. X Since lim sp (T as shown earlier, it ollows that lim sp (T or all R(X..E.D. Note that i re is a negative length cycle in sbgraph o arcs A, it is not necessarily tre that or some X we have. Even or on negative length cycle, vale o is detered longest path in P,, which may be simple in which case is a real nmber, or contain an ininite repetition o a positive length cycle in which case. A ey act in or analysis is ollowing characterization o notion o R(X-reglarity its connection to notion o properness. It shows that proper policies R(X-reglar, bt set o R(X- reglar policies may contain some improper policies, which characterized in terms o sign o lengths o ir associated cycles.

13 Proposition 4.: Consider ima ormlation o RSP problem, viewed as a case o abstract semicontractive DP model o Section. with T given Eqs. (.6-(.8, being zero nction. The ollowing eqivalent or a policy : (i is R(X-reglar. (ii The sbgraph o arcs A is destination-connected all its cycles have negative length. (iii is eir proper or else, i it is improper, all cycles o sbgraph o arcs A have negative length, R(X. Proo: To show that (i implies (ii, let be R(X-reglar to arrive at a contradiction, assme that A contains a nonnegative length cycle. Let be a node on cycle, consider path p that starts at consists o an ininite repetition o this cycle, let L (p be length o irst arcs o that path. Let also be a nonzero constant nction, r, where r is a scalar. Then we have L (p r (T, since rom deinition o T, we have that (T is maimm over lengths o all -arc paths nder starting at, pls r, i last node in path is not destination. Since is R(X-reglar, we have lim sp (T <, so that or all scalars r, lim sp ( L (p r <. Taing inimm over r R, it ollows that lim sp L (p, which contradicts nonnegativity o cycle o p. Ths all cycles o A have negative length. To show that A is destination-connected, assme contrary. Then re eists some node X sch that all paths in P, contain an ininite nmber o cycles. Since length o all cycles is negative, as jst shown, it ollows that, which contradicts R(X-reglarity o. To show that (ii implies (iii, we assme that is improper show that R(X. By (ii A is destination-connected, so set P, contains a simple path or all X. Moreover, since (ii cycles o A have negative length, each path in P, that is not simple has smaller length than some simple path in P,. This implies that is eqal to largest path length among simple paths in P,, so is a real nmber or all X. To show that (iii implies (i, we note that i is proper, it is R(X-reglar, so we ocs on case where is improper. Then (iii, R(X, so to show R(X-reglarity o, we mst show that (T or all X R(X, that T. Indeed, rom deinition o T, we have (T sp [ L (p p ], (4. p P, where p is node reached ater arcs along path p, (t is deined to be eqal to. Ths as, or every path p that contains an ininite nmber o cycles (each necessarily having negative length,

14 a t igre 4.. The sbgraph o arcs A corresponding to an improper policy, or case o a single node a destination node t. The arcs lengths shown in igre. seqence L p( p approaches. It ollows that or siciently large, spremm in Eq. (4. is attained one o simple paths in P,, so p t p. Ths limit o (T does not depend on, is eqal to limit o (T, i.e.,. To show that T, we note that preceding argment, is length o longest path among paths that start at terate at t. Moreover, we have [ (T ma g,, y (y ], y Y, where we denote (t. Ths (T is also length o longest path among paths that start at terate at t, hence it is eqal to..e.d. We illstrate preceding proposition with a two-node eample involving an improper policy with a cycle that may have positive, zero, or negative length. Eample 4.: Let X {}, consider policy where at state, antagonistic opponent may orce eir staying at or terating, i.e., Y (, ( {, t}. Then is improper since its sbgraph o arcs A contains sel-cycle (, ; c. ig. 4.. Let Then, g (, (, a, g (, (, t. (T ( ma [, a ( ], ( Consistently with Prop. 4., ollowing hold: { i a >, i a. (a or a >, cycle (, has positive length, is R(X-irreglar becase (. (b or a, cycle (, has zero length, is R(X-irreglar becase or a nction R(X with ( >, lim sp(t ( > (. (c or a <, cycle (, has negative length, is R(X-reglar becase (, we have R(X, ( ma [, a (] (T (, or all R(X, lim sp(t ( (. 4

15 We now show one o or main reslts. Proposition 4.: Let Assmption. hold. Then: (a The optimal cost nction * o RSP is niqe ied o T within R(X. (b A policy is optimal or RSP i only i T * T *. Moreover, re eists an optimal proper policy. (c We have T * or all R(X. (d or any R(X, i T we have *, i T we have *. Proo: We veriy parts (a-( o Assmption. with S R(X. The reslt n will be proved sing Prop... To this end we arge as ollows: ( Part (a is satisied since S R(X. ( Part (b is satisied since Assmption.(a, re eists at least one proper policy, which Prop. 4. is R(X-reglar. Moreover, or each R(X-reglar policy, we have R(X. Since nmber o all policies is inite, it ollows that Ĵ R(X. ( To show that part (c is satisied, note that since Prop. 4. every R(X-irreglar policy mst be improper, it ollows rom Assmption.(b that sbgraph o arcs A contains a cycle o positive length. By Prop. 4.(d, this implies that or each R(X, we have lim sp (T or at least one X. (4 Part (d is satisied since U is a inite set. (5 Part (e is satisied since X is inite T is a continos nction mapping inite-dimensional space R(X into itsel. (6 To show that part ( is satisied, we note that applying Prop.. with S R(X, we have that Ĵ is niqe ied o T within R(X. It ollows that or each R(X, re eists a siciently large scalar r > sch that nction given where e is nit nction, e, satisies as well as Ĵ re, X, (4. Ĵ re T Ĵ re T (Ĵ re T, (4. where ineqality holds in view o Eqs. (.6 (.9, act r >. Ths all parts o Assmption. with S R(X satisied, Prop.. applies with S R(X. Since nder Assmption., improper policies R(X-irreglar [c. Prop. 4.(d] so cannot be optimal, ima ormlation o Section is eqivalent to RSP, conclsions o Prop.. precisely reslts we want to prove..e.d. 5

16 a t t Improper policy Proper policy igre 4.. A contereample involving a single node in addition to destination t. There two policies,, with corresponding sbgraphs o arcs A A, arc lengths shown in igre. The improper policy is optimal when a. It is R(X-irreglar i a, it is R(X-reglar i a <. The ollowing variant o two-node Eample 4. illstrates what may happen in absence o Assmption.(b, when re may eist improper policies that involve a nonpositive length cycle. Eample 4.: Let X {}, consider improper policy with Y (, ( {, t} proper policy with Y (, ( {t} (c. ig. 4.. Let g (, (, a, g (, (, t, g (, (, t. Then it can be seen that nder both policies, longest path rom to t consists o arc (, t. Ths, (, (, so improper policy is optimal or ima problem (.5, strictly doates proper policy (which is optimal or RSP version o problem. To eplain what is happening here, we consider two dierent : ( a : In this case, optimal policy is both improper R(X-irreglar, bt with ( <. Ths conditions o both Props.. 4. do not hold becase Assmptions.(c Assmption.(b violated. ( a < : In this case, is improper bt R(X-reglar, so re no R(X-irreglar policies. Then all conditions o Assmption. satisied, Prop.. applies. Consistent with this proposition, re eists an optimal R(X-reglar policy (i.e., optimal over both proper improper policies, which however is improper hence not an optimal soltion or RSP. We will net discss modiications o Prop. 4., which address diiclties illstrated in two o preceding eample. The Case o Improper Policies with Negative Length Cycles We note that Prop. 4., set o R(X-reglar policies incldes not jst proper policies, bt also some improper ones (those or which A is destination-connected all its cycles have negative length. As 6

17 a reslt we can weaen Assmption. as long as it still implies Assmption. so we can se Prop.. to obtain corresponding versions o or main reslt o Prop. 4.. Here two sch weaer versions o Assmption.. Assmption 4.: Every policy is eir proper or else it is improper its sbgraph o arcs A is destination-connected with all cycles having negative length. rom Prop. 4., it ollows that preceding assmption is eqivalent to all policies being R(X-reglar. The net assmption is weaer in that it allows policies that R(X-irreglar, as long as some cycle o A has positive length. Assmption 4.: (a There eists at least one R(X-reglar policy. (b or every R(X-irreglar policy, some cycle in sbgraph o arcs A has positive length. Now essentially repeating proo o Prop. 4., we see that Assmption 4. implies Assmption., so that Prop.. applies. Then we obtain ollowing variant o Prop. 4.. Proposition 4.4: Let eir Assmption 4. or (more generally Assmption 4. hold. Then: (a The nction * o Eq. (. is niqe ied o T within R(X. (b A policy satisies *, where * is imm o over all M [c. Eq. (.5], i only i T * T *. Moreover, re eists an optimal R(X-reglar policy. (c We have T * or all R(X. (d or any R(X, i T we have *, i T we have *. It is important to note that optimal R(X-reglar policy o part (b above may not be proper, hence needs to be checed to ensre that it solves RSP problem (c. Eample 4. with a <. Ths one wold have to additionally prove that at least one o optimal R(X-reglar policies is proper in order or proposition to lly apply to RSP. The Case o Improper Policies with Zero Length Cycles In some, it may be easier to garantee nonnegativity rar than positivity o lengths o cycles corresponding to improper policies, which is reqired Assmption.(b. This is tre or eample in 7

18 important case where all arc lengths nonnegative, i.e., g,, y or all X, U, y Y,, as in case ( o Eample 4.. Let s consider ollowing relaation o Assmption.. Assmption 4.: (a There eists at least one proper policy. (b or every improper policy, all cycles in sbgraph o arcs A have nonnegative length. Note that similar to case o Assmption., we may garantee that part (a o preceding assmption is satisied introdcing a high cost teration action at each node. Then policy that terates at each state is proper. or an analysis nder preceding assmption, we will se a pertrbation approach that was introdced in Section.. o [Ber]. The idea is to consider a scalar δ > a δ-pertrbed problem, where each arc length g,, y with X is replaced g,, y δ. As a reslt, a nonnegative cycle length corresponding to an improper policy as per Assmption 4.(b becomes strictly positive, so Assmption. is satisied or δ-pertrbed problem, Prop. 4. applies. We ths see that δ *, optimal cost nction o δ-pertrbed problem, is niqe ied o mapping T δ given (T δ U H δ,,, X, where H δ,, is given H δ,, H,, δ. Moreover re eists an optimal proper policy δ or δ-pertrbed problem, which Prop. 4.(b, satisies optimality eqation T δ,δ δ * T δ δ *, where T,δ is mapping that corresponds to a policy in δ-pertrbed problem: ( (T,δ H δ,,, X. We have ollowing proposition. Proposition 4.5: policies only, Then: Let Assmption 4. hold, let Ĵ be optimal cost nction over proper Ĵ, X. : proper (a Ĵ lim δ * δ. 8

19 (b Ĵ is niqe ied o T within set { R(X Ĵ}. (c We have T Ĵ or every R(X with Ĵ. (d Let be a proper policy. Then is optimal within class o proper policies (i.e., Ĵ i only i T Ĵ T Ĵ. (e There eists δ > sch that or all δ (, δ], an optimal policy or δ-pertrbed problem is an optimal proper policy or original RSP. Proo: (a or all δ >, consider an optimal proper policy δ o δ-pertrbed problem, i.e., one with cost δ,δ δ *. We have Ĵ δ δ,δ * δ,δ Nδ, : proper, where N is nmber o nodes o X (since an etra δ cost is incrred in δ-pertrbed problem every time a path goes throgh a node t, any path nder a proper contains at most N nodes t. By taing limit as δ n imm over all that proper, it ollows that so lim δ * δ Ĵ. Ĵ lim δ * δ Ĵ, (b or all proper, we have T T Ĵ T Ĵ. Taing imm over proper, we obtain Ĵ T Ĵ. Conversely, or all δ > M, we have * δ T * δ δe T * δ δe. Taing limit as δ, sing part (a, we obtain Ĵ T Ĵ or all M. Taing imm over M, it ollows that Ĵ T Ĵ. Ths Ĵ is a ied o T. The niqeness o Ĵ will ollow once we prove part (c. (c or all R(X with Ĵ proper policies, we have sing relation Ĵ T Ĵ jst shown in part (b, Ĵ lim T Ĵ lim T lim T. Taing imm over all proper, we obtain Ĵ lim T Ĵ, Ĵ. (d I is a proper policy with Ĵ, we have Ĵ T T Ĵ, so, sing also relation Ĵ T Ĵ [c. part (a], we obtain T Ĵ T Ĵ. Conversely, i satisies T Ĵ T Ĵ, n rom part (a, we have T Ĵ Ĵ hence lim T Ĵ Ĵ. Since is proper, we have lim T Ĵ, so Ĵ. (e or every proper policy we have lim δ,δ. Hence i a proper is not optimal or RSP, it is also nonoptimal or δ-pertrbed problem or all δ [, δ ], where δ is some positive scalar. Let δ be imm δ over nonoptimal proper policies. Then or δ (, δ], an optimal policy or δ-pertrbed problem cannot be nonoptimal or RSP..E.D. 9

20 Note that we may have * < Ĵ or some, bt in RSP only proper policies admissible, so letting δ we approach optimal soltion o interest. This happens or instance in Eample 4. when a. or eample Ĵ (not * can be obtained as limit o T, starting rom Ĵ [c. part (c]. The ollowing eample describes an interesting problem, where Prop. 4.5 applies. Eample 4.: (Minima Search s Consider searching a graph with node set X {t}, looing or an optimal node X at which to stop. At each X we have two options: ( stopping at a cost s, which will stop search moving to t, or ( contining search choosing a control U, in which case we will move to a node y, chosen rom within a given set o nodes Y, an antagonistic opponent, at a cost g,, y. Then Assmption 4. holds, since re eists a proper policy ( one that stops at every. An interesting case is when stopping costs s all nonnegative, while searching is cost-ree [i.e., g,, y ], bt may lead in tre to nodes where a higher stopping cost will become inevitable. Then a policy that never stops is optimal bt improper, bt i we introdce a small pertrbation δ > to costs g,, y, we will mae lengths o all cycles positive, Prop. 4.5 may be sed to ind an optimal policy within class o proper policies. Note that this is an eample where we really interested in solving RSP problem (where only proper policies admissible, not its ima version (where all policies admissible. 5. COMPUTATIONAL METHODS We will now discss comptational s that patterned ater classical DP algorithms o vale iteration policy iteration (VI PI or short, respectively. In particlar, s o this section motivated ized stochastic shortest path algorithms. 5.. Vale Iteration Algorithms We have already shown as part o Prop. 4. that nder Assmption., VI algorithm, which seqentially generates T or, converges to optimal cost nction * or any starting nction R(X. We have also shown as part o Prop. 4.5 that nder Assmption 4., VI seqence T or, converges to Ĵ, optimal cost nction over proper policies only, or any starting nction Ĵ. We can etend se convergence properties to asynchronos versions o VI based on monotonicity ied properties o mapping T. This has been nown since paper [Ber8] (see also [Ber8], [BeT89], we reer to discssions in Sections.6.,.. o [Ber], which apply in ir entirety when ized to RSP problem o this paper. It trns ot that or or problem, nder Assmption. or Assmption 4., VI algorithm also terates initely when initialized with or all X [it can be seen that in view o orm (.9 o mapping T, VI algorithm is well-deined with this initialization]. In act nmber o iterations or teration is no more than N, where N is nmber o nodes in X, leading to polynomial compleity. This is consistent with a similar reslt or stochastic shortest path ([Ber], Section.4., which relies on assmption o acyclicity o graph o possible transitions nder an optimal policy. Becase this assmption is restrictive, inite teration o VI algorithm is an eceptional property in stochastic shortest path. However, in ima case o this paper, an optimal policy eists is

21 proper [c. Prop. 4.(b or Prop. 4.5(e], so graph o possible transitions nder is acyclic, it trns ot that inite teration o VI is garanteed to occr. Note that in deteristic shortest path initialization or all X, leads to polynomial compleity, generally wors better in practice that or initializations (sch as < *, or which compleity is only psedopolynomial, c. [BeT89], Section 4., Prop... To show inite teration property jst described, let be an optimal proper policy, consider sets o nodes X, X,..., deined X {t}, { X / m X m y m X m or all y Y (, },,,..., (5. let X be last o se sets that is nonempty. Then in view o acyclicity o sbgraph o arcs A, we have m X m X {t}. We will now show indction that starting rom or all X, iterates T o VI satisy (T *, m X m,,...,. (5. Indeed, it can be seen that this is so or. Assme that (T * i m X m. Then, since T T is monotone, (T is monotonically nonincreasing, so that * (T, X. (5. Moreover, indction hyposis, deinition o sets X, optimality o, we have (T H (,, T H (,, * *, m X m, (5.4 where irst eqality ollows rom orm (.6 o H act that or all m X m, we have y m X m or all y Y (, deinition (5. o X. The two relations (5. (5.4 complete indction proo. Ths nder Assmption., VI when started with or all X, will ind optimal costs o all nodes in set m X m ater iterations; c. Eq. (5.. The is tre nder Assmption 4., ecept that will ind corresponding optimal costs over proper policies. In particlar, all optimal costs will be ond ater N iterations, where N is nmber o nodes in X. This indicates that behavior o VI algorithm, when initialized with or all X, is similar to one o Bellman-ord algorithm or deteristic shortest path. Still each iteration o VI algorithm reqires as many as N applications o mapping T at every node. Ths it is liely that perormance o VI algorithm can be improved with a sitable choice o initial nction, with an asynchronos implementation that ses a avorable order o selecting nodes or iteration, one-nodeat-a-time similar to Gass-Seidel. This is consistent with deteristic shortest path case, where re VI-type algorithms, within class o label-correcting s, which aster than Bellman-ord algorithm even aster than eicient implementations o Dijstra algorithm or some types o ; see e.g., [Ber98]. or RSP problem, it can be seen that node selection order is based on sets X deined Eq. (5., i.e., iterate on nodes in set X, n on nodes in X, so on. In this case, only one iteration per node will be needed. While sets X not nown, an algorithm that tries to approimate optimal order cold be mch more eicient that stard all-nodes-at-once VI that comptes seqence T, or (or an eample o an algorithm o this type or stochastic shortest path, see [PBT98]. The development o sch more eicient VI algorithms is an interesting sbject or rr research, which, however, is beyond scope o present paper.

22 a 4a 4 a a4 4 a4 4 RSP 4 t b c 4 4 a t b c 4 a RSP RSP t 4b c Optimal 4 Proper 4 t baac t ba ca RSP 5 RSP 5Optimal a 5 4 Proper 4Optimal 4Proper Proper t b c Optimal t b c Optimal 4Proper 4 t b c b a 5a t b crsp RSP t γ γ c γγ c RSP!RSP " t bc Optimal Proper Proper( 4 Optimal c, arsp ( Optimal Proper Optimal Proper! " "!!c, a "( " "! Proper 4 Optimal (! ( a( ( b c at! arsp "c, c, a ( ( c, ( a ( c, a ( bc,a ( ( ( t b c a t( b Optimal Proper 4 b ( ( ( b ( a 4!b " (b ""( (! (!! " ( t ( ( ( ( a c,aab ( ( ( RSP c,( ( c, ( c, ab b( c! " 4 Proper! " ( c, a ( Optimal (bb ( ( ( 4 ( 4 tbc ( a RSP b c, ab! ( ( a θ a ( ( " θ 4 ( c, a ( θ b( Proper t b c t b c ( θ ( t θbcoptimal θ ( b RSP θ θ θ θ ( θ θ Proper θ a ( b! ( " θ Optimal t b Evalation ( c, a ( θ θ θ θ "!! " θ θ θ! θ ( c," a ( ( c, ( a ( ( c, a θ a t θ ( b ( a t b θc b θ! " ( Evalation ( Evalation ( b ( b ( c, a ( Evalation Ctting Evalation ( Improper policy θ θ b ( Evalation Evalation ( b ( policy " ν Proper ν! Evalation ( c, a ( ν VI ν Evalation ν Iteration # All-Nodes-at-Once Method One-Node-at-a-Time Method Evalation VI Evalation ν θ θ ν ν ν Ctting θ θctting ( θevalation ( b ν Evalation ν ν νθ Ctting Ctting ν (,,ν,, (, ν, ν θ θ Evalation Ctting Ctting Evalation T (, (,, 5 θ,, θ Ctting, ν ν ν Ctting ν N Ctting Ctting Evalation T (, 5, 6, ν (,ν,, Ctting Ctting T θ (, θ, o N. 5,, (,, ν ν Evalation Ctting 4 N Evalation Ctting 4 4 T policy (,, Evalation 4,, Evalation 4 (, Improper 4, N 99s N N Ctting Proper policy Evalation o N. ν ν N N ν ν ν N o N. Evalation N N N o N. o N. igre 5.: An eample RSP νproblem its optimal policy. At each X {,,, 4} re Ctting ν 99s Ctting o N. o N. o N. Ctting Ctting o N. 99s ν ν two controls: one (shown a solid line where Y, consists o a single element, anor s as. N o N. N o N. 99s Y (shown a broen line where, 99s has two elements. Arc lengths shown net to arcs. Ctting N 99s N 99s 99s 99s Both all-nodes-at-once one-node-at-a-time versions o VI terate in or iterations, Ctting Nondierentiability eatres. o99s N. o N. 99s reqires or N bt latter version times less comptation per iteration. s as. o N. o N. s as. 99s 99s. s as. s as N o N. s as. Nondierentiability eatres. s as. 99s 99s N Eample 5.: s as. s as. N N Dal Nondiﬀerentiability eatres. s as. s as. o N. Nondiﬀerentiability eatres. Nondiﬀerentiability Nondiﬀerentiability eatres. 99s eatres. Nondiﬀerentiability eatres. N o Nondiﬀerentiability eatres. Nondiﬀerentiability eatres. Let s illstrate VI or problem o ig. 5.. The optimal policy is shown inn. this igre, o N. o N. Vertical Distances N s as. Nondiﬀerentiability eatres. s as. Nondiﬀerentiability eatres. 99s it is proper; this is consistent with act that Assmption 4. is satisied. table gives iteration The Dal o N. as. s ass. 99s Crossing Point Dierentials 99s 99s seqence o two VI Dal s, starting with (,,,. The irst is all-nodes-at-once o N. Nondiﬀerentiability eatres. Nondiﬀerentiability eatres. Dal Dal Dal Dal as. T, which inds in or iterations. In 99s this eample, we have X {t}, X {}, Xs {4}, Vertical Distances Nondierentiability eatres. Nondiﬀerentiability eatres. Dal Dal Vales Crossing s (y X {}, X4 {},Vertical assertion o Eq. (5. may be veriied. The second Distances 99s is asynchronos Dal DalVertical s as. Vertical Distances Distances Nondiﬀerentiability Crossing Point Dierentials VI, which iterates one-node-at-a-time in (most avorable order, 4,,. The second also eatres. y s as. (y (y ( y ( as s. s as. Dal inds in or iterations with or times less comptation. Dal Nondierentiability eatres. CrossingVales Point Diﬀerentials Crossing Point Diﬀerentials Crossing s (y Dal ys as. Dal Slope y Slope Nondiﬀerentiability eatres.. eatres. Nondiﬀerentiability Nondiﬀerentiability eatres. o s We inally note that in absence Assmption. or Assmption 4., it isas possible that VI Dal (y (y Vertical ( y ( y can eatres.. Distances seqence {T } will not converge to Nondiﬀerentiability starting rom any with 6 This be seen with a simple Nondiﬀerentiability eatres. Dal deteristic shortest path problem involving a zero length version o Eample 4.. Here Slope ycycle, SlopeaCrossing y simplerpoint Dierentials Dal Dal Dal Vertical Distances Crossing s (y Vales Dal Distances Vertical Distances Dal Vertical Crossing Point Dierentials (y (y ( y ( y CrossingVales PointDiﬀerentials Crossing Point Diﬀerentials Vertical Distances Crossing s (y Slope y Slope y

23 re is a single node, aside rom destination t, two choices at : stay at at cost, move to t at cost. Then we have, while T is given T {, }. It can be seen that set o ied s o T is (, ], contains in its interior. Starting with, VI seqence converges to in a single step, while starting at it stays at. This is consistent with Prop. 4.5(c, since in this eample Assmption 4. holds, we have Ĵ. In case o Eample 4. with a, sitation is somewhat dierent bt qalitatively similar. There it can be veriied that, set o ied s is [, ], {T } will converge to starting rom, will converge to starting rom, will stay at starting rom [, ]. 5.. Iteration Algorithms The development o PI algorithms or RSP problem is straightorward given connection with semicontractive models. Briely, nder Assmption., based on analysis o Section.. o [Ber], re two types o PI algorithms. The irst is a natral orm o PI that generates proper policies eclsively. Let be an initial proper policy (re eists one assmption. At typical iteration, we have a proper policy, irst compte solving a longest path problem over corresponding acyclic sbgraph o arcs A. We n compte a policy sch that T T, imizing over U epression H,, o Eq. (.6, or all X. We have T T T lim m T m, (5.5 where second ineqality ollows rom monotonicity o T, last eqality is jstiied becase is proper hence R(X, so net policy cannot be improper [in view o Assmption.(b Prop. 4.(d]. In conclsion mst be proper has improved cost over. Ths seqence o policies { } is well-deined proper, corresponding seqence { } is nonincreasing. It n ollows that converges to * in a inite nmber o iterations. The reason is that rom Eq. (5.5, we have that at th iteration, eir strict improvement > (T is obtained or at least one node X, or else T, which implies that * [since * is niqe ied o T within R(X, Prop. 4.(a] is an optimal proper policy. Unortnately, when re improper policies, preceding PI algorithm is somewhat limited, becase an initial proper policy may not be nown, also becase when asynchronos versions o algorithm implemented, it is diiclt to garantee that all generated policies proper. There is anor algorithm, combining vale policy iterations, which has been developed in [BeY], [BeY], [YBa], [YBb] or a variety o DP models, inclding disconted, stochastic shortest path, abstract, is described in Sections.6... o [Ber]. This algorithm pdates a cost nction a policy, bt it also maintains an additional nction V, which acts as a threshold to eep bonded algorithm convergent. The algorithm not only can tolerate presence o improper policies, bt can also be operated in asynchronos mode, where vale iterations, policy evalation operations, policy improvement iterations perormed one-node-at-a-time withot any reglarity. The algorithm is

Complex Variables. For ECON 397 Macroeconometrics Steve Cunningham

Complex Variables. For ECON 397 Macroeconometrics Steve Cunningham Comple Variables For ECON 397 Macroeconometrics Steve Cnningham Open Disks or Neighborhoods Deinition. The set o all points which satis the ineqalit