PPCP: The Proofs. 1 Notations and Assumptions. Maxim Likhachev Computer and Information Science University of Pennsylvania Philadelphia, PA 19104

Size: px

Start display at page:

Download "PPCP: The Proofs. 1 Notations and Assumptions. Maxim Likhachev Computer and Information Science University of Pennsylvania Philadelphia, PA 19104"

Silvester Allen
5 years ago
Views:

1 PPCP: The Proofs Maxm Lkhachev Computer ad Iformato Scece Uversty of Pesylvaa Phladelpha, PA Athoy Stetz The Robotcs Isttute Carege Mello Uversty Pttsburgh, PA Notatos ad Assumptos I ths secto we troduce some otatos ad formalze mathematcally the class of problems our algorthm s sutable for. We assume that the evromet s fully determstc ad ca be modeled as a graph. That s, f we were to kow the true value of each varable that represets the mssg formato about the evromet the there would be o ucertaty a outcome of ay acto. There are certa elemets of the evromet, however, whose status we are ucerta about ad whch affect the outcomes (ad/or possble costs) of oe or more actos. I the followg we re-phrase ths mathematcally. Let X be a full state-vector (a belef state). We assume t ca be splt to two sets of varables, S(X), H(X): X = [S(X); H(X)]. S(X) s the fte set of varables whose values are always observed ad the umber of possble values s also fte. H(X) s the set of (hdde) varables that tally represeted the mssg formato about the evromet. The varables H(X) are ever moved to S(X). X start s used to deote the start state, all the values of the varables H(X start ) are ukow. The goal of the plaer s to costruct a polcy that reaches ay state X such that S(X) = S goal, where S goal s gve, whle mmzg the expected cost of executo. We assume perfect sesg. For the sake of easer otato let us troduce a addtoal value u for each varable h H. The settg h (X) = u at state X wll represet the fact that the value of h s ukow at X. If h (X) u, the the true value of h s kow at X sce sesg s perfect. We restrct that all the varables that make up X ca take oly a fte umber of dstct values. We assume at most oe hdde varable per acto. Let A(S(X)) to deote the fte set of actos avalable at ay state Y whose S(Y ) = S(X). Each acto a A(S(X)) take at state X may have oe or more outcomes. If the executo of the acto does ot deped o ay of the varables h whose values are ot yet kow, the there s oly oe outcome of a. Otherwse, there ca be more tha oe outcome. We assume that each such acto ca ot be cotrolled by more tha oe hdde varable. (The value of oe hdde varable ca affect more tha oe acto though.) We use h S(X),a to represet the hdde varable that cotrols the outcomes ad costs of acto a take at state X. By h S(X),a = ull we deote the case whe 1

2 there was ever ay ucertaty about the outcome of acto a take at state X. The set of possble outcomes of acto a take S(X) s otated by succ(s(x), a), whereas c(s(x), a, S(Y )) such that S(Y ) succ(s(x), a) deotes the cost of the acto ad the outcome S(Y ). The costs are assumed to be bouded from below by a (small) postve costat. Sometmes, we wll eed to refer to the set of successors the belef state-space. I these cases we wll use the otato succ(x, a) to deote the set of belef states Y such that S(Y ) succ(s(x), a) ad H(Y ) s the same as H(X) except for h S(X),a whch also remas the same f t was kow at X ad s dfferet otherwse. The fucto P X,a (succ(x, a)), the probablty dstrbuto of outcomes of a executed at X, follows the probablty dstrbuto of h S(X),a, P (h S(X),a ). Oce acto a was executed at state X the actual value of h S(X),a ca be deduced sce we assumed the sesg s perfect ad the evromet s determstc. We assume depedece of the hdde varables. For the sake of effcet plag we assume that the varables H ca be cosdered depedet of each other ad therefore P (H) = H =1 P (h ). We assume clear prefereces o the values of the hdde varables are avalable. We requre that for each varable h H we are gve ts preferred value, deoted by b (.e., best). Ths value must satsfy the followg property. Gve ay state X ad ay acto a such that h S(X),a s ot kow (that s, h S(X),a (X) = u), there exsts a successor state X such that h S(X),a (X ) = b ad X = argm Y succ(x,a) c(s(x), a, S(Y ))+v (Y ), where v (Y ) s the expected cost of executg a optmal polcy at state Y (Def. 1). We wll use the otato succ(x, a) b (.e., the best successor) to deote the state X whose h S(X),a (X ) = b f h S(X),a (X) = u ad whose h S(X),a (X ) = h S(X),a (X) otherwse. 2

3 A Appedx: The Proofs The pseudocode below assumes the followg: 1. Every state X the search state space tally s assumed to have v( X) = g( X) = ad besta( X) = ull; 1 procedure ComputePath(X pvot) 2 Xsearchgoal = GetStateSearchGraph([S(X pvot); H(X pvot)]); 3 g( X searchgoal ) = v( X searchgoal ) = ; 4 OPEN = ; 5 for every H whose every elemet h satsfes: [(h = u h = b) h (X pvot) = u] OR [h = h (X pvot) h (X pvot) u] 6 X = GetStateSearchGraph([Sgoal ; H]); 7 v( X) =, g( X) = 0, besta( X) = ull; 8 sert X to OPEN wth g( X) + h( X); 9 whle(g( X searchgoal ) > m X OPEN g( X ) + h( X )) 10 remove X wth the smallest g( X) + h( X) from OPEN; 11 v( X) = g( X); 12 for each acto a ad X = [S(X ); H(X ); H u (X pvot)] s.t. X = [S(succ(X, a) b ); H(succ(X, a) b )] 13 X = GetStateSearchGraph([S(X ); H(X )]); 14 Q a = Y succ(x,a) P (X, a, Y ) max(c(s(x ), a, S(Y ))+w(y ), c(s(x ), a, S(Y ))+v( X)); 15 f g( X ) > Q a 16 g( X ) = Q a; 17 besta( X ) = a; 18 sert/update X OPEN wth the prorty equal to g( X ) + h( X ); Fgure 1: ComputePath fucto The pseudocode below assumes the followg: 1. Every state X tally has 0 w(x) w b (X) ad besta(x) = ull. 1 procedure UpdateMDP(X pvot) 2 X = X pvot; X = GetStateSearchGraph([S(Xpvot); H(X pvot)]); 3 whle (S(X) S goal ) 4 w(x) = g( X); w([s(x); H(X); H u (X pvot)]) = g( X); besta(x) = besta( X); 5 f (besta(x) = ull) break; 6 X = succ(x, besta(x)) b ; X = GetStateSearchGraph([S(X); H(X)]); 7 procedure Ma() 8 X pvot = X start; 9 whle (X pvot! = ull) 10 ComputePath(X pvot); 11 UpdateMDP(X pvot); 12 fd state X o the curret polcy that has w(x) < E X succ(x,besta(x)) (c(s(x), besta(x), S(X )) + w(x )); 13 f foud set X pvot to X; 14 otherwse set X pvot to ull; Fgure 2: Ma fucto 3

4 Let us frst defe several varables that we wll use durg the proofs. Let H b be defed as H wth each h equal to u replaced by b. X b s the defed as [S(X); H(X); H b (X)]. Let H u (X) be H(X) but wth each h = h b replaced by u. For every state X we the defe X u state as the followg state: X u = [S(X); H(X); H u (X)]. We ow troduce optmstc Q-values. Every state-acto par X ad a A(S(X)) has a Q f,w (X, a) > 0 assocated wth t that s calculated from the acto costs c(s(x), a, S(Y )) for all states Y succ(x, a), the o-egatve f- value for state X = succ(x, a) b ad the o-egatve values w(y ) for all states Y succ(x, a). Q f,w (X, a) s defed as follows: Q f,w (X, a) = Y succ(x,a) P (X, a, Y ) max(c(s(x), a, S(Y )) + w(y ), c(s(x), a, S(X )) + f(x )) (1) We ow defe a optmstc path from X to X 0 whose S(X 0 ) = S goal as follows: π = [{X, a, X 1 },..., {X 1, a 1, X 0 }], where every tme a s stochastc, a outcome X k 1 = succ(x k, a k ) b. We defe a optmstc cost of a optmstc path π = [{X, a, X 1 },..., {X 1, a 1, X 0 }] uder a o-egatve value fucto w recursvely as follows: φπ (X, X 0 ) = { 0 f = 0 Q f(x 1 )=φπ (X 1,X 0 ),w (X, a ) f > 0 (2) We call a path defed by besta poters from X to X 0 as follows: π best = [{X, a, X 1 },..., {X 1, a 1, X 0 }], where a = besta(x ) ad X 1 = succ(x, a ) b. We defe a greedy path π greedy,f,w (X, X 0 ) = [{X, a, X 1 },..., {X 1, a 1, X 0 }] wth respect to fuctos f ad w that map each state X oto o-egatve real-values. It s defed as a path π from X to X 0 where for every 1 a = argm Q a A(S(X)) f,w(x, a) ad the outcome X 1 = succ(x, a ) b. We also defe w b values of states as costs of reachg a goal state uder the assumpto that the values of the mssg varables are all set to b: { w b 0 f S(X) = Sgoal (X) = m a S(X) (c(s(x), a, succ(x, a) b ) + w b (succ(x, a) b )) otherwse (3) A.1 ComputePath Fucto I ths secto we wll prove theorems that maly cocer the ComputePath fucto. We wll cosder a sgle executo of ComputePath fucto. We wll take the followg coveto: the search state-space at ay partcular executo of ComputePath wll be deoted by S, ay state S wll be deoted by a letter wth above t. The states the orgal MDP wll ot use sg above t. Thus, f X s a full state, the 4

5 X = [S(X), H(X)]. We wll also reserve the otato X X+H u to deote a full state [S( X); H( X); H u (X pvot )]. Smlarly to the defto of π a full state-space, a optmstc path from X to X0 s defed as π = [{X X+H u, a, succ(x X+H u..., {X X+H u 1, a 1, succ(x X+H u X 1 = [S(succ(X X+H u, a ) b }, {X X+H u 1, a 1, succ(x X+H u 1, a 1 ) b }, 1, a 1 ) b }], where for every 1, a ) b ); H(succ(X X+H u, a ) b )]. Smlarly to the defto of π best, a path π best from X to X 0 a full state-space s defed as π from X to X 0 where for every 1 a = besta( X ). I addto, we defe a greedy path π greedy,f,w wth respect to fuctos f ad w that map each state X oto o-egatve real-values. It s defed as a path π from X to X 0 where for every 1 a = argm a A(S( X)) Q f,w(x X+H u, a). We defe goal dstaces, g -values uder a fucto w recursvely as follows: { g ( X) = 0 f S( X) = S goal m Q a A(S( X)) f(y =succ(x X+H u,a) b )=g (X X+H u (4), a) otherwse (Ỹ ),w Fally, we requre that the heurstcs are cosstet the followg sese: h( X searchgoal ) = 0 ad for every other state X, a A(S( X)) ad Ỹ s.t. Y = succ(x X+H u, a) b, h(ỹ ) h( X) + c(s( X), a, S(Ỹ )). A.1.1 Low-level Correctess Lemma 1 Gve a o-egatve fucto w, for ay state X g ( X ) = φ πgreedy,g,w ( X, X 0 ) = m π from X to X 0 φ π ( X, X 0 ) where X 0 s the oly state o π greedy,g,w( X, X 0 ) that has S( X 0 ) = S goal. I addto, t holds that H( X 0 ) satsfes the equato o le 5. Proof: Let us frst prove that g ( X ) = φ πgreedy,g,w ( X, X 0 ). Let us wrte out the formula for φ πgreedy,g,w ( X, X 0 ). If = 0, φ πgreedy,g,w ( X, X 0 ) = g ( X ) = 0 sce S( X ) = S( X 0 ) = S goal. Suppose ow, 0. The φ πgreedy,g,w ( X, X 0 ) = m Q a A(S( X f(x 1)=φ πgreedy,g )),w ( X 1, X 0),w (X, a) Accordg to the defto of a optmstc path π, X 1 = succ(x X+H u, a ) b. It s thus the exact same formula as for g -values (equato 4). Let us ow prove that φ πgreedy,g,w ( X, X 0 ) = m π from X to X 0 φ π ( X, X 0 ). Let us deote argm π from X to X 0 φ π ( X, X 0 ) by π ( X, X 0 ) ad m π from X to X 0 φ π ( X, X 0 ) by φ π ( X, X 0 ). Sce π ( X, X 0 ) s a optmal optmstc path, φ πgreedy,g,w ( X, X 0 ) φ π ( X, X 0 ). We therefore eed to show that φ πgreedy,g,w ( X, X 0 ) φ π ( X, X 0 ) also. 5

6 The proof s a smple proof by cotradcto. Let us assume that φ πgreedy,g,w ( X, X 0 ) > φ π ( X, X 0 ). Ths mples that φ π ( X, X 0 ) s fte ad therefore path π ( X, X 0 ) s fte (sce φ π ( X, X 0 ) > φ π ( X 1, X 0 ) for all > 0 ad φ π ( X 0, X 0 ) = 0). Cosder a par of states X ad X 1 o the path π ( X, X 0 ) such that φ πgreedy,g,w ( X, X 0 ) > φ π ( X, X 0 ) but φ πgreedy,g,w ( X 1, X 0 ) φ π ( X 1, X 0 ). Such par must exst sce at least for X 0, φ πgreedy,g,w ( X 0, X 0 ) = φ π ( X 0, X 0 ) = 0. The we get the followg cotradcto. φ πgreedy,g,w ( X, X 0 ) Q f(x 1)=φ πgreedy,g,w ( X 1, X 0),w (X, a) Q f(x 1)=φ π ( X 1, X 0),w (X, a) = φ π ( X, X 0 ) We ow show that t holds that H( X 0 ) satsfes the equato o le 5. Cosder ay h. Utl path π volves executg a acto whose outcomes deped o h, ay state X o the path wll have h ( X ) = h ( X pvot ). Suppose ow at state X a acto a s executed whose outcomes deped o h. The, f h (X pvot ) u, the acto s determstc ad h ( X 1 ) = h ( X pvot ), whch s cosstet wth the equato o le 5. h ( X 1 ) remas to be such utl the ed of the path. O the other had, f h (X pvot ) = u, the acto a may have multple outcomes, but a optmstc path always chooses the preferred outcome: X 1 = succ(x X+H u 1, a) b. Therefore, h ( X 1 ) = b ad remas such utl the ed of the path. Ths s aga cosstet wth the equato o le 5. Fally, f path π does ot volve executg a acto whose outcomes deped o h, the h ( X 0 ) = h ( X pvot ), whch s also cosstet wth the equato o le 5. Lemma 2 Gve a o-egatve fucto w ad a path π greedy,g,w from X to ay state X 0 wth S( X 0 ) = S goal t holds that g ( X ) +1 j= c(s( X j ), a j, S( X j 1 )) + g ( X ) for ay 0 Proof: The followg s the proof that the theorem holds for = 1. g ( X) = X+H u m a A(S( Qf(Y X)) =succ(x X+H u (X,a) b )=g (Ỹ ),w, a) X+H u = Qf(Y =succ(x X+H u (X,a) b )=g (Ỹ ),w, a) = Y succ(x X+H u,a) P (X X+H u, a, Y ) max(c(s( X), a, S(Ỹ )) + w(y ), c(s( X), a, S(succ(X X+H u, a) b )) + g (succ(x X+H u, a) b ) c(s( X), a, S(succ(X X+H u = c(s( X), a, S( X 1 )) + g ( X 1 ), a) b )) + g (succ(x X+H u, a) b )) 6

7 The proof for 0 1 holds by ducto o. Lemma 3 At ay pot tme, for ay state X t holds that v( X) g( X). Proof: The theorem clearly holds before le 9 was executed for the frst tme sce for each state X v( X) =. Afterwards, the g-values ca oly decrease (les 15-16). For ay state X, o the other had, v( X) oly chages o le 11 whe t s set to g( X). Thus, t s always true that v( X) g( X). Lemma 4 Assumg fucto w s o-egatve, at le 9, the followg holds: g( X) = 0, besta( X) = ull for every state X whose S( X) = S goal ad H( X) satsfes the equato o le 5 g( X) = Q f(y )=v( Ỹ ),w (X X+H u, besta( X)) ad besta( X) = argm a A(S( X)) Q f(y )=v( Ỹ ),w (X X+H u, a), for every other state X f g( X) =, the besta( X) = ull Proof: The theorem holds the frst tme le 9 s executed. Ths s so because every state X S has v( X) =. As a result, the rght-had sde of the equato 1 evaluated uder fucto f = s equal to, depedetly of acto a. Ths s correct, sce after the talzato every state X wth S( X) S goal or whose H( X) does ot satsfy the equato o le 5 has g( X) =, besta( X) = ull ad every state X wth S( X) = S goal ad H( X) satsfyg the equato o le 5 has g( X) = 0, besta( X) = ull. The oly place where g- ad v-values are chaged afterwards s o les 11 ad 16. If v(s) s chaged le 11, the t s decreased accordg to Lemma 3. Thus, t may oly decrease the g-values of ts successors. The test o le 15 checks ths ad updates the g-values ad besta poters as ecessary. Sce all costs are postve ad ever chage, g-value of a state X wth S( X) = S goal ad H( X) satsfyg the equato o le 5 ca ever be chaged: t wll ever pass the test o le 15, ad thus s always 0. Also, sce g-values do ot crease, t cotues to hold that f g( X) =, the besta( X) = ull. Lemma 5 At le 9, OPEN cotas all ad oly states X whose v( X) g( X). Proof: The frst tme le 9 s executed the theorem holds sce after the talzato the oly states OPEN are the states X wth v( X) = = 0 = g( X). The rest of the states have fte values. Durg the followg executo wheever we decrease g( X) (le 16), ad as a result make g( X) < v( X) (Lemma 3), we sert t to OPEN; wheever we remove X from OPEN (le 10) we set v( X) = g( X) (le 11) makg the state cosstet. We ever modfy v( X) or g( X) elsewhere. 7

8 Lemma 6 Assumg fucto w s o-egatve, suppose X s selected for expaso o le 10. The the ext tme le 9 s executed v( X) = g( X), where g( X) before ad after the expaso of X s the same. Proof: Suppose X s selected for expaso. The o le 11 v( X) = g( X), ad t s the oly place where a v-value chages. We, thus, oly eed to show that g( X) does ot chage. It could oly chage f X = X ad g( X ) > Q a at oe of the executos of le 15. The former codto meas that there exsts a such that X = [S(succ(X X+H u, a) b ); H(succ(X X+H u, a) b )]. The later codto meas that g( X) > Q f(y )=v( Ỹ ),w (X X+H u, a). Sce X = [S(succ(X X+H u, a) b ); H(succ(X X+H u, a) b )], f(succ(x X+H u, a) b ) = v( X) = g( X). Hece, g( X) > Q f(succ(x X+H u,a) b )=g( X),w (X X+H u, a). Ths meas that g( X) > c(s( X), a, S( X)) + g( X) whch s mpossble sce costs are postve. Lemma 7 Assumg fucto w s o-egatve, at le 9, for ay state X, a optmstc cost of a path defed by besta poters, π best, from X to a state X 0 whose S( X 0 ) = S goal s o larger tha g( X), that s, φ πbest ( X, X 0 ) g( X). I addto, v( X) g( X) g ( X). Proof: v( X) g( X) holds accordg to Lemma 3. We thus eed to show that φ πbest ( X, X 0 ) g( X), ad g( X) g ( X). The statemet follows f g( X) =. We thus ca restrct our proof to a fte g-value. Cosder a path π best from X = X to a state X0 : π best = [{X X+H u, a, succ(x X+H u..., {X X+H u 1, a 1, succ(x X+H u [S(succ(X X+H u, a ) b }, {X X+H u 1, a 1, succ(x X+H u 1, a 1 ) b }, 1, a 1 ) b }], where a = besta( X ) ad X 1 =, a ) b ); H(succ(X X+H u, a ) b )]. We ow show that φ πbest ( X, X 0 ) g( X) by cotradcto. Suppose t does ot hold. Let us the pck a state Xk o the path that s closest to X 0 ad for whch φ πbest ( X k, X 0 ) > g( X k ). S(X k ) S goal because otherwse φ πbest ( X k, X 0 ) = 0 from the defto of φ-values. Cosequetly, φ πbest ( X k, X 0 ) = Q f(succ(x X+H u k,a k ) b )=φ πbest ( X k 1, X (Xk 0),w, a k ). Accordg to Lemma 4 g( X k ) = Q f(y )=v( Ỹ ),w(xk, a k ), where a k = besta( X k ). From Lemma 3 t the also follows that g( X k ) Q f(y )=g( Ỹ ),w(xk, a k ). Hece, g( X k ) Q X+H u (X f(succ(x k, a k ). k,a k ) b )=g( X k 1 ),w Fally, because of the way we pcked state Xk, φ πbest ( X k 1, X 0 ) g( X k 1 ). As a result, g( X k ) Q f(succ(x X+H u k Q X+H u f(succ(x,a k k ) b )=φ πbest ( X k 1, X (Xk 0),w cotradcto to the assumpto that φ πbest ( X k, X 0 ) > g( X k ).,a k ) b )=g( X (X k 1 ),w k, a k ), a k ) = φ πbest ( X k, X 0 ). Ths s a 8

9 Sce φ πbest ( X, X 0 ) g( X) the proof that g( X) g ( X) follows drectly from Lemma 1. A.1.2 Ma theorems Theorem 1 Assumg fucto w s o-egatve, at le 9, for ay state X wth (h( X) < g( X) + h( X) g(ũ) + h(ũ) Ũ OPEN), t holds that g( X) = g ( X). Proof: We prove by cotradcto. Suppose there exsts X such that h( X) < g( X) + h( X) g(ũ) + h(ũ) Ũ OPEN, but g( X) g ( X). Accordg to Lemma 7 t the follows that g( X) > g ( X). Ths also mples that g ( X) <. We also assume that S( X) S goal or H( X) does ot satsfy the equato o le 5 sce otherwse g( X) = 0 = g ( X) from Lemma 4. Cosder a path π greedy,g,w from X = X to a state X 0 whose S( X 0 ) = S goal. Accordg to Lemma 1, the cost of ths path s g ( X) ad H( X 0 ) satsfes the equato o le 5. Such path must exst sce g ( X) < ad from equato 4 t s clear that g ( X ) > g ( X 1 ) for each 1 o the path. Our assumpto that g( X) > g ( X) meas that there exsts at least oe X o the path π greedy,g,w, amely X 1, whose v( X ) > g ( X ). Otherwse, m Q a A(S( X f(y )=v( Ỹ ),w (X X+H u, a) )) g( X) = g( X ) = (Lemma 4) Q f(y )=v( Ỹ ),w (X X+H u, a ) = (def. of π) Q f(y )=v( X 1),w (X X+H u, a ) Q f(y )=g ( X 1),w (X X+H u, a ) = (def. of g ) g ( X ) = g ( X) Let us ow cosder X o the path wth the smallest dex 0 (that s, closest to X 0 ) such that v( X ) > g ( X ). We wll frst show that g ( X ) g( X ). It s clearly so whe = 0 accordg to Lemma 4 whch says that g( X ) = 0 wheever S( X ) = S goal ad H( X ) satsfes the equato o le 5. For > 0 we use the fact that v( X 1 ) g ( X 1 ) from the way X was chose, m Q a A(S( X f(y )=v( Ỹ ),w(x, a) )) g( X ) = (Lemma 4) Q f(y )=v( Ỹ ),w(x, a ) = (def. of π) Q f(y )=v( X 1),w (X, a ) Q f(y )=g ( X 1),w (X, a ) = (def. of g ) 9

10 g ( X ) We thus have v( X ) > g ( X ) g( X ), whch mples that X OPEN accordg to Lemma 5. We wll ow show that g( X) + h( X) > g( X ) + h( X ), ad fally arrve at a cotradcto. Accordg to our assumpto g( X) > g ( X) ad h( X) <, therefore +1 j= 1 g( X) + h( X) = g( X ) + h( X ) > g ( X ) + h( X ) (Lemma 2) +1 c(s( X j ), a j, S( X j 1 )) + g ( X ) + h( X ) (property of h) j= c(s( X j ), a j, S( X j 1 )) + g ( X ) + h( X 1 )... g ( X ) + h( X ) g( X ) + h( X ) Ths equalty, however, mples that X / OPEN sce accordg to the codtos of the theorem g( X) + h( X) g(ũ) + h(ũ) Ũ OPEN. But ths cotradcts to what we have prove earler. A.1.3 Correctess The corollares ths secto show how the theorems the prevous secto lead qute trvally to the correctess of ComputePath. We also show that each state s expaded at most oce, smlar to the guaratee that A* makes for determstc graphs wheever heurstcs are cosstet. Corollary 1 Whe the ComputePath fucto exts the followg holds for ay state X wth h( X) < g( X)+h( X) m X OP EN (g( X )+h( X )): a optmstc cost of a path defed by besta poters, π best, from X to a state X 0 whose S( X 0 ) = S goal s equal to g ( X), that s, φ πbest ( X, X 0 ) = g ( X). Proof: Accordg to Theorem 1 the codto h( X) < g( X) + h( X) m X OP EN (g( X ) + h( X )) mples that g( X) = g ( X). From Lemma 7 t the follows that φ πbest ( X, X 0 ) g ( X). Sce g ( X) s a optmstc cost of a least-cost path from X to X 0 accordg to Lemma 1, φ πbest ( X, X 0 ) = g ( X). Corollary 2 Whe the ComputePath fucto exts the followg holds: a optmstc cost of a path defed by besta poters, π best, from X searchgoal to a state X 0 whose S( X 0 ) = S goal s equal to g ( X searchgoal ), that s, φ πbest ( X searchgoal, X 0 ) = g ( X searchgoal ). The legth of ths path s fte. 10

11 Proof: Accordg to the termato codto of the ComputePath fucto, upo ts ext g( X searchgoal ) m X OPEN (g( X ) + h( X )). Sce h( X searchgoal ) = 0 the proof that the cost of the path s equal to g ( X searchgoal ) the follows drectly from Corollary 1 To prove that the path defed by besta poters s always fte, frst cosder the case of g( X searchgoal ) =. Accordg to lemma 4 the, besta( X searchgoal ) = ull ad the path defed by besta poters s therefore empty. Suppose ow g( X searchgoal ). Sce g( X searchgoal ) m X OPEN (g( X ) + h( X )) ad h( X searchgoal ) = 0, theorem 1 apples ad therefore > g( X searchgoal ) = g ( X searchgoal ). As a result, the optmstc cost of the path defed by besta poters s also fte accordg to lemma 7. Cosderg that the costs are bouded from below by a postve costat, t shows that the path s of fte legth. Corollary 3 Whe the ComputePath fucto exts the followg holds for each state X o the path π best ( X searchgoal, X 0 ): g( X) = g ( X). Proof: At the tme ComputePath termates g( X searchgoal ) m X OPEN (g( X ) + h( X )). ad h( X searchgoal ) = 0. Thus, accordg to theorem 1, g( X searchgoal ) = g ( X searchgoal ). We ow prove that the theorem holds for the rest of the states o the path defed by besta poters. The case whe g( X searchgoal ) = s trvally prove by otg that ths case besta( X searchgoal ) = ull accordg to lemma 4. We therefore cosder the case whe g( X searchgoal ). We prove the theorem for ths case by ducto. Suppose g( X ) = g ( X ), g( X ) + h( X ) m X OPEN (g( X ) + h( X )) ad h( X ) <. Ths s true at least for the frst state o the path, amely, Xsearchgoal. We wll show that g( X 1 ) = g ( X 1 ), g( X 1 ) + h( X 1 ) m X OPEN (g( X ) + h( X )) ad h( X 1 ) <. Ths ducto step wll prove the statemet of the theorem. The property h( X 1 ) < follows from the cosstecy of heurstcs ad the fact that h( X ) <. By cosstecy h( X 1 ) h( X )+c(s( X ), besta( X ), S( X 1 )). h( X ) s fte accordg to our ducto assumpto, whereas the costs are fte because > g( X searchgoal ) = g ( X searchgoal ). Thus, h( X 1 ) <. To prove that g( X 1 ) + h( X 1 ) m X OPEN (g( X ) + h( X )) we wll show that g( X 1 ) + h( X 1 ) g( X ) + h( X ) as follows: g( X 1 ) + h( X 1 ) cosstecy of heurstcs g( X 1 ) + h( X ) + c(s( X ), besta( X ), S( X 1 )) lemma 3 v( X 1 ) + c(s( X ), besta( X ), S( X 1 )) + h( X ) Y succ(x X+H u,besta( X )) P (X X+H u, besta( X ), Y ) max(c(s( X ), besta( X ), S(Ỹ )) + w(y ), c(s( X), besta( X ), S( X 1 )) + v( X 1 )) + h( X ) = eq. 1 Q f(y )=v( Ỹ ),w X+H u (X, besta( X )) + h( X ) = lemma 4 g( X ) + h( X ) ductve assumpto m X (g( X ) + h( X )) OPEN 11

12 Fally, the fact that g( X 1 ) = g ( X 1 ) ow comes drectly from theorem 1. Theorem 2 No state s expaded more tha oce durg the executo of the ComputePath fucto. Proof: Suppose a state X s selected for expaso for the frst tme durg the executo of the ComputePath fucto. The, t s removed from OPEN set o le 10. Accordg to theorem 1 ts g-value at ths pot s equal to g ( X). O le 11 the state s made cosstet by settg ts v-value to ts g-value. The oly way how X ca be chose for expaso aga s f t s serted to OPEN, but ths oly happes f ts g-value s decreased. Ths however s mpossble sce g( X) s already equal to g ( X) = m π from X to X 0 φ π ( X, X 0 ) where X 0 has S( X 0 ) = S goal (accordg to Lemma 1) ad g( X) must always rema a upper boud o φ πbest ( X, X 0 ) (accordg to Lemma 7). A.2 Ma Fucto I ths secto we preset the theorems about the ma fucto of the algorthm. All refereces to le umbers are for the fgure 2 uless explctly specfed otherwse. By w (X) we deote a mmum expected cost of a polcy for reachg a goal state from state X. We also troduce w u -values defed recursvely as follows: w u (X) = { 0 f S(X) = 0 m a A(S(X)) Q w u,w u (X u, a) otherwse (5) We also defe goal dstaces for full states g -values uder a fucto w recursvely as follows: g (X) = { 0 f S(X) = S goal m a A(S(X)) Q f(y =succ(x u,a) b )=g (X u, a) (Y ),w otherwse (6) Lemma 8 For each X, w u (X) = w u (X u ) Proof: Accordg to equato 5, f S(X) = S(X u ) = S goal the w u (X) = w u (X u ) = 0. Otherwse, w u (X) = m a A(S(X)) Q w u,w u(xu, a) = m a A(S(X)) Q w u,w u((xu ) u, a) = w u (X u ). Lemma 9 For each X, g (X) = g (X u ) Proof: Accordg to the defto, X u = [S(X); H(X); H u (X)], ad therefore S(X) = S(X u ). Suppose frst S(X) = S goal. The, accordg to equato 6, g (X) = 0 ad g (X u ) = 0. 12

13 Now suppose S(X) = S(X u ) S goal. The, accordg to equato 6, g(x) = m a A(S(X)) Q f(y =succ(xu,a) b )=g (Y ),w(x u, a) ad g(x u ) = m a A(S(Xu )) Q f(y =succ((xu ) u,a) b )=g (Y ),w((x u ) u, a). (X u ) u = X u because H u (X) does ot cota ay h elemets equal to b ad therefore H u (X u ) = H u (X). Also, S(X) = S(X u ). Cosequetly, g(x u ) = m a A(S(X)) Q f(y =succ(x u,a) b )=g (Y ),w(x u, a) = g(x). Lemma 10 For each X ad a A(S(X)), h S(X),a (succ(x, a) b ) = h S(X),a (succ(x u, a) b ) ad g (succ(x u, a) b ) = g (succ(x, a) b ) Proof: We cosder all possble cases for h S(X),a (X). Suppose frst h S(X),a (X) = ull. That s, acto a s (ad always was) determstc. The h S(X),a (X u ) = ull also ad therefore h S(X),a (succ(x, a) b ) = h S(X),a (succ(x u, a) b ) = ull. Also, succ(x u, a) b = (succ(x, a) b ) u because h-values are ot affected by acto a ad therefore g (succ(x u, a) b ) = g (succ(x, a) b ) accordg to lemma 9. Suppose ow h S(X),a (X) b. The aga h S(X),a (X u ) = h S(X),a (X) ad therefore h S(X),a (succ(x, a) b ) = h S(X),a (succ(x u, a) b ). Also, succ(x u, a) b = (succ(x, a) b ) u because h-values are ot affected by acto a ad therefore g (succ(x u, a) b ) = g (succ(x, a) b ) accordg to lemma 9. Now suppose h S(X),a (X) = b. If h S(X),a H, the h S(X),a (X u ) = b, whereas f h S(X),a H, the h S(X),a (X u ) = u. I ether case, however, h S(X),a (succ(x, a) b ) = h S(X),a (succ(x u, a) b ) = b. Also, g (succ(x, a) b ) = g ((succ(x, a) b ) u ) ad g (succ(x u, a) b ) = g ((succ(x u, a) b ) u ) accordg to lemma 9. But (succ(x, a) b ) u = (succ(x u, a) b ) u ad therefore g (succ(x, a) b ) = g (succ(x u, a) b ) as stated the theorem. Theorem 3 Suppose that before le 10 s executed for every state X t s true that 0 w(x) w(x u ). The after le 11 s executed for each state X o π best from X pvot to a goal state t holds that w(x) E X succ(x,besta(x))(c(s(x), besta(x), S(X )) + w(x )) f S(X) S goal ad w(x) = 0 otherwse. Proof: We frst prove that after le 11 s executed for each state X o π best from X pvot = X to a goal state X 0 t s true that X = [S(X ); H(X )], where X s the th state o π best from X pvot = X to a goal state X 0. We prove ths by ducto. It certaly holds for = sce X = [S(X pvot ); H(X pvot )] = [S(X ); H(X )]. We ow prove that t cotues to hold for 1. O le 6 we pck X 1 to be equal to succ(x, a ) b, where a = besta(x ) = besta( X ). We thus eed to show that [S(succ(X, a ) b ); H(succ(X, a ) b )] s the 1th state o π best. Accordg to the defto of π best the 1th state o t s defed as: X 1 = [S(succ(X X+H u, a ) b ); H(succ(X X+H u, a ) b )]. We thus eed to show that S(succ(X, a ) b ) = S(succ(X X+H u, a ) b ) ad H(succ(X, a ) b ) = H(succ(X X+H u, a ) b ), where X = [S(X ); H(X ); H(X )] ad X X+H u = [S(X ); H(X ); H u (X pvot )]. 13

14 Sce accordg to the defto of π best X = succ(x +1, a +1 ) b = succ(succ(x +2, a +2 ) b, a +1 ) b ad so o, h S(X),a (X ) ca oly be dfferet from h S(X),a (X X+H u ) f h S(X),a (X X+H u ) = u ad h S(X),a (X ) = b. Cosequetly, the same preferred outcome of acto a exsts for both X ad X X+H u, amely the oe that has h S(X),a = b. I other words, [S(succ(X, a ) b ); H(succ(X, a ) b )] = [S(succ(X X+H u, a ) b ); H(succ(X X+H u, a ) b )]. That s, [S(X 1 ); H(X 1 )] = X 1 We ow prove the statemet of the theorem tself. Cosder a arbtrary X o π best from X pvot = X to a goal state X 0. Because of the statemet we have just prove ad the executo of le 4, w(x ) = g( X ), where X = [S(X ); H(X )] s the th state o π best from X pvot = X to a goal state X 0. If = 0 the w(x ) = g( X ) = 0 accordg to Lemma 4. Suppose ow > 0. Accordg to Lemma 4 the w(x ) = g( X ) = Q f(y )=v( Ỹ ),w old (X X+H u, a ), where w old s w-fucto before the executo of the ComputePath fucto. I addto, the w-value of each X succ(x, a ) such that X X 1 remas the same as that before the ComputePath fucto was called. Ths s so because UpdateMDP does ot update w-values of states wth at least oe h j -value that s ether equal to h j (X X+H u ) or equal to b. Moreover, from Lemma 3 v( X 1 ) g( X 1 ) = w(x 1 ). Hece w(x ) Q f(y )=w(x 1),w(X X+H u, a ). Thus, max(c(s(x X+H u ), a, S(Y )) + w(y ), c(s(x X+H u Y succ(x X+H u,a ) w(x ) P (X X+H u, a, Y ) ), a, S(succ(X X+H u, a ) b )) + w(x 1 )) We dstgush two cases. Frst, suppose h S(X),a (X ) s dfferet from h S(X),a (X X+H u ). Ths s oly possble f h S(X),a (X X+H u ) = u ad h S(X),a (X ) = b. The latter mples that there s oly oe outcome of succ(x, a ), amely, X 1. Hece, Y succ(x X+H u,a ) P (X X+H u, a, Y ) (c(s(x X+H u (c(s(x X+H u w(x ) ), a, S(succ(X X+H u, a ) b )) + w(x 1 )) = ), a, S(succ(X X+H u, a ) b )) + w(x 1 )) = c(s(x ), a, S(X 1 )) + w(x 1 )) = c(s(x ), besta(x ), S(X 1 )) + w(x 1 )) = E X succ(x,besta(x )) (c(s(x ), besta(x), S(X )) + w(x )) Now suppose h S(X),a (X ) = h S(X),a (X X+H u ). The the probablty dstrbuto s the same for succ(x X+H u, a ) ad succ(x, a ). Hece, 14

15 P (X X+H u Y succ(x X+H u Y succ(x X+H u, a, succ(x X+H u, a ) b ) (c(s(x X+H u,a )s.t.y succ(x X+H u,a ) b,a )s.t.y succ(x X+H u,a ) b Cosder ow Y succ(x X+H u P (X X+H u w(x ) ), a, S(succ(X X+H u, a ) b )) + w(x 1 ))+, a, Y ) (c(s(x X+H u ), a, S(Y )) + w(y )) = P (X, a, X 1 ) (c(s(x ), a, S(X 1 )) + w(x 1 ))+ (P (X X+H u, a, Y ) (c(s(x ), a, S(Y )) + w(y ))), a ) such that Y succ(x X+H u, a ) b. Cosder also Z succ(x, a ) such that h S(X),a (Y ) = h S(X),a (Z) (that s, Y ad Z are correspodg outcomes). The Y = [S(Z); H(Z); H u (Z)] = Z u. Cosequetly, w(y ) = w old (Y ) w old (Z) = w(z) accordg to the assumptos of the theorem. As a result, w(x ) (P (X, a, Y ) (c(s(x ), a, S(Y )) + w(y ))) = Y succ(x,a ) E X succ(x,besta(x )) (c(s(x ), besta(x), S(X )) + w(x )) Theorem 4 For each state X, t holds that w b (X) g (X). Proof: The case of g (X) = s trval. We therefore assume that g (X) s fte ad prove by ducto. Suppose there exst X such that w b (X) > g (X). It could ot have bee a state whose S(X) = S goal sce accordg to the deftos of w b (X) ad g (X), they are both equal to 0. We therefore assume that S(X) S goal. The w b (X) = m (c(s(x), a, succ(x, a) b ) + w b (succ(x, a) b )) a S(X) ad, g (X) = m Q f(y =succ(x u,a) b )=g (X u, a) (Y ),w a A(S(X)) = m a A(S(X)) P (X u, a, Y ) max(c(s(x u ), a, S(Y )) + w(y ), Y succ(x u,a) c(s(x u ), a, S(succ(X u, a) b )) + g (succ(x u, a) b )) m c(s(x u ), a, S(succ(X u, a) b )) + g (succ(x u, a) b ) a A(S(X)) m c(s(x), a, S(succ(X, a) b )) + g (succ(x, a) b ) a A(S(X)) 15

16 . The last le s due to the fact that S(X) = S(X u ), S(succ(X u, a) b ) = S(succ(X, a) b ) ad g (succ(x u, a) b ) = g (succ(x, a) b ) accordg to lemma 10. Let us cosder a path π = [{X, a, X 1 },..., {X 1, a 1, X 0 }], where X = X, S(X 0 ) = S goal ad for every tuple {X, a, X 1 }, X 1 = succ(x, a ) b ad a = m a A(S(X)) c(s(x ), a, succ(x, a) b ) + g (succ(x, a) b ). Sce g (X) s fte ad all costs are bouded from below by a postve costat, the path π s fte. Because w b (X ) > g (X ) ad w b (X 0 ) = g (X 0 ) = 0, there must be a tuple {X, a, X 1 } π such that w b (X ) > g (X ) whereas w b (X 1 ) g (X 1 ). But the we get the followg cotradcto g (X ) = c(s(x ), a, succ(x, a ) b ) + g (succ(x, a ) b ) c(s(x ), a, succ(x, a ) b ) + w b (succ(x, a ) b ) m c(s(x ), a, succ(x, a) b ) + w b (succ(x, a) b ) a A(S(X )) = w b (X ). Theorem 5 After each executo of the UpdateMDP fucto, for each state X t holds that w old (X) w(x) g old (X) g (X), where w old -values are w-values before the executo of UpdateMDP, g old (X) are g -values uder w old -values ad g (X) are g -values uder w-values. Proof: Frst, let us show that before the frst executo of UpdateMDP for every X t holds that w(x) g (X). It holds because accordg to the assumptos about state talzato before the ma fucto s executed, w(x) w b (X). O the other had, accordg to theorem 4, w b (X) g (X). We ow prove by ducto. Suppose w old (X) gold (X) before the call to UpdateMDP. We eed to show that after UpdateMDP fucto returs, for each state X we have w old (X) w(x) g (X). Let us frst prove that w old (X) w(x). We oly eed to cosder the states updated by UpdateMDP fucto sce w-values of all other states rema uchaged. We frst prove by ducto o the executo of le 4 that for each state X updated by UpdateMDP t holds that X u = X X+H u. Cosder the frst tme, le 4 s executed. The X = X pvot. Therefore, X u = [S(X); H(X); H u (X pvot )] = X X+H u. UpdateMDP also updates drectly [S(X); H(X); H u (X pvot )] = X X+H u. Now cosder the th executo of le 4, whereas o all prevous executos t held that X u = X X+H u. At th executo, state X s a state whch s equal to some succ(y, besta(y )) b, where Y s a state that was updated durg (-1)th executo of le 4. Thus, H u (X) = H u (Y ) ad therefore X u = [S(X); H(X); H u (X pvot )] = X X+H u. Oce aga, UpdateMDP also updates [S(X); H(X); H u (X pvot )] = X X+H u. Thus, for each state updated by UpdateMDP, t holds that X u = [S(X); H(X); H u (X pvot )]. As a result, f S(X) = S goal, the accordg to the defto of g-values, gold ( X) = gold (X) = 0. O the other had, f S(X) S goal, the 16

17 gold( X) = m Q a A(S( X)) f(y =succ(x X+H u,a)b )=g (Ỹ ),w (X X+H u, a) old = m a A(S(X)) Q f(y =succ(x u,a) b )=g (Ỹ ),w old (Xu, a) Because g (Ỹ ) = g (Y ) = 0 f S(Y ) = s goal, t the holds that g old( X) = m a A(S(X)) Q f(y =succ(x u,a) b )=g (Y ),w old (X u, a) = g old(x) Thus, for each state X updated by UpdateMDP, t holds that g old ( X) = g old (X). Also, accordg to corollary 3, g( X) = g old ( X) ad from ducto assumpto w old (X) g old (X). Thus whe UpdateMDP executes w(x) = g( X) o le 4, the w(x) = g( X) = gold( X) = gold(x) w old (X) Suppose ow UpdateMDP executes w([s(x); H(X); H u (X pvot )]) = g( X) o le 4. Accordg to lemma 9 gold (X) = g old (Xu ) = gold (X X+H u ) ad sce w old (X X+H u ) gold (X X+H u ), t follows that: w(x X+H u ) = g( X) = g old( X) = g old(x) = g old(x X+H u ) w old (X X+H u ) We ow prove that for every state X w(x) g old (X) g (X). We frst ote that sce as we have just proved oe of w-values decreased, t holds that for every state X g (X) g old (X). Suppose X was ot updated by UpdateMDP, that s, w(x) = w old (X). The w(x) = w old (X) g old (X) g (X). Now suppose w(x) was updated by UpdateMDP. Oce aga suppose the update s w(x) = g( X) o le 4. The w(x) = g( X) = g old( X) = g old(x) g (X) Now suppose the update s w([s(x); H(X); H u (X pvot )]) = g( X) o le 4. The w(x X+H u ) = g( X) = g old( X) = g old(x) = g old(x X+H u ) g (X X+H u ) 17

18 Theorem 6 For a o-egatve fucto w w u, the followg holds: for each state X, g (X) s bouded from above by w u (X). Proof: Ths certaly holds for X whose S(X) = S goal. Now suppose S(X) S goal. We prove by cotradcto ad assume that there exsts oe or more states X whose g (X) > w u (X), whch mples that w u (X) s fte. Let us cosder a path π greedy,w u,w u(x, X 0 ) where X = X ad S(X 0 ) = s goal. Accordg to ts defto, for every par of states X, X 1 o ths path t holds that X 1 = succ(x, a ) b, where a = arg m a A(S(X )) Q wu,w u(xu, a ). Sce all costs are postve ad w u (X) s fte, w u (X ) > w u (X 1 ). Also, sce the costs are bouded from below by a postve costat, w u (X 0 ) = 0 ad w u (X) s fte, t holds that the path π greedy,w u,w u(x, X 0 ) s fte. Ths meas that there must exst such X o the path that g (X ) > w u (X ), whle g (X 1 ) w u (X 1 ) where X 1 = succ(x, a ) b The we arrve at the followg cotradcto g (X ) Q f(x 1 =succ(x u,a )b )=g (X 1 ),w (Xu, a ) Q f(x 1 =succ(x u,a )b )=g (X 1 ),w u (Xu, a ) = P (X u, a, Y ) max(c(s(x ), a, S(Y )) + wu (Y ), c(s(x ), a, S(X 1 )) + g (X 1 )) Y succ(x u,a ) Y succ(x u,a ) P (X u, a, Y ) max(c(s(x ), a, S(Y )) + wu (Y ), c(s(x ), a, S(X 1 )) + w u (X 1 )) = Q w u,w u (X u, a ) = m Q w u,w u (X u, a ) = a A(S(X )) w u (X) Theorem 7 For each state X, w(x) s bouded from above by w u (X). Proof: Before the frst executo of UpdateMDP, accordg to the talzato assumptos, for every state X, 0 w(x) w b (X). Also, accordg to theorem 4, for each state X, w b (X) g (X) ad accordg to theorem 6, g (X) w u (X). Thus, 0 w(x) w u (X). We ow prove the theorem by ducto. Suppose before the th executo of UpdateMDP, t holds that 0 w old (X) w u (X), where w old -values are w-values rght before the th executo of UpdateMDP fucto. We eed to show that after the th executo of UpdateMDP fucto, the equalty 0 w(x) w u (X) holds. Accordg to theorem 6, for every state X, g old (X) wu (X). At the same tme, accordg to theorem 6, w(x) g old (X). Thus, w(x) wu (X). 18

19 The equalty 0 w(x) follows from the fact that 0 w old (X) ad theorem 6, accordg to whch, w old (X) w(x). Theorem 8 PPCP termates, ad at that tme, w(x start ) w u (X start ) ad the expected cost of the polcy of always takg acto besta(x) at ay state X startg at X goal utl state X 0 whose S(X 0 ) = S goal s reached s o more tha w(x start ). Proof: We wll frst show that the algorthm termates. For ths let us frst show that the set of all possble polces π(x start ) ever cosdered by PPCP s guarateed to be fte. To prove ths we eed to show that ay polcy cosdered by PPCP s acyclc. The, the fact that the set of all such polces s fte wll be due to the belef state-space tself beg fte. Ay polcy PPCP has at ay pot of tme s acyclc because after each stochastc acto a at ay state X, the correspodg h S(X),a s set to a value ot equal to u the outcome states ad remas such all of ther descedats, whereas all the acestors of X ad X tself had h S(X),a = u. The determstc paths betwee ay two stochastc actos o the polcy or betwee X start ad the frst stochastc acto o the polcy, o the other had, are all segmets of the paths retured by ComputePath fucto ad these paths are fte accordg to corollary 2. Thus, the set of all possble polces cosdered by PPCP s fte. The termato crtero for the algorthm s that all states o ts curret polcy are have w-values at least as large as the expectato over the acto cost plus w-values of the successors of ther acto defed by besta poter, except for the goal states, whose w-values are 0 because they are bouded by w u -values accordg to theorem 7 ad these are zeroes for goal states. I other words, for every X o the curret polcy s.t. S(X) S goal t holds that w(x) E X succ(x,besta(x)) (c(s(x), besta(x), S(X )) + w(x )) (7) At each terato, PPCP fxes at least oe state X o the polcy to satsfy ths equato. Whle fxg the equato, PPCP may chage besta acto for state X ad/or chage w-value of X. There s a fte umber of possble subtrees below X that PPCP ca cosder sce the set of all possble polces cosdered by PPCP s fte. The chage the w-value of X may potetally affect other states, but sce the polcy s acyclc t ca ot affect the states that are descedats of X. The umber of acestors of X, o the other had, s fte sce the polcy s acyclc ad the belef state-space s fte. Therefore, PPCP s boud to arrve a fte umber of teratos at a polcy for whch all of the states that belog to t satsfy the equato 7. Each terato s also guarateed to be fte for the followg reasos. Frst, the ComputePath fucto s guarateed to retur because each state s expaded o more tha oce per search accordg to theorem 2. Secod, the UpdateMDP fucto s guarateed to retur, because the path t processes s guarateed to be of fte legth accordg to corollary 2. We ow show that after PPCP termates, w(x start ) w u (X start ) ad the expected cost of the polcy of always takg acto besta(x) at ay state X startg at 19

20 X goal utl state Y whose S(Y ) = S goal s reached s o more tha w(x start ). The frst part comes drectly from theorem 7. The secod part ca be proved as follows. Cosder the followg potetal fucto that we mata whle executg the polcy defed by besta actos startg wth X start : F (t) = costsofar(t) + w(x t ), where t s the curret tme-step. So, tally F (t = 0) = 0 + w(x start ) = w(x start ). We execute the polcy utl we reach a state Y such that S(Y ) = S goal. Suppose t happes at tmestep t = k. That s, Y = X t. The F (t = k) = costsofar(k) + 0 = costsofar(k). We eed to show that the expected value of F (t = k) s bouded above by w(x start ). Itally, E{F (t = 0)} = w(x 0 ), where X 0 = X start. Now cosder the expectato at the th step: E{F (t = )} = E{costsofar() + w(x )} = E{costsofar( 1) + cost() + w(x )} = E{costsofar( 1)} + E{cost() + w(x )} Sce all states o the polcy (except for the goal states) satsfy the equato 7, we have w(x 1 ) E{cost()+w(X )}. After takg a addtoal expectato we have E{w(X 1 ) cost()} E{w(X )}. Hece, E{F (t = )} = E{costsofar( 1)} + E{cost() + w(x )} E{costsofar( 1)} + E{cost() + (w(x 1 ) cost())} = E{costsofar( 1) + w(x 1 )} = E{F (t = 1)} By ducto the E{F (t = k)} E{F (t = 0)} = w(x start ). Theorem 9 Suppose there exsts a mmum expected cost polcy ρ that satsfes the followg codto: for every par of states X 1 ρ ad X 2 ρ such that X 2 ca be reached wth a o-zero probablty from X 1 whe followg polcy ρ t holds that ether h S(X1),ρ (X 1) h S(X2),ρ (X 2) or h S(X1),ρ (X 1) = h S(X2),ρ (X 2) = ull. The the polcy defed by besta poters at the tme PPCP termates s also a mmum expected cost polcy. Proof: Let us assume that there exsts a mmum expected cost polcy ρ that satsfes the codtos of the theorem. That s, for every par of states X 1 ρ ad X 2 ρ such that X 2 ca be reached wth a o-zero probablty from X 1 whe followg polcy ρ t holds that ether h S(X1),ρ (X 1) h S(X2),ρ (X 2) or h S(X1),ρ (X 1) = h S(X2),ρ (X 2) = ull. Sce ρ s a optmal polcy, ts expected cost s w (X start ). We wll show that w u (X start ) w (X start ). Ths wll prove the theorem sce the expected cost of the polcy retured by PPCP s bouded from above by w u (X start ). The expected cost the polcy wll the be exactly equal to w (X start ) sce ρ s already a optmal polcy. 20

21 Let us prove by cotradcto ad assume that w u (X start ) > w (X start ). Ths also meas that w (X start ) s fte ad therefore all braches o the polcy ρ ed up at states X whose S(X) = S goal sce a optmal polcy whe sesg s perfect s acyclc. Let us ow pck a state X ρ such that w u (X) > w (X), but all the successor states Y of acto ρ (X) executed at state X have w u (Y ) w (Y ). Such state X must exst because at least for X = X start t holds that w u (X) > w (X) ad all braches of the polcy ed up at states Y whose S(Y ) = S goal ad for these states w u (Y ) = w (Y ) = 0 accordg to the defto of w u ad w values. By defto, w u (X) = m a A(S(X)) Q w u,w u(xu, a) = Q wu,w u(xu, ρ (X)) P (X u, a, Z) max(c(s(x), a, S(Z)) + w u (Z), Z succ(x u,ρ (X)) c(s(x), a, S(Z)) + w u (succ(x u, ρ (X)) b )) Let us ow cosder h S(X),ρ (X) (X). It must be the case that ether h S(X),ρ (X) (X) = u or h S(X),ρ (X) (X) = ull sce, accordg to the assumptos of the theorem, o acto whose outcome depeds o h S(X),ρ (X) could have bee executed before. Thus, h S(X),ρ (X) (X u ) = h S(X),ρ (X) (X). Ths property has a mportat mplcato that we wll use. For ay par of states Y succ(x, ρ (X)) ad Z succ(x u, ρ (X)) such that h S(X),ρ (X) (Z u ) = h S(X),ρ (X) (Y u ) ( other words, Y ad Z are correspodg outcomes of acto ρ (X) executed at X ad X u respectvely), t holds that P (X, ρ (X), Y ) = P (X u, ρ (X), Z) ad Y u = Z u. Usg ths fact ad lemma 8 we ca derve the followg. w u (X) = = = P (X u, a, Z) max(c(s(x), a, S(Z)) + w u (Z), Z succ(x u,ρ (X)) c(s(x), a, S(Z)) + w u (succ(x u, ρ (X)) b )) P (X u, a, Z) max(c(s(x), a, S(Z)) + w u (Z u ), Z succ(x u,ρ (X)) c(s(x), a, S(Z)) + w u ((succ(x u, ρ (X)) b ) u )) P (X, a, Y ) max(c(s(x), a, S(Y )) + w u (Y u ), Y succ(x,ρ (X)) c(s(x), a, S(Y )) + w u ((succ(x, ρ (X)) b ) u )) P (X, a, Y ) max(c(s(x), a, S(Y )) + w u (Y ), Y succ(x,ρ (X)) 21

22 c(s(x), a, S(Y )) + w u (succ(x, ρ (X)) b )) Accordg to the way we pcked X, w u (Y ) w (Y ) for every Y succ(x, ρ (X)). Moreover, from the defto of clear prefereces t follows that c(s(x), a, S(Y )) + w u (succ(x, ρ (X)) b ) c(s(x), a, S(Y )) + w u (Y ) for all Y succ(x, ρ (X)). Hece, we obta the followg cotradcto w u (X) P (X, a, Y ) max(c(s(x), a, S(Y )) + w (Y ), Y succ(x,ρ (X)) c(s(x), a, S(Y )) + w (succ(x, ρ (X)) b )) = P (X, a, Y ) (c(s(x), a, S(Y )) + w (Y )) Y succ(x,ρ (X)) = w (X) 22

The Mathematical Appendix

The Mathematical Appendix The Mathematcal Appedx Defto A: If ( Λ, Ω, where ( λ λ λ whch the probablty dstrbutos,,..., Defto A. uppose that ( Λ,,..., s a expermet type, the σ-algebra o λ λ λ are defed s deoted by ( (,,...,, σ Ω.