Point-Based POMDP Algorithms: Improved Analysis and Implementation

Size: px

Start display at page:

Download "Point-Based POMDP Algorithms: Improved Analysis and Implementation"

Roger Julius Peters
6 years ago
Views:

1 Point-Bsed POMDP Algorithms: Improved Anlysis nd Implementtion Trey Smith nd Reid Simmons Rootics Institute, Crnegie Mellon University Pittsurgh, PA Astrct Existing complexity ounds for point-sed POMDP vlue itertion lgorithms focus either on the curse of dimensionlity or the curse of history. We derive new ound tht relies on oth nd uses the concept of discounted rechility; our conclusions my help guide future lgorithm design. We lso discuss recent improvements to our (point-sed) heuristic serch vlue itertion lgorithm. Our new implementtion clcultes tighter initil ounds, voids solving liner progrms, nd mkes more effective use of sprsity. Empiricl results show speedups of more thn two orders of mgnitude. 1 INTRODUCTION Prtilly oservle Mrkov decision processes (POMDPs) constitute powerful proilistic model for plnning prolems tht include hidden stte nd uncertinty in ction effects. Recently, severl POMDP solution lgorithms hve een developed tht use pproximte vlue itertion with point-sed updtes. These lgorithms hve proven to scle very effectively, relying on the fct tht performing mny fst pproximte updtes often results in more useful vlue function thn performing few exct updtes. Point-sed updtes re pplied over set B of eliefs drwn from the rechle prt of the elief simplex. One cn derive ound on the pproximtion error tht is proportionl to the smple spcing of B [Pineu et l., 2003]. The numer of points required is driven y the curse of dimensionlity: chieving desired smple spcing requires numer of smples exponentil in the dimensionlity of the elief simplex. However, in discounted prolems, one cn tolerte more pproximtion error t points tht re only rechle fter mny time steps. This ide, which is not used in the smple spcing rgument, is the sis of second type of convergence result [Zhng nd Zhng, 2001, Smith nd Simmons, 2004], in which the error ound is derived from the fct tht B smples enough of the serch tree to some depth. The numer of points required is driven y the curse of history: fully expnding the serch tree to depth t requires numer of points exponentil in t. This pper presents new convergence rgument tht drws on oth pproches. Our nlysis pplies to the cse when the smple spcing vries ccording to wht we cll discounted rechility, which more ccurtely reflects the ehvior of current lgorithms (Fig. 1). () uniform density rechle eliefs 0 () non-uniform reflecting discounted rechility Figure 1: Smpling strtegies for B. The reminder of the pper discusses recent improvements in our heuristic serch vlue itertion lgorithm (HSVI). HSVI is point-sed lgorithm tht mintins oth upper nd lower ounds on the optiml vlue function, llowing it to use effective heuristics for ction nd oservtion selection, nd to provide provly smll regret from the policy it genertes [Smith nd Simmons, 2004]. The new implementtion of HSVI clcultes tighter initil ounds, voids solving liner progrms, nd mkes more effective use of sprsity. Empiricl results show speedups of more thn two orders of mgnitude. 2 POMDP INTRODUCTION A POMDP models plnning prolem tht includes hidden stte nd uncertinty in ction effects; the gent is ssumed to know the trnsition model. Formlly, POMDP is descried y finite set of sttes S = {s 1,..., s S },

2 finite set of ctions A = { 1,..., A }, finite set of oservtions Z = {z 1,..., z Z }, trnsition proilities T,z (s i, s j ) = Pr(s j s i,, z), rel-vlued rewrd function R(s, ), discount fctor γ < 1, nd n initil elief 0. The Mrkov property of the model ensures tht the gent cn use proility distriution over current sttes s sufficient sttistic for the history of ctions nd oservtions. Geometriclly, the spce of eliefs is simplex, denoted. At ech stge of forwrd simultion the current elief cn e updted sed on the ltest ction nd oservtion z using the formul τ(,, z), defined so tht (s ) = s T,z (s, s )(s) (1) Only suset of is rechle from 0 through repeted pplictions of τ; this suset is denoted. In generl the oject of the plnning prolem is to generte policy π tht mximizes expected long-term rewrd: [ ] J π () = E γ t R(s t, t ), π (2) t=0 A glolly optiml policy π is known to exist when γ < 1 [Howrd, 1960]. We re prticulrly interested in the focused pproximtion setting, in which one ttempts to generte policy ˆπ tht minimizes regret J π ( 0 ) J ˆπ ( 0 ) when executed strting from 0. A POMDP is often solved y pproximting its optiml vlue function V = J π. Any vlue function V induces policy π V in which ctions re selected vi one-step lookhed. The regret of the policy induced y n pproximte vlue function ˆV cn e forced ritrrily smll y reducing the mx-norm error V ˆV. Vlue itertion strts with n initil guess V 0 nd pproximtes V through repeted ppliction of the Bellmn updte V t HV t 1, where H is defined s [ ] HV () = mx R(, ) + Pr(, )V ( ) V stisfies Bellmn s eqution V = HV. When γ < 1, H is contrction nd V is the unique solution. During vlue itertion, ech V t is piecewise liner nd convex [Sondik, 1971], so it cn e represented s set of vectors Γ t = {α 1,..., α Γ }, such tht V t () = mx i (α i ). There re numer of vlue itertion lgorithms tht clculte H exctly y projecting α vectors from Γ t 1 to Γ t [Sondik, 1971, Cssndr et l., 1997]. Unfortuntely, in the worst cse the size of the representtion grows s Γ t = A Γ t 1 O, which rpidly ecomes intrctle even for modest prolem sizes. Despite clever strtegies for (3) Algorithm 1. β = ckup(γ, ). 1. β,z rgmx α Γ (α τ(,, z)) 2. β (s) R(s, ) + γ s,z β,z(s )T,z (s, s ) 3. β rgmx β (β ) pruning dominted α vectors, these lgorithms hve een unle to scle to lrger prolems. The intrctility of exct vlue itertion hs led to the development of wide vriety of pproximtion techniques, too mny to mention here [Aerdeen, 2002]. 3 POINT-BASED ALGORITHMS Point-sed vlue itertion lgorithms rely on the fct tht performing mny fst pproximte updtes often results in more useful vlue function thn performing few exct updtes. Their fundmentl opertion is the point-sed updte ckup(γ, ), which genertes single α vector from HV tht is gurnteed to e mximl t (Alg. 1). Our nlysis focuses on simple conceptul version of point-sed vlue itertion. We ssume there is fixed finite set of eliefs B. At ech step the lgorithm genertes n α vector for every point in B, nd the set of vectors Γ defines vlue function through mx-projection s descried erlier. Denote the vlue function fter t updtes s Vt B. The vlue function is initilized with V0 B R min, nd the updte rule is Vt B H B Vt 1, B where the updte opertor H B pplies the point-sed updte t every point of B: H B Γ = {ckup(γ, ) B} (4) In this cse, the pproximtion error reltive to exct vlue itertion fter t updtes, V t Vt B, is known to e ounded proportionlly with the smple spcing δ(b), which is defined to e the mximum 1-norm distnce from ny point in to B [Pineu et l., 2003]. B thus needs to contin only enough points to cover with uniform smple spcing. However, current point-sed lgorithms do not smple uniformly (lthough the PBVI lgorithm mkes some ttempt to do so). Insted, they collect points for B y forwrd simultion process tht ises B to contin eliefs tht re only few simultion steps wy from 0. However, this is rguly helps rther thn hurts. It underlies second type of convergence rgument sed on the depth of the serch tree. If B contins ll the eliefs tht result from expnding the serch tree to depth t nd t updtes over B re performed, then the pproximtion error t 0 is ounded proportionlly with γ t.

3 3.1 NEW THEORETICAL RESULTS This section presents new convergence rgument tht drws on the two erlier pproches. Its use of weighted mx-norm mchinery in vlue itertion is closely relted to [Munos, 2004]. Our rgument reflects current point-sed lgorithms in tht it llows B to e non-uniform smpling of whose spcing vries ccording to discounted rechility. The discounted rechility ρ : R is defined to e ρ() = γ L, where L is the length of the shortest sequence of elief-stte trnsitions from 0 to. ρ stisfies the property tht ρ( ) γρ() whenever there is single-step trnsition from to. Bsed on ρ, we define generlized smple spcing mesure δ p (with 0 p < 1): 1 δ p (B) = mx min B ρ p () In order to chieve smll δ p vlue, B must hve smll 1- norm distnce from ll points in, ut its distnce from cn e proportionlly lrger if ρ p () is smll. When smple spcing is ounded in terms of δ p, H B does not hve the error properties we wnt under the usul mxnorm. We must define new norm to reflect the fct tht H B induces lrger errors where ρ is smll. A weighted mx-norm is function ξ such tht V V ξ = mx (5) V () V (), (6) ξ() where ξ > 0. Not surprisingly, ρ p is the norm we need. Note tht when p = 0, δ p reduces to the uniform spcing mesure δ nd ρ p reduces to the mx-norm. We egin y generlizing some well-known results out stndrd vlue itertion to the ρ p -norm with 0 p < 1. Theorem 1. The exct Bellmn updte H is contrction under the ρ p -norm with contrction fctor γ 1 p. Proof. Define Q V () = R(, ) + γ Pr(, )V ( ) (7) so tht HV = mx Q V. For ny, the mpping V Q V hs contrction fctor γ 1 p : Q V Q V ρ p = mx Q V () Q V [ρ()] p (8) = mx γ Pr(, ) V ( ) V ( ) [ρ()] p (9) mx γ Pr(, ) V ( ) V ( ) [γρ( )] p (10) mx γ 1 p Pr(, ) V V ρ p (11) = γ 1 p V V ρ p (12) Now choose n ritrry. Assume without loss of generlity tht HV () H V (). Choose to mximize Q V (), nd ā to mximize Q Vā (). It follows tht Q V () Q Vā () Q V (), nd HV () H V () = Q V () Q Vā () (13) Q V () Q V () (14) mx Q V () Q V () (15) Dividing through y ρ p () nd mximizing over yields HV H V ρ p mx Q V () Q V () ρ p (16) γ 1 p V V ρ p Theorem 2. Let ˆπ e the one-step lookhed policy induced y n pproximte vlue function ˆV. The regret from executing ˆπ rther thn π, strting from 0, is t most 2γ 1 p 1 γ 1 p V ˆV ρ p (17) Proof. Choose n ritrry. It is esy to check tht for ny policy π, J π () = Q J π π() (). Also, ecuse ˆπ is the one-step lookhed policy induced y ˆV, Q ˆVˆπ() () = H ˆV (). The Bellmn eqution sttes tht V = HV. Then: J π () J ˆπ () = V () Q J ˆπ ˆπ() () (18) = V () Q ˆVˆπ() () + Q ˆVˆπ() () Q J ˆπ ˆπ() () (19) V () Q ˆVˆπ() () + Q ˆVˆπ() () Q J ˆπ ˆπ() () (20) HV () H ˆV () + γ Pr(, ˆπ()) ˆV ( ) J ˆπ ( ) (21) HV () H ˆV () + γ Pr(, ˆπ()) γ p ρ p () ˆV J ˆπ ρ p (22) HV () H ˆV () + γ 1 p ρ p () ˆV J ˆπ ρ p (23) Dividing through y ρ p () nd mximizing over gives J π J ˆπ ρ p (24) HV H ˆV ρ p + γ 1 p ˆV J ˆπ ρ p (25) γ 1 p( V ˆV ρ p + ˆV J ˆπ ) ρ p (26) γ 1 p( V ˆV ρ p + (27) ˆV V ρ p + V J ˆπ ) ρ p (28) γ 1 p( 2 V ˆV ρ p + V J ˆπ ) ρ p (29) = γ 1 p( 2 V ˆV ρ p + J π J ˆπ ) ρ p (30) = 2γ 1 p V ˆV ρ p + γ 1 p J π J ˆπ ρ p (31)

4 Solving the recursion, J π J ˆπ ρ p 2γ1 p 1 γ 1 p V ˆV ρ p (32) And since ρ( 0 ) = 1, we hve the desired regret ound: J π ( 0 ) J ˆπ ( 0 ) 2γ1 p 1 γ 1 p V ˆV ρ p It is worth noting (lthough we lck spce to prove it here) tht tighter ound pplies when ˆV is uniformly improvle [Zhng nd Zhng, 2001]. A smll modifiction to H B would mke Vt B uniformly improvle t the cost of incresing Γ. In tht cse the regret would e t most γ 1 p V ˆV ρ p. Hving discussed the ρ p -norm ehvior of H, now we move on to the ρ p -norm ehvior of H B with non-uniform smple spcing δ p. Lemm 1. At ny updte step t, the error HV t B H B Vt B ρ p introduced y single ppliction of H B rther thn H is t most (R mx R min )δ p (B) 1 γ 1 p (33) Proof. The rgument is nlogous to Lemm 1 of [Pineu et l., 2003]. Necessry chnges: (1) restrict to e drwn from, (2) divide throughout y ρ p ( ), nd (3) sustitute γ 1 p for γ in the denomintor to reflect the chnged contrction properties of H under the new norm. Theorem 3. At ny updte step t, the ccumulted error V t V B t ρ p is t most (R mx R min )δ p (B) (1 γ 1 p ) 2 (34) Proof. The rgument is nlogous to Theorem 1 of [Pineu et l., 2003]. Necessry chnges: (1) replce the mx-norm with the ρ p -norm, nd (2) replce γ with γ 1 p. Tken together, these results show tht the conceptul lgorithm cn e used to generte policy with ritrrily smll regret relted to δ p (B), nd they provide finite ound on the numer of updtes required to chieve given regret. 3.2 IMPLICATIONS FOR ALGORITHM DESIGN The is of our model towrd eliefs with high discounted rechility descries current lgorithms more ccurtely thn uniform smpling, t lest to the extent tht the lgorithms perform (typiclly shllow) forwrd explortion from the initil elief to generte B. The prmeter p rose nturlly during our nlysis. p = 0 corresponds to uniform smpling nd the usul mx-norm. As p increses, smples grow less dense in res with low rechility nd the norm ecomes correspondingly more tolernt. But the results show tht there s no free lunch: the higher effective discount fctor γ 1 p under the new norm mens tht more updtes re required nd the finl error ounds re looser. The new theoreticl frmework provides wy to nlyze this trde-off. We initilly found the concept of discounted rechility surprising. The intuition is tht (1) eliefs tht re deeper in the serch tree re less relevnt, nd (2) eliefs tht cn only e reched y low-proility elief trnsitions re less relevnt. But discounted rechility ignores (2) entirely, in tht ll trnsitions with non-zero proility re treted eqully. Actully, we strted with different concept of discounted occupncy, in which eliefs re tgged s proportionlly less relevnt if they cn only e reched y low-proility elief trnsitions. The is of current lgorithms seems to e etter descried y discounted occupncy, nd empiriclly, treting ll trnsitions with non-zero proility eqully hurts performnce. But the convergence results we found do not go through when discounted occupncy is used insted of discounted rechility. We hope tht more sophisticted future nlysis will shed light on this issue. In summry, these new results tke us closer to understnding point-sed lgorithms. The nlysis helps explin importnt trde-offs in lgorithm design, lthough we hve not yet hd time to pply it to working lgorithm. The next section chnges the topic to recent improvements in our (point-sed) HSVI lgorithm. Note tht those improvements re not sed on the theoreticl results just presented. 4 IMPROVEMENTS IN HEURISTIC SEARCH VALUE ITERATION This section discusses recent improvements in our heuristic serch vlue itertion lgorithm (HSVI). Reltive to our originl presenttion of HSVI, the new implementtion clcultes tighter initil ounds, voids solving liner progrms, nd mkes more effective use of sprsity. Empiricl results show speedups of up to three orders of mgnitude. 4.1 HSVI OVERVIEW HSVI is point-sed lgorithm tht mintins oth upper nd lower ounds on the optiml vlue function, llowing it to use effective heuristics for ction nd oservtion selection, nd to provide provly smll regret from the policy it genertes. We provide rief overview here;

5 Algorithm 2. π = HSVI(ɛ). HSVI(ɛ) returns policy π whose regret reltive to π, strting from 0, is t most ɛ. 1. Initilize the ounds ˆV. 2. While width( ˆV ( 0 )) > ɛ, repetedly invoke explore( 0, ɛ, 0). 3. Hving chieved the desired precision, return the direct-control policy π corresponding to the lower ound. Algorithm 3. explore(, ɛ, t). explore recursively follows single pth down the serch tree until stisfying termintion condition sed on the width of the ounds intervl. It then performs series of updtes on its wy ck up to the initil elief. 1. If width( ˆV ()) ɛγ t, return. 2. Select n ction nd oservtion z ccording to the forwrd explortion heuristics. 3. Cll explore(τ(,, z ), ɛ, t + 1). 4. Perform point-sed updte of ˆV t elief. HSVI1 initilizes the lower ound using conservtive estimte of the vlues of lind policies of the form lwys execute ction. The smllest possile rewrd from executing ction is min s R(s, ), so ound on the longterm rewrd for tht policy cn e found y evluting the relevnt summtion. HSVI1 then mximizes over : R mx t=0 γ t min s R(s, ) = mx min s R(s, ) 1 γ (35) The vector set for the initil lower ound V 0 contins single vector α such tht every α(s) = R. HSVI1 initilizes the upper ound y ssuming full oservility nd solving the MDP version of the prolem. This provides upper ound vlues t the corners of the elief simplex, which form the initil point set. V V * V updte Figure 2: Locl updte t Locl Updtes for more detil refer to [Smith nd Simmons, 2004]. We refer to the originl version nd our current version s HSVI1 nd HSVI2, respectively. The differences re covered comprehensively in 4.2. HSVI is outlined in Algs. 2 nd 3. We denote the lower nd upper ound functions s V nd V, respectively. The intervl function ˆV refers to them collectively, such tht ˆV () = [V (), V ()] nd width( ˆV ()) = V () V () Vlue Function Representtion HSVI uses the usul Γ vector set representtion for its lower ound (see 2). Unfortuntely, if the upper ound is represented with vector set, updting y dding vector does not hve the desired effect of improving the ound in the neighorhood of the locl updte. To ccommodte the need for updtes, HSVI uses point set representtion for the upper ound. The vlue t point is the projection of onto the convex hull formed y finite set Υ of elief/vlue points ( i, v i ). Updtes re performed y dding new point to the set. In HSVI1, the projection onto the convex hull is clculted y solving liner progrm using the commercil CPLEX softwre pckge Initiliztion HSVI performs locl updte L of the lower ound y dding the result of point-sed updte t to the vector set: L Γ = Γ {ckup(γ, )} (36) It performs locl updte U of the upper ound y dding the result of Bellmn updte t to the point set: U Υ = Υ {(, H V ())} (37) Fig. 2 represents the structure of the ounds representtions nd the process of loclly updting t. In the left side of the figure, the points nd dotted lines represent V (upper ound points nd convex hull). Severl solid lines represent the vectors of Γ. In the right side of the figure, we see the result of updting oth ounds t, which involves dding new point to Υ nd new vector to Γ, ringing oth ounds closer to V. HSVI periodiclly prunes dominted elements in oth the lower ound vector set nd the upper ound point set; we do not discuss the pruning here ecuse it is unffected y our recent chnges Forwrd Explortion Heuristics This section discusses the heuristics tht re used to decide which child of the current node to visit s HSVI works its wy forwrd from the initil elief. Strting from prent node, HSVI must choose n ction nd n oservtion z : the child node to visit is τ(,, z ).

6 HSVI selects ctions greedily sed on the upper ound (the IE-MAX heuristic). At elief, for every ction, it cn compute n upper ound on the long-term rewrd from tking tht ction. It chooses the ction with the highest upper ound: = rgmx Q V () (38) Becuse the ounds t prent node re lwys wider thn the ounds t the child with the highest upper ound vlue, choosing ccording to IE-MAX is good wy to ensure convergence. In the simpler context where updtes do not ffect neighoring nodes, it is provly optiml [Kelling, 1993]. HSVI uses the weighted excess uncertinty heuristic for oservtion selection. Excess uncertinty t elief with depth t in the serch tree is defined to e excess(, t) = width( ˆV ()) ɛγ t (39) Excess uncertinty hs the property tht if ll the children of node hve negtive excess uncertinty, then fter n updte will lso hve negtive excess uncertinty. Negtive excess uncertinty t the root implies the desired convergence to within ɛ. The weighted excess uncertinty heuristic is designed to focus ttention on the child node with the gretest contriution to the excess uncertinty t the prent: z = rgmx z [ Pr(z, )excess(τ(,, z), t + 1) ] (40) Both the ction nd oservtion selection heuristics re designed so tht pplying them systemticlly gurntees HSVI convergence in finite time [Smith nd Simmons, 2004]. 4.2 CHANGES BETWEEN HSVI1 AND HSVI2 We report series of chnges mde since our initil presenttion of HSVI1. The chnges re roughly ordered in terms of their impct on the overll performnce. The reltive speedup for individul chnges is prolem-dependent; the reported vlues were mesured informlly on the Tg prolem. HSVI2 performnce is presented in More Effective Use of Sprsity HSVI1 represents eliefs nd trnsition functions s vectors nd mtrices in BLAS compressed storge mode [Dongrr et l., 1988]. It uses n off-the-shelf sprse liner lger pckge to compute elief trnsitions nd tke dot products. Tht pckge turned out to e using inpproprite lgorithms, slowing down individul opertions y s much s 100x. We ddressed the prolem in HSVI2 y writing our own simple compressed storge opertions, which speed up lower ound updtes y out 50x. HSVI1 represents α vectors in dense storge mode ecuse they tend to hve lrge numer of non-zeros, even when eliefs re sprse. Typiclly, when α ckup(γ, ) is pplied, ll of the entries of α must e computed, even if is sprse nd most of the entries hve no effect on the vlue α. They re required ecuse HSVI my lter need to evlute α where hs different non-zeros. But if α is optimized for, why should we expect it to e relevnt to, which hs different non-zeros nd perhps no overlp with t ll? This leds to the ide of msked α vectors. In HSVI2, α ckup(γ, ) computes only the entries of α tht correspond to non-zeros of. A msk records which entries were computed. If HSVI2 lter evlutes mx i (α i ) nd hs non-zero in position tht ws not computed in α i, the dot product α i is rejected from considertion. This chnge cn e interpreted geometriclly. Sprse eliefs lie in hyperplnes on the oundry of the elief simplex. When msked α vector is computed using the new ckup(γ, ), it pplies only to the lowest-dimensionl oundry hyperplne contining. Empiriclly, msked α vectors speed up lower ound updtes y out 5x. Note tht lmost ny POMDP vlue itertion lgorithm could mke use of this concept Avoid Solving Liner Progrms HSVI1 evlutes V () y computing the exct projection of onto the convex hull of the points in Υ, which involves solving liner progrm with the commercil CPLEX softwre pckge. Ech upper ound updte requires severl such projections, nd the time spent solving liner progrms domintes the upper ound updte time. HSVI2 uses n pproximte projection onto the convex hull suggested y [Huskrecht, 2000]. Projection onto the convex hull of set of points is prticulrly simple when the set contins only the corners of the elief simplex nd one interior point: it cn e computed in O( S ) time. To pproximtely project onto the overll convex hull, HSVI2 runs this opertion for ech interior point of Υ nd tkes the minimum vlue, requiring O( Υ S ) time overll (or less with sprsity). This pproximte convex hull hs the key properties tht (1) it is everywhere greter thn the exct convex hull, nd (2) the pproximtion t is exct if there is n undominted pir (, v) Υ. Empiriclly, the pproximte projection speeds up upper ound updtes y out 100x Tighter Initil Bounds HSVI1 genertes n initil lower ound sed on conservtive estimte of the vlues of lind policies. HSVI2 uses etter lind policy vlue estimte suggested in [Huskrecht, 1997]. The vlue α of ech policy lwys

7 tke ction is updted in MDP fshion: α t+1(s) = R(s, ) + γ s Pr(s s, )α t (s ) (41) Ech updte of A vectors cn e evluted in O( S 2 A ) time. HSVI2 initilizes the vectors α 0 using the HSVI1 lower ound, which gurntees tht the ound is vlid even if the itertion is not run to completion. When the itertion is stopped, the α t vectors form the initil lower ound Γ. HSVI1 genertes n initil upper ound sed on the vlue function of the fully oservle MDP. HSVI2 uses the fst informed ound (FIB) pproximtion, which is gurnteed to give tighter upper ound thn the MDP pproximtion [Huskrecht, 2000]. FIB itertion keeps one vector α for ech ction nd uses the following updte rule: αt+1(s) = R(s, ) + γ mx Pr(s, z s, )α t (s ) z s Ech FIB updte cn e evluted in O( A 2 S 2 Z ) time. As with the lower ound, HSVI2 initilizes the upper ound vectors α 0 using the HSVI1 upper ound. When FIB itertion is stopped, ech corner point corresponding to stte s is initilized to mx α t (s). Empiriclly, HSVI2 cn run oth ound initiliztion routines to pproximte convergence (residul < 10 3 ) in t most few seconds for ll of the prolems in our enchmrk set. This results in etter performnce ner the eginning of HSVI2 execution, lthough lter in the run the effect is less significnt. The chnge in the lower ound initiliztion is the more importnt of the two; the MDP upper ound ws lredy firly good for most prolems. 4.3 HSVI2 PERFORMANCE Fig. 3 shows HSVI2 rewrd vs. time for four prolems from the sclle POMDP literture. The plotted rewrd is the verge received over 100 or more simultions. We lso plot HSVI2 s ounds V ( 0 ) nd V ( 0 ). HSVI2 ws run only once on ech prolem since it is not stochstic. The pltform used ws Pentium-4 running t 3.4 GHz, with 2 GB of RAM (HSVI2 used t most 250 MB of RAM). The plots show rnge of ehviors. RockSmple[4,4] is especilly esy; the HSVI2 ounds converge fter 13 seconds, showing tht the solution is optiml. Hllwy2 shows HSVI2 quickly rriving t n pprently ner-optiml solution, ut its ounds remin loose. Tg nd RockSmple[10,10] show typicl ehvior for lrge prolems: the upper ound decrese is slow nd stedy while the lower ound (nd the rewrd) improve in jumps, plteuing for long periods. RockSmple[10,10], with > 10 5 sttes, would e too lrge for most POMDP lgorithms to hndle; HSVI2 gins y use of sprsity. It would run out of memory with prolem 5-10 times lrger RockSmple[4,4] (257s 9 2o) Hllwy2 (93s 5 17o)) simultion 0.1 simultion ounds ounds Tg (870s 5 30o) RockSmple[10,10] (102,401s 19 2o) simultion 10 simultion ounds ounds Figure 3: HSVI2 rewrd vs. wllclock time. Fig. 4 shows running times nd solution qulity for HSVI nd severl other lgorithms. Note tht different lgorithms were run on different pltforms, so running times re only roughly comprle. The tle lso shows, for ech prolem, the 95% confidence intervl for rewrd mesurements ssuming the vrince of HSVI2 s est policy nd verging 100 rewrds. An lgorithm s rewrd is strred if it is within the confidence intervl reltive to the est reported vlue. HSVI2 is within mesurement error of the est reported rewrd for ll prolems, nd its running time is considerly shorter thn other lgorithms in most cses. The gretest speedup from HSVI1 to HSVI2 ws oserved on the Rock- Smple[7,8] prolem. HSVI2 tkes out 6 seconds to surpss the rewrd reched y HSVI1 fter > 10 4 seconds. After correcting for running on processor out 5x fster, this is > 300x speedup. Other stte-of-the-rt sclle POMDP lgorithms could not e compred to HSVI2 ecuse they were tested on different prolems. Among these, two techniques pper especilly promising. Exponentil-fmily PCA trnsforms the POMDP, compresses to low-dimensionl representtion in the trnsformed spce, then solves it with gridsed lgorithm. It hs demonstrted good results on lrgescle root nvigtion prolems [Roy nd Gordon, 2003]. Vlue-directed compression (VDC) is nother compression technique. It typiclly produces less compct representtion thn E-PCA, ut the compressed POMDP retins liner structure nd vlue function convexity, so tht it cn e solved using lmost ny POMDP lgorithm. The comintions VDC+BPI nd VDC+PBVI hve demonstrted sclility to huge prolem sizes, up to 33 million sttes [Pouprt nd Boutilier, 2004]. 1 VDC would likely oost 1 VDC+PBVI results courtesy of Pouprt, personl communi-

8 Prolem (sttes/ctions/oservtions) Rewrd Time (s) Γ Tiger-Grid (36s 5 17o) (±0.14) HSVI1 [Smith et l., 2004] 2.35* Perseus [Spn et l., 2004] 2.34* HSVI2 2.30* PBUA [Poon, 2001] 2.30* PBVI [Pineu et l., 2003] 2.25* BPI [Pouprt et l., 2003] 2.22* QMDP N/A Hllwy (61s 5 21o) (±0.038) PBVI [Pineu et l., 2003] 0.53* PBUA [Poon, 2001] 0.53* HSVI2 0.52* HSVI1 [Smith et l., 2004] 0.52* Perseus [Spn et l., 2004] 0.51* BPI [Pouprt et l., 2003] 0.51* QMDP N/A Hllwy2 (93s 5 17o) (±0.048) HSVI2 0.35* Perseus [Spn et l., 2004] 0.35* HSVI1 [Smith et l., 2004] 0.35* PBUA [Poon, 2001] 0.35* PBVI [Pineu et l., 2003] 0.34* BPI [Pouprt et l., 2004] 0.32* QMDP N/A Tg (870s 5 30o) (±1.2) Perseus [Spn et l., 2004] -6.17* HSVI2-6.36* HSVI1 [Smith et l., 2004] -6.37* BPI [Pouprt et l., 2004] -6.65* PBVI [Pineu et l., 2003] QMDP N/A RockSmple[4,4] (257s 9 2o) (±1.2) HSVI2 18.0* HSVI1 [Smith et l., 2004] 18.0* PBVI [Pineu, pers. communiction] 17.1* 2000? QMDP N/A RockSmple[7,8] (12,545s 13 2o) (±1.2) HSVI2 20.6* HSVI1 [Smith et l., 2004] QMDP N/A RockSmple[10,10] (102,401s 19 2o) (±1.3) HSVI2 20.4* QMDP 0 57 N/A Figure 4: Multi-lgorithm performnce comprison. HSVI sclility in similr wy. 5 CONCLUSION We presented new theoreticl results for point-sed lgorithms, which comine curse of dimensionlity nd curse of history rguments into n overll ound on the convergence of point-sed vlue itertion with non-uniform smple spcing. In the future we will pply these results to point-sed lgorithm design. We lso demonstrted improved performnce for our HSVI lgorithm, with speedups of more thn two orders of mgnitude nd successful scling to POMDP with > 10 5 sttes. In the future we would like to comine HSVI with compct representtion technique such s VDC to del with still lrger prolems. Acknowledgments Thnks to Geoff Gordon nd Pscl Pouprt for helpful discussions. This work ws funded in prt y NASA GSRP Fellowship with Ames Reserch Center. References [Aerdeen, 2002] Aerdeen, D. (2002). A survey of pproximte methods for solving prtilly oservle Mrkov decision processes. Technicl report, Reserch School of Informtion Science nd Engineering, Austrli Ntionl University. [Cssndr et l., 1997] Cssndr, A., Littmn, M., nd Zhng, N. (1997). Incrementl pruning: A simple, fst, exct method for prtilly oservle Mrkov decision processes. In Proc. of UAI. [Dongrr et l., 1988] Dongrr, J. J., Croz, J. D., Hmmrling, S., nd Hnson, R. J. (1988). An extended set of FORTRAN sic liner lger suprogrms. ACM Trns. Mth. Soft., 14:1 17. [Huskrecht, 1997] Huskrecht, M. (1997). Incrementl methods for computing ounds in prtilly oservle Mrkov decision processes. In Proc. of AAAI, pges , Providence, RI. [Huskrecht, 2000] Huskrecht, M. (2000). Vlue-function pproximtions for prtilly oservle Mrkov decision processes. Journl of Artificil Intelligence Reserch, 13: [Howrd, 1960] Howrd, R. A. (1960). Dynmic Progrmming nd Mrkov Processes. MIT. [Kelling, 1993] Kelling, L. P. (1993). Lerning in Emedded Systems. The MIT Press. [Munos, 2004] Munos, R. (2004). Error ounds for pproximte vlue itertion. Technicl Report CMAP 527, École Polytechnique. [Pineu et l., 2003] Pineu, J., Gordon, G., nd Thrun, S. (2003). Point-sed vlue itertion: An nytime lgorithm for POMDPs. In Proc. of IJCAI. [Poon, 2001] Poon, K.-M. (2001). A fst heuristic lgorithm for decision-theoretic plnning. Mster s thesis, The Hong Kong University of Science nd Technology. [Pouprt nd Boutilier, 2003] Pouprt, P. nd Boutilier, C. (2003). Bounded finite stte controllers. In Proc. of NIPS, Vncouver. [Pouprt nd Boutilier, 2004] Pouprt, P. nd Boutilier, C. (2004). VDCBPI: n pproximte sclle lgorithm for lrge scle POMDPs. In Proc. of NIPS, Vncouver. [Roy nd Gordon, 2003] Roy, N. nd Gordon, G. (2003). Exponentil fmily PCA for elief compression in POMDPs. In NIPS. [Smith nd Simmons, 2004] Smith, T. nd Simmons, R. (2004). Heuristic serch vlue itertion for POMDPs. In Proc. of UAI. [Sondik, 1971] Sondik, E. J. (1971). The optiml control of prtilly oservle Mrkov processes. PhD thesis, Stnford University. [Zhng nd Zhng, 2001] Zhng, N. L. nd Zhng, W. (2001). Speeding up the convergence of vlue itertion in prtilly oservle Mrkov decision processes. Journl of AI Reserch, 14: ction.

Reinforcement learning II

Reinforcement learning II CS 1675 Introduction to Mchine Lerning Lecture 26 Reinforcement lerning II Milos Huskrecht milos@cs.pitt.edu 5329 Sennott Squre Reinforcement lerning Bsics: Input x Lerner Output Reinforcement r Critic