Compact, Convex Upper Bound Iteration for Approximate POMDP Planning

Size: px

Start display at page:

Download "Compact, Convex Upper Bound Iteration for Approximate POMDP Planning"

Neil Lester
6 years ago
Views:

1 Compct, Convex Upper Bound Itertion for Approximte POMDP Plnning To Wng University of Alert Pscl Pouprt University of Wterloo Michel Bowling nd Dle Schuurmns University of Alert Astrct Prtilly oservle Mrkov decision processes (POMDPs) re n intuitive nd generl wy to model sequentil decision mking prolems under uncertinty. Unfortuntely, even pproximte plnning in POMDPs is known to e hrd, nd developing heuristic plnners tht cn deliver resonle results in prctice hs proved to e significnt chllenge. In this pper, we present new pproch to pproximte vlue-itertion for POMDP plnning tht is sed on qudrtic rther thn piecewise liner function pproximtors. Specificlly, we pproximte the optiml vlue function y convex upper ound composed of fixed numer of qudrtics, nd optimize it t ech stge y semidefinite progrmming. We demonstrte tht our pproch cn chieve competitive pproximtion qulity to current techniques while still mintining ounded size representtion of the function pproximtor. Moreover, n upper ound on the optiml vlue function cn e preserved if required. Overll, the technique requires computtion time nd spce tht is only liner in the numer of itertions (horizon time). Introduction Prtilly oservle Mrkov decision processes (POMDPs) re generl model of n gent cting in n environment, where the effects of the gent s ctions nd the oservtions it cn mke out the current stte of the environment re oth suject to uncertinty. The gent s gols re specified y rewrds it receives (s function of the sttes it visits nd ctions it executes), nd n optiml ehvior strtegy in this context chooses ctions, sed on the history of oservtions, tht mximizes the long term rewrd of the gent. POMDPs hve ecome n importnt modeling formlisms in rootics nd utonomous gent design (Thrun, Burgrd, & Fox 2005; Pineu et l. 2003). Much of the current work on root nvigtion nd mpping, for exmple, is now sed on stochstic trnsition nd oservtion models (Thrun, Burgrd, & Fox 2005; Roy, Gordon, & Thrun 2005). Moreover, POMDP representtions hve lso een used to design utonomous gents for rel world pplictions, including nursing (Pineu et l. 2003) nd elderly ssistnce (Boger et l. 2005). Copyright c 2006, Americn Assocition for Artificil Intelligence ( All rights reserved. Despite their convenience s modeling frmework however, POMDPs pose difficult computtionl prolems. It is well known tht solving for optiml ehvior strtegies or even just pproximting optiml strtegies in POMDP is intrctle (Mdni, Hnks, & Condon 2003; Mundhenk et l. 2000). Therefore, lot of work hs focused on developing heuristics for computing resonle ehvior strtegies for POMDPs. These pproches hve generlly followed three rod strtegies: vlue function pproximtion (Huskrecht 2000; Spn & Vlssis 2005; Pineu, Gordon, & Thrun 2003; Prr & Russell 1995), policy sed optimiztion (Ng & Jordn 2000; Pouprt & Boutilier 2003; 2004; Amto, Bernstein, & Zilerstein 2006), nd stochstic smpling (Kerns, Mnsour, & Ng 2002; Thrun 2000). In this pper, we focus on the vlue function pproximtion pproch nd contriute new perspective to this strtegy. Most previous work on vlue function pproximtion for POMDPs hs focused on representtions tht explicitly mintin set of α-vectors or elief sttes. This is motivted y the fct tht the optiml vlue function, considered s function of the elief stte, is determined y the mximum of set of liner functions specified y α-vectors where ech α-vector is ssocited with specific ehvior policy. Since the optiml vlue function is given y the mximum of (lrge) set of α-vectors, it is nturl to consider pproximting it y suset of α-vectors, or t lest smll set of liner functions. In fct, even n exct representtion of the optiml vlue function need not keep every α-vector, ut only those tht re mximl for t lest some witness elief stte. Motivted y this chrcteriztion, most vlue function pproximtion strtegies ttempt to mintin smller suset of α-vectors y focusing on reduced set of elief sttes (Spn & Vlssis 2005; Huskrecht 2000; Pineu, Gordon, & Thrun 2003). Although much recent progress hs een mde on α-vector sed pproximtions, drwck of this pproch is tht the numer of α-vectors stored generlly hs to grow with the numer of vlue itertions to mintin n dequte pproximtion (Pineu, Gordon, & Thrun 2003). In this pper, we consider n lterntive pproch tht drops the notion of n α-vector entirely from the pproximtion strtegy. Insted we exploit the other fundmentl oservtion out the nture of the optiml vlue function: since it is determined y elief-stte-wise mxi-

2 mum over liner functions, the optiml vlue function must e convex function of the elief stte (Sondik 1978; Boyd & Vndenerghe 2004). Our strtegy, then, is to compute convex pproximtion to the optiml vlue function tht is sed on qudrtic rther thn liner functions of the elief stte. The dvntge of using qudrtic sis for vlue function pproximtion is severl-fold: First, the size of the representtion does not hve to grow merely to model n incresing numer of fcets in the optiml solution; thus we cn keep ounded size representtion t ech horizon. Second, qudrtic representtion llows one to conveniently mintin provle upper ound on the optiml vlues in n explicit compct representtion without requiring uxiliry liner progrmming to e used to retrieve the ound, s in current grid sed pproches (Huskrecht 2000; Smith & Simmons 2005). Third, the computtionl cost of updting the pproximtion does not chnge with itertion numer (either in time or spce), so the overll computtion time is only liner in the horizon. Finlly, s we demonstrte elow, despite significnt reduction in representtion size, convex qudrtics re still le to chieve competitive pproximtion qulity on enchmrk POMDP prolems. Bckground We egin with Mrkov decision processes (MDP) since we will need to exploit some sic concepts from MDPs in our pproch elow. An MDP is defined y set of sttes S, set of ctions A, stte trnsition model p(s s, ), nd rewrd model r(s, ). In this setting, deterministic policy is specified y function from sttes to ctions, π : S A, nd the vlue function for policy is defined s the expected future discounted rewrd the policy otins from ech stte [ ] V π (s) = E π γ t r(s t, π(s t )) s 0 = s t=0 Here the discount fctor, 0 γ < 1, expresses trdeoff etween short term nd long term rewrd. It is known tht there exists deterministic optiml policy whose vlue function domintes ll other policy vlues in every stte (Bertseks 1995). This optiml vlue function lso stisfies the Bellmn eqution V (s) = mx r(s, ) + γ p(s s, )V (s ) (1) s Computing the optiml vlue function for given MDP cn e ccomplished in severl wys. The two wys we consider elow re vlue itertion nd liner progrmming. Vlue itertion is sed on repetedly pplying the Bellmn ckup opertor, V n+1 = HV n, specified y V n+1 (s) = mx r(s, ) + γ p(s s, )V n (s ) (2) s It cn e shown tht V n V in the L norm, nd thus V is fixed point of (2) (Bertseks 1995). V is lso the solution to the liner progrm min V V (s) s.t. V (s) r(s, )+γ p(s s, )V (s ) (3) s s for ll s S nd A. It turns out tht for continuous stte spces, the Bellmn eqution (1) still chrcterizes the optiml vlue function, replcing the trnsition proilities with conditionl densities nd the sums with Leesgue integrls. However, computtionlly, the sitution is not so simple for continuous stte spces, since the integrls must now somehow e solved in plce of the sums, nd (3) is no longer finitely defined. Nevertheless, continuous stte spces re unvoidle when one considers POMDP plnning. POMDPs extend MDPs y introducing n oservtion model p(, s ) tht governs how noisy oservtion O is relted to the underlying stte s nd the ction. Hving ccess to only noisy oservtions of the stte complictes the prolem of choosing optiml ctions significntly. The gent now never knows the exct stte of the environment, ut insted must infer distriution over possile sttes, (s), from the history of oservtions nd ctions. Nevertheless, given n ction nd oservtion the gent s elief stte cn e esily updted y Byes rule (,, ) (s ) = p(, s ) s p(s s, )(s)/z (4) where Z = p(, ) = s p(o, s ) s p(s s, )(s). By the Mrkov ssumption, the elief stte is sufficient representtion upon which n optiml ehvior strtegy cn e defined (Sondik 1978). Therefore, policy is nturlly specified in this setting y function from elief sttes to ctions, π : B A, where B is the set of ll possile distriutions over the underlying stte spce S (n S 1 dimensionl simplex). Oviously for ny environment with two or more sttes there re n infinite numer of elief sttes, nd not every policy cn e finitely represented. Nevertheless, one cn still define the vlue function of policy s the expected future discounted rewrd otined from ech elief stte [ ] V π () = E π γ t r( t, π( t )) 0 = t=0 where r(, ) = s r(s, )(s). Thus, POMDP cn e treted s n MDP over elief sttes; tht is, continuous stte MDP. As efore, n optiml policy otins the mximum vlue for ech elief stte, nd its vlue function stisfies the Bellmn eqution (Sondik 1978) V () = mx r(, )+γ p(, )V ( ) = mx r(, )+γ p(, )V ( (,, ) )(5) Unfortuntely, solving the functionl eqution (5) for V is hrd. Known techniques for computing the optiml vlue function re generlly sed on vlue itertion (Cssndr, Littmn, & Zhng 1997; Zhng & Zhng 2001); lthough policy sed pproches re lso possile (Sondik 1978; Pouprt & Boutilier 2003; 2004). As ove, vlue itertion is sed on repetedly pplying Bellmn ckup opertor, V n+1 = HV n, to current vlue function pproximtion. In this cse, current lower ound, V n, is represented y

3 finite set of α-vectors, Γ n = {α π : π Π n }, where ech α-vector is ssocited with n n-step ehvior strtegy π. Given Γ n, the vlue function is represented y V n () = mx α π Γ n α π At ech stge of vlue itertion, the current lower ound is updted ccording to the Bellmn ckup, V n+1 = HV n, such tht V n+1 () = mx r(, )+γ p(, )V n ( (,, ) ) (6) = mx r +γ rg g (π,, ) mx g (π π Π,, ) n = mx,{ π } α,{ π } (7) where we use the quntities g (π,, )(s) = s p(, s )p(s, s)α π (s ) α,{ π } = r +γ g (π,, ) Once gin it is known tht V n V in the L norm, nd thus V is fixed point of (6) (Sondik 1978). Although the size of the representtion for V n+1 remins finite, it cn e exponentilly lrger thn V n in the worst cse, since enumerting every possiility for, { π } over A, o O, π Π n, yields Π n+1 A Π n O comintions. Mny of these α-vectors re not mximl for ny elief stte, nd cn e pruned y running liner progrm for ech tht verifies whether there is witness elief stte for which it is mximl (Cssndr, Littmn, & Zhng 1997). Thus, the set of α-vectors, Γ n, ction strtegies, Π n, nd witness elief sttes, B n, re ll ssocited 1 to 1. However, even with pruning, exct vlue itertion cnnot e run for mny steps, even on smll prolems. Vlue function pproximtion strtegies Much reserch hs focused on pproximting the optiml vlue function, imed for the most prt t reducing the time nd spce complexity of the vlue itertion updte. Work in this re hs considered vrious strtegies (Huskrecht 2000), including direct MDP pproximtions nd vrints, nd using function pproximtion to fit V n+1 over smpled elief sttes (Prr & Russell 1995; Littmn, Cssndr, & Kelling 1995). However, two pproches hve recently ecome the most dominnt: grid sed nd elief point pproximtions. The grid sed pproch (Gordon 1995; Huskrecht 2000; Zhou & Hnsen 2001; Bonet 2002) mintins finite collection of elief sttes long with ssocited vlue estimtes {, V n () : B grid }. These vlue estimtes re updted y pplying the Bellmn updte on B grid. An importnt dvntge of this pproch is tht it cn mintin n upper ound on the optiml vlue function. Unfortuntely, mintining tight ound entils significnt computtionl expense (Huskrecht 2000): First, B grid must contin ll corners of the simplex so tht its convex closure spns B. Second, ech successor elief stte in (6) must hve its interpolted vlue estimte minimized y liner progrm (Zhou & Hnsen 2001). Below we show tht this lrge numer of liner progrms cn e replced with single convex optimiztion. Unlike the grid sed pproch, which tkes current elief stte in B grid nd projects it forwrd to elief sttes outside of B grid, the elief point pproch only considers elief sttes in witness set B wit (Pineu, Gordon, & Thrun 2003; Smith & Simmons 2005). Specificlly, the elief point pproximtion mintins lower ound y keeping suset of α-vectors ssocited with these witness elief sttes. To further explin this pproch, let Γ n = {α π : π ˆΠ n }, so tht there is 1 to 1 correspondence etween α-vectors in Γ n, ction strtegies in ˆΠ n nd elief sttes in B wit. Then the set of α-vectors is updted y pplying the Bellmn ckup, ut restricting the choices in (7) to π ˆΠ n, nd only computing (7) for B wit. Thus, the numer of α-vectors in ech itertion remins ounded nd ssocited with B wit. The qulity of oth these pproches is strongly determined y the sets of elief points, B grid nd B wit, they mintin. For the elief point pproch, one generlly hs to grow the numer of elief points t ech itertion to mintin n dequte ound on the optiml vlue function. Pineu et l. (2003) suggested douling the size t ech itertion, ut recently more refined pproch ws suggested y (Smith & Simmons 2005). Convex qudrtic upper ounds The key oservtion ehind our pproch is tht one does not need to e confined to piecewise liner pproximtions. Our intuition is tht convex qudrtic pproximtions re prticulrly well suited for vlue function pproximtion in POMDPs. This is motivted y the fct tht ech vlue itertion step produces mximum over set of convex functions, yielding result tht is lwys convex. Thus, one cn plusily use convex qudrtic function to upper ound the mximum over α-vectors, nd more generlly to upper ound the mximum over ny set of ck-projected convex vlue pproximtions from itertion n. Our sic gol then is to retin compct representtion of the vlue pproximtion y exploiting the fct tht qudrtics cn e more efficient t pproximting convex upper ound thn set of liner functions; see Figure 1. As with piecewise liner pproximtions, the qulity of the pproximtion cn e improved y tking mximum over set of convex qudrtics, which would yield convex piecewise qudrtic rther thn piecewise liner pproximtion. In this pper, however, we will focus on the most nive choice, nd pproximte the vlue function with single qudrtic in ech step of vlue itertion. The susequent extension to multiple qudrtics is discussed elow. An importnt dvntge the qudrtic form hs over other function pproximtion representtions is tht it permits convex minimiztion of the upper ound, s we demonstrte elow. Such convenient formultion is not redily chievle for other function representtions. Also, since we re

4 α1 Convex qudrtic upper ound α2 Let the ction-vlue ckup of ˆV e denoted y q () = r(, ) + γ p(, ) ˆV ( (,, ) ) (9) Figure 1: Illustrtion of convex qudrtic upper ound pproximtion to mximum of liner functions α π. not compelled to grow the size of the representtion t ech itertion, we otin n pproch tht runs in liner time in the numer of vlue itertion steps. There re few drwcks in dropping the piecewise liner representtion however. One drwck is tht we lose the 1 to 1 correspondence etween α-vectors nd ehvior strtegies π, which mens tht greedy ction selection requires one step look hed clcultion sed on (5). The second drwck is tht the convex optimiztion prolem we hve to solve t ech vlue itertion is more complex thn simple liner progrm. Convex upper ound itertion The min technicl chllenge we fce is to solve for tight qudrtic upper ound on the vlue function t ech stge of vlue itertion. Interestingly, this cn e done effectively with convex optimiztion s follows. We represent the vlue function pproximtion over elief sttes y qudrtic form ˆV n () = W n + w n + ω n (8) where W n is squre mtrix of weights, w n is vector of weights, nd ω n is sclr offset weight. Eqution (8) defines convex function of elief stte if nd only if the mtrix W n is positive semidefinite (Boyd & Vndenerghe 2004). We denote the semidefinite constrint on W n y W n 0. As shown ove, one step of vlue itertion involves expnding (nd ck-projecting) vlue pproximtion from stge n; defining the vlue function t stge n + 1 y the mximum over the expnded, ck-projected set. However, ck-projection entils some dditionl compliction in our cse ecuse we do not mintin set of α-vectors, ut rther mintin qudrtic function pproximtion t stge n. Tht is, our pproximte vlue itertion step hs to pull the qudrtic form through the ckup opertor. Unfortuntely, the result of ckup is no longer qudrtic, ut rtionl (qudrtic over liner) function. Fortuntely, however, the result of this ckup is still convex, s we now show. α3 To express this s function of, we need to expnd the definitions of (,, ) nd ˆV n respectively. First, note tht (,, ) is rtio of vector liner function of over sclr liner function of y (4), therefore we cn represent it y (,, ) = M, p(, ) = M, e M, (10) where M, is mtrix such tht M,(s, s) = p(, s )p(s s, ), nd e denotes the vector of ll 1s. Sustituting (8) nd (10) into (9) yields q () = r(, )+γ M,W M, e +(w+ωe) M, M, Theorem 1 q () is convex in. Proof First note tht M,o W M, 0 if W 0, nd therefore it suffices to show tht the function f() = ( N)/(v ) is convex under the ssumption N 0 nd v 0. Note tht N 0 implies N = QQ for some Q, nd therefore f() = ( QQ )/(v ) = (Q ) (v I) 1 (Q ). Next, we use few elementry fcts out convexity (Boyd & Vndenerghe 2004). First, function is convex iff its epigrph is convex, so it suffices to show tht the set {(, v I, δ) v I 0, (Q ) (v I) 1 (Q ) δ} is convex. By the Schur complement lemm, we [ hve tht δ ] (Q ) (v I) 1 (Q ) 0 iff v I Q (Q ) 0 nd therefore f() is convex iff { δ [ ] } the set (, v I, δ) v v I 0, I Q (Q ) 0 δ is convex. The result then follows ecuse this set cn e written s liner mtrix inequlity. Corollry 1 Given convex qudrtic representtion for ˆV n, mx q (), nd hence H ˆV n, is convex in. So ck-projecting the convex qudrtic representtion still yields convex result. Our gol is to optimize tight qudrtic upper ound on the mximum of these convex functions (which of course is still convex). In some pproches elow we will use the ck-projected ction-vlue functions directly. However, in other cses, it will prove dvntgeous if we cn work with liner upper ounds on the ck-projections. Proposition 1 The tightest liner upper ound on q () is given y q () u for vector u such tht u 1 s = q (1 s ) for ech corner elief stte 1 s. Algorithmic pproch We would like to solve for qudrtic ˆV n+1 t stge n + 1 tht otins s tight n upper ound on H ˆV n s possile. To do this, we ppel to the liner progrm chrcteriztion

5 of the optiml vlue function (3) which lso is expressed s minimizing n upper ound on the ck-projected vlue function. Unfortuntely, here, since we re no longer working with finite spce, we cnnot formulte liner progrm ut rther hve to pose generlized semi-infinite progrm min W,w,ω ( W + w + ω ) µ() d suject to (11) W + w + ω q (),, ; W 0 where µ() is mesure over the spce of possile elief sttes. The semi-infinite progrm (11) specifies liner ojective suject to liner constrints (leit infinitely mny liner constrints); nd hence is convex optimiztion prolem in W, w, ω. There re two min difficulties in solving this convex optimiztion prolem. First, the ojective involves n integrl with respect to mesure µ() on elief sttes. This mesure is ritrry (except tht it must hve full support on the elief spce B) nd llows one to control the emphsis the minimiztion plces on different regions of the elief spce. For simplicity, we ssume the mesure is Dirichlet distriution, specified y vector of prior prmeters θ(s), s S. The Dirichlet distriution is prticulrly convenient in this context since one cn specify uniform distriution over the elief simplex merely y setting θ(s) = 1 for ll s. Moreover, the required integrls for the Dirichlet hve closed form solution, which llows us to simply precompute the liner coefficients for the weight prmeters, y ( W + w + ω ) µ()d = W, E[ ] +w E[]+ω where E[] = θ/ θ 1 ; E[ ] = (dig(e[]) + θ 1 E[]E[] )/(1 + θ 1 ) (Gelmn et l. 1995); nd A, B = ij A ijb ij. Tht is, one cn specify θ nd compute the liner coefficients hed of time. The second nd more difficult prolem with solving (11) is to find wy to cope with the infinite numer of liner constrints. Here, we ddress the prolem with strightforwrd constrint genertion pproch. The ide is to solve (11), itertively, y keeping finite set of constrints, ech corresponding to elief stte, nd solving the finite semidefinite progrm min W, W,w,ω E[ ] + w E[] + ω suject to (12) i W i + w i + ω q ( i ),, i C; W 0 Given puttive solution, W, w, ω, new constrint cn e otined y finding elief stte tht solves min W + w + ω q () suject to 0, s (s) = 1 (13) for ech. If the minimum vlue is nonnegtive for ll then there re no violted constrints nd we hve solution to (11). Unfortuntely, (13) cnnot directly e used for constrint genertion, since q () is convex function of (Theorem 1) nd hence q () is concve; yielding non-convex ojective. Thus, to use (13) for constrint genertion we need to follow n lterntive pproch. We hve pursued three different pproches to this prolem thus fr. Our first strtegy mintins provle upper ound on the optiml vlue function y strengthening the constrint threshold with the liner upper ound u q () from Proposition 1. Replcing q () with u in (11) nd (13) ensures tht n upper ound will e mintined, ut lso reduces (13) to qudrtic progrm tht cn e efficiently minimized to produce elief stte with mximum constrint violtion. Our second strtegy relxes the upper ound gurntee y only sustituting u for q () in the constrint genertion procedure, mintining n efficient qudrtic progrmming formultion there, ut keeping q () in the min optimiztion (12). This no longer gurntees n upper ound, ut cn still produce etter pproximtions in prctice ecuse the ounds do not hve to e rtificilly strengthened. Our finl strtegy side-steps optiml constrint genertion entirely, nd insted chooses fixed set of elief sttes for the constrint set C in (12). In this wy, the semidefinite progrm (12) needs to e solved only once per vlue itertion step. This strtegy doesn t produce n upper ound either ut the resulting pproximtion is fst nd effective in prctice. Finlly, to improve pproximtion qulity, one could ugment the pproximte vlue function representtion with mximum over set of qudrtics, much s with α-vectors. One nturl wy to do this would e to mintin seprte qudrtic for ech ction,, in (11). Experimentl results We implemented the proposed pproch using SDPT3 (Toh, Todd, & Tutuncu 1999) s the semidefinite progrm solver for (12). Specificlly, in our initil experiments, we hve investigted the third (simplest) strtegy mentioned ove, CQUB, which only used rndom smple of elief sttes to specify the constrints in C. We compred this method to two current vlue function pproximtion strtegies in the literture: Perseus (Spn & Vlssis 2005), nd PBVI (Pineu, Gordon, & Thrun 2003). Here, oth Perseus nd PBVI were run with the numer of elief sttes fixed t 1000, wheres the convex qudrtic method, CQUB, ws run with 100 rndom elief sttes. In our initil experiments, we considered the enchmrk prolems: Mze (Huskrecht 1997), Tigergrid, Hllwy, Hllwy2, Aircrft ville from Tle 1 gives the prolem chrcteristics. In ech cse, numer of vlue itertion steps ws fixed s shown in Tle 1, nd ech method ws run 10 times to generte n estimte of vlue function pproximtion qulity. Tle 2 shows the results otined y the vrious vlue function pproximtion strtegies on these domins, reporting the expected discounted rewrd otined y the greedy policies defined with respect to the vlue function estimtes, s well s the verge time nd the size of the vlue function

6 Prolems S A O vlue iters Mze Tiger-grid Hllwy Hllwy Aircrft Tle 1: Prolem chrcteristics. CQUB Perseus PBVI Mze Avg. rewrd ± ± ±2.0 Run time (s) Size Tiger-grid Avg. rewrd 2.16 ± ± ±0.06 Run time (s) Size Hllwy Avg. rewrd 0.58 ± ± ±0.03 Run time (s) Size Hllwy2 Avg. rewrd 0.43 ± ± ±0.03 Run time (s) Size Aircrft Avg. rewrd ± ± ±0.42 Run time (s) Size Tle 2: Men discounted rewrd otined over 1000 trjectories using the greedy policy for ech vlue function pproximtion, verged over 10 runs of vlue itertion. pproximtion. 1 Interestingly, the convex qudrtic strtegy CQUB performed surprisingly well in these experiments, competing with stte of the rt vlue function pproximtions while only using 100 rndom elief sttes for constrint genertion in (12). The result is slightly weker in the Tiger-grid domin, ut significntly stronger in the Hllwy domins; supporting the thesis tht convex qudrtics cpture vlue function structure more efficiently thn liner pproches. Conclusions We hve introduced new pproch to vlue function pproximtion for POMDPs tht is sed on convex qudrtic ound rther thn piecewise liner pproximtion. We hve found tht qudrtic pproximtors cn chieve highly competitive pproximtion qulity without growing the size of the representtion, even while explicitly 1 For Perseus nd PBVI, the size is S times the numer of α- vectors. For CQUB, the size is just S ( S +1)/2+ S +1, which corresponds to the numer of vriles in the qudrtic pproximtor. focusing on only tiny frction of the elief sttes. We expect tht this pproch cn led to new venues of reserch in vlue pproximtion for POMDPs. We re currently considering extensions to this pproch sed on elief stte compression (Pouprt & Boutilier 2002; 2004; Roy, Gordon, & Thrun 2005), nd fctored models (Boutilier & Poole 1996; Feng & Hnsen 2001; Pouprt 2005) to tckle POMDPs with lrge stte spces. We lso pln to comine our qudrtic vlue function pproximtion with policy sed nd smpling sed pproches. A further ide we re exploring is the interprettion of convex qudrtics s second order Tylor pproximtions to the optiml vlue function, which offers further lgorithmic pproches with the potentil for tight theoreticl gurntees on pproximtion qulity. Acknowledgments Reserch supported y the Alert Ingenuity Centre for Mchine Lerning, NSERC, MITACS, CFI, nd the Cnd Reserch Chirs progrm. References Amto, C.; Bernstein, D.; nd Zilerstein, S Solving POMDPs using qudrticlly constrined liner progrms. In Proceedings of the Fifth Interntionl Joint Conference on Autonomous Agents nd Multigent Systems (AAMAS). Bertseks, D Dynmic Progrmming nd Optiml Control, volume 2. Athen Scientific. Boger, J.; Pouprt, P.; Hoey, J.; Boutilier, C.; Fernie, G.; nd Mihilidis, A A decision-theoretic pproch to tsk ssistnce for persons with dementi. In Proceedings of the Nineteenth Interntionl Joint Conference on Artificil Intelligence (IJCAI). Bonet, B An ɛ-optiml grid-sed lgorithm for prtilly oservle Mrkov decision processes. In Proceedings of the Nineteenth Interntionl Conference on Mchine Lerning (ICML). Boutilier, C., nd Poole, D Computing optiml policies for prtilly oservle decision processes using compct representtions. In Proceedings of the Thirteenth Ntionl Conference on Artificil Intelligence (AAAI). Boyd, S., nd Vndenerghe, L Convex Optimiztion. Cmridge Univ. Press. Cssndr, A.; Littmn, M.; nd Zhng, N Incrementl pruning: A simple, fst, exct method for prtilly oservle Mrkov decision processes. In Proceedings of the Thirteenth Conference on Uncertinty in Artificil Intelligence (UAI). Feng, Z., nd Hnsen, E. A Approximte plnning for fctored POMDPs. In Proceedings of the Sixth Europen Conference on Plnning. Gelmn, A.; Crlin, J.; Stern, H.; nd Ruin, D Byesin Dt Anlysis. Chpmn & Hll. Gordon, G Stle function pproximtion in dynmic progrmming. In Proceedings of the Twelfth Interntionl Conference on Mchine Lerning (ICML).

7 Huskrecht, M Incrementl methods for computing ounds in prtilly oservle mrkov decision processes. In Proceedings of the Fourteenth Ntionl Conference on Artificil Intelligence (AAAI). Huskrecht, M Vlue-function pproximtions for prtilly oservle Mrkov decision processes. Journl of Artificil Intelligence Reserch 13: Kerns, M.; Mnsour, Y.; nd Ng, A A sprse smpling lgorithm for ner-optiml plnning in lrge Mrkov decision processes. Mchine Lerning 49(2-3): Littmn, M.; Cssndr, A.; nd Kelling, L Lerning policies for prtilly oservle environments: scling up. In Proceedings of the Twelfth Interntionl Conference on Mchine Lerning (ICML). Mdni, O.; Hnks, S.; nd Condon, A On the undecidility of proilistic plnning nd relted stochstic optimiztion prolems. Artificil Intelligence 147:5 34. Mundhenk, M.; Goldsmith, J.; Lusen, C.; nd Allender, E Complexity of finite-horizon Mrkov decision processes. Journl of the Assocition for Computing Mchinery 47(4): Ng, A., nd Jordn, M Pegsus: A policy serch method for lrge MDPs nd POMDPs. In Proceedings of the Sixteenth Conference on Uncertinty in Artificil Intelligence (UAI). Prr, R., nd Russell, S Approximting optiml policies for prtilly oservle stochstic domins. In Proceedings of the Fourteenth Interntionl Joint Conference on Artificil Intelligence (IJCAI). Pineu, J.; Montemerlo, M.; Pollck, M.; Roy, N.; nd Thrun, S Towrds rootic ssistnts in nursing homes: Chllenges nd results. Rootics nd Autonomous Systems 42: Pineu, J.; Gordon, G.; nd Thrun, S Point-sed vlue itertion: An nytime lgorithm for POMDPs. In Proceedings of the Eighteenth Interntionl Joint Conference on Artificil Intelligence (IJCAI). Pouprt, P., nd Boutilier, C Vlue-directed compression of POMDPs. In Advnces in Neurl Informtion Processing Systems (NIPS 15). Pouprt, P., nd Boutilier, C Bounded finite stte controllers. In Advnces in Neurl Informtion Processing Systems (NIPS 16). Pouprt, P., nd Boutilier, C VDCBPI: An pproximte sclle lgorithm for lrge POMDPs. In Advnces in Neurl Informtion Processing Systems (NIPS 17). Pouprt, P Exploiting Structure to efficienty solve lrge scle prtilly oservle Mrkov decision processes. Ph.D. Disserttion, Deprtment of Computer Science, University of Toronto. Roy, N.; Gordon, G.; nd Thrun, S Finding pproximte POMDP solutions through elief compression. Journl of Artificil Intelligence Reserch 23:1 40. Smith, T., nd Simmons, R Point-sed POMDP lgorithms: Improved nlysis nd implementtion. In Proceedings of the Twenty-first Conference on Uncertinty in Artificil Intelligence (UAI). Sondik, E The optiml control of prtilly oservle Mrkov processes over the infinite horizon: Discounted costs. Opertions Reserch 26: Spn, M., nd Vlssis, N Perseus: Rndomized point-sed vlue itertion for POMDPs. Journl of Artificil Intelligence Reserch 24: Thrun, S.; Burgrd, W.; nd Fox, D Proilistic Rootics. MIT Press. Thrun, S Monte Crlo POMDPs. In Advnces in Neurl Informtion Processing Systems (NIPS 12). Toh, K.; Todd, M.; nd Tutuncu, R SDPT3 Mtl softwre pckge for semidefinite progrmming. Optimiztion Methods nd Softwre 11. Zhng, N., nd Zhng, W Speeding up the convergence of vlue itertion in prtilly oservle Mrkov decision processes. Journl of Artificil Intelligence Reserch 14: Zhou, R., nd Hnsen, E An improved grid-sed pproximtion lgorithm for POMDPs. In Proceedings of the Seventeenth Interntionl Joint Conference on Artificil Intelligence (IJCAI).

Point-Based POMDP Algorithms: Improved Analysis and Implementation

Point-Based POMDP Algorithms: Improved Analysis and Implementation Point-Bsed POMDP Algorithms: Improved Anlysis nd Implementtion Trey Smith nd Reid Simmons Rootics Institute, Crnegie Mellon University Pittsurgh, PA 15213 Astrct Existing complexity ounds for point-sed