Monte Carlo Value Iteration with Macro-Actions

Size: px

Start display at page:

Download "Monte Carlo Value Iteration with Macro-Actions"

Collin Lloyd
5 years ago
Views:

1 In Advnces in Neurl Informtion Processing Systems (NIPS), 2011 Monte Crlo Vlue Itertion with Mcro-Actions Zhnwei Lim Dvid Hsu Wee Sun Lee Deprtment of Computer Science, Ntionl University of Singpore Singpore, , Singpore Abstrct POMDP plnning fces two mjor computtionl chllenges: lrge stte spces nd long plnning horizons. The recently introduced Monte Crlo Vlue Itertion (MCVI) cn tckle POMDPs with very lrge discrete stte spces or continuous stte spces, but its performnce degrdes when fced with long plnning horizons. This pper presents Mcro-MCVI, which extends MCVI by exploiting mcro-ctions for temporl bstrction. We provide sufficient conditions for Mcro-MCVI to inherit the good theoreticl properties of MCVI. Mcro-MCVI does not require explicit construction of probbilistic models for mcro-ctions nd is thus esy to pply in prctice. Experiments show tht Mcro-MCVI substntilly improves the performnce of MCVI with suitble mcro-ctions. 1 Introduction Prtilly observble Mrkov decision process (POMDP) provides principled generl frmework for plnning with imperfect stte informtion. In POMDP plnning, we represent n gent s possible sttes probbilisticlly s belief nd systemticlly reson over the spce of ll beliefs in order to derive policy tht is robust under uncertinty. POMDP plnning, however, fces two mjor computtionl chllenges. The first is the curse of dimensionlity. A complex plnning tsk involves lrge number of sttes, which result in high-dimensionl belief spce. The second obstcle is the curse of history. In pplictions such s robot motion plnning, n gent often tkes mny ctions before reching the gol, resulting in long plnning horizon. The complexity of the plnning tsk grows very fst with the horizon. Point-bsed pproximte lgorithms [10, 14, 9] hve brought drmtic progress to POMDP plnning. Some of the fstest ones, such s HSVI [14] nd SARSOP [9], cn solve modertely complex POMDPs with hundreds of thousnds sttes in resonble time. The recently introduced Monte Crlo Vlue Itertion (MCVI) [2] tkes one step further. It cn tckle POMDPs with very lrge discrete stte spces or continuous stte spces. The min ide of MCVI is to smple both n gent s stte spce nd the corresponding belief spce simultneously, thus voiding the prohibitive computtionl cost of unnecessrily processing these spces in their entirety. It uses Monte Crlo smpling in conjunction with dynmic progrmming to compute policy represented s finite stte controller. Both theoreticl nlysis nd experiments on severl robotic motion plnning tsks indicte tht MCVI is promising pproch for plnnning under uncertinty with very lrge stte spces, nd it hs lredy been pplied successfully to compute the thret resolution logic for ircrft collision voidnce systems in 3-D spce [1]. However, the performnce of MCVI degrdes, s the plnning horizon increses. Temporl bstrction using mcro-ctions is effective in mitigting the negtive effect nd hs chieved good results in erlier work on Mrkov decision processes (MDPs) nd POMDPs (see Section 2). In this work, we show tht mcro-ctions cn be semlessly integrted into MCVI, leding to the Mcro- MCVI lgorithm. Unfortuntely, the theoreticl properties of MCVI, such s the pproximtion error bounds [2], do not crry over to Mcro-MCVI utomticlly, if rbitrry mpping from belief to ctions re llowed s mcro-ctions. We give sufficient conditions for the good theoreticl properties 1

2 to be retined, trnforming POMDPs into prticulr type of prtilly observble semi-mrkov decision processes (POSMDPs) in which the lengths of mcro-ctions re not observble. A mjor dvntge of the new lgorithm is its bility to bstrct wy the lengths of mcro-ctions in plnning nd reduce the effect of long plnning horizons. Furthermore, it does not require explicit probbilistic models for mcro-ctions nd trets them just like primitive ctions in MCVI. This simplifies mcro-ction construction nd is mjor benefit in prctice. Mcro-MCVI cn lso be used to construct hierrchy of mcro-ctions for plnning lrge spces. Experiments show tht the lgorithm is effective with suitbly designed mcro-ctions. 2 Relted Works Mcro-ctions hve long been used to speed up plnning nd lerning lgorithms for MDPs (see, e.g., [6, 15, 3]). Similrly, they hve been used in offline policy computtion for POMDPs [16, 8]. Mcro-ctions cn be composed hierrchiclly to further improve sclbility [4, 11]. These erlier works rely on vector representtions for beliefs nd vlue functions, mking it difficult to scle up to lrge stte spces. Mcro-ctions hve lso been used in online serch lgorithms for POMDPs [7]. Mcro-MCVI is relted to Hnsen nd Zhou s work [5]. The erlier work uses finite stte controllers for policy representtion nd policy itertion for policy computtion, but it hs not yet been shown to work on lrge stte spces. Expecttion-mximiztion (EM) cn be used to trin finite stte controllers [17] nd potentilly hndle lrge stte spces, but it often gets stuck in locl optim. 3 Plnning with Mcro-ctions We would like to generlize POMDPs to hndle mcro-ctions. Idelly, the generliztion should retin properties of POMDPs such s piecewise liner nd convex finite horizon vlue functions. We would lso like the pproximtion bounds for MCVI [2] to hold with mcro-ctions. We would like to llow our mcro-ctions to be s powerful s possible. A very powerful representtion for mcro-ction would be to llow it to be n rbitrry mpping from belief to ction tht will run until some termintion condition is met. Unfortuntely, the vlue function of process with such mcro-ctions need not even be continuous. Consider the following simple finite horizon exmple, with horizon one. Assume tht there re two primitive ctions, both with constnt rewrds, regrdless of stte. Consider two mcro-ctions, one which selects the poorer primitive ction ll the time while the other which selects the better primitive ction for some beliefs. Clerly, the second mcro-ction domintes the first mcro-ction over the entire belief spce. The rewrd for the second mcro-ction tkes two possible vlues depending on which ction is selected for the belief. The rewrd function lso forms the optiml vlue function of the process nd need not even be continuous s the mcro-ction cn be n rbitrry mpping from belief to ction. Next, we give sufficient conditions for the process to retin piecewise linerity nd convexity of the vlue function. We do this by constructing type of prtilly observble semi-mrkov decision process (POSMDP) with the desired property. The POSMDP does not need to hve the length of the mcro-ction observed, property tht cn be prcticlly very useful s it llows the brnching fctor for serch to be significntly smller. Furthermore, the process is strict generliztion of POMDP s it reduces to POMDP when ll the mcro-ctions hve length one. 3.1 Prtilly Observble Semi-Mrkov Decision Process Finite-horizon (undiscounted) POSMDP were studied in [18]. Here, we focus on type of infinitehorizon discounted POSMDPs whose trnsition intervls re not observble. Our POSMDP is formlly defined s tuple (S, A, O, T, R, ), where S is stte spce, A is mcro-ction spce, O is mcro-observtion spce, T is joint trnsition nd observtion function, R is rewrd function, nd 2 (0, 1) is discount fctor. If we pply mcro-ction with strt stte s i, T = p(s j, o,k s i, ) encodes the joint conditionl probbility of the end stte s j, mcro-observtion o, nd the number of time steps k tht it tkes for to rech s j from s i. We could decompose T into stte-trnsition function nd n observtion function, but void doing so here to remin generl nd simplify the nottion. The rewrd function R gives the discounted cumultive rewrd for mcro-ction tht strts t stte s: R(s, ) = P 1 t=0 t E(r t s, ), where E(r t s, ) is the expected rewrd t step t. Here we ssume tht the rewrd is 0 once mcro-ction termintes. For convenience, we will work with reweighted beliefs, insted of beliefs. Assuming tht the number of sttes is n, reweighted belief (like belief) is vector of n non-negtive numbers tht sums to 2

3 one. By ssuming tht the POSMDP process will stop with probbility 1 t ech time step, we cn interpret the reweighted belief s the conditionl probbility of stte given tht the process hs not stopped. This gives n interprettion of the reweighted belief in terms of the discount fctor. Given reweighted belief, we compute the next reweighted belief given mcroction nd observtion o, b 0 = (b,, o), s follows: P 1 b 0 k=1 k 1 P n i=1 (s) = p(s, o,k s i, )b(s i ) P P 1 k=1 k 1 n P n j=0 i=1 p(s j, o,k s i, )b(s i ). (1) We P will simply refer to the reweighted belief s belief from here on. We denote the denomintor 1 k=1 k 1 P n P n j=0 i=1 p(s j, o,k s i, )b(s i ) by p (o,b). The vlue of p (o,b) cn be interpreted s the probbility tht observtion o is received nd the POSMDP hs not stopped. Note tht p (o,b) my sum to less thn 1 due to discounting. P o A policy is mpping from belief to mcro-ction. Let R(b, ) = P s b(s)r(s, ). The vlue of policy cn be defined recursively s V (b) =R(b, (b)) + o p (o (b),b)v ( (b, (b), o)). Note tht the policy opertes on the belief nd my not know the number of steps tken by the mcro-ctions. If knowledge of the number of steps is importnt, it cn be dded into the observtion function in the modeling process. We now define the bckup opertor H tht opertes on vlue function V m nd returns V m+1 HV (b) = mx R(b, )+ o2o p (o,b)v ( (b,, o)). (2) The bckup opertor is contrctive mpping 1. Lemm 1 Given vlue functions U nd V, HU HV 1 pple U V 1. Let the vlue of n optiml policy,, be V. The following theorem is consequence of the Bnch fixed point theorem nd Lemm 1. Theorem 1 V is the unique fixed point of H nd stisfies the Bellmn eqution V = HV. We cll policy n m-step policy if the number of times the mcro-ctions is pplied is m. For m-step policies, V cn be pproximted by finite set of liner functions; the weight vectors of these liner functions re clled the -vectors. Theorem 2 The vlue function for n m-step policy is piecewise liner nd convex nd cn be represented s V m (b) = mx (s)b(s) (3) where m is finite collection of -vectors. 2 m As V m is convex nd converges to V, V is lso convex. 3.2 Mcro-ction Construction We would like to construct mcro-ctions from primitive ctions of POMDP in order to use temporl bstrction to help solve difficult POMDP problems. A prtilly observble Mrkov decision process (POMDP) is defined by finite stte spce S, finite ction spce A, rewrd function R(s, ), n observtion spce O, nd discount 2 (0, 1). In our POSMDP, the probbility function p(s j, o,k s i, ) for mcro-ction must be independent of the history given the current stte s i ; hence the selection of primitive ctions nd termintion conditions within the mcro-ction cnnot depend on the belief. We exmine some llowble dependencies here. Due to prtil observbility, it is often not possible to llow the primitive ction nd the termintion condition to be functions of the initil stte. Dependence on the portion of history 1 Proofs of the results in this section re included in the supplementry mteril. 3

4 tht occurs fter the mcro-ction hs strted is, however, llowed. In some POMDPs, subset of the stte vribles re lwys observed nd cn be used to decide the next ction. In fct, we my sometimes explicitly construct observed vribles to remember relevnt prts of the history prior to the strt of mcro-ction (see Section 5); these cn be considered s prmeters tht re pssed on to the mcro-ction. Hence, one wy to construct the next ction in mcro-ction is to mke it function of the history since the mcro-ction strted, x k, k,o k+1,...,x t 1, t 1,o t,x t, where x i is the fully observble subset of stte vribles t time i, nd k is the strting time of the mcro-ction. Similrly, when the termintion criterion nd the observtion function of the mcro-ction depends only on the history x k, k,o k+1,...,x t 1, t 1,o t,x t, the mcro-ction cn retin trnsition function tht is independent of the history given the initil stte. Note tht the observtion to be pssed on to the POSMDP to crete the POSMDP observtion spce, O, is prt of the design trdeoff - usully it is desirble to reduce the number of observtions in order to reduce complexity without degrding the vlue of the POSMDP too much. In prticulr, we my not wish to include the execution length of the mcro-ction if it does not contribute much towrds obtining good policy. 4 Monte Crlo Vlue Itertion with Mcro-Actions We hve shown tht if the ction spce A nd the observtion spce O of POSMDP re discrete, then the optiml vlue function V cn be pproximted rbitrrily closely by piecewise-liner, convex function. Unfortuntely, when S is very high-dimensionl (or continuous), vector representtion is no longer effective. In this section, we show how the Monte Crlo Vlue Itertion (MCVI) lgorithm [2], which hs been designed for POMDPs with very lrge or infinite stte spces, cn be extended to POSMDP. Insted of -vectors, MCVI uses n lterntive policy representtion clled policy grph G. A policy grph is directed grph with lbeled nodes nd edges. Ech node of G is lbeled with n mcro-ction nd ech edge of G is lbeled with n observtion o. To execute policy G, it is treted s finite stte controller whose sttes re the nodes of G. Given n initil belief b, strting node v of G is selected nd its ssocited mcro-ction v is performed. The controller then trnsitions from v to new node v 0 by following the edge (v, v 0 ) lbeled with the observtion received, o. The process then repets with the new controller node v 0. Let G,v denote policy represented by G, when the controller lwys strts in node v of G. We define the vlue v (s) to be the expected totl rewrd of executing G,v with initil stte s. Hence V G (b) = mx v (s)b(s). (4) v2g V G is completely determined by the -functions ssocited with the nodes of G. 4.1 MC-Bckup One wy to pproximte the vlue function is to repetedly run the bckup opertor H strting from n rbitrry vlue function until it is close to convergence. This lgorithm is clled vlue itertion (VI). Vlue itertion cn be crried out on policy grphs s well, s it provides n implicit representtion of vlue function. Let V G be the vlue function for policy grph G. Substituting (4) into (2), we get HV G (b) = mx 2A n R(s, )b(s)+ o2o p (o,b) mx v2g o v (s)b 0 (s). (5) It is possible to then evlute the right-hnd side of (5) vi smpling nd monte crlo simultion t belief b. The outcome is new policy grph G 0 with vlue function ĤbV G. This is clled MC-bckup of G t b (Algorithm 1) [2]. There re A G O possible wys to generte new policy grph G 0 which hs one new node compred to the old policy grph node. Algorithm 1 computes n estimte of the best new policy grph t b using only N A G smples. Furthermore, we cn show tht MC-bckup pproximtes the stndrd VI bckup (eqution (5)) well t b, with error decresing t the rte O(1/ p N). Let R mx be the lrgest bsolute vlue of the rewrd, r t, t ny time step. 4

5 Algorithm 1 MC-Bckup of policy grph G t belief b 2Bwith N smples. MC-BACKUP(G, b, N) 1: For ech ction 2A, R 0. 2: For ech ction 2A, ech observtion o 2O, nd ech node v 2 G, V,o,v 0. 3: for ech ction 2Ado 4: for i =1to N do 5: Smple stte s i with probbility b(s i). 6: Simulte tking mcro-ction in stte s i. Generte new stte s 0 i, observtion o i, nd discounted rewrd R 0 (s i, ) by smpling from p(s j, o,k s i, ). 7: R R + R 0 (s i, ). 8: for ech node v 2 G do 9: Set V 0 to be the expected totl rewrd of simulting the policy represented by G, with initil controller stte v nd initil stte s 0 i. 10: V,oi,v V,oi,v + V 0. 11: for ech observtion o 2Odo 12: V,o mx v2g V,o,v. 13: v,o rgmx v2g V,o,v. 14: V (R + P o2o V,o)/N. 15: V mx 2A V. 16: rgmx 2A V. 17: Crete new policy grph G 0 by dding new node u to G. Lbel u with. For ech o 2O, dd the edge (u, v,o) nd lbel it with o. 18: return G 0. Theorem 3 Given policy grph G nd point b 2 B, MC-BACKUP(G, b, N) produces n improved policy grph such tht ĤbV G (b) with probbility t lest 1. HV G (b) pple 2R mx 1 s 2 O ln G +ln(2 A )+ln(1/ ) N The proof uses Hoeffding bound together with union bound. Detils cn be found in [2]. MC-bckup cn be combined with point-bsed POMDP plnning, which smples the belief spce B. Point-bsed POMDP lgorithms use set B of points smpled from B s n pproximte representtion of B. In contrst to the stndrd VI bckup opertor H, which performs bckup t every point in B, the opertor ĤB pplies MC-BACKUP(G m,b,n) on policy grph G m t every point in B. This results in B new policy grph nodes. Ĥ B then produces new policy grph G m+1 by dding the new policy grph nodes to the previous policy grph G m. Let B =sup b2b min b 0 2B kb b 0 k 1 be the mximum L 1 distnce from ny point in B to the closest point in B. Let V 0 be vlue function for some initil policy grph nd V m+1 = ĤBV m. The theorem below bounds the pproximtion error between V m nd the optiml vlue function V., Theorem 4 For every b 2 B, s V (b) V m(b) pple 2Rmx 2 O ln( B m)+ln(2 A )+ln( B m/ ) (1 ) 2 N + 2Rmx (1 ) 2 B + 2 m R mx (1 ), with probbility t lest 1. The proof requires the contrction property nd Lipschitz property tht cn be derived from the piece-wise linerity of the vlue function. Hving estblished those results in Section 3.1, the rest of the proof follows from the proof in [2]. The first term in the bound in Theorem 4 comes from Theorem 3, showing tht the error from smpling decys t the rte O(1/ p N) nd cn be reduced by tking lrge enough smple size. The second term depends on how well the set B covers B nd cn be reduced by smpling lrger number of beliefs. The lst term depends on the number of MC-bckup itertions nd decys exponentilly with m. 5

() (b) (c) Figure 1: () Underwter Nvigtion: A reduced mp with 11 12 grid is shown with S mrking the possible initil positions, D mrking the destintions, R mrking the rocks nd O mrking the loctions

(c) Vehiculr d-hoc networking: An UAV mintins d-hoc network over four ground vehicles in 10 10 grid with B mrking the bse nd D the destintions. 4.

Brnch nd bound is used to void smpling unimportnt prts of the belief spce. See [2] for detils.

6 () (b) (c) Figure 1: () Underwter Nvigtion: A reduced mp with grid is shown with S mrking the possible initil positions, D mrking the destintions, R mrking the rocks nd O mrking the loctions where the robot cn loclize completely. (b) Collbortive serch nd cpture: Two robotic gents ctching 12 escped crocodiles in grid. (c) Vehiculr d-hoc networking: An UAV mintins d-hoc network over four ground vehicles in grid with B mrking the bse nd D the destintions. 4.2 Algorithm Theorem 4 bounds the performnce of the lgorithm when given set of beliefs. Mcro-MCVI, like MCVI, smples beliefs incrementlly in prctice nd performs bckup t the smpled beliefs. Brnch nd bound is used to void smpling unimportnt prts of the belief spce. See [2] for detils. The other importnt component in prcticl lgorithm is the genertion of next belief; Mcro- MCVI uses prticle filter for tht. Given the mcro-ction construction s described in Section 3.2, simple prticle filter is esily implemented to pproximte the next belief function in eqution (1): smple set of sttes from the current belief; from ech smpled stte, simulte the current mcroction until termintion, keeping trck of its pth length, t; if the observtion t termintion mtches the desired observtion, keep the prticle; the set of prticles tht re kept re weighted by t nd then renormlized to form the next belief 2. Similrly, MC-bckup is performed by simply running simultions of the mcro-ctions - there is no need to store dditionl trnsition nd observtion mtrices, llowing the method to run for very lrge stte spces. 5 Experiments We now illustrte the use of mcro-ctions for temporl bstrction in three POMDPs of vrying complexity. Their stte spces rnge from reltively smll to very lrge. Correspondingly, the mcro-ctions rnge from reltively simple ones to much more complex ones forming hierrchy. Underwter Nvigtion: The underwter nvigtion tsk ws introduced in [9]. In this tsk, n utonomous underwter vehicle (AUV) nvigtes in n environment modeled s 51 x 52 grid mp. The AUV needs to move from the left border to the right border while voiding the rocks scttered ner its destintion. The AUV hs six ctions: move north, move south, move est, move north-est, move south-est or sty in the sme loction. Due to poor visibility, the AUV cn only loclize itself long the top or bottom borders where there re becon signls. This problem hs severl interesting chrcteristics. First, the reltively smll stte spce size of 2653 mens tht solvers tht use -vectors, such s SARSOP [9] cn be used. Second, the dynmics of the robot is ctully noiseless, hence the min difficulty is ctully locliztion from the robot s initilly unknown loction. We use 5 mcro-ctions tht move in direction (north, south, est, north-est, or south-est) until either becon signl or the destintion is reched. We lso define n dditionl mcro-ction tht: nvigtes to the nerest gol loction if the AUV position is known, or simply stys in the sme loction if the AUV position is not known. To enble proper behviour of the lst mcro-ction, we ugment the stte spce with fully observble stte vrible tht indictes the current AUV loction. The vrible is initilized to vlue denoting unknown but tkes the vlue of the current AUV loction fter the becon signl is received. This gives simple exmple where the originl stte spce is ugmented with fully observble stte vrible to llow more sophisticted mcroction behviour. 2 More sophisticted pproximtion of the belief cn be constructed but my require more knowledge of the underlying POMDP nd more computtion. 6

7 Collbortive Serch nd Cpture: In this problem, group of crocodiles hd escped from its enclosure into the environment nd two robotic gents hve to collborte to hunt down nd cpture the crocodiles (see Figure 1). Both gents re centrlly controlled nd ech gent cn mke one step move in one of the four directions (north, south, est nd west) or sty still t ech time instnce. There re twelve crocodiles in the environment. At every time instnce, ech crocodile moves to loction furthest from the gent tht is nerest to it with probbility 1 p (p =0.05 in the experiments). With probbility p, the crocodile moves rndomly. A crocodile is cptured when it is t the sme loction s n gent. The gents do not know the exct loction of the crocodiles, but ech gent knows the number of crocodiles in the top left, top right, bottom left nd bottom right qudrnts round itself from the noise mde by the crocodiles. Ech cptured crocodile gives rewrd of 10, while movement is free. We define twenty-five mcro ctions where ech gents move (north, south, est, west, or sty) long pssge wy until one of them reches n intersection. In ddition, the mcro-ctions only return the observtion it mkes t the point when the mcro-ction termintes, reducing the complexity of the problem, possibly t cost of some sub-optimlity. In this problem, the mcro-ctions re simple, but the stte spce is extremely lrge (pproximtely ). Vehiculr Ad-hoc Network: In post disster serch nd rescue scenrio, group of rescue vehicles re deployed for opertion work in n re where communiction infrstructure hs been destroyed. The rescue units need high-bndwidth network to rely imges of ground situtions. An Unmnned Aeril Vehicle (UAV) cn be deployed to mintin WiFi network communiction between the ground units. The UAV needs to visit ech vehicle s often s possible to pick up nd deliver dt pckets [13]. In this tsk, 4 rescue vehicles nd 1 UAV nvigtes in terrin modeled s 10 x 10 grid mp. There re obstcles on the terrin tht re impssble to ground vehicle but pssble to UAV. The UAV cn move in one of the four directions (north, south, est, nd west) or sty in the sme loction t every time step. The vehicles set off from the sme bse nd move long some predefined pth towrds their pre-ssigned destintions where they will strt their opertions, rndomly stopping long the wy. Upon reching its destintion, the vehicle my rom round the environment rndomly while crrying out its mission. The UAV knows its own loction on the mp nd cn observe the loction of vehicle if they re in the sme grid squre. To elicit policy with low network ltency, there is penlty of 0.1 number of time steps since lst visit of vehicle for ech time step for ech vehicle. There is rewrd of 10 for ech time vehicle is visited by the UAV. The stte spce consists of the vehicles loctions, UAV loction in the grid mp nd the number of time steps since ech vehicle is lst seen (for computing the rewrd). We bstrct the movements of UAV to serch nd visit single vehicle s mcro ctions. There re two kinds of serch mcro ctions for ech vehicle: serch for vehicle long its predefined pth nd serch for vehicle tht hs strted to rom rndomly. To enble the mcro-ctions to work effectively, the stte spce is lso ugmented with the previous seen loction of ech vehicle. Ech mcro-ction is in turn hierrchiclly constructed by solving the simplified POMDP tsk of serching for single vehicle on the sme mp using bsic ctions nd some simple mcroctions tht move long the pths. This problem hs both complex hierrchiclly constructed mcro-ctions nd very lrge stte spce. 5.1 Experimentl setup We pplied Mcro-MCVI to the bove tsks nd compred its performnce with the originl MCVI lgorithm. We lso compred with stte-of-the-rt off-line POMDP solver, SARSOP [9], on the underwter nvigtion tsk. SARSOP could not run on the other two tsks, due to their lrge stte spce sizes. For ech tsk, we rn Mcro-MCVI until the verge totl rewrd stblized. We then rn the competing lgorithms for t lest the sme mount of time. The exct running times re difficult to control becuse of our implementtion limittions. To confirm the comprison results, we lso rn the competing lgorithms 100 times longer when possible. All experiments were conducted on 16 core Intel eon 2.4Ghz computer server. Neither MCVI nor SARSOP uses mcro-ctions. We re not wre of other efficient off-line mcroction POMDP solvers tht hve been demonstrted on very lrge stte spce problems. Some online serch lgorithms, such s PUMA [7], use mcro-ctions nd hve shown strong results. Online serch lgorithms do not generte policy, mking fir comprison difficult. Despite tht, they 7

8 re useful s bseline references; we implement vrint of PUMA s one such reference. In our experiments, we simply gve the online serch lgorithms s much or more time thn Mcro-MCVI nd report the results here. PUMA uses open-loop mcro-ctions. As bseline reference for online solvers with closed-loop mcro-ctions, we lso creted n online serch vrint of Mcro-MCVI by removing the MC-bckup component. We refer to this vrint s Online-Mcro. It is similr to other recent online POMDP lgorithms [12], but uses the sme closed-loop mcro-ctions s MCVI does. 5.2 Results The performnce of the different lgorithms is shown in Figure 2 with 95% confidence intervls. The underwter nvigtion tsk consist of two phses: the locliztion phse nd nvigte to gol phse. Mcro-MCVI s policy tkes one mcro-ction, moving northest until reching the border, to loclize nd nother mcro-ction, nvigting to the gol, to rech the gol. In contrst, both MCVI nd SARSOP fil to mtch the performnce of Mcro-MCVI even when they re run 100 times longer. Online-Mcro does well, s the plnning horizon is short with the use of mcro-ctions. PUMA, however, does not do s well, s it uses the less powerful open-loop mcroctions, which move in the sme direction for fixed number of time steps. Figure 2: Performnce comprison. Rewrd Time(s) Underwter Nvigtion Mcro-MCVI ± MCVI ± ± SARSOP ± ± PUMA ± Online-Mcro ± Collbortive Serch & Cpture Mcro-MCVI ± MCVI ± ± PUMA 1.04 ± Online-Mcro Vehiculr Ad-Hoc Network Mcro-MCVI ± MCVI ± Greedy ± For the collbortive serch & cpture tsk, MCVI fils to mtch the performnce of Mcro-MCVI even when it is run for 100 times longer. PUMA nd Online-Mcro do bdly s they fil to serch deep enough nd do not hve the benefit of reusing sub-policies obtined from the bckup opertion. To confirm tht it is the bckup opertion nd not the shorter per mcro-ction time tht is responsible for the performnce difference, we rn Online-Mcro for much longer time nd found the result unchnged. The vehiculr d-hoc network tsk ws solved hierrchiclly in two stges. We first used Mcro- MCVI to solve for the policy tht finds single vehicle. This stge took roughly 8 hours of computtion time. We then used the single-vehicle policy s mcro-ction nd solved for the higher-level policy tht plns over the mcro-ctions. Although it took substntil computtion time, Mcro- MCVI generted resonble policy in the end. In constrst, MCVI, without mcro-ctions, fils bdly for this tsk. Due to the long running time involved, we did not run MCVI 100 times longer. To confirm tht tht the policy computed by Mcro-MCVI t the higher level of the hierrchy is lso effective, we mnully crfted greedy policy over the single-vehicle mcro-ctions. This greedy policy lwys serches for the vehicle tht hs not been visited for the longest durtion. The experimentl results indicte tht the higher-level policy computed by Mcro-MCVI is more effective thn the greedy policy. We did not pply online lgorithms to this tsk, s we re not wre of ny simple wy to hierrchiclly construct mcro-ctions online. 6 Conclusions We hve successfully extended MCVI, n lgorithm for solving very lrge stte spce POMDPs, to include mcro-ctions. This llows MCVI to use temporl bstrction to help solve difficult POMDP problems. The method inherits the good theoreticl properties of MCVI nd is esy to pply in prctice. Experiments show tht it cn substntilly improve the performnce of MCVI when used with ppropritely chosen mcro-ctions. Acknowledgement We thnk Tomás Lozno-Pérez nd Leslie Kelbling from MIT for mny insightful discussions. This work is supported in prt by MoE grnt MOE2010-T nd MDA GAMBIT grnt R

9 References [1] H. Bi, D. Hsu, M.J. Kochenderfer, nd W. S. Lee. Unmnned ircrft collision voidnce using continuous-stte POMDPs. In Proc. Robotics: Science & Systems, [2] H. Bi, D. Hsu, W. S. Lee, nd V. Ngo. Monte Crlo Vlue Itertion for Continuous-Stte POMDPs. In Algorithmic Foundtions of Robotics I Proc. Int. Workshop o n the Algorithmic Foundtions of Robotics (WAFR), pges Springer, [3] Andrew G. Brto nd Sridhr Mhdevn. Recent dvnces in hierrchicl reinforcement lerning. Discrete Event Dynmic Systems, 13:2003, [4] T. G. Dietterich. Hierrchicl reinforcement lerning with the MAQ vlue function decomposition. J. Artificil Intelligence Reserch, 13: , [5] E. Hnsen nd R. Zhou. Synthesis of hierrchicl finite-stte controllers for POMDPs. In Proc. Int. Conf. on Automted Plnning nd Scheduling, [6] M. Huskrecht, N. Meuleu, L.P. Kelbling, T. Den, nd C. Boutilier. Hierrchicl solution of Mrkov decision processes using mcro-ctions. In Proc. Conf. on Uncertinty in Artificil Intelligence, pges Citeseer, [7] R. He, E. Brunskill, nd N. Roy. PUMA: Plnning under uncertinty with mcro-ctions. In Proc. AAAI Conf. on Artificil Intelligence, [8] H. Kurniwti, Y. Du, D. Hsu, nd W. S. Lee. Motion plnning under uncertinty for robotic tsks with long time horizons. Int. J. Robotics Reserch, 30(3): , [9] H. Kurniwti, D. Hsu, nd W.S. Lee. SARSOP: Efficient point-bsed POMDP plnning by pproximting optimlly rechble belief spces. In Proc. Robotics: Science & Systems, [10] J. Pineu, G. Gordon, nd S. Thrun. Point-bsed vlue itertion: An nytime lgorithm for POMDPs. In Int. Jnt. Conf. on Artificil Intelligence, volume 18, pges , [11] J. Pineu, N. Roy, nd S. Thrun. A hierrchicl pproch to POMDP plnning nd execution. In Workshop on Hierrchy & Memory in Reinforcement Lerning (ICML), volume 156, [12] S. Ross, J. Pineu, S. Pquet, nd B. Chib-Dr. Online plnning lgorithms for POMDPs. Journl of Artificil Intelligence Reserch, 32(1): , [13] A. Sivkumr nd C.K.Y. Tn. UAV swrm coordintion using coopertive control for estblishing wireless communictions bckbone. In Proc. Int. Conf. on Autonomous Agents & Multigent Systems, pges , [14] T. Smith nd R. Simmons. Heuristic serch vlue itertion for POMDPs. In Proc. Conf. on Uncertinty in Artificil Intelligence, pges AUAI Press, [15] R.S. Sutton, D. Precup, nd S. Singh. Between MDPs nd semi-mdps: A frmework for temporl bstrction in reinforcement lerning. Artificil Intelligence, 112(1): , [16] G. Theochrous nd L. P. Kelbling. Approximte plnning in POMDPs with mcro-ctions. Advnces in Neurl Processing Informtion Systems, 17, [17] M. Toussint, L. Chrlin, nd P. Pouprt. Hierrchicl POMDP controller optimiztion by likelihood mximiztion. Proc. Conf. on Uncertinty in Artificil Intelligence, [18] C.C. White. Procedures for the solution of finite-horizon, prtilly observed, semi-mrkov optimiztion problem. Opertions Reserch, 24(2): ,

10 Supplementry Mteril for Monte Crlo Vlue Itertion with Mcro-Actions Lemm 1 Given vlue functions U nd V, HU HV 1 pple U V 1. Proof. Let b be n rbitrry belief nd ssume tht HV (b) pple HU(b) holds. Let be the optiml mcro ction for HU(b). Then 0 pple HU(b) HV (b) pple R(b, )+ p (o,b)u( (b, o, )) R(b, ) p (o,b)v ( (b, o, )) o2o = o2o p (o,b)[u( (b, o, ) V ( (b, o, ))] o2o pple o2o p (o,b) U V 1 pple U V 1. Since 1 is symmetricl, the result is the sme for the cse of HU(b) pple HV (b). By tking 1 over ll weighted belief, we get HU HV 1 pple U V 1. Thus, H is contrctive mpping. Theorem 2 The vlue function for n m-step policy is piecewise liner nd convex nd cn be represented s V m (b) = mx (s)b(s) (1) where m is finite collection of -vectors. 2 m Proof. We prove this property by induction. When m =1, the intil vlue function V 1 is the best expected rewrd nd cn be written s V 1 (b) = mx R(b, ) = mx R(s, )b(s). This hs the sme form s V m (b) = mx m2 m P m(s)b(s) where there is one liner -vector for ech mcro ction. V 1 (b) cn therefore be represented s finite collection of -vectors. Assuming the optiml vlue function for ny b i 1 is represented using finite set of -vector i 1 = { i 0 1, 1 i 1,...} nd V i 1(b i 1) = mx b i 1 (s) i 1 (s) (2) Substituting into (2), we get b i 1 (s) = 1 j=1 V i 1 (b i 1 ) = mx i 12 i 1 j 1 s0 i 12 i 1 Substituting it into the bckup eqution (??) gives V i (b i ) = mx = mx = mx R(b i, )+ p (o,b i ) mx i 12 i 1 o2o R(b i, )+ mx o2o 1 i 1 2 i 1,..., O i 1 s 0 2S mx i 12 i 1 p(s, o,j s 0, )b i (s 0 )/p (o,b i )1 P 1 j=1 j 1 P s 0 p(s, o,j s0, )b i (s 0 ) 1 j=1 2 b i (s 0 ) 4R(s 0, )+ j p (o,b i ) 1 s0 i 1 (s). P 1 j=1 j 1 P s 0 p(s, o,j s0, )b i (s 0 ) p (o,b i ) p(s, o,j s 0, )b i (s 0 ) i 1 o2o j=1 1 (s) i 3 j 1 p(s, o,j s 0, ) i o 1(s) 5 1 (s) 1

11 The expression in the squre brcket cn evlute to A i 1 O different vectors. We cn rewrite V i (b i ) s: V i (b i ) = mx i (s)b i (s). i2 i Hence V i (b i ) cn be represented by finite set of -vector. 2

Reinforcement learning II

Reinforcement learning II CS 1675 Introduction to Mchine Lerning Lecture 26 Reinforcement lerning II Milos Huskrecht milos@cs.pitt.edu 5329 Sennott Squre Reinforcement lerning Bsics: Input x Lerner Output Reinforcement r Critic