Planning in POMDPs. Dominik Schoenberger Abstract

Size: px

Start display at page:

Download "Planning in POMDPs. Dominik Schoenberger Abstract"

Elvin Fowler
5 years ago
Views:

1 Planning in POMDPs Dominik Schoenberger Absrac This documen briefly explains wha a Parially Observable Markov Decision Process is. Furhermore i inroduces he differen approaches available o solve his class of problems. I also gives an overview over echniques for selecing belief poins which make value backups more efficien. 1 The Parially Observable Markov Decision Process While agens used in classical planning are concerned only wih environmens ha are fully observable, he real world of roboic applicaions is generally a place where i is no possible o observe he whole environmen and ake deerminisic acions. The observed environmen migh no even be saic. For planning under such uncerainy, i is necessary o improve he robusness by explicily reasoning abou he ype of uncerainy ha can occur. The Parially Observable Markov Decision Process (POMDP) has become possibly he mos general represenaion of his problem. 1.1 Benefis of POMDPs This is because i combines he mos essenial feaures for planning under uncerainy. Whereas oher frameworks handle neiher or only sochasic acion effecs, POMDPs handle uncerainy in boh acion effecs an sae observabiliy. The laer is done by expressing parial sae observaions over informaion saes insead of world saes, since hese world saes are no direcly observable. Here he measuremens of noisy and imperfec sensors are used o calculae he informaion saes which form he beliefs a sysem has over is world sae. These informaion saes are represened by probabiliy disribuions over world saes. Many POMDP algorihms form plans by opimizing a value funcion allowing he numerical rade off beween alernaive ways o saisfy a goal, even muliple ineracing goals, and he comparison of acions wih differen coss or rewards. POMDPs are unique doing his for informaion saes insead of world saes. POMDPs produce a universal plan by giving a full policy for acion selecion prescribing he choice of acion for any possible informaion sae and alleviaing he need for replanning. This makes he execuion faser. 1.2 Disadvanages of POMDPs Noneheless, his is also he main drawback of POMDPs, because generaing a universal plan has a high compuaional complexiy. Mos algorihms for exac planning in POMDPs opimize he value funcion over all possible beliefs, which is known o be PSPACE-complee. Tha means ha many POMDP domains wih only a few saes, acions and sensor observaions are already compuaionally inracable. Proposiional planning is only NP-complee. To speed up POMDP solving, a commonly used echnique is o keep value backups of a se of belief poins. Anoher problem, which has long been a key impedimen for POMDPs being used in pracical applicaions, is ha, if he value funcion is expressed by a se of vecors, in his se a vecor can be fully dominaed by a se of oher vecors (see Figure 1). Pruning ha dominaed vecors away can be expensive. 1

2 Figure 1: Value funcion vecor α 2 is dominaed by a combinaion of α 1 and α Basic POMDP erminology The POMDP is formally defined by six disinc quaniies which are denoed {S, A, T, Z, O, R}. These represen he following: Saes S s denoes a sae of he world and he finie se of all saes of a world is denoed S = {s 0, s 1,...}, while he sae a he ime is denoed s wih being a discree ime index. Since he sae of a world is no direcly observable is POMDPs, an agen can only assume which sae i is in by compuing a belief over he sae space S. Acions A The agen is given a se of acions denoed A = {a 0, a 1,...} which i can use o ac in he world. These acions affec he sae of he world sochasically, so choosing he righ acion is a funcion of hisory and ha makes choosing he righ acion he core problem in POMDPs. Observaions Z Since a belief of he world s sae s is needed, he agen can derive his belief from sensor measuremens. A se of measuremens a he same ime is called observaion z. The se of all observaions is denoed Z = {z 0, z 1,...} where he observaion a ime is denoed z. Any observaion z is usually an incomplee projecion of a world sae s due o sensor noise. Reward funcion R The funcion R(s, a) : S A R assigns he reward of performing an acion a a a sae s. The agen ries o collec as much reward as possible over ime, which means i ries o maximize E[ T = 0 γ 0 r ] where E[] is he mahemaical expecaion, 0 γ < 1 is a discoun facor ensuring he sum being finie and r is he reward a ime. Sae ransiion probabiliy disribuion T Given ha he agen is in sae s and selecs acion a, he probabiliy of ransiioning o sae s is T (s, a, s ) := P r(s = s s 1 = s, a 1 = a), for any (s, a, s ). T is a condiional probabiliy disribuion which means ha s S T (s, a, s ) = 1, (s, a). T is also ime-invarian. Observaion probabiliy disribuion O Upon execuing acion a in sae s he probabiliy ha he agen will perceive observaion z is O(s, a, z) := P r(z = z s 1 = s, a 1 = a), 1 Graphic from Pineau, Gordon & Thrun - Anyime Poin-Based Approximaions for POMDPs, Figure 1. 2

3 for all (s, a, z). O is also a condiional probabiliy disribuion wih z Z O(s, a, z) = 1, (s, a) and i is also ime-invarian. 1.4 Belief compuaion Since POMDPs are insances of Markov processes, he curren world sae s is sufficien o predic he fuure independen of he pas {s 0, s 1,..., s 1 }, bu unforunaely he agen in a POMDP can only perceive observaions {z 0,..., z }, because he sae is no direcly observable. This is why he agen has o compue a belief of he world sae insead using a complee race of all observaions and all acions ever execued. Tha race is called a hisory h := {z 0, a 0, z 1,..., z 1, a 1, z }, here a ime. If an iniial sae probabiliy disribuion b 0 (s) := P r(s 0 = s) is available o he agen, he hisory can also be summarized via a belief disribuion b (s) := P r(s = s z, a 1, z 1,..., a 0, b 0 ) insead of being represened explicily. The belief b can be calculaed recursively using only he las belief b 1, he las acion a 1 and he curren observaion z. The belief updae equaion τ() is now defined as follows, equivalen o he one of he Bayes filer: τ(b 1, a 1, z ) = b (s ) = The denominaor is normalizing consan. 1.5 Opimal policy compuaion O(s,a 1,z )T (s,a 1,s )b 1(s) s P r(z b,a 1 Compuing a policy for selecing acions is he cenral objecive in a POMDP. The policy π(b) a chooses acion a a a belief disribuion b. Since he agen wans o maximize he expeced fuure discouned cumulaive reward, he opimal policy for his is π (b 0 ) = arg max π E π [ T = 0 γ 0 r b 0 ] A sraighforward approach o finding an opimal policy is o apply muliple ieraions o compue increasingly more accurae values for each belief sae b. For his a value funcion V is needed, which maps belief saes o values. The iniial value funcion is: V 0 (b) = max a R(s, a)b(s) For each ieraion of he value funcion is compued recursively and maximizes he sum of all fuure rewards wihin for any belief sae b wihin ime seps: V (b) = max a [ R(s, a)b(s) + γ P r(z a, b)v 1 (τ(b, a, z))] This way i produces a policy ha is opimal under he same planning horizon : π (b) = arg max a [ R(s, a)b(s) + γ P r(z a, b)v 1 (τ(b, a, z))] Now each of hese value funcions a any planning horizon can be expressed by a se of vecors Γ = {α 0, α 1,..., α m }, each vecor represening an S -dimensional hyper-plane and defining he value funcion over a bounded region of he belief sae: V (b) = max α Γ α(s)b(s) Then each of hese α-vecors is associaed wih an acion a o creae a policy, ha already assumes opimal behavior for he following seps: 3

4 V (b) = max a A [ R(s, a)b(s) + γ max α Γ 1 s S T (s, a, s )O(s, a, z)α(s )b(s)] V (b) canno be compued direcly for each beliefs because here are infiniely many beliefs. However he corresponding Γ can be generaed, done by a sequence operaions on he previous se Γ 1. For each acion a and for each observaion z he se Γ a, is compued as follows: along wih he inermediae se Γ a,z : α a, (s) = R(s, a) α a,z i (s) = γ T (s, a, s )O(s, a, z), α i (s ), α i Γ 1 s S Nex he cross-sum over observaions Γ a, a A is creaed including one α a,z from each Γ a,z : and he union is aken of all Γ a ses: Γ a = Γ a, + Γ a,z1 Γ = a A Γ a Γ a,z2... In his form he pieces of he soluion for he value funcion a he horizon can be backed up. To exrac he value funcion from he se Γ, he α-vecors are applied o he equaion for V (b) from above: V (b) = max α Γ α(s)b(s) 1.6 Poin-based value backup While here are many differen approaches how o selec belief poins o be updaed, he procedure of how he updae is done is sandard for any of hese, implemened as a sequence of operaions on a se of α-vecors. Since he updae of he value funcion is only applied a a fixed se of belief poins B = {b 0, b 1,..., b q }, here is a corresponding se of vecors {α 0, α 1,..., α q } conaining a mos one vecor for each belief. I is now assumed, ha he belief poins in a region around b have he same acion choice and also lead o he same faces of V 1 as his poin b. For his poin only one of is α-vecors from a given soluion se Γ 1 is used for he poin-based backup. To obain now he nex soluion se Γ, se Γ a, is generaed for all acions and observaions: and he same is done for Γ a,z : α a, (s) = R(s, a) α a,z i (s) = γ T (s, a, s )O(s, a, z), α i (s ), α i Γ 1 s S Nex, insead of a cross-sum, a simple summaion is calculaed o ge Γ a, a A: αb a = Γa, + arg max a Γ a,z( α(s)b(s)), b B Finally, he bes acion is needed for each belief poin: α b = arg max Γ a ( Γ a (s)b(s)), b B and he soluion se is creaed wih hese: Γ = b B α b Alhough he operaions above preserve only he bes α-vecor for each belief poin b B, an esimae of he value funcion a any belief a / B can be calculaed from Γ by using again: V (b) = max α Γ α(s)b(s) 4

5 2 Poin-based algorihms 2.1 Exac poin-based algorihms This ype of mehods ypically canno scale beyond a handful of saes, acions and observaions. Earlier echniques like his use poin-based backups o opimize he value funcion over limied pars of he belief ree looking for beliefs where he value funcion is no opimal. Therefore all reachable beliefs have o be considered, leaving his an expensive approach. Noneheless i is guaraneed o deliver he opimal soluion. 2.2 Approximae poin-based algorihms Poin-Based Value Ieraion Two main componens are needed o achieve an anyime soluion o large POMDP domains. These are he belief se selecion and he poin-based updae procedure, which is done here. The Poin- Based Value Ieraion algorihm (PBVI) sars wih an iniial se of belief poins for applying a firs backup. I hen grows he belief ree and does a new series of backup operaions including old and new beliefs. This is repeaed unil a saisfacory soluion is obained. In his way PBVI gradually rades off compuaion ime and soluion qualiy. Even hough i is no guaraneed, ha he value funcion improves wih he addiion of belief poins, PBVI decreases or a leas keeps he bound error wih each sep The Perseus algorihm Perseus always uses randomly chosen poins ha are added o he belief ree. Value updaes are no done all a once, he poins are randomly sampled o updae heir value one a a ime insead. Because of one updaed value in a value funcion vecor can also improve he value of nearby poins, hese poins are hen already removed from he sampling se. The algorihm coninues unil he value of all poins has been improved Heurisic Search Value Ieraion The Heurisic Search Value Ieraion algorihm (HSVI) keeps a lower and upper bound for he value funcion which i used o selec belief poins. To perform a value updae, i only updaes he direc predecessors of he seleced belief. The HSVI algorihm offers anyime performance Real Time Belief Space Search The Real Time Belief Space Search approach (RTBSS) consrucs a new belief reachabiliy ree by using he curren poin as he op node and erminaing he ree a a fixed deph. This way, he value of each node can be calculaed recursively over he finie planning horizon. The algorihm also deleed subrees ha exceed a calculaed bound, compared o oher subrees. A his poin, anoher algorihm like PBVI can be used o compue a lower bound and so improving pruning of subrees which also improves he qualiy of he soluion of he RTBSS. This approach is able o compue fas resuls alhough he qualiy is no as good as he soluion qualiy of algorihms like PBVI or Perseus. 2.3 Sraegies for selecing belief poins There a differen mehods used for selecing new belief poins. I is useful o check firs if he beliefs ha are considered as a backup are acually reachable. Therefore a subse of reachable beliefs is creaed saring wih a known iniial belief (see Figure 2). This subse should be sufficienly small for compuaional racabiliy and large enough for good value funcion approximaion. 5

6 Figure 2: The shown belief ree includes reachable beliefs only Random Belief Selecion The simples way o sample a new belief poin is obviously o choose i randomly ou of he enire belief simplex. The only hing o regard here is o ensure a uniform coverage. This sraegy work well in small domains bu since i canno provide a good coverage of he belief simplex wih a reasonable number of poins, i exhibis poor performance in large domains Sochasic Simulaion wih Random Acion A beer sraegy is o add poins along he belief ree. To generae hese, an acion is simulaed, making a single-sep forward rajecory from belief poins already in he ree. Since his acion is seleced randomly, he belief ree will sill be very large, especially when he branching facor is high Sochasic Simulaion wih Greedy Acion If he acion is chosen he way, ha he expeced value gain a he new belief poin will be he mos of all value gains seen from he curren belief, his is called a Greedy Acion. Here he ɛ-greedy exploraion sraegy known from reinforcemen learning is used o give he probabiliy wih which he greedy acion is seleced. Then he single-sep forward simulaion is done using he seleced acion Sochasic Simulaion wih Exploraory Acion Because of POMDP algorihm performing bes wih a uniformly dense se of reachable beliefs, he new belief supposed o be added o he belief ree should improve he wors-case densiy. To do his, he simulaion wih Exploraory Acion does a single-sep forward simulaion wih each acion, bu hen keeps only ha one poin, which is farhes away from all oher belief poins already in he belief ree Greedy Error Reducion The mos successful sraegy for selecing new belief poins ries o reduce he expeced error. I firs calculaes he addiional error inroduced by a single belief poin backup for each possible new poin in he ree. Then he exising poin wih he larges error bound is needed, wherefore is imporan o regard he reachabiliy probabiliy of his poin as well. Finally of ha poins descendans ha one is seleced, ha would minimize he new error bound (see Figure 3). 2 Graphic from Pineau, Gordon & Thrun - Anyime Poin-Based Approximaions for POMDPs, Figure 2. 6

7 Figure 3: The marked poins are he candidaes o be added nex. 3 3 Grid-based algorihms 3.1 Grid-based approximaion To approximae he value funcion using a finie se of belief poin, many approaches are know. As he name grid-based approximaion predics, here he poins are disribued according o a grid paern over he belief space. The value of poins no on he grid is specified by an inerpolaionexrapolaion rule maching hem o neighboring grid-poins. Thereby he convexiy of he value funcion of POMDPs is ignored. 3.2 Sraegies for selecing grid-poins To selec he grid-poins needed here, an easy way is o lay a grid wih fixed resoluion over he belief simplex. Now only neighboring grid-poins are used o calculae he value inerpolaion. This is done quickly, bu he number of poins grows wih he dimensionaliy of he belief space. Even simpler is he approach ha selecs random poins over he whole belief ree, bu ha makes inerpolaion a lo harder. These boh mehods are no ideal, when beliefs are no uniformly disribued, which is he acual characerizaion of many real-life problems. Furher here are approaches called non-regular grid approximaions. One of hem does single-sep sochasic simulaions saring a he corner poins of he belief simplex o generae addiional belief poins. Anoher approach also builds a grid bu sars a criical poins of he belief simplex and hen uses a heurisic o esimae he usefulness of inermediae poins i adds sep by sep. A hird one makes an inerpolaion over he values a criical poins of he grid. Though hese mehods require fewer beliefs, hey are more expensive because inerpolaion over non-grid poins requires searching over all grid poins, raher han jus neighboring ones. A beer approach creaes sub-samples of he fixed-resoluion grid fields were needed and his way i ges a variable resoluion of he whole grid. So i can sample some pars more densely while grid poins are resriced o lie on he fixed-resoluion grid. The disadvanage of his algorihm is ha is requires a large number of grid poins o performance well. Anoher good algorihm can be applied o POMDPs wih ɛ-opimaliy and requires a horoughly covered belief simplex and herefore exponenially many grid poins are needed. Bu he algorihm is really fas because i inerpolaes only over he neares neighbor of a one-sep successor belief for each grid poin. References [1] Joelle Pineau, Geoff Gordon and Sebasian Thrun (2006). Anyime poin-based approximaions for large POMDPs. Journal of Arificial Inelligence Research, Vol 27, pp Graphic from Pineau, Gordon & Thrun - Anyime Poin-Based Approximaions for POMDPs, Figure 3. 7

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB Elecronic Companion EC.1. Proofs of Technical Lemmas and Theorems LEMMA 1. Le C(RB) be he oal cos incurred by he RB policy. Then we have, T L E[C(RB)] 3 E[Z RB ]. (EC.1) Proof of Lemma 1. Using he marginal