Efficient POMDP Forward Search by Predicting the Posterior Belief Distribution Ruijie He and Nicholas Roy

Size: px

Start display at page:

Download "Efficient POMDP Forward Search by Predicting the Posterior Belief Distribution Ruijie He and Nicholas Roy"

Merilyn Farmer
6 years ago
Views:

1 Compuer Science and Arificial Inelligence Laboraory Technical Repor MIT-CSAIL-TR Sepember 23, 2009 Efficien POMDP Forward Search by Predicing he Poserior Belief Disribuion Ruijie He and Nicholas Roy massachuses insiue of echnology, cambridge, ma usa

2 Efficien POMDP Forward Search by Predicing he Poserior Belief Disribuion Ruijie He Compuer Science and Arificial Inelligence Laboraory Massachuses Insiue of Technology Cambridge, MA Nicholas Roy Compuer Science and Arificial Inelligence Laboraory Massachuses Insiue of Technology Cambridge, MA Absrac Online, forward-search echniques have demonsraed promising resuls for solving problems in parially observable environmens. These echniques depend on he abiliy o efficienly search and evaluae he se of beliefs reachable from he curren belief. However, enumeraing or sampling acion-observaion sequences o compue he reachable beliefs is compuaionally demanding; coupled wih he need o saisfy real-ime consrains, exising online solvers can only search o a limied deph. In his paper, we propose ha policies can be generaed direcly from he disribuion of he agen s poserior belief. When he underlying sae disribuion is Gaussian, and he observaion funcion is an exponenial family disribuion, we can calculae his disribuion of beliefs wihou enumeraing he possible observaions. This propery no only enables us o plan in problems wih large observaion spaces, bu also allows us o search deeper by considering policies composed of muli-sep acion sequences. We presen he Poserior Belief Disribuion (PBD) algorihm, an efficien forward-search POMDP planner for coninuous domains, demonsraing ha beer policies are generaed when we can perform deeper forward search. 1 Inroducion The Parially Observable Markov Decision Process (POMDP) is a general framework for sequenial decision making in parially observable environmens, when he agen is unable o exacly observe he sae of is environmen. Tradiionally, a POMDP solver generaes a policy offline, compuing an acion for a se of possible beliefs before policy execuion. However, for problems wih large domains, offline mehods can incur significan compuaion coss. Recenly, online forward-search mehods have demonsraed promising resuls in problems wih large domains (see [17] for a review), suggesing ha POMDP planning can be performed efficienly by only considering he belief saes ha are reachable from he agen s curren belief. If a POMDP solver is able o search deep enough, i will find he opimal policy for he curren belief [6, 14]. Unforunaely, he number of belief saes reachable wihin deph D is ( A Z ) D, where A and Z are he sizes of he acion and observaion ses. No only does he search quickly become inracable as D increases, bu online echniques generally have o mee real-ime consrains, which limis he planning ime available for each ieraion. Exising online, forward search 1

3 algorihms seek o reduce he number of possible observaions ha have o be explored by using branch-and-bound [10], Mone Carlo sampling [9] and heurisic search [15, 21] echniques. Fundamenally, hese algorihms sill branch on he possible individual acions and observaions o deermine he se of reachable poserior beliefs. An alernaive approach would be o consider shallow policies composed of muli-sep acion sequences, or macro-acions [20], branching only a he end of each acion sequence. However, o plan wih muli-sep acion sequences, an algorihm mus have he abiliy o deermine he se of poserior beliefs ha could resul afer he acion sequence, since he goal of a POMDP solver is o generae he policy ha maximizes he agen s expeced discouned reward. This se of beliefs is usually compued by enumeraing or sampling from he se of observaion sequences, which is iself a cosly process and reduces he poenial savings of macroacions. If i were possible o efficienly characerize he disribuion of poserior beliefs afer an acion sequence wihou enumeraing he possible observaions, forward search POMDP planning could hen be done much more efficienly. If he disribuion over poserior beliefs can be compued efficienly and is of a low dimension, hen sampling from his disribuion requires subsanially fewer samples and much less compuaion, allowing much faser search and efficien planning in POMDP problems wih large observaion spaces. In his paper, we demonsrae ha when he agen s belief and observaion models can be represened in parameric form, he disribuion of he agen s poserior beliefs can be direcly compued for a muli-sep acion sequence. Parameric represenaions have previously been proposed [2, 3, 12, 16] as an alernaive for compacly represening high-dimensional belief spaces, and are especially valuable for POMDP problems wih coninuous sae spaces. Specifically, we focus on problems where he agen s belief is reasonably represened as a Gaussian disribuion over a coninuous sae space, and where he ransiion and observaion models belong o any member of he exponenial family of disribuions, such as he linear-gaussian or mulinomial disribuions ha ofen characerize POMDP problems. By also consraining he agen s poserior belief o he Gaussian parameric represenaion, we can direcly compue how he sufficien saisics of he agen s belief are expeced o evolve over muli-sep acion sequences. Furhermore, for Gaussian disribuions, we will see ha he second momen of he belief disribuion (i.e., he covariance) can be compued in amorized O(1) for a given muli-sep acion sequence, increasing he efficiency of he search process. We presen he Poserior Belief Disribuion (PBD) algorihm, an efficien, POMDP forward search algorihm ha can perform much deeper forward searches. 2 POMDPs Formally, a POMDP consiss of a se of saes S, a se of acions A, and a se of observaions Z. I also includes a sae-ransiion model p(s s, a), an observaion model p(z a, s ), a reward model r S (s, a), as well as a discoun facor γ and iniial belief bel 0. The goal of a POMDP solver is o compue a policy π mapping beliefs o acions π : bel a ha will maximize he agen s expeced oal reward over is lifeime. Given a policy π and curren belief bel, he agen akes an acion a = π(bel) and obains an observaion z. I hen updaes is belief according o bel (s ) = τ(bel, a, z) = η p(z a, s ) p(s s, a)bel(s)ds (1) where τ(bel, a, z) represens he belief updae funcion and η is a normalizaion consan. Each policy π is also associaed wih a value funcion V π : bel R, specifying he expeced oal reward of execuing policy π saring from bel V π (bel) = max a A s S [ rb (bel, a) + γ p(z bel, a)v π (τ(bel, a, z)) ] (2) z Z where he funcion r B (bel, a) = s S bel(s)r S(s, a)ds specifies he immediae expeced reward of execuing acion a in belief bel. A POMDP solver seeks o find he opimal policy π (bel 0 ) ha maximizes V π (bel 0 ). Tradiionally, policies have been compued offline, and one class of POMDP solvers ha has achieved paricular success is he poin-based mehods. For discree sae spaces, poin-based approaches such as PBVI [11] and HSVI [18] leverage he piecewise-linear and convex (PWLC) aspecs of he value funcion [19] o obain lower bounds on he value funcion (Eqn. 2), performing 2

4 value updaes only a seleced belief saes. The value funcion has similarly been shown [12] o be PWLC for coninuous sae spaces. 3 Forward Search in Parameric Space Fig. 1: A forward search ree. An acion is chosen a each belief node (OR-node), while all observaions mus be considered a he acion nodes (AND-node) Raher han compuing a policy for every possible belief sae, forward search echniques avoid he compuaional complexiy of full policy compuaion by direcing compuaional effor only owards belief saes ha are reachable from he curren belief under differen acions. These echniques alernae beween a planning and execuion phase, planning online only for he belief a he curren imesep. During he planning phase, a forward search algorihm creaes an AND-OR ree (Fig. 3) of reachable belief saes from he curren belief sae. The ree is expanded using acion-observaion pairs ha are admissible from he curren belief, and he beliefs a he leaf nodes are found using Eqn. 1. By using a value heurisic [17] ha esimaes he value a he fringe nodes, he expeced value of execuing a policy from he curren belief can be propagaed up he ree o he roo node (Eqn. 2). To obain he se of reachable beliefs, exising forward search algorihms branch on he possible acions and observaions a each successive deph. Unforunaely, he branching facor for reasonably large discree or coninuous acion and observaion ses severely limis he maximum search deph achievable in real-ime. Even if we resric our acion space o a se of macro-acions, and compue he expeced reward of each macro-acion by sampling observaion sequences of corresponding lengh [20], he size of he observaion space and sampling complexiy will grow exponenially wih he lengh of he acion sequence. For a paricular macro-acion, he probabiliy of he agen obaining an observaion sequence is equivalen o he probabiliy of obaining he poserior belief associaed wih ha observaion sequence. Seen from anoher angle, every macro-acion generaes a disribuion over beliefs, or a disribuion of disribuions. If we are able o calculae he disribuion over poserior beliefs for every acion sequence, and branch a he end of he acion sequence by sampling poserior beliefs wihin his disribuion, he sampling complexiy is hen independen of he macro-acion lengh. Furhermore, he expeced reward of an acion sequence can hen be compued by finding he expeced rewards wih respec o ha disribuion, raher han by sampling he possible observaions. In his paper, we focus on problems where he agen s belief bel = N(µ, Σ) is normally disribued over he sae space, and he observaions are drawn from an exponenial family disribuion [1]. Wihou loss of generaliy, only he observaions are modeled as an exponenial family disribuion here, hough he same analysis could be applied o he sae-ransiion model. These model assumpions imply ha he poserior belief is no sricly Gaussian, since he Gaussian disribuion is no a conjugae prior for generic exponenial family observaion Fig. 2: Disribuion of poserior beliefs. a) A Gaussian poserior belief resuls afer incorporaing an observaion sequence. b) Over all possible observaion sequences, he disribuion of poserior means is a Gaussian (black line), and for each poserior mean, a Gaussian (blue curve) describes he agen s poserior belief. models. We neverheless assume ha he agen s poserior belief remains Gaussian, and show in Secion 4 ha he disribuion over poserior beliefs is iself a Gaussian over Gaussian beliefs (Fig. 3). We will show ha all poserior beliefs in his disribuion have he same covariance, and he poserior means are normally disribued over he coninuous sae space. Given an acion sequence, he poserior disribuion over beliefs is herefore a join disribuion over he poserior means and 3

5 he corresponding disribuion over saes. We can hen evaluae he expeced reward of an acion sequence by performing Mone Carlo inegraion over his disribuion of disribuions. 4 Gaussian Poserior Predicion Our sae-ransiion and observaion models can be represened as follows: s = A s 1 + B u + ε, s 1 N(µ 1, Σ 1 ), ε N(0, R ) (3) p(z θ ) = exp(z T θ b (θ ) + κ (z )) (4) We assume ha our sae-ransiion model is linear-gaussian, and A and B are he linear ransiion marices. θ and b (θ ) are respecively he canonical parameer and normalizaion facor of he exponenial family disribuion ha generaes he observaion. The exponenial family encompasses a large se of parameric disribuions, including he Gaussian disribuion. When he sae-ransiion and observaion models are normally disribued and linear funcions of he sae, he Kalman filer provides a closed-form soluion for he poserior belief (µ, Σ ), given a prior belief (µ 1, Σ 1 ), µ = A µ 1 + B u µ = µ + K (z C µ ) (5) Σ = A Σ 1 A T + R Σ = (C T Q 1 C + Σ 1 ) 1, (6) where C is he observaion marix, R and Q are he covariances of he Gaussian process and measuremen noise respecively, and K is he Kalman gain. µ and Σ are he mean and covariance afer an acion is aken bu before incorporaing he measuremen. Eqn. 5 and 6 show ha for problems wih linear-gaussian sae-ransiion and observaion models, he covariance updae is independen of he observaion obained. This is because he Fisher informaion associaed wih he observaion model, M = C TQ 1 C, is dependen only on he observaion model parameers, raher han he observaion obained [4]. For linear-gaussian models, M is also consan across he enire sae space. Unforunaely, he linear-gaussian assumpion is highly resricive, and mos POMDP models have observaion funcions ha are non-gaussian. A more general form of he Kalman filer updae exiss, which allows for a closed-form soluion of he poserior belief for problems wih observaion models ha belong o a larger class of parameric disribuions, he exponenial family. Building on saisical economics research for ime-series analysis of non-gaussian observaions [5], a dynamic generalized linear model [22] has been shown o provide he exponenial family equivalen of he Kalman filer (efkf). The key idea is o consruc linear-gaussian models which approximae he non-gaussian exponenial family model in he neighborhood of he condiional mode, s z. The approximae linear-gaussian observaion model can hen be used in a radiional Kalman filer. Since his idea was developed elsewhere, he derivaion of he filer is presened as an Appendix, and we presen he main equaions here. Consrucing he approximae linear-gaussian model requires compuaion of he firs wo momens of he disribuion and linearizing abou he mean esimae a every imesep. For an exponenial family observaion model, he firs wo momens of he disribuion [22] are, E(z θ ) = ḃ = b (θ ) θ θ=w(µ ) V ar(z θ ) = b = 2 b (θ ) θ θ T θ=w(µ ) θ = W(s ), where ḃ and b are he derivaives of he exponenial family disribuion s normalizaion facor, boh linearized abou θ = W(µ ). W(.) is he canonical link funcion, which maps he saes o canonical parameer values, and depends on he paricular member of he exponenial family. Given an acion-observaion sequence, he poserior mean of he agen s belief in he efkf can hen be updaed according o µ = A µ 1 + B u µ = µ + K ( z W(µ )), (8) Σ = A Σ 1 A T + R Σ = (Σ 1 (7) + Y b Y T ) 1, (9) 4

6 b 1 where K = Σ Y (Y Σ Y + ) 1 is he efkf Kalman gain, and z = θ (ḃ z ) is he projecion of he observaion ono he parameer space of he exponenial family observaion model. Y = θ s=µ s is he gradien of he exponenial family disribuion s canonical parameer, linearized abou µ. While our relaxaion of he observaion model o he exponenial family necessarily implies ha he Gaussian poserior belief is an approximaion of he rue poserior, he Gaussian assumpion does allow he use of a Kalman filer varian, which in urn allows us o approximae our disribuion of poserior beliefs efficienly and in closed-form. In he following subsecions, we ake advanage of he efkf o compue he disribuion of he poserior beliefs afer a muli-sep acion sequence. Since our belief is assumed o be Gaussian, expressing a disribuion over he poserior beliefs requires us o have a disribuion over he poserior means and covariances. 4.1 Predicion of Poserior Mean Disribuion Eqn. 8 reveals ha he poserior mean µ direcly depends on he observaion z. Neverheless, we demonsrae in his secion ha given a curren prior belief afer an acion, bel N(µ, Σ ), he expeced disribuion of he poserior means p(µ µ ) is normally disribued abou µ. Given he wo momens of he exponenial family observaion model, we can represen he condiional disribuion z s according o b 1 z s N(W(s ), b 1 ) (10) N(W(µ ) + Y (s µ ), b 1 ) (11) We can hen marginalize ou s using p( z µ ) = p( z s, µ )p(s µ )ds and using linear ransformaions, z µ N(W(µ ), Y Σ Y T µ + K ( z W(µ )) µ N(µ, K (Y Σ Y T 1 + b + ) (12) 1 b ) K T ) (13) µ µ N(µ, Σ Y KT ) (14) Eqn. 14 indicaes ha he poserior mean afer a measuremen updae µ is normally disribued abou he µ, wih a covariance ha depends on he prior covariance Σ and he observaion model parameers Y and b. The observaion model parameers are linearized abou he prior mean µ ; hence, for an acion-observaion sequence of lengh 1, he parameers are independen of he observaion ha will be obained. To obain he poserior mean disribuion afer a muli-sep acion sequence updae, we firs combine he process and measuremen updaes by marginalizing ou µ using p(µ µ 1 ) = p(µ µ )p(µ µ 1 )dµ, obaining µ µ 1 N(A µ 1 + B u, Σ Y KT ) (15) We assumed above ha for a one-sep belief updae, µ 1 is a fixed value. For a muli-sep updae, he mean is a random variable, i.e. µ 1 N(m 1, S 1 ). We can hen marginalize ou µ 1 o obain µ N(A m 1 + B u, S 1 + Σ Y KT ) (16) Equaion 16 can now be used o perform a predicion of he poserior mean disribuion afer a mulisep acion sequence. Assuming ha he agen is currenly a ime and has a paricular prior mean µ 1 N(µ 1, 0), he poserior mean afer an acion sequence of T imeseps is herefore +T µ +T N(f(µ 1, A :+T, B :+T, u :+T ), Σ i Y i KT i ) (17) where f(µ 1, A +1:+T, B +1:+T, u +1:+T ) is he deerminisic ransformaion of he means according o µ +k = A +k µ +k 1 + B +k u +k. Since an observaion on is own does no shif he mean value m +k of he disribuion of poserior means, m +k is dependen only on he saeransiion model parameers and can be calculaed via a recursive updae along he acion sequence. i= 5

7 4.2 Single-sep Predicion of Covariance Eqn. 9 dicaes how he poserior covariance of he agen s belief can be calculaed, afer an acion is aken and an observaion is obained. Given ha he Fisher informaion associaed wih he observaion model M = Y b Y T is independen of he observaions, he poserior covariance can be compued in closed form, and is independen of he poserior mean. For greaer efficiency, i has been shown [13] ha for a Kalman filer model, he process and measuremen updaes can be compiled ino a single linear ransfer funcion by using a facored represenaion of he covariance, Σ 1 = B 1 C 1 1. Adaping his resul o our model, he complee covariance updae sep can be wrien as: [ [ ] [ ] [ B 0 I 0 A T B Ψ = = C] I Y by A RA T, (18) C] 1 which enables us o collapse muli-sep covariance updaes ino a single sep Ψ +1:T = T i=1 Ψ i, recovering he poserior covariance Σ +T from Σ +T = B +T C 1 +T. Once Ψ +1:T is consruced for a given acion sequence, i can be reused for he same acion sequence a fuure beliefs. Calculaing he Fisher informaion associaed wih he observaion model M again requires us o perform linearizaions abou he mean of he agen s prior belief µ 1. This implies ha he covariance updae afer a muli-sep acion sequence depends on he poserior mean afer each imesep. Neverheless, as shown in Secion 4.1, he agen s expecaion of he poserior mean afer a measuremen updae µ +1 is a normal disribuion around he µ. Hence, when considering he effec of aking an acion sequence from he agen s curren belief, we perform our linearizaions abou µ, he prior mean a each sep along he acion sequence. 5 Poserior Belief Disribuion (PBD) Algorihm Secion 4 demonsraes ha, given he agen s curren belief (µ, Σ ), he parameers describing he disribuion of poserior beliefs can be compued wihou enumeraing he observaions. In paricular, he poserior covariance Σ +T can be prediced via a ransfer funcion (Eqn. 18), while he poserior mean µ +T is normally disribued according o Eqn. 17. These resuls enable us o consider a subse of admissible acion sequences and obain he associaed disribuion of poserior beliefs wihou enumeraing he corresponding observaion sequences. We now presen an algorihm, he Poserior Disribuion Predicion (PBD) algorihm, ha akes advanage of he resuls o perform forward search o much greaer dephs. The PBD algorihm, shown in Algs. 1 and 2, builds upon a generic online forward search for POMDPs [17], alernaing beween a planning and execuion phase a every ieraion. During he planning phase, he nex bes acion sequence â seq is chosen, and he firs acion a of his acion sequence is execued. The agen hen updaes is curren belief according o he observaion obained, and he cycle repeas. During he planning phase, he EXPAND() subrouine is execued recursively in a deph-firs search manner. The agen firs samples a number of muli-sep acion sequences ha i can execue from is curren belief. The choice of hese acion sequences is domain-specific. For each of hese acion sequences, he algorihm calculaes he parameers of he agen s poserior belief, i.e., he parameers of he poserior mean disribuion m +T, S +T according o Eqn. 17, as well as he poserior covariance Σ +T according o Eqn. 18. These hree ses of parameers are he sufficien saisics of he disribuion of beliefs shown in Fig. 3. As discussed in Secion 3, we hen sample from he poserior belief disribuion o insaniae beliefs for deeper forward search. Given ha he covariance can be assumed consan, we perform imporance sampling only on he poserior mean disribuion, generaing samples of poserior beliefs by associaing he poserior mean samples {n i } wih he poserior covariance Σ +T. These beliefs are hen used o perform an addiional layer of deph-firs search, and his process repeas for a pre-deermined search deph D. A he leaf nodes, a value heurisic is used o provide an esimae of being a he belief associaed wih he node. These values are hen propagaed up he ree (Alg. 2, Eqn. 12), and consiss of boh he insan rewards of execuing he acion sequence and a 6

8 Algorihm 1 Poserior Belief Disribuion Algorihm Require: b 0 : Agen s iniial belief, D : Maximum search deph, V h : Value heurisic funcion 1: b c b 0 2: while no EXECUTIONEND() do 3: while no PLANNINGEND() do 4: â seq = EXPAND(b c, D, V h ) 5: end while 6: Execue firs acion â of â seq 7: Obain new observaion z 8: Updae curren belief b c τ(b c, â, z ) 9: : end while FVRS[8,5] Ave rewards Online ime (s) Offline ime (s) QMDP HSVI(150s) RTBSS(d1) PBD(d1,s100) Fig. 3: FVRS resuls. Algorihm parameers: HSVI(# seconds), RTBSS(search deph), PBD(muli-sep search deph, # samples) Algorihm 2 EXPAND() Require: b : Belief node o be expanded, d : Deph of expansion under b, V h : Value heurisic funcion 1: if d = 0 hen 2: V (b) V h (b) 3: reurn V (b) 4: else 5: V (b) 6: for all a seq,i A seq do 7: V (b, a i) R B(b, a seq,i) 8: Compue m +T, S +T, Σ +T (Eqn. 17, 18) 9: Sample se of poserior means {n i} according o N(m +T,S +T) 10: for all n i do 11: b N(n i,σ +T) 12: V (b, a seq,i) V (b, a seq,i)+γp(n i m +T,S +T) EXPAND(τ(b, a seq,i, z), d 1) 13: if V (b, a i) > V (b) hen 14: V (b) V (b, a seq,i) 15: a seq a seq,i 16: end if 17: end for 18: end for 19: end if 20: reurn V (b), a seq Mone Carlo inegraion of he value esimaes from he sampled poserior beliefs. A he roo, he algorihm chooses he acion sequence wih he highes expeced discouned value. 5.1 Errors Induced by Linearizaions Sampling he disribuion of poserior means induces an error on he expeced rewards. However, error bounds can be compued using Hoeffding s inequaliy [7], ( p(v S (b, a seq,i ) V (b, a seq,i ) γnǫ) exp 2n 2 ǫ 2 n(v max V min ) 2 ), (19) where V S (.) and V (.) are he sampled and exac value esimaes respecively, V max and V min are he maximum and minimum rewards ha could be obained, and n is he number of samples. For a desired accuracy ǫ, we can recover he appropriae number of samples. In addiion, for a generic exponenial family observaion model, calculaing he poserior mean and covariance afer a single acion (Eqn. 14 and 9) requires us o linearize abou he prior mean o calculae he Jacobians Y and b. When deermining he poserior belief disribuion afer a mulisep acion sequence (Eqn. 17 and 18), we make he assumpion ha we know he fuure prior means a each sep along he acion sequence, in order o perform linearizaion a every sep along ha sequence. This assumpion is an approximaion, since he fuure mean explicily depends on he observaion sequence. As a resul, he Jacobians used o calculae he Kalman gain and covariance are approximae, and he error of he approximaion in he Kalman gains and covariances increases wih macro-acion lengh. Bounds on approximaion errors for he EKF are known o exis [8], and in fuure, we plan o provide analysis of how hese bounds can be used o deermine he lengh of he macro-acions. 7

9 6 Empirical Resuls We provide iniial resuls demonsraing he performance of he PBD algorihm. Unforunaely, mos exising benchmark POMDP problems do no require POMDP solvers o search deeply o generae good policies. For example, he Rocksample problem, originally proposed in [18], and he exension FieldVisionRockSample (FVRS) problem [15] boh allow simple approximaion echniques o perform well. In he FVRS problem, an agen explores and samples rocks in a grid world. The agen is fully aware of he posiion of boh iself and he rocks. A each imesep, i receives a binary, noisy observaion of he value of each rock, and he observaion accuracy increases as he agen moves closer o he rock. The original Rocksample and FVRS problems appear easily solvable wih exising POMDP echniques such as HSVI [18] and AEMS [15]. Moving o sample he rock involves he same acions as moving o acquire more informaion of he rock. For hese problems, a greedy policy of finding he shores pah o all he rocks is a good approximaion o he opimal policy. Fig. 3 shows ha our algorihm is compeiive wih offline solvers such as HSVI, bu suggess ha even a naive QMDP solver will perform well. Our PBD algorihm performs slighly poorer since i approximaes he discree sae space as coninuous, bu he error induced, if any, is no saisically significan. For all he experimens repored in his paper, we adaped he HSVI implemenaion from he ZMDP sofware package 1, while our RTBSS implemenaion uses he QMDP and HSVI alpha-vecors as upper and lower bounds on he value funcion. HSVI ran in C++, while we ran he QMDP, RTBSS, and PBD algorihms in Malab. To beer es he algorihms, we propose he Informaion Search RockSample problem (ISRS) (Fig. 4), modifying he FVRS problem in wo ways. We inroduce a se of beacons (shown as yellow riangles) ha each correspond o a rock (shown as grey circles). Raher han he observaion accuracy depend on he proximiy o each rock, he agen r mus insead move o he beacon RB i, o ge beer informaion. Second, we modified he rock values RV i, o each have a coninuous value of beween 0 and 1, raher han being a binary-valued se. The agen coninues o receive a binary, noisy observaion of he value of each rock z, according o: { 4 r RB i, (m 0.5)2 D 0 z i, = 1 O : p(z i, RV i, = m, r, RB i, ) = 0.5 (m 0.5)2 4 r RB i, 2 D 0 z i, = 0 where D 0 is a uning parameer ha conrols how quickly he accuracy of he observaions decrease wih greaer disance beween he agen and he beacon. The agen s sensor can herefore be modeled as a Bernoulli observaion model, and he observaions are assumed o be more accurae han an unbiased coin. The agen obains a large reward if i samples a rock ha has a high value, bu incurs a large cos for wrongly doing so. Small coss are incurred for moving around he environmen. 6.1 Comparison wih oher algorihms The ISRS problem was used o compare he PBD algorihm agains a fas upper bound (QMDP), a poin-based offline value ieraion echnique (HSVI) and an online forward search algorihm (RTBSS). Macro-acions were generaed for he PBD algorihm by sampling robo poses in he grid world and compuing he sequence of acions necessary o reach he sampled pose. Fig. 4 and 6 illusrae he fundamenally differen policies generaed by our PBD algorihm, relaive o exising echniques. In Fig. 4, he RTBSS solver obains an observaion suggesing ha rock 1 is valuable. Unforunaely, is inabiliy o search beyond deph 1 causes i o fail o realize ha good informaion abou rock 1 can be obained by making a shor deour. Insead, i heads owards rock 1, bu because subsequen observaions are noisy, due o disance from he beacon, he solver is unable o commi o sampling he rock and is suck a a local minimum. In conras, he PBD solver (Fig. 6) is able o evaluae he value of making a deour when seeking informaion abou rock 5. Subsequenly, he agen moves o he region wih muliple beacons, concludes ha here is lile value lef o sample, and exis he problem. Fig. 5 repors he performance of he differen algorihms esed on he ISRS problem. The problem was iniialized wih differen hidden values of he rocks, and 20 rials were conduced for each simulaion. The differen algorihms were also iniialized wih differen offline processing ime (HSVI), 1 ZMDP Sofware for POMDP and MDP Planning. hp:// rey/zmdp/ (20) 8

10 Fig. 4: ISRS world. Blue line indicaes example RTBSS policy ISRS[8,5] (2304s, 5a, 32o) Ave rewards Online ime(s) Offline ime(s) QMDP HSVI(150s) HSVI(500s) HSVI(4000s) RTBSS(d1) RTBSS(d2) PBD(d1,s300) PBD(d2,s30) Fig. 5: Performance of POMDP solvers in ISRS problem Fig. 6: Example PBD policy differen search deph (RTBSS), and differen poserior mean samples (PBD). The resuls demonsrae ha he PBD algorihm is able o perform significanly beer han oher exising algorihms, hereby demonsraing he value of searching deeper. 7 Conclusion and Discussion We have demonsraed he abiliy o perform belief updaes for muli-sep acion sequences in closed form for models ha have specific parameric represenaions. When he models are linear-gaussian, our expression for he poserior belief disribuion afer a muli-sep acion sequence is exac. For observaion models ha are members of he exponenial family, he belief updae can be approximaed by using a linearized varian of he Kalman filer. By being able o compue he agen s disribuion of poserior beliefs afer muli-sep acion sequences, we developed an algorihm ha can perform deeper forward search when planning under uncerainy. While we may sacrifice an exhausive search over all acion sequences up of a paricular lengh, our algorihm enables us o search o an equivalen lengh of muli-sep acion sequences. This effecively allows us o search deeper, allowing us o discover policies ha would oherwise no have been found wih a shallow search deph. In conras o mos POMDP solvers, our algorihm assumes ha he agen s belief and observaion models are represenable as paricular classes of parameric disribuions. This necessary implies ha our algorihm is no a generic POMDP solver. Neverheless, we believe ha no only do hese classes of probabiliy disribuions represen a wide variey of belief disribuions, bu also ha coninuous, parameric represenaions are our only soluion for avoiding he curse of dimensionaliy [2], especially as we seek o solve larger POMDP problems in fuure. In addiion, he noion of predicing he disribuion of poserior beliefs should be exendable o a broader class of belief represenaions. Exponenial family disribuions ha are conjugae priors o exponenial family observaion models are aracive candidaes for represening he belief space, since he poserior belief afer a belief updae belongs o he same parameric class as he prior belief. References [1] O.E. Barndorff-Nielsen. Informaion and exponenial families in saisical heory. Bull. Amer. Mah. Soc. 1 (1979), , 273(0979), [2] A. Brooks, A. Makarenko, S. Williams, and H. Durran-Whye. Parameric POMDPs for planning in coninuous sae spaces. Roboics and Auonomous Sysems, 54(11): , [3] E. Brunskill, L. Kaelbling, T. Lozano-Perez, and N. Roy. Coninuous-sae POMDPs wih hybrid dynamics. In Symposium on Arificial Inelligence and Mahemaics, [4] T.M. Cover and J.A. Thomas. Elemens of informaion heory. Wiley-Inerscience, [5] J. Durbin and SJ Koopman. Time series analysis of non-gaussian observaions based on sae space models from boh classical and Bayesian perspecives. Journal of he Royal Saisical Sociey: Series B (Mehodological), 62(1):3 56,

11 [6] M. Hauskrech. Value-funcion approximaions for parially observable Markov decision processes. Journal of Arificial Inelligence Research, 13(2000):33 94, [7] W. Hoeffding. Probabiliy inequaliies for sums of bounded random variables. Journal of he American Saisical Associaion, pages 13 30, [8] B.F. La Scala, R.R. Bimead, and M.R. James. Condiions for sabiliy of he exended Kalman filer and heir applicaion o he frequency racking problem. Mahemaics of Conrol, Signals, and Sysems (MCSS), 8(1):1 26, [9] D. McAlleser and S. Singh. Approximae planning for facored POMDPs using belief sae simplificaion. In Proceedings of he Fifeenh Conference on Uncerainy in Arificial Inelligence, pages , [10] S. Paque, L. Tobin, and B. Chaib-draa. An online POMDP algorihm for complex muliagen environmens. In Proceedings of he fourh inernaional join conference on Auonomous agens and muliagen sysems, pages ACM New York, NY, USA, [11] J. Pineau, G. Gordon, and S. Thrun. Poin-based value ieraion: An anyime algorihm for POMDPs. In Inernaional Join Conference on Arificial Inelligence, volume 18, pages , [12] J.M. Pora, N. Vlassis, M.T.J. Spaan, and P. Poupar. Poin-based value ieraion for coninuous POMDPs. The Journal of Machine Learning Research, 7: , [13] S. Prenice and N. Roy. The Belief Roadmap: Efficien Planning in Linear POMDPs by Facoring he Covariance. In Proceedings of he 13h Inernaional Symposium of Roboics Research (ISRR), [14] M.L. Puerman. Markov decision processes: discree sochasic dynamic programming. John Wiley & Sons, Inc. New York, NY, USA, [15] S. Ross and B. Chaib-draa. AEMS: An Anyime Online Search Algorihm for Approximae Policy Refinemen in Large POMDPs. In Proceedings of The 20h Join Conference in Arificial Inelligence (IJCAI 2007), Hyderabad, India, [16] S. Ross, B. Chaib-draa, and J. Pineau. Bayesian reinforcemen learning in coninuous POMDPs wih applicaion o robo navigaion. In IEEE Inernaional Conference on Roboics and Auomaion, ICRA 2008, pages , [17] S. Ross, J. Pineau, S. Paque, and B. Chaib-draa. Online planning algorihms for POMDPs. Journal of Arificial Inelligence Research (JAIR), 32: , [18] T. Smih and R. Simmons. Poin-based POMDP algorihms: Improved analysis and implemenaion. In Proc. Uncerainy in Arificial Inelligence, [19] E.J. Sondik. The opimal conrol of parially observable Markov processes over he infinie horizon: Discouned coss. Operaions Research, 26(2): , [20] G. Theocharous and L.P. Kaelbling. Approximae planning in POMDPs wih macro-acions. Advances in Neural Processing Informaion Sysems, 17, [21] R. Washingon. BI-POMDP: Bounded, incremenal parially-observable Markovmodel planning. In Proceedings of he 4h European Conference on Planning (ECP), pages Springer, [22] M. Wes, P.J. Harrison, and H.S. Migon. Dynamic generalized linear models and Bayesian forecasing. Journal of he American Saisical Associaion, pages 73 83,

12 Appendix A: Exponenial Family Kalman Filer Building on saisical economics research for ime-series analysis of non-gaussian observaions [5], we presen he Kalman filer equivalen for sysems wih linear-gaussian sae-ransiions and observaion models ha belong o he exponenial family of disribuions. The sae-ransiion and observaion models can be represened as follow: s = A s 1 + B u + ε, s 1 N(µ 1, Σ 1 ), ε N(0, R ) (21) p(z θ ) = exp(z T θ b (θ ) + κ (z )), θ = W(s ) (22) For he sae-ransiion model, s is he sysem s hidden sae, u is he conrol acions, A and B are he linear ransiion marices, and ǫ is he sae-ransiion Gaussian noise wih covariance R. The observaion model belongs o he exponenial family of disribuions. θ and b (θ ) are he canonical parameer and normalizaion facor of he disribuion, and W(.) maps he saes o canonical parameer values. W(.) depends on he paricular member of he exponenial family. For ease of noaion, we le β (z θ ) = logp(z θ ) = z T θ + b (θ ) + κ (z ) (23) Following he radiional Kalman filer, he process updae can be wrien as µ = A µ 1 + B u, Σ = A Σ 1 A T + R (24) where µ and Σ are he mean and covariances of he poserior belief afer he process updae bu before he measuremen udpae. For he measuremen updae, we seek o find he condiional mode µ = argmax s p(s z ) (25) = argmax s p(z s )bel(s ) (Bayes rule) (26) = argmax s p(z θ )bel(s ) (27) = argmax s exp( J ), where J = log p(z θ ) (s µ ) T Σ 1 (s µ ) 0 = J = s s=µ β (z, θ ) θ + Σ 1 (µ µ θ s ), (29) Taking he derivaive of θ = W(s ) abou he prior mean µ, we le Y = W(s ) s (30) s=µ Similarly, performing Taylor expansion on β(z θ) θ abou θ = W(µ ), β (z θ ) = β (z θ ) θ θ + 2 β (z θ ) θ=θ θ θ T (θ θ ) (31) θ=θ β (z θ ) = θ β + β (θ θ ) (32) where β = ( z T θ + b (θ ) κ (z )) θ, (Eqn. 23) (33) θ=θ = b (θ ) θ z (34) θ=θ (28) β =ḃ z (35) and β = 2 β (z θ ) θ θ T (θ θ ) (36) θ=θ β = b (37) 11

13 Plugging Equaions 35 and 37 ino Equaion 32, and hen ino Equaion 29, b 1 Y (ḃ z + b (θ θ )) = Σ 1 (µ µ ) (38) Y b ( b 1 (ḃ z ) θ + θ ) = Σ 1 (µ µ ) (39) 1 Y b ((θ b (ḃ z )) θ ) =Σ 1 (µ µ ) (40) Y b ( z W(s )) =Σ 1 (µ µ ) (41) where z = (θ (ḃ z )) is he projecion of he observaion ono he parameer space of he exponenial family disribuion, and is independen of s. In Equaion 41 we subsiued θ using Equaion 22. Mean Updae Using Equaion 41 and subsiuing µ for s, Σ 1 (µ µ ) = Y b ( z W(µ )) (42) = Y b ( z W(µ )) + W(µ ) W(µ ) (43) = Y b ( z W(µ )) Y b (W(µ ) W(µ )) (44) Linearizing W(s ) abou µ, W(s ) = W(µ ) + W (s ) s=µ (s µ ) (45) = W(µ ) + Y (µ µ ) (46) Σ 1 (µ µ ) = Y b ( z W(µ )) Y b Y (µ µ ) (47) Y b ( z W(µ )) = (Σ 1 + Y b Y )(µ µ ) (48) = Σ 1 (µ µ ) (49) µ µ = Σ Y b ( z W(µ )) (50) where Σ Y b = K is he Kalman gain for non-gaussian exponenial family disribuions. Via a sandard ransformaion, he Kalman gain can be wrien in erms of covariances oher han Σ, Covariance Updae Given a Gaussian poserior belief, 2 J s 2 Σ 1 1 K = Σ Y (Y Σ Y + b ) 1 (51) and µ = µ + K ( z W(µ )) (52) = 2 J s 2 is he inverse of he covariance of he agen s belief (53) = x (Σ 1 (s µ ) Y b ( z W(s ))) (54) = Σ 1 + Y b Y (55) Σ = (Σ 1 + Y b Y ) 1 (56) 12

Planning in POMDPs. Dominik Schoenberger Abstract

Planning in POMDPs. Dominik Schoenberger Abstract Planning in POMDPs Dominik Schoenberger d.schoenberger@sud.u-darmsad.de Absrac This documen briefly explains wha a Parially Observable Markov Decision Process is. Furhermore i inroduces he differen approaches