Partially Observable Systems. 1 Partially Observable Markov Decision Process (POMDP) Formalism

CS294-40 Lernng for Rootcs nd Control Lecture 10-9/30/2008 Lecturer: Peter Aeel Prtlly Oservle Systems Scre: Dvd Nchum Lecture outlne POMDP formlsm Pont-sed vlue terton Glol methods: polytree, enumerton, flterng, wtness 1 Prtlly Oservle Mrkov Decson Process (POMDP) Formlsm A Prtlly Oservle Mrkov Decson Process (POMDP) s tuple (S, A, T, R, γ,, Ω). where 1. S s the set of possle sttes for the system 2. A s the set of possle ctons 3. T represents the system dynmcs 4. R : S R s the rewrd functon 5. O s the set of possle oservtons we cn mke 6. Ω s the prolty dstruton P( S) We cn convert POMDP nto elef stte MDP. A elef stte spce B s S 1 dmensonl smplex whose elements re prolty dstrutons over sttes: B : (s) = P ro(s ll nformton vlle t current tme) The trnston model now descres trnstons etween elefs, rther thn trnstons etween sttes. We defne o to e the elef stte reched when strtng n elef stte, tkng cton, nd oservng o. Hence, o (s P (o t+1 = o s t+1 = s ) s ) = P (s t+1 = s s t = s, t = )(s) s P (o t+1 = o s t+1 = s ) s P (s t+1 = s s t = s, t = )(s) = P (o s ) s P (s s, )(s) P (o, ) As we terte through tme, the elef sttes re updted s follows: P ( o, ) = s P (o t+1 = o s t+1 = s ) s P (s s, )(s) Our polces now mp elefs to ctons: π : B A 1

nd we compute the vlue of polcy s follows: V π () = E[ γ t R( t, t ) 0 = ; π] Bellmn ck-ups for POMDP: t=0 V () mx [ R(s, )(s) + γ A s S o 2 Pont-sed vlue terton P ( o, )V ( o )] A prctcl ssue tht rses s tht, even when our stte spce s dscrete nd fnte, the elef stte spce ( S 1 dmensonl smplex) s contnuous nd hence hs nfntely mny sttes. One soluton: use functon pproxmton, e.g., grd out the elef stte spce nd use nerest neghor s functon pproxmton. Here we study nother soluton: t turns out tht for ny fnte horzon, the vlue functon of elef stte MDP cn e representted y the mx over set of lner functons. Concretely, n the frst terton of our vlue terton we hve: Ths remns true for rtrry n: We defne r = R(1,) R(2,). R(s,) V 1 () = mx A s S R(s, )(s) = mx α (0) T {α (0) } V n () = mx We tertvely updte the vlue of the elef stte: V n+1 () = mx [rt + γ o P (o, )V n ( o )] α (n) T } where = mx [rt + γ o = mx [rt + γ o = mx [rt + γ o = mx [rt + γ o P (o, ) mx P (o, ) mx P (o, ) mx α (n) } } } mx g,,ot (n) ] {,,o },,o (s) = s s s T o ] α (n) T (s ) o (s )] α (n) P (o s )P (s s, )α (n) (s ) T (s ) P (o s ) s P (s s, )(s) ] P (o, ) To do the ckup for :,, o, compute,,o (s) ccordng to the equton ove let = r + γ o rg mx g,,ot (n),,o 2

Ths yelds for V n+1 () = α (n+1) T α (n+1) rg mx g (n) T Pont-sed vlue terton s effcent, ut nexct due to the dscrete choce of elef sttes. However, there re clever wys to pck the ponts: 1. Pneu, Gordon, Thrun: Pont-Bsed Vlue Iterton Begn t some ntl elef. Then pck elef ponts y forwrd smulton nd prune y dstnce. 2. Vlsss, Spn: Perseus In every terton, only do the Bellmn ck-up for pont f the ck-ups of other elef sttes hs not yet ncresed tht ponts vlue functon. [Assumes you ntlze the vlue functon wth lower-ound.] Ths ensures the vlue functon ncreses for every elef pont n every terton. It works very well n prctce. 3 Glol methods Consder polcy wth horzon H n POMDP. We cn represent ths s polcy tree (ssumng the polcy s determnstc). We cn compute the vlue of eng t stte s nd followng polcy tree p: V H p = R(s, p(s)) + γ s S P (s s, p(s)) o P (o s )V H 1 ˆp (s) where ˆp s sutree of p. If we do not know wht stte we re n, ut do know wht elef stte we re n, we compute: Vp H () = s (s)vp H (s) For the optml (wthn horzon H) tree, we defne V H () = mx p Vp H () 3

Fgure 1: Polcy tree llustrton. 3.1 Glol vlue terton for elef stte MDP (Algorthm) t = 1 V 1 = set of 1-step polcy trees Loop: 1. t++ 2. Compute V t +, the set of possly useful t-step polcy trees from V t 1 3. Prune/Flter V + t Untl sup V t () V t 1 () < ɛ 3.2 Bsc flterng: j, solve to get V t, the set of useful t-step polcy trees mx t j, α T j α T + t 0 1 T = 1 If for j we hve tht the soluton t > 0, then the polcy tree j s useful. In tht, there s elef pont for whch polcy tree j s the optml polcy. Otherwse, we prune out polcy tree j. 4

3.3 Lrk s flterng: Incrementlly uld up the set of useful α. Ths wy the sze of the LP scles wth V t rther thn V t + j = 1, 2,..., V t + : mx t k V t, t α T j α T k 0 1 T = 1 f t > 0, fnd mx j 1,2,..., V + t T α j dd rgmx j 1,2,..., V + t T α j to V t For 3.4 Wtness lgorthm In the wtness lgorthm, we try to vod constructng V t +. We do ths y only ddng tree f t s optml for some elef. It s s suffcent to check ths per sutree: Is there elef stte we cn rech fter cton nd oservton o such tht the polcy tree p V t 1 would e optml? Ths prolem cn e solved y lner progrmmng. 5