Exact Dynamic Programming for Decentralized POMDPs with Lossless Policy Compression

Proceedngs of he Egheenh Inernaonal Conference on Auomaed Plannng and Schedulng (ICAPS 2008) Exac Dynamc Programmng for Decenralzed POMDPs wh Lossless Polcy Compresson Abdeslam Boularas and Brahm Chab-draa Compuer Scence & Sofware Engneerng Dep. Laval Unversy, Quebec G1k 7p4, CANADA {boularas,chab}@damas.f.ulaval.ca Absrac Hgh dmensonaly of belef space n DEC-POMDPs s one of he maor causes ha makes he opmal on polcy compuaon nracable. The belef sae for a gven agen s a probably dsrbuon over he sysem saes and he polces of oher agens. Belef compresson s an effcen POMDP approach ha speeds up plannng algorhms by proecng he belef sae space o a low-dmensonal one. In hs paper, we nroduce a new mehod for solvng DEC-POMDP problems, based on he compresson of he polcy belef space. The reduced polcy space conans sequences of acons and observaons ha are lnearly ndependen. We esed our approach on wo benchmark problems, and he prelmnary resuls confrm ha Dynamc Programmng algorhm scales up beer when he polcy belef s compressed. Inroducon Decson makng under sae uncerany s one of he greaes challenges n arfcal nellgence. Sae uncerany s a drec resul of sochasc acons and nosed, or alased, observaons. Parally Observable Markov Decson Processes (POMDPs) provde a powerful Bayesan model o solve hs problem (Smallwood & Sondk 1971). In hs model, he sae s represened by a probably dsrbuon over all he possble saes, ha we call a belef sae. The complexy of POMDPs algorhms, whch s proved o be PSPACEcomplee (Papadmrou & Tsskls 1987), depends heavly on he dmenson of he belef sae. Durng he las wo decades, sgnfcan effors have been devoed o developng fas algorhms for large POMDPs, and nowadays, even problems wh a housand of saes can be solved whn a few seconds (Vrn e al. 2007). The rse of applcaons requrng cooperaon beween dfferen agens, lke roboc eams, dsrbued sensors and communcaon neworks, has made he presence of many decson makers a key challenge for buldng auonomous agens. For hs purpose, a generalzaon of POMDPs o Copyrgh c 2008, Assocaon for he Advancemen of Arfcal Inellgence (www.aaa.org). All rghs reserved. mul-agen domans, called DEC-POMDPs (Decenralzed POMPDs), was nroduced n (Bernsen, Immerman, & Zlbersen 2002), and snce hen, hs framework has been recevng a growng amoun of aenon. Ths research s bascally movaed by he fac ha many real world problems need o be formalzed as DEC-POMDPs, whle plannng wh DEC-POMDPs s NEXP-complee (Bernsen, Immerman, & Zlbersen 2002), and even fndng ε-opmal soluons s NEXP-hard (Rabnovch, Goldman, & Rosenschen 2003). Fndng good soluons o DEC-POMDPs s so dffcul because here s no opmaly creron for he polces of a sngle agen alone: wheher a gven polcy s beer or worse han anoher depends on he polces of he oher agens. Consequenly, he dmensonaly of he polcy space s a crucal facor n he scalably of DEC-POMDPs algorhms. In hs paper, we propose a new mehod for scalng up Dynamc Programmng for DEC-POMDPs, based on a lossless compresson he polcy space. Our approach s based on he followng observaon: gven a se of polces, only a few sequences are necessary o represen all he polces. Ths mehod s an adapaon o DEC-POMDPs of anoher approach ha was orgnally proposed o reduce he sae space dmensonaly n POMDPs, and whch s known as he Predcve Sae Represenaons (PSRs) (Lman, Suon, & Sngh 2001). In PSRs, saes are replaced by sequences of acons and observaons ha have lnearly ndependen probables. Smlarly, a poll of polces can be represened by a smaller se of sequences ha have lnearly ndependen probables of occurrng n hese polces. Relaed Work A brue force soluon for DEC-POMDPs consss n performng an exhausve search n he space of on polces (Bernsen, Immerman, & Zlbersen 2002), bu hs approach s almos useless n pracce, even for he smalles domans. In fac, he search should focus only on he polces ha are lkely o be domnan. There are wo man approaches for fndng hese polces: op-down heurscs and boom-up dynamc programmng. MAA* was he frs 20

algorhm o use a op-down heursc (Szer, Charplle, & Zlbersen 2005). I s an adapaon of A* usng a heursc evaluaon funcon o prune domnaed nodes, where he nodes correspond o on polces. A bg dsadvanage of such op-down search approaches s ha he sarng pon should be known n advance. On he oher hand, Dynamc Programmng (DP) algorhm, proposed n (Hansen, Bernsen, & Zlbersen 2004), consss n consrucng he opmal polces from leaves o roo by usng eraed polcy elmnaon. DP algorhm can solve problems ha are unfeasble wh he exhausve search, bu keeps all he domnan polces for every pon n he belef space, even hose ha wll never be reached n pracce. Ths problem has been effcenly addressed wh Pon Based Dynamc Programmng (PBDP) algorhm proposed n (Szer & Charplle 2006). PBDP makes use of op-down heursc search o deermne whch belef pons wll be reached durng he execuon, and consrucs he bes polcy from leaves o roo wh DP, by keepng only he polces ha are domnan n he reachable pons. In he same ven, Memory Bounded Dynamc Programmng (MBDP) s an algorhm ha has been proposed recenly n (Seuken & Zlbersen 2007) and whch s close o PBDP, s based on boundng he maxmum number of polces kep n memory afer each eraon. All hese algorhms are based on reducng he number of polces o be evaluaed or kep n memory. Alernavely, we can preserve he orgnal polces space, and use a more compac represenaon of he polces. In decson rees, polces are consruced by combnng mulple sequences of acons and observaons, and he same sequences can be replcaed n dfferen polces. The sequenal represenaon akes advanage of hs characersc: all he possble sequences of a gven lengh are represened explcly, and each polcy s represened by a bnary vecor ha ndcaes whch sequences are conaned n hs polcy. Ths mehod has been appled effcenly o DEC-POMDPs n (Aras, Duech, & Charplle 2007), he proposed algorhm uses a mxed neger lnear program o fnd he opmal polces, where each varable corresponds o a sequence. However, here s no guaranee ha he number of sequences wll no exceed he number of polces, hs can happen when we have a few polces wh large horzons. Decenralzed POMDPs DEC-POMDPs, nroduced n (Bernsen, Immerman, & Zlbersen 2002), are a sragh generalzaon of POMDPs o mul-agen sysems. Formally, a DEC-POMDP wh n agens s a uple I,S,{A },P,{Ω},O,R,T,γ, where: I s a fne se of agens, ndexed 1...n. S s a fne se of saes. A s a fne se of ndvdual acons for agen. A = I A s he se of on acons, and a = a 1,...,a n denoes a on acon. P s a ranson funcon, P(s s, a) s he probably ha he sysem changes from sae s o sae s, when he agens execue he on acon a. Ω s a fne se of ndvdual observaons for agen. Ω = I Ω s he se of on observaons, and o = o 1,...,o n denoes a on observaon. O s an observaon funcon, O( o s, a) s he probably ha he agens observe o when he curren sae s s and he on acon ha led o hs sae was a. R s a reward funcon, where R(s, a) denoes he reward (or cos) gven for execung acon a n sae s. T s he horzon of plannng (he oal number of seps). γ s a dscoun facor. Plannng algorhms for DEC-POMDPs am o fnd he bes on polcy of horzon T, whch s a collecon of several local polces, one for each agen. A local polcy of horzon for agen, denoed by q, s a mappng from local hsores of observaons o 1 o2...o o acons n A. Usually, we use decson rees o represen local polces, each node of he decson ree s labeled wh an ndvdual acon a, and each arc s labeled wh an ndvdual observaon o. To ease noaons, we consder n he remander of hs paper ha we have only wo agens, and, all he resuls can be easly exended o he general case. A on polcy of horzon for agens and s denoed by q = q,q. We also use Q, Q o ndcae he ses of local polces for agens and respecvely, and Q for he se of on polces. Snce he saes are parally observable, agens should choose her acons accordng o a belef on he curren sae. The belef sae, denoed by he vecor b, s a probably dsrbuon over he dfferen saes. In he sngle-agen conex, he belef sae s a suffcen sasc for makng opmal decsons, bu n mul-agen sysems, he reward gven o an ndvdual acon depends on all he acons aken by he oher agens a he same me; he belef sae s hen no longer suffcen for makng opmal decsons. In order o choose opmally s acon, an agen should ake no accoun whch acons he oher agens are gong o execue. Ideally, every agen should know exacly he acons of he oher agens, bu unforunaely, hs canno be acheved whou use of communcaon. In fac, he on polcy s provded o all he agens, and he frs on acon can easly be predced, snce we us have o look a he frs node of every local polcy ree. Bu hen, he nex acons depend on he local observaons perceved by each agen, and whou communcaon, we can only consder a probably dsrbuon over he acons (or he remanng subrees) of he oher agens. Ths dsrbuon s wha we call a mul-agen belef sae. The belef sae b for agen conans a probably dsrbuon over he saes S, and anoher probably dsrbuon over he curren polces Q of agen. Noce ha b (s,q ) s he probably ha he sysem s n sae s and he curren polcy of agen s q. 21

Inpu: Q 1, Q 1 and V 1 ; Q, Q fullbackup(q 1 ), fullbackup(q 1 ); Calculae he value vecors V by usng V 1 (equaon 1); repea remove he polces of Q ha are domnaed (Table 1); remove he polces of Q ha are domnaed (Table 1); unl no more polces n Q or Q can be removed ; Oupu: Q,Q and V ; Algorhm 1: Dynamc Programmng for DEC- POMDPs (Hansen, Bernsen, & Zlbersen 2004). Dynamc Programmng for DEC-POMDPs Dynamc Programmng s by far he echnque mos used for solvng mulsage decson problems, where he opmal polces of horzon are recursvely consruced from he opmal sub-polces of horzon 1. Ths mehod has been wdely used for fndng opmal fne horzon polces for POMDPs snce (Smallwood & Sondk 1971) presened he value eraon algorhm. Recenly, (Hansen, Bernsen, & Zlbersen 2004) proposed an neresng exenson of he value eraon algorhm o decenralzed POMDPs, called Dynamc Programmng Operaor for DEC-POMDPs. We revew here brefly he prncpal seps of hs algorhm. The expeced dscouned reward of a on polcy q, sared from sae s, s gven recursvely by Bellman value funcon: V q (s) = R(s, A(q ))+γ P(s s, A(q )) O( o s, A(q ))V o(q )(s ) s S o Ω (1) where A(q ) s he frs on acon of he polcy q (he roo node), o s a on observaon, and o(q ) s he sub-polcy of q below he roo node and he observaon o. The value of an ndvdual polcy q, accordng o a belef sae b, s gven by he followng funcon: V q (b ) = b (s,q )V q s S q,q (s) (2) Q where q,q denoes he on polcy made up of q and q, V q,q (s) s gven by equaon 1. The Dynamc Programmng Operaor (Algorhm 1) fnds he opmal polces of horzon, gven he opmal polces of horzon ( 1). V s he se of value vecors V q correspondng o he on polces of horzon. Frs, he ses Q, Q are generaed by exendng he polces of Q 1, Q 1, and V are calculaed by usng V 1 n equaon 1, hen he weakly domnaed polces of each agen are eravely prune. The prunng process sops when no more polces can be removed from Q or Q. A polcy q s sad o be weakly domnaed f and only f: b (S Q ), q Q {q }: V q (b ) V q (b ) (3) mnmze: ε subec o: s S s S, q Q : s S b (s,q ) = 1 q Q 0 b (s,q ) 1 q Q {q } : b (s,q )[V q q Q,q (s) V q,q (s)] + ε > 0 Table 1: The lnear program used o check f a polcy q s domnaed or no (Hansen, Bernsen, & Zlbersen 2004). Movaon Polcy Space Compresson The man problem wh DEC-POMDPs, compared o POMDPs, s he dmensonaly of he polcy space. In fac, he number of polces grows double exponenally wh respec o he plannng horzon and he number of observaons. If we ge Q 1 polces for agen a sep 1, hen A Q 1 Ω new polces wll be creaed a sep. Ths curse of dmensonaly has dramac consequences on boh me and space complexy of he DEC-POMDPs algorhms. From Algorhm 1, we can see ha he dynamc programmng operaor spends mos of s me deermnng he weakly domnaed polces by checkng he nequaly (3) for every polcy q. The usual approach for performng hs es s o use he lnear program of Table 1. The obecve funcon o be mnmzed s defned by ε, whch s he greaes dfference beween he value of q and he value of any oher polcy q over he belef space. If ε 0 hen he polcy q s domnaed and should be removed, and f ε < 0, hen here s some regon n (S Q ) where q s domnan. The varables are ε and he probables b(.,.) of he mul-agen belef sae, so here are S Q + 1 varables. The me complexy of a lnear program solver depends on he number of varables and consrans defned n he problem, so, depends drecly on he number of polces and he way we represen he belefs over hese polces. However, he man problem of Dynamc Programmng (Algorhm 1) remans he sze of he memory space requred o represen he value vecors V for each on polcy. The memory space requred o represen hese vecors a sep s Q Q S floas. Indeed, hs algorhm runs ou of memory several eraons before runnng ou of me. 22

a 2 a 1 o 1 o 1 a 1 Q = {q a,q b,q c,q d } o 2 o 2 a 1 a 2 a 1 o 1 o 2 o 1 o 2 a 1 a 1 a 1 a 3 o 1 o 2 a 1 a 1 a 2 o 1 o 2 a 2 a 1 q a a 1 o 1 o 2 a 1 a 3 o 1 o 2 o 1 o 2 a 1 a 1 a 2 a 1 o 1 o 2 o 1 o 2 a 2 a 1 a 2 a 3 q b o 1 o 2 a 2 a 3 { Bass( Q ) }} { q 1 q 2 q 3 U = a 1 o 1 a 2 o 1 a 2 a 1 o 1 a 2 o 1 a 1 = q a q b q c q d a 1 o 1 a 2 o 2 a 3 a 1 o 1 a 2 o 2 a 1 1 0 1 0 1 0 1 0 1 0 0 1 1 0 0 1 0 1 1 0 0 1 1 0 0 1 0 1 0 1 0 1 1 0 1 1 0 0 0 1 1 0 1 0 a 1 o 2 a 1 o 2 a 3 a 1 o 2 a 1 o 2 a 1 a 1 o 2 a 1 o 1 a 2 a 1 o 2 a 1 o 1 a 1 dmenson = 4 ( 1 0 0 1 1 0 0 1 0 1 0 1 0 1 0 1 0 0 1 1 0 0 1 1 ) dmenson = 3 q c q d F Ũ Fgure 1: Reducng he polcy space dmensonaly. To allevae hese problems, we need o use a more compac echnque o represen he belef of agen on he polces of agen, nsead of he nave probably dsrbuon over all he polces Q. We can explo he srucure of he polces, and fnd some feaures Q = { q 1,..., q Q } whch consue a suffcen nformaon: each pon n Q wll correspond o a pon n Q, and vce versa. We should also guaranee ha Q Q and ha mos of he me Q Q. A polcy s a collecon of sequences of acons and observaons. A sequence q of horzon for agen s an ordered ls of ndvdual acons and 1 ndvdual observaons (a 1,o1,...,o 1,a ). A on sequence q of horzon s a couple q, q where q s an ndvdual sequence for agen and q s an ndvdual sequence for agen, a on sequence can also be seen as one ordered ls of on acons and observaons. The se Q can be compleely replaced, whou any loss of nformaon, by a marx U called he oucome marx (Sngh, James, & Rudary 2004). Each row of U wll correspond o a polcy q Q and each column wll correspond o a sequence q. U (q, q ) s defned as he probably ha agen wll execue he acons of q f he observaons of q occur, such ha he acual polcy of agen s q. Snce he polces are deermnsc, we have U (q, q ) = 1 f he sequence q appears n he polcy q, and U (q, q ) = 0 else. Fgure 1 shows a se Q conanng 4 ndvdual polces for agen : q a,q b,q c and q d. There are 8 dfferen sequences n hese polces, he marx U has hen 4 rows and 8 columns. If b (s,.) s a mul-agen belef sae,.e. a probably dsrbuon over he polces of Q for some sae s S, hen he produc b (s,.)u reurns a vecor conanng he probably of every sequence of Q. To reduce he dmensonaly of b (s,.) from Q o N, we should fnd a ransformaon funcon f defned by: f : Q [0,1] N b (s,.) b (s,.) b (s,.) s called he reduced mul-agen belef sae, corresponds o he belef abou he sequences, whereas b (s,.) s he belef abou he polces. In order ha f be an accurae ransformaon funcon, we should be able o make predcons abou any sequence q by usng only he vecor b (s,.). Lnear reducon of polcy space dmensonaly The oucome marx U can be facorzed as follows: U = F Ũ In hs case, our ransformaon funcon s smply he marx F. The reduced belef sae b can be generaed from b by: b (s,.) = b (s,.)f and he probables of Q sequences can be found by: b (s,.)u = b (s,.)(f Ũ ) = b (s,.)ũ The marx F s a bass for he marx U, s formed by lnearly ndependen columns of U. We use Bass( Q ) o ndcae he se of sequences correspondng o hese lnearly ndependen columns. In Fgure 1, Bass( Q ) conans he sequences q 1, q 2 and q 3. Noe ha Bass( Q ) Q, because he lnear rank of a marx canno be more han he number of rows n hs marx, and F has exacly Q rows. Snce F s a bass for he oucome marx U, we can wre any column of U as lnear combnaon of F columns: Pr( q q ) = U (q, q ) = F (q,.)w q 23

where w q s a wegh vecor 1 assocaed o he sequence q (we have one wegh vecor per sequence). Smlarly, w q s he column correspondng o q n he marx Ũ. If b (s,.) = b (s,.)f s a reduced belef sae, hen: Pr(s, q b ) = b (s,q )Pr( q q ) q Q = b (s,q )(F(q,.)w q ) q Q = b (s,.)w q Ths means ha gven a reduced belef sae, he probably of any sequence s a lnear combnaon of he probables conaned n hs reduced belef sae. Fndng he bass sequences The man problem now s how o fnd he bass sequences of Q and her marx F whou facorzng he enre marx U a each sep. The heorem below saes ha f Bass( Q 1 ) s he se of bass sequences for horzon 1, hen he sequences of Bass( Q ) for horzon wll be among he one sep exensons of he Bass( Q 1 ) sequences. Ths means ha we need o exend only he sequences of Bass( Q 1 ) a each sep, and consruc he marx F where he columns correspond o hese exended sequences. F may conan some lnearly dependen columns, along wh he bass columns. Thus, we use a Gauss-Jordan elmnaon on F, or any oher decomposon echnque, o exrac Bass( Q ) n a polynomal me. Theorem 1. Le Q 1 be a se of horzon 1 polces, Q 1 s he se of horzon 1 sequences correspondng o Q 1, Q s he se of horzon polces creaed from Q 1 by an exhausve backup, and Q s he se of horzon sequences correspondng o Q, hen: Bass( Q ) {ao q 1 : a A,o Ω, q 1 Bass( Q 1 )} U 1 of U 1 s he oucome marx of Q 1, F 1 s he sub-marx correspondng o he sequences of Bass( Q 1 ). Le q be a polcy of Q, q 1 a sequence of Q 1, a an acon of A and o an observaons of Ω, hen we can be n one of he followng wo suaons: Case 1: he polcy ree q sars wh an acon dfferen from a, hen he sequence ao q 1 does no appear n he polcy q, so: U (q,ao q 1 ) = 0. Case 2: he polcy ree q sars wh he acon a, hen he sequence ao q 1 appears n he polcy q ff q 1 appears n o(q ), he sub-ree of q below he frs acon a and he observaon o. We have hen U (q,ao q 1 ) = U 1 (o(q ), q 1 ) and q Bass( Q 1 ) : F (q,ao q ) = F 1 (o(q ), q ). So 2 : U (q,ao q 1 ) = U 1 where we defne w ao q 1 { wao q 1 w ao q 1 (o(q = q Bass( Q 1 ) ), q 1 ) F 1 (o(q ), q )w q 1 ( q ) = F (q,ao q )w ao q 1 (ao q ) q Bass( Q 1 ) = F (q,.)w ao q 1 as follows: (a o q ) = w q 1 ( q ) f a = a and o = o, (a o q ) = 0 else. Cases 1 and 2 can be grouped ogeher, we have hen: U (q,ao q 1 ) = F (q,.)w ao q 1 Noce ha he vecor w ao q 1 s defned ndependenly on he polcy q. So, he sequences {ao q : a A,o Ω, q Bass( Q 1 )} are ndeed a bass for he marx U. Proof. Ths heorem clams ha he sequences of Bass( Q ) are conaned n he one sep exensons of he Bass( Q 1 ) sequences. In oher words, q 1 Q 1, a A, o Ω, we should be able o wre he column U (.,ao q ) as a lnear, whch s he sub-marx combnaon of he columns of F of U correspondng o he sequences ao q 1 Ω, q 1 : a A,o Bass( Q 1 ). U s he oucome marx of Q, 1 To smplfy he noaons, we consder ha he belef vecors are rows and he wegh vecors are columns. Reduced value vecors Snce we wan o use hs represenaon for plannng, we have o redefne he value funcon (equaons 1 and 2), gven ha he belef sae s now abou sequences nsead of polces. Frs, we defne he expeced value V q (s) of a on sequence q = a 1 o 1 a 2 o 2... a n sae s as follows: V q (s) = Pr( q s)r q (s) = Pr( o 1 o 2... o 1 s 1 = s, a 1 a 2... a 1 )R q (s) 2 We use q o ndcae a bass sequence, he horzon of hs sequence can be nferred from he conex. 24

Pr( o 1 o 2... o 1 s, a 1 a 2... a 1 ) s he probably ha he observaons of q wll occur f we sar execung he acons of q a s, and R q (s) s he reward expeced from he acons of q such ha he observaons of q wll occur. R q (s) s gven by: R q (s) = γ k 1 Pr(s k = s s 1 = s, a 1 o 1... a )R(s, a k ) k=1 s S = k=1 γk 1 s S Pr(s k = s, o 1... o 1 s 1 = s, a 1... a 1 )R(s, a k ) Pr( o 1 o 2... o 1 s 1 = s, a 1 a 2... a 1 ) = k=1 γk 1 s S α k (s,s, q )β k (s, q )R(s, a k ) Pr( o 1 o 2... o 1 s 1 = s, a 1 a 2... a 1 ) where: { αk (s,s, q ) = Pr( o 1... o k 1,s k = s s 1 = s, a 1... a k 1 ) β k (s, q ) = Pr( o k... o 1 s k = s, a k a 2... a 1 ) We appled Bayes rule, and hen we decomposed Pr(s k = s, o 1... o 1 s 1 = s, a 1... a 1 ) no α k (s,s, q ) and β k (s, q ), akng advanage of Markov propery. So, he expeced value of a on sequence q s gven by: V q (s) = γ k 1 α k (s,s, q )β k (s, q )R(s, a k ) k=1 s S Ths value s calculaed recursvely as follows: V a o q 1(s) = γ k 1 α k (s,s, a o q 1 )β k (s, a o q 1 )R(s, a) k=1 s S = β 1 (s, a o q 1 )R(s, a) + γ k 1 α k (s,s, a o q 1 )β k (s, a o q 1 )R(s, a) k=2 s S = β 1 (s, a o q 1 )R(s, a) 1 +γ γ k 1 [P(s s, a)p( o s, a)α k (s,s, q 1 ) k=1 s S s S β k (s, q 1 )R(s, a)] = β 1 (s, a o q 1 )R(s, a)+γ P(s s, a)p( o s, a)v q 1(s ) s S We used he followng properes: α 1 (s,s, a o q 1 ) = 1, α 1 (s,s, a o q 1 ) = 0 for s s, β k (s, a o q 1 ) = β k 1 (s, q 1 ) for k 2, and α k (s,s, a o q 1 ) = s S P(s s, a)p( o s, a)α k 1 (s,s, q 1 ). The value funcon of an ndvdual polcy q n a reduced belef sae b s gven by: V q ( b ) = Pr(s b ) Pr( q b )Pr( q q )V q s S q Q, q (s) q Q = s S q Bass( Q ) q Bass( Q ) F (q, q )=1 b (s, q )Ṽ q, q (s) where: Ṽ q, q (s) de = f w q ( q )w q ( q q Q )V q, q (s) q Q We subsued he probably of each sequence q wh a lnear combnaon of he probables b (s, q ) of he bass sequences q, and he probably of each sequence q wh a lnear combnaon of he probables F (q, q ) {0,1} of he bass sequences q. The reduced vecor Ṽ q, q defnes he conrbuon of q, q o he value of a on polcy, by ncludng a proporon of he values of all he sequences whch depend on q, q. If we wan o calculae he value of a polcy q for a gven mul-agen belef sae b, all we need are he vecors Ṽ q, q and he marx F. We wll see now how o calculae he value vecors Ṽ gven he value vecors Ṽ 1. A horzon 1, each polcy s a sngle acon. We have hen Bass(Q 1 ) = A, and Ṽ a,a (s) = V a,a (s) = R(s, a,a ). A horzon > 1, we know from Theorem 1 ha he bass sequences are of he form a o q and a o q, where q Bass( Q 1 ) and q Bass( Q 1 ). Ṽ a o q,a o q (s) = where: C a o q,a o q (s) q 1 Q 1 q 1 Q 1 w a o q 1 ( q )w a o q 1 ( q ) V a o q 1,a o q 1 (s) = R(s, a,a )C a o q,a o q (s) + γ Pr( o,o,s s, a,a )V q, q (s ) s S (4) de f = q 1 Q 1 q 1 Q 1 w a o q 1 Pr( a o q 1 ( q )w a o q 1 ( q ),a o q 1 ) s) = P(s s, a,a )O( o,o s, a,a )C q, q s (s ) (5) S In order o calculae he value vecor Ṽ a o q,a o q, we only need o know he vecors C q, q and Ṽ q, q, provded by he las eraon of he dynamc programmng algorhm. Algorhm Algorhm 2 descrbes he man seps of he dynamc programmng algorhm where he polces are evaluaed n a reduced dmensonal space. We keep he same srucures Q and Q used n Algorhm 1. The value vecors V are replaced by lower dmensonal vecors Ṽ. We also need o 25

Inpu: Q 1,Q 1,Bass( Q 1 ),Bass( Q 1 ), Ṽ 1,Ṽ 1,C 1,C 1 ; Q, Q fullbackup(q 1 ), fullbackup(q 1 ); Bass( Q ) A O Bass( Q 1 ); Bass( Q ) A O Bass( Q 1 ); Calculae he vecors C by usng C 1 (Equaon 5); Calculae he vecors Ṽ by usng Ṽ 1 (Equaon 4); repea remove he polces of Q ha are domnaed (Table 2); remove he polces of Q ha are domnaed (Table 2); unl no more polces n Q or Q can be removed ; removedependence(q,bass( Q ),Bass( Q ),C,Ṽ ); removedependence(q,bass( Q ),Bass( Q ),C,Ṽ ); Oupu: Q, Q, Bass( Q ), Bass( Q ), Ṽ, Ṽ, C ; Algorhm 2: Dynamc Programmng for DEC- POMDPs wh Lossless Polcy Belef Compresson. Inpu: Q,Bass( Q ),Bass( Q ),C,Ṽ ; Use a decomposon mehod o fnd he lnearly dependen sequences n Bass( Q ), and remove hem from Bass( Q ); foreach each removed sequence q from Bass( Q ) do foreach bass sequence q from Bass( Q ) do foreach bass sequence q from Bass( Q ) do C q, q C q, q + w q ( q )C q, q ; Ṽ q, q Ṽ q, q + w q ( q )Ṽ q, q ; end end end Oupu: Updaed Bass( Q ),C,Ṽ ; Algorhm 3: Removng he dependen sequences from Bass( Q ) and updang he vecors C, Ṽ. manan he lss Bass( Q ), Bass( Q ), and he probably vecors C. The marx F (resp. F ) s mplcly represened by specfyng for each bass sequence he ls of polces where hs sequence occurs. A sep = 1, we have Q 1 = Bass( Q 1 ) = A,Q 1 = Bass( Q 1 ) = A and a A, a A : Ṽ a,a = R(., a,a ),C a,a = 1. A sep > 1, Q and Q are he ses of all possble polces of horzon where he subpolces of horzon 1 are n Q 1 and Q 1 respecvely. Bass( Q ) and Bass( Q ) are formed by one sep exensons of Bass( Q 1 ) and Bass( Q 1 ) respecvely. We now calculae he probably vecors C by usng he vecors C 1 n Equaon (5), and he value vecors Ṽ usng he vecors C and Ṽ 1 n Equaon (4). The vecors Ṽ are used o deermne whch polces of and are domnaed and should be removed. We use he lnear program of Table 2 o solve hs problem for each polcy q. The varables of hs lnear program are: ε, and he probablmnmze: ε subec o: s S, q Bass( Q ) : q Q {q } : 0 b (s, q ) 1 V q ( b ) V q ( b ) + ε > 0 Table 2: The lnear program used o deermne f a polcy q s domnaed or no, wh a reduced belef space. es b(.,.) of he reduced mul-agen belef sae, we have hen S Bass( Q ) +1 varables. Noce ha conrary o he orgnal belef sae b (s,.) whch s a probably dsrbuon over he polces of agen, he reduced belef space b (s,.) s no necessarly a probably dsrbuon, because he sequences are no muually exclusve. In Fgure 1 for example, f b(s,q a ) = 1 hen b(s, q 1 ) = 1 and b(s, q 3 ) = 1. The relaon beween he dfferen bass sequences s more complex and more consrans should be added o make sure ha any reduced belef consdered n he lnear program wll really correspond o some belef n he orgnal space. In fac, f we ake any belef b (s,.), we can always fnd a reduced belef b (s,.) = b (s,.)f where all he polces keep he same values, bu gven a reduced belef b (s,.), we are no sure of fndng a belef b (s,.) n he orgnal space such ha b (s,.) = b (s,.)f. Ths s rue f and only f he ransformaon funcon, represened by he marx F s a becon. However, f a polcy q s no domnaed, hen here s a belef b (.,.) where V q (b ) > V q (b ), q Q {q }, and n he correspondng reduced belef b (.,.) = b (.,.)F, we wll also have V q ( b ) > V q ( b ), q Q {q }. Therefore, he lnear program of Table 2. keeps a leas all he domnan polces, bu can keep some domnaed polces (Table 3.). Afer prunng he domnaed polces, some bass sequences become lnearly dependen. Algorhm 3 s hen used o remove he newly dependen sequences, and o updae he parameers C and Ṽ of he remanng bass sequences. These wo updae operaons are derved from he defnons of C and Ṽ. Noce ha when we elmnae polces (.e. elmnae rows from he marx U), he lnearly dependen sequences reman dependen, and keep he same wegh vecors. Expermens We mplemened boh of Algorhm 1 (DP) and Algorhm 2 (DP wh Polcy Compresson) usng ILOG Cplex 10 solver on an AMD Ahlon machne wh a 1.80 GH processor and 1.5 GB ram. We used Gauss-Jordan elmnaon mehod o 26

Dynamc Programmng Dynamc Programmng wh Lossless Polcy Compresson Problem T runme polces runme polces bass sequences compresson rao MA-Tger 2 0.20 (27,27) 0.17 (27,27) (18,18) 1.5 3 2.29 (675,675) 1.79 (675,675) (90,90) 7.5 4 - - 534.90 (195075,195075) (540,540) 361.25 MABC 2 0.12 (8,8) 0.14 (8,8) (8,8) 1 3 0.46 (72,72) 0.36 (72,72) (24,24) 3 4 17.59 (1800,1458) 4.59 (3528,3528) (80,80) 44.1 Table 3: The runme (n seconds) and he number of polces and sequences of DP algorhms, wh and whou compresson. fnd he bass sequences. For he las sep of plannng, we om prunng he domnaed polces snce we wll no use hem o generae furher polces, hus, we only generae he value vecors of each on polcy and each on sequence. We compared he performances of hese wo algorhms on wo benchmark problems MA-Tger and MABC (Hansen, Bernsen, & Zlbersen 2004). Boh of he wo algorhms fnd he same opmal values. However, he memory space used o represen value vecors s sgnfcanly smaller when we use he compresson approach. In fac, he value vecors n he orgnal DP algorhm are defned on saes and polces, whereas he reduced value vecors are defned on saes and bass sequences. Noce also ha he compresson rao (polces number/bass sequences number) grows exponenally w.r. he plannng horzon. Also, he runme of DP s mproved when he compresson algorhm s used. Indeed, he backup of reduced value vecors (equaons 4 and 5) akes less me han he backup of orgnal value vecors (equaon 1). However, gven ha he lnear program of Table 2 s under-consraned, our algorhm can keep some addonal polces ha are domnaed. In MABC for example, our algorhm generaes 3528 3528 on polces a he begnnng of horzon 4, whle only 1800 1458 on polces are no domnaed. Ths problem can be solved by addng a larger sysem of consrans on he probables of he sequences, bu he compuaonal effcency wll be possbly affeced. We can see ha as n all he compresson echnques, here s a radeoff beween he space performance and he me performance. Concluson The dmensonaly of he polcy space s a crucal facor n he complexy of Dynamc Programmng algorhms for DEC-POMDPs. In hs paper, we nroduced a new approach for dealng wh hs problem, based on proecng he polcy belefs from he hgh dmensonal space of rees o he low dmensonal space of sequences, and usng marx facorzaon mehods o reduce even more he number of sequences. Consequenly, he memory space used n hs algorhm s sgnfcanly smaller, whle he runme s lower compared o he orgnal DP algorhm. Ths mehod can be used n approxmae DP algorhms, manly for domans wh large observaons space. We arge also o nvesgae quck and lossy facorzaon echnques, and more specfcally, bnary-marces facorzaon algorhms. References Aras, R.; Duech, A.; and Charplle, F. 2007. Mxed Ineger Lnear Programmng for Exac Fne-Horzon Plannng n Decenralzed POMDPs. In Proceedngs of Inernaonal Conference on Auomaed Plannng and Schedulng (ICAPS 07), 18 25. Bernsen, D.; Immerman, N.; and Zlbersen, S. 2002. The Complexy of Decenralzed Conrol of Markov Decson Processes. Mahemacs of Operaons Research 27(4):819 840. Hansen, E.; Bernsen, D.; and Zlbersen, S. 2004. Dynamc Programmng for Parally Observable Sochasc Games. In Proceedngs of he 19h Naonal Conference on Arfcal Inellgence (AAAI 04), 709 715. Lman, M.; Suon, R.; and Sngh, S. 2001. Predcve Represenaons of Sae. In Advances n Neural Informaon Processng Sysems 14 (NIPS 01), 1555 1561. Papadmrou, C., and Tsskls, J. 1987. The Complexy of Markov Decson Process. Mahemacs of Operaons Research 12(3):441 450. Rabnovch, Z.; Goldman, C.; and Rosenschen, J. 2003. The Complexy of Mulagen Sysems: he Prce of Slence. In Proceedngs of he second nernaonal on conference on Auonomous agens and mulagen sysems (AAMAS 03), 1102 1103. Seuken, S., and Zlbersen, S. 2007. Improved Memory-Bounded Dynamc Programmng for Decenralzed POMDPs. In Proceedngs of he 23rd Conference on Uncerany n Arfcal Inellgence (UAI 07). Sngh, S.; James, M.; and Rudary, M. 2004. Predcve Sae Represenaons: A New Theory for Modelng Dynamcal Sysems. In Uncerany n Arfcal Inellgence: Proceedngs of he Tweneh Conference (UAI 04). Smallwood, R. D., and Sondk, E. J. 1971. The Opmal Conrol of Parally Observable Markov Decson Processes over a Fne Horzon. Operaons Research 21(5):1557 1566. Szer, D., and Charplle, F. 2006. Pon-Based Dynamc Programmng for DEC-POMDPs. In Proceedngs of he 21h Naonal Conference on Arfcal Inellgence (AAAI 06), 304 311. Szer, D.; Charplle, F.; and Zlbersen, S. 2005. MAA*: A Heursc Search Algorhm for Solvng Decenralzed POMDPs. In Proceedngs of 21s Conference on Uncerany n Arfcal Inellgence (UAI 05). Vrn, Y.; Shan, G.; Shmony, S.; and Brafman, R. 2007. Scalng Up: Solvng POMDPs hrough Value Based Cluserng. In Proceedngs of he Tweny-Second AAAI Conference on Arfcal Inellgence (AAAI 07), 1290 1295. 27