Optimal Resource Allocation and Policy Formulation in Loosely-Coupled Markov Decision Processes

Optml Resource Allocton nd Polcy Formulton n Loosely-Coupled Mrkov Decson Processes Dmtr A. Dolgov nd Edmund H. Durfee Deprtment of Electrcl Engneerng nd Computer Scence Unversty of Mchgn Ann Arbor, MI 48109 {ddolgov, durfee}@umch.edu Abstrct The problem of optml polcy formulton for tems of resource-lmted gents n stochstc envronments s composed of two strongly-coupled subproblems: resource llocton problem nd polcy optmzton problem. We show how to combne the two problems nto sngle constrned optmzton problem tht yelds optml resource lloctons nd polces tht re optml under these lloctons. We model the system s multgent Mrkov decson process (MDP), wth socl welfre of the group s the optmzton crteron. The strghtforwrd pproch of modelng both the resource llocton nd the ctul operton of the gents s multgent MDP on the jont stte nd cton spces of ll gents s not fesble, becuse of the exponentl ncrese n the sze of the stte spce. As n lterntve, we descrbe technque tht explots problem structure by recognzng tht gents re only loosely-coupled v the shred resource constrnts. Ths llows us to formulte constrned polcy optmzton problem tht yelds optml polces mong the clss of relzble ones gven the shred resource lmttons. Although our complexty nlyss shows the constrned optmzton problem to be NP-complete, our results demonstrte tht, by explotng problem structure nd v reducton to mxed nteger progrm, we re ble to solve problems orders of mgntude lrger thn wht s possble usng trdtonl multgent MDP formulton. Introducton We ddress the problem of fndng optml polces for tems of resource-lmted utonomous gents tht operte n stochstc envronments. Whle vrous spects of ths problem hve receved sgnfcnt mounts of ttenton, there hs been lmted focus on ddressng the combned problem of decdng how the lmted shred resources should be dstrbuted between the gents nd wht polces they should dopt, such tht the socl welfre of the tem s mxmzed. Notce tht n ths problem formulton, fgurng out the vlue of prtculr llocton requres one to solve stochstc polcy optmzton problem. Hence, the resource llocton nd the polcy optmzton problems re very closely coupled. A strghtforwrd pproch to solvng ths problem s to formulte both the resource llocton process nd the Copyrght c 2004, Amercn Assocton for Artfcl Intellgence (www..org). All rghts reserved. ctul operton of the gents s lrge multgent MDP (Boutler 1999) on the jont stte nd cton spces of ll gents. However, ths method suffers from n exponentl ncrese n the sze of the stte spce, nd thus very quckly becomes nfesble for tems of resonble sze. A common wy of ddressng ths problem of lrge stte spces n MDPs s bsed on problem decomposton (Boutler, Brfmn, & Geb 1997; Den & Ln 1995; Meuleu et l. 1998; Sngh & Cohn 1998), where globl MDP s decomposed nto severl ndependent or looselycoupled sub-mdps. These sub-mdps re usully solved ndependently nd the resultng polces re then combned to yeld (perhps suboptml) soluton to the orgnl globl MDP. In ths work, we focus on domns where the gents operte mostly ndependently, but ther polcy optmzton problems re coupled v the resources tht they shre. Such loose couplng of the gents mkes these problems very well suted for the decomposton soluton methods mentoned bove. However, the exstng methods ether do not llow one to completely vod the explct enumerton of the jont sttes nd ctons (Sngh & Cohn 1998) or provde only pproxmte solutons (Meuleu et l. 1998) to the globl polcy optmzton problems. We present method tht does not scrfce optmlty nd, by fully explotng the structure of the problem, mkes t possble to solve problems orders of mgntude lrger thn wht s possble usng trdtonl multgent MDP technques. Unlke the stndrd decomposton technques, we do not dvde the problem nto subproblems nd then recombne the solutons. Insted, we formulte one polcy optmzton problem wth constrnts tht ensure tht the polces do not over-consume the shred resources. The mn contrbutons of ths work re tht we formlly nlyze the complexty of ths constrned optmzton problem nd gve reducton of the problem to mxed nteger lner progrm (MILP), whch llows us to cptlze on effcent methods of solvng MILPs. We begn by gvng very brod, hgh-level descrpton of the problem nd presentng smple exmple of domn where such problems rse. We then present forml descrpton of our model nd the problem formulton s well s n nlyss of the complexty of ths optmzton problem nd the structure of the solutons. The lst sectons of

Generl Problem Descrpton () Fgure 1: A resource llocton nd stochstc plnnng problem. Once the shred opertonlzton resources re dstrbuted mong the gents (), they proceed to execute the best relzble polces (b). the pper descrbe our method for solvng ths problem nd present some emprcl results. Motvtng Exmple Imgne group of gents tht re opertng utonomously for exmple, group of rovers performng scentfc msson on remote plnet. There s cler need for coordnton nd tsk llocton mong the gents n order for them to perform ther msson effcently. For nstnce, f the msson nvolves tkng mesurements of the sol n one locton nd dong vdeo survey of nother re, t mght be benefcl to ssgn one rover to do sol smplng, nd nother to do the vdeo survey. However, the gents typclly need dfferent equpment to crry out dfferent tsks, nd whle t s sometmes fesble to desgn nd buld n gent for certn tsk, often f s more effectve to crete generlpurpose gent tht cn be outftted wth dfferent equpment dependng on the tsk t hnd. In prtculr, n our rover exmple, there mght be bse stton tht s used s centrlzed locton to store consumble executon resources (fuel, energy, etc), s well s equpment (vdeo cmers, extr btteres, etc.) tht cn be used to outft the rovers for vrous tsks. It s certnly nturl to ssume tht these resources re lmted. Thus, problem of effcent llocton of the shred resources rses. However, the resource llocton problem s complcted by the fct tht t s often hrd to clculte the exct utlty of prtculr ssgnment of the resources. Indeed, gents opertng n complex envronments re not ble to perfectly nd determnstclly reson bout the effects of ther ctons, nd therefore t s necessry to dopt model tht llows one to express the uncertnty of the gents nterctons wth the envronment. In fct, n gent tht dopts determnstc model of the envronment nd does not hve contngency plns mght fnd tself ctng rther poorly. For nstnce, t hs been estmted tht the 1997 Mrs Pthfnder, whch dd not tke the uncertnty of the envronment nto ccount, spent between 40% to 75% of ts tme dong nothng due to pln flures (Bresn et l. 2002). Therefore, n order to determne the vlue of prtculr resource llocton for gents opertng under uncertnty, t s necessry to solve stochstc optmzton problem. (b) More generlly, t s often the cse tht n gent hs mny cpbltes tht re ll n prncple vlble to t, but not ll combntons re relzble wthn the rchtecturl lmttons, becuse choosng to enble some of the cpbltes mght usurp resources needed to enble others. In other words, prtculr polcy mght not be opertonl becuse the gent s rchtecture does not support the combnton of cpbltes requred for tht polcy. If ths s the cse, we sy tht the gent exhbts opertonlzton constrnts. At hgh level, we model the stuton descrbed bove s follows (llustrted n Fgure 1). The gents hve set of ctons tht re potentlly executble, but ech cton requres certn combnton of resources. The mount of these shred resources s lmted. Furthermore, ech gent hs constrnts s to wht resources t cn mke use of (for exmple, wht equpment t cn be outftted wth). In our model, executon begns wth dstrbuton of the shred resources mong the gents (Fgure 1). Any resultng resource llocton must obey the constrnts tht no shred resource s over-utlzed,.e., the mount of ll resources tht re ssgned to the gents does not exceed the totl vlble mount. Furthermore, the ssgnment must stsfy the locl constrnts of the gents s to the resources tht they cn use. For exmple, t s useless (nd thus essentlly nvld) to ssgn to n gent more equpment thn t cn crry. Once the shred resources re dstrbuted mong the gents, they should use these resources to crry out ther polces (Fgure 1b) n such wy tht the socl welfre of the group (sum of ndvdul rewrds) s mxmzed. The Model Before we descrbe our world model n more detl, let us note tht lthough t s very esy to dpt our model nd soluton lgorthms to scenro tht nvolves consumble executon resources (e.g., fuel, tme) n ddton to dscrete opertonlzton resources (e.g., equpment to outft the gents), we do not model the former n ths pper for the ese of exposton. It turns out tht ncludng such contnuous consumble resources n the model does not dd to the complexty of the problem, but ntroduces some subtletes to the optmzton nd hs n effect on the structure of the optml polces (Dolgov & Durfee 2003). We brefly descrbe how such constrnts cn be ncorported nto our model t the end of the subsequent secton tht presents our soluton method. The stochstc propertes of the envronments n the problems tht we ddress n ths work led us to dopt the Mrkov model s the underlyng formlsm. In prtculr, we use the sttonry, dscrete-tme Mrkov model wth fnte stte nd cton spces (Putermn 1994). The choce s due to the fct tht MDPs provde well-studed nd smple, yet very expressve, model of the world. Ths secton brefly descrbes stndrd unconstrned Mrkov decson processes nd dscusses the ssumptons tht re specfc to the problems tht we focus on n ths work.

Mrkov Decson Processes A clsscl unconstrned sngle-gent MDP cn be defned s tuple S, A, P, R, where: S = {} s fnte set of sttes. A = {} s fnte set of ctons. P = [p j ] : S A S [0, 1] defnes the trnston functon. The probblty tht the gent goes to stte j f t executes cton n stte s p j. R = [r ] : S A R defnes the rewrds. The gent gets rewrd of r for executng cton n stte. A polcy s defned s procedure for selectng n cton n ech stte. A polcy s sd to be sttonry f t does not depend on tme, but only on the current stte,.e., the sme procedure for selectng n cton s performed every tme the gent encounters prtculr stte. A determnstc polcy lwys chooses the sme cton for stte, s opposed to rndomzed polcy, whch chooses ctons ccordng to some probblty dstrbuton over the set of ctons. The term pure s used to refer to sttonry determnstc polces. A rndomzed Mrkov polcy π, cn be descrbed s mppng of sttes to probblty dstrbutons over ctons, or equvlently, s mppng of stte-cton prs to probblty vlues: π = [π ] : S A [0, 1]; π defnes the probblty of executng cton, gven tht the gent s n stte. We ssume tht n gent must execute n cton ( noop s consdered trvl cton) n every stte, thus π = 1. A pure polcy cn be vewed s degenerte cse of rndomzed polcy, for whch there s only one cton for ech stte tht hs nonzero probblty of beng executed (tht probblty s obvously 1). Clerly, the totl probblty of trnstonng out of stte, gven prtculr cton, cnnot be greter thn 1,.e., j p j 1. As dscussed below, often we re ctully nterested n domns where there exst sttes for whch j p j < 1. If, t tme 0, the gent hs n ntl probblty dstrbuton α = [α ] over the stte spce, nd the system obeys the Mrkov ssumpton (nmely tht the trnston probbltes depend only on the current stte nd the chosen cton), the system s trjectory (defned s sequence of probblty dstrbutons on sttes) wll be s follows: ρ t+1 = Pρ t, ρ 0 = α, (1) where ρ t = [ρ t ] s the probblty dstrbuton of the system t tme t (ρ t s the probblty of beng n stte t tme t), nd P = [ p j ] s the trnston probblty mtrx mpled by the polcy ( p j = p jπ ). Assumptons Typclly, Mrkov decson problems re dvded nto two ctegores: fnte-horzon problems, where the totl number of steps tht the gent spends n the system s fnte nd s known pror, nd nfnte-horzon problems, where the gent s ssumed to sty n the system forever (see, for exmple, (Putermn 1994)). In ths work we focus on dynmc rel-tme domns, where gents hve tsks to ccomplsh. Ths leds us to mke slghtly dfferent (lthough, certnly, not novel) ssumpton bout how much tme the gent spends executng ts polcy. We ssume tht there s no predefned number of steps tht the gent spends n the system, but tht optml polces lwys yeld trnsent Mrkov processes (Kllenberg 1983). A polcy s sd to yeld trnsent Mrkov process f the gent executng tht polcy wll eventully leve the correspondng Mrkov chn, fter spendng fnte number of tme steps n t. Gven fnte stte spce, ths ssumpton mples tht the Mrkov chn correspondng to the optml polcy hs no recurrent sttes (sttes tht hve nonzero probblty of beng vsted nfntely mny tmes) or, n other words: lm t ρ (t) = 0. Ths mens tht there hs to be some lekge of probblty out of the system,.e., there hve to exst some sttes {} for whch j p j < 1. We lso ssume tht the rewrds tht n gent receves whle executng polcy re bounded. Gven these ssumptons bout bounded rewrds nd the trnsent nture of our problems, the most nturl polcy evluton functon to dopt s the expected totl rewrd: T V (π, α) = π r, (2) t=0 where T s the number of steps durng whch the gent ccumultes utlty. For trnsent system wth bounded rewrds, the bove sum converges for ny T. If the system obeys the Mrkov ssumpton nd, s result, follows the trjectory n (eq. 1), the vlue of polcy cn be expressed n terms of the ntl probblty dstrbuton, trnston probbltes, nd rewrds s follows: V (π, α) = Rρ(t) = R P t α, (3) t ρ t whch, under our ssumptons, becomes: V (π, α) = R(I P) 1 α (4) It s cler tht the vlue of polcy depends on the ntl stte probblty dstrbuton α. Moreover, n generl, the reltve order of two polces cn chnge dependng on the ntl probblty dstrbutons,.e., π, π, α, α : V (π, α) > V (π, α), nd V (π, α ) < V (π, α ). However, often, there exst polces tht re optml for ny ntl probblty dstrbuton,.e., V (π, α) V (π, α) π, α. These polces re the ones tht re commonly clled optml n the unconstrned MDP lterture nd re typclly computed v dynmc progrmmng, bsed on Bellmn optmlty equtons (Bellmn 1957). We refer to these polces s unformly optml (usng the termnology from (Altmn 1999)). These unformly optml polces π lwys produce hstory of sttes tht s t lest s good s hstory produced by ny other polcy, regrdless of the ntl condtons. Therefore, f unformly optml polces exst, t s suffcent to compute sngle unformly optml polcy nd use t for ll nstnces of the problem wth rbtrry ntl probblty dstrbutons. t

However, s t turns out, unformly optml polces do not lwys exst for constrned problems tht nvolve lmted resources (we wll prove ths sttement below). We re, therefore, nterested n fndng optml polces for gven ntl probblty dstrbuton α. Let us note tht, lthough n ths work we focus on trnsent systems wth the totl expected rewrd optmzton crteron, our model, the complexty results, nd the soluton lgorthm cn be esly dpted to other commonlyused Mrkov models (fnte horzon, nfnte horzon wth dscounted or per-unt rewrds). Problem Descrpton Mult-Agent Mrkov Decson Processes Let us now consder multgent envronment wth set of n gents M = {m} ( M = n), ech of whom hs ts own set of sttes S m = { m } nd ctons A m = { m }. Wthout ny loss of generlty, we cn ssume tht the stte nd cton spces re equl (S m = S m, A m = A m m, m M). In generl, for multgent MDP, we hve to defne new stte spce tht s the cross-product of the stte spces of ll gents: S(M) = S n, nd new cton spce tht s the cross-product of the ctons spces of ll gents: A(M) = A n. The trnston nd rewrd functons re defned on the new stte nd cton spce,.e., P(M) : S n A n S n [0, 1], nd R(M) : S n A n R. However, there s lrge subclss of multgent domns where the gents rewrds nd trnston functons re ndependent of ech other,.e., such problems re completely seprble f there re no shred resources nvolved. In ths pper we ssume tht once the shred resources re dstrbuted, the gents operte completely ndependently of ech other. In other words, ech gent hs ts own ndependent rewrd nd trnston functons defned on S nd A. Under the bove ndependence ssumptons, jont polcy of the group s smply the set of sngle-gent polces of ll gents,.e., π(m) = [π m ] = [π m]. Problem Formulton We cn now defne the problem of multgent polcy optmzton under lmted shred resources. Let us sy tht there re severl shred resources, nd tht every cton of ech gent requres some subset of these resources. Furthermore, ll resources hve costs ssocted wth them nd gents hve upper bounds on the costs of resources tht cn be llocted to them. For exmple, problem mght nvolve shred equpment (e.g., tools) tht enbles gents to execute vrous ctons, but ech unt of equpment hs some costs ssocted wth t (e.g., weght), nd the gents hve upper bounds on how much weght they cn crry. Under these condtons, we cn formulte the multgent optmzton problem s tuple S, A, P, R, C, Ĉ, Q, Q, α, where: S = {} s fnte set of sttes. A = {} s fnte set of ctons. P = [P m ] = [p m j ] : S A S [0, 1] defnes the trnston functon for gent m. The probblty tht gent m goes to stte j f t executes cton n stte s p m j. R = [R m ] = [r m ] : S A R defnes the rewrds tht gent m receves. Agent m gets rewrd of r m for executng cton n stte. C = [C m ] = [c m k ], where cm k = {0, 1} defnes cton resource requrements. If gent m needs resource k to be ble to execute cton, c m k = 1; otherwse cm k = 0. Ĉ = [ĉ k] defnes the totl mounts of shred resources tht re vlble to the group,.e., there re ĉ k unts of resource k vlble to the gents. Q = [q kl ] defnes the costs (weght, money, etc) of ech resource. The cost of type l of unt of resource k s gven by q kl. Q = [ˆq l m ] defnes the upper bounds on how much of the costs the gent cn ncur (e.g., how much weght the gent cn hold or how much money t cn spend). Agent m cnnot exceed ˆq l m unts of cost of type l. α = [α m ] = [α m ] s the ntl probblty dstrbuton. The probblty tht gent m strts n stte s α m. Wthout ny loss of generlty, we ssume tht the cton spce A nd the stte spce S re the sme for ll gents. The gol of the optmzton problem s to fnd jont polcy π(m) tht yelds the hghest expected rewrd, under the condtons tht the shred resources re not over-utlzed, nd tht no gent s ssgned more resources thn t cn hold. In other words, we hve to solve the followng (bstrct) mth progrm: ( ) θ c m k π m ĉ k, mx V (π, α) m ( ) (5) q kl θ c m k π m ˆq l m, k where θ s step functon of non-negtve rgument, defned s: { 0 z = 0 θ(z) = 1 z > 0 The frst constrnt n (eq. 5) mens tht the totl mounts of resources tht re needed by ll gents do not exceed the totl mounts tht re vlble. Indeed, πm s greter thn zero only f gent m plns to use cton wth nonzero probblty. Thus, cm k π s greter thn zero when the gent plns to use ctons tht requre resource k, n whch cse ( ) θ = 1, c m k π m nd the frst summton over ll gents m gves the totl requrements for resource k, whch should not exceed ĉ k. The second constrnt s nlogous to the frst one nd hs the menng tht the cost of type l of the resources ssgned to gent m does not exceed ts cost bounds ˆq m l.

1 : p=0.75, r=1, 2 : p=0.8, r=-1, c=[1,0] c=[0,1] 2 : p=0.75, r=1, c=[0,1] 1 : p=0.25, r=1, c=[1,0] 2 : p=0.25, r=1, c=[0,1] s 1 s 3 s 2 2 : p=1, r=0, c=[0,1] 1 : p=0.8, r=-1, c=[1,0] 1 : p=1, r=0, c=[1,0] Fgure 2: Unformly-optml polces do not lwys exst for constrned problems. Trnston probbltes, rewrds, nd cton costs re shown on the dgrm. Problem Propertes Polcy Structure In ths secton we show some propertes of the optml polces. We begn by demonstrtng tht unformly-optml polces (optml for ny ntl dstrbuton) do not lwys exst for constrned problems. Ths result s well known for clsscl constrned MDPs (Altmn 1999; Putermn 1994) where constrnts re mposed on the totl expected costs tht re proportonl to the expected number of tmes the correspondng ctons re executed. We now estblsh ths result for problems wth opertonlzton constrnts (eq. 5) for whch the costs re ncurred by the gents when they nclude n cton n ther polcy, regrdless of how mny tmes the cton s ctully executed (the costs re nterpreted s the mounts of the shred resources tht re requred to enble the cton). Proposton 1 There do not lwys exst unformly-optml solutons to problem (eq. 5). Proof: We show the correctness of the bove sttement by presentng n exmple (Fgure 2) for whch no unformlyoptml polcy exsts. Let us consder problem wth two dentcl gents (m = {1, 2}), three sttes {s 1, s 2, s 3 }, two ctons { 1, 2 }, nd two shred resource types (k = {1, 2}). The rewrds (r), trnston probbltes (p), nd cton costs (c) for ll ctons re shown n Fgure 2. The resource costs (requrements) for cton 1 re [1, 0],.e., t needs the frst resource nd does not use the second one. Acton 2 s exctly the opposte. Sttes s 1 nd s 2 re good, becuse the gents cn receve postve rewrds (r = 1) there, nd stte s 3 s bd, snce the rewrd there s negtve (r = 1). The obvous unconstrned optml polcy for both gents s to execute cton 1 n stte s 1, cton 2 n stte s 2 nd ether cton (or probblstc mxture of the two) n stte s 3. However, n order for both gents to be ble to execute ths polcy, they ech need to hve one unt of ech of the resource types. Let us now ssume tht there s only one unt of ech resource vlble (ĉ = [1, 1]), nd let us show tht there does not exst polcy tht s optml for ll ntl probblty dstrbutons α. Consder the stuton where the frst gent strts n stte s 1 nd the second gent strts n stte s 2,.e., α 1 = [1, 0], α 2 = [0, 1]. Then, the obvous unque optml jont polcy tht stsfes the resource constrnts s π 1 = [(1, 0), (1, 0), (1, 0)] nd π 2 = [(0, 1), (0, 1), (0, 1)]. 1 The resource cost of ths jont polcy s [1, 1], whch stsfes the constrnts on the totl use of the shred resources. The vlue (totl expected rewrd) of tht polcy s zero (V (α, π) = 0), snce ech gent, on verge, receves the postve rewrd (+1) fve tmes before t trnstons to stte s 3, where ts expected pyoff s -5, yeldng totl expected vlue of zero. Ths polcy s clerly the unque optmum for the gven ntl condtons, becuse ny other polcy would ether not stsfy the constrnts or yeld lower (negtve) pyoff. If we consder problem wth the reversed ntl condtons α where the frst gent strts n stte s 2 nd the second gents strts n stte s 1, the polcy descrbed bove would mmedtely tke both gents to stte s 3 nd would yeld totl expected rewrd of 10. However, clerly, there lso exsts polcy tht yelds zero expected rewrd for ths ntl dstrbuton. Thus, our polcy π s the unque optml soluton to the problem wth the ntl dstrbuton α nd s suboptml soluton to the problem wth the ntl dstrbuton α. We hve therefore constructed n exmple for whch no unformly-optml polcy exsts. Complexty In ths secton we study the computtonl complexty of the multgent optmzton problem ntroduced erler (eq. 5). We begn by defnng the correspondng decson problem, whch we lbel M-OPER-CMDP ( multgent constrned MDP wth opertonlzton constrnts): Gven n nstnce of multgent MDP wth shred opertonlzton resources S, A, P, R, C, Ĉ, Q, Q, α nd rtonl number V, does there exst multgent polcy π, whose expected totl rewrd, gven α, equls or exceeds V? The followng result chrcterzes the complexty of ths decson problem. It ssumes tht pure (sttonry determnstc hstory-ndependent) polces re optml for ths clss of problems. Ths cn be shown v the sme rguments s n the cse of stndrd unconstrned MDPs, but we omt the proof here n the nterest of spce. Intutvely, there s no need to use rndomzed polces, becuse ncludng n cton n polcy ncurs the sme resource costs, regrdless of the probblty of executng tht cton (or the expected number of tmes the cton wll be executed). Theorem 1 M-OPER-CMDP s NP-complete. Proof: The presence of M-OPER-CMDP n NP s obvous. Clerly, one cn lwys guess pure jont polcy, verfy tht t stsfes the shred resource constrnts, nd clculte ts expected totl rewrd n polynoml tme (the ltter cn be done by solvng the stndrd system of lner Mrkov equtons on the vlues of ll sttes (Putermn 1994)). 1 Here nd below we use round prentheses s nottonl convenence to group the vlues tht refer to one stte,.e., π = [(π 11, π 12), (π 21, π 22), (π 31, π 32)].

s 1 1 : r=v(u 1 ), c 11 =1 q 11 =s(u 1 ) 0 : r=0, c=0 s 2 2 : r=v(u 2 ), c 22 =1 q 21 =s(u 2 ) 0 : r=0, c=0 m : r=v(u m ), c mm =1 q m1 =s(u m ) s... 3 s m s m+1 0 : r=0, c=0 fnte-horzon, or n nfnte horzon wth totl dscounted rewrd. Indeed, the complexty proofs for ll of these flvors of MDPs re lmost dentcl nd cn be done v mnor vrtons of the bove reducton. Fgure 3: Reducton to of KNAPSACK to M-OPER-CMDP. All trnstons re determnstc. In order to show NP-completeness of M-OPER-CMDP, we wll reduce KNAPSACK to sngle-gent nstnce of M- OPER-CMDP. Recll tht KNAPSACK sks whether, for gven set of tems u U, ech of whch hs cost t(u) nd vlue v(u), there exsts subset U U such tht the totl vlue of ll tems n U s no less thn some constnt W, nd the totl cost of the tems s no greter thn nother constnt B,.e., u U t(u) B nd u U v(u) W. KNAPSACK s known to be NP-complete (Grey & Johnson 1979). Therefore, f we show tht ny nstnce of KNAPSACK cn be reduced to M-OPER-CMDP, we would show tht M-OPER-CMDP s lso NP-complete. The reducton s llustrted n Fgure 3 nd proceeds s follows. Gven n nstnce of KNAPSACK wth U = m, let us number ll tems s u, [1, m] s nottonl convenence. For such n nstnce of KNAPSACK, we crete MDP wth m + 1 sttes {s 1, s 2,... s m+1 }, m ctons { 1,... m }, m resource types [c k ] = [c 1,... c m ], nd sngle cost type q k. For every tem u n KNAPSACK, we defne n cton wth rewrd v(u ) nd the followng resource requrements. Acton only needs resource,.e., c k = 1 k =. We set the cost of resource to be the cost t(u ) of tem n the KNAPSACK problem. We lso defne null cton 0 wth zero resource requrements nd zero rewrd. Furthermore, we defne determnstc trnston functon on these sttes s follows. Every stte s, [1, m] hs two trnstons from t correspondng to ctons nd 0. Both ctons led to stte s +1 wth certnty, but gves the gent rewrd of v(u ), whle cton 0 gves rewrd of zero. Stte s m+1 s bsorbng nd hs no trnstons ledng from t. In order to complete the constructon of M-OPER-CMDP, we set the ntl dstrbuton α = [1, 0,...], so tht the gent strts n stte s 1 wth probblty 1. We lso defne the decson prmeter V = W nd the totl mount of the sngle cost ˆq = B. We mke the constrnts on the totl mounts of vrous resources non-bndng by settng ĉ k =. The bove constructon bsclly llows the gent to choose cton or 0 t every stte s. Choosng cton s equvlent to puttng tem u nto the knpsck, whle cton 0 corresponds to the choce of not ncludng u n the knpsck. Therefore, t s cler tht there exsts polcy tht hs the expected pyoff no less thn V = W nd uses no more thn ˆq = B of the shred resource f nd only f there exsts soluton to the orgnl nstnce of KNAPSACK. Note tht we hve formulted nd proven Theorem 1 for trnsent processes, becuse we focus on such processes n ths work. However, the result lso holds for MDPs wth Soluton Method In ths secton we present method for solvng the optmzton progrm (eq. 5), whch s bsed on reducton of the problem to mxed nteger progrm. However, before we descrbe our lgorthm, let us brefly revew the stndrd lner progrmmng pproch (D Epenoux 1963; Kllenberg 1983) to solvng Mrkov decson processes, whch serves s the bss for our method. A common method for fndng soluton to trnsent sngle-gent unconstrned MDP s by solvng the followng lner progrm: mx x r subject to the constrnts: x j x p j = α j, x 0, or, equvlently: 2 mx x r (δ j p j )x = α j, (6) where δ j s the Kronecker delt, defned s δ j = 1 = j. The optmzton vrbles x = [x ] re often referred to s the occupncy mesure of polcy, nd x cn be nterpreted s the expected number of tmes cton s executed n stte. The constrnts n (eq. 6) just represent the conservton of probblty nd hve nothng to do wth externl constrnts mposed on the problem. Tht s, the expected number of tmes tht stte j s vsted less the expected number of tmes tht j s entered cross ll sttecton prs should equl the expected number of tmes of strtng n stte j (.e., ntl probblty of beng n j). A polcy π cn be computed from x smply s: π = x x = x (7) x Under our ndependence ssumptons, we cn nlogously construct lner progrm for n unconstrned multgent MDP wth the totl expected rewrd s the optmzton crteron: mx x m r m (δ j p m j)x m = αj m, m (8) where x m s the expected number of tmes gent m executes cton n stte. A soluton to ths LP yelds optml polces for the unconstrned multgent problem. It s esy to see tht the bove LP s completely seprble nd could hve been wrtten s M sngle-gent LPs. We, however, re nterested n solvng the constrned optmzton problem tht 2 From now on we wll omt the x 0 constrnt for brevty.

we hve erler wrtten n n bstrct form s (eq. 5). Let us now rewrte the progrm (eq. 5) n the occupncy mesure coordntes x. Addng the constrnts from (eq. 5) to (eq. 8), nd notcng tht θ( cm k πm ) = θ( cm k xm ), we get the followng optmzton problem n x: (δ j p m j)x m = αj m, mx ( ) x m r m θ c m k x m ĉ k, m m ( ) q kl θ c m k x m ˆq l m, k (9) If we could solve ths mth progrm, we would be done. The bg problem wth ths progrm s, of course, tht t nvolves the step functon θ, whch mkes the constrnts nonlner (nd ctully dscontnuous t zero). Consequently, our gol s to rewrte the progrm (eq. 9) n more mngeble form. However, we cn bndon the de of tryng to wrte t s lner progrm, becuse LPs re solvble n polynoml tme, nd we hve shown tht our problem s NP-complete. Insted, we re gong to reformulte the problem s mxed nteger lner progrm (lso NP-complete). The blty to formulte the constrned polcy optmzton problem s MILP llows us to mke use of wde vrety of hghly optmzed lgorthms nd tools for solvng nteger progrms. Frst, to smplfy the followng dscusson, let us defne s m k = c m k x m, whch hs the nterpretton of the totl expected number of tmes gent m plns to use ctons tht need resource of type k n ts polcy. Let us ugment the orgnl optmzton vrbles x wth set of bnry vrbles = [ m k ], where m k = θ(sm k ). In other words, m k s n ndctor vrble tht shows whether gent m needs resource k for ts polcy. Usng, we cn rewrte the resource constrnts n (eq. 9) s m k ĉ k, q kl m k ˆq l m, (10) m whch re lner n. Ths, n tself, does not buy us nythng, becuse we hve smply renmed the nonlner prts of the constrnts. However, f we could synchronze x nd to preserve ther ntended nterpretton v lner functon, we would hve lner mxed nteger progrm, whch s the gol tht we hve set forth n ths secton. The problem s, of course, tht the reltonshp between m k nd x m s nonlner: { 0, f s m m k = k = cm k xm = 0 1, f s m k = cm k xm > 0 (11) Note tht ths s exctly the step functon tht we wnted to get rd of n the frst plce. However, we cn cpture the essence of the reltonshp between x nd wth lner functon s follows. k Frst, we need to normlze our occupncy mesure (x) such tht s m k = cm k x [0, 1]. Let us defne new normlzed occupncy mesure y s: y m = xm X, (12) where X sup s m k = sup cm k xm s some constnt fnte upper bound on s m k, whch exsts for ny trnsent MDP nd cn be computed n polynoml tme. Indeed, one smple wy to do ths s to replce the expected rewrd n the objectve functon n the stndrd unconstrned LP (eq. 6) wth m xm (snce cm k 1), solve ths LP n polynoml tme nd let X equl the resultng vlue of ths objectve functon. Gven the normlzed occupncy mesure, we cn then cpture the essence of the reltonshp between x nd v lner constrnt: y m m k (13) c m k Clerly, f cm k ym > 0, the bove constrnt forces the correspondng m k to be 1, whch s n ccordnce wth (eq. 11). On the other hnd, f cm k ym = 0, the bove constrnt wll hold for both m k = 0 nd m k = 1, whch does not qute stsfy (eq. 11). However, t turns out tht ths s not problem for the followng reson. If some m k = 1, even f the correspondng cm k ym = 0 (whch s llowed by constrnt (eq. 11)), the worst tht cn hppen s tht the constrnt (eq. 10) on the shred resources becomes unnecessrly broken. Ths mght seem problemtc but, n fct, t s not, snce the mportnt thng s tht nother (fesble) soluton wth the offendng delts corrected lwys exsts nd hs the sme vlue of the objectve functon. Bsclly, the bove condton hs no flse postves, but cn hve flse negtves, whch re not lethl. Ths mens tht ny complete lgorthm for solvng MILPs wll lwys fnd the optml determnstc polcy tht stsfes the constrnts (f one exsts). To summrze, the problem of fndng optml polces under opertonlzton constrnts cn be formulted s the followng MILP: mx (δ j p m j )ym = αm j X, yr m m m m k ĉ k, k m q kl m k ˆqm l, cm k ym m k, y m 0, m k {0, 1} (14) As mentoned erler, even though solvng such progrms s, n generl, n NP-complete problem, there s wde vrety of very effcent lgorthms nd tools for dong so (see, for exmple, (Wolsey 1998) nd references theren). Therefore, one of the benefts of reducng the optmzton problem to MILP s tht t llows us to mke use of the exstng hghly effcent tools. When ntroducng our model, we ndcted tht t llows for n esy ddton of constrnts on lmted consumble executon resources such s fuel, tme, or energy. Such resources re dfferent from the opertonlzton resources

s 1 1 2 1 : p=0.5, r=1, 0 : p=1, r=0, 2 : p=0.5, r=2, q=1 q=0 q=2 0 s n+1 2 : r=-100 n : r=-100 0 : p=1, r=0, q=0 1 : r=-100 s 0......... s 2 s n+2 n : r=-100 0 n-1 : r=-100 1 : r=-100...... 0 s n n n : p=0.5, r=n, q=n s 2n V(π) 2500 2000 1500 1000 500 0 1 0.8 0.6 γ 0.4 Optml Polcy Vlue 0.2 () 0 20 n 40 V(π) 500 400 300 200 100 0 1 0.8 0.6 γ 0.4 Polcy Vlue 15 20 10 0.2 5 0 0 n (b) Fgure 4: Sclble test problem wth n segments. tht re the mn focus of ths pper n tht the consumpton of the executon resources depends on the frequency of performng ctons tht utlze the resources (e.g., the more often rover executes the move cton, the more fuel t consumes). We cn model such executon resources s follows. Let us sy tht whenever gent m performs cton n stte, t consumes h m unts of n executon resource (here, for smplcty, we ssume tht there s only one resource, but t s trvl to extend ths formulton to the cse of severl consumble resources). Clerly, the totl expected consumpton of the resource s lner functon of the occupncy mesure x. Therefore, f we would lke to bound the totl expected use of the resource, ll we hve to do s dd the followng lner constrnt to (eq. 14): h m x m ĥ, m where ĥ s the upper bound on the expected resource consumpton. The blty to model lner constrnts of ths type s well-known dvntge of the LP formulton of MDPs (Altmn 1999; Kllenberg 1983; Putermn 1994). Such lner constrnts do not dd to the complexty of the polcy optmzton problem, but they do ffect the propertes of optml polces. In prtculr, unlke for the stndrd unconstrned MDPs, for the problems wth such constrnts, determnstc polces re no longer gurnteed to be optml, nd unformly-optml polces do not lwys exst. For some domns tht nvolve resources whose overutlzton cn hve dre consequences, t mght not be suffcent to bound the expected consumpton of resource, nd more expressve rsk-senstve constrnts mght be requred (Ross & Chen 1988; Sobel 1985). In prtculr, t mght be desrble to bound the probblty tht the resource consumpton exceeds gven upper bound (Dolgov & Durfee 2003; 2004). Emprcl Evluton nd Dscusson We hve mplemented the MILP reducton from the prevous secton nd hve run t on seres of test problems to see how t behves. Our mn gols, besdes performng n emprcl vldton of the method, hve been to see how well the lgorthm scles, nd lso how t behves s resource constrnts re tghtened or relxed. We hve lso been nterested n performng prelmnry nvestgton of serchbsed methods s n lterntve to the MILP pproch. Fgure 5: () Vlue of optml polces for the test problem; (b) Vlue of polces produced by the MILP pproch (sold lnes) nd the greedy heurstc (dshed lnes). We hve run two sets of experments on two dfferent sets of problems. One nvolves sclble sngle-gent problem, for whch t s possble to nlytclly compute the optml polces. The experments tht we hve performed on ths set were ment to serve s snty check for our method, nd rough ndcton of how the lgorthm performs under vrous constrnt levels. We lso used the sngle-gent problem s for our nvestgton of serch bsed technques, whch we brefly report on n the next secton. The second set of experments tht we hve performed were done on multgent problem, nd the mn gols there were to see how the method scles wth the number of gents. Vldton As test problem for our sngle-gent experments, we chose smple problem tht ws esly sclble nd hd optml polces whose menng ws ntutvely cler. The problem s shown n Fgure 4. It s composed of n twostte segments nd snk stte. Thus, the problem conssts of 2n + 1 sttes, whch re numbered s shown n the fgure. There re n + 1 ctons, numbered from 0 to n. Acton 0 s noop tht does not requre ny resources nd cn be nterpreted s dong nothng. In ech stte s, [1, n] (upper row), the gent cn choose to execute the noop 0 nd go to the next stte s +1 wthout gettng ny rewrd. Alterntvely, the gent cn execute cton tht mtches the current stte, n whch cse t hs n equl probblty of ether gong to stte s n+ (lower row) or styng n s, recevng rewrd of n both cses. However, f the gent executes ny other non-mtchng cton n stte s, [1, n], t goes to the snk stte s 0 wth certnty nd ncurs lrge penlty of 100. The only vlble cton n sttes s, [n + 1, 2n 2] (lower row) s the noop o, whch yelds rewrd of zero nd tkes the gent to s n+1. There s one resource cost, nd ech cton requres unts of t. For ths problem the optml polcy, ts vlue, nd ts resource requrements re ntutvely cler nd re computble nlytclly. In fct, ths problem s equvlent to knpsck problem where the vlue-to-cost rto of ll tems s the sme. In our experments we vred the sze of the problem (n) nd the mount of the resource tht ws vlble to the gent (γ). The ltter ws mesured s the frcton of the re-

tme (sec) 30 25 20 15 10 5 0 0 5 10 15 number of gents (n) () tme (sec) 10 80 10 60 10 40 10 20 10 0 CMDP jont MDP 10 20 0 5 10 15 number of gents (n) Fgure 6: Runnng tme of the MILP method (), nd comprson to n optmstc estmte of the runnng tme of the trdtonl flt multgent MDP defned on the jont stte nd cton spces of ll gents (b). source mount requred by the optml unconstrned polcy. Fgure 5 shows the vlue of the optml polcy s functon of γ nd n. As expected, the MILP method produced the optml polces; ther vlues re shown n sold lnes on Fgure 5b. We lso tmed the lgorthm for vrous constrnt levels (dfferent vlues of γ) nd observed tht the runnng tme for hghly-constrned s well s for wekly-constrned problems ws sgnfcntly lower thn for constrnt levels n the mddle rnge. Notng ths, we constructed our multgent test cses to hve moderte constrnt levels,.e., to be the most dffcult for our MILP method. Sclblty For our multgent experments, we creted the followng smple model of the rover domn, descrbed n ths pper s ntroducton. A tem of n rovers opertes n n N-by-N grd world, nd ther tsk s to conduct experments to mxmze the expected scentfc gn. The experments cn only be crred out t certn loctons rndomly plced throughout the grd. Ech successfully executed experment produces rewrd, but requres certn set of tools (determned by c m k ). In our experments, we lmted the totl mounts of tools ĉ k vlble to the tem to hlf of wht would be needed for the optml unconstrned polcy, to represent dffcult constrnt level. There s only one resource cost n ths problem ech tool hs weght (specfed by q k ), whch s correlted wth the pyoff of the experments tht ths tool s needed to perform; more vluble experments requre hever tools. To vod symmetry between the gents n our experments, ech gent ws gven dfferent lod cpcty (ˆq m ) tht determned how mny tools the gent cn crry. The bg gents wth hgh lod cpctes re more expensve to operte,.e., they hve hgher per-move penltes thn the lght gents wth low lod cpctes. The gents movement through the grd world hs stochstc component to t, nd the gents lso hve smll probblty of brekng down t ech step. We conducted the mjorty of our experments on 10- by-10 grd for vrous numbers of gents. Our mn concern ws how the soluton lgorthm would scle s we ncresed the number of gents. Fgure 6 shows the runnng tme of the MILP method (usng CPLEX 8.1 on Pentum 4 PC) for (b) vrous tem szes. The plot shows tht n under 30 seconds, we could compute optml polces for tems of 15 gents. It s nterestng to contrst ths result to wht could hve been obtned by usng trdtonl multgent MDP formulted on the jont stte nd cton spces of ll gents. It s esy to see tht for ths problem wth 100 sttes, 9 ctons, nd 15 gents, the jont trnston mtrx defned on the crossproducts of the stte nd cton spces of ll gents, would requre on the order of 10 74 vlues. Thus, t s not even possble to wrte down problem of tht sze s trdtonl multgent MDP, let lone solve t. In fct, f we ssume tht for problem wth only one rover the trdtonl pproch works mllon tmes fster thn our MILP method, nd tht the trdtonl pproch scles lnerly wth the problem sze, we cn plot the runnng tme for the two methods. Fgure 6b present such comprson grph on logrthmc tme scle, nd serves s n ndcton of the benefts of explotng problem structure n multgent MDPs. Conclusons nd Future Work We hve demonstrted tht t cn be very benefcl to fctor out the shred resources out of the problem descrpton nd tret the resource lmttons s constrnts mposed on the polcy optmzton problem. As our nlyss shows, the svngs for loosely-coupled gents cn be tremendous. Of course, there re mny other wys of explotng problem structure, such s bstrcton (Derden & Boutler 1997; Boutler, Derden, & Goldszmdt 1995) nd fctorzton (Boutler, Den, & Hnks 1999). It ppers tht t could be very benefcl to combne such methods wth our constrned optmzton pproch. However, ths would nvolve overcomng severl chllenges, the most mportnt of whch s probbly the followng. Just lke the mjorty of methods for solvng unconstrned MDPs, the exstng methods tht work wth compct problem representtons rely on Bellmn s prncple of optmlty, whch sttes tht the optml cton for ech stte s ndependent of the optml ctons chosen for other sttes. However, ths prncple no longer holds when globl constrnts re mposed on gents polces. Indeed, enblng n optml cton for one stte mght consume lmted resources, mkng the optml cton for nother stte nfesble. Overcomng such dffcultes n n ttempt to combne compct MDP representtons nd our constrned optmzton des s one of the drectons of our future work. As mentoned erler, we were lso nterested n explorng the possblty of usng serch-bsed methods s n lterntve to the MILP pproch. To ths end, we compred our MILP method to very smple heurstc serch method, whch worked s follows. It frst solved the unconstrned problem nd then sequentlly replced some ctons wth the noop to reduce the cost of the polcy. The ctons to be replced were chosen rndomly. We rn the two methods on our sngle-gent test problem (Fgure 4). As cn be seen from Fgure 5b, whch shows the vlues of the polces produced by the two methods, the greedy heurstc works very well for ths test problem. Ths should not be surprsng, gven the nlogy of ths problem to knpsck, nd the fct tht ll ctons hve the sme rewrd-to-cost rto. The fct

V(π) 600 400 200 0 200 1 0.8 0.6 0.4 γ 0.2 0 5 Polcy Vlue Fgure 7: Vlues of polces produced by the MILP pproch (sold lnes) nd the greedy heurstc (dshed lnes) for slghtly modfed verson of the problem n Fgure 4. tht ths elementry heurstc pproch works well on some problems t frcton of the tme requred for the MILP method suggests tht explorng serch-bsed pproches for ths optmzton problem mght be worthwhle. Of course, ths prtculr heurstc method turns out to hve been well mtched to ths problem only by good fortune. It cn do very poorly for vrton of the problem where the role of the noop nd the other ctons n the upper-row sttes s, [1, n] s reversed. There, the noop leds to the snk stte s 0, ncurrng penlty of 100 nd the other nonmtchng ctons led to the next stte s +1 wth no rewrd. For ths problem, the heurstc lmost lwys produces the worst possble polcy (s depcted n Fgure 7). A systemtc nvestgton of heurstc serch-bsed methods s requred before ny clms cn be mde bout the trde-offs nd benefts of usng such methods. Ths s nother drecton of our future work. 10 Acknowledgments Ths work ws n prt supported by DARPA/ITO nd the Ar Force Reserch Lbortory under contrct F30602-00- C-0017 s subcontrctor through Honeywell Lbortores nd lso through n Industrl Prtners grnt from Honeywell. The uthors thnk Kng Shn, Hksun L, nd Dvd Muslner. References Altmn, E. 1999. Constrned Mrkov Decson Processes. Chpmn nd HALL/CRC. Bellmn, R. 1957. Dynmc Progrmmng. Prnceton Unversty Press. Boutler, C.; Brfmn, R.; nd Geb, C. 1997. Prortzed gol decomposton of Mrkov decson processes: Towrds synthess of clsscl nd decson theoretc plnnng. In Pollck, M., ed., Proceedngs of the Ffteenth Interntonl Jont Conference on Artfcl Intellgence, 1156 1163. Sn Frncsco: Morgn Kufmnn. Boutler, C.; Den, T.; nd Hnks, S. 1999. Decsontheoretc plnnng: Structurl ssumptons nd computtonl leverge. Journl of Artfcl Intellgence Reserch 11:1 94. 15 n 20 Boutler, C.; Derden, R.; nd Goldszmdt, M. 1995. Explotng structure n polcy constructon. In Mellsh, C., ed., Proceedngs of the Fourteenth Interntonl Jont Conference on Artfcl Intellgence, 1104 1111. Sn Frncsco: Morgn Kufmnn. Boutler, C. 1999. Sequentl optmlty nd coordnton n multgent systems. In Proceedngs of the 1999 Interntonl Jont Conference on Artfcl Intellgence, 478 485. Bresn, J.; Derden, R.; Meuleu, N.; Rmkrshnn, S.; Smth, D.; nd Wshngton, R. 2002. Plnnng under contnuous tme nd resource uncertnty: A chllenge for. In Uncertnty n Artfcl Intellgence: Proceedngs of the Eghteenth Conference (UAI-2002), 77 84. Sn Frncsco, CA: Morgn Kufmnn Publshers. Den, T., nd Ln, S.-H. 1995. Decomposton technques for plnnng n stochstc domns. In Proceedngs of the 1995 Interntonl Jont Conference on Artfcl Intellgence. Derden, R., nd Boutler, C. 1997. Abstrcton nd pproxmte decson-theoretc plnnng. Artfcl Intellgence 89(1-2):219 283. D Epenoux. 1963. A probblstc producton nd nventory problem. Mngement Scence 10:98 108. Dolgov, D. A., nd Durfee, E. H. 2003. Approxmtng optml polces for gents wth lmted executon resources. In Proceedngs of the Eghteenth Interntonl Jont Conference on Artfcl Intellgence, 1107 1112. Dolgov, D. A., nd Durfee, E. H. 2004. Approxmte probblstc constrnts nd rsk-senstve optmzton crter n Mrkov decson processes. In Proceedngs of the Eghth Interntonl Symposums on Artfcl Intellgence nd Mthemtcs (AI&M 7-2004). Grey, M. R., nd Johnson, D. S. 1979. Computers nd Intrctblty: A Gude to the Theory of NP-Completeness. W. H. Freemn & Co. Kllenberg, L. 1983. Lner Progrmmng nd Fnte Mrkovn Control Problems. Mth. Centrum, Amsterdm. Meuleu, N.; Huskrecht, M.; Km, K.-E.; Peshkn, L.; Kelblng, L.; Den, T.; nd Boutler, C. 1998. Solvng very lrge wekly coupled Mrkov decson processes. In AAAI/IAAI, 165 172. Putermn, M. L. 1994. Mrkov Decson Processes. New York: John Wley & Sons. Ross, K., nd Chen, B. 1988. Optml schedulng of nterctve nd non-nterctve trffc n telecommuncton systems. IEEE Trnsctons on Auto Control 33:261 267. Sngh, S., nd Cohn, D. 1998. How to dynmclly merge Mrkov decson processes. In Jordn, M. I.; Kerns, M. J.; nd Soll, S. A., eds., Advnces n Neurl Informton Processng Systems, volume 10. The MIT Press. Sobel, M. 1985. Mxml men/stndrd devton rto n undscounted mdp. OR Letters 4:157 188. Wolsey, L. 1998. Integer Progrmmng. John Wley & Sons.