Efficient Planning in R-max

Size: px

Start display at page:

Download "Efficient Planning in R-max"

Arline Burns
5 years ago
Views:

1 Efficient Plnning in R-mx Mrek Grześ nd Jee Hoey Dvid R. Cheriton School of Computer Science, Univerity of Wterloo 200 Univerity Avenue Wet, Wterloo, ON, N2L 3G1, Cnd {mgrze, ABSTRACT PAC-MDP lgorithm re prticulrly efficient in term of the number of mple obtined from the environment which re needed by the lerning gent in order to chieve ner optiml performnce. Thee lgorithm however execute time conuming plnning tep fter ech new tte-ction pir become known to the gent, tht i, the pir h been mpled ufficiently mny time to be conidered known by the lgorithm. Thi fct i eriou limittion on broder ppliction of thee kind of lgorithm. Thi pper exmine the plnning problem in PAC-MDP lerning. Vlue itertion, prioritized weeping, nd bckwrd vlue itertion re invetigted. Through the exploittion of the pecific nture of the plnning problem in the conidered reinforcement lerning lgorithm, we how how thee plnning lgorithm cn be improved. Our extenion yield ignificnt improvement in ll evluted lgorithm, nd tndrd vlue itertion in prticulr. The theoreticl jutifiction to ll contribution i provided nd ll pproche re further evluted empiriclly. With our extenion, we mnged to olve problem of ize which hve never been pproched by PAC-MDP lerning in the exiting literture. Ctegorie nd Subject Decriptor I.2.6 [Artificil Intelligence]: Lerning; I.2.8 [Artificil Intelligence]: Problem Solving, Control Method, nd Serch Generl Term Algorithm, Experimenttion, Theory Keyword Reinforcement lerning, Plnning, MDP, Vlue Itertion 1 Introduction The key reerch chllenge in the re of reinforcement lerning (RL) i how to blnce the explortion-exploittion trde-off. One of the bet pproche to explortion in RL, which h good theoreticl propertie, i o clled PAC-MDP lerning (PAC men Probbly Approximtely Correct). Stte-of-the-rt exmple of thi ide re E 3 [9] nd R-mx [3]. PAC-MDP lerning define the explortion trtegy which gurntee tht with high probbility the lgorithm perform ner optimlly for ll but polynomil number of time tep (i.e., polynomil in the relevnt prmeter of the underlying proce). Thi fct men tht PAC-MDP lgorithm Cite : Efficient Plnning in R-mx, Mrek Grześ nd Jee Hoey, Proc. of 10th Int. Conf. on Autonomou Agent nd Multigent Sytem (AAMAS 2011), Tumer, Yolum, Sonenberg nd Stone (ed.), My, 2 6, 2011, Tipei, Tiwn, pp. XXX-XXX. Copyright c 2011, Interntionl Foundtion for Autonomou Agent nd Multigent Sytem ( All right reerved. re coniderbly efficient in term of the number of mple which re needed during lerning in order to chieve ner optiml performnce. Thee lgorithm however execute time conuming plnning tep fter ech new tte-ction pir become known to the gent, i.e., the pir w mpled ufficiently mny time to be conidered known by the lgorithm, nd thi i eriou limittion gint broder ppliction of thee kind of lgorithm [21]. Thi pper exmine the plnning problem in PAC-MDP lerning. A number of lgorithm re invetigted with regrd to plnning in PAC-MDP RL (thi include vlue itertion, prioritized weeping, nd bckwrd vlue itertion), nd the contribution of thi pper cn ummrized follow: Firt, we how how the tndrd R-mx lgorithm cn reduce the wort ce number of plnning tep from S A to S. Second, exploiting the pecil nture of the plnning problem in conidered RL lgorithm, the new updte opertor i propoed which updte only the bet ction of ech tte until convergence within the given tte. Thi pproch yield ignificnt improvement in ll evluted lgorithm, nd in tndrd vlue itertion in prticulr. Next, n extenion i propoed to the prioritized weeping lgorithm which gin exploit propertie of plnning problem in PAC-MDP lerning. Specificlly, only policy predeceor of ech tte re dded to the priority queue in contrt to dding ll predeceor in the tndrd prioritized weeping lgorithm. Finlly, we pply bckwrd vlue itertion (BVI) to plnning in R-mx, nd we how tht the originl lgorithm from the literture [4] cn fil on brod cle of MDP. We how the problem, nd fter tht our correction to the BVI lgorithm i propoed for the generl ce. Then, our extenion to the corrected verion of BVI which re gin pecific to plnning in PAC-MDP lerning re propoed. The theoreticl jutifiction to ll contribution i provided nd ll pproche re further evluted empiriclly on two domin. Regrdle which prticulr PAC-MDP lgorithm i conidered, the time conuming plning tep i required fter new tte-ction pir become known. Thi problem pplie lo to other modelbed RL lgorithm which re not PAC-MDP, uch the Byein Explortion Bonu lgorithm [10]. Our work i to improve the plning tep of thee kind of lgorithm, nd it pplie to ll exiting flvour of PAC-MDP lerning [16, 19]. In thi pper, we re focuing on R-mx, populr exmple of PAC-MDP lerning, nd our work i eqully pplicble to other relted model-bed RL lgorithm (including thoe which heuriticlly modify rewrd [1]). 2 Bckground The underlying mthemticl model of the RL methodology i the Mrkov Deciion Proce (MDP). An MDP i defined tuple (S, A, T, R, γ), where S i the tte pce, A i the ction pce, T : S A S i the trnition function, R : S A S R

2 the rewrd function (which i umed here to be bounded bove by the vlue R mx ), nd 0 γ 1 i the dicount fctor which determine how the long-term rewrd i clculted from immedite rewrd [15]. The problem of olving n MDP i to find policy (i.e., mpping from tte to ction) which mximize the ccumulted rewrd. A Bellmn eqution define optimlity condition when the environment dynmic (i.e., trnition probbilitie nd rewrd function) re known [2]. In uch ce, the problem of finding the policy become plnning problem which cn be olved uing itertive pproche like policy nd vlue itertion [2]. Thee lgorithm tke (S, A, T, R, γ) n input nd return policy which determine which ction hould be tken in ech tte o tht the long term rewrd i mximized. In lgorithm which repreent the policy vi the vlue function, Q(, ) reflect the expected long term rewrd when ction i executed in tte nd V () = mx Q(, ). The policy nd vlue itertion method require cce to n explicit, mthemticl model of the environment, tht i, trnition probbilitie, T, nd the rewrd function, R, of the controlled proce. When uch model i not vilble, there i need for lgorithm which cn lern from experience. Algorithm which lern the policy from the imultion in the bence of the MDP model re known reinforcement lerning [18, 2]. The firt mjor pproch to RL i to etimte the miing model of the environment uing, e.g., ttiticl technique. The repeted imultion i ued to pproximte or verge the model. Once uch n etimtion of the model i vilble, tndrd technique for olving MDP re gin pplicble. Thi pproch i known model-bed RL [17]. Thi pper invetigte pecil type of model-bed RL which i known PAC-MDP lerning. An lterntive cl of pproche to RL which re not conidered in thi pper doe not ttempt to etimte the model of the environment, nd becue of tht i clled model-free RL. Algorithm of thi type directly etimte the vlue function or policy [13] from repeted imultion. The tndrd exmple of thi pproch contitute Q-lerning nd SARSA lgorithm [18]. PAC-MDP lerning i prticulr pproch to explortion in RL nd i bed on optimim in the fce of uncertinty [9, 3]. Like in tndrd model-bed lerning, in PAC-MDP model-bed lgorithm, the dynmic of the underlying MDP re etimted from dt. If certin tte-ction pir h been experienced enough time (prmeter m control thi in R-mx), then the etimted dynmic re cloe to the true vlue. The optimim under uncertinty ply crucil role when deling with tte-ction pir which hve not been experienced m time. For uch pir, the lgorithm ume the highet poible vlue of their Q-vlue. Sttection pir for which n(, ) < m re nmed unknown nd known when n(, ) m where n(, ) i the number of time the ttection pir w experienced. When new tte ction pir become known, the exiting pproximtion, ˆM, of the true model, M, i ued to compute the correponding optiml policy for ˆM which when executed will encourge the lgorithm to try unknown ction nd lern their dynmic. Such n explortion trtegy gurntee tht with high probbility the lgorithm perform ner optimlly for ll but polynomil number of tep (i.e., polynomil in the relevnt prmeter of the underlying MDP). The prototypicl R-mx lgorithm ue the tndrd Bellmn bckup (ee Algorithm 1) nd vlue itertion to compute the policy, ˆπ, for the model ˆM, where the policy ˆπ() i defined in Eqution 1. ˆπ() = rg mx ˆQ(, ) (1) Summrizing, the R-mx lgorithm work follow: It ct Algorithm 1 Bckup(): Bellmn bckup for tte old_vl ˆV () { ˆV () = mx ˆQ(, ) = ˆR(, ) + γ ˆT (,, ) ˆV } ( ) return old_vl ˆV () greedily ccording to the current ˆV. Once new tte-ction pir become known, it perform plnning with the updted model (i.e., model with new known tte-ction pir), nd gin ct greedily ccording to updted ˆV. A nturl nd the mot efficient pproch to plnning in thi cenrio i to ue the outcome of the previou plnning proce the initil vlue function for new plnning, which we refer in the pper to incrementl plnning. Thi i umed for ll lgorithm nd experiment of thi pper. The proof nd the theoreticl nlyi of PAC-MDP lgorithm cn be found in the relevnt literture [8, 16]. In our nlyi one pecific property of uch lgorithm i dvocted: the optimim under uncertinty which gurntee tht inequlity ˆV () V () i lwy tified during lerning, where V () i the optiml vlue function which correpond to the true MDP model M. 3 Known Stte in R-mx The focu of thi pper i how to perform the plnning tep in R- mx efficiently. In originl R-mx, the plnning tep i executed every time new tte-ction pir become known [3] (thi i lo the ce in known implementtion [1]). While invetigting the rnge of plnning lgorithm which re dicued below, we found tht the efficiency of plnning in R-mx cn be improved by tking into ccount the fct tht the vlue of given tte doe not chnge until ll it ction become known. Thi i becue if ll unknown ttection pir re initilized with V mx ( i the ce in R-mx), where V mx = R mx /(1 γ) when γ < 1 nd V mx = R mx if γ = 1, then V () = V mx long t let one ction remin unknown in tte. If the R-mx lgorithm execute the plnning lgorithm fter the pir (,) become known, where there till exit t let one ction which i unknown in, then only one Q-vlue will chnge it vlue, i.e., the vlue of the pir (,). If, fter the updte, Q(, ) < V () = V mx, the vlue of will not chnge. Action will not be executed next time in tte, nd nother ction will be ued. In thi wy, unknown ction re correctly explored by policy ˆπ from Eqution 1, but we oberve here tht the updte i uele. Our novel improvement, which come from the bove obervtion, i to extend the notion of known tte-ction pir by notion of known tte, where known() = true iff known(, ) = true. With thi extenion, our pproch i to execute the plnning tep in R-mx only when new tte,, become known (i.e., known() become true). The only iue now i tht the ction election ccording to Eqution 1 h be chnged in order to del properly with tte for which known() = fle. Thi cn be ddreed by electing ction uing Algorithm 2 inted of Eqution 1. A explined Algorithm 2 GetAction(): modified ction election method if known() then return ˆπ() {ee Eqution 1} ele return ny ction for which known(, ) = fle end if bove, thi procedure will not chnge the explortion of the R- mx lgorithm when tie re broken rndomly. Normlly, when

3 the plnning tep i executed fter lerning ech new tte-ction pir, it Q-vlue i Q(, ) V () = V mx when there exit t let one unknown ction. When tie re broken rndomly (thi i for the ce when Q(, ) = V mx for updted known ction ), thi i equivlent to potponing plnning nd executing nother ction which i till unknown when known() = fle. Thi improvement i prticulrly ueful for plnning lgorithm which do the ytemtic updte of the entire Q-tble vlue itertion doe, becue when known() = fle the entire plnning proce chnge Q-vlue only of thoe ction which hve jut become known nd there re no chnge in Q-vlue of ny other tte, where vlue itertion will iterte nd perform (uele) Bellmn updte for ll te. Experimentl vlidtion of our extenion i in the experimentl ection of the pper. Since, thi improvement yielded coniderble peed-up nd repreent more efficient implementtion of R-mx, if not tted otherwie, we ue thi extenion in ll experiment preented in the pper. The min gol of thi pper i to peed up the R-mx lgorithm with regrd to plnning, nd our pproch preented here reduce the number of execution of the plnner (regrdle which plnner i ued) from O( S A ) to O( S ). 4 Bet-ction Only Updte From thi point, we re looking t wy of improving plnning lgorithm. The firt extenion which i introduced in thi ection i pplicble to ll lgorithm invetigted in the pper. However in order to mke the preenttion eier to undertnd by the reder nd to explin the intuition which i behind thi extenion, we how firtly how it pplie to vlue itertion. It ppliction to other plnning pproche i dicued in detil in ubequent ection. Let ume the tndrd cenrio of R-mx lerning when vlue itertion i ued plnning method, together with the incrementl pproch indicted t the end of Section 2. Thi men tht the initil vlue function t the beginning of plnning i lwy optimitic with regrd to the vlue which i the reult of plnning. Additionlly, under condition pecified below, the vlue function fter ech Bellmn bckup i lo optimitic with regrd to the vlue function fter the previou Bellmn bckup (in R-mx, vlue re ucceively decreed to reflect the chnge in the model which mde the model le optimitic when new tte becme known). The intuition which motivte Algorithm 3 i tht the chnge of V () in given itertion cn be triggered only by the chnge of the Q-vlue of the bet ction of becue ll Q(, ) re lwy optimitic with regrd to the optiml vlue function nd to the vlue fter ucceeding Bellmn bckup, nd we rgue here tht in ech tte the ction which h highet Q(, ) hould be updted firt. Thi cn be explined follow. If the vlue of the bet ction will not chnge fter it updte, which men tht V () will not chnge in the current itertion, then ll other remining ction cn be kipped in thi itertion becue they hve lower vlue nd they will not influence V () (thi explin why the for loop in Algorithm 3 cn bckup only the bet ction). If the vlue of the bet ction chnge fter the updte on the other hnd, then nother ction my be the bet nd it i reonble to updte currently the bet ction of the me tte gin (thi explin why the externl loop of Algorithm 3 mke ene). We recll here tht in the tndrd Bellmn bckup (ee Algorithm 1) ll ction re updted. Our ide here i tht it i profitble to focu Bellmn bckup only on the bet ction of ech tte inted of performing updte of ll ction when optimitic initiliztion tifie condition defined below. Thi concept i nmed bet-ction only updte (BAO) nd i cptured by Algorithm 3. The two forml rgument below prove tht Algorithm 3 i vlid. Algorithm 3 BAO(): bet-ction only bckup of tte old_vl V () repet bet_ction = ll in t. Q(, ) mx i Q(, i) < ɛ δ = 0 for ech in bet_ction do old_q = Q(, ) Q(, ) = R(, ) + γ T (,, ) mx Q(, ) if old_q Q(, ) > δ then δ = old_q Q(, ) end if end for until δ < ɛ return old_vl V () DEFINITION 1. Optimitic initiliztion with one tep monotonicity (OOSM) i the pecil ce of optimitic initiliztion of the Q-tble which tifie the following property: Q(, ) R(, ) + γ T (,, )V ( ). The property of OOSM initiliztion i tified, e.g., in ny MDP long ll Q-vlue re initilized with V mx. It will be hown in wht follow tht plnning in R-mx tifie the OOSM requirement well. In order to prove Algorithm 3, we firt prove the following lemm: LEMMA 1. If ll Q(, ) re initilized ccording to optimim with one tep monotonicity (OOSM), then fter ech individul t + 1-t Bellmn bckup of the Q-tble, the following inequlity i tified:,q t(, ) Q t+1(, ), where Q t i the vlue function fter the previou, t-th, Bellmn bckup. PROOF. We prove thi lemm by induction on the number of performed Bellmn bckup of Q-vlue. To prove the be ce, we how tht the lemm i tified fter the firt Bellmn bckup. Thi i tified directly by the definition of optimim with one tep monotonicity (ee Definition 1). After proving the be ce, we ume tht the ttement hold fter t Bellmn bckup, nd we will how tht it hold fter t + 1 bckup uing the following rgument: Q t (, ) = R(, ) + γ T (,, )V t 1 ( ) R(, ) + γ T (,, )V t( ) = Q t+1(, ), The firt Bellmn eqution how tht the updte of Q t (, ) in the bckup t i bed on vlue of ll next tte,, fter t 1 bckup, nd the third Bellmn eqution i nlogou for the bckup t + 1. The econd tep i from the induction hypothei which ume tht V t 1 ( ) V t ( ). The following corollry reult from Lemm 1: COROLLARY 1. Q-vlue converge monotoniclly to Q (, ) when ll Q(, ) entrie re OOSM initilized in vlue itertion. THEOREM 1. Vlue itertion with bet-ction only updte of Algorithm 3 converge to the me vlue tndrd vlue itertion with the Bellmn bckup of Algorithm 1 when the vlue function i OOSM initilized, i.e., when the optimitic initiliztion tifie Definition 1.

4 PROOF. In order to prove thi theorem, it i ufficient to how tht non-bet ction do not hve to be updted. Let ume tht i non-bet ction of prticulr tte, i.e., n ction t. Q(, ) < mx i Q(, i). Becue ll Q-vlue re initilly OOSM optimitic, we know from Lemm 1 tht Q(, ) cnnot be mde higher thn it current vlue in ny of the future itertion of vlue itertion. It men tht Q(, ) cnnot be mde higher thn mx i Q(, i) by updting Q(, ), nd the only wy to mke Q(, ) the bet ction in i to reduce the vlue of mx i Q(, ) which my hppen only by updting ction i which tifie mx i Q(, i). Thi how tht if the vlue function i initilized with OOSM optimim, it i ufficient to updte the bet ction only. Additionlly, if mx i Q(, i) < ɛ, V () cnnot chnge in the current itertion of vlue itertion (within given preciion ɛ) nd the lgorithm cn move to updting other tte of thi itertion. Thi proof mke BAO updte pplicble to generl vlue itertion plnning with OOSM optimitic initiliztion. A mentioned before, OOSM i nturlly tified in ny MDP long ll vlue re initilized with V mx. Thi requirement i rther wek nd ey to tify nd in thi wy pplicbility of BAO i ubtntil. A hort explntion i required on why in R-mx OOSM i tified. In our pproch, ech new plnning tep trt with the vlue function of the previou plnning tep (incrementl plnning). The new MDP model i different from the previou one jut in hving one more known tte. Thu, ll tte which were known in the previou model tify OOBC with equlity, nd the tte which h jut become known till h it V () = V mx which cnnot be mde higher, which tifie OOBC well. Due to the nture of the BAO updte, thi method i expected to yield prticulrly ignificnt improvement in domin with lrger number of ction in ech tte. It lo h gret potentil to improve plnning in domin with continue ction, becue only limited number of continuou ction hould be updted. 5 Prioritized Sweeping for R-mx Prioritized weeping (PS) h been populr for improved empiricl convergence rte but the theoreticl convergence w only expected by [12] to be provble bed on the convergence reult in ynchronou dynmic progrmming (ADP) by oberving tht PS i n ADP lgorithm. The firt forml proof for generl PS w recently preented by [11], nd the PS lgorithm of [12] w lo proved pecil ce under rther retrictive condition tht initilly ll tte hve to be igned non-zero priority. Thi i rther retrictive umption with regrd to incrementl plnning which i found in R-mx becue in R-mx uully not ll tte require being updted even once. In wht follow, we prove tht PS converge when ued for plnning in R-mx without thoe retrictive umption. Thi hold lo for our extenion to bic PS (hown in Algorithm 4), which i bed on the ide tht it i ufficient to dd to the priority queue only policy predeceor of tte, defined P olicyp red() = { T (, π( ), ) > 0}, (2) (ee Line 6 in Algorithm 4) inted of ll predeceor, defined P red() = { T (,, ) > 0}, (3) it i the ce in tndrd PS [12]. LEMMA 2. The prioritized weeping lgorithm pecified in Algorithm 4 drive Bellmn error to 0 (with required preciion ɛ) when executed for newly lerned tte, k, in R-mx, nd initilizing the vlue function uing the vlue function of the previou plnning tep in which k w not known. Algorithm 4 PS-PP( k ): prioritized weeping with policy predeceor for incrementl plnning in R-mx fter tte k become known 1: P Q k 2: while P Q do 3: remove the firt element from P Q 4: reidul() Bckup() 5: if reidul() > ɛ then 6: for ll P olicyp red() do 7: priority T (,, ) reidul() 8: if / P Q then 9: inert into P Q ccording to priority 10: ele 11: updte in P Q if the new priority i higher 12: end if 13: end for 14: end if 15: end while PROOF. Let F S be the et of tte which do not hve k in their policy grph. Since, the vlue of k cn only decree in the current plnning proce (becue in the previou plnning proce it w unknown with V ( k ) = V mx, nd now it become known nd it V ( k ) V mx), tte k will not pper in the optiml policy grph of ny tte in F, therefore current vlue of ll tte in F re correct, do not require updte, nd their Bellmn error i lredy 0. Thi rgument prove tht tte in F do not hve to be updted, nd only tte in S \ F hould be updted, tht i, policy predeceor of k. Thi prove tht bckwrd expnion of policy predeceor in Line 6 i correct, nd contitute our extenion to the tndrd PS lgorithm [12] for plnning in R-mx. Let S k be S \ F. Since k i the only tte in S k which chnge it dynmic, k i the only tte from which the modified vlue function hould be bck-propgted. The rgument of the previou prgrph howed tht thi bck-propgtion cn keep updting only policy predeceor of tte k, therefore the lt condition to prove i tht the predeceor of tte hould be viited only when reidul() > ɛ. We do thi by howing tht if for ll which cn be reched when ny ction i executed in, reidul() ɛ, then reidul( ) ɛ. Thi men tht if ll ucceor of chnge le thn ɛ, doe not hve to be bcked up given preciion ɛ. Thi cn be derived follow: = mx γ reidul( ) = mx R(, ) + γ + V ()] R(, ) γ T (,, ) V () mx γ = mx γ mx γ T (,, )[V () T (,, )V () T (,, ) V () T (,, ) reidul() T (,, )ɛ = γɛ ɛ. The firt eqution i the definition of reidul( ) where current V ( ) w computed from V (), nd new V ( ) i for V () + V () for ech ucceor of. Next tep re imple lgebric opertion, nd inequlitie re from + b + b, reidul() ɛ, nd γ 1. Bckwrd erch from k in Algorithm 4 will not expnd tte only when ll ucceor of for given policy ction hve reidul() ɛ ( will be viited if t let for one reidul() > ɛ). Thi end the proof tht V ( ) i

5 ) b) c) Figure 1: An exmple when the originl bckwrd vlue itertion fil on the loop within required preciion ɛ when the lgorithm terminte. Algorithm 4 would normlly ue the Bckup() method of Algorithm 1 in Line 4. The proof of Theorem 1 extend to Algorithm 4 with OOSM initiliztion well, nd the BAO procedure preented in Algorithm 3 cn be lo ued in Algorithm 4 by replcing, in Line 4, Bckup() with BAO(). 6 Bckwrd Vlue Itertion with Loop Bckwrd vlue itertion (BVI) i n lgorithm for plnning in generl MDP with et of terminl tte [4]. Thi lgorithm trvere the trnpoe of the policy grph uing breth- or dept-firt erch which trt from the gol tte, nd check for duplicte o tht ech tte i updted only once in the me itertion. Stte re bcked up in the order they re encountered during erch. Before pplying thi lgorithm for plnning in R-mx nd propoe our extenion, we how tht the originl verion of the lgorithm cn fil in computing the correct vlue function. Let ume the originl verion of the BVI lgorithm from [4] nd ummrized bove, nd the ue of thi lgorithm in plnning in the domin whoe four tte re hown in Figure 1. Firt, in Figure 1, current policy ction re hown before ny updte of the current itertion of BVI. Figure 1b how policy ction fter performing bckup on tte b nd d fter which the policy ction of tte d chnged (the new ction i highlighted uing think tyle). Figure 1c how updte of tte nd c fter which the bet ction of tte c chnged (gin the thick tyle how new ction). After thee updte, there i loop which involve tte c nd d, nd the BVI lgorithm will not updte thee tte in the current itertion gin becue ech tte i updted only once, nd the lgorithm will lo never updte thee two tte gin in ny of the future itertion, becue policy ction of ll tte in the loop do not led to ny tte outide of the loop (o neither c nor d will be the previou tte - ccording to policy ction - of ny tte outide of the loop). Thi itution cn hppen in brod cl of MDP in which tte re reviited, in our teting domin, nd pplie lo to tochtic ction when ll ction of ll tte in the loop led to tte in the loop only. It i worth noting tht in [4] where the BVI lgorithm w introduced, ll domin require mny tep to reviit the tte (ction re not eily reverible due to velocity in the tte pce). Our exmple how, tht the tndrd verion of the BVI lgorithm cn fil by encountering the loop in brod cl of MDP. Thi problem of the tndrd BVI lgorithm w found empiriclly during our experimenttion in thi reerch, in which the R-mx gent w getting tuck in uch loop. It i worth reclling here tht the PS-PP lgorithm of the previou ection expnd only policy predeceor, however it will not uffer from the me problem becue PS-PP gurntee tht will be viited if t let for one reidul() > ɛ, thu tte which contitute the loop will be updted well nd they will converge to proper vlue. The BVI lgorithm with policy predeceor nd updting ech tte once in ech itertion will fil in thi indicted in Figure 1. The brief nlyi of Figure 1c indicte one imple olution to the preented problem of the tndrd BVI lgorithm. Since tte which re in the loop hve other non-policy ction which led to tte outide of the loop (e.g., tte d h non-policy ction which led to tte b), the trightforwrd olution to the loop problem i to perform bckwrd erch on ll predeceor of given tte oppoed to policy predeceor it i the ce in the originl BVI lgorithm. Thi i the firt extenion to BVI which i propoed in thi pper, nd the BVI lgorithm modified in thi wy i nmed LBVI which tnd for BVI with loop. The LBVI lgorithm with thi modifiction i pplicble to generl MDP plnning. Our dditionl extenion to the LBVI lgorithm re pecific to incrementl plnning in R-mx which i tudied in thi pper. The complete lgorithm i preented in Algorithm 5. Thi i the tndrd verion of the BVI lgorithm with the following extenion: (1) ll predeceor re ued in the tte expnion in Line 13 (to del with the problem of Figure 1), (2) reidul i checked in Line 12 (to prune the tte expnion when poible), nd (3) the BAO bckup i pplied in Line 8. Algorithm 5 LBVI( k ): bckwrd vlue itertion for incrementl plnning in R-mx fter tte k become known 1: repet 2: ppended() fle 3: LrgetReidul 0 4: F IF OQ k 5: ppended( k ) true 6: while F IF OQ do 7: remove the firt element from F IF OQ 8: reidul() Bckup() 9: if reidul() > LrgetReidul then 10: LrgetReidul reidul() 11: end if 12: if reidul() > ɛ then 13: for ll P red() do 14: if ppended( ) == fle then 15: ppend to F IF OQ 16: ppended( ) = true 17: end if 18: end for 19: end if 20: end while 21: until LrgetReidul < ɛ LEMMA 3. The bckwrd vlue itertion lgorithm pecified in Algorithm 5 drive Bellmn error to 0 (with required preciion ɛ) when executed for newly lerned tte, k, in R-mx, nd initilizing the vlue function uing the vlue function of the previou plnning tep in which k w not known. PROOF. Let E S be the et of tte from which tte k cnnot be reched uing ny policy nd non-policy ction. Since tte k i not rechble from ny tte in E nd k i the only tte whoe dynmic chnge, none of the tte in E require being updted, hence Bellmn error of ll tte in E i lredy 0. Let S k be S \ E. Since k i the only tte in S k which chnge it dynmic, k i the only tte from which the modified vlue function hould be bck-propgted. Since the bckwrd erch proce expnd ll predeceor of ech tte nd trt from k, ll tte which rech tte k (uing both policy nd non-policy

6 ction) will be updted. Therefore the lt condition to prove i tht the predeceor of tte hould be viited only when reidul() > ɛ. In the prof of Lemm 2, it h been lredy hown tht if for ll which cn be reched from, reidul() ɛ, then reidul( ) ɛ. Bckwrd erch from k in Algorithm 5 will not expnd tte only when ll ucceor of hve reidul() ɛ ( will be viited if t let for one reidul() > ɛ). Thi end the proof tht when the lgorithm terminte, V () i within required preciion ɛ. Algorithm 5 would normlly bck up tte in Line 8 uing the Bellmn bckup hown in Algorithm 1. The proof of Theorem 1 extend to Algorithm 5 well, nd the BAO procedure preented in Algorithm 3 for bcking up tte cn be lo ued in Algorithm 5 by replcing, in Line 8, Bckup() with BAO(). 7 Empiricl Evlution Thi ection preent empiricl evlution of propoed pproche to incrementl plnning in R-mx. Plnning time i the meure tht one wihe to minimize in R-mx. 7.1 Algorithm The firt experiment evlute the extenion to the R-mx lgorithm introduced in Section 3. Specificlly, the tndrd R-mx with vlue itertion nd ction election ccording to Eqution 1 i compred gint modified R-mx with our predicte known() nd the ction election rule pecified by Algorithm 2 inted of uing Eqution 1. The gol of the min empiricl evlution i to check how different extenion to tndrd plnning lgorithm improve the time of plnning, nd for thi reon ll propoed extenion re evluted lo eprtely to ee their individul influence. Therefore, the following configurtion re evluted in the empiricl tudy of the pper: VI: tndrd vlue itertion VI-BAO: vlue itertion with BAO updte PS: tndrd prioritized weeping [12] PS-PP: tndrd prioritized weeping with policy predeceor PS-BAO: tndrd prioritized weeping with BAO updte PS-PP-BAO: prioritized weeping with policy predeceor nd BAO updte LBVI: bckwrd vlue itertion which cope will loop (bckwrd erch to ll predeceor) LBVI-RES: LBVI with reidul check (Line 12 in Algorithm 5) LBVI-BAO: LBVI with BAO updte LBVI-RES-BAO: LBVI with reidul check nd BAO updte All lgorithm were implemented in C++, nd the gol w to provide the me mount of optimiztion to ech lgorithm. With thi in mind, the crucil element of prioritized weeping lgorithm w the priority queue. Since, the opertion of increing the priority of the element in the priority queue i required (in Line 11 in Algorithm 4), the trinomil hep w ued becue it upport thi opertion in contnt time [20]. In the implementtion of the queue ued in LBVI, memory buffer were reued in order to hve ft opertion on the FIFO queue. A mentioned before, if not tted otherwie, ll lgorithm ue the modified tretment of unknown tte pecified in Algorithm 2 in Section 3, which ignificntly reduce the number of time the plnner re executed. In ll experiment, the R-mx prmeter m w et to 5, nd the plnning preciion ɛ w Experiment on the mze domin preent the verge vlue of 30 run, nd the hnd whing domin of 10 run. The tndrd error of the men (SEM) i hown both in grph nd in the tble. 7.2 Domin The firt domin i the verion of the nvigtion mze tk which cn be found in the literture. In our implementtion cled up verion of uch mze from [1] i ued nd it contin grid poition. The econd domin i implified model of ituted prompting ytem tht it multiple peron with dementi to complete ctivitie of dily living (ADL) more independently by giving pproprite prompt when needed. Such itution rie in hred pce, e.g. mrt long-term cre fcility, or mrt home with multiple reident in need of itnce. Prompting for ech ADL-reident combintion cn be done uing (PO)MDP [6], but the itution i more complex when multiple reident re preent, prompt cn interfere cro ADL nd between reident. The optiml olution (purued here) i to model the complete joint pce of ll reident nd ADL, lthough pproximte ditributed olution re lo poible [5]. Our pecific implementtion follow the decription in [14]. In our ce, ech MDP h 9 tte nd there re 3 prompt (do nothing or iue one of the two prompt pecific to the current pln tep) for ech tte. When prompting mny client t the me time, prompt of one client cn influence other client, where other prompt cnnot be executed for more thn one client t time, e.g., udio prompt. For exmple, the domin with 4 client h Q(, ) entrie in it Q-tble. Other ize cn be clculted nlogouly. 7.3 Reult The firt tet w to evlute the improvement of our modified notion of tte being known to the R-mx lgorithm introduced in Section 3. A pecified in the firt prgrph of Section 7.1, two verion of the R-mx lgorithm were evluted on the mze domin. Thee two verion of R-mx were executed 30 time nd the uer time w compred. The verion of the lgorithm with our pproch to ditinguih known nd unknown tte (from Section 3) w 2.3 time fter thn the originl verion. The pplicbility of thi extenion doe not depend on the plnning lgorithm nd ll ucceeding experiment ue thi modifiction to tndrd R-mx. Next experiment evlute the mjor contribution of thi pper. Figure 2 nd 3 how the evlution of ll 10 lgorithm pecified in Section 7.1 on the mze domin. Thee lgorithm determine how plnning i done, nd in principle the R-mx lgorithm hould be ble to explore in exctly the me wy regrdle which plnning lgorithm i ued. In order to verify thi, the obtined reult re compred with regrd to the ymptotic convergence of the R-mx lgorithm, nd the verge cumultive rewrd function of the epiode number i preented in Figure 4. Thi figure how tht explortion w the me, nd thi cn be een n empiricl proof, tht ll plnning lgorithm where returning the me explortion policy t their output. The BAO pproch to updting tte how ubtntil improvement in ll three lgorithm. In prticulr, vlue itertion which i trditionlly lower thn, for exmple, prioritized weeping ignificntly reduced it plnning time nd the number of bckup. Thi reult i prticulrly ignificnt not only to plning in R-mx, but lo to generl vlue-bed plnning in MDP when initiliztion tifie the requirement of Definition 1 which uniform optimitic initiliztion with V mx doe. With our BAO pproch, vlue itertion cn be done much fter in trightforwrd wy. A cloer nlyi of PS performnce indicte tht both policy predeceor nd BAO updte yield improvement when pplied individully, nd further improvement i gined when both technique re ued together. Overll with our extenion, PS when ued for incrementl plnning in R-mx i nrrowing it gp to BVI which w hown in [4] to outperform PS in the tndrd ce due

7 e e+07 1e+07 8e+06 6e+06 4e+06 2e+06 VI VI-BAO PS PS-PP PS-BAO PS-PP-BAO LBVI LBVI-RES Time [m] LBVI-BAO LBVI-RES-BAO Figure 2: Plnning time in the mze experiment 0 VI VI-BAO PS PS-PP PS-BAO Number of Q(,) bckup PS-PP-BAO LBVI LBVI-RES LBVI-BAO LBVI-RES-BAO Figure 3: Number of Q(,) bckup in the mze experiment to the overhed of mintining the priority queue. The LBVI lgorithm w evluted with reidul checking nd with BAO updte. Here, thee extenion yield improvement when pplied individully, nd dditionl gin re obtined when they re ued together. The ftet plnning lgorithm in thi experiment w LBVI with both reidul check nd BAO updte. In our implementtion, BVI i ued with our modifiction which updte ll predeceor inted of policy predeceor, ince thi w hown to be trightforwrd olution to the loop problem of the tndrd BVI lgorithm dicued in Section 6. Thi led however to n incree in the number of tte expnion, but our extenion proved to be ufficient in order to gurntee ft plnning of the modified BVI lgorithm. We cknowledge tht there i nother direction to improve the performnce of BVI by till uing policy predeceor, however the olution h to be found on how to void loop which re reported in Section 6. Thi loop problem i detrimentl for R-mx gent becue the gent get tuck in uch loop during explortion. Reult on the hnd whing domin re in Tble 1. The rnk of Averge cumultive rewrd / VI VI-BAO PS PS-PP PS-BAO PS-PP-BAO LBVI LBVI-RES LBVI-BAO LBVI-RES-BAO Number of epiode / 10 Figure 4: The cumultive rewrd of the lerning gent ech lgorithm i the me in the mze domin bove. The ignificnce of our improvement, BAO in prticulr, become more evident when the tte nd ction pce re bigger. It i worth noting tht in the lt two intnce (4 nd 5 client), we were ble to do off-line plnning in R-mx with nd ttection pir in the Q-tble! Experiment in which it w infeible to wit for their completion re indicted with -. 8 Relted Work The fct tht plnning i bottleneck of PAC-MDP lerning h been recently emphized lo in [21] where Monte Crlo on-line plnning lgorithm for PAC-MDP lerning were propoed. Thee lgorithm re intereting becue their complexity doe not depend on the number of tte. Thi i chieved by mpling C time from ech tte (which limit the brnching fctor) nd the horizon i dditionlly limited by the dicount fctor. In thi wy, it i ufficient to do Monte Crlo mpling only in the limited neighbourhood of given tte. The didvntge of thee lgorithm i tht they require the entire proce to be repeted for ech ction election. Our lgorithm which re propoed in thi pper lo mke ue of the fct tht when new tte become known, motly only it neighbourhood need to be updted, which i reflected very well in our reult. Our conjecture here i tht the lgorithm which we propoe in thi pper, could be proven to hve complexity dependent only on the cloe neighbourhood of the tte which trigger the plnning proce. The rtionl for thi theoreticl future work i indicted by our reult in thi pper. In [21] uthor report reult with Monte Crlo plnning on flg domin with 5 5 grid nd 6 flg poibly ppering, where VI did not ucceed. In our experiment of thi pper, we re reporting reult on lrge domin where even though VI w very inefficient or did not work t ll, our extenion to VI-bed plnning were proven to be ucceful. Such off-line lgorithm require plnning only once for ech known tte nd once plnning i done, the policy cn be ued very ft, where Monte Crlo method pln for every tep. Our method could further cle the off-line method up when ued with fctored plnner for MDP [7]. We re dditionlly not wre of ny PAC-MDP reult with off-line plnning on domin lrge thoe olved in thi pper. 9 Concluion PAC-MDP lgorithm re prticulrly efficient in term of the number of mple which re needed by the lerning gent in order to chieve ner optiml performnce. Thee lgorithm however execute time conuming plnning tep fter ech new tte-ction pir (or new tte ccording to our extenion) become known to the gent. Thi fct i eriou limittion on broder ppliction of thee kind of lgorithm. Thi pper exmine the plnning problem in PAC-MDP lerning, nd eek wy of hortening the durtion of the plnning tep. The contribution of thi pper cn be ummrized follow: The number of execution of the plnner cn be reduced when plnning i triggered by new tte becoming known introduced in Section 3 The new updte opertor, BAO, w propoed which, inted of updting ll ction of given tte once, updte only the bet ction of ech tte but continue thi updting until convergence within the given tte. Thi pproch yield ignificnt improvement in ll evluted lgorithm, nd tndrd vlue itertion in prticulr. Thi pproch i lo pplicble beyond plnning in R-mx, ince optimitic initiliztion with V mx cn be eily pplied in generl vlue-bed MDP plnning, nd thi contribution h potentil to ber n impct on the field

8 Algorithm 1 Client 2 Client 3 Client 4 Client 5 Client VI 7.9 ± ± ± VI-BAO 2.7 ± ± ± ± PS 5.1 ± ± ± ± PS-PP 3.7 ± ± ± ± PS-BAO 1.3 ± ± ± ± PS-PP-BAO 1.4 ± ± ± ± ± LBVI 5.3 ± ± ± LBVI-RES 4.3 ± ± ± ± LBVI-BAO 1.4 ± ± ± ± LBVI-RES-BAO 1.6 ± ± ± ± ± Tble 1: Plnning time [m] for different ize of the hnd whing domin An extenion to the prioritized weeping lgorithm w propoed which exploit propertie of plnning problem in PAC-MDP lerning. Specificlly, only policy predeceor of ech tte re dded to the priority queue in contrt to dding ll predeceor in the tndrd prioritized weeping lgorithm It w hown tht the originl bckwrd vlue itertion lgorithm from the literture - which updte ech tte exctly once in ech itertion - cn fil on brod cl of MDP domin. The problem nd one trightforwrd correction were hown. Then, our extenion to the corrected verion of BVI which re pecific to plnning in PAC-MDP lerning were propoed. Specificlly, it w hown tht the predeceor tte doe not hve to be expnded in given itertion when ll it ucceor hve their reidul mller thn preciion ɛ The intnce of the hnd whing domin with lrge tte pce were olved, which extend pplicbility of the PAC-MDP prdigm coniderbly beyond exiting PAC-MDP evlution which cn be found in the literture All preented in the pper lgorithm re eqully pplicble to gol-bed well infinite horizon RL problem, becue both in prioritized weeping nd bckwrd vlue itertion, plnning trt from pecific tte, nd it doe not mtter whether the domin h gol tte or not The theoreticl jutifiction to ll contribution w provided nd ll pproche were further evluted empiriclly. Regrdle of the more pecific detil of the empiricl evlution, prticulrly ubtntil contribution of thi work i tht the tndrd vlue itertion lgorithm cn be mde coniderbly fter by the trightforwrd ppliction of the BAO updte rule which w propoed in thi pper. 10 Acknowledgement Thi reerch w ponored by Americn Alzheimer Aocition grnt number ETAC Reference [1] J. Amuth, M. L. Littmn, nd R. Zinkov. Potentil-bed hping in model-bed reinforcement lerning. In Proceeding of AAAI, [2] D. P. Bertek nd J. N. Titikli. Neuro-Dynmic Progrmming. Athen Scientific, [3] R. I. Brfmn nd M. Tennenholtz. R-mx - generl polynomil time lgorithm for ner-optiml reinforcement lerning. JMLR, 3: , [4] P. Di nd E. A. Hnen. Prioritizing Bellmn bckup without priority queue. In Proceeding of ICAPS, [5] J. Hoey nd M. Grześ. Ditributed control of ituted itnce in lrge domin with mny tk. In Proc. of ICAPS, [6] J. Hoey, P. Pouprt, A. von Bertoldi, T. Crig, C. Boutilier, nd A. Mihilidi. Automted hndwhing itnce for peron with dementi uing video nd prtilly obervble mrkov deciion proce. Computer Viion nd Imge Undertnding, 114(5), My [7] J. Hoey, R. St-Aubin, A. Hu, nd C. Boutilier. SPUDD: Stochtic plnning uing deciion digrm. In Proceeding of UAI, pge , [8] S. M. Kkde. On the Smple Complexity of Reinforcement Lerning. PhD thei, Gtby Computtionl Neurocience Unit, Univerity College, London, [9] M. Kern nd S. Singh. Ner-optiml reinforcement lerning in polynomil time. Mchine Lerning, 49: , [10] J. Z. Kolter nd A. Ng. Ner-Byein explortion in polynomil time. In Proceeding of ICML, [11] L. Li nd M. L. Littmn. Priorioritized weeping converge to the optiml vlue function. Technicl report, Rutger Univerity, [12] A. W. Moore nd C. G. Atkenon. Prioritized weeping: Reinforcement lerning with le dt nd le time. Mchine Lerning, 13: , [13] A. Y. Ng nd M. Jordn. PEGASUS: A policy erch method for lrge MDP nd POMDP. In In Proceeding of Uncertinty in Artificil Intelligence, pge , [14] P. Pouprt, N. Vli, J. Hoey, nd K. Regn. An nlytic olution to dicrete Byein reinforcement lerningbell. In Proceeding of ICML, pge , [15] M. L. Putermn. Mrkov Deciion Procee: Dicrete Stochtic Dynmic Progrmming. John Wiley & Son, Inc., New York, NY, USA, [16] A. L. Strehl nd M. L. Littmn. An nlyi of model-bed intervl etimtion for Mrkov deciion procee. Journl of Computer nd Sytem Science, 74: , [17] R. S. Sutton. Integrted rchitecture for lerning, plnning, nd recting bed on pproximting dynmic progrmming. In Proceeding of ICML, pge , [18] R. S. Sutton nd A. G. Brto. Reinforcement Lerning: An Introduction. MIT Pre, [19] I. Szit nd C. Szepevári. Model-bed reinforcement lerning with nerly tight explortion complexity bound. In Proceeding of ICML, pge , [20] T. Tkok. Theory of trinomil hep. In Proceeding of the Interntionl Conference on Computing nd Combintoric, LNCS, pge , [21] T. J. Wlh, S. Gochin, nd M. L. Littmn. Integrting mple-bed plnning nd model-bed reinforcement lerning. In Proceeding of AAAI, 2010.

Artificial Intelligence Markov Decision Problems

Artificial Intelligence Markov Decision Problems rtificil Intelligence Mrkov eciion Problem ilon - briefly mentioned in hpter Ruell nd orvig - hpter 7 Mrkov eciion Problem; pge of Mrkov eciion Problem; pge of exmple: probbilitic blockworld ction outcome