Reward Shaping for Model-Based Bayesian Reinforcement Learning

Size: px

Start display at page:

Download "Reward Shaping for Model-Based Bayesian Reinforcement Learning"

Brittney McLaughlin
5 years ago
Views:

1 Rewrd Shping for Model-Bsed Byesin Reinforcement Lerning Hyeoneun Kim, Woosng Lim, Knghoon Lee, Yung-Kyun Noh nd Kee-Eung Kim Deprtment of Computer Science Kore Advnced Institute of Science nd Technology Dejeon 35-71, Kore nd Abstrct Byesin reinforcement lerning (BRL) provides forml frmework for optiml explortion-exploittion trdeoff in reinforcement lerning. Unfortuntely, it is generlly intrctble to find the Byes-optiml behvior except for restricted cses. As consequence, mny BRL lgorithms, model-bsed pproches in prticulr, rely on pproximted models or rel-time serch methods. In this pper, we present potentil-bsed shping for improving the lerning performnce in model-bsed BRL. We propose number of potentil functions tht re prticulrly well suited for BRL, nd re dominindependent in the sense tht they do not require ny prior knowledge bout the ctul environment. By incorporting the potentil function into rel-time heuristic serch, we show tht we cn significntly improve the lerning performnce in stndrd benchmrk domins. Introduction A reinforcement lerning (RL) gent intercts with n unknown environment to mximize the totl rewrd. One of the unique chllenges for the gent is the well-known explortion-exploittion trdeoff: without complete knowledge bout the environment, the gent hs to explore untried ctions tht my led to better long-term rewrds, but t the sme time, it lso needs to execute ctions tht re known to yield the lrgest rewrds given the current knowledge bout the environment. Byesin reinforcement lerning (BRL) provides principled mthemticl frmework for computing Byes-optiml ctions tht chieve n idel blnce between explortion nd exploittion. Although the Byes-optiml ction hs succinct formultion in model-bsed BRL, where the gent ttempts to build n explicit model of the environment for lerning, it is computtionlly intrctble except for restricted cses. As consequence, most model-bsed BRL lgorithms rely on constructing pproximted models tht re trctble to solve, or rel-time heuristic serch methods tht build serch trees on-the-fly. The min focus of this pper is on shping rewrds for improving the lerning performnce of BRL lgorithms. In Copyright c 215, Assocition for the Advncement of Artificil Intelligence ( All rights reserved. shping, we cn trnsform rewrds by shping function in order to mitigte the sprsity nd dely in rewrds. A populr pproch to shping is potentil-bsed shping, where the domin knowledge is leverged to encode the desirbility of sttes into the potentil function while the optiml policy remins unchnged. We ddress two min chllenges in using shping for model-bsed BRL. First, insted of using sttic or fixed shping function defined from -priori domin knowledge, cn we mke the shping function dpt to experiences (i.e. observed trnsitions)? Second, when shping is used with heuristic serch, cn we preserve the soundness nd completeness of the serch heuristic? We present set of domin-independent shping functions tht (1) does not use -priori knowledge of the true underlying environment, (2) dpts to experiences, nd (3) preserves the soundness nd completeness of the heuristic used for rel-time serch. We experimentlly show tht these shping functions significntly improve lerning performnce in stndrd benchmrk domins. Bckground BAMDP: A model-bsed BRL frmework We strt with the model of the underlying environment, which is ssumed to be discrete-stte Mrkov decision process (MDP) defined s 5-tuple M = S, A, T, R, γ, where S is the set of environment sttes, A is the set of gent ctions, T (s,, s ) is the trnsition probbility Pr(s s, ) of mking trnsition to stte s from stte s by executing ction, R(s,, s ) [R min, R mx ] is the nonnegtive rewrd (i.e., R min ) erned from the trnsition s,, s, nd γ [, 1) is the discount fctor. The gent tht intercts with the environment executes ctions by following policy π : S A, which prescribes the ction to be executed in ech stte. The performnce criterion for π we use in this pper is the expected discounted return V π for stte s t t timestep t, V π (s t ) = E[ τ=t γτ R(s τ, π(s τ ), s τ+1 )], defined s the stte vlue function. The Bellmn optimlity eqution sttes tht the stte vlue function for the optiml policy π stisfies V (s) = mx s T (s,, s )[R(s,, s ) + γv (s )] (1)

2 which cn be computed by clssicl dynmic progrmming lgorithms such s vlue itertion or policy itertion when the MDP model M is completely known. The optiml policy cn lso be recovered from the optiml ction vlue function Q (s, ) = s T (s,, s )[R(s,, s ) + γv (s )] (2) since π (s) = rgmx Q (s, ). On the other hnd, when the model M is not known, computing n optiml policy becomes n RL problem. The uncertinty in the model cn be represented using probbility distribution b, lso known s the belief over the models. Throughout the pper, we shll ssume tht the rewrd function is known to the gent, while the trnsition probbility is unknown. Since the trnsition for ech stte-ction pir is essentilly multinomil distribution, strightforwrd wy to represent the belief b is to define the prmeter θ s,s for ech unknown trnsition probbility T (s,, s ) nd use the product of independent Dirichlet distributions s the conjugte prior. When the trnsition s,, s is observed, the belief b is updted by the Byes rule b s,s (θ) θ s,s = θ s,s b(θ) ŝ,â Dir(θŝâ ( ); nŝâ ( )) = ŝ,â Dir(θŝâ ( ); nŝâ ( ) + δ ŝ,â,ŝ (s,, s )), which is equivlent to incrementing the single prmeter tht corresponds to the observed trnsition, i.e., n s (s ) n s (s ) + 1. Byes-Adptive MDP (BAMDP) provides succinct plnning formultion of the Byes-optiml ction under uncertinty in the model (Duff 22). By ugmenting the stte with the belief b, the Byes-optiml policy of the BAMDP should stisfy the optimlity eqution V (s, b) = mx s T b (s,, s )(R(s,, s ) + γv (s, b s,s )), where T b (s,, s )=E[Pr(s s,, b)]= n s (s )/ s ns (s ). Unfortuntely, computing n optiml policy of the BAMDP is known to be computtionlly intrctble in generl cses. One of the populr pproches to mitigte the intrctbility result is the rel-time serch, such s Byes-Adptive Monte-Crlo Plnning (BAMCP) (Guez, Silver, nd Dyn 212), Byesin Optimistic Plnning (BOP) (Fonteneu, Busoniu, nd Munos 213), nd Byesin Forwrd Serch Sprse Smpling (BFS3) (Asmuth nd Littmn 211). These pproches leverge the computtion time llowed between consecutive ction executions to construct lookhed serch tree to find the best ction. On the other hnd, there re other pproches tht do not involve lookhed serch. They often use n pproximtion of BAMDP tht is trctble to solve, while gurnteeing sufficient pproximtion to the Byes-optiml policy with high probbility. PAC-BAMDP lgorithms such s Best of Smpled Set (BOSS) (Asmuth et l. 29), Smrt BOSS (Cstro nd Precup 21), Byesin Explortion Bonus (BEB) (Kolter nd Ng 29), nd Byesin Optimistic Locl Trnsitions (BOLT) (Ary-López, Thoms, nd Buffet 212) belong to these pproches. Finlly, we should remrk tht BAMDP is essentilly hybrid-stte Prtilly Observble MDP (POMDP) where the (hidden) stte is the pir s, θ, the observtion is the environment stte, while the ction nd the rewrd remin the sme s in the environment MDP (Pouprt et l. 26). In this formultion, the trnsition probbility is defined s T ( s, θ,, s, θ ) = θ s,s δ θ (θ ) nd the observtion probbility is defined s Z( s, θ,, o) = Pr(o s, ) = δ s (o). In fct, rel-time serch lgorithms in the bove cn be viewed s extensions of POMDP plnning lgorithms to the hybrid-stte POMDP. Rel-Time Heuristic Serch Anytime Error Minimizing Serch (AEMS2) (Ross et l. 28) is one of the successful online POMDP solvers. The lgorithm performs the best-first serch using the heuristic bsed on the error bound of the vlue function for given mount of time. Upon timeout, the lgorithm selects nd executes the best ction t the root node of the serch tree. The serch process is essentilly rel-time version of the AO* serch lgorithm (Nilsson 1982). Specificlly, the serch tree is represented s n AND-OR tree, in which AND-nodes correspond to belief trnsitions nd OR-nodes correspond to ction selections. AEMS2 mintins the upper nd lower bounds on the optiml vlue t ech node nd use them for the serch heuristic. In ech itertion, fringe node of the tree is chosen by nvigting down the serch tree using the following heuristic: the ction with the mximum upper bound vlue t the OR-node nd the belief with the mximum error contribution to the root node t the ANDnode. One of the strengths of AEMS2 is its completeness nd ɛ-optimlity of the serch (Ross, Pineu, nd Chib-dr 27): the gp between the upper nd lower bounds symptoticlly converges to s the size of the serch tree grows lrger. Given tht BRL cn be seen s BAMDP (i.e. hybridstte POMDP) plnning problem, AEMS2 cn be redily extended to BRL. In fct, vrious online serch-bsed POMDP solvers (Ross et l. 28) cn be redily extended to BRL. As n exmple, it is interesting to note tht BOP (Fonteneu, Busoniu, nd Munos 213) is coincidentlly specil cse of rel-time AO* on the hybrid-stte POMDP with nive initil bounds, i.e. constnt upper nd lower bounds Rmx nd R min 1 γ. 1 γ Potentil-bsed Shping One of the most chllenging spects in RL is the sprsity nd dely in rewrd signls. Suppose tht the gent hs to nvigte in lrge environment to rech the gol loction. If the rewrd is zero everywhere except t the gol, the initil explortion of the gent would be nothing better thn rndom wlk with very smll chnce of reching the gol. On the other hnd, if we chnge the rewrd function so tht we give smll positive rewrd whenever the gent mkes progress towrds the gol, the initil explortion cn be mde very efficient. Potentil-bsed shping (Ng, Hrd, nd Russell 1999) introduces dditive bonus (or penlty) to rewrds to mke

3 the lerning potentilly more efficient, while gurnteeing invrince in the optiml behvior. This is chieved by defining potentil function on sttes Φ(s) nd trnsforming rewrds to R Φ (s,, s ) = R(s,, s ) + F Φ (s, s ), where F Φ (s, s ) = γφ(s ) Φ(s) is the shping function. If the rewrds re set to R Φ insted of R, the optiml vlue functions stisfy V Φ(s) = V (s) Φ(s) Q Φ(s, ) = Q (s, ) Φ(s) which cn be shown using Eq. (1) nd Eq. (2). The second eqution bove gives invrince in the optiml policy. In the idel cse where we set Φ(s) = V (s), the gent tht myopiclly cts solely bsed on the immedite shped rewrd follows n optiml policy, since rgmx s T (s,, s )R Φ (s,, s ) = rgmx s T (s,, s )[R(s,, s ) + γv (s ) V (s)] = rgmx s T (s,, s )[R(s,, s ) + γv (s )] = rgmx Q (s, ) = π (s). In recent work by (Eck et l. 213), shping ws shown to produce more efficient solutions for POMDP plnning using potentil functions tht encode domin-specific knowledge. It ws lso used with clssicl RL lgorithms such s SARSA nd RMAX, showing promising results (Asmuth, Littmn, nd Zinkov 28; Grześ nd Kudenko 21). Heuristic Serch with Shping for BRL Shping for BAMDP Although AEMS2 nd other online POMDP solvers nturlly extend to BAMDP, one of the min fctors tht criticlly impct the performnce is the sprsity nd dely in rewrds. Combined with the uncertinty in the trnsition model, we end up with very loose initil bound of the vlue function. Idelly, we would like to hve fst yet tight bound of the Byes-optiml vlue function, but they re hrd to obtin, just s in POMDPs. We ddress this issue by shping rewrds to focus the serch effort on more promising outcomes using inferred knowledge bout the true underlying environment. We cn nturlly extend shping in MDPs to BAMDPs by noticing tht BAMDPs re essentilly MDPs with beliefs s sttes. Specificlly, define nonnegtive potentil function for rewrd shping to be of the form Φ : S B [, + ) where S is the set of environment sttes nd B is the set of ll possible beliefs. Hence, rewrds re shped by R Φ ( s, b,, s, b s,s ) (3) = R( s, b,, s, b s,s ) + F Φ ( s, b, s, b s,s ) = R(s,, s ) + F Φ ( s, b, s, b s,s ), where F Φ ( s, b, s, b s,s ) = γφ(s, b s,s ) Φ(s, b), nd we shll use R Φ s the rewrd function. The resulting rel-time heuristic serch lgorithm mostly follows the structure of n online POMDP solver. For the Algorithm 1 The rel-time heuristic serch BRL lgorithm Input: s, b : initil stte s nd model prior b Sttic: s, b : the current stte of the gent T : the current AND-OR serch tree 1: s, b s, b 2: Initilize T to single root node s, b 3: while not ExecutionTerminted() do 4: while not SerchTimedOut() do 5: s, b ChooseNextNodeToExpnd() 6: Expnd( s, b ) 7: UpdteAncestor( s, b ) 8: end while 9: Execute best ction for s, b 1: Observe new stte s 11: Updte tree T so tht s, b s,s is the new root 12: end while ske of comprehensiveness, we provide the pseudo-code of AEMS2 extended to BAMDPs. Algorithm 1 is the min loop tht controls the serch in ech timestep. In ech itertion of the serch, one of the fringe nodes is chosen nd expnded in the best-first mnner. The objective of the expnsion is to close the gp in the bounds, nd hence we select the fringe node tht is likely to mximlly reduce the gp t the root node if expnded. Specificlly, we consider fringe nodes tht re rechble by ctions with the mximum upper bound vlue t ech intermedite node { 1 if = rgmx Pr( s, b ) = A U T ( s, b, ) otherwise nd compute the probbility of reching fringe node s d, b d t depth d by tking the pth h from the root node s, b to the fringe node, i.e. h( s d, b d ) = s, b,, s 1, b 1, 1,..., d 1, s d, b d, d 1 Pr(h( s d, b d )) = Pr(s i+1 s i, b i, i ) Pr( i s i, b i ) i= d 1 = T bi (s i, i, s i+1 ) Pr( i s i, b i ). i= Using the probbility of pth h, we compute the error contribution of ech fringe node s d, b d on the root node e( s d, b d ) = γ d Pr(h( s d, b d ))[U T (s d, b d ) L T (s d, b d )] nd choose the fringe node with the mximum error contribution. The best fringe node cn be identified efficiently without exhustive enumertion in ech itertion. We then expnd the chosen fringe node nd updte its lower nd upper bounds using Algorithm 2. These updted bounds re then propgted up to the root node using Algorithm 3. Note tht we use shped rewrd R Φ throughout the lgorithm, insted of the ctul rewrd R. We remrk tht the bound initiliztion is subtle but crucil step for serch. Given the initil upper nd lower

4 Algorithm 2 Expnd( s, b ) Input: s, b : n OR-Node chosen to expnd Sttic: U : n upper bound on V L : lower bound on V T : the current AND-OR serch tree 1: for A do 2: for s S do 3: Crete child node s, b s,s 4: U T (s, b s,s ) UΦ (s, b s,s ) ) 5: L T (s, b s,s ) L Φ (s, b s,s 6: end for 7: U T ( s, b, ) s S T b(s,, s ) [R Φ ( s, b,, s, b s,s 8: L T ( s, b, ) s S T b(s,, s ) ) + γu T (s, b s,s )] [R Φ ( s, b,, s, b s,s ) + γl T (s, b s,s )] 9: end for 1: U T (s, b) min (U T (s, b), mx U T ( s, b, )) 11: L T (s, b) mx (L T (s, b), mx L T ( s, b, )) Algorithm 3 UpdteAncestor( s, b ) Input: s, b : n OR-Node chosen to updte its ncestors Sttic: U : n upper bound L : lower bound T : the current AND-OR serch tree 1: while s, b is not root of T do 2: Set s, b to be the prent of s, b nd to be the corresponding ction 3: U T ( s, b, ) s S T b(s,, s ) [R Φ ( s, b,, s, b s,s ) + γu T (s, b s,s )] 4: L T ( s, b, ) s S T b(s,, s ) [R Φ ( s, b,, s, b s,s ) + γl T (s, b s,s )] 5: U T (s, b) min (U T (s, b), mx U T ( s, b, )) 6: L T (s, b) mx (L T (s, b), mx L T ( s, b, )) 7: s, b s, b 8: end while bounds U (s, b) nd L (s, b) using the originl rewrd function, it my seem nturl to use initil bounds UΦ (s, b) = U (s, b) Φ(s, b) nd L Φ (s, b) = L (s, b) Φ(s, b), bsed on the reltionship in Eq. (3). However, we cn show tht this is not good ide: Theorem 1. If the bounds using the shped rewrd re initilized UΦ (s, b) = U (s, b) Φ(s, b) nd L Φ (s, b) = L (s, b) Φ(s, b), the lgorithm will expnd the sme fringe node s using the originl rewrd. Proof. Given the pth h from the root node to the fringe node s, b t depth d, the error contribution of the fringe node under the shped rewrds is e Φ ( s, b ) = γ d Pr(h( s, b ))[U Φ(s, b) L Φ(s, b)] = γ d Pr(h( s, b ))[U (s, b) L (s, b)], since Φ(s, b) cncels out. In ddition, the upper bound vlue t its prent node s, b under the shped rewrds is updted by U T ( s, b, ) = s T b (s,, s )U Φ (s, b,, s ), where U Φ (s, b,, s ) = [R Φ ( s, b,, s, (b ) s,s ) + γu T (s, (b ) s,s )] = [R(s,, s ) + γu (s, (b ) s,s ) Φ(s, b )], which mkes the mximum upper bound ction rgmx U T ( s, b, ) t the prent node unchnged compred to the one using the originl rewrds, since Φ(s, b ) is invrint over ctions. By induction, the mximum upper bound ctions t ll intermedite nodes on the pth h do not chnge, which mkes Pr(h( s, b )) unchnged. Thus, we hve e Φ ( s, b ) = e( s, b ), which implies tht the lgorithm will expnd the sme fringe node s with the originl rewrds. This theorem sttes tht trnslting both bounds by Φ essentilly nullifies the effect of shping, mking the lgorithm build the exct sme serch tree s with the originl rewrd. In our implementtion, we used U Φ(s, b) = U (s, b) Φ min L Φ(s, b) = L (s, b) Φ(s, b), where Φ min = min s,b Φ(s, b). This is to mke the rewrd shping ffect the construction of serch tree (since the AEMS2 heuristic chooses the ction with the mximum upper bound vlue), while not ffecting the finl choice of the ction for execution (since the lgorithm chooses the ction with the mximum lower bound vlue). We cn show the ltter by the following corollry: Corollry 1. Given serch tree T, the best lower bound ction computed from T using the shped rewrds is identicl to the one using the originl rewrds. Proof. In the proof of Theorem 1, simply replce the upper bound U by the lower bound L. As finl remrk, since we hve not modified the serch heuristic, the completeness nd ɛ-optimlity gurntee of the serch in (Ross, Pineu, nd Chib-dr 27) is preserved. Potentil Functions It would seem tht we need to define potentil function - priori. However, due to the nture of rel-time serch lgorithms, the number of belief sttes t which the potentil function is evluted is bounded by the number of nodes expnded in the serch tree. Hence, we cn perform evlutions on-the-fly, even incorporting experiences observed from the environment. One thing tht we should keep in mind is tht if we decide to use some vlue for the potentil function t s, b, we should use the sme vlue when the serch lter cretes node with the sme s, b. This is to hve consistent specifiction of the potentil function.

5 Vlue Functions of MDP Smples An idel potentil function would be the optiml vlue function of the ctul underlying environment model. However, since this is not vilble, we could use set of models smpled from the current belief. In fct, number of BRL lgorithms took the smpling pproch lbeit for different purposes: MC- BRL (Wng et l. 212) nd BA-POMDP (Ross, Chib-dr, nd Pineu 27) used the set of smpled models for pproximte belief trcking (i.e. prticle filtering), nd BOSS (Asmuth et l. 29) used the smples for computing probbilistic upper bound of the Byes-optiml vlue. In our pproch, we periodiclly smple set of K MDPs (i.e. trnsition probbilities) {ˆθ 1, ˆθ 2,..., ˆθ K } from the current belief b nd solve ech MDP to obtin optiml vlue functions {V ˆθ1, V ˆθ2,..., V ˆθK }. The potentil function is defined s K Φ KMDP (s, b) = w k (b)v ˆθk (s), k=1 where the weight w k is set to the posterior probbility Pr(k b) of the k-th MDP. The weights re initilly set to 1/K nd updted by the Byes rule upon trnsition s,, s w k η w k ˆθ s,,s k k, where η is the normliztion constnt. The MDP resmpling period controls the trdeoff between reducing the overll running time nd obtining more ccurte vlue function estimte. In ddition, since solving MDPs cn tke considerble mount of time, we perform resmpling only t the root node nd reuse them in node expnsions, s in smplebsed tree serch lgorithms (Silver nd Veness 21). Note tht the MDP resmpling nd the weight updte led to the chnge in the potentil function. In order to mke the optiml policy invrint, we need the potentil function to be consistent by mking the sme s, b yield the sme potentil vlue. This is chieved by hshtble so tht the updted potentil function is only evluted for novel stte-belief pirs tht re encountered during serch. Vlue Function of Optimistic MDP Inspired by PAC- BAMDP lgorithms, we cn periodiclly build n optimistic MDP pproximtion of the BAMDP bsed on the current belief, nd use its optiml vlue function s the potentil function. As n exmple, we cn use the optimistic MDP constructed in BEB (Kolter nd Ng 29), of which the optiml vlue function is defined by VBEB(s, b) = mx T b (s,, s ) s [ ] R(s,, s β ) + 1+n b (s,) + γv BEB (s, b), where β is the explortion bonus prmeter, nd n b (s, ) is the number of visits to the stte-ction pir s, in the current belief b. The vlue function VBEB cn be efficiently computed by the stndrd vlue itertion. We denote this potentil function s Φ BEB. Our method cn be seen s the multi-step lookhed serch extension to BEB. As in Φ KMDP, we recompute Φ BEB only t the root node t regulr intervls, not necessrily t every timestep. We lso Figure 1: () CHAIN (left) nd (b) DOUBLE-LOOP (right) Figure 2: () GRID5 (left) nd (b) MAZE (right) use hshtble so tht the updted potentil function is only evluted for novel stte-belief pirs. Experiments We conducted experiments on the following five benchmrk BRL domins: CHAIN (Strens 2) consists of liner chin of 5 sttes nd 2 ctions {, b}, s shown in Figure 1 (). The rewrds re shown s edge lbels for ech trnsition. The trnsitions re stochstic: the gent slips nd perform the other ction with probbility.2. DOUBLE-LOOP (Derden, Friedmn, nd Russell 1998) consists of 9 sttes nd 2 ctions, s shown in Figure 1 (b). The trnsitions re deterministic. optiml behvior is to complete the trversl of the left loop with rewrd of 2 by executing ction b ll the time, while the right loop is esier to complete yielding rewrd of 1. GRID5 (Guez, Silver, nd Dyn 212) consists of 5 5 sttes with no rewrd nywhere except t the gol loction (G) which is t the opposite to the reset loction (R) (Figure 2 ()). Once the gent reches G, it is sent bck to R with rewrd of 1. There re 4 ctions for moving in ech crdinl direction, of which the trnsitions re stochstic: the gent moves in rndom directions with probbility.2. GRID1 (Guez, Silver, nd Dyn 212) is lrger version of GRID5 with 1 1 sttes. MAZE (Derden, Friedmn, nd Russell 1998) consists of 264 sttes nd 4 ctions, where the gent hs to collect flgs t certin loctions (F) nd rrive t the gol loction (G), s shown in Figure 2 (b). Once the gent reches G, it is sent bck to the reset loction (R) with the rewrd equl to the number of flgs (F) collected. The stochsticity in trnsition is sme s GRID5. In Tble 1, we compre the totl undiscounted rewrds gthered from the following 5 lgorithms: BAMCP 1 (Guez, 1

6 Totl rewrds GRID5 - FDM Φ BEB Φ KMDP NS BOP BAMCP Totl rewrds GRID1 - FDM Totl rewrds MAZE - FDM Totl rewrds GRID5 - SFDM Totl rewrds GRID1 - SFDM Totl rewrds MAZE - SFDM Figure 3: Totl rewrds vs. serch CPU times in lrger domins (GRID5, GRID1, nd MAZE) Methods CHAIN DOUBLE-LOOP GRID5 GRID1 MAZE Φ BEB (±75.66) (±8.52) (±.96) (±.51) (±1.53) Φ KMDP (±73.62) (±8.34) (±.97) (±.59) (±2.54) NS FDM (±75.4) (±8.58) (±1.2) 21.4 (±.5) (±2.61) BOP (±76.56) (±8.54) (±.82) 9.4 (±.28) (±.87) BAMCP (±25.1) (±1.17) (±.5) 5.14 (±.41) (±.84) Φ BEB (±74.53) (±7.53) 73.5 (±1.8) (±.6) (±4.2) Φ KMDP (±72.11) (±7.58) (±1.2) (±.59) (±4.78) NS SFDM (±72.23) (±7.45) (±1.17) (±.55) (±4.12) BOP (±72.45) (±7.52) (±1.13) 1.36 (±.25) (±.86) BAMCP (±27.14) (±8.1) 7.97 (±.79) (±.42) (±3.44) Tble 1: The verges of totl undiscounted rewrds nd their 95% confidence intervls. For ll domins except GRID1 nd MAZE, the results re from 5 runs of 1 timesteps. For GRID1 nd MAZE, we used 2 timesteps nd 2 timesteps, respectively. We set γ =.95 for ll domins. Top performnce results re highlighted in bold fce. Silver, nd Dyn 212) is one of the most efficient lgorithms tht uses Monte-Crlo Tree Serch; BOP (Fonteneu, Busoniu, nd Munos 213) is rel-time heuristic serch lgorithm tht uses the nive upper nd lower bounds Rmx 1 γ nd Rmin 1 γ ; NS is the rel-time heuristic serch lgorithm presented in the previous section without shping (No Shping), but using more sophisticted bounds clculted by optimistic nd pessimistic vlue itertion (Givn, Lech, nd Den 2); Φ KMDP nd Φ BEB re the sme serch lgorithms using the corresponding potentil functions for shping. In ddition, we experimented with two different priors: flt Dirichlet multinomil (FDM) with α = 1/ S (Guez, Silver, nd Dyn 212) nd sprse fctored Dirichlet multinomil (SFDM) (Friedmn nd Singer 1999). Ech lgorithm ws given the CPU time of.1s per timestep by djusting the number of node expnsions. This time limit ws sufficient for ll the lgorithms to rech their highest levels of performnce, except in lrger domins GRID1 nd MAZE. The prmeter settings for ech lgorithm were s follows: for BAMCP, we followed the exct settings in (Guez, Silver, nd Dyn 212), which were c = 3 nd ɛ =.5 for the explortion constnts in the tree serch nd the rollout simultion, nd the mximum depth of the serch tree ws set to 15 in ll domins except GRID1 nd MAZE, in which the depth ws incresed to 5; for Φ KMDP, we set the number of MDP smples K = 1; for Φ BEB, β ws chosen from {.5, 1, 1, 2, 3, 5} tht performed the best; the recomputtion of the potentil function ws set to hppen 1 times during run. In ddition, upon noticing tht the lgorithms with shping performed well even without the hshtble, we decided removed the bookkeeping in the experiments for further speedup. In fct, the hit rte of the cche ws less thn 1% in ll experiments. Rel-time heuristic serch with rewrd shping yielded the best results in ll domins except DOUBLE-LOOP, nd showed significnt improvement in lerning performnce on lrger domins such s GRID5, GRID1, nd MAZE. In CHAIN, which ws the smllest domin, shping hd lmost no effect on serch since good ctions could be redily found with smll serch trees. Finlly, Figure 3 shows the improvement in the totl rewrd s we increse the serch time for three lrger domins. It clerly shows the effectiveness of shping for rel-time heuristic serch. It is interesting to note the singulrity in the DOUBLE- LOOP results. BAMCP performed fr better thn other l-

7 Totl rewrds Totl rewrds A two rmed Bernoulli bndit (5, 5, 5, 5) (25, 25, 25, 25) (5, 5, 5, 5) Initil prior Byes optiml Rel time heuristic serch with rewrd shping BAMCP (9, 1, 1, 9) (45, 5, 5, 45) (9, 1, 1, 9) Initil prior Figure 4: Performnce comprison of rel-time heuristic serch with rewrd shping nd BAMCP ginst the Byesoptiml policy on two-rmed Bernoulli bndit µ 1 =.1 nd µ 2 =.9 with γ =.99. gorithms in this prticulr domin. In order to further nlyze the results, we conducted nother set of experiments on bndit problem where we cn obtin Byes-optiml policies with high ccurcy vi computing Gittins indices (Gittins 1979). In this experiment, we consider two-rmed Bernoulli bndit, where the two rms hve.1 nd.9 success probbilities. In Figure 4, we compre the totl rewrds obtined by the Byes-optiml policy, our rel-time heuristic serch lgorithm with rewrd shping, nd BAMCP. The verges were obtined from 1 runs of 3 time steps. Agin, s for BAMCP, the explortion constnt c ws chosen from {.5, 1, 1.5, 2, 2.5, 3} tht performed the best. The top grph in Figure 4 shows the results with three different initil priors, (α 1, β 1, α 2, β 2 ) = (5, 5, 5, 5), (25, 25, 25, 25), nd (5, 5, 5, 5). Note tht while our serch lgorithm performed very close to the Byes-optiml policy, BAMCP ws quite susceptible to the prior nd ws not ble to overcome the strong incorrect prior. The bottom grph shows the sme comprison with different set of incorrect initil priors: (9, 1, 1, 9), (45, 5, 5, 45), nd (9, 1, 1, 9). In this cse, the priors hd less effect on the BAMCP performnce, nd in fct BAMCP ws performing better thn Byes-optiml policy. Our serch lgorithm on the other hnd, mtched the performnce of the Byes-optiml policy in two out of three cses. The reson why we obtined these result is becuse the behvior of BAMCP rises from the combintion of two prmeters: the prior nd the explortion constnt. Hence, by chnging the explortion constnt, the prior cn be ignored nd mke the lgorithm be tuned to the specific problem t hnd. We believe tht this is wht is hppening behind the DOUBLE-LOOP experiments. Conclusion nd Future Work In this pper, we presented shping for significntly improving the lerning performnce of model-bsed BRL method. Our min insight comes from the BAMDP formultion of BRL, which is hybrid-stte POMDP. We showed how shping cn be used for rel-time AO* serch s n efficient BRL method. Shping mitigtes the sprsity nd dely of rewrds, helping the serch lgorithm to find good ctions without the necessity to build lrge serch trees for long-horizon plnning. We proposed two pproches to defining the potentil function for shping, which do not depend on -priori knowledge bout the true underlying environment - they only leverge the structurl regulrity in the POMDP tht rises from BAMDP. They re lso dptive in the sense tht they use pst experiences from the underlying model to estimte the Byes-optiml vlue. Extending our pproch to lrger or continuous stte spces, nd integrting shping with Monte-Crlo Tree Serch lgorithms re promising directions for the future work. Acknowledgments This work ws prtly supported by the ICT R&D progrm of MSIP/IITP [ , Bsic Softwre Reserch in Humn-level Lifelong Mchine Lerning (Mchine Lerning Center)], Ntionl Reserch Foundtion of Kore (Grnt# ), nd Defense Acquisition Progrm Administrtion nd Agency for Defense Development under the contrct UD1422PD, Kore References Ary-López, M.; Thoms, V.; nd Buffet, O Neroptiml BRL using optimistic locl trnsition. In Proceedings of the 29th Interntionl Conference on Mchine Lerning, Asmuth, J., nd Littmn, M Lerning is plnning: Ner Byes-optiml reinforcement lerning vi Monte- Crlo tree serch. In Proceedings of the 27th Conference on Uncertinty in Artificil Intelligence, Asmuth, J.; Li, L.; Littmn, M. L.; Nouri, A.; nd Wingte, D. 29. A Byesin smpling pproch to explortion in reinforcement lerning. In Proceedings of the 25th Conference on Uncertinty in Artificil Intelligence, Asmuth, J.; Littmn, M. L.; nd Zinkov, R. 28. Potentilbsed shping in model-bsed reinforcement lerning. In Proceedings of the 23rd AAAI Conference on Artificil Intelligence. Cstro, P. S., nd Precup, D. 21. Smrter smpling in model-bsed Byesin reinforcement lerning. In Mchine Lerning nd Knowldege Discovery in Dtbse. Springer Derden, R.; Friedmn, N.; nd Russell, S Byesin Q-lerning. In Proceedings of the Ntionl Conference on Artificil Intelligence,

8 Duff, M. O. 22. Optiml Lerning: Computtionl Procedures for Byes-Adptive Mrkov Decision Processes. Ph.D. Disserttion, University of Msschusetts Amherst. Eck, A.; Soh, L.-K.; Devlin, S.; nd Kudenko, D Potentil-bsed rewrd shping for POMDPs. In Proceedings of the 12th Interntionl Conference on Autonomous Agents nd Multigent Systems, Fonteneu, R.; Busoniu, L.; nd Munos, R Optimistic plnning for belief-ugmented Mrkov decision processes. In IEEE Symposium on Approximte Dynmic Progrmming nd Reinforcement Lerning, Friedmn, N., nd Singer, Y Efficient Byesin prmeter estimtion in lrge discrete domins. In Advnces in Neurl Informtion Processing Systems, Gittins, J. C Bndit processes nd dynmic lloction indices. Journl of the Royl Sttisticl Society. Series B (Methodologicl) 41(2):pp Givn, R.; Lech, S.; nd Den, T. 2. Boundedprmeter Mrkov decision processes. Artificil Intelligence 122: Grześ, M., nd Kudenko, D. 21. Online lerning of shping rewrds in reinforcement lerning. Neurl Networks. Guez, A.; Silver, D.; nd Dyn, P Efficient Byesdptive reinforcement lerning using smple-bsed serch. In Advnces in Neurl Informtion Processing Systems, Kolter, J. Z., nd Ng, A. Y. 29. Ner-Byesin explortion in polynomil time. In Proceedings of the 26th Interntionl Conference on Mchine Lerning, Ng, A. Y.; Hrd, D.; nd Russell, S Policy invrince under rewrd trnsformtions: Theory nd ppliction to rewrd shping. In Proceedings of 16th Interntionl Conference on Mchine Lerning, Nilsson, N. J Principles of Artificil Intelligence. Symbolic Computtion / Aritificil Intelligence. Springer. Pouprt, P.; Vlssis, N.; Hoey, J.; nd Regn, K. 26. An nlytic solution to discrete Byesin reinforcement lerning. In Proceedings of the 23rd Interntionl Conference on Mchine Lerning, Ross, S.; Pineu, J.; Pquet, S.; nd Chib-dr, B. 28. Online plnning lgorithms for POMDPs. Journl of Artificil Intelligence Reserch 32: Ross, S.; Chib-dr, B.; nd Pineu, J. 27. Byesdptive POMDPs. In Advnces in Neurl Informtion Processing Systems, Ross, S.; Pineu, J.; nd Chib-dr, B. 27. Theoreticl nlysis of heuristic serch methods for online POMDPs. In Advnces in Neurl Informtion Processing Systems, Silver, D., nd Veness, J. 21. Monte-Crlo plnning in lrge POMDPs. In Advnces in Neurl Informtion Processing Systems, Strens, M. 2. A Byesin frmework for reinforcement lerning. In Proceedings of the 17th Interntionl Conference on Mchine Lerning, Wng, Y.; Won, K. S.; Hsu, D.; nd Lee, W. S Monte Crlo Byesin reinforcement lerning. In Proceedings of the 29th Interntionl Conference on Mchine Lerning,

Reinforcement learning II

Reinforcement learning II CS 1675 Introduction to Mchine Lerning Lecture 26 Reinforcement lerning II Milos Huskrecht milos@cs.pitt.edu 5329 Sennott Squre Reinforcement lerning Bsics: Input x Lerner Output Reinforcement r Critic