Reward Shaping for Model-Based Bayesian Reinforcement Learning

Size: px
Start display at page:

Download "Reward Shaping for Model-Based Bayesian Reinforcement Learning"

Transcription

1 Rewrd Shping for Model-Bsed Byesin Reinforcement Lerning Hyeoneun Kim, Woosng Lim, Knghoon Lee, Yung-Kyun Noh nd Kee-Eung Kim Deprtment of Computer Science Kore Advnced Institute of Science nd Technology Dejeon 35-71, Kore nd Abstrct Byesin reinforcement lerning (BRL) provides forml frmework for optiml explortion-exploittion trdeoff in reinforcement lerning. Unfortuntely, it is generlly intrctble to find the Byes-optiml behvior except for restricted cses. As consequence, mny BRL lgorithms, model-bsed pproches in prticulr, rely on pproximted models or rel-time serch methods. In this pper, we present potentil-bsed shping for improving the lerning performnce in model-bsed BRL. We propose number of potentil functions tht re prticulrly well suited for BRL, nd re dominindependent in the sense tht they do not require ny prior knowledge bout the ctul environment. By incorporting the potentil function into rel-time heuristic serch, we show tht we cn significntly improve the lerning performnce in stndrd benchmrk domins. Introduction A reinforcement lerning (RL) gent intercts with n unknown environment to mximize the totl rewrd. One of the unique chllenges for the gent is the well-known explortion-exploittion trdeoff: without complete knowledge bout the environment, the gent hs to explore untried ctions tht my led to better long-term rewrds, but t the sme time, it lso needs to execute ctions tht re known to yield the lrgest rewrds given the current knowledge bout the environment. Byesin reinforcement lerning (BRL) provides principled mthemticl frmework for computing Byes-optiml ctions tht chieve n idel blnce between explortion nd exploittion. Although the Byes-optiml ction hs succinct formultion in model-bsed BRL, where the gent ttempts to build n explicit model of the environment for lerning, it is computtionlly intrctble except for restricted cses. As consequence, most model-bsed BRL lgorithms rely on constructing pproximted models tht re trctble to solve, or rel-time heuristic serch methods tht build serch trees on-the-fly. The min focus of this pper is on shping rewrds for improving the lerning performnce of BRL lgorithms. In Copyright c 215, Assocition for the Advncement of Artificil Intelligence ( All rights reserved. shping, we cn trnsform rewrds by shping function in order to mitigte the sprsity nd dely in rewrds. A populr pproch to shping is potentil-bsed shping, where the domin knowledge is leverged to encode the desirbility of sttes into the potentil function while the optiml policy remins unchnged. We ddress two min chllenges in using shping for model-bsed BRL. First, insted of using sttic or fixed shping function defined from -priori domin knowledge, cn we mke the shping function dpt to experiences (i.e. observed trnsitions)? Second, when shping is used with heuristic serch, cn we preserve the soundness nd completeness of the serch heuristic? We present set of domin-independent shping functions tht (1) does not use -priori knowledge of the true underlying environment, (2) dpts to experiences, nd (3) preserves the soundness nd completeness of the heuristic used for rel-time serch. We experimentlly show tht these shping functions significntly improve lerning performnce in stndrd benchmrk domins. Bckground BAMDP: A model-bsed BRL frmework We strt with the model of the underlying environment, which is ssumed to be discrete-stte Mrkov decision process (MDP) defined s 5-tuple M = S, A, T, R, γ, where S is the set of environment sttes, A is the set of gent ctions, T (s,, s ) is the trnsition probbility Pr(s s, ) of mking trnsition to stte s from stte s by executing ction, R(s,, s ) [R min, R mx ] is the nonnegtive rewrd (i.e., R min ) erned from the trnsition s,, s, nd γ [, 1) is the discount fctor. The gent tht intercts with the environment executes ctions by following policy π : S A, which prescribes the ction to be executed in ech stte. The performnce criterion for π we use in this pper is the expected discounted return V π for stte s t t timestep t, V π (s t ) = E[ τ=t γτ R(s τ, π(s τ ), s τ+1 )], defined s the stte vlue function. The Bellmn optimlity eqution sttes tht the stte vlue function for the optiml policy π stisfies V (s) = mx s T (s,, s )[R(s,, s ) + γv (s )] (1)

2 which cn be computed by clssicl dynmic progrmming lgorithms such s vlue itertion or policy itertion when the MDP model M is completely known. The optiml policy cn lso be recovered from the optiml ction vlue function Q (s, ) = s T (s,, s )[R(s,, s ) + γv (s )] (2) since π (s) = rgmx Q (s, ). On the other hnd, when the model M is not known, computing n optiml policy becomes n RL problem. The uncertinty in the model cn be represented using probbility distribution b, lso known s the belief over the models. Throughout the pper, we shll ssume tht the rewrd function is known to the gent, while the trnsition probbility is unknown. Since the trnsition for ech stte-ction pir is essentilly multinomil distribution, strightforwrd wy to represent the belief b is to define the prmeter θ s,s for ech unknown trnsition probbility T (s,, s ) nd use the product of independent Dirichlet distributions s the conjugte prior. When the trnsition s,, s is observed, the belief b is updted by the Byes rule b s,s (θ) θ s,s = θ s,s b(θ) ŝ,â Dir(θŝâ ( ); nŝâ ( )) = ŝ,â Dir(θŝâ ( ); nŝâ ( ) + δ ŝ,â,ŝ (s,, s )), which is equivlent to incrementing the single prmeter tht corresponds to the observed trnsition, i.e., n s (s ) n s (s ) + 1. Byes-Adptive MDP (BAMDP) provides succinct plnning formultion of the Byes-optiml ction under uncertinty in the model (Duff 22). By ugmenting the stte with the belief b, the Byes-optiml policy of the BAMDP should stisfy the optimlity eqution V (s, b) = mx s T b (s,, s )(R(s,, s ) + γv (s, b s,s )), where T b (s,, s )=E[Pr(s s,, b)]= n s (s )/ s ns (s ). Unfortuntely, computing n optiml policy of the BAMDP is known to be computtionlly intrctble in generl cses. One of the populr pproches to mitigte the intrctbility result is the rel-time serch, such s Byes-Adptive Monte-Crlo Plnning (BAMCP) (Guez, Silver, nd Dyn 212), Byesin Optimistic Plnning (BOP) (Fonteneu, Busoniu, nd Munos 213), nd Byesin Forwrd Serch Sprse Smpling (BFS3) (Asmuth nd Littmn 211). These pproches leverge the computtion time llowed between consecutive ction executions to construct lookhed serch tree to find the best ction. On the other hnd, there re other pproches tht do not involve lookhed serch. They often use n pproximtion of BAMDP tht is trctble to solve, while gurnteeing sufficient pproximtion to the Byes-optiml policy with high probbility. PAC-BAMDP lgorithms such s Best of Smpled Set (BOSS) (Asmuth et l. 29), Smrt BOSS (Cstro nd Precup 21), Byesin Explortion Bonus (BEB) (Kolter nd Ng 29), nd Byesin Optimistic Locl Trnsitions (BOLT) (Ary-López, Thoms, nd Buffet 212) belong to these pproches. Finlly, we should remrk tht BAMDP is essentilly hybrid-stte Prtilly Observble MDP (POMDP) where the (hidden) stte is the pir s, θ, the observtion is the environment stte, while the ction nd the rewrd remin the sme s in the environment MDP (Pouprt et l. 26). In this formultion, the trnsition probbility is defined s T ( s, θ,, s, θ ) = θ s,s δ θ (θ ) nd the observtion probbility is defined s Z( s, θ,, o) = Pr(o s, ) = δ s (o). In fct, rel-time serch lgorithms in the bove cn be viewed s extensions of POMDP plnning lgorithms to the hybrid-stte POMDP. Rel-Time Heuristic Serch Anytime Error Minimizing Serch (AEMS2) (Ross et l. 28) is one of the successful online POMDP solvers. The lgorithm performs the best-first serch using the heuristic bsed on the error bound of the vlue function for given mount of time. Upon timeout, the lgorithm selects nd executes the best ction t the root node of the serch tree. The serch process is essentilly rel-time version of the AO* serch lgorithm (Nilsson 1982). Specificlly, the serch tree is represented s n AND-OR tree, in which AND-nodes correspond to belief trnsitions nd OR-nodes correspond to ction selections. AEMS2 mintins the upper nd lower bounds on the optiml vlue t ech node nd use them for the serch heuristic. In ech itertion, fringe node of the tree is chosen by nvigting down the serch tree using the following heuristic: the ction with the mximum upper bound vlue t the OR-node nd the belief with the mximum error contribution to the root node t the ANDnode. One of the strengths of AEMS2 is its completeness nd ɛ-optimlity of the serch (Ross, Pineu, nd Chib-dr 27): the gp between the upper nd lower bounds symptoticlly converges to s the size of the serch tree grows lrger. Given tht BRL cn be seen s BAMDP (i.e. hybridstte POMDP) plnning problem, AEMS2 cn be redily extended to BRL. In fct, vrious online serch-bsed POMDP solvers (Ross et l. 28) cn be redily extended to BRL. As n exmple, it is interesting to note tht BOP (Fonteneu, Busoniu, nd Munos 213) is coincidentlly specil cse of rel-time AO* on the hybrid-stte POMDP with nive initil bounds, i.e. constnt upper nd lower bounds Rmx nd R min 1 γ. 1 γ Potentil-bsed Shping One of the most chllenging spects in RL is the sprsity nd dely in rewrd signls. Suppose tht the gent hs to nvigte in lrge environment to rech the gol loction. If the rewrd is zero everywhere except t the gol, the initil explortion of the gent would be nothing better thn rndom wlk with very smll chnce of reching the gol. On the other hnd, if we chnge the rewrd function so tht we give smll positive rewrd whenever the gent mkes progress towrds the gol, the initil explortion cn be mde very efficient. Potentil-bsed shping (Ng, Hrd, nd Russell 1999) introduces dditive bonus (or penlty) to rewrds to mke

3 the lerning potentilly more efficient, while gurnteeing invrince in the optiml behvior. This is chieved by defining potentil function on sttes Φ(s) nd trnsforming rewrds to R Φ (s,, s ) = R(s,, s ) + F Φ (s, s ), where F Φ (s, s ) = γφ(s ) Φ(s) is the shping function. If the rewrds re set to R Φ insted of R, the optiml vlue functions stisfy V Φ(s) = V (s) Φ(s) Q Φ(s, ) = Q (s, ) Φ(s) which cn be shown using Eq. (1) nd Eq. (2). The second eqution bove gives invrince in the optiml policy. In the idel cse where we set Φ(s) = V (s), the gent tht myopiclly cts solely bsed on the immedite shped rewrd follows n optiml policy, since rgmx s T (s,, s )R Φ (s,, s ) = rgmx s T (s,, s )[R(s,, s ) + γv (s ) V (s)] = rgmx s T (s,, s )[R(s,, s ) + γv (s )] = rgmx Q (s, ) = π (s). In recent work by (Eck et l. 213), shping ws shown to produce more efficient solutions for POMDP plnning using potentil functions tht encode domin-specific knowledge. It ws lso used with clssicl RL lgorithms such s SARSA nd RMAX, showing promising results (Asmuth, Littmn, nd Zinkov 28; Grześ nd Kudenko 21). Heuristic Serch with Shping for BRL Shping for BAMDP Although AEMS2 nd other online POMDP solvers nturlly extend to BAMDP, one of the min fctors tht criticlly impct the performnce is the sprsity nd dely in rewrds. Combined with the uncertinty in the trnsition model, we end up with very loose initil bound of the vlue function. Idelly, we would like to hve fst yet tight bound of the Byes-optiml vlue function, but they re hrd to obtin, just s in POMDPs. We ddress this issue by shping rewrds to focus the serch effort on more promising outcomes using inferred knowledge bout the true underlying environment. We cn nturlly extend shping in MDPs to BAMDPs by noticing tht BAMDPs re essentilly MDPs with beliefs s sttes. Specificlly, define nonnegtive potentil function for rewrd shping to be of the form Φ : S B [, + ) where S is the set of environment sttes nd B is the set of ll possible beliefs. Hence, rewrds re shped by R Φ ( s, b,, s, b s,s ) (3) = R( s, b,, s, b s,s ) + F Φ ( s, b, s, b s,s ) = R(s,, s ) + F Φ ( s, b, s, b s,s ), where F Φ ( s, b, s, b s,s ) = γφ(s, b s,s ) Φ(s, b), nd we shll use R Φ s the rewrd function. The resulting rel-time heuristic serch lgorithm mostly follows the structure of n online POMDP solver. For the Algorithm 1 The rel-time heuristic serch BRL lgorithm Input: s, b : initil stte s nd model prior b Sttic: s, b : the current stte of the gent T : the current AND-OR serch tree 1: s, b s, b 2: Initilize T to single root node s, b 3: while not ExecutionTerminted() do 4: while not SerchTimedOut() do 5: s, b ChooseNextNodeToExpnd() 6: Expnd( s, b ) 7: UpdteAncestor( s, b ) 8: end while 9: Execute best ction for s, b 1: Observe new stte s 11: Updte tree T so tht s, b s,s is the new root 12: end while ske of comprehensiveness, we provide the pseudo-code of AEMS2 extended to BAMDPs. Algorithm 1 is the min loop tht controls the serch in ech timestep. In ech itertion of the serch, one of the fringe nodes is chosen nd expnded in the best-first mnner. The objective of the expnsion is to close the gp in the bounds, nd hence we select the fringe node tht is likely to mximlly reduce the gp t the root node if expnded. Specificlly, we consider fringe nodes tht re rechble by ctions with the mximum upper bound vlue t ech intermedite node { 1 if = rgmx Pr( s, b ) = A U T ( s, b, ) otherwise nd compute the probbility of reching fringe node s d, b d t depth d by tking the pth h from the root node s, b to the fringe node, i.e. h( s d, b d ) = s, b,, s 1, b 1, 1,..., d 1, s d, b d, d 1 Pr(h( s d, b d )) = Pr(s i+1 s i, b i, i ) Pr( i s i, b i ) i= d 1 = T bi (s i, i, s i+1 ) Pr( i s i, b i ). i= Using the probbility of pth h, we compute the error contribution of ech fringe node s d, b d on the root node e( s d, b d ) = γ d Pr(h( s d, b d ))[U T (s d, b d ) L T (s d, b d )] nd choose the fringe node with the mximum error contribution. The best fringe node cn be identified efficiently without exhustive enumertion in ech itertion. We then expnd the chosen fringe node nd updte its lower nd upper bounds using Algorithm 2. These updted bounds re then propgted up to the root node using Algorithm 3. Note tht we use shped rewrd R Φ throughout the lgorithm, insted of the ctul rewrd R. We remrk tht the bound initiliztion is subtle but crucil step for serch. Given the initil upper nd lower

4 Algorithm 2 Expnd( s, b ) Input: s, b : n OR-Node chosen to expnd Sttic: U : n upper bound on V L : lower bound on V T : the current AND-OR serch tree 1: for A do 2: for s S do 3: Crete child node s, b s,s 4: U T (s, b s,s ) UΦ (s, b s,s ) ) 5: L T (s, b s,s ) L Φ (s, b s,s 6: end for 7: U T ( s, b, ) s S T b(s,, s ) [R Φ ( s, b,, s, b s,s 8: L T ( s, b, ) s S T b(s,, s ) ) + γu T (s, b s,s )] [R Φ ( s, b,, s, b s,s ) + γl T (s, b s,s )] 9: end for 1: U T (s, b) min (U T (s, b), mx U T ( s, b, )) 11: L T (s, b) mx (L T (s, b), mx L T ( s, b, )) Algorithm 3 UpdteAncestor( s, b ) Input: s, b : n OR-Node chosen to updte its ncestors Sttic: U : n upper bound L : lower bound T : the current AND-OR serch tree 1: while s, b is not root of T do 2: Set s, b to be the prent of s, b nd to be the corresponding ction 3: U T ( s, b, ) s S T b(s,, s ) [R Φ ( s, b,, s, b s,s ) + γu T (s, b s,s )] 4: L T ( s, b, ) s S T b(s,, s ) [R Φ ( s, b,, s, b s,s ) + γl T (s, b s,s )] 5: U T (s, b) min (U T (s, b), mx U T ( s, b, )) 6: L T (s, b) mx (L T (s, b), mx L T ( s, b, )) 7: s, b s, b 8: end while bounds U (s, b) nd L (s, b) using the originl rewrd function, it my seem nturl to use initil bounds UΦ (s, b) = U (s, b) Φ(s, b) nd L Φ (s, b) = L (s, b) Φ(s, b), bsed on the reltionship in Eq. (3). However, we cn show tht this is not good ide: Theorem 1. If the bounds using the shped rewrd re initilized UΦ (s, b) = U (s, b) Φ(s, b) nd L Φ (s, b) = L (s, b) Φ(s, b), the lgorithm will expnd the sme fringe node s using the originl rewrd. Proof. Given the pth h from the root node to the fringe node s, b t depth d, the error contribution of the fringe node under the shped rewrds is e Φ ( s, b ) = γ d Pr(h( s, b ))[U Φ(s, b) L Φ(s, b)] = γ d Pr(h( s, b ))[U (s, b) L (s, b)], since Φ(s, b) cncels out. In ddition, the upper bound vlue t its prent node s, b under the shped rewrds is updted by U T ( s, b, ) = s T b (s,, s )U Φ (s, b,, s ), where U Φ (s, b,, s ) = [R Φ ( s, b,, s, (b ) s,s ) + γu T (s, (b ) s,s )] = [R(s,, s ) + γu (s, (b ) s,s ) Φ(s, b )], which mkes the mximum upper bound ction rgmx U T ( s, b, ) t the prent node unchnged compred to the one using the originl rewrds, since Φ(s, b ) is invrint over ctions. By induction, the mximum upper bound ctions t ll intermedite nodes on the pth h do not chnge, which mkes Pr(h( s, b )) unchnged. Thus, we hve e Φ ( s, b ) = e( s, b ), which implies tht the lgorithm will expnd the sme fringe node s with the originl rewrds. This theorem sttes tht trnslting both bounds by Φ essentilly nullifies the effect of shping, mking the lgorithm build the exct sme serch tree s with the originl rewrd. In our implementtion, we used U Φ(s, b) = U (s, b) Φ min L Φ(s, b) = L (s, b) Φ(s, b), where Φ min = min s,b Φ(s, b). This is to mke the rewrd shping ffect the construction of serch tree (since the AEMS2 heuristic chooses the ction with the mximum upper bound vlue), while not ffecting the finl choice of the ction for execution (since the lgorithm chooses the ction with the mximum lower bound vlue). We cn show the ltter by the following corollry: Corollry 1. Given serch tree T, the best lower bound ction computed from T using the shped rewrds is identicl to the one using the originl rewrds. Proof. In the proof of Theorem 1, simply replce the upper bound U by the lower bound L. As finl remrk, since we hve not modified the serch heuristic, the completeness nd ɛ-optimlity gurntee of the serch in (Ross, Pineu, nd Chib-dr 27) is preserved. Potentil Functions It would seem tht we need to define potentil function - priori. However, due to the nture of rel-time serch lgorithms, the number of belief sttes t which the potentil function is evluted is bounded by the number of nodes expnded in the serch tree. Hence, we cn perform evlutions on-the-fly, even incorporting experiences observed from the environment. One thing tht we should keep in mind is tht if we decide to use some vlue for the potentil function t s, b, we should use the sme vlue when the serch lter cretes node with the sme s, b. This is to hve consistent specifiction of the potentil function.

5 Vlue Functions of MDP Smples An idel potentil function would be the optiml vlue function of the ctul underlying environment model. However, since this is not vilble, we could use set of models smpled from the current belief. In fct, number of BRL lgorithms took the smpling pproch lbeit for different purposes: MC- BRL (Wng et l. 212) nd BA-POMDP (Ross, Chib-dr, nd Pineu 27) used the set of smpled models for pproximte belief trcking (i.e. prticle filtering), nd BOSS (Asmuth et l. 29) used the smples for computing probbilistic upper bound of the Byes-optiml vlue. In our pproch, we periodiclly smple set of K MDPs (i.e. trnsition probbilities) {ˆθ 1, ˆθ 2,..., ˆθ K } from the current belief b nd solve ech MDP to obtin optiml vlue functions {V ˆθ1, V ˆθ2,..., V ˆθK }. The potentil function is defined s K Φ KMDP (s, b) = w k (b)v ˆθk (s), k=1 where the weight w k is set to the posterior probbility Pr(k b) of the k-th MDP. The weights re initilly set to 1/K nd updted by the Byes rule upon trnsition s,, s w k η w k ˆθ s,,s k k, where η is the normliztion constnt. The MDP resmpling period controls the trdeoff between reducing the overll running time nd obtining more ccurte vlue function estimte. In ddition, since solving MDPs cn tke considerble mount of time, we perform resmpling only t the root node nd reuse them in node expnsions, s in smplebsed tree serch lgorithms (Silver nd Veness 21). Note tht the MDP resmpling nd the weight updte led to the chnge in the potentil function. In order to mke the optiml policy invrint, we need the potentil function to be consistent by mking the sme s, b yield the sme potentil vlue. This is chieved by hshtble so tht the updted potentil function is only evluted for novel stte-belief pirs tht re encountered during serch. Vlue Function of Optimistic MDP Inspired by PAC- BAMDP lgorithms, we cn periodiclly build n optimistic MDP pproximtion of the BAMDP bsed on the current belief, nd use its optiml vlue function s the potentil function. As n exmple, we cn use the optimistic MDP constructed in BEB (Kolter nd Ng 29), of which the optiml vlue function is defined by VBEB(s, b) = mx T b (s,, s ) s [ ] R(s,, s β ) + 1+n b (s,) + γv BEB (s, b), where β is the explortion bonus prmeter, nd n b (s, ) is the number of visits to the stte-ction pir s, in the current belief b. The vlue function VBEB cn be efficiently computed by the stndrd vlue itertion. We denote this potentil function s Φ BEB. Our method cn be seen s the multi-step lookhed serch extension to BEB. As in Φ KMDP, we recompute Φ BEB only t the root node t regulr intervls, not necessrily t every timestep. We lso Figure 1: () CHAIN (left) nd (b) DOUBLE-LOOP (right) Figure 2: () GRID5 (left) nd (b) MAZE (right) use hshtble so tht the updted potentil function is only evluted for novel stte-belief pirs. Experiments We conducted experiments on the following five benchmrk BRL domins: CHAIN (Strens 2) consists of liner chin of 5 sttes nd 2 ctions {, b}, s shown in Figure 1 (). The rewrds re shown s edge lbels for ech trnsition. The trnsitions re stochstic: the gent slips nd perform the other ction with probbility.2. DOUBLE-LOOP (Derden, Friedmn, nd Russell 1998) consists of 9 sttes nd 2 ctions, s shown in Figure 1 (b). The trnsitions re deterministic. optiml behvior is to complete the trversl of the left loop with rewrd of 2 by executing ction b ll the time, while the right loop is esier to complete yielding rewrd of 1. GRID5 (Guez, Silver, nd Dyn 212) consists of 5 5 sttes with no rewrd nywhere except t the gol loction (G) which is t the opposite to the reset loction (R) (Figure 2 ()). Once the gent reches G, it is sent bck to R with rewrd of 1. There re 4 ctions for moving in ech crdinl direction, of which the trnsitions re stochstic: the gent moves in rndom directions with probbility.2. GRID1 (Guez, Silver, nd Dyn 212) is lrger version of GRID5 with 1 1 sttes. MAZE (Derden, Friedmn, nd Russell 1998) consists of 264 sttes nd 4 ctions, where the gent hs to collect flgs t certin loctions (F) nd rrive t the gol loction (G), s shown in Figure 2 (b). Once the gent reches G, it is sent bck to the reset loction (R) with the rewrd equl to the number of flgs (F) collected. The stochsticity in trnsition is sme s GRID5. In Tble 1, we compre the totl undiscounted rewrds gthered from the following 5 lgorithms: BAMCP 1 (Guez, 1

6 Totl rewrds GRID5 - FDM Φ BEB Φ KMDP NS BOP BAMCP Totl rewrds GRID1 - FDM Totl rewrds MAZE - FDM Totl rewrds GRID5 - SFDM Totl rewrds GRID1 - SFDM Totl rewrds MAZE - SFDM Figure 3: Totl rewrds vs. serch CPU times in lrger domins (GRID5, GRID1, nd MAZE) Methods CHAIN DOUBLE-LOOP GRID5 GRID1 MAZE Φ BEB (±75.66) (±8.52) (±.96) (±.51) (±1.53) Φ KMDP (±73.62) (±8.34) (±.97) (±.59) (±2.54) NS FDM (±75.4) (±8.58) (±1.2) 21.4 (±.5) (±2.61) BOP (±76.56) (±8.54) (±.82) 9.4 (±.28) (±.87) BAMCP (±25.1) (±1.17) (±.5) 5.14 (±.41) (±.84) Φ BEB (±74.53) (±7.53) 73.5 (±1.8) (±.6) (±4.2) Φ KMDP (±72.11) (±7.58) (±1.2) (±.59) (±4.78) NS SFDM (±72.23) (±7.45) (±1.17) (±.55) (±4.12) BOP (±72.45) (±7.52) (±1.13) 1.36 (±.25) (±.86) BAMCP (±27.14) (±8.1) 7.97 (±.79) (±.42) (±3.44) Tble 1: The verges of totl undiscounted rewrds nd their 95% confidence intervls. For ll domins except GRID1 nd MAZE, the results re from 5 runs of 1 timesteps. For GRID1 nd MAZE, we used 2 timesteps nd 2 timesteps, respectively. We set γ =.95 for ll domins. Top performnce results re highlighted in bold fce. Silver, nd Dyn 212) is one of the most efficient lgorithms tht uses Monte-Crlo Tree Serch; BOP (Fonteneu, Busoniu, nd Munos 213) is rel-time heuristic serch lgorithm tht uses the nive upper nd lower bounds Rmx 1 γ nd Rmin 1 γ ; NS is the rel-time heuristic serch lgorithm presented in the previous section without shping (No Shping), but using more sophisticted bounds clculted by optimistic nd pessimistic vlue itertion (Givn, Lech, nd Den 2); Φ KMDP nd Φ BEB re the sme serch lgorithms using the corresponding potentil functions for shping. In ddition, we experimented with two different priors: flt Dirichlet multinomil (FDM) with α = 1/ S (Guez, Silver, nd Dyn 212) nd sprse fctored Dirichlet multinomil (SFDM) (Friedmn nd Singer 1999). Ech lgorithm ws given the CPU time of.1s per timestep by djusting the number of node expnsions. This time limit ws sufficient for ll the lgorithms to rech their highest levels of performnce, except in lrger domins GRID1 nd MAZE. The prmeter settings for ech lgorithm were s follows: for BAMCP, we followed the exct settings in (Guez, Silver, nd Dyn 212), which were c = 3 nd ɛ =.5 for the explortion constnts in the tree serch nd the rollout simultion, nd the mximum depth of the serch tree ws set to 15 in ll domins except GRID1 nd MAZE, in which the depth ws incresed to 5; for Φ KMDP, we set the number of MDP smples K = 1; for Φ BEB, β ws chosen from {.5, 1, 1, 2, 3, 5} tht performed the best; the recomputtion of the potentil function ws set to hppen 1 times during run. In ddition, upon noticing tht the lgorithms with shping performed well even without the hshtble, we decided removed the bookkeeping in the experiments for further speedup. In fct, the hit rte of the cche ws less thn 1% in ll experiments. Rel-time heuristic serch with rewrd shping yielded the best results in ll domins except DOUBLE-LOOP, nd showed significnt improvement in lerning performnce on lrger domins such s GRID5, GRID1, nd MAZE. In CHAIN, which ws the smllest domin, shping hd lmost no effect on serch since good ctions could be redily found with smll serch trees. Finlly, Figure 3 shows the improvement in the totl rewrd s we increse the serch time for three lrger domins. It clerly shows the effectiveness of shping for rel-time heuristic serch. It is interesting to note the singulrity in the DOUBLE- LOOP results. BAMCP performed fr better thn other l-

7 Totl rewrds Totl rewrds A two rmed Bernoulli bndit (5, 5, 5, 5) (25, 25, 25, 25) (5, 5, 5, 5) Initil prior Byes optiml Rel time heuristic serch with rewrd shping BAMCP (9, 1, 1, 9) (45, 5, 5, 45) (9, 1, 1, 9) Initil prior Figure 4: Performnce comprison of rel-time heuristic serch with rewrd shping nd BAMCP ginst the Byesoptiml policy on two-rmed Bernoulli bndit µ 1 =.1 nd µ 2 =.9 with γ =.99. gorithms in this prticulr domin. In order to further nlyze the results, we conducted nother set of experiments on bndit problem where we cn obtin Byes-optiml policies with high ccurcy vi computing Gittins indices (Gittins 1979). In this experiment, we consider two-rmed Bernoulli bndit, where the two rms hve.1 nd.9 success probbilities. In Figure 4, we compre the totl rewrds obtined by the Byes-optiml policy, our rel-time heuristic serch lgorithm with rewrd shping, nd BAMCP. The verges were obtined from 1 runs of 3 time steps. Agin, s for BAMCP, the explortion constnt c ws chosen from {.5, 1, 1.5, 2, 2.5, 3} tht performed the best. The top grph in Figure 4 shows the results with three different initil priors, (α 1, β 1, α 2, β 2 ) = (5, 5, 5, 5), (25, 25, 25, 25), nd (5, 5, 5, 5). Note tht while our serch lgorithm performed very close to the Byes-optiml policy, BAMCP ws quite susceptible to the prior nd ws not ble to overcome the strong incorrect prior. The bottom grph shows the sme comprison with different set of incorrect initil priors: (9, 1, 1, 9), (45, 5, 5, 45), nd (9, 1, 1, 9). In this cse, the priors hd less effect on the BAMCP performnce, nd in fct BAMCP ws performing better thn Byes-optiml policy. Our serch lgorithm on the other hnd, mtched the performnce of the Byes-optiml policy in two out of three cses. The reson why we obtined these result is becuse the behvior of BAMCP rises from the combintion of two prmeters: the prior nd the explortion constnt. Hence, by chnging the explortion constnt, the prior cn be ignored nd mke the lgorithm be tuned to the specific problem t hnd. We believe tht this is wht is hppening behind the DOUBLE-LOOP experiments. Conclusion nd Future Work In this pper, we presented shping for significntly improving the lerning performnce of model-bsed BRL method. Our min insight comes from the BAMDP formultion of BRL, which is hybrid-stte POMDP. We showed how shping cn be used for rel-time AO* serch s n efficient BRL method. Shping mitigtes the sprsity nd dely of rewrds, helping the serch lgorithm to find good ctions without the necessity to build lrge serch trees for long-horizon plnning. We proposed two pproches to defining the potentil function for shping, which do not depend on -priori knowledge bout the true underlying environment - they only leverge the structurl regulrity in the POMDP tht rises from BAMDP. They re lso dptive in the sense tht they use pst experiences from the underlying model to estimte the Byes-optiml vlue. Extending our pproch to lrger or continuous stte spces, nd integrting shping with Monte-Crlo Tree Serch lgorithms re promising directions for the future work. Acknowledgments This work ws prtly supported by the ICT R&D progrm of MSIP/IITP [ , Bsic Softwre Reserch in Humn-level Lifelong Mchine Lerning (Mchine Lerning Center)], Ntionl Reserch Foundtion of Kore (Grnt# ), nd Defense Acquisition Progrm Administrtion nd Agency for Defense Development under the contrct UD1422PD, Kore References Ary-López, M.; Thoms, V.; nd Buffet, O Neroptiml BRL using optimistic locl trnsition. In Proceedings of the 29th Interntionl Conference on Mchine Lerning, Asmuth, J., nd Littmn, M Lerning is plnning: Ner Byes-optiml reinforcement lerning vi Monte- Crlo tree serch. In Proceedings of the 27th Conference on Uncertinty in Artificil Intelligence, Asmuth, J.; Li, L.; Littmn, M. L.; Nouri, A.; nd Wingte, D. 29. A Byesin smpling pproch to explortion in reinforcement lerning. In Proceedings of the 25th Conference on Uncertinty in Artificil Intelligence, Asmuth, J.; Littmn, M. L.; nd Zinkov, R. 28. Potentilbsed shping in model-bsed reinforcement lerning. In Proceedings of the 23rd AAAI Conference on Artificil Intelligence. Cstro, P. S., nd Precup, D. 21. Smrter smpling in model-bsed Byesin reinforcement lerning. In Mchine Lerning nd Knowldege Discovery in Dtbse. Springer Derden, R.; Friedmn, N.; nd Russell, S Byesin Q-lerning. In Proceedings of the Ntionl Conference on Artificil Intelligence,

8 Duff, M. O. 22. Optiml Lerning: Computtionl Procedures for Byes-Adptive Mrkov Decision Processes. Ph.D. Disserttion, University of Msschusetts Amherst. Eck, A.; Soh, L.-K.; Devlin, S.; nd Kudenko, D Potentil-bsed rewrd shping for POMDPs. In Proceedings of the 12th Interntionl Conference on Autonomous Agents nd Multigent Systems, Fonteneu, R.; Busoniu, L.; nd Munos, R Optimistic plnning for belief-ugmented Mrkov decision processes. In IEEE Symposium on Approximte Dynmic Progrmming nd Reinforcement Lerning, Friedmn, N., nd Singer, Y Efficient Byesin prmeter estimtion in lrge discrete domins. In Advnces in Neurl Informtion Processing Systems, Gittins, J. C Bndit processes nd dynmic lloction indices. Journl of the Royl Sttisticl Society. Series B (Methodologicl) 41(2):pp Givn, R.; Lech, S.; nd Den, T. 2. Boundedprmeter Mrkov decision processes. Artificil Intelligence 122: Grześ, M., nd Kudenko, D. 21. Online lerning of shping rewrds in reinforcement lerning. Neurl Networks. Guez, A.; Silver, D.; nd Dyn, P Efficient Byesdptive reinforcement lerning using smple-bsed serch. In Advnces in Neurl Informtion Processing Systems, Kolter, J. Z., nd Ng, A. Y. 29. Ner-Byesin explortion in polynomil time. In Proceedings of the 26th Interntionl Conference on Mchine Lerning, Ng, A. Y.; Hrd, D.; nd Russell, S Policy invrince under rewrd trnsformtions: Theory nd ppliction to rewrd shping. In Proceedings of 16th Interntionl Conference on Mchine Lerning, Nilsson, N. J Principles of Artificil Intelligence. Symbolic Computtion / Aritificil Intelligence. Springer. Pouprt, P.; Vlssis, N.; Hoey, J.; nd Regn, K. 26. An nlytic solution to discrete Byesin reinforcement lerning. In Proceedings of the 23rd Interntionl Conference on Mchine Lerning, Ross, S.; Pineu, J.; Pquet, S.; nd Chib-dr, B. 28. Online plnning lgorithms for POMDPs. Journl of Artificil Intelligence Reserch 32: Ross, S.; Chib-dr, B.; nd Pineu, J. 27. Byesdptive POMDPs. In Advnces in Neurl Informtion Processing Systems, Ross, S.; Pineu, J.; nd Chib-dr, B. 27. Theoreticl nlysis of heuristic serch methods for online POMDPs. In Advnces in Neurl Informtion Processing Systems, Silver, D., nd Veness, J. 21. Monte-Crlo plnning in lrge POMDPs. In Advnces in Neurl Informtion Processing Systems, Strens, M. 2. A Byesin frmework for reinforcement lerning. In Proceedings of the 17th Interntionl Conference on Mchine Lerning, Wng, Y.; Won, K. S.; Hsu, D.; nd Lee, W. S Monte Crlo Byesin reinforcement lerning. In Proceedings of the 29th Interntionl Conference on Mchine Lerning,

Reinforcement learning II

Reinforcement learning II CS 1675 Introduction to Mchine Lerning Lecture 26 Reinforcement lerning II Milos Huskrecht milos@cs.pitt.edu 5329 Sennott Squre Reinforcement lerning Bsics: Input x Lerner Output Reinforcement r Critic

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Lerning Tom Mitchell, Mchine Lerning, chpter 13 Outline Introduction Comprison with inductive lerning Mrkov Decision Processes: the model Optiml policy: The tsk Q Lerning: Q function Algorithm

More information

2D1431 Machine Learning Lab 3: Reinforcement Learning

2D1431 Machine Learning Lab 3: Reinforcement Learning 2D1431 Mchine Lerning Lb 3: Reinforcement Lerning Frnk Hoffmnn modified by Örjn Ekeberg December 7, 2004 1 Introduction In this lb you will lern bout dynmic progrmming nd reinforcement lerning. It is ssumed

More information

Bellman Optimality Equation for V*

Bellman Optimality Equation for V* Bellmn Optimlity Eqution for V* The vlue of stte under n optiml policy must equl the expected return for the best ction from tht stte: V (s) mx Q (s,) A(s) mx A(s) mx A(s) Er t 1 V (s t 1 ) s t s, t s

More information

19 Optimal behavior: Game theory

19 Optimal behavior: Game theory Intro. to Artificil Intelligence: Dle Schuurmns, Relu Ptrscu 1 19 Optiml behvior: Gme theory Adversril stte dynmics hve to ccount for worst cse Compute policy π : S A tht mximizes minimum rewrd Let S (,

More information

1 Online Learning and Regret Minimization

1 Online Learning and Regret Minimization 2.997 Decision-Mking in Lrge-Scle Systems My 10 MIT, Spring 2004 Hndout #29 Lecture Note 24 1 Online Lerning nd Regret Minimiztion In this lecture, we consider the problem of sequentil decision mking in

More information

CS 188: Artificial Intelligence Spring 2007

CS 188: Artificial Intelligence Spring 2007 CS 188: Artificil Intelligence Spring 2007 Lecture 3: Queue-Bsed Serch 1/23/2007 Srini Nrynn UC Berkeley Mny slides over the course dpted from Dn Klein, Sturt Russell or Andrew Moore Announcements Assignment

More information

Administrivia CSE 190: Reinforcement Learning: An Introduction

Administrivia CSE 190: Reinforcement Learning: An Introduction Administrivi CSE 190: Reinforcement Lerning: An Introduction Any emil sent to me bout the course should hve CSE 190 in the subject line! Chpter 4: Dynmic Progrmming Acknowledgment: A good number of these

More information

CS 188 Introduction to Artificial Intelligence Fall 2018 Note 7

CS 188 Introduction to Artificial Intelligence Fall 2018 Note 7 CS 188 Introduction to Artificil Intelligence Fll 2018 Note 7 These lecture notes re hevily bsed on notes originlly written by Nikhil Shrm. Decision Networks In the third note, we lerned bout gme trees

More information

Near-Bayesian Exploration in Polynomial Time

Near-Bayesian Exploration in Polynomial Time J. Zico Kolter kolter@cs.stnford.edu Andrew Y. Ng ng@cs.stnford.edu Computer Science Deprtment, Stnford University, CA 94305 Abstrct We consider the explortion/exploittion problem in reinforcement lerning

More information

Decision Networks. CS 188: Artificial Intelligence Fall Example: Decision Networks. Decision Networks. Decisions as Outcome Trees

Decision Networks. CS 188: Artificial Intelligence Fall Example: Decision Networks. Decision Networks. Decisions as Outcome Trees CS 188: Artificil Intelligence Fll 2011 Decision Networks ME: choose the ction which mximizes the expected utility given the evidence mbrell Lecture 17: Decision Digrms 10/27/2011 Cn directly opertionlize

More information

Review of Calculus, cont d

Review of Calculus, cont d Jim Lmbers MAT 460 Fll Semester 2009-10 Lecture 3 Notes These notes correspond to Section 1.1 in the text. Review of Clculus, cont d Riemnn Sums nd the Definite Integrl There re mny cses in which some

More information

Monte Carlo method in solving numerical integration and differential equation

Monte Carlo method in solving numerical integration and differential equation Monte Crlo method in solving numericl integrtion nd differentil eqution Ye Jin Chemistry Deprtment Duke University yj66@duke.edu Abstrct: Monte Crlo method is commonly used in rel physics problem. The

More information

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below.

Duality # Second iteration for HW problem. Recall our LP example problem we have been working on, in equality form, is given below. Dulity #. Second itertion for HW problem Recll our LP emple problem we hve been working on, in equlity form, is given below.,,,, 8 m F which, when written in slightly different form, is 8 F Recll tht we

More information

Multi-Armed Bandits: Non-adaptive and Adaptive Sampling

Multi-Armed Bandits: Non-adaptive and Adaptive Sampling CSE 547/Stt 548: Mchine Lerning for Big Dt Lecture Multi-Armed Bndits: Non-dptive nd Adptive Smpling Instructor: Shm Kkde 1 The (stochstic) multi-rmed bndit problem The bsic prdigm is s follows: K Independent

More information

Chapter 4: Dynamic Programming

Chapter 4: Dynamic Programming Chpter 4: Dynmic Progrmming Objectives of this chpter: Overview of collection of clssicl solution methods for MDPs known s dynmic progrmming (DP) Show how DP cn be used to compute vlue functions, nd hence,

More information

{ } = E! & $ " k r t +k +1

{ } = E! & $  k r t +k +1 Chpter 4: Dynmic Progrmming Objectives of this chpter: Overview of collection of clssicl solution methods for MDPs known s dynmic progrmming (DP) Show how DP cn be used to compute vlue functions, nd hence,

More information

CMDA 4604: Intermediate Topics in Mathematical Modeling Lecture 19: Interpolation and Quadrature

CMDA 4604: Intermediate Topics in Mathematical Modeling Lecture 19: Interpolation and Quadrature CMDA 4604: Intermedite Topics in Mthemticl Modeling Lecture 19: Interpoltion nd Qudrture In this lecture we mke brief diversion into the res of interpoltion nd qudrture. Given function f C[, b], we sy

More information

Lecture 14: Quadrature

Lecture 14: Quadrature Lecture 14: Qudrture This lecture is concerned with the evlution of integrls fx)dx 1) over finite intervl [, b] The integrnd fx) is ssumed to be rel-vlues nd smooth The pproximtion of n integrl by numericl

More information

Advanced Calculus: MATH 410 Notes on Integrals and Integrability Professor David Levermore 17 October 2004

Advanced Calculus: MATH 410 Notes on Integrals and Integrability Professor David Levermore 17 October 2004 Advnced Clculus: MATH 410 Notes on Integrls nd Integrbility Professor Dvid Levermore 17 October 2004 1. Definite Integrls In this section we revisit the definite integrl tht you were introduced to when

More information

Decision Networks. CS 188: Artificial Intelligence. Decision Networks. Decision Networks. Decision Networks and Value of Information

Decision Networks. CS 188: Artificial Intelligence. Decision Networks. Decision Networks. Decision Networks and Value of Information CS 188: Artificil Intelligence nd Vlue of Informtion Instructors: Dn Klein nd Pieter Abbeel niversity of Cliforni, Berkeley [These slides were creted by Dn Klein nd Pieter Abbeel for CS188 Intro to AI

More information

Numerical Analysis: Trapezoidal and Simpson s Rule

Numerical Analysis: Trapezoidal and Simpson s Rule nd Simpson s Mthemticl question we re interested in numericlly nswering How to we evlute I = f (x) dx? Clculus tells us tht if F(x) is the ntiderivtive of function f (x) on the intervl [, b], then I =

More information

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives

Properties of Integrals, Indefinite Integrals. Goals: Definition of the Definite Integral Integral Calculations using Antiderivatives Block #6: Properties of Integrls, Indefinite Integrls Gols: Definition of the Definite Integrl Integrl Clcultions using Antiderivtives Properties of Integrls The Indefinite Integrl 1 Riemnn Sums - 1 Riemnn

More information

A Fast and Reliable Policy Improvement Algorithm

A Fast and Reliable Policy Improvement Algorithm A Fst nd Relible Policy Improvement Algorithm Ysin Abbsi-Ydkori Peter L. Brtlett Stephen J. Wright Queenslnd University of Technology UC Berkeley nd QUT University of Wisconsin-Mdison Abstrct We introduce

More information

Module 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo

Module 6 Value Iteration. CS 886 Sequential Decision Making and Reinforcement Learning University of Waterloo Module 6 Vlue Itertion CS 886 Sequentil Decision Mking nd Reinforcement Lerning University of Wterloo Mrkov Decision Process Definition Set of sttes: S Set of ctions (i.e., decisions): A Trnsition model:

More information

CS 188: Artificial Intelligence

CS 188: Artificial Intelligence CS 188: Artificil Intelligence Lecture 19: Decision Digrms Pieter Abbeel --- C Berkeley Mny slides over this course dpted from Dn Klein, Sturt Russell, Andrew Moore Decision Networks ME: choose the ction

More information

Tech. Rpt. # UMIACS-TR-99-31, Institute for Advanced Computer Studies, University of Maryland, College Park, MD 20742, June 3, 1999.

Tech. Rpt. # UMIACS-TR-99-31, Institute for Advanced Computer Studies, University of Maryland, College Park, MD 20742, June 3, 1999. Tech. Rpt. # UMIACS-TR-99-3, Institute for Advnced Computer Studies, University of Mrylnd, College Prk, MD 20742, June 3, 999. Approximtion Algorithms nd Heuristics for the Dynmic Storge Alloction Problem

More information

CS 188: Artificial Intelligence Fall 2010

CS 188: Artificial Intelligence Fall 2010 CS 188: Artificil Intelligence Fll 2010 Lecture 18: Decision Digrms 10/28/2010 Dn Klein C Berkeley Vlue of Informtion 1 Decision Networks ME: choose the ction which mximizes the expected utility given

More information

Riemann is the Mann! (But Lebesgue may besgue to differ.)

Riemann is the Mann! (But Lebesgue may besgue to differ.) Riemnn is the Mnn! (But Lebesgue my besgue to differ.) Leo Livshits My 2, 2008 1 For finite intervls in R We hve seen in clss tht every continuous function f : [, b] R hs the property tht for every ɛ >

More information

Math 1B, lecture 4: Error bounds for numerical methods

Math 1B, lecture 4: Error bounds for numerical methods Mth B, lecture 4: Error bounds for numericl methods Nthn Pflueger 4 September 0 Introduction The five numericl methods descried in the previous lecture ll operte by the sme principle: they pproximte the

More information

New Expansion and Infinite Series

New Expansion and Infinite Series Interntionl Mthemticl Forum, Vol. 9, 204, no. 22, 06-073 HIKARI Ltd, www.m-hikri.com http://dx.doi.org/0.2988/imf.204.4502 New Expnsion nd Infinite Series Diyun Zhng College of Computer Nnjing University

More information

Bayesian Networks: Approximate Inference

Bayesian Networks: Approximate Inference pproches to inference yesin Networks: pproximte Inference xct inference Vrillimintion Join tree lgorithm pproximte inference Simplify the structure of the network to mkxct inferencfficient (vritionl methods,

More information

State space systems analysis (continued) Stability. A. Definitions A system is said to be Asymptotically Stable (AS) when it satisfies

State space systems analysis (continued) Stability. A. Definitions A system is said to be Asymptotically Stable (AS) when it satisfies Stte spce systems nlysis (continued) Stbility A. Definitions A system is sid to be Asymptoticlly Stble (AS) when it stisfies ut () = 0, t > 0 lim xt () 0. t A system is AS if nd only if the impulse response

More information

The Regulated and Riemann Integrals

The Regulated and Riemann Integrals Chpter 1 The Regulted nd Riemnn Integrls 1.1 Introduction We will consider severl different pproches to defining the definite integrl f(x) dx of function f(x). These definitions will ll ssign the sme vlue

More information

5.7 Improper Integrals

5.7 Improper Integrals 458 pplictions of definite integrls 5.7 Improper Integrls In Section 5.4, we computed the work required to lift pylod of mss m from the surfce of moon of mss nd rdius R to height H bove the surfce of the

More information

Math 8 Winter 2015 Applications of Integration

Math 8 Winter 2015 Applications of Integration Mth 8 Winter 205 Applictions of Integrtion Here re few importnt pplictions of integrtion. The pplictions you my see on n exm in this course include only the Net Chnge Theorem (which is relly just the Fundmentl

More information

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite

Goals: Determine how to calculate the area described by a function. Define the definite integral. Explore the relationship between the definite Unit #8 : The Integrl Gols: Determine how to clculte the re described by function. Define the definite integrl. Eplore the reltionship between the definite integrl nd re. Eplore wys to estimte the definite

More information

MAA 4212 Improper Integrals

MAA 4212 Improper Integrals Notes by Dvid Groisser, Copyright c 1995; revised 2002, 2009, 2014 MAA 4212 Improper Integrls The Riemnn integrl, while perfectly well-defined, is too restrictive for mny purposes; there re functions which

More information

Improper Integrals. Type I Improper Integrals How do we evaluate an integral such as

Improper Integrals. Type I Improper Integrals How do we evaluate an integral such as Improper Integrls Two different types of integrls cn qulify s improper. The first type of improper integrl (which we will refer to s Type I) involves evluting n integrl over n infinite region. In the grph

More information

THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS.

THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS. THE EXISTENCE-UNIQUENESS THEOREM FOR FIRST-ORDER DIFFERENTIAL EQUATIONS RADON ROSBOROUGH https://intuitiveexplntionscom/picrd-lindelof-theorem/ This document is proof of the existence-uniqueness theorem

More information

Recitation 3: More Applications of the Derivative

Recitation 3: More Applications of the Derivative Mth 1c TA: Pdric Brtlett Recittion 3: More Applictions of the Derivtive Week 3 Cltech 2012 1 Rndom Question Question 1 A grph consists of the following: A set V of vertices. A set E of edges where ech

More information

Reinforcement Learning and Policy Reuse

Reinforcement Learning and Policy Reuse Reinforcement Lerning nd Policy Reue Mnuel M. Veloo PEL Fll 206 Reding: Reinforcement Lerning: An Introduction R. Sutton nd A. Brto Probbilitic policy reue in reinforcement lerning gent Fernndo Fernndez

More information

Numerical Integration

Numerical Integration Chpter 5 Numericl Integrtion Numericl integrtion is the study of how the numericl vlue of n integrl cn be found. Methods of function pproximtion discussed in Chpter??, i.e., function pproximtion vi the

More information

7.2 The Definite Integral

7.2 The Definite Integral 7.2 The Definite Integrl the definite integrl In the previous section, it ws found tht if function f is continuous nd nonnegtive, then the re under the grph of f on [, b] is given by F (b) F (), where

More information

CS 275 Automata and Formal Language Theory

CS 275 Automata and Formal Language Theory CS 275 Automt nd Forml Lnguge Theory Course Notes Prt II: The Recognition Problem (II) Chpter II.6.: Push Down Automt Remrk: This mteril is no longer tught nd not directly exm relevnt Anton Setzer (Bsed

More information

Numerical integration

Numerical integration 2 Numericl integrtion This is pge i Printer: Opque this 2. Introduction Numericl integrtion is problem tht is prt of mny problems in the economics nd econometrics literture. The orgniztion of this chpter

More information

Numerical Integration

Numerical Integration Chpter 1 Numericl Integrtion Numericl differentition methods compute pproximtions to the derivtive of function from known vlues of the function. Numericl integrtion uses the sme informtion to compute numericl

More information

Unit #9 : Definite Integral Properties; Fundamental Theorem of Calculus

Unit #9 : Definite Integral Properties; Fundamental Theorem of Calculus Unit #9 : Definite Integrl Properties; Fundmentl Theorem of Clculus Gols: Identify properties of definite integrls Define odd nd even functions, nd reltionship to integrl vlues Introduce the Fundmentl

More information

Integral equations, eigenvalue, function interpolation

Integral equations, eigenvalue, function interpolation Integrl equtions, eigenvlue, function interpoltion Mrcin Chrząszcz mchrzsz@cernch Monte Crlo methods, 26 My, 2016 1 / Mrcin Chrząszcz (Universität Zürich) Integrl equtions, eigenvlue, function interpoltion

More information

APPROXIMATE INTEGRATION

APPROXIMATE INTEGRATION APPROXIMATE INTEGRATION. Introduction We hve seen tht there re functions whose nti-derivtives cnnot be expressed in closed form. For these resons ny definite integrl involving these integrnds cnnot be

More information

History-Based Controller Design and Optimization for Partially Observable MDPs

History-Based Controller Design and Optimization for Partially Observable MDPs History-Bsed Controller Design nd Optimiztion for Prtilly Observble MDPs Aksht Kumr School of Informtion Systems Singpore Mngement University kshtkumr@smu.edu.sg Shlomo Zilberstein School of Computer Science

More information

Quantum Physics II (8.05) Fall 2013 Assignment 2

Quantum Physics II (8.05) Fall 2013 Assignment 2 Quntum Physics II (8.05) Fll 2013 Assignment 2 Msschusetts Institute of Technology Physics Deprtment Due Fridy September 20, 2013 September 13, 2013 3:00 pm Suggested Reding Continued from lst week: 1.

More information

New data structures to reduce data size and search time

New data structures to reduce data size and search time New dt structures to reduce dt size nd serch time Tsuneo Kuwbr Deprtment of Informtion Sciences, Fculty of Science, Kngw University, Hirtsuk-shi, Jpn FIT2018 1D-1, No2, pp1-4 Copyright (c)2018 by The Institute

More information

Artificial Intelligence Markov Decision Problems

Artificial Intelligence Markov Decision Problems rtificil Intelligence Mrkov eciion Problem ilon - briefly mentioned in hpter Ruell nd orvig - hpter 7 Mrkov eciion Problem; pge of Mrkov eciion Problem; pge of exmple: probbilitic blockworld ction outcome

More information

We will see what is meant by standard form very shortly

We will see what is meant by standard form very shortly THEOREM: For fesible liner progrm in its stndrd form, the optimum vlue of the objective over its nonempty fesible region is () either unbounded or (b) is chievble t lest t one extreme point of the fesible

More information

Numerical Integration. 1 Introduction. 2 Midpoint Rule, Trapezoid Rule, Simpson Rule. AMSC/CMSC 460/466 T. von Petersdorff 1

Numerical Integration. 1 Introduction. 2 Midpoint Rule, Trapezoid Rule, Simpson Rule. AMSC/CMSC 460/466 T. von Petersdorff 1 AMSC/CMSC 46/466 T. von Petersdorff 1 umericl Integrtion 1 Introduction We wnt to pproximte the integrl I := f xdx where we re given, b nd the function f s subroutine. We evlute f t points x 1,...,x n

More information

W. We shall do so one by one, starting with I 1, and we shall do it greedily, trying

W. We shall do so one by one, starting with I 1, and we shall do it greedily, trying Vitli covers 1 Definition. A Vitli cover of set E R is set V of closed intervls with positive length so tht, for every δ > 0 nd every x E, there is some I V with λ(i ) < δ nd x I. 2 Lemm (Vitli covering)

More information

Acceptance Sampling by Attributes

Acceptance Sampling by Attributes Introduction Acceptnce Smpling by Attributes Acceptnce smpling is concerned with inspection nd decision mking regrding products. Three spects of smpling re importnt: o Involves rndom smpling of n entire

More information

Uninformed Search Lecture 4

Uninformed Search Lecture 4 Lecture 4 Wht re common serch strtegies tht operte given only serch problem? How do they compre? 1 Agend A quick refresher DFS, BFS, ID-DFS, UCS Unifiction! 2 Serch Problem Formlism Defined vi the following

More information

Stuff You Need to Know From Calculus

Stuff You Need to Know From Calculus Stuff You Need to Know From Clculus For the first time in the semester, the stuff we re doing is finlly going to look like clculus (with vector slnt, of course). This mens tht in order to succeed, you

More information

Review of basic calculus

Review of basic calculus Review of bsic clculus This brief review reclls some of the most importnt concepts, definitions, nd theorems from bsic clculus. It is not intended to tech bsic clculus from scrtch. If ny of the items below

More information

CHM Physical Chemistry I Chapter 1 - Supplementary Material

CHM Physical Chemistry I Chapter 1 - Supplementary Material CHM 3410 - Physicl Chemistry I Chpter 1 - Supplementry Mteril For review of some bsic concepts in mth, see Atkins "Mthemticl Bckground 1 (pp 59-6), nd "Mthemticl Bckground " (pp 109-111). 1. Derivtion

More information

Section 6.1 INTRO to LAPLACE TRANSFORMS

Section 6.1 INTRO to LAPLACE TRANSFORMS Section 6. INTRO to LAPLACE TRANSFORMS Key terms: Improper Integrl; diverge, converge A A f(t)dt lim f(t)dt Piecewise Continuous Function; jump discontinuity Function of Exponentil Order Lplce Trnsform

More information

f(x) dx, If one of these two conditions is not met, we call the integral improper. Our usual definition for the value for the definite integral

f(x) dx, If one of these two conditions is not met, we call the integral improper. Our usual definition for the value for the definite integral Improper Integrls Every time tht we hve evluted definite integrl such s f(x) dx, we hve mde two implicit ssumptions bout the integrl:. The intervl [, b] is finite, nd. f(x) is continuous on [, b]. If one

More information

A Signal-Level Fusion Model for Image-Based Change Detection in DARPA's Dynamic Database System

A Signal-Level Fusion Model for Image-Based Change Detection in DARPA's Dynamic Database System SPIE Aerosense 001 Conference on Signl Processing, Sensor Fusion, nd Trget Recognition X, April 16-0, Orlndo FL. (Minor errors in published version corrected.) A Signl-Level Fusion Model for Imge-Bsed

More information

Estimation of Binomial Distribution in the Light of Future Data

Estimation of Binomial Distribution in the Light of Future Data British Journl of Mthemtics & Computer Science 102: 1-7, 2015, Article no.bjmcs.19191 ISSN: 2231-0851 SCIENCEDOMAIN interntionl www.sciencedomin.org Estimtion of Binomil Distribution in the Light of Future

More information

Scalable Learning in Stochastic Games

Scalable Learning in Stochastic Games Sclble Lerning in Stochstic Gmes Michel Bowling nd Mnuel Veloso Computer Science Deprtment Crnegie Mellon University Pittsburgh PA, 15213-3891 Abstrct Stochstic gmes re generl model of interction between

More information

CS667 Lecture 6: Monte Carlo Integration 02/10/05

CS667 Lecture 6: Monte Carlo Integration 02/10/05 CS667 Lecture 6: Monte Crlo Integrtion 02/10/05 Venkt Krishnrj Lecturer: Steve Mrschner 1 Ide The min ide of Monte Crlo Integrtion is tht we cn estimte the vlue of n integrl by looking t lrge number of

More information

Lecture 6: Singular Integrals, Open Quadrature rules, and Gauss Quadrature

Lecture 6: Singular Integrals, Open Quadrature rules, and Gauss Quadrature Lecture notes on Vritionl nd Approximte Methods in Applied Mthemtics - A Peirce UBC Lecture 6: Singulr Integrls, Open Qudrture rules, nd Guss Qudrture (Compiled 6 August 7) In this lecture we discuss the

More information

Lecture 19: Continuous Least Squares Approximation

Lecture 19: Continuous Least Squares Approximation Lecture 19: Continuous Lest Squres Approximtion 33 Continuous lest squres pproximtion We begn 31 with the problem of pproximting some f C[, b] with polynomil p P n t the discrete points x, x 1,, x m for

More information

Math 270A: Numerical Linear Algebra

Math 270A: Numerical Linear Algebra Mth 70A: Numericl Liner Algebr Instructor: Michel Holst Fll Qurter 014 Homework Assignment #3 Due Give to TA t lest few dys before finl if you wnt feedbck. Exercise 3.1. (The Bsic Liner Method for Liner

More information

Improper Integrals, and Differential Equations

Improper Integrals, and Differential Equations Improper Integrls, nd Differentil Equtions October 22, 204 5.3 Improper Integrls Previously, we discussed how integrls correspond to res. More specificlly, we sid tht for function f(x), the region creted

More information

Best Approximation. Chapter The General Case

Best Approximation. Chapter The General Case Chpter 4 Best Approximtion 4.1 The Generl Cse In the previous chpter, we hve seen how n interpolting polynomil cn be used s n pproximtion to given function. We now wnt to find the best pproximtion to given

More information

How do we solve these things, especially when they get complicated? How do we know when a system has a solution, and when is it unique?

How do we solve these things, especially when they get complicated? How do we know when a system has a solution, and when is it unique? XII. LINEAR ALGEBRA: SOLVING SYSTEMS OF EQUATIONS Tody we re going to tlk bout solving systems of liner equtions. These re problems tht give couple of equtions with couple of unknowns, like: 6 2 3 7 4

More information

Scaling Up Optimal Heuristic Search in Dec-POMDPs via Incremental Expansion

Scaling Up Optimal Heuristic Search in Dec-POMDPs via Incremental Expansion Proceedings of the Twenty-Second Interntionl Joint Conference on Artificil Intelligence Mtthijs T. J. Spn Inst. for Systems nd Robotics Instituto Superior Técnico Lisbon, Portugl mtjspn@isr.ist.utl.pt

More information

Math& 152 Section Integration by Parts

Math& 152 Section Integration by Parts Mth& 5 Section 7. - Integrtion by Prts Integrtion by prts is rule tht trnsforms the integrl of the product of two functions into other (idelly simpler) integrls. Recll from Clculus I tht given two differentible

More information

Solution for Assignment 1 : Intro to Probability and Statistics, PAC learning

Solution for Assignment 1 : Intro to Probability and Statistics, PAC learning Solution for Assignment 1 : Intro to Probbility nd Sttistics, PAC lerning 10-701/15-781: Mchine Lerning (Fll 004) Due: Sept. 30th 004, Thursdy, Strt of clss Question 1. Bsic Probbility ( 18 pts) 1.1 (

More information

Metrics for Finite Markov Decision Processes

Metrics for Finite Markov Decision Processes Metrics for Finite Mrkov Decision Processes Norm Ferns chool of Computer cience McGill University Montrél, Cnd, H3 27 nferns@cs.mcgill.c Prksh Pnngden chool of Computer cience McGill University Montrél,

More information

Lecture 1. Functional series. Pointwise and uniform convergence.

Lecture 1. Functional series. Pointwise and uniform convergence. 1 Introduction. Lecture 1. Functionl series. Pointwise nd uniform convergence. In this course we study mongst other things Fourier series. The Fourier series for periodic function f(x) with period 2π is

More information

Session 13

Session 13 780.20 Session 3 (lst revised: Februry 25, 202) 3 3. 780.20 Session 3. Follow-ups to Session 2 Histogrms of Uniform Rndom Number Distributions. Here is typicl figure you might get when histogrmming uniform

More information

Learning Moore Machines from Input-Output Traces

Learning Moore Machines from Input-Output Traces Lerning Moore Mchines from Input-Output Trces Georgios Gintmidis 1 nd Stvros Tripkis 1,2 1 Alto University, Finlnd 2 UC Berkeley, USA Motivtion: lerning models from blck boxes Inputs? Lerner Forml Model

More information

Anytime algorithms for multiagent decision making using coordination graphs

Anytime algorithms for multiagent decision making using coordination graphs Anytime lgorithms for multigent decision mking using coordintion grphs N. Vlssis R. Elhorst J. R. Kok Informtics Institute, University of Amsterdm, The Netherlnds {vlssis,reinhrst,jellekok}@science.uv.nl

More information

Point-based methods for estimating the length of a parametric curve

Point-based methods for estimating the length of a parametric curve Point-bsed methods for estimting the length of prmetric curve Michel S. Floter, Atgeirr F. Rsmussen CMA, University of Oslo, P.O. Box 1053 Blindern, 0316 Oslo, Norwy Abstrct This pper studies generl method

More information

Probabilistic Investigation of Sensitivities of Advanced Test- Analysis Model Correlation Methods

Probabilistic Investigation of Sensitivities of Advanced Test- Analysis Model Correlation Methods Probbilistic Investigtion of Sensitivities of Advnced Test- Anlysis Model Correltion Methods Liz Bergmn, Mtthew S. Allen, nd Dniel C. Kmmer Dept. of Engineering Physics University of Wisconsin-Mdison Rndll

More information

Reinforcement learning

Reinforcement learning Reinforcement lerning Regulr MDP Given: Trnition model P Rewrd function R Find: Policy π Reinforcement lerning Trnition model nd rewrd function initilly unknown Still need to find the right policy Lern

More information

Cf. Linn Sennott, Stochastic Dynamic Programming and the Control of Queueing Systems, Wiley Series in Probability & Statistics, 1999.

Cf. Linn Sennott, Stochastic Dynamic Programming and the Control of Queueing Systems, Wiley Series in Probability & Statistics, 1999. Cf. Linn Sennott, Stochstic Dynmic Progrmming nd the Control of Queueing Systems, Wiley Series in Probbility & Sttistics, 1999. D.L.Bricker, 2001 Dept of Industril Engineering The University of Iow MDP

More information

How to simulate Turing machines by invertible one-dimensional cellular automata

How to simulate Turing machines by invertible one-dimensional cellular automata How to simulte Turing mchines by invertible one-dimensionl cellulr utomt Jen-Christophe Dubcq Déprtement de Mthémtiques et d Informtique, École Normle Supérieure de Lyon, 46, llée d Itlie, 69364 Lyon Cedex

More information

Introduction to the Calculus of Variations

Introduction to the Calculus of Variations Introduction to the Clculus of Vritions Jim Fischer Mrch 20, 1999 Abstrct This is self-contined pper which introduces fundmentl problem in the clculus of vritions, the problem of finding extreme vlues

More information

Chapter 0. What is the Lebesgue integral about?

Chapter 0. What is the Lebesgue integral about? Chpter 0. Wht is the Lebesgue integrl bout? The pln is to hve tutoril sheet ech week, most often on Fridy, (to be done during the clss) where you will try to get used to the ides introduced in the previous

More information

Review of Gaussian Quadrature method

Review of Gaussian Quadrature method Review of Gussin Qudrture method Nsser M. Asi Spring 006 compiled on Sundy Decemer 1, 017 t 09:1 PM 1 The prolem To find numericl vlue for the integrl of rel vlued function of rel vrile over specific rnge

More information

NUMERICAL INTEGRATION. The inverse process to differentiation in calculus is integration. Mathematically, integration is represented by.

NUMERICAL INTEGRATION. The inverse process to differentiation in calculus is integration. Mathematically, integration is represented by. NUMERICAL INTEGRATION 1 Introduction The inverse process to differentition in clculus is integrtion. Mthemticlly, integrtion is represented by f(x) dx which stnds for the integrl of the function f(x) with

More information

Information synergy, part 3:

Information synergy, part 3: Informtion synergy prt : belief updting These notes describe belief updting for dynmic Kelly-Ross investments where initil conditions my mtter. This note diers from the first two notes on informtion synergy

More information

Chapters 4 & 5 Integrals & Applications

Chapters 4 & 5 Integrals & Applications Contents Chpters 4 & 5 Integrls & Applictions Motivtion to Chpters 4 & 5 2 Chpter 4 3 Ares nd Distnces 3. VIDEO - Ares Under Functions............................................ 3.2 VIDEO - Applictions

More information

Exam 2, Mathematics 4701, Section ETY6 6:05 pm 7:40 pm, March 31, 2016, IH-1105 Instructor: Attila Máté 1

Exam 2, Mathematics 4701, Section ETY6 6:05 pm 7:40 pm, March 31, 2016, IH-1105 Instructor: Attila Máté 1 Exm, Mthemtics 471, Section ETY6 6:5 pm 7:4 pm, Mrch 1, 16, IH-115 Instructor: Attil Máté 1 17 copies 1. ) Stte the usul sufficient condition for the fixed-point itertion to converge when solving the eqution

More information

The steps of the hypothesis test

The steps of the hypothesis test ttisticl Methods I (EXT 7005) Pge 78 Mosquito species Time of dy A B C Mid morning 0.0088 5.4900 5.5000 Mid Afternoon.3400 0.0300 0.8700 Dusk 0.600 5.400 3.000 The Chi squre test sttistic is the sum of

More information

13.4 Work done by Constant Forces

13.4 Work done by Constant Forces 13.4 Work done by Constnt Forces We will begin our discussion of the concept of work by nlyzing the motion of n object in one dimension cted on by constnt forces. Let s consider the following exmple: push

More information

Continuous Random Variables

Continuous Random Variables STAT/MATH 395 A - PROBABILITY II UW Winter Qurter 217 Néhémy Lim Continuous Rndom Vribles Nottion. The indictor function of set S is rel-vlued function defined by : { 1 if x S 1 S (x) if x S Suppose tht

More information

A recursive construction of efficiently decodable list-disjunct matrices

A recursive construction of efficiently decodable list-disjunct matrices CSE 709: Compressed Sensing nd Group Testing. Prt I Lecturers: Hung Q. Ngo nd Atri Rudr SUNY t Bufflo, Fll 2011 Lst updte: October 13, 2011 A recursive construction of efficiently decodble list-disjunct

More information

Application of Exact Discretization for Logistic Differential Equations to the Design of a Discrete-Time State-Observer

Application of Exact Discretization for Logistic Differential Equations to the Design of a Discrete-Time State-Observer 5 Proceedings of the Interntionl Conference on Informtion nd Automtion, December 58, 5, Colombo, Sri Ln. Appliction of Exct Discretiztion for Logistic Differentil Equtions to the Design of Discrete-ime

More information

8 Laplace s Method and Local Limit Theorems

8 Laplace s Method and Local Limit Theorems 8 Lplce s Method nd Locl Limit Theorems 8. Fourier Anlysis in Higher DImensions Most of the theorems of Fourier nlysis tht we hve proved hve nturl generliztions to higher dimensions, nd these cn be proved

More information