Near-Bayesian Exploration in Polynomial Time

Size: px

Start display at page:

Download "Near-Bayesian Exploration in Polynomial Time"

Cathleen Simpson
6 years ago
Views:

1 J. Zico Kolter Andrew Y. Ng Computer Science Deprtment, Stnford University, CA Abstrct We consider the explortion/exploittion problem in reinforcement lerning RL. The Byesin pproch to model-bsed RL offers n elegnt solution to this problem, by considering distribution over possible models nd cting to mximize expected rewrd; unfortuntely, the Byesin solution is intrctble for ll but very restricted cses. In this pper we present simple lgorithm, nd prove tht with high probbility it is ble to perform ǫ-close to the true intrctble optiml Byesin policy fter some smll polynomil in quntities describing the system number of time steps. The lgorithm nd nlysis re motivted by the so-clled PAC- MDP pproch, nd extend such results into the setting of Byesin RL. In this setting, we show tht we cn chieve lower smple complexity bounds thn existing lgorithms, while using n explortion strtegy tht is much greedier thn the extremely cutious explortion of PAC-MDP lgorithms. 1. Introduction An gent cting in n unknown environment must consider the well-known explortion/exploittion trdeoff : the trde-off between mximizing rewrds bsed on the current knowledge of the system exploiting nd cting so s to gin dditionl informtion bout the system exploring. The Byesin pproch to model-bsed reinforcement lerning RL offers very elegnt solution to this dilemm. Under the Byesin pproch, one mintins distribution over possible models nd simply cts to mximize the expected future rewrd; this objective trdes off very nturlly between explortion nd exploittion. Unfortuntely, Appering in Proceedings of the 26 th Interntionl Conference on Mchine Lerning, Montrel, Cnd, Copyright 2009 by the uthors/owners. except for very restricted cses, computing the full Byesin policy is intrctble. This hs led to numerous pproximtion methods, but, to the best of our knowledge, there hs been little work on providing ny forml gurntees for such lgorithms. In this pper we present simple, greedy pproximtion lgorithm, nd show tht is is ble to perform nerly s well s the intrctble optiml Byesin policy fter executing smll polynomil is quntities describing the system number of steps. The lgorithm nd nlysis re motivted by the so-clled PAC-MDP pproch, typified by lgorithms such s E 3 nd R mx, but extend this prdigm to the setting of Byesin RL. We show tht by considering optimlity with respect to the optiml Byesin policy, we cn both chieve lower smple complexity thn existing lgorithms, nd use n explortion pproch tht is fr greedier thn the extremely cutious explortion required by ny PAC-MDP lgorithm. Indeed, our nlysis lso shows tht both our greedy lgorithm nd the true Byesin policy re not PAC-MDP. The reminder of the pper is orgnized s follows. In Section 2 we describe our setting formlly nd review the Byesin nd PAC-MDP pproches to explortion. In Section 3 we present our greedy pproximtion lgorithm, nd then stte nd discuss the theoreticl gurntees for this method. In Section 4 we prove these results, present brief simultion results in Section 5 nd conclude in Section Preliminries A Mrkov Decision Process MDP is tuple S,A,P,R,H where S is set of sttes, A is set of ctions, P : S A S R + is stte trnsition probbility function, R : S A [0,1] is bounded rewrd function, nd H is time horizon. 1 We consider n gent intercting with n MDP vi 1 We use the finite horizon setting for simplicity, but ll results presented here esily extend to the cse of infinite horizon discounted rewrds.

2 single continuous thred of experience, nd we ssume tht the trnsition probbilities P re unknown to the gent. For simplicity of nottion, we will ssume tht the rewrd is known, but this does not scrifice generlity, since n MDP with unknown bounded rewrds nd unknown trnsitions cn be represented s n MDP with known rewrds nd unknown trnsitions by dding dditionl sttes to the system. In this work we will focus on the cse of discrete sttes nd ctions. A policy π is mpping from sttes to ctions. The vlue of policy for given stte is defined s the sum [ of rewrds over the next H time steps VH πs = H ] E t=1 Rs t,πs s 1 = s,π. The vlue function cn lso be written in terms of Bellmn s eqution V π Hs = Rs,πs + s Ps s,v π H 1s. Finlly, when the trnsitions of the MDP re known, we cn find the optiml policy π nd optiml vlue function V by solving Bellmn s optimlity eqution V Hs = mx Ner-Byesin Explortion in Polynomil Time = Rs, + Pb s,b,s, b,s Rs, + s Ps s,v H 1s where π s is simply the ction tht mximizes the right hnd side; we cn pply clssicl lgorithms such s Vlue Itertion or Policy Itertion to find solution to this eqution Puttermn, Byesin Reinforcement Lerning Of course, in the setting we consider, the trnsitions of the MDP re not known, so we must dopt different methods. The Byesin pproch to model-bsed RL, which hs its roots in the topic of Dul Control Fel dbum, 1961; Filtov & Unbehuen, 2004, explicitly represents the uncertinty over MDPs by mintining belief stte b. Since we re concerned with the setting of discrete sttes nd ctions, nturl mens of representing the belief is to let b consist of set of Dirichlet distributions tht describe our uncertinty over the stte trnsitions b = αs,,s, Ps b,s, = αs,,s α 0 s, where α 0 s, = s αs,,s. In this setting, the stte now consists of both the system stte nd the belief stte; policies nd vlue functions now depend on both these elements. The vlue of policy π for belief nd stte is gin given by Bellmn s eqution VHb,s π = Rs, + Pb,s b,s,vh 1b π,s b,s Ps b,s,v π H 1b,s = Rs, + s Ps b,s,v π H 1b,s where = πs,b nd where on the third line the belief b is equl to b except tht we increment αs,,s. The simplified form on this third line results from the fct tht Pb s,b,s, is deterministic: if we re in stte s, tke ction, nd end up in stte s, then we know how to updte b by incrementing αs,,s, s described bove. This llows us to remove the integrl over b, nd since the sttes re discrete we only need to sum over s. We cn lso use the sme logic to derive Byesin version of Bellmn s optimlity eqution, which gives us the optiml Byesin vlue function nd policy V Hb,s = mx Rs, + s Ps b,s,v H 1b,s nd where the optiml Byesin policy π b,s is just the ction tht mximizes the right hnd side. Since this Byesin policy plys crucil role in the reminder of this pper, it is worth understnding the intuition behind this eqution. The optiml Byesin policy chooses ctions bsed not only on how they will ffect the next stte of the system, but lso bsed on how they will ffect the next belief stte; nd, since better knowledge of the MDP will typiclly led to greter future rewrd, the Byesin policy will very nturlly trde off between exploring the system to gin more knowledge, nd exploiting its current knowledge of the system. Unfortuntely, while the Byesin pproch provides very elegnt solution to the explortion/exploittion problem, it is typiclly not possible to compute the Byesin policy exctly. Since the dimension of the belief stte grows polynomilly in the number of sttes nd ctions, computing the Byesin vlue function using vlue itertion or other methods is typiclly not trctble. 2 This hs led to numerous methods tht pproximte the Byesin explortion policy Derden et l., 1999; Strens, 2000; Wng et l., 2005; Pouprt et l., 2006, typiclly by computing n pproximtion to the optiml vlue function, either by smpling or other methods. However, little is known bout these 2 One exception, where the Byesin pproch is trctble, is the domin of k-rmed bndit i.e., n MDP with one stte nd k ctions, where the rewrds re unknown. In this cse, the Byesin pproch leds to the well-known Gittins indices Gittins, However, the pproch does not scle nlyticlly to multi-stte MDPs.

3 lgorithms from theoreticl perspective, nd it is uncler wht if ny forml gurntees cn be mde for such pproches PAC-MDP Reinforcement Lerning An lterntive pproch to explortion in RL is the soclled PAC-MDP pproch, exemplified by lgorithms such s E 3, R mx, nd others Kerns & Singh, 2002; Brfmn & Tennenholtz, 2002; Kkde, 2003; Strehl & Littmn, Such lgorithms lso ddress the explortion/exploittion problem, but do so in different mnner. The lgorithms re bsed on the following intuition: if n gent hs observed certin stte-ction pir enough times, then we cn use lrge devition inequlities, such s the Hoeffding bound, to ensure tht the true dynmics re close to the empiricl estimtes. On the other hnd, if we hve not observed stte-ction pir enough times, then we ssume it hs very high vlue; this will drive the gent to try out stte-ction pirs tht we hven t observed enough times, until eventully we hve suitbly ccurte model of the system this generl technique is known s optimism in the fce of uncertinty. Although the precise formultion of the lerning gurntees vry from lgorithm to lgorithm, using these strtegies, one cn prove theoreticl gurntees of the following form, or similr: with high probbility, the lgorithm performs ner optimlly for ll but smll number of time steps where smll here mens polynomil is vrious quntities describing the MDP. Slightly more formlly, if A t denotes the policy followed by the lgorithm t time t, then with probbility greter thn 1 δ, V At s t V s t ǫ for ll but m = Opoly S, A, H, 1/ǫ, 1/δ time steps. This sttement does not indicte when these suboptiml steps will occur the lgorithm could ct ner-optimlly for long period of time before returning to sub-optiml behvior for some number of steps which llows us to void issues of mixing times or ergodicity of the MDP; this precise formultion is due to Kkde Mny vritions nd extensions of these results exist: to the cse of metric MDPs Kkde et l., 2003, fctored MDPs Kerns & Koller, 1999 to continuous liner in stte fetures domins Strehl & Littmn, 2008b, to certin clss of switching liner systems Brunskill et l., 2008, nd others. However, the overll intuition behind these pproches is similr: in order to perform well, we wnt to explore enough tht we lern n ccurte model of the system. 3 While this results in very 3 A slightly different pproch is tken by Strehl et l. powerful gurntees, the lgorithms typiclly require very lrge mount of explortion in prctice. This contrsts to the Byesin pproch, were we just wnt to obtin high expected rewrd over some finite horizon or lterntively, n infinite discounted horizon. Intuitively, we might then expect tht the Byesin pproch could ct in greedier fshion thn the PAC- MDP pproches, nd we will confirm nd formlize this intuition in the next section. Furthermore, mny issues tht present chllenges in the PAC-MDP frmework, such s incorporting prior knowledge or deling with correlted trnsitions, seemingly cn be hndled very nturlly in the Byesin frmework. 3. A Greedy Approximtion Algorithm nd Theoreticl Results From the discussion bove, it should be pprent tht both the Byesin nd PAC-MDP pproches hve dvntges nd drwbcks, nd in this section we present n lgorithm nd nlysis tht combines elements from both frmeworks. In prticulr, we present simple greedy lgorithm tht we show to perform nerly s well s the full Byesin policy, in sense tht we will formlize shortly; this is PAC-MDP-type result, but we consider optimlity with respect to the Byesin policy for given belief stte, rther thn the optiml policy for some fixed MDP. As we will show, this lterntive definition of optimlity llows us to both chieve lower smple complexity thn existing PAC-MDP lgorithms nd use greedier explortion method. The lgorithm we propose is itself very strightforwrd nd similr to mny previously proposed explortion heuristics. We cll the lgorithm Byesin Explortion Bonus BEB, since it chooses ctions ccording to the current men estimte of the stte trnsitions plus n dditionl rewrd bonus for stte-ction pirs tht hve been observed reltively little. Specificlly, the BEB lgorithm, t ech time step, chooses ctions greedily with respect to the vlue function Ṽ Hb,s = mx Rs, + β 1 + α 0 s, + s Ps b,s,ṽ H 1b,s 1 where β is prmeter of the lgorithm tht we will discuss shortly. In other words, the lgorithm cts by solving the n MDP using the men of the cur- 2006, where they do not build n explicit model of the system. However, the overll ide is the sme, only here they wnt to explore enough until they obtin n ccurte estimte of the stte-ction vlue function.

4 rent belief stte for the trnsition probbilities, nd n dditionl explortion bonus of β/1 + α 0 s, t ech stte. Note tht the belief b is not updted in this eqution, mening we cn solve the eqution using the stndrd Vlue Itertion or Policy Itertion lgorithms. To simplify the nlysis tht follows, we lso tke common pproch nd cese updting the belief sttes fter certin number of observtions, which we will describe more fully below. The following theorem gives performnce gurntee for the BEB lgorithm for suitbly lrge vlue of β. Theorem 1. Let A t denote the policy followed by the BEB lgorithm with β = 2H 2 t time t, nd let s t nd b t be the corresponding stte nd belief. Also suppose we stop updting the belief for stte-ction pir when α 0 s, > 4H 3 /ǫ. Then with probbility t lest 1 δ, V At H b t,s t V Hb t,s t ǫ i.e., the lgorithm is ǫ-close to the optiml Byesin policy for ll but S A H 6 m = O ǫ 2 log S A δ time steps. In other words, BEB cts sub-optimlity where optimlity is defined in the Byesin sense, for only polynomil number of time steps. Like the PAC-MDP results mentioned bove, the theorem mkes no clims bout when these sub-optiml steps occur, nd thus voids issues of mixing times, etc. In terms of the polynomil degree on the vrious quntities, this bound is tighter thn the stndrd PAC- MDP bounds, which to the best of our knowledge hve smple complexity of S 2 m = Õ A H 6 time steps. 4 Intuitively, this smller bound results from the fct tht in order to pproximte the Byesin policy, we don t need to lern the true dynmics of n MDP, we just need to ensure tht the posterior beliefs re sufficiently peked so tht further updtes cnnot led to very much dditionl rewrd. However, s mentioned bove, the two theorems re not directly comprble, since we define optimlity with respect to the Byesin policy for belief stte, wheres the stndrd PAC-MDP frmework defines 4 The Õ nottion suppresses logrithmic fctors. In ddition, the model-free lgorithm of Strehl et l., 2006 obtins bound of Õ bound. S A H 8 ǫ 3 ǫ 4, lso lrger thn our optimlity with respect to the optiml policy for some given MDP. Indeed, one of the chief insights of this work is tht by considering the Byesin definition of optimlity, we cn chieve these smller bounds. To gin further insight into the nture of Byesin explortion, BEB, nd the PAC-MDP pproch, we compre our method to very similr PAC-MDP lgorithm known s Model Bsed Intervl Estimtion with Explortion Bonus MBIE-EB Strehl & Littmn, Like BEB, this lgorithm t ech time step solves n MDP ccording to the men estimte of the trnsitions, plus n explortion bonus. However, MBIE-EB uses n explortion bonus of the form β ns, where ns, denotes the number of times tht the stte-ction pir s, hs been observed; this contrsts with the BEB lgorithm, which hs n explortion bonus of β 1 + ns, where here ns, lso includes the counts implied by the prior. Since 1/ n decys much slower thn 1/n, MBIE-EB consequently explores gret del more thn the BEB lgorithm. Furthermore, this is not n rtifct of the MBIE-EB lgorithm lone: s we formlize in the next theorem, ny lgorithm with n explortion bonus tht decys fster thn 1/ n cnnot be PAC-MDP. Theorem 2. Let A t denote the policy followed n lgorithm using ny rbitrry complex explortion bonus tht is upper bounded by β ns, p for some constnt β nd p > 1/2. Then there exists some MDP M nd ǫ 0 β,p, such tht with probbility greter thn δ 0 = 0.15, V At H s t < V Hs t ǫ 0 will hold for n unbounded number of time steps. In other words, the BEB lgorithm nd Byesin explortion itself, re not PAC-MDP, nd my in fct never find the optiml policy for some given MDP. This result is firly intuitive: since the Byesin lgorithms re trying to mximize the rewrd over some finite horizon, there would be no benefit to excessive explortion if it cnnot help over this horizon time. To summrize, by considering optimlity with respect to the Byesin policy, we cn obtin lgorithms with

5 lower smple complexity nd greedier explortion policies thn PAC-MDP pproches. Although the resulting lgorithms my fil to find the optiml policy for certin MDPs, they re still close to optiml in the Byesin sense. 4. Proofs of the Min Results Before presenting the proofs in this section, we wnt to briefly describe their intuition. Due to spce constrints, the proofs of the technicl lemms re deferred to the ppendix, vilble in the full version of the pper Kolter & Ng, The key condition tht llows us to prove tht BEB quickly performs ǫ-close to the Byesin policy is tht t every time step, BEB is optimistic with respect to the Byesin Policy, nd this optimism decys to zero given enough smples tht is, BEB cts ccording to n internl model tht lwys overestimtes the vlues of stte-ction pirs, but which pproches the true Byesin vlue estimte t rte of O1/ns,. The O1/n term itself rises from the L 1 divergence between prticulr Dirichlet distributions. Given these results, Theorem 1 follows by dpting stndrd rguments from previous PAC-MDP results. In prticulr, we define the known stte-ction pirs to be ll those stte-ction pirs tht hve been observed more thn some number of times nd use the bove insights to show, similr to the PAC-MDP proofs, tht V At H b,s is close to the vlue of cting ccording to the optiml Byesin policy, ssuming the probbility of leving the known stte-ction set is smll. Finlly, we use the Hoeffding bound to show tht this escpe probbility cn be lrge only for polynomil number of steps. To prove Theorem 2, we use Slud s inequlity Slud, 1977 to show tht ny lgorithm with explortion rte O1/n p for p > 1/2 my not be optimistic with respect to the optiml policy for given MDP. The domin to consider here is simple two-rmed bndit, where one of the rms results in rndom Bernoulli pyoff, nd the other results in fixed known pyoff with slightly lower men vlue; we cn show tht with significnt probbility, ny such explortion bonus lgorithm my prefer the suboptiml rm t some point, resulting in policy tht is never ner-optiml Proof of Theorem 1 We begin with series of lemms used in the proof. The first lemm sttes tht if one hs sufficient number of counts for Dirichlet distribution, then incrementing one of the counts won t chnge the probbilities very much. The proof just involves lgebric mnipultion. Lemm 3. Consider two Dirichlet distributions with prmeters α,α R k. Further, suppose α i = α i for ll i, except α j = α j + 1. Then Pxi α Px i α α 0 i Next we use this lemm to show tht if we solve the MDP using the current men of the belief stte, with 2H n dditionl explortion bonus of 2 1+α 0s,, this will led to vlue function tht is optimistic with respect to the Byesin policy. The proof involves showing tht the potentil benefit of the true Byesin policy i.e., how much extr rewrd we could obtin by updting the beliefs, is upper bounded by the explortion bonus of BEB. The proof is deferred due to spce constrints, but since this result is the key to proving Theorem 1, this lemm is one of the key technicl results of the pper. Lemm 4. Let Ṽ H b,s be the vlue function used by BEB, defined s in 1, with β = 2H 2 ; tht is, it is the optiml vlue function for the men MDP of belief b, plus the dditionl rewrd term. Then for ll s, Ṽ Hb,s V Hb,s. Our finl lemm is trivil modifiction of the Induced Inequlity used by previous PAC-MDP bounds, which extends this inequlity to the Byesin setting. The lemm sttes tht if we execute policy using two different rewrds nd belief sttes R,b nd R,b, where b = b nd R = R on known set of stte-ction pirs K, then following policy π will obtin similr rewrds for both belief sttes, provided the probbility of escping from K is smll. The proof mirrors tht in Strehl & Littmn, Lemm 5. Let b,r nd b,r be two belief sttes over trnsition probbilities nd rewrd functions tht re identicl on some set of stte-ction pirs K i.e., α b s,,s = α b s,,s nd Rs, = R s, for ll s, K. Let A K be the probbility tht stte-ction pir not in K is generted when strting from stte s nd following policy π for H steps. Assuming the rewrds R re bounded in [0,R mx ] then, V π HR,b,s V π HR,b,s HR mx PA K where we now mke explicit the dependence of the vlue function on the rewrd. We re now redy to prove Theorem 1. Proof. of Theorem 1 Define R s the rewrd function used by the BEB lgorithm i.e., the rewrd plus the

6 explortion bonus. Let K be the set of ll sttes tht hve posterior counts α 0 s, m 4H 3 /ǫ. Let R be rewrd function equl to R on K nd equl to R elsewhere. Furthermore, let π be the policy followed by the BEB t time t i.e., the greedy policy with respect to the current belief b t nd the rewrd R. Letting A K be the event tht π escpes from K when strting in nd cting for H steps. Then V At H R,b t,s t V π HR,b t,s t H 2 PA K 2 by Lemm 5 where we note tht we cn limit the explortion bonus to H i.e., use bonus of 2H min 2 1+α 0s,,H, nd still mintin optimism, nd by noticing tht A t equls π unless A K occurs. In ddition, note tht since R nd R differ by t most 2H 2 /m = ǫ/2h t ech stte, V π HR,b t,s t V π H R,b t,s t ǫ 2. 3 Finlly, we consider two cses. First, suppose tht PA K > ǫ/2h 2. By the Hoeffding inequlity, with probbility 1 δ this will occur no more thn m S A H 3 S A H 6 O = O ǫ ǫ 2 times before ll the sttes become known. Now suppose PA K ǫ/2h 2. Then V At H R,b t,s t V π HR,b t,s t H 2 PA k VHR π,b t,s t ǫ 2 VH π R,b t,s t ǫ = Ṽ H R,b t,s t ǫ V HR,b t,s t ǫ i.e., the policy is ǫ-optiml. In this derivtion the first line follows from 2, the second line follows from our ssumption tht PA K ǫ/2h 2, the third line follows from 3, the fourth line follows from the fct tht π is precisely the optiml policy for R,b t, nd the lst line follows from Lemm Proof of Theorem 2 We mke use of the following inequlity, due to Slud 1977, which gives lower bound on the probbility of lrge devitions from the men in binomil distribution. Lemm 6. Slud s inequlity Let X 1,...,X n be i.i.d. Bernoulli rndom vribles, with men µ 3/4. Then P µ 1 n ǫ n X i > ǫ 1 Φ n µ1 µ i=1 where Φx is the cumultive distribution function of stndrd Gussin rndom vrible Φx = x 1 2π e x2 2 dx. Using this lemm, we now prove the theorem. Proof. of Theorem 2 As mentioned in the preceding text, the scenrio we consider for this proof is two-rmed bndit: ction 1 gives Bernoulli rndom rewrd, with true pyoff probbility unknown to the lgorithm of 3/4; ction 2 gives fixed known pyoff of 3/4 ǫ 0 we will define ǫ 0 shortly. Therefore, the optiml policy is to lwys pick ction 1. Since this is setting with only one stte nd therefore known trnsition dynmics but unknown rewrds, lter in the proof we will trnsform this domin into one with unknown trnsitions but known rewrds; however, the bndit formultion is more intuitive for the time being. Since the rewrd for ction 2 is known, the only explortory ction in this cse is 1, nd we let n denote the number of times tht the gent hs chosen ction 1, where t ech tril it receives rewrd r i 0,1. Let fn be n explortion bonus for some lgorithm ttempting to lern this domin, nd suppose it is upper bounded by fn β n p for some p > 1/2. Then we cn lower bound the probbility tht the lgorithm s estimte of the rewrd, plus its explortion bonus, is pessimistic by more thn β/n p : P 3/4 1 n r i fn β n n p i=1 P 3/4 1 n r i 2β n n p i=1 8β 1 Φ 3n p 1/2 where the lst line follows by pplying Slud s inequlity. We cn esily verify numericlly tht 1 Φ1 > 0.15, so for 2 8β 2p 1 n 3 we hve tht with probbility greter thn δ 0 = 0.15, the lgorithm is pessimistic by more thn β/n p. Therefore, fter this mny steps, if we let 2 8β 2p 1 ǫ 0 β,p = β/ 3

7 then with probbility t lest δ 0 = 0.15, ction 2 will be preferred by the lgorithm over ction 1. Once this occurs, the lgorithm will never opt to select ction 1 since 2 is known, nd lredy hs no explortion bonus, so for ny ǫ ǫ 0, the lgorithm will be more thn ǫ sub-optiml for n infinite number of steps. Finlly, we lso note tht we cn esily trnsform this domin to n MDP with known rewrds but unknown trnsitions by considering three stte MDP, with trnsition probbilities nd rewrds Ps 2 s 1, 1 = 3/4 Ps 3 s 1, 1 = 1/4 Ps 1 s 1, 2 = 1 Ps 1 s 2:3, 1:2 = 1 Rs 2, 1:2 = 1 Rs 3, 1:2 = 0 5. Simulted Domin Rs 1, 2 = 3/4 ǫ 0. In this section we present empiricl results for BEB nd other lgorithms on simple chin domin from the Byesin explortion literture Strens, 2000; Pouprt et l., 2006, shown in Figure 1. We stress tht the results here re not intended s rigorous evlution of the different methods, since the domin is extremely smll-scle. Nonetheless, the results illustrte tht the chrcteristics suggested by the theory do mnifest themselves in prctice, t lest in this smll-scle setting. Figure 2 shows the verge totl rewrd versus time step for severl different lgorithms. These results illustrte severl points. First, the results show, s suggested by the theory, tht BEB cn outperform PAC-MDP lgorithms in this cse, MBIE-EB, due to it s greedier explortion strtegy. Second, the vlue of β required by Theorem 1 is typiclly much lrger thn wht is best in prctice. This is common trend for such lgorithms, so for both BEB nd MBIE-EB we evluted wide rnge of vlues for β nd chose the best for ech the sme evlution strtegy ws used by the uthors of MBIE-EB Strehl & Littmn, Thus, while the constnt fctors in the theoreticl results for both BEB nd MBIE-EB re less importnt from prcticl stndpoint, the rtes implied by these results i.e., the 1/n vs. 1/ n explortion rtes do result in empiricl differences. Finlly, for this domin, the possibility tht BEB converges to sub-optiml policy is not lrge concern. This is to be expected, s Theorem 2 nlyzes firly extreme setting, nd indeed implies only reltively little suboptimlity, even in the worse cse. Figure 1. Simple chin domin, consisting of five sttes nd two ctions. Arrows indicte trnsitions nd rewrds for the ctions, but t ech time step the gent performs the opposite ction s intended with probbility 0.2. The gent lwys strts in stte 1, nd the horizon is H = 6. Averge Totl Rewrd BEB β=2.5 MBIE-EB β= Optiml policy BEB β=2h 2 Exploit only β= Time Step Figure 2. Performnce versus time for different lgorithms on the chin domin, verged over 500 trils nd shown with 95% confidence intervls. Averge Totl Rewrd Smll Prior, α 0 = Uniform Prior, α = S Uniform Prior, α 0 = 5 S Informtive Prior, α 0 = S Informtive Prior, α 0 = 5 S Time Step Figure 3. Performnce versus time for BEB with different priors on the chin domin, verged over 500 trils nd shown with 95% confidence intervls. We lso evluted the significnce of the prior distribution on BEB. In Figure 3 we show performnce for BEB with very smll prior, for uniform priors of vrying strength, nd for informtive priors consisting of the true trnsition probbilities. As cn be seen, BEB is firly insensitive to either smll priors, but cn be negtively impcted by lrge misspecified prior. These results re quite intuitive, s such s prior will gretly decrese the explortion bonus, while providing poor model of the environment.

8 6. Conclusion In this pper we presented novel lgorithm nd mode of nlysis tht llows n gent cting in n MDP to perform ǫ-close to the intrctble optiml Byesin policy fter polynomil number of time steps. We bring PAC-MDP-type results to the setting of Byesin RL, nd we show tht by doing so, we cn both obtin lower smple complexity bounds, nd use explortion techniques tht re greedier thn those required by ny PAC-MDP lgorithm. Looking forwrd, the sme mode of nlysis tht we use to derive the bounds in this pper which involves bounding divergences between updtes of the belief distributions cn lso be pplied to more structured domins, such s finite MDPs with correlted trnsitions or continuous stte MDPs with smooth dynmics; it will be interesting to see how the resulting lgorithms perform in such domins. An lterntive mens of nlyzing the efficiency of reinforcement lerning lgorithms is the notion of regret in infinite-horizon settings Auer & Ortner, 2007, nd it remins n open question whether the ides we present here cn be extended to this infinite-horizon cse. Finlly, very recently Asmuth et l hve independently developed n lgorithm tht lso combines Byesin nd PAC-MDP pproches. The ctul pproch is quite different they use Byesin smpling to chieve PAC-MDP lgorithm but it would be very interesting to compre the lgorithms. Acknowledgments This work ws supported by the DARPA Lerning Locomotion progrm under contrct number FA C We thnk the nonymous reviews nd Lihong Li for helpful comments. Zico Kolter is supported by n NSF Grdute Reserch Fellowship. References Asmuth, J., Li, L., Littmn, M. L., Nouri, A., & Wingte, D A Byesin smpling pproch to explortion in reinforcement lerning. Preprint. Auer, P., & Ortner, R Logrithmic online regret bounds for undiscounted reinforcement lerning. Neurl Informtion Processing Systems. Brfmn, R. I., & Tennenholtz, M R-MAX generl polynomil time lgorithm for ner-optiml reinforcement lerning. Journl of Mchine Lerning Reserch, 3, Brunskill, E., Leffler, B. R., Li, L., Littmn, M. L., & Roy, N CORL: A continuous-stte offset-dynmics reinforcement lerner. Proceedings of the Interntionl Conference on Uncertinty in Artificil Intelligence. Derden, R., Friedmn, N., & Andre, D Model bsed Byesin explortion. Proceedings of the Interntionl Conference on Uncertinty in Artificil Intelligence. Fel dbum, A. A Dul control theory, prts I IV. Automtion nd Remote Control, , , , Filtov, N., & Unbehuen, H Adptive dul control: Theory nd pplictions. Springer. Gittins, J. C Multimred bndit lloction indices. Wiley. Kkde, S., Kerns, M., & Lngford, J Explortion in metric stte spces. Proceedings of the Interntionl Conference on Mchine Lerning. Kkde, S. M On the smple complexity of reinforcement lerning. Doctorl disserttion, Gtsby Computtionl Neuroscience Unit, University College, London. Kerns, M., & Koller, D Efficient reinforcement lerning in fctored MDPs. Proceedings of the Interntionl Joint Conference on Artificil Intelligence. Kerns, M., & Singh, S Ner-optiml reinforcement lerning in polynomil time. Mchine Lerning, 49. Kolter, J. Z., & Ng, A. Y Ner-Byesin explortion in polynomil time full version. Avilble t kolter. Pouprt, P., Vlssis, N., Hoey, J., & Regn, K An nlytic solution to discrete Byesin reinforcement lerning. Proceedings of the Interntionl Conference on Mchine Lerning. Puttermn, M. L Mrkov decision processes: Discrete stochstic dynmic progrmming. Wiley. Slud, E. V Distribution inequlities for the binomil lw. The Annls of Probbility, 5, Strehl, A. L., Li, L., Wiewior, E., Lngford, J., & Littmn, M. L Pc model-free reinforcement lerning. Proceedings of the Interntionl Conference on Mchine Lerning. Strehl, A. L., & Littmn, M. L An nlysis of model-bsed intervl estimtion for mrkov decision processes. Journl of Computer nd System Sciences, 74, Strehl, A. L., & Littmn, M. L. 2008b. Online liner regression nd its ppliction to model-bsed reinforcement lerning. Neurl Informtion Processing Systems. Strens, M. J A Byesin frmework for reinforcement lerning. Proceedings of the Interntionl Conference on Mchine Lerning. Wng, T., Lizotte, D., Bowling, M., & Schuurmns, D Byesin sprse smpling for on-line rewrd optimiztion. Proceedings of the Interntionl Conference on Mchine Lerning.

9 A. Technicl Proofs A.1. Proof of Lemm 3 Proof. Using definition of the Dirichlet distribution, Px i α = Px i πpπ αdπ = α i α 0 where α 0 i α i. By the ssumptions on α nd α, Pxi α Px i α i = i j αi α i αj α 0 α α α j α 0 α i = α 0 + α 2 i j 0 i α i α 0 + α α 0 α j α 0 + α α 0 α 0 + α 2 0 = α 0. The first line just substitutes the definitions of Ṽ H nd VH. In the second line we use the fct tht mx x fx mx gx min fx gx. x In the third line we use the fct tht pxfx x x x qxgx pxfx gx x x px qx gx, nd note tht VH 1 b,s H 1 for ny b nd s. In the fourth line we pply Lemm 3 to show tht 2H α 0 s, H 1 s Ps b,s, Ps b t,s, A.2. Proof of Lemm 4 Proof. Consider some belief nd stte b nd s, nd let b t be the new belief formed by updting b fter tking t H steps. Then ṼHb,s VHb t,s Rs, + = mx mx min 2H α 0 s, + s Ps b,s,ṽ H 1b,s Rs, + s Ps b t,s,v H 1b t+1,s 2H α 0 s, + s Ps b,s,ṽ H 1b,s Ps b t,s,vh 1b t+1,s s 2H 2 min 1 + α 0 s, min min s H 1 s Ps b,s, Ps b t,s, + s s Ps b,s, Ṽ H 1 b,s VH 1b t,s Ps b,s, Ṽ H 1 b,s VH 1b t,s ṼH 1 b,s VH 1b t+1,s. which lets us remove these terms. In greter detil, using the tringle inequlity, Lemm 3, nd the fct tht t H Ps b t,s, Ps b,s, s t Ps b i,s, Ps b i 1,s, i=1 t i=1 s α 0 s, + i 2H 1 + α 0 s,. Since s is rbitrry in the bove derivtion, we hve tht for ny t H, min ṼH b,s V s Hb t,s min s ṼH 1 b,s VH 1b t+1,s Applying this eqution repetedly proves the desired lemm. A.3. Proof of Lemm 5 Proof. Consider sequence of beliefs, sttes, ctions, nd rewrds of length t, p t = s 1 1 r 1,...,s t, t,r t. Let Pp t be the probbility of this sequence under belief b with rewrd function R when strting in stte s, nd let P p t be the probbility of the sequence under belief b with rewrd function R. Let K t be the set of

10 sequences where ll s 1,...,s t K Then VHR π,b,s VHR,b,s π H = P p t r t p t Pp t r t p t = = t=1 p t H t=1 H p t K t p t K t t=1 p t K t H t=1 Ner-Byesin Explortion in Polynomil Time P p t r t p t Pp t r t p t + P p t r t p t Pp t r t p t P p t r t p t Pp t r t p t p t K t P p t r t p t HR mx PA K where we cn eliminte the terms for p t K t, becuse R,b nd R,b re identicl on this set, nd where the lst line follows since the rewrds re bounded in [0,R mx ].

Reinforcement learning II

Reinforcement learning II CS 1675 Introduction to Mchine Lerning Lecture 26 Reinforcement lerning II Milos Huskrecht milos@cs.pitt.edu 5329 Sennott Squre Reinforcement lerning Bsics: Input x Lerner Output Reinforcement r Critic