arxiv: v1 [stat.ml] 9 Aug 2016

Size: px

Start display at page:

Download "arxiv: v1 [stat.ml] 9 Aug 2016"

Candice Harrison
5 years ago
Views:

1 On Lower Bounds for Regret in Reinforcement Lerning In Osbnd Stnford University, Google DeepMind Benjmin Vn Roy Stnford University rxiv: v1 [stt.ml 9 Aug Introduction August 10, 2016 his is brief technicl note to clrify the stte of lower bounds on regret for reinforcement lerning. In prticulr, this pper: Reproduces lower bound on regret for reinforcement lerning, similr to the result of heorem 5 in the journl UCRL2 pper Jksch et l Clrifies tht the proposed proof of heorem 6 in the REGAL pper Brtlett nd ewri 2009 does not hold using the stndrd techniques without further work. We suggest tht this result should insted be considered conjecture s it hs no rigorous proof. Suggests tht the conjectured lower bound given by Brtlett nd ewri 2009 is incorrect nd, in fct, it is possible to improve the scling of the upper bound to mtch the weker lower bounds presented in this pper. 2 Problem formultion We consider the problem of lerning to optimize n unknown MDP M = S, A, R, P. S = {1,.., S} is the stte spce, A = {1,.., A} is the ction spce. In ech timestep t = 1, 2,.. the gent observes stte s t S, selects n ction t A, receives rewrd r t R s t, t [0, 1 nd trnsitions to new stte s t+1 P s t, t. We define ll rndom vribles with respect to probbility spce Ω, F, P. A policy µ is mpping from stte s S to ction A. For MDP M nd ny policy µ we define the long run verge rewrd strting from stte s: [ 1 λ M µ s := lim E M,µ rs t, t s 1 = s, 1 where r s, := E[r r R s,. he subscripts M, µ indicte the MDP evolves under M with policy µ. A policy µ M is optiml for the MDP M if µ M rg mx µ λ M µ s for ll s S. For the unknown MDP M we will often bbrevite sub/superscripts to simply, for exmple λ for λ M µ M. 1

2 Let H t = s 1, 1, r 1,.., s t 1, t 1, r t 1 denote the history of observtions mde prior to time t. A reinforcement lerning lgorithm is deterministic sequence {π t t = 1, 2,..} of functions ech mpping H t to probbility distribution π t H t over policies, from which the gent smple policy µ t t timestep t. We define the regret of reinforcement lerning lgorithm π up to time Regret, π, M s := {λ s r t } s 1 = s. 2 he regret of lerning lgorithm shows how worse the policy performs tht optiml in terms of cumultive rewrds. Any lgorithm with o regret will eventully lern the optiml policy. Note tht the regret is rndom since it depends on the unknown MDP M, the rndom smpling of policies nd, through the history H t on the previous trnsitions nd rewrds. We will ssess nd compre lgorithm performnce in terms of the regret. 2.1 Finite horizon MDPs We now spend little time to relte the formultion bove to so-clled finite horizon MDPs Osbnd et l. 2013; Dnn nd Brunskill In this setting, n gent will interct repetedly with environment over H N timesteps which we cll n episode. A finite horizon MDP M = S, A, R, P, H, ρ is defined s bove, but every H N timesteps the stte will reset ccording to some initil distribution ρ. We cll H N the horizon of the MDP. In finite horizon MDP typicl policy my depend on both the stte s S nd the timestep h within the episode. o be explicit, we define policy µ is mpping from stte s S nd period h = 1,.., H to ction A. For ech MDP M = S, A, R M, P M, H, ρ nd policy µ we define the stte-ction vlue function for ech period h: H Q M µ,hs, := E M,µ r M s j, j s h = s, h =, 3 j=h nd Vµ,h M s := QM µ,h s, µs, h. Once gin, we sy policy µm is optiml for the MDP M if µ M rg mx µ Vµ,h M s for ll s S nd h = 1,..., H. At first glnce this might seem t odds with the formultion in Section 2. However, finite horizon MDPs cn be thought of s specil cse of Section 2 in the expnded stte spce S := S {1,.., H}. In this cse it is typicl to ssume tht the gent knows bout the evolution of time h deterministiclly priori. o highlight this time evolution within episodes, with some buse of nottion, we let s kh = s t for t = k 1H + h, so tht s kh is the stte in period h of episode k. We define H kh nlogously. 3 Multi-rmed bndit We cll the degenerte MDP with only one stte S = 1 multi-rmed bndit with independent rms Li nd Robbins In this setting the ctions t A re often clled rms nd the optiml verge rewrd is simply the verge rewrd of the highest rewrd, λ = mx r. 2

3 We now reproduce lower bound on regret for ny lerning lgorithm in multi-rmed bndit Bubeck nd Ces-Binchi heorem 1 Lower bound on regret in bndits. Let sup be the supremum over ll distributions of rewrds such tht for ech = 1,.., A the rewrds r1 t,.., ra t {0, 1} re i.i.d. nd let inf be the infimum over ll reinforcement lerning lgorithms. hen inf sup mx r E [ r t 1 A At high level heorem 1 sys tht no mtter wht lerning lgorithm you choose, there will lwys be some environment which gives your lgorithm Ω A regret. his is pretty powerful result, since it mens tht if we cn design n lgorithm with upper bounds on regret O A then this lgorithm is in some sense ner-optiml Bubeck nd Ces-Binchi he intuition for the proof is reltively simple nd presented in Bubeck nd Ces-Binchi After ny timesteps there must be some rm which is pulled less thn /A times. Stndrd concentrtion results stte tht the estimtes of rndom vrible cn only be ccurte up to O n 1 where n is the number of observtions. herefore, for the rm with n /A it is difficult to distinguish between Ber1/2 nd Ber1/2+ A/. his mens tht, if every rm is Ber1/2 but one Ber1/2 + A/, ny lgorithm would incur A/ = A regret. In the next section we will see how to mke this rgument more rigorous. 3.1 Proof of heorem 1 We consider the problem where ll rms re i.i.d. Bernoulli with prmeter δ, but one rm hs prmeter δ + ɛ for some δ, ɛ > 0. We define n uxilliry r t = r t for ll, but with the rewrds of the ction = replced by the drw r t Berδ. We consider n uxilliry sequence of ctions ã t π t H t for H t = ã 1, r 1,.., ã t 1, r t 1 s the history generted by n gent with no feedbck informing them bout. We introduce the nottion n := { t = t = 1,.., } nd ñ := {ã t = t = 1,.., } to denote the number of times rm hve been selected by time under t nd ã t respectively. he following lemm estblishes lower bound on the regret relized by ction ã t. Lemm 1 Regret of n uninformed gent. For ll δ, ɛ > 0 nd ll lerning lgorithms π, [ mx r E r ã t A 1 A ɛ. Proof. We hve, mx r E [ r ã t = E ñ ɛ = ɛ ñ = ɛ 1 1, 5 A 3

4 where the lst step follows from symmetry rgument, since is independent of ñ t for ll ctions. We now estblish tht, if ɛ is sufficiently smll, then over limited time horizon the distributions of r t t cnnot be significntly different from the outcomes r t t. We compre the conditionl distributions over the choice of ction P with the choice of ctions P which would hve risen under the uninformtive dt H t. o be more precise we define P zt H t := Prt = zt H t with P zt H t := P r t = zt H t. We write rt := r t t,.., r t for the sequence of rewrds from time t to nd similrly for r t. o quntify the difference between two distributions we will employ the following notion of KL divergence: d KL P z t H t, P zt H t = E P zt H t log Lemm 2 KL divergence of uninformed distribution. For ll δ, ɛ > 0 nd ll lerning lgorithms π, d KL P z 1 H t, P z1 H t δ log A z t δ δ log δ + ɛ P z t H t. 6 P zt H t δ. 1 δ ɛ Proof. We cn pply the chin rule of KL divergence Bubeck nd Ces-Binchi 2012 to obtin It follows tht d KL P z t H t, P zt H t = d KL P z t H t, P zt H t = Pã t δ log d KL P z t t H t, P zt H t t. δ δ log δ + ɛ δ. 1 δ ɛ We conclude the proof by noting tht the ctions ã t re selected indepedently of of together with symmetry rgument. We now use Pinsker s inequlity to show tht, if the distribution of ctions P is close to the choice of ctions under uninformtive dt P then the resulting regret is close to the regret of the uninformtive policy. Lemm 3 Regret bound in terms of KL divergence. For ll δ, ɛ > 0 nd ll lerning lgorithms π, [ mx r E r t ɛ 1 1A 12 d KL P z 1, P z1. Proof. Pinsker s inequlity gives us [ n E ñ [ Since E[ñ = /A, it follows tht E n proof of through simple substitution in Lemm d KL P z 1, P z d KL P z1, P z A. We complete the

5 o complete the proof of heorem 1 we cn use Lemm 20 from Jksch et l Proposition 1 Bound on the KL divergence. For ny 0 δ 1 2 nd ɛ 1 2δ we hve δ 1 δ ɛ 2 δ log δ log δ + ɛ 2 1 δ ɛ δ log2. We combine Proposition 1 with Lemm 3 to sy, [ mx r E ɛ r ã t 1 1 A ɛ 2 δa A δa. for ll ɛ 2δ A by setting ɛ 2 = δa 8 We cn choose δ = 0.25 to complete the proof of heorem 1. We note tht better constnts re vilble through more creful nlysis, but this is not our focus in this work. 4 Reinforcement lerning In this section we will work to extend the lower bound rguments from bndits to reinforcement lerning with S 2. As in common in the literture, we will begin with simple two stte MDP with known rewrds nd unknown trnsitions Jksch et l. 2010; Brtlett nd ewri 2009; Dnn nd Brunskill It is reltively strightforwrd to extend this flvour of result to MDPs with S > 2 simply by conctenting S/2 copies of these smller systems. Stte 0 gives rewrd of 0 nd stte 1 gives rewrd of 1. All ctions from the stte 0 follow the sme lw P 0, = 1 δ 0, δ 0. In stte 1 P 1, = δ 1, 1 δ 1 for ll ctions prt from P 1, = δ 1 ɛ, 1 δ 1 + ɛ. For this simple MDP we will distinguish policies in terms of their ction upon s = 1, since this is the only ction which cn influence the evolution of the MDP. Figure 1: A two stte MDP which is hrd to lern. Dotted lines distinguish the unique optiml policy. We define θ 1 := δ 0 δ 0 +δ 1 to be the verge expected rewrd under the policy. For convenience we write δ1 := δ 1 ɛ for the distinguished optiml ction nd correspondingly θ1 := δ 0 δ 0 +δ1 for the verge expected rewrd under the optiml policy. 5

6 4.1 Sketch t REGAL-style lower bounds In this section we present quick overview of the style of rgument tht ttempts to solidify the lower bound of heorem 6 in Brtlett nd ewri We ssume tht δ 0 δ 1 to bound the difference in optiml vlue, θ 1 θ 1 = = > δ 0 δ 0 + δ 1 ɛ δ 0 δ 0 + δ 1 δ 0 ɛ δ 0 + δ 1 δ 0 + δ 1 ɛ δ 0 ɛ δ 0 + δ 1 2 > δ 0ɛ 2δ 0 2 = ɛ. 7 4δ 0 Brodly speking, this indictes tht the gent should obtin expected regret Ωɛ/δ 0 every timestep it selects ction t whilst in stte s = 1. All other ctions in ny other stte produce zero regret. We now note tht the problem described by Figure 1 is quite similr to the bndit exmple from Section 3. he difference here is tht ctions of the suboptiml rm give expected regret O ɛ δ 0, rther thn ɛ. he rguments we present in this section cn be thought of s n ttempt to mke the sketch proof for heorem 6 of Brtlett nd ewri 2009 more explicit, if not entirely rigorous. Our rguments will follow the sme structure s Section 3: we consider n uxilliry MDP where the optiml ction hs been replced by nother ction with identicl trnsition dynmics. We will write ã t for the ctions which re tken by this uninformed policy nd H t for the uninformtive history tht it genertes. We begin with result of similr flvour to Lemm 3. Lemm 4 Regret of n uninformed gent. In the environment of Figure 1, for ll δ, ɛ > 0 nd ll lerning lgorithms π, [ mx r E r ɛ ã t θ 1 4δ A Proof. We note tht the uninformed gent cn only incur regret when it mkes sub-optiml decision, which is only possible in stte s = 1. he proportion of the time the gent spends in stte s = 1 is lower bounded by θ 1. he regret for ny sub-optiml decision while in stte s = 1 is t lest ɛ 4δ 0 by 7. We follow the rguments from Lemm 1 to obtin our desired result. We now note tht the problem of lerning 2-stte trnsition function is equivlent to estimting Bernoulli rewrd. herefore, we cn use Lemm 4 in plce of Lemm 1 nd repet similr rgument to the proof of heorem 1 for multi-rmed bndits. At high level we cn bound the regret of ny gent in terms of the devition in KL from the distribution of the uninformed gent. For ɛ smll, nd over short enough time window, the distribution of ctions chosen by the lerning lgorithm cnnot differ significntly from the ctions chosen from the uninformtive system. As such, using Pinsker s inequlity, the resulting regret from ny lerning lgorithm cnnot differ significntly from tht of the uninformed lgorithm. o mke this rgument explicit, we use Lemm 2 nd Lemm 3 together with Proposition 2 nd 6

7 optimize over the resulting bound over ɛ. ht is to sy, for ny lerning lgorithm π, [ θ1 E r t ɛ θ 1 1 1A 12 4δ d KL P z 1, P z1 0 θ 1 ɛ 4δ 0 1 ɛθ 1 4 δ A 1 1 A ɛ 2 θ 1 2δ 1 A 1 ɛ 2 θ 1 for ll ɛ 2 δ 1 A 1 4 δ 1 A 8θ 1 θ1 1 1 δ 0 A 1 δ 1 A setting ɛ = 4 8θ δ1 θ 1 A. 8 2 δ 2 0 Now, we re left with problem to complete the rgument for heorem 6 from REGAL. We introduce the nottion, M µ s, s for the expected number of timesteps to get from stte s to s in MDP M under policy µ. he one-wy dimeter of n MDP is defined D ow M := mx min M s µ µ s, s, where s is ny stte with optiml vlue bis. 9 he clim in heorem 6 of REGAL is tht, for ny lerning lgorithm π there exists nd MDP M such tht Regret, π, M c 0 D ow SA for some c0 > 0. From construction of the MDP in Figure 1 it is cler tht D ow = 1 δ 0, since the only stte with optiml vlue bis is s = 1 nd the expected time from s = 0 to s = 1 is 1 δ 0. We now exmine behviour of the remining free prmeters using the definition θ 1 = δ 0 /δ 0 + δ 1 : δ1 θ 1 δ 2 0 = D ow δ1 θ 1 δ1 /D ow = D ow δ 1 + 1/D ow Dow = = O D ow for ny choice of δ 1 > 0. δ 1 D ow his completes the demonstrtion tht the stndrd proof techniques for lower bounds do not ddress the problems in the proof REGAL heorem 6. In fct, we re only ble to estblish lower bound Ω D ow SA nd not ΩD ow SA s Brtlett nd ewri 2009 hd climed. Further, these bounds re ctully weker thn the estblished results in Jksch et l Ω DSA, where DM := mx s,s min µ M µ s, s D ow is the dimeter of the MDP. 4.2 Where do the lower bounds lie? he rguments in Section 4.1 show tht existing mchinery is not sufficient to estblish proof of heorem 6 in Brtlett nd ewri In light of this we suggest tht this published result 7

8 be considered conjecture, rther thn n estblished theorem. In this note we present nother lterntive conjecture, tht the results of heorem 6 in Brtlett nd ewri 2009 re not correct. he spirit of this conjecture is similr to Conjecture 1 of Osbnd nd Vn Roy 2016 given for finite horizon MDPs. Conjecture 1 ight lower bounds for regret. he lower bounds of Jksch et l Ω DSA re unimprovble in the sense tht there exists some lerning lgorithm π such tht, for ny MDP M nd ny δ > 0 Regret, π, M = Õ DSA, 10 with probbility t lest 1 δ Wht is wrong the REGAL lower bound? In order for Conjecture 1 to be true, the sketched proof in Brtlett nd ewri 2009 must be flse. Although the rguments of Section 4.1 show tht this proof is not yet rigorous, they do not pinpoint ny step of the ppeling sketched rgument which is incorrect. However, we will now present n intuitive rgument for wht my be going wrong in the sketched proof: For every timestep t in stte s = 1 the worst possible decision the gent could mke will contribute regret ODow in terms of the vlue. he proposed sketch proof rgues tht the gent effectively incurs this regret every timestep until it lerns the optiml rm. If we mesure regret in terms of ctul shortfll in the instntneous regret λ r t must be bounded O1 per timestep. he bd decisions in stte s = 1 re just worth OD ow vlue becuse it might led to OD ow of these O1 instntneous regret steps to occur in row. Alterntively, we might think of regret in terms of the future vlue OD ow which bd decision t s = 1 my be worth - this is the rgument tht REGAL uses Brtlett nd ewri However, if we do this then tht mens this bd decision must be followed by OD ow timesteps in which we count no dditionl regret. At the moment, the rgument for heorem 6 in Brtlett nd ewri 2009 is doing type of double-counting for regret. It ssigns the mximum OD ow M regret in terms of vlue t ech timestep. However, this nlysis ignores tht for every one of these bd ctions there will be OD ow M periods of time within s = 0 where, in terms of the vlue shortfll, these ctions will not incur further regret thn hs been counted lredy Comprison to existing tight PAC bounds Another piece of tngentilly supporting evidence for Conjecture 1 comes from the recent PACnlysis for finite horizon MDPs Dnn nd Brunskill he problem formultion given by this pper differs from Brtlett nd ewri 2009 in severl wys, but they produce n lgorithm LUCFH which mtches upper nd lower bounds for the horizon H in finite horizon MDPs. In finite horizon MDPs, the horizon H is n upper bound on D ow. A similr flvour of result is vilble in discounted MDPs Lttimore nd Hutter 2012 where the horizon H is replce with n equivlent timefrme H = Õ 1 1 γ. 8

9 he nlysis for LUCFH in finite horizon MDPs implies tht the number of episodes required for ɛ-optiml episodes is Θ H2, where we view ll vribles other thn H nd ɛ s fixed. According ɛ 2 to their definition, this would imply Θ H3 timesteps until ɛ-optiml episodes, which is roughly ɛ 2 equivlent to Θ H timesteps until ɛ-optiml timesteps. ɛ 2 At high level the lgorithm nd nlysis from Dnn nd Brunskill 2015 leverges the sort of phenomenon we describe in Section his essentil rgument is refined nd mde more rigorous through the Bellmn eqution for locl vrince, first used in Lttimore nd Hutter It is not generlly possible to go from PAC bounds to regret gurntees, however, the spirit of previous nlyses nd comprble results suggest tht the tight bounds Θ H ɛ 2 timesteps until ɛ-optiml timesteps re suggestive of tight regret scling Θ H. 5 Conclusion his technicl note ims to clrify the current stte of lower bounds for regret in reinforcement lerning. We reproduce cler step by step rgument for the lower bound on regret given in Brtlett nd ewri We show tht, using stndrd mchinery, this leds to provble lower bound Ω D ow SA nd currently there is no proof vilble for the bound ΩD ow SA s conjectured in tht erlier work. o stimulte thinking on this topic, we present Conjecture 1, tht the lower bound Ω D ow SA is in fct unimprovble. Definitively proving these results one wy or nother is n exciting re for future reserch. Acknowledgements We would like to thnk the uthors of Brtlett nd ewri 2009 for their help nd dilogue in the discussion of these delicte technicl issues. We would lso like to thnk Dniel Russo for the mny hours of discussion nd nlysis spent in the office on issues like these. References Peter L. Brtlett nd Ambuj ewri. REGAL: A regulriztion bsed lgorithm for reinforcement lerning in wekly communicting MDPs. In Proceedings of the 25th Conference on Uncertinty in Artificil Intelligence UAI2009, pges 35 42, June Sébstien Bubeck nd Nicolò Ces-Binchi. Regret nlysis of stochstic nd nonstochstic multi-rmed bndit problems. CoRR, bs/ , URL Christoph Dnn nd Emm Brunskill. Smple complexity of episodic fixed-horizon reinforcement lerning. In Advnces in Neurl Informtion Processing Systems, pge BA, homs Jksch, Ronld Ortner, nd Peter Auer. Ner-optiml regret bounds for reinforcement lerning. Journl of Mchine Lerning Reserch, 11: , ze Leung Li nd Herbert Robbins. Asymptoticlly efficient dptive lloction rules. Advnces in pplied mthemtics, 61:4 22, or Lttimore nd Mrcus Hutter. PAC bounds for discounted MDPs. In Algorithmic lerning theory, pges Springer,

10 In Osbnd nd Benjmin Vn Roy. Why is posterior smpling better thn optimism for reinforcement lerning. rxiv preprint rxiv: , In Osbnd, Dniel Russo, nd Benjmin Vn Roy. More efficient reinforcement lerning vi posterior smpling. In NIPS, pges Currn Assocites, Inc.,

Multi-Armed Bandits: Non-adaptive and Adaptive Sampling

Multi-Armed Bandits: Non-adaptive and Adaptive Sampling CSE 547/Stt 548: Mchine Lerning for Big Dt Lecture Multi-Armed Bndits: Non-dptive nd Adptive Smpling Instructor: Shm Kkde 1 The (stochstic) multi-rmed bndit problem The bsic prdigm is s follows: K Independent