Online Markov Decision Processes under Bandit Feedback

Size: px

Start display at page:

Download "Online Markov Decision Processes under Bandit Feedback"

Conrad Bruce
5 years ago
Views:

1 Online Mrkov Decision Processes under Bndit Feedbck Gergely Neu, András György, Csb Szepesvári, András Antos Abstrct We consider online lerning in finite stochstic Mrkovin environments where in ech time step new rewrd function is chosen by n oblivious dversry The gol of the lerning gent is to compete with the best sttionry policy in hindsight in terms of the totl rewrd received Specificlly, in ech time step the gent observes the current stte nd the rewrd ssocited with the lst trnsition, however, the gent does not observe the rewrds ssocited with other stte-ction pirs The gent is ssumed to know the trnsition probbilities The stte of the rt result for this setting is n lgorithm with n expected regret of OT 2/3 ln T In this pper, ssuming tht sttionry policies mix uniformly fst, we show tht fter T time steps, the expected regret of this lgorithm more precisely, slightly modified version thereof is O T /2 ln T, giving the first rigorously proven, essentilly tight regret bound for the problem /0*2' ; <=' ; <=' ; <=' r t t 34"#5'67*-8%*' y t ; <=' 9*-%*2#%::5' *"+,-'!"#$%&'*"+,-' x t ; <=' I INTRODUCTION In this pper we consider online lerning in finite stochstic Mrkovin environments where in ech time step new rewrd function my be chosen by n oblivious dversry The interction between the lerner nd the environment is shown in Figure The environment is split into two prts: One prt hs controlled Mrkovin dynmics, while nother one hs n unrestricted, uncontrolled utonomous dynmics In ech discrete time step t, the lerning gent receives the stte of the Mrkovin environment x t X nd some informtion y t Y bout the previous stte of the utonomous dynmics The lerner then mkes decision bout the next ction t A, which is sent to the environment In response, the environment mkes trnsition: the next stte x t+ of the Mrkovin prt is drwn from trnsition probbility kernel P x t, t, while the other prt mkes trnsition in n utonomous fshion In the menwhile, the gent incurs rewrd r t = rx t, t, y t [0, tht depends on the complete stte of the environment nd the chosen ction; then the process continues with the next step The gol of the lerner is to collect s much rewrd s possible The gent knows the trnsition probbility kernel P nd the rewrd function r, however, he does not know the sequence y t in dvnce We cll this problem online lerning in Mrkov Decision Processes MDPs We tke the viewpoint tht the uncontrolled dynmics might be very complex nd thus modeling it bsed on the vilble limited in- This reserch ws supported in prt by the Ntionl Development Agency of Hungry from the Reserch nd Technologicl Innovtion Fund KTIA-OTKA CNK 77782, the Albert Innovtes Technology Futures, nd the Nturl Sciences nd Engineering Reserch Council NSERC of Cnd Prts of this work hve been published t the Twenty-Fourth Annul Conference on Neurl Informtion Processing Systems NIPS 200 [6 G Neu is with the Deprtment of Computer Science nd Informtion Theory, Budpest University of Technology nd Economics, Budpest, Hungry, nd with the Computer nd Automtion Reserch Institute of the Hungrin Acdemy of Sciences, Budpest, Hungry emil: neugergely@gmilcom A György is with the Deprtment of Computing Science, University of Albert, Edmonton, Cnd emil: gy@csbmehu During prts of this work he ws with the Mchine Lerning Reserch Group of the Computer nd Automtion Reserch Institute of the Hungrin Acdemy of Sciences Cs Szepesvári is with the Deprtment of Computing Science, University of Albert, Edmonton, Cnd emil: szepesv@ulbertc A Antos is with the Budpest University of Technology nd Economics, Budpest, Hungry emil: ntos@csbmehu During prts of this work he ws with the Mchine Lerning Reserch Group of the Computer nd Automtion Reserch Institute of the Hungrin Acdemy of Sciences, Budpest, Hungry Fig The interction between the lerning gent nd the environment Here q denotes unit dely, tht is, ny informtion sent through such box is received t the beginning of the next time step formtion might be hopeless Equivlently, we ssume tht whtever cn be modeled bout the environment is modeled in the Mrkovin, controlled prt As result, when evluting the performnce of the lerner, the totl rewrd of the lerner will be compred to tht of the best stochstic sttionry policy in hindsight tht ssigns ctions to the sttes of the Mrkovin prt in rndom mnner This sttionry policy is thus selected s the policy tht mximizes the totl rewrd given the sequence of rewrd functions r t, r,, y t, t =, 2, Given horizon T > 0, ny policy π nd initil distribution uniquely determines distribution over the sequence spce X A T Noting tht the expected totl rewrd of π is then liner function of the distribution of π nd tht the spce of distributions is convex polytope with vertices corresponding to distributions of deterministic policies, we see tht there will lwys exist deterministic policy tht mximizes the totl expected rewrd in T time steps Hence, it is enough to consider deterministic policies only s reference To mke the objective more precise, for given sttionry deterministic policy π : X A let x π t, π t denote the stte-ction pir tht would hve been visited in time step t hd one used policy π from the beginning of time the initil stte being fixed Then, the gol cn be expressed s keeping the expected regret, ˆL T = mx E π [ T [ T r tx π t, π t E r t smll, regrdless of the sequence of rewrd functions {r t} T In prticulr, subliner regret-growth, L T = ot T mens tht the verge rewrd collected by the lerning gent pproches tht of the best policy in hindsight Nturlly, smller growth-rte is more desirble 2 The motivtion to study this problem is mnifold One viewpoint It is worth noting tht the problem cn be defined without referring to the uncontrolled, unmodelled dynmics by strting with n rbitrry sequence of rewrd functions {r t} Tht the two problems re equivlent follows becuse there is no restriction on the rnge of {y t} or its dynmics 2 Following previous works in the re, in this pper we only consider regret reltive to fixed sttionry policy However, s usul in online lerning, our results nd lgorithms cn lso be extended to less restricted sets of reference policies, such s the clss of sequences of sttionry policies with restricted number of switches We discuss such extensions in Section IV-D

2 2 is tht lerning gent chieving subliner regret growth shows robustness in the fce of rbitrrily ssigned rewrds, thus, the model provides useful generliztion of lerning nd cting in Mrkov Decision Processes Some exmples where the need for such robustness rises nturlly re discussed below Another viewpoint is tht this problem is useful generliztion of online lerning problems studied in the mchine lerning literture eg, [5 In prticulr, in this literture, the problems studied re so-clled prediction problems tht involve n oblivious environment tht chooses sequence of loss functions The lerner s predictions re elements in the common domin of these loss functions nd the gol is to keep the regret smll s compred with the best fixed prediction in hindsight Identifying losses with negtive rewrds we my notice tht this problem coincides exctly with our model with X =, tht is, our problem is indeed generliztion of this problem where the rewrd functions hve memory represented by multiple sttes subject to the Mrkovin control Let us now consider some exmples tht fit the bove model Generlly, since our pproch ssumes tht the hrd-to-model, uncontrolled prt influences the rewrds only, the exmples concern cses where the rewrd is difficult to model This is the cse, for exmple, in vrious production- nd resource-lloction problems, where the mjor source of difficulty is to model the prices tht influence the rewrds Indeed, the prices in these problems tend to depend on externl, generlly unobserved fctors nd thus dynmics of the prices might be hrd to model Other exmples include problems coming from computer science, such s the k-server problem, pging problems, or web-optimiztion eg, d-lloction problems with delyed informtion [see, eg, 7, 22 Previous results tht concern online lerning in MDPs with known trnsition probbility kernels re summrized in Tble I In pper lgorithm feedbck loops regret bound Even-Dr et l full MDP-E [6, 7 informtion yes ÕT /2 Yu et l [22 LAZY-FPL full ÕT yes, informtion ɛ > 0 Yu et l [22 Q-FPL 2 bndit yes ot Neu et l [3 bndit no OT /2 Neu et l [6 MDP-EXP3 bndit yes ÕT 2/3 this pper MDP-EXP3 bndit yes ÕT /2 TABLE I SUMMARY OF PREVIOUS RESULTS PREVIOUS WORKS CONCERNED PROBLEMS WITH EITHER FULL-INFORMATION OR BANDIT FEEDBACK, PROBLEMS WHEN THE MDP DYNAMICS MAY OR MAY NOT HAVE LOOPS TO BE MORE PRECISE, IN NEU ET AL [3 WE CONSIDERED EPISODIC MDPS WITH RESTARTS FOR EACH PAPER, THE ORDER OF THE OBTAINED REGRET BOUND IN TERMS OF THE TIME HORIZON T IS GIVEN The Lzy-FPL lgorithm hs smller computtionl complexity thn MDP-E 2 The stochstic regret of Q-FPL ws shown to be subliner lmost surely not only in expecttion the current pper we study the problem with recurrent Mrkovin dynmics while ssuming tht the only informtion received bout the uncontrolled prt is in the form of the ctul rewrd r t In prticulr, in our model the gent does not receive y t, while in most previous works it ws ssumed tht y t is observed [6, 7, 22 Following the terminology used in the online lerning literture [2, when y t is vilble equivlently, the gent receives the rewrd function r t : X A R in every time step, we sy tht lerning hppens under full informtion, while in our cse we sy tht lerning hppens under bndit feedbck note tht Even-Dr et l [7 suggested s n open problem to ddress the bndit sitution studied here In n erlier version of this pper [6, we provided n lgorithm, MDP- EXP3, for lerning in MDPs with recurrent dynmics under bndit feedbck, nd showed tht it chieves regret of order ÕT 2/3 3 In this pper we improve upon the nlysis of [6 nd prove n ÕT /2 -regret bound for the sme lgorithm As it follows from lower bound proven by Auer et l [2 for bndit problems, prt from logrithmic nd constnt terms the rte obtined is unimprovble The improvement compred to [6 is chieved by more elborte proof technique tht builds on perhps novel observtion tht the soclled exponentil weights technique tht our lgorithm builds upon chnges its weights slowly As in previous works where loopy Mrkovin dynmics were considered, our min ssumptions on the MDP trnsition probbility kernel will be tht sttionry policies mix uniformly fst In ddition, we shll ssume tht the sttionry distributions of these policies re bounded wy from zero These ssumptions will be discussed lter We lso mention here tht Yu nd Mnnor [20, 2 considered the relted problem of online lerning in MDPs where the trnsition probbilities my lso chnge rbitrrily fter ech trnsition This problem is significntly more difficult thn the cse where only the rewrd function is llowed to chnge Accordingly, the lgorithms proposed in these ppers do not chieve subliner regret Unfortuntely, these ppers hve lso gps in the proofs, s discussed in detil in [3 Finlly, we note in pssing tht the contextul bndit problem considered by Lzric nd Munos [2 cn lso be regrded s simplified version of our online lerning problem where the sttes re generted in n iid fshion though we do not consider the problem of competing with the best policy in restricted subset of sttionry policies For regret bounds concerning lerning in purely stochstic unknown MDPs, see the work of Jksch et l [0 nd the references therein Lerning in dversril MDPs without loops ws lso considered by György et l [8 for deterministic trnsitions under bndit feedbck, nd under full informtion but with unknown trnsition probbility kernels in our recent pper [4 The rest of the pper is orgnized s follows: The problem is lid out in Section II, which is followed by section tht mkes our ssumptions precise Section III The lgorithm nd the min result re given nd discussed in Section IV, with the proofs presented in Section V II NOTATION AND PROBLEM DEFINITION The purpose of this section is to provide the forml definition of our problem nd to set the gols We strt with some preliminries, in prticulr by reviewing the lnguge we use in connection to Mrkov Decision Processes MDPs This will be followed by the definition of the online lerning problem We ssume tht the reder is fmilir with the concepts necessry to study MDPs, our purpose here is to introduce the nottion only For more bckground bout MDPs, consult Putermn [7 We define finite Mrkov Decision Process MDP M by finite stte spce X, finite ction set A, trnsition probbility kernel P : X A X [0,, nd rewrd function r : X A [0, At time t {, 2, }, bsed on the sequence of pst sttes, observed rewrds, nd ctions, x,, rx,, x 2,, x t, t, rx t, t, x t X A R t X, n gent cting in the MDP M chooses n ction t A to be executed 4 As result, the process moves to stte x t+ X with probbility 3 Here, Õgs denotes the clss of functions f : N R + stisfying fs sup s N gs ln α < for some α 0 4 gs Throughout the pper we will use boldfce letters to denote rndom vribles

3 3 P x t+ x t, t nd the gent incurs the rewrd rx t, t We note in pssing tht t the price of incresed nottionl lod, but with essentilly no chnge to the contents, we could consider the cse where the set of ctions vilble t time step t is restricted to nonempty subset Ax t of ll ctions, where the set-system, Ax x X, is known to the gent However, for simplicity, in the rest of the pper we stick to the cse Ax = A In n MDP the gol of the gent is to mximize the long-term rewrd In prticulr, in the so-clled verge-rewrd problem, the gol of the gent is to mximize the long-run verge rewrd In wht follows, the symbols x, x, will be reserved to denote stte in X, while,, b will be reserved to denote n ction in A In expressions involving sums over X, the domin of x, x, will be suppressed to void clutter The sme holds for sums involving ctions Before defining the lerning problem, let us introduce some more nottion We use v p to denote the L p-norm of function or vector In prticulr, for p = the supremum norm of function v : S R is defined s v = sup s S vs, nd for p < nd for ny vector u = u,, u d R d d /p, u p = i= ui p We use e,, e d to denote the row vectors of the cnonicl bsis of the Eucliden spce R d Since we will identify X with the integers {,, X }, we will lso use the nottion e x for x X We will use ln to denote the nturl logrithm function A Online lerning in MDPs In this pper we consider so-clled online lerning problem when the rewrd function is llowed to chnge rbitrrily in every time step Tht is, insted of single rewrd function r, sequence of rewrd functions {r t} is given This sequence is ssumed to be fixed hed of time, nd, for simplicity, we ssume tht r tx, [0, for ll x, X A nd t {, 2, } No other ssumptions re mde bout this sequence The lerning gent is ssumed to know the trnsition probbilities P, but is not given the sequence {r t} The protocol of interction with the environment is unchnged: At time step t the gent selects n ction t bsed on the informtion vilble to it, which is sent to the environment In response, the rewrd r tx t, t nd the next stte x t+ re communicted to the gent The initil stte x is generted from fixed distribution P, which my or my not be known Let the expected totl rewrd collected by the gent up to time T be denoted by [ T R T = E r tx t, t As before, the gol of the gent is to mke this sum s lrge s possible In clssicl pproches to lerning one would ssume some kind of regulrity of r t nd then derive bounds on how much rewrd the lerning gent loses s compred to the gent tht knew bout the regulrity of the rewrds nd who cted optimlly from the beginning of time The loss or regret, mesured in terms of the difference of totl expected rewrds of the two gents, quntifies the lerner s efficiency In this pper, following the recent trend in the mchine lerning literture [5, while keeping the regret criterion, we will void mking ny ssumption on how the rewrd sequence is generted, nd tke worst-cse viewpoint The potentil benefit is tht the results will be more generlly pplicble nd the lgorithms will enjoy dded robustness, while, generlizing from results vilble for supervised lerning [4,, 8, the lgorithms cn lso be shown to void being too pessimistic The concept of regret in our cse is defined s follows: We shll consider lgorithms which re competitive with stochstic sttionry policies Fix stochstic sttionry policy π : X A [0, nd let {x t, t} be the trjectory tht results from following policy π from x P in prticulr, t π x t def = πx t, The expected totl rewrd of π over the first T time steps is defined s [ T RT π = E r tx t, t Now, the expected regret or expected reltive loss of the lerning gent reltive to the clss of sttionry policies is defined s ˆL T = sup RT π R T, π where the supremum is tken over ll stochstic sttionry policies in M Note tht the policy mximizing the totl expected rewrd is chosen in hindsight, tht is, bsed on the knowledge of the rewrd functions r,, r T Thus, the regret mesures how well the lerning gent is ble to generlize from its moment to moment knowledge of the rewrds to the sequence r,, r T If the regret of n gent grows sublinerly with T then it cn be sid to ct s well s the best stochstic sttionry policy in the long run ie, the verge expected rewrd of the gent in the limit is equl to tht of the best policy In this pper our min result will show tht there exists n lgorithm such tht if tht lgorithm is followed by the lerning gent, then the lerning gent s regret will be bounded by C T ln T, where C > 0 is constnt tht depends on the trnsition probbility kernel, but is independent of the sequence of rewrds {r t} III ASSUMPTIONS ON THE TRANSITION PROBABILITY KERNEL Before describing our ssumptions, few more definitions re needed: First of ll, for brevity, in wht follows we will cll stochstic sttionry policies just policies Further, without loss of generlity, we shll identify the sttes with the first X integers nd ssume tht X = {, 2,, X } Now, tke policy π nd define the Mrkov kernel P π x x = π xp x x, The identifiction of X with the first X integers mkes it possible to view P π s mtrix: P π x,x = P π x x In wht follows, we will lso tke this view when convenient In generl, distributions will lso be treted s row vectors Hence, for distribution µ over X, µp π is the distribution over X tht results from using policy π for one step fter stte is smpled from µ ie, the next-stte distribution under π Finlly, sttionry distribution of policy π is distribution µ st tht stisfies µ st P π = µ st In wht follows we ssume tht every stochstic sttionry policy π hs well-defined unique sttionry distribution µ π st This ensures tht the verge rewrd underlying ny sttionry policy is well-defined single rel number It is well-known tht in this cse the convergence to the sttionry distribution is exponentilly fst Following Even-Dr et l [7, we consider the following stronger, uniform mixing condition which implies the existence of the unique sttionry distributions: Assumption A: There exists number τ 0 such tht for ny policy π nd ny pir of distributions µ nd µ over X, µ µ P π e /τ µ µ As Even-Dr et l [7, we cll the smllest τ stisfying this ssumption the mixing time of the trnsition probbility kernel P Together with the existence nd uniqueness of the sttionry policy, the next ssumption ensures tht every stte is visited eventully no mtter wht policy is chosen: Assumption A2: The sttionry distributions re uniformly bounded wy from zero: inf π,x µπ stx > 0

4 4 for some R Note tht e /τ is the supremum over ll policy π of the Mrkov-Dobrushin coefficient of ergodicity, defined µ µ s m P π = sup P π µ µ for the trnsition µ µ probbility kernel P π, see, eg, [9 It is lso known tht m P π = min x,x X y X min{p π y x, P π y x } [9 Since m P π is continuous function of π nd the set of policies is compct, there is policy π with m P π = sup π mp π These fcts imply tht Assumption A is stisfied, tht is, sup π m P π <, if nd only if for every π, m P π <, tht is, P π is scrmbling mtrix P π is scrmbling mtrix if ny two rows of P π shre some column in which they both hve positive element Furthermore, if P π is scrmbling mtrix for ny deterministic policy π then it is lso scrmbling mtrix for ny stochstic policy Thus, to gurntee Assumption A it is enough to verify mixing for deterministic policies only The ssumptions will be further discussed in Section IV-D IV LEARNING IN ONLINE MDPS UNDER BANDIT FEEDBACK In this section we shll first introduce some dditionl, stndrd MDP concepts tht we will need Tht these concepts re well-defined follows from our ssumptions on P nd from stndrd results to be found, for exmple, in the book by Putermn [7 After the definitions, we specify our lgorithm The section is finished by the sttement of our min result concerning the performnce of the proposed lgorithm A Preliminries Fix n rbitrry policy π nd t Let {x s, s} be rndom trjectory generted by π nd the trnsition probbility kernel P nd n rbitrry everywhere positive initil distribution over the sttes We will use qt π to denote the ction-vlue function underlying π nd the immedite rewrd r t, while we will use vt π to denote the corresponding stte vlue function 5 Tht is, for x, X A, q π t x, = E v π t x = E [ s= [ s= rtx s, s ρ π t x = x, = rtx s, s ρ π t x = x where ρ π t is the verge rewrd per stge corresponding to π: ρ π t = lim S S S E[r tx s, s s= The verge rewrd per stge cn be expressed s ρ π t = x µ π stx π xr tx,, where µ π st is the sttionry distribution underlying policy π Under our ssumptions stted in the previous section, up to shift by constnt function, the vlue functions qt π, vt π re the unique solutions to the Bellmn equtions q π t x, = r tx, ρ π t + x P x x, v π t x, v π t x = π xq π t x,, which hold simultneously for ll x, X A Corollry 827 of [7 We will use qt to denote the optiml ction-vlue function, 5 Most sources would cll these functions differentil ction- nd stte-vlue functions We omit this djective for brevity,, 2 tht is, the ction-vlue function underlying policy tht mximizes the verge-rewrd in the MDP specified by P, r t We will lso need these concepts for n rbitrry rewrd function r: X A R In such cse, we will use v π, q π, nd ρ π to denote the respective vlue function, ction-vlue function, nd verge rewrd of policy π Now, consider the trjectory {x t, t} followed by lerning gent with x P For ny t, define u t = x,, r x,,, x t, t, r tx t, t 3 nd introduce the policy followed in time step t, π t x = P[ t = u t, x t = x, where u 0 nd, more generlly u s for ll s 0 is defined to be the empty sequence Note tht π t is computed bsed on pst informtion nd is therefore rndom We introduce the following nottion: q t = q π t t, v t = v π t t, ρ t = ρ π t t With this, we see tht the following equtions hold simultneously for ll x, X A: q tx, = r tx, ρ t + x P x x, v tx, v tx = B The lgorithm π t xq tx, Our lgorithm, MDP-EXP3, shown s Algorithm, is inspired by tht of Even-Dr et l [7, while lso borrowing ides from the EXP3 lgorithm exponentil weights lgorithm for explortion nd exploittion of Auer et l [2 The min ide of the lgorithm is to Algorithm MDP-EXP3: n lgorithm for online lerning in MDPs Set N, w x, = w 2x, = = w 2N x, =, γ 0,, 0, γ For t =, 2, repet: Set π t x = w tx, + γ wtx, b b for ll x, X A 2 Drw n ction t π t x t 3 Receive rewrd r tx t, t nd observe x t+ 4 If t N Compute µ N t for ll x X using 8 b Construct estimtes ˆr t using 6 nd compute ˆq t using 5 c Set w t+n x, = w t+n x, e ˆq tx, for ll x, X A construct estimtes {ˆq t} of the ction-vlue functions {q t}, which re then used to determine the ction-selection probbilities π t x in ech stte x in ech time step t In prticulr, the probbility of selecting ction in stte x t time step t is computed s the mixture of the uniform distribution which encourges exploring ctions irrespective of wht the lgorithm hs lerned bout the ction-vlues nd Gibbs distribution, the mixture prmeter being γ > 0 Given stte x, the Gibbs distribution defines the probbility of choosing ction t time step t to be proportionl to exp t N s=n ˆqsx, 6 6 In the lgorithm the Gibbs ction-selection probbilities re computed in n incrementl fshion with the help of the weights w tx, Note tht numericlly stble implementtion would clculte the ction-selection probbilities bsed on the reltive vlue differences, t N s=n ˆqsx, mx t N A s=n ˆqsx, These reltive vlue differences cn lso be updted incrementlly The form shown in Algorithm is preferred for mthemticl clrity 4

5 5 Here, > 0, N > 0 re further prmeters of the lgorithm Note tht for the single-stte setting with N =, MDP-EXP3 is equivlent to the EXP3 lgorithm of Auer et l [2 w t x, b w tx,b It is interesting to discuss how the Gibbs policy ie, is relted to wht is known s the Boltzmnn-explortion policy in the reinforcement lerning literture [eg, 9 Remember tht given stte x, the Boltzmnn-explortion policy would select ction t time step t with probbility proportionl to expˆq t x, for some estimte ˆq t of the optiml ction-vlue function in the MDP P, ˆr t, where {ˆr t} is the sequence of estimted rewrd functions Thus, we cn see couple of differences between the Boltzmnn explortion nd our Gibbs policy The first difference is tht the Gibbs policy in our lgorithm uses the cumulted sum of the estimtes of ction-vlues, while the Boltzmnn policy uses only the lst estimte By depending on the sum, the Gibbs policy will rely less on the lst estimte This reduces how fst the policies cn chnge, mking the lerning smoother Another difference is tht in our Gibbs policy the sum of previous ction-vlues runs only up to step t N insted of using the sum tht runs up to the lst step t The resons for doing this will be explined below Finlly, the Gibbs policy uses the ctionvlue function estimtes in the MDPs {P, ˆr s} of the policies {π s} selected by the lgorithm, s opposed to using n estimte of the optiml ction-vlue function This mkes our lgorithm closer in spirit to modified policy itertion thn to vlue itertion nd is gin expected to reduce the vrince of the lerning process The reson the Gibbs policy does not use the lst N estimtes is to llow the construction of resonble estimte ˆq t of the ctionvlue function q t If r t ws vilble, one could compute q t bsed on r t cf 4 nd the sum could then run up to t, resulting in the lgorithm of Even-Dr et l [7 Since in our problem r t is not vilble, we estimte it using n importnce smpling estimtor ˆr t below from now on, t N Given this ˆr t, the estimte ˆq t of the ction-vlue function q t is defined s the ction-vlue function underlying policy π t in the verge-rewrd MDP given by the trnsition probbility kernel P nd rewrd function ˆr t Thus, ˆq t, up to shift by constnt function, cn be computed s the solution to the Bellmn equtions corresponding to P, ˆr t cf 4: ˆq tx, = ˆr tx, ˆρ t + x P x x, ˆv tx, ˆv tx = π t xˆq tx,, ˆρ t = x, µ π t st x π t x ˆr tx,, which hold simultneously for ll x, X A Since π t is invrint to constnt shifts of ˆq t, ny of the solutions of these equtions leds to the sme sequence of policies Hence, in wht follows, without loss of generlity we ssume tht the lgorithm uses ˆq t, ie, the vlue function of π t in the verge-rewrd MDP defined by P, ˆr t To define the estimtor ˆr t define µ N t x s the probbility of visiting stte x t time step t, conditioned on the history u t N up to time step t N, including x t N nd t N cf 3 for the definition of {u t}: µ N t x def = P [x t = x u t N, x X Then, the estimte of r t is constructed using ˆr tx, = { rt x,, π t xµ N t x if x, = xt, t ; 0, otherwise The importnce smpling estimtor 6 is well-defined only if for 5 6 x = x t, µ N t x > 0 7 holds lmost surely by construction π t x t > γ/> 0 To see the intuitive reson of why 7 holds, it is instructive to look into how the distribution µ N t cn be computed When t = N, it should be cler from the definition of µ N t tht, viewing µ N t s row vector, µ N N = P P π N Now let t > N Denote by P the trnsition probbility mtrix of the policy tht selects ction in every stte nd recll tht e x denotes the x th unit row vector of the cnonicl bsis of the X -dimensionl Eucliden spce We my write µ N t = e xt N P t N P π t N+ P π t, t > N 8 This holds becuse for ny t N, π t is entirely determined by the history u t N, while for t > N the history u t N lso includes nd thus determines x t N, t N Using the nottion z σu t N to denote tht the rndom vrible z is mesurble with respect to the sigm-lgebr generted by the history u t N, the bove fct cn be stted s x t N, t N σu t N for t > N, π t σu t N for t N Consequently, we lso hve tht π t,, π t N+ σu t N nd therefore 8 follows from the lw of totl probbility Note lso tht P [ t = x t =x, u t N = P [ t = x t = x, u t =π t x, 0 where the lst equlity follows from the definition of π t nd t The lgorithm s presented needs to know P to compute µ N t t step t = N When P is unknown, insted of strting the computtion of the weights t time step t = N, we cn strt the computtion t time step t = N + ie, chnge t N of step 4 to t N + Clerly, in the worst-cse, the regret cn only increse by constnt mount the mgnitude of the lrgest rewrd s result of this chnge An essentil step of the proof of our min result is to show tht inequlity 7 indeed holds, tht is, µ N t x is bounded wy from zero In fct, we will show tht this inequlity holds lmost surely 7 for ll x X provided tht N is lrge enough, which explins why the sum in the definition of the Gibbs policy runs from time N This will be done by first showing tht the policies π t especilly, during the lst N steps chnge sufficiently slowly this is where it becomes useful tht the Gibbs policy is defined using sum of previous ction vlues Consequently, π t N+,,π t will ll be quite close to the policy of the lst time step Then, the expression on the right-hnd side of 8 cn be seen to be close to the N -step stte distribution of π t when strting from x t N, t N, which, if N is lrge enough, will be shown to be close to the sttionry distribution of π t thnks to Assumption A Since by Assumption A2, min x X µ π t st x > 0 then, by choosing the lgorithm s prmeters ppropritely, we cn show tht µ N t x /2 > 0 holds for ll x X, tht is, inequlity 7 follows This is shown in Lemm 3 It remins to be seen tht the estimte ˆr t is meningful In this regrd, we clim tht 9 E [ˆr tx, u t N = r tx, 7 In wht follows, for the ske of brevity, unless otherwise stted, we will omit the modifier lmost surely from probbilistic sttements It is worth to mention tht the finiteness of X nd A llows severl sttements concerning conditionl expecttions to hold lwys, insted of lmost surely

6 6 holds for ll x, X A First note tht E [ˆr tx, u t N = where we hve exploited tht π t, µ N t E [ I {x,=xt, t } u t N r tx, π t xµ N t x E [ I {x,=xt, t } u t N, σu t N Now, = P [ t = x t = x, u t N P [x t = x u t N By definition, P [x t = x u t N = µ N t x nd by 0, P [ t = x t = x, u t N = π t x Putting together the equlities obtined, we get By linerity of expecttion nd since π t, µ π t st σu t N, it then follows from 5 nd tht E[ ˆρ t u t N = ρ t, nd, hence, by the linerity of the Bellmn equtions nd by our ssumption tht ˆq t is the vlue function underlying the MDP P, ˆr t nd policy π t, we hve, for ll x, X A, E[ˆq tx, u t N = q tx,, E[ˆv tx u t N = v tx As consequence, we lso hve, for ll x, X A, t N, E[ ˆρ t = E [ρ t, E[ˆq tx, = E [q tx,, E[ˆv tx = E [v tx 2 3 Let us finlly comment on the computtionl complexity of our lgorithm Due to the dely in updting the policies bsed on the weights, the lgorithm needs to store N policies or weights, leding to the policies Thus, the memory requirement of MDP- EXP3 scles with N X in the rel-number model The computtionl complexity of the lgorithm is dominted by the cost of computing ˆr t nd, in prticulr, by the cost of computing µ N t, plus the cost of solving the Bellmn equtions 5 The cost of this is O X 2 N + X + in the worst cse, for ech time step, however, it cn be much smller for specific prcticl cses such s when the number of possible next-sttes is limited C Min result Our min result is the following bound concerning the performnce of MDP-EXP3 Theorem Regret under bndit feedbck: Let the trnsition probbility kernel P stisfy Assumptions A nd A2 Let T > 0 nd let N = + τ ln T, nd hy = 2y ln y for y > 0 Then for n pproprite choice of the prmeters nd γ which depend on, T,, τ, for ny sequence of rewrd functions {r t} tking vlues in [0,, for T > mx c τ + τ 3 3 ln, h τ c 2 + τ ln nd τ 8 the regret of the lgorithm MDP-EXP3 cn be bounded s τ L T C 3 T ln lnt + C τ 2 ln T for some universl constnts c, c 2, C, C > 0 Note tht with the specific choice of prmeters the totl cost of the lgorithm for time horizon of T is O T X 2 τ lnt + X + 8 The choice of the lower bound on τ is rbitrry, but the constnts in the theorem depend on it Furthermore, with some extr work, our proof lso gives rise to bound for the cse when τ 0, but for simplicity we decided to leve out this nlysis The proof is presented in the next section For comprison, we give now the nlogue result for the lgorithm of Even-Dr et l [7 tht ws developed for the full-informtion cse when the lgorithm is given r t in ech time step As hinted on before, our lgorithm reduces to this lgorithm if we set N =, ˆr t = r t nd γ = 0 We cll this lgorithm MDP-E fter Even-Dr et l [7 The following regret bound holds for this lgorithm: Theorem 2 Regret under full-informtion feedbck: Fix T > 0 Let the trnsition probbility kernel P stisfy Assumption A Then, for n pproprite choice of the prmeter which depends on, T, τ, for ny sequence of rewrd functions {r t} tking vlues in [0,, the regret of the lgorithm MDP-E cn be bounded s ˆL T 4τ + + 2T 2τ + 32τ 2 + 6τ + 5 ln 4 For pedgogicl resons, we shll present the proof in the next section, too Note tht the constnts in this bound re different from those presented in Theorem 5 of Even-Dr et l [7 In prticulr, the leding term here is 2τ 3/2 2T ln, while their leding term is 4τ 2 T ln The bove bound both corrects some smll mistkes in their clcultions nd improves the result t the sme time 9 As Even-Dr et l [7 note, the regret bound 4 does not depend directly on the number of sttes, X, but the dependence ppers implicitly through τ only Even-Dr et l [7 lso note tht tighter bound, where only the mixing times of the ctul policies chosen pper, cn be derived However, it is uncler whether in the worstcse this could be used to improve the bound Similrly to 4, our bound depends on X through other constnts In the bndit cse, these re nd τ Compring the theorems it seems tht the min price of not seeing the rewrds is the ppernce of insted of ln typicl difference between the bndit nd full observtion cses nd the ppernce of / term in the bound D Discussion nd future work In this pper, we hve presented n online lerning lgorithm, MDP-EXP3 for dversril MDPs, tht is, finite stochstic Mrkovin decision environments where the rewrd function my chnge fter ech trnsition This is the first lgorithm for this setting tht hs rigorously proved O T ln T bound on its regret We discuss the fetures of the lgorithm, long with future reserch directions below Extensions: We considered the expected regret reltive to the best fixed policy selected in hindsight A typicl extension is to prove high probbility bound on the regret, which we think cn be done in stndrd wy using concentrtion inequlities Note, however, tht the extension is more complicted thn for the bndit problems becuse the mixing property hs to be used together with the mrtingle resoning Another potentil extension is to compete with lrger policy clsses, such s with sequences of policies with bounded number of policy-switches Similrly to Neu et l [3, 5, the MDP-EXP3 lgorithm should then be modified by replcing EXP3 with the EXP3S lgorithm of Auer et l [2, specificlly designed to compete with switching experts in plce of EXP3 Note tht, gin, the nlysis will be more complicted thn in the bndit cse, nd requires to bound the mximum regret of EXP3S reltive to ny fixed policy over ny time window When compred to policy with C switches, the resulting regret bound is expected to be C times 9 One of the mistkes is in the proof of Theorem 4 of Even-Dr et l [7 where they filed to notice tht q π t t cn tke on negtive vlues Thus, their Assumption 3 is not met by {q π t t } one needs to extend the upper bound given in their Lemm 22 with lower bound nd chnge Assumption 3 As result, Assumption 3 cnnot be used to show tht the inequlity in the proof of Theorem 4 holds This mistke, s well s the others, cn esily be corrected, s we show it here

7 7 lrger thn tht of Theorem, while the lgorithm would not need to know the number of switches C b Tuning nd complexity: Setting up nd running the lgorithm MDP-EXP3 my ctully be computtionlly demnding Setting the prmeters of the lgorithm nd γ requires known lower bound on the visittion probbilities such tht = inf π,x µ π stx > > 0 nd lso the knowledge of n upper bound τ on the mixing time τ While these quntities cn be determined in principle from the trnsition probbility kernel P, it is not cler how to compute efficiently the minimum over ll policies Computtionl issues lso rise during running the lgorithm: s it is discussed in Section IV-B, ech step of the MDP-EXP3 lgorithm requires O X 2 τ ln T + X + computtions, which my be too demnding if, eg, the size of the stte spce is lrge It is n interesting problem to design more efficient method tht chieves similr performnce gurntees c Assumptions on the Mrkovin dynmics: We believe tht it should be possible to extend our min result beyond Assumption A, requiring only the existence of unique sttionry distribution for ny policy π we will refer to this ltter ssumption s the unichin ssumption Using tht the distribution of ny unichin Mrkov chin converges exponentilly fst to its sttionry distribution, nd tht it is enough to verify Assumption A for deterministic policies only, one cn esily show tht if P stisfies the unichin ssumption, then there exists n integer K > 0 such tht P π K is scrmbling mtrix for ny policy π Then, we conjecture tht the MDP-EXP3 lgorithm will work s it is, except tht the regret will be incresed The key to prove this result is to generlize Lemms 4 nd 5 to this cse Finlly, one my lso consider the cse when the Mrkov chins corresponding to P π re periodic We speculte tht this my be delt with using occupncy probbilities nd Cesro-verges insted of the sttionry nd stte distributions, respectively V PROOFS In this section we present the proofs of Theorem nd Theorem 2 We strt with the proof of Theorem 2 s this is simpler result The proof of this result is presented prtly for the ske of completeness nd prtly so tht we cn be more specific bout the corrections required to fix the min result Theorem 52 of Even-Dr et l [7 Further, the proof will lso serve s strting point for the proof of our min result, Theorem Nevertheless, the imptient reder my skip this next section nd jump immeditely to the proof of Theorem, which prt from referring to some generl lemms developed in the next subsection, is entirely self-contined A Proof of Theorem 2 Throughout this section we consider the MDP-E lgorithm given by Algorithm with N =, ˆr t = r t nd γ = 0, nd we suppose tht P stisfies Assumption A Let π t denote the policy used in step t of the lgorithm Note tht π t is not rndom since by ssumption the rewrd function is vilble t ll sttes not just the visited ones Hence, the sequence of policies chosen does not depend on the sttes visited by the lgorithm but is deterministic Remember tht ρ t = ρ π t t denotes the verge rewrd of policy π t mesured with respect to the rewrd function r t Following Even-Dr et l [7, fix some policy π nd consider the decomposition of the regret reltive to π: RT π R T T T T T = RT π + ρ π t ρ t + ρ t R T ρ π t 5 The first nd the lst terms mesure the difference between the sum of symptotic verge rewrds nd the ctul expected rewrd The mixing ssumption Assumption A ensures tht these differences re not lrge In prticulr, in the cse of fixed policy, this difference is bounded by constnt of order τ: Lemm : For ny T nd ny policy π, it holds tht T RT π ρ π t 2τ This lemm is lso stted in [7 We give the proof for completeness lso to correct slight inccurcies of the proof given in [7 Proof: Let {x t, t} be the trjectory when π is followed Note tht the difference between RT π difference between the initil distribution of x nd the sttionry distribution of π To quntify the difference, write R π T T ρ π t = nd T ρπ t is cused by the T νt π x µ π stx π xr tx,, x where νt π x = P[x t = x is the stte distribution t time step t Viewing νt π s row vector, we hve νt π = νt P π π Consider the t th term of the bove difference Then, using r tx, [0, nd Assumption A we get 0 νt π x µ π stx π xr tx, x νt π µ π st = νt P π π µ π stp π e /τ ν π t µ π st e t /τ ν π µ π st 2e t /τ This, together with the elementry inequlity T e t /τ + e t/τ dt = + τ gives the desired bound 0 Consider now the second term of 5 nd in prticulr its t th term ρ π t ρ t = ρ π t ρ π t t This term is the difference of the verge rewrd obtined by π nd π t The following lemm shows tht this difference cn be rewritten in terms of the stte-wise ctiondisdvntges underlying π t: Lemm 2 Performnce difference lemm: Consider n MDP specified by the trnsition probbility kernel P nd rewrd function r Let π, ˆπ be two stochstic sttionry policies in the MDP Assume tht µ π st, ρˆπ nd q ˆπ re well-defined Then, ρ π ρˆπ = x, µ π stxπ x [ q ˆπ x, v ˆπ x This lemm ppered s Lemm 4 in [7, but similr sttements hve been known for while For exmple, the book of Co [3 lso puts performnce difference sttements in the center of the theory of MDPs For the ske of completeness, we include the esy proof Note tht the sttement of the lemm continues to hold even when qˆπ nd v ˆπ re shifted by the sme constnt function Proof: We hve x, µ π stxπ xq ˆπ x, = x, µ π stxπ x = ρ π ρˆπ + x [ rx, ρˆπ + x P x x, v ˆπ x µ π stxv ˆπ x, where the second equlity holds since x, µπ stxπ xp x x, = µ π stx Reordering the terms gives the desired result 0 Even-Dr et l [7 mistkenly uses ν π t µπ st e t/τ ν π µ π st in their pper t = immeditely shows tht this cn be flse See, eg, the proofs of their Lemms 22 nd 52 This lemm does not need Assumption A nd in fct the ssumptions we mke could be further relxed with slight chnge to the clim

8 8 Becuse of this lemm, ρ π t ρ t = Let us now consider the third term of 5, T x, µπ stxπ x q π t t x, v π ρt R T The t t x Thus, by flipping t th term of this difference is the difference between the verge the sum tht runs over time with the one tht runs over the stte-ction pirs, we get: T ρπ t rewrd of π t nd the expected rewrd obtined T ρt = in step π t If ν tx x, µπ stxπ x is the distribution of sttes in time step t, T T qπ t t x, v π ρt R T = t t x Thus, it suffices to T x µπ stx ν tx π xrtx, Thus, bound, for fixed stte-ction pir x,, the sum T T q π t t x, v π t t x = q π t t x, T ρ t R T T µ π t π t xq π t t x, st ν t 9 By construction, π t x exp t s= qπs s x, recll tht γ = 0 in this version of the lgorithm, which mens tht the sum is the regret of the so-clled exponentil weights lgorithm EWA ginst ction when the lgorithm is used on the sequence {q π t t x, } Assume for moment tht K > 0 is such tht q π t t K holds for t T Then, since q π t t tkes its vlues from n intervl of length 2K, Theorem 22 in [5 implies tht the regret of EWA cn be bounded by ln + K2 T 2 7 Notice tht {q π t t } is sequence tht is sequentilly generted from {r t} It is Lemm 4 of [5 tht shows tht the bound of Theorem 22 of [5 continues to hold for such sequentilly generted functions Putting the inequlities together, we obtin T T ρ π t ρ t ln + K2 T 2 8 According to the next lemm n pproprite vlue for K is 2τ + 3 The lemm is stted in greter generlity thn wht is needed here becuse the more generl form will be used lter Lemm 3: Pick ny policy π in n MDP P, r Assume tht the mixing time of π is τ in the sense of If π xrx, R r holds for ny x X, then v π x 2Rτ + holds for ll x X Furthermore, for ny x, X A, q π x, R2τ + 3+ rx, nd, if, in ddition, rx, 0 for ny x, X A, then q π x, 2τ + 3 r Proof: As it is well known nd is esy to see from the definitions, the differentil vlue of policy π t stte x cn be written s v π x = νs,xx π µ π stx π x rx,, s= x where νs,x π = e xp π s is the stte distribution when following π for s steps strting from stte x The tringle inequlity nd then the bound on π x rx, gives v π x R νs,xx π µ π stx 2R τ +, s= x where in the second inequlity we used ν π s,x µ π st 2e s /τ nd tht s= e s /τ τ + cf the proof of Lemm This proves the first inequlity The inequlities on q π x, follow from the first prt nd the Bellmn eqution: q π x, rx, + ρ π + x P x x, v π x R2τ rx,, q π x, rx, ρ π + x P x x, v π x 2τ + 3 r Here, in the first inequlity we used tht ρ π x µπ stx π xrx, R, while the second inequlity holds since rx, ρ π, R [0, r nd so remins to bound the l distnces between the distributions µ π t st nd ν t For this, we will use two generl lemms tht will gin come useful lter For f : X A R, introduce the mixed norm f, = mx x f x, where f x is identified with fx, Clerly, νp π νp ˆπ π ˆπ, holds for ny two policies π, ˆπ nd ny distribution ν cf Lemm 5 in [7 The first lemm shows tht the mp π µ π st s mp from the spce of sttionry policies equipped with the mixed norm, to the spce of distributions equipped with the l -norm is τ +-Lipschitz: Lemm 4: Let P be trnsition probbility kernel over X A such tht the mixing time of P is τ < For ny two policies, π, ˆπ, it holds tht µ π st µˆπ st τ + π ˆπ, Proof: The sttement follows from solving µ π st µˆπ st µ π stp π µˆπ stp π + µˆπ stp π µˆπ stp ˆπ e /τ µ π st µˆπ st + π ˆπ, for µ π st µˆπ st nd using / e /τ τ + 20 The next lemm llows us to compre n n-step distribution under policy sequence with the sttionry distribution of the sequence s lst policy: Lemm 5: Let P be trnsition probbility kernel over X A such tht the mixing time of P is τ < Tke ny probbility distribution ν over X, integer n nd policies π,, π n Consider the distribution ν n = ν P π P π n Then, it holds tht ν n µ πn st 2e n /τ + τ + 2 mx tn πt πt,, where, for convenience, we hve introduced π 0 = π Proof: If n = the result is obtined from ν µ π st 2 Thus, in wht follows we ssume n 2 Let c = mx tn π t π t, By the tringle inequlity, ν n µ πn st νn µ π n st + π µ n st µ πn st e /τ ν n µ π n st + τ + c, µ πn st τ + where we used tht by the previous lemm µ π n st π n π n, τ + c Continuing recursively, we get ν n µ πn st e /τ e /τ νn 2 µ π n 2 st e n τ ν µ π st + τ + c 2e n /τ + τ + 2 c, + τ + c + τ + c + e τ + + e n 2 τ where we bounded the geometricl series by / e /τ nd used 20

9 9 Applying this lemm to ν t µ π t st we get ν t µ π t st 2e t /τ + τ + 2 K, where K is bound on mx 2tn π t π t, 2 Therefore, by 9, we hve T ρ t R T T 2 e t/τ + τ + 2 K T 2τ + τ + 2 K T Thus, it remins to find n pproprite vlue for K It is well known property of EWA tht π t x π t x q π t t x, Indeed, pplying Pinsker s inequlity nd Hoeffding s lemm see Section A2 nd Lemm A6 in Ces-Binchi nd Lugosi 5, we get for ny x X π t x π t x 2Dπ t x π t x = 2 [ln π t b xe qπ t t b,x q π t t x, b π t xq π t t b, x where, for two distributions Dv v = i vi lnvi/v i denotes the Kullbck-Leibler divergence of the distributions v nd v Thus, π t π t, q π t t Now, by Lemm 3, q π t t 2τ + 3, showing tht K = 2τ + 3 is suitble Putting together the inequlities obtined, we get T ρ t R T 2τ + 2τ + 3τ + 2 T Combining 6, 8 nd this lst bound, we obtin R π T R T = 4τ ln Setting + T 2τ + 32τ 2 + 6τ ln = T 2τ + 32τ 2 + 6τ + 5, we get the bound stted in Theorem 2 B Proof of Theorem Throughout this section we consider the MDP-EXP3 lgorithm nd suppose tht both Assumptions A nd A2 hold for P We strt from the decomposition 5, which is repeted to emphsize the difference tht some of the terms re rndom now: RT π R T T T T T = RT π + ρ π t ρ t + ρ π t 2 As before, Lemm shows tht the first term is bounded by 2τ + Thus, it remins to bound the expecttion of the other two terms This is done in the following two propositions whose proofs re deferred to the next subsections: 2 Lemm 52 of Even-Dr et l [7 gives bound on ν t µ π t st with slightly different technique However, there re multiple mistkes in the proof Once the mistkes re removed, their bounding technique gives the sme result s ours One of the mistkes is tht Assumption 3 sttes tht K = ln/t, wheres since the rnge of the ction-vlue functions scles with τ, K should lso scle with τ Unfortuntely, in [6 we committed the sme mistke, which we correct here We choose to present n lternte proof, s we find it somewht clener nd it lso gve us the opportunity to present Lemm 4 Proposition : Let L = 2 2τ + 3, Vˆq =, 2 + 2τ + 2 γ Uˆv = 4 τ +, U πˆq = 4 τ + 2, Uq = 2τ + 3, U q = 2τ + 4, e = e, e = e 2, c = euˆv + L + γvˆq envˆq, c = e L + γvˆq + e U πˆq + Uˆv U q en + Vˆq, nd ssume tht γ 0,, cτ + 2 < /2, N + 4 γ τ ln, 0 < < Then, for 2cτ+ 2 2eN+/γ +2τ+2 ny policy π, we hve T E [ρ π t ρ t ln + N U q + U q + + T 2N + 2 c N + e Vˆq + γ U q + e U πˆq U q Proposition 2: Assume tht the conditions of Proposition hold Then, T E [ρ t R T N + T N + c τ T N + e N /τ 22 Note tht setting N + τ ln T, s suggested in Theorem, the lst term in the right-hnd side of 22 becomes O, while for T sufficiently lrge ll the conditions of the lst two propositions will be stisfied This leds to the proof of Theorem : Proof of Theorem : If = then, due to ˆL T = 0, the sttement is trivil, so we ssume 2 from now on Define α = 2 euˆv + L + γvˆq, α = 2 {e L + γvˆq + e U πˆq + Uˆv U q} α so tht c = 2 nd c α = 2, γ 2eNV ˆq γ 2eN+V ˆq where V ˆq = 2 Vˆq = /γ + 2τ + 2 In the following we will use the nottion f g for two positive-vlued functions f, g : D R + defined on the sme domin D to denote tht they re equivlent up to constnt fctor, tht is, sup x D mx{fx/gx, gx/fx} < With this nottion, on 2, τ nd s long s γ, we hve α +τ nd α τ 2 23 independently of the vlue of nd of the choice of, γ, N In wht follows ll the equivlences will be stted for the domin 2, τ We now show how to choose, γ nd N so s to chieve smll regret bound In order to do so we will choose these constnts so tht the conditions of Propositions nd 2 re stisfied For simplicity, ρ t R T we dd the constrint γ /2 tht we will lso show to hold Under this dditionl constrint, the inequlity < 2eN + /γ + 2τ will be stisfied if we choose γ = 8eN ++τ +/ Indeed, the sid inequlity holds since it is equivlent to D = 2eN + /γ + 2τ + 2 > 0 nd 2eN + +γ2τ + 2 D = γ 2eN + +τ + = 2 γ 4 > 0, where the first inequlity holds becuse γ /2 nd the second equlity holds by the definition of γ Since c 2α/D nd c =

1 Online Learning and Regret Minimization

1 Online Learning and Regret Minimization 2.997 Decision-Mking in Lrge-Scle Systems My 10 MIT, Spring 2004 Hndout #29 Lecture Note 24 1 Online Lerning nd Regret Minimiztion In this lecture, we consider the problem of sequentil decision mking in