Online Learning of Power Allocation Policies in Energy Harvesting Communications

Size: px

Start display at page:

Download "Online Learning of Power Allocation Policies in Energy Harvesting Communications"

Lily Cook
5 years ago
Views:

1 Online Learning of Power Allocaion Policies in Energy Harvesing Communicaions Pranav Sakulkar and Bhaskar Krishnamachari Ming Hsieh Deparmen of Elecrical Engineering Vierbi School of Engineering Universiy of Souhern California, Los Angeles, CA, USA {sakulkar, Absrac We consider he problem of power allocaion over a ime-varying channel wih an unknown disribuion in energy harvesing communicaion sysems. In his problem, he ransmier needs o choose is ransmi power based on he amoun of sored energy in is baery wih he goal of maximizing he average rae obained over ime. We model his problem as a Markov decision process (MDP) wih he ransmier as he agen, he baery saus as he sae, he ransmi power as he acion and he rae obained as he reward. The average reward maximizaion problem over he MDP can be solved by a linear program (LP) ha uses he ransiion probabiliies for he saeacion pairs and heir mean rewards o choose a power allocaion policy. Since he rewards associaed he sae-acion pairs are unknown, we propose an online learning algorihm called UCLP ha learns hese rewards and adaps is policy wih ime. The UCLP algorihm solves he LP a each ime-sep o choose is policy using he upper confidence bounds on he rewards. We prove ha he reward loss or regre incurred by UCLP is upper bounded by a consan. Index Terms Energy harvesing communicaions, Markov decision process (MDP), online learning, conexual bandis. I. INTRODUCTION Communicaion sysems where he ransmissions are powered by he harvesed energy have rapidly emerged as a viable opion for he nex-generaion wireless neworks wih prolonged lifeime [1]. The performance of such sysems is dependen on he efficien uilizaion of energy ha is currenly sored in he baery, as well as ha is o be harvesed over ime. In [2], power allocaion policies over a finie ime horizon wih known channel gain and harvesed energy disribuions are sudied. In [3], a similar problem is analyzed, bu he energy arrivals are assumed o be deerminisic and known in advance. The algorihms presened in [4] assume he knowledge of energy arrivals and ries o minimize he overall scheduling ime for daa packes. In our problem, however, he channel gain disribuion is unknown and he harves energy is assumed be sochasically varying wih a known disribuion. The ransmier has o decide he ransmi power level based on he curren baery saus wih he goal maximizing he average expeced ransmission rae obained over ime. We model he sysem as an MDP wih he baery saus as he sae, he ransmi power as he acion, he rae as he reward. The power allocaion problem, herefore, reduces o he average reward maximizaion problem for an MDP. Our problem can also be seen from he lens of conexual bandis. In he sandard conexual bandi problems [5], [6], [7], he conexs are assumed o be drawn from an unknown disribuion independenly over ime. In his paper, we model he conex ransiions by MDPs. The acion he agen akes a ime, herefore, affecs no only he insananeous reward bu also he conex in slo + 1. Thus he agen needs o decide he acions wih he global objecive in mind, i.e. maximizing he average reward over ime. I mus be noed ha he MDP formulaion generalizes he sandard conexual bandis [8] for he case where he mapping beween he conex and random insance o reward is a known monoonic funcion, since he i.i.d. conex case can be viewed as a single sae MDP. Our problem is also closely relaed o he reinforcemen learning problem over MDPs from [9], [10], [11]. The objecive for hese problems is o maximize he average undiscouned reward over ime. In [9], [10], he agen is unaware of he ransiion probabiliies and he rewards corresponding o he sae-acion pairs. In [11], he agen knows he rewards, bu he ransiion probabiliies are sill unknown. In our problem, however, he ransiion probabiliies of he MDP can be inferred from he knowledge of he arrival disribuion and he acion aken from each sae. The goal of our problem is o maximize he average reward by learning he rewards for he sae-acion pairs over ime. One addiional feaure of our problem is ha he funcion mapping he sae-acion pair and he channel gain o he rae is known o he agen. The reward informaion revealed afer every acion can, herefore, be used o infer he rewards for oher sae-acion pairs. Our conribuions in his paper are as follows: We formulae he power allocaion problem as an online learning problem over an MDP wih he goal of maximizing he average reward over ime. We prove ha he MDP is ergodic and herefore use is corresponding LP formulaion o characerize he opimal policy. We propose an online learning algorihm UCLP ha learns he rewards for he sae-acion pairs and adaps is policy over ime. We characerize he regre conribuion from following a non-opimal policy and from no being a saionariy while following an opimal policy, and prove a consan regre upper bound for UCLP.

2 Harvesed Power: p Tx Wireless Channel Transmi Power : q Rx o a sae s +1 according o he ransiion probabiliy P (s +1 s, a ). In he energy harvesing problem, he baery saus Q represens he sysem sae s and he ransmi power q represens he acion aken a a any slo. In his paper, we consider sysems where he random rewards of various sae acion pairs can be modelled as Baery Sae: Q Time Slos Fig. 1. Power allocaion over a wireless channel in energy harvesing communicaions This paper is organized as follows. Firs, we describe he model for he energy harvesing communicaion sysem sudied in his paper, formulae his problem as an MDP and discuss he srucure of he opimal policy in secion II. We hen propose our online learning algorihm UCLP and prove is regre bounds in secion III. Secion IV presens he resuls of numerical simulaions for his problem and secion V concludes he paper. We also include appendices B and C o discuss and prove some of he echnical lemmas a he end of he paper. II. SYSTEM MODEL Consider a ime-sloed energy harvesing communicaion sysem where he ransmier uses he harvesed power for ransmission over a channel wih sochasically varying channel gains wih unknown disribuion as shown in figure 1. Le p denoes he harvesed power in he -h slo which is assumed o be i.i.d. over ime. Le Q denoe he sored energy in he ransmier s baery ha has a capaciy of Q max. Assume ha he ransmier decides o use q ( Q ) amoun of power for ransmission in -h slo. We assume discree and finie number of power levels for he harvesed and ransmi powers. The rae obained during he -h slo is assumed o follow a relaionship r = B log 2 (1 + q X ), (1) where X denoes he insananeous channel gain-o-noise raio of he channel which is assumed o be i.i.d. over ime and B is he channel bandwidh. The baery sae ges updaed in he nex slo as Q +1 = max{q q + p, Q max }. (2) The goal is o uilize he harvesed power and choose a ransmi power q in each slo[ sequenially o maximize he 1 expeced average rae lim T E T ] =1 r obained over ime. T A. Problem Formulaion Consider an MDP M wih a finie sae space S and a finie acion space A. Le A s A denoe he se of allowed acions from sae s. When he agen chooses an acion a A s in sae s S, i receives a random reward r(s, a ). Based on he agen s decision he sysem undergoes a random ransiion r (s, a ) = f(s, a, X ), (3) where f is a reward funcion known o he agen and X is a random variable inernal o he sysem ha is i.i.d. over ime. Noe ha in he energy harvesing communicaions problem, he reward is he rae obained a each slo and he reward funcion is defined in equaion (1). In his problem, he channel gain-o-noise raio X corresponds o he sysem s inernal random variable. We assume ha he disribuion of he harvesed energy p is known o he agen. This implies ha he sae ransiion probabiliies P (s +1 s, a ) are inferred by he agen based on he updae equaion (2). A policy is defined as any rule for choosing he acions in successive ime slos. The acion chosen a ime may, herefore, depend on he hisory of previous saes, acions and rewards. I may even be randomized such ha he acion a A s is chosen from some disribuion over he acions. A policy is said o be saionary, if he acion chosen a ime is only a funcion of he sysem sae a. This means ha a deerminisic saionary policy β is a mapping from he sae s S o is corresponding acion a A s. When a saionary policy is played, he sequence of saes {s = 1, 2, } follows a Markov chain. An MDP is said o be ergodic, if every deerminisic saionary policy leads o an irreducible and aperiodic Markov chain. According o secion V.3 from [12], he average reward can be maximized by an appropriae deerminisic saionary policy β for an ergodic MDP wih finie sae space. In order o arrive a an ergodic MDP for he energy harvesing communicaions problem, we make he following assumpions: 1. when he baery sae Q > 0, he ransmi power q > 0; 2. he disribuion of he harvesed energy is such ha Pr{p = p} > 0 for all 0 p Q max. Under hese assumpions, we prove he ergodiciy of he MDP as follows. Proposiion 1. The MDP corresponding o he power allocaion applicaion in energy harvesing communicaions is ergodic. Proof: Consider any policy β and le P (n) (s, s ) be he n- sep ransiion probabiliies associaed wih he Markov chain resuling from he policy. Firs, we prove ha P (1) (s, s ) > 0 for any s s as follows. According o he sae updae equaions, s +1 = s β(s ) + p. (4) The ransiion probabiliies can, herefore, be expressed as P (1) (s, s ) = Pr{p = s s + β(s)} 0, (5)

3 since s s and β(s) 0 for all saes. This implies ha any sae s S is accessible from any oher sae s in he resulan Markov chain, if s s. Now, we prove ha P (1) (s, s 1) > 0 for all s 1 as follows. From equaion (5), we observe ha P (1) (s, s 1) = Pr{p = β(s) 1} 0, (6) since β(s) 1 for all s 1. This implies ha every sae s S is accessible from he sae s+1 in he resulan Markov chain. Equaions (5) and (6) imply ha all he sae pairs (s, s+1) communicae wih each oher. Since communicaion is an equivalence relaionship, all he saes communicae wih each oher and he resulan Markov chain is irreducible. Also, equaion (5) implies ha P (1) (s, s) > 0 for all he saes and he Markov chain is, herefore, aperiodic. Since he MDP under consideraion is ergodic, we resric ourselves o he se of deerminisic saionary policies which we inerchangeably refer o as policies henceforh. Le µ(s, a) denoe he expeced reward associaed wih he sae-acions pair (s, a) which can be expressed as µ(s, a) = E [r(s, a)] = E X [f(s, a, X)]. (7) For ergodic MDPs, he opimal mean reward ρ is independen of he iniial sae (see [13], secion 8.3.3). I is specified as ρ = max ρ(β, M), (8) β B where B is he se of all policies, M is he marix whose (s, a)- h enry is µ(s, a), and ρ(β, M) is he average expeced reward per slo using policy β. We use he opimal mean reward as he benchmark and define he cumulaive regre of a learning algorihm afer T ime-slos as [ T ] R(T ) := T ρ E r. (9) B. Opimal Saionary Policy When he expeced rewards for all sae-acion pairs µ(s, a) and he ransiion probabiliies P (s s, a) are known, he problem of deermining he opimal policy o maximize he average expeced reward over ime can be formulaed as a linear program (LP) (see e.g. [12], secion V.3) shown below. maximize π(s, a)µ(s, a) subjec o π(s, a) 0, s S, a A s, π(s, a) = 1, s S a A s s S : π(s, a) = π(s, a)p (s s, a), a A s (10) where π(s, a) denoes he saionary disribuion of he MDP. The objecive funcion of he LP from equaion (10) gives he average rae corresponding o he saionary disribuion π(s, a), while he consrains make sure ha his saionary disribuion corresponds o a valid policy on he MDP. Such LPs can be solved by using sandard solvers like CVXPY [14]. If π (s, a) is he soluion o he LP from (10), hen for every s S, π(s, a) > 0 for only one acion a A s. This is due o he fac he he opimal policy β is deerminisic for ergodic MDPs in average reward maximizaion problems (see [13], secion 8.3.3). Thus for his problem, β (s) = arg max a As π (s, a). Noe ha we, henceforh, drop he acion index from he saionary disribuion, since he policies under consideraion are deerminisic and he corresponding acion is, herefore, deerminisically known. In general, we use π β (s) o denoe he saionary disribuion corresponding o he policy β. I mus be noed ha he saionary disribuion of any policy is independen of he reward values and only depends on he ransiion probabiliy for every sae-acion pair. The expeced average reward depends on he saionary disribuion as ρ(β, M) = s S π β (s)µ(s, β(s)). (11) In erms of his noaion, he LP from (10) equivalen o max β B ρ(β, M). Since he marix M is unknown, we develop an online learning framework o learn he opimal policy in he nex secion. III. ONLINE LEARNING ALGORITHMS For he power allocaion problem under consideraion, alhough he agen knows he sae ransiion probabiliies, he mean rewards for he sae-acion pairs µ(s, a) values are sill unknown. Hence, he agen canno solve he LP from (10) o figure ou he opimal policy. Any online learning algorihm needs o learn he reward values over ime and updae is policy adapively. One ineresing aspec of he problem, however, is ha he reward funcion from equaion (3) is known o he agen. Since he reward funcions under consideraion (1) is bijecive, once he reward is revealed o he agen, i can inver hem o infer he insananeous realizaion of he random variable X. This inference can be used o predic he rewards ha would have been obained for oher sae-acion pairs using he funcion knowledge. In our online learning framework, we sore he average values of hese inferred rewards θ(s, a) for all sae-acion pairs. Also, we define confidence bounds a ime : u,λ (s, a) = θ(s, a) + B(s, a) (12) l,λ (s, a) = θ(s, a) B(s, a), (13) which are referred o as UCB and LCB, respecively, and B(s, a) max x f(s, a, x) min x f(s, a, x) denoes any upper bound on he maximum possible range of he reward for he sae-acion pair (s, a). The idea behind our algorihms is o use he UCB values for he maximizaion problems insead of he unknown µ(s, a) values in he objecive funcion of he LP from (10). Since he θ(s, a) values ge updaed afer each reward revelaion, he agen needs o solve he LP again

4 and again. We propose our online learning algorihm UCLP where he agen solves he LP a each slo. Alhough he agen is unaware of he acual µ(s, a) values, i learns he saisics θ(s, a) over ime and evenually figures ou he opimal policy. We use following noaions in he analysis of our algorihms: B 0 := max (s,a) B(s, a), min := ρ max β β ρ(β, M). The oal number of saes and acions are specified as S := S, A := A, respecively. Also, U,λ and L,λ denoe he marices conaining he enries u,λ (s, a) and l,λ (s, a) a ime, respecively. The UCLP algorihm presened in algorihm 1 solves he LP a each ime-sep and updaes i policy based on he soluion obained. I sores only one θ value per sae-acion pair, is required sorage is, herefore, O(SA). In heorem 1, we derive an upper bound on he expeced number of slos where he LP fails o find he opimal soluion using UCLP. We use his resul o bound he oal expeced regre of UCLP in heorem 2. These resuls guaranee ha he regre is always upper bounded by a consan. Noe ha, for he ease of exposiion, we assume ha he ime sars a = 0. This simplifies he analysis, bu has no significan impac on he regre bounds. Algorihm 1 UCLP 1: Parameers: λ > 1/2. 2: Iniializaion: For all (s, a) pairs, θ(s, a) = 0. 3: for n = 0 do 4: Given he sae s 0 and choose any valid acion; 5: Updae all (s, a) pairs: θ(s, a) = f(s, a, x 0 ); 6: end for 7: // MAIN LOOP 8: while 1 do 9: n = n + 1; 10: Confidence bounds: λ ln n u n,λ (s, a) = θ(s, a) + B(s, a) n ; 11: Solve he LP from (10) wih u n,λ (s, a) insead of unknown µ(s, a); 12: In erms of he LP soluion π (n), define β n (s) = arg max a As π (n) (s, a), s S; 13: Given he sae s n, selec he acion β n (s n ); 14: Updae for all (s, a) pairs: 15: end while θ(s, a) nθ(s, a) + f(s, a, x n) ; n + 1 Theorem 1. The expeced number of slos where non-opimal policies are played by UCLP is upper bounded by n 0 + (1 + A) Sσ λ, (14) where σ λ = 2λ and n 0 denoes he minimum value of =1 n N for which min 2B 0 λ ln n n. The proof of heorem 1 is provided in appendix A. UCLP requires λ > 1/2 in order o have a consan regre upper bound from equaion (14), since σ λ < for λ > 1/2. I is imporan o noe ha even if he opimal policy is found by he LP and played during cerain slos, i does no mean ha regre conribuion of hose slos is zero. According o he definiion of regre from equaion (9), regre conribuion of a cerain slo is zero if and only if he opimal policy is played and he corresponding Markov chain is a is saionary disribuion. In appendix C, we inroduce ools o analyze he mixing of Markov chains and characerize his regre conribuion in heorem 3. These resuls are used o upper bound he UCLP regre in he nex heorem. Theorem 2. The oal expeced regre of he UCLP is upper bounded by ) ( ) µmax (n 0 +(1 + A) Sσ λ max + 1+(1 + A) Sσ λ 1 γ, (15) where γ = max P (s, ) P (s, ) TV, P denoes he ransiion probabiliy marix corresponding o he opimal policy, s,s S µ max = max µ(s, a) and max = ρ min µ(s, a). s S,a A s s S,a A s Proof: The regre of UCLP arises when eiher nonopimal acions are aken or opimal acions are aken, bu he corresponding Markov chain is no a saionariy. For he firs source of regre, i is sufficien o analyze he number of insances where he LP fails o find he opimal policy. For he second source, however, we need o analyze he oal number of phases where he opimal policy is found in succession. Since only he opimal policy is played in consecuive slos in a phase, i corresponds o ransiions on he Markov chain associaed wih he opimal policy and he ools from appendix C can be applied. According o heorem 3, he regre conribuion of any phase is bounded from above by (1 γ) µ max. As proved in heorem 1, for n 0, he expeced number of insances of non-opimal policies is upper bounded by (1 + A) Sσ λ. Even if none of he hese insances appear in successive slos, he expeced number of opimal phases is upper bounded by 1 + (1 + A) Sσ λ. Hence, for n 0, he expeced regre conribuion from he slos following he opimal policy is upper bounded by ( 1 + (1 + A) Sσ λ ) µmax 1 γ. (16) Noe ha maximum regre possible during one slo is max. Hence for he firs n 0 slos, he regre is bounded by n 0 max. Since here are a mos (1 + A) Sσ λ slos wih non-opimal policy for n 0 in expecaion, heir expeced regre is upper bounded by (1 + A) Sσ λ max. Overall expeced regre for he UCLP algorihm is, herefore, bounded from above by equaion (15). Remark 1. I mus be noed ha we call wo policies as same if and only if hey recommend idenical acions for every sae. I is, herefore, possible for a non-opimal policy o recommend opimal acions for some of he saes. In he analysis of UCLP, however, we assumed ha any occurrence of a non-opimal

5 Regre UCLP Opimal Policy Naive Policy Fig. 2. Number of rials Regre performance of differen algorihms. policy conribues o he regre. Alhough his is no necessary, i leads us o a valid upper bound in he proof. IV. NUMERICAL SIMULATIONS We perform simulaions for he power allocaion problem wih S = {0, 1, 2, 3, 4} and A = {0, 1, 2, 3, 4}. Noe ha each sae s corresponds o Q from equaion (2) wih Q max = 4 and a corresponds o he ransmi power q from equaion (1). The valid acions for each sae are shown in able I. The reward funcion is he rae funcion from equaion (1) and he channel gain is a scaled Bernoulli random variable wih Pr{X = 10} = 0.2 and Pr{X = 0} = 0.8. We use CVXPY [14] for solving he LPs in our algorihm. For he simulaions in figure 2, we use λ = 2 and plo he average regre performance over 10 3 independen runs of differen algorihms. Here, he naive policy never uses he baery, i.e. i uses all he arriving power for he curren ransmission. Noe ha he opimal policy also incurs a regre because of he corresponding Markov chain no being a saionariy. We observe ha UCLP follows he performance of he opimal policy wih he difference in regre semming from he firs few ime-slos when he channel saisics are no properly learn and hus UCLP fails o find he opimal policy. As he ime progresses, UCLP finds he opimal policy and he regre follows he regre paern of he opimal policy. TABLE I VALID ACTIONS PER STATE Sae (s) Acions (A s) 0 {0} 1 {1} 2 {1, 2} 3 {1, 2, 3} 4 {1, 2, 3, 4} V. CONCLUSION We have considered he problem of power allocaion over a sochasically varying channel wih unknown disribuion in an energy harvesing communicaion sysem. We have cas his problem as an online learning problem over an MDP. If he ransiion probabiliies and he mean rewards associaed wih he MDP are known, he opimal policy maximizing he average expeced reward over ime can be found by solving an LP specified in he paper. Since he agen is only assumed o know he disribuion of he harvesed energy, she needs o learn he rewards of he sae-acion pairs over ime and make her decisions based on he learn behaviour. For his problem, we have proposed our online learning algorihm UCLP which solves he LP a each ime-slo using he updaed upper confidence bounds of he rewards insead of he unknown mean rewards. We have shown ha he regre incurred by UCLP is bounded from above by a consan. Through he numerical simulaions, we have shown ha he regre of UCLP is very close o ha of he opimal policy. The UCLP algorihm presened in his paper solves an LP and herefore requires a lo of compuaions a each imesep. Reducing he compuaional needs of UCLP is a poenial direcion for fuure works in his area. REFERENCES [1] S. Ulukus, A. Yener, E. Erkip, O. Simeone, M. Zorzi, P. Grover, and K. Huang, Energy harvesing wireless communicaions: A review of recen advances, Seleced Areas in Communicaions, IEEE Journal on, vol. 33, no. 3, pp , [2] C. K. Ho and R. Zhang, Opimal energy allocaion for wireless communicaions wih energy harvesing consrains, Signal Processing, IEEE Transacions on, vol. 60, no. 9, pp , [3] K. Tuuncuoglu and A. Yener, Opimum ransmission policies for baery limied energy harvesing nodes, Wireless Communicaions, IEEE Transacions on, vol. 11, no. 3, pp , [4] J. Yang and S. Ulukus, Opimal packe scheduling in an energy harvesing communicaion sysem, Communicaions, IEEE Transacions on, vol. 60, no. 1, pp , [5] J. Langford and T. Zhang, The epoch-greedy algorihm for muliarmed bandis wih side informaion, in Advances in neural informaion processing sysems, pp , [6] M. Dudik, D. Hsu, S. Kale, N. Karampaziakis, J. Langford, L. Reyzin, and T. Zhang, Efficien opimal learning for conexual bandis, in Conference on Uncerainy in Arificial Inelligence, [7] A. Agarwal, D. Hsu, S. Kale, J. Langford, L. Li, and R. E. Schapire, Taming he monser: A fas and simple algorihm for conexual bandis, in Inernaional Conference on Machine Learning, pp , [8] P. Sakulkar and B. Krishnamachari, Sochasic conexual bandis wih known reward funcions. USC ANRG Technical Repor, ANRG , hp://anrg.usc.edu/www/papers/dcb ANRG TechRepor.pdf. [9] P. Orner and R. Auer, Logarihmic online regre bounds for undiscouned reinforcemen learning, in Proceedings of he 2006 Conference on Advances in Neural Informaion Processing Sysems, vol. 19, p. 49, [10] P. Auer, T. Jaksch, and R. Orner, Near-opimal regre bounds for reinforcemen learning, in Advances in neural informaion processing sysems, pp , [11] A. Tewari and P. L. Barle, Opimisic linear programming gives logarihmic regre for irreducible mdps, in Advances in Neural Informaion Processing Sysems, pp , [12] S. Ross, Inroducion o sochasic dynamic programming. Academic Press, [13] M. L. Puerman, Markov decision processes: discree sochasic dynamic programming. John Wiley & Sons, [14] S. Diamond and S. Boyd, CVXPY: A Pyhon-embedded modeling language for convex opimizaion, Journal of Machine Learning Research, To appear. [15] W. Hoeffding, Probabiliy inequaliies for sums of bounded random variables, Journal of he American saisical associaion, vol. 58, no. 301, pp , 1963.

6 [16] D. A. Levin, Y. Peres, and E. L. Wilmer, Markov chains and mixing imes. American Mahemaical Soc., APPENDIX A PROOF OF THEOREM 1 Le β denoe he policy obained by UCLP a ime and I(z) be he indicaor funcion defined o be 1 when he predicae z is rue, and 0 oherwise. Now he number of slos where non-opimal policies are played can be expressed as N 1 = 1 + I {β β } =1 n 0 + I {β β } =n 0 = n 0 + I {ρ(β, U,λ ) ρ(β, U,λ )}. (17) =n 0 We observe ha ρ(β, U,λ ) ρ(β, U,λ ) implies ha a leas one of he following inequaliies mus be rue: ρ(β, U,λ ) ρ(β, M) (18) ρ(β, L,λ ) ρ(β, M) (19) ρ(β, M) < ρ(β, M) + ρ(β, U,λ ) ρ(β, L,λ ). (20) Hence we upper bound he probabiliies of each of hese evens. For he firs even from condiion (18), we ge Pr {ρ(β, U,λ ) ρ(β, M)} { = Pr π (s, β (s))u,λ (s, β (s)) s S s S π (s, β (s))µ(s, β (s)) Pr {For a leas one sae s S : π (s, β (s))u,λ (s, β (s)) π (s, β (s))µ(s, β (s))} s S Pr {π (s, β (s))u,λ (s, β (s)) π (s, β (s))µ(s, β (s))} = s S Pr {u,λ (s, β (s)) µ(s, β (s))} (a) 2λ s S = S 2λ, (21) where (a) holds due o concenraion of confidence bounds from lemma 2 (see appendix B). Similarly for he second even from condiion (19), we ge Pr {ρ(β, L,λ ) ρ(β, M)} { = Pr π β (s, a)l,λ (s, a) } π β (s, a)µ(s, a) } Pr {For a leas one sae-acion pair (s, a) : π β (s, a)l,λ (s, a)) π β (s, a)µ(s, a)} Pr {π β (s, a)l,λ (s, a) π β (s, a)µ(s, a)} = Pr {l,λ (s, a) µ(s, a)} (b) 2λ SA 2λ, (22) where (b) holds due o concenraion bounds from lemma 2 (see appendix B). Now le us analyze he hird even from condiion (20). ρ(β, U,λ ) ρ(β, L,λ ) = π β (s, a)u,λ (s, a) π β (s, a)l,λ (s, a) = π β (s, a) (u,λ (s, a) l,λ (s, a)) = ( ) π β (s, a) 2B(s, a) s S a A s = 2 π β (s, a)b(s, a) s S a A s 2 π β (s, a)b 0 2B 0. (23) Since min ρ(β, M) ρ(β, M), for all n 0 we ge ρ(β, M) ρ(β, M) (ρ(β, U,λ ) ρ(β, L,λ )) min 2B 0 min 2B 0 λ ln n0 n 0 0. (24) This implies ha condiion (20) is always false for n 0. The expeced number of incorrec policies from equaion (17), herefore, can be expressed as E[N 1 ] n 0 + Pr {ρ(β, U,λ ) ρ(β, U,λ )} =n 0 n 0 + (Pr {ρ(β, U,λ ) ρ(β, M)} =n 0 n 0 + n + Pr {ρ(β, L,λ ) ρ(β, M)}) =n 0 ( S 2λ + SA 2λ ) n 0 + (1 + A) S =n 0 2λ

7 n 0 + (1 + A) Sσ λ, (25) where σ λ < as λ > 1/2. APPENDIX B TECHNICAL LEMMAS & PROOFS Lemma 1 (Hoeffding s Concenraion Inequaliy from [15]). Le Y 1,..., Y n be i.i.d. random variables wih mean µ and range [0, 1]. Le S n = n Y. Then for all α 0 =1 Pr{S n nµ + α} e 2α2 /n Pr{S n nµ α} e 2α2 /n. Lemma 2 (Concenraion of Confidence Bounds). A any ime, for any valid sae-acion pair (s, a), following inequaliies hold: 1) Pr {u,λ (s, a) µ(s, a)} 2λ, 2) Pr {l,λ (s, a) µ(s, a)} 2λ. Proof: For he firs inequaliy, Pr {u,λ (s, a) µ(s, a)} { } Pr ˆr (s, a) + B(s, a) µ(s, a) { ˆr (s, a) µ(s, a) = Pr B(s, a) B(s, a) } λ ln (a) 2(λ ln )/ e = 2λ, (26) where (a) is obained using he lef-sided Hoeffding s inequaliy (see lemma 1) wih α = λ ln. Similarly using he righ-sided version of he concenraion inequaliy, we ge he second inequaliy. APPENDIX C ANALYSIS OF MARKOV CHAIN MIXING We briefly inroduce he ools required for he analysis of Markov chain mixing (see [16], chaper 4 for a deailed discussion). The oal variaion (TV) disance beween wo probabiliy disribuions φ and ψ on sample space Ω is defined by φ ψ TV = max φ(e) ψ(e). (27) E Ω Inuiively, i means he TV disance beween φ and ψ is he maximum difference beween he probabiliies of a single even by he wo disribuions. The TV disance is relaed o he L 1 disance as follows φ ψ TV = 1 φ(ω) ψ(ω). (28) 2 ω Ω We wish o bound he maximal disance beween he saionary disribuion π and he disribuion over saes afer seps of a Markov chain. Le P () be he -sep ransiion marix wih P () (s, s ) being he ransiion probabiliy from sae s o s of he Markov chain in seps and P be he collecion of all probabiliy disribuions on Ω. Also le P () (s, ) be he row or disribuion corresponding o he iniial sae of s. Based on hese noaions, we define a couple of useful -sep disances as follows: d() := max π P () (s, ) TV s S = sup π φp () TV, (29) φ P ˆd() := max P () (s, ) P () (s, ) TV s,s S = sup ψp () φp () TV. (30) ψ,φ P For irreducible and aperiodic Markov chains, he disances d() and ˆd() have following special properies: Lemma 3 ([16], lemma 4.11). For all > 0, d() ˆd() 2d(). Lemma 4 ([16], lemma 4.12). The funcion ˆd is submuliplicaive: ˆd( ) ˆd( 1 ) ˆd( 2 ). These lemmas lead o following useful corollary: Corollary 1. For all 0, d() ˆd(1). Consider an MDP wih opimal saionary policy β. Since he MDP migh no sar a he saionary disribuion π corresponding o he opimal policy, even he opimal policy incurs some regre as defined in equaion (9). We characerize his regre in he following heorem. Theorem 3 (Regre of Opimal Policy). For an ergodic MDP, he oal expeced regre of he opimal saionary policy wih ransiion probabiliy marix P is upper bounded by (1 γ) µ max, where γ = max P (s, ) P (s, ) TV and s,s S µ max = max µ(s, a). s S,a A Proof: Le φ 0 be he iniial disribuion over saes and φ = φ 0 P () be such disribuion a ime represened as a row vecors. Also, le µ be a row vecor wih enry corresponding o sae s being µ(s, β (s)). We use d () and ˆd () o denoe he -sep disances from equaions (29) and (30) for he opimal policy. Ergodiciy of he MDP ensures ha he Markov chain corresponding o he opimal policy is irreducible and aperiodic, and hus lemmas 3 and 4 hold. The regre of he opimal policy, herefore, ges simplified as: R (φ 0, T ) T = T ρ φ µ T = T (π µ ) φ µ T = (π φ ) µ T (π φ ) + µ (Negaive enries ignored)

8 = T (π (s) φ (s)) + µ (s) s S T µ max (π (s) φ (s)) + s S T = µ max π φ 0 P () TV T µ max d () T ( ) µ ˆd max (1) (From corollary 1) T = µ max γ µ max 1 1 γ. Noe ha his regre bound is independen of he iniial disribuion over he saes.

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB

T L. t=1. Proof of Lemma 1. Using the marginal cost accounting in Equation(4) and standard arguments. t )+Π RB. t )+K 1(Q RB Elecronic Companion EC.1. Proofs of Technical Lemmas and Theorems LEMMA 1. Le C(RB) be he oal cos incurred by he RB policy. Then we have, T L E[C(RB)] 3 E[Z RB ]. (EC.1) Proof of Lemma 1. Using he marginal