arxiv: v1 [math.oc] 11 Sep 2017

Size: px

Start display at page:

Download "arxiv: v1 [math.oc] 11 Sep 2017"

Tamsin Skinner
5 years ago
Views:

1 Online Learning in Weakly Coupled Markov Decision Processes: A Convergence ime Sudy Xiaohan Wei, Hao Yu and Michael J. Neely arxiv: v [mah.oc] Sep 07. Inroducion Absrac: We consider muliple parallel Markov decision processes MDPs coupled by global consrains, where he ime varying objecive and consrain funcions can only be observed afer he decision is made. Special aenion is given o how well he decision maker can perform in slos, saring from any sae, compared o he bes feasible randomized saionary policy in hindsigh. We develop a new disribued online algorihm where each MDP makes is own decision each slo afer observing a muliplier compued from pas informaion. While he scenario is significanly more challenging han he classical online learning conex, he algorihm is shown o have a igh O regre and consrain violaions simulaneously. o obain such a bound, we combine several new ingrediens including ergodiciy and mixing ime bound in weakly coupled MDPs, a new regre analysis for online consrained opimizaion, a drif analysis for queue processes, and a perurbaion analysis based on Farkas Lemma. Keywords and phrases: Sochasic programming, Consrained programming, Markov decision processes. his paper considers online consrained Markov decision processes OCMDP where boh he objecive and consrain funcions can vary each ime slo afer he decision is made. We assume a sloed ime scenario wih ime slos {0,,,...}. he OCMDP consiss of K parallel Markov decision processes wih indices k {,,..., K}. he k-h MDP has sae space S k, acion space A k, and ransiion probabiliy marix P a k which depends on he chosen acion a A k. Specifically, P a k = P a k s, s where where s k P a k s, s = P r s k + = s k s = s, a k = a and a k are he sae and acion for sysem k on slo. We assume ha boh he sae space and he acion space are finie for all k {,,, K}. Afer each MDP k {,..., K} makes he decision a ime and assuming he curren sae is s k = s and he acion is a k. he nex sae s k +.. A penaly funcion f k 3. A collecion of m consrain funcions g k,, = a, he following informaion is revealed: s, a ha depends on he curren sae s and he curren acion a. s, a,..., gk m, s, a ha depend on s and a. he funcions f k and g k i, are all bounded mappings from S k A k o R and represen differen ypes of coss incurred by sysem k on slo depending on he curren sae and acion. For example, in a muli-server daa cener, he differen sysems k {,..., K} can represen differen servers, he cos funcion for a paricular server k migh represen energy or moneary expendiure for ha server, and he consrain coss for server k can represen Deparmen of lecrical ngineering, Universiy of Souhern California

2 X. Wei, H. Yu, M. J. Neely/Online consrained MDPs negaive rewards such as service raes or qualiies. Coupling beween he server sysems comes from using all of hem o collecively suppor a common sream of arriving jobs. A key aspec of his general problem is ha he funcions f k and g k i, are unknown unil afer he slo decision is made. hus, he precise coss incurred by each sysem are only known a he end of he slo. For a fixed ime horizon of slos, he overall penaly and consrain accumulaion resuling from a policy P is: F d 0, P := f k a k, s k d 0, P, and G i, d 0, P := = = g k i, a k, s k d 0, P, where d 0 represens a given disribuion on he iniial join sae vecor s 0,, sk 0. Noe ha a k, s k denoes he sae-acion pair of he kh MDP, which is a pair of random variables deermined by d 0 and P. Define a consrain se G := {P, d 0 : G i, d 0, P 0, i =,,, m}. Define he regre of a policy P wih respec o a paricular join randomized saionary policy Π along wih an arbirary saring sae disribuion d 0 as: F d 0, P F d 0, Π, he goal of OCMDP is o choose a policy P so ha boh he regre and consrain violaions grow sublinearly wih respec o, where regre is measured agains all feasible join randomized saionary policies Π... A moivaing example As an example, consider a daa cener wih a cenral conroller and K servers see Fig.. Jobs arrive randomly and are sored in a queue o awai service. he sysem operaes in sloed ime {0,,,...} and each server k {,..., K} is modeled as a 3-sae MDP wih saes acive, idle, and seup: Acive: In his sae he server is available o serve jobs. Server k incurs a ime varying elecriciy cos on every acive slo, regardless of wheher or no here are jobs o serve. I has a conrol opion o say acive or ransiion o he idle sae. Idle: In his sae no jobs can be served. his sae has muliple sleep modes as conrol opions, each wih differen per-slo coss and seup imes required for ransiioning from idle o acive. Seup: his is a ransiion sae beween idle and acive. No jobs can be served and here are no conrol opions. he seup coss and duraions are possibly consan random variables depending on he preceding chosen sleep mode. he goal is o minimize he overall elecriciy cos subjec o sabilizing he job queue. In a ypical daa cener scenario, he performance of each server on a given slo is governed by he curren elecriciy price and he service rae under each decision, boh of which can be ime varying and unknown o he server beforehand. his problem is challenging because:

3 X. Wei, H. Yu, M. J. Neely/Online consrained MDPs 3 If one server is currenly in a seup sae, i has zero service rae and canno make anoher decision unil i reaches he acive sae which ypically akes more han one slo, whereas oher acive servers can make decisions during his ime. hus, servers are acing asynchronously. he elecriciy price exhibis variaion across ime, locaion, and uiliy providers. Is behavior is irregular and can be difficul o predic. As an example, Fig. plos he average per 5 minue spo marke price beween 05/0/07 and 05/0/07 a New York zone CNRL []. Servers in differen locaions can have differen price offerings, and his piles up he uncerainy across he whole sysem. Despie hese difficulies, his problem fis ino he formulaion of his paper: he elecriciy price acs as he global penaly funcion, and sabiliy of he queue can be reaed as a global consrain ha he expeced oal number of arrivals is less han he expeced service rae. Fig. Illusraion of a daa cener server scheduling model. 450 lecriciy marke price Price dollar/mwh Number of slos each 5 min Fig. A ypical race of elecriciy marke price. A review on daa server provision can be found in [] and references herein. Prior daa cener analysis ofen assumes he sysem has up-o-dae informaion on service raes and elecriciy coss see, for example, [3],[4]. On he oher hand, work ha reas oudaed informaion such as [5], [6] generally does no consider he poenial Markov srucure of he problem. he curren paper reas he Markov srucure of he problem and allows rae and price informaion o be unknown and oudaed.

4 X. Wei, H. Yu, M. J. Neely/Online consrained MDPs 4.. Relaed work Online convex opimizaion OCO: his concerns muli-round cos minimizaion wih arbirarily-varying convex loss funcions. Specifically, on each slo he decision maker chooses decisions x wihin a convex se X before observing he loss funcion f x in order o minimize he oal regre compared o he bes fixed decision in hindsigh, expressed as: regre = = f x min x X f x. See [7] for an inroducion o OCO. Zinkevich inroduced OCO in [8] and shows ha an online projecion gradien descen OGD algorihm achieves O regre. his O regre is proven o be he bes in [9], alhough improved performance is possible if all convex loss funcions are srongly convex. he OGD decision requires o compue a projecion of a vecor ono a se X. For complicaed ses X wih funcional equaliy consrains, e.g., X = {x X 0 : g k x 0, k {,,..., m}}, he projecion can have high complexiy. o circumven he projecion, work in [0,,, 3] proposes alernaive algorihms wih simpler per-slo complexiy and ha saisfy he inequaliy consrains in he long erm raher han on every slo. Recenly, new primal-dual ype algorihms wih low complexiy are proposed in [4, 5] o solve more challenging OCO wih ime-varying funcional inequaliy consrains. Online Markov decision processes: his exends OCO o allow sysems wih a more complex Markov srucure. his is similar o he seup of he curren paper of minimizing he expression, bu does no have he consrain se. Unlike radiional OCO, he curren penaly depends no only on he curren acion and he curren unknown penaly funcion, bu on he curren sysem sae which depends on he hisory of previous acions. Furher, he number of policies can grow exponenially wih he sizes of he sae and acion spaces, so ha soluions can be compuaionally inensive. he work [6] develops an algorihm in his conex wih O regre. xended algorihms and regularizaion mehods are developed in [7][8][9] o reduce complexiy and improve dependencies on he number of saes and acions. Online MDP under bandi feedback where he decision maker can only observe he penaly corresponding o he chosen acion is considered in [7][0]. Consrained MDPs: his aims o solve classical MDP problems wih known cos funcions bu subjec o addiional consrains on he budge or resources. Linear programming mehods for MDPs are found, for example, in [], and algorihms beyond LP are found in [] [3]. Formulaions closes o our seup appear in recen work on weakly coupled MDPs in [4][5] ha have known cos and resource funcions. Reinforcemen Learning RL: his concerns MDPs wih some unknown parameers such as unknown funcions and ransiion probabiliies. ypically, RL makes sronger assumpions han he online seing, such as an environmen ha is unknown bu fixed, whereas he unknown environmen in he online conex can change over ime. Mehods for RL are developed in [6][7][8][9]..3. Our conribuions he curren paper proposes a new framework for online MDPs wih ime varying consrains. Furher, i considers muliple MDP sysems ha are weakly coupled. While he scenario is =

5 X. Wei, H. Yu, M. J. Neely/Online consrained MDPs 5 significanly more challenging han he original Zinkevich OGD conex as well as oher classical online learning scenarios, he algorihm is shown o achieve igh O regre in boh he objecive funcion and he consrains, which ies he opimal O regre for hose simpler unconsrained OCO problems. Along he way, we show he bound grows polynomially wih he number of MDPs and linearly wih respec o he number of saes and acions in each MDP heorem 5... Preliminaries.. Basic Definiions hroughou his paper, given an MDP wih sae space S and acion space A, a policy P defines a possibly probabilisic mehod of choosing acions a A a sae s S based on he pas informaion. We sar wih some basic definiions of imporan classes of policies: Definiion.. For an MDP, a randomized saionary policy π defines an algorihm which, whenever he sysem is in sae s S, chooses an acion a A according o a fixed condiional probabiliy funcion πa s, defined for all a A and s S. Definiion.. For an MDP, a pure policy π is a randomized saionary policy wih all probabiliies equal o eiher 0 or. ha is, a pure policy is defined by a deerminisic mapping beween saes s S and acions a A. Whenever he sysem is in a sae s S, i always chooses a paricular acion a s A wih probabiliy. Noe ha if an MDP has a finie sae and acion space, he se of all pure policies is also finie. Consider he MDP associaed wih a paricular sysem k {,..., K}. For any randomized saionary policy π, i holds ha a A k πa s = for all s Sk. Define he ransiion probabiliy marix P k π under policy π o have componens as follows: P π k s, s = a s, s, s, s S k. 3 a A k πa sp k I is easy o verify ha P k π is indeed a sochasic marix, ha is, i has rows wih nonnegaive componens ha sum o. Le d k 0 [0, ] Sk be an arbirary iniial disribuion for he k-h MDP. Define he sae disribuion a ime under π as d k. By he Markov propery of he sysem, we have d k π, = dk 0 P k π π is ergodic if i gives rise o a Markov chain ha is irreducible and aperiodic. Since he sae space is finie, an ergodic marix P k π probabiliy vecor solving d = dp k π,. A ransiion probabiliy marix P k has a unique saionary disribuion denoed d k π π., so ha d k is he unique Assumpion. Unichain model. here exiss a universal ineger r such ha for any ineger r r and every k {,..., K}, we have he produc P π k P π k P k π r is a ransiion marix wih sricly posiive enries for any sequence of pure policies π, π,, π r associaed wih he kh MDP. Remark.. Assumpion. implies ha each MDP k {,..., K} is ergodic under any pure policy. his follows by aking π, π,, π r all he same in Assumpion.. Since he ransiion marix of any randomized saionary policy can be formed as a convex combinaion of hose of For any se S, we use S o denoe he cardinaliy of he se. π

6 X. Wei, H. Yu, M. J. Neely/Online consrained MDPs 6 pure policies, any randomized saionary policy resuls in an ergodic MDP for which here is a unique saionary disribuion. Assumpion. is easy o check via he following simple sufficien condiion. Proposiion.. Assumpion. holds if, for every k {,..., K}, here is a fixed ergodic marix P k i.e., a ransiion probabiliy marix ha defines an irreducible and aperiodic Markov chain such ha for any pure policy π on MDP k we have he decomposiion P k π = δ π P k + δ π Q k π, where δ π 0, ] depends on he pure policy π and Q k π is a sochasic marix depending on π. Proof. Fix k {,..., K} and assume every pure policy on MDP k has he above decomposiion. Since here are only finiely many pure policies, here exiss a lower bound δ min > 0 such ha δ π δ min for every pure policy π. Since P k is an ergodic marix, here exiss an ineger r k > 0 large enough such ha P k r has sricly posiive componens for all r r k. Fix r r k and le π,..., π r be any sequence of r pure policies on MDP k. hen P k π P π k r δ min P k r > 0 he universal ineger ˆr can be aken as he maximum ineger r k over all k {,..., K}. Definiion.3. A join randomized saionary policy Π on K parallel MDPs defines an algorihm which chooses a join acion a := a, a,, a K A A A K given he join sae s := s, s,,, s K S S S K according o a fixed condiional probabiliy Π a s. he following special class of separable policies can be implemened separaely over each of he K MDPs and plays a role in boh algorihm design and performance analysis. Definiion.4. A join randomized saionary policy π is separable if he condiional probabiliies π := π, π,, π K decompose as a produc π a s = K π k a k s k for all a A A K, s S S K... echnical assumpions he funcions f k and g k i, are deermined by random processes defined over = 0,,,. Specifically, le Ω be a finie dimensional vecor space. Le {ω } =0 and {µ } =0 be wo sequences of random vecors in Ω. hen for all a A k, s S k, i {,,, m} we have g k i, g k i, a, s = ĝk i a, s, ω, f k a, s = ˆf k a, s, µ where ĝ k i and ˆf k formally define he ime-varying funcions in erms of he random processes ω and µ. I is assumed ha he processes {ω } =0 and {µ } =0 are generaed a he sar of slo 0 before any conrol acions are aken, and revealed gradually over ime, so ha funcions and f k are only revealed a he end of slo.

7 X. Wei, H. Yu, M. J. Neely/Online consrained MDPs 7 Remark.. he funcions generaed in his way are also called oblivious funcions. Such an assumpion is commonly adoped in previous unconsrained online MDP works e.g. [6], [7] and [9]. Furher, i is also shown in [7] ha wihou his assumpion, one can choose a sequence of objecive funcions agains he decision maker in a specifically designed MDP scenario so ha one never achieves he sublinear regre. he funcions are also assumed o be bounded by a universal consan Ψ, so ha: ĝ k i a, s, ω Ψ, ˆf k a, s, µ Ψ, k {,..., K}, a A k, s S k, ω, µ Ω. 4 I is assumed ha {ω } =0 is independen, idenically disribued i.i.d. and independen of {µ } =0. Hence, he consrain funcions can be arbirarily correlaed on he same slo, bu appear i.i.d. over differen slos. On he oher hand, no specific model is imposed on {µ } =0. hus, he funcions f k can be arbirarily ime varying. Le H be he sysem informaion up o ime, hen, for any {0,,, }, H conains sae and acion informaion up o ime, i.e. s 0,, s, a 0,, a, and {ω } =0 and {µ } =0. hroughou his paper, we make he following assumpions. Assumpion. Independen ransiion. For each MDP, given he sae s k acion a k A k, he nex sae s k + as well as he sae ransiion s j P r s k + = s H, s j +, j k = P r where H conains all pas informaion up o ime. S k and is independen of all oher pas informaion up o ime +, j k, i.e., for all s Sk i holds ha s k, a k + = s sk Inuiively, his assumpion means ha all MDPs are running independenly in he join probabiliy space and hus he only coupling among hem comes from he consrains, which reflecs he noion of weakly coupled MDPs in our ile. Furhermore, by definiion of H, given s k, a k, he nex ransiion s k + is also independen of funcion pahs {ω } =0 and {µ } =0. he following assumpion saes he consrain se is sricly feasible. Assumpion.3 Slaer s condiion. here exiss a real value η > 0 and a fixed separable randomized saionary policy π such ha [ K g k i, a k ], s k d π, π η, i {,,, m}, where he iniial sae is d π and is he unique saionary disribuion of policy π, and he expecaion is aken wih respec o he random iniial sae and he sochasic funcion g k i, a, s i.e., w. Slaer s condiion is a common assumpion in convergence ime analysis of consrained convex opimizaion e.g. [30], [3]. Noe ha his assumpion readily implies he consrain se G can be achieved by he above randomized saionary policy. Specifically, ake d k 0 = d π k and P = π, hen, we have [ K G i, d 0, π = =0 g k i, a k ], s k d π, π η < 0.

8 .3. he sae-acion polyhedron X. Wei, H. Yu, M. J. Neely/Online consrained MDPs 8 In his secion, we recall he well-known linear program formulaion of an MDP see, for example, [] and [3]. Consider an MDP wih a sae space S and an acion space A. Le R S A be a probabiliy simplex, i.e. = θ R S A : θs, a =, θs, a 0. s,a S A Given a randomized saionary policy π wih saionary sae disribuion d π, he MDP is a Markov chain wih ransiion marix P π given by 3. hus, i mus saisfy he following balance equaion: d π sp π s, s = d π s, s S. s S Defining θa, s = πa sd π s and subsiuing he definiion of ransiion probabiliy 3 ino he above equaion gives θs, a, s S. s S a A θs, ap a s, s = a A he variable θa, s is ofen inerpreed as a saionary probabiliy of being a sae s S and aking acion a A under some randomized saionary policy. he sae acion polyhedron Θ is hen defined as { } Θ := θ : θs, a, s S. s S a A θs, ap a s, s = a A Given any θ Θ, one can recover a randomized saionary policy π a any sae s S as πa s = { θa,s a A θa,s, if a A θa, s 0, 0, oherwise. 5 Given any fixed penaly funcion fa, s, he bes policy minimizing he penaly wihou consrain is a randomized saionary policy given by he soluion o he following linear program LP: min f, θ, s.. θ Θ. 6 where f := [fa, s] a A, s S. Noe ha for any policy π given by he sae-acion pair θ according o 5, f, θ = s dπ,a π s [fa, s], hus, f, θ is ofen referred o as he saionary sae penaly of policy π. I can also be shown ha any sae-acion pair in he se Θ can be achieved by a convex combinaion of sae-acion vecors of pure policies, and hus all corner poins of he polyhedron Θ are from pure policies. As a consequence, he bes randomized saionary policy solving 6 is always a pure policy.

9 .4. Preliminary resuls on MDPs X. Wei, H. Yu, M. J. Neely/Online consrained MDPs 9 In his secion, we give preliminary resuls regarding he properies of our weakly coupled MDPs under randomized saionary policies. he proofs can be found in he appendix. We sar wih a lemma on he uniform mixing of MDPs. Lemma.. Suppose Assumpion. and. hold. here exiss a posiive ineger r and a consan τ such ha for any wo sae disribuions d and d, sup d k d k P k P k e /τ d k d k, k {,,, K} π k,,πk r π k P k π k π r k where { he supremum } is aken wih respec o any sequence of r randomized saionary policies π k,, πk r. Lemma.. Suppose Assumpion. and. hold. Consider he produc MDP wih produc sae space S S K and acion space A A K. hen, he following hold:. he produc MDP is irreducible and aperiodic under any join randomized saionary policy.. he saionary sae-acion probabiliy {θ k } K of any join randomized saionary policy saisfies θ k Θ k, k {,,, K}. An immediae conclusion we can draw from his lemma is ha given any penaly and consrain funcions f k and g k i, k =,,, K, he saionary penaly and consrain value of any join randomized saionary policy can be expressed as f k, θ k, g k i, θ k, i =,,, m, wih θ k Θ k. his in urn implies such saionary sae-acion probabiliies {θ k } K can also be realized via a separable randomized saionary policy π wih π k a s = θ k a, s a A k θk a, s, a Ak, s S k, 7 and he corresponding saionary penaly and consrain value can also be achieved via his policy. his fac suggess ha when considering he saionary sae performance only, he class of separable randomized saionary policies is large enough o cover all possible saionary penaly and consrain values. In paricular, le π = π,, π K be he separable randomized saionary policy associaed wih he Slaer condiion Assumpion.3. Using he fac ha he consrain funcions g k i,, k =,,, K i.e. w are i.i.d.and Assumpion. on independence of probabiliy ransiions, we have he consrain funcions g k i, and he sae-acion pairs a any ime are muuallly independen. hus, [ K g k i, a k ], s k d π, π = where θ k corresponds o π according o 7. g k i,, θ k,

10 X. Wei, H. Yu, M. J. Neely/Online consrained MDPs 0 hen, Slaer s condiion can be ranslaed o he following: here exiss a sequence of saeacion probabiliies { θ k } K from a separable randomized saionary policy such ha θ k Θ k, k, and g k i,, θ k η, i =,,, m, 8 he assumpion on separabiliy does no lose generaliy in he sense ha if here is no separable randomized saionary policy ha saisfies 8, hen, here is no join randomized saionary policy ha saisfies 8 eiher..5. he blessing of slow-updae propery in online MDPs he curren sae of an MDP depends on previous saes and acions. As a consequence, he slo penaly no only depends on he curren penaly funcion and curren acion, bu also on he sysem hisory. his complicaion does no arise in classical online convex opimizaion [7],[8] as here is no noion of sae and he slo penaly depends only on he slo penaly funcion and acion. Now imagine a virual sysem where, on each slo, a policy π is chosen raher han an acion. Furher imagine he MDP immediaely reaching is corresponding saionary disribuion d π. hen he saes and acions on previous slos do no maer and he slo performance depends only on he chosen policy π and on he curren penaly and consrain funcions. his imaginary sysem now has a srucure similar o classical online convex opimizaion as in he Zinkevich scenario [8]. A key feaure of online convex opimizaion algorihms as in [8] is ha hey updae heir decision variables slowly. For a fixed ime scale over which O regre is desired, he decision variables are ypically changed no more han a disance O/ from one slo o he nex. An imporan insigh in prior unconsrained MDP workse.g. [6], [7] and [9] is ha such slow updaes also guaranee he approximae convergence of an MDP o is saionary disribuion. As a consequence, one can design he decision policies under he imaginary assumpion ha he sysem insanly reaches is saionary disribuion, and laer bound he error beween he rue sysem and he imaginary sysem. If he error is on he same order as he desired O regre, hen his approach works. his idea serves as a cornersone of our algorihm design of he nex secion, which reas he case of muliple weakly coupled sysems wih boh objecive funcions and consrain funcions. 3. OCMDP algorihm Our proposed algorihm is disribued in he sense ha each ime slo, each MDP solves is own subproblem and he consrain violaions are conrolled by a simple updae of global mulipliers called virual queues a he end of each slo. Le Θ, Θ,, Θ K be sae-acion polyhedron of K MDPs, respecively. Le θ k Θ k be a sae-acion vecor a ime slo. A = 0, each MDP chooses is iniial sae-acion vecor θ k 0 resuling from any separable randomized saionary policy π k 0. For example, one could choose a uniform policy πk a s = / A k, s S k, solve he equaion d k π = d k 0 π P k o ge a probabiliy vecor d 0 π k k π, and 0 0 obain θ k 0 a, s = d π k s/ A k. For each consrain i {,,, m}, le Q i be a virual 0 queue defined over slos = 0,,, wih he iniial condiion Q i 0 = Q i = 0, and updae

11 equaion: Q i + = max X. Wei, H. Yu, M. J. Neely/Online consrained MDPs { Q i + } g k i,, θ, 0, {,, 3, }. 9 Our algorihm uses wo parameers V > 0 and α > 0 and makes decisions as follows: A he sar of each slo {,, 3, }, he k-h MDP observes Q i, i =,,, m and chooses θ k o solve he following subproblem: θ k = argmin θ Θ k V f k m + Q i g k i,, θ + α θ θ k. 0 Consruc he randomized saionary policy π k according o 5 wih θ = θ k, and choose. he acion a k Updae he virual queue Q i according o 9 for all i =,,, m. a k-h MDP according o he condiional disribuion π k Noe ha for any slo, his algorihm gives a separable randomized saionary policy, so ha each MDP chooses is own policy based on is own funcion f k, gk i,, i {,,, m}, and a common muliplier Q := Q,, Q m. he nex lemma shows ha solving 0 is in fac a projecion ono he sae-acion polyhedron. For any se X R n and a vecor y R n, define he projecion operaor P X y as P X y = arginf x X x y. Lemma 3.. Fix an α > 0 and {,, 3, }. he θ ha solves 0 is where w k θ k = P Θ k θ k wk = V f k + m Q ig k i, R Ak S k. Proof. By definiion, we have θ k =argmin θ Θ k w k, θ + α θ θ k =argmin θ Θ k w k, θ θ k + α θ θ k + w k =argmin θ Θ k α α, θ θk + θ θ k =argmin θ Θ k α θ θk + wk = P α Θ k θ k finishing he proof. α, w k, θ k + wk α w k s k, θ k, 3.. Inuiion of he algorihm and roadmap of analysis he inuiion of his algorihm follows from he discussion in Secion.5. Insead of he Markovian regre and consrain se, we work on he imaginary sysem ha afer he decision maker chooses any join policy Π and he penaly/consrain funcions are revealed, he K

12 X. Wei, H. Yu, M. J. Neely/Online consrained MDPs parallel Markov chains reach saionary sae disribuion righ away, wih sae-acion probabiliy vecors for K parallel MDPs. hus here is no Markov sae in such a { } K sysem θ k anymore and he corresponding saionary penaly and consrain funcion value a ime can be expressed as K f k, θ k and K g k i,, θk, i =,,, m, respecively. As a consequence, we are now facing a relaively easier ask of minimizing he following regre: { where θ k } K =0 f k, θ k =0 f k, θ k, are he sae-acion probabiliies corresponding o he bes fixed join randomized saionary policy wihin he following saionary consrain se { } G := θ k Θ k, k {,,, K} :, θ k 0, i =,,, m, wih he assumpion ha Slaer s condiion 8 holds. o analyze he proposed algorihm, we need o ackle he following wo major challenges: Wheher or no he policy decision of he proposed algorihm would yield O regre and consrain violaion on he imaginary sysem ha reaches seady sae insananeously on each slo. Wheher he error beween he imaginary and rue sysems can be bounded by O. In he nex secion, we answer hese quesions via a muli-sage analysis piecing ogeher he resuls of MDPs from Secion.4 wih muliple ingrediens from convex analysis and sochasic queue analysis. We firs show he O regre and consrain violaion in he imaginary online linear program incorporaing a new regre analysis procedure wih a sochasic drif analysis for queue processes. hen, we show if he benchmark randomized saionary algorihm always sars from is saionary sae, hen, he discrepancy of regres beween he imaginary and rue sysems can be conrolled via he slow-updae propery of he proposed algorihm ogeher wih he properies of MDPs developed in Secion.4. Finally, for he problem wih arbirary nonsaionary saring sae, we reformulae i as a perurbaion on he aforemenioned saionary sae problem and analyze he perurbaion via Farkas Lemma. g k i, 4. Convergence ime analysis 4.. Saionary sae performance: An online linear program Le Q := [Q, Q,, Q m ] be he virual queue vecor and L = Q. Define he drif := L + L Sample-pah analysis his secion develops a couple of bounds given a sequence of penaly funcions f k 0, f k,, f k and consrain funcions g k i,0, gk i,,, gk i,. he following lemma provides bounds for virual queue processes:

13 X. Wei, H. Yu, M. J. Neely/Online consrained MDPs 3 Lemma 4.. For any i {,,, m} a {,, }, he following holds under he virual queue updae 9, = g k i,, θk Q i + Q i + Ψ where Ψ > 0 is he consan defined in 4. = Proof. By he queue updaing rule 9, for any N, { } Q i + = max Q i + g k i,, θk, 0 Q i + =Q i + Q i + g k i,, θk g k i,, θk + g k i,, θk A k S k θ k g k i,, θk θ k g k i, Noe ha he consrain funcions are deerminisically bounded, g k A k S k Ψ. i, θ k θ k, θ k, Subsiuing his bound ino he above queue bound and rearranging he erms finish he proof. he nex lemma provides a bound for he drif. Lemma 4.. For any slo, we have Proof. By definiion, we have mk Ψ + m Q i g k i,, θk. = Q + Q m Q i + g k i,, θk Q i m = Q i g k i,, θk + Noe ha by he queue updae 9, we have g k i,, θk K g k i, m K g k i,, θk. θ k KΨ. Subsiuing his bound ino he drif bound finishes he proof.

14 X. Wei, H. Yu, M. J. Neely/Online consrained MDPs 4 Consider a convex se X R n. Recall ha for a fixed real number c > 0, a funcion h : X R is said o be c-srongly convex, if hx c x is convex over x X. I is easy o see ha if q : X R is convex, c > 0 and b R n, he funcion qx + c x b is c-srongly convex. Furhermore, if he funcion h is c-srongly convex ha is minimized a a poin x min X, hen see, e.g., Corollary in [33]: hx min hy c y x min, y X. 3 he following lemma is a direc consequence of he above srongly convex resul. I also demonsraes he key propery of our minimizaion subproblem 0. Lemma 4.3. he following bound holds for any k {,,, K} and any fixed θ k Θ k : V f k, θk θ k + V m f k, θk θ k Q i g k i,, θk + α θ k θ k + m Q i g k i,, θk + α θ k θ k α θ k θ k. 4 his lemma follows easily from he fac ha he proposed algorihm 0 gives θ k Θ k minimizing he lef hand side, which is a srongly convex funcion, and hen, applying 3, wih h θ k = V f k, θk θ k + m Q i g k i,, θk + α θ k θ k Combining he previous wo lemmas gives he following drif-plus-penaly bound. Lemma 4.4. For any fixed {θ k bound, + V f k, θk θ k 3 mk Ψ + V } K + α such ha θk θ k θ k f k, θk θ k + α + Θ k and N, we have he following m Q i g k i,, θk θ k Proof. Using Lemma 4. and hen Lemma 4.3, we obain + V mk Ψ + mk Ψ + + α θ k f k, θk θ k m Q i + α g k i,, θk f k, θk θ k θ k α θ k + θ k + V θ k θ k α f k, θk θ k m Q i g k i,, θk + α θ k θ k θ k 5 θ k θ k. 6

15 X. Wei, H. Yu, M. J. Neely/Online consrained MDPs 5 Noe ha by he queue updaing rule 9, we have for any, Q i Q i g k i,, θk K g k θ k i, KΨ, and for =, Q i Q i = 0 by he iniial condiion of he algorihm. Also, we have for any θ k Θ k, g k i,, θk K g k θ k i, KΨ. hus, we have m Q i g k i,, θk m Q i Subsiuing his bound ino 6 finishes he proof. g k i,, θk + mk Ψ Objecive bound heorem 4.. For any {θ k } K in he consrain se and any {,, 3, }, he proposed algorihm has he following saionary sae performance bound: K =0 f k, θ k K =0 f k, θ k + αk V + mk Ψ + V Ψ α In paricular, choosing α = and V = gives he O regre K =0 f k, θ k K =0 + f k, θ k K + Ψ S k A k + 3 mk Ψ, V S k A k + 5 mk Ψ. Proof. Firs of all, noe ha {g k i, }K is i.i.d. and independen of all sysem hisory up o, and hus independen of Q i, i =,,, m. We have Q i g k i,, θk = Q i K g k i,, θk 0 7 where he las inequaliy follows from he assumpion ha {θ k } K is in he consrain se.

16 Subsiuing θ k + V 3 mk Ψ +V 3 mk Ψ +V X. Wei, H. Yu, M. J. Neely/Online consrained MDPs 6 ino 5 and aking expecaion wih respec o boh sides give K K K f k, θk θ k K + α f k, θk θ k + α K θ k θ k m Q i θ k f k, θk θ k +α K θ k θ k θ k g k i,, θk α K +α K θ k θ k θ k where he second inequaliy follows from 7. Noe ha for any k, compleing he squares gives V f k, θk θ k + α θ k θ k α θ k θ k + V α/ f k Subsiuing his inequaliy ino he previous bound and rearranging he erms give V Ψ S k A k. α K V f k, K θk V f k, θk + V K Ψ S k A k + 3 α mk Ψ K K + α θ k θ k α θ k θ k. aking elescoping sums from o and dividing boh sides by V gives, K f k, K θk f k, L0 L + θk + V = K α θk θ k α + V K f k, θk + V K Ψ S k A k α where we use he fac ha L0 = 0 and θ k A drif lemma and is implicaions + V K Ψ S k A k α K θk θ k θk θ k. θ k + 3 mk Ψ + αk V V, From Lemma 4., we know ha in order o ge he consrain violaion bound, we need o look a he size of he virual queue Q i +, i =,,, m. he following drif lemma serves as a cornersone for our goal. Lemma 4.5 Lemma 5 of [5]. Le {Z, } be a discree ime sochasic process adaped o a filraion {F, }. Suppose here exis ineger 0 > 0, real consans λ R, δ max > 0, θ k + 3 mk Ψ V

17 and 0 < ζ δ max such ha X. Wei, H. Yu, M. J. Neely/Online consrained MDPs 7 Z + Z δ max, 8 { 0 δ [Z + 0 Z F ] max, if Z < λ 0 ζ, if Z λ. 9 hold for all {,,...}. hen, he following holds 4δmax. [Z] λ + 0 ζ log [ + 8δ max e ζ/4δmax], {,,...}. ζ. For any consan 0 < µ <, we have PrZ z µ, {,,...} where z = λ + 0 4δ max ζ log [ + 8δ max e ζ/4δmax] 4δ + ζ max 0 ζ log µ. Noe ha a special case of above drif lemma for 0 = daes back o he seminal paper of Hajek [34] bounding he size of a random process wih srongly negaive drif. Since hen, is power has been demonsraed in various scenarios ranging from seady sae queue bound [35] o feasibiliy analysis of sochasic opimizaion [36]. he curren generalizaion o a muli-sep drif is firs considered in [5]. his lemma is useful in he curren conex due o he following lemma, whose proof can be found in he appendix. Lemma 4.6. Le F, be he sysem hisory funcions up o ime, including f k 0,, f k, g k 0,i,, gk,i, i =,,, m, k =,,, K, and F 0 is a null se. Le 0 be an arbirary posiive ineger, hen, we have Q + Q mkψ, { [ Q + 0 Q 0 mkψ, if Q < λ F ] η, 0, if Q λ where λ = 8V KΨ+3mK Ψ +4Kα+ 0 0 mψ+mkψη 0 +η 0 η 0. Combining he previous wo lemmas gives he virual queue bound as Q 8V KΨ + 3mK Ψ + 4Kα mψ + mkψη 0 + η 0 η mkψ log [ + 8e /4 ]. We hen choose 0 =, V = and α =, which implies ha Q Cm, K, Ψ, η, 0 where Cm, K, Ψ, η = 8KΨ η + 3mK Ψ η + 4K + mψ η + mkψ + η + 4 mkψ log + 8e / he slow-updae condiion and consrain violaion In his secion, we prove he slow-updae propery of he proposed algorihm, which no only implies he he O consrain violaion bound, bu also plays a key role in Markov analysis.

18 X. Wei, H. Yu, M. J. Neely/Online consrained MDPs 8 Lemma 4.7. he sequence of sae-acion vecors θ k, {,,, } saisfies θ k θ k m A k S k Ψ Q A + k S k ΨV. α α In paricular,choosing V = and α = gives a slow-updae condiion θ k θ k A k S k Ψ + C m A k S k Ψ, where C = Cm, K, Ψ, η is defined in 0. Proof. Firs, choosing θ = θ in 4 gives V f k, θk θ k + m Rearranging he erms gives α θ k θ k V f k, θk θ k Q i g k i,, θk + α θ k θ k m V f k θ k θ k + V f θ k m Q i g k i,, θk α θk θk. Q i g k i,, θk θ k m θ k + Q m Q i g k i, θ k θ k g k i, θk θ k, where he second and hird inequaliy follow from Cauchy-Schwarz inequaliy. hus, i follows θ k θ k V f k + Q α m gk i, Applying he fac ha f k A k S k Ψ, g k i, A k S k Ψ and aking expecaion from boh sides give he firs bound in he lemma. he second bound follows direcly from he firs bound by furher subsiuing 0. heorem 4.. he proposed algorihm has he following saionary sae consrain violaion bound: K g k i,, θk C + m A k S k ΨC + A k S k Ψ, =0 where C = Cm, K, Ψ, η is defined in 0. Proof. aking expecaion from boh sides of Lemma 4. gives K g k i,, θk Q i + + Ψ A k S k θ k θ k. = = Subsiuing he bounds 0 and in o he above inequaliy gives he desired resul..

19 4.. Markov analysis X. Wei, H. Yu, M. J. Neely/Online consrained MDPs 9 So far, we have shown ha our algorihm achieves an O regre and consrain violaion simulaneously regarding he saionary online linear program wih consrain se given by in he imaginary sysem. In his secion, we show how hese saionary sae resuls leads o a igh performance bound on he original rue online MDP problem and comparing o any join randomized saionary algorihm saring from is saionary sae Approximae mixing of MDPs Le F, be he se of sysem hisory funcions up o ime, including f k 0,, f k, g k 0,i,, gk,i, i =,,, m, k =,,, K, and F 0 is a null se. Le d k π be he saionary sae disribuion a k-h MDP under he randomized saionary policy π k in he proposed algorihm. Le v k be he rue sae disribuion a ime slo under he proposed algorihm given he funcion pah F and saring sae d k 0, i.e. for any s Sk, v k s := P r s k = s F and v k 0 = d k 0. he following lemma provides a key esimae on he disance beween saionary disribuion and rue disribuion a each ime slo. I builds upon he slow-updae condiion Lemma 4.7 of he proposed algorihm and uniform mixing bound of general MDPs Lemma.. Lemma 4.8. Consider he proposed algorihm wih V = and α =. For any iniial sae disribuion {d k 0 }K and any {0,,,, }, we have dπ k v k τr A k S k Ψ + C m A k S k Ψ + e τr +, where τ and r are mixing parameers defined in Lemma. and C is an absolue consan defined in 0. Proof. By Lemma 4.7 we know ha for any {,,, }, A k θ k θ k S k Ψ + C m A k S k Ψ, hus, Since for any s S k, d k π θ k θ k s d k π s = A k S k Ψ + C m A k S k Ψ, θ k a A k a, s θ k a, s a A k θ k a, s θ k a, s, i hen follows d π d θ k k π θ k k A k S k Ψ + C m A k S k Ψ.

20 X. Wei, H. Yu, M. J. Neely/Online consrained MDPs 0 dπ Now, we use he above relaion o bound k v k for any r. dπ k v k d k π d k π + d k π v k A k S k Ψ + C m A k S k Ψ + d k π v k A k S k Ψ + C m A k S k Ψ = + d k π v k P k π k, 3 where he second inequaliy follows from he slow-updae condiion and he final equaliy follows from he fac ha given he funcion pah F, he following holds d k π v k = d k π v k P k. 4 π k o see his, noe ha from he proposed algorihm, he policy π k is deermined by F. hus, by definiion of saionary disribuion, given F, we know ha d k π = d k π P k, and i is π k enough o show ha given F, v k = v k Pk π k Firs of all, he sae disribuion v k is deermined by v k, πk and probabiliy ransiion from s o s, which are in urn deermined by F. hus, given F, for any s S k, and v k s = P rs = s s = s, F v k s, s S k P rs = s s = s, F = P rs = s a = a, s = s, F P ra = a s = s, F a A k = P a s, sp ra = a s = s, F = P a s, sπ k a s = P k π s, s, a A k a A k where he second inequaliy follows from he Assumpion., he hird equaliy follows from he fac ha π k is deermined by F, hus, for any, π k a s = P ra = a s = s, F, a A k, s S k, and he las equaliy follows from he definiion of ransiion probabiliy 3. his gives and hus 4 holds. v k s = s S k P π k. s, sv k s,

21 X. Wei, H. Yu, M. J. Neely/Online consrained MDPs We can ieraively apply he procedure 3 r imes as follows dπ k v k A k S k Ψ + C m A k S k Ψ + d k π d k π A k S k Ψ + C m A k S k Ψ + d k π v k A k S k Ψ + C m A k S k Ψ + d k π v k A k S k Ψ + C m A k S k Ψ r + P k π k P k π k P k π k d k π v k r k r k + P k π k P k π k r d k π v k P k π k, where he second inequaliy follows from he nonexpansive propery in l norm of he sochasic marix P k ha π k d k π d k π P k π k d π k d k π, and hen using he slow-updae condiion again. By Lemma., we have dπ k v k r Ieraing his inequaliy down o = 0 gives dπ k v k finishing he proof. /τ j=0 /τ j=0 A k S k Ψ + C m A k S k Ψ e j/τ r e j/τ r A k S k Ψ + C m A k S k Ψ A k S k Ψ + C m A k S k Ψ + e /τ d k π v k r r A k S k Ψ + C m A k S k Ψ e x/τ dx r x=0 A k S k Ψ + C m A k S k Ψ τr + e rτ + + dπ k 0 + e /r /τ + e rτ +. v k 0 P k π k e /r /τ 4... Benchmarking agains policies saring from saionary sae Combining he resuls derived so far, we have he following regre bound regarding any randomized saionary policy Π saring from is saionary sae disribuion d Π such ha d Π, Π in he consrain se G defined in. heorem 4.3. Le P be he sequence of randomized saionary policies resuling from he proposed algorihm wih V = and α =. Le d 0 be he saring sae of he proposed algorihm. For an one-line proof, see 39 in he appendix.

22 X. Wei, H. Yu, M. J. Neely/Online consrained MDPs For any randomized saionary policy Π saring from is saionary sae disribuion d Π such ha d Π, Π G, we have F d 0, P F d Π, Π O G i, d 0, P O m 3/ K K m 3/ K K A k S k, A k S k, i =,,, m. Proof. Firs of all, by Lemma., for any randomized saionary policy Π, here exiss some saionary sae-acion probabiliy vecors {θ k } K such ha θk Θ k, F d Π, Π = K =0 and G i, d Π, Π = K =0 g i,, θ k. As a consequence, d Π, Π G implies G i, d Π, Π = K =0 g i,, θ k 0, i {,,, m} and i follows {θ k } K is in he imaginary consrain se G defined in. hus, we are in a good shape applying heorem 4. from imaginary sysems. We hen spli F d 0, P F d Π, Π ino wo erms: F d 0, P F d 0, Π f k a k, s k d 0, P f k, θ k =0 =0 }{{} I By heorem 4., we ge II + =0 f k, θ k f, θ k } {{ } II K + Ψ S k A k + 5 mk Ψ. 5 We hen bound I. Consider each ime slo {0,,, }. We have f k, θ k = sπ k a sf k a, s f k a k, s k d 0, P = s S k a A k s S k a A k where he firs equaliy follows from he definiion of θ k d π k v k he following: Given a specific funcion pah F, he policy π k are fixed. hus, we have, v k f k a k, s k d 0, P, F = s S k v k a A k sπ k a sf k a, s,. and he second equaliy follows from and he rue sae disribuion sπ k a sf k a, s. f, θ k,

23 X. Wei, H. Yu, M. J. Neely/Online consrained MDPs 3 aking he full expecaion regarding he funcion pah gives he resul. hus, f k a k, s k d 0, P f k, θ k a s v k s d k π s π k s S k a A k Ψ v k d k π Ψ τr + C m A k S k Ψ + e τr + Ψ where he las inequaliy follows from Lemma 4.8. hus, i follows, I =0 τr + C m A k S k Ψ + e τr + Ψ τr + C m A k S k Ψ + ΨK =0 e x τr + dx τrψ + C m K A k S k + eψkτr. 6 Overall, combining 5,6 and subsiuing he consan C = Cm, K, Ψ, η defined in 0 gives he objecive regre bound. For he consrain violaion, we have G i, d 0, P = g k i, a, s d 0, P g k i, =0 = }{{} IV, θ + = g k i,, θ. } {{ } V he erm V can be readily bounded using heorem 4. as K g k i,, θk C + m A k S k ΨC + A k S k Ψ. =0 For he erm IV, we have g k i,, θk = g k i, ak, s k d 0, P = s S k a A k s S k a A k where he firs equaliy follows from he definiion of θ k d π k v k he following: Given a specific funcion pah F, he policy π k are fixed. hus, we have, v k g k a k, s k d 0, P, F = s S k v k a A k sπ k a sg k i, a, s sπ k a sg k i,, a, s and he second equaliy follows from and he rue sae disribuion sπ k a sg k a, s.

24 X. Wei, H. Yu, M. J. Neely/Online consrained MDPs 4 aking he full expecaion regarding he funcion pah gives he resul. hen, repea he same proof as ha of 6 gives IV τrψ + C m K A k S k + eψkτr. his finishes he proof of consrain violaion. 5. A more general regre bound agains policies wih arbirary saring sae Recall ha heorem 4.3 compares he proposed algorihm wih any randomized saionary policy Π saring from is saionary sae disribuion d Π, so ha d Π, Π G. In his secion, we generalize heorem 4.3 and obain a bound of he regre agains all d 0, Π G where d 0 is an arbirary saring sae disribuion no necessarily he saionary sae disribuion. he main echnical difficuly doing such a generalizaion is as follows: For any randomized saionary policy Π such ha d 0, Π G, le {θ k ha θ k } K Θ k and G i, d Π, Π = =0 K be he saionary sae-acion probabiliies such g i,, θ k. For some finie horizon, here migh exis some low-cos saring sae disribuion d 0 such ha G i, d 0, Π < G i, d Π, Π for some i {,,, m}. As a consequence, one coud have G i, d 0, Π 0, and =0 g i,, θ k > 0. his implies alhough d 0, Π is feasible for our rue sysem, is saionary sae-acion probabiliies {θ k } K can be infeasible wih respec o he imaginary consrain se, and all our analysis so far fails o cover such randomized saionary policies. o resolve his issue, we have o enlarge he imaginary consrain se so as o cover all } K arising from any randomized saionary policy Π such ha d 0, Π G. Bu a perurbaion of consrain se would resul in a perurbaion of objecive in he imaginary sysem also. Our main goal in his secion is o bound such a perurbaion and show ha he perurbaion bound leads o he final O regre bound. sae-acion probabiliies {θ k A relaxed consrain ses We begin wih a supporing lemma on he uniform mixing ime bound over all join randomized saionary policies. he proof is given in he appendix. Lemma 5.. Consider any randomized saionary policies Π in wih arbirary saring sae disribuion d 0 S S K. Le P Π be he corresponding ransiion marix on he produc sae space. hen, he following holds d 0 d Π P Π e r /r, {0,,, }, 7 where r is fixed posiive consan independen of Π. he following lemma shows a relaxaion of O/ on he imaginary consrain se is enough o cover all he {θ k } K discussed a he beginning of his secion.

25 X. Wei, H. Yu, M. J. Neely/Online consrained MDPs 5 Lemma 5.. For any {,, } and any randomized saionary policies Π in, wih arbirary saring sae disribuion d 0 S S K and saionary sae-acion probabiliy {θ k } K, K f k a k, s k d 0, Π f k, θ k C KΨ 8 =0 K g k i, ak, s k d 0, Π g k i,, θ k C KΨ 9 =0 where C is an absolue consan. In paricular, {θ k consrain se G + := { θ k Θ k, k =,,, K : for some universal posiive consan r > 0. g k i, } K is conained in he following relaxed, θ k C KΨ, i =,,, m Proof. Le v S S K be he join sae disribuion a ime under policy Π. Using he fac ha Π is a fixed policy independen of g k i, and Assumpion. ha he probabiliy ransiion is also independen of funcion pah given any sae and acion, he funcion g k i, and sae-acion pair a k, s k are muually independen. hus, for any {0,,,, } K g k i, ak, s k d 0, Π = v sπa s g k i, ak, s k, s S S K a A A K where s = [s,, s K ] and a = [a,, a K ] and he laer expecaion is aken wih respec o g k i, i.e. he random variable w. On he oher hand, by Lemma., we know ha for any randomized saionary policy Π, he corresponding saionary sae-acion probabiliy can be expressed as {θ k wih θk Θ k. hus, g k i,, θ k = } K s S S K a A A K d Π sπa s Hence, we can conrol he difference: K g k i, ak, s k d 0, Π g k i,, θ k =0 v s d Π s Πa s =0 KΨ s S S K a A A K KΨ =0 v d Π KΨ =0 e r /r ekψ 0 } g k i, ak, s k. e /r d = er KΨ, where he hird inequaliy follows from Lemma 5.. aking C = er finishes he proof of 9 and 8 can be proved in a similar way.,

26 X. Wei, H. Yu, M. J. Neely/Online consrained MDPs 6 In paricular, we have for any randomized saionary policy Π ha saisfies he consrain, we have K =0 g k i,, θ k g k i, ak er KΨ + 0 = er KΨ, finishing he proof., s k d 0, Π g k i,, θ k K Bes saionary performance over he relaxed consrain se =0 g k i, ak, s k d 0, Π Recall ha he bes saionary performance in hindsigh over all randomized saionary policies in he consrain se G can be obained as he minimum achieved by he following linear program. min s.. =0 g k i, f k, θ k 30, θ k 0, i =,,, m. 3 On he oher hand, if we consider all he randomized saionary policies conained in he original consrain se, hen, By Lemma 5., he relaxed consrain se G conains all such policies and he bes saionary performance over his relaxed se comes from he minimum achieved by he following perurbed linear program: min s.. =0 g k i, f k, θ k 3, θ k C KΨ, i =,,, m. 33 We aim o show ha he minimum achieved by 3-33 is no far away from ha of In general, such a conclusion is no rue due o he unboundedness of Lagrange mulipliers in consrained opimizaion. However, since Slaer s condiion holds in our case, he perurbaion can be bounded via he following well-known Farkas lemma [3]: Lemma 5.3 Farkas Lemma. Consider a convex program wih objecive fx and consrain funcion g i x, i =,,, m: min fx, 34 s.. g i x b i, i =,,, m, 35 x X, 36 for some convex se X R n. Le x be one of he soluions o he above convex program. Suppose here exiss x X such ha g i x < 0, i {,,, m}. hen, here exiss a separaion hyperplane paramerized by, µ, µ,, µ m such ha µ i 0 and m m fx + µ i g i x fx + µ i b i, x X.

27 X. Wei, H. Yu, M. J. Neely/Online consrained MDPs 7 he parameer µ = µ, µ,, µ m is usually referred o as a Lagrange muliplier. From he geomeric perspecive, Farkas Lemma saes ha if Slaer s condiion holds, hen, here exiss a non-verical separaion hyperplane suppored a fx, b,, b m and conains he { } se fx, g x,, g m x, x X on one side. hus, in order o bound he perurbaion of objecive wih respec o he perurbaion of consrain level, we need o bound he slope of he supporing hyperplane from above, which boils down o conrolling he magniude of he Lagrange muliplier. his is summarized in he following lemma: Lemma 5.4 Lemma of [30]. Consider he convex program 34-36, and define he Lagrange dual funcion qµ = inf x X {fx + m µ ig i x b i }. Suppose here exiss x X such ha g i x b i η, i {,,, m} for some posiive consan η > 0. hen, he level se V µ = {µ, µ,, µ m 0, qµ q µ} is bounded for any nonnegaive µ. Furhermore, we have max µ V µ µ min i m { g i x+b i } f x q µ. he echnical imporance of hese wo lemmas in he curren conex is conained in he following corollary. { } Corollary 5.. Le θ k K { } and θ k K be soluions o 30-3 and 3-33, respecively. hen, he following holds Proof. ake =0 f f k, θ k θ,, θ K = g i θ,, θ K = =0 f k, θ k C K mψ η =0 X = Θ Θ Θ K, f k, θ k, g k i,, θ k, and b i = 0 in Farkas Lemma and we have he following display =0 f k, θ k + m K µ i g k i,, θ k =0 f k, θ k, for any θ,, θ K X and some µ, µ,, µ m 0. In paricular, subsiuing ino he above display gives =0 f k, θ k =0 f k, θ k =0 m K µ i f k, θ k g k i, C KΨ, θ k θ,, θ K m µ i, 37 where he final inequaliy follows from he fac ha θ,, θ K saisfies he relaxed consrain K, θ k and µ i 0, i {,,, m}. Now we need o bound g k i, C KΨ

28 X. Wei, H. Yu, M. J. Neely/Online consrained MDPs 8 he magniude of Lagrange muliplier µ,, µ m. Noe ha in our scenario, f θ,, θ K = =0 f k, θ k Ψ, and he Lagrange muliplier µ is he soluion o he maximizaion problem max qµ, µ i 0,i {,,,m} where qµ is he dual funcion defined in Lemma 5.4. hus, i mus be in any super level se V µ = {µ, µ,, µ m 0, qµ q µ}. In paricular, aking µ = 0 in Lemma 5.4 and using Slaer s condiion 8, we have here exiss θ,, θ K such ha m µ i m θ m µ f,, µ θ K inf f θ,, θ K mψk, θ,,θ K X η where he final inequaliy follows from he deerminisic bound of fθ,, θ K by ΨK. Subsiuing his bound ino 37 gives he desired resul. As a simple consequence of he above corollary, we have our final bound on he regre and consrain violaion regarding any d 0, Π G. heorem 5.. Le P be he sequence of randomized saionary policies resuling from he proposed algorihm wih V = and α =. Le d 0 be he saring sae of he proposed algorihm. For any randomized saionary policy Π saring from he sae d 0 such ha d 0, Π G, we have F d 0, P F d 0, Π O G i, d 0, P O m 3/ K K m 3/ K K A k S k, A k S k, i =,,, m. Proof. Le Π be he randomized saionary policy corresponding o he soluion {θ k } K o 30-3 and le Π be any randomized saionary policy such ha d 0, Π G. Since G i, d Π, Π = K =0 we know ha g i,, θ k F d 0, P F d Π, Π O 0, i follows d Π, Π G. By heorem 4.3, m 3/ K K A k S k, and G i, d 0, P saisfies he bound in he saemen. I is hen enough o bound F d Π, Π F d 0, Π. We spli i in o wo erms: F d Π, Π F d 0, Π F d Π, Π F d Π, Π + F }{{} d Π, Π F d 0, Π. }{{} I II By 8 in Lemma 5., he erm II is bounded by C KΨ. I remains o bound he firs erm. Since d 0, Π G, by Lemma 5., he corresponding sae-acion probabiliies {θ k } K of Π

A Primal-Dual Type Algorithm with the O(1/t) Convergence Rate for Large Scale Constrained Convex Programs

A Primal-Dual Type Algorithm with the O(1/t) Convergence Rate for Large Scale Constrained Convex Programs PROC. IEEE CONFERENCE ON DECISION AND CONTROL, 06 A Primal-Dual Type Algorihm wih he O(/) Convergence Rae for Large Scale Consrained Convex Programs Hao Yu and Michael J. Neely Absrac This paper considers