IEOR 8100-001: Learnng and Optmzaton for Sequental Decson Makng 03/07/16 Lecture 14: andts wth udget Constrants Instructor: Shpra Agrawal Scrbed by: Zhpeng Lu 1 Problem defnton In the regular Mult-armed andt problem we pull an arm I t at each round t and a reward s collected whch depend on the bandt I t. Now suppose that each round after pullng the arm a cost s also ncurred. When the total cost tll tme t surpass a gven level the algorthm stops. hs settng of problem s called andts wth constrants. he formal descrpton s followng. At tme t pull an arm I t = observe reward [0 1] and cost [0 1] d. Gven I t = D where D s a jont dstrbuton dependent on arm denote E[ I t = ] = µ E[j I t = ] = C j. Stop when any budget constrant s volated. Goal s to here 0. mze subject to t Remark. he above settng s also called andt wth Knapsacks wk. Usually we replace the constrant wth a smplfed verson by lettng = mn j j and j j = 1... d. t Example 1. Dynamc prcng wth lmted supply Suppose at prce q the product has probablty Sq to be sold observng revenue q and ncurrng nventory decreasng by 1. So the dstrbuton of reward and cost fohe bandt problem s: pullng arm { q 1 w.p. Sq = 0 0 w.p. 1 Sq 2 Optmal statc polcy A polcy s a mappng whch maps hstory to acton. It can be pure statc n whch case we keep pullng the sngle arm wth hghest expected reward. It can be mxed n whch case we pull arms randomly accordng to some dstrbuton. O can be dynamc where each turn we make decson based on the hstory and remanng resource. In ths secton we wll show that there s an optmal statc polcy whch has expected reward at least optmal dynamc and t satsfes constrants n expectaton. 1
Suppose each turn we pull arm wth probablty p and the constrants are all satsfed n expectaton. hen the optmal total reward OP s; subject to p µ p C j p 1. j = 1... d Notce that the second constrant mples that we allow not pullng any arm durng one round. Next we prove OP s bettehan what a optmal dynamc polcy could acheve. Defne X : total number of tmes an optmal polcy pcked arm p = E[X]. Any feasble polcy must satsfy t c sj t = 1.... ake expectatons on both sdes and we have s=1 E[ =1...N t:i t= ] E[X C j ] p C j. herefore { p } N =1 s feasble to OP and the total expected reward of ths optmal polcy E[ X µ ] = E[X µ ] = p µ OP. 3 andt algorthm: reducng to unconstraned f problem Suppose the algorthm stops at tme τ when a budget constrant s volated. Defne the regret of such polcy R = OP E[ ]. Snce the total reward of optmal dynamc polcy s bounded by OP the gap between optmal dynamc polcy and the algorthm s bounded by R. Now we reduce the constraned problem to an unconstraned problem wth nonlnear objectve functon by applyng Lagrangan multpleo the constraned problem. he new unconstraned problem wll mze f 2
whch s a concave and Lpschtz contnuous functon. r t f = z j=1...d 0. In the above defnton of f he frst tem s average reward and the second tem s the penalty from the mum volaton of budget constrant. We defne the penalty coeffcent z as z = 2OP whch we wll explan momentarly. Suppose we relax the budget to + and defne ɛ =. Denote the OP wth budget 1 + ɛ as OP 1+ɛ. Any p OP 1+ɛ -feasble snce p 1 + ɛ C 1 + ɛ j j = 1... d 1 + ɛ therefore p 1+ɛ s OP-feasble. hus we have N p µ = =1 N =1 p 1 + ɛ µ 1 + ɛ 1 + ɛop so f we relax the budget constrant by the optmal value of OP wll ncrease by no more than OP. So we can set z = 2OP whch guarantees that volatng the budget won t gve beneft n terms of ncreasng value of f. heorem 2 below wll provde the exact relaton between optmzng f and the andts wth Knapsacks problem. he sgnfcance of ths value of z wll be more precsely llustrated n the proof of that theorem. Now we defne the regret of the algorthm wth objectve functon f [ R f = OP f E f r ] t where OP f = p: p 1 f p µ p C = p µ Z C j j 0 And we defne R as the regret of the constraned problem wth larger budget = + 2 z R f + Õ [ τ ] R = OP E. where τ s the frst tme a budget s volated. We conclude ths lecture wth the followng theorem. heorem 2. If an algorthm acheves R f regret for unconstraned f problem then R 3R f + zõ and the algorthm wll not volate at any tme step t wth hgh probablty. 3
Proof. Proof outlne: he proof follows from the followng two clams: 1. OP f OP z. 2. Let be the cost of the decson at tme t for unconstraned f algorthm then wth hgh probablty j for all j. If above two clams are true then τ = wth hgh probablty and R = OP E[ τ ] OP f E[ ] = R f + z j=1...d 0 R f + z Usng = 2R f z + Õ we get the theorem statement. Next we prove the above two clams. he frst clam holds because the optmal soluton n fact any soluton p such that p 1 p 0 for OP forms a feasble soluton for OP f wth value OP z. For second clam let M be the mum budget volaton above by the algorthm fohe unconstraned f problem.e. hen M := f r t j=1...d 0 = E OP OP f [ ] zm + zm 2 zm zm 2 he second last nequalty follows usng our earler observaton that OP 1+ɛ OP1 + ɛ and z = 2OP. Last nequalty follows because OP f OP the optmal soluton p for OP forms a feasble soluton for OP f wth value OP. hen rearrangng M 2 OP f f r t = 2 z z R f herefore by defnton of M E[ t j ] + 2R f z So that usng Azuma-Hoeffdng wth hgh probablty j + 2 Z R f + Õ = provng the second clam. Why s above theorem useful? Conceptually above theorem shows that to bound regret of andts wth knapsacks we can nstead use an algorthm that has small regret bound fohe unconstraned f bandt problem. In next lecture we wll prove regret bounds fohe unconstraned f problem. In partcular fohe f dscussed above a unform bound on regret of R f Õz N can be proven though we wll not prove ths specal case n class. hen gven above theorem 4
we could solve the unconstraned f bandt problem wth slghtly smaller budget = 2 z Õ N Õ to get regret so that above theorem wll gve R f Õ z N Õz N Õz N R 3R f + zõ Õz N + zõ Usng z = 2OP OP we get a bound of Õ on regret for andts wth knapsacks.e. multplcatve guarantee that the reward acheved by bandt algorthm s at least OP1 Õ 1. Note that above algorthm assumed that the value of OP s known OP was used to set the value of z. If OP s not known then t needs to be estmated usng pure exploraton n the begnnng. hs ntal exploraton mght cause addtonal regret. hs s a lmtaton of the above approach of reducng the problem to unconstraned f problem. For a drect approach wthout knowng OP refeo Secton 4.2 of [1]. References [1] Shpra Agrawal and Nkhl R Devanur. andts wth concave rewards and convex knapsacks. In Proceedngs of the ffteenth ACM conference on Economcs and computaton pages 989 1006. ACM 2014. 5