Lecture 14: Bandits with Budget Constraints

Size: px

Start display at page:

Download "Lecture 14: Bandits with Budget Constraints"

Karen Merritt
5 years ago
Views:

1 IEOR : Learnng and Optmzaton for Sequental Decson Makng 03/07/16 Lecture 14: andts wth udget Constrants Instructor: Shpra Agrawal Scrbed by: Zhpeng Lu 1 Problem defnton In the regular Mult-armed andt problem we pull an arm I t at each round t and a reward s collected whch depend on the bandt I t. Now suppose that each round after pullng the arm a cost s also ncurred. When the total cost tll tme t surpass a gven level the algorthm stops. hs settng of problem s called andts wth constrants. he formal descrpton s followng. At tme t pull an arm I t = observe reward [0 1] and cost [0 1] d. Gven I t = D where D s a jont dstrbuton dependent on arm denote E[ I t = ] = µ E[j I t = ] = C j. Stop when any budget constrant s volated. Goal s to here 0. mze subject to t Remark. he above settng s also called andt wth Knapsacks wk. Usually we replace the constrant wth a smplfed verson by lettng = mn j j and j j = 1... d. t Example 1. Dynamc prcng wth lmted supply Suppose at prce q the product has probablty Sq to be sold observng revenue q and ncurrng nventory decreasng by 1. So the dstrbuton of reward and cost fohe bandt problem s: pullng arm { q 1 w.p. Sq = 0 0 w.p. 1 Sq 2 Optmal statc polcy A polcy s a mappng whch maps hstory to acton. It can be pure statc n whch case we keep pullng the sngle arm wth hghest expected reward. It can be mxed n whch case we pull arms randomly accordng to some dstrbuton. O can be dynamc where each turn we make decson based on the hstory and remanng resource. In ths secton we wll show that there s an optmal statc polcy whch has expected reward at least optmal dynamc and t satsfes constrants n expectaton. 1

2 Suppose each turn we pull arm wth probablty p and the constrants are all satsfed n expectaton. hen the optmal total reward OP s; subject to p µ p C j p 1. j = 1... d Notce that the second constrant mples that we allow not pullng any arm durng one round. Next we prove OP s bettehan what a optmal dynamc polcy could acheve. Defne X : total number of tmes an optmal polcy pcked arm p = E[X]. Any feasble polcy must satsfy t c sj t = ake expectatons on both sdes and we have s=1 E[ =1...N t:i t= ] E[X C j ] p C j. herefore { p } N =1 s feasble to OP and the total expected reward of ths optmal polcy E[ X µ ] = E[X µ ] = p µ OP. 3 andt algorthm: reducng to unconstraned f problem Suppose the algorthm stops at tme τ when a budget constrant s volated. Defne the regret of such polcy R = OP E[ ]. Snce the total reward of optmal dynamc polcy s bounded by OP the gap between optmal dynamc polcy and the algorthm s bounded by R. Now we reduce the constraned problem to an unconstraned problem wth nonlnear objectve functon by applyng Lagrangan multpleo the constraned problem. he new unconstraned problem wll mze f 2

3 whch s a concave and Lpschtz contnuous functon. r t f = z j=1...d 0. In the above defnton of f he frst tem s average reward and the second tem s the penalty from the mum volaton of budget constrant. We defne the penalty coeffcent z as z = 2OP whch we wll explan momentarly. Suppose we relax the budget to + and defne ɛ =. Denote the OP wth budget 1 + ɛ as OP 1+ɛ. Any p OP 1+ɛ -feasble snce p 1 + ɛ C 1 + ɛ j j = 1... d 1 + ɛ therefore p 1+ɛ s OP-feasble. hus we have N p µ = =1 N =1 p 1 + ɛ µ 1 + ɛ 1 + ɛop so f we relax the budget constrant by the optmal value of OP wll ncrease by no more than OP. So we can set z = 2OP whch guarantees that volatng the budget won t gve beneft n terms of ncreasng value of f. heorem 2 below wll provde the exact relaton between optmzng f and the andts wth Knapsacks problem. he sgnfcance of ths value of z wll be more precsely llustrated n the proof of that theorem. Now we defne the regret of the algorthm wth objectve functon f [ R f = OP f E f r ] t where OP f = p: p 1 f p µ p C = p µ Z C j j 0 And we defne R as the regret of the constraned problem wth larger budget = + 2 z R f + Õ [ τ ] R = OP E. where τ s the frst tme a budget s volated. We conclude ths lecture wth the followng theorem. heorem 2. If an algorthm acheves R f regret for unconstraned f problem then R 3R f + zõ and the algorthm wll not volate at any tme step t wth hgh probablty. 3

4 Proof. Proof outlne: he proof follows from the followng two clams: 1. OP f OP z. 2. Let be the cost of the decson at tme t for unconstraned f algorthm then wth hgh probablty j for all j. If above two clams are true then τ = wth hgh probablty and R = OP E[ τ ] OP f E[ ] = R f + z j=1...d 0 R f + z Usng = 2R f z + Õ we get the theorem statement. Next we prove the above two clams. he frst clam holds because the optmal soluton n fact any soluton p such that p 1 p 0 for OP forms a feasble soluton for OP f wth value OP z. For second clam let M be the mum budget volaton above by the algorthm fohe unconstraned f problem.e. hen M := f r t j=1...d 0 = E OP OP f [ ] zm + zm 2 zm zm 2 he second last nequalty follows usng our earler observaton that OP 1+ɛ OP1 + ɛ and z = 2OP. Last nequalty follows because OP f OP the optmal soluton p for OP forms a feasble soluton for OP f wth value OP. hen rearrangng M 2 OP f f r t = 2 z z R f herefore by defnton of M E[ t j ] + 2R f z So that usng Azuma-Hoeffdng wth hgh probablty j + 2 Z R f + Õ = provng the second clam. Why s above theorem useful? Conceptually above theorem shows that to bound regret of andts wth knapsacks we can nstead use an algorthm that has small regret bound fohe unconstraned f bandt problem. In next lecture we wll prove regret bounds fohe unconstraned f problem. In partcular fohe f dscussed above a unform bound on regret of R f Õz N can be proven though we wll not prove ths specal case n class. hen gven above theorem 4

5 we could solve the unconstraned f bandt problem wth slghtly smaller budget = 2 z Õ N Õ to get regret so that above theorem wll gve R f Õ z N Õz N Õz N R 3R f + zõ Õz N + zõ Usng z = 2OP OP we get a bound of Õ on regret for andts wth knapsacks.e. multplcatve guarantee that the reward acheved by bandt algorthm s at least OP1 Õ 1. Note that above algorthm assumed that the value of OP s known OP was used to set the value of z. If OP s not known then t needs to be estmated usng pure exploraton n the begnnng. hs ntal exploraton mght cause addtonal regret. hs s a lmtaton of the above approach of reducng the problem to unconstraned f problem. For a drect approach wthout knowng OP refeo Secton 4.2 of [1]. References [1] Shpra Agrawal and Nkhl R Devanur. andts wth concave rewards and convex knapsacks. In Proceedngs of the ffteenth ACM conference on Economcs and computaton pages ACM

Assortment Optimization under MNL

Assortment Optimization under MNL Assortment Optmzaton under MNL Haotan Song Aprl 30, 2017 1 Introducton The assortment optmzaton problem ams to fnd the revenue-maxmzng assortment of products to offer when the prces of products are fxed.