arxiv: v1 [math.oc] 25 Jun 2008

Size: px

Start display at page:

Download "arxiv: v1 [math.oc] 25 Jun 2008"

Muriel Parker
6 years ago
Views:

1 Irrevocable Mult-Armed Bandt Polces Vvek F. Faras Rtesh Madan arxv: v1 [math.oc] 25 Jun 2008 February 13, 2018 Abstract Ths paper consders the mult-armed bandt problem wth multple smultaneous arm pulls. We develop a new rrevocable heurstc for ths problem. In partcular, we do not allow recourse to arms that were pulled at some pont n the past but then dscarded. Ths rrevocable property s hghly desrable from a practcal perspectve. As a consequence of ths property, our heurstc entals a mnmum amount of exploraton. At the same tme, we fnd that the prce of rrevocablty s lmted for a broad useful class of bandts we characterze precsely. Ths class ncludes one of the most common applcatons of the bandt model, namely, bandts whose arms are cons of unknown bases. Computatonal experments wth a generatve famly of large scale problems wthn ths class ndcate losses of up to 5 10% relatve to an upper bound on the performance of an optmal polcy wth no restrctons on exploraton. We also provde a worst-case theoretcal analyss that shows that for ths class of bandt problems, the prce of rrevocablty s unformly bounded: our heurstc earns expected rewards that are always wthn a factor of 1/8 of an optmal polcy wth no restrctons on exploraton. In addton to beng an ndcator of robustness across all parameter regmes, ths analyss sheds lght on the structural propertes that afford a low prce of rrevocablty. Sloan School of Management and Operatons Research Center, Massachusetts Insttute of Technology, emal :vvekf@mt.edu Qualcomm-Flaron Technologes, emal :rkmadan@stanfordalumn.org 1

2 1 Introducton Consder the operatons of a fast-fashon retaler such as Zara or H&M. Such retalers have developed and nvested n merchandze procurement strateges that permt lead tmes for new fashons as short as two weeks. As a consequence of ths flexblty, such retalers are able to adjust the assortment of products offered on sale at ther stores to quckly adapt to popular fashon trends. In partcular, such retalers use weekly sales data to refne ther estmates of an tem s popularty, and based on such revsed estmates weed out unpopular tems, or else re-stock demonstrably popular ones on a week-by-week bass. In sharp contrast, tradtonal retalers such as J.C. Penney or Marks and Spencer face lead tmes on the order of several months. As such these retalers need to predct popular fashons months n advance and are allowed vrtually no changes to ther product assortments over the course of a sales season whch s typcally several months n length. Understandably, ths approach s not nearly as successful at dentfyng hgh sellng fashons and also results n substantal unsold nventores at the end of a sales season. In vew of the great deal of a-pror uncertanty n the popularty of a new fashon and the speed at whch fashon trends evolve, the fast-fashon operatons model s hghly desrable and emergng as the de-facto operatons model for large fashon retalers. Among other thngs, the fast-fashon model reles crucally on an effectve technology to learn from purchase data, and adjust product assortments based on such data. Such a technology must strke a balance between explorng potentally successful products and explotng products that are demonstrably popular. A convenent mathematcal model wthn whch to desgn algorthms capable of accomplshng such a task s that of the mult-armed bandt. Whle we defer a precse mathematcal dscusson to a later secton, a mult-armed bandt conssts of multple (say n) arms, each correspondng to a Markov Decson Process. As a specal case, one may thnk of each arm as an ndependent bnomal con wth an uncertan bas specfed va some pror dstrbuton. At each pont n tme, one may pull up to a certan number of arms (say k < n) smultaneously, or equvalently, toss up to a certan number of cons. For each tossed con, we earn a reward proportonal to ts realzaton and are able to refne our estmate of ts bas based on ths realzaton. We nether learn about, nor earn rewards from cons that are not tossed. The mult-armed bandt problem requres fndng a polcy that adaptvely selects k arms to pull at every pont n tme wth a vew to maxmzng total expected reward earned over some fnte tme horzon or alternatvely, dscounted rewards earned over an nfnte horzon or perhaps, even long term average rewards. Wth multple smultaneous pulls allowed, the mult-armed bandt problem we have descrbed s computatonally hard. A popular and emprcally successful heurstc for ths problem was proposed several decades ago by Whttle. Whttle s heurstc produces an ndex for every arm based on the state of that arm and smply calls for pullng the k arms wth the hghest ndex at every pont n tme. Whle t has been emprcally and computatonally observed that Whttle s heurstc provdes excellent performance, the heurstc typcally calls for frequent changes to the set of arms pulled that mght, n hndsght, have been unnecessary. For nstance, n the retal context, such a heurstc may choose to dscard from the assortment a product presently beng offered for sale n favor of a new product whose popularty s not known precsely. Later, the heurstc may well choose to rentroduce the dscarded product. Whle such exploraton may appear necessary f one s to dscover proftable bandt arms (or popular products), enablng such a heurstc n practce wll typcally call for a great number of adjustments to the product assortment a requrement that s both 1

3 expensve and undesrable. Ths begs the followng queston: Is t possble to desgn a heurstc for the mult-armed bandt problem that comes close to beng optmal wth a mnmal number of adjustments to the set of arms pulled over tme? Ths paper ntroduces a new rrevocable heurstc for the mult-armed bandt problem we call the packng heurstc. The packng heurstc establshes a statc rankng of bandt arms based on a measure of ther potental value relatve to the tme requred to realze that value, and pulls arms n the order prescrbed by ths rankng. For an arm currently beng pulled, the heurstc may ether choose to contnue pullng that arm n the next tme step or else dscard the arm n favor of the next hghest ranked arm not currently beng pulled. Once dscarded, an arm wll never be chosen agan; hence the term rrevocable. Irrevocablty s an attractve structural constrant to mpose on arm selecton polces n a number of practcal applcatons of the bandt model such as the dynamc assortment problem we have dscussed or sequental drug trals where recourse to drugs whose testng was dscontnued n the past s socally unacceptable. It s clear that an rrevocable heurstc makes a mnmal number of changes to the set of arms pulled. What s perhaps surprsng, s that the restrcton to an rrevocable polcy s typcally far less expensve than one mght expect. In partcular, we demonstrate va a theoretcal analyss and computatonal experments that the use of the packng heurstc ncurs a small performance loss relatve to an optmal bandt polcy wth no restrcton on exploraton,.e. an optmal strategy that s allowed recourse to arms that were pulled but dscarded n the past. More specfcally, the present work makes the followng contrbutons: We ntroduce a new rrevocable heurstc, the packng heurstc, for the mult-armed bandt problem wth multple smultaneous arm-pulls. The packng heurstc s rrevocable n that f an arm beng pulled s at some pont dscarded from the set of arms beng pulled, t s never pulled agan. At the same tme, the performance loss ncurred relatve to an optmal, potentally non-rrevocable, control polcy s lmted. In partcular, computatonal experments wth the packng heurstc for a generatve famly of large scale bandt problems ndcate performance losses of up to about a few percent relatve to an upper bound on the performance of an optmal polcy wth no restrctons on exploraton. Ths level of performance suggests that the packng heurstc s lkely to serve as a vable heurstc for the mult-armed bandt wth multple plays even when rrevocablty s not a concern. In addton to our computatonal study, we are able to demonstrate a unform bound on the prce of rrevocablty for a broad, nterestng class of bandts. Ths class ncludes most commonly used applcatons of the bandt model such as bandts whose arms are cons of unknown bases. We demonstrate that the packng heurstc earns expected rewards that are always wthn a factor of 1/8 of an optmal, potentally non-rrevocable polcy. Such a unform bound guarantees robust performance across all parameter regmes; n partcular, the packng heurstc wll track the performance of an optmal, potentally non-rrevocable polcy across all parameter regmes. In addton, our analyss sheds lght on the structural propertes that afford the surprsng effcacy of the rrevocable polces consdered here. In the nterest of practcal applcablty, we develop a fast combnatoral mplementaton of the packng heurstc. Assumng that an ndvdual arm has O(Σ) states, and gven a tme horzon of T steps, optmal soluton to the mult-armed bandt problem under consderaton requres 2

4 O(Σ n T n ) computatons. The man computatonal step n the packng heurstc calls for the one tme soluton of a lnear program wth O(nΣT) varables, whose soluton va a generc LP solver requres O(n 3 Σ 3 T 3 ) computatons. We develop a novel combnatoral algorthm that solves ths lnear program n O(nΣ 2 T logt) steps by solvng a sequence of dynamc programs for each bandt arm. The technque we develop here s potentally of ndependent nterest for the soluton of weakly coupled optmal control problems wth couplng constrants that must be met n expectaton. Employng ths soluton technque, our heurstc requres a total of O(nΣ 2 logt) computatons pertmestep amortzed over thetme horzon. Incomparson, the smplest theoretcally sound heurstcs n exstence for ths mult-armed bandt problem (such as Whttle s heurstc) requre O(nΣ 2 T) computatons per tme step. As such, we establsh that the packng heurstc s computatonally attractve. 1.1 Relevant Lterature The mult-armed bandt problem has a rch hstory, and a number of excellent references (such as Gttns (1989)) provde a thorough treatment of the subject. We revew here lterature especally relevant to the present work. In the case where k = 1, that s, allowng for a sngle arm to be pulled n a gven tme step, Gttns and Jones (1974) developed an elegant ndex based polcy that was shown to be optmal for the problem of maxmzng dscounted rewards over an nfnte horzon. Ther ndex polcy s known to be suboptmal f one s allowed to pull more than a sngle arm n a gven tme step. Whttle (1988) developed a smple ndex based heurstc for a more general bandt problem (the restless bandt problem) allowng for multple arms to be pulled n a gven tme step. Whle hs orgnal paper was concerned wth maxmzng long-term average rewards, hs heurstc s easly adapted to other objectves such as dscounted nfnte horzon rewards or expected rewards over a fnte horzon (see for nstance Caro and Gallen (2007), Bertsmas and Nno-Mora (2000)). Wess (1992) subsequently establshed that under sutable condtons, Whttle s heurstc was asymptotcally optmal (n a regme where n and k go to nfnty keepng n/k constant). Whttle s heurstc may be vewed as a modfcaton to the optmal control polcy one obtans upon relaxng the requrement that at most k arms be pulled n a gven tme step to requrng that at most k arms be pulled n expectaton n any gven tme step. The packng heurstc we ntroduce s motvated by a smlar relaxaton. In partcular, we restrct attenton to polces that ental a total of at most kt arm pulls over the entre horzon n expectaton whle allowng for no more than T pulls of any gven arm. Where we dffer substantally from Whttle s heurstc s the manner n whch we construct a feasble polcy (one where at most k arms are pulled n a gven tme step) from the relaxed polcy. In fact there are potentally many reasonable ways of transformng an optmal polcy for the relaxed problem to a feasble polcy for the mult-armed bandt; for nstance Bertsmas and Nno-Mora (2000) use a scheme dstnct from both Whttle s and ours, that employs optmal prmal and dual solutons to a lnear programmng formulaton of Whttle s relaxaton to construct an ndex heurstc for arm selecton. Nonetheless, none of these schemes are rrevocable and nor do they offer non-asymptotc performance guarantees, f any. The packng heurstc polcy bulds upon recent nsghts on the adaptvty gap for stochastc packng problems. In partcular, Dean et al. (2004) recently establshed that a smple statc rule (Smth s rule) for packng a knapsack wth tems of fxed reward (known a-pror), but whose szes were stochastc and unknown a-pror was wthn a constant factor of the optmal adaptve packng 3

5 polcy. Guha and Munagala (2007) used ths nsght to establsh a smlar statc rule for budgeted learnng problems. In such a problem one s nterested n fndng a con wth hghest bas from a set of cons of uncertan bas, assumng one s allowed to toss a sngle con n a gven tme step and that one has a fnte budget on the number of such expermental tosses allowed. Our work parallels that work n that we draw on the nsghts of the stochastc packng results of Dean et al. (2004). In addton, we must address two sgnfcant hurdles - correlatons between the total reward earned from pulls of a gven arm and the total number of pulls of that arm (these turn out not to matter n the budgeted learnng settng, but are crucal to our settng), and secondly, the fact that multple arms may be pulled smultaneously (only a sngle arm may be pulled at any tme n the budgeted learnng settng). Fnally, a workng paper (Bhattacharjee et al. (2007)), brought to our attenton by the authors of that work consders a varant of the budgeted learnng problem of Guha and Munagala (2007) wheren one s allowed to toss multple cons smultaneously. Whle t s concevable that ther heurstc may be modfed to apply to the mult-armed bandt problem we address, the heurstc they develop s also not rrevocable. Restrcted to cons, our work takes an nherently Bayesan vews of the mult-armed bandt problem. It s worth mentonng that there are a number of non-parametrc formulatons to such problems wth a vast assocated lterature. Most relevant to the present model are the papers by Anantharam et al. (1987a,b) that develop smple regret-optmal strateges for mult-armed bandt problems wth multple smultaneous plays. Our development of an rrevocable polcy for the mult-armed bandt problem was orgnally motvated by applcatons of ths framework to dynamc assortment problems of the type mentoned n the ntroducton. In partcular, Caro and Gallen (2007) computatonally explore the use of a number of smple ndex-type heurstcs (smlar to Whttle s heurstc) for such problems, none of whch are rrevocable; nonetheless, they stress the mportance of a mnmal number of changes to the assortment f any such heurstc s to be practcal. The remander of ths paper s organzed as follows. Secton 2 presents the mult-armed bandt model we consder and develops an (ntractable) LP whose soluton yelds an optmal control polcy for ths bandt problem. Secton 3 develops the packng heurstc by consderng a sutable relaxaton of the mult-armed bandt problem. Secton 4 ntroduces a structural property for bandt arms we call the decreasng returns property. It s shown that a useful class of bandts, namely the con bandts relevant to the applcatons that motvate us, possess ths property. That secton then establshes that the prce of rrevocablty for bandts possessng the decreasng returns property s unformly bounded. Secton 5 presents very encouragng computatonal experments for large scale bandt problems drawn from a generatve famly of con type bandts. In the nterest of mplementablty, Secton 6 develops a combnatoral algorthm for the fast computaton of packng heurstc polces for mult-armed bandts. Secton 7 concludes wth a perspectve on nterestng drectons for future work. 2 Model We consder a mult-armed bandt problem wth multple smultaneous pulls permtted at every tme step. A sngle bandt arm (ndexed by ) s a Markov Decson Process (MDP) specfed by a state space S, an acton space, A, a reward functon r : S A R +, and a transton 4

6 kernel P : S A S (where S s the S -dmensonal unt smplex), yeldng a probablty dstrbuton over next states should one choose some acton a A n state s S. Every bandt arm s endowed wth a dstngushed dle acton φ. Should a bandt be dled n some tme perod, t yelds no rewards n that perod and transtons to the same state wth probablty 1 n the next perod. More precsely, r (s,φ ) = 0, s S, P (s,φ,s ) = 1, s S. We consder a bandt problem wth n arms. In each tme step one must select a subset of up to k( n) arms for whch one may pck any acton avalable at those respectve arms. Should an acton other than the dle acton be selected at any of these k arms, we refer to such a selecton as a pull of that arm. That s, any acton a A \ {φ } would be consdered a pull of the th arm. One s forced to pck the dle acton for the remanng n k arms. We wsh to fnd an acton selecton (or control) polcy that maxmzes expected rewards earned over T tme perods. Our problem may be cast as an optmal control problem. In partcular, we defne as our state-space the set S = S and as our acton space, the set A = A. We let T = {0,1,...,T 1}. We understand by s, the th component of s S and smlarly let a denote the th component of a A. A feasble acton s one whch calls for smultaneously pullng at most k arms. In partcular we let A feas = {a A, 1 a φ k} denote the set of all feasble actons. We defne a reward functon r : S A R +, gven by r(s,a) = r (s,a ) and a system transton kernel P : S A Q S, gven by P(s,a,s ) = Π P (s,a,s ). We now formally develop what we mean by a control polcy. The arm selecton polcy we wll eventually develop wll use auxlary nformaton asde from the current state of the system, and so we requre a general defnton. Let X 0 be a random varable that encapsulates any endogenous randomzaton n selectng an acton, and defne the fltraton generated by X 0 and the hstory of vsted states and actons by F t = σ(x 0,(s 0 ),(s 1,a 0 ),...,(s t,a t 1 )), where s t and a t denote the state and acton at tme t, respectvely. We assume that P(s t+1 = s s t = s,a t = a,h t = h t ) = P(s,a,s ) for all s,s S,a A,t T and any F t -measurable random varable H t. A feasble polcy smply specfes a sequence of A feas -valued actons {a t } adapted to F t. In partcular, such a polcy may be specfed by a collecton of σ(x 0 ) measurable, A feas -valued random varables, {µ(s 0,...,s t,a 0,...,a t 1,t)}, one for each possble state-acton hstory of the system. We let M denote the set of all such polces µ, and denote by J µ (s,0) the expected value of usng polcy µ startng n state s at tme 0; n partcular J µ (s,0) = E ] R(s t,a t ) s 0 = s, [ T 1 t=0 where a t = µ(s 0,...,s t,a 0,...,a t 1,t). Our goal s to compute an optmal admssble polcy. Markovan polces,.e. polces under whch a t s measurable wth respect to σ(x 0,s t ), are partcularly useful. A Markovan polcy s 5

7 specfed as a collecton of ndependent A feas valued random varables {µ(s,t)} each measurable wth respect to σ(x 0 ). In partcular, assumng the system s n state s at tme t, such a polcy selects an acton a t as the random varable µ(s,t), ndependent of past states and actons. We let M m denote the set of all such admssble Markovan polces. Every µ M m s assocated wth a value functon, J µ : S T R + whch, for every (s,t) S T, gves the expected value of usng control polcy µ startng at that state: J µ (s,t) = E [ T 1 ] R(s t,µ(s t,t )) s t = s. t =t We denote by J the optmal value functon. In partcular, J (s,t) = sup µ M m J µ (s,t). The precedng supremum s always acheved and we denote by µ a correspondng optmal Markovan control polcy. That s, µ argsupj µ (s,t) for all (s,t) S T. Our restrcton to Markovan µ M m polces s wthout loss; M m always contans an optmal polcy among the broader class of admssble polces so that sup µ M m J µ (s,0) = sup µ M J µ (s,0) for all states s. We next formulate a mathematcal program to compute such an optmal polcy. 2.1 Computng an Optmal Polcy An optmal polcy µ may be found va the soluton of the followng lnear program, LP( π 0 ), specfed by a parameter π 0 S that specfes the dstrbuton of arm states at tme t = 0. max. s.t. t s,a π(s,a,t)r(s,a), a π(s,a,t) = s,a P(s,a,s)π(s,a,t 1), t > 0,s S, π(s,a,t) = 0, s,t,a / A feas a π(s,a,0) = π 0(s), s S, π 0. where the varables are the state acton frequences π(s, a, t), whch gve the probablty of beng n state s at tme t and choosng acton a. The frst set of constrants n the above program smply enforce the dynamcs of the system, whle the second set of constrants enforces the requrement that at most k arms are smultaneously pulled at any pont n tme. An optmal soluton to the program above may be used to construct a polcy µ that attans expected value J (s,0) startng at any state s for whch π 0 (s) > 0. In partcular, gven an optmal soluton π opt to LP( π 0 ), one obtans such a polcy by defnng µ (s,t) as a random varable that takes value a A wth probablty π opt (s,a,t)/ a πopt (s,a,t). By constructon, we have E[J (s,0) s π 0 ] = OPT(LP( π 0 )). Of course, effcent soluton of the above program s not a tractable task, whch forces us to seek approxmatons to an optmal polcy. The next secton wll present one such polcy wth an appealng structural property we term rrevocablty. 3 An Irrevocable Approxmaton to the Optmal Polcy Ths secton develops an approxmaton to the optmal mult-armed bandt control polcy that we wll subsequently establsh performs adequately relatve to the optmal polcy. Ths approxmaton 6

8 wll possess a desrable property we term rrevocablty. In partcular, the polcy we develop wll, at any tme, be permtted to pull an arm only f that arm was pulled n the pror tme step, or else never pulled n the past. We frst develop a control polcy for a related bandt problem, where the requrement that precsely k arms be pulled n any tme step s relaxed. As we wll see, ths s essentally Whttle s relaxaton and the polcy developed for ths relaxaton s an upper bound to the optmal polcy. We wll then use the control polcy developed for ths relaxed control problem to desgn a polcy for the mult-armed bandt problem that s rrevocable and also offers good performance relatve to the optmal polcy for a broad class of bandts. Consder the followng relaxaton of the program LP( π 0 ), RLP( π 0 ). RLP( π 0 ) may be vewed as a prmal formulaton of Whttle s relaxaton: max. s.t. t s,a π (s,a,t)r (s,a ), a π (s,a,t) = s P (s,a,a,s )π (s,a,t 1), t > 0,s S,. [ T s t π (s,φ,t) ] kt, a π (s,a,0) = s: s =s π 0 ( s), π 0, where π (s,a,t) s the probablty of the th bandt beng n state s at tme t and choosng acton a. The program above relaxes the requrement that precsely k arms be pulled n a gven tme step; nstead we now requre that over the entre horzon at most kt arms are pulled n expectaton, where the expectaton s over polcy randomzaton and state evoluton. The frst set of equalty constrants enforce ndvdual arm dynamcs whereas the frst nequalty constrant enforces the requrement that at most kt arms be pulled n expectaton over the entre tme horzon. The followng lemma makes the noton of a relaxaton to LP( π 0 ) precse; the proof may be found n the appendx. Lemma 1. OPT(RLP( π 0 )) OPT(LP( π 0 )) Gven an optmal soluton π to RLP( π 0 ), one may consder the polcy µ R, that, assumng we are n state s at tme t, selects a random acton µ R (s,t), where µ R (s,t) = a wth probablty ( π (s,a,t)/ a π (s,a,t) ) ndependent of the past. Notng that the acton for each arm s chosen ndependently of all other arms, we use µ R (s,t) to denote the nduced polcy for arm. By constructon, E[J µr (s,0) s π 0 ] = OPT(RLP( π 0 )). Moreover, we have that µ R satsfes the constrant [ T 1 ] E µ 1 µ R (s t,t) φ s0 = s kt, t=0 where the expectaton s over random state transtons and endogenous polcy randomzaton. Of course, µ R s not necessarly feasble; we ultmately requre a polcy that entals at most k arm pulls n any tme step. We wll use µ R to construct such a feasble polcy. In addton, we wll see that f an arm s pulled and then dled n some subsequent tme step, t wll never agan be pulled, so that the polcy we construct wll be rrevocable. In what follows we wll assume for convenence that π 0 s degenerate and puts mass 1 on a sngle startng state. That s, π 0 (s ) = 1 for some s S for all. We frst ntroduce some relevant notaton. Gven an optmal soluton π 7

9 to RLP( π 0 ), defne the value generated by arm as the random varable T 1 R = r (s t,µr (st,t)), t=0 and the actve tme of arm, T as the total number of pulls of arm entaled under that polcy T 1 T = 1 µ R (s t,t) φ. t=0 The expected value of arm, E[R ] = s,a,t π (s,a,t)r (s,a ), and the expected actve tme E[T ] = s,a,t:a φ π (s,a,t). We wll assume n what follows that E[T ] > 0 for all ; otherwse, we smply consder elmnatng those for whch E[T ] = 0. We wll also assume for analytcal convenence that E[T ] = kt. Nether assumpton results n a loss of generalty. To motvate our polcy we begn wth the followng analogy wth a packng problem: Imagne packng n objects nto a knapsack of sze B. Each object has sze T and value R. Moreover, we assume that we are allowed to pack fractonal quanttes of an object nto the knapsack and that packng a fracton α of the th object requres space α T and generates value α R. An optmal polcy s then gven by the followng greedy procedure: select objects n decreasng order of the rato R / T and place them n to the knapsack to the extent that there s room avalable. If one had more than a sngle knapsack and the addtonal constrant that an tem could not be placed n more than a sngle knapsack, then the stuaton s more complcated. One may consder a greedy procedure that, as before, consders tems n decreasng order of the rato R / T and places them (possbly fractonally) n sequence, nto the least loaded of the bns at that pont. Ths generalzaton of the greedy procedure for the smple knapsack s suboptmal, but stll a reasonable heurstc. Thus motvated, we begn wth a loose hgh level descrpton of our control polcy, whch we call the packng heurstc. We thnk of each bandt arm as an tem of value E[R ] wth sze E[T ]. For the purposes of ths explanaton alone, we wll assume for convenence that should polcy µ R call for an arm that was pulled n the past to be dled, t wll never agan call for that arm to be pulled; we wll momentarly remove that assumpton. Our control polcy wll operate as follows: we wll order arms n decreasng order of the rato E[R ]/E[T ]. We begn wth the top k arms accordng to ths orderng. For each such arm we wll select an acton accordng to the polcy specfed for that arm by µ R ; should ths polcy call for the arm to be dled, we dscard that arm and wll never agan consder pullng t. We replace the dscarded arm wth the next avalable arm (n order of ntal arm rankngs) and select an acton for the arm accordng to µ R. We repeat ths procedure untl we have selected non-dle actons for up to k arms (or no arms are avalable). We then let tme advance, earn rewards, and repeat the procedure descrbed above untl the end of the tme horzon. Algorthm 1 descrbes the packng heurstc polcy precsely, addressng the fact that µ R may call for an arm to be dled but then pulled n some subsequent tme step. Intheeventthatweplacednorestrctononthetmehorzon(.e. wesett = nthealgorthm above), we have by constructon, that the expected total reward earned under the above polcy s precsely OPT(RLP( π 0 )). In essence, RLP( π 0 ) prescrbes a polcy wheren each arm generates a total reward wth mean E[R ] usng an expected total number of pulls E[T ], ndependent of other 8

10 Algorthm 1 The Packng Heurstc 1: Renumber bandts so that E[R 1] E[T 1 ] E[R 2] E[T 2 ] E[R N] E[T N ]. Index bandts by varable. 2: l 0,a φ for all, s π 0 ( ) {The local tme of every arm s set to 0 and ts desgnated acton to the dle acton. An ntal state s drawn accordng to the ntal state dstrbuton π 0.} 3: J 0 {Total reward earned s ntalzed to 0.} 4: X {1,2,...,k},A {k+1,...,n},d =. {Intalze the set of actve (X), avalable (A), and dscarded (D) arms.} 5: for t = 0 to T 1 do 6: whle there exsts an arm X wth a = φ do {Select up to k arms to pull.} 7: Select an X wth a = φ {In what follows, ether select an acton for arm or else dscard t.} 8: whle a = φ and l < T do {Attempt to select a pull acton for arm } 9: Select a π (s,,l ) {Select an acton accordng to the soluton to RLP( π).} 10: l l +1 {Increment arm s local tme.} 11: end whle 12: f l = T anda = φ then{dscard armandactvate nexthghestrankedarmavalable.} 13: X X\{},D D {} {Dscard arm.} 14: f A then {There are avalable arms.} 15: j mn A {Select hghest ranked avalable arm.} 16: X X {j},a A\{j} {Add arm to actve set.} 17: end f 18: end f 19: end whle 20: for Every X do {Pull selected arms.} 21: s P(s,a, ) {Pull arm ; select next arm state accordng to ts transton kernel assumng the use of acton a.} 22: J J +r (s,a ) {Earn rewards.} 23: a φ 24: end for 25: end for 9

11 arms. The above scheme may be vsualzed as one whch packs as many of the pulls of varous arms possble n a manner so as to meet feasblty constrants. It s clear that the heurstc we have constructed entals a mnmal amount of arm exploraton. In partcular, we are guaranteed at most n k changes to the set of pulled arms. One may naturally ask what the lmted exploraton permtted under ths polcy costs us n terms of performance. In addton, s ths scheme computatonally practcal? In partcular, the lnear programmng relaxaton we must solve s stll a farly large program. In subsequent sectons we address these ssues. Frst, we present a theoretcal analyss that demonstrates that the prce of rrevocablty s unformly bounded for an mportant general class of bandts. Our analyss sheds lght on the structural propertes that are lkely to afford a low prce of rrevocablty n practce. We then present results of computatonal experments wth a generatve famly of large-scale problems demonstratng performance losses of up to 5 10% percent relatve to an upper bound on the performance of the optmal polcy (whch s potentally non-rreovcable and has no restrctons on exploraton). Fnally, we address computatonal ssues relevant to the packng heurstc and develop a computatonal scheme that s substantally qucker than heurstcs such as Whttle s heurstc. 4 The Prce of Irrevocablty Ths secton establshes a unform bound on the performance loss ncurred n usng the rrevocable packng heurstc relatve to an optmal, potentally non-rrevocable scheme for a useful famly of bandts whose arms exhbt a certan decreasng returns property. Ths class ncludes bandts whose arms are cons of unknown bases a famly partcularly relevant to a number of applcatons ncludng those dscussed n the ntroducton. We establsh that the packng heurstc always earns expected rewards that are wthn a factor of 1/8 of an optmal scheme. Our analyss sheds lght on those structural propertes that lkely afford a low prce to rrevocablty. In addton to beng an ndcator of robustness across all parameter regmes, ths bound on the prce of rrevocablty s remarkable for two reasons. Frst, t does not rely on an asymptotc scalng of the system; the performance of the packng heurstc wll track that of an optmal, potentally non-rrevocable heurstc across all regmes. Second, the bound represents a comparson wth a system where one s allowed recourse to arms that were pulled n the past and dscarded. In partcular, the bound thus hghlghts the fact that for a useful class of bandts, one may acheve reasonable performance wth very lmted exploraton. The typcal performance we expect from the heurstc s lkely to be far superor (as t generally s n the case of problems for whch such worst case guarantees can be establshed); n a subsequent secton we wll present computatonal experments ndcatng a performance loss of 5 10% relatve to an optmal polcy wth no restrctons on exploraton. In what follows we frst specfy the decreasng returns property and explctly dentfy a class of bandts that possess ths property. We then present our performance analyss whch wll proceed as follows: we frst consder pullng bandt arms serally,.e. at most one arm at a tme, n order of ther rank and show that the total reward earned from bandts that were frst pulled wthn the frst kt/2 pulls s at least wthn a factor of 1/8 of an optmal polcy. Ths result reles on the statc rankng of bandt arms used, and a symmetrzaton dea exploted by Dean et al. (2004) n ther result on stochastc packng where rewards are statstcally ndependent of tem sze. In contrast to that work, we must address the fact that the rewards earned from a bandt are statstcally 10

12 dependent on the number of pulls of that bandt and to ths end we explot the decreasng returns property that establshes the nature of ths correlaton. We then show va a combnatoral sample path argument that the expected reward earned from bandts pulled wthn the frst T/2 tme steps of the packng heurstc.e., wth arms beng pulled n parallel, s at least as much as that earned n the settng above where arms are pulled serally, thereby establshng our performance guarantee. 4.1 The Decreasng Returns Property Defne for every and l < T, the random varable L (l) = l 1 µ R (s t,t) φ. t=0 L (l) tracks the number of tmes a gven arm has been pulled under polcy µ R among the frst l+1 steps of selectng an acton for that arm. Further, defne T 1 R m = 1 L (l) mr (s l,µ R (s l,l)). l=0 R m s the random reward earned wthn the frst m pulls of arm under the polcy µ R. The decreasng returns property roughly states that the expected ncremental returns from allowng an addtonal pull of a bandt arm are, on average, decreasng. More precsely, we have: Property 1. (Decreasng Returns) E[R m+1 ] E[R m] E[Rm ] E[Rm 1 ] for all 0 < m < T. One useful class of bandts from a modelng perspectve that satsfy ths property are bandts whose arms are cons of unknown bas. The followng dscusson makes ths noton more precse: An example of a bandt wth decreasng returns: Cons We defne a con to be any mult-armed bandt for whch every arm has acton space a = {p,φ }, wth r(s,p) > 0 for all s S, and satsfes the followng property: r(s,p) s S P(s,p,s )r(s,p), s S. The above sub-martngale characterzaton of rewards ntutvely suggests the decreasng returns property. In partcular, t suggests that the returns from a pull n the current state are at least as large as the expected returns to a pull n a state reached subsequent to the current pull. The decreasng returns property for cons s establshed n the followng Lemma whose proof may be found n the appendx: Lemma 2. Cons satsfy the decreasng returns property. That s, f A = {p,φ }, and r(s,p) s S P(s,p,s )r(s,p),,s S, 11

13 then for all 0 < m < T. E[R m+1 ] E[R m ] E[Rm ] E[Rm 1 ] Returnng to our motvatng example of dynamc product assortment selecton, we note that n estmatng the bas of a bnomal con of unknown bas gven some ntal pror on con bas, Bayes rule mples that the estmated bas after n observatons (whch generate the fltraton F n ), µ n+1 satsfes E[µ n+1 F n ] = µ n. Thus, bandts wth such arms wheren the reward from an arm s some non-negatve scalar tmes the bas, automatcally possess the decreasng returns property. 4.2 A Unform Bound on the Prce of Irrevocablty for Bandts wth Decreasng Returns For convenence of exposton we assume that T s even; addressng the odd case requres essentally dentcal proofs but cumbersome notaton. We re-order the bandts n decreasng order of E[R ]/E[T ] as n the packng heurstc. Let us defne { } j H = mn j : E[T ] kt/2. Thus, H s the set of bandts that take up approxmately half the budget on total expected pulls. Next, let us defne for all, random varables R and T accordng to R = R, T = T for all < H. We defne R H = αr H and T H = αt H, where α = kt/2 P H 1 E[T ] E[T. H ] We begn wth a prelmnary lemma: Lemma 3. Proof. Defne a functon f(t) = E[ R ] 1 2 OPT(RLP( π 0)). H n E[R ] E[T ] 1 t E[T ] j=1 + E[T ], where (a b) = mn(a,b). By constructon (.e. snce E[R ] E[T ] s non-ncreasng n ), we have that f s a concave functon on [0,kT]. Now observe that E[ R ] = H Next, observe that H 1 E[R ] E[T ] E[T ]+ E[R H ] H 1 kt/2 E[T H ] OPT(RLP( π 0 )) = n j=1 E[R ] E[T ] E[T ] = f(kt). E[T ] = f(kt/2). By the concavty of f and snce f(0) = 0, we have that f(kt/2) 1 2f(kT), whch yelds the result. 12

14 We next compare the expected reward earned by a certan subset of bandts wth ndces no larger than H. The sgnfcance of the subset of bandts we defne wll be seen later n the proof of Lemma 6 we wll see there that all bandts n ths subset wll begn operaton pror to tme T/2 n a run the packng heurstc. In partcular, defne R 1/2 = 1 P { 1 j=1 T j<kt/2} R. H Lemma 4. Proof. We have: E[R 1/2 ] (a) = (b) (c) = (d) E[R 1/2 ] 1 4 OPT(RLP( π 0)). H 1 Pr T j < kt/2 E[R ] j=1 H 1 Pr T j < kt/2 E[ R ] j=1 H 1 Pr T j < kt/2 E[ R ] ( H 1 H j=1 1 = E[ R ] (e) H (f) 1 2 H j=1 E[ T ) j ] E[ R ] kt/2 H E[ R ] 1 2 E[ R ] 1 H H (g) 1 4 OPT(RLP( π 0)) j=1 E[ T j ] E[ R ] kt/2 j=1,j E[ T j ] E[ R ] kt/2 Equalty (a) follows from the fact that under polcy µ R, R s ndependent of T j for j <. Inequalty (b) follows from our defnton of R : R R. Equalty (c) follows from the fact that by defnton T = T for all < H. Inequalty (d) nvokes Markov s nequalty. Inequalty (e) s the crtcal step n establshng the result and uses the smple symmetrzaton dea exploted by Dean et al. (2004): In partcular, we observe that snce E[R ] E[T ] E[R j] E[T j ] for > j, t follows that E[R ]E[T j ] 1 2 (E[R ]E[T j ] + E[R j ]E[T ]) for > j. Replacng every term of the form E[R ]E[T j ] (wth > j) n the expresson precedng nequalty (e) wth the upper bound 1 2 (E[R ]E[T j ] +E[R j ]E[T ]) yelds nequalty (e). Inequalty (f) follows from the fact that H E[ T ] = kt/2. Inequalty (g) follows from Lemma 3. 13

15 Before movng on to our man Lemma that translates the above guarantees to a guarantee on the performance of the packng heurstc, we need to establsh one addtonal techncal fact. Recall that R m s the reward earned by bandt n the frst m pulls of ths bandt. Explotng the assumed decreasng returns property, we have the followng Lemma whose proof may be found n the appendx: Lemma 5. For bandts satsfyng the decreasng returns property (Property 1), E [ H 1 P 1 j=1 T j<kt/2 RT/2 ] 1 2 E[R 1/2]. We have thus far establshed estmates for total expected rewards earned assumng mplctly that bandts are pulled n a seral fashon n order of ther rank. The followng Lemma connects these estmates to the expected reward earned under the µ packng polcy (gven by the packng heurstc) usng a smple sample path argument. In partcular, the followng Lemma shows that the expected rewards under the µ packng polcy are at least as large as E Lemma 6. E[J µpackng (s,0) s π 0 ] E [ H ] 1P 1 j=1 T j<kt/2 RT/2. Proof. For a gven sample path of the system defne h = (H ) mn : T j kt/2. On ths sample path, t must be that: j=1 [ H 1P 1 j=1 T j<kt/2 RT/2 ]. (4.1) H 1 P 1 j=1 T j<kt/2 RT/2 = h R T/2. We clam that arms 1,2,...,h are all frst pulled at tmes t < T/2 under µ packng. Assume to the contrary that ths were not the case and recall that arms are consdered n order of ndex under µ packng, so that an arm wth ndex s pulled for the frst tme no later than the frst tme arm l s pulled for l >. Let h be the hghest arm ndex among the arms pulled at tme t = T/2 1 so that h < h. It must be that h j=1 T j kt/2. But then, H mn : T j kt/2 h j=1 whch s a contradcton. Thus, snce every one of the arms 1,2,...,h s frst pulled at tmes t < T/2, each such arm may be pulled for at least T/2 tme steps pror to tme T (the horzon). Consequently, we have that the total rewards earned on ths sample path under polcy µ packng are at least h R T/2 14

16 Usng dentty (4.1) and takng an expectaton over sample paths yelds the result. We are ready to establsh our man Theorem that provdes a unform bound on the performance loss ncurred n usng the packng heurstc polcy relatve to an optmal polcy wth no restrctons on exploraton. In partcular, we have that the prce of rrevocablty s unformly bounded for bandts satsfyng the decreasng returns property. Theorem 1. For mult-armed bandts satsfyng the decreasng returns property (Property 1), E[J µpackng (s,0) s π 0 ] 1 8 E[J (s,0) s π 0 ] for all ntal state dstrbutons π 0. Proof. We have from Lemmas 1,4,5 and 6 that E[J µpackng (s,0) s π 0 ] 1 8 OPT(RLP( π 0)). We know from Lemma 1 that OPT(RLP( π 0 )) OPT(LP( π 0 )) = E[J (s,0) s π 0 ] from whch the result follows. Our analyss hghlghted a structural property decreasng returns that s lkely to afford a low prce of rrevocablty. The next secton demonstrates computatonal results that suggest that n practce we may expect ths prce to be qute small (on the order of 5 10%) for bandts possessng ths property. 5 Computatonal Experments Ths secton presents computatonal experments wth the packng heurstc. We consder a number of large scale bandt problems drawn from a generatve famly of problems to be dscussed shortly and demonstrate that the packng heurstc consstently demonstrates performance wthn about 5 10 % of an upper bound on the performance of an unrestrcted (.e. potentally non-rrevocable) optmal soluton to the mult-armed bandt problem. In partcular, ths suggests that the prce of rrevocablty s lkely to be small n practce, at least for models of the type we consder here. Snce the bandts consdered n our experments - Bnomal cons of uncertan bas - are among the most wdely used applcatons of the mult-armed bandt model, we vew ths to be a postve result. The Generatve Model: We consder mult-armed bandt problems wth n arms up to k of whchmaybepulledsmultaneouslyatanytme. ThetharmcorrespondstoaBnomal(m,P )con wheremsfxedandknown,andp sunknownbutdrawnfromadrchlet(α,β )prordstrbuton. Assumng we choose to pull arm at some pont, we realze a random outcome M {0,1,...,m}. M s a Bernoull(m,P ) random varable where P s tself a Drchlet(α,β ) random varable. We receve a reward of r M and update the pror dstrbuton parameters accordng to α α +M, β β +m M. By selectng the ntal values of α and β for each arm approprately we can control for the ntal uncertanty n the value of P. Ths model s, for nstance, applcable to the dynamc assortment selecton problem dscussed earler (see Caro and Gallen (2007)) wth each con representng a product of uncertan popularty and M representng the uncertan number of product sales over a sngle perod n whch that product s offered for sale. We recall from our prevous dscusson that ths famly of bandts satsfes the decreasng returns property and from our performance analyss we expect a reasonable prce of rrevocablty. 15

17 Summary of Computatonal Experments Coeff. of Varaton Arms Smultaneous Pulls Horzon Performance (cv) (n) (k) (T) (J µpackng /J ) Moderate (1) Hgh (2.5) Table 1: Computatonal Summary. Each row represents summary statstcs for 100 dstnct random bandt problems wth the specfed n, k, T and cv parameters. Performance for each nstance was computed from 3000 smulatons of that nstance. Performance fgures thus represent an average over the generatve famly wth the specfed n,k,t and cv parameters as also over system randomness. We consder the followng random nstances of the above problem. We consder bandts wth (n, k) {(500, 50),(500, 100),(100, 10),(100, 20)}. These dmensons are representatve of large scale applcatons of whch the dynamc assortment problem s an example. For each value of (n, k) we consder tme horzons T = 25 and T = 40. For every bandt problem we consder, we subdvde the arms of the bandt nto 10 groups. All arms wthn a group have dentcal statstcal structure, that s, dentcal r values and dentcal ntal values of α and β. For each value of (n,k,t), we generate a number of problem nstances by randomly drawng pror parameters for bandt arms. In partcular, for all arms n a gven group we select α unformly n the nterval [0.05,0.35] and then select that value of β whch results n a pror co-effcent of varaton cv {1,2.5}. These co-effcents of varaton represent, respectvely, a moderate and hgh degree of a-pror uncertanty n con bas (or n the context of the dynamc assortment applcaton, product popularty). In addton, r s drawn unformly on [0,2] and we take m = 2. We generate 100 random problem nstances for each co-effcent of varaton. Control polces for a gven bandt problem nstance are evaluated over 3000 random state trajectores (whch resulted n 98% confdence ntervals that were at least wthn +/-1% of the sample average). Evaluatng Performance: A strkng feature of our performance results s that the prce of rrevocablty s qute small, a trend that appears to hold over varyng parameter regmes. In partcular, we make the followng observaton: 16

18 Consder problems wth a small number of arms (100) wth a large number of smultaneous pulls (20) allowed. Intutvely, an optmal polcy could reasonably explore all arms n ths settng before settlng on the best arms. We thus expect the prce of rrevocablty to be hgh here. Even n ths regme we fnd that the prce of rrevocablty s only about % of optmal performance. Consder problems wth a hgh degree of a-pror uncertanty n con bas. Mstakes - that s, dscardng an arm that s performng reasonably n favor of an unexplored arm that turns out to perform poorly - are partcularly expensve n such problems. Wth a hgh co-effcent of varaton n the pror on ntal arm bas, the prce of r-revocablty s ndeed somewhat hgher but contnues to reman wthn % of optmal performance. For each of our experments, we observe that keepng all other parameters fxed, relatve performance mproves wth a longer tme horzon. Ths s ntutve; wth longer horzons, one may delay dscardng an arm only once one s sure that the arm performs poorly relatve to the expected value of the avalable alternatves. Fnally, we note that the performance fgures we report are relatve to an upper bound on optmal polcy performance. Computng the optmal polcy s tself an ntractable task. The performance observed here suggests that at least for bandt problems wth decreasng returns the packng heurstc s a vable approxmaton scheme even when rrevocablty s not necessarly a concern. We can thus conclude that the prce of rrevocablty s small for a useful class of mult-armed bandt problems and that the packng heurstc performs well for ths class of problems. A fnal concern s computatonal effort. In partcular, for the largest problem nstance we consdered (n = 500), the lnear program we need to solve has 3.2 mllon varables and about the same number of constrants. Even a commercal lnear programmng solver (such as CPLEX) equpped wth the ablty to explot structure n ths program wll requre several hours on a powerful computer to solve ths program. Ths s n stark contrast wth an ndex based heurstc (such as Whttles heurstc) thatsolvesasmpledynamcprogramforeacharmateverytmestep. Inthenextsectonwedevelop an effcent computatonal algorthm for the soluton of RLP( π 0 ) that requres substantally less effort than even Whttles heurstc and takes a few mnutes to solve the aforementoned program on a laptop computer. 6 Fast Computaton Ths secton consders the computatonal effort requred to mplement the packng heurstc. We develop a computatonal scheme that makes the packng heurstc substantally easer to mplement than popular ndex heurstcs such as Whttle s heurstc and thus establsh that the heurstc s vable from a computatonal perspectve. The key computatonal step n mplementng the packng heurstc s the soluton of the lnear program RLP( π 0 ). Assumng that S = O(S) and A = O(A) for all, ths lnear program has O(nT AS) varables and each Newton teraton of a general purpose nteror pont method wll requre O ( (ntas) 3) steps. An nteror pont method that explots the fact that bandt arms are 17

19 coupled va a sngle constrant wll requre O(n(TAS) 3 ) computatonal steps at each teraton. We develop a combnatoral scheme to solve ths lnear program that s n sprt smlar to the classcal Dantzg-Wolfe dual decomposton algorthm. In contrast wth Dantzg-Wolfe decomposton, our scheme s effcent. In partcular, the scheme requres O(nTAS 2 log(kt)) computatonal steps to solve RLP( π 0 ) makng t a sgnfcantly faster soluton alternatve to the schemes alluded to above. Equpped wth ths fast scheme, t s notable that usng the packng heurstc requres O(nAS 2 log(kt)) computatons per tme step amortzed over the tme horzon whch wll typcally be substantally less than the Θ(nAS 2 T) computatons requred per tme step for ndex polcy heurstcs such as Whttle s heurstc. Our scheme employs a dual decomposton of RLP( π 0 ). The key techncal dffculty we must overcome n developng our computatonal scheme for the soluton of RLP( π 0 ) s the nondfferentablty of the dual functon correspondng to RLP( π 0 ) at an optmal dual soluton whch prevents us from recoverng an optmal or near optmal polcy by drect mnmzaton of the dual functon. 6.1 An Overvew of the Scheme For each bandtarm, defnethepolytoped ( π 0 ) R S A T of permssblestate-acton frequences for that bandt arm specfed va the constrants of RLP( π 0 ) relevant to that arm. A pont wthn ths polytope, π, corresponds to a set of vald state-acton frequences for the th bandt arm. Wth some abuse of notaton, we denote the expected reward from ths arm under π by the value functon: T 1 R (π ) = π (s,a,t)r (s,a ). In addton denote the expected number of pulls of bandt arm under π by t=0 T (π ) = T π (s,φ,t). s We understand that both R ( ) and T ( ) are defned over the doman D ( π 0 ). We may thus rewrte RLP( π 0 ) n the followng form: t (6.1) max. s.t. R (π ), T (π ) kt. The Lagrangan dual of ths program s DRLP( π 0 ): mn. λkt + max π (R (π ) λt (π )), s.t. λ 0. The above program s convex. In partcular, the objectve s a convex functon of λ. We wll show that strong dualty apples to the dual par of programs above, so that the optmal soluton to the two programs have dentcal value. Next, we wll observe that for a gven value of λ, t s smple to compute max π (R (π ) λt (π )) va the soluton of a dynamc program over the state space of arm (a fast procedure). Fnally t s smple to derve useful a-pror lower and upper 18

Problem Set 9 Solutions

Problem Set 9 Solutions Desgn and Analyss of Algorthms May 4, 2015 Massachusetts Insttute of Technology 6.046J/18.410J Profs. Erk Demane, Srn Devadas, and Nancy Lynch Problem Set 9 Solutons Problem Set 9 Solutons Ths problem