Satisficing in Gaussian bandit problems

Size: px

Start display at page:

Download "Satisficing in Gaussian bandit problems"

Lorena Gilbert
5 years ago
Views:

1 Sasfcng n Gaussan band problems Paul Reverdy and Naom E. Leonard Absrac We propose a sasfcng objecve for he mularmed band problem,.e., where he objecve s o acheve performance above a gven hreshold. We show ha hs new problem s equvalen o a sandard mul-armed band problem wh a maxmzng objecve and use hs equvalence o fnd bounds on performance n erms of he sasfcng objecve. For he specal case of Gaussan rewards we show ha he sasfcng problem s equvalen o a relaed sandard mularmed band problem agan wh Gaussan rewards. We apply he Upper Credble Lm (UCL algorhm o hs sandard problem and show how acheves opmal performance n erms of he sasfcng objecve. I. INTRODUCTION Engneerng soluons o decson-makng problems are ofen desgned o maxmze an objecve funcon. However, n many conexs maxmzaon of an objecve funcon s an unreasonable goal, eher because he objecve self s poorly defned or because solvng he resulng opmzaon problem s nracable or cosly. In hese conexs, s valuable o consder alernave decson-makng frameworks. Herber Smon consdered [16 alernave models of raonal decson makng wh he goal of makng hem compable wh he access o nformaon and he compuaonal capaces ha are acually possessed by organsms, ncludng man, n he knds of envronmens n whch such organsms exs. A major feaure of he models he consdered s wha he called sasfcng. In [16, Smon dscussed n very broad erms a varey of smplfcaons o he classcal economc concep of raonaly, mos mporanly he dea ha payoffs should be smple, defned by dong well relave o some hreshold value. In [17, he nroduced he word sasfcng o refer o hs hresholdng concep and consdered an ecologcal example of food foragng behavor n deal usng mahemacal erms. He also brefly dscussed how sasfcng relaes o problems n nvenory conrol and more complcaed decson processes lke playng chess. Snce Smon s poneerng work, sasfcng has been suded n many felds such as psychology [15, economcs [3, managemen scence [10, [1, and ecology [0, [4. In engneerng, sasfcng s of neres for he same reasons ha movaed s nroducon n he socal scence leraure, specfcally ha can smplfy decson-makng problems. Furhermore, many engneerng problems are naurally posed usng a sasfcng objecve, for example desgn problems Ths research has been suppored n par by ONR grans N and N and ARO gran W911NG P. Reverdy s suppored hrough a NDSEG Fellowshp. The auhors are wh he Deparmen of Mechancal and Aerospace Engneerng, Prnceon Unversy, Prnceon, NJ 08544, USA {preverdy,naom}@prnceon.edu ha have o mee gven specfcaons. A desgn ha mees all he requred specfcaons s accepable, and he desgners may be ndfferen beween any such desgn. In hs conex, opmzaon may be poorly defned, for example f here are several compeng performance measures ha rade off n complcaed ways. Sasfcng can be a smpler decson paradgm han maxmzng, whch requres addonal nformaon abou preferences among possble radeoffs. Sasfcng has been suded n he engneerng leraure n several conexs. In [11, Nakayama suded desgn opmzaon usng a sasfcng objecve and found ha s effecve n many praccal felds. In [6, he auhors suded conrol heory usng a sasfcng objecve funcon, and n [, he auhors used sasfcng o sudy opmal sofware desgn. Sasfcng can be mplemened n a varey of ways. In hs paper, we consder he sochasc mul-armed band problem [14, where a decson maker sequenally chooses one of a se of alernave opons, or arms, and earns a reward drawn from a saonary probably dsrbuon assocaed wh ha arm. The sandard mul-armed band problem uses a maxmzng objecve, for whch here s a known performance bound. We propose a sasfcng objecve for he mularmed band problem based on he number of mes he decson maker receves a reward ha s above a hreshold value and show ha he mul-armed band problem wh hs objecve s equvalen o a relaed sandard mul-armed band problem. We use he equvalen problem o derve a performance bound for he new sasfcng problem. For Gaussan band problems,.e., where he reward dsrbuons are Gaussan wh unknown mean and known varance, we show ha solvng he problem wh he sasfcng objecve s equvalen o solvng a sandard Gaussan mularmed band problem. We hen apply he UCL algorhm we developed n prevous work [13 o he sandard problem, and show how hs algorhm acheves opmal performance n erms of he orgnal sasfcng objecve. The remander of he paper s srucured as follows. In Secon II we revew he sandard sochasc mul-armed band problem and he assocaed performance bounds. In Secon III we propose he sasfcng objecve and bound performance n erms of hs objecve by defnng a noon of sasfcng regre. In Secon IV we specalze o he case of Gaussan rewards and show ha solvng he sasfcng problem s equvalen o solvng a sandard problem wh Gaussan rewards. In Secon V we revew he UCL algorhm and show how applyng o he problem wh Gaussan rewards acheves opmal performance n erms of he sasfcng objecve. Secon VI shows he resuls of numercal smulaons and Secon VII concludes.

2 II. THE STOCHASTIC MULTI-ARMED BANDIT PROBLEM The sochasc mul-armed band problem s a decsonmakng problem n whch he decson maker sequenally chooses one among a se of N opons, called arms n analogy wh he lever of a slo machne. A sngle-levered slo machne s somemes called a one-armed band, so he case of N opons s ofen called an N-armed band. The decson-makng agen collecs reward r R by choosng arm a each me {1,, T }, where T N s he horzon lengh for he sequenal decson process. The reward from opon {1,, N} s sampled from a saonary probably dsrbuon p and has an unknown mean m R. The decson-maker s objecve s o maxmze some funcon of he sequence of rewards {r }. A. Maxmzaon objecve In he sandard mul-armed band problem, he agen s objecve s o maxmze he expeced cumulave reward [ T T J = E r = m. (1 Equvalenly, by defnng m = max m and R = m m as he expeced regre a me, he objecve (1 can be formulaed as mnmzng he cumulave expeced regre defned by T N R = T m m E [ n T N = E [ n T, ( where n T s he number of mes arm has been chosen up o me T, = m m s he expeced regre due o pckng arm nsead of arm, and he expecaon s over he possble rewards and decsons made by he agen. The nerpreaon of ( s ha subopmal arms should be chosen as rarely as possble. Ths s a non-rval ask snce he mean rewards m are nally unknown o he decson maker, who mus ry all arms o learn abou her rewards whle preferenally pckng arms ha appear more rewardng. The enson beween hese requremens s known as he explore-explo radeoff and s common o many problems n machne learnng and adapve conrol. B. Bound on opmal performance Opmal performance n a band problem wh he maxmzaon objecve (1 corresponds o pckng subopmal arms as rarely as possble, as shown by he las equaly n (. La and Robbns [9 suded he sandard sochasc mul-armed band problem and showed ha any polcy solvng he problem mus pck each subopmal arm a number of mes ha s a leas logarhmc n he me horzon T,.e., E [ ( n T 1 D(p p + o(1 log T, (3 where o(1 0 as T +. The quany D(p p := p (r log p(r p (r dr s he Kullback-Lebler dvergence beween he reward densy p of a subopmal arm and he reward densy p of he opmal arm. The bound on E [ n T mples ha he cumulave expeced regre mus grow a leas logarhmcally n me. The bound (3 s asympoc n me, bu a number of researchers (e.g., [, [5, [13 have consruced algorhms ha acheve cumulave expeced regre ha s bounded by a logarhmc erm unformly n me, somemes wh he same consan as n (3. Cumulave expeced regre ha s unformly bounded n me by a logarhmc erm s ofen called logarhmc regre for shor. In he leraure, algorhms ha acheve logarhmc regre wh a leadng erm ha s whn a consan facor of ha n (3 are consdered o have opmal performance. C. Gaussan rewards In hs paper we focus on he case of Gaussan reward dsrbuons, ha s, he dsrbuon p of rewards assocaed wh arm s Gaussan wh unknown mean m and known varance σs,. In hs case, he Kullback-Lebler dvergence n (3 akes he value ( D(p p = 1 σ s, + σ s, σs, 1 log σ s, σ s,. (4 Ths equaon s more easly nerpreed when he reward varances are unform,.e., σ s, = σ s for each. In hs case, he dvergence becomes so he bound (3 s E [ n T D(p p = σs, ( σ s + o(1 log T. (5 Ths resul can be nerpreed as follows. For a gven value of, a larger varance σ s makes he rewards more varable and herefore s more dffcul o dsngush beween he arms. For a gven value of σ s, a larger value of makes easer o dsngush he opmal arm. III. THE MULTI-ARMED BANDIT PROBLEM WITH SATISFICING OBJECTIVE The sandard mul-armed band problem s defned wh he maxmzng objecve (1. We now propose a new sasfcng objecve for he mul-armed band problem and fnd bounds on opmal performance n erms of hs new objecve. Consder an N-armed band problem. As before, he reward assocaed wh each arm s drawn from a saonary probably dsrbuon p, whose mean m s unknown o he decson maker. A me {1,..., T }, he decson maker selecs arm and receves a sochasc reward r R. The decson maker has a ceran sasfacon level M R, and s sasfed a me only f he reward r s a leas M. Le s be he random varable denong he decson maker s sasfacon a me : { 0, r < M s = 1, r M.

3 Then s s a Bernoull random varable wh success probably π, where π = Pr [s = 1 = = Pr [r M = (6 s he probably of sasfacon upon pckng arm. We propose a sasfcng objecve n erms he number of mes he sasfacon level s me. Defnon 1 (Sasfcng objecve. The sasfcng objecve s o maxmze he funcon [ T T E s = π. (7 The sasfcng objecve dffers from he maxmzaon objecve (1 n several mporan ways. Frs, exhbs hresholdng, ha s, s ndfferen among rewards r above he hreshold value M. Second, exhbs rsk averson, ha s, prefers smaller, conssen rewards (ha wll ofen be above he hreshold o larger, more varable ones (ha may ofen be below. Rsk averson s a characersc ofen suded n economcs and psychology [1, and s ofen ncorporaed n models of human decson makng. Snce he sasfcng objecve consss of maxmzng he number of mes he agen s sasfed, can be rewren as follows. Le π = max π and defne = π π as he expeced sasfcng regre of selecng an arm. We can rewre (7 n erms of expeced sasfcng regre as [ T N J S = E = E [ n T, (8 where n T s he number of mes arm has been chosen up o me T. Ths s a sandard mul-armed band problem wh Bernoull rewards. Therefore he La-Robbns bound (3 holds, yeldng a logarhmc lower bound on E [ n T and cumulave expeced sasfcng regre: Corollary 1 (Sasfcng regre bound. Any polcy solvng he mul-armed band problem wh he sasfcng objecve (8 obeys E [ ( n T 1 D(π π + o(1 log T, (9 for subopmal ( arms ( where D(π π = π π log 1 π π + (1 π log 1 π s he Kullback-Lebler dvergence beween he wo Bernoull dsrbuons wh success probables π and π. Proof: Apply he La-Robbns bound (3 o he sandard mul-armed band problem wh Bernoull rewards. We refer o cumulave expeced sasfcng regre ha s unformly bounded above n me by a logarhmc erm as logarhmc sasfcng regre. An algorhm ha acheves logarhmc sasfcng regre acheves opmal sasfcng performance,.e., opmal performance n erms of he sasfcng objecve (7. The mplcaon of wrng he sasfcng objecve as he mnmzng of cumulave regre s ha f one can use he rewards r o esmae he sasfacon probably π, one can use algorhms desgned o solve he mul-armed band problem wh a maxmzng objecve o solve he sasfcng problem. In he nex secons we sudy he Gaussan mularmed band problem wh a sasfcng objecve and show how o lnk rewards and probables n hs case. IV. SATISFICING WITH GAUSSIAN REWARDS In hs secon we sudy a Gaussan mul-armed band problem wh he sasfcng objecve (8. By Gaussan mularmed band problem, we mean ha he reward r due o selecng arm s r N (m, σs,, where σs, s he known varance of arm. Defne he quany x = m M σ s, (10 for each arm. The followng lemma saes ha he Gaussan mul-armed band problem wh a sasfcng objecve s equvalen o a sandard Gaussan mul-armed band problem wh ransformed reward dsrbuons. Lemma (Equvalence for Gaussan rewards. The Gaussan mul-armed band problem wh sasfcng objecve s equvalen o a sandard Gaussan mul-armed band problem wh rewards r N (x, 1 n he sense ha he orderng of he arms n erms of x s dencal o he orderng n erms of π. In parcular, he arm wh maxmal x s he arm wh maxmal π Proof: Wh Gaussan rewards, he probably (6 of sasfacon from choosng arm s π = Pr [m + σ s, z M ( m M = Φ = Φ(x, σ s, where z N (0, 1 s a sandard normal random varable and Φ(z s s cumulave dsrbuon funcon. Le = arg max π. The key nsgh s ha Φ( s a monooncally ncreasng funcon, whch mples ha he orderng of arms n erms of π s dencal o he orderng n erms of x. In parcular, arm s he arm wh maxmal x. Therefore, he goal of an agen playng he sasfcng band problem s o fnd he arm ha maxmzes x. Ths s agan a Gaussan band problem: consder he ransformed reward r = r M σ s,, whch s a Gaussan random varable r N (x, 1. The quany x plays he role of he mean reward m from he orgnal maxmzng problem and he ransformed rewards have unform varance σ s = 1. Solvng hs problem wh a maxmzng objecve s equvalen o solvng he orgnal problem wh he sasfcng objecve. Remark 3 (Locaon-scale famles. The above analyss s easly generalzed o reward dsrbuons belongng o locaon-scale famles. A locaon-scale famly s a se of

4 probably dsrbuons closed under affne ransformaons,.e., f he random varable X s n he famly, so s he varable Y = a + bx, where a, b R. Any random varable X n such a famly wh mean µ and sandard devaon σ can be wren as X = µ + σz, where Z s a zero-mean, un-varance member of he famly. Examples nclude he unform or Suden s -dsrbuon. V. THE UCL ALGORITHM FOR GAUSSIAN BANDIT PROBLEMS In hs secon we revew he UCL algorhm, a Bayesan algorhm ha we developed and analyzed n [13 o solve he sandard Gaussan band problem. We hen show ha he UCL algorhm can be appled o he Gaussan sasfcng problem of Secon IV, achevng opmal performance. The algorhm manans a belef abou he mean rewards m by sarng wh a pror and updang usng Bayesan nference as new rewards are receved. A each me he algorhm chooses arm usng a heursc whch s a smple funcon of he curren belef sae. For unnformave prors, he UCL algorhm acheves logarhmc regre,.e., opmal performance. Unnformave prors correspond o havng no nformaon abou he mean rewards. A major aspec of he UCL algorhm s s ably o ncorporae nformaon abou he mean rewards hrough he use of a so-called nformave pror. In [13, we show ha an appropraely chosen pror can sgnfcanly ncrease he performance of he UCL algorhm. Several dfferen UCL algorhms are developed n [13; here we cover only he deermnsc UCL algorhm, whch we refer o as he UCL algorhm for brevy. A. Pror The pror dsrbuon capures he agen s knowledge abou he vecor of mean rewards m before begnnng he ask. We assume ha he pror dsrbuon s mulvarae Gaussan wh mean µ 0 R N and covarance Σ 0 R N N : m N (µ 0, Σ 0. (11 The h elemen of µ 0, denoed by µ 0, represens he agen s mean belef of he reward m assocaed wh arm. The (, elemen of Σ 0, denoed by (, σ 0 represens he agen s uncerany assocaed wh ha belef. Off-dagonal elemens of Σ 0, e.g., σj 0, represen he agen s perceved relaonshp beween m and m j : f σj 0 s posve, hgh values of m are correlaed wh hgh values of m j, whle f s negave, hgh values of m correlae wh low values of m j. Any posvedefne marx can be used as Σ 0, bu several specfc ones are of neres. An unnformave pror corresponds o a complee lack of cerany,.e., ( σ 0 +, so one ses each elemen σj 0 equal o +. B. Inference updae A each me he agen pcks an arm and receves a reward r ha s Gaussan dsrbued: r N (m, σ s,. Bayesan nference provdes an opmal soluon o he problem of updang he belef sae (µ, Σ o ncorporae hs new nformaon. Gven he Gaussan pror (11, he Bayesan updae equaons are lnear [8: q = r φ σs, + Λ 1 µ 1 Λ = φ φ T σs, µ = Σ q. C. Decson heursc + Λ 1, Σ = Λ 1 (1 A each me he UCL algorhm compues a value Q for each arm. The UCL algorhm pcks he arm ha maxmzes Q. Tha s, pcks The heursc value Q s = arg max Q. (13 Q = µ + σ Φ 1 (1 α, (14 where µ = (µ, (σ = (Σ, α = 1/K, and K > 0 s a unable parameer. The heursc Q s a Bayesan upper lm for he value of m based on he nformaon avalable a me. I represens an opmsc assessmen of he value of m. The decson made can be hough of as he mos opmsc one conssen wh he curren nformaon. D. Performance In [13, we sudy he case of homogeneous samplng nose (.e., σ s, = σ s for each and show ha he UCL algorhm acheves cumulave expeced regre unformly n me. In parcular, we prove ha he followng heorem holds for any β 1.0. Theorem 4 (Regre of he deermnsc UCL algorhm [13. The followng saemens hold for he Gaussan mularmed band problem and he deermnsc UCL algorhm wh uncorrelaed unnformave pror and K = : 1 he expeced number of mes a subopmal arm s chosen unl me T sasfes E [ n T ( 8β σ s + log T + 4β σs (1 log log log T ; he cumulave expeced regre unl me T sasfes ( T N (8β σs R + log T + 4β σs (1 log log log T The mplcaon of hs heorem can be seen by comparng saemen 1 wh he La-Robbns bound (5: shows ha he UCL algorhm acheves logarhmc regre unformly n me wh a consan ha dffers from he opmal asympoc one by a consan facor of 4β, and herefore s consdered o have opmal performance.

5 E. Applcaon o sasfcng objecve In Secon IV, we showed ha solvng he Gaussan mul-armed band problem wh a sasfcng objecve s equvalen o a ransformed sandard Gaussan mul-armed band problem wh maxmzng objecve. Therefore, we can apply he UCL algorhm o he sasfcng problem. A pror belef m N (µ 0, Σ 0 s ransformed no pror belefs on x by x N ( µ 0, Σ 0, where ( µ 0 = ((µ 0 M/σ s, and ( Σ 0 j = (Σ 0 j /(σ s, σ s,j. Defne x = max x and = x x. We refer o he UCL algorhm usng he ransformed reward r and pror as he sasfcng UCL algorhm. The sasfcng UCL algorhm acheves logarhmc sasfcng regre, as formalzed n he followng heorem. Theorem 5 (Regre of he sasfcng UCL algorhm. The followng saemens hold for he Gaussan mul-armed band problem wh a sasfcng objecve and he sasfcng UCL algorhm wh uncorrelaed unnformave pror and K = : 1 he expeced number of mes a subopmal arm s chosen unl me T sasfes E [ n T (8β + log T + 4β (1 log log log T ; he cumulave expeced sasfcng regre unl me T sasfes T N R ( (8β + log T (15 + 4β (1 log log log T Proof: Apply Theorem 4 o he Gaussan mul-armed band problem wh mean rewards x and reward dsrbuons r N (x, 1 defned n Lemma. The sasfcng regre s upper bounded by a logarhmc funcon of T. Therefore, he sasfcng UCL algorhm acheves opmal sasfcng regre up o a consan facor. VI. NUMERICAL EXAMPLE In hs secon, we presen he resuls of wo numercal smulaons of he sasfcng UCL algorhm solvng a mul-armed band problem wh Gaussan rewards and he sasfcng objecve. The frs smulaon demonsraes he performance guaranees and allows us o compare he opmal regre bound (9 and he bound (15 obeyed by he sasfcng UCL algorhm. The second smulaon demonsraes he rsk-averse naure of he sasfcng objecve. For he smulaons presened n Fgure 1, we se N = 4. The sasfacon level M was se equal o, he mean rewards m were equal o [1 3 4 and he sandard devaons equal o [ , so x = [ and = 3 was he opmal arm. The algorhm used an unnformave pror. These values were chosen such ha he arm wh maxmal mean reward was no he opmal arm, so sasfcng nduces dfferen behavor han maxmzng. Fgure 1 plos he mean cumulave sasfcng regre ncurred by he sasfcng UCL algorhm over 100 smulaons along wh he wo regre bounds (9 and (15. The mean regre obeys he performance bound (15 from Theorem 5 and s acually below he asympoc lower bound (9 a nal mes. Ths apparen volaon of he bound s due o he fac ha a nal mes he sysem s no ye n he asympoc regme where he bound apples. For he smulaons presened n Fgure, we se N =. The mean rewards m were equal o [ and he sandard devaons equal o [10 1, so x = [ Ths mean = 1 was he opmal arm for he maxmzng objecve whle = was he opmal arm for he sasfcng objecve. The algorhm used an unnformave pror. The problem was smulaed 100 mes wh each objecve. Fgure demonsraes he rsk averson nheren n he sasfcng objecve by comparng he resuls of he same problem solved wh he sasfcng and he maxmzng objecves. The sasfacon level M was se equal o 1. We consdered cumulave surplus (rewards n excess of he sasfacon level for boh objecves. Negave values of he surplus represen defcs, whch are o be avoded. Resuls from he maxmzng objecve are presened n black. The sold lne shows mean cumulave surplus and he shaded regon shows he 95% confdence nerval around ha mean. Resuls from he sasfcng objecve are presened n blue. The sold lne shows he mean cumulave surplus, and he dashed lnes show he 95% confdence nerval. The lower lm of he confdence nervals measures wors-case performance. The measure for he sasfcng objecve s conssenly above he one for he maxmzng objecve, so sasfcng resuls n beer wors-case performance. VII. CONCLUSION Sasfcng, he concep of dong well relave o a reference value, s a useful alernave o maxmzng ha can be appled o a varey of decson-makng scenaros. Consderng sasfcng objecves nsead of maxmzng ones can smplfy decson-makng problems and can resul n polces ha are more robus n he sense ha hey are rsk-averse. In hs paper, we consdered he mul-armed band problem usng a sasfcng objecve by proposng a new noon of sasfcng regre. We showed ha here s an equvalence beween mnmzng sasfcng regre and mnmzng he sandard noon of regre. Usng hs equvalence, we derved a logarhmc lower bound on sasfcng regre and, n he case of Gaussan rewards, adaped he UCL algorhm [13 o acheve opmal sasfcng performance. Ths work opens he door o many fuure exensons. The sasfcng objecve wh Gaussan rewards bears a srong resemblance o he CredMercs wo-sae cred rsk model used n quanave fnance [7. Ths could allow he cred

6 Cumulave regre Cumulave surplus Fg. 1. Regre ncurred by he sasfcng UCL algorhm whle solvng a sasfcng Gaussan mul-armed band problem, along wh wo heorecal bounds, ploed agans me on a logarhmc scale. The sold black lne shows mean cumulave expeced regre from 100 smulaons. The dashed lne shows he asympoc bound on regre (9, whch appears as a sragh lne due o he scalng of he axes. The dash-doed lne shows he regre bound (15, whch provdes guaranees on he algorhm s performance. nvesmen porfolo problem suded n fnance o be posed as a mul-armed band problem wh sasfcng objecve. The rsk averse naure of sasfcng objecves such as he one proposed n hs paper wll resul n more robus polces for solvng he mul-armed band problem n cases wh reward varance σs s heerogeneous across arms. Rsk averson and robusness are mporan for engneerng applcaons (where sandard band algorhms are known o have poor rsk-averson characerscs [1 bu also n he feld of opmal foragng heory [4. The mul-armed band framework has been used o sudy foragng [18 usng a maxmzng objecve, bu a sasfcng objecve s more ecologcally plausble. We developed a polcy for he sasfcng problem wh Gaussan rewards, bu developmen of opmal polces for he sasfcng problem wh oher reward dsrbuons remans an open problem. For all sasfcng problems, pckng he approprae sasfacon level s a non-rval problem n s own rgh, analogous o pckng he error raes n he Sequenal Probably Rao Tes [19. ACKNOWLEDGEMENT We hank Vabhav Srvasava and Smon A. Levn for helpful dscussons. REFERENCES [1 J.-Y. Audber, R. Munos, and C. Szepesvár. Exploraon exploaon radeoff usng varance esmaes n mul-armed bands. Theorecal Compu. Sc., 410(19: , 009. [ P. Auer, N. Cesa-Banch, and P. Fscher. Fne-me analyss of he mularmed band problem. Mach. Learnng, 47(:35 56, 00. [3 R. Bordley and M. LCalz. Decson analyss usng arges nsead of uly funcons. Decsons n Economcs and Fnance, 3(1:53 74, 000. [4 Y. Carmel and Y. Ben-Ham. Info-gap robus-sasfcng model of foragng behavor: Do foragers opmze or sasfce? The Amer. Naurals, 166(5: , Fg.. Cumulave surplus earned by he UCL algorhm whle solvng a Gaussan mul-armed band problem, once wh a sasfcng (blue curves and agan wh a maxmzng objecve (black curve and shaded regon. Boh objecves acheve smlar mean performance (sold curves bu usng he sasfcng objecve resuls n beer wors-case performance. The shaded regon (sasfcng and he blue dashed lnes (maxmzng show he 95% confdence nerval around he mean cumulave surplus. The lower lm of he confdence nervals measures wors-case performance. The lower lm for he sasfcng objecve s conssenly above he one for he maxmzng objecve, so sasfcng resuls n beer wors-case performance. [5 A. Garver and O. Cappé. The KL-UCB algorhm for bounded sochasc bands and beyond. In JMLR: Workshop and Conference Proceedngs, volume 19: COLT 011, pages , 011. [6 M. Goodrch, W. Srlng, and R. Fros. A heory of sasfcng decsons and conrol. IEEE Trans. Sys., Man and Cybern. A: Sys. Humans, 8(6: , Nov [7 M. B. Gordy. A comparave anaomy of cred rsk models. J. of Bankng & Fnance, 4(1: , 000. [8 S. M. Kay. Fundamenals of Sascal Sgnal Processng, Volume I : Esmaon Theory. Prence Hall, [9 T. L. La and H. Robbns. Asympocally effcen adapve allocaon rules. Advances n Appl. Mah., 6(1:4, [10 T. M. Moe. The new economcs of organzaon. Amer. J. of Polcal Sc., 8(4: , [11 H. Nakayama and Y. Sawarag. Sasfcng rade-off mehod for mulobjecve programmng. In Ineracve Decson Analyss, pages Sprnger, [1 J. W. Pra. Rsk averson n he small and n he large. Economerca: J. of he Economerc Soc., pages 1 136, [13 P. Reverdy, V. Srvasava, and N. E. Leonard. Modelng human decson-makng n generalzed Gaussan mul-armed bands. Proc. IEEE, 10(4: , 014. [14 H. Robbns. Some aspecs of he sequenal desgn of expermens. Bullen of he Amer. Mah. Soc., 58:57 535, 195. [15 B. Schwarz, A. Ward, J. Monerosso, S. Lyubomrsky, K. Whe, and D. R. Lehman. Maxmzng versus sasfcng: happness s a maer of choce. J. Personaly and Socal Psychology, 83(5:1178, 00. [16 H. A. Smon. A behavoral model of raonal choce. The Quarerly J. of Econ., 69(1:99 118, [17 H. A. Smon. Raonal choce and he srucure of he envronmen. Psychologcal Revew, 63(:19, [18 V. Srvasava, P. Reverdy, and N. E. Leonard. On opmal foragng and mul-armed bands. In Proc. of he 51s Annu. Alleron Conf. on Commun., Conrol, and Compung, pages , 013. [19 A. Wald. Sequenal ess of sascal hypoheses. Annals of Mahemacal Sascs, 16(: , [0 D. Ward. The role of sasfcng n foragng heory. Okos, pages , 199. [1 S. G. Wner. The sasfcng prncple n capably learnng. Sraegc Managemen Journal, 1(10-11: , 000. [ B. Yn e al. Fndng opmal soluon for sasfcng non-funconal requremens va 0-1 programmng. In Proc. 013 IEEE 37h Annu. Compuer Sofware and Applcaons Conf., pages , 013.

Algorithmic models of human decision making in Gaussian multi-armed bandit problems

Algorithmic models of human decision making in Gaussian multi-armed bandit problems Algorhmc models of human decson makng n Gaussan mul-armed band problems Paul Reverdy, Vabhav Srvasava and Naom E. Leonard Absrac We consder a heursc Bayesan algorhm as a model of human decson makng n mul-armed