ENGINEERING solutions to decision-making problems are

Size: px

Start display at page:

Download "ENGINEERING solutions to decision-making problems are"

Jade Perkins
6 years ago
Views:

1 3788 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 62, NO. 8, AUGUST 2017 Sasfcg Mul-Armed Bad Problems Paul Reverdy, Member, IEEE, Vabhav Srvasava, ad Naom Ehrch Leoard, Fellow, IEEE Absrac Sasfcg s a relaxao of maxmzg ad allows for less rsky decso makg he face of uceray. We propose wo ses of sasfcg objecves for he mul-armed bad problem, where he objecve s o acheve reward-based decso-makg performace above a gve hreshold. We show ha hese ew problems are equvale o varous sadard mul-armed bad problems wh maxmzg objecves ad use he equvalece o fd bouds o performace. The dffere objecves ca resul qualavely dffere behavor; for example, ages explore her opos coually oe case ad oly a fe umber of mes aoher. For he case of Gaussa rewards we show a addoal equvalece bewee he wo ses of sasfcg objecves ha allows algorhms developed for oe se o be appled o he oher. We he develop varas of he Upper Credble Lm UCL algorhm ha solve he problems wh sasfcg objecves ad show ha hese modfed UCL algorhms acheve effce sasfcg performace. Idex Terms Mul-armed bad, upper credble lm UCL. I. INTRODUCTION ENGINEERING soluos o decso-makg problems are ofe desged o maxmze a objecve fuco. However, may coexs maxmzao of a objecve fuco s a ureasoable goal, eher because he objecve self s poorly defed or because solvg he resulg opmzao problem s racable or cosly. I hese coexs, s valuable o cosder alerave decso-makg frameworks. Herber Smo cosdered alerave models of raoal decso-makg [30] wh he goal of makg hem compable wh he access o formao ad he compuaoal capaces ha are acually possessed by orgasms, cludg ma, he kds of evromes whch such orgasms exs. A major feaure of he models he cosdered s wha he called sasfcg. I [30], he dscussed very broad erms a varey of Mauscrp receved July 20, 2016; acceped November 27, Dae of publcao December 22, 2016; dae of curre verso July 26, Ths work was suppored par by ONR gra N ad ARO gra W911NF P. Reverdy was suppored hrough a NDSEG Fellowshp. Recommeded by Assocae Edor Q.-S. Ja. P. Reverdy s wh he Deparme of Elecrcal ad Sysems Egeerg, Uversy of Pesylvaa, Phladelpha, PA USA e-mal: preverdy@seas.upe.edu. V. Srvasava s wh he Deparme of Elecrcal ad Compuer Egeerg, Mchga Sae Uversy, Eas Lasg, MI USA e-mal: vabhav@egr.msu.edu. N. E. Leoard s wh he Deparme of Mechacal ad Aerospace Egeerg, Prceo Uversy, Prceo, NJ USA e-mal: aom@prceo.edu. Color versos of oe or more of he fgures hs paper are avalable ole a hp://eeexplore.eee.org. Dgal Objec Idefer /TAC smplfcaos o he classcal ecoomc cocep of raoaly, mos mporaly he dea ha payoffs should be smple, defed by dog well relave o some hreshold value. I [31], he roduced he word sasfcg, a combao of he words sasfy ad suffce, o refer o hs hresholdg cocep ad llusraed usg a mahemacal model of foragg. He also brefly dscussed how sasfcg relaes o problems veory corol ad more complcaed decso processes lke playg chess. Sce Smo s poeerg work, sasfcg has bee suded may felds such as psychology [29], ecoomcs [6], maageme scece [23], [37], ad ecology [36], [8]. I egeerg, sasfcg s of eres for he same reasos ha movaed s roduco he socal scece leraure, specfcally ha ca smplfy decso-makg problems: as compared o maxmzg allows for less rsky decso makg he face of uceray. Furhermore, may egeerg problems are aurally posed usg a sasfcg objecve, such as choosg a desg ha mees gve specfcaos, bu where he desgers may be dffere amog ay such desgs. Sasfcg s well defed eve f here are several compeg performace measures ha rade off complcaed ways, whereas maxmzg may be poorly defed whou addoal formao abou prefereces. Sasfcg has bee suded he egeerg leraure several coexs. I [25], he auhors suded desg opmzao usg a sasfcg objecve ad foud ha s effecve may praccal felds. I [14], he auhors suded corol heory usg a sasfcg objecve fuco, ad [38], he auhors used sasfcg o sudy opmal sofware desg. I [10], he auhors used a mul-armed bad algorhm o cosruc robos ha acvely adap her corol polces o mgae damage, such as acuaor falures. I order o speed he covergece of her algorhm, hey oly sough o defy corol polces wh performace above a se hreshold, raher ha o defy a opmal polcy. The heory ha we develop hs paper formalzes her oo of hresholdg ad provdes bouds o performace. I hs paper, we cosder sasfcg he sochasc mul-armed bad problem [28], for whch a decso maker sequeally chooses oe of a se of alerave opos, called arms, ad ears a reward draw from a saoary probably dsrbuo assocaed wh ha arm. The sadard mul-armed bad problem uses a maxmzg objecve o accumulaed reward. For hs objecve here s a kow performace boud erms of expeced regre, whch s he expeced dfferece bewee he reward receved by he decso maker ad he maxmum reward possble IEEE. Persoal use s permed, bu republcao/redsrbuo requres IEEE permsso. See hp:// sadards/publcaos/rghs/dex.hml for more formao.

2 REVERDY e al.: SATISFICING IN MULTI-ARMED BANDIT PROBLEMS 3789 Sce he sadard oo of regre s defed relave o he ukow opmum, ca oly be compued by a omsce age; hs oo of regre s o compuable by a decso maker faced wh a mul-armed bad problem. Neverheless, s a useful heorecal cocep, whch faclaes he aalyss of algorhms desged o solve bad problems. We exed he oo of regre o sasfcg objecves ad use o aalyze ew algorhms. I coras o he sadard sochasc mul-armed bad problem whch he age seeks o deerme, wh ceray, he opo wh maxmum mea reward, he sasfcg mul-armed bad problem seeks o deerme, wh a desred cofdece, a sasfyg opo. We characerze sasfcg mul-armed bad problems usg hree separae feaures of he sasfcg objecve. The frs feaure selecs he quay o whch he sasfcg objecve s defed. We cosder wo such quaes: he ukow mea reward of he seleced opo, ad he saaeous observed reward. The secod feaure reas he sasfaco aspec of he sasfcg problem. I parcular, selecs f he objecve fuco should be opmzg, or f should be sasfyg. The hrd feaure reas he suffcg aspec of he sasfcg problem. I parcular, selecs f he decso-makg algorhm should be cera ha he opmzg/sasfyg crero s me, or f s suffce for he algorhm o mee a desred hreshold cofdece abou he crero. Dffere combaos of he above hree feaures of sasfcg lead o egh sasfcg objecves ha we dscuss hs paper. We beg by defg he four objecves for he case where he sasfcg quay s he ukow mea reward. We show ha he bad problem wh each of hese four objecves s equvale o a prevously suded bad problem ad use he equvalece o derve a performace boud for he sasfcg problems. These four objecves seek a arm wh sasfygly hgh mea reward whou regard o ha reward s dsperso. To develop objecves wh mproved robusess properes, we he cosder he case where he sasfcg quay s he saaeous observed reward. We exed he frs four objecves o hs case by addg a addoal layer of hresholdg, whch defes four more objecves. Whe he reward dsrbuos belog o locao-scale famles, here s a equvalece bewee he objecves defed erms of mea reward ad he robus objecves defed erms of saaeous reward, whch we prove for Gaussa rewards. For smplcy of exposo, we he specalze o Gaussa mul-armed bad problems, where he reward dsrbuos are Gaussa wh ukow mea ad kow varace. For such problems, we develop several modfcaos of he UCL algorhm ha we developed prevous work [27]. These algorhms solve he problem wh he sasfcg mea reward objecves ad hus also wh he robus objecves. We show ha hese algorhms acheve effce performace. These resuls exed our prevous work [26] by corporag he cocep of suffcecy o he sasfcg objecve, as well as by addg several ew algorhms ad her assocaed aalyss. The assumpo of Gaussa rewards wh kow varace s o requred, bu allows us o focus o he dffere oos of regre, whch s he ma corbuo of hs paper. We laer show how he kow varace assumpo ca be relaxed. Our mehods also exed mmedaely o may oher mpora classes of reward dsrbuos, cludg dsrbuos wh bouded suppor ad sub-gaussa dsrbuos. We show how o exed our mehods hese cases ad provde refereces o he releva leraure for oher exesos. The remader of he paper s srucured as follows. I Seco II we revew he sadard sochasc mul-armed bad problem ad he assocaed performace bouds. I Seco III we propose he sasfcg objecves ad boud performace erms of hese objecves. I Seco IV we specalze o he case of Gaussa rewards ad show he equvalece bewee he sasfcg mea reward objecves ad he sasfcg saaeous observed reward objecves. I Seco V we revew he UCL algorhm, ad Seco VI we desg modfed versos of he UCL algorhm for he sasfcg objecves. We show ha hese modfed algorhms acheve effce performace for Gaussa rewards. We show he resuls of umercal smulaos Seco VII ad Seco VIII we coclude. II. THE STOCHASTIC MULTI-ARMED BANDIT PROBLEM I he sochasc mul-armed bad problem a decsomakg age sequeally chooses oe amog a se of N opos called arms aalogy wh he lever of a slo mache. A sglelevered slo mache s called a oe-armed bad, so he case of N 2 opos s called a mul-armed bad. The decso-makg age collecs reward r R by choosg arm a each me {1,...,T}, where T N s he horzo legh for he sequeal decso process. The reward from opo {1,...,N} s sampled from a saoary probably dsrbuo ν whch has a ukow mea m R.The decso-maker s objecve s o maxmze some fuco of he sequece of rewards {r } by sequeally pckg arms usg oly he formao avalable a me. A. Maxmzao Objecve I he sadard mul-armed bad problem, he age s objecve s o maxmze he expeced cumulave reward [ T ] T J = E r = m. 1 Equvalely, by defg m = max m ad R = m m, he expeced regre a me, mmzg 1 ca be formulaed as mmzg he cumulave expeced regre defed by T N R = Tm m E [ T ] N = E [ T ], 2 =1 where T s he umber of mes arm has bee chose up o me T, = m m s he expeced regre due o pckg arm sead of arm, ad he expecao s over he possble rewards ad decsos made by he age. =1

3 3790 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 62, NO. 8, AUGUST 2017 The erpreao of 2 s ha subopmal arms should be chose as rarely as possble. Ths s a o-rval ask sce he mea rewards m are ally ukow o he decsomaker, who mus ry arms o lear abou her rewards whle prefereally pckg arms ha appear more rewardg. The eso bewee hese requremes s kow as he exploreexplo radeoff ad s commo o may problems mache learg ad adapve corol. B. Boud o Opmal Performace Opmal performace a bad problem correspods o pckg subopmal arms as rarely as possble, as show by 2. La ad Robbs [20] suded he sadard sochasc mularmed bad problem ad showed ha ay polcy solvg he problem mus pck each subopmal arm a umber of mes ha s a leas logarhmc he me horzo T,.e., E [ T ] 1 Dν ν + o1 log T, 3 where o1 0 as T +. The quay Dν ν := ν r log ν r ν r dr s he Kullback-Lebler dvergece bewee he reward desy ν of ay subopmal arm ad he reward desy ν of he opmal arm. Equao 3 mples ha cumulave expeced regre mus grow a leas logarhmcally me. The boud 3 s asympoc me, bu researchers e.g., [4], [13], [27] have developed algorhms ha acheve cumulave expeced regre ha s bouded by a logarhmc erm uformly me, somemes wh he same cosa as 3. Cumulave expeced regre ha s uformly bouded me by a logarhmc erm s ofe called logarhmc regre for shor. I he leraure, algorhms ha acheve logarhmc regre wh a leadg erm ha s wh a cosa facor of ha 3 are cosdered o have opmal performace. C. Mulple Plays Aaharam e al. [2] suded a geeralzao of he mularmed bad problem whch he age pcks k 1 arms a each me, whch hey called he mul-armed bad problem wh mulple plays. The case k =1correspods o he sadard mul-armed bad problem defed above. I he spr of [2], le σ be a permuao of {1,...,N} such ha m σ 1 m σ 2 m σ N. For he mul-armed bad problem wh k plays, he opmal polcy wh full formao correspods o pckg he arms σ1,, σk, called he k-bes arms [2]. I he case k =1, σ1 =, he opmal arm defed above. For he case of geeral k 1, he cumulave expeced regre for he mul-armed bad problem wh mulple plays s defed as follows [2]: k N T m σ m E [ T ], 4 =1 =1 whch s a sraghforward geeralzao of he regre 2. The subopmal arms σk +1,,σN are called he k-wors arms [2]. Defe k = m σ k m for each k-wors arm. The quay k s he geeralzao of he expeced regre for he problem wh mulple plays, where he expeced value of he opmal polcy s ha of he k bes arms. As he case of a sgle play, opmal performace correspods o pckg subopmal.e., k-wors arms as rarely as possble. By [2] each k-wors arm mus be pcked a umber of mes ha s a leas logarhmc he me horzo T,.e., E [ T ] 1 Dν ν σ k + o1 log T. 5 Ths boud ca be erpreed as a geeralzao of he La- Robbs boud 3 where he Kullback-Lebler dvergece s ake wh respec o he k h bes arm σk raher ha he frs bes arm σ1.e., he case k =1. D. PAC Bouds I he sadard mul-armed bad problem ad he mularmed bad problem wh mulple plays, regre s defed erms of he ukow mea reward values m. These regre defos mply ha avodg regre requres defyg opmal arms wh ceray. The requreme o defy opmal arms wh ceray s characersc of a maxmzg decsomakg sraegy. I coras, a sasfcg decso-makg age should seek arms ha are good eough. I hs coex, sasfcg correspods o fdg arms ha are opmal wh hgh probably raher ha wh ceray. The Probably Approxmaely Correc PAC model for learg roduced by Vala [34] provdes a aural way o capure hs aspec of sasfcg. Eve-Dar e al. [11], [12] ad Maor ad Tsskls [22] suded he mul-armed bad problem usg he PAC model ad defed a ɛ-opmal arm as oe for whch m >m ɛ,.e., he mea reward s wh ɛ of he opmum value. Equvalely, a ɛ-opmal arm s a arm for whch he expeced regre s a mos ɛ. Uder he PAC model oe wshes o fd a ɛ-opmal arm wh probably of a leas 1 δ. Wh probably oe, hs ca be acheved a fe umber of samples, so performace guaraees ake he form of bouds o he umber of samples requred, whch s referred o as sample complexy. I our oao, we deoe sample complexy by T, as s he value of he horzo legh a whch samplg ermaes. Whe he rewards are Beroull dsrbued wh ukow success probables p, he followg lower boud holds [22]: 1 E [T ] O ɛ 2 log1/δ. 6 A smlar resul was repored [11] for T, raher ha s expeced value. I oher words, oe mus sample a arm a leas log1/δ/ɛ 2 mes o be able o declare ha s ɛ-opmal wh probably a leas 1 δ. Smlar o he work of [2] exedg La ad Robbs bouds [20] o he case of mulple plays, Kalyaakrsha e al. [15] exeded he work of [12] from fdg he ɛ-opmal arm o fdg he mɛ-bes arms wh probably a leas 1 δ. I [15] hs problem s called Explore-m, ad a algorhm ha solves ɛ, m, δ-opmal. Noe ha he problem [12] s he specal case Explore-1. The Explore-m problem s suded [15] for

4 REVERDY e al.: SATISFICING IN MULTI-ARMED BANDIT PROBLEMS 3791 rewards ha are Beroull dsrbued. I s proved ha, for every ɛ, m, δ-opmal algorhm, here exss a bad problem o whch ha algorhm has wors-case sample complexy of a leas logm/8δ. Specfcally, s show ha here exss a bad problem such ha he umber of samples T requred o defy mɛ-bes arms obeys T 1 N m ɛ 2 log. 7 8δ Ths gves a wors-case boud o he umber of mes all arms eed o be sampled o acheve ɛ, m, δ-opmaly. The bouds 6 ad 7 were boh formulaed for he case of Beroull rewards, bu s sraghforward o exed hem o he case where he rewards are Gaussa dsrbued wh ukow mea ad kow varace. E. Gaussa Rewards I hs paper we focus o he case of Gaussa reward dsrbuos, where he dsrbuo ν of rewards assocaed wh arm s Gaussa wh mea m, whch s ukow o he decso maker, ad varace σs, 2, whch s kow o he decso maker from, e.g., prevous observaos or kow measureme characerscs. Relaxao of he assumpo of kow varace s dscussed Remark 12. For he gve case, he Kullback- Lebler dvergece 3 akes he value Dν ν = σs, 2 + σ2 s, σ 2 1 log σ2 s, s, σ 2. 8 s, Ths equao s more easly erpreed whe he reward varaces are uform,.e., σs, 2 = σ2 s for each. I some cases we assume uform varace for smplcy of exposo, bu he releva resuls are readly geeralzed o he case of o-uform varace. Assumg uform varace, Dν ν = 2 /2σ2 s, so he boud 3 s E [ T ] 2σ 2 s 2 + o1 log T. 9 Ths resul ca be erpreed as follows. For a gve value of, a larger varace σs 2 makes he rewards more varable ad herefore s more dffcul o dsgush bewee he arms. For a gve value of σs 2, a larger value of makes easer o dsgush from he opmal arm. The expressos for he problem wh mulple plays.e., 5 are decal excep for subsug σk for ad k for. III. THE MULTI-ARMED BANDIT PROBLEM WITH SATISFICING OBJECTIVES We ow defe he mul-armed bad problem wh sasfcg objecves. We propose several ew sasfcg oos of regre ad fd assocaed bouds o opmal performace. These oos capure wo dmesos of he sasfcg problem: sasfaco,.e., he age s desre o oba a reward ha s above a cera hreshold, ad suffcecy,.e., he age s desre o aa a level of cofdece ha s choce of a gve arm wll brg hem sasfaco. We defe hese oos frs for sasfcg mea reward ad he exed hem o sasfcg saaeous reward, whch we refer o as robus sasfcg. A. Sasfcg Mea Reward We defe sasfaco mea reward as havg a expeced reward m ha s above a specfed hreshold value M. Formally, we represe sasfaco mea reward a me by he varable s, defed as s = 1m > M, 10 where 1 s he dcaor fuco, equal o oe f he argume s rue ad zero oherwse. The hreshold M s a free parameer ha mus be specfed by he decso-makg age. Le m = max m be he maxmum expeced reward from ay arm. The age ca ever be sasfed f M s greaer ha m,sowe assume ha M m o make he problem feasble. If M > m σ 2,.e., greaer ha he mea reward of he secod-bes arm, he arm σ1 = s he oly oe ha s sasfyg mea reward. As he mul-armed bad problem wh mulple plays, le σ be a permuao of {1,...,N} such ha m σ 1 m σ 2 m σ N.Lekbe he larges eger such ha m σ k M. The arms {σ1,...,σk} are he k-bes arms defed by he sasfaco hreshold M. For each arm, defe he hresholded expeced regre M = max{m m, 0}. For each k- bes arm, he hresholded regre s zero, ad for each k-wors arm {σk +1,...,σN}, hevalue M > 0 quafes he exe o whch he arm s usasfyg mea rewards. Noe ha f M = m, M =, whch s he sadard measure of expeced regre. We refer o he k-bes ad k-wors arms as sasfyg ad o-sasfyg arms, respecvely. The sasfaco varable s defed 10 ca be wre as a fuco of he sg of M : s = 1 M =0. The quay s s deermsc. However, sce he age does o kow he value of M assocaed wh ay gve arm, hey mus lear by samplg rewards from he varous arms ad updag her belefs accordgly. Adopg a Bayesa framework, we assume s s a realzao of a bary radom varable S. Due o he sochasc aure of he rewards he age wll have less ha perfec cofdece her belefs abou he value of s. We dsgush sasfcg objecves mea reward accordg o he degree δ [0, 1] of cofdece he age seeks her belefs, whch we call suffcecy mea reward. We defe a arm o be δ-suffcg mea reward f Pr [S =1] 1 δ, where he probably s evaluaed based o he age s curre belefs. For o-zero values of δ, he age fds suffce o have fe cofdece ha hey are sasfed, whle for δ = 0, he age was ceray ha hey are sasfed. The age cao acheve ceray fe me, so hese wo cases resul qualavely dffere behavor: δ =0meas he age wll ever sop explorg, whle δ>0 meas he age wll sele o a se of accepable opos afer fe me.

5 3792 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 62, NO. 8, AUGUST 2017 TABLE I TABLE OF THE FOUR DIFFERENT REGRET CONCEPTS, AND RESULTING PROBLEMS, ASSOCIATED WITH THE SATISFICING-IN-MEAN-REWARD MULTI-ARMED BANDIT PROBLEM Threshold level Seek ceray δ =0 Suffce δ >0 M >m σ 2 1 Sadard bad 3 δ-suffcg M m σ 2 2 Sasfaco--mea-rwd 4 M,δ-sasfcg The sasfcg--mea-reward objecve s T 1 s =1or Pr [S =1]> 1 δ. 11 The objecve 11 s maxmzed f, a each me, a sasfyg opo s seleced, or he probably ha he opo s sasfyg s suffcely hgh. The eve ha a opo s sasfyg s o kow a pror ad mus be leared by explorao. Ths resuls a explore-explo radeoff as he sadard mul-armed bad problem. To quafy he opmal explore-explo radeoff he spr of he La-Robbs boud we roduce he followg oo of he expeced sasfcg regre a me, R, defed by R = M 1Pr[S =1]< 1 δ. 12 If he age s suffcely cera of beg sasfed by he choce of, hey cur expeced regre of M. Oherwse, hey cur o regre. We defe he sasfcg--mea-reward mul-armed bad problem erms of mmzg cumulave expeced sasfcg regre. Defo 1 Sasfcg-I-Mea-Reward Mul-Armed Bad Problem: The sasfcg--mea-reward mul-armed bad problem s o mmze he cumulave sum of he expeced sasfcg regre 12: [ T ] J R = E R. 13 The sasfcg--mea-reward bad problem has wo parameers: M ad δ. These parameers characerze he age s hresholds for sasfaco ad suffcecy, respecvely. For purposes of aalyss we dsgush four cases as a fuco of he parameer values. For he sasfaco hreshold M R, he frs case s seg M >m σ 2, whle he secod case s seg M m σ 2. For he suffcecy hreshold δ [0, 1], he frs case s he ceray value δ =0, whle he secod case s δ 0, 1]. Table I summarzes he four problems ha resul from he eraco of he wo dmesos of sasfaco ad suffcecy. Problem 1 ses he sasfaco hreshold M >m σ 2 ad he suffcecy hreshold δ =0, whch resuls a sadard bad problem. We call Problem 2 wh M m σ 2 ad δ =0 sasfaco--mea-reward. We call Problem 3 wh M > m σ 2 ad δ 0, 1] δ-suffcg. Fally, we call Problem 4 wh M m σ 2 ad δ 0, 1], M,δ-sasfcg. Remark 1: We oe ha he dsco bewee Problems 1 ad 2 ad bewee Problems 3 ad 4 s oly due o he rage of values M ca ake. These problems ca be hough of as a sgle problem whch he choce of M dcaes he cardaly of he se of sasfyg arms. However, he wo rages of hresholds M >m σ 2 ad M m σ 2 allow us o clearly coras he sasfcg problem wh he sadard problem. Assumg M >m σ 2 Problems 1 ad 3 s equvale o assumg ha he age seeks he ukow hghes mea reward, whch s cosse wh he sadard problem. The polces we defe for Problems 1 ad 3 do o rely o a kow hreshold M. Assumg M m σ 2 s equvale o assumg ha he age seeks o mee a kow desred mea reward hreshold. The polces we defe for Problems 2 ad 4 do rely o he hreshold M. These same assumpos aalogously dsgush Problems 5 ad 7 from Problems 6 ad 8 defed Seco III-B. However, ulke he polces for Problems 1 ad 3, he polces defed for Problems 5 ad 7 do rely o M >m σ 2 beg kow. We do o assume ay of he problems ha he age kows he permuao σ, so o polces deped o σ. We develop performace bouds for each of hese problems erms of corollares of he performace bouds preseed Seco II. For he problems wh δ =0, hese bouds show ha cumulave expeced regre mus grow a leas a a logarhmc rae, whle for he problems wh δ>0, fe regre s possble. Problem 1 Sadard Bad: The sasfcg--mea-reward mul-armed bad problem wh M >m σ 2 ad δ =0s a sadard mul-armed bad problem. Therefore, for hs problem, he La-Robbs boud 3 holds, ad he expeced umber of mes a subopmal arm s chose obeys E [ T ] 1 Dν ν + o1 log T. As a drec cosequece, he cumulave expeced sasfcg regre 13 grows a leas logarhmcally wh me horzo T : N J R =1 M Dν ν + o1 log T. Problem 2 Sasfaco--Mea-Reward: The sasfaco-mea-reward problem, defed as he sasfcg--meareward mul-armed bad problem where M m σ 2 ad δ =0, also has a logarhmc lower boud o he cumulave expeced sasfcg regre: Corollary 2 Sasfaco-I-Mea-Reward Regre Boud: The sasfaco--mea-reward problem s a sasfcg-mea-reward mul-armed bad problem where he objecve 13 s defed wh M m σ 2 ad δ =0. Ay polcy solvg he sasfaco--mea-reward problem obeys E [ T ] 1 Dν ν σ k + o1 log T 14 for each o-sasfyg arm, where σ s a permuao of {1,...,N} such ha m σ 1 m σ 2 m σ N ad k s he larges eger such ha m σ k M. Proof: The defo of sasfaco 10 mples ha performace bouds for he sasfaco--mea-reward problem ad he mul-armed bad problem wh mulple plays are equvale. Gve a problem sace, he hreshold M duces he umber k of sasfyg arms, so performace ca be aalyzed as he problem wh mulple plays. The boud 5 apples o

6 REVERDY e al.: SATISFICING IN MULTI-ARMED BANDIT PROBLEMS 3793 he problem wh mulple plays ad he equvalece mples he resul. Problem 3 δ-suffcg: The δ-suffcg problem, defed as he sasfcg--mea-reward mul-armed bad problem where M >m σ 2 ad δ 0, 1], adms polces ha acheve cumulave expeced regre ha s a bouded fuco of T : Corollary 3 δ-suffcg Regre Boud: The δ-suffcg problem s a sasfcg--mea-reward mul-armed bad problem where he objecve 13 s defed wh M >m σ 2 ad δ 0, 1]. Ay polcy solvg he δ-suffcg problem obeys 1 T O ɛ 2 log1/δ 15 for each subopmal arm, where ɛ = = M m. Proof: The defo of sasfaco 10 he δ-suffcg problem mples ha he age curs regre f he arm seleced s o ɛ =0,δ-opmal. The boud 6 hus provdes a lower boud o he umber of mes he age mus cur regre. Problem 4 M,δ-Sasfcg: The M,δ-sasfcg problem, defed as he sasfcg--mea-reward mul-armed bad problem where M m σ 2 ad δ 0, 1], adms polces ha acheve cumulave expeced regre ha s a bouded fuco of T : Corollary 4 M,δ-Sasfcg Regre Boud: The M, δ-sasfcg problem s a sasfcg--mea-reward mularmed bad problem where he objecve 13 s defed wh M m σ 2 ad δ 0, 1]. Ay polcy solvg he M,δ-sasfcg mul-armed bad problem obeys N T = T 1 N k ɛ 2 log 16 8δ =1 where σ s a permuao of {1,...,N} such ha m σ 1 m σ 2 m σ N, k s he larges eger such ha m σ k M, ad ɛ = M m σ k. Sce oly arms {σk + 1,...,σN} resul regre, he lef had sde of 16 s a upper boud o he expeced sasfcg regre 13. Proof: The defo of sasfaco 10 he M,δ- suffcg problem mples ha a algorhm ha mmzes sasfcg regre s equvale o a ɛ = m σ k M,k,δ-opmal algorhm he sese of [15]. Therefore, he boud 7 apples o he M,δ-suffcg problem. Recall ha T s he umber of mes all arms cludg he opmal oe should be cumulavely sampled such ha followg T a M,ɛ-opmal decso ca be made. The lower bouds o boh T ad T are depede of T, suggesg ha for M,ɛ-sasfcg, a bouded regre ca be acheved. Corollares 3 ad 4 show ha he wors-case regre s a bouded fuco of T for he suffcg problems, where δ > 0. Therefore we ca coclude ha he expeced regre for such problems ca also be a bouded fuco of T. Ths s a mpora dsco from he maxmzg problems, where δ =0: such problems, he La-Robbs boud 3 mples ha he expeced regre mus grow logarhmcally wh T. As s sadard he bad leraure, we say a algorhm has effce performace f s regre maches, up o cosa facors, he releva growh raes: log T for maxmzg problems ad logk/δ/ɛ 2 for suffcg problems. B. Robus Sasfcg Isaaeous Reward The four objecves defed Seco III-A above defe sasfaco 10 erms of he mea reward m from a arm. Ths capures suaos where he me scale for sasfaco spas umerous decso mes. For example, cosder foragg, where a amal mus cosume a mmum amou of food each day. If each decso me represes a small poro of he day, he oal food cosumed durg he day represes he sum of umerous small rewards from each decso me. As log as he mea reward a each decso me s suffcely hgh, he amal wll mee s daly food requreme. If, sead, he decso me scale s he same as he sasfaco me scale, s more approprae o defe sasfaco a me erms of he reward r receved a ha me. Ths requres more robus algorhms, he sese ha hey mus esure ha each reward, raher ha smply he mea reward, s sasfyg wh hgh probably. I hs coex we defe sasfaco wo sages. Frs, we defe happess as recevg a reward r ha s a leas a hreshold value M R. We represe happess a me as he Beroull radom varable h, defed as h = 1r >M. 17 We defe he success probably of he happess radom varable h as p =Pr[h =1 = ]. 18 The success probably p s he expeced rae of happess due o pckg arm. Ths defes a Beroull mul-armed bad problem where he mea reward.e., happess rae s p. We he defe sasfaco erms of a hreshold Π for hs Beroull mul-armed bad problem as we dd 10: s = 1p >P. 19 Gve he happess hreshold M, hs defo s decal o he defo 10 of sasfaco where m = p, p = max p, ad M =Π. Therefore he four sasfcg mul-armed bad problems defed Table I ca be used o defe four addoal problems hs coex, whch we call robus sasfcg. Defo 2 Robus Sasfcg Mul-Armed Bad Problem: The robus sasfcg mul-armed bad problem s o mmze he cumulave sum of he expeced sasfcg regre 12: [ T ] J R = E R, where he regre R s defed usg he oo of sasfaco defed by A robus sasfcg mul-armed bad problem has hree parameers: M,Π, ad δ. We assume ha M ad Π are chose such ha here s a leas oe sasfyg arm; oherwse, he expeced regre mus grow defely. Table II summarzes he four robus sasfcg mul-armed bad problems ha resul from he eraco of he wo dmesos of sasfaco ad suffcecy, whch we ls below. We assume ha ς s a permuao of {1,...,N} such ha p ς 1 p ς 2... p ς N.

7 3794 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 62, NO. 8, AUGUST 2017 TABLE II TABLE OF THE FOUR DIFFERENT REGRET CONCEPTS, AND RESULTING PROBLEMS, ASSOCIATED WITH THE ROBUST SATISFICING MULTI-ARMED BANDIT PROBLEM Threshold level Seek ceray δ =0 Suffce δ>0 Π >p ς 2 5 Robus bad 7 δ-robus suffcg Π p ς 2 6 Robus sasfaco 8Π,δ-robus sasfcg The quay p represes he probably of happess.e., recevg a reward of a leas M due o choosg arm Problem 5. Robus Bad: The robus bad problem s defed as he robus sasfcg mul-armed bad problem where Π >p ς 2 ad δ =0. Problem 6. Robus Sasfaco: The robus sasfaco problem s defed as he robus sasfcg mul-armed bad problem where Π p ς 2 ad δ =0. Problem 7. δ-robus Suffcg: The δ-robus suffcg problem s defed as he robus sasfcg mul-armed bad problem where Π >p ς 2 ad δ 0, 1]. Problem 8. Π,δ-Robus Sasfcg: The Π,δ-robus sasfcg problem s defed as he robus sasfcg mul-armed bad problem where Π p ς 2 ad δ 0, 1]. For a large class of reward dsrbuos, here s a equvalece bewee Problems 5 8 defed erms of r ad Problems 1 4 defed erms of m. By Lemma 5 below, whe he rewards r follow a Gaussa dsrbuo wh ukow mea m ad kow varace σs, 2, each problem Table II s equvale o he aalogous problem Table I. IV. SATISFICING WITH GAUSSIAN REWARDS I hs seco we sudy he Gaussa sasfcg mularmed bad problem. Ths s he sasfcg mul-armed bad problem where he reward r due o selecg arm s r Nm,σs, 2 ad σs, 2 s he kow varace of arm.i hs case, we show a formal equvalece bewee he sasfcg-mea-reward mul-armed bad problems ad he robus sasfcg mul-armed bad problems. The choce of Gaussa rewards faclaes modelg correlao depedeces amog arms, whch ca be useful applcaos. A. Equvalece Lemma for Gaussa Rewards For he Gaussa robus sasfcg mul-armed bad problem, defe he quay x = m M, 20 σ s, whch we call he sadardzed mea reward, for each arm. The followg lemma saes ha each Gaussa robus sasfcg mul-armed bad problem where sasfaco s defed by 19 s equvale o a Gaussa sasfcg--mea-reward mul-armed bad problem where sasfaco s defed by 10 wh sadardzed reward dsrbuos. Lemma 5 Equvalece for Gaussa Rewards: Each Gaussa robus sasfcg mul-armed bad problem s equvale o a Gaussa sasfcg--mea-reward mul-armed bad problem wh rewards r Nx, 1 wh x gve by 20. Tha s, he orderg of he arms erms of x s decal o he orderg erms of p, ad, parcular, he arm wh maxmal x s he arm wh maxmal p. Proof: Wh Gaussa rewards, he probably 18 of happess due o choosg arm s p =Pr[m + σ s, z M] m M =Φ =Φx, 21 σ s, where z N0, 1 s a sadard ormal radom varable ad Φz s s cumulave dsrbuo fuco. Le = arg max p. The key sgh s ha Φ s a mooocally creasg fuco, whch mples ha he orderg of arms erms of p s decal o he orderg erms of x. I parcular, arm s he arm wh maxmal x. Therefore, sasfaco erms of r s equvale o sasfaco erms of he mea reward x. Ths s aga a Gaussa bad problem: cosder he sadardzed reward r = r M, 22 σ s, whch s a Gaussa radom varable r Nx, 1. The quay x plays he role of he mea reward m ad he rasformed rewards have uform varace σ s 2 =1. Mmzg he robus sasfcg regre erms of r s equvale o mmzg he sasfcg regre erms of x. Lemma 5 has wo mplcaos for he relaoshp bewee Problems 5 8 ad Problems 1 4 whe rewards are Gaussa dsrbued. Frs, each Problem 5 8 hers a regre boud from he correspodg Problem 1 4. Secod, each Problem 5 8 ca be solved by applyg he algorhm developed for Problem 1 4 by frs applyg he sadardzao rasformao 22 o he observed rewards. Remark 6 Locao-Scale Famles: Lemma 5 s easly geeralzed o reward dsrbuos belogg o locao-scale famles. A locao-scale famly s a se of probably dsrbuos closed uder affe rasformaos,.e., f he radom varable X s he famly, so s he varable Y = a + bx, where a, b R. Ay radom varable X such a famly wh mea μ ad sadard devao σ ca be wre as X = μ + σz, where Z s a zero-mea, u-varace member of he famly. Examples clude he uform dsrbuo ad Sude s -dsrbuo. B. Applcao o he Gaussa Robus Sasfcg Problems I hs seco we show how o use he equvalece resul of Lemma 5 for he full se of robus sasfcg problems he case of Gaussa rewards. Recall from Lemma 5 ha he probably of happess 18 due o pckg a arm s p. I he proof of he lemma, we show ha maxmzg he probably of happess s equvale o maxmzg he mea reward a Gaussa mul-armed bad problem wh mea rewards x =Φ 1 p, where x s he sadardzed mea reward m M/σ s,.

8 REVERDY e al.: SATISFICING IN MULTI-ARMED BANDIT PROBLEMS 3795 Gve a algorhm developed for oe of he Problems 1 4 defed Table I, ca be appled o he correspodg Problem 5 8 defed Table II as follows. Sadardze he observed rewards r ad ru he algorhm usg he sadardzed rewards r =r M/σ s, as pu. For example, Problem 5, he robus mul-armed bad problem, ca be solved by a algorhm desged o solve Problem 1, he sadard bad problem, where rewards are rasformed accordg o 22 before beg pu o he algorhm. The same procedure allows oe o apply algorhms developed for Problem 3, δ-suffcg, o Problem 7, δ-robus suffcg. For Problem 6, robus sasfaco, ad Problem 8, Π,δ- robus sasfcg, we eed a hreshold X ha s aalogous o he hreshold M defed for Problem 2, sasfaco--meareward, ad Problem 4, M,δ-sasfcg. We use he relaoshp bewee x ad p o derve he hreshold. I parcular, for a robus sasfcg problem wh probably of happess hreshold Π, defe he hreshold X by X =Φ 1 Π. 23 Whe he rewards are Gaussa dsrbued, we ca apply algorhms developed for Problems 2 ad 4 o he correspodg robus sasfcg Problems 6 ad 8 by sadardzg rewards ad usg he hreshold X defed 23 place of he hreshold M. Lemma 5 mples ha he effce performace guaraees for algorhms desged for Problems 1 4 also hold whe hey are used o solve he robus sasfcg Problems 5 8. V. THE UCL ALGORITHM FOR GAUSSIAN MULTI-ARMED BANDIT PROBLEMS I hs seco we revew he UCL algorhm, a Bayesa algorhm we developed ad aalyzed [27] o solve he sadard Gaussa mul-armed bad problem. The UCL algorhm was developed by applyg he Bayesa upper cofdece boud approach of [16] o he case of Gaussa rewards; he choce of Gaussa rewards faclaed he modelg of huma decsomakg behavor. The UCL algorhm maas a belef abou he mea rewards m by sarg wh a pror ad updag usg Bayesa ferece as ew rewards are receved. A each me he algorhm chooses arm usg a heursc ha s a smple fuco of he curre belef sae. For uformave prors, he UCL algorhm acheves logarhmc regre,.e., opmal performace. Uformave prors correspod o havg o formao abou he mea rewards. A major advaage of he UCL algorhm s s ably o corporae formao abou he mea rewards hrough he use of a so-called formave pror. I [27], we showed ha a appropraely-chose pror ca sgfcaly crease he performace of he UCL algorhm. Several dffere UCL algorhms were developed [27], cludg a sochasc decso rule o model huma behavor; here we cover oly he deermsc UCL algorhm, whch, for brevy, we refer o as he UCL algorhm. A. Pror The pror dsrbuo capures he age s kowledge abou he vecor of mea rewards m before begg he ask. We assume ha he pror dsrbuo s mulvarae Gaussa wh mea μ 0 R N ad covarace Σ 0 R N N : m Nμ 0, Σ The h eleme of μ 0, deoed by μ 0, represes he age s mea belef of he reward m assocaed wh arm. The, eleme of Σ 0, deoed by 2, σ 0 represes he age s uceray assocaed wh ha belef. Off-dagoal elemes of Σ 0, e.g., σj 0, represe he age s perceved relaoshp bewee m ad m j :fσj 0 s posve, hgh values of m are correlaed wh hgh values of m j, whle f s egave, hgh values of m correlae wh low values of m j. Ay posve-defe marx ca be used as Σ 0, bu s ofe useful o cosder a srucured paramerzao, such as Σ 0 = σ0 2 Σ, where σ0 2 > 0 ecodes he age s uceray. Oe mpora specal case s a ucorrelaed pror, where Σ s dagoal, whch correspods o he age percevg he rewards assocaed wh dffere arms o be depede. Aoher mpora specal case s a uformave pror, whch correspods o complee uceray,.e., he lm σ0 2 + ; a uformave pror ca be hough of as a specal case of a ucorrelaed pror. B. Iferece Updae A each me he age pcks a arm ad receves a reward r ha s Gaussa dsrbued: r Nm,σs, 2. Bayesa ferece provdes a opmal soluo o he problem of updag he belef sae μ, Σ.e., he suffce sascs for esmag m o corporae hs ew formao. Le Λ =Σ 1, ad le φ R N be he vecor wh eleme equal o 1 ad all oher elemes equal o zero. The gve he Gaussa pror 24, he Bayesa updae equaos are lear [17]: q = r φ σ 2 s, +Λ 1 μ 1, Λ = φ φ T σ 2 s, +Λ 1, μ =Σ q. 25 C. Decso Heursc A each me he UCL algorhm compues a value Q for each arm. The algorhm he pcks he arm ha maxmzes Q.Thas,pcks = arg max Q. 26 The heursc value Q s Q = μ + σ Φ 1 1 α, 27 where μ =μ, σ 2 =Σ,α =1/K, K>0 s a uable parameer, ad Φ 1 s he quale fuco of he sadard ormal radom varable. The heursc Q s a Bayesa upper lm for he value of m based o he formao avalable a me. I represes a opmsc assessme of he value of

9 3796 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 62, NO. 8, AUGUST 2017 m. The decso made ca be hough of as he mos opmsc oe cosse wh he curre formao. D. Performace I [27], we suded he case of homogeeous samplg ose.e., σs, 2 = σ2 s for each ad showed ha he UCL algorhm acheves logarhmc cumulave expeced regre uformly me. I parcular, we proved ha he followg heorem holds. We defe {R UCL } {1,...,T } as he sequece of expeced regre for he deermsc UCL algorhm. Theorem 7 Regre of he Deermsc UCL Algorhm [27]: The followg saemes hold for he Gaussa mul-armed bad problem ad he deermsc UCL algorhm wh ucorrelaed uformave pror ad K =1: 1 he expeced umber of mes a subopmal arm s chose ul me T sasfes E [ T ] 8σ 2 s 2 +2 log T +3; 2 he cumulave expeced regre ul me T sasfes T N 8σ 2 J R = R s 2 +2 log T +3. =1 The mplcao of hs heorem ca be see by comparg 1 wh he La-Robbs boud 9: he UCL algorhm acheves logarhmc regre uformly me wh a cosa ha dffers from he opmal asympoc oe by a cosa facor, ad hus s cosdered o have opmal performace. VI. ALGORITHMS FOR SATISFICING GAUSSIAN MULTI-ARMED BANDIT PROBLEMS I hs seco we develop algorhms for solvg Gaussa mul-armed bad problems wh he sasfcg objecves proposed Seco III. All he algorhms coss of modfed versos of he UCL algorhm. We aalyze he algorhms ad show ha hey acheve effce performace. The UCL algorhm solves he sadard Gaussa mul-armed bad problem,.e., he sasfcg Gaussa mul-armed bad problem wh M >m σ 2 ad δ =0Problem 1. We develop hree ew UCL varas for Problems 2 4 Table I. These algorhms ca he be appled o Problems 5 8 Table II. A he ed of he seco, we cosder exesos o reward dsrbuos oher ha he Gaussa wh kow varace. A. Problem 2: Sasfaco-I-Mea-Reward UCL Algorhm A smple modfcao of he UCL algorhm acheves logarhmc regre for he Gaussa sasfaco--mea-reward problem, whch s he sasfcg--mea-reward mul-armed bad problem wh M m σ 2 ad δ =0Problem 2. We defe hs algorhm, whch we refer o as he sasfaco-mea-reward UCL algorhm, as follows. As 27, defe he heursc value Q as Q = μ + σ Φ 1 1 α, where α =1/K ad K>0s aga a uable parameer. Le M R be he sasfaco hreshold, so he age s sasfed f pcks a arm wh m M. Le he elgble se a me be { Q M}. I coras o he UCL seleco scheme 26 ha pcks he arm wh maxmal Q, sasfaco-mea-reward UCL pcks ay arm he elgble se. Tha s, f he elgble se s o-empy, he { Q M}, 28 or f he elgble se s empy, he sasfaco--mea-reward UCL pcks he arm wh maxmal Q. Thus, f he mos recely seleced arm s he elgble se, may be seleced aga eve f does o have he maxmal Q. The sasfaco--mea-reward UCL algorhm acheves logarhmc cumulave expeced sasfaco--mea-reward regre, as guaraeed by he followg heorem. Theorem 8 Regre of he Sasfaco-I-Mea-Reward UCL Algorhm: Le a Gaussa mul-armed bad problem wh he sasfaco--mea-reward objecve have a leas oe arm ha obeys m > M, ad, whou loss of geeraly, assume σs, 2 =1for each arm. The, he followg saemes hold for he sasfaco--mea-reward UCL algorhm wh ucorrelaed uformave pror ad K =1: 1 he expeced umber of mes a o-sasfyg arm s chose ul me T sasfes E [ T ] 8 M 2 +3 log T +4; 2 he cumulave expeced sasfaco--mea-reward regre ul me T sasfes J SM N =1 M 8 M 2 +3 log T +4. To prove Theorem 8 we use he followg boud from [1]. Lemma 9 Bouds o he Iverse Gaussa cdf: For he sadard ormal.e., Gaussa radom varable z ad a cosa w R 0, Pr [z w] 2e w 2 /2 2πw + w2 +8/π 1 2 e w 2 /2. 29 I follows from 29 ha for ay α [0.5, 1], Φ 1 1 α 2 log α. 30 Proof of Theorem 8: The proof proceeds as he proof of Theorem 7 [27], whch self follows he proofs [4]. Le be a o-sasfyg arm,.e., m < M, ad recall ha desgaes

10 REVERDY e al.: SATISFICING IN MULTI-ARMED BANDIT PROBLEMS 3797 he maxmum mea reward. The E [ T ] T = Pr [ = ] T Pr [ Q M ] +Pr η + [ Q Q ] & max Q j < M j T [ Pr Q M, η ] +Pr [ Q Q, η ]. The frs erm he summad correspods o he probably ha he o-sasfyg arm s he elgble se, whle he secod erm correspods o he probably ha he elgble se s empy ad ha a o-sasfyg arm appears beer ha a opmal arm. The saeme Q Q mples ha a leas oe of he followg equales holds: μ m + C 31 μ m C 32 m <m +2C, 33 where C = σ Φ 1 1 α ad α =1/K. Oherwse,f oe of holds, he Q = μ + C >m m +2C >μ + C = Q. We frs aalyze he probably ha 31 holds. For a ucorrelaed uformave pror, μ s equal o m, he emprcal mea reward observed a arm ul me, ad σ =1/. Therefore, for a ucorrelaed uformave pror, Q = m + 1 Φ 1 1 α. Codoal o, he emprcal mea reward m s self a Gaussa radom varable wh mea m ad sadard devao 1/, so 31 holds f m m + 1 Φ 1 1 α m + z m + 1 Φ 1 1 α z Φ 1 1 α, where z N0, 1 s a sadard ormal radom varable. Thus, for a uformave pror, Pr [31 holds] =α = 1 K. Smlarly, 32 holds f m + m m C z m 1 Φ 1 1 α z Φ 1 1 α, where z N0, 1 s a sadard ormal radom varable. Thus, for a uformave pror, Pr [32 holds] =α = 1 K. Iequaly 33 holds f m <m + 2 Φ 1 1 α < 2 Φ 1 1 α 2 4 < 2 log < 2 log T < 2 log α 34 where = m m ad equaly 34 follows from boud 30. Thus, for a uformave pror, 33 ever holds f 8 2 log T. 35 Thus, for suffcely large, Pr [Q Q ]=2/K. We ow boud he probably Pr [Q M] ha a osasfyg arm s he elgble se. Noe ha Q Mmples ha a leas oe of he followg equales holds: μ m + C 36 M <m +2C. 37 Oherwse, f eher 36 or 37 holds, M m +2C > μ + C = Q ad arm s o he elgble se. 36 s decal o 31 ad 37 o 33. For a uformave pror, Pr [36 holds] =α = 1 K. Ad 37 holds f M <m + 2 Φ 1 1 α M < 2 Φ 1 1 α M 2 < 2 logα 4 M 2 M 2 < 2 log < 2 log T. 4 4 Thus, for a uformave pror, 37 ever holds f 8 M 2 log T. 38 Sce m M, for each o-sasfyg arm, M. Thus, 1/ M 2 1/ 2 ad 38 mples 35. So seg 8 η = M 2 log T 39 yelds he boud E [ T ] T [ η + Pr Q M, η ] < +Pr [ Q Q, η ] 8 T 1 M 2 log T +3.

11 3798 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 62, NO. 8, AUGUST 2017 The sum ca be bouded by he egral T 1 T d = 1 + log T, yeldg he boud he frs saeme of he heorem: E [ T ] 8 M 2 +3 log T +4. The secod saeme of he heorem follows from he defo 12 of expeced sasfcg regre. B. Problem 3: δ-suffcg UCL Algorhm A alerave modfcao of he UCL algorhm acheves fe sasfcg regre he Gaussa δ-suffcg problem, whch s he sasfcg--mea-reward mul-armed bad wh M >m σ 2 ad δ 0, 1] Problem 3. For he age, hs ca be hough of as wag o have fe cofdece ha has foud he ukow opmal arm σ1.forheδ-suffcg problem, defe he heursc fuco Q = μ + σ Φ 1 1 δ. 2 We defe he δ-suffcg UCL algorhm as he algorhm ha selecs arm = arg max Q a each decso me. Theδsuffcg UCL algorhm acheves fe cumulave sasfcg regre, as guaraeed by he followg heorem. Theorem 10: Cosder he δ-suffcg UCL algorhm wh a uformave pror. The umber of mes he pcked arm s o-sasfyg wh probably greaer ha δ s upper bouded as T < 4σ2 s 2 Φ 1 1 δ Proof: We boud T by og ha a o-sasfyg arm s pcked oly f Q Q, whch ca be decomposed as he proof of Theorem 8 o he hree codos 42 s equvale o μ m + C 40 μ m C 41 m <m +2C. 42 = m m < 2C = 2σ s Φ 1 1 δ/2. Squarg ad rearragg, we see ha hs ever holds f > 4σ2 s log1/δ 2 +1 > 4σ2 s Φ 1 log δ/2 2 = η. The same argume as he proof of Theorem 8 shows ha for 1, 40 ad 41 each hold wh probably a mos δ/2. Therefore, for >η+1, a o-sasfyg arm s seleced wh probably a mos δ. Theorem 10 guaraees ha he δ-suffcg UCL algorhm acheves fe regre. Furhermore, he algorhm s effce ha he regre maches he depedece o ɛ ad δ he boud 15. To see hs, oe ha a o-sasfyg arm wh s a ɛ = -subopmal arm, so Corollary 3 mples ha T s lower bouded by O log1/δ/ɛ 2. The saeme of Theorem 10 combed wh he boud 30 o he verse Gaussa cdf mples ha T s upper bouded by 8σs 2 log2/δ/ 2 +1=8σ2 s log2/δ/ɛ 2 +1, whch maches he lower boud 15 up o cosa facors. C. Problem 4: M,δ-Sasfcg UCL Algorhm A hrd modfcao of he UCL algorhm acheves fe sasfcg regre he Gaussa M,δ-sasfcg problem, whch s he sasfcg--mea-reward mul-armed bad wh M m σ 2 ad δ 0, 1] Problem 4. For he age, hs ca be hough of as wag o have fe cofdece ha has foud a arm whose mea reward s above a kow hreshold. For he M,δ-sasfcg problem, defe he heursc fuco Q = μ + σ Φ 1 1 δ. 3 Le he elgble se a me be { Q M}. We defe he M,δ-sasfcg UCL algorhm as he algorhm ha selecs arm { Q M}, f he elgble se a me s o-empy. Oherwse, f he elgble se s empy, he algorhm pcks he arm wh maxmal Q. The M,δ-sasfcg UCL algorhm acheves effce performace as guaraeed by he followg heorem. Theorem 11: Cosder he M,δ-sasfcg UCL algorhm wh a uformave pror. The umber of mes he pcked arm s o-sasfyg wh probably greaer ha δ s upper bouded as T < 4σ2 s M 2 Φ 1 1 δ/ Proof: The proof s very smlar o he proofs of Theorems 8 ad 10. As Theorem 8, we boud T by T = T 1 = η + T 1 Q M, η + 1 Q Q, η. The codo Q M, whch meas arm s he elgble se, ca be decomposed o he wo codos Equao 44 s equvale o M μ m + C 43 M <m +2C. 44 = M m < 2C = 2σ s Φ 1 1 δ/3. Squarg ad rearragg, we see ha 44 ever holds f > 4σ2 s M 2 Φ 1 1 δ/3 2 = η.

12 REVERDY e al.: SATISFICING IN MULTI-ARMED BANDIT PROBLEMS 3799 The same argume as he proof of Theorem 10 shows ha for 1, 43 holds wh probably a mos δ/3, so >η mples ha a o-sasfyg arm s he elgble se wh probably a mos δ/3. As he proof of Theorem 10, a o-sasfyg arm s pcked due o he elgble se beg empy oly f Q Q, where s he arm wh maxmal mea reward. Ths codo ca aga be decomposed o he hree codos Equao 42 does o hold f >η, so he probably ha Q Q s bouded by he probably ha eher 40 or 41 holds. For > 1, each of hese holds wh probably δ/3, so he probably of a o-sasfyg arm beg chose due o he elgble se beg empy s a mos 2δ/3. Thus, for >η+1, a o-sasfyg arm s seleced wh probably a mos δ. Theorem 11 guaraees ha he M,δ-sasfcg UCL algorhm acheves fe regre. Furhermore, he algorhm s effce ha he regre maches he depedece o ɛ ad δ he boud 16. Applyg he boud 30 o he verse Gaussa cdf o he saeme he heorem, we see ha T s upper bouded by 8σs 2 log3/δ/ 2. M Summg hs boud over o-sasfyg arms shows ha he oal umber of mes he algorhm curs regre s a mos 8σs 2 log3/δ { M >0} 1/ 2. M Ths maches he depedece o ɛ ad δ he boud 16 up o cosa facors. Noe ha lower boud 16 cous he umber of selecos of all arms cludg he opmal arm, whle he upper boud cous oly he subopmal arms. Hece, we ca oly clam ha we acheve cumulave regre bouded T. Wh a beer lower boud o T, we may be able o clam ha, smlar o δ-suffcg UCL, M,δ-suffcg UCL acheves he opmal depedece o ɛ ad δ. However, hs remas a ope problem o pursue. D. Robus Sasfcg UCL Algorhms The UCL algorhm solves Problem 1, he Gaussa sadard problem. The modfed versos of he UCL algorhm Secos VI-A, VI-B, ad VI-C solve he oher hree Gaussa sasfcg--mea-reward Problems 2 4. All four UCL algorhms acheve effce performace solvg her respecve problems, as guaraeed by Theorems 8, 10, ad 11. The equvalece resul of Lemma 5 shows for Gaussa dsrbued rewards ha we ca modfy he four UCL algorhms developed for Problems 1 4 o solve Problems 5 8 as follows. The modfed UCL algorhms make decsos based o he sadardzed mea reward 20 usg prors o he sadardzed mea rewards. A pror belef m Nμ 0, Σ 0 o he mea rewards m s rasformed o a pror belef o he sadardzed mea rewards x N μ 0, Σ 0 by μ 0 =μ 0 M/σ s,, Σ 0 j =Σ 0 j /σ s, σ s,j. Problem 5: Robus UCL Algorhm: The robus UCL algorhm s he UCL algorhm where he pror s gve erms of he sadardzed mea rewards, ad he observed reward r s sadardzed accordg o he rasformao 22 before beg pu o he ferece equaos 25. Problem 6: Robus Sasfaco UCL Algorhm: The robus sasfaco UCL algorhm s he sasfaco-mea-reward UCL algorhm where he pror s gve erms of he sadardzed mea rewards, he observed reward r s sadardzed accordg o he rasformao 22 before beg pu o he ferece equaos 25, ad he parameer M s se equal o X =Φ 1 Π defed 23. Problem 7: δ-robus Suffcg Algorhm: The δ- robus suffcg UCL algorhm sheδ-suffcg UCL algorhm where he pror s gve erms of he sadardzed mea rewards, ad he observed reward r s sadardzed accordg o he rasformao 22 before beg pu o he ferece equaos 25. Problem 8: Π,δ-Robus Suffcg Algorhm: The Π,δ-robus suffcg UCL algorhm s he M,δ-sasfcg UCL algorhm where he pror s gve erms of he sadardzed mea rewards, he observed reward r s sadardzed accordg o he rasformao 22 before beg pu o he ferece equaos 25, ad he parameer M s se equal o X =Φ 1 Π defed 23. Lemma 5 mples ha he performace guaraees ha hold for he UCL algorhms developed for Problems 1 4 also hold for he four ew UCL algorhms defed above whe appled o Problems 5 8. E. Relaxaos of Gaussa ad Kow Varace Assumpos The algorhms preseed so far have bee developed assumg ha he reward dsrbuo assocaed wh each arm s Gaussa wh ukow mea m ad kow varace σs, 2.The reward varace may be kow, e.g., esmaed from kow sesor characerscs or pror daa. Whe he reward varace s o kow, a smple modfcao o he heursc 27 yelds a algorhm ha acheves effce performace. Smlar smple modfcaos exed our resuls o he case where he reward dsrbuo s sub-gaussa, whch cludes dsrbuos wh bouded suppor. We sae modfcaos for he case of a uformave pror. Pror formao ca be corporaed usg a cojugae pror, as dscussed [24]. Remark 12 Gaussa Rewards Wh Ukow Varace: Whe he reward dsrbuo s Gaussa wh ukow varace, he heursc developed by Auer e al. [4] for her algorhm UCB1-NORMAL resuls algorhms ha acheve effce performace. Recall ha s he umber of mes arm has bee seleced up o me, ad m s he emprcal mea reward observed a arm up o me. Defe q = = as he sum of he squared rewards obaed from arm. The UCB1-NORMAL algorhm s composed of wo rules: f here s a arm ha has bee played less ha 8 log mes, selecs ha arm. Oherwse selecs he arm ha maxmzes he heursc Q,UCB1 NORMAL = m + r 2 16 q m 2 1 log.

13 3800 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 62, NO. 8, AUGUST 2017 Ths heursc ca be used drecly he sadard ad sasfaco--mea-reward UCL algorhms. For he δ- suffcg ad M,δ-sasfcg UCL algorhms, use k =2ad k =3, respecvely, he heursc Q = m + 4 q m 2 1 log k/δ. The Gaussa dsrbuo wh ukow mea ad varace s aga a locao-scale famly, so Lemma 5 mples ha hese modfed algorhms ca be used o solve he robus sasfcg problems as well. Pror formao ca be corporaed by meas of a cojugae pror, as dscussed [24]. For he followg remarks, defe he geeralzed heursc Q β = m + β log. Remark 13 Sub-Gaussa Rewards: Aoher geeralzao of Gaussa rewards wh kow varace s he case where he reward dsrbuo s sub-gaussa, also kow as lgh-aled. The dsrbuo of a radom varable X s called sub-gaussa f s mome geerag fuco Mu = E [expux] s fe for all u R. The, oe ca fd a cosa ζ such ha Mu expζu 2 /2 [9]. I hs case, a heursc fuco due o Lu ad Zhou [21] = Q 8ζ ca be used o acheve effce performace. Remark 14 Reward Dsrbuos Wh Bouded Suppor: Aoher commo assumpo he bad leraure s ha he reward dsrbuos are arbrary bu have a kow bouded suppor [a, b] R. Whou loss of geeraly, we assume ha he suppor s coaed he u erval [0, 1]. I hs case Q,SG = Q 2 ca be used he sadard ad sasfaco--mea-reward UCL algorhms. For he δ-suffcg ad M,δ-sasfcg UCL algorhms, k =2 ad k =3, respecvely, he heursc Q = he UCB1 heursc due o Auer e al. [4] Q,UCB1 Q k/δ 1/2 ca be used o acheve effce performace. For he robus sasfcg problems he releva reward, happess h 17, s a Beroull radom varable whch s suppored o [0, 1]. Therefore, each robus sasfcg problem ca be solved by he approprae vara of UCB1. However, f addoal formao s avalable abou he dsrbuo of he raw rewards r, e.g., ha hey are Gaussa wh kow varace, he he robus UCL algorhms ca acheve mproved performace relave o UCB1, for example f he Kullback- Lebler dvergece bewee he r dsrbuos s larger ha he Beroull dsrbuos assocaed wh h. Addoal exesos o heavy-aled dsrbuos may be possble usg he echques of [7]. VII. NUMERICAL EXAMPLES I hs seco, we prese he resuls of umercal smulaos of he modfed UCL algorhms solvg mul-armed bad problems wh Gaussa rewards ad sasfcg objecves. We cosder boh hresholdg he mea rewards m,as Problems 1 4 Table I, ad hresholdg he saaeous rewards, as Problems 5 8 Table II. I all of he cases preseed, he algorhms used a uformave pror. We use he Fg. 1. Comparso of regre curred by he UCL algorhms whe solvg he sadard problem Problem 1 ad sasfaco--mea-reward problem Problem 2. Boh problems defe regre by hresholdg mea reward values; he sadard bad objecve curs regre whe he mea reward of he chose opo s less ha he maxmum reward m, whle he sasfaco--mea-reward problem curs regre whe he mea reward s less ha M m σ 2, here se equal o 2.5. For boh problems, he cumulave expeced regre ad s upper boud crease a a logarhmc rae sce he age seeks ceray ha s hreshold s me, whch cao acheve fe me. Fg. 2. Comparso of regre curred by he UCL algorhms whe solvg he δ- ad M,δ-sasfcg problems, Problems 3 ad 4, respecvely. As Fg. 1, he problems defe regre by hresholdg he mea reward values; he δ-suffcg objecve curs regre whe he mea reward of he chose opo s less ha he maxmum reward m, whle he M,δ-suffcg problem curs regre whe he mea reward s less ha M m σ 2, here se equal o 2.5. I coras o Fg. 1, he age oly seeks o have 1 δ = 95% cofdece ha s hreshold s me, whch ca acheve fe me. Thus, he upper bouds o cumulave expeced regre are cosa fucos of horzo legh ad he mea regre plaeaus a a fe value. smulaos o llusrae performace of he algorhms relave o he bouds proved he heorems of Seco IV. We also use he smulaos o compare how he dffere algorhms rade off accumulao of reward wh reduco explorao cos as measured by umber of swches amog arms. As show he fgures, sasfcg ca sgfcaly decrease he explorao cos whle currg lle cos erms of he rewards receved by he age. We frs cosder he sasfcg objecves wh hresholdg he mea rewards. We llusrae how he objecves of Problems 1 ad 2 yeld logarhmc regre Fg. 1 whereas he objecves of Problems 3 ad 4 yeld fe regre Fg. 2, as predced by he bouds proved Theorems 7, 8, 10 ad 11.

arxiv: v2 [cs.lg] 19 Dec 2016

arxiv: v2 [cs.lg] 19 Dec 2016 1 Sasfcg mul-armed bad problems Paul Reverdy, Vabhav Srvasava, ad Naom Ehrch Leoard arxv:1512.07638v2 [cs.lg] 19 Dec 2016 Absrac Sasfcg s a relaxao of maxmzg ad allows for less rsky decso makg he face