On Optimal Foraging and Multi-armed Bandits

Size: px

Start display at page:

Download "On Optimal Foraging and Multi-armed Bandits"

Duane Grant
6 years ago
Views:

1 Proceedngs of he 51s Annual Alleron Conference on Communcaon, Conrol and Compung, Ocober 013 On Opmal Foragng and Mul-armed Bands Vabhav Srvasava Paul Reverdy Naom E. Leonard Absrac We consder wo varans of he sandard mularmed band problem, namely, he mul-armed band problem wh ranson coss and he mul-armed band problem on graphs. We develop bloc allocaon algorhms for hese problems ha acheve an expeced cumulave regre ha s unformly domnaed by a logarhmc funcon of me, and an expeced cumulave number of ransons from one arm o anoher arm unformly domnaed by a double-logarhmc funcon of me. We observe ha he mul-armed band problem wh ranson coss and he assocaed bloc allocaon algorhm capure he ey feaures of popular anmal foragng models n leraure. I. INTRODUCTION Foragng s a fundamenal anmal behavor ha perans o searchng ou food resources and explong hem. Foragng behavor s suded n behavoral ecology usng economc prncples,.e., he foragng decsons are evaluaed based on her effecs on ceran pay-off funcons. A he hear of a foragng decson s he radeoff beween exploraon o search for a beer food resource) and exploaon o sc wh he bes nown food resource). In he engneerng leraure, a benchmar seup o sudy he exploraon-exploaon radeoff s he mul-armed band problem. The mul-armed band problem models a class of resource allocaon problems n whch a decson-maer allocaes a sngle resource by sequenally choosng one among a se of compeng alernave opons called arms. In he so-called saonary mul-armed band problem, a decson-maer a each dscree me nsan chooses an arm and collecs a reward drawn from an unnown saonary probably dsrbuon assocaed wh he seleced arm. The objecve of he decson-maer s o maxmze he oal reward aggregaed over he sequenal allocaon process. The fundamenal exploraon-exploaon radeoff n foragng can be modeled as a mul-armed band problem, and he effecveness of he foragng decsons can be measured by comparng hem o he opmal decsons for he mul-armed band problem. In hs paper, we explore hs connecon and argue ha he soluon o a Bayesan mularmed band problem capures he qualave feaures of he foragng behavor n some anmals. Leraure revew: The mul-armed band problem has been exensvely suded; a survey s presened n [1]. In her semnal wor, La and Robbns [] esablshed a logarhmc lower bound on he expeced number of mes a sub-opmal arm needs o be seleced by an opmal polcy. Ths research has been suppored n par by ONR gran N and ARO gran W911NG P. Reverdy s suppored hrough an NDSEG Fellowshp. V. Srvasava, P. Reverdy, and N. E. Leonard are wh he Deparmen of Mechancal and Aerospace Engneerng, Prnceon Unversy, Prnceon, NJ, USA {vabhavs, preverdy, Snce [], a consderable emphass has been on he desgn of smple heursc polces ha acheve he logarhmc lower bound on he expeced number of selecon nsances of any subopmal arm. To hs end, Auer e al. [3] developed upper confdence bound UCB) algorhms for mularmed bands wh bounded reward ha acheve logarhmc expeced cumulave regre unformly n me. Recenly, Srnvas e al. [4] developed asympocally opmal UCB algorhms for Gaussan process opmzaon. Kauffman e al. [5] developed a generc Bayesan UCB algorhm and esablshed s opmaly for bnary bands wh unform pror. Reverdy e al. [6] esablshed he opmaly of a Bayesan UCB algorhm for Gaussan rewards and drew several connecons beween hese algorhms and human decson-mang. They also elucdaed he role of prors n decson-mang performance. Some varaons of he mul-armed band problem have been suded as well. Agarwal e al. [7] suded he mularmed band problem wh ranson coss,.e., he mularmed band problem n whch a ceran penaly s mposed each me he decson-maer swches from he currenly seleced arm, and developed an asympocally opmal bloc allocaon algorhm. In hs paper, we consder he Gaussan mul-armed band problem wh ranson coss and develop a bloc allocaon algorhm ha acheves an expeced cumulave regre ha s unformly domnaed by a logarhmc erm. Moreover, he bloc allocaon scheme desgned n hs paper ncurs smaller expeced ranson coss han he bloc allocaon scheme n [7]. Klenberg e al. [8] consdered he mul-armed band problem n whch every arm s no avalable for selecon a each me sleepng expers), and hey analyzed he performance of he UCB algorhms. In conras o he emporal unavalably of arms n [8], we consder a spaal unavalably of arms. We propose a novel mul-armed band problem, namely, he graphcal mul-armed band problem, n whch only a subse of he arms can be seleced a he nex allocaon nsance gven he currenly seleced arm. We develop a bloc allocaon algorhm for such a problem ha acheves expeced cumulave regre ha s unformly domnaed by a logarhmc erm. Foragng has been exensvely suded n he behavoral ecology leraure [9], [10], [11], [1], [13]. A parcular emphass has been on opmal foragng heory [9], [10] ha sudes foragng behavor based on economc prncples. Tradonal wors [9], [10] n opmal foragng heory have suded he opmal behavor by ) pcng an approprae currency; ) esablshng approprae cos-benef funcons; and ) deermnng he opmal polces. Typcally he currency s chosen as he ne rae of energy nae and he fundamenal hypohess s ha hs nae rae s maxmzed.

2 The fundamenal quesons suded n opmal foragng heory nclude ) whch envronmen pach should he anmal vs nex? ) how long should he anmal say n ha pach? and ) whch foragng pah should he anmal choose n each pach? In recen years, a sgnfcan focus has been on he macroscopc properes of foragng. I has been observed ha Lévy flghs are effcen search mechansms, and has been hypoheszed ha anmal foragng has evolved no a Lévy flgh [14], [11]. An alernave macroscopc model o he Lévy flgh model s he nermen search model [15]. The nermen search model vews foragng n wo alernang phases. In he frs phase he anmal performs a local Brownan search, and n he second phase he anmal performs a ballsc relocaon. In boh he Lévy flgh and nermen search models, he ey macroscopc observaon s ha he anmal performs a local exploraon for some me and hen moves o a far-off locaon. Whle hese macroscopc models capure he general characerscs of foragng well, hey do no provde nsghs no he decson mechansms used by he anmal. There have been sgnfcan effors o undersand he decson mechansms n foragng; see, e.g., [16], [17], [18]. Of parcular neres here are he foragng sudes n he mul-armed band problem seng. Krebs e al. [16] suded foragng n grea-s n a wo-armed band seng and found ha he foragng polcy of grea-s s close o he opmal polcy for he wo-armed band problem. Keasar [17] explored he foragng behavor of bumblebees n a wo-armed band seng and dscussed plausble decson-mang mechansms. Conrbuons: In hs paper, we sudy he mul-armed band problem wh Gaussan rewards. In anmal foragng, he energy aggregaed from a pach can be hough of as he reward from he pach, and he anmal s objecve s o maxmze nae energy rae, whle mnmzng expendure n me and energy. In roboc foragng, he robo searches an area, and he reward s he aggregaed evdence. Analogous o he anmal, he robo s objecve s ypcally o maxmze evdence colleced, whle mnmzng expendure of me and energy. To address hs common problem, we consder wo parcular exensons of he sandard mul-armed band problem, namely, he mul-armed band problem wh ranson coss, and he graphcal mul-armed band problem. We jusfy he need for he exensons as follows. In he sandard mul-armed band problem, he decson-maer can swch beween wo arms any number of mes; whle n he roboc as well as he anmal foragng as, a hgher number of swches beween arms s undesrable because resuls n a hgher ravel me ha leads o a smaller energy/evdence aggregaon rae and a larger fuel cos. Thus he foragng objecve s equvalen o maxmzng he aggregaed reward whle mnmzng he swches beween he arms; hs s addressed by our frs exenson. Anoher shorcomng of he sandard mul-armed band problem s ha assumes ha each arm can be drecly vsed from anoher arm; whle hs s rue for any convex envronmen, non-convex envronmens requre exra care. A well nown echnque o handle nonconvex envronmens s he occupancy grd [19] ha consrucs a graph assocaed wh he non-convex envronmen. Accordngly, our second exenson o he mul-armed band problem on graphs enables sudy of he foragng problem n non-convex envronmens. The major conrbuons of hs wor are hreefold. Frs, we sudy he Gaussan mul-armed band problem wh ranson coss and exend he Bayesan-UCB algorhm n [6] o a bloc allocaon sraegy ha unformly acheves an expeced cumulave regre ha s domnaed by a logarhmc erm and an expeced number of ransons beween arms ha s domnaed by a double-logarhmc erm. Second, we sudy he graphcal Gaussan mul-armed band problem and exend he bloc allocaon sraegy o hs problem. We show ha even for he graphcal mul-armed band problem, he bloc allocaon sraegy unformly acheves an expeced cumulave regre ha s domnaed by a logarhmc erm. Thrd, we draw connecons beween anmal foragng behavor and he behavor of he proposed polces for he mul-armed bands. We argue ha he mul-armed bands and he assocaed bloc allocaon algorhms qualavely capure he foragng behavor of some anmals. In parcular, we observe ha he mul-armed band problem seup has he poenal o provde an overarchng framewor ha brngs ogeher he classcal opmal foragng heory, he Lévy flgh based macroscopc search models, and he decsonmechansm based search models. Paper srucure: The remander of he paper s organzed as follows. We revew sandard Gaussan mul-armed bands n Secon II. The Gaussan mul-armed bands wh ranson coss and he graphcal Gaussan mul-armed bands are suded n Secon III and IV, respecvely. We draw comparsons beween he behavor of he bloc allocaon algorhm and anmal foragng n Secon V and conclude n Secon VI. II. REVIEW OF BANDITS WITH GAUSSIAN REWARDS Consder an N-armed band problem,.e., a mul-armed band problem wh N arms. The reward assocaed wh arm {1,..., N} s a Gaussan random varable wh an unnown mean m, and a nown varance σs. The mean of he Gaussan reward a arm can be nerpreed as he sgnal srengh a he arm, whle he varance can be nerpreed as he samplng nose ha s he same a each arm. Le he agen choose arm a me {1,..., T } and receve a reward r N m, σs). The decson-maer s objecve s o choose a sequence of arms { } {1,...,T } ha maxmzes he expeced cumulave reward T m, where T s he horzon lengh of he sequenal allocaon process. For a mul-armed band, he expeced regre a me s defned by R = m m, where m = max{m {1,..., N}}. The objecve of he decson-maer can be equvalenly defned as mnmzng he expeced cumulave regre defned by T R = N =1 E[n T ], where nt s he cumulave number of mes opon has been chosen unl me T and = m m s he expeced regre due o pcng arm nsead of arm.

3 A. Bound on Opmal Performance La and Robbns [] showed ha any asympocally effcen algorhm for he mul-armed band problem mus choose subopmal arms for an expeced number of mes ha s a leas logarhmc n me. Tha ) s, E[n T 1 ] Dp p ) + o1) log T, where o1) 0 as T + and D ) R 0 {+ }, s defned by Dp p ) = p r) log p r) p r) dr, s he Kullbac-Lebler dvergence beween he reward densy p of any subopmal opon and he reward densy p of he opmal arm. For he Gaussan reward srucure consdered n hs paper, he Kullbac-Lebler dvergence s equal o Dp p ) = /σ s, and consequenly, E[n T ] σs/ + o1)) log T. Ths leads o a lower bound on he cumulave regre gven by σ ) R s + o1) log T. =1 B. Upper Credble Lm Algorhm for Gaussan Bands Le he pror on he mean reward a arm be a Gaussan random varable wh mean µ 0 and varance σ0. We are parcularly neresed n he case of an unnformave pror,.e., σ0 +. Le he number of mes arm has been chosen unl me be denoed by n. Le he emprcal mean of he rewards from arm unl me be m. Condoned on he number of vss n o arm and he emprcal mean m, he poseror dsrbuon of he mean reward M ) a arm a me s a Gaussan random varable wh mean and varance µ := E[M n, m ] = δ µ 0 + n m δ + n, and ) σ := Var[M n, m ] = σ s δ + n, respecvely, where δ = σs/σ 0. The UCL algorhm, proposed n [6], a each dscree) me frs compues he 1 1/K)-upper credble lm Q assocaed wh each arm {1,..., N} defned by Q := µ σ s + Φ ), δ + n K where K > 0 s a consan and Φ 1 ) s he nverse cumulave dsrbuon funcon for he sandard normal random varable. The UCL algorhm hen selecs an arm := arg max{q {1,..., N}}. For he unnformave pror,.e., δ 0 +, he UCL algorhm acheves a logarhmc expeced cumulave regre for a mul-armed band problem wh Gaussan rewards. In parcular, he regre sasfes he followng unform upper bound: R UCL =1 8β σ s + πe ) log T + 4β σs 1 log log log T ) ), πe where R UCL s he regre of he UCL algorhm a me, and β = 1.0. III. GAUSSIAN MULTI-ARMED BANDITS WITH TRANSITION COSTS Consder he N-armed band problem descrbed n Secon II. Suppose he decson-maer ncurs a random ranson cos c j R 0 for a ranson from arm o arm j. No cos s ncurred f he same arm as a he prevous me nsan s chosen,.e., c = 0. Such a cos srucure corresponds o a search problem n whch he N arms correspond o N spaally dsrbued regons and he ranson cos c j correspond o he ravel cos from regon o regon j. A. The Bloc UCL Algorhm For such Gaussan bands wh ranson coss, we develop a bloc allocaon sraegy ha exends he UCL algorhm of Secon II-B. To develop hs sraegy, we dvde he se of naural numbers allocaon nsances) no frames {f N} such ha frame f sars a me 1 and ends a me 1. Thus, he lengh of frame f s 1. We subdvde frame f no blocs ha we call rounds of allocaon. Le he frs 1 / blocs n frame f have lengh and he remanng allocaon nsances n frame f consue a sngle bloc of lengh 1 1 /. The oal number of allocaon rounds blocs) n frame f s b = 1 /. Le l N be he smalles ndex such ha T < l. Noe ha each round of allocaon s characerzed by he uple, r), for some {1,..., l}, and r {1,..., b }. The bloc UCL algorhm a each round of allocaon selecs he arm wh he maxmum upper lm o he smalles 1 1/Kτ r )-credble se Q r defned below), where τ r s he me a allocaon round, r), and chooses for he lengh of ha round bloc). B. Regre Analyss of he Bloc UCL Algorhm In hs secon, we analyze he regre of he bloc UCL algorhm. We frs nroduce some noaon. Le Q r be he maxmum upper lm o he smalles 1 1/Kτ r )-credble se for he mean of arm a allocaon round, r), where K = πe s he credble lm parameer. Le n r be he number of mes arm has been chosen unl allocaon round, r). Le s be he number of mes he decson-maer ransons o arm from anoher arm j {1,..., N} \ {} unl me. Le he emprcal mean of he rewards from arm unl allocaon round, r) be m r. Condoned on he number of vss n r o arm and he emprcal mean m r, he poseror dsrbuon of he mean reward M ) a arm a allocaon round, r) s a Gaussan random varable wh mean and varance µ r σ r := E[M n r, m r := Var[M n r, m r ] = respecvely. Moreover, E[µ r n r ]= δ µ 0 + nr m δ + n r ] = δ µ 0 + nr δ + n r σ s δ + n r and Var[µ r n r, m r, and ]= nr σs δ + n r ).

4 Accordngly, he 1 1 Kτ r )-upper credble lm Q r s Q r = µ r + σ s δ + n r Φ Kτ r ). Also, for each {1,..., N}, we defne consans γ 1 = 8β σ s + 1 log + K, γ = 4β σs 1 log ) K + log 4 K, γ3 = γ1 log log log ) 4β σ sγ1 log log γ c max = max{e[c j ] j {1,..., N}}. )1 + π 6 ), and Le {R BUCL } {1,...,T } be he sequence of he expeced regre of he bloc UCL algorhm, and {S BUCL } {1,...,T } be he sequence of expeced ranson coss. The Bloc UCL algorhm acheves a logarhmc expeced cumulave regre as formalzed n he followng heorem. Theorem 1 Regre of Bloc UCL Algorhm): The followng saemens hold for he Gaussan mul-armed band problem wh ranson coss and he bloc UCL algorhm wh an unnformave pror: ) he expeced number of mes a subopmal arm s chosen unl me T sasfes E[n T ] γ 1 log T 4β σ s log log T + γ ; ) he expeced number of ransons o a subopmal arm from anoher arm unl me T sasfes E[s T ] γ 1 log ) log log T + γ 3; ) he cumulave regre and he cumulave ranson cos unl me T sasfy R BUCL S BUCL =1 γ1 log T 4β σs ) log log T + γ, c max =1, + c max ) γ 1 log ) log log T + γ 3) + c max. Proof: See Appendx. IV. GRAPHICAL GAUSSIAN BANDITS We now consder mul-armed bands wh Gaussan rewards n whch he decson-maer canno move o every oher arm from he curren arm. Le he arms ha can be vsed from arm be ne) {1,..., N}. Such a mularmed band can be represened by a graph G wh node se {1,..., N} and edge se E = {, j) j ne), {1,..., N}}. We assume ha he graph s conneced n he sense ha here exss a leas one pah from each node {1,..., N} o every oher node j {1,..., N}. A. The Graphcal Bloc UCL Algorhm For he graphcal Gaussan bands, we develop an algorhm smlar o he bloc allocaon algorhm, namely, he graphcal bloc UCL algorhm. Smlar o he bloc allocaon algorhm, a each comparson bloc, he arm wh he maxmum upper credble lm s deermned. Snce he arm wh he maxmum upper credble lm may no be mmedaely reached from he curren arm, he graphcal bloc UCL algorhm raverses a shores pah from he curren arm o he arm wh he maxmum upper credble lm. The ey nuon behnd he algorhm s ha he bloc allocaon sraegy resuls n an expeced number of ransons ha s sub-logarhmc n he horzon lengh. In he conex of graphcal bands, sub-logarhmc ransons resul n sub-logarhmc undesred vss o he arms on he chosen shores pah o he desred arm. Consequenly, he regre of he algorhm s domnaed by he logarhmc erm. B. Regre Analyss of he Graphcal Bloc UCL Algorhm We now analyze he performance of he graphcal bloc UCL algorhm. Le {R GUCL } {1,...,T } be he sequence of expeced regre of he graphcal bloc UCL algorhm. The graphcal bloc UCL algorhm acheves a logarhmc expeced cumulave regre as formalzed n he followng heorem. Theorem Regre of Graphcal Bloc UCL Algorhm): The followng saemens hold for he graphcal Gaussan mul-armed band problem wh he graphcal bloc UCL algorhm and an unnformave pror: ) he expeced number of mes a subopmal arm s chosen unl me T sasfes E[n T ] γ1 log T 4β σs log log T + γ + =1, γ 1 log ) log log T + γ3 ) + 1; ) he cumulave regre unl me T sasfes +γ + R GUCL =1, Proof: See Appendx. γ1 log T 4β σs log log T =1 ) γ 1 log ) log log T +γ3) +1 ; V. COMPARISON WITH ANIMAL FORAGING In hs secon, we compare he behavor of he bloc allocaon algorhm for he mul-armed bands wh he anmal foragng behavor repored n he leraure. Consder he foragng envronmen as composed of paches and each pach has sources of energy ha are modeled by Gaussan random varables wh an unnown mean and a nown varance. The exploraon-exploaon radeoff n he foragng problem can be modeled by he mul-armed band problem. In parcular, he foragng objecve of anmals s o maxmze he ne energy accumulaon rae whch n he mul-armed band seng maps o maxmzng he expeced cumulave

5 reward whle mnmzng he ravel me,.e., mnmzng he number of ransons among arms. The soluon o he mul-armed band problem naurally answers he frs wo fundamenal quesons suded n opmal foragng heory: ) whch envronmen pach should he anmal vs nex? ) how long should he anmal say n ha pach? Alhough he soluon o he mul-armed band problem does no answer he hrd fundamenal queson: whch foragng pah should he anmal choose n each pach? To undersand he hrd queson, s naural o envson ha pons whn a pach are hghly correlaed n erms of he energy accumulaon,.e., each pon whn a pach provdes energy a somewha he same rae, and accordngly he energy can be accumulaed, e.g., va an ergodc random wal. For smplcy of analyss, n hs paper, we assume ha he arms are uncorrelaed and he pror s unnformave. In general, he pror may be nformave and arms may be correlaed. The algorhm proposed n hs paper exends o hs case by smply replacng he N unvarae nference procedures wh an N-varae nference procedure. The correlaon srucure capures he srucure of he envronmen: hgher correlaon descrbes a smooher envronmen, whle lower correlaon descrbes a rougher envronmen. In a suffcenly correlaed envronmen, he bloc allocaon algorhm a allocaon round, r) pcs an arm wh hghes value of Q r and samples mes. A he subsequen allocaon nsance, due o he correlaon srucure he uncerany n he esmaes for he nearby locaons wll go down whle he uncerany n he far-off locaons would reman hgh. Consequenly, he componen of Q r assocaed wh he wdh of he credble se wll be hgher for he far-off locaons han he nearby locaons. If he pror means are assumed o be unform, he bloc allocaon sraegy a he nex allocaon nsance wll selec a locaon far-off from he curren locaon. Ths s a cenral feaure of he macroscopc foragng models, ncludng he Lévy flgh model and he nermen search model. Thus, he Bayesan mul-armed band problem and he assocaed bloc allocaon sraegy qualavely capures he behavor of Lévy flghs and relaed macroscopc models for search. Overall, he mul-armed band problem wh ranson coss models he fundamenal foragng objecve as defned n he opmal foragng leraure, and s soluon yelds search rajecores an o hose descrbed by macroscopc search models. Moreover, he soluon o he mul-armed band problem wh ranson coss naurally provdes he decson mechansms nvolved wh he search process. Therefore, he mul-armed band problem seup has he poenal o provde an overarchng framewor ha brngs ogeher he classcal opmal foragng heory, he Lévy flgh based macroscopc search models, and he decson-mechansm based search models. VI. CONCLUSIONS We suded wo varaons of he Gaussan mul-armed band problem, namely, he Gaussan mul-armed band problem wh ranson cos, and he graphcal Gaussan mul-armed band problem and developed bloc allocaon algorhms ha unformly acheve an expeced cumulave regre domnaed by a logarhmc funcon of me, and a number of expeced cumulave ransons among he arms domnaed by a double-logarhmc funcon of me. We drew some qualave connecons beween foragng behavor of some anmals and he behavor of he bloc allocaon algorhm. In parcular, we argued ha he mul-armed band problem models he foragng objecve n opmal foragng heory well and he assocaed bloc allocaon sraegy capures he ey feaures of popular macroscopc search models. A hs sage, we observe and pon ou he poenal of he mul-armed band problem and he assocaed bloc allocaon algorhm o brdge he gap beween classcal opmal foragng heory and recen macroscopc search models. Ths suggess an excng new avenue of nqury n whch he band model may prove valuable for fuure sudy of anmal foragng. In he fuure, we plan o nvesgae he band model more exensvely n he conex of emprcal wor on boh anmal and roboc foragng. APPENDIX A. Proof of regre of he bloc UCL algorhm Proof of Theorem 1: We sar by esablshng he frs saemen. For a gven, le, r ) be he lexcographcally maxmum uple such ha τ r. We noe ha n T = 1 = ) = η + l + η + l + 1 = & n r l r=1 < η) + 1 = & n r η) ) 1 = & n r η) b 1 τr = & n r η). 1) I can be shown see [0] for deals) ha f we choose η = 8β σ s log T 1 log log T ) + 4β σ s 1 log ), hen E[n T ] η + l + K l b r=1 τ r. ) We now focus on he erm l b r=1 τ r. We noe ha τ r = 1 + r 1), and hence b r=1 τ r = b r= r 1) b 1 dx x 1) log. 3) 1 Snce T l 1, follows ha l 1 + log T =: l. Therefore, nequales ) and 3) yeld

6 E[n T ] η + l + K l η + l + 8 K + log K l 1 + log ) γ1 log T 4β σs log log T + γ. We now esablsh he second saemen. In he spr of [7], we noe ha he number of mes he decson-maer ransons o arm from anoher arm n frame f s equal o he number of mes arm s seleced n frame dvded by he lengh of each bloc s frame f. Consequenly, s T l = nl l n + l 1 n 1 Therefore, follows ha = l n l 1 n n 1 ) nl + 1 l ] E[s T ] E[nl l l 1 + l 1 + n. E[n ]. 4) Subsung E[n ] n nequaly 4) wh he derved upper bounds and performng some algebrac manpulaons, we oban E[s T ] γ 1 log ) log log T + γ 3. We now esablsh he las saemen. The bound of he cumulave regre follows from he defnon and he frs saemen. To esablsh he bound on he cumulave swchng cos, we noe ha S BUCL c max E[s T ] + c max E[sT ] =1, c max =1, + c max )E[sT ] + c max, 5) where he second nequaly follows from he observaon ha s T T =1, st + 1. The fnal expresson follows from nequaly 5) and he second saemen. B. Proof of regre of he graphcal bloc UCL algorhm Proof of Theorem : We sar by esablshng he frs saemen. We classfy he selecon of arms n wo caegores, namely, he goal selecon and he ransen selecon. The goal selecon of an arm corresponds o he suaon n whch he arm has he maxmum upper credble lm, whle he ransen selecon corresponds o he suaon n whch he arm s seleced because belongs o he chosen shores pah o he arm wh he maxmum credble lm. We noe ha due o ransen selecons, he number of frames unl me T are a mos equal o he number of frames f here are no ransen selecons. Consequenly, he expeced number of goal selecons of a subopmal arm are upper bounded by he expeced number of selecons of arm n he bloc allocaon algorhm,.e., E[n T goal,] γ 1 log T 4β σ s log log T + γ. Moreover, he number of ransen selecons of arm are upper bounded by he oal number of ransons from an arm o anoher arm n he bloc allocaon algorhm,.e., E[n T ransen,] =1, γ 1 log ) log log T + γ3 ) + 1. The expeced number of selecons of arm s he sum of he expeced number of ransen selecons and he expeced number of goal selecons, and hus he frs saemen follows. The second saemen follows mmedaely from he defnon of he cumulave regre. REFERENCES [1] S. Bubec and N. Cesa-Banch. Regre analyss of sochasc and nonsochasc mul-armed band problems. Machne Learnng, 51):1 1, 01. [] T. L. La and H. Robbns. Asympocally effcen adapve allocaon rules. Advances n Appled Mahemacs, 61):4, [3] P. Auer, N. Cesa-Banch, and P. Fscher. Fne-me analyss of he mularmed band problem. Machne learnng, 47):35 56, 00. [4] N. Srnvas, A. Krause, S. M. Kaade, and M. Seeger. Informaonheorec regre bounds for Gaussan process opmzaon n he band seng. IEEE Transacons on Informaon Theory, 585): , 01. [5] E. Kaufmann, O. Cappé, and A. Garver. On Bayesan upper confdence bounds for band problems. In In. Conf. on Arfcal Inellgence and Sascs, pages , La Palma, Canary Islands, Span, Aprl 01. [6] P. Reverdy, V. Srvasava, and N. E. Leonard. Modelng human decson-mang n mul-armed bands. In Muldscplnary Conf. on Renforcemen Learnng and Decson Mang, Prnceon, NJ, USA, Oc 013. [7] R. Agrawal, M. V. Hedge, and D. Teneezs. Asympocally effcen adapve allocaon rules for he mul-armed band problem wh swchng cos. IEEE Transacons on Auomac Conrol, 3310): , [8] R. Klenberg, A. Nculescu-Mzl, and Y. Sharma. Regre bounds for sleepng expers and bands. Machne learnng, 80-3):45 7, 010. [9] G. H. Pye, H. R. Pullam, and E. L. Charnov. Opmal foragng: a selecve revew of heory and ess. Quarerly Revew of Bology, pages , [10] D. W. Sephens,, and J. R. Krebs. Foragng heory. Prnceon Unversy Press, [11] G. M. Vswanahan, M. G. E. da Luz, E. P. Raposo, and H. E. Sanley. The Physcs of Foragng: An Inroducon o Random Searches and Bologcal Encouners. Cambrdge Unversy Press, 011. [1] E. Gelenbe, N. Schmaju, J. Saddon, and J. Ref. Auonomous search by robos and anmals: A survey. Robocs and Auonomous Sysems, 1):3 34, [13] S. R. X. Dall, L. Graldeau, O. Olsson, J. M. McNamara, and D. W. Sephens. Informaon and s use by anmals n evoluonary ecology. Trends n Ecology & Evoluon, 04): , 005. [14] G. M. Vswanahan, S. V. Buldyrev, S. Havln, M. G. E. da Luz, E. P. Raposo, and H. E. Sanley. Opmzng he success of random searches. Naure, ): , [15] O. Bénchou, C. Loverdo, M. Moreau, and R. Vourez. Inermen search sraeges. Revews of Modern Physcs, 831):81, 011. [16] J. R. Krebs, A. Kaceln, and P. Taylor. Tes of opmal samplng by foragng grea s. Naure, ):7 31, [17] T. Keasar, E. Rashovch, D. Cohen, and A. Shmda. Bees n wo-armed band suaons: Foragng choces and possble decson mechansms. Behavoral Ecology, 136): , 00. [18] A. M. Hen and S. A. McKnley. Sensng and decson-mang n random search. Proceedngs of he Naonal Academy of Scences, 10930): , 01. [19] S. Thrun, W. Burgard, and D. Fox. Probablsc Robocs. The MIT Press, 005. [0] P. Reverdy, V. Srvasava, and N. E. Leonard. Modelng human decson-mang n generalzed Gaussan mul-armed bands. arxv preprn arxv: , July 013.

Algorithmic models of human decision making in Gaussian multi-armed bandit problems

Algorithmic models of human decision making in Gaussian multi-armed bandit problems Algorhmc models of human decson makng n Gaussan mul-armed band problems Paul Reverdy, Vabhav Srvasava and Naom E. Leonard Absrac We consder a heursc Bayesan algorhm as a model of human decson makng n mul-armed