One Practical Algorithm for Both Stochastic and Adversarial Bandits

Similar documents
Chapter 2: Evaluative Feedback

3. Renewal Limit Theorems

e t dt e t dt = lim e t dt T (1 e T ) = 1

Contraction Mapping Principle Approach to Differential Equations

4.8 Improper Integrals

Minimum Squared Error

Minimum Squared Error

REAL ANALYSIS I HOMEWORK 3. Chapter 1

Convergence of Singular Integral Operators in Weighted Lebesgue Spaces

September 20 Homework Solutions

A Kalman filtering simulation

A LIMIT-POINT CRITERION FOR A SECOND-ORDER LINEAR DIFFERENTIAL OPERATOR IAN KNOWLES

ENGR 1990 Engineering Mathematics The Integral of a Function as a Function

Motion. Part 2: Constant Acceleration. Acceleration. October Lab Physics. Ms. Levine 1. Acceleration. Acceleration. Units for Acceleration.

5.1-The Initial-Value Problems For Ordinary Differential Equations

f t f a f x dx By Lin McMullin f x dx= f b f a. 2

A Time Truncated Improved Group Sampling Plans for Rayleigh and Log - Logistic Distributions

1 jordan.mcd Eigenvalue-eigenvector approach to solving first order ODEs. -- Jordan normal (canonical) form. Instructor: Nam Sun Wang

An integral having either an infinite limit of integration or an unbounded integrand is called improper. Here are two examples.

A new model for limit order book dynamics

The solution is often represented as a vector: 2xI + 4X2 + 2X3 + 4X4 + 2X5 = 4 2xI + 4X2 + 3X3 + 3X4 + 3X5 = 4. 3xI + 6X2 + 6X3 + 3X4 + 6X5 = 6.

0 for t < 0 1 for t > 0

EXISTENCE AND UNIQUENESS OF SOLUTIONS FOR A SECOND-ORDER ITERATIVE BOUNDARY-VALUE PROBLEM

Probability, Estimators, and Stationarity

S Radio transmission and network access Exercise 1-2

INTEGRALS. Exercise 1. Let f : [a, b] R be bounded, and let P and Q be partitions of [a, b]. Prove that if P Q then U(P ) U(Q) and L(P ) L(Q).

GENERALIZATION OF SOME INEQUALITIES VIA RIEMANN-LIOUVILLE FRACTIONAL CALCULUS

22.615, MHD Theory of Fusion Systems Prof. Freidberg Lecture 9: The High Beta Tokamak

Optimality of Myopic Policy for a Class of Monotone Affine Restless Multi-Armed Bandit

THREE IMPORTANT CONCEPTS IN TIME SERIES ANALYSIS: STATIONARITY, CROSSING RATES, AND THE WOLD REPRESENTATION THEOREM

Some Inequalities variations on a common theme Lecture I, UL 2007

( ) ( ) ( ) ( ) ( ) ( y )

Efficient Optimal Learning for Contextual Bandits

PHYSICS 1210 Exam 1 University of Wyoming 14 February points

Procedia Computer Science

1. Introduction. 1 b b

MAT 266 Calculus for Engineers II Notes on Chapter 6 Professor: John Quigg Semester: spring 2017

MATH 124 AND 125 FINAL EXAM REVIEW PACKET (Revised spring 2008)

Mathematics 805 Final Examination Answers

Systems Variables and Structural Controllability: An Inverted Pendulum Case

Green s Functions and Comparison Theorems for Differential Equations on Measure Chains

Average & instantaneous velocity and acceleration Motion with constant acceleration

Solutions to Problems from Chapter 2

ON NEW INEQUALITIES OF SIMPSON S TYPE FOR FUNCTIONS WHOSE SECOND DERIVATIVES ABSOLUTE VALUES ARE CONVEX

3D Transformations. Computer Graphics COMP 770 (236) Spring Instructor: Brandon Lloyd 1/26/07 1

Physics 2A HW #3 Solutions

Asymptotic relationship between trajectories of nominal and uncertain nonlinear systems on time scales

How to prove the Riemann Hypothesis

3 Motion with constant acceleration: Linear and projectile motion

The Taiwan stock market does follow a random walk. Abstract

How to Prove the Riemann Hypothesis Author: Fayez Fok Al Adeh.

HUI-HSIUNG KUO, ANUWAT SAE-TANG, AND BENEDYKT SZOZDA

f(x) dx with An integral having either an infinite limit of integration or an unbounded integrand is called improper. Here are two examples dx x x 2

Tax Audit and Vertical Externalities

Reinforcement learning

MTH 146 Class 11 Notes

Neural assembly binding in linguistic representation

1.0 Electrical Systems

Hermite-Hadamard-Fejér type inequalities for convex functions via fractional integrals

Transforms II - Wavelets Preliminary version please report errors, typos, and suggestions for improvements

A new model for solving fuzzy linear fractional programming problem with ranking function

Temperature Rise of the Earth

Reinforcement Learning. Markov Decision Processes

arxiv: v1 [math.pr] 24 Sep 2015

Inventory Management Models with Variable Holding Cost and Salvage value

On the Pseudo-Spectral Method of Solving Linear Ordinary Differential Equations

Chapter 2. First Order Scalar Equations

Endogenous Formation of Limit Order Books: Dynamics Between Trades.

SOME USEFUL MATHEMATICS

ANSWERS TO EVEN NUMBERED EXERCISES IN CHAPTER 2

A Simple Method to Solve Quartic Equations. Key words: Polynomials, Quartics, Equations of the Fourth Degree INTRODUCTION

Research Article New General Integral Inequalities for Lipschitzian Functions via Hadamard Fractional Integrals

white strictly far ) fnf regular [ with f fcs)8( hs ) as function Preliminary question jointly speaking does not exist! Brownian : APA Lecture 1.

(b) 10 yr. (b) 13 m. 1.6 m s, m s m s (c) 13.1 s. 32. (a) 20.0 s (b) No, the minimum distance to stop = 1.00 km. 1.

Solutions for Nonlinear Partial Differential Equations By Tan-Cot Method

USING ITERATIVE LINEAR REGRESSION MODEL TO TIME SERIES MODELS

Non-oscillation of perturbed half-linear differential equations with sums of periodic coefficients

On Source and Channel Codes for Multiple Inputs and Outputs: Does Multiple Description Meet Space Time? 1

P441 Analytical Mechanics - I. Coupled Oscillators. c Alex R. Dzierba

LAPLACE TRANSFORMS. 1. Basic transforms

15/03/1439. Lecture 4: Linear Time Invariant (LTI) systems

Version 001 test-1 swinney (57010) 1. is constant at m/s.

Properties of Logarithms. Solving Exponential and Logarithmic Equations. Properties of Logarithms. Properties of Logarithms. ( x)

Think of the Relationship Between Time and Space Again

Chapter Direct Method of Interpolation

Reinforcement Learning

Advanced Calculus: MATH 410 Notes on Integrals and Integrability Professor David Levermore 17 October 2004

A 1.3 m 2.5 m 2.8 m. x = m m = 8400 m. y = 4900 m 3200 m = 1700 m

Bipartite Matching. Matching. Bipartite Matching. Maxflow Formulation

Hardy s inequality in L 2 ([0, 1]) and principal values of Brownian local times

Multi-Armed Bandits: Non-adaptive and Adaptive Sampling

Flow Networks Alon Efrat Slides courtesy of Charles Leiserson with small changes by Carola Wenk. Flow networks. Flow networks CS 445

Fault-Tolerant Guaranteed Cost Control of Uncertain Networked Control Systems with Time-varying Delay

Journal of Mathematical Analysis and Applications. Two normality criteria and the converse of the Bloch principle

Approximation Algorithms for Unique Games via Orthogonal Separators

Application on Inner Product Space with. Fixed Point Theorem in Probabilistic

Integral Transform. Definitions. Function Space. Linear Mapping. Integral Transform

Bias in Conditional and Unconditional Fixed Effects Logit Estimation: a Correction * Tom Coupé

AJAE appendix for Is Exchange Rate Pass-Through in Pork Meat Export Prices Constrained by the Supply of Live Hogs?

1 Review of Zero-Sum Games

Transcription:

One Prcicl Algorihm for Boh Sochsic nd Adversril Bndis Full Version Including Appendices Yevgeny Seldin Queenslnd Universiy of Technology, Brisbne, Ausrli Aleksndrs Slivkins Microsof Reserch, New York NY, USA YEVGENYSELDIN@GMAILCOM SLIVKINS@MICROSOFTCOM Absrc We presen n lgorihm for mulirmed bndis h chieves lmos opiml performnce in boh sochsic nd dversril regimes wihou prior knowledge bou he nure of he environmen Our lgorihm is bsed on ugmenion of he EXP lgorihm wih new conrol lever in he form of explorion prmeers h re ilored individully for ech rm The lgorihm simulneously pplies he old conrol lever, he lerning re, o conrol he regre in he dversril regime nd he new conrol lever o deec nd exploi gps beween he rm losses This secures problem-dependen logrihmic regre when gps re presen wihou compromising on he wors-cse performnce gurnee in he dversril regime We show h he lgorihm cn exploi boh he usul expeced gps beween he rm losses in he sochsic regime nd deerminisic gps beween he rm losses in he dversril regime The lgorihm reins logrihmic regre gurnee in he sochsic regime even when some observions re conmined by n dversry, s long s on verge he conminion does no reduce he gp by more hn hlf Our resuls for he sochsic regime re suppored by experimenl vlidion Inroducion Sochsic mulirmed bndis Thompson, 9; Robbins, 95; Li & Robbins, 985; Auer e l, nd dversril mulirmed bndis Auer e l, 995; b hve co-exised in prllel for lmos wo decdes by now, in he sense h no lgorihm for sochsic mulirmed bndis is pplicble o dversril mulirmed bndis nd l- Proceedings of he s Inernionl Conference on Mchine Lerning, Beijing, Chin, 4 JMLR: W&CP volume Copyrigh 4 by he uhors gorihms for dversril bndis re unble o exploi he simpler regime of sochsic bndis The recen emp of Bubeck & Slivkins o bring hem ogeher did no mke i in he full sense of unificion, since he lgorihm of Bubeck nd Slivkins relies on he knowledge of ime horizon nd mkes one-ime irreversible swich beween sochsic nd dversril operion modes if he beginning of he gme is esimed o exhibi dversril behvior We presen n lgorihm h res boh sochsic nd dversril mulirmed bndi problems wihou disinguishing beween hem Our lgorihm jus runs, s mos oher bndi lgorihms, wihou knowledge of ime horizon nd wihou mking ny hrd semens bou he nure of he environmen We show h if he environmen hppens o be dversril he performnce of he lgorihm is jus fcor of worse hn he performnce of he EXP lgorihm wih he bes consns, s described in Bubeck & Ces-Binchi nd if he environmen hppens o be sochsic he performnce of our lgorihm is comprble o he performnce of UCB of Auer e l Thus, we cover he full rnge nd chieve lmos opiml performnce he exreme poins Furhermore, we show h he new lgorihm cn exploi boh he usul expeced gps beween he rm losses in he sochsic regime nd deerminisic gps beween he rm losses in he dversril regime We lso show h he lgorihm reins logrihmic regre gurnee in he sochsic regime even when some observions re dversrilly conmined, s long s on verge he conminion does no reduce he gp by more hn hlf To he bes of our knowledge, no oher lgorihm hs been ye shown o be ble o exploi gps in he dversril or dversrilly conmined sochsic regimes The conmined sochsic regime is very prcicl model, since in mny rel-life siuions we re deling wih sochsic environmens wih occsionl disurbnces Since he inroducion of Thompson s smpling Thompson, 9 which ws nlyzed only fer 8 yers Kufmnn e l, ; Agrwl & Goyl, vriey of l-

One Prcicl Algorihm for Boh Sochsic nd Adversril Bndis gorihms were invened for he sochsic mulirmed bndi problem The mos powerful for ody re KL-UCB Cppé e l,, EwS Millrd,, nd he foremenioned Thompson s smpling I is esy o show h ny deerminisic lgorihm cn poenilly suffer liner regre in he dversril regime see he supplemenry meril for proof Alhough nohing is known bou he performnce of rndomized lgorihms for sochsic bndis in he dversril regimes, empiriclly hey re exremely sensiive o deviions from he sochsic ssumpion In he dversril world he mos powerful lgorihm for ody is INF Audiber & Bubeck, 9; Bubeck & Ces- Binchi, Neverheless, he EXP lgorihm of Auer e l b sill reins n imporn plce, minly due o is simpliciy nd wide pplicbiliy, which covers combinoril bndis, pril monioring gmes, nd mny oher dversril problems Since ny sochsic problem cn be seen s n insnce of n dversril problem, boh INF nd EXP hve he wors-cse roo- regre gurnee in he sochsic regime, bu i is no known wheher hey cn do beer Empiriclly in he sochsic regime EXP is inferior o ll oher known lgorihms for his seing, including he simples UCB lgorihm I is ineresing o ke brief look ino he developmen of EXP The lgorihm ws firs suggesed in Auer e l 995 nd is prmerizion nd nlysis were improved in Auer e l b The EXP of Auer e l ws designed for he mulirmed bndi gme wih rewrds nd is plying sregy is bsed on mixing Gibbs disribuion lso known s exponenil weighs wih uniform explorion disribuion in proporion o he lerning re The uniform explorion leves no hope for chieving logrihmic regre in he sochsic regime simulneously wih he roo- regre in he dversril regime, since ech rm is plyed les Ω imes in rounds of he gme By chnging he lerning re Ces-Binchi & Fischer 998 mnged o derive differen prmerizion of he lgorihm h ws shown o chieve logrihmic regre in he sochsic regime, bu i hd no regre gurnees in he dversril regime Solz 5 hs observed h in he gme wih losses he roo- regre gurnee in he dversril regime cn be chieved wihou mixing in he uniform disribuion nd even led o beer consns However, mixing in ny disribuion h elemen-wise does no exceed he lerning re does no brek he wors-cse performnce of he lgorihm in he gme wih losses We exploi his emerged freedom in order o derive modificion of he EXP lgorihm h chieves lmos opiml regre in boh dversril nd sochsic regimes wihou prior knowledge bou he nure of he environmen Rewrds cn be rnsformed ino losses by king l = r Problem Seing We sudy he mulirmed bndi MAB gme wih losses In ech round of he gme he lgorihm chooses one cion A mong K possible cions, k rms, nd observes he corresponding loss l A The losses of oher rms re no observed There is lrge number of loss generion models, four of which re considered below In his work we resric ourselves o loss sequences {l }, h re genered independenly of he lgorihm s cions Under his ssumpion we cn ssume h he loss sequences re wrien down before he gme srs bu no reveled o he lgorihm We lso mke sndrd ssumpion h he losses re bounded in he [, inervl The performnce of he lgorihm is qunified by regre, defined s he difference beween he expeced loss of he lgorihm up o round nd he expeced loss of he bes rm up o round : R = E [ l As s min { E [ l s } The expecion is ken over he possible rndomness of he lgorihm nd loss generion model The gol of he lgorihm is o minimize he regre We consider wo sndrd loss generion models, he dversril regime nd he sochsic regime nd wo inermedie regimes, he conmined sochsic regime nd he dversril regime wih gp Adversril regime In his regime he loss sequences re genered by n unresriced dversry who is oblivious o he lgorihm s cions This is he mos generl seing nd he oher hree regimes cn be seen s specil cses of he dversril regime An rm rg min l s is known s bes rm in hindsigh for he firs rounds Sochsic regime In his regime he losses l re smpled independenly from n unknown disribuion h depends on, bu no on We use µ = E [l o denoe he expeced loss of rm Arm is clled bes rm if µ = min {µ } nd subopiml oherwise; le denoe some bes rm For ech rm, define he gp = µ µ Le = min : > { } denoe he miniml gp Leing N be he number of imes rm ws plyed up o nd including round, he regre cn be rewrien s R = E [N Conmined sochsic regime In his regime he dversry picks some round-rm pirs, locions before he gme srs nd ssigns he loss vlues here in n

One Prcicl Algorihm for Boh Sochsic nd Adversril Bndis rbirry wy The remining losses re genered ccording o he sochsic regime We cll conmined sochsic regime moderely conmined fer τ rounds if for ll τ he ol number of conmined locions of ech subopiml rm up o ime is mos /4 nd he number of conmined locions of ech bes rm is mos /4 By his definiion, for ll τ on verge over sochsiciy of he loss sequences he dversry cn reduce he gp of every rm by mos hlf Adversril regime wih gp An dversril regime is nmed by us n dversril regime wih gp if here exiss round τ nd n rm τ h persiss o be he bes rm in hindsigh for ll rounds τ We nme such rm consisenly bes rm fer round τ If no such rm exiss hen τ is undefined Noe h if τ is defined for some τ hen τ is defined for ll τ > τ We use λ = l s o denoe he cumulive loss of rm Whenever τ is defined we define deerminisic gp of rm on round τ s: τ, = min τ { λ λ τ If τ is undefined, τ, is defined s zero } Noion We use {E} o denoe he indicor funcion of even E nd = {A=} o denoe he indicor funcion of he even h rm ws plyed on round Min Resuls Our min resuls include new lgorihm, which we nme EXP++, nd is nlysis in he four regimes defined in he previous secion The EXP++ lgorihm, provided in Algorihm box, is generlizion of he EXP lgorihm wih losses Algorihm Algorihm EXP++ Remrk: See ex for definiion of η nd ξ : L = for =,, do β = ln K K : ε = min { K, β, ξ } : ρ = e η L / e η L : ρ = ε ρ + ε Drw cion A ccording o ρ nd ply i Observe nd suffer he loss l A : l = la ρ : L = L + l end for The EXP++ lgorihm hs wo conrol levers: he lerning re η nd he explorion prmeers ξ The EXP wih losses s described in Bubeck & Ces-Binchi is specil cse of he EXP++ wih η = β nd ξ = The crucil innovion in EXP++ is he inroducion of explorion prmeers ξ, which re uned individully for ech rm depending on he ps observions In he sequel we show h uning only he lerning re η suffices o conrol he regre of EXP++ in he dversril regime, irrespecive of he choice of he explorion prmeers ξ Then we show h uning only he explorion prmeers ξ suffices o conrol he regre of EXP++ in he sochsic regime irrespecive of he choice of η, s long s η β Applying he wo conrol levers simulneously we obin n lgorihm h chieves he opiml roo- regre in he dversril regime up o logrihmic fcors nd lmos opiml logrihmic regre in he sochsic regime hough wih subopiml power in he logrihm Then show h he new conrol lever is even more powerful nd llows o deec nd exploi he gp in even more chllenging siuions, including moderely conmined sochsic regime nd dversril regime wih gp Adversril Regime Firs, we show uning η is sufficien o conrol he regre of EXP++ in he dversril regime Theorem For η = β nd ny ξ he regre of EXP++ for ny sisfies: R 4 K ln K Noe h he regre bound in Theorem is jus fcor of worse hn he regre of EXP wih losses Bubeck & Ces-Binchi, Sochsic Regime Now we show h for ny η β uning he explorion prmeers ξ suffices o conrol he regre of he lgorihm in he sochsic regime By choosing η = β we obin lgorihms h hve boh he opiml roo- regre scling in he dversril regime nd logrihmic regre scling in he sochsic regime We consider number of differen wys of uning he explorion prmeers ξ, which led o differen prmerizions of EXP++ We sr wih n idelisic ssumpion h he gp is known, jus o give n ide of wh is he bes resul we cn hope for Theorem Assume h he gps re known For ny choice of η β nd ny c 8, he regre of EXP++ wih ξ = c ln in he sochsic regime

One Prcicl Algorihm for Boh Sochsic nd Adversril Bndis sisfies: R ln O + K Õ The consns in his heorem re smll nd re provided explicily in he nlysis We lso show h c cn be mde lmos s smll s Nex we show h using he empiricl gp s n esime of he rue gp { ˆ = min, L min L } we cn lso chieve polylogrihmic regre gurnee We cll his lgorihm EXP++ AVG Theorem Le c 8 nd η β Le be he miniml ineger h sisfies 4c K ln 4 nd le = mx {, e / } cln ˆ lnk The regre of EXP++ wih ξ = ermed EXP++ AVG in he sochsic regime sisfies: R ln O + Alhough he ddiive consns in his heorem re very lrge, in he experimenl secion we show h minor modificion of his lgorihm performs comprbly o UCB in he sochsic regime nd hs he dversril regre gurnee in ddiion In he following heorem we show h if we ssume known ime horizon T, hen we cn elimine he ddiive erm e / in he regre bound The lgorihm in Theorem 4 replces he empiricl gp esime in he definiion of ξ wih lower confidence bound on he gp nd slighly djuss oher erms We nme his lgorihm EXP++ LCBT Theorem 4 Consider he sochsic regime wih known ime horizon T The EXP++ LCBT lgorihm wih ny η β nd ppropriely defined ξ chieves regre RT Olog T The precise definiion of EXP++ LCBT nd he proof of Theorem 4 re provided in he supplemenry meril I seems h simulneous eliminion of he ssumpion on he known ime horizon nd he exponenilly lrge ddiive erm is very chllenging problem nd we defer i for fuure work Conmined Sochsic Regime Nex we show h EXP++ AVG cn susin modere conminion in he sochsic regime wihou significn deeriorion in performnce Theorem 5 Under { he prmerizion given in Theorem, for = mx, e 4/ }, where is defined s before, he regre of EXP++ AVG in he sochsic regime h is moderely conmined fer τ rounds sisfies: R ln O + mx {, τ} The price h is pid for modere conminion fer τ rounds is he scling of by fcor of / nd he ddiive fcor of τ The scling of ffecs he definiion of nd he consn in O ln As before, he regre gurnee of Theorem 5 comes in ddiion o he gurnee of Theorem Adversril Regime wih Gp Finlly, we show h EXP++ AVG cn lso ke dvnge of deerminisic gp in he dversril regime Theorem 6 Under he prmerizion given in Theorem, he regre of EXP++ AVG in he dversril regime sisfies: R { { min mx, τ, e / τ,} } ln + O τ τ, We remind he reder h in he bsence of consisenly bes rm τ, is defined s zero nd he regre bound is vcuous bu he regre bound of Theorem sill holds We lso noe h τ, is non-decresing funcion of τ Therefore, here is rde-off: incresing τ increses τ,, bu loses he regre gurnee on he rounds before τ for simpliciy, we ssume h we hve no gurnees before τ Theorem 6 llows o pick τ h minimizes his rde-off An imporn implicion of he heorem is h if he deerminisic gp is growing wih ime he regre gurnee improves oo 4 Proofs We prove he heorems from he previous secion in he order hey were presened The Adversril Regime The proof of Theorem relies on he following lemm, which is n inermedie sep in he nlysis of EXP by Bubeck see lso Bubeck & Ces-Binchi Lemm 7 For ny K sequences of non-negive numbers X, X, indexed by {,, K} nd ny non-incresing posiive sequence η, η,, for ρ =

One Prcicl Algorihm for Boh Sochsic nd Adversril Bndis exp η X s h exp η X s exponen is zero we hve: ssuming for = he sum in he T = ρ T X min = X T η ρ X + ln K η T = More precisely, we re using he following corollry, which follows by llowing X -s o be rndom vribles nd king expecions of he wo sides of nd using he fc h E [min[ min [E [ We decompose expecions of incremenl sums ino incremenl sums of condiionl expecions nd use E [ o denoe expecions condiioned on relizion of ll rndom vribles up o round Corollry 8 Le X, X, for {,, K} be nonnegive rndom vribles nd le η nd ρ s defined in Lemm 7 Then: [ T [ [ T E E ρ X min E E [X = = [ T [ η E E ρ X + ln K η T = Proof of Theorem We ssocie X in wih l in he EXP++ lgorihm We hve E [ l = l nd since ρ = ε ρ ε ρ ε nd l [, we lso hve: E [ ρ l [ E ρ ε l E [l A ε As well, we hve: [ E ρ l = E ρ [ ρ E ρ = = l A ρ ρ ρ ρ ε ρ + ε K, where he ls inequliy follows by he fc h ε by he definiion of ε Subsiuion of he bove clculions ino Corollry 8 yields: [ T [ T R = E l A min E l K T = = η + ln K η T + = ε K T = η + ln K η T The resul of he heorem follows by he choice of η The Sochsic Regime Our proofs re bsed on he following form of Bernsein s inequliy, which is minor improvemen over Ces-Binchi & Lugosi 6, Lemm A8 bsed on he ides from Boucheron e l, Theorem Theorem 9 Bernsein s inequliy for mringles Le X,, X n be mringle difference sequence wih respec o filrion F = F i in nd le S i = i j= X j be he ssocied mringle Assume h here exis posiive numbers ν nd c, such h X j c for ll j wih probbiliy nd [ n i= E X i Fi ν wih probbiliy Then for ll b > : P [ S n > νb + cb e b We re lso using he following echnicl lemm, which is proved in he supplemenry meril Lemm For ny c > : = e c = O c The proof of Theorems nd is bsed on he following lemm Lemm Le {ε } = be non-incresing deerminisic sequences, such h ε ε wih probbiliy nd ε ε for ll nd Define ν = ε s nd define he even E L L ν + ν b + 5b ε E Then for ny posiive sequence b, b, nd ny he number of imes rm is plyed by EXP++ up o round is bounded s: E [N + e bs + ε s {E } s= s= + e ηsgs, s= where g = b ε + ε 5b ε

One Prcicl Algorihm for Boh Sochsic nd Adversril Bndis Proof Noe { h elemens of } he mringle difference sequence l l re upper bounded by = ε + Since ε ε /K /4 we cn simplify he upper bound by using ε + 5 ε Furher noe h = [ E s l s [ E s l s [ E s ls l s l s [ + E s l s p s + p s ε s + ε s ε s + ε s = ν + ν wih probbiliy Le E denoe he complemen of even E Then by Bernsein s inequliy P [ E b The number of imes rm is plyed up o round is bounded s: E [N = = P [A s = P [ A s = [ E s P E s + P [ A s = [ Es P E s P [ A s = E s {E s } + P [ Es P [ A s = E s {E s } + e bs For he erms of he sum bove we hve: P [ A = E {E s } = ρ {E s } ρ + ε {E s } L = ε + e η L e η {E s } ε + e η L L {E s } ε {E s } + e ηg, Where in he ls inequliy we used he fcs h even E holds nd h since ε is non-incresing sequence ν ε Subsiuion of his resul bck ino he compuion of E [N complees he proof Proof of Theorem The proof is bsed on Lemm Le b = ln nd ε = ε For ny c 8 nd ny, where is he miniml ineger for which 4c K ln 4 lnk, we hve: g = b ε + ε b ε 5b ε = 5 c c 5b ε The choice of ensures h for ll subopiml cions we hve ε = ξ, which slighly simplifies he clculions Also noe h since ε = min { K, β }, sympoiclly /ε erm in g domines /ε erm nd wih bi more creful bounding c cn be mde lmos s smll s By subsiuion of he lower bound on g ino Lemm we hve: E [N + ln + c ln + c ln e 4 s lnk K + ln K + O +, where we used Lemm o bound he sum of he exponens Noe h is of order Õ K 4 Proof of Theorem Noe h since by our definiion { ˆ } he sequence ε = ε = min K, β, c ln sisfies he condiion of Lemm Also noe h for lrge enough, so h 4c K ln 4 ln K, we hve ε = c ln Le b = ln nd le be lrge enough, so h for ll we hve 4c K ln 4 ln K nd e We re going o bound he hree erms in he bound on E [N in Lemm Bounding s= e bs is esy For bounding s= ε s {E s } we noe h when E holds nd c 8 we hve: ˆ L min L L L g = b 5b 4 ε ε = 5 c ln c ln 5 c c,

One Prcicl Algorihm for Boh Sochsic nd Adversril Bndis where in 4 we used he fc h E holds nd in he ls line we used he fc h for we hve ln / Thus ε {E s } cln ˆ 4c ln nd s= ε s {E s } = O ln Finlly, for he ls erm in Lemm we hve lredy shown s n inermedie sep in he clculion of he bound on ˆ h for we hve g Therefore, he ls erm K is of order O By king ll hese clculions ogeher we obin he resul of he heorem Noe h he resul holds for ny η β The Conmined Sochsic Regime Proof of Theorem 5 The key elemen of he previous proof ws high-probbiliy lower bound on L L We show h we cn obin similr lower bound in he conmined seing oo Le, denoe he indicor funcion of conminion in locion,, kes vlue if conminion occurred nd oherwise Le m =,l +, µ, in oher words, if eiher ws conmined on round hen m is he dversrilly ssigned vlue of he loss of rm on round nd oherwise i is he expeced loss Le M = m s hen M M L L is mringle By definiion of moderely conmined fer τ rounds process, for τ nd ny subopiml cion he ol number of rounds up o where eiher iself or were conmined is mos / Therefore, M M / / / Define even B : L L ν b + 5b, B ε where ε is defined in he proof of Theorem nd ν = ε Then by Bernsein s inequliy P [ B s b The reminder of he proof is idenicl o he proof of Theorem wih replced by / The Adversril Regime wih Gp The proof of Theorem 6 is bsed on he following lemm, which is n nlogue of Theorems nd 5 Lemm Under he prmerizion given in Theorem, he number of imes subopiml rm is plyed by EXP++ AVG in n dversril regime wih gp sisfies: { E [N mx, τ, e / τ,} ln + O τ, Proof Agin, he only modificion we need is highprobbiliy lower bound on L L τ We noe h λ λ τ L L τ is mringle nd h by definiion for τ we hve λ λ τ τ, Define he evens W : τ, L L τ ν b + 5b, W ε where ε nd ν re s in he proof of Theorem 5 By Bernsein s inequliy P [ W b The reminder of he proof is idenicl o he proof of Theorem Proof of Theorem 6 Noe h by definiion τ, is non-decresing sequence of τ Since Lemm is deerminisic resul i holds for ll τ simulneously nd we re free o choose he one h minimizes he bound 5 Empiricl Evluion: Sochsic Regime We consider he sochsic mulirmed bndi problem wih Bernoulli rewrds For ll he subopiml rms he rewrds re Bernoulli wih bis 5 nd for he single bes rm he rewrd is Bernoulli wih bis 5 + We run he experimens wih K =, K =, nd K =, nd = nd = in ol, six combinions of K nd We run ech gme for 7 rounds nd mke en repeiions of ech experimen The solid lines in he grphs in Figure represen he men performnce over he experimens nd he dshed lines represen he men plus one sndrd deviion sd over he en repeiions of he corresponding experimen In he experimens EXP++ is prmerized by ξ = ln ˆ ˆ, where ˆ is he empiricl esime of defined in In order o demonsre h in he sochsic regime he explorion prmeers re in full conrol of he performnce we run he EXP++ lgorihm wih wo differen lerning res EXP++ EMP corresponds o η = β nd EXP++ ACC corresponds o η = Noe h only he EXP++ EMP hs performnce gurnee in he dversril regime We compre EXP++ lgorihm wih he EXP lgorihm s described in Bubeck & Ces-Binchi, he UCB lgorihm of Auer e l, nd Thompson s smpling Since i ws demonsred empiriclly in Seldin e l h in he bove experimens he performnce of Thompson smpling is comprble or superior o he performnce of EwS nd KL-UCB, he ler wo lgorihms re excluded from he comprison For he EXP++ nd he EXP lgorihms we rnsform he rewrds ino losses vi l = r rnsformion, oher lgorihms opere direcly on he rewrds

One Prcicl Algorihm for Boh Sochsic nd Adversril Bndis 7 K = = 5 K = = 5 x 4 K = = Cumulive Regre 6 5 4 Cumulive Regre 4 Cumulive Regre 5 5 5 4 6 8 x 6 K =, = 4 6 8 x 6 b K =, = 4 6 8 x 6 c K =, = Cumulive Regre 5 5 5 K = = Cumulive Regre 4 x 4 5 5 5 5 K = = Cumulive Regre x 4 8 6 4 UCB Thom EXP EXP++ EMP EXP++ ACC K = = 4 6 8 x 6 d K =, = 4 6 8 x 6 e K =, = 4 6 8 x 6 f K =, = Figure Comprison of UCB, Thompson smpling Thom, EXP, nd EXP++ lgorihms in he sochsic regime The legend in figure f corresponds o ll he figures EXP++ EMP is he Empiricl EXP++ lgorihm nd EXP++ ACC is n Accelered Empiricl EXP++, where we ke η = Solid lines correspond o mens over repeiions of he corresponding experimens nd dshed lines correspond o he mens plus one sndrd deviion The resuls re presened in Figure We see h in ll he experimens he performnce of EXP++ EMP is lmos idenicl o he performnce of UCB However, unlike UCB nd Thompson s smpling, EXP++ EMP is secured gins he possibiliy h he gme is conrolled by n dversry In he supplemenry meril we show h ny deerminisic lgorihm is vulnerble gins n dversry The EXP++ ACC lgorihm cn be seen s eser for fuure work I performs beer hn EXP++ EMP, bu i does no hve he dversril regime performnce gurnee However, we do no exclude he possibiliy h by some more sophisiced simulneous conrol of η nd ε -s i my be possible o design n lgorihm h will hve boh beer performnce in he sochsic regime nd regre gurnee in he dversril regime An exmple of such sophisiced conrol of he lerning re in he full informion gmes cn be found in de Rooij e l 4 6 Discussion We presened generlizion of he EXP lgorihm, he EXP++ lgorihm, which ugmens he EXP lgorihm wih new conrol lever in he form explorion prmeers ε h re uned individully for ech rm We hve shown h he new conrol lever is exremely useful in deecing nd exploiing he gp in wide rnge of regimes, while he old conrol lever lwys keeps he wors-cse performnce of he lgorihm under conrol Due o he cenrl role of he EXP lgorihm in he dversril nlysis h sreches fr beyond he dversril bndis nd due o he simpliciy of our generlizion we believe h our resul will led o muliude of new lgorihms for oher problems h exploi he gps wihou compromising on he wors-cse performnce gurnees There is lso room for furher improvemen of he presened echnique h we pln o pursue in fuure work Acknowledgmens The uhors would like o hnk Sébsien Bubeck nd Wouer Koolen for useful discussions nd Csb Szepesvári for bringing up he reference o Ces-Binchi & Fischer 998 This reserch ws suppored by n Ausrlin Reserch Council Ausrlin Luree Fellowship FL8

One Prcicl Algorihm for Boh Sochsic nd Adversril Bndis References Agrwl, Shipr nd Goyl, Nvin Furher opiml regre bounds for Thompson smpling In AISTATS, Audiber, Jen-Yves nd Bubeck, Sébsien Minimx policies for dversril nd sochsic bndis In Proceedings of he Inernionl Conference on Compuionl Lerning Theory COLT, 9 Auer, Peer, Ces-Binchi, Nicolò, Freund, Yov, nd Schpire, Rober E Gmbling in rigged csino: The dversril mulirmed bndi problem In Proceedings of he Annul Symposium on Foundions of Compuer Science, 995 Auer, Peer, Ces-Binchi, Nicolò, nd Fischer, Pul Finie-ime nlysis of he mulirmed bndi problem Mchine Lerning, 47, Auer, Peer, Ces-Binchi, Nicolò, Freund, Yov, nd Schpire, Rober E The nonsochsic mulirmed bndi problem SIAM Journl of Compuing,, b Millrd, Odlric-Ambrym Apprenissge Séqueniel: Bndis, Sisique e Renforcemen PhD hesis, INRIA Lille, Robbins, Herber Some specs of he sequenil design of experimens Bullein of he Americn Mhemicl Sociey, 95 Seldin, Yevgeny, Szepesvári, Csb, Auer, Peer, nd Abbsi- Ydkori, Ysin Evluion nd nlysis of he performnce of he EXP lgorihm in sochsic environmens In JMLR Workshop nd Conference Proceedings, volume 4 EWRL, Solz, Gilles Incomplee Informion nd Inernl Regre in Predicion of Individul Sequences PhD hesis, Universié Pris- Sud, 5 Thompson, Willim R On he likelihood h one unknown probbiliy exceeds noher in view of he evidence of wo smples Biomerik, 5, 9 Bbioff, Moshe, Dughmi, Shddin, Kleinberg, Rober, nd Slivkins, Aleksndrs Dynmic pricing wih limied supply In h ACM Conf on Elecronic Commerce EC, Boucheron, Séphne, Lugosi, Gábor, nd Mssr, Pscl Concenrion Inequliies A Nonsympoic Theory of Independence Oxford Universiy Press, Bubeck, Sébsien Bndis Gmes nd Clusering Foundions PhD hesis, Universié Lille, Bubeck, Sébsien nd Ces-Binchi, Nicolò Regre nlysis of sochsic nd nonsochsic muli-rmed bndi problems Foundions nd Trends in Mchine Lerning, 5, Bubeck, Sébsien nd Slivkins, Aleksndrs The bes of boh worlds: sochsic nd dversril bndis In Proceedings of he Inernionl Conference on Compuionl Lerning Theory COLT, Cppé, Olivier, Grivier, Aurélien, Millrd, Odlric-Ambrym, Munos, Rémi, nd Solz, Gilles Kullbck-Leibler upper confidence bounds for opiml sequenil llocion Annls of Sisics, 4, Ces-Binchi, Nicolò nd Fischer, Pul Finie-ime regre bounds for he mulirmed bndi problem In Proceedings of he Inernionl Conference on Mchine Lerning ICML, 998 Ces-Binchi, Nicolò nd Lugosi, Gábor Predicion, Lerning, nd Gmes Cmbridge Universiy Press, 6 de Rooij, Seven, vn Erven, Tim, Grünwld, Peer D, nd Koolen, Wouer M Follow he leder if you cn, hedge if you mus Journl of Mchine Lerning Reserch, 4 Kufmnn, Emilie, Kord, Nhniel, nd Munos, Rémi Thompson smpling: An opiml finie ime nlysis In Proceedings of he Inernionl Conference on Algorihmic Lerning Theory ALT, Li, Tze Leung nd Robbins, Herber Asympoiclly efficien dpive llocion rules Advnces in Applied Mhemics, 6, 985

One Prcicl Algorihm for Boh Sochsic nd Adversril Bndis K = = 5 K = 4 = 5 K = 8 = 8 Cumulive Regre 6 4 Cumulive Regre 5 Cumulive Regre 5 5 UCB Thom EXP EXP++ EMP 4 6 8 x 6 K =, = 4 6 8 x 6 b K = 4, = 4 6 8 x 6 c K = 8, = Figure Perurbed Sochsic Environmen Comprison of UCB, Thompson smpling Thom, EXP, nd EXP++ EMP The legend in figure c corresponds o ll he figures Solid lines correspond o mens over repeiions of he corresponding experimens nd dshed lines correspond o he mens plus one sndrd deviion A Vulnerbiliy of Deerminisic Algorihms in he Adversril Regime We show h in he dversril regime ny deerminisic lgorihm cn be forced o suffer liner regre Le A be ny deerminisic lgorihm Noe h given sequence of losses up o ime he dversry knows which rm A will ply on round + So he dversry cn incremenlly design sequence of losses, such h he rm plyed by A lwys hs loss nd ll oher rms hve loss On T rounds he loss of A will be T, wheres he loss of he bes rm in hindsigh will be mos T/K nd he regre will be les T/ B Empiricl Evluion: Moderely Conmined Sochsic Environmen We simule moderely conmined sochsic environmen by drwing he firs, rounds of he gme ccording o one sochsic model nd hen swiching he bes rm nd coninuing he gme unil T = 8,, We noe h he conminion is no fully dversril, bu drwn from differen sochsic model We run his experimen wih = nd K =, 4, nd 8 rms The resuls re presened in Figure I is hrd o see he firs, rounds on he grph, bu heir effec on ll he lgorihms is clerly visible Despie he iniil corruped rounds he EXP++ EMP lgorihm successfully reurns o he sochsic operion mode nd chieves beer resuls hn EXP C Proof of Lemm Proof I is esy o check by differeniion h e c d = c e c c e c Thus, we hve: e c = O = = O c = O c e c d = e c c e c D Definiion of EXP++ LCBT nd Proof of Theorem 4 As discussed in Secion, if we ssume known ime horizon T, hen we cn elimine he ddiive erm e / in he regre bound for EXP++ For his purpose we define nd nlyze version of EXP++, clled EXP++ LCBT, in which we replce he empiricl gp esime wih lower confidence bound on he gp D Algorihm specificion Fix rm nd round Recll h N denoes he number of imes his rm hs been plyed by he lgorihm up o round, nd le µ be he corresponding verge loss The difference µ µ is bounded from bove by he confidence erm conf = min, 8 logt /N, 5 wih probbiliy les T The miniml expeced loss µ = min µ cn whp be upper-bounded by µ = min rms µ + conf The gp cn whp be lower-bounded by LB = mx, µ conf µ 6

One Prcicl Algorihm for Boh Sochsic nd Adversril Bndis Using his lower bound, we define lgorihm EXP++ LCBT o be version of EXP++ wih ξ = D Regre bound nd proof skech log T LB 7 For convenience, we repe he formulion of Theorem 4 in Theorem below Theorem provides regre bound for EXP++ LCBT in he sochsic regime wih known ime horizon Theorem Consider he sochsic regime wih known ime horizon T nd miniml gp The EXP++ LCBT lgorihm wih ny η β chieves regre RT Olog T 8 Le us describe he key seps of he proof Firs, we cpure wo useful properies of LB Clim 4 The following wo evens hold wih probbiliy les O T : { LB, } 9 {N θ LB, }, for hreshold θ = ΘK log T 4 In fc, he res of he proof uses LB only hrough he bove wo properies Accordingly, our nlysis pplies o ny oher esimor LB [, h sisfies he wo properies wih probbiliy les O T The proof for 9 is srighforwrd pplicion of Azum-Hoeffding inequliy Proving requires lile subley o hndle conf Here nd hroughou, is some bes rm Second, simple bu crucil compuion idenifies es = L L s lever h our nlysis cn use o conrol he probbiliy of choosing subopiml rm Clim 5 Fix subopiml cion nd round Then ρ exp K es + ε Proof Recll h ρ ρ + ε Denoing w = exp η L, we hve ρ = w / w w /w exp η L L Nex, we use he lower-bounding propery of LB Clim 4 o deduce lower bound on es Lemm 6 The following even holds wih probbiliy les O T : es, Here i suffices o ke ΘK log 4 T 4 Lemm 6 is proved by pplying Bernsein s inequliy Theorem 9 o bound he deviion of L for ech rm The crux is o bound from bove he Σ n = n i= E [X i Fi erm in Theorem 9 Clim 4 booss he explorion prmeer ε, which essenilly llows o upper-bound Σ n by Õ for sufficienly lrge, rher hn merely by Õ Now, plugging Equion ino Clim 5 implies h for he probbiliy of choosing subopiml rm is essenilly mos ε And h erm cn be upperbounded using Clim 4: ε Õ fer rm hs been chosen les θ imes Puing his ogeher, we see h fer some iniil period he lgorihm eners he regime where rm is chosen wih probbiliy mos Õ in ech round, which implies h he ol expeced number of imes rm is chosen in his regime is mos Õ The iniil period includes les rounds nd les θ plys of rm, whichever comes les I is no hrd o see h rm cn be seleced mos mx, θ imes during his period Wih some compuions, he bove rgumen proves h E [N OK log 4 4 for ech subopiml rm By Equion, his implies he climed regre bound D High-probbiliy evens Before we delve ino he deiled nlysis, we use Azum- Hoeffding inequliy nd Bernsein s inequliy o se up severl useful high-probbiliy evens Firs, we use he following corollry of Theorem 9 Theorem 7 Le X,, X n be - rndom vribles Le M = n = M, where M = E [X X,, X for ech Then for ny b he even n = X M b M log n + log n holds wih probbiliy les n Ωb For self-conined proof, one cn refer o Theorem 4 in he full version of Bbioff e l,

One Prcicl Algorihm for Boh Sochsic nd Adversril Bndis Second, we pproxime L wih µ, s long s ε cn be bounded from below Clim 8 Fix rm nd round Suppose ε s ε s for ech round s nd some numbers ε ε ε Denoe ν = ε s Then for ech λ > [ P L µ λ νλ + e λ Proof Noe h l s, s be form mringle difference sequence such h for ll rounds s we hve l s /ε nd moreover E s [ l s ε [ E s p s [ E s ε s ε s = ν Thus, he heorem follows from Bernsein s inequliy Third, for ech ime inervl [, le n [, be he number of imes rm is chosen in his ime inervl We pproxime i wih ˆn [, = s= ρ s Clim 9 Fix rm, rounds, nd λ > Le ˆn = ˆn [, Then [ P n [, ˆn O λ ˆn + λ e λ Proof This follows from Theorem 7 We will use he following high-probbiliy even: { n[, ˆn [, O logt ˆn [, + log T :, } This even holds wih probbiliy O T by Clim 9 D4 Deiled nlysis To side-sep some difficulies wih hndling lowprobbiliy evens in erly rounds, we define severl high-probbiliy evens, nd focus on he clen execuion when ll hese evens hold Proof of Clim 4, Equion I suffices o focus on clen execuion of he lgorihm: one in which evens 9 nd hold Recll h for ny rm nd ny > N 8 log T conf 4 4 Fix subopiml rm nd consider some round such h N θ Then conf 4 by Equion 4 wih = I remins o prove h conf 4 From he even in 9, ρ s ε s A s, where A = ln K K I follows h ˆN ρ s s=/ A A A s Noing h N θ, we hve ˆN By, i follows h N c log T 4 c log T 4 Using Equion 4 wih =, we obin conf 4, compleing he proof Proof of Lemm 6 Le he hreshold be c K log 4 T 4, where c is some bsolue consn Firs, we reduce o he cse when he even 9 holds deerminisiclly Indeed, consider noher version of EXP++ where LB is replced by = min, LB This new version sisfies 9 deerminisiclly, nd he wo versions coincide wih probbiliy les O T This complees he reducion From here on, we ssume 9 Fix round T nd prmeer λ > Denoe A = nd B = K ln K For ech rm, we pply Clim 8 wih log T ε s = min K, A s, B s ν ε s K + A s + Bs = OK + A / + B Plugging his ino Clim 8, we obin [ P L µ Γ e λ, where for ny λa + K A nd some consn c > Γ = c λa /4 + λb + λb Fix some subopiml rm Then P [ L L < Γ Γ < K e λ Tke λ = logkt I suffices o prove h Γ + Γ

One Prcicl Algorihm for Boh Sochsic nd Adversril Bndis This holds becuse Γ = c λa /4 nd {c λa /4 4 if 8c4 λa 4, c λb c λb 8 if log T 8c λ Proof of Theorem An execuion of he lgorihm is clled clen if he following evens hold: : n [, is close o expecion : esime LB is shrp if N θ : esime L L is shrp if We hve proved h n execuion is clen wih probbiliy les O T So i suffices o focus on clen execuion from here on Fix subopiml rm By even, in ech round we hve L L Plugging his ino Clim 5, we obin ρ exp + ε T + ε Le θ be he hreshold from even Assume h rm is seleced les θ imes, ie h N θ for some round Le be he smlles such round Then for ny round we hve LB consequenly ε Olog T Leing = mx, we hve, nd ˆn [,T T s= ρ s Olog T T s= Olog T Using even, we hve n [,T Olog T Noe h rm cn be plyed mos mxθ, imes before round This is becuse rm is plyed θ imes before round, nd mos imes before round Puing his ogeher, we hve n T = n [, + n [,T mxθ, + Olog T Plugging in = OK log 4 T 4 nd θ = OK log T 4 we hve n T OK log 4 T 4 Recll h his holds for ech subopiml rm, in ny clen execuion of he lgorihm By Equion, his implies he climed regre bound 8