Multi-Armed Bandits: Non-adaptive and Adaptive Sampling

CSE 547/Stt 548: Mchine Lerning for Big Dt Lecture Multi-Armed Bndits: Non-dptive nd Adptive Smpling Instructor: Shm Kkde 1 The (stochstic) multi-rmed bndit problem The bsic prdigm is s follows: K Independent Arms: {1,... K} Ech rm returns rndom rewrd R if pulled. (simpler cse) ssume R is not time vrying. Gme: You chose rm t t time t. You then observe: X t = R t where R t is smpled from the underlying distribution of tht rm. Criticlly, the distribution over R is not known. 1.1 Regret: n online performnce mesure Our objective is to mximize our long term rewrd. We hve (possibly rndomized) sequentil strtegy/lgorithm A, which is of the form: t = A( 1, X 1, 2, X 2,... t 1, X t 1 ) In T rounds, our rewrd is: T E[ X t A where the expecttion is with respect to the rewrd process nd our lgorithm. Suppose: µ = E[R, nd let us ssume 0 µ 1. Also, define: µ = mx µ. In T rounds nd in expecttion, the best we cn do is obtin µ T. We will mesure our performnce by our expected regret, defined s follows: In T rounds, our (observed) regret is: µ T T X t A 1

nd our expected regret is: µ T E X t A where the expecttion is with the rndomness in our outcomes (nd possibly our lgorithm if it is rndomized). 1.2 Cvet: Our presenttion in these notes will be loose in terms of log( ) fctors, in both K nd T. There re multiple good tretments tht provide improvements in terms of these fctors. 2 Review: Hoeffding s bound With N smples, denote the smple men s: ˆµ = 1 X t. N Lemm 2.1. Supposing tht the X t s hve n i.i.d. distribution nd re bounded between 0 nd 1, then, with probbility greter thn 1 δ, we hve tht log(2/δ) ˆµ µ 2N. 3 Wrmup: A non-dptive strtegy t Suppose we first pull ech rm times, in n explortion phse. Then, for the reminder of the T steps, we pull the rm which hd the best observed rewrd during the explortion phse. By the union bound, with probbility greter thn 1 δ, for ll ctions, ˆµ µ O. To see this, we simply mke our error probbility to be δ/k, to the totl error probbility is δ. Thus ll the confidence intervls will hold. During the explortion rounds, our cumultive regret is t most K, trivil upper bound. During the exploittion rounds, let us bound our cumultive regret for the reminder of T K. Note tht for the rm i tht we pull, we must hve tht: ˆµ i ˆµ i where i is n optiml rm. This implies tht µ i µ c where c is universl constnt. To see this, note tht by construction of the lgorithm ˆµ i ˆµ i, which implies µ i ˆµ i ˆµ i µ i ˆµ i ˆµ i µ i µ i ˆµ i µ i ˆµ i µ i, nd the clim follows using the confidence intervl bounds. 2

Hence, our totl regret is: Now let us optimize for. µ T T X t K + O (T K) Lemm 3.1. (Regret of the non-dptive strtegy) The totl expected regret of the non-dptive strtegy is: µ T E X t ck 1/3 T 2/3 (log T ) 1/3 where c is universl constnt. Proof. Choose = K 2/3 T 2/3 nd δ = 1/T 2. Note tht with probbility greter thn 1 1/T 2, our regret is bounded by (K 1/3 T 2/3 (log(kt )) 1/3 ). Also, if we fil, the lrgest regret we cn py is T, nd this occurs with probbility less thn 1/T 2, so the reget is: exp. regret Pr(no filure event) K 1/3 T 2/3 (log(kt )) 1/3 + Pr(filure event)t c(1 1/T 2 )K 1/3 T 2/3 (log(kt )) 1/3 + 1 T. This shows tht the regret is bounded s O(K 1/3 T 2/3 (log(kt )) 1/3 ). For T > K, log(kt ) 2 log T (nd for K < T, the climed regret bound is trivilly true). This completes the proof (for different universl constnt). 3.1 A (minimx) optiml dptive lgorithm We will now provide n optiml (up to log fctors) lgorithms (optiml under the i.i.d. ssumption for the rewrds re distributed nd using tht the rewrds re upper bounded by 1). Let be the number of times we pulled rm up to time t. The question is wht rm should pull time t + 1? 3.2 Confidence bounds If we don t cre bout log fctors, then the following is strightforwrd rgument to see tht our confidence bounds will simultneously hold for ll times t (from 0 to ) nd ll K rms. Lemm 3.2. With probbility greter thn 1 δ, we will hve tht for ll times t K, ll [K, ˆµ,t µ c where c is universl constnt. Proof. We will ctully prove stronger sttement: suppose tht we observe the outcome of every rm, we will first provide probbilistic sttement for the confidence intervls of ll the rms (nd for ll smple sizes). Let us pply Hoeffding s bound with n error probbility of δ/(k 2 ). Specificlly, for the rm with smples, we hve tht with probbility greter thn 1 δ/(k 2 ): ˆµ, µ c 3

(by strightforwrd ppliction of Hoeffding s bound). Now tht the totl error probbility over ll rms n over smple size is: δ K 2 = δπ2 /6 =0 (the π 2 /6 is from Bsel s problem). Note the sum is finite, which mens the error totl probbility for ll of these confidence intervls is less thn constnt δ. We hve thus shown the following (note the quntifiers): with probbility greter thn 1 δ, tht for ll rms nd ll smple sizes 1 tht: ˆµ, µ c, (for possibility different constnt c). Observe tht the confidence bounds tht ny lgorithm uses t time t is due to hving smples, so we cn now pply the bove bound in this cse, where: log( K/δ) log(tk/δ) c c since t. This shows tht these confidence bounds re vlid for ll times t nd ll rms. The proof is completed by nothing for t K, log(kt) 2 log t. 3.3 The Upper Confidence Bound (UCB) Algorithm At ech time t, Pull rm: t = rg mx ˆµ,t + c := rg mx ˆµ,t + ConfBound,t (where c 10 is constnt). Observe rewrd X t. Updte µ,t,, nd ConfBound,t. With probbility greter thn 1 δ ll the confidence bounds will hold for ll rms nd ll times t. 3.4 Anlysis of UCB If pull rm t time t, wht is our instntneous regret, i.e. wht is: µ µ t? Let i be n optiml rm. Note by construction of the lgorithm we hve, if we pull rm t time t, then: ˆµ,t + ConfBound,t ˆµ i + ConfBound i µ i, the lst step follows due to tht µ i is contined within the confidence intervl for i. Using this we hve tht: µ t ˆµ,t ConfBound,t ˆµ i 2ConfBound,t 4

Theorem 3.3. (UCB regret) The totl expected regret of UCB is: µ T E X t c KT log T for n ppropritely chosen universl constnt c. Proof. The expected regret is bounded s: µ T E X t 2 ConfBound,t t 2c N t,t 2c log(t/δ)n,t. (1) Note the following constrint on the N,T s must hold: N,T = T One cn now show the worst cse setting of N,T tht mkes Eqution 1 s lrge s possible (subject to this constrint on the N,T s) is when = T/K. Finlly, to obtin the expected regret bound, the proof is identicl to tht of the previous rgument (in the non-dptive cse, where we choose δ = 1/T 2 ). 5