Lecture 9: UCB Algorithm and Adversarial Bandit Problem EECS598: Prediction and Learning: It s Only a Game Fall 03 Lecture 9: UCB Algorithm and Adversarial Bandit Problem Prof. Jacob Abernethy Scribe: Hossein Keshavarz Announcements There is no class on Wednesday, November 7 Good job on project proposals so far! 9. Review on stochastic multi-armed bandit problem Consider a gambler playing against a N armed slot machine and sequentially pulls up arms to minimize the expected regret. Each arm, i {,...,N} has a probability distribution D i supported on the closed interval [0,]. Let µ i be the expected value of the corresponding loss to arm i and define Ĩ = argmin j N µ j as the index of the optimal expected value among arms. Suppose that the gambler selects arm I t at round t, then the expected regret of the gambler up to round T is defined as E regret = E where X i,t D i denotes the loss associated with choosing arm i at round t. It s worthwhile to mention that the expected value is taken over the distributions {D i } N i= and the randomness of choosing arms. 9. Analysis of greedy algorithm This section is devoted to state and prove the theorem regarding the upper bound on the expected regret of the greedy algorithm. The greedy algorithm is introduced in the last lecture. Theorem 9.. Suppose that there exists a positive scalar such that µ j µ Ĩ the expected regret of the greedy algorithm has the following upper bound for any j Ĩ. Then E regret + N lognt Proof. Let us to decompose the expected regret as sum of two terms in which the first term is the associated regret to the sampling phase and the second term corresponds to the exploitation phase.
Lecture 9: UCB Algorithm and Adversarial Bandit Problem E regret = E b Nm + E + E I It Ĩ = Nm + +mn +mn a Nm + E P I t Ĩ Nm + T max +Nm t T P I t Ĩ 9. Note that inequalities a and b are immediate consequence of the fact that X k,t [0,] for all k N and t T. Therefore, in order to complete the proof, we need to control the deterministic term P I t Ĩ uniformly from above. If I t Ĩ then ˆµ It ˆµ Ĩ, hence using some straightforward algebraic manipulation we obtain µ It µ = Ĩ µ It ˆµ It + ˆµIt ˆµ Ĩ + ˆµĨ µ Ĩ µit ˆµ It + 0 + ˆµĨ µ Ĩ max µ j ˆµ j j N In other words, P I t Ĩ P max j N µj ˆµ j for any t. Combination of union bound inequality a and Hoeffding s inequality used to prove inequality b give us the desired upper bound. = max P I t Ĩ P max µ j ˆµ j +Nm t T j N b m N exp a N max j N P µj ˆµ j 9. Letting the optional parameter m = log N and substituting inequality 9. into inequality 9. yield E regret Nm + T = N N log + T Choosing = T terminates the proof. Note that the reason that appeared in the denumerator of the expected regret is that the inequality E XIt,t X Ĩ,t Nm is not tight enough. Lemma 9. Hoeffding s inequality. Let Z,...,Z n are indepent and identically distributed random variables such that Z [0,] almost sure and EZ = µ. Then, P n n Z i µ ɛ exp nɛ j=
Lecture 9: UCB Algorithm and Adversarial Bandit Problem 3 9.3 Upper Confidence Bound UCB algorithm In order to introduce UCB algorithm let us to introduce some notation. Define, T j t t I Is =j as the number of times that the gambler played the arm j up to time t and let the empirical expected regret of arm j as the following ˆµ j,t = T j t s= t X j,s I Is =j 9.3 s= Algorithm. UCB algorithm to solve multi armed bandit problem Initialization Play each arm once For t = to T For j = to N Update ˆµ j,t according to the identity 9.3 Play the arm I t = argmin j N ˆµ j,t 3logt T j t The arm selection criterion of UCB algorithm encourages to choose the rarely chosen arms. The next theorem characterizes the upper bound on the expected regret of UCB algorithm. The interested reader is referred to [] for further details and proof of Theorem 9.3. Theorem 9.3. Let j = µ j µ Ĩ such that j > 0 for any j Ĩ. Then the expected regret of UCB algorithm can be upper bounded by the following inequality. E regret O Theorem 9.3 shows that the expected regret of UCB algorithm is far better than greedy method for small j s. Strictly speaking, if min = min j Ĩ j, then j Ĩ logt j N logt E regret O min 9.4 Adversarial Bandit Unlike the multi armed bandit setting where the loss of each action has a stationary distribution over time, in the adversarial bandit problem there is no statistical assumption about the form of the generating process of losses. In this new formulation, the associated regret to each arm is determined at each round by an adversary and the player only knows the reward of previously chosen actions. The only assumption about the loss vector is that, l t [0,] N for each round t. Since the adversary can assign low reward to the previously selected actions and high rewards to the unseen arms, hence, a deterministic policy of arm selection can not optimize the expected
Lecture 9: UCB Algorithm and Adversarial Bandit Problem 4 regret function. Finally, we need to mention that the distribution of action I t only deps on the loss of previous actions, { l s I s } t s=. Due to the lack of full information about the associated losses of choosing arms, the player can t run exponential weighted algorithm to optimize the regret function. One doubtful solution is to estimate of loss vector, l t, based on the observation of a single component l t I t. Taking advantage of the conditional expectation property, we can show that if l t is an unbiased estimator of l t, i.e. E lt F t = l t where F t is the generated σ field by observations up to round t, then the expected regrets with respect to l t and l t are the same. E l t. p t p = E E l t. p t p F t = E E lt F t. p t p = E l t. p t p 9.4. Unbiased estimation of l t We claim that the following procedure which is called exponential weighted algorithm with ɛ exploration generates an unbiased estimator of l t.. With probability ɛ, choose p t = N,..., and select I t p t uniformly at random. Let l t = 0,...,0, Nlt It ɛ,0,...,0.. With probability ɛ choose p t by exponential weighted algorithm on the loss vectors { l s } t s= and let l t = 0. proof of claim. E lt = ɛ.0 + ɛ N N 0,...,0, Nlt j,0,...,0 = lt ɛ j= Since l t is an unbiased estimator of l t, so at the first glance, it seems that the expected regret of above algorithm is exactly equal to the expected regret of EWA. However, a contingent reader notices that l t is no longer in the closed cube [0,] N. Recalling the proof of Theorem 3.. in the lecture notes, one can easily show that logn E regret of EWA with ɛ exploration T ɛ+ ɛ + l t = O T ɛ + logn + T N ɛ Now choosing the regularization parameters ɛ = N T upper bound is given by T logn and = ɛ logn N T, the optimal NT 3 E regret of EWA with ɛ exploration O 4 logn 4 9.4
Lecture 9: UCB Algorithm and Adversarial Bandit Problem 5 It s worthwhile to mention that although max t T l t N ɛ, but most of the times with probability ɛ, we have l t = 0 [0,]. Hence, the proof can be slightly modified in a smart way to obtain the following upper bound. E regret of EWA with ɛ exploration O which leads to E regret of EWA with ɛ exploration O NT 3. T ɛ + logn + T N ɛ Question: Is there any algorithm with the expected regret O NT? Yes! EXP3 algorithm 9.5 EXP3 Algorithm Let L t be the cumulative loss up to round t. Algorithm. EXP3 algorithm to for adversarial multi-armed bandit problem Input Regularization parameter Initialization St initial value for p For t = to T For j = to N Sample I t absed on the distribution p t Observe l t I t Let l t = 0,...,0, lt It p t It Update p t+ by p t+,0,...,0 j = exp L j t N exp L t j j= The EXP3 algorithm will be analyzed in the next class. References [] P. Auer, N. Cesa-Bianchi and P. Fischer, Finite-time analysis of the multi-armed bandit problem, Machine learning 47, no. -3 00: 35-56. [] P. Auer, N. Cesa-Bianchi, Y. Freund and R. E. Schapire, Gambling in a rigged casino: The adversarial multi-armed bandit problem, In Foundations of Computer Science, 995. Proceedings., 36th Annual Symposium on, pp. 3-33. IEEE, 995.