Lecture 3: Lower Bounds for Bandit Algorithms

CMSC 858G: Bandits, Experts and Games 09/19/16 Lecture 3: Lower Bounds for Bandit Algorithms Instructor: Alex Slivkins Scribed by: Soham De & Karthik A Sankararaman 1 Lower Bounds In this lecture (and the first half of the next one), we prove a Ω( KT ) lower bound for regret of bandit algorithms. This gives us a sense of what are the best possible upper bounds on regret that we can hope to prove. On a high level, there are two ways of proving a lower bound on regret: (1) Give a family F of problem instances, which is the same for all algorithms, such that any algorithm fails (has high regret) on some instance in F. () Give a distribution over problem instances, and show that, in expectation over this distribution, any algorithm with fail. Note that () implies (1) since: if regret is high in expectation over problem instances, then there exists at least one problem instance with high regret. Also, (1) implies () if F is a constant. This can be seen as follows: suppose we know that for any algorithm we have high regret (say H) with one problem instance in F and low regret with all other instances in F, then, taking a uniform distribution over F, we can say that any algorithm has expected regret at least H/ F. (So this argument breaks if F is large.) If we prove a stronger version of (1) that says that for any algorithm, regret is high for a constant fraction of the problem instances in F, then, considering a uniform distribution over F, this implies () regardless of whether F is large or not. In this lecture, for proving lower bounds, we consider 0-1 rewards and the following family of problem instances (with fixed ɛ to be adusted in the analysis): { µi 1/ for each arm i, I for each 1,,..., K. (1) µ i (1 + ɛ)/ for arm i (Recall that K is the number of arms.) In the previous lecture, we saw that sampling each arm Õ(1/ɛ ) times is sufficient for the upper bounds on regret that we derived. In this lecture, we prove that sampling each arm Ω(1/ɛ ) times is necessary to determine whether an arm is bad or not. The proof methods will require KL divergence, an important tool from Information Theory. In the next section, we briefly study the KL divergence and some of its properties. KL-divergence Consider a finite sample space Ω, and let p, q be two probability distributions defined on Ω. Then, the Kullback-Leibler divergence or KL-divergence is defined as: KL(p, q) p(x) ln p(x) [ E p ln p(x) ]. 1

The KL divergence is similar to a notion of distance with the properties that it is non-negative, 0 iff p q, and small if the distributions p and q are close. However, it is not strictly a distance function since it is not symmetric and does not satisfy the triangle-inequality. The intuition for the formula is as follows: we are interested in how certain we are that data, with underlying distribution q, can be generated from distribution p. The KL divergence effectively answers this question by measuring the average log likelihood of observing data with distribution p when the underlying distribution of the data is actually given by q. Remark.1. The definition of KL-divergence, as well as the properties discussed below, extend to infinite sample spaces. However, KL-divergence for finite sample spaces suffices for this class, and is much easier to work with. Properties of KL-divergence We present several basic properties of KL-divergence that will be needed later. The proofs of these properties are fairly simple, we include them here for the sake of completeness. 1. Gibbs Inequality: KL(p, q) 0, p, q. Further, KL(p, q) 0 iff p q. Proof. Let us define: f(y) y ln(y). f is a convex function under the domain y > 0. Now, from the definition of the KL divergence we get: KL(p, q) p(x) ln p(x) ( ) p(x) f ( ) f p(x) ( ) f p(x) f(1) 0, [follows from Jensen s inequality] where Jensen s inequality states that ϕ(λ 1 x 1 + λ x ) λ 1 ϕ(x 1 ) + λ ϕ(x ), if ϕ is a convex function and λ 1 + λ 1 with λ 1, λ > 0. Jensen s inequality further has the property that ϕ(λ 1 x 1 + λ x ) λ 1 ϕ(x 1 ) + λ ϕ(x ) iff x 1 x or if ϕ is a linear function. In this case, since f is not a linear function, equality holds (i.e., KL(p, q) 0) iff p(x), x.. Let the sample space Ω be composed as Ω Ω 1 Ω 1 Ω n. Further, let p and q be two distributions defined on Ω as p p 1 p p n and q q 1 q q n, such that 1,..., n, p and q are distributions defined on Ω. Then we have the property: KL(p, q) n 1 KL(p, q ). Proof. Let x (x 1, x,..., x n ) Ω st x i Ω i, i 1,..., n. Let h i (x i ) ln p i(x i ) q i (x i ). Then: KL(p, q) p(x) ln p(x)

n p(x)h i (x i ) i1 n i1 n i1 x i Ω i h i (x i ), x i x i p i (x i )h i (x i ) x i Ω i n KL(p i, q i ). i1 p(x) [ since since ln p(x), x i x i ] n h i (x i ) i1 p(x) p i (x i ) 3. Weaker form of Pinsker s inequality: A Ω : (p(a) q(a)) KL(p, q). Proof. To prove this property, we first claim the following: Claim.. For each event A Ω, Proof. Let us define the following: p(x) ln p(x) p(a) p(a) ln q(a). p A (x) p(x) p(a) and q A (x) q(a) x A. Then the claim can be proved as follows: p(x) ln p(x) p(a) p A (x) ln p(a)p A(x) q(a)q A (x) ( ) p(a) p A (x) ln p A(x) + p(a) ln p(a) p A (x) q A (x) q(a) [ p(a) ln p(a) q(a). since ] p A (x) ln p A(x) q A (x) KL(p A, q A ) 0 Fix A Ω. Using Claim. we have the following: x/ A p(x) ln p(x) p(a) p(a) ln q(a), p(x) ln p(x) p(ā) p(ā) ln q(ā), 3

where Ā denotes the complement of A. Now, let a p(a) and b q(a). Further, assume a < b. Then, we have: KL(p, q) a ln a 1 a + (1 a) ln b 1 b b ( a x + 1 a ) dx 1 x a b a b This proves the property. a x a x(1 x) dx 4(x a)dx (b a). [since x(1 x) 1/4] 4. Let p ɛ denote a distribution on {0, 1} such that p ɛ (1) (1 + ɛ)/. Thus, p ɛ (0) (1 ɛ)/. Further, let p 0 denote the distribution on {0, 1} where p 0 (0) p 0 (1) 1/. Then we have the property: KL(p ɛ, p 0 ) ɛ. Proof. KL(p ɛ, p 0 ) 1 + ɛ ln(1 + ɛ) + 1 ɛ ln(1 ɛ) 1 (ln(1 + ɛ) + ln(1 ɛ)) + ɛ (ln(1 + ɛ) ln(1 ɛ)) 1 ln(1 ɛ ) + ɛ ln 1 + ɛ 1 ɛ. Now, ln(1 ɛ ) < 0 and we can write ln 1+ɛ 1 ɛ ln ( 1 + ɛ 1 ɛ KL(p ɛ, p 0 ) < ɛ How are these properties going to be used? ) ɛ 1 ɛ ɛ 1 ɛ ɛ. We start with the same setting as in Property. From Property 3, we have: ɛ 1 ɛ. Thus, we get: (p(a) q(a)) KL(p, q) n KL(p, q ). (follows from Property ) 1 For example, we can define p and q to be distributions of a biased coin with small ɛ (p (1) (1 + ɛ)/, p (1) (1 ɛ)/) vs an unbiased coin (q (0) q (1) 1/). Then, we can use Property 4 to bound the above as: n n (p(a) q(a)) KL(p, q ) δ nδ, 1 where δ ɛ Thus, we arrive at the following bound: p(a) q(a) nδ/. 1 4

3 A simple example: flipping one coin We start with a simple example, which illustrates our proof technique and is interesting as a standalone result. We have a single coin, whose outcome is a 0 or 1. The coin s mean is unknown. We assume that the true mean µ [0, 1] is either µ 1 or µ for two known values µ 1 > µ. The coin is flipped T times. The goal is to identify if µ µ 1 or µ µ. Define Ω : {0, 1} T to be the sample space of the outcomes of the T coin tosses. We need a decision rule Rule : Ω {High, Low} with the following two properties: Pr[Rule(observations) High µ µ 1 ] 0.99 () Pr[Rule(observations) Low µ µ ] 0.99 (3) The question is how large should T be for for such a Rule to exist? We know that if δ µ 1 µ, then T Ω( 1 ) is sufficient. We will prove that it is also necessary. We will focus on the special δ case when both µ 1 and µ are close to 1. Claim 3.1. Let µ 1 1+ɛ and µ 1. For any rule to work (i.e., satisfy equations () and (3)) we need T Ω( 1 ɛ ) (4) Proof. Define for any event A Ω, the following quantities P 1 (A) Pr[A µ µ 1 ] P (A) Pr[A µ µ ] To prove this claim, we will consider the following equation. For an event A Ω, such that Rule(A) High (i.e. A {ω Ω : Rule(ω) High}), P 1 (A) P (A) 0.98 (5) We prove the claim by showing that if (4) is false, then (5) is false, too. Specifically, we will assume that T < 1.(In fact, the argument below holds for an arbitrary event A Ω.) 4ɛ Define, for all i {1, }, P i,t to be the distribution of the t th coin toss for P i. Then, P i P i,1 P i,... P i,t. From KL divergence property, we have (P 1 (A) P (A)) KL(P 1, P ) From KL divergence property 3 T KL(P 1,t, P,t ) From KL divergence property t1 Hence, we have P 1 (A) P (A) T ɛ < 1. T ɛ From KL divergence property 4 5

4 Flipping several coins: bandits with prediction Let us extend the previous example to flipping multiple coins. More formally, we consider a bandit problem with K arms (where each arm corresponds to a coin). Each arm gives a 0-1 reward, drawn independently from a fixed but unknown distribution. After T rounds, the algorithm outputs a guess y T A for which arm is the best arm (where A is the set of all arms). 1 We call this version bandits with predictions. In this section, we will only be concerned with a quality of prediction, rather than accumulated rewards and regret. For each arm a A, the mean reward is denoted as µ(a). (We will also write it as µ a whenever convenient.) A particular problem instance is specified as a tuple I (µ(a) : a A). A good algorithm for the bandits with prediction problem described above should satisfy Pr[y T is correct I] 0.99 (6) for each problem instance I. We will use the family (1) of problem instances to argue that one needs T Ω ( K ɛ ) for any algorithm to work, i.e., satisfy (6), on all instances in this family. Lemma 4.1. Suppose an algorithm for bandits with predictions satisfies (6) for all problem instances I 1,..., I K. Then T Ω ( K ɛ ). This result is of independent interest (regardless of the lower bound on regret). In fact, we will prove a stronger lemma which will (also) be the crux in the proof of the regret bound. Lemma 4.. Suppose T ck, for a small enough absolute constant c. Fix any deterministic ɛ algorithm for bandits with prediction. Then there exists at least K/3 arms such that Pr[y T I ] < 3 4 Remark 4.3. The proof for K arms is particularly simple, so we will do it first. We will then extend this proof to arbitrary K with more subtleties. While the lemma holds for an arbitrary K, we will present a simplified proof which requires K 4. We will use a standard shorthand [T ] : {1,,..., T }. Let us set up the sample space to be used in the proof. Let (r t (a) : a A, t [T ]) be mutually independent 0-1 random variables such that r t (a) has expectation µ(a). We refer to this tuple as the rewards table, where we interpret r t (a) as the reward received by the algorithm for the t-th time it chooses arm a. The sample space is Ω {0, 1} K T, where each outcome ω Ω corresponds to a particular realization of the rewards table. Each problem instance I defines distribution P on Ω: Also, let P a,t P (A) Pr[A I ] for each A Ω. be the distribution of r t (a) under instance I, so that P a A, t [T ] P a,t. Proof (K arms). Define A {ω Ω : y T 1}. In other words, A is the set of all events such that the correct arm is arm 1. (But the argument below holds for any any event A Ω.) Similar to the previous section, we use the properties of KL divergence as follows: (P 1 (A) P (A)) KL(P 1, P ) 1 Recall that the best arm is the arm with the highest mean reward. 6

K T a1 t1 KL(P a,t 1, P a,t ) T ɛ (7) The last inequality is because KL divergence of P a,t 1 and P a,t is non-zero if and only if P a,t 1 P a,t And when they are non-equal, their KL divergence is ɛ. Hence, P 1 (A) P (A) ɛ T < 1. The last inequality holds whenever T ( 1 4ɛ ). To complete the proof, observe that if Pr[y T I ] 3 4 for both problem instances, then P 1 (A) 3 4 and P (A) < 1 4, so their difference is at least 1, contradiction. Proof (K 4). Compared to the -arms case, time horizon T can be larger by a factor of O(K). The crucial improvement is a more delicate version of the KL-divergence argument in (7) which results in the right-hand side of the form O(T ɛ /K). For the sake of the analysis, we will consider an additional problem instance I 0 { µ i 1 : for all arms a } which we call the base instance. Let E 0 [ ] be the expectation given this problem instance. Also, let T a be the total number of times arm a is played. We consider the algorithm s performance on problem instance I 0, and focus on arms that are neglected by the algorithm, in the sense that the algorithm does not choose arm very often and is not likely to pick for the guess y T. Formally, we observe that. K 3 arms such that E 0(T ) 3T K (8) K 3 arms such that P 0(y T ) 3 K. (9) (To prove (8), assume for contradiction that we have more than K 3 arms with E 0(T ) > 3T K. Then the expected total number of times these arms are played is strictly greater than T, which is a contradiction. (9) is proved similarly.) By Markov inequality, E 0 (T ) 3T K implies that Pr[T 4T K ] 7 8. Since the sets of arms in (8) and (9) must overlap on least K 3 arms, we conclude: K 3 arms such that Pr[T m] 7 8 and P 0(y T ) 3 K, (10) where m 4T K. We will now refine our definition of the sample space to get the required claim. For each arm a, define the t-round sample space Ω t a {0, 1} t, where each outcome corresponds to a particular realization of the tuple (r s (a) : s [t]). (Recall that we interpret r t (a) as the reward received by the algorithm for the t-th time it chooses arm a.) Then the full sample space we considered before can be expressed as Ω a A ΩT a. 7

Fix an arm satisfying the two properties in (10). We will consider a reduced sample space in which arm is played only m 4T K times: Ω Ω m Ω T a. (11) arms a Each problem instance I l defines a distribution P l on Ω : P l (A) Pr[A I l] for each A Ω. In other words, distribution P l is a restriction of P l to the reduced sample space Ω. We apply the KL-divergence argument to distributions P 0 and P. For each event A Ω : (P0 (A) P (A)) KL(P0, P ) T m KL(P a,t 0, P a,t ) + KL(P,t 0, P,t ) arm a t1 0 + m ɛ. Note that each arm a has identical distributions under instances I 0 and I (namely, its mean reward is 1 a,t ). So distributions P0 and P a,t are the same, and therefore their KL-divergence is 0. Whereas for arm we only need to sum up over m samples. Therefore, assuming T ck with small enough constant c, we can conclude that ɛ P 0 (A) P (A) ɛ m < 1 8 for all events A Ω. (1) To apply (1), we need to make sure that the event A is in fact contained in Ω, i.e., whether A holds is completely determined by the first m samples of arm (and arbitrarily many samples of other arms). In particular, we cannot take A {y t }, which would be the most natural extension of the proof technique from the -arms case. Instead, we apply (1) twice: to events A {y T and T m} and A {T > m}. (13) Indeed, note that whether the algorithm samples arm more than m times is completely determined by the first m coin tosses! We are ready for the final computation: P (A) 1 8 + P 0(A) by (1) 1 8 + P 0(y T ) 1 4 by our choice of arm. P (A ) 1 8 + P 0(A ) by (1) 1 4 by our choice of arm. P (Y T ) P (Y T and T m) + P (T > m) P (A) + P (A ) 1 4. Recall that this holds for any arm satisfying the properties in (10). Since there are at least K/3 such arms, the lemma follows. Next lecture: Lemma 4. is used to derive the Ω( KT ) lower bound on regret. t1 8

5 Bibliographic notes The Ω( KT ) lower bound on regret is from Auer et al. (00). KL-divergence and its properties is textbook material from Information Theory, e.g., see Cover and Thomas (1991). The present exposition the outline and much of the technical details is based on Robert Kleinberg s lecture notes from (Kleinberg, 007). We present a substantially simpler proof compared to (Auer et al., 00; Kleinberg, 007) in that we avoid the general chain rule for KL-divergence. Instead, we only use the special case of independent distributions (Property in Section ), which is much easier to state and to apply. The proof of Lemma 4. (for general K), which in prior work relies on the general chain rule, is modified accordingly. In particular, we define the reduced sample space Ω with only a small number of samples from the bad arm, and apply the KL-divergence argument to carefully defined events in (13), rather than a seemingly more natural event A {y T }. References Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM J. Comput., 3(1):48 77, 00. Preliminary version in 36th IEEE FOCS, 1995. Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. John Wiley & Sons, New York, 1991. Robert Kleinberg. Lecture notes: CS683: Learning, Games, and Electronic Markets (week 9), 007. Available at http://www.cs.cornell.edu/courses/cs683/007sp/lecnotes/week9.pdf. 9