Lecture 3: Lower Bounds for Bandit Algorithms

Similar documents
Lecture 4: Lower Bounds (ending); Thompson Sampling

Introduction to Multi-Armed Bandits

Lecture 5: Regret Bounds for Thompson Sampling

Notes from Week 9: Multi-Armed Bandit Problems II. 1 Information-theoretic lower bounds for multiarmed

Introduction to Multi-Armed Bandits

Lecture 6: September 22

Advanced Machine Learning

Hands-On Learning Theory Fall 2016, Lecture 3

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

21.2 Example 1 : Non-parametric regression in Mean Integrated Square Error Density Estimation (L 2 2 risk)

The Algorithmic Foundations of Adaptive Data Analysis November, Lecture The Multiplicative Weights Algorithm

Blackwell s Approachability Theorem: A Generalization in a Special Case. Amy Greenwald, Amir Jafari and Casey Marks

Machine Learning. Lecture 02.2: Basics of Information Theory. Nevin L. Zhang

CS229T/STATS231: Statistical Learning Theory. Lecturer: Tengyu Ma Lecture 11 Scribe: Jongho Kim, Jamie Kang October 29th, 2018

COS598D Lecture 3 Pseudorandom generators from one-way functions

The information complexity of sequential resource allocation

Lecture notes for Analysis of Algorithms : Markov decision processes

Lecture 35: December The fundamental statistical distances

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12

Lecture 5 - Information theory

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Lower Bounds for Testing Bipartiteness in Dense Graphs

Discrete Mathematics and Probability Theory Spring 2016 Rao and Walrand Note 14

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University

COMPSCI 650 Applied Information Theory Jan 21, Lecture 2

Online Learning with Feedback Graphs

Lecture 10 : Contextual Bandits

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm

Online learning with feedback graphs and switching costs

Introduction to Statistical Learning Theory

Lecture 5: January 30

CS261: Problem Set #3

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning

Lecture 1: Introduction, Entropy and ML estimation

Bandits, Experts, and Games

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Gambling in a rigged casino: The adversarial multi-armed bandit problem

Grundlagen der Künstlichen Intelligenz

New Algorithms for Contextual Bandits

CS 229: Lecture 7 Notes

Entropy and Ergodic Theory Lecture 4: Conditional entropy and mutual information

Learning Algorithms for Minimizing Queue Length Regret

Series 7, May 22, 2018 (EM Convergence)

Inaccessible Entropy and its Applications. 1 Review: Psedorandom Generators from One-Way Functions

Toward a Classification of Finite Partial-Monitoring Games

Lecture Notes 3 Convergence (Chapter 5)

Agnostic Online learnability

The Vapnik-Chervonenkis Dimension

Formalizing Probability. Choosing the Sample Space. Probability Measures

X = X X n, + X 2

Lecture 13: Lower Bounds using the Adversary Method. 2 The Super-Basic Adversary Method [Amb02]

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

An Optimal Bidimensional Multi Armed Bandit Auction for Multi unit Procurement

Exponential Tail Bounds

10.1 The Formal Model

3. If a choice is broken down into two successive choices, the original H should be the weighted sum of the individual values of H.

Finite Automata. Mahesh Viswanathan

Note that in the example in Lecture 1, the state Home is recurrent (and even absorbing), but all other states are transient. f ii (n) f ii = n=1 < +

Lecture 7: Pseudo Random Generators

Measure and integration

CS264: Beyond Worst-Case Analysis Lecture #14: Smoothed Analysis of Pareto Curves

Lecture 28: April 26

Connectedness. Proposition 2.2. The following are equivalent for a topological space (X, T ).

Lecture 3: September 10

Discrete Mathematics for CS Spring 2007 Luca Trevisan Lecture 20

Lecture 10 + additional notes

Reinforcement Learning

2.1 Optimization formulation of k-means

Expectation is linear. So far we saw that E(X + Y ) = E(X) + E(Y ). Let α R. Then,

Chapter 2.5 Random Variables and Probability The Modern View (cont.)

Homework 4 Solutions

A Drifting-Games Analysis for Online Learning and Applications to Boosting

Lecture 19: Interactive Proofs and the PCP Theorem

Selecting Efficient Correlated Equilibria Through Distributed Learning. Jason R. Marden

Notes 1 : Measure-theoretic foundations I

7.1 Coupling from the Past

1 MDP Value Iteration Algorithm

On the Complexity of Best Arm Identification with Fixed Confidence

1 Recap: Interactive Proofs

A. Notation. Attraction probability of item d. (d) Highest attraction probability, (1) A

Analysis of Thompson Sampling for the multi-armed bandit problem

Lecture 5: Probabilistic tools and Applications II

Reducing contextual bandits to supervised learning

CS Foundations of Communication Complexity

Notes from Week 8: Multi-Armed Bandit Problems

Parametric Techniques Lecture 3

18.S097 Introduction to Proofs IAP 2015 Lecture Notes 1 (1/5/2015)

Lecture 1: September 25, A quick reminder about random variables and convexity

Cell-Probe Lower Bounds for Prefix Sums and Matching Brackets

CLASSICAL PROBABILITY MODES OF CONVERGENCE AND INEQUALITIES

Lecture Notes on Metric Spaces

10-704: Information Processing and Learning Fall Lecture 10: Oct 3

Lecture 2: From Classical to Quantum Model of Computation

Probability and Measure

1. When applied to an affected person, the test comes up positive in 90% of cases, and negative in 10% (these are called false negatives ).

Lecture 4: Probability, Proof Techniques, Method of Induction Lecturer: Lale Özkahya

Lecture 2: August 31

2 Completing the Hardness of approximation of Set Cover

Revisiting the Exploration-Exploitation Tradeoff in Bandit Models

1 Approximate Quantiles and Summaries

Transcription:

CMSC 858G: Bandits, Experts and Games 09/19/16 Lecture 3: Lower Bounds for Bandit Algorithms Instructor: Alex Slivkins Scribed by: Soham De & Karthik A Sankararaman 1 Lower Bounds In this lecture (and the first half of the next one), we prove a Ω( KT ) lower bound for regret of bandit algorithms. This gives us a sense of what are the best possible upper bounds on regret that we can hope to prove. On a high level, there are two ways of proving a lower bound on regret: (1) Give a family F of problem instances, which is the same for all algorithms, such that any algorithm fails (has high regret) on some instance in F. () Give a distribution over problem instances, and show that, in expectation over this distribution, any algorithm with fail. Note that () implies (1) since: if regret is high in expectation over problem instances, then there exists at least one problem instance with high regret. Also, (1) implies () if F is a constant. This can be seen as follows: suppose we know that for any algorithm we have high regret (say H) with one problem instance in F and low regret with all other instances in F, then, taking a uniform distribution over F, we can say that any algorithm has expected regret at least H/ F. (So this argument breaks if F is large.) If we prove a stronger version of (1) that says that for any algorithm, regret is high for a constant fraction of the problem instances in F, then, considering a uniform distribution over F, this implies () regardless of whether F is large or not. In this lecture, for proving lower bounds, we consider 0-1 rewards and the following family of problem instances (with fixed ɛ to be adusted in the analysis): { µi 1/ for each arm i, I for each 1,,..., K. (1) µ i (1 + ɛ)/ for arm i (Recall that K is the number of arms.) In the previous lecture, we saw that sampling each arm Õ(1/ɛ ) times is sufficient for the upper bounds on regret that we derived. In this lecture, we prove that sampling each arm Ω(1/ɛ ) times is necessary to determine whether an arm is bad or not. The proof methods will require KL divergence, an important tool from Information Theory. In the next section, we briefly study the KL divergence and some of its properties. KL-divergence Consider a finite sample space Ω, and let p, q be two probability distributions defined on Ω. Then, the Kullback-Leibler divergence or KL-divergence is defined as: KL(p, q) p(x) ln p(x) [ E p ln p(x) ]. 1

The KL divergence is similar to a notion of distance with the properties that it is non-negative, 0 iff p q, and small if the distributions p and q are close. However, it is not strictly a distance function since it is not symmetric and does not satisfy the triangle-inequality. The intuition for the formula is as follows: we are interested in how certain we are that data, with underlying distribution q, can be generated from distribution p. The KL divergence effectively answers this question by measuring the average log likelihood of observing data with distribution p when the underlying distribution of the data is actually given by q. Remark.1. The definition of KL-divergence, as well as the properties discussed below, extend to infinite sample spaces. However, KL-divergence for finite sample spaces suffices for this class, and is much easier to work with. Properties of KL-divergence We present several basic properties of KL-divergence that will be needed later. The proofs of these properties are fairly simple, we include them here for the sake of completeness. 1. Gibbs Inequality: KL(p, q) 0, p, q. Further, KL(p, q) 0 iff p q. Proof. Let us define: f(y) y ln(y). f is a convex function under the domain y > 0. Now, from the definition of the KL divergence we get: KL(p, q) p(x) ln p(x) ( ) p(x) f ( ) f p(x) ( ) f p(x) f(1) 0, [follows from Jensen s inequality] where Jensen s inequality states that ϕ(λ 1 x 1 + λ x ) λ 1 ϕ(x 1 ) + λ ϕ(x ), if ϕ is a convex function and λ 1 + λ 1 with λ 1, λ > 0. Jensen s inequality further has the property that ϕ(λ 1 x 1 + λ x ) λ 1 ϕ(x 1 ) + λ ϕ(x ) iff x 1 x or if ϕ is a linear function. In this case, since f is not a linear function, equality holds (i.e., KL(p, q) 0) iff p(x), x.. Let the sample space Ω be composed as Ω Ω 1 Ω 1 Ω n. Further, let p and q be two distributions defined on Ω as p p 1 p p n and q q 1 q q n, such that 1,..., n, p and q are distributions defined on Ω. Then we have the property: KL(p, q) n 1 KL(p, q ). Proof. Let x (x 1, x,..., x n ) Ω st x i Ω i, i 1,..., n. Let h i (x i ) ln p i(x i ) q i (x i ). Then: KL(p, q) p(x) ln p(x)

n p(x)h i (x i ) i1 n i1 n i1 x i Ω i h i (x i ), x i x i p i (x i )h i (x i ) x i Ω i n KL(p i, q i ). i1 p(x) [ since since ln p(x), x i x i ] n h i (x i ) i1 p(x) p i (x i ) 3. Weaker form of Pinsker s inequality: A Ω : (p(a) q(a)) KL(p, q). Proof. To prove this property, we first claim the following: Claim.. For each event A Ω, Proof. Let us define the following: p(x) ln p(x) p(a) p(a) ln q(a). p A (x) p(x) p(a) and q A (x) q(a) x A. Then the claim can be proved as follows: p(x) ln p(x) p(a) p A (x) ln p(a)p A(x) q(a)q A (x) ( ) p(a) p A (x) ln p A(x) + p(a) ln p(a) p A (x) q A (x) q(a) [ p(a) ln p(a) q(a). since ] p A (x) ln p A(x) q A (x) KL(p A, q A ) 0 Fix A Ω. Using Claim. we have the following: x/ A p(x) ln p(x) p(a) p(a) ln q(a), p(x) ln p(x) p(ā) p(ā) ln q(ā), 3

where Ā denotes the complement of A. Now, let a p(a) and b q(a). Further, assume a < b. Then, we have: KL(p, q) a ln a 1 a + (1 a) ln b 1 b b ( a x + 1 a ) dx 1 x a b a b This proves the property. a x a x(1 x) dx 4(x a)dx (b a). [since x(1 x) 1/4] 4. Let p ɛ denote a distribution on {0, 1} such that p ɛ (1) (1 + ɛ)/. Thus, p ɛ (0) (1 ɛ)/. Further, let p 0 denote the distribution on {0, 1} where p 0 (0) p 0 (1) 1/. Then we have the property: KL(p ɛ, p 0 ) ɛ. Proof. KL(p ɛ, p 0 ) 1 + ɛ ln(1 + ɛ) + 1 ɛ ln(1 ɛ) 1 (ln(1 + ɛ) + ln(1 ɛ)) + ɛ (ln(1 + ɛ) ln(1 ɛ)) 1 ln(1 ɛ ) + ɛ ln 1 + ɛ 1 ɛ. Now, ln(1 ɛ ) < 0 and we can write ln 1+ɛ 1 ɛ ln ( 1 + ɛ 1 ɛ KL(p ɛ, p 0 ) < ɛ How are these properties going to be used? ) ɛ 1 ɛ ɛ 1 ɛ ɛ. We start with the same setting as in Property. From Property 3, we have: ɛ 1 ɛ. Thus, we get: (p(a) q(a)) KL(p, q) n KL(p, q ). (follows from Property ) 1 For example, we can define p and q to be distributions of a biased coin with small ɛ (p (1) (1 + ɛ)/, p (1) (1 ɛ)/) vs an unbiased coin (q (0) q (1) 1/). Then, we can use Property 4 to bound the above as: n n (p(a) q(a)) KL(p, q ) δ nδ, 1 where δ ɛ Thus, we arrive at the following bound: p(a) q(a) nδ/. 1 4

3 A simple example: flipping one coin We start with a simple example, which illustrates our proof technique and is interesting as a standalone result. We have a single coin, whose outcome is a 0 or 1. The coin s mean is unknown. We assume that the true mean µ [0, 1] is either µ 1 or µ for two known values µ 1 > µ. The coin is flipped T times. The goal is to identify if µ µ 1 or µ µ. Define Ω : {0, 1} T to be the sample space of the outcomes of the T coin tosses. We need a decision rule Rule : Ω {High, Low} with the following two properties: Pr[Rule(observations) High µ µ 1 ] 0.99 () Pr[Rule(observations) Low µ µ ] 0.99 (3) The question is how large should T be for for such a Rule to exist? We know that if δ µ 1 µ, then T Ω( 1 ) is sufficient. We will prove that it is also necessary. We will focus on the special δ case when both µ 1 and µ are close to 1. Claim 3.1. Let µ 1 1+ɛ and µ 1. For any rule to work (i.e., satisfy equations () and (3)) we need T Ω( 1 ɛ ) (4) Proof. Define for any event A Ω, the following quantities P 1 (A) Pr[A µ µ 1 ] P (A) Pr[A µ µ ] To prove this claim, we will consider the following equation. For an event A Ω, such that Rule(A) High (i.e. A {ω Ω : Rule(ω) High}), P 1 (A) P (A) 0.98 (5) We prove the claim by showing that if (4) is false, then (5) is false, too. Specifically, we will assume that T < 1.(In fact, the argument below holds for an arbitrary event A Ω.) 4ɛ Define, for all i {1, }, P i,t to be the distribution of the t th coin toss for P i. Then, P i P i,1 P i,... P i,t. From KL divergence property, we have (P 1 (A) P (A)) KL(P 1, P ) From KL divergence property 3 T KL(P 1,t, P,t ) From KL divergence property t1 Hence, we have P 1 (A) P (A) T ɛ < 1. T ɛ From KL divergence property 4 5

4 Flipping several coins: bandits with prediction Let us extend the previous example to flipping multiple coins. More formally, we consider a bandit problem with K arms (where each arm corresponds to a coin). Each arm gives a 0-1 reward, drawn independently from a fixed but unknown distribution. After T rounds, the algorithm outputs a guess y T A for which arm is the best arm (where A is the set of all arms). 1 We call this version bandits with predictions. In this section, we will only be concerned with a quality of prediction, rather than accumulated rewards and regret. For each arm a A, the mean reward is denoted as µ(a). (We will also write it as µ a whenever convenient.) A particular problem instance is specified as a tuple I (µ(a) : a A). A good algorithm for the bandits with prediction problem described above should satisfy Pr[y T is correct I] 0.99 (6) for each problem instance I. We will use the family (1) of problem instances to argue that one needs T Ω ( K ɛ ) for any algorithm to work, i.e., satisfy (6), on all instances in this family. Lemma 4.1. Suppose an algorithm for bandits with predictions satisfies (6) for all problem instances I 1,..., I K. Then T Ω ( K ɛ ). This result is of independent interest (regardless of the lower bound on regret). In fact, we will prove a stronger lemma which will (also) be the crux in the proof of the regret bound. Lemma 4.. Suppose T ck, for a small enough absolute constant c. Fix any deterministic ɛ algorithm for bandits with prediction. Then there exists at least K/3 arms such that Pr[y T I ] < 3 4 Remark 4.3. The proof for K arms is particularly simple, so we will do it first. We will then extend this proof to arbitrary K with more subtleties. While the lemma holds for an arbitrary K, we will present a simplified proof which requires K 4. We will use a standard shorthand [T ] : {1,,..., T }. Let us set up the sample space to be used in the proof. Let (r t (a) : a A, t [T ]) be mutually independent 0-1 random variables such that r t (a) has expectation µ(a). We refer to this tuple as the rewards table, where we interpret r t (a) as the reward received by the algorithm for the t-th time it chooses arm a. The sample space is Ω {0, 1} K T, where each outcome ω Ω corresponds to a particular realization of the rewards table. Each problem instance I defines distribution P on Ω: Also, let P a,t P (A) Pr[A I ] for each A Ω. be the distribution of r t (a) under instance I, so that P a A, t [T ] P a,t. Proof (K arms). Define A {ω Ω : y T 1}. In other words, A is the set of all events such that the correct arm is arm 1. (But the argument below holds for any any event A Ω.) Similar to the previous section, we use the properties of KL divergence as follows: (P 1 (A) P (A)) KL(P 1, P ) 1 Recall that the best arm is the arm with the highest mean reward. 6

K T a1 t1 KL(P a,t 1, P a,t ) T ɛ (7) The last inequality is because KL divergence of P a,t 1 and P a,t is non-zero if and only if P a,t 1 P a,t And when they are non-equal, their KL divergence is ɛ. Hence, P 1 (A) P (A) ɛ T < 1. The last inequality holds whenever T ( 1 4ɛ ). To complete the proof, observe that if Pr[y T I ] 3 4 for both problem instances, then P 1 (A) 3 4 and P (A) < 1 4, so their difference is at least 1, contradiction. Proof (K 4). Compared to the -arms case, time horizon T can be larger by a factor of O(K). The crucial improvement is a more delicate version of the KL-divergence argument in (7) which results in the right-hand side of the form O(T ɛ /K). For the sake of the analysis, we will consider an additional problem instance I 0 { µ i 1 : for all arms a } which we call the base instance. Let E 0 [ ] be the expectation given this problem instance. Also, let T a be the total number of times arm a is played. We consider the algorithm s performance on problem instance I 0, and focus on arms that are neglected by the algorithm, in the sense that the algorithm does not choose arm very often and is not likely to pick for the guess y T. Formally, we observe that. K 3 arms such that E 0(T ) 3T K (8) K 3 arms such that P 0(y T ) 3 K. (9) (To prove (8), assume for contradiction that we have more than K 3 arms with E 0(T ) > 3T K. Then the expected total number of times these arms are played is strictly greater than T, which is a contradiction. (9) is proved similarly.) By Markov inequality, E 0 (T ) 3T K implies that Pr[T 4T K ] 7 8. Since the sets of arms in (8) and (9) must overlap on least K 3 arms, we conclude: K 3 arms such that Pr[T m] 7 8 and P 0(y T ) 3 K, (10) where m 4T K. We will now refine our definition of the sample space to get the required claim. For each arm a, define the t-round sample space Ω t a {0, 1} t, where each outcome corresponds to a particular realization of the tuple (r s (a) : s [t]). (Recall that we interpret r t (a) as the reward received by the algorithm for the t-th time it chooses arm a.) Then the full sample space we considered before can be expressed as Ω a A ΩT a. 7

Fix an arm satisfying the two properties in (10). We will consider a reduced sample space in which arm is played only m 4T K times: Ω Ω m Ω T a. (11) arms a Each problem instance I l defines a distribution P l on Ω : P l (A) Pr[A I l] for each A Ω. In other words, distribution P l is a restriction of P l to the reduced sample space Ω. We apply the KL-divergence argument to distributions P 0 and P. For each event A Ω : (P0 (A) P (A)) KL(P0, P ) T m KL(P a,t 0, P a,t ) + KL(P,t 0, P,t ) arm a t1 0 + m ɛ. Note that each arm a has identical distributions under instances I 0 and I (namely, its mean reward is 1 a,t ). So distributions P0 and P a,t are the same, and therefore their KL-divergence is 0. Whereas for arm we only need to sum up over m samples. Therefore, assuming T ck with small enough constant c, we can conclude that ɛ P 0 (A) P (A) ɛ m < 1 8 for all events A Ω. (1) To apply (1), we need to make sure that the event A is in fact contained in Ω, i.e., whether A holds is completely determined by the first m samples of arm (and arbitrarily many samples of other arms). In particular, we cannot take A {y t }, which would be the most natural extension of the proof technique from the -arms case. Instead, we apply (1) twice: to events A {y T and T m} and A {T > m}. (13) Indeed, note that whether the algorithm samples arm more than m times is completely determined by the first m coin tosses! We are ready for the final computation: P (A) 1 8 + P 0(A) by (1) 1 8 + P 0(y T ) 1 4 by our choice of arm. P (A ) 1 8 + P 0(A ) by (1) 1 4 by our choice of arm. P (Y T ) P (Y T and T m) + P (T > m) P (A) + P (A ) 1 4. Recall that this holds for any arm satisfying the properties in (10). Since there are at least K/3 such arms, the lemma follows. Next lecture: Lemma 4. is used to derive the Ω( KT ) lower bound on regret. t1 8

5 Bibliographic notes The Ω( KT ) lower bound on regret is from Auer et al. (00). KL-divergence and its properties is textbook material from Information Theory, e.g., see Cover and Thomas (1991). The present exposition the outline and much of the technical details is based on Robert Kleinberg s lecture notes from (Kleinberg, 007). We present a substantially simpler proof compared to (Auer et al., 00; Kleinberg, 007) in that we avoid the general chain rule for KL-divergence. Instead, we only use the special case of independent distributions (Property in Section ), which is much easier to state and to apply. The proof of Lemma 4. (for general K), which in prior work relies on the general chain rule, is modified accordingly. In particular, we define the reduced sample space Ω with only a small number of samples from the bad arm, and apply the KL-divergence argument to carefully defined events in (13), rather than a seemingly more natural event A {y T }. References Peter Auer, Nicolò Cesa-Bianchi, Yoav Freund, and Robert E. Schapire. The nonstochastic multiarmed bandit problem. SIAM J. Comput., 3(1):48 77, 00. Preliminary version in 36th IEEE FOCS, 1995. Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. John Wiley & Sons, New York, 1991. Robert Kleinberg. Lecture notes: CS683: Learning, Games, and Electronic Markets (week 9), 007. Available at http://www.cs.cornell.edu/courses/cs683/007sp/lecnotes/week9.pdf. 9