An Asymptotically Optimal Algorithm for the Max k-armed Bandit Problem

Size: px

Start display at page:

Download "An Asymptotically Optimal Algorithm for the Max k-armed Bandit Problem"

Dominick Carson
5 years ago
Views:

1 An Asymptotically Optimal Algorithm for the Max k-armed Bandit Problem Matthew J. Streeter February 27, 2006 CMU-CS Stephen F. Smith School of Computer Science Carnegie Mellon University Pittsburgh, PA Abstract We present an asymptotically optimal algorithm for the max variant of the k-armed bandit problem. Given a set of k slot machines, each yielding payoff from a fixed (but unknown) distribution, we wish to allocate trials to the machines so as to maximize the expected maximum payoff received over a series of n trials. Subject to certain distributional assumptions, we show that O ) 1δ ln(n)2 (ln( ) ɛ 2 trials are sufficient to identify, with probability at least 1 δ, a machine whose expected maximum payoff is within ɛ of optimal. This result leads to a strategy for solving the problem that is asymptotically optimal in the following sense: the gap between the expected maximum payoff obtained by using our strategy for n trials and that obtained by pulling the single best arm for all n trials approaches zero as n.

2 Keywords: multi-armed bandit problem, PAC bounds

3 1 Introduction In the k-armed bandit problem one is faced with a set of k slot machines, each having an arm that, when pulled, yields a payoff from a fixed (but unknown) distribution. The goal is to allocate trials to the arms so as to maximize the expected cumulative payoff obtained over a series of n trials. Solving the problem entails striking a balance between exploration (determining which arm yields the highest mean payoff) and exploitation (repeatedly pulling this arm). In the max k-armed bandit problem, the goal is to maximize the expected maximum (rather than cumulative) payoff. This version of the problem arises in practice when tackling combinatorial optimization problems for which a number of randomized search heuristics exist: given k heuristics, each yielding a stochastic outcome when applied to some particular problem instance, we wish to allocate trials to the heuristics so as to maximize the maximum payoff (e.g., the maximum number of clauses satisfied by any sampled variable assignment, the minimum makespan of any sampled schedule). Cicirello and Smith [3] show that a max k-armed bandit approach yields good performance on the resource-constrained project scheduling problem with maximum time lags (RCPSP/max). 1.1 Summary of Results We consider a restricted version of the max k-armed bandit problem in which each arm yields payoff drawn from a generalized extreme value (GEV) distribution (defined in 2). This paper presents the first provably asymptotically optimal algorithm for this problem. Roughly speaking, the reason assuming a GEV distribution is the Extremal Types Theorem (stated in 2), which states that the distribution of the sample maximum of n independent identically distributed random variables approaches a GEV distribution as n. A more formal justification is given in 3. For reasons that will become clear, the nature of our results depend on the shape parameter (ξ) of the GEV distribution. Assuming all arms have ξ 0, our results can be summarized as follows. ɛ 2 ) Let a be an arm that yields payoff drawn from a GEV distribution with unknown parameters; let M n denote the maximum payoff obtained after pulling a n times; and let m n = E[M n ]. 1δ ln(n)2 We provide an algorithm that, after pulling the arm O (ln( ) times, produces an estimate m n of m n with the property that P [ m n m n < ɛ] 1 δ. Let a 1, a 2,..., a k be k arms, each yielding payoff from (distinct) GEV distributions with unknown parameters. Let m i n denote the expected maximum payoff obtained by pulling the i th arm n times, and let m n = max 1 i k m i n. We provide an algorithm that, when run for n pulls, obtains expected maximum payoff m n o(1). Our results for the case ξ > 0 are similar, except that our estimates and expected maximum payoffs come within arbitrarily small factors (rather [ than absolute distances) ] of optimality. Specifically, our estimates have the property that P < mn α ɛ m n α 1 < 1 + ɛ 1 δ for constant α 1 independent of n, while the expected maximum payoff obtained by using our algorithm for n pulls is m n(1 o(1)). 1

4 1.2 Related Work The classical k-armed bandit problem was first studied by Robbins [7] and has since been the subject of numerous papers; see Berry and Fristedt [1] and Kaelbling [6] for overviews. In a paper similar in spirit to ours, Fong [5] showed that an initial exploration phase consisting of O( k ɛ 2 ln( k δ )) pulls is sufficient to identify, with probability at least 1 δ, an arm whose mean payoff is within ɛ of optimal. Theorem 2 of this paper proves a bound similar to Fong s on the number of pulls needed to identify an arm whose expected maximum payoff (over a series of n trials) is near-optimal. The max variant of the k-armed bandit problem was first studied by Cicirello and Smith [2, 3], who successfully used a heuristic for the max k-armed bandit problem to select among priority rules for the RCPSP/max. The design of Cicirello and Smith s heuristic is motivated by an analysis of the special case in which each arm s payoff distribution is a GEV distribution with shape parameter ξ = 0, but they do not rigorously analyze the heuristic s behavior. Our paper is more theoretical and less empirical: on the one hand we do not perform experiments on any practical combinatorial problem, but on the other hand we provide stronger performance guarantees under weaker distributional assumptions. 1.3 Notation For an arbitrary cumulative distribution function G, let the random variable M G n be defined by M G n = max{z 1, Z 2,..., Z n } where Z 1, Z 2,..., Z n are independent random variables, each having distribution G. Let m G n = E[M G n ]. 2 Extreme Value Theory This section provides a self-contained overview of results in extreme value theory that are relevant to this work. Our presentation is based on the text by Coles [4]. The central result of extreme value theory is an analogue of the central limit theorem that applies to extremely rare events. Recall that the central limit theorem states that (under certain regularity conditions) the distribution of the sum of n independent, identically distributed (i.i.d) random variables converges to a normal distribution as n. The extremal types theorem states that (under certain regularity conditions) the distribution of the maximum of n i.i.d random variables converges to a generalized extreme value (GEV) distribution. Definition (GEV distribution). A random variable Z has a generalized extreme value distribution if, for constants µ, σ > 0, and ξ, P[Z z] = GEV (µ,σ,ξ) (z), where GEV (µ,σ,ξ) (z) = exp ( ( 1 + ξ(z µ) σ ) ) 1 ξ 2

5 for z {z : 1+ξ(z µ)σ 1 > 0}, and GEV (µ,σ,ξ) (z) = 1 otherwise. The case ξ = 0 is interpreted as the limit ( ( )) µ z lim GEV ξ (µ,σ,ξ )(z) = exp exp. 0 σ The following three propositions establish properties of the GEV distribution. Proposition 1. Let Z be a random variable with P[Z z] = GEV (µ,σ,ξ) (z). Then where is the complete gamma function and is Euler s constant. µ + σ (Γ(1 ξ) 1) if ξ < 1, ξ 0 ξ E[Z] = µ + σγ if ξ = 0 if ξ 1 Γ(z) = 0 γ = lim n t z 1 exp( t) dt ( n k=1 ) 1 k ln(n) Proposition 2. Let G = GEV (µ,σ,ξ). Then M G n has distribution GEV (µ,σ,ξ ), where { µ + σ µ = ξ (nξ 1) if ξ 0 µ + σ ln(n) otherwise, σ = σn ξ, and ξ = ξ. given by Proposition 2 into Proposition 1 gives an expres- Substituting the parameters of Mn G sion for m G n. Proposition 3. Let G = GEV (µ,σ,ξ) where ξ < 1. Then It follows that m G n = for ξ > 0, m G n is Θ(n ξ ); for ξ = 0, m G n is Θ(ln(n)); and for ξ < 0, m G n = µ σ ξ Θ(nξ ). { ( µ + σ ξ n ξ Γ(1 ξ) 1 ) if ξ 0 µ + σγ + σ ln(n) otherwise. 3

6 20 Expected maximum Shape = 0.1 Shape = 0 Shape = ln(n) Figure 1: The effect of the shape parameter (ξ) on the expected maximum of n independent draws from a GEV distribution. It is useful to have a visual picture of what Proposition 3 means. Figure 1 plots m G n as a function of n for three GEV distributions with µ = 0, σ = 1, and ξ {0.1, 0, 0.1}. The central result of extreme value theory is the following theorem. The Extremal Types Theorem. Let G be an arbitrary cumulative distribution function, and suppose there exist sequences of constants {a n > 0} and {b n } such that [ ] M G lim P n b n z = G (z) (2.1) n a n for any continuity point z of G, where G is a not a point mass. Then there exist constants µ, σ > 0, and ξ such that G (z) = GEV (µ,σ,ξ) (z) z. Furthermore, lim P [M n z] = GEV (µan+bn,σa n n,ξ)(z). Condition (2.1) holds for a variety of distributions including the normal, lognormal, uniform, and Cauchy distributions. 3 The Max k-armed Bandit Problem Definition (max k-armed bandit instance). An instance I = (n, G) of the max k-armed bandit problem is an ordered pair whose first element is a positive integer n, and whose second element is a set G = {G 1, G 2,..., G k } of k cumulative distribution functions, each thought of as an arm on a slot machine. The i th arm, when pulled, returns a realization of a random variable with distribution G i. Definition (max k-armed bandit strategy). A max k-armed bandit strategy S is an algorithm that, given an instance I = (n, G) of the max k-armed bandit problem, performs a sequence of n 4

7 arm-pulls. For any strategy S and integer l n, we denote by S l (I) the expected maximum payoff obtained by running S on I for l trials: [ ] S l (I) = E max p j 0 j l where p j is the payoff obtained from the j th pull, and we define p 0 = 0. Our goal is to come up with a strategy S such that S n (I) is near-maximal. Note that the problem is ill-posed (i.e., there is no clear criterion for preferring one strategy over another) unless we make some assumptions about the distributions G i. We will assume that each arm G i = GEV (µi,σ i,ξ i ) is a GEV distribution whose parameters satisfy 1. µ i µ u 2. 0 < σ l σ i σ u 3. ξ l ξ i ξ u < 1 2 for known constants µ u, σ l, σ u, ξ l, and ξ u. There are two arguments for assuming that each arm is a GEV distribution. First, in practice the distribution of payoffs returned by a strong heuristic may be approximately GEV, even if the conditions of the Extremal Type Theorem are not formally satisfied [2]. A second argument runs as follows. Suppose I = (n, G) is an instance of the max k-armed bandit problem in which each distribution G i G satisfies condition (2.1) of the Extremal Types Theorem. Consider the instance Ī = ( n, Ḡ), where Ḡ = {Ḡ1, Ḡ2,..., Ḡk}, and arm m Ḡi returns the maximum payoff obtained by pulling the corresponding arm G i m times. Effectively, Ī is a restricted version of I in which the arms must be pulled in batches of size m, rather than in any arbitrary order. For m sufficiently large, the Extremal Types Theorem guarantees that for each i, Ḡ i GEV (µi,σ i,ξ i ) for some constants µ i, σ i, and ξ i. Thus, the instance I = ( n, m G ) with G = {G 1, G 2,..., G k } and G i = GEV (µi,σ i,ξ i ) is approximately equivalent to Ī and satisfies our distributional assumptions. The purpose of the restrictions on the parameters µ i, σ i, and ξ i is to ensure that each GEV distribution has finite, bounded mean and variance. 4 An Asymptotically Optimal Algorithm We will analyze the following max k-armed bandit strategy. 5

8 Strategy S 1 (ɛ, δ): 1. (Exploration) For each arm G i G: 1δ ln(n)2 (a) Using t = O (ln( ) ɛ 2 ) samples of G i, obtain an estimate m G i n of m G i n. Assuming that arm G i has shape parameter ξ i 0, our estimate will have the property that P [ m G i n m G i n < ɛ ] 1 δ. 2. (Exploitation) Set î = arg max 1 i k m G i n, and pull arm Gî for the remaining n tk trials. If an arm G i has [ shape parameter ξ i > ] 0, the estimate obtained in step 1 (a) will instead have the property that P < 1 + ɛ 1 δ for constant α 1 independent of n. 1 1+ɛ < mn α 1 m n α 1 The following theorem shows that with appropriate settings of ɛ and δ, strategy S 1 is asymptotically optimal. Theorem 1. Let I = (n, G) be an instance of the max k-armed bandit problem, where G = {G 1, G 2,..., G k } and G i = GEV (µi,σ i,ξ i ). Let m n = max 1 i k m G i n, ξ max = max 1 i k ξ i, and ( S = S 1 3 k, 1 ). n kn 2 Then if ξ max 0, while if ξ max > 0, Proof. lim n S n(i) = m n S n (I) lim n m n = 1. Case ξ max 0. Let ˆm n = m G î n (where î is the arm selected for exploitation in step 2). Then ˆm n tk is the expected maximum payoff obtained during the exploitation step, so Claim 1. ˆm n ˆm n tk is O( tk n ). S n (I) ˆm n tk. Proof of Claim 1. Let µ = µî, σ = σî, and ξ = ξî be the parameters of the arm selected for exploitation. Suppose ξ = 0. Then by Proposition 3, ˆm n ˆm n tk = σ (ln(n) ln(n tk)). 6

9 Expanding ln(x + β) in powers of β about β = 0 for β < x 2, x > 0 yields ln(x + β) ln(x) = i=1 = β x 1 1 β x 1 < 2 β x i=1 ( 1)i+1 1 i so for n sufficiently large, ˆm n ˆm n tk 2σ tk = O( tk). n n Now suppose ξ < 0. By Proposition 3, ˆm n ˆm n tk = σγ(1 ξ ξ)(nξ (n t) ξ ) = O((n t) ξ n ξ ) where we have used the fact that σγ(1 ξ) < 0. Expanding (n ξ t)ξ in powers of t about t = 0 gives ( ) t (n t) ξ = n ξ ξn ξ 1 2 t O n ξ 2 and so (n t) ξ n ξ ξn ξ 1 t ξ l t n = O( t n ). With probability at least 1 kδ, all estimates obtained during the exploration phase are within ɛ of the correct values, so that m n ˆm n < 2ɛ. Assuming m n ˆm n < 2ɛ, it follows that ( β x ) i ( β x ) i m n ˆm n tk = (m n ˆm n ) + ( ˆm n ˆm n tk ) < 2ɛ + O( tk) n ( ) = 2ɛ + k O ln( 1 (ln n)2 ) n δ ɛ 2 = O ( ) where = ln(nk) ln(n) 2 3 k, and on the second line we have used Claim 1. Thus, n S n (I) (1 kδ) (m n O ( )) = m n O ( ) = m n o(1). Case ξ max > 0. See Appendix A. Theorem 1 completes our analysis of the performance of S 1. It remains only to describe how the estimates in step 1 (a) are obtained. 4.1 Estimating m n We adopt the following notation: Let G = GEV µ,σ,ξ denote a GEV distribution with (unknown) parameters µ, σ, and ξ satisfying the conditions stated in 3, and 7

10 let m i = m G i. To estimate m n, we first obtain an accurate estimate of ξ. Then 1. if ξ 0 (so that the growth of m n as a function of ln n is linear), we estimate m n by first estimating m 1 and m 2, then performing linear interpolation; 2. otherwise we estimate m n by first estimating m 1, m 2, and m 4, then performing a nonlinear interpolation Estimating m i for i {1, 2, 4} The following two lemmas use well-known ideas to efficiently estimate m i for small values of i. Lemma 1. Let i be a positive integer and let ɛ > 0 be a real number. Then O ( i ɛ 2 ) draws from G suffice to obtain an estimate m i of m i such that P[ m i m i < ɛ] 3 4. Proof. First consider the special case i = 1. Let X denote the sum of t draws from G, for some to-be-specified positive integer t. Then E[X] = m 1 t and V ar[x] = σ 2 t, where σ is the (unknown) standard deviation of G. We take m 1 = X as our estimate of m t 1. Then P[ m 1 m 1 ɛ] = P[ t m 1 tm 1 tɛ] = P[ X E[X] tɛ σ V ar[x]] σ2 tɛ 2 where the last inequality is Chebyshev s. Thus to guarantee P [ m 1 m 1 ɛ] 1 we must set 4 t = 4 σ2 = O ( ) 1 ɛ 2 ɛ (note that due to the assumptions in 3, σ is O(1)). 2 For i > 1, we let X be the sum of t block maxima (each the maximum of i independent draws from G), which increases the number of samples required by a factor of i. To boost the probability that m i m i < ɛ from 3 to 1 δ, we use the median of means 4 method. Lemma 2. Let i be a positive integer and let ɛ > 0 and δ (0, 1) be real numbers. O ( ln( 1) ) i δ ɛ draws from G suffice to obtain an estimate 2 mi of m i such that P [ m n m n < ɛ] 1 δ. Proof. We invoke Lemma 1 r times (for r to be determined), yielding a set E = { m (1) i, m (2) i,..., m (r) of estimates of m i. Let m i be the median element of E. Let A = { m (j) i E : m (j) i m i < ɛ} be the set of accurate estimates of m i ; and let A = A. Then m i m i ɛ implies A r, while 2 E[A] 3 r. Using the standard Chernoff bound, we have 4 [ P [ m i m i ɛ] P A r ] ( exp r ) 2 C for constant C > 0. Thus r = O ( ln( 1 δ )) repetitions suffice to ensure P[ m i m i > ɛ] δ. Then i } 8

11 4.1.2 Estimating m n when ξ = 0 Lemma 3. Assume G has shape parameter ξ = ) 0. Let n be a positive integer and let ɛ > 0 and 1δ ln(n)2 δ (0, 1) be real numbers. Then O (ln( ) draws from G suffice to obtain an estimate m ɛ 2 n of m n such that P[ m n m n < ɛ] 1 δ. Proof. By Proposition 3, m i = µ + σγ + σ ln(i). Thus m n = m 1 + (m 2 m 1 ) log 2 (n). (4.1) Let m 1 and m 2 be estimates of m 1 and m 2, respectively, and let m n be the estimate of m n obtained by plugging m 1 and m 2 into (4.1). Define i = m i m i for i {1, 2, n}. Then Thus to guarantee P[ n i {1, 2}. By Lemma 2, this requires O Estimating m n when ξ 0 n (1 + log 2 (n))( ). < ɛ] 1 δ, it suffices that P [ i ) 1δ (ln n)2 (ln( ) draws from G. ɛ 2 ] ɛ 2(1+log 2 (n)) 1 δ 2 for all For the purpose of the proofs presented here, we will make a minor assumption concerning an arm s shape parameter ξ i : we assume that for some known constant ξ > 0, ξ i < ξ ξ i = 0. Removing this assumption does not fundamentally change the results, but it makes the proofs more complicated. Lemma 5 shows how to efficiently estimate ξ. Lemmas 6 and 7 show how to efficiently estimate m n in the cases ξ < 0 and ξ > 0, respectively. We will use the following lemma. Lemma 4. m 4 m σ and m 2 m σ. Proof. See Appendix A. Lemma 5. For real numbers ɛ > 0 and δ (0, 1), O ( ln( 1 δ ) 1 ɛ 2 ) draws from G suffice to obtain an estimate ξ of ξ such that P [ ξ ξ < ɛ ] 1 δ. Proof. Using Proposition 3, it is straightforward to check that for any ξ < 1, ( ) m4 m 2 ξ = log 2. (4.2) m 2 m 1 9

12 Let m 1, m 2, and m 4 be estimates of m 1, m 2, and m 4, respectively, and let ξ be the estimate of ξ obtained by plugging m 1, m 2, and m 4 into (4.2). Define m = max i {1,2,4} m i m i and define ξ = ξ ξ. We wish to upper bound ξ as a function of m. In Claim 1 of Theorem 1 we showed that ln(x + β) ln(x) 2 β for β x. Letting x 2 N = m 4 m 2 and D = m 2 m 1, and noting that ξ = log 2 (N) log 2 (D) = 1 (ln(n) ln(d)), ln 2 it follows that ξ 1 ( 2(2 m ) + 2(2 ) m) ln(2) N D for m < 1 2 min(n, D). Thus by Lemma 4 and the assumption that σ σ l, ξ is O( m ). Define i = m i m i, so that m = max i {1,2,4} i. Then to guarantee P[ ξ < ɛ] 1 δ, it suffices that P [ i Ω(ɛ)] 1 δ 3 for all i {1, 2, 4}. By Lemma 2, this requires O ( ln( 1 δ ) 1 ɛ 2 ) draws from G. Lemma 6. Assume G has shape parameter ξ ξ. Let n be a positive integer and let ɛ > 0 and δ (0, 1) be real numbers. Then O ( ln( 1 δ ) 1 ɛ 2 ) draws from G suffice to obtain an estimate mn of m n such that P[ m n m n < ɛ] 1 δ. Proof. By Proposition 3, m i = µ + σ ξ ( i ξ Γ(1 ξ) 1 ). Define so that α 1 = µ σξ 1 α 2 = σξ 1 Γ(1 ξ) α 3 = 2 ξ m i = α 1 + α 2 α log 2 (i) 3. (4.3) Plugging in the values i = 1, i = 2, and i = 4 into (4.3) yields a system of three quadratic equations. Solving this system for α 1, α 2, and α 3 yields α 1 = (m 1 m 4 m 2 2)(m 1 2m 2 + m 4 ) 1 α 2 = ( 2m 1 m 2 + m m 2 2)(m 1 2m 2 + m 4 ) 1 α 3 = (m 4 m 2 )(m 2 m 1 ) 1. Let m 1, m 2, and m 4 be estimates of m 1, m 2, and m 4, respectively. Plugging m 1, m 2, and m 4 into the above equations yields estimates, say ᾱ 1, ᾱ 2, and ᾱ 3, of α 1, α 2, and α 3, respectively. Define m = max i {1,2,4} m i m i and α = max i {1,2,3} ᾱ i α i. To complete the proof, we show that m n m n is O( m ). The argument consists of two parts: in claims 1 through 3 we show that α is O( m ), then in Claim 4 we show that m n m n is O( α ). Claim 1. Each of the numerators in the expressions for α 1, α 2, and α 3 has absolute value bounded from above, while each of the denominators has absolute value bounded from below. (The bounds are independent of the unknown parameters of G.) 10

13 Proof of claim 1. The numerators will have bounded absolute value as long as m 1, m 2, and m 3 are bounded. Upper bounds on m 1, m 2, and m 3 follow from the restrictions on the parameters µ, σ, and ξ. As for the denominators, by Lemma 4 we have m 1 2m 2 + m 4 = (m 2 m 1 )(α 3 1) 1 8 σ l 2 ξ 1. Claim 2. Let N and D be fixed real numbers, and let β N and β D be real numbers with β D < D Then N+β N D+β D N is O( β D N + β D ). Proof of claim 2. First, using the Taylor series expansion of N D+β D, N D+β D N = Nβ ( D D D 2 i=0 ( 1)i+1 β D D Nβ D D 2 (1 β D D 1 ) = O ( β D ). ) i 2. Then N+β N D+β D N D N D+β D N D + β N D+β D = O ( β N + β D ). Claim 3. α is O( m ). Proof of claim 3. We show that ᾱ 1 α 1 is O( m ). Similar arguments show that ᾱ 2 α 2 and ᾱ 4 α 4 are O( m ), which proves the claim. To see that ᾱ 1 α 1 is O( m ), let N = m 1 m 4 m 2 2, and let D = m 1 2m 2 + m 4, so that α 1 = N D. Define N and D in the natural way so that ᾱ 1 = N D. Because m 1,m 2, and m 3 are all O(1) (by Claim 1), it follows that both N N and D D are O( m ). That ᾱ 1 α 1 is O( m ) follows by Claim 2. Claim 4. m n m n is O ( α ). Proof of claim 4. Because ξ l ξ ξ it must be that 0 < 2 ξ l < α3 < 2 ξ < 1. So for α sufficiently small, 0 < ᾱ 3 < 1. m n m n = (ᾱ 1 + ᾱ 2 ᾱ log 2 (n) 3 ) (α 1 + α 2 α log 2 (n) 3 ) ᾱ 1 α 1 + ᾱ 2 ᾱ log 2 (n) 3 ᾱ 2 α log 2 (n) 3 + ᾱ 2 α log 2 (n) 3 α 2 α log 2 (n) 3 ᾱ 1 α 1 + ᾱ 2 ᾱ 3 α 3 + ᾱ 2 α 2 = O( α ) where on the third line we have used the fact that both α 3 and ᾱ 3 are between 0 and 1, and in the last line we have used the fact that ᾱ 2 is O(1). 11

14 Putting claims 3 and 4 together, m n m n is O( m ). Define i = m i m i, so that m = max i {1,2,4} i. Thus to guarantee P[ m n m n < ɛ] 1 δ, it suffices that P [ i < Ω(ɛ)] 1 δ 3 for all i {1, 2, 4}. By Lemma 2, this requires O(ln( 1 δ ) 1 ɛ 2 ) draws from G. Lemma 7. Assume G has shape parameter ξ ) ξ. Let n be a positive integer and let ɛ > 0 and 1δ ln(n)2 δ (0, 1) be real numbers. Then O (ln( ) draws from G suffice to obtain an estimate m ɛ 2 n of m n such that [ 1 P 1 + ɛ < m ] n α 1 < (1 + ɛ) 1 δ m n α 1 where α 1 = µ σ ξ. Proof. See Appendix A. Putting the results of lemmas 3, 5, 6, and 7 together, we obtain the following theorem. Theorem 2. Let ) n be a positive integer and let ɛ > 0 and δ (0, 1) be real numbers. Then 1δ ln(n)2 O (ln( ) draws from G suffice to obtain an estimate m ɛ 2 n of m n such that with probability at least 1 δ, one of the following holds: ξ 0 and m n m n < ɛ, or ξ > 0 and 1 1+ɛ < mn α 1 m n α 1 < 1 + ɛ, where α 1 = µ σ ξ. Proof. First, invoke Lemma 5 with parameters ξ and δ. Then invoke one of Lemmas 3, 6, or (depending on the estimate ξ obtained from Lemma 5) with parameters ɛ and δ. 2 Theorem 2 shows that step 1 (a) of strategy S 1 can be performed as described. 5 Conclusions The max k-armed bandit problem is a variant of the classical k-armed bandit problem with practical applications to combinatorial optimization. Motivated by extreme value theory, we studied a restricted version of this problem in which each arm yields payoff drawn from a GEV distribution. We derived PAC bounds on the sample complexity of estimating m n, the expected maximum of n draws from a GEV distribution. Using these bounds, we showed that a simple algorithm for the max k-armed bandit problem is asymptotically optimal. Ours is the first algorithm for this problem with rigorous asymptotic performance guarantees. References [1] Donald. A. Berry and Bert Fristedt. Bandit Problems: Sequential Allocation of Experiments. Chapman and Hall, London,

15 [2] Vincent A. Cicirello and Stephen F. Smith. Heuristic selection for stochastic search optimization: Modeling solution quality by extreme value theory. In Proceedings of the 10th International Conference on Principles and Practice of Constraint Programming, pages , [3] Vincent A. Cicirello and Stephen F. Smith. The max k-armed bandit: A new model of exploration applied to search heuristic selection. In Proceedings of AAAI 2005, pages , [4] Stuart Coles. An Introduction to Statistical Modeling of Extreme Values. Springer-Verlag, London, [5] Philip W. L. Fong. A quantitative study of hypothesis selection. In International Conference on Machine Learning, pages , [6] Leslie P. Kaelbling. Learning in Embedded Systems. The MIT Press, Cambridge, MA, [7] Herbert Robbins. Some aspects of sequential design of experiments. Bulletin of the American Mathematical Society, 58: , Appendix A Theorem 1. Let I = (n, G) be an instance of the max k-armed bandit problem, where G = {G 1, G 2,..., G k } and G i = GEV (µi,σ i,ξ i ). Let m n = max 1 i k m G i n, ξ max = max 1 i k ξ i, and ( S = S 1 3 k, 1 ). n kn 2 Then if ξ max 0, while if ξ max > 0, lim n S n(i) = m n S n (I) lim n m n = 1. Proof. For ξ max 0, the theorem was proved in the main text. It remains only to address the case ξ max > 0. Case ξ max > 0. We will prove the stronger claim that S n (I) α 1 m n α 1 = 1 O( ) (5.1) where = ln(nk) ln(n) 2 3 k n and α 1 = max 1 i k α i, where α i = µ i σ i ξ i. 13

16 For the moment, let us assume that all arms have shape parameter ξ i > 0. Let A be the event (which occurs with probability at least 1 kδ) that all estimates obtained in step 1 (a) satisfy the inequality in Theorem 2. Claim 1. To prove (5.1), it suffices to show that A implies m n α 1 ˆm n tk α 1 = 1 + O( ). Proof of claim 1. Because S n (I) ˆm n tk and the event A occurs with probability at least 1 kδ, it suffices to show that A implies (1 δk) ˆm n tk α 1 m n α 1 = 1 O( ). Because δk( ˆm n tk) m n α 1 is O ( 1 n 2 ) = o( ), it suffices to show that A implies ˆm n tk α 1 m n α 1 = 1 O( ). This can be rewritten as m 1 n α 1 = ( ˆm n tk α 1 ) = ( ˆm 1 O( ) n tk α 1 )(1 + O( )) (we can 1 replace with 1 + O( ) because for r < 1, 1 = 1 + r < 1 + 2r). 1 O( ) 2 1 r 1 r Claim 2. ˆm n ˆα 1 ˆm n tk ˆα 1 = 1 + O( ). Proof of claim 2. Using Proposition 3, ( ln ) ˆm n ˆα 1 ˆm n tk ˆα 1 ( = ln ) n ξ (n t) ξ = ξ (ln(n) ln(n tk)) = O( tk n ) = O( ) The claim follows from the fact that exp(β) < 1 + 3β for β < 1, so that exp(o( )) = O( ). Claim 3. A implies that for all i, m i n α 1 m i n α 1 < 1 + ɛ. Proof of claim 3. By definition, α 1 = α1 i β for some β 0. The claim follows from the fact that for positive N and D and β 0, N N+β < 1 + ɛ implies < 1 + ɛ. D D+β Claim 4. A implies Proof of claim 4. m n α 1 ˆm n tk α 1 m n α 1 mîn tk α 1 = 1 + O( ). = m n α 1 m m n α n α 1 mîn α 1 mîn α 1 1 mîn α 1 mîn α 1 (1 + ɛ) 1 (1 + ɛ) (1 + O( )) = 1 + O( ) where in the second step we have used claims 2 and mîn tk α 1

17 Putting claims 1 and 4 together completes the proof. To remove the assumption that all arms have ξ i > 0, we need to show that A implies that for n sufficiently large, the arms î and i (the only arms that play a role in the proof) will have shape parameters > 0. This follows from the fact that if ξ i 0, m i n is O(ln(n)), while if ξ i > ξ > 0, m i n is Ω(n ξ i ). Lemma 4. m 4 m σ and m 2 m σ. Proof. If ξ = 0, then by Proposition 3, m 4 m 2 = m 2 m 1 = ln(2)σ and we are done. Otherwise, It thus suffices to prove the following claim. Claim 1. m 4 m 2 = σ(2 ξ 1)ξ 1 Γ(1 ξ) and m 2 m 1 = σ(4 ξ 2 ξ )ξ 1 Γ(1 ξ). { 2 ξ 1 min ξ< 1 ξ 2 { 4 ξ 2 ξ min ξ< 1 2 ξ } Γ(1 ξ) 1 4, and } Γ(1 ξ) 1 8. Proof of claim 1. We state without proof the following properties of the Γ function: min y> 1 2 Γ(z) z! z 2 Γ(z) 1 2 z > 0 Making the change of variable y = ξ, it suffices to show { } 1 2 y min Γ(1 + y) 1, and y> 1 y 4 2 (5.2) { } 2 y (1 2 y ) Γ(1 + y) 1 8. (5.3) (5.2) holds because for 1 2 < y 1, while for y > 1, y 1 2 y Γ(1 + y) 1 y 2 Γ(1 + y) 1 4, 1 2 y Γ(1 + y) y Similarly, (5.3) holds because for 1 2 < y 1, y + 1! 2y 2 y (1 2 y ) Γ(1 + y) 1 y 8,

18 while for y > 1, 2 y (1 2 y ) Γ(1 + y) y y + 1! 2y(2 y ) 1 8. Lemma 7. Assume G has shape parameter ξ ) ξ. Let n be a positive integer and let ɛ > 0 and 1δ ln(n)2 δ (0, 1) be real numbers. Then O (ln( ) draws from G suffice to obtain an estimate m ɛ 2 n of m n such that [ 1 P 1 + ɛ < m ] n α 1 < (1 + ɛ) 1 δ m n α 1 where α 1 = µ σ ξ. Proof. We use the same estimation procedure as in the proof of Lemma 6. Let α 1, α 2, α 3, α, and m be defined as they were in that proof. The inequality 1 < mn α 1 1+ɛ m n α 1 < 1 + ɛ is the same as ln( mn α 1 m n α 1 ) < ln(1 + ɛ). For ɛ < 1, 2 ln(1 + ɛ) 7 ɛ, so it suffices to guarantee that 8 ln( m n α 1 ) ln(m n α 1 ) < 7 8 ɛ. Claim 1. ln( m n α 1 ) ln(m n α 1 ) is O(ln(n) α ). Proof of claim 1. Because ln( m n α 1 ) ln( m n ᾱ 1 ) is O( α ), it suffices to show that ln( m n ᾱ 1 ) ln(m n α 1 ) is O(ln(n) α ). This is true because ( ) ln( m n ᾱ 1 ) = ln ᾱ 2 ᾱ log 2 (n) 3 = log 2 (n) ln(ᾱ 3 ) + ln(ᾱ 2 ) = log ( 2 (n) ln(α 3 ) + ln(α 2 ) ± O (ln(n) α ) = ln α 2 α log 2 (n) 3 ± O (ln(n) α ) = ln(m n α 1 ) ± O (ln(n) α ). Setting α < Ω (ln(n) 1 ɛ) then guarantees ln( m n ) ln(m n ) < 7 ɛ. By Claim 3 of the 8 proof of Lemma 6 (which did not depend on the assumption ξ < 0), α is O ( m ), so we require P[ m < Ω (ln(n) 1 ɛ)] 1 δ. Define i = m i m i, so that m = max i {1,2,4} i. It suffices that P ) [ i < Ω (ln(n) 1 ɛ)] 1 δ for i {1, 2, 4}. By Lemma 2, ensuring this requires 3 1δ ln(n)2 O (ln( ) draws from G. ɛ 2 16

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms Stochastic K-Arm Bandit Problem Formulation Consider K arms (actions) each correspond to an unknown distribution {ν k } K k=1 with values bounded in [0, 1]. At each time t, the agent pulls an arm I t {1,...,