An Asymptotically Optimal Algorithm for the Max k-armed Bandit Problem

Similar documents
Introduction to Bandit Algorithms. Introduction to Bandit Algorithms

Bandit models: a tutorial

Lecture 4: Lower Bounds (ending); Thompson Sampling

Multi-armed bandit models: a tutorial

An Online Algorithm for Maximizing Submodular Functions

Bandits for Online Optimization

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

8 Laws of large numbers

Lecture 8: February 8

The Fitness Level Method with Tail Bounds

The Multi-Arm Bandit Framework

1 MDP Value Iteration Algorithm

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Multiple Identifications in Multi-Armed Bandits

Lecture 4: September Reminder: convergence of sequences

The expansion of random regular graphs

Notes 6 : First and second moment methods

Notes 1 : Measure-theoretic foundations I

The information complexity of sequential resource allocation

Regression and Statistical Inference

Discriminative Learning can Succeed where Generative Learning Fails

Linear models. Linear models are computationally convenient and remain widely used in. applied econometric research

Bennett-type Generalization Bounds: Large-deviation Case and Faster Rate of Convergence

6.1 Moment Generating and Characteristic Functions

1 Review of The Learning Setting

Lecture 5: January 30

Homework 4 Solutions

Lectures 2 3 : Wigner s semicircle law

Learning symmetric non-monotone submodular functions

The space complexity of approximating the frequency moments

1 Introduction. 2 Measure theoretic definitions

Lecture 9. = 1+z + 2! + z3. 1 = 0, it follows that the radius of convergence of (1) is.

Random Bernstein-Markov factors

Lecture 4: Two-point Sampling, Coupon Collector s problem

Bandits : optimality in exponential families

Module 3. Function of a Random Variable and its distribution

Lecture 9: March 26, 2014

Experience-Efficient Learning in Associative Bandit Problems

Hoeffding, Chernoff, Bennet, and Bernstein Bounds

Introduction to Machine Learning. Lecture 2

Machine Learning. Lecture 9: Learning Theory. Feng Li.

Optimization methods

THE first formalization of the multi-armed bandit problem

Proofs for Large Sample Properties of Generalized Method of Moments Estimators

1 A Support Vector Machine without Support Vectors

Distinguishing a truncated random permutation from a random function

The Multi-Armed Bandit Problem

Electronic Companion to: What matters in school choice tie-breaking? How competition guides design

Lecture Notes 3 Convergence (Chapter 5)

A sequential hypothesis test based on a generalized Azuma inequality 1

Finite-time Analysis of the Multiarmed Bandit Problem*

A Result of Vapnik with Applications

The Game of Normal Numbers

How Much Evidence Should One Collect?

Advanced Machine Learning

Financial Econometrics and Volatility Models Extreme Value Theory

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

LARGE DEVIATIONS OF TYPICAL LINEAR FUNCTIONALS ON A CONVEX BODY WITH UNCONDITIONAL BASIS. S. G. Bobkov and F. L. Nazarov. September 25, 2011

10.1 The Formal Model

The concentration of the chromatic number of random graphs

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

CLASSICAL PROBABILITY MODES OF CONVERGENCE AND INEQUALITIES

Rigorous Dynamics and Consistent Estimation in Arbitrarily Conditioned Linear Systems

March 1, Florida State University. Concentration Inequalities: Martingale. Approach and Entropy Method. Lizhe Sun and Boning Yang.

SOME MEMORYLESS BANDIT POLICIES

Estimation of AUC from 0 to Infinity in Serial Sacrifice Designs

Stochastic Optimization with Inequality Constraints Using Simultaneous Perturbations and Penalty Functions

Lecture 3: Lower Bounds for Bandit Algorithms

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

Statistical Convergence of Kernel CCA

Estimation of Dynamic Regression Models

Lecture 1: Introduction to Sublinear Algorithms

Tijmen Daniëls Universiteit van Amsterdam. Abstract

Ordinal Optimization and Multi Armed Bandit Techniques

On the Complexity of Best Arm Identification with Fixed Confidence

Robustness and duality of maximum entropy and exponential family distributions

On prediction and density estimation Peter McCullagh University of Chicago December 2004

Multivariate Statistics Random Projections and Johnson-Lindenstrauss Lemma

Lossless Online Bayesian Bagging

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm

Introduction to Statistical Learning Theory

The distribution inherited by Y is called the Cauchy distribution. Using that. d dy ln(1 + y2 ) = 1 arctan(y)

Lecture 9. d N(0, 1). Now we fix n and think of a SRW on [0,1]. We take the k th step at time k n. and our increments are ± 1

Sampling Contingency Tables

Case study: stochastic simulation via Rademacher bootstrap

Answers to Problem Set Number MIT (Fall 2005).

Lecture 8 Inequality Testing and Moment Inequality Models

Game Theory, On-line prediction and Boosting (Freund, Schapire)

CSCI-6971 Lecture Notes: Monte Carlo integration

Lecture 8: The Field B dr

Essentials on the Analysis of Randomized Algorithms

Quantum query complexity of entropy estimation

Econ 508B: Lecture 5

FORMULATION OF THE LEARNING PROBLEM

Analysis of Thompson Sampling for the multi-armed bandit problem

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

Lectures 2 3 : Wigner s semicircle law

Convergence Rates on Root Finding

Lecture 18: March 15

Transcription:

An Asymptotically Optimal Algorithm for the Max k-armed Bandit Problem Matthew J. Streeter February 27, 2006 CMU-CS-06-110 Stephen F. Smith School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 Abstract We present an asymptotically optimal algorithm for the max variant of the k-armed bandit problem. Given a set of k slot machines, each yielding payoff from a fixed (but unknown) distribution, we wish to allocate trials to the machines so as to maximize the expected maximum payoff received over a series of n trials. Subject to certain distributional assumptions, we show that O ) 1δ ln(n)2 (ln( ) ɛ 2 trials are sufficient to identify, with probability at least 1 δ, a machine whose expected maximum payoff is within ɛ of optimal. This result leads to a strategy for solving the problem that is asymptotically optimal in the following sense: the gap between the expected maximum payoff obtained by using our strategy for n trials and that obtained by pulling the single best arm for all n trials approaches zero as n.

Keywords: multi-armed bandit problem, PAC bounds

1 Introduction In the k-armed bandit problem one is faced with a set of k slot machines, each having an arm that, when pulled, yields a payoff from a fixed (but unknown) distribution. The goal is to allocate trials to the arms so as to maximize the expected cumulative payoff obtained over a series of n trials. Solving the problem entails striking a balance between exploration (determining which arm yields the highest mean payoff) and exploitation (repeatedly pulling this arm). In the max k-armed bandit problem, the goal is to maximize the expected maximum (rather than cumulative) payoff. This version of the problem arises in practice when tackling combinatorial optimization problems for which a number of randomized search heuristics exist: given k heuristics, each yielding a stochastic outcome when applied to some particular problem instance, we wish to allocate trials to the heuristics so as to maximize the maximum payoff (e.g., the maximum number of clauses satisfied by any sampled variable assignment, the minimum makespan of any sampled schedule). Cicirello and Smith [3] show that a max k-armed bandit approach yields good performance on the resource-constrained project scheduling problem with maximum time lags (RCPSP/max). 1.1 Summary of Results We consider a restricted version of the max k-armed bandit problem in which each arm yields payoff drawn from a generalized extreme value (GEV) distribution (defined in 2). This paper presents the first provably asymptotically optimal algorithm for this problem. Roughly speaking, the reason assuming a GEV distribution is the Extremal Types Theorem (stated in 2), which states that the distribution of the sample maximum of n independent identically distributed random variables approaches a GEV distribution as n. A more formal justification is given in 3. For reasons that will become clear, the nature of our results depend on the shape parameter (ξ) of the GEV distribution. Assuming all arms have ξ 0, our results can be summarized as follows. ɛ 2 ) Let a be an arm that yields payoff drawn from a GEV distribution with unknown parameters; let M n denote the maximum payoff obtained after pulling a n times; and let m n = E[M n ]. 1δ ln(n)2 We provide an algorithm that, after pulling the arm O (ln( ) times, produces an estimate m n of m n with the property that P [ m n m n < ɛ] 1 δ. Let a 1, a 2,..., a k be k arms, each yielding payoff from (distinct) GEV distributions with unknown parameters. Let m i n denote the expected maximum payoff obtained by pulling the i th arm n times, and let m n = max 1 i k m i n. We provide an algorithm that, when run for n pulls, obtains expected maximum payoff m n o(1). Our results for the case ξ > 0 are similar, except that our estimates and expected maximum payoffs come within arbitrarily small factors (rather [ than absolute distances) ] of optimality. Specifically, our estimates have the property that P < mn α 1 1 1+ɛ m n α 1 < 1 + ɛ 1 δ for constant α 1 independent of n, while the expected maximum payoff obtained by using our algorithm for n pulls is m n(1 o(1)). 1

1.2 Related Work The classical k-armed bandit problem was first studied by Robbins [7] and has since been the subject of numerous papers; see Berry and Fristedt [1] and Kaelbling [6] for overviews. In a paper similar in spirit to ours, Fong [5] showed that an initial exploration phase consisting of O( k ɛ 2 ln( k δ )) pulls is sufficient to identify, with probability at least 1 δ, an arm whose mean payoff is within ɛ of optimal. Theorem 2 of this paper proves a bound similar to Fong s on the number of pulls needed to identify an arm whose expected maximum payoff (over a series of n trials) is near-optimal. The max variant of the k-armed bandit problem was first studied by Cicirello and Smith [2, 3], who successfully used a heuristic for the max k-armed bandit problem to select among priority rules for the RCPSP/max. The design of Cicirello and Smith s heuristic is motivated by an analysis of the special case in which each arm s payoff distribution is a GEV distribution with shape parameter ξ = 0, but they do not rigorously analyze the heuristic s behavior. Our paper is more theoretical and less empirical: on the one hand we do not perform experiments on any practical combinatorial problem, but on the other hand we provide stronger performance guarantees under weaker distributional assumptions. 1.3 Notation For an arbitrary cumulative distribution function G, let the random variable M G n be defined by M G n = max{z 1, Z 2,..., Z n } where Z 1, Z 2,..., Z n are independent random variables, each having distribution G. Let m G n = E[M G n ]. 2 Extreme Value Theory This section provides a self-contained overview of results in extreme value theory that are relevant to this work. Our presentation is based on the text by Coles [4]. The central result of extreme value theory is an analogue of the central limit theorem that applies to extremely rare events. Recall that the central limit theorem states that (under certain regularity conditions) the distribution of the sum of n independent, identically distributed (i.i.d) random variables converges to a normal distribution as n. The extremal types theorem states that (under certain regularity conditions) the distribution of the maximum of n i.i.d random variables converges to a generalized extreme value (GEV) distribution. Definition (GEV distribution). A random variable Z has a generalized extreme value distribution if, for constants µ, σ > 0, and ξ, P[Z z] = GEV (µ,σ,ξ) (z), where GEV (µ,σ,ξ) (z) = exp ( ( 1 + ξ(z µ) σ ) ) 1 ξ 2

for z {z : 1+ξ(z µ)σ 1 > 0}, and GEV (µ,σ,ξ) (z) = 1 otherwise. The case ξ = 0 is interpreted as the limit ( ( )) µ z lim GEV ξ (µ,σ,ξ )(z) = exp exp. 0 σ The following three propositions establish properties of the GEV distribution. Proposition 1. Let Z be a random variable with P[Z z] = GEV (µ,σ,ξ) (z). Then where is the complete gamma function and is Euler s constant. µ + σ (Γ(1 ξ) 1) if ξ < 1, ξ 0 ξ E[Z] = µ + σγ if ξ = 0 if ξ 1 Γ(z) = 0 γ = lim n t z 1 exp( t) dt ( n k=1 ) 1 k ln(n) Proposition 2. Let G = GEV (µ,σ,ξ). Then M G n has distribution GEV (µ,σ,ξ ), where { µ + σ µ = ξ (nξ 1) if ξ 0 µ + σ ln(n) otherwise, σ = σn ξ, and ξ = ξ. given by Proposition 2 into Proposition 1 gives an expres- Substituting the parameters of Mn G sion for m G n. Proposition 3. Let G = GEV (µ,σ,ξ) where ξ < 1. Then It follows that m G n = for ξ > 0, m G n is Θ(n ξ ); for ξ = 0, m G n is Θ(ln(n)); and for ξ < 0, m G n = µ σ ξ Θ(nξ ). { ( µ + σ ξ n ξ Γ(1 ξ) 1 ) if ξ 0 µ + σγ + σ ln(n) otherwise. 3

20 Expected maximum 15 10 5 Shape = 0.1 Shape = 0 Shape = -0.1 0 0 2 4 6 8 10 ln(n) Figure 1: The effect of the shape parameter (ξ) on the expected maximum of n independent draws from a GEV distribution. It is useful to have a visual picture of what Proposition 3 means. Figure 1 plots m G n as a function of n for three GEV distributions with µ = 0, σ = 1, and ξ {0.1, 0, 0.1}. The central result of extreme value theory is the following theorem. The Extremal Types Theorem. Let G be an arbitrary cumulative distribution function, and suppose there exist sequences of constants {a n > 0} and {b n } such that [ ] M G lim P n b n z = G (z) (2.1) n a n for any continuity point z of G, where G is a not a point mass. Then there exist constants µ, σ > 0, and ξ such that G (z) = GEV (µ,σ,ξ) (z) z. Furthermore, lim P [M n z] = GEV (µan+bn,σa n n,ξ)(z). Condition (2.1) holds for a variety of distributions including the normal, lognormal, uniform, and Cauchy distributions. 3 The Max k-armed Bandit Problem Definition (max k-armed bandit instance). An instance I = (n, G) of the max k-armed bandit problem is an ordered pair whose first element is a positive integer n, and whose second element is a set G = {G 1, G 2,..., G k } of k cumulative distribution functions, each thought of as an arm on a slot machine. The i th arm, when pulled, returns a realization of a random variable with distribution G i. Definition (max k-armed bandit strategy). A max k-armed bandit strategy S is an algorithm that, given an instance I = (n, G) of the max k-armed bandit problem, performs a sequence of n 4

arm-pulls. For any strategy S and integer l n, we denote by S l (I) the expected maximum payoff obtained by running S on I for l trials: [ ] S l (I) = E max p j 0 j l where p j is the payoff obtained from the j th pull, and we define p 0 = 0. Our goal is to come up with a strategy S such that S n (I) is near-maximal. Note that the problem is ill-posed (i.e., there is no clear criterion for preferring one strategy over another) unless we make some assumptions about the distributions G i. We will assume that each arm G i = GEV (µi,σ i,ξ i ) is a GEV distribution whose parameters satisfy 1. µ i µ u 2. 0 < σ l σ i σ u 3. ξ l ξ i ξ u < 1 2 for known constants µ u, σ l, σ u, ξ l, and ξ u. There are two arguments for assuming that each arm is a GEV distribution. First, in practice the distribution of payoffs returned by a strong heuristic may be approximately GEV, even if the conditions of the Extremal Type Theorem are not formally satisfied [2]. A second argument runs as follows. Suppose I = (n, G) is an instance of the max k-armed bandit problem in which each distribution G i G satisfies condition (2.1) of the Extremal Types Theorem. Consider the instance Ī = ( n, Ḡ), where Ḡ = {Ḡ1, Ḡ2,..., Ḡk}, and arm m Ḡi returns the maximum payoff obtained by pulling the corresponding arm G i m times. Effectively, Ī is a restricted version of I in which the arms must be pulled in batches of size m, rather than in any arbitrary order. For m sufficiently large, the Extremal Types Theorem guarantees that for each i, Ḡ i GEV (µi,σ i,ξ i ) for some constants µ i, σ i, and ξ i. Thus, the instance I = ( n, m G ) with G = {G 1, G 2,..., G k } and G i = GEV (µi,σ i,ξ i ) is approximately equivalent to Ī and satisfies our distributional assumptions. The purpose of the restrictions on the parameters µ i, σ i, and ξ i is to ensure that each GEV distribution has finite, bounded mean and variance. 4 An Asymptotically Optimal Algorithm We will analyze the following max k-armed bandit strategy. 5

Strategy S 1 (ɛ, δ): 1. (Exploration) For each arm G i G: 1δ ln(n)2 (a) Using t = O (ln( ) ɛ 2 ) samples of G i, obtain an estimate m G i n of m G i n. Assuming that arm G i has shape parameter ξ i 0, our estimate will have the property that P [ m G i n m G i n < ɛ ] 1 δ. 2. (Exploitation) Set î = arg max 1 i k m G i n, and pull arm Gî for the remaining n tk trials. If an arm G i has [ shape parameter ξ i > ] 0, the estimate obtained in step 1 (a) will instead have the property that P < 1 + ɛ 1 δ for constant α 1 independent of n. 1 1+ɛ < mn α 1 m n α 1 The following theorem shows that with appropriate settings of ɛ and δ, strategy S 1 is asymptotically optimal. Theorem 1. Let I = (n, G) be an instance of the max k-armed bandit problem, where G = {G 1, G 2,..., G k } and G i = GEV (µi,σ i,ξ i ). Let m n = max 1 i k m G i n, ξ max = max 1 i k ξ i, and ( S = S 1 3 k, 1 ). n kn 2 Then if ξ max 0, while if ξ max > 0, Proof. lim n S n(i) = m n S n (I) lim n m n = 1. Case ξ max 0. Let ˆm n = m G î n (where î is the arm selected for exploitation in step 2). Then ˆm n tk is the expected maximum payoff obtained during the exploitation step, so Claim 1. ˆm n ˆm n tk is O( tk n ). S n (I) ˆm n tk. Proof of Claim 1. Let µ = µî, σ = σî, and ξ = ξî be the parameters of the arm selected for exploitation. Suppose ξ = 0. Then by Proposition 3, ˆm n ˆm n tk = σ (ln(n) ln(n tk)). 6

Expanding ln(x + β) in powers of β about β = 0 for β < x 2, x > 0 yields ln(x + β) ln(x) = i=1 = β x 1 1 β x 1 < 2 β x i=1 ( 1)i+1 1 i so for n sufficiently large, ˆm n ˆm n tk 2σ tk = O( tk). n n Now suppose ξ < 0. By Proposition 3, ˆm n ˆm n tk = σγ(1 ξ ξ)(nξ (n t) ξ ) = O((n t) ξ n ξ ) where we have used the fact that σγ(1 ξ) < 0. Expanding (n ξ t)ξ in powers of t about t = 0 gives ( ) t (n t) ξ = n ξ ξn ξ 1 2 t O n ξ 2 and so (n t) ξ n ξ ξn ξ 1 t ξ l t n = O( t n ). With probability at least 1 kδ, all estimates obtained during the exploration phase are within ɛ of the correct values, so that m n ˆm n < 2ɛ. Assuming m n ˆm n < 2ɛ, it follows that ( β x ) i ( β x ) i m n ˆm n tk = (m n ˆm n ) + ( ˆm n ˆm n tk ) < 2ɛ + O( tk) n ( ) = 2ɛ + k O ln( 1 (ln n)2 ) n δ ɛ 2 = O ( ) where = ln(nk) ln(n) 2 3 k, and on the second line we have used Claim 1. Thus, n S n (I) (1 kδ) (m n O ( )) = m n O ( ) = m n o(1). Case ξ max > 0. See Appendix A. Theorem 1 completes our analysis of the performance of S 1. It remains only to describe how the estimates in step 1 (a) are obtained. 4.1 Estimating m n We adopt the following notation: Let G = GEV µ,σ,ξ denote a GEV distribution with (unknown) parameters µ, σ, and ξ satisfying the conditions stated in 3, and 7

let m i = m G i. To estimate m n, we first obtain an accurate estimate of ξ. Then 1. if ξ 0 (so that the growth of m n as a function of ln n is linear), we estimate m n by first estimating m 1 and m 2, then performing linear interpolation; 2. otherwise we estimate m n by first estimating m 1, m 2, and m 4, then performing a nonlinear interpolation. 4.1.1 Estimating m i for i {1, 2, 4} The following two lemmas use well-known ideas to efficiently estimate m i for small values of i. Lemma 1. Let i be a positive integer and let ɛ > 0 be a real number. Then O ( i ɛ 2 ) draws from G suffice to obtain an estimate m i of m i such that P[ m i m i < ɛ] 3 4. Proof. First consider the special case i = 1. Let X denote the sum of t draws from G, for some to-be-specified positive integer t. Then E[X] = m 1 t and V ar[x] = σ 2 t, where σ is the (unknown) standard deviation of G. We take m 1 = X as our estimate of m t 1. Then P[ m 1 m 1 ɛ] = P[ t m 1 tm 1 tɛ] = P[ X E[X] tɛ σ V ar[x]] σ2 tɛ 2 where the last inequality is Chebyshev s. Thus to guarantee P [ m 1 m 1 ɛ] 1 we must set 4 t = 4 σ2 = O ( ) 1 ɛ 2 ɛ (note that due to the assumptions in 3, σ is O(1)). 2 For i > 1, we let X be the sum of t block maxima (each the maximum of i independent draws from G), which increases the number of samples required by a factor of i. To boost the probability that m i m i < ɛ from 3 to 1 δ, we use the median of means 4 method. Lemma 2. Let i be a positive integer and let ɛ > 0 and δ (0, 1) be real numbers. O ( ln( 1) ) i δ ɛ draws from G suffice to obtain an estimate 2 mi of m i such that P [ m n m n < ɛ] 1 δ. Proof. We invoke Lemma 1 r times (for r to be determined), yielding a set E = { m (1) i, m (2) i,..., m (r) of estimates of m i. Let m i be the median element of E. Let A = { m (j) i E : m (j) i m i < ɛ} be the set of accurate estimates of m i ; and let A = A. Then m i m i ɛ implies A r, while 2 E[A] 3 r. Using the standard Chernoff bound, we have 4 [ P [ m i m i ɛ] P A r ] ( exp r ) 2 C for constant C > 0. Thus r = O ( ln( 1 δ )) repetitions suffice to ensure P[ m i m i > ɛ] δ. Then i } 8

4.1.2 Estimating m n when ξ = 0 Lemma 3. Assume G has shape parameter ξ = ) 0. Let n be a positive integer and let ɛ > 0 and 1δ ln(n)2 δ (0, 1) be real numbers. Then O (ln( ) draws from G suffice to obtain an estimate m ɛ 2 n of m n such that P[ m n m n < ɛ] 1 δ. Proof. By Proposition 3, m i = µ + σγ + σ ln(i). Thus m n = m 1 + (m 2 m 1 ) log 2 (n). (4.1) Let m 1 and m 2 be estimates of m 1 and m 2, respectively, and let m n be the estimate of m n obtained by plugging m 1 and m 2 into (4.1). Define i = m i m i for i {1, 2, n}. Then Thus to guarantee P[ n i {1, 2}. By Lemma 2, this requires O 4.1.3 Estimating m n when ξ 0 n (1 + log 2 (n))( 1 + 2 ). < ɛ] 1 δ, it suffices that P [ i ) 1δ (ln n)2 (ln( ) draws from G. ɛ 2 ] ɛ 2(1+log 2 (n)) 1 δ 2 for all For the purpose of the proofs presented here, we will make a minor assumption concerning an arm s shape parameter ξ i : we assume that for some known constant ξ > 0, ξ i < ξ ξ i = 0. Removing this assumption does not fundamentally change the results, but it makes the proofs more complicated. Lemma 5 shows how to efficiently estimate ξ. Lemmas 6 and 7 show how to efficiently estimate m n in the cases ξ < 0 and ξ > 0, respectively. We will use the following lemma. Lemma 4. m 4 m 2 1 4 σ and m 2 m 1 1 8 σ. Proof. See Appendix A. Lemma 5. For real numbers ɛ > 0 and δ (0, 1), O ( ln( 1 δ ) 1 ɛ 2 ) draws from G suffice to obtain an estimate ξ of ξ such that P [ ξ ξ < ɛ ] 1 δ. Proof. Using Proposition 3, it is straightforward to check that for any ξ < 1, ( ) m4 m 2 ξ = log 2. (4.2) m 2 m 1 9

Let m 1, m 2, and m 4 be estimates of m 1, m 2, and m 4, respectively, and let ξ be the estimate of ξ obtained by plugging m 1, m 2, and m 4 into (4.2). Define m = max i {1,2,4} m i m i and define ξ = ξ ξ. We wish to upper bound ξ as a function of m. In Claim 1 of Theorem 1 we showed that ln(x + β) ln(x) 2 β for β x. Letting x 2 N = m 4 m 2 and D = m 2 m 1, and noting that ξ = log 2 (N) log 2 (D) = 1 (ln(n) ln(d)), ln 2 it follows that ξ 1 ( 2(2 m ) + 2(2 ) m) ln(2) N D for m < 1 2 min(n, D). Thus by Lemma 4 and the assumption that σ σ l, ξ is O( m ). Define i = m i m i, so that m = max i {1,2,4} i. Then to guarantee P[ ξ < ɛ] 1 δ, it suffices that P [ i Ω(ɛ)] 1 δ 3 for all i {1, 2, 4}. By Lemma 2, this requires O ( ln( 1 δ ) 1 ɛ 2 ) draws from G. Lemma 6. Assume G has shape parameter ξ ξ. Let n be a positive integer and let ɛ > 0 and δ (0, 1) be real numbers. Then O ( ln( 1 δ ) 1 ɛ 2 ) draws from G suffice to obtain an estimate mn of m n such that P[ m n m n < ɛ] 1 δ. Proof. By Proposition 3, m i = µ + σ ξ ( i ξ Γ(1 ξ) 1 ). Define so that α 1 = µ σξ 1 α 2 = σξ 1 Γ(1 ξ) α 3 = 2 ξ m i = α 1 + α 2 α log 2 (i) 3. (4.3) Plugging in the values i = 1, i = 2, and i = 4 into (4.3) yields a system of three quadratic equations. Solving this system for α 1, α 2, and α 3 yields α 1 = (m 1 m 4 m 2 2)(m 1 2m 2 + m 4 ) 1 α 2 = ( 2m 1 m 2 + m 2 1 + m 2 2)(m 1 2m 2 + m 4 ) 1 α 3 = (m 4 m 2 )(m 2 m 1 ) 1. Let m 1, m 2, and m 4 be estimates of m 1, m 2, and m 4, respectively. Plugging m 1, m 2, and m 4 into the above equations yields estimates, say ᾱ 1, ᾱ 2, and ᾱ 3, of α 1, α 2, and α 3, respectively. Define m = max i {1,2,4} m i m i and α = max i {1,2,3} ᾱ i α i. To complete the proof, we show that m n m n is O( m ). The argument consists of two parts: in claims 1 through 3 we show that α is O( m ), then in Claim 4 we show that m n m n is O( α ). Claim 1. Each of the numerators in the expressions for α 1, α 2, and α 3 has absolute value bounded from above, while each of the denominators has absolute value bounded from below. (The bounds are independent of the unknown parameters of G.) 10

Proof of claim 1. The numerators will have bounded absolute value as long as m 1, m 2, and m 3 are bounded. Upper bounds on m 1, m 2, and m 3 follow from the restrictions on the parameters µ, σ, and ξ. As for the denominators, by Lemma 4 we have m 1 2m 2 + m 4 = (m 2 m 1 )(α 3 1) 1 8 σ l 2 ξ 1. Claim 2. Let N and D be fixed real numbers, and let β N and β D be real numbers with β D < D Then N+β N D+β D N is O( β D N + β D ). Proof of claim 2. First, using the Taylor series expansion of N D+β D, N D+β D N = Nβ ( D D D 2 i=0 ( 1)i+1 β D D Nβ D D 2 (1 β D D 1 ) = O ( β D ). ) i 2. Then N+β N D+β D N D N D+β D N D + β N D+β D = O ( β N + β D ). Claim 3. α is O( m ). Proof of claim 3. We show that ᾱ 1 α 1 is O( m ). Similar arguments show that ᾱ 2 α 2 and ᾱ 4 α 4 are O( m ), which proves the claim. To see that ᾱ 1 α 1 is O( m ), let N = m 1 m 4 m 2 2, and let D = m 1 2m 2 + m 4, so that α 1 = N D. Define N and D in the natural way so that ᾱ 1 = N D. Because m 1,m 2, and m 3 are all O(1) (by Claim 1), it follows that both N N and D D are O( m ). That ᾱ 1 α 1 is O( m ) follows by Claim 2. Claim 4. m n m n is O ( α ). Proof of claim 4. Because ξ l ξ ξ it must be that 0 < 2 ξ l < α3 < 2 ξ < 1. So for α sufficiently small, 0 < ᾱ 3 < 1. m n m n = (ᾱ 1 + ᾱ 2 ᾱ log 2 (n) 3 ) (α 1 + α 2 α log 2 (n) 3 ) ᾱ 1 α 1 + ᾱ 2 ᾱ log 2 (n) 3 ᾱ 2 α log 2 (n) 3 + ᾱ 2 α log 2 (n) 3 α 2 α log 2 (n) 3 ᾱ 1 α 1 + ᾱ 2 ᾱ 3 α 3 + ᾱ 2 α 2 = O( α ) where on the third line we have used the fact that both α 3 and ᾱ 3 are between 0 and 1, and in the last line we have used the fact that ᾱ 2 is O(1). 11

Putting claims 3 and 4 together, m n m n is O( m ). Define i = m i m i, so that m = max i {1,2,4} i. Thus to guarantee P[ m n m n < ɛ] 1 δ, it suffices that P [ i < Ω(ɛ)] 1 δ 3 for all i {1, 2, 4}. By Lemma 2, this requires O(ln( 1 δ ) 1 ɛ 2 ) draws from G. Lemma 7. Assume G has shape parameter ξ ) ξ. Let n be a positive integer and let ɛ > 0 and 1δ ln(n)2 δ (0, 1) be real numbers. Then O (ln( ) draws from G suffice to obtain an estimate m ɛ 2 n of m n such that [ 1 P 1 + ɛ < m ] n α 1 < (1 + ɛ) 1 δ m n α 1 where α 1 = µ σ ξ. Proof. See Appendix A. Putting the results of lemmas 3, 5, 6, and 7 together, we obtain the following theorem. Theorem 2. Let ) n be a positive integer and let ɛ > 0 and δ (0, 1) be real numbers. Then 1δ ln(n)2 O (ln( ) draws from G suffice to obtain an estimate m ɛ 2 n of m n such that with probability at least 1 δ, one of the following holds: ξ 0 and m n m n < ɛ, or ξ > 0 and 1 1+ɛ < mn α 1 m n α 1 < 1 + ɛ, where α 1 = µ σ ξ. Proof. First, invoke Lemma 5 with parameters ξ and δ. Then invoke one of Lemmas 3, 6, or 7 3 2 (depending on the estimate ξ obtained from Lemma 5) with parameters ɛ and δ. 2 Theorem 2 shows that step 1 (a) of strategy S 1 can be performed as described. 5 Conclusions The max k-armed bandit problem is a variant of the classical k-armed bandit problem with practical applications to combinatorial optimization. Motivated by extreme value theory, we studied a restricted version of this problem in which each arm yields payoff drawn from a GEV distribution. We derived PAC bounds on the sample complexity of estimating m n, the expected maximum of n draws from a GEV distribution. Using these bounds, we showed that a simple algorithm for the max k-armed bandit problem is asymptotically optimal. Ours is the first algorithm for this problem with rigorous asymptotic performance guarantees. References [1] Donald. A. Berry and Bert Fristedt. Bandit Problems: Sequential Allocation of Experiments. Chapman and Hall, London, 1986. 12

[2] Vincent A. Cicirello and Stephen F. Smith. Heuristic selection for stochastic search optimization: Modeling solution quality by extreme value theory. In Proceedings of the 10th International Conference on Principles and Practice of Constraint Programming, pages 197 211, 2004. [3] Vincent A. Cicirello and Stephen F. Smith. The max k-armed bandit: A new model of exploration applied to search heuristic selection. In Proceedings of AAAI 2005, pages 1355 1361, 2005. [4] Stuart Coles. An Introduction to Statistical Modeling of Extreme Values. Springer-Verlag, London, 2001. [5] Philip W. L. Fong. A quantitative study of hypothesis selection. In International Conference on Machine Learning, pages 226 234, 1995. [6] Leslie P. Kaelbling. Learning in Embedded Systems. The MIT Press, Cambridge, MA, 1993. [7] Herbert Robbins. Some aspects of sequential design of experiments. Bulletin of the American Mathematical Society, 58:527 535, 1952. Appendix A Theorem 1. Let I = (n, G) be an instance of the max k-armed bandit problem, where G = {G 1, G 2,..., G k } and G i = GEV (µi,σ i,ξ i ). Let m n = max 1 i k m G i n, ξ max = max 1 i k ξ i, and ( S = S 1 3 k, 1 ). n kn 2 Then if ξ max 0, while if ξ max > 0, lim n S n(i) = m n S n (I) lim n m n = 1. Proof. For ξ max 0, the theorem was proved in the main text. It remains only to address the case ξ max > 0. Case ξ max > 0. We will prove the stronger claim that S n (I) α 1 m n α 1 = 1 O( ) (5.1) where = ln(nk) ln(n) 2 3 k n and α 1 = max 1 i k α i, where α i = µ i σ i ξ i. 13

For the moment, let us assume that all arms have shape parameter ξ i > 0. Let A be the event (which occurs with probability at least 1 kδ) that all estimates obtained in step 1 (a) satisfy the inequality in Theorem 2. Claim 1. To prove (5.1), it suffices to show that A implies m n α 1 ˆm n tk α 1 = 1 + O( ). Proof of claim 1. Because S n (I) ˆm n tk and the event A occurs with probability at least 1 kδ, it suffices to show that A implies (1 δk) ˆm n tk α 1 m n α 1 = 1 O( ). Because δk( ˆm n tk) m n α 1 is O ( 1 n 2 ) = o( ), it suffices to show that A implies ˆm n tk α 1 m n α 1 = 1 O( ). This can be rewritten as m 1 n α 1 = ( ˆm n tk α 1 ) = ( ˆm 1 O( ) n tk α 1 )(1 + O( )) (we can 1 replace with 1 + O( ) because for r < 1, 1 = 1 + r < 1 + 2r). 1 O( ) 2 1 r 1 r Claim 2. ˆm n ˆα 1 ˆm n tk ˆα 1 = 1 + O( ). Proof of claim 2. Using Proposition 3, ( ln ) ˆm n ˆα 1 ˆm n tk ˆα 1 ( = ln ) n ξ (n t) ξ = ξ (ln(n) ln(n tk)) = O( tk n ) = O( ) The claim follows from the fact that exp(β) < 1 + 3β for β < 1, so that exp(o( )) = 1 + 2 2 O( ). Claim 3. A implies that for all i, m i n α 1 m i n α 1 < 1 + ɛ. Proof of claim 3. By definition, α 1 = α1 i β for some β 0. The claim follows from the fact that for positive N and D and β 0, N N+β < 1 + ɛ implies < 1 + ɛ. D D+β Claim 4. A implies Proof of claim 4. m n α 1 ˆm n tk α 1 m n α 1 mîn tk α 1 = 1 + O( ). = m n α 1 m m n α n α 1 mîn α 1 mîn α 1 1 mîn α 1 mîn α 1 (1 + ɛ) 1 (1 + ɛ) (1 + O( )) = 1 + O( ) where in the second step we have used claims 2 and 3. 14 mîn tk α 1

Putting claims 1 and 4 together completes the proof. To remove the assumption that all arms have ξ i > 0, we need to show that A implies that for n sufficiently large, the arms î and i (the only arms that play a role in the proof) will have shape parameters > 0. This follows from the fact that if ξ i 0, m i n is O(ln(n)), while if ξ i > ξ > 0, m i n is Ω(n ξ i ). Lemma 4. m 4 m 2 1 4 σ and m 2 m 1 1 8 σ. Proof. If ξ = 0, then by Proposition 3, m 4 m 2 = m 2 m 1 = ln(2)σ and we are done. Otherwise, It thus suffices to prove the following claim. Claim 1. m 4 m 2 = σ(2 ξ 1)ξ 1 Γ(1 ξ) and m 2 m 1 = σ(4 ξ 2 ξ )ξ 1 Γ(1 ξ). { 2 ξ 1 min ξ< 1 ξ 2 { 4 ξ 2 ξ min ξ< 1 2 ξ } Γ(1 ξ) 1 4, and } Γ(1 ξ) 1 8. Proof of claim 1. We state without proof the following properties of the Γ function: min y> 1 2 Γ(z) z! z 2 Γ(z) 1 2 z > 0 Making the change of variable y = ξ, it suffices to show { } 1 2 y min Γ(1 + y) 1, and y> 1 y 4 2 (5.2) { } 2 y (1 2 y ) Γ(1 + y) 1 8. (5.3) (5.2) holds because for 1 2 < y 1, while for y > 1, y 1 2 y Γ(1 + y) 1 y 2 Γ(1 + y) 1 4, 1 2 y Γ(1 + y) y Similarly, (5.3) holds because for 1 2 < y 1, y + 1! 2y 2 y (1 2 y ) Γ(1 + y) 1 y 8, 15 1 2.

while for y > 1, 2 y (1 2 y ) Γ(1 + y) y y + 1! 2y(2 y ) 1 8. Lemma 7. Assume G has shape parameter ξ ) ξ. Let n be a positive integer and let ɛ > 0 and 1δ ln(n)2 δ (0, 1) be real numbers. Then O (ln( ) draws from G suffice to obtain an estimate m ɛ 2 n of m n such that [ 1 P 1 + ɛ < m ] n α 1 < (1 + ɛ) 1 δ m n α 1 where α 1 = µ σ ξ. Proof. We use the same estimation procedure as in the proof of Lemma 6. Let α 1, α 2, α 3, α, and m be defined as they were in that proof. The inequality 1 < mn α 1 1+ɛ m n α 1 < 1 + ɛ is the same as ln( mn α 1 m n α 1 ) < ln(1 + ɛ). For ɛ < 1, 2 ln(1 + ɛ) 7 ɛ, so it suffices to guarantee that 8 ln( m n α 1 ) ln(m n α 1 ) < 7 8 ɛ. Claim 1. ln( m n α 1 ) ln(m n α 1 ) is O(ln(n) α ). Proof of claim 1. Because ln( m n α 1 ) ln( m n ᾱ 1 ) is O( α ), it suffices to show that ln( m n ᾱ 1 ) ln(m n α 1 ) is O(ln(n) α ). This is true because ( ) ln( m n ᾱ 1 ) = ln ᾱ 2 ᾱ log 2 (n) 3 = log 2 (n) ln(ᾱ 3 ) + ln(ᾱ 2 ) = log ( 2 (n) ln(α 3 ) + ln(α 2 ) ± O (ln(n) α ) = ln α 2 α log 2 (n) 3 ± O (ln(n) α ) = ln(m n α 1 ) ± O (ln(n) α ). Setting α < Ω (ln(n) 1 ɛ) then guarantees ln( m n ) ln(m n ) < 7 ɛ. By Claim 3 of the 8 proof of Lemma 6 (which did not depend on the assumption ξ < 0), α is O ( m ), so we require P[ m < Ω (ln(n) 1 ɛ)] 1 δ. Define i = m i m i, so that m = max i {1,2,4} i. It suffices that P ) [ i < Ω (ln(n) 1 ɛ)] 1 δ for i {1, 2, 4}. By Lemma 2, ensuring this requires 3 1δ ln(n)2 O (ln( ) draws from G. ɛ 2 16