The non-stochastic multi-armed bandit problem

Size: px

Start display at page:

Download "The non-stochastic multi-armed bandit problem"

Coleen Brooks
5 years ago
Views:

1 Submitted for journal ublication. The non-stochastic multi-armed bandit roblem Peter Auer Institute for Theoretical Comuter Science Graz University of Technology A-8010 Graz (Austria) Nicolò Cesa-Bianchi Deartment of Comuter Science Università di Milano I Milano (Italy) Robert E. Schaire AT&T Labs 180 Park Avenue Florham Park, NJ November 20, 2001 Yoav Freund Banter, Inc. 214 Willow Ave., At. #5A Hoboken, NJ Abstract In the multi-armed bandit roblem, a gambler must decide which arm of K non-identical slot machines to lay in a sequence of trials so as to maximize his reward. This classical roblem has received much attention because of the simle model it rovides of the trade-off between exloration (trying out each arm to find the best one) and exloitation (laying the arm believed to give the best ayoff). Past solutions for the bandit roblem have almost always relied on assumtions about the statistics of the slot machines. In this work, we make no statistical assumtions whatsoever about the nature of the rocess generating the ayoffs of the slot machines. We give a solution to the bandit roblem in which an adversary, rather than a well-behaved stochastic rocess, has comlete control over the ayoffs. In a sequence of T lays, we rove that the er-round ayoff of our algorithm? aroaches that of the best arm at the rate O T?12. We show by a matching lower bound that this is best ossible. We also rove that our algorithm aroaches the er-round ayoff of any set of strategies at a similar rate: if the best strategy is chosen from a ool of N strategies then our algorithm

2 ? aroaches the er-round ayoff of the strategy at the rate O (log N 12 ) T?12. Finally, we aly our results to the roblem of laying an unknown reeated matrix game. We show that? our algorithm aroaches the minimax ayoff of the unknown game at the rate O T?12. Keywords: adversarial bandit roblem, unknown matrix games AMS subject classification: 68Q32 68T05 91A20 1 Introduction In the multi-armed bandit roblem, originally roosed by Robbins [19], a gambler must choose which of K slot machines to lay. At each time ste, he ulls the arm of one of the machines and receives a reward or ayoff (ossibly zero or negative). The gambler s urose is to maximize his return, i.e. the sum of the rewards he receives over a sequence of ulls. In this model, each arm is assumed to deliver rewards that are indeendently drawn from a fixed and unknown distribution. As reward distributions differ from arm to arm, the goal is to find the arm with the highest exected ayoff as early as ossible, and then to kee gambling using that best arm. The roblem is a aradigmatic examle of the trade-off between exloration and exloitation. On the one hand, if the gambler lays exclusively on the machine that he thinks is best ( exloitation ), he may fail to discover that one of the other arms actually has a higher exected ayoff. On the other hand, if he sends too much time trying out all the machines and gathering statistics ( exloration ), he may fail to lay the best arm often enough to get a high return. The gambler s erformance is tyically measured in terms of regret. This is the difference between the exected return of the otimal strategy (ulling consistently the best arm) and the gambler s exected return. Lai and Robbins roved that the gambler s regret over T ulls can be made, for T! 1, as small as O(ln T ). Furthermore, they rove that this bound is otimal in the following sense: it does not exist a strategy for the gambler with a better asymtotic erformance. Though this formulation of the bandit roblem allows an elegant statistical treatment of the exloration-exloitation trade-off, it may not be adequate to model certain environments. As a motivating examle, consider the task of reeatedly choosing a route for transmitting ackets between two oints in a communication network. To cast this scenario within the bandit roblem, suose there is a only a fixed number of ossible routes and the transmission cost is reorted back to the sender. Now, it is likely that the costs associated with each route cannot be modeled by a stationary distribution, so a more sohisticated set of statistical assumtions would be required. In general, it may be difficult or imossible to determine the right statistical assumtions for a given domain, and some domains may exhibit deendencies to an extent that no such assumtions are aroriate. To rovide a framework where one could model scenarios like the one sketched above, we resent the adversarial bandit roblem, a variant of the bandit roblem in which no statistical assumtions are made about the generation of rewards. We only assume that each slot machine is initially assigned an arbitrary and unknown sequence of rewards, one for each time ste, chosen from a bounded real interval. Each time the gambler ulls the arm of a slot-machine he receives the corresonding reward from the sequence assigned to that slot-machine. To measure the gambler s erformance in this setting we relace the notion of (statistical) regret with that of worst-case 2

3 regret. Given any sequence (j 1 ; : : : ; j T ) of ulls, where T > 0 is an arbitrary time horizon and each j t is the index of an arm, the worst-case regret of a gambler for this sequence of ulls is the difference between the return the gambler would have had by ulling arms j 1 ; : : : ; j T and the actual gambler s return, where both returns are determined by the initial assignment of rewards. It is easy to see that, in this model, the gambler cannot kee his regret small (say, sublinear in T ) for all sequences of ulls and with resect to the worst-case assignment of rewards to the arms. Thus, to make the roblem feasible, we allow the regret to deend on the hardness of the sequence of ulls for which it is measured, where the hardness of a sequence is roughly the number of times one has to change the slot machine currently being layed in order to ull the arms in the order given by the sequence. This trick allows us to effectively control the worst-case regret simultaneously for all sequences of ulls, even though (as one should exect) our regret bounds become trivial when the hardness of the sequence (j 1 ; : : : ; j T ) we comete against gets too close to T. As a remark, note that a deterministic bandit roblem was also considered by Gittins [9] and Ishikida and Varaiya [13]. However, their version of the bandit roblem is very different from ours: they assume that the layer can comute ahead of time exactly what ayoffs will be received from each arm, and their roblem is thus one of otimization, rather than exloration and exloitation. Our most general result is a very efficient, randomized layer algorithm whose exected regret for any sequence of ulls is 1 O(S KT ln(kt )), where S is the hardness of the sequence (see Theorem 8.1 and Corollaries 8.2, 8.4). Note that this bound holds simultaneously for all sequences of ulls, for any assignments of rewards to the arms, and uniformly over the time horizon T. If the gambler is willing to imose an uer bound S on the hardness of the sequences of ulls for which he wants to measure his regret, an imroved bound O( SKT ln(kt )) on the exected regret for these sequences can be roven (see Corollaries 8.3 and 8.5). With the urose of establishing connections with certain results in game theory, we also look at a secial case of the worst-case regret, which we call weak regret. Given a time horizon T, call best arm the arm that has the highest return (sum of assigned rewards) u to time T with resect to the initial assignment of rewards. The gambler s weak regret is the difference between the return of this best arm and the actual gambler s return. In the aer we introduce a randomized layer algorithm, tailored to this notion of regret, whose exected weak regret is O( KG max ln K), where G max is the return of the best arm see Theorem 4.1 in Section 4. As before, this bound holds for any assignments of rewards to the arms and uniformly over the choice of the time horizon T. Using a more comlex layer algorithm, we also rove that the weak regret is O( KT ln(kt)) with robability at least over the algorithm s randomization, for any fixed > 0, see Theorems 6.3 and 6.4 in Section 6. This also imlies that, asymtotically for T! 1 and K constant, the weak regret is O( T (ln T ) 1+" ) with robability 1 for any fixed " > 0, see Corollary 6.5. Our worst-case bounds may aear weaker than the bounds roved using statistical assumtions, such as those shown by Lai and Robbins [14] of the form O(ln T ). However, when comaring our results to those in the statistics literature, it is imortant to oint out an imortant difference in the asymtotic quantification. In the work of Lai and Robbins the assumtion is that the distribution of rewards that is associated with each arm is fixed as the total number of iterations T increases to infinity. In contrast, our bounds hold for any finite T, and, by the generality of our 1 Though in this introduction we use the comact asymtotic notation, our bounds are roven for each finite T and almost always with exlicit constants. 3

4 model, these bounds are alicable when the ayoffs are randomly (or adversarially) chosen in a manner that does deend on T. It is this quantification order, and not the adversarial nature of our framework, which is the cause for the aarent ga. We rove this oint in Theorem 5.1 where we show that, for any layer algorithm for the K-armed bandit roblem and for any T, there exists a set of K reward distributions such that the exected weak regret of the algorithm when laying on these arms for T time stes is ( KT ). So far we have considered notions of regret that comare the return of the gambler to the return of a sequence of ulls or to the return of the best arm. A further notion of regret which we exlore is the regret for the best strategy in a given set of strategies that are available to the gambler. The notion of strategy generalizes that of sequence of ulls : at each time ste a strategy gives a recommendation, in the form of a robability distribution over the K arms, as to which arm to lay next. Given an assignment of rewards to the arms and a set of N strategies for the gambler, call best strategy the strategy that yields the highest return with resect to this assignment. Then the regret for the best strategy is the difference between the return of this best strategy and the actual gambler s return. Using a randomized layer that combines the choices of the N strategies (in the same vein as the algorithms for rediction with exert advice from [3]), we show that the exected regret for the best strategy is O( KT ln N ) see Theorem 7.1. Note that the deendence on the number of strategies is only logarithmic, and therefore the bound is quite reasonable even when the layer is combining a very large number of strategies. The adversarial bandit roblem is closely related to the roblem of learning to lay an unknown N-erson finite game, where the same game is layed reeatedly by N layers. A desirable roerty for a layer is Hannan-consistency, which is similar to saying (in our bandit framework) that the weak regret er time ste of the layer converges to 0 with robability 1. Examles of Hannan-consistent layer strategies have been rovided by several authors in the ast (see [18] for a survey of these results). By alying (slight extensions of) Theorems 6.3 and 6.4, we can rove rovide an examle of a simle Hannan-consistent layer whose convergence rate is otimal u to logarithmic factors. Our layer algorithms are based in art on an algorithm resented by Freund and Schaire [6, 7], which in turn is a variant of Littlestone and Warmuth s [15] weighted majority algorithm, and Vovk s [20] aggregating strategies. In the setting analyzed by Freund and Schaire the layer scores on each ull the reward of the chosen arm, but gains access to the rewards associated with all of the arms (not just the one that was chosen). 2 Notation and terminology An adversarial bandit roblem is secified by the number K of ossible actions, where each action is denoted by an integer 1 i K, and by an assignment of rewards, i.e. an infinite sequence x(1); x(2); : : : of vectors x(t) (x 1 (t); : : : ; x K (t)), where x i (t) 2 [0; 1] denotes the reward obtained if action i is chosen at time ste (also called trial ) t. (Even though throughout the aer we will assume that all rewards belong to the [0; 1] interval, the generalization of our results to rewards in [a; b] for arbitrary a < b is straightforward.) We assume that the layer knows the number K of actions. Furthermore, after each trial t, we assume the layer only knows the rewards x i1 (1); : : : ; x it (t) of the reviously chosen actions i 1 ; : : : ; i t. In this resect, we can view the layer 4

5 algorithm as a sequence I 1 ; I 2 ; : : :, where each I t is a maing from the set (f1; : : : ; Kg[0; 1]) t?1 of action indices and revious rewards to the set of action indices. For any reward assignment and for any T > 0, let G A (T ) def x it (t) be the return at time horizon T of algorithm A choosing actions i 1 ; i 2 ; : : :. In what follows, we will write G A instead of G A (T ) whenever the value of T is clear from the context. Our measure of erformance for a layer algorithm is the worst-case regret, and in this aer we exlore variants of the notion of regret. Given any time horizon T > 0 and any sequence of actions (j 1 ; : : : ; j T ), the (worst-case) regret of algorithm A for (j 1 ; : : : ; j T ) is the difference where G (j1 ;:::;j T )? G A (T ) (1) G (j1 ;:::;j T ) def x jt (t) is the return, at time horizon T, obtained by choosing actions j 1 ; : : : ; j T. Hence, the regret (1) measures how much the layer lost (or gained, deending on the sign of the difference) by following strategy A instead of choosing actions j 1 ; : : : ; j T. A secial case of this is the regret of A for the best single action (which we will call weak regret for short), defined by where G max (T )? G A (T ) G max (T ) def max j x j (t) is the return of the single globally best action at time horizon T. As before, we will write G max instead of G max (T ) whenever the value of T is clear from the context. As our layer algorithms will be randomized, fixing a layer algorithm defines a robability distribution over the set of all sequences of actions. All the robabilities Pfg and exectations E[] considered in this aer will be taken with resect to this distribution. In what follows, we will rove two kinds of bounds on the erformance of a (randomized) layer A. The first is a bound on the exected regret G (j1 ;:::;j T )? E [G A (T )] of A for an arbitrary sequence (j 1 ; : : : ; j T ) of actions. The second is a confidence bound on the weak regret. This has the form P fg max (T ) > G A (T ) + "g and states that, with high robability, the return of A u to time T is not much smaller than that of the globally best action. Finally, we remark that all of our bounds hold for any sequence x(1); x(2); : : : of reward assignments, and most of them hold uniformly over the time horizon T (i.e., they hold for all T without requiring T as inut arameter). 5

6 Algorithm Ex3 Parameters: Real 2 (0; 1] Initialization: w i (1) 1 for i 1; : : : ; K. For each t 1; 2; : : : 1. Set w i (t) i (t) ( ) P K w + j1 j(t) K i 1; : : : ; K: 2. Draw i t randomly accordingly to the robabilities 1 (t); : : : ; K (t). 3. Receive reward x it (t) 2 [0; 1]. 4. For j 1; : : : ; K set ^x j (t) xj (t) j (t) if j i t 0 otherwise, w j (t + 1) w j (t) ex ( ^x j (t)k) : Figure 1: Pseudo-code of algorithm Ex3 for the weak regret. 3 Uer bounds on the weak regret In this section we resent and analyze our simlest layer algorithm, Ex3 (which stands for Exonential-weight algorithm for Exloration and Exloitation ). We will show a bound on the exected regret of Ex3 with resect to the single best action. In the next sections, we will greatly strengthen this result. The algorithm Ex3, described in Figure 1, is a variant of the algorithm Hedge introduced by Freund and Schaire [6] for solving a different worst-case sequential allocation roblem. On each time ste t, Ex3 draws an action i t according to the distribution 1 (t); : : : ; K (t). This distribution is a mixture of the uniform distribution and a distribution which assigns to each action a robability mass exonential in the estimated cumulative reward for that action. Intuitively, mixing in the uniform distribution is done to make sure that the algorithm tries out all K actions and gets good estimates of the rewards for each. Otherwise, the algorithm might miss a good action because the initial rewards it observes for this action are low and large rewards that occur later are not observed because the action is not selected. For the drawn action i t, Ex3 sets the estimated reward ^x it (t) to x it (t) it (t). Dividing the actual gain by the robability that the action was chosen comensates the reward of actions that are unlikely to be chosen. This choice of estimated rewards guarantees that their exectations are equal to the actual rewards for each action; that is, E[^x j (t) j i 1 ; : : : ; i t?1 ] x j (t), where the exectation is taken with resect to the random choice of i t at trial t given the choices i 1 ; : : : ; i t?1 6

7 in the revious t? 1 trials. We now give the first main theorem of this aer, which bounds the exected weak regret of algorithm Ex3. Theorem 3.1 For any K > 0 and for any 2 (0; 1], G max? E[G Ex3 ] (e? 1)G max + K ln K holds for any assignment of rewards and for any T > 0. To understand this theorem, it is helful to consider a simler bound which can be obtained by an aroriate choice of the arameter. Corollary 3.2 For any T > 0, assume that g G max and that algorithm Ex3 is run with inut arameter Then holds for any assignment of rewards. min ( 1; s K ln K (e? 1)g G max? E[G Ex3 ] 2 e? 1 gk ln K 2:63 gk ln K Proof. If g (K ln K)(e?1), then the bound is trivial since the exected regret cannot be more than g. Otherwise, by Theorem 3.1, the exected regret is at most (e? 1)G max + K ln K ) 2 e? 1 gk ln K as desired. 2 To aly Corollary 3.2, it is necessary that an uer bound g on G max (T ) be available for tuning. For examle, if the time horizon T is known then, since no action can have ayoff greater than 1 on any trial, we can use g T as an uer bound. In Section 4, we give a technique that does not require rior knowledge of such an uer bound, yielding a result which holds uniformly over T. If the rewards x i (t) are in the range [a; b], a < b, then Ex3 can be used after the rewards have been translated and rescaled to the range [0; 1]. Alying Corollary 3.2 with g T gives the bound (b? a)2 e? 1 T K ln K) on the regret. For instance, this is alicable to a standard loss model where the rewards fall in the range [?1; 0]. Proof of Theorem 3.1. Here (and also throughout the aer without exlicit mention) we use the following simle facts, which are immediately derived from the definitions, : ^x i (t) 1 i (t) K (2) i (t)^x i (t) it (t) x i t (t) it (t) x i t (t) (3) i (t)^x i (t) 2 it (t) x i t (t) it (t) ^x i t (t) ^x it (t) 7 ^x i (t) : (4)

8 Let W t w 1 (t) + : : : + w K (t). For all sequences i 1 ; : : : ; i T of actions drawn by Ex3, W t+1 W t w i (t + 1) W t w i (t) W t ex i (t)? K i (t)? K 1 + K K ^x i(t) ex K ^x i(t) 1 + K x i t (t) K ^x i(t) + (e? 2) K ^x i(t) i (t)^x i (t) + 2 (e? 2)(K) (e? 2)(K)2 2 (5) (6) i (t)^x i (t) 2 (7) ^x i (t): (8) Eq. (5) uses the definition of i (t) in Figure 1. Eq. (6) uses the fact that e x 1 + x + (e? 2)x 2 for x 1; the exression in the receding line is at most 1 by Eq. (2). Eq. (8) uses Eqs. (3) and (4). Taking logarithms and using 1 + x e x gives ln W t+1 W t K x i t (t) + 2 (e? 2)(K) ^x i (t): Summing over t we then get ln W T +1 W 1 K G Ex3 + 2 (e? 2)(K) ^x i (t) : (9) For any action j, Combining with Eq. (9), we get ln W T +1 W 1 ln w j(t + 1) W 1 G Ex3 ( ) K ^x j (t)? K ln K ^x j (t)? ln K:? (e? 2) K ^x i (t) : (10) We next take the exectation of both sides of (10) with resect to the distribution of hi 1 ; : : : ; i T i. For the exected value of each ^x i (t), we have: E[^x i (t) j i 1 ; : : : ; i t?1 ] E i (t) x i(t) i (t) + ( i(t)) 0 x i (t) : (11) 8

9 Combining (10) and (11), we find that E[G Ex3 ] ( ) x j (t)? K ln K? (e? 2) K x i (t) : Since j was chosen arbitrarily and x i (t) K G max we obtain the inequality in the statement of the theorem. 2 Additional notation. As our other layer algorithms will be variants of Ex3, we find it convenient to define some further notation based on the quantities used in the analysis of Ex3. For each 1 i K and for each t 1 define G i (t + 1) ^G i (t + 1) ^G max (t + 1) def def t t s1 s1 def max 1iK x i (s) ^x i (s) ^G i (t + 1) 4 Bounds on the weak regret that hold uniformly over time In Section 3, we showed that Ex3 yields an exected regret of O( Kg ln K) whenever an uer bound g on the return G max of the best action is known in advance. A bound of O( KT ln K), which holds uniformly over T, could be easily roven via the guessing techniques which will be used to rove Corollaries 8.4 and 8.5 in Section 8. In this section, instead, we describe an algorithm, called Ex3:1, whose exected weak regret is O( KG max ln K) uniformly over T. As G max G max (T ) T, this bound is never worse than O( KT ln K) and is substantially better whenever the return of the best arm is small comared to T. Our algorithm Ex3:1, described in Figure 2, roceeds in eochs, where each eoch consists of a sequence of trials. We use r 0; 1; 2; : : : to index the eochs. On eoch r, the algorithm guesses a bound g r for the return of the best action. It then uses this guess to tune the arameter of Ex3, restarting Ex3 at the beginning of each eoch. As usual, we use t to denote the current time ste. 2 Ex3:1 maintains an estimate ^G i (t + 1) of the return of each action i. Since E[^x i (t)] x i (t), this estimate will be unbiased in the sense that E[ ^G i (t + 1)] G i (t + 1) for all i and t. Using these estimates, the algorithm detects (aroximately) when the actual gain of some action has advanced beyond g r. When this haens, the algorithm goes on to the next eoch, restarting Ex3 with a larger bound on the maximal gain. 2 Note that, in general, this t may differ from the local variable t used by Ex3 which we now regard as a subroutine. Throughout this section, we will only use t to refer to the total number of trials as in Figure 2. 9

10 Algorithm Ex3:1 Initialization: Let t 1, and ^G i (1) 0 for i 1; : : : ; K Reeat for r 0; 1; 2; : : : 1. Let g r (K ln K)(e? 1) 4 r. 2. Restart Ex3 choosing r min 3. While max i ^G i (t) g r? K r do: ( 1; s ) K ln K. (e? 1)g r (a) Let i t be the random action chosen by Ex3 and x it (t) the corresonding reward. (b) ^G i (t + 1) ^G i (t) + ^x i (t) for i 1; : : : ; K. (c) t : t + 1 Figure 2: Pseudo-code of algorithm Ex3:1 to control the weak regret uniformly over time. The erformance of the algorithm is characterized by the following theorem which is the main result of this section. Theorem 4.1 For any K > 0, G max? E[G Ex3:1 ] 8 e? 1 Gmax K ln K + 8(e? 1)K + 2K ln K holds for any assignment of rewards and for any T > 0. 10:5 Gmax K ln K + 13:8 K + 2K ln K The roof of the theorem is divided into two lemmas. The first bounds the regret suffered on each eoch, and the second bounds the total number of eochs. Fix T arbitrarily and define the following random variables: Let R be the total number of eochs (i.e., the final value of r). Let S r and T r be the first and last time stes comleted on eoch r (where, for convenience, we define T R T ). Thus, eoch r consists of trials S r ; S r + 1; : : : ; T r. Note that, in degenerate cases, some eochs may be emty in which case S r T r + 1. Let ^G max ^G max (T + 1). Lemma 4.2 For any action j and for every eoch r, T r T r x it (t) ^x j (t)? 2 e? 1 gr K ln K : ts r ts r 10

11 Proof. If S r > T r (so that no trials occur on eoch r), then the lemma holds trivially since both summations will be equal to zero. Assume then that S r T r. Let g g r and r. We use (10) from the roof of Theorem 3.1: T r ts r x it (t) T r ts r ^x j (t)? T r ^x j (t)? K ln K? (e? 2) K T r ts r ^x i (t) : From the definition of the termination condition we know that ^G i (T r ) g? K. Using (2), we get ^x i (t) K. This imlies that ^G i (T r + 1) g for all i. Thus, T r ts r x it (t) T r ts r ^x j (t)? g ( + (e? 2))? K ln K By our choice for, we get the statement of the lemma. 2 The next lemma gives an imlicit uer bound on the number of eochs R. Let c (K ln K)(e? 1). : Lemma 4.3 The number of eochs R satisfies 2 R?1 K c + s ^G max c : Proof. If R 0, then the bound holds trivially. So assume R 1. Let z 2 R?1. Because eoch R? 1 was comleted, by the termination condition, ^G max ^G max (T R?1 + 1) > g R? K c 4 R? K 2 R?1 cz 2? Kz : (12) R?1 Suose the claim of the lemma is false. Then z > Kc + is increasing for x > K(2c), this imlies that cz 2? Kz > c 0 K c + ^G max c 1 A 2? K 0 K c + q ^G max c. Since the function cx 2? Kx ^G max c 1 s A K ^G max c + ^G max ; contradicting (12). 2 Proof of Theorem 4.1. Using the lemmas, we have that G Ex3:1 x it (t) R r0 max j T r ts r x it (t) R r0 T r ^x j (t)? 2 e? 1 gr K ln K ts r 11!

12 max j ^G j (T + 1)? 2K ln K ^G max? 2K ln K(2 R+ 1) ^G max + 2K ln K? 8K ln K R r0 2 r 0 K ^G c + max c q ^G max? 2K ln K? 8(e? 1)K? 8 e? 1 ^G max K ln K : (13) Here, we used Lemma 4.2 for the first inequality and Lemma 4.3 for the second inequality. The other stes follow from definitions and simle algebra. Let f (x) x?a x?b for x 0 where a 8 e? 1 K ln K and b 2K ln K +8(e?1)K. Taking exectations of both sides of (13) gives E[G Ex3:1 ] E[f ( ^G max )] : (14) Since the second derivative of f is ositive for x > 0, f is convex so that, by Jensen s inequality, E[f ( ^G max )] f (E[ ^G max ]) : (15) 1 A Note that, E[ ^G max ] E max j ^G j (T + 1) max E[ ^G j (T + 1)] max j j x j (t) G max : The function f is increasing if and only if x > a 2 4. Therefore, if G max > a 2 4 then f (E[ ^G max ]) f (G max ). Combined with (14) and (15), this gives that E[G Ex3:1 ] f (G max ) which is equivalent to the statement of the theorem. On the other hand, if G max a 2 4 then, because f is nonincreasing on [0; a 2 4], f (G max ) f (0)?b 0 E[G Ex3:1 ] so the theorem follows trivially in this case as well. 2 5 Lower bounds on the weak regret In this section, we state a lower bound on the exected weak regret of any layer. More recisely, for any choice of the time horizon T we show that there exists a strategy for assigning the rewards to the actions such that the exected weak regret of any layer algorithm is ( KT ). Observe that this does not match the uer bound for our algorithms Ex3 and Ex3:1 (see Corollary 3.2 and Theorem 4.1); it is an oen roblem to close this ga. Our lower bound is roven using the classical (statistical) bandit model with an crucial difference: the reward distribution deends on the number K of actions and on the time horizon T. This deendence is the reason why our lower bound does not contradict the uer bounds of the form 12

13 O(ln T ) for the classical bandit model [14]. There, the distribution over the rewards is fixed as T! 1. Note that our lower bound has a considerably stronger deendence on the number K of action than the lower bound ( T ln K), which could have been roven directly from the results in [3, 6]. Secifically, our lower bound imlies that no uer bound is ossible of the form O(T (ln K) ) where 0 < 1, > 0. Theorem 5.1 For any number of actions K 2 and for any time horizon T, there exists a distribution over the assignment of rewards such that the exected weak regret of any algorithm (where the exectation is taken with resect to both the randomization over rewards and the algorithm s internal randomization) is at least 1 20 minf KT ; T g: The roof is given in Aendix A. The lower bound imlies, of course, that for any algorithm there is a articular choice of rewards that will cause the exected weak regret (where the exectation is now with resect to the algorithm s internal randomization only) to be larger than this value. 6 Bounds on the weak regret that hold with robability 1 In Section 4 we showed that the exected weak regret of algorithm Ex3:1 is O( KT ln K). In this section we show that a modification of Ex3 achieves a weak regret of O( KT ln(kt)) with robability at least, for any fixed and uniformly over T. From this, a bound on the weak regret that holds with robability 1 follows easily. The modification of Ex3 is necessary since the variance of the regret achieved by this algorithm is large, so large that an interesting high robability bound may not hold. The large variance of the regret comes from the large variance of the estimates ^x i (t) for the ayoffs x i (t). In fact, the variance of ^x i (t) can be close to 1 i (t) which, for in our range of interest, is (ignoring the deendence of K) of magnitude T. Summing over trials, the variance of the return of Ex3 is about T 32, so that the regret might be as large as T 34. To control the variance we modify algorithm Ex3 so that it uses estimates which are based on uer confidence bounds instead of estimates with the correct exectation. The modified algorithm Ex3:P is given in Figure 3. Let ^ i (t + 1) def KT + t s1 i (t) 1 KT : Whereas algorithm Ex3 directly uses the estimates ^G i (t) when choosing i t at random, algorithm Ex3:P uses the uer confidence bounds ^G i (t) + ^ i (t). The next lemma shows that, for aroriate, these are indeed uer confidence bounds. Fix some time horizon T. In what follows, we will use ^ i to denote ^ i (T + 1) and ^G i to denote ^G i (T + 1). 13

14 Algorithm Ex3:P Parameters: Reals > 0 and 2 (0; 1] Initialization: For i 1; : : : ; K w i (1) ex 3 r! T K : For each t 1; 2; : : : ; T 1. For i 1; : : : ; K set w i (t) i (t) ( ) P K w + j1 j(t) K : 2. Choose i t randomly according to the distribution 1 (t); : : : ; K (t). 3. Receive reward x it (t) 2 [0; 1]. 4. For j 1; : : : ; K set ^x j (t) w j (t + 1) w j (t) ex xj (t) j (t) if j i t 0 otherwise, 3K ^x j(t) + j (t)!! KT : Figure 3: Pseudo-code of algorithm Ex3:P achieving small weak regret with high robability. Lemma 6.1 If 2 ln(kt) 2 KT, then o P n9i : ^G i + ^ i < G i : Proof. Fix some i and set s t def 2^ i (t + 1) : Since 2 KT and ^i (t + 1) KT, we have s t 1. Now P n ^G i + ^ i < G i o P P ( T ( s T ) (x i (t)? ^x i (t))? ^ i > ^ i 2 2 x i (t)? ^x i (t)? KT 2 i (t) ) > 2 4 (16) 14

15 P ( ex e?2 4 E " s T ex x i (t)? ^x i (t)? s T 2 i (t) x i (t)? ^x i (t)?! KT 2 i (t) > ex? 2 4 )!# KT (17) P T where in ste (16) we multilied both sides by s T and used ^ i 1( i(t) KT ), while in ste (17) we used Markov s inequality. For t 1; : : : ; T set t! def Z t ex s t x i ( )? ^x i ( )? : 2 i ( ) KT Then, for t 2; : : : ; T Z t ex 1 s t x i (t)? ^x i (t)? 2 i (t) (Z t?1 ) KT s t s t?1 : Denote by E t [Z t ] E [Z t j i 1 ; : : : ; i t?1 ] the exectation of Z t with resect to the random choice in trial t and conditioned on the ast t? 1 trials. Note that when the ast t? 1 trials are fixed the only random quantities in Z t are the ^x i (t) s. Note also that x i (t)? ^x i (t) 1, and that E t (xi (t)? ^x 2 i (t)) E t ^xi 2 (t)? x i (t) 2 E t ^xi 2 (t) Hence, for each t 2; : : : ; T E t [Z t ] E t ex s t x i (t)? ^x i (t)? s t i (t) Eq. (19) uses x i(t) 2 i (t) 1 i (t) E t 1 + st (x i (t)? ^x i (t)) + s 2 (x t i(t)? ^x 2 i (t)) ex? 1 + s 2 s t t i (t) ex (Z t?1 ) s t s t?1 1 + Z t?1 : 2 i (t)? s2 t i (t) KT s t (18) s (Z t?1 ) t?1 (19)? s2 s t t s (Z t?1 ) t?1 (20) i (t) s (Z t?1 ) t?1 (21) 2 i (t)^ i (t + 1) s t i (t) since ^ i (t + 1) KT. Eq. (20) uses e a 1 + a + a 2 for a 1. Eq. (21) uses E t [^x i (t)] x i (t). Eq. (22) uses 1 + x e x for any real x. Eq. (23) uses s t s t?1 and z u 1 + z for any z > 0 and u 2 [0; 1]. Observing that E [Z 1 ] 1, we get by induction that E[Z T ] T, and the lemma follows by our choice of (22) (23)

16 The next lemma shows that the return achieved by algorithm Ex3:P is close to its uer confidence bounds. Let ^G i + ^ i : ^U def max 1iK Lemma 6.2 If 2 KT then G Ex3:P 5 3 ^U? 3 K ln K? 2 KT? 2 2 : Proof. We roceed as in the analysis of algorithm Ex3. Set (3K) and consider any sequence i 1 ; : : : ; i T of actions chosen by Ex3:P. As ^x i (t) K, i (t) K, and 2 KT, we have ^x i (t) + 1 : KT Therefore, W t+1 W t w i (t + 1) W t w i (t) W t ex i (t)? K i (t)? K ^x i (t) + ex i (t) i (t) ^x i (t) ^x i (t) + i (t)^x i (t) + KT i (t)^x i (t) x i t (t) + r K T + i (t) i (t) KT + 2 2^x i (t) KT i (t) 2 KT 22 1 KT 1 i (t)kt ^x i (t) T : The second inequality uses e a 1 + a + a 2 for a 1, and (a + b) 2 2(a 2 + b 2 ) for any a; b. The last inequality uses Eqs. (2), (3) and (4). Taking logarithms, using ln(1 + x) x and summing over t 1; : : : ; T we get W T +1 ln W 1 G 2 2 Ex3:P + KT + ^G i + 22 : Since ln W 1 KT + ln K 16

17 and for any j this imlies ln W T +1 ln w j (T + 1) ^G j + ^ j G Ex3:P ( ) ^G j + ^ j? 1 ln K? 2 KT? 2 ^G i? 2 2 : for any j. Finally, using (3K) and ^G i K ^U yields the lemma. 2 Combining Lemmas 6.1 and 6.2 gives the main result of this section. Theorem 6.3 For any fixed T > 0, for all K 2 and for all > 0, if then min ( 3 5 ; 2 r 3 5 G max? G Ex3:P 4 K ln K T r ) KT ln KT and 2 ln(kt) ; + 4 holds for any assignment of rewards with robability at least. r 5 3 KT ln K + 8 ln KT Proof. We assume without loss of generality that T (203)K ln K and that KT e?kt. If either of these conditions do not hold, then the theorem holds trivially. Note that T (203)K ln K ensures 35. Note also that KT e?kt imlies 2 KT for our choice of. So we can aly Lemmas 6.1 and 6.2. By Lemma 6.2 we have G Ex3:P 5 3 ^U? 3 K ln K? 2 KT? 2 2 : By Lemma 6.1 we have ^U G max with robability at least. Collecting terms and using G max T gives the theorem. 2 It is not difficult to obtain an algorithm that does not need the time horizon T as inut arameter and whose regret is only slightly worse than that roven for the algorithm Ex3:P in Theorem 6.3. This new algorithm, called Ex3:P:1 and shown in Figure 4, simly restarts Ex3:P doubling its guess for T each time. The only careful issue is the choice of the confidence arameter and of the minimum length of the runs to ensure that Lemma 6.1 holds for all the runs of Ex3:P. 17

18 Algorithm Ex3:P:1 Parameters: Real 0 < < 1. Initialization: let T r 2 r, r Reeat for r r ; r + 1; : : : (r + 1)(r + 2) and r minfr 2 N : r KT r e?ktr g : (24) Run Ex3:P for T r trials choosing and as in Theorem 6.3 with T T r and r. Figure 4: Pseudo-code of algorithm Ex3:P:1 (see Theorem 6.4). Theorem 6.4 Let K 2, 2 (0; 1) and T 2 r. Let c T 2 ln(2 + log 2 T ), and let r be as in Eq. (24). Then G max? G Ex3:P:1 s2kt 10 + c T + 10(1 + log 2 T ) + c T ; 2? 1 holds with robability at least. ln KT ln KT Proof. Choose the time horizon T arbitrarily and call eoch the sequence of trials between two successive restarts of algorithm Ex3:P. For each r > r, where r is defined in (24), let G i (r) def 2 r+1 t2 r +1 x i (t) ; ^G i (r) def 2 r+1 t2 r +1 ^x i (t) ; ^ i (r) def KTr + 2 r+1 t2 r +1 1 i (t) KT r and similarly define the quantities G i (r ) and ^G i (r ) with sums that go from t 1 to t r. For each r r, we have r KT r e?ktr. Thus we can find numbers r such that, by Lemma 6.1, n o 1 n o P (9r r )(9i) : ^G i (r) + r ^ i (r) < G i (r) 9i : ^G i (r) + r ^ i (r) < G i (r) 1 rr P r0 : (r + 1)(r + 2) We now aly Theorem 6.3 to each eoch. Without loss of generality, assume that T satisfies 2 r +`?1 < T `?1 r r +r < 2 r +`

19 for some ` 1. With robability at least over the random draw of Ex3:P:1 s actions i 1 ; : : : ; i T, G max? G Ex3:P:1 `?1 r "s "s 10 2? 1 "s KT r +r ln KT r +r r +r K ln KT r +`?1 r +`?1 K ln KT r +`?1 s 2KT r +`?1 ln KT `?1 + ln KT r +r r +r # Tr +r + ` ln KT r +`?1 r +`?1 r0 2 (r +`)2 2? 1 + ` ln KT r +`?1 r +`?1 + c T + 10(1 + log 2 T ) # # ln KT + c T where c T 2 ln(2 + log 2 T ). 2 From the above theorem we get, as a simle corollary, a statement about the almost sure convergence of the return of algorithm Ex3:P. The rate of convergence is almost otimal, as one can see from our lower bound in Section 5. Corollary 6.5 For any K 2 and for any function f : R! R with lim T!1 f (T ) 1, G max? G lim Ex3:P:1 0 : T!1 T (ln T )f (T ) holds for any assignment of rewards with robability 1. Proof. Let 1T 2. Then, by Theorem 6.4, there exists a constant C such that for all T large enough G max? G Ex3:P:1 C KT ln T with robability at least 1T 2. This imlies that P ( G max? G Ex3:P:1 (T ln T )f (T ) > C s K f (T ) ) 1 T 2 and the theorem follows from the Borel-Cantelli lemma. 2 7 The regret against the best strategy from a ool Consider a setting where the layer has reliminarly fixed a set of strategies that could be used for choosing actions. These strategies might select different actions at different iterations. The strategies can be comutations erformed by the layer or they can be external advice given to the layer by exerts. We will use the more general term exert (borrowed from Cesa-Bianchi et 19

20 al. [3]) because we lace no restrictions on the generation of the advice. The layer s goal in this case is to combine the advice of the exerts in such a way that its return is close to that of the best exert. Formally, we assume that the layer, rior to choosing an action P at time t, is rovided with a set of N robability vectors 1 (t); : : : ; N (t) 2 [0; 1] K K, where j1 i j (t) 1 for each i 1; : : : ; N. We interret i (t) as the advice of exert i on trial t, where the j-th comonent j i (t) reresents the recommended robability of laying action j. (As a secial case, the distribution can be concentrated on a single action, which reresents a deterministic recommendation.) If the vector of rewards at time t is x(t), then the exected reward for exert i, with resect to the chosen robability vector i (t), is simly i (t) x(t). In analogy of G max, we define ~G max def max 1iN i (t) x(t) measuring the exected return of the best strategy. Then the regret for the best strategy at time horizon T, defined by ~ G max (T )? G A (T ), measures the difference between the return of the best exert and layer s A return u to time T. Our results hold for any finite set of exerts. Formally, we regard each i (t) as a random variable which is an arbitrary function of the random sequence of lays i 1 ; : : : ; i t?1. This definition allows for exerts whose advice deends on the entire ast history as observed by the layer, as well as other side information which may be available. We could at this oint view each exert as a meta-action in a higher-level bandit roblem with ayoff vector defined at trial t as ( 1 (t) x(t); : : : ; N (t) x(t)). We could then immediately aly Corollary 3.2 to obtain a bound of O( gn log N ) on the layer s regret relative to the best exert (where g is an uer bound on ~ G max ). However, this bound is quite weak if the layer is combining many exerts (i.e., if N is very large). We show below that the algorithm Ex3 from Section 3 can be modified yielding a regret term of the form O( gk log N). This bound is very reasonable when the number of actions is small, but the number of exerts is quite large (even exonential). Our algorithm Ex4 is shown in Figure 5, and is only a slightly modified version of Ex3. (Ex4 stands for Exonential-weight algorithm for Exloration and Exloitation using Exert advice. ) Let us define y(t) 2 [0; 1] N to be the vector with comonents corresonding to the gains of the exerts: y i (t) i (t) x(t). The simlest ossible exert is one which always assigns uniform weight to all actions so that j (t) 1K on each round t. We call this the uniform exert. To rove our results, we need to assume that the uniform exert is included in the family of exerts. 3 Clearly, the uniform exert can always be added to any given family of exerts at the very small exense of increasing N by one. Theorem 7.1 For any K; T > 0, for any 2 (0; 1], and for any family of exerts which includes 3 In fact, we can use a slightly weaker sufficient condition, namely, that the uniform exert is included in the convex hull of the family of exerts, i.e., that there exists nonnegative numbers 1; : : : ; N with P N j1 j 1 such that, for all t and all i, P N j1 j j i (t) 1K. 20

21 Algorithm Ex4 Parameters: Real 2 (0; 1] Initialization: w i (1) 1 for i 1; : : : ; N. For each t 1; 2; : : : 1. Get advice vectors 1 (t); : : : ; N (t). 2. Set W t N w i (t) and for j 1; : : : ; K set j (t) ( ) N w i (t) i j (t) W t + K : 3. Draw action i t randomly according to the robabilities 1 (t); : : : ; K (t). 4. Receive reward x it (t) 2 [0; 1]. 5. For j 1; : : : ; K set ^x j (t) xj (t) j (t) if j i t 0 otherwise, 6. For i 1; : : : ; N set ^y i (t) i (t) ^x(t) w i (t + 1) w i (t) ex ( ^y i (t)k) : Figure 5: Pseudo-code of algorithm Ex4 for using exert advice. the uniform exert, holds for any assignment of rewards. ~G max? E[G Ex4 ] (e? 1) ~ G max + K ln N : Proof. We rove this theorem along the lines of the roof of Theorem 3.1. Let q i (t) w i (t)w t. Then W t+1 W t N N w i (t + 1) W t q i (t) ex K ^y i(t) 21

22 N q i (t) 1 + (K) 1 + K ^y i(t) + (e? 2) K ^y i(t) N q i (t)^y i (t) + (e? 2)(K) 2 2 N q i (t)^y i (t) 2 : Taking logarithms and summing over t we get ln W T +1 W 1 (K) N q i (t)^y i (t) + (e? 2)(K) 2 N q i (t)^y i (t) 2 : Since, for any exert k, ln W T +1 W 1 ln w k(t + 1) W 1 K ^x k (t)? ln N we get Note that Also N q i (t)^y i (t) N q i (t)^y i (t) N ^y k (t)? K ln N N q i (t) N? (e? 2) K j1 j1 i (t)? K j1 i j (t)^x j(t)! q i (t) i j (t)! ^x j (t) q i (t)^y i (t) 2 q it (t)( i i t (t)^x it (t)) 2 ^x it (t) 2 i t (t) ^x i t (t) : N ^x j (t) x j(t) : q i (t)^y i (t) 2 : Therefore, for all exerts k, G Ex4 ^x it (t) ( ) ^y k (t)? K ln N 22? (e? 2) K j1 ^x j (t) :

23 We now take exectations of both sides of this inequality. Note that Further, E[^y k (t)] E " 1 T K E j1 ^x j (t) " j1 # k j (t)^x j(t) 1 K # j1 j1 x j (t) max 1iN k j (t)x j(t) y k (t) : y i (t) ~ G max since we have assumed that the uniform exert is included in the family of exerts. Combining these facts immediately imlies the statement of the theorem. 2 8 The regret against arbitrary strategies In this section we resent a variant of algorithm Ex3 and rove a bound on its exected regret for any sequence (j 1 ; : : : ; j T ) of actions. To rove this result, we rank all sequences of actions according to their hardness. The hardness of a sequence (j 1 ; : : : ; j T ) is defined by H(j 1 ; : : : ; j T ) def 1 + jf1 ` < T : j` 6 j`+1 gj : So, H(1; : : : ; 1) 1 and H(1; 1; 3; 2; 2) 3. The bound on the regret which we will rove grows with the hardness of the sequence for which we are measuring the regret. In articular, we will show that the layer algorithm Ex3:S described in Figure 6 has an exected regret of O(H(j ) T KT ln(kt )) for any sequence j T (j 1 ; : : : ; j T ) of actions. On the other hand, if the regret is measured for any sequence j T of actions of hardness H(j T ) S, then the exected regret of Ex3:S (with arameters tuned to this S) reduces to O( SKT ln(kt )). In what follows, we will use G j T to denote the return x j1 (1) + : : : x jt (T ) of a sequence j T (j 1 ; : : : ; j T ) of actions. Theorem 8.1 For any K > 0, for any 2 (0; 1], and for any > 0, G j T? E [G Ex3:S ] K(H(jT ) ln(k) + et ) + (e? 1)T holds for any assignment of rewards, for any T > 0, and for any sequence j T actions. (j 1 ; : : : ; j T ) of Corollary 8.2 Assume that algorithm Ex3:S is run with inut arameters 1T and ( r ) K ln(kt ) min 1; : T Then G j T? E [G Ex3:S ] H(j T ) holds for any sequence j T (j 1 ; : : : ; j T ) of actions. 23 KT ln(kt ) + 2e s KT ln(kt )

24 Algorithm Ex3:S Parameters: Reals 2 (0; 1] and > 0. Initialization: w i (1) 1 for i 1; : : : ; K. For each t 1; 2; : : : 1. Set w i (t) i (t) ( ) P K w + j1 j(t) K i 1; : : : ; K: 2. Draw i t randomly accordingly to the robabilities 1 (t); : : : ; K (t). 3. Receive reward x it (t) 2 [0; 1]. 4. For j 1; : : : ; K set ^x j (t) xj (t) j (t) if j i t 0 otherwise, w j (t + 1) w j (t) ex ( ^x j (t)k) + e K w i (t) : Figure 6: Pseudo-code of algorithm Ex3:S to control the exected regret. Note that the statement of Corollary 8.2 can be equivalently written as E [G Ex3:S ] max j T G j T? H(j T )? 2e s KT ln(kt ) KT ln(kt ) revealing that algorithm Ex3:S is able to automatically trade-off between the return G j T sequence j T and its hardness H(j T ). of a Corollary 8.3 Assume that algorithm Ex3:S is run with inut arameters 1T and Then min ( 1; s K(S ln(kt ) + e) (e? 1)T G j T? E [G Ex3:S ] 2 e? 1 KT (S ln(kt ) + e) holds for any sequence j T (j 1 ; : : : ; j T ) of actions such that H(j T ) S. ) : 24

25 Proof of Theorem 8.1. Fix any sequence j T (j 1 ; : : : ; j T ) of actions. With a technique that follows closely the roof of Theorem 3.1, we can rove that for all sequences i 1 ; : : : ; i T of actions drawn by Ex3:S, W t+1 W t 1 + K x i t (t) + 2 (e? 2)(K) ^x i (t) + e : (25) where, as usual, W t w 1 (t) + : : : + w K (t). Now let S H(j T ) and artition (1; : : : ; T ) in segments [T 1 ; : : : ; T 2 ); [T 2 ; : : : ; T 3 ); : : : ; [T S ; : : : ; T S+1 ) where T 1 1, T S+1 T + 1, and j Ts j Ts+1 : : : j Ts+1?1 for each segment s 1; : : : ; S. Fix an arbitrary segment [T s ; T s+1 ) and let s T s+ T s. Furthermore, let G Ex3:S (s) def T s+1?1 tt s x it (t) : Taking logarithms on both sides of (25) and summing over t T s ; : : : ; T s+ 1 we get ln W T s+1 W Ts K G Ex3:S(s) + 2 T s+1?1 (e? 2)(K) tt s Now let j be the action such that j Ts : : : j Ts+1?1 j. Since w j (T s+1 ) w j (T s + 1) ex e K W T s ex K W T s ex where the last ste uses ^x j (t)k 1, we have ln W T s+1 W Ts T s+1?1 ^x j (t) K tt s+1 T s+1?1 K tt s+1 T s+1?1 K tt s ^x j (t) ^x j (t)!! ^x i (t) + e s : (26) w j(t s+1) ln ln T s+1?1 + ^x j (t) : (27) W Ts K K tt s! Piecing together (26) and (27) we get G Ex3:S (s) ( ) T s+1?1 tt s ^x j (t)? K ln(k)? (e? 2) K T s+1?1 tt s ^x i (t)? ek s : 25

26 Summing over all segments s 1; : : : ; S, taking exectation with resect to the random choices of algorithm Ex3:S, and using G (j1 ;:::;j T ) T and x i (t) KT yields the inequality in the statement of the theorem. 2 If the time horizon T is not known, we can aly techniques similar to those alied for roving Theorem 6.4 in Section 6. More secifically, we introduce a new algorithm, Ex3:S:1, that runs Ex3:S as a subroutine. Suose that at each new run (or eoch) r 0; 1; : : :, Ex3:S is started with its arameters set as rescribed in Corollary 8.2, where T is set to T r 2 r, and then stoed after T r iterations. Clearly, for any fixed sequence j T (j 1 ; : : : ; j T ) of actions, the number of segments (see roof of Theorem 8.1 for a definition of segment) within each eoch r is at most H(j T ). Hence the exected regret of Ex3:S:1 for eoch r is certainly not more than (H(j T ) + 2e) KTr ln(kt r ) : Let ` be such that 2` T < 2`+1. Then the last eoch is ` log 2 T and the overall regret (over the ` + 1 eochs) is at most (H(j T ) + 2e) ` r0 Finishing u the calculations roves the following. Corollary 8.4 KTr ln(kt r ) (H(j T ) + 2e) KT` ln(kt`) ` r0 Tr : G j T? E [G Ex3:S:1 ] H(jT ) + 2e 2? 1 2KT ln(kt ) for any T > 0 and for any sequence j T (j 1 ; : : : ; j T ) of actions. On the other hand, if Ex3:S:1 runs Ex3:S with arameters set as rescribed in Corollary 8.3, with a reasoning similar to the one above we conclude the following. Corollary 8.5 G j T? E [G Ex3:S:1 ] 2 e? 1 2KT (S ln(kt ) + e) 2? 1 for any T > 0 and for any sequence j T (j 1 ; : : : ; j T ) of actions such that H(j T ) S. 9 Alications to game theory The adversarial bandit roblem can be easily related to the roblem of laying reeated games. For N > 1 integer, a N-erson finite game is defined by N finite sets S 1 ; : : : ; S N of ure strategies, 26

27 one set for each layer, and by N functions u 1 ; : : : ; u N, where function u i : S 1 : : : S N! R is layer s i ayoff function. Note the each layer s ayoff deends both on the ure strategy chosen by the layer and on the ure strategies chosen by the other layers. Let S S 1 : : : S N and let S?i S 1 : : :S i?1 S i+1 : : :S N. We use s and s?i to denote tyical members of, resectively, S and S?i. Given s 2 S, we will often write (j; s?i) to denote (s 1 ; : : : ; s i?1 ; j; s i+1 ; : : : ; s N ), where j 2 S i. Suose that the game is layed reeatedly through time. Assume for now that each layer knows all ayoff functions and, after each reetition (or round) t, also knows the vector s(t) (s 1 (t); : : : ; s N (t)) of ure strategies chosen by the layers. Hence, the ure strategy s i (t), chosen by layer i at round t may deend on what layer i and the other layers chose in the ast rounds. The average regret of layer i for the ure strategy j after T rounds is defined by R (j) i (T ) 1 T [u i (j; s?i(t))? u i (s(t))] : This is how much layer i lost on average for not laying the ure strategy j on all rounds, given that all the other layers ket their choices fixed. A desirable roerty for a layer is Hannan-consistency [8], defined as follows. Player i is Hannan-consistent if lim su T!1 max R (j) i (T ) 0 with robability 1. j2s i The existence and roerties of Hannan-consistent layers have been first investigated by Hannan [10] and Blackwell [2], and later by many others (see [18] for a nice survey). Hannan-consistency can be also studied in the so-called unknown game setu, where it is further assumed that: (1) each layer knows neither the total number of layers nor the ayoff function of any layer (including itself), (2) after each round each layer sees its own ayoffs but it sees neither the choices of the other layers nor the resulting ayoffs. This setu was reviously studied by Baños [1], Megiddo [16], and by Hart and Mas-Colell [11, 12]. We can aly the results of Section 6 to rove that a layer using algorithm Ex3:P:1 as mixed strategy is Hannan-consistent in the unknown game setu whenever the ayoffs obtained by the layer belong to a known bounded real interval. To do that, we must first extend our results to the case when the assignment of rewards can be chosen adatively. More recisely, we can view the ayoff x it (t), received by the gambler at trial t of the bandit roblem, as the ayoff u i (i t ; s?i(t)) received by layer i at the t-th round of the game. However, unlike our adversarial bandit framework where all the rewards were assigned to each arm at the beginning, here the ayoff u i (i t ; s?i(t)) deends on the (ossibly randomized) choices of all layers which, in turn, are functions of their realized ayoffs. In our bandit terminology, this corresonds to assuming that the vector (x 1 (t); : : : ; x K (t)) of rewards for each trial t is chosen by an adversary who knows the gambler s strategy and the outcome of the gambler s random draws u to time t? 1. We leave to the interested reader the easy but lengthy task of checking that all of our results (including those of Section 6) hold under this additional assumtion. Using Theorem 6.4 we then get the following. 27

Gambling in a rigged casino: The adversarial multi-armed bandit problem

Gambling in a rigged casino: The adversarial multi-armed bandit problem Peter Auer Institute for Theoretical Computer Science University of Technology Graz A-8010 Graz (Austria) pauer@igi.tu-graz.ac.at