The non-stochastic multi-armed bandit problem

Size: px
Start display at page:

Download "The non-stochastic multi-armed bandit problem"

Transcription

1 Submitted for journal ublication. The non-stochastic multi-armed bandit roblem Peter Auer Institute for Theoretical Comuter Science Graz University of Technology A-8010 Graz (Austria) Nicolò Cesa-Bianchi Deartment of Comuter Science Università di Milano I Milano (Italy) Robert E. Schaire AT&T Labs 180 Park Avenue Florham Park, NJ November 20, 2001 Yoav Freund Banter, Inc. 214 Willow Ave., At. #5A Hoboken, NJ Abstract In the multi-armed bandit roblem, a gambler must decide which arm of K non-identical slot machines to lay in a sequence of trials so as to maximize his reward. This classical roblem has received much attention because of the simle model it rovides of the trade-off between exloration (trying out each arm to find the best one) and exloitation (laying the arm believed to give the best ayoff). Past solutions for the bandit roblem have almost always relied on assumtions about the statistics of the slot machines. In this work, we make no statistical assumtions whatsoever about the nature of the rocess generating the ayoffs of the slot machines. We give a solution to the bandit roblem in which an adversary, rather than a well-behaved stochastic rocess, has comlete control over the ayoffs. In a sequence of T lays, we rove that the er-round ayoff of our algorithm? aroaches that of the best arm at the rate O T?12. We show by a matching lower bound that this is best ossible. We also rove that our algorithm aroaches the er-round ayoff of any set of strategies at a similar rate: if the best strategy is chosen from a ool of N strategies then our algorithm

2 ? aroaches the er-round ayoff of the strategy at the rate O (log N 12 ) T?12. Finally, we aly our results to the roblem of laying an unknown reeated matrix game. We show that? our algorithm aroaches the minimax ayoff of the unknown game at the rate O T?12. Keywords: adversarial bandit roblem, unknown matrix games AMS subject classification: 68Q32 68T05 91A20 1 Introduction In the multi-armed bandit roblem, originally roosed by Robbins [19], a gambler must choose which of K slot machines to lay. At each time ste, he ulls the arm of one of the machines and receives a reward or ayoff (ossibly zero or negative). The gambler s urose is to maximize his return, i.e. the sum of the rewards he receives over a sequence of ulls. In this model, each arm is assumed to deliver rewards that are indeendently drawn from a fixed and unknown distribution. As reward distributions differ from arm to arm, the goal is to find the arm with the highest exected ayoff as early as ossible, and then to kee gambling using that best arm. The roblem is a aradigmatic examle of the trade-off between exloration and exloitation. On the one hand, if the gambler lays exclusively on the machine that he thinks is best ( exloitation ), he may fail to discover that one of the other arms actually has a higher exected ayoff. On the other hand, if he sends too much time trying out all the machines and gathering statistics ( exloration ), he may fail to lay the best arm often enough to get a high return. The gambler s erformance is tyically measured in terms of regret. This is the difference between the exected return of the otimal strategy (ulling consistently the best arm) and the gambler s exected return. Lai and Robbins roved that the gambler s regret over T ulls can be made, for T! 1, as small as O(ln T ). Furthermore, they rove that this bound is otimal in the following sense: it does not exist a strategy for the gambler with a better asymtotic erformance. Though this formulation of the bandit roblem allows an elegant statistical treatment of the exloration-exloitation trade-off, it may not be adequate to model certain environments. As a motivating examle, consider the task of reeatedly choosing a route for transmitting ackets between two oints in a communication network. To cast this scenario within the bandit roblem, suose there is a only a fixed number of ossible routes and the transmission cost is reorted back to the sender. Now, it is likely that the costs associated with each route cannot be modeled by a stationary distribution, so a more sohisticated set of statistical assumtions would be required. In general, it may be difficult or imossible to determine the right statistical assumtions for a given domain, and some domains may exhibit deendencies to an extent that no such assumtions are aroriate. To rovide a framework where one could model scenarios like the one sketched above, we resent the adversarial bandit roblem, a variant of the bandit roblem in which no statistical assumtions are made about the generation of rewards. We only assume that each slot machine is initially assigned an arbitrary and unknown sequence of rewards, one for each time ste, chosen from a bounded real interval. Each time the gambler ulls the arm of a slot-machine he receives the corresonding reward from the sequence assigned to that slot-machine. To measure the gambler s erformance in this setting we relace the notion of (statistical) regret with that of worst-case 2

3 regret. Given any sequence (j 1 ; : : : ; j T ) of ulls, where T > 0 is an arbitrary time horizon and each j t is the index of an arm, the worst-case regret of a gambler for this sequence of ulls is the difference between the return the gambler would have had by ulling arms j 1 ; : : : ; j T and the actual gambler s return, where both returns are determined by the initial assignment of rewards. It is easy to see that, in this model, the gambler cannot kee his regret small (say, sublinear in T ) for all sequences of ulls and with resect to the worst-case assignment of rewards to the arms. Thus, to make the roblem feasible, we allow the regret to deend on the hardness of the sequence of ulls for which it is measured, where the hardness of a sequence is roughly the number of times one has to change the slot machine currently being layed in order to ull the arms in the order given by the sequence. This trick allows us to effectively control the worst-case regret simultaneously for all sequences of ulls, even though (as one should exect) our regret bounds become trivial when the hardness of the sequence (j 1 ; : : : ; j T ) we comete against gets too close to T. As a remark, note that a deterministic bandit roblem was also considered by Gittins [9] and Ishikida and Varaiya [13]. However, their version of the bandit roblem is very different from ours: they assume that the layer can comute ahead of time exactly what ayoffs will be received from each arm, and their roblem is thus one of otimization, rather than exloration and exloitation. Our most general result is a very efficient, randomized layer algorithm whose exected regret for any sequence of ulls is 1 O(S KT ln(kt )), where S is the hardness of the sequence (see Theorem 8.1 and Corollaries 8.2, 8.4). Note that this bound holds simultaneously for all sequences of ulls, for any assignments of rewards to the arms, and uniformly over the time horizon T. If the gambler is willing to imose an uer bound S on the hardness of the sequences of ulls for which he wants to measure his regret, an imroved bound O( SKT ln(kt )) on the exected regret for these sequences can be roven (see Corollaries 8.3 and 8.5). With the urose of establishing connections with certain results in game theory, we also look at a secial case of the worst-case regret, which we call weak regret. Given a time horizon T, call best arm the arm that has the highest return (sum of assigned rewards) u to time T with resect to the initial assignment of rewards. The gambler s weak regret is the difference between the return of this best arm and the actual gambler s return. In the aer we introduce a randomized layer algorithm, tailored to this notion of regret, whose exected weak regret is O( KG max ln K), where G max is the return of the best arm see Theorem 4.1 in Section 4. As before, this bound holds for any assignments of rewards to the arms and uniformly over the choice of the time horizon T. Using a more comlex layer algorithm, we also rove that the weak regret is O( KT ln(kt)) with robability at least over the algorithm s randomization, for any fixed > 0, see Theorems 6.3 and 6.4 in Section 6. This also imlies that, asymtotically for T! 1 and K constant, the weak regret is O( T (ln T ) 1+" ) with robability 1 for any fixed " > 0, see Corollary 6.5. Our worst-case bounds may aear weaker than the bounds roved using statistical assumtions, such as those shown by Lai and Robbins [14] of the form O(ln T ). However, when comaring our results to those in the statistics literature, it is imortant to oint out an imortant difference in the asymtotic quantification. In the work of Lai and Robbins the assumtion is that the distribution of rewards that is associated with each arm is fixed as the total number of iterations T increases to infinity. In contrast, our bounds hold for any finite T, and, by the generality of our 1 Though in this introduction we use the comact asymtotic notation, our bounds are roven for each finite T and almost always with exlicit constants. 3

4 model, these bounds are alicable when the ayoffs are randomly (or adversarially) chosen in a manner that does deend on T. It is this quantification order, and not the adversarial nature of our framework, which is the cause for the aarent ga. We rove this oint in Theorem 5.1 where we show that, for any layer algorithm for the K-armed bandit roblem and for any T, there exists a set of K reward distributions such that the exected weak regret of the algorithm when laying on these arms for T time stes is ( KT ). So far we have considered notions of regret that comare the return of the gambler to the return of a sequence of ulls or to the return of the best arm. A further notion of regret which we exlore is the regret for the best strategy in a given set of strategies that are available to the gambler. The notion of strategy generalizes that of sequence of ulls : at each time ste a strategy gives a recommendation, in the form of a robability distribution over the K arms, as to which arm to lay next. Given an assignment of rewards to the arms and a set of N strategies for the gambler, call best strategy the strategy that yields the highest return with resect to this assignment. Then the regret for the best strategy is the difference between the return of this best strategy and the actual gambler s return. Using a randomized layer that combines the choices of the N strategies (in the same vein as the algorithms for rediction with exert advice from [3]), we show that the exected regret for the best strategy is O( KT ln N ) see Theorem 7.1. Note that the deendence on the number of strategies is only logarithmic, and therefore the bound is quite reasonable even when the layer is combining a very large number of strategies. The adversarial bandit roblem is closely related to the roblem of learning to lay an unknown N-erson finite game, where the same game is layed reeatedly by N layers. A desirable roerty for a layer is Hannan-consistency, which is similar to saying (in our bandit framework) that the weak regret er time ste of the layer converges to 0 with robability 1. Examles of Hannan-consistent layer strategies have been rovided by several authors in the ast (see [18] for a survey of these results). By alying (slight extensions of) Theorems 6.3 and 6.4, we can rove rovide an examle of a simle Hannan-consistent layer whose convergence rate is otimal u to logarithmic factors. Our layer algorithms are based in art on an algorithm resented by Freund and Schaire [6, 7], which in turn is a variant of Littlestone and Warmuth s [15] weighted majority algorithm, and Vovk s [20] aggregating strategies. In the setting analyzed by Freund and Schaire the layer scores on each ull the reward of the chosen arm, but gains access to the rewards associated with all of the arms (not just the one that was chosen). 2 Notation and terminology An adversarial bandit roblem is secified by the number K of ossible actions, where each action is denoted by an integer 1 i K, and by an assignment of rewards, i.e. an infinite sequence x(1); x(2); : : : of vectors x(t) (x 1 (t); : : : ; x K (t)), where x i (t) 2 [0; 1] denotes the reward obtained if action i is chosen at time ste (also called trial ) t. (Even though throughout the aer we will assume that all rewards belong to the [0; 1] interval, the generalization of our results to rewards in [a; b] for arbitrary a < b is straightforward.) We assume that the layer knows the number K of actions. Furthermore, after each trial t, we assume the layer only knows the rewards x i1 (1); : : : ; x it (t) of the reviously chosen actions i 1 ; : : : ; i t. In this resect, we can view the layer 4

5 algorithm as a sequence I 1 ; I 2 ; : : :, where each I t is a maing from the set (f1; : : : ; Kg[0; 1]) t?1 of action indices and revious rewards to the set of action indices. For any reward assignment and for any T > 0, let G A (T ) def x it (t) be the return at time horizon T of algorithm A choosing actions i 1 ; i 2 ; : : :. In what follows, we will write G A instead of G A (T ) whenever the value of T is clear from the context. Our measure of erformance for a layer algorithm is the worst-case regret, and in this aer we exlore variants of the notion of regret. Given any time horizon T > 0 and any sequence of actions (j 1 ; : : : ; j T ), the (worst-case) regret of algorithm A for (j 1 ; : : : ; j T ) is the difference where G (j1 ;:::;j T )? G A (T ) (1) G (j1 ;:::;j T ) def x jt (t) is the return, at time horizon T, obtained by choosing actions j 1 ; : : : ; j T. Hence, the regret (1) measures how much the layer lost (or gained, deending on the sign of the difference) by following strategy A instead of choosing actions j 1 ; : : : ; j T. A secial case of this is the regret of A for the best single action (which we will call weak regret for short), defined by where G max (T )? G A (T ) G max (T ) def max j x j (t) is the return of the single globally best action at time horizon T. As before, we will write G max instead of G max (T ) whenever the value of T is clear from the context. As our layer algorithms will be randomized, fixing a layer algorithm defines a robability distribution over the set of all sequences of actions. All the robabilities Pfg and exectations E[] considered in this aer will be taken with resect to this distribution. In what follows, we will rove two kinds of bounds on the erformance of a (randomized) layer A. The first is a bound on the exected regret G (j1 ;:::;j T )? E [G A (T )] of A for an arbitrary sequence (j 1 ; : : : ; j T ) of actions. The second is a confidence bound on the weak regret. This has the form P fg max (T ) > G A (T ) + "g and states that, with high robability, the return of A u to time T is not much smaller than that of the globally best action. Finally, we remark that all of our bounds hold for any sequence x(1); x(2); : : : of reward assignments, and most of them hold uniformly over the time horizon T (i.e., they hold for all T without requiring T as inut arameter). 5

6 Algorithm Ex3 Parameters: Real 2 (0; 1] Initialization: w i (1) 1 for i 1; : : : ; K. For each t 1; 2; : : : 1. Set w i (t) i (t) ( ) P K w + j1 j(t) K i 1; : : : ; K: 2. Draw i t randomly accordingly to the robabilities 1 (t); : : : ; K (t). 3. Receive reward x it (t) 2 [0; 1]. 4. For j 1; : : : ; K set ^x j (t) xj (t) j (t) if j i t 0 otherwise, w j (t + 1) w j (t) ex ( ^x j (t)k) : Figure 1: Pseudo-code of algorithm Ex3 for the weak regret. 3 Uer bounds on the weak regret In this section we resent and analyze our simlest layer algorithm, Ex3 (which stands for Exonential-weight algorithm for Exloration and Exloitation ). We will show a bound on the exected regret of Ex3 with resect to the single best action. In the next sections, we will greatly strengthen this result. The algorithm Ex3, described in Figure 1, is a variant of the algorithm Hedge introduced by Freund and Schaire [6] for solving a different worst-case sequential allocation roblem. On each time ste t, Ex3 draws an action i t according to the distribution 1 (t); : : : ; K (t). This distribution is a mixture of the uniform distribution and a distribution which assigns to each action a robability mass exonential in the estimated cumulative reward for that action. Intuitively, mixing in the uniform distribution is done to make sure that the algorithm tries out all K actions and gets good estimates of the rewards for each. Otherwise, the algorithm might miss a good action because the initial rewards it observes for this action are low and large rewards that occur later are not observed because the action is not selected. For the drawn action i t, Ex3 sets the estimated reward ^x it (t) to x it (t) it (t). Dividing the actual gain by the robability that the action was chosen comensates the reward of actions that are unlikely to be chosen. This choice of estimated rewards guarantees that their exectations are equal to the actual rewards for each action; that is, E[^x j (t) j i 1 ; : : : ; i t?1 ] x j (t), where the exectation is taken with resect to the random choice of i t at trial t given the choices i 1 ; : : : ; i t?1 6

7 in the revious t? 1 trials. We now give the first main theorem of this aer, which bounds the exected weak regret of algorithm Ex3. Theorem 3.1 For any K > 0 and for any 2 (0; 1], G max? E[G Ex3 ] (e? 1)G max + K ln K holds for any assignment of rewards and for any T > 0. To understand this theorem, it is helful to consider a simler bound which can be obtained by an aroriate choice of the arameter. Corollary 3.2 For any T > 0, assume that g G max and that algorithm Ex3 is run with inut arameter Then holds for any assignment of rewards. min ( 1; s K ln K (e? 1)g G max? E[G Ex3 ] 2 e? 1 gk ln K 2:63 gk ln K Proof. If g (K ln K)(e?1), then the bound is trivial since the exected regret cannot be more than g. Otherwise, by Theorem 3.1, the exected regret is at most (e? 1)G max + K ln K ) 2 e? 1 gk ln K as desired. 2 To aly Corollary 3.2, it is necessary that an uer bound g on G max (T ) be available for tuning. For examle, if the time horizon T is known then, since no action can have ayoff greater than 1 on any trial, we can use g T as an uer bound. In Section 4, we give a technique that does not require rior knowledge of such an uer bound, yielding a result which holds uniformly over T. If the rewards x i (t) are in the range [a; b], a < b, then Ex3 can be used after the rewards have been translated and rescaled to the range [0; 1]. Alying Corollary 3.2 with g T gives the bound (b? a)2 e? 1 T K ln K) on the regret. For instance, this is alicable to a standard loss model where the rewards fall in the range [?1; 0]. Proof of Theorem 3.1. Here (and also throughout the aer without exlicit mention) we use the following simle facts, which are immediately derived from the definitions, : ^x i (t) 1 i (t) K (2) i (t)^x i (t) it (t) x i t (t) it (t) x i t (t) (3) i (t)^x i (t) 2 it (t) x i t (t) it (t) ^x i t (t) ^x it (t) 7 ^x i (t) : (4)

8 Let W t w 1 (t) + : : : + w K (t). For all sequences i 1 ; : : : ; i T of actions drawn by Ex3, W t+1 W t w i (t + 1) W t w i (t) W t ex i (t)? K i (t)? K 1 + K K ^x i(t) ex K ^x i(t) 1 + K x i t (t) K ^x i(t) + (e? 2) K ^x i(t) i (t)^x i (t) + 2 (e? 2)(K) (e? 2)(K)2 2 (5) (6) i (t)^x i (t) 2 (7) ^x i (t): (8) Eq. (5) uses the definition of i (t) in Figure 1. Eq. (6) uses the fact that e x 1 + x + (e? 2)x 2 for x 1; the exression in the receding line is at most 1 by Eq. (2). Eq. (8) uses Eqs. (3) and (4). Taking logarithms and using 1 + x e x gives ln W t+1 W t K x i t (t) + 2 (e? 2)(K) ^x i (t): Summing over t we then get ln W T +1 W 1 K G Ex3 + 2 (e? 2)(K) ^x i (t) : (9) For any action j, Combining with Eq. (9), we get ln W T +1 W 1 ln w j(t + 1) W 1 G Ex3 ( ) K ^x j (t)? K ln K ^x j (t)? ln K:? (e? 2) K ^x i (t) : (10) We next take the exectation of both sides of (10) with resect to the distribution of hi 1 ; : : : ; i T i. For the exected value of each ^x i (t), we have: E[^x i (t) j i 1 ; : : : ; i t?1 ] E i (t) x i(t) i (t) + ( i(t)) 0 x i (t) : (11) 8

9 Combining (10) and (11), we find that E[G Ex3 ] ( ) x j (t)? K ln K? (e? 2) K x i (t) : Since j was chosen arbitrarily and x i (t) K G max we obtain the inequality in the statement of the theorem. 2 Additional notation. As our other layer algorithms will be variants of Ex3, we find it convenient to define some further notation based on the quantities used in the analysis of Ex3. For each 1 i K and for each t 1 define G i (t + 1) ^G i (t + 1) ^G max (t + 1) def def t t s1 s1 def max 1iK x i (s) ^x i (s) ^G i (t + 1) 4 Bounds on the weak regret that hold uniformly over time In Section 3, we showed that Ex3 yields an exected regret of O( Kg ln K) whenever an uer bound g on the return G max of the best action is known in advance. A bound of O( KT ln K), which holds uniformly over T, could be easily roven via the guessing techniques which will be used to rove Corollaries 8.4 and 8.5 in Section 8. In this section, instead, we describe an algorithm, called Ex3:1, whose exected weak regret is O( KG max ln K) uniformly over T. As G max G max (T ) T, this bound is never worse than O( KT ln K) and is substantially better whenever the return of the best arm is small comared to T. Our algorithm Ex3:1, described in Figure 2, roceeds in eochs, where each eoch consists of a sequence of trials. We use r 0; 1; 2; : : : to index the eochs. On eoch r, the algorithm guesses a bound g r for the return of the best action. It then uses this guess to tune the arameter of Ex3, restarting Ex3 at the beginning of each eoch. As usual, we use t to denote the current time ste. 2 Ex3:1 maintains an estimate ^G i (t + 1) of the return of each action i. Since E[^x i (t)] x i (t), this estimate will be unbiased in the sense that E[ ^G i (t + 1)] G i (t + 1) for all i and t. Using these estimates, the algorithm detects (aroximately) when the actual gain of some action has advanced beyond g r. When this haens, the algorithm goes on to the next eoch, restarting Ex3 with a larger bound on the maximal gain. 2 Note that, in general, this t may differ from the local variable t used by Ex3 which we now regard as a subroutine. Throughout this section, we will only use t to refer to the total number of trials as in Figure 2. 9

10 Algorithm Ex3:1 Initialization: Let t 1, and ^G i (1) 0 for i 1; : : : ; K Reeat for r 0; 1; 2; : : : 1. Let g r (K ln K)(e? 1) 4 r. 2. Restart Ex3 choosing r min 3. While max i ^G i (t) g r? K r do: ( 1; s ) K ln K. (e? 1)g r (a) Let i t be the random action chosen by Ex3 and x it (t) the corresonding reward. (b) ^G i (t + 1) ^G i (t) + ^x i (t) for i 1; : : : ; K. (c) t : t + 1 Figure 2: Pseudo-code of algorithm Ex3:1 to control the weak regret uniformly over time. The erformance of the algorithm is characterized by the following theorem which is the main result of this section. Theorem 4.1 For any K > 0, G max? E[G Ex3:1 ] 8 e? 1 Gmax K ln K + 8(e? 1)K + 2K ln K holds for any assignment of rewards and for any T > 0. 10:5 Gmax K ln K + 13:8 K + 2K ln K The roof of the theorem is divided into two lemmas. The first bounds the regret suffered on each eoch, and the second bounds the total number of eochs. Fix T arbitrarily and define the following random variables: Let R be the total number of eochs (i.e., the final value of r). Let S r and T r be the first and last time stes comleted on eoch r (where, for convenience, we define T R T ). Thus, eoch r consists of trials S r ; S r + 1; : : : ; T r. Note that, in degenerate cases, some eochs may be emty in which case S r T r + 1. Let ^G max ^G max (T + 1). Lemma 4.2 For any action j and for every eoch r, T r T r x it (t) ^x j (t)? 2 e? 1 gr K ln K : ts r ts r 10

11 Proof. If S r > T r (so that no trials occur on eoch r), then the lemma holds trivially since both summations will be equal to zero. Assume then that S r T r. Let g g r and r. We use (10) from the roof of Theorem 3.1: T r ts r x it (t) T r ts r ^x j (t)? T r ^x j (t)? K ln K? (e? 2) K T r ts r ^x i (t) : From the definition of the termination condition we know that ^G i (T r ) g? K. Using (2), we get ^x i (t) K. This imlies that ^G i (T r + 1) g for all i. Thus, T r ts r x it (t) T r ts r ^x j (t)? g ( + (e? 2))? K ln K By our choice for, we get the statement of the lemma. 2 The next lemma gives an imlicit uer bound on the number of eochs R. Let c (K ln K)(e? 1). : Lemma 4.3 The number of eochs R satisfies 2 R?1 K c + s ^G max c : Proof. If R 0, then the bound holds trivially. So assume R 1. Let z 2 R?1. Because eoch R? 1 was comleted, by the termination condition, ^G max ^G max (T R?1 + 1) > g R? K c 4 R? K 2 R?1 cz 2? Kz : (12) R?1 Suose the claim of the lemma is false. Then z > Kc + is increasing for x > K(2c), this imlies that cz 2? Kz > c 0 K c + ^G max c 1 A 2? K 0 K c + q ^G max c. Since the function cx 2? Kx ^G max c 1 s A K ^G max c + ^G max ; contradicting (12). 2 Proof of Theorem 4.1. Using the lemmas, we have that G Ex3:1 x it (t) R r0 max j T r ts r x it (t) R r0 T r ^x j (t)? 2 e? 1 gr K ln K ts r 11!

12 max j ^G j (T + 1)? 2K ln K ^G max? 2K ln K(2 R+ 1) ^G max + 2K ln K? 8K ln K R r0 2 r 0 K ^G c + max c q ^G max? 2K ln K? 8(e? 1)K? 8 e? 1 ^G max K ln K : (13) Here, we used Lemma 4.2 for the first inequality and Lemma 4.3 for the second inequality. The other stes follow from definitions and simle algebra. Let f (x) x?a x?b for x 0 where a 8 e? 1 K ln K and b 2K ln K +8(e?1)K. Taking exectations of both sides of (13) gives E[G Ex3:1 ] E[f ( ^G max )] : (14) Since the second derivative of f is ositive for x > 0, f is convex so that, by Jensen s inequality, E[f ( ^G max )] f (E[ ^G max ]) : (15) 1 A Note that, E[ ^G max ] E max j ^G j (T + 1) max E[ ^G j (T + 1)] max j j x j (t) G max : The function f is increasing if and only if x > a 2 4. Therefore, if G max > a 2 4 then f (E[ ^G max ]) f (G max ). Combined with (14) and (15), this gives that E[G Ex3:1 ] f (G max ) which is equivalent to the statement of the theorem. On the other hand, if G max a 2 4 then, because f is nonincreasing on [0; a 2 4], f (G max ) f (0)?b 0 E[G Ex3:1 ] so the theorem follows trivially in this case as well. 2 5 Lower bounds on the weak regret In this section, we state a lower bound on the exected weak regret of any layer. More recisely, for any choice of the time horizon T we show that there exists a strategy for assigning the rewards to the actions such that the exected weak regret of any layer algorithm is ( KT ). Observe that this does not match the uer bound for our algorithms Ex3 and Ex3:1 (see Corollary 3.2 and Theorem 4.1); it is an oen roblem to close this ga. Our lower bound is roven using the classical (statistical) bandit model with an crucial difference: the reward distribution deends on the number K of actions and on the time horizon T. This deendence is the reason why our lower bound does not contradict the uer bounds of the form 12

13 O(ln T ) for the classical bandit model [14]. There, the distribution over the rewards is fixed as T! 1. Note that our lower bound has a considerably stronger deendence on the number K of action than the lower bound ( T ln K), which could have been roven directly from the results in [3, 6]. Secifically, our lower bound imlies that no uer bound is ossible of the form O(T (ln K) ) where 0 < 1, > 0. Theorem 5.1 For any number of actions K 2 and for any time horizon T, there exists a distribution over the assignment of rewards such that the exected weak regret of any algorithm (where the exectation is taken with resect to both the randomization over rewards and the algorithm s internal randomization) is at least 1 20 minf KT ; T g: The roof is given in Aendix A. The lower bound imlies, of course, that for any algorithm there is a articular choice of rewards that will cause the exected weak regret (where the exectation is now with resect to the algorithm s internal randomization only) to be larger than this value. 6 Bounds on the weak regret that hold with robability 1 In Section 4 we showed that the exected weak regret of algorithm Ex3:1 is O( KT ln K). In this section we show that a modification of Ex3 achieves a weak regret of O( KT ln(kt)) with robability at least, for any fixed and uniformly over T. From this, a bound on the weak regret that holds with robability 1 follows easily. The modification of Ex3 is necessary since the variance of the regret achieved by this algorithm is large, so large that an interesting high robability bound may not hold. The large variance of the regret comes from the large variance of the estimates ^x i (t) for the ayoffs x i (t). In fact, the variance of ^x i (t) can be close to 1 i (t) which, for in our range of interest, is (ignoring the deendence of K) of magnitude T. Summing over trials, the variance of the return of Ex3 is about T 32, so that the regret might be as large as T 34. To control the variance we modify algorithm Ex3 so that it uses estimates which are based on uer confidence bounds instead of estimates with the correct exectation. The modified algorithm Ex3:P is given in Figure 3. Let ^ i (t + 1) def KT + t s1 i (t) 1 KT : Whereas algorithm Ex3 directly uses the estimates ^G i (t) when choosing i t at random, algorithm Ex3:P uses the uer confidence bounds ^G i (t) + ^ i (t). The next lemma shows that, for aroriate, these are indeed uer confidence bounds. Fix some time horizon T. In what follows, we will use ^ i to denote ^ i (T + 1) and ^G i to denote ^G i (T + 1). 13

14 Algorithm Ex3:P Parameters: Reals > 0 and 2 (0; 1] Initialization: For i 1; : : : ; K w i (1) ex 3 r! T K : For each t 1; 2; : : : ; T 1. For i 1; : : : ; K set w i (t) i (t) ( ) P K w + j1 j(t) K : 2. Choose i t randomly according to the distribution 1 (t); : : : ; K (t). 3. Receive reward x it (t) 2 [0; 1]. 4. For j 1; : : : ; K set ^x j (t) w j (t + 1) w j (t) ex xj (t) j (t) if j i t 0 otherwise, 3K ^x j(t) + j (t)!! KT : Figure 3: Pseudo-code of algorithm Ex3:P achieving small weak regret with high robability. Lemma 6.1 If 2 ln(kt) 2 KT, then o P n9i : ^G i + ^ i < G i : Proof. Fix some i and set s t def 2^ i (t + 1) : Since 2 KT and ^i (t + 1) KT, we have s t 1. Now P n ^G i + ^ i < G i o P P ( T ( s T ) (x i (t)? ^x i (t))? ^ i > ^ i 2 2 x i (t)? ^x i (t)? KT 2 i (t) ) > 2 4 (16) 14

15 P ( ex e?2 4 E " s T ex x i (t)? ^x i (t)? s T 2 i (t) x i (t)? ^x i (t)?! KT 2 i (t) > ex? 2 4 )!# KT (17) P T where in ste (16) we multilied both sides by s T and used ^ i 1( i(t) KT ), while in ste (17) we used Markov s inequality. For t 1; : : : ; T set t! def Z t ex s t x i ( )? ^x i ( )? : 2 i ( ) KT Then, for t 2; : : : ; T Z t ex 1 s t x i (t)? ^x i (t)? 2 i (t) (Z t?1 ) KT s t s t?1 : Denote by E t [Z t ] E [Z t j i 1 ; : : : ; i t?1 ] the exectation of Z t with resect to the random choice in trial t and conditioned on the ast t? 1 trials. Note that when the ast t? 1 trials are fixed the only random quantities in Z t are the ^x i (t) s. Note also that x i (t)? ^x i (t) 1, and that E t (xi (t)? ^x 2 i (t)) E t ^xi 2 (t)? x i (t) 2 E t ^xi 2 (t) Hence, for each t 2; : : : ; T E t [Z t ] E t ex s t x i (t)? ^x i (t)? s t i (t) Eq. (19) uses x i(t) 2 i (t) 1 i (t) E t 1 + st (x i (t)? ^x i (t)) + s 2 (x t i(t)? ^x 2 i (t)) ex? 1 + s 2 s t t i (t) ex (Z t?1 ) s t s t?1 1 + Z t?1 : 2 i (t)? s2 t i (t) KT s t (18) s (Z t?1 ) t?1 (19)? s2 s t t s (Z t?1 ) t?1 (20) i (t) s (Z t?1 ) t?1 (21) 2 i (t)^ i (t + 1) s t i (t) since ^ i (t + 1) KT. Eq. (20) uses e a 1 + a + a 2 for a 1. Eq. (21) uses E t [^x i (t)] x i (t). Eq. (22) uses 1 + x e x for any real x. Eq. (23) uses s t s t?1 and z u 1 + z for any z > 0 and u 2 [0; 1]. Observing that E [Z 1 ] 1, we get by induction that E[Z T ] T, and the lemma follows by our choice of (22) (23)

16 The next lemma shows that the return achieved by algorithm Ex3:P is close to its uer confidence bounds. Let ^G i + ^ i : ^U def max 1iK Lemma 6.2 If 2 KT then G Ex3:P 5 3 ^U? 3 K ln K? 2 KT? 2 2 : Proof. We roceed as in the analysis of algorithm Ex3. Set (3K) and consider any sequence i 1 ; : : : ; i T of actions chosen by Ex3:P. As ^x i (t) K, i (t) K, and 2 KT, we have ^x i (t) + 1 : KT Therefore, W t+1 W t w i (t + 1) W t w i (t) W t ex i (t)? K i (t)? K ^x i (t) + ex i (t) i (t) ^x i (t) ^x i (t) + i (t)^x i (t) + KT i (t)^x i (t) x i t (t) + r K T + i (t) i (t) KT + 2 2^x i (t) KT i (t) 2 KT 22 1 KT 1 i (t)kt ^x i (t) T : The second inequality uses e a 1 + a + a 2 for a 1, and (a + b) 2 2(a 2 + b 2 ) for any a; b. The last inequality uses Eqs. (2), (3) and (4). Taking logarithms, using ln(1 + x) x and summing over t 1; : : : ; T we get W T +1 ln W 1 G 2 2 Ex3:P + KT + ^G i + 22 : Since ln W 1 KT + ln K 16

17 and for any j this imlies ln W T +1 ln w j (T + 1) ^G j + ^ j G Ex3:P ( ) ^G j + ^ j? 1 ln K? 2 KT? 2 ^G i? 2 2 : for any j. Finally, using (3K) and ^G i K ^U yields the lemma. 2 Combining Lemmas 6.1 and 6.2 gives the main result of this section. Theorem 6.3 For any fixed T > 0, for all K 2 and for all > 0, if then min ( 3 5 ; 2 r 3 5 G max? G Ex3:P 4 K ln K T r ) KT ln KT and 2 ln(kt) ; + 4 holds for any assignment of rewards with robability at least. r 5 3 KT ln K + 8 ln KT Proof. We assume without loss of generality that T (203)K ln K and that KT e?kt. If either of these conditions do not hold, then the theorem holds trivially. Note that T (203)K ln K ensures 35. Note also that KT e?kt imlies 2 KT for our choice of. So we can aly Lemmas 6.1 and 6.2. By Lemma 6.2 we have G Ex3:P 5 3 ^U? 3 K ln K? 2 KT? 2 2 : By Lemma 6.1 we have ^U G max with robability at least. Collecting terms and using G max T gives the theorem. 2 It is not difficult to obtain an algorithm that does not need the time horizon T as inut arameter and whose regret is only slightly worse than that roven for the algorithm Ex3:P in Theorem 6.3. This new algorithm, called Ex3:P:1 and shown in Figure 4, simly restarts Ex3:P doubling its guess for T each time. The only careful issue is the choice of the confidence arameter and of the minimum length of the runs to ensure that Lemma 6.1 holds for all the runs of Ex3:P. 17

18 Algorithm Ex3:P:1 Parameters: Real 0 < < 1. Initialization: let T r 2 r, r Reeat for r r ; r + 1; : : : (r + 1)(r + 2) and r minfr 2 N : r KT r e?ktr g : (24) Run Ex3:P for T r trials choosing and as in Theorem 6.3 with T T r and r. Figure 4: Pseudo-code of algorithm Ex3:P:1 (see Theorem 6.4). Theorem 6.4 Let K 2, 2 (0; 1) and T 2 r. Let c T 2 ln(2 + log 2 T ), and let r be as in Eq. (24). Then G max? G Ex3:P:1 s2kt 10 + c T + 10(1 + log 2 T ) + c T ; 2? 1 holds with robability at least. ln KT ln KT Proof. Choose the time horizon T arbitrarily and call eoch the sequence of trials between two successive restarts of algorithm Ex3:P. For each r > r, where r is defined in (24), let G i (r) def 2 r+1 t2 r +1 x i (t) ; ^G i (r) def 2 r+1 t2 r +1 ^x i (t) ; ^ i (r) def KTr + 2 r+1 t2 r +1 1 i (t) KT r and similarly define the quantities G i (r ) and ^G i (r ) with sums that go from t 1 to t r. For each r r, we have r KT r e?ktr. Thus we can find numbers r such that, by Lemma 6.1, n o 1 n o P (9r r )(9i) : ^G i (r) + r ^ i (r) < G i (r) 9i : ^G i (r) + r ^ i (r) < G i (r) 1 rr P r0 : (r + 1)(r + 2) We now aly Theorem 6.3 to each eoch. Without loss of generality, assume that T satisfies 2 r +`?1 < T `?1 r r +r < 2 r +`

19 for some ` 1. With robability at least over the random draw of Ex3:P:1 s actions i 1 ; : : : ; i T, G max? G Ex3:P:1 `?1 r "s "s 10 2? 1 "s KT r +r ln KT r +r r +r K ln KT r +`?1 r +`?1 K ln KT r +`?1 s 2KT r +`?1 ln KT `?1 + ln KT r +r r +r # Tr +r + ` ln KT r +`?1 r +`?1 r0 2 (r +`)2 2? 1 + ` ln KT r +`?1 r +`?1 + c T + 10(1 + log 2 T ) # # ln KT + c T where c T 2 ln(2 + log 2 T ). 2 From the above theorem we get, as a simle corollary, a statement about the almost sure convergence of the return of algorithm Ex3:P. The rate of convergence is almost otimal, as one can see from our lower bound in Section 5. Corollary 6.5 For any K 2 and for any function f : R! R with lim T!1 f (T ) 1, G max? G lim Ex3:P:1 0 : T!1 T (ln T )f (T ) holds for any assignment of rewards with robability 1. Proof. Let 1T 2. Then, by Theorem 6.4, there exists a constant C such that for all T large enough G max? G Ex3:P:1 C KT ln T with robability at least 1T 2. This imlies that P ( G max? G Ex3:P:1 (T ln T )f (T ) > C s K f (T ) ) 1 T 2 and the theorem follows from the Borel-Cantelli lemma. 2 7 The regret against the best strategy from a ool Consider a setting where the layer has reliminarly fixed a set of strategies that could be used for choosing actions. These strategies might select different actions at different iterations. The strategies can be comutations erformed by the layer or they can be external advice given to the layer by exerts. We will use the more general term exert (borrowed from Cesa-Bianchi et 19

20 al. [3]) because we lace no restrictions on the generation of the advice. The layer s goal in this case is to combine the advice of the exerts in such a way that its return is close to that of the best exert. Formally, we assume that the layer, rior to choosing an action P at time t, is rovided with a set of N robability vectors 1 (t); : : : ; N (t) 2 [0; 1] K K, where j1 i j (t) 1 for each i 1; : : : ; N. We interret i (t) as the advice of exert i on trial t, where the j-th comonent j i (t) reresents the recommended robability of laying action j. (As a secial case, the distribution can be concentrated on a single action, which reresents a deterministic recommendation.) If the vector of rewards at time t is x(t), then the exected reward for exert i, with resect to the chosen robability vector i (t), is simly i (t) x(t). In analogy of G max, we define ~G max def max 1iN i (t) x(t) measuring the exected return of the best strategy. Then the regret for the best strategy at time horizon T, defined by ~ G max (T )? G A (T ), measures the difference between the return of the best exert and layer s A return u to time T. Our results hold for any finite set of exerts. Formally, we regard each i (t) as a random variable which is an arbitrary function of the random sequence of lays i 1 ; : : : ; i t?1. This definition allows for exerts whose advice deends on the entire ast history as observed by the layer, as well as other side information which may be available. We could at this oint view each exert as a meta-action in a higher-level bandit roblem with ayoff vector defined at trial t as ( 1 (t) x(t); : : : ; N (t) x(t)). We could then immediately aly Corollary 3.2 to obtain a bound of O( gn log N ) on the layer s regret relative to the best exert (where g is an uer bound on ~ G max ). However, this bound is quite weak if the layer is combining many exerts (i.e., if N is very large). We show below that the algorithm Ex3 from Section 3 can be modified yielding a regret term of the form O( gk log N). This bound is very reasonable when the number of actions is small, but the number of exerts is quite large (even exonential). Our algorithm Ex4 is shown in Figure 5, and is only a slightly modified version of Ex3. (Ex4 stands for Exonential-weight algorithm for Exloration and Exloitation using Exert advice. ) Let us define y(t) 2 [0; 1] N to be the vector with comonents corresonding to the gains of the exerts: y i (t) i (t) x(t). The simlest ossible exert is one which always assigns uniform weight to all actions so that j (t) 1K on each round t. We call this the uniform exert. To rove our results, we need to assume that the uniform exert is included in the family of exerts. 3 Clearly, the uniform exert can always be added to any given family of exerts at the very small exense of increasing N by one. Theorem 7.1 For any K; T > 0, for any 2 (0; 1], and for any family of exerts which includes 3 In fact, we can use a slightly weaker sufficient condition, namely, that the uniform exert is included in the convex hull of the family of exerts, i.e., that there exists nonnegative numbers 1; : : : ; N with P N j1 j 1 such that, for all t and all i, P N j1 j j i (t) 1K. 20

21 Algorithm Ex4 Parameters: Real 2 (0; 1] Initialization: w i (1) 1 for i 1; : : : ; N. For each t 1; 2; : : : 1. Get advice vectors 1 (t); : : : ; N (t). 2. Set W t N w i (t) and for j 1; : : : ; K set j (t) ( ) N w i (t) i j (t) W t + K : 3. Draw action i t randomly according to the robabilities 1 (t); : : : ; K (t). 4. Receive reward x it (t) 2 [0; 1]. 5. For j 1; : : : ; K set ^x j (t) xj (t) j (t) if j i t 0 otherwise, 6. For i 1; : : : ; N set ^y i (t) i (t) ^x(t) w i (t + 1) w i (t) ex ( ^y i (t)k) : Figure 5: Pseudo-code of algorithm Ex4 for using exert advice. the uniform exert, holds for any assignment of rewards. ~G max? E[G Ex4 ] (e? 1) ~ G max + K ln N : Proof. We rove this theorem along the lines of the roof of Theorem 3.1. Let q i (t) w i (t)w t. Then W t+1 W t N N w i (t + 1) W t q i (t) ex K ^y i(t) 21

22 N q i (t) 1 + (K) 1 + K ^y i(t) + (e? 2) K ^y i(t) N q i (t)^y i (t) + (e? 2)(K) 2 2 N q i (t)^y i (t) 2 : Taking logarithms and summing over t we get ln W T +1 W 1 (K) N q i (t)^y i (t) + (e? 2)(K) 2 N q i (t)^y i (t) 2 : Since, for any exert k, ln W T +1 W 1 ln w k(t + 1) W 1 K ^x k (t)? ln N we get Note that Also N q i (t)^y i (t) N q i (t)^y i (t) N ^y k (t)? K ln N N q i (t) N? (e? 2) K j1 j1 i (t)? K j1 i j (t)^x j(t)! q i (t) i j (t)! ^x j (t) q i (t)^y i (t) 2 q it (t)( i i t (t)^x it (t)) 2 ^x it (t) 2 i t (t) ^x i t (t) : N ^x j (t) x j(t) : q i (t)^y i (t) 2 : Therefore, for all exerts k, G Ex4 ^x it (t) ( ) ^y k (t)? K ln N 22? (e? 2) K j1 ^x j (t) :

23 We now take exectations of both sides of this inequality. Note that Further, E[^y k (t)] E " 1 T K E j1 ^x j (t) " j1 # k j (t)^x j(t) 1 K # j1 j1 x j (t) max 1iN k j (t)x j(t) y k (t) : y i (t) ~ G max since we have assumed that the uniform exert is included in the family of exerts. Combining these facts immediately imlies the statement of the theorem. 2 8 The regret against arbitrary strategies In this section we resent a variant of algorithm Ex3 and rove a bound on its exected regret for any sequence (j 1 ; : : : ; j T ) of actions. To rove this result, we rank all sequences of actions according to their hardness. The hardness of a sequence (j 1 ; : : : ; j T ) is defined by H(j 1 ; : : : ; j T ) def 1 + jf1 ` < T : j` 6 j`+1 gj : So, H(1; : : : ; 1) 1 and H(1; 1; 3; 2; 2) 3. The bound on the regret which we will rove grows with the hardness of the sequence for which we are measuring the regret. In articular, we will show that the layer algorithm Ex3:S described in Figure 6 has an exected regret of O(H(j ) T KT ln(kt )) for any sequence j T (j 1 ; : : : ; j T ) of actions. On the other hand, if the regret is measured for any sequence j T of actions of hardness H(j T ) S, then the exected regret of Ex3:S (with arameters tuned to this S) reduces to O( SKT ln(kt )). In what follows, we will use G j T to denote the return x j1 (1) + : : : x jt (T ) of a sequence j T (j 1 ; : : : ; j T ) of actions. Theorem 8.1 For any K > 0, for any 2 (0; 1], and for any > 0, G j T? E [G Ex3:S ] K(H(jT ) ln(k) + et ) + (e? 1)T holds for any assignment of rewards, for any T > 0, and for any sequence j T actions. (j 1 ; : : : ; j T ) of Corollary 8.2 Assume that algorithm Ex3:S is run with inut arameters 1T and ( r ) K ln(kt ) min 1; : T Then G j T? E [G Ex3:S ] H(j T ) holds for any sequence j T (j 1 ; : : : ; j T ) of actions. 23 KT ln(kt ) + 2e s KT ln(kt )

24 Algorithm Ex3:S Parameters: Reals 2 (0; 1] and > 0. Initialization: w i (1) 1 for i 1; : : : ; K. For each t 1; 2; : : : 1. Set w i (t) i (t) ( ) P K w + j1 j(t) K i 1; : : : ; K: 2. Draw i t randomly accordingly to the robabilities 1 (t); : : : ; K (t). 3. Receive reward x it (t) 2 [0; 1]. 4. For j 1; : : : ; K set ^x j (t) xj (t) j (t) if j i t 0 otherwise, w j (t + 1) w j (t) ex ( ^x j (t)k) + e K w i (t) : Figure 6: Pseudo-code of algorithm Ex3:S to control the exected regret. Note that the statement of Corollary 8.2 can be equivalently written as E [G Ex3:S ] max j T G j T? H(j T )? 2e s KT ln(kt ) KT ln(kt ) revealing that algorithm Ex3:S is able to automatically trade-off between the return G j T sequence j T and its hardness H(j T ). of a Corollary 8.3 Assume that algorithm Ex3:S is run with inut arameters 1T and Then min ( 1; s K(S ln(kt ) + e) (e? 1)T G j T? E [G Ex3:S ] 2 e? 1 KT (S ln(kt ) + e) holds for any sequence j T (j 1 ; : : : ; j T ) of actions such that H(j T ) S. ) : 24

25 Proof of Theorem 8.1. Fix any sequence j T (j 1 ; : : : ; j T ) of actions. With a technique that follows closely the roof of Theorem 3.1, we can rove that for all sequences i 1 ; : : : ; i T of actions drawn by Ex3:S, W t+1 W t 1 + K x i t (t) + 2 (e? 2)(K) ^x i (t) + e : (25) where, as usual, W t w 1 (t) + : : : + w K (t). Now let S H(j T ) and artition (1; : : : ; T ) in segments [T 1 ; : : : ; T 2 ); [T 2 ; : : : ; T 3 ); : : : ; [T S ; : : : ; T S+1 ) where T 1 1, T S+1 T + 1, and j Ts j Ts+1 : : : j Ts+1?1 for each segment s 1; : : : ; S. Fix an arbitrary segment [T s ; T s+1 ) and let s T s+ T s. Furthermore, let G Ex3:S (s) def T s+1?1 tt s x it (t) : Taking logarithms on both sides of (25) and summing over t T s ; : : : ; T s+ 1 we get ln W T s+1 W Ts K G Ex3:S(s) + 2 T s+1?1 (e? 2)(K) tt s Now let j be the action such that j Ts : : : j Ts+1?1 j. Since w j (T s+1 ) w j (T s + 1) ex e K W T s ex K W T s ex where the last ste uses ^x j (t)k 1, we have ln W T s+1 W Ts T s+1?1 ^x j (t) K tt s+1 T s+1?1 K tt s+1 T s+1?1 K tt s ^x j (t) ^x j (t)!! ^x i (t) + e s : (26) w j(t s+1) ln ln T s+1?1 + ^x j (t) : (27) W Ts K K tt s! Piecing together (26) and (27) we get G Ex3:S (s) ( ) T s+1?1 tt s ^x j (t)? K ln(k)? (e? 2) K T s+1?1 tt s ^x i (t)? ek s : 25

26 Summing over all segments s 1; : : : ; S, taking exectation with resect to the random choices of algorithm Ex3:S, and using G (j1 ;:::;j T ) T and x i (t) KT yields the inequality in the statement of the theorem. 2 If the time horizon T is not known, we can aly techniques similar to those alied for roving Theorem 6.4 in Section 6. More secifically, we introduce a new algorithm, Ex3:S:1, that runs Ex3:S as a subroutine. Suose that at each new run (or eoch) r 0; 1; : : :, Ex3:S is started with its arameters set as rescribed in Corollary 8.2, where T is set to T r 2 r, and then stoed after T r iterations. Clearly, for any fixed sequence j T (j 1 ; : : : ; j T ) of actions, the number of segments (see roof of Theorem 8.1 for a definition of segment) within each eoch r is at most H(j T ). Hence the exected regret of Ex3:S:1 for eoch r is certainly not more than (H(j T ) + 2e) KTr ln(kt r ) : Let ` be such that 2` T < 2`+1. Then the last eoch is ` log 2 T and the overall regret (over the ` + 1 eochs) is at most (H(j T ) + 2e) ` r0 Finishing u the calculations roves the following. Corollary 8.4 KTr ln(kt r ) (H(j T ) + 2e) KT` ln(kt`) ` r0 Tr : G j T? E [G Ex3:S:1 ] H(jT ) + 2e 2? 1 2KT ln(kt ) for any T > 0 and for any sequence j T (j 1 ; : : : ; j T ) of actions. On the other hand, if Ex3:S:1 runs Ex3:S with arameters set as rescribed in Corollary 8.3, with a reasoning similar to the one above we conclude the following. Corollary 8.5 G j T? E [G Ex3:S:1 ] 2 e? 1 2KT (S ln(kt ) + e) 2? 1 for any T > 0 and for any sequence j T (j 1 ; : : : ; j T ) of actions such that H(j T ) S. 9 Alications to game theory The adversarial bandit roblem can be easily related to the roblem of laying reeated games. For N > 1 integer, a N-erson finite game is defined by N finite sets S 1 ; : : : ; S N of ure strategies, 26

27 one set for each layer, and by N functions u 1 ; : : : ; u N, where function u i : S 1 : : : S N! R is layer s i ayoff function. Note the each layer s ayoff deends both on the ure strategy chosen by the layer and on the ure strategies chosen by the other layers. Let S S 1 : : : S N and let S?i S 1 : : :S i?1 S i+1 : : :S N. We use s and s?i to denote tyical members of, resectively, S and S?i. Given s 2 S, we will often write (j; s?i) to denote (s 1 ; : : : ; s i?1 ; j; s i+1 ; : : : ; s N ), where j 2 S i. Suose that the game is layed reeatedly through time. Assume for now that each layer knows all ayoff functions and, after each reetition (or round) t, also knows the vector s(t) (s 1 (t); : : : ; s N (t)) of ure strategies chosen by the layers. Hence, the ure strategy s i (t), chosen by layer i at round t may deend on what layer i and the other layers chose in the ast rounds. The average regret of layer i for the ure strategy j after T rounds is defined by R (j) i (T ) 1 T [u i (j; s?i(t))? u i (s(t))] : This is how much layer i lost on average for not laying the ure strategy j on all rounds, given that all the other layers ket their choices fixed. A desirable roerty for a layer is Hannan-consistency [8], defined as follows. Player i is Hannan-consistent if lim su T!1 max R (j) i (T ) 0 with robability 1. j2s i The existence and roerties of Hannan-consistent layers have been first investigated by Hannan [10] and Blackwell [2], and later by many others (see [18] for a nice survey). Hannan-consistency can be also studied in the so-called unknown game setu, where it is further assumed that: (1) each layer knows neither the total number of layers nor the ayoff function of any layer (including itself), (2) after each round each layer sees its own ayoffs but it sees neither the choices of the other layers nor the resulting ayoffs. This setu was reviously studied by Baños [1], Megiddo [16], and by Hart and Mas-Colell [11, 12]. We can aly the results of Section 6 to rove that a layer using algorithm Ex3:P:1 as mixed strategy is Hannan-consistent in the unknown game setu whenever the ayoffs obtained by the layer belong to a known bounded real interval. To do that, we must first extend our results to the case when the assignment of rewards can be chosen adatively. More recisely, we can view the ayoff x it (t), received by the gambler at trial t of the bandit roblem, as the ayoff u i (i t ; s?i(t)) received by layer i at the t-th round of the game. However, unlike our adversarial bandit framework where all the rewards were assigned to each arm at the beginning, here the ayoff u i (i t ; s?i(t)) deends on the (ossibly randomized) choices of all layers which, in turn, are functions of their realized ayoffs. In our bandit terminology, this corresonds to assuming that the vector (x 1 (t); : : : ; x K (t)) of rewards for each trial t is chosen by an adversary who knows the gambler s strategy and the outcome of the gambler s random draws u to time t? 1. We leave to the interested reader the easy but lengthy task of checking that all of our results (including those of Section 6) hold under this additional assumtion. Using Theorem 6.4 we then get the following. 27

Gambling in a rigged casino: The adversarial multi-armed bandit problem

Gambling in a rigged casino: The adversarial multi-armed bandit problem Gambling in a rigged casino: The adversarial multi-armed bandit problem Peter Auer Institute for Theoretical Computer Science University of Technology Graz A-8010 Graz (Austria) pauer@igi.tu-graz.ac.at

More information

Downloaded 11/11/14 to Redistribution subject to SIAM license or copyright; see

Downloaded 11/11/14 to Redistribution subject to SIAM license or copyright; see SIAM J. COMPUT. Vol. 32, No. 1, pp. 48 77 c 2002 Society for Industrial and Applied Mathematics THE NONSTOCHASTIC MULTIARMED BANDIT PROBLEM PETER AUER, NICOLÒ CESA-BIANCHI, YOAV FREUND, AND ROBERT E. SCHAPIRE

More information

8 STOCHASTIC PROCESSES

8 STOCHASTIC PROCESSES 8 STOCHASTIC PROCESSES The word stochastic is derived from the Greek στoχαστικoς, meaning to aim at a target. Stochastic rocesses involve state which changes in a random way. A Markov rocess is a articular

More information

4. Score normalization technical details We now discuss the technical details of the score normalization method.

4. Score normalization technical details We now discuss the technical details of the score normalization method. SMT SCORING SYSTEM This document describes the scoring system for the Stanford Math Tournament We begin by giving an overview of the changes to scoring and a non-technical descrition of the scoring rules

More information

MATH 2710: NOTES FOR ANALYSIS

MATH 2710: NOTES FOR ANALYSIS MATH 270: NOTES FOR ANALYSIS The main ideas we will learn from analysis center around the idea of a limit. Limits occurs in several settings. We will start with finite limits of sequences, then cover infinite

More information

Elementary Analysis in Q p

Elementary Analysis in Q p Elementary Analysis in Q Hannah Hutter, May Szedlák, Phili Wirth November 17, 2011 This reort follows very closely the book of Svetlana Katok 1. 1 Sequences and Series In this section we will see some

More information

15-451/651: Design & Analysis of Algorithms October 23, 2018 Lecture #17: Prediction from Expert Advice last changed: October 25, 2018

15-451/651: Design & Analysis of Algorithms October 23, 2018 Lecture #17: Prediction from Expert Advice last changed: October 25, 2018 5-45/65: Design & Analysis of Algorithms October 23, 208 Lecture #7: Prediction from Exert Advice last changed: October 25, 208 Prediction with Exert Advice Today we ll study the roblem of making redictions

More information

Analysis of some entrance probabilities for killed birth-death processes

Analysis of some entrance probabilities for killed birth-death processes Analysis of some entrance robabilities for killed birth-death rocesses Master s Thesis O.J.G. van der Velde Suervisor: Dr. F.M. Sieksma July 5, 207 Mathematical Institute, Leiden University Contents Introduction

More information

Improved Bounds on Bell Numbers and on Moments of Sums of Random Variables

Improved Bounds on Bell Numbers and on Moments of Sums of Random Variables Imroved Bounds on Bell Numbers and on Moments of Sums of Random Variables Daniel Berend Tamir Tassa Abstract We rovide bounds for moments of sums of sequences of indeendent random variables. Concentrating

More information

Approximating min-max k-clustering

Approximating min-max k-clustering Aroximating min-max k-clustering Asaf Levin July 24, 2007 Abstract We consider the roblems of set artitioning into k clusters with minimum total cost and minimum of the maximum cost of a cluster. The cost

More information

1 Gambler s Ruin Problem

1 Gambler s Ruin Problem Coyright c 2017 by Karl Sigman 1 Gambler s Ruin Problem Let N 2 be an integer and let 1 i N 1. Consider a gambler who starts with an initial fortune of $i and then on each successive gamble either wins

More information

COMMUNICATION BETWEEN SHAREHOLDERS 1

COMMUNICATION BETWEEN SHAREHOLDERS 1 COMMUNICATION BTWN SHARHOLDRS 1 A B. O A : A D Lemma B.1. U to µ Z r 2 σ2 Z + σ2 X 2r ω 2 an additive constant that does not deend on a or θ, the agents ayoffs can be written as: 2r rθa ω2 + θ µ Y rcov

More information

Topic: Lower Bounds on Randomized Algorithms Date: September 22, 2004 Scribe: Srinath Sridhar

Topic: Lower Bounds on Randomized Algorithms Date: September 22, 2004 Scribe: Srinath Sridhar 15-859(M): Randomized Algorithms Lecturer: Anuam Guta Toic: Lower Bounds on Randomized Algorithms Date: Setember 22, 2004 Scribe: Srinath Sridhar 4.1 Introduction In this lecture, we will first consider

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

Radial Basis Function Networks: Algorithms

Radial Basis Function Networks: Algorithms Radial Basis Function Networks: Algorithms Introduction to Neural Networks : Lecture 13 John A. Bullinaria, 2004 1. The RBF Maing 2. The RBF Network Architecture 3. Comutational Power of RBF Networks 4.

More information

On the Chvatál-Complexity of Knapsack Problems

On the Chvatál-Complexity of Knapsack Problems R u t c o r Research R e o r t On the Chvatál-Comlexity of Knasack Problems Gergely Kovács a Béla Vizvári b RRR 5-08, October 008 RUTCOR Rutgers Center for Oerations Research Rutgers University 640 Bartholomew

More information

Convex Optimization methods for Computing Channel Capacity

Convex Optimization methods for Computing Channel Capacity Convex Otimization methods for Comuting Channel Caacity Abhishek Sinha Laboratory for Information and Decision Systems (LIDS), MIT sinhaa@mit.edu May 15, 2014 We consider a classical comutational roblem

More information

John Weatherwax. Analysis of Parallel Depth First Search Algorithms

John Weatherwax. Analysis of Parallel Depth First Search Algorithms Sulementary Discussions and Solutions to Selected Problems in: Introduction to Parallel Comuting by Viin Kumar, Ananth Grama, Anshul Guta, & George Karyis John Weatherwax Chater 8 Analysis of Parallel

More information

On Wald-Type Optimal Stopping for Brownian Motion

On Wald-Type Optimal Stopping for Brownian Motion J Al Probab Vol 34, No 1, 1997, (66-73) Prerint Ser No 1, 1994, Math Inst Aarhus On Wald-Tye Otimal Stoing for Brownian Motion S RAVRSN and PSKIR The solution is resented to all otimal stoing roblems of

More information

Named Entity Recognition using Maximum Entropy Model SEEM5680

Named Entity Recognition using Maximum Entropy Model SEEM5680 Named Entity Recognition using Maximum Entroy Model SEEM5680 Named Entity Recognition System Named Entity Recognition (NER): Identifying certain hrases/word sequences in a free text. Generally it involves

More information

On split sample and randomized confidence intervals for binomial proportions

On split sample and randomized confidence intervals for binomial proportions On slit samle and randomized confidence intervals for binomial roortions Måns Thulin Deartment of Mathematics, Usala University arxiv:1402.6536v1 [stat.me] 26 Feb 2014 Abstract Slit samle methods have

More information

Towards understanding the Lorenz curve using the Uniform distribution. Chris J. Stephens. Newcastle City Council, Newcastle upon Tyne, UK

Towards understanding the Lorenz curve using the Uniform distribution. Chris J. Stephens. Newcastle City Council, Newcastle upon Tyne, UK Towards understanding the Lorenz curve using the Uniform distribution Chris J. Stehens Newcastle City Council, Newcastle uon Tyne, UK (For the Gini-Lorenz Conference, University of Siena, Italy, May 2005)

More information

A Social Welfare Optimal Sequential Allocation Procedure

A Social Welfare Optimal Sequential Allocation Procedure A Social Welfare Otimal Sequential Allocation Procedure Thomas Kalinowsi Universität Rostoc, Germany Nina Narodytsa and Toby Walsh NICTA and UNSW, Australia May 2, 201 Abstract We consider a simle sequential

More information

Brownian Motion and Random Prime Factorization

Brownian Motion and Random Prime Factorization Brownian Motion and Random Prime Factorization Kendrick Tang June 4, 202 Contents Introduction 2 2 Brownian Motion 2 2. Develoing Brownian Motion.................... 2 2.. Measure Saces and Borel Sigma-Algebras.........

More information

arxiv: v1 [cs.lg] 31 Jul 2014

arxiv: v1 [cs.lg] 31 Jul 2014 Learning Nash Equilibria in Congestion Games Walid Krichene Benjamin Drighès Alexandre M. Bayen arxiv:408.007v [cs.lg] 3 Jul 204 Abstract We study the reeated congestion game, in which multile oulations

More information

Information collection on a graph

Information collection on a graph Information collection on a grah Ilya O. Ryzhov Warren Powell October 25, 2009 Abstract We derive a knowledge gradient olicy for an otimal learning roblem on a grah, in which we use sequential measurements

More information

Notes on Instrumental Variables Methods

Notes on Instrumental Variables Methods Notes on Instrumental Variables Methods Michele Pellizzari IGIER-Bocconi, IZA and frdb 1 The Instrumental Variable Estimator Instrumental variable estimation is the classical solution to the roblem of

More information

On a Markov Game with Incomplete Information

On a Markov Game with Incomplete Information On a Markov Game with Incomlete Information Johannes Hörner, Dinah Rosenberg y, Eilon Solan z and Nicolas Vieille x{ January 24, 26 Abstract We consider an examle of a Markov game with lack of information

More information

General Linear Model Introduction, Classes of Linear models and Estimation

General Linear Model Introduction, Classes of Linear models and Estimation Stat 740 General Linear Model Introduction, Classes of Linear models and Estimation An aim of scientific enquiry: To describe or to discover relationshis among events (variables) in the controlled (laboratory)

More information

Combining Logistic Regression with Kriging for Mapping the Risk of Occurrence of Unexploded Ordnance (UXO)

Combining Logistic Regression with Kriging for Mapping the Risk of Occurrence of Unexploded Ordnance (UXO) Combining Logistic Regression with Kriging for Maing the Risk of Occurrence of Unexloded Ordnance (UXO) H. Saito (), P. Goovaerts (), S. A. McKenna (2) Environmental and Water Resources Engineering, Deartment

More information

arxiv: v1 [physics.data-an] 26 Oct 2012

arxiv: v1 [physics.data-an] 26 Oct 2012 Constraints on Yield Parameters in Extended Maximum Likelihood Fits Till Moritz Karbach a, Maximilian Schlu b a TU Dortmund, Germany, moritz.karbach@cern.ch b TU Dortmund, Germany, maximilian.schlu@cern.ch

More information

Online Appendix to Accompany AComparisonof Traditional and Open-Access Appointment Scheduling Policies

Online Appendix to Accompany AComparisonof Traditional and Open-Access Appointment Scheduling Policies Online Aendix to Accomany AComarisonof Traditional and Oen-Access Aointment Scheduling Policies Lawrence W. Robinson Johnson Graduate School of Management Cornell University Ithaca, NY 14853-6201 lwr2@cornell.edu

More information

An Analysis of Reliable Classifiers through ROC Isometrics

An Analysis of Reliable Classifiers through ROC Isometrics An Analysis of Reliable Classifiers through ROC Isometrics Stijn Vanderlooy s.vanderlooy@cs.unimaas.nl Ida G. Srinkhuizen-Kuyer kuyer@cs.unimaas.nl Evgueni N. Smirnov smirnov@cs.unimaas.nl MICC-IKAT, Universiteit

More information

Information collection on a graph

Information collection on a graph Information collection on a grah Ilya O. Ryzhov Warren Powell February 10, 2010 Abstract We derive a knowledge gradient olicy for an otimal learning roblem on a grah, in which we use sequential measurements

More information

Asymptotically Optimal Simulation Allocation under Dependent Sampling

Asymptotically Optimal Simulation Allocation under Dependent Sampling Asymtotically Otimal Simulation Allocation under Deendent Samling Xiaoing Xiong The Robert H. Smith School of Business, University of Maryland, College Park, MD 20742-1815, USA, xiaoingx@yahoo.com Sandee

More information

Elements of Asymptotic Theory. James L. Powell Department of Economics University of California, Berkeley

Elements of Asymptotic Theory. James L. Powell Department of Economics University of California, Berkeley Elements of Asymtotic Theory James L. Powell Deartment of Economics University of California, Berkeley Objectives of Asymtotic Theory While exact results are available for, say, the distribution of the

More information

On Doob s Maximal Inequality for Brownian Motion

On Doob s Maximal Inequality for Brownian Motion Stochastic Process. Al. Vol. 69, No., 997, (-5) Research Reort No. 337, 995, Det. Theoret. Statist. Aarhus On Doob s Maximal Inequality for Brownian Motion S. E. GRAVERSEN and G. PESKIR If B = (B t ) t

More information

Positive decomposition of transfer functions with multiple poles

Positive decomposition of transfer functions with multiple poles Positive decomosition of transfer functions with multile oles Béla Nagy 1, Máté Matolcsi 2, and Márta Szilvási 1 Deartment of Analysis, Technical University of Budaest (BME), H-1111, Budaest, Egry J. u.

More information

Topic 7: Using identity types

Topic 7: Using identity types Toic 7: Using identity tyes June 10, 2014 Now we would like to learn how to use identity tyes and how to do some actual mathematics with them. By now we have essentially introduced all inference rules

More information

Estimation of the large covariance matrix with two-step monotone missing data

Estimation of the large covariance matrix with two-step monotone missing data Estimation of the large covariance matrix with two-ste monotone missing data Masashi Hyodo, Nobumichi Shutoh 2, Takashi Seo, and Tatjana Pavlenko 3 Deartment of Mathematical Information Science, Tokyo

More information

Feedback-error control

Feedback-error control Chater 4 Feedback-error control 4.1 Introduction This chater exlains the feedback-error (FBE) control scheme originally described by Kawato [, 87, 8]. FBE is a widely used neural network based controller

More information

State Estimation with ARMarkov Models

State Estimation with ARMarkov Models Deartment of Mechanical and Aerosace Engineering Technical Reort No. 3046, October 1998. Princeton University, Princeton, NJ. State Estimation with ARMarkov Models Ryoung K. Lim 1 Columbia University,

More information

Research Article An iterative Algorithm for Hemicontractive Mappings in Banach Spaces

Research Article An iterative Algorithm for Hemicontractive Mappings in Banach Spaces Abstract and Alied Analysis Volume 2012, Article ID 264103, 11 ages doi:10.1155/2012/264103 Research Article An iterative Algorithm for Hemicontractive Maings in Banach Saces Youli Yu, 1 Zhitao Wu, 2 and

More information

Proof: We follow thearoach develoed in [4]. We adot a useful but non-intuitive notion of time; a bin with z balls at time t receives its next ball at

Proof: We follow thearoach develoed in [4]. We adot a useful but non-intuitive notion of time; a bin with z balls at time t receives its next ball at A Scaling Result for Exlosive Processes M. Mitzenmacher Λ J. Sencer We consider the following balls and bins model, as described in [, 4]. Balls are sequentially thrown into bins so that the robability

More information

Sums of independent random variables

Sums of independent random variables 3 Sums of indeendent random variables This lecture collects a number of estimates for sums of indeendent random variables with values in a Banach sace E. We concentrate on sums of the form N γ nx n, where

More information

Extension of Minimax to Infinite Matrices

Extension of Minimax to Infinite Matrices Extension of Minimax to Infinite Matrices Chris Calabro June 21, 2004 Abstract Von Neumann s minimax theorem is tyically alied to a finite ayoff matrix A R m n. Here we show that (i) if m, n are both inite,

More information

Elementary theory of L p spaces

Elementary theory of L p spaces CHAPTER 3 Elementary theory of L saces 3.1 Convexity. Jensen, Hölder, Minkowski inequality. We begin with two definitions. A set A R d is said to be convex if, for any x 0, x 1 2 A x = x 0 + (x 1 x 0 )

More information

Universal Finite Memory Coding of Binary Sequences

Universal Finite Memory Coding of Binary Sequences Deartment of Electrical Engineering Systems Universal Finite Memory Coding of Binary Sequences Thesis submitted towards the degree of Master of Science in Electrical and Electronic Engineering in Tel-Aviv

More information

AM 221: Advanced Optimization Spring Prof. Yaron Singer Lecture 6 February 12th, 2014

AM 221: Advanced Optimization Spring Prof. Yaron Singer Lecture 6 February 12th, 2014 AM 221: Advanced Otimization Sring 2014 Prof. Yaron Singer Lecture 6 February 12th, 2014 1 Overview In our revious lecture we exlored the concet of duality which is the cornerstone of Otimization Theory.

More information

MATHEMATICAL MODELLING OF THE WIRELESS COMMUNICATION NETWORK

MATHEMATICAL MODELLING OF THE WIRELESS COMMUNICATION NETWORK Comuter Modelling and ew Technologies, 5, Vol.9, o., 3-39 Transort and Telecommunication Institute, Lomonosov, LV-9, Riga, Latvia MATHEMATICAL MODELLIG OF THE WIRELESS COMMUICATIO ETWORK M. KOPEETSK Deartment

More information

The inverse Goldbach problem

The inverse Goldbach problem 1 The inverse Goldbach roblem by Christian Elsholtz Submission Setember 7, 2000 (this version includes galley corrections). Aeared in Mathematika 2001. Abstract We imrove the uer and lower bounds of the

More information

where x i is the ith coordinate of x R N. 1. Show that the following upper bound holds for the growth function of H:

where x i is the ith coordinate of x R N. 1. Show that the following upper bound holds for the growth function of H: Mehryar Mohri Foundations of Machine Learning Courant Institute of Mathematical Sciences Homework assignment 2 October 25, 2017 Due: November 08, 2017 A. Growth function Growth function of stum functions.

More information

Homework Solution 4 for APPM4/5560 Markov Processes

Homework Solution 4 for APPM4/5560 Markov Processes Homework Solution 4 for APPM4/556 Markov Processes 9.Reflecting random walk on the line. Consider the oints,,, 4 to be marked on a straight line. Let X n be a Markov chain that moves to the right with

More information

arxiv:cond-mat/ v2 25 Sep 2002

arxiv:cond-mat/ v2 25 Sep 2002 Energy fluctuations at the multicritical oint in two-dimensional sin glasses arxiv:cond-mat/0207694 v2 25 Se 2002 1. Introduction Hidetoshi Nishimori, Cyril Falvo and Yukiyasu Ozeki Deartment of Physics,

More information

RANDOM WALKS AND PERCOLATION: AN ANALYSIS OF CURRENT RESEARCH ON MODELING NATURAL PROCESSES

RANDOM WALKS AND PERCOLATION: AN ANALYSIS OF CURRENT RESEARCH ON MODELING NATURAL PROCESSES RANDOM WALKS AND PERCOLATION: AN ANALYSIS OF CURRENT RESEARCH ON MODELING NATURAL PROCESSES AARON ZWIEBACH Abstract. In this aer we will analyze research that has been recently done in the field of discrete

More information

On the Toppling of a Sand Pile

On the Toppling of a Sand Pile Discrete Mathematics and Theoretical Comuter Science Proceedings AA (DM-CCG), 2001, 275 286 On the Toling of a Sand Pile Jean-Christohe Novelli 1 and Dominique Rossin 2 1 CNRS, LIFL, Bâtiment M3, Université

More information

Elements of Asymptotic Theory. James L. Powell Department of Economics University of California, Berkeley

Elements of Asymptotic Theory. James L. Powell Department of Economics University of California, Berkeley Elements of Asymtotic Theory James L. Powell Deartment of Economics University of California, Berkeley Objectives of Asymtotic Theory While exact results are available for, say, the distribution of the

More information

IMPROVED BOUNDS IN THE SCALED ENFLO TYPE INEQUALITY FOR BANACH SPACES

IMPROVED BOUNDS IN THE SCALED ENFLO TYPE INEQUALITY FOR BANACH SPACES IMPROVED BOUNDS IN THE SCALED ENFLO TYPE INEQUALITY FOR BANACH SPACES OHAD GILADI AND ASSAF NAOR Abstract. It is shown that if (, ) is a Banach sace with Rademacher tye 1 then for every n N there exists

More information

1-way quantum finite automata: strengths, weaknesses and generalizations

1-way quantum finite automata: strengths, weaknesses and generalizations 1-way quantum finite automata: strengths, weaknesses and generalizations arxiv:quant-h/9802062v3 30 Se 1998 Andris Ambainis UC Berkeley Abstract Rūsiņš Freivalds University of Latvia We study 1-way quantum

More information

Statics and dynamics: some elementary concepts

Statics and dynamics: some elementary concepts 1 Statics and dynamics: some elementary concets Dynamics is the study of the movement through time of variables such as heartbeat, temerature, secies oulation, voltage, roduction, emloyment, rices and

More information

ECE 534 Information Theory - Midterm 2

ECE 534 Information Theory - Midterm 2 ECE 534 Information Theory - Midterm Nov.4, 009. 3:30-4:45 in LH03. You will be given the full class time: 75 minutes. Use it wisely! Many of the roblems have short answers; try to find shortcuts. You

More information

1. INTRODUCTION. Fn 2 = F j F j+1 (1.1)

1. INTRODUCTION. Fn 2 = F j F j+1 (1.1) CERTAIN CLASSES OF FINITE SUMS THAT INVOLVE GENERALIZED FIBONACCI AND LUCAS NUMBERS The beautiful identity R.S. Melham Deartment of Mathematical Sciences, University of Technology, Sydney PO Box 23, Broadway,

More information

Improved Capacity Bounds for the Binary Energy Harvesting Channel

Improved Capacity Bounds for the Binary Energy Harvesting Channel Imroved Caacity Bounds for the Binary Energy Harvesting Channel Kaya Tutuncuoglu 1, Omur Ozel 2, Aylin Yener 1, and Sennur Ulukus 2 1 Deartment of Electrical Engineering, The Pennsylvania State University,

More information

1 Extremum Estimators

1 Extremum Estimators FINC 9311-21 Financial Econometrics Handout Jialin Yu 1 Extremum Estimators Let θ 0 be a vector of k 1 unknown arameters. Extremum estimators: estimators obtained by maximizing or minimizing some objective

More information

Morten Frydenberg Section for Biostatistics Version :Friday, 05 September 2014

Morten Frydenberg Section for Biostatistics Version :Friday, 05 September 2014 Morten Frydenberg Section for Biostatistics Version :Friday, 05 Setember 204 All models are aroximations! The best model does not exist! Comlicated models needs a lot of data. lower your ambitions or get

More information

System Reliability Estimation and Confidence Regions from Subsystem and Full System Tests

System Reliability Estimation and Confidence Regions from Subsystem and Full System Tests 009 American Control Conference Hyatt Regency Riverfront, St. Louis, MO, USA June 0-, 009 FrB4. System Reliability Estimation and Confidence Regions from Subsystem and Full System Tests James C. Sall Abstract

More information

ON THE LEAST SIGNIFICANT p ADIC DIGITS OF CERTAIN LUCAS NUMBERS

ON THE LEAST SIGNIFICANT p ADIC DIGITS OF CERTAIN LUCAS NUMBERS #A13 INTEGERS 14 (014) ON THE LEAST SIGNIFICANT ADIC DIGITS OF CERTAIN LUCAS NUMBERS Tamás Lengyel Deartment of Mathematics, Occidental College, Los Angeles, California lengyel@oxy.edu Received: 6/13/13,

More information

Age of Information: Whittle Index for Scheduling Stochastic Arrivals

Age of Information: Whittle Index for Scheduling Stochastic Arrivals Age of Information: Whittle Index for Scheduling Stochastic Arrivals Yu-Pin Hsu Deartment of Communication Engineering National Taiei University yuinhsu@mail.ntu.edu.tw arxiv:80.03422v2 [math.oc] 7 Ar

More information

B8.1 Martingales Through Measure Theory. Concept of independence

B8.1 Martingales Through Measure Theory. Concept of independence B8.1 Martingales Through Measure Theory Concet of indeendence Motivated by the notion of indeendent events in relims robability, we have generalized the concet of indeendence to families of σ-algebras.

More information

ECE 6960: Adv. Random Processes & Applications Lecture Notes, Fall 2010

ECE 6960: Adv. Random Processes & Applications Lecture Notes, Fall 2010 ECE 6960: Adv. Random Processes & Alications Lecture Notes, Fall 2010 Lecture 16 Today: (1) Markov Processes, (2) Markov Chains, (3) State Classification Intro Please turn in H 6 today. Read Chater 11,

More information

On the capacity of the general trapdoor channel with feedback

On the capacity of the general trapdoor channel with feedback On the caacity of the general tradoor channel with feedback Jui Wu and Achilleas Anastasooulos Electrical Engineering and Comuter Science Deartment University of Michigan Ann Arbor, MI, 48109-1 email:

More information

MODELING THE RELIABILITY OF C4ISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL

MODELING THE RELIABILITY OF C4ISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL Technical Sciences and Alied Mathematics MODELING THE RELIABILITY OF CISR SYSTEMS HARDWARE/SOFTWARE COMPONENTS USING AN IMPROVED MARKOV MODEL Cezar VASILESCU Regional Deartment of Defense Resources Management

More information

Quantitative estimates of propagation of chaos for stochastic systems with W 1, kernels

Quantitative estimates of propagation of chaos for stochastic systems with W 1, kernels oname manuscrit o. will be inserted by the editor) Quantitative estimates of roagation of chaos for stochastic systems with W, kernels Pierre-Emmanuel Jabin Zhenfu Wang Received: date / Acceted: date Abstract

More information

t 0 Xt sup X t p c p inf t 0

t 0 Xt sup X t p c p inf t 0 SHARP MAXIMAL L -ESTIMATES FOR MARTINGALES RODRIGO BAÑUELOS AND ADAM OSȨKOWSKI ABSTRACT. Let X be a suermartingale starting from 0 which has only nonnegative jums. For each 0 < < we determine the best

More information

Shadow Computing: An Energy-Aware Fault Tolerant Computing Model

Shadow Computing: An Energy-Aware Fault Tolerant Computing Model Shadow Comuting: An Energy-Aware Fault Tolerant Comuting Model Bryan Mills, Taieb Znati, Rami Melhem Deartment of Comuter Science University of Pittsburgh (bmills, znati, melhem)@cs.itt.edu Index Terms

More information

HENSEL S LEMMA KEITH CONRAD

HENSEL S LEMMA KEITH CONRAD HENSEL S LEMMA KEITH CONRAD 1. Introduction In the -adic integers, congruences are aroximations: for a and b in Z, a b mod n is the same as a b 1/ n. Turning information modulo one ower of into similar

More information

Analysis of Multi-Hop Emergency Message Propagation in Vehicular Ad Hoc Networks

Analysis of Multi-Hop Emergency Message Propagation in Vehicular Ad Hoc Networks Analysis of Multi-Ho Emergency Message Proagation in Vehicular Ad Hoc Networks ABSTRACT Vehicular Ad Hoc Networks (VANETs) are attracting the attention of researchers, industry, and governments for their

More information

p-adic Measures and Bernoulli Numbers

p-adic Measures and Bernoulli Numbers -Adic Measures and Bernoulli Numbers Adam Bowers Introduction The constants B k in the Taylor series exansion t e t = t k B k k! k=0 are known as the Bernoulli numbers. The first few are,, 6, 0, 30, 0,

More information

#A64 INTEGERS 18 (2018) APPLYING MODULAR ARITHMETIC TO DIOPHANTINE EQUATIONS

#A64 INTEGERS 18 (2018) APPLYING MODULAR ARITHMETIC TO DIOPHANTINE EQUATIONS #A64 INTEGERS 18 (2018) APPLYING MODULAR ARITHMETIC TO DIOPHANTINE EQUATIONS Ramy F. Taki ElDin Physics and Engineering Mathematics Deartment, Faculty of Engineering, Ain Shams University, Cairo, Egyt

More information

STA 250: Statistics. Notes 7. Bayesian Approach to Statistics. Book chapters: 7.2

STA 250: Statistics. Notes 7. Bayesian Approach to Statistics. Book chapters: 7.2 STA 25: Statistics Notes 7. Bayesian Aroach to Statistics Book chaters: 7.2 1 From calibrating a rocedure to quantifying uncertainty We saw that the central idea of classical testing is to rovide a rigorous

More information

arxiv: v2 [math.na] 6 Apr 2016

arxiv: v2 [math.na] 6 Apr 2016 Existence and otimality of strong stability reserving linear multiste methods: a duality-based aroach arxiv:504.03930v [math.na] 6 Ar 06 Adrián Németh January 9, 08 Abstract David I. Ketcheson We rove

More information

1 1 c (a) 1 (b) 1 Figure 1: (a) First ath followed by salesman in the stris method. (b) Alternative ath. 4. D = distance travelled closing the loo. Th

1 1 c (a) 1 (b) 1 Figure 1: (a) First ath followed by salesman in the stris method. (b) Alternative ath. 4. D = distance travelled closing the loo. Th 18.415/6.854 Advanced Algorithms ovember 7, 1996 Euclidean TSP (art I) Lecturer: Michel X. Goemans MIT These notes are based on scribe notes by Marios Paaefthymiou and Mike Klugerman. 1 Euclidean TSP Consider

More information

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem Lecture 9: UCB Algorithm and Adversarial Bandit Problem EECS598: Prediction and Learning: It s Only a Game Fall 03 Lecture 9: UCB Algorithm and Adversarial Bandit Problem Prof. Jacob Abernethy Scribe:

More information

The Fekete Szegő theorem with splitting conditions: Part I

The Fekete Szegő theorem with splitting conditions: Part I ACTA ARITHMETICA XCIII.2 (2000) The Fekete Szegő theorem with slitting conditions: Part I by Robert Rumely (Athens, GA) A classical theorem of Fekete and Szegő [4] says that if E is a comact set in the

More information

Cryptanalysis of Pseudorandom Generators

Cryptanalysis of Pseudorandom Generators CSE 206A: Lattice Algorithms and Alications Fall 2017 Crytanalysis of Pseudorandom Generators Instructor: Daniele Micciancio UCSD CSE As a motivating alication for the study of lattice in crytograhy we

More information

Robust hamiltonicity of random directed graphs

Robust hamiltonicity of random directed graphs Robust hamiltonicity of random directed grahs Asaf Ferber Rajko Nenadov Andreas Noever asaf.ferber@inf.ethz.ch rnenadov@inf.ethz.ch anoever@inf.ethz.ch Ueli Peter ueter@inf.ethz.ch Nemanja Škorić nskoric@inf.ethz.ch

More information

Robustness of classifiers to uniform l p and Gaussian noise Supplementary material

Robustness of classifiers to uniform l p and Gaussian noise Supplementary material Robustness of classifiers to uniform l and Gaussian noise Sulementary material Jean-Yves Franceschi Ecole Normale Suérieure de Lyon LIP UMR 5668 Omar Fawzi Ecole Normale Suérieure de Lyon LIP UMR 5668

More information

Optimism, Delay and (In)Efficiency in a Stochastic Model of Bargaining

Optimism, Delay and (In)Efficiency in a Stochastic Model of Bargaining Otimism, Delay and In)Efficiency in a Stochastic Model of Bargaining Juan Ortner Boston University Setember 10, 2012 Abstract I study a bilateral bargaining game in which the size of the surlus follows

More information

An Estimate For Heilbronn s Exponential Sum

An Estimate For Heilbronn s Exponential Sum An Estimate For Heilbronn s Exonential Sum D.R. Heath-Brown Magdalen College, Oxford For Heini Halberstam, on his retirement Let be a rime, and set e(x) = ex(2πix). Heilbronn s exonential sum is defined

More information

The Longest Run of Heads

The Longest Run of Heads The Longest Run of Heads Review by Amarioarei Alexandru This aer is a review of older and recent results concerning the distribution of the longest head run in a coin tossing sequence, roblem that arise

More information

Developing A Deterioration Probabilistic Model for Rail Wear

Developing A Deterioration Probabilistic Model for Rail Wear International Journal of Traffic and Transortation Engineering 2012, 1(2): 13-18 DOI: 10.5923/j.ijtte.20120102.02 Develoing A Deterioration Probabilistic Model for Rail Wear Jabbar-Ali Zakeri *, Shahrbanoo

More information

HAUSDORFF MEASURE OF p-cantor SETS

HAUSDORFF MEASURE OF p-cantor SETS Real Analysis Exchange Vol. 302), 2004/2005,. 20 C. Cabrelli, U. Molter, Deartamento de Matemática, Facultad de Cs. Exactas y Naturales, Universidad de Buenos Aires and CONICET, Pabellón I - Ciudad Universitaria,

More information

Elliptic Curves and Cryptography

Elliptic Curves and Cryptography Ellitic Curves and Crytograhy Background in Ellitic Curves We'll now turn to the fascinating theory of ellitic curves. For simlicity, we'll restrict our discussion to ellitic curves over Z, where is a

More information

Positivity, local smoothing and Harnack inequalities for very fast diffusion equations

Positivity, local smoothing and Harnack inequalities for very fast diffusion equations Positivity, local smoothing and Harnack inequalities for very fast diffusion equations Dedicated to Luis Caffarelli for his ucoming 60 th birthday Matteo Bonforte a, b and Juan Luis Vázquez a, c Abstract

More information

Online Learning of Noisy Data with Kernels

Online Learning of Noisy Data with Kernels Online Learning of Noisy Data with Kernels Nicolò Cesa-Bianchi Università degli Studi di Milano cesa-bianchi@dsiunimiit Shai Shalev Shwartz The Hebrew University shais@cshujiacil Ohad Shamir The Hebrew

More information

A Comparison between Biased and Unbiased Estimators in Ordinary Least Squares Regression

A Comparison between Biased and Unbiased Estimators in Ordinary Least Squares Regression Journal of Modern Alied Statistical Methods Volume Issue Article 7 --03 A Comarison between Biased and Unbiased Estimators in Ordinary Least Squares Regression Ghadban Khalaf King Khalid University, Saudi

More information

A note on the random greedy triangle-packing algorithm

A note on the random greedy triangle-packing algorithm A note on the random greedy triangle-acking algorithm Tom Bohman Alan Frieze Eyal Lubetzky Abstract The random greedy algorithm for constructing a large artial Steiner-Trile-System is defined as follows.

More information

A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the Training-Test Split

A Bound on the Error of Cross Validation Using the Approximation and Estimation Rates, with Consequences for the Training-Test Split A Bound on the Error of Cross Validation Using the Aroximation and Estimation Rates, with Consequences for the Training-Test Slit Michael Kearns AT&T Bell Laboratories Murray Hill, NJ 7974 mkearns@research.att.com

More information

arxiv: v1 [cs.gt] 2 Nov 2018

arxiv: v1 [cs.gt] 2 Nov 2018 Tight Aroximation Ratio of Anonymous Pricing Yaonan Jin Pinyan Lu Qi Qi Zhihao Gavin Tang Tao Xiao arxiv:8.763v [cs.gt] 2 Nov 28 Abstract We consider two canonical Bayesian mechanism design settings. In

More information

Weakly Short Memory Stochastic Processes: Signal Processing Perspectives

Weakly Short Memory Stochastic Processes: Signal Processing Perspectives Weakly Short emory Stochastic Processes: Signal Processing Persectives by Garimella Ramamurthy Reort No: IIIT/TR/9/85 Centre for Security, Theory and Algorithms International Institute of Information Technology

More information