Gambling in a rigged casino: The adversarial multi-armed bandit problem

Size: px
Start display at page:

Download "Gambling in a rigged casino: The adversarial multi-armed bandit problem"

Transcription

1 Gambling in a rigged casino: The adversarial multi-armed bandit problem Peter Auer Institute for Theoretical Computer Science University of Technology Graz A-8010 Graz (Austria) pauer@igi.tu-graz.ac.at Yoav Freund Robert E. Schapire AT&T Labs 180 Park Avenue Florham Park, NJ fyoav, schapireg@research.att.com Internal Report DSI, Università di Milano, Italy Nicolò Cesa-Bianchi Department of Computer Science Università di Milano I Milano (Italy) cesabian@dsi.unimi.it Abstract In the multi-armed bandit problem, a gambler must decide which arm of K non-identical slot machines to play in a sequence of trials so as to maximize his reward. This classical problem has received much attention because of the simple model it provides of the trade-off between exploration (trying out each arm to find the best one) and exploitation (playing the arm believed to give the best payoff). Past solutions for the bandit problem have almost always relied on assumptions about the statistics of the slot machines. In this work, we make no statistical assumptions whatsoever about the nature of the process generating the payoffs of the slot machines. We give a solution to the bandit problem in which an adversary, rather than a well-behaved stochastic process, has complete control over the payoffs. In a sequence of T plays, we prove that the expected per-round payoff of our algorithm approaches that of the best arm at the rate O(T?1=2 ), and we give an improved rate of convergence when the best arm has fairly low payoff. We also prove a general matching lower bound on the best possible performance of any algorithm in our setting. In addition, we consider a setting in which the player has a team of experts advising him on which arm to play; here, we give a strategy that will guarantee expected payoff close to that of the best expert. Finally, we apply our result to the problem of learning to play an unknown repeated matrix game against an all-powerful adversary. An early extended abstract of this paper appeared in the proceedings of the 36th Annual Symposium on Foundations of Computer Science, pages , The present draft is a very substantially revised and expanded version which has been submitted for journal publication.

2 1 Introduction In the well studied multi-armed bandit problem, originally proposed by Robbins [16], a gambler must choose which of K slot machines to play. At each time step, he pulls the arm of one of the machines and receives a reward or payoff (possibly zero or negative). The gambler s purpose is to maximize his total reward over a sequence of trials. Since each arm is assumed to have a different distribution of rewards, the goal is to find the arm with the best expected return as early as possible, and then to keep gambling using that arm. The problem is a classical example of the trade-off between exploration and exploitation. On the one hand, if the gambler plays exclusively on the machine that he thinks is best ( exploitation ), he may fail to discover that one of the other arms actually has a higher average return. On the other hand, if he spends too much time trying out all the machines and gathering statistics ( exploration ), he may fail to play the best arm often enough to get a high total return. As a more practically motivated example, consider the task of repeatedly choosing a route for transmitting packets between two points in a communication network. Suppose there are K possible routes and the transmission cost is reported back to the sender. Then the problem can be seen as that of selecting a route for each packet so that the total cost of transmitting a large set of packets would not be much larger than the cost incurred by sending all of them on the single best route. In the past, the bandit problem has almost always been studied with the aid of statistical assumptions on the process generating the rewards for each arm. In the gambling example, for instance, it might be natural to assume that the distribution of rewards for each arm is Gaussian and time-invariant. However, it is likely that the costs associated with each route in the routing example cannot be modeled by a stationary distribution, so a more sophisticated set of statistical assumptions would be required. In general, it may be difficult or impossible to determine the right statistical assumptions for a given domain, and some domains may be inherently adversarial in nature so that no such assumptions are appropriate. In this paper, we present a variant of the bandit problem in which no statistical assumptions are made about the generation of rewards. In our model, the reward associated with each arm is determined at each time step by an adversary with unbounded computational power rather than by some benign stochastic process. We only assume that the rewards are chosen from a bounded range. The performance of any player is measured in terms of regret, i.e., the expected difference between the total reward scored by the player and the total reward scored by the best arm. At first it may seem impossible that the player should stand a chance against such a powerful opponent. Indeed, a deterministic player will fare very badly against an adversary who assigns low payoff to the chosen arm and high payoff to all the other arms. However, in this paper we present a very efficient, randomized player algorithm that performs well against any adversary. We prove that the difference between the expected gain of our algorithm and the expected gain of the best out of K arms is at most O( p T K log K), where T is the number of time steps. Note that the average per-time-step regret approaches zero at the rate O(1= p T ). We also present more refined bounds in which the dependence on T is replaced by the total reward of the best arm (or an assumed upper bound thereof). Our worst-case bounds may appear weaker than the bounds proved using statistical assumptions, such as those shown by Lai and Robbins [12] of the form O(log T ). However, when comparing our results to those in the statistics literature, it is important to point out an important difference in the asymptotic quantification. In the work of Lai and Robbins the assumption is that the distribution of rewards that is associated with each arm is fixed as the total number of iterations T increases to infinity. In contrast, our bounds hold for any finite T, and, by the generality of our model, these bounds are applicable when the payoffs are randomly (or adversarially) chosen in a manner that does depend on T. It is this quantification order, and not the adversarial nature of our framework, which is the cause for the apparent gap. We prove this point by showing that, for any algorithm for the K-arm bandit problem and for any T there exists a set 2

3 of K reward distributions such that the expected regret of the algorithm when playing against such arms for T iterations is lower bounded by (p KT ). We can also show that the per-sequence regret is well behaved. More precisely, we show that our algorithm can guarantee that the actual (rather than expected) difference between its gain and the gain of the best arm on any run is upper bounded by O(T 2=3 (K ln K) 1=3 ) with high probability. This bound is weaker than the bound on the expected regret. It is not clear whether or not this bound can be improved to have a dependence of O(p T ) on the number of trials. A non-stochastic bandit problem was also considered by Gittins [9] and Ishikida and Varaiya [11]. However, their version of the bandit problem is very different from ours: they assume that the player can compute ahead of time exactly what payoffs will be received from each arm, and their problem is thus one of optimization, rather than exploration and exploitation. Our algorithm is based in part on an algorithm presented by Freund and Schapire [6, 7], which in turn is a variant of Littlestone and Warmuth s [13] weighted majority algorithm, and Vovk s [17] aggregating strategies. In the setting analyzed by Freund and Schapire (which we call here the full information game), the player on each trial scores the reward of the chosen arm, but gains access to the rewards associated with all of the arms (not just the one that was chosen). In some situations, picking the same action at all trials might not be the best strategy. For example, in the packet routing problem it might be that no single route is good for the whole duration of the message, but switching between routes from time to time can yield better performance. We give a variant of our algorithm which combines the choices of N strategies (or experts ), each of which recommends one of the K actions at each iteration. We show that the regret with respect to the best strategy is O(p T K ln N ). Note that the dependence on the number of strategies is only logarithmic, and therefore the bound is quite reasonable even when the player is combining a very large number of strategies. The adversarial bandit problem is closely related to the problem of learning to play an unknown repeated matrix game. In this setting, a player without prior knowledge of the game matrix is playing the game repeatedly against an adversary with complete knowledge of the game and unbounded computational power. It is well known that matrix games have an associated value which is the best possible expected payoff when playing the game against an adversary. If the matrix is known, then a randomized strategy that achieves the value of the game can be computed (say, using a linear-programming algorithm) and employed by the player. The case where the matrix is entirely unknown was previously considered by Baños [1] and Megiddo [14], who proposed two different strategies whose per-round payoff converges to the game value. Both of these algorithms are extremely inefficient. For the same problem, we show that by using our algorithm the player achieves an expected per-round payoff in T rounds which efficiently approaches the value of the game at the rate O(T?1=2 ). This convergence is much faster than that achieved by Baños and Megiddo. Our paper is organized as follows. In Section 2, we give the formal definition of the problem. In Section 3, we describe Freund and Schapire s algorithm for the full information game and state its performance. In Section 4, we describe our basic algorithm for the partial information game and prove the bound on the expected regret. In Section 5, we prove a bound on the regret of our algorithm on typical sequences. In Section 6, we show how to adaptively tune the parameters of the algorithm when no prior knowledge (such as the length of the game) is available. In Section 7, we give a lower bound on the regret suffered by any algorithm for the partial information game. In Section 8, we show how to modify the algorithm to use expert advice. Finally, in Section 9, we describe the application of our algorithm to repeated matrix games. 2 Notation and terminology We formalize the bandit problem as a game between a player choosing actions and an adversary choosing the rewards associated with each action. The game is parameterized by the number K of possible actions, 3

4 where each action is denoted by an integer i, 1 i K. We will assume that all the rewards belong to the unit interval [0; 1]. The generalization to rewards in [a; b] for arbitrary a < b is straightforward. The game is played in a sequence of trials t = 1; 2; : : : ; T. We distinguish two variants: the partial information game, which captures the adversarial multi-armed bandit problem; and the full information game, which is essentially equivalent to the framework studied by Freund and Schapire [6]. On each trial t of the full information game: 1. The adversary selects a vector x(t) 2 [0; 1] K of current rewards. The ith component x i (t) is interpreted as the reward associated with action i at trial t. 2. Without knowledge of the adversary s choice, the player chooses an action by picking a number i t 2 f1; 2; : : : ; Kg and scores the corresponding reward x it (t). 3. The player observes the entire vector x(t) of current rewards. The partial information game corresponds to the above description of the full information game but with step 3 replaced by: 3 0. The player observes only the reward x it (t) for the chosen action i t. Let G A : = P T x it (t) be the total reward of player A choosing actions i 1 ; i 2 ; : : : ; i T. We formally define an adversary as a deterministic 1 function mapping the past history of play i 1 ; : : : ; i t?1 to the current reward vector x(t). As a special case, we say that an adversary is oblivious if it is independent of the player s actions, i.e., if the reward at trial t is a function of t only. All of our results, which are proved for a nonoblivious adversary, hold for an oblivious adversary as well. As our player algorithms will be randomized, fixing an adversary and a player algorithm defines a probability distribution over the set f1; : : : ; Kg T of sequences of T actions. All the probabilities and expectations considered in this paper will be with respect to this distribution. For an oblivious adversary, the rewards are fixed quantities with respect to this distribution, but for a nonoblivious adversary, each reward x i (t) is a random variable defined on the set f1; : : : ; Kg t?1 of player actions up to trial t? 1. We will not use explicit notation to represent this dependence, but will refer to it in the text when appropriate. The measure of the performance of our algorithm is the regret, which is the difference between the total reward of the algorithm G A and the total reward of the best action. We shall mostly be concerned with the expected regret of the algorithm. Formally, we define the expected total reward of algorithm A by the expected total reward of the best action by E[G A ] : = E i 1;:::;i T EG max : = max x it (t) 1jK E i1;:::;i T : and the expected regret of algorithm A by R A = EGmax? E[G A ]: This definition is easiest to interpret for an oblivious adversary since, in this case, EG max truly measures what could have been gained had the best 1 There is no loss of generality in assuming that the adversary is deterministic. To see this, assume that the adversary maps past histories to distributions over the values of x(t). This defines a stochastic strategy for the adversary for the T step game, which is equivalent to a distribution over all deterministic adversarial strategies for the T step game. Assume that A is any player algorithm and that B is the worst-case stochastic strategy for the adversary playing against A. The stated equivalence implies that there is a deterministic adversarial strategy ~ B against which the gain of A is at most as large as the gain of A against B. (The same argument can easily be made for other measures of performance, such as the regret, which is defined shortly.) ; x j (t) ; 4

5 Algorithm Hedge Parameter: A real number > 0. Initialization: Set G i (0) := 0 for i = 1; : : : ; K. Repeat for t = 1; 2; : : : until game ends 1. Choose action i t according to the distribution p(t), where p i (t) = exp(g i (t? 1)) P Kj=1 exp(g j (t? 1)) : 2. Receive the reward vector x(t) and score gain x it (t). 3. Set G i (t) := G i (t? 1) + x i (t) for i = 1; : : : ; K. Figure 1: Algorithm Hedge for the full information game. action been played for the entire sequence. However, for a nonoblivious adversary, the definition of regret is a bit strange: It still compares the total reward of the algorithm to the sum of rewards that were associated with taking some action j on all iterations; however, had action j actually been taken, the rewards chosen by the adversary would have been different than those actually generated since the variable x j (t) depends on the past history of plays i 1 ; : : : ; i t?1. Although the definition of R A looks difficult to interpret in this case, in Section 9 we prove that our bounds on the regret for a nonoblivious adversary can also be used to derive an interesting result in the context of repeated matrix games. We shall also give a bound that holds with high probability on the actual regret of the algorithm, i.e., on the actual difference between the gain of the algorithm and the gain of the best action: 3 The full information game max j x j (t)? G A : In this section, we describe an algorithm, called Hedge, for the full information game which will also be used as a building block in the design of our algorithm for the partial information game. The version of Hedge presented here is a variant 2 of the algorithm introduced by Freund and Schapire [6] which itself is a direct generalization of Littlestone and Warmuth s Weighted Majority [13] algorithm. Hedge is described in Figure 1. The main idea is simply to choose action i at time t with probability proportional to exp (G i (t? 1)), where > 0 is a parameter and G i (t) = P t t 0 =1 x i (t 0 ) is the total reward scored by action i up through trial t. Thus, actions yielding high rewards quickly gain a high probability of being chosen. Since we allow for rewards larger than 1, proving bounds for Hedge is more complex than for Freund and Schapire s original algorithm. The following is an extension of Freund and Schapire s Theorem 2. Here 2 These modifications enable Hedge to handle gains (rewards in [0; M ]) rather than losses (rewards in [?1; 0]). Note that we also allow rewards larger than 1. These changes are necessary to use Hedge as a building block in the partial information game. 5

6 and throughout this paper, we make use of the function M (x) which is defined for M 6= 0 to be M (x) : = emx? 1? Mx M 2 : Theorem 3.1 For > 0, and for any sequence of reward vectors x(1); : : : ; x(t ) with x i (t) 2 [0; M ], M > 0, the probability vectors p(t) computed by Hedge satisfy for all actions j = 1; : : : ; K. p i (t)x i (t) x j (t)? ln K? M () p i (t)x i (t) 2 In the special case that M = 1, we can replace P T P K p i (t)x i (t) 2 with its upper bound P T P K p i (t)x i (t) to get the following lower bound on the gain of Hedge. Corollary 3.2 For > 0, and for any sequence of reward vectors x(1); : : : ; x(t ) with x i (t) 2 [0; 1], the probability vectors p(t) computed by Hedge satisfy P p(t) x(t) max T j x j (t)? ln K e :? 1 Note that p(t) x(t) = E it [x it (t) j i 1 ; : : : ; i t?1 ], so this corollary immediately implies the lower bound: E[G Hedge ] = E x it (t) = E p(t) x(t) h P E max T j x j (t) e? 1 EG max? ln K e :? 1 i? ln K In particular, it can be shown that if we choose = ln(1 + p 2(ln K)=T ) then Hedge suffers regret at most p 2T ln K in the full information game, i.e., E[G Hedge ] EG max? p 2T ln K: To prove Theorem 3.1, we will use the following inequality. Lemma 3.3 For all > 0, for all M 6= 0 and for all x M: Proof. It suffices to show that the function e x 1 + x + M ()x 2 : f (x) : = ex? 1? x x 2 is nondecreasing since the inequality f (x) f (M ) immediately implies the lemma. (We can make f continuous with continuous derivatives by defining f (0) = 1=2.) We need to show that the first derivative f 0 (x) 0, for which it is sufficient to show that g(x) = : x3 f 0 (x) e x e x + 1 = x?? 2 1 e x + 1 6

7 is nonnegative for positive x and nonpositive for negative x. This can be proved by noting that g(0) = 0 and that g s first derivative e x? g (x) = e x + 1 is obviously nonnegative. 2 Proof of Theorem 3.1. Let W t = P K exp (G i (t? 1)). By definition of the algorithm, we find that, for all 1 t T, W t+1 W t = = 1 + exp (G i (t? 1)) exp (x i (t)) W t p i (t) exp (x i (t)) p i (t)x i (t) + M () using Lemma 3.3. Taking logarithms and summing over t = 1; : : : ; T yields ln W T +1 W 1 = ln W t+1 W t ln 1 + p i (t)x i (t) + M () p i (t)x i (t) + M () p i (t)x i (t) 2 p i (t)x i (t) 2! p i (t)x i (t) 2 (1) since 1 + x e x for all x. Observing that W 1 = K and, for any j, W T +1 exp (G j (T )), we get ln W T +1 W 1 G j (T )? ln K : (2) Combining Equations (1) and (2) and rearranging we obtain the statement of the theorem. 2 4 The partial information game In this section, we move to the analysis of the partial information game. We present an algorithm Exp3 that runs the algorithm Hedge of Section 3 as a subroutine. (Exp3 stands for Exponential-weight algorithm for Exploration and Exploitation. ) The algorithm is described in Figure 2. On each trial t, Exp3 receives the distribution vector p(t) from Hedge and selects an action i t according to the distribution ^p(t) which is a mixture of p(t) and the uniform distribution. Intuitively, mixing in the uniform distribution is done to make sure that the algorithm tries out all K actions and gets good estimates of the rewards for each. Otherwise, the algorithm might miss a good action because the initial rewards it observes for this action are low and large rewards that occur later are not observed because the action is not selected. After Exp3 receives the reward x it (t) associated with the chosen action, it generates a simulated reward vector ^x(t) for Hedge. As Hedge requires full information, all components of this vector must be filled 7

8 Algorithm Exp3 Parameters: Reals > 0 and 2 (0; 1] Initialization: Initialize Hedge(). Repeat for t = 1; 2; : : : until game ends 1. Get the distribution p(t) from Hedge. 2. Select action i t to be j with probability ^p j (t) = (1? )p j (t) + K. 3. Receive reward x it (t) 2 [0; 1]. 4. Feed the simulated reward vector ^x(t) back to Hedge, where ^x j (t) = 8 < : x it (t) if j = i t ^p it (t) 0 otherwise. Figure 2: Algorithm Exp3 for the partial information game. in, even for the actions that were not selected. For the chosen action i t, we set the simulated reward ^x it (t) to x it (t)=^p it (t). Dividing the actual gain by the probability that the action was chosen compensates the reward of actions that are unlikely to be chosen. The other actions all receive a simulated reward of zero. This choice of simulated rewards guarantees that the expected simulated gain associated with any fixed action j is equal to the actual gain of the action; that is, E it [^x j (t) j i 1 ; : : : ; i t?1 ] = x j (t). We now give the first main theorem of this paper, which bounds the regret of algorithm Exp3. Theorem 4.1 For > 0 and 2 (0; 1], the expected gain of algorithm Exp3 is at least E[G Exp3 ] EG max? + K K=()! EG max? 1? ln K : To understand this theorem, it is helpful to consider a simpler bound which can be obtained by an appropriate choice of the parameters and : Corollary 4.2 Assume that g EG max and that algorithm Exp3 is run with input parameters = =K and ( s ) K ln K = min 1; : (e? 1)g Then the expected regret of algorithm Exp3 is at most R Exp3 2 p e? 1p gk ln K 2:63 p gk ln K: Proof. If g (K ln K)=(e? 1), then the bound is trivial since the expected regret cannot be more than g. Otherwise, by Theorem 4.1, the expected regret is at most + K K=()! g + ln K 8 = 2 p e? 1p gk ln K: 2

9 To apply Corollary 4.2, it is necessary that an upper bound g on EG max be available for tuning and. For example, if the number of trials T is known in advance then, since no action can have payoff greater than 1 on any trial, we can use g = T as an upper bound. In Section 6, we give a technique that does not require prior knowledge of such an upper bound. If the rewards x i (t) are in the range [a; b], a < b, then Exp3 can be used after the rewards have been translated and rescaled to the range [0; 1]. Applying Corollary 4.2 with g = T gives the bound (b? a)2 p e? 1p T K ln K) on the regret. For instance, this is applicable to a standard loss model where the rewards fall in the range [?1; 0]. Proof of Theorem 4.1. By the definition of the algorithm, we have that ^x i (t) 1=^p i (t) K=. Thus we find, by Theorem 3.1, that for all actions j = 1; : : : ; K Since and p i (t)^x i (t) we get that for all actions j = 1; : : : ; K Note that G Exp3 = ^x j (t)? ln K? K=() p i (t)^x i (t) = p x i t (t) it (t) ^p x i t (t) it (t) 1? p i (t)^x i (t) 2 : p i (t)^x i (t) 2 = p it (t) x i t (t) ^p it (t) ^x i t (t) ^x i t (t) 1? ; (4) x it (t) (1? ) ^x it (t) = ^x j (t)? 1? ln K? K=() (3) ^x it (t) : (5) ^x i (t): (6) We next take the expectation of Equation (5) with respect to the distribution of hi 1 ; : : : ; i T i. For the expected value of ^x j (t), we have: E[^x j (t)] = E i 1;:::;i t?1 [E i t [^x j (t) j i 1 ; : : : ; i t?1 ]] = E i 1;:::;i t?1 Combining Equations (5), (6) and (7), we find that E[G Exp3 ] (1? ) " ^p j (t) x j(t) ^p j (t) + (1? ^p j(t)) 0 = E[x j (t)]: (7) E[x j (t)]? 1? ln K? K=() E[x i (t)] : Since max j P T E[x j (t)] = EG max and P T P K E[x i (t)] K EG max we obtain the inequality in the statement of the theorem. 2 9

10 5 A bound on the regret that holds with high probability In the last section, we showed that algorithm Exp3 with appropriately set parameters can guarantee an expected regret of at most O( p gk ln K). In the case that the adversarial strategy is oblivious (i.e., when the rewards associated with each action are chosen without regard to the player s past actions), we compare the expected gain of the player to EG max, which, in this case, is the actual gain of the best action. However, if the adversary is not oblivious, our notion of expected regret can be very weak. Consider, for example, a rather benign but nonoblivious adversary which assigns reward 0 to all actions on the first round, and then, on all future rounds, assigns reward 1 to action i 1 (i.e., to whichever action was played by the player on the first round), and 0 to all other actions. In this case, assuming the player chooses the first action uniformly at random (as do all algorithms considered in this paper), the expected total gain of any action is (T? 1)=K. This means that the bound that we get from Corollary 4.2 in this case will guarantee only that the expected gain of the algorithm is not much smaller than EG max = (T? 1)=K. This is a very weak guarantee since, in each run, there is one action whose actual gain is T? 1. On the other hand, Exp3 would clearly perform much better than promised in this simple case. Clearly, we need a bound that relates the player s gain to the actual gain of the best action in the same run. In this section, we prove such a bound for Exp3. Specifically, let us define the random variable to be the actual total gain of action i, and let G i = x i (t) G max = max G i i be the actual total gain of the best action i. The main result of this section is a proof of a bound which holds with high probability relating the player s actual gain G Exp3 to G max. We show that the dependence of the difference G max? G Exp3 as a function of T is O(T 2=3 ) with high probability for an appropriate setting of Exp3 s parameters. This dependence is sufficient to show that the average per-trial gain of the algorithm approaches that of the best action as T! 1. However, the dependence is significantly worse than the O(p T ) dependence of the bound on the expected regret proved in Theorem 4.1. It is an open question whether the gap between the bounds is real or can be closed by this or some other algorithm. For notational convenience, let us also define the random variables ^G i = and ^G max = max ^G i : i The heart of the proof of the result in this section is an upper bound that holds with high probability on the deviation of ^G i from G i for any action i. The main difficulty in proving such a bound is that the gains associated with a single action in different trials are not independent of each other, but may be dependent through the decisions made by the adversary. However, using martingale theory, we can prove the following lemma: ^x i (t) Lemma 5.1 Let > 0 and > 0. Then with probability at least 1?, for every action i, ^G i 1? K 1() G i? ln(k=) : 10

11 Proof. Given in Appendix A. 2 Using this lemma, we can prove the main result of this section: Theorem 5.2 Let > 0, 2 (0; 1], > 0 and > 0. Then with probability at least 1?, the gain of algorithm Exp3 is at least! G Exp3 G max? + K K=() + K 1() G max? 1? ln K? ln(k=) : Proof. Note first that Combining with Equation (5) gives ^x it (t) = G Exp3 max j " ^x i (t) = (1? ) ^G j? 1? = 1?? K K= () 1?? K K= ()!! ^G i K ^G max : ln K? K K=() ^G max? 1? ^G i? 1? ln K ln K ^G max for all i. Next, we apply Lemma 5.1 which implies that, with probability at least 1?, for all actions i,! G Exp3 1?? K K=() 1? K 1() G i? ln(k=)? 1? ln K G i? + K K= () + K 1()! G i? ln(k=)? 1? Choosing i to be the best action gives the result. 2 To interpret this result we give the following simple corollary. Corollary 5.3 Let > 0. Assume that g G max and that algorithm Exp3 is run with input parameters = =K and ( K ln(k=) 1=3 ) = min 1; b 2 g ln K: where b = (e? 1)=2. Then with probability at least 1? the regret of algorithm Exp3 is at most R Exp3 (b 4=3 + 4b 1=3 )g 2=3 (K ln(k=)) 1=3 4:62 g 2=3 (K ln(k=)) 1=3 : Proof. We assume that g b?2 K ln(k=) since otherwise the bound follows from the trivial fact that the regret is at most g. We apply Theorem 5.2 setting = (ln(k=))2 bkg 2! 1=3 : 11

12 Algorithm Exp3:1 Initialization: Let t = 0, c = K ln K e? 1, and ^G i (0) = 0 for i = 1; : : : ; K Repeat for r = 0; 1; 2; : : : until game ends 1. Let S r = t + 1 and g r = c 4 r. 2. Restart Exp3 choosing and as in Corollary 4.2 (with g = g r ), namely, = r = 2?r and = r = r =K. 3. While max i ^G i (t) g r? K= r do: (a) t := t + 1 (b) Let ^p(t) and i t be the distribution and random action chosen by Exp3. (c) Compute ^x(t) from ^p(t) and observed reward x it (t) as in Figure 2. (d) ^G i (t) = ^G i (t? 1) + ^x i (t) for i = 1; : : : ; K. 4. Let T r = t Figure 3: Algorithm Exp3:1 for the partial information game when a bound on EG max is not known. Given our assumed lower bound on g, we have that 1 which implies that 1 () 2. Plugging into the bound in Theorem 5.2, this implies a bound on regret of b 2=3 g 1=3 K ln K (K ln(k=)) 1=3 + 4(bg2 K ln(k=)) 1=3 : The result now follows by upper bounding K ln K in the first term by (K ln(k=)) 2=3 (gb 2 ) 1=3 using our assumed lower bound on g. 2 As g = T is an upper bound that holds for any sequence, we get that the dependence of the regret of the algorithm on T is O(T 2=3 ). 6 Guessing the maximal reward In Section 4, we showed that algorithm Exp3 yields a regret of O( p gk ln K) whenever an upper bound g on the total expected reward EG max of the best action is known in advance. In this section, we describe an algorithm Exp3:1 which does not require prior knowledge of a bound on EG max and whose regret is at most O( p EG max K ln K). Along the same lines, the bounds of Corollary 5.3 can be achieved without prior knowledge about G max. Our algorithm Exp3:1, described in Figure 3, proceeds in epochs, where each epoch consists of a sequence of trials. We use r = 0; 1; 2; : : : to index the epochs. On epoch r, the algorithm guesses a bound g r for the total reward of the best action. It then uses this guess to tune the parameters and of Exp3, restarting Exp3 at the beginning of each epoch. As usual, we use t to denote the current time step. 3 3 Note that, in general, this t may differ from the local variable t used by Exp3 which we now regard as a subroutine. Throughout this section, we will only use t to refer to the total number of trials as in Figure 3. 12

13 Exp3:1 maintains an estimate ^G i (t) = tx t 0 =1 ^x i (t 0 ) of the total reward of each action i. Since E[^x i (t)] = E[x i (t)], this estimate will be unbiased in the sense that E[ ^G i (t)] = E " t X t 0 =1 x i (t 0 ) for all i and t. Using these estimates, the algorithm detects (approximately) when the actual gain of some action has advanced beyond g r. When this happens, the algorithm goes on to the next epoch, restarting Exp3 with a larger bound on the maximal gain. The performance of the algorithm is characterized by the following theorem which is the main result of this section. Theorem 6.1 The regret suffered by algorithm Exp3:1 is at most R Exp3:1 8 p e? 1p EGmax K ln K + 8(e? 1)K + 2K ln K 10:5 p EGmax K ln K + 13:8 K + 2K ln K: The proof of the theorem is divided into two lemmas. The first bounds the regret suffered on each epoch, and the second bounds the total number of epochs. As usual, we use T to denote the total number of time steps (i.e., the final value of t). We also define the following random variables: Let R be the total number of epochs (i.e., the final value of r). As in the figure, S r and T r denote the first and last time steps completed on epoch r (where, for convenience, we define T R = T ). Thus, epoch r consists of trials S r ; S r + 1; : : : ; T r. Note that, in degenerate cases, some epochs may be empty in which case S r = T r + 1. Let ^G max (t) = max i ^G i (t) and let ^G max = ^G max (T ). Lemma 6.2 For any action j and for every epoch r, the gain of Exp3:1 during epoch r is lower bounded by XT r XT r x it (t) ^x j (t)? p p 2 e? 1 gr K ln K: t=s r t=s r Proof. If S r > T r (so that no trials occur on epoch r), then the lemma holds trivially since both summations will be equal to zero. Assume then that S r T r. Let g = g r, = r and = r. We use Equation (5) from the proof of Theorem 4.1: XT r t=s r x it (t) (1? ) = XT r XT r t=s r ^x j (t)? XT r t=s r ^x j (t)? ^x j (t)? 1? t=s r XT r ln K? K= () ^x j (t)? K=() t=s r XT r ^x j (t)? K= () XT r t=s r XT r XT r t=s r ^x it (t) ^x i (t)? ^x i (t)? (1? ) ln K (1? ) ln K : From the definition of the termination condition and since S r T r, we know that ^G i (T r? 1) g? K=. Since ^x i (t) K= (by Exp3 s choice of ^p(t)), this implies that ^G i (T r ) g for all i. Thus, XT r XT r x it (t) ^x j (t)? g t=s r t=s r + KK=() 13!? (1? ) ln K :

14 By our choices for and, we get the statement of the lemma. 2 The next lemma gives an implicit upper bound on the number of epochs R. Lemma 6.3 The number of epochs R satisfies 2 R?1 K c + s Proof. If R = 0, then the bound holds trivially. So assume R 1. Let z = 2 R?1. Because epoch R? 1 was completed, by the termination condition, ^G max c : ^G max ^G max (T R?1 ) > g R?1? K = c 4 R?1? K 2 R?1 = cz 2? Kz: (8) R?1 Suppose the claim of the lemma is false. Then z > K=c + increasing for x > K=(2c), this implies that cz 2? Kz > c 0 K c + ^G max c 1 A 2? K 0 K c + q ^G max =c. Since the function cx 2? Kx is ^G max c 1 s A = K ^G max c + ^G max ; contradicting Equation 8. 2 Proof of Theorem 6.1. Using the lemmas, we have that G Exp3:1 = x it (t) = RX XT r x it (t) r=0 t=s r max j = max j 0 RX X r=0 ^x j (t)? p p 2 e? 1 gr K ln K t=s Tr ^G j (T )? 2K ln K RX r=0 = ^G max? 2K ln K(2 R+1? 1) ^G max + 2K ln K? 8K ln K 2 r 1 A 0 K ^G c + max = ^G max? 2K ln K? 8(e? 1)K? 8 p e? 1 c q A ^G max K ln K: (9) Here, we used Lemma 6.2 for the first inequality and Lemma 6.3 for the second inequality. The other steps follow from definitions and simple algebra. Let f (x) = x? a p x? b for x 0 where a = 8 p e? 1p K ln K and b = 2K ln K + 8(e? 1)K. Taking expectations of both sides of Equation (9) gives E[G Exp3:1 ] E[f ( ^G max )]: (10) Since the second derivative of f is positive for x > 0, f is convex so that, by Jensen s inequality, E[f ( ^G max )] f (E[ ^G max ]): (11) 14

15 Note that, E[ ^G max ] = E max ^G j (T ) j max j E[ ^G j (T )] = max E j x j (t) = EG max : The function f is increasing if and only if x > a 2 =4. Therefore, if EG max > a 2 =4 then f (E[ ^G max ]) f (EG max ). Combined with Equations (10) and (11), this gives that E[G Exp3:1 ] f (EG max ) which is equivalent to the statement of the theorem. On the other hand, if EG max a 2 =4 then, because f is nonincreasing on [0; a 2 =4], f (EG max ) f (0) =?b 0 E[G Exp3:1 ] so the theorem follows trivially in this case as well. 2 7 A lower bound In this section, we state an information-theoretic lower bound on the regret of any player, i.e., a lower bound that holds even if the player has unbounded computational power. More precisely, we show that there exists an adversarial strategy for choosing the rewards such that the expected regret of any player algorithm is (p T K). Observe that this does not match the upper bound for our algorithms Exp3 and Exp3:1 (see Corollary 4.2 and Theorem 6.1); it is an open problem to close this gap. The adversarial strategy we use in our proof is oblivious to the algorithm; it simply assigns the rewards at random according to some distribution, similar to a standard statistical model for the bandit problem. The choice of distribution depends on the number of actions K and the number of iterations T. This dependence of the distribution on T is the reason that our lower bound does not contradict the upper bounds of the form O(log T ) which appear in the statistics literature [12]. There, the distribution over the rewards is fixed as T! 1. For the full information game, matching upper and lower bounds of the form?p T log K were already known [3, 6]. Our lower bound shows that for the partial information game the dependence on the number of actions increases considerably. Specifically, our lower bound implies that no upper bound is possible of the form O(T (log K) ) where 0 < 1, > 0. Theorem 7.1 For any number of actions K 2 and any number of iterations T, there exists a distribution over the rewards assigned to different actions such that the expected regret of any algorithm is at least 1 minfp KT ; T g: 20 The proof is given in Appendix B. The lower bound on the expected regret implies, of course, that for any algorithm there is a particular choice of rewards that will cause the regret to be larger than this expected value. 8 Combining the advice of many experts Up to this point, we have considered a bandit problem in which the player s goal is to achieve a payoff close to that of the best single action. In a more general setting, the player may have a set of strategies for choosing the best action. These strategies might select different actions at different iterations. The strategies can be computations performed by the player or they can be external advice given to the player by experts. We will use the more general term expert (borrowed from Cesa-Bianchi et al. [3]) because we place no 15

16 restrictions on the generation of the advice. The player s goal in this case is to combine the advice of the experts in such a way that its total reward is close to that of the best expert (rather than the best single action). For example, consider the packet-routing problem. In this case there might be several routing strategies, each based on different assumptions regarding network load distribution and using different data to estimate current load. Each of these strategies might suggest different routes at different times, and each might be better in different situations. In this case, we would like to have an algorithm for combining these strategies which, for each set of packets, performs almost as well as the strategy that was best for that set. Formally, at each trial t, we assume that the player, P prior to choosing an action, is provided with a set of N probability vectors j (t) 2 [0; 1] K K, j = 1; : : : ; N, j i (t) = 1. We interpret j (t) as the advice of expert j on trial t, where the ith component j i (t) represents the recommended probability of playing action i. (As a special case, the distribution can be concentrated on a single action, which represents a deterministic recommendation.) If the adversary chooses payoff vector x(t), then the expected reward for expert j (with respect to the chosen probability vector j (t)) is simply j (t) x(t). In analogy of EG max, we define E ~ G max : = max 1jN E i1;:::;i T j (t) x(t) so that the regret R ~ : A = E[GA ]?E G ~ max measures the expected difference between the player s total reward and the total reward of the best expert. Our results hold for any finite set of experts. Formally, we regard each j (t) as a random variable which is an arbitrary function of the random sequence of plays i 1 ; : : : ; i t?1 (just like the adversary s payoff vector x(t)). This definition allows for experts whose advice depends on the entire past history as observed by the player, as well as other side information which may be available. We could at this point view each expert as a meta-action in a higher-level bandit problem with payoff vector defined at trial t as ( 1 (t) x(t); : : : ; N (t) x(t)). We could then immediately apply Corollary 4.2 to obtain a bound of p O( gn log N ) on the player s regret relative to the best expert (where g is an upper bound on E G ~ max ). However, this bound is quite weak if the player is combining many experts (i.e., if N is very large). We show below that the algorithm Exp3 from Section 4 can be modified yielding a regret term of the form p O( gk log N ). This bound is very reasonable when the number of actions is small, but the number of experts is quite large (even exponential). Our algorithm Exp4 is shown in Figure 4, and is only a slightly modified version of Exp3. (Exp4 stands for Exponential-weight algorithm for Exploration and Exploitation using Expert advice. ) As before, we use Hedge as a subroutine, but we now apply Hedge to a problem of dimension N rather than K. At trial t, we receive a probability vector q(t) from Hedge which represents a distribution over strategies. We compute the vector p(t) as a weighted average (with respect to q(t)) of the strategy vectors j (t). The vector ^p(t) is then computed as before using p(t), and an action i t is chosen randomly. We define the vector ^x(t) 2 [0; K=] K as before, and we finally feed the vector ^y(t) 2 [0; K=] N to Hedge where ^y j (t) = : j (t) ^x(t). Let us also define y(t) 2 [0; 1] N to be the vector with components corresponding to the gains of the experts: y j (t) = : j (t) x(t). The simplest possible expert is one which always assigns uniform weight to all actions so that i (t) = 1=K on each round t. We call this the uniform expert. To prove our results, we need to assume that the uniform expert is included in the family of experts. 4 Clearly, the uniform expert can always be added to any given family of experts at the very small expense of increasing N by one. 4 In fact, we can use a slightly weaker sufficient condition, namely, that the uniform P expert is included in the convex hull of N the family of experts, i.e., that there exists nonnegative numbers 1; : : : ; N with j j=1 = 1 such that, for all t and all i, P N j=1 jj i (t) = 1=K. ; 16

17 Algorithm Exp4 Parameters: Reals > 0 and 2 [0; 1] Initialization: Initialize Hedge (with K replaced by N) Repeat for t = 1; 2; : : : until game ends 1. Get the distribution q(t) 2 [0; 1] N from Hedge. 2. Get advice vectors j (t) 2 [0; 1] K, and let p(t) := P N j=1 q j (t) j (t). 3. Select action i t to be j with probability ^p j (t) = (1? )p j (t) + =K. 4. Receive reward x it (t) 2 [0; 1]. 5. Compute the simulated reward vector ^x(t) as ^x j (t) = 8 < : x it (t) if j = i t ^p it (t) 0 otherwise. 6. Feed the vector ^y(t) 2 [0; K=] N to Hedge where ^y j (t) : = j (t) ^x(t). Figure 4: Algorithm Exp4 for using expert advice in the partial information game. Theorem 8.1 For > 0 and 2 (0; 1], and for any family of experts which includes the uniform expert, the expected gain of algorithm Exp4 is at least E[G Exp4 ] E ~ G max? + K K= ()! E ~ G max? 1? ln N : Proof. We prove this theorem along the lines of the proof of Theorem 4.1. By Theorem 3.1, for all experts j = 1; : : : ; N, q(t) ^y(t) ^y j (t)? ln N? K=() NX q j (t)^y j (t) 2 : Now q(t) ^y(t) = NX j=1 by Equation (3). Also, similar to Equation (4), NX j=1 q j (t)^y j (t) 2 = NX j=1 j=1 q j (t) j (t) ^x(t) = p(t) ^x(t) x i t (t) 1? q j (t)( j i t (t)^x it (t)) 2 ^x it (t) 2 Therefore, using Equation (6), for all experts j, G Exp4 = x it (t) (1? ) NX j=1 ^y j (t)? 1? q j (t) j i t (t) = p it (t)^x it (t) 2 ^x i t (t) 1? : ln N? K= () ^x i (t): 17

18 As before, we take expectations of both sides of this inequality. Note that Further, " 1 X T K E E[^y j (t)] = E ^x i (t) " K X = E ^p i (t) j i (t)x i(t) ^p i (t) 1 K x i (t) h i = E j (t) x(t) = E[y j (t)]: max E j y j (t) = E ~ G max since we have assumed that the uniform expert is included in the family of experts. Combining these facts immediately implies the statement of the theorem. 2 Analogous versions of the other main results of this paper can be proved in which occurrences of ln K are replaced by ln N. For Corollary 4.2, this is immediate using Theorem 8.1, yielding a bound on regret of at most 2 p e? 1 p gk ln N. For the analog of Lemma 5.1, we need to prove a bound on the difference between P t y j(t) and P t ^y j(t) for each expert j which can be done exactly as before replacing =K with =N in the proof. The analogs of Theorems 5.2 and 6.1 can be proved as before where we again need to assume that the uniform expert is included in the family of experts. The analog of Corollary 5.3 is straightforward. 9 Nearly optimal play of an unknown repeated game The bandit problem considered up to this point is closely related to the problem of playing an unknown repeated game against an adversary of unbounded computational power. In this latter setting, a game is defined by an n m matrix M. On each trial t, the player (also called the row player) chooses a row i of the matrix. At the same time, the adversary (or column player) chooses a column j. The player then receives the payoff M ij. In repeated play, the player s goal is to maximize its expected total payoff over a sequence of plays. Suppose in some trial the player chooses its next move i randomly according to a probability distribution on rows represented by a (column) vector p 2 [0; 1] n, and the adversary similarly chooses according to a probability vector q 2 [0; 1] m. Then the expected payoff is p T Mq. Von Neumann s celebrated minimax theorem states that max p min q pt Mq = min max q p pt Mq ; where maximum and minimum are taken over the (compact) set of all distribution vectors p and q. The quantity v defined by the above equation is called the value of the game with matrix M. In words, this says that there exists a mixed (randomized) strategy p for the row player that guarantees expected payoff at least v, regardless of the column player s action. Moreover, this payoff is optimal in the sense that the column player can choose a mixed strategy whose expected payoff is at most v, regardless of the row player s action. Thus, if the player knows the matrix M, it can compute a strategy (for instance, using linear programming) that is certain to bring an expected optimal payoff not smaller than v on each trial. Suppose now that the game M is entirely unknown to the player. To be precise, assume the player knows only the number of rows of the matrix and a bound on the magnitude of the entries of M. The main result of this section is a proof based on the results in Section 4 showing that the player can play in such a manner that its payoff per trial will rapidly converge to the optimal maximin payoff v. This result holds even when the adversary knows the game M and also knows the (randomized) strategy being used by the player. The problem of learning to play a repeated game when the player gets to see the whole column of rewards associated with the choice of the adversary corresponds to our full-information game. This problem 18

19 was studied by Hannan [10], Blackwell [2] and more recently by Foster and Vohra [5], Fudenberg and Levin [8] and Freund and Schapire [7]. The problem of learning to play when the player gets to see only the single element of the matrix associated with his choice and the choice of the adversary corresponds to the partial information game which is our emphasis here. This problem was previously considered by Baños [1] and Megiddo [14]. However, these previously proposed strategies are extremely inefficient. Not only is our strategy simpler and much more efficient, but we also are able to prove much faster rates of convergence. In fact, the application of our earlier algorithms to this problem is entirely straightforward. The player s actions are now identified with the rows of the matrix and are chosen randomly on each trial according to algorithm Exp3, where we tune and as in Corollary 4.2 with g = T, where T is the total number of epochs of play. 5 The payoff vector x(t) is simply M jt, the j t -th column of M chosen by the adversary on trial t. Theorem 9.1 Let M be an unknown game matrix in [a; b] nm with value v. Suppose the player, knowing only a, b and n, uses the algorithm sketched above against any adversary for T trials. Then the player s expected payoff per trial is at least v? 2(b? a) s (e? 1)n ln n T Proof. We assume that [a; b] = [0; 1]; the extension to the general case is straightforward. By Corollary 4.2, we have " T " X T X E M itj t = E x it (t) max E i Let p be a maxmin strategy for the row player such that v = max p min q x i (t) pt Mq = min q p T Mq; and let q(t) be a distribution vector whose j t -th component is 1. Then max E i x i (t) nx p i E x i (t) since p T Mq v for all q. Thus, the player s expected payoff is at least vt? 2 q = E : q? 2 (e? 1)T n ln n: p x(t) (e? 1)T n ln n: = E p T Mq(t) vt Dividing by T to get the average per-trial payoff gives the result. 2 Note that the theorem is independent of the number of columns of M and, with appropriate assumptions, the theorem can be easily generalized to adversaries with an infinite number of strategies. If the matrix M is very large and all entries are small, then, even if M is known to the player, our algorithm may be an efficient alternative to linear programming. The generality of the theorem also allows us to handle games in which the outcome for given plays i and j is a random variable (rather than a constant M ij ). Finally, as pointed out by Megiddo [14], such a result is valid for non-cooperative, multi-person games; the average per-trial payoff of any player using this strategy will converge rapidly to the maximin payoff of the one-shot game. 5 If T is not known in advance, the methods developed in Section 6 can be applied. 19

Downloaded 11/11/14 to Redistribution subject to SIAM license or copyright; see

Downloaded 11/11/14 to Redistribution subject to SIAM license or copyright; see SIAM J. COMPUT. Vol. 32, No. 1, pp. 48 77 c 2002 Society for Industrial and Applied Mathematics THE NONSTOCHASTIC MULTIARMED BANDIT PROBLEM PETER AUER, NICOLÒ CESA-BIANCHI, YOAV FREUND, AND ROBERT E. SCHAPIRE

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

Gambling in a rigged casino: The adversarial multi-armed bandit problem

Gambling in a rigged casino: The adversarial multi-armed bandit problem Gambling in a rigged casino: The adversarial multi-armed bandit problem Peter Auer Nicolb Cesa-Bianchi Yoav Freund Robert E. Schapire Computer & Information Sciences University of California Santa Cruz,

More information

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo! Theory and Applications of A Repeated Game Playing Algorithm Rob Schapire Princeton University [currently visiting Yahoo! Research] Learning Is (Often) Just a Game some learning problems: learn from training

More information

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem Lecture 9: UCB Algorithm and Adversarial Bandit Problem EECS598: Prediction and Learning: It s Only a Game Fall 03 Lecture 9: UCB Algorithm and Adversarial Bandit Problem Prof. Jacob Abernethy Scribe:

More information

Lecture 3: Lower Bounds for Bandit Algorithms

Lecture 3: Lower Bounds for Bandit Algorithms CMSC 858G: Bandits, Experts and Games 09/19/16 Lecture 3: Lower Bounds for Bandit Algorithms Instructor: Alex Slivkins Scribed by: Soham De & Karthik A Sankararaman 1 Lower Bounds In this lecture (and

More information

Game Theory, On-line prediction and Boosting (Freund, Schapire)

Game Theory, On-line prediction and Boosting (Freund, Schapire) Game heory, On-line prediction and Boosting (Freund, Schapire) Idan Attias /4/208 INRODUCION he purpose of this paper is to bring out the close connection between game theory, on-line prediction and boosting,

More information

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12 Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12 Lecture 4: Multiarmed Bandit in the Adversarial Model Lecturer: Yishay Mansour Scribe: Shai Vardi 4.1 Lecture Overview

More information

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm CS61: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm Tim Roughgarden February 9, 016 1 Online Algorithms This lecture begins the third module of the

More information

Adaptive Game Playing Using Multiplicative Weights

Adaptive Game Playing Using Multiplicative Weights Games and Economic Behavior 29, 79 03 (999 Article ID game.999.0738, available online at http://www.idealibrary.com on Adaptive Game Playing Using Multiplicative Weights Yoav Freund and Robert E. Schapire

More information

The No-Regret Framework for Online Learning

The No-Regret Framework for Online Learning The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin,

More information

The non-stochastic multi-armed bandit problem

The non-stochastic multi-armed bandit problem Submitted for journal ublication. The non-stochastic multi-armed bandit roblem Peter Auer Institute for Theoretical Comuter Science Graz University of Technology A-8010 Graz (Austria) auer@igi.tu-graz.ac.at

More information

From Bandits to Experts: A Tale of Domination and Independence

From Bandits to Experts: A Tale of Domination and Independence From Bandits to Experts: A Tale of Domination and Independence Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Domination and Independence 1 / 1 From Bandits to Experts: A

More information

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 22 Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 How to balance exploration and exploitation in reinforcement

More information

Learning, Games, and Networks

Learning, Games, and Networks Learning, Games, and Networks Abhishek Sinha Laboratory for Information and Decision Systems MIT ML Talk Series @CNRG December 12, 2016 1 / 44 Outline 1 Prediction With Experts Advice 2 Application to

More information

Hybrid Machine Learning Algorithms

Hybrid Machine Learning Algorithms Hybrid Machine Learning Algorithms Umar Syed Princeton University Includes joint work with: Rob Schapire (Princeton) Nina Mishra, Alex Slivkins (Microsoft) Common Approaches to Machine Learning!! Supervised

More information

Notes from Week 8: Multi-Armed Bandit Problems

Notes from Week 8: Multi-Armed Bandit Problems CS 683 Learning, Games, and Electronic Markets Spring 2007 Notes from Week 8: Multi-Armed Bandit Problems Instructor: Robert Kleinberg 2-6 Mar 2007 The multi-armed bandit problem The multi-armed bandit

More information

Lecture 4: Lower Bounds (ending); Thompson Sampling

Lecture 4: Lower Bounds (ending); Thompson Sampling CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from

More information

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Adversarial bandits Definition: sequential game. Lower bounds on regret from the stochastic case. Exp3: exponential weights

More information

Algorithms, Games, and Networks January 17, Lecture 2

Algorithms, Games, and Networks January 17, Lecture 2 Algorithms, Games, and Networks January 17, 2013 Lecturer: Avrim Blum Lecture 2 Scribe: Aleksandr Kazachkov 1 Readings for today s lecture Today s topic is online learning, regret minimization, and minimax

More information

New Algorithms for Contextual Bandits

New Algorithms for Contextual Bandits New Algorithms for Contextual Bandits Lev Reyzin Georgia Institute of Technology Work done at Yahoo! 1 S A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R.E. Schapire Contextual Bandit Algorithms with Supervised

More information

Evaluation of multi armed bandit algorithms and empirical algorithm

Evaluation of multi armed bandit algorithms and empirical algorithm Acta Technica 62, No. 2B/2017, 639 656 c 2017 Institute of Thermomechanics CAS, v.v.i. Evaluation of multi armed bandit algorithms and empirical algorithm Zhang Hong 2,3, Cao Xiushan 1, Pu Qiumei 1,4 Abstract.

More information

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016 AM 1: Advanced Optimization Spring 016 Prof. Yaron Singer Lecture 11 March 3rd 1 Overview In this lecture we will introduce the notion of online convex optimization. This is an extremely useful framework

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Learning and Games MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Normal form games Nash equilibrium von Neumann s minimax theorem Correlated equilibrium Internal

More information

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013 COS 5: heoretical Machine Learning Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 203 Review of Zero-Sum Games At the end of last lecture, we discussed a model for two player games (call

More information

Agnostic Online learnability

Agnostic Online learnability Technical Report TTIC-TR-2008-2 October 2008 Agnostic Online learnability Shai Shalev-Shwartz Toyota Technological Institute Chicago shai@tti-c.org ABSTRACT We study a fundamental question. What classes

More information

Applications of on-line prediction. in telecommunication problems

Applications of on-line prediction. in telecommunication problems Applications of on-line prediction in telecommunication problems Gábor Lugosi Pompeu Fabra University, Barcelona based on joint work with András György and Tamás Linder 1 Outline On-line prediction; Some

More information

The Multi-Armed Bandit Problem

The Multi-Armed Bandit Problem Università degli Studi di Milano The bandit problem [Robbins, 1952]... K slot machines Rewards X i,1, X i,2,... of machine i are i.i.d. [0, 1]-valued random variables An allocation policy prescribes which

More information

Computational Game Theory Spring Semester, 2005/6. Lecturer: Yishay Mansour Scribe: Ilan Cohen, Natan Rubin, Ophir Bleiberg*

Computational Game Theory Spring Semester, 2005/6. Lecturer: Yishay Mansour Scribe: Ilan Cohen, Natan Rubin, Ophir Bleiberg* Computational Game Theory Spring Semester, 2005/6 Lecture 5: 2-Player Zero Sum Games Lecturer: Yishay Mansour Scribe: Ilan Cohen, Natan Rubin, Ophir Bleiberg* 1 5.1 2-Player Zero Sum Games In this lecture

More information

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University EASINESS IN BANDITS Gergely Neu Pompeu Fabra University EASINESS IN BANDITS Gergely Neu Pompeu Fabra University THE BANDIT PROBLEM Play for T rounds attempting to maximize rewards THE BANDIT PROBLEM Play

More information

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)

More information

An Online Algorithm for Maximizing Submodular Functions

An Online Algorithm for Maximizing Submodular Functions An Online Algorithm for Maximizing Submodular Functions Matthew Streeter December 20, 2007 CMU-CS-07-171 Daniel Golovin School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213 This research

More information

The Multi-Arm Bandit Framework

The Multi-Arm Bandit Framework The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94

More information

Convergence and No-Regret in Multiagent Learning

Convergence and No-Regret in Multiagent Learning Convergence and No-Regret in Multiagent Learning Michael Bowling Department of Computing Science University of Alberta Edmonton, Alberta Canada T6G 2E8 bowling@cs.ualberta.ca Abstract Learning in a multiagent

More information

Bandit models: a tutorial

Bandit models: a tutorial Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses

More information

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

Experts in a Markov Decision Process

Experts in a Markov Decision Process University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 2004 Experts in a Markov Decision Process Eyal Even-Dar Sham Kakade University of Pennsylvania Yishay Mansour Follow

More information

Blackwell s Approachability Theorem: A Generalization in a Special Case. Amy Greenwald, Amir Jafari and Casey Marks

Blackwell s Approachability Theorem: A Generalization in a Special Case. Amy Greenwald, Amir Jafari and Casey Marks Blackwell s Approachability Theorem: A Generalization in a Special Case Amy Greenwald, Amir Jafari and Casey Marks Department of Computer Science Brown University Providence, Rhode Island 02912 CS-06-01

More information

Exponential Weights on the Hypercube in Polynomial Time

Exponential Weights on the Hypercube in Polynomial Time European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts

More information

Online Learning with Randomized Feedback Graphs for Optimal PUE Attacks in Cognitive Radio Networks

Online Learning with Randomized Feedback Graphs for Optimal PUE Attacks in Cognitive Radio Networks 1 Online Learning with Randomized Feedback Graphs for Optimal PUE Attacks in Cognitive Radio Networks Monireh Dabaghchian, Amir Alipour-Fanid, Kai Zeng, Qingsi Wang, Peter Auer arxiv:1709.10128v3 [cs.ni]

More information

Learning Algorithms for Minimizing Queue Length Regret

Learning Algorithms for Minimizing Queue Length Regret Learning Algorithms for Minimizing Queue Length Regret Thomas Stahlbuhk Massachusetts Institute of Technology Cambridge, MA Brooke Shrader MIT Lincoln Laboratory Lexington, MA Eytan Modiano Massachusetts

More information

Lecture 14: Approachability and regret minimization Ramesh Johari May 23, 2007

Lecture 14: Approachability and regret minimization Ramesh Johari May 23, 2007 MS&E 336 Lecture 4: Approachability and regret minimization Ramesh Johari May 23, 2007 In this lecture we use Blackwell s approachability theorem to formulate both external and internal regret minimizing

More information

0.1 Motivating example: weighted majority algorithm

0.1 Motivating example: weighted majority algorithm princeton univ. F 16 cos 521: Advanced Algorithm Design Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm Lecturer: Sanjeev Arora Scribe: Sanjeev Arora (Today s notes

More information

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Explore no more: Improved high-probability regret bounds for non-stochastic bandits Explore no more: Improved high-probability regret bounds for non-stochastic bandits Gergely Neu SequeL team INRIA Lille Nord Europe gergely.neu@gmail.com Abstract This work addresses the problem of regret

More information

Game Theory, On-line Prediction and Boosting

Game Theory, On-line Prediction and Boosting roceedings of the Ninth Annual Conference on Computational Learning heory, 996. Game heory, On-line rediction and Boosting Yoav Freund Robert E. Schapire A& Laboratories 600 Mountain Avenue Murray Hill,

More information

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University Bandit Algorithms Zhifeng Wang Department of Statistics Florida State University Outline Multi-Armed Bandits (MAB) Exploration-First Epsilon-Greedy Softmax UCB Thompson Sampling Adversarial Bandits Exp3

More information

Minimax strategy for prediction with expert advice under stochastic assumptions

Minimax strategy for prediction with expert advice under stochastic assumptions Minimax strategy for prediction ith expert advice under stochastic assumptions Wojciech Kotłosi Poznań University of Technology, Poland otlosi@cs.put.poznan.pl Abstract We consider the setting of prediction

More information

No-Regret Algorithms for Unconstrained Online Convex Optimization

No-Regret Algorithms for Unconstrained Online Convex Optimization No-Regret Algorithms for Unconstrained Online Convex Optimization Matthew Streeter Duolingo, Inc. Pittsburgh, PA 153 matt@duolingo.com H. Brendan McMahan Google, Inc. Seattle, WA 98103 mcmahan@google.com

More information

Online Bounds for Bayesian Algorithms

Online Bounds for Bayesian Algorithms Online Bounds for Bayesian Algorithms Sham M. Kakade Computer and Information Science Department University of Pennsylvania Andrew Y. Ng Computer Science Department Stanford University Abstract We present

More information

On the Worst-case Analysis of Temporal-difference Learning Algorithms

On the Worst-case Analysis of Temporal-difference Learning Algorithms Machine Learning, 22(1/2/3:95-121, 1996. On the Worst-case Analysis of Temporal-difference Learning Algorithms schapire@research.att.com ATT Bell Laboratories, 600 Mountain Avenue, Room 2A-424, Murray

More information

CS261: A Second Course in Algorithms Lecture #12: Applications of Multiplicative Weights to Games and Linear Programs

CS261: A Second Course in Algorithms Lecture #12: Applications of Multiplicative Weights to Games and Linear Programs CS26: A Second Course in Algorithms Lecture #2: Applications of Multiplicative Weights to Games and Linear Programs Tim Roughgarden February, 206 Extensions of the Multiplicative Weights Guarantee Last

More information

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Journal of Machine Learning Research 1 8, 2017 Algorithmic Learning Theory 2017 New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Philip M. Long Google, 1600 Amphitheatre

More information

Lecture notes for Analysis of Algorithms : Markov decision processes

Lecture notes for Analysis of Algorithms : Markov decision processes Lecture notes for Analysis of Algorithms : Markov decision processes Lecturer: Thomas Dueholm Hansen June 6, 013 Abstract We give an introduction to infinite-horizon Markov decision processes (MDPs) with

More information

Selecting Efficient Correlated Equilibria Through Distributed Learning. Jason R. Marden

Selecting Efficient Correlated Equilibria Through Distributed Learning. Jason R. Marden 1 Selecting Efficient Correlated Equilibria Through Distributed Learning Jason R. Marden Abstract A learning rule is completely uncoupled if each player s behavior is conditioned only on his own realized

More information

Online Learning with Feedback Graphs

Online Learning with Feedback Graphs Online Learning with Feedback Graphs Claudio Gentile INRIA and Google NY clagentile@gmailcom NYC March 6th, 2018 1 Content of this lecture Regret analysis of sequential prediction problems lying between

More information

Online prediction with expert advise

Online prediction with expert advise Online prediction with expert advise Jyrki Kivinen Australian National University http://axiom.anu.edu.au/~kivinen Contents 1. Online prediction: introductory example, basic setting 2. Classification with

More information

Finite-time Analysis of the Multiarmed Bandit Problem*

Finite-time Analysis of the Multiarmed Bandit Problem* Machine Learning, 47, 35 56, 00 c 00 Kluwer Academic Publishers. Manufactured in The Netherlands. Finite-time Analysis of the Multiarmed Bandit Problem* PETER AUER University of Technology Graz, A-8010

More information

CS 6820 Fall 2014 Lectures, October 3-20, 2014

CS 6820 Fall 2014 Lectures, October 3-20, 2014 Analysis of Algorithms Linear Programming Notes CS 6820 Fall 2014 Lectures, October 3-20, 2014 1 Linear programming The linear programming (LP) problem is the following optimization problem. We are given

More information

CS261: Problem Set #3

CS261: Problem Set #3 CS261: Problem Set #3 Due by 11:59 PM on Tuesday, February 23, 2016 Instructions: (1) Form a group of 1-3 students. You should turn in only one write-up for your entire group. (2) Submission instructions:

More information

Online Learning with Feedback Graphs

Online Learning with Feedback Graphs Online Learning with Feedback Graphs Nicolò Cesa-Bianchi Università degli Studi di Milano Joint work with: Noga Alon (Tel-Aviv University) Ofer Dekel (Microsoft Research) Tomer Koren (Technion and Microsoft

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm. Lecturer: Sanjeev Arora

Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm. Lecturer: Sanjeev Arora princeton univ. F 13 cos 521: Advanced Algorithm Design Lecture 8: Decision-making under total uncertainty: the multiplicative weight algorithm Lecturer: Sanjeev Arora Scribe: (Today s notes below are

More information

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret

More information

Chapter III. Stability of Linear Systems

Chapter III. Stability of Linear Systems 1 Chapter III Stability of Linear Systems 1. Stability and state transition matrix 2. Time-varying (non-autonomous) systems 3. Time-invariant systems 1 STABILITY AND STATE TRANSITION MATRIX 2 In this chapter,

More information

Deterministic Calibration and Nash Equilibrium

Deterministic Calibration and Nash Equilibrium University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 2-2008 Deterministic Calibration and Nash Equilibrium Sham M. Kakade Dean P. Foster University of Pennsylvania Follow

More information

Mistake Bounds on Noise-Free Multi-Armed Bandit Game

Mistake Bounds on Noise-Free Multi-Armed Bandit Game Mistake Bounds on Noise-Free Multi-Armed Bandit Game Atsuyoshi Nakamura a,, David P. Helmbold b, Manfred K. Warmuth b a Hokkaido University, Kita 14, Nishi 9, Kita-ku, Sapporo 060-0814, Japan b University

More information

Online Learning, Mistake Bounds, Perceptron Algorithm

Online Learning, Mistake Bounds, Perceptron Algorithm Online Learning, Mistake Bounds, Perceptron Algorithm 1 Online Learning So far the focus of the course has been on batch learning, where algorithms are presented with a sample of training data, from which

More information

Homework 4 Solutions

Homework 4 Solutions CS 174: Combinatorics and Discrete Probability Fall 01 Homework 4 Solutions Problem 1. (Exercise 3.4 from MU 5 points) Recall the randomized algorithm discussed in class for finding the median of a set

More information

On Minimaxity of Follow the Leader Strategy in the Stochastic Setting

On Minimaxity of Follow the Leader Strategy in the Stochastic Setting On Minimaxity of Follow the Leader Strategy in the Stochastic Setting Wojciech Kot lowsi Poznań University of Technology, Poland wotlowsi@cs.put.poznan.pl Abstract. We consider the setting of prediction

More information

Lectures 6, 7 and part of 8

Lectures 6, 7 and part of 8 Lectures 6, 7 and part of 8 Uriel Feige April 26, May 3, May 10, 2015 1 Linear programming duality 1.1 The diet problem revisited Recall the diet problem from Lecture 1. There are n foods, m nutrients,

More information

NOWADAYS the demand to wireless bandwidth is growing

NOWADAYS the demand to wireless bandwidth is growing 1 Online Learning with Randomized Feedback Graphs for Optimal PUE Attacks in Cognitive Radio Networks Monireh Dabaghchian, Student Member, IEEE, Amir Alipour-Fanid, Student Member, IEEE, Kai Zeng, Member,

More information

Prediction and Playing Games

Prediction and Playing Games Prediction and Playing Games Vineel Pratap vineel@eng.ucsd.edu February 20, 204 Chapter 7 : Prediction, Learning and Games - Cesa Binachi & Lugosi K-Person Normal Form Games Each player k (k =,..., K)

More information

Multi-armed Bandits in the Presence of Side Observations in Social Networks

Multi-armed Bandits in the Presence of Side Observations in Social Networks 52nd IEEE Conference on Decision and Control December 0-3, 203. Florence, Italy Multi-armed Bandits in the Presence of Side Observations in Social Networks Swapna Buccapatnam, Atilla Eryilmaz, and Ness

More information

Regret in the On-Line Decision Problem

Regret in the On-Line Decision Problem Games and Economic Behavior 29, 7 35 (1999) Article ID game.1999.0740, available online at http://www.idealibrary.com on Regret in the On-Line Decision Problem Dean P. Foster Department of Statistics,

More information

Testing Problems with Sub-Learning Sample Complexity

Testing Problems with Sub-Learning Sample Complexity Testing Problems with Sub-Learning Sample Complexity Michael Kearns AT&T Labs Research 180 Park Avenue Florham Park, NJ, 07932 mkearns@researchattcom Dana Ron Laboratory for Computer Science, MIT 545 Technology

More information

A Drifting-Games Analysis for Online Learning and Applications to Boosting

A Drifting-Games Analysis for Online Learning and Applications to Boosting A Drifting-Games Analysis for Online Learning and Applications to Boosting Haipeng Luo Department of Computer Science Princeton University Princeton, NJ 08540 haipengl@cs.princeton.edu Robert E. Schapire

More information

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16

Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16 600.463 Introduction to Algorithms / Algorithms I Lecturer: Michael Dinitz Topic: Intro to Learning Theory Date: 12/8/16 25.1 Introduction Today we re going to talk about machine learning, but from an

More information

THE first formalization of the multi-armed bandit problem

THE first formalization of the multi-armed bandit problem EDIC RESEARCH PROPOSAL 1 Multi-armed Bandits in a Network Farnood Salehi I&C, EPFL Abstract The multi-armed bandit problem is a sequential decision problem in which we have several options (arms). We can

More information

Adaptive Sampling Under Low Noise Conditions 1

Adaptive Sampling Under Low Noise Conditions 1 Manuscrit auteur, publié dans "41èmes Journées de Statistique, SFdS, Bordeaux (2009)" Adaptive Sampling Under Low Noise Conditions 1 Nicolò Cesa-Bianchi Dipartimento di Scienze dell Informazione Università

More information

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 2018 COS 5: heoretical Machine Learning Lecturer: Rob Schapire Lecture #24 Scribe: Jad Bechara May 2, 208 Review of Game heory he games we will discuss are two-player games that can be modeled by a game matrix

More information

Efficient Algorithms for Universal Portfolios

Efficient Algorithms for Universal Portfolios Efficient Algorithms for Universal Portfolios Adam Kalai CMU Department of Computer Science akalai@cs.cmu.edu Santosh Vempala y MIT Department of Mathematics and Laboratory for Computer Science vempala@math.mit.edu

More information

The Distribution of Optimal Strategies in Symmetric Zero-sum Games

The Distribution of Optimal Strategies in Symmetric Zero-sum Games The Distribution of Optimal Strategies in Symmetric Zero-sum Games Florian Brandl Technische Universität München brandlfl@in.tum.de Given a skew-symmetric matrix, the corresponding two-player symmetric

More information

Internal Regret in On-line Portfolio Selection

Internal Regret in On-line Portfolio Selection Internal Regret in On-line Portfolio Selection Gilles Stoltz (gilles.stoltz@ens.fr) Département de Mathématiques et Applications, Ecole Normale Supérieure, 75005 Paris, France Gábor Lugosi (lugosi@upf.es)

More information

Lecture 6: Non-stochastic best arm identification

Lecture 6: Non-stochastic best arm identification CSE599i: Online and Adaptive Machine Learning Winter 08 Lecturer: Kevin Jamieson Lecture 6: Non-stochastic best arm identification Scribes: Anran Wang, eibin Li, rian Chan, Shiqing Yu, Zhijin Zhou Disclaimer:

More information

The best expert versus the smartest algorithm

The best expert versus the smartest algorithm Theoretical Computer Science 34 004 361 380 www.elsevier.com/locate/tcs The best expert versus the smartest algorithm Peter Chen a, Guoli Ding b; a Department of Computer Science, Louisiana State University,

More information

Appendix B for The Evolution of Strategic Sophistication (Intended for Online Publication)

Appendix B for The Evolution of Strategic Sophistication (Intended for Online Publication) Appendix B for The Evolution of Strategic Sophistication (Intended for Online Publication) Nikolaus Robalino and Arthur Robson Appendix B: Proof of Theorem 2 This appendix contains the proof of Theorem

More information

Stochastic Contextual Bandits with Known. Reward Functions

Stochastic Contextual Bandits with Known. Reward Functions Stochastic Contextual Bandits with nown 1 Reward Functions Pranav Sakulkar and Bhaskar rishnamachari Ming Hsieh Department of Electrical Engineering Viterbi School of Engineering University of Southern

More information

1 Cryptographic hash functions

1 Cryptographic hash functions CSCI 5440: Cryptography Lecture 6 The Chinese University of Hong Kong 24 October 2012 1 Cryptographic hash functions Last time we saw a construction of message authentication codes (MACs) for fixed-length

More information

The Multi-Armed Bandit Problem

The Multi-Armed Bandit Problem The Multi-Armed Bandit Problem Electrical and Computer Engineering December 7, 2013 Outline 1 2 Mathematical 3 Algorithm Upper Confidence Bound Algorithm A/B Testing Exploration vs. Exploitation Scientist

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. Converting online to batch. Online convex optimization.

More information

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Explore no more: Improved high-probability regret bounds for non-stochastic bandits Explore no more: Improved high-probability regret bounds for non-stochastic bandits Gergely Neu SequeL team INRIA Lille Nord Europe gergely.neu@gmail.com Abstract This work addresses the problem of regret

More information

CS264: Beyond Worst-Case Analysis Lecture #11: LP Decoding

CS264: Beyond Worst-Case Analysis Lecture #11: LP Decoding CS264: Beyond Worst-Case Analysis Lecture #11: LP Decoding Tim Roughgarden October 29, 2014 1 Preamble This lecture covers our final subtopic within the exact and approximate recovery part of the course.

More information

Learning to Coordinate Efficiently: A Model-based Approach

Learning to Coordinate Efficiently: A Model-based Approach Journal of Artificial Intelligence Research 19 (2003) 11-23 Submitted 10/02; published 7/03 Learning to Coordinate Efficiently: A Model-based Approach Ronen I. Brafman Computer Science Department Ben-Gurion

More information

An Algorithms-based Intro to Machine Learning

An Algorithms-based Intro to Machine Learning CMU 15451 lecture 12/08/11 An Algorithmsbased Intro to Machine Learning Plan for today Machine Learning intro: models and basic issues An interesting algorithm for combining expert advice Avrim Blum [Based

More information

1 Cryptographic hash functions

1 Cryptographic hash functions CSCI 5440: Cryptography Lecture 6 The Chinese University of Hong Kong 23 February 2011 1 Cryptographic hash functions Last time we saw a construction of message authentication codes (MACs) for fixed-length

More information

Efficient learning by implicit exploration in bandit problems with side observations

Efficient learning by implicit exploration in bandit problems with side observations Efficient learning by implicit exploration in bandit problems with side observations omáš Kocák Gergely Neu Michal Valko Rémi Munos SequeL team, INRIA Lille Nord Europe, France {tomas.kocak,gergely.neu,michal.valko,remi.munos}@inria.fr

More information

CS364A: Algorithmic Game Theory Lecture #13: Potential Games; A Hierarchy of Equilibria

CS364A: Algorithmic Game Theory Lecture #13: Potential Games; A Hierarchy of Equilibria CS364A: Algorithmic Game Theory Lecture #13: Potential Games; A Hierarchy of Equilibria Tim Roughgarden November 4, 2013 Last lecture we proved that every pure Nash equilibrium of an atomic selfish routing

More information

Bandits for Online Optimization

Bandits for Online Optimization Bandits for Online Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Bandits for Online Optimization 1 / 16 The multiarmed bandit problem... K slot machines Each

More information