Online learning with feedback graphs and switching costs

Size: px
Start display at page:

Download "Online learning with feedback graphs and switching costs"

Transcription

1 Online learning with feedback graphs and switching costs A Proof of Theorem Proof. Without loss of generality let the independent sequence set I(G :T ) formed of actions (or arms ) from to. Given the sequence of feedback graphs G :T let T i be the number of times the action i I(G :T ) is selected by the player in T rounds. Let T be the total number of times the actions are selected from the set K\I(G :T ). Let E i denote expectation conditioned on X i and P i denote the probability conditioned on X i. Additionally we define P 0 as the probability conditioned on event ɛ 0. Therefore under P 0 all the actions in the independent sequence set i.e. i I(G :T ) incur an expected regret of /2 whereas the expected regret of actions i K\I(G :T ) is /2 + ɛ 2. Let E 0 be the corresponding conditional expectation. For all i K and t T l t (i) and l c t(i) denote the unclipped and clipped loss of the action i respectively. Assuming the unclipped losses are observed by the player then F is the sigma field generated by the unclipped losses and S t (i t ) is the set of actions whose losses are observed at time t following the selection of i t according to the feedback graph G t. The observed sequence of unclipped losses will be referred as l o :T. Additionally F is the sigma field generated by the clipped losses for all t T l t(i) where i S t (i t ) and the observed sequence of clipped losses will be referred as l o :T. By definition F F. Let i... i T be the sequence of actions selected by a player over the time horizon T. Then the regret R c of the player corresponding to clipped loses is R c l c t(i t ) + c M s min l c t(i) () where M s is the number of switches in the action selection sequence i... i T and c is the cost of each switch in action. Now we define the regret R which corresponds to the unclipped loss function in Algorithm as following R l t (i t ) + c M s min l t (i). (2) Using (Dekel et al. 204 Lemma 4) we have ( P For all t T 2 + W t 6 5 ) (3) Thus for all T > max{ 6} we have ɛ ɛ 2 < /6. If B {For all t T : /2 + W t /6 5/6} occurs and ɛ ɛ 2 < /6 then for all i K l c t(i) l t (i) which implies R c R (see () and (2)). Now if the event B does not occur then the losses at any time t satisfy l t (i) l c t(i) (ɛ + ɛ 2 ). Therefore we have c M s R c R c M s + (ɛ + ɛ 2 )T. Now for T > max{ 6} we have ER ER c ( P(B))ER R c B does not occur (ɛ + ɛ 2 )T 6. (4)

2 Thus (4) lower bounds the actual regret R c in terms of regret R. Now we derive the lower bound on regret R corresponding to the unclipped loses. Using the definition of R we have ER max E T l t (i t ) E i i E i i i i l t (i) M s l t (i t ) min j I(G :T )\{i} β(g:t ) E i j 2 T j + 2 T j + E i ɛ 2 T + ɛ (T T i ) β(g (b) :T ) ɛ (T ) E i Ti M s i l t (i) M s ( ) ( ) ( ) 2 ɛ T i ɛ 2 T 2 ɛ T M s ( ) ( ) 2 + ɛ 2 T ɛ T i 2 ɛ T M s M s where follows from j T j + T T and (b) follows from ɛ 2 T 0. Now we upper bound the E i Ti in (5) to obtain the lower bound on the expected regret ER. Since the player is deterministic the event {i t i} is F measurable. Therefore we have P i (i t i) P 0 (i t i) d F T V (P 0 P i ) d F T V (P 0 P i ) where d F T V (P 0 P i ) sup A F P 0 (A) P i (A) is the total variational distance between the two probability measures and follows from F F. Summing the above equation over t T and i I(G :T ) yields i ( Ei T i E 0 T i ) T i d F T V (P 0 P i ). Rearranging the above equation and using i E 0 T i E 0 i T i T we get i E i T i T Combining the above equation with (5) we get ER ɛ T ɛ T i i d F T V (P 0 P i ) + T. d F T V (P 0 P i ) ɛ T 2 ɛ T d F T V (P 0 P i ) M s i ɛ T M s where uses the fact that >. Next we upper bound the second term in the right hand side of (6). Using Pinsker s inequality we have d F T V (P 0 P i ) 2 D KL(P 0 (l o :T ) P i(l o :T )) (7) 2 (5) (6)

3 where l o :T are the losses observed by the player over the time horizon T. Using the chain rule of relative entropy to decompose D KL (P 0 (l o :T ) P 0(l o :T )) we get D KL (P 0 (l o :T ) P i (l o :T )) D KL (P 0 (l o t l o :t ) P i (l o t l o :t )) (8) D KL (P 0 (l o t l o ρ (t) ) P i(l o t l o ρ (t) )) where ρ (t) is the set of time instances 0 k t encountered when operation ρ(.) in Algorithm is applied recursively to t. Now we deal with each term D KL (P 0 (l o t l o ρ (t) ) P i(l o t l o ρ (t))) in the summation individually. For i I(G :T ) we separate this computation into four cases: i t is such that loss of action i is observed at both time instances t and ρ(t) i.e. i S t (i t ) and i S t (i ρ(t) ); i t is such that loss of action i is observed at time instance t but not at time instance ρ(t) i.e. i S t (i t ) and i / S t (i ρ(t) ); i t is such that loss of action i is not observed at time instance t but is observed at time instance ρ(t) i.e. i / S t (i t ) and i S t (i ρ(t) ); i t is such that loss of action i is not observed at both time instances t and ρ(t) i.e. i / S t (i t ) and i / S t (i ρ(t) ). Note that at a single time instance the loss of only one single action can be observed from I(G :T ) arms. Case : Since the loss of action i is observed from the independent sequence set I(G :T ) at both the time instances the loss distribution for the action i is l o t (i) l o ρ (t) N (l ρ(t)(i) σ 2 ) for both P 0 and P i. For all j K\I(G :T ) the loss distribution is l o t (j) l o ρ (t) N (l ρ(t)(i) + ɛ + ɛ 2 σ 2 ) under both P 0 and P i. Case 2: Since the loss of action i is observed from the independent sequence set I(G :T ) at time instance t but not at ρ(t) therefore there exists an action k I(G :T )\{i} from the independent sequence set which was observed at time instance ρ(t). Then the loss distribution for the action i is l o t (i) l o ρ (t) N (l ρ(t)(k ) σ 2 ) under P 0 and l o t (i) l o ρ (t) N (l ρ(t)(k ) ɛ σ 2 ) under P i. For all j K\I(G :T ) the loss distribution is l o t (j) l o ρ (t) N (l ρ(t)(k ) + ɛ 2 σ 2 ) under both P 0 and P i. Case 3:Since the action i is observed from the independent sequence set I(G :T ) at time instance ρ(t) but not at t therefore there exists an action k I(G :T )\{i} from the independent sequence set which was observed at time instance t. Then the loss distribution for the arm k is l o t (k ) l o ρ (t) N (l ρ(t)(i) σ 2 ) under P 0 and l o t (k ) l o ρ (t) N (l ρ(t)(i) + ɛ σ 2 ) under P i. For all j K\I(G :T ) the loss distribution is l o t (j) l o ρ (t) N (l ρ(t)(i) + ɛ + ɛ 2 σ 2 ) under both P 0 and P i. Case 4: Let k be the arm from the independent sequence set observed at time instance ρ(t). Since the arm i is not observed from the independent sequence set I(G :T ) at the time instances t and ρ(t) therefore the loss distribution for all arms k I(G :T )\{i} is l o t (k ) l o ρ (t) N (l ρ(t)(k ) σ 2 ) for both P 0 and P i. For all j K\I(G :T ) the loss distribution is l o t (j) l o ρ (t) N (l ρ(t)(k )+ɛ 2 σ 2 ) under both P 0 and P i. Therefore we have D KL (P 0 (l o t l o ρ (t) ) P i(l o t l o ρ (t) )) P 0(i S t (i t ) i / S ρ(t) (i ρ(t) )) D KL (N (0 σ 2 ) N ( ɛ σ 2 )) + P 0 (i / S t (i t ) i S ρ(t) (i ρ(t) )) D KL (N (0 σ 2 ) N (ɛ σ 2 )) ɛ2 2σ 2 P 0(B t ) where B t {i S t (i t ) i / S ρ(t) (i ρ(t) ) i / S t (i t ) i S ρ(t) (i ρ(t) )}. The event B t implies that the player has switched at least once between the feedback systems S t (k ) and S ρ(t) (k 2 ) such that i S t (k ) but i / S ρ(t) (k 2 ) or vice-versa. Let N i be the number of times a player switches from the feedback system which includes i to the feedback system which does not include i and vice-versa. Then using (8) and (9) we have D KL (P 0 (l o :T ) P i (l o :T )) ɛ2 ω(ρ) 2σ 2 E 0N i (0) 3 (9)

4 where ω(ρ) is the width of process ρ(.) (see Definition 2 in Dekel et al. (204)) and is bounded above by 2 log 2 (T ). Combining (7) and (0) we have (P 0 (A) P i (A)) ɛ log2 (T )E 0 N i. () σ sup A F If M s ɛ T then ER > ɛ T. Thus the claimed lower bound follows. Now let us assume M s ɛ T. For all i I(G :T ) we have E 0 M s E i M s Using the above equation we have E 0 M s EM s ɛ T m Now combining (4) (6) ()and (3) we obtain P 0 (M s m) P i (M s m)) ɛ T d F T V (P 0 P i ). (E 0 M s E i M s ) i ɛ T d F T V (P 0 P i ). ER ɛ T 6 ɛ T ɛ log2 (T )E 0 N i + c E 0 M s σ i ɛ T 6 ɛ 2 T σ 2 log2 (T )E 0 M s + c E 0 M s i (b) c/3 /3 T 2/3 c/3 /3 T 2/3 54 log 2 (T ) 62 log 2 (T ) c/3 /3 T 2/3 8 log 2 (T ) where follows from the concavity of x and i N i 2M s (b) follows from the fact that the right hand side is minimized for E 0 M s ɛ 2 T log 2 (T )/2cσ. The claim of the theorem now follows. B Proof of Lemma 2 is the cardinality of I(G :T ). Let 2... actions belong to the set I(G :T ). Then the adversary selects an action uniformly at random from the set I(G :T ) say j and assigns the loss sequence to action j using independent Bernoulli random variable with parameter 0.5 ɛ where ɛ /T ). For all i I(G :T )/{j} losses are assigned using independent Bernoulli random variable with parameter 0.5. For all i / I(G :T ) the losses are assigned using independent Bernoulli random variable with parameter. The proof of the lemma follows along the same lines as Theorem 5 in (Alon et al. (207)). C Proof of Theorem 3 Proof of this theorem uses the result from Theorem. Since the loss sequence is assigned independently to each sub-sequence U m where m M. Using Theorem there exists a constant b m such that E T (l t (i t )(G t U m ) + cw m min i U m b m c /3 β(u m ) /3 N(U m ) 2/3 / log(t ) 4 (l t (i)(g t U m ) (2) (3) (4) (5)

5 where W m is number of switches performed within the sequence U m. Since m M W m (i t i t ) there exist a constant b such that the expected regret of any algorithm A is at least b c /3 β(u m ) /3 N(U m ) 2/3 / log T. D Proof of Lemma 4 m M Proof. The proof follows from contradiction and is along the same lines as the proof of Theorem 4 in Dekel et al. (204). Let A performs at most Õ(( /2 T ) α ) switches for any sequence of loss function over T rounds with β + α/2 <. Then there exists a real number γ such that β < γ < α/2. Then assign c ( /2 T ) 3γ 2. Thus the expected regret including the switching cost of the algorithm is Õ(( /2 T ) β + ( /2 T ) 3γ 2 (T ) α ) õ( /2 T ) γ over a sequence of losses assigned by the adversary because β < γ and α < 2 2γ. However according to Theorem the expected regret is at least Ω( /3 ( /2 T ) (3γ 2)/3 T 2/3 ) Ω((T ) γ ). Hence by contradiction the proof of the lemma follows. E Proof of Theorem 5 Proof. Let t t 2... t be the sequence of time instances at which the event E t occurs during the duration T of the game. We define {r j t j+ t j } j T as the sequence of inter-event times between the events E t. Let mas(g () )... mas(g (T ) ) denote the sequence in the decreasing order of size of maximal acyclic graphs i.e. mas(g () ) (or mas(g (T ) )) is the maximum (or minimum) size of maximal acyclic graph observed in sequence G :T {G... G T }. Using the definition of E t note that r j is a random variable bounded by T /3 c 2/3 /mas(g (T ) ) /3. For all j the ratio of total weights of actions at round t j and t j+ is W tj+ W tj w itj+ W tj w itj exp( ηl t j+r j (i)) W tj p itj exp( ηl t j+r j (i)) η ( p itj ηl t j+r j (i) + ) 2 η2 l 2 t j+r j (i) p itj l t j+r j (i) + η2 2 p itj l 2 t j+r j (i) (6) where follows from the fact that for all x 0 e x x x 2 /2. Now taking logs on both sides of (6) summing over t t 2... t and using log( + x) x for all x > we get log W t + W η j p itj l t j+r j (i) + η2 2 j p itj l 2 t j+r j (i). (7) 5

6 For all actions k K we also have log W t + W log w k t + W Combining (7) and (8) for all k K we obtain j p itj l t j+r j (i) j η l t j+r j (k ) log(k). (8) j l t j+r j (k ) η + η 2 j p itj l 2 t j+r j (i). (9) Now for all i K the conditional expectation of l t j+r j (i) is E l t j+r j (i) p tj r j t j+r j tt j t j+r j tt j t j+r j tt j k :i S t(k ) l t (i) q it l t (i). p k t j lt(i) q it k :i S t(k ) p k t j (20) Therefore we have that for all i K the conditional expectation E l t j+r j (i) {p tj r j } j j j t j+r j tt j l t (i) l t (i). (2) Now the expectation of second term in right hand side of (9) is E p itj l 2 t j+r j (i) E E p itj l 2 t j+r j (i) {p tj r j } j j j j E mas(g tj:t j+r j )rj 2 where mas(g tj:t j+r j ) max n tjt j+r j mas(g n ) and follows from the fact that for all i K and t T l t (i) and p it/q it mas(g t )(Alon et al. 207 Lemma 0). Now we bound j mas(g t j:t j+r j )rj 2. We write the following optimization problem: max {r j} j T (22) mas(g tj:t j+r j )rj 2 subject to (23) j r j T j 0 r j T /3 c 2/3 mas /3 (G (T ) ). Since the objective function is submodular and the constraints are linear the ratio of the solution of the greedy algorithm and the optimal solution is at most ( /e) (Nemhauser and Wolsey (978)). Therefore the optimal solution o of the above optimization problem is o t T 2/3 mas(g (t) )c 4/3 ( /e)mas 2/3 (G (T ) ) (24) 6

7 where t T 2/3 c 2/3 mas /3 (G (T ) ). Using (9) (20) (2) (22) and (24) we have k j+r j E p ikj l t (i) l t (k ) + η t T 2/3 c 4/3 mas(g (t) ) η 2 ( /e)mas 2/3 (G (T ) ). (25) j tk j j Additionally the player switches its action only if E t is true. Thus using (25) and c(i j) c for all i j K we have R A (l :T C) η + η 2 t T 2/3 c 4/3 mas(g (t) ) T ( /e)mas 2/3 (G (T ) ) + c E (i t i t ). (26) Now we bound E T t2 (i t i t ). E t occurs with probability and does not contribute to any SC. E2 t can lead to at most T 2/3 c 2/3 mas /3 (G (T ) ) switches. Now let E3 t causes N T switches. Then we have EN T E E E j j j E (i tj+ i tj E tj 3 is true) E (i tj+ i tj E tj 3 is true) {p tj r j } j E k K\{i} j k K\{i} P(i tj i t E j 3 is true)p(i t j+ k itj i) {p t j r j } j p itj p k t j+ c 2/3 mas /3 (G (T ) )t /3 c 2/3 mas /3 (G (T ) )T 2/3 where follows from Lemma in this section. Thus the number of switches are 2c 2/3 mas /3 (G (T ) )T 2/3 and the SC is 2c /3 mas /3 (G (T ) )T 2/3. Part (iii) of the theorem follows by combining the results from (i) and (ii). Part (iv) follows from the fact that if G t is undirected mas(g t ) α(g t ). Lemma. Given i K is chosen at time instance t j for all k K\{i} we have p itj p k t j+ (t j+ ) /3. Proof. Given i is chosen at time instance t j for all k K\{i} we have t2 (27) p k t j+ p itj+ p k exp( ηˆl tj+ (k )) p itj exp( ηl t j+r j (i)) p k exp( η(ˆl tj (k ) + l t j+r j (k ))) p itj exp( ηl t (i)) j+r j (b) exp ( η(ˆl tj (k ) + l t j+r j (k ) l ) t j+r j (i)) p itj (c) exp ( η(ɛ tj+ /η) ) Kp itj exp ( ) ɛ tj+ p itj (28) 7

8 where follows from the fact that ˆl tj+ (k ) ˆl tj (k ) + l t j+r j (k ); (b) follows from p k /K; (c) follows from the fact that for all k K\{i} ˆl kt l it > ɛ t/η as the increment in l it is bounded by /q it. Now replacing ɛ t log(tc 2 /mas(g (T ) ))/3 in (28) we have p itj p k t j+ c 2/3 mas /3 (G (T ) )t /3 j+. (29) F Proof of Theorem 6 Proof. We borrow the notations from the proof of Theorem 5. Using the fact that η t is decreasing in t and (9) we have j p itj l t j+r j (i) min k K j l t j+r j (k ) η tj + 2 j p itj l 2 t j+r j (i). (30) Now taking expectation on both the sides and using the fact that expectation of the min(.) is smaller than the min(.) of the expectation we have E p itj l t j+r j (i) j j j (b) log(k) j j η tj 2 ɛ t j E min k K E j l t j+r j (k ) p itj l 2 t j+r j (i) p tj r j (i t is selected using p t ) η tj 2 ɛ t j Emas(G tj:t j+r j )r 2 j (i t is selected using p t ) η tj 2 ɛ t j 2 mas(g tj:t j+r j ) η tj 2 ɛ 2 t j 2 mas(g tj:t j+r j ) ɛ tj (c) 2 log(k) mas 2/3 (G j (T ) ) mas(g (j)) (d) E + j 2 log(k) mas 2/3 (G (T ) ) mas(g (j)) (3) where follows from (22) (b) follows from the fact that since the probability of selecting a new action is at most ɛ tj the mean and the variance of the geometric random variable r j is bounded by /ɛ 2 t j and ( ɛ tj )/ɛ 2 t j respectively (c) follows from the value of η t and ɛ t and (d) follows from the fact that mas(g (j) )/mas(g (T ) ) is a monotonic non increasing sequence in j therefore the summation is a concave function and the inequality follows from the Jensen s inequality. Now we bound the E in (3). This also gives a bound on the number of switches performed by the algorithm. We have E E(i t i t ) ɛ t 0.5mas /3 (G (T ) )T 2/3 c /3 (32) 8

9 References Alon N. Cesa-Bianchi N. Gentile C. Mannor S. Mansour Y. and Shamir O. (207). Nonstochastic multi-armed bandits with graph-structured feedback. SIAM Journal on Computing 46(6): Dekel O. Ding J. Koren T. and Peres Y. (204). Bandits with switching costs: T 2/3 regret. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing pages ACM. Nemhauser G. L. and Wolsey L. A. (978). Best algorithms for approximating the maximum of a submodular set function. Mathematics of operations research 3(3):

From Bandits to Experts: A Tale of Domination and Independence

From Bandits to Experts: A Tale of Domination and Independence From Bandits to Experts: A Tale of Domination and Independence Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Domination and Independence 1 / 1 From Bandits to Experts: A

More information

Online Learning with Feedback Graphs

Online Learning with Feedback Graphs Online Learning with Feedback Graphs Nicolò Cesa-Bianchi Università degli Studi di Milano Joint work with: Noga Alon (Tel-Aviv University) Ofer Dekel (Microsoft Research) Tomer Koren (Technion and Microsoft

More information

Efficient learning by implicit exploration in bandit problems with side observations

Efficient learning by implicit exploration in bandit problems with side observations Efficient learning by implicit exploration in bandit problems with side observations Tomáš Kocák, Gergely Neu, Michal Valko, Rémi Munos SequeL team, INRIA Lille - Nord Europe, France SequeL INRIA Lille

More information

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known

More information

Online Learning with Feedback Graphs

Online Learning with Feedback Graphs Online Learning with Feedback Graphs Claudio Gentile INRIA and Google NY clagentile@gmailcom NYC March 6th, 2018 1 Content of this lecture Regret analysis of sequential prediction problems lying between

More information

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem Lecture 9: UCB Algorithm and Adversarial Bandit Problem EECS598: Prediction and Learning: It s Only a Game Fall 03 Lecture 9: UCB Algorithm and Adversarial Bandit Problem Prof. Jacob Abernethy Scribe:

More information

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Adversarial bandits Definition: sequential game. Lower bounds on regret from the stochastic case. Exp3: exponential weights

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

Lecture 3: Lower Bounds for Bandit Algorithms

Lecture 3: Lower Bounds for Bandit Algorithms CMSC 858G: Bandits, Experts and Games 09/19/16 Lecture 3: Lower Bounds for Bandit Algorithms Instructor: Alex Slivkins Scribed by: Soham De & Karthik A Sankararaman 1 Lower Bounds In this lecture (and

More information

Toward a Classification of Finite Partial-Monitoring Games

Toward a Classification of Finite Partial-Monitoring Games Toward a Classification of Finite Partial-Monitoring Games Gábor Bartók ( *student* ), Dávid Pál, and Csaba Szepesvári Department of Computing Science, University of Alberta, Canada {bartok,dpal,szepesva}@cs.ualberta.ca

More information

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12 Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12 Lecture 4: Multiarmed Bandit in the Adversarial Model Lecturer: Yishay Mansour Scribe: Shai Vardi 4.1 Lecture Overview

More information

Bandit models: a tutorial

Bandit models: a tutorial Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses

More information

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University EASINESS IN BANDITS Gergely Neu Pompeu Fabra University EASINESS IN BANDITS Gergely Neu Pompeu Fabra University THE BANDIT PROBLEM Play for T rounds attempting to maximize rewards THE BANDIT PROBLEM Play

More information

Online Learning: Bandit Setting

Online Learning: Bandit Setting Online Learning: Bandit Setting Daniel asabi Summer 04 Last Update: October 0, 06 Introduction [TODO Bandits. Stocastic setting Suppose tere exists unknown distributions ν,..., ν, suc tat te loss at eac

More information

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Explore no more: Improved high-probability regret bounds for non-stochastic bandits Explore no more: Improved high-probability regret bounds for non-stochastic bandits Gergely Neu SequeL team INRIA Lille Nord Europe gergely.neu@gmail.com Abstract This work addresses the problem of regret

More information

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms Stochastic K-Arm Bandit Problem Formulation Consider K arms (actions) each correspond to an unknown distribution {ν k } K k=1 with values bounded in [0, 1]. At each time t, the agent pulls an arm I t {1,...,

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 5: Bandit optimisation Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Introduce bandit optimisation: the

More information

Online Learning and Online Convex Optimization

Online Learning and Online Convex Optimization Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game

More information

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017 s s Machine Learning Reading Group The University of British Columbia Summer 2017 (OCO) Convex 1/29 Outline (OCO) Convex Stochastic Bernoulli s (OCO) Convex 2/29 At each iteration t, the player chooses

More information

Bandit Convex Optimization: T Regret in One Dimension

Bandit Convex Optimization: T Regret in One Dimension Bandit Convex Optimization: T Regret in One Dimension arxiv:1502.06398v1 [cs.lg 23 Feb 2015 Sébastien Bubeck Microsoft Research sebubeck@microsoft.com Tomer Koren Technion tomerk@technion.ac.il February

More information

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)

More information

Efficient learning by implicit exploration in bandit problems with side observations

Efficient learning by implicit exploration in bandit problems with side observations Efficient learning by implicit exploration in bandit problems with side observations omáš Kocák Gergely Neu Michal Valko Rémi Munos SequeL team, INRIA Lille Nord Europe, France {tomas.kocak,gergely.neu,michal.valko,remi.munos}@inria.fr

More information

Key words. online learning, multi-armed bandits, learning from experts, learning with partial feedback, graph theory

Key words. online learning, multi-armed bandits, learning from experts, learning with partial feedback, graph theory SIAM J. COMPUT. Vol. 46, No. 6, pp. 785 86 c 07 the authors NONSTOCHASTIC MULTI-ARMED BANDITS WITH GRAPH-STRUCTURED FEEDBACK NOGA ALON, NICOLÒ CESA-BIANCHI, CLAUDIO GENTILE, SHIE MANNOR, YISHAY MANSOUR,

More information

Online learning with noisy side observations

Online learning with noisy side observations Online learning with noisy side observations Tomáš Kocák Gergely Neu Michal Valko Inria Lille - Nord Europe, France DTIC, Universitat Pompeu Fabra, Barcelona, Spain Inria Lille - Nord Europe, France SequeL

More information

arxiv: v3 [cs.lg] 30 Jun 2012

arxiv: v3 [cs.lg] 30 Jun 2012 arxiv:05874v3 [cslg] 30 Jun 0 Orly Avner Shie Mannor Department of Electrical Engineering, Technion Ohad Shamir Microsoft Research New England Abstract We consider a multi-armed bandit problem where the

More information

Online Learning with Gaussian Payoffs and Side Observations

Online Learning with Gaussian Payoffs and Side Observations Online Learning with Gaussian Payoffs and Side Observations Yifan Wu 1 András György 2 Csaba Szepesvári 1 1 Department of Computing Science University of Alberta 2 Department of Electrical and Electronic

More information

arxiv: v1 [cs.lg] 30 Sep 2014

arxiv: v1 [cs.lg] 30 Sep 2014 Nonstochastic Multi-Armed Bandits with Graph-Structured Feedback arxiv:409.848v [cs.lg] 30 Sep 04 Noga Alon Tel Aviv University, Tel Aviv, Israel, Institute for Advanced Study, Princeton, United States,

More information

arxiv: v1 [cs.lg] 23 Jan 2019

arxiv: v1 [cs.lg] 23 Jan 2019 Cooperative Online Learning: Keeping your Neighbors Updated Nicolò Cesa-Bianchi, Tommaso R. Cesari, and Claire Monteleoni 2 Dipartimento di Informatica, Università degli Studi di Milano, Italy 2 Department

More information

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Explore no more: Improved high-probability regret bounds for non-stochastic bandits Explore no more: Improved high-probability regret bounds for non-stochastic bandits Gergely Neu SequeL team INRIA Lille Nord Europe gergely.neu@gmail.com Abstract This work addresses the problem of regret

More information

Multi-Armed Bandits with Metric Movement Costs

Multi-Armed Bandits with Metric Movement Costs Multi-Armed Bandits with Metric Movement Costs Tomer Koren 1, Roi Livni 2, and Yishay Mansour 3 1 Google; tkoren@google.com 2 Princeton University; rlivni@cs.princeton.edu 3 Tel Aviv University; mansour@cs.tau.ac.il

More information

arxiv: v2 [cs.lg] 24 Oct 2018

arxiv: v2 [cs.lg] 24 Oct 2018 Adversarial Online Learning with noise arxiv:80.09346v [cs.lg] 4 Oct 08 Alon Resler alonress@gmail.com Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel Yishay Mansour mansour.yishay@gmail.com

More information

The Multi-Arm Bandit Framework

The Multi-Arm Bandit Framework The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94

More information

Online Learning with Costly Features and Labels

Online Learning with Costly Features and Labels Online Learning with Costly Features and Labels Navid Zolghadr Department of Computing Science University of Alberta zolghadr@ualberta.ca Gábor Bartók Department of Computer Science TH Zürich bartok@inf.ethz.ch

More information

Online Learning with Randomized Feedback Graphs for Optimal PUE Attacks in Cognitive Radio Networks

Online Learning with Randomized Feedback Graphs for Optimal PUE Attacks in Cognitive Radio Networks 1 Online Learning with Randomized Feedback Graphs for Optimal PUE Attacks in Cognitive Radio Networks Monireh Dabaghchian, Amir Alipour-Fanid, Kai Zeng, Qingsi Wang, Peter Auer arxiv:1709.10128v3 [cs.ni]

More information

Lecture 4: Lower Bounds (ending); Thompson Sampling

Lecture 4: Lower Bounds (ending); Thompson Sampling CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from

More information

The Online Approach to Machine Learning

The Online Approach to Machine Learning The Online Approach to Machine Learning Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary 1 My beautiful regret 2 A supposedly fun game I

More information

On the Complexity of Best Arm Identification with Fixed Confidence

On the Complexity of Best Arm Identification with Fixed Confidence On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, Emilie Kaufmann COLT, June 23 th 2016, New York Institut de Mathématiques de Toulouse

More information

Stochastic bandits: Explore-First and UCB

Stochastic bandits: Explore-First and UCB CSE599s, Spring 2014, Online Learning Lecture 15-2/19/2014 Stochastic bandits: Explore-First and UCB Lecturer: Brendan McMahan or Ofer Dekel Scribe: Javad Hosseini In this lecture, we like to answer this

More information

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Sébastien Bubeck Theory Group

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Sébastien Bubeck Theory Group Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems Sébastien Bubeck Theory Group Part 1: i.i.d., adversarial, and Bayesian bandit models i.i.d. multi-armed bandit, Robbins [1952]

More information

Exponential Weights on the Hypercube in Polynomial Time

Exponential Weights on the Hypercube in Polynomial Time European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts

More information

Combinatorial Multi-Armed Bandit: General Framework, Results and Applications

Combinatorial Multi-Armed Bandit: General Framework, Results and Applications Combinatorial Multi-Armed Bandit: General Framework, Results and Applications Wei Chen Microsoft Research Asia, Beijing, China Yajun Wang Microsoft Research Asia, Beijing, China Yang Yuan Computer Science

More information

A. Notation. Attraction probability of item d. (d) Highest attraction probability, (1) A

A. Notation. Attraction probability of item d. (d) Highest attraction probability, (1) A A Notation Symbol Definition (d) Attraction probability of item d max Highest attraction probability, (1) A Binary attraction vector, where A(d) is the attraction indicator of item d P Distribution over

More information

Applications of on-line prediction. in telecommunication problems

Applications of on-line prediction. in telecommunication problems Applications of on-line prediction in telecommunication problems Gábor Lugosi Pompeu Fabra University, Barcelona based on joint work with András György and Tamás Linder 1 Outline On-line prediction; Some

More information

Budgeted Prediction With Expert Advice

Budgeted Prediction With Expert Advice Budgeted Prediction With Expert Advice Kareem Amin University of Pennsylvania, Philadelphia, PA akareem@seas.upenn.edu Gerald Tesauro IBM T.J. Watson Research Center Yorktown Heights, NY gtesauro@us.ibm.com

More information

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. Converting online to batch. Online convex optimization.

More information

NOWADAYS the demand to wireless bandwidth is growing

NOWADAYS the demand to wireless bandwidth is growing 1 Online Learning with Randomized Feedback Graphs for Optimal PUE Attacks in Cognitive Radio Networks Monireh Dabaghchian, Student Member, IEEE, Amir Alipour-Fanid, Student Member, IEEE, Kai Zeng, Member,

More information

Online Learning with Abstention. Figure 6: Simple example of the benefits of learning with abstention (Cortes et al., 2016a).

Online Learning with Abstention. Figure 6: Simple example of the benefits of learning with abstention (Cortes et al., 2016a). + + + + + + Figure 6: Simple example of the benefits of learning with abstention (Cortes et al., 206a). A. Further Related Work Learning with abstention is a useful paradigm in applications where the cost

More information

The information complexity of sequential resource allocation

The information complexity of sequential resource allocation The information complexity of sequential resource allocation Emilie Kaufmann, joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishan SMILE Seminar, ENS, June 8th, 205 Sequential allocation

More information

arxiv: v1 [stat.ml] 23 May 2018

arxiv: v1 [stat.ml] 23 May 2018 Analysis of Thompson Sampling for Graphical Bandits Without the Graphs arxiv:1805.08930v1 [stat.ml] 3 May 018 Fang Liu The Ohio State University Columbus, Ohio 4310 liu.3977@osu.edu Abstract We study multi-armed

More information

The information complexity of best-arm identification

The information complexity of best-arm identification The information complexity of best-arm identification Emilie Kaufmann, joint work with Olivier Cappé and Aurélien Garivier MAB workshop, Lancaster, January th, 206 Context: the multi-armed bandit model

More information

Exponentiated Gradient Descent

Exponentiated Gradient Descent CSE599s, Spring 01, Online Learning Lecture 10-04/6/01 Lecturer: Ofer Dekel Exponentiated Gradient Descent Scribe: Albert Yu 1 Introduction In this lecture we review norms, dual norms, strong convexity,

More information

Combinatorial Multi-Armed Bandit with General Reward Functions

Combinatorial Multi-Armed Bandit with General Reward Functions Combinatorial Multi-Armed Bandit with General Reward Functions Wei Chen Wei Hu Fu Li Jian Li Yu Liu Pinyan Lu Abstract In this paper, we study the stochastic combinatorial multi-armed bandit (CMAB) framework

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University Bandit Algorithms Zhifeng Wang Department of Statistics Florida State University Outline Multi-Armed Bandits (MAB) Exploration-First Epsilon-Greedy Softmax UCB Thompson Sampling Adversarial Bandits Exp3

More information

THE first formalization of the multi-armed bandit problem

THE first formalization of the multi-armed bandit problem EDIC RESEARCH PROPOSAL 1 Multi-armed Bandits in a Network Farnood Salehi I&C, EPFL Abstract The multi-armed bandit problem is a sequential decision problem in which we have several options (arms). We can

More information

Online (and Distributed) Learning with Information Constraints. Ohad Shamir

Online (and Distributed) Learning with Information Constraints. Ohad Shamir Online (and Distributed) Learning with Information Constraints Ohad Shamir Weizmann Institute of Science Online Algorithms and Learning Workshop Leiden, November 2014 Ohad Shamir Learning with Information

More information

Online Bounds for Bayesian Algorithms

Online Bounds for Bayesian Algorithms Online Bounds for Bayesian Algorithms Sham M. Kakade Computer and Information Science Department University of Pennsylvania Andrew Y. Ng Computer Science Department Stanford University Abstract We present

More information

Experts in a Markov Decision Process

Experts in a Markov Decision Process University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 2004 Experts in a Markov Decision Process Eyal Even-Dar Sham Kakade University of Pennsylvania Yishay Mansour Follow

More information

Conditional Swap Regret and Conditional Correlated Equilibrium

Conditional Swap Regret and Conditional Correlated Equilibrium Conditional Swap Regret and Conditional Correlated quilibrium Mehryar Mohri Courant Institute and Google 251 Mercer Street New York, NY 10012 mohri@cims.nyu.edu Scott Yang Courant Institute 251 Mercer

More information

Subsampling, Concentration and Multi-armed bandits

Subsampling, Concentration and Multi-armed bandits Subsampling, Concentration and Multi-armed bandits Odalric-Ambrym Maillard, R. Bardenet, S. Mannor, A. Baransi, N. Galichet, J. Pineau, A. Durand Toulouse, November 09, 2015 O-A. Maillard Subsampling and

More information

Bandits with Delayed, Aggregated Anonymous Feedback

Bandits with Delayed, Aggregated Anonymous Feedback Ciara Pike-Burke 1 Shipra Agrawal 2 Csaba Szepesvári 3 4 Steffen Grünewälder 1 Abstract We study a variant of the stochastic K-armed bandit problem, which we call bandits with delayed, aggregated anonymous

More information

Lecture 16: Perceptron and Exponential Weights Algorithm

Lecture 16: Perceptron and Exponential Weights Algorithm EECS 598-005: Theoretical Foundations of Machine Learning Fall 2015 Lecture 16: Perceptron and Exponential Weights Algorithm Lecturer: Jacob Abernethy Scribes: Yue Wang, Editors: Weiqing Yu and Andrew

More information

arxiv: v1 [cs.lg] 7 Sep 2018

arxiv: v1 [cs.lg] 7 Sep 2018 Analysis of Thompson Sampling for Combinatorial Multi-armed Bandit with Probabilistically Triggered Arms Alihan Hüyük Bilkent University Cem Tekin Bilkent University arxiv:809.02707v [cs.lg] 7 Sep 208

More information

A Second-order Bound with Excess Losses

A Second-order Bound with Excess Losses A Second-order Bound with Excess Losses Pierre Gaillard 12 Gilles Stoltz 2 Tim van Erven 3 1 EDF R&D, Clamart, France 2 GREGHEC: HEC Paris CNRS, Jouy-en-Josas, France 3 Leiden University, the Netherlands

More information

Beat the Mean Bandit

Beat the Mean Bandit Yisong Yue H. John Heinz III College, Carnegie Mellon University, Pittsburgh, PA, USA Thorsten Joachims Department of Computer Science, Cornell University, Ithaca, NY, USA yisongyue@cmu.edu tj@cs.cornell.edu

More information

Online Learning with Predictable Sequences

Online Learning with Predictable Sequences JMLR: Workshop and Conference Proceedings vol (2013) 1 27 Online Learning with Predictable Sequences Alexander Rakhlin Karthik Sridharan rakhlin@wharton.upenn.edu skarthik@wharton.upenn.edu Abstract We

More information

THE WEIGHTED MAJORITY ALGORITHM

THE WEIGHTED MAJORITY ALGORITHM THE WEIGHTED MAJORITY ALGORITHM Csaba Szepesvári University of Alberta CMPUT 654 E-mail: szepesva@ualberta.ca UofA, October 3, 2006 OUTLINE 1 PREDICTION WITH EXPERT ADVICE 2 HALVING: FIND THE PERFECT EXPERT!

More information

Bandit Online Convex Optimization

Bandit Online Convex Optimization March 31, 2015 Outline 1 OCO vs Bandit OCO 2 Gradient Estimates 3 Oblivious Adversary 4 Reshaping for Improved Rates 5 Adaptive Adversary 6 Concluding Remarks Review of (Online) Convex Optimization Set-up

More information

arxiv: v3 [stat.ml] 13 Jun 2018

arxiv: v3 [stat.ml] 13 Jun 2018 Ciara Pike-Burke hipra Agrawal Csaba zepesvári 3 4 teffen Grünewälder arxiv:709.06853v3 stat.ml 3 Jun 08 Abstract We study a variant of the stochastic K-armed bandit problem, which we call bandits with

More information

Minimax Policies for Combinatorial Prediction Games

Minimax Policies for Combinatorial Prediction Games Minimax Policies for Combinatorial Prediction Games Jean-Yves Audibert Imagine, Univ. Paris Est, and Sierra, CNRS/ENS/INRIA, Paris, France audibert@imagine.enpc.fr Sébastien Bubeck Centre de Recerca Matemàtica

More information

Prediction and Playing Games

Prediction and Playing Games Prediction and Playing Games Vineel Pratap vineel@eng.ucsd.edu February 20, 204 Chapter 7 : Prediction, Learning and Games - Cesa Binachi & Lugosi K-Person Normal Form Games Each player k (k =,..., K)

More information

Two optimization problems in a stochastic bandit model

Two optimization problems in a stochastic bandit model Two optimization problems in a stochastic bandit model Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishnan Journées MAS 204, Toulouse Outline From stochastic optimization

More information

Online Learning with Gaussian Payoffs and Side Observations

Online Learning with Gaussian Payoffs and Side Observations Online Learning with Gaussian Payoffs and Side Observations Yifan Wu András György 2 Csaba Szepesvári Dept. of Computing Science University of Alberta {ywu2,szepesva}@ualberta.ca 2 Dept. of Electrical

More information

Contextual Combinatorial Bandit and its Application on Diversified Online Recommendation

Contextual Combinatorial Bandit and its Application on Diversified Online Recommendation Contextual Combinatorial Bandit and its Application on Diversified Online Recommendation Lijing Qin Shouyuan Chen Xiaoyan Zhu Abstract Recommender systems are faced with new challenges that are beyond

More information

Learning Algorithms for Minimizing Queue Length Regret

Learning Algorithms for Minimizing Queue Length Regret Learning Algorithms for Minimizing Queue Length Regret Thomas Stahlbuhk Massachusetts Institute of Technology Cambridge, MA Brooke Shrader MIT Lincoln Laboratory Lexington, MA Eytan Modiano Massachusetts

More information

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University

PCA with random noise. Van Ha Vu. Department of Mathematics Yale University PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical

More information

Optimization, Learning, and Games with Predictable Sequences

Optimization, Learning, and Games with Predictable Sequences Optimization, Learning, and Games with Predictable Sequences Alexander Rakhlin University of Pennsylvania Karthik Sridharan University of Pennsylvania Abstract We provide several applications of Optimistic

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning

More information

Introduction to Multi-Armed Bandits

Introduction to Multi-Armed Bandits Introduction to Multi-Armed Bandits (preliminary and incomplete draft) Aleksandrs Slivkins Microsoft Research NYC https://www.microsoft.com/en-us/research/people/slivkins/ First draft: January 2017 This

More information

Lecture 10 + additional notes

Lecture 10 + additional notes CSE533: Information Theorn Computer Science November 1, 2010 Lecturer: Anup Rao Lecture 10 + additional notes Scribe: Mohammad Moharrami 1 Constraint satisfaction problems We start by defining bivariate

More information

On the Complexity of Best Arm Identification with Fixed Confidence

On the Complexity of Best Arm Identification with Fixed Confidence On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, joint work with Emilie Kaufmann CNRS, CRIStAL) to be presented at COLT 16, New York

More information

Bandit View on Continuous Stochastic Optimization

Bandit View on Continuous Stochastic Optimization Bandit View on Continuous Stochastic Optimization Sébastien Bubeck 1 joint work with Rémi Munos 1 & Gilles Stoltz 2 & Csaba Szepesvari 3 1 INRIA Lille, SequeL team 2 CNRS/ENS/HEC 3 University of Alberta

More information

Online Learning with Abstention

Online Learning with Abstention Corinna Cortes 1 Giulia DeSalvo 1 Claudio Gentile 1 2 Mehryar Mohri 3 1 Scott Yang 4 Abstract We present an extensive study of a key problem in online learning where the learner can opt to abstain from

More information

Multiple Identifications in Multi-Armed Bandits

Multiple Identifications in Multi-Armed Bandits Multiple Identifications in Multi-Armed Bandits arxiv:05.38v [cs.lg] 4 May 0 Sébastien Bubeck Department of Operations Research and Financial Engineering, Princeton University sbubeck@princeton.edu Tengyao

More information

Lecture 5: Regret Bounds for Thompson Sampling

Lecture 5: Regret Bounds for Thompson Sampling CMSC 858G: Bandits, Experts and Games 09/2/6 Lecture 5: Regret Bounds for Thompson Sampling Instructor: Alex Slivkins Scribed by: Yancy Liao Regret Bounds for Thompson Sampling For each round t, we defined

More information

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 Online Convex Optimization Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 The General Setting The General Setting (Cover) Given only the above, learning isn't always possible Some Natural

More information

Supplementary: Battle of Bandits

Supplementary: Battle of Bandits Supplementary: Battle of Bandits A Proof of Lemma Proof. We sta by noting that PX i PX i, i {U, V } Pi {U, V } PX i i {U, V } PX i i {U, V } j,j i j,j i j,j i PX i {U, V } {i, j} Q Q aia a ia j j, j,j

More information

Online Convex Optimization

Online Convex Optimization Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed

More information

Bandit Algorithms. Tor Lattimore & Csaba Szepesvári

Bandit Algorithms. Tor Lattimore & Csaba Szepesvári Bandit Algorithms Tor Lattimore & Csaba Szepesvári Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next? Overview What are bandits,

More information

Partial monitoring classification, regret bounds, and algorithms

Partial monitoring classification, regret bounds, and algorithms Partial monitoring classification, regret bounds, and algorithms Gábor Bartók Department of Computing Science University of Alberta Dávid Pál Department of Computing Science University of Alberta Csaba

More information

Regret Bounds for Sleeping Experts and Bandits

Regret Bounds for Sleeping Experts and Bandits Regret Bounds for Sleeping Experts and Bandits Robert D. Kleinberg Department of Computer Science Cornell University Ithaca, NY 4853 rdk@cs.cornell.edu Alexandru Niculescu-Mizil Department of Computer

More information

1 Submodular functions

1 Submodular functions CS 369P: Polyhedral techniques in combinatorial optimization Instructor: Jan Vondrák Lecture date: November 16, 2010 1 Submodular functions We have already encountered submodular functions. Let s recall

More information

Reward Maximization Under Uncertainty: Leveraging Side-Observations on Networks

Reward Maximization Under Uncertainty: Leveraging Side-Observations on Networks Reward Maximization Under Uncertainty: Leveraging Side-Observations Reward Maximization Under Uncertainty: Leveraging Side-Observations on Networks Swapna Buccapatnam AT&T Labs Research, Middletown, NJ

More information

Revisiting the Exploration-Exploitation Tradeoff in Bandit Models

Revisiting the Exploration-Exploitation Tradeoff in Bandit Models Revisiting the Exploration-Exploitation Tradeoff in Bandit Models joint work with Aurélien Garivier (IMT, Toulouse) and Tor Lattimore (University of Alberta) Workshop on Optimization and Decision-Making

More information

Bandits with Movement Costs and Adaptive Pricing

Bandits with Movement Costs and Adaptive Pricing Proceedings of Machine Learning Research vol 65:1 27, 2017 Bandits with Movement Costs and Adaptive Pricing Tomer Koren Google Roi Livni Princeton University, Computer Science Department Yishay Mansour

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Optimization of Submodular Functions Tutorial - lecture I

Optimization of Submodular Functions Tutorial - lecture I Optimization of Submodular Functions Tutorial - lecture I Jan Vondrák 1 1 IBM Almaden Research Center San Jose, CA Jan Vondrák (IBM Almaden) Submodular Optimization Tutorial 1 / 1 Lecture I: outline 1

More information

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurélien Garivier Institut de Mathématiques de Toulouse Information Theory, Learning and Big Data Simons Institute, Berkeley, March

More information

Online learning CMPUT 654. October 20, 2011

Online learning CMPUT 654. October 20, 2011 Online learning CMPUT 654 Gábor Bartók Dávid Pál Csaba Szepesvári István Szita October 20, 2011 Contents 1 Shooting Game 4 1.1 Exercises...................................... 6 2 Weighted Majority Algorithm

More information

University of Alberta. The Role of Information in Online Learning

University of Alberta. The Role of Information in Online Learning University of Alberta The Role of Information in Online Learning by Gábor Bartók A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment of the requirements for the degree

More information