Online learning with feedback graphs and switching costs
|
|
- Elfreda Miller
- 5 years ago
- Views:
Transcription
1 Online learning with feedback graphs and switching costs A Proof of Theorem Proof. Without loss of generality let the independent sequence set I(G :T ) formed of actions (or arms ) from to. Given the sequence of feedback graphs G :T let T i be the number of times the action i I(G :T ) is selected by the player in T rounds. Let T be the total number of times the actions are selected from the set K\I(G :T ). Let E i denote expectation conditioned on X i and P i denote the probability conditioned on X i. Additionally we define P 0 as the probability conditioned on event ɛ 0. Therefore under P 0 all the actions in the independent sequence set i.e. i I(G :T ) incur an expected regret of /2 whereas the expected regret of actions i K\I(G :T ) is /2 + ɛ 2. Let E 0 be the corresponding conditional expectation. For all i K and t T l t (i) and l c t(i) denote the unclipped and clipped loss of the action i respectively. Assuming the unclipped losses are observed by the player then F is the sigma field generated by the unclipped losses and S t (i t ) is the set of actions whose losses are observed at time t following the selection of i t according to the feedback graph G t. The observed sequence of unclipped losses will be referred as l o :T. Additionally F is the sigma field generated by the clipped losses for all t T l t(i) where i S t (i t ) and the observed sequence of clipped losses will be referred as l o :T. By definition F F. Let i... i T be the sequence of actions selected by a player over the time horizon T. Then the regret R c of the player corresponding to clipped loses is R c l c t(i t ) + c M s min l c t(i) () where M s is the number of switches in the action selection sequence i... i T and c is the cost of each switch in action. Now we define the regret R which corresponds to the unclipped loss function in Algorithm as following R l t (i t ) + c M s min l t (i). (2) Using (Dekel et al. 204 Lemma 4) we have ( P For all t T 2 + W t 6 5 ) (3) Thus for all T > max{ 6} we have ɛ ɛ 2 < /6. If B {For all t T : /2 + W t /6 5/6} occurs and ɛ ɛ 2 < /6 then for all i K l c t(i) l t (i) which implies R c R (see () and (2)). Now if the event B does not occur then the losses at any time t satisfy l t (i) l c t(i) (ɛ + ɛ 2 ). Therefore we have c M s R c R c M s + (ɛ + ɛ 2 )T. Now for T > max{ 6} we have ER ER c ( P(B))ER R c B does not occur (ɛ + ɛ 2 )T 6. (4)
2 Thus (4) lower bounds the actual regret R c in terms of regret R. Now we derive the lower bound on regret R corresponding to the unclipped loses. Using the definition of R we have ER max E T l t (i t ) E i i E i i i i l t (i) M s l t (i t ) min j I(G :T )\{i} β(g:t ) E i j 2 T j + 2 T j + E i ɛ 2 T + ɛ (T T i ) β(g (b) :T ) ɛ (T ) E i Ti M s i l t (i) M s ( ) ( ) ( ) 2 ɛ T i ɛ 2 T 2 ɛ T M s ( ) ( ) 2 + ɛ 2 T ɛ T i 2 ɛ T M s M s where follows from j T j + T T and (b) follows from ɛ 2 T 0. Now we upper bound the E i Ti in (5) to obtain the lower bound on the expected regret ER. Since the player is deterministic the event {i t i} is F measurable. Therefore we have P i (i t i) P 0 (i t i) d F T V (P 0 P i ) d F T V (P 0 P i ) where d F T V (P 0 P i ) sup A F P 0 (A) P i (A) is the total variational distance between the two probability measures and follows from F F. Summing the above equation over t T and i I(G :T ) yields i ( Ei T i E 0 T i ) T i d F T V (P 0 P i ). Rearranging the above equation and using i E 0 T i E 0 i T i T we get i E i T i T Combining the above equation with (5) we get ER ɛ T ɛ T i i d F T V (P 0 P i ) + T. d F T V (P 0 P i ) ɛ T 2 ɛ T d F T V (P 0 P i ) M s i ɛ T M s where uses the fact that >. Next we upper bound the second term in the right hand side of (6). Using Pinsker s inequality we have d F T V (P 0 P i ) 2 D KL(P 0 (l o :T ) P i(l o :T )) (7) 2 (5) (6)
3 where l o :T are the losses observed by the player over the time horizon T. Using the chain rule of relative entropy to decompose D KL (P 0 (l o :T ) P 0(l o :T )) we get D KL (P 0 (l o :T ) P i (l o :T )) D KL (P 0 (l o t l o :t ) P i (l o t l o :t )) (8) D KL (P 0 (l o t l o ρ (t) ) P i(l o t l o ρ (t) )) where ρ (t) is the set of time instances 0 k t encountered when operation ρ(.) in Algorithm is applied recursively to t. Now we deal with each term D KL (P 0 (l o t l o ρ (t) ) P i(l o t l o ρ (t))) in the summation individually. For i I(G :T ) we separate this computation into four cases: i t is such that loss of action i is observed at both time instances t and ρ(t) i.e. i S t (i t ) and i S t (i ρ(t) ); i t is such that loss of action i is observed at time instance t but not at time instance ρ(t) i.e. i S t (i t ) and i / S t (i ρ(t) ); i t is such that loss of action i is not observed at time instance t but is observed at time instance ρ(t) i.e. i / S t (i t ) and i S t (i ρ(t) ); i t is such that loss of action i is not observed at both time instances t and ρ(t) i.e. i / S t (i t ) and i / S t (i ρ(t) ). Note that at a single time instance the loss of only one single action can be observed from I(G :T ) arms. Case : Since the loss of action i is observed from the independent sequence set I(G :T ) at both the time instances the loss distribution for the action i is l o t (i) l o ρ (t) N (l ρ(t)(i) σ 2 ) for both P 0 and P i. For all j K\I(G :T ) the loss distribution is l o t (j) l o ρ (t) N (l ρ(t)(i) + ɛ + ɛ 2 σ 2 ) under both P 0 and P i. Case 2: Since the loss of action i is observed from the independent sequence set I(G :T ) at time instance t but not at ρ(t) therefore there exists an action k I(G :T )\{i} from the independent sequence set which was observed at time instance ρ(t). Then the loss distribution for the action i is l o t (i) l o ρ (t) N (l ρ(t)(k ) σ 2 ) under P 0 and l o t (i) l o ρ (t) N (l ρ(t)(k ) ɛ σ 2 ) under P i. For all j K\I(G :T ) the loss distribution is l o t (j) l o ρ (t) N (l ρ(t)(k ) + ɛ 2 σ 2 ) under both P 0 and P i. Case 3:Since the action i is observed from the independent sequence set I(G :T ) at time instance ρ(t) but not at t therefore there exists an action k I(G :T )\{i} from the independent sequence set which was observed at time instance t. Then the loss distribution for the arm k is l o t (k ) l o ρ (t) N (l ρ(t)(i) σ 2 ) under P 0 and l o t (k ) l o ρ (t) N (l ρ(t)(i) + ɛ σ 2 ) under P i. For all j K\I(G :T ) the loss distribution is l o t (j) l o ρ (t) N (l ρ(t)(i) + ɛ + ɛ 2 σ 2 ) under both P 0 and P i. Case 4: Let k be the arm from the independent sequence set observed at time instance ρ(t). Since the arm i is not observed from the independent sequence set I(G :T ) at the time instances t and ρ(t) therefore the loss distribution for all arms k I(G :T )\{i} is l o t (k ) l o ρ (t) N (l ρ(t)(k ) σ 2 ) for both P 0 and P i. For all j K\I(G :T ) the loss distribution is l o t (j) l o ρ (t) N (l ρ(t)(k )+ɛ 2 σ 2 ) under both P 0 and P i. Therefore we have D KL (P 0 (l o t l o ρ (t) ) P i(l o t l o ρ (t) )) P 0(i S t (i t ) i / S ρ(t) (i ρ(t) )) D KL (N (0 σ 2 ) N ( ɛ σ 2 )) + P 0 (i / S t (i t ) i S ρ(t) (i ρ(t) )) D KL (N (0 σ 2 ) N (ɛ σ 2 )) ɛ2 2σ 2 P 0(B t ) where B t {i S t (i t ) i / S ρ(t) (i ρ(t) ) i / S t (i t ) i S ρ(t) (i ρ(t) )}. The event B t implies that the player has switched at least once between the feedback systems S t (k ) and S ρ(t) (k 2 ) such that i S t (k ) but i / S ρ(t) (k 2 ) or vice-versa. Let N i be the number of times a player switches from the feedback system which includes i to the feedback system which does not include i and vice-versa. Then using (8) and (9) we have D KL (P 0 (l o :T ) P i (l o :T )) ɛ2 ω(ρ) 2σ 2 E 0N i (0) 3 (9)
4 where ω(ρ) is the width of process ρ(.) (see Definition 2 in Dekel et al. (204)) and is bounded above by 2 log 2 (T ). Combining (7) and (0) we have (P 0 (A) P i (A)) ɛ log2 (T )E 0 N i. () σ sup A F If M s ɛ T then ER > ɛ T. Thus the claimed lower bound follows. Now let us assume M s ɛ T. For all i I(G :T ) we have E 0 M s E i M s Using the above equation we have E 0 M s EM s ɛ T m Now combining (4) (6) ()and (3) we obtain P 0 (M s m) P i (M s m)) ɛ T d F T V (P 0 P i ). (E 0 M s E i M s ) i ɛ T d F T V (P 0 P i ). ER ɛ T 6 ɛ T ɛ log2 (T )E 0 N i + c E 0 M s σ i ɛ T 6 ɛ 2 T σ 2 log2 (T )E 0 M s + c E 0 M s i (b) c/3 /3 T 2/3 c/3 /3 T 2/3 54 log 2 (T ) 62 log 2 (T ) c/3 /3 T 2/3 8 log 2 (T ) where follows from the concavity of x and i N i 2M s (b) follows from the fact that the right hand side is minimized for E 0 M s ɛ 2 T log 2 (T )/2cσ. The claim of the theorem now follows. B Proof of Lemma 2 is the cardinality of I(G :T ). Let 2... actions belong to the set I(G :T ). Then the adversary selects an action uniformly at random from the set I(G :T ) say j and assigns the loss sequence to action j using independent Bernoulli random variable with parameter 0.5 ɛ where ɛ /T ). For all i I(G :T )/{j} losses are assigned using independent Bernoulli random variable with parameter 0.5. For all i / I(G :T ) the losses are assigned using independent Bernoulli random variable with parameter. The proof of the lemma follows along the same lines as Theorem 5 in (Alon et al. (207)). C Proof of Theorem 3 Proof of this theorem uses the result from Theorem. Since the loss sequence is assigned independently to each sub-sequence U m where m M. Using Theorem there exists a constant b m such that E T (l t (i t )(G t U m ) + cw m min i U m b m c /3 β(u m ) /3 N(U m ) 2/3 / log(t ) 4 (l t (i)(g t U m ) (2) (3) (4) (5)
5 where W m is number of switches performed within the sequence U m. Since m M W m (i t i t ) there exist a constant b such that the expected regret of any algorithm A is at least b c /3 β(u m ) /3 N(U m ) 2/3 / log T. D Proof of Lemma 4 m M Proof. The proof follows from contradiction and is along the same lines as the proof of Theorem 4 in Dekel et al. (204). Let A performs at most Õ(( /2 T ) α ) switches for any sequence of loss function over T rounds with β + α/2 <. Then there exists a real number γ such that β < γ < α/2. Then assign c ( /2 T ) 3γ 2. Thus the expected regret including the switching cost of the algorithm is Õ(( /2 T ) β + ( /2 T ) 3γ 2 (T ) α ) õ( /2 T ) γ over a sequence of losses assigned by the adversary because β < γ and α < 2 2γ. However according to Theorem the expected regret is at least Ω( /3 ( /2 T ) (3γ 2)/3 T 2/3 ) Ω((T ) γ ). Hence by contradiction the proof of the lemma follows. E Proof of Theorem 5 Proof. Let t t 2... t be the sequence of time instances at which the event E t occurs during the duration T of the game. We define {r j t j+ t j } j T as the sequence of inter-event times between the events E t. Let mas(g () )... mas(g (T ) ) denote the sequence in the decreasing order of size of maximal acyclic graphs i.e. mas(g () ) (or mas(g (T ) )) is the maximum (or minimum) size of maximal acyclic graph observed in sequence G :T {G... G T }. Using the definition of E t note that r j is a random variable bounded by T /3 c 2/3 /mas(g (T ) ) /3. For all j the ratio of total weights of actions at round t j and t j+ is W tj+ W tj w itj+ W tj w itj exp( ηl t j+r j (i)) W tj p itj exp( ηl t j+r j (i)) η ( p itj ηl t j+r j (i) + ) 2 η2 l 2 t j+r j (i) p itj l t j+r j (i) + η2 2 p itj l 2 t j+r j (i) (6) where follows from the fact that for all x 0 e x x x 2 /2. Now taking logs on both sides of (6) summing over t t 2... t and using log( + x) x for all x > we get log W t + W η j p itj l t j+r j (i) + η2 2 j p itj l 2 t j+r j (i). (7) 5
6 For all actions k K we also have log W t + W log w k t + W Combining (7) and (8) for all k K we obtain j p itj l t j+r j (i) j η l t j+r j (k ) log(k). (8) j l t j+r j (k ) η + η 2 j p itj l 2 t j+r j (i). (9) Now for all i K the conditional expectation of l t j+r j (i) is E l t j+r j (i) p tj r j t j+r j tt j t j+r j tt j t j+r j tt j k :i S t(k ) l t (i) q it l t (i). p k t j lt(i) q it k :i S t(k ) p k t j (20) Therefore we have that for all i K the conditional expectation E l t j+r j (i) {p tj r j } j j j t j+r j tt j l t (i) l t (i). (2) Now the expectation of second term in right hand side of (9) is E p itj l 2 t j+r j (i) E E p itj l 2 t j+r j (i) {p tj r j } j j j j E mas(g tj:t j+r j )rj 2 where mas(g tj:t j+r j ) max n tjt j+r j mas(g n ) and follows from the fact that for all i K and t T l t (i) and p it/q it mas(g t )(Alon et al. 207 Lemma 0). Now we bound j mas(g t j:t j+r j )rj 2. We write the following optimization problem: max {r j} j T (22) mas(g tj:t j+r j )rj 2 subject to (23) j r j T j 0 r j T /3 c 2/3 mas /3 (G (T ) ). Since the objective function is submodular and the constraints are linear the ratio of the solution of the greedy algorithm and the optimal solution is at most ( /e) (Nemhauser and Wolsey (978)). Therefore the optimal solution o of the above optimization problem is o t T 2/3 mas(g (t) )c 4/3 ( /e)mas 2/3 (G (T ) ) (24) 6
7 where t T 2/3 c 2/3 mas /3 (G (T ) ). Using (9) (20) (2) (22) and (24) we have k j+r j E p ikj l t (i) l t (k ) + η t T 2/3 c 4/3 mas(g (t) ) η 2 ( /e)mas 2/3 (G (T ) ). (25) j tk j j Additionally the player switches its action only if E t is true. Thus using (25) and c(i j) c for all i j K we have R A (l :T C) η + η 2 t T 2/3 c 4/3 mas(g (t) ) T ( /e)mas 2/3 (G (T ) ) + c E (i t i t ). (26) Now we bound E T t2 (i t i t ). E t occurs with probability and does not contribute to any SC. E2 t can lead to at most T 2/3 c 2/3 mas /3 (G (T ) ) switches. Now let E3 t causes N T switches. Then we have EN T E E E j j j E (i tj+ i tj E tj 3 is true) E (i tj+ i tj E tj 3 is true) {p tj r j } j E k K\{i} j k K\{i} P(i tj i t E j 3 is true)p(i t j+ k itj i) {p t j r j } j p itj p k t j+ c 2/3 mas /3 (G (T ) )t /3 c 2/3 mas /3 (G (T ) )T 2/3 where follows from Lemma in this section. Thus the number of switches are 2c 2/3 mas /3 (G (T ) )T 2/3 and the SC is 2c /3 mas /3 (G (T ) )T 2/3. Part (iii) of the theorem follows by combining the results from (i) and (ii). Part (iv) follows from the fact that if G t is undirected mas(g t ) α(g t ). Lemma. Given i K is chosen at time instance t j for all k K\{i} we have p itj p k t j+ (t j+ ) /3. Proof. Given i is chosen at time instance t j for all k K\{i} we have t2 (27) p k t j+ p itj+ p k exp( ηˆl tj+ (k )) p itj exp( ηl t j+r j (i)) p k exp( η(ˆl tj (k ) + l t j+r j (k ))) p itj exp( ηl t (i)) j+r j (b) exp ( η(ˆl tj (k ) + l t j+r j (k ) l ) t j+r j (i)) p itj (c) exp ( η(ɛ tj+ /η) ) Kp itj exp ( ) ɛ tj+ p itj (28) 7
8 where follows from the fact that ˆl tj+ (k ) ˆl tj (k ) + l t j+r j (k ); (b) follows from p k /K; (c) follows from the fact that for all k K\{i} ˆl kt l it > ɛ t/η as the increment in l it is bounded by /q it. Now replacing ɛ t log(tc 2 /mas(g (T ) ))/3 in (28) we have p itj p k t j+ c 2/3 mas /3 (G (T ) )t /3 j+. (29) F Proof of Theorem 6 Proof. We borrow the notations from the proof of Theorem 5. Using the fact that η t is decreasing in t and (9) we have j p itj l t j+r j (i) min k K j l t j+r j (k ) η tj + 2 j p itj l 2 t j+r j (i). (30) Now taking expectation on both the sides and using the fact that expectation of the min(.) is smaller than the min(.) of the expectation we have E p itj l t j+r j (i) j j j (b) log(k) j j η tj 2 ɛ t j E min k K E j l t j+r j (k ) p itj l 2 t j+r j (i) p tj r j (i t is selected using p t ) η tj 2 ɛ t j Emas(G tj:t j+r j )r 2 j (i t is selected using p t ) η tj 2 ɛ t j 2 mas(g tj:t j+r j ) η tj 2 ɛ 2 t j 2 mas(g tj:t j+r j ) ɛ tj (c) 2 log(k) mas 2/3 (G j (T ) ) mas(g (j)) (d) E + j 2 log(k) mas 2/3 (G (T ) ) mas(g (j)) (3) where follows from (22) (b) follows from the fact that since the probability of selecting a new action is at most ɛ tj the mean and the variance of the geometric random variable r j is bounded by /ɛ 2 t j and ( ɛ tj )/ɛ 2 t j respectively (c) follows from the value of η t and ɛ t and (d) follows from the fact that mas(g (j) )/mas(g (T ) ) is a monotonic non increasing sequence in j therefore the summation is a concave function and the inequality follows from the Jensen s inequality. Now we bound the E in (3). This also gives a bound on the number of switches performed by the algorithm. We have E E(i t i t ) ɛ t 0.5mas /3 (G (T ) )T 2/3 c /3 (32) 8
9 References Alon N. Cesa-Bianchi N. Gentile C. Mannor S. Mansour Y. and Shamir O. (207). Nonstochastic multi-armed bandits with graph-structured feedback. SIAM Journal on Computing 46(6): Dekel O. Ding J. Koren T. and Peres Y. (204). Bandits with switching costs: T 2/3 regret. In Proceedings of the forty-sixth annual ACM symposium on Theory of computing pages ACM. Nemhauser G. L. and Wolsey L. A. (978). Best algorithms for approximating the maximum of a submodular set function. Mathematics of operations research 3(3):
From Bandits to Experts: A Tale of Domination and Independence
From Bandits to Experts: A Tale of Domination and Independence Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Domination and Independence 1 / 1 From Bandits to Experts: A
More informationOnline Learning with Feedback Graphs
Online Learning with Feedback Graphs Nicolò Cesa-Bianchi Università degli Studi di Milano Joint work with: Noga Alon (Tel-Aviv University) Ofer Dekel (Microsoft Research) Tomer Koren (Technion and Microsoft
More informationEfficient learning by implicit exploration in bandit problems with side observations
Efficient learning by implicit exploration in bandit problems with side observations Tomáš Kocák, Gergely Neu, Michal Valko, Rémi Munos SequeL team, INRIA Lille - Nord Europe, France SequeL INRIA Lille
More informationRegret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group
Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known
More informationOnline Learning with Feedback Graphs
Online Learning with Feedback Graphs Claudio Gentile INRIA and Google NY clagentile@gmailcom NYC March 6th, 2018 1 Content of this lecture Regret analysis of sequential prediction problems lying between
More informationLecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem
Lecture 9: UCB Algorithm and Adversarial Bandit Problem EECS598: Prediction and Learning: It s Only a Game Fall 03 Lecture 9: UCB Algorithm and Adversarial Bandit Problem Prof. Jacob Abernethy Scribe:
More informationStat 260/CS Learning in Sequential Decision Problems. Peter Bartlett
Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Adversarial bandits Definition: sequential game. Lower bounds on regret from the stochastic case. Exp3: exponential weights
More informationAdvanced Machine Learning
Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his
More informationLecture 3: Lower Bounds for Bandit Algorithms
CMSC 858G: Bandits, Experts and Games 09/19/16 Lecture 3: Lower Bounds for Bandit Algorithms Instructor: Alex Slivkins Scribed by: Soham De & Karthik A Sankararaman 1 Lower Bounds In this lecture (and
More informationToward a Classification of Finite Partial-Monitoring Games
Toward a Classification of Finite Partial-Monitoring Games Gábor Bartók ( *student* ), Dávid Pál, and Csaba Szepesvári Department of Computing Science, University of Alberta, Canada {bartok,dpal,szepesva}@cs.ualberta.ca
More informationAdvanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12
Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12 Lecture 4: Multiarmed Bandit in the Adversarial Model Lecturer: Yishay Mansour Scribe: Shai Vardi 4.1 Lecture Overview
More informationBandit models: a tutorial
Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses
More informationEASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University
EASINESS IN BANDITS Gergely Neu Pompeu Fabra University EASINESS IN BANDITS Gergely Neu Pompeu Fabra University THE BANDIT PROBLEM Play for T rounds attempting to maximize rewards THE BANDIT PROBLEM Play
More informationOnline Learning: Bandit Setting
Online Learning: Bandit Setting Daniel asabi Summer 04 Last Update: October 0, 06 Introduction [TODO Bandits. Stocastic setting Suppose tere exists unknown distributions ν,..., ν, suc tat te loss at eac
More informationExplore no more: Improved high-probability regret bounds for non-stochastic bandits
Explore no more: Improved high-probability regret bounds for non-stochastic bandits Gergely Neu SequeL team INRIA Lille Nord Europe gergely.neu@gmail.com Abstract This work addresses the problem of regret
More informationIntroduction to Bandit Algorithms. Introduction to Bandit Algorithms
Stochastic K-Arm Bandit Problem Formulation Consider K arms (actions) each correspond to an unknown distribution {ν k } K k=1 with values bounded in [0, 1]. At each time t, the agent pulls an arm I t {1,...,
More informationReinforcement Learning
Reinforcement Learning Lecture 5: Bandit optimisation Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Introduce bandit optimisation: the
More informationOnline Learning and Online Convex Optimization
Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game
More informationAlireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017
s s Machine Learning Reading Group The University of British Columbia Summer 2017 (OCO) Convex 1/29 Outline (OCO) Convex Stochastic Bernoulli s (OCO) Convex 2/29 At each iteration t, the player chooses
More informationBandit Convex Optimization: T Regret in One Dimension
Bandit Convex Optimization: T Regret in One Dimension arxiv:1502.06398v1 [cs.lg 23 Feb 2015 Sébastien Bubeck Microsoft Research sebubeck@microsoft.com Tomer Koren Technion tomerk@technion.ac.il February
More informationMulti-armed bandit models: a tutorial
Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)
More informationEfficient learning by implicit exploration in bandit problems with side observations
Efficient learning by implicit exploration in bandit problems with side observations omáš Kocák Gergely Neu Michal Valko Rémi Munos SequeL team, INRIA Lille Nord Europe, France {tomas.kocak,gergely.neu,michal.valko,remi.munos}@inria.fr
More informationKey words. online learning, multi-armed bandits, learning from experts, learning with partial feedback, graph theory
SIAM J. COMPUT. Vol. 46, No. 6, pp. 785 86 c 07 the authors NONSTOCHASTIC MULTI-ARMED BANDITS WITH GRAPH-STRUCTURED FEEDBACK NOGA ALON, NICOLÒ CESA-BIANCHI, CLAUDIO GENTILE, SHIE MANNOR, YISHAY MANSOUR,
More informationOnline learning with noisy side observations
Online learning with noisy side observations Tomáš Kocák Gergely Neu Michal Valko Inria Lille - Nord Europe, France DTIC, Universitat Pompeu Fabra, Barcelona, Spain Inria Lille - Nord Europe, France SequeL
More informationarxiv: v3 [cs.lg] 30 Jun 2012
arxiv:05874v3 [cslg] 30 Jun 0 Orly Avner Shie Mannor Department of Electrical Engineering, Technion Ohad Shamir Microsoft Research New England Abstract We consider a multi-armed bandit problem where the
More informationOnline Learning with Gaussian Payoffs and Side Observations
Online Learning with Gaussian Payoffs and Side Observations Yifan Wu 1 András György 2 Csaba Szepesvári 1 1 Department of Computing Science University of Alberta 2 Department of Electrical and Electronic
More informationarxiv: v1 [cs.lg] 30 Sep 2014
Nonstochastic Multi-Armed Bandits with Graph-Structured Feedback arxiv:409.848v [cs.lg] 30 Sep 04 Noga Alon Tel Aviv University, Tel Aviv, Israel, Institute for Advanced Study, Princeton, United States,
More informationarxiv: v1 [cs.lg] 23 Jan 2019
Cooperative Online Learning: Keeping your Neighbors Updated Nicolò Cesa-Bianchi, Tommaso R. Cesari, and Claire Monteleoni 2 Dipartimento di Informatica, Università degli Studi di Milano, Italy 2 Department
More informationExplore no more: Improved high-probability regret bounds for non-stochastic bandits
Explore no more: Improved high-probability regret bounds for non-stochastic bandits Gergely Neu SequeL team INRIA Lille Nord Europe gergely.neu@gmail.com Abstract This work addresses the problem of regret
More informationMulti-Armed Bandits with Metric Movement Costs
Multi-Armed Bandits with Metric Movement Costs Tomer Koren 1, Roi Livni 2, and Yishay Mansour 3 1 Google; tkoren@google.com 2 Princeton University; rlivni@cs.princeton.edu 3 Tel Aviv University; mansour@cs.tau.ac.il
More informationarxiv: v2 [cs.lg] 24 Oct 2018
Adversarial Online Learning with noise arxiv:80.09346v [cs.lg] 4 Oct 08 Alon Resler alonress@gmail.com Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel Yishay Mansour mansour.yishay@gmail.com
More informationThe Multi-Arm Bandit Framework
The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94
More informationOnline Learning with Costly Features and Labels
Online Learning with Costly Features and Labels Navid Zolghadr Department of Computing Science University of Alberta zolghadr@ualberta.ca Gábor Bartók Department of Computer Science TH Zürich bartok@inf.ethz.ch
More informationOnline Learning with Randomized Feedback Graphs for Optimal PUE Attacks in Cognitive Radio Networks
1 Online Learning with Randomized Feedback Graphs for Optimal PUE Attacks in Cognitive Radio Networks Monireh Dabaghchian, Amir Alipour-Fanid, Kai Zeng, Qingsi Wang, Peter Auer arxiv:1709.10128v3 [cs.ni]
More informationLecture 4: Lower Bounds (ending); Thompson Sampling
CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from
More informationThe Online Approach to Machine Learning
The Online Approach to Machine Learning Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary 1 My beautiful regret 2 A supposedly fun game I
More informationOn the Complexity of Best Arm Identification with Fixed Confidence
On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, Emilie Kaufmann COLT, June 23 th 2016, New York Institut de Mathématiques de Toulouse
More informationStochastic bandits: Explore-First and UCB
CSE599s, Spring 2014, Online Learning Lecture 15-2/19/2014 Stochastic bandits: Explore-First and UCB Lecturer: Brendan McMahan or Ofer Dekel Scribe: Javad Hosseini In this lecture, we like to answer this
More informationRegret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Sébastien Bubeck Theory Group
Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems Sébastien Bubeck Theory Group Part 1: i.i.d., adversarial, and Bayesian bandit models i.i.d. multi-armed bandit, Robbins [1952]
More informationExponential Weights on the Hypercube in Polynomial Time
European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts
More informationCombinatorial Multi-Armed Bandit: General Framework, Results and Applications
Combinatorial Multi-Armed Bandit: General Framework, Results and Applications Wei Chen Microsoft Research Asia, Beijing, China Yajun Wang Microsoft Research Asia, Beijing, China Yang Yuan Computer Science
More informationA. Notation. Attraction probability of item d. (d) Highest attraction probability, (1) A
A Notation Symbol Definition (d) Attraction probability of item d max Highest attraction probability, (1) A Binary attraction vector, where A(d) is the attraction indicator of item d P Distribution over
More informationApplications of on-line prediction. in telecommunication problems
Applications of on-line prediction in telecommunication problems Gábor Lugosi Pompeu Fabra University, Barcelona based on joint work with András György and Tamás Linder 1 Outline On-line prediction; Some
More informationBudgeted Prediction With Expert Advice
Budgeted Prediction With Expert Advice Kareem Amin University of Pennsylvania, Philadelphia, PA akareem@seas.upenn.edu Gerald Tesauro IBM T.J. Watson Research Center Yorktown Heights, NY gtesauro@us.ibm.com
More informationLearning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley
Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. Converting online to batch. Online convex optimization.
More informationNOWADAYS the demand to wireless bandwidth is growing
1 Online Learning with Randomized Feedback Graphs for Optimal PUE Attacks in Cognitive Radio Networks Monireh Dabaghchian, Student Member, IEEE, Amir Alipour-Fanid, Student Member, IEEE, Kai Zeng, Member,
More informationOnline Learning with Abstention. Figure 6: Simple example of the benefits of learning with abstention (Cortes et al., 2016a).
+ + + + + + Figure 6: Simple example of the benefits of learning with abstention (Cortes et al., 206a). A. Further Related Work Learning with abstention is a useful paradigm in applications where the cost
More informationThe information complexity of sequential resource allocation
The information complexity of sequential resource allocation Emilie Kaufmann, joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishan SMILE Seminar, ENS, June 8th, 205 Sequential allocation
More informationarxiv: v1 [stat.ml] 23 May 2018
Analysis of Thompson Sampling for Graphical Bandits Without the Graphs arxiv:1805.08930v1 [stat.ml] 3 May 018 Fang Liu The Ohio State University Columbus, Ohio 4310 liu.3977@osu.edu Abstract We study multi-armed
More informationThe information complexity of best-arm identification
The information complexity of best-arm identification Emilie Kaufmann, joint work with Olivier Cappé and Aurélien Garivier MAB workshop, Lancaster, January th, 206 Context: the multi-armed bandit model
More informationExponentiated Gradient Descent
CSE599s, Spring 01, Online Learning Lecture 10-04/6/01 Lecturer: Ofer Dekel Exponentiated Gradient Descent Scribe: Albert Yu 1 Introduction In this lecture we review norms, dual norms, strong convexity,
More informationCombinatorial Multi-Armed Bandit with General Reward Functions
Combinatorial Multi-Armed Bandit with General Reward Functions Wei Chen Wei Hu Fu Li Jian Li Yu Liu Pinyan Lu Abstract In this paper, we study the stochastic combinatorial multi-armed bandit (CMAB) framework
More informationFull-information Online Learning
Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2
More informationBandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University
Bandit Algorithms Zhifeng Wang Department of Statistics Florida State University Outline Multi-Armed Bandits (MAB) Exploration-First Epsilon-Greedy Softmax UCB Thompson Sampling Adversarial Bandits Exp3
More informationTHE first formalization of the multi-armed bandit problem
EDIC RESEARCH PROPOSAL 1 Multi-armed Bandits in a Network Farnood Salehi I&C, EPFL Abstract The multi-armed bandit problem is a sequential decision problem in which we have several options (arms). We can
More informationOnline (and Distributed) Learning with Information Constraints. Ohad Shamir
Online (and Distributed) Learning with Information Constraints Ohad Shamir Weizmann Institute of Science Online Algorithms and Learning Workshop Leiden, November 2014 Ohad Shamir Learning with Information
More informationOnline Bounds for Bayesian Algorithms
Online Bounds for Bayesian Algorithms Sham M. Kakade Computer and Information Science Department University of Pennsylvania Andrew Y. Ng Computer Science Department Stanford University Abstract We present
More informationExperts in a Markov Decision Process
University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 2004 Experts in a Markov Decision Process Eyal Even-Dar Sham Kakade University of Pennsylvania Yishay Mansour Follow
More informationConditional Swap Regret and Conditional Correlated Equilibrium
Conditional Swap Regret and Conditional Correlated quilibrium Mehryar Mohri Courant Institute and Google 251 Mercer Street New York, NY 10012 mohri@cims.nyu.edu Scott Yang Courant Institute 251 Mercer
More informationSubsampling, Concentration and Multi-armed bandits
Subsampling, Concentration and Multi-armed bandits Odalric-Ambrym Maillard, R. Bardenet, S. Mannor, A. Baransi, N. Galichet, J. Pineau, A. Durand Toulouse, November 09, 2015 O-A. Maillard Subsampling and
More informationBandits with Delayed, Aggregated Anonymous Feedback
Ciara Pike-Burke 1 Shipra Agrawal 2 Csaba Szepesvári 3 4 Steffen Grünewälder 1 Abstract We study a variant of the stochastic K-armed bandit problem, which we call bandits with delayed, aggregated anonymous
More informationLecture 16: Perceptron and Exponential Weights Algorithm
EECS 598-005: Theoretical Foundations of Machine Learning Fall 2015 Lecture 16: Perceptron and Exponential Weights Algorithm Lecturer: Jacob Abernethy Scribes: Yue Wang, Editors: Weiqing Yu and Andrew
More informationarxiv: v1 [cs.lg] 7 Sep 2018
Analysis of Thompson Sampling for Combinatorial Multi-armed Bandit with Probabilistically Triggered Arms Alihan Hüyük Bilkent University Cem Tekin Bilkent University arxiv:809.02707v [cs.lg] 7 Sep 208
More informationA Second-order Bound with Excess Losses
A Second-order Bound with Excess Losses Pierre Gaillard 12 Gilles Stoltz 2 Tim van Erven 3 1 EDF R&D, Clamart, France 2 GREGHEC: HEC Paris CNRS, Jouy-en-Josas, France 3 Leiden University, the Netherlands
More informationBeat the Mean Bandit
Yisong Yue H. John Heinz III College, Carnegie Mellon University, Pittsburgh, PA, USA Thorsten Joachims Department of Computer Science, Cornell University, Ithaca, NY, USA yisongyue@cmu.edu tj@cs.cornell.edu
More informationOnline Learning with Predictable Sequences
JMLR: Workshop and Conference Proceedings vol (2013) 1 27 Online Learning with Predictable Sequences Alexander Rakhlin Karthik Sridharan rakhlin@wharton.upenn.edu skarthik@wharton.upenn.edu Abstract We
More informationTHE WEIGHTED MAJORITY ALGORITHM
THE WEIGHTED MAJORITY ALGORITHM Csaba Szepesvári University of Alberta CMPUT 654 E-mail: szepesva@ualberta.ca UofA, October 3, 2006 OUTLINE 1 PREDICTION WITH EXPERT ADVICE 2 HALVING: FIND THE PERFECT EXPERT!
More informationBandit Online Convex Optimization
March 31, 2015 Outline 1 OCO vs Bandit OCO 2 Gradient Estimates 3 Oblivious Adversary 4 Reshaping for Improved Rates 5 Adaptive Adversary 6 Concluding Remarks Review of (Online) Convex Optimization Set-up
More informationarxiv: v3 [stat.ml] 13 Jun 2018
Ciara Pike-Burke hipra Agrawal Csaba zepesvári 3 4 teffen Grünewälder arxiv:709.06853v3 stat.ml 3 Jun 08 Abstract We study a variant of the stochastic K-armed bandit problem, which we call bandits with
More informationMinimax Policies for Combinatorial Prediction Games
Minimax Policies for Combinatorial Prediction Games Jean-Yves Audibert Imagine, Univ. Paris Est, and Sierra, CNRS/ENS/INRIA, Paris, France audibert@imagine.enpc.fr Sébastien Bubeck Centre de Recerca Matemàtica
More informationPrediction and Playing Games
Prediction and Playing Games Vineel Pratap vineel@eng.ucsd.edu February 20, 204 Chapter 7 : Prediction, Learning and Games - Cesa Binachi & Lugosi K-Person Normal Form Games Each player k (k =,..., K)
More informationTwo optimization problems in a stochastic bandit model
Two optimization problems in a stochastic bandit model Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishnan Journées MAS 204, Toulouse Outline From stochastic optimization
More informationOnline Learning with Gaussian Payoffs and Side Observations
Online Learning with Gaussian Payoffs and Side Observations Yifan Wu András György 2 Csaba Szepesvári Dept. of Computing Science University of Alberta {ywu2,szepesva}@ualberta.ca 2 Dept. of Electrical
More informationContextual Combinatorial Bandit and its Application on Diversified Online Recommendation
Contextual Combinatorial Bandit and its Application on Diversified Online Recommendation Lijing Qin Shouyuan Chen Xiaoyan Zhu Abstract Recommender systems are faced with new challenges that are beyond
More informationLearning Algorithms for Minimizing Queue Length Regret
Learning Algorithms for Minimizing Queue Length Regret Thomas Stahlbuhk Massachusetts Institute of Technology Cambridge, MA Brooke Shrader MIT Lincoln Laboratory Lexington, MA Eytan Modiano Massachusetts
More informationPCA with random noise. Van Ha Vu. Department of Mathematics Yale University
PCA with random noise Van Ha Vu Department of Mathematics Yale University An important problem that appears in various areas of applied mathematics (in particular statistics, computer science and numerical
More informationOptimization, Learning, and Games with Predictable Sequences
Optimization, Learning, and Games with Predictable Sequences Alexander Rakhlin University of Pennsylvania Karthik Sridharan University of Pennsylvania Abstract We provide several applications of Optimistic
More informationOnline Learning and Sequential Decision Making
Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning
More informationIntroduction to Multi-Armed Bandits
Introduction to Multi-Armed Bandits (preliminary and incomplete draft) Aleksandrs Slivkins Microsoft Research NYC https://www.microsoft.com/en-us/research/people/slivkins/ First draft: January 2017 This
More informationLecture 10 + additional notes
CSE533: Information Theorn Computer Science November 1, 2010 Lecturer: Anup Rao Lecture 10 + additional notes Scribe: Mohammad Moharrami 1 Constraint satisfaction problems We start by defining bivariate
More informationOn the Complexity of Best Arm Identification with Fixed Confidence
On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, joint work with Emilie Kaufmann CNRS, CRIStAL) to be presented at COLT 16, New York
More informationBandit View on Continuous Stochastic Optimization
Bandit View on Continuous Stochastic Optimization Sébastien Bubeck 1 joint work with Rémi Munos 1 & Gilles Stoltz 2 & Csaba Szepesvari 3 1 INRIA Lille, SequeL team 2 CNRS/ENS/HEC 3 University of Alberta
More informationOnline Learning with Abstention
Corinna Cortes 1 Giulia DeSalvo 1 Claudio Gentile 1 2 Mehryar Mohri 3 1 Scott Yang 4 Abstract We present an extensive study of a key problem in online learning where the learner can opt to abstain from
More informationMultiple Identifications in Multi-Armed Bandits
Multiple Identifications in Multi-Armed Bandits arxiv:05.38v [cs.lg] 4 May 0 Sébastien Bubeck Department of Operations Research and Financial Engineering, Princeton University sbubeck@princeton.edu Tengyao
More informationLecture 5: Regret Bounds for Thompson Sampling
CMSC 858G: Bandits, Experts and Games 09/2/6 Lecture 5: Regret Bounds for Thompson Sampling Instructor: Alex Slivkins Scribed by: Yancy Liao Regret Bounds for Thompson Sampling For each round t, we defined
More informationOnline Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016
Online Convex Optimization Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 The General Setting The General Setting (Cover) Given only the above, learning isn't always possible Some Natural
More informationSupplementary: Battle of Bandits
Supplementary: Battle of Bandits A Proof of Lemma Proof. We sta by noting that PX i PX i, i {U, V } Pi {U, V } PX i i {U, V } PX i i {U, V } j,j i j,j i j,j i PX i {U, V } {i, j} Q Q aia a ia j j, j,j
More informationOnline Convex Optimization
Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed
More informationBandit Algorithms. Tor Lattimore & Csaba Szepesvári
Bandit Algorithms Tor Lattimore & Csaba Szepesvári Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next? Overview What are bandits,
More informationPartial monitoring classification, regret bounds, and algorithms
Partial monitoring classification, regret bounds, and algorithms Gábor Bartók Department of Computing Science University of Alberta Dávid Pál Department of Computing Science University of Alberta Csaba
More informationRegret Bounds for Sleeping Experts and Bandits
Regret Bounds for Sleeping Experts and Bandits Robert D. Kleinberg Department of Computer Science Cornell University Ithaca, NY 4853 rdk@cs.cornell.edu Alexandru Niculescu-Mizil Department of Computer
More information1 Submodular functions
CS 369P: Polyhedral techniques in combinatorial optimization Instructor: Jan Vondrák Lecture date: November 16, 2010 1 Submodular functions We have already encountered submodular functions. Let s recall
More informationReward Maximization Under Uncertainty: Leveraging Side-Observations on Networks
Reward Maximization Under Uncertainty: Leveraging Side-Observations Reward Maximization Under Uncertainty: Leveraging Side-Observations on Networks Swapna Buccapatnam AT&T Labs Research, Middletown, NJ
More informationRevisiting the Exploration-Exploitation Tradeoff in Bandit Models
Revisiting the Exploration-Exploitation Tradeoff in Bandit Models joint work with Aurélien Garivier (IMT, Toulouse) and Tor Lattimore (University of Alberta) Workshop on Optimization and Decision-Making
More informationBandits with Movement Costs and Adaptive Pricing
Proceedings of Machine Learning Research vol 65:1 27, 2017 Bandits with Movement Costs and Adaptive Pricing Tomer Koren Google Roi Livni Princeton University, Computer Science Department Yishay Mansour
More informationOn the Generalization Ability of Online Strongly Convex Programming Algorithms
On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract
More informationOptimization of Submodular Functions Tutorial - lecture I
Optimization of Submodular Functions Tutorial - lecture I Jan Vondrák 1 1 IBM Almaden Research Center San Jose, CA Jan Vondrák (IBM Almaden) Submodular Optimization Tutorial 1 / 1 Lecture I: outline 1
More informationOn the Complexity of Best Arm Identification in Multi-Armed Bandit Models
On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurélien Garivier Institut de Mathématiques de Toulouse Information Theory, Learning and Big Data Simons Institute, Berkeley, March
More informationOnline learning CMPUT 654. October 20, 2011
Online learning CMPUT 654 Gábor Bartók Dávid Pál Csaba Szepesvári István Szita October 20, 2011 Contents 1 Shooting Game 4 1.1 Exercises...................................... 6 2 Weighted Majority Algorithm
More informationUniversity of Alberta. The Role of Information in Online Learning
University of Alberta The Role of Information in Online Learning by Gábor Bartók A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment of the requirements for the degree
More information