Online Learning with Feedback Graphs

Size: px
Start display at page:

Download "Online Learning with Feedback Graphs"

Transcription

1 Online Learning with Feedback Graphs Claudio Gentile INRIA and Google NY NYC March 6th,

2 Content of this lecture Regret analysis of sequential prediction problems lying between full and bandit information regimes: Motivation Nonstochastic setting: Brief review of background Feedback graphs Stochastic setting: Brief review of background Feedback graphs Examples (nonstochastic) 2

3 Motivation Sequential prediction problems with partial information where items in action space have semantic connections turning into observability dependencies of associated losses/gains Action space Action space 3

4 Background/1: Nonstochastic experts K actions for Learner For t = 1, 2, : 1 Losses l t (i) [0, 1] are assigned deterministically by Nature to every action i = 1 K (hidded to Learner) 2 Learner picks action I t (possibly using randomization) and incurs loss l t (I t ) 3 Learner gets feedback information: l t (1),, l t (I t ),, l t (K) 4

5 Background/1: Nonstochastic experts K actions for Learner For t = 1, 2, : 1 Losses l t (i) [0, 1] are assigned deterministically by opponent to every action i = 1 K (hidded to Learner) 2 Learner picks action I t (possibly using randomization) and incurs loss l t (I t ) 3 Learner gets feedback information: l t (1),, l t (I t ),, l t (K) 5

6 Background: Nonstochastic experts K actions for Learner For t = 1, 2, : 1 Losses l t (i) [0, 1] are assigned deterministically by opponent to every action i = 1 K (hidded to Learner) 2 Learner picks action I t (possibly using randomization) and incurs loss l t (I t ) 3 Learner gets feedback information: l t (1),, l t (I t ),, l t (K) 6

7 No (External, Pseudo) Regret Goal : Given T rounds, Learner s total loss T l t (I t ) must be close to that of single best action in hindsight for Learner Regret of Learner for T rounds: [ T ] R T = E l t (I t ) min i=1k Want : R T = o(t ) as T grows large ( no regret ) T l t (i) Notice : No stochastic assumptions on losses, but assume for simplicity Nature is deterministic and oblivious Lower bound: T ln K R T (1 o(1)) 2 as T, K (l t (i) random coin flips + simple probabilistic argument) [CB+97] 7

8 Exponentially-weighted Algorithm [CB+97] At round t pick action I t = i with probability proportional to ( ) exp η t 1 s=1 l s (i) if η = Dynamic η = ln K 8T = R T total loss of action i so far T ln K 2 ln K t = R T looses constant factors 8

9 Nonstochastic bandit problem/1 K actions for Learner For t = 1, 2, : 1 Losses l t (i) [0, 1] are assigned deterministically by Nature to every action i = 1 K (hidded to Learner) 2 Learner picks action I t (possibly using randomization) and incurs loss l t (I t ) 3 Learner gets feedback information: l t (I t ) 9

10 Nonstochastic bandit problem/1 K actions for Learner For t = 1, 2, : 1 Losses l t (i) [0, 1] are assigned deterministically by Nature to every action i = 1 K (hidded to Learner) 2 Learner picks action I t (possibly using randomization) and incurs loss l t (I t ) 3 Learner gets feedback information: l t (I t ) 10

11 Nonstochastic bandit problem/1 K actions for Learner 03 For t = 1, 2, : 1 Losses l t (i) [0, 1] are assigned deterministically by Nature to every action i = 1 K (hidded to Learner) 2 Learner picks action I t (possibly using randomization) and incurs loss l t (I t ) 3 Learner gets feedback information: l t (I t ) 11

12 Nonstochastic bandit problem/2 Goal : same as before Regret of Learner for T rounds: [ T ] R T = E l t (I t ) min i=1k Want : R T = o(t ) as T grows large ( no regret ) Tradeoff exploration vs exploitation T l t (i) 12

13 Nonstochastic bandit problem/3: Exp3 Alg/1 [Auer+ 02] At round t pick action I t = i with probability proportional to ( ) l s (i) = exp η t 1 s=1 l s (i), i = 1 K { l s (i) if l Pr s (l s (i) is observed in round s (i) is observed s) 0 otherwise Only one nonzero component in lt Exponentially-weighted alg with (importance sampling) loss estimates l t (i) l t (i) 13

14 Nonstochastic bandit problem/3: Exp3 Alg/2 [Auer+ 02] Properties of loss estimates: E t [ lt (i)] = l t (i) unbiasedness E t [ lt (i) 2 ] 1 Pr t(l t (i) is observed in round t) variance control Regret analysis: Set p t (i) = Pr t (I t = i) Approximate exp(x) up to 2nd order, sum over rounds t and overapprox: T i=1 K p t (i) lt (i) min i=1,,k T l t (i) ln K η + η 2 T K p t (i) lt (i) 2 i=1 Take expectations (tower rule), and optimize over η: R T ln K η + η 2 T K = 2T K ln K Lower bound Ω( T K) (improved upper bound by the INF alg [AB09]) 14

15 Contrasting expert to nonstochastic bandit problem Experts : Learner observes all losses l t (1),, l t (K) Pr t (l t (i) is observed in round t) = 1 Regret R T = O( T ln K) Nonstochastic bandits : Learner only observes loss l t (I t ) of chosen action Pr t (l t (i) is observed in round t) = Pr t (I t = i) Note: Exp3 collapses to Exponentially-weighted alg Regret R T = O( T K) Exponential gap ln K vs K: relevant when actions are many 15

16 Nonstochastic bandits with Feedback Graphs/1[MS11,A+13,K+14] K actions for Learner For t = 1, 2, : 1 Losses l t (i) [0, 1] are assigned deterministically by Nature to every action i = 1 K (hidded to Learner) 2 Feedback graph G t = (V, E t ), V = {1,, K} generated by exogenous process (hidden to Learner) all self-loops included 3 Learner picks action I t (possibly using randomization) and incurs loss l t (I t ) 4 Learner gets feedback information: {l t (j) : (I t, j) E t } + G t 16

17 Nonstochastic bandits with Feedback Graphs/1[MS11,A+13,K+14] K actions for Learner For t = 1, 2, : 1 Losses l t (i) [0, 1] are assigned deterministically by Nature to every action i = 1 K (hidded to Learner) 2 Feedback graph G t = (V, E t ), V = {1,, K} generated by exogenous process (hidden to Learner) all self-loops included 3 Learner picks action I t (possibly using randomization) and incurs loss l t (I t ) 4 Learner gets feedback information: {l t (j) : (I t, j) E t } + G t 17

18 Nonstochastic bandits with Feedback Graphs/1[MS11,A+13,K+14] K actions for Learner For t = 1, 2, : 1 Losses l t (i) [0, 1] are assigned deterministically by Nature to every action i = 1 K (hidded to Learner) 2 Feedback graph G t = (V, E t ), V = {1,, K} generated by exogenous process (hidden to Learner) all self-loops included 3 Learner picks action I t (possibly using randomization) and incurs loss l t (I t ) 4 Learner gets feedback information: {l t (j) : (I t, j) E t } + G t 18

19 Nonstochastic bandits with Feedback Graphs/1[MS11,A+13,K+14] 01 K actions for Learner For t = 1, 2, : 07 1 Losses l t (i) [0, 1] are assigned deterministically by Nature to every action i = 1 K (hidded to Learner) 2 Feedback graph G t = (V, E t ), V = {1,, K} generated by exogenous process (hidden to Learner) all self-loops included 3 Learner picks action I t (possibly using randomization) and incurs loss l t (I t ) 4 Learner gets feedback information: {l t (j) : (I t, j) E t } + G t 19

20 Nonstochastic bandits with Feedback Graphs/2: Alg Exp3-IX [Ne+15] At round t pick action I t = i with probability proportional to ( ) l s (i) = exp η t 1 s=1 l s (i), i = 1 K { l s (i) if l γ t + Pr s(l s (i) is observed in round s) s (i) is observed 0 otherwise Note: prob of observing loss of action prob of playing action Exponentially-weighted alg with γ t -biased (importance sampling) loss estimates l t (i) l t (i) Bias is controlled by γ t = 1/ t 20

21 Nonstochastic bandits with Feedback Graphs/3[A+13,K+14] Independence number α(g t ) : disregard edge orientation 1 }{{} clique: expert problem α(g t ) }{{} K edgeless: bandit problem Regret analysis: If G t = G t: ( ) R T = Õ T α(g) (also lower bound up to logs) In general: R T = O ln(t K) T α(g t ) 21

22 Nonstochastic bandits with Feedback Graphs/4 Properties of loss estimates: p t (i) = Pr t (I t = i) (prob of playing) Q t (i) = Pr t (l t (i) is observed in round t) lt (i) = l t(i) {l t (i) is observed in round t } γ t + Q t (i) (prob of observing) E t [ lt (i)] = l t (i) E t [ lt (i) 2 ] 1 Q t (i) unbiasedness variance control Some details of regret analysis: From T K p t (i) lt (i) i=1 Take expectations: R T ln K η min i=1,,k + η 2 T l t (i) ln K η + η 2 T i=1 [ T K ] p t (i) E variance Q t (i) i=1 K p t (i) lt (i) 2 22

23 Nonstochastic bandits with Feedback Graphs/5 Relating variance to α(g) : j Suppose G is undirected (with self-loops) K p(i) K Σ = Q G (i) = p(i) S p(j) i=1 i=1 j : j i G where S V is an independent set for G = (V, E) j i 1 Init: S = ; G 1 = G, V 1 = V Pick i 1 = argmin i V1 Q G 1 (i) Augment S S {i 1 } Remove i 1 from V 1, all its neighbors (and incident edges in G 1 ): p(j) p(j) Σ Σ Σ Q G 1 (j) Q G 1 (i1 ) = Σ (i 1) QG1 Q G 1 (i1 ) = Σ 1 j : j G 1 i 1 j : j G 1 i 1 get smaller graph G 2 = (V 2, E 2 ) and iterate 23

24 Nonstochastic bandits with Feedback Graphs/5 Relating variance to α(g) : Suppose G is undirected (with self-loops) j K p(i) K Σ = Q G (i) = p(i) S p(j) i=1 i=1 j : j i G where S V is an independent set for G = (V, E) i 2 Init: S = ; G 1 = G, V 1 = V Pick i 2 = argmin i V2 Q G 2 (i) Augment S S {i 2 } Remove i 2 from V 2, all its neighbors (and incident edges in G 2 ): p(j) p(j) Σ Σ Σ = Σ QG 2 (i 1) j : j G 2 i 2 Q G1 (j) j : j G 2 i 2 Q G2 Q (i 2 ) G 2 (i2 ) = Σ 1 get smaller graph G 3 = (V 3, E 3 ) and iterate 24

25 Nonstochastic bandits with Feedback Graphs/5 Relating variance to α(g) : Suppose G is undirected (with self-loops) K p(i) K Σ = Q G (i) = p(i) S p(j) i=1 i=1 j : j i G where S V is an independent set for G = (V, E) i 3 Init: S = ; G 1 = G, V 1 = V Pick i 3 = argmin i V3 Q G 3 (i) Augment S S {i 3 } Remove i 3 from V 3, all its neighbors (and incident edges in G 3 ): p(j) p(j) Σ Σ Σ = Σ QG 3 (i 3) j : j G 3 i 3 Q G1 (j) j : j G 3 i 3 Q G3 Q (i 3 ) G 3 (i3 ) = Σ 1 get smaller graph G 4 = (V 4, E 4 ) and iterate 25

26 Nonstochastic bandits with Feedback Graphs/5 Relating variance to α(g) : Suppose G is undirected (with self-loops) i 4 K p(i) K Σ = Q G (i) = p(i) S p(j) i=1 i=1 j : j i G where S V is an independent set for G = (V, E) Init: S = ; G 1 = G, V 1 = V Pick i 4 = argmin i V4 Q G 4 (i) Augment S S {i 4 } Remove i 4 from V 4, all its neighbors (and incident edges in G 4 ): p(j) p(j) Σ Σ Σ = Σ QG 4 (i 4) j : j G 4 i 4 Q G1 (j) j : j G 4 i 4 Q G4 Q (i 4 ) G 4 (i4 ) = Σ 1 get smaller graph G 4 = (V 4, E 4 ) and iterate 26

27 Nonstochastic bandits with Feedback Graphs/6 Hence: Σ decreases by at most 1 S increases by 1 Potential S + Σ increases over iterations: has minimal value at the beginning (S = ) reaches maximal value is when G becomes empty (Σ = 0) S is independent set by construction S α(g) i 2 i 3 i 1 When G directed analysis gets more complicated (needs lower bound on p t (i)) and adds a log T factor in bound Have obtained: R T ln K η + η 2 T T α(g t ) = O (ln K) α(g t ) i 4 27

28 Stochastic bandit problem/1 K actions for Learner When picking action i at time t, Learner receives as reward independent realization of random variable X i : E[X i ] = µ i, X i [0, 1] The µ i s are hidden to Learner For t = 1, 2, : 1 Learner picks action I t (possibly using random) and gathers reward X It,t 2 Learner gets feedback information: X It,t Goal: Optimize (pseudo)regret [ T ] [ T R T = max E X i,t E i=1k X It,t ] [ T = µ T E X It,t ] = K i E[T i (T )] i=1 28

29 Stochastic bandit problem/1 K actions for Learner When picking action i at time t, Learner receives as reward independent realization of random variable X i : E[X i ] = µ i, X i [0, 1] The µ i s are hidden to Learner For t = 1, 2, : 1 Learner picks action I t (possibly using random) and gathers reward X It,t 2 Learner gets feedback information: X It,t Goal: Optimize (pseudo)regret [ T ] [ T R T = max E X i,t E i=1k X It,t ] [ T = µ T E X It,t ] = K i E[T i (T )] i=1 29

30 Stochastic bandit problem/1 K actions for Learner When picking action i at time t, Learner receives as reward independent realization of random variable X i : E[X i ] = µ i, X i [0, 1] The µ i s are hidden to Learner 03 For t = 1, 2, : 1 Learner picks action I t (possibly using random) and gathers reward X It,t 2 Learner gets feedback information: X It,t Goal: Optimize (pseudo)regret [ T ] [ T R T = max E X i,t E i=1k X It,t ] [ T = µ T E X It,t ] = K i E[T i (T )] i=1 30

31 Stochastic bandit problem/1 K actions for Learner When picking action i at time t, Learner receives as reward independent realization of random variable X i : E[X i ] = µ i, X i [0, 1] The µ i s are hidden to Learner 03 For t = 1, 2, : 1 Learner picks action I t (possibly using random) and gathers reward X It,t 2 Learner gets feedback information: X It,t Goal: Optimize (pseudo)regret [ T ] [ T R T = max E X i,t E i=1k X It,t ] [ T = µ T E X It,t ] = K i E[T i (T )] i=1 31

32 Stochastic bandit problem/2: UCB alg [AFC02] At round t pick action ( I t = argmax i=1k X i,t 1 + ) ln t T i,t 1 T i,t 1 = no of times reward of action i has been observed so far X i,t 1 = 1 T i,t 1 s t 1 : I s =i X i,s = average reward of action i observed so far (Pseudo)Regret: R T = O (( K i=1 1 i ) ln T + K ) 32

33 Stochastic bandits with feedback graphs/1 K actions for Learner, arranged into a fixed graph G = (V, E) When picking action i at time t, Learner receives as reward independent realization of random variable X i : E[X i ] = µ i, but also reward of nearby actions in G The µ i s are hidden to Learner For t = 1, 2, : 07 1 Learner picks action I t (possibly using random) and gathers reward X It,t 2 Learner gets feedback information: {X j,t : (I t, j) E} Goal: Optimize (pseudo)regret [ T ] [ T R T = max E X i,t E i=1k X It,t ] [ T = µ T E X It,t ] = K i E[T i (T )] i=1 33

34 Stochastic bandits with feedback graphs/2: UCB-N [Ca+12] At round t pick action ( I t = argmax i=1k X i,t 1 + ) ln t O i,t 1 O i,t 1 = no of times reward of action i has been observed so far X i,t 1 = 1 O i,t 1 s t 1 : I s G i X i,s = average reward of action i observed so far 34

35 Stochastic bandits with feedback graphs/3 Clique covering number χ(g) : assume G is undirected 1 }{{} clique: expert problem c(g) = 4 α(g) χ(g) }{{} K edgeless: bandit problem Regret analysis: Given any partition C of V into cliques: C = {C 1, C 2,, C C } max i C i R T = O C C min i C 2 ln T + K i Sum over χ(g) regret terms (but can be improved to α(g) ) Term K replaced by χ(g) by modified alg No tight lower bounds available 35

36 Simple examples/1: Auctions (nonstoc) Revenue (= 1- Loss) Feedback graph G t b 2 revealed after play b b 2 1 played It Second-price auction with reserve (seller side) highest bid revealed to seller (eg AppNexus) Auctioneer is third party price Each price reveals revenue of itself + revenue of any higher price After seller plays reserve price I t, both seller s revenue and highest bid revealed to him/her Seller/Player in a position to observe all revenues for prices j I t α(g) = 1: R T = O ( ln(t K) T ) (expert problem up to logs) [CB+17] 36

37 Simple examples/2: Contextual bandits (nonstoc)[auer+02] K predictors f i : {1 T } {1 N}, i = 1 K, each one having the same N << K actions Learner s action space is the set of K predictors For t = 1, 2, : 1 l t (j) [0, 1] are assigned deterministically by Nature to every action j = 1 N (hidded to Learner) K = 8 N = 3 2 Learner observes f 1 (t) f 2 (t) f K (t) 3 Learner picks predictor f It (possibly using randomization) and incurs loss l t (f It (t)) 4 Learner gets feedback information: l t (f It (t)) Feedback graph G t on K predictors made up of N cliques {i : f i (t) = 1} {i : f i (t) = 2} {i : f i (t) = N} Independence number: α(g t ) N t 37

38 References CB+97: NCesa-Bianchi, Y Freund, D Haussler, D Helmbold, R Schapire, M Warmuth, How to use expert advice, Journal of the ACM, 44/3, pp , 1997 AB09: J Audibert and S Bubeck Regret bounds and minimax policies under partial monitoring Journal of Machine Learning Research, 11, pp , 2010 Auer+02: P Auer, N Cesa-Bianchi, Y Freund, and R Schapire The nonstochastic multi-armed bandit problem SIAM Journal on Computing, 32/1, pp 48 77, 2002 ACF02 P Auer, N Cesa-Bianchi, and P Fischer Finite-time analysis of the multi- armed bandit problem, Machine Learning Journal, vol 47, no 2-3, pp , 2002 MS+11: S Mannor, O Shamir, From Bandits to Experts: On the Value of Side-Observations, NIPS 2011 Ca+12: S Caron, B Kveton, M Lelarge and S Bhagat, Leveraging Side Observations in Stochastic Bandits, UAI 2012 A+13: N Alon, N Cesa-Bianchi, C Gentile, Y Mansour, From bandits to experts: A tale of domination and independence, NIPS 2013 K+14: T Kocak, G Neu, M Valko, R Munos, Efficient learning by implicit exploration in bandit problems with side observations, NIPS 2014 Ne15: G Neu, Explore no more: Improved high-probability regret bounds for non-stochastic bandits, NIPS 2015 CB+17: N Cesa-Bianchi, P Gaillard, C Gentile, S Gerchinovitz, Algorithmic chaining and the role of partial feedback in online nonparametric learning, COLT

From Bandits to Experts: A Tale of Domination and Independence

From Bandits to Experts: A Tale of Domination and Independence From Bandits to Experts: A Tale of Domination and Independence Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Domination and Independence 1 / 1 From Bandits to Experts: A

More information

Online Learning with Feedback Graphs

Online Learning with Feedback Graphs Online Learning with Feedback Graphs Nicolò Cesa-Bianchi Università degli Studi di Milano Joint work with: Noga Alon (Tel-Aviv University) Ofer Dekel (Microsoft Research) Tomer Koren (Technion and Microsoft

More information

Efficient learning by implicit exploration in bandit problems with side observations

Efficient learning by implicit exploration in bandit problems with side observations Efficient learning by implicit exploration in bandit problems with side observations Tomáš Kocák, Gergely Neu, Michal Valko, Rémi Munos SequeL team, INRIA Lille - Nord Europe, France SequeL INRIA Lille

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

Efficient learning by implicit exploration in bandit problems with side observations

Efficient learning by implicit exploration in bandit problems with side observations Efficient learning by implicit exploration in bandit problems with side observations omáš Kocák Gergely Neu Michal Valko Rémi Munos SequeL team, INRIA Lille Nord Europe, France {tomas.kocak,gergely.neu,michal.valko,remi.munos}@inria.fr

More information

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Explore no more: Improved high-probability regret bounds for non-stochastic bandits Explore no more: Improved high-probability regret bounds for non-stochastic bandits Gergely Neu SequeL team INRIA Lille Nord Europe gergely.neu@gmail.com Abstract This work addresses the problem of regret

More information

Online Learning with Gaussian Payoffs and Side Observations

Online Learning with Gaussian Payoffs and Side Observations Online Learning with Gaussian Payoffs and Side Observations Yifan Wu 1 András György 2 Csaba Szepesvári 1 1 Department of Computing Science University of Alberta 2 Department of Electrical and Electronic

More information

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Explore no more: Improved high-probability regret bounds for non-stochastic bandits Explore no more: Improved high-probability regret bounds for non-stochastic bandits Gergely Neu SequeL team INRIA Lille Nord Europe gergely.neu@gmail.com Abstract This work addresses the problem of regret

More information

Online Learning with Abstention

Online Learning with Abstention Corinna Cortes 1 Giulia DeSalvo 1 Claudio Gentile 1 2 Mehryar Mohri 3 1 Scott Yang 4 Abstract We present an extensive study of a key problem in online learning where the learner can opt to abstain from

More information

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known

More information

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University EASINESS IN BANDITS Gergely Neu Pompeu Fabra University EASINESS IN BANDITS Gergely Neu Pompeu Fabra University THE BANDIT PROBLEM Play for T rounds attempting to maximize rewards THE BANDIT PROBLEM Play

More information

Yevgeny Seldin. University of Copenhagen

Yevgeny Seldin. University of Copenhagen Yevgeny Seldin University of Copenhagen Classical (Batch) Machine Learning Collect Data Data Assumption The samples are independent identically distributed (i.i.d.) Machine Learning Prediction rule New

More information

Online Learning and Online Convex Optimization

Online Learning and Online Convex Optimization Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game

More information

The Online Approach to Machine Learning

The Online Approach to Machine Learning The Online Approach to Machine Learning Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary 1 My beautiful regret 2 A supposedly fun game I

More information

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Journal of Machine Learning Research 1 8, 2017 Algorithmic Learning Theory 2017 New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Philip M. Long Google, 1600 Amphitheatre

More information

arxiv: v1 [cs.lg] 30 Sep 2014

arxiv: v1 [cs.lg] 30 Sep 2014 Nonstochastic Multi-Armed Bandits with Graph-Structured Feedback arxiv:409.848v [cs.lg] 30 Sep 04 Noga Alon Tel Aviv University, Tel Aviv, Israel, Institute for Advanced Study, Princeton, United States,

More information

Online learning with feedback graphs and switching costs

Online learning with feedback graphs and switching costs Online learning with feedback graphs and switching costs A Proof of Theorem Proof. Without loss of generality let the independent sequence set I(G :T ) formed of actions (or arms ) from to. Given the sequence

More information

Algorithmic Chaining and the Role of Partial Feedback in Online Nonparametric Learning

Algorithmic Chaining and the Role of Partial Feedback in Online Nonparametric Learning Proceedings of Machine Learning Research vol 65:1 17, 2017 Algorithmic Chaining and the Role of Partial Feedback in Online Nonparametric Learning Nicolò Cesa-Bianchi Università degli Studi di Milano, Milano,

More information

Algorithmic Chaining and the Role of Partial Feedback in Online Nonparametric Learning

Algorithmic Chaining and the Role of Partial Feedback in Online Nonparametric Learning Algorithmic Chaining and the Role of Partial Feedback in Online Nonparametric Learning Nicolò Cesa-Bianchi, Pierre Gaillard, Claudio Gentile, Sébastien Gerchinovitz To cite this version: Nicolò Cesa-Bianchi,

More information

Online learning with noisy side observations

Online learning with noisy side observations Online learning with noisy side observations Tomáš Kocák Gergely Neu Michal Valko Inria Lille - Nord Europe, France DTIC, Universitat Pompeu Fabra, Barcelona, Spain Inria Lille - Nord Europe, France SequeL

More information

A Second-order Bound with Excess Losses

A Second-order Bound with Excess Losses A Second-order Bound with Excess Losses Pierre Gaillard 12 Gilles Stoltz 2 Tim van Erven 3 1 EDF R&D, Clamart, France 2 GREGHEC: HEC Paris CNRS, Jouy-en-Josas, France 3 Leiden University, the Netherlands

More information

New Algorithms for Contextual Bandits

New Algorithms for Contextual Bandits New Algorithms for Contextual Bandits Lev Reyzin Georgia Institute of Technology Work done at Yahoo! 1 S A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R.E. Schapire Contextual Bandit Algorithms with Supervised

More information

THE first formalization of the multi-armed bandit problem

THE first formalization of the multi-armed bandit problem EDIC RESEARCH PROPOSAL 1 Multi-armed Bandits in a Network Farnood Salehi I&C, EPFL Abstract The multi-armed bandit problem is a sequential decision problem in which we have several options (arms). We can

More information

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem Lecture 9: UCB Algorithm and Adversarial Bandit Problem EECS598: Prediction and Learning: It s Only a Game Fall 03 Lecture 9: UCB Algorithm and Adversarial Bandit Problem Prof. Jacob Abernethy Scribe:

More information

Multi-armed Bandits in the Presence of Side Observations in Social Networks

Multi-armed Bandits in the Presence of Side Observations in Social Networks 52nd IEEE Conference on Decision and Control December 0-3, 203. Florence, Italy Multi-armed Bandits in the Presence of Side Observations in Social Networks Swapna Buccapatnam, Atilla Eryilmaz, and Ness

More information

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12 Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12 Lecture 4: Multiarmed Bandit in the Adversarial Model Lecturer: Yishay Mansour Scribe: Shai Vardi 4.1 Lecture Overview

More information

Reducing contextual bandits to supervised learning

Reducing contextual bandits to supervised learning Reducing contextual bandits to supervised learning Daniel Hsu Columbia University Based on joint work with A. Agarwal, S. Kale, J. Langford, L. Li, and R. Schapire 1 Learning to interact: example #1 Practicing

More information

Bandit models: a tutorial

Bandit models: a tutorial Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses

More information

The Multi-Armed Bandit Problem

The Multi-Armed Bandit Problem Università degli Studi di Milano The bandit problem [Robbins, 1952]... K slot machines Rewards X i,1, X i,2,... of machine i are i.i.d. [0, 1]-valued random variables An allocation policy prescribes which

More information

Combinatorial Multi-Armed Bandit: General Framework, Results and Applications

Combinatorial Multi-Armed Bandit: General Framework, Results and Applications Combinatorial Multi-Armed Bandit: General Framework, Results and Applications Wei Chen Microsoft Research Asia, Beijing, China Yajun Wang Microsoft Research Asia, Beijing, China Yang Yuan Computer Science

More information

Key words. online learning, multi-armed bandits, learning from experts, learning with partial feedback, graph theory

Key words. online learning, multi-armed bandits, learning from experts, learning with partial feedback, graph theory SIAM J. COMPUT. Vol. 46, No. 6, pp. 785 86 c 07 the authors NONSTOCHASTIC MULTI-ARMED BANDITS WITH GRAPH-STRUCTURED FEEDBACK NOGA ALON, NICOLÒ CESA-BIANCHI, CLAUDIO GENTILE, SHIE MANNOR, YISHAY MANSOUR,

More information

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurélien Garivier Institut de Mathématiques de Toulouse Information Theory, Learning and Big Data Simons Institute, Berkeley, March

More information

Online (and Distributed) Learning with Information Constraints. Ohad Shamir

Online (and Distributed) Learning with Information Constraints. Ohad Shamir Online (and Distributed) Learning with Information Constraints Ohad Shamir Weizmann Institute of Science Online Algorithms and Learning Workshop Leiden, November 2014 Ohad Shamir Learning with Information

More information

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Adversarial bandits Definition: sequential game. Lower bounds on regret from the stochastic case. Exp3: exponential weights

More information

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)

More information

Online Learning with Gaussian Payoffs and Side Observations

Online Learning with Gaussian Payoffs and Side Observations Online Learning with Gaussian Payoffs and Side Observations Yifan Wu András György 2 Csaba Szepesvári Dept. of Computing Science University of Alberta {ywu2,szepesva}@ualberta.ca 2 Dept. of Electrical

More information

The No-Regret Framework for Online Learning

The No-Regret Framework for Online Learning The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin,

More information

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm

More information

Mistake Bounds on Noise-Free Multi-Armed Bandit Game

Mistake Bounds on Noise-Free Multi-Armed Bandit Game Mistake Bounds on Noise-Free Multi-Armed Bandit Game Atsuyoshi Nakamura a,, David P. Helmbold b, Manfred K. Warmuth b a Hokkaido University, Kita 14, Nishi 9, Kita-ku, Sapporo 060-0814, Japan b University

More information

arxiv: v1 [stat.ml] 23 May 2018

arxiv: v1 [stat.ml] 23 May 2018 Analysis of Thompson Sampling for Graphical Bandits Without the Graphs arxiv:1805.08930v1 [stat.ml] 3 May 018 Fang Liu The Ohio State University Columbus, Ohio 4310 liu.3977@osu.edu Abstract We study multi-armed

More information

Bandit Algorithms. Tor Lattimore & Csaba Szepesvári

Bandit Algorithms. Tor Lattimore & Csaba Szepesvári Bandit Algorithms Tor Lattimore & Csaba Szepesvári Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next? Overview What are bandits,

More information

Reward Maximization Under Uncertainty: Leveraging Side-Observations on Networks

Reward Maximization Under Uncertainty: Leveraging Side-Observations on Networks Reward Maximization Under Uncertainty: Leveraging Side-Observations Reward Maximization Under Uncertainty: Leveraging Side-Observations on Networks Swapna Buccapatnam AT&T Labs Research, Middletown, NJ

More information

Bandits for Online Optimization

Bandits for Online Optimization Bandits for Online Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Bandits for Online Optimization 1 / 16 The multiarmed bandit problem... K slot machines Each

More information

arxiv: v3 [cs.lg] 30 Jun 2012

arxiv: v3 [cs.lg] 30 Jun 2012 arxiv:05874v3 [cslg] 30 Jun 0 Orly Avner Shie Mannor Department of Electrical Engineering, Technion Ohad Shamir Microsoft Research New England Abstract We consider a multi-armed bandit problem where the

More information

Learning, Games, and Networks

Learning, Games, and Networks Learning, Games, and Networks Abhishek Sinha Laboratory for Information and Decision Systems MIT ML Talk Series @CNRG December 12, 2016 1 / 44 Outline 1 Prediction With Experts Advice 2 Application to

More information

Lecture 4: Lower Bounds (ending); Thompson Sampling

Lecture 4: Lower Bounds (ending); Thompson Sampling CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from

More information

Exponential Weights on the Hypercube in Polynomial Time

Exponential Weights on the Hypercube in Polynomial Time European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts

More information

Improved Algorithms for Linear Stochastic Bandits

Improved Algorithms for Linear Stochastic Bandits Improved Algorithms for Linear Stochastic Bandits Yasin Abbasi-Yadkori abbasiya@ualberta.ca Dept. of Computing Science University of Alberta Dávid Pál dpal@google.com Dept. of Computing Science University

More information

Optimal and Adaptive Online Learning

Optimal and Adaptive Online Learning Optimal and Adaptive Online Learning Haipeng Luo Advisor: Robert Schapire Computer Science Department Princeton University Examples of Online Learning (a) Spam detection 2 / 34 Examples of Online Learning

More information

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 22 Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 How to balance exploration and exploitation in reinforcement

More information

Lecture 3: Lower Bounds for Bandit Algorithms

Lecture 3: Lower Bounds for Bandit Algorithms CMSC 858G: Bandits, Experts and Games 09/19/16 Lecture 3: Lower Bounds for Bandit Algorithms Instructor: Alex Slivkins Scribed by: Soham De & Karthik A Sankararaman 1 Lower Bounds In this lecture (and

More information

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret

More information

Online Learning with Costly Features and Labels

Online Learning with Costly Features and Labels Online Learning with Costly Features and Labels Navid Zolghadr Department of Computing Science University of Alberta zolghadr@ualberta.ca Gábor Bartók Department of Computer Science TH Zürich bartok@inf.ethz.ch

More information

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Multi-armed bandit algorithms. Concentration inequalities. P(X ǫ) exp( ψ (ǫ))). Cumulant generating function bounds. Hoeffding

More information

Online prediction with expert advise

Online prediction with expert advise Online prediction with expert advise Jyrki Kivinen Australian National University http://axiom.anu.edu.au/~kivinen Contents 1. Online prediction: introductory example, basic setting 2. Classification with

More information

An Optimal Bidimensional Multi Armed Bandit Auction for Multi unit Procurement

An Optimal Bidimensional Multi Armed Bandit Auction for Multi unit Procurement An Optimal Bidimensional Multi Armed Bandit Auction for Multi unit Procurement Satyanath Bhat Joint work with: Shweta Jain, Sujit Gujar, Y. Narahari Department of Computer Science and Automation, Indian

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Active Learning and Optimized Information Gathering

Active Learning and Optimized Information Gathering Active Learning and Optimized Information Gathering Lecture 7 Learning Theory CS 101.2 Andreas Krause Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok. Office

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning

More information

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Sébastien Bubeck Theory Group

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Sébastien Bubeck Theory Group Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems Sébastien Bubeck Theory Group Part 1: i.i.d., adversarial, and Bayesian bandit models i.i.d. multi-armed bandit, Robbins [1952]

More information

Exploration. 2015/10/12 John Schulman

Exploration. 2015/10/12 John Schulman Exploration 2015/10/12 John Schulman What is the exploration problem? Given a long-lived agent (or long-running learning algorithm), how to balance exploration and exploitation to maximize long-term rewards

More information

Two generic principles in modern bandits: the optimistic principle and Thompson sampling

Two generic principles in modern bandits: the optimistic principle and Thompson sampling Two generic principles in modern bandits: the optimistic principle and Thompson sampling Rémi Munos INRIA Lille, France CSML Lunch Seminars, September 12, 2014 Outline Two principles: The optimistic principle

More information

arxiv: v2 [cs.lg] 24 Oct 2018

arxiv: v2 [cs.lg] 24 Oct 2018 Adversarial Online Learning with noise arxiv:80.09346v [cs.lg] 4 Oct 08 Alon Resler alonress@gmail.com Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel Yishay Mansour mansour.yishay@gmail.com

More information

Tsinghua Machine Learning Guest Lecture, June 9,

Tsinghua Machine Learning Guest Lecture, June 9, Tsinghua Machine Learning Guest Lecture, June 9, 2015 1 Lecture Outline Introduction: motivations and definitions for online learning Multi-armed bandit: canonical example of online learning Combinatorial

More information

An adaptive algorithm for finite stochastic partial monitoring

An adaptive algorithm for finite stochastic partial monitoring An adaptive algorithm for finite stochastic partial monitoring Gábor Bartók Navid Zolghadr Csaba Szepesvári Department of Computing Science, University of Alberta, AB, Canada, T6G E8 bartok@ualberta.ca

More information

Csaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008

Csaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008 LEARNING THEORY OF OPTIMAL DECISION MAKING PART I: ON-LINE LEARNING IN STOCHASTIC ENVIRONMENTS Csaba Szepesvári 1 1 Department of Computing Science University of Alberta Machine Learning Summer School,

More information

Pure Exploration Stochastic Multi-armed Bandits

Pure Exploration Stochastic Multi-armed Bandits C&A Workshop 2016, Hangzhou Pure Exploration Stochastic Multi-armed Bandits Jian Li Institute for Interdisciplinary Information Sciences Tsinghua University Outline Introduction 2 Arms Best Arm Identification

More information

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.

More information

Toward a Classification of Finite Partial-Monitoring Games

Toward a Classification of Finite Partial-Monitoring Games Toward a Classification of Finite Partial-Monitoring Games Gábor Bartók ( *student* ), Dávid Pál, and Csaba Szepesvári Department of Computing Science, University of Alberta, Canada {bartok,dpal,szepesva}@cs.ualberta.ca

More information

arxiv: v1 [cs.lg] 16 Jan 2017

arxiv: v1 [cs.lg] 16 Jan 2017 Achieving Privacy in the Adversarial Multi-Armed Bandit Aristide C. Y. Tossou Chalmers University of Technology Gothenburg, Sweden aristide@chalmers.se Christos Dimitrakakis University of Lille, France

More information

EFFICIENT ALGORITHMS FOR LINEAR POLYHEDRAL BANDITS. Manjesh K. Hanawal Amir Leshem Venkatesh Saligrama

EFFICIENT ALGORITHMS FOR LINEAR POLYHEDRAL BANDITS. Manjesh K. Hanawal Amir Leshem Venkatesh Saligrama EFFICIENT ALGORITHMS FOR LINEAR POLYHEDRAL BANDITS Manjesh K. Hanawal Amir Leshem Venkatesh Saligrama IEOR Group, IIT-Bombay, Mumbai, India 400076 Dept. of EE, Bar-Ilan University, Ramat-Gan, Israel 52900

More information

Online combinatorial optimization with stochastic decision sets and adversarial losses

Online combinatorial optimization with stochastic decision sets and adversarial losses Online combinatorial optimization with stochastic decision sets and adversarial losses Gergely Neu Michal Valko SequeL team, INRIA Lille Nord urope, France {gergely.neu,michal.valko}@inria.fr Abstract

More information

arxiv: v1 [cs.lg] 23 Jan 2019

arxiv: v1 [cs.lg] 23 Jan 2019 Cooperative Online Learning: Keeping your Neighbors Updated Nicolò Cesa-Bianchi, Tommaso R. Cesari, and Claire Monteleoni 2 Dipartimento di Informatica, Università degli Studi di Milano, Italy 2 Department

More information

Piecewise-stationary Bandit Problems with Side Observations

Piecewise-stationary Bandit Problems with Side Observations Jia Yuan Yu jia.yu@mcgill.ca Department Electrical and Computer Engineering, McGill University, Montréal, Québec, Canada. Shie Mannor shie.mannor@mcgill.ca; shie@ee.technion.ac.il Department Electrical

More information

University of Alberta. The Role of Information in Online Learning

University of Alberta. The Role of Information in Online Learning University of Alberta The Role of Information in Online Learning by Gábor Bartók A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment of the requirements for the degree

More information

Regret Bounds for Sleeping Experts and Bandits

Regret Bounds for Sleeping Experts and Bandits Regret Bounds for Sleeping Experts and Bandits Robert D. Kleinberg Department of Computer Science Cornell University Ithaca, NY 4853 rdk@cs.cornell.edu Alexandru Niculescu-Mizil Department of Computer

More information

Applications of on-line prediction. in telecommunication problems

Applications of on-line prediction. in telecommunication problems Applications of on-line prediction in telecommunication problems Gábor Lugosi Pompeu Fabra University, Barcelona based on joint work with András György and Tamás Linder 1 Outline On-line prediction; Some

More information

Evaluation of multi armed bandit algorithms and empirical algorithm

Evaluation of multi armed bandit algorithms and empirical algorithm Acta Technica 62, No. 2B/2017, 639 656 c 2017 Institute of Thermomechanics CAS, v.v.i. Evaluation of multi armed bandit algorithms and empirical algorithm Zhang Hong 2,3, Cao Xiushan 1, Pu Qiumei 1,4 Abstract.

More information

Multi-Armed Bandit Formulations for Identification and Control

Multi-Armed Bandit Formulations for Identification and Control Multi-Armed Bandit Formulations for Identification and Control Cristian R. Rojas Joint work with Matías I. Müller and Alexandre Proutiere KTH Royal Institute of Technology, Sweden ERNSI, September 24-27,

More information

Fundamental Limits of Online and Distributed Algorithms for Statistical Learning and Estimation

Fundamental Limits of Online and Distributed Algorithms for Statistical Learning and Estimation Fundamental Limits of Online and Distributed Algorithms for Statistical Learning and Estimation Ohad Shamir Weizmann Institute of Science ohad.shamir@weizmann.ac.il Abstract Many machine learning approaches

More information

Gambling in a rigged casino: The adversarial multi-armed bandit problem

Gambling in a rigged casino: The adversarial multi-armed bandit problem Gambling in a rigged casino: The adversarial multi-armed bandit problem Peter Auer Institute for Theoretical Computer Science University of Technology Graz A-8010 Graz (Austria) pauer@igi.tu-graz.ac.at

More information

Learning Hurdles for Sleeping Experts

Learning Hurdles for Sleeping Experts Learning Hurdles for Sleeping Experts Varun Kanade EECS, University of California Berkeley, CA, USA vkanade@eecs.berkeley.edu Thomas Steinke SEAS, Harvard University Cambridge, MA, USA tsteinke@fas.harvard.edu

More information

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University Bandit Algorithms Zhifeng Wang Department of Statistics Florida State University Outline Multi-Armed Bandits (MAB) Exploration-First Epsilon-Greedy Softmax UCB Thompson Sampling Adversarial Bandits Exp3

More information

The Multi-Arm Bandit Framework

The Multi-Arm Bandit Framework The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94

More information

Contextual semibandits via supervised learning oracles

Contextual semibandits via supervised learning oracles Contextual semibandits via supervised learning oracles Akshay Krishnamurthy Alekh Agarwal Miroslav Dudík akshay@cs.umass.edu alekha@microsoft.com mdudik@microsoft.com College of Information and Computer

More information

Large-scale Information Processing, Summer Recommender Systems (part 2)

Large-scale Information Processing, Summer Recommender Systems (part 2) Large-scale Information Processing, Summer 2015 5 th Exercise Recommender Systems (part 2) Emmanouil Tzouridis tzouridis@kma.informatik.tu-darmstadt.de Knowledge Mining & Assessment SVM question When a

More information

Learning for Contextual Bandits

Learning for Contextual Bandits Learning for Contextual Bandits Alina Beygelzimer 1 John Langford 2 IBM Research 1 Yahoo! Research 2 NYC ML Meetup, Sept 21, 2010 Example of Learning through Exploration Repeatedly: 1. A user comes to

More information

On Minimaxity of Follow the Leader Strategy in the Stochastic Setting

On Minimaxity of Follow the Leader Strategy in the Stochastic Setting On Minimaxity of Follow the Leader Strategy in the Stochastic Setting Wojciech Kot lowsi Poznań University of Technology, Poland wotlowsi@cs.put.poznan.pl Abstract. We consider the setting of prediction

More information

Perceptron Mistake Bounds

Perceptron Mistake Bounds Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce

More information

Littlestone s Dimension and Online Learnability

Littlestone s Dimension and Online Learnability Littlestone s Dimension and Online Learnability Shai Shalev-Shwartz Toyota Technological Institute at Chicago The Hebrew University Talk at UCSD workshop, February, 2009 Joint work with Shai Ben-David

More information

arxiv: v1 [cs.lg] 7 Sep 2018

arxiv: v1 [cs.lg] 7 Sep 2018 Analysis of Thompson Sampling for Combinatorial Multi-armed Bandit with Probabilistically Triggered Arms Alihan Hüyük Bilkent University Cem Tekin Bilkent University arxiv:809.02707v [cs.lg] 7 Sep 208

More information

Online Forest Density Estimation

Online Forest Density Estimation Online Forest Density Estimation Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr UAI 16 1 Outline 1 Probabilistic Graphical Models 2 Online Density Estimation 3 Online Forest Density

More information

Efficient learning by implicit exploration in bandit problems with side observations

Efficient learning by implicit exploration in bandit problems with side observations fficient learning by implicit exploration in bandit problems with side observations omáš Kocák, Gergely Neu, Michal Valko, Rémi Munos o cite this version: omáš Kocák, Gergely Neu, Michal Valko, Rémi Munos.

More information

Collaborative Learning of Stochastic Bandits over a Social Network

Collaborative Learning of Stochastic Bandits over a Social Network Collaborative Learning of Stochastic Bandits over a Social Network Ravi Kumar Kolla, Krishna Jagannathan and Aditya Gopalan Abstract We consider a collaborative online learning paradigm, wherein a group

More information

Budgeted Prediction With Expert Advice

Budgeted Prediction With Expert Advice Budgeted Prediction With Expert Advice Kareem Amin University of Pennsylvania, Philadelphia, PA akareem@seas.upenn.edu Gerald Tesauro IBM T.J. Watson Research Center Yorktown Heights, NY gtesauro@us.ibm.com

More information

Learning to Bid Without Knowing your Value. Chara Podimata Harvard University June 4, 2018

Learning to Bid Without Knowing your Value. Chara Podimata Harvard University June 4, 2018 Learning to Bid Without Knowing your Value Zhe Feng Harvard University zhe feng@g.harvard.edu Chara Podimata Harvard University podimata@g.harvard.edu June 4, 208 Vasilis Syrgkanis Microsoft Research vasy@microsoft.com

More information

Tor Lattimore & Csaba Szepesvári

Tor Lattimore & Csaba Szepesvári Bandit Algorithms Tor Lattimore & Csaba Szepesvári Outline 1 From Contextual to Linear Bandits 2 Stochastic Linear Bandits 3 Confidence Bounds for Least-Squares Estimators 4 Improved Regret for Fixed,

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Uncertainty & Probabilities & Bandits Daniel Hennes 16.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Uncertainty Probability

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

Multi-Armed Bandits with Metric Movement Costs

Multi-Armed Bandits with Metric Movement Costs Multi-Armed Bandits with Metric Movement Costs Tomer Koren 1, Roi Livni 2, and Yishay Mansour 3 1 Google; tkoren@google.com 2 Princeton University; rlivni@cs.princeton.edu 3 Tel Aviv University; mansour@cs.tau.ac.il

More information