New Algorithms for Contextual Bandits

Size: px
Start display at page:

Download "New Algorithms for Contextual Bandits"

Transcription

1 New Algorithms for Contextual Bandits Lev Reyzin Georgia Institute of Technology Work done at Yahoo! 1

2 S A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R.E. Schapire Contextual Bandit Algorithms with Supervised Learning Guarantees (AISTATS 2011) S M. Dudik, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, T. Zhang Efficient Optimal Learning for Contextual Bandits (UAI 2011) S S. Kale, L. Reyzin, R.E. Schapire Non-Stochastic Bandit Slate Problems (NIPS 2010) 2

3 Serving Content to Users Query, IP address, browser properties, etc. 3

4 Serving Content to Users Query, IP address, browser properties, etc. result (ie. ad, news story) 4

5 Serving Content to Users Query, IP address, browser properties, etc. result (ie. ad, news story) click or not 5

6 Serving Content to Users Query, IP address, browser properties, etc. result (ie. ad, news story) click or not 6

7 Serving Content to Users Query, IP address, browser properties, etc. result (ie. ad, news story) click or not 7

8 Serving Content to Users 8

9 Outline S The setting and some background S Show ideas that fail S Give a high probability optimal algorithm S Dealing with VC sets S An efficient algorithm S Slates 9

10 Multiarmed Bandits [Robbins 52] T $ k 10

11 Multiarmed Bandits [Robbins 52] T $0.50 $ k 11

12 Multiarmed Bandits [Robbins 52] T $0.50 $ $0 k 12

13 Multiarmed Bandits [Robbins 52] T $ $0.50 $ $0 k 13

14 Multiarmed Bandits [Robbins 52] T $0.20 $0.4T 2 $0.50 $ $0 k $

15 Multiarmed Bandits [Robbins 52] 1 $0.5T $0.4T 2 $0.2T 3 $0.33T k $0.1T 15

16 Multiarmed Bandits [Robbins 52] 1 $0.5T $0.4T 2 $0.2T 3 $0.33T regret = $0.1T k $0.1T 16

17 Contextual Bandits [Auer-CesaBianchi-Freund-Schapire 02] context: T $ k N experts/policies/functions think of N >> K 17

18 Contextual Bandits [Auer-CesaBianchi-Freund-Schapire 02] context: x T $ k K 3 N experts/policies/functions think of N >> K 18

19 Contextual Bandits [Auer-CesaBianchi-Freund-Schapire 02] context: x T $ $ k K 3 N experts/policies/functions think of N >> K 19

20 Contextual Bandits [Auer-CesaBianchi-Freund-Schapire 02] context: x 1 x 2 x 3 x 4 x T T $0.2T 1 $0.15 $0.4 2 $.12T $.22T 3 $0.5 $0.2 $.1T 0 k $0.1 $.2T $.17T N experts/policies/functions think of N >> K 20

21 Contextual Bandits [Auer-CesaBianchi-Freund-Schapire 02] context: x 1 x 2 x 3 x 4 x T T $0.2T 1 2 $.12T $.22T 3 regret = $0.02T $.1T 0 k $.2T $.17T N experts/policies/functions think of N >> K 21

22 Contextual Bandits [Auer-CesaBianchi-Freund-Schapire 02] context: x 1 x 2 x 3 x 4 x T T $0.2T the rewards can come i.i.d. from a distribution or be arbitrary stochastic / adversarial The experts can be present or not. contextual / non-contextual k 22

23 The Setting S T rounds, K possible actions, N policies π in Π (context à actions) S for t=1 to T S world commits to rewards r(t)=r 1 (t),r 2 (t),,r K (t) (adversarial or iid) S world provides context x t S learner s policies recommend π 1 (x t ), π 2 (x t ),, π N (x t ) S learner chooses action j t S learner receives reward r jt (t) S want to compete with following the best policy in hindsight 23

24 Regret S reward of algorithm A: S expected reward of policy i: G A T ( ) T t=1 G i T ( ) = r jt t T t=1 ( ) = π i x t ( ) r( t) S algorithm A s regret: max i G i T ( ) G A (T) 24

25 Regret S algorithm A s regret: max i G i T ( ) G A (T) S bound on expected regret: S high probability bound: max i G i P[max i ( T ) E[G A (T )] < ε G i T ( ) G A (T) > ε] δ 25

26 u Harder than supervised learning: u In the bandit setting we do not know the rewards of actions not taken. u Many applications u Ad auctions, medicine, finance, u Exploration/Exploitation u Can exploit expert/arm you ve learned to be good. u Can explore expert/arm you re not sure about. 26

27 Some Barriers Ω(kT) 1/2 (non-contextual) and ~ Ω(TK ln N) 1/2 (contextual) are known lower bounds [Auer et al. 02] on regret, even in the stochastic case. Any algorithm achieving regret Õ(KT polylog N) 1/2 is said to be optimal. ε-greedy algorithms that first explore (act randomly) and then exploit (follow the best policy) cannot be optimal. Any optimal algorithm must be adaptive. 27

28 Two Types of Approaches 1 UCB [Auer 02] EXP3 Exponential Weights [Littlestone-Warmuth 94] [Auer et al. 02] t=1 t=2 t=3 Algorithm: at every time step 1) pull arm with highest UCB 2) update confidence bound of the arm pulled. Algorithm: at every time step 1) sample from distribution defined by weights (mixed w/ uniform) 2) update weights exponentially 28

29 UCB vs EXP3 A Comparison u Pros UCB [Auer 02] u Optimal for the stochastic setting. u Succeeds with high probability. u Cons u Does not work in the adversarial setting. u Is not optimal in the contextual setting. EXP3 & Friends [Auer-CesaBianchi-Freund-Schapire 02] u Pros u Optimal for both the adversarial and stochastic settings. u Can be made to work in the contextual setting u Cons u Does not succeed with high probability in the contextual setting (only in expectation). 29

30 Algorithm Regret High Prob? Context? Exp4 [ACFS 02] Õ(KT ln(n)) 1/2 No Yes ε-greedy, epochgeedy [LZ 07] Exp3.P[ACFS 02] UCB [Auer 00] Õ((K ln(n) 1/3 )T 2/3 ) Yes Yes Õ(KT) 1/2 Yes No Ω( KT ) lower bound [ACFS 02] 30

31 Algorithm Regret High Prob? Context? Exp4 [ACFS 02] Õ(KT ln(n)) 1/2 No Yes ε-greedy, epochgeedy [LZ 07] Exp3.P[ACFS 02] UCB [Auer 00] Õ((K ln(n) 1/3 )T 2/3 ) Yes Yes Õ(KT) 1/2 Yes No Exp4.P [BLLRS 10] Õ(K ln(n/δ)t) 1/2 Yes Yes Ω( KT ) lower bound [ACFS 02] 31

32 EXP4P [Beygelzimer-Langford-Li-R-Schapire 11] Main Theorem [Beygelzimer-Langford-Li-R-Schapire 11]: For any δ>0, with probability at least 1-δ, EXP4P has regret at most O(KT ln (N/δ)) 1/2 in the adversarial contextual bandit setting. EXP4P combines the advantages of Exponential Weights and UCB. optimal for both the stochastic and adversarial settings works for the contextual case (and also the non-contextual case) a high probability result 32

33 Outline S The setting and some background S Show ideas that fail S Give a high probability optimal algorithm S Dealing with VC sets S An efficient algorithm S Slates 33

34 Some Failed Approaches S Bad idea 1: Maintain a set of plausible hypotheses and randomize uniformly over their predicted actions. S Adversary has two actions, one always paying off 1 and the other 0. Hypothesis generally agree on correct action, except for a different one which defects each round. This incurs regret of ~T/2. S Bad idea 2: Maintain a set of plausible hypotheses and randomize uniformly among the hypothesis. S Adversary has two actions, one always paying off 1 and the other 0. If all but one of > 2T hypothesis always predict wrong arm, and only 1 hypothesis always predicts good arm, with probability > ½ it is never picked and algorithm incurs regret of T. 34

35 Some Failed Approaches S Bad idea 1: Maintain a set of plausible hypotheses and randomize uniformly over their predicted actions. S Adversary has two actions, one always paying off 1 and the other 0. Hypothesis generally agree on correct action, except for a different one which defects each round. This incurs regret of ~T/2. S Bad idea 2: Maintain a set of plausible hypotheses and randomize uniformly among the hypothesis. S Adversary has two actions, one always paying off 1 and the other 0. If all but one of > 2T hypothesis always predict wrong arm, and only 1 hypothesis always predicts good arm, with probability > ½ it is never picked and algorithm incurs regret of T. 35

36 Some Failed Approaches S Bad idea 1: Maintain a set of plausible hypotheses and randomize uniformly over their predicted actions. S Adversary has two actions, one always paying off 1 and the other 0. Hypothesis generally agree on correct action, except for a different one which defects each round. This incurs regret of ~T/2. S Bad idea 2: Maintain a set of plausible hypotheses and randomize uniformly among the hypothesis. S Adversary has two actions, one always paying off 1 and the other 0. If all but one of > 2T hypothesis always predict wrong arm, and only 1 hypothesis always predicts good arm, with probability > ½ it is never picked and algorithm incurs regret of T. 36

37 Some Failed Approaches S Bad idea 1: Maintain a set of plausible hypotheses and randomize uniformly over their predicted actions. S Adversary has two actions, one always paying off 1 and the other 0. Hypothesis generally agree on correct action, except for a different one which defects each round. This incurs regret of ~T/2. S Bad idea 2: Maintain a set of plausible hypotheses and randomize uniformly among the hypothesis. S Adversary has two actions, one always paying off 1 and the other 0. If all but one of > 2T hypothesis always predict wrong arm, and only 1 hypothesis always predicts good arm, with probability > ½ it is never picked and algorithm incurs regret of T. 37

38 epsilon-greedy S Rough idea of ε-greedy (or ε-first): act randomly for ε rounds, then go with best (arm or expert). S Even if we know the number of rounds in advance, ε-first won t get us regret O(T) 1/2, even in the non-contextual setting. S Rough analysis: even for just 2 arms, we suffer regret of ε+(t-ε)/(ε 1/2 ). S ε T 2/3 is optimal tradeoff. S gives regret T 2/3 38

39 Outline S The setting and some background S Show ideas that fail S Give a high probability optimal algorithm S Dealing with VC sets S An efficient algorithm S Slates 39

40 Ideas Behind Exp4.P (all appeared in previous algorithms) S exponential weights S keep a weight on each expert that drops exponentially in the expert s (estimated) performance S upper confidence bounds S use an upper confidence bound on each expert s estimated reward S ensuring exploration S make sure each action is taken with some minimum probability S importance weighting S give rare events more importance to keep estimates unbiased 40

41 (slide from Beygelzimer & Langford ICML 2010 tutorial) 41

42 (Exp4.P) [Beygelzimer, Langford, Li, R, Schapire 10] 42

43 Lemma 1 43

44 Lemma 2 44

45 EXP4P [Beygelzimer-Langford-Li-R-Schapire 11] Main Theorem [Beygelzimer-Langford-Li-R-Schapire 11]: For any δ>0, with probability at least 1-δ, EXP4P has regret at most O(KT ln (N/δ)) 1/2 in the adversarial contextual bandit setting. key insights (on top of UCB/ EXP) 1) exponential weights and upper confidence bounds stack 2) generalized Bernstein s inequality for martingales t=1 t=2 t=3 45

46 Efficiency Algorithm Regret High Prob? Context? Efficient? Exp4 [ACFS 02] epoch-geedy [LZ 07] Exp3.P/UCB [ACFS 02][A 00] Exp4.P [BLLRS 10] Õ(T 1/2 ) No Yes No Õ(T 2/3 ) Yes Yes Yes Õ(T 1/2 ) Yes No Yes Õ(T 1/2 ) Yes Yes No 46

47 EXP4P Applied to Yahoo! 47

48 Experiments on Yahoo! Data u We chose a policy class for which we could efficiently keep track of the weights. u Created 5 clusters, with users (at each time step) getting features based on their distances to clusters. u Policies mapped clusters to article (action) choices. u Ran on personalized news article recommendations for Yahoo! front page. u We used a learning bucket on which we ran the algorithms and a deployment bucket on which we ran the greedy (best) learned policy. 48

49 Experimental Results Reported estimated (normalized) click-through rates on front page news. Over 41M user visits. 253 total articles. 21 candidate articles per visit. Learning ectr Deployment ectr EXP4P EXP4 ε-greedy

50 Experimental Results Reported estimated (normalized) click-through rates on front page news. Over 41M user visits. 253 total articles. 21 candidate articles per visit. Learning ectr Deployment ectr EXP4P EXP4 ε-greedy Why does this work in practice? 50

51 Outline S The setting and some background S Show ideas that fail S Give a high probability optimal algorithm S Dealing with VC sets S An efficient algorithm S Slates 51

52 Infinitely Many Policies S What if we have an infinite number of policies? S Our bound of Õ(K ln(n)t) 1/2 becomes vacuous. S If we assume our policy class has a finite VC dimension d, then we can tackle this problem. S Need i.i.d. assumption. We will also assume k=2 to illustrate the argument. 52

53 VC Dimension S The VC dimension of a hypothesis class captures the class s expressive power. S It is the cardinality of the largest set (in our case, of contexts) the class can shatter. S Shatter means to label in all possible configurations. 53

54 VE, an Algorithm for VC Sets The VE algorithm: S Act uniformly at random for τ rounds. S This partitions our policies Π into equivalence classes according to their labelings of the first τ examples. S Pick one representative from each equivalence class to make Π. S Run Exp4.P on Π. 54

55 Outline of Analysis of VE S Sauer s lemma bounds the number of equivalence classes to (eτ/d) d. S Hence, using Exp4.P bounds, VE s regret to Π is τ+ O (Td ln(τ)) S We can show that the regret of Π to Π is (T/τ)(d lnt) S by looking at the probability of disagreeing on future data given agreement for τ steps. S τ (Td ln 1/δ) 1/2 achieves the optimal trade-off. S Gives Õ(Td) 1/2 regret. S Still inefficient! 55

56 Outline S The setting and some background S Show ideas that fail S Give a high probability optimal algorithm S Dealing with VC sets S An efficient algorithm S Slates 56

57 Hope for an Efficient Algorithm? [Dudik-Hsu-Kale-Karampatziakis-Langford-R-Zhang 11] For EXP4P, the dependence on N in the regret is logarithmic. this suggests We could compete with a large, even super-polynomial number of policies! (e.g. N=K 100 becomes 10 log 1/2 K in the regret) however All known contextual bandit algorithms explicitly keep track of the N policies. Even worse, just reading in the N would take too long for large N. 57

58 Idea: Use Supervised Learning S Competing with a large (even exponentially large) set of policies is commonplace in supervised learning. S Targets: e.g. linear thresholds, CNF, decision trees (in practice only) S Methods: e.g. boosting, SVM, neural networks, gradient descent S The recommendations of the policies don t need to be explicitly read in when the policy class has structure! x 1 x 2 x 3 x 4 x 5 x 6 Policy class Π Supervised Learning Oracle A good policy in Π idea originates with [Langford-Zhang 07] 58

59 Idea: Use Supervised Learning S Competing with a large (even exponentially large) set of policies is commonplace in supervised learning. S Targets: e.g. linear thresholds, CNF, decision trees (in practice only) S Methods: e.g. boosting, SVM, neural networks, gradient descent S The recommendations of the policies don t need to be explicitly read in when the policy class has structure! x 1 x 2 x 3 x 4 x 5 x 6 Policy class Π Warning: NP-Hard in Supervised Learning General Oracle A good policy in Π idea originates with [Langford-Zhang 07] 59

60 Back to Contextual Bandits context: x 1 x 2 x T $ $ $ k k 3 N experts/policies/functions think of N >> K 60

61 Back to Contextual Bandits context: x 1 x 2 x T $ $ $0.50 made-up data Supervised Learning Oracle k k 3 N experts/policies/functions think of N >> K 61

62 Back to Contextual Bandits context: x 1 x 2 x T $ $ $0.50 made-up data Supervised Learning Oracle k k 3 N experts/policies/functions think of N >> K 62

63 Randomized-UCB Main Theorem [Dudik-Hsu-Kale-Karampatziakis-Langford-R-Zhang 11]: For any δ>0, w.p. at least 1-δ, given access to a supervised learning oracle, Randomized-UCB has regret at most O((KT ln (NT/δ)) 1/2 +K ln(nk/δ)) in the stochastic contextual bandit setting and runs in time poly(k,t, ln N). 63

64 Randomized-UCB Main Theorem [Dudik-Hsu-Kale-Karampatziakis-Langford-R-Zhang 11]: For any δ>0, w.p. at least 1-δ, given access to a supervised learning oracle, Randomized-UCB has regret at most O((KT ln (NT/δ)) 1/2 +K ln(nk/δ)) in the stochastic contextual bandit setting and runs in time poly(k,t, ln N). if arms are chosen among only good policies s.t. all have variance < approx 2K, we win can prove this exists via a minimax theorem this condition can be softened to occasionally allow choosing of bad policies via randomized upper confidence bounds creates a problem of how to choose arms as to satisfy the constraints expressed as convex optimization problem solvable by ellipsoid algorithm can implement a separation oracle with the supervised learning oracle 64

65 Randomized-UCB Main Theorem [Dudik-Hsu-Kale-Karampatziakis-Langford-R-Zhang 11]: For any δ>0, w.p. at least 1-δ, given access to a supervised learning oracle, Randomized-UCB has regret at most O((KT ln (NT/δ)) 1/2 +K ln(nk/δ)) in the stochastic contextual bandit setting and runs in time poly(k,t, ln N). Not practical if arms are chosen among only good policies s.t. all have variance < approx 2K, we win can prove this exists via a minimax theorem to implement! this condition can be softened to occasionally allow choosing of bad policies via randomized upper confidence bounds (yet) creates a problem of how to choose arms as to satisfy the constraints expressed as convex optimization problem solvable by ellipsoid algorithm can implement a separation oracle with the supervised learning oracle 65

66 Outline S The setting and some background S Show ideas that fail S Give a high probability optimal algorithm S Dealing with VC sets S An efficient algorithm S Slates 66

67 Bandit Slate Problems [Kale-R-Schapire 11] Problem: Instead of selecting one arm, we need to select s 1, arms (possibly ranked). The motivation is web ads where a search engine shows multiple ads at once. 67

68 Slates Setting S On round t algorithm selects a sate S t of s arms S Unordered or Ordered S No context or Contextual S Algorithm sees r j (t) for all j in S. S Algorithm gets reward j S r j (t) S Obvious solution is to reduce to the regular bandit problem, but we can do much better. 68

69 Algorithm Idea 69

70 Algorithm Idea

71 Algorithm Idea 71

72 Algorithm Idea multiplicative update 72

73 Algorithm Idea relative entropy projection Also Component Hedge, independently by Koolen et al

74 Algorithm Idea 74

75 Slate Results Unordered Slates Ordered, with Positional Factors No Policies Õ(sKT) 1/2 * Õ(s(KT) 1/2 ) N Policies Õ(sKT ln N) 1/2 Õ(s(KT ln N) 1/2 ) *Independently obtained by Uchiya et al. 10, using different methods. 75

76 Discussion S The contextual bandit setting captures many interesting real-world problems. S We presented the first optimal, high-probability, contextual algorithm. S We showed how one could possibly make it efficient. S Not fully there yet S We discussed slates a more real-world setting. S How to make those efficient? 76

Reducing contextual bandits to supervised learning

Reducing contextual bandits to supervised learning Reducing contextual bandits to supervised learning Daniel Hsu Columbia University Based on joint work with A. Agarwal, S. Kale, J. Langford, L. Li, and R. Schapire 1 Learning to interact: example #1 Practicing

More information

Learning with Exploration

Learning with Exploration Learning with Exploration John Langford (Yahoo!) { With help from many } Austin, March 24, 2011 Yahoo! wants to interactively choose content and use the observed feedback to improve future content choices.

More information

The Multi-Armed Bandit Problem

The Multi-Armed Bandit Problem Università degli Studi di Milano The bandit problem [Robbins, 1952]... K slot machines Rewards X i,1, X i,2,... of machine i are i.i.d. [0, 1]-valued random variables An allocation policy prescribes which

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret

More information

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University Bandit Algorithms Zhifeng Wang Department of Statistics Florida State University Outline Multi-Armed Bandits (MAB) Exploration-First Epsilon-Greedy Softmax UCB Thompson Sampling Adversarial Bandits Exp3

More information

Contextual semibandits via supervised learning oracles

Contextual semibandits via supervised learning oracles Contextual semibandits via supervised learning oracles Akshay Krishnamurthy Alekh Agarwal Miroslav Dudík akshay@cs.umass.edu alekha@microsoft.com mdudik@microsoft.com College of Information and Computer

More information

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning. Tutorial: PART 1 Online Convex Optimization, A Game- Theoretic Approach to Learning http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm Elad Hazan Princeton University Satyen Kale Yahoo Research

More information

Online Learning with Feedback Graphs

Online Learning with Feedback Graphs Online Learning with Feedback Graphs Claudio Gentile INRIA and Google NY clagentile@gmailcom NYC March 6th, 2018 1 Content of this lecture Regret analysis of sequential prediction problems lying between

More information

THE first formalization of the multi-armed bandit problem

THE first formalization of the multi-armed bandit problem EDIC RESEARCH PROPOSAL 1 Multi-armed Bandits in a Network Farnood Salehi I&C, EPFL Abstract The multi-armed bandit problem is a sequential decision problem in which we have several options (arms). We can

More information

Yevgeny Seldin. University of Copenhagen

Yevgeny Seldin. University of Copenhagen Yevgeny Seldin University of Copenhagen Classical (Batch) Machine Learning Collect Data Data Assumption The samples are independent identically distributed (i.i.d.) Machine Learning Prediction rule New

More information

Lecture 10 : Contextual Bandits

Lecture 10 : Contextual Bandits CMSC 858G: Bandits, Experts and Games 11/07/16 Lecture 10 : Contextual Bandits Instructor: Alex Slivkins Scribed by: Guowei Sun and Cheng Jie 1 Problem statement and examples In this lecture, we will be

More information

Slides at:

Slides at: Learning to Interact John Langford @ Microsoft Research (with help from many) Slides at: http://hunch.net/~jl/interact.pdf For demo: Raw RCV1 CCAT-or-not: http://hunch.net/~jl/vw_raw.tar.gz Simple converter:

More information

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University EASINESS IN BANDITS Gergely Neu Pompeu Fabra University EASINESS IN BANDITS Gergely Neu Pompeu Fabra University THE BANDIT PROBLEM Play for T rounds attempting to maximize rewards THE BANDIT PROBLEM Play

More information

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Journal of Machine Learning Research 1 8, 2017 Algorithmic Learning Theory 2017 New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Philip M. Long Google, 1600 Amphitheatre

More information

Optimal and Adaptive Online Learning

Optimal and Adaptive Online Learning Optimal and Adaptive Online Learning Haipeng Luo Advisor: Robert Schapire Computer Science Department Princeton University Examples of Online Learning (a) Spam detection 2 / 34 Examples of Online Learning

More information

Active Learning and Optimized Information Gathering

Active Learning and Optimized Information Gathering Active Learning and Optimized Information Gathering Lecture 7 Learning Theory CS 101.2 Andreas Krause Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok. Office

More information

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem Lecture 9: UCB Algorithm and Adversarial Bandit Problem EECS598: Prediction and Learning: It s Only a Game Fall 03 Lecture 9: UCB Algorithm and Adversarial Bandit Problem Prof. Jacob Abernethy Scribe:

More information

Multi-armed Bandits in the Presence of Side Observations in Social Networks

Multi-armed Bandits in the Presence of Side Observations in Social Networks 52nd IEEE Conference on Decision and Control December 0-3, 203. Florence, Italy Multi-armed Bandits in the Presence of Side Observations in Social Networks Swapna Buccapatnam, Atilla Eryilmaz, and Ness

More information

Hybrid Machine Learning Algorithms

Hybrid Machine Learning Algorithms Hybrid Machine Learning Algorithms Umar Syed Princeton University Includes joint work with: Rob Schapire (Princeton) Nina Mishra, Alex Slivkins (Microsoft) Common Approaches to Machine Learning!! Supervised

More information

Computational Learning Theory. CS534 - Machine Learning

Computational Learning Theory. CS534 - Machine Learning Computational Learning Theory CS534 Machine Learning Introduction Computational learning theory Provides a theoretical analysis of learning Shows when a learning algorithm can be expected to succeed Shows

More information

The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits John Langford and Tong Zhang

The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits John Langford and Tong Zhang The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits John Langford and Tong Zhang Presentation by Terry Lam 02/2011 Outline The Contextual Bandit Problem Prior Works The Epoch Greedy Algorithm

More information

Contextual Bandit Learning with Predictable Rewards

Contextual Bandit Learning with Predictable Rewards Contetual Bandit Learning with Predictable Rewards Alekh Agarwal alekh@cs.berkeley.edu Miroslav Dudík mdudik@yahoo-inc.com Satyen Kale sckale@us.ibm.com John Langford jl@yahoo-inc.com Robert. Schapire

More information

Bandit models: a tutorial

Bandit models: a tutorial Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses

More information

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Explore no more: Improved high-probability regret bounds for non-stochastic bandits Explore no more: Improved high-probability regret bounds for non-stochastic bandits Gergely Neu SequeL team INRIA Lille Nord Europe gergely.neu@gmail.com Abstract This work addresses the problem of regret

More information

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 22 Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 How to balance exploration and exploitation in reinforcement

More information

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known

More information

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)

More information

Practical Agnostic Active Learning

Practical Agnostic Active Learning Practical Agnostic Active Learning Alina Beygelzimer Yahoo Research based on joint work with Sanjoy Dasgupta, Daniel Hsu, John Langford, Francesco Orabona, Chicheng Zhang, and Tong Zhang * * introductory

More information

Learning for Contextual Bandits

Learning for Contextual Bandits Learning for Contextual Bandits Alina Beygelzimer 1 John Langford 2 IBM Research 1 Yahoo! Research 2 NYC ML Meetup, Sept 21, 2010 Example of Learning through Exploration Repeatedly: 1. A user comes to

More information

Computational Learning Theory

Computational Learning Theory CS 446 Machine Learning Fall 2016 OCT 11, 2016 Computational Learning Theory Professor: Dan Roth Scribe: Ben Zhou, C. Cervantes 1 PAC Learning We want to develop a theory to relate the probability of successful

More information

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017 s s Machine Learning Reading Group The University of British Columbia Summer 2017 (OCO) Convex 1/29 Outline (OCO) Convex Stochastic Bernoulli s (OCO) Convex 2/29 At each iteration t, the player chooses

More information

Computational Learning Theory

Computational Learning Theory 1 Computational Learning Theory 2 Computational learning theory Introduction Is it possible to identify classes of learning problems that are inherently easy or difficult? Can we characterize the number

More information

Computational Learning Theory

Computational Learning Theory Computational Learning Theory Slides by and Nathalie Japkowicz (Reading: R&N AIMA 3 rd ed., Chapter 18.5) Computational Learning Theory Inductive learning: given the training set, a learning algorithm

More information

A Time and Space Efficient Algorithm for Contextual Linear Bandits

A Time and Space Efficient Algorithm for Contextual Linear Bandits A Time and Space Efficient Algorithm for Contextual Linear Bandits José Bento, Stratis Ioannidis, S. Muthukrishnan, and Jinyun Yan Stanford University, Technicolor, Rutgers University, Rutgers University

More information

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo! Theory and Applications of A Repeated Game Playing Algorithm Rob Schapire Princeton University [currently visiting Yahoo! Research] Learning Is (Often) Just a Game some learning problems: learn from training

More information

Littlestone s Dimension and Online Learnability

Littlestone s Dimension and Online Learnability Littlestone s Dimension and Online Learnability Shai Shalev-Shwartz Toyota Technological Institute at Chicago The Hebrew University Talk at UCSD workshop, February, 2009 Joint work with Shai Ben-David

More information

Large-scale Information Processing, Summer Recommender Systems (part 2)

Large-scale Information Processing, Summer Recommender Systems (part 2) Large-scale Information Processing, Summer 2015 5 th Exercise Recommender Systems (part 2) Emmanouil Tzouridis tzouridis@kma.informatik.tu-darmstadt.de Knowledge Mining & Assessment SVM question When a

More information

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Adversarial bandits Definition: sequential game. Lower bounds on regret from the stochastic case. Exp3: exponential weights

More information

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds Lecture 25 of 42 PAC Learning, VC Dimension, and Mistake Bounds Thursday, 15 March 2007 William H. Hsu, KSU http://www.kddresearch.org/courses/spring2007/cis732 Readings: Sections 7.4.17.4.3, 7.5.17.5.3,

More information

Online Learning: Bandit Setting

Online Learning: Bandit Setting Online Learning: Bandit Setting Daniel asabi Summer 04 Last Update: October 0, 06 Introduction [TODO Bandits. Stocastic setting Suppose tere exists unknown distributions ν,..., ν, suc tat te loss at eac

More information

Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits

Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits Alekh Agarwal, Daniel Hsu 2, Satyen Kale 3, John Langford, Lihong Li, and Robert E. Schapire 4 Microsoft Research 2 Columbia University

More information

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12 Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12 Lecture 4: Multiarmed Bandit in the Adversarial Model Lecturer: Yishay Mansour Scribe: Shai Vardi 4.1 Lecture Overview

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18 CSE 417T: Introduction to Machine Learning Final Review Henry Chai 12/4/18 Overfitting Overfitting is fitting the training data more than is warranted Fitting noise rather than signal 2 Estimating! "#$

More information

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Explore no more: Improved high-probability regret bounds for non-stochastic bandits Explore no more: Improved high-probability regret bounds for non-stochastic bandits Gergely Neu SequeL team INRIA Lille Nord Europe gergely.neu@gmail.com Abstract This work addresses the problem of regret

More information

Gambling in a rigged casino: The adversarial multi-armed bandit problem

Gambling in a rigged casino: The adversarial multi-armed bandit problem Gambling in a rigged casino: The adversarial multi-armed bandit problem Peter Auer Institute for Theoretical Computer Science University of Technology Graz A-8010 Graz (Austria) pauer@igi.tu-graz.ac.at

More information

Online Learning under Full and Bandit Information

Online Learning under Full and Bandit Information Online Learning under Full and Bandit Information Artem Sokolov Computerlinguistik Universität Heidelberg 1 Motivation 2 Adversarial Online Learning Hedge EXP3 3 Stochastic Bandits ε-greedy UCB Real world

More information

Efficient Bandit Algorithms for Online Multiclass Prediction

Efficient Bandit Algorithms for Online Multiclass Prediction Sham M. Kakade Shai Shalev-Shwartz Ambuj Tewari Toyota Technological Institute, 427 East 60th Street, Chicago, Illinois 60637, USA sham@tti-c.org shai@tti-c.org tewari@tti-c.org Keywords: Online learning,

More information

Generalization, Overfitting, and Model Selection

Generalization, Overfitting, and Model Selection Generalization, Overfitting, and Model Selection Sample Complexity Results for Supervised Classification Maria-Florina (Nina) Balcan 10/03/2016 Two Core Aspects of Machine Learning Algorithm Design. How

More information

Lecture 3: Lower Bounds for Bandit Algorithms

Lecture 3: Lower Bounds for Bandit Algorithms CMSC 858G: Bandits, Experts and Games 09/19/16 Lecture 3: Lower Bounds for Bandit Algorithms Instructor: Alex Slivkins Scribed by: Soham De & Karthik A Sankararaman 1 Lower Bounds In this lecture (and

More information

Web-Mining Agents Computational Learning Theory

Web-Mining Agents Computational Learning Theory Web-Mining Agents Computational Learning Theory Prof. Dr. Ralf Möller Dr. Özgür Özcep Universität zu Lübeck Institut für Informationssysteme Tanya Braun (Exercise Lab) Computational Learning Theory (Adapted)

More information

Applications of on-line prediction. in telecommunication problems

Applications of on-line prediction. in telecommunication problems Applications of on-line prediction in telecommunication problems Gábor Lugosi Pompeu Fabra University, Barcelona based on joint work with András György and Tamás Linder 1 Outline On-line prediction; Some

More information

Outline. Offline Evaluation of Online Metrics Counterfactual Estimation Advanced Estimators. Case Studies & Demo Summary

Outline. Offline Evaluation of Online Metrics Counterfactual Estimation Advanced Estimators. Case Studies & Demo Summary Outline Offline Evaluation of Online Metrics Counterfactual Estimation Advanced Estimators 1. Self-Normalized Estimator 2. Doubly Robust Estimator 3. Slates Estimator Case Studies & Demo Summary IPS: Issues

More information

Learning, Games, and Networks

Learning, Games, and Networks Learning, Games, and Networks Abhishek Sinha Laboratory for Information and Decision Systems MIT ML Talk Series @CNRG December 12, 2016 1 / 44 Outline 1 Prediction With Experts Advice 2 Application to

More information

Learning Theory Continued

Learning Theory Continued Learning Theory Continued Machine Learning CSE446 Carlos Guestrin University of Washington May 13, 2013 1 A simple setting n Classification N data points Finite number of possible hypothesis (e.g., dec.

More information

Generalization and Overfitting

Generalization and Overfitting Generalization and Overfitting Model Selection Maria-Florina (Nina) Balcan February 24th, 2016 PAC/SLT models for Supervised Learning Data Source Distribution D on X Learning Algorithm Expert / Oracle

More information

Active Learning: Disagreement Coefficient

Active Learning: Disagreement Coefficient Advanced Course in Machine Learning Spring 2010 Active Learning: Disagreement Coefficient Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz In previous lectures we saw examples in which

More information

arxiv: v1 [cs.lg] 1 Jun 2016

arxiv: v1 [cs.lg] 1 Jun 2016 Improved Regret Bounds for Oracle-Based Adversarial Contextual Bandits arxiv:1606.00313v1 cs.g 1 Jun 2016 Vasilis Syrgkanis Microsoft Research Haipeng uo Princeton University Robert E. Schapire Microsoft

More information

Lecture 8. Instructor: Haipeng Luo

Lecture 8. Instructor: Haipeng Luo Lecture 8 Instructor: Haipeng Luo Boosting and AdaBoost In this lecture we discuss the connection between boosting and online learning. Boosting is not only one of the most fundamental theories in machine

More information

Game Theory, On-line prediction and Boosting (Freund, Schapire)

Game Theory, On-line prediction and Boosting (Freund, Schapire) Game heory, On-line prediction and Boosting (Freund, Schapire) Idan Attias /4/208 INRODUCION he purpose of this paper is to bring out the close connection between game theory, on-line prediction and boosting,

More information

Online Learning with Experts & Multiplicative Weights Algorithms

Online Learning with Experts & Multiplicative Weights Algorithms Online Learning with Experts & Multiplicative Weights Algorithms CS 159 lecture #2 Stephan Zheng April 1, 2016 Caltech Table of contents 1. Online Learning with Experts With a perfect expert Without perfect

More information

Discover Relevant Sources : A Multi-Armed Bandit Approach

Discover Relevant Sources : A Multi-Armed Bandit Approach Discover Relevant Sources : A Multi-Armed Bandit Approach 1 Onur Atan, Mihaela van der Schaar Department of Electrical Engineering, University of California Los Angeles Email: oatan@ucla.edu, mihaela@ee.ucla.edu

More information

Stochastic Gradient Descent

Stochastic Gradient Descent Stochastic Gradient Descent Machine Learning CSE546 Carlos Guestrin University of Washington October 9, 2013 1 Logistic Regression Logistic function (or Sigmoid): Learn P(Y X) directly Assume a particular

More information

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1 Introduction to Machine Learning Introduction to ML - TAU 2016/7 1 Course Administration Lecturers: Amir Globerson (gamir@post.tau.ac.il) Yishay Mansour (Mansour@tau.ac.il) Teaching Assistance: Regev Schweiger

More information

Answering Many Queries with Differential Privacy

Answering Many Queries with Differential Privacy 6.889 New Developments in Cryptography May 6, 2011 Answering Many Queries with Differential Privacy Instructors: Shafi Goldwasser, Yael Kalai, Leo Reyzin, Boaz Barak, and Salil Vadhan Lecturer: Jonathan

More information

Online Learning and Online Convex Optimization

Online Learning and Online Convex Optimization Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Sequential Decision

More information

Lecture 10: Contextual Bandits

Lecture 10: Contextual Bandits CSE599i: Online and Adaptive Machine Learning Winter 208 Lecturer: Lalit Jain Lecture 0: Contextual Bandits Scribes: Neeraja Abhyankar, Joshua Fan, Kunhui Zhang Disclaimer: These notes have not been subjected

More information

Voting (Ensemble Methods)

Voting (Ensemble Methods) 1 2 Voting (Ensemble Methods) Instead of learning a single classifier, learn many weak classifiers that are good at different parts of the data Output class: (Weighted) vote of each classifier Classifiers

More information

The Multi-Armed Bandit Problem

The Multi-Armed Bandit Problem The Multi-Armed Bandit Problem Electrical and Computer Engineering December 7, 2013 Outline 1 2 Mathematical 3 Algorithm Upper Confidence Bound Algorithm A/B Testing Exploration vs. Exploitation Scientist

More information

Learning Hurdles for Sleeping Experts

Learning Hurdles for Sleeping Experts Learning Hurdles for Sleeping Experts Varun Kanade EECS, University of California Berkeley, CA, USA vkanade@eecs.berkeley.edu Thomas Steinke SEAS, Harvard University Cambridge, MA, USA tsteinke@fas.harvard.edu

More information

Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm

Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm Saba Q. Yahyaa, Madalina M. Drugan and Bernard Manderick Vrije Universiteit Brussel, Department of Computer Science, Pleinlaan 2, 1050 Brussels,

More information

Empirical Risk Minimization

Empirical Risk Minimization Empirical Risk Minimization Fabrice Rossi SAMM Université Paris 1 Panthéon Sorbonne 2018 Outline Introduction PAC learning ERM in practice 2 General setting Data X the input space and Y the output space

More information

Combinatorial Multi-Armed Bandit: General Framework, Results and Applications

Combinatorial Multi-Armed Bandit: General Framework, Results and Applications Combinatorial Multi-Armed Bandit: General Framework, Results and Applications Wei Chen Microsoft Research Asia, Beijing, China Yajun Wang Microsoft Research Asia, Beijing, China Yang Yuan Computer Science

More information

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes Prokopis C. Prokopiou, Peter E. Caines, and Aditya Mahajan McGill University

More information

ECE 5984: Introduction to Machine Learning

ECE 5984: Introduction to Machine Learning ECE 5984: Introduction to Machine Learning Topics: Ensemble Methods: Bagging, Boosting Readings: Murphy 16.4; Hastie 16 Dhruv Batra Virginia Tech Administrativia HW3 Due: April 14, 11:55pm You will implement

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Bandit Algorithms. Tor Lattimore & Csaba Szepesvári

Bandit Algorithms. Tor Lattimore & Csaba Szepesvári Bandit Algorithms Tor Lattimore & Csaba Szepesvári Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next? Overview What are bandits,

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 5: Bandit optimisation Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Introduce bandit optimisation: the

More information

Subsampling, Concentration and Multi-armed bandits

Subsampling, Concentration and Multi-armed bandits Subsampling, Concentration and Multi-armed bandits Odalric-Ambrym Maillard, R. Bardenet, S. Mannor, A. Baransi, N. Galichet, J. Pineau, A. Durand Toulouse, November 09, 2015 O-A. Maillard Subsampling and

More information

Atutorialonactivelearning

Atutorialonactivelearning Atutorialonactivelearning Sanjoy Dasgupta 1 John Langford 2 UC San Diego 1 Yahoo Labs 2 Exploiting unlabeled data Alotofunlabeleddataisplentifulandcheap,eg. documents off the web speech samples images

More information

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms Stochastic K-Arm Bandit Problem Formulation Consider K arms (actions) each correspond to an unknown distribution {ν k } K k=1 with values bounded in [0, 1]. At each time t, the agent pulls an arm I t {1,...,

More information

Ensemble Methods for Machine Learning

Ensemble Methods for Machine Learning Ensemble Methods for Machine Learning COMBINING CLASSIFIERS: ENSEMBLE APPROACHES Common Ensemble classifiers Bagging/Random Forests Bucket of models Stacking Boosting Ensemble classifiers we ve studied

More information

Exponential Weights on the Hypercube in Polynomial Time

Exponential Weights on the Hypercube in Polynomial Time European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts

More information

The No-Regret Framework for Online Learning

The No-Regret Framework for Online Learning The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin,

More information

Bandits for Online Optimization

Bandits for Online Optimization Bandits for Online Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Bandits for Online Optimization 1 / 16 The multiarmed bandit problem... K slot machines Each

More information

Computational Learning Theory

Computational Learning Theory 09s1: COMP9417 Machine Learning and Data Mining Computational Learning Theory May 20, 2009 Acknowledgement: Material derived from slides for the book Machine Learning, Tom M. Mitchell, McGraw-Hill, 1997

More information

arxiv: v3 [cs.lg] 30 Jun 2012

arxiv: v3 [cs.lg] 30 Jun 2012 arxiv:05874v3 [cslg] 30 Jun 0 Orly Avner Shie Mannor Department of Electrical Engineering, Technion Ohad Shamir Microsoft Research New England Abstract We consider a multi-armed bandit problem where the

More information

Learning from Logged Implicit Exploration Data

Learning from Logged Implicit Exploration Data Learning from Logged Implicit Exploration Data Alexander L. Strehl Facebook Inc. 1601 S California Ave Palo Alto, CA 94304 astrehl@facebook.com Lihong Li Yahoo! Research 4401 Great America Parkway Santa

More information

Online (and Distributed) Learning with Information Constraints. Ohad Shamir

Online (and Distributed) Learning with Information Constraints. Ohad Shamir Online (and Distributed) Learning with Information Constraints Ohad Shamir Weizmann Institute of Science Online Algorithms and Learning Workshop Leiden, November 2014 Ohad Shamir Learning with Information

More information

Adaptive Learning with Unknown Information Flows

Adaptive Learning with Unknown Information Flows Adaptive Learning with Unknown Information Flows Yonatan Gur Stanford University Ahmadreza Momeni Stanford University June 8, 018 Abstract An agent facing sequential decisions that are characterized by

More information

Models of collective inference

Models of collective inference Models of collective inference Laurent Massoulié (Microsoft Research-Inria Joint Centre) Mesrob I. Ohannessian (University of California, San Diego) Alexandre Proutière (KTH Royal Institute of Technology)

More information

The multi armed-bandit problem

The multi armed-bandit problem The multi armed-bandit problem (with covariates if we have time) Vianney Perchet & Philippe Rigollet LPMA Université Paris Diderot ORFE Princeton University Algorithms and Dynamics for Games and Optimization

More information

Agnostic Online learnability

Agnostic Online learnability Technical Report TTIC-TR-2008-2 October 2008 Agnostic Online learnability Shai Shalev-Shwartz Toyota Technological Institute Chicago shai@tti-c.org ABSTRACT We study a fundamental question. What classes

More information

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm CS61: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm Tim Roughgarden February 9, 016 1 Online Algorithms This lecture begins the third module of the

More information

The Multi-Arm Bandit Framework

The Multi-Arm Bandit Framework The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94

More information

RL 3: Reinforcement Learning

RL 3: Reinforcement Learning RL 3: Reinforcement Learning Q-Learning Michael Herrmann University of Edinburgh, School of Informatics 20/01/2015 Last time: Multi-Armed Bandits (10 Points to remember) MAB applications do exist (e.g.

More information

Online Submodular Minimization

Online Submodular Minimization Online Submodular Minimization Elad Hazan IBM Almaden Research Center 650 Harry Rd, San Jose, CA 95120 hazan@us.ibm.com Satyen Kale Yahoo! Research 4301 Great America Parkway, Santa Clara, CA 95054 skale@yahoo-inc.com

More information