New Algorithms for Contextual Bandits

Similar documents
Reducing contextual bandits to supervised learning

Learning with Exploration

The Multi-Armed Bandit Problem

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Advanced Machine Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Contextual semibandits via supervised learning oracles

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Online Learning with Feedback Graphs

THE first formalization of the multi-armed bandit problem

Yevgeny Seldin. University of Copenhagen

Lecture 10 : Contextual Bandits

Slides at:

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning

Optimal and Adaptive Online Learning

Active Learning and Optimized Information Gathering

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

Multi-armed Bandits in the Presence of Side Observations in Social Networks

Hybrid Machine Learning Algorithms

Computational Learning Theory. CS534 - Machine Learning

The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits John Langford and Tong Zhang

Contextual Bandit Learning with Predictable Rewards

Bandit models: a tutorial

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Multi-armed bandit models: a tutorial

Practical Agnostic Active Learning

Learning for Contextual Bandits

Computational Learning Theory

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

Computational Learning Theory

Computational Learning Theory

A Time and Space Efficient Algorithm for Contextual Linear Bandits

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Littlestone s Dimension and Online Learnability

Large-scale Information Processing, Summer Recommender Systems (part 2)

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Lecture 25 of 42. PAC Learning, VC Dimension, and Mistake Bounds

Online Learning: Bandit Setting

Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12

Ad Placement Strategies

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Gambling in a rigged casino: The adversarial multi-armed bandit problem

Online Learning under Full and Bandit Information

Efficient Bandit Algorithms for Online Multiclass Prediction

Generalization, Overfitting, and Model Selection

Lecture 3: Lower Bounds for Bandit Algorithms

Web-Mining Agents Computational Learning Theory

Applications of on-line prediction. in telecommunication problems

Outline. Offline Evaluation of Online Metrics Counterfactual Estimation Advanced Estimators. Case Studies & Demo Summary

Learning, Games, and Networks

Learning Theory Continued

Generalization and Overfitting

Active Learning: Disagreement Coefficient

arxiv: v1 [cs.lg] 1 Jun 2016

Lecture 8. Instructor: Haipeng Luo

Game Theory, On-line prediction and Boosting (Freund, Schapire)

Online Learning with Experts & Multiplicative Weights Algorithms

Discover Relevant Sources : A Multi-Armed Bandit Approach

Stochastic Gradient Descent

Introduction to Machine Learning. Introduction to ML - TAU 2016/7 1

Answering Many Queries with Differential Privacy

Online Learning and Online Convex Optimization

Online Learning and Sequential Decision Making

Lecture 10: Contextual Bandits

Voting (Ensemble Methods)

The Multi-Armed Bandit Problem

Learning Hurdles for Sleeping Experts

Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm

Empirical Risk Minimization

Combinatorial Multi-Armed Bandit: General Framework, Results and Applications

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes

ECE 5984: Introduction to Machine Learning

Full-information Online Learning

Bandit Algorithms. Tor Lattimore & Csaba Szepesvári

Reinforcement Learning

Subsampling, Concentration and Multi-armed bandits

Atutorialonactivelearning

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms

Ensemble Methods for Machine Learning

Exponential Weights on the Hypercube in Polynomial Time

The No-Regret Framework for Online Learning

Bandits for Online Optimization

Computational Learning Theory

arxiv: v3 [cs.lg] 30 Jun 2012

Learning from Logged Implicit Exploration Data

Online (and Distributed) Learning with Information Constraints. Ohad Shamir

Adaptive Learning with Unknown Information Flows

Models of collective inference

The multi armed-bandit problem

Agnostic Online learnability

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm

The Multi-Arm Bandit Framework

RL 3: Reinforcement Learning

Online Submodular Minimization

Transcription:

New Algorithms for Contextual Bandits Lev Reyzin Georgia Institute of Technology Work done at Yahoo! 1

S A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R.E. Schapire Contextual Bandit Algorithms with Supervised Learning Guarantees (AISTATS 2011) S M. Dudik, D. Hsu, S. Kale, N. Karampatziakis, J. Langford, L. Reyzin, T. Zhang Efficient Optimal Learning for Contextual Bandits (UAI 2011) S S. Kale, L. Reyzin, R.E. Schapire Non-Stochastic Bandit Slate Problems (NIPS 2010) 2

Serving Content to Users Query, IP address, browser properties, etc. 3

Serving Content to Users Query, IP address, browser properties, etc. result (ie. ad, news story) 4

Serving Content to Users Query, IP address, browser properties, etc. result (ie. ad, news story) click or not 5

Serving Content to Users Query, IP address, browser properties, etc. result (ie. ad, news story) click or not 6

Serving Content to Users Query, IP address, browser properties, etc. result (ie. ad, news story) click or not 7

Serving Content to Users 8

Outline S The setting and some background S Show ideas that fail S Give a high probability optimal algorithm S Dealing with VC sets S An efficient algorithm S Slates 9

Multiarmed Bandits [Robbins 52] 1 2 3 4 T $0 1 2 3 k 10

Multiarmed Bandits [Robbins 52] 1 2 1 2 3 4 T $0.50 $0.50 2 3 k 11

Multiarmed Bandits [Robbins 52] 1 2 1 2 3 4 T $0.50 $0.50 3 3 $0 k 12

Multiarmed Bandits [Robbins 52] 1 2 3 4 T $0.83 1 2 $0.50 $0.33 2 3 $0 k 13

Multiarmed Bandits [Robbins 52] 1 1 2 3 4 T $0.20 $0.4T 2 $0.50 $0.33 3 $0 k $0.90 14

Multiarmed Bandits [Robbins 52] 1 $0.5T $0.4T 2 $0.2T 3 $0.33T k $0.1T 15

Multiarmed Bandits [Robbins 52] 1 $0.5T $0.4T 2 $0.2T 3 $0.33T regret = $0.1T k $0.1T 16

Contextual Bandits [Auer-CesaBianchi-Freund-Schapire 02] context: 1 2 3 4 T $0 1 2 3 k N experts/policies/functions think of N >> K 17

Contextual Bandits [Auer-CesaBianchi-Freund-Schapire 02] context: x 1 1 2 3 4 T $0 1 2 3 5 1 1 4 k K 3 N experts/policies/functions think of N >> K 18

Contextual Bandits [Auer-CesaBianchi-Freund-Schapire 02] context: x 1 1 2 3 4 T $0.15 1 2 3 $0.15 5 1 1 1 4 k K 3 N experts/policies/functions think of N >> K 19

Contextual Bandits [Auer-CesaBianchi-Freund-Schapire 02] context: x 1 x 2 x 3 x 4 x T 1 2 3 4 T $0.2T 1 $0.15 $0.4 2 $.12T $.22T 3 $0.5 $0.2 $.1T 0 k $0.1 $.2T $.17T N experts/policies/functions think of N >> K 20

Contextual Bandits [Auer-CesaBianchi-Freund-Schapire 02] context: x 1 x 2 x 3 x 4 x T 1 2 3 4 T $0.2T 1 2 $.12T $.22T 3 regret = $0.02T $.1T 0 k $.2T $.17T N experts/policies/functions think of N >> K 21

Contextual Bandits [Auer-CesaBianchi-Freund-Schapire 02] context: x 1 x 2 x 3 x 4 x T 1 2 3 4 T $0.2T 1 2 3 the rewards can come i.i.d. from a distribution or be arbitrary stochastic / adversarial The experts can be present or not. contextual / non-contextual k 22

The Setting S T rounds, K possible actions, N policies π in Π (context à actions) S for t=1 to T S world commits to rewards r(t)=r 1 (t),r 2 (t),,r K (t) (adversarial or iid) S world provides context x t S learner s policies recommend π 1 (x t ), π 2 (x t ),, π N (x t ) S learner chooses action j t S learner receives reward r jt (t) S want to compete with following the best policy in hindsight 23

Regret S reward of algorithm A: S expected reward of policy i: G A T ( ) T t=1 G i T ( ) = r jt t T t=1 ( ) = π i x t ( ) r( t) S algorithm A s regret: max i G i T ( ) G A (T) 24

Regret S algorithm A s regret: max i G i T ( ) G A (T) S bound on expected regret: S high probability bound: max i G i P[max i ( T ) E[G A (T )] < ε G i T ( ) G A (T) > ε] δ 25

u Harder than supervised learning: u In the bandit setting we do not know the rewards of actions not taken. u Many applications u Ad auctions, medicine, finance, u Exploration/Exploitation u Can exploit expert/arm you ve learned to be good. u Can explore expert/arm you re not sure about. 26

Some Barriers Ω(kT) 1/2 (non-contextual) and ~ Ω(TK ln N) 1/2 (contextual) are known lower bounds [Auer et al. 02] on regret, even in the stochastic case. Any algorithm achieving regret Õ(KT polylog N) 1/2 is said to be optimal. ε-greedy algorithms that first explore (act randomly) and then exploit (follow the best policy) cannot be optimal. Any optimal algorithm must be adaptive. 27

Two Types of Approaches 1 UCB [Auer 02] EXP3 Exponential Weights [Littlestone-Warmuth 94] [Auer et al. 02] 0.5 0 t=1 t=2 t=3 Algorithm: at every time step 1) pull arm with highest UCB 2) update confidence bound of the arm pulled. Algorithm: at every time step 1) sample from distribution defined by weights (mixed w/ uniform) 2) update weights exponentially 28

UCB vs EXP3 A Comparison u Pros UCB [Auer 02] u Optimal for the stochastic setting. u Succeeds with high probability. u Cons u Does not work in the adversarial setting. u Is not optimal in the contextual setting. EXP3 & Friends [Auer-CesaBianchi-Freund-Schapire 02] u Pros u Optimal for both the adversarial and stochastic settings. u Can be made to work in the contextual setting u Cons u Does not succeed with high probability in the contextual setting (only in expectation). 29

Algorithm Regret High Prob? Context? Exp4 [ACFS 02] Õ(KT ln(n)) 1/2 No Yes ε-greedy, epochgeedy [LZ 07] Exp3.P[ACFS 02] UCB [Auer 00] Õ((K ln(n) 1/3 )T 2/3 ) Yes Yes Õ(KT) 1/2 Yes No Ω( KT ) lower bound [ACFS 02] 30

Algorithm Regret High Prob? Context? Exp4 [ACFS 02] Õ(KT ln(n)) 1/2 No Yes ε-greedy, epochgeedy [LZ 07] Exp3.P[ACFS 02] UCB [Auer 00] Õ((K ln(n) 1/3 )T 2/3 ) Yes Yes Õ(KT) 1/2 Yes No Exp4.P [BLLRS 10] Õ(K ln(n/δ)t) 1/2 Yes Yes Ω( KT ) lower bound [ACFS 02] 31

EXP4P [Beygelzimer-Langford-Li-R-Schapire 11] Main Theorem [Beygelzimer-Langford-Li-R-Schapire 11]: For any δ>0, with probability at least 1-δ, EXP4P has regret at most O(KT ln (N/δ)) 1/2 in the adversarial contextual bandit setting. EXP4P combines the advantages of Exponential Weights and UCB. optimal for both the stochastic and adversarial settings works for the contextual case (and also the non-contextual case) a high probability result 32

Outline S The setting and some background S Show ideas that fail S Give a high probability optimal algorithm S Dealing with VC sets S An efficient algorithm S Slates 33

Some Failed Approaches S Bad idea 1: Maintain a set of plausible hypotheses and randomize uniformly over their predicted actions. S Adversary has two actions, one always paying off 1 and the other 0. Hypothesis generally agree on correct action, except for a different one which defects each round. This incurs regret of ~T/2. S Bad idea 2: Maintain a set of plausible hypotheses and randomize uniformly among the hypothesis. S Adversary has two actions, one always paying off 1 and the other 0. If all but one of > 2T hypothesis always predict wrong arm, and only 1 hypothesis always predicts good arm, with probability > ½ it is never picked and algorithm incurs regret of T. 34

Some Failed Approaches S Bad idea 1: Maintain a set of plausible hypotheses and randomize uniformly over their predicted actions. S Adversary has two actions, one always paying off 1 and the other 0. Hypothesis generally agree on correct action, except for a different one which defects each round. This incurs regret of ~T/2. S Bad idea 2: Maintain a set of plausible hypotheses and randomize uniformly among the hypothesis. S Adversary has two actions, one always paying off 1 and the other 0. If all but one of > 2T hypothesis always predict wrong arm, and only 1 hypothesis always predicts good arm, with probability > ½ it is never picked and algorithm incurs regret of T. 35

Some Failed Approaches S Bad idea 1: Maintain a set of plausible hypotheses and randomize uniformly over their predicted actions. S Adversary has two actions, one always paying off 1 and the other 0. Hypothesis generally agree on correct action, except for a different one which defects each round. This incurs regret of ~T/2. S Bad idea 2: Maintain a set of plausible hypotheses and randomize uniformly among the hypothesis. S Adversary has two actions, one always paying off 1 and the other 0. If all but one of > 2T hypothesis always predict wrong arm, and only 1 hypothesis always predicts good arm, with probability > ½ it is never picked and algorithm incurs regret of T. 36

Some Failed Approaches S Bad idea 1: Maintain a set of plausible hypotheses and randomize uniformly over their predicted actions. S Adversary has two actions, one always paying off 1 and the other 0. Hypothesis generally agree on correct action, except for a different one which defects each round. This incurs regret of ~T/2. S Bad idea 2: Maintain a set of plausible hypotheses and randomize uniformly among the hypothesis. S Adversary has two actions, one always paying off 1 and the other 0. If all but one of > 2T hypothesis always predict wrong arm, and only 1 hypothesis always predicts good arm, with probability > ½ it is never picked and algorithm incurs regret of T. 37

epsilon-greedy S Rough idea of ε-greedy (or ε-first): act randomly for ε rounds, then go with best (arm or expert). S Even if we know the number of rounds in advance, ε-first won t get us regret O(T) 1/2, even in the non-contextual setting. S Rough analysis: even for just 2 arms, we suffer regret of ε+(t-ε)/(ε 1/2 ). S ε T 2/3 is optimal tradeoff. S gives regret T 2/3 38

Outline S The setting and some background S Show ideas that fail S Give a high probability optimal algorithm S Dealing with VC sets S An efficient algorithm S Slates 39

Ideas Behind Exp4.P (all appeared in previous algorithms) S exponential weights S keep a weight on each expert that drops exponentially in the expert s (estimated) performance S upper confidence bounds S use an upper confidence bound on each expert s estimated reward S ensuring exploration S make sure each action is taken with some minimum probability S importance weighting S give rare events more importance to keep estimates unbiased 40

(slide from Beygelzimer & Langford ICML 2010 tutorial) 41

(Exp4.P) [Beygelzimer, Langford, Li, R, Schapire 10] 42

Lemma 1 43

Lemma 2 44

EXP4P [Beygelzimer-Langford-Li-R-Schapire 11] Main Theorem [Beygelzimer-Langford-Li-R-Schapire 11]: For any δ>0, with probability at least 1-δ, EXP4P has regret at most O(KT ln (N/δ)) 1/2 in the adversarial contextual bandit setting. key insights (on top of UCB/ EXP) 1) exponential weights and upper confidence bounds stack 2) generalized Bernstein s inequality for martingales t=1 t=2 t=3 45

Efficiency Algorithm Regret High Prob? Context? Efficient? Exp4 [ACFS 02] epoch-geedy [LZ 07] Exp3.P/UCB [ACFS 02][A 00] Exp4.P [BLLRS 10] Õ(T 1/2 ) No Yes No Õ(T 2/3 ) Yes Yes Yes Õ(T 1/2 ) Yes No Yes Õ(T 1/2 ) Yes Yes No 46

EXP4P Applied to Yahoo! 47

Experiments on Yahoo! Data u We chose a policy class for which we could efficiently keep track of the weights. u Created 5 clusters, with users (at each time step) getting features based on their distances to clusters. u Policies mapped clusters to article (action) choices. u Ran on personalized news article recommendations for Yahoo! front page. u We used a learning bucket on which we ran the algorithms and a deployment bucket on which we ran the greedy (best) learned policy. 48

Experimental Results Reported estimated (normalized) click-through rates on front page news. Over 41M user visits. 253 total articles. 21 candidate articles per visit. Learning ectr Deployment ectr EXP4P EXP4 ε-greedy 1.0525 1.0988 1.3829 1.6512 1.5309 1.4290 49

Experimental Results Reported estimated (normalized) click-through rates on front page news. Over 41M user visits. 253 total articles. 21 candidate articles per visit. Learning ectr Deployment ectr EXP4P EXP4 ε-greedy 1.0525 1.0988 1.3829 1.6512 1.5309 1.4290 Why does this work in practice? 50

Outline S The setting and some background S Show ideas that fail S Give a high probability optimal algorithm S Dealing with VC sets S An efficient algorithm S Slates 51

Infinitely Many Policies S What if we have an infinite number of policies? S Our bound of Õ(K ln(n)t) 1/2 becomes vacuous. S If we assume our policy class has a finite VC dimension d, then we can tackle this problem. S Need i.i.d. assumption. We will also assume k=2 to illustrate the argument. 52

VC Dimension S The VC dimension of a hypothesis class captures the class s expressive power. S It is the cardinality of the largest set (in our case, of contexts) the class can shatter. S Shatter means to label in all possible configurations. 53

VE, an Algorithm for VC Sets The VE algorithm: S Act uniformly at random for τ rounds. S This partitions our policies Π into equivalence classes according to their labelings of the first τ examples. S Pick one representative from each equivalence class to make Π. S Run Exp4.P on Π. 54

Outline of Analysis of VE S Sauer s lemma bounds the number of equivalence classes to (eτ/d) d. S Hence, using Exp4.P bounds, VE s regret to Π is τ+ O (Td ln(τ)) S We can show that the regret of Π to Π is (T/τ)(d lnt) S by looking at the probability of disagreeing on future data given agreement for τ steps. S τ (Td ln 1/δ) 1/2 achieves the optimal trade-off. S Gives Õ(Td) 1/2 regret. S Still inefficient! 55

Outline S The setting and some background S Show ideas that fail S Give a high probability optimal algorithm S Dealing with VC sets S An efficient algorithm S Slates 56

Hope for an Efficient Algorithm? [Dudik-Hsu-Kale-Karampatziakis-Langford-R-Zhang 11] For EXP4P, the dependence on N in the regret is logarithmic. this suggests We could compete with a large, even super-polynomial number of policies! (e.g. N=K 100 becomes 10 log 1/2 K in the regret) however All known contextual bandit algorithms explicitly keep track of the N policies. Even worse, just reading in the N would take too long for large N. 57

Idea: Use Supervised Learning S Competing with a large (even exponentially large) set of policies is commonplace in supervised learning. S Targets: e.g. linear thresholds, CNF, decision trees (in practice only) S Methods: e.g. boosting, SVM, neural networks, gradient descent S The recommendations of the policies don t need to be explicitly read in when the policy class has structure! x 1 x 2 x 3 x 4 x 5 x 6 Policy class Π Supervised Learning Oracle A good policy in Π idea originates with [Langford-Zhang 07] 58

Idea: Use Supervised Learning S Competing with a large (even exponentially large) set of policies is commonplace in supervised learning. S Targets: e.g. linear thresholds, CNF, decision trees (in practice only) S Methods: e.g. boosting, SVM, neural networks, gradient descent S The recommendations of the policies don t need to be explicitly read in when the policy class has structure! x 1 x 2 x 3 x 4 x 5 x 6 Policy class Π Warning: NP-Hard in Supervised Learning General Oracle A good policy in Π idea originates with [Langford-Zhang 07] 59

Back to Contextual Bandits context: x 1 x 2 x 3 1 2 3 4 T $1.20 1 $0.70 2 3 $0.50 5 1 1 4 k k 3 N experts/policies/functions think of N >> K 60

Back to Contextual Bandits context: x 1 x 2 x 3 1 2 3 4 T $1.20 1 $0.70 2 3 $0.50 made-up data Supervised Learning Oracle 5 1 1 4 k k 3 N experts/policies/functions think of N >> K 61

Back to Contextual Bandits context: x 1 x 2 x 3 1 2 3 4 T $1.20 1 $0.70 2 3 $0.50 made-up data Supervised Learning Oracle 5 1 1 4 k k 3 N experts/policies/functions think of N >> K 62

Randomized-UCB Main Theorem [Dudik-Hsu-Kale-Karampatziakis-Langford-R-Zhang 11]: For any δ>0, w.p. at least 1-δ, given access to a supervised learning oracle, Randomized-UCB has regret at most O((KT ln (NT/δ)) 1/2 +K ln(nk/δ)) in the stochastic contextual bandit setting and runs in time poly(k,t, ln N). 63

Randomized-UCB Main Theorem [Dudik-Hsu-Kale-Karampatziakis-Langford-R-Zhang 11]: For any δ>0, w.p. at least 1-δ, given access to a supervised learning oracle, Randomized-UCB has regret at most O((KT ln (NT/δ)) 1/2 +K ln(nk/δ)) in the stochastic contextual bandit setting and runs in time poly(k,t, ln N). if arms are chosen among only good policies s.t. all have variance < approx 2K, we win can prove this exists via a minimax theorem this condition can be softened to occasionally allow choosing of bad policies via randomized upper confidence bounds creates a problem of how to choose arms as to satisfy the constraints expressed as convex optimization problem solvable by ellipsoid algorithm can implement a separation oracle with the supervised learning oracle 64

Randomized-UCB Main Theorem [Dudik-Hsu-Kale-Karampatziakis-Langford-R-Zhang 11]: For any δ>0, w.p. at least 1-δ, given access to a supervised learning oracle, Randomized-UCB has regret at most O((KT ln (NT/δ)) 1/2 +K ln(nk/δ)) in the stochastic contextual bandit setting and runs in time poly(k,t, ln N). Not practical if arms are chosen among only good policies s.t. all have variance < approx 2K, we win can prove this exists via a minimax theorem to implement! this condition can be softened to occasionally allow choosing of bad policies via randomized upper confidence bounds (yet) creates a problem of how to choose arms as to satisfy the constraints expressed as convex optimization problem solvable by ellipsoid algorithm can implement a separation oracle with the supervised learning oracle 65

Outline S The setting and some background S Show ideas that fail S Give a high probability optimal algorithm S Dealing with VC sets S An efficient algorithm S Slates 66

Bandit Slate Problems [Kale-R-Schapire 11] Problem: Instead of selecting one arm, we need to select s 1, arms (possibly ranked). The motivation is web ads where a search engine shows multiple ads at once. 67

Slates Setting S On round t algorithm selects a sate S t of s arms S Unordered or Ordered S No context or Contextual S Algorithm sees r j (t) for all j in S. S Algorithm gets reward j S r j (t) S Obvious solution is to reduce to the regular bandit problem, but we can do much better. 68

Algorithm Idea 69

Algorithm Idea.33.33.33 70

Algorithm Idea 71

Algorithm Idea multiplicative update 72

Algorithm Idea relative entropy projection Also Component Hedge, independently by Koolen et al. 10. 73

Algorithm Idea 74

Slate Results Unordered Slates Ordered, with Positional Factors No Policies Õ(sKT) 1/2 * Õ(s(KT) 1/2 ) N Policies Õ(sKT ln N) 1/2 Õ(s(KT ln N) 1/2 ) *Independently obtained by Uchiya et al. 10, using different methods. 75

Discussion S The contextual bandit setting captures many interesting real-world problems. S We presented the first optimal, high-probability, contextual algorithm. S We showed how one could possibly make it efficient. S Not fully there yet S We discussed slates a more real-world setting. S How to make those efficient? 76