Reducing contextual bandits to supervised learning

Similar documents
Learning with Exploration

New Algorithms for Contextual Bandits

Slides at:

Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits

Learning for Contextual Bandits

Contextual semibandits via supervised learning oracles

Lecture 10 : Contextual Bandits

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

arxiv: v1 [cs.lg] 1 Jun 2016

Online Learning with Feedback Graphs

Contextual Bandit Learning with Predictable Rewards

Doubly Robust Policy Evaluation and Learning

Doubly Robust Policy Evaluation and Learning

Counterfactual Evaluation and Learning

Hybrid Machine Learning Algorithms

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University

Counterfactual Model for Learning Systems

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Bandit models: a tutorial

Lecture 10: Contextual Bandits

Advanced Machine Learning

Multi-armed bandit models: a tutorial

The Multi-Armed Bandit Problem

The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits John Langford and Tong Zhang

Discover Relevant Sources : A Multi-Armed Bandit Approach

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

Learning from Logged Implicit Exploration Data

Optimal and Adaptive Online Learning

The Multi-Armed Bandit Problem

Yevgeny Seldin. University of Copenhagen

Crowd-Learning: Improving the Quality of Crowdsourcing Using Sequential Learning

Efficient Bandit Algorithms for Online Multiclass Prediction

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Practical Agnostic Active Learning

Online Learning and Online Convex Optimization

Online (and Distributed) Learning with Information Constraints. Ohad Shamir

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems

Reinforcement Learning

arxiv: v1 [stat.ml] 12 Feb 2018

Online Manifold Regularization: A New Learning Setting and Empirical Study

Introduction to Multi-Armed Bandits

Lecture 8. Instructor: Haipeng Luo

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

Applications of on-line prediction. in telecommunication problems

Exploration. 2015/10/12 John Schulman

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning

Lecture 3: Lower Bounds for Bandit Algorithms

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Bayesian Contextual Multi-armed Bandits

Estimation Considerations in Contextual Bandits

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

THE first formalization of the multi-armed bandit problem

The multi armed-bandit problem

Off-policy Evaluation and Learning

Online Learning and Sequential Decision Making

Adaptive Learning with Unknown Information Flows

Efficient and Principled Online Classification Algorithms for Lifelon

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Sequential Multi-armed Bandits

cxx ab.ec Warm up OH 2 ax 16 0 axtb Fix any a, b, c > What is the x 2 R that minimizes ax 2 + bx + c

Tsinghua Machine Learning Guest Lecture, June 9,

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

CSE 417T: Introduction to Machine Learning. Final Review. Henry Chai 12/4/18

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Full-information Online Learning

Efficient learning by implicit exploration in bandit problems with side observations

Stochastic Gradient Descent

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

From Bandits to Experts: A Tale of Domination and Independence

Reinforcement Learning

Reinforcement Learning Active Learning

Learnability, Stability, Regularization and Strong Convexity

Generalization theory

Reinforcement Learning

Marks. bonus points. } Assignment 1: Should be out this weekend. } Mid-term: Before the last lecture. } Mid-term deferred exam:

Online Learning under Full and Bandit Information

Online Learning and Sequential Decision Making

The Online Approach to Machine Learning

Machine Learning. Reinforcement learning. Hamid Beigy. Sharif University of Technology. Fall 1396

Stephen Scott.

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Evaluation of multi armed bandit algorithms and empirical algorithm

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Offline Evaluation of Ranking Policies with Click Models

Reinforcement Learning. Machine Learning, Fall 2010

Lecture 4: Lower Bounds (ending); Thompson Sampling

Exploration Scavenging

Decision trees COMS 4771

Adaptive Sampling Under Low Noise Conditions 1

Is the test error unbiased for these programs? 2017 Kevin Jamieson

A Time and Space Efficient Algorithm for Contextual Linear Bandits

Reinforcement Learning

Combinatorial Multi-Armed Bandit: General Framework, Results and Applications

Midterm, Fall 2003

Large-scale Information Processing, Summer Recommender Systems (part 2)

Bandits for Online Optimization

Grundlagen der Künstlichen Intelligenz

Machine Learning in the Data Revolution Era

Transcription:

Reducing contextual bandits to supervised learning Daniel Hsu Columbia University Based on joint work with A. Agarwal, S. Kale, J. Langford, L. Li, and R. Schapire 1

Learning to interact: example #1 Practicing physician 2

Learning to interact: example #1 Practicing physician Loop: 1. Patient arrives with symptoms, medical history, genome... 2

Learning to interact: example #1 Practicing physician Loop: 1. Patient arrives with symptoms, medical history, genome... 2. Prescribe treatment. 2

Learning to interact: example #1 Practicing physician Loop: 1. Patient arrives with symptoms, medical history, genome... 2. Prescribe treatment. 3. Observe impact on patient s health (e.g., improves, worsens). 2

Learning to interact: example #1 Practicing physician Loop: 1. Patient arrives with symptoms, medical history, genome... 2. Prescribe treatment. 3. Observe impact on patient s health (e.g., improves, worsens). Goal: prescribe treatments that yield good health outcomes. 2

Learning to interact: example #2 Website operator 3

Learning to interact: example #2 Website operator Loop: 1. User visits website with profile, browsing history... 3

Learning to interact: example #2 Website operator Loop: 1. User visits website with profile, browsing history... 2. Choose content to display on website. 3

Learning to interact: example #2 Website operator Loop: 1. User visits website with profile, browsing history... 2. Choose content to display on website. 3. Observe user reaction to content (e.g., click, like ). 3

Learning to interact: example #2 Website operator Loop: 1. User visits website with profile, browsing history... 2. Choose content to display on website. 3. Observe user reaction to content (e.g., click, like ). Goal: choose content that yield desired user behavior. 3

Contextual bandit problem For t = 1, 2,..., T : 4

Contextual bandit problem For t = 1, 2,..., T : 1. Observe context x t X. [e.g., user profile, search query] 4

Contextual bandit problem For t = 1, 2,..., T : 1. Observe context x t X. [e.g., user profile, search query] 2. Choose action a t A. [e.g., ad to display] 4

Contextual bandit problem For t = 1, 2,..., T : 1. Observe context x t X. [e.g., user profile, search query] 2. Choose action a t A. [e.g., ad to display] 3. Collect reward r t (a t ) [0, 1]. [e.g., 1 if click, 0 otherwise] 4

Contextual bandit problem For t = 1, 2,..., T : 0. Nature draws (x t, r t ) from dist. D over X [0, 1] A. 1. Observe context x t X. [e.g., user profile, search query] 2. Choose action a t A. [e.g., ad to display] 3. Collect reward r t (a t ) [0, 1]. [e.g., 1 if click, 0 otherwise] 4

Contextual bandit problem For t = 1, 2,..., T : 0. Nature draws (x t, r t ) from dist. D over X [0, 1] A. 1. Observe context x t X. [e.g., user profile, search query] 2. Choose action a t A. [e.g., ad to display] 3. Collect reward r t (a t ) [0, 1]. [e.g., 1 if click, 0 otherwise] Task: choose a t s that yield high expected reward (w.r.t. D). 4

Contextual bandit problem For t = 1, 2,..., T : 0. Nature draws (x t, r t ) from dist. D over X [0, 1] A. 1. Observe context x t X. [e.g., user profile, search query] 2. Choose action a t A. [e.g., ad to display] 3. Collect reward r t (a t ) [0, 1]. [e.g., 1 if click, 0 otherwise] Task: choose a t s that yield high expected reward (w.r.t. D). Contextual: use features x t to choose good actions a t. 4

Contextual bandit problem For t = 1, 2,..., T : 0. Nature draws (x t, r t ) from dist. D over X [0, 1] A. 1. Observe context x t X. [e.g., user profile, search query] 2. Choose action a t A. [e.g., ad to display] 3. Collect reward r t (a t ) [0, 1]. [e.g., 1 if click, 0 otherwise] Task: choose a t s that yield high expected reward (w.r.t. D). Contextual: use features x t to choose good actions a t. Bandit: r t (a) for a a t is not observed. (Non-bandit setting: whole reward vector r t [0, 1] A is observed.) 4

Challenges 5

Challenges 1. Exploration vs. exploitation. Use what you ve already learned (exploit), but also learn about actions that could be good (explore). Must balance to get good statistical performance. 5

Challenges 1. Exploration vs. exploitation. Use what you ve already learned (exploit), but also learn about actions that could be good (explore). Must balance to get good statistical performance. 2. Must use context. Want to do as well as the best policy (i.e., decision rule) π : context x action a from some policy class Π (a set of decision rules). Computationally constrained w/ large Π (e.g., all decision trees). 5

Challenges 1. Exploration vs. exploitation. Use what you ve already learned (exploit), but also learn about actions that could be good (explore). Must balance to get good statistical performance. 2. Must use context. Want to do as well as the best policy (i.e., decision rule) π : context x action a from some policy class Π (a set of decision rules). Computationally constrained w/ large Π (e.g., all decision trees). 3. Selection bias, especially while exploiting. 5

Learning objective Regret (i.e., relative performance) to a policy class Π: 1 T 1 T max r t (π(x t )) r t (a t ) π Π T T t=1 t=1 }{{}}{{} average reward of best policy average reward of learner 6

Learning objective Regret (i.e., relative performance) to a policy class Π: 1 T 1 T max r t (π(x t )) r t (a t ) π Π T T t=1 t=1 }{{}}{{} average reward of best policy average reward of learner Strong benchmark if Π contains a policy w/ high expected reward! 6

Learning objective Regret (i.e., relative performance) to a policy class Π: 1 T 1 T max r t (π(x t )) r t (a t ) π Π T T t=1 t=1 }{{}}{{} average reward of best policy average reward of learner Strong benchmark if Π contains a policy w/ high expected reward! Goal: regret 0 as fast as possible as T. 6

Our result New fast and simple algorithm for contextual bandits. 7

Our result New fast and simple algorithm for contextual bandits. Operates via reduction to supervised learning (with computationally-efficient reduction). 7

Our result New fast and simple algorithm for contextual bandits. Operates via reduction to supervised learning (with computationally-efficient reduction). Statistically (near) optimal regret bound. 7

Need for exploration No-exploration approach: 8

Need for exploration No-exploration approach: 1. Using historical data, learn a reward predictor for each action a A based on context x X : ˆr(a x). 8

Need for exploration No-exploration approach: 1. Using historical data, learn a reward predictor for each action a A based on context x X : ˆr(a x). 2. Then deploy policy ˆπ, given by and collect more data. ˆπ(x) := arg max ˆr(a x), a A 8

Need for exploration No-exploration approach: 1. Using historical data, learn a reward predictor for each action a A based on context x X : ˆr(a x). 2. Then deploy policy ˆπ, given by and collect more data. Suffers from selection bias. ˆπ(x) := arg max ˆr(a x), a A 8

Using no-exploration Example: two contexts {X, Y }, two actions {A, B}. Suppose initial policy says ˆπ(X ) = A and ˆπ(Y ) = B. 9

Using no-exploration Example: two contexts {X, Y }, two actions {A, B}. Suppose initial policy says ˆπ(X ) = A and ˆπ(Y ) = B. Observed rewards A B X 0.7 Y 0.1 9

Using no-exploration Example: two contexts {X, Y }, two actions {A, B}. Suppose initial policy says ˆπ(X ) = A and ˆπ(Y ) = B. Observed rewards A B X 0.7 Y 0.1 Reward estimates A B X 0.7 0.5 Y 0.5 0.1 9

Using no-exploration Example: two contexts {X, Y }, two actions {A, B}. Suppose initial policy says ˆπ(X ) = A and ˆπ(Y ) = B. Observed rewards A B X 0.7 Y 0.1 Reward estimates A B X 0.7 0.5 Y 0.5 0.1 New policy: ˆπ (X ) = ˆπ (Y ) = A. 9

Using no-exploration Example: two contexts {X, Y }, two actions {A, B}. Suppose initial policy says ˆπ(X ) = A and ˆπ(Y ) = B. Observed rewards A B X 0.7 Y 0.1 Reward estimates A B X 0.7 0.5 Y 0.5 0.1 New policy: ˆπ (X ) = ˆπ (Y ) = A. Observed rewards A B X 0.7 Y 0.3 0.1 9

Using no-exploration Example: two contexts {X, Y }, two actions {A, B}. Suppose initial policy says ˆπ(X ) = A and ˆπ(Y ) = B. Observed rewards A B X 0.7 Y 0.1 Reward estimates A B X 0.7 0.5 Y 0.5 0.1 New policy: ˆπ (X ) = ˆπ (Y ) = A. Observed rewards A B X 0.7 Y 0.3 0.1 Reward estimates A B X 0.7 0.5 Y 0.3 0.1 9

Using no-exploration Example: two contexts {X, Y }, two actions {A, B}. Suppose initial policy says ˆπ(X ) = A and ˆπ(Y ) = B. Observed rewards A B X 0.7 Y 0.1 Reward estimates A B X 0.7 0.5 Y 0.5 0.1 New policy: ˆπ (X ) = ˆπ (Y ) = A. Observed rewards A B X 0.7 Y 0.3 0.1 Never try action B in context X. Reward estimates A B X 0.7 0.5 Y 0.3 0.1 9

Using no-exploration Example: two contexts {X, Y }, two actions {A, B}. Suppose initial policy says ˆπ(X ) = A and ˆπ(Y ) = B. Observed rewards A B X 0.7 Y 0.1 Reward estimates A B X 0.7 0.5 Y 0.5 0.1 New policy: ˆπ (X ) = ˆπ (Y ) = A. Observed rewards A B X 0.7 Y 0.3 0.1 Reward estimates A B X 0.7 0.5 Y 0.3 0.1 Never try action B in context X. Ω(1) regret. True rewards A B X 0.7 1.0 Y 0.3 0.1 9

Dealing with policies Feedback in round t: reward of chosen action r t (a t ). Tells us about policies π Π s.t. π(x t ) = a t. Not informative about other policies! 10

Dealing with policies Feedback in round t: reward of chosen action r t (a t ). Tells us about policies π Π s.t. π(x t ) = a t. Not informative about other policies! Possible approach: track average reward of each π Π. 10

Dealing with policies Feedback in round t: reward of chosen action r t (a t ). Tells us about policies π Π s.t. π(x t ) = a t. Not informative about other policies! Possible approach: track average reward of each π Π. Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995). 10

Dealing with policies Feedback in round t: reward of chosen action r t (a t ). Tells us about policies π Π s.t. π(x t ) = a t. Not informative about other policies! Possible approach: track average reward of each π Π. Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995). ( ) K log N Statistically optimal regret bound O T for K := A actions and N := Π policies after T rounds. 10

Dealing with policies Feedback in round t: reward of chosen action r t (a t ). Tells us about policies π Π s.t. π(x t ) = a t. Not informative about other policies! Possible approach: track average reward of each π Π. Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995). ( ) K log N Statistically optimal regret bound O T for K := A actions and N := Π policies after T rounds. Explicit bookkeeping is computationally intractable for large N. 10

Dealing with policies Feedback in round t: reward of chosen action r t (a t ). Tells us about policies π Π s.t. π(x t ) = a t. Not informative about other policies! Possible approach: track average reward of each π Π. Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995). ( ) K log N Statistically optimal regret bound O T for K := A actions and N := Π policies after T rounds. Explicit bookkeeping is computationally intractable for large N. But perhaps policy class Π has some structure... 10

Hypothetical full-information setting If we observed rewards for all actions... 11

Hypothetical full-information setting If we observed rewards for all actions... Like supervised learning, have labeled data after t rounds: (x 1, ρ 1 ),..., (x t, ρ t ) X R A. 11

Hypothetical full-information setting If we observed rewards for all actions... Like supervised learning, have labeled data after t rounds: (x 1, ρ 1 ),..., (x t, ρ t ) X R A. context features actions classes rewards costs policy classifier 11

Hypothetical full-information setting If we observed rewards for all actions... Like supervised learning, have labeled data after t rounds: (x 1, ρ 1 ),..., (x t, ρ t ) X R A. context features actions classes rewards costs policy classifier Can often exploit structure of Π to get tractable algorithms. Abstraction: arg max oracle (AMO) ( {(xi AMO, ρ i ) } ) t t := arg max ρ i=1 i (π(x i )). π Π i=1 11

Hypothetical full-information setting If we observed rewards for all actions... Like supervised learning, have labeled data after t rounds: (x 1, ρ 1 ),..., (x t, ρ t ) X R A. context features actions classes rewards costs policy classifier Can often exploit structure of Π to get tractable algorithms. Abstraction: arg max oracle (AMO) ( {(xi AMO, ρ i ) } ) t t := arg max ρ i=1 i (π(x i )). π Π Can t directly use this in bandit setting. i=1 11

Using AMO with some exploration Explore-then-exploit 12

Using AMO with some exploration Explore-then-exploit 1. In first τ rounds, choose a t A u.a.r. to get unbiased estimates ˆr t of r t for all t τ. 12

Using AMO with some exploration Explore-then-exploit 1. In first τ rounds, choose a t A u.a.r. to get unbiased estimates ˆr t of r t for all t τ. 2. Get ˆπ := AMO({(x t, ˆr t )} τ t=1 ). 12

Using AMO with some exploration Explore-then-exploit 1. In first τ rounds, choose a t A u.a.r. to get unbiased estimates ˆr t of r t for all t τ. 2. Get ˆπ := AMO({(x t, ˆr t )} τ t=1 ). 3. Henceforth use a t := ˆπ(x t ), for t = τ+1, τ+2,..., T. 12

Using AMO with some exploration Explore-then-exploit 1. In first τ rounds, choose a t A u.a.r. to get unbiased estimates ˆr t of r t for all t τ. 2. Get ˆπ := AMO({(x t, ˆr t )} τ t=1 ). 3. Henceforth use a t := ˆπ(x t ), for t = τ+1, τ+2,..., T. Regret bound with best τ: T 1/3 (sub-optimal). (Dependencies on A and Π hidden.) 12

Previous contextual bandit algorithms 13

Previous contextual bandit algorithms Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995). Optimal regret, but explicitly enumerates Π. 13

Previous contextual bandit algorithms Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995). Optimal regret, but explicitly enumerates Π. Greedy (Langford & Zhang, NIPS 2007) Sub-optimal regret, but one call to AMO. 13

Previous contextual bandit algorithms Exp4 (Auer, Cesa-Bianchi, Freund, & Schapire, FOCS 1995). Optimal regret, but explicitly enumerates Π. Greedy (Langford & Zhang, NIPS 2007) Sub-optimal regret, but one call to AMO. Monster (Dudik, Hsu, Kale, Karampatziakis, Langford, Reyzin, & Zhang, UAI 2011) Near optimal regret, but O(T 6 ) calls to AMO. 13

Our result Let K := A and N := Π. Our result: a new, fast and simple algorithm. ( ) Regret bound: Õ. K log N T Near optimal. ) # calls to AMO: Õ( TK log N. Less than once per round! 14

Rest of the talk Components of the new algorithm: Importance-weighted LOw-Variance Epoch-Timed Oracleized CONtextual BANDITS 1. Classical tricks: randomization, inverse propensity weighting. 2. Efficient algorithm for balancing exploration/exploitation. 3. Additional tricks: warm-start and epoch structure. 15

1. Classical tricks 16

What would ve happened if I had done X? For t = 1, 2,..., T : 0. Nature draws (x t, r t ) from dist. D over X [0, 1] A. 1. Observe context x t X. [e.g., user profile, search query] 2. Choose action a t A. [e.g., ad to display] 3. Collect reward r t (a t ) [0, 1]. [e.g., 1 if click, 0 otherwise] 17

What would ve happened if I had done X? For t = 1, 2,..., T : 0. Nature draws (x t, r t ) from dist. D over X [0, 1] A. 1. Observe context x t X. [e.g., user profile, search query] 2. Choose action a t A. [e.g., ad to display] 3. Collect reward r t (a t ) [0, 1]. [e.g., 1 if click, 0 otherwise] Q: How do I learn about r t (a) for actions a I don t actually take? 17

What would ve happened if I had done X? For t = 1, 2,..., T : 0. Nature draws (x t, r t ) from dist. D over X [0, 1] A. 1. Observe context x t X. [e.g., user profile, search query] 2. Choose action a t A. [e.g., ad to display] 3. Collect reward r t (a t ) [0, 1]. [e.g., 1 if click, 0 otherwise] Q: How do I learn about r t (a) for actions a I don t actually take? A: Randomize. Draw a t p t for some pre-specified prob. dist. p t. 17

Inverse propensity weighting (Horvitz & Thompson, JASA 1952) Importance-weighted estimate of reward from round t: a A ˆr t (a) := r t(a t ) 1{a = a t } p t (a t ) 18

Inverse propensity weighting (Horvitz & Thompson, JASA 1952) Importance-weighted estimate of reward from round t: r t (a t ) a A ˆr t (a) := r t(a t ) 1{a = a t } if a = a t, p = t (a t ) p t (a t ) 0 otherwise. 18

Inverse propensity weighting (Horvitz & Thompson, JASA 1952) Importance-weighted estimate of reward from round t: r t (a t ) a A ˆr t (a) := r t(a t ) 1{a = a t } if a = a t, p = t (a t ) p t (a t ) 0 otherwise. Unbiasedness: E at p t [ˆrt (a) ] = a A p t (a ) rt(a ) 1{a = a } p t (a ) = r t (a). 18

Inverse propensity weighting (Horvitz & Thompson, JASA 1952) Importance-weighted estimate of reward from round t: r t (a t ) a A ˆr t (a) := r t(a t ) 1{a = a t } if a = a t, p = t (a t ) p t (a t ) 0 otherwise. Unbiasedness: E at p t [ˆrt (a) ] = a A p t (a ) rt(a ) 1{a = a } p t (a ) = r t (a). Range and variance: upper-bounded by 1 p t(a). 18

Inverse propensity weighting (Horvitz & Thompson, JASA 1952) Importance-weighted estimate of reward from round t: r t (a t ) a A ˆr t (a) := r t(a t ) 1{a = a t } if a = a t, p = t (a t ) p t (a t ) 0 otherwise. Unbiasedness: E at p t [ˆrt (a) ] = a A p t (a ) rt(a ) 1{a = a } p t (a ) = r t (a). Range and variance: upper-bounded by 1 p t(a). Estimate avg. reward of policy: Rewt (π) := 1 t t i=1 ˆr i(π(x i )). 18

Inverse propensity weighting (Horvitz & Thompson, JASA 1952) Importance-weighted estimate of reward from round t: r t (a t ) a A ˆr t (a) := r t(a t ) 1{a = a t } if a = a t, p = t (a t ) p t (a t ) 0 otherwise. Unbiasedness: E at p t [ˆrt (a) ] = a A p t (a ) rt(a ) 1{a = a } p t (a ) = r t (a). Range and variance: upper-bounded by 1 p t(a). Estimate avg. reward of policy: Rewt (π) := 1 t t i=1 ˆr i(π(x i )). How should we choose the p t? 18

Hedging over policies Get action distributions via policy distributions. (Q, x) }{{} (policy distribution, context) p }{{} action distribution 19

Hedging over policies Get action distributions via policy distributions. (Q, x) }{{} (policy distribution, context) p }{{} action distribution Policy distribution: Q = (Q(π) : π Π) probability dist. over policies π in the policy class Π 19

Hedging over policies Get action distributions via policy distributions. (Q, x) }{{} (policy distribution, context) p }{{} action distribution 1: Pick initial distribution Q 1 over policies Π. 2: for round t = 1, 2,... do 3: Nature draws (x t, r t ) from dist. D over X [0, 1] A. 4: Observe context x t. 5: Compute distribution p t over A (using Q t and x t ). 6: Pick action a t p t. 7: Collect reward r t (a t ). 8: Compute new distribution Q t+1 over policies Π. 9: end for 19

2. Efficient construction of good policy distributions 20

Our approach Q: How do we choose Q t for good exploration/exploitation? 21

Our approach Q: How do we choose Q t for good exploration/exploitation? Caveat: Q t must be efficiently computable + representable! 21

Our approach Q: How do we choose Q t for good exploration/exploitation? Caveat: Q t must be efficiently computable + representable! Our approach: 1. Define convex feasibility problem (over distributions Q on Π) such that solutions yield (near) optimal regret bounds. 21

Our approach Q: How do we choose Q t for good exploration/exploitation? Caveat: Q t must be efficiently computable + representable! Our approach: 1. Define convex feasibility problem (over distributions Q on Π) such that solutions yield (near) optimal regret bounds. 2. Design algorithm that finds a sparse solution Q. 21

Our approach Q: How do we choose Q t for good exploration/exploitation? Caveat: Q t must be efficiently computable + representable! Our approach: 1. Define convex feasibility problem (over distributions Q on Π) such that solutions yield (near) optimal regret bounds. 2. Design algorithm that finds a sparse solution Q. Algorithm only accesses Π via calls to AMO = nnz(q) = O(# AMO calls) 21

The good policy distribution problem Convex feasibility problem for policy distribution Q 22

The good policy distribution problem Convex feasibility problem for policy distribution Q Q(π) Reg t (π) π Π K log N t (Low regret) 22

The good policy distribution problem Convex feasibility problem for policy distribution Q Q(π) Reg K log N t (π) (Low regret) t π Π ) var( Rewt (π) b(π) π Π (Low variance) 22

The good policy distribution problem Convex feasibility problem for policy distribution Q Q(π) Reg K log N t (π) (Low regret) t π Π ) var( Rewt (π) b(π) π Π (Low variance) Using feasible Q t in round t gives near-optimal regret. 22

The good policy distribution problem Convex feasibility problem for policy distribution Q Q(π) Reg K log N t (π) (Low regret) t π Π ) var( Rewt (π) b(π) π Π (Low variance) Using feasible Q t in round t gives near-optimal regret. But Π variables and > Π constraints,... 22

Solving the convex feasibility problem Solver for good policy distribution problem (Technical detail: Q can be a sub-distribution that sums to less than one.) 23

Solving the convex feasibility problem Solver for good policy distribution problem Start with some Q (e.g., Q := 0), then repeat: (Technical detail: Q can be a sub-distribution that sums to less than one.) 23

Solving the convex feasibility problem Solver for good policy distribution problem Start with some Q (e.g., Q := 0), then repeat: 1. If low regret constraint violated, then fix by rescaling: for some c < 1. Q := cq (Technical detail: Q can be a sub-distribution that sums to less than one.) 23

Solving the convex feasibility problem Solver for good policy distribution problem Start with some Q (e.g., Q := 0), then repeat: 1. If low regret constraint violated, then fix by rescaling: for some c < 1. Q := cq 2. Find most violated low variance constraint say, corresponding to policy π and update Q( π) := Q( π) + α. (c < 1 and α > 0 have closed-form formulae.) (Technical detail: Q can be a sub-distribution that sums to less than one.) 23

Solving the convex feasibility problem Solver for good policy distribution problem Start with some Q (e.g., Q := 0), then repeat: 1. If low regret constraint violated, then fix by rescaling: for some c < 1. Q := cq 2. Find most violated low variance constraint say, corresponding to policy π and update Q( π) := Q( π) + α. (If no such violated constraint, stop and return Q.) (c < 1 and α > 0 have closed-form formulae.) (Technical detail: Q can be a sub-distribution that sums to less than one.) 23

Implementation via AMO Finding low variance constraint violation: 1. Create fictitious rewards for each i = 1, 2,..., t: r i (a) := ˆr i (a) + µ Q(a x i ) where µ (log N)/(Kt). ( {(xi 2. Obtain π := AMO, r i ) } ) t. i=1 a A, 3. Rewt ( π) > threshold iff π s low variance constraint is violated. 24

Iteration bound Solver is coordinate descent for minimizing potential function [ ] Φ(Q) := c 1 Êx RE(uniform Q( x)) + c2 Q(π) Reg t (π). π Π (Actually use (1 ε) Q + ε uniform inside RE expression.) 25

Iteration bound Solver is coordinate descent for minimizing potential function [ ] Φ(Q) := c 1 Êx RE(uniform Q( x)) + c2 Q(π) Reg t (π). π Π (Partial derivative w.r.t. Q(π) is low variance constraint for π.) (Actually use (1 ε) Q + ε uniform inside RE expression.) 25

Iteration bound Solver is coordinate descent for minimizing potential function [ ] Φ(Q) := c 1 Êx RE(uniform Q( x)) + c2 Q(π) Reg t (π). π Π (Partial derivative w.r.t. Q(π) is low variance constraint for π.) Returns a feasible solution after Õ Kt steps. log N (Actually use (1 ε) Q + ε uniform inside RE expression.) 25

Algorithm 1: Pick initial distribution Q 1 over policies Π. 2: for round t = 1, 2,... do 3: Nature draws (x t, r t ) from dist. D over X [0, 1] A. 4: Observe context x t. 5: Compute action distribution p t := Q t ( x t ). 6: Pick action a t p t. 7: Collect reward r t (a t ). 8: Compute new policy distribution Q t+1 using coordinate descent + AMO. 9: end for 26

Recap 27

Recap Feasible solution to good policy distribution problem gives near optimal regret bound. 27

Recap Feasible solution to good policy distribution problem gives near optimal regret bound. New coordinate descent algorithm: repeatedly find a violated constraint and adjust Q to satisfy it. 27

Recap Feasible solution to good policy distribution problem gives near optimal regret bound. New coordinate descent algorithm: repeatedly find a violated constraint and adjust Q to satisfy it. Analysis: In round t, nnz(q t+1 ) = O(# AMO calls) = Õ Kt. log N 27

3. Additional tricks: warm-start and epoch structure 28

Total complexity over all rounds In round t, coordinate descent for computing Q t+1 requires Õ Kt AMO calls. log N 29

Total complexity over all rounds In round t, coordinate descent for computing Q t+1 requires Õ Kt AMO calls. log N To compute Q t+1 in all rounds t = 1, 2,..., T, need Õ K log N T 1.5 AMO calls over T rounds. 29

Warm start To compute Q t+1 using coordinate descent, initialize with Q t. 30

Warm start To compute Q t+1 using coordinate descent, initialize with Q t. 1. Total epoch-to-epoch increase in potential is Õ( T /K) over all T rounds (w.h.p. exploiting i.i.d. assumption). 30

Warm start To compute Q t+1 using coordinate descent, initialize with Q t. 1. Total epoch-to-epoch increase in potential is Õ( T /K) over all T rounds (w.h.p. exploiting i.i.d. assumption). ( ) 2. Each coordinate descent step decreases potential by Ω log N. K 30

Warm start To compute Q t+1 using coordinate descent, initialize with Q t. 1. Total epoch-to-epoch increase in potential is Õ( T /K) over all T rounds (w.h.p. exploiting i.i.d. assumption). ( ) 2. Each coordinate descent step decreases potential by Ω log N. K 3. Over all T rounds, total # calls to AMO Õ KT log N 30

Warm start To compute Q t+1 using coordinate descent, initialize with Q t. 1. Total epoch-to-epoch increase in potential is Õ( T /K) over all T rounds (w.h.p. exploiting i.i.d. assumption). ( ) 2. Each coordinate descent step decreases potential by Ω log N. K 3. Over all T rounds, total # calls to AMO Õ KT log N But still need an AMO call to even check if Q t is feasible! 30

Epoch trick Regret analysis: Q t has low instantaneous per-round regret this also crucially relies on i.i.d. assumption. 31

Epoch trick Regret analysis: Q t has low instantaneous per-round regret this also crucially relies on i.i.d. assumption. = same Q t can be used for O(t) more rounds! 31

Epoch trick Regret analysis: Q t has low instantaneous per-round regret this also crucially relies on i.i.d. assumption. = same Q t can be used for O(t) more rounds! Epoch trick: split T rounds into epochs, only compute Q t at start of each epoch. 31

Epoch trick Regret analysis: Q t has low instantaneous per-round regret this also crucially relies on i.i.d. assumption. = same Q t can be used for O(t) more rounds! Epoch trick: split T rounds into epochs, only compute Q t at start of each epoch. Doubling: only update on rounds 2 1, 2 2, 2 3, 2 4,... log T updates, so Õ( KT / log N) AMO calls overall. 31

Epoch trick Regret analysis: Q t has low instantaneous per-round regret this also crucially relies on i.i.d. assumption. = same Q t can be used for O(t) more rounds! Epoch trick: split T rounds into epochs, only compute Q t at start of each epoch. Doubling: only update on rounds 2 1, 2 2, 2 3, 2 4,... log T updates, so Õ( KT / log N) AMO calls overall. Squares: only update on rounds 1 2, 2 2, 3 2, 4 2,... T updates, so Õ( K/ log N) AMO calls per update, on average. 31

Warm start + epoch trick Over all T rounds: Update policy distribution on rounds 1 2, 2 2, 3 2, 4 2,..., i.e., total of T times. Total # calls to AMO: Õ KT. log N # AMO calls per update (on average): Õ K. log N 32

4. Closing remarks and open problems 33

Recap 34

Recap 1. New algorithm for general contextual bandits 34

Recap 1. New algorithm for general contextual bandits 2. Accesses policy class Π only via AMO. 34

Recap 1. New algorithm for general contextual bandits 2. Accesses policy class Π only via AMO. 3. Defined convex feasibility problem over policy distributions that are good for exploration/exploitation: ( ) K log N Regret Õ. T Coordinate descent finds a Õ( KT / log N)-sparse solution. 34

Recap 1. New algorithm for general contextual bandits 2. Accesses policy class Π only via AMO. 3. Defined convex feasibility problem over policy distributions that are good for exploration/exploitation: ( ) K log N Regret Õ. T Coordinate descent finds a Õ( KT / log N)-sparse solution. 4. Epoch structure allows for policy distribution to change very infrequently; combine with warm start for computational improvements. 34

Open problems 35

Open problems 1. Empirical evaluation. 35

Open problems 1. Empirical evaluation. 2. Adaptive algorithm that takes advantage of problem easiness. 35

Open problems 1. Empirical evaluation. 2. Adaptive algorithm that takes advantage of problem easiness. 3. Alternatives to AMO. 35

Open problems 1. Empirical evaluation. 2. Adaptive algorithm that takes advantage of problem easiness. 3. Alternatives to AMO. Thanks! 35

36

Projections of policy distributions Given policy distribution Q and context x, a A Q(a x) := π Π Q(π) 1{π(x) = a} (so Q Q( x) is a linear map). 37

Projections of policy distributions Given policy distribution Q and context x, a A Q(a x) := π Π Q(π) 1{π(x) = a} (so Q Q( x) is a linear map). We actually use p t := Q µt t ( x t ) := (1 Kµ t )Q t ( x t ) + µ t 1 so every action has probability at least µ t (to be determined). 37

The potential function [ (Êx Ht RE(uniform Q µ t ( x)) ] π Π Φ(Q) := tµ t + Q(π) Reg ) t (π), 1 Kµ t Kt µ t 38