Learning with Exploration

Similar documents
Reducing contextual bandits to supervised learning

Slides at:

Learning for Contextual Bandits

New Algorithms for Contextual Bandits

Doubly Robust Policy Evaluation and Learning

Doubly Robust Policy Evaluation and Learning

Doubly Robust Policy Evaluation and Optimization 1

Outline. Offline Evaluation of Online Metrics Counterfactual Estimation Advanced Estimators. Case Studies & Demo Summary

Doubly Robust Policy Evaluation and Optimization 1

Contextual Bandit Learning with Predictable Rewards

Taming the Monster: A Fast and Simple Algorithm for Contextual Bandits

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University

Discover Relevant Sources : A Multi-Armed Bandit Approach

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Bandit models: a tutorial

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Practical Agnostic Active Learning

Learning Theory. Machine Learning CSE546 Carlos Guestrin University of Washington. November 25, Carlos Guestrin

Counterfactual Evaluation and Learning

Active Learning and Optimized Information Gathering

Multi-armed bandit models: a tutorial

Lecture 10 : Contextual Bandits

Hybrid Machine Learning Algorithms

The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits John Langford and Tong Zhang

Large-scale Information Processing, Summer Recommender Systems (part 2)

Online Learning and Sequential Decision Making

Counterfactual Model for Learning Systems

Yevgeny Seldin. University of Copenhagen

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Optimal and Adaptive Online Learning

The Multi-Arm Bandit Framework

Advanced Machine Learning

Online Learning and Sequential Decision Making

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Machine Learning in the Data Revolution Era

Contextual semibandits via supervised learning oracles

arxiv: v3 [cs.lg] 3 Apr 2016

The Online Approach to Machine Learning

THE first formalization of the multi-armed bandit problem

Online Learning with Feedback Graphs

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Online Learning and Online Convex Optimization

COMP3702/7702 Artificial Intelligence Lecture 11: Introduction to Machine Learning and Reinforcement Learning. Hanna Kurniawati

Full-information Online Learning

Bayesian Contextual Multi-armed Bandits

Online Learning with Feedback Graphs

Reinforcement Learning

Online Advertising is Big Business

From Bandits to Experts: A Tale of Domination and Independence

Machine Learning Theory (CS 6783)

Crowd-Learning: Improving the Quality of Crowdsourcing Using Sequential Learning

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

Ad Placement Strategies

The multi armed-bandit problem

Lecture 10: Contextual Bandits

COMS 4771 Lecture Boosting 1 / 16

Online Learning: Bandit Setting

1 [15 points] Frequent Itemsets Generation With Map-Reduce

Online Learning under Full and Bandit Information

Reinforcement Learning. Spring 2018 Defining MDPs, Planning

Stochastic Gradient Descent

The No-Regret Framework for Online Learning

arxiv: v1 [stat.ml] 12 Feb 2018

Evaluation of multi armed bandit algorithms and empirical algorithm

Case Study 1: Estimating Click Probabilities. Kakade Announcements: Project Proposals: due this Friday!

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade

Stochastic Optimization

Imitation Learning by Coaching. Abstract

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Reinforcement Learning

PAC-learning, VC Dimension and Margin-based Bounds

I D I A P. Online Policy Adaptation for Ensemble Classifiers R E S E A R C H R E P O R T. Samy Bengio b. Christos Dimitrakakis a IDIAP RR 03-69

The Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks

Learning from Logged Implicit Exploration Data

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning

Online Advertising is Big Business

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Learning theory. Ensemble methods. Boosting. Boosting: history

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

Introduction to Machine Learning

CSE 151 Machine Learning. Instructor: Kamalika Chaudhuri

Machine Learning Theory (CS 6783)

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Learning, Games, and Networks

The Multi-Armed Bandit Problem

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

Boosting: Foundations and Algorithms. Rob Schapire

Sample complexity bounds for differentially private learning

Online (and Distributed) Learning with Information Constraints. Ohad Shamir

A Time and Space Efficient Algorithm for Contextual Linear Bandits

Combinatorial Multi-Armed Bandit: General Framework, Results and Applications

Generalization and Overfitting

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

Reinforcement Learning as Classification Leveraging Modern Classifiers

On the tradeoff between computational complexity and sample complexity in learning

Learning Theory Continued

The information complexity of sequential resource allocation

Machine Learning. Lecture 9: Learning Theory. Feng Li.

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

Transcription:

Learning with Exploration John Langford (Yahoo!) { With help from many } Austin, March 24, 2011

Yahoo! wants to interactively choose content and use the observed feedback to improve future content choices. Example of Learning through Exploration Repeatedly: 1 A user comes to Yahoo! (with history of previous visits, IP address, data related to his Yahoo! account) 2 Yahoo! chooses information to present (from urls, ads, news stories) 3 The user reacts to the presented information (clicks on something, clicks, comes back and clicks again, et cetera)

Another Example: Clinical Decision Making Repeatedly: 1 A patient comes to a doctor with symptoms, medical history, test results 2 The doctor chooses a treatment 3 The patient responds to it The doctor wants a policy for choosing targeted treatments for individual patients.

The Contextual Bandit Setting For t = 1,..., T : 1 The world produces some context x t X 2 The learner chooses an action a t {1,..., K} 3 The world reacts with reward r t (a t ) [0, 1] Goal:

The Contextual Bandit Setting For t = 1,..., T : 1 The world produces some context x t X 2 The learner chooses an action a t {1,..., K} 3 The world reacts with reward r t (a t ) [0, 1] Goal: Learn a good policy for choosing actions given context.

The Contextual Bandit Setting For t = 1,..., T : 1 The world produces some context x t X 2 The learner chooses an action a t {1,..., K} 3 The world reacts with reward r t (a t ) [0, 1] Goal: Learn a good policy for choosing actions given context. What does learning mean?

The Contextual Bandit Setting For t = 1,..., T : 1 The world produces some context x t X 2 The learner chooses an action a t {1,..., K} 3 The world reacts with reward r t (a t ) [0, 1] Goal: Learn a good policy for choosing actions given context. What does learning mean? Efficiently competing with a large reference class of possible policies Π = {π : X {1,..., K}}: Regret = max π Π T T r t (π(x t )) r t (a t ) t=1 t=1

The Contextual Bandit Setting For t = 1,..., T : 1 The world produces some context x t X 2 The learner chooses an action a t {1,..., K} 3 The world reacts with reward r t (a t ) [0, 1] Goal: Learn a good policy for choosing actions given context. What does learning mean? Efficiently competing with a large reference class of possible policies Π = {π : X {1,..., K}}: Regret = max π Π T T r t (π(x t )) r t (a t ) t=1 Other names: associative reinforcement learning, associative bandits, learning with partial feedback, bandits with side information t=1

Basic Observation #1 This is not a supervised learning problem: We don t know the reward of actions not taken loss function is unknown even at training time. Exploration is required to succeed (but still simpler than reinforcement learning we know which action is responsible for each reward)

Basic Observation #2 This is not a bandit problem: In the bandit setting, there is no x, and the goal is to compete with the set of constant actions. Too weak in practice. Generalization across x is required to succeed.

Outline 1 What is it? 2 How can we Evaluate? 3 How can we Learn?

The Evaluation Problem Let π : X A be a policy mapping features to actions. How do we evaluate it?

The Evaluation Problem Let π : X A be a policy mapping features to actions. How do we evaluate it? Method 1: Deploy algorithm in the world.

The Evaluation Problem Let π : X A be a policy mapping features to actions. How do we evaluate it? Method 1: Deploy algorithm in the world. 1 Found company. 2 Get lots of business. 3 Deploy algorithm. VERY expensive and VERY noisy.

How do we measure a Static Policy? Let π : X A be a policy mapping features to actions. How do we evaluate it?

How do we measure a Static Policy? Let π : X A be a policy mapping features to actions. How do we evaluate it? Answer: Collect T samples of the form (x, a, r a, p a ) where p a = p(a x) is the probability of choosing action a, then evaluate: Value(π) = 1 T (x,a,p a,r a) r a I (π(x) = a) p a

How do we measure a Static Policy? Let π : X A be a policy mapping features to actions. How do we evaluate it? Answer: Collect T samples of the form (x, a, r a, p a ) where p a = p(a x) is the probability of choosing action a, then evaluate: Value(π) = 1 T (x,a,p a,r a) r a I (π(x) = a) p a Theorem: For all policies π, for all IID data distributions D, Value(π) is an unbiased estimate of the expected reward of π: E (x, r) D [ rπ(x) ] = EValue(π) with deviations bounded by [Kearns et al. 00, adapted]: ( ) 1 O T min pπ(x) Proof: [ [Part 1] π, ] x, p(a), r a : E rai (π(x)=a) a p p(a) = rai (π(x)=a) a p(a) p(a) = r π(x)

Double Robust Policy Evaluation [Dudik, Langford, Li 2011] Basic question: Can we reduce the variance of a policy estimate?

Double Robust Policy Evaluation [Dudik, Langford, Li 2011] Basic question: Can we reduce the variance of a policy estimate? Suppose we have an estimate ˆr(a, x), then we can form an estimator according to: (r ˆr(a, x))i (π(x) = a) p(a x) + ˆr(π(x), x) Or even: Value DR (π) = 1 T x,a,r,ˆp (r ˆr(a, x))i (π(x) = a) ˆp(a x) + ˆr(π(x), x)

Analysis Value DR (π) = 1 T x,a,r,ˆp (r ˆr(a, x))i (π(x) = a) ˆp(a x) + ˆr(π(x), x) Let (a, x) = ˆr(a, x) E r x r a = reward deviation Let δ(a, x) = 1 p(a x) ˆp(a x) = probability deviation Theorem: For all policies π and all (x, r): Value DR (π) E r x [r π(x) ] (π(x), x)δ(π(x), x) In essence: the deviations multiply, and since deviations < 1 this is good.

An Empirical Test 1 Pick some UCI multiclass datasets. 2 Generate (x, a, r, p) quads via uniform random exploration of actions 3 Learn ˆr(a, x). 4 Compute for each x the double-robust estimate for each a: (r ˆr(a, x))i (a = a) p(a x) + ˆr(a, x) 5 Learn π using a cost-sensitive classifier.

Experimental Results IPS: ˆr = 0 DR: ˆr = w a x Filter Tree = [Beygelzimer, Langford, Ravikumar 2009] CSMC reduction to decision tree Offset Tree = [Beygelzimer, Langford 2009] direct reduction to decision tree IPS (Filter Tree) DR (Filter Tree) Offset Tree 0.8 0.6 0.4 0.2 ecoli glass letter optdigits page-blocks pendigits satimage vehicle yeast Classification Error

Experimental Results IPS: ˆr = 0 DR: ˆr = w a x DLM = [McAllester, Hazan, Keshet 2010] CSMC on linear representation Offset Tree = [Beygelzimer, Langford 2009] direct reduction to decision tree IPS (DLM) DR (DLM) Offset Tree 0.8 0.6 0.4 0.2 ecoli glass letter optdigits page-blocks pendigits satimage vehicle yeast Classification Error

Outline 1 What is it? 2 How can we Evaluate? 3 How can we Learn?

Exponential Weight Algorithm for Exploration and Exploitation with Experts (EXP4) [Auer et al. 95] Initialization: π Π : w t (π) = 1 For each t = 1, 2,...: 1 Observe x t and let for a = 1,..., K π p t (a) = (1 Kµ) 1[π(x t) = a] w t (π) π w + µ, t(π) where µ = ln Π KT = minimum probability 2 Draw a t from p t, and observe reward r t (a t ). 3 Update for each π Π { ( wt (π) exp w t+1 (π) = w t (π) µ rt(at) p t(a t) ) if π(x t ) = a t otherwise

What do we know about EXP4? Theorem: [Auer et al. 95] For all oblivious sequences (x 1, r 1 ),..., (x T, r T ), EXP4 has expected regret ( ) O TK ln Π. Theorem: [Auer et al. 95] For any T, there exists an iid sequence such that the expected regret of any player is Ω( TK). EXP4 can be modified to succeed with high probability or over VC sets when the world is IID. [Beygelzimer, et al. 2011]. EXP4 is slow Ω(TN) Exponentially slower than is typical for supervised learning. No reasonable oracle-ized algorithm for speeding up.

A new algorithm [Dudik, Hsu, Kale, Karampatziakis, Langford, Reyzin, Zhang 2011] Policy Elimination Let Π 0 = Π and µ t = 1/ Kt For each t = 1, 2,... 1 Choose distribution P t over Π t 1 s.t. π Π t 1 : [ ] 1 E x DX (1 Kµ t ) Pr π P t (π 2K (x) = π(x)) + µ t 2 observe x t 3 Let p t (a) = (1 Kµ t ) Pr π P t (π (x) = π(x)) + µ t 4 Choose a t p t and observe reward r t 5 Let Π t = {π Π t 1 : η t (π) max π Π t 1 η t (π ) Kµ t }

Analysis For all sets of policies Π, for all distributions D(x, r), if the world is IID w.r.t. D, with high probability Policy Elimination has expected regret ( ) O TK ln Π. A key lemma: For any set of policies Π and any distribution over x, step 1 is possible.

Analysis For all sets of policies Π, for all distributions D(x, r), if the world is IID w.r.t. D, with high probability Policy Elimination has expected regret ( ) O TK ln Π. A key lemma: For any set of policies Π and any distribution over x, step 1 is possible. Proof: Consider the game: min P max Q E π Q E x 1 (1 Kµ t) Pr π P (π(x)=π (x))+µ t

Analysis For all sets of policies Π, for all distributions D(x, r), if the world is IID w.r.t. D, with high probability Policy Elimination has expected regret ( ) O TK ln Π. A key lemma: For any set of policies Π and any distribution over x, step 1 is possible. Proof: Consider the game: 1 min P max Q E π Q E x (1 Kµ t) Pr π P (π(x)=π (x))+µ t Minimax magic! 1 = max Q min P E π Q E x (1 Kµ t) Pr π P (π(x)=π (x))+µ t

Analysis For all sets of policies Π, for all distributions D(x, r), if the world is IID w.r.t. D, with high probability Policy Elimination has expected regret ( ) O TK ln Π. A key lemma: For any set of policies Π and any distribution over x, step 1 is possible. Proof: Consider the game: min P max Q E π Q E x 1 (1 Kµ t) Pr π P (π(x)=π (x))+µ t Minimax magic! = max Q min P E π Q E x 1 (1 Kµ t) Pr π P (π(x)=π (x))+µ t Let P = Q max Q E π Q E x 1 (1 Kµ t) Pr π Q (π(x)=π (x))+µ t

Analysis For all sets of policies Π, for all distributions D(x, r), if the world is IID w.r.t. D, with high probability Policy Elimination has expected regret ( ) O TK ln Π. A key lemma: For any set of policies Π and any distribution over x, step 1 is possible. Proof: Consider the game: min P max Q E π Q E x 1 (1 Kµ t) Pr π P (π(x)=π (x))+µ t Minimax magic! 1 = max Q min P E π Q E x (1 Kµ t) Pr π P (π(x)=π (x))+µ t Let P = Q 1 max Q E π Q E x (1 Kµ t) Pr π Q (π(x)=π (x))+µ t Linearity of Expectation = max Q E x a Pr π Q (π(x)=a) (1 Kµ t) Pr π Q (π (x)=a)+µ t

Another Use of minimax [DHKKLRZ11] Randomized UCB ln Π Let µ t = Kt Let t (π) = max π η t (π ) η t (π) For each t = 1, 2,... 1 Choose distribution P over Π minizing E π P [ t (π)] s.t. π: [ ] 1 E x ht (1 Kµ t ) Pr π P(π (x) = π(x)) + µ t max{2k, Ct( t (π)) 2 } 2 observe x t 3 Let p t (a) = (1 Kµ t ) Pr π P(π (x) = a) + µ t 4 Choose a t p t and observe reward r t

Randomized UCB analysis For all sets of policies Π, for all distributions D(x, r), if the world is IID w.r.t. D, with high probability Randomized UCB has expected regret ( ) O TK ln Π.

Randomized UCB analysis For all sets of policies Π, for all distributions D(x, r), if the world is IID w.r.t. D, with high probability Randomized UCB has expected regret ( ) O TK ln Π. And: Given an cost sensitive optimization oracle for Π, Randomized UCB runs in time Poly(t, K, log Π )!

Randomized UCB analysis For all sets of policies Π, for all distributions D(x, r), if the world is IID w.r.t. D, with high probability Randomized UCB has expected regret ( ) O TK ln Π. And: Given an cost sensitive optimization oracle for Π, Randomized UCB runs in time Poly(t, K, log Π )! Uses ellipsoid algorithm for convex programming. First ever general nonexponential-time algorithm for contextual bandits.

Final Thoughts and pointers 2 papers coming to arxiv near you. Great Background in Exploration and Learning Tutorial. http://hunch.net/~exploration_learning Further Contextual Bandit discussion: http://hunch.net/