The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits John Langford and Tong Zhang

Similar documents
Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

New Algorithms for Contextual Bandits

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Reducing contextual bandits to supervised learning

Lecture 10 : Contextual Bandits

Bandit View on Continuous Stochastic Optimization

The Multi-Armed Bandit Problem

1 A Support Vector Machine without Support Vectors

Bandit models: a tutorial

Learning with Exploration

1 MDP Value Iteration Algorithm

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

Active Learning and Optimized Information Gathering

Online Learning and Sequential Decision Making

Revisiting the Exploration-Exploitation Tradeoff in Bandit Models

Large-scale Information Processing, Summer Recommender Systems (part 2)

The multi armed-bandit problem

Online Learning: Bandit Setting

Multi-armed bandit models: a tutorial

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

Advanced Machine Learning

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

Stochastic bandits: Explore-First and UCB

Multiple Identifications in Multi-Armed Bandits

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade

Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm

Sparse Linear Contextual Bandits via Relevance Vector Machines

Reinforcement Learning

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

The Multi-Armed Bandit Problem

MDP Preliminaries. Nan Jiang. February 10, 2019

Evaluation of multi armed bandit algorithms and empirical algorithm

Notes from Week 8: Multi-Armed Bandit Problems

The Multi-Arm Bandit Framework

Lecture 3: Lower Bounds for Bandit Algorithms

Models of collective inference

Bayesian Contextual Multi-armed Bandits

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Lecture 5: Regret Bounds for Thompson Sampling

Multi-Armed Bandits. Credit: David Silver. Google DeepMind. Presenter: Tianlu Wang

The No-Regret Framework for Online Learning

Online Learning under Full and Bandit Information

Online (and Distributed) Learning with Information Constraints. Ohad Shamir

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes

Online Learning with Feedback Graphs

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Online Learning with Randomized Feedback Graphs for Optimal PUE Attacks in Cognitive Radio Networks

Lecture 15: Bandit problems. Markov Processes. Recall: Lotteries and utilities

Gambling in a rigged casino: The adversarial multi-armed bandit problem

ARTIFICIAL INTELLIGENCE. Reinforcement learning

Stochastic Contextual Bandits with Known. Reward Functions

Learning Optimal Online Advertising Portfolios with Periodic Budgets

Reinforcement Learning

Hybrid Machine Learning Algorithms

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms

Basics of reinforcement learning

Does Unlabeled Data Help?

Linear Scalarized Knowledge Gradient in the Multi-Objective Multi-Armed Bandits Problem

Learning from Logged Implicit Exploration Data

Reinforcement Learning

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning

The information complexity of sequential resource allocation

Spectral Bandits for Smooth Graph Functions with Applications in Recommender Systems

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12

Supplementary: Battle of Bandits

ONLINE ADVERTISEMENTS AND MULTI-ARMED BANDITS CHONG JIANG DISSERTATION

Doubly Robust Policy Evaluation and Learning

Grundlagen der Künstlichen Intelligenz

Learning by constraints and SVMs (2)

Hands-On Learning Theory Fall 2016, Lecture 3

Downloaded 11/11/14 to Redistribution subject to SIAM license or copyright; see

CSE250A Fall 12: Discussion Week 9

Littlestone s Dimension and Online Learnability

T.I.H.E. IT 233 Statistics and Probability: Sem. 1: 2013 ESTIMATION AND HYPOTHESIS TESTING OF TWO POPULATIONS

Learning to play K-armed bandit problems

Yevgeny Seldin. University of Copenhagen

Optimization under Uncertainty: An Introduction through Approximation. Kamesh Munagala

Bandits for Online Optimization

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

NOWADAYS the demand to wireless bandwidth is growing

ORIE 4741: Learning with Big Messy Data. Generalization

Full-information Online Learning

Generalization bounds

1 [15 points] Search Strategies

Online Learning Class 12, 20 March 2006 Andrea Caponnetto, Sanmay Das

COS 511: Theoretical Machine Learning. Lecturer: Rob Schapire Lecture 24 Scribe: Sachin Ravi May 2, 2013

Tsinghua Machine Learning Guest Lecture, June 9,

Exploration. 2015/10/12 John Schulman

Bayesian Active Learning With Basis Functions

8 Basics of Hypothesis Testing

Lecture 2: Learning from Evaluative Feedback. or Bandit Problems

Improved Algorithms for Linear Stochastic Bandits

Lecture 8. Instructor: Haipeng Luo

Bandits : optimality in exponential families

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Multi-armed Bandits in the Presence of Side Observations in Social Networks

Online Learning and Sequential Decision Making

Transcription:

The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits John Langford and Tong Zhang Presentation by Terry Lam 02/2011

Outline The Contextual Bandit Problem Prior Works The Epoch Greedy Algorithm Analysis

Standard k-armed bandits problem The world chooses krewards r 1, r 2,, r k [0, 1] The player chooses an arm a {1, 2, k} Without knowledge of the world s chosen award The player observes the reward r a Only reward of pulled arm observed

Contextual Bandits The player observes context information x The world chooses k rewards r 1, r 2,, r k [0, 1] The player chooses an arm a {1, 2, k} Without knowledge of the world s chosen award The player observes the reward r a Only reward of pulled arm observed

Why Contextual Bandits? Context information is common in practice For example, matching ads to web-page Bandit arms = ads Context information = web page contents, visitor profile Reward = revenue with clicked ads Goal: relevant ads on each page to maximize the expected revenue

Definitions Contextual bandit problem (x, r) P Distribution x: context a {1, 2, k} is the arm to be pulled r a [0, 1] is the reward for arm a Repeated game At each round, a sample (x, r 1, r 2,, r k ) is drawn from P The context xis announced The player chooses an arm a Onlyreward r a is revealed

Definitions Contextual bandit algorithm B At time step t, decide which arm a {1, 2,., k} Known information Current context x t Previous observations (x 1, a 1, r a, 1 ), (x t-1, a t-1, r a, t-1 ) Goal: maximize expected total reward

Definitions A hypothesis h maps a context xto an arm a h : X a {1, L, k} Expected reward R(h)=E (x, r ) P [r h(x) ] Expected regretof B w.r.t. h up to time T Hypothesis space H Set of all hypotheses h Expected regretof B w.r.t. H up to time T B to compete with the best hypothesis in H!

Outline The Contextual Bandit Problem Prior Works The Epoch Greedy Algorithm Analysis

Prior works EXP3(Auer et al., 1995) Standard multi-armed bandits Context information is lost Set for arm i Draw i t according to p 1 (t), p 2 (t), p k (t) Receive reward x i,t (t) [0, 1] For i= 1,, k Regret bound versus the best arm:

Prior works EXP4(Auer et al., 1995) Combine the advices of m experts At time t: Get advice vectors ξ 1 (t), ξ 2 (t),, ξ m (t) Each expert advises a distribution on arms Weight w j (t)for expert j. Let Combine for final distribution on arms For arm i: Still have exploration parameter γ

Prior works EXP4(cont.) At time t: Draw i t according to p 1 (t), p 2 (t), p k (t) Receive reward x i,t (t) [0, 1] For arm i= 1,, k For expert j= 1,, m Regret w.r.t. the best expert

Epoch-Greedy properties No knowledge of time horizon T Regret bound O(T 2/3 ln 1/3 m) m= H = size of hypothesis space Each hypothesis as an expert O(ln(T )) with certain structure of the hypothesis space Reduced computational complexity

Outline The Contextual Bandit Problem Prior Works The Epoch Greedy Algorithm Analysis

Intuition: T is known First phase: n steps of explorations Random pulling of arms Second phase: exploitations Average regret for one exploitation step ǫ n Total regret: n+ (T - n) ǫ n Pick nto minimize the total regret

Intuition: T is unknown Run exploration/exploitation in epochs At epoch l: One step of exploration 1/ǫ l steps of exploitations Recall: after learning lrandom explorations, ǫ l is the average regret for one exploitation step

Intuition: T is unknown Total regret after Lepochs: Let L T be the epoch containing T It is easy to prove that No worse than three times the bound with known T and optimal stopping point

Algorithm Key Ideas Three main components Random explorations Large immediate regret Learning the best hypothesis from explorations Reduce regret for future exploitation steps Exploitations by following the best hypothesis Maximizing immediate rewards Run in several epochs, each epoch contains Exactly one step of random exploration Several steps of exploitations

Notations Z l =(x l, a l, r a, l ): random explorationsample at epoch l Z l 1={Z 1,,Z l } : set of all explorations up to epoch l s(z l 1) : number of exploitationsteps in epoch l Either data-independent or data-dependent Empirical reward maximization estimator

Epoch-Greedy Epoch-Greedy algorithm For epoch l= 1, 2, Observe x l Pick a l {1, 2,., k} uniformly random Receive reward r a, l [0, 1] Find the best hypothesis by solving s(z l 1) Repeat times Observe context x Select arm Receive reward r a [0, 1] exploration exploitations learning

Empirical reward estimation Dealing with missing observations In context x, random pulling aonly yields r a Fully observed reward: i.e. Reward expectation w.r.t. exploration samples Empirical reward estimation of h H

Outline The Contextual Bandit Problem Prior Works The Epoch Greedy Algorithm Analysis

Theorem Denote: per epoch exploitation cost For all T, n l, Lsuch that The expected regret of Epoch-Greedy algorithm R(Epoch Greedy,H, T) L+ L l=1 µ l(h, s)+t L l=1 Pr[s(Zl 1) < n l ]

Theorem Proof Sketch For all T, n l, Lsuch that The expected regret of Epoch-Greedy algorithm R(Epoch Greedy,H, T) L+ L l=1 µ l(h, s)+t L l=1 Pr[s(Zl 1) < n l ] There are two complementary cases Case 1: for all l = 1,, L Case 2: for some l = 1,, L

Theorem Proof Sketch Case 1: for all l = 1,, L Then T is contained in L epochs Exploitation regret in epoch lis µ l (H,s) Exploration regret at most 1 Regret contribution of Case 1: L+ L l=1 µ l(h, s)

Theorem Proof Sketch Case 2: for some l = 1,, L Regret of Case 2 is at most T Regret is at most 1 per step Probability of Case 2: Bound for regret contribution of Case 2 Total regretof Case 1 and Case 2: R(Epoch Greedy,H, T) L+ L l=1 µ l(h, s)+t L l=1 Pr[s(Zl 1) < n l ]

Bound for Finite Hypothesis Space Denote size of hypothesis space m = H < By Berstein inequality, c: some constant Recall: per epoch exploitation cost Pick s(z1)= c l l/kln m then µ l (H, s) 1

Bound for Finite Hypothesis Space Theorem: for all T, n l, Lsuch that The expected regret of Epoch-Greedy algorithm Take then Pr[s(Z l 1) < n l ]=0 Let L = c T 2/3 (kln m) 1/3 for some constant c, then T L l=1 n l Therefore,

Bound improvement Let H = {h 1, L, h m } WLOG, R(h 1 ) R(h 2 ) L R(h m ) Suppose R(h 1 ) R(h 2 ) + Δfor some Δ> 0 Δ: gap between the best and second best bandits With appropriate parameter choices is data dependent in this case R(Epoch Greedy,H, T) 2 8k(lnm+ln(T+1) c 2 That means, regret is O(kln(m) + kln(t )) +1+c k 2

Conclusions Contextual multi-armed bandits Generalization of the standard multi-armed bandits Observable context helps decision to pull arms Sample complexity for exploration-exploitation trade-off Good for large hypothesis spaces or with special structures