The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits John Langford and Tong Zhang

The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits John Langford and Tong Zhang Presentation by Terry Lam 02/2011

Outline The Contextual Bandit Problem Prior Works The Epoch Greedy Algorithm Analysis

Standard k-armed bandits problem The world chooses krewards r 1, r 2,, r k [0, 1] The player chooses an arm a {1, 2, k} Without knowledge of the world s chosen award The player observes the reward r a Only reward of pulled arm observed

Contextual Bandits The player observes context information x The world chooses k rewards r 1, r 2,, r k [0, 1] The player chooses an arm a {1, 2, k} Without knowledge of the world s chosen award The player observes the reward r a Only reward of pulled arm observed

Why Contextual Bandits? Context information is common in practice For example, matching ads to web-page Bandit arms = ads Context information = web page contents, visitor profile Reward = revenue with clicked ads Goal: relevant ads on each page to maximize the expected revenue

Definitions Contextual bandit problem (x, r) P Distribution x: context a {1, 2, k} is the arm to be pulled r a [0, 1] is the reward for arm a Repeated game At each round, a sample (x, r 1, r 2,, r k ) is drawn from P The context xis announced The player chooses an arm a Onlyreward r a is revealed

Definitions Contextual bandit algorithm B At time step t, decide which arm a {1, 2,., k} Known information Current context x t Previous observations (x 1, a 1, r a, 1 ), (x t-1, a t-1, r a, t-1 ) Goal: maximize expected total reward

Definitions A hypothesis h maps a context xto an arm a h : X a {1, L, k} Expected reward R(h)=E (x, r ) P [r h(x) ] Expected regretof B w.r.t. h up to time T Hypothesis space H Set of all hypotheses h Expected regretof B w.r.t. H up to time T B to compete with the best hypothesis in H!

Outline The Contextual Bandit Problem Prior Works The Epoch Greedy Algorithm Analysis

Prior works EXP3(Auer et al., 1995) Standard multi-armed bandits Context information is lost Set for arm i Draw i t according to p 1 (t), p 2 (t), p k (t) Receive reward x i,t (t) [0, 1] For i= 1,, k Regret bound versus the best arm:

Prior works EXP4(Auer et al., 1995) Combine the advices of m experts At time t: Get advice vectors ξ 1 (t), ξ 2 (t),, ξ m (t) Each expert advises a distribution on arms Weight w j (t)for expert j. Let Combine for final distribution on arms For arm i: Still have exploration parameter γ

Prior works EXP4(cont.) At time t: Draw i t according to p 1 (t), p 2 (t), p k (t) Receive reward x i,t (t) [0, 1] For arm i= 1,, k For expert j= 1,, m Regret w.r.t. the best expert

Epoch-Greedy properties No knowledge of time horizon T Regret bound O(T 2/3 ln 1/3 m) m= H = size of hypothesis space Each hypothesis as an expert O(ln(T )) with certain structure of the hypothesis space Reduced computational complexity

Outline The Contextual Bandit Problem Prior Works The Epoch Greedy Algorithm Analysis

Intuition: T is known First phase: n steps of explorations Random pulling of arms Second phase: exploitations Average regret for one exploitation step ǫ n Total regret: n+ (T - n) ǫ n Pick nto minimize the total regret

Intuition: T is unknown Run exploration/exploitation in epochs At epoch l: One step of exploration 1/ǫ l steps of exploitations Recall: after learning lrandom explorations, ǫ l is the average regret for one exploitation step

Intuition: T is unknown Total regret after Lepochs: Let L T be the epoch containing T It is easy to prove that No worse than three times the bound with known T and optimal stopping point

Algorithm Key Ideas Three main components Random explorations Large immediate regret Learning the best hypothesis from explorations Reduce regret for future exploitation steps Exploitations by following the best hypothesis Maximizing immediate rewards Run in several epochs, each epoch contains Exactly one step of random exploration Several steps of exploitations

Notations Z l =(x l, a l, r a, l ): random explorationsample at epoch l Z l 1={Z 1,,Z l } : set of all explorations up to epoch l s(z l 1) : number of exploitationsteps in epoch l Either data-independent or data-dependent Empirical reward maximization estimator

Epoch-Greedy Epoch-Greedy algorithm For epoch l= 1, 2, Observe x l Pick a l {1, 2,., k} uniformly random Receive reward r a, l [0, 1] Find the best hypothesis by solving s(z l 1) Repeat times Observe context x Select arm Receive reward r a [0, 1] exploration exploitations learning

Empirical reward estimation Dealing with missing observations In context x, random pulling aonly yields r a Fully observed reward: i.e. Reward expectation w.r.t. exploration samples Empirical reward estimation of h H

Outline The Contextual Bandit Problem Prior Works The Epoch Greedy Algorithm Analysis

Theorem Denote: per epoch exploitation cost For all T, n l, Lsuch that The expected regret of Epoch-Greedy algorithm R(Epoch Greedy,H, T) L+ L l=1 µ l(h, s)+t L l=1 Pr[s(Zl 1) < n l ]

Theorem Proof Sketch For all T, n l, Lsuch that The expected regret of Epoch-Greedy algorithm R(Epoch Greedy,H, T) L+ L l=1 µ l(h, s)+t L l=1 Pr[s(Zl 1) < n l ] There are two complementary cases Case 1: for all l = 1,, L Case 2: for some l = 1,, L

Theorem Proof Sketch Case 1: for all l = 1,, L Then T is contained in L epochs Exploitation regret in epoch lis µ l (H,s) Exploration regret at most 1 Regret contribution of Case 1: L+ L l=1 µ l(h, s)

Theorem Proof Sketch Case 2: for some l = 1,, L Regret of Case 2 is at most T Regret is at most 1 per step Probability of Case 2: Bound for regret contribution of Case 2 Total regretof Case 1 and Case 2: R(Epoch Greedy,H, T) L+ L l=1 µ l(h, s)+t L l=1 Pr[s(Zl 1) < n l ]

Bound for Finite Hypothesis Space Denote size of hypothesis space m = H < By Berstein inequality, c: some constant Recall: per epoch exploitation cost Pick s(z1)= c l l/kln m then µ l (H, s) 1

Bound for Finite Hypothesis Space Theorem: for all T, n l, Lsuch that The expected regret of Epoch-Greedy algorithm Take then Pr[s(Z l 1) < n l ]=0 Let L = c T 2/3 (kln m) 1/3 for some constant c, then T L l=1 n l Therefore,

Bound improvement Let H = {h 1, L, h m } WLOG, R(h 1 ) R(h 2 ) L R(h m ) Suppose R(h 1 ) R(h 2 ) + Δfor some Δ> 0 Δ: gap between the best and second best bandits With appropriate parameter choices is data dependent in this case R(Epoch Greedy,H, T) 2 8k(lnm+ln(T+1) c 2 That means, regret is O(kln(m) + kln(t )) +1+c k 2

Conclusions Contextual multi-armed bandits Generalization of the standard multi-armed bandits Observable context helps decision to pull arms Sample complexity for exploration-exploitation trade-off Good for large hypothesis spaces or with special structures