The Multi-Arm Bandit Framework

Size: px

Start display at page:

Download "The Multi-Arm Bandit Framework"

Neil Park
5 years ago
Views:

1 The Multi-Arm Bandit Framework A. LAZARIC (SequeL ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course

2 In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

3 In This Lecture Question: which route should we take? Problem: each day we obtain a limited feedback: traveling time of the chosen route Results: if we do not repeatedly try different options we cannot learn. Solution: trade off between optimization and learning. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

4 Mathematical Tools Outline Mathematical Tools The General Multi-arm Bandit Problem The Stochastic Multi-arm Bandit Problem The Non-Stochastic Multi-arm Bandit Problem Connections to Game Theory Other Stochastic Multi-arm Bandit Problems A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

5 Mathematical Tools Concentration Inequalities Proposition (Chernoff-Hoeffding Inequality) Let X i [a i, b i ] be n independent r.v. with mean µ i = EX i. Then [ n ( ) ] P Xi µ i ɛ i=1 ( 2 exp 2ɛ 2 n i=1 (b i a i ) 2 ). A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

6 Mathematical Tools Concentration Inequalities Proof. ( n ) P X i µ i ɛ i=1 = P(e s n i=1 Xi µi e sɛ ) e sɛ E[e s n i=1 Xi µi ], Markov inequality = n e sɛ E[e s(xi µi ) ], independent random variables i=1 n e sɛ e s2 (b i a i ) 2 /8, i=1 = e sɛ+s2 n i=1 (bi ai )2 /8 Hoeffding inequality If we choose s = 4ɛ/ n i=1 (b i a i ) 2, the result follows. Similar arguments hold for P ( n i=1 X i µ i ɛ ). A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

7 Mathematical Tools Concentration Inequalities Finite sample guarantee: [ 1 n P X t E[X 1 ] n t=1 }{{} deviation > ɛ }{{} accuracy ] ( 2 exp 2nɛ2 ) (b a) }{{ 2 } confidence A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

8 Mathematical Tools Concentration Inequalities Finite sample guarantee: [ 1 n ] log 2/δ P X t E[X 1 ] > (b a) δ n 2n t=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

9 Mathematical Tools Concentration Inequalities Finite sample guarantee: [ ] 1 n P X t E[X 1 ] > ɛ δ n if n (b a)2 log 2/δ 2ɛ 2. t=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

10 The General Multi-arm Bandit Problem Outline Mathematical Tools The General Multi-arm Bandit Problem The Stochastic Multi-arm Bandit Problem The Non-Stochastic Multi-arm Bandit Problem Connections to Game Theory Other Stochastic Multi-arm Bandit Problems A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

11 The General Multi-arm Bandit Problem The Multi armed Bandit Game The learner has i = 1,..., N arms (options, experts,...) At each round t = 1,..., n At the same time The environment chooses a vector of rewards {Xi,t } N i=1 The learner chooses an arm It The learner receives a reward X It,t The environment does not reveal the rewards of the other arms A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

12 The General Multi-arm Bandit Problem The Multi armed Bandit Game (cont d) The regret [ n ] [ n R n (A) = max E X i,t E i=1,...,n t=1 t=1 X It,t The expectation summarizes any possible source of randomness (either in X or in the algorithm) ] A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

13 The General Multi-arm Bandit Problem The Exploration Exploitation Lemma Problem 1: The environment does not reveal the rewards of the arms not pulled by the learner the learner should gain information by repeatedly pulling all the arms exploration Problem 2: Whenever the learner pulls a bad arm, it suffers some regret the learner should reduce the regret by repeatedly pulling the best arm exploitation Challenge: The learner should solve two opposite problems! Challenge: The learner should solve the exploration-exploitation dilemma! A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

14 The General Multi-arm Bandit Problem The Multi armed Bandit Game (cont d) Examples Packet routing Clinical trials Web advertising Computer games Resource mining... A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

15 The Stochastic Multi-arm Bandit Problem Outline Mathematical Tools The General Multi-arm Bandit Problem The Stochastic Multi-arm Bandit Problem The Non-Stochastic Multi-arm Bandit Problem Connections to Game Theory Other Stochastic Multi-arm Bandit Problems A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

16 The Stochastic Multi-arm Bandit Problem The Stochastic Multi armed Bandit Problem Definition The environment is stochastic Each arm has a distribution ν i bounded in [0, 1] and characterized by an expected value µ i The rewards are i.i.d. X i,t ν i A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

17 The Stochastic Multi-arm Bandit Problem The Stochastic Multi armed Bandit Problem (cont d) Notation Number of times arm i has been pulled after n rounds n T i,n = I{I t = i} Regret t=1 [ n ] [ n R n (A) = max E X i,t E i=1,...,n R n (A) = R n (A) = t=1 [ n max (nµ i) E i=1,...,n max (nµ i) i=1,...,n R n (A) = nµ i t=1 t=1 X It,t X It,t ] N E[T i,n ]µ i i=1 N E[T i,n ]µ i A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94 i=1 ]

18 The Stochastic Multi-arm Bandit Problem The Stochastic Multi armed Bandit Problem (cont d) R n (A) = i i E[T i,n ] i we only need to study the expected number of pulls of the suboptimal arms A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

19 The Stochastic Multi-arm Bandit Problem The Stochastic Multi armed Bandit Problem (cont d) Optimism in Face of Uncertainty Learning (OFUL) Whenever we are uncertain about the outcome of an arm, we consider the best possible world and choose the best arm. Why it works: If the best possible world is correct no regret If the best possible world is wrong the reduction in the uncertainty is maximized A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

20 The Stochastic Multi-arm Bandit Problem The Stochastic Multi armed Bandit Problem (cont d) Rewards Rewards pulls = 100 pulls = Rewards Rewards pulls = 50 pulls = 20 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

21 The Stochastic Multi-arm Bandit Problem The Stochastic Multi armed Bandit Problem (cont d) Optimism in face of uncertainty Rewards Rewards Rewards Rewards A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

22 The Stochastic Multi-arm Bandit Problem The Upper Confidence Bound (UCB) Algorithm The idea Reward (10) 2 (73) 3 (3) 4 (23) Arms A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

23 The Stochastic Multi-arm Bandit Problem The Upper Confidence Bound (UCB) Algorithm Show time! A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

24 The Stochastic Multi-arm Bandit Problem The Upper Confidence Bound (UCB) Algorithm (cont d) At each round t = 1,..., n Compute the score of each arm i B i = (optimistic score of arm i) Pull arm I t = arg max i=1,...,n B i,s,t Update the number of pulls T It,t = T It,t A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

25 The Stochastic Multi-arm Bandit Problem The Upper Confidence Bound (UCB) Algorithm (cont d) The score (with parameters ρ and δ) B i = (optimistic score of arm i) B i,s,t = (optimistic score of arm i if pulled s times up to round t) B i,s,t = knowledge }{{} + optimism uncertainty log 1/δ B i,s,t = ˆµ i,s + ρ 2s Optimism in face of uncertainty: Current knowledge: average rewards ˆµ i,s Current uncertainty: number of pulls s A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

26 The Stochastic Multi-arm Bandit Problem The Upper Confidence Bound (UCB) Algorithm (cont d) Do you remember Chernoff-Hoeffding? Theorem Let X 1,..., X n be i.i.d. samples from a distribution bounded in [a, b], then for any δ (0, 1) [ 1 n ] log 2/δ P X t E[X 1 ] > (b a) δ n 2n t=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

27 The Stochastic Multi-arm Bandit Problem The Upper Confidence Bound (UCB) Algorithm (cont d) After s pulls, arm i [ P E[X i ] 1 ] s log 1/δ X i,t + 1 δ s 2s t=1 [ ] log 1/δ P µ i ˆµ i,s + 1 δ 2s UCB uses an upper confidence bound on the expectation A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

28 The Stochastic Multi-arm Bandit Problem The Upper Confidence Bound (UCB) Algorithm (cont d) Theorem For any set of N arms with distributions bounded in [0, b], if δ = 1/t, then UCB(ρ) with ρ > 1, achieves a regret R n (A) [ ( )] 4b 2 3 ρ log(n) + i i i i (ρ 1) A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

29 The Stochastic Multi-arm Bandit Problem The Upper Confidence Bound (UCB) Algorithm (cont d) Let N = 2 with i = 1 ( ) 1 R n (A) O ρ log(n) Remark 1: the cumulative regret slowly increases as log(n) Remark 2: the smaller the gap the bigger the regret... why? A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

30 The Stochastic Multi-arm Bandit Problem The Upper Confidence Bound (UCB) Algorithm (cont d) Show time (again)! A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

31 The Stochastic Multi-arm Bandit Problem The Worst case Performance Remark: the regret bound is distribution dependent ( ) 1 R n (A; ) O ρ log(n) Meaning: the algorithm is able to adapt to the specific problem at hand! Worst case performance: what is the distribution which leads to the worst possible performance of UCB? what is the distribution free performance of UCB? R n (A) = sup R n (A; ) A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

32 The Stochastic Multi-arm Bandit Problem The Worst case Performance Problem: it seems like if 0 then the regret tends to infinity nosense because the regret is defined as R n (A; ) = E[T 2,n ] then if i is small, the regret is also small... In fact { ( ) } 1 R n (A; ) = min O ρ log(n), E[T 2,n ] A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

33 The Stochastic Multi-arm Bandit Problem The Worst case Performance Then R n (A) = sup for = 1/n R n (A; ) = sup min { ( ) } 1 O ρ log(n), n n A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

34 The Stochastic Multi-arm Bandit Problem Tuning the confidence δ of UCB Remark: UCB is an anytime algorithm (δ = 1/t) log t B i,s,t = ˆµ i,s + ρ 2s Remark: If the time horizon n is known then the optimal choice is δ = 1/n log n B i,s,t = ˆµ i,s + ρ 2s A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

35 The Stochastic Multi-arm Bandit Problem Tuning the confidence δ of UCB (cont d) Intuition: UCB should pull the suboptimal arms Enough: so as to understand which arm is the best Not too much: so as to keep the regret as small as possible The confidence 1 δ has the following impact (similar for ρ) Big 1 δ: high level of exploration Small 1 δ: high level of exploitation Solution: depending on the time horizon, we can tune how to trade-off between exploration and exploitation A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

36 The Stochastic Multi-arm Bandit Problem Tuning the confidence δ of UCB (cont d) Let s dig into the (1 page and half!!) proof. Define the (high-probability) event [statistics] { } log 1/δ E = i, s ˆµ i,s µ i 2s By Chernoff-Hoeffding P[E] 1 nnδ. At time t we pull arm i [algorithm] B i,ti,t 1 B i,t i,t 1 log 1/δ log 1/δ ˆµ i,ti,t 1 + ˆµ i 2T,T i,t 1 + i,t 1 2T i,t 1 On the event E we have [math] log 1/δ µ i + 2 µ i 2T i,t 1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

37 The Stochastic Multi-arm Bandit Problem Tuning the confidence δ of UCB (cont d) Assume t is the last time i is pulled, then T i,n = T i,t 1 + 1, thus log 1/δ µ i + 2 2(T i,n 1) µ i Reordering [math] log 1/δ T i,n i under event E and thus with probability 1 nnδ. Moving to the expectation [statistics] E[T i,n ] = E[T i,n IE] + E[T i,n IE C ] log 1/δ E[T i,n ] n(nnδ) i Trading-off the two terms δ = 1/n 2, we obtain 2 log n ˆµ i,ti,t 1 + 2T i,t 1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

38 The Stochastic Multi-arm Bandit Problem Tuning the confidence δ of UCB (cont d) Trading-off the two terms δ = 1/n 2, we obtain 2 log n ˆµ i,ti,t 1 + 2T i,t 1 and E[T i,n ] log n 2 i N A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

39 The Stochastic Multi-arm Bandit Problem Tuning the confidence δ of UCB (cont d) Multi armed Bandit: the same for δ = 1/t and δ = 1/n almost (i.e., in expectation) A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

40 The Stochastic Multi-arm Bandit Problem Tuning the confidence δ of UCB (cont d) The value at risk of the regret for UCB-anytime A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

41 The Stochastic Multi-arm Bandit Problem Tuning the ρ of UCB (cont d) UCB values (for the δ = 1/n algorithm) log n B i,s = ˆµ i,s + ρ Theory ρ < 0.5, polynomial regret w.r.t. n 2s ρ > 0.5, logarithmic regret w.r.t. n Practice: ρ = 0.2 is often the best choice A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

42 The Stochastic Multi-arm Bandit Problem Improvements over UCB: UCB-V Idea: use Bernstein bounds with empirical variance Algorithm: B i,s,t = ˆµ i,s + log t 2s Bi,s,t V 2ˆσ i,s 2 = ˆµ i,s + log t s + 8 log t 3s ( 1 ) R n O log n ( σ 2 ) R n O log n A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

43 The Stochastic Multi-arm Bandit Problem Improvements over UCB: KL-UCB Idea: use Kullback Leibler bounds which are tighter than other bounds Algorithm: the algorithm is still index based but a bit more complicated ( 1 ) R n O log n ( 1 ) R n O KL(ν, ν i ) log n A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

44 The Stochastic Multi-arm Bandit Problem Improvements over UCB: Thompson strategy Idea: Keep a distribution over the possible values of µ i Algorithm: Bayesian approach. Compute the posterior distributions given the samples. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

45 The Stochastic Multi-arm Bandit Problem Back to UCB: the Lower Bound Theorem For any stochastic bandit {ν i }, any algorithm A has a regret lim n R n log n i inf ν KL(ν i, ν) Problem: this is just asymptotic Open Question: what is the finite-time lower bound? A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

46 The Non-Stochastic Multi-arm Bandit Problem Outline Mathematical Tools The General Multi-arm Bandit Problem The Stochastic Multi-arm Bandit Problem The Non-Stochastic Multi-arm Bandit Problem Connections to Game Theory Other Stochastic Multi-arm Bandit Problems A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

47 The Non-Stochastic Multi-arm Bandit Problem The Non Stochastic Multi armed Bandit Problem Definition The environment is adversarial Arms have no fixed distribution The rewards X i,t are arbitrarily chosen by the environment A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

48 The Non-Stochastic Multi-arm Bandit Problem The Non Stochastic Multi armed Bandit Problem (cont d) The (non stochastic bandit) regret [ n ] [ n R n (A) = max E X i,t E i=1,...,n R n (A) = max i=1,...,n t=1 t=1 n [ n X i,t E t=1 t=1 X It,t X It,t ] ] A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

49 The Non-Stochastic Multi-arm Bandit Problem The Exponentially Weighted Average Forecaster Initialize the weights w i,0 = 1 Compute (W t 1 = N i=1 w i,t 1) Choose the arm at random ˆp i,t = w i,t 1 W t 1 Observe the rewards {X i,t } Receive a reward X It,t I t ˆp t Update w i,t = w i,t 1 exp ( ) + ηx it,t A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

50 The Non-Stochastic Multi-arm Bandit Problem The Non Stochastic Multi armed Bandit Problem (cont d) Problem: we only observe the reward of the specific arm chosen at time t!! (i.e., only X It,t is observed) A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

51 The Non-Stochastic Multi-arm Bandit Problem The Exponentially Weighted Average Forecaster Initialize the weights w i,0 = 1 Compute (W t 1 = N i=1 w i,t 1) Choose the arm at random ˆp i,t = w i,t 1 W t 1 I t ˆp t Observe the rewards {X i,t } Receive a reward X It,t Update w i,t = w i,t 1 exp ( ηx it,t) this update is not possible A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

52 The Non-Stochastic Multi-arm Bandit Problem The Non Stochastic Multi armed Bandit Problem (cont d) We use the importance weight trick Why it is a good idea: ˆX i,t = { Xi,t ˆp i,t if i = I t 0 otherwise E [ ˆX i,t ] = X i,t ˆp i,t ˆp i,t + 0(1 ˆp i,t ) = X i,t ˆX i,t is an unbiased estimator of X i,t A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

53 The Non-Stochastic Multi-arm Bandit Problem The Exp3 Algorithm Exp3: Exponential-weight algorithm for Exploration and Exploitation Initialize the weights w i,0 = 1 Compute (W t 1 = N i=1 w i,t 1) Choose the arm at random ˆp i,t = w i,t 1 W t 1 I t ˆp t Receive a reward X It,t Update w i,t = w i,t 1 exp ( ) η ˆX it,t A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

54 The Non-Stochastic Multi-arm Bandit Problem The Exp3 Algorithm Question: is this enough? is this algorithm actually exploring enough? Answer: more or less... Exp3 has a small regret in expectation Exp3 might have large deviations with high probability (ie, from time to time it may concentrate ˆp t on the wrong arm for too long and then incur a large regret) A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

55 The Non-Stochastic Multi-arm Bandit Problem The Exp3 Algorithm Fix: add some extra uniform exploration Initialize the weights w i,0 = 1 Compute (W t 1 = N i=1 w i,t 1) Choose the arm at random ˆp i,t = (1 γ) w i,t 1 W t 1 + γ K I t ˆp t Receive a reward X It,t Update w i,t = w i,t 1 exp ( ) η ˆX it,t A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

56 The Non-Stochastic Multi-arm Bandit Problem The Exp3 Algorithm Theorem If Exp3 is run with γ = η, then it achieves a regret R n (A) = max i=1,...,n n [ n X i,t E t=1 t=1 with G max = max i=1,...,n n t=1 X i,t. X It,t ] (e 1)γG max + N log N γ A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

57 The Non-Stochastic Multi-arm Bandit Problem The Exp3 Algorithm Theorem If Exp3 is run with N log N γ = η = (e 1)n then it achieves a regret R n (A) O( nn log N) A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

58 The Non-Stochastic Multi-arm Bandit Problem The Exp3 Algorithm Comparison with online learning R n (Exp3) O( nn log N) R n (EWA) O( n log N) Intuition: in online learning at each round we obtain N feedbacks, while in bandits we receive 1 feedback. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

59 The Non-Stochastic Multi-arm Bandit Problem The Improved-Exp3 Algorithm Initialize the weights w i,0 = 1 Compute (W t 1 = N i=1 w i,t 1) Choose the arm at random ˆp i,t = (1 γ) w i,t 1 W t 1 + γ K I t ˆp t Receive a reward X It,t Compute Update X i,t = ˆX i,t + β ˆp i,t w i,t = w i,t 1 exp ( η X ) it,t A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

60 The Non-Stochastic Multi-arm Bandit Problem The Improved-Exp3 Algorithm Theorem If Improved-Exp3 is run with parameters in the ranges γ 1 2 ; then it achieves a regret 0 η γ 2N ; 1 nn log N δ β 1 Rn HP (A) n ( γ + η(1 + β)n ) + log N + 2nNβ η with probability at least 1 δ. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

61 The Non-Stochastic Multi-arm Bandit Problem The Improved-Exp3 Algorithm Theorem If Improved-Exp3 is run with parameters in the ranges β = 1 nn log N δ ; γ = 4Nβ 3 + β ; η = γ 2N then it achieves a regret n (A) 11 log N nn log(n/δ) R HP with probability at least 1 δ. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

62 Connections to Game Theory Outline Mathematical Tools The General Multi-arm Bandit Problem The Stochastic Multi-arm Bandit Problem The Non-Stochastic Multi-arm Bandit Problem Connections to Game Theory Other Stochastic Multi-arm Bandit Problems A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

63 Connections to Game Theory Repeated Two Player Zero Sum Games A two player zero sum game A B C 1 30, , 10 20, , , 20-20, 20 Nash equilibrium: A set of strategies is a Nash equilibrium if no player can do better by unilaterally changing his strategy. Red: take action 1 with prob. 4/7 and action 2 with prob. 3/7 Blue: take action A with prob. 0, action B with prob. 4/7, and action C with prob. 3/7 Value of the game: V = 20/7 (reward of Red at the equilibrium) A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

64 Connections to Game Theory Repeated Two Player Zero Sum Games At each round t Row player computes a mixed strategy ˆp t = (ˆp 1,t,..., ˆp N,t ) Column player computes a mixed strategy ˆq t = (ˆq 1,t,..., ˆq M,t ) Row player selects action I t {1,..., N} Column player selects action J t {1,..., M} Row player suffers l(i t, J t ) Column player suffers l(i t, J t ) Value of the game with V = max min l(p, q) q p l(p, q) = N i=1 j=1 M p i q j l(i, j) A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

65 Connections to Game Theory Repeated Two Player Zero Sum Games Question: what if the two players are both bandit algorithms (e.g., Exp3)? Row player: a bandit algorithm is able to minimize R n (row) = n l It,Jt t=1 n min l i,jt i=1,...,n t=1 Col player: a bandit algorithm is able to minimize R n (col) = n l It,Jt t=1 n min l It,j j=1,...,m t=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

66 Connections to Game Theory Repeated Two Player Zero Sum Games Theorem If both the row and column players play according to an Hannan-consistent strategy, then 1 lim sup n n n l(i t, J t ) = V t=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

67 Connections to Game Theory Repeated Two Player Zero Sum Games Theorem The empirical distribution of plays ˆp i,n = 1 n n I{I t = i} t=1 ˆq j,n = 1 n n I{J t = j} induces a product distribution ˆp n ˆq n which converges to the set of Nash equilibria p q. t=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

68 Connections to Game Theory Repeated Two Player Zero Sum Games Proof idea. Since l(p, J t ) is linear, over the simplex, the minimum is at one of the corners [math] min i=1,...,n 1 N n t=1 1 l(i, J t ) = min p n n l(p, J t ) We consider the empirical probability of the row player [def] ˆq j,n = 1 n IJ t = j n t=1 t=1 Elaborating on it [math] 1 n min l(p, J t ) = min p n p t=1 M ˆq j,n l(p, j) j=1 = min p l(p, ˆqn ) max q min l(p, q) = V p A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

69 Connections to Game Theory Repeated Two Player Zero Sum Games Proof idea. By definition of Hannan s consistent strategy [def] Then 1 lim sup n n n 1 l(i t, J t ) = min i=1,...,n n t=1 1 lim sup n n n l(i t, J t ) V t=1 n l(i, J t ) t=1 If we do the same for the other player [zero sum game] 1 lim sup n n n l(i t, J t ) V t=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

70 Connections to Game Theory Repeated Two Player Zero Sum Games Question: how fast do they converge to the Nash equilibrium? Answer: it depends on the specific algorithm. For EWA(η), we now that n l(i t, J t ) t=1 min n i=1,...,n t=1 l(i, J t ) log N η + nη n log 1 δ A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

71 Connections to Game Theory Repeated Two Player Zero Sum Games Generality of the results Players do not know the payoff matrix Players do not observe the loss of the other player Players do not even observe the action of the other player A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

72 Connections to Game Theory Internal Regret and Correlated Equilibria External (expected) regret R n = n l(ˆp t, y t ) t=1 = max i=1,...,n Internal (expected) regret n t=1 j=1 min i=1,...,n t=1 n l(i, y t ) N ( ˆp j,t l(j, yt ) l(i, y t ) ) R I n = max i,j=1,...,n n ( ˆp j,t l(i, yt ) l(j, y t ) ) t=1 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

73 Connections to Game Theory Internal Regret and Correlated Equilibria Internal (expected) regret R I n = max i,j=1,...,n n ( ˆp j,t l(i, yt ) l(j, y t ) ) t=1 Intuition: an algorithm has small internal regret if, for each pair of experts (i, j), the learner does not regret of not having followed expert j each time it followed expert i. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

74 Connections to Game Theory Internal Regret and Correlated Equilibria Theorem Given a K person game with a set of correlated equilibria C. If all the players are internal regret minimizers, then the distance between the empirical distribution of plays and the set of correlated equilibria C converges to 0. A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

75 Connections to Game Theory Nash Equilibria in Extensive Form Games A powerful model for sequential games Checkers / Chess / Go Poker Bargaining Monitoring Patrolling... A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

76 Connections to Game Theory Nash Equilibria in Extensive Form Games A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

77 Connections to Game Theory Nash Equilibria in Extensive Form Games A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

78 Connections to Game Theory Nash Equilibria in Extensive Form Games No details about the algorithm... but... Theorem If player k selects actions according to the counterfactual regret minimization algorithm, then it achieves a regret # actions R k,t # states T Theorem In a two player zero sum extensive form game, counterfactual regret minimization algorithms achieves an 2ɛ-Nash equilibrium, with # actions ɛ # states T A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

79 Other Stochastic Multi-arm Bandit Problems Outline Mathematical Tools The General Multi-arm Bandit Problem The Stochastic Multi-arm Bandit Problem The Non-Stochastic Multi-arm Bandit Problem Connections to Game Theory Other Stochastic Multi-arm Bandit Problems A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

80 Other Stochastic Multi-arm Bandit Problems The Best Arm Identification Problem Motivating Examples Find the best shortest path in a limited number of days Maximize the confidence about the best treatment after a finite number of patients Discover the best advertisements after a training phase... A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

81 Other Stochastic Multi-arm Bandit Problems The Best Arm Identification Problem Objective: given a fixed budget n, return the best arm i = arg max i µ i at the end of the experiment Measure of performance: the probability of error P[J n i ] P[J n i ] N exp ( T i,n 2 ) i Algorithm idea: mimic the behavior of the optimal strategy T i,n = i=1 1 2 i N n j=1 1 2 j A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

82 Other Stochastic Multi-arm Bandit Problems The Best Arm Identification Problem The Successive Reject Algorithm Divide the budget in N 1 phases. Define (log(n) = N i=2 1/i) n k = 1 n N logk N + 1 k Set of active arms A k at phase k (A 1 = {1,..., N}) For each phase k = 1,..., N 1 For each arm i Ak, pull arm i for n k n k 1 rounds Remove the worst arm A k+1 = A k \ arg min i A k ˆµ i,nk Return the only remaining arm J n = A N A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

83 Other Stochastic Multi-arm Bandit Problems The Best Arm Identification Problem The Successive Reject Algorithm Theorem The successive reject algorithm have a probability of doing a mistake of P[J n i ] with H 2 = max i=1,...,n i 2 (i). K(K 1) 2 ( exp n N ) lognh 2 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

84 Other Stochastic Multi-arm Bandit Problems The Best Arm Identification Problem The UCB-E Algorithm Define an exploration parameter a Compute B i,s = ˆµ i,s + a s Select I t = arg max B i,s At the end return J n = arg max i ˆµ i,ti,n A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

85 Other Stochastic Multi-arm Bandit Problems The Best Arm Identification Problem The UCB-E Algorithm Theorem The UCB-E algorithm with a = mistake of with H 1 = N i=1 1/ 2 i. n N H 1 ( P[J n i ] 2nN exp has a probability of doing a 2a ) 25 A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

86 Other Stochastic Multi-arm Bandit Problems The Best Arm Identification Problem A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

87 Other Stochastic Multi-arm Bandit Problems The Active Bandit Problem Motivating Examples N production lines The test of the performance of a line is expensive We want an accurate estimation of the performance of each production line A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

88 Other Stochastic Multi-arm Bandit Problems The Active Bandit Problem Objective: given a fixed budget n, return the an estimate of the means ˆµ i,t which is as accurate as possible for all the arms Notice: Given an arm has a mean µ i and a variance σi 2, if it is pulled T i,n times, then L i,n = E [ (ˆµ i,ti,n µ i ) 2] = σ2 i T i,n L n = max L i,n i A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

89 Other Stochastic Multi-arm Bandit Problems The Active Bandit Problem Problem: what are the number of pulls (T 1,n,..., T N,n ) (such that T i,n = n) which minimizes the loss? Answer (T1,n,..., TN,n ) = arg min L n (T 1,n,...,T N,n ) T i,n = σi 2 N n j=1 σ2 j L n = N i=1 σ2 i n = Σ n A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

90 Other Stochastic Multi-arm Bandit Problems The Active Bandit Problem Objective: given a fixed budget n, return the an estimate of the means ˆµ i,t which is as accurate as possible for all the arms Measure of performance: the regret on the quadratic error N i=1 R n (A) = max L n (A) σ2 i i n Algorithm idea: mimic the behavior of the optimal strategy T i,n = σi 2 N n = λ i n j=1 σ2 j A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

91 Other Stochastic Multi-arm Bandit Problems The Active Bandit Problem An UCB based strategy At each time step t = 1,..., n Estimate ˆσ i,t 2 i,t 1 = 1 T i,t 1 Xs,i 2 ˆµ 2 i,t T i,t 1 i,t 1 s=1 Compute B i,t = 1 ( ˆσ 2 log 1/δ ) i,t T i,t i,t 1 2T i,t 1 Pull arm I t = arg max B i,t A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

92 Other Stochastic Multi-arm Bandit Problems The Active Bandit Problem Theorem The UCB based algorithm achieves a regret ( ) 98 log(n) log n R n (A) n 3/2 λ 5/2 + O n 2 min A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

93 Other Stochastic Multi-arm Bandit Problems Bibliography I A. LAZARIC Reinforcement Learning Algorithms Oct 29th, /94

94 Other Stochastic Multi-arm Bandit Problems Reinforcement Learning Alessandro Lazaric sequel.lille.inria.fr

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms Stochastic K-Arm Bandit Problem Formulation Consider K arms (actions) each correspond to an unknown distribution {ν k } K k=1 with values bounded in [0, 1]. At each time t, the agent pulls an arm I t {1,...,