The Multi-Armed Bandit Problem

Size: px

Start display at page:

Download "The Multi-Armed Bandit Problem"

April Jones
6 years ago
Views:

1 The Multi-Armed Bandit Problem Electrical and Computer Engineering December 7, 2013

2 Outline 1 2 Mathematical 3 Algorithm Upper Confidence Bound Algorithm

3 A/B Testing

4 Exploration vs. Exploitation Scientist View Explore new ideas Businessman View Exploit best idea found so far

5 Terminology pulling an arm = making a choice (which ad/color to display) reward/regret = measure of success (user-click, item-buy)

6 Problem Formulation Formulation K arms 1,, K Arm i gives reward distribution ν i (x), x [0, 1] with mean µ i. Think Bernoulli(p i ) ν i s unknown Finite time horizon (arm-pulls) n At time t, player chooses arm I t {1,, K}, the environment rewards g It,t ν It

7 Problem Formulation Formulation K arms 1,, K Arm i gives reward distribution ν i (x), x [0, 1] with mean µ i. Think Bernoulli(p i ) ν i s unknown Finite time horizon (arm-pulls) n At time t, player chooses arm I t {1,, K}, the environment rewards g It,t ν It

8 Problem Formulation Formulation K arms 1,, K Arm i gives reward distribution ν i (x), x [0, 1] with mean µ i. Think Bernoulli(p i ) ν i s unknown Finite time horizon (arm-pulls) n At time t, player chooses arm I t {1,, K}, the environment rewards g It,t ν It

9 Problem Formulation Formulation K arms 1,, K Arm i gives reward distribution ν i (x), x [0, 1] with mean µ i. Think Bernoulli(p i ) ν i s unknown Finite time horizon (arm-pulls) n At time t, player chooses arm I t {1,, K}, the environment rewards g It,t ν It

10 Problem Formulation Formulation K arms 1,, K Arm i gives reward distribution ν i (x), x [0, 1] with mean µ i. Think Bernoulli(p i ) ν i s unknown Finite time horizon (arm-pulls) n At time t, player chooses arm I t {1,, K}, the environment rewards g It,t ν It

11 Definitions Define i = arg max i=1,,k µ i µ = max i=1,,k µ i i = µ µ i T i (n) = n ½ It =i t=1 Cumulative regret ˆR n = n g i,t n t=1 g It,t t=1 Objective Find best arm Minimize expected regret R n = EˆR n = nµ E K T i (n)µ i = K i ET i (n) i=1 i=1

12 Definitions Define i = arg max i=1,,k µ i µ = max i=1,,k µ i i = µ µ i T i (n) = n ½ It =i t=1 Cumulative regret ˆR n = n g i,t n t=1 g It,t t=1 Objective Find best arm Minimize expected regret R n = EˆR n = nµ E K T i (n)µ i = K i ET i (n) i=1 i=1

13 Definitions Define i = arg max i=1,,k µ i µ = max i=1,,k µ i i = µ µ i T i (n) = n ½ It =i t=1 Cumulative regret ˆR n = n g i,t n t=1 g It,t t=1 Objective Find best arm Minimize expected regret R n = EˆR n = nµ E K T i (n)µ i = K i ET i (n) i=1 i=1

14 Definitions Define i = arg max i=1,,k µ i µ = max i=1,,k µ i i = µ µ i T i (n) = n ½ It =i t=1 Cumulative regret ˆR n = n g i,t n t=1 g It,t t=1 Objective Find best arm Minimize expected regret R n = EˆR n = nµ E K T i (n)µ i = K i ET i (n) i=1 i=1

15 Outline 1 2 Mathematical 3 Algorithm Upper Confidence Bound Algorithm

16 Clarification Objectively and Subjectively Best Options Objectively best: Which option is truly the best (as known to an oracle) Subjectively best: Which option has been best in the past? Exploitation vs. Exploration Exploitation: Choose the subjectively best arm Exploration: Choosing anything else

17 Clarification Objectively and Subjectively Best Options Objectively best: Which option is truly the best (as known to an oracle) Subjectively best: Which option has been best in the past? Exploitation vs. Exploration Exploitation: Choose the subjectively best arm Exploration: Choosing anything else

18 Algorithm 1 2 K Strategy = ǫ Scientist +(1 ǫ) Businessman At each time t With probability 1 ǫ, pick the subjectively best arm With probability ǫ K, pick a random arm

19 Probability of Selecting Best Arm Bernoulli arms with reward probabilities 0.1, 0.1, 0.1, 0.1, 0.9 Accuracy of the Epsilon Greedy Algorithm Probability of Selecting Best Arm Epsilon ǫ = 0.1(Businessman) Learns slowly Does well at the end ǫ = 0.5(Scientist) Learns quickly Doesn t exploit at the end Time

20 Theoretical guarantee Weakness - ǫ constant: Solution - annealing Theoretical Guarantee (Auer, Cesa-Bianchi, Fischer, 2002) ) Let = min i: i >0 i and consider ǫ t = min( 6K 2 t, 1 When t 6K, the probability of choosing a suboptimal arm 2 i is bounded by C, for some constant C > 0. 2 t As a consequence, E[T i (n)] C log n and 2 R n i: i >0 C i 2 log n logarithmic regret.

21 Theoretical guarantee Weakness - ǫ constant: Solution - annealing Theoretical Guarantee (Auer, Cesa-Bianchi, Fischer, 2002) ) Let = min i: i >0 i and consider ǫ t = min( 6K 2 t, 1 When t 6K, the probability of choosing a suboptimal arm 2 i is bounded by C, for some constant C > 0. 2 t As a consequence, E[T i (n)] C log n and 2 R n i: i >0 C i 2 log n logarithmic regret.

22 Theoretical guarantee Weakness - ǫ constant: Solution - annealing Theoretical Guarantee (Auer, Cesa-Bianchi, Fischer, 2002) ) Let = min i: i >0 i and consider ǫ t = min( 6K 2 t, 1 When t 6K, the probability of choosing a suboptimal arm 2 i is bounded by C, for some constant C > 0. 2 t As a consequence, E[T i (n)] C log n and 2 R n i: i >0 C i 2 log n logarithmic regret.

23 Weakness of ǫ Greedy Exploration insensitive to relative performance levels Two arms with rewards 0.9 and 0.1 Two arms with rewards 0.15 and 0.1 Solution -

24 Idea: P(arm 1) = ˆµ 1 ˆµ 1 + ˆµ 2 P(arm 2) = ˆµ 2 ˆµ 1 + ˆµ 2 Variant: P(arm 1) = P(arm 2) = e µ ˆ 1 T e µ ˆ 1 T e µ ˆ 1 T + e µ ˆ 2 T e µ ˆ 2 T + e µ ˆ 2 T T : Pure exploration T = 0 : Pure exploitation

25 Idea: P(arm 1) = ˆµ 1 ˆµ 1 + ˆµ 2 P(arm 2) = ˆµ 2 ˆµ 1 + ˆµ 2 Variant: P(arm 1) = P(arm 2) = e µ ˆ 1 T e µ ˆ 1 T e µ ˆ 1 T + e µ ˆ 2 T e µ ˆ 2 T + e µ ˆ 2 T T : Pure exploration T = 0 : Pure exploitation

26 Weakness of Softmax Doesn t use confidence ˆp 1 = 0.15 after 100 plays, ˆp 2 = 0.1 after 100 plays. ˆp 1 = 0.15 after 100K plays, ˆp 2 = 0.1 after 100K plays. Solution - (Upper Confidence Bound) Algorithm

27 Algorithm Optimism in the Face of Uncertainty At time t, construct most optimistic estimate for each arm V i,t 1 = ˆµ i,t log t T i (t 1) Play arm with max upper { bound. } i.e. play I t arg max Vi,t 1 i {1,,K} Proof based on Hoeffding s inequality

28 Algorithm Optimism in the Face of Uncertainty At time t, construct most optimistic estimate for each arm V i,t 1 = ˆµ i,t log t T i (t 1) Play arm with max upper { bound. } i.e. play I t arg max Vi,t 1 i {1,,K} Proof based on Hoeffding s inequality

29 Algorithm Optimism in the Face of Uncertainty At time t, construct most optimistic estimate for each arm V i,t 1 = ˆµ i,t log t T i (t 1) Play arm with max upper { bound. } i.e. play I t arg max Vi,t 1 i {1,,K} Proof based on Hoeffding s inequality

30 Algorithm Optimism in the Face of Uncertainty At time t, construct most optimistic estimate for each arm V i,t 1 = ˆµ i,t log t T i (t 1) Play arm with max upper { bound. } i.e. play I t arg max Vi,t 1 i {1,,K} Proof based on Hoeffding s inequality

31 Results Accuracy of the 1 Algorithm 1.00 Probability of Selecting Best Arm Time

32 Theoretical Guarantee Regret Bound (Auer, Cesa-Bianchi, Fischer, 2002) [ ] ( ) ( ) ( ) R n log n K i + 1+ π2 3 i i:µ i <µ i=1 Lower bound (Lai and Rubbins 1985) Asymptotic total regret is at least logarithmic in number of steps lim R n log n i n KL(ν i ν ) i: i >0

33 Theoretical Guarantee Regret Bound (Auer, Cesa-Bianchi, Fischer, 2002) [ ] ( ) ( ) ( ) R n log n K i + 1+ π2 3 i i:µ i <µ i=1 Lower bound (Lai and Rubbins 1985) Asymptotic total regret is at least logarithmic in number of steps lim R n log n i n KL(ν i ν ) i: i >0

34 Comparison Accuracy of Different Probability of Selecting Best Arm 0.50 Algorithm Annealing epsilon Greedy 1 Annealing Softmax Time

35 Summary 1 2 Mathematical 3 Algorithm Upper Confidence Bound Algorithm

36 References White, John. Bandit for Website Optimization. O Reilly, Auer, Peter, Nicolo Cesa-Bianchi, and Paul Fischer. "Finite-time analysis of the multiarmed bandit problem." Machine learning (2002):

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University Bandit Algorithms Zhifeng Wang Department of Statistics Florida State University Outline Multi-Armed Bandits (MAB) Exploration-First Epsilon-Greedy Softmax UCB Thompson Sampling Adversarial Bandits Exp3