Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade

Size: px

Start display at page:

Download "Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade"

Clifford Ford
5 years ago
Views:

1 Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 22

2 Announcements... HW 4 posted soon (short) Poster session: June 1, 9-11:30a; ask TA/CSE students for help printing Projects: the term is approaching the end... Today: Quick overview: Parallelization and Deep learning Bandits: 1 Review: Vanilla k-arm setting,ucb 2 Today: UCB (continued), Thompson, Linear bandits and ad-placement S. M. Kakade (UW) Optimization for Big data 2 / 22

3 The problem In unsupervised learning, we just have data... In supervised learning, we have inputs X and labels Y (often we spend resources to get these labels). In reinforcement learning (very general), we act in the world, there is state and we observe rewards. Bandit Settings: We have K decisions each round and we do only received feedback for the chosen decision... S. M. Kakade (UW) Optimization for Big data 3 / 22

4 Review S. M. Kakade (UW) Optimization for Big data 4 / 22

5 Multi-Armed Bandit Game K Independent Arms: a {1,... K } Each arm a returns a random reward R a if pulled. (simpler case) assume R a is not time varying. Game: You chose arm a t at time t. You then observe: X t = R at where R at is sampled from the underlying distribution of that arm. The distribution of R a is not known. S. M. Kakade (UW) Optimization for Big data 5 / 22

6 Ad placement... S. M. Kakade (UW) Optimization for Big data 5 / 22

7 The Goal We would like to maximize our long term future reward. Our (possibly randomized) sequential strategy/algorithm A is: In T rounds, our reward is: a t = A(a 1, X 1, a 2, X 2,... a t 1, X t 1 ) E[ T X t A] t=1 where the expectation is with respect to the reward process and our algorithm. Objective: What is a strategy which maximizes our long term reward? S. M. Kakade (UW) Optimization for Big data 6 / 22

8 Our Regret Suppose: µ a = E[R a ] Assume 0 µ a 1. Let µ = max a µ a In expectation, the best we can do is obtain µ T reward in T steps. In T rounds, our regret is: [ T ] µ T E X t A?? Objective: What is a strategy which makes our regret small? t=1 S. M. Kakade (UW) Optimization for Big data 7 / 22

9 A Naive Strategy For the first τ rounds, sample each arm τ/k times. For the remainder of the rounds, choose the arm with best observed empirical reward. How good is this strategy? How do we set τ? Let s look at confidence intervals. S. M. Kakade (UW) Optimization for Big data 8 / 22

10 Our regret (Exploration rounds) What is our regret for the first τ rounds? (Exploitation rounds) What is our regret for the remainder τ rounds? Our total regret is: µ T How do we choose τ? T log(k /δ) X t τ + O (T τ) τ/k t=1 S. M. Kakade (UW) Optimization for Big data 9 / 22

11 The Naive Strategy s Regret Choose τ = K 1/3 T 2/3 and δ = 1/T. Theorem: Our total (expected) regret is: T µ T E[ X t A] O(K 1/3 T 2/3 (log(kt )) 1/3 ) t=1 S. M. Kakade (UW) Optimization for Big data 10 / 22

12 Can we be more adaptive? Are we still pulling arms that we know are sub-optimal? How do we know this?? Let N a,t be the number of times we pulled arm a up to time t. Confidence interval at time t: with probability greater than 1 δ, log(1/δ) ˆµ a,t µ a O N a,t with δ δ/(tk ), the above bound will hold for all time arms a [K ] and timesteps t T. S. M. Kakade (UW) Optimization for Big data 11 / 22

13 Upper Confidence Bound (UCB) Algorithm At each time t, Pull arm: a t = argmaxˆµ a,t + c log(kt /δ) N a,t (where c 10 is a constant). Observe reward X t. Update µ a,t, N a,t, and ConfBound a,t. How well does this do? := argmaxˆµ a,t + ConfBound a,t S. M. Kakade (UW) Optimization for Big data 12 / 22

14 Today S. M. Kakade (UW) Optimization for Big data 13 / 22

15 Instantaneous Regret With probability greater than 1 δ all the confidence bounds will hold. Question: If could UCB pull arm a at time t? argmaxˆµ a,t + ConfBound a,t µ Question: If pull arm a at time t, how much regret do we pay? i.e. µ µ at?? S. M. Kakade (UW) Optimization for Big data 14 / 22

16 Total Regret Theorem: The total (expected) regret of UCB is: T µ T E[ X t A] KT log(kt ) t=1 This better than the Naive strategy. Up to log factors, it is optimal. Practical algorithm? S. M. Kakade (UW) Optimization for Big data 15 / 22

17 Simulation S. M. Kakade (UW) Optimization for Big data 16 / 22

18 Proof Idea: for K = 2 Suppose arm a = 2 is not optimal. Claim 1: All confidence intervals will be valid (with Pr 1 δ). Claim 2: If we pull arm a = 1, then no regret. Claim 3: If we pull a = 2, then we pay 2C a,t regret. To see this: Why? ˆµ a,t + C a,t ˆµ 1,t + C 1,t µ Why? The total regret is: µ a ˆµ a,t C a,t C a,t t t Note that N a,t T (and increasing). 1 Na,t S. M. Kakade (UW) Optimization for Big data 17 / 22

19 Aside: Logarithmic regret The previous rates are not a function of problem dependent parameters. On any given problem, we expect to eventually start pulling the best arm. Define the gap as: = µ max a a µ a Theorem: The total (expected) regret of UCB is: µ T E[ T X t A] K log(t ) t=1 (same algorithm enjoys this bound.) Question: How is the naive algorithm different? S. M. Kakade (UW) Optimization for Big data 18 / 22

20 Thompson sampling Practical issues: how to obtain good confidence intervals? variants with similar performance? Suppose we are Bayesian. We have a posterior distribution Thompson sampling: Sample from each posterior: Pr(µ a History <t ) ν a Pr(µ a History <t ) take action update posteriors a t = argmax a ν a S. M. Kakade (UW) Optimization for Big data 19 / 22

21 Thompson sampling and Confidence intervals S. M. Kakade (UW) Optimization for Big data 20 / 22

22 Acknowledgements JourneeCOSdec2015-Kaufman.pdf S. M. Kakade (UW) Optimization for Big data 20 / 22

Bandit models: a tutorial

Bandit models: a tutorial Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses