Bandit Algorithms for Pure Exploration: Best Arm Identification and Game Tree Search. Wouter M. Koolen

Size: px

Start display at page:

Download "Bandit Algorithms for Pure Exploration: Best Arm Identification and Game Tree Search. Wouter M. Koolen"

Marshall Gardner
5 years ago
Views:

1 Bandit Algorithms for Pure Exploration: Best Arm Identification and Game Tree Search Wouter M. Koolen Machine Learning and Statistics for Structures Friday 23 rd February, 2018

2 Outline 1 Intro 2 Model 3 Statistician s toolbox 4 Sample Complexity Lower Bound 5 Algorithms 6 Extensions

3 Question Complexity of interactive learning.

4 Question Complexity of interactive learning. Medical testing [Villar et al., 2015] Online advertising and website optimisation [Zhou et al., 2014] Monte Carlo planning [Grill et al., 2016], and Game-playing AI [Silver et al., 2016]

5 Pure Exploration Query: Which is... the most effective drug dose? the most appealing website layout? the safest next robot action?

6 Pure Exploration Query: Which is... Method the most effective drug dose? the most appealing website layout? the safest next robot action? statistical experiments in physical or simulated environment, interactively and adaptively.

7 Pure Exploration Query: Which is... Method the most effective drug dose? the most appealing website layout? the safest next robot action? statistical experiments in physical or simulated environment, interactively and adaptively. Main scientific questions: sample complexity of interactive learning # experiments as function of query structure and environment Design of efficient pure exploration systems

8 Outline 1 Intro 2 Model 3 Statistician s toolbox 4 Sample Complexity Lower Bound 5 Algorithms 6 Extensions

9 Formal model Environment (Multi-armed bandit model) K distributions ν 1,..., ν K with means µ 1,..., µ K. Best arm i = argmax µ i i [K]

10 Formal model Environment (Multi-armed bandit model) K distributions ν 1,..., ν K with means µ 1,..., µ K. Best arm i = argmax µ i i [K] Strategy Stopping rule τ N In round t τ sampling rule picks I t [K]. See X t ν It. Recommendation rule Î [K].

11 Formal model Environment (Multi-armed bandit model) K distributions ν 1,..., ν K with means µ 1,..., µ K. Best arm i = argmax µ i i [K] Strategy Stopping rule τ N In round t τ sampling rule picks I t [K]. See X t ν It. Recommendation rule Î [K]. Realisation of interaction: (I 1, X 1 ),..., (I τ, X τ ), Î.

12 Formal model Environment (Multi-armed bandit model) K distributions ν 1,..., ν K with means µ 1,..., µ K. Best arm i = argmax µ i i [K] Strategy Stopping rule τ N In round t τ sampling rule picks I t [K]. See X t ν It. Recommendation rule Î [K]. Realisation of interaction: (I 1, X 1 ),..., (I τ, X τ ), Î. Two objectives: sample efficiency τ and correctness Î = i.

13 Objective Two main flavours: fixed budget : fix τ = T, optimise P(Î = i ) fixed confidence : fix P(Î = i ) δ, optimise E[τ].

14 Objective Two main flavours: fixed budget : fix τ = T, optimise P(Î = i ) fixed confidence : fix P(Î = i ) δ, optimise E[τ]. Notation N i (t) = t s=1 1 {I s = i} and ˆµ i (t) = 1 N i (t) t X s 1 {I s = i} s=1

15 Outline 1 Intro 2 Model 3 Statistician s toolbox 4 Sample Complexity Lower Bound 5 Algorithms 6 Extensions

16 A statistician s view Active Sequential Multiple Composite Hypothesis Testing H i = {ν i (ν) = i} i [K] (Frequentist) uniform type 1 error control.

17 Families of approaches to BAI Upper and Lower confidence bounds [Bubeck et al., 2011, Kalyanakrishnan et al., 2012, Gabillon et al., 2012, Kaufmann and Kalyanakrishnan, 2013, Jamieson et al., 2014], Racing or Successive Rejects/Eliminations [Maron and Moore, 1997, Even-Dar et al., 2006, Audibert et al., 2010, Kaufmann and Kalyanakrishnan, 2013, Karnin et al., 2013], Thompson Sampling [Russo, 2016] (hemidemisemibayesian) Track-and-Stop [Garivier and Kaufmann, 2016].

18 Outline 1 Intro 2 Model 3 Statistician s toolbox 4 Sample Complexity Lower Bound 5 Algorithms 6 Extensions

19 Change of Measure If the observations are likely under both ν and λ, yet i (ν) i (λ), then the algorithm cannot stop and be correct. Theorem (Kaufmann, Cappé, and Garivier [2016]) For bandit models µ and λ, stopping time τ, and event E F τ, K E [N i (τ)] KL(ν i λ i ) d (P ν (E), P λ (E)) ν i=1

20 Change of Measure If the observations are likely under both ν and λ, yet i (ν) i (λ), then the algorithm cannot stop and be correct. Theorem (Kaufmann, Cappé, and Garivier [2016]) For bandit models µ and λ, stopping time τ, and event E F τ, K E [N i (τ)] KL(ν i λ i ) d (P ν (E), P λ (E)) ν i=1 Theorem (Garivier and Kaufmann [2016]) Let µ and λ be bandit models with i (ν) i (λ). Then for any δ-correct algorithm K E [N i (τ)] KL(ν i λ i ) d(δ, 1 δ) ν i=1

21 Sample Complexity Consequence Starting point: hence so E ν [τ] K i=1 E ν [τ] max w K E ν [N i (τ)] E ν [τ] min λ Alt(ν) KL(ν i λ i ) d(δ, 1 δ) K w i KL(ν i λ i ) d(δ, 1 δ) i=1 Theorem (Garivier and Kaufmann [2016]) E ν [τ] T (ν)d(δ, 1 δ) where 1 T (ν) = max min w K λ Alt(ν) K w i KL(ν i λ i ) i=1

22 Example K = 5 arms, µ = (.3,.4,.5,.6,.7). Bernoulli T (µ) = Gaussian (σ 2 = 1) T (µ) = 893.5

23 Outline 1 Intro 2 Model 3 Statistician s toolbox 4 Sample Complexity Lower Bound 5 Algorithms 6 Extensions

24 Algorithms Sampling rule? Stopping rule? Recommendation rule? Î = argmax i [K] ˆµ i (τ)

25 Sampling Rule Look at the lower bound again. Any good algorithm must sample with optimal proportions w (ν) = argmax w K min λ Alt(ν) K w i KL(ν i λ i ) i=1

26 Sampling Rule Look at the lower bound again. Any good algorithm must sample with optimal proportions w (ν) = argmax w K min λ Alt(ν) K w i KL(ν i λ i ) i=1 Idea: draw I t w ( ˆµ(t)). Ensure N i (t)/t wi by forced exploration Draw arm with N i (t)/t below wi (tracking) Computation

27 Stopping Rule When do we have enough evidence to stop? Generalized Likelihood Ratio Test (GLRT) P ˆµ(t) (data) Z t = ln max λ Alt( ˆµ(t)) P λ (data)

28 Stopping Rule When do we have enough evidence to stop? Generalized Likelihood Ratio Test (GLRT) P ˆµ(t) (data) Z t = ln max λ Alt( ˆµ(t)) P λ (data) Turns out, GLRT statistic equals Z t = min λ Alt( ˆµ(t)) K N i (t) KL(ˆµ i (t) λ i ) i=1 i.e. lower bound with ˆµ(t) plug-in.

29 Stopping Rule When do we have enough evidence to stop? Generalized Likelihood Ratio Test (GLRT) P ˆµ(t) (data) Z t = ln max λ Alt( ˆµ(t)) P λ (data) Turns out, GLRT statistic equals Z t = min λ Alt( ˆµ(t)) K N i (t) KL(ˆµ i (t) λ i ) i=1 i.e. lower bound with ˆµ(t) plug-in. Roughly: stop when Z t ln 1 δ. Make precise with careful universal coding (MDL) argument.

30 All in all Final result: for Track-and-Stop algorithms E µ [τ] lim sup δ 0 ln 1 δ = T (µ) Very similar optimality result for Top Two Thompson Sampling by Russo [2016]

31 Outline 1 Intro 2 Model 3 Statistician s toolbox 4 Sample Complexity Lower Bound 5 Algorithms 6 Extensions

32 Beyond asymptotic bounds Okay, so good algorithms have E µ [τ] T (µ) ln 1 δ + small. What about lower-order terms? Moderate confidence regime! Dependence on ln ln 1 δ Dependence on ln K. [Simchowitz et al., 2017, Chen et al., 2017b]

33 Beyond Best Arm Practical and fundamental question: solving more complex pure exploration problems. : Pure Exploration Problems general Monte Carlo Tree Search Maximin Action Identification (max-min)* Combinatorial Pure Exploration BAI max max-min max-sum Fully Dense Fixed Sparsity Variable Sparsity

34 Combinatorial Pure Exploration Best k-set Shortest path Spanning tree...

35 Combinatorial Pure Exploration Best k-set Shortest path Spanning tree... Combinatorial collection F of subsets of [K]. i = i (ν) = argmax S F i S [Chen et al., 2017a] Track-and-stop-like algorithms. Can compute lower bound. Dense. µ i.

36 Game Tree Search max 1 2 min min µ 1,1 µ 1,2 µ 2,1 µ 2,2 Goal: find maximin action i := arg max i min µ i,j j

37 Sparsity in the Lower Bound (depth 2) Kaufmann and Koolen [2017] Sparsity Pattern max min Oracle weights w supported on only 7 of the 13 leaves. Open problem: algorithms incorporating appropriate pruning?

38 Sparsity in the Lower Bound (depth 3) max min max optimal weights µ 1 µ 2 µ 3 µ w 1 w 2 w 3 w 4 oracle weights w = (w 1, w 2, w 3, w 4 ) as a function of µ 3 for µ = (16, 5, µ 3, 0) It s complicated. But not intractable.

39 Conclusion Pure Exploration is an interesting, hot, promising area. : Pure Exploration Problems general Monte Carlo Tree Search Maximin Action Identification (max-min)* Combinatorial Pure Exploration BAI max max-min max-sum Fully Dense Fixed Sparsity Variable Sparsity

On the Complexity of Best Arm Identification with Fixed Confidence

On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, Emilie Kaufmann COLT, June 23 th 2016, New York Institut de Mathématiques de Toulouse