Monte-Carlo Tree Search by. MCTS by Best Arm Identification

Size: px

Start display at page:

Download "Monte-Carlo Tree Search by. MCTS by Best Arm Identification"

Godwin Brown
5 years ago
Views:

1 Monte-Carlo Tree Search by Best Arm Identification and Wouter M. Koolen Inria Lille SequeL team CWI Machine Learning Group Inria-CWI workshop Amsterdam, September 20th, 2017

2 Part of a new Associate Team proposal 6 PAC involving Peter Grünwald (CWI, Machine Learning Group) Wouter M. Koolen (CWI, Machine Learning Group) Benjamin Guedj (Inria Lille, MODAL project-team) (Inria Lille, SequeL project-team) Broader goal: Probably Approximately Correct - Learning 6 Safe, Efficient, Sequential, Active, Structured, Ideal

3 Monte-Carlo Tree Search for games

4 Monte-Carlo Tree Search for games We introduce an idealized model: fixed maximin tree i.i.d. playouts starting from each leaf and propose new algorithms with sample complexity guarantees

5 Outline 1 Problem formulation 2 The BAI-MCTS architecture 3 UGapE-MCTS and LUCB-MCTS 4 Towards optimal algorithms

6 Outline 1 Problem formulation 2 The BAI-MCTS architecture 3 UGapE-MCTS and LUCB-MCTS 4 Towards optimal algorithms

7 A simple model for MCTS A fixed MAXMIN game tree T, with leaves L. MAX node (= your move) MIN node (= adversary move) Leaf l: stochastic oracle O l that evaluates the position

8 A simple model for MCTS At round t a MCTS algorithm: picks a path down to a leaf L t get an evaluation of this leaf X t O Lt Assumption: i.i.d. sucessive evaluations, E X Ol [X ] = µ l

9 A simple model for MCTS μ 1 μ 2 μ 3 μ 4 μ 5 μ 6 μ 7 μ 8 At round t a MCTS algorithm: picks a path down to a leaf L t get an evaluation of this leaf X t O Lt Assumption: i.i.d. sucessive evaluations, E X Ol [X ] = µ l

10 Goal s 0 μ 1 μ 2 μ 3 μ 4 μ 5 μ 6 μ 7 μ 8 A MCTS algorithm should find the best move at the root: µ s if s L, V s = max c C(s) V c if s is a MAX node, min c C(s) V c if s is a MIN node. s = argmax s C(s 0 ) V s

11 A PAC learning framework s 0 μ 1 μ 2 μ 3 μ 4 μ 5 μ 6 μ 7 μ 8 MCTS algorithm: (L t, τ, ŝ τ ), where L t is the sampling rule τ is the stopping rule ŝ τ C(s 0 ) is the recommendation rule is (ɛ, δ) PAC if P (Vŝτ V s ɛ) 1 δ. Goal: (ɛ, δ)-pac algorithm with a small sample complexity τ.

12 A simpler problem: best arm identification Reminiscent of a bandit model: μ 1 μ 2 μ 3 μ 4 μ 5 μ 6 μ 7 μ 8 A Best Arm Identification algorithm: (A t, τ, ŝ τ ), where A t is the sampling rule τ is the stopping rule ŝ τ C(s 0 ) is the recommendation rule is (ɛ, δ)-pac if P (µŝτ µ ɛ) 1 δ.

13 A simpler problem: best arm identification Reminiscent of a bandit model: μ 1 μ 2 μ 3 μ 4 μ 5 μ 6 μ 7 μ 8 A Best Arm Identification algorithm: (A t, τ, ŝ τ ), where A t is the sampling rule τ is the stopping rule ŝ τ C(s 0 ) is the recommendation rule is (ɛ, δ)-pac if P (µŝτ µ ɛ) 1 δ. The BAI problem: How to adaptivly sample the arms so as to identify as quickly as possible the arm with highest mean?

14 MCTS: a structured BAI problem Reminiscent of a bandit model: μ 1 μ 2 μ 3 μ 4 μ 5 μ 6 μ 7 μ 8 A Best Arm Identification algorithm: (L t, τ, ŝ τ ), where L t is the sampling rule τ is the stopping rule ŝ τ C(s 0 ) is the recommendation rule is (ɛ, δ)-pac if P (Vŝτ V s ɛ) 1 δ. The MCTS problem: How to adaptivly sample the leaves of a maxmin tree so as to identify as quickly as possible the best action at the root?

15 Outline 1 Problem formulation 2 The BAI-MCTS architecture 3 UGapE-MCTS and LUCB-MCTS 4 Towards optimal algorithms

16 A key building block: confidence intervals Using the samples collected for the leaves, one can build, for l L, [LCB l (t), UCB l (t)] a confidence interval on µ l s 0 μ 1 μ 2 μ 3 μ 4 μ 5 μ 6 μ 7 μ 8

17 A key building block: confidence intervals Using the samples collected for the leaves, one can build, for l L, [LCB l (t), UCB l (t)] a confidence interval on µ l s 0 Idea: Propagate these confidence intervals up in the tree

18 A key building block: confidence intervals MAX node: UCB s (t) = max c C(s) UCB c(t) LCB s (t) = max c C(s) LCB c(t) s 0

19 A key building block: confidence intervals MAX node: UCB s (t) = max c C(s) UCB c(t) s 0 LCB s (t) = max c C(s) LCB c(t)

20 A key building block: confidence intervals MIN node: UCB s (t) = min c C(s) UCB c(t) s 0 LCB s (t) = min c C(s) LCB c(t)

21 Property of this construction s 0 (µ l I l (t)) (V s I s (t)) l L s T

22 Representative leaves l s (t): representative leaf of internal node s T. s 0 Idea: alternate optimistic/pessimistic moves starting from s

23 Generic BAI-MCTS algorithm Input: a BAI algorithm Initialization: t = 0. while not BAIStop ({s C(s 0 )}) do R t+1 = BAIStep ({s C(s 0 )}) Sample the representative leaf L t+1 = l Rt+1 (t) Update the information about the arms. t = t + 1. end Output: BAIReco ({s C(s 0 )})

24 Generic BAI-MCTS algorithm Input: a BAI algorithm Initialization: t = 0. while not BAIStop ({s C(s 0 )}) do R t+1 = BAIStep ({s C(s 0 )}) Sample the representative leaf L t+1 = l Rt+1 (t) Update the information about the arms. t = t + 1. end Output: BAIReco ({s C(s 0 )})... sometimes reduces to updating confidence intervals!

25 Outline 1 Problem formulation 2 The BAI-MCTS architecture 3 UGapE-MCTS and LUCB-MCTS 4 Towards optimal algorithms

26 An example of BAI algorithm: LUCB The (KL)-LUCB algorithm [Kalyanakrishnan et al. 12, Kaufmann and Kalyanakrishnan 13]

27 UGapE-MCTS based on the UGapE algorithm [Gabillon et al. 12] Sampling rule: R t+1 is the least sampled among two promising depth-one nodes: where a t = argmin a C(s 0 ) Stopping rule: B s (t) = B a (t) and b t = argmax UCB b (t), b C(s 0 )\{a t } max UCB s s (t) LCB s(t). C(s 0 )\{s} τ = inf { t N : UCB bt (t) LCB at (t) < ɛ } Recommendation rule: ŝ τ = a τ

28 Theoretical guarantees We choose confidence intervals of the form β(n l (t), δ) LCB l (t) = ˆµ l (t) 2N l (t) β(n l (t), δ) UCB l (t) = ˆµ l (t) + 2N l (t) where β(s, δ) is some exploration function. Correctness If δ max(0.1 L, 1), for the choice β(s, δ) = log( L /δ) + 3 log log( L /δ) + (3/2) log(log s + 1) UGapE-MCTS is (ɛ, δ)-pac.

29 Theoretical guarantees where H ɛ (µ) := l L := V (s ) V (s 2) l := max s Ancestors(l)\{s 0 } 1 2 l 2 ɛ 2 V Parent(s) V s Sample complexity With probability larger than 1 δ, the total number of leaves explorations performed by UGapE-MCTS is upper bounded as ( ( )) 1 τ = O Hɛ (µ) log. δ

30 Theoretical guarantees where H ɛ (µ) := l L := V (s ) V (s 2) l := max s Ancestors(l)\{s 0 } 1 2 l 2 ɛ V Parent(s) V s

31 Numerical results ɛ = 0, δ = (N = 10 6 simulations) LUCB-MCTS (0.72% errors, 1551 samples) UGapE-MCTS (0.75% erros, 1584 samples) FindTopWinner (0% errors, samples) [Teraoka et al. 14]

32 Outline 1 Problem formulation 2 The BAI-MCTS architecture 3 UGapE-MCTS and LUCB-MCTS 4 Towards optimal algorithms

33 A sample complexity lower bound Theorem Let ɛ = 0. Any δ-correct algorithm satisfies where T (µ) 1 := Depth-two tree: E µ [τ] T (µ) log (1/(3δ)) sup inf w Σ L λ Alt(µ) l L w l KL (B(µ l ), B(λ l )). The optimal proportions satisfy w i,j(µ) = 0 if i 2 and j 2.

34 A sample complexity lower bound Theorem Let ɛ = 0. Any δ-correct algorithm satisfies where T (µ) 1 := Depth-two tree: E µ [τ] T (µ) log (1/(3δ)) sup inf w Σ L λ Alt(µ) l L w l KL (B(µ l ), B(λ l )). The optimal proportions satisfy w i,j(µ) = 0 if i 2 and j 2. A more general sparsity pattern?

35 Conclusion Our contributions: a generic way to use a BAI algorithm for MCTS PAC and sample complexity guarantees for UGapE-MCTS and LUCB-MCTS that also displays good empirical performance Future work: identify the optimal sample complexity of the MCTS problem... (i.e. matching upper and lower bounds)... and that of other structured Best Arm Identification problems [Ajallooeian et al., ALT 17]

36 Conclusion Our contributions: a generic way to use a BAI algorithm for MCTS PAC and sample complexity guarantees for UGapE-MCTS and LUCB-MCTS that also displays good empirical performance Future work: identify the optimal sample complexity of the MCTS problem... (i.e. matching upper and lower bounds)... and that of other structured Best Arm Identification problems [Ajallooeian et al., ALT 17] Reference: E. Kaufmann & W.M. Koolen, Monte-Carlo Tree Search by Best Arm Identification to appear in NIPS 2017

Bandit Algorithms for Pure Exploration: Best Arm Identification and Game Tree Search. Wouter M. Koolen

Bandit Algorithms for Pure Exploration: Best Arm Identification and Game Tree Search Wouter M. Koolen Machine Learning and Statistics for Structures Friday 23 rd February, 2018 Outline 1 Intro 2 Model