On the Complexity of Best Arm Identification with Fixed Confidence

Similar documents
On the Complexity of Best Arm Identification with Fixed Confidence

On Sequential Decision Problems

The information complexity of best-arm identification

The information complexity of sequential resource allocation

Two optimization problems in a stochastic bandit model

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

Revisiting the Exploration-Exploitation Tradeoff in Bandit Models

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

Multi-armed bandit models: a tutorial

Optimal Best Arm Identification with Fixed Confidence

Bandit Algorithms for Pure Exploration: Best Arm Identification and Game Tree Search. Wouter M. Koolen

Bayesian and Frequentist Methods in Bandit Models

Online Learning and Sequential Decision Making

On Bayesian bandit algorithms

arxiv: v2 [stat.ml] 14 Nov 2016

On the Complexity of A/B Testing

arxiv: v3 [cs.lg] 7 Nov 2017

Bandit models: a tutorial

Bandits : optimality in exponential families

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Informational Confidence Bounds for Self-Normalized Averages and Applications

Dynamic resource allocation: Bandit problems and extensions

KULLBACK-LEIBLER UPPER CONFIDENCE BOUNDS FOR OPTIMAL SEQUENTIAL ALLOCATION

The Multi-Arm Bandit Framework

Reinforcement Learning

arxiv: v2 [math.st] 14 Nov 2016

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms

Sequential Test for the Lowest Mean: From Thompson to Murphy Sampling

Two generic principles in modern bandits: the optimistic principle and Thompson sampling

A minimax and asymptotically optimal algorithm for stochastic bandits

Racing Thompson: an Efficient Algorithm for Thompson Sampling with Non-conjugate Priors

Subsampling, Concentration and Multi-armed bandits

Monte-Carlo Tree Search by. MCTS by Best Arm Identification

arxiv: v2 [stat.ml] 19 Jul 2012

The multi armed-bandit problem

Advanced Machine Learning

Stat 260/CS Learning in Sequential Decision Problems.

Lecture 4: Lower Bounds (ending); Thompson Sampling

Kullback-Leibler Upper Confidence Bounds for Optimal Sequential Allocation

Bandit Algorithms. Tor Lattimore & Csaba Szepesvári

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Optimality of Thompson Sampling for Gaussian Bandits Depends on Priors

Pure Exploration in Infinitely-Armed Bandit Models with Fixed-Confidence

arxiv: v1 [stat.ml] 27 May 2016

Nearly Optimal Sampling Algorithms for Combinatorial Pure Exploration

Ordinal optimization - Empirical large deviations rate estimators, and multi-armed bandit methods

Lecture 5: Regret Bounds for Thompson Sampling

arxiv: v1 [cs.ds] 4 Mar 2016

Robustness of Anytime Bandit Policies

Corrupt Bandits. Abstract

Adaptive Concentration Inequalities for Sequential Decision Problems

arxiv: v4 [math.pr] 26 Aug 2013

Multiple Identifications in Multi-Armed Bandits

Multi armed bandit problem: some insights

THE MULTI-ARMED BANDIT PROBLEM: AN EFFICIENT NON-PARAMETRIC SOLUTION

Ordinal Optimization and Multi Armed Bandit Techniques

Pure Exploration Stochastic Multi-armed Bandits

Stochastic bandits: Explore-First and UCB

Anytime optimal algorithms in stochastic multi-armed bandits

Best Arm Identification in Multi-Armed Bandits

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade

Online Learning with Gaussian Payoffs and Side Observations

Learning to play K-armed bandit problems

Introduction to Reinforcement Learning Part 3: Exploration for decision making, Application to games, optimization, and planning

Adaptive Concentration Inequalities for Sequential Decision Problems

Introduction to Statistical Learning Theory

arxiv: v1 [cs.lg] 14 Nov 2018 Abstract

Thompson Sampling for the non-stationary Corrupt Multi-Armed Bandit

Boundary Crossing for General Exponential Families

arxiv: v3 [stat.ml] 6 Nov 2017

Information Theory and Hypothesis Testing

PAC Subset Selection in Stochastic Multi-armed Bandits

Csaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008

Pure Exploration Stochastic Multi-armed Bandits

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Sébastien Bubeck Theory Group

The knowledge gradient method for multi-armed bandit problems

Consistency of the maximum likelihood estimator for general hidden Markov models

Statistical Inference

1 MDP Value Iteration Algorithm

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Analysis of Thompson Sampling for the multi-armed bandit problem

Covariance function estimation in Gaussian process regression

THE first formalization of the multi-armed bandit problem

Generalized Neyman Pearson optimality of empirical likelihood for testing parameter hypotheses

Stochastic Regret Minimization via Thompson Sampling

The Multi-Armed Bandit Problem

Reward Maximization Under Uncertainty: Leveraging Side-Observations on Networks

Introduction to Reinforcement Learning Part 3: Exploration for sequential decision making

Improved Algorithms for Linear Stochastic Bandits

Generalization bounds

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Iterative Markov Chain Monte Carlo Computation of Reference Priors and Minimax Risk

Bandit View on Continuous Stochastic Optimization

Minimax strategy for Stratified Sampling for Monte Carlo

Pure Exploration in Finitely Armed and Continuous Armed Bandits

Introduction to Reinforcement Learning and multi-armed bandits

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

Learning to Optimize via Information-Directed Sampling

10-704: Information Processing and Learning Fall Lecture 24: Dec 7

Transcription:

On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, Emilie Kaufmann COLT, June 23 th 2016, New York Institut de Mathématiques de Toulouse LabeX CIMI Université Paul Sabatier, France Université Lille, CNRS UMR 9189 Laboratoire CRIStAL F-59000 Lille, France

The Problem

Best-Arm Identification with Fixed Confidence K options = probability distributions ν = (ν a ) 1 a K ν a F exponential family parameterized by its expectation µ a ν 1 ν 2 ν 3 ν 4 ν 5 At round t, you may: choose an option A t = φ t (A 1, X 1,..., A t 1, X t 1) ) {1,..., K} observe a new independent sample X t ν At so as to identify the best arm a = argmax a µ a as fast as possible: stopping time τ. and µ = max µ a a Fixed-budget setting Fixed-confidence setting given τ = T minimize E[τ] minimize P(â τ a ) under constraint P(â τ a ) δ 3

Best-Arm Identification with Fixed Confidence K options = probability distributions ν = (ν a ) 1 a K ν a F exponential family parameterized by its expectation µ a x 1 x 2 x 3 x 4 x 5 At round t, you may: choose an option A t = φ t (A 1, X 1,..., A t 1, X t 1) ) {1,..., K} observe a new independent sample X t ν At so as to identify the best arm a = argmax a µ a as fast as possible: stopping time τ. and µ = max µ a a Fixed-budget setting given τ = T Fixed-confidence setting minimize E[τ] minimize P(â τ a ) under constraint P(â τ a ) δ 3

Best-Arm Identification with Fixed Confidence K options = probability distributions ν = (ν a ) 1 a K ν a F exponential family parameterized by its expectation µ a x 1 x 2 x 3 x 4 x 5 At round t, you may: choose an option A t = φ t (A 1, X 1,..., A t 1, X t 1) ) {1,..., K} observe a new independent sample X t ν At so as to identify the best arm a = argmax a µ a as fast as possible: stopping time τ δ. and µ = max µ a a Fixed-budget setting Fixed-confidence setting given τ = T minimize E[τ δ ] minimize P(â τ a ) under constraint P(â τ a ) δ 3

Intuition Most simple setting: for all a {1,..., K}, ν a = N (µ a, 1) For example: µ = [2, 1.75, 1.75, 1.6, 1.5]. At time t: you have sampled n a times the option a your empirical average is X a,na. 2 0 2 4 arm 1 arm 2 arm 3 arm 4 arm 5 if you stop at time t, your probability of prefering arm a 2 to arm a = 1 is: P ( Xa,na > X 1,n1 ) = P ( Xa,na µ a ( X1,n1 µ 1 ) 1/n1 + 1/n a > ( ) µ 1 µ a = Φ 1/n1 + 1/n a ) µ 1 µ a 1/n1 + 1/n a e where Φ(u) u2 /2 = du u 2π 4

Uniform Sampling 5

Uniform Sampling 5

Uniform Sampling 5

Uniform Sampling 5

Uniform Sampling 5

Uniform Sampling 5

Uniform Sampling 5

Uniform Sampling 5

Uniform Sampling 5

Uniform Sampling 5

Uniform Sampling 5

Uniform Sampling 5

Uniform Sampling 5

Uniform Sampling 5

Uniform Sampling 5

Uniform Sampling 5

Uniform Sampling 5

Uniform Sampling 5

Uniform Sampling 5

Uniform Sampling 5

Intuition: Equalizing the Probabilities of Confusion Most simple setting: for all a {1,..., K}, ν a = N (µ a, 1) For example: µ = [2, 1.75, 1.75, 1.6, 1.5]. Active Learning You allocate a relative budget w a to option a, with w 1 + + w K = 1. At time t: you have sampled n a w a t times the option a your empirical average is X a,na. 2 0 2 4 arm 1 arm 2 arm 3 arm 4 arm 5 if you stop at time t, your probability of prefering arm a 2 to arm a = 1 is: P ( Xa,na > X 1,n1 ) = P ( X a,na µ a ( X 1,n1 µ 1 ) 1/n1 + 1/n a > ( ) = Φ µ 1 µ a 1/n1 + 1/n a ) µ 1 µ a 1/n1 + 1/n a 6

Improving: trial 1 7

Improving: trial 1 7

Improving: trial 1 7

Improving: trial 1 7

Improving: trial 1 7

Improving: trial 1 7

Improving: trial 1 7

Improving: trial 1 7

Improving: trial 1 7

Improving: trial 1 7

Improving: trial 1 7

Improving: trial 1 7

Improving: trial 1 7

Improving: trial 1 7

Improving: trial 1 7

Improving: trial 2 8

Improving: trial 2 8

Improving: trial 2 8

Improving: trial 2 8

Improving: trial 2 8

Improving: trial 2 8

Improving: trial 2 8

Improving: trial 2 8

Improving: trial 2 8

Improving: trial 2 8

Improving: trial 2 8

Improving: trial 2 8

Improving: trial 2 8

Improving: trial 2 8

Improving: trial 2 8

Improving: trial 2 8

Improving: trial 2 8

Improving: trial 3 9

Improving: trial 3 9

Improving: trial 3 9

Improving: trial 3 9

Improving: trial 3 9

Improving: trial 3 9

Improving: trial 3 9

Improving: trial 3 9

Improving: trial 3 9

Improving: trial 3 9

Improving: trial 3 9

Improving: trial 3 9

Improving: trial 3 9

Improving: trial 3 9

Improving: trial 3 9

Improving: trial 3 9

Optimal Proportions 10

Optimal Proportions 10

Optimal Proportions 10

Optimal Proportions 10

Optimal Proportions 10

Optimal Proportions 10

Optimal Proportions 10

Optimal Proportions 10

Optimal Proportions 10

Optimal Proportions 10

Optimal Proportions 10

Optimal Proportions 10

Optimal Proportions 10

Optimal Proportions 10

How to Turn this Intuition into a Theorem? The arms are not Gaussian (no formula for probability of confusion) large deviations (Sanov, KL) You do not allocate a relative budget at first, but you use sequential sampling no fixed-size samples: sequential experiment tracking lemma How to compute the optimal proportions? lower bound, game The parameters of the distribution are unknown (sequential) estimation When should you stop? Chernoff s stopping rule 11

Exponential Families ν 1,..., ν K belong to a one-dimensional exponential family P λ,θ,b = { ν θ, θ Θ : ν θ has density f θ (x) = exp(θx b(θ)) w.r.t. λ } Example: Gaussian, Bernoulli, Poisson distributions... ν θ can be parametrized by its mean µ = ḃ(θ) : νµ := νḃ 1 (µ) Notation: Kullback-Leibler divergence For a given exponential family, d(µ, µ ) := KL(ν µ, ν µ ) = E X ν µ ] [log dνµ (X ) dν µ is the KL-divergence between the distributions of mean µ and µ. We identify ν = (ν µ1,..., ν µ K ) and µ = (µ 1,..., µ K ) and consider { } S = µ (ḃ(θ)) K : a {1,..., K} : µ a > max µ i i a 12

Back to: Equalizing the Probabilities of Confusion Most simple setting: for all a {1,..., K}, ν a = N (µ a, 1). For example: µ = [2, 1.75, 1.75, 1.6, 1.5]. You allocate a relative budget w a to option a, with w 1 + + w K = 1. At time t, you have sampled n a w a t times option a and the empirical average is X a,na. if you stop at time t, your probability of prefering arm a 2 to arm a = 1 is: P ( Xa,na > X 1,n1 ) = P ( X a,na µ a ( X 1,n1 µ 1 ) 1/n1 + 1/n a > ( ) = Φ µ 1 µ a 1/n1 + 1/n a ) µ 1 µ a 1/n1 + 1/n a e ( (µ 1 µa)2 2 1/n 1 +1/na ) Chernoff s bound 13

Back to: Equalizing the Probabilities of Confusion Most simple setting: for all a {1,..., K}, ν a = N (µ a, 1). if you stop at time t, your probability of prefering arm a 2 to arm a = 1 is: P ( X a,na > X 1,n1 ) = P ( Xa,na µ a ( X1,n1 µ 1 ) 1/n1 + 1/n a > ( ) = Φ µ 1 µ a 1/n1 + 1/n a ) µ 1 µ a 1/n1 + 1/n a e ( (µ 1 µa)2 2 1/n 1 +1/na ) Large Deviation Principle = e n 1 (µ 1 m)2 2 e na(µa m)2 2 with m = n 1µ 1 + n 1 µ a n 1 + n a P ( X1,n1 < m ) P ( Xa,na m ) = max µ a<m<µ 1 P ( X1,n1 < m ) P ( Xa,na m ) (cf. Sanov) µ a m µ 1 13

Entropic Method for Large Deviations Lower Bounds Let d(µ, µ ) = KL ( N (µ, 1), N (µ, 1)) = (x y)2 2, KL(Y, Z) = KL ( L(Y ), L(Z) ), ɛ > 0, µ a m µ 1 and X 1,1,..., X 1,n1 iid N (µ1, 1) X a,1,..., X a,na iid N (µa, 1) X 1,1,..., X 1,n 1 iid N (m ɛ, 1) X a,1,..., X a,n a iid N (m + ɛ, 1) µ a m µ 1 n 1 d(m ɛ, µ 1 ) + n a d(m + ɛ, µ a ) = KL ( (X a,i) ) KL(P P a,i, (X a,i ),Q Q ) a,i = KL(P, Q) + KL(P, Q ) KL (1 { X a,na > X } { } 1,n1, 1 X a,na > X ) contraction of entropy 1,n1 = data-processing inequality ( = kl P ( X a,n a > X 1,n ) ( ) 1, P X a,na > X ) 1,n1 kl(p, q) = p log p + (1 p) log 1 p q 1 q P ( X a,na > X 1,n ) 1 1 log ( ) log(2) kl(p, q) p log 1 log 2 q P X a,na > X 1,n1 14

Entropic Method for Large Deviations Lower Bounds µ a m µ 1 n 1 d(m ɛ, µ 1 ) + n a d(m + ɛ, µ a ) = KL ( (X a,i) ) KL(P P a,i, (X a,i ),Q Q ) a,i = KL(P, Q) + KL(P, Q ) KL (1 { X a,na > X } { 1,n1, 1 Xa,na > X } ) contraction of entropy 1,n1 = data-processing inequality ( = kl P ( X a,na > X 1,n ) ( 1, P Xa,na > X ) ) 1,n1 kl(p, q) = p log p + (1 p) log 1 p q 1 q P ( X a,n a > X 1,n ) 1 1 log ( ) log(2) kl(p, q) p log 1 log 2 q P X a,na > X 1,n1 P ( ( Xa,na > X ) 1,n1 max exp n ) 1 d(m ɛ, µ 1 ) + n a d(m + ɛ, µ a ) + log(2) µ 1 m µ a 1 e (n1+na)ɛ2 /2 (µ 1 µ a+ɛ) 2 1/n = exp 1+1/n a + log(2) 2 ( 1 e ) n1µ1 + naµa m = (n1+na)ɛ2 /2 n 1 + n a 14

Entropic Method for Large Deviations Lower Bounds µ a m µ 1 n 1 d(m ɛ, µ 1 ) + n a d(m + ɛ, µ a ) = KL ( (X a,i) ) KL(P P a,i, (X a,i ),Q Q ) a,i = KL(P, Q) + KL(P, Q ) KL (1 { X a,na > X } { 1,n1, 1 Xa,na > X } ) contraction of entropy 1,n1 = data-processing inequality ( = kl P ( X a,na > X 1,n ) ( 1, P Xa,na > X ) ) 1,n1 kl(p, q) = p log p + (1 p) log 1 p q 1 q P ( X a,n a > X 1,n ) 1 1 log ( ) log(2) kl(p, q) p log 1 log 2 q P X a,na > X 1,n1 ( ) 1 = T w 1 d(m ɛ, µ 1 ) + w a d(m + ɛ, µ a ) log ( P Xa,na > X ) 1,n1 ( ) if you want to have P X a,na > X 1,n1 δ then you need T log(1/δ) w 1 d(m, µ 1 ) + w a d(m, µ a ) 14

Lower Bound

Lower-Bounding the Sample Complexity Let µ = (µ 1,..., µ K ) and λ = (λ 1,..., λ K ) be two elements of S. Uniform δ-pac Constraint [Kaufmann, Cappé, G. 15] If a (µ) a (λ), any δ-pac algorithm satisfies K [ E µ Na (τ δ ) ] d(µ a, λ a ) kl(δ, 1 δ). a=1 µ 4 µ 3 µ 2 m 2 µ 1 Let Alt(µ) = {λ : a (λ) a (µ)}. Take: λ 1 = m 2 ɛ λ 2 = m 2 + ɛ E µ [N 1 (τ δ )] d(µ 1, m 2 ɛ) + E µ [N 2 (τ δ ) ] d(µ 2, m 2 + ɛ) kl(δ, 1 δ) 16

Lower-Bounding the Sample Complexity Let µ = (µ 1,..., µ K ) and λ = (λ 1,..., λ K ) be two elements of S. Uniform δ-pac Constraint [Kaufmann, Cappé, G. 15] If a (µ) a (λ), any δ-pac algorithm satisfies K [ E µ Na (τ δ ) ] d(µ a, λ a ) kl(δ, 1 δ). a=1 µ 4 µ 3 µ 2 m 3 µ 1 Let Alt(µ) = {λ : a (λ) a (µ)}. Take: λ 1 = m 3 ɛ λ 3 = m 3 + ɛ E µ [N 1 (τ δ )] d(µ 1, m 2 ɛ) + E µ [N 2 (τ δ ) ] d(µ 2, m 2 + ɛ) kl(δ, 1 δ) E µ [N 1 (τ δ )] d(µ 1, m 3 ɛ) + E µ [N 3 (τ δ )] d(µ 3, m 3 + ɛ) kl(δ, 1 δ) 16

Lower-Bounding the Sample Complexity Let µ = (µ 1,..., µ K ) and λ = (λ 1,..., λ K ) be two elements of S. Uniform δ-pac Constraint [Kaufmann, Cappé, G. 15] If a (µ) a (λ), any δ-pac algorithm satisfies K [ E µ Na (τ δ ) ] d(µ a, λ a ) kl(δ, 1 δ). a=1 µ 4 µ 3 µ 2 m 4 µ 1 Let Alt(µ) = {λ : a (λ) a (µ)}. Take: λ 1 = m 4 ɛ λ 4 = m 4 + ɛ E µ [N 1 (τ δ )] d(µ 1, m 2 ɛ) + E µ [N 2 (τ δ ) ] d(µ 2, m 2 + ɛ) kl(δ, 1 δ) E µ [N 1 (τ δ )] d(µ 1, m 3 ɛ) + E µ [N 3 (τ δ )] d(µ 3, m 3 + ɛ) kl(δ, 1 δ) E µ [N 1 (τ δ )] d(µ 1, m 4 ɛ) + E µ [N 4 (τ δ )] d(µ 4, m 4 + ɛ) kl(δ, 1 δ) 16

Lower-Bounding the Sample Complexity Let µ = (µ 1,..., µ K ) and λ = (λ 1,..., λ K ) be two elements of S. Uniform δ-pac Constraint [Kaufmann, Cappé, G. 15] If a (µ) a (λ), any δ-pac algorithm satisfies K [ E µ Na (τ δ ) ] d(µ a, λ a ) kl(δ, 1 δ). a=1 Let Alt(µ) = {λ : a (λ) a (µ)}. µ 4 µ 3 µ 2 m 4 m 3 m 2 µ 1 E µ [τ δ ] E µ [τ δ ] inf λ Alt(µ) a=1 inf λ Alt(µ) a=1 ( sup w Σ K K E µ [N a (τ δ )] d(µ a, λ a ) kl(δ, 1 δ) K inf E µ [N a (τ δ )] E µ [τ δ ] λ Alt(µ) a=1 d(µ a, λ a ) kl(δ, 1 δ) ) K w a d(µ a, λ a ) kl(δ, 1 δ) 16

Lower Bound: the Complexity of BAI Theorem For any δ-pac algorithm, E µ [τ δ ] T (µ) kl(δ, 1 δ), where T (µ) 1 = sup w Σ K inf λ Alt(µ) ( K ) w a d(µ a, λ a ). a=1 kl(δ, 1 δ) log(1/δ) when δ 0, kl(δ, 1 δ) log ( 1/(2.4δ) ) cf. [Graves and Lai 1997, Vaidhyan and Sundaresan, 2015] the optimal proportions of arm draws are ( K ) w (µ) = argmax w Σ K inf λ Alt(µ) a=1 w a d(µ a, λ a ) they do not depend on δ 17

PAC-BAI as a Game Given a parameter µ = (µ 1,..., µ K ) : the statistician chooses proportions of arm draws w = (w a ) a the opponent chooses an alternative model λ the payoff is the minimal number T = T (w, λ) of draws necessary to ensure that he does not violate the δ-pac constraint K Tw a d(µ a, λ a ) kl(δ, 1 δ) a=1 T (µ) kl(δ, 1 δ) = value of the game w = optimal action for the statistician 18

PAC-BAI as a Game Given a parameter µ = (µ 1,..., µ K ) such that µ 1 > µ 2 µ K : the statistician chooses proportions of arm draws w = (w a ) a the opponent chooses an arm a {2,..., K} and λ a = arg min λ w 1 d(µ 1, λ) + w a d(µ a, λ) µ 4 µ 3 µ 2 m 3 µ 1 the payoff is the minimal number T = T (w, a, δ) of draws necessary to ensure that Tw 1 d(µ 1, λ a ɛ) + Tw a d(µ a, λ a + ɛ) kl(δ, 1 δ) kl(δ, 1 δ) that is T (w, a, δ) = w 1 d(µ 1, λ a ɛ) + w a d(µ a, λ a + ɛ) T (µ) kl(δ, 1 δ) = value of the game w = optimal action for the statistician 18

Properties of T (µ) and w (µ) 1. Unique solution, solution of scalar equations only 2. For all µ S, for all a, w a (µ) > 0 3. w is continuous in every µ S 4. If µ 1 > µ 2 µ K, one has w 2 (µ) w K (µ) (one may have w 1 (µ) < w 2 (µ)) 5. Case of two arms [Kaufmann, Cappé, G. 14]: E µ [τ δ ] kl(δ, 1 δ) d (µ 1, µ 2 ). where d is the reversed Chernoff information d (µ 1, µ 2 ) := d(µ 1, µ ) = d(µ 2, µ ). 6. Gaussian arms : algebraic equation but no simple formula for K 3. K 2σ 2 K 2 T 2σ 2 (µ) 2 a 2. a a=1 a=1 19

The Track-and-Stop Strategy

Sampling rule: Tracking the optimal proportions ˆµ(t) = (ˆµ 1 (t),..., ˆµ K (t)): vector of empirical means Introducing U t = { a : N a (t) < } t, the arm sampled at round t + 1 is argmin N a (t) if U t (forced exploration) a U A t+1 t argmax t wa ( ˆµ(t)) N a (t) (tracking) 1 a K Lemma Under the Tracking sampling rule, ( P µ lim t N a (t) t ) = wa (µ) = 1. 21

Sequential Generalized Likelihood Test High values of the Generalized Likelihood Ratio Z a,b (t) := log max {λ:λ a λ b } dp λ (X 1,..., X t ) max {λ:λa λ b } dp λ (X 1,..., X t ) = N a (t) d (ˆµ a (t), ˆµ a,b (t) ) + N b (t) d (ˆµ b (t), ˆµ a,b (t) ) if ˆµ a (t) > ˆµ b (t) Z b,a (t) otherwise reject the hypothesis that (µ a µ b ). We stop when one arm is assessed to be significantly larger than all other arms, according to a GLR test: τ δ = inf { t N : a {1,..., K}, b a, Z a,b (t) > β(t, δ) } { } = inf t N : Z(t) := max min Z a,b(t) > β(t, δ) a {1,...,K} b a Chernoff stopping rule [Chernoff 59] Two other possible interpretations of the stopping rule: MDL: Z a,b (t) = ( N a(t) + N b (t) ) H (ˆµ a,b (t) ) [ N a(t)h (ˆµ a(t) ) + N b (t)h (ˆµ b (t) )] 22

Sequential Generalized Likelihood Test High values of the Generalized Likelihood Ratio Z a,b (t) := log max {λ:λ a λ b } dp λ (X 1,..., X t ) max {λ:λa λ b } dp λ (X 1,..., X t ) reject the hypothesis that (µ a µ b ). We stop when one arm is assessed to be significantly larger than all other arms, according to a GLR test: { } τ δ = inf t N : Z(t) := max min Z a,b(t) > β(t, δ) a {1,...,K} b a Chernoff stopping rule [Chernoff 59] Two other possible interpretations of the stopping rule: plug-in complexity estimate: with F (w, µ) := stop when Z(t) = t F t T (µ) = t F (w, µ) kl(δ, 1 δ). inf K w d ( ) µ a, λ a, λ Alt(µ) a=1 ( Na(t) ), ˆµ(t) β(t, δ) instead of the lower bound t 22

Calibration Theorem The Chernoff rule is δ-pac for β(t, δ) = log ( ) 2(K 1)t δ Lemma If µ a < µ b, whatever the sampling rule, P µ ( t N : Z a,b (t) > log ( )) 2t δ δ The proof uses: Barron s lemma (change of distribution) and Krichevsky-Trofimov s universal distribution (very information-theoretic ideas) 23

Asymptotic Optimality of the T&S strategy Theorem The Track-and-Stop strategy, that uses the Tracking sampling rule the Chernoff stopping rule with β(t, δ) = log and recommends â τδ = argmax a=1...k ˆµ a (τ δ ) is δ-pac for every δ (0, 1) and satisfies lim sup δ 0 E µ [τ δ ] log(1/δ) = T (µ). ( ) 2(K 1)t δ 24

Why is the T&S Strategy asymptotically Optimal? 25

Why is the T&S Strategy asymptotically Optimal? 25

Why is the T&S Strategy asymptotically Optimal? 25

Why is the T&S Strategy asymptotically Optimal? 25

Why is the T&S Strategy asymptotically Optimal? 25

Why is the T&S Strategy asymptotically Optimal? 25

Why is the T&S Strategy asymptotically Optimal? 25

Why is the T&S Strategy asymptotically Optimal? 25

Why is the T&S Strategy asymptotically Optimal? 25

Why is the T&S Strategy asymptotically Optimal? 25

Why is the T&S Strategy asymptotically Optimal? 25

Why is the T&S Strategy asymptotically Optimal? 25

Why is the T&S Strategy asymptotically Optimal? 25

Why is the T&S Strategy asymptotically Optimal? 25

Why is the T&S Strategy asymptotically Optimal? 25

Why is the T&S Strategy asymptotically Optimal? 25

Why is the T&S Strategy asymptotically Optimal? 25

Why is the T&S Strategy asymptotically Optimal? 25

Why is the T&S Strategy asymptotically Optimal? 25

Why is the T&S Strategy asymptotically Optimal? 25

Sketch of proof (almost-sure convergence only) forced exploration = N a (t) a.s. for all a {1,..., K} µ(t) µ a.s. w ( ˆµ(t) ) w a.s. tracking rule: N a (t) t t w a but the mapping F : (µ, w) a.s. inf λ Alt(µ ) a=1 K w a d(µ a, λ a ) is continuous at ( (µ, w (µ)): ) Z(t) = t F ˆµ(t), (N a (t)/t) K a=1 t F (µ, w ) = t T (µ) 1 and for every ɛ > 0 there exists t 0 such that t t 0 Z(t) t (1 + ɛ) 1 T (µ) 1 { } = Thus τ δ t 0 inf t N : (1 + ɛ) 1 T (µ) 1 t log(2(k 1)t/δ) τ δ and lim sup δ 0 log(1/δ) (1 + ɛ)t (µ) a.s. 26

Numerical Experiments µ 1 = [0.5 0.45 0.43 0.4] w (µ 1 ) = [0.42 0.39 0.14 0.06] µ 2 = [0.3 0.21 0.2 0.19 0.18] w (µ 2 ) = [0.34 0.25 0.18 0.13 0.10] In practice, set the threshold to β(t, δ) = log ( ) log(t)+1 δ (δ-pac OK) Track-and-Stop Chernoff-Racing KL-LUCB KL-Racing µ 1 4052 4516 8437 9590 µ 2 1406 3078 2716 3334 Table 1: Expected number of draws E µ[τ δ ] for δ = 0.1, averaged over N = 3000 experiments. Empirically good even for large values of the risk δ Racing is sub-optimal in general, because it plays w 1 = w 2 LUCB is sub-optimal in general, because it plays w 1 = 1/2 27

Perspectives For best arm identification, we showed that inf lim sup PAC algorithm δ 0 E µ [τ δ ] log(1/δ) = sup w Σ K inf λ Alt(µ) ( K ) w a d(µ a, λ a ) and provided an efficient strategy asymptotically matching this bound. Future work: (easy) anytime stopping gives a confidence level (easy) find an ɛ-optimal arm (easy) find the m-best arms design and analyze more stable algorithm (hint: optimism) give a simple algorithm with a finite-time analysis candidate: play action maximizing the expected increase of Z(t) extend to structured and continuous settings a=1 x 1 x 2 x 3 x 4 x 5 28

References O. Cappé, A. Garivier, O-A. Maillard, R. Munos, and G. Stoltz. Kullback-Leibler upper confidence bounds for optimal sequential allocation. Annals of Statistics, 2013. H. Chernoff. Sequential design of Experiments. The Annals of Mathematical Statistics, 1959. E. Even-Dar, S. Mannor, Y. Mansour, Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems. JMLR, 2006. T.L. Graves and T.L. Lai. Asymptotically Efficient adaptive choice of control laws in controlled markov chains. SIAM Journal on Control and Optimization, 35(3):715743, 1997. S. Kalyanakrishnan, A. Tewari, P. Auer, and P. Stone. PAC subset selection in stochastic multi- armed bandits. ICML, 2012. E. Kaufmann, O. Cappé, A. Garivier. On the Complexity of Best Arm Identification in Multi-Armed Bandit Models. JMLR, 2015 A. Garivier, E. Kaufmann. Optimal Best Arm Identification with Fixed Confidence, COLT 16, New York, arxiv:1602.04589 A. Garivier, P. Ménard, G. Stoltz. Explore First, Exploit Next: The True Shape of Regret in Bandit Problems. E. Kaufmann, S. Kalyanakrishnan. The information complexity of best arm identification, COLT 2013 T.L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 1985. D. Russo. Simple Bayesian Algorithms for Best Arm Identification, COLT 2016 N.K. Vaidhyan and R. Sundaresan. Learning to detect an oddball target. arxiv:1508.05572, 2015. 29