The information complexity of best-arm identification

Similar documents
The information complexity of sequential resource allocation

On the Complexity of Best Arm Identification with Fixed Confidence

Two optimization problems in a stochastic bandit model

On the Complexity of Best Arm Identification with Fixed Confidence

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

Revisiting the Exploration-Exploitation Tradeoff in Bandit Models

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

Multi-armed bandit models: a tutorial

arxiv: v2 [stat.ml] 14 Nov 2016

Bandit models: a tutorial

On Bayesian bandit algorithms

Online Learning and Sequential Decision Making

On the Complexity of A/B Testing

Bayesian and Frequentist Methods in Bandit Models

On Sequential Decision Problems

Bandits : optimality in exponential families

Bandit Algorithms for Pure Exploration: Best Arm Identification and Game Tree Search. Wouter M. Koolen

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Dynamic resource allocation: Bandit problems and extensions

Informational Confidence Bounds for Self-Normalized Averages and Applications

arxiv: v3 [cs.lg] 7 Nov 2017

Optimal Best Arm Identification with Fixed Confidence

KULLBACK-LEIBLER UPPER CONFIDENCE BOUNDS FOR OPTIMAL SEQUENTIAL ALLOCATION

Stat 260/CS Learning in Sequential Decision Problems.

Kullback-Leibler Upper Confidence Bounds for Optimal Sequential Allocation

Advanced Machine Learning

The Multi-Arm Bandit Framework

Reinforcement Learning

Subsampling, Concentration and Multi-armed bandits

arxiv: v2 [stat.ml] 19 Jul 2012

Multiple Identifications in Multi-Armed Bandits

Two generic principles in modern bandits: the optimistic principle and Thompson sampling

Racing Thompson: an Efficient Algorithm for Thompson Sampling with Non-conjugate Priors

Nearly Optimal Sampling Algorithms for Combinatorial Pure Exploration

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms

Monte-Carlo Tree Search by. MCTS by Best Arm Identification

arxiv: v2 [math.st] 14 Nov 2016

arxiv: v4 [math.pr] 26 Aug 2013

Stochastic bandits: Explore-First and UCB

Corrupt Bandits. Abstract

arxiv: v1 [cs.ds] 4 Mar 2016

Ordinal optimization - Empirical large deviations rate estimators, and multi-armed bandit methods

A minimax and asymptotically optimal algorithm for stochastic bandits

Ordinal Optimization and Multi Armed Bandit Techniques

Multi armed bandit problem: some insights

Lecture 4: Lower Bounds (ending); Thompson Sampling

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Sequential Test for the Lowest Mean: From Thompson to Murphy Sampling

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Sébastien Bubeck Theory Group

Optimality of Thompson Sampling for Gaussian Bandits Depends on Priors

Online Learning with Gaussian Payoffs and Side Observations

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

arxiv: v3 [stat.ml] 6 Nov 2017

Anytime optimal algorithms in stochastic multi-armed bandits

Introduction to Reinforcement Learning Part 3: Exploration for decision making, Application to games, optimization, and planning

Bandit Algorithms. Tor Lattimore & Csaba Szepesvári

Pure Exploration in Infinitely-Armed Bandit Models with Fixed-Confidence

Analysis of Thompson Sampling for the multi-armed bandit problem

Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm

Regional Multi-Armed Bandits

Learning to play K-armed bandit problems

arxiv: v1 [stat.ml] 27 May 2016

Adaptive Concentration Inequalities for Sequential Decision Problems

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Multi-Armed Bandit Formulations for Identification and Control

Lecture 5: Regret Bounds for Thompson Sampling

The knowledge gradient method for multi-armed bandit problems

An Optimal Bidimensional Multi Armed Bandit Auction for Multi unit Procurement

Pure Exploration Stochastic Multi-armed Bandits

arxiv: v1 [cs.lg] 14 Nov 2018 Abstract

THE MULTI-ARMED BANDIT PROBLEM: AN EFFICIENT NON-PARAMETRIC SOLUTION

Boundary Crossing for General Exponential Families

Csaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008

Introduction to Reinforcement Learning Part 3: Exploration for sequential decision making

Bandits with Delayed, Aggregated Anonymous Feedback

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes

Improved Algorithms for Linear Stochastic Bandits

The multi armed-bandit problem

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

arxiv: v1 [cs.lg] 15 Oct 2014

Adaptive Concentration Inequalities for Sequential Decision Problems

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

Multi-armed Bandits in the Presence of Side Observations in Social Networks

Thompson Sampling for the non-stationary Corrupt Multi-Armed Bandit

On Regret-Optimal Learning in Decentralized Multi-player Multi-armed Bandits

Pure Exploration Stochastic Multi-armed Bandits

Reinforcement Learning

THE first formalization of the multi-armed bandit problem

Reward Maximization Under Uncertainty: Leveraging Side-Observations on Networks

Bandit View on Continuous Stochastic Optimization

Best Arm Identification in Multi-Armed Bandits

PAC Subset Selection in Stochastic Multi-armed Bandits

Evaluation of multi armed bandit algorithms and empirical algorithm

Online Learning with Gaussian Payoffs and Side Observations

Machine learning - HT Maximum Likelihood

Grundlagen der Künstlichen Intelligenz

The sample size required in importance sampling. Sourav Chatterjee

Complex Bandit Problems and Thompson Sampling

Upper-Confidence-Bound Algorithms for Active Learning in Multi-Armed Bandits

Transcription:

The information complexity of best-arm identification Emilie Kaufmann, joint work with Olivier Cappé and Aurélien Garivier MAB workshop, Lancaster, January th, 206

Context: the multi-armed bandit model (MAB) K arms = K probability distributions (νa has mean µa ) ν ν2 ν3 ν4 ν5 At round t, an agent: chooses an arm At observes a sample Xt νat using a sequential sampling strategy (At ): At+ = Ft (A, X,..., At, Xt ), aimed for a prescribed objective, e.g. related to learning a = argmaxa µa and µ = max µa. a

A possible objective: Regret minimization Samples = rewards, (A t ) is adjusted to [ T maximize the (expected) sum of rewards, E t= X t or equivalently minimize regret: [ ] T R T = E T µ X t exploration/exploitation tradeoff Motivation: clinical trials [933] t= ] B(µ ) B(µ 2 ) B(µ 3 ) B(µ 4 ) B(µ 5 ) Goal: Maximize the number of patients healed during the trial

Our objective: Best-arm identification Goal : identify the best arm, a, as fast/accurately as possible. No incentive to draw arms with high means! optimal exploration The agent s strategy is made of: a sequential sampling strategy (A t ) a stopping rule τ (stopping time) a recommendation rule â τ Possible goals: Fixed-budget setting τ = T minimize P(â τ a ) Fixed-confidence setting minimize E[τ] P(â τ a ) δ Motivation: Market research, A/B Testing, clinical trials...

Our objective: Best-arm identification Goal : identify the best arm, a, as fast/accurately as possible. No incentive to draw arms with high means! optimal exploration The agent s strategy is made of: a sequential sampling strategy (A t ) a stopping rule τ (stopping time) a recommendation rule â τ Possible goals: Fixed-budget setting τ = T minimize P(â τ a ) Fixed-confidence setting minimize E[τ] P(â τ a ) δ Motivation: Market research, A/B Testing, clinical trials...

Wanted: Optimal algorithms in the PAC formulation M a class of bandit models ν = (ν,..., ν K ). A strategy is δ-pac on M is ν M, P ν (â τ = a ) δ. Goal: for some classes M, and ν M, find a lower bound on E ν [τ] for any δ-pac strategy a δ-pac strategy such that E ν [τ] matches this bound (distribution-dependent bounds)

Outline Regret minimization 2 Lower bound on the sample complexity 3 The complexity of A/B Testing 4 Algorithms for the general case

Exponential family bandit models ν,..., ν K belong to a one-dimensional exponential family: P λ,θ,b = {ν θ, θ Θ : ν θ has density f θ (x) = exp(θx b(θ)) w.r.t. λ} Example: Gaussian, Bernoulli, Poisson distributions... ν θ can be parametrized by its mean µ = ḃ(θ) : ν µ := νḃ (µ) Notation: Kullback-Leibler divergence For a given exponential family P, ] d P (µ, µ ) := KL(ν µ, ν µ ) = E X ν µ [log dνµ (X ) dν µ is the KL-divergence between the distributions of mean µ and µ. Example: Bernoulli distributions d(µ, µ ) = KL(B(µ), B(µ )) = µ log µ µ + ( µ) log µ µ.

Outline Regret minimization 2 Lower bound on the sample complexity 3 The complexity of A/B Testing 4 Algorithms for the general case

Optimal algorithms for regret minimization ν = (ν µ,..., ν µ K ) M = (P) K. N a (t) : number of draws of arm a up to time t K R T (ν) = (µ µ a )E ν [N a (T )] a= consistent algorithm: ν M, α ]0, [, R T (ν) = o(t α ) [Lai and Robbins 985]: every consistent algorithm satisfies µ a < µ lim inf T E ν [N a (T )] log T d(µ a, µ ) Definition A bandit algorithm is asymptotically optimal if, for every ν M, µ a < µ lim sup T E ν [N a (T )] log T d(µ a, µ )

KL-UCB: an asymptotically optimal algorithm KL-UCB [Cappé et al. 203] A t+ = arg max a u a (t), with { u a (t) = argmax d (ˆµ a (t), x) log(t) }, x N a (t) ( where d(µ, µ ) = KL ν µ, ν µ ). 0.9 0.8 d(s a (t)/n a (t),q) 2(q S a (t)/n a (t)) 2 0.7 0.6 0.5 0.4 0.3 log(t)/n a (t) 0.2 0. E[N a (T )] S a (t)/n a (t) [ [ ]] 0 0 0. 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 q d(µ a, µ ) log T + O( log(t )).

The information complexity of regret minimization Letting we showed that κ R (ν) := inf lim inf A consistent T κ R (ν) = K a= (µ µ a ) d(µ a, µ ). R T (ν) log(t ),

Outline Regret minimization 2 Lower bound on the sample complexity 3 The complexity of A/B Testing 4 Algorithms for the general case

A general lower bound M a class of exponential family bandit models A = (A t, τ, â τ ) a strategy A is δ-pac on M: ν M, P ν (â τ = a ) δ. Theorem [K.,Cappé, Garivier 5] Let ν = (ν µ,..., ν µ K ) be such that µ > µ 2 µ K. Let δ ]0, [. Any algorithm that is δ-pac on M satisfies E ν [τ] ( K ) d(µ, µ 2 ) + ( ) log. d(µ a, µ ) 2.4δ a=2 d(µ, µ ) = KL(ν µ, ν µ )

Behind the lower bound: changes of distribution Lemma [K., Cappé, Garivier 205] ν = (ν, ν 2,..., ν K ), ν = (ν, ν 2,..., ν K ) two bandit models. K a= E ν [N a (τ)]kl(ν a, ν a) sup E F τ kl(p ν (E), P ν (E)). with kl(x, y) = x log(x/y) + ( x) log(( x)/( y)).

Behind the lower bound: changes of distribution Lemma [K., Cappé, Garivier 205] ν = (ν, ν 2,..., ν K ), ν = (ν, ν 2,..., ν K ) two bandit models. K a= E ν [N a (τ)]kl(ν a, ν a) sup E F τ kl(p ν (E), P ν (E)). with kl(x, y) = x log(x/y) + ( x) log(( x)/( y)). L t the log-likelihood ratio of past observations under ν and ν : Wald s equality: E ν [L τ ] = K a= E ν[n a (τ)]kl(ν a, ν a) change of distribution: E F τ, P ν (E) = E ν [exp( L τ ) E ]

Behind the lower bound: changes of distribution Exponential bandits: ν = (µ, µ 2,..., µ K ), ν = (µ, µ 2,..., µ K ) E F τ, K E ν [N a (τ)]d(µ a, µ a) kl(p ν (E), P ν (E)). a= E ν [τ] = K a= E ν[n a (τ)]. Then, for a, choose ν such that arm is no longer the best : µ K µ a µ 2 µ µ + ε { µ a = µ + ɛ µ i = µ i, if i a 2 E = (â τ = ): P ν (E) δ and P ν (E) δ. E ν [N a (τ)]d(µ a, µ + ɛ) kl(δ, δ) ( ) E ν [N a (τ)] d(µ a, µ ) log. 2.4δ

The complexity of best arm identification M a class of exponential family bandit models Theorem [K.,Cappé, Garivier 5] Let ν = (ν µ,..., ν µ K ) be such that µ > µ 2 µ K. Let δ ]0, [. Any algorithm that is δ-pac on M satisfies E ν [τ] ( K ) d(µ, µ 2 ) + ( ) log. d(µ a, µ ) 2.4δ a=2

The complexity of best arm identification M a class of exponential family bandit models Theorem [K.,Cappé, Garivier 5] Let ν = (ν µ,..., ν µ K ) be such that µ > µ 2 µ K. Let δ ]0, [. Any algorithm that is δ-pac on M satisfies E ν [τ] ( K ) d(µ, µ 2 ) + ( ) log. d(µ a, µ ) 2.4δ a=2 For any class M, the complexity term of ν M is defined as E ν [τ] κ C (ν) :=inf lim sup A PAC log(/δ) δ 0 A = (A(δ)) is PAC if for all δ ]0, [, A(δ) is δ-pac on M.

Outline Regret minimization 2 Lower bound on the sample complexity 3 The complexity of A/B Testing 4 Algorithms for the general case

Computing the complexity term M a class of two-armed bandit models. For ν = (ν, ν 2 ) recall that E ν [τ] κ C (ν) :=inf lim sup A PAC log(/δ) δ 0 We now compute κ C (ν) for two types of classes M: Exponential family bandit models: M = {ν = (ν µ, ν µ 2 ) : ν µ P, µ µ 2 } Gaussian with known variances σ 2 and σ2 2 : M = { ν = ( N ( µ, σ) 2 (, N µ2, σ2 2 )) : (µ, µ 2 ) R 2 }, µ µ 2

Lower bounds on the complexity From our previous lower bound (or a similar method) Exponential family bandit models: κ C (ν) d(µ, µ 2 ) + d(µ 2, µ ). Gaussian with known variances σ 2, σ2 2 : κ C (ν) 2σ 2 (µ µ 2 ) 2 + 2σ 2 2 (µ 2 µ ) 2.

Towards tighter lower bounds Exponential bandits: ν = (µ, µ 2 ), ν = (µ, µ 2 ) : µ < µ 2 ( ) E ν [N (τ)]d(µ, µ ) + E ν [N 2 (τ)]d(µ 2, µ 2) log. 2.4δ previously, a new change of distribution: μ 2 μ μ +ε µ = µ µ 2 = µ + ɛ μ 2 μ * μ * +ε μ µ = µ µ 2 = µ + ɛ choosing µ : d(µ, µ ) = d(µ 2, µ ) := d (µ, µ 2 ): ( ) d (µ, µ 2 )E ν [τ] log 2.4δ E ν [τ] d (µ, µ 2 ) log ( ) 2.4δ

Tighter lower bounds in the two-armed case New lower bounds (tighter!) Exponential families Gaussian with known σ 2, σ2 2 κ C (ν) d (µ,µ 2 ) κ C (ν) 2(σ +σ 2 ) 2 (µ µ 2 ) 2 d (µ, µ 2 ) := d(µ, µ ) = d(µ 2, µ ) is a Chernoff information. Previous lower bounds Exponential families Gaussian with known σ 2, σ2 2 κ C (ν) d(µ,µ 2 ) + d(µ 2,µ ) κ C (ν) 2(σ2 +σ2 2 ) (µ µ 2 ) 2

Upper bounds on the complexity: algorithms M = { ν = ( N ( µ, σ) 2 (, N µ2, σ2 2 )) : (µ, µ 2 ) R 2 }, µ µ 2 The α-elimination algorithm with exploration rate β(t, δ) :.0 0.5 0.0 0.5.0 0 200 400 600 800 000 chooses A t in order to keep a proportion N (t)/t α i.e. A t = 2 if and only if αt = α(t + ) if ˆµ a (t) is the empirical mean of rewards obtained from a up to time t, σt 2 (α) = σ 2/ αt + σ2 2 /(t αt ), { } τ = inf t N : ˆµ (t) ˆµ 2 (t) > 2σt 2 (α)β(t, δ)

Gaussian case: matching algorithm Theorem [K., Cappé, Garivier 4] With α = σ σ + σ 2 and β(t, δ) = log t δ + 2 log log(6t), α-elimination is δ-pac and ɛ > 0, E ν [τ] ( + ɛ) 2(σ + σ 2 ) 2 ( ) ( (µ µ 2 ) 2 log + o ɛ log ) δ δ 0 δ In the Gaussian case, and finally κ C (ν) 2(σ + σ 2 ) 2 (µ µ 2 ) 2 κ C (ν) = 2(σ + σ 2 ) 2 (µ µ 2 ) 2.

Exponential families: uniform sampling Another lower bound... M = {ν = (ν µ, ν µ 2 ) : ν µ P, µ µ 2 } A δ-pac algorithm using uniform sampling (A t = t[2]) satisfy ( ) E ν [τ] I (µ, µ 2 ) log 2.4δ with I (µ, µ 2 ) = d ( µ, µ +µ 2 2 ) ( + d µ2, µ +µ 2 ) 2. 2 Remark: I (µ, µ 2 ) is very close to d (µ, µ 2 )... find a good strategy with a uniform sampling strategy!

Exponential families: uniform sampling For Bernoulli bandit models, uniform sampling and { ( t )} τ = inf t N : ˆµ (t) ˆµ 2 (t) > log δ is δ-pac but not optimal: E ν[τ] log(/δ) 2 (µ µ 2 ) 2 > I (µ,µ 2 ). SGLRT algorithm (Sequential Generalized Likelihood Ratio Test) Let α > 0. There exists C = C α such that the algorithm using a uniform sampling strategy and the stopping rule τ = inf {t N : ti (ˆµ (t), ˆµ 2 (t)) > β(t, δ))} ( ) with β(t, δ) = log Ct +α is δ-pac and δ lim sup δ 0 E ν [τ] log(/δ) κ C (ν) + α I (µ, µ 2 ). I (µ, µ 2 )

The complexity of A/B Testing For Gaussian bandit models with known variances σ 2 and σ2 2, if ν = (N (µ, σ 2 ), N (µ 2, σ 2 2 )), κ C (ν) = (σ + σ 2 ) 2 2(µ + µ 2 ) 2 and the optimal strategy draws the arms proportionally to their standard deviation. For exponential bandit models, if ν = (ν µ, ν µ 2 ), d (µ, µ 2 ) κ C (ν) and uniform sampling is close to optimal. I (µ, µ 2 )

Outline Regret minimization 2 Lower bound on the sample complexity 3 The complexity of A/B Testing 4 Algorithms for the general case

Racing (or elimination) algorithms S = {,..., K} set of remaining arms r = 0 current round while S > r=r+ draw each a S, compute ˆµ a,r, the empirical mean of the r samples observed sofar compute the empirical best and empirical worst arms: b r = argmax a S ˆµ a,r w r = argmin a S ˆµ a,r end if EliminationRule(r, b r, w r ), eliminate w r : S = S\{w r } Outpout: â the single element in S.

Elimination rules In the literature: Successive Elimination for Bernoulli bandits ( ) log (ckt Elimination(r, a, b) = ˆµ a,r ˆµ b,r > 2 /δ) r [Even Dar et al. 06] KL-Racing for exponential family bandits with Elimination(r, a, b) = (l a,r > u b,r ) { la,r = min{x : rd(ˆµ a,r, x) β(r, δ)} u b,r = max{x : rd(ˆµ b,r, x) β(r, δ)} [K. and Kalyanakrishnan 3]

The Racing-SGLRT algorithm = EliminationRule(r, a, b) ( ( rd ˆµ a,r, ˆµ a,r + ˆµ b,r 2 ) + rd = (2rI (ˆµ a,r, ˆµ b,r ) > β(r, δ)) ( ˆµ b,r, ˆµ ) ) a,r + ˆµ b,r > β(r, δ) 2 Analysis of Racing-SGLRT Let α > 0. For an exploration rate of the form ( ) Ct +α β(r, δ) = log, δ Racing-SGLRT is δ-pac and satifies ( K ) E ν [τ] lim sup ( + α). δ 0 log(/δ) I (µ a, µ ) a=2

Conclusion We presented: a simple methodology to derive lower bounds on the sample complexity of a δ-pac strategy a characterization of the complexity of best arm identification among two-arm, involving alternative information-theoretic quantities (e.g. Chernoff information) To be continued... A/B Testing: for which classes of distributions is uniform sampling a good idea? the complexity of best arm identification is still to be understood in the general case... K d (µ, µ 2 ) + K d(µ a, µ ) κ C (ν) I (µ a, µ ) a=2 a=2

References O. Cappé, A. Garivier, O-A. Maillard, R. Munos, and G. Stoltz. Kullback-Leibler upper confidence bounds for optimal sequential allocation. Annals of Statistics, 203. E. Even-Dar, S. Mannor, Y. Mansour, Action Elimination and Stopping Conditions for the Multi-Armed Bandit and Reinforcement Learning Problems. JMLR, 2006. E. Kaufmann, S. Kalyanakrishnan, Information Complexity in Bandit Subset Selection. In Proceedings of the 26th Conference On Learning Theory (COLT), 203. E. Kaufmann, O. Cappé, A. Garivier. On the Complexity of A/B Testing. In Proceedings of the 27th Conference On Learning Theory (COLT), 204. E. Kaufmann, O. Cappé, A. Garivier. On the Complexity of Best Arm Identification in Multi-Armed Bandit Models. JMLR, 205 T.L. Lai and H. Robbins. Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 985.