Bandit models: a tutorial

Similar documents
Multi-armed bandit models: a tutorial

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

Online Learning and Sequential Decision Making

Revisiting the Exploration-Exploitation Tradeoff in Bandit Models

Two optimization problems in a stochastic bandit model

The information complexity of sequential resource allocation

Bayesian and Frequentist Methods in Bandit Models

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Advanced Machine Learning

The information complexity of best-arm identification

On Bayesian bandit algorithms

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

Bandit Algorithms. Tor Lattimore & Csaba Szepesvári

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms

Reinforcement Learning

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

The Multi-Arm Bandit Framework

On the Complexity of Best Arm Identification with Fixed Confidence

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade

The Multi-Armed Bandit Problem

Bandits : optimality in exponential families

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models

On the Complexity of Best Arm Identification with Fixed Confidence

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

Two generic principles in modern bandits: the optimistic principle and Thompson sampling

Grundlagen der Künstlichen Intelligenz

arxiv: v4 [cs.lg] 22 Jul 2014

Informational Confidence Bounds for Self-Normalized Averages and Applications

Stochastic bandits: Explore-First and UCB

THE first formalization of the multi-armed bandit problem

Analysis of Thompson Sampling for the multi-armed bandit problem

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes

Multi-Armed Bandit Formulations for Identification and Control

arxiv: v2 [stat.ml] 19 Jul 2012

The multi armed-bandit problem

KULLBACK-LEIBLER UPPER CONFIDENCE BOUNDS FOR OPTIMAL SEQUENTIAL ALLOCATION

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Sébastien Bubeck Theory Group

Evaluation of multi armed bandit algorithms and empirical algorithm

arxiv: v3 [stat.ml] 6 Nov 2017

Dynamic resource allocation: Bandit problems and extensions

Lecture 4: Lower Bounds (ending); Thompson Sampling

Reinforcement Learning

1 MDP Value Iteration Algorithm

Csaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008

Optimistic Bayesian Sampling in Contextual-Bandit Problems

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University

Multi armed bandit problem: some insights

Analysis of Thompson Sampling for the multi-armed bandit problem

Stochastic Multi-Armed-Bandit Problem with Non-stationary Rewards

arxiv: v3 [cs.lg] 7 Nov 2017

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

An Optimal Bidimensional Multi Armed Bandit Auction for Multi unit Procurement

Reward Maximization Under Uncertainty: Leveraging Side-Observations on Networks

Yevgeny Seldin. University of Copenhagen

Learning to play K-armed bandit problems

Mechanisms with Learning for Stochastic Multi-armed Bandit Problems

Bayesian Contextual Multi-armed Bandits

Adaptive Learning with Unknown Information Flows

Subsampling, Concentration and Multi-armed bandits

Stochastic Contextual Bandits with Known. Reward Functions

Introduction to Reinforcement Learning Part 3: Exploration for decision making, Application to games, optimization, and planning

Applications of on-line prediction. in telecommunication problems

Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm

arxiv: v4 [math.pr] 26 Aug 2013

Combinatorial Multi-Armed Bandit: General Framework, Results and Applications

The knowledge gradient method for multi-armed bandit problems

Kullback-Leibler Upper Confidence Bounds for Optimal Sequential Allocation

Regional Multi-Armed Bandits

Exploration and exploitation of scratch games

Anytime optimal algorithms in stochastic multi-armed bandits

Bandit View on Continuous Stochastic Optimization

Multi-armed Bandits in the Presence of Side Observations in Social Networks

arxiv: v2 [stat.ml] 14 Nov 2016

Ordinal optimization - Empirical large deviations rate estimators, and multi-armed bandit methods

Monte-Carlo Tree Search by. MCTS by Best Arm Identification

Pure Exploration Stochastic Multi-armed Bandits

On Regret-Optimal Learning in Decentralized Multi-player Multi-armed Bandits

Multiple Identifications in Multi-Armed Bandits

Bandit Algorithms for Pure Exploration: Best Arm Identification and Game Tree Search. Wouter M. Koolen

Optimistic Gittins Indices

Reinforcement Learning

Piecewise-stationary Bandit Problems with Side Observations

The Multi-Armed Bandit Problem

Online Learning Schemes for Power Allocation in Energy Harvesting Communications

Online Learning and Sequential Decision Making

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

COMPUTING OPTIMAL SEQUENTIAL ALLOCATION RULES IN CLINICAL TRIALS* Michael N. Katehakis. State University of New York at Stony Brook. and.

arxiv: v1 [cs.lg] 13 May 2014

Introduction to Reinforcement Learning and multi-armed bandits

Thompson Sampling for the non-stationary Corrupt Multi-Armed Bandit

Deterministic Sequencing of Exploration and Exploitation for Multi-Armed Bandit Problems

Online Learning with Gaussian Payoffs and Side Observations

Online Learning with Feedback Graphs

Crowdsourcing & Optimal Budget Allocation in Crowd Labeling

Tsinghua Machine Learning Guest Lecture, June 9,

An Information-Theoretic Analysis of Thompson Sampling

Transcription:

Gdt COS, December 3rd, 2015

Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses an arm A t {1,..., K} receives a reward X t = X At,t Goal: Build a sequential strategy maximizing A t = F t (A 1, X 1,..., A t 1, X t 1 ) [ ] E α t X t, t=1 where (α t ) t N is a discount sequence. [Berry and Fristedt, 1985]

Multi-Armed Bandit model: the i.i.d. case K independent arms: for a {1,..., K}, (X a,t ) t N is i.i.d ν a (unknown distributions) Bandit game: a each round t, an agent chooses an arm A t {1,..., K} receives a reward X t ν At Goal: Build a sequential strategy maximizing A t = F t (A 1, X 1,..., A t 1, X t 1 ) Finite-horizon [ MAB T ] E t=1 X t Discounted MAB E [ ] t=1 αt 1 X t

Why MABs? ν 1 ν 2 ν 3 ν 4 ν 5 Goal: maximize ones gains in a casino? (HOPELESS)

Why MABs? Real motivations Clinical trials: B(µ 1 ) B(µ 2 ) B(µ 3 ) B(µ 4 ) B(µ 5 ) choose a treatment A t for patient t observe a response X t {0, 1} : P(X t = 1) = µ At Goal: maximize the number of patient healed Recommendation tasks: ν 1 ν 2 ν 3 ν 4 ν 5 recommend a movie A t for visitor t observe a rating X t ν At (e.g. X t {1,..., 5}) Goal: maximize the sum of ratings

Bernoulli bandit models K independent arms: for a {1,..., K}, (X a,t ) t N is i.i.d B(µ a ) Bandit game: a each round t, an agent chooses an arm A t {1,..., K} receives a reward X t B(µ At ) {0, 1} Goal: maximize Finite-horizon [ MAB T ] E t=1 X t Discounted MAB E [ ] t=1 αt 1 X t

Bernoulli bandit models K independent arms: for a {1,..., K}, (X a,t ) t N is i.i.d B(µ a ) Bandit game: a each round t, an agent chooses an arm A t {1,..., K} receives a reward X t B(µ At ) {0, 1} Goal: maximize Finite-horizon [ MAB T ] E t=1 X t Discounted MAB E [ ] t=1 αt 1 X t Frequentist model Bayesian model µ 1,..., µ K µ 1,..., µ K drawn from a unknown parameters prior distribution : µ a π a arm a: (X a,t ) t i.i.d. B(µ a ) arm a: (X a,t ) t µ i.i.d. B(µ a )

Outline 1 Bayesian bandits: a planning problem 2 Frequentist bandits: asymptotically optimal algorithms 3 Non stochastic bandits: minimax algorithms

Outline 1 Bayesian bandits: a planning problem 2 Frequentist bandits: asymptotically optimal algorithms 3 Non stochastic bandits: minimax algorithms

A Markov Decision Process Bandit model (B(µ 1 ),..., B(µ K )) i.i.d prior distribution: µ a U([0, 1]) posterior distribution: πa t := L(µ a X 1,..., X t ) ( πa t = Beta S a (t) }{{} #ones +1, N a (t) S a (t) }{{} #zeros ) +1 S a (t): sum of the rewards gathered from a up to time t N a (t): number of draws of arm a up to time t State Π t = (πa) t K a=1that evolves in a MDP.

A Markov Decision Process An example of transition: 1 2 1 2 5 1 At=2 6 1 if X t = 1 0 2 0 2 Solving a planning problem: there exists an exact solution to The finite-horizon MAB: [ T ] arg max E µ π X t (A t) t=1 The discounted MAB: [ ] arg max E µ π α t 1 X t (A t) t=1 Optimal policy = solution to dynamic programming equations.

A Markov Decision Process An example of transition: 1 2 1 2 5 1 At=2 6 1 if X t = 1 0 2 0 2 Solving a planning problem: there exists an exact solution to The finite-horizon MAB: [ T ] arg max E µ π X t (A t) t=1 The discounted MAB: [ ] arg max E µ π α t 1 X t (A t) t=1 Optimal policy = solution to dynamic programming equations. Problem: The state space is too large!

A reduction of the dimension [Gittins 79]: the solution of the discounted MAB reduces to an index policy: A t+1 = argmax G α (πa). t a=1...k The Gittins indices: G α (p) = sup stopping times τ>0 E i.i.d Yt B(µ) µ p E i.i.d Yt B(µ) µ p [ τ t=1 αt 1 Y t ] [ τ t=1 αt 1 ] instantaneous rewards when commiting to arm µ p, when rewards are discounted by α

The Gittins indices An alternative formulation: G α (p) = inf{λ R : V α(p, λ) = 0}, with V α(p, λ) = sup stopping times τ>0 E Yt i.i.d B(µ) µ p [ τ ] α t 1 (Y t λ). t=1 price worth paying for commiting to arm µ p when rewards are discounted by α

Gittins indices for finite horizon The Finite-Horizon Gittins indices: depend on the remaining time to play r with V r (p, λ) = G(p, r) = inf{λ R : V r (p, λ) = 0}, sup stopping times 0<τ r E Yt i.i.d B(µ) µ p [ τ ] (Y t λ). t=1 price worth paying for playing arm µ p for at most r rounds The Finite-Horizon Gittins algorithm A t+1 = argmax a=1...k G(π t a, T t) does NOT coincide with the optimal solution [Berry and Fristedt 85]... but is conjectured to be a good approximation!

Outline 1 Bayesian bandits: a planning problem 2 Frequentist bandits: asymptotically optimal algorithms 3 Non stochastic bandits: minimax algorithms

Regret minimization µ = (µ 1,..., µ K ) unknown parameters, µ = max a µ a. The regret of a strategy A = (A t ) is defined as R µ (A, T ) = E µ [µ T and can be rewritten ] T X t t=1 K R µ (A, T ) = (µ µ a )E µ [N a (T )]. a=1 N a (t) : number of draws of arm a up to time t Maximizing rewards Minimizing regret Goal: Design strategies that have small regret for all µ.

Optimal algorithms for regret minimization All the arms should be drawn infinitely often! where [Lai and Robbins, 1985]: a uniformly efficient strategy ( µ, α ]0, 1[, R µ (A, T ) = o(t α )) satisfies µ a < µ lim inf T E µ [N a (T )] log T d(µ, µ ) = KL(B(µ), B(µ )) 1 d(µ a, µ ), = µ log µ 1 µ + (1 µ) log µ 1 µ. Definition A bandit algorithm is asymptotically optimal if, for every µ, µ a < µ lim sup T E µ [N a (T )] log T 1 d(µ a, µ )

First algorithms Idea 1 : Draw each arm T /K times EXPLORATION

First algorithms Idea 1 : Draw each arm T /K times EXPLORATION Idea 2: Always choose the empirical best arm: EXPLOITATION A t+1 = argmax a ˆµ a (t)

First algorithms Idea 1 : Draw each arm T /K times EXPLORATION Idea 2: Always choose the empirical best arm: EXPLOITATION A t+1 = argmax a ˆµ a (t) Idea 3 : Draw the arms uniformly during T /2 rounds, then draw the empirical best until the end EXPLORATION followed EXPLOITATION

First algorithms Idea 1 : Draw each arm T /K times EXPLORATION Idea 2: Always choose the empirical best arm: EXPLOITATION A t+1 = argmax a ˆµ a (t) Idea 3 : Draw the arms uniformly during T /2 rounds, then draw the empirical best until the end EXPLORATION followed EXPLOITATION Linear regret...

Optimistic algorithms For each arm a, build a confidence interval on µ a : µ a UCB a (t) w.h.p Figure : Confidence intervals on the arms at round t Optimism principle: act as if the best possible model were the true model A t+1 = arg max a UCB a (t)

A UCB algorithm in action! 1 0 6 31 436 17 9

The UCB1 algorithm UCB1 [Auer et al. 02] is based on the index α log(t) UCB a (t) = ˆµ a (t) + 2N a (t) Hoeffding s inequality: ( ) ( ( )) α log(t) α log(t) P ˆµ a,s + µ a exp 2s 2s 2s = 1 t α. Union bound: P(UCB a (t) µ a ) P ( t s=1 s t : ˆµ a,s + 1 t α = 1 t α 1. α log(t) 2s µ a )

The UCB1 algorithm Theorem For every α > 2 and every sub-optimal arm a, there exists a constant C α > 0 such that E µ [N a (T )] 2α (µ µ a ) 2 log(t ) + C α.

The UCB1 algorithm Theorem For every α > 2 and every sub-optimal arm a, there exists a constant C α > 0 such that E µ [N a (T )] 2α (µ µ a ) 2 log(t ) + C α. Remark: 2α (µ µ a ) 2 > 4α 1 d(µ a, µ ) (UCB1 not asymptotically optimal)

Proof : 1/3 Assume µ = µ 1 and µ 2 < µ 1. N 2 (T ) = = T 1 t=0 T 1 t=0 T 1 t=0 1 (At+1 =2) T 1 1 (At+1 =2) (UCB 1 (t) µ 1 ) + T 1 1 (UCB1 (t) µ 1 ) + t=0 t=0 1 (At+1 =2) (UCB 1 (t)>µ 1 ) 1 (At+1 =2) (UCB 2 (t)>µ 1 )

Proof : 1/3 Assume µ = µ 1 and µ 2 < µ 1. N 2 (T ) = = T 1 t=0 T 1 t=0 T 1 t=0 1 (At+1 =2) T 1 1 (At+1 =2) (UCB 1 (t) µ 1 ) + T 1 1 (UCB1 (t) µ 1 ) + t=0 t=0 1 (At+1 =2) (UCB 1 (t)>µ 1 ) 1 (At+1 =2) (UCB 2 (t)>µ 1 ) E[N 2 (T )] T 1 t=0 P(UCB 1 (t) µ 1 ) } {{ } A P(A t+1 = 2, UCB 2 (t) > µ 1 ) T 1 + t=0 } {{ } B

Proof : 2/3 E[N 2 (T )] T 1 t=0 P(UCB 1 (t) µ 1 ) } {{ } A Term A: if α > 2, P(A t+1 = 2, UCB 2 (t) > µ 1 ) T 1 + t=0 } {{ } B T 1 t=0 T 1 P(UCB 1 (t) µ 1 ) 1 + t=1 1 t α 1 1 + ζ(α 1) := C α /2.

Preuve : 3/3 Term B: (B) = with T 1 t=0 T 1 t=0 P (A t+1 = 2, UCB 2 (t) > µ 1 ) P(A t+1 = 2, UCB 2 (t) > µ 1, LCB 2 (t) µ 2 ) + C α /2 LCB 2 (t) = ˆµ 2 (t) α log t 2N 2 (t). (LCB 2 (t) < µ 2 < µ 1 UCB 2 (t)) α log(t ) (µ 1 µ 2 ) 2 2N 2 (t) N 2 (t) 2α (µ 1 µ 2 ) 2 log(t )

Proof : 3/3 Term B: (continued) (B) T 1 t=0 T 1 t=0 P(A t+1 = 2, UCB 2 (t) > µ 1, LCB 2 (t) µ 2 ) + C α /2 ( P A t+1 = 2, N 2 (t) 2α (µ 1 µ 2 ) 2 log(t ) + C α/2 ) 2α (µ 1 µ 2 ) 2 log(t ) + C α /2 Conclusion: E[N 2 (T )] 2α (µ 1 µ 2 ) 2 log(t ) + C α.

The KL-UCB algorithm A UCB-type algorithm: A t+1 = arg max a u a (t)... associated to the right upper confidence bounds: { u a (t) = max q ˆµ a (t) : d (ˆµ a (t), x) log(t) }, N a (t) ˆµ a (t): empirical mean of rewards from arm a up to time t. 1 0.9 d(µ a (t),q) 0.8 0.7 0.6 0.5 0.4 0.3 log(t)/n a (t) 0.2 0.1 µ a (t) [ ] u a (t) 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 q [Cappé et al. 13]: KL-UCB satisfies 1 E µ [N a (T )] d(µ a, µ ) log T + O( log(t )).

Bayesian algorithms for regret minimization? Algorithms based on Bayesian tools can be good to solve (frequentist) regret minimization 1 1 0 0 9 3 448 18 21 6 3 451 5 34 Ideas: use the Finite-Horizon Gittins use posterior quantiles use posterior samples

Thompson Sampling (π t a,.., π t K ) posterior distribution on (µ 1,.., µ K ) at round t. Algorithm: Thompson Sampling Thompson Sampling is a randomized Bayesian algorithm: a {1..K}, θ a (t) πa t A t+1 = argmax a θ a (t) Draw each arm according to its posterior probability of being optimal [Thompson 1933] 10 8 6 4 2 θ (t) μ 1 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 6 4 2 μ θ (t) 2 2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Thompson Sampling is asymptotically optimal. [K.,Korda,Munos 2012]

Outline 1 Bayesian bandits: a planning problem 2 Frequentist bandits: asymptotically optimal algorithms 3 Non stochastic bandits: minimax algorithms

Minimax regret In stochastic (Bernoulli) bandits, we exhibited algorithm satisfying µ, R µ (A, T ) = ( K ) (µ µ a ) d(µ a=1 a, µ ) }{{} log(t ) + o(log(t )). problem-dependent term, can be large For those algorithms, one can also prove that, for some constant C, µ, R µ (A, T ) C KT log(t ). }{{} problem-independent bound Minimax rate of the regret ( ) inf sup R µ (A, T ) = O KT A µ

Removing the stochastic assumption... A new bandit game: at round t the player chooses arm A t simultaneously, an adversary chooses the vector of rewards (x t,1,..., x t,k ) the player receives the reward x t = x At,t Goal: maximize rewards, or minimize regret [ T ] [ T ] R(A, T ) = max E x a,t E x t. a t=1 t=1

Full information: Exponential Weighted Forecaster The full-information game: at round t the player chooses arm A t simultaneously, an adversary chooses the vector of rewards (x t,1,..., x t,k ) the player receives the reward x t = x At,t and he observes the reward vector (x t,1,..., x t,k ) The EWF algorithm [Littelstone, Warmuth 1994] With ˆp t the probability distribution at round t, choose ˆp a,t e η( t 1 s=1 xa,s) A t ˆp t

Back to the bandit case: the EXP3 strategy We don t have access to the (x a,t ) for all a... satisfies E[ˆx a,t ] = x a,t. ˆx a,t = x a,t ˆp a,t 1 (At=a) The EXP3 strategy [Auer et al. 2003] With ˆp t the probability distribution at round t, choose ˆp a,t e η( t 1 s=1 ˆxa,s) A t ˆp t

Theoretical results The EXP3 strategy [Auer et al. 2003] With ˆp t the probability distribution at round t, choose ˆp a,t e η( t 1 s=1 ˆxa,s) A t ˆp t [Bubeck and Cesa-Bianchi 12] EXP3 with log(k) η = KT satisfies R(EXP3, T ) 2 log K KT Remarks: log(k) Kt almost the same guarantees for η t = extra exploration is needed to have high probability results

Conclusions Under different assumptions, different types of strategies to achieve an exploration-exploitation tradeoff in bandit models: Index policies: Gittins indices UCB-type algorithms Randomized algorithms: Thompson Sampling Exponential weights More complex bandit models not covered today: restless bandits, contextual bandits, combinatorial bandits...

Books and surveys Bayesian bandits: Bandit Problems. Sequential allocation of experiments. Berry and Fristedt. Chapman and Hall, 1985. Multi-armed bandit allocation indices. Gittins, Glazebrook, Weber. Wiley, 2011. Stochastic and non-stochastic bandits: Regret analysis of Stochastic and Non-stochastic Bandit Problems. Bubeck and Cesa-Bianchi. Fondations and Trends in Machine Learning, 2012. Prediction, Learning and Games. Cesa-Bianchi and Lugosi. Cambridge University Press, 2006.