Stratégies bayésiennes et fréquentistes dans un modèle de bandit

Similar documents
Revisiting the Exploration-Exploitation Tradeoff in Bandit Models

Multi-armed bandit models: a tutorial

Online Learning and Sequential Decision Making

Bandit models: a tutorial

The information complexity of sequential resource allocation

On Bayesian bandit algorithms

Two optimization problems in a stochastic bandit model

The information complexity of best-arm identification

Bayesian and Frequentist Methods in Bandit Models

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

On the Complexity of Best Arm Identification with Fixed Confidence

On the Complexity of Best Arm Identification with Fixed Confidence

arxiv: v2 [stat.ml] 19 Jul 2012

Bandits : optimality in exponential families

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

arxiv: v2 [stat.ml] 14 Nov 2016

Two generic principles in modern bandits: the optimistic principle and Thompson sampling

KULLBACK-LEIBLER UPPER CONFIDENCE BOUNDS FOR OPTIMAL SEQUENTIAL ALLOCATION

Lecture 5: Regret Bounds for Thompson Sampling

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

arxiv: v3 [cs.lg] 7 Nov 2017

Advanced Machine Learning

On the Complexity of A/B Testing

arxiv: v3 [stat.ml] 6 Nov 2017

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade

Bandit Algorithms for Pure Exploration: Best Arm Identification and Game Tree Search. Wouter M. Koolen

Reinforcement Learning

Analysis of Thompson Sampling for the multi-armed bandit problem

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

arxiv: v4 [cs.lg] 22 Jul 2014

Lecture 4: Lower Bounds (ending); Thompson Sampling

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

The Multi-Arm Bandit Framework

Dynamic resource allocation: Bandit problems and extensions

Informational Confidence Bounds for Self-Normalized Averages and Applications

Grundlagen der Künstlichen Intelligenz

Monte-Carlo Tree Search by. MCTS by Best Arm Identification

Optimality of Thompson Sampling for Gaussian Bandits Depends on Priors

Stochastic bandits: Explore-First and UCB

Mechanisms with Learning for Stochastic Multi-armed Bandit Problems

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms

Sparse Linear Contextual Bandits via Relevance Vector Machines

Kullback-Leibler Upper Confidence Bounds for Optimal Sequential Allocation

The Multi-Armed Bandit Problem

Optimistic Bayesian Sampling in Contextual-Bandit Problems

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

arxiv: v4 [math.pr] 26 Aug 2013

Reinforcement Learning

Learning to play K-armed bandit problems

Ordinal optimization - Empirical large deviations rate estimators, and multi-armed bandit methods

Analysis of Thompson Sampling for the multi-armed bandit problem

Csaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008

Racing Thompson: an Efficient Algorithm for Thompson Sampling with Non-conjugate Priors

Anytime optimal algorithms in stochastic multi-armed bandits

Stochastic Contextual Bandits with Known. Reward Functions

Introduction to Bayesian Learning. Machine Learning Fall 2018

An Information-Theoretic Analysis of Thompson Sampling

Reinforcement Learning

Lecture 4 January 23

Multi-Armed Bandits. Credit: David Silver. Google DeepMind. Presenter: Tianlu Wang

Stat 260/CS Learning in Sequential Decision Problems.

Bandit Algorithms. Tor Lattimore & Csaba Szepesvári

Multi-Armed Bandit Formulations for Identification and Control

arxiv: v1 [cs.lg] 15 Oct 2014

On Sequential Decision Problems

arxiv: v2 [math.st] 14 Nov 2016

Bayesian Contextual Multi-armed Bandits

1 MDP Value Iteration Algorithm

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models

arxiv: v7 [cs.lg] 7 Jul 2017

Accouncements. You should turn in a PDF and a python file(s) Figure for problem 9 should be in the PDF

arxiv: v1 [cs.ds] 4 Mar 2016

Complex Bandit Problems and Thompson Sampling

Exploration and exploitation of scratch games

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Sébastien Bubeck Theory Group

Multi-armed Bandits in the Presence of Side Observations in Social Networks

Optimistic Gittins Indices

Subsampling, Concentration and Multi-armed bandits

Optimal Best Arm Identification with Fixed Confidence

Reward Maximization Under Uncertainty: Leveraging Side-Observations on Networks

Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm

Exploiting Correlation in Finite-Armed Structured Bandits

The knowledge gradient method for multi-armed bandit problems

Online Learning with Gaussian Payoffs and Side Observations

Thompson Sampling for the non-stationary Corrupt Multi-Armed Bandit

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes

Improved Algorithms for Linear Stochastic Bandits

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

Stochastic Regret Minimization via Thompson Sampling

On Regret-Optimal Learning in Decentralized Multi-player Multi-armed Bandits

THE first formalization of the multi-armed bandit problem

Lecture 8: Information Theory and Statistics

Statistical Inference

Learning to Optimize via Information-Directed Sampling

THE MULTI-ARMED BANDIT PROBLEM: AN EFFICIENT NON-PARAMETRIC SOLUTION

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Reinforcement Learning

Corrupt Bandits. Abstract

Multiple Identifications in Multi-Armed Bandits

Review: Probability. BM1: Advanced Natural Language Processing. University of Potsdam. Tatjana Scheffler

Transcription:

Stratégies bayésiennes et fréquentistes dans un modèle de bandit thèse effectuée à Telecom ParisTech, co-dirigée par Olivier Cappé, Aurélien Garivier et Rémi Munos Journées MAS, Grenoble, 30 août 2016

The multi-armed bandit model K arms = K probability distributions (ν a has mean µ a ) ν 1 ν 2 ν 3 ν 4 ν 5 At round t, an agent: chooses an arm A t observes a sample X t ν At using a sequential sampling strategy (A t ): A t+1 = F t (A 1, X 1,..., A t, X t ). Generic goal: learn the best arm, a = argmax a µ a

Regret minimization in a bandit model Samples = rewards, (A t ) is adjusted to maximize the (expected) sum of rewards, [ T ] E X t t=1 or equivalently minimize the regret: [ ] T R T = E T µ X t t=1 µ = µ a = max µ a a Exploration/Exploitation tradeoff

Modern motivation: recommendation tasks ν 1 ν 2 ν 3 ν 4 ν 5 For the t-th visitor of a website, recommend a movie A t observe a rating X t ν At (e.g. X t {1,..., 5}) Goal: maximize the sum of ratings

Back to the initial motivation: clinical trials B(µ 1 ) B(µ 2 ) B(µ 3 ) B(µ 4 ) B(µ 5 ) For the t-th patient in a clinical study, chooses a treatment A t observes a response X t {0, 1}: P(X t = 1) = µ At Goal: maximize the number of patient healed during the study

Back to the initial motivation: clinical trials B(µ 1 ) B(µ 2 ) B(µ 3 ) B(µ 4 ) B(µ 5 ) For the t-th patient in a clinical study, chooses a treatment A t observes a response X t {0, 1}: P(X t = 1) = µ At Goal: maximize the number of patient healed during the study Alternative goal: allocate the treatments so as to identify as quickly as possible the best treatment (no focus on curing patients during the study)

Two bandit problems Regret minimization Best arm identification sampling rule (A t ) Bandit sampling rule (A t ) stopping rule τ algorithm recommendation rule â τ Input horizon T risk parameter δ [ minimize ensure P(â τ = a ) 1 δ Objective R T = E µ T ] T t=1 X t and minimize E[τ] Exploration/Exploitation pure Exploration Goal: find efficient, optimal algorithms for both objectives In the presentation, we focus on Bernoulli bandit models µ = (µ 1,..., µ K )

Outline 1 Asymptotically optimal algorithms for regret minimization 2 A Bayesian approach for regret minimization Bayes-UCB Thompson Sampling 3 Optimal algorithms for Best Arm Identification

Outline 1 Asymptotically optimal algorithms for regret minimization 2 A Bayesian approach for regret minimization Bayes-UCB Thompson Sampling 3 Optimal algorithms for Best Arm Identification

Optimal algorithms for regret minimization µ = (µ 1,..., µ K ). N a (t) : number of draws of arm a up to time t [ T ] K R µ (A, T ) = µ T E µ X t = (µ µ a )E µ [N a (T )] t=1 Notation: Kullback-Leibler divergence d(µ, µ ) := KL ( B(µ), B(µ ) ) a=1 = µ log(µ/µ ) + (1 µ) log((1 µ)/(1 µ )) [Lai and Robbins, 1985]: for uniformly efficient algorithms, µ a < µ lim inf T E µ [N a (T )] log T 1 d(µ a, µ ) A bandit algorithm is asymptotically optimal if, for every µ, µ a < µ E µ [N a (T )] 1 lim sup T log T d(µ a, µ )

Algorithms: naive ideas Idea 1 : Choose each arm T /K times EXPLORATION Idea 2 : Always choose the best arm so far EXPLOITATION A t+1 = argmax a ˆµ a (t)...linear regret

Algorithms: naive ideas Idea 1 : Choose each arm T /K times EXPLORATION Idea 2 : Always choose the best arm so far EXPLOITATION A t+1 = argmax a ˆµ a (t)...linear regret Idea 3 : First explore the arms uniformly, then commit to the empirical best until the end EXPLORATION followed by EXPLOITATION...Still sub-optimal

Mixing Exploration and Exploitation: the UCB approach I a (t) = [LCB a (t), UCB a (t)] a confidence interval on µ a, based on observation up to round t. A UCB-type (or optimistic) algorithm chooses at round t A t+1 = argmax a=1...k UCB a (t).

Mixing Exploration and Exploitation: the UCB approach I a (t) = [LCB a (t), UCB a (t)] a confidence interval on µ a, based on observation up to round t. A UCB-type (or optimistic) algorithm chooses at round t A t+1 = argmax a=1...k UCB a (t). How to choose the Upper Confidence Bounds? use appropriate deviation inequalities...... in order to guarantee that P(µ a UCB a (t)) 1 t 1

The KL-UCB algorithm ˆµ a (t): empirical mean of rewards from arm a up to time t. A deviation inequality involving the KL-divergence: P (N a (t)d(ˆµ a (t), µ a ) γ) 2e(log(t) + 1)γe γ The KL-UCB algorithm: A t+1 = arg max a { u a (t) := max q : d (ˆµ a (t), q) 1 0.9 u a (t) with log(t) + c log log(t) N a (t) d(µ a (t),q) }, 0.8 0.7 0.6 0.5 0.4 0.3 log(t)/n a (t) 0.2 0.1 µ a (t) [ ] u a (t) 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 q [Cappé et al. 13]: KL-UCB satisfies, for c 5, E µ [N a (T )] 1 d(µ a, µ ) log T + O( log(t )).

Outline 1 Asymptotically optimal algorithms for regret minimization 2 A Bayesian approach for regret minimization Bayes-UCB Thompson Sampling 3 Optimal algorithms for Best Arm Identification

A frequentist or a Bayesian model? µ = (µ 1,..., µ K ). Two probabilistic modelings Frequentist model Bayesian model µ 1,..., µ K µ 1,..., µ K drawn from a unknown parameters prior distribution : µ a π a arm a: (Y a,s ) s i.i.d. B(µ a ) The regret can be computed in each case arm a: (Y a,s ) s µ i.i.d. B(µ a ) Frequentist regret (regret) [ T ] R T (A, µ)= E µ t=1 (µ µ At ) Bayesian regret (Bayes risk) [ T ] R T (A, π)= E µ π t=1 (µ µ At ) = R T (A, µ)dπ(µ)

Frequentist and Bayesian algorithms Two types of tools to build bandit algorithms: Frequentist tools MLE estimators of the means Confidence Intervals Bayesian tools Posterior distributions πa t = L(µ a X a,1,..., X a,na(t)) 1 1 0 0 9 3 448 18 21 6 3 451 5 34 One can separate tools and objective: We present efficient Bayesian algorithms for regret minimization

Outline 1 Asymptotically optimal algorithms for regret minimization 2 A Bayesian approach for regret minimization Bayes-UCB Thompson Sampling 3 Optimal algorithms for Best Arm Identification

The Bayes-UCB algorithm π t a the posterior distribution over µ a at the end of round t. Algorithm: Bayes-UCB [K., Cappé, Garivier 2012] ( Q 1 A t+1 = argmax a 1 t(log t) c, πt a where Q(α, p) is the quantile of order α of the distribution p. ) Bernoulli reward with uniform prior: π 0 a i.i.d U([0, 1]) = Beta(1, 1) π t a = Beta(S a (t) + 1, N a (t) S a (t) + 1)

Bayes-UCB in practice 1 0 2 4 346 107 40

Theoretical results Bayes-UCB is asymptotically optimal Theorem [K.,Cappé,Garivier 2012] Let ɛ > 0. The Bayes-UCB algorithm using a uniform prior over the arms and parameter c 5 satisfies E µ [N a (T )] 1 + ɛ d(µ a, µ ) log(t ) + o ɛ,c (log(t )).

Links to a frequentist algorithm Bayes-UCB index is close to KL-UCB indices: Lemma with: u a (t) = max ũ a (t) = max q : d { q : d (ˆµ a (t), q) ũ a (t) q a (t) u a (t) ( ) Na (t)ˆµ a (t) log N a (t) + 1, q } log(t) + c log log(t) N a (t) ( ) + c log log(t) t N a(t)+2 (N a (t) + 1) Bayes-UCB automatically builds confidence intervals based on the Kullback-Leibler divergence!

Where does it come from? We have a tight bound on the tail of posterior distributions (Beta distributions) First element: link between Beta and Binomial distribution: P(X a,b x) = P(S a+b 1,1 x b) Second element: Sanov inequalities Lemma For k > nx, e nd( k n,x) n + 1 P(S n,x k) e nd( k n,x)

Outline 1 Asymptotically optimal algorithms for regret minimization 2 A Bayesian approach for regret minimization Bayes-UCB Thompson Sampling 3 Optimal algorithms for Best Arm Identification

Thompson Sampling (π1 t,.., πt K ) posterior distribution on (µ 1,.., µ K ) at round t. Algorithm: Thompson Sampling Thompson Sampling is a randomized Bayesian algorithm: a {1..K}, θ a (t) πa t A t+1 = argmax a θ a (t) Draw each arm according to its posterior probability of being optimal the first bandit algorithm, introduced by [Thompson 1933]

Illustration of the algorithm 10 8 6 4 2 θ (t) μ 1 1 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 6 4 2 μ 2 θ 2 (t) 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Posterior distributions of the mean of arm 1 (top) and arm 2 (bottom) in a two-armed bandit model

Thompson Sampling is asymptotically optimal good empirical performance in complex models first logarithmic regret bound by [Agrawal and Goyal 2012] Theorem [K.,Korda,Munos 2012] For all ɛ > 0, E µ [N a (T )] 1 + ɛ d(µ a, µ ) log(t ) + o µ,ɛ(log(t )).

Outline 1 Asymptotically optimal algorithms for regret minimization 2 A Bayesian approach for regret minimization Bayes-UCB Thompson Sampling 3 Optimal algorithms for Best Arm Identification

A sample complexity lower bound A Best Arm Identification algorithm (A t, τ, â τ ) is δ-pac if µ, P µ (â τ = a (µ)) 1 δ. Theorem [Garivier and K. 2016] For any δ-pac algorithm, where T (µ) 1 = sup w Σ K E µ [τ] T (µ) log inf ( ) 1, 2.4δ {λ:a (λ) a (µ)} a=1 with Σ K = {w [0, 1] K : K i=1 w i = 1}. K w a d(µ a, λ a )

Optimal proportion of draws The vector w (µ) = argmax w Σ K inf {λ:a (λ) a (µ)} a=1 K w a d(µ a, λ a ) contains the optimal proportions of draws of the arms, i.e. an algorithm matching the lower bound should satisfy a {1,..., K}, E µ [N a (τ)] E µ [τ] w a (µ). Building on this notion of optimal proportions, one can exhibit an asymptotically optimal algorithm: lim δ E µ [τ δ ] log(1/δ) = T (µ).

Conclusion We saw two Bayesian algorithms that are good alternative to KL-UCB for regret minimization because: they are also asymptotically optimal in simple models they display better empirical performance... they can be easily generalized to more complex models Algorithms for regret minimization and BAI are very different! playing mostly the best arm vs. optimal proportions 1 1 0 0 1289 111 22 36 19 22 771 459 200 45 48 23 different complexity terms (featuring KL-divergence) ( R T µ µ ) a d(µ a, µ log(t ) E µ [τ] T (µ) log (1/δ) ) a a

Merci de votre attention. Merci!

Bayesian algorithms in contextual linear bandit models At time t, a set of contexts D t R d is revealed. = characteristics of the items to recommend The model: if the context x t D t is selected a reward r t = x T t θ + ɛ t is received θ R d = underlying preference vector A Bayesian model: (with Gaussian prior) r t = xt T θ + ɛ t, θ N ( 0, κ 2 ) I d, ɛt N ( 0, σ 2). ) Explicit posterior: p(θ x 1, r 1,..., x t, r t ) = N (ˆθ(t), Σt. Thompson Sampling: ) θ(t) N (ˆθ(t), Σ t, and x t+1 = argmax x T θ(t). x D t+1