Csaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008

Size: px
Start display at page:

Download "Csaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008"

Transcription

1 LEARNING THEORY OF OPTIMAL DECISION MAKING PART I: ON-LINE LEARNING IN STOCHASTIC ENVIRONMENTS Csaba Szepesvári 1 1 Department of Computing Science University of Alberta Machine Learning Summer School, Ile de Re, France, 2008 with thanks to: RLAI group, SZTAKI group, Jean-Yves Audibert, Remi Munos

2 OUTLINE 1 HIGH LEVEL OVERVIEW OF THE TALKS 2 MOTIVATION What is it? Why should we care? The challenge 3 BANDITS Forcing ɛ-greedy Softmax Optimism in the Face of Uncertainty 4 BANDITS WITH SIDE INFORMATION 5 ONLINE LEARNING IN MDPS 6 CONCLUSIONS

3 MOTTO A theory is something nobody believes, except the person who made it. An experiment is something everybody believes, except the person who made it. (Einstein)

4 HIGH LEVEL OVERVIEW OF THE TALKs Day 1: Online learning in stochastic environments Day 2: Batch learning in Markovian Decision Processes Day 3: Online learning in adversarial environments

5 REFRESHER Union bound: P (A B) P (A) + P (B) Inclusion: A B P (A) P (B) Inversion: If P (X > t) F(t) holds for all t then for any 0 < δ < 1, w.p. 1 δ, X F 1 (δ) Expectation: If X 0, E [X] = 0 P (X t) dt

6 WHAT IS IT? PROTOCOL OF LEARNING Concepts: Agent, Environment, sensations, actions, rewards Time: t = 1, 2,... Perception-action loop: 1 Agent senses Y t coming from the Environment 2 Agent sends action A t to the Environment 3 Agent receives reward R t from the Environment 4 t := t + 1, go to Step 1 Goal: T t=1 R t max RELATED PROBLEMS Cost-sensitive learning actions are predictions Passive (batch) learning no influence on training data Active learning learn fast (instead of cheaply)

7 WHY SHOULD WE CARE? Online Opportunity to keep improving Can learn with fewer data points Learning should be cheap learning you know! in stochastic environments convenience,..

8 THE CHALLENGE: EXPLORE OR EXPLOIT? CLINICAL TRIALS Drugs: i I def = {1, 2,..., k} Protocol: 1 Choose drug I t I for the next patient 2 Observe response R t (I t ) {0, 1} of the patient 3 t := t + 1, go to Step 1 Estimated response of drug i after n steps: Q n (i) = n t=1 I {It =i}r t (i) n t=1 I {It =i} WHICH DRUG TO CHOOSE? Best response so far? (greedy choice; exploit; optimize return) Least explored? (explore; collect information)???

9 BANDIT PROBLEMS BANDIT PROBLEM [ROBBINS, 1952] Choices: i I def = {1, 2,..., k} Protocol: 1 Choose an option I t I 2 Observe response R t (I t ) {0, 1} 3 t := t + 1, go to Step 1 Terminology: option = arm = action ASSUMPTION R t (i) P i ( ), independence within and between the arms

10 EXAMPLES EXAMPLES Clinical trials Web advertising Job shop scheduling.

11 SOME DEFINITIONS Expected payoffs: Q(i) = E [R 1 (i)] Optimal arm: Q(i ) = Q def = max i Q(i) Set of optimal arms: I = {i Q(i) = Q } Payoff loss ( gap ): i = Q Q(i) Total reward up to time n: R 1:n = n t=1 R t(i t ) A bandit problem instance: B Class of bandit problems: B

12 DEFINING THE GOALS DEFINITION (CONSISTENCY) A bandit algorithm A is strongly consistent on B if lim P (I t I ) 1 t holds when A is run on any instance from B. DEFINITION (HANNAN-CONSISTENCY ( NO-REGRET )) A bandit algorithm is Hannan-consistent on B if its expected regret L n is sublinear over time: L n def = n Q E [R 1:n ] = o(n) when A is run on any instance from B.

13 STRONG VS. HANNAN-CONSISTENCY PROPOSITION If an algorithm is strongly consistent then it is also Hannan-consistent. PROOF. Let a t = E [Q R t (I t )]. Then a t = E [Q R t (i) I t = i] P (I t = i) i = i P (I t = i). i: i >0 Hence a t 0. Cesaro: n 1 n t=1 a t 0. PROPOSITION The reverse does not hold.

14 EXPLORATION IS NECESSARY DEFINITION A bandit problem is non-trivial if for i optimal, i suboptimal, with positive probability R t (i ) < R t (i) holds. DEFINITION An algorithm stops exploring on problem B if there exists a time n such that after time step n the algorithm only chooses exploiting actions: Q n (I t ) = max i Q n (i). Time n may depend on B. PROPOSITION Let B contain a non-trivial bandit problem. If an algorithm stops exploring it cannot be consistent on B.

15 METHODS FOR ACHIEVING CONSISTENCY Forcing Fixed schedule ɛ-greedy Softmax Optimism in the Face of Uncertainty

16 FORCING IDEA Exploration is necessary make sure that every arm is selected infinitely often T n (i) = n t=1 I {It =i}: Number of times arm i is chosen up to time n. FORCING ALGORITHM((f t ) t N ) At time t do: 1 i 0 := argmin i T t (i) 2 if T t (i 0 ) < f t then I t := i 0 else I t := argmax i Q t (i). COMMENT Other possibility: Periodic forcing (forcing with a fixed timing)

17 REGRET OF THE FORCING ALGORITHM PROPOSITION Let f t 0 and lim t f t =. Then the Forcing Algorithm is Hannan-consistent, i.e., L n /n 0 (The Forcing Algorithm is not strongly consistent.).. but what is the growth rate of L n?

18 CENTRAL LIMIT THEOREM ( ) n ( ) P σ 2 X n E [X 1 ] y 1 Φ(y) 1 e y 2 /2, 2π y hence ( ) P X n E [X 1 ] ɛ = P ( n ( ) ) n σ 2 X n E [X 1 ] σ 2 ɛ e nɛ2 /(2σ 2 ) σ 2 nɛ 2 e nɛ2 /(2σ 2).

19 THEOREM (HOEFFDING S INEQUALITY) Let X i [0, 1] i.i.d., µ = E [X 1 ], X n = 1/n n t=1 X t. Then ( ) P X n µ + ɛ e 2nɛ2 ( ) P X n µ ɛ e 2nɛ2. COROLLARY (HOEFFDING S BOUND IN DEVIATION FORM) Let δ > 0. With probability 1 δ, log(1/δ) X n µ +. 2n Similarly, with probability 1 δ, log(1/δ) X n µ. 2n

20 HEURISTIC ANALYSIS Assume that R t (i) [0, 1]. Then (handwaving) w.p. 1 δ t, for all i {1,..., k} def log(2k/δ t ) Q t (i) Q(i) c t = 2f t since we explored i at least f t times. Counting mistakes: 1 When forcing: total contribution to L n is k f n 2 When the sample is atypical : n t=1 δ t 3 When the sample is typical : no mistake when c t < /2 def = min j: j >0 j /2 (*) f t 2 log(2k/δ t )/( ) 2 n L n f n + δ t + O(1) = O(log n) t=1 if δ t = 1/t and f t = c log(t) with c = Ω(( ) 2 ).

21 REGRET OF THE FORCING ALGORITHM THEOREM Assume that the payoffs are bounded. If f t = c log(t) with c = Ω(( ) 2 ) then the regret of the Forcing Algorithm grows logarithmically: L n = O(log n).

22 THE ɛ-greedy ALGORITHM ɛ-greedy((ɛ t ) t N ) At time t do: 1 Draw U t in [0, 1] uniformly at random 2 if U t < ɛ t then 3 else Pick I t randomly from {1, 2,..., k} I t := argmax i Q t (i). THEOREM (REGRET OF ɛ-greedy [AUER ET AL., 2002]) Assume that the payoffs are bounded. If ɛ t = c /t with c = Ω(k/( ) 2 ) then P (I t I ) O( c n ) and L n = O(c log n).

23 SOFTMAX ALGORITHMS Problem with the previous algorithms: When exploring they are indifferent about Q t (i). BOLTZMANN((τ t ) t N ) At time t do: 1 Let w t (i) = exp(q t (i)/τ t ), i {1, 2,..., k} 2 Let p t (i) = P w t (i) j w t (j) 3 Draw I t from p t ( ) COMMENTS τ t 0; computational temperature aka exponential weights algorithm, Gibbs-selection Plays a big role in adversarial environments

24 OFU: OPTIMISM IN THE FACE OF UNCERTAINTY IDEA [LAI AND ROBBINS, 1985] Choose the arm with the best potential Assumption: R t (i) p θi ( ), θ i R unknown, p known Mean payoff: Q(θ) = r p θ (r)dr Uncertainty set: U i,t = {θ log P ( R 1 (i),..., X Ti (t)(i) θ ) is large }... large depends on t, T i (t). Upper confidence index for arm i: UCI t (i) = max θ U i,t Q(θ) Algorithm UCI: I t := argmax i UCI t (i). Two reasons we select an arm: (i) Associated uncertainty is big, (ii) the arm looks good. Is this a good algorithm??

25 REGRET BOUND FOR UCI [LAI AND ROBBINS, 1985] THEOREM For any suboptimal arm i, ( ) 1 E [T n (i)] D(p i p ) + o(1) log(n), where p i = p θi D(p i p ) = p i (x) log p i(x) p (x) dx, and p is the distribution of an optimal arm. COROLLARY L n i: i >0 { } 1 i D(p i p ) + o(1) log n.

26 A LOWER BOUND [LAI AND ROBBINS, 1985] B = (p 1,..., p K ); L A n (B): regret of A when run on problem B. DEFINITION (UNIFORMLY GOOD ALGORITHMS) Algorithm A is uniformly good if L A n (B) = o(n a ) holds for all a > 0 and reward distributions B = (p 1,..., p K ) B. This is a minimum requirement! THEOREM (LOWER BOUND) If A is uniformly good then for any B = (p 1,..., p K ) B and any i, ( ) 1 E [T i (n)] D(p i p ) o(1) log n. COROLLARY UCI algorithms are asymptotically efficient.

27 UCB: A NONPARAMETRIC IMPLEMENTATION OF OFU Goal: Avoid parametric distributions! How to implement the OFU principle?! What is the potential of an arm? Need upper estimates of its mean payoff! [Agrawal, 1995] Large-deviation theory asymptotic results [Auer et al., 2002] Avoid asymptotics, use Hoeffding s inequality! Hoeffding: log(k/δ t ) Q(i) Q t (i) + 2T t (i) UCI t (i) := Q t (i) + p log(k/δ t ) 2T t (i) p > 1, δ t 0 tuning parameters

28 ALGORITHM UCB1 UCB1(p) At time t do: 1 UCI t (i) := Q t (i) + p log(t) 2T t (i) 2 I t := argmax i UCI t (i) THEOREM (UCB1 REGRET) Let 0 R t (i) 1, i.i.d., independent between the arms. Then the regret of UCB1 satisfies E [L n ] 2p ( 1 log(n) ) K i. i p 2 i=1 i: i >0 (Slightly) improved bound compared to [Auer et al., 2002] One can show that 1/D(p j p ) 1/(2 2 j ) Still far from the best possible constant (1/2).

29 UCBTUNED: MOTIVATION Assume that rewards are in [a, a + b]. UCB1 needs to be adjusted: p log(t) UCI t (i) := Q t (i) + b 2T t (i). Regret bound: R n 2p b 2 i i: i >0 log(n) + ( ) b p 2 Problem: outrageously big when Var [R t (i)] b 2! K i. i=1

30 UCBTUNED: ALGORITHM Idea: Estimate variance and use it to define the upper index! THEOREM (EMPIRICAL BERNSTEIN BOUND [AUDIBERT ET AL., 2007]) Let a X t a + b be ii.i.d., t > 2. Let X t be the empirical mean of X 1,..., X t, V t = 1/t t s=1 (X s X t ) 2 be the empirical variance estimate. Then for any 0 < δ < 1, w.p. at least 1 δ, X t E [X 1 ] where x δ = log(3/δ). 2Vt x δ t + 3bx δ, t UCBTUNED(p) At time t do: 1 UCI t (i) := Q t (i) + 2Vt (i)p log t T t (i) 2 I t := argmax i UCI t (i) + 3bp log t T t (i)

31 UCBTUNED: REGRET BOUND THEOREM (UCBTUNED REGRET [AUDIBERT ET AL., 2007]) Let 0 R t (i) b be i.i.d., independent between the arms, p > 1, σi 2 = Var [R t (i)]. Then the regret of UCBTuned(p) satisfies ( σ 2 ) E [L n ] c i p + 2b log n. i i: i >0 In particular, when p = 1.2, c p = 10.

32 WHAT REALLY HAPPENS.. Distribution of T t (2)/t plotted against time. Bandit: R t(1) Ber(0.5), R t(2) =

33 EXTENSIONS Switching costs [Agrawal et al., 1988] Dependent rewards [Lai and Yakowitz, 1995] Continuous action spaces Discretization [Kleinberg, 2004, Auer et al., 2007b] Tree-based methods [Kocsis and Szepesvári, 2006, Bubeck et al., 2008] Linear mean-payoff Q( ) [Dani et al., 2008] Drifts [Garivier and Moulines, 2008] Side information see below Feedback MDPs; see below

34 SUMMARY Forcing, ɛ-greedy: Exploration schedule provides a lower bound on regret Requires loss tolerance tuning OFU-based algs: No fixed schedule for exploration adaptivity Advantage: Minimal tuning You can mix these ideas!

35 BANDITS WITH SIDE INFORMATION ( ASSOCIATIVE BANDITS) PROTOCOL OF LEARNING Perception-action loop: 1 Agent senses Y t coming from the Environment 2 Agent sends action A t to the Environment 3 Agent receives reward R t from the Environment 4 t := t + 1, go to Step 1 ASSUMPTION Sensations are not influenced by the agent s decisions

36 VARIANTS Y t is from a m-element set Y m independent bandit problems Dependent payoffs; e.g., Q : Y A R, Q(, a) is smooth [Yang and Zhu, 2002], linear [Auer, 2003] Continuous action sets Delayed feedback [György et al., 2007]

37 ONLINE LEARNING IN MDPS Assumption: Environment is a finite MDP (X, A, p, r). PROTOCOL OF LEARNING Perception-action loop: 1 Agent senses state X t p( X t 1, A t 1 ) of the Environment 2 Agent sends action A t to the Environment 3 Agent receives reward R t from the Environment (E [R t X t = x, A t = a] = r(x, a)). 4 t := t + 1, go to Step 1 Goal: Minimize regret L n = nλ n t=1 R t, where λ is the best possible reward per time step

38 AVERAGE REWARD MDPS ASSUMPTION Irreducibility of the Markov chains under stationary policies AVERAGE REWARD BELLMAN OPTIMALITY EQUATIONS (BOE) Find λ R, h : X R, λ + h (x) = max [r(x, a) + p x (a), h ] a A(x) = max Q (x, a), x X. a A(x) RESTRICTED PROBLEM D(x) A(x) h D, Q D.

39 OPTIMISTIC LINEAR PROGRAMMING [TEWARI AND BARTLETT, 2007] ALGORITHM OLP 1 Update ˆp, the estimate of transition probabilities 2 D(x) := {a A(x) T t (x, a) log 2 T t (x)}, x X // keep well-sampled actions only 3 (λ, h, Q) := BOE(D, ˆp) // solve the Bellman equations 4 UCI(a) { := } sup r(x t, a) + q, h q 1, ˆp Xt (a) q 1 2 log t T t (X t,a) 5 Γ := {a A(X t ) Q(X t, a) = λ + h(x t ), T t (X t, a) < log 2 (T t (X t ) + 1)} 6 if A(X t ) = Γ /* all actions are in danger */ then else let A t be some element of Γ A t := argmax a A(Xt ) UCI(a).

40 REGRET BOUNDS FOR OLP (x, a) = h (x) + λ Q (x, a) Z ɛ (x, a) = {q 1 r(x, a) + q, h h (x) ɛ} C = {(x, a) Q (x, a) < λ + h (x), ɛ > 0, Z ɛ (x, a) } J x,a (p; ɛ) = inf{ p q 1 q Z ɛ (x, a)} K (x, a) = lim J x,a (p x (a); ɛ) ɛ 0 H = 2 (x, a) K (x, a) (x,a) C THEOREM ( [TEWARI AND BARTLETT, 2007]) Let the MDP M be irreducible. If L n is the regret of OLP after n steps then L n lim sup n log n H.

41 INTERPRETATION PROPOSITION Assume A(x) = A for all x X. Let = min (x,a) C (x, a). Then COMMENT H 2 X A h 2 1. The algorithm closely follows that of [Burnetas and Katehakis, 1997] who proved asymptotic efficiency for their policy.

42 THE UCRL2 ALGORITHM [AUER ET AL., 2007A] MOTIVATION Explore OFU in MDPs High probability bounds for finite horizon problems ALGORITHM UCRL2(δ, n) Phase initialization: 1 Estimate mean model ˆp t using ML 2 U := { p p( x, a) ˆp t ( x, a) 1 c X log( A n/δ) T t (x,a) } 3 p := argmax p U λ (p), π := π (p ) 4 C(x, a) := T t (x, a) Execution of a phase 1 Execute π until some (x, a) gets visited more than C(x, a) times.

43 UCRL2 RESULTS DEFINITION (DIAMETER) Let M be an MDP. Then the diameter of M is Results: Lower bound: Upper bounds: w.p. 1 δ/n, w.p. 1 δ, D(M) = max x,y min E [T (x y; π)]. π E [L n ] = Ω( D X A n) L n O ( D X ) A n log( A n/δ) ( L n O D 2 X 2 A log( A n/δ) ), where = min π:λπ <λ λ λ π.

44 CONCLUSIONS Exploration is necessary.. but should be controlled in a wise manner Tools: Forced exploration Softmax Optimism in the face of uncertainty After 50 years still lots of work to be done! Scaling up Non-stationary environments Dependencies

45 REFERENCES I Agrawal, R. (1995). Sample mean based index policies with O(logn) regret for the multi-armed bandit problem. Advances in Applied Probability, 27: Agrawal, R., Hedge, M., and Teneketzis, D. (1988). Asymptotically efficient adaptive allocation rules for the multiarmed bandit problem with switching cost. IEEE Transactions on Automatic Control, 33(10): Audibert, J.-Y., Munos, R., and Szepesvári, C. (2007). Tuning bandit algorithms in stochastic environments. In Algorithmic Learning Theory - 18th International Conference, ALT 2007, pages Springer. Auer, P. (2003). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3: Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3): Auer, P., Jaksch, T., and Ortner, R. (2007a). Near-optimal regret bounds for reinforcement learning. Technical report, University of Leoben. Auer, P., Ortner, R., and Szepesvári, C. (2007b). Improved rates for the stochastic continuum-armed bandit problem. In Proceedings of the 20th Annual Conference on Learning Theory (COLT-07), pages

46 REFERENCES II Bubeck, S., Munos, R., and Szepesvári, C. (2008). Online optimization in X-armed bandits. In NIPS-21. Burnetas, A. and Katehakis, M. (1997). Optimal adaptive policies for Markov Decision Processes. Mathematics of Operations Research, 22(1): Dani, V., Hayes, T., and Kakade, S. (2008). Stochastic linear optimization under bandit feedback. COLT Garivier, A. and Moulines, E. (2008). On upper-confidence bound policies for non-stationary bandit problems. Technical report, LTCI. György, A., Kocsis, L., Szabó, I., and Szepesvári, C. (2007). Continuous time associative bandit problems. In IJCAI-07, pages Kleinberg, R. (2004). Nearly tight bounds for the continuum-armed bandit problem. In NIPS Kocsis, L. and Szepesvári, C. (2006). Bandit based Monte-Carlo planning. In Proceedings of the 17th European Conference on Machine Learning (ECML-2006), pages Lai, T. and Yakowitz, S. (1995). Machine learning and nonparametric bandit theory. IEEE Transactions on Automatic Control, 40:

47 REFERENCES III Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4 22. Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathematics Society, 58: Tewari, A. and Bartlett, P. (2007). Optimistic linear programming gives logarithmic regret for irreducible mdps. NIPS-20. Yang, Y. and Zhu, D. (2002). Randomized allocation with nonparametric estimation for a multi-armed bandit problem with covariates. Annals of Statistics, 30(1):

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurélien Garivier Institut de Mathématiques de Toulouse Information Theory, Learning and Big Data Simons Institute, Berkeley, March

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Present and analyse two online algorithms

More information

Improved Algorithms for Linear Stochastic Bandits

Improved Algorithms for Linear Stochastic Bandits Improved Algorithms for Linear Stochastic Bandits Yasin Abbasi-Yadkori abbasiya@ualberta.ca Dept. of Computing Science University of Alberta Dávid Pál dpal@google.com Dept. of Computing Science University

More information

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

Bandit models: a tutorial

Bandit models: a tutorial Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Sequential Decision

More information

Bandit Algorithms. Tor Lattimore & Csaba Szepesvári

Bandit Algorithms. Tor Lattimore & Csaba Szepesvári Bandit Algorithms Tor Lattimore & Csaba Szepesvári Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next? Overview What are bandits,

More information

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 3: RL problems, sample complexity and regret Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Introduce the

More information

Two optimization problems in a stochastic bandit model

Two optimization problems in a stochastic bandit model Two optimization problems in a stochastic bandit model Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishnan Journées MAS 204, Toulouse Outline From stochastic optimization

More information

Informational Confidence Bounds for Self-Normalized Averages and Applications

Informational Confidence Bounds for Self-Normalized Averages and Applications Informational Confidence Bounds for Self-Normalized Averages and Applications Aurélien Garivier Institut de Mathématiques de Toulouse - Université Paul Sabatier Thursday, September 12th 2013 Context Tree

More information

Two generic principles in modern bandits: the optimistic principle and Thompson sampling

Two generic principles in modern bandits: the optimistic principle and Thompson sampling Two generic principles in modern bandits: the optimistic principle and Thompson sampling Rémi Munos INRIA Lille, France CSML Lunch Seminars, September 12, 2014 Outline Two principles: The optimistic principle

More information

Multi-Armed Bandits. Credit: David Silver. Google DeepMind. Presenter: Tianlu Wang

Multi-Armed Bandits. Credit: David Silver. Google DeepMind. Presenter: Tianlu Wang Multi-Armed Bandits Credit: David Silver Google DeepMind Presenter: Tianlu Wang Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 1 / 27 Outline 1 Introduction Exploration vs.

More information

Revisiting the Exploration-Exploitation Tradeoff in Bandit Models

Revisiting the Exploration-Exploitation Tradeoff in Bandit Models Revisiting the Exploration-Exploitation Tradeoff in Bandit Models joint work with Aurélien Garivier (IMT, Toulouse) and Tor Lattimore (University of Alberta) Workshop on Optimization and Decision-Making

More information

Dynamic resource allocation: Bandit problems and extensions

Dynamic resource allocation: Bandit problems and extensions Dynamic resource allocation: Bandit problems and extensions Aurélien Garivier Institut de Mathématiques de Toulouse MAD Seminar, Université Toulouse 1 October 3rd, 2014 The Bandit Model Roadmap 1 The Bandit

More information

The Multi-Armed Bandit Problem

The Multi-Armed Bandit Problem The Multi-Armed Bandit Problem Electrical and Computer Engineering December 7, 2013 Outline 1 2 Mathematical 3 Algorithm Upper Confidence Bound Algorithm A/B Testing Exploration vs. Exploitation Scientist

More information

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

Stratégies bayésiennes et fréquentistes dans un modèle de bandit Stratégies bayésiennes et fréquentistes dans un modèle de bandit thèse effectuée à Telecom ParisTech, co-dirigée par Olivier Cappé, Aurélien Garivier et Rémi Munos Journées MAS, Grenoble, 30 août 2016

More information

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem Lecture 9: UCB Algorithm and Adversarial Bandit Problem EECS598: Prediction and Learning: It s Only a Game Fall 03 Lecture 9: UCB Algorithm and Adversarial Bandit Problem Prof. Jacob Abernethy Scribe:

More information

Online Regret Bounds for Markov Decision Processes with Deterministic Transitions

Online Regret Bounds for Markov Decision Processes with Deterministic Transitions Online Regret Bounds for Markov Decision Processes with Deterministic Transitions Ronald Ortner Department Mathematik und Informationstechnologie, Montanuniversität Leoben, A-8700 Leoben, Austria Abstract

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 22 Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 How to balance exploration and exploitation in reinforcement

More information

Learning to play K-armed bandit problems

Learning to play K-armed bandit problems Learning to play K-armed bandit problems Francis Maes 1, Louis Wehenkel 1 and Damien Ernst 1 1 University of Liège Dept. of Electrical Engineering and Computer Science Institut Montefiore, B28, B-4000,

More information

Evaluation of multi armed bandit algorithms and empirical algorithm

Evaluation of multi armed bandit algorithms and empirical algorithm Acta Technica 62, No. 2B/2017, 639 656 c 2017 Institute of Thermomechanics CAS, v.v.i. Evaluation of multi armed bandit algorithms and empirical algorithm Zhang Hong 2,3, Cao Xiushan 1, Pu Qiumei 1,4 Abstract.

More information

Bayesian and Frequentist Methods in Bandit Models

Bayesian and Frequentist Methods in Bandit Models Bayesian and Frequentist Methods in Bandit Models Emilie Kaufmann, Telecom ParisTech Bayes In Paris, ENSAE, October 24th, 2013 Emilie Kaufmann (Telecom ParisTech) Bayesian and Frequentist Bandits BIP,

More information

Finite-time Analysis of the Multiarmed Bandit Problem*

Finite-time Analysis of the Multiarmed Bandit Problem* Machine Learning, 47, 35 56, 00 c 00 Kluwer Academic Publishers. Manufactured in The Netherlands. Finite-time Analysis of the Multiarmed Bandit Problem* PETER AUER University of Technology Graz, A-8010

More information

The Multi-Arm Bandit Framework

The Multi-Arm Bandit Framework The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94

More information

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning JMLR: Workshop and Conference Proceedings vol:1 8, 2012 10th European Workshop on Reinforcement Learning Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning Michael

More information

The information complexity of sequential resource allocation

The information complexity of sequential resource allocation The information complexity of sequential resource allocation Emilie Kaufmann, joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishan SMILE Seminar, ENS, June 8th, 205 Sequential allocation

More information

On Bayesian bandit algorithms

On Bayesian bandit algorithms On Bayesian bandit algorithms Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier, Nathaniel Korda and Rémi Munos July 1st, 2012 Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms

More information

Yevgeny Seldin. University of Copenhagen

Yevgeny Seldin. University of Copenhagen Yevgeny Seldin University of Copenhagen Classical (Batch) Machine Learning Collect Data Data Assumption The samples are independent identically distributed (i.i.d.) Machine Learning Prediction rule New

More information

Introduction to Reinforcement Learning Part 3: Exploration for decision making, Application to games, optimization, and planning

Introduction to Reinforcement Learning Part 3: Exploration for decision making, Application to games, optimization, and planning Introduction to Reinforcement Learning Part 3: Exploration for decision making, Application to games, optimization, and planning Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/

More information

Multi-armed Bandits in the Presence of Side Observations in Social Networks

Multi-armed Bandits in the Presence of Side Observations in Social Networks 52nd IEEE Conference on Decision and Control December 0-3, 203. Florence, Italy Multi-armed Bandits in the Presence of Side Observations in Social Networks Swapna Buccapatnam, Atilla Eryilmaz, and Ness

More information

Subsampling, Concentration and Multi-armed bandits

Subsampling, Concentration and Multi-armed bandits Subsampling, Concentration and Multi-armed bandits Odalric-Ambrym Maillard, R. Bardenet, S. Mannor, A. Baransi, N. Galichet, J. Pineau, A. Durand Toulouse, November 09, 2015 O-A. Maillard Subsampling and

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 5: Bandit optimisation Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Introduce bandit optimisation: the

More information

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models c Qing Zhao, UC Davis. Talk at Xidian Univ., September, 2011. 1 Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models Qing Zhao Department of Electrical and Computer Engineering University

More information

Multi-armed bandit based policies for cognitive radio s decision making issues

Multi-armed bandit based policies for cognitive radio s decision making issues Multi-armed bandit based policies for cognitive radio s decision making issues Wassim Jouini SUPELEC/IETR wassim.jouini@supelec.fr Damien Ernst University of Liège dernst@ulg.ac.be Christophe Moy SUPELEC/IETR

More information

KULLBACK-LEIBLER UPPER CONFIDENCE BOUNDS FOR OPTIMAL SEQUENTIAL ALLOCATION

KULLBACK-LEIBLER UPPER CONFIDENCE BOUNDS FOR OPTIMAL SEQUENTIAL ALLOCATION Submitted to the Annals of Statistics arxiv: math.pr/0000000 KULLBACK-LEIBLER UPPER CONFIDENCE BOUNDS FOR OPTIMAL SEQUENTIAL ALLOCATION By Olivier Cappé 1, Aurélien Garivier 2, Odalric-Ambrym Maillard

More information

The information complexity of best-arm identification

The information complexity of best-arm identification The information complexity of best-arm identification Emilie Kaufmann, joint work with Olivier Cappé and Aurélien Garivier MAB workshop, Lancaster, January th, 206 Context: the multi-armed bandit model

More information

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University Bandit Algorithms Zhifeng Wang Department of Statistics Florida State University Outline Multi-Armed Bandits (MAB) Exploration-First Epsilon-Greedy Softmax UCB Thompson Sampling Adversarial Bandits Exp3

More information

Online regret in reinforcement learning

Online regret in reinforcement learning University of Leoben, Austria Tübingen, 31 July 2007 Undiscounted online regret I am interested in the difference (in rewards during learning) between an optimal policy and a reinforcement learner: T T

More information

Online Learning with Gaussian Payoffs and Side Observations

Online Learning with Gaussian Payoffs and Side Observations Online Learning with Gaussian Payoffs and Side Observations Yifan Wu 1 András György 2 Csaba Szepesvári 1 1 Department of Computing Science University of Alberta 2 Department of Electrical and Electronic

More information

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning Peter Auer Ronald Ortner University of Leoben, Franz-Josef-Strasse 18, 8700 Leoben, Austria auer,rortner}@unileoben.ac.at Abstract

More information

On the Complexity of Best Arm Identification with Fixed Confidence

On the Complexity of Best Arm Identification with Fixed Confidence On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, Emilie Kaufmann COLT, June 23 th 2016, New York Institut de Mathématiques de Toulouse

More information

Bandit View on Continuous Stochastic Optimization

Bandit View on Continuous Stochastic Optimization Bandit View on Continuous Stochastic Optimization Sébastien Bubeck 1 joint work with Rémi Munos 1 & Gilles Stoltz 2 & Csaba Szepesvari 3 1 INRIA Lille, SequeL team 2 CNRS/ENS/HEC 3 University of Alberta

More information

Introduction to Reinforcement Learning Part 3: Exploration for sequential decision making

Introduction to Reinforcement Learning Part 3: Exploration for sequential decision making Introduction to Reinforcement Learning Part 3: Exploration for sequential decision making Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/ munos/ INRIA Lille - Nord Europe

More information

Bandits : optimality in exponential families

Bandits : optimality in exponential families Bandits : optimality in exponential families Odalric-Ambrym Maillard IHES, January 2016 Odalric-Ambrym Maillard Bandits 1 / 40 Introduction 1 Stochastic multi-armed bandits 2 Boundary crossing probabilities

More information

Stochastic Contextual Bandits with Known. Reward Functions

Stochastic Contextual Bandits with Known. Reward Functions Stochastic Contextual Bandits with nown 1 Reward Functions Pranav Sakulkar and Bhaskar rishnamachari Ming Hsieh Department of Electrical Engineering Viterbi School of Engineering University of Southern

More information

Introducing strategic measure actions in multi-armed bandits

Introducing strategic measure actions in multi-armed bandits 213 IEEE 24th International Symposium on Personal, Indoor and Mobile Radio Communications: Workshop on Cognitive Radio Medium Access Control and Network Solutions Introducing strategic measure actions

More information

Learning Algorithms for Minimizing Queue Length Regret

Learning Algorithms for Minimizing Queue Length Regret Learning Algorithms for Minimizing Queue Length Regret Thomas Stahlbuhk Massachusetts Institute of Technology Cambridge, MA Brooke Shrader MIT Lincoln Laboratory Lexington, MA Eytan Modiano Massachusetts

More information

On the Complexity of Best Arm Identification with Fixed Confidence

On the Complexity of Best Arm Identification with Fixed Confidence On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, joint work with Emilie Kaufmann CNRS, CRIStAL) to be presented at COLT 16, New York

More information

Selecting the State-Representation in Reinforcement Learning

Selecting the State-Representation in Reinforcement Learning Selecting the State-Representation in Reinforcement Learning Odalric-Ambrym Maillard INRIA Lille - Nord Europe odalricambrym.maillard@gmail.com Rémi Munos INRIA Lille - Nord Europe remi.munos@inria.fr

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Uncertainty & Probabilities & Bandits Daniel Hennes 16.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Uncertainty Probability

More information

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms Stochastic K-Arm Bandit Problem Formulation Consider K arms (actions) each correspond to an unknown distribution {ν k } K k=1 with values bounded in [0, 1]. At each time t, the agent pulls an arm I t {1,...,

More information

Deviations of stochastic bandit regret

Deviations of stochastic bandit regret Deviations of stochastic bandit regret Antoine Salomon 1 and Jean-Yves Audibert 1,2 1 Imagine École des Ponts ParisTech Université Paris Est salomona@imagine.enpc.fr audibert@imagine.enpc.fr 2 Sierra,

More information

Anytime optimal algorithms in stochastic multi-armed bandits

Anytime optimal algorithms in stochastic multi-armed bandits Rémy Degenne LPMA, Université Paris Diderot Vianney Perchet CREST, ENSAE REMYDEGENNE@MATHUNIV-PARIS-DIDEROTFR VIANNEYPERCHET@NORMALESUPORG Abstract We introduce an anytime algorithm for stochastic multi-armed

More information

Tuning bandit algorithms in stochastic environments

Tuning bandit algorithms in stochastic environments Tuning bandit algorithms in stochastic environments Jean-Yves Audibert, Rémi Munos, Csaba Szepesvari To cite this version: Jean-Yves Audibert, Rémi Munos, Csaba Szepesvari. Tuning bandit algorithms in

More information

Bandits with Delayed, Aggregated Anonymous Feedback

Bandits with Delayed, Aggregated Anonymous Feedback Ciara Pike-Burke 1 Shipra Agrawal 2 Csaba Szepesvári 3 4 Steffen Grünewälder 1 Abstract We study a variant of the stochastic K-armed bandit problem, which we call bandits with delayed, aggregated anonymous

More information

THE MULTI-ARMED BANDIT PROBLEM: AN EFFICIENT NON-PARAMETRIC SOLUTION

THE MULTI-ARMED BANDIT PROBLEM: AN EFFICIENT NON-PARAMETRIC SOLUTION THE MULTI-ARMED BANDIT PROBLEM: AN EFFICIENT NON-PARAMETRIC SOLUTION Hock Peng Chan stachp@nus.edu.sg Department of Statistics and Applied Probability National University of Singapore Abstract Lai and

More information

Online Learning with Feedback Graphs

Online Learning with Feedback Graphs Online Learning with Feedback Graphs Claudio Gentile INRIA and Google NY clagentile@gmailcom NYC March 6th, 2018 1 Content of this lecture Regret analysis of sequential prediction problems lying between

More information

Multi armed bandit problem: some insights

Multi armed bandit problem: some insights Multi armed bandit problem: some insights July 4, 20 Introduction Multi Armed Bandit problems have been widely studied in the context of sequential analysis. The application areas include clinical trials,

More information

1 MDP Value Iteration Algorithm

1 MDP Value Iteration Algorithm CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using

More information

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12 Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12 Lecture 4: Multiarmed Bandit in the Adversarial Model Lecturer: Yishay Mansour Scribe: Shai Vardi 4.1 Lecture Overview

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Journal of Machine Learning Research 1 8, 2017 Algorithmic Learning Theory 2017 New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Philip M. Long Google, 1600 Amphitheatre

More information

Lecture 4: Lower Bounds (ending); Thompson Sampling

Lecture 4: Lower Bounds (ending); Thompson Sampling CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from

More information

Regret Bounds for Restless Markov Bandits

Regret Bounds for Restless Markov Bandits Regret Bounds for Restless Markov Bandits Ronald Ortner, Daniil Ryabko, Peter Auer, Rémi Munos Abstract We consider the restless Markov bandit problem, in which the state of each arm evolves according

More information

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Multi-armed bandit algorithms. Concentration inequalities. P(X ǫ) exp( ψ (ǫ))). Cumulant generating function bounds. Hoeffding

More information

Adaptive Concentration Inequalities for Sequential Decision Problems

Adaptive Concentration Inequalities for Sequential Decision Problems Adaptive Concentration Inequalities for Sequential Decision Problems Shengjia Zhao Tsinghua University zhaosj12@stanford.edu Ashish Sabharwal Allen Institute for AI AshishS@allenai.org Enze Zhou Tsinghua

More information

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Explore no more: Improved high-probability regret bounds for non-stochastic bandits Explore no more: Improved high-probability regret bounds for non-stochastic bandits Gergely Neu SequeL team INRIA Lille Nord Europe gergely.neu@gmail.com Abstract This work addresses the problem of regret

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Sébastien Bubeck Theory Group

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Sébastien Bubeck Theory Group Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems Sébastien Bubeck Theory Group Part 1: i.i.d., adversarial, and Bayesian bandit models i.i.d. multi-armed bandit, Robbins [1952]

More information

Active Learning and Optimized Information Gathering

Active Learning and Optimized Information Gathering Active Learning and Optimized Information Gathering Lecture 7 Learning Theory CS 101.2 Andreas Krause Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok. Office

More information

The Multi-Armed Bandit Problem

The Multi-Armed Bandit Problem Università degli Studi di Milano The bandit problem [Robbins, 1952]... K slot machines Rewards X i,1, X i,2,... of machine i are i.i.d. [0, 1]-valued random variables An allocation policy prescribes which

More information

Combinatorial Multi-Armed Bandit: General Framework, Results and Applications

Combinatorial Multi-Armed Bandit: General Framework, Results and Applications Combinatorial Multi-Armed Bandit: General Framework, Results and Applications Wei Chen Microsoft Research Asia, Beijing, China Yajun Wang Microsoft Research Asia, Beijing, China Yang Yuan Computer Science

More information

Online Optimization in X -Armed Bandits

Online Optimization in X -Armed Bandits Online Optimization in X -Armed Bandits Sébastien Bubeck INRIA Lille, SequeL project, France sebastien.bubeck@inria.fr Rémi Munos INRIA Lille, SequeL project, France remi.munos@inria.fr Gilles Stoltz Ecole

More information

Regret lower bounds and extended Upper Confidence Bounds policies in stochastic multi-armed bandit problem

Regret lower bounds and extended Upper Confidence Bounds policies in stochastic multi-armed bandit problem Journal of Machine Learning Research 1 01) 1-48 Submitted 4/00; Published 10/00 Regret lower bounds and extended Upper Confidence Bounds policies in stochastic multi-armed bandit problem Antoine Salomon

More information

Analysis of Thompson Sampling for the multi-armed bandit problem

Analysis of Thompson Sampling for the multi-armed bandit problem Analysis of Thompson Sampling for the multi-armed bandit problem Shipra Agrawal Microsoft Research India shipra@microsoft.com avin Goyal Microsoft Research India navingo@microsoft.com Abstract We show

More information

arxiv: v1 [cs.lg] 12 Sep 2017

arxiv: v1 [cs.lg] 12 Sep 2017 Adaptive Exploration-Exploitation Tradeoff for Opportunistic Bandits Huasen Wu, Xueying Guo,, Xin Liu University of California, Davis, CA, USA huasenwu@gmail.com guoxueying@outlook.com xinliu@ucdavis.edu

More information

Introduction to Reinforcement Learning. CMPT 882 Mar. 18

Introduction to Reinforcement Learning. CMPT 882 Mar. 18 Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and

More information

The multi armed-bandit problem

The multi armed-bandit problem The multi armed-bandit problem (with covariates if we have time) Vianney Perchet & Philippe Rigollet LPMA Université Paris Diderot ORFE Princeton University Algorithms and Dynamics for Games and Optimization

More information

A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem

A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem Fang Liu and Joohyun Lee and Ness Shroff The Ohio State University Columbus, Ohio 43210 {liu.3977, lee.7119, shroff.11}@osu.edu

More information

Complex Bandit Problems and Thompson Sampling

Complex Bandit Problems and Thompson Sampling Complex Bandit Problems and Aditya Gopalan Department of Electrical Engineering Technion, Israel aditya@ee.technion.ac.il Shie Mannor Department of Electrical Engineering Technion, Israel shie@ee.technion.ac.il

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University

More information

PAC Subset Selection in Stochastic Multi-armed Bandits

PAC Subset Selection in Stochastic Multi-armed Bandits In Langford, Pineau, editors, Proceedings of the 9th International Conference on Machine Learning, pp 655--66, Omnipress, New York, NY, USA, 0 PAC Subset Selection in Stochastic Multi-armed Bandits Shivaram

More information

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Explore no more: Improved high-probability regret bounds for non-stochastic bandits Explore no more: Improved high-probability regret bounds for non-stochastic bandits Gergely Neu SequeL team INRIA Lille Nord Europe gergely.neu@gmail.com Abstract This work addresses the problem of regret

More information

From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning

From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning Rémi Munos To cite this version: Rémi Munos. From Bandits to Monte-Carlo Tree Search: The Optimistic

More information

THE first formalization of the multi-armed bandit problem

THE first formalization of the multi-armed bandit problem EDIC RESEARCH PROPOSAL 1 Multi-armed Bandits in a Network Farnood Salehi I&C, EPFL Abstract The multi-armed bandit problem is a sequential decision problem in which we have several options (arms). We can

More information

Artificial Intelligence

Artificial Intelligence Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important

More information

arxiv: v2 [stat.ml] 19 Jul 2012

arxiv: v2 [stat.ml] 19 Jul 2012 Thompson Sampling: An Asymptotically Optimal Finite Time Analysis Emilie Kaufmann, Nathaniel Korda and Rémi Munos arxiv:105.417v [stat.ml] 19 Jul 01 Telecom Paristech UMR CNRS 5141 & INRIA Lille - Nord

More information

Lower PAC bound on Upper Confidence Bound-based Q-learning with examples

Lower PAC bound on Upper Confidence Bound-based Q-learning with examples Lower PAC bound on Upper Confidence Bound-based Q-learning with examples Jia-Shen Boon, Xiaomin Zhang University of Wisconsin-Madison {boon,xiaominz}@cs.wisc.edu Abstract Abstract Recently, there has been

More information

Introduction to Reinforcement Learning and multi-armed bandits

Introduction to Reinforcement Learning and multi-armed bandits Introduction to Reinforcement Learning and multi-armed bandits Rémi Munos INRIA Lille - Nord Europe Currently on leave at MSR-NE http://researchers.lille.inria.fr/ munos/ NETADIS Summer School 2013, Hillerod,

More information

Exploiting Variance Information in Monte-Carlo Tree Search

Exploiting Variance Information in Monte-Carlo Tree Search Exploiting Variance Information in Monte-Carlo Tree Search Robert Liec Vien Ngo Marc Toussaint Machine Learning and Robotics Lab University of Stuttgart prename.surname@ipvs.uni-stuttgart.de Abstract In

More information

Bandits for Online Optimization

Bandits for Online Optimization Bandits for Online Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Bandits for Online Optimization 1 / 16 The multiarmed bandit problem... K slot machines Each

More information

Sparse Linear Contextual Bandits via Relevance Vector Machines

Sparse Linear Contextual Bandits via Relevance Vector Machines Sparse Linear Contextual Bandits via Relevance Vector Machines Davis Gilton and Rebecca Willett Electrical and Computer Engineering University of Wisconsin-Madison Madison, WI 53706 Email: gilton@wisc.edu,

More information

Chapter 2 Stochastic Multi-armed Bandit

Chapter 2 Stochastic Multi-armed Bandit Chapter 2 Stochastic Multi-armed Bandit Abstract In this chapter, we present the formulation, theoretical bound, and algorithms for the stochastic MAB problem. Several important variants of stochastic

More information

Online Learning and Online Convex Optimization

Online Learning and Online Convex Optimization Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game

More information

Piecewise-stationary Bandit Problems with Side Observations

Piecewise-stationary Bandit Problems with Side Observations Jia Yuan Yu jia.yu@mcgill.ca Department Electrical and Computer Engineering, McGill University, Montréal, Québec, Canada. Shie Mannor shie.mannor@mcgill.ca; shie@ee.technion.ac.il Department Electrical

More information