Csaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008
|
|
- Barbara Kelley
- 5 years ago
- Views:
Transcription
1 LEARNING THEORY OF OPTIMAL DECISION MAKING PART I: ON-LINE LEARNING IN STOCHASTIC ENVIRONMENTS Csaba Szepesvári 1 1 Department of Computing Science University of Alberta Machine Learning Summer School, Ile de Re, France, 2008 with thanks to: RLAI group, SZTAKI group, Jean-Yves Audibert, Remi Munos
2 OUTLINE 1 HIGH LEVEL OVERVIEW OF THE TALKS 2 MOTIVATION What is it? Why should we care? The challenge 3 BANDITS Forcing ɛ-greedy Softmax Optimism in the Face of Uncertainty 4 BANDITS WITH SIDE INFORMATION 5 ONLINE LEARNING IN MDPS 6 CONCLUSIONS
3 MOTTO A theory is something nobody believes, except the person who made it. An experiment is something everybody believes, except the person who made it. (Einstein)
4 HIGH LEVEL OVERVIEW OF THE TALKs Day 1: Online learning in stochastic environments Day 2: Batch learning in Markovian Decision Processes Day 3: Online learning in adversarial environments
5 REFRESHER Union bound: P (A B) P (A) + P (B) Inclusion: A B P (A) P (B) Inversion: If P (X > t) F(t) holds for all t then for any 0 < δ < 1, w.p. 1 δ, X F 1 (δ) Expectation: If X 0, E [X] = 0 P (X t) dt
6 WHAT IS IT? PROTOCOL OF LEARNING Concepts: Agent, Environment, sensations, actions, rewards Time: t = 1, 2,... Perception-action loop: 1 Agent senses Y t coming from the Environment 2 Agent sends action A t to the Environment 3 Agent receives reward R t from the Environment 4 t := t + 1, go to Step 1 Goal: T t=1 R t max RELATED PROBLEMS Cost-sensitive learning actions are predictions Passive (batch) learning no influence on training data Active learning learn fast (instead of cheaply)
7 WHY SHOULD WE CARE? Online Opportunity to keep improving Can learn with fewer data points Learning should be cheap learning you know! in stochastic environments convenience,..
8 THE CHALLENGE: EXPLORE OR EXPLOIT? CLINICAL TRIALS Drugs: i I def = {1, 2,..., k} Protocol: 1 Choose drug I t I for the next patient 2 Observe response R t (I t ) {0, 1} of the patient 3 t := t + 1, go to Step 1 Estimated response of drug i after n steps: Q n (i) = n t=1 I {It =i}r t (i) n t=1 I {It =i} WHICH DRUG TO CHOOSE? Best response so far? (greedy choice; exploit; optimize return) Least explored? (explore; collect information)???
9 BANDIT PROBLEMS BANDIT PROBLEM [ROBBINS, 1952] Choices: i I def = {1, 2,..., k} Protocol: 1 Choose an option I t I 2 Observe response R t (I t ) {0, 1} 3 t := t + 1, go to Step 1 Terminology: option = arm = action ASSUMPTION R t (i) P i ( ), independence within and between the arms
10 EXAMPLES EXAMPLES Clinical trials Web advertising Job shop scheduling.
11 SOME DEFINITIONS Expected payoffs: Q(i) = E [R 1 (i)] Optimal arm: Q(i ) = Q def = max i Q(i) Set of optimal arms: I = {i Q(i) = Q } Payoff loss ( gap ): i = Q Q(i) Total reward up to time n: R 1:n = n t=1 R t(i t ) A bandit problem instance: B Class of bandit problems: B
12 DEFINING THE GOALS DEFINITION (CONSISTENCY) A bandit algorithm A is strongly consistent on B if lim P (I t I ) 1 t holds when A is run on any instance from B. DEFINITION (HANNAN-CONSISTENCY ( NO-REGRET )) A bandit algorithm is Hannan-consistent on B if its expected regret L n is sublinear over time: L n def = n Q E [R 1:n ] = o(n) when A is run on any instance from B.
13 STRONG VS. HANNAN-CONSISTENCY PROPOSITION If an algorithm is strongly consistent then it is also Hannan-consistent. PROOF. Let a t = E [Q R t (I t )]. Then a t = E [Q R t (i) I t = i] P (I t = i) i = i P (I t = i). i: i >0 Hence a t 0. Cesaro: n 1 n t=1 a t 0. PROPOSITION The reverse does not hold.
14 EXPLORATION IS NECESSARY DEFINITION A bandit problem is non-trivial if for i optimal, i suboptimal, with positive probability R t (i ) < R t (i) holds. DEFINITION An algorithm stops exploring on problem B if there exists a time n such that after time step n the algorithm only chooses exploiting actions: Q n (I t ) = max i Q n (i). Time n may depend on B. PROPOSITION Let B contain a non-trivial bandit problem. If an algorithm stops exploring it cannot be consistent on B.
15 METHODS FOR ACHIEVING CONSISTENCY Forcing Fixed schedule ɛ-greedy Softmax Optimism in the Face of Uncertainty
16 FORCING IDEA Exploration is necessary make sure that every arm is selected infinitely often T n (i) = n t=1 I {It =i}: Number of times arm i is chosen up to time n. FORCING ALGORITHM((f t ) t N ) At time t do: 1 i 0 := argmin i T t (i) 2 if T t (i 0 ) < f t then I t := i 0 else I t := argmax i Q t (i). COMMENT Other possibility: Periodic forcing (forcing with a fixed timing)
17 REGRET OF THE FORCING ALGORITHM PROPOSITION Let f t 0 and lim t f t =. Then the Forcing Algorithm is Hannan-consistent, i.e., L n /n 0 (The Forcing Algorithm is not strongly consistent.).. but what is the growth rate of L n?
18 CENTRAL LIMIT THEOREM ( ) n ( ) P σ 2 X n E [X 1 ] y 1 Φ(y) 1 e y 2 /2, 2π y hence ( ) P X n E [X 1 ] ɛ = P ( n ( ) ) n σ 2 X n E [X 1 ] σ 2 ɛ e nɛ2 /(2σ 2 ) σ 2 nɛ 2 e nɛ2 /(2σ 2).
19 THEOREM (HOEFFDING S INEQUALITY) Let X i [0, 1] i.i.d., µ = E [X 1 ], X n = 1/n n t=1 X t. Then ( ) P X n µ + ɛ e 2nɛ2 ( ) P X n µ ɛ e 2nɛ2. COROLLARY (HOEFFDING S BOUND IN DEVIATION FORM) Let δ > 0. With probability 1 δ, log(1/δ) X n µ +. 2n Similarly, with probability 1 δ, log(1/δ) X n µ. 2n
20 HEURISTIC ANALYSIS Assume that R t (i) [0, 1]. Then (handwaving) w.p. 1 δ t, for all i {1,..., k} def log(2k/δ t ) Q t (i) Q(i) c t = 2f t since we explored i at least f t times. Counting mistakes: 1 When forcing: total contribution to L n is k f n 2 When the sample is atypical : n t=1 δ t 3 When the sample is typical : no mistake when c t < /2 def = min j: j >0 j /2 (*) f t 2 log(2k/δ t )/( ) 2 n L n f n + δ t + O(1) = O(log n) t=1 if δ t = 1/t and f t = c log(t) with c = Ω(( ) 2 ).
21 REGRET OF THE FORCING ALGORITHM THEOREM Assume that the payoffs are bounded. If f t = c log(t) with c = Ω(( ) 2 ) then the regret of the Forcing Algorithm grows logarithmically: L n = O(log n).
22 THE ɛ-greedy ALGORITHM ɛ-greedy((ɛ t ) t N ) At time t do: 1 Draw U t in [0, 1] uniformly at random 2 if U t < ɛ t then 3 else Pick I t randomly from {1, 2,..., k} I t := argmax i Q t (i). THEOREM (REGRET OF ɛ-greedy [AUER ET AL., 2002]) Assume that the payoffs are bounded. If ɛ t = c /t with c = Ω(k/( ) 2 ) then P (I t I ) O( c n ) and L n = O(c log n).
23 SOFTMAX ALGORITHMS Problem with the previous algorithms: When exploring they are indifferent about Q t (i). BOLTZMANN((τ t ) t N ) At time t do: 1 Let w t (i) = exp(q t (i)/τ t ), i {1, 2,..., k} 2 Let p t (i) = P w t (i) j w t (j) 3 Draw I t from p t ( ) COMMENTS τ t 0; computational temperature aka exponential weights algorithm, Gibbs-selection Plays a big role in adversarial environments
24 OFU: OPTIMISM IN THE FACE OF UNCERTAINTY IDEA [LAI AND ROBBINS, 1985] Choose the arm with the best potential Assumption: R t (i) p θi ( ), θ i R unknown, p known Mean payoff: Q(θ) = r p θ (r)dr Uncertainty set: U i,t = {θ log P ( R 1 (i),..., X Ti (t)(i) θ ) is large }... large depends on t, T i (t). Upper confidence index for arm i: UCI t (i) = max θ U i,t Q(θ) Algorithm UCI: I t := argmax i UCI t (i). Two reasons we select an arm: (i) Associated uncertainty is big, (ii) the arm looks good. Is this a good algorithm??
25 REGRET BOUND FOR UCI [LAI AND ROBBINS, 1985] THEOREM For any suboptimal arm i, ( ) 1 E [T n (i)] D(p i p ) + o(1) log(n), where p i = p θi D(p i p ) = p i (x) log p i(x) p (x) dx, and p is the distribution of an optimal arm. COROLLARY L n i: i >0 { } 1 i D(p i p ) + o(1) log n.
26 A LOWER BOUND [LAI AND ROBBINS, 1985] B = (p 1,..., p K ); L A n (B): regret of A when run on problem B. DEFINITION (UNIFORMLY GOOD ALGORITHMS) Algorithm A is uniformly good if L A n (B) = o(n a ) holds for all a > 0 and reward distributions B = (p 1,..., p K ) B. This is a minimum requirement! THEOREM (LOWER BOUND) If A is uniformly good then for any B = (p 1,..., p K ) B and any i, ( ) 1 E [T i (n)] D(p i p ) o(1) log n. COROLLARY UCI algorithms are asymptotically efficient.
27 UCB: A NONPARAMETRIC IMPLEMENTATION OF OFU Goal: Avoid parametric distributions! How to implement the OFU principle?! What is the potential of an arm? Need upper estimates of its mean payoff! [Agrawal, 1995] Large-deviation theory asymptotic results [Auer et al., 2002] Avoid asymptotics, use Hoeffding s inequality! Hoeffding: log(k/δ t ) Q(i) Q t (i) + 2T t (i) UCI t (i) := Q t (i) + p log(k/δ t ) 2T t (i) p > 1, δ t 0 tuning parameters
28 ALGORITHM UCB1 UCB1(p) At time t do: 1 UCI t (i) := Q t (i) + p log(t) 2T t (i) 2 I t := argmax i UCI t (i) THEOREM (UCB1 REGRET) Let 0 R t (i) 1, i.i.d., independent between the arms. Then the regret of UCB1 satisfies E [L n ] 2p ( 1 log(n) ) K i. i p 2 i=1 i: i >0 (Slightly) improved bound compared to [Auer et al., 2002] One can show that 1/D(p j p ) 1/(2 2 j ) Still far from the best possible constant (1/2).
29 UCBTUNED: MOTIVATION Assume that rewards are in [a, a + b]. UCB1 needs to be adjusted: p log(t) UCI t (i) := Q t (i) + b 2T t (i). Regret bound: R n 2p b 2 i i: i >0 log(n) + ( ) b p 2 Problem: outrageously big when Var [R t (i)] b 2! K i. i=1
30 UCBTUNED: ALGORITHM Idea: Estimate variance and use it to define the upper index! THEOREM (EMPIRICAL BERNSTEIN BOUND [AUDIBERT ET AL., 2007]) Let a X t a + b be ii.i.d., t > 2. Let X t be the empirical mean of X 1,..., X t, V t = 1/t t s=1 (X s X t ) 2 be the empirical variance estimate. Then for any 0 < δ < 1, w.p. at least 1 δ, X t E [X 1 ] where x δ = log(3/δ). 2Vt x δ t + 3bx δ, t UCBTUNED(p) At time t do: 1 UCI t (i) := Q t (i) + 2Vt (i)p log t T t (i) 2 I t := argmax i UCI t (i) + 3bp log t T t (i)
31 UCBTUNED: REGRET BOUND THEOREM (UCBTUNED REGRET [AUDIBERT ET AL., 2007]) Let 0 R t (i) b be i.i.d., independent between the arms, p > 1, σi 2 = Var [R t (i)]. Then the regret of UCBTuned(p) satisfies ( σ 2 ) E [L n ] c i p + 2b log n. i i: i >0 In particular, when p = 1.2, c p = 10.
32 WHAT REALLY HAPPENS.. Distribution of T t (2)/t plotted against time. Bandit: R t(1) Ber(0.5), R t(2) =
33 EXTENSIONS Switching costs [Agrawal et al., 1988] Dependent rewards [Lai and Yakowitz, 1995] Continuous action spaces Discretization [Kleinberg, 2004, Auer et al., 2007b] Tree-based methods [Kocsis and Szepesvári, 2006, Bubeck et al., 2008] Linear mean-payoff Q( ) [Dani et al., 2008] Drifts [Garivier and Moulines, 2008] Side information see below Feedback MDPs; see below
34 SUMMARY Forcing, ɛ-greedy: Exploration schedule provides a lower bound on regret Requires loss tolerance tuning OFU-based algs: No fixed schedule for exploration adaptivity Advantage: Minimal tuning You can mix these ideas!
35 BANDITS WITH SIDE INFORMATION ( ASSOCIATIVE BANDITS) PROTOCOL OF LEARNING Perception-action loop: 1 Agent senses Y t coming from the Environment 2 Agent sends action A t to the Environment 3 Agent receives reward R t from the Environment 4 t := t + 1, go to Step 1 ASSUMPTION Sensations are not influenced by the agent s decisions
36 VARIANTS Y t is from a m-element set Y m independent bandit problems Dependent payoffs; e.g., Q : Y A R, Q(, a) is smooth [Yang and Zhu, 2002], linear [Auer, 2003] Continuous action sets Delayed feedback [György et al., 2007]
37 ONLINE LEARNING IN MDPS Assumption: Environment is a finite MDP (X, A, p, r). PROTOCOL OF LEARNING Perception-action loop: 1 Agent senses state X t p( X t 1, A t 1 ) of the Environment 2 Agent sends action A t to the Environment 3 Agent receives reward R t from the Environment (E [R t X t = x, A t = a] = r(x, a)). 4 t := t + 1, go to Step 1 Goal: Minimize regret L n = nλ n t=1 R t, where λ is the best possible reward per time step
38 AVERAGE REWARD MDPS ASSUMPTION Irreducibility of the Markov chains under stationary policies AVERAGE REWARD BELLMAN OPTIMALITY EQUATIONS (BOE) Find λ R, h : X R, λ + h (x) = max [r(x, a) + p x (a), h ] a A(x) = max Q (x, a), x X. a A(x) RESTRICTED PROBLEM D(x) A(x) h D, Q D.
39 OPTIMISTIC LINEAR PROGRAMMING [TEWARI AND BARTLETT, 2007] ALGORITHM OLP 1 Update ˆp, the estimate of transition probabilities 2 D(x) := {a A(x) T t (x, a) log 2 T t (x)}, x X // keep well-sampled actions only 3 (λ, h, Q) := BOE(D, ˆp) // solve the Bellman equations 4 UCI(a) { := } sup r(x t, a) + q, h q 1, ˆp Xt (a) q 1 2 log t T t (X t,a) 5 Γ := {a A(X t ) Q(X t, a) = λ + h(x t ), T t (X t, a) < log 2 (T t (X t ) + 1)} 6 if A(X t ) = Γ /* all actions are in danger */ then else let A t be some element of Γ A t := argmax a A(Xt ) UCI(a).
40 REGRET BOUNDS FOR OLP (x, a) = h (x) + λ Q (x, a) Z ɛ (x, a) = {q 1 r(x, a) + q, h h (x) ɛ} C = {(x, a) Q (x, a) < λ + h (x), ɛ > 0, Z ɛ (x, a) } J x,a (p; ɛ) = inf{ p q 1 q Z ɛ (x, a)} K (x, a) = lim J x,a (p x (a); ɛ) ɛ 0 H = 2 (x, a) K (x, a) (x,a) C THEOREM ( [TEWARI AND BARTLETT, 2007]) Let the MDP M be irreducible. If L n is the regret of OLP after n steps then L n lim sup n log n H.
41 INTERPRETATION PROPOSITION Assume A(x) = A for all x X. Let = min (x,a) C (x, a). Then COMMENT H 2 X A h 2 1. The algorithm closely follows that of [Burnetas and Katehakis, 1997] who proved asymptotic efficiency for their policy.
42 THE UCRL2 ALGORITHM [AUER ET AL., 2007A] MOTIVATION Explore OFU in MDPs High probability bounds for finite horizon problems ALGORITHM UCRL2(δ, n) Phase initialization: 1 Estimate mean model ˆp t using ML 2 U := { p p( x, a) ˆp t ( x, a) 1 c X log( A n/δ) T t (x,a) } 3 p := argmax p U λ (p), π := π (p ) 4 C(x, a) := T t (x, a) Execution of a phase 1 Execute π until some (x, a) gets visited more than C(x, a) times.
43 UCRL2 RESULTS DEFINITION (DIAMETER) Let M be an MDP. Then the diameter of M is Results: Lower bound: Upper bounds: w.p. 1 δ/n, w.p. 1 δ, D(M) = max x,y min E [T (x y; π)]. π E [L n ] = Ω( D X A n) L n O ( D X ) A n log( A n/δ) ( L n O D 2 X 2 A log( A n/δ) ), where = min π:λπ <λ λ λ π.
44 CONCLUSIONS Exploration is necessary.. but should be controlled in a wise manner Tools: Forced exploration Softmax Optimism in the face of uncertainty After 50 years still lots of work to be done! Scaling up Non-stationary environments Dependencies
45 REFERENCES I Agrawal, R. (1995). Sample mean based index policies with O(logn) regret for the multi-armed bandit problem. Advances in Applied Probability, 27: Agrawal, R., Hedge, M., and Teneketzis, D. (1988). Asymptotically efficient adaptive allocation rules for the multiarmed bandit problem with switching cost. IEEE Transactions on Automatic Control, 33(10): Audibert, J.-Y., Munos, R., and Szepesvári, C. (2007). Tuning bandit algorithms in stochastic environments. In Algorithmic Learning Theory - 18th International Conference, ALT 2007, pages Springer. Auer, P. (2003). Using confidence bounds for exploitation-exploration trade-offs. Journal of Machine Learning Research, 3: Auer, P., Cesa-Bianchi, N., and Fischer, P. (2002). Finite time analysis of the multiarmed bandit problem. Machine Learning, 47(2-3): Auer, P., Jaksch, T., and Ortner, R. (2007a). Near-optimal regret bounds for reinforcement learning. Technical report, University of Leoben. Auer, P., Ortner, R., and Szepesvári, C. (2007b). Improved rates for the stochastic continuum-armed bandit problem. In Proceedings of the 20th Annual Conference on Learning Theory (COLT-07), pages
46 REFERENCES II Bubeck, S., Munos, R., and Szepesvári, C. (2008). Online optimization in X-armed bandits. In NIPS-21. Burnetas, A. and Katehakis, M. (1997). Optimal adaptive policies for Markov Decision Processes. Mathematics of Operations Research, 22(1): Dani, V., Hayes, T., and Kakade, S. (2008). Stochastic linear optimization under bandit feedback. COLT Garivier, A. and Moulines, E. (2008). On upper-confidence bound policies for non-stationary bandit problems. Technical report, LTCI. György, A., Kocsis, L., Szabó, I., and Szepesvári, C. (2007). Continuous time associative bandit problems. In IJCAI-07, pages Kleinberg, R. (2004). Nearly tight bounds for the continuum-armed bandit problem. In NIPS Kocsis, L. and Szepesvári, C. (2006). Bandit based Monte-Carlo planning. In Proceedings of the 17th European Conference on Machine Learning (ECML-2006), pages Lai, T. and Yakowitz, S. (1995). Machine learning and nonparametric bandit theory. IEEE Transactions on Automatic Control, 40:
47 REFERENCES III Lai, T. L. and Robbins, H. (1985). Asymptotically efficient adaptive allocation rules. Advances in Applied Mathematics, 6:4 22. Robbins, H. (1952). Some aspects of the sequential design of experiments. Bulletin of the American Mathematics Society, 58: Tewari, A. and Bartlett, P. (2007). Optimistic linear programming gives logarithmic regret for irreducible mdps. NIPS-20. Yang, Y. and Zhu, D. (2002). Randomized allocation with nonparametric estimation for a multi-armed bandit problem with covariates. Annals of Statistics, 30(1):
On the Complexity of Best Arm Identification in Multi-Armed Bandit Models
On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurélien Garivier Institut de Mathématiques de Toulouse Information Theory, Learning and Big Data Simons Institute, Berkeley, March
More informationReinforcement Learning
Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Present and analyse two online algorithms
More informationImproved Algorithms for Linear Stochastic Bandits
Improved Algorithms for Linear Stochastic Bandits Yasin Abbasi-Yadkori abbasiya@ualberta.ca Dept. of Computing Science University of Alberta Dávid Pál dpal@google.com Dept. of Computing Science University
More informationMulti-armed bandit models: a tutorial
Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)
More informationAdvanced Machine Learning
Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his
More informationBandit models: a tutorial
Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses
More informationOnline Learning and Sequential Decision Making
Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Sequential Decision
More informationBandit Algorithms. Tor Lattimore & Csaba Szepesvári
Bandit Algorithms Tor Lattimore & Csaba Szepesvári Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next? Overview What are bandits,
More informationRegret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group
Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known
More informationReinforcement Learning
Reinforcement Learning Lecture 3: RL problems, sample complexity and regret Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Introduce the
More informationTwo optimization problems in a stochastic bandit model
Two optimization problems in a stochastic bandit model Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishnan Journées MAS 204, Toulouse Outline From stochastic optimization
More informationInformational Confidence Bounds for Self-Normalized Averages and Applications
Informational Confidence Bounds for Self-Normalized Averages and Applications Aurélien Garivier Institut de Mathématiques de Toulouse - Université Paul Sabatier Thursday, September 12th 2013 Context Tree
More informationTwo generic principles in modern bandits: the optimistic principle and Thompson sampling
Two generic principles in modern bandits: the optimistic principle and Thompson sampling Rémi Munos INRIA Lille, France CSML Lunch Seminars, September 12, 2014 Outline Two principles: The optimistic principle
More informationMulti-Armed Bandits. Credit: David Silver. Google DeepMind. Presenter: Tianlu Wang
Multi-Armed Bandits Credit: David Silver Google DeepMind Presenter: Tianlu Wang Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 1 / 27 Outline 1 Introduction Exploration vs.
More informationRevisiting the Exploration-Exploitation Tradeoff in Bandit Models
Revisiting the Exploration-Exploitation Tradeoff in Bandit Models joint work with Aurélien Garivier (IMT, Toulouse) and Tor Lattimore (University of Alberta) Workshop on Optimization and Decision-Making
More informationDynamic resource allocation: Bandit problems and extensions
Dynamic resource allocation: Bandit problems and extensions Aurélien Garivier Institut de Mathématiques de Toulouse MAD Seminar, Université Toulouse 1 October 3rd, 2014 The Bandit Model Roadmap 1 The Bandit
More informationThe Multi-Armed Bandit Problem
The Multi-Armed Bandit Problem Electrical and Computer Engineering December 7, 2013 Outline 1 2 Mathematical 3 Algorithm Upper Confidence Bound Algorithm A/B Testing Exploration vs. Exploitation Scientist
More informationStratégies bayésiennes et fréquentistes dans un modèle de bandit
Stratégies bayésiennes et fréquentistes dans un modèle de bandit thèse effectuée à Telecom ParisTech, co-dirigée par Olivier Cappé, Aurélien Garivier et Rémi Munos Journées MAS, Grenoble, 30 août 2016
More informationLecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem
Lecture 9: UCB Algorithm and Adversarial Bandit Problem EECS598: Prediction and Learning: It s Only a Game Fall 03 Lecture 9: UCB Algorithm and Adversarial Bandit Problem Prof. Jacob Abernethy Scribe:
More informationOnline Regret Bounds for Markov Decision Processes with Deterministic Transitions
Online Regret Bounds for Markov Decision Processes with Deterministic Transitions Ronald Ortner Department Mathematik und Informationstechnologie, Montanuniversität Leoben, A-8700 Leoben, Austria Abstract
More informationBasics of reinforcement learning
Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system
More informationCOS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3
COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 22 Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 How to balance exploration and exploitation in reinforcement
More informationLearning to play K-armed bandit problems
Learning to play K-armed bandit problems Francis Maes 1, Louis Wehenkel 1 and Damien Ernst 1 1 University of Liège Dept. of Electrical Engineering and Computer Science Institut Montefiore, B28, B-4000,
More informationEvaluation of multi armed bandit algorithms and empirical algorithm
Acta Technica 62, No. 2B/2017, 639 656 c 2017 Institute of Thermomechanics CAS, v.v.i. Evaluation of multi armed bandit algorithms and empirical algorithm Zhang Hong 2,3, Cao Xiushan 1, Pu Qiumei 1,4 Abstract.
More informationBayesian and Frequentist Methods in Bandit Models
Bayesian and Frequentist Methods in Bandit Models Emilie Kaufmann, Telecom ParisTech Bayes In Paris, ENSAE, October 24th, 2013 Emilie Kaufmann (Telecom ParisTech) Bayesian and Frequentist Bandits BIP,
More informationFinite-time Analysis of the Multiarmed Bandit Problem*
Machine Learning, 47, 35 56, 00 c 00 Kluwer Academic Publishers. Manufactured in The Netherlands. Finite-time Analysis of the Multiarmed Bandit Problem* PETER AUER University of Technology Graz, A-8010
More informationThe Multi-Arm Bandit Framework
The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94
More informationLearning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning
JMLR: Workshop and Conference Proceedings vol:1 8, 2012 10th European Workshop on Reinforcement Learning Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning Michael
More informationThe information complexity of sequential resource allocation
The information complexity of sequential resource allocation Emilie Kaufmann, joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishan SMILE Seminar, ENS, June 8th, 205 Sequential allocation
More informationOn Bayesian bandit algorithms
On Bayesian bandit algorithms Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier, Nathaniel Korda and Rémi Munos July 1st, 2012 Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms
More informationYevgeny Seldin. University of Copenhagen
Yevgeny Seldin University of Copenhagen Classical (Batch) Machine Learning Collect Data Data Assumption The samples are independent identically distributed (i.i.d.) Machine Learning Prediction rule New
More informationIntroduction to Reinforcement Learning Part 3: Exploration for decision making, Application to games, optimization, and planning
Introduction to Reinforcement Learning Part 3: Exploration for decision making, Application to games, optimization, and planning Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/
More informationMulti-armed Bandits in the Presence of Side Observations in Social Networks
52nd IEEE Conference on Decision and Control December 0-3, 203. Florence, Italy Multi-armed Bandits in the Presence of Side Observations in Social Networks Swapna Buccapatnam, Atilla Eryilmaz, and Ness
More informationSubsampling, Concentration and Multi-armed bandits
Subsampling, Concentration and Multi-armed bandits Odalric-Ambrym Maillard, R. Bardenet, S. Mannor, A. Baransi, N. Galichet, J. Pineau, A. Durand Toulouse, November 09, 2015 O-A. Maillard Subsampling and
More informationReinforcement Learning
Reinforcement Learning Lecture 5: Bandit optimisation Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Introduce bandit optimisation: the
More informationMulti-Armed Bandit: Learning in Dynamic Systems with Unknown Models
c Qing Zhao, UC Davis. Talk at Xidian Univ., September, 2011. 1 Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models Qing Zhao Department of Electrical and Computer Engineering University
More informationMulti-armed bandit based policies for cognitive radio s decision making issues
Multi-armed bandit based policies for cognitive radio s decision making issues Wassim Jouini SUPELEC/IETR wassim.jouini@supelec.fr Damien Ernst University of Liège dernst@ulg.ac.be Christophe Moy SUPELEC/IETR
More informationKULLBACK-LEIBLER UPPER CONFIDENCE BOUNDS FOR OPTIMAL SEQUENTIAL ALLOCATION
Submitted to the Annals of Statistics arxiv: math.pr/0000000 KULLBACK-LEIBLER UPPER CONFIDENCE BOUNDS FOR OPTIMAL SEQUENTIAL ALLOCATION By Olivier Cappé 1, Aurélien Garivier 2, Odalric-Ambrym Maillard
More informationThe information complexity of best-arm identification
The information complexity of best-arm identification Emilie Kaufmann, joint work with Olivier Cappé and Aurélien Garivier MAB workshop, Lancaster, January th, 206 Context: the multi-armed bandit model
More informationBandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University
Bandit Algorithms Zhifeng Wang Department of Statistics Florida State University Outline Multi-Armed Bandits (MAB) Exploration-First Epsilon-Greedy Softmax UCB Thompson Sampling Adversarial Bandits Exp3
More informationOnline regret in reinforcement learning
University of Leoben, Austria Tübingen, 31 July 2007 Undiscounted online regret I am interested in the difference (in rewards during learning) between an optimal policy and a reinforcement learner: T T
More informationOnline Learning with Gaussian Payoffs and Side Observations
Online Learning with Gaussian Payoffs and Side Observations Yifan Wu 1 András György 2 Csaba Szepesvári 1 1 Department of Computing Science University of Alberta 2 Department of Electrical and Electronic
More informationLogarithmic Online Regret Bounds for Undiscounted Reinforcement Learning
Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning Peter Auer Ronald Ortner University of Leoben, Franz-Josef-Strasse 18, 8700 Leoben, Austria auer,rortner}@unileoben.ac.at Abstract
More informationOn the Complexity of Best Arm Identification with Fixed Confidence
On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, Emilie Kaufmann COLT, June 23 th 2016, New York Institut de Mathématiques de Toulouse
More informationBandit View on Continuous Stochastic Optimization
Bandit View on Continuous Stochastic Optimization Sébastien Bubeck 1 joint work with Rémi Munos 1 & Gilles Stoltz 2 & Csaba Szepesvari 3 1 INRIA Lille, SequeL team 2 CNRS/ENS/HEC 3 University of Alberta
More informationIntroduction to Reinforcement Learning Part 3: Exploration for sequential decision making
Introduction to Reinforcement Learning Part 3: Exploration for sequential decision making Rémi Munos SequeL project: Sequential Learning http://researchers.lille.inria.fr/ munos/ INRIA Lille - Nord Europe
More informationBandits : optimality in exponential families
Bandits : optimality in exponential families Odalric-Ambrym Maillard IHES, January 2016 Odalric-Ambrym Maillard Bandits 1 / 40 Introduction 1 Stochastic multi-armed bandits 2 Boundary crossing probabilities
More informationStochastic Contextual Bandits with Known. Reward Functions
Stochastic Contextual Bandits with nown 1 Reward Functions Pranav Sakulkar and Bhaskar rishnamachari Ming Hsieh Department of Electrical Engineering Viterbi School of Engineering University of Southern
More informationIntroducing strategic measure actions in multi-armed bandits
213 IEEE 24th International Symposium on Personal, Indoor and Mobile Radio Communications: Workshop on Cognitive Radio Medium Access Control and Network Solutions Introducing strategic measure actions
More informationLearning Algorithms for Minimizing Queue Length Regret
Learning Algorithms for Minimizing Queue Length Regret Thomas Stahlbuhk Massachusetts Institute of Technology Cambridge, MA Brooke Shrader MIT Lincoln Laboratory Lexington, MA Eytan Modiano Massachusetts
More informationOn the Complexity of Best Arm Identification with Fixed Confidence
On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, joint work with Emilie Kaufmann CNRS, CRIStAL) to be presented at COLT 16, New York
More informationSelecting the State-Representation in Reinforcement Learning
Selecting the State-Representation in Reinforcement Learning Odalric-Ambrym Maillard INRIA Lille - Nord Europe odalricambrym.maillard@gmail.com Rémi Munos INRIA Lille - Nord Europe remi.munos@inria.fr
More informationAdministration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.
Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,
More informationThe geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan
The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Uncertainty & Probabilities & Bandits Daniel Hennes 16.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Uncertainty Probability
More informationIntroduction to Bandit Algorithms. Introduction to Bandit Algorithms
Stochastic K-Arm Bandit Problem Formulation Consider K arms (actions) each correspond to an unknown distribution {ν k } K k=1 with values bounded in [0, 1]. At each time t, the agent pulls an arm I t {1,...,
More informationDeviations of stochastic bandit regret
Deviations of stochastic bandit regret Antoine Salomon 1 and Jean-Yves Audibert 1,2 1 Imagine École des Ponts ParisTech Université Paris Est salomona@imagine.enpc.fr audibert@imagine.enpc.fr 2 Sierra,
More informationAnytime optimal algorithms in stochastic multi-armed bandits
Rémy Degenne LPMA, Université Paris Diderot Vianney Perchet CREST, ENSAE REMYDEGENNE@MATHUNIV-PARIS-DIDEROTFR VIANNEYPERCHET@NORMALESUPORG Abstract We introduce an anytime algorithm for stochastic multi-armed
More informationTuning bandit algorithms in stochastic environments
Tuning bandit algorithms in stochastic environments Jean-Yves Audibert, Rémi Munos, Csaba Szepesvari To cite this version: Jean-Yves Audibert, Rémi Munos, Csaba Szepesvari. Tuning bandit algorithms in
More informationBandits with Delayed, Aggregated Anonymous Feedback
Ciara Pike-Burke 1 Shipra Agrawal 2 Csaba Szepesvári 3 4 Steffen Grünewälder 1 Abstract We study a variant of the stochastic K-armed bandit problem, which we call bandits with delayed, aggregated anonymous
More informationTHE MULTI-ARMED BANDIT PROBLEM: AN EFFICIENT NON-PARAMETRIC SOLUTION
THE MULTI-ARMED BANDIT PROBLEM: AN EFFICIENT NON-PARAMETRIC SOLUTION Hock Peng Chan stachp@nus.edu.sg Department of Statistics and Applied Probability National University of Singapore Abstract Lai and
More informationOnline Learning with Feedback Graphs
Online Learning with Feedback Graphs Claudio Gentile INRIA and Google NY clagentile@gmailcom NYC March 6th, 2018 1 Content of this lecture Regret analysis of sequential prediction problems lying between
More informationMulti armed bandit problem: some insights
Multi armed bandit problem: some insights July 4, 20 Introduction Multi Armed Bandit problems have been widely studied in the context of sequential analysis. The application areas include clinical trials,
More information1 MDP Value Iteration Algorithm
CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using
More informationAdvanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12
Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12 Lecture 4: Multiarmed Bandit in the Adversarial Model Lecturer: Yishay Mansour Scribe: Shai Vardi 4.1 Lecture Overview
More informationElements of Reinforcement Learning
Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,
More informationNew bounds on the price of bandit feedback for mistake-bounded online multiclass learning
Journal of Machine Learning Research 1 8, 2017 Algorithmic Learning Theory 2017 New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Philip M. Long Google, 1600 Amphitheatre
More informationLecture 4: Lower Bounds (ending); Thompson Sampling
CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from
More informationRegret Bounds for Restless Markov Bandits
Regret Bounds for Restless Markov Bandits Ronald Ortner, Daniil Ryabko, Peter Auer, Rémi Munos Abstract We consider the restless Markov bandit problem, in which the state of each arm evolves according
More informationStat 260/CS Learning in Sequential Decision Problems. Peter Bartlett
Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Multi-armed bandit algorithms. Concentration inequalities. P(X ǫ) exp( ψ (ǫ))). Cumulant generating function bounds. Hoeffding
More informationAdaptive Concentration Inequalities for Sequential Decision Problems
Adaptive Concentration Inequalities for Sequential Decision Problems Shengjia Zhao Tsinghua University zhaosj12@stanford.edu Ashish Sabharwal Allen Institute for AI AshishS@allenai.org Enze Zhou Tsinghua
More informationExplore no more: Improved high-probability regret bounds for non-stochastic bandits
Explore no more: Improved high-probability regret bounds for non-stochastic bandits Gergely Neu SequeL team INRIA Lille Nord Europe gergely.neu@gmail.com Abstract This work addresses the problem of regret
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and
More informationRegret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. Sébastien Bubeck Theory Group
Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems Sébastien Bubeck Theory Group Part 1: i.i.d., adversarial, and Bayesian bandit models i.i.d. multi-armed bandit, Robbins [1952]
More informationActive Learning and Optimized Information Gathering
Active Learning and Optimized Information Gathering Lecture 7 Learning Theory CS 101.2 Andreas Krause Announcements Project proposal: Due tomorrow 1/27 Homework 1: Due Thursday 1/29 Any time is ok. Office
More informationThe Multi-Armed Bandit Problem
Università degli Studi di Milano The bandit problem [Robbins, 1952]... K slot machines Rewards X i,1, X i,2,... of machine i are i.i.d. [0, 1]-valued random variables An allocation policy prescribes which
More informationCombinatorial Multi-Armed Bandit: General Framework, Results and Applications
Combinatorial Multi-Armed Bandit: General Framework, Results and Applications Wei Chen Microsoft Research Asia, Beijing, China Yajun Wang Microsoft Research Asia, Beijing, China Yang Yuan Computer Science
More informationOnline Optimization in X -Armed Bandits
Online Optimization in X -Armed Bandits Sébastien Bubeck INRIA Lille, SequeL project, France sebastien.bubeck@inria.fr Rémi Munos INRIA Lille, SequeL project, France remi.munos@inria.fr Gilles Stoltz Ecole
More informationRegret lower bounds and extended Upper Confidence Bounds policies in stochastic multi-armed bandit problem
Journal of Machine Learning Research 1 01) 1-48 Submitted 4/00; Published 10/00 Regret lower bounds and extended Upper Confidence Bounds policies in stochastic multi-armed bandit problem Antoine Salomon
More informationAnalysis of Thompson Sampling for the multi-armed bandit problem
Analysis of Thompson Sampling for the multi-armed bandit problem Shipra Agrawal Microsoft Research India shipra@microsoft.com avin Goyal Microsoft Research India navingo@microsoft.com Abstract We show
More informationarxiv: v1 [cs.lg] 12 Sep 2017
Adaptive Exploration-Exploitation Tradeoff for Opportunistic Bandits Huasen Wu, Xueying Guo,, Xin Liu University of California, Davis, CA, USA huasenwu@gmail.com guoxueying@outlook.com xinliu@ucdavis.edu
More informationIntroduction to Reinforcement Learning. CMPT 882 Mar. 18
Introduction to Reinforcement Learning CMPT 882 Mar. 18 Outline for the week Basic ideas in RL Value functions and value iteration Policy evaluation and policy improvement Model-free RL Monte-Carlo and
More informationThe multi armed-bandit problem
The multi armed-bandit problem (with covariates if we have time) Vianney Perchet & Philippe Rigollet LPMA Université Paris Diderot ORFE Princeton University Algorithms and Dynamics for Games and Optimization
More informationA Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem
A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem Fang Liu and Joohyun Lee and Ness Shroff The Ohio State University Columbus, Ohio 43210 {liu.3977, lee.7119, shroff.11}@osu.edu
More informationComplex Bandit Problems and Thompson Sampling
Complex Bandit Problems and Aditya Gopalan Department of Electrical Engineering Technion, Israel aditya@ee.technion.ac.il Shie Mannor Department of Electrical Engineering Technion, Israel shie@ee.technion.ac.il
More informationReinforcement Learning
Reinforcement Learning Model-Based Reinforcement Learning Model-based, PAC-MDP, sample complexity, exploration/exploitation, RMAX, E3, Bayes-optimal, Bayesian RL, model learning Vien Ngo MLR, University
More informationPAC Subset Selection in Stochastic Multi-armed Bandits
In Langford, Pineau, editors, Proceedings of the 9th International Conference on Machine Learning, pp 655--66, Omnipress, New York, NY, USA, 0 PAC Subset Selection in Stochastic Multi-armed Bandits Shivaram
More informationExplore no more: Improved high-probability regret bounds for non-stochastic bandits
Explore no more: Improved high-probability regret bounds for non-stochastic bandits Gergely Neu SequeL team INRIA Lille Nord Europe gergely.neu@gmail.com Abstract This work addresses the problem of regret
More informationFrom Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning
From Bandits to Monte-Carlo Tree Search: The Optimistic Principle Applied to Optimization and Planning Rémi Munos To cite this version: Rémi Munos. From Bandits to Monte-Carlo Tree Search: The Optimistic
More informationTHE first formalization of the multi-armed bandit problem
EDIC RESEARCH PROPOSAL 1 Multi-armed Bandits in a Network Farnood Salehi I&C, EPFL Abstract The multi-armed bandit problem is a sequential decision problem in which we have several options (arms). We can
More informationArtificial Intelligence
Artificial Intelligence Dynamic Programming Marc Toussaint University of Stuttgart Winter 2018/19 Motivation: So far we focussed on tree search-like solvers for decision problems. There is a second important
More informationarxiv: v2 [stat.ml] 19 Jul 2012
Thompson Sampling: An Asymptotically Optimal Finite Time Analysis Emilie Kaufmann, Nathaniel Korda and Rémi Munos arxiv:105.417v [stat.ml] 19 Jul 01 Telecom Paristech UMR CNRS 5141 & INRIA Lille - Nord
More informationLower PAC bound on Upper Confidence Bound-based Q-learning with examples
Lower PAC bound on Upper Confidence Bound-based Q-learning with examples Jia-Shen Boon, Xiaomin Zhang University of Wisconsin-Madison {boon,xiaominz}@cs.wisc.edu Abstract Abstract Recently, there has been
More informationIntroduction to Reinforcement Learning and multi-armed bandits
Introduction to Reinforcement Learning and multi-armed bandits Rémi Munos INRIA Lille - Nord Europe Currently on leave at MSR-NE http://researchers.lille.inria.fr/ munos/ NETADIS Summer School 2013, Hillerod,
More informationExploiting Variance Information in Monte-Carlo Tree Search
Exploiting Variance Information in Monte-Carlo Tree Search Robert Liec Vien Ngo Marc Toussaint Machine Learning and Robotics Lab University of Stuttgart prename.surname@ipvs.uni-stuttgart.de Abstract In
More informationBandits for Online Optimization
Bandits for Online Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Bandits for Online Optimization 1 / 16 The multiarmed bandit problem... K slot machines Each
More informationSparse Linear Contextual Bandits via Relevance Vector Machines
Sparse Linear Contextual Bandits via Relevance Vector Machines Davis Gilton and Rebecca Willett Electrical and Computer Engineering University of Wisconsin-Madison Madison, WI 53706 Email: gilton@wisc.edu,
More informationChapter 2 Stochastic Multi-armed Bandit
Chapter 2 Stochastic Multi-armed Bandit Abstract In this chapter, we present the formulation, theoretical bound, and algorithms for the stochastic MAB problem. Several important variants of stochastic
More informationOnline Learning and Online Convex Optimization
Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game
More informationPiecewise-stationary Bandit Problems with Side Observations
Jia Yuan Yu jia.yu@mcgill.ca Department Electrical and Computer Engineering, McGill University, Montréal, Québec, Canada. Shie Mannor shie.mannor@mcgill.ca; shie@ee.technion.ac.il Department Electrical
More information