The knowledge gradient method for multi-armed bandit problems

Size: px
Start display at page:

Download "The knowledge gradient method for multi-armed bandit problems"

Transcription

1 The knowledge gradient method for multi-armed bandit problems Moving beyond inde policies Ilya O. Ryzhov Warren Powell Peter Frazier Department of Operations Research and Financial Engineering Princeton University INFORMS APS Conference July 12, / 41

2 Outline 1 Introduction 2 Mathematical model 3 The knowledge gradient (KG) policy 4 Asymptotic behavior of the KG policy 5 Bandit problems with correlated arms 6 Numerical results 7 Conclusions 2 / 41

3 Outline 1 Introduction 2 Mathematical model 3 The knowledge gradient (KG) policy 4 Asymptotic behavior of the KG policy 5 Bandit problems with correlated arms 6 Numerical results 7 Conclusions 3 / 41

4 Motivation: clinical drug trials We are testing eperimental diabetes treatments on human patients We want to find the best treatment, but we also care about the effect on the patients How can we allocate groups of patients to treatments? 4 / 41

5 The multi-armed bandit problem There are M different treatments (M arms or alternatives) The effectiveness of each treatment is unknown, but we have a Bayesian belief about it We can measure a treatment (by trying it out on a group of patients) and observe a result that changes our beliefs How should we allocate our measurements to maimize some measure of the total benefit across all patient groups? 5 / 41

6 Outline 1 Introduction 2 Mathematical model 3 The knowledge gradient (KG) policy 4 Asymptotic behavior of the KG policy 5 Bandit problems with correlated arms 6 Numerical results 7 Conclusions 6 / 41

7 The multi-armed bandit problem At first, we believe that µ N ( µ 0,1/β 0 ). We measure alternative and observe a reward ˆµ 1 N (µ,1/β ε ). As a result, our beliefs change: µ 1 = β 0 µ 0 + β ε ˆµ 1 β 0 + β ε The quantity β 0 is called the precision of our beliefs. ( σ 0 ) 2 = 1/β 0 β 1 = β 0 + β ε For all y, µ 1 y = µ 0 y. 7 / 41

8 The multi-armed bandit problem After n measurements, our beliefs about the alternatives are encoded in the knowledge state: s n = (µ n,σ n ) A decision rule X n is a function that maps the knowledge state s n to an alternative X n (s n ) {1,...,M}. A learning policy π is a sequence of decision rules X π,1,x π,2,... Objective function Choose a measurement policy π to achieve sup π IE π for some discount factor 0 < γ < 1. n=0 γ n µ X π,n (s n ) 8 / 41

9 Inde policies An inde policy yields decision rules of the form X π,n (s n ) = arg mai π (µ n,σ n ) where the inde I π depends on our beliefs about, but not on our beliefs about other alternatives. Inde policies allow us to consider each alternative separately, as if there were no other alternatives. 9 / 41

10 Eamples of inde policies Interval estimation (Kaelbling 1993): X IE,n (s n ) = arg ma µ n + z α/2 σ n Upper confidence bound for finite horizon (Lai 1987): ( ) X UCB,n (s n ) = arg ma µ n 2 N n + g N Gittins indices (Gittins & Jones 1974): N n X Gitt,n (s n ) = arg maγ(µ n,σ n,σ ε,γ) The Gittins policy is optimal for infinite-horizon problems, but Gittins indices are difficult to compute and usually need to be approimated (Chick & Gans 2009). 10 / 41

11 Outline 1 Introduction 2 Mathematical model 3 The knowledge gradient (KG) policy 4 Asymptotic behavior of the KG policy 5 Bandit problems with correlated arms 6 Numerical results 7 Conclusions 11 / 41

12 The knowledge gradient concept One-period look-ahead rule for making measurements Originally developed for offline problems (Gupta & Miescke 1996, Frazier et al. 2008) Offline objective: sup π ( ) IE π ma µ N sup π Online objective: IE π n=0 γ n µ X π,n (s n ) 12 / 41

13 Definition of the knowledge gradient In the offline problem, our only goal is to estimate ma µ If we measure at time n, we can epect to improve our estimate by ( ) ma. ν KG,n = IE n µ n+1 ma µ n This quantity is called the knowledge gradient of at time n The offline KG decision rule is X Off,n (s n ) = arg maν KG,n 13 / 41

14 Computation of the knowledge gradient The epectation ν KG,n = IE n ( ma µ n+1 ) ma µ n has a closed-form solution where ν KG,n = σ n f ( ) µ n ma µ n σ n f (z) = zφ(z) + φ (z) σ n = (σ n ) 2 ( σ n+1 ) 2. and Φ,φ are the standard Gaussian cdf and pdf. 14 / 41

15 Using KG for multi-armed bandits In the bandit problem, if we stop learning at time n + 1, we will 1 epect to collect a reward of 1 γ ma µn+1 If we decide to choose alternative at time n, we epect to collect µ n immediately, plus the discounted downstream value The online KG decision rule is X KG,n (s n ) = arg ma µ n + γ 1 γ IEn ma = arg ma µ n + γ 1 γ IEn = arg ma µ n + γ 1 γ νkg,n µ n+1 ( ma µ n+1 ) ma µ n 15 / 41

16 Why KG is not an inde policy The KG decision rule is X KG,n (s n ) = arg ma µ n + γ 1 γ νkg,n where ν KG,n = σ n f ( ) µ n ma µ n. σ n The KG factor of alternative depends on ma µ n. The calculation for alternative uses our beliefs about all the other alternatives. 16 / 41

17 Outline 1 Introduction 2 Mathematical model 3 The knowledge gradient (KG) policy 4 Asymptotic behavior of the KG policy 5 Bandit problems with correlated arms 6 Numerical results 7 Conclusions 17 / 41

18 KG will converge to some alternative We say that the KG policy converges to alternative if it measures alternative infinitely often. Proposition For almost every sample path, only one alternative will be measured infinitely often by the KG policy. 18 / 41

19 KG will converge to some alternative KG has to converge to some alternative, but not necessarily to the best one However, even the Gittins policy, which is known to be optimal, does not converge to the best alternative Can we find situations where KG does converge to the best alternative? 19 / 41

20 Connection to offline KG Idea: For γ close to 1, the online KG decision rule X KG,n (s n ) = arg ma µ n + γ 1 γ νkg,n starts to resemble the offline KG decision rule X Off,n (s n ) = arg maν KG,n. We know from Frazier et al. (2008) that the offline KG rule will measure every alternative infinitely often (thus finding the best alternative) in an infinite-horizon setting. 20 / 41

21 Asymptotic behavior for large γ Let KG (γ) denote the online KG policy with discount factor γ. Define the stopping time { } N γ = min n 0 X Off,n (s n ) X KG(γ),n (s n ). Lemma As γ 1, N γ almost surely. 21 / 41

22 Asymptotic behavior for large γ The time horizon can be divided into three periods: n N γ Online KG agrees with offline KG 22 / 41

23 Asymptotic behavior for large γ The time horizon can be divided into three periods: n N γ Online KG does not agree with offline KG 22 / 41

24 Asymptotic behavior for large γ The time horizon can be divided into three periods: n N γ Online KG converges 22 / 41

25 Asymptotic behavior for large γ Lemma For fied 0 < γ < 1, ( lim n IEKG(γ) ma µ n ) IE KG(γ) ( ma ) µ N γ. Proof. It can be shown that M n = ma µ n is a uniformly integrable submartingale. Therefore, M n converges almost surely and in L 1, and lim IEM n = IE lim M n IEM Nγ n n by Doob s optional sampling theorem. 23 / 41

26 Asymptotic behavior for large γ Lemma ( lim γ 1 IEKG(γ) ma ) ( µ N γ = IE ma µ ). Proof. Because online KG agrees with offline KG on all decisions before N γ, ( ) lim γ 1 IEKG(γ) ma µ N γ ( ) = lim IE Off ma µ N γ γ 1 ( ) = lim IE Off ma µ n n ( = IE ma µ ). The last line follows by the asymptotic optimality of offline KG (Frazier et al. 2008). 24 / 41

27 Main asymptotic result Theorem ( lim lim γ 1 n IEKG(γ) ma µ n ) ( = IE ma µ ). Proof. Combining the previous results yields ( ) lim lim γ 1 n IEKG(γ) ma µ n ( lim IE KG(γ) ma γ 1 ( = IE ma µ ). ) µ N γ The other direction can be obtained using Jensen s inequality. 25 / 41

28 Summary of asymptotic results As γ 1, the value of the alternative measured infinitely often by KG converges to the value of the best alternative This is evidence that some of the attractive properties of offline KG carry over to the online problem Rates of convergence are a subject for future work Eperimental work suggests that online KG performs well in practice 26 / 41

29 Outline 1 Introduction 2 Mathematical model 3 The knowledge gradient (KG) policy 4 Asymptotic behavior of the KG policy 5 Bandit problems with correlated arms 6 Numerical results 7 Conclusions 27 / 41

30 Bandit problems with correlated arms The classic multi-armed bandit model assumes that the alternatives are independent However, the definition of the knowledge gradient ( ) ν KG,n = IE n ma µ n+1 ma µ n does not preclude the presence of correlated beliefs. 28 / 41

31 The meaning of correlated beliefs Correlated beliefs allow us to learn about many alternatives by making one measurement Observe what happens when we measure alternative / 41

32 The meaning of correlated beliefs Correlated beliefs allow us to learn about many alternatives by making one measurement Observe what happens when we measure alternative / 41

33 The meaning of correlated beliefs Correlated beliefs allow us to learn about many alternatives by making one measurement Observe what happens when we measure alternative / 41

34 Mathematical model for correlated problem At first, we believe that µ N ( µ 0,Σ 0). We measure alternative and observe a reward ˆµ 1 N ( µ,σε 2 ). As a result, our beliefs change: µ 1 = µ 0 + ˆµ1 µ 0 σε 2 + Σ 0 Σ 0 e Σ 1 = Σ 0 Σ0 e e T Σ 0 σε 2 + Σ 0 The vector e consists of all zeros, with a single 1 at inde Now, it is possible for every component of µ to change after a measurement. 30 / 41

35 Correlated knowledge gradients An inde policy is inherently unable to consider correlations, because it always considers every alternative separately from the others However, we can still use the KG rule: X KG,n (s n ) = arg ma µ n + γ 1 γ νkg,n The computation of ν KG,n is more complicated than in the independent case, but Frazier et al. (2009) provides a numerical algorithm 31 / 41

36 Correlated knowledge gradients By introducing correlations, we are able to learn efficiently in problems with a very large number of alternatives Eample: Subset selection Suppose that a diabetes treatment consists of 5 drugs (with 20 drugs to choose from) Two treatments are correlated if they contain one or more of the same drugs The number of subsets is ( 20 5 ), but one measurement will now provide much more information than before 32 / 41

37 Outline 1 Introduction 2 Mathematical model 3 The knowledge gradient (KG) policy 4 Asymptotic behavior of the KG policy 5 Bandit problems with correlated arms 6 Numerical results 7 Conclusions 33 / 41

38 Numerical results: independent alternatives We compare the performance of two policies by comparing the true values of the alternatives they measure in every time step KG outperforms approimate Gittins (Chick & Gans 2009) over 75% of the time 34 / 41

39 Numerical results: independent alternatives We can eamine how performance varies with measurement noise in several representative sample problems 35 / 41

40 Numerical results: independent alternatives We can also vary the discount factor; KG yields very good performance for γ close to 1 36 / 41

41 Numerical results: correlated alternatives 37 / 41

42 Outline 1 Introduction 2 Mathematical model 3 The knowledge gradient (KG) policy 4 Asymptotic behavior of the KG policy 5 Bandit problems with correlated arms 6 Numerical results 7 Conclusions 38 / 41

43 Conclusions The KG policy is a new, non-inde approach to the multi-armed bandit problem KG is substantially easier to compute than Gittins indices, and is competitive against the state of the art in Gittins approimation KG outperforms or is competitive against other inde policies such as interval estimation KG handles correlated problems, which inde policies were never designed to do 39 / 41

44 References Chick, S.E. & Gans, N. (2009) Economic Analysis of Simulation Selection Options. Management Science, to appear. Frazier, P.I., Powell, W. & Dayanik, S. (2008) A knowledge-gradient policy for sequential information collection. SIAM J. on Control and Optimization 47:5, Frazier, P.I., Powell, W. & Dayanik, S. (2009) The knowledge gradient policy for correlated normal rewards. INFORMS J. on Computing, to appear. Gittins, J.C. & Jones, D.M. (1974) A Dynamic Allocation Inde for the Sequential Design of Eperiments. In: Progress in Statistics, J Gani et al., eds., Gupta, S. & Miescke, K. (1996) Bayesian look ahead one stage sampling allocation for selecting the best population. J. on Statistical Planning and Inference 54: / 41

45 References Kaelbling, L.P. (1993) Learning in embedded systems. MIT Press, Cambridge MA. Lai, T.L. (1987) Adaptive treatment allocation and the multi-armed bandit problem. Annals of Statistics 15:3, Ryzhov, I.O., Powell, W. & Frazier, P.I. The knowledge gradient algorithm for a general class of online learning problems. In preparation. Ryzhov, I.O. & Powell, W. The knowledge gradient algorithm for online subset selection. Proceedings of the 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pp / 41

On the robustness of a one-period look-ahead policy in multi-armed bandit problems

On the robustness of a one-period look-ahead policy in multi-armed bandit problems Procedia Computer Science Procedia Computer Science 00 (2010) 1 10 On the robustness of a one-period look-ahead policy in multi-armed bandit problems Ilya O. Ryzhov a, Peter Frazier b, Warren B. Powell

More information

On the robustness of a one-period look-ahead policy in multi-armed bandit problems

On the robustness of a one-period look-ahead policy in multi-armed bandit problems Procedia Computer Science 00 (200) (202) 0 635 644 International Conference on Computational Science, ICCS 200 Procedia Computer Science www.elsevier.com/locate/procedia On the robustness of a one-period

More information

Bayesian Active Learning With Basis Functions

Bayesian Active Learning With Basis Functions Bayesian Active Learning With Basis Functions Ilya O. Ryzhov Warren B. Powell Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA IEEE ADPRL April 13, 2011 1 / 29

More information

OPTIMAL LEARNING AND APPROXIMATE DYNAMIC PROGRAMMING

OPTIMAL LEARNING AND APPROXIMATE DYNAMIC PROGRAMMING CHAPTER 18 OPTIMAL LEARNING AND APPROXIMATE DYNAMIC PROGRAMMING Warren B. Powell and Ilya O. Ryzhov Princeton University, University of Maryland 18.1 INTRODUCTION Approimate dynamic programming (ADP) has

More information

THE KNOWLEDGE GRADIENT FOR OPTIMAL LEARNING

THE KNOWLEDGE GRADIENT FOR OPTIMAL LEARNING THE KNOWLEDGE GRADIENT FOR OPTIMAL LEARNING INTRODUCTION WARREN B. POWELL Department of Operations Research and Financial Engineering, Princeton University, Princeton, New Jersey There is a wide range

More information

Bandit models: a tutorial

Bandit models: a tutorial Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses

More information

Bayesian Active Learning With Basis Functions

Bayesian Active Learning With Basis Functions Bayesian Active Learning With Basis Functions Ilya O. Ryzhov and Warren B. Powell Operations Research and Financial Engineering Princeton University Princeton, NJ 08544 Abstract A common technique for

More information

Hierarchical Knowledge Gradient for Sequential Sampling

Hierarchical Knowledge Gradient for Sequential Sampling Journal of Machine Learning Research () Submitted ; Published Hierarchical Knowledge Gradient for Sequential Sampling Martijn R.K. Mes Department of Operational Methods for Production and Logistics University

More information

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)

More information

The Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks

The Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks The Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks Yingfei Wang, Chu Wang and Warren B. Powell Princeton University Yingfei Wang Optimal Learning Methods June 22, 2016

More information

Bayesian exploration for approximate dynamic programming

Bayesian exploration for approximate dynamic programming Bayesian exploration for approximate dynamic programming Ilya O. Ryzhov Martijn R.K. Mes Warren B. Powell Gerald A. van den Berg December 18, 2017 Abstract Approximate dynamic programming (ADP) is a general

More information

COMPUTING OPTIMAL SEQUENTIAL ALLOCATION RULES IN CLINICAL TRIALS* Michael N. Katehakis. State University of New York at Stony Brook. and.

COMPUTING OPTIMAL SEQUENTIAL ALLOCATION RULES IN CLINICAL TRIALS* Michael N. Katehakis. State University of New York at Stony Brook. and. COMPUTING OPTIMAL SEQUENTIAL ALLOCATION RULES IN CLINICAL TRIALS* Michael N. Katehakis State University of New York at Stony Brook and Cyrus Derman Columbia University The problem of assigning one of several

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Uncertainty & Probabilities & Bandits Daniel Hennes 16.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Uncertainty Probability

More information

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

Stratégies bayésiennes et fréquentistes dans un modèle de bandit Stratégies bayésiennes et fréquentistes dans un modèle de bandit thèse effectuée à Telecom ParisTech, co-dirigée par Olivier Cappé, Aurélien Garivier et Rémi Munos Journées MAS, Grenoble, 30 août 2016

More information

Revisiting the Exploration-Exploitation Tradeoff in Bandit Models

Revisiting the Exploration-Exploitation Tradeoff in Bandit Models Revisiting the Exploration-Exploitation Tradeoff in Bandit Models joint work with Aurélien Garivier (IMT, Toulouse) and Tor Lattimore (University of Alberta) Workshop on Optimization and Decision-Making

More information

The information complexity of sequential resource allocation

The information complexity of sequential resource allocation The information complexity of sequential resource allocation Emilie Kaufmann, joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishan SMILE Seminar, ENS, June 8th, 205 Sequential allocation

More information

Bayesian Contextual Multi-armed Bandits

Bayesian Contextual Multi-armed Bandits Bayesian Contextual Multi-armed Bandits Xiaoting Zhao Joint Work with Peter I. Frazier School of Operations Research and Information Engineering Cornell University October 22, 2012 1 / 33 Outline 1 Motivating

More information

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurélien Garivier Institut de Mathématiques de Toulouse Information Theory, Learning and Big Data Simons Institute, Berkeley, March

More information

The information complexity of best-arm identification

The information complexity of best-arm identification The information complexity of best-arm identification Emilie Kaufmann, joint work with Olivier Cappé and Aurélien Garivier MAB workshop, Lancaster, January th, 206 Context: the multi-armed bandit model

More information

CONSISTENCY OF SEQUENTIAL BAYESIAN SAMPLING POLICIES

CONSISTENCY OF SEQUENTIAL BAYESIAN SAMPLING POLICIES CONSISTENCY OF SEQUENTIAL BAYESIAN SAMPLING POLICIES PETER I. FRAZIER AND WARREN B. POWELL Abstract. We consider Bayesian information collection, in which a measurement policy collects information to support

More information

On the Complexity of Best Arm Identification with Fixed Confidence

On the Complexity of Best Arm Identification with Fixed Confidence On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, Emilie Kaufmann COLT, June 23 th 2016, New York Institut de Mathématiques de Toulouse

More information

Analysis of Thompson Sampling for the multi-armed bandit problem

Analysis of Thompson Sampling for the multi-armed bandit problem Analysis of Thompson Sampling for the multi-armed bandit problem Shipra Agrawal Microsoft Research India shipra@microsoft.com avin Goyal Microsoft Research India navingo@microsoft.com Abstract We show

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Markov Decision Processes Infinite Horizon Problems

Markov Decision Processes Infinite Horizon Problems Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld 1 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T)

More information

On the Complexity of Best Arm Identification with Fixed Confidence

On the Complexity of Best Arm Identification with Fixed Confidence On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, joint work with Emilie Kaufmann CNRS, CRIStAL) to be presented at COLT 16, New York

More information

2. A Basic Statistical Toolbox

2. A Basic Statistical Toolbox . A Basic Statistical Toolbo Statistics is a mathematical science pertaining to the collection, analysis, interpretation, and presentation of data. Wikipedia definition Mathematical statistics: concerned

More information

Two optimization problems in a stochastic bandit model

Two optimization problems in a stochastic bandit model Two optimization problems in a stochastic bandit model Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishnan Journées MAS 204, Toulouse Outline From stochastic optimization

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

Peter I. Frazier Jing Xie. Stephen E. Chick. Technology & Operations Management Area INSEAD Boulevard de Constance Fontainebleau, FRANCE

Peter I. Frazier Jing Xie. Stephen E. Chick. Technology & Operations Management Area INSEAD Boulevard de Constance Fontainebleau, FRANCE Proceedings of the 2011 Winter Simulation Conference S. Jain, R. R. Creasey, J. Himmelspach, K. P. White, and M. Fu, eds. VALUE OF INFORMATION METHODS FOR PAIRWISE SAMPLING WITH CORRELATIONS Peter I. Frazier

More information

Sequential Selection of Projects

Sequential Selection of Projects Sequential Selection of Projects Kemal Gürsoy Rutgers University, Department of MSIS, New Jersey, USA Fusion Fest October 11, 2014 Outline 1 Introduction Model 2 Necessary Knowledge Sequential Statistics

More information

Multi-Attribute Bayesian Optimization under Utility Uncertainty

Multi-Attribute Bayesian Optimization under Utility Uncertainty Multi-Attribute Bayesian Optimization under Utility Uncertainty Raul Astudillo Cornell University Ithaca, NY 14853 ra598@cornell.edu Peter I. Frazier Cornell University Ithaca, NY 14853 pf98@cornell.edu

More information

Evaluation of multi armed bandit algorithms and empirical algorithm

Evaluation of multi armed bandit algorithms and empirical algorithm Acta Technica 62, No. 2B/2017, 639 656 c 2017 Institute of Thermomechanics CAS, v.v.i. Evaluation of multi armed bandit algorithms and empirical algorithm Zhang Hong 2,3, Cao Xiushan 1, Pu Qiumei 1,4 Abstract.

More information

1 MDP Value Iteration Algorithm

1 MDP Value Iteration Algorithm CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using

More information

Bandit Algorithms for Pure Exploration: Best Arm Identification and Game Tree Search. Wouter M. Koolen

Bandit Algorithms for Pure Exploration: Best Arm Identification and Game Tree Search. Wouter M. Koolen Bandit Algorithms for Pure Exploration: Best Arm Identification and Game Tree Search Wouter M. Koolen Machine Learning and Statistics for Structures Friday 23 rd February, 2018 Outline 1 Intro 2 Model

More information

The Multi-Armed Bandit Problem

The Multi-Armed Bandit Problem The Multi-Armed Bandit Problem Electrical and Computer Engineering December 7, 2013 Outline 1 2 Mathematical 3 Algorithm Upper Confidence Bound Algorithm A/B Testing Exploration vs. Exploitation Scientist

More information

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)

More information

Optimal learning with non-gaussian rewards

Optimal learning with non-gaussian rewards Optimal learning with non-gaussian rewards Zi Ding Ilya O. Ryzhov November 2, 214 Abstract We propose a novel theoretical characterization of the optimal Gittins index policy in multi-armed bandit problems

More information

14.30 Introduction to Statistical Methods in Economics Spring 2009

14.30 Introduction to Statistical Methods in Economics Spring 2009 MIT OpenCourseWare http://ocw.mit.edu 4.0 Introduction to Statistical Methods in Economics Spring 009 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.

More information

The Knowledge-Gradient Algorithm for Sequencing Experiments in Drug Discovery

The Knowledge-Gradient Algorithm for Sequencing Experiments in Drug Discovery The Knowledge-Gradient Algorithm for Sequencing Experiments in Drug Discovery Diana M. Negoescu Peter I. Frazier Warren B. Powell August 3, 9 Abstract We present a new technique for adaptively choosing

More information

On Bayesian bandit algorithms

On Bayesian bandit algorithms On Bayesian bandit algorithms Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier, Nathaniel Korda and Rémi Munos July 1st, 2012 Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms

More information

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands

More information

Lecture 4 January 23

Lecture 4 January 23 STAT 263/363: Experimental Design Winter 2016/17 Lecture 4 January 23 Lecturer: Art B. Owen Scribe: Zachary del Rosario 4.1 Bandits Bandits are a form of online (adaptive) experiments; i.e. samples are

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Sequential Decision

More information

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Multi-armed bandit algorithms. Concentration inequalities. P(X ǫ) exp( ψ (ǫ))). Cumulant generating function bounds. Hoeffding

More information

arxiv: v4 [cs.lg] 22 Jul 2014

arxiv: v4 [cs.lg] 22 Jul 2014 Learning to Optimize Via Information-Directed Sampling Daniel Russo and Benjamin Van Roy July 23, 2014 arxiv:1403.5556v4 cs.lg] 22 Jul 2014 Abstract We propose information-directed sampling a new algorithm

More information

Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming MATHEMATICS OF OPERATIONS RESEARCH Vol. 37, No. 1, February 2012, pp. 66 94 ISSN 0364-765X (print) ISSN 1526-5471 (online) http://dx.doi.org/10.1287/moor.1110.0532 2012 INFORMS Q-Learning and Enhanced

More information

Optimistic Gittins Indices

Optimistic Gittins Indices Optimistic Gittins Indices Eli Gutin Operations Research Center, MIT Cambridge, MA 0242 gutin@mit.edu Vivek F. Farias MIT Sloan School of Management Cambridge, MA 0242 vivekf@mit.edu Abstract Starting

More information

Coupled Bisection for Root Ordering

Coupled Bisection for Root Ordering Coupled Bisection for Root Ordering Stephen N. Pallone, Peter I. Frazier, Shane G. Henderson School of Operations Research and Information Engineering Cornell University, Ithaca, NY 14853 Abstract We consider

More information

Crowdsourcing & Optimal Budget Allocation in Crowd Labeling

Crowdsourcing & Optimal Budget Allocation in Crowd Labeling Crowdsourcing & Optimal Budget Allocation in Crowd Labeling Madhav Mohandas, Richard Zhu, Vincent Zhuang May 5, 2016 Table of Contents 1. Intro to Crowdsourcing 2. The Problem 3. Knowledge Gradient Algorithm

More information

On Robust Arm-Acquiring Bandit Problems

On Robust Arm-Acquiring Bandit Problems On Robust Arm-Acquiring Bandit Problems Shiqing Yu Faculty Mentor: Xiang Yu July 20, 2014 Abstract In the classical multi-armed bandit problem, at each stage, the player has to choose one from N given

More information

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. Converting online to batch. Online convex optimization.

More information

Exploration and Exploitation in Bayes Sequential Decision Problems

Exploration and Exploitation in Bayes Sequential Decision Problems Exploration and Exploitation in Bayes Sequential Decision Problems James Anthony Edwards Submitted for the degree of Doctor of Philosophy at Lancaster University. September 2016 Abstract Bayes sequential

More information

Elements of Reinforcement Learning

Elements of Reinforcement Learning Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,

More information

INDEX POLICIES FOR DISCOUNTED BANDIT PROBLEMS WITH AVAILABILITY CONSTRAINTS

INDEX POLICIES FOR DISCOUNTED BANDIT PROBLEMS WITH AVAILABILITY CONSTRAINTS Applied Probability Trust (4 February 2008) INDEX POLICIES FOR DISCOUNTED BANDIT PROBLEMS WITH AVAILABILITY CONSTRAINTS SAVAS DAYANIK, Princeton University WARREN POWELL, Princeton University KAZUTOSHI

More information

Lecture 4: Lower Bounds (ending); Thompson Sampling

Lecture 4: Lower Bounds (ending); Thompson Sampling CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from

More information

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs 2015 IEEE 54th Annual Conference on Decision and Control CDC December 15-18, 2015. Osaka, Japan An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs Abhishek Gupta Rahul Jain Peter

More information

Notes on Discriminant Functions and Optimal Classification

Notes on Discriminant Functions and Optimal Classification Notes on Discriminant Functions and Optimal Classification Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Discriminant Functions Consider a classification problem

More information

Bayesian Social Learning with Random Decision Making in Sequential Systems

Bayesian Social Learning with Random Decision Making in Sequential Systems Bayesian Social Learning with Random Decision Making in Sequential Systems Yunlong Wang supervised by Petar M. Djurić Department of Electrical and Computer Engineering Stony Brook University Stony Brook,

More information

The Art of Sequential Optimization via Simulations

The Art of Sequential Optimization via Simulations The Art of Sequential Optimization via Simulations Stochastic Systems and Learning Laboratory EE, CS* & ISE* Departments Viterbi School of Engineering University of Southern California (Based on joint

More information

The Multi-Arm Bandit Framework

The Multi-Arm Bandit Framework The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94

More information

Upper Bounds on the Bayes-Optimal Procedure for Ranking & Selection with Independent Normal Priors. Jing Xie Peter I. Frazier

Upper Bounds on the Bayes-Optimal Procedure for Ranking & Selection with Independent Normal Priors. Jing Xie Peter I. Frazier Proceedings of the 2013 Winter Simulation Conference R. Pasupathy, S.-H. Kim, A. Tol, R. Hill, and M. E. Kuhl, eds. Upper Bounds on the Bayes-Optimal Procedure for Raning & Selection with Independent Normal

More information

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms Stochastic K-Arm Bandit Problem Formulation Consider K arms (actions) each correspond to an unknown distribution {ν k } K k=1 with values bounded in [0, 1]. At each time t, the agent pulls an arm I t {1,...,

More information

THE KNOWLEDGE GRADIENT ALGORITHM USING LOCALLY PARAMETRIC APPROXIMATIONS

THE KNOWLEDGE GRADIENT ALGORITHM USING LOCALLY PARAMETRIC APPROXIMATIONS Proceedings of the 2013 Winter Simulation Conference R. Pasupathy, S.-H. Kim, A. Tolk, R. Hill, and M. E. Kuhl, eds. THE KNOWLEDGE GRADENT ALGORTHM USNG LOCALLY PARAMETRC APPROXMATONS Bolong Cheng Electrical

More information

Informational Confidence Bounds for Self-Normalized Averages and Applications

Informational Confidence Bounds for Self-Normalized Averages and Applications Informational Confidence Bounds for Self-Normalized Averages and Applications Aurélien Garivier Institut de Mathématiques de Toulouse - Université Paul Sabatier Thursday, September 12th 2013 Context Tree

More information

Bayesian and Frequentist Methods in Bandit Models

Bayesian and Frequentist Methods in Bandit Models Bayesian and Frequentist Methods in Bandit Models Emilie Kaufmann, Telecom ParisTech Bayes In Paris, ENSAE, October 24th, 2013 Emilie Kaufmann (Telecom ParisTech) Bayesian and Frequentist Bandits BIP,

More information

A Rothschild-Stiglitz approach to Bayesian persuasion

A Rothschild-Stiglitz approach to Bayesian persuasion A Rothschild-Stiglitz approach to Bayesian persuasion Matthew Gentzkow and Emir Kamenica Stanford University and University of Chicago September 2015 Abstract Rothschild and Stiglitz (1970) introduce a

More information

Bandit Algorithms. Tor Lattimore & Csaba Szepesvári

Bandit Algorithms. Tor Lattimore & Csaba Szepesvári Bandit Algorithms Tor Lattimore & Csaba Szepesvári Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next? Overview What are bandits,

More information

arxiv: v7 [cs.lg] 7 Jul 2017

arxiv: v7 [cs.lg] 7 Jul 2017 Learning to Optimize Via Information-Directed Sampling Daniel Russo 1 and Benjamin Van Roy 2 1 Northwestern University, daniel.russo@kellogg.northwestern.edu 2 Stanford University, bvr@stanford.edu arxiv:1403.5556v7

More information

The Knowledge-Gradient Algorithm for Sequencing Experiments in Drug Discovery

The Knowledge-Gradient Algorithm for Sequencing Experiments in Drug Discovery The Knowledge-Gradient Algorithm for Sequencing Experiments in Drug Discovery Diana M. Negoescu Peter I. Frazier Warren B. Powell May 6, 010 Abstract We present a new technique for adaptively choosing

More information

Identifying and Solving an Unknown Acrobot System

Identifying and Solving an Unknown Acrobot System Identifying and Solving an Unknown Acrobot System Clement Gehring and Xinkun Nie Abstract Identifying dynamical parameters is a challenging and important part of solving dynamical systems. For this project,

More information

A Maximally Controllable Indifference-Zone Policy

A Maximally Controllable Indifference-Zone Policy A Maimally Controllable Indifference-Zone Policy Peter I. Frazier Operations Research and Information Engineering Cornell University Ithaca, NY 14853, USA February 21, 2011 Abstract We consider the indifference-zone

More information

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 22 Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 How to balance exploration and exploitation in reinforcement

More information

Power Allocation over Two Identical Gilbert-Elliott Channels

Power Allocation over Two Identical Gilbert-Elliott Channels Power Allocation over Two Identical Gilbert-Elliott Channels Junhua Tang School of Electronic Information and Electrical Engineering Shanghai Jiao Tong University, China Email: junhuatang@sjtu.edu.cn Parisa

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and

More information

Online Learning Schemes for Power Allocation in Energy Harvesting Communications

Online Learning Schemes for Power Allocation in Energy Harvesting Communications Online Learning Schemes for Power Allocation in Energy Harvesting Communications Pranav Sakulkar and Bhaskar Krishnamachari Ming Hsieh Department of Electrical Engineering Viterbi School of Engineering

More information

Ordinal optimization - Empirical large deviations rate estimators, and multi-armed bandit methods

Ordinal optimization - Empirical large deviations rate estimators, and multi-armed bandit methods Ordinal optimization - Empirical large deviations rate estimators, and multi-armed bandit methods Sandeep Juneja Tata Institute of Fundamental Research Mumbai, India joint work with Peter Glynn Applied

More information

Procedia Computer Science 00 (2011) 000 6

Procedia Computer Science 00 (2011) 000 6 Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 5: Bandit optimisation Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Introduce bandit optimisation: the

More information

Control Theory : Course Summary

Control Theory : Course Summary Control Theory : Course Summary Author: Joshua Volkmann Abstract There are a wide range of problems which involve making decisions over time in the face of uncertainty. Control theory draws from the fields

More information

An Information-Theoretic Analysis of Thompson Sampling

An Information-Theoretic Analysis of Thompson Sampling Journal of Machine Learning Research (2015) Submitted ; Published An Information-Theoretic Analysis of Thompson Sampling Daniel Russo Department of Management Science and Engineering Stanford University

More information

Information collection on a graph

Information collection on a graph Information collection on a grah Ilya O. Ryzhov Warren Powell October 25, 2009 Abstract We derive a knowledge gradient olicy for an otimal learning roblem on a grah, in which we use sequential measurements

More information

Information collection on a graph

Information collection on a graph Information collection on a grah Ilya O. Ryzhov Warren Powell February 10, 2010 Abstract We derive a knowledge gradient olicy for an otimal learning roblem on a grah, in which we use sequential measurements

More information

Partially Observable Markov Decision Processes (POMDPs)

Partially Observable Markov Decision Processes (POMDPs) Partially Observable Markov Decision Processes (POMDPs) Sachin Patil Guest Lecture: CS287 Advanced Robotics Slides adapted from Pieter Abbeel, Alex Lee Outline Introduction to POMDPs Locally Optimal Solutions

More information

Paradoxes in Learning and the Marginal Value of Information

Paradoxes in Learning and the Marginal Value of Information Paradoxen Learning and the Marginal Value of Information Peter I. Frazier, Warren B. Powell April 14, 2010 Abstract We consider the Bayesian ranking and selection problem, in which one wishes to allocate

More information

SOME MEMORYLESS BANDIT POLICIES

SOME MEMORYLESS BANDIT POLICIES J. Appl. Prob. 40, 250 256 (2003) Printed in Israel Applied Probability Trust 2003 SOME MEMORYLESS BANDIT POLICIES EROL A. PEKÖZ, Boston University Abstract We consider a multiarmed bit problem, where

More information

Multi armed bandit problem: some insights

Multi armed bandit problem: some insights Multi armed bandit problem: some insights July 4, 20 Introduction Multi Armed Bandit problems have been widely studied in the context of sequential analysis. The application areas include clinical trials,

More information

9 - Markov processes and Burt & Allison 1963 AGEC

9 - Markov processes and Burt & Allison 1963 AGEC This document was generated at 8:37 PM on Saturday, March 17, 2018 Copyright 2018 Richard T. Woodward 9 - Markov processes and Burt & Allison 1963 AGEC 642-2018 I. What is a Markov Chain? A Markov chain

More information

Multi-armed Bandits and the Gittins Index

Multi-armed Bandits and the Gittins Index Multi-armed Bandits and the Gittins Index Richard Weber Statistical Laboratory, University of Cambridge A talk to accompany Lecture 6 Two-armed Bandit 3, 10, 4, 9, 12, 1,... 5, 6, 2, 15, 2, 7,... Two-armed

More information

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes Prokopis C. Prokopiou, Peter E. Caines, and Aditya Mahajan McGill University

More information

A Rothschild-Stiglitz approach to Bayesian persuasion

A Rothschild-Stiglitz approach to Bayesian persuasion A Rothschild-Stiglitz approach to Bayesian persuasion Matthew Gentzkow and Emir Kamenica Stanford University and University of Chicago December 2015 Abstract Rothschild and Stiglitz (1970) represent random

More information

Allocating Resources, in the Future

Allocating Resources, in the Future Allocating Resources, in the Future Sid Banerjee School of ORIE May 3, 2018 Simons Workshop on Mathematical and Computational Challenges in Real-Time Decision Making online resource allocation: basic model......

More information

Parallel Bayesian Global Optimization, with Application to Metrics Optimization at Yelp

Parallel Bayesian Global Optimization, with Application to Metrics Optimization at Yelp .. Parallel Bayesian Global Optimization, with Application to Metrics Optimization at Yelp Jialei Wang 1 Peter Frazier 1 Scott Clark 2 Eric Liu 2 1 School of Operations Research & Information Engineering,

More information

Value and Policy Iteration

Value and Policy Iteration Chapter 7 Value and Policy Iteration 1 For infinite horizon problems, we need to replace our basic computational tool, the DP algorithm, which we used to compute the optimal cost and policy for finite

More information

CONSISTENCY OF SEQUENTIAL BAYESIAN SAMPLING POLICIES

CONSISTENCY OF SEQUENTIAL BAYESIAN SAMPLING POLICIES SIAM J. CONTROL OPTIM. Vol. 49, No. 2, pp. 712 731 c 2011 Society for Industrial and Applied Mathematics CONSISTENCY OF SEQUENTIAL BAYESIAN SAMPLING POLICIES PETER I. FRAZIER AND WARREN B. POWELL Abstract.

More information

A PECULIAR COIN-TOSSING MODEL

A PECULIAR COIN-TOSSING MODEL A PECULIAR COIN-TOSSING MODEL EDWARD J. GREEN 1. Coin tossing according to de Finetti A coin is drawn at random from a finite set of coins. Each coin generates an i.i.d. sequence of outcomes (heads or

More information

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications.

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications. Stat 811 Lecture Notes The Wald Consistency Theorem Charles J. Geyer April 9, 01 1 Analyticity Assumptions Let { f θ : θ Θ } be a family of subprobability densities 1 with respect to a measure µ on a measurable

More information

Notes from Week 9: Multi-Armed Bandit Problems II. 1 Information-theoretic lower bounds for multiarmed

Notes from Week 9: Multi-Armed Bandit Problems II. 1 Information-theoretic lower bounds for multiarmed CS 683 Learning, Games, and Electronic Markets Spring 007 Notes from Week 9: Multi-Armed Bandit Problems II Instructor: Robert Kleinberg 6-30 Mar 007 1 Information-theoretic lower bounds for multiarmed

More information

Paradoxes in Learning and the Marginal Value of Information

Paradoxes in Learning and the Marginal Value of Information Paradoxes in Learning and the Marginal Value of Information Peter I. Frazier, Warren B. Powell August 21, 2010 Abstract We consider the Bayesian ranking and selection problem, in which one wishes to allocate

More information

Probability Models of Information Exchange on Networks Lecture 5

Probability Models of Information Exchange on Networks Lecture 5 Probability Models of Information Exchange on Networks Lecture 5 Elchanan Mossel (UC Berkeley) July 25, 2013 1 / 22 Informational Framework Each agent receives a private signal X i which depends on S.

More information

L p Functions. Given a measure space (X, µ) and a real number p [1, ), recall that the L p -norm of a measurable function f : X R is defined by

L p Functions. Given a measure space (X, µ) and a real number p [1, ), recall that the L p -norm of a measurable function f : X R is defined by L p Functions Given a measure space (, µ) and a real number p [, ), recall that the L p -norm of a measurable function f : R is defined by f p = ( ) /p f p dµ Note that the L p -norm of a function f may

More information