The knowledge gradient method for multi-armed bandit problems
|
|
- Roger Johns
- 5 years ago
- Views:
Transcription
1 The knowledge gradient method for multi-armed bandit problems Moving beyond inde policies Ilya O. Ryzhov Warren Powell Peter Frazier Department of Operations Research and Financial Engineering Princeton University INFORMS APS Conference July 12, / 41
2 Outline 1 Introduction 2 Mathematical model 3 The knowledge gradient (KG) policy 4 Asymptotic behavior of the KG policy 5 Bandit problems with correlated arms 6 Numerical results 7 Conclusions 2 / 41
3 Outline 1 Introduction 2 Mathematical model 3 The knowledge gradient (KG) policy 4 Asymptotic behavior of the KG policy 5 Bandit problems with correlated arms 6 Numerical results 7 Conclusions 3 / 41
4 Motivation: clinical drug trials We are testing eperimental diabetes treatments on human patients We want to find the best treatment, but we also care about the effect on the patients How can we allocate groups of patients to treatments? 4 / 41
5 The multi-armed bandit problem There are M different treatments (M arms or alternatives) The effectiveness of each treatment is unknown, but we have a Bayesian belief about it We can measure a treatment (by trying it out on a group of patients) and observe a result that changes our beliefs How should we allocate our measurements to maimize some measure of the total benefit across all patient groups? 5 / 41
6 Outline 1 Introduction 2 Mathematical model 3 The knowledge gradient (KG) policy 4 Asymptotic behavior of the KG policy 5 Bandit problems with correlated arms 6 Numerical results 7 Conclusions 6 / 41
7 The multi-armed bandit problem At first, we believe that µ N ( µ 0,1/β 0 ). We measure alternative and observe a reward ˆµ 1 N (µ,1/β ε ). As a result, our beliefs change: µ 1 = β 0 µ 0 + β ε ˆµ 1 β 0 + β ε The quantity β 0 is called the precision of our beliefs. ( σ 0 ) 2 = 1/β 0 β 1 = β 0 + β ε For all y, µ 1 y = µ 0 y. 7 / 41
8 The multi-armed bandit problem After n measurements, our beliefs about the alternatives are encoded in the knowledge state: s n = (µ n,σ n ) A decision rule X n is a function that maps the knowledge state s n to an alternative X n (s n ) {1,...,M}. A learning policy π is a sequence of decision rules X π,1,x π,2,... Objective function Choose a measurement policy π to achieve sup π IE π for some discount factor 0 < γ < 1. n=0 γ n µ X π,n (s n ) 8 / 41
9 Inde policies An inde policy yields decision rules of the form X π,n (s n ) = arg mai π (µ n,σ n ) where the inde I π depends on our beliefs about, but not on our beliefs about other alternatives. Inde policies allow us to consider each alternative separately, as if there were no other alternatives. 9 / 41
10 Eamples of inde policies Interval estimation (Kaelbling 1993): X IE,n (s n ) = arg ma µ n + z α/2 σ n Upper confidence bound for finite horizon (Lai 1987): ( ) X UCB,n (s n ) = arg ma µ n 2 N n + g N Gittins indices (Gittins & Jones 1974): N n X Gitt,n (s n ) = arg maγ(µ n,σ n,σ ε,γ) The Gittins policy is optimal for infinite-horizon problems, but Gittins indices are difficult to compute and usually need to be approimated (Chick & Gans 2009). 10 / 41
11 Outline 1 Introduction 2 Mathematical model 3 The knowledge gradient (KG) policy 4 Asymptotic behavior of the KG policy 5 Bandit problems with correlated arms 6 Numerical results 7 Conclusions 11 / 41
12 The knowledge gradient concept One-period look-ahead rule for making measurements Originally developed for offline problems (Gupta & Miescke 1996, Frazier et al. 2008) Offline objective: sup π ( ) IE π ma µ N sup π Online objective: IE π n=0 γ n µ X π,n (s n ) 12 / 41
13 Definition of the knowledge gradient In the offline problem, our only goal is to estimate ma µ If we measure at time n, we can epect to improve our estimate by ( ) ma. ν KG,n = IE n µ n+1 ma µ n This quantity is called the knowledge gradient of at time n The offline KG decision rule is X Off,n (s n ) = arg maν KG,n 13 / 41
14 Computation of the knowledge gradient The epectation ν KG,n = IE n ( ma µ n+1 ) ma µ n has a closed-form solution where ν KG,n = σ n f ( ) µ n ma µ n σ n f (z) = zφ(z) + φ (z) σ n = (σ n ) 2 ( σ n+1 ) 2. and Φ,φ are the standard Gaussian cdf and pdf. 14 / 41
15 Using KG for multi-armed bandits In the bandit problem, if we stop learning at time n + 1, we will 1 epect to collect a reward of 1 γ ma µn+1 If we decide to choose alternative at time n, we epect to collect µ n immediately, plus the discounted downstream value The online KG decision rule is X KG,n (s n ) = arg ma µ n + γ 1 γ IEn ma = arg ma µ n + γ 1 γ IEn = arg ma µ n + γ 1 γ νkg,n µ n+1 ( ma µ n+1 ) ma µ n 15 / 41
16 Why KG is not an inde policy The KG decision rule is X KG,n (s n ) = arg ma µ n + γ 1 γ νkg,n where ν KG,n = σ n f ( ) µ n ma µ n. σ n The KG factor of alternative depends on ma µ n. The calculation for alternative uses our beliefs about all the other alternatives. 16 / 41
17 Outline 1 Introduction 2 Mathematical model 3 The knowledge gradient (KG) policy 4 Asymptotic behavior of the KG policy 5 Bandit problems with correlated arms 6 Numerical results 7 Conclusions 17 / 41
18 KG will converge to some alternative We say that the KG policy converges to alternative if it measures alternative infinitely often. Proposition For almost every sample path, only one alternative will be measured infinitely often by the KG policy. 18 / 41
19 KG will converge to some alternative KG has to converge to some alternative, but not necessarily to the best one However, even the Gittins policy, which is known to be optimal, does not converge to the best alternative Can we find situations where KG does converge to the best alternative? 19 / 41
20 Connection to offline KG Idea: For γ close to 1, the online KG decision rule X KG,n (s n ) = arg ma µ n + γ 1 γ νkg,n starts to resemble the offline KG decision rule X Off,n (s n ) = arg maν KG,n. We know from Frazier et al. (2008) that the offline KG rule will measure every alternative infinitely often (thus finding the best alternative) in an infinite-horizon setting. 20 / 41
21 Asymptotic behavior for large γ Let KG (γ) denote the online KG policy with discount factor γ. Define the stopping time { } N γ = min n 0 X Off,n (s n ) X KG(γ),n (s n ). Lemma As γ 1, N γ almost surely. 21 / 41
22 Asymptotic behavior for large γ The time horizon can be divided into three periods: n N γ Online KG agrees with offline KG 22 / 41
23 Asymptotic behavior for large γ The time horizon can be divided into three periods: n N γ Online KG does not agree with offline KG 22 / 41
24 Asymptotic behavior for large γ The time horizon can be divided into three periods: n N γ Online KG converges 22 / 41
25 Asymptotic behavior for large γ Lemma For fied 0 < γ < 1, ( lim n IEKG(γ) ma µ n ) IE KG(γ) ( ma ) µ N γ. Proof. It can be shown that M n = ma µ n is a uniformly integrable submartingale. Therefore, M n converges almost surely and in L 1, and lim IEM n = IE lim M n IEM Nγ n n by Doob s optional sampling theorem. 23 / 41
26 Asymptotic behavior for large γ Lemma ( lim γ 1 IEKG(γ) ma ) ( µ N γ = IE ma µ ). Proof. Because online KG agrees with offline KG on all decisions before N γ, ( ) lim γ 1 IEKG(γ) ma µ N γ ( ) = lim IE Off ma µ N γ γ 1 ( ) = lim IE Off ma µ n n ( = IE ma µ ). The last line follows by the asymptotic optimality of offline KG (Frazier et al. 2008). 24 / 41
27 Main asymptotic result Theorem ( lim lim γ 1 n IEKG(γ) ma µ n ) ( = IE ma µ ). Proof. Combining the previous results yields ( ) lim lim γ 1 n IEKG(γ) ma µ n ( lim IE KG(γ) ma γ 1 ( = IE ma µ ). ) µ N γ The other direction can be obtained using Jensen s inequality. 25 / 41
28 Summary of asymptotic results As γ 1, the value of the alternative measured infinitely often by KG converges to the value of the best alternative This is evidence that some of the attractive properties of offline KG carry over to the online problem Rates of convergence are a subject for future work Eperimental work suggests that online KG performs well in practice 26 / 41
29 Outline 1 Introduction 2 Mathematical model 3 The knowledge gradient (KG) policy 4 Asymptotic behavior of the KG policy 5 Bandit problems with correlated arms 6 Numerical results 7 Conclusions 27 / 41
30 Bandit problems with correlated arms The classic multi-armed bandit model assumes that the alternatives are independent However, the definition of the knowledge gradient ( ) ν KG,n = IE n ma µ n+1 ma µ n does not preclude the presence of correlated beliefs. 28 / 41
31 The meaning of correlated beliefs Correlated beliefs allow us to learn about many alternatives by making one measurement Observe what happens when we measure alternative / 41
32 The meaning of correlated beliefs Correlated beliefs allow us to learn about many alternatives by making one measurement Observe what happens when we measure alternative / 41
33 The meaning of correlated beliefs Correlated beliefs allow us to learn about many alternatives by making one measurement Observe what happens when we measure alternative / 41
34 Mathematical model for correlated problem At first, we believe that µ N ( µ 0,Σ 0). We measure alternative and observe a reward ˆµ 1 N ( µ,σε 2 ). As a result, our beliefs change: µ 1 = µ 0 + ˆµ1 µ 0 σε 2 + Σ 0 Σ 0 e Σ 1 = Σ 0 Σ0 e e T Σ 0 σε 2 + Σ 0 The vector e consists of all zeros, with a single 1 at inde Now, it is possible for every component of µ to change after a measurement. 30 / 41
35 Correlated knowledge gradients An inde policy is inherently unable to consider correlations, because it always considers every alternative separately from the others However, we can still use the KG rule: X KG,n (s n ) = arg ma µ n + γ 1 γ νkg,n The computation of ν KG,n is more complicated than in the independent case, but Frazier et al. (2009) provides a numerical algorithm 31 / 41
36 Correlated knowledge gradients By introducing correlations, we are able to learn efficiently in problems with a very large number of alternatives Eample: Subset selection Suppose that a diabetes treatment consists of 5 drugs (with 20 drugs to choose from) Two treatments are correlated if they contain one or more of the same drugs The number of subsets is ( 20 5 ), but one measurement will now provide much more information than before 32 / 41
37 Outline 1 Introduction 2 Mathematical model 3 The knowledge gradient (KG) policy 4 Asymptotic behavior of the KG policy 5 Bandit problems with correlated arms 6 Numerical results 7 Conclusions 33 / 41
38 Numerical results: independent alternatives We compare the performance of two policies by comparing the true values of the alternatives they measure in every time step KG outperforms approimate Gittins (Chick & Gans 2009) over 75% of the time 34 / 41
39 Numerical results: independent alternatives We can eamine how performance varies with measurement noise in several representative sample problems 35 / 41
40 Numerical results: independent alternatives We can also vary the discount factor; KG yields very good performance for γ close to 1 36 / 41
41 Numerical results: correlated alternatives 37 / 41
42 Outline 1 Introduction 2 Mathematical model 3 The knowledge gradient (KG) policy 4 Asymptotic behavior of the KG policy 5 Bandit problems with correlated arms 6 Numerical results 7 Conclusions 38 / 41
43 Conclusions The KG policy is a new, non-inde approach to the multi-armed bandit problem KG is substantially easier to compute than Gittins indices, and is competitive against the state of the art in Gittins approimation KG outperforms or is competitive against other inde policies such as interval estimation KG handles correlated problems, which inde policies were never designed to do 39 / 41
44 References Chick, S.E. & Gans, N. (2009) Economic Analysis of Simulation Selection Options. Management Science, to appear. Frazier, P.I., Powell, W. & Dayanik, S. (2008) A knowledge-gradient policy for sequential information collection. SIAM J. on Control and Optimization 47:5, Frazier, P.I., Powell, W. & Dayanik, S. (2009) The knowledge gradient policy for correlated normal rewards. INFORMS J. on Computing, to appear. Gittins, J.C. & Jones, D.M. (1974) A Dynamic Allocation Inde for the Sequential Design of Eperiments. In: Progress in Statistics, J Gani et al., eds., Gupta, S. & Miescke, K. (1996) Bayesian look ahead one stage sampling allocation for selecting the best population. J. on Statistical Planning and Inference 54: / 41
45 References Kaelbling, L.P. (1993) Learning in embedded systems. MIT Press, Cambridge MA. Lai, T.L. (1987) Adaptive treatment allocation and the multi-armed bandit problem. Annals of Statistics 15:3, Ryzhov, I.O., Powell, W. & Frazier, P.I. The knowledge gradient algorithm for a general class of online learning problems. In preparation. Ryzhov, I.O. & Powell, W. The knowledge gradient algorithm for online subset selection. Proceedings of the 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pp / 41
On the robustness of a one-period look-ahead policy in multi-armed bandit problems
Procedia Computer Science Procedia Computer Science 00 (2010) 1 10 On the robustness of a one-period look-ahead policy in multi-armed bandit problems Ilya O. Ryzhov a, Peter Frazier b, Warren B. Powell
More informationOn the robustness of a one-period look-ahead policy in multi-armed bandit problems
Procedia Computer Science 00 (200) (202) 0 635 644 International Conference on Computational Science, ICCS 200 Procedia Computer Science www.elsevier.com/locate/procedia On the robustness of a one-period
More informationBayesian Active Learning With Basis Functions
Bayesian Active Learning With Basis Functions Ilya O. Ryzhov Warren B. Powell Operations Research and Financial Engineering Princeton University Princeton, NJ 08544, USA IEEE ADPRL April 13, 2011 1 / 29
More informationOPTIMAL LEARNING AND APPROXIMATE DYNAMIC PROGRAMMING
CHAPTER 18 OPTIMAL LEARNING AND APPROXIMATE DYNAMIC PROGRAMMING Warren B. Powell and Ilya O. Ryzhov Princeton University, University of Maryland 18.1 INTRODUCTION Approimate dynamic programming (ADP) has
More informationTHE KNOWLEDGE GRADIENT FOR OPTIMAL LEARNING
THE KNOWLEDGE GRADIENT FOR OPTIMAL LEARNING INTRODUCTION WARREN B. POWELL Department of Operations Research and Financial Engineering, Princeton University, Princeton, New Jersey There is a wide range
More informationBandit models: a tutorial
Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses
More informationBayesian Active Learning With Basis Functions
Bayesian Active Learning With Basis Functions Ilya O. Ryzhov and Warren B. Powell Operations Research and Financial Engineering Princeton University Princeton, NJ 08544 Abstract A common technique for
More informationHierarchical Knowledge Gradient for Sequential Sampling
Journal of Machine Learning Research () Submitted ; Published Hierarchical Knowledge Gradient for Sequential Sampling Martijn R.K. Mes Department of Operational Methods for Production and Logistics University
More informationMulti-armed bandit models: a tutorial
Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)
More informationThe Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks
The Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks Yingfei Wang, Chu Wang and Warren B. Powell Princeton University Yingfei Wang Optimal Learning Methods June 22, 2016
More informationBayesian exploration for approximate dynamic programming
Bayesian exploration for approximate dynamic programming Ilya O. Ryzhov Martijn R.K. Mes Warren B. Powell Gerald A. van den Berg December 18, 2017 Abstract Approximate dynamic programming (ADP) is a general
More informationCOMPUTING OPTIMAL SEQUENTIAL ALLOCATION RULES IN CLINICAL TRIALS* Michael N. Katehakis. State University of New York at Stony Brook. and.
COMPUTING OPTIMAL SEQUENTIAL ALLOCATION RULES IN CLINICAL TRIALS* Michael N. Katehakis State University of New York at Stony Brook and Cyrus Derman Columbia University The problem of assigning one of several
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Uncertainty & Probabilities & Bandits Daniel Hennes 16.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Uncertainty Probability
More informationStratégies bayésiennes et fréquentistes dans un modèle de bandit
Stratégies bayésiennes et fréquentistes dans un modèle de bandit thèse effectuée à Telecom ParisTech, co-dirigée par Olivier Cappé, Aurélien Garivier et Rémi Munos Journées MAS, Grenoble, 30 août 2016
More informationRevisiting the Exploration-Exploitation Tradeoff in Bandit Models
Revisiting the Exploration-Exploitation Tradeoff in Bandit Models joint work with Aurélien Garivier (IMT, Toulouse) and Tor Lattimore (University of Alberta) Workshop on Optimization and Decision-Making
More informationThe information complexity of sequential resource allocation
The information complexity of sequential resource allocation Emilie Kaufmann, joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishan SMILE Seminar, ENS, June 8th, 205 Sequential allocation
More informationBayesian Contextual Multi-armed Bandits
Bayesian Contextual Multi-armed Bandits Xiaoting Zhao Joint Work with Peter I. Frazier School of Operations Research and Information Engineering Cornell University October 22, 2012 1 / 33 Outline 1 Motivating
More informationOn the Complexity of Best Arm Identification in Multi-Armed Bandit Models
On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurélien Garivier Institut de Mathématiques de Toulouse Information Theory, Learning and Big Data Simons Institute, Berkeley, March
More informationThe information complexity of best-arm identification
The information complexity of best-arm identification Emilie Kaufmann, joint work with Olivier Cappé and Aurélien Garivier MAB workshop, Lancaster, January th, 206 Context: the multi-armed bandit model
More informationCONSISTENCY OF SEQUENTIAL BAYESIAN SAMPLING POLICIES
CONSISTENCY OF SEQUENTIAL BAYESIAN SAMPLING POLICIES PETER I. FRAZIER AND WARREN B. POWELL Abstract. We consider Bayesian information collection, in which a measurement policy collects information to support
More informationOn the Complexity of Best Arm Identification with Fixed Confidence
On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, Emilie Kaufmann COLT, June 23 th 2016, New York Institut de Mathématiques de Toulouse
More informationAnalysis of Thompson Sampling for the multi-armed bandit problem
Analysis of Thompson Sampling for the multi-armed bandit problem Shipra Agrawal Microsoft Research India shipra@microsoft.com avin Goyal Microsoft Research India navingo@microsoft.com Abstract We show
More informationBasics of reinforcement learning
Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system
More informationMarkov Decision Processes Infinite Horizon Problems
Markov Decision Processes Infinite Horizon Problems Alan Fern * * Based in part on slides by Craig Boutilier and Daniel Weld 1 What is a solution to an MDP? MDP Planning Problem: Input: an MDP (S,A,R,T)
More informationOn the Complexity of Best Arm Identification with Fixed Confidence
On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, joint work with Emilie Kaufmann CNRS, CRIStAL) to be presented at COLT 16, New York
More information2. A Basic Statistical Toolbox
. A Basic Statistical Toolbo Statistics is a mathematical science pertaining to the collection, analysis, interpretation, and presentation of data. Wikipedia definition Mathematical statistics: concerned
More informationTwo optimization problems in a stochastic bandit model
Two optimization problems in a stochastic bandit model Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishnan Journées MAS 204, Toulouse Outline From stochastic optimization
More informationAdvanced Machine Learning
Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his
More informationPeter I. Frazier Jing Xie. Stephen E. Chick. Technology & Operations Management Area INSEAD Boulevard de Constance Fontainebleau, FRANCE
Proceedings of the 2011 Winter Simulation Conference S. Jain, R. R. Creasey, J. Himmelspach, K. P. White, and M. Fu, eds. VALUE OF INFORMATION METHODS FOR PAIRWISE SAMPLING WITH CORRELATIONS Peter I. Frazier
More informationSequential Selection of Projects
Sequential Selection of Projects Kemal Gürsoy Rutgers University, Department of MSIS, New Jersey, USA Fusion Fest October 11, 2014 Outline 1 Introduction Model 2 Necessary Knowledge Sequential Statistics
More informationMulti-Attribute Bayesian Optimization under Utility Uncertainty
Multi-Attribute Bayesian Optimization under Utility Uncertainty Raul Astudillo Cornell University Ithaca, NY 14853 ra598@cornell.edu Peter I. Frazier Cornell University Ithaca, NY 14853 pf98@cornell.edu
More informationEvaluation of multi armed bandit algorithms and empirical algorithm
Acta Technica 62, No. 2B/2017, 639 656 c 2017 Institute of Thermomechanics CAS, v.v.i. Evaluation of multi armed bandit algorithms and empirical algorithm Zhang Hong 2,3, Cao Xiushan 1, Pu Qiumei 1,4 Abstract.
More information1 MDP Value Iteration Algorithm
CS 0. - Active Learning Problem Set Handed out: 4 Jan 009 Due: 9 Jan 009 MDP Value Iteration Algorithm. Implement the value iteration algorithm given in the lecture. That is, solve Bellman s equation using
More informationBandit Algorithms for Pure Exploration: Best Arm Identification and Game Tree Search. Wouter M. Koolen
Bandit Algorithms for Pure Exploration: Best Arm Identification and Game Tree Search Wouter M. Koolen Machine Learning and Statistics for Structures Friday 23 rd February, 2018 Outline 1 Intro 2 Model
More informationThe Multi-Armed Bandit Problem
The Multi-Armed Bandit Problem Electrical and Computer Engineering December 7, 2013 Outline 1 2 Mathematical 3 Algorithm Upper Confidence Bound Algorithm A/B Testing Exploration vs. Exploitation Scientist
More informationChristopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015
Q-Learning Christopher Watkins and Peter Dayan Noga Zaslavsky The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015 Noga Zaslavsky Q-Learning (Watkins & Dayan, 1992)
More informationOptimal learning with non-gaussian rewards
Optimal learning with non-gaussian rewards Zi Ding Ilya O. Ryzhov November 2, 214 Abstract We propose a novel theoretical characterization of the optimal Gittins index policy in multi-armed bandit problems
More information14.30 Introduction to Statistical Methods in Economics Spring 2009
MIT OpenCourseWare http://ocw.mit.edu 4.0 Introduction to Statistical Methods in Economics Spring 009 For information about citing these materials or our Terms of Use, visit: http://ocw.mit.edu/terms.
More informationThe Knowledge-Gradient Algorithm for Sequencing Experiments in Drug Discovery
The Knowledge-Gradient Algorithm for Sequencing Experiments in Drug Discovery Diana M. Negoescu Peter I. Frazier Warren B. Powell August 3, 9 Abstract We present a new technique for adaptively choosing
More informationOn Bayesian bandit algorithms
On Bayesian bandit algorithms Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier, Nathaniel Korda and Rémi Munos July 1st, 2012 Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms
More informationComplexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning
Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning Christos Dimitrakakis Informatics Institute, University of Amsterdam, Amsterdam, The Netherlands
More informationLecture 4 January 23
STAT 263/363: Experimental Design Winter 2016/17 Lecture 4 January 23 Lecturer: Art B. Owen Scribe: Zachary del Rosario 4.1 Bandits Bandits are a form of online (adaptive) experiments; i.e. samples are
More informationOnline Learning and Sequential Decision Making
Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Sequential Decision
More informationStat 260/CS Learning in Sequential Decision Problems. Peter Bartlett
Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Multi-armed bandit algorithms. Concentration inequalities. P(X ǫ) exp( ψ (ǫ))). Cumulant generating function bounds. Hoeffding
More informationarxiv: v4 [cs.lg] 22 Jul 2014
Learning to Optimize Via Information-Directed Sampling Daniel Russo and Benjamin Van Roy July 23, 2014 arxiv:1403.5556v4 cs.lg] 22 Jul 2014 Abstract We propose information-directed sampling a new algorithm
More informationQ-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming
MATHEMATICS OF OPERATIONS RESEARCH Vol. 37, No. 1, February 2012, pp. 66 94 ISSN 0364-765X (print) ISSN 1526-5471 (online) http://dx.doi.org/10.1287/moor.1110.0532 2012 INFORMS Q-Learning and Enhanced
More informationOptimistic Gittins Indices
Optimistic Gittins Indices Eli Gutin Operations Research Center, MIT Cambridge, MA 0242 gutin@mit.edu Vivek F. Farias MIT Sloan School of Management Cambridge, MA 0242 vivekf@mit.edu Abstract Starting
More informationCoupled Bisection for Root Ordering
Coupled Bisection for Root Ordering Stephen N. Pallone, Peter I. Frazier, Shane G. Henderson School of Operations Research and Information Engineering Cornell University, Ithaca, NY 14853 Abstract We consider
More informationCrowdsourcing & Optimal Budget Allocation in Crowd Labeling
Crowdsourcing & Optimal Budget Allocation in Crowd Labeling Madhav Mohandas, Richard Zhu, Vincent Zhuang May 5, 2016 Table of Contents 1. Intro to Crowdsourcing 2. The Problem 3. Knowledge Gradient Algorithm
More informationOn Robust Arm-Acquiring Bandit Problems
On Robust Arm-Acquiring Bandit Problems Shiqing Yu Faculty Mentor: Xiang Yu July 20, 2014 Abstract In the classical multi-armed bandit problem, at each stage, the player has to choose one from N given
More informationLearning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley
Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. Converting online to batch. Online convex optimization.
More informationExploration and Exploitation in Bayes Sequential Decision Problems
Exploration and Exploitation in Bayes Sequential Decision Problems James Anthony Edwards Submitted for the degree of Doctor of Philosophy at Lancaster University. September 2016 Abstract Bayes sequential
More informationElements of Reinforcement Learning
Elements of Reinforcement Learning Policy: way learning algorithm behaves (mapping from state to action) Reward function: Mapping of state action pair to reward or cost Value function: long term reward,
More informationINDEX POLICIES FOR DISCOUNTED BANDIT PROBLEMS WITH AVAILABILITY CONSTRAINTS
Applied Probability Trust (4 February 2008) INDEX POLICIES FOR DISCOUNTED BANDIT PROBLEMS WITH AVAILABILITY CONSTRAINTS SAVAS DAYANIK, Princeton University WARREN POWELL, Princeton University KAZUTOSHI
More informationLecture 4: Lower Bounds (ending); Thompson Sampling
CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from
More informationAn Empirical Algorithm for Relative Value Iteration for Average-cost MDPs
2015 IEEE 54th Annual Conference on Decision and Control CDC December 15-18, 2015. Osaka, Japan An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs Abhishek Gupta Rahul Jain Peter
More informationNotes on Discriminant Functions and Optimal Classification
Notes on Discriminant Functions and Optimal Classification Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Discriminant Functions Consider a classification problem
More informationBayesian Social Learning with Random Decision Making in Sequential Systems
Bayesian Social Learning with Random Decision Making in Sequential Systems Yunlong Wang supervised by Petar M. Djurić Department of Electrical and Computer Engineering Stony Brook University Stony Brook,
More informationThe Art of Sequential Optimization via Simulations
The Art of Sequential Optimization via Simulations Stochastic Systems and Learning Laboratory EE, CS* & ISE* Departments Viterbi School of Engineering University of Southern California (Based on joint
More informationThe Multi-Arm Bandit Framework
The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94
More informationUpper Bounds on the Bayes-Optimal Procedure for Ranking & Selection with Independent Normal Priors. Jing Xie Peter I. Frazier
Proceedings of the 2013 Winter Simulation Conference R. Pasupathy, S.-H. Kim, A. Tol, R. Hill, and M. E. Kuhl, eds. Upper Bounds on the Bayes-Optimal Procedure for Raning & Selection with Independent Normal
More informationIntroduction to Bandit Algorithms. Introduction to Bandit Algorithms
Stochastic K-Arm Bandit Problem Formulation Consider K arms (actions) each correspond to an unknown distribution {ν k } K k=1 with values bounded in [0, 1]. At each time t, the agent pulls an arm I t {1,...,
More informationTHE KNOWLEDGE GRADIENT ALGORITHM USING LOCALLY PARAMETRIC APPROXIMATIONS
Proceedings of the 2013 Winter Simulation Conference R. Pasupathy, S.-H. Kim, A. Tolk, R. Hill, and M. E. Kuhl, eds. THE KNOWLEDGE GRADENT ALGORTHM USNG LOCALLY PARAMETRC APPROXMATONS Bolong Cheng Electrical
More informationInformational Confidence Bounds for Self-Normalized Averages and Applications
Informational Confidence Bounds for Self-Normalized Averages and Applications Aurélien Garivier Institut de Mathématiques de Toulouse - Université Paul Sabatier Thursday, September 12th 2013 Context Tree
More informationBayesian and Frequentist Methods in Bandit Models
Bayesian and Frequentist Methods in Bandit Models Emilie Kaufmann, Telecom ParisTech Bayes In Paris, ENSAE, October 24th, 2013 Emilie Kaufmann (Telecom ParisTech) Bayesian and Frequentist Bandits BIP,
More informationA Rothschild-Stiglitz approach to Bayesian persuasion
A Rothschild-Stiglitz approach to Bayesian persuasion Matthew Gentzkow and Emir Kamenica Stanford University and University of Chicago September 2015 Abstract Rothschild and Stiglitz (1970) introduce a
More informationBandit Algorithms. Tor Lattimore & Csaba Szepesvári
Bandit Algorithms Tor Lattimore & Csaba Szepesvári Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next? Overview What are bandits,
More informationarxiv: v7 [cs.lg] 7 Jul 2017
Learning to Optimize Via Information-Directed Sampling Daniel Russo 1 and Benjamin Van Roy 2 1 Northwestern University, daniel.russo@kellogg.northwestern.edu 2 Stanford University, bvr@stanford.edu arxiv:1403.5556v7
More informationThe Knowledge-Gradient Algorithm for Sequencing Experiments in Drug Discovery
The Knowledge-Gradient Algorithm for Sequencing Experiments in Drug Discovery Diana M. Negoescu Peter I. Frazier Warren B. Powell May 6, 010 Abstract We present a new technique for adaptively choosing
More informationIdentifying and Solving an Unknown Acrobot System
Identifying and Solving an Unknown Acrobot System Clement Gehring and Xinkun Nie Abstract Identifying dynamical parameters is a challenging and important part of solving dynamical systems. For this project,
More informationA Maximally Controllable Indifference-Zone Policy
A Maimally Controllable Indifference-Zone Policy Peter I. Frazier Operations Research and Information Engineering Cornell University Ithaca, NY 14853, USA February 21, 2011 Abstract We consider the indifference-zone
More informationCOS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3
COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 22 Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 How to balance exploration and exploitation in reinforcement
More informationPower Allocation over Two Identical Gilbert-Elliott Channels
Power Allocation over Two Identical Gilbert-Elliott Channels Junhua Tang School of Electronic Information and Electrical Engineering Shanghai Jiao Tong University, China Email: junhuatang@sjtu.edu.cn Parisa
More informationGrundlagen der Künstlichen Intelligenz
Grundlagen der Künstlichen Intelligenz Reinforcement learning Daniel Hennes 4.12.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Reinforcement learning Model based and
More informationOnline Learning Schemes for Power Allocation in Energy Harvesting Communications
Online Learning Schemes for Power Allocation in Energy Harvesting Communications Pranav Sakulkar and Bhaskar Krishnamachari Ming Hsieh Department of Electrical Engineering Viterbi School of Engineering
More informationOrdinal optimization - Empirical large deviations rate estimators, and multi-armed bandit methods
Ordinal optimization - Empirical large deviations rate estimators, and multi-armed bandit methods Sandeep Juneja Tata Institute of Fundamental Research Mumbai, India joint work with Peter Glynn Applied
More informationProcedia Computer Science 00 (2011) 000 6
Procedia Computer Science (211) 6 Procedia Computer Science Complex Adaptive Systems, Volume 1 Cihan H. Dagli, Editor in Chief Conference Organized by Missouri University of Science and Technology 211-
More informationReinforcement Learning
Reinforcement Learning Lecture 5: Bandit optimisation Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Introduce bandit optimisation: the
More informationControl Theory : Course Summary
Control Theory : Course Summary Author: Joshua Volkmann Abstract There are a wide range of problems which involve making decisions over time in the face of uncertainty. Control theory draws from the fields
More informationAn Information-Theoretic Analysis of Thompson Sampling
Journal of Machine Learning Research (2015) Submitted ; Published An Information-Theoretic Analysis of Thompson Sampling Daniel Russo Department of Management Science and Engineering Stanford University
More informationInformation collection on a graph
Information collection on a grah Ilya O. Ryzhov Warren Powell October 25, 2009 Abstract We derive a knowledge gradient olicy for an otimal learning roblem on a grah, in which we use sequential measurements
More informationInformation collection on a graph
Information collection on a grah Ilya O. Ryzhov Warren Powell February 10, 2010 Abstract We derive a knowledge gradient olicy for an otimal learning roblem on a grah, in which we use sequential measurements
More informationPartially Observable Markov Decision Processes (POMDPs)
Partially Observable Markov Decision Processes (POMDPs) Sachin Patil Guest Lecture: CS287 Advanced Robotics Slides adapted from Pieter Abbeel, Alex Lee Outline Introduction to POMDPs Locally Optimal Solutions
More informationParadoxes in Learning and the Marginal Value of Information
Paradoxen Learning and the Marginal Value of Information Peter I. Frazier, Warren B. Powell April 14, 2010 Abstract We consider the Bayesian ranking and selection problem, in which one wishes to allocate
More informationSOME MEMORYLESS BANDIT POLICIES
J. Appl. Prob. 40, 250 256 (2003) Printed in Israel Applied Probability Trust 2003 SOME MEMORYLESS BANDIT POLICIES EROL A. PEKÖZ, Boston University Abstract We consider a multiarmed bit problem, where
More informationMulti armed bandit problem: some insights
Multi armed bandit problem: some insights July 4, 20 Introduction Multi Armed Bandit problems have been widely studied in the context of sequential analysis. The application areas include clinical trials,
More information9 - Markov processes and Burt & Allison 1963 AGEC
This document was generated at 8:37 PM on Saturday, March 17, 2018 Copyright 2018 Richard T. Woodward 9 - Markov processes and Burt & Allison 1963 AGEC 642-2018 I. What is a Markov Chain? A Markov chain
More informationMulti-armed Bandits and the Gittins Index
Multi-armed Bandits and the Gittins Index Richard Weber Statistical Laboratory, University of Cambridge A talk to accompany Lecture 6 Two-armed Bandit 3, 10, 4, 9, 12, 1,... 5, 6, 2, 15, 2, 7,... Two-armed
More informationAn Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes
An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes Prokopis C. Prokopiou, Peter E. Caines, and Aditya Mahajan McGill University
More informationA Rothschild-Stiglitz approach to Bayesian persuasion
A Rothschild-Stiglitz approach to Bayesian persuasion Matthew Gentzkow and Emir Kamenica Stanford University and University of Chicago December 2015 Abstract Rothschild and Stiglitz (1970) represent random
More informationAllocating Resources, in the Future
Allocating Resources, in the Future Sid Banerjee School of ORIE May 3, 2018 Simons Workshop on Mathematical and Computational Challenges in Real-Time Decision Making online resource allocation: basic model......
More informationParallel Bayesian Global Optimization, with Application to Metrics Optimization at Yelp
.. Parallel Bayesian Global Optimization, with Application to Metrics Optimization at Yelp Jialei Wang 1 Peter Frazier 1 Scott Clark 2 Eric Liu 2 1 School of Operations Research & Information Engineering,
More informationValue and Policy Iteration
Chapter 7 Value and Policy Iteration 1 For infinite horizon problems, we need to replace our basic computational tool, the DP algorithm, which we used to compute the optimal cost and policy for finite
More informationCONSISTENCY OF SEQUENTIAL BAYESIAN SAMPLING POLICIES
SIAM J. CONTROL OPTIM. Vol. 49, No. 2, pp. 712 731 c 2011 Society for Industrial and Applied Mathematics CONSISTENCY OF SEQUENTIAL BAYESIAN SAMPLING POLICIES PETER I. FRAZIER AND WARREN B. POWELL Abstract.
More informationA PECULIAR COIN-TOSSING MODEL
A PECULIAR COIN-TOSSING MODEL EDWARD J. GREEN 1. Coin tossing according to de Finetti A coin is drawn at random from a finite set of coins. Each coin generates an i.i.d. sequence of outcomes (heads or
More informationis a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications.
Stat 811 Lecture Notes The Wald Consistency Theorem Charles J. Geyer April 9, 01 1 Analyticity Assumptions Let { f θ : θ Θ } be a family of subprobability densities 1 with respect to a measure µ on a measurable
More informationNotes from Week 9: Multi-Armed Bandit Problems II. 1 Information-theoretic lower bounds for multiarmed
CS 683 Learning, Games, and Electronic Markets Spring 007 Notes from Week 9: Multi-Armed Bandit Problems II Instructor: Robert Kleinberg 6-30 Mar 007 1 Information-theoretic lower bounds for multiarmed
More informationParadoxes in Learning and the Marginal Value of Information
Paradoxes in Learning and the Marginal Value of Information Peter I. Frazier, Warren B. Powell August 21, 2010 Abstract We consider the Bayesian ranking and selection problem, in which one wishes to allocate
More informationProbability Models of Information Exchange on Networks Lecture 5
Probability Models of Information Exchange on Networks Lecture 5 Elchanan Mossel (UC Berkeley) July 25, 2013 1 / 22 Informational Framework Each agent receives a private signal X i which depends on S.
More informationL p Functions. Given a measure space (X, µ) and a real number p [1, ), recall that the L p -norm of a measurable function f : X R is defined by
L p Functions Given a measure space (, µ) and a real number p [, ), recall that the L p -norm of a measurable function f : R is defined by f p = ( ) /p f p dµ Note that the L p -norm of a function f may
More information