The knowledge gradient method for multi-armed bandit problems

Similar documents
On the robustness of a one-period look-ahead policy in multi-armed bandit problems

On the robustness of a one-period look-ahead policy in multi-armed bandit problems

Bayesian Active Learning With Basis Functions

OPTIMAL LEARNING AND APPROXIMATE DYNAMIC PROGRAMMING

THE KNOWLEDGE GRADIENT FOR OPTIMAL LEARNING

Bandit models: a tutorial

Bayesian Active Learning With Basis Functions

Hierarchical Knowledge Gradient for Sequential Sampling

Multi-armed bandit models: a tutorial

The Knowledge Gradient for Sequential Decision Making with Stochastic Binary Feedbacks

Bayesian exploration for approximate dynamic programming

COMPUTING OPTIMAL SEQUENTIAL ALLOCATION RULES IN CLINICAL TRIALS* Michael N. Katehakis. State University of New York at Stony Brook. and.

Grundlagen der Künstlichen Intelligenz

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

Revisiting the Exploration-Exploitation Tradeoff in Bandit Models

The information complexity of sequential resource allocation

Bayesian Contextual Multi-armed Bandits

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

The information complexity of best-arm identification

CONSISTENCY OF SEQUENTIAL BAYESIAN SAMPLING POLICIES

On the Complexity of Best Arm Identification with Fixed Confidence

Analysis of Thompson Sampling for the multi-armed bandit problem

Basics of reinforcement learning

Markov Decision Processes Infinite Horizon Problems

On the Complexity of Best Arm Identification with Fixed Confidence

2. A Basic Statistical Toolbox

Two optimization problems in a stochastic bandit model

Advanced Machine Learning

Peter I. Frazier Jing Xie. Stephen E. Chick. Technology & Operations Management Area INSEAD Boulevard de Constance Fontainebleau, FRANCE

Sequential Selection of Projects

Multi-Attribute Bayesian Optimization under Utility Uncertainty

Evaluation of multi armed bandit algorithms and empirical algorithm

1 MDP Value Iteration Algorithm

Bandit Algorithms for Pure Exploration: Best Arm Identification and Game Tree Search. Wouter M. Koolen

The Multi-Armed Bandit Problem

Christopher Watkins and Peter Dayan. Noga Zaslavsky. The Hebrew University of Jerusalem Advanced Seminar in Deep Learning (67679) November 1, 2015

Optimal learning with non-gaussian rewards

14.30 Introduction to Statistical Methods in Economics Spring 2009

The Knowledge-Gradient Algorithm for Sequencing Experiments in Drug Discovery

On Bayesian bandit algorithms

Complexity of stochastic branch and bound methods for belief tree search in Bayesian reinforcement learning

Lecture 4 January 23

Online Learning and Sequential Decision Making

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

arxiv: v4 [cs.lg] 22 Jul 2014

Q-Learning and Enhanced Policy Iteration in Discounted Dynamic Programming

Optimistic Gittins Indices

Coupled Bisection for Root Ordering

Crowdsourcing & Optimal Budget Allocation in Crowd Labeling

On Robust Arm-Acquiring Bandit Problems

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Exploration and Exploitation in Bayes Sequential Decision Problems

Elements of Reinforcement Learning

INDEX POLICIES FOR DISCOUNTED BANDIT PROBLEMS WITH AVAILABILITY CONSTRAINTS

Lecture 4: Lower Bounds (ending); Thompson Sampling

An Empirical Algorithm for Relative Value Iteration for Average-cost MDPs

Notes on Discriminant Functions and Optimal Classification

Bayesian Social Learning with Random Decision Making in Sequential Systems

The Art of Sequential Optimization via Simulations

The Multi-Arm Bandit Framework

Upper Bounds on the Bayes-Optimal Procedure for Ranking & Selection with Independent Normal Priors. Jing Xie Peter I. Frazier

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms

THE KNOWLEDGE GRADIENT ALGORITHM USING LOCALLY PARAMETRIC APPROXIMATIONS

Informational Confidence Bounds for Self-Normalized Averages and Applications

Bayesian and Frequentist Methods in Bandit Models

A Rothschild-Stiglitz approach to Bayesian persuasion

Bandit Algorithms. Tor Lattimore & Csaba Szepesvári

arxiv: v7 [cs.lg] 7 Jul 2017

The Knowledge-Gradient Algorithm for Sequencing Experiments in Drug Discovery

Identifying and Solving an Unknown Acrobot System

A Maximally Controllable Indifference-Zone Policy

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

Power Allocation over Two Identical Gilbert-Elliott Channels

Grundlagen der Künstlichen Intelligenz

Online Learning Schemes for Power Allocation in Energy Harvesting Communications

Ordinal optimization - Empirical large deviations rate estimators, and multi-armed bandit methods

Procedia Computer Science 00 (2011) 000 6

Reinforcement Learning

Control Theory : Course Summary

An Information-Theoretic Analysis of Thompson Sampling

Information collection on a graph

Information collection on a graph

Partially Observable Markov Decision Processes (POMDPs)

Paradoxes in Learning and the Marginal Value of Information

SOME MEMORYLESS BANDIT POLICIES

Multi armed bandit problem: some insights

9 - Markov processes and Burt & Allison 1963 AGEC

Multi-armed Bandits and the Gittins Index

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes

A Rothschild-Stiglitz approach to Bayesian persuasion

Allocating Resources, in the Future

Parallel Bayesian Global Optimization, with Application to Metrics Optimization at Yelp

Value and Policy Iteration

CONSISTENCY OF SEQUENTIAL BAYESIAN SAMPLING POLICIES

A PECULIAR COIN-TOSSING MODEL

is a Borel subset of S Θ for each c R (Bertsekas and Shreve, 1978, Proposition 7.36) This always holds in practical applications.

Notes from Week 9: Multi-Armed Bandit Problems II. 1 Information-theoretic lower bounds for multiarmed

Paradoxes in Learning and the Marginal Value of Information

Probability Models of Information Exchange on Networks Lecture 5

L p Functions. Given a measure space (X, µ) and a real number p [1, ), recall that the L p -norm of a measurable function f : X R is defined by

Transcription:

The knowledge gradient method for multi-armed bandit problems Moving beyond inde policies Ilya O. Ryzhov Warren Powell Peter Frazier Department of Operations Research and Financial Engineering Princeton University INFORMS APS Conference July 12, 2009 1 / 41

Outline 1 Introduction 2 Mathematical model 3 The knowledge gradient (KG) policy 4 Asymptotic behavior of the KG policy 5 Bandit problems with correlated arms 6 Numerical results 7 Conclusions 2 / 41

Outline 1 Introduction 2 Mathematical model 3 The knowledge gradient (KG) policy 4 Asymptotic behavior of the KG policy 5 Bandit problems with correlated arms 6 Numerical results 7 Conclusions 3 / 41

Motivation: clinical drug trials We are testing eperimental diabetes treatments on human patients We want to find the best treatment, but we also care about the effect on the patients How can we allocate groups of patients to treatments? 4 / 41

The multi-armed bandit problem There are M different treatments (M arms or alternatives) The effectiveness of each treatment is unknown, but we have a Bayesian belief about it We can measure a treatment (by trying it out on a group of patients) and observe a result that changes our beliefs How should we allocate our measurements to maimize some measure of the total benefit across all patient groups? 5 / 41

Outline 1 Introduction 2 Mathematical model 3 The knowledge gradient (KG) policy 4 Asymptotic behavior of the KG policy 5 Bandit problems with correlated arms 6 Numerical results 7 Conclusions 6 / 41

The multi-armed bandit problem At first, we believe that µ N ( µ 0,1/β 0 ). We measure alternative and observe a reward ˆµ 1 N (µ,1/β ε ). As a result, our beliefs change: µ 1 = β 0 µ 0 + β ε ˆµ 1 β 0 + β ε The quantity β 0 is called the precision of our beliefs. ( σ 0 ) 2 = 1/β 0 β 1 = β 0 + β ε For all y, µ 1 y = µ 0 y. 7 / 41

The multi-armed bandit problem After n measurements, our beliefs about the alternatives are encoded in the knowledge state: s n = (µ n,σ n ) A decision rule X n is a function that maps the knowledge state s n to an alternative X n (s n ) {1,...,M}. A learning policy π is a sequence of decision rules X π,1,x π,2,... Objective function Choose a measurement policy π to achieve sup π IE π for some discount factor 0 < γ < 1. n=0 γ n µ X π,n (s n ) 8 / 41

Inde policies An inde policy yields decision rules of the form X π,n (s n ) = arg mai π (µ n,σ n ) where the inde I π depends on our beliefs about, but not on our beliefs about other alternatives. Inde policies allow us to consider each alternative separately, as if there were no other alternatives. 9 / 41

Eamples of inde policies Interval estimation (Kaelbling 1993): X IE,n (s n ) = arg ma µ n + z α/2 σ n Upper confidence bound for finite horizon (Lai 1987): ( ) X UCB,n (s n ) = arg ma µ n 2 N n + g N Gittins indices (Gittins & Jones 1974): N n X Gitt,n (s n ) = arg maγ(µ n,σ n,σ ε,γ) The Gittins policy is optimal for infinite-horizon problems, but Gittins indices are difficult to compute and usually need to be approimated (Chick & Gans 2009). 10 / 41

Outline 1 Introduction 2 Mathematical model 3 The knowledge gradient (KG) policy 4 Asymptotic behavior of the KG policy 5 Bandit problems with correlated arms 6 Numerical results 7 Conclusions 11 / 41

The knowledge gradient concept One-period look-ahead rule for making measurements Originally developed for offline problems (Gupta & Miescke 1996, Frazier et al. 2008) Offline objective: sup π ( ) IE π ma µ N sup π Online objective: IE π n=0 γ n µ X π,n (s n ) 12 / 41

Definition of the knowledge gradient In the offline problem, our only goal is to estimate ma µ If we measure at time n, we can epect to improve our estimate by ( ) ma. ν KG,n = IE n µ n+1 ma µ n This quantity is called the knowledge gradient of at time n The offline KG decision rule is X Off,n (s n ) = arg maν KG,n 13 / 41

Computation of the knowledge gradient The epectation ν KG,n = IE n ( ma µ n+1 ) ma µ n has a closed-form solution where ν KG,n = σ n f ( ) µ n ma µ n σ n f (z) = zφ(z) + φ (z) σ n = (σ n ) 2 ( σ n+1 ) 2. and Φ,φ are the standard Gaussian cdf and pdf. 14 / 41

Using KG for multi-armed bandits In the bandit problem, if we stop learning at time n + 1, we will 1 epect to collect a reward of 1 γ ma µn+1 If we decide to choose alternative at time n, we epect to collect µ n immediately, plus the discounted downstream value The online KG decision rule is X KG,n (s n ) = arg ma µ n + γ 1 γ IEn ma = arg ma µ n + γ 1 γ IEn = arg ma µ n + γ 1 γ νkg,n µ n+1 ( ma µ n+1 ) ma µ n 15 / 41

Why KG is not an inde policy The KG decision rule is X KG,n (s n ) = arg ma µ n + γ 1 γ νkg,n where ν KG,n = σ n f ( ) µ n ma µ n. σ n The KG factor of alternative depends on ma µ n. The calculation for alternative uses our beliefs about all the other alternatives. 16 / 41

Outline 1 Introduction 2 Mathematical model 3 The knowledge gradient (KG) policy 4 Asymptotic behavior of the KG policy 5 Bandit problems with correlated arms 6 Numerical results 7 Conclusions 17 / 41

KG will converge to some alternative We say that the KG policy converges to alternative if it measures alternative infinitely often. Proposition For almost every sample path, only one alternative will be measured infinitely often by the KG policy. 18 / 41

KG will converge to some alternative KG has to converge to some alternative, but not necessarily to the best one However, even the Gittins policy, which is known to be optimal, does not converge to the best alternative Can we find situations where KG does converge to the best alternative? 19 / 41

Connection to offline KG Idea: For γ close to 1, the online KG decision rule X KG,n (s n ) = arg ma µ n + γ 1 γ νkg,n starts to resemble the offline KG decision rule X Off,n (s n ) = arg maν KG,n. We know from Frazier et al. (2008) that the offline KG rule will measure every alternative infinitely often (thus finding the best alternative) in an infinite-horizon setting. 20 / 41

Asymptotic behavior for large γ Let KG (γ) denote the online KG policy with discount factor γ. Define the stopping time { } N γ = min n 0 X Off,n (s n ) X KG(γ),n (s n ). Lemma As γ 1, N γ almost surely. 21 / 41

Asymptotic behavior for large γ The time horizon can be divided into three periods: n N γ Online KG agrees with offline KG 22 / 41

Asymptotic behavior for large γ The time horizon can be divided into three periods: n N γ Online KG does not agree with offline KG 22 / 41

Asymptotic behavior for large γ The time horizon can be divided into three periods: n N γ Online KG converges 22 / 41

Asymptotic behavior for large γ Lemma For fied 0 < γ < 1, ( lim n IEKG(γ) ma µ n ) IE KG(γ) ( ma ) µ N γ. Proof. It can be shown that M n = ma µ n is a uniformly integrable submartingale. Therefore, M n converges almost surely and in L 1, and lim IEM n = IE lim M n IEM Nγ n n by Doob s optional sampling theorem. 23 / 41

Asymptotic behavior for large γ Lemma ( lim γ 1 IEKG(γ) ma ) ( µ N γ = IE ma µ ). Proof. Because online KG agrees with offline KG on all decisions before N γ, ( ) lim γ 1 IEKG(γ) ma µ N γ ( ) = lim IE Off ma µ N γ γ 1 ( ) = lim IE Off ma µ n n ( = IE ma µ ). The last line follows by the asymptotic optimality of offline KG (Frazier et al. 2008). 24 / 41

Main asymptotic result Theorem ( lim lim γ 1 n IEKG(γ) ma µ n ) ( = IE ma µ ). Proof. Combining the previous results yields ( ) lim lim γ 1 n IEKG(γ) ma µ n ( lim IE KG(γ) ma γ 1 ( = IE ma µ ). ) µ N γ The other direction can be obtained using Jensen s inequality. 25 / 41

Summary of asymptotic results As γ 1, the value of the alternative measured infinitely often by KG converges to the value of the best alternative This is evidence that some of the attractive properties of offline KG carry over to the online problem Rates of convergence are a subject for future work Eperimental work suggests that online KG performs well in practice 26 / 41

Outline 1 Introduction 2 Mathematical model 3 The knowledge gradient (KG) policy 4 Asymptotic behavior of the KG policy 5 Bandit problems with correlated arms 6 Numerical results 7 Conclusions 27 / 41

Bandit problems with correlated arms The classic multi-armed bandit model assumes that the alternatives are independent However, the definition of the knowledge gradient ( ) ν KG,n = IE n ma µ n+1 ma µ n does not preclude the presence of correlated beliefs. 28 / 41

The meaning of correlated beliefs Correlated beliefs allow us to learn about many alternatives by making one measurement Observe what happens when we measure alternative 5... 29 / 41

The meaning of correlated beliefs Correlated beliefs allow us to learn about many alternatives by making one measurement Observe what happens when we measure alternative 5... 29 / 41

The meaning of correlated beliefs Correlated beliefs allow us to learn about many alternatives by making one measurement Observe what happens when we measure alternative 5... 29 / 41

Mathematical model for correlated problem At first, we believe that µ N ( µ 0,Σ 0). We measure alternative and observe a reward ˆµ 1 N ( µ,σε 2 ). As a result, our beliefs change: µ 1 = µ 0 + ˆµ1 µ 0 σε 2 + Σ 0 Σ 0 e Σ 1 = Σ 0 Σ0 e e T Σ 0 σε 2 + Σ 0 The vector e consists of all zeros, with a single 1 at inde Now, it is possible for every component of µ to change after a measurement. 30 / 41

Correlated knowledge gradients An inde policy is inherently unable to consider correlations, because it always considers every alternative separately from the others However, we can still use the KG rule: X KG,n (s n ) = arg ma µ n + γ 1 γ νkg,n The computation of ν KG,n is more complicated than in the independent case, but Frazier et al. (2009) provides a numerical algorithm 31 / 41

Correlated knowledge gradients By introducing correlations, we are able to learn efficiently in problems with a very large number of alternatives Eample: Subset selection Suppose that a diabetes treatment consists of 5 drugs (with 20 drugs to choose from) Two treatments are correlated if they contain one or more of the same drugs The number of subsets is ( 20 5 ), but one measurement will now provide much more information than before 32 / 41

Outline 1 Introduction 2 Mathematical model 3 The knowledge gradient (KG) policy 4 Asymptotic behavior of the KG policy 5 Bandit problems with correlated arms 6 Numerical results 7 Conclusions 33 / 41

Numerical results: independent alternatives We compare the performance of two policies by comparing the true values of the alternatives they measure in every time step KG outperforms approimate Gittins (Chick & Gans 2009) over 75% of the time 34 / 41

Numerical results: independent alternatives We can eamine how performance varies with measurement noise in several representative sample problems 35 / 41

Numerical results: independent alternatives We can also vary the discount factor; KG yields very good performance for γ close to 1 36 / 41

Numerical results: correlated alternatives 37 / 41

Outline 1 Introduction 2 Mathematical model 3 The knowledge gradient (KG) policy 4 Asymptotic behavior of the KG policy 5 Bandit problems with correlated arms 6 Numerical results 7 Conclusions 38 / 41

Conclusions The KG policy is a new, non-inde approach to the multi-armed bandit problem KG is substantially easier to compute than Gittins indices, and is competitive against the state of the art in Gittins approimation KG outperforms or is competitive against other inde policies such as interval estimation KG handles correlated problems, which inde policies were never designed to do 39 / 41

References Chick, S.E. & Gans, N. (2009) Economic Analysis of Simulation Selection Options. Management Science, to appear. Frazier, P.I., Powell, W. & Dayanik, S. (2008) A knowledge-gradient policy for sequential information collection. SIAM J. on Control and Optimization 47:5, 2410-2439. Frazier, P.I., Powell, W. & Dayanik, S. (2009) The knowledge gradient policy for correlated normal rewards. INFORMS J. on Computing, to appear. Gittins, J.C. & Jones, D.M. (1974) A Dynamic Allocation Inde for the Sequential Design of Eperiments. In: Progress in Statistics, J Gani et al., eds., 241-266. Gupta, S. & Miescke, K. (1996) Bayesian look ahead one stage sampling allocation for selecting the best population. J. on Statistical Planning and Inference 54:229-244. 40 / 41

References Kaelbling, L.P. (1993) Learning in embedded systems. MIT Press, Cambridge MA. Lai, T.L. (1987) Adaptive treatment allocation and the multi-armed bandit problem. Annals of Statistics 15:3, 1091-1114. Ryzhov, I.O., Powell, W. & Frazier, P.I. The knowledge gradient algorithm for a general class of online learning problems. In preparation. Ryzhov, I.O. & Powell, W. The knowledge gradient algorithm for online subset selection. Proceedings of the 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pp. 137-144. 41 / 41