The knowledge gradient method for multi-armed bandit problems

The knowledge gradient method for multi-armed bandit problems Moving beyond inde policies Ilya O. Ryzhov Warren Powell Peter Frazier Department of Operations Research and Financial Engineering Princeton University INFORMS APS Conference July 12, 2009 1 / 41

Outline 1 Introduction 2 Mathematical model 3 The knowledge gradient (KG) policy 4 Asymptotic behavior of the KG policy 5 Bandit problems with correlated arms 6 Numerical results 7 Conclusions 2 / 41

Outline 1 Introduction 2 Mathematical model 3 The knowledge gradient (KG) policy 4 Asymptotic behavior of the KG policy 5 Bandit problems with correlated arms 6 Numerical results 7 Conclusions 3 / 41

Motivation: clinical drug trials We are testing eperimental diabetes treatments on human patients We want to find the best treatment, but we also care about the effect on the patients How can we allocate groups of patients to treatments? 4 / 41

The multi-armed bandit problem There are M different treatments (M arms or alternatives) The effectiveness of each treatment is unknown, but we have a Bayesian belief about it We can measure a treatment (by trying it out on a group of patients) and observe a result that changes our beliefs How should we allocate our measurements to maimize some measure of the total benefit across all patient groups? 5 / 41

Outline 1 Introduction 2 Mathematical model 3 The knowledge gradient (KG) policy 4 Asymptotic behavior of the KG policy 5 Bandit problems with correlated arms 6 Numerical results 7 Conclusions 6 / 41

The multi-armed bandit problem At first, we believe that µ N ( µ 0,1/β 0 ). We measure alternative and observe a reward ˆµ 1 N (µ,1/β ε ). As a result, our beliefs change: µ 1 = β 0 µ 0 + β ε ˆµ 1 β 0 + β ε The quantity β 0 is called the precision of our beliefs. ( σ 0 ) 2 = 1/β 0 β 1 = β 0 + β ε For all y, µ 1 y = µ 0 y. 7 / 41

The multi-armed bandit problem After n measurements, our beliefs about the alternatives are encoded in the knowledge state: s n = (µ n,σ n ) A decision rule X n is a function that maps the knowledge state s n to an alternative X n (s n ) {1,...,M}. A learning policy π is a sequence of decision rules X π,1,x π,2,... Objective function Choose a measurement policy π to achieve sup π IE π for some discount factor 0 < γ < 1. n=0 γ n µ X π,n (s n ) 8 / 41

Inde policies An inde policy yields decision rules of the form X π,n (s n ) = arg mai π (µ n,σ n ) where the inde I π depends on our beliefs about, but not on our beliefs about other alternatives. Inde policies allow us to consider each alternative separately, as if there were no other alternatives. 9 / 41

Eamples of inde policies Interval estimation (Kaelbling 1993): X IE,n (s n ) = arg ma µ n + z α/2 σ n Upper confidence bound for finite horizon (Lai 1987): ( ) X UCB,n (s n ) = arg ma µ n 2 N n + g N Gittins indices (Gittins & Jones 1974): N n X Gitt,n (s n ) = arg maγ(µ n,σ n,σ ε,γ) The Gittins policy is optimal for infinite-horizon problems, but Gittins indices are difficult to compute and usually need to be approimated (Chick & Gans 2009). 10 / 41

Outline 1 Introduction 2 Mathematical model 3 The knowledge gradient (KG) policy 4 Asymptotic behavior of the KG policy 5 Bandit problems with correlated arms 6 Numerical results 7 Conclusions 11 / 41

The knowledge gradient concept One-period look-ahead rule for making measurements Originally developed for offline problems (Gupta & Miescke 1996, Frazier et al. 2008) Offline objective: sup π ( ) IE π ma µ N sup π Online objective: IE π n=0 γ n µ X π,n (s n ) 12 / 41

Definition of the knowledge gradient In the offline problem, our only goal is to estimate ma µ If we measure at time n, we can epect to improve our estimate by ( ) ma. ν KG,n = IE n µ n+1 ma µ n This quantity is called the knowledge gradient of at time n The offline KG decision rule is X Off,n (s n ) = arg maν KG,n 13 / 41

Computation of the knowledge gradient The epectation ν KG,n = IE n ( ma µ n+1 ) ma µ n has a closed-form solution where ν KG,n = σ n f ( ) µ n ma µ n σ n f (z) = zφ(z) + φ (z) σ n = (σ n ) 2 ( σ n+1 ) 2. and Φ,φ are the standard Gaussian cdf and pdf. 14 / 41

Using KG for multi-armed bandits In the bandit problem, if we stop learning at time n + 1, we will 1 epect to collect a reward of 1 γ ma µn+1 If we decide to choose alternative at time n, we epect to collect µ n immediately, plus the discounted downstream value The online KG decision rule is X KG,n (s n ) = arg ma µ n + γ 1 γ IEn ma = arg ma µ n + γ 1 γ IEn = arg ma µ n + γ 1 γ νkg,n µ n+1 ( ma µ n+1 ) ma µ n 15 / 41

Why KG is not an inde policy The KG decision rule is X KG,n (s n ) = arg ma µ n + γ 1 γ νkg,n where ν KG,n = σ n f ( ) µ n ma µ n. σ n The KG factor of alternative depends on ma µ n. The calculation for alternative uses our beliefs about all the other alternatives. 16 / 41

KG will converge to some alternative We say that the KG policy converges to alternative if it measures alternative infinitely often. Proposition For almost every sample path, only one alternative will be measured infinitely often by the KG policy. 18 / 41

KG will converge to some alternative KG has to converge to some alternative, but not necessarily to the best one However, even the Gittins policy, which is known to be optimal, does not converge to the best alternative Can we find situations where KG does converge to the best alternative? 19 / 41

Connection to offline KG Idea: For γ close to 1, the online KG decision rule X KG,n (s n ) = arg ma µ n + γ 1 γ νkg,n starts to resemble the offline KG decision rule X Off,n (s n ) = arg maν KG,n. We know from Frazier et al. (2008) that the offline KG rule will measure every alternative infinitely often (thus finding the best alternative) in an infinite-horizon setting. 20 / 41

Asymptotic behavior for large γ Let KG (γ) denote the online KG policy with discount factor γ. Define the stopping time { } N γ = min n 0 X Off,n (s n ) X KG(γ),n (s n ). Lemma As γ 1, N γ almost surely. 21 / 41

Asymptotic behavior for large γ The time horizon can be divided into three periods: n N γ Online KG agrees with offline KG 22 / 41

Asymptotic behavior for large γ The time horizon can be divided into three periods: n N γ Online KG does not agree with offline KG 22 / 41

Asymptotic behavior for large γ The time horizon can be divided into three periods: n N γ Online KG converges 22 / 41

Asymptotic behavior for large γ Lemma For fied 0 < γ < 1, ( lim n IEKG(γ) ma µ n ) IE KG(γ) ( ma ) µ N γ. Proof. It can be shown that M n = ma µ n is a uniformly integrable submartingale. Therefore, M n converges almost surely and in L 1, and lim IEM n = IE lim M n IEM Nγ n n by Doob s optional sampling theorem. 23 / 41

Asymptotic behavior for large γ Lemma ( lim γ 1 IEKG(γ) ma ) ( µ N γ = IE ma µ ). Proof. Because online KG agrees with offline KG on all decisions before N γ, ( ) lim γ 1 IEKG(γ) ma µ N γ ( ) = lim IE Off ma µ N γ γ 1 ( ) = lim IE Off ma µ n n ( = IE ma µ ). The last line follows by the asymptotic optimality of offline KG (Frazier et al. 2008). 24 / 41

Main asymptotic result Theorem ( lim lim γ 1 n IEKG(γ) ma µ n ) ( = IE ma µ ). Proof. Combining the previous results yields ( ) lim lim γ 1 n IEKG(γ) ma µ n ( lim IE KG(γ) ma γ 1 ( = IE ma µ ). ) µ N γ The other direction can be obtained using Jensen s inequality. 25 / 41

Summary of asymptotic results As γ 1, the value of the alternative measured infinitely often by KG converges to the value of the best alternative This is evidence that some of the attractive properties of offline KG carry over to the online problem Rates of convergence are a subject for future work Eperimental work suggests that online KG performs well in practice 26 / 41

Bandit problems with correlated arms The classic multi-armed bandit model assumes that the alternatives are independent However, the definition of the knowledge gradient ( ) ν KG,n = IE n ma µ n+1 ma µ n does not preclude the presence of correlated beliefs. 28 / 41

The meaning of correlated beliefs Correlated beliefs allow us to learn about many alternatives by making one measurement Observe what happens when we measure alternative 5... 29 / 41

Mathematical model for correlated problem At first, we believe that µ N ( µ 0,Σ 0). We measure alternative and observe a reward ˆµ 1 N ( µ,σε 2 ). As a result, our beliefs change: µ 1 = µ 0 + ˆµ1 µ 0 σε 2 + Σ 0 Σ 0 e Σ 1 = Σ 0 Σ0 e e T Σ 0 σε 2 + Σ 0 The vector e consists of all zeros, with a single 1 at inde Now, it is possible for every component of µ to change after a measurement. 30 / 41

Correlated knowledge gradients An inde policy is inherently unable to consider correlations, because it always considers every alternative separately from the others However, we can still use the KG rule: X KG,n (s n ) = arg ma µ n + γ 1 γ νkg,n The computation of ν KG,n is more complicated than in the independent case, but Frazier et al. (2009) provides a numerical algorithm 31 / 41

Correlated knowledge gradients By introducing correlations, we are able to learn efficiently in problems with a very large number of alternatives Eample: Subset selection Suppose that a diabetes treatment consists of 5 drugs (with 20 drugs to choose from) Two treatments are correlated if they contain one or more of the same drugs The number of subsets is ( 20 5 ), but one measurement will now provide much more information than before 32 / 41

Numerical results: independent alternatives We compare the performance of two policies by comparing the true values of the alternatives they measure in every time step KG outperforms approimate Gittins (Chick & Gans 2009) over 75% of the time 34 / 41

Numerical results: independent alternatives We can eamine how performance varies with measurement noise in several representative sample problems 35 / 41

Numerical results: independent alternatives We can also vary the discount factor; KG yields very good performance for γ close to 1 36 / 41

Numerical results: correlated alternatives 37 / 41

Conclusions The KG policy is a new, non-inde approach to the multi-armed bandit problem KG is substantially easier to compute than Gittins indices, and is competitive against the state of the art in Gittins approimation KG outperforms or is competitive against other inde policies such as interval estimation KG handles correlated problems, which inde policies were never designed to do 39 / 41

References Chick, S.E. & Gans, N. (2009) Economic Analysis of Simulation Selection Options. Management Science, to appear. Frazier, P.I., Powell, W. & Dayanik, S. (2008) A knowledge-gradient policy for sequential information collection. SIAM J. on Control and Optimization 47:5, 2410-2439. Frazier, P.I., Powell, W. & Dayanik, S. (2009) The knowledge gradient policy for correlated normal rewards. INFORMS J. on Computing, to appear. Gittins, J.C. & Jones, D.M. (1974) A Dynamic Allocation Inde for the Sequential Design of Eperiments. In: Progress in Statistics, J Gani et al., eds., 241-266. Gupta, S. & Miescke, K. (1996) Bayesian look ahead one stage sampling allocation for selecting the best population. J. on Statistical Planning and Inference 54:229-244. 40 / 41

References Kaelbling, L.P. (1993) Learning in embedded systems. MIT Press, Cambridge MA. Lai, T.L. (1987) Adaptive treatment allocation and the multi-armed bandit problem. Annals of Statistics 15:3, 1091-1114. Ryzhov, I.O., Powell, W. & Frazier, P.I. The knowledge gradient algorithm for a general class of online learning problems. In preparation. Ryzhov, I.O. & Powell, W. The knowledge gradient algorithm for online subset selection. Proceedings of the 2009 IEEE Symposium on Adaptive Dynamic Programming and Reinforcement Learning, pp. 137-144. 41 / 41