Bandit Problems with Side Observations

Size: px

Start display at page:

Download "Bandit Problems with Side Observations"

Dominic Andrews
5 years ago
Views:

1 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 50, NO. 5, MAY 2005 Bandit Problems with Side Observations Chih-Chun Wang, Student Member, IEEE, Sanjeev R. Kulkarni, Fellow, IEEE, and H. Vincent Poor, Fellow, IEEE arxiv:cs.it/ v 22 Jan 2005 Abstract An etension of the traditional two-armed bandit problem is considered, in which the decision maker has access to some side information before deciding which arm to pull. At each time t, before making a selection, the decision maker is able to observe a random variable X t that provides some information on the rewards to be obtained. The focus is on finding uniformly good rules that minimize the growth rate of the inferior sampling time and on quantifying how much the additional information helps. Various settings are considered and for each setting, lower bounds on the achievable inferior sampling time are developed and asymptotically optimal adaptive schemes achieving these lower bounds are constructed. Inde Terms Two-armed bandit, side information, inferior sampling time, allocation rule, asymptotic, efficient, adaptive. I. INTRODUCTION SINCE the publication of [], bandit problems have attracted much attention in various areas of statistics, control, learning, and economics e.g., see [2], [3], [4], [5], [6], [7], [8], [9], [0]. In the classical two-armed bandit problem, at each time a player selects one of two arms and receives a reward drawn from a distribution associated with the arm selected. The essence of the bandit problem is that the reward distributions are unknown, and so there is a fundamental tradeoff between gathering information about the unknown reward distributions and choosing the arm we currently think is the best. A rich set of problems arises in trying to find an optimal/reasonable balance between these conflicting objectives also referred to as learning versus control, or eploration versus eploitation. We let Yτ 2 and Yτ denote the sequences of rewards from arms and 2 in a two-armed bandit machine. In the traditional parametric setting, the underlying configurations/distributions of the arms are epressed by a pair of parameters C 0 θ, θ 2 such that Yτ and Yτ 2 are independent and identically distributed i.i.d. with distribution F θ, F θ2, where F θ is a known family of distributions parametrized by θ. The goal is to maimize the sum of the epected rewards. Results on achievable performance have been obtained for a number of variations and etensions of the basic problem defined in [9] e.g., see [], [2], [3], [4], [5], [6], [7]. In this paper, we consider an etension of the classical twoarmed bandit where we have access to side information before making our decision about which arm to pull. Suppose at Manuscript received November 5, 2002; revised November 9, This work was supported in part by the National Science Foundation under Grants ANI and ECS , the Army Research Office under contract number DAAD , the Office of Naval Research under Grant No. N , and by the New Jersey Center for Pervasive Information Technologies. C-C. Wang, S.R. Kulkarni, and H.V. Poor are with Princeton University. time t, in addition to the history of previous decisions, outcomes, and observations, we have access to a side observation X t to help us make our current decision. The etent to which this side observation can help depends on the relationship of X t to the reward distributions of Yt and Yt 2. Previous work on bandit problems with side observations includes [8], [9], [20], [2], [22]. Woodroofe [2] considered a one-armed bandit in a Bayesian setting, and constructed a simple criterion for asymptotically optimal rules. Sarkar [20] etended the side information model of [2] to the eponential family. In [9], Kulkarni considered classes of reward distributions and their effects on performance using results from learning theory. Most of the previous work with side observations is on one-armed bandit problems, which can be viewed as a special case of the two-armed setting by letting arm 2 always return zero. In contrast with this previous work, we consider various general settings of side information for a two-armed bandit problem. Our focus is on providing both lower bounds and bound-achieving algorithms for the various settings. The results and proofs are very much along the lines of [8] and subsequent works as in [], [2], [3], [4], [5]. We now describe the settings considered in this paper. Direct Information: In this case, X t provides information directly about the underlying configuration C 0 θ, θ 2, which allows a type of separation between the learning and control. This has a dramatic effect on the achievable inferior sampling time. Specifically, estimating θ, θ 2 by observing X τ, and using the estimate ˆθ, ˆθ 2 to make the decision, results in bounded epected inferior sampling time. If the distribution of X τ is not a function of C 0, we are not able to learn C 0 through X τ. However, different values of the side observation X t will result in different conditional distributions of the rewards Yt i. By eploiting this new structure observing X t in advance, we can hope to do better than the case without any side observation. A physical meaning about the above scenario constant distribution on X τ is that a two-armed bandit with the side observations drawn from a finite set, 2,, n can be viewed as a set of n different two-armed sub-bandit machines indeed from to n. The player does not know the order of sub-machines he is going to play, which is determined by rolling a die with n faces. However, by observing X t, the player knows which machine out of the n different ones he is facing now before selecting which arm to play. The connection between these sub-machines is that they share the same common configuration pair θ, θ 2, so that the rewards observed from one machine provide information on the common θ, θ 2, which can then be applied to all of the others different values of X t. This is the key aspect that makes this

2 2 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 50, NO. 5, MAY 2005 setup distinct from simply having many independent bandit problems with random access opportunity. We consider the following three cases of different relationships among the most rewarding arm, C 0, and X t. 2 For all possible C 0, the best arm is a function of X t : That is, θ, θ 2,, 2 such that at time t, arm yields higher epected reward conditioned on X t while arm 2 is preferred when X t 2. Surprisingly, we ehibit an algorithm that achieves bounded epected inferior sampling time in this case. Woodroofe s result [2] can then be viewed as a special case of this scenario. 3 For all possible C 0, the best arm is not a function of X t : In this case, for all configurations θ, θ 2, one of the arms is always preferred regardless of the value of X t. Since the conditional reward distributions are functions of X t, the intuition is that we can postpone our learning until it is most advantageous to us. We show that, asymptotically, our performance will be governed by the most informative bandit among the different values taken on by X t. 4 Mied Case: This is a general case that combines the previous two, and contains the main contribution of this paper. For some possible configurations, one arm may always be preferred for any X t, while for other possible configurations, the preferred arm is a function of X t. We ehibit an algorithm that achieves the best possible in either case. That is, if the best arm is a function of X t, it achieves bounded epected inferior sampling time as in Case 2, while if the underlying configuration is such that one arm is always preferred, then we get the results of Case 3. Our paper is organized as follows. In Section II, we introduce the general formulation. In Section III, we provide background on the asymptotic analysis of traditional bandit problems without side observations. In Sections IV through VII, we consider the above four cases respectively. The results are included in each section, while details of the proofs are provided in the appendi. II. GENERAL FORMULATION Consider the two-armed bandit problem defined as follows. Suppose we have two sequences of real-valued random variables r.v. s, Yτ i i,2, and an i.i.d. side observation sequence X τ, taking values in X R. Yτ i denotes the reward sequence of arm i while X t is the side information observed at time t before making the decision. The formal parametric setting is as follows. For each configuration pair C 0 θ, θ 2 and each i, the sequence of vectors X t, Yt i is i.i.d. with joint distribution G C0 df θi dy, where the families G C C Θ 2 and F θ θ Θ are known to the player, but the true value of the corresponding inde C 0 must be learned through eperiments. For notational simplicity, we further assumed Θ is a set of real numbers. Note that the concept of the i.i.d. bandit is now etended to the assumption that the vector sequence X t, Yt i t is i.i.d. The unconditioned marginal sequence Yτ i remains i.i.d. However, rather than the unconditional marginals, the player TABLE I GLOSSARY Not n Description G C d The marginal distribution of the i.i.d. X τ under configuration C. F θi dy The conditional distribution of the reward of arm i, Yt i, under parameter θ i. µ θ The conditional epectation of the reward, µ θ E θ Y yf θ dy. C 0, 2C 0 The first and the second coordinates of the configuration pair C 0, i.e. C 0 θ, 2C 0 θ 2. For eample: F C0 dy F θ dy and µ 2C0 µ θ2. M C The inde of the preferred arm, i.e. argma i,2 µ ic. φ t The decision rule taking values in, 2 and depending only on the past outcomes and the current side information X t. T i t The total number of samples taken on arm i up to time t, T i t t φ τi. T inf t The total number of samples taken on the inferior arm up to time t: T inf t t IP, Q φ τ M C0 X τ. The Kullback-Leibler K-L information number between distributions P and Q: IP, Q E P log dp. dq Iθ, θ 2 The conditional K-L information number: Iθ, θ 2 IF θ, F θ2. is now facing the conditional distribution of Y i t, which is a function of the observed side information X t and is not identically distributed given different X t. The goal is to find an adaptive allocation rule φ τ to maimize the growth rate of the epected reward: E C0 W φ t : E C0 φyτ + φτ2yτ 2, or equivalently to minimize the growth rate of the epected inferior sampling time, namely E C0 T inf t. To be more eplicit, at any time t, φ t takes a value in, 2 and depends only on the past rewards τ < t and the current side observation X t. We define a uniformly good rule as follows. Definition Uniformly Good Rules: An allocation rule φ τ is uniformly good if for all C Θ 2, E C T inf t ot α, α > 0. In what follows, we consider only uniformly good rules and regard other rules as uninteresting. Necessary notation and several quantities of interest are defined in TABLE I. We assume that all the given epectations eist and are finite. III. TRADITIONAL BANDITS Under the general formulation provided in Section II, the traditional non-bayesian, parametric, infinite horizon, two- In the literature of bandit problems, the term regret is more typically used rather than the inferior sampling time. For traditional two-armed bandits, the regret is defined as regret : t maµ θ, µ θ2 E C0 W φ t, the difference between the best possible reward and that of the strategy of interest φ τ. The relationship between the regret and T inf t is as follows. regret µ θ µ θ2 E C0 T inf t. For greater simplicity in the discussion of bandit problems with side observations, we consider T inf t rather than the regret.

3 WANG et al.: BANDIT PROBLEMS WITH SIDE OBSERVATIONS 3 armed bandit is simply a degenerate case, i.e., the traditional bandit problem is equivalent to having only one element in X say X 0. This formulation of traditional bandit problems is identical to the two-armed case of [4], [8], [9]. For simplicity, the argument 0 can be omitted in this traditional setting, i.e., M C : M C 0, µ θ : µ θ 0, Iθ, θ 2 : Iθ, θ 2 0, etc. The main contribution of [4], [8], [9] is the asymptotic analysis stated as the following two theorems. Theorem logt Lower Bound: For any uniformly good rule φ τ, T inf t satisfies and P C 0 inf T inf t E C0 T inf t ǫlogt, ǫ > 0,, where is a constant depending on C 0. If M C0 2, then T inf t T t and is defined 2 as follows. infiθ, θ : θ, µ θ > µ θ2. The epression for for the case in which M C0 can be obtained by symmetry. Theorem 2 Asymptotic Tightness: Under certain regularity conditions 3, the above lower bound is asymptotically tight. Formally stated, given the distribution family F θ, there eists a decision rule φ τ such that for all C 0 θ, θ 2 Θ 2, sup E C0 T inf t, where is the same as in Theorem. The intuition behind the lower bound is as follows. Suppose M C0 2 and consider another configuration C θ, θ 2 such that M C. It can be shown that if under configuration C 0 θ, θ 2, E C0 T inf t is less than the lower bound, E C T inf t must be greater than ot α for some α > 0, which contradicts the assumption that φ τ is uniformly good. A. Formulation IV. DIRECT INFORMATION In this setting, the side observation X t directly reveals information about the underlying configuration pair C 0 θ, θ 2 in the following way. Dependence: G C G C2 iff C C 2. As a result, observing the empirical distribution of X t gives us useful information about the underlying parameter pair C 0. Thus this is a type of identifiability condition. Eamples: Θ 0, 0.5 and X, 2, 3. θ k if k, 2 P θ,θ 2X t k θ θ 2 otherwise. 2 Throughout this paper, we will adopt the conventions that the infimum of the null set is, and 0. 3 If the parameter set is finite, Theorem 2 always holds. If Θ is the set of reals, the required regularity conditions are on the unboundedness and the continuity of µ θ w.r.t. θ and on the continuity of Iθ, θ w.r.t. µ θ. Θ 0, and X [0, ]. X t is beta distributed with parameters θ, θ 2. B. Scheme with Bounded E C0 T inf t Consider the following condition. Condition : For any fied C 0 inf ρg C0, G Ce : C e Θ 2,, M Ce M C0 > 0, where ρ denotes the Prohorov metric 4 on the space of distributions. Two eamples satisfying Condition are as follows: Eample : X is finite, and X, µ θ is continuous with respect to w.r.t. θ. Eample 2: F θ Nθ, is a Gaussian distribution with mean θ and variance. Under this condition, we obtain the following result. Theorem 3 Bounded E C0 T inf t: If Condition is satisfied, then there eists an allocation rule φ τ, such that E C0 T inf t < and T inf t < a.s. Note: the information directly revealed by X t helps the sequential control scheme surpass the lower bound stated in Theorem. This significant improvement bounded epected inferior sampling time is due to the fact that the dilemma between learning and control no longer eists in the direct information case. We provide a scheme achieving bounded E C0 T inf t as in Algorithm, of which a detailed analysis is given in APPENDIX II. Algorithm φ t, the decision at time t after observing X t but before deciding φ t : Construct C t : C Θ 2 : ρg C, L X t inf ρg C, L X t +, C Θ 2 t where L X t is the empirical measure of the side observations X τ until time t, and ρ is the Prohorov metric as before. 2: Arbitrarily pick Ĉt Ct, and set φt M Ĉ t X t. V. BEST ARM AS A FUNCTION OF X t For all of the following sections Sections V through VII, we consider only the case in which observing X t will not reveal any information about C 0, but only reveals information about the upcoming reward Y i t, that is, G C0 does not depend on the value of C 0 ; we use G : G C0 as shorthand notation. Three further refinements regarding the relationship between M C and will be discussed separately each in one section. A. Formulation In this section, we assume that for all possible C, the side observation X t is always able to change the preference order as shown in Fig.. That is, For all C Θ 2, there eist and 2 such that M C and M C 2 2.

4 4 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 50, NO. 5, MAY 2005 a b c Algorithm 2 φ t, the decision at time t + Variables: Denote Ti t as the total number of time instants until time t when arm i has been pulled and X τ, i.e. Ti t : Xτ,φτi, 2 3 Fig.. The best arm at time t always depends on the side observation X t. That is, for any possible pair θ, θ 2 the two curves, µ θ and µ θ2, w.r.t. always intersect each other. The needed regularity conditions are as follows. X is a finite set and P G X t > 0 for all X. 2 θ, θ 2,, Iθ, θ 2 is strictly positive and finite. 3, µ θ is continuous w.r.t. θ. The first condition embodies the idea of treating X t as the inde of several different bandit machines, which also simplifies our proof. The second condition is to ensure that all these different bandit problems are non-trivial, with nonidentical pairs of arms. Eample: Θ 0,, X,, and the conditional reward distribution F θ Nθ,. B. Scheme with Bounded E C0 T inf t Theorem 4 Bounded E C0 T inf t: If the above conditions are satisfied, there eists an allocation rule φ τ such that E C 0 T inf t <. Such a rule is obviously uniformly good. Note: although the side observation X t does not reveal any information about C 0 in this setting, the alternation of the best arm as the i.i.d. X t takes on different values makes it possible to always perform the control part, φ t MĈt X t, and simultaneously sample both arms often enough. Since the information about both arms will be implicitly revealed through the alternation of M C0 X t, the dilemma of learning and control no longer eists, and a significant improvement E C0 T inf t < is obtained over the lower bound in Theorem. We construct an allocation rule with bounded E C0 T inf t given as Algorithm 2. The intuition as to why the proposed scheme has bounded E C0 T inf t is as follows. The forced sampling, T i t < t +, ensures there are enough samples on both arms, which implies good enough estimates of C 0. Based on the good enough estimates, the myopic action of sampling the seemingly better arm, φ t+ MĈt X t+, will result in very few inferior samplings. Unlike the traditional two-armed bandits, in this scenario, the best arm M C0 varies from one outcome of X t to the other. Therefore, the 4 A definition of the Prohorov metric is stated in APPENDIX I. and define i : arg ma T i t and T i t : ma T i t. Construct C t : C θ, θ 2 Θ 2 : σc, t infσc, t : C Θ 2 + t, with σc, t : ρf C, L t, +ρf 2C 2, L 2 t, where L i t is the empirical measure of rewards sampled from arm i at those time instants τ t when X τ. As before ρp, Q is the Prohorov metric. Arbitrarily choose Ĉt Ct. Algorithm: : if t + 6 then 2: φ t+ t mod : else if i such that T i t < t + then 4: φ t+ i. 5: else 6: φ t+ MĈt X t+. 7: end if Note that Line guarantees that there is only one i such that T i t < t +. 2 a b c Fig. 2. The best arm at time t never depends on the side observation X t. That is, for any possible pair, θ, θ 2, the two curves, µ θ and µ θ2, do not intersect each other. However, in this case, we can postpone our sampling to the most informative time instants. myopic action and the even appearances of the i.i.d. X τ will eventually make both T t and T 2 t grow linearly with the elapsed time t, and the forced sampling should occur only rarely. This situation differs significantly from the traditional bandits, where the forced sampling will inevitably make the T inf t of the order of t, which is an undesired result. A detailed proof of the boundedness of E C0 T inf t for this scheme is provided in APPENDIX III. A. Formulation VI. BEST ARM IS NOT A FUNCTION OF X t Besides the assumption of constant G, in this section, we consider the case in which for all C Θ 2, M C is not a function of, and we thus can use M C : M C as shorthand notation. Fig. 2 illustrates this situation.

5 WANG et al.: BANDIT PROBLEMS WITH SIDE OBSERVATIONS 5 The needed regularity conditions are similar to those in Section V: X is a finite set and P G X t > 0 for all X. 2 θ, θ 2,, Iθ, θ 2 is strictly positive and finite. In this case, one arm is always better than the other no matter what value of X t occurs. The conflict between learning and control still eists. As epected, the growth rate of the epected inferior sampling time is again lower bounded by, but with the additional help of X t we can see improvements over the traditional bandit problems. To greatly simplify the notation, we also assume that 4 For all, the conditional epected reward µ θ is strictly increasing w.r.t. θ. This condition gives us the notational convenience that the order of µ θ, µ θ2 is simply the same as the order of θ, θ 2. Eample: Θ,, X, 2, 3, and the conditional reward distribution F θ Nθ,. B. Lower Bound Theorem 5 logt Lower Bound: Under the above assumptions, for any uniformly good rule φ τ, T inf t satisfies P ǫlogt C 0 T inf t, ǫ > 0, and inf E C0 T inf t, 2 where is a constant depending on C 0. If M C0 2, then T inf t T t. The constant can be epressed as follows. inf sup θ:θ>θ 2 X Iθ, θ. 3 The epression for for the case in which M C0 can be obtained by symmetry. Note : if the decision maker is not able to access the side observation X t, the player will then face the unconditional reward distribution F θ i dy Gd rather than F θi dy. Let Iθ, θ 2 denote the Kullback-Leibler information between the unconditional reward distributions. By the conveity of the Kullback-Leibler information, we have sup Iθ, θ Iθ, θ Gd Iθ, θ. This shows that the new constant in front of, in 3, is no larger than the corresponding constant in, and the additional side information X t generally improves the decision made in the bandit problem. As we would epect, Theorem 5 collapses to Theorem when X. Note 2: This situation is like having several related bandit machines, whose reward distributions are all determined by the common configuration pair θ, θ 2. The information obtained from one machine is also applicable to the other machines. If arm 2 is always better than arm, we wish to sample arm 2 most of the time the control part, and force sample arm once in a while the learning part. With the help of the side information X t, we can postpone our forced sampling learning to the most informative machine X t. As a result, the constant in the lower bound in Theorem has been further reduced to this new. A detailed proof of Theorem 5 is provided in APPENDIX IV. C. Scheme Achieving the Lower Bound Consider the additional conditions as follows. Θ is finite. 2 A saddle point for eists; that is, for all θ < θ 2, inf sup θ:θ>θ 2 Iθ, θ sup inf Iθ, θ. θ:θ>θ 2 With the above conditions, we construct a -lowerbound-achieving scheme φ τ, which is inspired by [2]. The following terms and quantities are necessary in the epression of φ τ. Denote Ĉt : θ α, θ β. Instead of the traditional ˆθ, ˆθ 2 representation, we use θ α, θ β. Based on this representation, we are able to derive the following useful notation: θ α β : minθ α, θ β α β : argminθ α, θ β θ α β : maθ α, θ β α β : argmaθ α, θ β. For instance, if θ α < θ β, µ θ α µ θ α β; arm α β represents arm ; Y α β t is the reward of arm 2; and T inf t T α β t T t. Choose an ǫ such that 0 < ǫ < 2 min ρf θ, F ϑ : X, θ ϑ Θ, where ρ is the Prohorov metric. The whole system is well-sampled if there eists a unique estimate Ĉ t θ α, θ β, such that the empirical measure L i t falls into the ǫ-neighborhood of F i Ĉ t, for all X and i, 2. That is Ĉt, s.t. ρ L i t, FiĈt < ǫ, X, i, 2. For any estimate Ĉt θ α, θ β, define the most informative bandit according to Ĉt as Ĉt : arg ma inf Iθ α β, θ,, θ:θ>θ α β and Λ t Ĉt, θ to be the conditional likelihood ratio between the seemingly inferior arm θ α β and the competing parameter θ: Λ t Ĉt, θ : T α β t m F θ α β dy α β τ m Ĉt, F θ dy α β τ m Ĉt where τ m denotes the time instant of the m-th pull of arm α β when the side observation X τ Ĉτ. Set a total number of X + Θ 2 + Θ 3 counters, including X counters, named ctr ; Θ 2 counters, named ctrĉ for all possible Ĉ Θ2 ; and Θ 3 counters, named ctrĉ, θ for all possible Ĉ and θ. Initially, all counters are set to zero.

6 6 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 50, NO. 5, MAY 2005 Algorithm 3 φ t+, the decision at time t + : if there eists i, 2 and X such that Ti t 0, then Cond0 2: φ t+ t + mod 2. 3: else if the whole system is not well-sampled or θ α θ β, then Cond 4: ctrx t+ ctrx t+ + and φ t+ ctrx t+ mod 2. 5: else if θ α β θ : maθ, then Cond2 6: φ t+ if it is θ α θ. Otherwise, φ t+ 2. 7: else Cond3 8: ctrĉt ctrĉt +. 9: 0: if ctrĉt is odd, then Cond3a φ t+ MĈt X t+. : else Cond3b 2: θ argminλ tĉ t, θ : θ > θ α β. 3: if X t+ Ĉt then Cond3b 4: if Λ tĉ t, θ t 2, then Cond3ba 5: ctrĉt, θ ctrĉt, θ +. 6: if k N s.t. ctrĉt, θ k 2, then Cond3ba 7: φ t+ k mod 2. 8: else Cond3ba2 9: φ t+ 3 MĈt X t+. 20: end if 2: else Cond3bb 22: φ t+ MĈt X t+. 23: end if 24: else Cond3b2 25: φ t+ MĈt X t+. 26: end if 27: end if 28: end if Theorem 6 Asymptotic Tightness: With the above conditions, the scheme described in Algorithm 3 achieves the lower bound 2, so that this φ τ is uniformly good and asymptotically optimal. A complete analysis is provided in APPENDIX V. VII. MIXED CASE The main difference between Sections V and VI is that in one case, for all possible C 0, X t always changes the preference order, while in the other, for all possible C 0, X t never changes the order. A more general case is a miture of these two. In this section, we consider this mied case, which is the main result of this paper. A. Formulation Besides the assumption of constant G, in this section, we consider the case in which for some C Θ 2, M C is not a function of. For the remaining C, there eist and 2 s.t. M C and M C 2 2. For future reference, when the configuration pair C 0 satisfies the latter case, we say the configuration pair C 0 is implicitly revealing. Fig. 3 illustrates this situation. However, without knowledge of the authentic underlying configuration C 0, we do not know whether C 0 is implicitly revealing or not. In view of the results of Sections V and VI, we would like to find a single scheme that is able to achieve bounded E C0 T inf t when being applied to an implicitly revealing C 0, and on the other hand to achieve the lower bound when being applied to those C 0 which are not implicitly revealing. The needed regularity conditions are the same as those in Sections V and VI: 2 3 a b c Fig. 3. If θ, θ 2 θ a, θ b, the best arm depends on, i.e. µ θ and µ θ2 intersect each other as in Section V. If θ, θ 2 θ b, θ c, the best arm does not depend on, i.e. µ θ and µ θ2 do not intersect each other as in Section VI. X is a finite set and P G X t > 0 for all X. 2 θ, θ 2,, Iθ, θ 2 is strictly positive and finite. To simplify the notation and the following proof, we define a partial ordering as θ ϑ iff, µ θ µ ϑ, and θ ϑ is defined similarly. Note that for a configuration C 0 θ, θ 2, it can be the case that neither θ θ 2 nor θ θ 2. Eample: Θ 0,, X, and the conditional reward distribution F θ Nθ 2 θ,. Then C 0 θ, θ 2 0., 0.2 is implicitly revealing, but C 0 0, 0 is not. B. Lower Bound Theorem 7 Lower Bound: Under the above assumptions, for any uniformly good rule φ τ, if C 0 is not implicitly revealing, T inf t satisfies and P C 0 inf T inf t E C0 T inf t ǫ, ǫ > 0,, 4 where is a constant depending on C 0. If M C0 2, T inf t T t, and the constant can be epressed as follows. inf θ: 0, s.t. sup Iθ, θ. µ θ 0>µ θ2 0 The epression for for the case in which M C0 can be obtained by symmetry. The only difference between the lower bounds 2 and 4 is that, in 4, has been changed from taking the infimum over θ > θ 2, µ θ > µ θ2 to a larger set, θ :, µ θ > µ θ2. The reason for this is that under this case, consider a θ for which there eists such that µ θ > µ θ2. If the authentic configuration is C θ, θ 2 rather than θ, θ 2, a linear order of incorrect sampling will be introduced, which violates the uniformly-good-rule assumption. As a result, a broader class of competing distributions C θ, θ 2 must be considered, i.e., we must consider a different set of configurations, over which the infimum is taken. A detailed proof is contained in APPENDIX VI.

7 WANG et al.: BANDIT PROBLEMS WITH SIDE OBSERVATIONS 7 C. Scheme Achieving the Lower Bound Consider the same two additional conditions as those in Section VI. Θ is finite. 2 A saddle point for eists; that is, for all θ, inf sup Iθ, θ θ: 0,µ θ 0>µ θ2 0 sup inf Iθ, θ. θ: 0,µ θ 0>µ θ2 0 Algorithm 4 φ t+, the decision at time t + : if there eists i, 2 and X such that Ti t 0, then Cond0 2: φ t+ t + mod 2. 3: else if the whole system is not well-sampled or θ α θ β, then Cond 4: ctrx t+ ctrx t+ + and φ t+ ctrx t+ mod 2. 5: else if there eists i, 2, such that θ,, µ θ µ i Ĉ t, then Cond2 6: φ t+ i, where i is the satisfying inde. 7: else if Ĉ t is implicitly revealing, then Cond2.5 8: φ t+ MĈt X t+. 9: else Cond3 0: ctrĉt ctrĉt +. : if ctrĉt is odd, then Cond3a 2: φ t+ MĈt X t+. 3: else Cond3b 4: θ argminλ tĉt, θ : θ, 0, s.t. µ θ 0 > µ θ α β 0. 5: if X t+ Ĉt then Cond3b 6: if Λ tĉt, θ t 2, then Cond3ba 7: ctrĉt, θ ctrĉt, θ +. 8: if k N s.t. ctrĉt, θ k 2, then Cond3ba 9: φ t+ k mod 2. 20: else Cond3ba2 2: φ t+ 3 MĈt X t+. 22: end if 23: else Cond3bb 24: φ t+ MĈt X t+. 25: end if 26: else Cond3b2 27: φ t+ MĈt X t+. 28: end if 29: end if 30: end if A proposed scheme is described in Algorithm 4, which is similar to the scheme in Section VI-C. The only differences are the insertion of Cond2.5, Lines 7 and 8; the modification of Cond2, Lines 5 and 6; and the modification of Cond3b, line 4. Notes: When the estimate Ĉt θ α, θ β is not implicitly revealing, an ordering between θ α and θ β eists. As a result, all notation regarding α β, θ α β, etc., remains valid. 2 The definition of Λ t Ĉt, θ is slightly different. For any estimate Ĉt θ α, θ β that is not implicitly revealing, we can define the most informative bandit according to Ĉ t as Ĉt 5 : arg ma inf θ: 0,µ θ 0 >µ θ α β 0 Iθα β, θ, and Λ t Ĉt, θ to be the conditional likelihood ratio between the seemingly inferior arm θ α β and the competing parameter θ. That is, Λ t Ĉt, θ : T α β t m dy α β τ m Ĉt F θ dy α β, Ĉt F θ α β τ m where τ m denotes the time instant of the m-th pull of arm α β when the side observation X τ Ĉτ. The difference between this new Λ t Ĉt, θ and the previous one in Algorithm 3 is that we have a new Ĉt defined in 5. Theorem 8 Asymptotic Tightness: With the above conditions, the scheme described in Algorithm 4 has bounded t E C0 T inf t, or achieves the lower bound 4, depending on whether the underlying configuration pair C 0 is implicitly revealing or not. A detailed analysis is given in APPENDIX VI. VIII. CONCLUSION We have shown that observing additional side information can significantly improve sequential decisions in bandit problems. If the side observation itself directly provides information about the underlying configuration, then it resolves the dilemma of forced sampling and optimal control. The epected inferior sampling time will be bounded, as shown in Section IV. If the side observation does not provide information on the underlying configuration θ, θ 2, but always affects the preference order implicitly revealing, then the myopic approach of sampling the seemingly-best arm will automatically sample both arms enough. The epected inferior sampling time is bounded, as shown in Section V. If the side observation does not affect the preference order at all, the dilemma still eists. However, by postponing our forced sampling to the most informative time instants, we can reduce the constant in the lower bound, as shown in Section VI. In Section VII, we combined the settings of Sections V and VI, and obtained a general result. When the underlying configuration C 0 is implicitly revealing such that X t will change the preference order, we obtain bounded epected inferior sampling time as in Section V. Even if C 0 is not implicitly revealing in that X t does not change the preference order, the new lower bound can be achieved as in Section VI. Our results are summarized in TABLE II. APPENDIX I SANOV S THEOREM AND THE PROHOROV METRIC For two distributions P and Q on the reals, the Prohorov metric is defined as follows. Definition 2 The Prohorov metric: For any closed set A R and ǫ > 0, define A ǫ, the ǫ-flattening of A, as A ǫ : R : inf y < ǫ. y A The Prohorov metric ρ is then defined as follows. ρp, Q : inf ǫ > 0 : PA QA ǫ + ǫ, for all closed A R..

8 8 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 50, NO. 5, MAY 2005 TABLE II SUMMARY OF THE BENEFIT OF THE SIDE OBSERVATIONS AND THE REQUIRED REGULARITY CONDITIONS. Characterization Regularity Conditions Results G C G C2 iff C C 2. As Ĉ t C 0,, MĈt M C0. φ τ s.t. C 0, t E C0 T inf t <. i Constant G C, i.e., G C : G, ii C,, 2, s.t. M C, M C 2 2 implicitly revealing. i Constant G C, i.e., G C : G, ii C, M C only depends on C, not on. i Constant G C, i.e., G C : G, ii The underlying C 0 may be implicitly revealing or not. i X is finite. ii θ θ 2,, 0 < Iθ, θ 2 <. iii, µ θ is continuous w.r.t. θ. i X is finite. ii θ θ 2,, 0 < Iθ, θ 2 <, iii, µ θ is strictly increasing w.r.t. θ. i X is finite. ii θ θ 2,, 0 < Iθ, θ 2 <. φ τ such that C 0, t E C0 T inf t <. For any uniformly good φ τ, we have E C0 T inf t t, : inf θ sup Iθ, θ. For finite Θ, φ τ, s.t. t E C0 T inf t. For any uniformly good φ τ, if C 0 is not implicitly revealing, we have inf t E C0 T t, : inf θ sup Iθ, θ. For finite Θ, φ t s.t. if C 0 is implicitly revealing, E C0 T inf t <, E C0 T 2 if C 0 is not i.r., inf t t. The Prohorov metric generates the topology corresponding to convergence in distribution. Throughout this paper, the open/closed sets on the space of distributions are thus defined accordingly. Theorem 9 Sanov s theorem: Let L X n denote the empirical measure of the real-valued i.i.d. random variables X, X 2,, X n. Suppose X i is of distribution P and consider any open set A and closed set B from the topological space of distributions, generated by the Prohorov metric. We have inf n n log P PL X n A inf IQ, P Q A sup n n log P PL X n B inf IQ, P. Q B Further discussion of the Prohorov metric and Sanov s theorem can be found in [23], [24]. APPENDIX II PROOF OF Theorem 3 Proof: For any underlying configuration pair C 0 θ, θ 2, define the error set C e as follows. C e : C Θ 2 : M C M C0. 6 X Let C e denote the closure of C e. By Condition, C 0 / C e. For any t, we can write P C0 φt M C0 X t P C0 X MĈt t M C0 X t P C0, MĈt M C0 P C0 Ĉt C e P C0 Ĉt C e. Let ǫ 3 infρg C 0, G Ce : C e C e, which is strictly positive by Condition, and consider sufficiently large t ǫ. If ρg C 0, L X t < ǫ, then by the definition of C t, ρgĉt, L X t < ǫ + ǫ 2ǫ. By the triangle inequality, ρg C0, GĈt < 3ǫ and Ĉt C e. As a result, Ĉt C e ρl X t, G C0 ǫ K t. K t is a closed set. By Sanov s theorem, the probability of K t is eponentially upper bounded w.r.t. t, and so is P C0 Ĉt C e. As a result, we have E C 0 T inf t P C0 φ τ M C0 X τ <. By the monotone convergence theorem, the epectation of T inf t is finite, which implies that T inf t is finite a.s. APPENDIX III PROOF OF Theorem 4 Similarly, we define C e as that in 6. We need the following lemma to complete the analysis. Lemma : With the regularity conditions specified in Section V, a, a 2 > 0 such that P C0 Ĉt C e a ep a 2 min T t, T2 t. Proof of Lemma : By the continuity of µ θ w.r.t. θ and the assumption of finite X, it can be shown that C 0 C c e. 5 Therefore there eists a neighborhood of C 0, C δ θ δ, θ +δ θ 2 δ, θ 2 +δ, such that C δ C c e C e C c δ. Define a strictly positive ǫ > 0 as follows. ǫ : ρ 4 inf, F Fˆθi θi : X, i, 2, ˆθ, ˆθ 2 C c δ. We would like to prove that for sufficiently large t > ǫ, Ĉt Cδ c i, ρ L i t, F θi > ǫ. 5 Cc e denotes the complement of C e.

9 WANG et al.: BANDIT PROBLEMS WITH SIDE OBSERVATIONS 9 Suppose ρ L i t, F θi ǫ for both i, 2. By the definition of σc 0, t, we have σc 0, t 2ǫ. 7 However, for those Ĉt C c δ, by the definition of ǫ, for some i, 2, we have σĉt, t ρfˆθi, L i t, F ρfˆθi θi ρf θi, L i t 3ǫ, 8 which contradicts the definition of C t since 7 and 8 imply σĉt, t > t + σc 0, t. As a result, for sufficiently large t, we have Ĉt C e Ĉt Cδ c i, ρ i,2 L ρ i t, F θi > ǫ L i t, F θi > ǫ. 9 By Sanov s theorem, the probability of each term in the union of the right-hand side of 9 is eponentially bounded w.r.t. Ti t. As a result, the probability of this finite union is bounded by a ep a 2 min T t, T2 t for some a, a 2 > 0. Analysis of the scheme: We first use induction to show that t 6, T i t t. This statement is true for t 6. Suppose T t t t. If T i t t, by the monotonicity of T i t w.r.t. t, we have T i t T i t t. If T i t < t, by the forced sampling mechanism, T i t T i t + t + t. We consider the event of the inferior sampling at time t + : φt+ M C0 X t+ φ t+ M C0 X t+, Ĉt Ce φ t+ M C0 X t+, Ĉt Cc e Ĉt C e φ t+ M C0 X t+, Ĉt Cc e A t+ B t+. 0 Since T i t t, we have min i T i t t and min i Ti t t X. By Lemma, we have P C 0 A t+ t a e a2 X, and hence t+7 P C 0 A t+ <. For P C0 B t+, we can write B t+ φ t+ MĈt X t+, Ĉt Cc e mint i t i < t +, Ĉt Cc e mint i t i t N, Ĉt Cc e i, φ a i, a τ 0, t], T i τ 0 t B t+ B2 t+, where τ 0 t 2 + and B t+, B 2 t+ correspond to i, 2, respectively. The first equality comes from the fact that since Ĉt C c e, M C 0 X t+ MĈt X t+. The first subset sign comes from the fact that φ t+ MĈt X t+ implies the decision rule φ t+ is in the stage of forced sampling. The second equality follows by combining both the inequalities: mint i t i t and mint i t i < t + and the fact that both t and T i t are integers. The reasoning behind the second subset inequality is as follows. By again using the fact that T i t t and substituting τ 0 for t, we have t < T i τ 0 and thus have T i τ 0 t T i t, which guarantees that arm i has not been sampled from time τ 0 + to t. By the symmetry between Bt+ and B2 t+, we can consider only Bt+ for eample. We have P C0 B t+ P C0 X MĈa a 2, a τ 0, t] P C0 MĈa X a 2 MĈb X b 2, b τ 0, a a τ 0,t] t τ0 minp G X t. 2 The first inequality comes from the definition of φ τ which implies that if T τ 0 T t t, the forced sampling mechanism is not active during the time interval τ 0, t]. So φ a 2 implies MĈa X a 2, a τ 0, t]. The second inequality comes from the assumption of i.i.d. X τ, which implies that X a is independent of Ĉb and X b for all b < a. Since at least one will make MĈa X a, each term in the product is then upper bounded by min P G X t. It is worth noting that by the regularity assumption on G, min P G X t is strictly less than. Then from, 2, and the union bound, we obtain P C0 B t+ P C0 Bt+ + P C0 Bt+ 2 2a t t 2 + for some a <. Hence t+7 P C 0 B t+ <. From 0, we conclude that E C 0 T inf t 6 + P C0 A τ+ + P C0 B τ+ <, τ+7 which completes the proof. APPENDIX IV PROOF OF Theorem 5 Proof: The proof is inspired by [4]. Without loss of generality, we assume M C0 2, which immediately implies T inf t T t. Fi a θ with µ θ > µ θ2, and define C θ, θ 2. Let λn denote the log likelihood ratio between θ and θ based on the first n observed rewards of arm. That is n Fθ dyτm λn : log X τm F θ dyτm X, τm m where τm is a random variable corresponding to the time inde of the m-th pull of arm. By conditioning on the sequence X τm, λn is a sum of independent r.v. s. Let K C : sup X Iθ, θ, and suppose there eists δ > 0 such that sup n λn n > K C + δ,

10 0 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 50, NO. 5, MAY 2005 with positive probability. Then with positive probability, there eists an 0 such that the average of the subsequence for which X τm 0, will be larger than K C + δ. This, however, contradicts the strong law of large numbers since the subsequence is i.i.d. and with marginal epectation Iθ, θ 0. Thus we obtain λn sup n n K C, P C 0 a.s. 3 The inequality 3 is equivalent to the statement that with probability one, there are finitely many n such that λn > nk C + δ for some δ > 0. And since K C > 0, this in turn implies there are at most finitely manly n such that ma mn λm > nk C + δ. As a result, we have and sup n ma mn λm n K C, P C0 a.s., P C 0 m n, λm + δnk C 0. 4 n Henceforth, we proceed using contradiction. Suppose sup P C0 T t < > δK C Using A and A 2 as shorthand to denote events A : +δ T t < +2δK and A C 2 : λt t +2δ, and by 4, we have sup P C0 A A 2 > 0. 5 The quantity E C T inf t can be rewritten as follows. Tinf t E C a E C T 2 t b E C t T t c t d e t t f O t + 2δK C + 2δK C + 2δK C δ +2δ P C A P C A A 2 +δ e +2δ P C0 A A 2. 6 The equality marked a follows from M C and b follows from the fact that T t+t 2 t t. c and d follow from elementary probability inequalities. e follows from the change-of-measure formula and the definition of A 2 in which +δ λt t +2δ. f follows from simple arithmetic and Eq. 5. The inequality 6 contradicts the assumption that φ τ is uniformly good for both C 0 θ, θ 2 and C θ, θ 2, and thus we have P ǫ C 0 T t 0, ǫ > 0. K C By choosing the θ in C θ, θ 2 with the minimizing configuration inf θ>θ2 sup Iθ, θ, we complete the proof of the first statement of Theorem 5. The second statement in Theorem 5 can be obtained by simply applying Markov s inequality and the first statement. APPENDIX V PROOF OF Theorem 6 We prove Theorem 6 by decomposing the inferior sampling time instants into disjoint subsequences, each of which will be discussed in separate lemmas respectively. For simplicity, throughout this proof, we use Condt as shorthand for Cond is satisfied at time t 6, and use δ-nbdg to denote the δ-neighborhood of the distribution G on the L space of distributions. Suppose M C0 2. To prove that for the φ τ in Algorithm 3, sup E C0 T t inf θ>θ2 ma Iθ, we,θ first note the following: T t 7 φ τ φ τ, Cond0τ + φ τ, Condτ + φ τ, Cond2τ + φ τ, Ĉτ θα, θ β, θ α < θ β θ 2, Cond3τ + φ τ, Ĉτ θα, θ β, θ θ α > θ β, Cond3τ + φ τ, Ĉτ θα, θ β, θ θ α < θ β θ 2, Cond3τ + φ τ, Ĉτ θα, θ β, θ θ α > θ β, Cond3τ + φ τ, Ĉτ θα, θ β C 0 θ, θ 2, Cond3τ. These eight terms of the right-hand side of 7 will be treated separately in Lemmas 2 through 8. Lemma 2: Suppose M C0 2, i.e., θ < θ 2. 7 Then C 0 Θ 2, E C0 φ τ, Cond0τ E C0 Cond0τ <. Proof: Let T0 : Cond0τ. By the monotone convergence theorem, it is equivalent to prove that E C0 T0 < for all C 0. By the definition of Cond0, we have P C0 T0 t XP G X t, τ < t and τ t mod 2, X τ X P G X t P G X [ t 2 ] [ t 2 minp G X t ]. By directly computing the epectation, we obtain E C0 T0 <. 6 At time t means after observing X t but before the final decision φ t is made. It is basically the moment when we are performing the φ t-deciding algorithm. 7 There is no need to consider the case θ θ 2, since in that case, all allocation rules are optimal.

11 WANG et al.: BANDIT PROBLEMS WITH SIDE OBSERVATIONS Lemma 3: Suppose M C0 2, i.e., θ < θ 2. Then E C 0 φ τ, Condτ E C0 Condτ <. Proof: We define L X t Cond as the empirical distribution of X τ at those time instants τ t for which Cond is satisfied. We then have Condτ Condτ, L X τ Cond δ-nbdg + Condτ, L X τ Cond / δ-nbdg. 8 By Sanov s theorem on finite alphabets see [24], each term in the second sum is eponentially upper bounded w.r.t. τ, which implies the bounded epectation of the second sum. For the first sum, we have Condτ, L X τ Cond δ-nbdg i,, s.t. L i τ / ǫ-nbdf θ i, and L X τ Cond δ-nbdg 2 L i τ / ǫ-nbdf θ i, i L X τ Cond δ-nbdg 9 2 [ τ ] P G X δ n, 2 i τ s.t. ρl i n, F θ i > ǫ. 20 The first inequality comes from etending the finite sum to the infinite sum and the definition of Cond. The second inequality comes from the union bound. The third inequality comes from the following three steps. First we change the summation inde from the time variable τ to τ, which specifies that it is the τ -th time that the condition in 9 is satisfied. Note: by definition, τ τ. Second, by L X τ Cond δ-nbdg, there must be at least τ P G X δ time instants that X s, s τ, which guarantees we have enough access to the bandit machine. And finally, by the definition of Cond in Algorithm 3, at the τ -th[ time of satisfaction, ] the sample τ size n must be greater than P C0 X δ 2. By slightly abusing the notation L i t with L i n, where n represents the sample size Ti t rather than the current time t, we obtain the third inequality. Remark: this change-of-inde transformation will be used etensively throughout the proofs in this section. By Sanov s theorem on R Theorem 9, the probability of each term in 20 is eponentially upper bounded w.r.t. τ, which implies that the summation has bounded epectation. By 8, the proof of Lemma 3 is then complete. Lemma 4: Suppose M C0 2, i.e., θ < θ 2. Then E C 0 φ τ, Cond2τ <. Proof: By the assumption θ < θ 2, we have φ τ, Cond2τ Ĉτ θ α, θ β, θ α θ Ĉτ θ α, θ β, θ α θ, L X τ θ α θ δ-nbdg + Ĉτ θ α, θ β, θ α θ, L X τ θ α θ / δ-nbdg By Sanov s theorem on finite alphabets, each term in the second sum is eponentially upper bounded w.r.t. τ, which implies the bounded epectation of the second sum. For the first sum, we have Ĉτ θ α, θ β, θ α θ, L X τ θ α θ δ-nbdg Ĉτ θ α, θ β, θ α θ,, s.t. ρl τ, F θ > ǫ,l X τ θ α θ δ-nbdg n [ τ P G X δ ], s.t. τ ρl n, F θ > ǫ. By etending the finite sum to the infinite sum, we obtain the first inequality. By the definition of Cond2 in Algorithm 3 and using eactly the same reasoning used in going from 9 to 20, we obtain the second inequality. By Sanov s theorem, each term in the above sum is eponentially upper bounded w.r.t. τ. Thus it follows that the epectation of the first sum is also finite, which completes the proof. Lemma 5: Suppose M C0 2, i.e., θ < θ 2. Then E C 0 E C 0 <. Proof: We have φ τ, Ĉτ θ α, θ β, θ α < θ β θ 2, Cond3τ Ĉτ θ α, θ β, θ α < θ β θ 2, Cond3τ Ĉτ θ α, θ β, θ α < θ β θ 2, Cond3τ Ĉτ θ, ϑ, Cond3τ θ,ϑ:θ<ϑ θ 2 2 Ĉτ θ, ϑ, Cond3aτ θ,ϑ:θ<ϑ θ 2 2 X τ, Ĉτ θ, ϑ, Cond3aτ θ,ϑ:θ<ϑ θ 2 2 ρl 2 τ, F θ 2 > ǫ, Cond3aτ θ,ϑ:θ<ϑ θ 2 2 n [ τ ], s.t. θ,ϑ:θ<ϑ θ 2 τ ρl 2 n, F θ 2 > ǫ.

12 2 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 50, NO. 5, MAY 2005 The first equality follows from conditioning on the event that the eact value of the estimate Ĉτ is some configuration pair θ, ϑ. The first inequality follows from the definition of Cond3a in Algorithm 3, where double the number of time instants with odd ctrĉ will be larger than the total number of times that Cond3 is satisfied. The second equality follows from conditioning on the value of X τ. The second inequality follows from the condition that the second coordinate of the estimate, ϑ θ 2, and then etending the finite sum to the infinite sum. The third inequality follows from the definition of Cond3a and changing the time inde to τ, similarly to the reasoning in By Sanov s theorem, each term is eponentially upper bounded w.r.t. τ, and thus the entire sum has bounded epectation. The proof is thus complete. Corollary : By the symmetry of φ τ, we have E C0 <. φ τ, Ĉτ θα, θ β, θ θ α > θ β, Cond3τ Lemma 6: Suppose M C0 2, i.e., θ < θ 2. Then E C 0 φ τ, Ĉτ θ α, θ β, θ θ α < θ β θ 2, Cond3τ <. Proof: We have φ τ, Ĉτ θ α, θ β, θ θ α < θ β θ 2, Cond3τ θ:θ θ<θ 2 θ:θ θ<θ 2 φ τ, Ĉτ θ, θ 2, Cond3τ φ τ, Ĉτ θ, θ 2, Cond3baτ + φ τ, Ĉτ θ, θ 2, Cond3ba2τ θ θ<ϑθ 2 θ :θ >θ 2 φ τ, Ĉτ θ, θ 2, θ θ, Cond3baτ + φ τ, Ĉ τ θ, θ 2, θ θ, Cond3ba2τ2 θ:θ θ<θ 2 θ :θ >θ 2 τ 4τ + 2 2τ 2 n [τ ], s.t. ρl θ n, F θ θ > ǫ. 22 The second equality comes from the fact that the scheme samples the inferior arm only when either Cond3ba or Cond3ba2 is satisfied. For the first inequality, we condition on θ and etend to the infinite sum. For the last inequality, we change the time inde to τ, which specifies the τ -th satisfaction of Cond3ba, so that we can upper bound the first sum of 2. The reason we have a multiplication factor 4τ + 2 2τ 2 in front of the indicator function is in order to upper bound the second sum of 2, concerning Cond3ba2, simultaneously. To obtain this result, we note that between the consecutive times τ and τ +, at which Cond3ba is satisfied and arm is pulled, the number of times that Cond3ba2 is satisfied and arm is pulled cannot eceed 2τ + 2 2τ 2, which is because of the algorithm involving ctrĉt, θ in Line 6. Multiplying the factor 4τ + 2 2τ 2, we simultaneously bound these two sums. By Sanov s theorem, the epectation of the indicator in 22 is eponentially upper bounded w.r.t. τ. As a result, the entire sum will have bounded epectation, which in turn completes the proof. Lemma 7: Suppose M C0 2, i.e., θ < θ 2. Then E C 0 φ τ, Ĉτ θ α, θ β, θ θ α > θ β, Cond3τ <. Proof: We have, φ τ, Ĉτ θ α, θ β, θ θ α > θ β, Cond3τ φ τ, Ĉτ θ, ϑ, Cond3τ ϑ:θ >ϑ Ĉτ θ, ϑ, Cond3τ ϑ:θ >ϑ 2 Ĉτ θ, ϑ, Cond3bτ + ϑ:θ >ϑ #ϑ Θ : ϑ < θ + 2 θ :θ >θ #ϑ Θ : ϑ < θ + 2 ϑ:θ >ϑ Ĉτ θ, ϑ, θ θ, Cond3bτ ϑ:θ >ϑ θ :θ >θ Ĉτ θ, ϑ, θ θ, Cond3bτ, L X τ Cond3b δ-nbdg + Ĉτ θ, ϑ, θ θ, Cond3bτ, L X τ Cond3b / δ-nbdg. 23 The first inequality follows from Line in Algorithm 3, where Cond3b is satisfied once after two times of Cond3 satisfaction. The last two equalities come from conditioning on θ and L X τ Cond3b. By Sanov s theorem on finite alphabets, the terms of the second sum in 23 are eponentially upper bounded and the entire sum thus has bounded epectation. For the first sum, we have Ĉτ θ, ϑ, θ θ, Cond3bτ, L X τ Cond3b δ-nbdg P G X θ, ϑ δ Ĉτ θ, ϑ, θ θ, Cond3bτ. 24

13 WANG et al.: BANDIT PROBLEMS WITH SIDE OBSERVATIONS 3 This inequality follows from the fact that once L X falls into the δ-nbdg, the total number of time instants can be upper bounded by the number of instants when X τ θ, ϑ, over P G X θ, ϑ δ. To show E C 0 Ĉτ θ, ϑ, θ θ, Cond3bτ we further decompose the epectand into Ĉτ θ, ϑ, θ θ, Cond3bτ Ĉτ θ, ϑ, θ θ, Cond3baτ <, + Ĉ τ θ, ϑ, θ θ, Cond3bbτ. 25 For the first sum in 25, under the assumption θ > ϑ, we can write Ĉτ θ, ϑ, θ θ, Cond3baτ Ĉτ θ, ϑ, θ θ, Cond3baτ + Ĉτ θ, ϑ, θ θ, Cond3ba2τ + 2 Ĉτ θ, ϑ, θ θ, Cond3ba2τ + 2 ρ n [τ ], s.t. τ ρl θ,ϑ 2 n, F θ2 θ, ϑ > ǫ. 26 The first inequality comes from conditioning on the subconditions Cond3ba and Cond3ba2, and etending to the infinite sums. Let SQ n denote the set of perfectly squared integers in,, n. The second inequality is from the definition of Cond3ba in Algorithm 3 and the fact that n N, SQ n is no larger than +,, n\sq n. The third inequality comes from the fact that by definition, under Cond3ba2, φ τ 2, and changing the time inde to τ, the number of satisfaction times. By Sanov s theorem on R, the above has bounded epectation. For the second sum of 25, with the condition θ > ϑ Ĉτ θ, ϑ, θ θ, Cond3bbτ t + Ĉτ θ, ϑ, θ θ, Λ τĉτ, θ > τlog τ 2 t + Ĉτ θ, ϑ, ΛτĈτ, θ2 > τlog τ2 t T θ,ϑ 2 + m τ t + n τ, s.t. F ϑ dy 2 τ m θ, ϑ F θ2 dy 2 τ m θ, ϑ > τlog τ2, n F ϑ dy 2 m θ, ϑ F m θ2 dy 2 m θ, ϑ > τlog τ2, where Ym 2 is the reward of arm 2 at the m-th time that X s θ, ϑ and φ s 2. The first inequality follows from focusing only on the Λ τ Ĉτ, θ condition in Cond3bb and then shifting the time inde τ. The second inequality follows by replacing the minimum achieving θ with θ 2. The third inequality follows from epressing Λ τ using its definition. The fourth inequality follows from the set relationship, where n is T θ,ϑ 2 τ, the number of time instants that the side information X s θ, ϑ and φ s 2, for s τ. We first note that n m F ϑ dy 2 m θ,ϑ F θ2 dy 2 m θ,ϑ is a positive martingale with epectation, when being considered under distribution F θ2 θ, ϑ. By Doob s maimal inequality, we have n F ϑ dy P C0 n 2 m τ, θ, ϑ F m θ2 dym 2 > τlog τ2 θ, ϑ τlog τ 2, and thus the epectation is bounded, i.e., E C0 Ĉ τ θ, ϑ, θ θ, Cond3bbτ + <. 27 τlog τ 2 By 23, 24, 25, 26, and 27, Lemma 7 is proved. Lemma 8: Suppose M C0 2, i.e., θ < θ 2. Then sup E C0 φ τ, Ĉτ θ α, θ β θ, θ 2, Cond3τ inf θ>θ2 ma Iθ, θ. Proof: By the definition of φ τ, especially of Cond3ba, we have E C0 φ τ, Ĉτ θα, θ β C 0 θ, θ 2, Cond3τ E C0 Ĉτ C0, Cond3baτ E C0 sup n t : n F θ dy m min C 0 θ>θ 2 F m θ dy m C tlog 0 t2 E C0 sup n < : n F θ dy m min C 0 θ>θ 2 F m θ dy m C tlog 0 t2 E C0 ma θ>θ 2 sup n < : n m n E C0 ma inf θ>θ 2 n s s m F θ dy m C 0 F θ dy m C 0 t2 F θ dy m C 0 F θ dy m C 0 t2 where Ym denotes the reward of the m-th time that arm of the sub-bandit machine X τ C 0 is pulled. The first inequality follows because, by definition, only when Cond3ba is satisfied can φ τ, given Ĉτ C 0. The second inequality is obtained by focusing on the subcondition Λ τ Ĉτ, θ in Cond3ba, and letting n T t be the number of time instants when arm is pulled and X τ C 0. The third inequality comes from etending the upper bound of n from t to. The equalities come from rearranging the ma and min operators and elementary implications. By applying Lemma 4.3 of [2], quoted as,

Metric Spaces and Topology

Chapter 2 Metric Spaces and Topology From an engineering perspective, the most important way to construct a topology on a set is to define the topology in terms of a metric on the set. This approach underlies