Bandit Problems with Side Observations

Size: px
Start display at page:

Download "Bandit Problems with Side Observations"

Transcription

1 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 50, NO. 5, MAY 2005 Bandit Problems with Side Observations Chih-Chun Wang, Student Member, IEEE, Sanjeev R. Kulkarni, Fellow, IEEE, and H. Vincent Poor, Fellow, IEEE arxiv:cs.it/ v 22 Jan 2005 Abstract An etension of the traditional two-armed bandit problem is considered, in which the decision maker has access to some side information before deciding which arm to pull. At each time t, before making a selection, the decision maker is able to observe a random variable X t that provides some information on the rewards to be obtained. The focus is on finding uniformly good rules that minimize the growth rate of the inferior sampling time and on quantifying how much the additional information helps. Various settings are considered and for each setting, lower bounds on the achievable inferior sampling time are developed and asymptotically optimal adaptive schemes achieving these lower bounds are constructed. Inde Terms Two-armed bandit, side information, inferior sampling time, allocation rule, asymptotic, efficient, adaptive. I. INTRODUCTION SINCE the publication of [], bandit problems have attracted much attention in various areas of statistics, control, learning, and economics e.g., see [2], [3], [4], [5], [6], [7], [8], [9], [0]. In the classical two-armed bandit problem, at each time a player selects one of two arms and receives a reward drawn from a distribution associated with the arm selected. The essence of the bandit problem is that the reward distributions are unknown, and so there is a fundamental tradeoff between gathering information about the unknown reward distributions and choosing the arm we currently think is the best. A rich set of problems arises in trying to find an optimal/reasonable balance between these conflicting objectives also referred to as learning versus control, or eploration versus eploitation. We let Yτ 2 and Yτ denote the sequences of rewards from arms and 2 in a two-armed bandit machine. In the traditional parametric setting, the underlying configurations/distributions of the arms are epressed by a pair of parameters C 0 θ, θ 2 such that Yτ and Yτ 2 are independent and identically distributed i.i.d. with distribution F θ, F θ2, where F θ is a known family of distributions parametrized by θ. The goal is to maimize the sum of the epected rewards. Results on achievable performance have been obtained for a number of variations and etensions of the basic problem defined in [9] e.g., see [], [2], [3], [4], [5], [6], [7]. In this paper, we consider an etension of the classical twoarmed bandit where we have access to side information before making our decision about which arm to pull. Suppose at Manuscript received November 5, 2002; revised November 9, This work was supported in part by the National Science Foundation under Grants ANI and ECS , the Army Research Office under contract number DAAD , the Office of Naval Research under Grant No. N , and by the New Jersey Center for Pervasive Information Technologies. C-C. Wang, S.R. Kulkarni, and H.V. Poor are with Princeton University. time t, in addition to the history of previous decisions, outcomes, and observations, we have access to a side observation X t to help us make our current decision. The etent to which this side observation can help depends on the relationship of X t to the reward distributions of Yt and Yt 2. Previous work on bandit problems with side observations includes [8], [9], [20], [2], [22]. Woodroofe [2] considered a one-armed bandit in a Bayesian setting, and constructed a simple criterion for asymptotically optimal rules. Sarkar [20] etended the side information model of [2] to the eponential family. In [9], Kulkarni considered classes of reward distributions and their effects on performance using results from learning theory. Most of the previous work with side observations is on one-armed bandit problems, which can be viewed as a special case of the two-armed setting by letting arm 2 always return zero. In contrast with this previous work, we consider various general settings of side information for a two-armed bandit problem. Our focus is on providing both lower bounds and bound-achieving algorithms for the various settings. The results and proofs are very much along the lines of [8] and subsequent works as in [], [2], [3], [4], [5]. We now describe the settings considered in this paper. Direct Information: In this case, X t provides information directly about the underlying configuration C 0 θ, θ 2, which allows a type of separation between the learning and control. This has a dramatic effect on the achievable inferior sampling time. Specifically, estimating θ, θ 2 by observing X τ, and using the estimate ˆθ, ˆθ 2 to make the decision, results in bounded epected inferior sampling time. If the distribution of X τ is not a function of C 0, we are not able to learn C 0 through X τ. However, different values of the side observation X t will result in different conditional distributions of the rewards Yt i. By eploiting this new structure observing X t in advance, we can hope to do better than the case without any side observation. A physical meaning about the above scenario constant distribution on X τ is that a two-armed bandit with the side observations drawn from a finite set, 2,, n can be viewed as a set of n different two-armed sub-bandit machines indeed from to n. The player does not know the order of sub-machines he is going to play, which is determined by rolling a die with n faces. However, by observing X t, the player knows which machine out of the n different ones he is facing now before selecting which arm to play. The connection between these sub-machines is that they share the same common configuration pair θ, θ 2, so that the rewards observed from one machine provide information on the common θ, θ 2, which can then be applied to all of the others different values of X t. This is the key aspect that makes this

2 2 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 50, NO. 5, MAY 2005 setup distinct from simply having many independent bandit problems with random access opportunity. We consider the following three cases of different relationships among the most rewarding arm, C 0, and X t. 2 For all possible C 0, the best arm is a function of X t : That is, θ, θ 2,, 2 such that at time t, arm yields higher epected reward conditioned on X t while arm 2 is preferred when X t 2. Surprisingly, we ehibit an algorithm that achieves bounded epected inferior sampling time in this case. Woodroofe s result [2] can then be viewed as a special case of this scenario. 3 For all possible C 0, the best arm is not a function of X t : In this case, for all configurations θ, θ 2, one of the arms is always preferred regardless of the value of X t. Since the conditional reward distributions are functions of X t, the intuition is that we can postpone our learning until it is most advantageous to us. We show that, asymptotically, our performance will be governed by the most informative bandit among the different values taken on by X t. 4 Mied Case: This is a general case that combines the previous two, and contains the main contribution of this paper. For some possible configurations, one arm may always be preferred for any X t, while for other possible configurations, the preferred arm is a function of X t. We ehibit an algorithm that achieves the best possible in either case. That is, if the best arm is a function of X t, it achieves bounded epected inferior sampling time as in Case 2, while if the underlying configuration is such that one arm is always preferred, then we get the results of Case 3. Our paper is organized as follows. In Section II, we introduce the general formulation. In Section III, we provide background on the asymptotic analysis of traditional bandit problems without side observations. In Sections IV through VII, we consider the above four cases respectively. The results are included in each section, while details of the proofs are provided in the appendi. II. GENERAL FORMULATION Consider the two-armed bandit problem defined as follows. Suppose we have two sequences of real-valued random variables r.v. s, Yτ i i,2, and an i.i.d. side observation sequence X τ, taking values in X R. Yτ i denotes the reward sequence of arm i while X t is the side information observed at time t before making the decision. The formal parametric setting is as follows. For each configuration pair C 0 θ, θ 2 and each i, the sequence of vectors X t, Yt i is i.i.d. with joint distribution G C0 df θi dy, where the families G C C Θ 2 and F θ θ Θ are known to the player, but the true value of the corresponding inde C 0 must be learned through eperiments. For notational simplicity, we further assumed Θ is a set of real numbers. Note that the concept of the i.i.d. bandit is now etended to the assumption that the vector sequence X t, Yt i t is i.i.d. The unconditioned marginal sequence Yτ i remains i.i.d. However, rather than the unconditional marginals, the player TABLE I GLOSSARY Not n Description G C d The marginal distribution of the i.i.d. X τ under configuration C. F θi dy The conditional distribution of the reward of arm i, Yt i, under parameter θ i. µ θ The conditional epectation of the reward, µ θ E θ Y yf θ dy. C 0, 2C 0 The first and the second coordinates of the configuration pair C 0, i.e. C 0 θ, 2C 0 θ 2. For eample: F C0 dy F θ dy and µ 2C0 µ θ2. M C The inde of the preferred arm, i.e. argma i,2 µ ic. φ t The decision rule taking values in, 2 and depending only on the past outcomes and the current side information X t. T i t The total number of samples taken on arm i up to time t, T i t t φ τi. T inf t The total number of samples taken on the inferior arm up to time t: T inf t t IP, Q φ τ M C0 X τ. The Kullback-Leibler K-L information number between distributions P and Q: IP, Q E P log dp. dq Iθ, θ 2 The conditional K-L information number: Iθ, θ 2 IF θ, F θ2. is now facing the conditional distribution of Y i t, which is a function of the observed side information X t and is not identically distributed given different X t. The goal is to find an adaptive allocation rule φ τ to maimize the growth rate of the epected reward: E C0 W φ t : E C0 φyτ + φτ2yτ 2, or equivalently to minimize the growth rate of the epected inferior sampling time, namely E C0 T inf t. To be more eplicit, at any time t, φ t takes a value in, 2 and depends only on the past rewards τ < t and the current side observation X t. We define a uniformly good rule as follows. Definition Uniformly Good Rules: An allocation rule φ τ is uniformly good if for all C Θ 2, E C T inf t ot α, α > 0. In what follows, we consider only uniformly good rules and regard other rules as uninteresting. Necessary notation and several quantities of interest are defined in TABLE I. We assume that all the given epectations eist and are finite. III. TRADITIONAL BANDITS Under the general formulation provided in Section II, the traditional non-bayesian, parametric, infinite horizon, two- In the literature of bandit problems, the term regret is more typically used rather than the inferior sampling time. For traditional two-armed bandits, the regret is defined as regret : t maµ θ, µ θ2 E C0 W φ t, the difference between the best possible reward and that of the strategy of interest φ τ. The relationship between the regret and T inf t is as follows. regret µ θ µ θ2 E C0 T inf t. For greater simplicity in the discussion of bandit problems with side observations, we consider T inf t rather than the regret.

3 WANG et al.: BANDIT PROBLEMS WITH SIDE OBSERVATIONS 3 armed bandit is simply a degenerate case, i.e., the traditional bandit problem is equivalent to having only one element in X say X 0. This formulation of traditional bandit problems is identical to the two-armed case of [4], [8], [9]. For simplicity, the argument 0 can be omitted in this traditional setting, i.e., M C : M C 0, µ θ : µ θ 0, Iθ, θ 2 : Iθ, θ 2 0, etc. The main contribution of [4], [8], [9] is the asymptotic analysis stated as the following two theorems. Theorem logt Lower Bound: For any uniformly good rule φ τ, T inf t satisfies and P C 0 inf T inf t E C0 T inf t ǫlogt, ǫ > 0,, where is a constant depending on C 0. If M C0 2, then T inf t T t and is defined 2 as follows. infiθ, θ : θ, µ θ > µ θ2. The epression for for the case in which M C0 can be obtained by symmetry. Theorem 2 Asymptotic Tightness: Under certain regularity conditions 3, the above lower bound is asymptotically tight. Formally stated, given the distribution family F θ, there eists a decision rule φ τ such that for all C 0 θ, θ 2 Θ 2, sup E C0 T inf t, where is the same as in Theorem. The intuition behind the lower bound is as follows. Suppose M C0 2 and consider another configuration C θ, θ 2 such that M C. It can be shown that if under configuration C 0 θ, θ 2, E C0 T inf t is less than the lower bound, E C T inf t must be greater than ot α for some α > 0, which contradicts the assumption that φ τ is uniformly good. A. Formulation IV. DIRECT INFORMATION In this setting, the side observation X t directly reveals information about the underlying configuration pair C 0 θ, θ 2 in the following way. Dependence: G C G C2 iff C C 2. As a result, observing the empirical distribution of X t gives us useful information about the underlying parameter pair C 0. Thus this is a type of identifiability condition. Eamples: Θ 0, 0.5 and X, 2, 3. θ k if k, 2 P θ,θ 2X t k θ θ 2 otherwise. 2 Throughout this paper, we will adopt the conventions that the infimum of the null set is, and 0. 3 If the parameter set is finite, Theorem 2 always holds. If Θ is the set of reals, the required regularity conditions are on the unboundedness and the continuity of µ θ w.r.t. θ and on the continuity of Iθ, θ w.r.t. µ θ. Θ 0, and X [0, ]. X t is beta distributed with parameters θ, θ 2. B. Scheme with Bounded E C0 T inf t Consider the following condition. Condition : For any fied C 0 inf ρg C0, G Ce : C e Θ 2,, M Ce M C0 > 0, where ρ denotes the Prohorov metric 4 on the space of distributions. Two eamples satisfying Condition are as follows: Eample : X is finite, and X, µ θ is continuous with respect to w.r.t. θ. Eample 2: F θ Nθ, is a Gaussian distribution with mean θ and variance. Under this condition, we obtain the following result. Theorem 3 Bounded E C0 T inf t: If Condition is satisfied, then there eists an allocation rule φ τ, such that E C0 T inf t < and T inf t < a.s. Note: the information directly revealed by X t helps the sequential control scheme surpass the lower bound stated in Theorem. This significant improvement bounded epected inferior sampling time is due to the fact that the dilemma between learning and control no longer eists in the direct information case. We provide a scheme achieving bounded E C0 T inf t as in Algorithm, of which a detailed analysis is given in APPENDIX II. Algorithm φ t, the decision at time t after observing X t but before deciding φ t : Construct C t : C Θ 2 : ρg C, L X t inf ρg C, L X t +, C Θ 2 t where L X t is the empirical measure of the side observations X τ until time t, and ρ is the Prohorov metric as before. 2: Arbitrarily pick Ĉt Ct, and set φt M Ĉ t X t. V. BEST ARM AS A FUNCTION OF X t For all of the following sections Sections V through VII, we consider only the case in which observing X t will not reveal any information about C 0, but only reveals information about the upcoming reward Y i t, that is, G C0 does not depend on the value of C 0 ; we use G : G C0 as shorthand notation. Three further refinements regarding the relationship between M C and will be discussed separately each in one section. A. Formulation In this section, we assume that for all possible C, the side observation X t is always able to change the preference order as shown in Fig.. That is, For all C Θ 2, there eist and 2 such that M C and M C 2 2.

4 4 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 50, NO. 5, MAY 2005 a b c Algorithm 2 φ t, the decision at time t + Variables: Denote Ti t as the total number of time instants until time t when arm i has been pulled and X τ, i.e. Ti t : Xτ,φτi, 2 3 Fig.. The best arm at time t always depends on the side observation X t. That is, for any possible pair θ, θ 2 the two curves, µ θ and µ θ2, w.r.t. always intersect each other. The needed regularity conditions are as follows. X is a finite set and P G X t > 0 for all X. 2 θ, θ 2,, Iθ, θ 2 is strictly positive and finite. 3, µ θ is continuous w.r.t. θ. The first condition embodies the idea of treating X t as the inde of several different bandit machines, which also simplifies our proof. The second condition is to ensure that all these different bandit problems are non-trivial, with nonidentical pairs of arms. Eample: Θ 0,, X,, and the conditional reward distribution F θ Nθ,. B. Scheme with Bounded E C0 T inf t Theorem 4 Bounded E C0 T inf t: If the above conditions are satisfied, there eists an allocation rule φ τ such that E C 0 T inf t <. Such a rule is obviously uniformly good. Note: although the side observation X t does not reveal any information about C 0 in this setting, the alternation of the best arm as the i.i.d. X t takes on different values makes it possible to always perform the control part, φ t MĈt X t, and simultaneously sample both arms often enough. Since the information about both arms will be implicitly revealed through the alternation of M C0 X t, the dilemma of learning and control no longer eists, and a significant improvement E C0 T inf t < is obtained over the lower bound in Theorem. We construct an allocation rule with bounded E C0 T inf t given as Algorithm 2. The intuition as to why the proposed scheme has bounded E C0 T inf t is as follows. The forced sampling, T i t < t +, ensures there are enough samples on both arms, which implies good enough estimates of C 0. Based on the good enough estimates, the myopic action of sampling the seemingly better arm, φ t+ MĈt X t+, will result in very few inferior samplings. Unlike the traditional two-armed bandits, in this scenario, the best arm M C0 varies from one outcome of X t to the other. Therefore, the 4 A definition of the Prohorov metric is stated in APPENDIX I. and define i : arg ma T i t and T i t : ma T i t. Construct C t : C θ, θ 2 Θ 2 : σc, t infσc, t : C Θ 2 + t, with σc, t : ρf C, L t, +ρf 2C 2, L 2 t, where L i t is the empirical measure of rewards sampled from arm i at those time instants τ t when X τ. As before ρp, Q is the Prohorov metric. Arbitrarily choose Ĉt Ct. Algorithm: : if t + 6 then 2: φ t+ t mod : else if i such that T i t < t + then 4: φ t+ i. 5: else 6: φ t+ MĈt X t+. 7: end if Note that Line guarantees that there is only one i such that T i t < t +. 2 a b c Fig. 2. The best arm at time t never depends on the side observation X t. That is, for any possible pair, θ, θ 2, the two curves, µ θ and µ θ2, do not intersect each other. However, in this case, we can postpone our sampling to the most informative time instants. myopic action and the even appearances of the i.i.d. X τ will eventually make both T t and T 2 t grow linearly with the elapsed time t, and the forced sampling should occur only rarely. This situation differs significantly from the traditional bandits, where the forced sampling will inevitably make the T inf t of the order of t, which is an undesired result. A detailed proof of the boundedness of E C0 T inf t for this scheme is provided in APPENDIX III. A. Formulation VI. BEST ARM IS NOT A FUNCTION OF X t Besides the assumption of constant G, in this section, we consider the case in which for all C Θ 2, M C is not a function of, and we thus can use M C : M C as shorthand notation. Fig. 2 illustrates this situation.

5 WANG et al.: BANDIT PROBLEMS WITH SIDE OBSERVATIONS 5 The needed regularity conditions are similar to those in Section V: X is a finite set and P G X t > 0 for all X. 2 θ, θ 2,, Iθ, θ 2 is strictly positive and finite. In this case, one arm is always better than the other no matter what value of X t occurs. The conflict between learning and control still eists. As epected, the growth rate of the epected inferior sampling time is again lower bounded by, but with the additional help of X t we can see improvements over the traditional bandit problems. To greatly simplify the notation, we also assume that 4 For all, the conditional epected reward µ θ is strictly increasing w.r.t. θ. This condition gives us the notational convenience that the order of µ θ, µ θ2 is simply the same as the order of θ, θ 2. Eample: Θ,, X, 2, 3, and the conditional reward distribution F θ Nθ,. B. Lower Bound Theorem 5 logt Lower Bound: Under the above assumptions, for any uniformly good rule φ τ, T inf t satisfies P ǫlogt C 0 T inf t, ǫ > 0, and inf E C0 T inf t, 2 where is a constant depending on C 0. If M C0 2, then T inf t T t. The constant can be epressed as follows. inf sup θ:θ>θ 2 X Iθ, θ. 3 The epression for for the case in which M C0 can be obtained by symmetry. Note : if the decision maker is not able to access the side observation X t, the player will then face the unconditional reward distribution F θ i dy Gd rather than F θi dy. Let Iθ, θ 2 denote the Kullback-Leibler information between the unconditional reward distributions. By the conveity of the Kullback-Leibler information, we have sup Iθ, θ Iθ, θ Gd Iθ, θ. This shows that the new constant in front of, in 3, is no larger than the corresponding constant in, and the additional side information X t generally improves the decision made in the bandit problem. As we would epect, Theorem 5 collapses to Theorem when X. Note 2: This situation is like having several related bandit machines, whose reward distributions are all determined by the common configuration pair θ, θ 2. The information obtained from one machine is also applicable to the other machines. If arm 2 is always better than arm, we wish to sample arm 2 most of the time the control part, and force sample arm once in a while the learning part. With the help of the side information X t, we can postpone our forced sampling learning to the most informative machine X t. As a result, the constant in the lower bound in Theorem has been further reduced to this new. A detailed proof of Theorem 5 is provided in APPENDIX IV. C. Scheme Achieving the Lower Bound Consider the additional conditions as follows. Θ is finite. 2 A saddle point for eists; that is, for all θ < θ 2, inf sup θ:θ>θ 2 Iθ, θ sup inf Iθ, θ. θ:θ>θ 2 With the above conditions, we construct a -lowerbound-achieving scheme φ τ, which is inspired by [2]. The following terms and quantities are necessary in the epression of φ τ. Denote Ĉt : θ α, θ β. Instead of the traditional ˆθ, ˆθ 2 representation, we use θ α, θ β. Based on this representation, we are able to derive the following useful notation: θ α β : minθ α, θ β α β : argminθ α, θ β θ α β : maθ α, θ β α β : argmaθ α, θ β. For instance, if θ α < θ β, µ θ α µ θ α β; arm α β represents arm ; Y α β t is the reward of arm 2; and T inf t T α β t T t. Choose an ǫ such that 0 < ǫ < 2 min ρf θ, F ϑ : X, θ ϑ Θ, where ρ is the Prohorov metric. The whole system is well-sampled if there eists a unique estimate Ĉ t θ α, θ β, such that the empirical measure L i t falls into the ǫ-neighborhood of F i Ĉ t, for all X and i, 2. That is Ĉt, s.t. ρ L i t, FiĈt < ǫ, X, i, 2. For any estimate Ĉt θ α, θ β, define the most informative bandit according to Ĉt as Ĉt : arg ma inf Iθ α β, θ,, θ:θ>θ α β and Λ t Ĉt, θ to be the conditional likelihood ratio between the seemingly inferior arm θ α β and the competing parameter θ: Λ t Ĉt, θ : T α β t m F θ α β dy α β τ m Ĉt, F θ dy α β τ m Ĉt where τ m denotes the time instant of the m-th pull of arm α β when the side observation X τ Ĉτ. Set a total number of X + Θ 2 + Θ 3 counters, including X counters, named ctr ; Θ 2 counters, named ctrĉ for all possible Ĉ Θ2 ; and Θ 3 counters, named ctrĉ, θ for all possible Ĉ and θ. Initially, all counters are set to zero.

6 6 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 50, NO. 5, MAY 2005 Algorithm 3 φ t+, the decision at time t + : if there eists i, 2 and X such that Ti t 0, then Cond0 2: φ t+ t + mod 2. 3: else if the whole system is not well-sampled or θ α θ β, then Cond 4: ctrx t+ ctrx t+ + and φ t+ ctrx t+ mod 2. 5: else if θ α β θ : maθ, then Cond2 6: φ t+ if it is θ α θ. Otherwise, φ t+ 2. 7: else Cond3 8: ctrĉt ctrĉt +. 9: 0: if ctrĉt is odd, then Cond3a φ t+ MĈt X t+. : else Cond3b 2: θ argminλ tĉ t, θ : θ > θ α β. 3: if X t+ Ĉt then Cond3b 4: if Λ tĉ t, θ t 2, then Cond3ba 5: ctrĉt, θ ctrĉt, θ +. 6: if k N s.t. ctrĉt, θ k 2, then Cond3ba 7: φ t+ k mod 2. 8: else Cond3ba2 9: φ t+ 3 MĈt X t+. 20: end if 2: else Cond3bb 22: φ t+ MĈt X t+. 23: end if 24: else Cond3b2 25: φ t+ MĈt X t+. 26: end if 27: end if 28: end if Theorem 6 Asymptotic Tightness: With the above conditions, the scheme described in Algorithm 3 achieves the lower bound 2, so that this φ τ is uniformly good and asymptotically optimal. A complete analysis is provided in APPENDIX V. VII. MIXED CASE The main difference between Sections V and VI is that in one case, for all possible C 0, X t always changes the preference order, while in the other, for all possible C 0, X t never changes the order. A more general case is a miture of these two. In this section, we consider this mied case, which is the main result of this paper. A. Formulation Besides the assumption of constant G, in this section, we consider the case in which for some C Θ 2, M C is not a function of. For the remaining C, there eist and 2 s.t. M C and M C 2 2. For future reference, when the configuration pair C 0 satisfies the latter case, we say the configuration pair C 0 is implicitly revealing. Fig. 3 illustrates this situation. However, without knowledge of the authentic underlying configuration C 0, we do not know whether C 0 is implicitly revealing or not. In view of the results of Sections V and VI, we would like to find a single scheme that is able to achieve bounded E C0 T inf t when being applied to an implicitly revealing C 0, and on the other hand to achieve the lower bound when being applied to those C 0 which are not implicitly revealing. The needed regularity conditions are the same as those in Sections V and VI: 2 3 a b c Fig. 3. If θ, θ 2 θ a, θ b, the best arm depends on, i.e. µ θ and µ θ2 intersect each other as in Section V. If θ, θ 2 θ b, θ c, the best arm does not depend on, i.e. µ θ and µ θ2 do not intersect each other as in Section VI. X is a finite set and P G X t > 0 for all X. 2 θ, θ 2,, Iθ, θ 2 is strictly positive and finite. To simplify the notation and the following proof, we define a partial ordering as θ ϑ iff, µ θ µ ϑ, and θ ϑ is defined similarly. Note that for a configuration C 0 θ, θ 2, it can be the case that neither θ θ 2 nor θ θ 2. Eample: Θ 0,, X, and the conditional reward distribution F θ Nθ 2 θ,. Then C 0 θ, θ 2 0., 0.2 is implicitly revealing, but C 0 0, 0 is not. B. Lower Bound Theorem 7 Lower Bound: Under the above assumptions, for any uniformly good rule φ τ, if C 0 is not implicitly revealing, T inf t satisfies and P C 0 inf T inf t E C0 T inf t ǫ, ǫ > 0,, 4 where is a constant depending on C 0. If M C0 2, T inf t T t, and the constant can be epressed as follows. inf θ: 0, s.t. sup Iθ, θ. µ θ 0>µ θ2 0 The epression for for the case in which M C0 can be obtained by symmetry. The only difference between the lower bounds 2 and 4 is that, in 4, has been changed from taking the infimum over θ > θ 2, µ θ > µ θ2 to a larger set, θ :, µ θ > µ θ2. The reason for this is that under this case, consider a θ for which there eists such that µ θ > µ θ2. If the authentic configuration is C θ, θ 2 rather than θ, θ 2, a linear order of incorrect sampling will be introduced, which violates the uniformly-good-rule assumption. As a result, a broader class of competing distributions C θ, θ 2 must be considered, i.e., we must consider a different set of configurations, over which the infimum is taken. A detailed proof is contained in APPENDIX VI.

7 WANG et al.: BANDIT PROBLEMS WITH SIDE OBSERVATIONS 7 C. Scheme Achieving the Lower Bound Consider the same two additional conditions as those in Section VI. Θ is finite. 2 A saddle point for eists; that is, for all θ, inf sup Iθ, θ θ: 0,µ θ 0>µ θ2 0 sup inf Iθ, θ. θ: 0,µ θ 0>µ θ2 0 Algorithm 4 φ t+, the decision at time t + : if there eists i, 2 and X such that Ti t 0, then Cond0 2: φ t+ t + mod 2. 3: else if the whole system is not well-sampled or θ α θ β, then Cond 4: ctrx t+ ctrx t+ + and φ t+ ctrx t+ mod 2. 5: else if there eists i, 2, such that θ,, µ θ µ i Ĉ t, then Cond2 6: φ t+ i, where i is the satisfying inde. 7: else if Ĉ t is implicitly revealing, then Cond2.5 8: φ t+ MĈt X t+. 9: else Cond3 0: ctrĉt ctrĉt +. : if ctrĉt is odd, then Cond3a 2: φ t+ MĈt X t+. 3: else Cond3b 4: θ argminλ tĉt, θ : θ, 0, s.t. µ θ 0 > µ θ α β 0. 5: if X t+ Ĉt then Cond3b 6: if Λ tĉt, θ t 2, then Cond3ba 7: ctrĉt, θ ctrĉt, θ +. 8: if k N s.t. ctrĉt, θ k 2, then Cond3ba 9: φ t+ k mod 2. 20: else Cond3ba2 2: φ t+ 3 MĈt X t+. 22: end if 23: else Cond3bb 24: φ t+ MĈt X t+. 25: end if 26: else Cond3b2 27: φ t+ MĈt X t+. 28: end if 29: end if 30: end if A proposed scheme is described in Algorithm 4, which is similar to the scheme in Section VI-C. The only differences are the insertion of Cond2.5, Lines 7 and 8; the modification of Cond2, Lines 5 and 6; and the modification of Cond3b, line 4. Notes: When the estimate Ĉt θ α, θ β is not implicitly revealing, an ordering between θ α and θ β eists. As a result, all notation regarding α β, θ α β, etc., remains valid. 2 The definition of Λ t Ĉt, θ is slightly different. For any estimate Ĉt θ α, θ β that is not implicitly revealing, we can define the most informative bandit according to Ĉ t as Ĉt 5 : arg ma inf θ: 0,µ θ 0 >µ θ α β 0 Iθα β, θ, and Λ t Ĉt, θ to be the conditional likelihood ratio between the seemingly inferior arm θ α β and the competing parameter θ. That is, Λ t Ĉt, θ : T α β t m dy α β τ m Ĉt F θ dy α β, Ĉt F θ α β τ m where τ m denotes the time instant of the m-th pull of arm α β when the side observation X τ Ĉτ. The difference between this new Λ t Ĉt, θ and the previous one in Algorithm 3 is that we have a new Ĉt defined in 5. Theorem 8 Asymptotic Tightness: With the above conditions, the scheme described in Algorithm 4 has bounded t E C0 T inf t, or achieves the lower bound 4, depending on whether the underlying configuration pair C 0 is implicitly revealing or not. A detailed analysis is given in APPENDIX VI. VIII. CONCLUSION We have shown that observing additional side information can significantly improve sequential decisions in bandit problems. If the side observation itself directly provides information about the underlying configuration, then it resolves the dilemma of forced sampling and optimal control. The epected inferior sampling time will be bounded, as shown in Section IV. If the side observation does not provide information on the underlying configuration θ, θ 2, but always affects the preference order implicitly revealing, then the myopic approach of sampling the seemingly-best arm will automatically sample both arms enough. The epected inferior sampling time is bounded, as shown in Section V. If the side observation does not affect the preference order at all, the dilemma still eists. However, by postponing our forced sampling to the most informative time instants, we can reduce the constant in the lower bound, as shown in Section VI. In Section VII, we combined the settings of Sections V and VI, and obtained a general result. When the underlying configuration C 0 is implicitly revealing such that X t will change the preference order, we obtain bounded epected inferior sampling time as in Section V. Even if C 0 is not implicitly revealing in that X t does not change the preference order, the new lower bound can be achieved as in Section VI. Our results are summarized in TABLE II. APPENDIX I SANOV S THEOREM AND THE PROHOROV METRIC For two distributions P and Q on the reals, the Prohorov metric is defined as follows. Definition 2 The Prohorov metric: For any closed set A R and ǫ > 0, define A ǫ, the ǫ-flattening of A, as A ǫ : R : inf y < ǫ. y A The Prohorov metric ρ is then defined as follows. ρp, Q : inf ǫ > 0 : PA QA ǫ + ǫ, for all closed A R..

8 8 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 50, NO. 5, MAY 2005 TABLE II SUMMARY OF THE BENEFIT OF THE SIDE OBSERVATIONS AND THE REQUIRED REGULARITY CONDITIONS. Characterization Regularity Conditions Results G C G C2 iff C C 2. As Ĉ t C 0,, MĈt M C0. φ τ s.t. C 0, t E C0 T inf t <. i Constant G C, i.e., G C : G, ii C,, 2, s.t. M C, M C 2 2 implicitly revealing. i Constant G C, i.e., G C : G, ii C, M C only depends on C, not on. i Constant G C, i.e., G C : G, ii The underlying C 0 may be implicitly revealing or not. i X is finite. ii θ θ 2,, 0 < Iθ, θ 2 <. iii, µ θ is continuous w.r.t. θ. i X is finite. ii θ θ 2,, 0 < Iθ, θ 2 <, iii, µ θ is strictly increasing w.r.t. θ. i X is finite. ii θ θ 2,, 0 < Iθ, θ 2 <. φ τ such that C 0, t E C0 T inf t <. For any uniformly good φ τ, we have E C0 T inf t t, : inf θ sup Iθ, θ. For finite Θ, φ τ, s.t. t E C0 T inf t. For any uniformly good φ τ, if C 0 is not implicitly revealing, we have inf t E C0 T t, : inf θ sup Iθ, θ. For finite Θ, φ t s.t. if C 0 is implicitly revealing, E C0 T inf t <, E C0 T 2 if C 0 is not i.r., inf t t. The Prohorov metric generates the topology corresponding to convergence in distribution. Throughout this paper, the open/closed sets on the space of distributions are thus defined accordingly. Theorem 9 Sanov s theorem: Let L X n denote the empirical measure of the real-valued i.i.d. random variables X, X 2,, X n. Suppose X i is of distribution P and consider any open set A and closed set B from the topological space of distributions, generated by the Prohorov metric. We have inf n n log P PL X n A inf IQ, P Q A sup n n log P PL X n B inf IQ, P. Q B Further discussion of the Prohorov metric and Sanov s theorem can be found in [23], [24]. APPENDIX II PROOF OF Theorem 3 Proof: For any underlying configuration pair C 0 θ, θ 2, define the error set C e as follows. C e : C Θ 2 : M C M C0. 6 X Let C e denote the closure of C e. By Condition, C 0 / C e. For any t, we can write P C0 φt M C0 X t P C0 X MĈt t M C0 X t P C0, MĈt M C0 P C0 Ĉt C e P C0 Ĉt C e. Let ǫ 3 infρg C 0, G Ce : C e C e, which is strictly positive by Condition, and consider sufficiently large t ǫ. If ρg C 0, L X t < ǫ, then by the definition of C t, ρgĉt, L X t < ǫ + ǫ 2ǫ. By the triangle inequality, ρg C0, GĈt < 3ǫ and Ĉt C e. As a result, Ĉt C e ρl X t, G C0 ǫ K t. K t is a closed set. By Sanov s theorem, the probability of K t is eponentially upper bounded w.r.t. t, and so is P C0 Ĉt C e. As a result, we have E C 0 T inf t P C0 φ τ M C0 X τ <. By the monotone convergence theorem, the epectation of T inf t is finite, which implies that T inf t is finite a.s. APPENDIX III PROOF OF Theorem 4 Similarly, we define C e as that in 6. We need the following lemma to complete the analysis. Lemma : With the regularity conditions specified in Section V, a, a 2 > 0 such that P C0 Ĉt C e a ep a 2 min T t, T2 t. Proof of Lemma : By the continuity of µ θ w.r.t. θ and the assumption of finite X, it can be shown that C 0 C c e. 5 Therefore there eists a neighborhood of C 0, C δ θ δ, θ +δ θ 2 δ, θ 2 +δ, such that C δ C c e C e C c δ. Define a strictly positive ǫ > 0 as follows. ǫ : ρ 4 inf, F Fˆθi θi : X, i, 2, ˆθ, ˆθ 2 C c δ. We would like to prove that for sufficiently large t > ǫ, Ĉt Cδ c i, ρ L i t, F θi > ǫ. 5 Cc e denotes the complement of C e.

9 WANG et al.: BANDIT PROBLEMS WITH SIDE OBSERVATIONS 9 Suppose ρ L i t, F θi ǫ for both i, 2. By the definition of σc 0, t, we have σc 0, t 2ǫ. 7 However, for those Ĉt C c δ, by the definition of ǫ, for some i, 2, we have σĉt, t ρfˆθi, L i t, F ρfˆθi θi ρf θi, L i t 3ǫ, 8 which contradicts the definition of C t since 7 and 8 imply σĉt, t > t + σc 0, t. As a result, for sufficiently large t, we have Ĉt C e Ĉt Cδ c i, ρ i,2 L ρ i t, F θi > ǫ L i t, F θi > ǫ. 9 By Sanov s theorem, the probability of each term in the union of the right-hand side of 9 is eponentially bounded w.r.t. Ti t. As a result, the probability of this finite union is bounded by a ep a 2 min T t, T2 t for some a, a 2 > 0. Analysis of the scheme: We first use induction to show that t 6, T i t t. This statement is true for t 6. Suppose T t t t. If T i t t, by the monotonicity of T i t w.r.t. t, we have T i t T i t t. If T i t < t, by the forced sampling mechanism, T i t T i t + t + t. We consider the event of the inferior sampling at time t + : φt+ M C0 X t+ φ t+ M C0 X t+, Ĉt Ce φ t+ M C0 X t+, Ĉt Cc e Ĉt C e φ t+ M C0 X t+, Ĉt Cc e A t+ B t+. 0 Since T i t t, we have min i T i t t and min i Ti t t X. By Lemma, we have P C 0 A t+ t a e a2 X, and hence t+7 P C 0 A t+ <. For P C0 B t+, we can write B t+ φ t+ MĈt X t+, Ĉt Cc e mint i t i < t +, Ĉt Cc e mint i t i t N, Ĉt Cc e i, φ a i, a τ 0, t], T i τ 0 t B t+ B2 t+, where τ 0 t 2 + and B t+, B 2 t+ correspond to i, 2, respectively. The first equality comes from the fact that since Ĉt C c e, M C 0 X t+ MĈt X t+. The first subset sign comes from the fact that φ t+ MĈt X t+ implies the decision rule φ t+ is in the stage of forced sampling. The second equality follows by combining both the inequalities: mint i t i t and mint i t i < t + and the fact that both t and T i t are integers. The reasoning behind the second subset inequality is as follows. By again using the fact that T i t t and substituting τ 0 for t, we have t < T i τ 0 and thus have T i τ 0 t T i t, which guarantees that arm i has not been sampled from time τ 0 + to t. By the symmetry between Bt+ and B2 t+, we can consider only Bt+ for eample. We have P C0 B t+ P C0 X MĈa a 2, a τ 0, t] P C0 MĈa X a 2 MĈb X b 2, b τ 0, a a τ 0,t] t τ0 minp G X t. 2 The first inequality comes from the definition of φ τ which implies that if T τ 0 T t t, the forced sampling mechanism is not active during the time interval τ 0, t]. So φ a 2 implies MĈa X a 2, a τ 0, t]. The second inequality comes from the assumption of i.i.d. X τ, which implies that X a is independent of Ĉb and X b for all b < a. Since at least one will make MĈa X a, each term in the product is then upper bounded by min P G X t. It is worth noting that by the regularity assumption on G, min P G X t is strictly less than. Then from, 2, and the union bound, we obtain P C0 B t+ P C0 Bt+ + P C0 Bt+ 2 2a t t 2 + for some a <. Hence t+7 P C 0 B t+ <. From 0, we conclude that E C 0 T inf t 6 + P C0 A τ+ + P C0 B τ+ <, τ+7 which completes the proof. APPENDIX IV PROOF OF Theorem 5 Proof: The proof is inspired by [4]. Without loss of generality, we assume M C0 2, which immediately implies T inf t T t. Fi a θ with µ θ > µ θ2, and define C θ, θ 2. Let λn denote the log likelihood ratio between θ and θ based on the first n observed rewards of arm. That is n Fθ dyτm λn : log X τm F θ dyτm X, τm m where τm is a random variable corresponding to the time inde of the m-th pull of arm. By conditioning on the sequence X τm, λn is a sum of independent r.v. s. Let K C : sup X Iθ, θ, and suppose there eists δ > 0 such that sup n λn n > K C + δ,

10 0 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 50, NO. 5, MAY 2005 with positive probability. Then with positive probability, there eists an 0 such that the average of the subsequence for which X τm 0, will be larger than K C + δ. This, however, contradicts the strong law of large numbers since the subsequence is i.i.d. and with marginal epectation Iθ, θ 0. Thus we obtain λn sup n n K C, P C 0 a.s. 3 The inequality 3 is equivalent to the statement that with probability one, there are finitely many n such that λn > nk C + δ for some δ > 0. And since K C > 0, this in turn implies there are at most finitely manly n such that ma mn λm > nk C + δ. As a result, we have and sup n ma mn λm n K C, P C0 a.s., P C 0 m n, λm + δnk C 0. 4 n Henceforth, we proceed using contradiction. Suppose sup P C0 T t < > δK C Using A and A 2 as shorthand to denote events A : +δ T t < +2δK and A C 2 : λt t +2δ, and by 4, we have sup P C0 A A 2 > 0. 5 The quantity E C T inf t can be rewritten as follows. Tinf t E C a E C T 2 t b E C t T t c t d e t t f O t + 2δK C + 2δK C + 2δK C δ +2δ P C A P C A A 2 +δ e +2δ P C0 A A 2. 6 The equality marked a follows from M C and b follows from the fact that T t+t 2 t t. c and d follow from elementary probability inequalities. e follows from the change-of-measure formula and the definition of A 2 in which +δ λt t +2δ. f follows from simple arithmetic and Eq. 5. The inequality 6 contradicts the assumption that φ τ is uniformly good for both C 0 θ, θ 2 and C θ, θ 2, and thus we have P ǫ C 0 T t 0, ǫ > 0. K C By choosing the θ in C θ, θ 2 with the minimizing configuration inf θ>θ2 sup Iθ, θ, we complete the proof of the first statement of Theorem 5. The second statement in Theorem 5 can be obtained by simply applying Markov s inequality and the first statement. APPENDIX V PROOF OF Theorem 6 We prove Theorem 6 by decomposing the inferior sampling time instants into disjoint subsequences, each of which will be discussed in separate lemmas respectively. For simplicity, throughout this proof, we use Condt as shorthand for Cond is satisfied at time t 6, and use δ-nbdg to denote the δ-neighborhood of the distribution G on the L space of distributions. Suppose M C0 2. To prove that for the φ τ in Algorithm 3, sup E C0 T t inf θ>θ2 ma Iθ, we,θ first note the following: T t 7 φ τ φ τ, Cond0τ + φ τ, Condτ + φ τ, Cond2τ + φ τ, Ĉτ θα, θ β, θ α < θ β θ 2, Cond3τ + φ τ, Ĉτ θα, θ β, θ θ α > θ β, Cond3τ + φ τ, Ĉτ θα, θ β, θ θ α < θ β θ 2, Cond3τ + φ τ, Ĉτ θα, θ β, θ θ α > θ β, Cond3τ + φ τ, Ĉτ θα, θ β C 0 θ, θ 2, Cond3τ. These eight terms of the right-hand side of 7 will be treated separately in Lemmas 2 through 8. Lemma 2: Suppose M C0 2, i.e., θ < θ 2. 7 Then C 0 Θ 2, E C0 φ τ, Cond0τ E C0 Cond0τ <. Proof: Let T0 : Cond0τ. By the monotone convergence theorem, it is equivalent to prove that E C0 T0 < for all C 0. By the definition of Cond0, we have P C0 T0 t XP G X t, τ < t and τ t mod 2, X τ X P G X t P G X [ t 2 ] [ t 2 minp G X t ]. By directly computing the epectation, we obtain E C0 T0 <. 6 At time t means after observing X t but before the final decision φ t is made. It is basically the moment when we are performing the φ t-deciding algorithm. 7 There is no need to consider the case θ θ 2, since in that case, all allocation rules are optimal.

11 WANG et al.: BANDIT PROBLEMS WITH SIDE OBSERVATIONS Lemma 3: Suppose M C0 2, i.e., θ < θ 2. Then E C 0 φ τ, Condτ E C0 Condτ <. Proof: We define L X t Cond as the empirical distribution of X τ at those time instants τ t for which Cond is satisfied. We then have Condτ Condτ, L X τ Cond δ-nbdg + Condτ, L X τ Cond / δ-nbdg. 8 By Sanov s theorem on finite alphabets see [24], each term in the second sum is eponentially upper bounded w.r.t. τ, which implies the bounded epectation of the second sum. For the first sum, we have Condτ, L X τ Cond δ-nbdg i,, s.t. L i τ / ǫ-nbdf θ i, and L X τ Cond δ-nbdg 2 L i τ / ǫ-nbdf θ i, i L X τ Cond δ-nbdg 9 2 [ τ ] P G X δ n, 2 i τ s.t. ρl i n, F θ i > ǫ. 20 The first inequality comes from etending the finite sum to the infinite sum and the definition of Cond. The second inequality comes from the union bound. The third inequality comes from the following three steps. First we change the summation inde from the time variable τ to τ, which specifies that it is the τ -th time that the condition in 9 is satisfied. Note: by definition, τ τ. Second, by L X τ Cond δ-nbdg, there must be at least τ P G X δ time instants that X s, s τ, which guarantees we have enough access to the bandit machine. And finally, by the definition of Cond in Algorithm 3, at the τ -th[ time of satisfaction, ] the sample τ size n must be greater than P C0 X δ 2. By slightly abusing the notation L i t with L i n, where n represents the sample size Ti t rather than the current time t, we obtain the third inequality. Remark: this change-of-inde transformation will be used etensively throughout the proofs in this section. By Sanov s theorem on R Theorem 9, the probability of each term in 20 is eponentially upper bounded w.r.t. τ, which implies that the summation has bounded epectation. By 8, the proof of Lemma 3 is then complete. Lemma 4: Suppose M C0 2, i.e., θ < θ 2. Then E C 0 φ τ, Cond2τ <. Proof: By the assumption θ < θ 2, we have φ τ, Cond2τ Ĉτ θ α, θ β, θ α θ Ĉτ θ α, θ β, θ α θ, L X τ θ α θ δ-nbdg + Ĉτ θ α, θ β, θ α θ, L X τ θ α θ / δ-nbdg By Sanov s theorem on finite alphabets, each term in the second sum is eponentially upper bounded w.r.t. τ, which implies the bounded epectation of the second sum. For the first sum, we have Ĉτ θ α, θ β, θ α θ, L X τ θ α θ δ-nbdg Ĉτ θ α, θ β, θ α θ,, s.t. ρl τ, F θ > ǫ,l X τ θ α θ δ-nbdg n [ τ P G X δ ], s.t. τ ρl n, F θ > ǫ. By etending the finite sum to the infinite sum, we obtain the first inequality. By the definition of Cond2 in Algorithm 3 and using eactly the same reasoning used in going from 9 to 20, we obtain the second inequality. By Sanov s theorem, each term in the above sum is eponentially upper bounded w.r.t. τ. Thus it follows that the epectation of the first sum is also finite, which completes the proof. Lemma 5: Suppose M C0 2, i.e., θ < θ 2. Then E C 0 E C 0 <. Proof: We have φ τ, Ĉτ θ α, θ β, θ α < θ β θ 2, Cond3τ Ĉτ θ α, θ β, θ α < θ β θ 2, Cond3τ Ĉτ θ α, θ β, θ α < θ β θ 2, Cond3τ Ĉτ θ, ϑ, Cond3τ θ,ϑ:θ<ϑ θ 2 2 Ĉτ θ, ϑ, Cond3aτ θ,ϑ:θ<ϑ θ 2 2 X τ, Ĉτ θ, ϑ, Cond3aτ θ,ϑ:θ<ϑ θ 2 2 ρl 2 τ, F θ 2 > ǫ, Cond3aτ θ,ϑ:θ<ϑ θ 2 2 n [ τ ], s.t. θ,ϑ:θ<ϑ θ 2 τ ρl 2 n, F θ 2 > ǫ.

12 2 IEEE TRANSACTIONS ON AUTOMATIC CONTROL, VOL. 50, NO. 5, MAY 2005 The first equality follows from conditioning on the event that the eact value of the estimate Ĉτ is some configuration pair θ, ϑ. The first inequality follows from the definition of Cond3a in Algorithm 3, where double the number of time instants with odd ctrĉ will be larger than the total number of times that Cond3 is satisfied. The second equality follows from conditioning on the value of X τ. The second inequality follows from the condition that the second coordinate of the estimate, ϑ θ 2, and then etending the finite sum to the infinite sum. The third inequality follows from the definition of Cond3a and changing the time inde to τ, similarly to the reasoning in By Sanov s theorem, each term is eponentially upper bounded w.r.t. τ, and thus the entire sum has bounded epectation. The proof is thus complete. Corollary : By the symmetry of φ τ, we have E C0 <. φ τ, Ĉτ θα, θ β, θ θ α > θ β, Cond3τ Lemma 6: Suppose M C0 2, i.e., θ < θ 2. Then E C 0 φ τ, Ĉτ θ α, θ β, θ θ α < θ β θ 2, Cond3τ <. Proof: We have φ τ, Ĉτ θ α, θ β, θ θ α < θ β θ 2, Cond3τ θ:θ θ<θ 2 θ:θ θ<θ 2 φ τ, Ĉτ θ, θ 2, Cond3τ φ τ, Ĉτ θ, θ 2, Cond3baτ + φ τ, Ĉτ θ, θ 2, Cond3ba2τ θ θ<ϑθ 2 θ :θ >θ 2 φ τ, Ĉτ θ, θ 2, θ θ, Cond3baτ + φ τ, Ĉ τ θ, θ 2, θ θ, Cond3ba2τ2 θ:θ θ<θ 2 θ :θ >θ 2 τ 4τ + 2 2τ 2 n [τ ], s.t. ρl θ n, F θ θ > ǫ. 22 The second equality comes from the fact that the scheme samples the inferior arm only when either Cond3ba or Cond3ba2 is satisfied. For the first inequality, we condition on θ and etend to the infinite sum. For the last inequality, we change the time inde to τ, which specifies the τ -th satisfaction of Cond3ba, so that we can upper bound the first sum of 2. The reason we have a multiplication factor 4τ + 2 2τ 2 in front of the indicator function is in order to upper bound the second sum of 2, concerning Cond3ba2, simultaneously. To obtain this result, we note that between the consecutive times τ and τ +, at which Cond3ba is satisfied and arm is pulled, the number of times that Cond3ba2 is satisfied and arm is pulled cannot eceed 2τ + 2 2τ 2, which is because of the algorithm involving ctrĉt, θ in Line 6. Multiplying the factor 4τ + 2 2τ 2, we simultaneously bound these two sums. By Sanov s theorem, the epectation of the indicator in 22 is eponentially upper bounded w.r.t. τ. As a result, the entire sum will have bounded epectation, which in turn completes the proof. Lemma 7: Suppose M C0 2, i.e., θ < θ 2. Then E C 0 φ τ, Ĉτ θ α, θ β, θ θ α > θ β, Cond3τ <. Proof: We have, φ τ, Ĉτ θ α, θ β, θ θ α > θ β, Cond3τ φ τ, Ĉτ θ, ϑ, Cond3τ ϑ:θ >ϑ Ĉτ θ, ϑ, Cond3τ ϑ:θ >ϑ 2 Ĉτ θ, ϑ, Cond3bτ + ϑ:θ >ϑ #ϑ Θ : ϑ < θ + 2 θ :θ >θ #ϑ Θ : ϑ < θ + 2 ϑ:θ >ϑ Ĉτ θ, ϑ, θ θ, Cond3bτ ϑ:θ >ϑ θ :θ >θ Ĉτ θ, ϑ, θ θ, Cond3bτ, L X τ Cond3b δ-nbdg + Ĉτ θ, ϑ, θ θ, Cond3bτ, L X τ Cond3b / δ-nbdg. 23 The first inequality follows from Line in Algorithm 3, where Cond3b is satisfied once after two times of Cond3 satisfaction. The last two equalities come from conditioning on θ and L X τ Cond3b. By Sanov s theorem on finite alphabets, the terms of the second sum in 23 are eponentially upper bounded and the entire sum thus has bounded epectation. For the first sum, we have Ĉτ θ, ϑ, θ θ, Cond3bτ, L X τ Cond3b δ-nbdg P G X θ, ϑ δ Ĉτ θ, ϑ, θ θ, Cond3bτ. 24

13 WANG et al.: BANDIT PROBLEMS WITH SIDE OBSERVATIONS 3 This inequality follows from the fact that once L X falls into the δ-nbdg, the total number of time instants can be upper bounded by the number of instants when X τ θ, ϑ, over P G X θ, ϑ δ. To show E C 0 Ĉτ θ, ϑ, θ θ, Cond3bτ we further decompose the epectand into Ĉτ θ, ϑ, θ θ, Cond3bτ Ĉτ θ, ϑ, θ θ, Cond3baτ <, + Ĉ τ θ, ϑ, θ θ, Cond3bbτ. 25 For the first sum in 25, under the assumption θ > ϑ, we can write Ĉτ θ, ϑ, θ θ, Cond3baτ Ĉτ θ, ϑ, θ θ, Cond3baτ + Ĉτ θ, ϑ, θ θ, Cond3ba2τ + 2 Ĉτ θ, ϑ, θ θ, Cond3ba2τ + 2 ρ n [τ ], s.t. τ ρl θ,ϑ 2 n, F θ2 θ, ϑ > ǫ. 26 The first inequality comes from conditioning on the subconditions Cond3ba and Cond3ba2, and etending to the infinite sums. Let SQ n denote the set of perfectly squared integers in,, n. The second inequality is from the definition of Cond3ba in Algorithm 3 and the fact that n N, SQ n is no larger than +,, n\sq n. The third inequality comes from the fact that by definition, under Cond3ba2, φ τ 2, and changing the time inde to τ, the number of satisfaction times. By Sanov s theorem on R, the above has bounded epectation. For the second sum of 25, with the condition θ > ϑ Ĉτ θ, ϑ, θ θ, Cond3bbτ t + Ĉτ θ, ϑ, θ θ, Λ τĉτ, θ > τlog τ 2 t + Ĉτ θ, ϑ, ΛτĈτ, θ2 > τlog τ2 t T θ,ϑ 2 + m τ t + n τ, s.t. F ϑ dy 2 τ m θ, ϑ F θ2 dy 2 τ m θ, ϑ > τlog τ2, n F ϑ dy 2 m θ, ϑ F m θ2 dy 2 m θ, ϑ > τlog τ2, where Ym 2 is the reward of arm 2 at the m-th time that X s θ, ϑ and φ s 2. The first inequality follows from focusing only on the Λ τ Ĉτ, θ condition in Cond3bb and then shifting the time inde τ. The second inequality follows by replacing the minimum achieving θ with θ 2. The third inequality follows from epressing Λ τ using its definition. The fourth inequality follows from the set relationship, where n is T θ,ϑ 2 τ, the number of time instants that the side information X s θ, ϑ and φ s 2, for s τ. We first note that n m F ϑ dy 2 m θ,ϑ F θ2 dy 2 m θ,ϑ is a positive martingale with epectation, when being considered under distribution F θ2 θ, ϑ. By Doob s maimal inequality, we have n F ϑ dy P C0 n 2 m τ, θ, ϑ F m θ2 dym 2 > τlog τ2 θ, ϑ τlog τ 2, and thus the epectation is bounded, i.e., E C0 Ĉ τ θ, ϑ, θ θ, Cond3bbτ + <. 27 τlog τ 2 By 23, 24, 25, 26, and 27, Lemma 7 is proved. Lemma 8: Suppose M C0 2, i.e., θ < θ 2. Then sup E C0 φ τ, Ĉτ θ α, θ β θ, θ 2, Cond3τ inf θ>θ2 ma Iθ, θ. Proof: By the definition of φ τ, especially of Cond3ba, we have E C0 φ τ, Ĉτ θα, θ β C 0 θ, θ 2, Cond3τ E C0 Ĉτ C0, Cond3baτ E C0 sup n t : n F θ dy m min C 0 θ>θ 2 F m θ dy m C tlog 0 t2 E C0 sup n < : n F θ dy m min C 0 θ>θ 2 F m θ dy m C tlog 0 t2 E C0 ma θ>θ 2 sup n < : n m n E C0 ma inf θ>θ 2 n s s m F θ dy m C 0 F θ dy m C 0 t2 F θ dy m C 0 F θ dy m C 0 t2 where Ym denotes the reward of the m-th time that arm of the sub-bandit machine X τ C 0 is pulled. The first inequality follows because, by definition, only when Cond3ba is satisfied can φ τ, given Ĉτ C 0. The second inequality is obtained by focusing on the subcondition Λ τ Ĉτ, θ in Cond3ba, and letting n T t be the number of time instants when arm is pulled and X τ C 0. The third inequality comes from etending the upper bound of n from t to. The equalities come from rearranging the ma and min operators and elementary implications. By applying Lemma 4.3 of [2], quoted as,

Metric Spaces and Topology

Metric Spaces and Topology Chapter 2 Metric Spaces and Topology From an engineering perspective, the most important way to construct a topology on a set is to define the topology in terms of a metric on the set. This approach underlies

More information

The information complexity of sequential resource allocation

The information complexity of sequential resource allocation The information complexity of sequential resource allocation Emilie Kaufmann, joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishan SMILE Seminar, ENS, June 8th, 205 Sequential allocation

More information

Lecture 3: September 10

Lecture 3: September 10 CS294 Markov Chain Monte Carlo: Foundations & Applications Fall 2009 Lecture 3: September 10 Lecturer: Prof. Alistair Sinclair Scribes: Andrew H. Chan, Piyush Srivastava Disclaimer: These notes have not

More information

Generalized Neyman Pearson optimality of empirical likelihood for testing parameter hypotheses

Generalized Neyman Pearson optimality of empirical likelihood for testing parameter hypotheses Ann Inst Stat Math (2009) 61:773 787 DOI 10.1007/s10463-008-0172-6 Generalized Neyman Pearson optimality of empirical likelihood for testing parameter hypotheses Taisuke Otsu Received: 1 June 2007 / Revised:

More information

Bandit models: a tutorial

Bandit models: a tutorial Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses

More information

Optimal Sojourn Time Control within an Interval 1

Optimal Sojourn Time Control within an Interval 1 Optimal Sojourn Time Control within an Interval Jianghai Hu and Shankar Sastry Department of Electrical Engineering and Computer Sciences University of California at Berkeley Berkeley, CA 97-77 {jianghai,sastry}@eecs.berkeley.edu

More information

Lecture 3: Lower Bounds for Bandit Algorithms

Lecture 3: Lower Bounds for Bandit Algorithms CMSC 858G: Bandits, Experts and Games 09/19/16 Lecture 3: Lower Bounds for Bandit Algorithms Instructor: Alex Slivkins Scribed by: Soham De & Karthik A Sankararaman 1 Lower Bounds In this lecture (and

More information

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)

More information

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

Stratégies bayésiennes et fréquentistes dans un modèle de bandit Stratégies bayésiennes et fréquentistes dans un modèle de bandit thèse effectuée à Telecom ParisTech, co-dirigée par Olivier Cappé, Aurélien Garivier et Rémi Munos Journées MAS, Grenoble, 30 août 2016

More information

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER On the Performance of Sparse Recovery IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 11, NOVEMBER 2011 7255 On the Performance of Sparse Recovery Via `p-minimization (0 p 1) Meng Wang, Student Member, IEEE, Weiyu Xu, and Ao Tang, Senior

More information

Case study: stochastic simulation via Rademacher bootstrap

Case study: stochastic simulation via Rademacher bootstrap Case study: stochastic simulation via Rademacher bootstrap Maxim Raginsky December 4, 2013 In this lecture, we will look at an application of statistical learning theory to the problem of efficient stochastic

More information

Analysis III. Exam 1

Analysis III. Exam 1 Analysis III Math 414 Spring 27 Professor Ben Richert Exam 1 Solutions Problem 1 Let X be the set of all continuous real valued functions on [, 1], and let ρ : X X R be the function ρ(f, g) = sup f g (1)

More information

The knowledge gradient method for multi-armed bandit problems

The knowledge gradient method for multi-armed bandit problems The knowledge gradient method for multi-armed bandit problems Moving beyond inde policies Ilya O. Ryzhov Warren Powell Peter Frazier Department of Operations Research and Financial Engineering Princeton

More information

The binary entropy function

The binary entropy function ECE 7680 Lecture 2 Definitions and Basic Facts Objective: To learn a bunch of definitions about entropy and information measures that will be useful through the quarter, and to present some simple but

More information

Bregman Divergence and Mirror Descent

Bregman Divergence and Mirror Descent Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,

More information

Part III. 10 Topological Space Basics. Topological Spaces

Part III. 10 Topological Space Basics. Topological Spaces Part III 10 Topological Space Basics Topological Spaces Using the metric space results above as motivation we will axiomatize the notion of being an open set to more general settings. Definition 10.1.

More information

Notes from Week 9: Multi-Armed Bandit Problems II. 1 Information-theoretic lower bounds for multiarmed

Notes from Week 9: Multi-Armed Bandit Problems II. 1 Information-theoretic lower bounds for multiarmed CS 683 Learning, Games, and Electronic Markets Spring 007 Notes from Week 9: Multi-Armed Bandit Problems II Instructor: Robert Kleinberg 6-30 Mar 007 1 Information-theoretic lower bounds for multiarmed

More information

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation

Statistics 612: L p spaces, metrics on spaces of probabilites, and connections to estimation Statistics 62: L p spaces, metrics on spaces of probabilites, and connections to estimation Moulinath Banerjee December 6, 2006 L p spaces and Hilbert spaces We first formally define L p spaces. Consider

More information

Week 2: Sequences and Series

Week 2: Sequences and Series QF0: Quantitative Finance August 29, 207 Week 2: Sequences and Series Facilitator: Christopher Ting AY 207/208 Mathematicians have tried in vain to this day to discover some order in the sequence of prime

More information

Relations. Relations. Definition. Let A and B be sets.

Relations. Relations. Definition. Let A and B be sets. Relations Relations. Definition. Let A and B be sets. A relation R from A to B is a subset R A B. If a A and b B, we write a R b if (a, b) R, and a /R b if (a, b) / R. A relation from A to A is called

More information

Optimal scaling of the random walk Metropolis on elliptically symmetric unimodal targets

Optimal scaling of the random walk Metropolis on elliptically symmetric unimodal targets Bernoulli 15(3), 2009, 774 798 DOI: 10.3150/08-BEJ176 Optimal scaling of the random walk Metropolis on elliptically symmetric unimodal targets CHRIS SHERLOCK 1 and GARETH ROBERTS 2 1 Department of Mathematics

More information

Multiple Identifications in Multi-Armed Bandits

Multiple Identifications in Multi-Armed Bandits Multiple Identifications in Multi-Armed Bandits arxiv:05.38v [cs.lg] 4 May 0 Sébastien Bubeck Department of Operations Research and Financial Engineering, Princeton University sbubeck@princeton.edu Tengyao

More information

Math 118B Solutions. Charles Martin. March 6, d i (x i, y i ) + d i (y i, z i ) = d(x, y) + d(y, z). i=1

Math 118B Solutions. Charles Martin. March 6, d i (x i, y i ) + d i (y i, z i ) = d(x, y) + d(y, z). i=1 Math 8B Solutions Charles Martin March 6, Homework Problems. Let (X i, d i ), i n, be finitely many metric spaces. Construct a metric on the product space X = X X n. Proof. Denote points in X as x = (x,

More information

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable Lecture Notes 1 Probability and Random Variables Probability Spaces Conditional Probability and Independence Random Variables Functions of a Random Variable Generation of a Random Variable Jointly Distributed

More information

Lecture 2. We now introduce some fundamental tools in martingale theory, which are useful in controlling the fluctuation of martingales.

Lecture 2. We now introduce some fundamental tools in martingale theory, which are useful in controlling the fluctuation of martingales. Lecture 2 1 Martingales We now introduce some fundamental tools in martingale theory, which are useful in controlling the fluctuation of martingales. 1.1 Doob s inequality We have the following maximal

More information

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable

Lecture Notes 1 Probability and Random Variables. Conditional Probability and Independence. Functions of a Random Variable Lecture Notes 1 Probability and Random Variables Probability Spaces Conditional Probability and Independence Random Variables Functions of a Random Variable Generation of a Random Variable Jointly Distributed

More information

Tight Bounds for Symmetric Divergence Measures and a Refined Bound for Lossless Source Coding

Tight Bounds for Symmetric Divergence Measures and a Refined Bound for Lossless Source Coding APPEARS IN THE IEEE TRANSACTIONS ON INFORMATION THEORY, FEBRUARY 015 1 Tight Bounds for Symmetric Divergence Measures and a Refined Bound for Lossless Source Coding Igal Sason Abstract Tight bounds for

More information

Lecture 35: December The fundamental statistical distances

Lecture 35: December The fundamental statistical distances 36-705: Intermediate Statistics Fall 207 Lecturer: Siva Balakrishnan Lecture 35: December 4 Today we will discuss distances and metrics between distributions that are useful in statistics. I will be lose

More information

x log x, which is strictly convex, and use Jensen s Inequality:

x log x, which is strictly convex, and use Jensen s Inequality: 2. Information measures: mutual information 2.1 Divergence: main inequality Theorem 2.1 (Information Inequality). D(P Q) 0 ; D(P Q) = 0 iff P = Q Proof. Let ϕ(x) x log x, which is strictly convex, and

More information

Real Analysis Math 131AH Rudin, Chapter #1. Dominique Abdi

Real Analysis Math 131AH Rudin, Chapter #1. Dominique Abdi Real Analysis Math 3AH Rudin, Chapter # Dominique Abdi.. If r is rational (r 0) and x is irrational, prove that r + x and rx are irrational. Solution. Assume the contrary, that r+x and rx are rational.

More information

Lecture Notes 2 Random Variables. Discrete Random Variables: Probability mass function (pmf)

Lecture Notes 2 Random Variables. Discrete Random Variables: Probability mass function (pmf) Lecture Notes 2 Random Variables Definition Discrete Random Variables: Probability mass function (pmf) Continuous Random Variables: Probability density function (pdf) Mean and Variance Cumulative Distribution

More information

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017 s s Machine Learning Reading Group The University of British Columbia Summer 2017 (OCO) Convex 1/29 Outline (OCO) Convex Stochastic Bernoulli s (OCO) Convex 2/29 At each iteration t, the player chooses

More information

Notes on Discriminant Functions and Optimal Classification

Notes on Discriminant Functions and Optimal Classification Notes on Discriminant Functions and Optimal Classification Padhraic Smyth, Department of Computer Science University of California, Irvine c 2017 1 Discriminant Functions Consider a classification problem

More information

Data Compression. Limit of Information Compression. October, Examples of codes 1

Data Compression. Limit of Information Compression. October, Examples of codes 1 Data Compression Limit of Information Compression Radu Trîmbiţaş October, 202 Outline Contents Eamples of codes 2 Kraft Inequality 4 2. Kraft Inequality............................ 4 2.2 Kraft inequality

More information

Multi armed bandit problem: some insights

Multi armed bandit problem: some insights Multi armed bandit problem: some insights July 4, 20 Introduction Multi Armed Bandit problems have been widely studied in the context of sequential analysis. The application areas include clinical trials,

More information

QUANTIZATION FOR DISTRIBUTED ESTIMATION IN LARGE SCALE SENSOR NETWORKS

QUANTIZATION FOR DISTRIBUTED ESTIMATION IN LARGE SCALE SENSOR NETWORKS QUANTIZATION FOR DISTRIBUTED ESTIMATION IN LARGE SCALE SENSOR NETWORKS Parvathinathan Venkitasubramaniam, Gökhan Mergen, Lang Tong and Ananthram Swami ABSTRACT We study the problem of quantization for

More information

Rate of Convergence of Learning in Social Networks

Rate of Convergence of Learning in Social Networks Rate of Convergence of Learning in Social Networks Ilan Lobel, Daron Acemoglu, Munther Dahleh and Asuman Ozdaglar Massachusetts Institute of Technology Cambridge, MA Abstract We study the rate of convergence

More information

A Maximally Controllable Indifference-Zone Policy

A Maximally Controllable Indifference-Zone Policy A Maimally Controllable Indifference-Zone Policy Peter I. Frazier Operations Research and Information Engineering Cornell University Ithaca, NY 14853, USA February 21, 2011 Abstract We consider the indifference-zone

More information

7 Convergence in R d and in Metric Spaces

7 Convergence in R d and in Metric Spaces STA 711: Probability & Measure Theory Robert L. Wolpert 7 Convergence in R d and in Metric Spaces A sequence of elements a n of R d converges to a limit a if and only if, for each ǫ > 0, the sequence a

More information

Lecture 4: Lower Bounds (ending); Thompson Sampling

Lecture 4: Lower Bounds (ending); Thompson Sampling CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from

More information

Gi en Demand for Several Goods

Gi en Demand for Several Goods Gi en Demand for Several Goods Peter Norman Sørensen January 28, 2011 Abstract The utility maimizing consumer s demand function may simultaneously possess the Gi en property for any number of goods strictly

More information

M.S. Project Report. Efficient Failure Rate Prediction for SRAM Cells via Gibbs Sampling. Yamei Feng 12/15/2011

M.S. Project Report. Efficient Failure Rate Prediction for SRAM Cells via Gibbs Sampling. Yamei Feng 12/15/2011 .S. Project Report Efficient Failure Rate Prediction for SRA Cells via Gibbs Sampling Yamei Feng /5/ Committee embers: Prof. Xin Li Prof. Ken ai Table of Contents CHAPTER INTRODUCTION...3 CHAPTER BACKGROUND...5

More information

Bayesian Learning in Social Networks

Bayesian Learning in Social Networks Bayesian Learning in Social Networks Asu Ozdaglar Joint work with Daron Acemoglu, Munther Dahleh, Ilan Lobel Department of Electrical Engineering and Computer Science, Department of Economics, Operations

More information

STAT 7032 Probability Spring Wlodek Bryc

STAT 7032 Probability Spring Wlodek Bryc STAT 7032 Probability Spring 2018 Wlodek Bryc Created: Friday, Jan 2, 2014 Revised for Spring 2018 Printed: January 9, 2018 File: Grad-Prob-2018.TEX Department of Mathematical Sciences, University of Cincinnati,

More information

Value Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes

Value Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes Value Iteration and Action ɛ-approximation of Optimal Policies in Discounted Markov Decision Processes RAÚL MONTES-DE-OCA Departamento de Matemáticas Universidad Autónoma Metropolitana-Iztapalapa San Rafael

More information

Gaussian Random Fields

Gaussian Random Fields Gaussian Random Fields Mini-Course by Prof. Voijkan Jaksic Vincent Larochelle, Alexandre Tomberg May 9, 009 Review Defnition.. Let, F, P ) be a probability space. Random variables {X,..., X n } are called

More information

Common Knowledge and Sequential Team Problems

Common Knowledge and Sequential Team Problems Common Knowledge and Sequential Team Problems Authors: Ashutosh Nayyar and Demosthenis Teneketzis Computer Engineering Technical Report Number CENG-2018-02 Ming Hsieh Department of Electrical Engineering

More information

Class 26: review for final exam 18.05, Spring 2014

Class 26: review for final exam 18.05, Spring 2014 Probability Class 26: review for final eam 8.05, Spring 204 Counting Sets Inclusion-eclusion principle Rule of product (multiplication rule) Permutation and combinations Basics Outcome, sample space, event

More information

Randomized Algorithms

Randomized Algorithms Randomized Algorithms Prof. Tapio Elomaa tapio.elomaa@tut.fi Course Basics A new 4 credit unit course Part of Theoretical Computer Science courses at the Department of Mathematics There will be 4 hours

More information

Analysis of Thompson Sampling for the multi-armed bandit problem

Analysis of Thompson Sampling for the multi-armed bandit problem Analysis of Thompson Sampling for the multi-armed bandit problem Shipra Agrawal Microsoft Research India shipra@microsoft.com avin Goyal Microsoft Research India navingo@microsoft.com Abstract We show

More information

Introduction to Bayesian Statistics

Introduction to Bayesian Statistics Bayesian Parameter Estimation Introduction to Bayesian Statistics Harvey Thornburg Center for Computer Research in Music and Acoustics (CCRMA) Department of Music, Stanford University Stanford, California

More information

AS computer hardware technology advances, both

AS computer hardware technology advances, both 1 Best-Harmonically-Fit Periodic Task Assignment Algorithm on Multiple Periodic Resources Chunhui Guo, Student Member, IEEE, Xiayu Hua, Student Member, IEEE, Hao Wu, Student Member, IEEE, Douglas Lautner,

More information

Existence and Uniqueness

Existence and Uniqueness Chapter 3 Existence and Uniqueness An intellect which at a certain moment would know all forces that set nature in motion, and all positions of all items of which nature is composed, if this intellect

More information

Probability and Measure

Probability and Measure Probability and Measure Robert L. Wolpert Institute of Statistics and Decision Sciences Duke University, Durham, NC, USA Convergence of Random Variables 1. Convergence Concepts 1.1. Convergence of Real

More information

Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results

Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results Introduction to Empirical Processes and Semiparametric Inference Lecture 12: Glivenko-Cantelli and Donsker Results Michael R. Kosorok, Ph.D. Professor and Chair of Biostatistics Professor of Statistics

More information

Two Factor Additive Conjoint Measurement With One Solvable Component

Two Factor Additive Conjoint Measurement With One Solvable Component Two Factor Additive Conjoint Measurement With One Solvable Component Christophe Gonzales LIP6 Université Paris 6 tour 46-0, ème étage 4, place Jussieu 755 Paris Cedex 05, France e-mail: ChristopheGonzales@lip6fr

More information

Contents Ordered Fields... 2 Ordered sets and fields... 2 Construction of the Reals 1: Dedekind Cuts... 2 Metric Spaces... 3

Contents Ordered Fields... 2 Ordered sets and fields... 2 Construction of the Reals 1: Dedekind Cuts... 2 Metric Spaces... 3 Analysis Math Notes Study Guide Real Analysis Contents Ordered Fields 2 Ordered sets and fields 2 Construction of the Reals 1: Dedekind Cuts 2 Metric Spaces 3 Metric Spaces 3 Definitions 4 Separability

More information

6196 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 9, SEPTEMBER 2011

6196 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 9, SEPTEMBER 2011 6196 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 57, NO. 9, SEPTEMBER 2011 On the Structure of Real-Time Encoding and Decoding Functions in a Multiterminal Communication System Ashutosh Nayyar, Student

More information

Topology. Xiaolong Han. Department of Mathematics, California State University, Northridge, CA 91330, USA address:

Topology. Xiaolong Han. Department of Mathematics, California State University, Northridge, CA 91330, USA  address: Topology Xiaolong Han Department of Mathematics, California State University, Northridge, CA 91330, USA E-mail address: Xiaolong.Han@csun.edu Remark. You are entitled to a reward of 1 point toward a homework

More information

2 Generating Functions

2 Generating Functions 2 Generating Functions In this part of the course, we re going to introduce algebraic methods for counting and proving combinatorial identities. This is often greatly advantageous over the method of finding

More information

6. Brownian Motion. Q(A) = P [ ω : x(, ω) A )

6. Brownian Motion. Q(A) = P [ ω : x(, ω) A ) 6. Brownian Motion. stochastic process can be thought of in one of many equivalent ways. We can begin with an underlying probability space (Ω, Σ, P) and a real valued stochastic process can be defined

More information

Proofs for Large Sample Properties of Generalized Method of Moments Estimators

Proofs for Large Sample Properties of Generalized Method of Moments Estimators Proofs for Large Sample Properties of Generalized Method of Moments Estimators Lars Peter Hansen University of Chicago March 8, 2012 1 Introduction Econometrica did not publish many of the proofs in my

More information

CSCI-6971 Lecture Notes: Probability theory

CSCI-6971 Lecture Notes: Probability theory CSCI-6971 Lecture Notes: Probability theory Kristopher R. Beevers Department of Computer Science Rensselaer Polytechnic Institute beevek@cs.rpi.edu January 31, 2006 1 Properties of probabilities Let, A,

More information

Hierarchical Knowledge Gradient for Sequential Sampling

Hierarchical Knowledge Gradient for Sequential Sampling Journal of Machine Learning Research () Submitted ; Published Hierarchical Knowledge Gradient for Sequential Sampling Martijn R.K. Mes Department of Operational Methods for Production and Logistics University

More information

Finding a Needle in a Haystack: Conditions for Reliable. Detection in the Presence of Clutter

Finding a Needle in a Haystack: Conditions for Reliable. Detection in the Presence of Clutter Finding a eedle in a Haystack: Conditions for Reliable Detection in the Presence of Clutter Bruno Jedynak and Damianos Karakos October 23, 2006 Abstract We study conditions for the detection of an -length

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Statistical Machine Learning

Statistical Machine Learning Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 2: Introduction to statistical learning theory. 1 / 22 Goals of statistical learning theory SLT aims at studying the performance of

More information

Selecting Efficient Correlated Equilibria Through Distributed Learning. Jason R. Marden

Selecting Efficient Correlated Equilibria Through Distributed Learning. Jason R. Marden 1 Selecting Efficient Correlated Equilibria Through Distributed Learning Jason R. Marden Abstract A learning rule is completely uncoupled if each player s behavior is conditioned only on his own realized

More information

Math 320-2: Midterm 2 Practice Solutions Northwestern University, Winter 2015

Math 320-2: Midterm 2 Practice Solutions Northwestern University, Winter 2015 Math 30-: Midterm Practice Solutions Northwestern University, Winter 015 1. Give an example of each of the following. No justification is needed. (a) A metric on R with respect to which R is bounded. (b)

More information

Revisiting the Exploration-Exploitation Tradeoff in Bandit Models

Revisiting the Exploration-Exploitation Tradeoff in Bandit Models Revisiting the Exploration-Exploitation Tradeoff in Bandit Models joint work with Aurélien Garivier (IMT, Toulouse) and Tor Lattimore (University of Alberta) Workshop on Optimization and Decision-Making

More information

Analysis Finite and Infinite Sets The Real Numbers The Cantor Set

Analysis Finite and Infinite Sets The Real Numbers The Cantor Set Analysis Finite and Infinite Sets Definition. An initial segment is {n N n n 0 }. Definition. A finite set can be put into one-to-one correspondence with an initial segment. The empty set is also considered

More information

Module 1. Probability

Module 1. Probability Module 1 Probability 1. Introduction In our daily life we come across many processes whose nature cannot be predicted in advance. Such processes are referred to as random processes. The only way to derive

More information

Online Convex Optimization

Online Convex Optimization Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed

More information

An exponential family of distributions is a parametric statistical model having densities with respect to some positive measure λ of the form.

An exponential family of distributions is a parametric statistical model having densities with respect to some positive measure λ of the form. Stat 8112 Lecture Notes Asymptotics of Exponential Families Charles J. Geyer January 23, 2013 1 Exponential Families An exponential family of distributions is a parametric statistical model having densities

More information

University of Warwick, EC9A0 Maths for Economists Lecture Notes 10: Dynamic Programming

University of Warwick, EC9A0 Maths for Economists Lecture Notes 10: Dynamic Programming University of Warwick, EC9A0 Maths for Economists 1 of 63 University of Warwick, EC9A0 Maths for Economists Lecture Notes 10: Dynamic Programming Peter J. Hammond Autumn 2013, revised 2014 University of

More information

APPENDIX C: Measure Theoretic Issues

APPENDIX C: Measure Theoretic Issues APPENDIX C: Measure Theoretic Issues A general theory of stochastic dynamic programming must deal with the formidable mathematical questions that arise from the presence of uncountable probability spaces.

More information

Multidimensional partitions of unity and Gaussian terrains

Multidimensional partitions of unity and Gaussian terrains and Gaussian terrains Richard A. Bale, Jeff P. Grossman, Gary F. Margrave, and Michael P. Lamoureu ABSTRACT Partitions of unity play an important rôle as amplitude-preserving windows for nonstationary

More information

Preliminary Results on Social Learning with Partial Observations

Preliminary Results on Social Learning with Partial Observations Preliminary Results on Social Learning with Partial Observations Ilan Lobel, Daron Acemoglu, Munther Dahleh and Asuman Ozdaglar ABSTRACT We study a model of social learning with partial observations from

More information

Math 104: Homework 7 solutions

Math 104: Homework 7 solutions Math 04: Homework 7 solutions. (a) The derivative of f () = is f () = 2 which is unbounded as 0. Since f () is continuous on [0, ], it is uniformly continous on this interval by Theorem 9.2. Hence for

More information

Real Analysis. July 10, These notes are intended for use in the warm-up camp for incoming Berkeley Statistics

Real Analysis. July 10, These notes are intended for use in the warm-up camp for incoming Berkeley Statistics Real Analysis July 10, 2006 1 Introduction These notes are intended for use in the warm-up camp for incoming Berkeley Statistics graduate students. Welcome to Cal! The real analysis review presented here

More information

DR.RUPNATHJI( DR.RUPAK NATH )

DR.RUPNATHJI( DR.RUPAK NATH ) Contents 1 Sets 1 2 The Real Numbers 9 3 Sequences 29 4 Series 59 5 Functions 81 6 Power Series 105 7 The elementary functions 111 Chapter 1 Sets It is very convenient to introduce some notation and terminology

More information

Qualifying Exam in Machine Learning

Qualifying Exam in Machine Learning Qualifying Exam in Machine Learning October 20, 2009 Instructions: Answer two out of the three questions in Part 1. In addition, answer two out of three questions in two additional parts (choose two parts

More information

Unsupervised Learning with Permuted Data

Unsupervised Learning with Permuted Data Unsupervised Learning with Permuted Data Sergey Kirshner skirshne@ics.uci.edu Sridevi Parise sparise@ics.uci.edu Padhraic Smyth smyth@ics.uci.edu School of Information and Computer Science, University

More information

Steepest descent on factor graphs

Steepest descent on factor graphs Steepest descent on factor graphs Justin Dauwels, Sascha Korl, and Hans-Andrea Loeliger Abstract We show how steepest descent can be used as a tool for estimation on factor graphs. From our eposition,

More information

Measure Theory on Topological Spaces. Course: Prof. Tony Dorlas 2010 Typset: Cathal Ormond

Measure Theory on Topological Spaces. Course: Prof. Tony Dorlas 2010 Typset: Cathal Ormond Measure Theory on Topological Spaces Course: Prof. Tony Dorlas 2010 Typset: Cathal Ormond May 22, 2011 Contents 1 Introduction 2 1.1 The Riemann Integral........................................ 2 1.2 Measurable..............................................

More information

Eco517 Fall 2004 C. Sims MIDTERM EXAM

Eco517 Fall 2004 C. Sims MIDTERM EXAM Eco517 Fall 2004 C. Sims MIDTERM EXAM Answer all four questions. Each is worth 23 points. Do not devote disproportionate time to any one question unless you have answered all the others. (1) We are considering

More information

arxiv: v1 [math.oc] 24 Dec 2018

arxiv: v1 [math.oc] 24 Dec 2018 On Increasing Self-Confidence in Non-Bayesian Social Learning over Time-Varying Directed Graphs César A. Uribe and Ali Jadbabaie arxiv:82.989v [math.oc] 24 Dec 28 Abstract We study the convergence of the

More information

Some Background Material

Some Background Material Chapter 1 Some Background Material In the first chapter, we present a quick review of elementary - but important - material as a way of dipping our toes in the water. This chapter also introduces important

More information

A Generic Bound on Cycles in Two-Player Games

A Generic Bound on Cycles in Two-Player Games A Generic Bound on Cycles in Two-Player Games David S. Ahn February 006 Abstract We provide a bound on the size of simultaneous best response cycles for generic finite two-player games. The bound shows

More information

Lecture 2: CDF and EDF

Lecture 2: CDF and EDF STAT 425: Introduction to Nonparametric Statistics Winter 2018 Instructor: Yen-Chi Chen Lecture 2: CDF and EDF 2.1 CDF: Cumulative Distribution Function For a random variable X, its CDF F () contains all

More information

Learning Algorithms for Minimizing Queue Length Regret

Learning Algorithms for Minimizing Queue Length Regret Learning Algorithms for Minimizing Queue Length Regret Thomas Stahlbuhk Massachusetts Institute of Technology Cambridge, MA Brooke Shrader MIT Lincoln Laboratory Lexington, MA Eytan Modiano Massachusetts

More information

On the Complexity of Best Arm Identification with Fixed Confidence

On the Complexity of Best Arm Identification with Fixed Confidence On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, Emilie Kaufmann COLT, June 23 th 2016, New York Institut de Mathématiques de Toulouse

More information

The information complexity of best-arm identification

The information complexity of best-arm identification The information complexity of best-arm identification Emilie Kaufmann, joint work with Olivier Cappé and Aurélien Garivier MAB workshop, Lancaster, January th, 206 Context: the multi-armed bandit model

More information

Avoider-Enforcer games played on edge disjoint hypergraphs

Avoider-Enforcer games played on edge disjoint hypergraphs Avoider-Enforcer games played on edge disjoint hypergraphs Asaf Ferber Michael Krivelevich Alon Naor July 8, 2013 Abstract We analyze Avoider-Enforcer games played on edge disjoint hypergraphs, providing

More information

MATH 116, LECTURES 10 & 11: Limits

MATH 116, LECTURES 10 & 11: Limits MATH 6, LECTURES 0 & : Limits Limits In application, we often deal with quantities which are close to other quantities but which cannot be defined eactly. Consider the problem of how a car s speedometer

More information

Appendix B for The Evolution of Strategic Sophistication (Intended for Online Publication)

Appendix B for The Evolution of Strategic Sophistication (Intended for Online Publication) Appendix B for The Evolution of Strategic Sophistication (Intended for Online Publication) Nikolaus Robalino and Arthur Robson Appendix B: Proof of Theorem 2 This appendix contains the proof of Theorem

More information

Guessing Games. Anthony Mendes and Kent E. Morrison

Guessing Games. Anthony Mendes and Kent E. Morrison Guessing Games Anthony Mendes and Kent E. Morrison Abstract. In a guessing game, players guess the value of a random real number selected using some probability density function. The winner may be determined

More information

Lecture 5. If we interpret the index n 0 as time, then a Markov chain simply requires that the future depends only on the present and not on the past.

Lecture 5. If we interpret the index n 0 as time, then a Markov chain simply requires that the future depends only on the present and not on the past. 1 Markov chain: definition Lecture 5 Definition 1.1 Markov chain] A sequence of random variables (X n ) n 0 taking values in a measurable state space (S, S) is called a (discrete time) Markov chain, if

More information

ebay/google short course: Problem set 2

ebay/google short course: Problem set 2 18 Jan 013 ebay/google short course: Problem set 1. (the Echange Parado) You are playing the following game against an opponent, with a referee also taking part. The referee has two envelopes (numbered

More information

A Stochastic Paradox for Reflected Brownian Motion?

A Stochastic Paradox for Reflected Brownian Motion? Proceedings of the 9th International Symposium on Mathematical Theory of Networks and Systems MTNS 2 9 July, 2 udapest, Hungary A Stochastic Parado for Reflected rownian Motion? Erik I. Verriest Abstract

More information