arxiv: v1 [cs.lg] 7 Sep 2018

Size: px
Start display at page:

Download "arxiv: v1 [cs.lg] 7 Sep 2018"

Transcription

1 Analysis of Thompson Sampling for Combinatorial Multi-armed Bandit with Probabilistically Triggered Arms Alihan Hüyük Bilkent University Cem Tekin Bilkent University arxiv: v [cs.lg] 7 Sep 208 Abstract We analyze the regret of combinatorial Thompson sampling CTS for the combinatorial multi-armed bandit with probabilistically triggered arms under the semi-bandit feedback setting. We assume that the learner has access to an exact optimization oracle but does not know the expected base arm outcomes beforehand. When the expected reward function is Lipschitz continuous in the expected base arm outcomes, we derive O m log T/p i i regret bound for CTS, where m denotes the number of base arms, p i denotes the minimum non-zero triggering probability of base arm i and i denotes the minimum suboptimality gap of base arm i. We also show that CTS outperforms combinatorial upper confidence bound CUCB via numerical experiments. Introduction Multi-armed bandit MAB exhibits the prime example of the tradeoff between exploration and exploitation faced in many reinforcement learning problems []. In the classical MAB, at each round the learner selects an arm which yields a random reward that comes from an unknown distribution. The goal of the learner is to maximize its expected cumulative reward over all rounds by learning to select arms that yield high rewards. The learner s performance is measured by its regret with respect to an oracle policy which always selects the arm with the highest expected reward. It is shown that when the arms rewards are independent, any uniformly good policy will incur at least logarithmic in time regret [2]. Several classes of policies are proposed for the learner to minimize its regret. These Copyright 208 by the authors. include Thompson sampling [3, 4, 5] and upper confidence bound UCB policies [2, 6, 7], which are shown to achieve logarithmic in time regret, and hence, are order optimal. Combinatorial multi-armed bandit CMAB [8, 9, 0] is an extension of MAB where the learner selects a super arm at each round, which is defined to be a subset of the base arms. Then, the learner observes and collects the reward associated with the selected super arm, and also observes the outcomes of the base arms that are in the selected super arm. This type of feedback is also called the semi-bandit feedback. For the special case when the expected reward of a super arm is a linear combination of the expected outcomes of the base arms that are in that super arm, it is shown in [9] that a combinatorial version of UCB in [7] achieves OKm log T/ gap-dependent and O KmT log T gap-free regrets, where m is the number of base arms, K is the maximum number of base arms in a super arm, and is the gap between the expected reward of the optimal super arm and the second best super arm. Later on, this setting is generalized to allow the expected reward of the super arm to be a more general function of the expected outcomes of the base arms that obeys certain monotonicity and bounded smoothness conditions [0]. The main challenge in the general case is that the optimization problem itself is NP-hard, but an approximately optimal solution can usually be computed efficiently for many special cases []. Therefore, it is assumed that the learner has access to an approximation oracle, which can output a super arm that has expected reward that is at least α fraction of the optimal reward with probability at least β when given the expected outcomes of the base arms. Thus, the regret is measured with respect to the αβ fraction of the optimal reward, and it is proven that a combinatorial variant of UCB, called CUCB, achieves O m log T/ i regret, when the bounded smoothness function is fx = γx for some γ > 0, where i is the minimum gap between the expected reward of the optimal super arm and the expected reward of any suboptimal super arm that contains base arm i.

2 Thompson Sampling for Combinatorial Multi-armed Bandit with Probabilistically Triggered Arms Recently, it is shown in [2] that Thompson sampling can achieve O m log T/ i regret for the general CMAB under a Lipschitz continuity assumption on the expected reward, given that the learner has access to an exact computation oracle, which outputs an optimal super arm when given the set of expected base arm outcomes. Moreover, it is also shown that the learner cannot guarantee sublinear regret when it only has access to an approximation oracle. An interesting extension of CMAB is CMAB with probabilistically triggered arms PTAs [3] where the selected super arm probabilistically triggers a set of base arms, and the reward obtained in a round is a function of the set of probabilistically triggered base arms and their expected outcomes. For this problem, it is shown in [3] that logarithmic regret is achievable when the expected reward function has the l bounded smoothness property. However, this bound depends on /p, where p is the minimum non-zero triggering probability. Later, it is shown in [4] that under a more strict smoothness assumption on the expected reward function, called triggering probability modulated bounded smoothness, it is possible to achieve regret which does not depend on /p. It is also shown in this work that the dependence on /p is unavoidable for the general case. In another work [5], CMAB with PTAs is considered for the case when the arm triggering probabilities are all positive, and it is shown that both CUCB and CTS achieve bounded regret. Apart from the works mentioned above, numerous other works also tackle related online learning problems. For instance, [6] considers matroid MAB, which is a special case of CMAB where the super arms are given as independent sets of a matroid with base arms being the elements of the ground set, and the expected reward of a super arm is the sum of the expected outcomes of the base arms in the super arm. In addition, Thompson sampling is also analyzed for a parametric CMAB model given a prior with finite support in [7], and a contextual CMAB model with a Bayesian regret metric in [8]. Unlike these works, we adopt the models in [2] and [3], work in a setting where there is an unknown but fixed parameter expected outcome vector, and analyze the expected regret. To sum up, in this work we analyze the expected regret of CTS when the learner has access to an exact computation oracle, and prove that it achieves O m log T/p i i regret. Comparing this with the regret lower bound for CMAB with PTAs given in Theorem 3 in [4], we also observe that our regret bound is tight. The rest of the paper is organized as follows. Problem formulation is given in Section 2. CTS algorithm is described in Section 3. Regret analysis of CTS is given in Section 4. Numerical results are given in Section 5, and concluding remarks are given in Section 6. Proofs of the lemmas that are used in the regret analysis are given in the supplemental document. 2 Problem Formulation Combinatorial multi-armed bandit CMAB is a decision making problem where the learner interacts with its environment through m base arms, indexed by the set [m] := {, 2,..., m} sequentially over rounds indexed by t [T ]. In this paper, we consider the general CMAB model introduced in [3] and borrow the notation from [2]. In this model, the following events take place in order in each round t: The learner selects a subset of base arms, denoted by St, which is called a super arm. St causes some other base arms to probabilistically trigger based on a stochastic triggering process, which results in a set of triggered base arms S t that contains St. The learner obtains a reward that depends on S t and observes the outcomes of the arms in S t. Next, we describe in detail the super arms, the triggering process, the reward and the outcomes. First of all, the learner is allowed to select St from a subset of 2 [m] denoted by I, which corresponds to the set of feasible super arms. Once St is selected, all base arms i St are immediately triggered. These arms can trigger other base arms that are not in St, and those arms can further trigger other base arms, and so on. Hence, triggering of the base arms may depend on each other. At the end, a random superset S t of St is formed that consists of all triggered base arms as a result of selecting St. The triggering process can be described by a set of triggering probabilities. Essentially, for each i [m] and S I, p S i := Pr[i S t St = S] denotes the probability that base arm i is triggered when super arm S is selected. Let S := {i [m] : p S i > 0} be the set of all base arms that could potentially be triggered by super arm S, which is called the triggering set of S. We have that St S t St [m]. We define p i := min S I:i S p S i as the minimum nonzero triggering probability of base arm i, and p := min i [m] p i as the minimum nonzero triggering probability. Next, we describe the base arm outcomes. After St is selected, the environment draws a random outcome vector Xt := X t, X 2 t,..., X m t from a fixed probability distribution D on [0, ] m independent of the previous rounds. Here X i t represents the outcome of base arm i. We define the mean outcome

3 Alihan Hüyük, Cem Tekin parameter vector as µ := µ, µ 2,..., µ m, where µ i := E X D [X i t], and use µ S to denote the projection of µ on S for S [m]. Since CTS computes a posterior over µ, the following assumption is made to have an efficient and simple update of the posterior distribution. Assumption. The outcomes of all base arms are mutually independent, i.e., D = D D 2 D m. Note that this independence assumption is correct for many applications, including the influence maximization problem with independent cascade influence propagation model [9]. At the end of round t, the learner receives a reward that depends on the set of triggered arms S t and the outcome vector Xt, which is denoted by RS t, Xt. For simplicity of notation, we also use Rt := RS t, Xt to denote the reward in round t. Note that whether a base arm is in the selected super arm or is triggered afterwards is not relevant in terms of the reward. We also make two other assumptions about the reward function R, which are standard in the CMAB literature [2, 3]. Assumption 2. The expected reward of super arm S I only depends on S and the mean outcome vector µ, i.e., there exists a function r such that E[Rt] = E S t,xt D[RS t, Xt] = rst, µ. Assumption 3. Lipschitz continuity There exists a constant B > 0, such that for every super arm S and every pair of mean outcome vectors µ and µ, rs, µ rs,µ µ B µ S µ S, where denotes the l norm. We consider the semi-bandit feedback model, where at the end of round t, the learner observes the individual outcomes of the triggered arms, denoted by QS t, Xt := {i, X i t : i S t}. Again, for simplicity of notation, we also use Qt = QS t, Xt to denote the observation at the end of round t. Based on this, the only information available to the learner when choosing the super arm to select in round t + is its observation history, given as F t := {Sτ, Qτ : τ [t]}. In order to evaluate the performance of the learner, we define the set of optimal super arms given an m-dimensional parameter vector θ as OPTθ := argmax S I rs, θ. We use OPT := OPTµ to denote the set of optimal super arms given the true mean outcome vector µ. Based on this, we let S to represent a specific super arm in argmin S OPT S, which is the set of super arms that have triggering sets with minimum cardinality among all optimal super arms. We also let k := S and k := S. Next, we define the suboptimality gap due to selecting super arm S I as S := rs, µ rs, µ, the maximum suboptimality gap as max := max S I S, and the minimum suboptimality gap of base arm i as i := min S I OPT:i S S. The goal of the learner is to minimize the expected regret over the time horizon T, given by [ T ] RegT := E rs, µ rst, µ [ T ] = E St. 3 The Learning Algorithm Algorithm Combinatorial Thompson Sampling CTS. : For each base arm i, let a i =, b i = 2: for t =, 2,... do 3: For each base arm i, draw a sample θ i t from Beta distribution βa i, b i ; let θt := θ t,..., θ m t 4: Select super arm St = Oracleθt, get the observation Qt 5: for all i, X i Qt do 6: Y i with probability X i, 0 with probability X i 7: a i a i + Y i 8: b i b i + Y i 9: end for 0: end for We consider the CTS algorithm for CMAB with PTAs [2, 5] pseudocode given in Algorithm. We assume that the learner has access to an exact computation oracle, which takes as input a parameter vector θ and outputs a super arm, denoted by Oracleθ, such that Oracleθ OPTθ. CTS keeps a Beta posterior over the mean outcome of each base arm. At the beginning of round t, for each base arm i it draws a sample θ i t from its posterior distribution. Then, it forms the parameter vector in round t as θt := θ t,..., θ m t, gives it to the exact computational oracle, and selects the super arm St = Oracleθt. At the end of the round, CTS updates the posterior distributions of the triggered base arms using the observation Qt. 4 Regret Analysis 4. Main Theorem The regret bound for CTS is given in the following theorem. If there is no such super arm S, let i =.

4 Thompson Sampling for Combinatorial Multi-armed Bandit with Probabilistically Triggered Arms Theorem. Under Assumptions, 2 and 3, for all D, the regret of CTS by round T is bounded as follows: RegT m max S I OPT:i S B 2 S log T ρp i S 2B k 2 + 2ε K max 2 ρp ε 2 + 2I{p < } ρ 2 p + α 8 k k 4 p ε 2 ε 2 + log k ε 2 max m max for all ρ 0,, and for all ε such that S I OPT, S > 2B k 2 +2ε, where B is the Lipschitz constant in Assumption 3, α > 0 is a problem independent constant that is also independent of T, and K max = max S I S is the maximum triggering set size among all super arms. We compare the result of Theorem with [3], which shows that the regret of CUCB is O i [m] log T/p i i, given an l bounded smoothness condition on the expected reward function, when the bounded smoothness function is fx = γx. When ε is sufficiently small, the regret bound in Theorem is asymptotically equivalent to the regret bound for CUCB in terms of the dependence on T, p i and i, i [m]. For the case with p = no probabilistic triggering, the regret bound in Theorem matches with the regret bound in Theorem in [2] in terms of the dependence on T and i, i [m]. As a final remark, we note that Theorem 3 in [4] shows that the /p i term in the regret bounds are unavoidable for the general case. 4.2 Preliminaries for the Proof The complement of set S is denoted by S or S c. The indicator function is given as I{ }. M i t := t τ= I{i Sτ} denotes the number of times base arm i is in the triggering set of the selected super arm, i.e., it is tried to be triggered, N i t := t τ= I{i S τ} denotes the number of times base arm i is triggered, and ˆµ i t := τ:τ<t,i S τ Y iτ denotes the empirical mean N it outcome of base arm i until round t, where Y i t is the Bernoulli random variable with mean X i t that is used for updating the posterior distribution that corresponds to base arm i in CTS. We define ls := S 2B S k 2 +2 S 2 ε ls ρp i as the trial threshold of base arm i with respect := max S I OPT:i S L i S. to super arm S, and L max i Consider an m-dimensional parameter vector θ. Similar to [2], given Z S, we say that the first bad event for Z, denoted by E Z, θ, holds when all θ = θ θ Z, θ Z c such that θ θ Z µ Z ε satisfies the following properties: Z Oracleθ θ. Either Oracleθ θ OPT or θ θ µ Oracleθ θ θ > Oracleθ θ B k 2 + ε. Oracleθ θ Given the same parameter vector θ, the second bad event for Z is defined as E Z,2 θ := θ Z µ Z > ε. In addition, similar to the regret analysis in [2], we will make use of the following events when bounding the regret: At := {St OPT} 2 { B i, t := ˆµ i t µ i > ε } St B i,2 t := {N i t ρp i M i t} { Bt := i St } : B i, t B i,2 t 3 { } Ct := i St : θ i t ˆµ i t > N i t 4 { Dt := θ St t µ St > } St B k 2 + ε 4.3 Regret Decomposition 5 Using the definitions of the events given in 2-5, the regret can be upper bounded as follows: RegT = E[I{At} St ] E[I{Bt At} St ] E[I{Ct At} St ] 7 E[I{ Bt Ct Dt At} St ] 8 + E[I{ Dt At} St ]. 9 as the sampling threshold of super arm S, L i S := The regret bound in Theorem is obtained by bounding

5 Alihan Hüyük, Cem Tekin each term in the above decomposition. For this, we will make use of the facts and lemmas that are introduced in the following section. 4.4 Facts and Lemmas Fact. Multiplicative Chernoff Bound [2] and [3] Let X,..., X n be Bernoulli random variables taking values in {0, } such that E[X t X,..., X t ] µ for all t n, and Y = X X n. Then, for all δ 0,, Pr[Y δnµ] e δ2 nµ 2. Fact 2. Lemma 4 in [2] When CTS is run, the following holds for all base arms i [m]: [ ] Pr θ i t ˆµ i t > N i t T [ ] Pr ˆµ i t θ i t > N i t T Fact 3. Results from Lemma 7 in [2] Given Z S, let τ j be the round at which E Z, θt E Z,2 θt occurs for the jth time, and let τ 0 = 0. If i Z, N i τ j + q, then E τ j+ t=τ j+ I{E Z, θt, E Z,2 θt} B q i Z where B q is given as + 6α ε e ε2 B q = 2 2 q ε 2 e 8 ε2q 2 and α is a problem independent constant. Moreover, B q 3α 2 q=0 i Z if q > 8 ε 2 otherwise 2 2 Z +3 log Z ε 2 ε 2 Z +2 where α 2 is a problem independent constant. 0 Using these facts, we prove the following three lemmas proofs can be found in the supplemental document. Lemma. When CTS is run, we have E[ {t : t T, i St, ˆµ i t µ i > ε B i,2 t} ] + ρp ε 2 + 2I{p < } ρ 2 p for all i [m] and ρ 0,. Lemma 2. Suppose that Dt At happens. Then, there exists Z S such that Z and E Z, θt holds. Lemma 3. When CTS is run, for all Z S such that Z, we have E[I{E Z, θt, E Z,2 θt}] 3α 2 Z p 4.5 Main Proof of Theorem 4.5. Bounding 6 Using Lemma, we have E[I{Bt At} St ] max m 2 2 Z +3 log Z ε 2. ε 2 Z +2 E [ t : t T, i St, ˆµ i t µ i > ε ] B i,2 t K max K max 2 + ρp ε 2 + 2I{p < } ρ 2 p where K max = max S I S Bounding 7 By Fact 2, we have E[I{Ct At} St ] max m 2m max Bounding 8 [ Pr θ i t ˆµ i t > m max ] N i t For this, we first show that event Bt Ct Dt At cannot happen when M i t > L i St, i St. To see this, assume that both Bt Ct At and M i t > L i St, i St holds. Then, we must have θ St t ˆµ St t = θ i t ˆµ i t i St i St N i t 2

6 Thompson Sampling for Combinatorial Multi-armed Bandit with Probabilistically Triggered Arms ρp i M i t i St ρp i L i St i St = lst i St St = i St = St 2B St k St ε St 2B St k St ε = St 2B k 2 + 2ε 3 4 where 2 holds when Ct happens, 3 holds when Bt happens, and 4 holds by the definition of L i St. We also know that ˆµ St t µ St ε, when Bt happens. Then, θ St t µ St θ St t ˆµ St t + ˆµ St t µ St St 2B k 2 + ε < St B k 2 + ε, which implies that Dt happens. Thus, we conclude that when Bt Ct Dt At happens, then there exists some i St such that M i t L i St. Let S t be the base arms i in St such that M i t > L i St, and S 2 t be the other base arms in St. By the result above, S 2 t. Next, we show that St 2B ρp i M i t. This holds since, i S 2t St B k 2 + ε < θ i t µ i t i St θ i t ˆµ i t + ˆµ i t µ i t i St i S t θ i t ˆµ i t + i S 2t St S t 2B St k St ε + θ i t ˆµ i t + ε i S 2t St 2B k 2 + ε + i S 2t θ i t ˆµ i t + ε θ i t ˆµ i t St 2B k 2 + ε + i S 2t St 2B k 2 + ε + i S 2t N i t ρp i M i t Fix i [m]. For w > 0, let ηw i be the round for which i S 2 ηw i and {t ηw i : i S 2 t} = w, and w i T := {t T : i S 2 t}. We have i S 2 ηw i Sη w i for all w > 0, which implies that M i ηw+ i w. Moreover, by the definition of S 2 t, we know that M i t L i St L max i for i S 2 t, t T. These two facts together imply that w i T L max i with probability. Consider the round τ i for which i St for the first time, i.e., τ i := min{t : i St}. We know that M i τ i = 0 L i S for all S, hence i S 2 τ i. Since t < τ i, i St, and i St implies i S 2 t, we conclude that τ i = η i. We also observe that Bt cannot happen for t τ i = η i, since N i t > ρp i M i t = 0 cannot be true when N i t M i t = 0. Then, E[I{ Bt Ct Dt At} St ] E = E i S 2t [ m I{ Bt}2B ρp i M i t I{i S 2 t, Bt} ] 2B ρp i M i t m E I{i S 2 t}2b ρp i M i t t=η i + L m max i η i w+ E I{i S 2 t} t=ηw i + ] 2B ρp i M i t L m max i = E 2B ρp i M i ηw+ i L m max i 2B ρp i w m 4B L max i 5 ρp i

7 Alihan Hüyük, Cem Tekin where 5 holds since N n= /n 2 N Bounding 9 From Lemma 2, we know that E[I{ Dt At} St ] max Z S,Z T E[I{E Z, θt, E Z,2 θt}] since Dt At implies E Z, θt for some Z S, and E Z, θt E Z,2 θt implies either At or Dt. From Lemma 3, we have: Z S,Z T E[I{E Z, θt, E Z,2 θt}] Z S,Z 3α 2 Z p 3α 2 8 k k p log ε2 ε 2 3α 2 8 k p ε 2 Z S,Z 2 2 Z +3 log Z ε 2 ε 2 Z Z ε 2 Z k 4 ε 2 + log k ε Summing the Bounds The regret bound for CTS is computed by summing the bounds derived for terms 6-9 in the regret decomposition, which are given in the sections above: RegT m 4B L max i ρp i K max ρp ε 2 + 2I{p < } ρ 2 p m max 5 Numerical Results In this section, we compare CTS with CUCB in [3] in a cascading bandit problem [20], which is a special case of CMAB with PTAs. In this problem a search engine outputs a list of K web pages for each of its L users among a set of R web pages. Then, the users examine their respective lists, and click on the first page that they find attractive. If all pages fail to attract them, they do not click on any page. The goal of the search engine is to maximize the number of clicks. The problem can be modeled as a CMAB problem. The base arms are user-page pairs i, j, where i [L] and j [R]. User i finds page j attractive independent of other users and other pages, and the probability that user i finds page j attractive is given as p i,j. The super arms are L lists of K-tuples, where each K-tuple represents the list of pages shown to a user. Given a super arm S, let Si, k denote the kth page that is selected for user i. Then, the triggering probabilities can be written as if j = Si, p S i,j = k k = p i,si,k if k : j = Si, k 0 otherwise that is we observe feedback for a top selection immediately, and observe feedback for the other selections only if all previous selections fail to attract the user. The expected reward of playing super arm S can be written as L K rs, p = p i,si,k k= for which Assumption 3 holds when B =. + 3α 2 8 k p ε 2 m = max S I OPT:i S k 4 ε 2 + log k ε 2 max 6B 2 S log T ρp i S 2B k 2 + 2ε K max 2 ρp ε 2 + 2I{p < } ρ 2 p m max where α := 3α 2. + α 8 k k 4 p ε 2 ε 2 + log k ε 2 max Figure : Regrets of CTS and CUCB for the cascading bandit problem.

8 Thompson Sampling for Combinatorial Multi-armed Bandit with Probabilistically Triggered Arms We let L = 00, R = 20 and K = 5, and generate p i,j s by sampling uniformly at random from [0, ]. We run both CTS and CUCB for 600 rounds, and report their regrets averaged over 000 runs in Figure. As expected TS significantly outperforms CUCB. Relatively bad performance of CUCB can be explained by excessive number of explorations due to the UCBs that stay high for large number of rounds. 6 Conclusion In this paper we analyzed the regret of CTS for CMAB with PTAs. We proved an order optimal regret bound when the expected reward function is Lipschitz continuous. Our bound includes the /p term that is unavoidable in general. Future work includes deriving regret bounds under more strict assumptions on the expected reward function such as the triggering probability modulated bounded smoothness condition given in [4] to get rid of the /p term. References [] H. Robbins, Some aspects of the sequential design of experiments, Bull. Amer. Math. Soc., vol. 55, pp , 952. [2] T. L. Lai and H. Robbins, Asymptotically efficient adaptive allocation rules, Advances in Applied Mathematics, vol. 6, pp. 4 22, 985. [3] W. R. Thompson, On the likelihood that one unknown probability exceeds another in view of the evidence of two samples, Biometrika, vol. 25, no. 3/4, pp , 933. [4] S. Agrawal and N. Goyal, Analysis of Thompson sampling for the multi-armed bandit problem, in Proc. 25th Annual Conference on Learning Theory, pp , 202. [5] D. Russo and B. Van Roy, Learning to optimize via posterior sampling, Mathematics of Operations Research, vol. 39, no. 4, pp , 204. [6] R. Agrawal, Sample mean based index policies with Olog n regret for the multi-armed bandit problem, Advances in Applied Probability, vol. 27, no. 4, pp , 995. [7] P. Auer, N. Cesa-Bianchi, and P. Fischer, Finitetime analysis of the multiarmed bandit problem, Machine Learning, vol. 47, pp , [8] Y. Gai, B. Krishnamachari, and R. Jain, Combinatorial network optimization with unknown variables: Multi-armed bandits with linear rewards and individual observations, IEEE/ACM Transactions on Networking, vol. 20, no. 5, pp , 202. [9] B. Kveton, Z. Wen, A. Ashkan, and C. Szepesvari, Tight regret bounds for stochastic combinatorial semi-bandits, in Proc. 8th International Conference on Artificial Intelligence and Statistics, pp , 205. [0] W. Chen, Y. Wang, and Y. Yuan, Combinatorial multi-armed bandit: General framework and applications, in Proc. 30th International Conference on Machine Learning, pp. 5 59, 203. [] G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher, An analysis of approximations for maximizing submodular set functions I, Mathematical Programming, vol. 4, no., pp , 978. [2] S. Wang and W. Chen, Thompson sampling for combinatorial semi-bandits, in Proc. 35th International Conference on Machine Learning, pp , 208. [3] W. Chen, Y. Wang, Y. Yuan, and Q. Wang, Combinatorial multi-armed bandit and its extension to probabilistically triggered arms, The Journal of Machine Learning Research, vol. 7, no., pp , 206. [4] Q. Wang and W. Chen, Improving regret bounds for combinatorial semi-bandits with probabilistically triggered arms and its applications, in Proc. Advances in Neural Information Processing Systems, pp. 6 7, 207. [5] A. O. Saritac and C. Tekin, Combinatorial multiarmed bandit with probabilistically triggered arms: A case with bounded regret, arxiv preprint arxiv: , pp. 6 7, 207. [6] B. Kveton, Z. Wen, A. Ashkan, H. Eydgahi, and B. Eriksson, Matroid bandits: Fast combinatorial optimization with learning, in Proc. 30th Conference on Uncertainty in Artificial Intelligence, pp , 204. [7] A. Gopalan, S. Mannor, and Y. Mansour, Thompson sampling for complex online problems, in Proc. 3st International Conference on Machine Learning, pp , 204. [8] Z. Wen, B. Kveton, and A. Ashkan, Efficient learning in large-scale combinatorial semi-bandits, in Proc. 32nd International Conference on Machine Learning, pp. 3 22, 205. [9] D. Kempe, J. Kleinberg, and E. Tardos, Maximizing the spread of influence through a social network, in Proc. 9th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining, pp. 37

9 Alihan Hüyük, Cem Tekin 46, [20] B. Kveton, C. Szepesvari, Z. Wen, and A. Ashkan, Cascading bandits: Learning to rank in the cascade model, in Proc. International Conference on Machine Learning, pp , 205. [2] M. Mitzenmacher and E. Upfal, Probability and computing: Randomized algorithms and probabilistic analysis. Cambridge University Press, 2005.

10 Thompson Sampling for Combinatorial Multi-armed Bandit with Probabilistically Triggered Arms A Supplemental Document A. Proof of Lemma The proof is similar to the proof of Lemma 3 in [2]. However, additional steps are required to take probabilistic triggering into account. Consider a base arm i [m]. Let τ i w be the round for which base arm i is in the triggering set of the selected super arm for the wth time. Hence, we have i Sτ i w for all w > 0. Also let τ i 0 = 0. Then, we have: E[ {t : t T, i St, ˆµ i t µ i > ε B i,2 t} ] [ T ] = E I{i St, ˆµ i t µ i > ε B i,2 t} τ i w+ E I{i St, ˆµ i t µ i > ε B i,2 t} w=0 t=τw i + = E[I{i Sτ w+, i ˆµ i τw+ i µ i > ε B i,2 τw+}] i = w=0 E[I{ ˆµ i τw+ i µ i > ε B i,2 τw+}] i w=0 + = + = Pr[ ˆµ i τw+ i µ i > ε B i,2 τw+] i Pr[ ˆµ i τw+ i µ i > ε B i,2 τw+] i + Pr[B i,2 τw+] i Pr[ ˆµ i τw+ i µ i > ε, N i τw+ i > ρwp i ] + Pr[ ˆµ i τw+ i µ i > ε, N i τw+ i > ρwp ] + 2e 2 ρwp ɛ 2 + I{p < } ρp ε 2 + 2I{p < } ρ 2 p where the second term in 6 is obtained by observing that Pr[N i τw+ i ρwp i ] Pr[N i τw+ i ρwp i ] e ρ2 wp 2 6 Pr[ ˆµ i τw+ i µ i > ε, N i τw+ i > ρwp ] Pr[ ˆµ i τw+ i µ i > ε N i τw+ i = k] Pr[N i τw+ i = k] k= ρwp and applying Hoeffding s inequality, and the third term in 6 is obtained by using Fact. A.2 Proof of Lemma 2 The proof is similar to the proof of Lemma in [2]. Let θ := θ θ S, θ S ct be such that θ θ µ S S ε. 7 Claim : For all S such that S S =, S Oracleθ θ.

11 Alihan Hüyük, Cem Tekin Claim holds since rs,θ θ = rs, θt 8 rst, θt 9 St rst, µ + B B k 2 + ε 20 = rst, µ + St B k 2 + ε = rs, µ B k 2 + ε 2 < rs, µ B k ε rs,θ θ 22 θ where 8 follows from Assumption 3 since θ and θt only differ on arms in S and S S =, 9 holds since St OPTθt, 20 is by Dt and Assumption 3, 2 is by the definition of St, and 22 is again by Assumption 3. Next, we consider two cases: Case a: S Oracleθ θ for all θ = θ θ S, θ S ct that satisfies 7. Case b: There exists θ = θ θ S, θ S ct that satisfies 7 for which S Oracleθ θ. For this θ, let S = Oracleθ θ and Z = S S. Together with Claim, for this case, we have Z S and Z. Note that Case a and Case b are complements of each other. θ When Case a is true, for any given θ, with an abuse of notation, let S 0 := Oracleθ θ. Then, we have rs 0,θ θ rs,θ θ rs, µ B k ε. If S 0 OPT, then we have rs, µ = rs 0, µ + S0. Combining the two results above, we obtain rs 0,θ θ rs 0, µ + S0 B k ε. By Assumption 3, this implies that θ θ µ S0 S0 S 0 B k ε > S 0 B k 2 + ε. Thus, from the discussion above, we conclude that either S 0 OPT or θ θ µ S0 S0 > S 0 B k 2 + ε. This means E S, θ θ = E S,θt holds. Hence, if Case a is true, then Lemma 2 holds for Z = S. In Case b, we also have rs,θ θ rs,θ θ rs, µ B k ε. Consider any θ = θ θ Z, θ Z c t such that θ θ Z µ Z ε. 23 θ We see that θ θ θ S S = i S S θ i θ i + i S S c θ θ i µ i + µ i θ i i Z 2 k ε i θ i θ hence rs,θ θ rs,θ θ 2B k ε rs, µ B k ε 2B k ε = rs, µ B3 k 2ε. Claim 2: For all S such that S Z =, S Oracleθ θ. Similar to Claim, Claim 2 holds since rs,θ θ = rs, θt rst, θt rst, µ + B St B k 2 + ε = rst, µ + St B k 2 + ε = rs, µ B k 2 + ε

12 Thompson Sampling for Combinatorial Multi-armed Bandit with Probabilistically Triggered Arms where 24 holds since k 2 in Case b. < rs, µ B3 k 2ε 24 rs,θ θ Claim 2 implies that when Case b holds, we have Oracleθ θ Z. Hence, we consider two cases again for Oracleθ θ : Case 2a: Z Oracleθ θ for all θ = θ θ Z, θ Z c t that satisfies 23. Case 2b: There exists θ = θ θ Z, θ Z c t that satisfies 23 for which Z Oracleθ θ. For this θ let S 2 = Oracleθ θ and Z 2 = S 2 Z. Together with Claim 2, for this case, we have Z 2 Z and Z 2. Similar to Case a, when Case 2a is true, then Lemma 2 holds for Z = Z. Thus, we can keep repeating the same arguments iteratively, and the size of Z i will decrease by at least at each iteration. After at most k iterations, Case b will not be possible. In order to see this, suppose that we come to a point where Z i =. As in all iterations, either Casei + a or Casei + b must hold. However, when Casei + b holds, Claim i +, which follows from Caseib, implies that there exists a Z i+ Z i such that Z i+ and Z i+ Z i, which is not possible when Z i =. Therefore, we conclude that some Case i + a must hold, where Z i S, Z i, and E Zi,θt occurs. Finally, we need to show that Claim i + holds for all iterations. We focus on the claim rs, µ B k 2 + ε < rs, µ Bk + 2 i k kε as repeating other arguments for all iterations is straightforward. The given inequality is true as k + 2 i k= k k k + 2 k k= k k = k 2 < k 2 +. Note that, when checking Claim i +, we know that i previous iterations have passed, hence k must be larger than i +. A.3 Proof of Lemma 3 Given Z, we re-index the base arms in Z such that z i represents ith base arm in Z. We also introduce a counter ct, and let c =. If at round t, E Z, θt E Z,2 θt occurs and a feedback for z ct is observed, i.e., z ct S t, the counter is updated with probability p /p St z ct in the following way: ct + = { ct + if ct < Z if ct = Z If the counter is not updated at round t, ct + = ct. Note that when E Z, θt E Z,2 θt occurs, z ct Z Oracleθt = St, hence we always have 0 < p /p St z ct. Moreover, the probability that the counter is updated, i.e., ct + ct, given E Z, θt E Z,2 θt occurs is constant and equal to p for all rounds t for which E Z, θt E Z,2 θt occurs. To see this, consider a parameter vector θ such that E Z, θ E Z,2 θ holds and let S = Oracleθ, then Pr[ct + ct θt = θ] = Pr[z ct S t St = S] p /p S z ct = p S z ct p /p S z ct = p. Let τ j be the round at which E Z, θt E Z,2 θt occurs for the jth time, and let τ 0 := 0. Then, the counter is updated only at rounds τ j with probability p. Let η q,k be the round τ j such that cτ j + = k + and cτ j = k holds for the q + th time. Let η 0,0 = 0 and η q, Z = η q+,0. We know that 0 = η 0,0 < η 0, <... < η 0, Z = η,0 < η, <.... We use two important observations to continue with proof. Firstly, due to the way the counter is updated, for t η q,0 + we have N i t q, i Z. Secondly, for non-negative integers j and j 2, Pr[η q,k+ = τ j+j 2+ η q,k = τ j ] = p p j2. This holds since for the given event to hold, the counter must not be updated at rounds τ j+, τ j+2,..., τ j+j 2, each of which happens with probability p, and must be updated at round τ j+j 2+ which happens with probability p. k=

13 Alihan Hüyük, Cem Tekin Therefore, we have E η q,k+ t=η q,k + = = = j =0 I{E Z, θt, E Z,2 θt} Pr[η q,k = τ j ] Pr[η q,k+ = τ j+j 2+ η q,k = τ j ] j +j 2 j =0 j=j E τ j+ t=τ j+ j 2=0 Pr[η q,k = τ j ] j =0 I{E Z, θt, E Z,2 θt} η q,k τ j < η q+,k j 2=0 Pr[η q,k = τ j ] Pr[η q,k = τ j ] p j =0 i Z = p i Z B q j +j 2 p p j2 j=j B q i Z p j 2 + p j2 j 2=0 i Z B q B q 25 where 25 holds due to our observations and 0 in Fact 3. Finally, we have E[I{E Z, θt, E Z,2 θt}] q=0 q=0 = Z p Z k=0 Z k=0 E p η q,k+ t=η q,k + B q i Z B q q=0 i Z 3α 2 Z p 2 2 Z +3 log Z ε 2 ε 2 Z +2 I{E Z, θt, E Z,2 θt} 26 where 26 holds due to in Fact 3.

Combinatorial Multi-Armed Bandit and Its Extension to Probabilistically Triggered Arms

Combinatorial Multi-Armed Bandit and Its Extension to Probabilistically Triggered Arms Journal of Machine Learning Research 17 2016) 1-33 Submitted 7/14; Revised 3/15; Published 4/16 Combinatorial Multi-Armed Bandit and Its Extension to Probabilistically Triggered Arms Wei Chen Microsoft

More information

arxiv: v6 [cs.lg] 29 Mar 2016

arxiv: v6 [cs.lg] 29 Mar 2016 Combinatorial Multi-Armed Bandit and Its Extension to Probabilistically Triggered Arms arxiv:1407.8339v6 [cs.lg] 9 Mar 016 Wei Chen Microsoft Beijing, China weic@microsoft.com Yajun Wang Microsoft Sunnyvale,

More information

Combinatorial Multi-Armed Bandit: General Framework, Results and Applications

Combinatorial Multi-Armed Bandit: General Framework, Results and Applications Combinatorial Multi-Armed Bandit: General Framework, Results and Applications Wei Chen Microsoft Research Asia, Beijing, China Yajun Wang Microsoft Research Asia, Beijing, China Yang Yuan Computer Science

More information

COMBINATORIAL MULTI-ARMED BANDIT PROBLEM WITH PROBABILISTICALLY TRIGGERED ARMS: A CASE WITH BOUNDED REGRET

COMBINATORIAL MULTI-ARMED BANDIT PROBLEM WITH PROBABILISTICALLY TRIGGERED ARMS: A CASE WITH BOUNDED REGRET COMBINATORIAL MULTI-ARMED BANDIT PROBLEM WITH PROBABILITICALLY TRIGGERED ARM: A CAE WITH BOUNDED REGRET A.Ömer arıtaç Department of Industrial Engineering Bilkent University, Ankara, Turkey Cem Tekin Department

More information

Analysis of Thompson Sampling for the multi-armed bandit problem

Analysis of Thompson Sampling for the multi-armed bandit problem Analysis of Thompson Sampling for the multi-armed bandit problem Shipra Agrawal Microsoft Research India shipra@microsoft.com avin Goyal Microsoft Research India navingo@microsoft.com Abstract We show

More information

Improving Regret Bounds for Combinatorial Semi-Bandits with Probabilistically Triggered Arms and Its Applications

Improving Regret Bounds for Combinatorial Semi-Bandits with Probabilistically Triggered Arms and Its Applications Improving Regret Bounds for Combinatorial Semi-Bandits with Probabilistically Triggered Arms and Its Applications Qinshi Wang Princeton University Princeton, NJ 08544 qinshiw@princeton.edu Wei Chen Microsoft

More information

Tsinghua Machine Learning Guest Lecture, June 9,

Tsinghua Machine Learning Guest Lecture, June 9, Tsinghua Machine Learning Guest Lecture, June 9, 2015 1 Lecture Outline Introduction: motivations and definitions for online learning Multi-armed bandit: canonical example of online learning Combinatorial

More information

Learning and Selecting the Right Customers for Reliability: A Multi-armed Bandit Approach

Learning and Selecting the Right Customers for Reliability: A Multi-armed Bandit Approach Learning and Selecting the Right Customers for Reliability: A Multi-armed Bandit Approach Yingying Li, Qinran Hu, and Na Li Abstract In this paper, we consider residential demand response (DR) programs

More information

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University Bandit Algorithms Zhifeng Wang Department of Statistics Florida State University Outline Multi-Armed Bandits (MAB) Exploration-First Epsilon-Greedy Softmax UCB Thompson Sampling Adversarial Bandits Exp3

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Sequential Decision

More information

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

Lecture 4: Lower Bounds (ending); Thompson Sampling

Lecture 4: Lower Bounds (ending); Thompson Sampling CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from

More information

Bandit models: a tutorial

Bandit models: a tutorial Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses

More information

Combinatorial Multi-Armed Bandit with General Reward Functions

Combinatorial Multi-Armed Bandit with General Reward Functions Combinatorial Multi-Armed Bandit with General Reward Functions Wei Chen Wei Hu Fu Li Jian Li Yu Liu Pinyan Lu Abstract In this paper, we study the stochastic combinatorial multi-armed bandit (CMAB) framework

More information

Analysis of Thompson Sampling for the multi-armed bandit problem

Analysis of Thompson Sampling for the multi-armed bandit problem Analysis of Thompson Sampling for the multi-armed bandit problem Shipra Agrawal Microsoft Research India shipra@microsoft.com Navin Goyal Microsoft Research India navingo@microsoft.com Abstract The multi-armed

More information

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Thompson sampling Bernoulli strategy Regret bounds Extensions the flexibility of Bayesian strategies 1 Bayesian bandit strategies

More information

Stratégies bayésiennes et fréquentistes dans un modèle de bandit

Stratégies bayésiennes et fréquentistes dans un modèle de bandit Stratégies bayésiennes et fréquentistes dans un modèle de bandit thèse effectuée à Telecom ParisTech, co-dirigée par Olivier Cappé, Aurélien Garivier et Rémi Munos Journées MAS, Grenoble, 30 août 2016

More information

The Multi-Armed Bandit Problem

The Multi-Armed Bandit Problem The Multi-Armed Bandit Problem Electrical and Computer Engineering December 7, 2013 Outline 1 2 Mathematical 3 Algorithm Upper Confidence Bound Algorithm A/B Testing Exploration vs. Exploitation Scientist

More information

Sequential Multi-armed Bandits

Sequential Multi-armed Bandits Sequential Multi-armed Bandits Cem Tekin Mihaela van der Schaar University of California, Los Angeles Abstract In this paper we introduce a new class of online learning problems called sequential multi-armed

More information

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known

More information

Learning Algorithms for Minimizing Queue Length Regret

Learning Algorithms for Minimizing Queue Length Regret Learning Algorithms for Minimizing Queue Length Regret Thomas Stahlbuhk Massachusetts Institute of Technology Cambridge, MA Brooke Shrader MIT Lincoln Laboratory Lexington, MA Eytan Modiano Massachusetts

More information

Bandit Algorithms. Tor Lattimore & Csaba Szepesvári

Bandit Algorithms. Tor Lattimore & Csaba Szepesvári Bandit Algorithms Tor Lattimore & Csaba Szepesvári Bandits Time 1 2 3 4 5 6 7 8 9 10 11 12 Left arm $1 $0 $1 $1 $0 Right arm $1 $0 Five rounds to go. Which arm would you play next? Overview What are bandits,

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 5: Bandit optimisation Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Introduce bandit optimisation: the

More information

Learning to play K-armed bandit problems

Learning to play K-armed bandit problems Learning to play K-armed bandit problems Francis Maes 1, Louis Wehenkel 1 and Damien Ernst 1 1 University of Liège Dept. of Electrical Engineering and Computer Science Institut Montefiore, B28, B-4000,

More information

Combinatorial Cascading Bandits

Combinatorial Cascading Bandits Combinatorial Cascading Bandits Branislav Kveton Adobe Research San Jose, CA kveton@adobe.com Zheng Wen Yahoo Labs Sunnyvale, CA zhengwen@yahoo-inc.com Azin Ashkan Technicolor Research Los Altos, CA azin.ashkan@technicolor.com

More information

Sparse Linear Contextual Bandits via Relevance Vector Machines

Sparse Linear Contextual Bandits via Relevance Vector Machines Sparse Linear Contextual Bandits via Relevance Vector Machines Davis Gilton and Rebecca Willett Electrical and Computer Engineering University of Wisconsin-Madison Madison, WI 53706 Email: gilton@wisc.edu,

More information

Stochastic Contextual Bandits with Known. Reward Functions

Stochastic Contextual Bandits with Known. Reward Functions Stochastic Contextual Bandits with nown 1 Reward Functions Pranav Sakulkar and Bhaskar rishnamachari Ming Hsieh Department of Electrical Engineering Viterbi School of Engineering University of Southern

More information

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 22 Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 How to balance exploration and exploitation in reinforcement

More information

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models

On the Complexity of Best Arm Identification in Multi-Armed Bandit Models On the Complexity of Best Arm Identification in Multi-Armed Bandit Models Aurélien Garivier Institut de Mathématiques de Toulouse Information Theory, Learning and Big Data Simons Institute, Berkeley, March

More information

Contextual Combinatorial Bandit and its Application on Diversified Online Recommendation

Contextual Combinatorial Bandit and its Application on Diversified Online Recommendation Contextual Combinatorial Bandit and its Application on Diversified Online Recommendation Lijing Qin Shouyuan Chen Xiaoyan Zhu Abstract Recommender systems are faced with new challenges that are beyond

More information

Adaptive Shortest-Path Routing under Unknown and Stochastically Varying Link States

Adaptive Shortest-Path Routing under Unknown and Stochastically Varying Link States Adaptive Shortest-Path Routing under Unknown and Stochastically Varying Link States Keqin Liu, Qing Zhao To cite this version: Keqin Liu, Qing Zhao. Adaptive Shortest-Path Routing under Unknown and Stochastically

More information

arxiv: v1 [cs.lg] 12 Sep 2017

arxiv: v1 [cs.lg] 12 Sep 2017 Adaptive Exploration-Exploitation Tradeoff for Opportunistic Bandits Huasen Wu, Xueying Guo,, Xin Liu University of California, Davis, CA, USA huasenwu@gmail.com guoxueying@outlook.com xinliu@ucdavis.edu

More information

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models

Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models c Qing Zhao, UC Davis. Talk at Xidian Univ., September, 2011. 1 Multi-Armed Bandit: Learning in Dynamic Systems with Unknown Models Qing Zhao Department of Electrical and Computer Engineering University

More information

Performance and Convergence of Multi-user Online Learning

Performance and Convergence of Multi-user Online Learning Performance and Convergence of Multi-user Online Learning Cem Tekin, Mingyan Liu Department of Electrical Engineering and Computer Science University of Michigan, Ann Arbor, Michigan, 4809-222 Email: {cmtkn,

More information

Two generic principles in modern bandits: the optimistic principle and Thompson sampling

Two generic principles in modern bandits: the optimistic principle and Thompson sampling Two generic principles in modern bandits: the optimistic principle and Thompson sampling Rémi Munos INRIA Lille, France CSML Lunch Seminars, September 12, 2014 Outline Two principles: The optimistic principle

More information

Complex Bandit Problems and Thompson Sampling

Complex Bandit Problems and Thompson Sampling Complex Bandit Problems and Aditya Gopalan Department of Electrical Engineering Technion, Israel aditya@ee.technion.ac.il Shie Mannor Department of Electrical Engineering Technion, Israel shie@ee.technion.ac.il

More information

When Gaussian Processes Meet Combinatorial Bandits: GCB

When Gaussian Processes Meet Combinatorial Bandits: GCB European Workshop on Reinforcement Learning 14 018 October 018, Lille, France. When Gaussian Processes Meet Combinatorial Bandits: GCB Guglielmo Maria Accabi Francesco Trovò Alessandro Nuara Nicola Gatti

More information

Stochastic Online Greedy Learning with Semi-bandit Feedbacks

Stochastic Online Greedy Learning with Semi-bandit Feedbacks Stochastic Online Greedy Learning with Semi-bandit Feedbacks (Full Version Including Appendices) Tian Lin Tsinghua University Beijing, China lintian06@gmail.com Jian Li Tsinghua University Beijing, China

More information

Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm

Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm Annealing-Pareto Multi-Objective Multi-Armed Bandit Algorithm Saba Q. Yahyaa, Madalina M. Drugan and Bernard Manderick Vrije Universiteit Brussel, Department of Computer Science, Pleinlaan 2, 1050 Brussels,

More information

Thompson Sampling for the non-stationary Corrupt Multi-Armed Bandit

Thompson Sampling for the non-stationary Corrupt Multi-Armed Bandit European Worshop on Reinforcement Learning 14 (2018 October 2018, Lille, France. Thompson Sampling for the non-stationary Corrupt Multi-Armed Bandit Réda Alami Orange Labs 2 Avenue Pierre Marzin 22300,

More information

Thompson Sampling for the MNL-Bandit

Thompson Sampling for the MNL-Bandit JMLR: Workshop and Conference Proceedings vol 65: 3, 207 30th Annual Conference on Learning Theory Thompson Sampling for the MNL-Bandit author names withheld Editor: Under Review for COLT 207 Abstract

More information

Evaluation of multi armed bandit algorithms and empirical algorithm

Evaluation of multi armed bandit algorithms and empirical algorithm Acta Technica 62, No. 2B/2017, 639 656 c 2017 Institute of Thermomechanics CAS, v.v.i. Evaluation of multi armed bandit algorithms and empirical algorithm Zhang Hong 2,3, Cao Xiushan 1, Pu Qiumei 1,4 Abstract.

More information

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning

Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning Logarithmic Online Regret Bounds for Undiscounted Reinforcement Learning Peter Auer Ronald Ortner University of Leoben, Franz-Josef-Strasse 18, 8700 Leoben, Austria auer,rortner}@unileoben.ac.at Abstract

More information

Online Learning with Feedback Graphs

Online Learning with Feedback Graphs Online Learning with Feedback Graphs Claudio Gentile INRIA and Google NY clagentile@gmailcom NYC March 6th, 2018 1 Content of this lecture Regret analysis of sequential prediction problems lying between

More information

Pure Exploration Stochastic Multi-armed Bandits

Pure Exploration Stochastic Multi-armed Bandits C&A Workshop 2016, Hangzhou Pure Exploration Stochastic Multi-armed Bandits Jian Li Institute for Interdisciplinary Information Sciences Tsinghua University Outline Introduction 2 Arms Best Arm Identification

More information

Reward Maximization Under Uncertainty: Leveraging Side-Observations on Networks

Reward Maximization Under Uncertainty: Leveraging Side-Observations on Networks Reward Maximization Under Uncertainty: Leveraging Side-Observations Reward Maximization Under Uncertainty: Leveraging Side-Observations on Networks Swapna Buccapatnam AT&T Labs Research, Middletown, NJ

More information

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem Lecture 9: UCB Algorithm and Adversarial Bandit Problem EECS598: Prediction and Learning: It s Only a Game Fall 03 Lecture 9: UCB Algorithm and Adversarial Bandit Problem Prof. Jacob Abernethy Scribe:

More information

Online Contextual Influence Maximization with Costly Observations

Online Contextual Influence Maximization with Costly Observations his article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 1.119/SIPN.18.866334,

More information

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes

An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes An Estimation Based Allocation Rule with Super-linear Regret and Finite Lock-on Time for Time-dependent Multi-armed Bandit Processes Prokopis C. Prokopiou, Peter E. Caines, and Aditya Mahajan McGill University

More information

Introducing strategic measure actions in multi-armed bandits

Introducing strategic measure actions in multi-armed bandits 213 IEEE 24th International Symposium on Personal, Indoor and Mobile Radio Communications: Workshop on Cognitive Radio Medium Access Control and Network Solutions Introducing strategic measure actions

More information

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade

Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade Bandits and Exploration: How do we (optimally) gather information? Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for Big data 1 / 22

More information

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms

Introduction to Bandit Algorithms. Introduction to Bandit Algorithms Stochastic K-Arm Bandit Problem Formulation Consider K arms (actions) each correspond to an unknown distribution {ν k } K k=1 with values bounded in [0, 1]. At each time t, the agent pulls an arm I t {1,...,

More information

Exploiting Correlation in Finite-Armed Structured Bandits

Exploiting Correlation in Finite-Armed Structured Bandits Exploiting Correlation in Finite-Armed Structured Bandits Samarth Gupta Carnegie Mellon University Pittsburgh, PA 1513 Gauri Joshi Carnegie Mellon University Pittsburgh, PA 1513 Osman Yağan Carnegie Mellon

More information

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett

Stat 260/CS Learning in Sequential Decision Problems. Peter Bartlett Stat 260/CS 294-102. Learning in Sequential Decision Problems. Peter Bartlett 1. Multi-armed bandit algorithms. Concentration inequalities. P(X ǫ) exp( ψ (ǫ))). Cumulant generating function bounds. Hoeffding

More information

arxiv: v1 [cs.lg] 11 Mar 2018

arxiv: v1 [cs.lg] 11 Mar 2018 Doruk Öner 1 Altuğ Karakurt 2 Atilla Eryılmaz 2 Cem Tekin 1 arxiv:1803.04039v1 [cs.lg] 11 Mar 2018 Abstract In this paper, we introduce the COmbinatorial Multi-Objective Multi-Armed Bandit (COMO- MAB)

More information

THE first formalization of the multi-armed bandit problem

THE first formalization of the multi-armed bandit problem EDIC RESEARCH PROPOSAL 1 Multi-armed Bandits in a Network Farnood Salehi I&C, EPFL Abstract The multi-armed bandit problem is a sequential decision problem in which we have several options (arms). We can

More information

Basics of reinforcement learning

Basics of reinforcement learning Basics of reinforcement learning Lucian Buşoniu TMLSS, 20 July 2018 Main idea of reinforcement learning (RL) Learn a sequential decision policy to optimize the cumulative performance of an unknown system

More information

Multi-Armed Bandit Formulations for Identification and Control

Multi-Armed Bandit Formulations for Identification and Control Multi-Armed Bandit Formulations for Identification and Control Cristian R. Rojas Joint work with Matías I. Müller and Alexandre Proutiere KTH Royal Institute of Technology, Sweden ERNSI, September 24-27,

More information

Online Learning with Gaussian Payoffs and Side Observations

Online Learning with Gaussian Payoffs and Side Observations Online Learning with Gaussian Payoffs and Side Observations Yifan Wu 1 András György 2 Csaba Szepesvári 1 1 Department of Computing Science University of Alberta 2 Department of Electrical and Electronic

More information

The Multi-Arm Bandit Framework

The Multi-Arm Bandit Framework The Multi-Arm Bandit Framework A. LAZARIC (SequeL Team @INRIA-Lille) ENS Cachan - Master 2 MVA SequeL INRIA Lille MVA-RL Course In This Lecture A. LAZARIC Reinforcement Learning Algorithms Oct 29th, 2013-2/94

More information

Multi-armed Bandits in the Presence of Side Observations in Social Networks

Multi-armed Bandits in the Presence of Side Observations in Social Networks 52nd IEEE Conference on Decision and Control December 0-3, 203. Florence, Italy Multi-armed Bandits in the Presence of Side Observations in Social Networks Swapna Buccapatnam, Atilla Eryilmaz, and Ness

More information

A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem

A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem A Change-Detection based Framework for Piecewise-stationary Multi-Armed Bandit Problem Fang Liu and Joohyun Lee and Ness Shroff The Ohio State University Columbus, Ohio 43210 {liu.3977, lee.7119, shroff.11}@osu.edu

More information

On Regret-Optimal Learning in Decentralized Multi-player Multi-armed Bandits

On Regret-Optimal Learning in Decentralized Multi-player Multi-armed Bandits 1 On Regret-Optimal Learning in Decentralized Multi-player Multi-armed Bandits Naumaan Nayyar, Dileep Kalathil and Rahul Jain Abstract We consider the problem of learning in single-player and multiplayer

More information

Racing Thompson: an Efficient Algorithm for Thompson Sampling with Non-conjugate Priors

Racing Thompson: an Efficient Algorithm for Thompson Sampling with Non-conjugate Priors Racing Thompson: an Efficient Algorithm for Thompson Sampling with Non-conugate Priors Yichi Zhou 1 Jun Zhu 1 Jingwe Zhuo 1 Abstract Thompson sampling has impressive empirical performance for many multi-armed

More information

Exploration and exploitation of scratch games

Exploration and exploitation of scratch games Mach Learn (2013) 92:377 401 DOI 10.1007/s10994-013-5359-2 Exploration and exploitation of scratch games Raphaël Féraud Tanguy Urvoy Received: 10 January 2013 / Accepted: 12 April 2013 / Published online:

More information

New Algorithms for Contextual Bandits

New Algorithms for Contextual Bandits New Algorithms for Contextual Bandits Lev Reyzin Georgia Institute of Technology Work done at Yahoo! 1 S A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R.E. Schapire Contextual Bandit Algorithms with Supervised

More information

Two optimization problems in a stochastic bandit model

Two optimization problems in a stochastic bandit model Two optimization problems in a stochastic bandit model Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishnan Journées MAS 204, Toulouse Outline From stochastic optimization

More information

Grundlagen der Künstlichen Intelligenz

Grundlagen der Künstlichen Intelligenz Grundlagen der Künstlichen Intelligenz Uncertainty & Probabilities & Bandits Daniel Hennes 16.11.2017 (WS 2017/18) University Stuttgart - IPVS - Machine Learning & Robotics 1 Today Uncertainty Probability

More information

The information complexity of sequential resource allocation

The information complexity of sequential resource allocation The information complexity of sequential resource allocation Emilie Kaufmann, joint work with Olivier Cappé, Aurélien Garivier and Shivaram Kalyanakrishan SMILE Seminar, ENS, June 8th, 205 Sequential allocation

More information

arxiv: v1 [cs.ds] 4 Mar 2016

arxiv: v1 [cs.ds] 4 Mar 2016 Sequential ranking under random semi-bandit feedback arxiv:1603.01450v1 [cs.ds] 4 Mar 2016 Hossein Vahabi, Paul Lagrée, Claire Vernade, Olivier Cappé March 7, 2016 Abstract In many web applications, a

More information

Deterministic Sequencing of Exploration and Exploitation for Multi-Armed Bandit Problems

Deterministic Sequencing of Exploration and Exploitation for Multi-Armed Bandit Problems Deterministic Sequencing of Exploration and Exploitation for Multi-Armed Bandit Problems 1 Sattar Vakili, Keqin Liu, Qing Zhao arxiv:1106.6104v3 [math.oc] 9 Mar 2013 Abstract In the Multi-Armed Bandit

More information

Pure Exploration Stochastic Multi-armed Bandits

Pure Exploration Stochastic Multi-armed Bandits CAS2016 Pure Exploration Stochastic Multi-armed Bandits Jian Li Institute for Interdisciplinary Information Sciences Tsinghua University Outline Introduction Optimal PAC Algorithm (Best-Arm, Best-k-Arm):

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 6: RL algorithms 2.0 Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Present and analyse two online algorithms

More information

An Optimal Bidimensional Multi Armed Bandit Auction for Multi unit Procurement

An Optimal Bidimensional Multi Armed Bandit Auction for Multi unit Procurement An Optimal Bidimensional Multi Armed Bandit Auction for Multi unit Procurement Satyanath Bhat Joint work with: Shweta Jain, Sujit Gujar, Y. Narahari Department of Computer Science and Automation, Indian

More information

On Bayesian bandit algorithms

On Bayesian bandit algorithms On Bayesian bandit algorithms Emilie Kaufmann joint work with Olivier Cappé, Aurélien Garivier, Nathaniel Korda and Rémi Munos July 1st, 2012 Emilie Kaufmann (Telecom ParisTech) On Bayesian bandit algorithms

More information

arxiv: v4 [cs.lg] 22 Jul 2014

arxiv: v4 [cs.lg] 22 Jul 2014 Learning to Optimize Via Information-Directed Sampling Daniel Russo and Benjamin Van Roy July 23, 2014 arxiv:1403.5556v4 cs.lg] 22 Jul 2014 Abstract We propose information-directed sampling a new algorithm

More information

The information complexity of best-arm identification

The information complexity of best-arm identification The information complexity of best-arm identification Emilie Kaufmann, joint work with Olivier Cappé and Aurélien Garivier MAB workshop, Lancaster, January th, 206 Context: the multi-armed bandit model

More information

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Journal of Machine Learning Research 1 8, 2017 Algorithmic Learning Theory 2017 New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Philip M. Long Google, 1600 Amphitheatre

More information

An Information-Theoretic Analysis of Thompson Sampling

An Information-Theoretic Analysis of Thompson Sampling Journal of Machine Learning Research (2015) Submitted ; Published An Information-Theoretic Analysis of Thompson Sampling Daniel Russo Department of Management Science and Engineering Stanford University

More information

Csaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008

Csaba Szepesvári 1. University of Alberta. Machine Learning Summer School, Ile de Re, France, 2008 LEARNING THEORY OF OPTIMAL DECISION MAKING PART I: ON-LINE LEARNING IN STOCHASTIC ENVIRONMENTS Csaba Szepesvári 1 1 Department of Computing Science University of Alberta Machine Learning Summer School,

More information

Combinatorial Network Optimization With Unknown Variables: Multi-Armed Bandits With Linear Rewards and Individual Observations

Combinatorial Network Optimization With Unknown Variables: Multi-Armed Bandits With Linear Rewards and Individual Observations IEEE/ACM TRANSACTIONS ON NETWORKING 1 Combinatorial Network Optimization With Unknown Variables: Multi-Armed Bandits With Linear Rewards and Individual Observations Yi Gai, Student Member, IEEE, Member,

More information

Finite-time Analysis of the Multiarmed Bandit Problem*

Finite-time Analysis of the Multiarmed Bandit Problem* Machine Learning, 47, 35 56, 00 c 00 Kluwer Academic Publishers. Manufactured in The Netherlands. Finite-time Analysis of the Multiarmed Bandit Problem* PETER AUER University of Technology Graz, A-8010

More information

Deviations of stochastic bandit regret

Deviations of stochastic bandit regret Deviations of stochastic bandit regret Antoine Salomon 1 and Jean-Yves Audibert 1,2 1 Imagine École des Ponts ParisTech Université Paris Est salomona@imagine.enpc.fr audibert@imagine.enpc.fr 2 Sierra,

More information

Bandits for Online Optimization

Bandits for Online Optimization Bandits for Online Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Bandits for Online Optimization 1 / 16 The multiarmed bandit problem... K slot machines Each

More information

Improved Algorithms for Linear Stochastic Bandits

Improved Algorithms for Linear Stochastic Bandits Improved Algorithms for Linear Stochastic Bandits Yasin Abbasi-Yadkori abbasiya@ualberta.ca Dept. of Computing Science University of Alberta Dávid Pál dpal@google.com Dept. of Computing Science University

More information

The Multi-Armed Bandit Problem

The Multi-Armed Bandit Problem Università degli Studi di Milano The bandit problem [Robbins, 1952]... K slot machines Rewards X i,1, X i,2,... of machine i are i.i.d. [0, 1]-valued random variables An allocation policy prescribes which

More information

Bayesian and Frequentist Methods in Bandit Models

Bayesian and Frequentist Methods in Bandit Models Bayesian and Frequentist Methods in Bandit Models Emilie Kaufmann, Telecom ParisTech Bayes In Paris, ENSAE, October 24th, 2013 Emilie Kaufmann (Telecom ParisTech) Bayesian and Frequentist Bandits BIP,

More information

Lecture 5: Regret Bounds for Thompson Sampling

Lecture 5: Regret Bounds for Thompson Sampling CMSC 858G: Bandits, Experts and Games 09/2/6 Lecture 5: Regret Bounds for Thompson Sampling Instructor: Alex Slivkins Scribed by: Yancy Liao Regret Bounds for Thompson Sampling For each round t, we defined

More information

Multi armed bandit problem: some insights

Multi armed bandit problem: some insights Multi armed bandit problem: some insights July 4, 20 Introduction Multi Armed Bandit problems have been widely studied in the context of sequential analysis. The application areas include clinical trials,

More information

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning

Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning JMLR: Workshop and Conference Proceedings vol:1 8, 2012 10th European Workshop on Reinforcement Learning Learning Exploration/Exploitation Strategies for Single Trajectory Reinforcement Learning Michael

More information

Profile-Based Bandit with Unknown Profiles

Profile-Based Bandit with Unknown Profiles Journal of Machine Learning Research 9 (208) -40 Submitted /7; Revised 6/8; Published 9/8 Profile-Based Bandit with Unknown Profiles Sylvain Lamprier sylvain.lamprier@lip6.fr Sorbonne Universités, UPMC

More information

Multi-Armed Bandits. Credit: David Silver. Google DeepMind. Presenter: Tianlu Wang

Multi-Armed Bandits. Credit: David Silver. Google DeepMind. Presenter: Tianlu Wang Multi-Armed Bandits Credit: David Silver Google DeepMind Presenter: Tianlu Wang Credit: David Silver (DeepMind) Multi-Armed Bandits Presenter: Tianlu Wang 1 / 27 Outline 1 Introduction Exploration vs.

More information

COMPUTING OPTIMAL SEQUENTIAL ALLOCATION RULES IN CLINICAL TRIALS* Michael N. Katehakis. State University of New York at Stony Brook. and.

COMPUTING OPTIMAL SEQUENTIAL ALLOCATION RULES IN CLINICAL TRIALS* Michael N. Katehakis. State University of New York at Stony Brook. and. COMPUTING OPTIMAL SEQUENTIAL ALLOCATION RULES IN CLINICAL TRIALS* Michael N. Katehakis State University of New York at Stony Brook and Cyrus Derman Columbia University The problem of assigning one of several

More information

Online Forest Density Estimation

Online Forest Density Estimation Online Forest Density Estimation Frédéric Koriche CRIL - CNRS UMR 8188, Univ. Artois koriche@cril.fr UAI 16 1 Outline 1 Probabilistic Graphical Models 2 Online Density Estimation 3 Online Forest Density

More information

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan

The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan The geometry of Gaussian processes and Bayesian optimization. Contal CMLA, ENS Cachan Background: Global Optimization and Gaussian Processes The Geometry of Gaussian Processes and the Chaining Trick Algorithm

More information

On the Complexity of Best Arm Identification with Fixed Confidence

On the Complexity of Best Arm Identification with Fixed Confidence On the Complexity of Best Arm Identification with Fixed Confidence Discrete Optimization with Noise Aurélien Garivier, Emilie Kaufmann COLT, June 23 th 2016, New York Institut de Mathématiques de Toulouse

More information

Chapter 2 Stochastic Multi-armed Bandit

Chapter 2 Stochastic Multi-armed Bandit Chapter 2 Stochastic Multi-armed Bandit Abstract In this chapter, we present the formulation, theoretical bound, and algorithms for the stochastic MAB problem. Several important variants of stochastic

More information

Cascading Bandits: Learning to Rank in the Cascade Model

Cascading Bandits: Learning to Rank in the Cascade Model Branislav Kveton Adobe Research, San Jose, CA Csaba Szepesvári Department of Computing Science, University of Alberta Zheng Wen Yahoo Labs, Sunnyvale, CA Azin Ashkan Technicolor Research, Los Altos, CA

More information

Multiple Identifications in Multi-Armed Bandits

Multiple Identifications in Multi-Armed Bandits Multiple Identifications in Multi-Armed Bandits arxiv:05.38v [cs.lg] 4 May 0 Sébastien Bubeck Department of Operations Research and Financial Engineering, Princeton University sbubeck@princeton.edu Tengyao

More information