Bandit Online Learning with Unknown Delays

Size: px
Start display at page:

Download "Bandit Online Learning with Unknown Delays"

Transcription

1 Bandit Online Learning with Unknown Delays Bingcong Li, Tianyi Chen, and Georgios B. Giannakis University of Minnesota - Twin Cities, Minneapolis, MN 55455, USA {lixx5599,chen387,georgios}@umn.edu November 13, 018 Abstract arxiv: v cs.lg] 11 Nov 018 This paper deals with bandit online learning problems involving feedback of unknown delay that can emerge in multi-armed bandit MAB and bandit convex optimization BCO settings. MAB and BCO require only values of the objective function involved that become available through feedback, and are used to estimate the gradient appearing in the corresponding iterative algorithms. Since the challenging case of feedback with unknown delays prevents one from constructing the sought gradient estimates, existing MAB and BCO algorithms become intractable. For such challenging setups, delayed exploration, exploitation, and exponential DEXP3 iterations, along with delayed bandit gradient descent DBGD iterations are developed for MAB and BCO, respectively. Leveraging a unified analysis framework, it is established that both DEXP3 and DBGD guarantee an O T + D regret, where D denotes the delay accumulated over T slots. The regret bounds match those in full information settings. Numerical tests using both synthetic and real data validate the performance of DEXP3 and DBGD. 1 Introduction Sequential decision making emerges in various learning and optimization tasks, such as online advertisement, online routing, and portfolio management 5,15]. Among popular methods for sequential decision making, multiarmed bandit MAB and bandit convex optimization BCO have widely-appreciated merits because with limited information they offer quantifiable performance guarantees. MAB and BCO can be viewed as a repeated game between a possibly randomized learner, and the possibly adversarial nature. In each round, the learner selects an action, and incurs the associated loss that is returned by the nature. In contrast to the full information setting, only the loss of the performed action rather than the gradient of the loss function or even the loss function itself is revealed to the learner. Popular approaches to bandit online learning estimate gradients using several point-wise evaluations of the loss function, and use them to run online gradient-type iterative solvers; see e.g., 3] for MAB and 1, 13] for BCO. Although widely applicable with solid performance guarantees, standard MAB and BCO frameworks do not account for delayed feedback that is naturally present in various applications. For example, when carrying out machine learning tasks using distributed mobile devices a setup referred to as federated learning 1], delay comes from the time it takes to compute at mobile devices and also to transmit over the wireless communication links; in online recommendations the click-through rate could be aggregated and then periodically sent back 19]; in online routing over communication networks, the latency of each routing decision can be revealed only after the packet s arrival to its destination 4]; and in parallel computing by data centers, computations are carried with outdated information because agents are not synchronized, 11, 0]. Challenges arise naturally when dealing with bandit online learning with unknown delays, simply because unknown delayes prevent existing methods in non-stochastic MAB as well as BCO to construct reliable gradient estimates. To address this limitation, our solution is a fine-grained biased gradient estimator for MAB and a 1

2 deterministic gradient estimator for BCO, where the standard unbiased loss estimator for non-stochastic MAB and the nearly unbiased one for BCO are no longer available. The resultant algorithms, that we abbreviate as DEXP3 and DBGD, are guaranteed to achieve O T + D regret over a T -slot time horizon with the overall delay being D, which matches the regret bounds in delayed full-information online learning settings. 1.1 Related works Delayed online learning can be categorized depending on whether the feedback information is full or bandit meaning partial. We review prior works from these two aspects. Delayed online learning. This class deals with delayed but fully revealed loss information, namely fully known gradient or loss function. It is proved that an O T + D regret for a T -slot time horizon with overall delay D can be achieved. Particularly, algorithms dealing with a fixed delay have been studied in 9]. To reduce the storage and computation burden of 9], an online gradient descent type algorithm for fixed d-slot delay was developed in 18], where the lower bound O d + 1T was also provided. Adversarial delay has been tackled recently in 17, 5, 6]. However, the algorithms as well as the corresponding analyses in 17, 5, 6] are not applicable to bandit online learning setting when the delays are unknown. Delayed bandit online learning. Stochastic MAB with delays has been reported in 8, 9, 4, 8]; see also 16] for multi-instance generalizations introduced to handle adversarial delays in stochastic and non-stochastic MAB settings. For non-stochastic MAB, EXP3-based algorithms were developed to handle fixed delays in 7, 3]. Although not requiring memories for extra instances, the delay in 7] and 3] must be known. A recent work 6] considers a more general non-stochastic MAB setting, where the feedback is anonymous. 1 Although the scope in 6] is general, we will see in Section 3, that the regret bound of the MAB algorithm in 6] is suboptimal. 1. Contributions Our main contributions can be summarized as follows. c1 Based on a novel biased gradient estimator, a delayed exploration-exploitation exponentially DEXP3 weighted algorithm is developed for delayed non-stochastic MAB with unknown and adversarially chosen delays; c Relying on a novel deterministic gradient estimator, a delayed bandit gradient descent DBGD algorithm is developed to handle the delayed BCO setting; and, c3 A unifying analysis framework reveals that DEXP3 and DBGD provably achieve an O T + D regret, where D is the cumulative delay over T slots. The regret bounds match those of online learning in full information settings. Numerical tests validate the proposed DEXP3 and DBGD. Notational conventions. Bold lowercase letters denote column vectors; E ] represents expectation; 1 denotes the indicator function; stands for vector transposition; and x denotes the l -norm of a vector x. Problem statements Before introducing the delayed bandit learning settings, we first revisit the standard non-stochastic MAB and BCO..1 MAB and BCO Non-stochastic MAB. Consider the MAB problem with a total of K arms a.k.a. actions 3,5]. At the beginning of slot t, without knowing the loss of any arm, the learner selects an arm a t following a distribution p t K over all arms, where the probability simplex is defined as K : {p K : pk 0, k; K pk 1}. The loss l t a t incurred by the selection of a t is an entry of the K 1 loss vector l t, and it is observed by the learner. Along with previously observed losses {l s a s } t s1, it then becomes possible to find p t+1; see also Fig. 1 a1. 1 Anonymous feedback in MAB means that the learner observes the loss without knowing which arm it is associated with.

3 a1 a b1 b Figure 1: A single slot structure of: a1 standard MAB; a non-stochastic MAB with delayed feedback; b1 standard BCO; b BCO with delayed feedback, where p t+1 {l s a s } t s1 means that p t+1 updates based on previously observed losses {l s a s } t s1. The goal is to minimize the regret, which is the difference between the expected cumulative loss of the learner relative to the loss of the best fixed policy in hindsight, given by Reg MAB T : E p ] T t l t p l t 1 where the expectation is taken w.r.t. the possible randomness of p t induced by the selection of {a s } t 1 s1, while the best fixed policy p is p : arg min p K p l t. Specifically, if p 0,..., 1,..., 0], the regret is relative to the corresponding best fixed arm in hindsight. BCO. Consider now the BCO setup with M-point feedback 1]. At the beginning of slot t, without knowing the loss, the learner selects x t X, where X R K is a compact and convex set. Being able to query the function values at another M 1 points {x t,k X } M 1 and with x t,0 : x t, the loss values at {x t,k } M 1 k0, that is, {f t x t,k } M 1 k0, are observed instead of the function f t. The learner leverages the revealed losses to decide the next action x t+1 ; see also Fig. 1 b1. The learner s goal is to find a sequence of { {x t,k } M 1 } T k0 to minimize the regret relative to the best fixed action in hindsight, meaning Reg BCO T : E f t x t ] f t x 3 where the expectation is taken over the sequence of random actions {x τ } t 1 s1. The best fixed action x in hindsight is x : arg min x X f t x. 4 This definition is slightly different with that in 1]. However, we will show in Section 5. that the regret bound is not affected. 3

4 In both MAB and BCO settings, an online algorithm is desirable when its regret is sublinear w.r.t. the time horizon T, i.e., Reg MAB T ot and Reg BCO T ot 5, 15].. Delayed MAB and BCO MAB with unknown delays. In delayed MAB, the learner still chooses an arm a t p t at the beginning of slot t. However, the loss l t a t is observed after d t slots, namely, at the end of slot t + d t, where delay d t 0 can vary from slot to slot. In this paper, we assume that {d t } T can be chosen adversarially by nature. Let l s ta s t denote the loss incurred by the selected arm a s in slot s but observed at t, i.e., the learner receives the losses collected in L t { l s t a s t, s :s+d s t } at the end of slot t. Note that it is possible to have L t in certain slots. And the order of feedback can be arbitrary, meaning it is possible to heve t 1 + d t1 t + d t when t 1 t. In contrast to 16] however, we consider the case where the delay d t is not accessible, i.e., the learner just observes the value of l s t a s t, but not s. The learner s goal is to select { } T p t on-the-fly to minimize the regret defined in 1. Note that in the presence of delays, the information to decide p t is even less compared with the standard MAB. Specifically, the available information for the learner to decide p t is collected in the set L 1:t 1 : t 1 s1 L s; see also Fig. 1 a. For simplicity, we assume that all feedback information is received at the end of slot T. This assumption does not lose generality since the feedback arriving at the end of slot T cannot aid the arm selection, hence the final performance of the learner will not be affected. BCO with unknown delays. For delayed BCO, the learner still chooses x t to play while querying {x t,k } M 1 at the beginning of slot t. However, the loss as well as the querying responses { f t x t,k } M 1 are observed at the k0 end of slot t + d t. Similarly, let f s t x s t denote the loss incurred in slot s but observed at t, and the feedback set at the end of slot t is L t { {f s t x s t,k } M 1 k0, s : s+d s t }. To find a desirable x t, the learner relies on history L 1:t 1 : t 1 s1 L s, with the goal of minimizing the regret in 3. Why existing algorithms fail with unknown delays? The algorithms for standard non-delayed MAB, such as EXP3, cannot be applied to delayed MAB with unknown delays. Recall that in settings without delay, to deal with the partially observed l t, EXP3 relies on an importance sampling type of loss estimates given by 3] ˆlt k l ta t 1a t k p t k, k 1,..., K. 5 The denominator as well as the indicator function in 5 ensure unbiasedness of ˆl t k. Leveraging the estimated loss, the distribution p t+1 is obtained by p t+1 k p t k exp ˆl t k K j1 p, k 6 tj exp ˆl t j where is the learning rate. Consider now that the loss l s t a s t with delay d s is observed at t s + d s. To recover the unbiased estimator ˆl s t k in 5, p s k must be known. However, since d s is not revealed, even if the learner can store the previous probability distributions, it is not clear how to attain the loss estimator. Knowing the delay is also instrumental when it comes to the gradient estimator in BCO as well. For nondelayed single-point feedback BCO 13], e.g., M 1, since only one value of the loss instead of the full gradient is observed per slot, the idea is to draw u t uniformly from the surface of a unit ball in R K, and form the gradient estimate as g t K δ f tx t + δu t u t 7 where δ is a small constant. The next action is obtained using a standard online projected gradient descent iteration leveraging the estimated gradient, that is x t+1 Π Xδ x t g t ] 8 4

5 Algorithm 1 DEXP3 1: Initialize: p 1 k 1/K, k. : for t 1,..., T do 3: Select an arm a t p t. 4: Observe feedback collected in set L t. 5: for n 1,,..., L t do 6: Estimate ˆl sn t via 9 if l sn ta sn t L t. 7: Update p n t via : end for 9: Obtain p t+1 via : end for Figure : An example of DEXP3 with L t including the losses incurred in slot s 1, s, and s 3 and L t+1. where X δ is the shrunk feasibility set to ensure x t + u t is feasible. While g t serves as a nearly unbiased estimator of f t x t, the unknown delay brings mismatch between the feedback f s t x s t + δu s t and g s t. Specifically, given the feedback f s t x s t + δu s t, since d s is unknown, the learner does not know u s t to obtain g s t in 7. Similar arguments also hold for BCO with multi-point feedback. Therefore, performing delayed bandit learning with unknown delays is challenging, and has not been explored. 3 DEXP3 for Delayed MAB We start with the non-stochastic MAB setup that is randomized in nature because an arm a t is chosen randomly per slot according to a K 1 probability mass vector p t. In this section, we show that for the MAB problem, so long as the unknown delay is bounded, based only on a single-point feedback, the randomized algorithm that we term Delayed EXP3 DEXP3 can cope with unknown delays in MAB through a biased loss estimator, and is guaranteed to attain a desirable regret. Recall that the feedback at slot t includes losses incurred at slots s n, n 1,,..., L t, where L t : { lsn ta sn t : s n t d sn }. Once Lt is revealed, the learner estimates l sn t by scaling the observed loss according to p t at the current slot. For each l sn ta sn t L t, the estimator of the loss vector l sn t is ˆlsn tk l s n tk1 a sn t k p t k, k. 9 It is worth mentioning that the index s n in 9 is only used for analysis while during the implementation, there is no need to know s n. In contrast to EXP3 3] and its variant for delayed MAB with known delays 7, 16], our estimator for l sn tk in 9 turns out to be biased since a sn t is chosen according to p sn and not according to p t, that is E asn t ˆlsn tk ] l s n tkp sn k l p t k sn tk. 10 Since L t may contain multiple rounds of feedback, leveraging each ˆl sn t, the learner must update Lt times to obtain p t+1. Intuitively, to upper bound the bias of 10, an upper bound on p sn k/p t k is required, which 5

6 amounts to a lower bound on p t k. On the other hand however, the lower bound of p t k cannot be too large to avoid incurring extra regret. Different from EXP3, our DEXP3 ensures a lower bound on p t k by introducing an intermediate weight vector w t to evaluate the historical performance of each arm. Let n denote the index of the inner-loop update at slot t starting from p 0 t : p t. For each l sn ta sn t L t, the learner first updates w n t by using the estimated loss ˆl sn t as w t n k p n 1 t k exp min { δ 1, ˆl } sn tk, k 11 where is the learning rate, and δ 1 serves as an upper bound of ˆl sn tk to control the bias of ˆl sn tk. However, to confine the extra regret incurred by introducing δ 1, a carefully-selected δ 1 should ensure that the probability of having ˆl sn tk larger than δ 1 is small enough. Then the learner finds w n t by a trimmed normalization as { wt n k max w n t k K j1 wn t j, δ K }, k. 1 Update 1 ensures that w n t k is lower bounded by δ /K. Finally, the learner normalizes w n t to obtain p n t as p n t k wt n k K k. 13 j1 wn t j, δ K1+δ It can be shown that p n t k is lower bounded by p n t k the elements of in L t have been used, the learner finds p t+1 via cf. 35 in supplementary material]. After all p t+1 p Lt t. 14 Furthermore, if L t, the learner directly reuses the previous distribution, i.e., p t+1 p t, and chooses an arm accordingly. In a nutshell, DEXP3 is summarized in Alg. 1. As it will be shown in Sec. 5., if the delay d t is bounded by a constant d, DEXP3 can guarantee a regret of O T + D, where D T d t is the overall delay. Remark 1. The recent composite loss wrapper algorithm abbreviated as CLW in 6] can be also applied to the delayed MAB problem. However, CLW is designed for a more general setting with composite and anonymous feedback, and its efficiency drops when the previous action a s t is known. The main differences between DEXP3 and CLW are: i the loss estimators are different; ii DEXP3 updates p t in every slot, while CLW updates occur every other O d slots thus requiring a larger learning rate; and iii DEXP3 does not involve d in the loss estimator, leading to a tighter regret bound in terms of delay O T + D, in contrast to O d + 1T of CLW. As it will be corroborated by simulations, DEXP3 outperforms CLW in the considered setting. 4 DBGD for Delayed BCO In this section, we develop an algorithm that we term Delayed Bandit Gradient Descent DBGD based on a deterministic approximant of the loss obtained using M K + 1 rounds of feedback. DBGD enjoys regret of O T + D for BCO problems even when the delays are unknown. In practice, K + 1-point feedback can be obtained i when it is possible to evaluate the loss function easily; and ii when the slot duration is long, meaning that the algorithm has enough time to query multiple points from the oracle 7]. The intuition behind our deterministic approximation originates from the gradient definition 1]. Consider for example x R, and the gradient fx 1, ], where fx+δe 1 fx fx+δe fx 1 lim ; lim. 15 δ 0 δ δ 0 δ 6

7 Algorithm DBGD 1: Initialize: x 1 0. : for t 1,..., T do 3: Play x t, also query x t + δe k, k 1,..., K. 4: Observe feedback collected in set L t. 5: if L t then set x t+1 x t. 6: else estimate gradient g sn t via 17 if s n +d s t. 7: Update x t+1 via 18. 8: end if 9: end for Figure 3: An example of DBGD with L t including the losses incurred in slot s 1, s, and s 3 and L t+1. Similarly, for a K-dimensional x, if K + 1 rounds of feedback are available, the gradient can be approximated as g t 1 δ ft x t + δe k f t x t e k 16 where e k : 0,..., 1,..., 0] denotes the unit vector with k-th entry equal 1. Intuitively, a smaller δ improves the approximation accuracy. When f t is further assumed to be linear, g t in 16 is unbiased. In this case, the gradient of f t can be recovered exactly, and thus the setup boils down to a delayed one with full information. However, if f t is generally convex, g t in 16 is biased. Leveraging the gradient in 16, we are ready to introduce the DBGD algorithm. Per slot t, the learner plays x t and also queries f t x t + δe k, k {1,..., K}. However, to ensure that f t x t + δe k is feasible, the x x t should be confined to the set X δ {x : 1 δ X }. Note that if 0 δ < 1, X δ is still convex. Let n 1,,..., L t indexing the inner loop update at slot t. At the end of slot t, the learner receives observations L t { } {f sn tx sn t, f sn tx sn t + δe k, k 1,..., K}, s n t d sn. Per received feedback value, the learner approximates the gradient via 16; thus, for each {f sn tx sn t, f sn tx sn t+δe k, k 1,..., K}, we have g sn t 1 δ With g sn t and x 0 t : x t, the learner will update L t times to obtain x t+1 by fsn tx sn t + δe k f sn tx sn t e k. 17 x n t Π Xδ x n 1 t g sn t], n 1,..., Lt 18a x t+1 x Lt t. 18b If no feedback is received at slot t, the learner simply sets x t+1 x t. The DBGD is summarized in Algorithm. 5 A unified framework for regret analysis In this section, we show that both DEXP3 and DBGD can guarantee an O T + D regret. Our analysis considerably broadens that in 17], which was originally developed for delayed online learning with full-information 7

8 Figure 4: An example of mapping from real slots solid line and virtual slots dotted line. At the end of slot t, the feedback is L t {l s1 ta s1 t, l s ta s t}. v.s. stands for virtual slot. feedback. 5.1 Mapping from real to virtual slots To analyze the recursion involving consecutive variables p t and p t+1 in DEXP3 or x t and x t+1 in DBGD is challenging, since different from standard settings, the number of feedback rounds varies over slots. We will bypass this variable feedback using the notion of a virtual slot. Over the real time horizon, there are in total T virtual slots, where the τth virtual slot is associated with the τth loss value fed back. Recall that the feedback received at the end of slot t is L t. With the overall feedback received until the end of slot t 1 denoted by L t 1 : t 1 v1 L v, the virtual slot τ corresponding to the first feedback value received at slot t is τ L t In what follows, we will use MAB as an example to elaborate on this mapping, but the BCO setting can be argued likewise. When the multiple rounds of feedback are received over a real slot t, DEXP3 updates L t times p t to obtain p t+1 ; see Using the notion of virtual slots, these L t updates are performed over L t consecutive virtual slots. Taking Fig. 4 as an example, when p 1 t is obtained by using an estimated loss ˆl s1 t cf. 9] and 11-13, this update is mapped to a virtual slot τ L t 1 + 1, where l τ : ˆl s1 t is adopted to obtain p τ+1 : p 1 t. Similarly, when p t is obtained using ˆl s t, the virtual slot yields p τ+ : p t via l τ+1 : ˆl s t. That is to say, at real slot t, for n 1,..., L t, each update from p n 1 t to p n t using the estimated loss ˆl sn t is mapped to an update at the virtual slot τ + n 1, where l τ+n 1 : ˆl sn t is employed to obtain p τ+n : p n t from p τ+n 1 p n 1 t. According to the real-to-virtual slot mapping, we have p τ+ Lt p t+1 ; see also Fig. 4 for two examples. As we will show later, it is convenient to analyze the recursion between two consecutive p τ and p τ+1, which is the key for the ensuing regret analysis. With regard to DBGD, since multiple feedback rounds are possible per real slot t, we again map the L t updates at a real slot cf. 18] to L t virtual slots. The mapping is exactly the same as that in DEXP3, that is, per virtual slot τ, g τ is used to obtain x τ+1. From this real-to-virtual mapping vantage point, DEXP3 and DBGD can be viewed as inexact EXP3 and BGD running on the virtual time horizon with only one feedback value per virtual slot. That is to say, instead of analyzing regret on the real time horizon, which can involve multiple feedback rounds, we can alternatively turn to the virtual slot, where there is only one update per slot. 8

9 5. Regret Analysis of DEXP3 Now we turn to the analyze the regret for DEXP3. The analysis builds on the following assumptions. Assumption 1. The losses satisfy max t,k l t k 1. Assumption. The delay d t is bounded, i.e., max t d t d. Assumption 1 requires the loss function to be upper bounded, which is common in MAB; see also 3, 5, 15]. Assumption asks for the delay to be bounded that also appears in previous analyses for the delayed online learning setup 6, 7, 16]. Let us first consider the changes on p τ k after one update in the virtual slot. Lemma 1. If the parameters are properly selected such that 1 δ δ 1 0, the following inequality holds p τ 1 k p τ k 1 1 δ δ 1, k, τ. 19 Proof. See Sec. B.1 of the supplementary document. Lemma. If the parameters are properly selected such that 1 δ 1 0, the following inequality holds { p τ k p τ 1 k max 1 + δ, Proof. See Sec. B. of the supplementary document. 1 1 δ 1 }, k, τ. 0 Lemmas 1 and assert that both p τ 1 k/ p τ k and p τ k/ p τ 1 k are bounded deterministically, that is, regardless of the arm selection and observed loss. These bounds are the critical for deriving the regret. To bound the regret, the final cornerstone is the regret in virtual slots, specified in the following lemma. Lemma 3. For a given sequence of { l τ } T, the following relation follows pτ p min { lτ, δ 1 1 } T ln1 + δ + ln K where 1 is a K 1 vector of all ones, and p K. Proof. See Sec. B.3 of the supplementary document. Leveraging Lemma 3, the regret of DEXP3 follows. + p τ k lτ k ] 1 Theorem 1. Supposing Assumptions 1 and hold, defining the overall delay D : T d t, and choosing δ 1 T +D, ln K T +D, and δ 1 1 d δ, DEXP3 guarantees that the Reg MAB T in 1 satisfies Proof. See Sec. B.4 of the supplementary document. Reg MAB T O T + D. Theorem 1 recovers the regret bound for delayed non-stochastic MAB with fixed yet known delay 16, 3]. Furthermore, the established O T + D regret also addresses the open problem in the non-stochastic MAB with unknown and possibly adversarial delays 7]. 9

10 5.3 Regret analysis of DBGD Our analysis builds on the following assumptions. Assumption 3. For any t, the loss function f t is convex. Assumption 4. For any t, f t is L-Lipschitz and β-smooth. Assumption 5. The feasible set contains ɛb, where B is the unit ball, and ɛ > 0 is a predefined parameter. The diameter of X is R; that is, max x,y X x y R. Assumptions 3-5 are common in online learning 15]. Assumption 4 requires that f t is L-Lipschitz and β-smooth, which is needed to bound the bias of the estimator g s t 1]. Assumption 5 is also typical in BCO 1, 1, 13, 15]. In addition, the counterpart of Assumption 1 in BCO can readily follow from Assumptions 4 and 5. To start, the quality of the gradient estimator g s t is first evaluated. As stated, when f t is not linear, the estimator g s t is biased, but its bias is bounded. Lemma 4. If Assumption 4 holds, then for every f s t x s t, the corresponding estimator 16 satisfies g s t KL, and g s t f s t x s t βδ K. 3 Proof. See Sec. C.1 of the supplementary document. Lemma 4 suggests that with δ small enough, the bias of g s t will not be too large. Then, the following lemma shows the relation among g τ, x τ, and x τ+1, in a virtual time slot. Lemma 5. Under Assumptions 4 and 5, the update in a virtual slot τ guarantees that for any x X δ, we have g τ xτ x x τ x x τ+1 x KL +. 4 Proof. See Sec. C. of the supplementary document. Lemma 5 is the counterpart of the gradient descent estimate in the non-delayed and full-information setting 15, Theorem 3.1], which demonstrates that DBGD is BGD running on the virtual slots. Finally, leveraging these results, the regret bound follows next. Theorem. Suppose Assumptions 3-5 hold. Choosing δ O T + D 1, and O T + D 1/, the DBGD guarantees that the regret is bounded, that is Reg BCO T f t x t where D : T d t is the overall delay. Proof. See Sec. C.3 of the supplementary document. f t x O T + D 5 For the slightly different regret definition in 1], DBGD achieves the same regret bound. Corollary 1. Upon defining x t,0 : x t and x t,k : x t + δe k, choosing δ O T + D 1, and O T + D 1/, the DBGD also guarantees that 1 K +1 f t x t,k f t x O T +D. 6 k0 Proof. See Sec. C.4 of the supplementary document. The O T + D regret in Theorem and Corollary 1 recovers the bound of delayed online learning in the full information setup 17, 18, 5] with only bandit feedback. 10

11 Regret EXP3 BOLD DEXP3 CLW Time slot a Regret EXP3 BOLD DEXP3 CLW Time slot 10 4 Figure 5: a Regret of DEXP3 using synthetic data; b Regret of DEXP3 using real data. b Regret OGD SOLID K+1-BCO DBGD CLW-BCO Regret OGD SOLID K+1-BCO DBGD CLW-BCO Time slot a Time slot b Figure 6: a Regret of DBGD using synthetic data; b Regret of DBGD using real data. 6 Numerical tests In this section, experiments are conducted to corroborate the validity of the novel DEXP3 and DBGD schemes. In synthetic data tests, we consider T, 000 slots. Delays are periodically generated with period 1,, 1, 0, 3, 0,, with the delay of the last few slots slightly modified to ensure that all feedback arrives at the end of T, 000, resulting in the overall delay D, 569. DEXP3 synthetic tests. Consider K 5 arms, and losses generated with a sudden change. Specifically, for t 1, 500], we have l t k 0.4k cos t per arm k, while for the rest of the slots, l t k 0.k sint. To benchmark the novel DEXP3, we use: i the standard EXP3 for non-delayed MAB 3]; ii the BOLD for delayed MAB with known delay 16]; and, iii CLW to deal with the more difficult setting in 6]. The instantaneous accumulated regret normalized by T versus time slots is plotted in Fig. 5 a. The gap between BOLD and EXP3, illustrates that even with a known delay, the learner suffers from an extra regret. The small gap between DEXP3 and BOLD further demonstrates the estimator bias cf. 9] causing a slightly larger regret, which is the price paid for the unknown delay. Compared with CLW, DEXP3 performs significantly better since DEXP3 can leverage more information relative to the non-anonymous feedback used by CLW. 11

12 DEXP3 real tests. We also tested DEXP3 using the Jester Online Joke Recommender System dataset 14], where T 4, 983 users rate K 100 different jokes from 0 not funny to 1 very funny. The goal is to recommend one joke per slot t to amuse the users. The system performance is evaluated by l t 1 the score of this joke. In this test, we assign a random score within range 0, 1] for the missing entries of this dataset. The delay is generated periodically as in the synthetic test, resulting in D 3, 119. Similar to the synthetic test, it can be observed that DEXP3 incurs slightly larger regret than BOLD due to the unknown delay, but outperforms the recently developed CLW. DBGD synthetic tests. Consider that K 5, and the feasible set X R 5 is the unit ball, i.e., X : { x 1 }. The loss function at slot t is generated as f t x a t x + b t x, where a t cos3t + 3, while b t 1 sint + 1,b t cost, b t 3 sint, b t 4 sint, b t 4 sint, and b t 5. To see the influence of the bandit feedback and the unknown delay, we consider the following benchmarks: i the standard OGD 30] in full-information and non-delayed setting; ii the K + 1-point feedback BCO 1] for non-delayed BCO; iii the SOLID for delayed full-information OCO 17]; and iv the CLW-BCO with the inner algorithm relying on K + 1-point feedback BCO 6]. The instantaneous accumulated regret normalized by T versus time slots is plotted in Fig. 6 a. This test shows that DBGD performs almost as good as SOLID, and the gap between DBGD/SOLID and OGD/K + 1-BCO is due to the delay. The regret of DBGD again significantly outperforms that of CLW-BCO, demonstrating the efficiency of DBGD. DBGD real tests. To further illustrate the merits of DBGD, we conduct tests dealing with online regression applied to a yacht hydrodynamics dataset 10], which contains T 308 data with K 6 features. Per slot t, the regressor x t R 6 predicts based on the feature w t before its label y t is revealed. The loss function for slot t is f t x t 1 y t x t w t. The delay is generated periodically as before, and cumulatively it is D 394. The instantaneous accumulated regret normalized by T versus time slots is plotted in Fig. 6 b. Again, DBGD outperforms CLW-BCO considerably. Comparing with the regret performance of DBGD and SOLID, we can safely deduce the influence delay has on bandit feedback. 7 Conclusions Bandit online learning with unknown delays, including non-stochastic MAB and BCO, was studied in this paper. Different from settings where the experienced delay is known in bandit online learning, the unknown delay prevents a simple gradient estimate that is needed by the iterative algorithm. To address this issue, a biased loss estimator as well as a deterministic one were developed for non-stochastic MAB and BCO. Leveraging the proposed loss estimators, the so-termed DEXP3 and DBGD algorithms were developed. An O T + D regret was analytically established for both DEXP3 and DBGD, matching the regret bound of the delayed online learning in full information settings. Numerical tests on synthetic and real datasets confirmed the performance gain of DEXP3 and DBGD relative to state-of-the-art approaches. 1

13 References 1] A. Agarwal, O. Dekel, and L. Xiao, Optimal algorithms for online convex optimization with multi-point bandit feedback. in Proc. Intl. Conf. on Learning Theory, 010, pp ] A. Agarwal and J. C. Duchi, Distributed delayed stochastic optimization, in Proc. Advances in Neural Info. Process. Syst., Granada, Spain, 011, pp ] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, The nonstochastic multiarmed bandit problem, SIAM Journal on Computing, vol. 3, no. 1, pp , 00. 4] B. Awerbuch and R. D. Kleinberg, Adaptive routing with end-to-end feedback: Distributed learning and geometric approaches, in Proc. ACM Symp. on Theory of Computing, Chicago, IL, Jun. 004, pp ] S. Bubeck, N. Cesa-Bianchi et al., Regret analysis of stochastic and nonstochastic multi-armed bandit problems, Found. and Trends in Machine Learning, vol. 5, no. 1, pp. 1 1, 01. 6] N. Cesa-Bianchi, C. Gentile, and Y. Mansour, Nonstochastic bandits with composite anonymous feedback, in Proc. Conf. On Learning Theory, Stockholm, Sweden, 018, pp ] N. Cesa-Bianchi, C. Gentile, Y. Mansour, and A. Minora, Delay and cooperation in nonstochastic bandits, J. Machine Learning Res., vol. 49, pp , ] O. Chapelle and L. Li, An empirical evaluation of thompson sampling, in Proc. Advances in Neural Info. Process. Syst., Granada, Spain, 011, pp ] T. Desautels, A. Krause, and J. W. Burdick, Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization, J. Machine Learning Res., vol. 15, no. 1, pp , ] D. Dheeru and E. Karra Taniskidou, UCI machine learning repository, 017. Online]. Available: 11] J. Duchi, M. I. Jordan, and B. McMahan, Estimation, optimization, and parallelism when data is sparse, in Proc. Advances in Neural Info. Process. Syst., Lake Tahoe, Nevada, 013, pp ] J. C. Duchi, M. I. Jordan, M. J. Wainwright, and A. Wibisono, Optimal rates for zero-order convex optimization: The power of two function evaluations, IEEE Trans. Inform. Theory, vol. 61, no. 5, pp , ] A. D. Flaxman, A. T. Kalai, and H. B. McMahan, Online convex optimization in the bandit setting: gradient descent without a gradient, in Proc. of ACM-SIAM symposium on Discrete algorithms, Vancouver, Canada, pp ] K. Goldberg, T. Roeder, D. Gupta, and C. Perkins, Eigentaste: A constant time collaborative filtering algorithm, Information Retrieval, vol. 4, no., pp , 001. Online]. Available: 15] E. Hazan, Introduction to online convex optimization, Found. and Trends in Optimization, vol., no. 3-4, pp , ] P. Joulani, A. Gyorgy, and C. Szepesvári, Online learning under delayed feedback, in Proc. Intl. Conf. Machine Learning, Atlanta, 013, pp ] P. Joulani, A. Gyórgy, and C. Szepesvári, Delay-tolerant online convex optimization: Unified analysis and adaptive-gradient algorithms, in Proc. of AAAI Conf. on Artificial Intelligence, vol. 16, Phoenix, Arizona, 016, pp

14 18] J. Langford, A. J. Smola, and M. Zinkevich, Slow learners are fast, Proc. Advances in Neural Info. Process. Syst., pp , ] L. Li, W. Chu, J. Langford, and R. E. Schapire, A contextual-bandit approach to personalized news article recommendation, in Proc. of the 19th Intl. Conf. on World Wide Web. Rayleigh, NC: ACM, 010, pp ] B. McMahan and M. Streeter, Delay-tolerant algorithms for asynchronous distributed online learning, in Proc. Advances in Neural Info. Process. Syst., Montreal, Canada, 014, pp ] H. B. McMahan, E. Moore, D. Ramage, S. Hampson et al., Communication-efficient learning of deep networks from decentralized data, in Proc. Intl. Conf. on Artificial Intelligence and Statistics, Fort Lauderdale, Florida, 017, pp ] Y. Nesterov, Introductory lectures on convex optimization: A basic course. Springer Science & Business Media, 013, vol ] G. Neu, A. Antos, A. Gyórgy, and C. Szepesvári, Online markov decision processes under bandit feedback, in Proc. Advances in Neural Info. Process. Syst., Vancouver, Canada, 010, pp ] C. Pike-Burke, S. Agrawal, C. Szepesvari, and S. Grunewalder, Bandits with delayed anonymous feedback, arxiv preprint arxiv: , ] K. Quanrud and D. Khashabi, Online learning with adversarial delays, in Proc. Advances in Neural Info. Process. Syst., Montreal, Canada, 015, pp ] O. Shamir and L. Szlak, Online learning with local permutations and delayed feedback, in Proc. Intl. Conf. Machine Learning, Sydney, Australia, 017, pp ] T. S. Thune and Y. Seldin, Adaptation to easy data in prediction with limited advice, arxiv preprint arxiv: , ] C. Vernade, O. Cappé, and V. Perchet, Stochastic bandit models for delayed conversions, in Proc. Conf. on Uncertainty in Artificial Intelligence, Sydney, Australia, ] M. J. Weinberger and E. Ordentlich, On delayed prediction of individual sequences, IEEE Trans. Inform. Theory, vol. 48, no. 7, pp , ] M. Zinkevich, Online convex programming and generalized infinitesimal gradient ascent, in Proc. Intl. Conf. Machine Learning, Washington D.C., 003, pp

15 Supplementary Document for Bandit Online Learning with Unknown Delays A Real to virtual slot mapping For the analysis, let tτ denote the real slot when the real loss l tτ corresponding to l τ was incurred, i.e., l τ ˆl tτ tτ+dtτ. Also define an auxiliary variable s τ τ 1 L tτ 1. See an example in Fig. 7 and Table 1. Lemma 6. The following relations hold: i s τ 0, τ; ii T s τ T d t; and, iii if max t d t d, we have s τ d, τ. Proof. We first prove the property i s τ 0, t. Consider at virtual slot τ, the observed loss is l tτ a tτ with the corresponding s τ τ 1 L tτ 1. Suppose that L tτ 1 m, where 0 m tτ 1 by definition of L tτ 1. The history L tτ 1 m suggests that at the beginning of t 1 tτ, there are in total m received feedback. On the other hand, the loss l tτ a tτ is observed at the end of slot t tτ + d tτ t 1, thus at the beginning of t, there are at least m observations. Hence we must have τ m + 1. Then by the definition, s τ m m 0. Then for the property ii T s τ T d t, the proof follows from the definition of s τ, i.e., s τ a τ 1 L tτ 1 t 1 Lt 1 b t 1 L tτ 1 d t 7 where a is due to the fact that {tτ} T is a permutation of {1,, T }; and b follows from the definition of L t 1. Finally, for the property iii, notice that L tτ 1 tτ 1 d, which follows that at the beginning of t tτ, the losses of slots t tτ 1 d must have been received. Therefore, we have s τ τ 1 L tτ 1 τ 1 tτ d c d 8 where c follows that l tτ a tτ is observed at the end of t tτ + d tτ, and L tτ+dtτ 1 is at most tτ + d tτ since l tτ a tτ is not observed, leading to the fact that τ is at most tτ + d tτ, leading to τ tτ d tτ d. Virtual slot τ 1 τ τ 3 tτ 3 1 L tτ s τ 0 0 Figure 7: An examples of mapping from real slots solid line and virtual slots dotted line. The value of tτ is marked besides the corresponding yellow arrow; T 3 with delay d 1, d 0, and d 3 0. Table 1: The value of tτ, L tτ 1, and s τ in Fig

16 B Proofs for DEXP3 Before diving into the proofs, we first show some useful but simple bounds for different parameters of the DEXP3 s in virtual slots. In virtual slot τ, the update is carried out the same as 11, 1 and 13, given by w τ+1 k p τ k exp min { δ 1, l τ k }], k, 9 Since l τ k 0, k, τ, we have { w τ+1 k max p τ+1 k w τ j j1 And K w τ k is upper/lower bounded by w τ k w τ k Finally, p τ k is upper/lower bounded with B.1 Proof of Lemma 1 w τ+1 k K j1 w τ+1j, δ K }, k, 30 w τ+1 k K j1 w, k. 31 τ+1j p τ 1 j 1. 3 j1 w τ k K j1 w 1; 33 τ j w τ k K j1 w τ j + δ 1 + δ. 34 δ K1 + δ w τ k 1 + δ p τ k w τ k. 35 Lemma 7. In consecutive virtual slots τ 1 and τ, the following inequality holds for any k. p τ 1 k p τ k p τ 1 k δ + min { δ 1, l τ 1 k } Proof. First, we have p τ k a w τ k 1 + δ w τ k b K j1 w w τ k τ j1 + δ 1 + δ 1 + δ. 36 p τ 1 k exp min { δ 1, l τ 1 k }] δ where in a we used 35; b is due to 3. Hence, we have p τ 1 k exp min { δ 1, l τ 1 k }] p τ k p τ 1 k p τ 1 k 1 + δ c p τ 1k 1 min { δ 1, 1 + δ l τ 1 k }] p τ 1 k p τ 1 k δ min { δ 1, l τ 1 k } δ where in c we used e x 1 x and the proof is completed by multiplying 1 on both sides of

17 From Lemma 7, we have p τ 1 k p τ k p τ 1 k δ + min { δ 1, l τ 1 k } Hence, as long as 1 δ δ 1 0, we can guarantee that 19 is satisfied. B. Proof of Lemma Lemma 8. The following inequality holds for any τ and any k where I τ k : 1 w τ k δ K. Proof. We first show that p τ k p τ 1 k p τ k 1 I τ k 1 + δ p τ 1 k δ + δ j1 w τ k p τ ki τ k p τ 1 j 1 min { δ 1, l τ 1 j }] 40 w τ j. 41 It is easy to see that 41 holds when I τ k 0. Then when I τ k 1, we have w τ k w τ k/ K j1 w τ j. Then applying 35, we have p τ k w τ k w τ k/ K j1 w τ j, from which 41 holds. Then we have p τ k p τ 1 k p τ k w τ k p τ k p τ ki τ k p τ k 1 I τ k a p τ k 1 I τ k j1 j1 where in a we used e x 1 x. j1 ] { w τ j p τ k 1 I τ k w τ j j1 j1 p τ 1 j exp min { δ 1, l τ 1 j }]} p τ 1 j 1 min { δ 1, l τ 1 j }] 4 The proof builds on Lemma 8. First consider the case of I τ k 0. In this case Lemma 8 becomes p τ k p τ 1 k p τ k, which is travail. On the other hand, since I τ k 0, we have w τ k δ K. Then leveraging 35, we have p τ k w τ k δ K. Then plugging the lower bound on p τ 1 k into 35, we have p τ k p τ 1 k δ K Considering the case of I τ k 1, Lemma 8 becomes p τ k p τ 1 k p τ k 1 p τ k Rearranging 44 and combining it with 43, we complete the proof. 1 p τ 1 k δ K1 + δ 1 + δ. 43 K δ j1 p τ 1 j 1 min { δ 1, l τ 1 j }] p τ 1 j min { δ 1, l τ 1 k } p τ kδ j1 17

18 B.3 Proof of Lemma 3 For conciseness, define c τ : min { lτ, δ 1 1}, and correspondingly c τ k : min{ l τ k, δ 1 }. Define Wτ : K w τ k, and W τ : K w τ k. Leveraging these auxiliary variables, we have W T +1 w T +1 k p T k exp c T k ] w T k exp ct k ] W T W T w T 1 k exp W T 1 w T k W T exp c T k ] p T 1 k exp c T k c T 1 k ] W T W T ct k c T 1 k ] w 1 k exp ] T c τ k W T W T. 45 T Wτ Wτ Then, for any probability distribution p K noticing that the initialization of w 1 k 1, k and hence W 1 K, 45 implies that pk exp ] c τ k exp ] c τ k W T 1 a T +1 Wτ Wτ+1 K1 + δ T where in a we used the fact that W τ 1 + δ. Then, using the the Jensen inequality on e x, we have Plugging 47 into 46, we have pk exp exp On the other hand, Wτ can be upper bounded by W τ b w τ ] c τ k exp ] T +1 pk c τ k K1 + δ T p τ 1 k exp c τ 1 k ] τ W τ, 46 ] pk c τ k. 47 τ p τ 1 k 1 c τ 1 k + cτ 1 k ] 1 p τ 1 k c τ 1 k + W τ. 48 p τ 1 k c τ 1 k ], 49 where b follows from e x 1 x + x /, x 0. Taking logarithm on both sides of 49, we arrive at ln W τ ln 1 c p τ 1 k c τ 1 k + p τ 1 k c τ 1 k + p τ 1 k c τ 1 k ] p τ 1 k c τ 1 k ]

19 where in c we used ln1 + x x. Then taking logarithm on both sides of 48 and plugging 50 in, we arrive at pk c τ k T ln1 + δ + ln K p τ k c τ k + p τ k c τ k ] 51 Rearranging the terms of 51 and writing it compactly, we obtain pτ p cτ T ln1+δ +ln K T ln1+δ +ln K + + p τ k c τ k ] p τ k lτ k ]. 5 B.4 Proof of Theorem 1 To begin with, the instantaneous regret can be written as p t l t p l t a p t kl t k pkl t k max k b p t ke at lt k1a t k p t k ] pt k pk E at lt k1a t k p t+dt k max k p t+dt k p t k ] lt k1a t k pke at p t k ] p t+dt k p t k pt k pk ] lt k1a t k E at p t+dt k E at p t ˆl ] t t+dt p ˆlt t+dt p t+dt k p t k ] l where a is due to E tk1a tk at p tk l t k, and b follows from the definition ˆl t t+dt k ltk1atk p t+dt k. Then the overall regret of T slots is given by T ] Reg T E p t l t p l t E c E d E e E T T T max k max k max k T E max k p tτ+dtτ k T max k p t+dt k ] E ] at p t ˆlt t+dt p ˆlt t+dt p t k ] E ] atτ p tτˆl tτ tτ+dtτ p ˆltτ tτ+dtτ p tτ k p tτ+dtτ k E atτ p tτ l ] ] τ p lτ p tτ k p tτ+dtτ k ] E ] atτ p τ sτ lτ p lτ p tτ k p tτ+dtτ k p tτ k 53 E atτ p τ sτ lτ p τ l τ ] + E atτ p τ l τ p lτ ] ] 54 19

20 where c is due to the fact that {t1, t,, tt } is a permutation of {1,,, T }; d follows from l τ ˆl tτ tτ+dtτ; e uses the fact p t p Lt 1 +1 and p tτ p Ltτ 1 +1 p τ sτ. First note that between real time slot tτ and tτ + d tτ, there are at most d + d tτ d feedback received. Hence the correspondingly virtual slot will not differ larger than d. Note also that the index of virtual slot corresponding to tτ must be no larger than that of tτ + d tτ. Hence we have for all τ 1, T ], max k p tτ+dtτ k p tτ k max k p τ+1 k d f { max 1 + δ d, p τ k 1 1 δ 1 d } 55 where f is the result of Lemma. Then, to bound the terms in the second blanket of 54, again we denote c τ : min { lτ, δ 1 1}, and correspondingly c τ k : min{ l τ k, δ 1 } for conciseness. Then we have p τ s τ c τ p τ c τ c τ p τ sτ p τ g c τ m h s τ 1 c τ m l τ m j0 s τ 1 j0 s τ 1 j0 pτ sτ +jm p τ sτ +j+1m p τ sτ +jm δ + c τ sτ +jm 1 + δ c τ m pτ sτ +jm l τ sτ +jm + δ s τ 1 j0 pτ sτ +jm c τ sτ +jm + δ where g follows from the facts that l τ has at most one entry with index m being non-zero cf. 66] and s τ 0 cf. Lemma 6]; and h is the result of Lemma 7. Then notice that lτ k p τ k l tτ k p p τ k i p τ k d max tτ+dtτk k p τ+1 k 1 1 δ δ 1 d where i uses the fact that between tτ and tτ + d tτ there are at most d feedback; then further applying the result of Lemma 1, inequality 57 can be obtained. Plugging 57 back in to 56 and taking expectation w.r.t. a tτ, we arrive at E atτ p τ sτ c τ p τ c ] s τ τ 1 δ δ 1 d + δ K s τ p tτ k l τ k j 1 s τ K 1 δ δ 1 d 1 δ δ 1 d + δ s τ 58 where j follows a similar reason of 57. Then, noticing T s τ T d t D, we have E atτ p τ sτ c τ p τ c ] KD τ 1 δ δ 1 d 1 δ δ 1 d + δ. 59 Then, using a similar argument of 57, we have E atτ p τ k lτ k ] ] ltτ p τ k k p tτ+d k p tτk tτ 1 1 δ δ 1 4 d 60 0

21 Then leveraging Lemma 3, we have E atτ pτ p c τ ] T ln1 + δ + ln K T ln1 + δ + ln K + E atτ p τ k lτ k ] ] KT δ δ 1 4 d The last step is to show that introducing δ 1 would not incur too much extra regret. Note that both c τ and l τ have only one entry being non-zero, whose index is denoted by m τ. lτ m τ > c τ m τ only when l τ m τ l tτ m τ p tτ+dtτ m τ > δ 1, which is equivalent to p tτ+dtτ m τ < l τ m τ /δ 1 1/δ 1. Hence, we have h E atτ pτ p lτ ] T E atτ pτ p c ] T τ + E atτ pτ p lτ ] c τ E atτ pτ p c ] T τ + E atτ p τ m τ lτ m τ c τ m τ 1 ] p tτ+dtτm τ < 1/δ 1 E atτ pτ p c ] T τ + E atτ p τ m τ l τ m τ 1 ] p tτ+dtτm τ < 1/δ 1 6 where in h, m τ denotes the index of the only one none-zero entry of l τ, and p is dropped due to the appearance of the indicator function. To proceed, notice that E atτ lτ m τ p τ m τ 1 p tτ+dtτm τ < 1/δ 1 ] i j K p τ k1 p tτ+dtτk < 1/δ 1 K 1 δ δ 1 d δ 1 1 δ δ 1 4 d p tτ kl tτ k p tτ+dtτk p τ k1 p tτ+dtτk < 1/δ 1 p τ k p tτ+dtτk1 p tτ+dtτk < 1/δ 1 p tτ+dtτk 1 δ δ 1 d where in i we used the a similar argument of 57; and in j we used the fact x1x < a a. Plugging 63 back into 6, we arrive at 63 E atτ p τ p lτ ] E atτ p τ p c τ ] + KT δ 1 1 δ δ 1 4 d 64 Applying similar arguments as 6 and 63, we can also show that ] E atτ pτ sτ p τ lτ ] KD E atτ pτ sτ p τ cτ δ 1 1 δ δ 1 6 d For the parameter selection, we have T ln1 + δ T ln1 + 1 T +D ln e 1. Leveraging the inequality that e 1 x x 4, x N +, we have that 1 1 δ 1 d 1 O δ δ 1 d 1

22 From 66 it is not hard to see the bound on 55, which is Then for 59, we have For 61, we have max k p tτ+dtτ k max p tτ k { 1 + δ d, } 1 1 δ 1 d O1. 67 E atτ p τ sτ c τ p τ c ] KD τ 1 δ δ 1 d 1 δ δ 1 d + δ E atτ pτ p c τ ] T ln1 + δ + ln K Using 69 and T/δ 1 OT, we can bound 64 by E atτ p τ p lτ ] Plugging 68 and using the fact D/δ 1 OD, we have ] E atτ pτ sτ p τ lτ + OD. 68 KT 1 δ δ 1 4 d O T + ln K. 69 E atτ p τ p c ] KT τ + δ 1 1 δ δ 1 4 d O T + ln K. 70 ] KD E atτ pτ sτ p τ cτ + δ 1 1 δ δ 1 6 d O D. 71 Plugging 67, 70, and 71 into 54, the regret is bounded by Reg T E p ] T t l t p l t O T + D. 7 C Proofs for DBGD C.1 Proof of Lemma 4 Since f s t is L-Lipschitz, we have g s t k 1 δ L δe k L, and thus g s t KL. On the other hand, let s t : f s t x s t, and s t k being the k-th entry of s t. Due to the β-smoothness of f s t, we have g s t k s t k 1 δ suggesting that g s t f s t x s t βδ K. δ s t e k + β δ s t k βδ 73 C. Proof of Lemma 5 Lemma 5 Restate. In virtual slots, it is guaranteed to have x τ x τ sτ s τ KL 74 and for any x X δ, we have g τ xτ x x τ x x τ+1 x KL +. 75

23 Proof. The proof begins with x τ sτ x τ s τ 1 j0 x τ sτ +j x τ sτ +j+1 a s τ KL 76 where in a we used the fact that x τ x τ+1 x τ Π Xδ x τ g τ ] g τ. Hence we proved the first inequality. Then, notice that x τ+1 x x τ x Π Xδ x τ g τ ] x x τ x b x τ x g τ x τ x g τ xτ x + g τ where a uses the non-expansion property of projection. Rearranging the terms of 77 completes the proof. 77 C.3 Proof of Theorem Lemma 9. Let h t x : f t x+ g t f t x t x, where gt : g t t+dt. Then h t x has the following properties: i h t x is L + βδ K -Lipschitz; and ii ht x is β smooth and convex. Proof. Starting with the first property, consider that h t x h t y f t x + g t f t x t x ft y g t f t x t y f t x f t y + g t f t x t x y a L + βδ K x y 78 where in a we used the results in Lemma 4. For the second property, the convexity of h t x is obvious. Then noticing that h t x f t x + g t f t x t, we have h t y h t x f t y f t x + g t f t x t y x f t x y x + β y x + g t f t x t y x h t x y x + β y x 79 which implies that h t x is β smooth. Then we are ready to prove Theorem. Let h t x : f t x + g t f t x t x, where gt : g t t+dt. Using the property of h t x in Lemma 9 as well as the fact h t x t g t, we have Reg T a b f t x t f t x h t x t g t f t x t xt h t x t h t x + h t x t h t x δ + gt f t x t x x t h t x t h t x δ + δrt h t x g t f t x t x h t x δ h t x + RT βδ K L + βδ + RT βδ K 80 3

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret

More information

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017 s s Machine Learning Reading Group The University of British Columbia Summer 2017 (OCO) Convex 1/29 Outline (OCO) Convex Stochastic Bernoulli s (OCO) Convex 2/29 At each iteration t, the player chooses

More information

Delay-Tolerant Online Convex Optimization: Unified Analysis and Adaptive-Gradient Algorithms

Delay-Tolerant Online Convex Optimization: Unified Analysis and Adaptive-Gradient Algorithms Delay-Tolerant Online Convex Optimization: Unified Analysis and Adaptive-Gradient Algorithms Pooria Joulani 1 András György 2 Csaba Szepesvári 1 1 Department of Computing Science, University of Alberta,

More information

Reducing contextual bandits to supervised learning

Reducing contextual bandits to supervised learning Reducing contextual bandits to supervised learning Daniel Hsu Columbia University Based on joint work with A. Agarwal, S. Kale, J. Langford, L. Li, and R. Schapire 1 Learning to interact: example #1 Practicing

More information

Bandits for Online Optimization

Bandits for Online Optimization Bandits for Online Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Bandits for Online Optimization 1 / 16 The multiarmed bandit problem... K slot machines Each

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

Adaptive Online Learning in Dynamic Environments

Adaptive Online Learning in Dynamic Environments Adaptive Online Learning in Dynamic Environments Lijun Zhang, Shiyin Lu, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China {zhanglj, lusy, zhouzh}@lamda.nju.edu.cn

More information

Bandit Online Convex Optimization

Bandit Online Convex Optimization March 31, 2015 Outline 1 OCO vs Bandit OCO 2 Gradient Estimates 3 Oblivious Adversary 4 Reshaping for Improved Rates 5 Adaptive Adversary 6 Concluding Remarks Review of (Online) Convex Optimization Set-up

More information

Online Learning with Feedback Graphs

Online Learning with Feedback Graphs Online Learning with Feedback Graphs Claudio Gentile INRIA and Google NY clagentile@gmailcom NYC March 6th, 2018 1 Content of this lecture Regret analysis of sequential prediction problems lying between

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Optimal and Adaptive Online Learning

Optimal and Adaptive Online Learning Optimal and Adaptive Online Learning Haipeng Luo Advisor: Robert Schapire Computer Science Department Princeton University Examples of Online Learning (a) Spam detection 2 / 34 Examples of Online Learning

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12

Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12 Advanced Topics in Machine Learning and Algorithmic Game Theory Fall semester, 2011/12 Lecture 4: Multiarmed Bandit in the Adversarial Model Lecturer: Yishay Mansour Scribe: Shai Vardi 4.1 Lecture Overview

More information

Bandits with Delayed, Aggregated Anonymous Feedback

Bandits with Delayed, Aggregated Anonymous Feedback Ciara Pike-Burke 1 Shipra Agrawal 2 Csaba Szepesvári 3 4 Steffen Grünewälder 1 Abstract We study a variant of the stochastic K-armed bandit problem, which we call bandits with delayed, aggregated anonymous

More information

The Online Approach to Machine Learning

The Online Approach to Machine Learning The Online Approach to Machine Learning Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary 1 My beautiful regret 2 A supposedly fun game I

More information

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning. Tutorial: PART 1 Online Convex Optimization, A Game- Theoretic Approach to Learning http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm Elad Hazan Princeton University Satyen Kale Yahoo Research

More information

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016 AM 1: Advanced Optimization Spring 016 Prof. Yaron Singer Lecture 11 March 3rd 1 Overview In this lecture we will introduce the notion of online convex optimization. This is an extremely useful framework

More information

A Low Complexity Algorithm with O( T ) Regret and Finite Constraint Violations for Online Convex Optimization with Long Term Constraints

A Low Complexity Algorithm with O( T ) Regret and Finite Constraint Violations for Online Convex Optimization with Long Term Constraints A Low Complexity Algorithm with O( T ) Regret and Finite Constraint Violations for Online Convex Optimization with Long Term Constraints Hao Yu and Michael J. Neely Department of Electrical Engineering

More information

Lecture 3: Lower Bounds for Bandit Algorithms

Lecture 3: Lower Bounds for Bandit Algorithms CMSC 858G: Bandits, Experts and Games 09/19/16 Lecture 3: Lower Bounds for Bandit Algorithms Instructor: Alex Slivkins Scribed by: Soham De & Karthik A Sankararaman 1 Lower Bounds In this lecture (and

More information

Yevgeny Seldin. University of Copenhagen

Yevgeny Seldin. University of Copenhagen Yevgeny Seldin University of Copenhagen Classical (Batch) Machine Learning Collect Data Data Assumption The samples are independent identically distributed (i.i.d.) Machine Learning Prediction rule New

More information

Online Learning and Online Convex Optimization

Online Learning and Online Convex Optimization Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game

More information

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University

Bandit Algorithms. Zhifeng Wang ... Department of Statistics Florida State University Bandit Algorithms Zhifeng Wang Department of Statistics Florida State University Outline Multi-Armed Bandits (MAB) Exploration-First Epsilon-Greedy Softmax UCB Thompson Sampling Adversarial Bandits Exp3

More information

Bandit Convex Optimization

Bandit Convex Optimization March 7, 2017 Table of Contents 1 (BCO) 2 Projection Methods 3 Barrier Methods 4 Variance reduction 5 Other methods 6 Conclusion Learning scenario Compact convex action set K R d. For t = 1 to T : Predict

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

Online Convex Optimization for Dynamic Network Resource Allocation

Online Convex Optimization for Dynamic Network Resource Allocation 217 25th European Signal Processing Conference (EUSIPCO) Online Convex Optimization for Dynamic Network Resource Allocation Tianyi Chen, Qing Ling, and Georgios B. Giannakis Dept. of Elec. & Comput. Engr.

More information

Bandit Convex Optimization: T Regret in One Dimension

Bandit Convex Optimization: T Regret in One Dimension Bandit Convex Optimization: T Regret in One Dimension arxiv:1502.06398v1 [cs.lg 23 Feb 2015 Sébastien Bubeck Microsoft Research sebubeck@microsoft.com Tomer Koren Technion tomerk@technion.ac.il February

More information

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Explore no more: Improved high-probability regret bounds for non-stochastic bandits Explore no more: Improved high-probability regret bounds for non-stochastic bandits Gergely Neu SequeL team INRIA Lille Nord Europe gergely.neu@gmail.com Abstract This work addresses the problem of regret

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon.

Administration. CSCI567 Machine Learning (Fall 2018) Outline. Outline. HW5 is available, due on 11/18. Practice final will also be available soon. Administration CSCI567 Machine Learning Fall 2018 Prof. Haipeng Luo U of Southern California Nov 7, 2018 HW5 is available, due on 11/18. Practice final will also be available soon. Remaining weeks: 11/14,

More information

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3

COS 402 Machine Learning and Artificial Intelligence Fall Lecture 22. Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 COS 402 Machine Learning and Artificial Intelligence Fall 2016 Lecture 22 Exploration & Exploitation in Reinforcement Learning: MAB, UCB, Exp3 How to balance exploration and exploitation in reinforcement

More information

Efficient learning by implicit exploration in bandit problems with side observations

Efficient learning by implicit exploration in bandit problems with side observations Efficient learning by implicit exploration in bandit problems with side observations omáš Kocák Gergely Neu Michal Valko Rémi Munos SequeL team, INRIA Lille Nord Europe, France {tomas.kocak,gergely.neu,michal.valko,remi.munos}@inria.fr

More information

Learning, Games, and Networks

Learning, Games, and Networks Learning, Games, and Networks Abhishek Sinha Laboratory for Information and Decision Systems MIT ML Talk Series @CNRG December 12, 2016 1 / 44 Outline 1 Prediction With Experts Advice 2 Application to

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Online Convex Optimization MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Online projected sub-gradient descent. Exponentiated Gradient (EG). Mirror descent.

More information

Online learning with feedback graphs and switching costs

Online learning with feedback graphs and switching costs Online learning with feedback graphs and switching costs A Proof of Theorem Proof. Without loss of generality let the independent sequence set I(G :T ) formed of actions (or arms ) from to. Given the sequence

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

Bandit models: a tutorial

Bandit models: a tutorial Gdt COS, December 3rd, 2015 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions) Bandit game: a each round t, an agent chooses

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning

New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Journal of Machine Learning Research 1 8, 2017 Algorithmic Learning Theory 2017 New bounds on the price of bandit feedback for mistake-bounded online multiclass learning Philip M. Long Google, 1600 Amphitheatre

More information

Exponential Weights on the Hypercube in Polynomial Time

Exponential Weights on the Hypercube in Polynomial Time European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts

More information

Regret Bounds for Online Portfolio Selection with a Cardinality Constraint

Regret Bounds for Online Portfolio Selection with a Cardinality Constraint Regret Bounds for Online Portfolio Selection with a Cardinality Constraint Shinji Ito NEC Corporation Daisuke Hatano RIKEN AIP Hanna Sumita okyo Metropolitan University Akihiro Yabe NEC Corporation akuro

More information

The No-Regret Framework for Online Learning

The No-Regret Framework for Online Learning The No-Regret Framework for Online Learning A Tutorial Introduction Nahum Shimkin Technion Israel Institute of Technology Haifa, Israel Stochastic Processes in Engineering IIT Mumbai, March 2013 N. Shimkin,

More information

New Algorithms for Contextual Bandits

New Algorithms for Contextual Bandits New Algorithms for Contextual Bandits Lev Reyzin Georgia Institute of Technology Work done at Yahoo! 1 S A. Beygelzimer, J. Langford, L. Li, L. Reyzin, R.E. Schapire Contextual Bandit Algorithms with Supervised

More information

Learning Algorithms for Minimizing Queue Length Regret

Learning Algorithms for Minimizing Queue Length Regret Learning Algorithms for Minimizing Queue Length Regret Thomas Stahlbuhk Massachusetts Institute of Technology Cambridge, MA Brooke Shrader MIT Lincoln Laboratory Lexington, MA Eytan Modiano Massachusetts

More information

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem

Lecture 19: UCB Algorithm and Adversarial Bandit Problem. Announcements Review on stochastic multi-armed bandit problem Lecture 9: UCB Algorithm and Adversarial Bandit Problem EECS598: Prediction and Learning: It s Only a Game Fall 03 Lecture 9: UCB Algorithm and Adversarial Bandit Problem Prof. Jacob Abernethy Scribe:

More information

Multi-armed bandit models: a tutorial

Multi-armed bandit models: a tutorial Multi-armed bandit models: a tutorial CERMICS seminar, March 30th, 2016 Multi-Armed Bandit model: general setting K arms: for a {1,..., K}, (X a,t ) t N is a stochastic process. (unknown distributions)

More information

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

Online Learning and Sequential Decision Making

Online Learning and Sequential Decision Making Online Learning and Sequential Decision Making Emilie Kaufmann CNRS & CRIStAL, Inria SequeL, emilie.kaufmann@univ-lille.fr Research School, ENS Lyon, Novembre 12-13th 2018 Emilie Kaufmann Online Learning

More information

Lecture: Adaptive Filtering

Lecture: Adaptive Filtering ECE 830 Spring 2013 Statistical Signal Processing instructors: K. Jamieson and R. Nowak Lecture: Adaptive Filtering Adaptive filters are commonly used for online filtering of signals. The goal is to estimate

More information

Experts in a Markov Decision Process

Experts in a Markov Decision Process University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 2004 Experts in a Markov Decision Process Eyal Even-Dar Sham Kakade University of Pennsylvania Yishay Mansour Follow

More information

Online Learning with Experts & Multiplicative Weights Algorithms

Online Learning with Experts & Multiplicative Weights Algorithms Online Learning with Experts & Multiplicative Weights Algorithms CS 159 lecture #2 Stephan Zheng April 1, 2016 Caltech Table of contents 1. Online Learning with Experts With a perfect expert Without perfect

More information

Optimization, Learning, and Games with Predictable Sequences

Optimization, Learning, and Games with Predictable Sequences Optimization, Learning, and Games with Predictable Sequences Alexander Rakhlin University of Pennsylvania Karthik Sridharan University of Pennsylvania Abstract We provide several applications of Optimistic

More information

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University EASINESS IN BANDITS Gergely Neu Pompeu Fabra University EASINESS IN BANDITS Gergely Neu Pompeu Fabra University THE BANDIT PROBLEM Play for T rounds attempting to maximize rewards THE BANDIT PROBLEM Play

More information

Optimistic Bandit Convex Optimization

Optimistic Bandit Convex Optimization Optimistic Bandit Convex Optimization Mehryar Mohri Courant Institute and Google 25 Mercer Street New York, NY 002 mohri@cims.nyu.edu Scott Yang Courant Institute 25 Mercer Street New York, NY 002 yangs@cims.nyu.edu

More information

Sparse Linear Contextual Bandits via Relevance Vector Machines

Sparse Linear Contextual Bandits via Relevance Vector Machines Sparse Linear Contextual Bandits via Relevance Vector Machines Davis Gilton and Rebecca Willett Electrical and Computer Engineering University of Wisconsin-Madison Madison, WI 53706 Email: gilton@wisc.edu,

More information

A Drifting-Games Analysis for Online Learning and Applications to Boosting

A Drifting-Games Analysis for Online Learning and Applications to Boosting A Drifting-Games Analysis for Online Learning and Applications to Boosting Haipeng Luo Department of Computer Science Princeton University Princeton, NJ 08540 haipengl@cs.princeton.edu Robert E. Schapire

More information

Lecture 4: Lower Bounds (ending); Thompson Sampling

Lecture 4: Lower Bounds (ending); Thompson Sampling CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

From Bandits to Experts: A Tale of Domination and Independence

From Bandits to Experts: A Tale of Domination and Independence From Bandits to Experts: A Tale of Domination and Independence Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Domination and Independence 1 / 1 From Bandits to Experts: A

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

Proximal and First-Order Methods for Convex Optimization

Proximal and First-Order Methods for Convex Optimization Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,

More information

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Explore no more: Improved high-probability regret bounds for non-stochastic bandits Explore no more: Improved high-probability regret bounds for non-stochastic bandits Gergely Neu SequeL team INRIA Lille Nord Europe gergely.neu@gmail.com Abstract This work addresses the problem of regret

More information

arxiv: v1 [cs.lg] 23 Jan 2019

arxiv: v1 [cs.lg] 23 Jan 2019 Cooperative Online Learning: Keeping your Neighbors Updated Nicolò Cesa-Bianchi, Tommaso R. Cesari, and Claire Monteleoni 2 Dipartimento di Informatica, Università degli Studi di Milano, Italy 2 Department

More information

Evaluation of multi armed bandit algorithms and empirical algorithm

Evaluation of multi armed bandit algorithms and empirical algorithm Acta Technica 62, No. 2B/2017, 639 656 c 2017 Institute of Thermomechanics CAS, v.v.i. Evaluation of multi armed bandit algorithms and empirical algorithm Zhang Hong 2,3, Cao Xiushan 1, Pu Qiumei 1,4 Abstract.

More information

Convergence and No-Regret in Multiagent Learning

Convergence and No-Regret in Multiagent Learning Convergence and No-Regret in Multiagent Learning Michael Bowling Department of Computing Science University of Alberta Edmonton, Alberta Canada T6G 2E8 bowling@cs.ualberta.ca Abstract Learning in a multiagent

More information

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group

Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I. Sébastien Bubeck Theory Group Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Part I Sébastien Bubeck Theory Group i.i.d. multi-armed bandit, Robbins [1952] i.i.d. multi-armed bandit, Robbins [1952] Known

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and

More information

arxiv: v1 [cs.lg] 12 Sep 2017

arxiv: v1 [cs.lg] 12 Sep 2017 Adaptive Exploration-Exploitation Tradeoff for Opportunistic Bandits Huasen Wu, Xueying Guo,, Xin Liu University of California, Davis, CA, USA huasenwu@gmail.com guoxueying@outlook.com xinliu@ucdavis.edu

More information

arxiv: v1 [cs.lg] 7 Sep 2018

arxiv: v1 [cs.lg] 7 Sep 2018 Analysis of Thompson Sampling for Combinatorial Multi-armed Bandit with Probabilistically Triggered Arms Alihan Hüyük Bilkent University Cem Tekin Bilkent University arxiv:809.02707v [cs.lg] 7 Sep 208

More information

Dynamic Regret of Strongly Adaptive Methods

Dynamic Regret of Strongly Adaptive Methods Lijun Zhang 1 ianbao Yang 2 Rong Jin 3 Zhi-Hua Zhou 1 Abstract o cope with changing environments, recent developments in online learning have introduced the concepts of adaptive regret and dynamic regret

More information

Hybrid Machine Learning Algorithms

Hybrid Machine Learning Algorithms Hybrid Machine Learning Algorithms Umar Syed Princeton University Includes joint work with: Rob Schapire (Princeton) Nina Mishra, Alex Slivkins (Microsoft) Common Approaches to Machine Learning!! Supervised

More information

Online (and Distributed) Learning with Information Constraints. Ohad Shamir

Online (and Distributed) Learning with Information Constraints. Ohad Shamir Online (and Distributed) Learning with Information Constraints Ohad Shamir Weizmann Institute of Science Online Algorithms and Learning Workshop Leiden, November 2014 Ohad Shamir Learning with Information

More information

Reinforcement Learning

Reinforcement Learning Reinforcement Learning Lecture 5: Bandit optimisation Alexandre Proutiere, Sadegh Talebi, Jungseul Ok KTH, The Royal Institute of Technology Objectives of this lecture Introduce bandit optimisation: the

More information

Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms

Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms EECS 598-005: heoretical Foundations of Machine Learning Fall 2015 Lecture 23: Online convex optimization Lecturer: Jacob Abernethy Scribes: Vikas Dhiman Disclaimer: hese notes have not been subjected

More information

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 Online Convex Optimization Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 The General Setting The General Setting (Cover) Given only the above, learning isn't always possible Some Natural

More information

Online Optimization : Competing with Dynamic Comparators

Online Optimization : Competing with Dynamic Comparators Ali Jadbabaie Alexander Rakhlin Shahin Shahrampour Karthik Sridharan University of Pennsylvania University of Pennsylvania University of Pennsylvania Cornell University Abstract Recent literature on online

More information

Stochastic Contextual Bandits with Known. Reward Functions

Stochastic Contextual Bandits with Known. Reward Functions Stochastic Contextual Bandits with nown 1 Reward Functions Pranav Sakulkar and Bhaskar rishnamachari Ming Hsieh Department of Electrical Engineering Viterbi School of Engineering University of Southern

More information

Online Optimization in Dynamic Environments: Improved Regret Rates for Strongly Convex Problems

Online Optimization in Dynamic Environments: Improved Regret Rates for Strongly Convex Problems 216 IEEE 55th Conference on Decision and Control (CDC) ARIA Resort & Casino December 12-14, 216, Las Vegas, USA Online Optimization in Dynamic Environments: Improved Regret Rates for Strongly Convex Problems

More information

MDP Preliminaries. Nan Jiang. February 10, 2019

MDP Preliminaries. Nan Jiang. February 10, 2019 MDP Preliminaries Nan Jiang February 10, 2019 1 Markov Decision Processes In reinforcement learning, the interactions between the agent and the environment are often described by a Markov Decision Process

More information

6350 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 24, DECEMBER 15, 2017

6350 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 24, DECEMBER 15, 2017 6350 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 65, NO. 4, DECEMBER 15, 017 An Online Convex Optimization Approach to Proactive Network Resource Allocation Tianyi Chen, Student Member, IEEE, Qing Ling,

More information

Efficient learning by implicit exploration in bandit problems with side observations

Efficient learning by implicit exploration in bandit problems with side observations Efficient learning by implicit exploration in bandit problems with side observations Tomáš Kocák, Gergely Neu, Michal Valko, Rémi Munos SequeL team, INRIA Lille - Nord Europe, France SequeL INRIA Lille

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Subgradient Methods for Maximum Margin Structured Learning

Subgradient Methods for Maximum Margin Structured Learning Nathan D. Ratliff ndr@ri.cmu.edu J. Andrew Bagnell dbagnell@ri.cmu.edu Robotics Institute, Carnegie Mellon University, Pittsburgh, PA. 15213 USA Martin A. Zinkevich maz@cs.ualberta.ca Department of Computing

More information

Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs

Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs Elad Hazan IBM Almaden Research Center 650 Harry Rd San Jose, CA 95120 ehazan@cs.princeton.edu Satyen Kale Yahoo! Research 4301

More information

Stochastic bandits: Explore-First and UCB

Stochastic bandits: Explore-First and UCB CSE599s, Spring 2014, Online Learning Lecture 15-2/19/2014 Stochastic bandits: Explore-First and UCB Lecturer: Brendan McMahan or Ofer Dekel Scribe: Javad Hosseini In this lecture, we like to answer this

More information

Online Submodular Minimization

Online Submodular Minimization Online Submodular Minimization Elad Hazan IBM Almaden Research Center 650 Harry Rd, San Jose, CA 95120 hazan@us.ibm.com Satyen Kale Yahoo! Research 4301 Great America Parkway, Santa Clara, CA 95054 skale@yahoo-inc.com

More information

Generalized Mirror Descents with Non-Convex Potential Functions in Atomic Congestion Games

Generalized Mirror Descents with Non-Convex Potential Functions in Atomic Congestion Games Generalized Mirror Descents with Non-Convex Potential Functions in Atomic Congestion Games Po-An Chen Institute of Information Management National Chiao Tung University, Taiwan poanchen@nctu.edu.tw Abstract.

More information

Move from Perturbed scheme to exponential weighting average

Move from Perturbed scheme to exponential weighting average Move from Perturbed scheme to exponential weighting average Chunyang Xiao Abstract In an online decision problem, one makes decisions often with a pool of decisions sequence called experts but without

More information

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm

CS261: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm CS61: A Second Course in Algorithms Lecture #11: Online Learning and the Multiplicative Weights Algorithm Tim Roughgarden February 9, 016 1 Online Algorithms This lecture begins the third module of the

More information

Applications of on-line prediction. in telecommunication problems

Applications of on-line prediction. in telecommunication problems Applications of on-line prediction in telecommunication problems Gábor Lugosi Pompeu Fabra University, Barcelona based on joint work with András György and Tamás Linder 1 Outline On-line prediction; Some

More information

Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs

Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs Elad Hazan IBM Almaden 650 Harry Rd, San Jose, CA 95120 hazan@us.ibm.com Satyen Kale Microsoft Research 1 Microsoft Way, Redmond,

More information

Online Learning with Costly Features and Labels

Online Learning with Costly Features and Labels Online Learning with Costly Features and Labels Navid Zolghadr Department of Computing Science University of Alberta zolghadr@ualberta.ca Gábor Bartók Department of Computer Science TH Zürich bartok@inf.ethz.ch

More information

Online Convex Optimization

Online Convex Optimization Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Performance and Convergence of Multi-user Online Learning

Performance and Convergence of Multi-user Online Learning Performance and Convergence of Multi-user Online Learning Cem Tekin, Mingyan Liu Department of Electrical Engineering and Computer Science University of Michigan, Ann Arbor, Michigan, 4809-222 Email: {cmtkn,

More information

Sequential Multi-armed Bandits

Sequential Multi-armed Bandits Sequential Multi-armed Bandits Cem Tekin Mihaela van der Schaar University of California, Los Angeles Abstract In this paper we introduce a new class of online learning problems called sequential multi-armed

More information

arxiv: v3 [cs.lg] 30 Jun 2012

arxiv: v3 [cs.lg] 30 Jun 2012 arxiv:05874v3 [cslg] 30 Jun 0 Orly Avner Shie Mannor Department of Electrical Engineering, Technion Ohad Shamir Microsoft Research New England Abstract We consider a multi-armed bandit problem where the

More information

Discover Relevant Sources : A Multi-Armed Bandit Approach

Discover Relevant Sources : A Multi-Armed Bandit Approach Discover Relevant Sources : A Multi-Armed Bandit Approach 1 Onur Atan, Mihaela van der Schaar Department of Electrical Engineering, University of California Los Angeles Email: oatan@ucla.edu, mihaela@ee.ucla.edu

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Online Learning of Non-stationary Sequences

Online Learning of Non-stationary Sequences Online Learning of Non-stationary Sequences Claire Monteleoni and Tommi Jaakkola MIT Computer Science and Artificial Intelligence Laboratory 200 Technology Square Cambridge, MA 02139 {cmontel,tommi}@ai.mit.edu

More information

The Algorithmic Foundations of Adaptive Data Analysis November, Lecture The Multiplicative Weights Algorithm

The Algorithmic Foundations of Adaptive Data Analysis November, Lecture The Multiplicative Weights Algorithm he Algorithmic Foundations of Adaptive Data Analysis November, 207 Lecture 5-6 Lecturer: Aaron Roth Scribe: Aaron Roth he Multiplicative Weights Algorithm In this lecture, we define and analyze a classic,

More information