Bandit Online Learning with Unknown Delays

Size: px

Start display at page:

Download "Bandit Online Learning with Unknown Delays"

Tobias Martin
5 years ago
Views:

1 Bandit Online Learning with Unknown Delays Bingcong Li, Tianyi Chen, and Georgios B. Giannakis University of Minnesota - Twin Cities, Minneapolis, MN 55455, USA {lixx5599,chen387,georgios}@umn.edu November 13, 018 Abstract arxiv: v cs.lg] 11 Nov 018 This paper deals with bandit online learning problems involving feedback of unknown delay that can emerge in multi-armed bandit MAB and bandit convex optimization BCO settings. MAB and BCO require only values of the objective function involved that become available through feedback, and are used to estimate the gradient appearing in the corresponding iterative algorithms. Since the challenging case of feedback with unknown delays prevents one from constructing the sought gradient estimates, existing MAB and BCO algorithms become intractable. For such challenging setups, delayed exploration, exploitation, and exponential DEXP3 iterations, along with delayed bandit gradient descent DBGD iterations are developed for MAB and BCO, respectively. Leveraging a unified analysis framework, it is established that both DEXP3 and DBGD guarantee an O T + D regret, where D denotes the delay accumulated over T slots. The regret bounds match those in full information settings. Numerical tests using both synthetic and real data validate the performance of DEXP3 and DBGD. 1 Introduction Sequential decision making emerges in various learning and optimization tasks, such as online advertisement, online routing, and portfolio management 5,15]. Among popular methods for sequential decision making, multiarmed bandit MAB and bandit convex optimization BCO have widely-appreciated merits because with limited information they offer quantifiable performance guarantees. MAB and BCO can be viewed as a repeated game between a possibly randomized learner, and the possibly adversarial nature. In each round, the learner selects an action, and incurs the associated loss that is returned by the nature. In contrast to the full information setting, only the loss of the performed action rather than the gradient of the loss function or even the loss function itself is revealed to the learner. Popular approaches to bandit online learning estimate gradients using several point-wise evaluations of the loss function, and use them to run online gradient-type iterative solvers; see e.g., 3] for MAB and 1, 13] for BCO. Although widely applicable with solid performance guarantees, standard MAB and BCO frameworks do not account for delayed feedback that is naturally present in various applications. For example, when carrying out machine learning tasks using distributed mobile devices a setup referred to as federated learning 1], delay comes from the time it takes to compute at mobile devices and also to transmit over the wireless communication links; in online recommendations the click-through rate could be aggregated and then periodically sent back 19]; in online routing over communication networks, the latency of each routing decision can be revealed only after the packet s arrival to its destination 4]; and in parallel computing by data centers, computations are carried with outdated information because agents are not synchronized, 11, 0]. Challenges arise naturally when dealing with bandit online learning with unknown delays, simply because unknown delayes prevent existing methods in non-stochastic MAB as well as BCO to construct reliable gradient estimates. To address this limitation, our solution is a fine-grained biased gradient estimator for MAB and a 1

2 deterministic gradient estimator for BCO, where the standard unbiased loss estimator for non-stochastic MAB and the nearly unbiased one for BCO are no longer available. The resultant algorithms, that we abbreviate as DEXP3 and DBGD, are guaranteed to achieve O T + D regret over a T -slot time horizon with the overall delay being D, which matches the regret bounds in delayed full-information online learning settings. 1.1 Related works Delayed online learning can be categorized depending on whether the feedback information is full or bandit meaning partial. We review prior works from these two aspects. Delayed online learning. This class deals with delayed but fully revealed loss information, namely fully known gradient or loss function. It is proved that an O T + D regret for a T -slot time horizon with overall delay D can be achieved. Particularly, algorithms dealing with a fixed delay have been studied in 9]. To reduce the storage and computation burden of 9], an online gradient descent type algorithm for fixed d-slot delay was developed in 18], where the lower bound O d + 1T was also provided. Adversarial delay has been tackled recently in 17, 5, 6]. However, the algorithms as well as the corresponding analyses in 17, 5, 6] are not applicable to bandit online learning setting when the delays are unknown. Delayed bandit online learning. Stochastic MAB with delays has been reported in 8, 9, 4, 8]; see also 16] for multi-instance generalizations introduced to handle adversarial delays in stochastic and non-stochastic MAB settings. For non-stochastic MAB, EXP3-based algorithms were developed to handle fixed delays in 7, 3]. Although not requiring memories for extra instances, the delay in 7] and 3] must be known. A recent work 6] considers a more general non-stochastic MAB setting, where the feedback is anonymous. 1 Although the scope in 6] is general, we will see in Section 3, that the regret bound of the MAB algorithm in 6] is suboptimal. 1. Contributions Our main contributions can be summarized as follows. c1 Based on a novel biased gradient estimator, a delayed exploration-exploitation exponentially DEXP3 weighted algorithm is developed for delayed non-stochastic MAB with unknown and adversarially chosen delays; c Relying on a novel deterministic gradient estimator, a delayed bandit gradient descent DBGD algorithm is developed to handle the delayed BCO setting; and, c3 A unifying analysis framework reveals that DEXP3 and DBGD provably achieve an O T + D regret, where D is the cumulative delay over T slots. The regret bounds match those of online learning in full information settings. Numerical tests validate the proposed DEXP3 and DBGD. Notational conventions. Bold lowercase letters denote column vectors; E ] represents expectation; 1 denotes the indicator function; stands for vector transposition; and x denotes the l -norm of a vector x. Problem statements Before introducing the delayed bandit learning settings, we first revisit the standard non-stochastic MAB and BCO..1 MAB and BCO Non-stochastic MAB. Consider the MAB problem with a total of K arms a.k.a. actions 3,5]. At the beginning of slot t, without knowing the loss of any arm, the learner selects an arm a t following a distribution p t K over all arms, where the probability simplex is defined as K : {p K : pk 0, k; K pk 1}. The loss l t a t incurred by the selection of a t is an entry of the K 1 loss vector l t, and it is observed by the learner. Along with previously observed losses {l s a s } t s1, it then becomes possible to find p t+1; see also Fig. 1 a1. 1 Anonymous feedback in MAB means that the learner observes the loss without knowing which arm it is associated with.

a1 a b1 b Figure 1: A single slot structure of: a1 standard MAB; a non-stochastic MAB with delayed feedback; b1 standard BCO; b BCO with delayed feedback, where p t+1 {l s a s } t s1 means that p t+1

The goal is to minimize the regret, which is the difference between the expected cumulative loss of the learner relative to the loss of the best fixed policy in hindsight, given by Reg MAB T : E p ]

Specifically, if p 0,..., 1,..., 0], the regret is relative to the corresponding best fixed arm in hindsight. BCO. Consider now the BCO setup with M-point feedback 1].

Being able to query the function values at another M 1 points {x t,k X } M 1 and with x t,0 : x t, the loss values at {x t,k } M 1 k0, that is, {f t x t,k } M 1 k0, are observed instead of the

3 a1 a b1 b Figure 1: A single slot structure of: a1 standard MAB; a non-stochastic MAB with delayed feedback; b1 standard BCO; b BCO with delayed feedback, where p t+1 {l s a s } t s1 means that p t+1 updates based on previously observed losses {l s a s } t s1. The goal is to minimize the regret, which is the difference between the expected cumulative loss of the learner relative to the loss of the best fixed policy in hindsight, given by Reg MAB T : E p ] T t l t p l t 1 where the expectation is taken w.r.t. the possible randomness of p t induced by the selection of {a s } t 1 s1, while the best fixed policy p is p : arg min p K p l t. Specifically, if p 0,..., 1,..., 0], the regret is relative to the corresponding best fixed arm in hindsight. BCO. Consider now the BCO setup with M-point feedback 1]. At the beginning of slot t, without knowing the loss, the learner selects x t X, where X R K is a compact and convex set. Being able to query the function values at another M 1 points {x t,k X } M 1 and with x t,0 : x t, the loss values at {x t,k } M 1 k0, that is, {f t x t,k } M 1 k0, are observed instead of the function f t. The learner leverages the revealed losses to decide the next action x t+1 ; see also Fig. 1 b1. The learner s goal is to find a sequence of { {x t,k } M 1 } T k0 to minimize the regret relative to the best fixed action in hindsight, meaning Reg BCO T : E f t x t ] f t x 3 where the expectation is taken over the sequence of random actions {x τ } t 1 s1. The best fixed action x in hindsight is x : arg min x X f t x. 4 This definition is slightly different with that in 1]. However, we will show in Section 5. that the regret bound is not affected. 3

4 In both MAB and BCO settings, an online algorithm is desirable when its regret is sublinear w.r.t. the time horizon T, i.e., Reg MAB T ot and Reg BCO T ot 5, 15].. Delayed MAB and BCO MAB with unknown delays. In delayed MAB, the learner still chooses an arm a t p t at the beginning of slot t. However, the loss l t a t is observed after d t slots, namely, at the end of slot t + d t, where delay d t 0 can vary from slot to slot. In this paper, we assume that {d t } T can be chosen adversarially by nature. Let l s ta s t denote the loss incurred by the selected arm a s in slot s but observed at t, i.e., the learner receives the losses collected in L t { l s t a s t, s :s+d s t } at the end of slot t. Note that it is possible to have L t in certain slots. And the order of feedback can be arbitrary, meaning it is possible to heve t 1 + d t1 t + d t when t 1 t. In contrast to 16] however, we consider the case where the delay d t is not accessible, i.e., the learner just observes the value of l s t a s t, but not s. The learner s goal is to select { } T p t on-the-fly to minimize the regret defined in 1. Note that in the presence of delays, the information to decide p t is even less compared with the standard MAB. Specifically, the available information for the learner to decide p t is collected in the set L 1:t 1 : t 1 s1 L s; see also Fig. 1 a. For simplicity, we assume that all feedback information is received at the end of slot T. This assumption does not lose generality since the feedback arriving at the end of slot T cannot aid the arm selection, hence the final performance of the learner will not be affected. BCO with unknown delays. For delayed BCO, the learner still chooses x t to play while querying {x t,k } M 1 at the beginning of slot t. However, the loss as well as the querying responses { f t x t,k } M 1 are observed at the k0 end of slot t + d t. Similarly, let f s t x s t denote the loss incurred in slot s but observed at t, and the feedback set at the end of slot t is L t { {f s t x s t,k } M 1 k0, s : s+d s t }. To find a desirable x t, the learner relies on history L 1:t 1 : t 1 s1 L s, with the goal of minimizing the regret in 3. Why existing algorithms fail with unknown delays? The algorithms for standard non-delayed MAB, such as EXP3, cannot be applied to delayed MAB with unknown delays. Recall that in settings without delay, to deal with the partially observed l t, EXP3 relies on an importance sampling type of loss estimates given by 3] ˆlt k l ta t 1a t k p t k, k 1,..., K. 5 The denominator as well as the indicator function in 5 ensure unbiasedness of ˆl t k. Leveraging the estimated loss, the distribution p t+1 is obtained by p t+1 k p t k exp ˆl t k K j1 p, k 6 tj exp ˆl t j where is the learning rate. Consider now that the loss l s t a s t with delay d s is observed at t s + d s. To recover the unbiased estimator ˆl s t k in 5, p s k must be known. However, since d s is not revealed, even if the learner can store the previous probability distributions, it is not clear how to attain the loss estimator. Knowing the delay is also instrumental when it comes to the gradient estimator in BCO as well. For nondelayed single-point feedback BCO 13], e.g., M 1, since only one value of the loss instead of the full gradient is observed per slot, the idea is to draw u t uniformly from the surface of a unit ball in R K, and form the gradient estimate as g t K δ f tx t + δu t u t 7 where δ is a small constant. The next action is obtained using a standard online projected gradient descent iteration leveraging the estimated gradient, that is x t+1 Π Xδ x t g t ] 8 4

5 Algorithm 1 DEXP3 1: Initialize: p 1 k 1/K, k. : for t 1,..., T do 3: Select an arm a t p t. 4: Observe feedback collected in set L t. 5: for n 1,,..., L t do 6: Estimate ˆl sn t via 9 if l sn ta sn t L t. 7: Update p n t via : end for 9: Obtain p t+1 via : end for Figure : An example of DEXP3 with L t including the losses incurred in slot s 1, s, and s 3 and L t+1. where X δ is the shrunk feasibility set to ensure x t + u t is feasible. While g t serves as a nearly unbiased estimator of f t x t, the unknown delay brings mismatch between the feedback f s t x s t + δu s t and g s t. Specifically, given the feedback f s t x s t + δu s t, since d s is unknown, the learner does not know u s t to obtain g s t in 7. Similar arguments also hold for BCO with multi-point feedback. Therefore, performing delayed bandit learning with unknown delays is challenging, and has not been explored. 3 DEXP3 for Delayed MAB We start with the non-stochastic MAB setup that is randomized in nature because an arm a t is chosen randomly per slot according to a K 1 probability mass vector p t. In this section, we show that for the MAB problem, so long as the unknown delay is bounded, based only on a single-point feedback, the randomized algorithm that we term Delayed EXP3 DEXP3 can cope with unknown delays in MAB through a biased loss estimator, and is guaranteed to attain a desirable regret. Recall that the feedback at slot t includes losses incurred at slots s n, n 1,,..., L t, where L t : { lsn ta sn t : s n t d sn }. Once Lt is revealed, the learner estimates l sn t by scaling the observed loss according to p t at the current slot. For each l sn ta sn t L t, the estimator of the loss vector l sn t is ˆlsn tk l s n tk1 a sn t k p t k, k. 9 It is worth mentioning that the index s n in 9 is only used for analysis while during the implementation, there is no need to know s n. In contrast to EXP3 3] and its variant for delayed MAB with known delays 7, 16], our estimator for l sn tk in 9 turns out to be biased since a sn t is chosen according to p sn and not according to p t, that is E asn t ˆlsn tk ] l s n tkp sn k l p t k sn tk. 10 Since L t may contain multiple rounds of feedback, leveraging each ˆl sn t, the learner must update Lt times to obtain p t+1. Intuitively, to upper bound the bias of 10, an upper bound on p sn k/p t k is required, which 5

6 amounts to a lower bound on p t k. On the other hand however, the lower bound of p t k cannot be too large to avoid incurring extra regret. Different from EXP3, our DEXP3 ensures a lower bound on p t k by introducing an intermediate weight vector w t to evaluate the historical performance of each arm. Let n denote the index of the inner-loop update at slot t starting from p 0 t : p t. For each l sn ta sn t L t, the learner first updates w n t by using the estimated loss ˆl sn t as w t n k p n 1 t k exp min { δ 1, ˆl } sn tk, k 11 where is the learning rate, and δ 1 serves as an upper bound of ˆl sn tk to control the bias of ˆl sn tk. However, to confine the extra regret incurred by introducing δ 1, a carefully-selected δ 1 should ensure that the probability of having ˆl sn tk larger than δ 1 is small enough. Then the learner finds w n t by a trimmed normalization as { wt n k max w n t k K j1 wn t j, δ K }, k. 1 Update 1 ensures that w n t k is lower bounded by δ /K. Finally, the learner normalizes w n t to obtain p n t as p n t k wt n k K k. 13 j1 wn t j, δ K1+δ It can be shown that p n t k is lower bounded by p n t k the elements of in L t have been used, the learner finds p t+1 via cf. 35 in supplementary material]. After all p t+1 p Lt t. 14 Furthermore, if L t, the learner directly reuses the previous distribution, i.e., p t+1 p t, and chooses an arm accordingly. In a nutshell, DEXP3 is summarized in Alg. 1. As it will be shown in Sec. 5., if the delay d t is bounded by a constant d, DEXP3 can guarantee a regret of O T + D, where D T d t is the overall delay. Remark 1. The recent composite loss wrapper algorithm abbreviated as CLW in 6] can be also applied to the delayed MAB problem. However, CLW is designed for a more general setting with composite and anonymous feedback, and its efficiency drops when the previous action a s t is known. The main differences between DEXP3 and CLW are: i the loss estimators are different; ii DEXP3 updates p t in every slot, while CLW updates occur every other O d slots thus requiring a larger learning rate; and iii DEXP3 does not involve d in the loss estimator, leading to a tighter regret bound in terms of delay O T + D, in contrast to O d + 1T of CLW. As it will be corroborated by simulations, DEXP3 outperforms CLW in the considered setting. 4 DBGD for Delayed BCO In this section, we develop an algorithm that we term Delayed Bandit Gradient Descent DBGD based on a deterministic approximant of the loss obtained using M K + 1 rounds of feedback. DBGD enjoys regret of O T + D for BCO problems even when the delays are unknown. In practice, K + 1-point feedback can be obtained i when it is possible to evaluate the loss function easily; and ii when the slot duration is long, meaning that the algorithm has enough time to query multiple points from the oracle 7]. The intuition behind our deterministic approximation originates from the gradient definition 1]. Consider for example x R, and the gradient fx 1, ], where fx+δe 1 fx fx+δe fx 1 lim ; lim. 15 δ 0 δ δ 0 δ 6

7 Algorithm DBGD 1: Initialize: x 1 0. : for t 1,..., T do 3: Play x t, also query x t + δe k, k 1,..., K. 4: Observe feedback collected in set L t. 5: if L t then set x t+1 x t. 6: else estimate gradient g sn t via 17 if s n +d s t. 7: Update x t+1 via 18. 8: end if 9: end for Figure 3: An example of DBGD with L t including the losses incurred in slot s 1, s, and s 3 and L t+1. Similarly, for a K-dimensional x, if K + 1 rounds of feedback are available, the gradient can be approximated as g t 1 δ ft x t + δe k f t x t e k 16 where e k : 0,..., 1,..., 0] denotes the unit vector with k-th entry equal 1. Intuitively, a smaller δ improves the approximation accuracy. When f t is further assumed to be linear, g t in 16 is unbiased. In this case, the gradient of f t can be recovered exactly, and thus the setup boils down to a delayed one with full information. However, if f t is generally convex, g t in 16 is biased. Leveraging the gradient in 16, we are ready to introduce the DBGD algorithm. Per slot t, the learner plays x t and also queries f t x t + δe k, k {1,..., K}. However, to ensure that f t x t + δe k is feasible, the x x t should be confined to the set X δ {x : 1 δ X }. Note that if 0 δ < 1, X δ is still convex. Let n 1,,..., L t indexing the inner loop update at slot t. At the end of slot t, the learner receives observations L t { } {f sn tx sn t, f sn tx sn t + δe k, k 1,..., K}, s n t d sn. Per received feedback value, the learner approximates the gradient via 16; thus, for each {f sn tx sn t, f sn tx sn t+δe k, k 1,..., K}, we have g sn t 1 δ With g sn t and x 0 t : x t, the learner will update L t times to obtain x t+1 by fsn tx sn t + δe k f sn tx sn t e k. 17 x n t Π Xδ x n 1 t g sn t], n 1,..., Lt 18a x t+1 x Lt t. 18b If no feedback is received at slot t, the learner simply sets x t+1 x t. The DBGD is summarized in Algorithm. 5 A unified framework for regret analysis In this section, we show that both DEXP3 and DBGD can guarantee an O T + D regret. Our analysis considerably broadens that in 17], which was originally developed for delayed online learning with full-information 7

8 Figure 4: An example of mapping from real slots solid line and virtual slots dotted line. At the end of slot t, the feedback is L t {l s1 ta s1 t, l s ta s t}. v.s. stands for virtual slot. feedback. 5.1 Mapping from real to virtual slots To analyze the recursion involving consecutive variables p t and p t+1 in DEXP3 or x t and x t+1 in DBGD is challenging, since different from standard settings, the number of feedback rounds varies over slots. We will bypass this variable feedback using the notion of a virtual slot. Over the real time horizon, there are in total T virtual slots, where the τth virtual slot is associated with the τth loss value fed back. Recall that the feedback received at the end of slot t is L t. With the overall feedback received until the end of slot t 1 denoted by L t 1 : t 1 v1 L v, the virtual slot τ corresponding to the first feedback value received at slot t is τ L t In what follows, we will use MAB as an example to elaborate on this mapping, but the BCO setting can be argued likewise. When the multiple rounds of feedback are received over a real slot t, DEXP3 updates L t times p t to obtain p t+1 ; see Using the notion of virtual slots, these L t updates are performed over L t consecutive virtual slots. Taking Fig. 4 as an example, when p 1 t is obtained by using an estimated loss ˆl s1 t cf. 9] and 11-13, this update is mapped to a virtual slot τ L t 1 + 1, where l τ : ˆl s1 t is adopted to obtain p τ+1 : p 1 t. Similarly, when p t is obtained using ˆl s t, the virtual slot yields p τ+ : p t via l τ+1 : ˆl s t. That is to say, at real slot t, for n 1,..., L t, each update from p n 1 t to p n t using the estimated loss ˆl sn t is mapped to an update at the virtual slot τ + n 1, where l τ+n 1 : ˆl sn t is employed to obtain p τ+n : p n t from p τ+n 1 p n 1 t. According to the real-to-virtual slot mapping, we have p τ+ Lt p t+1 ; see also Fig. 4 for two examples. As we will show later, it is convenient to analyze the recursion between two consecutive p τ and p τ+1, which is the key for the ensuing regret analysis. With regard to DBGD, since multiple feedback rounds are possible per real slot t, we again map the L t updates at a real slot cf. 18] to L t virtual slots. The mapping is exactly the same as that in DEXP3, that is, per virtual slot τ, g τ is used to obtain x τ+1. From this real-to-virtual mapping vantage point, DEXP3 and DBGD can be viewed as inexact EXP3 and BGD running on the virtual time horizon with only one feedback value per virtual slot. That is to say, instead of analyzing regret on the real time horizon, which can involve multiple feedback rounds, we can alternatively turn to the virtual slot, where there is only one update per slot. 8

9 5. Regret Analysis of DEXP3 Now we turn to the analyze the regret for DEXP3. The analysis builds on the following assumptions. Assumption 1. The losses satisfy max t,k l t k 1. Assumption. The delay d t is bounded, i.e., max t d t d. Assumption 1 requires the loss function to be upper bounded, which is common in MAB; see also 3, 5, 15]. Assumption asks for the delay to be bounded that also appears in previous analyses for the delayed online learning setup 6, 7, 16]. Let us first consider the changes on p τ k after one update in the virtual slot. Lemma 1. If the parameters are properly selected such that 1 δ δ 1 0, the following inequality holds p τ 1 k p τ k 1 1 δ δ 1, k, τ. 19 Proof. See Sec. B.1 of the supplementary document. Lemma. If the parameters are properly selected such that 1 δ 1 0, the following inequality holds { p τ k p τ 1 k max 1 + δ, Proof. See Sec. B. of the supplementary document. 1 1 δ 1 }, k, τ. 0 Lemmas 1 and assert that both p τ 1 k/ p τ k and p τ k/ p τ 1 k are bounded deterministically, that is, regardless of the arm selection and observed loss. These bounds are the critical for deriving the regret. To bound the regret, the final cornerstone is the regret in virtual slots, specified in the following lemma. Lemma 3. For a given sequence of { l τ } T, the following relation follows pτ p min { lτ, δ 1 1 } T ln1 + δ + ln K where 1 is a K 1 vector of all ones, and p K. Proof. See Sec. B.3 of the supplementary document. Leveraging Lemma 3, the regret of DEXP3 follows. + p τ k lτ k ] 1 Theorem 1. Supposing Assumptions 1 and hold, defining the overall delay D : T d t, and choosing δ 1 T +D, ln K T +D, and δ 1 1 d δ, DEXP3 guarantees that the Reg MAB T in 1 satisfies Proof. See Sec. B.4 of the supplementary document. Reg MAB T O T + D. Theorem 1 recovers the regret bound for delayed non-stochastic MAB with fixed yet known delay 16, 3]. Furthermore, the established O T + D regret also addresses the open problem in the non-stochastic MAB with unknown and possibly adversarial delays 7]. 9

10 5.3 Regret analysis of DBGD Our analysis builds on the following assumptions. Assumption 3. For any t, the loss function f t is convex. Assumption 4. For any t, f t is L-Lipschitz and β-smooth. Assumption 5. The feasible set contains ɛb, where B is the unit ball, and ɛ > 0 is a predefined parameter. The diameter of X is R; that is, max x,y X x y R. Assumptions 3-5 are common in online learning 15]. Assumption 4 requires that f t is L-Lipschitz and β-smooth, which is needed to bound the bias of the estimator g s t 1]. Assumption 5 is also typical in BCO 1, 1, 13, 15]. In addition, the counterpart of Assumption 1 in BCO can readily follow from Assumptions 4 and 5. To start, the quality of the gradient estimator g s t is first evaluated. As stated, when f t is not linear, the estimator g s t is biased, but its bias is bounded. Lemma 4. If Assumption 4 holds, then for every f s t x s t, the corresponding estimator 16 satisfies g s t KL, and g s t f s t x s t βδ K. 3 Proof. See Sec. C.1 of the supplementary document. Lemma 4 suggests that with δ small enough, the bias of g s t will not be too large. Then, the following lemma shows the relation among g τ, x τ, and x τ+1, in a virtual time slot. Lemma 5. Under Assumptions 4 and 5, the update in a virtual slot τ guarantees that for any x X δ, we have g τ xτ x x τ x x τ+1 x KL +. 4 Proof. See Sec. C. of the supplementary document. Lemma 5 is the counterpart of the gradient descent estimate in the non-delayed and full-information setting 15, Theorem 3.1], which demonstrates that DBGD is BGD running on the virtual slots. Finally, leveraging these results, the regret bound follows next. Theorem. Suppose Assumptions 3-5 hold. Choosing δ O T + D 1, and O T + D 1/, the DBGD guarantees that the regret is bounded, that is Reg BCO T f t x t where D : T d t is the overall delay. Proof. See Sec. C.3 of the supplementary document. f t x O T + D 5 For the slightly different regret definition in 1], DBGD achieves the same regret bound. Corollary 1. Upon defining x t,0 : x t and x t,k : x t + δe k, choosing δ O T + D 1, and O T + D 1/, the DBGD also guarantees that 1 K +1 f t x t,k f t x O T +D. 6 k0 Proof. See Sec. C.4 of the supplementary document. The O T + D regret in Theorem and Corollary 1 recovers the bound of delayed online learning in the full information setup 17, 18, 5] with only bandit feedback. 10

11 Regret EXP3 BOLD DEXP3 CLW Time slot a Regret EXP3 BOLD DEXP3 CLW Time slot 10 4 Figure 5: a Regret of DEXP3 using synthetic data; b Regret of DEXP3 using real data. b Regret OGD SOLID K+1-BCO DBGD CLW-BCO Regret OGD SOLID K+1-BCO DBGD CLW-BCO Time slot a Time slot b Figure 6: a Regret of DBGD using synthetic data; b Regret of DBGD using real data. 6 Numerical tests In this section, experiments are conducted to corroborate the validity of the novel DEXP3 and DBGD schemes. In synthetic data tests, we consider T, 000 slots. Delays are periodically generated with period 1,, 1, 0, 3, 0,, with the delay of the last few slots slightly modified to ensure that all feedback arrives at the end of T, 000, resulting in the overall delay D, 569. DEXP3 synthetic tests. Consider K 5 arms, and losses generated with a sudden change. Specifically, for t 1, 500], we have l t k 0.4k cos t per arm k, while for the rest of the slots, l t k 0.k sint. To benchmark the novel DEXP3, we use: i the standard EXP3 for non-delayed MAB 3]; ii the BOLD for delayed MAB with known delay 16]; and, iii CLW to deal with the more difficult setting in 6]. The instantaneous accumulated regret normalized by T versus time slots is plotted in Fig. 5 a. The gap between BOLD and EXP3, illustrates that even with a known delay, the learner suffers from an extra regret. The small gap between DEXP3 and BOLD further demonstrates the estimator bias cf. 9] causing a slightly larger regret, which is the price paid for the unknown delay. Compared with CLW, DEXP3 performs significantly better since DEXP3 can leverage more information relative to the non-anonymous feedback used by CLW. 11

12 DEXP3 real tests. We also tested DEXP3 using the Jester Online Joke Recommender System dataset 14], where T 4, 983 users rate K 100 different jokes from 0 not funny to 1 very funny. The goal is to recommend one joke per slot t to amuse the users. The system performance is evaluated by l t 1 the score of this joke. In this test, we assign a random score within range 0, 1] for the missing entries of this dataset. The delay is generated periodically as in the synthetic test, resulting in D 3, 119. Similar to the synthetic test, it can be observed that DEXP3 incurs slightly larger regret than BOLD due to the unknown delay, but outperforms the recently developed CLW. DBGD synthetic tests. Consider that K 5, and the feasible set X R 5 is the unit ball, i.e., X : { x 1 }. The loss function at slot t is generated as f t x a t x + b t x, where a t cos3t + 3, while b t 1 sint + 1,b t cost, b t 3 sint, b t 4 sint, b t 4 sint, and b t 5. To see the influence of the bandit feedback and the unknown delay, we consider the following benchmarks: i the standard OGD 30] in full-information and non-delayed setting; ii the K + 1-point feedback BCO 1] for non-delayed BCO; iii the SOLID for delayed full-information OCO 17]; and iv the CLW-BCO with the inner algorithm relying on K + 1-point feedback BCO 6]. The instantaneous accumulated regret normalized by T versus time slots is plotted in Fig. 6 a. This test shows that DBGD performs almost as good as SOLID, and the gap between DBGD/SOLID and OGD/K + 1-BCO is due to the delay. The regret of DBGD again significantly outperforms that of CLW-BCO, demonstrating the efficiency of DBGD. DBGD real tests. To further illustrate the merits of DBGD, we conduct tests dealing with online regression applied to a yacht hydrodynamics dataset 10], which contains T 308 data with K 6 features. Per slot t, the regressor x t R 6 predicts based on the feature w t before its label y t is revealed. The loss function for slot t is f t x t 1 y t x t w t. The delay is generated periodically as before, and cumulatively it is D 394. The instantaneous accumulated regret normalized by T versus time slots is plotted in Fig. 6 b. Again, DBGD outperforms CLW-BCO considerably. Comparing with the regret performance of DBGD and SOLID, we can safely deduce the influence delay has on bandit feedback. 7 Conclusions Bandit online learning with unknown delays, including non-stochastic MAB and BCO, was studied in this paper. Different from settings where the experienced delay is known in bandit online learning, the unknown delay prevents a simple gradient estimate that is needed by the iterative algorithm. To address this issue, a biased loss estimator as well as a deterministic one were developed for non-stochastic MAB and BCO. Leveraging the proposed loss estimators, the so-termed DEXP3 and DBGD algorithms were developed. An O T + D regret was analytically established for both DEXP3 and DBGD, matching the regret bound of the delayed online learning in full information settings. Numerical tests on synthetic and real datasets confirmed the performance gain of DEXP3 and DBGD relative to state-of-the-art approaches. 1

13 References 1] A. Agarwal, O. Dekel, and L. Xiao, Optimal algorithms for online convex optimization with multi-point bandit feedback. in Proc. Intl. Conf. on Learning Theory, 010, pp ] A. Agarwal and J. C. Duchi, Distributed delayed stochastic optimization, in Proc. Advances in Neural Info. Process. Syst., Granada, Spain, 011, pp ] P. Auer, N. Cesa-Bianchi, Y. Freund, and R. E. Schapire, The nonstochastic multiarmed bandit problem, SIAM Journal on Computing, vol. 3, no. 1, pp , 00. 4] B. Awerbuch and R. D. Kleinberg, Adaptive routing with end-to-end feedback: Distributed learning and geometric approaches, in Proc. ACM Symp. on Theory of Computing, Chicago, IL, Jun. 004, pp ] S. Bubeck, N. Cesa-Bianchi et al., Regret analysis of stochastic and nonstochastic multi-armed bandit problems, Found. and Trends in Machine Learning, vol. 5, no. 1, pp. 1 1, 01. 6] N. Cesa-Bianchi, C. Gentile, and Y. Mansour, Nonstochastic bandits with composite anonymous feedback, in Proc. Conf. On Learning Theory, Stockholm, Sweden, 018, pp ] N. Cesa-Bianchi, C. Gentile, Y. Mansour, and A. Minora, Delay and cooperation in nonstochastic bandits, J. Machine Learning Res., vol. 49, pp , ] O. Chapelle and L. Li, An empirical evaluation of thompson sampling, in Proc. Advances in Neural Info. Process. Syst., Granada, Spain, 011, pp ] T. Desautels, A. Krause, and J. W. Burdick, Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization, J. Machine Learning Res., vol. 15, no. 1, pp , ] D. Dheeru and E. Karra Taniskidou, UCI machine learning repository, 017. Online]. Available: 11] J. Duchi, M. I. Jordan, and B. McMahan, Estimation, optimization, and parallelism when data is sparse, in Proc. Advances in Neural Info. Process. Syst., Lake Tahoe, Nevada, 013, pp ] J. C. Duchi, M. I. Jordan, M. J. Wainwright, and A. Wibisono, Optimal rates for zero-order convex optimization: The power of two function evaluations, IEEE Trans. Inform. Theory, vol. 61, no. 5, pp , ] A. D. Flaxman, A. T. Kalai, and H. B. McMahan, Online convex optimization in the bandit setting: gradient descent without a gradient, in Proc. of ACM-SIAM symposium on Discrete algorithms, Vancouver, Canada, pp ] K. Goldberg, T. Roeder, D. Gupta, and C. Perkins, Eigentaste: A constant time collaborative filtering algorithm, Information Retrieval, vol. 4, no., pp , 001. Online]. Available: 15] E. Hazan, Introduction to online convex optimization, Found. and Trends in Optimization, vol., no. 3-4, pp , ] P. Joulani, A. Gyorgy, and C. Szepesvári, Online learning under delayed feedback, in Proc. Intl. Conf. Machine Learning, Atlanta, 013, pp ] P. Joulani, A. Gyórgy, and C. Szepesvári, Delay-tolerant online convex optimization: Unified analysis and adaptive-gradient algorithms, in Proc. of AAAI Conf. on Artificial Intelligence, vol. 16, Phoenix, Arizona, 016, pp

14 18] J. Langford, A. J. Smola, and M. Zinkevich, Slow learners are fast, Proc. Advances in Neural Info. Process. Syst., pp , ] L. Li, W. Chu, J. Langford, and R. E. Schapire, A contextual-bandit approach to personalized news article recommendation, in Proc. of the 19th Intl. Conf. on World Wide Web. Rayleigh, NC: ACM, 010, pp ] B. McMahan and M. Streeter, Delay-tolerant algorithms for asynchronous distributed online learning, in Proc. Advances in Neural Info. Process. Syst., Montreal, Canada, 014, pp ] H. B. McMahan, E. Moore, D. Ramage, S. Hampson et al., Communication-efficient learning of deep networks from decentralized data, in Proc. Intl. Conf. on Artificial Intelligence and Statistics, Fort Lauderdale, Florida, 017, pp ] Y. Nesterov, Introductory lectures on convex optimization: A basic course. Springer Science & Business Media, 013, vol ] G. Neu, A. Antos, A. Gyórgy, and C. Szepesvári, Online markov decision processes under bandit feedback, in Proc. Advances in Neural Info. Process. Syst., Vancouver, Canada, 010, pp ] C. Pike-Burke, S. Agrawal, C. Szepesvari, and S. Grunewalder, Bandits with delayed anonymous feedback, arxiv preprint arxiv: , ] K. Quanrud and D. Khashabi, Online learning with adversarial delays, in Proc. Advances in Neural Info. Process. Syst., Montreal, Canada, 015, pp ] O. Shamir and L. Szlak, Online learning with local permutations and delayed feedback, in Proc. Intl. Conf. Machine Learning, Sydney, Australia, 017, pp ] T. S. Thune and Y. Seldin, Adaptation to easy data in prediction with limited advice, arxiv preprint arxiv: , ] C. Vernade, O. Cappé, and V. Perchet, Stochastic bandit models for delayed conversions, in Proc. Conf. on Uncertainty in Artificial Intelligence, Sydney, Australia, ] M. J. Weinberger and E. Ordentlich, On delayed prediction of individual sequences, IEEE Trans. Inform. Theory, vol. 48, no. 7, pp , ] M. Zinkevich, Online convex programming and generalized infinitesimal gradient ascent, in Proc. Intl. Conf. Machine Learning, Washington D.C., 003, pp

15 Supplementary Document for Bandit Online Learning with Unknown Delays A Real to virtual slot mapping For the analysis, let tτ denote the real slot when the real loss l tτ corresponding to l τ was incurred, i.e., l τ ˆl tτ tτ+dtτ. Also define an auxiliary variable s τ τ 1 L tτ 1. See an example in Fig. 7 and Table 1. Lemma 6. The following relations hold: i s τ 0, τ; ii T s τ T d t; and, iii if max t d t d, we have s τ d, τ. Proof. We first prove the property i s τ 0, t. Consider at virtual slot τ, the observed loss is l tτ a tτ with the corresponding s τ τ 1 L tτ 1. Suppose that L tτ 1 m, where 0 m tτ 1 by definition of L tτ 1. The history L tτ 1 m suggests that at the beginning of t 1 tτ, there are in total m received feedback. On the other hand, the loss l tτ a tτ is observed at the end of slot t tτ + d tτ t 1, thus at the beginning of t, there are at least m observations. Hence we must have τ m + 1. Then by the definition, s τ m m 0. Then for the property ii T s τ T d t, the proof follows from the definition of s τ, i.e., s τ a τ 1 L tτ 1 t 1 Lt 1 b t 1 L tτ 1 d t 7 where a is due to the fact that {tτ} T is a permutation of {1,, T }; and b follows from the definition of L t 1. Finally, for the property iii, notice that L tτ 1 tτ 1 d, which follows that at the beginning of t tτ, the losses of slots t tτ 1 d must have been received. Therefore, we have s τ τ 1 L tτ 1 τ 1 tτ d c d 8 where c follows that l tτ a tτ is observed at the end of t tτ + d tτ, and L tτ+dtτ 1 is at most tτ + d tτ since l tτ a tτ is not observed, leading to the fact that τ is at most tτ + d tτ, leading to τ tτ d tτ d. Virtual slot τ 1 τ τ 3 tτ 3 1 L tτ s τ 0 0 Figure 7: An examples of mapping from real slots solid line and virtual slots dotted line. The value of tτ is marked besides the corresponding yellow arrow; T 3 with delay d 1, d 0, and d 3 0. Table 1: The value of tτ, L tτ 1, and s τ in Fig

16 B Proofs for DEXP3 Before diving into the proofs, we first show some useful but simple bounds for different parameters of the DEXP3 s in virtual slots. In virtual slot τ, the update is carried out the same as 11, 1 and 13, given by w τ+1 k p τ k exp min { δ 1, l τ k }], k, 9 Since l τ k 0, k, τ, we have { w τ+1 k max p τ+1 k w τ j j1 And K w τ k is upper/lower bounded by w τ k w τ k Finally, p τ k is upper/lower bounded with B.1 Proof of Lemma 1 w τ+1 k K j1 w τ+1j, δ K }, k, 30 w τ+1 k K j1 w, k. 31 τ+1j p τ 1 j 1. 3 j1 w τ k K j1 w 1; 33 τ j w τ k K j1 w τ j + δ 1 + δ. 34 δ K1 + δ w τ k 1 + δ p τ k w τ k. 35 Lemma 7. In consecutive virtual slots τ 1 and τ, the following inequality holds for any k. p τ 1 k p τ k p τ 1 k δ + min { δ 1, l τ 1 k } Proof. First, we have p τ k a w τ k 1 + δ w τ k b K j1 w w τ k τ j1 + δ 1 + δ 1 + δ. 36 p τ 1 k exp min { δ 1, l τ 1 k }] δ where in a we used 35; b is due to 3. Hence, we have p τ 1 k exp min { δ 1, l τ 1 k }] p τ k p τ 1 k p τ 1 k 1 + δ c p τ 1k 1 min { δ 1, 1 + δ l τ 1 k }] p τ 1 k p τ 1 k δ min { δ 1, l τ 1 k } δ where in c we used e x 1 x and the proof is completed by multiplying 1 on both sides of

17 From Lemma 7, we have p τ 1 k p τ k p τ 1 k δ + min { δ 1, l τ 1 k } Hence, as long as 1 δ δ 1 0, we can guarantee that 19 is satisfied. B. Proof of Lemma Lemma 8. The following inequality holds for any τ and any k where I τ k : 1 w τ k δ K. Proof. We first show that p τ k p τ 1 k p τ k 1 I τ k 1 + δ p τ 1 k δ + δ j1 w τ k p τ ki τ k p τ 1 j 1 min { δ 1, l τ 1 j }] 40 w τ j. 41 It is easy to see that 41 holds when I τ k 0. Then when I τ k 1, we have w τ k w τ k/ K j1 w τ j. Then applying 35, we have p τ k w τ k w τ k/ K j1 w τ j, from which 41 holds. Then we have p τ k p τ 1 k p τ k w τ k p τ k p τ ki τ k p τ k 1 I τ k a p τ k 1 I τ k j1 j1 where in a we used e x 1 x. j1 ] { w τ j p τ k 1 I τ k w τ j j1 j1 p τ 1 j exp min { δ 1, l τ 1 j }]} p τ 1 j 1 min { δ 1, l τ 1 j }] 4 The proof builds on Lemma 8. First consider the case of I τ k 0. In this case Lemma 8 becomes p τ k p τ 1 k p τ k, which is travail. On the other hand, since I τ k 0, we have w τ k δ K. Then leveraging 35, we have p τ k w τ k δ K. Then plugging the lower bound on p τ 1 k into 35, we have p τ k p τ 1 k δ K Considering the case of I τ k 1, Lemma 8 becomes p τ k p τ 1 k p τ k 1 p τ k Rearranging 44 and combining it with 43, we complete the proof. 1 p τ 1 k δ K1 + δ 1 + δ. 43 K δ j1 p τ 1 j 1 min { δ 1, l τ 1 j }] p τ 1 j min { δ 1, l τ 1 k } p τ kδ j1 17

18 B.3 Proof of Lemma 3 For conciseness, define c τ : min { lτ, δ 1 1}, and correspondingly c τ k : min{ l τ k, δ 1 }. Define Wτ : K w τ k, and W τ : K w τ k. Leveraging these auxiliary variables, we have W T +1 w T +1 k p T k exp c T k ] w T k exp ct k ] W T W T w T 1 k exp W T 1 w T k W T exp c T k ] p T 1 k exp c T k c T 1 k ] W T W T ct k c T 1 k ] w 1 k exp ] T c τ k W T W T. 45 T Wτ Wτ Then, for any probability distribution p K noticing that the initialization of w 1 k 1, k and hence W 1 K, 45 implies that pk exp ] c τ k exp ] c τ k W T 1 a T +1 Wτ Wτ+1 K1 + δ T where in a we used the fact that W τ 1 + δ. Then, using the the Jensen inequality on e x, we have Plugging 47 into 46, we have pk exp exp On the other hand, Wτ can be upper bounded by W τ b w τ ] c τ k exp ] T +1 pk c τ k K1 + δ T p τ 1 k exp c τ 1 k ] τ W τ, 46 ] pk c τ k. 47 τ p τ 1 k 1 c τ 1 k + cτ 1 k ] 1 p τ 1 k c τ 1 k + W τ. 48 p τ 1 k c τ 1 k ], 49 where b follows from e x 1 x + x /, x 0. Taking logarithm on both sides of 49, we arrive at ln W τ ln 1 c p τ 1 k c τ 1 k + p τ 1 k c τ 1 k + p τ 1 k c τ 1 k ] p τ 1 k c τ 1 k ]

19 where in c we used ln1 + x x. Then taking logarithm on both sides of 48 and plugging 50 in, we arrive at pk c τ k T ln1 + δ + ln K p τ k c τ k + p τ k c τ k ] 51 Rearranging the terms of 51 and writing it compactly, we obtain pτ p cτ T ln1+δ +ln K T ln1+δ +ln K + + p τ k c τ k ] p τ k lτ k ]. 5 B.4 Proof of Theorem 1 To begin with, the instantaneous regret can be written as p t l t p l t a p t kl t k pkl t k max k b p t ke at lt k1a t k p t k ] pt k pk E at lt k1a t k p t+dt k max k p t+dt k p t k ] lt k1a t k pke at p t k ] p t+dt k p t k pt k pk ] lt k1a t k E at p t+dt k E at p t ˆl ] t t+dt p ˆlt t+dt p t+dt k p t k ] l where a is due to E tk1a tk at p tk l t k, and b follows from the definition ˆl t t+dt k ltk1atk p t+dt k. Then the overall regret of T slots is given by T ] Reg T E p t l t p l t E c E d E e E T T T max k max k max k T E max k p tτ+dtτ k T max k p t+dt k ] E ] at p t ˆlt t+dt p ˆlt t+dt p t k ] E ] atτ p tτˆl tτ tτ+dtτ p ˆltτ tτ+dtτ p tτ k p tτ+dtτ k E atτ p tτ l ] ] τ p lτ p tτ k p tτ+dtτ k ] E ] atτ p τ sτ lτ p lτ p tτ k p tτ+dtτ k p tτ k 53 E atτ p τ sτ lτ p τ l τ ] + E atτ p τ l τ p lτ ] ] 54 19

20 where c is due to the fact that {t1, t,, tt } is a permutation of {1,,, T }; d follows from l τ ˆl tτ tτ+dtτ; e uses the fact p t p Lt 1 +1 and p tτ p Ltτ 1 +1 p τ sτ. First note that between real time slot tτ and tτ + d tτ, there are at most d + d tτ d feedback received. Hence the correspondingly virtual slot will not differ larger than d. Note also that the index of virtual slot corresponding to tτ must be no larger than that of tτ + d tτ. Hence we have for all τ 1, T ], max k p tτ+dtτ k p tτ k max k p τ+1 k d f { max 1 + δ d, p τ k 1 1 δ 1 d } 55 where f is the result of Lemma. Then, to bound the terms in the second blanket of 54, again we denote c τ : min { lτ, δ 1 1}, and correspondingly c τ k : min{ l τ k, δ 1 } for conciseness. Then we have p τ s τ c τ p τ c τ c τ p τ sτ p τ g c τ m h s τ 1 c τ m l τ m j0 s τ 1 j0 s τ 1 j0 pτ sτ +jm p τ sτ +j+1m p τ sτ +jm δ + c τ sτ +jm 1 + δ c τ m pτ sτ +jm l τ sτ +jm + δ s τ 1 j0 pτ sτ +jm c τ sτ +jm + δ where g follows from the facts that l τ has at most one entry with index m being non-zero cf. 66] and s τ 0 cf. Lemma 6]; and h is the result of Lemma 7. Then notice that lτ k p τ k l tτ k p p τ k i p τ k d max tτ+dtτk k p τ+1 k 1 1 δ δ 1 d where i uses the fact that between tτ and tτ + d tτ there are at most d feedback; then further applying the result of Lemma 1, inequality 57 can be obtained. Plugging 57 back in to 56 and taking expectation w.r.t. a tτ, we arrive at E atτ p τ sτ c τ p τ c ] s τ τ 1 δ δ 1 d + δ K s τ p tτ k l τ k j 1 s τ K 1 δ δ 1 d 1 δ δ 1 d + δ s τ 58 where j follows a similar reason of 57. Then, noticing T s τ T d t D, we have E atτ p τ sτ c τ p τ c ] KD τ 1 δ δ 1 d 1 δ δ 1 d + δ. 59 Then, using a similar argument of 57, we have E atτ p τ k lτ k ] ] ltτ p τ k k p tτ+d k p tτk tτ 1 1 δ δ 1 4 d 60 0

21 Then leveraging Lemma 3, we have E atτ pτ p c τ ] T ln1 + δ + ln K T ln1 + δ + ln K + E atτ p τ k lτ k ] ] KT δ δ 1 4 d The last step is to show that introducing δ 1 would not incur too much extra regret. Note that both c τ and l τ have only one entry being non-zero, whose index is denoted by m τ. lτ m τ > c τ m τ only when l τ m τ l tτ m τ p tτ+dtτ m τ > δ 1, which is equivalent to p tτ+dtτ m τ < l τ m τ /δ 1 1/δ 1. Hence, we have h E atτ pτ p lτ ] T E atτ pτ p c ] T τ + E atτ pτ p lτ ] c τ E atτ pτ p c ] T τ + E atτ p τ m τ lτ m τ c τ m τ 1 ] p tτ+dtτm τ < 1/δ 1 E atτ pτ p c ] T τ + E atτ p τ m τ l τ m τ 1 ] p tτ+dtτm τ < 1/δ 1 6 where in h, m τ denotes the index of the only one none-zero entry of l τ, and p is dropped due to the appearance of the indicator function. To proceed, notice that E atτ lτ m τ p τ m τ 1 p tτ+dtτm τ < 1/δ 1 ] i j K p τ k1 p tτ+dtτk < 1/δ 1 K 1 δ δ 1 d δ 1 1 δ δ 1 4 d p tτ kl tτ k p tτ+dtτk p τ k1 p tτ+dtτk < 1/δ 1 p τ k p tτ+dtτk1 p tτ+dtτk < 1/δ 1 p tτ+dtτk 1 δ δ 1 d where in i we used the a similar argument of 57; and in j we used the fact x1x < a a. Plugging 63 back into 6, we arrive at 63 E atτ p τ p lτ ] E atτ p τ p c τ ] + KT δ 1 1 δ δ 1 4 d 64 Applying similar arguments as 6 and 63, we can also show that ] E atτ pτ sτ p τ lτ ] KD E atτ pτ sτ p τ cτ δ 1 1 δ δ 1 6 d For the parameter selection, we have T ln1 + δ T ln1 + 1 T +D ln e 1. Leveraging the inequality that e 1 x x 4, x N +, we have that 1 1 δ 1 d 1 O δ δ 1 d 1

22 From 66 it is not hard to see the bound on 55, which is Then for 59, we have For 61, we have max k p tτ+dtτ k max p tτ k { 1 + δ d, } 1 1 δ 1 d O1. 67 E atτ p τ sτ c τ p τ c ] KD τ 1 δ δ 1 d 1 δ δ 1 d + δ E atτ pτ p c τ ] T ln1 + δ + ln K Using 69 and T/δ 1 OT, we can bound 64 by E atτ p τ p lτ ] Plugging 68 and using the fact D/δ 1 OD, we have ] E atτ pτ sτ p τ lτ + OD. 68 KT 1 δ δ 1 4 d O T + ln K. 69 E atτ p τ p c ] KT τ + δ 1 1 δ δ 1 4 d O T + ln K. 70 ] KD E atτ pτ sτ p τ cτ + δ 1 1 δ δ 1 6 d O D. 71 Plugging 67, 70, and 71 into 54, the regret is bounded by Reg T E p ] T t l t p l t O T + D. 7 C Proofs for DBGD C.1 Proof of Lemma 4 Since f s t is L-Lipschitz, we have g s t k 1 δ L δe k L, and thus g s t KL. On the other hand, let s t : f s t x s t, and s t k being the k-th entry of s t. Due to the β-smoothness of f s t, we have g s t k s t k 1 δ suggesting that g s t f s t x s t βδ K. δ s t e k + β δ s t k βδ 73 C. Proof of Lemma 5 Lemma 5 Restate. In virtual slots, it is guaranteed to have x τ x τ sτ s τ KL 74 and for any x X δ, we have g τ xτ x x τ x x τ+1 x KL +. 75

23 Proof. The proof begins with x τ sτ x τ s τ 1 j0 x τ sτ +j x τ sτ +j+1 a s τ KL 76 where in a we used the fact that x τ x τ+1 x τ Π Xδ x τ g τ ] g τ. Hence we proved the first inequality. Then, notice that x τ+1 x x τ x Π Xδ x τ g τ ] x x τ x b x τ x g τ x τ x g τ xτ x + g τ where a uses the non-expansion property of projection. Rearranging the terms of 77 completes the proof. 77 C.3 Proof of Theorem Lemma 9. Let h t x : f t x+ g t f t x t x, where gt : g t t+dt. Then h t x has the following properties: i h t x is L + βδ K -Lipschitz; and ii ht x is β smooth and convex. Proof. Starting with the first property, consider that h t x h t y f t x + g t f t x t x ft y g t f t x t y f t x f t y + g t f t x t x y a L + βδ K x y 78 where in a we used the results in Lemma 4. For the second property, the convexity of h t x is obvious. Then noticing that h t x f t x + g t f t x t, we have h t y h t x f t y f t x + g t f t x t y x f t x y x + β y x + g t f t x t y x h t x y x + β y x 79 which implies that h t x is β smooth. Then we are ready to prove Theorem. Let h t x : f t x + g t f t x t x, where gt : g t t+dt. Using the property of h t x in Lemma 9 as well as the fact h t x t g t, we have Reg T a b f t x t f t x h t x t g t f t x t xt h t x t h t x + h t x t h t x δ + gt f t x t x x t h t x t h t x δ + δrt h t x g t f t x t x h t x δ h t x + RT βδ K L + βδ + RT βδ K 80 3

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret