Online Optimization with Gradual Variations

Size: px
Start display at page:

Download "Online Optimization with Gradual Variations"

Transcription

1 JMLR: Workshop and Conference Proceedings vol (0) 0 Online Optimization with Gradual Variations Chao-Kai Chiang, Tianbao Yang 3 Chia-Jung Lee Mehrdad Mahdavi 3 Chi-Jen Lu Rong Jin 3 Shenghuo Zhu 4 Institute of Information Science, Academia Sinica, Taipei, Taiwan. Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan. 3 Department of Computer Science and Engineering Michigan State University, East Lansing, MI, 4884, USA 4 NEC Laboratories America Cupertino, CA, 9504, USA chaokai@iis.sinica.edu.tw yangtia@msu.edu leecj@iis.sinica.edu.tw mahdavim@cse.msu.edu cjlu@iis.sinica.edu.tw rongjin@cse.msu.edu zsh@sv.nec-labs.com Abstract We study the online convex optimization problem, in which an online algorithm has to make repeated decisions with convex loss functions and hopes to achieve a small regret. We consider a natural restriction of this problem in which the loss functions have a small deviation, measured by the sum of the distances between every two consecutive loss functions, according to some distance metrics. We show that for the linear and general smooth convex loss functions, an online algorithm modified from the gradient descend algorithm can achieve a regret which only scales as the square root of the deviation. For the closely related problem of prediction with expert advice, we show that an online algorithm modified from the multiplicative update algorithm can also achieve a similar regret bound for a different measure of deviation. Finally, for loss functions which are strictly convex, we show that an online algorithm modified from the online Newton step algorithm can achieve a regret which is only logarithmic in terms of the deviation, and as an application, we can also have such a logarithmic regret for the portfolio management problem. Keywords: Online Learning, Regret, Convex Optimization, Deviation.. Introduction We study the online convex optimization problem in which a player has to make decisions iteratively for a number of rounds in the following way. In round t, the player has to choose a point x t from some convex feasible set X R N, and after that the player receives a convex loss function f t and suffers the corresponding loss f t (x t ) [0, ]. The player would like to have an online algorithm that can minimize its regret, which is the difference between the total loss it suffers and that of the best fixed point in hindsight. It is known that when playing for T rounds, a regret of O( T N) can be achieved, using the gradient 0 C.-K. Chiang, T. Yang, C.-J. Lee, M. Mahdavi, C.-J. Lu, R. Jin & S. Zhu.

2 Chiang Yang Lee Mahdavi Lu Jin Zhu descend algorithm (Zinkevich, 003). When the loss functions are restricted to be linear, it becomes the well-known online linear optimization problem. Another related problem is the prediction with expert advice problem, in which the player in each round has to choose one of N actions to play, possibly in a probabilistic way. This can be seen as a special case of the online linear optimization problem, with the feasible set being the set of probability distributions over the N actions, and a regret of O( T ln N) can be achieved using the multiplicative update algorithm (Littlestone and Warmuth, 994; Freund and Schapire, 997). There have been many wonderful results and applications for these problems, and more information can be found in (Cesa-Bianchi and Lugosi, 006). The regrets achieved for these problems are in fact optimal since matching lower bounds are also known (see e.g., (Cesa-Bianchi and Lugosi, 006; Abernethy et al., 008)). On the other hand, when the loss functions satisfy some nice property, a smaller regret becomes possible. Hazan et al. (007) showed that a regret of O(N ln T ) can be achieved for loss functions satisfying some strong convexity properties, which includes functions arising from the portfolio management problem (Cover, 99). Most previous works, including those discussed above, considered the most general setting in which the loss functions could be arbitrary and possibly chosen in an adversarial way. However, the environments around us may not always be adversarial, and the loss functions may have some patterns which can be exploited for achieving a smaller regret. One work along this direction is that of Hazan and Kale (008). For the online linear optimization problem, in which each loss function is linear and can be seen as a vector, they considered the case in which the loss functions have a small variation, defined as V = T f t µ, where µ = T f t/t is the average of the loss functions and p denotes the L p -norm. For this, they showed that a regret of O( V ) can be achieved, and they also have an analogous result for the prediction with expert advice problem. In another paper, Hazan and Kale (009) considered the portfolio management problem in which each loss function has the form f t (x) = ln v t, x with v t [δ, ] N for some constant δ (0, ), where v t, x denotes the inner product of the vectors v t and x, and they showed how to achieve a regret of O(N log Q), with Q = T v t µ and µ = T v t/t. Note that according to their definition, a small V means that most of the loss functions center around some fixed loss function µ, and similarly for the case of small Q. This seems to model a stationary environment, in which all the loss functions are produced according to some fixed distribution. The variation introduced in (Hazan and Kale, 008) is defined in terms of total difference between individual linear cost vectors to their mean. In this paper we introduce a new measure, which we call L p -deviation, for the loss functions, defined as D p = max f t(x) f t (x) p, () x X which is defined in terms of sequential difference between individual loss function to its previous one, where we use the convention that f 0 is the all-0 function. The motivation of defining gradual variation (i.e., L p -deviation) stems from two observations: one is practical and the other one is technical raised by the limitation of extending the results in (Hazan and Kale, 008) to general convex functions. From practical point of view, we are interested in a more general scenario, in which the environment may be evolving but

3 Online Optimization with Gradual Variations in a somewhat gradual way. For example, the weather condition or the stock price at one moment may have some correlation with the next and their difference is usually small, while abrupt changes only occur sporadically. Obviously, L p -deviation easily models these situations. In order to understand the limitation of extending the results in (Hazan and Kale, 008), let us apply the results in (Hazan and Kale, 008) to general convex loss functions as follows. Since the results in (Hazan and Kale, 008) were developed for linear loss functions, a straightforward approach is to use the first order approximation for convex loss functions, i.e., f t (x) f t (x t ) + f t (x t ), x x t, and replace the linear loss vector with the gradient of the loss function f t (x) at x t. Using the convexity of loss function f t (x), we have f t (x t ) min π X f t (π) f t (x t ), x t min π X f t (x t ), π. () By assuming f t (x), t [T ] and x X, we can apply Hazan and Kale s variation based bound to bound the regret in () by the variation of the gradients as V = f t (x t ) µ = f t(x t ) T To better understand V in (3), we rewrite it as V = T f t(x t ) T τ= f τ (x τ ) τ= = T f t (x t ) f t (x τ ) + T f τ (x τ ) τ= f t (x t ) f τ (x τ ) t,τ= τ=. (3) f t (x τ ) f τ (x τ ) = V + V. We see that the variation V is bounded by two parts: V essentially measures the smoothness of the individual loss functions, while V measures the variation in the gradients of loss functions. As a result, even when all the loss functions are identical, V vanishes, while V still exists, and therefore the regret of the algorithm in (Hazan and Kale, 008) for online convex optimization may still be bounded by O( T ) regardless of the smoothness of the cost function. To address above mentioned challenges, the bounds in this paper are developed in terms of L p -deviation. We note that for linear functions, L p -deviation becomes D p = T f t f t p. It can be shown that D O(V ) while there are loss functions with D O() and V = Ω(T ). Thus, one can argue that our constraint of a small deviation is strictly easier to satisfy than that of a small variation in (Hazan and Kale, 008). For the portfolio management problem, a natural measure of deviation is T v t v t, and one can show that D O(N) T v t v t O(NQ), so one can again argue that our constraint is easier to satisfy than that of (Hazan and Kale, 009). In this paper, we consider loss functions with such deviation constraints and obtain the following results. First, for the online linear optimization problem, we provide an algorithm which, when given loss functions with L -deviation D, can achieve a regret of O( D ). This is in fact optimal as a matching lower bound can be shown. Since D O(T N), we immediately recover the result of Zinkevich (003). Furthermore, as 3

4 Chiang Yang Lee Mahdavi Lu Jin Zhu discussed before, since one can upper-bound D in terms of V but not vice versa, our result is arguably stronger than that of Hazan and Kale (008); interestingly, our analysis even looks simpler than theirs. A similar bound was given by Rakhlin et al. (0) in a game-theoretical setting, but they did not discuss any algorithm. Next, for the prediction with expert advice problem, we provide an algorithm such that when given loss functions with L -deviation D, it achieves a regret of O( D ln N), which is also optimal with a matching lower bound. Note that since D O(T ), we also recover the O( T ln N) regret bound of Freund and Schapire (997), but our result seems incomparable to that of Hazan and Kale (008). We then establish variation bound for general convex loss functions aiming to take one step further along the work done in (Hazan and Kale, 008). Our results shows that for general smooth convex functions, the proposed algorithm attains O( D ) bound. We show that smoothness assumptions is unavoidable for general convex loss functions. Finally, we provide an algorithm for the online convex optimization problem studied by Hazan et al. (007), in which the loss functions are strictly convex. Our algorithm achieves a regret of O(N ln T ) which matches that of an algorithm in (Hazan et al., 007), and when the loss functions have L -deviation D, for a large enough D, and satisfy some smoothness condition, our algorithm achieves a regret of O(N ln D ). This can be applied to the portfolio management problem considered by Hazan and Kale (009) as the corresponding loss functions in fact satisfy our smoothness condition, and we can achieve a regret of O(N ln D) when T v t v t D. As discussed before, one can again argue that our result is stronger than that of Hazan and Kale (009). All of our algorithms are based on the following idea, which we illustrate using the online linear optimization problem as an example. For general linear functions, the gradient descent algorithm is known to achieve an optimal regret, which plays in round t the point x t = Π X (x t ηf t ), the projection of x t ηf t to the feasible set X. Now, if the loss functions have a small deviation, f t may be close to f t, so in round t, it may be a good idea to play a point which moves further in the direction of f t as it may make its inner product with f t (which is its loss with respect to f t ) smaller. In fact, it can be shown that if one could play the point x t+ = Π X (x t ηf t ) in round t, a very small regret could be achieved, but in reality one does not have f t available before round t to compute x t+. On the other hand, if f t is a good estimate of f t, the point ˆx t = Π X (x t ηf t ) should be a good estimate of x t+ too. The point ˆx t can actually be computed before round t since f t is available, so our algorithm plays ˆx t in round t. Our algorithms for the prediction with expert advice problem and the online convex optimization problem use the same idea. We unify all our algorithms by a meta algorithm, which can be seen as a type of mirror descent algorithm (Nemirovski and Yudin, 978; Beck and Teboulle, 003), using the notion of Bregman divergence with respect to some function R. Then we derive different algorithms for different settings simply by substantiating the meta algorithm with different choices for the function R. For the linear and general smooth online convex optimization problems, the prediction with expert advice problem, and the online strictly convex optimization problem, respectively, the algorithms we derive can be seen as modified from the gradient descent algorithm of (Zinkevich, 003), the multiplicative algorithm of (Littlestone and Warmuth, 994; Freund and Schapire, 997), and the online Newton step of (Hazan et al., 007), with the modification based on the idea of moving further in the direction of f t discussed above. 4

5 Online Optimization with Gradual Variations. Preliminaries For a positive integer n, let [n] denote the set {,,, n}. Let R denote the set of real numbers, R N the set of N-dimensional vectors over R, and R N N the set of N N matrices over R. We will see a vector x as a column vector and see its transpose, denoted by x, as a row vector. For a vector x R N and an index i [N], let x(i) denote the i th component of x. For x, y R N, let x, y = N i= x(i)y(i) and let RE (xy) = N x(i) i= x(i) ln y(i). All the matrices considered in this paper will be symmetric and we will assume this without stating it later. For two matrices A and B, we write A B if A B is a positive semidefinite (PSD) matrix. For x R N, let x p denote the L p -norm of x, and for a PSD matrix H R N N, define the norm x H by x Hx. Note that if H is the identity matrix, then x H = x. We will need the following simple fact, which will be proved in Appendix A. Proposition For any y, z R N and any PSD H R N N, y + z H y H + z H. We will need the notion of Bregman divergence and the projection according to it. Definition Let R : R N R be a differentiable function and X R N a convex set. Define the Bregman divergence of x, y R N with respect to R by B R (x, y) = R(x) R(y) R(y), x y. Define the projection of y R N onto X according to B R by Π X,R (y) = arg min x X B R (x, y). We consider the online convex optimization problem, in which an online algorithm must play in T rounds in the following way. In each round t [T ], it plays a point x t X, for some convex feasible set X R N, and after that, it receives a loss function f t : X R and suffers a loss of f t (x t ). The goal is to minimize its regret, defined as f t (x t ) arg min π X f t (π), which is the difference between its total loss and that of the best offline algorithm playing a single point π X for all T rounds. We study four special cases of this problem. The first is the online linear optimization problem, in which each loss function f t is linear. The second case is the prediction with expert advice problem, which can be seen as a special case of the online linear optimization problem with the set of probability distributions over N actions as the feasible set X. The third case is when the loss functions are smooth which is a generalizeiton of linear optimization setting. Finally, we consider the case when the loss functions are strictly convex in the sense defined as follows. Definition 3 For β > 0, we say that a function f : X R is β-convex, if for all x, y X, f(x) f(y) + f(y), x y + β f(y), x y. As shown in (Hazan et al., 007), all the convex functions considered there are in fact β-convex, and thus our result also applies to those convex functions. For simplicity of presentation, we will assume throughout the paper that the feasible set X is a closed convex set contained in the unit ball {x R n : x }; the extension to the general case is straightforward. 5

6 Chiang Yang Lee Mahdavi Lu Jin Zhu Algorithm Meta algorithm : Initially, let x = ˆx = (/N,..., /N). : In round t [T ]: (a): Play ˆx t. (b): Receive f t and compute l t = f t (ˆx t ). (c): Update ( x t+ = arg min x X lt, x + B R t (x, xt ) ), ( ˆx t+ = arg minˆx X lt, ˆx + B R t+ (ˆx, xt+ ) ). 3. Meta Algorithm All of our algorithms in the coming sections are based on the Meta algorithm, given in Algorithm, which has the parameter R t for t [T ]. For different types of problems, we will have different choices of R t, which will be specified later in the respective sections. Here we allow R t to depend on t, although we do not need this freedom for linear functions and general convex functions; we only need this for strictly convex functions. Note that we define x t+ using R t instead of R t+ for some technical reason which will be discussed soon and will become clear in the proof Lemma 6. Our Meta algorithm is related to the mirror descent algorithm, as it can be shown to have the following equivalent form, which will be proved in Appendix B.. Lemma 4 Suppose y t+ and ŷ t+ satisfy the conditions R t (y t+ ) = R t (x t ) l t and R t+ (ŷ t+ ) = R t+ (x t+ ) l t, respectively, for a strictly convex R t. Then the update in Step (c) of the Meta algorithm is identical to x t+ = Π X,Rt (y t+ ) = arg min x X B R t (x, yt+ ), ˆx t+ = Π X,Rt+ (ŷ t+ ) = arg minˆx X B R t+ (ˆx, ŷt+ ). Note that a typical mirror descent algorithm plays in round t a point roughly corresponding to our x t, while we move one step further along the direction of l t and play ˆx t = arg minˆx X ( lt, ˆx + B R t (ˆx, xt ) ) instead. The intuition behind our algorithm is the following. It can be shown that if one could play x t+ = arg min x X ( lt, x + B R t (x, xt ) ) in round t, then a small regret could be achieved, but in reality one does not have f t available to compute x t+ before round t. Nevertheless, if the loss vectors have a small deviation, l t is likely to be close to l t, and so is ˆx t to x t+, which is made possible by defining x t+ and ˆx t both using R t. Based on this idea, we let our algorithm play ˆx t in round t. Now let us see how to bound the regret of the algorithm. Consider any π X taken by the offline algorithm. Then for a β-convex function f t, we know from the definition that f t (ˆx t ) f t (π) l t, ˆx t π β ˆx t π h t, where h t = l t l t, (4) while for a linear or a general convex f t, the above still holds with β = 0. Thus, the key is to bound l t, ˆx t π, which is given by the following lemma. We give the proof in Appendix B.. 6

7 Online Optimization with Gradual Variations Lemma 5 Let S t = l t l t, ˆx t x t+, At = B R t (π, xt ) B R t (π, xt+ ) and B t = B R t (xt+, ˆx t ) + B R t (ˆxt, x t ). Then l t, ˆx t π S t + A t B t. The following lemma provides an upper bound for S t. Lemma 6 Suppose is a norm, with dual norm, such that x x B R t (x, x ) for any x, x X. Then, S t = l t l t, ˆx t x t+ lt l t. Proof By a generalized Cauchy-Schwartz inequality, S t = l t l t, ˆx t x t+ lt l t ˆx t x t+. Then we need the following, which will be proved in Appendix B.3. Proposition 7 ˆx t x t+ R t (ŷ t ) R t (y t+ ). From this proposition, we have ˆx t x t+ ( R t (x t ) l t ) ( Rt (x t ) l t ) = l t l t. (5) This is why we define x t+ and y t+ using R t instead of R t+. Finally, by combining these bounds together, we have the lemma. Taking the sum over t of the bounds in Lemma 5 and Lemma 6, we obtain a general regret bound for the Meta algorithm. In the following sections, we will make different choices of R t and the norms for different types of loss functions, and we will derive the corresponding regret bounds. 4. Linear Loss Functions In this section, we consider the case that each loss function f t is linear, which can be seen as an N-dimensional vector in R N with f t (x) = f t, x and f t (x) = f t. We measure the deviation of the loss functions by their L p -deviation, defined in (), which becomes T f t f t p for linear functions. To bound the regret suffered in each round, we can use the bound in (4) with β = 0 and we drop the term B t from the bound in Lemma 5. By summing the bound over t, we have f t (ˆx t ) f t (π) S t + A t, (6) where S t = f t f t, ˆx t x t+ and A t = B R t (π, xt ) B R t (π, xt+ ). In the following two subsections, we will consider the online linear optimization problem and the prediction with expert advice problem, respectively, in which we will have different choices of R t and use different measures of deviation. 7

8 Chiang Yang Lee Mahdavi Lu Jin Zhu 4.. Online Linear Optimization Problem In this subsection, we consider the online linear optimization problem, and we consider loss functions with L -deviation D. To instantiate the Meta algorithm for such loss functions, we choose ˆ R t (x) = η x, for every t [T ], where η is the learning rate to be determined later; in fact, it can also be adjusted in the algorithm using the standard doubling trick by keeping track of the deviation accumulated so far. It is easy to show that with this choice of R t, ˆ R t (x) = x η, BR t (x, y) = η x y, and Π X,R t (y) = arg min x X x y. Then, according to Lemma 4, the update in Step (c) of Meta algorithm becomes: ˆ x t+ = arg min x X x y t+, with y t+ = x t ηf t, ˆx t+ = arg minˆx X ˆx ŷ t+, with ŷ t+ = x t+ ηf t. The regret achieved by our algorithm is guaranteed by the following. Theorem 8 When the L -deviation of the loss functions is D, the regret of our algorithm is at most O( D ). Proof We start by bounding the first sum in (6). Note that we can apply Lemma 6 with the norm = η, since x x = η x x = BR t (x, x ) for any x, x X. As the dual norm is = η, Lemma 6 gives us S t T f t f t η f t f t ηd. Next, note that A t = η π x t η π x t+, so the second sum in (6) is A t = ( ) π x η π x T + η, by telescoping and then using the fact that π x 4 and π x T + by substituting these two bounds into (6), we have 0. Finally, (f t (ˆx t ) f t (π)) ηd + η O ( D ), by choosing η = /D, which proves the theorem. Let us make three remarks about Theorem 8. First, as mentioned in the introduction, one can argue that our result is strictly stronger than that of (Hazan and Kale, 008) as our deviation bound is easier to satisfy. This is because by Proposition, we have 8

9 Online Optimization with Gradual Variations f t f t (f t µ + µ f t ) and thus D 4V + O(), while, for example, with N =, f t = 0 for t T/ and f t = for T/ < t T, we have D O() and V Ω(T ). Next, we claim that the regret achieved by our algorithm is optimal. This is because a matching lower bound can be shown by simply setting the loss functions of all but the first r = D rounds to be the all-0 function, and then applying the known Ω( r) regret lower bound on the first r rounds. Finally, our algorithm can be seen as a modification of the gradient descent (GD) algorithm of (Zinkevich, 003), which plays x t, instead of our ˆx t, in round t. Then one may wonder if GD already performs as well as our algorithm does. The following lemma, to be proved in Appendix C., provides a negative answer, which means that our modification is in fact necessary. Lemma 9 The regret of the GD algorithm is at least Ω(min{D, T }). 4.. Prediction with Expert Advice In this subsection, we consider the prediction with expert advice problem. Now, the feasible set X is the set of probability distributions over N actions, which can also be represented as N-dimensional vectors. Although this problem can be seen as a special case of that in Subsection 4. and Theorem 8 there also applies here, we would like to obtain a stronger result. More precisely, now we consider L -deviation instead of L -deviation, and we assume that the loss functions have L -deviation D. Note that with D D N, Theorem 8 only gives a regret bound of O( D N). To obtain a smaller regret, we instantiate the Meta algorithm with ˆ R t (x) = N η i= x(i) (ln x(i) ), for every t [T ], where η is the learning rate to be determined later and recall that x(i) denotes the i th component of the vector x. It is easy to show that with this choice, ˆ R t (x) = η (ln x(),..., ln x(n)), B R t (x, y) = η RE (xy), and Π X,R t (y) = y/z with the normalization factor Z = N j= y(j). Then, according to Lemma 4, the update in Step (c) of the Meta algorithm becomes: ˆ x t+ (i) = x t (i)e ηf t(i) /Z t+, for each i [N], with Z t+ = N j= x t(j)e ηf t(j), ˆx t+ (i) = x t+ (i)e ηft(i) /Ẑt+, for each i [N], with Ẑt+ = N j= x t+(j)e ηft(j). Note that our algorithm can be seen as a modification of the multiplicative updates algorithm (Littlestone and Warmuth, 994; Freund and Schapire, 997) which plays x t, instead of our ˆx t, in round t. The regret achieved by our algorithm is guaranteed by the following, which we will prove in Appendix C.. Theorem 0 When the L -deviation of the loss functions is D, the regret of our algorithm is at most O( D ln N). We remark that the regret achieved by our algorithm is also optimal. This is because a matching lower bound can be shown by simply setting the loss functions of all but the first r = D rounds to be the all-0 function, and then applying the known Ω( r ln N) regret lower bound on the first r rounds. 9

10 Chiang Yang Lee Mahdavi Lu Jin Zhu 5. General Convex Loss Functions In this section, we consider general convex loss functions. We measure the deviation of loss functions by their L -deviation defined in (), which is T max x X f t (x) f t (x). Our algorithm for such loss functions is the same algorithm for linear functions. To bound its regret, now we need the help of the term B t in Lemma 5, and we have (f t (ˆx t ) f t (π)) S t + A t B t. (7) From the proof of Theorem 8, we know that T A t η and T S t T η lt l t which, unlike in Theorem 8, can not be immediately bounded by L -deviation. This is because lt l t = f t(ˆx t ) f t (ˆx t ), where the two gradients are taken at different points. To handle this issue, we further assume that each gradient f t satisfies the following λ-smoothness condition: f t (x) f t (y) λ x y, for any x, y X. (8) We emphasize that our assumption about the smoothness of loss functions is necessary to achieve the desired variation bound. To see this, consider the special case of f (x) = = f T (x) = f(x). If the variation bound O( D ) holds for any sequence of convex functions, then for the special case where all loss functions are identical, we will have f(ˆx t ) min π X f(π) + O(), implying that (/T ) T ˆx t approaches the optimal solution at the rate of O(/T ). This contradicts the lower complexity bound (i.e. Ω(/ T )) for any first order optimization method (Nesterov, 004, Theorem 3..) and therefore smoothness assumption is necessary to extend our results to general convex loss functions. Our main result of this section is the following theorem which establishes the variation bound for general smooth convex loss functions applying Meta algorithm. Theorem When the loss functions have L -deviation D and the gradient of each loss function is λ-smooth, with λ / 8D, the regret of our algorithm is at most O( D ). The proof of Theorem immediately results from the following two lemmas. First, we need the following to bound T S t in terms of D. Lemma T Proof l t l t D + λ T ˆx t ˆx t. lt l t = f t(ˆx t ) f t (ˆx t ), which by Proposition is at most f t (ˆx t ) f t (ˆx t ) + f t (ˆx t ) f t (ˆx t ), where the second term above is at most λ ˆx t ˆx t by the λ-smoothness condition. By summing the bound over t, we have the lemma. To eliminate the undesirable term λ T ˆx t ˆx t from the sum T B t, which has the following bound. 0 in the lemma, we use the help

11 Online Optimization with Gradual Variations Lemma 3 T B t 4η T ˆx t ˆx t O(). Proof Recall that B t = η x t+ ˆx t + η ˆx t x t, so we can write T B t as T + x t ˆx t η + η t= ˆx t x t η 4η t= ( ) x t ˆx t + ˆx t x t ˆx t ˆx t, by Proposition, with H being the identity matrix so that x H = x. Then the lemma follows as ˆx ˆx O(). According to the bounds obtained so far, the regret of our algorithm is at most ηd +ηλ T ˆx t ˆx t 4η when λ / 8η and η = / D. 6. Strictly Convex Loss Functions t= ˆx t ˆx t +O()+ ( η O ηd + ) ( ) O D, η In this section, we consider convex functions which are strictly convex. suppose for some β > 0, each loss function is β-convex, so that More precisely, f t (ˆx t ) f t (π) l t, ˆx t π β π ˆx t h t, where h t = l t l t. (9) Again, we measure the deviation of loss functions by their L -deviation, defined in (). To instantiate the Meta algorithm for such loss functions, we choose ˆ R t (x) = x H t, with H t = I + βγ I + β t τ= l τ l τ, where I is the N N identity matrix, and γ is an upper bound of l t, for every t, so that γ I l t l t. It is easy to show that with this choice, ˆ R t (x) = H t x, B R t (x, y) = x y H t, and Π X,Rt (y) = arg min x X x y H t. Then, according to Lemma 4, the update in Step (c) of the Meta algorithm becomes: ˆ x t+ = arg min x X x y t+ H t, with y t+ = x t H t l t, ˆx t+ = arg minˆx X ˆx ŷ t+ H t+, with ŷ t+ = x t+ H t+ l t. We remark that our algorithm is related to the online Newton step algorithm in (Hazan et al., 007), except that our matrix H t is slightly different from theirs and we play ˆx t in round t while they play a point roughly corresponding to our x t. It is easy to verify that the update of our algorithm can be computed at the end of round t, because we have l,..., l t available to compute H t and H t+.

12 Chiang Yang Lee Mahdavi Lu Jin Zhu To bound the regret of our algorithm, note that by substituting the bound of Lemma 5 into (9) and then taking the sum over t, we obtain (f t (ˆx t ) f t (π)) S t + A t B t C t, (0) with S t = l t l t, ˆx t x t+, At = π x t H t π x t+ H t, B t = x t+ ˆx t H t + ˆx t x t H t, and C t = β π ˆx t h t. Then our key lemma is the following, which will be proved in Appendix D.. Lemma 4 Suppose the loss functions are β-convex for some β > 0. Then ( ) S t + A t C t O( + βγ ) + 8N β ln + β l 4 t l t. Note that the lemma does not use the nonnegative sum T B t but it already provides a regret bound matching that in (Hazan et al., 007). To bound T lt l t in terms of L -deviation, we again assume that each gradient f t satisfies the λ-smoothness condition defined in (8), and we will also use the help from the sum T B t. To get a cleaner regret bound, let us assume without loss of generality that λ and β, because otherwise we can set λ = and β = and the inequalities in (8) and (9) still hold. Our main result of this section is the following, which we will prove in Appendix D.3. Theorem 5 Suppose the loss functions are β-convex and their L -deviation is D, with β and D. Furthermore, suppose the gradient of each loss function is λ-smooth, with λ, and has L -norm at most γ. Then the regret of our algorithm is at most O(βγ + (N/β) ln(λnd )), which becomes O((N/β) ln D ) for a large enough D. An immediate application of Theorem 5 is to the portfolio management problem considered in (Hazan and Kale, 009). In the problem, the feasible set X is the N-dimensional probability simplex and each loss function has the form f t (x) = ln v t, x, with v t [δ, ] N for some δ (0, ). A natural measure of deviation, extending that of (Hazan and Kale, 009), for such loss functions is D = T v t v t. By applying Theorem 5 to this problem, we have the following, which will be proved in Appendix D.4. Corollary 6 For the portfolio management problem described above, there is an online algorithm which achieves a regret of O((N/δ ) ln((n/δ)d)). Acknowledgments The authors T. Yang, M. Mahdavi, and R. Jin acknowledge support from the National Science Foundation (IIS ) and Office of Navy Research (for ONR award N and N ). The authors C.-K. Chiang, C.-J. Lee, and C.-J. Lu acknowledge support from Academia Sinica and National Science Council (NSC 00-- E MY3) of Taiwan.

13 Online Optimization with Gradual Variations References Jacob Abernethy, Peter L. Bartlett, Alexander Rakhlin, and Ambuj Tewari. Optimal strategies and minimax lower bounds for online convex games. In COLT, pages 45 44, 008. Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 3(3):67 75, 003. Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, New York, NY, USA, 004. ISBN Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, Learning, and Games. Cambridge Univerity Press, New York, 006. Thomas Cover. Universal portfolios. Mathematical Finance, : 9, 99. Yoav Freund and Robert E. Schapire. A decision theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55():9 39, 997. Elad Hazan and Satyen Kale. Extracting certainty from uncertainty: regret bounded by variation in costs. In COLT, pages 57 68, 008. Elad Hazan and Satyen Kale. On stochastic and worst-case models for investing. In NIPS, pages , 009. Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Journal of Computer and System Sciences, 69(-3):69 9, 007. Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Information and Computation, 08(): 6, 994. A. Nemirovski and D. Yudin. Problem Complexity and Method Efficiency in Optimization. Nauka Publishers, Moscow, 978. Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course (Applied Optimization). Springer Netherlands, edition, 004. Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning: stochastic, constrained, and smoothed adversaries. In NIPS, pages , 0. Nati Srebro, Karthik Sridharan, and Ambuj Tewari. On the universality of online mirror descent. In NIPS, pages , 0. Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In ICML, pages ,

14 Chiang Yang Lee Mahdavi Lu Jin Zhu Appendix A. Proof of Proposition in Section By definition, y H + z H y + z H = y H + z H y Hz = y z H 0, which implies that y H + z H y + z H. Appendix B. Proofs in Section 3 B.. Proof of Lemma 4 The lemma follows immediately from the following known fact (see e.g. (Beck and Teboulle, 003; Srebro et al., 0)); we give the proof for completeness. Proposition 7 Suppose R is strictly convex and differentiable, and y satisfies the condition R(y) = R(u) l. Then ( arg min l, x + B R (x, u) ) = arg min x X x X BR (x, y). Proof Since R is strictly convex, the minimum on each side is achieved by a unique point. Next, note that B R (x, y) = R(x) R(y) R(y), x y = R(x) R(y), x + c, where c = R(y) + R(y), y does not depend on the variable x. Thus, using the condition that R(y) = R(u) l, we have arg min x X BR (x, y) = arg min (R(x) R(u) l, x ) x X = arg min ( l, x + R(x) R(u), x ). x X On the other hand, B R (x, u) = R(x) R(u) R(u), x u = R(x) R(u), x + c, where c = R(u) + R(u), u does not depend on the variable x. Thus, we have arg min x X ( l, x + B R (x, u) ) = arg min ( l, x + R(x) R(u), x ) = arg min x X x X BR (x, y). B.. Proof of Lemma 5 Let us write l t, ˆx t π = l t, ˆx t x t+ + l t, x t+ π which in turn equals lt l t, ˆx t x t+ + lt, ˆx t x t+ + lt, x t+ π. () To bound the second and third terms in (), we rely on the following. Proposition 8 Suppose l R n, v = arg min x X ( l, x + B R (x, u) ), and w X. Then l, v w B R (w, u) B R (w, v) B R (v, u). 4

15 Online Optimization with Gradual Variations Proof We need the following well-known fact; for a proof, see e.g. pages of (Boyd and Vandenberghe, 004). Fact Let X R n be a convex set and x = arg min z X ϕ(z) for some continuous and differentiable function ϕ : X R. Then for any w X, ϕ(x), w x 0. Let ϕ be the function defined by ϕ(x) = l, x + B R (x, u). Since X is a convex set and v is the minimizer of ϕ(x) over x X, it follows from Fact that ϕ(v), w v 0. Since ϕ(v) = l + R(v) R(u), we have l, v w R(v) R(u), w v. Then, by the definition of Bregman divergence, we obtain B R (w, u) B R (w, v) B R (v, u) = R(u), w v + R(v), w v As a result, we have = R(v) R(u), w v. l, v w R(v) R(u), w v = B R (w, u) B R (w, v) B R (v, u). From Proposition 8 and the definitions of ˆx t and x t+, we have lt, ˆx t x t+ B R t (x t+, x t ) B R t(x t+, ˆx t ) B R t(ˆx t, x t ), and () l t, x t+ π B R t(π, x t ) B R t(π, x t+ ) B R t(x t+, x t ). (3) Combining the bounds in (), (), (3) together, we have the lemma. B.3. Proof of Proposition 7 To simplify the notation, let R = R t, x = ˆx t, x = x t+, y = ŷ t, and y = y t+. Then, from the property of the norm, we know that x x R(x) R(x ) R(x ), x x, and also x x R(x ) R(x) R(x), x x. Adding these two bounds, we obtain x x R(x) R(x ), x x. (4) Next, we show that R(x) R(x ), x x R(y) R(y ), x x. (5) For this, we need Fact in Appendix B.. By letting ϕ(z) = B R (z, y), we have x = arg min z X ϕ(z), ϕ(x) = R(x) R(y), and R(x) R(y), x x 0. 5

16 Chiang Yang Lee Mahdavi Lu Jin Zhu On the other hand, by letting ϕ(z) = B R (z, y ), we have x = arg min z X ϕ(z), ϕ(x ) = R(x ) R(y ), and R(x ) R(y ), x x 0. Combining these two bounds, we have ( R(y) R(y ) ) ( R(x) R(x ) ), x x 0, which implies the inequality in (5). Finally, by combining (4) and (5), we obtain x x R(y) R(y ), x x R(y) R(y ) x x, by a generalized Cauchy-Schwartz inequality. Dividing both sides by x x, we have the proposition. Appendix C. Proofs in Section 4 C.. Proof of Lemma 9 in Section 4 One may wonder if the GD algorithm can also achieve the same regret as our algorithm s by choosing the learning rate η properly. We show that no matter what the learning rate η the GD algorithm chooses, there exists a sequence of loss vectors which can cause a large regret. Let f be any unit vector passing through x. Let s = /η, so that if we use f t = f for every t s, each such y t+ = x tηf still remains in X and thus x t+ = y t+. Next, we analyze the regret by considering the following three cases depending on the range of s. First, when s T, we choose f t = f for t from to s/ and f t = 0 for the remaining t. Clearly, the best strategy of the offline algorithm is to play π = f. On the other hand, since the learning rate η is too small, the strategy x t played by GD, for t s/, is far away from π, so that f t, x t π tη /. Therefore, the regret is at least s/ (/) = Ω( T ). Second, when 0 < s < T, the learning rate is high enough so that GD may overreact to each loss vector, and we make it pay by flipping the direction of loss vectors frequently. More precisely, we use the vector f for the first s rounds so that x t+ = x tηf for any t s, but just as x s+ moves far enough in the direction of f, we make it pay by switching the loss vector to f, which we continue to use for s rounds. Note that x s++r = x s+ r but f s++r = f s+ r for any r s, so s f t, x t x = f s+, x s+ x Ω(). As x s+ returns back to x, we can see the first s rounds as a period, which only contributes f = 4 to the deviation. Then we repeat the period for τ times, where τ = D /4 if there are enough rounds, with T/(s) D /4, to use up the deviation D, and τ = T/(s) otherwise. For any remaining round t, we simply choose f t = 0. As a result, the total regret is at least Ω() τ = Ω(min{D /4, T/(s)}) = Ω(min{D, T }). Finally, when s = 0, the learning rate is so high that we can easily make GD pay by flipping the direction of the loss vector in each round. More precisely, by starting with f = f, we can have x on the boundary of X, which means that if we then alternate between f and f, the strategies GD plays will alternate between x 3 and x which have a constant distance from each other. Then following the analysis in the second case, one can show that the total regret is at least Ω(min{D, T }). 6

17 Online Optimization with Gradual Variations C.. Proof of Theorem 0 We start by bounding the first sum in (6). Note that we can apply Lemma 6 with the norm = η, since for any x, x X, x x = η x x η RE (xx ) = B R t (x, x ), by Pinsker s inequality. As the dual norm is = η, Lemma 6 gives us S t T f t f t η f t f t ηd. Next, note that A t = η RE (πx t) η RE (πx t+), so the second sum in (6) is A t = η (RE (πx ) RE (πx T + )) ln N, η by telescoping and then using the fact that RE (πx ) ln N and RE (πx T + ) 0. Finally, by substituting these two bounds into (6), we have (f t (ˆx t ) f t (π)) ηd + ( ) η ln N O D ln N, by choosing η = (ln N)/D, which proves the theorem. Appendix D. Proofs in Section 6 D.. Proof of Lemma 4 We start by bounding the first sum in (0). Note that we can apply Lemma 6 with the norm = Ht, since x x = x x H = B R t (x, x ) for any x, x X. As the t dual norm is = H, Lemma 6 gives us t S t l t l t T l t l t Ht. Next, we bound the second sum T A t in (0), which can be written as π x H π x T + H T + + (π x t+ Ht+ π x t+ Ht ). Since π x H = O ( + βγ ), π x T + H T + 0, and H t+ H t = βh t, we have A t O ( + βγ ) + β π x t+ h t. 7

18 Chiang Yang Lee Mahdavi Lu Jin Zhu Note that unlike in the case of linear functions, here the sum does not telescope and hence we do not have a small bound for it. The last sum T C t in (0) now comes to help. Recall that C t = β π ˆx t h t, so by Proposition, β π x t+ h t C t β π ˆx t h t + β ˆx t x t+ h t C t = β ˆx t x t+ h t, which, by the fact that H t βγ I βh t and the bound in (5), is at most ˆx t x t+ H t lt l t Ht. Combining the bounds derived so far, we obtain S t + A t C t O( + βγ ) + lt l t Ht. (6) Finally, to complete our proof of Lemma 4, we rely on the following, which provides a bound for the last term in (6) and will be proved in Appendix D.. Lemma 9 T lt l t Ht 4N β ( ln + β T ) 4 lt l t. D.. Proof of Lemma 9 We need the following lemma from (Hazan et al., 007): Lemma 0 Let u t R N, for t [T ], be a sequence of vectors. Define V t = I + t τ= u τ u τ. Then, ( ) u t Vt u t N ln + u t To prove our Lemma 9, first note that for any t [T ], t H t = I + βγ I + β l τ l τ I + β since γ I l t l t τ= Thus, with K t = I + β t 4 τ= This implies that t l τ l τ I + β τ= t τ= and l 0 is the the all-0 vector. Next, we claim that. ( ) l τ l τ + l τ l τ, l τ l τ + l τ l τ ( ) ( ) lτ l τ lτ l τ. This is because by subtracting the right-hand side from the left-hand side, we have l τ l τ + l τ l τ + l τ l τ + l τ l τ = ( ) ( ) lτ + l τ lτ + l τ 0. ( ) ( ), lτ l τ lτ l τ we have Ht K t and Kt lt l t Ht lt l t Kt = 4 β which by Lemma 0 is at most 4N β ln ( + β 4 T 8 β ( ) lt l 4 t ) lt l t. K t, H t.

19 Online Optimization with Gradual Variations D.3. Proof of Theorem 5 To bound the last term in the bound of Lemma 4 in terms of our deviation bound D, we use Lemma in Section 5. Combining this with (0), we can upper-bound T (f t(ˆx t ) f t (π)) by O( + βγ ) + 8N β ln ( + β D + βλ ) ˆx t ˆx t B t. (7) To eliminate the undesirable last term inside the parenthesis above, we need the help from the sum T B t, which has the following bound. Lemma T B t 4 T ˆx t ˆx t O(). Proof Recall that B t = x t+ ˆx t H t + ˆx t x t H t, so we can write T B t as T + x t ˆx t H + t t= ˆx t x t H t x t ˆx t H + t t= ˆx t x t H t since H t H t and x T + ˆx T H T, ˆx x H 0. By Proposition, this is at least 4 ˆx t ˆx t H t 4 t= ˆx t ˆx t I = 4 t= t= ˆx t ˆx t as H t I and I is the identity matrix. Then the lemma follows as ˆx ˆx O(). t= Applying this lemma to (7), we obtain a regret bound of the form O( + βγ ) + 8N ( β ln + β ) D + βλ W 4 W where W = T ˆx t ˆx t. Observe that the combination of the last two terms above become negative when W (λnd ) c /β for some large enough constant c, as we assume β and λ, D. Thus, the regret bound is at most O(βγ + (N/β) ln(λnd )), which completes the proof of Theorem 5. D.4. Proof of Corollary 6 Recall that each loss function has the form f t (x) = ln v t, x for some v t [δ, ] N with δ (0, ), and note that f t (x) = v t / v t, x. To apply Theorem 5, we need to determine the parameters β, γ, λ, D. First, by a Taylor expansion, we know that for any x, y X, there is some ξ t on the line between x and y such that f t (x) = f t (y) + f t (y), x y + 9 v t, ξ t (x y) v t v t (x y),

20 Chiang Yang Lee Mahdavi Lu Jin Zhu where the last term above equals v t, ξ t v t, x y = v t, y v t, ξ t f t(y), x y δ f t(y), x y. Thus, we can choose β = δ /. Next, since f t (x) = v t / v t, x N/δ, we can choose γ = N/δ. Third, note that f t (x) f t (y) = v t v t, x v t v t, y = v t v t, x y, v t, x v t, y which by a Cauchy-Schwarz inequality is at most v t v t, x v t, y x y N δ x y. Thus, we can choose λ = N/δ. Finally, note that for any x X, f t (x) f t (x) = v t v t, x v t v t, x = t v t v + v t ( v t, x v t, x ) v t, x v t, x v t, x, which by a triangle inequality and then a Cauchy-Schwarz inequality is at most v t v t v t, x which in turn is at most + v t v t v t, x v t, x v t, x v t v t v t, x + v t x v t v t, v t, x v t, x ( ) δ + N v δ t v t N v δ t v t. This implies T ( max f t(x) f t (x) ) ( ) N x X δ v t v t 4N δ 4 D. Thus, we can choose D = (4N/δ 4 )D. Using these bounds in Theorem 5, we have the corollary. 0

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

Exponential Weights on the Hypercube in Polynomial Time

Exponential Weights on the Hypercube in Polynomial Time European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts

More information

No-Regret Algorithms for Unconstrained Online Convex Optimization

No-Regret Algorithms for Unconstrained Online Convex Optimization No-Regret Algorithms for Unconstrained Online Convex Optimization Matthew Streeter Duolingo, Inc. Pittsburgh, PA 153 matt@duolingo.com H. Brendan McMahan Google, Inc. Seattle, WA 98103 mcmahan@google.com

More information

A survey: The convex optimization approach to regret minimization

A survey: The convex optimization approach to regret minimization A survey: The convex optimization approach to regret minimization Elad Hazan September 10, 2009 WORKING DRAFT Abstract A well studied and general setting for prediction and decision making is regret minimization

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Online Convex Optimization MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Online projected sub-gradient descent. Exponentiated Gradient (EG). Mirror descent.

More information

Regret bounded by gradual variation for online convex optimization

Regret bounded by gradual variation for online convex optimization Noname manuscript No. will be inserted by the editor Regret bounded by gradual variation for online convex optimization Tianbao Yang Mehrdad Mahdavi Rong Jin Shenghuo Zhu Received: date / Accepted: date

More information

Adaptive Online Learning in Dynamic Environments

Adaptive Online Learning in Dynamic Environments Adaptive Online Learning in Dynamic Environments Lijun Zhang, Shiyin Lu, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China {zhanglj, lusy, zhouzh}@lamda.nju.edu.cn

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning Editors: Suvrit Sra suvrit@gmail.com Max Planck Insitute for Biological Cybernetics 72076 Tübingen, Germany Sebastian Nowozin Microsoft Research Cambridge, CB3 0FB, United

More information

Online Optimization : Competing with Dynamic Comparators

Online Optimization : Competing with Dynamic Comparators Ali Jadbabaie Alexander Rakhlin Shahin Shahrampour Karthik Sridharan University of Pennsylvania University of Pennsylvania University of Pennsylvania Cornell University Abstract Recent literature on online

More information

The convex optimization approach to regret minimization

The convex optimization approach to regret minimization The convex optimization approach to regret minimization Elad Hazan Technion - Israel Institute of Technology ehazan@ie.technion.ac.il Abstract A well studied and general setting for prediction and decision

More information

Online Optimization in Dynamic Environments: Improved Regret Rates for Strongly Convex Problems

Online Optimization in Dynamic Environments: Improved Regret Rates for Strongly Convex Problems 216 IEEE 55th Conference on Decision and Control (CDC) ARIA Resort & Casino December 12-14, 216, Las Vegas, USA Online Optimization in Dynamic Environments: Improved Regret Rates for Strongly Convex Problems

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization JMLR: Workshop and Conference Proceedings vol (2010) 1 16 24th Annual Conference on Learning heory Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

More information

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning. Tutorial: PART 1 Online Convex Optimization, A Game- Theoretic Approach to Learning http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm Elad Hazan Princeton University Satyen Kale Yahoo Research

More information

Online Learning and Online Convex Optimization

Online Learning and Online Convex Optimization Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game

More information

A Low Complexity Algorithm with O( T ) Regret and Finite Constraint Violations for Online Convex Optimization with Long Term Constraints

A Low Complexity Algorithm with O( T ) Regret and Finite Constraint Violations for Online Convex Optimization with Long Term Constraints A Low Complexity Algorithm with O( T ) Regret and Finite Constraint Violations for Online Convex Optimization with Long Term Constraints Hao Yu and Michael J. Neely Department of Electrical Engineering

More information

Lower and Upper Bounds on the Generalization of Stochastic Exponentially Concave Optimization

Lower and Upper Bounds on the Generalization of Stochastic Exponentially Concave Optimization JMLR: Workshop and Conference Proceedings vol 40:1 16, 2015 28th Annual Conference on Learning Theory Lower and Upper Bounds on the Generalization of Stochastic Exponentially Concave Optimization Mehrdad

More information

Learning, Games, and Networks

Learning, Games, and Networks Learning, Games, and Networks Abhishek Sinha Laboratory for Information and Decision Systems MIT ML Talk Series @CNRG December 12, 2016 1 / 44 Outline 1 Prediction With Experts Advice 2 Application to

More information

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret

More information

Optimization, Learning, and Games with Predictable Sequences

Optimization, Learning, and Games with Predictable Sequences Optimization, Learning, and Games with Predictable Sequences Alexander Rakhlin University of Pennsylvania Karthik Sridharan University of Pennsylvania Abstract We provide several applications of Optimistic

More information

Lecture 3: Lower Bounds for Bandit Algorithms

Lecture 3: Lower Bounds for Bandit Algorithms CMSC 858G: Bandits, Experts and Games 09/19/16 Lecture 3: Lower Bounds for Bandit Algorithms Instructor: Alex Slivkins Scribed by: Soham De & Karthik A Sankararaman 1 Lower Bounds In this lecture (and

More information

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016 AM 1: Advanced Optimization Spring 016 Prof. Yaron Singer Lecture 11 March 3rd 1 Overview In this lecture we will introduce the notion of online convex optimization. This is an extremely useful framework

More information

Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs

Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs Elad Hazan IBM Almaden Research Center 650 Harry Rd San Jose, CA 95120 ehazan@cs.princeton.edu Satyen Kale Yahoo! Research 4301

More information

The Interplay Between Stability and Regret in Online Learning

The Interplay Between Stability and Regret in Online Learning The Interplay Between Stability and Regret in Online Learning Ankan Saha Department of Computer Science University of Chicago ankans@cs.uchicago.edu Prateek Jain Microsoft Research India prajain@microsoft.com

More information

Exponentiated Gradient Descent

Exponentiated Gradient Descent CSE599s, Spring 01, Online Learning Lecture 10-04/6/01 Lecturer: Ofer Dekel Exponentiated Gradient Descent Scribe: Albert Yu 1 Introduction In this lecture we review norms, dual norms, strong convexity,

More information

A Drifting-Games Analysis for Online Learning and Applications to Boosting

A Drifting-Games Analysis for Online Learning and Applications to Boosting A Drifting-Games Analysis for Online Learning and Applications to Boosting Haipeng Luo Department of Computer Science Princeton University Princeton, NJ 08540 haipengl@cs.princeton.edu Robert E. Schapire

More information

A Greedy Framework for First-Order Optimization

A Greedy Framework for First-Order Optimization A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts

More information

Lecture: Adaptive Filtering

Lecture: Adaptive Filtering ECE 830 Spring 2013 Statistical Signal Processing instructors: K. Jamieson and R. Nowak Lecture: Adaptive Filtering Adaptive filters are commonly used for online filtering of signals. The goal is to estimate

More information

Stochastic Gradient Descent with Only One Projection

Stochastic Gradient Descent with Only One Projection Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Logarithmic Regret Algorithms for Online Convex Optimization

Logarithmic Regret Algorithms for Online Convex Optimization Logarithmic Regret Algorithms for Online Convex Optimization Elad Hazan 1, Adam Kalai 2, Satyen Kale 1, and Amit Agarwal 1 1 Princeton University {ehazan,satyen,aagarwal}@princeton.edu 2 TTI-Chicago kalai@tti-c.org

More information

Dynamic Regret of Strongly Adaptive Methods

Dynamic Regret of Strongly Adaptive Methods Lijun Zhang 1 ianbao Yang 2 Rong Jin 3 Zhi-Hua Zhou 1 Abstract o cope with changing environments, recent developments in online learning have introduced the concepts of adaptive regret and dynamic regret

More information

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs

Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs Extracting Certainty from Uncertainty: Regret Bounded by Variation in Costs Elad Hazan IBM Almaden 650 Harry Rd, San Jose, CA 95120 hazan@us.ibm.com Satyen Kale Microsoft Research 1 Microsoft Way, Redmond,

More information

Stochastic and Adversarial Online Learning without Hyperparameters

Stochastic and Adversarial Online Learning without Hyperparameters Stochastic and Adversarial Online Learning without Hyperparameters Ashok Cutkosky Department of Computer Science Stanford University ashokc@cs.stanford.edu Kwabena Boahen Department of Bioengineering Stanford

More information

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm Jacob Steinhardt Percy Liang Stanford University {jsteinhardt,pliang}@cs.stanford.edu Jun 11, 2013 J. Steinhardt & P. Liang (Stanford)

More information

Quasi-Newton Algorithms for Non-smooth Online Strongly Convex Optimization. Mark Franklin Godwin

Quasi-Newton Algorithms for Non-smooth Online Strongly Convex Optimization. Mark Franklin Godwin Quasi-Newton Algorithms for Non-smooth Online Strongly Convex Optimization by Mark Franklin Godwin A dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy

More information

Perceptron Mistake Bounds

Perceptron Mistake Bounds Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce

More information

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. Converting online to batch. Online convex optimization.

More information

Online Convex Optimization

Online Convex Optimization Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed

More information

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm

Adaptivity and Optimism: An Improved Exponentiated Gradient Algorithm Jacob Steinhardt Percy Liang Stanford University, 353 Serra Street, Stanford, CA 94305 USA JSTEINHARDT@CS.STANFORD.EDU PLIANG@CS.STANFORD.EDU Abstract We present an adaptive variant of the exponentiated

More information

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research

Foundations of Machine Learning On-Line Learning. Mehryar Mohri Courant Institute and Google Research Foundations of Machine Learning On-Line Learning Mehryar Mohri Courant Institute and Google Research mohri@cims.nyu.edu Motivation PAC learning: distribution fixed over time (training and test). IID assumption.

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University

EASINESS IN BANDITS. Gergely Neu. Pompeu Fabra University EASINESS IN BANDITS Gergely Neu Pompeu Fabra University EASINESS IN BANDITS Gergely Neu Pompeu Fabra University THE BANDIT PROBLEM Play for T rounds attempting to maximize rewards THE BANDIT PROBLEM Play

More information

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Logarithmic Regret Algorithms for Strongly Convex Repeated Games Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600

More information

Online Learning with Predictable Sequences

Online Learning with Predictable Sequences JMLR: Workshop and Conference Proceedings vol (2013) 1 27 Online Learning with Predictable Sequences Alexander Rakhlin Karthik Sridharan rakhlin@wharton.upenn.edu skarthik@wharton.upenn.edu Abstract We

More information

Accelerating Online Convex Optimization via Adaptive Prediction

Accelerating Online Convex Optimization via Adaptive Prediction Mehryar Mohri Courant Institute and Google Research New York, NY 10012 mohri@cims.nyu.edu Scott Yang Courant Institute New York, NY 10012 yangs@cims.nyu.edu Abstract We present a powerful general framework

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

An Online Convex Optimization Approach to Blackwell s Approachability

An Online Convex Optimization Approach to Blackwell s Approachability Journal of Machine Learning Research 17 (2016) 1-23 Submitted 7/15; Revised 6/16; Published 8/16 An Online Convex Optimization Approach to Blackwell s Approachability Nahum Shimkin Faculty of Electrical

More information

Online Linear Optimization via Smoothing

Online Linear Optimization via Smoothing JMLR: Workshop and Conference Proceedings vol 35:1 18, 2014 Online Linear Optimization via Smoothing Jacob Abernethy Chansoo Lee Computer Science and Engineering Division, University of Michigan, Ann Arbor

More information

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley

Learning Methods for Online Prediction Problems. Peter Bartlett Statistics and EECS UC Berkeley Learning Methods for Online Prediction Problems Peter Bartlett Statistics and EECS UC Berkeley Course Synopsis A finite comparison class: A = {1,..., m}. 1. Prediction with expert advice. 2. With perfect

More information

Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms

Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms EECS 598-005: heoretical Foundations of Machine Learning Fall 2015 Lecture 23: Online convex optimization Lecturer: Jacob Abernethy Scribes: Vikas Dhiman Disclaimer: hese notes have not been subjected

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Bandit Problems MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Multi-Armed Bandit Problem Problem: which arm of a K-slot machine should a gambler pull to maximize his

More information

Optimal and Adaptive Online Learning

Optimal and Adaptive Online Learning Optimal and Adaptive Online Learning Haipeng Luo Advisor: Robert Schapire Computer Science Department Princeton University Examples of Online Learning (a) Spam detection 2 / 34 Examples of Online Learning

More information

Agnostic Online learnability

Agnostic Online learnability Technical Report TTIC-TR-2008-2 October 2008 Agnostic Online learnability Shai Shalev-Shwartz Toyota Technological Institute Chicago shai@tti-c.org ABSTRACT We study a fundamental question. What classes

More information

A Unified Approach to Proximal Algorithms using Bregman Distance

A Unified Approach to Proximal Algorithms using Bregman Distance A Unified Approach to Proximal Algorithms using Bregman Distance Yi Zhou a,, Yingbin Liang a, Lixin Shen b a Department of Electrical Engineering and Computer Science, Syracuse University b Department

More information

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order

More information

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015 5-859E: Advanced Algorithms CMU, Spring 205 Lecture #6: Gradient Descent February 8, 205 Lecturer: Anupam Gupta Scribe: Guru Guruganesh In this lecture, we will study the gradient descent algorithm and

More information

Delay-Tolerant Online Convex Optimization: Unified Analysis and Adaptive-Gradient Algorithms

Delay-Tolerant Online Convex Optimization: Unified Analysis and Adaptive-Gradient Algorithms Delay-Tolerant Online Convex Optimization: Unified Analysis and Adaptive-Gradient Algorithms Pooria Joulani 1 András György 2 Csaba Szepesvári 1 1 Department of Computing Science, University of Alberta,

More information

Unconstrained Online Linear Learning in Hilbert Spaces: Minimax Algorithms and Normal Approximations

Unconstrained Online Linear Learning in Hilbert Spaces: Minimax Algorithms and Normal Approximations JMLR: Workshop and Conference Proceedings vol 35:1 20, 2014 Unconstrained Online Linear Learning in Hilbert Spaces: Minimax Algorithms and Normal Approximations H. Brendan McMahan MCMAHAN@GOOGLE.COM Google,

More information

A Modular Analysis of Adaptive (Non-)Convex Optimization: Optimism, Composite Objectives, and Variational Bounds

A Modular Analysis of Adaptive (Non-)Convex Optimization: Optimism, Composite Objectives, and Variational Bounds Proceedings of Machine Learning Research 76: 40, 207 Algorithmic Learning Theory 207 A Modular Analysis of Adaptive (Non-)Convex Optimization: Optimism, Composite Objectives, and Variational Bounds Pooria

More information

Minimax strategy for prediction with expert advice under stochastic assumptions

Minimax strategy for prediction with expert advice under stochastic assumptions Minimax strategy for prediction ith expert advice under stochastic assumptions Wojciech Kotłosi Poznań University of Technology, Poland otlosi@cs.put.poznan.pl Abstract We consider the setting of prediction

More information

Online Learning: Random Averages, Combinatorial Parameters, and Learnability

Online Learning: Random Averages, Combinatorial Parameters, and Learnability Online Learning: Random Averages, Combinatorial Parameters, and Learnability Alexander Rakhlin Department of Statistics University of Pennsylvania Karthik Sridharan Toyota Technological Institute at Chicago

More information

Lecture 4: Lower Bounds (ending); Thompson Sampling

Lecture 4: Lower Bounds (ending); Thompson Sampling CMSC 858G: Bandits, Experts and Games 09/12/16 Lecture 4: Lower Bounds (ending); Thompson Sampling Instructor: Alex Slivkins Scribed by: Guowei Sun,Cheng Jie 1 Lower bounds on regret (ending) Recap from

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania

More information

Optimistic Bandit Convex Optimization

Optimistic Bandit Convex Optimization Optimistic Bandit Convex Optimization Mehryar Mohri Courant Institute and Google 25 Mercer Street New York, NY 002 mohri@cims.nyu.edu Scott Yang Courant Institute 25 Mercer Street New York, NY 002 yangs@cims.nyu.edu

More information

Online Learning for Time Series Prediction

Online Learning for Time Series Prediction Online Learning for Time Series Prediction Joint work with Vitaly Kuznetsov (Google Research) MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Motivation Time series prediction: stock values.

More information

Multi-armed Bandits: Competing with Optimal Sequences

Multi-armed Bandits: Competing with Optimal Sequences Multi-armed Bandits: Competing with Optimal Sequences Oren Anava The Voleon Group Berkeley, CA oren@voleon.com Zohar Karnin Yahoo! Research New York, NY zkarnin@yahoo-inc.com Abstract We consider sequential

More information

Explore no more: Improved high-probability regret bounds for non-stochastic bandits

Explore no more: Improved high-probability regret bounds for non-stochastic bandits Explore no more: Improved high-probability regret bounds for non-stochastic bandits Gergely Neu SequeL team INRIA Lille Nord Europe gergely.neu@gmail.com Abstract This work addresses the problem of regret

More information

Bandit Convex Optimization: T Regret in One Dimension

Bandit Convex Optimization: T Regret in One Dimension Bandit Convex Optimization: T Regret in One Dimension arxiv:1502.06398v1 [cs.lg 23 Feb 2015 Sébastien Bubeck Microsoft Research sebubeck@microsoft.com Tomer Koren Technion tomerk@technion.ac.il February

More information

Near-Optimal Algorithms for Online Matrix Prediction

Near-Optimal Algorithms for Online Matrix Prediction JMLR: Workshop and Conference Proceedings vol 23 (2012) 38.1 38.13 25th Annual Conference on Learning Theory Near-Optimal Algorithms for Online Matrix Prediction Elad Hazan Technion - Israel Inst. of Tech.

More information

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 Online Convex Optimization Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 The General Setting The General Setting (Cover) Given only the above, learning isn't always possible Some Natural

More information

Online prediction with expert advise

Online prediction with expert advise Online prediction with expert advise Jyrki Kivinen Australian National University http://axiom.anu.edu.au/~kivinen Contents 1. Online prediction: introductory example, basic setting 2. Classification with

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Learning and Games MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Normal form games Nash equilibrium von Neumann s minimax theorem Correlated equilibrium Internal

More information

The Online Approach to Machine Learning

The Online Approach to Machine Learning The Online Approach to Machine Learning Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary 1 My beautiful regret 2 A supposedly fun game I

More information

Minimax Fixed-Design Linear Regression

Minimax Fixed-Design Linear Regression JMLR: Workshop and Conference Proceedings vol 40:1 14, 2015 Mini Fixed-Design Linear Regression Peter L. Bartlett University of California at Berkeley and Queensland University of Technology Wouter M.

More information

Better Algorithms for Benign Bandits

Better Algorithms for Benign Bandits Better Algorithms for Benign Bandits Elad Hazan IBM Almaden 650 Harry Rd, San Jose, CA 95120 ehazan@cs.princeton.edu Satyen Kale Microsoft Research One Microsoft Way, Redmond, WA 98052 satyen.kale@microsoft.com

More information

Faster Rates for Convex-Concave Games

Faster Rates for Convex-Concave Games Proceedings of Machine Learning Research vol 75: 3, 208 3st Annual Conference on Learning Theory Faster Rates for Convex-Concave Games Jacob Abernethy Georgia Tech Kevin A. Lai Georgia Tech Kfir Y. Levy

More information

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties

One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties One Mirror Descent Algorithm for Convex Constrained Optimization Problems with Non-Standard Growth Properties Fedor S. Stonyakin 1 and Alexander A. Titov 1 V. I. Vernadsky Crimean Federal University, Simferopol,

More information

Lecture 19: Follow The Regulerized Leader

Lecture 19: Follow The Regulerized Leader COS-511: Learning heory Spring 2017 Lecturer: Roi Livni Lecture 19: Follow he Regulerized Leader Disclaimer: hese notes have not been subjected to the usual scrutiny reserved for formal publications. hey

More information

Blackwell s Approachability Theorem: A Generalization in a Special Case. Amy Greenwald, Amir Jafari and Casey Marks

Blackwell s Approachability Theorem: A Generalization in a Special Case. Amy Greenwald, Amir Jafari and Casey Marks Blackwell s Approachability Theorem: A Generalization in a Special Case Amy Greenwald, Amir Jafari and Casey Marks Department of Computer Science Brown University Providence, Rhode Island 02912 CS-06-01

More information

Online Bounds for Bayesian Algorithms

Online Bounds for Bayesian Algorithms Online Bounds for Bayesian Algorithms Sham M. Kakade Computer and Information Science Department University of Pennsylvania Andrew Y. Ng Computer Science Department Stanford University Abstract We present

More information

Yevgeny Seldin. University of Copenhagen

Yevgeny Seldin. University of Copenhagen Yevgeny Seldin University of Copenhagen Classical (Batch) Machine Learning Collect Data Data Assumption The samples are independent identically distributed (i.i.d.) Machine Learning Prediction rule New

More information

Bregman Divergence and Mirror Descent

Bregman Divergence and Mirror Descent Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,

More information

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Benjamin Grimmer Abstract We generalize the classic convergence rate theory for subgradient methods to

More information

Convex Repeated Games and Fenchel Duality

Convex Repeated Games and Fenchel Duality Convex Repeated Games and Fenchel Duality Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci. & Eng., he Hebrew University, Jerusalem 91904, Israel 2 Google Inc. 1600 Amphitheater Parkway,

More information

More Adaptive Algorithms for Adversarial Bandits

More Adaptive Algorithms for Adversarial Bandits Proceedings of Machine Learning Research vol 75: 29, 208 3st Annual Conference on Learning Theory More Adaptive Algorithms for Adversarial Bandits Chen-Yu Wei University of Southern California Haipeng

More information

Bandit Online Convex Optimization

Bandit Online Convex Optimization March 31, 2015 Outline 1 OCO vs Bandit OCO 2 Gradient Estimates 3 Oblivious Adversary 4 Reshaping for Improved Rates 5 Adaptive Adversary 6 Concluding Remarks Review of (Online) Convex Optimization Set-up

More information

Logarithmic Regret Algorithms for Online Convex Optimization

Logarithmic Regret Algorithms for Online Convex Optimization Logarithmic Regret Algorithms for Online Convex Optimization Elad Hazan Amit Agarwal Satyen Kale November 20, 2008 Abstract In an online convex optimization problem a decision-maker makes a sequence of

More information

Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization

Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization Follow-the-Regularized-Leader and Mirror Descent: Equivalence Theorems and L1 Regularization H. Brendan McMahan Google, Inc. Abstract We prove that many mirror descent algorithms for online conve optimization

More information

Lecture 16: Perceptron and Exponential Weights Algorithm

Lecture 16: Perceptron and Exponential Weights Algorithm EECS 598-005: Theoretical Foundations of Machine Learning Fall 2015 Lecture 16: Perceptron and Exponential Weights Algorithm Lecturer: Jacob Abernethy Scribes: Yue Wang, Editors: Weiqing Yu and Andrew

More information

Minimizing Adaptive Regret with One Gradient per Iteration

Minimizing Adaptive Regret with One Gradient per Iteration Minimizing Adaptive Regret with One Gradient per Iteration Guanghui Wang, Dakuan Zhao, Lijun Zhang National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China wanggh@lamda.nju.edu.cn,

More information

Lecture 25: Subgradient Method and Bundle Methods April 24

Lecture 25: Subgradient Method and Bundle Methods April 24 IE 51: Convex Optimization Spring 017, UIUC Lecture 5: Subgradient Method and Bundle Methods April 4 Instructor: Niao He Scribe: Shuanglong Wang Courtesy warning: hese notes do not necessarily cover everything

More information

Better Algorithms for Benign Bandits

Better Algorithms for Benign Bandits Better Algorithms for Benign Bandits Elad Hazan IBM Almaden 650 Harry Rd, San Jose, CA 95120 hazan@us.ibm.com Satyen Kale Microsoft Research One Microsoft Way, Redmond, WA 98052 satyen.kale@microsoft.com

More information

Gambling in a rigged casino: The adversarial multi-armed bandit problem

Gambling in a rigged casino: The adversarial multi-armed bandit problem Gambling in a rigged casino: The adversarial multi-armed bandit problem Peter Auer Institute for Theoretical Computer Science University of Technology Graz A-8010 Graz (Austria) pauer@igi.tu-graz.ac.at

More information

Math 273a: Optimization Subgradients of convex functions

Math 273a: Optimization Subgradients of convex functions Math 273a: Optimization Subgradients of convex functions Made by: Damek Davis Edited by Wotao Yin Department of Mathematics, UCLA Fall 2015 online discussions on piazza.com 1 / 42 Subgradients Assumptions

More information

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo!

Theory and Applications of A Repeated Game Playing Algorithm. Rob Schapire Princeton University [currently visiting Yahoo! Theory and Applications of A Repeated Game Playing Algorithm Rob Schapire Princeton University [currently visiting Yahoo! Research] Learning Is (Often) Just a Game some learning problems: learn from training

More information