Online Optimization with Gradual Variations

Size: px

Start display at page:

Download "Online Optimization with Gradual Variations"

Jeffry Norton
6 years ago
Views:

1 JMLR: Workshop and Conference Proceedings vol (0) 0 Online Optimization with Gradual Variations Chao-Kai Chiang, Tianbao Yang 3 Chia-Jung Lee Mehrdad Mahdavi 3 Chi-Jen Lu Rong Jin 3 Shenghuo Zhu 4 Institute of Information Science, Academia Sinica, Taipei, Taiwan. Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan. 3 Department of Computer Science and Engineering Michigan State University, East Lansing, MI, 4884, USA 4 NEC Laboratories America Cupertino, CA, 9504, USA chaokai@iis.sinica.edu.tw yangtia@msu.edu leecj@iis.sinica.edu.tw mahdavim@cse.msu.edu cjlu@iis.sinica.edu.tw rongjin@cse.msu.edu zsh@sv.nec-labs.com Abstract We study the online convex optimization problem, in which an online algorithm has to make repeated decisions with convex loss functions and hopes to achieve a small regret. We consider a natural restriction of this problem in which the loss functions have a small deviation, measured by the sum of the distances between every two consecutive loss functions, according to some distance metrics. We show that for the linear and general smooth convex loss functions, an online algorithm modified from the gradient descend algorithm can achieve a regret which only scales as the square root of the deviation. For the closely related problem of prediction with expert advice, we show that an online algorithm modified from the multiplicative update algorithm can also achieve a similar regret bound for a different measure of deviation. Finally, for loss functions which are strictly convex, we show that an online algorithm modified from the online Newton step algorithm can achieve a regret which is only logarithmic in terms of the deviation, and as an application, we can also have such a logarithmic regret for the portfolio management problem. Keywords: Online Learning, Regret, Convex Optimization, Deviation.. Introduction We study the online convex optimization problem in which a player has to make decisions iteratively for a number of rounds in the following way. In round t, the player has to choose a point x t from some convex feasible set X R N, and after that the player receives a convex loss function f t and suffers the corresponding loss f t (x t ) [0, ]. The player would like to have an online algorithm that can minimize its regret, which is the difference between the total loss it suffers and that of the best fixed point in hindsight. It is known that when playing for T rounds, a regret of O( T N) can be achieved, using the gradient 0 C.-K. Chiang, T. Yang, C.-J. Lee, M. Mahdavi, C.-J. Lu, R. Jin & S. Zhu.

2 Chiang Yang Lee Mahdavi Lu Jin Zhu descend algorithm (Zinkevich, 003). When the loss functions are restricted to be linear, it becomes the well-known online linear optimization problem. Another related problem is the prediction with expert advice problem, in which the player in each round has to choose one of N actions to play, possibly in a probabilistic way. This can be seen as a special case of the online linear optimization problem, with the feasible set being the set of probability distributions over the N actions, and a regret of O( T ln N) can be achieved using the multiplicative update algorithm (Littlestone and Warmuth, 994; Freund and Schapire, 997). There have been many wonderful results and applications for these problems, and more information can be found in (Cesa-Bianchi and Lugosi, 006). The regrets achieved for these problems are in fact optimal since matching lower bounds are also known (see e.g., (Cesa-Bianchi and Lugosi, 006; Abernethy et al., 008)). On the other hand, when the loss functions satisfy some nice property, a smaller regret becomes possible. Hazan et al. (007) showed that a regret of O(N ln T ) can be achieved for loss functions satisfying some strong convexity properties, which includes functions arising from the portfolio management problem (Cover, 99). Most previous works, including those discussed above, considered the most general setting in which the loss functions could be arbitrary and possibly chosen in an adversarial way. However, the environments around us may not always be adversarial, and the loss functions may have some patterns which can be exploited for achieving a smaller regret. One work along this direction is that of Hazan and Kale (008). For the online linear optimization problem, in which each loss function is linear and can be seen as a vector, they considered the case in which the loss functions have a small variation, defined as V = T f t µ, where µ = T f t/t is the average of the loss functions and p denotes the L p -norm. For this, they showed that a regret of O( V ) can be achieved, and they also have an analogous result for the prediction with expert advice problem. In another paper, Hazan and Kale (009) considered the portfolio management problem in which each loss function has the form f t (x) = ln v t, x with v t [δ, ] N for some constant δ (0, ), where v t, x denotes the inner product of the vectors v t and x, and they showed how to achieve a regret of O(N log Q), with Q = T v t µ and µ = T v t/t. Note that according to their definition, a small V means that most of the loss functions center around some fixed loss function µ, and similarly for the case of small Q. This seems to model a stationary environment, in which all the loss functions are produced according to some fixed distribution. The variation introduced in (Hazan and Kale, 008) is defined in terms of total difference between individual linear cost vectors to their mean. In this paper we introduce a new measure, which we call L p -deviation, for the loss functions, defined as D p = max f t(x) f t (x) p, () x X which is defined in terms of sequential difference between individual loss function to its previous one, where we use the convention that f 0 is the all-0 function. The motivation of defining gradual variation (i.e., L p -deviation) stems from two observations: one is practical and the other one is technical raised by the limitation of extending the results in (Hazan and Kale, 008) to general convex functions. From practical point of view, we are interested in a more general scenario, in which the environment may be evolving but

3 Online Optimization with Gradual Variations in a somewhat gradual way. For example, the weather condition or the stock price at one moment may have some correlation with the next and their difference is usually small, while abrupt changes only occur sporadically. Obviously, L p -deviation easily models these situations. In order to understand the limitation of extending the results in (Hazan and Kale, 008), let us apply the results in (Hazan and Kale, 008) to general convex loss functions as follows. Since the results in (Hazan and Kale, 008) were developed for linear loss functions, a straightforward approach is to use the first order approximation for convex loss functions, i.e., f t (x) f t (x t ) + f t (x t ), x x t, and replace the linear loss vector with the gradient of the loss function f t (x) at x t. Using the convexity of loss function f t (x), we have f t (x t ) min π X f t (π) f t (x t ), x t min π X f t (x t ), π. () By assuming f t (x), t [T ] and x X, we can apply Hazan and Kale s variation based bound to bound the regret in () by the variation of the gradients as V = f t (x t ) µ = f t(x t ) T To better understand V in (3), we rewrite it as V = T f t(x t ) T τ= f τ (x τ ) τ= = T f t (x t ) f t (x τ ) + T f τ (x τ ) τ= f t (x t ) f τ (x τ ) t,τ= τ=. (3) f t (x τ ) f τ (x τ ) = V + V. We see that the variation V is bounded by two parts: V essentially measures the smoothness of the individual loss functions, while V measures the variation in the gradients of loss functions. As a result, even when all the loss functions are identical, V vanishes, while V still exists, and therefore the regret of the algorithm in (Hazan and Kale, 008) for online convex optimization may still be bounded by O( T ) regardless of the smoothness of the cost function. To address above mentioned challenges, the bounds in this paper are developed in terms of L p -deviation. We note that for linear functions, L p -deviation becomes D p = T f t f t p. It can be shown that D O(V ) while there are loss functions with D O() and V = Ω(T ). Thus, one can argue that our constraint of a small deviation is strictly easier to satisfy than that of a small variation in (Hazan and Kale, 008). For the portfolio management problem, a natural measure of deviation is T v t v t, and one can show that D O(N) T v t v t O(NQ), so one can again argue that our constraint is easier to satisfy than that of (Hazan and Kale, 009). In this paper, we consider loss functions with such deviation constraints and obtain the following results. First, for the online linear optimization problem, we provide an algorithm which, when given loss functions with L -deviation D, can achieve a regret of O( D ). This is in fact optimal as a matching lower bound can be shown. Since D O(T N), we immediately recover the result of Zinkevich (003). Furthermore, as 3

4 Chiang Yang Lee Mahdavi Lu Jin Zhu discussed before, since one can upper-bound D in terms of V but not vice versa, our result is arguably stronger than that of Hazan and Kale (008); interestingly, our analysis even looks simpler than theirs. A similar bound was given by Rakhlin et al. (0) in a game-theoretical setting, but they did not discuss any algorithm. Next, for the prediction with expert advice problem, we provide an algorithm such that when given loss functions with L -deviation D, it achieves a regret of O( D ln N), which is also optimal with a matching lower bound. Note that since D O(T ), we also recover the O( T ln N) regret bound of Freund and Schapire (997), but our result seems incomparable to that of Hazan and Kale (008). We then establish variation bound for general convex loss functions aiming to take one step further along the work done in (Hazan and Kale, 008). Our results shows that for general smooth convex functions, the proposed algorithm attains O( D ) bound. We show that smoothness assumptions is unavoidable for general convex loss functions. Finally, we provide an algorithm for the online convex optimization problem studied by Hazan et al. (007), in which the loss functions are strictly convex. Our algorithm achieves a regret of O(N ln T ) which matches that of an algorithm in (Hazan et al., 007), and when the loss functions have L -deviation D, for a large enough D, and satisfy some smoothness condition, our algorithm achieves a regret of O(N ln D ). This can be applied to the portfolio management problem considered by Hazan and Kale (009) as the corresponding loss functions in fact satisfy our smoothness condition, and we can achieve a regret of O(N ln D) when T v t v t D. As discussed before, one can again argue that our result is stronger than that of Hazan and Kale (009). All of our algorithms are based on the following idea, which we illustrate using the online linear optimization problem as an example. For general linear functions, the gradient descent algorithm is known to achieve an optimal regret, which plays in round t the point x t = Π X (x t ηf t ), the projection of x t ηf t to the feasible set X. Now, if the loss functions have a small deviation, f t may be close to f t, so in round t, it may be a good idea to play a point which moves further in the direction of f t as it may make its inner product with f t (which is its loss with respect to f t ) smaller. In fact, it can be shown that if one could play the point x t+ = Π X (x t ηf t ) in round t, a very small regret could be achieved, but in reality one does not have f t available before round t to compute x t+. On the other hand, if f t is a good estimate of f t, the point ˆx t = Π X (x t ηf t ) should be a good estimate of x t+ too. The point ˆx t can actually be computed before round t since f t is available, so our algorithm plays ˆx t in round t. Our algorithms for the prediction with expert advice problem and the online convex optimization problem use the same idea. We unify all our algorithms by a meta algorithm, which can be seen as a type of mirror descent algorithm (Nemirovski and Yudin, 978; Beck and Teboulle, 003), using the notion of Bregman divergence with respect to some function R. Then we derive different algorithms for different settings simply by substantiating the meta algorithm with different choices for the function R. For the linear and general smooth online convex optimization problems, the prediction with expert advice problem, and the online strictly convex optimization problem, respectively, the algorithms we derive can be seen as modified from the gradient descent algorithm of (Zinkevich, 003), the multiplicative algorithm of (Littlestone and Warmuth, 994; Freund and Schapire, 997), and the online Newton step of (Hazan et al., 007), with the modification based on the idea of moving further in the direction of f t discussed above. 4

5 Online Optimization with Gradual Variations. Preliminaries For a positive integer n, let [n] denote the set {,,, n}. Let R denote the set of real numbers, R N the set of N-dimensional vectors over R, and R N N the set of N N matrices over R. We will see a vector x as a column vector and see its transpose, denoted by x, as a row vector. For a vector x R N and an index i [N], let x(i) denote the i th component of x. For x, y R N, let x, y = N i= x(i)y(i) and let RE (xy) = N x(i) i= x(i) ln y(i). All the matrices considered in this paper will be symmetric and we will assume this without stating it later. For two matrices A and B, we write A B if A B is a positive semidefinite (PSD) matrix. For x R N, let x p denote the L p -norm of x, and for a PSD matrix H R N N, define the norm x H by x Hx. Note that if H is the identity matrix, then x H = x. We will need the following simple fact, which will be proved in Appendix A. Proposition For any y, z R N and any PSD H R N N, y + z H y H + z H. We will need the notion of Bregman divergence and the projection according to it. Definition Let R : R N R be a differentiable function and X R N a convex set. Define the Bregman divergence of x, y R N with respect to R by B R (x, y) = R(x) R(y) R(y), x y. Define the projection of y R N onto X according to B R by Π X,R (y) = arg min x X B R (x, y). We consider the online convex optimization problem, in which an online algorithm must play in T rounds in the following way. In each round t [T ], it plays a point x t X, for some convex feasible set X R N, and after that, it receives a loss function f t : X R and suffers a loss of f t (x t ). The goal is to minimize its regret, defined as f t (x t ) arg min π X f t (π), which is the difference between its total loss and that of the best offline algorithm playing a single point π X for all T rounds. We study four special cases of this problem. The first is the online linear optimization problem, in which each loss function f t is linear. The second case is the prediction with expert advice problem, which can be seen as a special case of the online linear optimization problem with the set of probability distributions over N actions as the feasible set X. The third case is when the loss functions are smooth which is a generalizeiton of linear optimization setting. Finally, we consider the case when the loss functions are strictly convex in the sense defined as follows. Definition 3 For β > 0, we say that a function f : X R is β-convex, if for all x, y X, f(x) f(y) + f(y), x y + β f(y), x y. As shown in (Hazan et al., 007), all the convex functions considered there are in fact β-convex, and thus our result also applies to those convex functions. For simplicity of presentation, we will assume throughout the paper that the feasible set X is a closed convex set contained in the unit ball {x R n : x }; the extension to the general case is straightforward. 5

6 Chiang Yang Lee Mahdavi Lu Jin Zhu Algorithm Meta algorithm : Initially, let x = ˆx = (/N,..., /N). : In round t [T ]: (a): Play ˆx t. (b): Receive f t and compute l t = f t (ˆx t ). (c): Update ( x t+ = arg min x X lt, x + B R t (x, xt ) ), ( ˆx t+ = arg minˆx X lt, ˆx + B R t+ (ˆx, xt+ ) ). 3. Meta Algorithm All of our algorithms in the coming sections are based on the Meta algorithm, given in Algorithm, which has the parameter R t for t [T ]. For different types of problems, we will have different choices of R t, which will be specified later in the respective sections. Here we allow R t to depend on t, although we do not need this freedom for linear functions and general convex functions; we only need this for strictly convex functions. Note that we define x t+ using R t instead of R t+ for some technical reason which will be discussed soon and will become clear in the proof Lemma 6. Our Meta algorithm is related to the mirror descent algorithm, as it can be shown to have the following equivalent form, which will be proved in Appendix B.. Lemma 4 Suppose y t+ and ŷ t+ satisfy the conditions R t (y t+ ) = R t (x t ) l t and R t+ (ŷ t+ ) = R t+ (x t+ ) l t, respectively, for a strictly convex R t. Then the update in Step (c) of the Meta algorithm is identical to x t+ = Π X,Rt (y t+ ) = arg min x X B R t (x, yt+ ), ˆx t+ = Π X,Rt+ (ŷ t+ ) = arg minˆx X B R t+ (ˆx, ŷt+ ). Note that a typical mirror descent algorithm plays in round t a point roughly corresponding to our x t, while we move one step further along the direction of l t and play ˆx t = arg minˆx X ( lt, ˆx + B R t (ˆx, xt ) ) instead. The intuition behind our algorithm is the following. It can be shown that if one could play x t+ = arg min x X ( lt, x + B R t (x, xt ) ) in round t, then a small regret could be achieved, but in reality one does not have f t available to compute x t+ before round t. Nevertheless, if the loss vectors have a small deviation, l t is likely to be close to l t, and so is ˆx t to x t+, which is made possible by defining x t+ and ˆx t both using R t. Based on this idea, we let our algorithm play ˆx t in round t. Now let us see how to bound the regret of the algorithm. Consider any π X taken by the offline algorithm. Then for a β-convex function f t, we know from the definition that f t (ˆx t ) f t (π) l t, ˆx t π β ˆx t π h t, where h t = l t l t, (4) while for a linear or a general convex f t, the above still holds with β = 0. Thus, the key is to bound l t, ˆx t π, which is given by the following lemma. We give the proof in Appendix B.. 6

7 Online Optimization with Gradual Variations Lemma 5 Let S t = l t l t, ˆx t x t+, At = B R t (π, xt ) B R t (π, xt+ ) and B t = B R t (xt+, ˆx t ) + B R t (ˆxt, x t ). Then l t, ˆx t π S t + A t B t. The following lemma provides an upper bound for S t. Lemma 6 Suppose is a norm, with dual norm, such that x x B R t (x, x ) for any x, x X. Then, S t = l t l t, ˆx t x t+ lt l t. Proof By a generalized Cauchy-Schwartz inequality, S t = l t l t, ˆx t x t+ lt l t ˆx t x t+. Then we need the following, which will be proved in Appendix B.3. Proposition 7 ˆx t x t+ R t (ŷ t ) R t (y t+ ). From this proposition, we have ˆx t x t+ ( R t (x t ) l t ) ( Rt (x t ) l t ) = l t l t. (5) This is why we define x t+ and y t+ using R t instead of R t+. Finally, by combining these bounds together, we have the lemma. Taking the sum over t of the bounds in Lemma 5 and Lemma 6, we obtain a general regret bound for the Meta algorithm. In the following sections, we will make different choices of R t and the norms for different types of loss functions, and we will derive the corresponding regret bounds. 4. Linear Loss Functions In this section, we consider the case that each loss function f t is linear, which can be seen as an N-dimensional vector in R N with f t (x) = f t, x and f t (x) = f t. We measure the deviation of the loss functions by their L p -deviation, defined in (), which becomes T f t f t p for linear functions. To bound the regret suffered in each round, we can use the bound in (4) with β = 0 and we drop the term B t from the bound in Lemma 5. By summing the bound over t, we have f t (ˆx t ) f t (π) S t + A t, (6) where S t = f t f t, ˆx t x t+ and A t = B R t (π, xt ) B R t (π, xt+ ). In the following two subsections, we will consider the online linear optimization problem and the prediction with expert advice problem, respectively, in which we will have different choices of R t and use different measures of deviation. 7

8 Chiang Yang Lee Mahdavi Lu Jin Zhu 4.. Online Linear Optimization Problem In this subsection, we consider the online linear optimization problem, and we consider loss functions with L -deviation D. To instantiate the Meta algorithm for such loss functions, we choose ˆ R t (x) = η x, for every t [T ], where η is the learning rate to be determined later; in fact, it can also be adjusted in the algorithm using the standard doubling trick by keeping track of the deviation accumulated so far. It is easy to show that with this choice of R t, ˆ R t (x) = x η, BR t (x, y) = η x y, and Π X,R t (y) = arg min x X x y. Then, according to Lemma 4, the update in Step (c) of Meta algorithm becomes: ˆ x t+ = arg min x X x y t+, with y t+ = x t ηf t, ˆx t+ = arg minˆx X ˆx ŷ t+, with ŷ t+ = x t+ ηf t. The regret achieved by our algorithm is guaranteed by the following. Theorem 8 When the L -deviation of the loss functions is D, the regret of our algorithm is at most O( D ). Proof We start by bounding the first sum in (6). Note that we can apply Lemma 6 with the norm = η, since x x = η x x = BR t (x, x ) for any x, x X. As the dual norm is = η, Lemma 6 gives us S t T f t f t η f t f t ηd. Next, note that A t = η π x t η π x t+, so the second sum in (6) is A t = ( ) π x η π x T + η, by telescoping and then using the fact that π x 4 and π x T + by substituting these two bounds into (6), we have 0. Finally, (f t (ˆx t ) f t (π)) ηd + η O ( D ), by choosing η = /D, which proves the theorem. Let us make three remarks about Theorem 8. First, as mentioned in the introduction, one can argue that our result is strictly stronger than that of (Hazan and Kale, 008) as our deviation bound is easier to satisfy. This is because by Proposition, we have 8

9 Online Optimization with Gradual Variations f t f t (f t µ + µ f t ) and thus D 4V + O(), while, for example, with N =, f t = 0 for t T/ and f t = for T/ < t T, we have D O() and V Ω(T ). Next, we claim that the regret achieved by our algorithm is optimal. This is because a matching lower bound can be shown by simply setting the loss functions of all but the first r = D rounds to be the all-0 function, and then applying the known Ω( r) regret lower bound on the first r rounds. Finally, our algorithm can be seen as a modification of the gradient descent (GD) algorithm of (Zinkevich, 003), which plays x t, instead of our ˆx t, in round t. Then one may wonder if GD already performs as well as our algorithm does. The following lemma, to be proved in Appendix C., provides a negative answer, which means that our modification is in fact necessary. Lemma 9 The regret of the GD algorithm is at least Ω(min{D, T }). 4.. Prediction with Expert Advice In this subsection, we consider the prediction with expert advice problem. Now, the feasible set X is the set of probability distributions over N actions, which can also be represented as N-dimensional vectors. Although this problem can be seen as a special case of that in Subsection 4. and Theorem 8 there also applies here, we would like to obtain a stronger result. More precisely, now we consider L -deviation instead of L -deviation, and we assume that the loss functions have L -deviation D. Note that with D D N, Theorem 8 only gives a regret bound of O( D N). To obtain a smaller regret, we instantiate the Meta algorithm with ˆ R t (x) = N η i= x(i) (ln x(i) ), for every t [T ], where η is the learning rate to be determined later and recall that x(i) denotes the i th component of the vector x. It is easy to show that with this choice, ˆ R t (x) = η (ln x(),..., ln x(n)), B R t (x, y) = η RE (xy), and Π X,R t (y) = y/z with the normalization factor Z = N j= y(j). Then, according to Lemma 4, the update in Step (c) of the Meta algorithm becomes: ˆ x t+ (i) = x t (i)e ηf t(i) /Z t+, for each i [N], with Z t+ = N j= x t(j)e ηf t(j), ˆx t+ (i) = x t+ (i)e ηft(i) /Ẑt+, for each i [N], with Ẑt+ = N j= x t+(j)e ηft(j). Note that our algorithm can be seen as a modification of the multiplicative updates algorithm (Littlestone and Warmuth, 994; Freund and Schapire, 997) which plays x t, instead of our ˆx t, in round t. The regret achieved by our algorithm is guaranteed by the following, which we will prove in Appendix C.. Theorem 0 When the L -deviation of the loss functions is D, the regret of our algorithm is at most O( D ln N). We remark that the regret achieved by our algorithm is also optimal. This is because a matching lower bound can be shown by simply setting the loss functions of all but the first r = D rounds to be the all-0 function, and then applying the known Ω( r ln N) regret lower bound on the first r rounds. 9

10 Chiang Yang Lee Mahdavi Lu Jin Zhu 5. General Convex Loss Functions In this section, we consider general convex loss functions. We measure the deviation of loss functions by their L -deviation defined in (), which is T max x X f t (x) f t (x). Our algorithm for such loss functions is the same algorithm for linear functions. To bound its regret, now we need the help of the term B t in Lemma 5, and we have (f t (ˆx t ) f t (π)) S t + A t B t. (7) From the proof of Theorem 8, we know that T A t η and T S t T η lt l t which, unlike in Theorem 8, can not be immediately bounded by L -deviation. This is because lt l t = f t(ˆx t ) f t (ˆx t ), where the two gradients are taken at different points. To handle this issue, we further assume that each gradient f t satisfies the following λ-smoothness condition: f t (x) f t (y) λ x y, for any x, y X. (8) We emphasize that our assumption about the smoothness of loss functions is necessary to achieve the desired variation bound. To see this, consider the special case of f (x) = = f T (x) = f(x). If the variation bound O( D ) holds for any sequence of convex functions, then for the special case where all loss functions are identical, we will have f(ˆx t ) min π X f(π) + O(), implying that (/T ) T ˆx t approaches the optimal solution at the rate of O(/T ). This contradicts the lower complexity bound (i.e. Ω(/ T )) for any first order optimization method (Nesterov, 004, Theorem 3..) and therefore smoothness assumption is necessary to extend our results to general convex loss functions. Our main result of this section is the following theorem which establishes the variation bound for general smooth convex loss functions applying Meta algorithm. Theorem When the loss functions have L -deviation D and the gradient of each loss function is λ-smooth, with λ / 8D, the regret of our algorithm is at most O( D ). The proof of Theorem immediately results from the following two lemmas. First, we need the following to bound T S t in terms of D. Lemma T Proof l t l t D + λ T ˆx t ˆx t. lt l t = f t(ˆx t ) f t (ˆx t ), which by Proposition is at most f t (ˆx t ) f t (ˆx t ) + f t (ˆx t ) f t (ˆx t ), where the second term above is at most λ ˆx t ˆx t by the λ-smoothness condition. By summing the bound over t, we have the lemma. To eliminate the undesirable term λ T ˆx t ˆx t from the sum T B t, which has the following bound. 0 in the lemma, we use the help

11 Online Optimization with Gradual Variations Lemma 3 T B t 4η T ˆx t ˆx t O(). Proof Recall that B t = η x t+ ˆx t + η ˆx t x t, so we can write T B t as T + x t ˆx t η + η t= ˆx t x t η 4η t= ( ) x t ˆx t + ˆx t x t ˆx t ˆx t, by Proposition, with H being the identity matrix so that x H = x. Then the lemma follows as ˆx ˆx O(). According to the bounds obtained so far, the regret of our algorithm is at most ηd +ηλ T ˆx t ˆx t 4η when λ / 8η and η = / D. 6. Strictly Convex Loss Functions t= ˆx t ˆx t +O()+ ( η O ηd + ) ( ) O D, η In this section, we consider convex functions which are strictly convex. suppose for some β > 0, each loss function is β-convex, so that More precisely, f t (ˆx t ) f t (π) l t, ˆx t π β π ˆx t h t, where h t = l t l t. (9) Again, we measure the deviation of loss functions by their L -deviation, defined in (). To instantiate the Meta algorithm for such loss functions, we choose ˆ R t (x) = x H t, with H t = I + βγ I + β t τ= l τ l τ, where I is the N N identity matrix, and γ is an upper bound of l t, for every t, so that γ I l t l t. It is easy to show that with this choice, ˆ R t (x) = H t x, B R t (x, y) = x y H t, and Π X,Rt (y) = arg min x X x y H t. Then, according to Lemma 4, the update in Step (c) of the Meta algorithm becomes: ˆ x t+ = arg min x X x y t+ H t, with y t+ = x t H t l t, ˆx t+ = arg minˆx X ˆx ŷ t+ H t+, with ŷ t+ = x t+ H t+ l t. We remark that our algorithm is related to the online Newton step algorithm in (Hazan et al., 007), except that our matrix H t is slightly different from theirs and we play ˆx t in round t while they play a point roughly corresponding to our x t. It is easy to verify that the update of our algorithm can be computed at the end of round t, because we have l,..., l t available to compute H t and H t+.

12 Chiang Yang Lee Mahdavi Lu Jin Zhu To bound the regret of our algorithm, note that by substituting the bound of Lemma 5 into (9) and then taking the sum over t, we obtain (f t (ˆx t ) f t (π)) S t + A t B t C t, (0) with S t = l t l t, ˆx t x t+, At = π x t H t π x t+ H t, B t = x t+ ˆx t H t + ˆx t x t H t, and C t = β π ˆx t h t. Then our key lemma is the following, which will be proved in Appendix D.. Lemma 4 Suppose the loss functions are β-convex for some β > 0. Then ( ) S t + A t C t O( + βγ ) + 8N β ln + β l 4 t l t. Note that the lemma does not use the nonnegative sum T B t but it already provides a regret bound matching that in (Hazan et al., 007). To bound T lt l t in terms of L -deviation, we again assume that each gradient f t satisfies the λ-smoothness condition defined in (8), and we will also use the help from the sum T B t. To get a cleaner regret bound, let us assume without loss of generality that λ and β, because otherwise we can set λ = and β = and the inequalities in (8) and (9) still hold. Our main result of this section is the following, which we will prove in Appendix D.3. Theorem 5 Suppose the loss functions are β-convex and their L -deviation is D, with β and D. Furthermore, suppose the gradient of each loss function is λ-smooth, with λ, and has L -norm at most γ. Then the regret of our algorithm is at most O(βγ + (N/β) ln(λnd )), which becomes O((N/β) ln D ) for a large enough D. An immediate application of Theorem 5 is to the portfolio management problem considered in (Hazan and Kale, 009). In the problem, the feasible set X is the N-dimensional probability simplex and each loss function has the form f t (x) = ln v t, x, with v t [δ, ] N for some δ (0, ). A natural measure of deviation, extending that of (Hazan and Kale, 009), for such loss functions is D = T v t v t. By applying Theorem 5 to this problem, we have the following, which will be proved in Appendix D.4. Corollary 6 For the portfolio management problem described above, there is an online algorithm which achieves a regret of O((N/δ ) ln((n/δ)d)). Acknowledgments The authors T. Yang, M. Mahdavi, and R. Jin acknowledge support from the National Science Foundation (IIS ) and Office of Navy Research (for ONR award N and N ). The authors C.-K. Chiang, C.-J. Lee, and C.-J. Lu acknowledge support from Academia Sinica and National Science Council (NSC 00-- E MY3) of Taiwan.

13 Online Optimization with Gradual Variations References Jacob Abernethy, Peter L. Bartlett, Alexander Rakhlin, and Ambuj Tewari. Optimal strategies and minimax lower bounds for online convex games. In COLT, pages 45 44, 008. Amir Beck and Marc Teboulle. Mirror descent and nonlinear projected subgradient methods for convex optimization. Operations Research Letters, 3(3):67 75, 003. Stephen Boyd and Lieven Vandenberghe. Convex Optimization. Cambridge University Press, New York, NY, USA, 004. ISBN Nicolò Cesa-Bianchi and Gábor Lugosi. Prediction, Learning, and Games. Cambridge Univerity Press, New York, 006. Thomas Cover. Universal portfolios. Mathematical Finance, : 9, 99. Yoav Freund and Robert E. Schapire. A decision theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55():9 39, 997. Elad Hazan and Satyen Kale. Extracting certainty from uncertainty: regret bounded by variation in costs. In COLT, pages 57 68, 008. Elad Hazan and Satyen Kale. On stochastic and worst-case models for investing. In NIPS, pages , 009. Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Journal of Computer and System Sciences, 69(-3):69 9, 007. Nick Littlestone and Manfred K. Warmuth. The weighted majority algorithm. Information and Computation, 08(): 6, 994. A. Nemirovski and D. Yudin. Problem Complexity and Method Efficiency in Optimization. Nauka Publishers, Moscow, 978. Yurii Nesterov. Introductory Lectures on Convex Optimization: A Basic Course (Applied Optimization). Springer Netherlands, edition, 004. Alexander Rakhlin, Karthik Sridharan, and Ambuj Tewari. Online learning: stochastic, constrained, and smoothed adversaries. In NIPS, pages , 0. Nati Srebro, Karthik Sridharan, and Ambuj Tewari. On the universality of online mirror descent. In NIPS, pages , 0. Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In ICML, pages ,

14 Chiang Yang Lee Mahdavi Lu Jin Zhu Appendix A. Proof of Proposition in Section By definition, y H + z H y + z H = y H + z H y Hz = y z H 0, which implies that y H + z H y + z H. Appendix B. Proofs in Section 3 B.. Proof of Lemma 4 The lemma follows immediately from the following known fact (see e.g. (Beck and Teboulle, 003; Srebro et al., 0)); we give the proof for completeness. Proposition 7 Suppose R is strictly convex and differentiable, and y satisfies the condition R(y) = R(u) l. Then ( arg min l, x + B R (x, u) ) = arg min x X x X BR (x, y). Proof Since R is strictly convex, the minimum on each side is achieved by a unique point. Next, note that B R (x, y) = R(x) R(y) R(y), x y = R(x) R(y), x + c, where c = R(y) + R(y), y does not depend on the variable x. Thus, using the condition that R(y) = R(u) l, we have arg min x X BR (x, y) = arg min (R(x) R(u) l, x ) x X = arg min ( l, x + R(x) R(u), x ). x X On the other hand, B R (x, u) = R(x) R(u) R(u), x u = R(x) R(u), x + c, where c = R(u) + R(u), u does not depend on the variable x. Thus, we have arg min x X ( l, x + B R (x, u) ) = arg min ( l, x + R(x) R(u), x ) = arg min x X x X BR (x, y). B.. Proof of Lemma 5 Let us write l t, ˆx t π = l t, ˆx t x t+ + l t, x t+ π which in turn equals lt l t, ˆx t x t+ + lt, ˆx t x t+ + lt, x t+ π. () To bound the second and third terms in (), we rely on the following. Proposition 8 Suppose l R n, v = arg min x X ( l, x + B R (x, u) ), and w X. Then l, v w B R (w, u) B R (w, v) B R (v, u). 4

15 Online Optimization with Gradual Variations Proof We need the following well-known fact; for a proof, see e.g. pages of (Boyd and Vandenberghe, 004). Fact Let X R n be a convex set and x = arg min z X ϕ(z) for some continuous and differentiable function ϕ : X R. Then for any w X, ϕ(x), w x 0. Let ϕ be the function defined by ϕ(x) = l, x + B R (x, u). Since X is a convex set and v is the minimizer of ϕ(x) over x X, it follows from Fact that ϕ(v), w v 0. Since ϕ(v) = l + R(v) R(u), we have l, v w R(v) R(u), w v. Then, by the definition of Bregman divergence, we obtain B R (w, u) B R (w, v) B R (v, u) = R(u), w v + R(v), w v As a result, we have = R(v) R(u), w v. l, v w R(v) R(u), w v = B R (w, u) B R (w, v) B R (v, u). From Proposition 8 and the definitions of ˆx t and x t+, we have lt, ˆx t x t+ B R t (x t+, x t ) B R t(x t+, ˆx t ) B R t(ˆx t, x t ), and () l t, x t+ π B R t(π, x t ) B R t(π, x t+ ) B R t(x t+, x t ). (3) Combining the bounds in (), (), (3) together, we have the lemma. B.3. Proof of Proposition 7 To simplify the notation, let R = R t, x = ˆx t, x = x t+, y = ŷ t, and y = y t+. Then, from the property of the norm, we know that x x R(x) R(x ) R(x ), x x, and also x x R(x ) R(x) R(x), x x. Adding these two bounds, we obtain x x R(x) R(x ), x x. (4) Next, we show that R(x) R(x ), x x R(y) R(y ), x x. (5) For this, we need Fact in Appendix B.. By letting ϕ(z) = B R (z, y), we have x = arg min z X ϕ(z), ϕ(x) = R(x) R(y), and R(x) R(y), x x 0. 5

16 Chiang Yang Lee Mahdavi Lu Jin Zhu On the other hand, by letting ϕ(z) = B R (z, y ), we have x = arg min z X ϕ(z), ϕ(x ) = R(x ) R(y ), and R(x ) R(y ), x x 0. Combining these two bounds, we have ( R(y) R(y ) ) ( R(x) R(x ) ), x x 0, which implies the inequality in (5). Finally, by combining (4) and (5), we obtain x x R(y) R(y ), x x R(y) R(y ) x x, by a generalized Cauchy-Schwartz inequality. Dividing both sides by x x, we have the proposition. Appendix C. Proofs in Section 4 C.. Proof of Lemma 9 in Section 4 One may wonder if the GD algorithm can also achieve the same regret as our algorithm s by choosing the learning rate η properly. We show that no matter what the learning rate η the GD algorithm chooses, there exists a sequence of loss vectors which can cause a large regret. Let f be any unit vector passing through x. Let s = /η, so that if we use f t = f for every t s, each such y t+ = x tηf still remains in X and thus x t+ = y t+. Next, we analyze the regret by considering the following three cases depending on the range of s. First, when s T, we choose f t = f for t from to s/ and f t = 0 for the remaining t. Clearly, the best strategy of the offline algorithm is to play π = f. On the other hand, since the learning rate η is too small, the strategy x t played by GD, for t s/, is far away from π, so that f t, x t π tη /. Therefore, the regret is at least s/ (/) = Ω( T ). Second, when 0 < s < T, the learning rate is high enough so that GD may overreact to each loss vector, and we make it pay by flipping the direction of loss vectors frequently. More precisely, we use the vector f for the first s rounds so that x t+ = x tηf for any t s, but just as x s+ moves far enough in the direction of f, we make it pay by switching the loss vector to f, which we continue to use for s rounds. Note that x s++r = x s+ r but f s++r = f s+ r for any r s, so s f t, x t x = f s+, x s+ x Ω(). As x s+ returns back to x, we can see the first s rounds as a period, which only contributes f = 4 to the deviation. Then we repeat the period for τ times, where τ = D /4 if there are enough rounds, with T/(s) D /4, to use up the deviation D, and τ = T/(s) otherwise. For any remaining round t, we simply choose f t = 0. As a result, the total regret is at least Ω() τ = Ω(min{D /4, T/(s)}) = Ω(min{D, T }). Finally, when s = 0, the learning rate is so high that we can easily make GD pay by flipping the direction of the loss vector in each round. More precisely, by starting with f = f, we can have x on the boundary of X, which means that if we then alternate between f and f, the strategies GD plays will alternate between x 3 and x which have a constant distance from each other. Then following the analysis in the second case, one can show that the total regret is at least Ω(min{D, T }). 6

17 Online Optimization with Gradual Variations C.. Proof of Theorem 0 We start by bounding the first sum in (6). Note that we can apply Lemma 6 with the norm = η, since for any x, x X, x x = η x x η RE (xx ) = B R t (x, x ), by Pinsker s inequality. As the dual norm is = η, Lemma 6 gives us S t T f t f t η f t f t ηd. Next, note that A t = η RE (πx t) η RE (πx t+), so the second sum in (6) is A t = η (RE (πx ) RE (πx T + )) ln N, η by telescoping and then using the fact that RE (πx ) ln N and RE (πx T + ) 0. Finally, by substituting these two bounds into (6), we have (f t (ˆx t ) f t (π)) ηd + ( ) η ln N O D ln N, by choosing η = (ln N)/D, which proves the theorem. Appendix D. Proofs in Section 6 D.. Proof of Lemma 4 We start by bounding the first sum in (0). Note that we can apply Lemma 6 with the norm = Ht, since x x = x x H = B R t (x, x ) for any x, x X. As the t dual norm is = H, Lemma 6 gives us t S t l t l t T l t l t Ht. Next, we bound the second sum T A t in (0), which can be written as π x H π x T + H T + + (π x t+ Ht+ π x t+ Ht ). Since π x H = O ( + βγ ), π x T + H T + 0, and H t+ H t = βh t, we have A t O ( + βγ ) + β π x t+ h t. 7

18 Chiang Yang Lee Mahdavi Lu Jin Zhu Note that unlike in the case of linear functions, here the sum does not telescope and hence we do not have a small bound for it. The last sum T C t in (0) now comes to help. Recall that C t = β π ˆx t h t, so by Proposition, β π x t+ h t C t β π ˆx t h t + β ˆx t x t+ h t C t = β ˆx t x t+ h t, which, by the fact that H t βγ I βh t and the bound in (5), is at most ˆx t x t+ H t lt l t Ht. Combining the bounds derived so far, we obtain S t + A t C t O( + βγ ) + lt l t Ht. (6) Finally, to complete our proof of Lemma 4, we rely on the following, which provides a bound for the last term in (6) and will be proved in Appendix D.. Lemma 9 T lt l t Ht 4N β ( ln + β T ) 4 lt l t. D.. Proof of Lemma 9 We need the following lemma from (Hazan et al., 007): Lemma 0 Let u t R N, for t [T ], be a sequence of vectors. Define V t = I + t τ= u τ u τ. Then, ( ) u t Vt u t N ln + u t To prove our Lemma 9, first note that for any t [T ], t H t = I + βγ I + β l τ l τ I + β since γ I l t l t τ= Thus, with K t = I + β t 4 τ= This implies that t l τ l τ I + β τ= t τ= and l 0 is the the all-0 vector. Next, we claim that. ( ) l τ l τ + l τ l τ, l τ l τ + l τ l τ ( ) ( ) lτ l τ lτ l τ. This is because by subtracting the right-hand side from the left-hand side, we have l τ l τ + l τ l τ + l τ l τ + l τ l τ = ( ) ( ) lτ + l τ lτ + l τ 0. ( ) ( ), lτ l τ lτ l τ we have Ht K t and Kt lt l t Ht lt l t Kt = 4 β which by Lemma 0 is at most 4N β ln ( + β 4 T 8 β ( ) lt l 4 t ) lt l t. K t, H t.

19 Online Optimization with Gradual Variations D.3. Proof of Theorem 5 To bound the last term in the bound of Lemma 4 in terms of our deviation bound D, we use Lemma in Section 5. Combining this with (0), we can upper-bound T (f t(ˆx t ) f t (π)) by O( + βγ ) + 8N β ln ( + β D + βλ ) ˆx t ˆx t B t. (7) To eliminate the undesirable last term inside the parenthesis above, we need the help from the sum T B t, which has the following bound. Lemma T B t 4 T ˆx t ˆx t O(). Proof Recall that B t = x t+ ˆx t H t + ˆx t x t H t, so we can write T B t as T + x t ˆx t H + t t= ˆx t x t H t x t ˆx t H + t t= ˆx t x t H t since H t H t and x T + ˆx T H T, ˆx x H 0. By Proposition, this is at least 4 ˆx t ˆx t H t 4 t= ˆx t ˆx t I = 4 t= t= ˆx t ˆx t as H t I and I is the identity matrix. Then the lemma follows as ˆx ˆx O(). t= Applying this lemma to (7), we obtain a regret bound of the form O( + βγ ) + 8N ( β ln + β ) D + βλ W 4 W where W = T ˆx t ˆx t. Observe that the combination of the last two terms above become negative when W (λnd ) c /β for some large enough constant c, as we assume β and λ, D. Thus, the regret bound is at most O(βγ + (N/β) ln(λnd )), which completes the proof of Theorem 5. D.4. Proof of Corollary 6 Recall that each loss function has the form f t (x) = ln v t, x for some v t [δ, ] N with δ (0, ), and note that f t (x) = v t / v t, x. To apply Theorem 5, we need to determine the parameters β, γ, λ, D. First, by a Taylor expansion, we know that for any x, y X, there is some ξ t on the line between x and y such that f t (x) = f t (y) + f t (y), x y + 9 v t, ξ t (x y) v t v t (x y),

20 Chiang Yang Lee Mahdavi Lu Jin Zhu where the last term above equals v t, ξ t v t, x y = v t, y v t, ξ t f t(y), x y δ f t(y), x y. Thus, we can choose β = δ /. Next, since f t (x) = v t / v t, x N/δ, we can choose γ = N/δ. Third, note that f t (x) f t (y) = v t v t, x v t v t, y = v t v t, x y, v t, x v t, y which by a Cauchy-Schwarz inequality is at most v t v t, x v t, y x y N δ x y. Thus, we can choose λ = N/δ. Finally, note that for any x X, f t (x) f t (x) = v t v t, x v t v t, x = t v t v + v t ( v t, x v t, x ) v t, x v t, x v t, x, which by a triangle inequality and then a Cauchy-Schwarz inequality is at most v t v t v t, x which in turn is at most + v t v t v t, x v t, x v t, x v t v t v t, x + v t x v t v t, v t, x v t, x ( ) δ + N v δ t v t N v δ t v t. This implies T ( max f t(x) f t (x) ) ( ) N x X δ v t v t 4N δ 4 D. Thus, we can choose D = (4N/δ 4 )D. Using these bounds in Theorem 5, we have the corollary. 0

Adaptive Online Gradient Descent

University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow