Regret bounded by gradual variation for online convex optimization

Size: px

Start display at page:

Download "Regret bounded by gradual variation for online convex optimization"

Willis Stevenson
6 years ago
Views:

1 Noname manuscript No. will be inserted by the editor Regret bounded by gradual variation for online convex optimization Tianbao Yang Mehrdad Mahdavi Rong Jin Shenghuo Zhu Received: date / Accepted: date Abstract Recently, it has been shown that the regret of the Follow the Regularized Leader FTRL algorithm for online linear optimization can be bounded by the total variation of the cost vectors rather than the number of rounds. In this paper, we extend this result to general online convex optimization. In particular, this resolves an open problem that has been posed in a number of recent papers. We first analyze the limitations of the FTRL algorithm in [16] when applied to online convex optimization, and extend the definition of variation to a gradual variation which is shown to be a lower bound of the total variation. We then present two novel algorithms that bound the regret by the gradual variation of cost functions. Unlike previous approaches that maintain a single sequence of solutions, the proposed algorithms maintain two sequences of solutions that make it possible to achieve a variation-based regret bound for online convex optimization. To establish the main results, we discuss a lower bound for FTRL that maintains only one sequence of solutions, and a necessary condition on smoothness of the cost functions for obtaining a gradual variation bound. We extend the main results three-fold: i we present a general method to obtain a gradual variation bound measured by general norm; ii we extend algorithms to a class of online non-smooth optimization with gradual variation bound; and Tianbao Yang NEC Laboratories America tyang@nec-labs.com The most part of the work was done when the author was at Michigan State University Mehrdad Mahdavi and Rong Jin Department of Computer Science and Engineering Michigan State University mahdavim@cse.msu.edu, rongjin@cse.msu.edu Shenghuo Zhu NEC Laboratories America zsh@nec-labs.com

2 Tianbao Yang et al. iii we develop a deterministic algorithm for online bandit optimization in multipoint bandit setting. Keywords online convex optimization regret bound gradual variation bandit 1 Introduction We consider the general online convex optimization problem [6] which proceeds in trials. At each trial, the learner is asked to predict the decision vector x t that belongs to a bounded closed convex set P R d ; it then receives a cost function c t : P R and incurs a cost of c t x t for the submitted solution. The goal of online convex optimization is to come up with a sequence of solutions x 1,..., x T that minimizes the regret, which is defined as the difference in the cost of the sequence of decisions accumulated up to the trial T made by the learner and the cost of the best fixed decision in hindsight, i.e. Regret T = c t x t min c t x. In the special case of linear cost functions, i.e. c t x = f t x, the problem becomes online linear optimization. The goal of online convex optimization is to design algorithms that predict, with a small regret, the solution x t at the tth trial given the partial knowledge about the past cost functions c τ, τ = 1,, t 1. Over the past decade, many algorithms have been proposed for online convex optimization, especially for online linear optimization. As the first seminal paper in online convex optimization, Zinkevich [6] proposed a gradient descent algorithm with a regret bound of O T. When cost functions are strongly convex, the regret bound of the online gradient descent algorithm is reduced to Olog T with appropriately chosen step size [13]. Another common methodology for online convex optimization, especially for online linear optimization, is based on the framework of Follow the Leader FTL [17]. FTL chooses x t by minimizing the cost incurred by x t in all previous trials. Since the naive FTL algorithm fails to achieve sublinear regret in the worst case, many variants have been developed to fix the problem, including Follow the Perturbed Leader FTPL [17], Follow the Regularized Leader FTRL [1], and Follow the Approximate Leader FTAL [13]. Other methodologies for online convex optimization introduce a potential function or link function to map solutions between the space of primal variables and the space of dual variables, and carry out primal-dual update based on the potential function. The well-known Exponentiated Gradient EG algorithm [0] or multiplicative weights algorithm [1] belong to this category. We note that these different algorithms are closely related. For example, in online linear optimization, the potential-based primal-dual algorithm is equivalent to FTRL algorithm [16].

3 Regret Bounded by Gradual Variation 3 Generally, most previous studies of online convex optimization bound the regret in terms of the number of trials T. However, it is expected that the regret should be low in an unchanging environment or when the cost functions are somehow correlated. Specifically, the tightest rate for the regret should depend on the variance of the sequence of cost functions rather than the number of rounds T. An algorithm with regret in terms of variation of the cost functions is expected to perform better when the cost sequence has low variation. Therefore, it is of great interest to derive a variation-based regret bound for online convex optimization in an adversarial setting. As it has been established as a fact that the regret of a natural algorithm in a stochastic setting, i.e. the cost functions are generated by a stationary stochastic process, can be bounded by the total variation in the cost functions [16], devising algorithm in fully adversarial setting, i.e. the cost functions are chosen completely adversarially, is posed as an open problem in [5]. The concern is whether or not it is possible to derive a regret bound for an online optimization algorithm by the variation of the observed cost functions. Recently [16] made a substantial progress in this route and proved a variationbased regret bound for online linear optimization by tight analysis of FTRL algorithm with an appropriately chosen step size. A similar regret bound is shown in the same paper for prediction from expert advice by slightly modifying the multiplicative weights algorithm. In this work, we aim to take one step further and contribute to this research direction by developing algorithms for general framework of online convex optimization with variation-based regret bounds. A preliminary version of this paper appeared at the Conference on Learning Theory COLT [8] as the result of a merge between two papers as discussed in a commentary written by Satyen Kale [18]. We highlight the differences between this paper and [8]. We first motivate the definition of a gradual variation by showing that the total variation bound in [16] is not necessary small when the cost functions change slowly, and the gradual variation is smaller than the total variation. We then extend FTRL to achieve a regret bounded by the gradual variation in Subsection.1. The second algorithm in Subsection. is similar to the one that appeared in [8]. Sections 3 and 4 contain generalizations of the second algorithm to non-smooth functions and a bandit setting, respectively, which appear in this paper for the first time. In the remaining of this section, we first present the results from [16] for online linear optimization and discuss their potential limitations when applied to online convex optimization which motivates our work. 1.1 Online Linear Optimization Many decision problems can be cast into online linear optimization problems, such as prediction from expert advice [6], online shortest path problem [5]. [16] proved the first variation-based regret bound for online linear optimization problems in an adversarial setting. Hazan and Kale s algorithm for online

4 4 Tianbao Yang et al. Algorithm 1 Follow The Regularized Leader for Online Linear Optimization 1: Input: η > 0 : for t = 1,..., T do 3: If t = 1, predict x t = 0 4: If t > 1, predict x t by t x t = arg min fτ x + 1 η x τ=1 5: Receive cost vector f t and incur loss f t xt 6: end for linear optimization is based on the framework of FTRL. For completeness, the algorithm is shown in Algorithm 1. At each trial, the decision vector x t is given by solving the following optimization problem: t x t = arg min τ=1 f τ x + 1 η x, where f t is the cost vector received at trial t after predicting the decision x t, and η is a step size. They bound the regret by the variation of cost vectors defined as VAR T = f t µ, 1 where µ = 1/T T f t. By assuming f t 1, t and setting the value of η to η = min/ VAR T, 1/6, they showed that the regret of Algorithm 1 can be bounded by f t x t min ft x { 15 VART if VAR T if VAR T 1. From, we can see that when the variation of the cost vectors is small less than 1, the regret is a constant, otherwise it is bounded by the variation O VAR T. This result indicates that online linear optimization in the adversarial setting is as efficient as in the stationary stochastic setting. 1. Online Convex Optimization Online convex optimization generalizes online linear optimization by replacing linear cost functions with non-linear convex cost functions. It has found applications in several domains, including portfolio management [3] and online classification [19]. For example, in online portfolio management problem, an investor wants to distribute his wealth over a set of stocks without knowing the market output in advance. If we let x t denote the distribution on the

5 Regret Bounded by Gradual Variation 5 Algorithm Follow The Regularized Leader FTRL for Online Convex Optimization 1: Input: η > 0 : for t = 1,..., T do 3: If t = 1, Predict x t = 0 4: If t > 1, Predict x t by t x t = arg min fτ x + 1 η x τ=1 5: Receive cost function c t and incur loss c tx t 6: Compute f t = c tx t 7: end for stocks and r t denote the price relative vector, i.e. r t [i] denote the the ratio of the closing price of stock i on day t to the closing price on day t 1, then an interesting function is the logarithmic growth ratio, i.e. T logx t r t, which is a concave function to be maximized. Similar to [16], we aim to develop algorithms for online convex optimization with regret bounded by the variation in the cost functions. Before presenting our algorithms, we first show that directly applying the FTRL algorithm to general online convex optimization may not be able to achieve the desired result. To extend FTRL for online convex optimization, a straightforward approach is to use the first order approximation for convex cost function, i.e., c t x c t x t + c t x t x x t, and replace the cost vector f t in Algorithm 1 with the gradient of the cost function c t at x t, i.e., f t = c t x t. The resulting algorithm is shown in Algorithm. Using the convexity of c t, we have c t x t min c t x f t x t min ft x. 3 If we assume c t x 1, t, x P, we can apply Hazan and Kale s variation-based bound in to bound the regret in 3 by the variation VAR T = f t µ = c tx t 1 c τ x τ. 4 T To better understand VAR T in 4, we rewrite VAR T as VAR T = c tx t 1 c τ x τ = 1 c T t x t c τ x τ T 1 T τ=1 τ=1 c t x t c t x τ + 1 T t,τ=1 τ=1 τ=1 c t x τ c τ x τ = VAR 1 T + VAR T. 5

6 6 Tianbao Yang et al. We see that the variation VAR T is bounded by two parts: VAR 1 T essentially measures the smoothness of individual cost functions, while VAR T measures the variation in the gradients of cost functions. Let us consider an easy setting when all cost functions are identical. In this case, VAR T vanishes, and VAR T is equal to VAR 1 T /, i.e., VAR T = 1 T t,τ=1 c t x t c τ x τ = 1 T c t x t c t x τ t,τ=1 = VAR1 T. As a result, the regret of the FTRL algorithm for online convex optimization may still be bounded by O T regardless of the smoothness of the cost function. To address this challenge, we develop two novel algorithms for online convex optimization that bound the regret by the variation of cost functions. In particular, we would like to bound the regret of online convex optimization by the variation of cost functions defined as follows GV T = max c t+1x c t x. 6 Note that the variation in 6 is defined in terms of gradual difference between individual cost function to its previous one, while the variation in 1 [16] is defined in terms of total difference between individual cost vectors to their mean. Therefore we refer to the variation defined in 6 as gradual variation 1, and to the variation defined in 1 as total variation. It is straightforward to show that when c t x = ft x, the gradual variation GV T defined in 6 is upper bounded by the total variation VAR T defined in 1 with a constant factor: f t+1 f t f t+1 µ + f t µ 4 f t µ. On the other hand, we can not bound the total variation by the gradual variation up to a constant. This is verified by the following example: f 1 = = f T/ = f and f T/+1 = = f T = g f. The total variation in 1 is given by VAR T = f t µ = T f f + g + T g f + g = ΩT, while the gradual variation defined in 6 is a constant given by GV T = f t+1 f t = f g = O1. 1 This is also termed as deviation in [8].

7 Regret Bounded by Gradual Variation 7 Based on the above analysis, we claim that the regret bound by gradual variation is usually tighter than total variation. Before closing this section, we present the following lower bound for FTRL whose proof can be found in [8]. Theorem 1 The regret of FTRL is at least ΩminGV T, T. The theorem motivates us to develop new algorithms for online convex optimization to achieve a gradual variation bound of O GV T. The remainder of the paper is organized as follows. We present in Section the proposed algorithms and the main results. Section 3 generalizes the results to the special class of non-smooth functions that is composed of a smooth and a non-smooth component. Section 4 is devoted to extending the proposed algorithms to the setting where only partial feedback about the cost functions are available, i.e., online bandit convex optimization with a variation-based regret bound. Finally, in Section 5, we conclude this work and discuss a few open problems. Algorithms and Main Results Without loss of generality, we assume that the decision set P is contained in a unit ball B, i.e., P B, and 0 P [16]. We propose two algorithms for online convex optimization. The first algorithm is an improved FTRL and the second one is based on the mirror prox method [1]. One common feature shared by the two algorithms is that both of them maintain two sequences of solutions: decision vectors x 1:T = x 1,, x T and searching vectors z 1:T = z 1,, z T that facilitate the updates of decision vectors. Both algorithms share almost the same regret bound except for a constant factor. To facilitate the discussion, besides the variation of cost functions defined in 6, we define another variation, named extended gradual variation, as follows EGV T, y 1:T = c t+1 y t c t y t c 1 y 0 + GV T, 7 where c 0 x = 0, the sequence y 0,..., y T is either z 0,..., z T as in the improved FTRL or x 0,..., x T as in the prox method and the subscript means the variation is defined with respect to l norm. When all cost functions are identical, GV T becomes zero and the extended variation EGV T, y 1:T is reduced to c 1 y 0, a constant independent from the number of trials. In the sequel, we use the notation EGV T, for simplicity. In this study, we assume smooth cost functions with Lipschtiz continuous gradients, i.e., there exists a constant L > 0 such that c t x c t z L x z, x, z P, t. 8

8 8 Tianbao Yang et al. Algorithm 3 Improved FTRL for Online Convex Optimization 1: Input: η 0, 1] : Initialization: z 0 = 0 and c 0 x = 0 3: for t = 1,..., T do 4: Predict x t by { x t = arg min x c t z t + L } η x z t 5: Receive cost function c t and incur loss c tx t 6: Update z t by { } t z t = arg min x c τ z τ + L η x τ=1 7: end for Our results show that for online convex optimization with L-smooth cost functions, the regrets of the proposed algorithms can be bounded as follows c t x t min EGVT, c t x O + constant. 9 Remark 1 We would like to emphasize that our assumption about the smoothness of cost functions is necessary to achieve the variation-based bound stated in 9. To see this, consider the special case of c 1 x = = c T x = cx. If the bound in 9 holds for any sequence of convex functions, then for the special case where all the cost functions are identical, we have cx t min cx + O1, implying that x T = 1/T T x t approaches the optimal solution at the rate of O1/T. This contradicts the lower complexity bound i.e. O1/ T for any optimization method which only uses first order information about the cost functions [, Theorem 3..1]. This analysis indicates that the smoothness assumption is necessary to attain variation based regret bound for general online convex optimization problem. We would like to emphasize the fact that this contradiction holds when only the gradient information about the cost functions is provided to the learner and the learner may be able to achieve a variation-based bound using second order information about the cost functions, which is not the focus of the present work..1 An Improved FTRL Algorithm for Online Convex Optimization Gradual variation bounds for a special class of non-smooth cost functions that are composed of a smooth component and a non-smooth component are discussed in section 3.

9 Regret Bounded by Gradual Variation 9 The improved FTRL algorithm for online convex optimization is presented in Algorithm 3. Note that in step 6, the searching vectors z t are updated according to the FTRL algorithm after receiving the cost function c t. To understand the updating procedure for the decision vector x t specified in step 4, we rewrite it as { x t = arg min c t z t + x z t c t z t + L } η x z t. 10 Notice that c t x c t z t + x z t c t z t + L x z t c t z t + x z t c t z t + L η x z t, 11 where the first inequality follows the smoothness condition in 8 and the second inequality follows from the fact η 1. The inequality 11 provides an upper bound for c t x and therefore can be used as an approximation of c t x for predicting x t. However, since c t z t is unknown before the prediction, we use c t z t as a surrogate for c t z t, leading to the updating rule in 10. It is this approximation that leads to the variation bound. The following theorem states the regret bound of Algorithm 3. Theorem Let c t, t = 1,..., T be a sequence of convex functions with L- Lipschitz continuous gradients. By setting η = min { 1, L/ EGV T, }, we have the following regret bound for Algorithm 3 c t x t min c t x max L, EGV T,. Remark Comparing with the variation bound in 5 for the FTRL algorithm, the smoothness parameter L plays the same role as VAR 1 T that accounts for the smoothness of cost functions, and term EGV T, plays the same role as VAR T that accounts for the variation in the cost functions. Compared to the FTRL algorithm, the key advantage of the improved FTRL algorithm is that the regret bound is reduced to a constant when the cost functions change only by a constant number of times along the horizon. Of course, the extended variation EGV T, may not be known apriori for setting the optimal η, we can apply the standard doubling trick [6] to obtain a bound that holds uniformly over time and is a factor at most 8 from the bound obtained with the optimal choice of η. The details are provided in Appendix A. To prove Theorem, we first present the following lemma.

10 10 Tianbao Yang et al. Lemma 1 Let c t, t = 1,..., T be a sequence of convex functions with L- Lipschitz continuous gradients. By running Algorithm 3 over T trials, we have [ ] L c t x t min η x + c t z t + x z t c t z t + η c t+1 z t c t z t L. With this lemma, we can easily prove Theorem by exploring the convexity of c t x. Proof of Theorem By using x 1, x P B, and the convexity of c t x, we have { } L min η x + c t z t + x z t c t z t L η + min c t x. Combining the above result with Lemma 1, we have c t x t min c t x L η + η L EGV T,. By choosing η = min1, L/ EGV T,, we have the regret bound claimed in Theorem. The Lemma 1 is proved by induction. The key to the proof is that z t is the optimal solution to the strongly convex minimization problem in Lemma 1, i.e., z t = arg min [ L η x + ] t c τ z τ + x z τ c τ z τ τ=1 Proof of Lemma 1 We prove the inequality by induction. When T = 1, we have x 1 = z 0 = 0 and [ ] L min η x + c 1 z 0 + x z 0 c 1 z 0 + η L c 1z 0 c 1 z 0 + η { } L L c 1z 0 + min x η x + x z 0 c 1 z 0 = c 1 z 0 = c 1 x 1. where the inequality follows that by relaxing the minimization domain x P to the whole space. We assume the inequality holds for t and aim to prove it for t + 1. To this end, we define ψ t x = [ L η x + ] t c τ z τ + x z τ c τ z τ τ=1 + η t c τ+1 z τ c τ z τ L. τ=0

11 Regret Bounded by Gradual Variation 11 According to the updating procedure for z t in step 6, we have z t = arg min ψ t x. Define φ t = ψ t z t = min ψ t x. Since ψ t x is a L/η-strongly convex function, we have ψ t+1 x ψ t+1 z t L η x z t + x z t ψ t+1 z t = L η x z t + x z t ψ t z t + c t+1 z t. Setting x = z t+1 = arg min ψ t+1 x in the above inequality results in ψ t+1 z t+1 ψ t+1 z t = φ t+1 φ t + c t+1 z t + η L c t+1z t c t z t L η z t+1 z t + z t+1 z t ψ t z t + c t+1 z t L η z t+1 z t + z t+1 z t c t+1 z t, where the second inequality follows from the fact z t = arg min ψ t x, and therefore x z t ψ t z t 0, x P. Moving c t+1 z t in the above inequality to the right hand side, we have φ t+1 φ t η L c t+1z t c t z t 1 L η z t+1 z t + z t+1 z t c t+1 z t + c t+1 z t { } L min η x z t + x z t c t+1 z t + c t+1 z t L = min η x z t + x z t c t z t +c t+1 z t + x z t c t+1 z t c t z t. }{{} }{{} rx ρx To bound the right hand side, we note that x t+1 is the minimizer of ρx by step 4 in Algorithm 3, and ρx is a L/η-strongly convex function, so we have ρx ρx t+1 + x x t+1 ρx t+1 + L }{{} η x x t+1 ρx t+1 + L η x x t+1. 0 Then we have ρx + c t+1 z t + rx ρx t+1 + c t+1 z t + L η x x t+1 + rx = L η x t+1 z t + x t+1 z t c t z t +c t+1 z t + L η x x t+1 + rx }{{} ρx t+1

12 1 Tianbao Yang et al. Plugging above inequality into the inequality in 1, we have φ t+1 φ t η L c t+1z t c t z t L η x t+1 z t + x t+1 z t c t z t + c t+1 z t { } L + min η x x t+1 + x z t c t+1 z t c t z t To continue the bounding, we proceed as follows φ t+1 φ t η L c t+1z t c t z t L η x t+1 z t + x t+1 z t c t z t + c t+1 z t { } L + min η x x t+1 + x z t c t+1 z t c t z t = L η x t+1 z t + x t+1 z t c t+1 z t + c t+1 z t { } L + min η x x t+1 + x x t+1 c t+1 z t c t z t L η x t+1 z t + x t+1 z t c t+1 z t + c t+1 z t { } L + min x η x x t+1 + x x t+1 c t+1 z t c t z t = L η x t+1 z t + x t+1 z t c t+1 z t + c t+1 z t η L c t+1z t c t z t c t+1 x t+1 η L c t+1z t c t z t, where the first equality follows by writing x t+1 z t c t z t = x t+1 z t c t+1 z t x t+1 z t c t+1 z t c t z t and combining with x z t c t+1 z t c t z t, and the last inequality follows from the smoothness condition of c t+1 x. Since by induction φ t t τ=1 c τ x τ, we have φ t+1 t+1 τ=1 c τ x τ.. A Prox Method for Online Convex Optimization In this subsection, we present a prox method for online convex optimization that shares the same order of regret bound as the improved FTRL algorithm. It is closely related to the prox method in [1] by maintaining two sets of vectors x 1:T and z 1:T, where x t and z t are computed by gradient mappings

13 Regret Bounded by Gradual Variation 13 Algorithm 4 A Prox Method for Online Convex Optimization 1: Input: η > 0 : Initialization: z 0 = x 0 = 0 and c 0 x = 0 3: for t = 1,..., T do 4: Predict x t by { x t = arg min x c t x t + L } η x z t 5: Receive cost function c t and incur loss c tx t 6: Update z t by { z t = arg min x c tx t + L } η x z t 7: end for using c t x t, and c t x t, respectively, as 1 x t = arg min z t = arg min 1 x x z t η L c tx t z t η L c tx t The detailed steps are shown in Algorithm 4, where we use an equivalent form of updates for x t and z t in order to compare to Algorithm 3. Algorithm 4 differs from Algorithm 3: i in updating the searching points z t, Algorithm 3 updates z t by the FTRL scheme using all the gradients of the cost functions at {z τ } t τ=1, while Algorithm 4 updates z t by a prox method using a single gradient c t x t, and ii in updating the decision vector x t, Algorithm 4 uses the gradient c t x t instead of c t z t. The advantage of Algorithm 4 compared to Algorithm 3 is that it only requires to compute one gradient c t x t for each loss function; in contrast, the improved FTRL algorithm in Algorithm 3 needs to compute the gradients of c t x at two searching points z t and z t. It is these differences mentioned above that make it easier to extend the prox method to a bandit setting, which will be discussed in section 5. The following theorem states the regret bound of the prox method for online convex optimization. Theorem 3 Let c t, t = 1,..., T be a sequence of convex functions with L- Lipschitz continuous gradients. By setting η = 1/ min { 1/, L/ EGV T, }, we have the following regret bound for Algorithm 4 c t x t min c t x max L, EGVT,. We note that compared to Theorem, the regret bound in Theorem 3 is slightly worse by a factor of. To prove Theorem 3, we need the following lemma, which is the Lemma 3.1 in [1] stated in our notations.

14 14 Tianbao Yang et al. Lemma Lemma 3.1 [1] Let ωz be a α-strongly convex function with respect to the norm, whose dual norm is denoted by, and Dx, z = ωx ωz + x z ω z be the Bregman distance induced by function ωx. Let Z be a convex compact set, and U Z be convex and closed. Let z Z, γ > 0, Consider the points, then for any u U, we have x = arg min u U γu ξ + Du, z, 13 z + = arg min u U γu ζ + Du, z, 14 γζ x u Du, z Du, z + + γ α ξ ζ α [ x z + x z + ]. 15 In order not to have readers struggle with complex notations in [1] for the proof of Lemma, we present a detailed proof in Appendix A which is an adaption of the original proof to our notations. Theorem 3 can be proved by using the above Lemma, because the updates of x t, z t can be written equivalently as 13 and 14, respectively. The proof below starts from 15 and bounds the summation of each term over t = 1,..., T, respectively. Proof of Theorem 3 First, we note that the two updates in step 4 and step 6 of Algorithm 4 fit in the Lemma if we let U = Z = P, z = z t, x = x t, z + = z t, and ωx = 1 x, which is 1-strongly convex function with respect to. Then Du, z = 1 u z. As a result, the two updates for x t, z t in Algorithm 4 are exactly the updates in 13 and 14 with z = z t, γ = η/l, ξ = c t z t, and ζ = c t x t. Replacing these into 15, we have the following inequality for any u P, η L x t u c t x t 1 u zt u z t + η L c tx t c t x t 1 xt z t + x t z t Then we have η L c tx t c t u η L x t u c t x t 1 u zt u z t + η L c tx t c t x t + η L c tx t c t x t 1 xt z t + x t z t 1 u zt u z t + η + η x t x t 1 L c tx t c t x t xt z t + x t z t,

15 Regret Bounded by Gradual Variation 15 where the first inequality follows the convexity of c t x, and the third inequality follows the smoothness of c t x. By taking the summation over t = 1,, T with z T = arg min u P c tu, and dividing both sides by η/l, we have c t x t min c t x L η + η c t+1 x t c t x t L + η x t x t 1 xt z t + x t z t }{{} B T 16 We can bound B T as follows: B T = x t z t + 1 T +1 x t z t t= xt z t + x t z t t= x t x t = 1 4 t= x t x t 17 where the last equality follows that x 1 into 18, we have = x 0. Plugging the above bound c t x t min c t x L η + η c t+1 x t c t x t L + η 1 x t x t 18 4 We complete the proof by plugging the value of η..3 A General Prox Method and Some Special Cases In this subsection, we first present a general prox method to obtain a variation bound defined in a general norm. Then we discuss four special cases: online linear optimization, prediction with expert advice, and online strictly convex optimization. The omitted proofs in this subsection can be easily duplicated by mimicking the proof of Theorem 3, if necessary with the help of previous analysis as mentioned in the appropriate text.

16 16 Tianbao Yang et al. Algorithm 5 A General Prox Method for Online Convex Optimization 1: Input: η > 0, ωz : Initialization: z 0 = x 0 = min z P ωz and c 0 x = 0 3: for t = 1,..., T do 4: Predict x t by { x t = arg min x c t x t + L } η Dx, z t 5: Receive cost function c t and incur loss c tx t 6: Update z t by { z t = arg min x c tx t + L } η Dx, z t 7: end for.3.1 A General Prox Method The prox method, together with Lemma provides an easy way to generalize the framework based on Euclidean norm to a general norm. To be precise, let denote a general norm, denote its dual norm, ωz be a α-strongly convex function with respect to the norm, and Dx, z = ωx ωz + x z ω z be the Bregman distance induced by function ωx. Let c t, t = 1,, T be L-smooth functions with respect to norm, i.e., c t x c t z L x z. Correspondingly, we define the extended gradual variation based on the general norm as follows: EGV T = c t+1 x t c t x t. 19 Algorithm 5 gives the detailed steps for the general framework. We note that the key differences from Algorithm 4 are: z 0 is set to min z P ωz, and the Euclidean distances in steps 4 and 6 are replaced by Bregman distances, i.e., { x t = arg min x c t x t + L } η Dx, z t, z t = arg min { x c t x t + L } η Dx, z t. The following theorem states the variation-based regret bound for the general norm framework, where R measure the size of P defined as R = max ωx min ωx.

17 Regret Bounded by Gradual Variation 17 Theorem 4 Let c t, t = 1,..., T be a sequence of convex functions whose gradients are L-smooth continuous, ωz be a α-strongly convex function, both with respect to norm, and EGV T be defined in 19. By setting η = 1/ min { α/, LR/ EGV T }, we have the following regret bound c t x t min.3. Online Linear Optimization c t x R max LR/ α, EGVT. Here we consider online linear optimization and present the algorithm and the gradual variation bound for this setting as a special case of proposed algorithm. In particular, we are interested in bounding the regret by the gradual variation T EGV f T, = f t+1 f t, where f t, t = 1,..., T are the linear cost vectors and f 0 = 0. Since linear functions are smooth functions that satisfy the inequality in 8 for any positive L > 0, therefore we can apply Algorithm 4 to online linear optimization with any positive value for L 3. The regret bound of Algorithm 4 for online linear optimization is presented in the following corollary. Corollary 1 Let c t x = ft x, t = 1,..., T be a sequence of linear functions. By setting η = 1/ EGV f T, and L = 1 in Algorithm 4, then we have f t x t min ft x EGV f T,. Remark 3 Note that the regret bound in Corollary 1 is stronger than the regret bound obtained in [16] for online linear optimization due to the fact that the gradual variation is smaller than the total variation..3.3 Prediction with Expert Advice In the problem of prediction with expert advice, the decision vector x is a distribution over m experts, i.e., x P = {x R m + : m i=1 x i = 1}. Let f t R m denote the costs for m experts in trial t. Similar to [16], we would like to bound the regret of prediction from expert advice by the gradual variation defined in infinite norm, i.e., T EGV f T, = f t+1 f t. 3 We simply set L = 1 for online linear optimization and prediction with expert advice.

18 18 Tianbao Yang et al. Since it is a special online linear optimization problem, we can apply Algorithm 4 to obtain a regret bound as in Corollary 1, i.e., f t x t min ft x EGV ft, megv f T,. However, the above regret bound scales badly with the number of experts. We can obtain a better regret bound in O EGV f T, ln m by applying the general prox method in Algorithm 5 with ωx = m i=1 x i ln x i and Dx, z = m i=1 x i lnz i /x i. The two updates in Algorithm 5 become x i t = z i t = zt i exp[η/l]ft i m j=1 zj t exp[η/l]f j i = 1,..., m t, zt i exp[η/l]ft i m j=1 zj t exp[η/l]f j, i = 1,..., m. t The resulting regret bound is formally stated in the following Corollary. Corollary Let c t x = ft x, t = 1,..., T be a sequence of linear functions in prediction with expert advice. By setting η = ln m/egv f T,, L = 1, ωx = m i=1 x i ln x i and Dx, z = m i=1 x i lnx i /z i in Algorithm 5, we have f t x t min ft x EGV f T, ln m. Remark 4 By noting the definition of EGV f T,, the regret bound in Corollary T is O max i ft+1 i f t i ln m, which is similar to the regret bound obtained in [16] for prediction with expert advice. However, the definitions of the variation are not exactly the same. In [16], the authors bound the regret of prediction with expert advice by O ln m max i f t i µ i t + ln m, T where the variation is the maximum total variation over all experts. To compare the two regret bounds, we first consider two extreme cases. When the costs of all experts are the same, then the variation in Corollary is a standard gradual variation, while the variation in [16] is a standard total variation. According to the previous analysis, a gradual variation is smaller than a total variation, therefore the regret bound in Corollary is better than that in [16]. In another extreme case when the costs at all iterations of each expert are the same, both regret bounds are constants. More generally, if we assume the maximum total variation is small say a constant, then T f t+1 i ft i is also a constant for any i [m]. By a trivial analysis T max i ft+1 i ft i T m max i f t+1 i ft i, the regret bound in Corollary might be worse up to a factor m than that in [16].

19 Regret Bounded by Gradual Variation 19 Remark 5 It was shown in [8], both the regret bounds in Corollary 1 and Corollary are optimal because they match the lower bounds for a special sequence of loss functions. In particular, for online linear optimization if all loss functions but the first T k = EGV f T, are all-0 functions, then the known lower bound Ω T k [] matches the upper bound in Corollary 1. Similarly, for prediction from expert advice if all loss functions but the first T k EGV = f T, are all-0 functions, then the known lower bound Ω T k ln m [6] matches the upper bound in Corollary..3.4 Online Strictly Convex Optimization In this subsection, we present an algorithm to achieve a logarithmic variation bound for online strictly convex optimization [14]. In particular, we assume the cost functions c t x are not only smooth but also strictly convex defined formally in the following. Definition 1 For β > 0, a function cx : P R is β-strictly convex if for any x, z P cx cz + cz x z + βx z cz cz x z 0 It is known that such a defined strictly convex function include strongly convex function and exponential concave function as special cases as long as the gradient of the function is bounded. To see this, if cx is a β -strongly convex function with a bounded gradient cx G, then cx cz + cz x z + β x z x z cz + cz x z + β G x z cz cz x z, thus cx is a β /G strictly convex. Similarly if cx is exp-concave, i.e., there exists α > 0 such that hx = exp αcx is concave, then cx is a β = 1/ min1/4gd, α strictly convex c.f. Lemma in [14], where D is defined as the diameter of the domain. Therefore, in addition to smoothness and strict convexity we also assume all the cost functions have bounded gradients, i.e., c t x G. In [8], a logarithmic gradual variation regret bound has been proved. For completeness, we present the algorithm in the framework of general prox method and summarize the results and analysis. To derive a logarithmic gradual variation bound for online strictly convex optimization, we need to change the Euclidean distance function in Algorithm 4 to a generalized Euclidean distance function. Specifically, at trial t, we let H t = I + βg I + β t τ=0 c τ x τ c τ x τ and use the generalized Euclidean distance D t x, z = 1 x z H t = 1 x z H t x z in updating

20 0 Tianbao Yang et al. x t and z t, i.e., x t = arg min z t = arg min {x c t x t + 1 x z t Ht } {x c t x t + 1 x z t Ht }, 1 To prove the regret bound, we can prove a similar inequality as in Lemma by applying ωx = 1/ x H t, which is stated as follows c t x t x t z D t z, z t D t z, z t + c t x t c t x t 1 [ H xt z t H t t + x t z t ] H t. Then by applying inequality in 0 for strictly convex function, we obtain the following c t x t c t z D t x, z t D t x, z t β x t z h t + c t x t c t x t 1 [ H xt z t H t t + x t z t ] H t, where h t = c t x t c t x t. It remains to apply the analysis in [8] to obtain a logarithmic gradual variation bound, which is stated in the following corollary and its proof is deferred to Appendix C. Corollary 3 Let c t x, t = 1,..., T be a sequence of β-strictly convex and L- smooth functions with gradients bounded by G. We assume 8dL 1, otherwise we can set L = 1/8d. An algorithm that adopts the updates in 1 has a regret bound by c t x t min c t x 1 + βg + 8d β ln max16dl, βegv T,, where EGV T, x P. = T c t+1x t c t x t and d is the dimension of 3 Online Non-Smooth Optimization with Gradual Variation Bound All the results we have obtained so far rely on the assumption that the cost functions are smooth. Additionally, at the beginning of Section, we showed that for general non-smooth functions when the only information presented to the learner is the first order information about the cost functions, it is impossible to obtain a regret bound by gradual variation. However, in this section, we show that a gradual variation bound is achievable for a special class of non-smooth functions that is composed of a smooth component and a non-smooth component.

21 Regret Bounded by Gradual Variation 1 We consider two categories for the non-smooth component. In the first category, we assume that the non-smooth component is a fixed function and is relatively easy such that the composite gradient mapping can be solved without too much computational overhead compared to gradient mapping. A common example that falls into this category is to consider a non-smooth regularizer. For example, in addition to the basic domain P, one would enforce the sparsity constraint on the decision vector x, i.e., x 0 k < d, which is important in feature selection. However, the sparsity constraint x 0 k is a non-convex function, and is usually implemented by adding a l 1 regularizer λ x 1 to the objective function, where λ > 0 is a regularization parameter. Therefore, at each iteration the cost function is given by c t x+λ x 1. To prove a regret bound by gradual variation for this type of non-smooth optimization, we first present a simplified version of the general prox method and show that it has the exactly same regret bound as stated in Subsection.3.1, and then extend the algorithm to the non-smooth optimization with a fixed non-smooth component. In the second category, we assume that the non-smooth component can be written as an explicit maximization structure. In general, we consider a timevarying non-smooth component, present a primal-dual prox method, and prove a min-max regret bound by gradual variation. When the non-smooth components are equal across all trials, the usual regret is bounded by the min-max bound plus a variation in the non-smooth component. To see an application of min-max regret bound, we consider the problem of online classification with hinge loss and show that the number of mistakes can be bounded by a variation in sequential examples. Before moving to the detailed analysis, it is worth mentioning that several pieces of works have proposed algorithms for optimizing the two types of non-smooth functions as described above to obtain an optimal convergence rate of O1/T [4,3]. Therefore, the existence of a regret bound by gradual variation for these two types of non-smooth optimization does not violate the contradictory argument in Section. 3.1 Online Non-Smooth Optimization with a Fixed Non-smooth Component A Simplified Version of Algorithm 5 In this subsection, we present a simplified version of Algorithm 5, which is the foundation for us to develop the algorithm for non-smooth optimization. The key trick is to replace the domain constraint x P with a non-smooth function in the objective. Let δ P x denote the indicator function of the domain P, i.e., 0, x P δ P x =, otherwise

22 Tianbao Yang et al. Then the proximal gradient mapping for updating x step 4 in Algorihtm 5 is equivalent to x t = arg min x c t x t + L x η Dx, z t + δ P x. By the first order optimality condition, there exists a sub-gradient v t δ P x t such that c t x t + L η ωx t ωz t + v t = 0. 3 Thus, x t is equal to x t = arg min x x c t x t + v t + L η Dx, z t. 4 Then we can change the update for z t to z t = arg min x x c t x t + v t + L η Dx, z t. 5 The key ingredient of above update compared to step 6 in Algorithm 5 is that we explicitly use the sub-gradient v t that satisfies the optimality condition for x t instead of solving a domain constrained optimization problem. The advantage of updating z t by 5 is that we can easily compute z t by the first order optimality condition, i.e., c t x t + v t + L η ωz t ωz t = 0. 6 Note that equation 3 indicates v t = c t x t ωx t + ωz t. By plugging this into 6, we reach to the following simplified update for z t, ωz t = ωx t + η L c tx t c t x t. The simplified version of Algorithm 5 is presented in Algorithm 6. Remark 6 We make three remarks for Algorithm 6. First, the searching point z t does not necessarily belong to the domain P, which is usually not a problem given that the decision point x t is always in P. Nevertheless, the update can be followed by a projection step z t = min Dx, z t to ensure the searching point also stay in the domain P, where we slightly abuse the notation z t in ωz t = ωx t + η L c tx t c t x t. Second, the update in step 6 can be implemented by [6, Chapter 11]: z t = ω ωx t + η L c tx t c t x t,

23 Regret Bounded by Gradual Variation 3 Algorithm 6 A Simplified General Prox Method for Online Convex Optimization 1: Input: η > 0, ωz : Initialization: z 0 = x 0 = min z P ωz and c 0 x = 0 3: for t = 1,..., T do 4: Predict x t by { x t = arg min x c t x t + L } η Dx, z t 5: Receive cost function c t and incur loss c tx t 6: Update z t by solving 7: end for ωz t = ωx t + η L c tx t c tx t. where ω is the Legendre-Fenchel conjugate of ω. For example, when ωx = 1/ x, ω x = 1/ x and the update for the searching point is given by z t = x t + η/l c t x t c t x t ; when ωx = i x i ln x i, ω x = log [ i expx i] and the update for the searching point can be computed by [z t ] i x i exp η/l[ c t x t c t x t ], s.t. [z t ] i = 1. Third, the key inequality in 15 for proving the regret bound still hold for ζ = c t x t + v t, ξ = c t x t + v t by noting the equivalence between the pairs 4, 5 and 13, 14, which is given below: η L c tx t + v t x t x Dx, z t Dx, z t + γ α c tx t c t x t α [ x t z t + x z t ], x P. where v t δ P x t. As a result, we can apply the same analysis as in the proof of Theorem 3 to obtain the same regret bound in Theorem 4 for Algorithm 6. Note that the above inequality remains valid even if we take a projection step after the update for z t due to the generalized pythagorean inequality Dx, z t Dx, z t, x P [6]. i 3.1. A Gradual Variation Bound for Online Non-Smooth Optimization In spirit of Algorithm 6, we present an algorithm for online non-smooth optimization of functions c t x = ĉ t x + gx with a regret bound by gradual

24 4 Tianbao Yang et al. Algorithm 7 A General Prox Method for Online Non-Smooth Optimization with a Fixed Non-Smooth Component 1: Input: η > 0, ωz : Initialization: z 0 = x 0 = min z P ωz and c 0 x = 0 3: for t = 1,..., T do 4: Predict x t by { x t = arg min x c t x t + L } η Dx, z t + gx 5: Receive cost function c t and incur loss c tx t 6: Update z t by z t = ω ωx t + η L c tx t c tx t and z t = min Dx, z t 7: end for variation EGV T = T c t+1x t c t x t. The trick is to solve the composite gradient mapping: and update z t by x t = arg min x c t x t + L η Dx, z t + gx ωz t = ωx t + η L c tx t c t x t. Algorithm 7 shows the detailed steps and Corollay 4 states the regret bound, which can be proved similarly. Corollary 4 Let c t = ĉ t + g, t = 1,..., T be a sequence of convex functions where ĉ t are L-smooth continuous w.r.t and g is a nonsmooth function, ωz be a α-strongly convex function w.r.t, and EGV T be defined in 19. By setting η = 1/ min { α/, LR/ EGV T }, we have the following regret bound c t x t min c t x R max LR/ α, EGVT. 3. Online Non-Smooth Optimization with an Explicit Max Structure In previous subsection, we assume the composite gradient mapping with the non-smooth component can be efficiently solved. Here, we replace this assumption with an explicit max structure of the non-smooth component. In what follows, we present a primal-dual prox method for such non-smooth cost functions and prove its regret bound. We consider a general setting, where

25 Regret Bounded by Gradual Variation 5 the non-smooth functions c t x has the following structure: c t x = ĉ t x + max u Q A tx, u φ t u, 7 where ĉ t x and φ t u are L 1 -smooth and L -smooth functions, respectively, and A t R m d is a matrix used to characterize the non-smooth component of c t x with φ t u by maximization. Similarly, we define a dual cost function φ t u as φ t u = φ t u + min A tx, u + ĉ t x. 8 We refer to x as the primal variable and to u as the dual variable. To motivate the setup, let us consider online classification with hinge loss l t w = max0, 1 y t w x t, where we slightly abuse the notation x t, y t to denote the attribute and label pair received at trial t. It is straightforward to see that l t w is a non-smooth function and can be cast into the form in 7 by l t w = max α [0,1] α1 y tw x t = max α [0,1] αy tx t w + α. To present the algorithm and analyze its regret bound, we introduce some notations. Let F t x, u = ĉ t x + A t x, u φ t u, ω 1 x be a α 1 -strongly convex function defined on the primal variable x w.r.t a norm p and ω u be a α -strongly convex function defined on the dual variable u w.r.t a norm q. Correspondingly, let D 1 x, z and D u, v denote the induced Bregman distance, respectively. We assume the domains P, Q are bounded and matrices A t have a bounded norm, i.e., max x p R 1, max u Q u q R max ω 1x min ω 1x M 1 max ω u min ω u M u Q u Q A t p,q = max x u A t x σ. p 1, u q 1 9 Let p, and q, denote the dual norms to p and q, respectively. To prove a variational regret bound, we define a gradual variation as follows: EGV T,p,q = ĉ t+1 x t ĉ t x t p, + R1 + R A t A t p,q φ t+1 u t φ t u t q,. 30 T + Given above notations, Algorithm 8 shows the detailed steps and Theorem 5 states a min-max bound.

26 6 Tianbao Yang et al. Algorithm 8 General Prox Method for Online Non-smooth Optimization with an Explicit Max Structure 1: Input: η > 0, ω 1 z, ω v : Initialization: z 0 = x 0 = min z P ω 1 z, v 0 = u 0 = min v Q ω v and ĉ 0 x = φ 0 u = 0 3: for t = 1,..., T do 4: Update u t by { u t = arg max u A t x t φ t u t L } u Q η D u, v t 5: Predict x t by { x t = arg min x ĉ t x t + A t u t + L } 1 η D 1x, z t 6: Receive cost function c t and incur loss c tx t 7: Update v t by { v t = arg max u A tx t φ tu t L } u Q η D u, v t 8: Update z t by 9: end for { z t = arg min x ĉ tx t + A t ut + L } 1 η D 1x, z t Theorem 5 Let c t x = ĉ t x + max u Q A t x, u φ t u, t = 1,..., T be a sequence of non-smooth functions. Assume ĉ t x, φu are L = maxl 1, L - smooth functions and the domain P, Q and A t satisfy the boundness condition as in 9. Let ωx be a α 1 -strongly convex function w.r.t the norm p, ωu be a α -strongly convex function w.r.t. the norm q, and α = minα 1, α. By setting η = min M 1 + M / EGV T,p,q, α/4 σ + L in Algorithm 8, then we have F t x t, u min F t x, u t 4 M1 + M σ M 1 + M max + L, EGV T,p,q. α max u Q To facilitate understanding, we break the proof into several Lemmas. The following Lemma is by analogy with Lemma. x Lemma 3 Let θ = denote a single vector with a norm θ = u and a dual norm θ = x p + u q x p + u q. Let ωθ = ω 1 x+ω u, Dθ, ζ =

27 Regret Bounded by Gradual Variation 7 D 1 x, u + D z, v. Then x F t x t, u t xt x η Dθ, ζ t Dθ, ζ t u F t x t, u t u t u + η x F t x t, u t x F t x t, u t p, + η u F t x t, u t u F t x t, u t q, α xt z t p + u t v t q + x t z t p + u t v t q. Proof The updates of x t, u t in Algorithm 8 can be seen as applying the xt zt updates in Lemma with θ t = in place of x, ζ t = in place of z +, u t v t zt ζ t = in place of z. Note that ωθ is a α = minα 1, α -strongly v t convex function with respect to the norm θ. Then applying the results in Lemma we can complete the proof. Applying the convexity of F t x, u in terms of x and the concavity of F t x, u in terms of u to the result in Lemma 3, we have η F t x t, u F t x, u t D 1 x, z t D 1 x, z t + D u, v t D u, v t + η ĉ t x t ĉ t x t + A t u t A tu t p, + η φ t u t φ t u t + A t x t A t x t q, α xt z t p + x t z t α p ut v t q + u t v t q D 1 x, z t D 1 x, z t + D u, v t D u, v t 31 + η ĉ t x t ĉ t x t p, + η A t u t A tu t p, + η φ t u t φ t u t q, + η A t x t A t x t q, α xt z t p + x t z t α p ut v t q + u t v t q The following lemma provides tools for proceeding the bound. Lemma 4 ĉ t x t ĉ t x t p, ĉ t x t ĉ t x t p, + L x t x t p φ t u t φ t u t q, φ t u t φ t u t q, + L u t u t q A t x t A t x t q, R 1 A t A t p,q + σ x t x t p A t u t A tu t p, R A t A t p,q + σ u t u t q

28 8 Tianbao Yang et al. Proof We prove the first and the third inequalities. Another two inequalities can be proved similarly. ĉ t x t ĉ t x t p, ĉ t x t ĉ t x t p, + ĉ t x t ĉ t x t p, ĉ t x t ĉ t x t p, + L x t x t p where we use the smoothness of ĉ t x. A t x t A t x t q, A t A t x t q, + A t x t x t p R 1 A t A t p,q + σ x t x t p Lemma 5 For any x P and u Q, we have η F t x t, u F t x, u t D 1 x, z t D 1 x, z t + D u, v t D u, v t + 4η ĉ t x t ĉ t x t p, + φ t u t φ t u t q, + R1 + R A t A t p,q + 4η σ x t x t p + 4η L x t x t p α x t z t p + x t z t p + 4η σ u t u t p + 4η L u t u t q α u t v t q + u t v t q. Proof The lemma can be proved by combining the results in Lemma 4 and the inequality in 31. η F t x t, u F t x, u t D 1 x, z t D 1 x, z t + D u, v t D u, v t + η ĉ t x t ĉ t x t p, + η A t u t A tu t p, + η φ t u t φ t u t q, + η A t x t A t x t q, α xt z t p + x t z t α p ut v t q + u t v t q D 1 x, z t D 1 x, z t + D u, v t D u, v t + 4η ĉ t x t ĉ t x t p, + 4η L x t x t p + 4η σ x t x t p + 4η R 1 A t A t p,q + 4η R A t A t p,q + 4η φ t u t φ t u t q, + 4η L u t u t q + 4η σ u t u t q α xt z t p + x t z t α p ut v t q + u t v t q

Adaptive Online Gradient Descent

University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow