Online Learning with Non-Convex Losses and Non-Stationary Regret

Size: px
Start display at page:

Download "Online Learning with Non-Convex Losses and Non-Stationary Regret"

Transcription

1 Online Learning with Non-Convex Losses and Non-Stationary Regret Xiang Gao Xiaobo Li Shuzhong Zhang University of Minnesota University of Minnesota University of Minnesota Abstract In this paper, we consider online learning with non-convex loss functions. Similar to Besbes et al. [5 we apply non-stationary regret as the performance metric. In particular, we study the regret bounds under different assumptions on the information available regarding the loss functions. When the gradient of the loss function at the decision point is available, we propose an online normalized gradient descent algorithm ONGD to solve the online learning problem. In another situation, when only the value of the loss function is available, we propose a bandit online normalized gradient descent algorithm BONGD. Under a condition to be called weak pseudo-convexity WPC, we show that both algorithms achieve a cumulative regret bound of O + V, where V is the total temporal variations of the loss functions, thus establishing a sublinear regret bound for online learning with non-convex loss functions and non-stationary regret measure. Introduction Online convex optimization OCO has been studied extensively in the literature. In OCO, at each period t {,,..., }, an online player chooses a feasible strategy x t from a decision set X R n, and suffers a loss given by f t x t, where f t is a convex loss function. One key feature of the OCO is that the player must make a decision for period t without knowing the loss function f t. he performance of an OCO algorithm is usually measured by the stationary regret, which compares the accumulated loss suffered by the player with the loss suffered by the best fixed strategy. Specifically, the stationary regret is defined Proceedings of the st International Conference on Artificial Intelligence and Statistics AISAS 8, Lanzarote, Spain. PMLR: Volume 84. Copyright 8 by the authors. as Regret S {x t} f t x t f t x, where x is one best fixed decision in hindsight, i.e. x arg min x X f t x. Several sub-linear cumulative regret bounds measured by stationary regret have been established in various papers in the literature. For example, Zinkevich [3 proposed an online gradient descent algorithm which achieves an regret bound of order O for convex loss functions. he order of the regret can be further improved to Olog if the loss functions are strongly convex see Hazan et al. [7. Moreover, the bounds are shown to be tight for the OCO with convex / strongly convex loss functions respectively in Abernethy et al. [9. One important extension of the OCO is the so-called bandit online convex optimization, where the online player is only supposed to know the function value f t x t at x t, instead of the entire function f t. In particular, when the player can only observe the function value at a single point, Flaxman et al. [5 established an O 3/4 regret bound for general convex loss functions by constructing a zerothorder approximation of the gradient. Assuming that the loss functions are smooth, the regret bound can be improved to O /3 by incorporating a self-concordant regularizer see Saha and ewari [. Alternatively, if multiple points can be inquired at the same time, Agarwal et al. [ showed that the regrets can be further improved to O / and Olog for convex / strongly convex loss functions respectively. As suggested by its name, the loss functions in the OCO are assumed to be convex. Only a handful of papers studied online learning with non-convex loss functions. In the existing works, most of the time heuristic algorithms were proposed e.g. Gasso et al. [, Ertekin et al. [ without focussing on establishing sublinear regret bounds. here are a few noticeable exceptions though. Hazan and Kale [ developed an algorithm that achieves O / and O /3 regret bounds for the full information and bandit settings respectively, by assuming the loss functions to be submodular. Zhang et al. [5 showed that an O /3 regret bound still holds if the loss functions are in

2 Online Learning with Non-Convex Losses and Non-Stationary Regret the form of composition between a non-increasing scalar function and a linear function. he stationary regret requires the benchmark strategy to remain unchanged throughout the periods. his assumption may not be relevant in some of the applications. Recently, a new performance metric known as the non-stationary regret was proposed by Besbes et al. [5. he nonstationary regret compares the cumulative losses of the online player with the losses of the best possible responses: Regret NS {x t} f t x t f t x t, where x t arg min x X f t x. Clearly, the non-stationary regret is never less than the stationary regret. Besbes et al. [5 proves that if there is no restriction on the changes of the loss functions, then the non-stationary regret is linear in regardless of the strategies. o obtain meaningful bounds, the authors assumed that the temporal change of the sequence of the function {f t } is bounded. Specifically, the loss functions are assumed to be taken from the set { } V : {f, f,..., f } : f t f t+ V, 3 where f t f t sup x X f t x f t x. For nonzero temporal change V, Besbes et al. [5 then proposes algorithms with sub-linear non-stationary regret bounds: OV /3 /3, OV / /, and OV /3 /3 respectively, for the cases where loss functions are: convex with noisy gradients, strongly convex with noisy gradients, and strongly convex with noisy function values. Note that V > is assumed in the bounds. More recently, Yang et al. [6 also studied the non-stationary regret bounds for OCO. hey proposed an uncertainty set S p of the sequence of functions in which the worst-case variation of the optimal solution x t of f t referred to as the path variation is bounded: S p : {{f, f,..., f } : } x t x t+ V. max x t arg min x X f tx hey then proved some upper and lower bounds for the several different feedback structure and functional classes within the convex function class. In particular, they showed that some existing algorithms with some modification can achieve the V, V and V for true gradient with smooth condition, noisy gradient and twopoint bandit feedback, respectively. Note that V > is assumed in the bounds. In this paper, we consider online non-convex optimization with non-stationary regret as the performance metric. o the best of our knowledge, such a combination had not been studied before. For each period t, even after the decision x t is made, the online player is not assumed to know the function f t ; instead, only some partial information regarding the loss at x t is revealed. Specifically, only fx t in the first-order setting or fx t the zeroth-order setting is available to the player. Similar to Yang et al. [6, we define the uncertainty set S of the sequence of functions as follows: S : {{f, f,..., f } : x t arg min x X } f t x, t,...,, s.t. x t x t+ V. Note that S p S for the same V. In particular, consider a static sequence f t f, t,.., where f has multiple optimal solutions. his sequence would clearly belong to S even when V. However, V would have to be linear in in order for this sequence to be in S p. We propose the Online Normalized Gradient Descent ONGD and the novel Bandit Online Normalized Gradient Descent BONGD algorithms for the first-order setting and the zeroth-order setting respectively. For the loss functions satisfying 4 and a condition to be introduced later, we show that these two algorithms both achieve O + V regret bound. Compared to the regret bounds in Yang et al. [6, our regret bound for the first-order setting is worse but is the same for the zeroth-order setting. Note however, that our loss functions are non-convex and we use a weaker version of variational constraints. Regarding non-convex objective function, a related work in the literature is Hazan et al. [5, where the authors propose a Normalized Gradient Descent NGD method for solving an optimization model with the so-called strictly locally quasi-convex SLQC function as the objective. hey further show that the NGD converges to an ɛ-optimal minimum within O/ɛ iterations. his paper generalizes the results in Hazan et al. [5 in the following aspects: Hazan et al. [5 considers an optimization model, while this paper considers an online learning model. Hazan et al. [5 assumes the objective function to be strictly locally quasi-convex SLQC. In this paper we introduce the notion of weak pseudo-convexity WPC, which will be shown to be a weaker condition than the SLQC. We show that the regret bounds hold if the objective function is weak pseudo-convex. Hazan et al. [5 considers only the first-order setting, while our proposed BONGD algorithm works for the bandit zeroth-order setting as well. he rest of the paper is organized as follows. Section presents some preparations including the assumptions and notations. In Section 3, we present the ONGD algorithm and prove its regret bound. In Section 4, we present the

3 Xiang Gao, Xiaobo Li, Shuzhong Zhang BONGD algorithm for the zeroth-order setting and show its regret bound under some assumptions. Finally, we conclude the paper in Section 5. Problem Setup In this section, we present the assumptions underlying our online learning model and introduce some notations that will be used in the paper. Let X R n be a convex decision set that is known to the player. For every period t {,,..., }, the loss function is f t. hroughout the paper, we assume that X R n is bounded, i.e., there exists R > such that x R for all x X. We present the following definitions regarding the loss functions. Definition Bounded Gradient A function f is said to have bounded gradient if there exists a finite positive value M such that for all x X, it holds that fx M. Note that if f has bounded gradient, then it is also Lipschitz continuous with Lipschitz constant M on the set X. Definition Weak Pseudo-Convexity A function f is said to be weakly pseudo-convex WPC if there exists K > such that fx fx K fx x x, fx fx holds for all x X, with the convention that fx if fx, where x is one optimal solution, i.e., x arg min x X fx. Here we discuss some implications of the weak pseudoconvexity. If a differentiable function f is Lipschitz continuous and pseudo-convex, then we have see similar derivation in Nesterov [4 fx fy M fx x y, fx for all y, x with fx fy, where M is Lipschitz constant. herefore, we can simply let K M, and the function is also weakly pseudo-convex. Moreover, as another example, the star-convex function proposed by Nesterov and Polyak [6 is weakly pseudo-convexity. Proposition If f is star-convex and smooth with bounded gradient in X, then f is weakly pseudo-convex. he proof of Proposition can be found in the supplementary file. We next introduce a property that is essentially the same as the SLQC property introduced in Hazan et al. [5. Definition 3 Acute Angle Gradient of f is said to satisfy the acute angle condition if there exists a positive value Z such that cos fx, x x fx x x fx x x Z >, fx holds for all x X, with the convention that fx if fx, where x is one optimal solution, i.e., x arg min x X fx. he following proposition shows that the acute angle condition together with the Lipschitz continuity implies the weak pseudo-convexity. Proposition If f has bounded gradient and satisfies the acute angle condition, then f is weakly pseudoconvex. he proof of Proposition can be found in the supplementary file. he class of weakly pseudo-convex functions certainly go beyond the acute angle condition. For example, below is another class of functions satisfying the WPC. Proposition 3 If f has bounded gradient and satisfy the α-homogeneity with respect to its minimum, i.e., there exists α > satisfying ftx x + x fx t α fx fx, for all x X and t where x arg min x X fx, then f is weak pseudo-convex. he proof of Proposition 3 can be found in the supplementary file. Proposition 3 suggests that all non-negative homogeneous polynomial satisfies WPC with respect to. ake fx x + x + x x as an example. It is easy to verify that f satisfies the condition in Proposition 3, and thus is weakly pseudo-convex. In Figure, the curvature of fx and a sub-level set of this function are plotted. he function is not quasi-convex since the sublevel set is non-convex. However, this function satisfies the acute-angle condition in 3. Note that if f i x is α i -homogeneous with respect to the shared minimum x for all i I with α i α >, and the gradient of f i is uniformly bounded over a set X, then I i f ix is WPC. As a result, we can construct functions that are WPC but do not satisfy the acute-angle condition. Consider a two-dimensional function fx x + x 3/, and suppose that X is the unit disc centered at the origin. Clearly, fx is differentiable and Lipchitz continuous in X. Also, it is the sum of a -homogeneous function and a 3/-homogeneous function with a shared minimum,. hus fx is WPC. We compute that cos fx, x x fx x x fx x x x + 3 x 3/ 4x + 94 x x + x.

4 Online Learning with Non-Convex Losses and Non-Stationary Regret x + x + x - x this situation, we introduce the expected non-stationary regret for an algorithm that outputs a random sequence {xt } in the performance metric. 8 6 Definition 5 he expected non-stationary regret for a randomized algorithm A is defined as 4.5 ".5 S ERegretN {xt } E x - - x X # ft xt ft x t, 5 x +x + x -x 4.5 where the expectation is taken over the filtration generated by the random sequence {xt } produced by A. x.5 3 he First-Order Setting -.5 In this section, we assume that for each period, the gradient information at the current point is available to the online player after the decision is made. Specifically, at each period t, the sequence of the events is as follows: x Figure : Plot of a WPC function that is not quasi-convex. he online player chooses a strategy xt ; Consider a parameterized path x, x t/, t/3 with t >. On this path, we have cos f x, x x 3. Regret ft xt ft x t incurs but is not necessarily known to the online player. x + 3 x 3/ q. he online player receives the feedback ft xt ; 4x + 94 x x + x 7t q 9 /3 4t + 4 t t + t4/3 7t/6 q. 4t/ t/3 herefore, along the path, as t approaches to, we have cos f x, x x. his example shows that a WPC function may fail to satisfy the acute angle condition. As we mentioned before, in order to establish some sublinear non-stationary regret bound, we need to confine the loss functions {ft } in a unified manner. herefore, we introduce the uncertainty set of the loss functions S, as a set of admissible loss functions where their total variation of the minimizers are bounded by V. Definition 4 he uncertainty set of functions S is defined as S : {{f, f,..., f } : x t arg minx X ft x, P t,...,, s.t. kxt xt+ k V. 4 where V. In the zeroth-order setting to be discussed in Section 4, only the function value is available. herefore, some randomized approaches are needed in the algorithm. o account for We propose the Online Normalized Gradient Descend algorithm ONGD in this setting. he normalized gradient descent method was first proposed in Nesterov [4 which can be applied to solve the pseudo-convex minimization problem. he ONGD algorithm uses the firstorder information ft xt to compute the normalized vector ft xt /k ft xt k as the search direction. Similar to the standard gradient method, it moves along that search direction with a specific stepsize η > and then projects the point back to the decision set X ; seeqalgorithm for the details. Note that in Algorithm, X y : Algorithm Online Normalized Gradient Descent Input: feasible set X, # time period Initialization: x X for t to do chooses xt and receives the feedback gt ft xt if kgt k > then Q xt+ X xt η kggtt k else xt+ xt end if end for arg minx X ky xk is the projection operator. he main result is shown in the following theorem which claims an O + V non-stationary regret bound for ONGD.

5 Xiang Gao, Xiaobo Li, Shuzhong Zhang heorem Let V be as defined in 4. For any sequence of loss functions {f t } S where f t is weakly pseudo-convex with common constant K, let the stepsize 4R η +6RV. hen, the following regret bound holds for ONGD: Regret NS {x t} K R +.5RV O + V. Proof: Let x t, t,.., be the sequence of optimal solutions satisfying the condition in Definition 4, and z t : x t x t. hen we have: z t+ x t+ x t+ x t+ x t + x t x t+ +x t+ x t x t x t+ x t η g t x t + 6R x t x g t t+ X x t η g t g t x t + 6R x t x t+ z t + η η g t x t x t g t + 6R x t x t+. By rearranging terms and multiplying K on both sides we have K g t x t x t g t K η z t z t+ + η + 6R x t x t+. By Definition, noting that g t f t x t, we have f t x t f t x t K f tx t x t x t f t x t K z η t zt+ + η + 6R x t x t+. Summing these inequalities from t,...,, we have Regret NS {x t} K z z + + η + 6R x t x η t+ 6 4 he Zeroth-Order Setting In the previous section, it is assumed that the gradient information is available, which may not be the case in some applications. Such exceptions include the multi-armed bandit problem, dynamic pricing and Bayesian optimization. herefore, in this section, we consider the setting where the online player only receives the function value f t x t, instead of the gradient f t x t, as the feedback. As mentioned above, the zeroth-order or bandit setting has been studied in the OCO literature. he main technique in the OCO literature see Flaxman et al. [5 for example is to construct a zeroth-order approximation of the gradient of a smoothed function. hat smoothed function is often created by integrating the original loss function with a chosen probability distribution. By querying some random samples of the function value according to a probability distribution, the player is able to create an unbiased zeroth-order approximation of the gradient of the smoothed function. his is, however, not applicable in our online normalized gradient descent algorithm since what we need is the direction of the gradient. herefore, we shall first develop a new type of zeroth-order oracle that can approximate the gradient direction without averaging multiple samples of gradients when the norm of the gradient is not too small. o proceed, we require some additional conditions on the loss function. Definition 6 Error Bound here exists D > and γ > such that x x t D f t x γ, for all x X, t, where x t is the optimal solution to f t, i.e., x t arg min x X f t x. Since X is a compact set, the error bound condition is essentially the requirement for a unique optimal solution and no local minimum. Definition 7 Lipschitz Gradient here exists a positive number L, such that f t x f t y L x y, K 4R + η + 6RV. η for all x, y X, where t. As a result, by noting η 4R +6RV, we have Regret NS {x t} K R +.5RV O + V. Note that... We introduce some notations that will be used in subsequent analysis. Sn: the unit sphere in R n ; ma: the measure of set A R n ; : the area of the unit sphere Sn; ds n : the differential unit on the unit sphere Sn; A x: the indicator function of set A; sign : the sign function.

6 Online Learning with Non-Convex Losses and Non-Stationary Regret Before we present the main results, several lemmas are in order. he first lemma considers some geometric properties of the unit sphere. Lemma For any non-zero vector d R n and δ <, let Sδ x be defined as S x δ : { v Sn s.t. d v < δ }. If d δ, then there exists a constant C n >, such that Proof: We have ms x δ ms x δ < C n δ. v Sn S x δ ds n. By the symmetry of Sn, we may assume w.l.o.g. that d,...,, d. Let a δ d. Since a <, we have msδ x { } δ vn δ vds n d d v Sn Using the previous lemmas, we have the following result which constructs a zeroth-order estimator for the normalized gradient. v dv dv n v n a v + +v n a r a r a r r n dr dsn r r n r dr r dr π arcsin a arcsin a < π a πβn δ d πβn δ. By setting C n π, the desired result follows. Notice that if v Sn, then u v, v,..., v n, v n is also in Sn. As a result, the above integral will be on the direction of d d,,...,,, and its length is given by v Sn vn vv nds n v + +v n v + +v n r βn n r n drds n : Pn. v v n dsn v v n dv... dv n v v n heorem Suppose fx has Lipschitz gradient and fx δ at x. Let ɛ δ L. hen we have E fx Sn [signfx + ɛv fxv fx D nδ where v is a random vector uniformly distributed over Sn, and Pn and D n Cn. Proof: By Definition 7, we have fx + ɛv fx ɛ fx v ɛl v fx v ɛ fx + ɛv fx L fx v + ɛ ɛ L. he next lemma leads to an unbiased first-order estimator of the direction of a vector. Lemma Suppose d R n, and d. hen, signd d vvds n P n d, v Sn where P n is a constant. Proof: By the symmetry of Sn, again we may assume d,...,, d, and signd vvds n vn vvds n. v Sn v Sn Since fx v δ for v Sn\S x δ, if we let ɛ δ L, we have fx v δ hus, fx + ɛv fx ɛ fx v + δ. sign fx v sign fx v δ fx + ɛv fx sign sign fx v + δ ɛ sign fx v, implying sign fx v sign fx+ɛv fx ɛ. here-

7 Xiang Gao, Xiaobo Li, Shuzhong Zhang fore, E Sn [signfx + ɛv fxv [signfx + ɛv fxv dsn v Sn\S x δ + v Sn\S x δ + v Sn + P n fx fx + [signfx + ɛv fxv dsn [ sign fx vv dsn [signfx + ɛv fxv dsn [ sign fx vv dsn [ sign fx vv dsn [signfx + ɛv fxv dsn [ sign fx vv dsn [signfx + ɛv fxv dsn, Algorithm Bandit Online Normalized Gradient Descent Input: feasible set X, # time period, δ Initialization: x X, ɛ δ /L for t to do Sample v t uniformly over Sn R n ; play x t and x t + ɛv t ; receive feedbacks f t x t and f t x t + ɛv t ; set G t x t, v t signftxt+ɛvt ftxt v t ; update x t+ X x t ηg t x t, v t. end for Note that Algorithm actually outputs a random sequence of vectors {x t } ; hence the notion of expected non-stationary regret is applicable here. Let us denote {F t } be the filtration generated by {x t }. hen v t is independent of F t. Note that in Algorithm, at each step, it queries the function at another point x t + ɛv t. herefore, besides its output sequence {x t }, we need to include {x t + ɛv t } in our regret. We thus define ERegret NS {x t}, {x t + ɛv t } E[ f t x t f t x t + E[ f t x t + ɛv t f t x t. he following theorem shows that by choosing η and δ appropriately, we can still achieve an O + V expected non-stationary regret bound. where the last equality is due to Lemma. Putting the estimations together, we have E Sn [signfx + ɛv fxv Pn fx fx sign fx vv dsn v S δ x + v S δ x msx δ signfx + ɛv fxv dsn Cnδ. Note that Pn and D n Cn, the theorem is proved. Based on heorem, for a given δ > we have a zerothorder estimator for the normalized gradient given as: G t x t, v t signf tx t + ɛv t f t x t v t, 7 where ɛ δ /L and v t is an uniformly distributed random vector over Sn. heorem implies that the distance between the estimator and the normalized gradient can be controlled up to a factor of δ. Essentially, the Bandit Online Normalized Gradient Descent BONGD algorithm replaces the normalized gradient by G t x t, v t in the ONGD algorithm. heorem 3 Let V be defined in 4. Assume that the loss functions have Lipschitz gradients Definition 7, satisfying the error bound condition Definition 6 and are weakly pseudo-convex with bounded gradient. For any sequence of loss functions {f t } S, applying BONGD with η 4R Q +6RV n and δ min{ γ, 4 } where and P n is a constant, the following regret bound holds P n ERegret NS {x t}, {x t + ɛv t } O + V. Proof: Let z t : x t x t. hen, zt+ x t+ x t+ x t+ x t + x t x t+ +x t+ x t x t x t+ x t+ x t + R x t x t+ + 4R x t x t+ X x t ηg t x t, v t x t + 6R x t x t+ x t ηg t x t, v t x t + 6R x t x t+ zt + η G t x t, v t ηg t x t, v t x t x t +6R x t x t+ zt + η Q ηg t x t, v t x t x t + 6R x t x t+. n By rearranging the terms, we have: KG tx t, v t x t x t 8 K z t z t+ + η + 6R x t x t+. η Q n

8 Online Learning with Non-Convex Losses and Non-Stationary Regret Now based on f t x t, we have two different cases: f t x t δ. In this case, by heorem, we have herefore, E[G t x t, v t x t f tx t f t x t D n δ. f tx t f tx t K ftxt x t x t f tx t KE[G tx t, v t x t x t x t + DnK δ x t x t K E[z t x t E[z t+ x t + η + 6R x η Q t x t+ n + DnK δ x t x t K E[z t x t E[z t+ x t + η + 6R x η Q t x t+ n + 4DnK Rδ. 9 f t x t < δ. In this case, by the error bound property Definition 6 we have x t x t D f t x t γ < Dδ γ. herefore, due to the boundedness of gradient and K η f t x t f t x t M x t x t MDδ γ, E[z t x t E[z t+ x t + η Q n Q n + 6R x t x t+ KE[G tx t, v t x t x t x t K E[z t x t E[z t+ x t + η + 6R x t x t+ η +K βn Dδ γ. Adding with, it follows that f tx t f tx t K E[z t x t E[z t+ x t + η + 6R x η Q t x t+ n + K βn D + MD δ γ. In { view of 9 and, } if we let U 4C max nk P n R, K βn D + MD, then in either case the following inequality holds: Summing these inequalities over t,...,, we have ERegret NS {xt}, {x t + ɛv t} [ E f tx t + f tx t + ɛv t f tx t [ E f tx t f tx t + Mɛ v t K η E[z E[z + + η Q n + Uδ γ + M ɛ K 4R + η η Q n + 6R x t x t+ + 6RV + Uδ γ + M δ L. 4R By choosing η Q +6RV n, and δ min{ γ, 4 }, we have ERegret NS {xt}, {x t + ɛv t} K 4R + 6RV + U + M L O + V. herefore, under the additional error bound condition Definition 6 and the Lipschitz continuity of the gradient Definition 7 on the loss functions e.g. the function depicted in Figure satisfies all the conditions of heorem 3, the expected regret of BONGD remains O + V, which matches both the upper and lower bound in Yang et al. [6 for the general Lipschitz continuous convex cost functions and two point bandit feedback. Moreover, the zeroth-order estimator for the normalized gradient could be of interest on its own. 5 Concluding Remarks In this paper, we considered online learning with nonconvex loss functions and non-stationary regret measure, and established O + V regret bounds, where V is the total variation of the loss functions, for a gradient-type algorithm and a bandit-type algorithm under some conditions on the non-convex loss function. As a direction for future research, it will be interesting to find out if the same regret bound can still be established without knowing V in advance. Moreover, it remains open to extend the results to the setting where the loss functions may be noisy and non-smooth. f tx t f tx t K E[z t x t E[z t+ x t + η + 6R x t x t+ η +Uδ γ. Q n

9 Xiang Gao, Xiaobo Li, Shuzhong Zhang References Jacob Abernethy, Alekh Agarwal, Peter L Bartlett, and Alexander Rakhlin. A stochastic view of optimal regret through minimax duality. arxiv preprint arxiv:93.538, 9. Alekh Agarwal, Ofer Dekel, and Lin Xiao. Optimal algorithms for online convex optimization with multi-point bandit feedback. In COL, pages 8 4. Citeseer,. Omar Besbes, Yonatan Gur, and Assaf Zeevi. Nonstationary stochastic optimization. Operations Research, 635:7 44, 5. S. Ertekin, L. Bottou, and C. L. Giles. Nonconvex online support vector machines. IEEE ransactions on Pattern Analysis and Machine Intelligence, 33:368 38, Feb. ISSN doi:.9/pami..9. Abraham D Flaxman, Adam auman Kalai, and H Brendan McMahan. Online convex optimization in the bandit setting: gradient descent without a gradient. In Proceedings of the sixteenth annual ACM-SIAM symposium on Discrete algorithms, pages Society for Industrial and Applied Mathematics, 5. Gilles Gasso, Aristidis Pappaioannou, Marina Spivak, and Léon Bottou. Batch and online learning algorithms for nonconvex neyman-pearson classification. ACM ransactions on Intelligent Systems and echnology IS, 3:8,. Elad Hazan and Satyen Kale. Online submodular minimization. he Journal of Machine Learning Research, 3:93 9,. Elad Hazan, Amit Agarwal, and Satyen Kale. Logarithmic regret algorithms for online convex optimization. Machine Learning, 69-3:69 9, 7. Elad Hazan, Kfir Levy, and Shai Shalev-Shwartz. Beyond convexity: Stochastic quasi-convex optimization. In Advances in Neural Information Processing Systems, pages 594 6, 5. Yurii Nesterov. Introductory lectures on convex optimization, volume 87. Springer Science & Business Media, 4. Yurii Nesterov and Boris Polyak. Cubic regularization of newton method and its global performance. Mathematical Programming, 8:77 5, 6. Ankan Saha and Ambuj ewari. Improved regret guarantees for online smooth convex optimization with bandit feedback. In International Conference on Artificial Intelligence and Statistics, pages ,. ianbao Yang, Lijun Zhang, Rong Jin, and Jinfeng Yi. racking slowly moving clairvoyant: Optimal dynamic regret of online learning with true and noisy gradient. In International Conference on Machine Learning, pages , 6. Lijun Zhang, ianbao Yang, Rong Jin, and Zhi-Hua Zhou. Online bandit learning for a special class of non-convex losses. In AAAI, pages , 5. Martin Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. 3.

Adaptive Online Learning in Dynamic Environments

Adaptive Online Learning in Dynamic Environments Adaptive Online Learning in Dynamic Environments Lijun Zhang, Shiyin Lu, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China {zhanglj, lusy, zhouzh}@lamda.nju.edu.cn

More information

Dynamic Regret of Strongly Adaptive Methods

Dynamic Regret of Strongly Adaptive Methods Lijun Zhang 1 ianbao Yang 2 Rong Jin 3 Zhi-Hua Zhou 1 Abstract o cope with changing environments, recent developments in online learning have introduced the concepts of adaptive regret and dynamic regret

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization JMLR: Workshop and Conference Proceedings vol (2010) 1 16 24th Annual Conference on Learning heory Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

More information

Supplementary Material of Dynamic Regret of Strongly Adaptive Methods

Supplementary Material of Dynamic Regret of Strongly Adaptive Methods Supplementary Material of A Proof of Lemma 1 Lijun Zhang 1 ianbao Yang Rong Jin 3 Zhi-Hua Zhou 1 We first prove the first part of Lemma 1 Let k = log K t hen, integer t can be represented in the base-k

More information

Online Optimization : Competing with Dynamic Comparators

Online Optimization : Competing with Dynamic Comparators Ali Jadbabaie Alexander Rakhlin Shahin Shahrampour Karthik Sridharan University of Pennsylvania University of Pennsylvania University of Pennsylvania Cornell University Abstract Recent literature on online

More information

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 Online Convex Optimization Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 The General Setting The General Setting (Cover) Given only the above, learning isn't always possible Some Natural

More information

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Stochastic and Adversarial Online Learning without Hyperparameters

Stochastic and Adversarial Online Learning without Hyperparameters Stochastic and Adversarial Online Learning without Hyperparameters Ashok Cutkosky Department of Computer Science Stanford University ashokc@cs.stanford.edu Kwabena Boahen Department of Bioengineering Stanford

More information

Online Optimization in Dynamic Environments: Improved Regret Rates for Strongly Convex Problems

Online Optimization in Dynamic Environments: Improved Regret Rates for Strongly Convex Problems 216 IEEE 55th Conference on Decision and Control (CDC) ARIA Resort & Casino December 12-14, 216, Las Vegas, USA Online Optimization in Dynamic Environments: Improved Regret Rates for Strongly Convex Problems

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Bandit Online Convex Optimization

Bandit Online Convex Optimization March 31, 2015 Outline 1 OCO vs Bandit OCO 2 Gradient Estimates 3 Oblivious Adversary 4 Reshaping for Improved Rates 5 Adaptive Adversary 6 Concluding Remarks Review of (Online) Convex Optimization Set-up

More information

A survey: The convex optimization approach to regret minimization

A survey: The convex optimization approach to regret minimization A survey: The convex optimization approach to regret minimization Elad Hazan September 10, 2009 WORKING DRAFT Abstract A well studied and general setting for prediction and decision making is regret minimization

More information

No-Regret Algorithms for Unconstrained Online Convex Optimization

No-Regret Algorithms for Unconstrained Online Convex Optimization No-Regret Algorithms for Unconstrained Online Convex Optimization Matthew Streeter Duolingo, Inc. Pittsburgh, PA 153 matt@duolingo.com H. Brendan McMahan Google, Inc. Seattle, WA 98103 mcmahan@google.com

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Online Convex Optimization MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Online projected sub-gradient descent. Exponentiated Gradient (EG). Mirror descent.

More information

Bandit Convex Optimization

Bandit Convex Optimization March 7, 2017 Table of Contents 1 (BCO) 2 Projection Methods 3 Barrier Methods 4 Variance reduction 5 Other methods 6 Conclusion Learning scenario Compact convex action set K R d. For t = 1 to T : Predict

More information

arxiv: v2 [math.pr] 22 Dec 2014

arxiv: v2 [math.pr] 22 Dec 2014 Non-stationary Stochastic Optimization Omar Besbes Columbia University Yonatan Gur Stanford University Assaf Zeevi Columbia University first version: July 013, current version: December 4, 014 arxiv:1307.5449v

More information

Stochastic Gradient Descent with Only One Projection

Stochastic Gradient Descent with Only One Projection Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine

More information

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning. Tutorial: PART 1 Online Convex Optimization, A Game- Theoretic Approach to Learning http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm Elad Hazan Princeton University Satyen Kale Yahoo Research

More information

Lecture: Adaptive Filtering

Lecture: Adaptive Filtering ECE 830 Spring 2013 Statistical Signal Processing instructors: K. Jamieson and R. Nowak Lecture: Adaptive Filtering Adaptive filters are commonly used for online filtering of signals. The goal is to estimate

More information

Online Submodular Minimization

Online Submodular Minimization Online Submodular Minimization Elad Hazan IBM Almaden Research Center 650 Harry Rd, San Jose, CA 95120 hazan@us.ibm.com Satyen Kale Yahoo! Research 4301 Great America Parkway, Santa Clara, CA 95054 skale@yahoo-inc.com

More information

Optimistic Bandit Convex Optimization

Optimistic Bandit Convex Optimization Optimistic Bandit Convex Optimization Mehryar Mohri Courant Institute and Google 25 Mercer Street New York, NY 002 mohri@cims.nyu.edu Scott Yang Courant Institute 25 Mercer Street New York, NY 002 yangs@cims.nyu.edu

More information

Bandit Convex Optimization: T Regret in One Dimension

Bandit Convex Optimization: T Regret in One Dimension Bandit Convex Optimization: T Regret in One Dimension arxiv:1502.06398v1 [cs.lg 23 Feb 2015 Sébastien Bubeck Microsoft Research sebubeck@microsoft.com Tomer Koren Technion tomerk@technion.ac.il February

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization racle Complexity of Second-rder Methods for Smooth Convex ptimization Yossi Arjevani had Shamir Ron Shiff Weizmann Institute of Science Rehovot 7610001 Israel Abstract yossi.arjevani@weizmann.ac.il ohad.shamir@weizmann.ac.il

More information

Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms

Lecture 23: Online convex optimization Online convex optimization: generalization of several algorithms EECS 598-005: heoretical Foundations of Machine Learning Fall 2015 Lecture 23: Online convex optimization Lecturer: Jacob Abernethy Scribes: Vikas Dhiman Disclaimer: hese notes have not been subjected

More information

An Online Convex Optimization Approach to Blackwell s Approachability

An Online Convex Optimization Approach to Blackwell s Approachability Journal of Machine Learning Research 17 (2016) 1-23 Submitted 7/15; Revised 6/16; Published 8/16 An Online Convex Optimization Approach to Blackwell s Approachability Nahum Shimkin Faculty of Electrical

More information

arxiv: v2 [stat.ml] 7 Sep 2018

arxiv: v2 [stat.ml] 7 Sep 2018 Projection-Free Bandit Convex Optimization Lin Chen 1, Mingrui Zhang 3 Amin Karbasi 1, 1 Yale Institute for Network Science, Department of Electrical Engineering, 3 Department of Statistics and Data Science,

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

The convex optimization approach to regret minimization

The convex optimization approach to regret minimization The convex optimization approach to regret minimization Elad Hazan Technion - Israel Institute of Technology ehazan@ie.technion.ac.il Abstract A well studied and general setting for prediction and decision

More information

Exponential Weights on the Hypercube in Polynomial Time

Exponential Weights on the Hypercube in Polynomial Time European Workshop on Reinforcement Learning 14 (2018) October 2018, Lille, France. Exponential Weights on the Hypercube in Polynomial Time College of Information and Computer Sciences University of Massachusetts

More information

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

Online Optimization with Gradual Variations

Online Optimization with Gradual Variations JMLR: Workshop and Conference Proceedings vol (0) 0 Online Optimization with Gradual Variations Chao-Kai Chiang, Tianbao Yang 3 Chia-Jung Lee Mehrdad Mahdavi 3 Chi-Jen Lu Rong Jin 3 Shenghuo Zhu 4 Institute

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

Bandits for Online Optimization

Bandits for Online Optimization Bandits for Online Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Bandits for Online Optimization 1 / 16 The multiarmed bandit problem... K slot machines Each

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

The Interplay Between Stability and Regret in Online Learning

The Interplay Between Stability and Regret in Online Learning The Interplay Between Stability and Regret in Online Learning Ankan Saha Department of Computer Science University of Chicago ankans@cs.uchicago.edu Prateek Jain Microsoft Research India prajain@microsoft.com

More information

Minimizing Adaptive Regret with One Gradient per Iteration

Minimizing Adaptive Regret with One Gradient per Iteration Minimizing Adaptive Regret with One Gradient per Iteration Guanghui Wang, Dakuan Zhao, Lijun Zhang National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing 210023, China wanggh@lamda.nju.edu.cn,

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

The Online Approach to Machine Learning

The Online Approach to Machine Learning The Online Approach to Machine Learning Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Approach to ML 1 / 53 Summary 1 My beautiful regret 2 A supposedly fun game I

More information

arxiv: v1 [cs.lg] 8 Nov 2010

arxiv: v1 [cs.lg] 8 Nov 2010 Blackwell Approachability and Low-Regret Learning are Equivalent arxiv:0.936v [cs.lg] 8 Nov 200 Jacob Abernethy Computer Science Division University of California, Berkeley jake@cs.berkeley.edu Peter L.

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning Editors: Suvrit Sra suvrit@gmail.com Max Planck Insitute for Biological Cybernetics 72076 Tübingen, Germany Sebastian Nowozin Microsoft Research Cambridge, CB3 0FB, United

More information

Logarithmic Regret Algorithms for Online Convex Optimization

Logarithmic Regret Algorithms for Online Convex Optimization Logarithmic Regret Algorithms for Online Convex Optimization Elad Hazan 1, Adam Kalai 2, Satyen Kale 1, and Amit Agarwal 1 1 Princeton University {ehazan,satyen,aagarwal}@princeton.edu 2 TTI-Chicago kalai@tti-c.org

More information

Distributed online optimization over jointly connected digraphs

Distributed online optimization over jointly connected digraphs Distributed online optimization over jointly connected digraphs David Mateos-Núñez Jorge Cortés University of California, San Diego {dmateosn,cortes}@ucsd.edu Southern California Optimization Day UC San

More information

Online Learning: Random Averages, Combinatorial Parameters, and Learnability

Online Learning: Random Averages, Combinatorial Parameters, and Learnability Online Learning: Random Averages, Combinatorial Parameters, and Learnability Alexander Rakhlin Department of Statistics University of Pennsylvania Karthik Sridharan Toyota Technological Institute at Chicago

More information

Online Learning and Online Convex Optimization

Online Learning and Online Convex Optimization Online Learning and Online Convex Optimization Nicolò Cesa-Bianchi Università degli Studi di Milano N. Cesa-Bianchi (UNIMI) Online Learning 1 / 49 Summary 1 My beautiful regret 2 A supposedly fun game

More information

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback

An Optimal Algorithm for Bandit and Zero-Order Convex Optimization with Two-Point Feedback Journal of Machine Learning Research 8 07) - Submitte /6; Publishe 5/7 An Optimal Algorithm for Banit an Zero-Orer Convex Optimization with wo-point Feeback Oha Shamir Department of Computer Science an

More information

The Algorithmic Foundations of Adaptive Data Analysis November, Lecture The Multiplicative Weights Algorithm

The Algorithmic Foundations of Adaptive Data Analysis November, Lecture The Multiplicative Weights Algorithm he Algorithmic Foundations of Adaptive Data Analysis November, 207 Lecture 5-6 Lecturer: Aaron Roth Scribe: Aaron Roth he Multiplicative Weights Algorithm In this lecture, we define and analyze a classic,

More information

Regret bounded by gradual variation for online convex optimization

Regret bounded by gradual variation for online convex optimization Noname manuscript No. will be inserted by the editor Regret bounded by gradual variation for online convex optimization Tianbao Yang Mehrdad Mahdavi Rong Jin Shenghuo Zhu Received: date / Accepted: date

More information

Distributed online optimization over jointly connected digraphs

Distributed online optimization over jointly connected digraphs Distributed online optimization over jointly connected digraphs David Mateos-Núñez Jorge Cortés University of California, San Diego {dmateosn,cortes}@ucsd.edu Mathematical Theory of Networks and Systems

More information

Exponentiated Gradient Descent

Exponentiated Gradient Descent CSE599s, Spring 01, Online Learning Lecture 10-04/6/01 Lecturer: Ofer Dekel Exponentiated Gradient Descent Scribe: Albert Yu 1 Introduction In this lecture we review norms, dual norms, strong convexity,

More information

Convergence rate of SGD

Convergence rate of SGD Convergence rate of SGD heorem: (see Nemirovski et al 09 from readings) Let f be a strongly convex stochastic function Assume gradient of f is Lipschitz continuous and bounded hen, for step sizes: he expected

More information

Multi-armed Bandits: Competing with Optimal Sequences

Multi-armed Bandits: Competing with Optimal Sequences Multi-armed Bandits: Competing with Optimal Sequences Oren Anava The Voleon Group Berkeley, CA oren@voleon.com Zohar Karnin Yahoo! Research New York, NY zkarnin@yahoo-inc.com Abstract We consider sequential

More information

Online Submodular Minimization

Online Submodular Minimization Journal of Machine Learning Research 13 (2012) 2903-2922 Submitted 12/11; Revised 7/12; Published 10/12 Online Submodular Minimization Elad Hazan echnion - Israel Inst. of ech. echnion City Haifa, 32000,

More information

Optimization, Learning, and Games with Predictable Sequences

Optimization, Learning, and Games with Predictable Sequences Optimization, Learning, and Games with Predictable Sequences Alexander Rakhlin University of Pennsylvania Karthik Sridharan University of Pennsylvania Abstract We provide several applications of Optimistic

More information

Bregman Divergence and Mirror Descent

Bregman Divergence and Mirror Descent Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,

More information

A Low Complexity Algorithm with O( T ) Regret and Finite Constraint Violations for Online Convex Optimization with Long Term Constraints

A Low Complexity Algorithm with O( T ) Regret and Finite Constraint Violations for Online Convex Optimization with Long Term Constraints A Low Complexity Algorithm with O( T ) Regret and Finite Constraint Violations for Online Convex Optimization with Long Term Constraints Hao Yu and Michael J. Neely Department of Electrical Engineering

More information

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017

Alireza Shafaei. Machine Learning Reading Group The University of British Columbia Summer 2017 s s Machine Learning Reading Group The University of British Columbia Summer 2017 (OCO) Convex 1/29 Outline (OCO) Convex Stochastic Bernoulli s (OCO) Convex 2/29 At each iteration t, the player chooses

More information

A Drifting-Games Analysis for Online Learning and Applications to Boosting

A Drifting-Games Analysis for Online Learning and Applications to Boosting A Drifting-Games Analysis for Online Learning and Applications to Boosting Haipeng Luo Department of Computer Science Princeton University Princeton, NJ 08540 haipengl@cs.princeton.edu Robert E. Schapire

More information

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania

More information

Constrained Stochastic Gradient Descent for Large-scale Least Squares Problem

Constrained Stochastic Gradient Descent for Large-scale Least Squares Problem Constrained Stochastic Gradient Descent for Large-scale Least Squares Problem ABSRAC Yang Mu University of Massachusetts Boston 100 Morrissey Boulevard Boston, MA, US 015 yangmu@cs.umb.edu ianyi Zhou University

More information

Lecture 25: Subgradient Method and Bundle Methods April 24

Lecture 25: Subgradient Method and Bundle Methods April 24 IE 51: Convex Optimization Spring 017, UIUC Lecture 5: Subgradient Method and Bundle Methods April 4 Instructor: Niao He Scribe: Shuanglong Wang Courtesy warning: hese notes do not necessarily cover everything

More information

Lower and Upper Bounds on the Generalization of Stochastic Exponentially Concave Optimization

Lower and Upper Bounds on the Generalization of Stochastic Exponentially Concave Optimization JMLR: Workshop and Conference Proceedings vol 40:1 16, 2015 28th Annual Conference on Learning Theory Lower and Upper Bounds on the Generalization of Stochastic Exponentially Concave Optimization Mehrdad

More information

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016

1 Overview. 2 Learning from Experts. 2.1 Defining a meaningful benchmark. AM 221: Advanced Optimization Spring 2016 AM 1: Advanced Optimization Spring 016 Prof. Yaron Singer Lecture 11 March 3rd 1 Overview In this lecture we will introduce the notion of online convex optimization. This is an extremely useful framework

More information

Online Learning of Optimal Strategies in Unknown Environments

Online Learning of Optimal Strategies in Unknown Environments 215 IEEE 54th Annual Conference on Decision and Control (CDC) December 15-18, 215. Osaka, Japan Online Learning of Optimal Strategies in Unknown Environments Santiago Paternain and Alejandro Ribeiro Abstract

More information

Online Learning with Predictable Sequences

Online Learning with Predictable Sequences JMLR: Workshop and Conference Proceedings vol (2013) 1 27 Online Learning with Predictable Sequences Alexander Rakhlin Karthik Sridharan rakhlin@wharton.upenn.edu skarthik@wharton.upenn.edu Abstract We

More information

A Modular Analysis of Adaptive (Non-)Convex Optimization: Optimism, Composite Objectives, and Variational Bounds

A Modular Analysis of Adaptive (Non-)Convex Optimization: Optimism, Composite Objectives, and Variational Bounds Proceedings of Machine Learning Research 76: 40, 207 Algorithmic Learning Theory 207 A Modular Analysis of Adaptive (Non-)Convex Optimization: Optimism, Composite Objectives, and Variational Bounds Pooria

More information

Online Sparse Linear Regression

Online Sparse Linear Regression JMLR: Workshop and Conference Proceedings vol 49:1 11, 2016 Online Sparse Linear Regression Dean Foster Amazon DEAN@FOSTER.NET Satyen Kale Yahoo Research SATYEN@YAHOO-INC.COM Howard Karloff HOWARD@CC.GATECH.EDU

More information

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Frank E. Curtis, Lehigh University Beyond Convexity Workshop, Oaxaca, Mexico 26 October 2017 Worst-Case Complexity Guarantees and Nonconvex

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Subgradient Methods for Maximum Margin Structured Learning

Subgradient Methods for Maximum Margin Structured Learning Nathan D. Ratliff ndr@ri.cmu.edu J. Andrew Bagnell dbagnell@ri.cmu.edu Robotics Institute, Carnegie Mellon University, Pittsburgh, PA. 15213 USA Martin A. Zinkevich maz@cs.ualberta.ca Department of Computing

More information

Regret Bounds for Online Portfolio Selection with a Cardinality Constraint

Regret Bounds for Online Portfolio Selection with a Cardinality Constraint Regret Bounds for Online Portfolio Selection with a Cardinality Constraint Shinji Ito NEC Corporation Daisuke Hatano RIKEN AIP Hanna Sumita okyo Metropolitan University Akihiro Yabe NEC Corporation akuro

More information

arxiv: v1 [math.oc] 7 Dec 2018

arxiv: v1 [math.oc] 7 Dec 2018 arxiv:1812.02878v1 [math.oc] 7 Dec 2018 Solving Non-Convex Non-Concave Min-Max Games Under Polyak- Lojasiewicz Condition Maziar Sanjabi, Meisam Razaviyayn, Jason D. Lee University of Southern California

More information

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity

Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Convergence Rates for Deterministic and Stochastic Subgradient Methods Without Lipschitz Continuity Benjamin Grimmer Abstract We generalize the classic convergence rate theory for subgradient methods to

More information

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima Third-order Smoothness elps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima Yaodong Yu and Pan Xu and Quanquan Gu arxiv:171.06585v1 [math.oc] 18 Dec 017 Abstract We propose stochastic

More information

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson 204 IEEE INTERNATIONAL WORKSHOP ON ACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 2 24, 204, REIS, FRANCE A DELAYED PROXIAL GRADIENT ETHOD WITH LINEAR CONVERGENCE RATE Hamid Reza Feyzmahdavian, Arda Aytekin,

More information

Higher-Order Methods

Higher-Order Methods Higher-Order Methods Stephen J. Wright 1 2 Computer Sciences Department, University of Wisconsin-Madison. PCMI, July 2016 Stephen Wright (UW-Madison) Higher-Order Methods PCMI, July 2016 1 / 25 Smooth

More information

Zeroth-Order Online Alternating Direction Method of Multipliers: Convergence Analysis and Applications

Zeroth-Order Online Alternating Direction Method of Multipliers: Convergence Analysis and Applications Zeroth-Order Online Alternating Direction Method of Multipliers: Convergence Analysis and Applications Sijia Liu Jie Chen Pin-Yu Chen Alfred O. Hero University of Michigan Northwestern Polytechnical IBM

More information

The FTRL Algorithm with Strongly Convex Regularizers

The FTRL Algorithm with Strongly Convex Regularizers CSE599s, Spring 202, Online Learning Lecture 8-04/9/202 The FTRL Algorithm with Strongly Convex Regularizers Lecturer: Brandan McMahan Scribe: Tamara Bonaci Introduction In the last lecture, we talked

More information

Preference Based Adaptation for Learning Objectives

Preference Based Adaptation for Learning Objectives Preference Based Adaptation for Learning Objectives Yao-Xiang Ding Zhi-Hua Zhou National Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, 223, China {dingyx, zhouzh}@lamda.nju.edu.cn

More information

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015

15-859E: Advanced Algorithms CMU, Spring 2015 Lecture #16: Gradient Descent February 18, 2015 5-859E: Advanced Algorithms CMU, Spring 205 Lecture #6: Gradient Descent February 8, 205 Lecturer: Anupam Gupta Scribe: Guru Guruganesh In this lecture, we will study the gradient descent algorithm and

More information

Convergence and No-Regret in Multiagent Learning

Convergence and No-Regret in Multiagent Learning Convergence and No-Regret in Multiagent Learning Michael Bowling Department of Computing Science University of Alberta Edmonton, Alberta Canada T6G 2E8 bowling@cs.ualberta.ca Abstract Learning in a multiagent

More information

Adaptive Negative Curvature Descent with Applications in Non-convex Optimization

Adaptive Negative Curvature Descent with Applications in Non-convex Optimization Adaptive Negative Curvature Descent with Applications in Non-convex Optimization Mingrui Liu, Zhe Li, Xiaoyu Wang, Jinfeng Yi, Tianbao Yang Department of Computer Science, The University of Iowa, Iowa

More information

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization Stochastic optimization, ERM,

More information

ONLINE VARIANCE-REDUCING OPTIMIZATION

ONLINE VARIANCE-REDUCING OPTIMIZATION ONLINE VARIANCE-REDUCING OPTIMIZATION Nicolas Le Roux Google Brain nlr@google.com Reza Babanezhad University of British Columbia rezababa@cs.ubc.ca Pierre-Antoine Manzagol Google Brain manzagop@google.com

More information

Noisy Streaming PCA. Noting g t = x t x t, rearranging and dividing both sides by 2η we get

Noisy Streaming PCA. Noting g t = x t x t, rearranging and dividing both sides by 2η we get Supplementary Material A. Auxillary Lemmas Lemma A. Lemma. Shalev-Shwartz & Ben-David,. Any update of the form P t+ = Π C P t ηg t, 3 for an arbitrary sequence of matrices g, g,..., g, projection Π C onto

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine

More information

Stochastic Composition Optimization

Stochastic Composition Optimization Stochastic Composition Optimization Algorithms and Sample Complexities Mengdi Wang Joint works with Ethan X. Fang, Han Liu, and Ji Liu ORFE@Princeton ICCOPT, Tokyo, August 8-11, 2016 1 / 24 Collaborators

More information

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Mengdi Wang Ethan X. Fang Han Liu Abstract Classical stochastic gradient methods are well suited

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

arxiv: v1 [cs.lg] 4 Jul 2017

arxiv: v1 [cs.lg] 4 Jul 2017 Robust Optimization for Non-Convex Objectives Robert Chen Computer Science Harvard University Brendan Lucier Microsoft Research New England Yaron Singer Computer Science Harvard University Vasilis Syrgkanis

More information

Generalized Boosting Algorithms for Convex Optimization

Generalized Boosting Algorithms for Convex Optimization Alexander Grubb agrubb@cmu.edu J. Andrew Bagnell dbagnell@ri.cmu.edu School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 523 USA Abstract Boosting is a popular way to derive powerful

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Logarithmic Regret Algorithms for Strongly Convex Repeated Games Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600

More information

Better Algorithms for Benign Bandits

Better Algorithms for Benign Bandits Better Algorithms for Benign Bandits Elad Hazan IBM Almaden 650 Harry Rd, San Jose, CA 95120 hazan@us.ibm.com Satyen Kale Microsoft Research One Microsoft Way, Redmond, WA 98052 satyen.kale@microsoft.com

More information

Online Learning, Mistake Bounds, Perceptron Algorithm

Online Learning, Mistake Bounds, Perceptron Algorithm Online Learning, Mistake Bounds, Perceptron Algorithm 1 Online Learning So far the focus of the course has been on batch learning, where algorithms are presented with a sample of training data, from which

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information