arxiv: v4 [math.oc] 5 Jan 2016

Size: px
Start display at page:

Download "arxiv: v4 [math.oc] 5 Jan 2016"

Transcription

1 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv: v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The University of Iowa, Iowa City, IA 54 tianbao-yang@uiowa.edu, qihang-lin@uiowa.edu Abstract In this paper, we show that a restarted stochastic (sub)gradient descent (RSGD) can find an -optimal solution for a broad class of non-smooth and/or non-strongly convex optimization problems with a lower iteration complexity than vanilla stochastic subgradient descent (SGD). Of foremost interest, we show that for nonsmooth and non-strongly convex optimization, RSGD reduces the dependence of SGD s iteration complexity on the distance of initial solution to the optimal set to the dependence on the maximum distance of points in the -level set to the optimal set. This result of RSGD for non-smooth and non-strongly convex problems complements our prior work [7] that establishes a linear convergence rate for RSGD for a special family of non-smooth and non-strongly convex optimization problems whose epigraph is a polyhedron. Besides that, we also obtain some interesting results of RSGD for locally semi-strongly convex problems and smooth optimization problems that have a Łojasiewicz property. In particular, (i) for locally semi-strongly convex functions, RSGD has an O( 1 log( 1 )) iteration complexity, which in the strongly convex scenario is similar to if not better than that of SGD; (ii) for smooth optimization problems that have a Łojasiewicz property with a power constant of β (0, 1), RSGD 1 has an O( 1 log( 1 β )) iteration complexity, which is better that that of SGD for such smooth optimization problems whose iteration complexity is O( 1 ). The novelty of our analysis lies at exploiting the lower bound of the subgradient or gradient of the objection function at the -level set. 1 Introduction In this section, we present the problem and some necessary notations and basic results. We consider the following generic unconstrained optimization problem without imposing additional assumptions on the structure of the objective function: min f(w) (1) w R d where f(w) is a non-smooth and/or non-strongly convex function. We abuse the notation f(w) to denote the set of subgradients of a non-smooth function or the gradient of a smooth function. In the sequel, we will use the term subgradient generally without explicitly referring to the gradient for a smooth function. Let g(w) denote a stochastic subgradient of f(w) that depends on a random variable ξ such that E[g(w)] f(w). Since the objective function is not necessarily stronlgy convex, thus the optimal solution is not necessarily unique. We use Ω to denote the optimal solution set and use f to denote the unique optimal objective value. 1 In this case, RSGD refers to restarted stochastic gradient descent. 1

2 Figure 1: An illustration of Lemma 1 for a non-smooth and non-stronlgy convex problem. The value of κ is a constant (just the slope of the linear line), where w (red point) is the closest -optimal solution to w (blue point). Throughout the paper, we make the following assumptions unless stated otherwise. Assumption 1. For a convex minimization problem (1), we assume a. There exist known w 0 Ω and 0 0 such that f(w 0 ) min w Ω f(w) 0. b. There exists a known constant G such that g(w) G almost surely for any w R d. c. Ω is a non-empty convex compact set. Below, we first introduce some notations. We denote by the Euclidean norm. Let w denote the closest optimal solution in Ω to w measured in terms of norm, i.e., w = arg min u Ω u w. Since Ω is a non-empty convex compact set, w is unique. We denote by L the -level set of f(w) and by S the -sublevel set of f(w), respectively, i.e., L = {w R d : f(w) = f + }, S = {w R d : f(w) f + } Since Ω is bounded and f(x) is convex, it follows from [5] (Corollary 8.7.1) that the sublevel set S = {w R d : f(w) f + } is bounded for any 0 and so as the level set L. Define B to be the maximum distance of points on the level set L to the optimal set Ω, i.e., B = max min w u = max w w. () u Ω w L w L Due to that L and Ω are bounded, B is also bounded. For any > 0, we define κ B. Let w to denote the closest point in the -sublevel set to w, i.e.., w = arg min u S u w = arg min u R d u w, s.t. f(u) f +. (3) It is easy to show that w L when w / S. In the sequel, we let f(w) be defined as f(w) = inf v f(w) v Given the notations above, we provide the following lemma which is the key to our analysis.

3 Lemma 1. For any w, we have w w 1 (f(w) f(w κ )). In general, when f(w) ρ for any w L for a constant ρ > 0, we have w w 1 (f(w) f(w ρ )). Proof. First we note that if w S then w = w and the lemma holds trivially. Next we consider w S. Then w L, i.e., f(w ) = f +. By the definition of w, there exists λ 0 and a subgradient v f(w ) such that By the convexity of f( ) we have w w + λv = 0. f(w) f(w ) (w w ) v = w w v (4) where the equality is due to that w w is in the same direction of v. From the above inequality, the second part of the Lemma is proved due to the assumption f(w) ρ for any w L. To prove the first part, we use the convexity of f( ) again to obtain f(w ) f(w ) (w w ) f(w ) where w is the closest point in Ω to w. To proceed, we have w w f(w ) (w w ) f(w ) f(w ) f(w ) = f(w ) f = (5) By the definition of B, we have B f(w ) = f(w ) B Combining the above inequality with that in (4), we have which completes the proof. w w B (f(w) f(w )) Remark 1: It is important to note that the above results are deterministic independent of any stochastic or randomized optimization algorithms. Although we focus on SGD-type algorithms in this work, it might be of further interest to emploit the above results for the analysis of other types of stochastic or randomized algorithms. For example, it is easy to generalize the analysis and algorithms to restarted stochastic mirror descent as long as the Euclidean norm for w is replaced by a general norm w and the Euclidean norm of subgradient f(w) is replaced by a dual norm f(w). In Section, we leverage the first part of Lemma 1 to analyze the convergence for non-smooth non-strongly convex optimization problems. In Section 3, we leverage the second part to analyze the convergence for other types of optimization problems, including problems with a weaker notion of strong convexity and a Łojasiewicz property, where both types of problems enjoy a lower bound of the subgradients or gradient at the -level set. Indeed, we note from the proof above that the key to prove the first part is also to establish a lower bound of the subgradients at points in the - level set. This observation might be extended to many special problems. For the problem shown in Figure 1, the subgradient f(w + ) is a constant, i.e., B = c where c is a constant, implying that κ is a constant. More generally, when f(x) is polyhedral function as considered in [7], κ κ when 0, implying B = Θ(). We emphasize that the second part of Lemma 1 could be of independent interest for further research. 3

4 Algorithm 1 SGD: ŵ T = SGD(w 1, η, T ) 1: Input: a step size η, the number of iterations T, and the initial solution w 1, : for t = 1,..., T do 3: compute a stochastic subgradient of f(w) at w t denoted by g t 4: update w t+1 = w t ηg t 5: end for 6: Output: ŵ T = T w t t=1 T Algorithm RSGD 1: Input: the number of epochs K and the number of iterations t per-epoch, w 0 Ω and 0 as in Assumption 1.a : for k = 1,..., K do 3: Set η k = ( k 1 + (a k 1 1))/(G ) // where a k = k 1. k 4: Run SGD to obtain w k = SGD(w k 1, η k, t) 5: Set k = k 1 / 6: end for 7: Output: w K RSGD for Non-smooth Non-strongly Convex Optimization Our goal is to find an -optimal solution ŵ. Algorithm 1 describes the vanilla stochastic subgradient descent method that serves as a subroutine in the proposed restarted stochastic subgradient method (RSGD) shown in Algorithm. We note that the restarting scheme is similar to RSGD proposed in [7]. However, the step size is slightly different. The following lemma [8] provides guarantee on the convergence of SGD methods. Lemma. Let SGD run T iterations. Then for any w we have for using subgradients, and E [f(ŵ T ) f(w)] G η + E[ w 1 w ]. ηt Theorem 1. Suppose Assumption 1 is satisfied. Let Algorithm run for t iterations per-epoch with t 4G κ. With at most K = log ( 0 ) epochs, we have E[f(w K ) f ]. Hence, the total number of iterations for achieving an -suboptimal solution is at most T = tk = t log ( 0 ). Remark : Compare with vanilla SGD. Since the number of iterations per-epoch t is in the order of 4G B, the overall iteration complexity is 4G B log( 0 /). Compared to traditional SGD whose iteration complexity is G w 0 w 0 by setting the optimal step size η, RSGD s iteration complexity depends on B instead of w 0 w0 and only has a logarithmic dependence on 0, the upper bound of f(w 0 ) f. When the initial solution is far from the optimal set, or B is very small, the proposed RSGD could be much faster. In some special cases, e.g., the problem shown in Figure 1, B = Θ(), RSGD only needs a constant number of iterations. Remark 3: Compare with []. Freund & Lu [] proposed new subgradient descent methods with a new step-size rule by leveraging the lower bound of the objective function and a function growth condition, which could lead to improved iteration complexity for both non-smooth and smooth optimizations. For non-smooth optimization, their method achieves an iteration complexity of O(G C ( log H + 1 ) for a solution with a relative accuracy (defined shortly) of. Here, H = f(w0) f f f slb, f slb is a lower bound f and C is called growth constant. We emphasize that there are several key differences between our work and []:(i) The convergence results in [] are established based on finding an solution ŵ T with a relative accuracy of, i.e., f(ŵ T ) f (f f slb ), The constant is set artificially for ease of presentation. 4

5 while we consider absolute accuracy. (ii) The methods in [] are the deterministic first-order methods while we focus on stochastic first-order methods. (iii) Their version of subgradient descent method keeps track of the best solution up to the current iteration while our method maintains the average of all solutions within each epoch. (iv) Their convergence results depend on the growth constant C while ours depends on the quantity B. Proof. We assume that the epoch solutions w k S for k = 1,..., K 1 and induce how many epochs are sufficient for reaching to a point in the -sublevel set in the worst case. Let w k, denote the closest point to w k in the sublevel set. Since S S, therefore w k S and consequentially f(w k, ) = f +. We first define a k = k 1 1 so that a k k = a k 1+1 for k = 0, 1,.... According to Algorithm, we have k = k 1. We will show by induction that E[f(w k) f ] k + a k for k = 0, 1,... which leads to our conclusion when k = K. The inequality holds obviously for k = 0. Assuming E[f(w k 1 ) f ] k 1 + a k 1, we need to show that E[f(w k ) f ] k + a k. We apply Lemma to the k-th epoch of Algorithm and get By Lemma 1, we have E[f(w k ) f(w k 1, )] G η k + E[ w k 1 w k 1, ]. η k t E[ w k 1 w k 1, ] 1 κ E[(f(w k 1 ) f(w k 1, ))] k 1 + (a k 1 1) κ where the second inequality is from the fact that E[f(w k 1 ) f(w k 1, )] E[f(w k 1) f + (f f(w k 1, ))] k 1 + (a k 1 1) Choosing η k = k 1 + (a k 1 1) G E[f(w k ) f(w k 1, )] k 1 + (a k 1 1) 4 and t 4G κ, we have + k 1 + (a k 1 1) 4 which, together with the fact that f(w k 1, ) = f +, implies E[f(w k ) f ] + k 1 + (a k 1 1) Therefore by induction, we have = k + (a k 1 + 1) E[f(w K ) f ] K + a K = 0 K + K 1 K where the last inequality is due to the value of K. k 1 + (a k 1 1), = k + a k Before ending this section, we would like to emphasize that the same analysis can be employed to analyze the iteration complexity of restarted version of other stochastic/randomized algorithms. For example, it is not difficult to obtain similar results for randomized coordinate subgradient descent and stochastic dual averaging methods, which have been analyzed in the previous work [7]. 3 RSGD for other classes of problems In this section, we show some interesting convergence results of RSGD for (weaker) strongly convex optimization problems and smooth optimization problems. First, we consider a weaker notation of strong convexity f(w). A function f(w) is said to be (λ, )-locally semi-strongly convex, iff for any w S it satisfies w w λ (f(w) f(w )), (6) 5

6 Figure : An illustration of weaker classes of strong convexity and the applicable representative algorithms. where w is the closest point to w in the optimal set. If the inequality (6) holds for any w R d, we refer to it as semi-strong convexity. The above inequality implies w w for any w L. Therefore, in the proof of Lemma 1 f(w ) which, together with (4), further leads to w w λ λ as a result of (5), λ (f(w) f(w )) for any w. By choosing t 8G in Algorithm and following a similar analysis as the proof of Theorem 1 λ we can obtain an O( G λ log( 1 ) iteration complexity of Algorithm for (λ, )-locally semi-strongly convex optimization. This result is summarized in the following corollary. Corollary. Suppose Assumption 1 is satisfied and f(w) is (λ, )-locally semistrongly convex such that the inequality in (6) holds for any w S. Let Algorithm run for t iterations per-epoch with t 8G λ. With at most K = log ( 0 ) epochs, we have E[f(w) f ]. In comparison, to the best of our knowledge the vanilla SGD can only take advantage of strong convexity for obtaining Õ(1/) iteration complexity where Õ( ) suppresses constant and logarithmic terms. A function f(w) is said to be λ-strongly convex if it satisfies the following inequality for any w, u and any α [0, 1] f(αw + (1 α)w) αf(w) + (1 α)f(u) λ α(1 α) w u which is stronger than (6). In particular, λ-strong convexity implies (λ, )-local semi-strong convexity [3] but not vice versa, which is illustrated in Figure. The convergence rate of SGD for λ-strongly convex optimization is O( G log T λt ). It is worth noting that a different variant of SGD (namely Epoch-SGD) proposed in [3] applies to semi-strong convexity such that the inequality in (6) holds for any w and enjoys an O( G λt ) convergence rate. Nonetheless, RSGD is still attractive when only local semi-strong convexity holds, which is discussed further below for some examples. Locally strongly convex functions include, e.g., f(w) = e w z and f(w) = log(1+exp(w z)) that cover some commonly used loss functions in machine learning. A family of functions that admit (6) is in the form f(w) = h(aw) where h( ) is a strongly convex function (e.g., h(x) = x y ). By exploiting the Hoffman s bound, it can be shown that the f function in this form satisfies (6) for a constant λ > 0 3. As a result, for any locally strongly convex function h( ), the composition h(aw) is locally semi-strongly convex. Next, we consider a family of smooth optimization with an objective function satisfying the Łojasiewicz inequality, i.e., for every w there is an open neighborhood U of w and constants β (0, 1) and c > 0 such that for all u U f(u) f(w) β c f(u) 3 Indeed, λ is the minimum non-zero eigen-value of A A. 6

7 Algorithm 3 Multi-stage RSGD 1: Input: the accuracy parameter 1 for the first stage and the associated number iterations t 1 per-epoch, a decreasing power parameter r N + : Initialization: w 0 Ω and 0 as in Assumption 1.a 3: for s = 1,..., do 4: Run RSGD with the initial solution w s 1 and K s = log ( s 1 / s ) epochs and t s iterations per-epoch, and obtain a solution w s 5: Evaluate w s according to the earlier stopping criterion 6: If the criterion is not satisfied, update s+1 = s / r and t s+1 accordingly 7: Otherwise terminate the for loop 8: end for In particular, for any w L we assume that the Łojasiewicz inequality holds for w Ω and a neighborhood U of w such that w U. Then f(w) β c, w L The same analysis in Section for RSGD yields an O( G log( 1 β )) iteration complexity. In comparison, stochastic gradient descent for smooth optimization only has an O( 1 ) iteration complexity. Therefore the smaller the power constant β, the smaller the iteration complexity of RSGD. It is worth mentioning that the most notable class of functions that have the Łojasiewicz property at any point are real-analytic functions [6]. Recently, many works study when a function admits the Łojasiewicz property or the Kurdyka - Łojasiewicz property [1]. However, research about this issue is beyond the scope of this paper. The corollary below summarizes the iteration complexity of RSGD for minimizing a smooth function that has that the Łojasiewicz property. Corollary 3. Suppose Assumption 1 is satisfied and f(w) has a Łojasiewicz property - for any w L there exist constants β (0, 1) and c > 0 and an open neighborhood of w Ω that subsumes w such that f(w) f(w ) β c f(w) where w is the closest point to w in Ω. Let Algorithm run for t iterations per-epoch with t 4G c. With at most K = log β ( 0 ) epochs, we have E[f(w) f ]. 4 Implementation Tricks for Machine Learning Problems In this section, we discuss some technical issues related to implementation. First, we note that the RSGD algorithm presented in Algorithm is a unified framework for non-smooth and non-strongly convex problems, locally semi-stronlgy convex problems, and smooth problems of interest. There is only one parameter in RSGD needed to be tuned in practice, i.e., the number of iterations perepoch. The step size η is determined automatically given the accuracy. Although the value of is usually given by the user, in machine learning applications people may find it inconvenient if no prior knowledge about that gives good prediction performance is known. In particular, if is set too small the total number of iterations becomes unnecessarily large, which could harm the prediction performance on testing data as well. On the other hand, if is not sufficiently large, people would worry about that whether the best prediction performance is obtained. In contrast, SGD can be run with infinite number of iterations to get the optimal solution. To solve this issue, we present a multi-stage implementation scheme of RSGD guided by an earlier stopping criterion. The earlier stopping criterion is based on the prediction performance on a validation data, which is usually a common practice in big data learning [4]. We let the first stage start from a randomly chosen initial solution w 0 with a target accuracy of reasonably small 1 and a tuned value of t 1 per-iteration. After K 1 = log ( 0 / 1 ) epochs, we obtain an 1 -optimal model w 1 and evaluate the earlier stopping criterion to decided whether to stop or not. If the decision is to continue, we then decrease 1 by a constant number of times (e.g., ) and increase t 1 accordingly to obtain = 1 / and t. In particular, to get t we can increase t 1 by 4 times if according to the theory the number 7

8 of iterations per-epoch t depends on 1/, or times if t depends on 1/. Then we start the second stage of RSGD from the last solution w 1 with a target accuracy of and t iterations per-peoch. Note that since the new initial solution w 1 is 1 -optimal, therefore according to our theory, we can stop the second stage in K = log ( 1 / ) epochs and the resulting model w is guaranteed to be -optimal solution. Then we repeat the above process till that the earlier stopping criterion is satisfied, e.g, there is no change in the validation error or overfitting occurs 4. The multi-stage scheme is presented in Algorithm 3. 5 Conclusion In this work, we have analyzed the iteration complexity of a novel restarted stochastic subgradient descent method for non-smooth and/or non-strongly convex optimization for obtaining an - suboptimal solution. By leveraging the maximum distance of points in the -level set to the optimal set, we establish a lower bound of the subgradient of points in the -level set, which together with a novel analysis leads to an iteration complexity that could be better than that of the vanilla SGD for non-smooth and non-strongly convex optimization problems. Our novel analysis also leads to improved iteration complexities of RSGD for locally semi-strongly convex optimization problems, and smooth problems that have a Łojasiewicz property. It remains an interesting question to explore RSGD for solving non-convex problems. References [1] J. Bolte, A. Daniilidis, A. S. Lewis, and M. Shiota. Clarke subgradients of stratifiable functions. SIAM Journal on Optimization, 18:556 57, 007. [] R. M. Freund and H. Lu. New computational guarantees for solving convex optimization problems with first order methods, via a function growth condition measure. ArXiv e-prints, 015. [3] E. Hazan and S. Kale. Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization. In Proceedings of the 4th Annual Conference on Learning Theory (COLT), pages , 011. [4] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in Neural Information Processing Systems (NIPS), pages , 01. [5] R. Rockafellar. Convex Analysis. Princeton mathematical series. Princeton University Press, [6] R. Schneider and A. Uschmajew. Convergence results for projected line-search methods on varieties of low-rank matrices via łojasiewicz inequality. CoRR, abs/ , 014. [7] T. Yang and Q. Lin. Stochastic subgradient methods with linear convergence for polyhedral convex optimization. arxiv: cs.lg, 015. [8] M. Zinkevich. Online convex programming and generalized infinitesimal gradient ascent. In Proceedings of the International Conference on Machine Learning (ICML), pages , In the later case, the solution from the previous stage may be used as the final solution. 8

Accelerate Subgradient Methods

Accelerate Subgradient Methods Accelerate Subgradient Methods Tianbao Yang Department of Computer Science The University of Iowa Contributors: students Yi Xu, Yan Yan and colleague Qihang Lin Yang (CS@Uiowa) Accelerate Subgradient Methods

More information

Homotopy Smoothing for Non-Smooth Problems with Lower Complexity than O(1/ɛ)

Homotopy Smoothing for Non-Smooth Problems with Lower Complexity than O(1/ɛ) 1 28 Homotopy Smoothing for Non-Smooth Problems with Lower Complexity than O(1/) Yi Xu yi-xu@uiowa.edu Yan Yan yan.yan-3@student.uts.edu.au Qihang Lin qihang-lin@uiowa.edu Tianbao Yang tianbao-yang@uiowa.edu

More information

arxiv: v1 [math.oc] 10 Oct 2018

arxiv: v1 [math.oc] 10 Oct 2018 8 Frank-Wolfe Method is Automatically Adaptive to Error Bound ondition arxiv:80.04765v [math.o] 0 Oct 08 Yi Xu yi-xu@uiowa.edu Tianbao Yang tianbao-yang@uiowa.edu Department of omputer Science, The University

More information

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods Zaiyi Chen * Yi Xu * Enhong Chen Tianbao Yang Abstract Although the convergence rates of existing variants of ADAGRAD have a better dependence on the number of iterations under the strong convexity condition,

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Optimal Stochastic Strongly Convex Optimization with a Logarithmic Number of Projections

Optimal Stochastic Strongly Convex Optimization with a Logarithmic Number of Projections Optimal Stochastic Strongly Convex Optimization with a Logarithmic Number of Projections Jianhui Chen 1, Tianbao Yang 2, Qihang Lin 2, Lijun Zhang 3, and Yi Chang 4 July 18, 2016 Yahoo Research 1, The

More information

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods Zaiyi Chen * Yi Xu * Enhong Chen Tianbao Yang Abstract Although the convergence rates of existing variants of ADAGRAD have a better dependence on the number of iterations under the strong convexity condition,

More information

Online Convex Optimization

Online Convex Optimization Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Lecture: Adaptive Filtering

Lecture: Adaptive Filtering ECE 830 Spring 2013 Statistical Signal Processing instructors: K. Jamieson and R. Nowak Lecture: Adaptive Filtering Adaptive filters are commonly used for online filtering of signals. The goal is to estimate

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

A Two-Stage Approach for Learning a Sparse Model with Sharp

A Two-Stage Approach for Learning a Sparse Model with Sharp A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis The University of Iowa, Nanjing University, Alibaba Group February 3, 2017 1 Problem and Chanllenges 2 The Two-stage Approach

More information

Stochastic and Adversarial Online Learning without Hyperparameters

Stochastic and Adversarial Online Learning without Hyperparameters Stochastic and Adversarial Online Learning without Hyperparameters Ashok Cutkosky Department of Computer Science Stanford University ashokc@cs.stanford.edu Kwabena Boahen Department of Bioengineering Stanford

More information

A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis

A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis Proceedings of the Thirty-First AAAI Conference on Artificial Intelligence (AAAI-7) A Two-Stage Approach for Learning a Sparse Model with Sharp Excess Risk Analysis Zhe Li, Tianbao Yang, Lijun Zhang, Rong

More information

Provable Non-Convex Min-Max Optimization

Provable Non-Convex Min-Max Optimization Provable Non-Convex Min-Max Optimization Mingrui Liu, Hassan Rafique, Qihang Lin, Tianbao Yang Department of Computer Science, The University of Iowa, Iowa City, IA, 52242 Department of Mathematics, The

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

Stochastic Subgradient Method

Stochastic Subgradient Method Stochastic Subgradient Method Lingjie Weng, Yutian Chen Bren School of Information and Computer Science UC Irvine Subgradient Recall basic inequality for convex differentiable f : f y f x + f x T (y x)

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

A Greedy Framework for First-Order Optimization

A Greedy Framework for First-Order Optimization A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts

More information

Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Optimal Regularized Dual Averaging Methods for Stochastic Optimization Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Regret bounded by gradual variation for online convex optimization

Regret bounded by gradual variation for online convex optimization Noname manuscript No. will be inserted by the editor Regret bounded by gradual variation for online convex optimization Tianbao Yang Mehrdad Mahdavi Rong Jin Shenghuo Zhu Received: date / Accepted: date

More information

Adaptive Online Learning in Dynamic Environments

Adaptive Online Learning in Dynamic Environments Adaptive Online Learning in Dynamic Environments Lijun Zhang, Shiyin Lu, Zhi-Hua Zhou National Key Laboratory for Novel Software Technology Nanjing University, Nanjing 210023, China {zhanglj, lusy, zhouzh}@lamda.nju.edu.cn

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Optimization and Gradient Descent

Optimization and Gradient Descent Optimization and Gradient Descent INFO-4604, Applied Machine Learning University of Colorado Boulder September 12, 2017 Prof. Michael Paul Prediction Functions Remember: a prediction function is the function

More information

Stochastic Gradient Descent with Only One Projection

Stochastic Gradient Descent with Only One Projection Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine

More information

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 Online Convex Optimization Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 The General Setting The General Setting (Cover) Given only the above, learning isn't always possible Some Natural

More information

Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property

Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property Convergence of Cubic Regularization for Nonconvex Optimization under KŁ Property Yi Zhou Department of ECE The Ohio State University zhou.1172@osu.edu Zhe Wang Department of ECE The Ohio State University

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

arxiv: v2 [math.oc] 1 Nov 2017

arxiv: v2 [math.oc] 1 Nov 2017 Stochastic Non-convex Optimization with Strong High Probability Second-order Convergence arxiv:1710.09447v [math.oc] 1 Nov 017 Mingrui Liu, Tianbao Yang Department of Computer Science The University of

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate

Fast Stochastic AUC Maximization with O(1/n)-Convergence Rate Fast Stochastic Maximization with O1/n)-Convergence Rate Mingrui Liu 1 Xiaoxuan Zhang 1 Zaiyi Chen Xiaoyu Wang 3 ianbao Yang 1 Abstract In this paper we consider statistical learning with area under ROC

More information

Stochastic Convex Optimization: Faster Local Growth Implies Faster Global Convergence

Stochastic Convex Optimization: Faster Local Growth Implies Faster Global Convergence Stochastic Convex Optimization: Faster Local Growth Implies Faster Global Convergence Yi Xu 1 Qihang Lin 2 Tianbao Yang 1 Abstract In this paper, a new theory is developed for firstorder stochastic convex

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4 Statistical Machine Learning II Spring 07, Learning Theory, Lecture 4 Jean Honorio jhonorio@purdue.edu Deterministic Optimization For brevity, everywhere differentiable functions will be called smooth.

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Logarithmic Regret Algorithms for Strongly Convex Repeated Games Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

A Unified Analysis of Stochastic Momentum Methods for Deep Learning

A Unified Analysis of Stochastic Momentum Methods for Deep Learning A Unified Analysis of Stochastic Momentum Methods for Deep Learning Yan Yan,2, Tianbao Yang 3, Zhe Li 3, Qihang Lin 4, Yi Yang,2 SUSTech-UTS Joint Centre of CIS, Southern University of Science and Technology

More information

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning

Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Simultaneous Model Selection and Optimization through Parameter-free Stochastic Learning Francesco Orabona Yahoo! Labs New York, USA francesco@orabona.com Abstract Stochastic gradient descent algorithms

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Generalized Boosting Algorithms for Convex Optimization

Generalized Boosting Algorithms for Convex Optimization Alexander Grubb agrubb@cmu.edu J. Andrew Bagnell dbagnell@ri.cmu.edu School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 523 USA Abstract Boosting is a popular way to derive powerful

More information

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION

I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION I P IANO : I NERTIAL P ROXIMAL A LGORITHM FOR N ON -C ONVEX O PTIMIZATION Peter Ochs University of Freiburg Germany 17.01.2017 joint work with: Thomas Brox and Thomas Pock c 2017 Peter Ochs ipiano c 1

More information

Advanced Machine Learning

Advanced Machine Learning Advanced Machine Learning Online Convex Optimization MEHRYAR MOHRI MOHRI@ COURANT INSTITUTE & GOOGLE RESEARCH. Outline Online projected sub-gradient descent. Exponentiated Gradient (EG). Mirror descent.

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Neural Networks: A brief touch Yuejie Chi Department of Electrical and Computer Engineering Spring 2018 1/41 Outline

More information

Constrained Optimization and Lagrangian Duality

Constrained Optimization and Lagrangian Duality CIS 520: Machine Learning Oct 02, 2017 Constrained Optimization and Lagrangian Duality Lecturer: Shivani Agarwal Disclaimer: These notes are designed to be a supplement to the lecture. They may or may

More information

Convergence rate of SGD

Convergence rate of SGD Convergence rate of SGD heorem: (see Nemirovski et al 09 from readings) Let f be a strongly convex stochastic function Assume gradient of f is Lipschitz continuous and bounded hen, for step sizes: he expected

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

Convex Optimization. Problem set 2. Due Monday April 26th

Convex Optimization. Problem set 2. Due Monday April 26th Convex Optimization Problem set 2 Due Monday April 26th 1 Gradient Decent without Line-search In this problem we will consider gradient descent with predetermined step sizes. That is, instead of determining

More information

Accelerating Online Convex Optimization via Adaptive Prediction

Accelerating Online Convex Optimization via Adaptive Prediction Mehryar Mohri Courant Institute and Google Research New York, NY 10012 mohri@cims.nyu.edu Scott Yang Courant Institute New York, NY 10012 yangs@cims.nyu.edu Abstract We present a powerful general framework

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

Algorithmic Stability and Generalization Christoph Lampert

Algorithmic Stability and Generalization Christoph Lampert Algorithmic Stability and Generalization Christoph Lampert November 28, 2018 1 / 32 IST Austria (Institute of Science and Technology Austria) institute for basic research opened in 2009 located in outskirts

More information

Online Optimization : Competing with Dynamic Comparators

Online Optimization : Competing with Dynamic Comparators Ali Jadbabaie Alexander Rakhlin Shahin Shahrampour Karthik Sridharan University of Pennsylvania University of Pennsylvania University of Pennsylvania Cornell University Abstract Recent literature on online

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning.

Tutorial: PART 1. Online Convex Optimization, A Game- Theoretic Approach to Learning. Tutorial: PART 1 Online Convex Optimization, A Game- Theoretic Approach to Learning http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm Elad Hazan Princeton University Satyen Kale Yahoo Research

More information

Sub-Sampled Newton Methods

Sub-Sampled Newton Methods Sub-Sampled Newton Methods F. Roosta-Khorasani and M. W. Mahoney ICSI and Dept of Statistics, UC Berkeley February 2016 F. Roosta-Khorasani and M. W. Mahoney (UCB) Sub-Sampled Newton Methods Feb 2016 1

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

SHARPNESS, RESTART AND ACCELERATION

SHARPNESS, RESTART AND ACCELERATION SHARPNESS, RESTART AND ACCELERATION VINCENT ROULET AND ALEXANDRE D ASPREMONT ABSTRACT. The Łojasievicz inequality shows that sharpness bounds on the minimum of convex optimization problems hold almost

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

A Parallel SGD method with Strong Convergence

A Parallel SGD method with Strong Convergence A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate 58th Annual IEEE Symposium on Foundations of Computer Science First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate Zeyuan Allen-Zhu Microsoft Research zeyuan@csail.mit.edu

More information

Mini-Course 1: SGD Escapes Saddle Points

Mini-Course 1: SGD Escapes Saddle Points Mini-Course 1: SGD Escapes Saddle Points Yang Yuan Computer Science Department Cornell University Gradient Descent (GD) Task: min x f (x) GD does iterative updates x t+1 = x t η t f (x t ) Gradient Descent

More information

arxiv: v1 [math.oc] 7 Dec 2018

arxiv: v1 [math.oc] 7 Dec 2018 arxiv:1812.02878v1 [math.oc] 7 Dec 2018 Solving Non-Convex Non-Concave Min-Max Games Under Polyak- Lojasiewicz Condition Maziar Sanjabi, Meisam Razaviyayn, Jason D. Lee University of Southern California

More information

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson 204 IEEE INTERNATIONAL WORKSHOP ON ACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 2 24, 204, REIS, FRANCE A DELAYED PROXIAL GRADIENT ETHOD WITH LINEAR CONVERGENCE RATE Hamid Reza Feyzmahdavian, Arda Aytekin,

More information

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret

More information

No-Regret Algorithms for Unconstrained Online Convex Optimization

No-Regret Algorithms for Unconstrained Online Convex Optimization No-Regret Algorithms for Unconstrained Online Convex Optimization Matthew Streeter Duolingo, Inc. Pittsburgh, PA 153 matt@duolingo.com H. Brendan McMahan Google, Inc. Seattle, WA 98103 mcmahan@google.com

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

SGD and Deep Learning

SGD and Deep Learning SGD and Deep Learning Subgradients Lets make the gradient cheating more formal. Recall that the gradient is the slope of the tangent. f(w 1 )+rf(w 1 ) (w w 1 ) Non differentiable case? w 1 Subgradients

More information

arxiv: v4 [math.oc] 24 Apr 2017

arxiv: v4 [math.oc] 24 Apr 2017 Finding Approximate ocal Minima Faster than Gradient Descent arxiv:6.046v4 [math.oc] 4 Apr 07 Naman Agarwal namana@cs.princeton.edu Princeton University Zeyuan Allen-Zhu zeyuan@csail.mit.edu Institute

More information

A survey: The convex optimization approach to regret minimization

A survey: The convex optimization approach to regret minimization A survey: The convex optimization approach to regret minimization Elad Hazan September 10, 2009 WORKING DRAFT Abstract A well studied and general setting for prediction and decision making is regret minimization

More information

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania

More information

Deep Learning & Neural Networks Lecture 4

Deep Learning & Neural Networks Lecture 4 Deep Learning & Neural Networks Lecture 4 Kevin Duh Graduate School of Information Science Nara Institute of Science and Technology Jan 23, 2014 2/20 3/20 Advanced Topics in Optimization Today we ll briefly

More information

Summary and discussion of: Dropout Training as Adaptive Regularization

Summary and discussion of: Dropout Training as Adaptive Regularization Summary and discussion of: Dropout Training as Adaptive Regularization Statistics Journal Club, 36-825 Kirstin Early and Calvin Murdock November 21, 2014 1 Introduction Multi-layered (i.e. deep) artificial

More information

Variance-Reduced and Projection-Free Stochastic Optimization

Variance-Reduced and Projection-Free Stochastic Optimization Elad Hazan Princeton University, Princeton, NJ 08540, USA Haipeng Luo Princeton University, Princeton, NJ 08540, USA EHAZAN@CS.PRINCETON.EDU HAIPENGL@CS.PRINCETON.EDU Abstract The Frank-Wolfe optimization

More information

Convergence Rate of Expectation-Maximization

Convergence Rate of Expectation-Maximization Convergence Rate of Expectation-Maximiation Raunak Kumar University of British Columbia Mark Schmidt University of British Columbia Abstract raunakkumar17@outlookcom schmidtm@csubcca Expectation-maximiation

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

Subgradient Methods for Maximum Margin Structured Learning

Subgradient Methods for Maximum Margin Structured Learning Nathan D. Ratliff ndr@ri.cmu.edu J. Andrew Bagnell dbagnell@ri.cmu.edu Robotics Institute, Carnegie Mellon University, Pittsburgh, PA. 15213 USA Martin A. Zinkevich maz@cs.ualberta.ca Department of Computing

More information

Mini-Batch Primal and Dual Methods for SVMs

Mini-Batch Primal and Dual Methods for SVMs Mini-Batch Primal and Dual Methods for SVMs Peter Richtárik School of Mathematics The University of Edinburgh Coauthors: M. Takáč (Edinburgh), A. Bijral and N. Srebro (both TTI at Chicago) arxiv:1303.2314

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

10. Ellipsoid method

10. Ellipsoid method 10. Ellipsoid method EE236C (Spring 2008-09) ellipsoid method convergence proof inequality constraints 10 1 Ellipsoid method history developed by Shor, Nemirovski, Yudin in 1970s used in 1979 by Khachian

More information

Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization

Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization Complexity analysis of second-order algorithms based on line search for smooth nonconvex optimization Clément Royer - University of Wisconsin-Madison Joint work with Stephen J. Wright MOPTA, Bethlehem,

More information

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order

More information

Stochastic Optimization: First order method

Stochastic Optimization: First order method Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Lecture 9: September 28

Lecture 9: September 28 0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity

Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity Noname manuscript No. (will be inserted by the editor) Distributed Stochastic Variance Reduced Gradient Methods and A Lower Bound for Communication Complexity Jason D. Lee Qihang Lin Tengyu Ma Tianbao

More information

Asynchronous Non-Convex Optimization For Separable Problem

Asynchronous Non-Convex Optimization For Separable Problem Asynchronous Non-Convex Optimization For Separable Problem Sandeep Kumar and Ketan Rajawat Dept. of Electrical Engineering, IIT Kanpur Uttar Pradesh, India Distributed Optimization A general multi-agent

More information

Active sets, steepest descent, and smooth approximation of functions

Active sets, steepest descent, and smooth approximation of functions Active sets, steepest descent, and smooth approximation of functions Dmitriy Drusvyatskiy School of ORIE, Cornell University Joint work with Alex D. Ioffe (Technion), Martin Larsson (EPFL), and Adrian

More information

arxiv: v1 [math.oc] 18 Mar 2016

arxiv: v1 [math.oc] 18 Mar 2016 Katyusha: Accelerated Variance Reduction for Faster SGD Zeyuan Allen-Zhu zeyuan@csail.mit.edu Princeton University arxiv:1603.05953v1 [math.oc] 18 Mar 016 March 18, 016 Abstract We consider minimizing

More information

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima Third-order Smoothness elps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima Yaodong Yu and Pan Xu and Quanquan Gu arxiv:171.06585v1 [math.oc] 18 Dec 017 Abstract We propose stochastic

More information

arxiv: v2 [cs.lg] 14 Sep 2017

arxiv: v2 [cs.lg] 14 Sep 2017 Elad Hazan Princeton University, Princeton, NJ 08540, USA Haipeng Luo Princeton University, Princeton, NJ 08540, USA EHAZAN@CS.PRINCETON.EDU HAIPENGL@CS.PRINCETON.EDU arxiv:160.0101v [cs.lg] 14 Sep 017

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Delay-Tolerant Online Convex Optimization: Unified Analysis and Adaptive-Gradient Algorithms

Delay-Tolerant Online Convex Optimization: Unified Analysis and Adaptive-Gradient Algorithms Delay-Tolerant Online Convex Optimization: Unified Analysis and Adaptive-Gradient Algorithms Pooria Joulani 1 András György 2 Csaba Szepesvári 1 1 Department of Computing Science, University of Alberta,

More information

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization Stochastic optimization, ERM,

More information