arxiv: v2 [math.oc] 29 Jul 2016

Size: px
Start display at page:

Download "arxiv: v2 [math.oc] 29 Jul 2016"

Transcription

1 Stochastic Frank-Wolfe Methods for Nonconvex Optimization arxiv: v [math.oc] 9 Jul 06 Sashank J. Reddi sjakkamr@cs.cmu.edu Carnegie Mellon University Barnaás Póczós apoczos@cs.cmu.edu Carnegie Mellon University Suvrit Sra suvrit@mit.edu Massachusetts Institute of Technology Astract Alex Smola alex@smola.org Carnegie Mellon University We study Frank-Wolfe methods for nonconvex stochastic and finite-sum optimization prolems. Frank-Wolfe methods (in the convex case) have gained tremendous recent interest in machine learning and optimization communities due to their projection-free property and their aility to exploit structured constraints. However, our understanding of these algorithms in the nonconvex setting is fairly limited. In this paper, we propose nonconvex stochastic Frank-Wolfe methods and analyze their convergence properties. For ojective functions that decompose into a finitesum, we leverage ideas from variance reduction techniques for convex optimization to otain new variance reduced nonconvex Frank-Wolfe methods that have provaly faster convergence than the classical Frank-Wolfe method. Finally, we show that the faster convergence rates of our variance reduced methods also translate into improved convergence rates for the stochastic setting. Introduction We study optimization prolems of the form: { E z [f(x, z)], F (x) := min x Ω (stochastic) n n i= f i(x), (finite-sum). We assume that F, f, and f i (i {,..., n} [n]) are all differentiale, ut possily nonconvex; the domain Ω is convex and compact. Prolems of this form are at the heart of machine learning and statistics; for instance, the finitesum prolem arises under the name empirical loss minimization and M-estimation. Examples of such prolems include multiclass classification, matrix learning, recommendation systems (Jaggi, 03; Hazan and Kale, 0; Hazan and Luo, 06; Harchaoui et al., 04). Within convex optimization, prolem () is relatively well-studied. Two particularly popular approaches for solving it are: (i) Projected stochastic gradient descent (Sgd); and () the Frank-Wolfe (Fw) method. At each iteration, Sgd takes a step in a direction opposite to a stochastic approximation of the gradient F and uses projection onto Ω to ensure feasiility. While computing a stochastic approximation to F is usually inexpensive, in many real settings, the cost projecting onto Ω can e very high (e.g., projecting onto the trace-norm all, onto ase polytopes in sumodular minimization (Fujishige and Isotani, 0)); and in extreme cases projection can even e computationally intractale (Collins et al., 008). ()

2 In such cases, projection ased methods like Sgd ecome impractical. This difficulty underlies the recent surge of interest in Frank-Wolfe methods (Frank and Wolfe, 956; Jaggi, 03) (also known as conditional gradient), due to their projection-free property. In particular, Fw methods avoid the expensive projection operation and require just a linear oracle that solves prolems of the form min x Ω x, g at each iteration. Despite the remarkale success of Fw approaches in the convex setting, including stochastic prolems (Hazan and Luo, 06), their applicaility and non-asymptotic convergence for nonconvex optimization is largely unstudied. Even for Sgd, it is only recently that non-asymptotic convergence analysis for nonconvex optimization was otained (Ghadimi and Lan, 03; Ghadimi et al., 04). More recently, Reddi et al. (06a;) otained variance reduced stochastic methods that converge faster than Sgd in the nonconvex finite-sum setting. Similar fast variants of Fw for nonconvex prolems are not known. Given the vast importance of nonconvex models in machine learning (e.g., in deep learning) and the need to incorporate non-trivial constraints in such models, it is imperative to develop scalale, projection-free methods. This paper presents new Fw methods towards this goal. Our main contriutions are summarized elow, while the key complexity results are listed in Figure. Main Contriutions. For the nonconvex stochastic setting in (), we propose a stochastic Frank- Wolfe algorithm (Sfw), and provide its convergence analysis. For the nonconvex finite-sum setting, we propose two variance reduced (VR) algorithms: Svfw and SagaFw, ased on the popular VR algorithms Svrg and Saga, respectively. We show that y carefully selecting the parameters of these algorithms, we can attain faster convergence rates than the deterministic Fw. In particular, we prove that Svfw and SagaFw are faster than deterministic Fw y a factor of n /3 and n /3 respectively, where n is the numer of component functions in the finite-sum (see ()). Furthermore, leveraging these variance reduced methods, we propose two algorithms, Svfw-S and SagaFw-S, for the nonconvex stochastic setting, with faster convergence rates than Sfw. To our knowledge, our work presents the first theoretical improvement for stochastic variants of Frank-Wolfe in the context of nonconvex optimization.. Related Work The classical Frank-Wolfe method (Frank and Wolfe, 956) using line-search was analyzed for smooth convex functions F and polyhedral domains Ω. Here, a convergence rate of O(/ɛ) to ensure F (x) F ɛ was proved without additional conditions (Frank and Wolfe, 956; Jaggi, 03). There have een several recent works on improving the convergence rates under additional assumptions (Garer and Hazan, 05; Lacoste-Julien and Jaggi, 05). More recently, Hazan and Luo (06) proposed stochastic variants of Fw for convex prolems of form (), and showed theoretical improvements over the classical Frank-Wolfe method. The literature on nonconvex Frank-Wolfe is relatively small. The work (Bertsekas, 995) proves asymptotic convergence of Fw to a stationary point; though, no convergence rates are provided. To the est of our knowledge, Yu et al. (04) is the first to provide convergence rates for Fw-type algorithm in the nonconvex setting. Very recently, Lacoste-Julien (06) provided a (non-asymptotic) convergence rate of O(/ɛ ) for nonconvex Fw with adaptive step sizes. However, as we shall see later, implementation of classical Fw for () is expensive (or impossile in the pure stochastic case) since it requires calculation of the gradient F at each iteration. We show that our stochastic variants are provaly faster than the existing Fw methods. In the nonconvex setting, most of the work on stochastic methods focuses on Sgd (Ghadimi and Lan, 03; Ghadimi et al., 04) and analyzes convergence to stationary points. For the finite-sum setting, we uild on recent variance reduction techniques (Johnson and Zhang, 03; Defazio et al., 04; Schmidt et al., 03), which were first proposed for solving unconstrained convex prolems of form (). Projected variants to handle constraints were studied in (Defazio et al., 04; Xiao and Zhang, 04). More recently, Reddi et al. (06a;;c) provided nonconvex variants of these methods

3 Algorithm SFO/IFO Complexity LO Complexity Frank-Wolfe O(n/ɛ ) O ( /ɛ ) Sfw O ( /ɛ 4) O ( /ɛ ) Svfw O(n + n /3 /ɛ ) O(/ɛ ) SagaFw O(n + n /3 /ɛ ) O(/ɛ ) Svfw-S O(/ɛ 0/3 ) O(/ɛ ) SagaFw-S O(/ɛ 8/3 ) O(/ɛ ) Figure : Tale comparing the est SFO/IFO and LO complexity of algorithms discussed in the paper (for the nonconvex setting). Here, Sfw, Svfw-S and SagaFw-S are algorithms for the stochastic setting, while Fw, Svfw and SagaFw are algorithms for the finite-sum setting. The complexity is measured y the numer of oracle calls required to achieve an ɛ-accurate solution (see Section for definitions of SFO/IFO and LO complexity). The complexity of Fw is from (Lacoste-Julien, 06). The results marked in red are contriutions of this paper. For clarity, we hide the dependence of SFO/IFO and LO complexity on the initial point and few parameters related to the function F and domain Ω. that converge provaly faster than oth Sgd and its deterministic counterpart. Preliminaries As stated aove, we study two different prolem settings: () stochastic, where F (x) = E z [f(x, z)] and z is random variale whose distriution P is supported on Ξ R p ; and () finite-sums, where F (x) = n n f i(x). For the stochastic setting, we assume that F is L-smooth, i.e., its gradient is Lipschitz continuous with constant L, so F (x) F (y) L x y, x, y Ω. Here. denotes the l -norm. Furthermore, for the stochastic setting, we also assume the function f is G-Lipschitz i.e., f(x, z) G for all x Ω and z Ξ. Such an assumption is common in the stochastic setting (Ghadimi and Lan, 03; Hazan and Luo, 06). For the finite-sum setting, we assume that the individual functions f i (i [n]) are L-smooth i.e., f i (x) f i (y) L x y, x, y Ω. Note that this implies that the function F is also L-smooth. The domain Ω R d is assumed to e convex and compact with diameter D; i.e., x y D for all x, y Ω. Such an assumption is common to all Frank-Wolfe methods. Convergence criteria. The criterion used for the convergence analysis is important in nonconvex optimization. For unconstrained prolems, the gradient norm F is typically used to measure convergence, ecause F 0 translates into convergence to a stationary point. However, this criterion cannot e used for constrained prolems of the form (). Instead, we use the following quantity, typically referred to as Frank-Wolfe gap: G(x) = max v x, F (x). () v Ω 3

4 For convex functions, the Fw gap provides an upper ound on the suoptimality. For nonconvex functions, the gap G(x) = 0 if and only if x is a stationary point. To state our convergence results we will also need the following ound: β (F (x 0) F (x )) LD, given some (unspecified) initial point x 0 Ω. Oracle model. To compare convergence speed of different algorithms, we use the following lack-ox oracles: Stochastic First-Order Oracle (SFO): For a function F ( ) = E z [f(., z)] where z P, an SFO takes a point x and returns the pair (f(x, z ), f(x, z )) where z is a sample drawn i.i.d. from P (Nemirovski and Yudin, 983). Incremental First-Order Oracle (IFO): For a function F ( ) = n i f i(.), an IFO takes an index i [n] and a point x R d, and returns the pair (f i (x), f i (x)) (Agarwal and Bottou, 04). Linear Optimization Oracle (LO): For a set Ω, an LO takes a direction d and returns arg max v Ω v, d. Throughout the paper, y SFO, IFO and LO complexity of an algorithm, we mean the total numer of SFO, IFO and LO calls made y the algorithm to otain an ɛ-accurate solution, i.e., a solution for which E[G(x)] ɛ; the expectation is over any randomization as part of the algorithm. For clarity of presentation, we hide the dependence of these complexities on the initial point F (x 0 ) F (x ), Lipschitz constant G, and the smoothness constant L; we report the dependence on n to highlight its importance. Classical Fw. To place our results in perspective, we egin y recalling the classical Frank-Wolfe (Fw) algorithm (Frank and Wolfe, 956). Pseudocode for this is presented in Algorithm. Algorithm : Fw ( x 0, T, {γ i } T ) i=0 : Input: x 0 Ω, numer of iterations T, {γ i} T i=0 : for t = 0 to T do 3: Compute v t = arg max v Ω v, F (x t) 4: Compute update direction d t = v t x t 5: x t+ = x t + γ td t 6: end for where γi [0, ] for all i {0,..., T } Each iteration of Fw entails calculation of the gradient F and moving towards a minimizer of a linearized ojective. Notice that calculation of F may not e possile in the stochastic setting of (). Furthermore, even in the finite-sum setting, computing F requires n IFO calls, rendering the approach useless in large-scale prolems, where n is large. For the nonconvex finite-sum setting, the following key result was proved recently (Lacoste-Julien, 06). Theorem (Lacoste-Julien, 06)). Under appropriate selection of step sizes γ t, the IFO and LO complexity of Algorithm to achieve an ɛ-accurate solution in the finite-sum setting are O(n/ɛ ) and O(/ɛ ) respectively. The key aspect of Theorem is the dependence of IFO complexity on n. In particular, when n is large, the IFO complexity O(n/ɛ ) shown y the theorem ecomes prohiitively expensive; thus, undermining the enefits of Fw over competitors like projected Sgd. In the next section, we tackle this drawack and develop faster nonconvex stochastic and variance reduced Fw methods. 4

5 Algorithm : Nonconvex SFW ( x 0, T, {γ i } T i=0, { i} T ) i=0 : Input: x 0 Ω, numer of iterations T, {γ i } T i=0 where γ i [0, ] for all i {0,..., T }, miniatch size { i } T i=0 : for t = 0 to T do 3: Uniformly randomly pick i.i.d samples {z, t..., z t t } according to the distriution P. 4: Compute v t = arg max v Ω v, t t f(x t, z i ) 5: Compute update direction d t = v t x t 6: x t+ = x t + γ t d t 7: end for 8: Output: Iterate x a chosen uniformly random from {x t } T t=0. 3 Algorithms In this section, we descrie Fw algorithms for solving (). In particular, we explore stochastic and variance reduced versions of the classical Fw method, for the stochastic and finite-sum settings, respectively. We defer the discussion on comparison of the convergence rates to Section Stochastic Setting We first investigate the convergence of Fw in the stochastic setting. As mentioned earlier, the classical Fw method (Algorithm ) requires calculation of the full gradient F (x), which is typically impossile to compute in the stochastic setting. For convex prolems, Hazan and Luo (06) tackle this issue y using the popular Roins-Monro approximation (Roins and Monro, 95) to the gradient. We use a variant of the algorithm for our nonconvex stochastic setting, which we call Sfw. The pseudocode of Sfw is listed in Algorithm. Note that the samples {z i } are chosen independently according to the distriution P. Thus, E zi [ f(x, z i )] = F (x), i.e., we otain an uniased estimate of the gradient. Also, note that the output in Algorithm is randomly selected from all the iterates of the algorithm. The key parameters of Sfw are the step sizes {γ i } T i=0 and the miniatch sizes { t }. These parameters must e chosen appropriately in order to ensure convergence of the algorithm (see Theorem ). For our analysis, we assume that the function f is G-Lipschitz i.e., we have max x Ω,z Ξ f(x, z) G. This ound on the gradient is crucial for our convergence analysis. We prove the following key result for nonconvex Sfw. Theorem. Consider the stochastic setting of () where f is G-Lipschitz and F is L-smooth. Then, the output x a of Algorithm with parameters γ t = γ = (F (x0) F (x )) T LD β, t = = T for all t {0,..., T }, satisfies the following ound: E[G(x a )] D ) L(F (x (G + 0) F (x )) T β ( + β), where x is an optimal solution to (stochastic) prolem (). Proof. First oserve the following upper ound: F (x t+ ) F (x t ) + F (x t ), x t+ x t + L x t+ x t = F (x t ) + F (x t ), γ(v t x t ) + L γ(v t x t ) F (x t ) + γ F (x t ), v t x t + LD γ. (3) 5

6 The first inequality follows since F is L-smooth (see Lemma ). The equality is due to the fact that x t+ x t = γ(v t x t ). The second inequality holds ecause v t, x t Ω and ecause the diameter of Ω is D. Next, we introduce the following quantity: ˆv t := arg max v Ω v, F (x t), (4) which is used purely for our analysis and is not part of the algorithm. For revity, we use t to denote f(x t, z t i ). Rewriting inequality (3) using this quantity, we see that F (x t+ ) F (x t ) + γ t, v t x t + γ F (x t ) t, v t x t + LD γ F (x t ) + γ t, ˆv t x t + γ F (x t ) t, v t x t + LD γ = F (x t ) + γ F (x t ), ˆv t x t + γ F (x t ) t, v t ˆv t + LD γ = F (x t ) γg(x t ) + γ F (x t ) t, v t ˆv t + LD γ F (x t ) γg(x t ) + Dγ F (x t ) t + LD γ. (5) The second inequality follows from the optimality of v t in Algorithm, while the third inequality follows from recalling that G(x t ) = max v Ω v x t, F (x t ) = ˆv t x t, F (x t ), which holds due to the optimality of ˆv t in (4). The last inequality follows from Cauchy-Schwarz and the fact that the diameter of the feasile set Ω is ounded y D. Taking expectations and using Lemma in (5) we otain the following important ound: E[F (x t+ )] E[F (x t )] γe[g(x t )] + GDγ + LD γ. Summing over t and telescoping, we then otain the upper-ound T γ t=0 E[G(x t )] F (x 0 ) E[F (x T )] + T GDγ + T LD γ F (x 0 ) F (x ) + T GDγ + T LD γ. The latter inequality follows from the optimality of x. Using the definition of the output x a of Algorithm and the parameters specified in the theorem statement, we get E[G(x a )] F (x 0) F (x ) T γ D T (G + which concludes the proof of the theorem. + GD + LD γ L(F (x 0) F (x )) β ( + β) ), (6) An immediate consequence of Theorem is the following complexity result for Sfw. 6

7 Corollary. Under the setting of Theorem, the SFO complexity and LO complexity of Algorithm are O(/ɛ 4 ) and O(/ɛ ), respectively. Proof. The proof follows upon oserving that O(/ɛ ) miniatch size is required at each iteration of the algorithm, and noting that as per Theorem O(/ɛ ) iterations are required to achieve an ɛ-accurate solution. Note that the SFO and LO complexity of nonconvex Sfw is similar to that of online Fw (Hazan and Kale, 0) and slightly worse than complexity of Sfw for convex prolems (Hazan and Luo, 06). Furthermore, for simplicity of analysis, we used a fixed step size and miniatch size. One can derive an essentially similar result using a decreasing step size and increasing miniatch size. It is important to emphasize that the aove results also apply to the finite-sum setting. In particular, when the distriution P is the empirical measure, then the convergence result in Theorem also provides convergence rates for the finite-sum case. However, as we will see shortly, these convergence rates can e improved significantly y using variance reduction techniques. 3. Finite-sum Setting In this section, we consider the finite-sum setting of (). We show that y uilding on ideas from variance reduction for Sgd, one can significantly improve the convergence rates. The key idea is to use a variance reduced approximation of the gradient (Johnson and Zhang, 03; Defazio et al., 04). We analyze two different algorithms for the finite-sum setting. Our first algorithm (Svfw) is ased on the convex method of (Hazan and Luo, 06) adapted to the nonconvex case. Our second algorithm (SagaFw) is ased on another variance reduction technique called Saga (Defazio et al., 04). Svfw Algorithm Pseudocode of our first method (Svfw) is presented in Algorithm 3. Similar to (Johnson and Zhang, 03) and (Hazan and Luo, 06), nonconvex Svfw is also epoch-ased. At the end of each epoch, the full gradient is computed at the current iterate. This gradient is used for controlling the variance of the stochastic gradients in the inner loop. For epoch size m =, Svfw reduces to the classic Fw algorithm. In general, the epoch size m is chosen such that the total numer of IFO calls per epoch is Θ(n). This ensures that the cost of computing the full gradient at the end of each epoch is amortized. To enale a fair comparison with Sfw, we assume that the total numer of inner iterations across all epochs in Algorithm 3 is T. We prove the following key result for Algorithm 3. For ease of exposition, we assume that the total numer of inner iterations T is a multiple of m. Theorem 3. Consider the finite-sum setting of () where the functions {f i } n are L-smooth. Then, the output x a of Algorithm 3 with parameters γ t = γ = F (x0) F (x ) T LD β and t = = m for all t {0,..., m }, satisfies E[G(x a )] D T β L(F (x0 ) F (x ))( + β), where x is an optimal solution of () and x a is the output of Algorithm 4. Proof. We first analyze the convergence properties of iterates within an epoch. Suppose that the current epoch is s +. For revity, we drop the symol s from x s+ t, x s and g s, whenever it safe to do so given the context. The first part of the proof is similar to that of Theorem. For the sake of completeness, we provide the details here. We again use the quantity ˆv t = arg max v Ω v, F (x t ), as efore, purely for our analysis. 7

8 Algorithm 3: SVFW ( x 0, T, m, {γ i } m i=0, { i} m ) i=0 : Input: x 0 m = x 0 Ω, epoch size m, numer of epochs S = T/m, {γ i } m i=0 where γ i [0, ] for all i {0,..., m }, miniatch size { i } m i=0 : for s = 0 to S do 3: Let x s = x s m 4: Compute g s = F ( x s ) = n n f( xs ) 5: for t = 0 to m do 6: Uniformly randomly (with replacement) select suset I t = {i,..., i t } from [n]. 7: Compute vt s+ = arg max v Ω v, t ( f i (x s+ t ) f i ( x s ) + g s ) 8: Compute update direction d s+ t = vt s+ x s+ t 9: x s+ t+ = xs+ 0: end for : end for t + γ t d s+ t : Output: Iterate x a chosen uniformly random from {{x s+ t } t=0 m }S s=0. For the t th iteration within the epoch s, we have F (x t+ ) F (x t ) + F (x t ), γ(v t x t ) + L γ(v t x t ) F (x t ) + γ F (x t ), v t x t + LD γ. (7) This is due to Lemma and definition of x t+ in Algorithm 3. For revity, we use t to denote t ( f i (x t ) f i ( x) + g). Rewriting, we then otain F (x t+ ) F (x t ) + γ t, v t x t + γ F (x t ) t, v t x t + LD γ F (x t ) + γ t, ˆv t x t + γ F (x t ) t, v t x t + LD γ F (x t ) + γ F (x t ), ˆv t x t + γ F (x t ) t, v t ˆv t + LD γ F (x t ) γg(x t ) + Dγ F (x t ) t + LD γ. (8) The second inequality is due to the optimality of v t in Algorithm 3. The last inequality is due to the definition of G(x t ), the diameter of set Ω, and an application of Cauchy-Schwarz inequality. Note that the aove inequality is similar to (5), except for the crucial difference in the term F (x t ) t (instead of F (x t ) t in (5)). As we shall see shortly, this term has much lower variance, which ultimately leads to faster convergence rates. Taking expectations and using Lemma 3 in inequality (8) we otain the ound E[F (x t+ )] E[F (x t )] γe[g(x t )] To aid further analysis, we introduce the following Lyapunov function: + LDγ E[ x t x ] + LD γ. (9) R t = E[F (x t ) + c t x t x ], 8

9 where c m = 0 and c t = c t+ + (LDγ)/ for all t {0,..., m }. Using the relationship in (9), we see that R t+ = E[F (x t+ ) + c t+ x t+ x ] E[F (x t )] γe[g(x t )] + LDγ E[ x t x ] + LD γ + c t+ E[ x t+ x ] E[F (x t )] γe[g(x t )] + LDγ E[ x t x ] + LD γ + c t+ E[ x t+ x t + x t x ] R t γe[g(x t )] + LD γ + c t+ Dγ. (0) The second inequality follows from the triangle inequality, while the last inequality holds ecause: (a) c t = c t+ + (LDγ)/, and () x t+ x t = γ v t x t Dγ (recall the definition of diameter of Ω). Telescoping over all the iterations within an epoch, we otain m R m R 0 γ E[G(x t )] + LmD γ t=0 m = R 0 γ E[G(x t )] + LmD γ t=0 + Dγ m t= c t + L(m )md γ. () The equality follows from the relationship c t = c t+ + (LDγ)/. Since c m = 0 and x s+ 0 = x s = x s m (in Algorithm 3), from () we otain E[F (x s+ m )] E[F (x s+ m Now telescoping over all epochs, we otain + LmD γ S E[F (x S m)] F (x 0 ) γ m )] γ s=0 + T LD γ t=0 E[G(x s+ t )] + L(m )md γ. m t=0 E[G(x s+ t )] + T L(m )D γ. Rearranging this inequality and using the definition of the output in Algorithm 3, we finally otain E[G(x a )] F (x 0) E[F (x S m)] T γ F (x 0) F (x ) T γ + LD γ + LD γ LD (F (x 0 ) F (x )) ( + β). T β + L(m )D γ The second inequality follows from the optimality of x and ecause = m. The last inequality follows from the choice of γ stated in the theorem. This concludes the proof. 9

10 Algorithm 4: SagaFw ( x 0, T, {γ i } T i=0, { i} T ) i=0 : Input: α0 i = x 0 Ω for all i [n], numer of iterations T, {γ i } T i=0 where γ i [0, ] for all i {0,..., T }, miniatch size { i } T i=0 : Compute g 0 = n n F (αi 0) 3: for t = 0 to T do 4: Uniformly randomly (with replacement) select susets I t, J t from [n] of size t. 5: Compute v t = arg max v Ω v, t ( f i (x t ) f i (αt) i + g t ) 6: Compute update direction d t = v t x t 7: x t+ = x t + γ t d t 8: α j t+ = x t for j J t and α j t+ = αj t for j / J t 9: g t+ = g t n j J t ( f j (αt) j f j (α j t+ )) 0: end for : Output: Iterate x a chosen uniformly random from {x t } T t=0. The analysis suggests that the value of m should e set appropriately in Theorem 3 to otain good convergence rates. If m is small, the IFO complexity of Algorithm 3 is dominated y the step involving calculation of the full gradient at the end of each epoch. On the other hand, if m is large, a large miniatch is used in each step of the algorithm (since = m ), which increases the IFO complexity. With this intuition, we present following important corollary. Corollary. Under the setting of Theorem 3 and with m = n /3, the IFO complexity and LO complexity of Algorithm are O(n + n /3 /ɛ ) and O(/ɛ ), respectively. Proof. We first oserve that the total numer of IFO calls for an epoch (including those required for calculating the full gradient) is Θ(m 3 + n). Since m = n /3, the total amortized IFO complexity of one iteration within an epoch is O(m ) = O(n /3 ). Therefore, the IFO complexity is O(n + n /3 /ɛ ). Further, since each inner iteration requires O() LO calls, the LO complexity is O(/ɛ ). SagaFw Algorithm Svfw is a semi-stochastic algorithm since it requires calculation of the full gradient at the end of each epoch. Below we propose a purely incremental method (SagaFw) ased on the Saga algorithm of (Defazio et al., 04). The pseudocode for SagaFw is presented in Algorithm 4. A key feature of SagaFw is that it entirely avoids calculation of full gradients. Instead, it updates the average gradient vector g t at each iteration. This update requires maintaining additional vectors α i (i [n]), and in the worst case such a strategy incurs additional storage cost of O(nd). However, this cost can e reduced to O(n) in several practical cases (refer to (Defazio et al., 04; Reddi et al., 06)). For SagaFw, we prove the following key result. Theorem 4. Consider the finite-sum setting of () where functions {f i } n are L-smooth. Define θ(, n, T ) = / + (n 3/ /T 3/ ). Then the output x a of Algorithm 4 with parameters γ t = γ = F (x0) F (x ) T LD θ(,n,t )β and t = n for all t {0,..., T }, satisfies the following: E[G(x a )] D T β Lθ(, n, T )(F (x0 ) F (x ))( + β), where x is an optimal solution of prolem () and x a is the output of Algorithm 4. 0

11 Proof. We use the following quantities in our analysis: ˇ t = ( fi (x t ) f i (α i ) t) + g t t ˆv t = arg max v Ω v, F (x t). The first part of our proof is similar to that of Theorem 3. until (8), we have E[F (x t+ )] Using essentially the same argument F (x t ) γg(x t ) + Dγ F (x t ) ˇ t + LD γ F (x t ) γg(x t ) + LDγ n n E x t α i n t + LD γ. () The second inequality is due to Lemma 4. Next, we define the following Lyapunov function: n R t = E[F (x t )] + c t E x t α i n t, where c T = 0 and c t = ( ρ)c t+ + (LDγ n)/ for all t {0,..., T }, where ρ is the proaility ( /n) of an index i eing in J t. We can ound ρ from elow as ρ = ( n) +(/n) = /n +/n n, (3) where the first inequality follows from ( y) r /( + ry) (which holds for y [0, ] and r ), while the second inequality holds ecause n. Now oserve the following: n E x t+ α n t+ i = n E [ ρ x t+ x t + ( ρ) x t+ α i n t ] n = n n [ E ρ x t+ x t ] + ( ρ)( x t+ x t + x t αt ) i n E [ x t+ x t + ( ρ)e x t αt i ] (4) The first equality follows from the definition of α i t+ in Algorithm 4, while the inequality is just the triangle inequality. Using the aove relationship and the ound in (), we otain R t+ E[F (x t )] γe[g(x t )] + LD γ + LDγ n n n E[ x t αt ] i + c t+ E[ x t+ x t ] + c t+ ( ρ) n n E[ x t αt ] i R t γe[g(x t )] + LD γ + c t+ Dγ. (5)

12 The second inequality holds ecause: (a) c t = ( ρ)c t+ + (LDγ n)/, and () x t+ x t = γ v t x t Dγ (due to our ound on the diameter of the set Ω). Telescoping over all the iterations, we see that T R T R 0 γ E[G(x t )] + T LD γ t=0 T R 0 γ E[G(x t )] + T LD γ t=0 T R 0 γ E[G(x t )] + T LD γ t=0 + Dγ T t= + LD γ n ρ c t + LD γ n 3/ 3/. The second inequality follows form the fact that T t= c t LDγ n/(ρ ). This can, in turn, e otained from the recursion c t = ( ρ)c t+ + (LDγ n)/ and c T = 0. The third inequality is due to the ound on ρ in (3). Rearranging the aove inequality and using the definition of x a from Algorithm 4, we finally otain the ound E[G(x a )] F (x 0) E[F (x T )] T γ F (x 0) F (x ) T γ + LD γ + LD γθ(, n, T ). + LD γn 3/ T 3/ The first inequality uses the fact that c T = 0 and α i 0 = x 0 (in Algorithm 4). The second inequality uses the optimality of x and the definition of θ(, n, T ). Using the setting of γ in the theorem statement, we otain the desired result. Corollary 3. Assume T n. Under the settings of Theorem 4 and with = n /3, the IFO and LO complexity of Algorithm are O(n + n /3 /ɛ ) and O(/ɛ ), respectively. Proof. First, oserve that for T n and = n /3, θ(, n, T ) 5/ in Theorem 4. Thus, the IFO complexity is O(n + n /3 /ɛ ). Furthermore, since each iteration requires just O() LO calls, the LO complexity is O(/ɛ ). Notaly, the IFO complexity of SagaFw is lower than that of Svfw. Moreover, if T n 3/ and =, then we have θ(, n, T ) 5/, in which case the IFO complexity is O(n 3/ + /ɛ ). 4 Variance Reduction in Stochastic Setting In this section, we improve the convergence rates in the stochastic setting using variance reduction techniques. The key idea is to first otain samples {z i } are chosen independently according to the distriution P and then use Svfw or SagaFw, descried in this paper, on the finite-sum prolem over these samples. The pseudocode for the Svfw and SagaFw variants for stochastic setting (Svfw-S and SagaFw-S respectively) are provided in Figure. The following is the key result regarding the convergence rates of Svfw-S and SagaFw-S. Theorem 5. Consider the stochastic setting of () where f is G-Lipschitz and f(., z) is L-smooth F (x0) F (x ) T LD β for all z Ξ. Suppose B = T and γ = of Svfw-S and SagaFw-S satisfy the following: E[G(x a )] (for Svfw-S and SagaFw-S). Then the output D T β L(F (x0 ) F (x ))( + β) + GD T (6)

13 Svfw-S:(x 0, T, B, γ) Randomly sample z,, z B P Let finite-sum ˆF (x) = B B f(x, z i) Output Svfw(x 0, T, B /3, γ, B /3 ) applied on the function ˆF SagaFw-S:(x 0, T, B, γ) Randomly sample z,, z B P Let finite-sum ˆF (x) = B B f(x, z i) Output SagaFw(x 0, T, γ, 3B /3 ) applied on the function ˆF Figure : Svfw-S and SagaFw-S variants for the stochastic setting. Proof. Consider the finite-sum ˆF (x) = B B f(x, z i) where z,, z B P. We use the following notation: Ĝ(x) = max v Ω v x, ˆF (x). Let v t s+ = arg max v Ω v x s+ t, F (x s+ t ) and ˆv t s+ = arg max v Ω v x s+ t, ˆF (x s+ t ). We first oserve the following key relationship for Svfw: E[G(x s+ t ) Ĝ(xs+ t )] = E[ v t s+ x s+ t, F (x s+ t ) ] E[ ˆv t s+ x s+ t, ˆF (x s+ t ) ] E[ v s+ t x s+ t, F (x s+ t ) ] E[ v s+ t x s+ t, ˆF (x s+ t ) ] E[ v t s+ x s+ t, ˆF (x s+ t ) F (x s+ t ) ] DE[ ˆF (x s+ t ) F (x s+ t ) ] GD. T The first inequality is due to the optimality of ˆv t s+. The third inequality follows from Cauchy-Schwarz inequality. The last inequality is due to Lemma. Adding the aove inequality across all the iterations and epochs, we get: E[G(x a )] E[Ĝ(x a)] + GD T. Using the ound on E[Ĝ(x a)] in Theorem 3 (here, recall we are using Svfw on ˆF ) in the aove inequality, we get the desired result. The proof for SagaFw-S is similar. The following corollary on the complexity of Svfw-S and SagaFw-S is immediate consequence of the aove result. Corollary 4. Under the setting of Theorem 5, the SFO complexity of Svfw-S and SagaFw-S (in Figure ) are O(/ɛ 0/3 ) and O(/ɛ 8/3 ), respectively. The LO complexity of oth Svfw-S and SagaFw-S is O(/ɛ ). Proof. The proof follows from the fact that B = T, = B /3 (in Svfw-S) and = 3B /3 (in SagaFw-S) and IFO complexities of Svfw and SagaFw given in Corollary and 3 respectively. By comparing Corollary 4 with Corollary, we see that Svfw-S and SagaFw-S have etter SFO complexity than Sfw. 5 Discussion It is important to remark on the complexity results derived in this paper. For the stochastic setting, we showed that the SFO and LO complexity of Sfw are O(/ɛ 4 ) and O(/ɛ ), respectively. At first glance, these rates might appear worse than those otained for nonconvex Sgd (see (Ghadimi and Lan, 03)). However, it is important to note that the convergence criterion used in our paper is 3

14 different from the one used in (Ghadimi and Lan, 03). It is an important piece of future work to understand the precise relationship etween these convergence criteria. Furthermore, the convergence rates in this paper are similar to those otained for online Frank-Wolfe (Hazan and Kale, 0) and only slightly worse than those otained for stochastic Frank-Wolfe in the convex setting (Hazan and Luo, 06). We, further, improved the convergence rate of Sfw y using variance reduction ideas in the stochastic setting (Svfw-S and SagaFw-S algorithms in Section 4). Understanding the tightness of these rates is an interesting open prolem left as future work. For the finite-sum setting, while the complexity results of Sfw still hold, we otained significantly faster convergence rates y using variance reduction techniques. The dependence of IFO and LO complexity of nonconvex Svfw and SagaFw, on ɛ is O(/ɛ ), which matches the classical Frank-Wolfe algorithm (Lacoste-Julien, 06). However, Svfw and SagaFw exhiit a much weaker dependence on n than Fw; wherein, they are provaly faster than the classical Frank-Wolfe y a factor of n /3 and n /3, respectively. Similar (ut not same) enefits have also een reported for nonconvex Svrg and Saga over gradient descent (Reddi et al., 06a;). Interestingly, there appears to e a gap etween the convergence rates of Svfw and SagaFw. Whether this gap is an artifact of our analysis or has deeper reasons remains open. We conclude with a remark on a sutle point regarding the step size γ. The step size γ in Theorems, 3, and 4 requires knowledge of parameters like L, D and F (x) F (x ). Typically, an estimate of the these values suffices in practice. In asence of such knowledge, one can completely eliminate this dependence of γ on these parameters y simply choosing β = (F (x0) F (x )) LD. Fortunately, this comes at the cost of only slightly worse constants in the convergence rate. References A. Agarwal and L. Bottou. A Lower Bound for the Optimization of Finite Sums. arxiv:40.073, 04. D. Bertsekas. Nonlinear Programming. Athena Scientific, 995. ISBN M. Collins, A. Gloerson, T. Koo, X. Carreras, and P. L. Bartlett. Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks. JMLR, 9:775 8, 008. A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A Fast Incremental Gradient Method with Support for Non-Strongly Convex Composite Ojectives. In NIPS 7, pages M. Frank and P. Wolfe. An Algorithm for Quadratic Programming. Naval Research Logistics Quarterly, 3(-):95 0, March 956. S. Fujishige and S. Isotani. A Sumodular Function Minimization Algorithm ased on the Minimumnorm Base. Pacific Journal of Optimization, 7():3 7, 0. D. Garer and E. Hazan. Faster Rates for the Frank-Wolfe Method over Strongly-Convex Sets. In Proceedings of the 3nd International Conference on Machine Learning, ICML 05, Lille, France, 6- July 05, pages , 05. S. Ghadimi and G. Lan. Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming. SIAM Journal on Optimization, 3(4):34 368, 03. doi: 0.37/ S. Ghadimi, G. Lan, and H. Zhang. Mini-atch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization. Mathematical Programming, 55(-):67 305, Decemer 04. Z. Harchaoui, A. Juditsky, and A. Nemirovski. Conditional Gradient Algorithms for Norm-Regularized Smooth Convex Optimization. Mathematical Programming, 5(-):75, apr 04. 4

15 E. Hazan and S. Kale. Projection-free Online Learning. In Proceedings of the 9th International Conference on Machine Learning, ICML 0, Edinurgh, Scotland, UK, June 6 - July, 0, 0. E. Hazan and H. Luo. Variance-Reduced and Projection-Free Stochastic Optimization. CoRR, as/60.00, 06. M. Jaggi. Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. In ICML 3, pages , 03. R. Johnson and T. Zhang. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction. In NIPS 6, pages S. Lacoste-Julien. Convergence Rate of Frank-Wolfe for Non-Convex Ojectives. as/ , 06. S. Lacoste-Julien and M. Jaggi. On the Gloal Linear Convergence of Frank-Wolfe Optimization Variants. In Advances in Neural Information Processing Systems 8, pages , 05. A. Nemirovski and D. Yudin. Prolem Complexity and Method Efficiency in Optimization. John Wiley and Sons, 983. Y. Nesterov. Introductory Lectures On Convex Optimization: A Basic Course. Springer, 003. S. J. Reddi, A. Hefny, S. Sra, B. Póczos, and A. J. Smola. Stochastic Variance Reduction for Nonconvex Optimization. In Proceedings of the 33nd International Conference on Machine Learning, ICML 06, New York City, NY, USA, June 9-4, 06, pages 34 33, 06a. S. J. Reddi, S. Sra, B. Póczos, and A. J. Smola. Fast Incremental Method for Nonconvex Optimization. arxiv: , 06. S. J. Reddi, S. Sra, B. Póczos, and A. J. Smola. Fast Stochastic Methods for Nonsmooth Nonconvex Optimization. arxiv: , 06c. H. Roins and S. Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics, (3): , Sep 95. M. W. Schmidt, N. L. Roux, and F. R. Bach. Minimizing Finite Sums with the Stochastic Average Gradient. arxiv: , 03. L. Xiao and T. Zhang. A Proximal Stochastic Gradient Method with Progressive Variance Reduction. SIAM Journal on Optimization, 4(4): , 04. Y. Yu, X. Zhang, and D. Schuurmans. Generalized Conditional Gradient for Sparse Estimation. as/arxiv:40.488, 04. Appendix The following ound on the value of functions with Lipschitz continuous gradients is classical (see e.g., (Nesterov, 003)). Lemma. If f : R d R is L-smooth, then for all x, y R d. f(x) f(y) + f(y), x y + L x y, 5

16 The following lemma is useful for ounding the variance of the gradient estimate used in the stochastic setting. Lemma. Suppose the function F (x) = E z [f(x, z)] where z is a random variale with distriution P and support Ξ, and max z Ξ f(x, z) G for all x Ω. Also, let x = f(x, z i ) where {z i } are i.i.d. samples from the distriution P. Then, the following holds for any x Ω: E[ x F (x) ] G. Proof. The proof follows from a simple application of Lemma 5 and Jensen s inequality. The following result is useful for ounding the variance of the updates of Svfw and follows from a slight modification of a result in (Reddi et al., 06a). We give the proof here for completeness. Lemma 3 (Reddi et al., 06a)). Let t = t ( f i (x s+ t ) f i ( x s ) + g s ) in Algorithm 3. For the iterates x s+ t and x s where t {0,..., m } and s {0,..., S } in Algorithm 3, the following inequality holds: Proof. For the ease of exposition, we first define E It [ F (x s+ t ) t ] L x s+ t x s. t ζ s+ t = I t Using this notation, we then otain the following: E It [ F (x s+ t ) t ] ( fi (x s+ t ) f i ( x s ) ). = E It [ ζt s+ + F ( x s ) F (x s+ t ) ] = E It [ ζt s+ E It [ζt s+ ] ] = E It t ( fi (x s+ t ) f i ( x s ) E It [ζ s+ t ] ). The second equality is due to the fact that E It [ζt s+ ] = F (x s+ t ) F ( x s ). From the aove relationship, we get E It [ F (x s+ t ) t ] [ ] E It f i (x s+ t ) f i ( x s ) E It [ζt s+ ] t t E It [ f i (x s+ t ) f i ( x s ) ] L t x s+ t x s. The first inequality follows from Lemma 5. The second inequality is due to the fact that for a random variale ζ, E[ ζ E[ζ] ] E[ ζ ]. The last inequality follows from L-smoothness of f i. The result follows from a simple application of Jensen s inequality to the inequality aove. 6

17 The following result is important for ounding the variance in SagaFw. The key difference from Lemma 3 is that the variance term in SagaFw involves αt. i Again, we provide the proof for completeness. Lemma 4. Let ˇ t = t ( f i (x t ) f i (αt) i + g t ) in Algorithm 4. {αt} i n where t {0,..., T } in Algorithm 4, we have the inequality For the iterates x t, v t and E It [ F (x t ) ˇ t ] Proof. As efore we first define the quantity ζ t = I t L t n n x t α i t. ( fi (x t ) f i (αt) i ). With this notation, we then otain the equality E It [ F (x t) ˇ t ] [ = E It ζ t + n ] f i(αt) i F (x t) = E It [ ζ t E It [ζ t] ] n [ = ( ) ] EI t f i(x t) f i(α i t) E It [ζ t]. The second equality follows from the fact that E It [ζ t ] = F (x t ) n n f i(α i t). From the aove inequality, we get the following ound: E It [ F (x t) ˇ t ] t E It [ f i(x t) f i(α i t) E It [ζ t] ] t E It [ f i(x t) f i(α i t) ] L n t n x t αt i. The first inequality is due to Lemma 5, while the second inequality holds ecause for a random variale ζ, E[ ζ E[ζ] ] E[ ζ ]. The last inequality is from L-smoothness of f i (i [n]) and uniform randomness of the set I t. By applying Jensen s inequality, we get the desired result. Lemma 5. For random variales z,..., z r that are independent and have mean 0, we have Proof. Expanding the left hand side we have E [ z z r ] = E [ z z r ]. E [ z z r ] = r i,j= the second equality here follows from the our hypothesis. [ r E [z i z j ] = E z i ] ; 7

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Finite-sum Composition Optimization via Variance Reduced Gradient Descent Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where

More information

Variance-Reduced and Projection-Free Stochastic Optimization

Variance-Reduced and Projection-Free Stochastic Optimization Elad Hazan Princeton University, Princeton, NJ 08540, USA Haipeng Luo Princeton University, Princeton, NJ 08540, USA EHAZAN@CS.PRINCETON.EDU HAIPENGL@CS.PRINCETON.EDU Abstract The Frank-Wolfe optimization

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

arxiv: v2 [cs.lg] 14 Sep 2017

arxiv: v2 [cs.lg] 14 Sep 2017 Elad Hazan Princeton University, Princeton, NJ 08540, USA Haipeng Luo Princeton University, Princeton, NJ 08540, USA EHAZAN@CS.PRINCETON.EDU HAIPENGL@CS.PRINCETON.EDU arxiv:160.0101v [cs.lg] 14 Sep 017

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how

More information

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization

Oracle Complexity of Second-Order Methods for Smooth Convex Optimization racle Complexity of Second-rder Methods for Smooth Convex ptimization Yossi Arjevani had Shamir Ron Shiff Weizmann Institute of Science Rehovot 7610001 Israel Abstract yossi.arjevani@weizmann.ac.il ohad.shamir@weizmann.ac.il

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization

A Sparsity Preserving Stochastic Gradient Method for Composite Optimization A Sparsity Preserving Stochastic Gradient Method for Composite Optimization Qihang Lin Xi Chen Javier Peña April 3, 11 Abstract We propose new stochastic gradient algorithms for solving convex composite

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania

More information

A Parallel SGD method with Strong Convergence

A Parallel SGD method with Strong Convergence A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,

More information

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima

Third-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima Third-order Smoothness elps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima Yaodong Yu and Pan Xu and Quanquan Gu arxiv:171.06585v1 [math.oc] 18 Dec 017 Abstract We propose stochastic

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization

Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization Proceedings of the hirty-first AAAI Conference on Artificial Intelligence (AAAI-7) Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization Zhouyuan Huo Dept. of Computer

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Optimal Regularized Dual Averaging Methods for Stochastic Optimization Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business

More information

Optimization for Machine Learning (Lecture 3-B - Nonconvex)

Optimization for Machine Learning (Lecture 3-B - Nonconvex) Optimization for Machine Learning (Lecture 3-B - Nonconvex) SUVRIT SRA Massachusetts Institute of Technology MPI-IS Tübingen Machine Learning Summer School, June 2017 ml.mit.edu Nonconvex problems are

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization

Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization Noname manuscript No. (will be inserted by the editor) Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization Saeed Ghadimi Guanghui Lan Hongchao Zhang the date of

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Provable Non-Convex Min-Max Optimization

Provable Non-Convex Min-Max Optimization Provable Non-Convex Min-Max Optimization Mingrui Liu, Hassan Rafique, Qihang Lin, Tianbao Yang Department of Computer Science, The University of Iowa, Iowa City, IA, 52242 Department of Mathematics, The

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

Stochastic Quasi-Newton Methods

Stochastic Quasi-Newton Methods Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Lecture: Adaptive Filtering

Lecture: Adaptive Filtering ECE 830 Spring 2013 Statistical Signal Processing instructors: K. Jamieson and R. Nowak Lecture: Adaptive Filtering Adaptive filters are commonly used for online filtering of signals. The goal is to estimate

More information

Linearly-Convergent Stochastic-Gradient Methods

Linearly-Convergent Stochastic-Gradient Methods Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure

More information

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent

Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

1 Hoeffding s Inequality

1 Hoeffding s Inequality Proailistic Method: Hoeffding s Inequality and Differential Privacy Lecturer: Huert Chan Date: 27 May 22 Hoeffding s Inequality. Approximate Counting y Random Sampling Suppose there is a ag containing

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Complexity bounds for primal-dual methods minimizing the model of objective function

Complexity bounds for primal-dual methods minimizing the model of objective function Complexity bounds for primal-dual methods minimizing the model of objective function Yu. Nesterov July 4, 06 Abstract We provide Frank-Wolfe ( Conditional Gradients method with a convergence analysis allowing

More information

arxiv: v1 [cs.lg] 15 Jun 2016

arxiv: v1 [cs.lg] 15 Jun 2016 A Class of Parallel Douly Stochastic Algorithms for Large-Scale Learning arxiv:1606.04991v1 [cs.lg] 15 Jun 2016 Aryan Mokhtari Alec Koppel Alejandro Rieiro Department of Electrical and Systems Engineering

More information

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)

More information

arxiv: v1 [math.oc] 18 Mar 2016

arxiv: v1 [math.oc] 18 Mar 2016 Katyusha: Accelerated Variance Reduction for Faster SGD Zeyuan Allen-Zhu zeyuan@csail.mit.edu Princeton University arxiv:1603.05953v1 [math.oc] 18 Mar 016 March 18, 016 Abstract We consider minimizing

More information

Stochastic Gradient Descent with Only One Projection

Stochastic Gradient Descent with Only One Projection Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new

More information

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods

SADAGRAD: Strongly Adaptive Stochastic Gradient Methods Zaiyi Chen * Yi Xu * Enhong Chen Tianbao Yang Abstract Although the convergence rates of existing variants of ADAGRAD have a better dependence on the number of iterations under the strong convexity condition,

More information

A Greedy Framework for First-Order Optimization

A Greedy Framework for First-Order Optimization A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

On Nesterov s Random Coordinate Descent Algorithms - Continued

On Nesterov s Random Coordinate Descent Algorithms - Continued On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower

More information

SGD CONVERGES TO GLOBAL MINIMUM IN DEEP LEARNING VIA STAR-CONVEX PATH

SGD CONVERGES TO GLOBAL MINIMUM IN DEEP LEARNING VIA STAR-CONVEX PATH Under review as a conference paper at ICLR 9 SGD CONVERGES TO GLOAL MINIMUM IN DEEP LEARNING VIA STAR-CONVEX PATH Anonymous authors Paper under double-blind review ASTRACT Stochastic gradient descent (SGD)

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine

More information

Proximal and First-Order Methods for Convex Optimization

Proximal and First-Order Methods for Convex Optimization Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,

More information

arxiv: v3 [math.oc] 8 Jan 2019

arxiv: v3 [math.oc] 8 Jan 2019 Why Random Reshuffling Beats Stochastic Gradient Descent Mert Gürbüzbalaban, Asuman Ozdaglar, Pablo Parrilo arxiv:1510.08560v3 [math.oc] 8 Jan 2019 January 9, 2019 Abstract We analyze the convergence rate

More information

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Jinghui Chen Department of Systems and Information Engineering University of Virginia Quanquan Gu

More information

arxiv: v1 [math.oc] 10 Oct 2018

arxiv: v1 [math.oc] 10 Oct 2018 8 Frank-Wolfe Method is Automatically Adaptive to Error Bound ondition arxiv:80.04765v [math.o] 0 Oct 08 Yi Xu yi-xu@uiowa.edu Tianbao Yang tianbao-yang@uiowa.edu Department of omputer Science, The University

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values

Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Mengdi Wang Ethan X. Fang Han Liu Abstract Classical stochastic gradient methods are well suited

More information

Stochastic Gradient Descent. CS 584: Big Data Analytics

Stochastic Gradient Descent. CS 584: Big Data Analytics Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October

More information

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013

Convex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013 Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for

More information

arxiv: v2 [cs.lg] 4 Oct 2016

arxiv: v2 [cs.lg] 4 Oct 2016 Appearing in 2016 IEEE International Conference on Data Mining (ICDM), Barcelona. Efficient Distributed SGD with Variance Reduction Soham De and Tom Goldstein Department of Computer Science, University

More information

arxiv: v2 [math.oc] 1 Nov 2017

arxiv: v2 [math.oc] 1 Nov 2017 Stochastic Non-convex Optimization with Strong High Probability Second-order Convergence arxiv:1710.09447v [math.oc] 1 Nov 017 Mingrui Liu, Tianbao Yang Department of Computer Science The University of

More information

NESTT: A Nonconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization

NESTT: A Nonconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization ESTT: A onconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization Davood Hajinezhad, Mingyi Hong Tuo Zhao Zhaoran Wang Abstract We study a stochastic and distributed algorithm for

More information

Block stochastic gradient update method

Block stochastic gradient update method Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic

More information

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization

Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Frank E. Curtis, Lehigh University Beyond Convexity Workshop, Oaxaca, Mexico 26 October 2017 Worst-Case Complexity Guarantees and Nonconvex

More information

On the Convergence Rate of Incremental Aggregated Gradient Algorithms

On the Convergence Rate of Incremental Aggregated Gradient Algorithms On the Convergence Rate of Incremental Aggregated Gradient Algorithms M. Gürbüzbalaban, A. Ozdaglar, P. Parrilo June 5, 2015 Abstract Motivated by applications to distributed optimization over networks

More information

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)

More information

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):

More information

Generalized Conditional Gradient and Its Applications

Generalized Conditional Gradient and Its Applications Generalized Conditional Gradient and Its Applications Yaoliang Yu University of Alberta UBC Kelowna, 04/18/13 Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 1 / 25 1 Introduction 2 Generalized

More information

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization References --- a tentative list of papers to be mentioned in the ICML 2017 tutorial Recent Advances in Stochastic Convex and Non-Convex Optimization Disclaimer: in a quite arbitrary order. 1. [ShalevShwartz-Zhang,

More information

Stochastic optimization: Beyond stochastic gradients and convexity Part I

Stochastic optimization: Beyond stochastic gradients and convexity Part I Stochastic optimization: Beyond stochastic gradients and convexity Part I Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint tutorial with Suvrit Sra, MIT - NIPS

More information

Bregman Divergence and Mirror Descent

Bregman Divergence and Mirror Descent Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression

On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression arxiv:606.03000v [cs.it] 9 Jun 206 Kobi Cohen, Angelia Nedić and R. Srikant Abstract The problem

More information

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning

Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology

More information

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization Stochastic optimization, ERM,

More information

Large-scale Machine Learning and Optimization

Large-scale Machine Learning and Optimization 1/84 Large-scale Machine Learning and Optimization Zaiwen Wen http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html Acknowledgement: this slides is based on Dimitris Papailiopoulos and Shiqian Ma lecture notes

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

Bringing Order to Special Cases of Klee s Measure Problem

Bringing Order to Special Cases of Klee s Measure Problem Bringing Order to Special Cases of Klee s Measure Prolem Karl Bringmann Max Planck Institute for Informatics Astract. Klee s Measure Prolem (KMP) asks for the volume of the union of n axis-aligned oxes

More information

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir

A Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir A Stochastic PCA Algorithm with an Exponential Convergence Rate Ohad Shamir Weizmann Institute of Science NIPS Optimization Workshop December 2014 Ohad Shamir Stochastic PCA with Exponential Convergence

More information