arxiv: v2 [math.oc] 29 Jul 2016
|
|
- Scot Quinn
- 5 years ago
- Views:
Transcription
1 Stochastic Frank-Wolfe Methods for Nonconvex Optimization arxiv: v [math.oc] 9 Jul 06 Sashank J. Reddi sjakkamr@cs.cmu.edu Carnegie Mellon University Barnaás Póczós apoczos@cs.cmu.edu Carnegie Mellon University Suvrit Sra suvrit@mit.edu Massachusetts Institute of Technology Astract Alex Smola alex@smola.org Carnegie Mellon University We study Frank-Wolfe methods for nonconvex stochastic and finite-sum optimization prolems. Frank-Wolfe methods (in the convex case) have gained tremendous recent interest in machine learning and optimization communities due to their projection-free property and their aility to exploit structured constraints. However, our understanding of these algorithms in the nonconvex setting is fairly limited. In this paper, we propose nonconvex stochastic Frank-Wolfe methods and analyze their convergence properties. For ojective functions that decompose into a finitesum, we leverage ideas from variance reduction techniques for convex optimization to otain new variance reduced nonconvex Frank-Wolfe methods that have provaly faster convergence than the classical Frank-Wolfe method. Finally, we show that the faster convergence rates of our variance reduced methods also translate into improved convergence rates for the stochastic setting. Introduction We study optimization prolems of the form: { E z [f(x, z)], F (x) := min x Ω (stochastic) n n i= f i(x), (finite-sum). We assume that F, f, and f i (i {,..., n} [n]) are all differentiale, ut possily nonconvex; the domain Ω is convex and compact. Prolems of this form are at the heart of machine learning and statistics; for instance, the finitesum prolem arises under the name empirical loss minimization and M-estimation. Examples of such prolems include multiclass classification, matrix learning, recommendation systems (Jaggi, 03; Hazan and Kale, 0; Hazan and Luo, 06; Harchaoui et al., 04). Within convex optimization, prolem () is relatively well-studied. Two particularly popular approaches for solving it are: (i) Projected stochastic gradient descent (Sgd); and () the Frank-Wolfe (Fw) method. At each iteration, Sgd takes a step in a direction opposite to a stochastic approximation of the gradient F and uses projection onto Ω to ensure feasiility. While computing a stochastic approximation to F is usually inexpensive, in many real settings, the cost projecting onto Ω can e very high (e.g., projecting onto the trace-norm all, onto ase polytopes in sumodular minimization (Fujishige and Isotani, 0)); and in extreme cases projection can even e computationally intractale (Collins et al., 008). ()
2 In such cases, projection ased methods like Sgd ecome impractical. This difficulty underlies the recent surge of interest in Frank-Wolfe methods (Frank and Wolfe, 956; Jaggi, 03) (also known as conditional gradient), due to their projection-free property. In particular, Fw methods avoid the expensive projection operation and require just a linear oracle that solves prolems of the form min x Ω x, g at each iteration. Despite the remarkale success of Fw approaches in the convex setting, including stochastic prolems (Hazan and Luo, 06), their applicaility and non-asymptotic convergence for nonconvex optimization is largely unstudied. Even for Sgd, it is only recently that non-asymptotic convergence analysis for nonconvex optimization was otained (Ghadimi and Lan, 03; Ghadimi et al., 04). More recently, Reddi et al. (06a;) otained variance reduced stochastic methods that converge faster than Sgd in the nonconvex finite-sum setting. Similar fast variants of Fw for nonconvex prolems are not known. Given the vast importance of nonconvex models in machine learning (e.g., in deep learning) and the need to incorporate non-trivial constraints in such models, it is imperative to develop scalale, projection-free methods. This paper presents new Fw methods towards this goal. Our main contriutions are summarized elow, while the key complexity results are listed in Figure. Main Contriutions. For the nonconvex stochastic setting in (), we propose a stochastic Frank- Wolfe algorithm (Sfw), and provide its convergence analysis. For the nonconvex finite-sum setting, we propose two variance reduced (VR) algorithms: Svfw and SagaFw, ased on the popular VR algorithms Svrg and Saga, respectively. We show that y carefully selecting the parameters of these algorithms, we can attain faster convergence rates than the deterministic Fw. In particular, we prove that Svfw and SagaFw are faster than deterministic Fw y a factor of n /3 and n /3 respectively, where n is the numer of component functions in the finite-sum (see ()). Furthermore, leveraging these variance reduced methods, we propose two algorithms, Svfw-S and SagaFw-S, for the nonconvex stochastic setting, with faster convergence rates than Sfw. To our knowledge, our work presents the first theoretical improvement for stochastic variants of Frank-Wolfe in the context of nonconvex optimization.. Related Work The classical Frank-Wolfe method (Frank and Wolfe, 956) using line-search was analyzed for smooth convex functions F and polyhedral domains Ω. Here, a convergence rate of O(/ɛ) to ensure F (x) F ɛ was proved without additional conditions (Frank and Wolfe, 956; Jaggi, 03). There have een several recent works on improving the convergence rates under additional assumptions (Garer and Hazan, 05; Lacoste-Julien and Jaggi, 05). More recently, Hazan and Luo (06) proposed stochastic variants of Fw for convex prolems of form (), and showed theoretical improvements over the classical Frank-Wolfe method. The literature on nonconvex Frank-Wolfe is relatively small. The work (Bertsekas, 995) proves asymptotic convergence of Fw to a stationary point; though, no convergence rates are provided. To the est of our knowledge, Yu et al. (04) is the first to provide convergence rates for Fw-type algorithm in the nonconvex setting. Very recently, Lacoste-Julien (06) provided a (non-asymptotic) convergence rate of O(/ɛ ) for nonconvex Fw with adaptive step sizes. However, as we shall see later, implementation of classical Fw for () is expensive (or impossile in the pure stochastic case) since it requires calculation of the gradient F at each iteration. We show that our stochastic variants are provaly faster than the existing Fw methods. In the nonconvex setting, most of the work on stochastic methods focuses on Sgd (Ghadimi and Lan, 03; Ghadimi et al., 04) and analyzes convergence to stationary points. For the finite-sum setting, we uild on recent variance reduction techniques (Johnson and Zhang, 03; Defazio et al., 04; Schmidt et al., 03), which were first proposed for solving unconstrained convex prolems of form (). Projected variants to handle constraints were studied in (Defazio et al., 04; Xiao and Zhang, 04). More recently, Reddi et al. (06a;;c) provided nonconvex variants of these methods
3 Algorithm SFO/IFO Complexity LO Complexity Frank-Wolfe O(n/ɛ ) O ( /ɛ ) Sfw O ( /ɛ 4) O ( /ɛ ) Svfw O(n + n /3 /ɛ ) O(/ɛ ) SagaFw O(n + n /3 /ɛ ) O(/ɛ ) Svfw-S O(/ɛ 0/3 ) O(/ɛ ) SagaFw-S O(/ɛ 8/3 ) O(/ɛ ) Figure : Tale comparing the est SFO/IFO and LO complexity of algorithms discussed in the paper (for the nonconvex setting). Here, Sfw, Svfw-S and SagaFw-S are algorithms for the stochastic setting, while Fw, Svfw and SagaFw are algorithms for the finite-sum setting. The complexity is measured y the numer of oracle calls required to achieve an ɛ-accurate solution (see Section for definitions of SFO/IFO and LO complexity). The complexity of Fw is from (Lacoste-Julien, 06). The results marked in red are contriutions of this paper. For clarity, we hide the dependence of SFO/IFO and LO complexity on the initial point and few parameters related to the function F and domain Ω. that converge provaly faster than oth Sgd and its deterministic counterpart. Preliminaries As stated aove, we study two different prolem settings: () stochastic, where F (x) = E z [f(x, z)] and z is random variale whose distriution P is supported on Ξ R p ; and () finite-sums, where F (x) = n n f i(x). For the stochastic setting, we assume that F is L-smooth, i.e., its gradient is Lipschitz continuous with constant L, so F (x) F (y) L x y, x, y Ω. Here. denotes the l -norm. Furthermore, for the stochastic setting, we also assume the function f is G-Lipschitz i.e., f(x, z) G for all x Ω and z Ξ. Such an assumption is common in the stochastic setting (Ghadimi and Lan, 03; Hazan and Luo, 06). For the finite-sum setting, we assume that the individual functions f i (i [n]) are L-smooth i.e., f i (x) f i (y) L x y, x, y Ω. Note that this implies that the function F is also L-smooth. The domain Ω R d is assumed to e convex and compact with diameter D; i.e., x y D for all x, y Ω. Such an assumption is common to all Frank-Wolfe methods. Convergence criteria. The criterion used for the convergence analysis is important in nonconvex optimization. For unconstrained prolems, the gradient norm F is typically used to measure convergence, ecause F 0 translates into convergence to a stationary point. However, this criterion cannot e used for constrained prolems of the form (). Instead, we use the following quantity, typically referred to as Frank-Wolfe gap: G(x) = max v x, F (x). () v Ω 3
4 For convex functions, the Fw gap provides an upper ound on the suoptimality. For nonconvex functions, the gap G(x) = 0 if and only if x is a stationary point. To state our convergence results we will also need the following ound: β (F (x 0) F (x )) LD, given some (unspecified) initial point x 0 Ω. Oracle model. To compare convergence speed of different algorithms, we use the following lack-ox oracles: Stochastic First-Order Oracle (SFO): For a function F ( ) = E z [f(., z)] where z P, an SFO takes a point x and returns the pair (f(x, z ), f(x, z )) where z is a sample drawn i.i.d. from P (Nemirovski and Yudin, 983). Incremental First-Order Oracle (IFO): For a function F ( ) = n i f i(.), an IFO takes an index i [n] and a point x R d, and returns the pair (f i (x), f i (x)) (Agarwal and Bottou, 04). Linear Optimization Oracle (LO): For a set Ω, an LO takes a direction d and returns arg max v Ω v, d. Throughout the paper, y SFO, IFO and LO complexity of an algorithm, we mean the total numer of SFO, IFO and LO calls made y the algorithm to otain an ɛ-accurate solution, i.e., a solution for which E[G(x)] ɛ; the expectation is over any randomization as part of the algorithm. For clarity of presentation, we hide the dependence of these complexities on the initial point F (x 0 ) F (x ), Lipschitz constant G, and the smoothness constant L; we report the dependence on n to highlight its importance. Classical Fw. To place our results in perspective, we egin y recalling the classical Frank-Wolfe (Fw) algorithm (Frank and Wolfe, 956). Pseudocode for this is presented in Algorithm. Algorithm : Fw ( x 0, T, {γ i } T ) i=0 : Input: x 0 Ω, numer of iterations T, {γ i} T i=0 : for t = 0 to T do 3: Compute v t = arg max v Ω v, F (x t) 4: Compute update direction d t = v t x t 5: x t+ = x t + γ td t 6: end for where γi [0, ] for all i {0,..., T } Each iteration of Fw entails calculation of the gradient F and moving towards a minimizer of a linearized ojective. Notice that calculation of F may not e possile in the stochastic setting of (). Furthermore, even in the finite-sum setting, computing F requires n IFO calls, rendering the approach useless in large-scale prolems, where n is large. For the nonconvex finite-sum setting, the following key result was proved recently (Lacoste-Julien, 06). Theorem (Lacoste-Julien, 06)). Under appropriate selection of step sizes γ t, the IFO and LO complexity of Algorithm to achieve an ɛ-accurate solution in the finite-sum setting are O(n/ɛ ) and O(/ɛ ) respectively. The key aspect of Theorem is the dependence of IFO complexity on n. In particular, when n is large, the IFO complexity O(n/ɛ ) shown y the theorem ecomes prohiitively expensive; thus, undermining the enefits of Fw over competitors like projected Sgd. In the next section, we tackle this drawack and develop faster nonconvex stochastic and variance reduced Fw methods. 4
5 Algorithm : Nonconvex SFW ( x 0, T, {γ i } T i=0, { i} T ) i=0 : Input: x 0 Ω, numer of iterations T, {γ i } T i=0 where γ i [0, ] for all i {0,..., T }, miniatch size { i } T i=0 : for t = 0 to T do 3: Uniformly randomly pick i.i.d samples {z, t..., z t t } according to the distriution P. 4: Compute v t = arg max v Ω v, t t f(x t, z i ) 5: Compute update direction d t = v t x t 6: x t+ = x t + γ t d t 7: end for 8: Output: Iterate x a chosen uniformly random from {x t } T t=0. 3 Algorithms In this section, we descrie Fw algorithms for solving (). In particular, we explore stochastic and variance reduced versions of the classical Fw method, for the stochastic and finite-sum settings, respectively. We defer the discussion on comparison of the convergence rates to Section Stochastic Setting We first investigate the convergence of Fw in the stochastic setting. As mentioned earlier, the classical Fw method (Algorithm ) requires calculation of the full gradient F (x), which is typically impossile to compute in the stochastic setting. For convex prolems, Hazan and Luo (06) tackle this issue y using the popular Roins-Monro approximation (Roins and Monro, 95) to the gradient. We use a variant of the algorithm for our nonconvex stochastic setting, which we call Sfw. The pseudocode of Sfw is listed in Algorithm. Note that the samples {z i } are chosen independently according to the distriution P. Thus, E zi [ f(x, z i )] = F (x), i.e., we otain an uniased estimate of the gradient. Also, note that the output in Algorithm is randomly selected from all the iterates of the algorithm. The key parameters of Sfw are the step sizes {γ i } T i=0 and the miniatch sizes { t }. These parameters must e chosen appropriately in order to ensure convergence of the algorithm (see Theorem ). For our analysis, we assume that the function f is G-Lipschitz i.e., we have max x Ω,z Ξ f(x, z) G. This ound on the gradient is crucial for our convergence analysis. We prove the following key result for nonconvex Sfw. Theorem. Consider the stochastic setting of () where f is G-Lipschitz and F is L-smooth. Then, the output x a of Algorithm with parameters γ t = γ = (F (x0) F (x )) T LD β, t = = T for all t {0,..., T }, satisfies the following ound: E[G(x a )] D ) L(F (x (G + 0) F (x )) T β ( + β), where x is an optimal solution to (stochastic) prolem (). Proof. First oserve the following upper ound: F (x t+ ) F (x t ) + F (x t ), x t+ x t + L x t+ x t = F (x t ) + F (x t ), γ(v t x t ) + L γ(v t x t ) F (x t ) + γ F (x t ), v t x t + LD γ. (3) 5
6 The first inequality follows since F is L-smooth (see Lemma ). The equality is due to the fact that x t+ x t = γ(v t x t ). The second inequality holds ecause v t, x t Ω and ecause the diameter of Ω is D. Next, we introduce the following quantity: ˆv t := arg max v Ω v, F (x t), (4) which is used purely for our analysis and is not part of the algorithm. For revity, we use t to denote f(x t, z t i ). Rewriting inequality (3) using this quantity, we see that F (x t+ ) F (x t ) + γ t, v t x t + γ F (x t ) t, v t x t + LD γ F (x t ) + γ t, ˆv t x t + γ F (x t ) t, v t x t + LD γ = F (x t ) + γ F (x t ), ˆv t x t + γ F (x t ) t, v t ˆv t + LD γ = F (x t ) γg(x t ) + γ F (x t ) t, v t ˆv t + LD γ F (x t ) γg(x t ) + Dγ F (x t ) t + LD γ. (5) The second inequality follows from the optimality of v t in Algorithm, while the third inequality follows from recalling that G(x t ) = max v Ω v x t, F (x t ) = ˆv t x t, F (x t ), which holds due to the optimality of ˆv t in (4). The last inequality follows from Cauchy-Schwarz and the fact that the diameter of the feasile set Ω is ounded y D. Taking expectations and using Lemma in (5) we otain the following important ound: E[F (x t+ )] E[F (x t )] γe[g(x t )] + GDγ + LD γ. Summing over t and telescoping, we then otain the upper-ound T γ t=0 E[G(x t )] F (x 0 ) E[F (x T )] + T GDγ + T LD γ F (x 0 ) F (x ) + T GDγ + T LD γ. The latter inequality follows from the optimality of x. Using the definition of the output x a of Algorithm and the parameters specified in the theorem statement, we get E[G(x a )] F (x 0) F (x ) T γ D T (G + which concludes the proof of the theorem. + GD + LD γ L(F (x 0) F (x )) β ( + β) ), (6) An immediate consequence of Theorem is the following complexity result for Sfw. 6
7 Corollary. Under the setting of Theorem, the SFO complexity and LO complexity of Algorithm are O(/ɛ 4 ) and O(/ɛ ), respectively. Proof. The proof follows upon oserving that O(/ɛ ) miniatch size is required at each iteration of the algorithm, and noting that as per Theorem O(/ɛ ) iterations are required to achieve an ɛ-accurate solution. Note that the SFO and LO complexity of nonconvex Sfw is similar to that of online Fw (Hazan and Kale, 0) and slightly worse than complexity of Sfw for convex prolems (Hazan and Luo, 06). Furthermore, for simplicity of analysis, we used a fixed step size and miniatch size. One can derive an essentially similar result using a decreasing step size and increasing miniatch size. It is important to emphasize that the aove results also apply to the finite-sum setting. In particular, when the distriution P is the empirical measure, then the convergence result in Theorem also provides convergence rates for the finite-sum case. However, as we will see shortly, these convergence rates can e improved significantly y using variance reduction techniques. 3. Finite-sum Setting In this section, we consider the finite-sum setting of (). We show that y uilding on ideas from variance reduction for Sgd, one can significantly improve the convergence rates. The key idea is to use a variance reduced approximation of the gradient (Johnson and Zhang, 03; Defazio et al., 04). We analyze two different algorithms for the finite-sum setting. Our first algorithm (Svfw) is ased on the convex method of (Hazan and Luo, 06) adapted to the nonconvex case. Our second algorithm (SagaFw) is ased on another variance reduction technique called Saga (Defazio et al., 04). Svfw Algorithm Pseudocode of our first method (Svfw) is presented in Algorithm 3. Similar to (Johnson and Zhang, 03) and (Hazan and Luo, 06), nonconvex Svfw is also epoch-ased. At the end of each epoch, the full gradient is computed at the current iterate. This gradient is used for controlling the variance of the stochastic gradients in the inner loop. For epoch size m =, Svfw reduces to the classic Fw algorithm. In general, the epoch size m is chosen such that the total numer of IFO calls per epoch is Θ(n). This ensures that the cost of computing the full gradient at the end of each epoch is amortized. To enale a fair comparison with Sfw, we assume that the total numer of inner iterations across all epochs in Algorithm 3 is T. We prove the following key result for Algorithm 3. For ease of exposition, we assume that the total numer of inner iterations T is a multiple of m. Theorem 3. Consider the finite-sum setting of () where the functions {f i } n are L-smooth. Then, the output x a of Algorithm 3 with parameters γ t = γ = F (x0) F (x ) T LD β and t = = m for all t {0,..., m }, satisfies E[G(x a )] D T β L(F (x0 ) F (x ))( + β), where x is an optimal solution of () and x a is the output of Algorithm 4. Proof. We first analyze the convergence properties of iterates within an epoch. Suppose that the current epoch is s +. For revity, we drop the symol s from x s+ t, x s and g s, whenever it safe to do so given the context. The first part of the proof is similar to that of Theorem. For the sake of completeness, we provide the details here. We again use the quantity ˆv t = arg max v Ω v, F (x t ), as efore, purely for our analysis. 7
8 Algorithm 3: SVFW ( x 0, T, m, {γ i } m i=0, { i} m ) i=0 : Input: x 0 m = x 0 Ω, epoch size m, numer of epochs S = T/m, {γ i } m i=0 where γ i [0, ] for all i {0,..., m }, miniatch size { i } m i=0 : for s = 0 to S do 3: Let x s = x s m 4: Compute g s = F ( x s ) = n n f( xs ) 5: for t = 0 to m do 6: Uniformly randomly (with replacement) select suset I t = {i,..., i t } from [n]. 7: Compute vt s+ = arg max v Ω v, t ( f i (x s+ t ) f i ( x s ) + g s ) 8: Compute update direction d s+ t = vt s+ x s+ t 9: x s+ t+ = xs+ 0: end for : end for t + γ t d s+ t : Output: Iterate x a chosen uniformly random from {{x s+ t } t=0 m }S s=0. For the t th iteration within the epoch s, we have F (x t+ ) F (x t ) + F (x t ), γ(v t x t ) + L γ(v t x t ) F (x t ) + γ F (x t ), v t x t + LD γ. (7) This is due to Lemma and definition of x t+ in Algorithm 3. For revity, we use t to denote t ( f i (x t ) f i ( x) + g). Rewriting, we then otain F (x t+ ) F (x t ) + γ t, v t x t + γ F (x t ) t, v t x t + LD γ F (x t ) + γ t, ˆv t x t + γ F (x t ) t, v t x t + LD γ F (x t ) + γ F (x t ), ˆv t x t + γ F (x t ) t, v t ˆv t + LD γ F (x t ) γg(x t ) + Dγ F (x t ) t + LD γ. (8) The second inequality is due to the optimality of v t in Algorithm 3. The last inequality is due to the definition of G(x t ), the diameter of set Ω, and an application of Cauchy-Schwarz inequality. Note that the aove inequality is similar to (5), except for the crucial difference in the term F (x t ) t (instead of F (x t ) t in (5)). As we shall see shortly, this term has much lower variance, which ultimately leads to faster convergence rates. Taking expectations and using Lemma 3 in inequality (8) we otain the ound E[F (x t+ )] E[F (x t )] γe[g(x t )] To aid further analysis, we introduce the following Lyapunov function: + LDγ E[ x t x ] + LD γ. (9) R t = E[F (x t ) + c t x t x ], 8
9 where c m = 0 and c t = c t+ + (LDγ)/ for all t {0,..., m }. Using the relationship in (9), we see that R t+ = E[F (x t+ ) + c t+ x t+ x ] E[F (x t )] γe[g(x t )] + LDγ E[ x t x ] + LD γ + c t+ E[ x t+ x ] E[F (x t )] γe[g(x t )] + LDγ E[ x t x ] + LD γ + c t+ E[ x t+ x t + x t x ] R t γe[g(x t )] + LD γ + c t+ Dγ. (0) The second inequality follows from the triangle inequality, while the last inequality holds ecause: (a) c t = c t+ + (LDγ)/, and () x t+ x t = γ v t x t Dγ (recall the definition of diameter of Ω). Telescoping over all the iterations within an epoch, we otain m R m R 0 γ E[G(x t )] + LmD γ t=0 m = R 0 γ E[G(x t )] + LmD γ t=0 + Dγ m t= c t + L(m )md γ. () The equality follows from the relationship c t = c t+ + (LDγ)/. Since c m = 0 and x s+ 0 = x s = x s m (in Algorithm 3), from () we otain E[F (x s+ m )] E[F (x s+ m Now telescoping over all epochs, we otain + LmD γ S E[F (x S m)] F (x 0 ) γ m )] γ s=0 + T LD γ t=0 E[G(x s+ t )] + L(m )md γ. m t=0 E[G(x s+ t )] + T L(m )D γ. Rearranging this inequality and using the definition of the output in Algorithm 3, we finally otain E[G(x a )] F (x 0) E[F (x S m)] T γ F (x 0) F (x ) T γ + LD γ + LD γ LD (F (x 0 ) F (x )) ( + β). T β + L(m )D γ The second inequality follows from the optimality of x and ecause = m. The last inequality follows from the choice of γ stated in the theorem. This concludes the proof. 9
10 Algorithm 4: SagaFw ( x 0, T, {γ i } T i=0, { i} T ) i=0 : Input: α0 i = x 0 Ω for all i [n], numer of iterations T, {γ i } T i=0 where γ i [0, ] for all i {0,..., T }, miniatch size { i } T i=0 : Compute g 0 = n n F (αi 0) 3: for t = 0 to T do 4: Uniformly randomly (with replacement) select susets I t, J t from [n] of size t. 5: Compute v t = arg max v Ω v, t ( f i (x t ) f i (αt) i + g t ) 6: Compute update direction d t = v t x t 7: x t+ = x t + γ t d t 8: α j t+ = x t for j J t and α j t+ = αj t for j / J t 9: g t+ = g t n j J t ( f j (αt) j f j (α j t+ )) 0: end for : Output: Iterate x a chosen uniformly random from {x t } T t=0. The analysis suggests that the value of m should e set appropriately in Theorem 3 to otain good convergence rates. If m is small, the IFO complexity of Algorithm 3 is dominated y the step involving calculation of the full gradient at the end of each epoch. On the other hand, if m is large, a large miniatch is used in each step of the algorithm (since = m ), which increases the IFO complexity. With this intuition, we present following important corollary. Corollary. Under the setting of Theorem 3 and with m = n /3, the IFO complexity and LO complexity of Algorithm are O(n + n /3 /ɛ ) and O(/ɛ ), respectively. Proof. We first oserve that the total numer of IFO calls for an epoch (including those required for calculating the full gradient) is Θ(m 3 + n). Since m = n /3, the total amortized IFO complexity of one iteration within an epoch is O(m ) = O(n /3 ). Therefore, the IFO complexity is O(n + n /3 /ɛ ). Further, since each inner iteration requires O() LO calls, the LO complexity is O(/ɛ ). SagaFw Algorithm Svfw is a semi-stochastic algorithm since it requires calculation of the full gradient at the end of each epoch. Below we propose a purely incremental method (SagaFw) ased on the Saga algorithm of (Defazio et al., 04). The pseudocode for SagaFw is presented in Algorithm 4. A key feature of SagaFw is that it entirely avoids calculation of full gradients. Instead, it updates the average gradient vector g t at each iteration. This update requires maintaining additional vectors α i (i [n]), and in the worst case such a strategy incurs additional storage cost of O(nd). However, this cost can e reduced to O(n) in several practical cases (refer to (Defazio et al., 04; Reddi et al., 06)). For SagaFw, we prove the following key result. Theorem 4. Consider the finite-sum setting of () where functions {f i } n are L-smooth. Define θ(, n, T ) = / + (n 3/ /T 3/ ). Then the output x a of Algorithm 4 with parameters γ t = γ = F (x0) F (x ) T LD θ(,n,t )β and t = n for all t {0,..., T }, satisfies the following: E[G(x a )] D T β Lθ(, n, T )(F (x0 ) F (x ))( + β), where x is an optimal solution of prolem () and x a is the output of Algorithm 4. 0
11 Proof. We use the following quantities in our analysis: ˇ t = ( fi (x t ) f i (α i ) t) + g t t ˆv t = arg max v Ω v, F (x t). The first part of our proof is similar to that of Theorem 3. until (8), we have E[F (x t+ )] Using essentially the same argument F (x t ) γg(x t ) + Dγ F (x t ) ˇ t + LD γ F (x t ) γg(x t ) + LDγ n n E x t α i n t + LD γ. () The second inequality is due to Lemma 4. Next, we define the following Lyapunov function: n R t = E[F (x t )] + c t E x t α i n t, where c T = 0 and c t = ( ρ)c t+ + (LDγ n)/ for all t {0,..., T }, where ρ is the proaility ( /n) of an index i eing in J t. We can ound ρ from elow as ρ = ( n) +(/n) = /n +/n n, (3) where the first inequality follows from ( y) r /( + ry) (which holds for y [0, ] and r ), while the second inequality holds ecause n. Now oserve the following: n E x t+ α n t+ i = n E [ ρ x t+ x t + ( ρ) x t+ α i n t ] n = n n [ E ρ x t+ x t ] + ( ρ)( x t+ x t + x t αt ) i n E [ x t+ x t + ( ρ)e x t αt i ] (4) The first equality follows from the definition of α i t+ in Algorithm 4, while the inequality is just the triangle inequality. Using the aove relationship and the ound in (), we otain R t+ E[F (x t )] γe[g(x t )] + LD γ + LDγ n n n E[ x t αt ] i + c t+ E[ x t+ x t ] + c t+ ( ρ) n n E[ x t αt ] i R t γe[g(x t )] + LD γ + c t+ Dγ. (5)
12 The second inequality holds ecause: (a) c t = ( ρ)c t+ + (LDγ n)/, and () x t+ x t = γ v t x t Dγ (due to our ound on the diameter of the set Ω). Telescoping over all the iterations, we see that T R T R 0 γ E[G(x t )] + T LD γ t=0 T R 0 γ E[G(x t )] + T LD γ t=0 T R 0 γ E[G(x t )] + T LD γ t=0 + Dγ T t= + LD γ n ρ c t + LD γ n 3/ 3/. The second inequality follows form the fact that T t= c t LDγ n/(ρ ). This can, in turn, e otained from the recursion c t = ( ρ)c t+ + (LDγ n)/ and c T = 0. The third inequality is due to the ound on ρ in (3). Rearranging the aove inequality and using the definition of x a from Algorithm 4, we finally otain the ound E[G(x a )] F (x 0) E[F (x T )] T γ F (x 0) F (x ) T γ + LD γ + LD γθ(, n, T ). + LD γn 3/ T 3/ The first inequality uses the fact that c T = 0 and α i 0 = x 0 (in Algorithm 4). The second inequality uses the optimality of x and the definition of θ(, n, T ). Using the setting of γ in the theorem statement, we otain the desired result. Corollary 3. Assume T n. Under the settings of Theorem 4 and with = n /3, the IFO and LO complexity of Algorithm are O(n + n /3 /ɛ ) and O(/ɛ ), respectively. Proof. First, oserve that for T n and = n /3, θ(, n, T ) 5/ in Theorem 4. Thus, the IFO complexity is O(n + n /3 /ɛ ). Furthermore, since each iteration requires just O() LO calls, the LO complexity is O(/ɛ ). Notaly, the IFO complexity of SagaFw is lower than that of Svfw. Moreover, if T n 3/ and =, then we have θ(, n, T ) 5/, in which case the IFO complexity is O(n 3/ + /ɛ ). 4 Variance Reduction in Stochastic Setting In this section, we improve the convergence rates in the stochastic setting using variance reduction techniques. The key idea is to first otain samples {z i } are chosen independently according to the distriution P and then use Svfw or SagaFw, descried in this paper, on the finite-sum prolem over these samples. The pseudocode for the Svfw and SagaFw variants for stochastic setting (Svfw-S and SagaFw-S respectively) are provided in Figure. The following is the key result regarding the convergence rates of Svfw-S and SagaFw-S. Theorem 5. Consider the stochastic setting of () where f is G-Lipschitz and f(., z) is L-smooth F (x0) F (x ) T LD β for all z Ξ. Suppose B = T and γ = of Svfw-S and SagaFw-S satisfy the following: E[G(x a )] (for Svfw-S and SagaFw-S). Then the output D T β L(F (x0 ) F (x ))( + β) + GD T (6)
13 Svfw-S:(x 0, T, B, γ) Randomly sample z,, z B P Let finite-sum ˆF (x) = B B f(x, z i) Output Svfw(x 0, T, B /3, γ, B /3 ) applied on the function ˆF SagaFw-S:(x 0, T, B, γ) Randomly sample z,, z B P Let finite-sum ˆF (x) = B B f(x, z i) Output SagaFw(x 0, T, γ, 3B /3 ) applied on the function ˆF Figure : Svfw-S and SagaFw-S variants for the stochastic setting. Proof. Consider the finite-sum ˆF (x) = B B f(x, z i) where z,, z B P. We use the following notation: Ĝ(x) = max v Ω v x, ˆF (x). Let v t s+ = arg max v Ω v x s+ t, F (x s+ t ) and ˆv t s+ = arg max v Ω v x s+ t, ˆF (x s+ t ). We first oserve the following key relationship for Svfw: E[G(x s+ t ) Ĝ(xs+ t )] = E[ v t s+ x s+ t, F (x s+ t ) ] E[ ˆv t s+ x s+ t, ˆF (x s+ t ) ] E[ v s+ t x s+ t, F (x s+ t ) ] E[ v s+ t x s+ t, ˆF (x s+ t ) ] E[ v t s+ x s+ t, ˆF (x s+ t ) F (x s+ t ) ] DE[ ˆF (x s+ t ) F (x s+ t ) ] GD. T The first inequality is due to the optimality of ˆv t s+. The third inequality follows from Cauchy-Schwarz inequality. The last inequality is due to Lemma. Adding the aove inequality across all the iterations and epochs, we get: E[G(x a )] E[Ĝ(x a)] + GD T. Using the ound on E[Ĝ(x a)] in Theorem 3 (here, recall we are using Svfw on ˆF ) in the aove inequality, we get the desired result. The proof for SagaFw-S is similar. The following corollary on the complexity of Svfw-S and SagaFw-S is immediate consequence of the aove result. Corollary 4. Under the setting of Theorem 5, the SFO complexity of Svfw-S and SagaFw-S (in Figure ) are O(/ɛ 0/3 ) and O(/ɛ 8/3 ), respectively. The LO complexity of oth Svfw-S and SagaFw-S is O(/ɛ ). Proof. The proof follows from the fact that B = T, = B /3 (in Svfw-S) and = 3B /3 (in SagaFw-S) and IFO complexities of Svfw and SagaFw given in Corollary and 3 respectively. By comparing Corollary 4 with Corollary, we see that Svfw-S and SagaFw-S have etter SFO complexity than Sfw. 5 Discussion It is important to remark on the complexity results derived in this paper. For the stochastic setting, we showed that the SFO and LO complexity of Sfw are O(/ɛ 4 ) and O(/ɛ ), respectively. At first glance, these rates might appear worse than those otained for nonconvex Sgd (see (Ghadimi and Lan, 03)). However, it is important to note that the convergence criterion used in our paper is 3
14 different from the one used in (Ghadimi and Lan, 03). It is an important piece of future work to understand the precise relationship etween these convergence criteria. Furthermore, the convergence rates in this paper are similar to those otained for online Frank-Wolfe (Hazan and Kale, 0) and only slightly worse than those otained for stochastic Frank-Wolfe in the convex setting (Hazan and Luo, 06). We, further, improved the convergence rate of Sfw y using variance reduction ideas in the stochastic setting (Svfw-S and SagaFw-S algorithms in Section 4). Understanding the tightness of these rates is an interesting open prolem left as future work. For the finite-sum setting, while the complexity results of Sfw still hold, we otained significantly faster convergence rates y using variance reduction techniques. The dependence of IFO and LO complexity of nonconvex Svfw and SagaFw, on ɛ is O(/ɛ ), which matches the classical Frank-Wolfe algorithm (Lacoste-Julien, 06). However, Svfw and SagaFw exhiit a much weaker dependence on n than Fw; wherein, they are provaly faster than the classical Frank-Wolfe y a factor of n /3 and n /3, respectively. Similar (ut not same) enefits have also een reported for nonconvex Svrg and Saga over gradient descent (Reddi et al., 06a;). Interestingly, there appears to e a gap etween the convergence rates of Svfw and SagaFw. Whether this gap is an artifact of our analysis or has deeper reasons remains open. We conclude with a remark on a sutle point regarding the step size γ. The step size γ in Theorems, 3, and 4 requires knowledge of parameters like L, D and F (x) F (x ). Typically, an estimate of the these values suffices in practice. In asence of such knowledge, one can completely eliminate this dependence of γ on these parameters y simply choosing β = (F (x0) F (x )) LD. Fortunately, this comes at the cost of only slightly worse constants in the convergence rate. References A. Agarwal and L. Bottou. A Lower Bound for the Optimization of Finite Sums. arxiv:40.073, 04. D. Bertsekas. Nonlinear Programming. Athena Scientific, 995. ISBN M. Collins, A. Gloerson, T. Koo, X. Carreras, and P. L. Bartlett. Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks. JMLR, 9:775 8, 008. A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A Fast Incremental Gradient Method with Support for Non-Strongly Convex Composite Ojectives. In NIPS 7, pages M. Frank and P. Wolfe. An Algorithm for Quadratic Programming. Naval Research Logistics Quarterly, 3(-):95 0, March 956. S. Fujishige and S. Isotani. A Sumodular Function Minimization Algorithm ased on the Minimumnorm Base. Pacific Journal of Optimization, 7():3 7, 0. D. Garer and E. Hazan. Faster Rates for the Frank-Wolfe Method over Strongly-Convex Sets. In Proceedings of the 3nd International Conference on Machine Learning, ICML 05, Lille, France, 6- July 05, pages , 05. S. Ghadimi and G. Lan. Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming. SIAM Journal on Optimization, 3(4):34 368, 03. doi: 0.37/ S. Ghadimi, G. Lan, and H. Zhang. Mini-atch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization. Mathematical Programming, 55(-):67 305, Decemer 04. Z. Harchaoui, A. Juditsky, and A. Nemirovski. Conditional Gradient Algorithms for Norm-Regularized Smooth Convex Optimization. Mathematical Programming, 5(-):75, apr 04. 4
15 E. Hazan and S. Kale. Projection-free Online Learning. In Proceedings of the 9th International Conference on Machine Learning, ICML 0, Edinurgh, Scotland, UK, June 6 - July, 0, 0. E. Hazan and H. Luo. Variance-Reduced and Projection-Free Stochastic Optimization. CoRR, as/60.00, 06. M. Jaggi. Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. In ICML 3, pages , 03. R. Johnson and T. Zhang. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction. In NIPS 6, pages S. Lacoste-Julien. Convergence Rate of Frank-Wolfe for Non-Convex Ojectives. as/ , 06. S. Lacoste-Julien and M. Jaggi. On the Gloal Linear Convergence of Frank-Wolfe Optimization Variants. In Advances in Neural Information Processing Systems 8, pages , 05. A. Nemirovski and D. Yudin. Prolem Complexity and Method Efficiency in Optimization. John Wiley and Sons, 983. Y. Nesterov. Introductory Lectures On Convex Optimization: A Basic Course. Springer, 003. S. J. Reddi, A. Hefny, S. Sra, B. Póczos, and A. J. Smola. Stochastic Variance Reduction for Nonconvex Optimization. In Proceedings of the 33nd International Conference on Machine Learning, ICML 06, New York City, NY, USA, June 9-4, 06, pages 34 33, 06a. S. J. Reddi, S. Sra, B. Póczos, and A. J. Smola. Fast Incremental Method for Nonconvex Optimization. arxiv: , 06. S. J. Reddi, S. Sra, B. Póczos, and A. J. Smola. Fast Stochastic Methods for Nonsmooth Nonconvex Optimization. arxiv: , 06c. H. Roins and S. Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics, (3): , Sep 95. M. W. Schmidt, N. L. Roux, and F. R. Bach. Minimizing Finite Sums with the Stochastic Average Gradient. arxiv: , 03. L. Xiao and T. Zhang. A Proximal Stochastic Gradient Method with Progressive Variance Reduction. SIAM Journal on Optimization, 4(4): , 04. Y. Yu, X. Zhang, and D. Schuurmans. Generalized Conditional Gradient for Sparse Estimation. as/arxiv:40.488, 04. Appendix The following ound on the value of functions with Lipschitz continuous gradients is classical (see e.g., (Nesterov, 003)). Lemma. If f : R d R is L-smooth, then for all x, y R d. f(x) f(y) + f(y), x y + L x y, 5
16 The following lemma is useful for ounding the variance of the gradient estimate used in the stochastic setting. Lemma. Suppose the function F (x) = E z [f(x, z)] where z is a random variale with distriution P and support Ξ, and max z Ξ f(x, z) G for all x Ω. Also, let x = f(x, z i ) where {z i } are i.i.d. samples from the distriution P. Then, the following holds for any x Ω: E[ x F (x) ] G. Proof. The proof follows from a simple application of Lemma 5 and Jensen s inequality. The following result is useful for ounding the variance of the updates of Svfw and follows from a slight modification of a result in (Reddi et al., 06a). We give the proof here for completeness. Lemma 3 (Reddi et al., 06a)). Let t = t ( f i (x s+ t ) f i ( x s ) + g s ) in Algorithm 3. For the iterates x s+ t and x s where t {0,..., m } and s {0,..., S } in Algorithm 3, the following inequality holds: Proof. For the ease of exposition, we first define E It [ F (x s+ t ) t ] L x s+ t x s. t ζ s+ t = I t Using this notation, we then otain the following: E It [ F (x s+ t ) t ] ( fi (x s+ t ) f i ( x s ) ). = E It [ ζt s+ + F ( x s ) F (x s+ t ) ] = E It [ ζt s+ E It [ζt s+ ] ] = E It t ( fi (x s+ t ) f i ( x s ) E It [ζ s+ t ] ). The second equality is due to the fact that E It [ζt s+ ] = F (x s+ t ) F ( x s ). From the aove relationship, we get E It [ F (x s+ t ) t ] [ ] E It f i (x s+ t ) f i ( x s ) E It [ζt s+ ] t t E It [ f i (x s+ t ) f i ( x s ) ] L t x s+ t x s. The first inequality follows from Lemma 5. The second inequality is due to the fact that for a random variale ζ, E[ ζ E[ζ] ] E[ ζ ]. The last inequality follows from L-smoothness of f i. The result follows from a simple application of Jensen s inequality to the inequality aove. 6
17 The following result is important for ounding the variance in SagaFw. The key difference from Lemma 3 is that the variance term in SagaFw involves αt. i Again, we provide the proof for completeness. Lemma 4. Let ˇ t = t ( f i (x t ) f i (αt) i + g t ) in Algorithm 4. {αt} i n where t {0,..., T } in Algorithm 4, we have the inequality For the iterates x t, v t and E It [ F (x t ) ˇ t ] Proof. As efore we first define the quantity ζ t = I t L t n n x t α i t. ( fi (x t ) f i (αt) i ). With this notation, we then otain the equality E It [ F (x t) ˇ t ] [ = E It ζ t + n ] f i(αt) i F (x t) = E It [ ζ t E It [ζ t] ] n [ = ( ) ] EI t f i(x t) f i(α i t) E It [ζ t]. The second equality follows from the fact that E It [ζ t ] = F (x t ) n n f i(α i t). From the aove inequality, we get the following ound: E It [ F (x t) ˇ t ] t E It [ f i(x t) f i(α i t) E It [ζ t] ] t E It [ f i(x t) f i(α i t) ] L n t n x t αt i. The first inequality is due to Lemma 5, while the second inequality holds ecause for a random variale ζ, E[ ζ E[ζ] ] E[ ζ ]. The last inequality is from L-smoothness of f i (i [n]) and uniform randomness of the set I t. By applying Jensen s inequality, we get the desired result. Lemma 5. For random variales z,..., z r that are independent and have mean 0, we have Proof. Expanding the left hand side we have E [ z z r ] = E [ z z r ]. E [ z z r ] = r i,j= the second equality here follows from the our hypothesis. [ r E [z i z j ] = E z i ] ; 7
arxiv: v1 [math.oc] 1 Jul 2016
Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the
More informationStochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos
1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and
More informationImproved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations
Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org
More informationStochastic and online algorithms
Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem
More informationFinite-sum Composition Optimization via Variance Reduced Gradient Descent
Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu
More informationOptimization for Machine Learning
Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS
More informationOptimization for Machine Learning
Optimization for Machine Learning (Problems; Algorithms - A) SUVRIT SRA Massachusetts Institute of Technology PKU Summer School on Data Science (July 2017) Course materials http://suvrit.de/teaching.html
More informationOptimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method
Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors
More informationSVRG++ with Non-uniform Sampling
SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract
More informationModern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization
Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where
More informationVariance-Reduced and Projection-Free Stochastic Optimization
Elad Hazan Princeton University, Princeton, NJ 08540, USA Haipeng Luo Princeton University, Princeton, NJ 08540, USA EHAZAN@CS.PRINCETON.EDU HAIPENGL@CS.PRINCETON.EDU Abstract The Frank-Wolfe optimization
More informationMinimizing Finite Sums with the Stochastic Average Gradient Algorithm
Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale
More informationComparison of Modern Stochastic Optimization Algorithms
Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,
More informationStochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure
Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,
More informationarxiv: v2 [cs.lg] 14 Sep 2017
Elad Hazan Princeton University, Princeton, NJ 08540, USA Haipeng Luo Princeton University, Princeton, NJ 08540, USA EHAZAN@CS.PRINCETON.EDU HAIPENGL@CS.PRINCETON.EDU arxiv:160.0101v [cs.lg] 14 Sep 017
More informationStochastic Optimization
Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic
More informationOn Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:
A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition
More informationStochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization
Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine
More informationarxiv: v4 [math.oc] 5 Jan 2016
Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The
More informationStochastic gradient descent and robustness to ill-conditioning
Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,
More informationProximal Minimization by Incremental Surrogate Optimization (MISO)
Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine
More informationBarzilai-Borwein Step Size for Stochastic Gradient Descent
Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how
More informationOracle Complexity of Second-Order Methods for Smooth Convex Optimization
racle Complexity of Second-rder Methods for Smooth Convex ptimization Yossi Arjevani had Shamir Ron Shiff Weizmann Institute of Science Rehovot 7610001 Israel Abstract yossi.arjevani@weizmann.ac.il ohad.shamir@weizmann.ac.il
More informationConvex Optimization Lecture 16
Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean
More informationStochastic Gradient Descent with Variance Reduction
Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction
More informationLinear Convergence under the Polyak-Łojasiewicz Inequality
Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based
More informationFast Stochastic Optimization Algorithms for ML
Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2
More informationA Sparsity Preserving Stochastic Gradient Method for Composite Optimization
A Sparsity Preserving Stochastic Gradient Method for Composite Optimization Qihang Lin Xi Chen Javier Peña April 3, 11 Abstract We propose new stochastic gradient algorithms for solving convex composite
More informationLinear Convergence under the Polyak-Łojasiewicz Inequality
Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning
More informationLarge-scale Stochastic Optimization
Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation
More informationCoordinate Descent and Ascent Methods
Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:
More informationMaking Gradient Descent Optimal for Strongly Convex Stochastic Optimization
Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania
More informationA Parallel SGD method with Strong Convergence
A Parallel SGD method with Strong Convergence Dhruv Mahajan Microsoft Research India dhrumaha@microsoft.com S. Sundararajan Microsoft Research India ssrajan@microsoft.com S. Sathiya Keerthi Microsoft Corporation,
More informationThird-order Smoothness Helps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima
Third-order Smoothness elps: Even Faster Stochastic Optimization Algorithms for Finding Local Minima Yaodong Yu and Pan Xu and Quanquan Gu arxiv:171.06585v1 [math.oc] 18 Dec 017 Abstract We propose stochastic
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization
More informationAsynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization
Proceedings of the hirty-first AAAI Conference on Artificial Intelligence (AAAI-7) Asynchronous Mini-Batch Gradient Descent with Variance Reduction for Non-Convex Optimization Zhouyuan Huo Dept. of Computer
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,
More informationStochastic Gradient Descent. Ryan Tibshirani Convex Optimization
Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so
More informationOptimal Regularized Dual Averaging Methods for Stochastic Optimization
Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business
More informationOptimization for Machine Learning (Lecture 3-B - Nonconvex)
Optimization for Machine Learning (Lecture 3-B - Nonconvex) SUVRIT SRA Massachusetts Institute of Technology MPI-IS Tübingen Machine Learning Summer School, June 2017 ml.mit.edu Nonconvex problems are
More informationLearning with stochastic proximal gradient
Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and
More informationMini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization
Noname manuscript No. (will be inserted by the editor) Mini-batch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization Saeed Ghadimi Guanghui Lan Hongchao Zhang the date of
More informationContents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016
ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................
More informationProvable Non-Convex Min-Max Optimization
Provable Non-Convex Min-Max Optimization Mingrui Liu, Hassan Rafique, Qihang Lin, Tianbao Yang Department of Computer Science, The University of Iowa, Iowa City, IA, 52242 Department of Mathematics, The
More informationTrade-Offs in Distributed Learning and Optimization
Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed
More informationStochastic Quasi-Newton Methods
Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent
More informationAccelerating Stochastic Optimization
Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz
More informationLecture: Adaptive Filtering
ECE 830 Spring 2013 Statistical Signal Processing instructors: K. Jamieson and R. Nowak Lecture: Adaptive Filtering Adaptive filters are commonly used for online filtering of signals. The goal is to estimate
More informationLinearly-Convergent Stochastic-Gradient Methods
Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure
More informationRelative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent
Relative-Continuity for Non-Lipschitz Non-Smooth Convex Optimization using Stochastic (or Deterministic) Mirror Descent Haihao Lu August 3, 08 Abstract The usual approach to developing and analyzing first-order
More informationBarzilai-Borwein Step Size for Stochastic Gradient Descent
Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong
More informationConditional Gradient (Frank-Wolfe) Method
Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties
More informationmin f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;
Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many
More information1 Hoeffding s Inequality
Proailistic Method: Hoeffding s Inequality and Differential Privacy Lecturer: Huert Chan Date: 27 May 22 Hoeffding s Inequality. Approximate Counting y Random Sampling Suppose there is a ag containing
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July
More informationSTA141C: Big Data & High Performance Statistical Computing
STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied
More informationComplexity bounds for primal-dual methods minimizing the model of objective function
Complexity bounds for primal-dual methods minimizing the model of objective function Yu. Nesterov July 4, 06 Abstract We provide Frank-Wolfe ( Conditional Gradients method with a convergence analysis allowing
More informationarxiv: v1 [cs.lg] 15 Jun 2016
A Class of Parallel Douly Stochastic Algorithms for Large-Scale Learning arxiv:1606.04991v1 [cs.lg] 15 Jun 2016 Aryan Mokhtari Alec Koppel Alejandro Rieiro Department of Electrical and Systems Engineering
More informationFrank-Wolfe Method. Ryan Tibshirani Convex Optimization
Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)
More informationarxiv: v1 [math.oc] 18 Mar 2016
Katyusha: Accelerated Variance Reduction for Faster SGD Zeyuan Allen-Zhu zeyuan@csail.mit.edu Princeton University arxiv:1603.05953v1 [math.oc] 18 Mar 016 March 18, 016 Abstract We consider minimizing
More informationStochastic Gradient Descent with Only One Projection
Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine
More informationNOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained
NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume
More informationBeyond stochastic gradient descent for large-scale machine learning
Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new
More informationSADAGRAD: Strongly Adaptive Stochastic Gradient Methods
Zaiyi Chen * Yi Xu * Enhong Chen Tianbao Yang Abstract Although the convergence rates of existing variants of ADAGRAD have a better dependence on the number of iterations under the strong convexity condition,
More informationA Greedy Framework for First-Order Optimization
A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts
More informationBig Data Analytics: Optimization and Randomization
Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.
More informationOn Nesterov s Random Coordinate Descent Algorithms - Continued
On Nesterov s Random Coordinate Descent Algorithms - Continued Zheng Xu University of Texas At Arlington February 20, 2015 1 Revisit Random Coordinate Descent The Random Coordinate Descent Upper and Lower
More informationSGD CONVERGES TO GLOBAL MINIMUM IN DEEP LEARNING VIA STAR-CONVEX PATH
Under review as a conference paper at ICLR 9 SGD CONVERGES TO GLOAL MINIMUM IN DEEP LEARNING VIA STAR-CONVEX PATH Anonymous authors Paper under double-blind review ASTRACT Stochastic gradient descent (SGD)
More informationMachine Learning. Support Vector Machines. Fabio Vandin November 20, 2017
Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training
More informationStochastic gradient methods for machine learning
Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine
More informationStochastic gradient methods for machine learning
Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine
More informationProximal and First-Order Methods for Convex Optimization
Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,
More informationarxiv: v3 [math.oc] 8 Jan 2019
Why Random Reshuffling Beats Stochastic Gradient Descent Mert Gürbüzbalaban, Asuman Ozdaglar, Pablo Parrilo arxiv:1510.08560v3 [math.oc] 8 Jan 2019 January 9, 2019 Abstract We analyze the convergence rate
More informationAccelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization
Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Jinghui Chen Department of Systems and Information Engineering University of Virginia Quanquan Gu
More informationarxiv: v1 [math.oc] 10 Oct 2018
8 Frank-Wolfe Method is Automatically Adaptive to Error Bound ondition arxiv:80.04765v [math.o] 0 Oct 08 Yi Xu yi-xu@uiowa.edu Tianbao Yang tianbao-yang@uiowa.edu Department of omputer Science, The University
More informationSelected Topics in Optimization. Some slides borrowed from
Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model
More informationStochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values
Stochastic Compositional Gradient Descent: Algorithms for Minimizing Nonlinear Functions of Expected Values Mengdi Wang Ethan X. Fang Han Liu Abstract Classical stochastic gradient methods are well suited
More informationStochastic Gradient Descent. CS 584: Big Data Analytics
Stochastic Gradient Descent CS 584: Big Data Analytics Gradient Descent Recap Simplest and extremely popular Main Idea: take a step proportional to the negative of the gradient Easy to implement Each iteration
More informationCourse Notes for EE227C (Spring 2018): Convex Optimization and Approximation
Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October
More informationConvex Optimization. (EE227A: UC Berkeley) Lecture 15. Suvrit Sra. (Gradient methods III) 12 March, 2013
Convex Optimization (EE227A: UC Berkeley) Lecture 15 (Gradient methods III) 12 March, 2013 Suvrit Sra Optimal gradient methods 2 / 27 Optimal gradient methods We saw following efficiency estimates for
More informationarxiv: v2 [cs.lg] 4 Oct 2016
Appearing in 2016 IEEE International Conference on Data Mining (ICDM), Barcelona. Efficient Distributed SGD with Variance Reduction Soham De and Tom Goldstein Department of Computer Science, University
More informationarxiv: v2 [math.oc] 1 Nov 2017
Stochastic Non-convex Optimization with Strong High Probability Second-order Convergence arxiv:1710.09447v [math.oc] 1 Nov 017 Mingrui Liu, Tianbao Yang Department of Computer Science The University of
More informationNESTT: A Nonconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization
ESTT: A onconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization Davood Hajinezhad, Mingyi Hong Tuo Zhao Zhaoran Wang Abstract We study a stochastic and distributed algorithm for
More informationBlock stochastic gradient update method
Block stochastic gradient update method Yangyang Xu and Wotao Yin IMA, University of Minnesota Department of Mathematics, UCLA November 1, 2015 This work was done while in Rice University 1 / 26 Stochastic
More informationWorst-Case Complexity Guarantees and Nonconvex Smooth Optimization
Worst-Case Complexity Guarantees and Nonconvex Smooth Optimization Frank E. Curtis, Lehigh University Beyond Convexity Workshop, Oaxaca, Mexico 26 October 2017 Worst-Case Complexity Guarantees and Nonconvex
More informationOn the Convergence Rate of Incremental Aggregated Gradient Algorithms
On the Convergence Rate of Incremental Aggregated Gradient Algorithms M. Gürbüzbalaban, A. Ozdaglar, P. Parrilo June 5, 2015 Abstract Motivated by applications to distributed optimization over networks
More informationAccelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization
Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)
More informationIFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent
IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):
More informationGeneralized Conditional Gradient and Its Applications
Generalized Conditional Gradient and Its Applications Yaoliang Yu University of Alberta UBC Kelowna, 04/18/13 Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 1 / 25 1 Introduction 2 Generalized
More informationReferences. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization
References --- a tentative list of papers to be mentioned in the ICML 2017 tutorial Recent Advances in Stochastic Convex and Non-Convex Optimization Disclaimer: in a quite arbitrary order. 1. [ShalevShwartz-Zhang,
More informationStochastic optimization: Beyond stochastic gradients and convexity Part I
Stochastic optimization: Beyond stochastic gradients and convexity Part I Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint tutorial with Suvrit Sra, MIT - NIPS
More informationBregman Divergence and Mirror Descent
Bregman Divergence and Mirror Descent Bregman Divergence Motivation Generalize squared Euclidean distance to a class of distances that all share similar properties Lots of applications in machine learning,
More informationOptimization methods
Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,
More informationOn Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression
On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression arxiv:606.03000v [cs.it] 9 Jun 206 Kobi Cohen, Angelia Nedić and R. Srikant Abstract The problem
More informationDistributed Inexact Newton-type Pursuit for Non-convex Sparse Learning
Distributed Inexact Newton-type Pursuit for Non-convex Sparse Learning Bo Liu Department of Computer Science, Rutgers Univeristy Xiao-Tong Yuan BDAT Lab, Nanjing University of Information Science and Technology
More informationTutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer
Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization Stochastic optimization, ERM,
More informationLarge-scale Machine Learning and Optimization
1/84 Large-scale Machine Learning and Optimization Zaiwen Wen http://bicmr.pku.edu.cn/~wenzw/bigdata2018.html Acknowledgement: this slides is based on Dimitris Papailiopoulos and Shiqian Ma lecture notes
More informationStochastic Optimization Algorithms Beyond SG
Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods
More informationBringing Order to Special Cases of Klee s Measure Problem
Bringing Order to Special Cases of Klee s Measure Prolem Karl Bringmann Max Planck Institute for Informatics Astract. Klee s Measure Prolem (KMP) asks for the volume of the union of n axis-aligned oxes
More informationA Stochastic PCA Algorithm with an Exponential Convergence Rate. Ohad Shamir
A Stochastic PCA Algorithm with an Exponential Convergence Rate Ohad Shamir Weizmann Institute of Science NIPS Optimization Workshop December 2014 Ohad Shamir Stochastic PCA with Exponential Convergence
More information