arxiv: v2 [math.oc] 29 Jul 2016

Size: px

Start display at page:

Download "arxiv: v2 [math.oc] 29 Jul 2016"

Scot Quinn
5 years ago
Views:

1 Stochastic Frank-Wolfe Methods for Nonconvex Optimization arxiv: v [math.oc] 9 Jul 06 Sashank J. Reddi sjakkamr@cs.cmu.edu Carnegie Mellon University Barnaás Póczós apoczos@cs.cmu.edu Carnegie Mellon University Suvrit Sra suvrit@mit.edu Massachusetts Institute of Technology Astract Alex Smola alex@smola.org Carnegie Mellon University We study Frank-Wolfe methods for nonconvex stochastic and finite-sum optimization prolems. Frank-Wolfe methods (in the convex case) have gained tremendous recent interest in machine learning and optimization communities due to their projection-free property and their aility to exploit structured constraints. However, our understanding of these algorithms in the nonconvex setting is fairly limited. In this paper, we propose nonconvex stochastic Frank-Wolfe methods and analyze their convergence properties. For ojective functions that decompose into a finitesum, we leverage ideas from variance reduction techniques for convex optimization to otain new variance reduced nonconvex Frank-Wolfe methods that have provaly faster convergence than the classical Frank-Wolfe method. Finally, we show that the faster convergence rates of our variance reduced methods also translate into improved convergence rates for the stochastic setting. Introduction We study optimization prolems of the form: { E z [f(x, z)], F (x) := min x Ω (stochastic) n n i= f i(x), (finite-sum). We assume that F, f, and f i (i {,..., n} [n]) are all differentiale, ut possily nonconvex; the domain Ω is convex and compact. Prolems of this form are at the heart of machine learning and statistics; for instance, the finitesum prolem arises under the name empirical loss minimization and M-estimation. Examples of such prolems include multiclass classification, matrix learning, recommendation systems (Jaggi, 03; Hazan and Kale, 0; Hazan and Luo, 06; Harchaoui et al., 04). Within convex optimization, prolem () is relatively well-studied. Two particularly popular approaches for solving it are: (i) Projected stochastic gradient descent (Sgd); and () the Frank-Wolfe (Fw) method. At each iteration, Sgd takes a step in a direction opposite to a stochastic approximation of the gradient F and uses projection onto Ω to ensure feasiility. While computing a stochastic approximation to F is usually inexpensive, in many real settings, the cost projecting onto Ω can e very high (e.g., projecting onto the trace-norm all, onto ase polytopes in sumodular minimization (Fujishige and Isotani, 0)); and in extreme cases projection can even e computationally intractale (Collins et al., 008). ()

2 In such cases, projection ased methods like Sgd ecome impractical. This difficulty underlies the recent surge of interest in Frank-Wolfe methods (Frank and Wolfe, 956; Jaggi, 03) (also known as conditional gradient), due to their projection-free property. In particular, Fw methods avoid the expensive projection operation and require just a linear oracle that solves prolems of the form min x Ω x, g at each iteration. Despite the remarkale success of Fw approaches in the convex setting, including stochastic prolems (Hazan and Luo, 06), their applicaility and non-asymptotic convergence for nonconvex optimization is largely unstudied. Even for Sgd, it is only recently that non-asymptotic convergence analysis for nonconvex optimization was otained (Ghadimi and Lan, 03; Ghadimi et al., 04). More recently, Reddi et al. (06a;) otained variance reduced stochastic methods that converge faster than Sgd in the nonconvex finite-sum setting. Similar fast variants of Fw for nonconvex prolems are not known. Given the vast importance of nonconvex models in machine learning (e.g., in deep learning) and the need to incorporate non-trivial constraints in such models, it is imperative to develop scalale, projection-free methods. This paper presents new Fw methods towards this goal. Our main contriutions are summarized elow, while the key complexity results are listed in Figure. Main Contriutions. For the nonconvex stochastic setting in (), we propose a stochastic Frank- Wolfe algorithm (Sfw), and provide its convergence analysis. For the nonconvex finite-sum setting, we propose two variance reduced (VR) algorithms: Svfw and SagaFw, ased on the popular VR algorithms Svrg and Saga, respectively. We show that y carefully selecting the parameters of these algorithms, we can attain faster convergence rates than the deterministic Fw. In particular, we prove that Svfw and SagaFw are faster than deterministic Fw y a factor of n /3 and n /3 respectively, where n is the numer of component functions in the finite-sum (see ()). Furthermore, leveraging these variance reduced methods, we propose two algorithms, Svfw-S and SagaFw-S, for the nonconvex stochastic setting, with faster convergence rates than Sfw. To our knowledge, our work presents the first theoretical improvement for stochastic variants of Frank-Wolfe in the context of nonconvex optimization.. Related Work The classical Frank-Wolfe method (Frank and Wolfe, 956) using line-search was analyzed for smooth convex functions F and polyhedral domains Ω. Here, a convergence rate of O(/ɛ) to ensure F (x) F ɛ was proved without additional conditions (Frank and Wolfe, 956; Jaggi, 03). There have een several recent works on improving the convergence rates under additional assumptions (Garer and Hazan, 05; Lacoste-Julien and Jaggi, 05). More recently, Hazan and Luo (06) proposed stochastic variants of Fw for convex prolems of form (), and showed theoretical improvements over the classical Frank-Wolfe method. The literature on nonconvex Frank-Wolfe is relatively small. The work (Bertsekas, 995) proves asymptotic convergence of Fw to a stationary point; though, no convergence rates are provided. To the est of our knowledge, Yu et al. (04) is the first to provide convergence rates for Fw-type algorithm in the nonconvex setting. Very recently, Lacoste-Julien (06) provided a (non-asymptotic) convergence rate of O(/ɛ ) for nonconvex Fw with adaptive step sizes. However, as we shall see later, implementation of classical Fw for () is expensive (or impossile in the pure stochastic case) since it requires calculation of the gradient F at each iteration. We show that our stochastic variants are provaly faster than the existing Fw methods. In the nonconvex setting, most of the work on stochastic methods focuses on Sgd (Ghadimi and Lan, 03; Ghadimi et al., 04) and analyzes convergence to stationary points. For the finite-sum setting, we uild on recent variance reduction techniques (Johnson and Zhang, 03; Defazio et al., 04; Schmidt et al., 03), which were first proposed for solving unconstrained convex prolems of form (). Projected variants to handle constraints were studied in (Defazio et al., 04; Xiao and Zhang, 04). More recently, Reddi et al. (06a;;c) provided nonconvex variants of these methods

3 Algorithm SFO/IFO Complexity LO Complexity Frank-Wolfe O(n/ɛ ) O ( /ɛ ) Sfw O ( /ɛ 4) O ( /ɛ ) Svfw O(n + n /3 /ɛ ) O(/ɛ ) SagaFw O(n + n /3 /ɛ ) O(/ɛ ) Svfw-S O(/ɛ 0/3 ) O(/ɛ ) SagaFw-S O(/ɛ 8/3 ) O(/ɛ ) Figure : Tale comparing the est SFO/IFO and LO complexity of algorithms discussed in the paper (for the nonconvex setting). Here, Sfw, Svfw-S and SagaFw-S are algorithms for the stochastic setting, while Fw, Svfw and SagaFw are algorithms for the finite-sum setting. The complexity is measured y the numer of oracle calls required to achieve an ɛ-accurate solution (see Section for definitions of SFO/IFO and LO complexity). The complexity of Fw is from (Lacoste-Julien, 06). The results marked in red are contriutions of this paper. For clarity, we hide the dependence of SFO/IFO and LO complexity on the initial point and few parameters related to the function F and domain Ω. that converge provaly faster than oth Sgd and its deterministic counterpart. Preliminaries As stated aove, we study two different prolem settings: () stochastic, where F (x) = E z [f(x, z)] and z is random variale whose distriution P is supported on Ξ R p ; and () finite-sums, where F (x) = n n f i(x). For the stochastic setting, we assume that F is L-smooth, i.e., its gradient is Lipschitz continuous with constant L, so F (x) F (y) L x y, x, y Ω. Here. denotes the l -norm. Furthermore, for the stochastic setting, we also assume the function f is G-Lipschitz i.e., f(x, z) G for all x Ω and z Ξ. Such an assumption is common in the stochastic setting (Ghadimi and Lan, 03; Hazan and Luo, 06). For the finite-sum setting, we assume that the individual functions f i (i [n]) are L-smooth i.e., f i (x) f i (y) L x y, x, y Ω. Note that this implies that the function F is also L-smooth. The domain Ω R d is assumed to e convex and compact with diameter D; i.e., x y D for all x, y Ω. Such an assumption is common to all Frank-Wolfe methods. Convergence criteria. The criterion used for the convergence analysis is important in nonconvex optimization. For unconstrained prolems, the gradient norm F is typically used to measure convergence, ecause F 0 translates into convergence to a stationary point. However, this criterion cannot e used for constrained prolems of the form (). Instead, we use the following quantity, typically referred to as Frank-Wolfe gap: G(x) = max v x, F (x). () v Ω 3

4 For convex functions, the Fw gap provides an upper ound on the suoptimality. For nonconvex functions, the gap G(x) = 0 if and only if x is a stationary point. To state our convergence results we will also need the following ound: β (F (x 0) F (x )) LD, given some (unspecified) initial point x 0 Ω. Oracle model. To compare convergence speed of different algorithms, we use the following lack-ox oracles: Stochastic First-Order Oracle (SFO): For a function F ( ) = E z [f(., z)] where z P, an SFO takes a point x and returns the pair (f(x, z ), f(x, z )) where z is a sample drawn i.i.d. from P (Nemirovski and Yudin, 983). Incremental First-Order Oracle (IFO): For a function F ( ) = n i f i(.), an IFO takes an index i [n] and a point x R d, and returns the pair (f i (x), f i (x)) (Agarwal and Bottou, 04). Linear Optimization Oracle (LO): For a set Ω, an LO takes a direction d and returns arg max v Ω v, d. Throughout the paper, y SFO, IFO and LO complexity of an algorithm, we mean the total numer of SFO, IFO and LO calls made y the algorithm to otain an ɛ-accurate solution, i.e., a solution for which E[G(x)] ɛ; the expectation is over any randomization as part of the algorithm. For clarity of presentation, we hide the dependence of these complexities on the initial point F (x 0 ) F (x ), Lipschitz constant G, and the smoothness constant L; we report the dependence on n to highlight its importance. Classical Fw. To place our results in perspective, we egin y recalling the classical Frank-Wolfe (Fw) algorithm (Frank and Wolfe, 956). Pseudocode for this is presented in Algorithm. Algorithm : Fw ( x 0, T, {γ i } T ) i=0 : Input: x 0 Ω, numer of iterations T, {γ i} T i=0 : for t = 0 to T do 3: Compute v t = arg max v Ω v, F (x t) 4: Compute update direction d t = v t x t 5: x t+ = x t + γ td t 6: end for where γi [0, ] for all i {0,..., T } Each iteration of Fw entails calculation of the gradient F and moving towards a minimizer of a linearized ojective. Notice that calculation of F may not e possile in the stochastic setting of (). Furthermore, even in the finite-sum setting, computing F requires n IFO calls, rendering the approach useless in large-scale prolems, where n is large. For the nonconvex finite-sum setting, the following key result was proved recently (Lacoste-Julien, 06). Theorem (Lacoste-Julien, 06)). Under appropriate selection of step sizes γ t, the IFO and LO complexity of Algorithm to achieve an ɛ-accurate solution in the finite-sum setting are O(n/ɛ ) and O(/ɛ ) respectively. The key aspect of Theorem is the dependence of IFO complexity on n. In particular, when n is large, the IFO complexity O(n/ɛ ) shown y the theorem ecomes prohiitively expensive; thus, undermining the enefits of Fw over competitors like projected Sgd. In the next section, we tackle this drawack and develop faster nonconvex stochastic and variance reduced Fw methods. 4

5 Algorithm : Nonconvex SFW ( x 0, T, {γ i } T i=0, { i} T ) i=0 : Input: x 0 Ω, numer of iterations T, {γ i } T i=0 where γ i [0, ] for all i {0,..., T }, miniatch size { i } T i=0 : for t = 0 to T do 3: Uniformly randomly pick i.i.d samples {z, t..., z t t } according to the distriution P. 4: Compute v t = arg max v Ω v, t t f(x t, z i ) 5: Compute update direction d t = v t x t 6: x t+ = x t + γ t d t 7: end for 8: Output: Iterate x a chosen uniformly random from {x t } T t=0. 3 Algorithms In this section, we descrie Fw algorithms for solving (). In particular, we explore stochastic and variance reduced versions of the classical Fw method, for the stochastic and finite-sum settings, respectively. We defer the discussion on comparison of the convergence rates to Section Stochastic Setting We first investigate the convergence of Fw in the stochastic setting. As mentioned earlier, the classical Fw method (Algorithm ) requires calculation of the full gradient F (x), which is typically impossile to compute in the stochastic setting. For convex prolems, Hazan and Luo (06) tackle this issue y using the popular Roins-Monro approximation (Roins and Monro, 95) to the gradient. We use a variant of the algorithm for our nonconvex stochastic setting, which we call Sfw. The pseudocode of Sfw is listed in Algorithm. Note that the samples {z i } are chosen independently according to the distriution P. Thus, E zi [ f(x, z i )] = F (x), i.e., we otain an uniased estimate of the gradient. Also, note that the output in Algorithm is randomly selected from all the iterates of the algorithm. The key parameters of Sfw are the step sizes {γ i } T i=0 and the miniatch sizes { t }. These parameters must e chosen appropriately in order to ensure convergence of the algorithm (see Theorem ). For our analysis, we assume that the function f is G-Lipschitz i.e., we have max x Ω,z Ξ f(x, z) G. This ound on the gradient is crucial for our convergence analysis. We prove the following key result for nonconvex Sfw. Theorem. Consider the stochastic setting of () where f is G-Lipschitz and F is L-smooth. Then, the output x a of Algorithm with parameters γ t = γ = (F (x0) F (x )) T LD β, t = = T for all t {0,..., T }, satisfies the following ound: E[G(x a )] D ) L(F (x (G + 0) F (x )) T β ( + β), where x is an optimal solution to (stochastic) prolem (). Proof. First oserve the following upper ound: F (x t+ ) F (x t ) + F (x t ), x t+ x t + L x t+ x t = F (x t ) + F (x t ), γ(v t x t ) + L γ(v t x t ) F (x t ) + γ F (x t ), v t x t + LD γ. (3) 5

6 The first inequality follows since F is L-smooth (see Lemma ). The equality is due to the fact that x t+ x t = γ(v t x t ). The second inequality holds ecause v t, x t Ω and ecause the diameter of Ω is D. Next, we introduce the following quantity: ˆv t := arg max v Ω v, F (x t), (4) which is used purely for our analysis and is not part of the algorithm. For revity, we use t to denote f(x t, z t i ). Rewriting inequality (3) using this quantity, we see that F (x t+ ) F (x t ) + γ t, v t x t + γ F (x t ) t, v t x t + LD γ F (x t ) + γ t, ˆv t x t + γ F (x t ) t, v t x t + LD γ = F (x t ) + γ F (x t ), ˆv t x t + γ F (x t ) t, v t ˆv t + LD γ = F (x t ) γg(x t ) + γ F (x t ) t, v t ˆv t + LD γ F (x t ) γg(x t ) + Dγ F (x t ) t + LD γ. (5) The second inequality follows from the optimality of v t in Algorithm, while the third inequality follows from recalling that G(x t ) = max v Ω v x t, F (x t ) = ˆv t x t, F (x t ), which holds due to the optimality of ˆv t in (4). The last inequality follows from Cauchy-Schwarz and the fact that the diameter of the feasile set Ω is ounded y D. Taking expectations and using Lemma in (5) we otain the following important ound: E[F (x t+ )] E[F (x t )] γe[g(x t )] + GDγ + LD γ. Summing over t and telescoping, we then otain the upper-ound T γ t=0 E[G(x t )] F (x 0 ) E[F (x T )] + T GDγ + T LD γ F (x 0 ) F (x ) + T GDγ + T LD γ. The latter inequality follows from the optimality of x. Using the definition of the output x a of Algorithm and the parameters specified in the theorem statement, we get E[G(x a )] F (x 0) F (x ) T γ D T (G + which concludes the proof of the theorem. + GD + LD γ L(F (x 0) F (x )) β ( + β) ), (6) An immediate consequence of Theorem is the following complexity result for Sfw. 6

7 Corollary. Under the setting of Theorem, the SFO complexity and LO complexity of Algorithm are O(/ɛ 4 ) and O(/ɛ ), respectively. Proof. The proof follows upon oserving that O(/ɛ ) miniatch size is required at each iteration of the algorithm, and noting that as per Theorem O(/ɛ ) iterations are required to achieve an ɛ-accurate solution. Note that the SFO and LO complexity of nonconvex Sfw is similar to that of online Fw (Hazan and Kale, 0) and slightly worse than complexity of Sfw for convex prolems (Hazan and Luo, 06). Furthermore, for simplicity of analysis, we used a fixed step size and miniatch size. One can derive an essentially similar result using a decreasing step size and increasing miniatch size. It is important to emphasize that the aove results also apply to the finite-sum setting. In particular, when the distriution P is the empirical measure, then the convergence result in Theorem also provides convergence rates for the finite-sum case. However, as we will see shortly, these convergence rates can e improved significantly y using variance reduction techniques. 3. Finite-sum Setting In this section, we consider the finite-sum setting of (). We show that y uilding on ideas from variance reduction for Sgd, one can significantly improve the convergence rates. The key idea is to use a variance reduced approximation of the gradient (Johnson and Zhang, 03; Defazio et al., 04). We analyze two different algorithms for the finite-sum setting. Our first algorithm (Svfw) is ased on the convex method of (Hazan and Luo, 06) adapted to the nonconvex case. Our second algorithm (SagaFw) is ased on another variance reduction technique called Saga (Defazio et al., 04). Svfw Algorithm Pseudocode of our first method (Svfw) is presented in Algorithm 3. Similar to (Johnson and Zhang, 03) and (Hazan and Luo, 06), nonconvex Svfw is also epoch-ased. At the end of each epoch, the full gradient is computed at the current iterate. This gradient is used for controlling the variance of the stochastic gradients in the inner loop. For epoch size m =, Svfw reduces to the classic Fw algorithm. In general, the epoch size m is chosen such that the total numer of IFO calls per epoch is Θ(n). This ensures that the cost of computing the full gradient at the end of each epoch is amortized. To enale a fair comparison with Sfw, we assume that the total numer of inner iterations across all epochs in Algorithm 3 is T. We prove the following key result for Algorithm 3. For ease of exposition, we assume that the total numer of inner iterations T is a multiple of m. Theorem 3. Consider the finite-sum setting of () where the functions {f i } n are L-smooth. Then, the output x a of Algorithm 3 with parameters γ t = γ = F (x0) F (x ) T LD β and t = = m for all t {0,..., m }, satisfies E[G(x a )] D T β L(F (x0 ) F (x ))( + β), where x is an optimal solution of () and x a is the output of Algorithm 4. Proof. We first analyze the convergence properties of iterates within an epoch. Suppose that the current epoch is s +. For revity, we drop the symol s from x s+ t, x s and g s, whenever it safe to do so given the context. The first part of the proof is similar to that of Theorem. For the sake of completeness, we provide the details here. We again use the quantity ˆv t = arg max v Ω v, F (x t ), as efore, purely for our analysis. 7

8 Algorithm 3: SVFW ( x 0, T, m, {γ i } m i=0, { i} m ) i=0 : Input: x 0 m = x 0 Ω, epoch size m, numer of epochs S = T/m, {γ i } m i=0 where γ i [0, ] for all i {0,..., m }, miniatch size { i } m i=0 : for s = 0 to S do 3: Let x s = x s m 4: Compute g s = F ( x s ) = n n f( xs ) 5: for t = 0 to m do 6: Uniformly randomly (with replacement) select suset I t = {i,..., i t } from [n]. 7: Compute vt s+ = arg max v Ω v, t ( f i (x s+ t ) f i ( x s ) + g s ) 8: Compute update direction d s+ t = vt s+ x s+ t 9: x s+ t+ = xs+ 0: end for : end for t + γ t d s+ t : Output: Iterate x a chosen uniformly random from {{x s+ t } t=0 m }S s=0. For the t th iteration within the epoch s, we have F (x t+ ) F (x t ) + F (x t ), γ(v t x t ) + L γ(v t x t ) F (x t ) + γ F (x t ), v t x t + LD γ. (7) This is due to Lemma and definition of x t+ in Algorithm 3. For revity, we use t to denote t ( f i (x t ) f i ( x) + g). Rewriting, we then otain F (x t+ ) F (x t ) + γ t, v t x t + γ F (x t ) t, v t x t + LD γ F (x t ) + γ t, ˆv t x t + γ F (x t ) t, v t x t + LD γ F (x t ) + γ F (x t ), ˆv t x t + γ F (x t ) t, v t ˆv t + LD γ F (x t ) γg(x t ) + Dγ F (x t ) t + LD γ. (8) The second inequality is due to the optimality of v t in Algorithm 3. The last inequality is due to the definition of G(x t ), the diameter of set Ω, and an application of Cauchy-Schwarz inequality. Note that the aove inequality is similar to (5), except for the crucial difference in the term F (x t ) t (instead of F (x t ) t in (5)). As we shall see shortly, this term has much lower variance, which ultimately leads to faster convergence rates. Taking expectations and using Lemma 3 in inequality (8) we otain the ound E[F (x t+ )] E[F (x t )] γe[g(x t )] To aid further analysis, we introduce the following Lyapunov function: + LDγ E[ x t x ] + LD γ. (9) R t = E[F (x t ) + c t x t x ], 8

9 where c m = 0 and c t = c t+ + (LDγ)/ for all t {0,..., m }. Using the relationship in (9), we see that R t+ = E[F (x t+ ) + c t+ x t+ x ] E[F (x t )] γe[g(x t )] + LDγ E[ x t x ] + LD γ + c t+ E[ x t+ x ] E[F (x t )] γe[g(x t )] + LDγ E[ x t x ] + LD γ + c t+ E[ x t+ x t + x t x ] R t γe[g(x t )] + LD γ + c t+ Dγ. (0) The second inequality follows from the triangle inequality, while the last inequality holds ecause: (a) c t = c t+ + (LDγ)/, and () x t+ x t = γ v t x t Dγ (recall the definition of diameter of Ω). Telescoping over all the iterations within an epoch, we otain m R m R 0 γ E[G(x t )] + LmD γ t=0 m = R 0 γ E[G(x t )] + LmD γ t=0 + Dγ m t= c t + L(m )md γ. () The equality follows from the relationship c t = c t+ + (LDγ)/. Since c m = 0 and x s+ 0 = x s = x s m (in Algorithm 3), from () we otain E[F (x s+ m )] E[F (x s+ m Now telescoping over all epochs, we otain + LmD γ S E[F (x S m)] F (x 0 ) γ m )] γ s=0 + T LD γ t=0 E[G(x s+ t )] + L(m )md γ. m t=0 E[G(x s+ t )] + T L(m )D γ. Rearranging this inequality and using the definition of the output in Algorithm 3, we finally otain E[G(x a )] F (x 0) E[F (x S m)] T γ F (x 0) F (x ) T γ + LD γ + LD γ LD (F (x 0 ) F (x )) ( + β). T β + L(m )D γ The second inequality follows from the optimality of x and ecause = m. The last inequality follows from the choice of γ stated in the theorem. This concludes the proof. 9

10 Algorithm 4: SagaFw ( x 0, T, {γ i } T i=0, { i} T ) i=0 : Input: α0 i = x 0 Ω for all i [n], numer of iterations T, {γ i } T i=0 where γ i [0, ] for all i {0,..., T }, miniatch size { i } T i=0 : Compute g 0 = n n F (αi 0) 3: for t = 0 to T do 4: Uniformly randomly (with replacement) select susets I t, J t from [n] of size t. 5: Compute v t = arg max v Ω v, t ( f i (x t ) f i (αt) i + g t ) 6: Compute update direction d t = v t x t 7: x t+ = x t + γ t d t 8: α j t+ = x t for j J t and α j t+ = αj t for j / J t 9: g t+ = g t n j J t ( f j (αt) j f j (α j t+ )) 0: end for : Output: Iterate x a chosen uniformly random from {x t } T t=0. The analysis suggests that the value of m should e set appropriately in Theorem 3 to otain good convergence rates. If m is small, the IFO complexity of Algorithm 3 is dominated y the step involving calculation of the full gradient at the end of each epoch. On the other hand, if m is large, a large miniatch is used in each step of the algorithm (since = m ), which increases the IFO complexity. With this intuition, we present following important corollary. Corollary. Under the setting of Theorem 3 and with m = n /3, the IFO complexity and LO complexity of Algorithm are O(n + n /3 /ɛ ) and O(/ɛ ), respectively. Proof. We first oserve that the total numer of IFO calls for an epoch (including those required for calculating the full gradient) is Θ(m 3 + n). Since m = n /3, the total amortized IFO complexity of one iteration within an epoch is O(m ) = O(n /3 ). Therefore, the IFO complexity is O(n + n /3 /ɛ ). Further, since each inner iteration requires O() LO calls, the LO complexity is O(/ɛ ). SagaFw Algorithm Svfw is a semi-stochastic algorithm since it requires calculation of the full gradient at the end of each epoch. Below we propose a purely incremental method (SagaFw) ased on the Saga algorithm of (Defazio et al., 04). The pseudocode for SagaFw is presented in Algorithm 4. A key feature of SagaFw is that it entirely avoids calculation of full gradients. Instead, it updates the average gradient vector g t at each iteration. This update requires maintaining additional vectors α i (i [n]), and in the worst case such a strategy incurs additional storage cost of O(nd). However, this cost can e reduced to O(n) in several practical cases (refer to (Defazio et al., 04; Reddi et al., 06)). For SagaFw, we prove the following key result. Theorem 4. Consider the finite-sum setting of () where functions {f i } n are L-smooth. Define θ(, n, T ) = / + (n 3/ /T 3/ ). Then the output x a of Algorithm 4 with parameters γ t = γ = F (x0) F (x ) T LD θ(,n,t )β and t = n for all t {0,..., T }, satisfies the following: E[G(x a )] D T β Lθ(, n, T )(F (x0 ) F (x ))( + β), where x is an optimal solution of prolem () and x a is the output of Algorithm 4. 0

11 Proof. We use the following quantities in our analysis: ˇ t = ( fi (x t ) f i (α i ) t) + g t t ˆv t = arg max v Ω v, F (x t). The first part of our proof is similar to that of Theorem 3. until (8), we have E[F (x t+ )] Using essentially the same argument F (x t ) γg(x t ) + Dγ F (x t ) ˇ t + LD γ F (x t ) γg(x t ) + LDγ n n E x t α i n t + LD γ. () The second inequality is due to Lemma 4. Next, we define the following Lyapunov function: n R t = E[F (x t )] + c t E x t α i n t, where c T = 0 and c t = ( ρ)c t+ + (LDγ n)/ for all t {0,..., T }, where ρ is the proaility ( /n) of an index i eing in J t. We can ound ρ from elow as ρ = ( n) +(/n) = /n +/n n, (3) where the first inequality follows from ( y) r /( + ry) (which holds for y [0, ] and r ), while the second inequality holds ecause n. Now oserve the following: n E x t+ α n t+ i = n E [ ρ x t+ x t + ( ρ) x t+ α i n t ] n = n n [ E ρ x t+ x t ] + ( ρ)( x t+ x t + x t αt ) i n E [ x t+ x t + ( ρ)e x t αt i ] (4) The first equality follows from the definition of α i t+ in Algorithm 4, while the inequality is just the triangle inequality. Using the aove relationship and the ound in (), we otain R t+ E[F (x t )] γe[g(x t )] + LD γ + LDγ n n n E[ x t αt ] i + c t+ E[ x t+ x t ] + c t+ ( ρ) n n E[ x t αt ] i R t γe[g(x t )] + LD γ + c t+ Dγ. (5)

12 The second inequality holds ecause: (a) c t = ( ρ)c t+ + (LDγ n)/, and () x t+ x t = γ v t x t Dγ (due to our ound on the diameter of the set Ω). Telescoping over all the iterations, we see that T R T R 0 γ E[G(x t )] + T LD γ t=0 T R 0 γ E[G(x t )] + T LD γ t=0 T R 0 γ E[G(x t )] + T LD γ t=0 + Dγ T t= + LD γ n ρ c t + LD γ n 3/ 3/. The second inequality follows form the fact that T t= c t LDγ n/(ρ ). This can, in turn, e otained from the recursion c t = ( ρ)c t+ + (LDγ n)/ and c T = 0. The third inequality is due to the ound on ρ in (3). Rearranging the aove inequality and using the definition of x a from Algorithm 4, we finally otain the ound E[G(x a )] F (x 0) E[F (x T )] T γ F (x 0) F (x ) T γ + LD γ + LD γθ(, n, T ). + LD γn 3/ T 3/ The first inequality uses the fact that c T = 0 and α i 0 = x 0 (in Algorithm 4). The second inequality uses the optimality of x and the definition of θ(, n, T ). Using the setting of γ in the theorem statement, we otain the desired result. Corollary 3. Assume T n. Under the settings of Theorem 4 and with = n /3, the IFO and LO complexity of Algorithm are O(n + n /3 /ɛ ) and O(/ɛ ), respectively. Proof. First, oserve that for T n and = n /3, θ(, n, T ) 5/ in Theorem 4. Thus, the IFO complexity is O(n + n /3 /ɛ ). Furthermore, since each iteration requires just O() LO calls, the LO complexity is O(/ɛ ). Notaly, the IFO complexity of SagaFw is lower than that of Svfw. Moreover, if T n 3/ and =, then we have θ(, n, T ) 5/, in which case the IFO complexity is O(n 3/ + /ɛ ). 4 Variance Reduction in Stochastic Setting In this section, we improve the convergence rates in the stochastic setting using variance reduction techniques. The key idea is to first otain samples {z i } are chosen independently according to the distriution P and then use Svfw or SagaFw, descried in this paper, on the finite-sum prolem over these samples. The pseudocode for the Svfw and SagaFw variants for stochastic setting (Svfw-S and SagaFw-S respectively) are provided in Figure. The following is the key result regarding the convergence rates of Svfw-S and SagaFw-S. Theorem 5. Consider the stochastic setting of () where f is G-Lipschitz and f(., z) is L-smooth F (x0) F (x ) T LD β for all z Ξ. Suppose B = T and γ = of Svfw-S and SagaFw-S satisfy the following: E[G(x a )] (for Svfw-S and SagaFw-S). Then the output D T β L(F (x0 ) F (x ))( + β) + GD T (6)

13 Svfw-S:(x 0, T, B, γ) Randomly sample z,, z B P Let finite-sum ˆF (x) = B B f(x, z i) Output Svfw(x 0, T, B /3, γ, B /3 ) applied on the function ˆF SagaFw-S:(x 0, T, B, γ) Randomly sample z,, z B P Let finite-sum ˆF (x) = B B f(x, z i) Output SagaFw(x 0, T, γ, 3B /3 ) applied on the function ˆF Figure : Svfw-S and SagaFw-S variants for the stochastic setting. Proof. Consider the finite-sum ˆF (x) = B B f(x, z i) where z,, z B P. We use the following notation: Ĝ(x) = max v Ω v x, ˆF (x). Let v t s+ = arg max v Ω v x s+ t, F (x s+ t ) and ˆv t s+ = arg max v Ω v x s+ t, ˆF (x s+ t ). We first oserve the following key relationship for Svfw: E[G(x s+ t ) Ĝ(xs+ t )] = E[ v t s+ x s+ t, F (x s+ t ) ] E[ ˆv t s+ x s+ t, ˆF (x s+ t ) ] E[ v s+ t x s+ t, F (x s+ t ) ] E[ v s+ t x s+ t, ˆF (x s+ t ) ] E[ v t s+ x s+ t, ˆF (x s+ t ) F (x s+ t ) ] DE[ ˆF (x s+ t ) F (x s+ t ) ] GD. T The first inequality is due to the optimality of ˆv t s+. The third inequality follows from Cauchy-Schwarz inequality. The last inequality is due to Lemma. Adding the aove inequality across all the iterations and epochs, we get: E[G(x a )] E[Ĝ(x a)] + GD T. Using the ound on E[Ĝ(x a)] in Theorem 3 (here, recall we are using Svfw on ˆF ) in the aove inequality, we get the desired result. The proof for SagaFw-S is similar. The following corollary on the complexity of Svfw-S and SagaFw-S is immediate consequence of the aove result. Corollary 4. Under the setting of Theorem 5, the SFO complexity of Svfw-S and SagaFw-S (in Figure ) are O(/ɛ 0/3 ) and O(/ɛ 8/3 ), respectively. The LO complexity of oth Svfw-S and SagaFw-S is O(/ɛ ). Proof. The proof follows from the fact that B = T, = B /3 (in Svfw-S) and = 3B /3 (in SagaFw-S) and IFO complexities of Svfw and SagaFw given in Corollary and 3 respectively. By comparing Corollary 4 with Corollary, we see that Svfw-S and SagaFw-S have etter SFO complexity than Sfw. 5 Discussion It is important to remark on the complexity results derived in this paper. For the stochastic setting, we showed that the SFO and LO complexity of Sfw are O(/ɛ 4 ) and O(/ɛ ), respectively. At first glance, these rates might appear worse than those otained for nonconvex Sgd (see (Ghadimi and Lan, 03)). However, it is important to note that the convergence criterion used in our paper is 3

14 different from the one used in (Ghadimi and Lan, 03). It is an important piece of future work to understand the precise relationship etween these convergence criteria. Furthermore, the convergence rates in this paper are similar to those otained for online Frank-Wolfe (Hazan and Kale, 0) and only slightly worse than those otained for stochastic Frank-Wolfe in the convex setting (Hazan and Luo, 06). We, further, improved the convergence rate of Sfw y using variance reduction ideas in the stochastic setting (Svfw-S and SagaFw-S algorithms in Section 4). Understanding the tightness of these rates is an interesting open prolem left as future work. For the finite-sum setting, while the complexity results of Sfw still hold, we otained significantly faster convergence rates y using variance reduction techniques. The dependence of IFO and LO complexity of nonconvex Svfw and SagaFw, on ɛ is O(/ɛ ), which matches the classical Frank-Wolfe algorithm (Lacoste-Julien, 06). However, Svfw and SagaFw exhiit a much weaker dependence on n than Fw; wherein, they are provaly faster than the classical Frank-Wolfe y a factor of n /3 and n /3, respectively. Similar (ut not same) enefits have also een reported for nonconvex Svrg and Saga over gradient descent (Reddi et al., 06a;). Interestingly, there appears to e a gap etween the convergence rates of Svfw and SagaFw. Whether this gap is an artifact of our analysis or has deeper reasons remains open. We conclude with a remark on a sutle point regarding the step size γ. The step size γ in Theorems, 3, and 4 requires knowledge of parameters like L, D and F (x) F (x ). Typically, an estimate of the these values suffices in practice. In asence of such knowledge, one can completely eliminate this dependence of γ on these parameters y simply choosing β = (F (x0) F (x )) LD. Fortunately, this comes at the cost of only slightly worse constants in the convergence rate. References A. Agarwal and L. Bottou. A Lower Bound for the Optimization of Finite Sums. arxiv:40.073, 04. D. Bertsekas. Nonlinear Programming. Athena Scientific, 995. ISBN M. Collins, A. Gloerson, T. Koo, X. Carreras, and P. L. Bartlett. Exponentiated Gradient Algorithms for Conditional Random Fields and Max-Margin Markov Networks. JMLR, 9:775 8, 008. A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A Fast Incremental Gradient Method with Support for Non-Strongly Convex Composite Ojectives. In NIPS 7, pages M. Frank and P. Wolfe. An Algorithm for Quadratic Programming. Naval Research Logistics Quarterly, 3(-):95 0, March 956. S. Fujishige and S. Isotani. A Sumodular Function Minimization Algorithm ased on the Minimumnorm Base. Pacific Journal of Optimization, 7():3 7, 0. D. Garer and E. Hazan. Faster Rates for the Frank-Wolfe Method over Strongly-Convex Sets. In Proceedings of the 3nd International Conference on Machine Learning, ICML 05, Lille, France, 6- July 05, pages , 05. S. Ghadimi and G. Lan. Stochastic First- and Zeroth-Order Methods for Nonconvex Stochastic Programming. SIAM Journal on Optimization, 3(4):34 368, 03. doi: 0.37/ S. Ghadimi, G. Lan, and H. Zhang. Mini-atch Stochastic Approximation Methods for Nonconvex Stochastic Composite Optimization. Mathematical Programming, 55(-):67 305, Decemer 04. Z. Harchaoui, A. Juditsky, and A. Nemirovski. Conditional Gradient Algorithms for Norm-Regularized Smooth Convex Optimization. Mathematical Programming, 5(-):75, apr 04. 4

15 E. Hazan and S. Kale. Projection-free Online Learning. In Proceedings of the 9th International Conference on Machine Learning, ICML 0, Edinurgh, Scotland, UK, June 6 - July, 0, 0. E. Hazan and H. Luo. Variance-Reduced and Projection-Free Stochastic Optimization. CoRR, as/60.00, 06. M. Jaggi. Revisiting Frank-Wolfe: Projection-Free Sparse Convex Optimization. In ICML 3, pages , 03. R. Johnson and T. Zhang. Accelerating Stochastic Gradient Descent using Predictive Variance Reduction. In NIPS 6, pages S. Lacoste-Julien. Convergence Rate of Frank-Wolfe for Non-Convex Ojectives. as/ , 06. S. Lacoste-Julien and M. Jaggi. On the Gloal Linear Convergence of Frank-Wolfe Optimization Variants. In Advances in Neural Information Processing Systems 8, pages , 05. A. Nemirovski and D. Yudin. Prolem Complexity and Method Efficiency in Optimization. John Wiley and Sons, 983. Y. Nesterov. Introductory Lectures On Convex Optimization: A Basic Course. Springer, 003. S. J. Reddi, A. Hefny, S. Sra, B. Póczos, and A. J. Smola. Stochastic Variance Reduction for Nonconvex Optimization. In Proceedings of the 33nd International Conference on Machine Learning, ICML 06, New York City, NY, USA, June 9-4, 06, pages 34 33, 06a. S. J. Reddi, S. Sra, B. Póczos, and A. J. Smola. Fast Incremental Method for Nonconvex Optimization. arxiv: , 06. S. J. Reddi, S. Sra, B. Póczos, and A. J. Smola. Fast Stochastic Methods for Nonsmooth Nonconvex Optimization. arxiv: , 06c. H. Roins and S. Monro. A Stochastic Approximation Method. The Annals of Mathematical Statistics, (3): , Sep 95. M. W. Schmidt, N. L. Roux, and F. R. Bach. Minimizing Finite Sums with the Stochastic Average Gradient. arxiv: , 03. L. Xiao and T. Zhang. A Proximal Stochastic Gradient Method with Progressive Variance Reduction. SIAM Journal on Optimization, 4(4): , 04. Y. Yu, X. Zhang, and D. Schuurmans. Generalized Conditional Gradient for Sparse Estimation. as/arxiv:40.488, 04. Appendix The following ound on the value of functions with Lipschitz continuous gradients is classical (see e.g., (Nesterov, 003)). Lemma. If f : R d R is L-smooth, then for all x, y R d. f(x) f(y) + f(y), x y + L x y, 5

16 The following lemma is useful for ounding the variance of the gradient estimate used in the stochastic setting. Lemma. Suppose the function F (x) = E z [f(x, z)] where z is a random variale with distriution P and support Ξ, and max z Ξ f(x, z) G for all x Ω. Also, let x = f(x, z i ) where {z i } are i.i.d. samples from the distriution P. Then, the following holds for any x Ω: E[ x F (x) ] G. Proof. The proof follows from a simple application of Lemma 5 and Jensen s inequality. The following result is useful for ounding the variance of the updates of Svfw and follows from a slight modification of a result in (Reddi et al., 06a). We give the proof here for completeness. Lemma 3 (Reddi et al., 06a)). Let t = t ( f i (x s+ t ) f i ( x s ) + g s ) in Algorithm 3. For the iterates x s+ t and x s where t {0,..., m } and s {0,..., S } in Algorithm 3, the following inequality holds: Proof. For the ease of exposition, we first define E It [ F (x s+ t ) t ] L x s+ t x s. t ζ s+ t = I t Using this notation, we then otain the following: E It [ F (x s+ t ) t ] ( fi (x s+ t ) f i ( x s ) ). = E It [ ζt s+ + F ( x s ) F (x s+ t ) ] = E It [ ζt s+ E It [ζt s+ ] ] = E It t ( fi (x s+ t ) f i ( x s ) E It [ζ s+ t ] ). The second equality is due to the fact that E It [ζt s+ ] = F (x s+ t ) F ( x s ). From the aove relationship, we get E It [ F (x s+ t ) t ] [ ] E It f i (x s+ t ) f i ( x s ) E It [ζt s+ ] t t E It [ f i (x s+ t ) f i ( x s ) ] L t x s+ t x s. The first inequality follows from Lemma 5. The second inequality is due to the fact that for a random variale ζ, E[ ζ E[ζ] ] E[ ζ ]. The last inequality follows from L-smoothness of f i. The result follows from a simple application of Jensen s inequality to the inequality aove. 6

17 The following result is important for ounding the variance in SagaFw. The key difference from Lemma 3 is that the variance term in SagaFw involves αt. i Again, we provide the proof for completeness. Lemma 4. Let ˇ t = t ( f i (x t ) f i (αt) i + g t ) in Algorithm 4. {αt} i n where t {0,..., T } in Algorithm 4, we have the inequality For the iterates x t, v t and E It [ F (x t ) ˇ t ] Proof. As efore we first define the quantity ζ t = I t L t n n x t α i t. ( fi (x t ) f i (αt) i ). With this notation, we then otain the equality E It [ F (x t) ˇ t ] [ = E It ζ t + n ] f i(αt) i F (x t) = E It [ ζ t E It [ζ t] ] n [ = ( ) ] EI t f i(x t) f i(α i t) E It [ζ t]. The second equality follows from the fact that E It [ζ t ] = F (x t ) n n f i(α i t). From the aove inequality, we get the following ound: E It [ F (x t) ˇ t ] t E It [ f i(x t) f i(α i t) E It [ζ t] ] t E It [ f i(x t) f i(α i t) ] L n t n x t αt i. The first inequality is due to Lemma 5, while the second inequality holds ecause for a random variale ζ, E[ ζ E[ζ] ] E[ ζ ]. The last inequality is from L-smoothness of f i (i [n]) and uniform randomness of the set I t. By applying Jensen s inequality, we get the desired result. Lemma 5. For random variales z,..., z r that are independent and have mean 0, we have Proof. Expanding the left hand side we have E [ z z r ] = E [ z z r ]. E [ z z r ] = r i,j= the second equality here follows from the our hypothesis. [ r E [z i z j ] = E z i ] ; 7

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the