arxiv: v2 [cs.lg] 14 Sep 2017

Size: px
Start display at page:

Download "arxiv: v2 [cs.lg] 14 Sep 2017"

Transcription

1 Elad Hazan Princeton University, Princeton, NJ 08540, USA Haipeng Luo Princeton University, Princeton, NJ 08540, USA arxiv: v [cs.lg] 14 Sep 017 Abstract The Frank-Wolfe optimization algorithm has recently regained popularity for machine learning applications due to its projection-free property and its ability to handle structured constraints. However, in the stochastic learning setting, it is still relatively understudied compared to the gradient descent counterpart. In this work, leveraging a recent variance reduction technique, we propose two stochastic Frank-Wolfe variants which substantially improve previous results in terms of the number of stochastic gradient evaluations needed to achieve 1 accuracy. For example, we improve from O 1 to Oln 1 if the objective function is smooth and strongly convex, and from O 1 to O 1 if the objective 1.5 function is smooth and Lipschitz. The theoretical improvement is also observed in experiments on real-world datasets for a multiclass classification application. 1. Introduction We consider the following optimization problem 1 n min fw = min f i w w Ω w Ω n i=1 which is an extremely common objective in machine learning. We are interested in the case where 1 n, usually corresponding to the number of training examples, is very large and therefore stochastic optimization is much more efficient; and the domain Ω admits fast linear optimization, while projecting onto it is much slower, necessitating projection-free optimization algorithms. Examples of such problem include multiclass classification, multitask learning, recommendation systems, matrix learning and many Proceedings of the 33 rd International Conference on Machine Learning, New York, NY, USA, 016. JMLR: W&CP volume 48. Copyright 016 by the authors. more see for example Hazan & Kale, 01; Hazan et al., 01; Jaggi, 013; Dudik et al., 01; Zhang et al., 01; Harchaoui et al., 015. The Frank-Wolfe algorithm Frank & Wolfe, 1956 also known as conditional gradient and it variants are natural candidates for solving these problems, due to its projectionfree property and its ability to handle structured constraints. However, despite gaining more popularity recently, its applicability and efficiency in the stochastic learning setting, where computing stochastic gradients is much faster than computing exact gradients, is still relatively understudied compared to variants of projected gradient descent methods. In this work, we thus try to answer the following question: what running time can a projection-free algorithm achieve in terms of the number of stochastic gradient evaluations and the number of linear optimizations needed to achieve a certain accuracy? Utilizing Nesterov s acceleration technique Nesterov, 1983 and the recent variance reduction idea Johnson & Zhang, 013; Mahdavi et al., 013, we propose two new algorithms that are substantially faster than previous work. Specifically, to achieve 1 accuracy, while the number of linear optimization is the same as previous work, the improvement of the number of stochastic gradient evaluations is summarized in Table 1: previous work this work Smooth O 1 O Smooth and Strongly Convex O 1 Oln 1 Table 1: Comparisons of number of stochastic gradients The extra overhead of our algorithms is computing at most Oln 1 exact gradients, which is computationally insignificant compared to the other operations. A more detailed comparisons to previous work is included in Table, which will be further explained in Section. While the idea of our algorithms is quite straightforward,

2 we emphasize that our analysis is non-trivial, especially for the second algorithm where the convergence of a sequence of auxiliary points in Nesterov s algorithm needs to be shown. To support our theoretical results, we also conducted experiments on three large real-word datasets for a multiclass classification application. These experiments show significant improvement over both previous projection-free algorithms and algorithms such as projected stochastic gradient descent and its variance-reduced version. The rest of the paper is organized as follows: Section setups the problem more formally and discusses related work. Our two new algorithms are presented and analyzed in Section 3 and 4, followed by experiment details in Section 5.. Preliminary and Related Work We assume each function f i is convex and L-smooth in R d so that for any w, v R d, 1 f i v w v f i w f i v f i v w v + L w v. We will use two more important properties of smoothness. The first one is f i w f i v Lf i w f i v f i v w v proven in Appendix A for completeness, and the second one is f i λw + 1 λv 1 λf i w + 1 λf i v L λ1 λ w v for any w, v Ω and λ [0, 1]. Notice that f = 1 n n i=1 f i is also L-smooth since smoothness is preserved under convex combinations. For some cases, we also assume each f i is G-Lipschitz: f i w G for any w Ω, and f although not necessarily each f i is α-strongly convex, that is, fw fv fw w v α w v for any w, v Ω. As usual, µ = L α is called the condition number of f. We assume the domain Ω R d is a compact convex set with diameter D. We are interested in the case where linear optimization on Ω, formally argmin v Ω w v for any 1 We thank Sebastian Pokutta and Gábor Braun for pointing out that f i needs to be defined over R d, rather than only over Ω, in order for property 1 to hold. w R d, is much faster than projection onto Ω, formally argmin v Ω w v. Examples of such domains include the set of all bounded trace norm matrices, the convex hull of all rotation matrices, flow polytope and many more see for instance Hazan & Kale, Example Application: Multiclass Classification Consider a multiclass classification problem where a set of training examples e i, y i i=1,...,n is given beforehand. Here e i R m is a feature vector and y i {1,..., h} is the label. Our goal is to find an accurate linear predictor, a matrix w = [w 1 ;..., w h ] Rh m that predicts argmax l w l e for any example e. Note that here the dimensionality d is hm. Previous work Dudik et al., 01; Zhang et al., 01 found that finding w by minimizing a regularized multivariate logistic loss gives a very accurate predictor in general. Specifically, the objective can be written in our notation with f i w = log 1 + expw l e i w y i e i l y i and Ω = {w R h m : w τ} where denotes the matrix trace norm. In this case, projecting onto Ω is equivalent to performing an SVD, which takes Ohm min{h, m} time, while linear optimization on Ω amounts to finding the top singular vector, which can be done in time linear to the number of non-zeros in the corresponding h by m matrix, and is thus much faster. One can also verify that each f i is smooth. The number of examples n can be prohibitively large for non-stochastic methods for instance, tens of millions for the ImageNet dataset Deng et al., 009, which makes stochastic optimization necessary... Detailed Efficiency Comparisons We call f i w a stochastic gradient for f at some w, where i is picked from {1,..., n} uniformly at random. Note that a stochastic gradient f i w is an unbiased estimator of the exact gradient fw. The efficiency of a projection-free algorithm is measured by how many numbers of exact gradient evaluations, stochastic gradient evaluations and linear optimizations respectively are needed to achieve 1 accuracy, that is, to output a point w Ω such that E[fw fw ] where w argmin w Ω fw is any optimum. In Table, we summarize the efficiency and extra assumptions needed beside convexity and smoothness of existing In general, condition G-Lipschitz in Table means each f i is G-Lipschitz, except for our STORC algorithm which only requires f being G-Lipschitz.

3 Algorithm Extra Conditions #Exact Gradients #Stochastic Gradients #Linear Optimizations Frank-Wolfe Garber & Hazan, 013 α-strongly convex Ω is polytope O LD 0 O LD Odµρ ln LD 0 Odµρ ln LD SFW G-Lipschitz 0 O G LD 4 3 O LD Online-FW Hazan & Kale, 01 SCGS Lan & Zhou, 014 SVRF this work STORC this work G-Lipschitz 0 O d LD +GD 4 O dld +GD 4 G-Lipschitz 0 O G4 D 4 O G4 D L = allowed G-Lipschitz 0 O G D O LD G-Lipschitz α-strongly convex G-Lipschitz fw = 0 α-strongly convex 0 O G α LD O Oln LD O L D 4 O LD Oln LD O LD G 1.5 O LD Oln LD O LD O LD Oln LD Oµ ln LD O LD Table : Comparisons of different Frank-Wolfe variants see Section. for further explanations. algorithms in the literature as well as the two new algorithms we propose. Below we briefly explain these results from top to bottom. The standard Frank-Wolfe algorithm: v k = argmin fw k 1 v v Ω 3 w k = 1 γ k w k 1 + γ k v k for some appropriate chosen γ k requires O 1 iteration without additional conditions Frank & Wolfe, 1956; Jaggi, 013. In a recent paper, Garber & Hazan 013 give a variant that requires Odµρ ln 1 iterations when f is strongly convex and smooth, and Ω is a polytope 3. Although the dependence on is much better, the geometric constant ρ depends on the polyhedral set and can be very large. Moreover, each iteration of the algorithm requires further computation besides the linear optimization step. The most obvious way to obtain a stochastic Frank-Wolfe variant is to replace fw k 1 by some f i w k 1, or more generally the average of some number of iid samples of f i w k 1 mini-batch approach. We call this method SFW and include its analysis in Appendix B since we do not find it explicitly analyzed before. SFW needs O 1 3 stochastic gradients and O 1 linear optimization steps to reach an -approximate optimum. The work by Hazan & Kale 01 focuses on a online learning setting. One can extract two results from this work See also recent follow up work Lacoste-Julien & Jaggi, for the setting studied here. 4 In any case, the result is worse than SFW for both the number of stochastic gradients and the number of linear optimizations. Stochastic Condition Gradient Sliding SCGS, recently proposed by Lan & Zhou, 014, uses Nesterov s acceleration technique to speed up Frank-Wolfe. Without strong convexity, SCGS needs O 1 stochastic gradients, improving SFW. With strong convexity, this number can even be improved to O 1. In both cases, the number of linear optimization steps is O 1. The key idea of our algorithms is to combine the variance reduction technique proposed in Johnson & Zhang, 013; Mahdavi et al., 013 with some of the above-mentioned algorithms. For example, our algorithm SVRF combines this technique with SFW, also improving the number of stochastic gradients from O 1 3 to O 1, but without any extra conditions such as Lipschitzness required for SCGS. More importantly, despite having seemingly same convergence rate, SVRF substantially outperforms SCGS empirically see Section 5. On the other hand, our second algorithm STORC combines variance reduction with SCGS, providing even further improvements. Specifically, the number of stochastic gradients is improved to: O 1 when f is Lipschitz; O when fw = 0; and finally Oln 1 when f is strongly convex. Note that the condition fw = 0 essentially 4 The first result comes from the setting where the online loss functions are stochastic, and the second one comes from a completely online setting with the standard online-to-batch conversion.

4 means that w is in the interior of Ω, but it is still an interesting case when the optimum is not unique and doing unconstraint optimization would not necessary return a point in Ω. Both of our algorithms require O 1 linear optimization steps as previous work, and overall require computing Oln LD exact gradients. However, we emphasize that this extra overhead is much more affordable compared to non-stochastic Frank-Wolfe that is, computing exact gradients every iteration since it does not have any polynomial dependence on parameters such as d, L or µ..3. Variance-Reduced Stochastic Gradients Originally proposed in Johnson & Zhang, 013 and independently in Mahdavi et al., 013, the idea of variancereduced stochastic gradients is proven to be highly useful and has been extended to various different algorithms such as Frostig et al., 015; Moritz et al., 016. A variance-reduced stochastic gradient at some point w Ω with some snapshot w 0 Ω is defined as fw; w 0 = f i w f i w 0 fw 0, where i is again picked from {1,..., n} uniformly at random. The snapshot w 0 is usually a decision point from some previous iteration of the algorithm and its exact gradient fw 0 has been pre-computed before, so that computing fw; w 0 only requires two standard stochastic gradient evaluations: f i w and f i w 0. A variance-reduced stochastic gradient is clearly also unbiased, that is, E[ fw; w 0 ] = fw. More importantly, the term f i w 0 fw 0 serves as a correction term to reduce the variance of the stochastic gradient. Formally, one can prove the following Lemma 1. For any w, w 0 Ω, we have E[ fw; w 0 fw ] 6LE[fw fw ] + E[fw 0 fw ]. In words, the variance of the variance-reduced stochastic gradient is bounded by how close the current point and the snapshot are to the optimum. The original work proves a bound on E[ fw; w 0 ] under the assumption fw = 0, which we do not require here. However, the main idea of the proof is similar and we defer it to Section Stochastic Variance-Reduced Frank-Wolfe With the previous discussion, our first algorithm is pretty straightforward: compared to the standard Frank-Wolfe, we simply replace the exact gradient with the average of Algorithm 1 Stochastic Variance-Reduced Frank-Wolfe SVRF 1: Input: Objective function f = 1 n n i=1 f i. : Input: Parameters γ k, m k and N k. 3: Initialize: w 0 = argmin w Ω fx w for some arbitrary x Ω. 4: for t = 1,,..., T do 5: Take snapshot: x 0 = w t 1 and compute fx 0. 6: for k = 1 to N t do 7: Compute k, the average of m k iid samples of fx k 1, x 0. 8: Compute v k = argmin v Ω k v. 9: Compute x k = 1 γ k x k 1 + γ k v k. 10: end for 11: Set w t = x Nt. 1: end for a mini-batch of variance-reduced stochastic gradients, and take snapshots every once in a while. We call this algorithm Stochastic Variance-Reduced Frank-Wolfe SVRF, whose pseudocode is presented in Alg 1. The convergence rate of this algorithm is shown in the following theorem. Theorem 1. With the following parameters, γ k = k + 1, m k = 96k + 1, N t = t+3, Algorithm 1 ensures E[fw t fw ] LD t+1 for any t. Before proving this theorem, we first show a direct implication of this convergence result. Corollary 1. To achieve 1 accuracy, Algorithm 1 requires Oln LD exact gradient evaluations, O L D 4 stochastic gradient evaluations and O LD linear optimizations. Proof. According to the algorithm and the choice of parameters, it is clear that these three numbers are T + 1, T Nt t=1 k=1 m k = O4 T and T t=1 N t = O T respectively. Theorem 1 implies that T should be of order LD Θlog. Plugging in all parameters concludes the proof. To prove Theorem 1, we first consider a fixed iteration t and prove the following lemma: Lemma. For any k, we have E[fx k fw ] 4LD k + if E[ s fx s 1 ] L D s+1 for all s k.

5 We defer the proof of this lemma to Section 6 for coherence. With the help of Lemma, we are now ready to prove the main convergence result. Proof of Theorem 1. We prove by induction. For t = 0, by smoothness, the optimality of w 0 and convexity, we have fw 0 fx + fx w 0 x + L w 0 x fx + fx w x + LD fw + LD. Now assuming E[fw t 1 fw ] LD, we consider t iteration t of the algorithm and use another induction to show E[fx k fw ] 4LD k+ for any k N t. The base case is trivial since x 0 = w t 1. Suppose E[fx s 1 fw ] 4LD s+1 for any s k. Now because s is the average of m s iid samples of fx s 1 ; x 0, its variance is reduced by a factor of m s. That is, with Lemma 1 we have E[ s fx s 1 ] 6L E[fx s 1 fw ] + E[fx 0 fw ] m s 8LD 6L m s 6L m s s LD t 8LD s LD s + 1 = L D s + 1, where the last inequality is by the fact s N t = t+3 and the last equality is by plugging the choice of m s. Therefore the condition of Lemma is satisfied and the induction is completed. Finally with the choice of N t we thus prove E[fw t fw ] = E[fx Nt fw ] 4LD N = LD t+. t+1 We remark that in Alg 1, we essentially restart the algorithm that is, reseting k to 1 after taking a new snapshot. However, another option is to continue increasing k and never reset it. Although one can show that this only leads to constant speed up for the convergence, it provides more stable update and is thus what we implement in experiments. 4. Stochastic Variance-Reduced Conditional Gradient Sliding Our second algorithm applies variance reduction to the SCGS algorithm Lan & Zhou, 014. Again, the key difference is that we replace the stochastic gradients with the average of a mini-batch of variance-reduced stochastic gradients, and take snapshots every once in a while. See pseudocode in Alg for details. Algorithm STOchastic variance-reduced Conditional gradient sliding STORC 1: Input: Objective function f = 1 n n i=1 f i. : Input: Parameters γ k, β k, η t,k, m t,k and N t. 3: Initialize: w 0 = argmin w Ω fx w for some arbitrary x Ω. 4: for t = 1,,... do 5: Take snapshot: y 0 = w t 1 and compute fy 0. 6: Initialize x 0 = y 0. 7: for k = 1 to N t do 8: Compute z k = 1 γ k y k 1 + γ k x k 1. 9: Compute k, the average of m t,k iid samples of fz k ; y 0. 10: Let gx = β k x x k 1 + k x. 11: Compute x k, the output of using standard Frank- Wolfe to solve min x Ω gx until the duality gap is at most η t,k, that is, max gx k x k x η t,k. 4 x Ω 1: Compute y k = 1 γ k y k 1 + γ k x k. 13: end for 14: Set w t = y Nt. 15: end for The algorithm makes use of two auxiliary sequences x k and z k Line 8 and 1, which is standard for Nesterov s algorithm. x k is obtained by approximately solving a square norm regularized linear optimization so that it is close to x k 1 Line 11. Note that this step does not require computing any extra gradients of f or f i, and is done by performing the standard Frank-Wolfe algorithm Eq. 3 until the duality gap is at most a certain value η t,k. The duality gap is a certificate of approximate optimality see Jaggi, 013, and is a side product of the linear optimization performed at each step, requiring no extra cost. Also note that the stochastic gradients are computed at the sequence z k instead of y k, which is also standard in Nesterov s algorithm. However, according to Lemma 1, we thus need to show the convergence rate of the auxiliary sequence z k, which appears to be rarely studied previously to the best our knowledge. This is one of the key steps in our analysis. The main convergence result of STORC is the following: Theorem. With the following parameters where D t is defined later below: γ k = k + 1, β k = 3L k, η t,k = LD t N t k, Algorithm ensures E[fw t fw ] LD t+1 for any t if any of the following three cases holds:

6 a fw = 0 and D t = D, N t = t +, m t,k = 900N t. b f is G-Lipschitz and D t = D, N t = t +, m t,k = 700N t + 4NtGk+1 LD. c f is α-strongly convex and D t = µd t 1, N t = 3µ, m t,k = 5600N t µ where µ = L α. Again we first give a direct implication of the above result: Corollary. To achieve 1 accuracy, Algorithm requires Oln LD exact gradient evaluations and O LD linear optimizations. The numbers of stochastic gradient evaluations for Case a, b and c are respectively O LD, O LD + LD G and Oµ ln LD 1.5. Proof. Line 11 requires O β kd η t,k iterations of the standard Frank-Wolfe algorithm since gx is β k -smooth see e.g. Jaggi, 013, Theorem. So the numbers of exact gradient evaluations, stochastic gradient evaluations and linear optimizations are respectively T +1, T Nt t=1 k=1 m t,k and Nt k=1 β k D η t,k O T t=1 LD be of order Θlog the corollary.. Theorem implies that T should. Plugging in all parameters proves To prove Theorem, we again first consider a fixed iteration t and use the following lemma, which is essentially proven in Lan & Zhou, 014. We include a distilled proof in Appendix C for completeness. Lemma 3. Suppose E[ y 0 w ] D t holds for some positive constant D t D. Then for any k, we have E[fy k fw ] if E[ s fz s ] 8LD t kk + 1 L D t N ts+1 for all s k. Proof of Theorem. We prove by induction. The base case t = 0 holds by the exact same argument as in the proof of Theorem 1. Suppose E[fw t 1 fw ] LD and t consider iteration t. Below we use another induction to prove E[fy k fw ] 8LD t kk+1 for any 1 k N t, which will concludes the proof since for any of the three cases, we have E[fw t fw ] = E[fy Nt fw ] which is at most 8LD t N t LD t+1. We first show that the condition E[ y 0 w ] D t holds. This is trivial for Cases a and b when D t = D. For Case c, by strong convexity and the inductive assumption, we have E[ y 0 w ] α E[fy 0 fw ] LD α t 1 = D t. Next note that Lemma 1 implies that E[ s fz s ] 6L is at most m t,s E[fz s fw ] + E[fy 0 fw ]. So the key is to bound E[fz s fw ]. With z 1 = y 0 one can verify that E[ 1 fz 1 ] is at most 18L m t,1 E[fy 0 fw ] 18L D m t,1 for all three t L D t 4N t cases, and thus E[fy s fw ] 8LD t ss+1 holds for s = 1 by Lemma 3. Now suppose it holds for any s < k, below we discuss the three cases separately to show that it also holds for s = k. Case a. By smoothness, the condition fw = 0, the construction of z s, and Cauchy-Schwarz inequality, we have for any 1 < s k, fz s fy s 1 + fy s 1 fw z s y s 1 + L z s y s 1 = fy s 1 + γ s fy s 1 fw x s 1 y s 1 + Lγ s x s 1 y s 1 fy s 1 + γ s D fy s 1 fw + LD γs. Property 1 and the optimality of w implies: fy s 1 fw Lfy s 1 fw fw y s 1 w Lfy s 1 fw. So subtracting fw and taking expectation on both sides, and applying Jensen s inequality and the inductive assumption, we have E[fz s fw ] E[fy s 1 fw ] + γ s D LE[fy s 1 fw ] + LD s + 1 8LD s 1s + 8LD s + 1 s 1s + LD s + 1 < 55LD s + 1. On the other hand, we have E[fy 0 fw ] LD t 16LD N t 1 < 40LD N t+1 40LD s+1. So E[ s fz s 900L is at most D m t,ss+1, and the choice of m t,s ensures that L this bound is at most D N ts+1, satisfying the condition of Lemma 3 and thus completing the induction.

7 Case b. With the G-Lipschitz condition we proceed similarly and bound fz s by fy s 1 + fy s 1 z s y s 1 + L z s y s 1 = fy s 1 + γ s fy s 1 x s 1 y s 1 + LD γ s fy s 1 + γ s GD + LD γs. So using bounds derived previously and the choice of m t,s, we bound E[ s fz s as follows: 6L 16LD m t,s s 1s + 4GD s LD s LD s + 1 < 6L 4GD m t,s s LD s + 1 < L D N t s + 1, again completing the induction. Case c. Using the definition of z s and y s and direct calcalution, one can remove the dependence of x s and verify y s 1 = s + 1 s 1 z s + s s 1 y s for any s. Now we apply Property with λ = s+1 s 1 : fy s 1 s + 1 s 1 fz s + s s 1 fy s L s + 1s s 1 z s y s = fw + s + 1 s 1 fz s fw + s s 1 fy s fw Ls s + 1 y s 1 y s fw + 1 fz s fw L y s 1 y s, where the equality is by adding and subtracting fw and the fact y s 1 y s = s+1 s 1 z s y s, and the last inequality is by fy s fw and trivial relaxations. Rearranging gives fz s fw fy s 1 fw + L y s 1 y s. Applying Cauchy-Schwarz inequality, strong convexity and the fact µ 1, we continue with fz s fw fy s 1 fw + L y s 1 w + y s w fy s 1 fw + 4µfy s 1 fw + fy s fw 6µfy s 1 fw + 4µfy s fw, For s 3, we use the inductive assumption to show E[fz s fw ] 48µLD t s 1s + 3µLD t s s 1 448µLD t s+1. dataset #features #categories #examples news0 6, ,935 rcv1 47, ,564 aloi 18 1, ,000 Table 3: Summary of datasets The case for s = can be verified similarly using the bound on E[fy 0 fw ] and E[fy 1 fw ] base case. Finally we bound the term E[fy 0 fw ] LD = t LD t µ 3LD t N t+1 3LD t s+1, and conclude that the variance E[ s fz s is at most 6L m t,s 896µLD t s+1 L D t N ts+1, completing the induction by Lemma Experiments + 3LD t s+1 To support our theory, we conduct experiments in the multiclass classification problem mentioned in Sec.1. Three datasets are selected from the LIBSVM repository 5 with relatively large number of features, categories and examples, summarized in the Table 3. Recall that the loss function is multivariate logistic loss and Ω is the set of matrices with bounded trace norm τ. We focus on how fast the loss decreases instead of the final test error rate so that the tuning of τ is less important, and is fixed to 50 throughout. We compare six algorithms. Four of them SFW, SCGS, SVRF, STORC are projection-free as discussed, and the other two are standard projected stochastic gradient descent SGD and its variance-reduced version SVRG Johnson & Zhang, 013, both of which require expensive projection. For most of the parameters in these algorithms, we roughly follow what the theory suggests. For example, the size of mini-batch of stochastic gradients at round k is set to k, k 3 and k respectively for SFW, SCGS and SVRF, and is fixed to 100 for the other three. The number of iterations between taking two snapshots for variance-reduced methods SVRG, SVRF and STORC are fixed to 50. The learning rate is set to the typical decaying sequence c/ k for SGD and a constant c for SVRG as the original work suggests for some best tuned c and c. Since the complexity of computing gradients, performing linear optimization and projecting are very different, we measure the actual running time of the algorithms and see how fast the loss decreases. Results can be found in Figure 1, where one can clearly observe that for all datasets, 5 cjlin/ libsvmtools/datasets/

8 3 4 7 Loss Loss Loss SGD SVRG SCGS STORC SFW SVRF Time s a news Time s b rcv Time s c aloi Figure 1: Comparison of six algorithms on three multiclass datasets best viewed in color SGD and SVRG are significantly slower compared to the others, due to the expensive projection step, highlighting the usefulness of projection-free algorithms. Moreover, we also observe large improvement gained from the variance reduction technique, especially when comparing SCGS and STORC, as well as SFW and SVRF on the aloi dataset. Interestingly, even though the STORC algorithm gives the best theoretical results, empirically the simpler algorithms SFW and SVRF tend to have consistent better performance. 6. Omitted Proofs Proof of Lemma 1. Let E i denotes the conditional expectation given all the past except the realization of i. We have E i [ fw; w 0 fw ] = E i [ f i w f i w 0 + fw 0 fw ] = E i [ f i w f i w f i w 0 f i w + fw 0 fw fw fw ] 3E i [ f i w f i w + f i w 0 f i w fw 0 fw + fw fw ] 3E i [ f i w f i w + f i w 0 f i w + fw fw ] where the first inequality is Cauchy-Schwarz inequality, and the second one is by the fact E i [ f i w 0 f i w ] = fw 0 fw and that the variance of a random variable is bounded by its second moment. We now apply Property 1 to bound each of the three terms above. For example, E i f i w f i w LE i [f i w f i w f i w w w ] = Lfw fw fw w w, which is at most Lfw fw by the optimality of w. Proceeding similarly for the other two terms concludes the proof. Proof of Lemma. For any s k, by smoothness we have fx s fx s 1 + fx s 1 x s x s 1 + L x s x s 1. Plugging in x s = 1 γ s x s 1 + γ s v s gives fx s fx s 1 + γ s fx s 1 v s x s 1 + Lγ s v s x s 1. Rewriting and using the fact that v s x s 1 D leads to fx s fx s 1 + γ s s v s x s 1 + γ s fx s 1 s v s x s 1 + LD γs. The optimality of v s implies s v s s w. So with further rewriting we arrive at fx s fx s 1 + γ s fx s 1 w x s 1 + γ s fx s 1 s v s w + LD γs. By convexity, term fx s 1 w x s 1 is bounded by fw fx s 1, and by Cauchy-Schwarz inequality, term fx s 1 s v s w is bounded by D s fx s 1, which in expectation is at most LD s+1 by the condition on E[ s fx s 1 ] and Jensen s inequality. Therefore we can bound E[fx s fw ] by 1 γ s E[fx s 1 fw ] + LD γ s s LD γs = 1 γ s E[fx s 1 fw ] + LD γs. Finally we prove E[fx k fw ] 4LD k+ by induction. The base case is trival: E[fx 1 fw ] is bounded by 1 γ 1 E[fx 0 fw ]+LD γ1 = LD since γ 1 = 1. Suppose E[fx s 1 fw ] 4LD s+1 then with γ s = s+1 we bound E[fx s fw ] by 4LD 1 s + 1 s LD s + 1 s +, completing the induction. 7. Conclusion and Open Problems We conclude that the variance reduction technique, previously shown to be highly useful for gradient descent variants, can also be very helpful in speeding up projection-free

9 algorithms. The main open question is, in the strongly convex case, whether the number of stochastic gradients for STORC can be improved from Oµ ln 1 to Oµ ln 1, which is typical for gradient descent methods, and whether the number of linear optimizations can be improved from O 1 to Oln 1. Acknowledgements The authors acknowledge support from the National Science Foundation grant IIS and a Google research award. References Deng, Jia, Dong, Wei, Socher, Richard, Li, Li-Jia, Li, Kai, and Fei-Fei, Li. Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 009. CVPR 009. IEEE Conference on, pp IEEE, 009. Dudik, Miro, Harchaoui, Zaid, and Malick, Jérôme. Lifted coordinate descent for learning with trace-norm regularization. In Proceedings of the Fifteenth International Conference on Artificial Intelligence and Statistics, volume, pp , 01. Frank, Marguerite and Wolfe, Philip. An algorithm for quadratic programming. Naval research logistics quarterly, 31-:95 110, Frostig, Roy, Ge, Rong, Kakade, Sham M, and Sidford, Aaron. Competing with the empirical risk minimizer in a single pass. In Proceedings of the 8th Annual Conference on Learning Theory, 015. Garber, Dan and Hazan, Elad. A linearly convergent conditional gradient algorithm with applications to online and stochastic optimization. arxiv preprint arxiv: , 013. Harchaoui, Zaid, Juditsky, Anatoli, and Nemirovski, Arkadi. Conditional gradient algorithms for normregularized smooth convex optimization. Mathematical Programming, 151-:75 11, 015. Hazan, Elad and Kale, Satyen. Projection-free online learning. In Proceedings of the 9th International Conference on Machine Learning, 01. Hazan, Elad, Kale, Satyen, and Shalev-Shwartz, Shai. Near-optimal algorithms for online matrix prediction. In COLT 01 - The 5th Annual Conference on Learning Theory, June 5-7, 01, Edinburgh, Scotland, pp , 01. Jaggi, Martin. Revisiting frank-wolfe: Projection-free sparse convex optimization. In Proceedings of the 30th International Conference on Machine Learning, pp , 013. Johnson, Rie and Zhang, Tong. Accelerating stochastic gradient descent using predictive variance reduction. In Advances in Neural Information Processing Systems 7, pp , 013. Lacoste-Julien, Simon and Jaggi, Martin. On the global linear convergence of frank-wolfe optimization variants. In Advances in Neural Information Processing Systems 9, pp , 015. Lan, Guanghui and Zhou, Yi. Conditional gradient sliding for convex optimization. Optimization-Online preprint 4605, 014. Mahdavi, Mehrdad, Zhang, Lijun, and Jin, Rong. Mixed optimization for smooth functions. In Advances in Neural Information Processing Systems, pp , 013. Moritz, Philipp, Nishihara, Robert, and Jordan, Michael I. A linearly-convergent stochastic l-bfgs algorithm. In Proceedings of the Nineteenth International Conference on Artificial Intelligence and Statistics, 016. Nesterov, YU. E. A method of solving a convex programming problem with convergence rate o1/k. In Soviet Mathematics Doklady, volume 7, pp , Zhang, Xinhua, Schuurmans, Dale, and Yu, Yao-liang. Accelerated training for matrix-norm regularization: A boosting approach. In Advances in Neural Information Processing Systems 6, pp , 01.

10 Supplementary material for Variance-Reduced and Projection-Free Stochastic Optimization A. Proof of Property 1 Proof. We drop the subscript i for conciseness. Define gw = fw fv w, which is clearly also convex and L-smooth on Ω. Since gv = 0, v is one of the minimizers of gw. Therefore we have gv gw gw 1 gw gw L gw w 1 L gw w + L w 1 L gw w by smoothness of g = 1 L gw = 1 fw fv L Rearranging and plugging in the definition of g concludes the proof. B. Analysis for SFW The concrete update of SFW is v k = argmin v Ω k v w k = 1 γ k w k 1 + γ k v k where k is the average of m k iid samples of stochastic gradient f i w k 1. The convergence rate of SFW is presented below. Theorem 3. If each f i is G-Lipschitz, then with γ k = k+1 and m k = Gk+1, LD SFW ensures for any k, E[fw k fw ] 4LD k +. Proof. Similar to the proof of Lemma, we first proceed as follows, fw k fw k 1 + fw k 1 w k w k 1 + L w k w k 1 smoothness = fw k 1 + γ k fw k 1 v k w k 1 + Lγ k v k x k 1 w k w k 1 = γ k v k w k 1 fw k 1 + γ k k v k w k 1 + γ k fw k 1 k v k w k 1 + LD γ k fw k 1 + γ k k w w k 1 + γ k fw k 1 k v k w k 1 + LD γ k v k w k 1 D by optimality of v k = fw k 1 + γ k fw k 1 w w k 1 + γ k fw k 1 k v k w + LD γ k fw k 1 + γ k fw fw k 1 + γ k D k fw k 1 + LD γk, where the last step is by convexity and Cauchy-Schwarz inequality. Since f i is G-Lipschitz, with Jensen s inequality, we further have E[ k fw k 1 ] E[ k fw k 1 ] G mk, which is at most LDγ k with the choice of γ k and m k. So we arrive at E[fw k fw ] 1 γ k E[fw k 1 fw ] + LD γk. It remains to use a simple induction to conclude the proof. Now it is clear that to achieve 1 accuracy, SFW needs O LD iterations, and in total O G L D LD 3 = O G LD 4 3 stochastic gradients.

11 C. Proof of Lemma 3 Proof. Let δ s = s fz s. For any s k, we proceed as follows: fy s fz s + fz s y s z s + L y s z s by smoothness = 1 γ s fz s + fz s y s 1 z s + γ s fz s + fz s w z s + γ s fz s x s w + Lγ s x s x s 1 by definition of y s and z s 1 γ s fy s 1 + γ s fw + γ s fz s x s w + Lγ s x s x s 1 by convexity = 1 γ s fy s 1 + γ s fw + γ s s x s w + Lγ s x s x s 1 + γ s δ s w x s 1 γ s fy s 1 + γ s fw + γ s η t,s γ s β s x s x s 1 x s w + Lγ s x s x s 1 + γ s δ s w x s = 1 γ s fy s 1 + γ s fw + γ s η t,s + β sγ s x s 1 w x s w + γ s Lγ s β s x s x s 1 + δ s x s 1 x s + δ s w x s 1 1 γ s fy s 1 + γ s fw + γ s η t,s + β sγ s x s 1 w x s w + γ s where the last inequality is by the fact β s Lγ s and thus by Eq. 4 δ s + δ s w x s 1, β s Lγ s Lγ s β s x s x s 1 + δ s x s 1 x s = δ s β s Lγ s β s Lγ s x δ s s x s 1 β s Lγ s δ s. β s Lγ s Note that E[δ s w x s 1 ] = 0. So with the condition E[ δ s ] L D def t N ts+1 = σ s we arrive at E[fy s fw ] 1 γ s E[fy s 1 fw ]+γ s η t,s + β s E[ x s 1 w ] E[ x s w σ s ] +. β s Lγ s Now define Γ s = Γ s 1 1 γ s when s > 1 and Γ 1 = 1. By induction, one can verify Γ s = ss+1 and the following: which is at most k γ s Γ k Γ s=1 s E[fy k fw ] Γ k η s + k γ s Γ s=1 s σs + Γ k γ 1 β 1 β s Lγ s η t,s + β s E[ x s 1 w ] E[ x s w ] + Γ 1 E[ x 0 w ] + k γs β s s= Γ s σs, β s Lγ s γ s 1β s 1 E[ x s 1 w ]. Γ s 1 Finally plugging in the parameters γ s, β s, η t,s, Γ s and the bound E[ x 0 w ] D t concludes the proof: E[fy k fw ] kk + 1 k s=1 LD k t N t k + LDt + 3LD t N t k + 1 kk + 1 8LD t kk + 1.

Variance-Reduced and Projection-Free Stochastic Optimization

Variance-Reduced and Projection-Free Stochastic Optimization Elad Hazan Princeton University, Princeton, NJ 08540, USA Haipeng Luo Princeton University, Princeton, NJ 08540, USA EHAZAN@CS.PRINCETON.EDU HAIPENGL@CS.PRINCETON.EDU Abstract The Frank-Wolfe optimization

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

arxiv: v1 [math.oc] 10 Oct 2018

arxiv: v1 [math.oc] 10 Oct 2018 8 Frank-Wolfe Method is Automatically Adaptive to Error Bound ondition arxiv:80.04765v [math.o] 0 Oct 08 Yi Xu yi-xu@uiowa.edu Tianbao Yang tianbao-yang@uiowa.edu Department of omputer Science, The University

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Projection-free Distributed Online Learning in Networks

Projection-free Distributed Online Learning in Networks Wenpeng Zhang Peilin Zhao 2 Wenwu Zhu Steven C. H. Hoi 3 Tong Zhang 4 Abstract The conditional gradient algorithm has regained a surge of research interest in recent years due to its high efficiency in

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization

Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Making Gradient Descent Optimal for Strongly Convex Stochastic Optimization Alexander Rakhlin University of Pennsylvania Ohad Shamir Microsoft Research New England Karthik Sridharan University of Pennsylvania

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer

Tutorial: PART 2. Optimization for Machine Learning. Elad Hazan Princeton University. + help from Sanjeev Arora & Yoram Singer Tutorial: PART 2 Optimization for Machine Learning Elad Hazan Princeton University + help from Sanjeev Arora & Yoram Singer Agenda 1. Learning as mathematical optimization Stochastic optimization, ERM,

More information

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation

Course Notes for EE227C (Spring 2018): Convex Optimization and Approximation Course Notes for EE7C (Spring 08): Convex Optimization and Approximation Instructor: Moritz Hardt Email: hardt+ee7c@berkeley.edu Graduate Instructor: Max Simchowitz Email: msimchow+ee7c@berkeley.edu October

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini, Mark Schmidt University of British Columbia Linear of Convergence of Gradient-Based Methods Fitting most machine learning

More information

arxiv: v4 [math.oc] 5 Jan 2016

arxiv: v4 [math.oc] 5 Jan 2016 Restarted SGD: Beating SGD without Smoothness and/or Strong Convexity arxiv:151.03107v4 [math.oc] 5 Jan 016 Tianbao Yang, Qihang Lin Department of Computer Science Department of Management Sciences The

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence:

On Acceleration with Noise-Corrupted Gradients. + m k 1 (x). By the definition of Bregman divergence: A Omitted Proofs from Section 3 Proof of Lemma 3 Let m x) = a i On Acceleration with Noise-Corrupted Gradients fxi ), u x i D ψ u, x 0 ) denote the function under the minimum in the lower bound By Proposition

More information

Full-information Online Learning

Full-information Online Learning Introduction Expert Advice OCO LM A DA NANJING UNIVERSITY Full-information Lijun Zhang Nanjing University, China June 2, 2017 Outline Introduction Expert Advice OCO 1 Introduction Definitions Regret 2

More information

Lecture: Adaptive Filtering

Lecture: Adaptive Filtering ECE 830 Spring 2013 Statistical Signal Processing instructors: K. Jamieson and R. Nowak Lecture: Adaptive Filtering Adaptive filters are commonly used for online filtering of signals. The goal is to estimate

More information

A Greedy Framework for First-Order Optimization

A Greedy Framework for First-Order Optimization A Greedy Framework for First-Order Optimization Jacob Steinhardt Department of Computer Science Stanford University Stanford, CA 94305 jsteinhardt@cs.stanford.edu Jonathan Huggins Department of EECS Massachusetts

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

arxiv: v1 [math.oc] 18 Mar 2016

arxiv: v1 [math.oc] 18 Mar 2016 Katyusha: Accelerated Variance Reduction for Faster SGD Zeyuan Allen-Zhu zeyuan@csail.mit.edu Princeton University arxiv:1603.05953v1 [math.oc] 18 Mar 016 March 18, 016 Abstract We consider minimizing

More information

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization References --- a tentative list of papers to be mentioned in the ICML 2017 tutorial Recent Advances in Stochastic Convex and Non-Convex Optimization Disclaimer: in a quite arbitrary order. 1. [ShalevShwartz-Zhang,

More information

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization

Frank-Wolfe Method. Ryan Tibshirani Convex Optimization Frank-Wolfe Method Ryan Tibshirani Convex Optimization 10-725 Last time: ADMM For the problem min x,z f(x) + g(z) subject to Ax + Bz = c we form augmented Lagrangian (scaled form): L ρ (x, z, w) = f(x)

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based

More information

Stochastic Gradient Descent with Only One Projection

Stochastic Gradient Descent with Only One Projection Stochastic Gradient Descent with Only One Projection Mehrdad Mahdavi, ianbao Yang, Rong Jin, Shenghuo Zhu, and Jinfeng Yi Dept. of Computer Science and Engineering, Michigan State University, MI, USA Machine

More information

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Finite-sum Composition Optimization via Variance Reduced Gradient Descent Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

Comparison of Modern Stochastic Optimization Algorithms

Comparison of Modern Stochastic Optimization Algorithms Comparison of Modern Stochastic Optimization Algorithms George Papamakarios December 214 Abstract Gradient-based optimization methods are popular in machine learning applications. In large-scale problems,

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term;

min f(x). (2.1) Objectives consisting of a smooth convex term plus a nonconvex regularization term; Chapter 2 Gradient Methods The gradient method forms the foundation of all of the schemes studied in this book. We will provide several complementary perspectives on this algorithm that highlight the many

More information

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained

NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS. 1. Introduction. We consider first-order methods for smooth, unconstrained NOTES ON FIRST-ORDER METHODS FOR MINIMIZING SMOOTH FUNCTIONS 1. Introduction. We consider first-order methods for smooth, unconstrained optimization: (1.1) minimize f(x), x R n where f : R n R. We assume

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

Stochastic Optimization Algorithms Beyond SG

Stochastic Optimization Algorithms Beyond SG Stochastic Optimization Algorithms Beyond SG Frank E. Curtis 1, Lehigh University involving joint work with Léon Bottou, Facebook AI Research Jorge Nocedal, Northwestern University Optimization Methods

More information

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research

OLSO. Online Learning and Stochastic Optimization. Yoram Singer August 10, Google Research OLSO Online Learning and Stochastic Optimization Yoram Singer August 10, 2016 Google Research References Introduction to Online Convex Optimization, Elad Hazan, Princeton University Online Learning and

More information

ONLINE VARIANCE-REDUCING OPTIMIZATION

ONLINE VARIANCE-REDUCING OPTIMIZATION ONLINE VARIANCE-REDUCING OPTIMIZATION Nicolas Le Roux Google Brain nlr@google.com Reza Babanezhad University of British Columbia rezababa@cs.ubc.ca Pierre-Antoine Manzagol Google Brain manzagop@google.com

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Conditional Gradient (Frank-Wolfe) Method

Conditional Gradient (Frank-Wolfe) Method Conditional Gradient (Frank-Wolfe) Method Lecturer: Aarti Singh Co-instructor: Pradeep Ravikumar Convex Optimization 10-725/36-725 1 Outline Today: Conditional gradient method Convergence analysis Properties

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)

More information

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization

Stochastic Gradient Descent. Ryan Tibshirani Convex Optimization Stochastic Gradient Descent Ryan Tibshirani Convex Optimization 10-725 Last time: proximal gradient descent Consider the problem min x g(x) + h(x) with g, h convex, g differentiable, and h simple in so

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org

More information

Adaptive Online Gradient Descent

Adaptive Online Gradient Descent University of Pennsylvania ScholarlyCommons Statistics Papers Wharton Faculty Research 6-4-2007 Adaptive Online Gradient Descent Peter Bartlett Elad Hazan Alexander Rakhlin University of Pennsylvania Follow

More information

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate

First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate 58th Annual IEEE Symposium on Foundations of Computer Science First Efficient Convergence for Streaming k-pca: a Global, Gap-Free, and Near-Optimal Rate Zeyuan Allen-Zhu Microsoft Research zeyuan@csail.mit.edu

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

The FTRL Algorithm with Strongly Convex Regularizers

The FTRL Algorithm with Strongly Convex Regularizers CSE599s, Spring 202, Online Learning Lecture 8-04/9/202 The FTRL Algorithm with Strongly Convex Regularizers Lecturer: Brandan McMahan Scribe: Tamara Bonaci Introduction In the last lecture, we talked

More information

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent

CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent CSCI 1951-G Optimization Methods in Finance Part 12: Variants of Gradient Descent April 27, 2018 1 / 32 Outline 1) Moment and Nesterov s accelerated gradient descent 2) AdaGrad and RMSProp 4) Adam 5) Stochastic

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Proceedings of the Thirtieth AAAI Conference on Artificial Intelligence (AAAI-16) Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and

More information

arxiv: v2 [math.oc] 29 Jul 2016

arxiv: v2 [math.oc] 29 Jul 2016 Stochastic Frank-Wolfe Methods for Nonconvex Optimization arxiv:607.0854v [math.oc] 9 Jul 06 Sashank J. Reddi sjakkamr@cs.cmu.edu Carnegie Mellon University Barnaás Póczós apoczos@cs.cmu.edu Carnegie Mellon

More information

A Universal Catalyst for Gradient-Based Optimization

A Universal Catalyst for Gradient-Based Optimization A Universal Catalyst for Gradient-Based Optimization Julien Mairal Inria, Grenoble CIMI workshop, Toulouse, 2015 Julien Mairal, Inria Catalyst 1/58 Collaborators Hongzhou Lin Zaid Harchaoui Publication

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Complexity bounds for primal-dual methods minimizing the model of objective function

Complexity bounds for primal-dual methods minimizing the model of objective function Complexity bounds for primal-dual methods minimizing the model of objective function Yu. Nesterov July 4, 06 Abstract We provide Frank-Wolfe ( Conditional Gradients method with a convergence analysis allowing

More information

On the Iteration Complexity of Oblivious First-Order Optimization Algorithms

On the Iteration Complexity of Oblivious First-Order Optimization Algorithms On the Iteration Complexity of Oblivious First-Order Optimization Algorithms Yossi Arjevani Weizmann Institute of Science, Rehovot 7610001, Israel Ohad Shamir Weizmann Institute of Science, Rehovot 7610001,

More information

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence

Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence Parallel and Distributed Stochastic Learning -Towards Scalable Learning for Big Data Intelligence oé LAMDA Group H ŒÆOŽÅ Æ EâX ^ #EâI[ : liwujun@nju.edu.cn Dec 10, 2016 Wu-Jun Li (http://cs.nju.edu.cn/lwj)

More information

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning

Tutorial: PART 2. Online Convex Optimization, A Game- Theoretic Approach to Learning Tutorial: PART 2 Online Convex Optimization, A Game- Theoretic Approach to Learning Elad Hazan Princeton University Satyen Kale Yahoo Research Exploiting curvature: logarithmic regret Logarithmic regret

More information

ECS171: Machine Learning

ECS171: Machine Learning ECS171: Machine Learning Lecture 4: Optimization (LFD 3.3, SGD) Cho-Jui Hsieh UC Davis Jan 22, 2018 Gradient descent Optimization Goal: find the minimizer of a function min f (w) w For now we assume f

More information

Logarithmic Regret Algorithms for Strongly Convex Repeated Games

Logarithmic Regret Algorithms for Strongly Convex Repeated Games Logarithmic Regret Algorithms for Strongly Convex Repeated Games Shai Shalev-Shwartz 1 and Yoram Singer 1,2 1 School of Computer Sci & Eng, The Hebrew University, Jerusalem 91904, Israel 2 Google Inc 1600

More information

Simple Optimization, Bigger Models, and Faster Learning. Niao He

Simple Optimization, Bigger Models, and Faster Learning. Niao He Simple Optimization, Bigger Models, and Faster Learning Niao He Big Data Symposium, UIUC, 2016 Big Data, Big Picture Niao He (UIUC) 2/26 Big Data, Big Picture Niao He (UIUC) 3/26 Big Data, Big Picture

More information

Stochastic Quasi-Newton Methods

Stochastic Quasi-Newton Methods Stochastic Quasi-Newton Methods Donald Goldfarb Department of IEOR Columbia University UCLA Distinguished Lecture Series May 17-19, 2016 1 / 35 Outline Stochastic Approximation Stochastic Gradient Descent

More information

Perceptron Mistake Bounds

Perceptron Mistake Bounds Perceptron Mistake Bounds Mehryar Mohri, and Afshin Rostamizadeh Google Research Courant Institute of Mathematical Sciences Abstract. We present a brief survey of existing mistake bounds and introduce

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how

More information

Online Convex Optimization

Online Convex Optimization Advanced Course in Machine Learning Spring 2010 Online Convex Optimization Handouts are jointly prepared by Shie Mannor and Shai Shalev-Shwartz A convex repeated game is a two players game that is performed

More information

An Optimal Affine Invariant Smooth Minimization Algorithm.

An Optimal Affine Invariant Smooth Minimization Algorithm. An Optimal Affine Invariant Smooth Minimization Algorithm. Alexandre d Aspremont, CNRS & École Polytechnique. Joint work with Martin Jaggi. Support from ERC SIPA. A. d Aspremont IWSL, Moscow, June 2013,

More information

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017

Machine Learning. Support Vector Machines. Fabio Vandin November 20, 2017 Machine Learning Support Vector Machines Fabio Vandin November 20, 2017 1 Classification and Margin Consider a classification problem with two classes: instance set X = R d label set Y = { 1, 1}. Training

More information

Ad Placement Strategies

Ad Placement Strategies Case Study : Estimating Click Probabilities Intro Logistic Regression Gradient Descent + SGD AdaGrad Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox January 7 th, 04 Ad

More information

On the Generalization Ability of Online Strongly Convex Programming Algorithms

On the Generalization Ability of Online Strongly Convex Programming Algorithms On the Generalization Ability of Online Strongly Convex Programming Algorithms Sham M. Kakade I Chicago Chicago, IL 60637 sham@tti-c.org Ambuj ewari I Chicago Chicago, IL 60637 tewari@tti-c.org Abstract

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Proximal and First-Order Methods for Convex Optimization

Proximal and First-Order Methods for Convex Optimization Proximal and First-Order Methods for Convex Optimization John C Duchi Yoram Singer January, 03 Abstract We describe the proximal method for minimization of convex functions We review classical results,

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal

An Evolving Gradient Resampling Method for Machine Learning. Jorge Nocedal An Evolving Gradient Resampling Method for Machine Learning Jorge Nocedal Northwestern University NIPS, Montreal 2015 1 Collaborators Figen Oztoprak Stefan Solntsev Richard Byrd 2 Outline 1. How to improve

More information

CS260: Machine Learning Algorithms

CS260: Machine Learning Algorithms CS260: Machine Learning Algorithms Lecture 4: Stochastic Gradient Descent Cho-Jui Hsieh UCLA Jan 16, 2019 Large-scale Problems Machine learning: usually minimizing the training loss min w { 1 N min w {

More information

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent

IFT Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent IFT 6085 - Lecture 6 Nesterov s Accelerated Gradient, Stochastic Gradient Descent This version of the notes has not yet been thoroughly checked. Please report any bugs to the scribes or instructor. Scribe(s):

More information

Generalized Conditional Gradient and Its Applications

Generalized Conditional Gradient and Its Applications Generalized Conditional Gradient and Its Applications Yaoliang Yu University of Alberta UBC Kelowna, 04/18/13 Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 1 / 25 1 Introduction 2 Generalized

More information

Lower and Upper Bounds on the Generalization of Stochastic Exponentially Concave Optimization

Lower and Upper Bounds on the Generalization of Stochastic Exponentially Concave Optimization JMLR: Workshop and Conference Proceedings vol 40:1 16, 2015 28th Annual Conference on Learning Theory Lower and Upper Bounds on the Generalization of Stochastic Exponentially Concave Optimization Mehrdad

More information

Lecture 9: September 28

Lecture 9: September 28 0-725/36-725: Convex Optimization Fall 206 Lecturer: Ryan Tibshirani Lecture 9: September 28 Scribes: Yiming Wu, Ye Yuan, Zhihao Li Note: LaTeX template courtesy of UC Berkeley EECS dept. Disclaimer: These

More information

arxiv: v2 [stat.ml] 16 Jun 2015

arxiv: v2 [stat.ml] 16 Jun 2015 Semi-Stochastic Gradient Descent Methods Jakub Konečný Peter Richtárik arxiv:1312.1666v2 [stat.ml] 16 Jun 2015 School of Mathematics University of Edinburgh United Kingdom June 15, 2015 (first version:

More information

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18

CSE 417T: Introduction to Machine Learning. Lecture 11: Review. Henry Chai 10/02/18 CSE 417T: Introduction to Machine Learning Lecture 11: Review Henry Chai 10/02/18 Unknown Target Function!: # % Training data Formal Setup & = ( ), + ),, ( -, + - Learning Algorithm 2 Hypothesis Set H

More information

An Extended Frank-Wolfe Method, with Application to Low-Rank Matrix Completion

An Extended Frank-Wolfe Method, with Application to Low-Rank Matrix Completion An Extended Frank-Wolfe Method, with Application to Low-Rank Matrix Completion Robert M. Freund, MIT joint with Paul Grigas (UC Berkeley) and Rahul Mazumder (MIT) CDC, December 2016 1 Outline of Topics

More information

Online Passive-Aggressive Algorithms

Online Passive-Aggressive Algorithms Online Passive-Aggressive Algorithms Koby Crammer Ofer Dekel Shai Shalev-Shwartz Yoram Singer School of Computer Science & Engineering The Hebrew University, Jerusalem 91904, Israel {kobics,oferd,shais,singer}@cs.huji.ac.il

More information

Nonlinear Optimization Methods for Machine Learning

Nonlinear Optimization Methods for Machine Learning Nonlinear Optimization Methods for Machine Learning Jorge Nocedal Northwestern University University of California, Davis, Sept 2018 1 Introduction We don t really know, do we? a) Deep neural networks

More information

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016

Online Convex Optimization. Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 Online Convex Optimization Gautam Goel, Milan Cvitkovic, and Ellen Feldman CS 159 4/5/2016 The General Setting The General Setting (Cover) Given only the above, learning isn't always possible Some Natural

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Warm up. Regrade requests submitted directly in Gradescope, do not instructors.

Warm up. Regrade requests submitted directly in Gradescope, do not  instructors. Warm up Regrade requests submitted directly in Gradescope, do not email instructors. 1 float in NumPy = 8 bytes 10 6 2 20 bytes = 1 MB 10 9 2 30 bytes = 1 GB For each block compute the memory required

More information

Sub-Sampled Newton Methods

Sub-Sampled Newton Methods Sub-Sampled Newton Methods F. Roosta-Khorasani and M. W. Mahoney ICSI and Dept of Statistics, UC Berkeley February 2016 F. Roosta-Khorasani and M. W. Mahoney (UCB) Sub-Sampled Newton Methods Feb 2016 1

More information

Lecture 3: Minimizing Large Sums. Peter Richtárik

Lecture 3: Minimizing Large Sums. Peter Richtárik Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors

More information

Linearly-Convergent Stochastic-Gradient Methods

Linearly-Convergent Stochastic-Gradient Methods Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure

More information

Computational and Statistical Learning Theory

Computational and Statistical Learning Theory Computational and Statistical Learning Theory TTIC 31120 Prof. Nati Srebro Lecture 17: Stochastic Optimization Part II: Realizable vs Agnostic Rates Part III: Nearest Neighbor Classification Stochastic

More information

Proximal Minimization by Incremental Surrogate Optimization (MISO)

Proximal Minimization by Incremental Surrogate Optimization (MISO) Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26 Motivation: large-scale machine

More information

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč

FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES. Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč FAST DISTRIBUTED COORDINATE DESCENT FOR NON-STRONGLY CONVEX LOSSES Olivier Fercoq Zheng Qu Peter Richtárik Martin Takáč School of Mathematics, University of Edinburgh, Edinburgh, EH9 3JZ, United Kingdom

More information

Selected Topics in Optimization. Some slides borrowed from

Selected Topics in Optimization. Some slides borrowed from Selected Topics in Optimization Some slides borrowed from http://www.stat.cmu.edu/~ryantibs/convexopt/ Overview Optimization problems are almost everywhere in statistics and machine learning. Input Model

More information

Logistic Regression. Stochastic Gradient Descent

Logistic Regression. Stochastic Gradient Descent Tutorial 8 CPSC 340 Logistic Regression Stochastic Gradient Descent Logistic Regression Model A discriminative probabilistic model for classification e.g. spam filtering Let x R d be input and y { 1, 1}

More information

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee

Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Fast Asynchronous Parallel Stochastic Gradient Descent: A Lock-Free Approach with Convergence Guarantee Shen-Yi Zhao and Wu-Jun Li National Key Laboratory for Novel Software Technology Department of Computer

More information

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017

Support Vector Machines: Training with Stochastic Gradient Descent. Machine Learning Fall 2017 Support Vector Machines: Training with Stochastic Gradient Descent Machine Learning Fall 2017 1 Support vector machines Training by maximizing margin The SVM objective Solving the SVM optimization problem

More information

Second-Order Stochastic Optimization for Machine Learning in Linear Time

Second-Order Stochastic Optimization for Machine Learning in Linear Time Journal of Machine Learning Research 8 (207) -40 Submitted 9/6; Revised 8/7; Published /7 Second-Order Stochastic Optimization for Machine Learning in Linear Time Naman Agarwal Computer Science Department

More information

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization JMLR: Workshop and Conference Proceedings vol (2010) 1 16 24th Annual Conference on Learning heory Beyond the regret minimization barrier: an optimal algorithm for stochastic strongly-convex optimization

More information

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich)

Faster Machine Learning via Low-Precision Communication & Computation. Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) Faster Machine Learning via Low-Precision Communication & Computation Dan Alistarh (IST Austria & ETH Zurich), Hantian Zhang (ETH Zurich) 2 How many bits do you need to represent a single number in machine

More information