Proximal Minimization by Incremental Surrogate Optimization (MISO)

Size: px
Start display at page:

Download "Proximal Minimization by Incremental Surrogate Optimization (MISO)"

Transcription

1 Proximal Minimization by Incremental Surrogate Optimization (MISO) (and a few variants) Julien Mairal Inria, Grenoble ICCOPT, Tokyo, 2016 Julien Mairal, Inria MISO 1/26

2 Motivation: large-scale machine learning Minimizing large finite sums of functions Given data points x i, i =1,...,n, learnsomemodel parameters θ in R p by minimizing 1 min θ R p n n l(x i,θ)+ψ(θ), i=1 where l measures the data fit, andψ is a regularization function. Minimizing expectations If the amount of data is infinite, we may also need to minimize the expected cost min θ R p E x[l(x,θ)] + ψ(θ), leading to a stochastic optimization problem. Julien Mairal, Inria MISO 2/26

3 Motivation: large-scale machine learning Afewexamplesfromtheconvexworld 1 n 1 Ridge regression: min x R p n 2 (y i θ, x i ) 2 + λ 2 θ 2 2. Linear SVM: Logistic regression: 1 min x R p n 1 min x R p n i=1 n max(0, 1 y i θ, x i ) + λ 2 θ 2 2. i=1 n i=1 ( ) log 1+e y i θ,x i + λ 2 θ 2 2. Julien Mairal, Inria MISO 3/26

4 Motivation: large-scale machine learning Afewexamplesfromtheconvexworld 1 n 1 Ridge regression: min x R p n 2 (y i θ, x i ) 2 + λ 2 θ 2 2. Linear SVM: Logistic regression: 1 min x R p n 1 min x R p n i=1 n max(0, 1 y i θ, x i ) 2 + λ 2 θ 2 2. i=1 n i=1 ( ) log 1+e y i θ,x i + λ 2 θ 2 2. Julien Mairal, Inria MISO 3/26

5 Motivation: large-scale machine learning Afewexamplesfromtheconvexworld 1 n 1 Ridge regression: min x R p n 2 (y i θ, x i ) 2 + λ θ 1. Linear SVM: Logistic regression: 1 min x R p n 1 min x R p n i=1 n max(0, 1 y i θ, x i ) 2 + λ θ 1. i=1 n i=1 ( ) log 1+e y i θ,x i + λ θ 1. Julien Mairal, Inria MISO 3/26

6 Methodology We will consider optimization methods that iteratively build a model of the objective before updating the variable: θ t arg min g t (θ), θ R p where g t is easy to minimize and exploits the objective structure: large finite sum, expectation, (strong) convexity, composite? There is a large body of related work Kelley s and bundle methods; incremental and online EM algorithms; incremental and stochastic proximal gradient methods; variance-reduction techniques for minimizing finite sums. [Neal and Hinton, 1998, Duchi and Singer, 2009, Bertsekas, 2011, Schmidt et al., 2013, Defazio et al., 2014a, Shalev-Shwartz and Zhang, 2012, Lan,2012,2015]... Julien Mairal, Inria MISO 4/26

7 Outline of the talk 1) stochastic majorization-minimization min E x[l(x,θ)] + ψ(θ), θ R p where l is not necessarily smooth or convex. 2) incremental majorization-minimization 1 n min l(x i,θ)+ψ(θ). θ R p n i=1 The MISO algorithm for non-convex functions. 3) faster schemes for composite strongly-convex functions Another MISO algorithm for strongly-convex functions. 4)?? Julien Mairal, Inria MISO 5/26

8 Outline of the talk 1) stochastic majorization-minimization min E x[l(x,θ)] + ψ(θ), θ R p where l is not necessarily smooth or convex. 2) incremental majorization-minimization 1 n min l(x i,θ)+ψ(θ). θ R p n i=1 The MISO algorithm for non-convex functions. 3) faster schemes for composite strongly-convex functions Another MISO algorithm for strongly-convex functions. 4)?? Julien Mairal, Inria MISO 6/26

9 Majorization-minimization principle g t (θ) f (θ) θ t 1 θ t Iteratively minimize locally tight upper bounds of the objective. The objective monotonically decreases. Under some assumptions, we get similar convergence rates as classical first-order approaches in the convex case. Julien Mairal, Inria MISO 7/26

10 Setting: first-order surrogate functions h t (θ) g t (θ) θ t 1 f (θ) θ t g t (θ t ) f (θ t )forθ t in arg min θ Θ g t (θ); the approximation error h t = gt f is differentiable, and h t is L-Lipschitz. Moreover, h t (θ t 1 )=0and h t (θ t 1 ) = 0; we may also need g t to be strongly convex. Julien Mairal, Inria MISO 8/26

11 Examples of first-order surrogate functions Lipschitz gradient surrogates: f is L-smooth (differentiable + L-Lipschitz gradient). g : θ f (κ)+ f (κ) (θ κ)+ L 2 θ κ 2 2. Minimizing g yields a gradient descent step θ κ 1 L f (κ). Proximal gradient surrogates: f = f + ψ with f smooth. g : θ f (κ)+ f (κ) (θ κ)+ L 2 θ κ ψ(θ). Minimizing g amounts to one step of the forward-backward, ISTA, or proximal gradient descent algorithm. [Nesterov, 2004, 2013, Beck and Teboulle, 2009, Wright et al., 2009]... Julien Mairal, Inria MISO 9/26

12 Examples of first-order surrogate functions Linearizing concave functions and dc-programming: f = f 1 + f 2 with f 2 smooth and concave. g : θ f 1 (θ)+f 2 (κ)+ f 2 (κ) (θ κ). when f 1 is convex, the algorithm is called dc-programming. Quadratic surrogates: f is twice differentiable, and H is a uniform upper bound of 2 f : g : θ f (κ)+ f (κ) (θ κ)+ 1 2 (θ κ) H(θ κ). Upper-bounds based on Jensen s inequality... Julien Mairal, Inria MISO 10/26

13 Theoretical guarantees of the basic MM algorithm When using first-order surrogates, for convex problems: f (θ t ) f = O(L/t). for µ-strongly convex ones: O((1 µ/l) t ). for non-convex problems: f (θ t ) monotonically decreases and lim inf inf f (θ t,θ θ t ) 0, (1) t + θ Θ θ θ t 2 which we call asymptotic stationary point condition. Directional derivative f (θ + εκ) f (θ) f (θ, κ) = lim. ε 0 + ε when Θ = R p and f is smooth, (1) is equivalent to f (θ t ) 0. Julien Mairal, Inria MISO 11/26

14 Stochastic majorization minimization [Mairal, 2013] Assume that f is an expectation: Recipe f (θ) =E x [l(θ, x)]. Draw a single function f t : θ l(θ, x t ) at iteration t; Choose a first-order surrogate function g t for f t at θ t 1 ; Update the model g t =(1 w t )g t 1 + w t g t with appropriate w t ; Update θ t by minimizing g t. Related Work online-em; online matrix factorization. [Neal and Hinton, 1998, Mairal et al., 2010, Razaviyayn et al., 2013]... Julien Mairal, Inria MISO 12/26

15 Stochastic majorization minimization [Mairal, 2013] Theoretical Guarantees - Non-Convex Problems under a set of reasonable assumptions, f (θ t ) almost surely converges; the function g t asymptotically behaves as a first-order surrogate; asymptotic stationary point conditions hold almost surely. Theoretical Guarantees - Convex Problems under a few assumptions, for proximal gradient surrogates, we obtain similar expected rates as SGD with averaging: O(1/t) for strongly convex problems, O(log(t)/ t)forconvex ones. The most interesting feature of this principle is probably the ability to deal with some non-smooth non-convex problems. Julien Mairal, Inria MISO 13/26

16 Stochastic majorization minimization [Mairal, 2013] Update Rule for Proximal Gradient Surrogate θ t arg min θ Θ t i=1 w i t [ fi (θ i 1 ) θ + L 2 θ θ i ψ(θ)]. (SMM) Other schemes in the literature [Duchi and Singer, 2009]: θ t arg min f t (θ t 1 ) θ + 1 2η t θ θ t ψ(θ), θ Θ (FOBOS) or regularized dual averaging (RDA) of Xiao [2010]: θ t arg min θ Θ 1 t t i=1 f i (θ i 1 ) θ + 1 2η t θ ψ(θ). (RDA) or others... Julien Mairal, Inria MISO 14/26

17 Outline of the talk 1) stochastic majorization-minimization min E x[l(x,θ)] + ψ(θ), θ R p where l is not necessarily smooth or convex. 2) incremental majorization-minimization 1 n min l(x i,θ)+ψ(θ). θ R p n i=1 The MISO algorithm for non-convex functions. 3) faster schemes for composite strongly-convex functions Another MISO algorithm for strongly-convex functions. 4)?? Julien Mairal, Inria MISO 15/26

18 MISO (MM) for non-convex optimization [Mairal, 2015] Assume that f splits into many components: f (θ) = 1 n n f i (θ). i=1 Recipe Draw at random a single index i t at iteration t; Compute a first-order surrogate g it t of f it at θ t 1 ; Incrementally update the approximate surrogate g t = 1 n n i=1 Update θ t by minimizing g t. gt i = g t it (gt g it t 1 n ). Julien Mairal, Inria MISO 16/26

19 MISO (MM) for non-convex optimization [Mairal, 2015] Theoretical Guarantees - Non-Convex Problems same as the basic MM algorithm with probability one. Theoretical Guarantees - Convex Problems when using proximal gradient surrogates, for convex problems, f ( ˆθ t ) f = O(nL/t). for µ-strongly convex problems, f (θ t ) f = O((1 µ/(nl)) t ). The computational complexity is the same as ISTA. Related work for non-convex problems incremental EM; more specific incremental MM algorithms. [Neal and Hinton, 1998, Ahn et al., 2006]. Julien Mairal, Inria MISO 17/26

20 Outline of the talk 1) stochastic majorization-minimization min E x[l(x,θ)] + ψ(θ), θ R p where l is not necessarily smooth or convex. 2) incremental majorization-minimization 1 n min l(x i,θ)+ψ(θ). θ R p n i=1 The MISO algorithm for non-convex functions. 3) faster schemes for composite strongly-convex functions Another MISO algorithm for strongly-convex functions. 4)?? Julien Mairal, Inria MISO 18/26

21 MISO for µ-strongly convex smooth functions Strong convexity provides simple quadratic surrogate functions: g i t : θ f i (θ t 1 )+ f i (θ t 1 ) (θ θ t 1 )+ µ 2 θ θ t (2) This time, the model of the objective is a lower bound. Proposition: MISO with lower bounds [Mairal, 2015] When the functions f i are µ-strongly convex, L-smooth, and non-negative, MISO with the surrogates (2) guarantees that E[f (θ t ) f ] under the big data condition n 2L/µ. Remark ( 1 1 ) t nf, 3n When n 2L/µ, the algorithm may diverge. Julien Mairal, Inria MISO 19/26

22 MISO for µ-strongly convex composite functions [Lin, Mairal, and Harchaoui, 2015] First goal: allow a composite term ψ f (θ) = 1 n f i (θ)+ψ(θ), n i=1 by simply using the composite lower-bounds g i t : θ f i (θ t 1 )+ f i (θ t 1 ) (θ θ t 1 )+ µ 2 θ θ t ψ(θ). ( ) Second goal: remove the condition n 2L/µ ( with δ =min 1, gt i : θ (1 δ)gt 1(θ)+δ( ), i (3) ) instead of δ = 1 previously. µn 2(L µ) Julien Mairal, Inria MISO 20/26

23 MISO for µ-strongly convex composite functions [Lin, Mairal, and Harchaoui, 2015] Convergence of MISO-prox When the functions f i are µ-strongly convex, L-smooth, MISO-prox with the surrogates (3) guarantees that E[f (θ t )] f 1 { µ τ (1 τ)t+1 (f (θ 0 ) g 0 (θ 0 )) with τ min 4L, 1 }. 2n Furthermore, we also have fast convergence of the certificate E[f (θ t ) g t (θ t )] 1 τ (1 τ)t (f g 0 (θ 0 )). Julien Mairal, Inria MISO 21/26

24 MISO for µ-strongly convex composite functions [Lin, Mairal, and Harchaoui, 2015] Relation with SDCA [Shalev-Shwartz and Zhang, 2012]. Variant 5 of SDCA is identical to MISO-Prox with δ = µn L+µn ; The construction is primal. The proof of convergence and the algorithm do not use duality, whereas SDCA is a dual ascent technique; g t (θ t ) is a lower-bound of f ; it plays the same role as the dual lower bound in SDCA, but is easier to evaluate. Another viewpoint about SDCA without duality [Shalev-Shwartz, 2015]. Julien Mairal, Inria MISO 22/26

25 MISO for µ-strongly convex composite functions We may now compare the expected complexity, using the fact that incremental algorithms require to compute a single f i per iteration. grad. desc., ISTA, MISO-MM FISTA, acc. grad. desc. SVRG, SAG, SAGA, SDCA, MISOµ, Finito ( µ>0 O n L µ log ( ) ) 1 ε ( O n L µ log ( ) ) 1 ε ( ( ) O max n, L µ log ( ) ) 1 ε SVRG, SAG, SAGA, SDCA, MISO, Finito improve upon FISTA when ( max n, L ) L L n µ µ µ n, [Schmidt et al., 2013, Xiao and Zhang, 2014, Defazio et al., 2014a,b, Shalev-Shwartz and Zhang, 2012, Zhang and Xiao, 2015] Julien Mairal, Inria MISO 23/26

26 Outline of the talk 1) stochastic majorization-minimization min E x[l(x,θ)] + ψ(θ), θ R p where l is not necessarily smooth or convex. 2) incremental majorization-minimization 1 n min l(x i,θ)+ψ(θ). θ R p n i=1 The MISO algorithm for non-convex functions. 3) faster schemes for composite strongly-convex functions Another MISO algorithm for strongly-convex functions. 4)?? Julien Mairal, Inria MISO 24/26

27 Can we do better? [Lin, Mairal, and Harchaoui, 2015] SVRG, SAG, SAGA, SDCA, MISO, Finito improve upon FISTA but they are not accelerated in the sense of Nesterov. [Beck and Teboulle, 2009, Nesterov, 2013, Shalev-Shwartz and Zhang, 2014, Lan, 2015, Agarwal and Bottou, 2015, Allen-Zhu, 2016] Julien Mairal, Inria MISO 25/26

28 Can we do better? [Lin, Mairal, and Harchaoui, 2015] SVRG, SAG, SAGA, SDCA, MISO, Finito improve upon FISTA but they are not accelerated in the sense of Nesterov. How to improve the previous complexities? 1 read classical paper about accelerated gradient methods; [Beck and Teboulle, 2009, Nesterov, 2013, Shalev-Shwartz and Zhang, 2014, Lan, 2015, Agarwal and Bottou, 2015, Allen-Zhu, 2016] Julien Mairal, Inria MISO 25/26

29 Can we do better? [Lin, Mairal, and Harchaoui, 2015] SVRG, SAG, SAGA, SDCA, MISO, Finito improve upon FISTA but they are not accelerated in the sense of Nesterov. How to improve the previous complexities? 1 read classical paper about accelerated gradient methods; 2 stay in the room and listen to G. Lan s talk; [Beck and Teboulle, 2009, Nesterov, 2013, Shalev-Shwartz and Zhang, 2014, Lan, 2015, Agarwal and Bottou, 2015, Allen-Zhu, 2016] Julien Mairal, Inria MISO 25/26

30 Can we do better? [Lin, Mairal, and Harchaoui, 2015] SVRG, SAG, SAGA, SDCA, MISO, Finito improve upon FISTA but they are not accelerated in the sense of Nesterov. How to improve the previous complexities? 1 read classical paper about accelerated gradient methods; 2 stay in the room and listen to G. Lan s talk; 3 Listen to Hongzhou s Lin talk tomorrow. Tuesday, 2:45pm, room 5A [Beck and Teboulle, 2009, Nesterov, 2013, Shalev-Shwartz and Zhang, 2014, Lan, 2015, Agarwal and Bottou, 2015, Allen-Zhu, 2016] Julien Mairal, Inria MISO 25/26

31 Conclusion a large class of majorization-minimization algorithms for non-convex, possibly non-smooth, optimization; fast algorithms for minimizing large sums of convex functions (using lower bounds). see Hongzhou Lin s talk on acceleration tomorrow. Related publications J. Mairal. Optimization with First-Order Surrogate Functions. ICML, J. Mairal. Stochastic Majorization-Minimization Algorithms for Large-Scale Optimization. NIPS, J. Mairal. Incremental Majorization-Minimization Optimization with Application to Large-Scale Machine Learning. SIAM Journal on Optimization, 2015; H. Lin, J. Mairal, and Z. Harchaoui. A Universal Catalyst for First-Order Optimization. NIPS, 2015; Julien Mairal, Inria MISO 26/26

32 Online Sparse Matrix Factorization Consider some signals x in R m. We want to find a dictionary D in R m p. The quality of D is measured through the loss l(x, D) = 1 min α R K 2 x Dα λ 1 α 1 + λ 2 2 α 2 2. Then, learning the dictionary amounts to solving min E x [l(x, D)] + ϕ(d), D D Why is it a matrix factorization problem? [ 1 1 min D D,A R p n n 2 X DA 2 F + n i=1 λ 1 α i 1 + λ 2 2 α i 2 2 ] + ϕ(d). Julien Mairal, Inria MISO 27/26

33 Online Sparse Matrix Factorization when D = {D R m p s.t. d j 2 1} and ϕ = 0, the problem is called sparse coding or dictionary learning [Olshausen and Field, 1996, Elad and Aharon, 2006, Mairal et al., 2010]. non-negativity constraints can be easily added. It yields an online nonnegative matrix factorization algorithm. ϕ can be a function encouraging a particular structure in D [Jenatton et al., 2011]. Julien Mairal, Inria MISO 28/26

34 Online Sparse Matrix Factorization Dictionary Learning on Natural Image Patches Consider n = whitened natural image patches of size m = We learn a dictionary with K = 256 elements. 0s on an old laptop 1.2GHz dual-core CPU. (initialization) Julien Mairal, Inria MISO 29/26

35 Online Sparse Matrix Factorization Dictionary Learning on Natural Image Patches Consider n = whitened natural image patches of size m = We learn a dictionary with K = 256 elements. 1.15s on an old laptop 1.2GHz dual-core CPU (0.1 pass) Julien Mairal, Inria MISO 30/26

36 Online Sparse Matrix Factorization Dictionary Learning on Natural Image Patches Consider n = whitened natural image patches of size m = We learn a dictionary with K = 256 elements. 5.97s on an old laptop 1.2GHz dual-core CPU (0.5 pass) Julien Mairal, Inria MISO 31/26

37 Online Sparse Matrix Factorization Dictionary Learning on Natural Image Patches Consider n = whitened natural image patches of size m = We learn a dictionary with K = 256 elements s on an old laptop 1.2GHz dual-core CPU (1 pass) Julien Mairal, Inria MISO 32/26

38 Online Sparse Matrix Factorization Dictionary Learning on Natural Image Patches Consider n = whitened natural image patches of size m = We learn a dictionary with K = 256 elements s on an old laptop 1.2GHz dual-core CPU (2 passes) Julien Mairal, Inria MISO 33/26

39 Online Sparse Matrix Factorization Dictionary Learning on Natural Image Patches Consider n = whitened natural image patches of size m = We learn a dictionary with K = 256 elements s on an old laptop 1.2GHz dual-core CPU (5 passes) Julien Mairal, Inria MISO 34/26

40 References I A. Agarwal and L. Bottou. A lower bound for the optimization of finite sums. In Proceedings of the International Conference on Machine Learning (ICML), Sangtae Ahn, Jeffrey A Fessler, Doron Blatt, and Alfred O Hero. Convergent incremental optimization transfer algorithms: Application to tomography. IEEE Transactions on Medical Imaging, 25(3): ,2006. Z. Allen-Zhu. Katyusha: The first direct acceleration of stochastic gradient methods. ArXiv , A. Beck and M. Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems. SIAM Journal on Imaging Sciences, 2(1): , D. P. Bertsekas. Incremental gradient, subgradient, and proximal methods for convex optimization: A survey. Optimization for Machine Learning, 2010: 1 38, Julien Mairal, Inria MISO 35/26

41 References II A. Defazio, F. Bach, and S. Lacoste-Julien. SAGA: A fast incremental gradient method with support for non-strongly convex composite objectives. In Advances in Neural Information Processing Systems (NIPS), 2014a. A. J. Defazio, T. S. Caetano, and J. Domke. Finito: A faster, permutable incremental gradient method for big data problems. In Proceedings of the International Conference on Machine Learning (ICML), 2014b. J. Duchi and Y. Singer. Efficient online and batch learning using forward backward splitting. Journal of Machine Learning Research, 10: , M. Elad and M. Aharon. Image denoising via sparse and redundant representations over learned dictionaries. IEEE Transactions on Image Processing, 15(12): ,2006. R. Jenatton, J-Y. Audibert, and F. Bach. Structured variable selection with sparsity-inducing norms. Journal of Machine Learning Research, 12: , Guanghui Lan. An optimal method for stochastic composite optimization. Mathematical Programming, 133(1-2): ,2012. Julien Mairal, Inria MISO 36/26

42 References III Guanghui Lan. An optimal randomized incremental gradient method. arxiv preprint arxiv: , H. Lin, J. Mairal, and Z. Harchaoui. A universal catalyst for first-order optimization. In Advances in Neural Information Processing Systems, J. Mairal. Stochastic majorization-minimization algorithms for large-scale optimization. In Advances in Neural Information Processing Systems (NIPS), J. Mairal. Incremental majorization-minimization optimization with application to large-scale machine learning. SIAM Journal on Optimization, 25(2): , J. Mairal, F. Bach, J. Ponce, and G. Sapiro. Online learning for matrix factorization and sparse coding. Journal of Machine Learning Research, 11: 19 60, R.M. Neal and G.E. Hinton. A view of the EM algorithm that justifies incremental, sparse, and other variants. Learning in graphical models, 89, Julien Mairal, Inria MISO 37/26

43 References IV Y. Nesterov. Introductory lectures on convex optimization: a basic course. Kluwer Academic Publishers, Y. Nesterov. Gradient methods for minimizing composite objective function. Mathematical Programming, 140(1): ,2013. B. A. Olshausen and D. J. Field. Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature, 381: , Meisam Razaviyayn, Maziar Sanjabi, and Zhi-Quan Luo. A stochastic successive minimization method for nonsmooth nonconvex optimization with applications to transceiver design in wireless communication networks. arxiv preprint arxiv: , M. Schmidt, N. Le Roux, and F. Bach. Minimizing finite sums with the stochastic average gradient. arxiv: , S. Shalev-Shwartz. Sdca without duality. arxiv 1502:06177, S. Shalev-Shwartz and T. Zhang. Proximal stochastic dual coordinate ascent. arxiv: , Julien Mairal, Inria MISO 38/26

44 References V S. Shalev-Shwartz and T. Zhang. Accelerated proximal stochastic dual coordinate ascent for regularized loss minimization. Mathematical Programming, pages1 41,2014. S.J. Wright, R.D. Nowak, and M.A.T. Figueiredo. Sparse reconstruction by separable approximation. IEEE Transactions on Signal Processing, 57(7): , L. Xiao. Dual averaging methods for regularized stochastic learning and online optimization. Journal of Machine Learning Research, 11: , L. Xiao and T. Zhang. A proximal stochastic gradient method with progressive variance reduction. SIAM Journal on Optimization, 24(4): , Y. Zhang and L. Xiao. Stochastic primal-dual coordinate method for regularized empirical risk minimization. In Proceedings of the International Conference on Machine Learning (ICML), Julien Mairal, Inria MISO 39/26

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning

Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Incremental and Stochastic Majorization-Minimization Algorithms for Large-Scale Machine Learning Julien Mairal Inria, LEAR Team, Grenoble Journées MAS, Toulouse Julien Mairal Incremental and Stochastic

More information

A Universal Catalyst for Gradient-Based Optimization

A Universal Catalyst for Gradient-Based Optimization A Universal Catalyst for Gradient-Based Optimization Julien Mairal Inria, Grenoble CIMI workshop, Toulouse, 2015 Julien Mairal, Inria Catalyst 1/58 Collaborators Hongzhou Lin Zaid Harchaoui Publication

More information

Optimization for Machine Learning

Optimization for Machine Learning Optimization for Machine Learning (Lecture 3-A - Convex) SUVRIT SRA Massachusetts Institute of Technology Special thanks: Francis Bach (INRIA, ENS) (for sharing this material, and permitting its use) MPI-IS

More information

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure

Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Stochastic Optimization with Variance Reduction for Infinite Datasets with Finite Sum Structure Alberto Bietti Julien Mairal Inria Grenoble (Thoth) March 21, 2017 Alberto Bietti Stochastic MISO March 21,

More information

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19

Journal Club. A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) March 8th, CMAP, Ecole Polytechnique 1/19 Journal Club A Universal Catalyst for First-Order Optimization (H. Lin, J. Mairal and Z. Harchaoui) CMAP, Ecole Polytechnique March 8th, 2018 1/19 Plan 1 Motivations 2 Existing Acceleration Methods 3 Universal

More information

SVRG++ with Non-uniform Sampling

SVRG++ with Non-uniform Sampling SVRG++ with Non-uniform Sampling Tamás Kern András György Department of Electrical and Electronic Engineering Imperial College London, London, UK, SW7 2BT {tamas.kern15,a.gyorgy}@imperial.ac.uk Abstract

More information

Fast Stochastic Optimization Algorithms for ML

Fast Stochastic Optimization Algorithms for ML Fast Stochastic Optimization Algorithms for ML Aaditya Ramdas April 20, 205 This lecture is about efficient algorithms for minimizing finite sums min w R d n i= f i (w) or min w R d n f i (w) + λ 2 w 2

More information

Stochastic optimization: Beyond stochastic gradients and convexity Part I

Stochastic optimization: Beyond stochastic gradients and convexity Part I Stochastic optimization: Beyond stochastic gradients and convexity Part I Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint tutorial with Suvrit Sra, MIT - NIPS

More information

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations

Improved Optimization of Finite Sums with Minibatch Stochastic Variance Reduced Proximal Iterations Improved Optimization of Finite Sums with Miniatch Stochastic Variance Reduced Proximal Iterations Jialei Wang University of Chicago Tong Zhang Tencent AI La Astract jialei@uchicago.edu tongzhang@tongzhang-ml.org

More information

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization

References. --- a tentative list of papers to be mentioned in the ICML 2017 tutorial. Recent Advances in Stochastic Convex and Non-Convex Optimization References --- a tentative list of papers to be mentioned in the ICML 2017 tutorial Recent Advances in Stochastic Convex and Non-Convex Optimization Disclaimer: in a quite arbitrary order. 1. [ShalevShwartz-Zhang,

More information

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization

Modern Stochastic Methods. Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization Modern Stochastic Methods Ryan Tibshirani (notes by Sashank Reddi and Ryan Tibshirani) Convex Optimization 10-725 Last time: conditional gradient method For the problem min x f(x) subject to x C where

More information

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization

Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Stochastic Dual Coordinate Ascent Methods for Regularized Loss Minimization Shai Shalev-Shwartz and Tong Zhang School of CS and Engineering, The Hebrew University of Jerusalem Optimization for Machine

More information

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization

Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Accelerated Randomized Primal-Dual Coordinate Method for Empirical Risk Minimization Lin Xiao (Microsoft Research) Joint work with Qihang Lin (CMU), Zhaosong Lu (Simon Fraser) Yuchen Zhang (UC Berkeley)

More information

Large-scale Stochastic Optimization

Large-scale Stochastic Optimization Large-scale Stochastic Optimization 11-741/641/441 (Spring 2016) Hanxiao Liu hanxiaol@cs.cmu.edu March 24, 2016 1 / 22 Outline 1. Gradient Descent (GD) 2. Stochastic Gradient Descent (SGD) Formulation

More information

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm

Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Minimizing Finite Sums with the Stochastic Average Gradient Algorithm Joint work with Nicolas Le Roux and Francis Bach University of British Columbia Context: Machine Learning for Big Data Large-scale

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan The Chinese University of Hong Kong chtan@se.cuhk.edu.hk Shiqian Ma The Chinese University of Hong Kong sqma@se.cuhk.edu.hk Yu-Hong

More information

Big Data Analytics: Optimization and Randomization

Big Data Analytics: Optimization and Randomization Big Data Analytics: Optimization and Randomization Tianbao Yang Tutorial@ACML 2015 Hong Kong Department of Computer Science, The University of Iowa, IA, USA Nov. 20, 2015 Yang Tutorial for ACML 15 Nov.

More information

arxiv: v1 [math.oc] 18 Mar 2016

arxiv: v1 [math.oc] 18 Mar 2016 Katyusha: Accelerated Variance Reduction for Faster SGD Zeyuan Allen-Zhu zeyuan@csail.mit.edu Princeton University arxiv:1603.05953v1 [math.oc] 18 Mar 016 March 18, 016 Abstract We consider minimizing

More information

Learning with stochastic proximal gradient

Learning with stochastic proximal gradient Learning with stochastic proximal gradient Lorenzo Rosasco DIBRIS, Università di Genova Via Dodecaneso, 35 16146 Genova, Italy lrosasco@mit.edu Silvia Villa, Băng Công Vũ Laboratory for Computational and

More information

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method

Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method Davood Hajinezhad Iowa State University Davood Hajinezhad Optimizing Nonconvex Finite Sums by a Proximal Primal-Dual Method 1 / 35 Co-Authors

More information

Nesterov s Acceleration

Nesterov s Acceleration Nesterov s Acceleration Nesterov Accelerated Gradient min X f(x)+ (X) f -smooth. Set s 1 = 1 and = 1. Set y 0. Iterate by increasing t: g t 2 @f(y t ) s t+1 = 1+p 1+4s 2 t 2 y t = x t + s t 1 s t+1 (x

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - January 2013 Context Machine

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - CAP, July

More information

STA141C: Big Data & High Performance Statistical Computing

STA141C: Big Data & High Performance Statistical Computing STA141C: Big Data & High Performance Statistical Computing Lecture 8: Optimization Cho-Jui Hsieh UC Davis May 9, 2017 Optimization Numerical Optimization Numerical Optimization: min X f (X ) Can be applied

More information

Recent Advances in Structured Sparse Models

Recent Advances in Structured Sparse Models Recent Advances in Structured Sparse Models Julien Mairal Willow group - INRIA - ENS - Paris 21 September 2010 LEAR seminar At Grenoble, September 21 st, 2010 Julien Mairal Recent Advances in Structured

More information

ONLINE VARIANCE-REDUCING OPTIMIZATION

ONLINE VARIANCE-REDUCING OPTIMIZATION ONLINE VARIANCE-REDUCING OPTIMIZATION Nicolas Le Roux Google Brain nlr@google.com Reza Babanezhad University of British Columbia rezababa@cs.ubc.ca Pierre-Antoine Manzagol Google Brain manzagop@google.com

More information

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization

A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization A Multilevel Proximal Algorithm for Large Scale Composite Convex Optimization Panos Parpas Department of Computing Imperial College London www.doc.ic.ac.uk/ pp500 p.parpas@imperial.ac.uk jointly with D.V.

More information

Barzilai-Borwein Step Size for Stochastic Gradient Descent

Barzilai-Borwein Step Size for Stochastic Gradient Descent Barzilai-Borwein Step Size for Stochastic Gradient Descent Conghui Tan Shiqian Ma Yu-Hong Dai Yuqiu Qian May 16, 2016 Abstract One of the major issues in stochastic gradient descent (SGD) methods is how

More information

Stochastic Gradient Descent with Variance Reduction

Stochastic Gradient Descent with Variance Reduction Stochastic Gradient Descent with Variance Reduction Rie Johnson, Tong Zhang Presenter: Jiawen Yao March 17, 2015 Rie Johnson, Tong Zhang Presenter: JiawenStochastic Yao Gradient Descent with Variance Reduction

More information

Coordinate Descent and Ascent Methods

Coordinate Descent and Ascent Methods Coordinate Descent and Ascent Methods Julie Nutini Machine Learning Reading Group November 3 rd, 2015 1 / 22 Projected-Gradient Methods Motivation Rewrite non-smooth problem as smooth constrained problem:

More information

Stochastic gradient methods for machine learning

Stochastic gradient methods for machine learning Stochastic gradient methods for machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines, Nicolas Le Roux and Mark Schmidt - September 2012 Context Machine

More information

Stochastic and online algorithms

Stochastic and online algorithms Stochastic and online algorithms stochastic gradient method online optimization and dual averaging method minimizing finite average Stochastic and online optimization 6 1 Stochastic optimization problem

More information

Stochastic gradient descent and robustness to ill-conditioning

Stochastic gradient descent and robustness to ill-conditioning Stochastic gradient descent and robustness to ill-conditioning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Stochastic Optimization Part I: Convex analysis and online stochastic optimization

Stochastic Optimization Part I: Convex analysis and online stochastic optimization Stochastic Optimization Part I: Convex analysis and online stochastic optimization Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical

More information

Finite-sum Composition Optimization via Variance Reduced Gradient Descent

Finite-sum Composition Optimization via Variance Reduced Gradient Descent Finite-sum Composition Optimization via Variance Reduced Gradient Descent Xiangru Lian Mengdi Wang Ji Liu University of Rochester Princeton University University of Rochester xiangru@yandex.com mengdiw@princeton.edu

More information

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique

Master 2 MathBigData. 3 novembre CMAP - Ecole Polytechnique Master 2 MathBigData S. Gaïffas 1 3 novembre 2014 1 CMAP - Ecole Polytechnique 1 Supervised learning recap Introduction Loss functions, linearity 2 Penalization Introduction Ridge Sparsity Lasso 3 Some

More information

Lecture 3: Minimizing Large Sums. Peter Richtárik

Lecture 3: Minimizing Large Sums. Peter Richtárik Lecture 3: Minimizing Large Sums Peter Richtárik Graduate School in Systems, Op@miza@on, Control and Networks Belgium 2015 Mo@va@on: Machine Learning & Empirical Risk Minimiza@on Training Linear Predictors

More information

Topographic Dictionary Learning with Structured Sparsity

Topographic Dictionary Learning with Structured Sparsity Topographic Dictionary Learning with Structured Sparsity Julien Mairal 1 Rodolphe Jenatton 2 Guillaume Obozinski 2 Francis Bach 2 1 UC Berkeley 2 INRIA - SIERRA Project-Team San Diego, Wavelets and Sparsity

More information

Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization

Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization Doubly Accelerated Stochastic Variance Reduced Dual Averaging Method for Regularized Empirical Risk Minimization Tomoya Murata Taiji Suzuki More recently, several authors have proposed accelerarxiv:70300439v4

More information

Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization

Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization Limitations on Variance-Reduction and Acceleration Schemes for Finite Sum Optimization Yossi Arjevani Department of Computer Science and Applied Mathematics Weizmann Institute of Science Rehovot 7610001,

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 29, 2016 Outline Convex vs Nonconvex Functions Coordinate Descent Gradient Descent Newton s method Stochastic Gradient Descent Numerical Optimization

More information

Trade-Offs in Distributed Learning and Optimization

Trade-Offs in Distributed Learning and Optimization Trade-Offs in Distributed Learning and Optimization Ohad Shamir Weizmann Institute of Science Includes joint works with Yossi Arjevani, Nathan Srebro and Tong Zhang IHES Workshop March 2016 Distributed

More information

arxiv: v1 [math.oc] 1 Jul 2016

arxiv: v1 [math.oc] 1 Jul 2016 Convergence Rate of Frank-Wolfe for Non-Convex Objectives Simon Lacoste-Julien INRIA - SIERRA team ENS, Paris June 8, 016 Abstract arxiv:1607.00345v1 [math.oc] 1 Jul 016 We give a simple proof that the

More information

Optimization methods

Optimization methods Lecture notes 3 February 8, 016 1 Introduction Optimization methods In these notes we provide an overview of a selection of optimization methods. We focus on methods which rely on first-order information,

More information

Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization

Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization Parallel Successive Convex Approximation for Nonsmooth Nonconvex Optimization Meisam Razaviyayn meisamr@stanford.edu Mingyi Hong mingyi@iastate.edu Zhi-Quan Luo luozq@umn.edu Jong-Shi Pang jongship@usc.edu

More information

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation

A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation A Brief Overview of Practical Optimization Algorithms in the Context of Relaxation Zhouchen Lin Peking University April 22, 2018 Too Many Opt. Problems! Too Many Opt. Algorithms! Zero-th order algorithms:

More information

Optimization methods for large-scale machine learning and sparse estimation. Julien Mairal

Optimization methods for large-scale machine learning and sparse estimation. Julien Mairal Optimization methods for large-scale machine learning and sparse estimation Julien Mairal Inria Grenoble Nantes, Mascot-Num, 2018 Part II Julien Mairal Optim for large-scale ML and sparse estimation 1/71

More information

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization

Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Accelerated Stochastic Block Coordinate Gradient Descent for Sparsity Constrained Nonconvex Optimization Jinghui Chen Department of Systems and Information Engineering University of Virginia Quanquan Gu

More information

Proximal average approximated incremental gradient descent for composite penalty regularized empirical risk minimization

Proximal average approximated incremental gradient descent for composite penalty regularized empirical risk minimization Mach earn 207 06:595 622 DOI 0.007/s0994-06-5609- Proximal average approximated incremental gradient descent for composite penalty regularized empirical risk minimization Yiu-ming Cheung Jian ou Received:

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France ÉCOLE NORMALE SUPÉRIEURE Joint work with Aymeric Dieuleveut, Nicolas Flammarion,

More information

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples

Agenda. Fast proximal gradient methods. 1 Accelerated first-order methods. 2 Auxiliary sequences. 3 Convergence analysis. 4 Numerical examples Agenda Fast proximal gradient methods 1 Accelerated first-order methods 2 Auxiliary sequences 3 Convergence analysis 4 Numerical examples 5 Optimality of Nesterov s scheme Last time Proximal gradient method

More information

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference

ECE G: Special Topics in Signal Processing: Sparsity, Structure, and Inference ECE 18-898G: Special Topics in Signal Processing: Sparsity, Structure, and Inference Sparse Recovery using L1 minimization - algorithms Yuejie Chi Department of Electrical and Computer Engineering Spring

More information

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity

Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Randomized Coordinate Descent with Arbitrary Sampling: Algorithms and Complexity Zheng Qu University of Hong Kong CAM, 23-26 Aug 2016 Hong Kong based on joint work with Peter Richtarik and Dominique Cisba(University

More information

On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression

On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression On Projected Stochastic Gradient Descent Algorithm with Weighted Averaging for Least Squares Regression arxiv:606.03000v [cs.it] 9 Jun 206 Kobi Cohen, Angelia Nedić and R. Srikant Abstract The problem

More information

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer

Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Adaptive Subgradient Methods for Online Learning and Stochastic Optimization John Duchi, Elad Hanzan, Yoram Singer Vicente L. Malave February 23, 2011 Outline Notation minimize a number of functions φ

More information

Optimal Regularized Dual Averaging Methods for Stochastic Optimization

Optimal Regularized Dual Averaging Methods for Stochastic Optimization Optimal Regularized Dual Averaging Methods for Stochastic Optimization Xi Chen Machine Learning Department Carnegie Mellon University xichen@cs.cmu.edu Qihang Lin Javier Peña Tepper School of Business

More information

Fast proximal gradient methods

Fast proximal gradient methods L. Vandenberghe EE236C (Spring 2013-14) Fast proximal gradient methods fast proximal gradient method (FISTA) FISTA with line search FISTA as descent method Nesterov s second method 1 Fast (proximal) gradient

More information

1. Introduction. We consider the problem of minimizing the sum of two convex functions: = F (x)+r(x)}, f i (x),

1. Introduction. We consider the problem of minimizing the sum of two convex functions: = F (x)+r(x)}, f i (x), SIAM J. OPTIM. Vol. 24, No. 4, pp. 2057 2075 c 204 Society for Industrial and Applied Mathematics A PROXIMAL STOCHASTIC GRADIENT METHOD WITH PROGRESSIVE VARIANCE REDUCTION LIN XIAO AND TONG ZHANG Abstract.

More information

LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION

LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION LARGE-SCALE NONCONVEX STOCHASTIC OPTIMIZATION BY DOUBLY STOCHASTIC SUCCESSIVE CONVEX APPROXIMATION Aryan Mokhtari, Alec Koppel, Gesualdo Scutari, and Alejandro Ribeiro Department of Electrical and Systems

More information

Accelerating SVRG via second-order information

Accelerating SVRG via second-order information Accelerating via second-order information Ritesh Kolte Department of Electrical Engineering rkolte@stanford.edu Murat Erdogdu Department of Statistics erdogdu@stanford.edu Ayfer Özgür Department of Electrical

More information

Lasso: Algorithms and Extensions

Lasso: Algorithms and Extensions ELE 538B: Sparsity, Structure and Inference Lasso: Algorithms and Extensions Yuxin Chen Princeton University, Spring 2017 Outline Proximal operators Proximal gradient methods for lasso and its extensions

More information

Convergence Rate of Expectation-Maximization

Convergence Rate of Expectation-Maximization Convergence Rate of Expectation-Maximiation Raunak Kumar University of British Columbia Mark Schmidt University of British Columbia Abstract raunakkumar17@outlookcom schmidtm@csubcca Expectation-maximiation

More information

Adaptive Probabilities in Stochastic Optimization Algorithms

Adaptive Probabilities in Stochastic Optimization Algorithms Research Collection Master Thesis Adaptive Probabilities in Stochastic Optimization Algorithms Author(s): Zhong, Lei Publication Date: 2015 Permanent Link: https://doi.org/10.3929/ethz-a-010421465 Rights

More information

On the Convergence Rate of Incremental Aggregated Gradient Algorithms

On the Convergence Rate of Incremental Aggregated Gradient Algorithms On the Convergence Rate of Incremental Aggregated Gradient Algorithms M. Gürbüzbalaban, A. Ozdaglar, P. Parrilo June 5, 2015 Abstract Motivated by applications to distributed optimization over networks

More information

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson

A DELAYED PROXIMAL GRADIENT METHOD WITH LINEAR CONVERGENCE RATE. Hamid Reza Feyzmahdavian, Arda Aytekin, and Mikael Johansson 204 IEEE INTERNATIONAL WORKSHOP ON ACHINE LEARNING FOR SIGNAL PROCESSING, SEPT. 2 24, 204, REIS, FRANCE A DELAYED PROXIAL GRADIENT ETHOD WITH LINEAR CONVERGENCE RATE Hamid Reza Feyzmahdavian, Arda Aytekin,

More information

Optimization methods

Optimization methods Optimization methods Optimization-Based Data Analysis http://www.cims.nyu.edu/~cfgranda/pages/obda_spring16 Carlos Fernandez-Granda /8/016 Introduction Aim: Overview of optimization methods that Tend to

More information

Structured Sparse Estimation with Network Flow Optimization

Structured Sparse Estimation with Network Flow Optimization Structured Sparse Estimation with Network Flow Optimization Julien Mairal University of California, Berkeley Neyman seminar, Berkeley Julien Mairal Neyman seminar, UC Berkeley /48 Purpose of the talk introduce

More information

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos

Stochastic Variance Reduction for Nonconvex Optimization. Barnabás Póczos 1 Stochastic Variance Reduction for Nonconvex Optimization Barnabás Póczos Contents 2 Stochastic Variance Reduction for Nonconvex Optimization Joint work with Sashank Reddi, Ahmed Hefny, Suvrit Sra, and

More information

Coordinate descent methods

Coordinate descent methods Coordinate descent methods Master Mathematics for data science and big data Olivier Fercoq November 3, 05 Contents Exact coordinate descent Coordinate gradient descent 3 3 Proximal coordinate descent 5

More information

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization

SDNA: Stochastic Dual Newton Ascent for Empirical Risk Minimization Zheng Qu ZHENGQU@HKU.HK Department of Mathematics, The University of Hong Kong, Hong Kong Peter Richtárik PETER.RICHTARIK@ED.AC.UK School of Mathematics, The University of Edinburgh, UK Martin Takáč TAKAC.MT@GMAIL.COM

More information

Convex Optimization Lecture 16

Convex Optimization Lecture 16 Convex Optimization Lecture 16 Today: Projected Gradient Descent Conditional Gradient Descent Stochastic Gradient Descent Random Coordinate Descent Recall: Gradient Descent (Steepest Descent w.r.t Euclidean

More information

Stochastic Optimization: First order method

Stochastic Optimization: First order method Stochastic Optimization: First order method Taiji Suzuki Tokyo Institute of Technology Graduate School of Information Science and Engineering Department of Mathematical and Computing Sciences JST, PRESTO

More information

NESTT: A Nonconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization

NESTT: A Nonconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization ESTT: A onconvex Primal-Dual Splitting Method for Distributed and Stochastic Optimization Davood Hajinezhad, Mingyi Hong Tuo Zhao Zhaoran Wang Abstract We study a stochastic and distributed algorithm for

More information

arxiv: v2 [cs.na] 3 Apr 2018

arxiv: v2 [cs.na] 3 Apr 2018 Stochastic variance reduced multiplicative update for nonnegative matrix factorization Hiroyui Kasai April 5, 208 arxiv:70.078v2 [cs.a] 3 Apr 208 Abstract onnegative matrix factorization (MF), a dimensionality

More information

Coordinate Descent Faceoff: Primal or Dual?

Coordinate Descent Faceoff: Primal or Dual? JMLR: Workshop and Conference Proceedings 83:1 22, 2018 Algorithmic Learning Theory 2018 Coordinate Descent Faceoff: Primal or Dual? Dominik Csiba peter.richtarik@ed.ac.uk School of Mathematics University

More information

Accelerated Gradient Method for Multi-Task Sparse Learning Problem

Accelerated Gradient Method for Multi-Task Sparse Learning Problem Accelerated Gradient Method for Multi-Task Sparse Learning Problem Xi Chen eike Pan James T. Kwok Jaime G. Carbonell School of Computer Science, Carnegie Mellon University Pittsburgh, U.S.A {xichen, jgc}@cs.cmu.edu

More information

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016

Contents. 1 Introduction. 1.1 History of Optimization ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 ALG-ML SEMINAR LISSA: LINEAR TIME SECOND-ORDER STOCHASTIC ALGORITHM FEBRUARY 23, 2016 LECTURERS: NAMAN AGARWAL AND BRIAN BULLINS SCRIBE: KIRAN VODRAHALLI Contents 1 Introduction 1 1.1 History of Optimization.....................................

More information

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4

Statistical Machine Learning II Spring 2017, Learning Theory, Lecture 4 Statistical Machine Learning II Spring 07, Learning Theory, Lecture 4 Jean Honorio jhonorio@purdue.edu Deterministic Optimization For brevity, everywhere differentiable functions will be called smooth.

More information

Sparse Estimation and Dictionary Learning

Sparse Estimation and Dictionary Learning Sparse Estimation and Dictionary Learning (for Biostatistics?) Julien Mairal Biostatistics Seminar, UC Berkeley Julien Mairal Sparse Estimation and Dictionary Learning Methods 1/69 What this talk is about?

More information

Composite Objective Mirror Descent

Composite Objective Mirror Descent Composite Objective Mirror Descent John C. Duchi 1,3 Shai Shalev-Shwartz 2 Yoram Singer 3 Ambuj Tewari 4 1 University of California, Berkeley 2 Hebrew University of Jerusalem, Israel 3 Google Research

More information

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems)

Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Accelerated Dual Gradient-Based Methods for Total Variation Image Denoising/Deblurring Problems (and other Inverse Problems) Donghwan Kim and Jeffrey A. Fessler EECS Department, University of Michigan

More information

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions

Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Primal-dual coordinate descent A Coordinate Descent Primal-Dual Algorithm with Large Step Size and Possibly Non-Separable Functions Olivier Fercoq and Pascal Bianchi Problem Minimize the convex function

More information

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade

Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Optimization in the Big Data Regime 2: SVRG & Tradeoffs in Large Scale Learning. Sham M. Kakade Machine Learning for Big Data CSE547/STAT548 University of Washington S. M. Kakade (UW) Optimization for

More information

arxiv: v1 [math.oc] 7 Dec 2018

arxiv: v1 [math.oc] 7 Dec 2018 arxiv:1812.02878v1 [math.oc] 7 Dec 2018 Solving Non-Convex Non-Concave Min-Max Games Under Polyak- Lojasiewicz Condition Maziar Sanjabi, Meisam Razaviyayn, Jason D. Lee University of Southern California

More information

Beyond stochastic gradient descent for large-scale machine learning

Beyond stochastic gradient descent for large-scale machine learning Beyond stochastic gradient descent for large-scale machine learning Francis Bach INRIA - Ecole Normale Supérieure, Paris, France Joint work with Eric Moulines - October 2014 Big data revolution? A new

More information

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison

Optimization. Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison Optimization Benjamin Recht University of California, Berkeley Stephen Wright University of Wisconsin-Madison optimization () cost constraints might be too much to cover in 3 hours optimization (for big

More information

Proximal Methods for Optimization with Spasity-inducing Norms

Proximal Methods for Optimization with Spasity-inducing Norms Proximal Methods for Optimization with Spasity-inducing Norms Group Learning Presentation Xiaowei Zhou Department of Electronic and Computer Engineering The Hong Kong University of Science and Technology

More information

Exploiting Strong Convexity from Data with Primal-Dual First-Order Algorithms

Exploiting Strong Convexity from Data with Primal-Dual First-Order Algorithms Exploiting Strong Convexit from Data with Primal-Dual First-Order Algorithms Jialei Wang Lin Xiao Abstract We consider empirical ris minimization of linear predictors with convex loss functions. Such problems

More information

Provable Non-Convex Min-Max Optimization

Provable Non-Convex Min-Max Optimization Provable Non-Convex Min-Max Optimization Mingrui Liu, Hassan Rafique, Qihang Lin, Tianbao Yang Department of Computer Science, The University of Iowa, Iowa City, IA, 52242 Department of Mathematics, The

More information

Generalized Conditional Gradient and Its Applications

Generalized Conditional Gradient and Its Applications Generalized Conditional Gradient and Its Applications Yaoliang Yu University of Alberta UBC Kelowna, 04/18/13 Y-L. Yu (UofA) GCG and Its Apps. UBC Kelowna, 04/18/13 1 / 25 1 Introduction 2 Generalized

More information

Lecture 2 February 25th

Lecture 2 February 25th Statistical machine learning and convex optimization 06 Lecture February 5th Lecturer: Francis Bach Scribe: Guillaume Maillard, Nicolas Brosse This lecture deals with classical methods for convex optimization.

More information

6. Proximal gradient method

6. Proximal gradient method L. Vandenberghe EE236C (Spring 2016) 6. Proximal gradient method motivation proximal mapping proximal gradient method with fixed step size proximal gradient method with line search 6-1 Proximal mapping

More information

Accelerating Stochastic Optimization

Accelerating Stochastic Optimization Accelerating Stochastic Optimization Shai Shalev-Shwartz School of CS and Engineering, The Hebrew University of Jerusalem and Mobileye Master Class at Tel-Aviv, Tel-Aviv University, November 2014 Shalev-Shwartz

More information

Asynchronous Non-Convex Optimization For Separable Problem

Asynchronous Non-Convex Optimization For Separable Problem Asynchronous Non-Convex Optimization For Separable Problem Sandeep Kumar and Ketan Rajawat Dept. of Electrical Engineering, IIT Kanpur Uttar Pradesh, India Distributed Optimization A general multi-agent

More information

Sparse and Regularized Optimization

Sparse and Regularized Optimization Sparse and Regularized Optimization In many applications, we seek not an exact minimizer of the underlying objective, but rather an approximate minimizer that satisfies certain desirable properties: sparsity

More information

On Stochastic Primal-Dual Hybrid Gradient Approach for Compositely Regularized Minimization

On Stochastic Primal-Dual Hybrid Gradient Approach for Compositely Regularized Minimization On Stochastic Primal-Dual Hybrid Gradient Approach for Compositely Regularized Minimization Linbo Qiao, and Tianyi Lin 3 and Yu-Gang Jiang and Fan Yang 5 and Wei Liu 6 and Xicheng Lu, Abstract We consider

More information

Linear Convergence under the Polyak-Łojasiewicz Inequality

Linear Convergence under the Polyak-Łojasiewicz Inequality Linear Convergence under the Polyak-Łojasiewicz Inequality Hamed Karimi, Julie Nutini and Mark Schmidt The University of British Columbia LCI Forum February 28 th, 2017 1 / 17 Linear Convergence of Gradient-Based

More information

Linearly-Convergent Stochastic-Gradient Methods

Linearly-Convergent Stochastic-Gradient Methods Linearly-Convergent Stochastic-Gradient Methods Joint work with Francis Bach, Michael Friedlander, Nicolas Le Roux INRIA - SIERRA Project - Team Laboratoire d Informatique de l École Normale Supérieure

More information

Stochastic Optimization

Stochastic Optimization Introduction Related Work SGD Epoch-GD LM A DA NANJING UNIVERSITY Lijun Zhang Nanjing University, China May 26, 2017 Introduction Related Work SGD Epoch-GD Outline 1 Introduction 2 Related Work 3 Stochastic

More information

MACHINE learning and optimization theory have enjoyed a fruitful symbiosis over the last decade. On the one hand,

MACHINE learning and optimization theory have enjoyed a fruitful symbiosis over the last decade. On the one hand, 1 Analysis and Implementation of an Asynchronous Optimization Algorithm for the Parameter Server Arda Aytein Student Member, IEEE, Hamid Reza Feyzmahdavian, Student Member, IEEE, and Miael Johansson Member,

More information